Lecture 14: Shrinkage - Stanford University · Example. TheLasso Lassoregressionof default...

Transcript of Lecture 14: Shrinkage - Stanford University · Example. TheLasso Lassoregressionof default...

Lecture 14: Shrinkage

Reading: Section 6.2

STATS 202: Data mining and analysis

Jonathan Taylor, 10/29Slide credits: Sergio Bacallado

1 / 19

Page 2: Lecture 14: Shrinkage - Stanford University · Example. TheLasso Lassoregressionof default intheCreditdataset. 20 50 100 200 500 2000 5000 Standardized Coefficients-200 0 100 200

Shrinkage methods

The idea is to perform a linear regression, while regularizing orshrinking the coefficients β̂ toward 0.

Why would shrunk coefficients be better?

I This introduces bias, but may significantly decrease thevariance of the estimates. If the latter effect is larger, thiswould decrease the test error.

I There are Bayesian motivations to do this: the prior tends toshrink the parameters.

2 / 19

Page 3: Lecture 14: Shrinkage - Stanford University · Example. TheLasso Lassoregressionof default intheCreditdataset. 20 50 100 200 500 2000 5000 Standardized Coefficients-200 0 100 200

Shrinkage methods

2 / 19

Page 4: Lecture 14: Shrinkage - Stanford University · Example. TheLasso Lassoregressionof default intheCreditdataset. 20 50 100 200 500 2000 5000 Standardized Coefficients-200 0 100 200

Shrinkage methods

2 / 19

Page 5: Lecture 14: Shrinkage - Stanford University · Example. TheLasso Lassoregressionof default intheCreditdataset. 20 50 100 200 500 2000 5000 Standardized Coefficients-200 0 100 200

Shrinkage methods

2 / 19

Page 6: Lecture 14: Shrinkage - Stanford University · Example. TheLasso Lassoregressionof default intheCreditdataset. 20 50 100 200 500 2000 5000 Standardized Coefficients-200 0 100 200

Ridge regression

Ridge regression solves the following optimization:

minβ

n∑i=1

yi − β0 − p∑j=1

βjxi,j

2

+ λ

p∑j=1

β2j

In blue, we have the RSS of the model.

In red, we have the squared `2 norm of β, or ‖β‖22.

The parameter λ is a tuning parameter. It modulates theimportance of fit vs. shrinkage.

We find an estimate β̂Rλ for many values of λ and then choose it bycross-validation.

3 / 19

Page 7: Lecture 14: Shrinkage - Stanford University · Example. TheLasso Lassoregressionof default intheCreditdataset. 20 50 100 200 500 2000 5000 Standardized Coefficients-200 0 100 200

Bias-variance tradeoff

In a simulation study, we compute bias, variance, and test error asa function of λ.

1e−01 1e+01 1e+03

010

20

30

40

50

60

Me

an

Sq

ua

red

Err

or

0.0 0.2 0.4 0.6 0.8 1.0

010

20

30

40

50

60

Me

an

Sq

ua

red

Err

or

λ ‖β̂Rλ ‖2/‖β̂‖2

4 / 19

Page 8: Lecture 14: Shrinkage - Stanford University · Example. TheLasso Lassoregressionof default intheCreditdataset. 20 50 100 200 500 2000 5000 Standardized Coefficients-200 0 100 200

Ridge regression

In least-squares linear regression, scaling the variables has no effecton the fit of the model:

Y = β0 + β1X1 + β2X2 + · · ·+ βpXp.

Multiplying X1 by c can be compensated by dividing β̂1 by c,ie. after doing this we have the same RSS.

In ridge regression, this is not true.

In practice, what do we do?

I Scale each variable such that it has sample variance 1 beforerunning the regression.

I This prevents penalizing some coefficients more than others.

5 / 19

Page 9: Lecture 14: Shrinkage - Stanford University · Example. TheLasso Lassoregressionof default intheCreditdataset. 20 50 100 200 500 2000 5000 Standardized Coefficients-200 0 100 200

Ridge regression

Y = β0 + β1X1 + β2X2 + · · ·+ βpXp.

5 / 19

Page 10: Lecture 14: Shrinkage - Stanford University · Example. TheLasso Lassoregressionof default intheCreditdataset. 20 50 100 200 500 2000 5000 Standardized Coefficients-200 0 100 200

Ridge regression

Y = β0 + β1X1 + β2X2 + · · ·+ βpXp.

5 / 19

Page 11: Lecture 14: Shrinkage - Stanford University · Example. TheLasso Lassoregressionof default intheCreditdataset. 20 50 100 200 500 2000 5000 Standardized Coefficients-200 0 100 200

Example. Ridge regression

Ridge regression of default in the Credit dataset.

1e−02 1e+00 1e+02 1e+04

−300

−100

0100

200

300

400

Sta

ndard

ized C

oeffic

ients

IncomeLimitRatingStudent

0.0 0.2 0.4 0.6 0.8 1.0

−300

−100

0100

200

300

400

Sta

ndard

ized C

oeffic

ients

λ ‖β̂Rλ ‖2/‖β̂‖2

6 / 19

Page 12: Lecture 14: Shrinkage - Stanford University · Example. TheLasso Lassoregressionof default intheCreditdataset. 20 50 100 200 500 2000 5000 Standardized Coefficients-200 0 100 200

Selecting λ by cross-validation

5e−03 5e−02 5e−01 5e+00

25.0

25.2

25.4

25.6

Cro

ss−

Va

lid

atio

n E

rro

r

5e−03 5e−02 5e−01 5e+00

−300

−100

0100

300

Sta

nd

ard

ize

d C

oe

ffic

ien

ts

λλ

7 / 19

The Lasso

Lasso regression solves the following optimization:

minβ

n∑i=1

yi − β0 − p∑j=1

βjxi,j

2

+ λ

p∑j=1

|βj |

In red, we have the `1 norm of β, or ‖β‖1.

Why would we use the Lasso instead of Ridge regression?

I Ridge regression shrinks all the coefficients to a non-zero value.

I The Lasso shrinks some of the coefficients all the way to zero.Alternative convex to best subset selection or stepwiseselection!

8 / 19

The Lasso

minβ

n∑i=1

yi − β0 − p∑j=1

βjxi,j

2

+ λ

p∑j=1

|βj |

8 / 19

The Lasso

minβ

n∑i=1

yi − β0 − p∑j=1

βjxi,j

2

+ λ

p∑j=1

|βj |

8 / 19

The Lasso

minβ

n∑i=1

yi − β0 − p∑j=1

βjxi,j

2

+ λ

p∑j=1

|βj |

8 / 19

Page 17: Lecture 14: Shrinkage - Stanford University · Example. TheLasso Lassoregressionof default intheCreditdataset. 20 50 100 200 500 2000 5000 Standardized Coefficients-200 0 100 200

Example. Ridge regression

Ridge regression of default in the Credit dataset.

1e−02 1e+00 1e+02 1e+04

−300

−100

0100

200

300

400

Sta

ndard

ized C

oeffic

ients

0.0 0.2 0.4 0.6 0.8 1.0

−300

−100

0100

200

300

400

Sta

ndard

ized C

oeffic

ients

λ ‖β̂Rλ ‖2/‖β̂‖2

A lot of pesky small coefficients throughout the regularization path.

9 / 19

Example. The Lasso

Lasso regression of default in the Credit dataset.

20 50 100 200 500 2000 5000

−200

0100

200

300

400

Sta

ndard

ized C

oeffic

ients

0.0 0.2 0.4 0.6 0.8 1.0

−300

−100

0100

200

300

400

Sta

ndard

ized C

oeffic

ients

λ ‖β̂Lλ ‖1/‖β̂‖1

Those coefficients are shrunk to zero.

10 / 19

Page 19: Lecture 14: Shrinkage - Stanford University · Example. TheLasso Lassoregressionof default intheCreditdataset. 20 50 100 200 500 2000 5000 Standardized Coefficients-200 0 100 200

An alternative formulation for regularization

I Ridge: for every λ, there is an s such that β̂Rλ solves:

minimizeβ

n∑i=1

yi − β0 − p∑j=1

βjxi,j

2 subject to

p∑j=1

β2j < s.

I Lasso: for every λ, there is an s such that β̂Lλ solves:

minimizeβ

n∑i=1

yi − β0 − p∑j=1

βjxi,j

2 subject to

p∑j=1

|βj | < s.

I Best subset:

minimizeβ

n∑i=1

yi − β0 − p∑j=1

βjxi,j

2 s.t.

p∑j=1

1(βj 6= 0) < s.

11 / 19

Page 20: Lecture 14: Shrinkage - Stanford University · Example. TheLasso Lassoregressionof default intheCreditdataset. 20 50 100 200 500 2000 5000 Standardized Coefficients-200 0 100 200

minimizeβ

n∑i=1

yi − β0 − p∑j=1

βjxi,j

2 subject to

p∑j=1

β2j < s.

minimizeβ

n∑i=1

yi − β0 − p∑j=1

βjxi,j

2 subject to

p∑j=1

|βj | < s.

I Best subset:

minimizeβ

n∑i=1

yi − β0 − p∑j=1

βjxi,j

2 s.t.

p∑j=1

1(βj 6= 0) < s.

11 / 19

Page 21: Lecture 14: Shrinkage - Stanford University · Example. TheLasso Lassoregressionof default intheCreditdataset. 20 50 100 200 500 2000 5000 Standardized Coefficients-200 0 100 200

minimizeβ

n∑i=1

yi − β0 − p∑j=1

βjxi,j

2 subject to

p∑j=1

β2j < s.

minimizeβ

n∑i=1

yi − β0 − p∑j=1

βjxi,j

2 subject to

p∑j=1

|βj | < s.

I Best subset:

minimizeβ

n∑i=1

yi − β0 − p∑j=1

βjxi,j

2 s.t.

p∑j=1

1(βj 6= 0) < s.

11 / 19

Page 22: Lecture 14: Shrinkage - Stanford University · Example. TheLasso Lassoregressionof default intheCreditdataset. 20 50 100 200 500 2000 5000 Standardized Coefficients-200 0 100 200

Visualizing Ridge and the Lasso with 2 predictors

The Lasso Ridge Regression♦ :

∑pj=1 |βj | < s :

∑pj=1 β

2j < s

Best subset with s = 1 is union of the axes...12 / 19

Page 23: Lecture 14: Shrinkage - Stanford University · Example. TheLasso Lassoregressionof default intheCreditdataset. 20 50 100 200 500 2000 5000 Standardized Coefficients-200 0 100 200

When is the Lasso better than Ridge?

Example 1. Most of the coefficients are non-zero.

0.02 0.10 0.50 2.00 10.00 50.00

010

20

30

40

50

60

Me

an

Sq

ua

red

Err

or

0.0 0.2 0.4 0.6 0.8 1.0

010

20

30

40

50

60

R2 on Training Data

Me

an

Sq

ua

red

Err

or

λ

I Bias, Variance, MSE.

The Lasso (—), Ridge (· · · ).

I The bias is about the same for both methods.

I The variance of Ridge regression is smaller, so is the MSE.

13 / 19

Page 24: Lecture 14: Shrinkage - Stanford University · Example. TheLasso Lassoregressionof default intheCreditdataset. 20 50 100 200 500 2000 5000 Standardized Coefficients-200 0 100 200

0.02 0.10 0.50 2.00 10.00 50.00

010

20

30

40

50

60

Me

an

Sq

ua

red

Err

or

0.0 0.2 0.4 0.6 0.8 1.0

010

20

30

40

50

60

R2 on Training Data

Me

an

Sq

ua

red

Err

or

λ

The Lasso (—), Ridge (· · · ).I The bias is about the same for both methods.

13 / 19

Page 25: Lecture 14: Shrinkage - Stanford University · Example. TheLasso Lassoregressionof default intheCreditdataset. 20 50 100 200 500 2000 5000 Standardized Coefficients-200 0 100 200

0.02 0.10 0.50 2.00 10.00 50.00

010

20

30

40

50

60

Me

an

Sq

ua

red

Err

or

0.0 0.2 0.4 0.6 0.8 1.0

010

20

30

40

50

60

R2 on Training Data

Me

an

Sq

ua

red

Err

or

λ

I Bias, Variance, MSE. The Lasso (—), Ridge (· · · ).

I The bias is about the same for both methods.

13 / 19

Page 26: Lecture 14: Shrinkage - Stanford University · Example. TheLasso Lassoregressionof default intheCreditdataset. 20 50 100 200 500 2000 5000 Standardized Coefficients-200 0 100 200

0.02 0.10 0.50 2.00 10.00 50.00

010

20

30

40

50

60

Me

an

Sq

ua

red

Err

or

0.0 0.2 0.4 0.6 0.8 1.0

010

20

30

40

50

60

R2 on Training Data

Me

an

Sq

ua

red

Err

or

λ

I Bias, Variance, MSE. The Lasso (—), Ridge (· · · ).I The bias is about the same for both methods.

13 / 19

Page 27: Lecture 14: Shrinkage - Stanford University · Example. TheLasso Lassoregressionof default intheCreditdataset. 20 50 100 200 500 2000 5000 Standardized Coefficients-200 0 100 200

0.02 0.10 0.50 2.00 10.00 50.00

010

20

30

40

50

60

Me

an

Sq

ua

red

Err

or

0.0 0.2 0.4 0.6 0.8 1.0

010

20

30

40

50

60

R2 on Training Data

Me

an

Sq

ua

red

Err

or

λ

I Bias, Variance, MSE. The Lasso (—), Ridge (· · · ).I The bias is about the same for both methods.

13 / 19

Page 28: Lecture 14: Shrinkage - Stanford University · Example. TheLasso Lassoregressionof default intheCreditdataset. 20 50 100 200 500 2000 5000 Standardized Coefficients-200 0 100 200

Example 2. Only 2 coefficients are non-zero.

0.02 0.10 0.50 2.00 10.00 50.00

020

40

60

80

100

Me

an

Sq

ua

red

Err

or

0.4 0.5 0.6 0.7 0.8 0.9 1.0

020

40

60

80

100

R2 on Training Data

Me

an

Sq

ua

red

Err

or

λ

The Lasso (—), Ridge (· · · ).

I The bias, variance, and MSE are lower for the Lasso.

14 / 19

Page 29: Lecture 14: Shrinkage - Stanford University · Example. TheLasso Lassoregressionof default intheCreditdataset. 20 50 100 200 500 2000 5000 Standardized Coefficients-200 0 100 200

0.02 0.10 0.50 2.00 10.00 50.00

020

40

60

80

100

Me

an

Sq

ua

red

Err

or

0.4 0.5 0.6 0.7 0.8 0.9 1.0

020

40

60

80

100

R2 on Training Data

Me

an

Sq

ua

red

Err

or

λ

The Lasso (—), Ridge (· · · ).I The bias, variance, and MSE are lower for the Lasso.

14 / 19

Page 30: Lecture 14: Shrinkage - Stanford University · Example. TheLasso Lassoregressionof default intheCreditdataset. 20 50 100 200 500 2000 5000 Standardized Coefficients-200 0 100 200

0.02 0.10 0.50 2.00 10.00 50.00

020

40

60

80

100

Me

an

Sq

ua

red

Err

or

0.4 0.5 0.6 0.7 0.8 0.9 1.0

020

40

60

80

100

R2 on Training Data

Me

an

Sq

ua

red

Err

or

λ

I Bias, Variance, MSE. The Lasso (—), Ridge (· · · ).

I The bias, variance, and MSE are lower for the Lasso.

14 / 19

Page 31: Lecture 14: Shrinkage - Stanford University · Example. TheLasso Lassoregressionof default intheCreditdataset. 20 50 100 200 500 2000 5000 Standardized Coefficients-200 0 100 200

0.02 0.10 0.50 2.00 10.00 50.00

020

40

60

80

100

Me

an

Sq

ua

red

Err

or

0.4 0.5 0.6 0.7 0.8 0.9 1.0

020

40

60

80

100

R2 on Training Data

Me

an

Sq

ua

red

Err

or

λ

I Bias, Variance, MSE. The Lasso (—), Ridge (· · · ).I The bias, variance, and MSE are lower for the Lasso.

14 / 19

Page 32: Lecture 14: Shrinkage - Stanford University · Example. TheLasso Lassoregressionof default intheCreditdataset. 20 50 100 200 500 2000 5000 Standardized Coefficients-200 0 100 200

Choosing λ by cross-validation

0.0 0.2 0.4 0.6 0.8 1.0

0200

600

1000

1400

Cro

ss−

Va

lid

atio

n E

rro

r

0.0 0.2 0.4 0.6 0.8 1.0

−5

05

10

15

Sta

nd

ard

ize

d C

oe

ffic

ien

ts

‖β̂Lλ ‖1/‖β̂‖1‖β̂L

λ ‖1/‖β̂‖1

15 / 19

Page 33: Lecture 14: Shrinkage - Stanford University · Example. TheLasso Lassoregressionof default intheCreditdataset. 20 50 100 200 500 2000 5000 Standardized Coefficients-200 0 100 200

A very special case

Suppose n = p and our matrix of predictors is X = I.

Then, the objective function in Ridge regression can be simplified:

p∑j=1

(yj − βj)2 + λ

p∑j=1

β2j

and we can minimize the terms that involve each βj separately:

(yj − βj)2 + λβ2j .

It is easy to show that

β̂Rj =yj

1 + λ.

16 / 19

Page 34: Lecture 14: Shrinkage - Stanford University · Example. TheLasso Lassoregressionof default intheCreditdataset. 20 50 100 200 500 2000 5000 Standardized Coefficients-200 0 100 200

A very special case

p∑j=1

(yj − βj)2 + λ

p∑j=1

β2j

(yj − βj)2 + λβ2j .

β̂Rj =yj

1 + λ.

16 / 19

Page 35: Lecture 14: Shrinkage - Stanford University · Example. TheLasso Lassoregressionof default intheCreditdataset. 20 50 100 200 500 2000 5000 Standardized Coefficients-200 0 100 200

A very special case

p∑j=1

(yj − βj)2 + λ

p∑j=1

β2j

(yj − βj)2 + λβ2j .

β̂Rj =yj

1 + λ.

16 / 19

Page 36: Lecture 14: Shrinkage - Stanford University · Example. TheLasso Lassoregressionof default intheCreditdataset. 20 50 100 200 500 2000 5000 Standardized Coefficients-200 0 100 200

A very special case

p∑j=1

(yj − βj)2 + λ

p∑j=1

β2j

(yj − βj)2 + λβ2j .

β̂Rj =yj

1 + λ.

16 / 19

Page 37: Lecture 14: Shrinkage - Stanford University · Example. TheLasso Lassoregressionof default intheCreditdataset. 20 50 100 200 500 2000 5000 Standardized Coefficients-200 0 100 200

A very special case

Similar story for the Lasso; the objective function is:

p∑j=1

(yj − βj)2 + λ

p∑j=1

|βj |

(yj − βj)2 + λ|βj |.

β̂Lj =

yj − λ/2 if yj > λ/2;

yj + λ/2 if yj < −λ/2;0 if |yj | < λ/2.

17 / 19

Page 38: Lecture 14: Shrinkage - Stanford University · Example. TheLasso Lassoregressionof default intheCreditdataset. 20 50 100 200 500 2000 5000 Standardized Coefficients-200 0 100 200

A very special case

p∑j=1

(yj − βj)2 + λ

p∑j=1

|βj |

(yj − βj)2 + λ|βj |.

β̂Lj =

yj − λ/2 if yj > λ/2;

yj + λ/2 if yj < −λ/2;0 if |yj | < λ/2.

17 / 19

Page 39: Lecture 14: Shrinkage - Stanford University · Example. TheLasso Lassoregressionof default intheCreditdataset. 20 50 100 200 500 2000 5000 Standardized Coefficients-200 0 100 200

A very special case

p∑j=1

(yj − βj)2 + λ

p∑j=1

|βj |

(yj − βj)2 + λ|βj |.

β̂Lj =

yj − λ/2 if yj > λ/2;

yj + λ/2 if yj < −λ/2;0 if |yj | < λ/2.

17 / 19

Page 40: Lecture 14: Shrinkage - Stanford University · Example. TheLasso Lassoregressionof default intheCreditdataset. 20 50 100 200 500 2000 5000 Standardized Coefficients-200 0 100 200

Lasso and Ridge coefficients as a function of λ

−1.5 −0.5 0.0 0.5 1.0 1.5

−1

.5−

0.5

1.5

Co

eff

icie

nt

Estim

ate

Ridge

Least Squares

−1.5 −0.5 0.0 0.5 1.0 1.5

−1

.5−

0.5

1.5

Co

eff

icie

nt

Estim

ate

Lasso

Least Squares

yjyj

18 / 19

Page 41: Lecture 14: Shrinkage - Stanford University · Example. TheLasso Lassoregressionof default intheCreditdataset. 20 50 100 200 500 2000 5000 Standardized Coefficients-200 0 100 200

Bayesian interpretations

Ridge: β̂R is the posterior mean, with a Normal prior on β.

Lasso: β̂L is the posterior mode, with a Laplace prior on β.

−3 −2 −1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

−3 −2 −1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

βjβj

g(β

j)

g(β

j)

19 / 19

Lecture 14: Shrinkage - Stanford University · Example. TheLasso Lassoregressionof default...

Documents

Transcript of Lecture 14: Shrinkage - Stanford University · Example. TheLasso Lassoregressionof default...