Lecture 14: Shrinkage - Stanford University · Example. TheLasso Lassoregressionof default...

Post on 18-Mar-2020

5 views 0 download

Transcript of Lecture 14: Shrinkage - Stanford University · Example. TheLasso Lassoregressionof default...

Lecture 14: Shrinkage

Reading: Section 6.2

STATS 202: Data mining and analysis

Jonathan Taylor, 10/29Slide credits: Sergio Bacallado

1 / 19

Shrinkage methods

The idea is to perform a linear regression, while regularizing orshrinking the coefficients β̂ toward 0.

Why would shrunk coefficients be better?

I This introduces bias, but may significantly decrease thevariance of the estimates. If the latter effect is larger, thiswould decrease the test error.

I There are Bayesian motivations to do this: the prior tends toshrink the parameters.

2 / 19

Shrinkage methods

The idea is to perform a linear regression, while regularizing orshrinking the coefficients β̂ toward 0.

Why would shrunk coefficients be better?

I This introduces bias, but may significantly decrease thevariance of the estimates. If the latter effect is larger, thiswould decrease the test error.

I There are Bayesian motivations to do this: the prior tends toshrink the parameters.

2 / 19

Shrinkage methods

The idea is to perform a linear regression, while regularizing orshrinking the coefficients β̂ toward 0.

Why would shrunk coefficients be better?

I This introduces bias, but may significantly decrease thevariance of the estimates. If the latter effect is larger, thiswould decrease the test error.

I There are Bayesian motivations to do this: the prior tends toshrink the parameters.

2 / 19

Shrinkage methods

The idea is to perform a linear regression, while regularizing orshrinking the coefficients β̂ toward 0.

Why would shrunk coefficients be better?

I This introduces bias, but may significantly decrease thevariance of the estimates. If the latter effect is larger, thiswould decrease the test error.

I There are Bayesian motivations to do this: the prior tends toshrink the parameters.

2 / 19

Ridge regression

Ridge regression solves the following optimization:

minβ

n∑i=1

yi − β0 − p∑j=1

βjxi,j

2

+ λ

p∑j=1

β2j

In blue, we have the RSS of the model.

In red, we have the squared `2 norm of β, or ‖β‖22.

The parameter λ is a tuning parameter. It modulates theimportance of fit vs. shrinkage.

We find an estimate β̂Rλ for many values of λ and then choose it bycross-validation.

3 / 19

Bias-variance tradeoff

In a simulation study, we compute bias, variance, and test error asa function of λ.

1e−01 1e+01 1e+03

010

20

30

40

50

60

Me

an

Sq

ua

red

Err

or

0.0 0.2 0.4 0.6 0.8 1.0

010

20

30

40

50

60

Me

an

Sq

ua

red

Err

or

λ ‖β̂Rλ ‖2/‖β̂‖2

4 / 19

Ridge regression

In least-squares linear regression, scaling the variables has no effecton the fit of the model:

Y = β0 + β1X1 + β2X2 + · · ·+ βpXp.

Multiplying X1 by c can be compensated by dividing β̂1 by c,ie. after doing this we have the same RSS.

In ridge regression, this is not true.

In practice, what do we do?

I Scale each variable such that it has sample variance 1 beforerunning the regression.

I This prevents penalizing some coefficients more than others.

5 / 19

Ridge regression

In least-squares linear regression, scaling the variables has no effecton the fit of the model:

Y = β0 + β1X1 + β2X2 + · · ·+ βpXp.

Multiplying X1 by c can be compensated by dividing β̂1 by c,ie. after doing this we have the same RSS.

In ridge regression, this is not true.

In practice, what do we do?

I Scale each variable such that it has sample variance 1 beforerunning the regression.

I This prevents penalizing some coefficients more than others.

5 / 19

Ridge regression

In least-squares linear regression, scaling the variables has no effecton the fit of the model:

Y = β0 + β1X1 + β2X2 + · · ·+ βpXp.

Multiplying X1 by c can be compensated by dividing β̂1 by c,ie. after doing this we have the same RSS.

In ridge regression, this is not true.

In practice, what do we do?

I Scale each variable such that it has sample variance 1 beforerunning the regression.

I This prevents penalizing some coefficients more than others.

5 / 19

Example. Ridge regression

Ridge regression of default in the Credit dataset.

1e−02 1e+00 1e+02 1e+04

−300

−100

0100

200

300

400

Sta

ndard

ized C

oeffic

ients

IncomeLimitRatingStudent

0.0 0.2 0.4 0.6 0.8 1.0

−300

−100

0100

200

300

400

Sta

ndard

ized C

oeffic

ients

λ ‖β̂Rλ ‖2/‖β̂‖2

6 / 19

Selecting λ by cross-validation

5e−03 5e−02 5e−01 5e+00

25.0

25.2

25.4

25.6

Cro

ss−

Va

lid

atio

n E

rro

r

5e−03 5e−02 5e−01 5e+00

−300

−100

0100

300

Sta

nd

ard

ize

d C

oe

ffic

ien

ts

λλ

7 / 19

The Lasso

Lasso regression solves the following optimization:

minβ

n∑i=1

yi − β0 − p∑j=1

βjxi,j

2

+ λ

p∑j=1

|βj |

In blue, we have the RSS of the model.

In red, we have the `1 norm of β, or ‖β‖1.

Why would we use the Lasso instead of Ridge regression?

I Ridge regression shrinks all the coefficients to a non-zero value.

I The Lasso shrinks some of the coefficients all the way to zero.Alternative convex to best subset selection or stepwiseselection!

8 / 19

The Lasso

Lasso regression solves the following optimization:

minβ

n∑i=1

yi − β0 − p∑j=1

βjxi,j

2

+ λ

p∑j=1

|βj |

In blue, we have the RSS of the model.

In red, we have the `1 norm of β, or ‖β‖1.

Why would we use the Lasso instead of Ridge regression?

I Ridge regression shrinks all the coefficients to a non-zero value.

I The Lasso shrinks some of the coefficients all the way to zero.Alternative convex to best subset selection or stepwiseselection!

8 / 19

The Lasso

Lasso regression solves the following optimization:

minβ

n∑i=1

yi − β0 − p∑j=1

βjxi,j

2

+ λ

p∑j=1

|βj |

In blue, we have the RSS of the model.

In red, we have the `1 norm of β, or ‖β‖1.

Why would we use the Lasso instead of Ridge regression?

I Ridge regression shrinks all the coefficients to a non-zero value.

I The Lasso shrinks some of the coefficients all the way to zero.Alternative convex to best subset selection or stepwiseselection!

8 / 19

The Lasso

Lasso regression solves the following optimization:

minβ

n∑i=1

yi − β0 − p∑j=1

βjxi,j

2

+ λ

p∑j=1

|βj |

In blue, we have the RSS of the model.

In red, we have the `1 norm of β, or ‖β‖1.

Why would we use the Lasso instead of Ridge regression?

I Ridge regression shrinks all the coefficients to a non-zero value.

I The Lasso shrinks some of the coefficients all the way to zero.Alternative convex to best subset selection or stepwiseselection!

8 / 19

Example. Ridge regression

Ridge regression of default in the Credit dataset.

1e−02 1e+00 1e+02 1e+04

−300

−100

0100

200

300

400

Sta

ndard

ized C

oeffic

ients

IncomeLimitRatingStudent

0.0 0.2 0.4 0.6 0.8 1.0

−300

−100

0100

200

300

400

Sta

ndard

ized C

oeffic

ients

λ ‖β̂Rλ ‖2/‖β̂‖2

A lot of pesky small coefficients throughout the regularization path.

9 / 19

Example. The Lasso

Lasso regression of default in the Credit dataset.

20 50 100 200 500 2000 5000

−200

0100

200

300

400

Sta

ndard

ized C

oeffic

ients

0.0 0.2 0.4 0.6 0.8 1.0

−300

−100

0100

200

300

400

Sta

ndard

ized C

oeffic

ients

IncomeLimitRatingStudent

λ ‖β̂Lλ ‖1/‖β̂‖1

Those coefficients are shrunk to zero.

10 / 19

An alternative formulation for regularization

I Ridge: for every λ, there is an s such that β̂Rλ solves:

minimizeβ

n∑i=1

yi − β0 − p∑j=1

βjxi,j

2 subject to

p∑j=1

β2j < s.

I Lasso: for every λ, there is an s such that β̂Lλ solves:

minimizeβ

n∑i=1

yi − β0 − p∑j=1

βjxi,j

2 subject to

p∑j=1

|βj | < s.

I Best subset:

minimizeβ

n∑i=1

yi − β0 − p∑j=1

βjxi,j

2 s.t.

p∑j=1

1(βj 6= 0) < s.

11 / 19

An alternative formulation for regularization

I Ridge: for every λ, there is an s such that β̂Rλ solves:

minimizeβ

n∑i=1

yi − β0 − p∑j=1

βjxi,j

2 subject to

p∑j=1

β2j < s.

I Lasso: for every λ, there is an s such that β̂Lλ solves:

minimizeβ

n∑i=1

yi − β0 − p∑j=1

βjxi,j

2 subject to

p∑j=1

|βj | < s.

I Best subset:

minimizeβ

n∑i=1

yi − β0 − p∑j=1

βjxi,j

2 s.t.

p∑j=1

1(βj 6= 0) < s.

11 / 19

An alternative formulation for regularization

I Ridge: for every λ, there is an s such that β̂Rλ solves:

minimizeβ

n∑i=1

yi − β0 − p∑j=1

βjxi,j

2 subject to

p∑j=1

β2j < s.

I Lasso: for every λ, there is an s such that β̂Lλ solves:

minimizeβ

n∑i=1

yi − β0 − p∑j=1

βjxi,j

2 subject to

p∑j=1

|βj | < s.

I Best subset:

minimizeβ

n∑i=1

yi − β0 − p∑j=1

βjxi,j

2 s.t.

p∑j=1

1(βj 6= 0) < s.

11 / 19

Visualizing Ridge and the Lasso with 2 predictors

The Lasso Ridge Regression♦ :

∑pj=1 |βj | < s :

∑pj=1 β

2j < s

Best subset with s = 1 is union of the axes...12 / 19

When is the Lasso better than Ridge?

Example 1. Most of the coefficients are non-zero.

0.02 0.10 0.50 2.00 10.00 50.00

010

20

30

40

50

60

Me

an

Sq

ua

red

Err

or

0.0 0.2 0.4 0.6 0.8 1.0

010

20

30

40

50

60

R2 on Training Data

Me

an

Sq

ua

red

Err

or

λ

I Bias, Variance, MSE.

The Lasso (—), Ridge (· · · ).

I The bias is about the same for both methods.

I The variance of Ridge regression is smaller, so is the MSE.

13 / 19

When is the Lasso better than Ridge?

Example 1. Most of the coefficients are non-zero.

0.02 0.10 0.50 2.00 10.00 50.00

010

20

30

40

50

60

Me

an

Sq

ua

red

Err

or

0.0 0.2 0.4 0.6 0.8 1.0

010

20

30

40

50

60

R2 on Training Data

Me

an

Sq

ua

red

Err

or

λ

I Bias, Variance, MSE.

The Lasso (—), Ridge (· · · ).I The bias is about the same for both methods.

I The variance of Ridge regression is smaller, so is the MSE.

13 / 19

When is the Lasso better than Ridge?

Example 1. Most of the coefficients are non-zero.

0.02 0.10 0.50 2.00 10.00 50.00

010

20

30

40

50

60

Me

an

Sq

ua

red

Err

or

0.0 0.2 0.4 0.6 0.8 1.0

010

20

30

40

50

60

R2 on Training Data

Me

an

Sq

ua

red

Err

or

λ

I Bias, Variance, MSE. The Lasso (—), Ridge (· · · ).

I The bias is about the same for both methods.

I The variance of Ridge regression is smaller, so is the MSE.

13 / 19

When is the Lasso better than Ridge?

Example 1. Most of the coefficients are non-zero.

0.02 0.10 0.50 2.00 10.00 50.00

010

20

30

40

50

60

Me

an

Sq

ua

red

Err

or

0.0 0.2 0.4 0.6 0.8 1.0

010

20

30

40

50

60

R2 on Training Data

Me

an

Sq

ua

red

Err

or

λ

I Bias, Variance, MSE. The Lasso (—), Ridge (· · · ).I The bias is about the same for both methods.

I The variance of Ridge regression is smaller, so is the MSE.

13 / 19

When is the Lasso better than Ridge?

Example 1. Most of the coefficients are non-zero.

0.02 0.10 0.50 2.00 10.00 50.00

010

20

30

40

50

60

Me

an

Sq

ua

red

Err

or

0.0 0.2 0.4 0.6 0.8 1.0

010

20

30

40

50

60

R2 on Training Data

Me

an

Sq

ua

red

Err

or

λ

I Bias, Variance, MSE. The Lasso (—), Ridge (· · · ).I The bias is about the same for both methods.

I The variance of Ridge regression is smaller, so is the MSE.

13 / 19

When is the Lasso better than Ridge?

Example 2. Only 2 coefficients are non-zero.

0.02 0.10 0.50 2.00 10.00 50.00

020

40

60

80

100

Me

an

Sq

ua

red

Err

or

0.4 0.5 0.6 0.7 0.8 0.9 1.0

020

40

60

80

100

R2 on Training Data

Me

an

Sq

ua

red

Err

or

λ

I Bias, Variance, MSE.

The Lasso (—), Ridge (· · · ).

I The bias, variance, and MSE are lower for the Lasso.

14 / 19

When is the Lasso better than Ridge?

Example 2. Only 2 coefficients are non-zero.

0.02 0.10 0.50 2.00 10.00 50.00

020

40

60

80

100

Me

an

Sq

ua

red

Err

or

0.4 0.5 0.6 0.7 0.8 0.9 1.0

020

40

60

80

100

R2 on Training Data

Me

an

Sq

ua

red

Err

or

λ

I Bias, Variance, MSE.

The Lasso (—), Ridge (· · · ).I The bias, variance, and MSE are lower for the Lasso.

14 / 19

When is the Lasso better than Ridge?

Example 2. Only 2 coefficients are non-zero.

0.02 0.10 0.50 2.00 10.00 50.00

020

40

60

80

100

Me

an

Sq

ua

red

Err

or

0.4 0.5 0.6 0.7 0.8 0.9 1.0

020

40

60

80

100

R2 on Training Data

Me

an

Sq

ua

red

Err

or

λ

I Bias, Variance, MSE. The Lasso (—), Ridge (· · · ).

I The bias, variance, and MSE are lower for the Lasso.

14 / 19

When is the Lasso better than Ridge?

Example 2. Only 2 coefficients are non-zero.

0.02 0.10 0.50 2.00 10.00 50.00

020

40

60

80

100

Me

an

Sq

ua

red

Err

or

0.4 0.5 0.6 0.7 0.8 0.9 1.0

020

40

60

80

100

R2 on Training Data

Me

an

Sq

ua

red

Err

or

λ

I Bias, Variance, MSE. The Lasso (—), Ridge (· · · ).I The bias, variance, and MSE are lower for the Lasso.

14 / 19

Choosing λ by cross-validation

0.0 0.2 0.4 0.6 0.8 1.0

0200

600

1000

1400

Cro

ss−

Va

lid

atio

n E

rro

r

0.0 0.2 0.4 0.6 0.8 1.0

−5

05

10

15

Sta

nd

ard

ize

d C

oe

ffic

ien

ts

‖β̂Lλ ‖1/‖β̂‖1‖β̂L

λ ‖1/‖β̂‖1

15 / 19

A very special case

Suppose n = p and our matrix of predictors is X = I.

Then, the objective function in Ridge regression can be simplified:

p∑j=1

(yj − βj)2 + λ

p∑j=1

β2j

and we can minimize the terms that involve each βj separately:

(yj − βj)2 + λβ2j .

It is easy to show that

β̂Rj =yj

1 + λ.

16 / 19

A very special case

Suppose n = p and our matrix of predictors is X = I.

Then, the objective function in Ridge regression can be simplified:

p∑j=1

(yj − βj)2 + λ

p∑j=1

β2j

and we can minimize the terms that involve each βj separately:

(yj − βj)2 + λβ2j .

It is easy to show that

β̂Rj =yj

1 + λ.

16 / 19

A very special case

Suppose n = p and our matrix of predictors is X = I.

Then, the objective function in Ridge regression can be simplified:

p∑j=1

(yj − βj)2 + λ

p∑j=1

β2j

and we can minimize the terms that involve each βj separately:

(yj − βj)2 + λβ2j .

It is easy to show that

β̂Rj =yj

1 + λ.

16 / 19

A very special case

Suppose n = p and our matrix of predictors is X = I.

Then, the objective function in Ridge regression can be simplified:

p∑j=1

(yj − βj)2 + λ

p∑j=1

β2j

and we can minimize the terms that involve each βj separately:

(yj − βj)2 + λβ2j .

It is easy to show that

β̂Rj =yj

1 + λ.

16 / 19

A very special case

Similar story for the Lasso; the objective function is:

p∑j=1

(yj − βj)2 + λ

p∑j=1

|βj |

and we can minimize the terms that involve each βj separately:

(yj − βj)2 + λ|βj |.

It is easy to show that

β̂Lj =

yj − λ/2 if yj > λ/2;

yj + λ/2 if yj < −λ/2;0 if |yj | < λ/2.

17 / 19

A very special case

Similar story for the Lasso; the objective function is:

p∑j=1

(yj − βj)2 + λ

p∑j=1

|βj |

and we can minimize the terms that involve each βj separately:

(yj − βj)2 + λ|βj |.

It is easy to show that

β̂Lj =

yj − λ/2 if yj > λ/2;

yj + λ/2 if yj < −λ/2;0 if |yj | < λ/2.

17 / 19

A very special case

Similar story for the Lasso; the objective function is:

p∑j=1

(yj − βj)2 + λ

p∑j=1

|βj |

and we can minimize the terms that involve each βj separately:

(yj − βj)2 + λ|βj |.

It is easy to show that

β̂Lj =

yj − λ/2 if yj > λ/2;

yj + λ/2 if yj < −λ/2;0 if |yj | < λ/2.

17 / 19

Lasso and Ridge coefficients as a function of λ

−1.5 −0.5 0.0 0.5 1.0 1.5

−1

.5−

0.5

0.5

1.5

Co

eff

icie

nt

Estim

ate

Ridge

Least Squares

−1.5 −0.5 0.0 0.5 1.0 1.5

−1

.5−

0.5

0.5

1.5

Co

eff

icie

nt

Estim

ate

Lasso

Least Squares

yjyj

18 / 19

Bayesian interpretations

Ridge: β̂R is the posterior mean, with a Normal prior on β.

Lasso: β̂L is the posterior mode, with a Laplace prior on β.

−3 −2 −1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

−3 −2 −1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

βjβj

g(β

j)

g(β

j)

19 / 19