Lecture 14: Shrinkage - Stanford University · Example. TheLasso Lassoregressionof default...

Lecture 14: Shrinkage

Reading: Section 6.2

STATS 202: Data mining and analysis

Jonathan Taylor, 10/29Slide credits: Sergio Bacallado

1 / 19

Shrinkage methods

The idea is to perform a linear regression, while regularizing orshrinking the coefficients β̂ toward 0.

Why would shrunk coefficients be better?

I This introduces bias, but may significantly decrease thevariance of the estimates. If the latter effect is larger, thiswould decrease the test error.

I There are Bayesian motivations to do this: the prior tends toshrink the parameters.

2 / 19

Shrinkage methods

2 / 19

Shrinkage methods

2 / 19

Shrinkage methods

2 / 19

Ridge regression

Ridge regression solves the following optimization:

n∑i=1

yi − β0 − p∑j=1

βjxi,j

p∑j=1

In blue, we have the RSS of the model.

In red, we have the squared `2 norm of β, or ‖β‖22.

The parameter λ is a tuning parameter. It modulates theimportance of fit vs. shrinkage.

We find an estimate β̂Rλ for many values of λ and then choose it bycross-validation.

3 / 19

Bias-variance tradeoff

In a simulation study, we compute bias, variance, and test error asa function of λ.

1e−01 1e+01 1e+03

0.0 0.2 0.4 0.6 0.8 1.0

λ ‖β̂Rλ ‖2/‖β̂‖2

4 / 19

Ridge regression

In least-squares linear regression, scaling the variables has no effecton the fit of the model:

Y = β0 + β1X1 + β2X2 + · · ·+ βpXp.

Multiplying X1 by c can be compensated by dividing β̂1 by c,ie. after doing this we have the same RSS.

In ridge regression, this is not true.

In practice, what do we do?

I Scale each variable such that it has sample variance 1 beforerunning the regression.

I This prevents penalizing some coefficients more than others.

5 / 19

Ridge regression

Y = β0 + β1X1 + β2X2 + · · ·+ βpXp.

5 / 19

Ridge regression

Y = β0 + β1X1 + β2X2 + · · ·+ βpXp.

5 / 19

Example. Ridge regression

Ridge regression of default in the Credit dataset.

1e−02 1e+00 1e+02 1e+04

−300

−100

ized C

oeffic

IncomeLimitRatingStudent

0.0 0.2 0.4 0.6 0.8 1.0

−300

−100

ized C

oeffic

λ ‖β̂Rλ ‖2/‖β̂‖2

6 / 19

Selecting λ by cross-validation

5e−03 5e−02 5e−01 5e+00

−300

−100

7 / 19

The Lasso

Lasso regression solves the following optimization:

n∑i=1

yi − β0 − p∑j=1

βjxi,j

p∑j=1

|βj |

In red, we have the `1 norm of β, or ‖β‖1.

Why would we use the Lasso instead of Ridge regression?

I Ridge regression shrinks all the coefficients to a non-zero value.

I The Lasso shrinks some of the coefficients all the way to zero.Alternative convex to best subset selection or stepwiseselection!

8 / 19

The Lasso

n∑i=1

yi − β0 − p∑j=1

βjxi,j

p∑j=1

|βj |

8 / 19

The Lasso

n∑i=1

yi − β0 − p∑j=1

βjxi,j

p∑j=1

|βj |

8 / 19

The Lasso

n∑i=1

yi − β0 − p∑j=1

βjxi,j

p∑j=1

|βj |

8 / 19

Example. Ridge regression

Ridge regression of default in the Credit dataset.

1e−02 1e+00 1e+02 1e+04

−300

−100

ized C

oeffic

0.0 0.2 0.4 0.6 0.8 1.0

−300

−100

ized C

oeffic

λ ‖β̂Rλ ‖2/‖β̂‖2

A lot of pesky small coefficients throughout the regularization path.

9 / 19

Example. The Lasso

Lasso regression of default in the Credit dataset.

20 50 100 200 500 2000 5000

−200

ized C

oeffic

0.0 0.2 0.4 0.6 0.8 1.0

−300

−100

ized C

oeffic

λ ‖β̂Lλ ‖1/‖β̂‖1

Those coefficients are shrunk to zero.

10 / 19

An alternative formulation for regularization

I Ridge: for every λ, there is an s such that β̂Rλ solves:

minimizeβ

n∑i=1

yi − β0 − p∑j=1

βjxi,j

2 subject to

p∑j=1

β2j < s.

I Lasso: for every λ, there is an s such that β̂Lλ solves:

minimizeβ

n∑i=1

yi − β0 − p∑j=1

βjxi,j

2 subject to

p∑j=1

|βj | < s.

I Best subset:

minimizeβ

n∑i=1

yi − β0 − p∑j=1

βjxi,j

2 s.t.

p∑j=1

1(βj 6= 0) < s.

11 / 19

minimizeβ

n∑i=1

yi − β0 − p∑j=1

βjxi,j

2 subject to

p∑j=1

β2j < s.

minimizeβ

n∑i=1

yi − β0 − p∑j=1

βjxi,j

2 subject to

p∑j=1

|βj | < s.

I Best subset:

minimizeβ

n∑i=1

yi − β0 − p∑j=1

βjxi,j

2 s.t.

p∑j=1

1(βj 6= 0) < s.

11 / 19

minimizeβ

n∑i=1

yi − β0 − p∑j=1

βjxi,j

2 subject to

p∑j=1

β2j < s.

minimizeβ

n∑i=1

yi − β0 − p∑j=1

βjxi,j

2 subject to

p∑j=1

|βj | < s.

I Best subset:

minimizeβ

n∑i=1

yi − β0 − p∑j=1

βjxi,j

2 s.t.

p∑j=1

1(βj 6= 0) < s.

11 / 19

Visualizing Ridge and the Lasso with 2 predictors

The Lasso Ridge Regression♦ :

∑pj=1 |βj | < s :

∑pj=1 β

2j < s

Best subset with s = 1 is union of the axes...12 / 19

When is the Lasso better than Ridge?

Example 1. Most of the coefficients are non-zero.

0.02 0.10 0.50 2.00 10.00 50.00

0.0 0.2 0.4 0.6 0.8 1.0

R2 on Training Data

I Bias, Variance, MSE.

The Lasso (—), Ridge (· · · ).

I The bias is about the same for both methods.

I The variance of Ridge regression is smaller, so is the MSE.

13 / 19

0.02 0.10 0.50 2.00 10.00 50.00

0.0 0.2 0.4 0.6 0.8 1.0

R2 on Training Data

The Lasso (—), Ridge (· · · ).I The bias is about the same for both methods.

13 / 19

0.02 0.10 0.50 2.00 10.00 50.00

0.0 0.2 0.4 0.6 0.8 1.0

R2 on Training Data

I Bias, Variance, MSE. The Lasso (—), Ridge (· · · ).

I The bias is about the same for both methods.

13 / 19

0.02 0.10 0.50 2.00 10.00 50.00

0.0 0.2 0.4 0.6 0.8 1.0

R2 on Training Data

I Bias, Variance, MSE. The Lasso (—), Ridge (· · · ).I The bias is about the same for both methods.

13 / 19

0.02 0.10 0.50 2.00 10.00 50.00

0.0 0.2 0.4 0.6 0.8 1.0

R2 on Training Data

I Bias, Variance, MSE. The Lasso (—), Ridge (· · · ).I The bias is about the same for both methods.

13 / 19

Example 2. Only 2 coefficients are non-zero.

0.02 0.10 0.50 2.00 10.00 50.00

0.4 0.5 0.6 0.7 0.8 0.9 1.0

R2 on Training Data

The Lasso (—), Ridge (· · · ).

I The bias, variance, and MSE are lower for the Lasso.

14 / 19

0.02 0.10 0.50 2.00 10.00 50.00

0.4 0.5 0.6 0.7 0.8 0.9 1.0

R2 on Training Data

The Lasso (—), Ridge (· · · ).I The bias, variance, and MSE are lower for the Lasso.

14 / 19

0.02 0.10 0.50 2.00 10.00 50.00

0.4 0.5 0.6 0.7 0.8 0.9 1.0

R2 on Training Data

I Bias, Variance, MSE. The Lasso (—), Ridge (· · · ).

I The bias, variance, and MSE are lower for the Lasso.

14 / 19

0.02 0.10 0.50 2.00 10.00 50.00

0.4 0.5 0.6 0.7 0.8 0.9 1.0

R2 on Training Data

I Bias, Variance, MSE. The Lasso (—), Ridge (· · · ).I The bias, variance, and MSE are lower for the Lasso.

14 / 19

Choosing λ by cross-validation

0.0 0.2 0.4 0.6 0.8 1.0

‖β̂Lλ ‖1/‖β̂‖1‖β̂L

λ ‖1/‖β̂‖1

15 / 19

A very special case

Suppose n = p and our matrix of predictors is X = I.

Then, the objective function in Ridge regression can be simplified:

p∑j=1

(yj − βj)2 + λ

p∑j=1

and we can minimize the terms that involve each βj separately:

(yj − βj)2 + λβ2j .

It is easy to show that

β̂Rj =yj

1 + λ.

16 / 19

A very special case

p∑j=1

(yj − βj)2 + λ

p∑j=1

(yj − βj)2 + λβ2j .

β̂Rj =yj

1 + λ.

16 / 19

A very special case

p∑j=1

(yj − βj)2 + λ

p∑j=1

(yj − βj)2 + λβ2j .

β̂Rj =yj

1 + λ.

16 / 19

A very special case

p∑j=1

(yj − βj)2 + λ

p∑j=1

(yj − βj)2 + λβ2j .

β̂Rj =yj

1 + λ.

16 / 19

A very special case

Similar story for the Lasso; the objective function is:

p∑j=1

(yj − βj)2 + λ

p∑j=1

|βj |

(yj − βj)2 + λ|βj |.

β̂Lj =

yj − λ/2 if yj > λ/2;

yj + λ/2 if yj < −λ/2;0 if |yj | < λ/2.

17 / 19

A very special case

p∑j=1

(yj − βj)2 + λ

p∑j=1

|βj |

(yj − βj)2 + λ|βj |.

β̂Lj =

yj − λ/2 if yj > λ/2;

yj + λ/2 if yj < −λ/2;0 if |yj | < λ/2.

17 / 19

A very special case

p∑j=1

(yj − βj)2 + λ

p∑j=1

|βj |

(yj − βj)2 + λ|βj |.

β̂Lj =

yj − λ/2 if yj > λ/2;

yj + λ/2 if yj < −λ/2;0 if |yj | < λ/2.

17 / 19

Lasso and Ridge coefficients as a function of λ

−1.5 −0.5 0.0 0.5 1.0 1.5

Least Squares

−1.5 −0.5 0.0 0.5 1.0 1.5

Least Squares

18 / 19

Bayesian interpretations

Ridge: β̂R is the posterior mean, with a Normal prior on β.

Lasso: β̂L is the posterior mode, with a Laplace prior on β.

−3 −2 −1 0 1 2 3

βjβj

19 / 19

Lecture 14: Shrinkage - Stanford University · Example. TheLasso Lassoregressionof default...

Documents

Transcript of Lecture 14: Shrinkage - Stanford University · Example. TheLasso Lassoregressionof default...

Standardized StorageSilos - Alimatic

200 Xρόνια Iστορίας / 200 Years of History

Geologia - Oficina de Textos...Geologia estrutural Haakon Fossen 2ª EDIÇÃO 100 200 400 600 Pressão con nante (MPa) Resist ência (σ di f) (M Pa) 800 1.000 1.200 200 300 Quartzito

Data Sheet (Sprache) for R&S® oder … STAN 59-411 und DO-160 im Frequenzbereich von 100 kHz bis 200 MHz. Die Dauerstrombelastbarkeit beträgt 100 A. Kurzzeitig

Fundwise - koda...“Crowdfunding” / Google Trends. 43 800 35 600 10 100 4 908 14 200 13 300 17 100 10 100 4 316 6 570 9 900 13 800 11 600 11 700 European alternative finance market

Standardized Score, probability & Normal Distribution Ibrahim Altubasi, PT, PhD The University of Jordan.

KIT DE CONSTRUCTION: KIT DE FINITION: BOX 3 SYSTÈME …medias.lmc.tm.fr/pdf/NoticeKIT575_KIT5100.pdf1330 200 200 3050 400 1430 100 100 200 1530 - - - 1670 830 - - 1650 - 1870 930

VITOSOL 200-TVITOPEND 100-WH1B Επίτοιχη μονάδα φυσικού αερίου – υγραερίου για θέρμανση και ζεστό νερό χρήσης κλειστού

Αύριο η έγκριση 2+1pdf.kathimerini.com.cy/issues/OIK_20150617_ALL.pdf · περάσουν το 10% του συνολικού τζίρου της ε- ... 90 100 200 400

resistive - Farnell element14T F POWER (W) 2 = 002 50 = 050 100 = 100 TYPE A = AlN B = BeO C = Alumina RESISTANCE (OHM) A = 100 B = 50 C = 75 D = 200 E = 150 F = 25 G = 250 H = 300

EX-200 - EX148+200 Web FINAL

zusammengestellt von den III Kursen 100 200 400 300 400 Atombau Radioaktivität Spaltung Diverses 300 200 400 200 100 500 100.

COMPACT 100 - 125 - 160 - 200 · COMPACT 100 - 125 - 160 - 200 TECHNOLOGY AND INNOVATION ENCOMPASSED IN A COMPACT FORM. 1. ΤHE COMPACT ATTAINS A 15% FASTEST PER ... e-mail:megasun@helioakmi.com

Catalogue 30x31cm GR · 2019-10-30 · Τσίπουρο 100 ml 3,50 € - 200 ml 6,80 € Τσίπουρο 200 ml 8,60 € (Τσιλιλή, Μπαμπατζίμ, Αποστολάκη,

Introduction to Cosmology Lecture 9pettini/Intro... · The Cosmic Microwave Background as seen by Planck and WMAP WMAP Planck . 100 10 100 10 150 10 -200 10 105 inflation probably

Τιμή: € 1,50 Eνας αλλά πολύ αυστηρόςpdf.kathimerini.com.cy/issues/OIK_20160727_all.pdf · 0 20 40 60 80 100 0 100 200 300 400 500 0 0 0 0 1 2 3 4 Όγκος

begundalreligius.weebly.com · Web viewDawai piano yang panjangnya 0,5 m dan massanya 10−2 kg ditegangkan 200 N, maka nada dasar piano adalah berfrekuensi.... A. 100 HzB. 200 HzC.

registrationweb.eecs.umich.edu/~hero/Preprints/fsu_slides01.pdf · Simple Example 50 100 150 200 250 300 350 50 100 150 200 250 300 350 50 100 150 200 250 300 350 50 100 150 200 250

ΚΟΥΖΙΝΑΣ ΝΤΟΥΛΑΠΑΣ ΑΞΕΣΟΥΑΡ - carve...200 200-2 200-82508 200 508 200-1 300 300-2 509 200-1 300 300-2 509 200-81 808 808-2 510 400 200-81 808 808-2 510 400

Οδηγίες χρήσης Θαμνοκοπτικά 16/1 20/100 000 · Οι στροφές του κινητήρα σε ρελαντί 3000±200 min-1 Μείγμα καυσίμων