Shrinkage method II: Lasso - UMass...

Shrinkage method II: Lasso

Lasso, short for Least Absolute Shrinkage and Selection Operator,different from Ridge regression, performs variable selection.Lasso coefficients, β̂Lλ, minimizes

In statistical parlance, the lasso uses an l1 penalty instead of an l2penalty. The l1 norm of a coefficient vector β is given by‖β‖1 =

∑ |βj |.

The solution path of Lasso

20 50 100 200 500 2000 5000

−2

00

01

00

20

03

00

40

0

Sta

ndard

ized C

oeffic

ients

0.0 0.2 0.4 0.6 0.8 1.0

−3

00

−1

00

01

00

20

03

00

40

0

Sta

ndard

ized C

oeffic

ients

IncomeLimitRatingStudent

λ ‖β̂Lλ ‖1/‖β̂‖1

Figure: The standardized lasso coefficients on the Credit data set areshown as a function of λ and ‖β̂L

λ‖1/‖β̂‖1.

Equivalent formulationThe Ridge regression is equivalent to

Lasso is equivalent to

The best subset regression is equivalent to

Sparse variable selection of Lasso

Figure: Contours of the error and constraint functions for the lasso (left)and ridge regression (right). The solid blue areas are the constraintregions, |β1|+ |β2| ≤ s and β2

1 + β22 ≤ s, while the red ellipses are the

contours of the RSS.

Comparison of Lasso and Ridge regression

0.02 0.10 0.50 2.00 10.00 50.00

010

20

30

40

50

60

Mean S

quare

d E

rror

0.0 0.2 0.4 0.6 0.8 1.0

010

20

30

40

50

60

R2 on Training Data

Mean S

quare

d E

rror

λ

Figure: Left: Plots of squared bias (black), variance (green), and testMSE (purple) for the lasso on a simulated data set. Right: Comparisonof squared bias, variance and test MSE between lasso (solid) and ridge(dashed). Both are plotted against their R2 on the training data, as acommon form of indexing. The crosses in both plots indicate the lassomodel for which the MSE is smallest. In this simulation setting, 45predictors are related to the response. n = 50.

0.02 0.10 0.50 2.00 10.00 50.00

020

40

60

80

100

Mean S

quare

d E

rror

0.4 0.5 0.6 0.7 0.8 0.9 1.0

020

40

60

80

100

R2 on Training Data

Mean S

quare

d E

rror

λ

Figure: . Left: Plots of squared bias (black), variance (green), and testMSE (purple) for the lasso. The simulated data is similar to that in Figure6.8, except that now only two predictors are related to the response.Right: Comparison of squared bias, variance and test MSE between lasso(solid) and ridge (dashed). Both are plotted against their R2 on thetraining data, as a common form of indexing. The crosses in both plotsindicate the lasso model for which the MSE is smallest. In this simulationsetting, only 2 of 45 predictors are related to the reponse. n = 50.

These two examples illustrate that neither ridge regression nor thelasso will universally dominate the other. In general, one mightexpect the lasso to perform better in a setting where arelatively small number of predictors have substantialcoefficients, and the remaining predictors have coefficients thatare very small or that equal zero. Ridge regression will performbetter when the response is a function of many predictors, allwith coefficients of roughly equal size. However, the number ofpredictors that is related to the response is never known a priori forreal data sets. A technique such as cross-validation can be used inorder to determine which approach is better on a particular dataset.

Mathematical comparison in a simple case

consider a simple special case with n = p, and X a diagonal matrixwith 1’s on the diagonal and 0’s in all off-diagonal elements. Tosimplify the problem further, assume also that we are performingregression without an intercept. With these assumptions, the usualleast squares problem simplifies to finding β1, ..., βp that minimize

p∑i=1

(yj − βj)2.

. In this case the least squares solution is given by

β̂j = yj .

−1.5 −0.5 0.0 0.5 1.0 1.5

−1

.5−

0.5

0.5

1.5

Co

eff

icie

nt

Estim

ate

Ridge

Least Squares

−1.5 −0.5 0.0 0.5 1.0 1.5

−1

.5−

0.5

0.5

1.5

Co

eff

icie

nt

Estim

ate

Lasso

Least Squares

yjyj

Figure: The ridge regression and lasso coefficient estimates for a simplesetting with n = p and X a diagonal matrix with 1’s on the diagonal.Left: The ridge regression coefficient estimates are shrunkenproportionally towards zero, relative to the least squares estimates. Right:The lasso coefficient estimates are soft-thresholded towards zero.

In the case of a more general data matrix X, the story is a littlemore complicated than what is depicted in the previous figure, butthe main ideas still hold approximately: ridge regression more orless shrinks every dimension of the data by the same proportion,whereas the lasso more or less shrinks all coefficients toward zeroby a similar amount, and sufficiently small coefficients are shrunkenall the way to zero.

Bayesian perspective

−3 −2 −1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

−3 −2 −1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

βjβj

g(β

j)

g(β

j)

Figure: Left: Ridge regression is the posterior mode for β under aGaussian prior. Right: The lasso is the posterior mode for β under adouble-exponential prior

Select tuning prameters

We choose a grid of λ values, and compute the cross-validationerror for each value of λ, as described in Chapter 5. We then selectthe tuning parameter value for which the cross-validation error issmallest. Finally, the model is re-fit using all of the availableobservations and the selected value of the tuning parameter.

5e−03 5e−02 5e−01 5e+00

25

.02

5.2

25

.42

5.6

Cro

ss−

Va

lid

atio

n E

rro

r

5e−03 5e−02 5e−01 5e+00

−3

00

−1

00

01

00

30

0

Sta

nd

ard

ize

d C

oe

ffic

ien

ts

λλ

Figure: Left: Cross-validation errors that result from applying ridgeregression to the Credit data set with various value of λ. Right: Thecoefficient estimates as a function of λ. The vertical dashed lines indicatethe value of λ selected by cross-validation.

0.0 0.2 0.4 0.6 0.8 1.0

02

00

60

01

00

01

40

0

Cro

ss−

Va

lid

atio

n E

rro

r

0.0 0.2 0.4 0.6 0.8 1.0

−5

05

10

15

Sta

nd

ard

ize

d C

oe

ffic

ien

ts

‖β̂Lλ ‖1/‖β̂‖1‖β̂L

λ ‖1/‖β̂‖1

Figure: Left: Ten-fold cross-validation MSE for the lasso, applied to thesparse simulated data set (2 out of 45 variables are signal variables withn = 50) Right: The corresponding lasso coefficient estimates aredisplayed. The vertical dashed lines indicate the lasso fit for which thecross-validation error is smallest.

Shrinkage method II: Lasso - UMass...

Documents

Transcript of Shrinkage method II: Lasso - UMass...