Shrinkage method II: Lasso - UMass...
Transcript of Shrinkage method II: Lasso - UMass...
Shrinkage method II: Lasso
Lasso, short for Least Absolute Shrinkage and Selection Operator,different from Ridge regression, performs variable selection.Lasso coefficients, β̂Lλ, minimizes
In statistical parlance, the lasso uses an l1 penalty instead of an l2penalty. The l1 norm of a coefficient vector β is given by‖β‖1 =
∑ |βj |.
The solution path of Lasso
20 50 100 200 500 2000 5000
−2
00
01
00
20
03
00
40
0
Sta
ndard
ized C
oeffic
ients
0.0 0.2 0.4 0.6 0.8 1.0
−3
00
−1
00
01
00
20
03
00
40
0
Sta
ndard
ized C
oeffic
ients
IncomeLimitRatingStudent
λ ‖β̂Lλ ‖1/‖β̂‖1
Figure: The standardized lasso coefficients on the Credit data set areshown as a function of λ and ‖β̂L
λ‖1/‖β̂‖1.
Equivalent formulationThe Ridge regression is equivalent to
Lasso is equivalent to
The best subset regression is equivalent to
Sparse variable selection of Lasso
Figure: Contours of the error and constraint functions for the lasso (left)and ridge regression (right). The solid blue areas are the constraintregions, |β1|+ |β2| ≤ s and β2
1 + β22 ≤ s, while the red ellipses are the
contours of the RSS.
Comparison of Lasso and Ridge regression
0.02 0.10 0.50 2.00 10.00 50.00
010
20
30
40
50
60
Mean S
quare
d E
rror
0.0 0.2 0.4 0.6 0.8 1.0
010
20
30
40
50
60
R2 on Training Data
Mean S
quare
d E
rror
λ
Figure: Left: Plots of squared bias (black), variance (green), and testMSE (purple) for the lasso on a simulated data set. Right: Comparisonof squared bias, variance and test MSE between lasso (solid) and ridge(dashed). Both are plotted against their R2 on the training data, as acommon form of indexing. The crosses in both plots indicate the lassomodel for which the MSE is smallest. In this simulation setting, 45predictors are related to the response. n = 50.
0.02 0.10 0.50 2.00 10.00 50.00
020
40
60
80
100
Mean S
quare
d E
rror
0.4 0.5 0.6 0.7 0.8 0.9 1.0
020
40
60
80
100
R2 on Training Data
Mean S
quare
d E
rror
λ
Figure: . Left: Plots of squared bias (black), variance (green), and testMSE (purple) for the lasso. The simulated data is similar to that in Figure6.8, except that now only two predictors are related to the response.Right: Comparison of squared bias, variance and test MSE between lasso(solid) and ridge (dashed). Both are plotted against their R2 on thetraining data, as a common form of indexing. The crosses in both plotsindicate the lasso model for which the MSE is smallest. In this simulationsetting, only 2 of 45 predictors are related to the reponse. n = 50.
These two examples illustrate that neither ridge regression nor thelasso will universally dominate the other. In general, one mightexpect the lasso to perform better in a setting where arelatively small number of predictors have substantialcoefficients, and the remaining predictors have coefficients thatare very small or that equal zero. Ridge regression will performbetter when the response is a function of many predictors, allwith coefficients of roughly equal size. However, the number ofpredictors that is related to the response is never known a priori forreal data sets. A technique such as cross-validation can be used inorder to determine which approach is better on a particular dataset.
Mathematical comparison in a simple case
consider a simple special case with n = p, and X a diagonal matrixwith 1’s on the diagonal and 0’s in all off-diagonal elements. Tosimplify the problem further, assume also that we are performingregression without an intercept. With these assumptions, the usualleast squares problem simplifies to finding β1, ..., βp that minimize
p∑i=1
(yj − βj)2.
. In this case the least squares solution is given by
β̂j = yj .
−1.5 −0.5 0.0 0.5 1.0 1.5
−1
.5−
0.5
0.5
1.5
Co
eff
icie
nt
Estim
ate
Ridge
Least Squares
−1.5 −0.5 0.0 0.5 1.0 1.5
−1
.5−
0.5
0.5
1.5
Co
eff
icie
nt
Estim
ate
Lasso
Least Squares
yjyj
Figure: The ridge regression and lasso coefficient estimates for a simplesetting with n = p and X a diagonal matrix with 1’s on the diagonal.Left: The ridge regression coefficient estimates are shrunkenproportionally towards zero, relative to the least squares estimates. Right:The lasso coefficient estimates are soft-thresholded towards zero.
In the case of a more general data matrix X, the story is a littlemore complicated than what is depicted in the previous figure, butthe main ideas still hold approximately: ridge regression more orless shrinks every dimension of the data by the same proportion,whereas the lasso more or less shrinks all coefficients toward zeroby a similar amount, and sufficiently small coefficients are shrunkenall the way to zero.
Bayesian perspective
−3 −2 −1 0 1 2 3
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
−3 −2 −1 0 1 2 3
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
βjβj
g(β
j)
g(β
j)
Figure: Left: Ridge regression is the posterior mode for β under aGaussian prior. Right: The lasso is the posterior mode for β under adouble-exponential prior
Select tuning prameters
We choose a grid of λ values, and compute the cross-validationerror for each value of λ, as described in Chapter 5. We then selectthe tuning parameter value for which the cross-validation error issmallest. Finally, the model is re-fit using all of the availableobservations and the selected value of the tuning parameter.
5e−03 5e−02 5e−01 5e+00
25
.02
5.2
25
.42
5.6
Cro
ss−
Va
lid
atio
n E
rro
r
5e−03 5e−02 5e−01 5e+00
−3
00
−1
00
01
00
30
0
Sta
nd
ard
ize
d C
oe
ffic
ien
ts
λλ
Figure: Left: Cross-validation errors that result from applying ridgeregression to the Credit data set with various value of λ. Right: Thecoefficient estimates as a function of λ. The vertical dashed lines indicatethe value of λ selected by cross-validation.
0.0 0.2 0.4 0.6 0.8 1.0
02
00
60
01
00
01
40
0
Cro
ss−
Va
lid
atio
n E
rro
r
0.0 0.2 0.4 0.6 0.8 1.0
−5
05
10
15
Sta
nd
ard
ize
d C
oe
ffic
ien
ts
‖β̂Lλ ‖1/‖β̂‖1‖β̂L
λ ‖1/‖β̂‖1
Figure: Left: Ten-fold cross-validation MSE for the lasso, applied to thesparse simulated data set (2 out of 45 variables are signal variables withn = 50) Right: The corresponding lasso coefficient estimates aredisplayed. The vertical dashed lines indicate the lasso fit for which thecross-validation error is smallest.