Lecture 14: Shrinkage
Reading: Section 6.2
STATS 202: Data mining and analysis
Jonathan Taylor, 10/29Slide credits: Sergio Bacallado
1 / 19
Shrinkage methods
The idea is to perform a linear regression, while regularizing orshrinking the coefficients β̂ toward 0.
Why would shrunk coefficients be better?
I This introduces bias, but may significantly decrease thevariance of the estimates. If the latter effect is larger, thiswould decrease the test error.
I There are Bayesian motivations to do this: the prior tends toshrink the parameters.
2 / 19
Shrinkage methods
The idea is to perform a linear regression, while regularizing orshrinking the coefficients β̂ toward 0.
Why would shrunk coefficients be better?
I This introduces bias, but may significantly decrease thevariance of the estimates. If the latter effect is larger, thiswould decrease the test error.
I There are Bayesian motivations to do this: the prior tends toshrink the parameters.
2 / 19
Shrinkage methods
The idea is to perform a linear regression, while regularizing orshrinking the coefficients β̂ toward 0.
Why would shrunk coefficients be better?
I This introduces bias, but may significantly decrease thevariance of the estimates. If the latter effect is larger, thiswould decrease the test error.
I There are Bayesian motivations to do this: the prior tends toshrink the parameters.
2 / 19
Shrinkage methods
The idea is to perform a linear regression, while regularizing orshrinking the coefficients β̂ toward 0.
Why would shrunk coefficients be better?
I This introduces bias, but may significantly decrease thevariance of the estimates. If the latter effect is larger, thiswould decrease the test error.
I There are Bayesian motivations to do this: the prior tends toshrink the parameters.
2 / 19
Ridge regression
Ridge regression solves the following optimization:
minβ
n∑i=1
yi − β0 − p∑j=1
βjxi,j
2
+ λ
p∑j=1
β2j
In blue, we have the RSS of the model.
In red, we have the squared `2 norm of β, or ‖β‖22.
The parameter λ is a tuning parameter. It modulates theimportance of fit vs. shrinkage.
We find an estimate β̂Rλ for many values of λ and then choose it bycross-validation.
3 / 19
Bias-variance tradeoff
In a simulation study, we compute bias, variance, and test error asa function of λ.
1e−01 1e+01 1e+03
010
20
30
40
50
60
Me
an
Sq
ua
red
Err
or
0.0 0.2 0.4 0.6 0.8 1.0
010
20
30
40
50
60
Me
an
Sq
ua
red
Err
or
λ ‖β̂Rλ ‖2/‖β̂‖2
4 / 19
Ridge regression
In least-squares linear regression, scaling the variables has no effecton the fit of the model:
Y = β0 + β1X1 + β2X2 + · · ·+ βpXp.
Multiplying X1 by c can be compensated by dividing β̂1 by c,ie. after doing this we have the same RSS.
In ridge regression, this is not true.
In practice, what do we do?
I Scale each variable such that it has sample variance 1 beforerunning the regression.
I This prevents penalizing some coefficients more than others.
5 / 19
Ridge regression
In least-squares linear regression, scaling the variables has no effecton the fit of the model:
Y = β0 + β1X1 + β2X2 + · · ·+ βpXp.
Multiplying X1 by c can be compensated by dividing β̂1 by c,ie. after doing this we have the same RSS.
In ridge regression, this is not true.
In practice, what do we do?
I Scale each variable such that it has sample variance 1 beforerunning the regression.
I This prevents penalizing some coefficients more than others.
5 / 19
Ridge regression
In least-squares linear regression, scaling the variables has no effecton the fit of the model:
Y = β0 + β1X1 + β2X2 + · · ·+ βpXp.
Multiplying X1 by c can be compensated by dividing β̂1 by c,ie. after doing this we have the same RSS.
In ridge regression, this is not true.
In practice, what do we do?
I Scale each variable such that it has sample variance 1 beforerunning the regression.
I This prevents penalizing some coefficients more than others.
5 / 19
Example. Ridge regression
Ridge regression of default in the Credit dataset.
1e−02 1e+00 1e+02 1e+04
−300
−100
0100
200
300
400
Sta
ndard
ized C
oeffic
ients
IncomeLimitRatingStudent
0.0 0.2 0.4 0.6 0.8 1.0
−300
−100
0100
200
300
400
Sta
ndard
ized C
oeffic
ients
λ ‖β̂Rλ ‖2/‖β̂‖2
6 / 19
Selecting λ by cross-validation
5e−03 5e−02 5e−01 5e+00
25.0
25.2
25.4
25.6
Cro
ss−
Va
lid
atio
n E
rro
r
5e−03 5e−02 5e−01 5e+00
−300
−100
0100
300
Sta
nd
ard
ize
d C
oe
ffic
ien
ts
λλ
7 / 19
The Lasso
Lasso regression solves the following optimization:
minβ
n∑i=1
yi − β0 − p∑j=1
βjxi,j
2
+ λ
p∑j=1
|βj |
In blue, we have the RSS of the model.
In red, we have the `1 norm of β, or ‖β‖1.
Why would we use the Lasso instead of Ridge regression?
I Ridge regression shrinks all the coefficients to a non-zero value.
I The Lasso shrinks some of the coefficients all the way to zero.Alternative convex to best subset selection or stepwiseselection!
8 / 19
The Lasso
Lasso regression solves the following optimization:
minβ
n∑i=1
yi − β0 − p∑j=1
βjxi,j
2
+ λ
p∑j=1
|βj |
In blue, we have the RSS of the model.
In red, we have the `1 norm of β, or ‖β‖1.
Why would we use the Lasso instead of Ridge regression?
I Ridge regression shrinks all the coefficients to a non-zero value.
I The Lasso shrinks some of the coefficients all the way to zero.Alternative convex to best subset selection or stepwiseselection!
8 / 19
The Lasso
Lasso regression solves the following optimization:
minβ
n∑i=1
yi − β0 − p∑j=1
βjxi,j
2
+ λ
p∑j=1
|βj |
In blue, we have the RSS of the model.
In red, we have the `1 norm of β, or ‖β‖1.
Why would we use the Lasso instead of Ridge regression?
I Ridge regression shrinks all the coefficients to a non-zero value.
I The Lasso shrinks some of the coefficients all the way to zero.Alternative convex to best subset selection or stepwiseselection!
8 / 19
The Lasso
Lasso regression solves the following optimization:
minβ
n∑i=1
yi − β0 − p∑j=1
βjxi,j
2
+ λ
p∑j=1
|βj |
In blue, we have the RSS of the model.
In red, we have the `1 norm of β, or ‖β‖1.
Why would we use the Lasso instead of Ridge regression?
I Ridge regression shrinks all the coefficients to a non-zero value.
I The Lasso shrinks some of the coefficients all the way to zero.Alternative convex to best subset selection or stepwiseselection!
8 / 19
Example. Ridge regression
Ridge regression of default in the Credit dataset.
1e−02 1e+00 1e+02 1e+04
−300
−100
0100
200
300
400
Sta
ndard
ized C
oeffic
ients
IncomeLimitRatingStudent
0.0 0.2 0.4 0.6 0.8 1.0
−300
−100
0100
200
300
400
Sta
ndard
ized C
oeffic
ients
λ ‖β̂Rλ ‖2/‖β̂‖2
A lot of pesky small coefficients throughout the regularization path.
9 / 19
Example. The Lasso
Lasso regression of default in the Credit dataset.
20 50 100 200 500 2000 5000
−200
0100
200
300
400
Sta
ndard
ized C
oeffic
ients
0.0 0.2 0.4 0.6 0.8 1.0
−300
−100
0100
200
300
400
Sta
ndard
ized C
oeffic
ients
IncomeLimitRatingStudent
λ ‖β̂Lλ ‖1/‖β̂‖1
Those coefficients are shrunk to zero.
10 / 19
An alternative formulation for regularization
I Ridge: for every λ, there is an s such that β̂Rλ solves:
minimizeβ
n∑i=1
yi − β0 − p∑j=1
βjxi,j
2 subject to
p∑j=1
β2j < s.
I Lasso: for every λ, there is an s such that β̂Lλ solves:
minimizeβ
n∑i=1
yi − β0 − p∑j=1
βjxi,j
2 subject to
p∑j=1
|βj | < s.
I Best subset:
minimizeβ
n∑i=1
yi − β0 − p∑j=1
βjxi,j
2 s.t.
p∑j=1
1(βj 6= 0) < s.
11 / 19
An alternative formulation for regularization
I Ridge: for every λ, there is an s such that β̂Rλ solves:
minimizeβ
n∑i=1
yi − β0 − p∑j=1
βjxi,j
2 subject to
p∑j=1
β2j < s.
I Lasso: for every λ, there is an s such that β̂Lλ solves:
minimizeβ
n∑i=1
yi − β0 − p∑j=1
βjxi,j
2 subject to
p∑j=1
|βj | < s.
I Best subset:
minimizeβ
n∑i=1
yi − β0 − p∑j=1
βjxi,j
2 s.t.
p∑j=1
1(βj 6= 0) < s.
11 / 19
An alternative formulation for regularization
I Ridge: for every λ, there is an s such that β̂Rλ solves:
minimizeβ
n∑i=1
yi − β0 − p∑j=1
βjxi,j
2 subject to
p∑j=1
β2j < s.
I Lasso: for every λ, there is an s such that β̂Lλ solves:
minimizeβ
n∑i=1
yi − β0 − p∑j=1
βjxi,j
2 subject to
p∑j=1
|βj | < s.
I Best subset:
minimizeβ
n∑i=1
yi − β0 − p∑j=1
βjxi,j
2 s.t.
p∑j=1
1(βj 6= 0) < s.
11 / 19
Visualizing Ridge and the Lasso with 2 predictors
The Lasso Ridge Regression♦ :
∑pj=1 |βj | < s :
∑pj=1 β
2j < s
Best subset with s = 1 is union of the axes...12 / 19
When is the Lasso better than Ridge?
Example 1. Most of the coefficients are non-zero.
0.02 0.10 0.50 2.00 10.00 50.00
010
20
30
40
50
60
Me
an
Sq
ua
red
Err
or
0.0 0.2 0.4 0.6 0.8 1.0
010
20
30
40
50
60
R2 on Training Data
Me
an
Sq
ua
red
Err
or
λ
I Bias, Variance, MSE.
The Lasso (—), Ridge (· · · ).
I The bias is about the same for both methods.
I The variance of Ridge regression is smaller, so is the MSE.
13 / 19
When is the Lasso better than Ridge?
Example 1. Most of the coefficients are non-zero.
0.02 0.10 0.50 2.00 10.00 50.00
010
20
30
40
50
60
Me
an
Sq
ua
red
Err
or
0.0 0.2 0.4 0.6 0.8 1.0
010
20
30
40
50
60
R2 on Training Data
Me
an
Sq
ua
red
Err
or
λ
I Bias, Variance, MSE.
The Lasso (—), Ridge (· · · ).I The bias is about the same for both methods.
I The variance of Ridge regression is smaller, so is the MSE.
13 / 19
When is the Lasso better than Ridge?
Example 1. Most of the coefficients are non-zero.
0.02 0.10 0.50 2.00 10.00 50.00
010
20
30
40
50
60
Me
an
Sq
ua
red
Err
or
0.0 0.2 0.4 0.6 0.8 1.0
010
20
30
40
50
60
R2 on Training Data
Me
an
Sq
ua
red
Err
or
λ
I Bias, Variance, MSE. The Lasso (—), Ridge (· · · ).
I The bias is about the same for both methods.
I The variance of Ridge regression is smaller, so is the MSE.
13 / 19
When is the Lasso better than Ridge?
Example 1. Most of the coefficients are non-zero.
0.02 0.10 0.50 2.00 10.00 50.00
010
20
30
40
50
60
Me
an
Sq
ua
red
Err
or
0.0 0.2 0.4 0.6 0.8 1.0
010
20
30
40
50
60
R2 on Training Data
Me
an
Sq
ua
red
Err
or
λ
I Bias, Variance, MSE. The Lasso (—), Ridge (· · · ).I The bias is about the same for both methods.
I The variance of Ridge regression is smaller, so is the MSE.
13 / 19
When is the Lasso better than Ridge?
Example 1. Most of the coefficients are non-zero.
0.02 0.10 0.50 2.00 10.00 50.00
010
20
30
40
50
60
Me
an
Sq
ua
red
Err
or
0.0 0.2 0.4 0.6 0.8 1.0
010
20
30
40
50
60
R2 on Training Data
Me
an
Sq
ua
red
Err
or
λ
I Bias, Variance, MSE. The Lasso (—), Ridge (· · · ).I The bias is about the same for both methods.
I The variance of Ridge regression is smaller, so is the MSE.
13 / 19
When is the Lasso better than Ridge?
Example 2. Only 2 coefficients are non-zero.
0.02 0.10 0.50 2.00 10.00 50.00
020
40
60
80
100
Me
an
Sq
ua
red
Err
or
0.4 0.5 0.6 0.7 0.8 0.9 1.0
020
40
60
80
100
R2 on Training Data
Me
an
Sq
ua
red
Err
or
λ
I Bias, Variance, MSE.
The Lasso (—), Ridge (· · · ).
I The bias, variance, and MSE are lower for the Lasso.
14 / 19
When is the Lasso better than Ridge?
Example 2. Only 2 coefficients are non-zero.
0.02 0.10 0.50 2.00 10.00 50.00
020
40
60
80
100
Me
an
Sq
ua
red
Err
or
0.4 0.5 0.6 0.7 0.8 0.9 1.0
020
40
60
80
100
R2 on Training Data
Me
an
Sq
ua
red
Err
or
λ
I Bias, Variance, MSE.
The Lasso (—), Ridge (· · · ).I The bias, variance, and MSE are lower for the Lasso.
14 / 19
When is the Lasso better than Ridge?
Example 2. Only 2 coefficients are non-zero.
0.02 0.10 0.50 2.00 10.00 50.00
020
40
60
80
100
Me
an
Sq
ua
red
Err
or
0.4 0.5 0.6 0.7 0.8 0.9 1.0
020
40
60
80
100
R2 on Training Data
Me
an
Sq
ua
red
Err
or
λ
I Bias, Variance, MSE. The Lasso (—), Ridge (· · · ).
I The bias, variance, and MSE are lower for the Lasso.
14 / 19
When is the Lasso better than Ridge?
Example 2. Only 2 coefficients are non-zero.
0.02 0.10 0.50 2.00 10.00 50.00
020
40
60
80
100
Me
an
Sq
ua
red
Err
or
0.4 0.5 0.6 0.7 0.8 0.9 1.0
020
40
60
80
100
R2 on Training Data
Me
an
Sq
ua
red
Err
or
λ
I Bias, Variance, MSE. The Lasso (—), Ridge (· · · ).I The bias, variance, and MSE are lower for the Lasso.
14 / 19
Choosing λ by cross-validation
0.0 0.2 0.4 0.6 0.8 1.0
0200
600
1000
1400
Cro
ss−
Va
lid
atio
n E
rro
r
0.0 0.2 0.4 0.6 0.8 1.0
−5
05
10
15
Sta
nd
ard
ize
d C
oe
ffic
ien
ts
‖β̂Lλ ‖1/‖β̂‖1‖β̂L
λ ‖1/‖β̂‖1
15 / 19
A very special case
Suppose n = p and our matrix of predictors is X = I.
Then, the objective function in Ridge regression can be simplified:
p∑j=1
(yj − βj)2 + λ
p∑j=1
β2j
and we can minimize the terms that involve each βj separately:
(yj − βj)2 + λβ2j .
It is easy to show that
β̂Rj =yj
1 + λ.
16 / 19
A very special case
Suppose n = p and our matrix of predictors is X = I.
Then, the objective function in Ridge regression can be simplified:
p∑j=1
(yj − βj)2 + λ
p∑j=1
β2j
and we can minimize the terms that involve each βj separately:
(yj − βj)2 + λβ2j .
It is easy to show that
β̂Rj =yj
1 + λ.
16 / 19
A very special case
Suppose n = p and our matrix of predictors is X = I.
Then, the objective function in Ridge regression can be simplified:
p∑j=1
(yj − βj)2 + λ
p∑j=1
β2j
and we can minimize the terms that involve each βj separately:
(yj − βj)2 + λβ2j .
It is easy to show that
β̂Rj =yj
1 + λ.
16 / 19
A very special case
Suppose n = p and our matrix of predictors is X = I.
Then, the objective function in Ridge regression can be simplified:
p∑j=1
(yj − βj)2 + λ
p∑j=1
β2j
and we can minimize the terms that involve each βj separately:
(yj − βj)2 + λβ2j .
It is easy to show that
β̂Rj =yj
1 + λ.
16 / 19
A very special case
Similar story for the Lasso; the objective function is:
p∑j=1
(yj − βj)2 + λ
p∑j=1
|βj |
and we can minimize the terms that involve each βj separately:
(yj − βj)2 + λ|βj |.
It is easy to show that
β̂Lj =
yj − λ/2 if yj > λ/2;
yj + λ/2 if yj < −λ/2;0 if |yj | < λ/2.
17 / 19
A very special case
Similar story for the Lasso; the objective function is:
p∑j=1
(yj − βj)2 + λ
p∑j=1
|βj |
and we can minimize the terms that involve each βj separately:
(yj − βj)2 + λ|βj |.
It is easy to show that
β̂Lj =
yj − λ/2 if yj > λ/2;
yj + λ/2 if yj < −λ/2;0 if |yj | < λ/2.
17 / 19
A very special case
Similar story for the Lasso; the objective function is:
p∑j=1
(yj − βj)2 + λ
p∑j=1
|βj |
and we can minimize the terms that involve each βj separately:
(yj − βj)2 + λ|βj |.
It is easy to show that
β̂Lj =
yj − λ/2 if yj > λ/2;
yj + λ/2 if yj < −λ/2;0 if |yj | < λ/2.
17 / 19
Lasso and Ridge coefficients as a function of λ
−1.5 −0.5 0.0 0.5 1.0 1.5
−1
.5−
0.5
0.5
1.5
Co
eff
icie
nt
Estim
ate
Ridge
Least Squares
−1.5 −0.5 0.0 0.5 1.0 1.5
−1
.5−
0.5
0.5
1.5
Co
eff
icie
nt
Estim
ate
Lasso
Least Squares
yjyj
18 / 19
Bayesian interpretations
Ridge: β̂R is the posterior mean, with a Normal prior on β.
Lasso: β̂L is the posterior mode, with a Laplace prior on β.
−3 −2 −1 0 1 2 3
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
−3 −2 −1 0 1 2 3
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
βjβj
g(β
j)
g(β
j)
19 / 19
Top Related