Post on 01-Sep-2020
Variable Selection at Scale
Trevor Hastie, Stanford University
with Ryan Tibshirani and Rob Tibshirani
1 / 32
Variable Selection at Scale
Trevor Hastie, Stanford University
with Ryan Tibshirani and Rob Tibshirani
1 / 32
Outline and SummaryWe consider linear regression models η(X) = XTβ withpotentially very large numbers of variables, and methods forselecting an informative subset.
• Revisit two baby boomers (best-subset selection andforward-stepwise selection), one millennial (lasso) and anewborn (relaxed lasso).
• Simulation study to evaluate them all over a wide range ofsettings.
Conclusions:
• forward stepwise very close to best subset, but much faster.
• relaxed lasso overall winner, and fastest by far.
• In wide-data settings, and low SNR, lasso can beat bestsubset and forward stepwise.
Paper: https://arxiv.org/abs/1707.08692
R package: https://github.com/ryantibs/best-subset/
2 / 32
Outline and SummaryWe consider linear regression models η(X) = XTβ withpotentially very large numbers of variables, and methods forselecting an informative subset.
• Revisit two baby boomers (best-subset selection andforward-stepwise selection), one millennial (lasso) and anewborn (relaxed lasso).
• Simulation study to evaluate them all over a wide range ofsettings.
Conclusions:
• forward stepwise very close to best subset, but much faster.
• relaxed lasso overall winner, and fastest by far.
• In wide-data settings, and low SNR, lasso can beat bestsubset and forward stepwise.
Paper: https://arxiv.org/abs/1707.08692
R package: https://github.com/ryantibs/best-subset/2 / 32
Best Subset Selection58 3. Linear Methods for Regression
Subset Size k
Res
idua
l Sum
−of
−S
quar
es
020
4060
8010
0
0 1 2 3 4 5 6 7 8
•
•
•••••••
••••••••••••••••••••••••••
•••••••••••••••••••••••••••••••••••••••••••••••••
••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••
•••••••••••••••••••••••••••••••••••••••••••••
••••••••••••••••••••••••••
•••••••
•
•
•
•• • • • • • •
FIGURE 3.5. All possible subset models for the prostate cancer example. Ateach subset size is shown the residual sum-of-squares for each model of that size.
cross-validation to estimate prediction error and select k; the AIC criterionis a popular alternative. We defer more detailed discussion of these andother approaches to Chapter 7.
3.3.2 Forward- and Backward-Stepwise Selection
Rather than search through all possible subsets (which becomes infeasiblefor pmuch larger than 40), we can seek a good path through them. Forward-stepwise selection starts with the intercept, and then sequentially adds intothe model the predictor that most improves the fit. With many candidatepredictors, this might seem like a lot of computation; however, clever up-dating algorithms can exploit the QR decomposition for the current fit torapidly establish the next candidate (Exercise 3.9). Like best-subset re-gression, forward stepwise produces a sequence of models indexed by k, thesubset size, which must be determined.Forward-stepwise selection is a greedy algorithm, producing a nested se-
quence of models. In this sense it might seem sub-optimal compared tobest-subset selection. However, there are several reasons why it might bepreferred:
1. For each subset of size k of the p variables, evaluate the fittingobjective (e.g. RSS) via linear regression on the training data.
2. Candidate models β(k) are at the lower frontier — the best foreach k on the training data.
3. Pick k using a validation dataset (or CV), and deliver β(k)
3 / 32
Properties of Best Subset Selection
4 Well-defined goal — the obvious gold standard for variableselection.
4 Feasible for least squares regression with p ≈ 35 usingclever algorithms (Furnival and Wilson, 1974, “Leaps andBounds”).
6 Combinatorially hard for large p.
? Obvious gold standard — really?
4 / 32
Properties of Best Subset Selection
4 Well-defined goal — the obvious gold standard for variableselection.
4 Feasible for least squares regression with p ≈ 35 usingclever algorithms (Furnival and Wilson, 1974, “Leaps andBounds”).
6 Combinatorially hard for large p.
? Obvious gold standard — really?
4 / 32
Properties of Best Subset Selection
4 Well-defined goal — the obvious gold standard for variableselection.
4 Feasible for least squares regression with p ≈ 35 usingclever algorithms (Furnival and Wilson, 1974, “Leaps andBounds”).
6 Combinatorially hard for large p.
? Obvious gold standard — really?
4 / 32
Best Subset Selection BreakthroughRahul Mazumder, with Bertsimas and King(AoS 2016) crack the forty year old best-subsetselection bottleneck! They use mixed-integerprogramming (MIO) along with the gurobisolver.
minimizez,β1
2
n∑
i=1
(yi − β0 −p∑
j=1
xijβj)2
subject to −Mzj ≤ βj ≤Mzj , zj ∈ {0, 1}, j = 1, . . . , pp∑
j=1
zj ≤ k.
Their procedure iteratively narrows the optimality gap — if thegap hits zero, they have found the solution.
5 / 32
Best Subset Selection BreakthroughRahul Mazumder, with Bertsimas and King(AoS 2016) crack the forty year old best-subsetselection bottleneck! They use mixed-integerprogramming (MIO) along with the gurobisolver.
minimizez,β1
2
n∑
i=1
(yi − β0 −p∑
j=1
xijβj)2
subject to −Mzj ≤ βj ≤Mzj , zj ∈ {0, 1}, j = 1, . . . , pp∑
j=1
zj ≤ k.
Their procedure iteratively narrows the optimality gap — if thegap hits zero, they have found the solution.
5 / 32
Forward Stepwise Selection
Greedy forward algorithm, traditionally thought of as asub-optimal but feasible alternative to best-subset regression.
1. Start with null model (response mean).
2. Choose among the p variables to find the bestsingle-variable model in terms of fitting objective.
3. Choose among the remaining p− 1 variables to find theone, when included with the previously chosen variable,best improves the fitting objective.
4. Choose among the remaining p− 2 . . ., and so on.
Forward stepwise produces a nested sequence of modelsβ(k), k = 1, 2, . . ..Pick k using a validation dataset.
6 / 32
Forward Stepwise Selection Properties
4 Computationally feasible with big data, and also workswith n� p.
4 Efficient computations with squared-error loss.Computations can be arranged as a guided QRdecomposition of the X matrix, and hence costs the sameas a full least-squares fit O(npmin(n, p)).
4 Performance very similar to best subset selection, althoughdifficult counter examples can be constructed.
6 Efficiency not available for GLMs, although scoreapproximations can be used.
6 Tedious with very large p and n, since terms augmentedone at a time.
7 / 32
Forward Stepwise Selection Properties
4 Computationally feasible with big data, and also workswith n� p.
4 Efficient computations with squared-error loss.Computations can be arranged as a guided QRdecomposition of the X matrix, and hence costs the sameas a full least-squares fit O(npmin(n, p)).
4 Performance very similar to best subset selection, althoughdifficult counter examples can be constructed.
6 Efficiency not available for GLMs, although scoreapproximations can be used.
6 Tedious with very large p and n, since terms augmentedone at a time.
7 / 32
Lasso
The lasso (Tibshirani, 1996) solves
minimizeβ
1
2
n∑
i=1
(yi − β0 −p∑
j=1
xijβj)2 s.t. ‖β‖1 ≤ t
Generally, the smaller t, the sparser the solutions, andapproximate nesting occurs.We compute many solutions over a range of values of t, andselect t using validation data.
Often thought of as a convex relaxation for the best-subsetproblem
minimizeβ
1
2
n∑
i=1
(yi − β0 −p∑
j=1
xijβj)2 s.t. ‖β‖0 ≤ k
8 / 32
Lasso
The lasso (Tibshirani, 1996) solves
minimizeβ
1
2
n∑
i=1
(yi − β0 −p∑
j=1
xijβj)2 s.t. ‖β‖1 ≤ t
Generally, the smaller t, the sparser the solutions, andapproximate nesting occurs.We compute many solutions over a range of values of t, andselect t using validation data.
Often thought of as a convex relaxation for the best-subsetproblem
minimizeβ
1
2
n∑
i=1
(yi − β0 −p∑
j=1
xijβj)2 s.t. ‖β‖0 ≤ k
8 / 32
Lasso properties
We typically solve lasso in Lagrange form
minimizeβ
1
2
n∑
i=1
(yi − β0 −p∑
j=1
xijβj)2 + λ‖β‖1
4 Extremely fast algorithms for solving lasso problems (withmany loss functions). Pathwise coordinate descent viaglmnet (Friedman, H, Tibshirani, 2010) exploits sparsity,active-set convergence, strong rules, and more, to rapidlycompute entire solution path on a grid of values of λ.
4 With large p provides convenient subset selection, takingleaps rather than single steps.
6 Since coefficients are both selected and regularized, cansuffer from shrinkage bias.
9 / 32
Lasso properties
We typically solve lasso in Lagrange form
minimizeβ
1
2
n∑
i=1
(yi − β0 −p∑
j=1
xijβj)2 + λ‖β‖1
4 Extremely fast algorithms for solving lasso problems (withmany loss functions). Pathwise coordinate descent viaglmnet (Friedman, H, Tibshirani, 2010) exploits sparsity,active-set convergence, strong rules, and more, to rapidlycompute entire solution path on a grid of values of λ.
4 With large p provides convenient subset selection, takingleaps rather than single steps.
6 Since coefficients are both selected and regularized, cansuffer from shrinkage bias.
9 / 32
November 2015 Trevor Hastie, Stanford Statistics 5
0.0 0.2 0.4 0.6 0.8 1.0
−50
00
500
52
110
84
69
0 2 3 4 5 7 8 10
||β(λ)||1/||β(0)||1
Lasso Coefficient Path
Standardized
Coeffi
cients
Lasso: β(λ) = argminβ1N
∑Ni=1(yi − β0 − xTi β)2 + λ||β||1
fit using lars package in R (Efron, Hastie, Johnstone, Tibshirani 2002)
Lasso: β(λ) = argminβ1N
∑Ni=1(yi − β0 − xTi β)2 + λ||β||1
fit using lars package in R (Efron, H, Johnstone, Tibshirani 2002)
10 / 32
Lasso and Least-Angle Regression (LAR)
Interesting connection between Lasso and Forward Stepwise.
LAR algorithm: Democratic Forward Stepwise
1. Find variable X(1) most correlated with the response.
2. While moving towards the least-squares fit on X(1), keeptrack of correlations of other variables with the evolvingresidual.
3. When X(2) catches up in correlation, include it in model,and move the pair toward least squares fit (correlationsstay tied!)
4. And so on.
LAR path = Lasso path (almost always).Forward Stepwise goes all the way with each variable, whileLAR lets others in when they catch up. This slow learning wasinspired by the forward stagewise approach of boosting.
11 / 32
3.4 Shrinkage Methods 75
0 5 10 15
0.0
0.1
0.2
0.3
0.4
v2 v6 v4 v5 v3 v1
L1 Arc Length
Absolute
Correlations
FIGURE 3.14. Progression of the absolute correlations during each step of theLAR procedure, using a simulated data set with six predictors. The labels at thetop of the plot indicate which variables enter the active set at each step. The steplength are measured in units of L1 arc length.
0 5 10 15
−1.
5−
1.0
−0.
50.
00.
5
Least Angle Regression
0 5 10 15
−1.
5−
1.0
−0.
50.
00.
5
Lasso
L1 Arc LengthL1 Arc Length
Coeffi
cien
ts
Coeffi
cien
ts
FIGURE 3.15. Left panel shows the LAR coefficient profiles on the simulateddata, as a function of the L1 arc length. The right panel shows the Lasso profile.They are identical until the dark-blue coefficient crosses zero at an arc length ofabout 18.
12 / 32
Relaxed Lasso
Originally proposed by Meinshausen (2006). We present asimplified version.
• Suppose βλ is the lasso solution at λ, and let Aλ be theactive set of indices with nonzero coefficients in βλ.
• Let βLSAλbe the coefficients in the least squares fit, using
only the variables in Aλ. Let βLSλ be the full-sized versionof this coefficient vector, padded with zeros.
βLSλ debiases the lasso, while maintaining its sparsity.
• Define the Relaxed Lasso
βRELAXλ (γ) = γ · βλ + (1− γ) · βLSλ
Once βLSλ is computed at desired values of λ, the whole family
βRELAXλ (γ) comes free of charge!
13 / 32
Relaxed Lasso
Originally proposed by Meinshausen (2006). We present asimplified version.
• Suppose βλ is the lasso solution at λ, and let Aλ be theactive set of indices with nonzero coefficients in βλ.
• Let βLSAλbe the coefficients in the least squares fit, using
only the variables in Aλ. Let βLSλ be the full-sized versionof this coefficient vector, padded with zeros.
βLSλ debiases the lasso, while maintaining its sparsity.
• Define the Relaxed Lasso
βRELAXλ (γ) = γ · βλ + (1− γ) · βLSλOnce βLSλ is computed at desired values of λ, the whole family
βRELAXλ (γ) comes free of charge!
13 / 32
Simulation
●
●
●
●
●
●
●●
● ● ● ● ●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●● ● ● ● ● ● ● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
● ● ● ● ● ●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●● ●
●
●
●
●
●
● ●●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
12.5
15.0
17.5
20.0
0 10 20 30
Subset Size
Exp
ecte
d P
redi
ctio
n E
rror
method●
●
●
●
●
Best Subset
Forward Stepwise
Lasso
Relaxed Lasso 0
Relaxed Lasso 0.5
n=70, p=30, s=5, SNR=0.71, PVE=0.42
14 / 32
Simulation Setup
Y =
p∑
j=1
Xjβj + ε
X ∼ Np(0,Σ)
ε ∼ N(0, σ2)
• p = 30, sample size n = 70, and first s = 5 values of β are1, the rest are zero.
• Σ is correlation matrix, with Cov(Xi, Xj) = ρ|i−j|, andρ = 0.35
• σ2 is chosen here to achieve desired SNR = Var(Xβ)/σ2 of0.71.
• This is equivalent to a percentage variance explained (R2)of 42%, since population PVE = SNR/(1 + SNR).
• Where appropriate we have a separate validation set of sizen, and an infinite test set.
15 / 32
Simulation Setup
Y =
p∑
j=1
Xjβj + ε
X ∼ Np(0,Σ)
ε ∼ N(0, σ2)
• p = 30, sample size n = 70, and first s = 5 values of β are1, the rest are zero.
• Σ is correlation matrix, with Cov(Xi, Xj) = ρ|i−j|, andρ = 0.35
• σ2 is chosen here to achieve desired SNR = Var(Xβ)/σ2 of0.71.
• This is equivalent to a percentage variance explained (R2)of 42%, since population PVE = SNR/(1 + SNR).
• Where appropriate we have a separate validation set of sizen, and an infinite test set.
15 / 32
Simulation Setup
Y =
p∑
j=1
Xjβj + ε
X ∼ Np(0,Σ)
ε ∼ N(0, σ2)
• p = 30, sample size n = 70, and first s = 5 values of β are1, the rest are zero.
• Σ is correlation matrix, with Cov(Xi, Xj) = ρ|i−j|, andρ = 0.35
• σ2 is chosen here to achieve desired SNR = Var(Xβ)/σ2 of0.71.
• This is equivalent to a percentage variance explained (R2)of 42%, since population PVE = SNR/(1 + SNR).
• Where appropriate we have a separate validation set of sizen, and an infinite test set.
15 / 32
Degrees of Freedom
We can get some insight into the aggressiveness of theprocedures by looking at their degrees of freedom.
Suppose yi = f(xi) + εi, i = 1, . . . , n, and assume Var(εi) = σ2.Let yi be the fitted value for observation i, after applying someregression method to the n pairs (xi, yi) (e.g. best-subset linearregression of size k, lasso with parameter λ)
df =1
σ2
n∑
i=1
Cov(yi, yi)
These covariances are wrt the sampling distribution of the yi.The more aggressive the procedure, the more it will overfit thetraining responses, and hence the higher the covariances and df.
16 / 32
Degrees of Freedom
We can get some insight into the aggressiveness of theprocedures by looking at their degrees of freedom.
Suppose yi = f(xi) + εi, i = 1, . . . , n, and assume Var(εi) = σ2.Let yi be the fitted value for observation i, after applying someregression method to the n pairs (xi, yi) (e.g. best-subset linearregression of size k, lasso with parameter λ)
df =1
σ2
n∑
i=1
Cov(yi, yi)
These covariances are wrt the sampling distribution of the yi.The more aggressive the procedure, the more it will overfit thetraining responses, and hence the higher the covariances and df.
16 / 32
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●● ● ● ● ● ● ● ● ● ● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●● ● ● ● ● ● ● ● ● ●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
● ●● ●
● ● ● ● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●●
●●
●●
●
●●
●
0
10
20
30
0 10 20 30
Subset Size
Deg
rees
of F
reed
om method●
●
●
●
●
Best Subset
Forward Stepwise
Lasso
Relaxed Lasso 0
Relaxed Lasso 0.5
n=70, p=30, s=5, SNR=0.71, PVE=0.42
17 / 32
Notable features of previous plot
• Df for lasso is size of active set (Efron et al 2004, Zou et al2007) — shrinkage offsets selection.
• Best-subset most aggressive, with forward stepwise justbehind (in this example).Df can exceed p for BS and FS due to non-convexity(Janson et al 2005, Kaufman& Rosset 2014)
• Relaxed Lasso notably less aggressive, in particular βLSλ(γ = 0).
18 / 32
Next plots · · ·
• Show results over a range of SNRs
• Averaged over 10 simulations
• For each method, a validation set of same size as trainingset used to select the best model
• Reported errors are over infinite test set
19 / 32
Correlation 0.35
Type 2
0.05 0.09 0.14 0.25 0.42 0.71 1.22 2.07 3.52 6.00
1.0
1.1
1.2
1.3
1.4
Signal−to−noise ratio
(Tes
t Err
or)
/ σ2 method
Best subset
Lasso
Relaxed lasso
Stepwise
n=70, p=30, s=5
20 / 32
Correlation 0.35
Type 2
0.05 0.09 0.14 0.25 0.42 0.71 1.22 2.07 3.52 6.00
0.0
0.5
1.0
1.5
2.0
Signal−to−noise ratio
Rel
ativ
e ris
k (t
o nu
ll m
odel
)
method
Best subset
Lasso
Relaxed lasso
Stepwise
n=70, p=30, s=5
21 / 32
Correlation 0.35
Type 2
0.05 0.09 0.14 0.25 0.42 0.71 1.22 2.07 3.52 6.00
0.00
0.25
0.50
0.75
Signal−to−noise ratio
Pro
port
ion
Var
ianc
e E
xpla
ined
method
Best subset
Lasso
Relaxed lasso
Stepwise
n=70, p=30, s=5
22 / 32
Correlation 0.35
Type 2
0.05 0.09 0.14 0.25 0.42 0.71 1.22 2.07 3.52 6.00
0.00
0.25
0.50
0.75
1.00
Signal−to−noise ratio
Cla
ssifi
catio
n M
easu
re F
− R
ecov
ery
of T
rue
Bet
a
method
Best subset
Lasso
Relaxed lasso
Stepwise
n=70, p=30, s=5
23 / 32
Next plots · · ·
As before, but also
• different pairwise correlations between variables
• Different patterns of true coefficients
• Different problem sizes N, p.
24 / 32
Timings
Setting BS FS Lasso R Lassolow n=100, p=10 3.43 0.006 0.002 0.002medium n=500, p=100 >120 min 0.818 0.009 0.009high-5 n=50, p=1000 >126 min 0.137 0.011 0.011high-10 n=100, p=1000 >144 min 0.277 0.019 0.021
Average time in seconds to compute one path of solutions for eachmethod, on a Linux cluster
25 / 32
Correlation 0 Correlation 0.35 Correlation 0.7
Type 1Type 2
Type 3
0.05 0.25 1.22 6.00 0.05 0.25 1.22 6.00 0.05 0.25 1.22 6.00
1.00
1.05
1.10
1.15
1.00
1.05
1.10
1.15
1.00
1.05
1.10
1.15
Signal−to−noise ratio
(Tes
t Err
or)
/ σ2 method
Best subset
Lasso
Relaxed lasso
Stepwise
n=100, p=10, s=5
26 / 32
Correlation 0 Correlation 0.35 Correlation 0.7
Type 1Type 2
Type 3
0.05 0.25 1.22 6.00 0.05 0.25 1.22 6.00 0.05 0.25 1.22 6.00
0.0
0.5
1.0
1.5
0.0
0.5
1.0
1.5
0.0
0.5
1.0
1.5
Signal−to−noise ratio
Rel
ativ
e ris
k (t
o nu
ll m
odel
)
method
Best subset
Lasso
Relaxed lasso
Stepwise
n=100, p=10, s=5
27 / 32
Correlation 0 Correlation 0.35 Correlation 0.7
Type 1Type 2
Type 3
0.05 0.25 1.22 6.00 0.05 0.25 1.22 6.00 0.05 0.25 1.22 6.00
1.00
1.02
1.04
1.06
1.00
1.02
1.04
1.06
1.00
1.02
1.04
1.06
Signal−to−noise ratio
(Tes
t Err
or)
/ σ2 method
Best subset
Lasso
Relaxed lasso
Stepwise
n=500, p=100, s=5
28 / 32
Correlation 0 Correlation 0.35 Correlation 0.7
Type 1Type 2
Type 3
0.05 0.25 1.22 6.00 0.05 0.25 1.22 6.00 0.05 0.25 1.22 6.00
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
Signal−to−noise ratio
Rel
ativ
e ris
k (t
o nu
ll m
odel
)
method
Best subset
Lasso
Relaxed lasso
Stepwise
n=500, p=100, s=5
29 / 32
Correlation 0 Correlation 0.35 Correlation 0.7
Type 1Type 2
Type 3
0.05 0.25 1.22 6.00 0.05 0.25 1.22 6.00 0.05 0.25 1.22 6.00
1.0
1.5
2.0
2.5
3.0
1.0
1.5
2.0
2.5
3.0
1.0
1.5
2.0
2.5
3.0
Signal−to−noise ratio
(Tes
t Err
or)
/ σ2 method
Best subset
Lasso
Relaxed lasso
Stepwise
n=100, p=1000, s=10
30 / 32
Correlation 0 Correlation 0.35 Correlation 0.7
Type 1Type 2
Type 3
0.05 0.25 1.22 6.00 0.05 0.25 1.22 6.00 0.05 0.25 1.22 6.00
0.0
0.5
1.0
1.5
2.0
0.0
0.5
1.0
1.5
2.0
0.0
0.5
1.0
1.5
2.0
Signal−to−noise ratio
Rel
ativ
e ris
k (t
o nu
ll m
odel
)
method
Best subset
Lasso
Relaxed lasso
Stepwise
n=100, p=1000, s=10
31 / 32
Outline and SummaryWe consider linear regression models η(X) = XTβ withpotentially very large numbers of variables, and methods forselecting an informative subset.
• Revisit two baby boomers (best-subset selection andforward-stepwise selection), one millennial (lasso) and anewborn (relaxed lasso).
• Simulation study to evaluate them all over a wide range ofsettings.
Conclusions:
• forward stepwise very close to best subset, but much faster.• relaxed lasso overall winner, and fastest by far.• In wide-data settings, and low SNR, lasso can beat best
subset and forward stepwise.
Paper: https://arxiv.org/abs/1707.08692
R package: https://github.com/ryantibs/best-subset/
Thank you for attending!
32 / 32
Outline and SummaryWe consider linear regression models η(X) = XTβ withpotentially very large numbers of variables, and methods forselecting an informative subset.
• Revisit two baby boomers (best-subset selection andforward-stepwise selection), one millennial (lasso) and anewborn (relaxed lasso).
• Simulation study to evaluate them all over a wide range ofsettings.
Conclusions:
• forward stepwise very close to best subset, but much faster.• relaxed lasso overall winner, and fastest by far.• In wide-data settings, and low SNR, lasso can beat best
subset and forward stepwise.
Paper: https://arxiv.org/abs/1707.08692
R package: https://github.com/ryantibs/best-subset/
Thank you for attending!32 / 32