Trevor Hastie, Stanford University with Ryan Tibshirani and Rob...

45
Variable Selection at Scale Trevor Hastie, Stanford University with Ryan Tibshirani and Rob Tibshirani 1 / 32

Transcript of Trevor Hastie, Stanford University with Ryan Tibshirani and Rob...

Page 1: Trevor Hastie, Stanford University with Ryan Tibshirani and Rob …hastie/TALKS/hastieJSM2017.pdf · 2017. 8. 2. · Trevor Hastie, Stanford University with Ryan Tibshirani and Rob

Variable Selection at Scale

Trevor Hastie, Stanford University

with Ryan Tibshirani and Rob Tibshirani

1 / 32

Page 2: Trevor Hastie, Stanford University with Ryan Tibshirani and Rob …hastie/TALKS/hastieJSM2017.pdf · 2017. 8. 2. · Trevor Hastie, Stanford University with Ryan Tibshirani and Rob

Variable Selection at Scale

Trevor Hastie, Stanford University

with Ryan Tibshirani and Rob Tibshirani

1 / 32

Page 3: Trevor Hastie, Stanford University with Ryan Tibshirani and Rob …hastie/TALKS/hastieJSM2017.pdf · 2017. 8. 2. · Trevor Hastie, Stanford University with Ryan Tibshirani and Rob

Outline and SummaryWe consider linear regression models η(X) = XTβ withpotentially very large numbers of variables, and methods forselecting an informative subset.

• Revisit two baby boomers (best-subset selection andforward-stepwise selection), one millennial (lasso) and anewborn (relaxed lasso).

• Simulation study to evaluate them all over a wide range ofsettings.

Conclusions:

• forward stepwise very close to best subset, but much faster.

• relaxed lasso overall winner, and fastest by far.

• In wide-data settings, and low SNR, lasso can beat bestsubset and forward stepwise.

Paper: https://arxiv.org/abs/1707.08692

R package: https://github.com/ryantibs/best-subset/

2 / 32

Page 4: Trevor Hastie, Stanford University with Ryan Tibshirani and Rob …hastie/TALKS/hastieJSM2017.pdf · 2017. 8. 2. · Trevor Hastie, Stanford University with Ryan Tibshirani and Rob

Outline and SummaryWe consider linear regression models η(X) = XTβ withpotentially very large numbers of variables, and methods forselecting an informative subset.

• Revisit two baby boomers (best-subset selection andforward-stepwise selection), one millennial (lasso) and anewborn (relaxed lasso).

• Simulation study to evaluate them all over a wide range ofsettings.

Conclusions:

• forward stepwise very close to best subset, but much faster.

• relaxed lasso overall winner, and fastest by far.

• In wide-data settings, and low SNR, lasso can beat bestsubset and forward stepwise.

Paper: https://arxiv.org/abs/1707.08692

R package: https://github.com/ryantibs/best-subset/2 / 32

Page 5: Trevor Hastie, Stanford University with Ryan Tibshirani and Rob …hastie/TALKS/hastieJSM2017.pdf · 2017. 8. 2. · Trevor Hastie, Stanford University with Ryan Tibshirani and Rob

Best Subset Selection58 3. Linear Methods for Regression

Subset Size k

Res

idua

l Sum

−of

−S

quar

es

020

4060

8010

0

0 1 2 3 4 5 6 7 8

•••••••

••••••••••••••••••••••••••

•••••••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

•••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••

•••••••

•• • • • • • •

FIGURE 3.5. All possible subset models for the prostate cancer example. Ateach subset size is shown the residual sum-of-squares for each model of that size.

cross-validation to estimate prediction error and select k; the AIC criterionis a popular alternative. We defer more detailed discussion of these andother approaches to Chapter 7.

3.3.2 Forward- and Backward-Stepwise Selection

Rather than search through all possible subsets (which becomes infeasiblefor pmuch larger than 40), we can seek a good path through them. Forward-stepwise selection starts with the intercept, and then sequentially adds intothe model the predictor that most improves the fit. With many candidatepredictors, this might seem like a lot of computation; however, clever up-dating algorithms can exploit the QR decomposition for the current fit torapidly establish the next candidate (Exercise 3.9). Like best-subset re-gression, forward stepwise produces a sequence of models indexed by k, thesubset size, which must be determined.Forward-stepwise selection is a greedy algorithm, producing a nested se-

quence of models. In this sense it might seem sub-optimal compared tobest-subset selection. However, there are several reasons why it might bepreferred:

1. For each subset of size k of the p variables, evaluate the fittingobjective (e.g. RSS) via linear regression on the training data.

2. Candidate models β(k) are at the lower frontier — the best foreach k on the training data.

3. Pick k using a validation dataset (or CV), and deliver β(k)

3 / 32

Page 6: Trevor Hastie, Stanford University with Ryan Tibshirani and Rob …hastie/TALKS/hastieJSM2017.pdf · 2017. 8. 2. · Trevor Hastie, Stanford University with Ryan Tibshirani and Rob

Properties of Best Subset Selection

4 Well-defined goal — the obvious gold standard for variableselection.

4 Feasible for least squares regression with p ≈ 35 usingclever algorithms (Furnival and Wilson, 1974, “Leaps andBounds”).

6 Combinatorially hard for large p.

? Obvious gold standard — really?

4 / 32

Page 7: Trevor Hastie, Stanford University with Ryan Tibshirani and Rob …hastie/TALKS/hastieJSM2017.pdf · 2017. 8. 2. · Trevor Hastie, Stanford University with Ryan Tibshirani and Rob

Properties of Best Subset Selection

4 Well-defined goal — the obvious gold standard for variableselection.

4 Feasible for least squares regression with p ≈ 35 usingclever algorithms (Furnival and Wilson, 1974, “Leaps andBounds”).

6 Combinatorially hard for large p.

? Obvious gold standard — really?

4 / 32

Page 8: Trevor Hastie, Stanford University with Ryan Tibshirani and Rob …hastie/TALKS/hastieJSM2017.pdf · 2017. 8. 2. · Trevor Hastie, Stanford University with Ryan Tibshirani and Rob

Properties of Best Subset Selection

4 Well-defined goal — the obvious gold standard for variableselection.

4 Feasible for least squares regression with p ≈ 35 usingclever algorithms (Furnival and Wilson, 1974, “Leaps andBounds”).

6 Combinatorially hard for large p.

? Obvious gold standard — really?

4 / 32

Page 9: Trevor Hastie, Stanford University with Ryan Tibshirani and Rob …hastie/TALKS/hastieJSM2017.pdf · 2017. 8. 2. · Trevor Hastie, Stanford University with Ryan Tibshirani and Rob

Best Subset Selection BreakthroughRahul Mazumder, with Bertsimas and King(AoS 2016) crack the forty year old best-subsetselection bottleneck! They use mixed-integerprogramming (MIO) along with the gurobisolver.

minimizez,β1

2

n∑

i=1

(yi − β0 −p∑

j=1

xijβj)2

subject to −Mzj ≤ βj ≤Mzj , zj ∈ {0, 1}, j = 1, . . . , pp∑

j=1

zj ≤ k.

Their procedure iteratively narrows the optimality gap — if thegap hits zero, they have found the solution.

5 / 32

Page 10: Trevor Hastie, Stanford University with Ryan Tibshirani and Rob …hastie/TALKS/hastieJSM2017.pdf · 2017. 8. 2. · Trevor Hastie, Stanford University with Ryan Tibshirani and Rob

Best Subset Selection BreakthroughRahul Mazumder, with Bertsimas and King(AoS 2016) crack the forty year old best-subsetselection bottleneck! They use mixed-integerprogramming (MIO) along with the gurobisolver.

minimizez,β1

2

n∑

i=1

(yi − β0 −p∑

j=1

xijβj)2

subject to −Mzj ≤ βj ≤Mzj , zj ∈ {0, 1}, j = 1, . . . , pp∑

j=1

zj ≤ k.

Their procedure iteratively narrows the optimality gap — if thegap hits zero, they have found the solution.

5 / 32

Page 11: Trevor Hastie, Stanford University with Ryan Tibshirani and Rob …hastie/TALKS/hastieJSM2017.pdf · 2017. 8. 2. · Trevor Hastie, Stanford University with Ryan Tibshirani and Rob

Forward Stepwise Selection

Greedy forward algorithm, traditionally thought of as asub-optimal but feasible alternative to best-subset regression.

1. Start with null model (response mean).

2. Choose among the p variables to find the bestsingle-variable model in terms of fitting objective.

3. Choose among the remaining p− 1 variables to find theone, when included with the previously chosen variable,best improves the fitting objective.

4. Choose among the remaining p− 2 . . ., and so on.

Forward stepwise produces a nested sequence of modelsβ(k), k = 1, 2, . . ..Pick k using a validation dataset.

6 / 32

Page 12: Trevor Hastie, Stanford University with Ryan Tibshirani and Rob …hastie/TALKS/hastieJSM2017.pdf · 2017. 8. 2. · Trevor Hastie, Stanford University with Ryan Tibshirani and Rob

Forward Stepwise Selection Properties

4 Computationally feasible with big data, and also workswith n� p.

4 Efficient computations with squared-error loss.Computations can be arranged as a guided QRdecomposition of the X matrix, and hence costs the sameas a full least-squares fit O(npmin(n, p)).

4 Performance very similar to best subset selection, althoughdifficult counter examples can be constructed.

6 Efficiency not available for GLMs, although scoreapproximations can be used.

6 Tedious with very large p and n, since terms augmentedone at a time.

7 / 32

Page 13: Trevor Hastie, Stanford University with Ryan Tibshirani and Rob …hastie/TALKS/hastieJSM2017.pdf · 2017. 8. 2. · Trevor Hastie, Stanford University with Ryan Tibshirani and Rob

Forward Stepwise Selection Properties

4 Computationally feasible with big data, and also workswith n� p.

4 Efficient computations with squared-error loss.Computations can be arranged as a guided QRdecomposition of the X matrix, and hence costs the sameas a full least-squares fit O(npmin(n, p)).

4 Performance very similar to best subset selection, althoughdifficult counter examples can be constructed.

6 Efficiency not available for GLMs, although scoreapproximations can be used.

6 Tedious with very large p and n, since terms augmentedone at a time.

7 / 32

Page 14: Trevor Hastie, Stanford University with Ryan Tibshirani and Rob …hastie/TALKS/hastieJSM2017.pdf · 2017. 8. 2. · Trevor Hastie, Stanford University with Ryan Tibshirani and Rob

Lasso

The lasso (Tibshirani, 1996) solves

minimizeβ

1

2

n∑

i=1

(yi − β0 −p∑

j=1

xijβj)2 s.t. ‖β‖1 ≤ t

Generally, the smaller t, the sparser the solutions, andapproximate nesting occurs.We compute many solutions over a range of values of t, andselect t using validation data.

Often thought of as a convex relaxation for the best-subsetproblem

minimizeβ

1

2

n∑

i=1

(yi − β0 −p∑

j=1

xijβj)2 s.t. ‖β‖0 ≤ k

8 / 32

Page 15: Trevor Hastie, Stanford University with Ryan Tibshirani and Rob …hastie/TALKS/hastieJSM2017.pdf · 2017. 8. 2. · Trevor Hastie, Stanford University with Ryan Tibshirani and Rob

Lasso

The lasso (Tibshirani, 1996) solves

minimizeβ

1

2

n∑

i=1

(yi − β0 −p∑

j=1

xijβj)2 s.t. ‖β‖1 ≤ t

Generally, the smaller t, the sparser the solutions, andapproximate nesting occurs.We compute many solutions over a range of values of t, andselect t using validation data.

Often thought of as a convex relaxation for the best-subsetproblem

minimizeβ

1

2

n∑

i=1

(yi − β0 −p∑

j=1

xijβj)2 s.t. ‖β‖0 ≤ k

8 / 32

Page 16: Trevor Hastie, Stanford University with Ryan Tibshirani and Rob …hastie/TALKS/hastieJSM2017.pdf · 2017. 8. 2. · Trevor Hastie, Stanford University with Ryan Tibshirani and Rob

Lasso properties

We typically solve lasso in Lagrange form

minimizeβ

1

2

n∑

i=1

(yi − β0 −p∑

j=1

xijβj)2 + λ‖β‖1

4 Extremely fast algorithms for solving lasso problems (withmany loss functions). Pathwise coordinate descent viaglmnet (Friedman, H, Tibshirani, 2010) exploits sparsity,active-set convergence, strong rules, and more, to rapidlycompute entire solution path on a grid of values of λ.

4 With large p provides convenient subset selection, takingleaps rather than single steps.

6 Since coefficients are both selected and regularized, cansuffer from shrinkage bias.

9 / 32

Page 17: Trevor Hastie, Stanford University with Ryan Tibshirani and Rob …hastie/TALKS/hastieJSM2017.pdf · 2017. 8. 2. · Trevor Hastie, Stanford University with Ryan Tibshirani and Rob

Lasso properties

We typically solve lasso in Lagrange form

minimizeβ

1

2

n∑

i=1

(yi − β0 −p∑

j=1

xijβj)2 + λ‖β‖1

4 Extremely fast algorithms for solving lasso problems (withmany loss functions). Pathwise coordinate descent viaglmnet (Friedman, H, Tibshirani, 2010) exploits sparsity,active-set convergence, strong rules, and more, to rapidlycompute entire solution path on a grid of values of λ.

4 With large p provides convenient subset selection, takingleaps rather than single steps.

6 Since coefficients are both selected and regularized, cansuffer from shrinkage bias.

9 / 32

Page 18: Trevor Hastie, Stanford University with Ryan Tibshirani and Rob …hastie/TALKS/hastieJSM2017.pdf · 2017. 8. 2. · Trevor Hastie, Stanford University with Ryan Tibshirani and Rob

November 2015 Trevor Hastie, Stanford Statistics 5

0.0 0.2 0.4 0.6 0.8 1.0

−50

00

500

52

110

84

69

0 2 3 4 5 7 8 10

||β(λ)||1/||β(0)||1

Lasso Coefficient Path

Standardized

Coeffi

cients

Lasso: β(λ) = argminβ1N

∑Ni=1(yi − β0 − xTi β)2 + λ||β||1

fit using lars package in R (Efron, Hastie, Johnstone, Tibshirani 2002)

Lasso: β(λ) = argminβ1N

∑Ni=1(yi − β0 − xTi β)2 + λ||β||1

fit using lars package in R (Efron, H, Johnstone, Tibshirani 2002)

10 / 32

Page 19: Trevor Hastie, Stanford University with Ryan Tibshirani and Rob …hastie/TALKS/hastieJSM2017.pdf · 2017. 8. 2. · Trevor Hastie, Stanford University with Ryan Tibshirani and Rob

Lasso and Least-Angle Regression (LAR)

Interesting connection between Lasso and Forward Stepwise.

LAR algorithm: Democratic Forward Stepwise

1. Find variable X(1) most correlated with the response.

2. While moving towards the least-squares fit on X(1), keeptrack of correlations of other variables with the evolvingresidual.

3. When X(2) catches up in correlation, include it in model,and move the pair toward least squares fit (correlationsstay tied!)

4. And so on.

LAR path = Lasso path (almost always).Forward Stepwise goes all the way with each variable, whileLAR lets others in when they catch up. This slow learning wasinspired by the forward stagewise approach of boosting.

11 / 32

Page 20: Trevor Hastie, Stanford University with Ryan Tibshirani and Rob …hastie/TALKS/hastieJSM2017.pdf · 2017. 8. 2. · Trevor Hastie, Stanford University with Ryan Tibshirani and Rob

3.4 Shrinkage Methods 75

0 5 10 15

0.0

0.1

0.2

0.3

0.4

v2 v6 v4 v5 v3 v1

L1 Arc Length

Absolute

Correlations

FIGURE 3.14. Progression of the absolute correlations during each step of theLAR procedure, using a simulated data set with six predictors. The labels at thetop of the plot indicate which variables enter the active set at each step. The steplength are measured in units of L1 arc length.

0 5 10 15

−1.

5−

1.0

−0.

50.

00.

5

Least Angle Regression

0 5 10 15

−1.

5−

1.0

−0.

50.

00.

5

Lasso

L1 Arc LengthL1 Arc Length

Coeffi

cien

ts

Coeffi

cien

ts

FIGURE 3.15. Left panel shows the LAR coefficient profiles on the simulateddata, as a function of the L1 arc length. The right panel shows the Lasso profile.They are identical until the dark-blue coefficient crosses zero at an arc length ofabout 18.

12 / 32

Page 21: Trevor Hastie, Stanford University with Ryan Tibshirani and Rob …hastie/TALKS/hastieJSM2017.pdf · 2017. 8. 2. · Trevor Hastie, Stanford University with Ryan Tibshirani and Rob

Relaxed Lasso

Originally proposed by Meinshausen (2006). We present asimplified version.

• Suppose βλ is the lasso solution at λ, and let Aλ be theactive set of indices with nonzero coefficients in βλ.

• Let βLSAλbe the coefficients in the least squares fit, using

only the variables in Aλ. Let βLSλ be the full-sized versionof this coefficient vector, padded with zeros.

βLSλ debiases the lasso, while maintaining its sparsity.

• Define the Relaxed Lasso

βRELAXλ (γ) = γ · βλ + (1− γ) · βLSλ

Once βLSλ is computed at desired values of λ, the whole family

βRELAXλ (γ) comes free of charge!

13 / 32

Page 22: Trevor Hastie, Stanford University with Ryan Tibshirani and Rob …hastie/TALKS/hastieJSM2017.pdf · 2017. 8. 2. · Trevor Hastie, Stanford University with Ryan Tibshirani and Rob

Relaxed Lasso

Originally proposed by Meinshausen (2006). We present asimplified version.

• Suppose βλ is the lasso solution at λ, and let Aλ be theactive set of indices with nonzero coefficients in βλ.

• Let βLSAλbe the coefficients in the least squares fit, using

only the variables in Aλ. Let βLSλ be the full-sized versionof this coefficient vector, padded with zeros.

βLSλ debiases the lasso, while maintaining its sparsity.

• Define the Relaxed Lasso

βRELAXλ (γ) = γ · βλ + (1− γ) · βLSλOnce βLSλ is computed at desired values of λ, the whole family

βRELAXλ (γ) comes free of charge!

13 / 32

Page 23: Trevor Hastie, Stanford University with Ryan Tibshirani and Rob …hastie/TALKS/hastieJSM2017.pdf · 2017. 8. 2. · Trevor Hastie, Stanford University with Ryan Tibshirani and Rob

Simulation

●●

● ● ● ● ●●

●●

●●

●●

●●

●●

●●

●● ● ● ● ● ● ● ●

●●

●●

●●

●●

● ● ● ● ● ●

● ●

●●

●●

●● ●

● ●●

●●

●●

●●

12.5

15.0

17.5

20.0

0 10 20 30

Subset Size

Exp

ecte

d P

redi

ctio

n E

rror

method●

Best Subset

Forward Stepwise

Lasso

Relaxed Lasso 0

Relaxed Lasso 0.5

n=70, p=30, s=5, SNR=0.71, PVE=0.42

14 / 32

Page 24: Trevor Hastie, Stanford University with Ryan Tibshirani and Rob …hastie/TALKS/hastieJSM2017.pdf · 2017. 8. 2. · Trevor Hastie, Stanford University with Ryan Tibshirani and Rob

Simulation Setup

Y =

p∑

j=1

Xjβj + ε

X ∼ Np(0,Σ)

ε ∼ N(0, σ2)

• p = 30, sample size n = 70, and first s = 5 values of β are1, the rest are zero.

• Σ is correlation matrix, with Cov(Xi, Xj) = ρ|i−j|, andρ = 0.35

• σ2 is chosen here to achieve desired SNR = Var(Xβ)/σ2 of0.71.

• This is equivalent to a percentage variance explained (R2)of 42%, since population PVE = SNR/(1 + SNR).

• Where appropriate we have a separate validation set of sizen, and an infinite test set.

15 / 32

Page 25: Trevor Hastie, Stanford University with Ryan Tibshirani and Rob …hastie/TALKS/hastieJSM2017.pdf · 2017. 8. 2. · Trevor Hastie, Stanford University with Ryan Tibshirani and Rob

Simulation Setup

Y =

p∑

j=1

Xjβj + ε

X ∼ Np(0,Σ)

ε ∼ N(0, σ2)

• p = 30, sample size n = 70, and first s = 5 values of β are1, the rest are zero.

• Σ is correlation matrix, with Cov(Xi, Xj) = ρ|i−j|, andρ = 0.35

• σ2 is chosen here to achieve desired SNR = Var(Xβ)/σ2 of0.71.

• This is equivalent to a percentage variance explained (R2)of 42%, since population PVE = SNR/(1 + SNR).

• Where appropriate we have a separate validation set of sizen, and an infinite test set.

15 / 32

Page 26: Trevor Hastie, Stanford University with Ryan Tibshirani and Rob …hastie/TALKS/hastieJSM2017.pdf · 2017. 8. 2. · Trevor Hastie, Stanford University with Ryan Tibshirani and Rob

Simulation Setup

Y =

p∑

j=1

Xjβj + ε

X ∼ Np(0,Σ)

ε ∼ N(0, σ2)

• p = 30, sample size n = 70, and first s = 5 values of β are1, the rest are zero.

• Σ is correlation matrix, with Cov(Xi, Xj) = ρ|i−j|, andρ = 0.35

• σ2 is chosen here to achieve desired SNR = Var(Xβ)/σ2 of0.71.

• This is equivalent to a percentage variance explained (R2)of 42%, since population PVE = SNR/(1 + SNR).

• Where appropriate we have a separate validation set of sizen, and an infinite test set.

15 / 32

Page 27: Trevor Hastie, Stanford University with Ryan Tibshirani and Rob …hastie/TALKS/hastieJSM2017.pdf · 2017. 8. 2. · Trevor Hastie, Stanford University with Ryan Tibshirani and Rob

Degrees of Freedom

We can get some insight into the aggressiveness of theprocedures by looking at their degrees of freedom.

Suppose yi = f(xi) + εi, i = 1, . . . , n, and assume Var(εi) = σ2.Let yi be the fitted value for observation i, after applying someregression method to the n pairs (xi, yi) (e.g. best-subset linearregression of size k, lasso with parameter λ)

df =1

σ2

n∑

i=1

Cov(yi, yi)

These covariances are wrt the sampling distribution of the yi.The more aggressive the procedure, the more it will overfit thetraining responses, and hence the higher the covariances and df.

16 / 32

Page 28: Trevor Hastie, Stanford University with Ryan Tibshirani and Rob …hastie/TALKS/hastieJSM2017.pdf · 2017. 8. 2. · Trevor Hastie, Stanford University with Ryan Tibshirani and Rob

Degrees of Freedom

We can get some insight into the aggressiveness of theprocedures by looking at their degrees of freedom.

Suppose yi = f(xi) + εi, i = 1, . . . , n, and assume Var(εi) = σ2.Let yi be the fitted value for observation i, after applying someregression method to the n pairs (xi, yi) (e.g. best-subset linearregression of size k, lasso with parameter λ)

df =1

σ2

n∑

i=1

Cov(yi, yi)

These covariances are wrt the sampling distribution of the yi.The more aggressive the procedure, the more it will overfit thetraining responses, and hence the higher the covariances and df.

16 / 32

Page 29: Trevor Hastie, Stanford University with Ryan Tibshirani and Rob …hastie/TALKS/hastieJSM2017.pdf · 2017. 8. 2. · Trevor Hastie, Stanford University with Ryan Tibshirani and Rob

●●

●●

●● ● ● ● ● ● ● ● ● ● ● ●

●●

●●

●●

●● ● ● ● ● ● ● ● ● ●

●●

●●

●●

● ●● ●

● ● ● ● ● ●

●●

●●

●●

●●

●●

●●

0

10

20

30

0 10 20 30

Subset Size

Deg

rees

of F

reed

om method●

Best Subset

Forward Stepwise

Lasso

Relaxed Lasso 0

Relaxed Lasso 0.5

n=70, p=30, s=5, SNR=0.71, PVE=0.42

17 / 32

Page 30: Trevor Hastie, Stanford University with Ryan Tibshirani and Rob …hastie/TALKS/hastieJSM2017.pdf · 2017. 8. 2. · Trevor Hastie, Stanford University with Ryan Tibshirani and Rob

Notable features of previous plot

• Df for lasso is size of active set (Efron et al 2004, Zou et al2007) — shrinkage offsets selection.

• Best-subset most aggressive, with forward stepwise justbehind (in this example).Df can exceed p for BS and FS due to non-convexity(Janson et al 2005, Kaufman& Rosset 2014)

• Relaxed Lasso notably less aggressive, in particular βLSλ(γ = 0).

18 / 32

Page 31: Trevor Hastie, Stanford University with Ryan Tibshirani and Rob …hastie/TALKS/hastieJSM2017.pdf · 2017. 8. 2. · Trevor Hastie, Stanford University with Ryan Tibshirani and Rob

Next plots · · ·

• Show results over a range of SNRs

• Averaged over 10 simulations

• For each method, a validation set of same size as trainingset used to select the best model

• Reported errors are over infinite test set

19 / 32

Page 32: Trevor Hastie, Stanford University with Ryan Tibshirani and Rob …hastie/TALKS/hastieJSM2017.pdf · 2017. 8. 2. · Trevor Hastie, Stanford University with Ryan Tibshirani and Rob

Correlation 0.35

Type 2

0.05 0.09 0.14 0.25 0.42 0.71 1.22 2.07 3.52 6.00

1.0

1.1

1.2

1.3

1.4

Signal−to−noise ratio

(Tes

t Err

or)

/ σ2 method

Best subset

Lasso

Relaxed lasso

Stepwise

n=70, p=30, s=5

20 / 32

Page 33: Trevor Hastie, Stanford University with Ryan Tibshirani and Rob …hastie/TALKS/hastieJSM2017.pdf · 2017. 8. 2. · Trevor Hastie, Stanford University with Ryan Tibshirani and Rob

Correlation 0.35

Type 2

0.05 0.09 0.14 0.25 0.42 0.71 1.22 2.07 3.52 6.00

0.0

0.5

1.0

1.5

2.0

Signal−to−noise ratio

Rel

ativ

e ris

k (t

o nu

ll m

odel

)

method

Best subset

Lasso

Relaxed lasso

Stepwise

n=70, p=30, s=5

21 / 32

Page 34: Trevor Hastie, Stanford University with Ryan Tibshirani and Rob …hastie/TALKS/hastieJSM2017.pdf · 2017. 8. 2. · Trevor Hastie, Stanford University with Ryan Tibshirani and Rob

Correlation 0.35

Type 2

0.05 0.09 0.14 0.25 0.42 0.71 1.22 2.07 3.52 6.00

0.00

0.25

0.50

0.75

Signal−to−noise ratio

Pro

port

ion

Var

ianc

e E

xpla

ined

method

Best subset

Lasso

Relaxed lasso

Stepwise

n=70, p=30, s=5

22 / 32

Page 35: Trevor Hastie, Stanford University with Ryan Tibshirani and Rob …hastie/TALKS/hastieJSM2017.pdf · 2017. 8. 2. · Trevor Hastie, Stanford University with Ryan Tibshirani and Rob

Correlation 0.35

Type 2

0.05 0.09 0.14 0.25 0.42 0.71 1.22 2.07 3.52 6.00

0.00

0.25

0.50

0.75

1.00

Signal−to−noise ratio

Cla

ssifi

catio

n M

easu

re F

− R

ecov

ery

of T

rue

Bet

a

method

Best subset

Lasso

Relaxed lasso

Stepwise

n=70, p=30, s=5

23 / 32

Page 36: Trevor Hastie, Stanford University with Ryan Tibshirani and Rob …hastie/TALKS/hastieJSM2017.pdf · 2017. 8. 2. · Trevor Hastie, Stanford University with Ryan Tibshirani and Rob

Next plots · · ·

As before, but also

• different pairwise correlations between variables

• Different patterns of true coefficients

• Different problem sizes N, p.

24 / 32

Page 37: Trevor Hastie, Stanford University with Ryan Tibshirani and Rob …hastie/TALKS/hastieJSM2017.pdf · 2017. 8. 2. · Trevor Hastie, Stanford University with Ryan Tibshirani and Rob

Timings

Setting BS FS Lasso R Lassolow n=100, p=10 3.43 0.006 0.002 0.002medium n=500, p=100 >120 min 0.818 0.009 0.009high-5 n=50, p=1000 >126 min 0.137 0.011 0.011high-10 n=100, p=1000 >144 min 0.277 0.019 0.021

Average time in seconds to compute one path of solutions for eachmethod, on a Linux cluster

25 / 32

Page 38: Trevor Hastie, Stanford University with Ryan Tibshirani and Rob …hastie/TALKS/hastieJSM2017.pdf · 2017. 8. 2. · Trevor Hastie, Stanford University with Ryan Tibshirani and Rob

Correlation 0 Correlation 0.35 Correlation 0.7

Type 1Type 2

Type 3

0.05 0.25 1.22 6.00 0.05 0.25 1.22 6.00 0.05 0.25 1.22 6.00

1.00

1.05

1.10

1.15

1.00

1.05

1.10

1.15

1.00

1.05

1.10

1.15

Signal−to−noise ratio

(Tes

t Err

or)

/ σ2 method

Best subset

Lasso

Relaxed lasso

Stepwise

n=100, p=10, s=5

26 / 32

Page 39: Trevor Hastie, Stanford University with Ryan Tibshirani and Rob …hastie/TALKS/hastieJSM2017.pdf · 2017. 8. 2. · Trevor Hastie, Stanford University with Ryan Tibshirani and Rob

Correlation 0 Correlation 0.35 Correlation 0.7

Type 1Type 2

Type 3

0.05 0.25 1.22 6.00 0.05 0.25 1.22 6.00 0.05 0.25 1.22 6.00

0.0

0.5

1.0

1.5

0.0

0.5

1.0

1.5

0.0

0.5

1.0

1.5

Signal−to−noise ratio

Rel

ativ

e ris

k (t

o nu

ll m

odel

)

method

Best subset

Lasso

Relaxed lasso

Stepwise

n=100, p=10, s=5

27 / 32

Page 40: Trevor Hastie, Stanford University with Ryan Tibshirani and Rob …hastie/TALKS/hastieJSM2017.pdf · 2017. 8. 2. · Trevor Hastie, Stanford University with Ryan Tibshirani and Rob

Correlation 0 Correlation 0.35 Correlation 0.7

Type 1Type 2

Type 3

0.05 0.25 1.22 6.00 0.05 0.25 1.22 6.00 0.05 0.25 1.22 6.00

1.00

1.02

1.04

1.06

1.00

1.02

1.04

1.06

1.00

1.02

1.04

1.06

Signal−to−noise ratio

(Tes

t Err

or)

/ σ2 method

Best subset

Lasso

Relaxed lasso

Stepwise

n=500, p=100, s=5

28 / 32

Page 41: Trevor Hastie, Stanford University with Ryan Tibshirani and Rob …hastie/TALKS/hastieJSM2017.pdf · 2017. 8. 2. · Trevor Hastie, Stanford University with Ryan Tibshirani and Rob

Correlation 0 Correlation 0.35 Correlation 0.7

Type 1Type 2

Type 3

0.05 0.25 1.22 6.00 0.05 0.25 1.22 6.00 0.05 0.25 1.22 6.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

Signal−to−noise ratio

Rel

ativ

e ris

k (t

o nu

ll m

odel

)

method

Best subset

Lasso

Relaxed lasso

Stepwise

n=500, p=100, s=5

29 / 32

Page 42: Trevor Hastie, Stanford University with Ryan Tibshirani and Rob …hastie/TALKS/hastieJSM2017.pdf · 2017. 8. 2. · Trevor Hastie, Stanford University with Ryan Tibshirani and Rob

Correlation 0 Correlation 0.35 Correlation 0.7

Type 1Type 2

Type 3

0.05 0.25 1.22 6.00 0.05 0.25 1.22 6.00 0.05 0.25 1.22 6.00

1.0

1.5

2.0

2.5

3.0

1.0

1.5

2.0

2.5

3.0

1.0

1.5

2.0

2.5

3.0

Signal−to−noise ratio

(Tes

t Err

or)

/ σ2 method

Best subset

Lasso

Relaxed lasso

Stepwise

n=100, p=1000, s=10

30 / 32

Page 43: Trevor Hastie, Stanford University with Ryan Tibshirani and Rob …hastie/TALKS/hastieJSM2017.pdf · 2017. 8. 2. · Trevor Hastie, Stanford University with Ryan Tibshirani and Rob

Correlation 0 Correlation 0.35 Correlation 0.7

Type 1Type 2

Type 3

0.05 0.25 1.22 6.00 0.05 0.25 1.22 6.00 0.05 0.25 1.22 6.00

0.0

0.5

1.0

1.5

2.0

0.0

0.5

1.0

1.5

2.0

0.0

0.5

1.0

1.5

2.0

Signal−to−noise ratio

Rel

ativ

e ris

k (t

o nu

ll m

odel

)

method

Best subset

Lasso

Relaxed lasso

Stepwise

n=100, p=1000, s=10

31 / 32

Page 44: Trevor Hastie, Stanford University with Ryan Tibshirani and Rob …hastie/TALKS/hastieJSM2017.pdf · 2017. 8. 2. · Trevor Hastie, Stanford University with Ryan Tibshirani and Rob

Outline and SummaryWe consider linear regression models η(X) = XTβ withpotentially very large numbers of variables, and methods forselecting an informative subset.

• Revisit two baby boomers (best-subset selection andforward-stepwise selection), one millennial (lasso) and anewborn (relaxed lasso).

• Simulation study to evaluate them all over a wide range ofsettings.

Conclusions:

• forward stepwise very close to best subset, but much faster.• relaxed lasso overall winner, and fastest by far.• In wide-data settings, and low SNR, lasso can beat best

subset and forward stepwise.

Paper: https://arxiv.org/abs/1707.08692

R package: https://github.com/ryantibs/best-subset/

Thank you for attending!

32 / 32

Page 45: Trevor Hastie, Stanford University with Ryan Tibshirani and Rob …hastie/TALKS/hastieJSM2017.pdf · 2017. 8. 2. · Trevor Hastie, Stanford University with Ryan Tibshirani and Rob

Outline and SummaryWe consider linear regression models η(X) = XTβ withpotentially very large numbers of variables, and methods forselecting an informative subset.

• Revisit two baby boomers (best-subset selection andforward-stepwise selection), one millennial (lasso) and anewborn (relaxed lasso).

• Simulation study to evaluate them all over a wide range ofsettings.

Conclusions:

• forward stepwise very close to best subset, but much faster.• relaxed lasso overall winner, and fastest by far.• In wide-data settings, and low SNR, lasso can beat best

subset and forward stepwise.

Paper: https://arxiv.org/abs/1707.08692

R package: https://github.com/ryantibs/best-subset/

Thank you for attending!32 / 32