Trevor Hastie, Stanford University with Ryan Tibshirani and Rob...

Post on 01-Sep-2020

0 views 0 download

Transcript of Trevor Hastie, Stanford University with Ryan Tibshirani and Rob...

Variable Selection at Scale

Trevor Hastie, Stanford University

with Ryan Tibshirani and Rob Tibshirani

1 / 32

Variable Selection at Scale

Trevor Hastie, Stanford University

with Ryan Tibshirani and Rob Tibshirani

1 / 32

Outline and SummaryWe consider linear regression models η(X) = XTβ withpotentially very large numbers of variables, and methods forselecting an informative subset.

• Revisit two baby boomers (best-subset selection andforward-stepwise selection), one millennial (lasso) and anewborn (relaxed lasso).

• Simulation study to evaluate them all over a wide range ofsettings.

Conclusions:

• forward stepwise very close to best subset, but much faster.

• relaxed lasso overall winner, and fastest by far.

• In wide-data settings, and low SNR, lasso can beat bestsubset and forward stepwise.

Paper: https://arxiv.org/abs/1707.08692

R package: https://github.com/ryantibs/best-subset/

2 / 32

Outline and SummaryWe consider linear regression models η(X) = XTβ withpotentially very large numbers of variables, and methods forselecting an informative subset.

• Revisit two baby boomers (best-subset selection andforward-stepwise selection), one millennial (lasso) and anewborn (relaxed lasso).

• Simulation study to evaluate them all over a wide range ofsettings.

Conclusions:

• forward stepwise very close to best subset, but much faster.

• relaxed lasso overall winner, and fastest by far.

• In wide-data settings, and low SNR, lasso can beat bestsubset and forward stepwise.

Paper: https://arxiv.org/abs/1707.08692

R package: https://github.com/ryantibs/best-subset/2 / 32

Best Subset Selection58 3. Linear Methods for Regression

Subset Size k

Res

idua

l Sum

−of

−S

quar

es

020

4060

8010

0

0 1 2 3 4 5 6 7 8

•••••••

••••••••••••••••••••••••••

•••••••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

•••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••

•••••••

•• • • • • • •

FIGURE 3.5. All possible subset models for the prostate cancer example. Ateach subset size is shown the residual sum-of-squares for each model of that size.

cross-validation to estimate prediction error and select k; the AIC criterionis a popular alternative. We defer more detailed discussion of these andother approaches to Chapter 7.

3.3.2 Forward- and Backward-Stepwise Selection

Rather than search through all possible subsets (which becomes infeasiblefor pmuch larger than 40), we can seek a good path through them. Forward-stepwise selection starts with the intercept, and then sequentially adds intothe model the predictor that most improves the fit. With many candidatepredictors, this might seem like a lot of computation; however, clever up-dating algorithms can exploit the QR decomposition for the current fit torapidly establish the next candidate (Exercise 3.9). Like best-subset re-gression, forward stepwise produces a sequence of models indexed by k, thesubset size, which must be determined.Forward-stepwise selection is a greedy algorithm, producing a nested se-

quence of models. In this sense it might seem sub-optimal compared tobest-subset selection. However, there are several reasons why it might bepreferred:

1. For each subset of size k of the p variables, evaluate the fittingobjective (e.g. RSS) via linear regression on the training data.

2. Candidate models β(k) are at the lower frontier — the best foreach k on the training data.

3. Pick k using a validation dataset (or CV), and deliver β(k)

3 / 32

Properties of Best Subset Selection

4 Well-defined goal — the obvious gold standard for variableselection.

4 Feasible for least squares regression with p ≈ 35 usingclever algorithms (Furnival and Wilson, 1974, “Leaps andBounds”).

6 Combinatorially hard for large p.

? Obvious gold standard — really?

4 / 32

Properties of Best Subset Selection

4 Well-defined goal — the obvious gold standard for variableselection.

4 Feasible for least squares regression with p ≈ 35 usingclever algorithms (Furnival and Wilson, 1974, “Leaps andBounds”).

6 Combinatorially hard for large p.

? Obvious gold standard — really?

4 / 32

Properties of Best Subset Selection

4 Well-defined goal — the obvious gold standard for variableselection.

4 Feasible for least squares regression with p ≈ 35 usingclever algorithms (Furnival and Wilson, 1974, “Leaps andBounds”).

6 Combinatorially hard for large p.

? Obvious gold standard — really?

4 / 32

Best Subset Selection BreakthroughRahul Mazumder, with Bertsimas and King(AoS 2016) crack the forty year old best-subsetselection bottleneck! They use mixed-integerprogramming (MIO) along with the gurobisolver.

minimizez,β1

2

n∑

i=1

(yi − β0 −p∑

j=1

xijβj)2

subject to −Mzj ≤ βj ≤Mzj , zj ∈ {0, 1}, j = 1, . . . , pp∑

j=1

zj ≤ k.

Their procedure iteratively narrows the optimality gap — if thegap hits zero, they have found the solution.

5 / 32

Best Subset Selection BreakthroughRahul Mazumder, with Bertsimas and King(AoS 2016) crack the forty year old best-subsetselection bottleneck! They use mixed-integerprogramming (MIO) along with the gurobisolver.

minimizez,β1

2

n∑

i=1

(yi − β0 −p∑

j=1

xijβj)2

subject to −Mzj ≤ βj ≤Mzj , zj ∈ {0, 1}, j = 1, . . . , pp∑

j=1

zj ≤ k.

Their procedure iteratively narrows the optimality gap — if thegap hits zero, they have found the solution.

5 / 32

Forward Stepwise Selection

Greedy forward algorithm, traditionally thought of as asub-optimal but feasible alternative to best-subset regression.

1. Start with null model (response mean).

2. Choose among the p variables to find the bestsingle-variable model in terms of fitting objective.

3. Choose among the remaining p− 1 variables to find theone, when included with the previously chosen variable,best improves the fitting objective.

4. Choose among the remaining p− 2 . . ., and so on.

Forward stepwise produces a nested sequence of modelsβ(k), k = 1, 2, . . ..Pick k using a validation dataset.

6 / 32

Forward Stepwise Selection Properties

4 Computationally feasible with big data, and also workswith n� p.

4 Efficient computations with squared-error loss.Computations can be arranged as a guided QRdecomposition of the X matrix, and hence costs the sameas a full least-squares fit O(npmin(n, p)).

4 Performance very similar to best subset selection, althoughdifficult counter examples can be constructed.

6 Efficiency not available for GLMs, although scoreapproximations can be used.

6 Tedious with very large p and n, since terms augmentedone at a time.

7 / 32

Forward Stepwise Selection Properties

4 Computationally feasible with big data, and also workswith n� p.

4 Efficient computations with squared-error loss.Computations can be arranged as a guided QRdecomposition of the X matrix, and hence costs the sameas a full least-squares fit O(npmin(n, p)).

4 Performance very similar to best subset selection, althoughdifficult counter examples can be constructed.

6 Efficiency not available for GLMs, although scoreapproximations can be used.

6 Tedious with very large p and n, since terms augmentedone at a time.

7 / 32

Lasso

The lasso (Tibshirani, 1996) solves

minimizeβ

1

2

n∑

i=1

(yi − β0 −p∑

j=1

xijβj)2 s.t. ‖β‖1 ≤ t

Generally, the smaller t, the sparser the solutions, andapproximate nesting occurs.We compute many solutions over a range of values of t, andselect t using validation data.

Often thought of as a convex relaxation for the best-subsetproblem

minimizeβ

1

2

n∑

i=1

(yi − β0 −p∑

j=1

xijβj)2 s.t. ‖β‖0 ≤ k

8 / 32

Lasso

The lasso (Tibshirani, 1996) solves

minimizeβ

1

2

n∑

i=1

(yi − β0 −p∑

j=1

xijβj)2 s.t. ‖β‖1 ≤ t

Generally, the smaller t, the sparser the solutions, andapproximate nesting occurs.We compute many solutions over a range of values of t, andselect t using validation data.

Often thought of as a convex relaxation for the best-subsetproblem

minimizeβ

1

2

n∑

i=1

(yi − β0 −p∑

j=1

xijβj)2 s.t. ‖β‖0 ≤ k

8 / 32

Lasso properties

We typically solve lasso in Lagrange form

minimizeβ

1

2

n∑

i=1

(yi − β0 −p∑

j=1

xijβj)2 + λ‖β‖1

4 Extremely fast algorithms for solving lasso problems (withmany loss functions). Pathwise coordinate descent viaglmnet (Friedman, H, Tibshirani, 2010) exploits sparsity,active-set convergence, strong rules, and more, to rapidlycompute entire solution path on a grid of values of λ.

4 With large p provides convenient subset selection, takingleaps rather than single steps.

6 Since coefficients are both selected and regularized, cansuffer from shrinkage bias.

9 / 32

Lasso properties

We typically solve lasso in Lagrange form

minimizeβ

1

2

n∑

i=1

(yi − β0 −p∑

j=1

xijβj)2 + λ‖β‖1

4 Extremely fast algorithms for solving lasso problems (withmany loss functions). Pathwise coordinate descent viaglmnet (Friedman, H, Tibshirani, 2010) exploits sparsity,active-set convergence, strong rules, and more, to rapidlycompute entire solution path on a grid of values of λ.

4 With large p provides convenient subset selection, takingleaps rather than single steps.

6 Since coefficients are both selected and regularized, cansuffer from shrinkage bias.

9 / 32

November 2015 Trevor Hastie, Stanford Statistics 5

0.0 0.2 0.4 0.6 0.8 1.0

−50

00

500

52

110

84

69

0 2 3 4 5 7 8 10

||β(λ)||1/||β(0)||1

Lasso Coefficient Path

Standardized

Coeffi

cients

Lasso: β(λ) = argminβ1N

∑Ni=1(yi − β0 − xTi β)2 + λ||β||1

fit using lars package in R (Efron, Hastie, Johnstone, Tibshirani 2002)

Lasso: β(λ) = argminβ1N

∑Ni=1(yi − β0 − xTi β)2 + λ||β||1

fit using lars package in R (Efron, H, Johnstone, Tibshirani 2002)

10 / 32

Lasso and Least-Angle Regression (LAR)

Interesting connection between Lasso and Forward Stepwise.

LAR algorithm: Democratic Forward Stepwise

1. Find variable X(1) most correlated with the response.

2. While moving towards the least-squares fit on X(1), keeptrack of correlations of other variables with the evolvingresidual.

3. When X(2) catches up in correlation, include it in model,and move the pair toward least squares fit (correlationsstay tied!)

4. And so on.

LAR path = Lasso path (almost always).Forward Stepwise goes all the way with each variable, whileLAR lets others in when they catch up. This slow learning wasinspired by the forward stagewise approach of boosting.

11 / 32

3.4 Shrinkage Methods 75

0 5 10 15

0.0

0.1

0.2

0.3

0.4

v2 v6 v4 v5 v3 v1

L1 Arc Length

Absolute

Correlations

FIGURE 3.14. Progression of the absolute correlations during each step of theLAR procedure, using a simulated data set with six predictors. The labels at thetop of the plot indicate which variables enter the active set at each step. The steplength are measured in units of L1 arc length.

0 5 10 15

−1.

5−

1.0

−0.

50.

00.

5

Least Angle Regression

0 5 10 15

−1.

5−

1.0

−0.

50.

00.

5

Lasso

L1 Arc LengthL1 Arc Length

Coeffi

cien

ts

Coeffi

cien

ts

FIGURE 3.15. Left panel shows the LAR coefficient profiles on the simulateddata, as a function of the L1 arc length. The right panel shows the Lasso profile.They are identical until the dark-blue coefficient crosses zero at an arc length ofabout 18.

12 / 32

Relaxed Lasso

Originally proposed by Meinshausen (2006). We present asimplified version.

• Suppose βλ is the lasso solution at λ, and let Aλ be theactive set of indices with nonzero coefficients in βλ.

• Let βLSAλbe the coefficients in the least squares fit, using

only the variables in Aλ. Let βLSλ be the full-sized versionof this coefficient vector, padded with zeros.

βLSλ debiases the lasso, while maintaining its sparsity.

• Define the Relaxed Lasso

βRELAXλ (γ) = γ · βλ + (1− γ) · βLSλ

Once βLSλ is computed at desired values of λ, the whole family

βRELAXλ (γ) comes free of charge!

13 / 32

Relaxed Lasso

Originally proposed by Meinshausen (2006). We present asimplified version.

• Suppose βλ is the lasso solution at λ, and let Aλ be theactive set of indices with nonzero coefficients in βλ.

• Let βLSAλbe the coefficients in the least squares fit, using

only the variables in Aλ. Let βLSλ be the full-sized versionof this coefficient vector, padded with zeros.

βLSλ debiases the lasso, while maintaining its sparsity.

• Define the Relaxed Lasso

βRELAXλ (γ) = γ · βλ + (1− γ) · βLSλOnce βLSλ is computed at desired values of λ, the whole family

βRELAXλ (γ) comes free of charge!

13 / 32

Simulation

●●

● ● ● ● ●●

●●

●●

●●

●●

●●

●●

●● ● ● ● ● ● ● ●

●●

●●

●●

●●

● ● ● ● ● ●

● ●

●●

●●

●● ●

● ●●

●●

●●

●●

12.5

15.0

17.5

20.0

0 10 20 30

Subset Size

Exp

ecte

d P

redi

ctio

n E

rror

method●

Best Subset

Forward Stepwise

Lasso

Relaxed Lasso 0

Relaxed Lasso 0.5

n=70, p=30, s=5, SNR=0.71, PVE=0.42

14 / 32

Simulation Setup

Y =

p∑

j=1

Xjβj + ε

X ∼ Np(0,Σ)

ε ∼ N(0, σ2)

• p = 30, sample size n = 70, and first s = 5 values of β are1, the rest are zero.

• Σ is correlation matrix, with Cov(Xi, Xj) = ρ|i−j|, andρ = 0.35

• σ2 is chosen here to achieve desired SNR = Var(Xβ)/σ2 of0.71.

• This is equivalent to a percentage variance explained (R2)of 42%, since population PVE = SNR/(1 + SNR).

• Where appropriate we have a separate validation set of sizen, and an infinite test set.

15 / 32

Simulation Setup

Y =

p∑

j=1

Xjβj + ε

X ∼ Np(0,Σ)

ε ∼ N(0, σ2)

• p = 30, sample size n = 70, and first s = 5 values of β are1, the rest are zero.

• Σ is correlation matrix, with Cov(Xi, Xj) = ρ|i−j|, andρ = 0.35

• σ2 is chosen here to achieve desired SNR = Var(Xβ)/σ2 of0.71.

• This is equivalent to a percentage variance explained (R2)of 42%, since population PVE = SNR/(1 + SNR).

• Where appropriate we have a separate validation set of sizen, and an infinite test set.

15 / 32

Simulation Setup

Y =

p∑

j=1

Xjβj + ε

X ∼ Np(0,Σ)

ε ∼ N(0, σ2)

• p = 30, sample size n = 70, and first s = 5 values of β are1, the rest are zero.

• Σ is correlation matrix, with Cov(Xi, Xj) = ρ|i−j|, andρ = 0.35

• σ2 is chosen here to achieve desired SNR = Var(Xβ)/σ2 of0.71.

• This is equivalent to a percentage variance explained (R2)of 42%, since population PVE = SNR/(1 + SNR).

• Where appropriate we have a separate validation set of sizen, and an infinite test set.

15 / 32

Degrees of Freedom

We can get some insight into the aggressiveness of theprocedures by looking at their degrees of freedom.

Suppose yi = f(xi) + εi, i = 1, . . . , n, and assume Var(εi) = σ2.Let yi be the fitted value for observation i, after applying someregression method to the n pairs (xi, yi) (e.g. best-subset linearregression of size k, lasso with parameter λ)

df =1

σ2

n∑

i=1

Cov(yi, yi)

These covariances are wrt the sampling distribution of the yi.The more aggressive the procedure, the more it will overfit thetraining responses, and hence the higher the covariances and df.

16 / 32

Degrees of Freedom

We can get some insight into the aggressiveness of theprocedures by looking at their degrees of freedom.

Suppose yi = f(xi) + εi, i = 1, . . . , n, and assume Var(εi) = σ2.Let yi be the fitted value for observation i, after applying someregression method to the n pairs (xi, yi) (e.g. best-subset linearregression of size k, lasso with parameter λ)

df =1

σ2

n∑

i=1

Cov(yi, yi)

These covariances are wrt the sampling distribution of the yi.The more aggressive the procedure, the more it will overfit thetraining responses, and hence the higher the covariances and df.

16 / 32

●●

●●

●● ● ● ● ● ● ● ● ● ● ● ●

●●

●●

●●

●● ● ● ● ● ● ● ● ● ●

●●

●●

●●

● ●● ●

● ● ● ● ● ●

●●

●●

●●

●●

●●

●●

0

10

20

30

0 10 20 30

Subset Size

Deg

rees

of F

reed

om method●

Best Subset

Forward Stepwise

Lasso

Relaxed Lasso 0

Relaxed Lasso 0.5

n=70, p=30, s=5, SNR=0.71, PVE=0.42

17 / 32

Notable features of previous plot

• Df for lasso is size of active set (Efron et al 2004, Zou et al2007) — shrinkage offsets selection.

• Best-subset most aggressive, with forward stepwise justbehind (in this example).Df can exceed p for BS and FS due to non-convexity(Janson et al 2005, Kaufman& Rosset 2014)

• Relaxed Lasso notably less aggressive, in particular βLSλ(γ = 0).

18 / 32

Next plots · · ·

• Show results over a range of SNRs

• Averaged over 10 simulations

• For each method, a validation set of same size as trainingset used to select the best model

• Reported errors are over infinite test set

19 / 32

Correlation 0.35

Type 2

0.05 0.09 0.14 0.25 0.42 0.71 1.22 2.07 3.52 6.00

1.0

1.1

1.2

1.3

1.4

Signal−to−noise ratio

(Tes

t Err

or)

/ σ2 method

Best subset

Lasso

Relaxed lasso

Stepwise

n=70, p=30, s=5

20 / 32

Correlation 0.35

Type 2

0.05 0.09 0.14 0.25 0.42 0.71 1.22 2.07 3.52 6.00

0.0

0.5

1.0

1.5

2.0

Signal−to−noise ratio

Rel

ativ

e ris

k (t

o nu

ll m

odel

)

method

Best subset

Lasso

Relaxed lasso

Stepwise

n=70, p=30, s=5

21 / 32

Correlation 0.35

Type 2

0.05 0.09 0.14 0.25 0.42 0.71 1.22 2.07 3.52 6.00

0.00

0.25

0.50

0.75

Signal−to−noise ratio

Pro

port

ion

Var

ianc

e E

xpla

ined

method

Best subset

Lasso

Relaxed lasso

Stepwise

n=70, p=30, s=5

22 / 32

Correlation 0.35

Type 2

0.05 0.09 0.14 0.25 0.42 0.71 1.22 2.07 3.52 6.00

0.00

0.25

0.50

0.75

1.00

Signal−to−noise ratio

Cla

ssifi

catio

n M

easu

re F

− R

ecov

ery

of T

rue

Bet

a

method

Best subset

Lasso

Relaxed lasso

Stepwise

n=70, p=30, s=5

23 / 32

Next plots · · ·

As before, but also

• different pairwise correlations between variables

• Different patterns of true coefficients

• Different problem sizes N, p.

24 / 32

Timings

Setting BS FS Lasso R Lassolow n=100, p=10 3.43 0.006 0.002 0.002medium n=500, p=100 >120 min 0.818 0.009 0.009high-5 n=50, p=1000 >126 min 0.137 0.011 0.011high-10 n=100, p=1000 >144 min 0.277 0.019 0.021

Average time in seconds to compute one path of solutions for eachmethod, on a Linux cluster

25 / 32

Correlation 0 Correlation 0.35 Correlation 0.7

Type 1Type 2

Type 3

0.05 0.25 1.22 6.00 0.05 0.25 1.22 6.00 0.05 0.25 1.22 6.00

1.00

1.05

1.10

1.15

1.00

1.05

1.10

1.15

1.00

1.05

1.10

1.15

Signal−to−noise ratio

(Tes

t Err

or)

/ σ2 method

Best subset

Lasso

Relaxed lasso

Stepwise

n=100, p=10, s=5

26 / 32

Correlation 0 Correlation 0.35 Correlation 0.7

Type 1Type 2

Type 3

0.05 0.25 1.22 6.00 0.05 0.25 1.22 6.00 0.05 0.25 1.22 6.00

0.0

0.5

1.0

1.5

0.0

0.5

1.0

1.5

0.0

0.5

1.0

1.5

Signal−to−noise ratio

Rel

ativ

e ris

k (t

o nu

ll m

odel

)

method

Best subset

Lasso

Relaxed lasso

Stepwise

n=100, p=10, s=5

27 / 32

Correlation 0 Correlation 0.35 Correlation 0.7

Type 1Type 2

Type 3

0.05 0.25 1.22 6.00 0.05 0.25 1.22 6.00 0.05 0.25 1.22 6.00

1.00

1.02

1.04

1.06

1.00

1.02

1.04

1.06

1.00

1.02

1.04

1.06

Signal−to−noise ratio

(Tes

t Err

or)

/ σ2 method

Best subset

Lasso

Relaxed lasso

Stepwise

n=500, p=100, s=5

28 / 32

Correlation 0 Correlation 0.35 Correlation 0.7

Type 1Type 2

Type 3

0.05 0.25 1.22 6.00 0.05 0.25 1.22 6.00 0.05 0.25 1.22 6.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

Signal−to−noise ratio

Rel

ativ

e ris

k (t

o nu

ll m

odel

)

method

Best subset

Lasso

Relaxed lasso

Stepwise

n=500, p=100, s=5

29 / 32

Correlation 0 Correlation 0.35 Correlation 0.7

Type 1Type 2

Type 3

0.05 0.25 1.22 6.00 0.05 0.25 1.22 6.00 0.05 0.25 1.22 6.00

1.0

1.5

2.0

2.5

3.0

1.0

1.5

2.0

2.5

3.0

1.0

1.5

2.0

2.5

3.0

Signal−to−noise ratio

(Tes

t Err

or)

/ σ2 method

Best subset

Lasso

Relaxed lasso

Stepwise

n=100, p=1000, s=10

30 / 32

Correlation 0 Correlation 0.35 Correlation 0.7

Type 1Type 2

Type 3

0.05 0.25 1.22 6.00 0.05 0.25 1.22 6.00 0.05 0.25 1.22 6.00

0.0

0.5

1.0

1.5

2.0

0.0

0.5

1.0

1.5

2.0

0.0

0.5

1.0

1.5

2.0

Signal−to−noise ratio

Rel

ativ

e ris

k (t

o nu

ll m

odel

)

method

Best subset

Lasso

Relaxed lasso

Stepwise

n=100, p=1000, s=10

31 / 32

Outline and SummaryWe consider linear regression models η(X) = XTβ withpotentially very large numbers of variables, and methods forselecting an informative subset.

• Revisit two baby boomers (best-subset selection andforward-stepwise selection), one millennial (lasso) and anewborn (relaxed lasso).

• Simulation study to evaluate them all over a wide range ofsettings.

Conclusions:

• forward stepwise very close to best subset, but much faster.• relaxed lasso overall winner, and fastest by far.• In wide-data settings, and low SNR, lasso can beat best

subset and forward stepwise.

Paper: https://arxiv.org/abs/1707.08692

R package: https://github.com/ryantibs/best-subset/

Thank you for attending!

32 / 32

Outline and SummaryWe consider linear regression models η(X) = XTβ withpotentially very large numbers of variables, and methods forselecting an informative subset.

• Revisit two baby boomers (best-subset selection andforward-stepwise selection), one millennial (lasso) and anewborn (relaxed lasso).

• Simulation study to evaluate them all over a wide range ofsettings.

Conclusions:

• forward stepwise very close to best subset, but much faster.• relaxed lasso overall winner, and fastest by far.• In wide-data settings, and low SNR, lasso can beat best

subset and forward stepwise.

Paper: https://arxiv.org/abs/1707.08692

R package: https://github.com/ryantibs/best-subset/

Thank you for attending!32 / 32