he o and d p n ho ds - Stanford Universitystatweb.stanford.edu/~tibs/sta305files/lasso.pdfo and d o...

16
1 The Lasso and related penalization methods Basic lasso, related approaches, generalizations, novel applications

Transcript of he o and d p n ho ds - Stanford Universitystatweb.stanford.edu/~tibs/sta305files/lasso.pdfo and d o...

  • 1

    The Lasso and relatedpenalization methods

    Basic lasso, related approaches, generalizations, novel applications

  • Lasso and related methods 3

    Linear regression via the Lasso (Tibshirani, 1995)

    • Outcome variable yi, for cases i = 1, 2, . . . n, features xij ,j = 1, 2, . . . p

    • Minimize∑n

    i=1(yi −∑

    j xijβj)2 + λ

    |βj |

    • Equivalent to minimizing sum of squares with constraint∑

    |βj | ≤ s.

    • Similar to ridge regression, which has constraint∑

    j β2j ≤ t

    • Lasso does variable selection and shrinkage, while ridge onlyshrinks.

    • See also “Basis Pursuit” (Chen, Donoho and Saunders, 1998).

  • Lasso and related methods 4

    Picture of Lasso and Ridge regression

    β̂ β̂2. .β

    1

    β 2

    β1

    β

  • Lasso and related methods 5

    Example: Prostate Cancer Data

    yi = log (PSA), xij measurements on a man and his prostate

    Shrinkage Factor s

    Coe

    ffici

    ents

    0.0 0.2 0.4 0.6 0.8 1.0

    −0.2

    0.0

    0.2

    0.4

    0.6

    lcavol

    lweight

    age

    lbph

    svi

    lcp

    gleason

    pgg45

  • Lasso and related methods 6

    Algorithms for the lasso

    • Standard convex optimizer

    • Least angle regression (LAR) - Efron et al 2004- computesentire path of solutions. State-of-the-Art until 2008- see LARS

    section in this course.

    • Pathwise coordinate descent- new

  • Lasso and related methods 7

    Pathwise coordinate descent for the lasso

    • Coordinate descent: optimize one parameter (coordinate) at atime.

    • How? suppose we had only one predictor. Problem is tominimize

    i

    (yi − xiβ)2 + λ|β|

    • Solution is the soft-thresholded estimate

    sign(β̂)(|β|− λ)+

    where β̂ is usual least squares estimate.

    • Idea: with multiple predictors, cycle through each predictor inturn. We compute residuals ri = yi −

    j ̸=k xij β̂k and applying

    univariate soft-thresholding, pretending that our data is

    (xij , ri).

    Robert Tibshirani

    Robert Tibshirani

    Robert Tibshirani

    Robert Tibshirani

  • Lasso and related methods 8

    • Turns out that this is coordinate descent for the lasso criterion∑

    i

    (yi −∑

    j

    xijβj)2 + λ

    |βj |

    • like skiing to the bottom of a hill, going north-south, east-west,north-south, etc.

    • Too simple?!

    Demonstration:

  • Lasso and related methods 9

    A brief history of coordinate descent for the lasso

    1997 Tibshirani’s student Wenjiang Fu at U. Toronto develops

    the “shooting algorithm” for the lasso. Tibshirani doesn’t fully

    appreciate it.

    2002 Ingrid Daubechies gives a talk at Stanford, describes a

    one-at-a-time algorithm for the lasso. Hastie implements it,

    makes an error, and Hastie +Tibshirani conclude that the

    method doesn’t work.

    2006 Friedman is external examiner at PhD oral of Anita van der

    Kooij (Leiden) who uses coordinate descent for elastic net.

    Friedman, Hastie + Tibshirani revisit this problem. Others

    have too — Krishnapuram and Hartemink (2005), Genkin,

    Lewis and Madigan (2007), Wu and Lange (2008), Meier, van

    de Geer and Buehlmann (2008).

  • Lasso and related methods 10

    Pathwise coordinate descent for the lasso

    • Start with large value for λ (very sparse model) and slowlydecrease it

    • most coordinates that are zero never become non-zero

    • coordinate descent code for Lasso is just 73 lines ofFortran!

  • Lasso and related methods 11

    Extensions

    • Pathwise coordinate descent can be generalized to many othermodels: logistic/multinomial for classification, graphical lasso

    for undirected graphs, fused lasso for signals.

    • Its speed and simplicity are quite remarkable.

    • glmnet R package now available on CRAN

  • Lasso and related methods 12

    Logistic regression

    • Outcome Y = 0 or 1; Logistic regression model

    log(Pr(Y = 1)

    1 − Pr(Y = 1)) = β0 + β1X1 + β2X2 . . .

    • Criterion is binomial log-likelihood +absolute value penalty

  • Lasso and related methods 13

    Speed Trials

    Competitors:

    lars As implemented in R package, for squared-error loss.

    glmnet Fortran based R package using coordinate descent — topic

    of this talk. Does squared error and logistic (2- and K-class).

    l1logreg Lasso-logistic regression package by Koh, Kim and Boyd,

    using state-of-art interior point methods for convex

    optimization.

    BBR/BMR Bayesian binomial/multinomial regression package by

    Genkin, Lewis and Madigan. Also uses coordinate descent to

    compute posterior mode with Laplace prior—the lasso fit.

    Based on simulations (next 3 slides) and real data (4th slide).

  • Lasso and related methods 17

    Logistic Regression — Real DatasetsName Type N p glmnet l1logreg BBR

    BMR

    Dense

    Cancer 14 class 144 16,063 2.5 mins NA 2.1 hrs

    Leukemia 2 class 72 3571 2.50 55.0 450

    Sparse

    Internet ad 2 class 2359 1430 5.0 20.9 34.7

    Newsgroup 2 class 11,314 777,811 2 mins 3.5 hrs

    Timings in seconds (unless stated otherwise). For Cancer, Leukemia and

    Internet-Ad, times are for ten-fold cross-validation over 100 λ values; for

    Newsgroup we performed a single run with 100 values of λ, with

    λmin = 0.05λmax.

  • Lasso and related methods 19

    Generalizations of the lasso

    • The Elastic net (Zou and Hastie 2005)

    minβ

    1

    2

    n∑

    i=1

    (yi −n

    j=1

    xijβj)2 + λ1

    p∑

    j=1

    |βj | + λ2p

    j=1

    β2j /2 (1)

    The squared norm tends to average the coefficients of

    predictors that are correlated, while the L1 norm chooses

    among the averaged groups. When p > N , the number of

    non-zero coefficients can exceed n- unlike the lasso.

    Form of penalty used in glmnet:

    p∑

    j=1

    [

    1

    2(1 − α)β2j + α|βj |

    ]

    .

  • Lasso and related methods 20

    2 4 6 8 10

    −0.1

    0.0

    0.1

    0.2

    0.3

    Step

    Coefficients

    Lasso

    2 4 6 8 10

    −0.1

    0.0

    0.1

    0.2

    0.3

    Step

    Coefficients

    Elastic Net − alpha=0.4

    2 4 6 8 10

    −0.1

    0.0

    0.1

    0.2

    0.3

    Step

    Coefficients

    Ridge Regression

    Leukemia Data, Logistic, N=72, p=3571, first 10 steps shown

  • Lasso and related methods 21

    Grouped Lasso

    (Yuan and Lin 2005). Here the variables occur in groups (such as

    dummy variables for multi-level factors). Suppose Xj is an N × pjorthonormal matrix that represents the jth group of pj variables,

    j = 1, . . . , m, and βj the corresponding coefficient vector. The

    grouped lasso solves

    minβ

    ||y −m

    j=1

    Xjβj ||22 +m

    j=1

    λj ||βj ||2, (2)

    where λj = λ√

    pj . The 2-norm shrinks all coefficients in a group

    down to zero together. Can apply coordinate descent in both

    models.