Download - Lecture 12: Variable Selection - Beyond LASSOhzhang/math574m/2020F_Lect12_shrin… · can be solved by the same e cient algorithm LARS to obtain the entire solution path. By contrast,

Transcript
  • Lecture 12: Variable Selection - Beyond LASSO

    Hao Helen Zhang

    Hao Helen Zhang Lecture 12: Variable Selection - Beyond LASSO

  • Extension of LASSO

    To achieve oracle properties,

    Lq penalty with 0 < q < 1, SCAD penalty (Fan and Li 2001;Zhang et al. 2007).Adaptive LASSO (Zou 2006; Zhang and Lu 2007; Wang et al.2007)

    Jλ(β) = λ

    p∑j=1

    wj |βj |.

    Nonnegative garotte (Breiman, 1995)

    To handle correlated predictors,

    Elastic net (Zou and Hastie 2005), adaptive elastic net

    Jλ =

    p∑j=1

    [λ|βj |+ (1− λ)β2j ], λ ∈ [0, 1].

    OSCAR (Bondell and Reich, 2006)Adaptive elastic net (Zou and Zhang 2009)

    Hao Helen Zhang Lecture 12: Variable Selection - Beyond LASSO

  • For group variable selection,

    group LASSO (Yuan and Lin 2006)supnorm penalty (Zhang et al. 2008)

    For nonparametric (nonlinear) models

    LBP (Zhang et al. 2004),COSSO (Lin and Zhang 2006; Zhang 2006; Zhang and Lin2006)group lasso based methods

    Hao Helen Zhang Lecture 12: Variable Selection - Beyond LASSO

  • Various Penalty Functions

    −4 −2 0 2 4

    01

    23

    4

    L_0 Loss

    x

    |x|_0

    −4 −2 0 2 4

    01

    23

    4

    L_0.5 loss

    x

    |x|_1

    −4 −2 0 2 4

    01

    23

    4

    L_1 loss

    x

    |x|_0.5

    −4 −2 0 2 4

    01

    23

    4

    SCAD Loss(lambda=0.8,a = 3)

    x

    p(|x|)

    Hao Helen Zhang Lecture 12: Variable Selection - Beyond LASSO

  • About Lq Penalty Methods

    q = 0 corresponds variable selection, the penalty is thenumber of nonzero coefficients or model size

    For q ≤ 1, more weights imposed on the coordinate directionsq = 1 is the smallest q such that the constraint region isconvex

    q can be fractions. Maybe estimate q from the data?

    Hao Helen Zhang Lecture 12: Variable Selection - Beyond LASSO

  • Motivations:

    LASSO imposes the same penalty on the coefficients

    Ideally, small penalty should be applied to large coefficients(for small bias), while large penalty applied to smallcoefficients (for better sparsity)

    LASSO is not an oracle procedure

    Methods with oracle properties include:

    SCAD, weighted lasso, adaptive lasso, nonnegative garrote,etc.

    Hao Helen Zhang Lecture 12: Variable Selection - Beyond LASSO

  • What is A Good Penalty

    A good penalty function should lead to the estimator

    “nearly” unbiased

    sparsity: threshold small coefficients to zero (reduced modelcomplexity)

    continuous in data, robust against small perturbation)

    How to make these happen? (Fan and Li 2002)

    sufficient condition for unbiasedness: J ′(|β|) = 0 for large |β|(penalty bounded by a constant).

    necessary and sufficient condition for sparsity: J(|β|) issingular at 0

    Most of the oracle penalties are concave, suffering from multiplelocal minima numerically.

    Hao Helen Zhang Lecture 12: Variable Selection - Beyond LASSO

  • Smoothly Clipped Absolute Deviation Penalty

    pλ(β) =

    λ|β| if |β| ≤ λ,− (|β|

    2−2aλ|β|+λ2)2(a−1) if λ < |β| ≤ aλ,

    (a+1)λ2

    2 if |β| > aλ,

    where a > 2 and λ > 0 are tuning parameters.

    A quadratic spline function with two knots at λ and aλ.

    singular at the origin; continuous first-order derivative.

    constant penalty on large coefficients; not convex

    Hao Helen Zhang Lecture 12: Variable Selection - Beyond LASSO

  • SCAD Penalty

    −2 −1 0 1 2

    0.00.1

    0.20.3

    0.4

    w

    L(w)

    Hao Helen Zhang Lecture 12: Variable Selection - Beyond LASSO

  • Computational Issues

    The SCAD penalty is not convex

    Local Quadratic Approximation (LQA; Fan and Li, 2001)

    Local Linear Approximation (LLA; Zou and Li, 2008)

    Hao Helen Zhang Lecture 12: Variable Selection - Beyond LASSO

  • Nonnegative Garotte

    Breiman (1995): solve

    minc1,··· ,cp ‖y −p∑

    j=1

    x j β̂olsj cj‖2 + λnp∑

    j=1

    cj

    subject to cj ≥ 0, j = 1, · · · , p,

    and denote the solution as ĉj ’s. The garotte solution

    β̂garottej = ĉj β̂olsj , j = 1, · · · , p.

    A sufficiently large λn shrinks some cj to exact 0 ( sparsity)

    The nonnegative garotte estimate is selection consistent. Seealso Yuan and Lin (2005).

    Hao Helen Zhang Lecture 12: Variable Selection - Beyond LASSO

  • Adaptive LASSO

    Proposed by Zou (2006), Zhang and Lu (2007), Wang et al.(2007)

    β̂alasso

    = arg minβ‖y − Xβ‖2 + λ

    p∑j=1

    wj |βj |,

    where wj ≥ 0 for all j .

    Adaptive choices of weights

    The weights are data-dependent and adaptively chosen fromthe data.

    Large coefficients receive small weights (and small penalties)

    Small coefficients receive large weights (and penalties)

    Hao Helen Zhang Lecture 12: Variable Selection - Beyond LASSO

  • How to Compute Weights

    Need some good initial estimates β̃j ’s.

    In general, we require β̃j ’s to be root-n consistent

    The weights are calculated as

    wj =1

    |β̃j |γ, j = 1, · · · , p.

    γ > 0 is another tuning parameter.

    For example, β̃j ’s can be chosen as β̂ols

    .

    Alternatively, if collinearity is a concern, one can use β̂ridge

    .

    As the sample size n grows,

    the weights for zero-coefficient predictors get inflated (to ∞);the weights for nonzerocoefficient predictors converge to afinite constant.

    Hao Helen Zhang Lecture 12: Variable Selection - Beyond LASSO

  • Oracle Properties of Adaptive LASSO

    Assume the first q variables are important. And

    1

    nXTn Xn → C =

    (C11 C12C21 C22

    ),

    where C is a positive definite matrix.

    Assume the initial estimators β̃ are root-n consistent. Supposeλn/√n→ 0 and λnn(γ−1)/2 →∞.

    Consistency in variable selection:

    limn

    P(Ân = A0) = 1,

    where Ân = {j : β̂alassoj ,n 6= 0}.Asymptotic normality:

    √n(β̂

    alasso

    n − βA0,0)→d N(0, σ2C−111 ).

    Hao Helen Zhang Lecture 12: Variable Selection - Beyond LASSO

  • Additional Comments

    The oracle properties are closely related to the super-efficiencyphenomenon (Lehmann and Casella 1998).

    Adaptive lasso simultaneously unbiasedly (asymptotically)estimate large coefficients and small threshold estimates.

    The adaptive lasso uses the same rational behind SCAD.

    Hao Helen Zhang Lecture 12: Variable Selection - Beyond LASSO

  • Computational Advantages of Adaptive LASSO

    The adaptive lasso

    has continuous solutions (continuous in data)

    is a convex optimization problem

    can be solved by the same efficient algorithm LARS to obtainthe entire solution path.

    By contrast,

    The bridge regression (Frank and Friedman 1993) uses the Lqpenalty.

    The bridge with 0 < q < 1 has the oracle properties (Knightand Fu, 2000), but the bridge with q < 1 solution is notcontinuous. (See the picture)

    Hao Helen Zhang Lecture 12: Variable Selection - Beyond LASSO

  • Fit Adaptive LASSO using LARS algorithm

    Algorithm:

    1 Define x∗∗j = x j/wj for j = 1, · · · , p.2 Solve the lasso problem for all λn,

    β̂∗∗

    = arg minβ‖y −

    p∑j=1

    x∗∗j βj‖2 + λnp∑

    j=1

    |βj |.

    3 Outputβ̂alassoj = β̂

    ∗∗j /wj , j = 1, · · · , p.

    The LARS algorithm is used to compute the entire solution path ofthe lasso in Step 2.

    The computational cost is of order O(np2), which is the sameorder of computation of a single OLS fit.

    Hao Helen Zhang Lecture 12: Variable Selection - Beyond LASSO

  • Adaptive Solution Plot

    Hao Helen Zhang Lecture 12: Variable Selection - Beyond LASSO

  • Connection of Nonnegative Garotte and Adaptive LASSO

    Let γ = 1. The garotte can be reformulated as

    minβ ‖y −p∑

    j=1

    x jβj‖2 + λnp∑

    j=1

    |βj ||β̂olsj |

    subject to βj β̂olsj ≥ 0, j = 1, · · · , p,

    The nonnegative garotte can be regarded as the adaptivelasso (γ = 1) with additional sign constraint.

    Zou (2006) showed that, if λ→∞ and λn/√n→ 0, then

    then the nonnegative garotte is consistent for variableselection consistent.

    Hao Helen Zhang Lecture 12: Variable Selection - Beyond LASSO

  • Adaptive LASSO for Generalized Linear Models (GLMs)

    Assume the data has the density of the form

    f (y |x, θ) = h(y) exp(yθ − φ(θ)), θ = xTβ,

    which belongs the exponential family with canonical parameter θ.The adaptive lasso solves the penalized log-likelihood

    β̂alasso

    n = arg minβ

    n∑i=1

    (−yi (xTi β) + φ(xTi β)

    )+ λn

    p∑j=1

    wj |βj |,

    wherewj = 1/|β̂mlej |γ , j = 1, · · · , p,

    β̂mlej is the maximum likelihood estimator.

    Hao Helen Zhang Lecture 12: Variable Selection - Beyond LASSO

  • Adaptive LASSO for GLMs

    For logistic regression, the adaptive lasso solves

    β̂alasso

    n = arg minβ

    n∑i=1

    (−yi (xTi β) + log(1 + ex

    Ti β))

    + λn

    p∑j=1

    wj |βj |.

    For Poisson log-linear regression models, the adaptive lasso solves

    β̂alasso

    n = arg minβ

    n∑i=1

    (−yi (xTi β) + exp(xTi β)

    )+ λn

    p∑j=1

    wj |βj |.

    Zou (2006) show that, under some mild regularity conditions,

    β̂alasso

    n enjoys oracle properties if λn is chosen appropriately.

    The limiting covariance matrix of β̂alasso

    n is I11, the Fisherinformation with the true sub-model known.

    Hao Helen Zhang Lecture 12: Variable Selection - Beyond LASSO

  • Adaptive LASSO for High Dimensional Linear Regression

    Huang, Ma, and Zhang (2008) considers the case p →∞ asn→∞. If a reasonable initial estimator is available, they show

    Under appropriate conditions, β̂alasso

    is consistent in variableselection, and the nonzero coefficient estimators have thesame asymptotic distribution they would have if the truemodel were known (oracle properties)

    Under a partial orthogonality condition, i.e., unimportantcovariates are weakly correlated with important covariates,marginal regression can be used to obtain the initial estimator

    and assures the oracle property of β̂alasso

    even when p > n.

    For example, margin regression initial estimates work in theessentially uncorrelated case

    |corr(Xj ,Xk)| ≤ ρn = O(n−1/2), ∀j ∈ A0, k ∈ Ac0.

    Hao Helen Zhang Lecture 12: Variable Selection - Beyond LASSO

  • Correlated Predictors

    Motivations:

    If p > n or p � n, the lasso can select at most n variablesbefore it saturates, because of the nature of the convexoptimization problem.

    If there exist high pairwise correlations among predictors, lassotends to select only one variable from the group and does notcare which one is selected.

    For usual n > p situations, if there are high correlationsbetween predictors, it has been empirically observed that theprediction performance of the lasso is dominated by ridgeregression (Tibshirani, 1996).

    Methods: elastic net, OSCAR, adaptive elastic net

    Hao Helen Zhang Lecture 12: Variable Selection - Beyond LASSO

  • Motivations of Elastic Net

    A typical microarray data set has many thousands of predictors(genes) and often fewer than 100 samples.

    Genes sharing the same biological ‘pathway’ tend to have highcorrelations between them (Segal and Conklin, 2003).

    The genes belonging to a same pathway form a group.

    The ideal gene selection method should do two things:

    eliminate the trivial genes

    automatically include whole groups into the model once onegene among them is selected (‘grouped selection’).

    The lasso lacks the ability to reveal the grouping information.

    Hao Helen Zhang Lecture 12: Variable Selection - Beyond LASSO

  • Elastic net

    Zou and Hastie (2005) combine the lasso and the ridge

    β̂ = arg minβ‖y − Xβ‖2 + λ2‖β‖2 + λ1‖β‖1.

    The l1 part of the penalty generates a sparse model.

    The quadratic part of the penalty

    removes the limitation on the number of selected variables;

    encourages grouping effect;

    stabilizes l1 regularization path.

    The new estimator is called Elastic Net. The solution is

    β̂enet

    = (1 + λ2)β̂,

    which removes the double shrinkage effect.

    Hao Helen Zhang Lecture 12: Variable Selection - Beyond LASSO

  • Elastic Net Regularization

    The elastic net penalty function

    has singularity at the vertex (necessary for sparsity)

    has strict convex edges (encouraging grouping)

    Properties: the elastic net solution

    simultaneously does automatic variable selection andcontinuous shrinkage

    can select groups of correlated variables, retaining ‘all the bigfish’.

    The elastic net often outperforms the lasso in terms of predictionaccuracy in numerical results.

    Hao Helen Zhang Lecture 12: Variable Selection - Beyond LASSO

  • Hao Helen Zhang Lecture 12: Variable Selection - Beyond LASSO

  • Strictly Convex Penalty vs LASSO penalty

    Considerβ̂ = arg min

    β‖y − Xβ‖2 + λJ(β),

    where J(β) > 0 for β 6= 0.

    Assume xi = xj , i , , j ∈ {1, · · · , p}.If J is strictly convex, then β̂i = β̂j for any λ > 0.

    If J(β) = ‖β‖1, then β̂i β̂j > 0, and β̂∗

    is another minimzer,where

    β̂∗k =

    β̂k if k 6= i , k 6= j ,(β̂i + β̂j) · s if k = i ,(β̂i + β̂j) · (1− s) if k = j ,

    for any s ∈ [0, 1].

    Hao Helen Zhang Lecture 12: Variable Selection - Beyond LASSO

  • Comments

    There is a clear distinction between strictly convex penaltyfunctions and the lasso penalty.

    Strict convexity guarantees the grouping effect in the extremesituation with identical predictors.

    In contrast the lasso does not even have a unique solution.

    The elastic net penalty with λ2 > 0 is strictly convex, thusenjoying the grouping effect.

    Hao Helen Zhang Lecture 12: Variable Selection - Beyond LASSO

  • Results on the Grouping Effect

    Given data (y,X ) and parameters (λ1, λ2), the response y is

    centred and the predictors X are standardized. Let β̂enet

    (λ1, λ2)be the elastic net estimate. Supposeβ̂eneti (λ1, λ2)β̂

    enetj (λ1, λ2) > 0, then

    1

    ‖y‖1|β̂eneti (λ1, λ2)− β̂enetj (λ1, λ2)| <

    √2

    λ2

    √1− ρij ,

    where ρij = xTi xj is the sample correlation.

    The closer to 1 ρij is, the more grouping in the solution.

    The larger λ2 is, the more grouping in the solution.

    Hao Helen Zhang Lecture 12: Variable Selection - Beyond LASSO

  • Effective Degrees of Freedom

    Effective df describes the model complexity.

    df is very useful in estimating the prediction accuracy of thefitted model.

    df is well studied for linear smoothers: ŷ = Sy, and

    df (ŷ) = tr(S).

    For the l1 related methods, the non-linear nature makes theanalysis difficult.

    Hao Helen Zhang Lecture 12: Variable Selection - Beyond LASSO

  • Elastic net: degrees of freedom

    For the elastic net, an unbiased estimate for df is

    d̂f (β̂enet

    ) = Tr(Hλ2(A)),

    where A = {j : β̂enetj 6= 0, j = 1, · · · , p} is the active set(containing all the selected variables), and

    Hλ2(A) = XA(XTAXA + λ2I )

    −1XTA .

    Special case for the lasso: λ2 = 0 and

    d̂f (β̂lasso

    ) = the number of nonzero coefficients in β̂lasso

    .

    Hao Helen Zhang Lecture 12: Variable Selection - Beyond LASSO

  • Computing and Tuning Elastic Net

    Solution path algorithm:

    The elastic net solution path is piecewise linear.

    Given a fixed λ2, a stage-wise algorithm called LARS-ENefficiently solves the entire elastic net solution path.

    The computational effort is same as that of a single OLS fit.

    R package: elasticnet

    Two-dimensional cross validation

    Hao Helen Zhang Lecture 12: Variable Selection - Beyond LASSO

  • Adaptive Elastic Net

    Zou and Zhang (2009) propose

    β̂aenet

    = (1 + λ2) arg minβ‖y − Xβ‖2 + λ2‖β‖2 + λ1

    p∑j=1

    wj |βj |,

    where wj = (|β̂enetj |)−γ for j = 1, · · · , p and γ > 0 is a constant.

    Main results:

    Under weak regularity conditions, the adaptive elastic-net hasthe oracle property.

    Numerically, the adaptive elastic-net deals with the collinearityproblem better than the other oracle-like methods and hasbetter finite sample performance.

    Hao Helen Zhang Lecture 12: Variable Selection - Beyond LASSO

  • Motivation of Group Lasso

    The problem of group selection arises naturally in many practicalsituations. For example, In multifactor ANOVA,

    each factor has several levels and can be expressed through agroup of dummy variables;

    the goal is to select important main effects and interactionsfor accurate prediction, which amounts to the selection ofgroups of derived input variables.

    Variable selection amounts to the selection of important factors(groups of variables) rather than individual derived variables.

    Hao Helen Zhang Lecture 12: Variable Selection - Beyond LASSO

  • Another Motivation: Nonparametric Regression

    Additive Regression Models

    Each component in the additive model may be expressed as alinear combination of a number of basis functions of theoriginal measured variable.

    The selection of important measured variables corresponds tothe selection of groups of basis functions.

    Variable selection amount to the selection of groups of basisfunctions (corresponding to a same variable).

    Hao Helen Zhang Lecture 12: Variable Selection - Beyond LASSO

  • Group Selection Setup

    Consider the general regression problem with J factors:

    y =J∑

    j=1

    Xjβj + �,

    y is an n × 1 vector� ∼ N(0, σ2I )Xj is an n × pj matrix corresponding to the jth factorβj is a coefficient vector of size pj , j = 1, . . . , J.

    Both y and Xj ’s are centered. Each Xj is orthonormalized, i.e.,

    XTj Xj = Ipj , j = 1, ..., J,

    which can be done through Gram–Schmidt orthonormalization.Denoting X = (X1,X2, . . . ,XJ) and β = (β

    T1 ,β

    T2 , · · · ,βTJ )T .

    Hao Helen Zhang Lecture 12: Variable Selection - Beyond LASSO

  • Group LASSO

    Yuan and Lin (2006) propose

    β̂glasso

    = arg minβ‖y −

    J∑j=1

    Xjβj‖2 + λJ∑

    j=1

    √pj‖βj‖,

    where

    λ ≥ 0 is a tuning parameter.

    ‖βj‖ =√βTj βj is the empirical L2 norm.

    The group lasso penalty encourages sparsity at the factor level.

    When p1 = · · · = pJ = 1, the group lasso reduces to the lasso.

    Hao Helen Zhang Lecture 12: Variable Selection - Beyond LASSO

  • Degree of Freedom for Group LASSO

    d̂f =J∑

    j=1

    I (‖β̂glasso

    j ‖ > 0) +J∑

    j=1

    ‖β̂glasso

    j ‖

    ‖β̂ols

    j ‖(pj − 1).

    λ ≥ 0 is a tuning parameter.The group lasso penalty encourages sparsity at the factor level.

    When p1 = · · · = pJ = 1, the group lasso reduces to the lasso.

    Hao Helen Zhang Lecture 12: Variable Selection - Beyond LASSO

  • Motivation of Adaptive Group LASSO

    The group lasso suffers estimation inefficiency and selectioninconsistency. To remedy this, Wang and Leng (2006) propose

    β̂aglasso

    = arg minβ‖y −

    J∑j=1

    Xjβj‖2 + λJ∑

    j=1

    wj‖βj‖,

    The weight wj = ‖β̃j‖−γ for all j = 1, · · · , J.The adaptive group LASSO has oracle properties. In other

    words, β̂aglasso

    is root-n consistency, model selectionconsistency, asymptotically normal and efficient.

    Hao Helen Zhang Lecture 12: Variable Selection - Beyond LASSO