Lecture 5: Variable Selection and Sparsitytzhao80/Lectures/Lecture_5.pdfLecture 5: Variable...

Lecture 5: Variable Selection and Sparsity

Tuo Zhao

Schools of ISyE and CSE, Georgia Tech

High Dimensional Variable Selection

ISYE/CSE 6740: Computational Data Analysis

Linear Models

The simplest regression model in the world:

y = Xθ∗ + ε.

Design Matrix: X ∈ Rn×d,

Response Vector: y ∈ Rn,

Random Noise: ε ∼ N(0, σ2In).

n > d: Ordinary Least Square Estimator (equivalent to MLE)

= (X>X)−1X>y.

d n: X>X is not invertible. What to do?

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 3/49

Linear Models

y = Xθ∗ + ε.

= (X>X)−1X>y.

Linear Models

y = Xθ∗ + ε.

= (X>X)−1X>y.

Motivating Example: Credit Card

Motivating Example: Medical Imaging

Sparsity-Inducing Norm Regularization

Sparsity Assumption:∑d

j=1 1(θ∗j 6= 0) = s d.

Greedy Selection and Ridge Estimator

What we learned in textbooks :

Forward Selection: It always increases the model size

Backward Selection: It always decreases the model size

Stepwise Selection: It dynamically adjusts the model size

Hypothesis Testing: t-test for each coefficient.

Ridge Estimator: The model size is fixed

This lecture is about: Lasso, Logistic Lasso, Graphical Lasso, GroupLasso, Elastic-net, Dantzig Selector, ...

Lasso and Ridge Regression

Lasso Regression:

θ = arg minθ

2n‖y −Xθ‖22 subject to ‖θ‖1 ≤ R,

where R is a tuning parameter.

Ridge Regression:

θ = arg minθ

2n‖y −Xθ‖22 subject to ‖θ‖2 ≤ R.

Geometric Intuition

useR! 2009 Trevor Hastie, Stanford Statistics 4

Linear regression via the Lasso (Tibshirani, 1995)

• Given observations yi, xi1, . . . , xipNi=1

(yi − β0 −p!

xijβj)2 subject to

|βj | ≤ t

• Similar to ridge regression, which has constraint"

j β2j ≤ t

• Lasso does variable selection and shrinkage, while ridge only

shrinks.

` `2. .`

Regularized Least Square Regression

Lasso (Tibshirani, 1996):

θ = arg minθ

2n‖y −Xθ‖22 + λ‖θ‖1,

where λ > 0 is the regularization parameter.

Ridge Regression:

θ = arg minθ

2n‖y −Xθ‖22 + λ‖θ‖22.

Remark: The `1 norm can trap some coordinates at zero values.

Why does the `1 norm work?

Best Subset Selection using the `0 regularization:

θ = arg minθ

2n‖y −Xθ‖22 + λ‖θ‖0,

where ‖θ‖0 =∑d

j=1 1(θj 6= 0).

Differences:

Discontinuous v.s. Continuous

Nonconvex v.s. Convex

Unbiased v.s. Biased

Why the `1 norm works?

-3 -2 -1 0 1 2 3

The `1 and `0 regularizers

Extensions to Generalized Linear Models

Logistic Lasso (Tibshirani, 1996):

θ = arg minθ

log(1 + exp(−yix>i θ)) + λ‖θ‖1.

Design Matrix: X = [x1, ...,xn]> ∈ Rn×d,

Response Vector: y = [y1, ..., yn]> ∈ −1,+1n.

ERM Framework: Loss + `1 regularization:

Sparse Support Vector Machine,

Sparse LAD Regression

... ...

Extensions to Undirected Graphical Models

Gaussian Graphical Models: X = (X1, ..., Xd) ∈ Rd ∼ N(0,Σ).Precision Matrix: Ω = Σ−1. Xj and Xk are independent given theother variables if Ωjk = 0. The sparsity pattern of Ω encodes theconditional independence graph G = (V,E).

Graphical Lasso:

Ω = arg minΩ

− log(|Ω|) + trace(S>Ω) + λ∑

|Ωjk|,

Data matrix: X = [x1, ...,xn]>,

Sample Mean: x = 1n

∑ni=1 xi,

Empirical Covariance: S = 1n

∑ni=1(xi − x)(xi − x)>.

Extensions to Undirected Graphical Models

Gaussian Graphical Models: X = (X1, ..., Xd) ∈ Rd ∼ N(0,Σ).Precision Matrix: Ω = Σ−1. Xj and Xk are independent given theother variables if Ωjk = 0. The sparsity pattern of Ω encodes theconditional independence graph G = (V,E).

Graphical Lasso:

Ω = arg minΩ

− log(|Ω|) + trace(S>Ω) + λ∑

|Ωjk|,

Data matrix: X = [x1, ...,xn]>,

Sample Mean: x = 1n

∑ni=1 xi,

Empirical Covariance: S = 1n

∑ni=1(xi − x)(xi − x)>.

Examples of Undirected Graphical Models

FPPS1 HMGR2

IPPI2 FPPS2

PPDS2mt

MCT HDS

PPDS1 CMK

GGPPS6

GGPPS11

DPPS2 DXPS2(cla1)

MECPS HMGS

GGPPS12

The estimated undirected graph using the arabidopsis dataset.

Group Lasso

Linear Model with Group Structure

XGjθ∗Gj + ε,

where XGj ∈ Rn×mj , θ∗Gj ∈ Rmj , Gj ∩Gk = ∅.

Group Regularization

θ = arg minθ

∣∣∣∣y −d∑

XGjθGj∣∣∣∣22

+ λ‖θ‖1,p,

where 2 ≤ p ≤ ∞ and ‖θ‖1,p =∑d

j=1 ‖θGj‖p.

Structural Sparsity Assumption: ‖θ∗‖0,p = s d.

Region Sparsity of Brain Medical Imaging

Group Regularization

The group regularization yields joint sparsity over each block ofcoefficients. What is the difference between the Ridge and `2 normregularization?

The `2 and `∞ regularization functions

Extension to Multitask Regression

Multitask Regression Models

Y = XΘ∗ + W.

Response Matrix: Y ∈ Rn×m,

Regression Coefficient Matrix: Θ∗ ∈ Rd×m

Random Noise: W i.i.d. Gaussian entries.

Regularization Across Tasks

Θ = arg minΘ

2n‖Y −XΘ‖2F + λ‖Θ‖1,p,

where ‖Θ‖1,p =∑d

j=1 (∑m

k=1 |Θjk|p)1/p.

Structural Sparsity Assumption: ‖Θ∗‖0,p = s d.

Elastic-net Regularization

Elastic-net Regularized Regression

θ = arg minθ

2n‖y −Xθ‖22 + λ1‖θ‖1 + λ2‖θ‖22,

where λ1 and λ2 are regularization parameters.

Remark:

Extra tuning efforts.

Collinearity

Grouping effects

Ease computation

Elasitic-net Regularization

Ridge Regularization:

θ2j ∝∑

[(θj − θk)2 + (θj + θk)2].

The Ridge regularization encourages the diminution of θj − θk’sand θj + θk’s for highly correlated variables.

Therefore, the elastic-net regularized regression tends to jointlyselect or remove highly correlated variables.

Extensions: Elastic-net Penalized Logistic/Poisson Regression

Dantzig Selector

Dantzig Selector:

θ = arg minθ

‖θ‖1 subject to1

n‖X>(y −Xθ)‖∞ ≤ λ.

General Form:

θ = arg minθ

R(θ) subject to R∗(∇L(θ)) ≤ λ.

Remark:

Essentially linear optimization

Similar performance

Less popular

Dantzig Selector as Linear Program

Parameter Decomposition

θ = θ+ − θ−

Reparamatrization:

minθ+,θ−

1>θ+ + 1>θ−

subject to X>(Xθ+ −Xθ− − y) ≤ λ1

− λ1 ≤ X>(Xθ+ −Xθ− − y)

θ+ ≥ 0,θ− ≥ 0

Remark: Efficiently solved by existing linear programming solvers.

Statistical Properties

Parameter Estimation:

Lasso : ‖θ − θ∗‖22 = OP(s log d

Group Lasso: ‖θ − θ∗‖22 = OP(s log d

n+smmax

Remark:

Restricted Eigenvalue Conditions

Light Tail Conditions

Scaling: s log d/n→ 0, smmax/n→ 0

Variable Selection:

Lasso : P(

sign(θ) = sign (θ∗))→ 1

Group Lasso: P(

sign(θ) = sign (θ∗))→ 1

Remark:

Restricted Eigenvalue Conditions + Irrepresentable Conditions

Light Tail Conditions

Excessive Bound:

Lasso : EL(θ)− EL(θ∗) = OP(√

s log d

Group Lasso: EL(θ)− EL(θ∗) = OP(√

s log d

√smmax

Remark:

Statistical Learning Theory v.s. Statistics

Bounded Design and Response Conditions

Nonsmooth Convex Optimization

Computational Algorithms

You may have heard:

1 Proximal Gradient Algorithm (Nesterov, 2007)

2 Accelerated Proximal Gradient Algorithm (Beck et al., 2009)

3 Coordinate Descent Algorithm (Friedman et al., 2007)

4 Accelerated Coordinate Descent Algorithm (Lin et al., 2014)

5 Extension to Stochastic Optimization and Parallel Optimization

Proximal Gradient Algorithm

The proximal gradient algorithm is the most fundamentalcomputational algorithm for solving high dimensional sparseestimation problem (Nesterov, 2007).

θ = arg minθ

L(θ) +Rλ(θ)︸︷︷︸Fλ(θ)

Remark:

Simple and easy to implement

Handle Complex Regularization

Software packages available in R

Given the solution θ(t), we take

θ(t+1) = arg minθ

L(θ(t)) + (θ − θ(t))>∇L(θ(t))

2ηt‖θ − θ(t)‖22 +Rλ(θ)

= arg minθ

2‖θ − θ(t) + ηt∇L(θ(t))‖22 + ηt−1Rλ(θ),

where η is the step size parameter. Then we have

θ(t+1) = Tηλ(θ(t) − ηt∇L(θ(t))).

Lasso: At the t-th iteraiton,

θ(t+1)j = sign(θ

(t+1)j ) ·max|θ(t+1)

j | − ηλ, 0,where θ

(t+1)j = θ

(t)j − η∇jL(θ(t)).

Group Lasso: At the t-th iteraiton,

θ(t+1)Gj

= θ(t+1)

Gj ·max

1− λη

‖θ(t+1)

Gj ‖2, 0

where θ(t+1)

Gj = θ(t)Gj− η∇GjL(θ(t)).

Convergence Analysis

Sublinear Rate of Convergence:

T = O(L

)iterations such that Fλ(θ(t+1))−Fλ(θ) ≤ ε.

Remark:

∇L(·) is Lipschitz continous

‖∇L(θ′)−∇L(θ)‖2 ≤ L‖θ′ − θ‖2.L ≤ 1/η ≤ 2L (Guaranteed by line search)

Accelerated version O(√

Linear rate of convergence requires strong convexity.

Coordinate Descent Algorithm

The coordinate descent algorithm is the most famouscomputational algorithm for solving high dimension sparseestimation problem (Friedman et al., 2007, 2010).

Simple and easy to implement

Extremely efficient when the solution is sparse

High precision

Decomposable regularization: Rλ(θ) =∑d

j=1 rλ(θj)

Randomized Coordinate Descent Algorithm

At the t-th iteration, we sample j from 1, ..., d with equalprobability, and take

θ(t+1)j = arg min

L(θ(t)) +(θj − θ(t)j

)∇jL(θ(t))

(θj − θ(t)j

)2+Rλ

(θ(t)\j

)+ rλ(θj)

= arg minθ

(θj − θ(t)j + ηj∇jL(θ(t)

)2+ ηjrλ(θj),

where ηj is the step size parameter. Then we have

θ(t+1)j = Tηjλ

(θ(t)j − ηj∇jL(θ(t))

)and θ

(t+1)\j = θ

Convergence Analysis

Sublinear Rate of Convergence:

T = O(dmaxjM

)iterations s. t. E Fλ(θ(t+1))−Fλ(θ) ≤ ε.

Remark:

∇jL(·,θ\j) is Lipschitz continuous for all j = 1, ..., d

‖∇jL(θ′j ,θ\j)−∇jL(θj ,θ\j)‖2 ≤Mj |θ′j − θj |2.

1/ηj = Mj (Often explicitly calculated)

Accelerated version O(d√

maxjMj/ε)

Partial Residual Update Trick.

Warm Start Initialization

Regularization sequence λKNK=0 : λ0 = 1n‖X>y‖∞

θKNK=0 from sparse to dense : θ0

L() + R1()

1 = 0.960

Initia

lizati

b1 Initia

lizati

L() + R2()

2 = 0.961

b2 Initia

lizati

· · ·

Active Set Strategy

A: Active Set

A: Inactive Set

Active Set Strategy

A: Active Set

A: Inactive Set

Active Set Strategy

A: Active Set

A: Inactive Set

Active Set Strategy

A: Active Set

A: Inactive Set

Active Set Strategy

A: Active Set

A: Inactive Set

Active Set Strategy

A: Active Set

A: Inactive Set

Active Set Strategy

A: Active Set

A: Inactive Set

... ...

... ...9

Solution Path

Lasso and Elastic-net solution paths

Proximal Newton Algorithm

Given the solution θ(t), we take

θ(t+0.5) = arg minθ

L(θ(t)) + (θ − θ(t))>∇L(θ(t))

2(θ − θ(t))>∇2L(θ(t))(θ − θ(t)) +Rλ(θ).

Combined with the backtracking line search, we have

θ(t+1) = θ(t) + η(θ(t+0.5) − θ(t)).

Remark: Each subproblem is solved by the coordinate descentalgorithm. The proximal Newton algorithm can be much moreefficient than coordinate descent algorithm in practice.

Software Libraries

Available Packages

glmnet: Lasso, Logistic/Poisson Lasso.Developed by J. Friedman; Maintained by T. Hastie

PICASSO: Lasso, Logistic/Poisson Lasso.

huge: Graphical Lasso

liblinear: Sparse Support Vector Machine, Logistic LassoDeveloped by C. Jin and Maintained by T. Helleputte

QUIC: Graphical LassoDeveloped by C. Hsieh and Maintained by M. Sustik

Lecture 5: Variable Selection and Sparsitytzhao80/Lectures/Lecture_5.pdfLecture 5: Variable...

Documents

Transcript of Lecture 5: Variable Selection and Sparsitytzhao80/Lectures/Lecture_5.pdfLecture 5: Variable...

Instrumental Variable Identification of Dynamic Variance ...

Variable Selection and Model Choice in Survival Models with

Coil Selection Guide52ebad10ee97eea25d5e-d7d40819259e7d3022d9ad53e3694148.r84.cf3.rackc…Coil Selection Guide Inductance DC Resistance DC saturation Temperature rise インダクタンス

Sparse Covariance Selection using Semideﬁnite Programmingaspremon/PDF/SIAM06am.pdfSparse Covariance Selection using Semideﬁnite Programming A. d’Aspremont ORFE, Princeton University

Function spaces with variable exponents - TU Chemnitzpotts/cms/cms14/Summerschool… · Extrapolation in variable Lebesgue spaces ... Lebesgue and Sobolev spaces with variable exponents,

1.CODON-BASED SELECTION TESTS

LaTeX2e font selection

AC Power Regulators Quick selection guide

Selection, installation and assembly of surge protective ...

Modelling of W UMa -type Variable Stars

Selection analysis using HyPhy

Variable micro- inspection system

funciones de variable compleja

Κατάλογος Secret Selection 2015

Thymocyte Maturation: Selection for In-Frame TCR … Maturation: Selection for In-Frame TCRa-Chain Rearrangement Is Followed by Selection for Shorter TCR b-Chain Complementarity-Determining

RF 100 VARIABLE HELIX SoLId CARBIdE End · PDF fileRF 100 VARIABLE HELIX SoLId CARBIdE End MILLS ... 3631 RF 100 SF standard length 6-flute variable helix end mills for materials

235051185 Belt Selection Calculation

Living Light selection 2013

Day 6: Text Scaling - Ken Benoit's website...I more focused policy content I lots of ‘documents ’ Disadvantages I variable (institutionally structured) sample selection I possibly

Supplementary Material: Pixelwise View Selection for ...