Lecture 5: Variable Selection and Sparsitytzhao80/Lectures/Lecture_5.pdfLecture 5: Variable...

57
Lecture 5: Variable Selection and Sparsity Tuo Zhao Schools of ISyE and CSE, Georgia Tech

Transcript of Lecture 5: Variable Selection and Sparsitytzhao80/Lectures/Lecture_5.pdfLecture 5: Variable...

Page 1: Lecture 5: Variable Selection and Sparsitytzhao80/Lectures/Lecture_5.pdfLecture 5: Variable Selection and Sparsity Tuo Zhao Schools of ISyE and CSE, Georgia Tech

Lecture 5: Variable Selection and Sparsity

Tuo Zhao

Schools of ISyE and CSE, Georgia Tech

Page 2: Lecture 5: Variable Selection and Sparsitytzhao80/Lectures/Lecture_5.pdfLecture 5: Variable Selection and Sparsity Tuo Zhao Schools of ISyE and CSE, Georgia Tech

High Dimensional Variable Selection

Page 3: Lecture 5: Variable Selection and Sparsitytzhao80/Lectures/Lecture_5.pdfLecture 5: Variable Selection and Sparsity Tuo Zhao Schools of ISyE and CSE, Georgia Tech

ISYE/CSE 6740: Computational Data Analysis

Linear Models

The simplest regression model in the world:

y = Xθ∗ + ε.

Design Matrix: X ∈ Rn×d,

Response Vector: y ∈ Rn,

Random Noise: ε ∼ N(0, σ2In).

n > d: Ordinary Least Square Estimator (equivalent to MLE)

θo

= (X>X)−1X>y.

d n: X>X is not invertible. What to do?

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 3/49

Page 4: Lecture 5: Variable Selection and Sparsitytzhao80/Lectures/Lecture_5.pdfLecture 5: Variable Selection and Sparsity Tuo Zhao Schools of ISyE and CSE, Georgia Tech

ISYE/CSE 6740: Computational Data Analysis

Linear Models

The simplest regression model in the world:

y = Xθ∗ + ε.

Design Matrix: X ∈ Rn×d,

Response Vector: y ∈ Rn,

Random Noise: ε ∼ N(0, σ2In).

n > d: Ordinary Least Square Estimator (equivalent to MLE)

θo

= (X>X)−1X>y.

d n: X>X is not invertible. What to do?

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 3/49

Page 5: Lecture 5: Variable Selection and Sparsitytzhao80/Lectures/Lecture_5.pdfLecture 5: Variable Selection and Sparsity Tuo Zhao Schools of ISyE and CSE, Georgia Tech

ISYE/CSE 6740: Computational Data Analysis

Linear Models

The simplest regression model in the world:

y = Xθ∗ + ε.

Design Matrix: X ∈ Rn×d,

Response Vector: y ∈ Rn,

Random Noise: ε ∼ N(0, σ2In).

n > d: Ordinary Least Square Estimator (equivalent to MLE)

θo

= (X>X)−1X>y.

d n: X>X is not invertible. What to do?

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 3/49

Page 6: Lecture 5: Variable Selection and Sparsitytzhao80/Lectures/Lecture_5.pdfLecture 5: Variable Selection and Sparsity Tuo Zhao Schools of ISyE and CSE, Georgia Tech

ISYE/CSE 6740: Computational Data Analysis

Motivating Example: Credit Card

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 4/49

Page 7: Lecture 5: Variable Selection and Sparsitytzhao80/Lectures/Lecture_5.pdfLecture 5: Variable Selection and Sparsity Tuo Zhao Schools of ISyE and CSE, Georgia Tech

ISYE/CSE 6740: Computational Data Analysis

Motivating Example: Credit Card

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 5/49

Page 8: Lecture 5: Variable Selection and Sparsitytzhao80/Lectures/Lecture_5.pdfLecture 5: Variable Selection and Sparsity Tuo Zhao Schools of ISyE and CSE, Georgia Tech

ISYE/CSE 6740: Computational Data Analysis

Motivating Example: Credit Card

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 6/49

Page 9: Lecture 5: Variable Selection and Sparsitytzhao80/Lectures/Lecture_5.pdfLecture 5: Variable Selection and Sparsity Tuo Zhao Schools of ISyE and CSE, Georgia Tech

ISYE/CSE 6740: Computational Data Analysis

Motivating Example: Medical Imaging

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 7/49

Page 10: Lecture 5: Variable Selection and Sparsitytzhao80/Lectures/Lecture_5.pdfLecture 5: Variable Selection and Sparsity Tuo Zhao Schools of ISyE and CSE, Georgia Tech

ISYE/CSE 6740: Computational Data Analysis

Sparsity-Inducing Norm Regularization

X

d

=n

y

+

Sparsity Assumption:∑d

j=1 1(θ∗j 6= 0) = s d.

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 8/49

Page 11: Lecture 5: Variable Selection and Sparsitytzhao80/Lectures/Lecture_5.pdfLecture 5: Variable Selection and Sparsity Tuo Zhao Schools of ISyE and CSE, Georgia Tech

ISYE/CSE 6740: Computational Data Analysis

Greedy Selection and Ridge Estimator

What we learned in textbooks :

Forward Selection: It always increases the model size

Backward Selection: It always decreases the model size

Stepwise Selection: It dynamically adjusts the model size

Hypothesis Testing: t-test for each coefficient.

Ridge Estimator: The model size is fixed

This lecture is about: Lasso, Logistic Lasso, Graphical Lasso, GroupLasso, Elastic-net, Dantzig Selector, ...

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 9/49

Page 12: Lecture 5: Variable Selection and Sparsitytzhao80/Lectures/Lecture_5.pdfLecture 5: Variable Selection and Sparsity Tuo Zhao Schools of ISyE and CSE, Georgia Tech

ISYE/CSE 6740: Computational Data Analysis

Greedy Selection and Ridge Estimator

What we learned in textbooks :

Forward Selection: It always increases the model size

Backward Selection: It always decreases the model size

Stepwise Selection: It dynamically adjusts the model size

Hypothesis Testing: t-test for each coefficient.

Ridge Estimator: The model size is fixed

This lecture is about: Lasso, Logistic Lasso, Graphical Lasso, GroupLasso, Elastic-net, Dantzig Selector, ...

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 9/49

Page 13: Lecture 5: Variable Selection and Sparsitytzhao80/Lectures/Lecture_5.pdfLecture 5: Variable Selection and Sparsity Tuo Zhao Schools of ISyE and CSE, Georgia Tech

ISYE/CSE 6740: Computational Data Analysis

Greedy Selection and Ridge Estimator

What we learned in textbooks :

Forward Selection: It always increases the model size

Backward Selection: It always decreases the model size

Stepwise Selection: It dynamically adjusts the model size

Hypothesis Testing: t-test for each coefficient.

Ridge Estimator: The model size is fixed

This lecture is about: Lasso, Logistic Lasso, Graphical Lasso, GroupLasso, Elastic-net, Dantzig Selector, ...

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 9/49

Page 14: Lecture 5: Variable Selection and Sparsitytzhao80/Lectures/Lecture_5.pdfLecture 5: Variable Selection and Sparsity Tuo Zhao Schools of ISyE and CSE, Georgia Tech

ISYE/CSE 6740: Computational Data Analysis

Greedy Selection and Ridge Estimator

What we learned in textbooks :

Forward Selection: It always increases the model size

Backward Selection: It always decreases the model size

Stepwise Selection: It dynamically adjusts the model size

Hypothesis Testing: t-test for each coefficient.

Ridge Estimator: The model size is fixed

This lecture is about: Lasso, Logistic Lasso, Graphical Lasso, GroupLasso, Elastic-net, Dantzig Selector, ...

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 9/49

Page 15: Lecture 5: Variable Selection and Sparsitytzhao80/Lectures/Lecture_5.pdfLecture 5: Variable Selection and Sparsity Tuo Zhao Schools of ISyE and CSE, Georgia Tech

ISYE/CSE 6740: Computational Data Analysis

Greedy Selection and Ridge Estimator

What we learned in textbooks :

Forward Selection: It always increases the model size

Backward Selection: It always decreases the model size

Stepwise Selection: It dynamically adjusts the model size

Hypothesis Testing: t-test for each coefficient.

Ridge Estimator: The model size is fixed

This lecture is about: Lasso, Logistic Lasso, Graphical Lasso, GroupLasso, Elastic-net, Dantzig Selector, ...

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 9/49

Page 16: Lecture 5: Variable Selection and Sparsitytzhao80/Lectures/Lecture_5.pdfLecture 5: Variable Selection and Sparsity Tuo Zhao Schools of ISyE and CSE, Georgia Tech

ISYE/CSE 6740: Computational Data Analysis

Greedy Selection and Ridge Estimator

What we learned in textbooks :

Forward Selection: It always increases the model size

Backward Selection: It always decreases the model size

Stepwise Selection: It dynamically adjusts the model size

Hypothesis Testing: t-test for each coefficient.

Ridge Estimator: The model size is fixed

This lecture is about: Lasso, Logistic Lasso, Graphical Lasso, GroupLasso, Elastic-net, Dantzig Selector, ...

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 9/49

Page 17: Lecture 5: Variable Selection and Sparsitytzhao80/Lectures/Lecture_5.pdfLecture 5: Variable Selection and Sparsity Tuo Zhao Schools of ISyE and CSE, Georgia Tech

ISYE/CSE 6740: Computational Data Analysis

Lasso and Ridge Regression

Lasso Regression:

θ = arg minθ

1

2n‖y −Xθ‖22 subject to ‖θ‖1 ≤ R,

where R is a tuning parameter.

Ridge Regression:

θ = arg minθ

1

2n‖y −Xθ‖22 subject to ‖θ‖2 ≤ R.

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 10/49

Page 18: Lecture 5: Variable Selection and Sparsitytzhao80/Lectures/Lecture_5.pdfLecture 5: Variable Selection and Sparsity Tuo Zhao Schools of ISyE and CSE, Georgia Tech

ISYE/CSE 6740: Computational Data Analysis

Geometric Intuition

useR! 2009 Trevor Hastie, Stanford Statistics 4

Linear regression via the Lasso (Tibshirani, 1995)

• Given observations yi, xi1, . . . , xipNi=1

minβ

N!

i=1

(yi − β0 −p!

j=1

xijβj)2 subject to

p!

j=1

|βj | ≤ t

• Similar to ridge regression, which has constraint"

j β2j ≤ t

• Lasso does variable selection and shrinkage, while ridge only

shrinks.

` `2. .`

1

` 2

`1

`

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 11/49

Page 19: Lecture 5: Variable Selection and Sparsitytzhao80/Lectures/Lecture_5.pdfLecture 5: Variable Selection and Sparsity Tuo Zhao Schools of ISyE and CSE, Georgia Tech

ISYE/CSE 6740: Computational Data Analysis

Regularized Least Square Regression

Lasso (Tibshirani, 1996):

θ = arg minθ

1

2n‖y −Xθ‖22 + λ‖θ‖1,

where λ > 0 is the regularization parameter.

Ridge Regression:

θ = arg minθ

1

2n‖y −Xθ‖22 + λ‖θ‖22.

Remark: The `1 norm can trap some coordinates at zero values.

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 12/49

Page 20: Lecture 5: Variable Selection and Sparsitytzhao80/Lectures/Lecture_5.pdfLecture 5: Variable Selection and Sparsity Tuo Zhao Schools of ISyE and CSE, Georgia Tech

ISYE/CSE 6740: Computational Data Analysis

Why does the `1 norm work?

Best Subset Selection using the `0 regularization:

θ = arg minθ

1

2n‖y −Xθ‖22 + λ‖θ‖0,

where ‖θ‖0 =∑d

j=1 1(θj 6= 0).

Differences:

Discontinuous v.s. Continuous

Nonconvex v.s. Convex

Unbiased v.s. Biased

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 13/49

Page 21: Lecture 5: Variable Selection and Sparsitytzhao80/Lectures/Lecture_5.pdfLecture 5: Variable Selection and Sparsity Tuo Zhao Schools of ISyE and CSE, Georgia Tech

ISYE/CSE 6740: Computational Data Analysis

Why the `1 norm works?

-3 -2 -1 0 1 2 3

0.0

0.5

1.0

1.5

2.0

2.5

3.0

`1`0

j

The `1 and `0 regularizers

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 14/49

Page 22: Lecture 5: Variable Selection and Sparsitytzhao80/Lectures/Lecture_5.pdfLecture 5: Variable Selection and Sparsity Tuo Zhao Schools of ISyE and CSE, Georgia Tech

ISYE/CSE 6740: Computational Data Analysis

Extensions to Generalized Linear Models

Logistic Lasso (Tibshirani, 1996):

θ = arg minθ

1

n

n∑

i=1

log(1 + exp(−yix>i θ)) + λ‖θ‖1.

Design Matrix: X = [x1, ...,xn]> ∈ Rn×d,

Response Vector: y = [y1, ..., yn]> ∈ −1,+1n.

ERM Framework: Loss + `1 regularization:

Sparse Support Vector Machine,

Sparse LAD Regression

... ...

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 15/49

Page 23: Lecture 5: Variable Selection and Sparsitytzhao80/Lectures/Lecture_5.pdfLecture 5: Variable Selection and Sparsity Tuo Zhao Schools of ISyE and CSE, Georgia Tech

ISYE/CSE 6740: Computational Data Analysis

Extensions to Undirected Graphical Models

Gaussian Graphical Models: X = (X1, ..., Xd) ∈ Rd ∼ N(0,Σ).Precision Matrix: Ω = Σ−1. Xj and Xk are independent given theother variables if Ωjk = 0. The sparsity pattern of Ω encodes theconditional independence graph G = (V,E).

Graphical Lasso:

Ω = arg minΩ

− log(|Ω|) + trace(S>Ω) + λ∑

j,k

|Ωjk|,

Data matrix: X = [x1, ...,xn]>,

Sample Mean: x = 1n

∑ni=1 xi,

Empirical Covariance: S = 1n

∑ni=1(xi − x)(xi − x)>.

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 16/49

Page 24: Lecture 5: Variable Selection and Sparsitytzhao80/Lectures/Lecture_5.pdfLecture 5: Variable Selection and Sparsity Tuo Zhao Schools of ISyE and CSE, Georgia Tech

ISYE/CSE 6740: Computational Data Analysis

Extensions to Undirected Graphical Models

Gaussian Graphical Models: X = (X1, ..., Xd) ∈ Rd ∼ N(0,Σ).Precision Matrix: Ω = Σ−1. Xj and Xk are independent given theother variables if Ωjk = 0. The sparsity pattern of Ω encodes theconditional independence graph G = (V,E).

Graphical Lasso:

Ω = arg minΩ

− log(|Ω|) + trace(S>Ω) + λ∑

j,k

|Ωjk|,

Data matrix: X = [x1, ...,xn]>,

Sample Mean: x = 1n

∑ni=1 xi,

Empirical Covariance: S = 1n

∑ni=1(xi − x)(xi − x)>.

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 16/49

Page 25: Lecture 5: Variable Selection and Sparsitytzhao80/Lectures/Lecture_5.pdfLecture 5: Variable Selection and Sparsity Tuo Zhao Schools of ISyE and CSE, Georgia Tech

ISYE/CSE 6740: Computational Data Analysis

Examples of Undirected Graphical Models

MPDC2

FPPS1 HMGR2

MPDC1

IPPI2 FPPS2

IPPI1

HDR

PPDS2mt

AACT1

GPPS

MCT HDS

HMGR1

PPDS1 CMK

GGPPS6

GGPPS11

AACT2

DXR

DPPS2 DXPS2(cla1)

MECPS HMGS

MK

GGPPS12

UPPS1

The estimated undirected graph using the arabidopsis dataset.

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 17/49

Page 26: Lecture 5: Variable Selection and Sparsitytzhao80/Lectures/Lecture_5.pdfLecture 5: Variable Selection and Sparsity Tuo Zhao Schools of ISyE and CSE, Georgia Tech

ISYE/CSE 6740: Computational Data Analysis

Group Lasso

Linear Model with Group Structure

y =

d∑

j=1

XGjθ∗Gj + ε,

where XGj ∈ Rn×mj , θ∗Gj ∈ Rmj , Gj ∩Gk = ∅.

Group Regularization

θ = arg minθ

1

2n

∣∣∣∣y −d∑

j=1

XGjθGj∣∣∣∣22

+ λ‖θ‖1,p,

where 2 ≤ p ≤ ∞ and ‖θ‖1,p =∑d

j=1 ‖θGj‖p.

Structural Sparsity Assumption: ‖θ∗‖0,p = s d.

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 18/49

Page 27: Lecture 5: Variable Selection and Sparsitytzhao80/Lectures/Lecture_5.pdfLecture 5: Variable Selection and Sparsity Tuo Zhao Schools of ISyE and CSE, Georgia Tech

ISYE/CSE 6740: Computational Data Analysis

Region Sparsity of Brain Medical Imaging

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 19/49

Page 28: Lecture 5: Variable Selection and Sparsitytzhao80/Lectures/Lecture_5.pdfLecture 5: Variable Selection and Sparsity Tuo Zhao Schools of ISyE and CSE, Georgia Tech

ISYE/CSE 6740: Computational Data Analysis

Group Regularization

The group regularization yields joint sparsity over each block ofcoefficients. What is the difference between the Ridge and `2 normregularization?

The `2 and `∞ regularization functions

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 20/49

Page 29: Lecture 5: Variable Selection and Sparsitytzhao80/Lectures/Lecture_5.pdfLecture 5: Variable Selection and Sparsity Tuo Zhao Schools of ISyE and CSE, Georgia Tech

ISYE/CSE 6740: Computational Data Analysis

Extension to Multitask Regression

Multitask Regression Models

Y = XΘ∗ + W.

Response Matrix: Y ∈ Rn×m,

Regression Coefficient Matrix: Θ∗ ∈ Rd×m

Random Noise: W i.i.d. Gaussian entries.

Regularization Across Tasks

Θ = arg minΘ

1

2n‖Y −XΘ‖2F + λ‖Θ‖1,p,

where ‖Θ‖1,p =∑d

j=1 (∑m

k=1 |Θjk|p)1/p.

Structural Sparsity Assumption: ‖Θ∗‖0,p = s d.

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 21/49

Page 30: Lecture 5: Variable Selection and Sparsitytzhao80/Lectures/Lecture_5.pdfLecture 5: Variable Selection and Sparsity Tuo Zhao Schools of ISyE and CSE, Georgia Tech

ISYE/CSE 6740: Computational Data Analysis

Elastic-net Regularization

Elastic-net Regularized Regression

θ = arg minθ

1

2n‖y −Xθ‖22 + λ1‖θ‖1 + λ2‖θ‖22,

where λ1 and λ2 are regularization parameters.

Remark:

Extra tuning efforts.

Collinearity

Grouping effects

Ease computation

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 22/49

Page 31: Lecture 5: Variable Selection and Sparsitytzhao80/Lectures/Lecture_5.pdfLecture 5: Variable Selection and Sparsity Tuo Zhao Schools of ISyE and CSE, Georgia Tech

ISYE/CSE 6740: Computational Data Analysis

Elasitic-net Regularization

Ridge Regularization:

d∑

j=1

θ2j ∝∑

j>k

[(θj − θk)2 + (θj + θk)2].

The Ridge regularization encourages the diminution of θj − θk’sand θj + θk’s for highly correlated variables.

Therefore, the elastic-net regularized regression tends to jointlyselect or remove highly correlated variables.

Extensions: Elastic-net Penalized Logistic/Poisson Regression

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 23/49

Page 32: Lecture 5: Variable Selection and Sparsitytzhao80/Lectures/Lecture_5.pdfLecture 5: Variable Selection and Sparsity Tuo Zhao Schools of ISyE and CSE, Georgia Tech

ISYE/CSE 6740: Computational Data Analysis

Dantzig Selector

Dantzig Selector:

θ = arg minθ

‖θ‖1 subject to1

n‖X>(y −Xθ)‖∞ ≤ λ.

General Form:

θ = arg minθ

R(θ) subject to R∗(∇L(θ)) ≤ λ.

Remark:

Essentially linear optimization

Similar performance

Less popular

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 24/49

Page 33: Lecture 5: Variable Selection and Sparsitytzhao80/Lectures/Lecture_5.pdfLecture 5: Variable Selection and Sparsity Tuo Zhao Schools of ISyE and CSE, Georgia Tech

ISYE/CSE 6740: Computational Data Analysis

Dantzig Selector as Linear Program

Parameter Decomposition

θ = θ+ − θ−

Reparamatrization:

minθ+,θ−

1>θ+ + 1>θ−

subject to X>(Xθ+ −Xθ− − y) ≤ λ1

− λ1 ≤ X>(Xθ+ −Xθ− − y)

θ+ ≥ 0,θ− ≥ 0

Remark: Efficiently solved by existing linear programming solvers.

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 25/49

Page 34: Lecture 5: Variable Selection and Sparsitytzhao80/Lectures/Lecture_5.pdfLecture 5: Variable Selection and Sparsity Tuo Zhao Schools of ISyE and CSE, Georgia Tech

ISYE/CSE 6740: Computational Data Analysis

Statistical Properties

Parameter Estimation:

Lasso : ‖θ − θ∗‖22 = OP(s log d

n

)

Group Lasso: ‖θ − θ∗‖22 = OP(s log d

n+smmax

n

)

Remark:

Restricted Eigenvalue Conditions

Light Tail Conditions

Scaling: s log d/n→ 0, smmax/n→ 0

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 26/49

Page 35: Lecture 5: Variable Selection and Sparsitytzhao80/Lectures/Lecture_5.pdfLecture 5: Variable Selection and Sparsity Tuo Zhao Schools of ISyE and CSE, Georgia Tech

ISYE/CSE 6740: Computational Data Analysis

Statistical Properties

Variable Selection:

Lasso : P(

sign(θ) = sign (θ∗))→ 1

Group Lasso: P(

sign(θ) = sign (θ∗))→ 1

Remark:

Restricted Eigenvalue Conditions + Irrepresentable Conditions

Light Tail Conditions

Scaling: s log d/n→ 0, smmax/n→ 0

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 27/49

Page 36: Lecture 5: Variable Selection and Sparsitytzhao80/Lectures/Lecture_5.pdfLecture 5: Variable Selection and Sparsity Tuo Zhao Schools of ISyE and CSE, Georgia Tech

ISYE/CSE 6740: Computational Data Analysis

Statistical Properties

Excessive Bound:

Lasso : EL(θ)− EL(θ∗) = OP(√

s log d

n

)

Group Lasso: EL(θ)− EL(θ∗) = OP(√

s log d

n+

√smmax

n

)

Remark:

Statistical Learning Theory v.s. Statistics

Bounded Design and Response Conditions

Scaling: s log d/n→ 0, smmax/n→ 0

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 28/49

Page 37: Lecture 5: Variable Selection and Sparsitytzhao80/Lectures/Lecture_5.pdfLecture 5: Variable Selection and Sparsity Tuo Zhao Schools of ISyE and CSE, Georgia Tech

Nonsmooth Convex Optimization

Page 38: Lecture 5: Variable Selection and Sparsitytzhao80/Lectures/Lecture_5.pdfLecture 5: Variable Selection and Sparsity Tuo Zhao Schools of ISyE and CSE, Georgia Tech

ISYE/CSE 6740: Computational Data Analysis

Computational Algorithms

You may have heard:

1 Proximal Gradient Algorithm (Nesterov, 2007)

2 Accelerated Proximal Gradient Algorithm (Beck et al., 2009)

3 Coordinate Descent Algorithm (Friedman et al., 2007)

4 Accelerated Coordinate Descent Algorithm (Lin et al., 2014)

5 Extension to Stochastic Optimization and Parallel Optimization

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 30/49

Page 39: Lecture 5: Variable Selection and Sparsitytzhao80/Lectures/Lecture_5.pdfLecture 5: Variable Selection and Sparsity Tuo Zhao Schools of ISyE and CSE, Georgia Tech

ISYE/CSE 6740: Computational Data Analysis

Proximal Gradient Algorithm

The proximal gradient algorithm is the most fundamentalcomputational algorithm for solving high dimensional sparseestimation problem (Nesterov, 2007).

θ = arg minθ

L(θ) +Rλ(θ)︸ ︷︷ ︸Fλ(θ)

.

Remark:

Simple and easy to implement

Handle Complex Regularization

Software packages available in R

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 31/49

Page 40: Lecture 5: Variable Selection and Sparsitytzhao80/Lectures/Lecture_5.pdfLecture 5: Variable Selection and Sparsity Tuo Zhao Schools of ISyE and CSE, Georgia Tech

ISYE/CSE 6740: Computational Data Analysis

Proximal Gradient Algorithm

Given the solution θ(t), we take

θ(t+1) = arg minθ

L(θ(t)) + (θ − θ(t))>∇L(θ(t))

+1

2ηt‖θ − θ(t)‖22 +Rλ(θ)

= arg minθ

1

2‖θ − θ(t) + ηt∇L(θ(t))‖22 + ηt−1Rλ(θ),

where η is the step size parameter. Then we have

θ(t+1) = Tηλ(θ(t) − ηt∇L(θ(t))).

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 32/49

Page 41: Lecture 5: Variable Selection and Sparsitytzhao80/Lectures/Lecture_5.pdfLecture 5: Variable Selection and Sparsity Tuo Zhao Schools of ISyE and CSE, Georgia Tech

ISYE/CSE 6740: Computational Data Analysis

Proximal Gradient Algorithm

Lasso: At the t-th iteraiton,

θ(t+1)j = sign(θ

(t+1)j ) ·max|θ(t+1)

j | − ηλ, 0,where θ

(t+1)j = θ

(t)j − η∇jL(θ(t)).

Group Lasso: At the t-th iteraiton,

θ(t+1)Gj

= θ(t+1)

Gj ·max

1− λη

‖θ(t+1)

Gj ‖2, 0

,

where θ(t+1)

Gj = θ(t)Gj− η∇GjL(θ(t)).

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 33/49

Page 42: Lecture 5: Variable Selection and Sparsitytzhao80/Lectures/Lecture_5.pdfLecture 5: Variable Selection and Sparsity Tuo Zhao Schools of ISyE and CSE, Georgia Tech

ISYE/CSE 6740: Computational Data Analysis

Convergence Analysis

Sublinear Rate of Convergence:

T = O(L

ε

)iterations such that Fλ(θ(t+1))−Fλ(θ) ≤ ε.

Remark:

∇L(·) is Lipschitz continous

‖∇L(θ′)−∇L(θ)‖2 ≤ L‖θ′ − θ‖2.L ≤ 1/η ≤ 2L (Guaranteed by line search)

Accelerated version O(√

L/ε)

.

Linear rate of convergence requires strong convexity.

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 34/49

Page 43: Lecture 5: Variable Selection and Sparsitytzhao80/Lectures/Lecture_5.pdfLecture 5: Variable Selection and Sparsity Tuo Zhao Schools of ISyE and CSE, Georgia Tech

ISYE/CSE 6740: Computational Data Analysis

Coordinate Descent Algorithm

The coordinate descent algorithm is the most famouscomputational algorithm for solving high dimension sparseestimation problem (Friedman et al., 2007, 2010).

Simple and easy to implement

Extremely efficient when the solution is sparse

High precision

Decomposable regularization: Rλ(θ) =∑d

j=1 rλ(θj)

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 35/49

Page 44: Lecture 5: Variable Selection and Sparsitytzhao80/Lectures/Lecture_5.pdfLecture 5: Variable Selection and Sparsity Tuo Zhao Schools of ISyE and CSE, Georgia Tech

ISYE/CSE 6740: Computational Data Analysis

Randomized Coordinate Descent Algorithm

At the t-th iteration, we sample j from 1, ..., d with equalprobability, and take

θ(t+1)j = arg min

θj

L(θ(t)) +(θj − θ(t)j

)∇jL(θ(t))

+1

2ηj

(θj − θ(t)j

)2+Rλ

(θ(t)\j

)+ rλ(θj)

= arg minθ

1

2

(θj − θ(t)j + ηj∇jL(θ(t)

)2+ ηjrλ(θj),

where ηj is the step size parameter. Then we have

θ(t+1)j = Tηjλ

(θ(t)j − ηj∇jL(θ(t))

)and θ

(t+1)\j = θ

(t)\j

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 36/49

Page 45: Lecture 5: Variable Selection and Sparsitytzhao80/Lectures/Lecture_5.pdfLecture 5: Variable Selection and Sparsity Tuo Zhao Schools of ISyE and CSE, Georgia Tech

ISYE/CSE 6740: Computational Data Analysis

Convergence Analysis

Sublinear Rate of Convergence:

T = O(dmaxjM

ε

)iterations s. t. E Fλ(θ(t+1))−Fλ(θ) ≤ ε.

Remark:

∇jL(·,θ\j) is Lipschitz continuous for all j = 1, ..., d

‖∇jL(θ′j ,θ\j)−∇jL(θj ,θ\j)‖2 ≤Mj |θ′j − θj |2.

1/ηj = Mj (Often explicitly calculated)

Accelerated version O(d√

maxjMj/ε)

Partial Residual Update Trick.

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 37/49

Page 46: Lecture 5: Variable Selection and Sparsitytzhao80/Lectures/Lecture_5.pdfLecture 5: Variable Selection and Sparsity Tuo Zhao Schools of ISyE and CSE, Georgia Tech

ISYE/CSE 6740: Computational Data Analysis

Warm Start Initialization

Regularization sequence λKNK=0 : λ0 = 1n‖X>y‖∞

θKNK=0 from sparse to dense : θ0

= 0

0

b0

min

L() + R1()

1 = 0.960

Initia

lizati

on

b1 Initia

lizati

on

min

L() + R2()

2 = 0.961

b2 Initia

lizati

on

· · ·

· · ·

· · ·

· · ·

· · ·

· · ·

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 38/49

Page 47: Lecture 5: Variable Selection and Sparsitytzhao80/Lectures/Lecture_5.pdfLecture 5: Variable Selection and Sparsity Tuo Zhao Schools of ISyE and CSE, Georgia Tech

ISYE/CSE 6740: Computational Data Analysis

Active Set Strategy

1

2

3

4

5

6

7

8

9

10

12

11

A: Active Set

A: Inactive Set

...

...

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 39/49

Page 48: Lecture 5: Variable Selection and Sparsitytzhao80/Lectures/Lecture_5.pdfLecture 5: Variable Selection and Sparsity Tuo Zhao Schools of ISyE and CSE, Georgia Tech

ISYE/CSE 6740: Computational Data Analysis

Active Set Strategy

1

2

3

4

5

6

7

8

9

10

12

11

A: Active Set

A: Inactive Set

...

...

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 40/49

Page 49: Lecture 5: Variable Selection and Sparsitytzhao80/Lectures/Lecture_5.pdfLecture 5: Variable Selection and Sparsity Tuo Zhao Schools of ISyE and CSE, Georgia Tech

ISYE/CSE 6740: Computational Data Analysis

Active Set Strategy

1

2

3

4

5

6

7

8

9

10

12

11

A: Active Set

A: Inactive Set

2

5

8

11

...

...

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 41/49

Page 50: Lecture 5: Variable Selection and Sparsitytzhao80/Lectures/Lecture_5.pdfLecture 5: Variable Selection and Sparsity Tuo Zhao Schools of ISyE and CSE, Georgia Tech

ISYE/CSE 6740: Computational Data Analysis

Active Set Strategy

1

2

3

4

5

6

7

8

9

10

12

11

A: Active Set

A: Inactive Set

2

5

8

11

2

5

8

11

1

3

4

6

7

9

10

12

...

...

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 42/49

Page 51: Lecture 5: Variable Selection and Sparsitytzhao80/Lectures/Lecture_5.pdfLecture 5: Variable Selection and Sparsity Tuo Zhao Schools of ISyE and CSE, Georgia Tech

ISYE/CSE 6740: Computational Data Analysis

Active Set Strategy

1

2

3

4

5

6

7

8

9

10

12

11

A: Active Set

A: Inactive Set

2

5

8

11

2

5

8

11

1

3

4

6

7

9

10

12

...

...

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 43/49

Page 52: Lecture 5: Variable Selection and Sparsitytzhao80/Lectures/Lecture_5.pdfLecture 5: Variable Selection and Sparsity Tuo Zhao Schools of ISyE and CSE, Georgia Tech

ISYE/CSE 6740: Computational Data Analysis

Active Set Strategy

1

2

3

4

5

6

7

8

9

10

12

11

A: Active Set

A: Inactive Set

2

5

8

11

2

5

8

11

1

3

4

6

7

9

10

12

8

1

3

4

6

7

9

10

12

...

...

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 44/49

Page 53: Lecture 5: Variable Selection and Sparsitytzhao80/Lectures/Lecture_5.pdfLecture 5: Variable Selection and Sparsity Tuo Zhao Schools of ISyE and CSE, Georgia Tech

ISYE/CSE 6740: Computational Data Analysis

Active Set Strategy

1

2

3

4

5

6

7

8

9

10

12

11

A: Active Set

A: Inactive Set

2

5

8

11

2

5

8

11

1

3

4

6

7

9

10

12

8

1

3

4

6

7

9

10

12

2

5

8

1

3

11

4

6

7

10

12

... ...

... ...9

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 45/49

Page 54: Lecture 5: Variable Selection and Sparsitytzhao80/Lectures/Lecture_5.pdfLecture 5: Variable Selection and Sparsity Tuo Zhao Schools of ISyE and CSE, Georgia Tech

ISYE/CSE 6740: Computational Data Analysis

Solution Path

Lasso and Elastic-net solution paths

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 46/49

Page 55: Lecture 5: Variable Selection and Sparsitytzhao80/Lectures/Lecture_5.pdfLecture 5: Variable Selection and Sparsity Tuo Zhao Schools of ISyE and CSE, Georgia Tech

ISYE/CSE 6740: Computational Data Analysis

Proximal Newton Algorithm

Given the solution θ(t), we take

θ(t+0.5) = arg minθ

L(θ(t)) + (θ − θ(t))>∇L(θ(t))

+1

2(θ − θ(t))>∇2L(θ(t))(θ − θ(t)) +Rλ(θ).

Combined with the backtracking line search, we have

θ(t+1) = θ(t) + η(θ(t+0.5) − θ(t)).

Remark: Each subproblem is solved by the coordinate descentalgorithm. The proximal Newton algorithm can be much moreefficient than coordinate descent algorithm in practice.

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 47/49

Page 56: Lecture 5: Variable Selection and Sparsitytzhao80/Lectures/Lecture_5.pdfLecture 5: Variable Selection and Sparsity Tuo Zhao Schools of ISyE and CSE, Georgia Tech

Software Libraries

Page 57: Lecture 5: Variable Selection and Sparsitytzhao80/Lectures/Lecture_5.pdfLecture 5: Variable Selection and Sparsity Tuo Zhao Schools of ISyE and CSE, Georgia Tech

ISYE/CSE 6740: Computational Data Analysis

Available Packages

glmnet: Lasso, Logistic/Poisson Lasso.Developed by J. Friedman; Maintained by T. Hastie

PICASSO: Lasso, Logistic/Poisson Lasso.

huge: Graphical Lasso

liblinear: Sparse Support Vector Machine, Logistic LassoDeveloped by C. Jin and Maintained by T. Helleputte

QUIC: Graphical LassoDeveloped by C. Hsieh and Maintained by M. Sustik

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 49/49