Lecture 5: Variable Selection and Sparsitytzhao80/Lectures/Lecture_5.pdfLecture 5: Variable...

Post on 23-May-2020

10 views 0 download

Transcript of Lecture 5: Variable Selection and Sparsitytzhao80/Lectures/Lecture_5.pdfLecture 5: Variable...

Lecture 5: Variable Selection and Sparsity

Tuo Zhao

Schools of ISyE and CSE, Georgia Tech

High Dimensional Variable Selection

ISYE/CSE 6740: Computational Data Analysis

Linear Models

The simplest regression model in the world:

y = Xθ∗ + ε.

Design Matrix: X ∈ Rn×d,

Response Vector: y ∈ Rn,

Random Noise: ε ∼ N(0, σ2In).

n > d: Ordinary Least Square Estimator (equivalent to MLE)

θo

= (X>X)−1X>y.

d n: X>X is not invertible. What to do?

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 3/49

ISYE/CSE 6740: Computational Data Analysis

Linear Models

The simplest regression model in the world:

y = Xθ∗ + ε.

Design Matrix: X ∈ Rn×d,

Response Vector: y ∈ Rn,

Random Noise: ε ∼ N(0, σ2In).

n > d: Ordinary Least Square Estimator (equivalent to MLE)

θo

= (X>X)−1X>y.

d n: X>X is not invertible. What to do?

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 3/49

ISYE/CSE 6740: Computational Data Analysis

Linear Models

The simplest regression model in the world:

y = Xθ∗ + ε.

Design Matrix: X ∈ Rn×d,

Response Vector: y ∈ Rn,

Random Noise: ε ∼ N(0, σ2In).

n > d: Ordinary Least Square Estimator (equivalent to MLE)

θo

= (X>X)−1X>y.

d n: X>X is not invertible. What to do?

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 3/49

ISYE/CSE 6740: Computational Data Analysis

Motivating Example: Credit Card

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 4/49

ISYE/CSE 6740: Computational Data Analysis

Motivating Example: Credit Card

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 5/49

ISYE/CSE 6740: Computational Data Analysis

Motivating Example: Credit Card

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 6/49

ISYE/CSE 6740: Computational Data Analysis

Motivating Example: Medical Imaging

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 7/49

ISYE/CSE 6740: Computational Data Analysis

Sparsity-Inducing Norm Regularization

X

d

=n

y

+

Sparsity Assumption:∑d

j=1 1(θ∗j 6= 0) = s d.

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 8/49

ISYE/CSE 6740: Computational Data Analysis

Greedy Selection and Ridge Estimator

What we learned in textbooks :

Forward Selection: It always increases the model size

Backward Selection: It always decreases the model size

Stepwise Selection: It dynamically adjusts the model size

Hypothesis Testing: t-test for each coefficient.

Ridge Estimator: The model size is fixed

This lecture is about: Lasso, Logistic Lasso, Graphical Lasso, GroupLasso, Elastic-net, Dantzig Selector, ...

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 9/49

ISYE/CSE 6740: Computational Data Analysis

Greedy Selection and Ridge Estimator

What we learned in textbooks :

Forward Selection: It always increases the model size

Backward Selection: It always decreases the model size

Stepwise Selection: It dynamically adjusts the model size

Hypothesis Testing: t-test for each coefficient.

Ridge Estimator: The model size is fixed

This lecture is about: Lasso, Logistic Lasso, Graphical Lasso, GroupLasso, Elastic-net, Dantzig Selector, ...

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 9/49

ISYE/CSE 6740: Computational Data Analysis

Greedy Selection and Ridge Estimator

What we learned in textbooks :

Forward Selection: It always increases the model size

Backward Selection: It always decreases the model size

Stepwise Selection: It dynamically adjusts the model size

Hypothesis Testing: t-test for each coefficient.

Ridge Estimator: The model size is fixed

This lecture is about: Lasso, Logistic Lasso, Graphical Lasso, GroupLasso, Elastic-net, Dantzig Selector, ...

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 9/49

ISYE/CSE 6740: Computational Data Analysis

Greedy Selection and Ridge Estimator

What we learned in textbooks :

Forward Selection: It always increases the model size

Backward Selection: It always decreases the model size

Stepwise Selection: It dynamically adjusts the model size

Hypothesis Testing: t-test for each coefficient.

Ridge Estimator: The model size is fixed

This lecture is about: Lasso, Logistic Lasso, Graphical Lasso, GroupLasso, Elastic-net, Dantzig Selector, ...

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 9/49

ISYE/CSE 6740: Computational Data Analysis

Greedy Selection and Ridge Estimator

What we learned in textbooks :

Forward Selection: It always increases the model size

Backward Selection: It always decreases the model size

Stepwise Selection: It dynamically adjusts the model size

Hypothesis Testing: t-test for each coefficient.

Ridge Estimator: The model size is fixed

This lecture is about: Lasso, Logistic Lasso, Graphical Lasso, GroupLasso, Elastic-net, Dantzig Selector, ...

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 9/49

ISYE/CSE 6740: Computational Data Analysis

Greedy Selection and Ridge Estimator

What we learned in textbooks :

Forward Selection: It always increases the model size

Backward Selection: It always decreases the model size

Stepwise Selection: It dynamically adjusts the model size

Hypothesis Testing: t-test for each coefficient.

Ridge Estimator: The model size is fixed

This lecture is about: Lasso, Logistic Lasso, Graphical Lasso, GroupLasso, Elastic-net, Dantzig Selector, ...

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 9/49

ISYE/CSE 6740: Computational Data Analysis

Lasso and Ridge Regression

Lasso Regression:

θ = arg minθ

1

2n‖y −Xθ‖22 subject to ‖θ‖1 ≤ R,

where R is a tuning parameter.

Ridge Regression:

θ = arg minθ

1

2n‖y −Xθ‖22 subject to ‖θ‖2 ≤ R.

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 10/49

ISYE/CSE 6740: Computational Data Analysis

Geometric Intuition

useR! 2009 Trevor Hastie, Stanford Statistics 4

Linear regression via the Lasso (Tibshirani, 1995)

• Given observations yi, xi1, . . . , xipNi=1

minβ

N!

i=1

(yi − β0 −p!

j=1

xijβj)2 subject to

p!

j=1

|βj | ≤ t

• Similar to ridge regression, which has constraint"

j β2j ≤ t

• Lasso does variable selection and shrinkage, while ridge only

shrinks.

` `2. .`

1

` 2

`1

`

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 11/49

ISYE/CSE 6740: Computational Data Analysis

Regularized Least Square Regression

Lasso (Tibshirani, 1996):

θ = arg minθ

1

2n‖y −Xθ‖22 + λ‖θ‖1,

where λ > 0 is the regularization parameter.

Ridge Regression:

θ = arg minθ

1

2n‖y −Xθ‖22 + λ‖θ‖22.

Remark: The `1 norm can trap some coordinates at zero values.

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 12/49

ISYE/CSE 6740: Computational Data Analysis

Why does the `1 norm work?

Best Subset Selection using the `0 regularization:

θ = arg minθ

1

2n‖y −Xθ‖22 + λ‖θ‖0,

where ‖θ‖0 =∑d

j=1 1(θj 6= 0).

Differences:

Discontinuous v.s. Continuous

Nonconvex v.s. Convex

Unbiased v.s. Biased

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 13/49

ISYE/CSE 6740: Computational Data Analysis

Why the `1 norm works?

-3 -2 -1 0 1 2 3

0.0

0.5

1.0

1.5

2.0

2.5

3.0

`1`0

j

The `1 and `0 regularizers

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 14/49

ISYE/CSE 6740: Computational Data Analysis

Extensions to Generalized Linear Models

Logistic Lasso (Tibshirani, 1996):

θ = arg minθ

1

n

n∑

i=1

log(1 + exp(−yix>i θ)) + λ‖θ‖1.

Design Matrix: X = [x1, ...,xn]> ∈ Rn×d,

Response Vector: y = [y1, ..., yn]> ∈ −1,+1n.

ERM Framework: Loss + `1 regularization:

Sparse Support Vector Machine,

Sparse LAD Regression

... ...

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 15/49

ISYE/CSE 6740: Computational Data Analysis

Extensions to Undirected Graphical Models

Gaussian Graphical Models: X = (X1, ..., Xd) ∈ Rd ∼ N(0,Σ).Precision Matrix: Ω = Σ−1. Xj and Xk are independent given theother variables if Ωjk = 0. The sparsity pattern of Ω encodes theconditional independence graph G = (V,E).

Graphical Lasso:

Ω = arg minΩ

− log(|Ω|) + trace(S>Ω) + λ∑

j,k

|Ωjk|,

Data matrix: X = [x1, ...,xn]>,

Sample Mean: x = 1n

∑ni=1 xi,

Empirical Covariance: S = 1n

∑ni=1(xi − x)(xi − x)>.

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 16/49

ISYE/CSE 6740: Computational Data Analysis

Extensions to Undirected Graphical Models

Gaussian Graphical Models: X = (X1, ..., Xd) ∈ Rd ∼ N(0,Σ).Precision Matrix: Ω = Σ−1. Xj and Xk are independent given theother variables if Ωjk = 0. The sparsity pattern of Ω encodes theconditional independence graph G = (V,E).

Graphical Lasso:

Ω = arg minΩ

− log(|Ω|) + trace(S>Ω) + λ∑

j,k

|Ωjk|,

Data matrix: X = [x1, ...,xn]>,

Sample Mean: x = 1n

∑ni=1 xi,

Empirical Covariance: S = 1n

∑ni=1(xi − x)(xi − x)>.

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 16/49

ISYE/CSE 6740: Computational Data Analysis

Examples of Undirected Graphical Models

MPDC2

FPPS1 HMGR2

MPDC1

IPPI2 FPPS2

IPPI1

HDR

PPDS2mt

AACT1

GPPS

MCT HDS

HMGR1

PPDS1 CMK

GGPPS6

GGPPS11

AACT2

DXR

DPPS2 DXPS2(cla1)

MECPS HMGS

MK

GGPPS12

UPPS1

The estimated undirected graph using the arabidopsis dataset.

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 17/49

ISYE/CSE 6740: Computational Data Analysis

Group Lasso

Linear Model with Group Structure

y =

d∑

j=1

XGjθ∗Gj + ε,

where XGj ∈ Rn×mj , θ∗Gj ∈ Rmj , Gj ∩Gk = ∅.

Group Regularization

θ = arg minθ

1

2n

∣∣∣∣y −d∑

j=1

XGjθGj∣∣∣∣22

+ λ‖θ‖1,p,

where 2 ≤ p ≤ ∞ and ‖θ‖1,p =∑d

j=1 ‖θGj‖p.

Structural Sparsity Assumption: ‖θ∗‖0,p = s d.

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 18/49

ISYE/CSE 6740: Computational Data Analysis

Region Sparsity of Brain Medical Imaging

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 19/49

ISYE/CSE 6740: Computational Data Analysis

Group Regularization

The group regularization yields joint sparsity over each block ofcoefficients. What is the difference between the Ridge and `2 normregularization?

The `2 and `∞ regularization functions

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 20/49

ISYE/CSE 6740: Computational Data Analysis

Extension to Multitask Regression

Multitask Regression Models

Y = XΘ∗ + W.

Response Matrix: Y ∈ Rn×m,

Regression Coefficient Matrix: Θ∗ ∈ Rd×m

Random Noise: W i.i.d. Gaussian entries.

Regularization Across Tasks

Θ = arg minΘ

1

2n‖Y −XΘ‖2F + λ‖Θ‖1,p,

where ‖Θ‖1,p =∑d

j=1 (∑m

k=1 |Θjk|p)1/p.

Structural Sparsity Assumption: ‖Θ∗‖0,p = s d.

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 21/49

ISYE/CSE 6740: Computational Data Analysis

Elastic-net Regularization

Elastic-net Regularized Regression

θ = arg minθ

1

2n‖y −Xθ‖22 + λ1‖θ‖1 + λ2‖θ‖22,

where λ1 and λ2 are regularization parameters.

Remark:

Extra tuning efforts.

Collinearity

Grouping effects

Ease computation

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 22/49

ISYE/CSE 6740: Computational Data Analysis

Elasitic-net Regularization

Ridge Regularization:

d∑

j=1

θ2j ∝∑

j>k

[(θj − θk)2 + (θj + θk)2].

The Ridge regularization encourages the diminution of θj − θk’sand θj + θk’s for highly correlated variables.

Therefore, the elastic-net regularized regression tends to jointlyselect or remove highly correlated variables.

Extensions: Elastic-net Penalized Logistic/Poisson Regression

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 23/49

ISYE/CSE 6740: Computational Data Analysis

Dantzig Selector

Dantzig Selector:

θ = arg minθ

‖θ‖1 subject to1

n‖X>(y −Xθ)‖∞ ≤ λ.

General Form:

θ = arg minθ

R(θ) subject to R∗(∇L(θ)) ≤ λ.

Remark:

Essentially linear optimization

Similar performance

Less popular

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 24/49

ISYE/CSE 6740: Computational Data Analysis

Dantzig Selector as Linear Program

Parameter Decomposition

θ = θ+ − θ−

Reparamatrization:

minθ+,θ−

1>θ+ + 1>θ−

subject to X>(Xθ+ −Xθ− − y) ≤ λ1

− λ1 ≤ X>(Xθ+ −Xθ− − y)

θ+ ≥ 0,θ− ≥ 0

Remark: Efficiently solved by existing linear programming solvers.

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 25/49

ISYE/CSE 6740: Computational Data Analysis

Statistical Properties

Parameter Estimation:

Lasso : ‖θ − θ∗‖22 = OP(s log d

n

)

Group Lasso: ‖θ − θ∗‖22 = OP(s log d

n+smmax

n

)

Remark:

Restricted Eigenvalue Conditions

Light Tail Conditions

Scaling: s log d/n→ 0, smmax/n→ 0

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 26/49

ISYE/CSE 6740: Computational Data Analysis

Statistical Properties

Variable Selection:

Lasso : P(

sign(θ) = sign (θ∗))→ 1

Group Lasso: P(

sign(θ) = sign (θ∗))→ 1

Remark:

Restricted Eigenvalue Conditions + Irrepresentable Conditions

Light Tail Conditions

Scaling: s log d/n→ 0, smmax/n→ 0

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 27/49

ISYE/CSE 6740: Computational Data Analysis

Statistical Properties

Excessive Bound:

Lasso : EL(θ)− EL(θ∗) = OP(√

s log d

n

)

Group Lasso: EL(θ)− EL(θ∗) = OP(√

s log d

n+

√smmax

n

)

Remark:

Statistical Learning Theory v.s. Statistics

Bounded Design and Response Conditions

Scaling: s log d/n→ 0, smmax/n→ 0

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 28/49

Nonsmooth Convex Optimization

ISYE/CSE 6740: Computational Data Analysis

Computational Algorithms

You may have heard:

1 Proximal Gradient Algorithm (Nesterov, 2007)

2 Accelerated Proximal Gradient Algorithm (Beck et al., 2009)

3 Coordinate Descent Algorithm (Friedman et al., 2007)

4 Accelerated Coordinate Descent Algorithm (Lin et al., 2014)

5 Extension to Stochastic Optimization and Parallel Optimization

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 30/49

ISYE/CSE 6740: Computational Data Analysis

Proximal Gradient Algorithm

The proximal gradient algorithm is the most fundamentalcomputational algorithm for solving high dimensional sparseestimation problem (Nesterov, 2007).

θ = arg minθ

L(θ) +Rλ(θ)︸ ︷︷ ︸Fλ(θ)

.

Remark:

Simple and easy to implement

Handle Complex Regularization

Software packages available in R

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 31/49

ISYE/CSE 6740: Computational Data Analysis

Proximal Gradient Algorithm

Given the solution θ(t), we take

θ(t+1) = arg minθ

L(θ(t)) + (θ − θ(t))>∇L(θ(t))

+1

2ηt‖θ − θ(t)‖22 +Rλ(θ)

= arg minθ

1

2‖θ − θ(t) + ηt∇L(θ(t))‖22 + ηt−1Rλ(θ),

where η is the step size parameter. Then we have

θ(t+1) = Tηλ(θ(t) − ηt∇L(θ(t))).

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 32/49

ISYE/CSE 6740: Computational Data Analysis

Proximal Gradient Algorithm

Lasso: At the t-th iteraiton,

θ(t+1)j = sign(θ

(t+1)j ) ·max|θ(t+1)

j | − ηλ, 0,where θ

(t+1)j = θ

(t)j − η∇jL(θ(t)).

Group Lasso: At the t-th iteraiton,

θ(t+1)Gj

= θ(t+1)

Gj ·max

1− λη

‖θ(t+1)

Gj ‖2, 0

,

where θ(t+1)

Gj = θ(t)Gj− η∇GjL(θ(t)).

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 33/49

ISYE/CSE 6740: Computational Data Analysis

Convergence Analysis

Sublinear Rate of Convergence:

T = O(L

ε

)iterations such that Fλ(θ(t+1))−Fλ(θ) ≤ ε.

Remark:

∇L(·) is Lipschitz continous

‖∇L(θ′)−∇L(θ)‖2 ≤ L‖θ′ − θ‖2.L ≤ 1/η ≤ 2L (Guaranteed by line search)

Accelerated version O(√

L/ε)

.

Linear rate of convergence requires strong convexity.

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 34/49

ISYE/CSE 6740: Computational Data Analysis

Coordinate Descent Algorithm

The coordinate descent algorithm is the most famouscomputational algorithm for solving high dimension sparseestimation problem (Friedman et al., 2007, 2010).

Simple and easy to implement

Extremely efficient when the solution is sparse

High precision

Decomposable regularization: Rλ(θ) =∑d

j=1 rλ(θj)

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 35/49

ISYE/CSE 6740: Computational Data Analysis

Randomized Coordinate Descent Algorithm

At the t-th iteration, we sample j from 1, ..., d with equalprobability, and take

θ(t+1)j = arg min

θj

L(θ(t)) +(θj − θ(t)j

)∇jL(θ(t))

+1

2ηj

(θj − θ(t)j

)2+Rλ

(θ(t)\j

)+ rλ(θj)

= arg minθ

1

2

(θj − θ(t)j + ηj∇jL(θ(t)

)2+ ηjrλ(θj),

where ηj is the step size parameter. Then we have

θ(t+1)j = Tηjλ

(θ(t)j − ηj∇jL(θ(t))

)and θ

(t+1)\j = θ

(t)\j

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 36/49

ISYE/CSE 6740: Computational Data Analysis

Convergence Analysis

Sublinear Rate of Convergence:

T = O(dmaxjM

ε

)iterations s. t. E Fλ(θ(t+1))−Fλ(θ) ≤ ε.

Remark:

∇jL(·,θ\j) is Lipschitz continuous for all j = 1, ..., d

‖∇jL(θ′j ,θ\j)−∇jL(θj ,θ\j)‖2 ≤Mj |θ′j − θj |2.

1/ηj = Mj (Often explicitly calculated)

Accelerated version O(d√

maxjMj/ε)

Partial Residual Update Trick.

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 37/49

ISYE/CSE 6740: Computational Data Analysis

Warm Start Initialization

Regularization sequence λKNK=0 : λ0 = 1n‖X>y‖∞

θKNK=0 from sparse to dense : θ0

= 0

0

b0

min

L() + R1()

1 = 0.960

Initia

lizati

on

b1 Initia

lizati

on

min

L() + R2()

2 = 0.961

b2 Initia

lizati

on

· · ·

· · ·

· · ·

· · ·

· · ·

· · ·

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 38/49

ISYE/CSE 6740: Computational Data Analysis

Active Set Strategy

1

2

3

4

5

6

7

8

9

10

12

11

A: Active Set

A: Inactive Set

...

...

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 39/49

ISYE/CSE 6740: Computational Data Analysis

Active Set Strategy

1

2

3

4

5

6

7

8

9

10

12

11

A: Active Set

A: Inactive Set

...

...

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 40/49

ISYE/CSE 6740: Computational Data Analysis

Active Set Strategy

1

2

3

4

5

6

7

8

9

10

12

11

A: Active Set

A: Inactive Set

2

5

8

11

...

...

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 41/49

ISYE/CSE 6740: Computational Data Analysis

Active Set Strategy

1

2

3

4

5

6

7

8

9

10

12

11

A: Active Set

A: Inactive Set

2

5

8

11

2

5

8

11

1

3

4

6

7

9

10

12

...

...

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 42/49

ISYE/CSE 6740: Computational Data Analysis

Active Set Strategy

1

2

3

4

5

6

7

8

9

10

12

11

A: Active Set

A: Inactive Set

2

5

8

11

2

5

8

11

1

3

4

6

7

9

10

12

...

...

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 43/49

ISYE/CSE 6740: Computational Data Analysis

Active Set Strategy

1

2

3

4

5

6

7

8

9

10

12

11

A: Active Set

A: Inactive Set

2

5

8

11

2

5

8

11

1

3

4

6

7

9

10

12

8

1

3

4

6

7

9

10

12

...

...

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 44/49

ISYE/CSE 6740: Computational Data Analysis

Active Set Strategy

1

2

3

4

5

6

7

8

9

10

12

11

A: Active Set

A: Inactive Set

2

5

8

11

2

5

8

11

1

3

4

6

7

9

10

12

8

1

3

4

6

7

9

10

12

2

5

8

1

3

11

4

6

7

10

12

... ...

... ...9

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 45/49

ISYE/CSE 6740: Computational Data Analysis

Solution Path

Lasso and Elastic-net solution paths

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 46/49

ISYE/CSE 6740: Computational Data Analysis

Proximal Newton Algorithm

Given the solution θ(t), we take

θ(t+0.5) = arg minθ

L(θ(t)) + (θ − θ(t))>∇L(θ(t))

+1

2(θ − θ(t))>∇2L(θ(t))(θ − θ(t)) +Rλ(θ).

Combined with the backtracking line search, we have

θ(t+1) = θ(t) + η(θ(t+0.5) − θ(t)).

Remark: Each subproblem is solved by the coordinate descentalgorithm. The proximal Newton algorithm can be much moreefficient than coordinate descent algorithm in practice.

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 47/49

Software Libraries

ISYE/CSE 6740: Computational Data Analysis

Available Packages

glmnet: Lasso, Logistic/Poisson Lasso.Developed by J. Friedman; Maintained by T. Hastie

PICASSO: Lasso, Logistic/Poisson Lasso.

huge: Graphical Lasso

liblinear: Sparse Support Vector Machine, Logistic LassoDeveloped by C. Jin and Maintained by T. Helleputte

QUIC: Graphical LassoDeveloped by C. Hsieh and Maintained by M. Sustik

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 49/49