Statistics for high-dimensional data: Group Lasso and ...buhlmann/teaching/presentation4.pdf ·...

38
Statistics for high-dimensional data: Group Lasso and additive models Peter B¨ uhlmann and Sara van de Geer Seminar f ¨ ur Statistik, ETH Z ¨ urich May 2012

Transcript of Statistics for high-dimensional data: Group Lasso and ...buhlmann/teaching/presentation4.pdf ·...

Page 1: Statistics for high-dimensional data: Group Lasso and ...buhlmann/teaching/presentation4.pdf · Statistics for high-dimensional data: Group Lasso and additive models Peter Buhlmann

Statistics for high-dimensional data:Group Lasso and additive models

Peter Buhlmann and Sara van de Geer

Seminar fur Statistik, ETH Zurich

May 2012

Page 2: Statistics for high-dimensional data: Group Lasso and ...buhlmann/teaching/presentation4.pdf · Statistics for high-dimensional data: Group Lasso and additive models Peter Buhlmann

The Group Lasso (Yuan & Lin, 2006)

high-dimensional parameter vector is structured into q groupsor partitions (known a-priori):

G1, . . . ,Gq ⊆ {1, . . . ,p}, disjoint and ∪g Gg = {1, . . . ,p}

corresponding coefficients: βG = {βj ; j ∈ G}

Page 3: Statistics for high-dimensional data: Group Lasso and ...buhlmann/teaching/presentation4.pdf · Statistics for high-dimensional data: Group Lasso and additive models Peter Buhlmann

Example: categorical covariatesX (1), . . . ,X (p) are factors (categorical variables)each with 4 levels (e.g. “letters” from DNA)

for encoding a main effect: 3 parametersfor encoding a first-order interaction: 9 parametersand so on ...

parameterization (e.g. sum contrasts) is structured as follows:I intercept: no penaltyI main effect of X (1): group G1 with df = 3I main effect of X (2): group G2 with df = 3I ...I first-order interaction of X (1) and X (2): Gp+1 with df = 9I ...

often, we want sparsity on the group-leveleither all parameters of an effect are zero or not

Page 4: Statistics for high-dimensional data: Group Lasso and ...buhlmann/teaching/presentation4.pdf · Statistics for high-dimensional data: Group Lasso and additive models Peter Buhlmann

often, we want sparsity on the group-leveleither all parameters of an effect are zero or not

this can be achieved with the Group-Lasso penalty

λ

q∑g=1

mg ‖βGg‖2︸ ︷︷ ︸√‖·‖2

2

typically mg =√|Gg |

Page 5: Statistics for high-dimensional data: Group Lasso and ...buhlmann/teaching/presentation4.pdf · Statistics for high-dimensional data: Group Lasso and additive models Peter Buhlmann

properties of Group-Lasso penaltyI for group-sizes |Gg | ≡ 1 ; standard Lasso-penaltyI convex penalty ; convex optimization for standard

likelihoods (exponential family models)I either (βG(λ))j = 0 or 6= 0 for all j ∈ GI penalty is invariant under orthonormal transformation

e.g. invariant when requiring orthonormal parameterizationfor factors

Page 6: Statistics for high-dimensional data: Group Lasso and ...buhlmann/teaching/presentation4.pdf · Statistics for high-dimensional data: Group Lasso and additive models Peter Buhlmann

DNA splice site detection: (mainly) prediction problemDNA sequence

. . .ACGGC . . . E E E GC︸︷︷︸potential donor site

I I I I

︸ ︷︷ ︸3 positions exon GC 4 positions intron

. . .AAC . . .

response Y ∈ {0,1}: splice or non-splice sitepredictor variables: 7 factors each having 4 levels

(full dimension: 47 = 16′384)data:

training: 5′610 true splice sites5′610 non-splice sitesplus an unbalanced validation set

test data: 4′208 true splice sites89′717 non-splice sites

Page 7: Statistics for high-dimensional data: Group Lasso and ...buhlmann/teaching/presentation4.pdf · Statistics for high-dimensional data: Group Lasso and additive models Peter Buhlmann

logistic regression:

log(

p(x)

1− p(x)

)= β0 + main effects + first order interactions + . . .

up to second oreder interactions: 1156 parameters

use the Group-Lasso which selects whole terms

Page 8: Statistics for high-dimensional data: Group Lasso and ...buhlmann/teaching/presentation4.pdf · Statistics for high-dimensional data: Group Lasso and additive models Peter Buhlmann

Term

1 3 5 7 1:3 1:5 1:7 2:4 2:6 3:4 3:6 4:5 4:7 5:72 4 6 1:2 1:4 1:6 2:3 2:5 2:7 3:5 3:7 4:6 5:6 6:7

l 2−

norm

01

2 GLGL/RGL/MLE

Term

1:2:3 1:2:5 1:2:7 1:3:5 1:3:7 1:4:6 1:5:6 1:6:7 2:3:5 2:3:7 2:4:6 2:5:6 2:6:7 3:4:6 3:5:6 3:6:7 4:5:7 5:6:71:2:4 1:2:6 1:3:4 1:3:6 1:4:5 1:4:7 1:5:7 2:3:4 2:3:6 2:4:5 2:4:7 2:5:7 3:4:5 3:4:7 3:5:7 4:5:6 4:6:7

l 2−

norm

01

2

I mainly neighboring DNA positions show interactions(has been “known” and “debated”)

I no interaction among exons and introns (with Group Lassomethod)

I no second-order interactions (with Group Lasso method)

Page 9: Statistics for high-dimensional data: Group Lasso and ...buhlmann/teaching/presentation4.pdf · Statistics for high-dimensional data: Group Lasso and additive models Peter Buhlmann

predictive power:competitive with “state to the art” maximum entropy modelingfrom Yeo and Burge (2004)

correlation between true and predicted class

Logistic Group Lasso 0.6593max. entropy (Yeo and Burge) 0.6589

our model (not necessarily the method/algorithm) is simple andhas clear interpretation

Page 10: Statistics for high-dimensional data: Group Lasso and ...buhlmann/teaching/presentation4.pdf · Statistics for high-dimensional data: Group Lasso and additive models Peter Buhlmann

Generalized group Lasso penalty

λ

q∑j=1

mj

√βTGj

AjβGj ,

where Aj are Tj × Tj positive definite matrices

; generalized group Lasso:

β = argminβ(‖Y− Xβ‖22/n + λ

q∑j=1

mj

√βTGj

AjβGj )

Page 11: Statistics for high-dimensional data: Group Lasso and ...buhlmann/teaching/presentation4.pdf · Statistics for high-dimensional data: Group Lasso and additive models Peter Buhlmann

reparameterize

βGj = A1/2j βGj ,

XGj = XGj A−1/2j

β = argminβ(‖Y− Xβ‖22/n + λ

q∑j=1

mj

√βTGj

AjβGj )

can be derived from

βGj = A−1/2j

ˆβGj ,

ˆβ = argminβ(‖Y− Xβ‖22/n + λ

q∑j=1

mj‖βGj‖2)

Page 12: Statistics for high-dimensional data: Group Lasso and ...buhlmann/teaching/presentation4.pdf · Statistics for high-dimensional data: Group Lasso and additive models Peter Buhlmann

Groupwise prediction penalty and parameterization invariance

λ

q∑j=1

mj‖XGjβGj‖2 = λ

q∑j=1

mj

√βTGj

XTGj

XGjβGj

is a generalized group Lasso penalty if XTGj

XGj are positivedefinite (i.e. necessarily |Gj | ≤ n)

this penalty is invariant under any (invertible) transformationswithin groupsi.e. can use Gj = BjβGj where Bj is any Tj × Tj invertible matrix;

XGj βGj = XGjˆβGj ,

{j ; βGj 6= 0} = {j ; ˆβGj 6= 0}

Page 13: Statistics for high-dimensional data: Group Lasso and ...buhlmann/teaching/presentation4.pdf · Statistics for high-dimensional data: Group Lasso and additive models Peter Buhlmann

Some aspects from theory“again”:

I optimal prediction and estimation (oracle inequality)I group screening: S ⊇ S0︸︷︷︸

set of active groups

with high prob.

but listen to Sarain ≈ “a few” minutes

interesting case:I Gj ’s are “large”I βGj ’s are “smooth”

Page 14: Statistics for high-dimensional data: Group Lasso and ...buhlmann/teaching/presentation4.pdf · Statistics for high-dimensional data: Group Lasso and additive models Peter Buhlmann

example: high-dimensional additive model

Y =

p∑j=1

fj(X (j)) + ε

and expand fj(x (j)) =∑n

k=1 β(j)k︸︷︷︸

(βGj )k

B(j)k︸︷︷︸

basis fct.s

(x (j))

fj(·) smooth⇒ “smoothness” of βGj

Page 15: Statistics for high-dimensional data: Group Lasso and ...buhlmann/teaching/presentation4.pdf · Statistics for high-dimensional data: Group Lasso and additive models Peter Buhlmann

Computation and KKT

criterion function

Qλ(β) = n−1n∑

i=1

ρβ(xi ,Yi)︸ ︷︷ ︸loss fct.

+λG∑

g=1

mg‖βg‖2,

loss function ρβ(., .) convex in β

KKT conditions:

∇ρ(β)Gg + λmgβGg

‖βGg‖2= 0 if βGg 6= 0 (not the 0-vector),

‖∇ρ(β)Gg‖2 ≤ λmg if βGg ≡ 0.

Page 16: Statistics for high-dimensional data: Group Lasso and ...buhlmann/teaching/presentation4.pdf · Statistics for high-dimensional data: Group Lasso and additive models Peter Buhlmann

Block coordinate descent algorithm

generic description for both, Lasso or Group-Lasso problems:I cycle through all coordinates j = 1, . . . ,p,1,2, . . .

or j = 1, . . . ,q,1,2, . . .I optimize the penalized log-likelihood w.r.t. βj (or βGj )

keeping all other coefficients βk , k 6= j (or k 6= Gj ) fixed

Lasso: (β1, β2 = β(0)2 , . . . , βj = β

(0)j , . . . , βp = β

(0)p )

Page 17: Statistics for high-dimensional data: Group Lasso and ...buhlmann/teaching/presentation4.pdf · Statistics for high-dimensional data: Group Lasso and additive models Peter Buhlmann

Block coordinate descent algorithm

generic description for both, Lasso or Group-Lasso problems:I cycle through all coordinates j = 1, . . . ,p,1,2, . . .

or j = 1, . . . ,q,1,2, . . .I optimize the penalized log-likelihood w.r.t. βj (or βGj )

keeping all other coefficients βk , k 6= j (or k 6= Gj ) fixed

Lasso: (β1 = β(1)1 , β2, . . . , βj = β

(0)j , . . . , βp = β

(0)p )

Page 18: Statistics for high-dimensional data: Group Lasso and ...buhlmann/teaching/presentation4.pdf · Statistics for high-dimensional data: Group Lasso and additive models Peter Buhlmann

Block coordinate descent algorithm

generic description for both, Lasso or Group-Lasso problems:I cycle through all coordinates j = 1, . . . ,p,1,2, . . .

or j = 1, . . . ,q,1,2, . . .I optimize the penalized log-likelihood w.r.t. βj (or βGj )

keeping all other coefficients βk , k 6= j (or k 6= Gj ) fixed

Lasso: (β1 = β(1)1 , β2 = β

(1)2 , . . . , βj , . . . , βp = β

(0)p )

Page 19: Statistics for high-dimensional data: Group Lasso and ...buhlmann/teaching/presentation4.pdf · Statistics for high-dimensional data: Group Lasso and additive models Peter Buhlmann

Block coordinate descent algorithm

generic description for both, Lasso or Group-Lasso problems:I cycle through all coordinates j = 1, . . . ,p,1,2, . . .

or j = 1, . . . ,q,1,2, . . .I optimize the penalized log-likelihood w.r.t. βj (or βGj )

keeping all other coefficients βk , k 6= j (or k 6= Gj ) fixed

Lasso: (β1 = β(1)1 , β2 = β

(1)2 , . . . , βj = β

(1)j , . . . , βp)

Page 20: Statistics for high-dimensional data: Group Lasso and ...buhlmann/teaching/presentation4.pdf · Statistics for high-dimensional data: Group Lasso and additive models Peter Buhlmann

Block coordinate descent algorithm

generic description for both, Lasso or Group-Lasso problems:I cycle through all coordinates j = 1, . . . ,p,1,2, . . .

or j = 1, . . . ,q,1,2, . . .I optimize the penalized log-likelihood w.r.t. βj (or βGj )

keeping all other coefficients βk , k 6= j (or k 6= Gj ) fixed

Lasso: (β1, β2 = β(1)2 , . . . , βj = β

(1)j , . . . , βp = β

(1)p )

Page 21: Statistics for high-dimensional data: Group Lasso and ...buhlmann/teaching/presentation4.pdf · Statistics for high-dimensional data: Group Lasso and additive models Peter Buhlmann

Block coordinate descent algorithm

generic description for both, Lasso or Group-Lasso problems:I cycle through all coordinates j = 1, . . . ,p,1,2, . . .

or j = 1, . . . ,q,1,2, . . .I optimize the penalized log-likelihood w.r.t. βj (or βGj )

keeping all other coefficients βk , k 6= j (or k 6= Gj ) fixed

Group Lasso: (βG1 , βG2 = β(0)G2, . . . , βGj = β

(0)Gj, . . . , βGq = β

(0)Gq

)

Page 22: Statistics for high-dimensional data: Group Lasso and ...buhlmann/teaching/presentation4.pdf · Statistics for high-dimensional data: Group Lasso and additive models Peter Buhlmann

Block coordinate descent algorithm

generic description for both, Lasso or Group-Lasso problems:I cycle through all coordinates j = 1, . . . ,p,1,2, . . .

or j = 1, . . . ,q,1,2, . . .I optimize the penalized log-likelihood w.r.t. βj (or βGj )

keeping all other coefficients βk , k 6= j (or k 6= Gj ) fixed

Group Lasso: (βG1 = β(1)G1 , βG2 , . . . , βGj = β

(0)Gj, . . . , βGq = β

(0)Gq

)

Page 23: Statistics for high-dimensional data: Group Lasso and ...buhlmann/teaching/presentation4.pdf · Statistics for high-dimensional data: Group Lasso and additive models Peter Buhlmann

Block coordinate descent algorithm

generic description for both, Lasso or Group-Lasso problems:I cycle through all coordinates j = 1, . . . ,p,1,2, . . .

or j = 1, . . . ,q,1,2, . . .I optimize the penalized log-likelihood w.r.t. βj (or βGj )

keeping all other coefficients βk , k 6= j (or k 6= Gj ) fixed

Group Lasso: (βG1 = β(1)G1 , βG2 = β

(1)G2, . . . , βGj , . . . , βGq = β

(0)Gq

)

Page 24: Statistics for high-dimensional data: Group Lasso and ...buhlmann/teaching/presentation4.pdf · Statistics for high-dimensional data: Group Lasso and additive models Peter Buhlmann

Block coordinate descent algorithm

generic description for both, Lasso or Group-Lasso problems:I cycle through all coordinates j = 1, . . . ,p,1,2, . . .

or j = 1, . . . ,q,1,2, . . .I optimize the penalized log-likelihood w.r.t. βj (or βGj )

keeping all other coefficients βk , k 6= j (or k 6= Gj ) fixed

Group Lasso: (βG1 = β(1)G1 , βG2 = β

(1)G2, . . . , βGj = β

(1)Gj, . . . , βGq )

Page 25: Statistics for high-dimensional data: Group Lasso and ...buhlmann/teaching/presentation4.pdf · Statistics for high-dimensional data: Group Lasso and additive models Peter Buhlmann

Block coordinate descent algorithm

generic description for both, Lasso or Group-Lasso problems:I cycle through all coordinates j = 1, . . . ,p,1,2, . . .

or j = 1, . . . ,q,1,2, . . .I optimize the penalized log-likelihood w.r.t. βj (or βGj )

keeping all other coefficients βk , k 6= j (or k 6= Gj ) fixed

Group Lasso: (βG1 , βG2 = β(1)G2, . . . , βGj = β

(1)Gj, . . . , βGq = β

(1)Gq

)

Page 26: Statistics for high-dimensional data: Group Lasso and ...buhlmann/teaching/presentation4.pdf · Statistics for high-dimensional data: Group Lasso and additive models Peter Buhlmann

for Gaussian log-likelihood (squared error loss):blockwise up-dates are easy and closed-form solutions exist(use KKT)

for other loss functions (e.g. logistic loss):blockwise up-dates: no closed-form solution;

strategy which is fast: improve every coordinate/groupnumerically, but not until numerical convergence(by using quadratic approximation of log-likelihood function forimproving/optimization of a single block)

and further tricks... (still allowing provable numericalconvergence)

Page 27: Statistics for high-dimensional data: Group Lasso and ...buhlmann/teaching/presentation4.pdf · Statistics for high-dimensional data: Group Lasso and additive models Peter Buhlmann

How fast?

logistic case: p = 106, n = 100group-size = 20, sparsity: 2 active groups = 40 parameters

for 10 different λ-valuesCPU using grplasso: 203.16 seconds ≈ 3.5 minutes(dual core processor with 2.6 GHz and 32 GB RAM)

we can easily deal today with predictors in the Mega’si.e. p ≈ 106 − 107

Page 28: Statistics for high-dimensional data: Group Lasso and ...buhlmann/teaching/presentation4.pdf · Statistics for high-dimensional data: Group Lasso and additive models Peter Buhlmann

How fast?

logistic case: p = 106, n = 100group-size = 20, sparsity: 2 active groups = 40 parameters

for 10 different λ-valuesCPU using grplasso: 203.16 seconds ≈ 3.5 minutes(dual core processor with 2.6 GHz and 32 GB RAM)

we can easily deal today with predictors in the Mega’si.e. p ≈ 106 − 107

Page 29: Statistics for high-dimensional data: Group Lasso and ...buhlmann/teaching/presentation4.pdf · Statistics for high-dimensional data: Group Lasso and additive models Peter Buhlmann

The sparsity-smoothness penalty (SSP)

(whose corresponding optimization becomes again aGroup-Lasso problem...)

for additive modeling in high dimensions

Yi =

p∑j=1

fj(x(j)i ) + εi (i = 1, . . . ,n)

fj : R→ R smooth univariate functionsp � n

Page 30: Statistics for high-dimensional data: Group Lasso and ...buhlmann/teaching/presentation4.pdf · Statistics for high-dimensional data: Group Lasso and additive models Peter Buhlmann

in principle: basis expansion for every fj(·) with basis functions

hj,1, . . . ,hj,K where K = O(n) (or e.g. K = O(n1/2))

j = 1, . . . ,p

; represent

p∑j=1

fj(x (j)) =

p∑j=1

K∑k=1

βj,khj,k (x (j))

; high-dimensional parametric problem

and use the Group-Lasso penalty to ensure sparsity of wholefunctions

λ

p∑j=1

‖ βGj︸︷︷︸βj :=(βj,1,...,βj,K )T

‖2

Page 31: Statistics for high-dimensional data: Group Lasso and ...buhlmann/teaching/presentation4.pdf · Statistics for high-dimensional data: Group Lasso and additive models Peter Buhlmann

drawback:does not exploit smoothness(except when choosing appropriate K which is “bad” if differentfj ’s have different complexity)

when using a large number of basis functions (large K ) forachieving a high degree of flexibility; need additional control for smoothness

Page 32: Statistics for high-dimensional data: Group Lasso and ...buhlmann/teaching/presentation4.pdf · Statistics for high-dimensional data: Group Lasso and additive models Peter Buhlmann

Sparsity-Smoothness Penalty (SSP)

λ1

p∑j=1

‖fj‖n︸︷︷︸‖Hjβj‖2/

√n

+λ2

p∑j=1

I(fj)

I2(fj) =

∫(f′′

j (x))2dx = βTj Wjβj

where fj = (fj(X(j)1 ), . . . , fj(X

(j)n ))T ,

and Wj =∫

h′′

j,k (x)h′′

j,`(x)dx

; SSP-penalty does variable selection (fj ≡ 0 for some j)

Page 33: Statistics for high-dimensional data: Group Lasso and ...buhlmann/teaching/presentation4.pdf · Statistics for high-dimensional data: Group Lasso and additive models Peter Buhlmann

Orthogonal basis and diagonal smoothing matrices

n−1HTj Hj = I and

Wj ≡ diag(d21 , . . . ,d

2K ) := D2, dk = km (m > 1/2)

then, the penalty becomes

λ1

p∑j=1

‖βj‖2 + λ2

p∑j=1

‖Dβj‖2

;

β(λ1, λ2) = argminβ‖Y−p∑

j=1

Hjβj‖22/n + λ1

p∑j=1

‖βj‖2 + λ2

p∑j=1

‖Dβj‖2

Page 34: Statistics for high-dimensional data: Group Lasso and ...buhlmann/teaching/presentation4.pdf · Statistics for high-dimensional data: Group Lasso and additive models Peter Buhlmann

the difficulty is the computation, although still a convexoptimization problem

see Section 5.3.3 in the book

Page 35: Statistics for high-dimensional data: Group Lasso and ...buhlmann/teaching/presentation4.pdf · Statistics for high-dimensional data: Group Lasso and additive models Peter Buhlmann

A modified SSP-penalty

λ1

p∑j=1

√‖fj‖22 + λ2I2(fj)

for additive modeling:

f1, . . . , fp = argminf1,...,fp‖Y−p∑

j=1

fj‖22 + λ1

p∑j=1

√‖fj‖22 + λ2I2(fj)

assuming fj is twice differentiable; solution is a natural cubic spline with knots at X (j)

i; finite-dimensional parameterization with e.g. B-splines:

f =∑p

j=1 fj , fj = Hjβj

Page 36: Statistics for high-dimensional data: Group Lasso and ...buhlmann/teaching/presentation4.pdf · Statistics for high-dimensional data: Group Lasso and additive models Peter Buhlmann

penalty becomes:

λ1

p∑j=1

√‖fj‖22 + λ2I2(fj)

= λ1

p∑j=1

√√√√βTj BT

j Bj︸ ︷︷ ︸Σj

βj + λ2βTj Wj︸︷︷︸

integ. 2nd derivatives

βj

= λ1

p∑j=1

√√√√βTj (Σj + λ2Wj)︸ ︷︷ ︸

Aj =Aj (λ2)

β

; re-parameterize βj = βj(λ2) = Rjβj , RTj Rj = Aj = Aj(λ2)

(Choleski)penalty becomes

λ1

p∑j=1

‖βj‖2︸ ︷︷ ︸depending on λ2

i.e., a Group-Lasso penalty

Page 37: Statistics for high-dimensional data: Group Lasso and ...buhlmann/teaching/presentation4.pdf · Statistics for high-dimensional data: Group Lasso and additive models Peter Buhlmann

HIF1α motif additive regressionfor finding HIF1α transcription factor binding sites on DNAsequences

n = 287, p = 196

additive model with SSP has ≈ 20% better predictionperformance than linear model with Lasso

bootstrap stability analysis: select the variables (functions)which have occurred at least in 50% among all bootstrap runs; only 2 stable variables /candidate motifs remain

7.5 8.5 9.5 10.5 11.5 12.5 13.5 14.5

−0.6

−0.4

−0.2

0.0

0.2

0.4

0.6

0.8

Motif.P1.6.23

Par

tial E

ffect

8.0 9.0 10.0 11.0 12.0 13.0 14.0 15.0

−0.6

−0.4

−0.2

0.0

0.2

0.4

0.6

0.8

Motif.P1.6.26

Par

tial E

ffect

Page 38: Statistics for high-dimensional data: Group Lasso and ...buhlmann/teaching/presentation4.pdf · Statistics for high-dimensional data: Group Lasso and additive models Peter Buhlmann

7.5 8.5 9.5 10.5 11.5 12.5 13.5 14.5

−0.6

−0.4

−0.2

0.0

0.2

0.4

0.6

0.8

Motif.P1.6.23

Par

tial E

ffect

8.0 9.0 10.0 11.0 12.0 13.0 14.0 15.0

−0.6

−0.4

−0.2

0.0

0.2

0.4

0.6

0.8

Motif.P1.6.26

Par

tial E

ffect

right panel: variable corresponds to a true, known motif

variable/motif corresponding to left panel:good additional support for relevance (nearness totranscriptional start-site of important genes, ...)