Statistics for high-dimensional data: Group Lasso and ...buhlmann/teaching/presentation4.pdf ·...

Statistics for high-dimensional data:Group Lasso and additive models

Peter Buhlmann and Sara van de Geer

Seminar fur Statistik, ETH Zurich

May 2012

The Group Lasso (Yuan & Lin, 2006)

high-dimensional parameter vector is structured into q groupsor partitions (known a-priori):

G1, . . . ,Gq ⊆ {1, . . . ,p}, disjoint and ∪g Gg = {1, . . . ,p}

corresponding coefficients: βG = {βj ; j ∈ G}

Example: categorical covariatesX (1), . . . ,X (p) are factors (categorical variables)each with 4 levels (e.g. “letters” from DNA)

for encoding a main effect: 3 parametersfor encoding a first-order interaction: 9 parametersand so on ...

parameterization (e.g. sum contrasts) is structured as follows:I intercept: no penaltyI main effect of X (1): group G1 with df = 3I main effect of X (2): group G2 with df = 3I ...I first-order interaction of X (1) and X (2): Gp+1 with df = 9I ...

often, we want sparsity on the group-leveleither all parameters of an effect are zero or not

this can be achieved with the Group-Lasso penalty

q∑g=1

mg ‖βGg‖2︸︷︷︸√‖·‖2

typically mg =√|Gg |

properties of Group-Lasso penaltyI for group-sizes |Gg | ≡ 1 ; standard Lasso-penaltyI convex penalty ; convex optimization for standard

likelihoods (exponential family models)I either (βG(λ))j = 0 or 6= 0 for all j ∈ GI penalty is invariant under orthonormal transformation

e.g. invariant when requiring orthonormal parameterizationfor factors

DNA splice site detection: (mainly) prediction problemDNA sequence

. . .ACGGC . . . E E E GC︸︷︷︸potential donor site

I I I I

︸︷︷︸3 positions exon GC 4 positions intron

. . .AAC . . .

response Y ∈ {0,1}: splice or non-splice sitepredictor variables: 7 factors each having 4 levels

(full dimension: 47 = 16′384)data:

training: 5′610 true splice sites5′610 non-splice sitesplus an unbalanced validation set

test data: 4′208 true splice sites89′717 non-splice sites

logistic regression:

1− p(x)

)= β0 + main effects + first order interactions + . . .

up to second oreder interactions: 1156 parameters

use the Group-Lasso which selects whole terms

1 3 5 7 1:3 1:5 1:7 2:4 2:6 3:4 3:6 4:5 4:7 5:72 4 6 1:2 1:4 1:6 2:3 2:5 2:7 3:5 3:7 4:6 5:6 6:7

l 2−

2 GLGL/RGL/MLE

1:2:3 1:2:5 1:2:7 1:3:5 1:3:7 1:4:6 1:5:6 1:6:7 2:3:5 2:3:7 2:4:6 2:5:6 2:6:7 3:4:6 3:5:6 3:6:7 4:5:7 5:6:71:2:4 1:2:6 1:3:4 1:3:6 1:4:5 1:4:7 1:5:7 2:3:4 2:3:6 2:4:5 2:4:7 2:5:7 3:4:5 3:4:7 3:5:7 4:5:6 4:6:7

l 2−

I mainly neighboring DNA positions show interactions(has been “known” and “debated”)

I no interaction among exons and introns (with Group Lassomethod)

I no second-order interactions (with Group Lasso method)

predictive power:competitive with “state to the art” maximum entropy modelingfrom Yeo and Burge (2004)

correlation between true and predicted class

Logistic Group Lasso 0.6593max. entropy (Yeo and Burge) 0.6589

our model (not necessarily the method/algorithm) is simple andhas clear interpretation

Generalized group Lasso penalty

q∑j=1

√βTGj

AjβGj ,

where Aj are Tj × Tj positive definite matrices

; generalized group Lasso:

β = argminβ(‖Y− Xβ‖22/n + λ

q∑j=1

√βTGj

AjβGj )

reparameterize

βGj = A1/2j βGj ,

XGj = XGj A−1/2j

β = argminβ(‖Y− Xβ‖22/n + λ

q∑j=1

√βTGj

AjβGj )

can be derived from

βGj = A−1/2j

ˆβGj ,

ˆβ = argminβ(‖Y− Xβ‖22/n + λ

q∑j=1

mj‖βGj‖2)

Groupwise prediction penalty and parameterization invariance

q∑j=1

mj‖XGjβGj‖2 = λ

q∑j=1

√βTGj

XGjβGj

is a generalized group Lasso penalty if XTGj

XGj are positivedefinite (i.e. necessarily |Gj | ≤ n)

this penalty is invariant under any (invertible) transformationswithin groupsi.e. can use Gj = BjβGj where Bj is any Tj × Tj invertible matrix;

XGj βGj = XGjˆβGj ,

{j ; βGj 6= 0} = {j ; ˆβGj 6= 0}

Some aspects from theory“again”:

I optimal prediction and estimation (oracle inequality)I group screening: S ⊇ S0︸︷︷︸

set of active groups

with high prob.

but listen to Sarain ≈ “a few” minutes

interesting case:I Gj ’s are “large”I βGj ’s are “smooth”

example: high-dimensional additive model

p∑j=1

fj(X (j)) + ε

and expand fj(x (j)) =∑n

k=1 β(j)k︸︷︷︸

(βGj )k

B(j)k︸︷︷︸

basis fct.s

(x (j))

fj(·) smooth⇒ “smoothness” of βGj

Computation and KKT

criterion function

Qλ(β) = n−1n∑

ρβ(xi ,Yi)︸︷︷︸loss fct.

+λG∑

mg‖βg‖2,

loss function ρβ(., .) convex in β

KKT conditions:

∇ρ(β)Gg + λmgβGg

‖βGg‖2= 0 if βGg 6= 0 (not the 0-vector),

‖∇ρ(β)Gg‖2 ≤ λmg if βGg ≡ 0.

Block coordinate descent algorithm

generic description for both, Lasso or Group-Lasso problems:I cycle through all coordinates j = 1, . . . ,p,1,2, . . .

or j = 1, . . . ,q,1,2, . . .I optimize the penalized log-likelihood w.r.t. βj (or βGj )

keeping all other coefficients βk , k 6= j (or k 6= Gj ) fixed

Lasso: (β1, β2 = β(0)2 , . . . , βj = β

(0)j , . . . , βp = β

(0)p )

Lasso: (β1 = β(1)1 , β2, . . . , βj = β

(0)j , . . . , βp = β

(0)p )

Lasso: (β1 = β(1)1 , β2 = β

(1)2 , . . . , βj , . . . , βp = β

(0)p )

Lasso: (β1 = β(1)1 , β2 = β

(1)2 , . . . , βj = β

(1)j , . . . , βp)

Lasso: (β1, β2 = β(1)2 , . . . , βj = β

(1)j , . . . , βp = β

(1)p )

Group Lasso: (βG1 , βG2 = β(0)G2, . . . , βGj = β

(0)Gj, . . . , βGq = β

Group Lasso: (βG1 = β(1)G1 , βG2 , . . . , βGj = β

(0)Gj, . . . , βGq = β

Group Lasso: (βG1 = β(1)G1 , βG2 = β

(1)G2, . . . , βGj , . . . , βGq = β

Group Lasso: (βG1 = β(1)G1 , βG2 = β

(1)G2, . . . , βGj = β

(1)Gj, . . . , βGq )

Group Lasso: (βG1 , βG2 = β(1)G2, . . . , βGj = β

(1)Gj, . . . , βGq = β

for Gaussian log-likelihood (squared error loss):blockwise up-dates are easy and closed-form solutions exist(use KKT)

for other loss functions (e.g. logistic loss):blockwise up-dates: no closed-form solution;

strategy which is fast: improve every coordinate/groupnumerically, but not until numerical convergence(by using quadratic approximation of log-likelihood function forimproving/optimization of a single block)

and further tricks... (still allowing provable numericalconvergence)

How fast?

logistic case: p = 106, n = 100group-size = 20, sparsity: 2 active groups = 40 parameters

for 10 different λ-valuesCPU using grplasso: 203.16 seconds ≈ 3.5 minutes(dual core processor with 2.6 GHz and 32 GB RAM)

we can easily deal today with predictors in the Mega’si.e. p ≈ 106 − 107

How fast?

logistic case: p = 106, n = 100group-size = 20, sparsity: 2 active groups = 40 parameters

for 10 different λ-valuesCPU using grplasso: 203.16 seconds ≈ 3.5 minutes(dual core processor with 2.6 GHz and 32 GB RAM)

we can easily deal today with predictors in the Mega’si.e. p ≈ 106 − 107

The sparsity-smoothness penalty (SSP)

(whose corresponding optimization becomes again aGroup-Lasso problem...)

for additive modeling in high dimensions

p∑j=1

fj(x(j)i ) + εi (i = 1, . . . ,n)

fj : R→ R smooth univariate functionsp � n

in principle: basis expansion for every fj(·) with basis functions

hj,1, . . . ,hj,K where K = O(n) (or e.g. K = O(n1/2))

j = 1, . . . ,p

; represent

p∑j=1

fj(x (j)) =

p∑j=1

K∑k=1

βj,khj,k (x (j))

; high-dimensional parametric problem

and use the Group-Lasso penalty to ensure sparsity of wholefunctions

p∑j=1

‖ βGj︸︷︷︸βj :=(βj,1,...,βj,K )T

drawback:does not exploit smoothness(except when choosing appropriate K which is “bad” if differentfj ’s have different complexity)

when using a large number of basis functions (large K ) forachieving a high degree of flexibility; need additional control for smoothness

Sparsity-Smoothness Penalty (SSP)

p∑j=1

‖fj‖n︸︷︷︸‖Hjβj‖2/

p∑j=1

I2(fj) =

∫(f′′

j (x))2dx = βTj Wjβj

where fj = (fj(X(j)1 ), . . . , fj(X

(j)n ))T ,

and Wj =∫

h′′

j,k (x)h′′

j,`(x)dx

; SSP-penalty does variable selection (fj ≡ 0 for some j)

Orthogonal basis and diagonal smoothing matrices

n−1HTj Hj = I and

Wj ≡ diag(d21 , . . . ,d

2K ) := D2, dk = km (m > 1/2)

then, the penalty becomes

p∑j=1

‖βj‖2 + λ2

p∑j=1

‖Dβj‖2

β(λ1, λ2) = argminβ‖Y−p∑

Hjβj‖22/n + λ1

p∑j=1

‖βj‖2 + λ2

p∑j=1

‖Dβj‖2

the difficulty is the computation, although still a convexoptimization problem

see Section 5.3.3 in the book

A modified SSP-penalty

p∑j=1

√‖fj‖22 + λ2I2(fj)

for additive modeling:

f1, . . . , fp = argminf1,...,fp‖Y−p∑

fj‖22 + λ1

p∑j=1

√‖fj‖22 + λ2I2(fj)

assuming fj is twice differentiable; solution is a natural cubic spline with knots at X (j)

i; finite-dimensional parameterization with e.g. B-splines:

f =∑p

j=1 fj , fj = Hjβj

penalty becomes:

p∑j=1

√‖fj‖22 + λ2I2(fj)

p∑j=1

√√√√βTj BT

j Bj︸︷︷︸Σj

βj + λ2βTj Wj︸︷︷︸

integ. 2nd derivatives

p∑j=1

√√√√βTj (Σj + λ2Wj)︸︷︷︸

Aj =Aj (λ2)

; re-parameterize βj = βj(λ2) = Rjβj , RTj Rj = Aj = Aj(λ2)

(Choleski)penalty becomes

p∑j=1

‖βj‖2︸︷︷︸depending on λ2

i.e., a Group-Lasso penalty

HIF1α motif additive regressionfor finding HIF1α transcription factor binding sites on DNAsequences

n = 287, p = 196

additive model with SSP has ≈ 20% better predictionperformance than linear model with Lasso

bootstrap stability analysis: select the variables (functions)which have occurred at least in 50% among all bootstrap runs; only 2 stable variables /candidate motifs remain

7.5 8.5 9.5 10.5 11.5 12.5 13.5 14.5

−0.6

−0.4

−0.2

Motif.P1.6.23

tial E

8.0 9.0 10.0 11.0 12.0 13.0 14.0 15.0

−0.6

−0.4

−0.2

Motif.P1.6.26

tial E

7.5 8.5 9.5 10.5 11.5 12.5 13.5 14.5

−0.6

−0.4

−0.2

Motif.P1.6.23

tial E

8.0 9.0 10.0 11.0 12.0 13.0 14.0 15.0

−0.6

−0.4

−0.2

Motif.P1.6.26

tial E

right panel: variable corresponds to a true, known motif

variable/motif corresponding to left panel:good additional support for relevance (nearness totranscriptional start-site of important genes, ...)

Statistics for high-dimensional data: Group Lasso and ...buhlmann/teaching/presentation4.pdf ·...

Documents

Transcript of Statistics for high-dimensional data: Group Lasso and ...buhlmann/teaching/presentation4.pdf ·...

How to lasso positively, quickly and correctly · The Lasso (a.k.a. L 1 regularization) Consider the linear inverse problem y = X + : An L 1-regularized solution takes a parameter

Shrinkage method II: Lasso - UMass Amherstpeople.math.umass.edu/~anna/stat697F/Chapter6_part2.pdf · Shrinkage method II: Lasso Lasso, short for Least Absolute Shrinkage and Selection

RECENT DEVELOPMENTS IN LARGE DIMENSIONAL FACTOR …

Module-5 Dimensional Analysis and Similitude

Module 4: Dimensional analysis - UBC Blogsblogs.ubc.ca/frigaard/files/2017/12/18-MECH280-Module4.pdf · Module 4: Dimensional analysis Understand dimensions, units and dimensional

Dimensional Analysis and Scalingjlogan1/PDFfiles/amch1.pdf · CHAPTER 1 Dimensional Analysis and Scaling 1. Dimensional analysis Exercise 1.1 The variables are t,r,ρ,e,P. We already

Fast Algorithms for LAD Lasso Problems - Robert J. Vanderbei · Lasso References Regression Shrinkage and Selection via the LASSO, Tsibshirani, J. Royal Statistical Soc., Ser. B,

The Three-dimensional Schödinger Equation

Lower Bounds for 2-Dimensional Range Counting

Chapter 3: Two – Dimensional Motion and Vectors

Finite-ordercorrelationlengthfor4-dimensional weaklyself ... · Finite-ordercorrelationlengthfor4-dimensional weaklyself-avoidingwalkand|ϕ|4 spins Roland Bauerschmidt∗, Gordon

Lasso: Algorithms - MyWebLasso geometry Coordinate descent Lasso vs. forward selection LARS Lasso as soft relaxation of ‘ 0-penalization Thus, the lasso can be thought of as a \soft"

High-dimensional StatisticsHigh-dimensional Statistics Pradeep Ravikumar UT Austin Outline 1. High Dimensional Data : Large p, small n 2. Sparsity 3. Group Sparsity 4. Low Rank 1 Curse

Supplementary Information · 8 References [1] E. Pretsch, P. Buhlmann, M. Badertscher, “Structure determination of organic compounds tables of spectral data”, 4th Ed. .pp 283.

Adaptive lasso, MCP, and SCAD - MyWeb · Adaptive lasso, MCP, and SCAD Patrick Breheny February 29 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/34. Adaptive lasso

On robust regression with high-dimensional predictors

Lasso: Algorithms - MyWebmyweb.uiowa.edu/pbreheny/7600/s16/notes/2-17.pdfLasso: Algorithms Patrick Breheny February 17 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/23.

Unidade 4 Análise dimensional e semelhança mecânica.

LASSO Regression - courses.cs.washington.educourses.cs.washington.edu/.../slides/LARS-fusedlasso-annotated.pdfAxel Gandy LASSO and related algorithms 34 LARS – Illustration for p=2

Two-dimensional incompressible irrotational fow