Statistics for high-dimensional data: 1-penalization for ...buhlmann/teaching/presentation2.pdf ·...

Statistics for high-dimensional data:`1-penalization for smooth loss functions

Peter Buhlmann and Sara van de Geer

Seminar fur Statistik, ETH Zurich

May 2012

Generalized linear models

very useful extension of linear models:

Y1, . . . ,Yn independent,

g(E[Yi |Xi = x ]) = µ+

p∑j=1

β0j x (j)

g(·) a real-valued, known link function

notation: f (x) = fµ,β(x) = µ+∑p

j=1 βjx (j)

(conditional) distribution of Yi (given Xi ) is depending on Xi onlythrough fµ,β0(Xi):

p(y |x) = pµ,β0p(y |x)

Logistic regression

Yi ∈ {0,1},E[Yi |Xi = x ] = π(x) = P[Yi = 1|Xi = x ],

g(π(x)) = log(π(x)/(1− π(x))) = fµ,β(x) logit link function

; log(pµ,β(Yi |Xi)) = Yi fµ,β(Xi)− log(1 + exp(fµ,β(Xi))

Lasso for GLMs and M-estimators

µ, β = argminµ,β − n−1n∑

i=1

log(pµ,β(Yi |Xi)) + λ‖β‖1

(no direct penalization for intercept µ)

or more generally:

µ, β = argminµ,β − n−1n∑

i=1

ρfµ,β (Yi ,Xi) + λ‖β‖1

Binary classification and some loss functionslogistic loss:

ρf (Y ,X ) = −Yf (X ) + log(1 + exp(f (X ))

= log(1 + exp(− (2Y − 1)︸︷︷︸Y

f (X )) ∝ log2(1 + exp(−Y f (X )),

(Exercise 3.2)

many loss functions for binary classification can be written as

ρ(f , y) = fct .( y f︸︷︷︸”margin”

)

−4 −2 0 2 4

01

23

45

6

Loss functions for binary classification

margin : (2y − 1)f

loss

ρ0−1ρhingeρexpρlog−lik

logistic loss:I leads to probabilities π(x) = P[Y = 1|X = x ]

I is approximately linear for large negative values of y f (forheavy misclassifications) ; “robustness”

I is upper convex approximation of misclassification lossall these issues make logistic loss “useful” in many applications

for all “ordinary” GLMs: negative likelihood loss is convex in theparameters

methodology with general convex loss functions: similar toLasso with squared error loss

e.g. have variable screening and can/should do second stagefitting such as adaptive Lasso or MLE-refitting

software for Lasso for GLMs:R-package glmnet (Friedman, Hastie & Tibshirani, 2008)

Non-convex loss functions

examples: (penalized) maximum likelihood estimator inI mixture models

in particular: mixture of high-dimensional Gaussianregressions

I mixed effects modelsboth involve a non-convex loss function ρ(·)

Finite mixture of Gaussian regressions model

motivation: data is inhomogeneousdifferent data-points belong to different Gaussian linear modelsand membership of data-points to different models is unknownmodel:

Yi |Xi ∼k∑

r=1

πrN (Xiβr , σ2r )

unknown parameters: θ = (β1, . . . , βk , σ1, . . . , σk , π1, . . . , πk−1)dim(θ) = k · p + k + (k − 1) = k · (p + 2)− 1

First step: `1-penalization for reparametrized linear models

one mixture component: need to estimate β and σcould use `1-penalized MLE:

β, σ = argminβ,σ(

log(σ) + ‖Y− Xβ‖22/(2nσ2) + λ‖β‖1)

; non-convex objective function!

an alternative:

β, σ = argminβ,σ

(log(σ) + ‖Y− Xβ‖22/(2nσ2) + λ

‖β‖1σ

); equivariant under scaling Y′ = bY, β′ = bβ, σ′ = bσ (b > 0)and λ can be chosen “universally” not depending on σthis estimator still invoves a non-convex objective function

but: reparametrize!

φj = βj/σ,

ρ = σ−1

alternative estimator in reparametrized form is:

φ, ρ = argminφ,ρ(− log(ρ) + ‖ρY− Xφ‖22/(2n) + λ‖φ‖1

)with

I convex objective functionI universal regularization paremeter which can be chosen

independently of σ (or ρ)

thus, we have an estimator where:

I there is “no need” for cross-validation since λ is “universal”I we get a consistent estimate of σ

; useful for e.g. construction of p-values

this appears first in Stadler, Buhlmann & van de Geer (2010)refined by Sun and Zhang (2010): Scaled Lassois an alternative to the “square root Lasso” Belloni, Chernozhukov& Wang (2011) which also has a universal regularizationparameter

back to mixture of Gaussian regressionsin reparametrized form:

Yi |Xi ∼k∑

r=1

πrρr√2π

exp(−1

2(ρr y − Xiφr )2

)dy

unknown parameters θ = (φ1, . . . , φk , ρ1, . . . , ρk , π1, . . . , πk−1)

`1-penalized MLE:

θ(λ) = argminθ(−n−1n∑

i=1

log(k∑

r=1

πrρr√2pi

exp(−12

(ρr Yi − Xiφr )2)

+λk∑

r=1

‖φr‖1)

Riboflavin production with Bacillus Subtilis n = 146, p = 4088

preselection of 100 variables (genes) which have highestempirical variancechoose tuning parameter via 10-fold cross-validation

cross-validation score for various kCV Error

Nr. components

1.0 2.0 3.0 4.0 5.0

32.5

33.5

34.5

35.5

36.5

37.5

38.5

39.5

; the data is not homogeneousk = 3-components mixture model fits best

`1-penalized MLE selects 51 variables (genes)

top 20 variables according to∑3

r=1 |βr ,j |ge

ne 1

gene

29

gene

70

gene

91

gene

43

gene

52

gene

81

gene

3

gene

73

gene

25

gene

87

gene

41

gene

27

gene

44

gene

15

gene

83

gene

46

gene

74

gene

85

gene

34

mag

nitu

de c

oeffi

cien

ts−

0.7

−0.

5−

0.3

−0.

10.

00.

10.

20.

30.

40.

5

; mostly the same signs of β1,j , β2,j , β3,j

Computation for `1-penalized MLE in mixture of Gaussian regressions

θ(λ) = argminθ(−n−1n∑

i=1

log(k∑

r=1

πrρr√2pi

exp(−12

(ρr Yi − Xiφr )2)

+λk∑

r=1

‖φr‖1)

involves non-convex loss function, due to the mixturecomponentsif we would know the membership to the component:

∆i,r = 1 if i th data-point belongs to component r

can interchange

log(k∑

r=1

πr . . .)→k∑

r=1

∆i,r log(. . .)

; convex optimization

such a setting calls for the EM-algorithm:I E-step: compute the responsibilities

γi,r = E[∆i,r |Yi , θ[m]]

I M-step: minimize the objective function above withsubstitution

log(k∑

r=1

πr . . .)→k∑

r=1

γi,r log(. . .)

M-step is a convex optimization; such an EM-algorithm converges to a stationary point (notnecessarily a global optimum)

can speed up by using the generalized EM-algorithmgeneralized M-step: new parameter value to improve objectivefunction

although we have only convergence to a stationary pointthe problem is better behaved than standard MLE in mixturemodels:we consider (in other parameterization):

−`(β, σ; data)/n + λ

k∑r=1

‖βr‖1σr

the typical problems with mixture models are unboundedlikelihood with σr → 0 but this is scenario is penalized by thescaled `1-penalty!; the penalized likelihood stays bounded

methodology remains the same as for standard Lasso

that is: often, one would use refitting or adaptive `1-penalization

R-package fmrlasso (Staedler)

Linear mixed effects models

model:

Yij = Xij︸︷︷︸p×1

β + Zij︸︷︷︸q×1

bi︸︷︷︸random coeff.

+εij , i = 1, . . . ,N, j = 1, . . .ni ,

b1, . . . ,bN i.i.d. Nq(0, Γτ ),

εij i.i.d. N (0, σ2)

high-dimensional: p � NT =∑N

i=1 nithe dimension q of the random effects may be very largeq∗ = dim(τ) is “small”

`1-penalized MLE:

−`(β, τ, σ2; data)/N + λ‖β‖1

involves non-convex loss function

computation: coordinatewise optimization forj = 1,2, . . . ,p,p + 1, . . . ,p + q∗ + 1,1,2, . . .

R-package lmmlasso (Schelldorfer)

Riboflavin production with Bacillus Subtilis

N = 28 groups (strains) with ni ∈ {2, . . . ,6} and NT = 111p = 4088model:

Yij = Xijβ + Zij,k1bi,k1 + Zij,k2bi,k2 + εij

b1, . . . ,bN i.i.d. N2(0,diag(τ2k1, τ2

k2))

Estimates LMMLasso adaptive LMMLasso Lasso adaptive Lassoσ2 0.18 0.15 0.30 0.20τ2

k10.17 0.08 – –

τ2k2

0.03 0.03 – –|S| 18 14 21 20

LMMLasso:I smaller (estimated) residual variance for (adaptive)

LMMLasso than for (adaptive) LassoI 53% of total variability is between groups

((0.17 + 0.03)/(0.18 + 0.17 + 0.03) = 0.53) (forLMMLasso)

I (adaptive) LMMLasso is sparse than (adaptive) Lasso

Statistics for high-dimensional data: 1-penalization for ...buhlmann/teaching/presentation2.pdf ·...

Documents

Transcript of Statistics for high-dimensional data: 1-penalization for ...buhlmann/teaching/presentation2.pdf ·...