Statistics for high-dimensional data: 1-penalization for ...buhlmann/teaching/presentation2.pdf ·...
Transcript of Statistics for high-dimensional data: 1-penalization for ...buhlmann/teaching/presentation2.pdf ·...
Statistics for high-dimensional data:`1-penalization for smooth loss functions
Peter Buhlmann and Sara van de Geer
Seminar fur Statistik, ETH Zurich
May 2012
Generalized linear models
very useful extension of linear models:
Y1, . . . ,Yn independent,
g(E[Yi |Xi = x ]) = µ+
p∑j=1
β0j x (j)
g(·) a real-valued, known link function
notation: f (x) = fµ,β(x) = µ+∑p
j=1 βjx (j)
(conditional) distribution of Yi (given Xi ) is depending on Xi onlythrough fµ,β0(Xi):
p(y |x) = pµ,β0p(y |x)
Logistic regression
Yi ∈ {0,1},E[Yi |Xi = x ] = π(x) = P[Yi = 1|Xi = x ],
g(π(x)) = log(π(x)/(1− π(x))) = fµ,β(x) logit link function
; log(pµ,β(Yi |Xi)) = Yi fµ,β(Xi)− log(1 + exp(fµ,β(Xi))
Lasso for GLMs and M-estimators
µ, β = argminµ,β − n−1n∑
i=1
log(pµ,β(Yi |Xi)) + λ‖β‖1
(no direct penalization for intercept µ)
or more generally:
µ, β = argminµ,β − n−1n∑
i=1
ρfµ,β (Yi ,Xi) + λ‖β‖1
Binary classification and some loss functionslogistic loss:
ρf (Y ,X ) = −Yf (X ) + log(1 + exp(f (X ))
= log(1 + exp(− (2Y − 1)︸ ︷︷ ︸Y
f (X )) ∝ log2(1 + exp(−Y f (X )),
(Exercise 3.2)
many loss functions for binary classification can be written as
ρ(f , y) = fct .( y f︸︷︷︸”margin”
)
−4 −2 0 2 4
01
23
45
6
Loss functions for binary classification
margin : (2y − 1)f
loss
ρ0−1ρhingeρexpρlog−lik
logistic loss:I leads to probabilities π(x) = P[Y = 1|X = x ]
I is approximately linear for large negative values of y f (forheavy misclassifications) ; “robustness”
I is upper convex approximation of misclassification lossall these issues make logistic loss “useful” in many applications
for all “ordinary” GLMs: negative likelihood loss is convex in theparameters
methodology with general convex loss functions: similar toLasso with squared error loss
e.g. have variable screening and can/should do second stagefitting such as adaptive Lasso or MLE-refitting
software for Lasso for GLMs:R-package glmnet (Friedman, Hastie & Tibshirani, 2008)
Non-convex loss functions
examples: (penalized) maximum likelihood estimator inI mixture models
in particular: mixture of high-dimensional Gaussianregressions
I mixed effects modelsboth involve a non-convex loss function ρ(·)
Finite mixture of Gaussian regressions model
motivation: data is inhomogeneousdifferent data-points belong to different Gaussian linear modelsand membership of data-points to different models is unknownmodel:
Yi |Xi ∼k∑
r=1
πrN (Xiβr , σ2r )
unknown parameters: θ = (β1, . . . , βk , σ1, . . . , σk , π1, . . . , πk−1)dim(θ) = k · p + k + (k − 1) = k · (p + 2)− 1
First step: `1-penalization for reparametrized linear models
one mixture component: need to estimate β and σcould use `1-penalized MLE:
β, σ = argminβ,σ(
log(σ) + ‖Y− Xβ‖22/(2nσ2) + λ‖β‖1)
; non-convex objective function!
an alternative:
β, σ = argminβ,σ
(log(σ) + ‖Y− Xβ‖22/(2nσ2) + λ
‖β‖1σ
); equivariant under scaling Y′ = bY, β′ = bβ, σ′ = bσ (b > 0)and λ can be chosen “universally” not depending on σthis estimator still invoves a non-convex objective function
but: reparametrize!
φj = βj/σ,
ρ = σ−1
alternative estimator in reparametrized form is:
φ, ρ = argminφ,ρ(− log(ρ) + ‖ρY− Xφ‖22/(2n) + λ‖φ‖1
)with
I convex objective functionI universal regularization paremeter which can be chosen
independently of σ (or ρ)
thus, we have an estimator where:
I there is “no need” for cross-validation since λ is “universal”I we get a consistent estimate of σ
; useful for e.g. construction of p-values
this appears first in Stadler, Buhlmann & van de Geer (2010)refined by Sun and Zhang (2010): Scaled Lassois an alternative to the “square root Lasso” Belloni, Chernozhukov& Wang (2011) which also has a universal regularizationparameter
back to mixture of Gaussian regressionsin reparametrized form:
Yi |Xi ∼k∑
r=1
πrρr√2π
exp(−1
2(ρr y − Xiφr )2
)dy
unknown parameters θ = (φ1, . . . , φk , ρ1, . . . , ρk , π1, . . . , πk−1)
`1-penalized MLE:
θ(λ) = argminθ(−n−1n∑
i=1
log(k∑
r=1
πrρr√2pi
exp(−12
(ρr Yi − Xiφr )2)
+λk∑
r=1
‖φr‖1)
Riboflavin production with Bacillus Subtilis n = 146, p = 4088
preselection of 100 variables (genes) which have highestempirical variancechoose tuning parameter via 10-fold cross-validation
cross-validation score for various kCV Error
Nr. components
1.0 2.0 3.0 4.0 5.0
32.5
33.5
34.5
35.5
36.5
37.5
38.5
39.5
; the data is not homogeneousk = 3-components mixture model fits best
`1-penalized MLE selects 51 variables (genes)
top 20 variables according to∑3
r=1 |βr ,j |ge
ne 1
gene
29
gene
70
gene
91
gene
43
gene
52
gene
81
gene
3
gene
73
gene
25
gene
87
gene
41
gene
27
gene
44
gene
15
gene
83
gene
46
gene
74
gene
85
gene
34
mag
nitu
de c
oeffi
cien
ts−
0.7
−0.
5−
0.3
−0.
10.
00.
10.
20.
30.
40.
5
; mostly the same signs of β1,j , β2,j , β3,j
Computation for `1-penalized MLE in mixture of Gaussian regressions
θ(λ) = argminθ(−n−1n∑
i=1
log(k∑
r=1
πrρr√2pi
exp(−12
(ρr Yi − Xiφr )2)
+λk∑
r=1
‖φr‖1)
involves non-convex loss function, due to the mixturecomponentsif we would know the membership to the component:
∆i,r = 1 if i th data-point belongs to component r
can interchange
log(k∑
r=1
πr . . .)→k∑
r=1
∆i,r log(. . .)
; convex optimization
such a setting calls for the EM-algorithm:I E-step: compute the responsibilities
γi,r = E[∆i,r |Yi , θ[m]]
I M-step: minimize the objective function above withsubstitution
log(k∑
r=1
πr . . .)→k∑
r=1
γi,r log(. . .)
M-step is a convex optimization; such an EM-algorithm converges to a stationary point (notnecessarily a global optimum)
can speed up by using the generalized EM-algorithmgeneralized M-step: new parameter value to improve objectivefunction
although we have only convergence to a stationary pointthe problem is better behaved than standard MLE in mixturemodels:we consider (in other parameterization):
−`(β, σ; data)/n + λ
k∑r=1
‖βr‖1σr
the typical problems with mixture models are unboundedlikelihood with σr → 0 but this is scenario is penalized by thescaled `1-penalty!; the penalized likelihood stays bounded
methodology remains the same as for standard Lasso
that is: often, one would use refitting or adaptive `1-penalization
R-package fmrlasso (Staedler)
Linear mixed effects models
model:
Yij = Xij︸︷︷︸p×1
β + Zij︸︷︷︸q×1
bi︸︷︷︸random coeff.
+εij , i = 1, . . . ,N, j = 1, . . .ni ,
b1, . . . ,bN i.i.d. Nq(0, Γτ ),
εij i.i.d. N (0, σ2)
high-dimensional: p � NT =∑N
i=1 nithe dimension q of the random effects may be very largeq∗ = dim(τ) is “small”
`1-penalized MLE:
−`(β, τ, σ2; data)/N + λ‖β‖1
involves non-convex loss function
computation: coordinatewise optimization forj = 1,2, . . . ,p,p + 1, . . . ,p + q∗ + 1,1,2, . . .
R-package lmmlasso (Schelldorfer)
Riboflavin production with Bacillus Subtilis
N = 28 groups (strains) with ni ∈ {2, . . . ,6} and NT = 111p = 4088model:
Yij = Xijβ + Zij,k1bi,k1 + Zij,k2bi,k2 + εij
b1, . . . ,bN i.i.d. N2(0,diag(τ2k1, τ2
k2))
Estimates LMMLasso adaptive LMMLasso Lasso adaptive Lassoσ2 0.18 0.15 0.30 0.20τ2
k10.17 0.08 – –
τ2k2
0.03 0.03 – –|S| 18 14 21 20
LMMLasso:I smaller (estimated) residual variance for (adaptive)
LMMLasso than for (adaptive) LassoI 53% of total variability is between groups
((0.17 + 0.03)/(0.18 + 0.17 + 0.03) = 0.53) (forLMMLasso)
I (adaptive) LMMLasso is sparse than (adaptive) Lasso