Mind the duality gap: safer rules for the Lasso · A possible way to obtain a sparse vector when...
Transcript of Mind the duality gap: safer rules for the Lasso · A possible way to obtain a sparse vector when...
Motivation Duality Screening rules Coordinate descent
Mind the duality gap: safer rulesfor the Lasso
Olivier Fercoq
Joint work with A. Gramfort, E. Ndiaye and J. Salmon
26 October 2018
1/20
Motivation Duality Screening rules Coordinate descent
Sparse linear model
Objective: determine the parameters β of the model
loooooomoooooon
y P Rn
«
loooooomoooooon
X P Rnˆp
ˆ
¨
˚
˚
˚
˝
β1......βp
˛
‹
‹
‹
‚
loomoon
β P Rp
where we would like β sparse.
y P Rn : a signal
X “ rx1, . . . , xps P Rnˆp: a collection of atoms (the dictionary)
2/20
Motivation Duality Screening rules Coordinate descent
The Lasso
A possible way to obtain a sparse vector when the dictionary isknown:
β̂pλq P arg minβPRp
´ 1
2}y ´ Xβ}2
2looooomooooon
data fitting term
` λ}β}1loomoon
sparsity-inducing penalty
¯
• Not necessarily unique solution
• Best λ unknown a priori
Chosen by cross validation:Needs to solve a lot of Lasso problems β̂pλ1q, ¨ ¨ ¨ , β̂pλT q
(often T “ 100 and 10 folds)
3/20
Motivation Duality Screening rules Coordinate descent
Algorithmic solutions 1
β̂pλq P arg minβPRp
´ 1
2}y ´ Xβ}2
2looooomooooon
quadratic
` λ}β}1loomoon
linear by parts
¯
• Least Angle Regression: Efron et al. (2004)based on linear systems inversionsvery efficient when p is small
4/20
Motivation Duality Screening rules Coordinate descent
Algorithmic solutions 2Convex problem
β̂pλq P arg minβPRp
´ 1
2}y ´ Xβ}2
2looooomooooon
differentiable
` λ}β}1loomoon
separable
¯
Use soft-thresholding
STpλ, yq “ arg mintPR
´
12py´ tq2` λ|t|
¯
“ signpyq ¨ p|y| ´ λq`
• Proximal algorithm: Beck & Teboulle (2009)useful when r Ñ xJj r is cheap (eg. FFT)
• Coordinate descent: Friedman et al. (2007)
βjk`1 “
#
ST`
λ, βjk ´
1
‖xj‖2 xJj pXβk ´ yq˘
if j “ jk`1
βjk otherwise
very useful when p is large and X is sparse (eg. text)5/20
Motivation Duality Screening rules Coordinate descent
Dual problem
Primal function: Pλpβq “1
2}y ´ Xβ}2
2 ` λ}β}1
Dual feasible set: ∆X “
θ P Rn : |xJj θ| ď 1, @j P rps(
Dual solution: θ̂pλq “ argmaxθP∆XĂRn
1
2‖y‖2
2 ´λ2
2
∥∥∥θ ´ y
λ
∥∥∥2
2looooooooooooomooooooooooooon
“Dλpθq
θ̂pλq “ Π∆Xpyλq:
6/20
Motivation Duality Screening rules Coordinate descent
Duality gap properties
• Duality: for all β P Rp and θ P ∆X ,
Dλpθq ď Dλpθ̂pλqq “ Pλpβ̂
pλqq ď Pλpβq
• One can compute
Gλpβ, θq “Pλpβq ´ Dλpθq
“1
2‖Xβ ´ y‖2
` λ ‖β‖1 ´ p1
2‖y‖2
´λ2
2
∥∥∥θ ´ y
λ
∥∥∥2
q
• Stopping criterion
Gλpβ, θq ď ε ñ Pλpβq´Pλpβ̂pλqq ď ε (β is an ε-solution)
7/20
Motivation Duality Screening rules Coordinate descent
Karush-Khun-Tucker conditions (KKT)
• Lagrangian:
Lpθ, βq “ 12‖y‖2
2 ´λ2
2
∥∥θ ´ yλ
∥∥2
2` λ ‖β‖1 ´ λβ
JXJθ
• Primal solution: β̂pλq P Rp
• Dual solution: θ̂pλq P Rn
KKT :λθ̂pλq “ y ´ X β̂pλq
}XJθ̂pλq}8 ď 1
|β̂pλqj |
`
|xJj θ̂pλq| ´ 1
˘
“ 0 @j P rps
In particular, @λ ě λmax “ }XJy}8, 0 P Rp is a primal
solution for Pλ and y{λ P ∆X is dual optimal8/20
Motivation Duality Screening rules Coordinate descent
Safe regionsWe can screen variables thanks to KKT|β̂pλqj |
`
|xJj θ̂pλq| ´ 1
˘
“ 0:
If |xJj θ̂pλq| ă 1 then, β̂
pλqj “ 0
Attention: θ̂pλq is unknownSolution: consider a safe region C such that θ̂pλq P C
If supθPC|xJj θ| ă 1 then β̂
pλqj “ 0
Goal : find a region C1. that contains θ̂pλq
2. as small as possible3. such that µC : x ÞÑ supθPC |x
Jθ| is easy to compute
Ñ C “ Bpc , rq a ball of center c P Rn and radius r ą 0
9/20
Motivation Duality Screening rules Coordinate descent
Safe regionsWe can screen variables thanks to KKT|β̂pλqj |
`
|xJj θ̂pλq| ´ 1
˘
“ 0:
If |xJj θ̂pλq| ă 1 then, β̂
pλqj “ 0
Attention: θ̂pλq is unknownSolution: consider a safe region C such that θ̂pλq P C
If supθPC|xJj θ| ă 1 then β̂
pλqj “ 0
Goal : find a region C1. that contains θ̂pλq
2. as small as possible3. such that µC : x ÞÑ supθPC |x
Jθ| is easy to compute
Ñ C “ Bpc , rq a ball of center c P Rn and radius r ą 0
9/20
Motivation Duality Screening rules Coordinate descent
Safe regionsWe can screen variables thanks to KKT|β̂pλqj |
`
|xJj θ̂pλq| ´ 1
˘
“ 0:
If |xJj θ̂pλq| ă 1 then, β̂
pλqj “ 0
Attention: θ̂pλq is unknownSolution: consider a safe region C such that θ̂pλq P C
If supθPC|xJj θ| ă 1 then β̂
pλqj “ 0
Goal : find a region C1. that contains θ̂pλq
2. as small as possible3. such that µC : x ÞÑ supθPC |x
Jθ| is easy to compute
Ñ C “ Bpc , rq a ball of center c P Rn and radius r ą 09/20
Motivation Duality Screening rules Coordinate descent
Creating a safe sphere
10/20
Motivation Duality Screening rules Coordinate descent
Creating a safe sphere
10/20
Motivation Duality Screening rules Coordinate descent
Creating a safe sphere
10/20
Motivation Duality Screening rules Coordinate descent
Original safe rule: El Ghaoui et al. (2012)
Static safe rule, becomes useless when λ gets small11/20
Motivation Duality Screening rules Coordinate descent
Dynamic screening rule
• Use a dual point θk P ∆X , that evolves as theoptimisation algorithm proceedsBonnefoy et al. (2014,2015)
• We have a primal point βk and the residuals
ρk “ y ´ Xβk
• Dual candidate:
θk “y ´ Xβk
maxpλ, ‖XJρk‖8q
• Motivation: θk P ∆X and if βk Ñ β̂pλq then θk Ñ θ̂pλq
12/20
Motivation Duality Screening rules Coordinate descent
Limits of previous dynamic screening rulesrk “ }θk ´ y{λ} does not converge to 0:the limit ball is
13/20
Motivation Duality Screening rules Coordinate descent
GAP safe rule
Bonnefoy et al. (2014,2015)
1
2‖y‖2
2 ´λ2
2
∥∥∥θk ´ y
λ
∥∥∥2
2ď
1
2‖y‖2
2 ´λ2
2
∥∥∥θ̂pλq ´ y
λ
∥∥∥2
2
ď Pλpβkq
rλpβ, θq “
b
qRλpθq2 ´ pRλpβq2 “a
2Gλpβ, θq{λ
14/20
Motivation Duality Screening rules Coordinate descent
GAP safe rule
Weak duality:
1
2‖y‖2
2 ´λ2
2
∥∥∥θk ´ y
λ
∥∥∥2
2ď
1
2‖y‖2
2 ´λ2
2
∥∥∥θ̂pλq ´ y
λ
∥∥∥2
2ď Pλpβkq
rλpβ, θq “
b
qRλpθq2 ´ pRλpβq2 “a
2Gλpβ, θq{λ
14/20
Motivation Duality Screening rules Coordinate descent
GAP safe rule
qRλpθq2“
∥∥∥θk ´ y
λ
∥∥∥2
2ě
∥∥∥θ̂pλq ´ y
λ
∥∥∥2
2ě
2
λ2Pλpβkq`
1
λ2‖y‖2
2 “pRλpβq
2
rλpβ, θq “
b
qRλpθq2 ´ pRλpβq2 “a
2Gλpβ, θq{λ
14/20
Motivation Duality Screening rules Coordinate descent
GAP safe rule
qRλpθq2“
∥∥∥θk ´ y
λ
∥∥∥2
2ě
∥∥∥θ̂pλq ´ y
λ
∥∥∥2
2ě
2
λ2Pλpβkq`
1
λ2‖y‖2
2 “pRλpβq
2
rλpβ, θq “
b
qRλpθq2 ´ pRλpβq2 “a
2Gλpβ, θq{λ
14/20
Motivation Duality Screening rules Coordinate descent
GAP safe rule
qRλpθq2“
∥∥∥θk ´ y
λ
∥∥∥2
2ě
∥∥∥θ̂pλq ´ y
λ
∥∥∥2
2ě
2
λ2Pλpβkq`
1
λ2‖y‖2
2 “pRλpβq
2
rλpβ, θq “
b
qRλpθq2 ´ pRλpβq2 “a
2Gλpβ, θq{λ14/20
Motivation Duality Screening rules Coordinate descent
Algorithm 1 Coordinate descent (Lasso)Input: X , y , ε,K , f , pλtqtPrT´1s
1: Initialization: λ0 “ λmax, βλ0 “ 02: for t P rT ´ 1s do Ź Loop over λ’s3: β Ð βλt´1 Ź previous ε-solution4: for k P rK s do5: if k mod f “ 1 then6: Construct θ P ∆X
7: if Gλt pβ, θq ď ε then Ź Stop if duality gap small8: βλt Ð β9: break
10: end if11: end if12: for j P rps do Ź Soft-Threshold coordinates
13: βj Ð ST`
λt
‖xj‖2 , βj ´xJj pXβ´yq
‖xj‖2
˘
14: end for15: end for16: end for
15/20
Motivation Duality Screening rules Coordinate descent
Algorithm 2 Gap Safe screening for coordinate descentInput: X , y , ε,K , f , pλtqtPrT´1s
1: Initialization: λ0 “ λmax, βλ0 “ 02: for t P rT ´ 1s do Ź Loop over λ’s3: β Ð βλt´1 Ź previous ε-solution4: for k P rK s do5: if k mod f “ 1 then6: Construct θ P ∆X , Aλt pCq “ tj P rps : µCpxjq ě 1u7: if Gλt pβ, θq ď ε then Ź Stop if duality gap small8: βλt Ð β9: break
10: end if11: end if12: for j P Aλt pCq do Ź Soft-Threshold coordinates
13: βj Ð ST`
λt
‖xj‖2 , βj ´xJj pXβ´yq
‖xj‖2
˘
14: end for15: end for16: end for
16/20
Motivation Duality Screening rules Coordinate descent
Gap safe rule: advantages• Dynamic rule• The safe region converges towards tθ̂pλqu• It works better in practice
Proportion of active variables (Leukemia n = 72; p = 7,129)
17/20
Motivation Duality Screening rules Coordinate descent
Computing time
2 4 6 8
-log10(duality gap)
0
1
2
3
4
5
Tim
e (
s)
No screening
SAFE (El Ghaoui et al.)
ST3 (Bonnefoy et al.)
SAFE (Bonnefoy et al.)
GAP SAFE (sphere)
GAP SAFE (dome)
Time to obtain an ε-solution (Leukemia, n=72; p=7,129)
18/20
Motivation Duality Screening rules Coordinate descent
Computing time
2 4 6 8
10
-log10(duality gap)
0
1
2
3
4
5
6
7
Tim
e(s
)
No screening
SAFE (El Ghaoui et al.)
ST3 (Bonnefoy et al.)
SAFE (Bonnefoy et al.)
GAP SAFE (sphere)
GAP SAFE (dome)
Time to obtain an ε-solution (RCV1, n=20,242; p=47,236)
19/20
Motivation Duality Screening rules Coordinate descent
Conclusion
• New safe screening rule based on the duality gap
• Theoretical advantage: converging safe region
• Improves computational efficiency on Scikit-Learn’scoordinate descent implementationPedregosa et al. (2011)
• Extensions
- same idea works for many nonsmooth convex losses(group-lasso, sparse logistic regression, SVM, . . . )
- guarantees for variable selection- accuracy guarantees on the whole path rλT , λ1s,
not only tλ1, . . . , λTu
20/20