Mind the duality gap: safer rules for the Lasso · A possible way to obtain a sparse vector when...

Motivation Duality Screening rules Coordinate descent

Mind the duality gap: safer rulesfor the Lasso

Olivier Fercoq

Joint work with A. Gramfort, E. Ndiaye and J. Salmon

26 October 2018

1/20


Sparse linear model

Objective: determine the parameters β of the model

loooooomoooooon

y P Rn

«

loooooomoooooon

X P Rnˆp

ˆ

¨

˚

˚

˚

˝

β1......βp

˛

‹

‹

‹

‚

loomoon

β P Rp

where we would like β sparse.

y P Rn : a signal

X “ rx1, . . . , xps P Rnˆp: a collection of atoms (the dictionary)

2/20


The Lasso

A possible way to obtain a sparse vector when the dictionary isknown:

β̂pλq P arg minβPRp

´ 1

2}y ´ Xβ}2

2looooomooooon

data fitting term

` λ}β}1loomoon

sparsity-inducing penalty

¯

• Not necessarily unique solution

• Best λ unknown a priori

Chosen by cross validation:Needs to solve a lot of Lasso problems β̂pλ1q, ¨ ¨ ¨ , β̂pλT q

(often T “ 100 and 10 folds)

3/20


Algorithmic solutions 1


´ 1

2}y ´ Xβ}2

2looooomooooon

quadratic

` λ}β}1loomoon

linear by parts

¯

• Least Angle Regression: Efron et al. (2004)based on linear systems inversionsvery efficient when p is small

4/20


Algorithmic solutions 2Convex problem


´ 1

2}y ´ Xβ}2

2looooomooooon

differentiable

` λ}β}1loomoon

separable

¯

Use soft-thresholding

STpλ, yq “ arg mintPR

´

12py´ tq2` λ|t|

¯

“ signpyq ¨ p|y| ´ λq`

• Proximal algorithm: Beck & Teboulle (2009)useful when r Ñ xJj r is cheap (eg. FFT)

• Coordinate descent: Friedman et al. (2007)

βjk`1 “

#

ST`

λ, βjk ´

1

‖xj‖2 xJj pXβk ´ yq˘

if j “ jk`1

βjk otherwise

very useful when p is large and X is sparse (eg. text)5/20


Dual problem

Primal function: Pλpβq “1

2}y ´ Xβ}2

2 ` λ}β}1

Dual feasible set: ∆X “

θ P Rn : |xJj θ| ď 1, @j P rps(

Dual solution: θ̂pλq “ argmaxθP∆XĂRn

1

2‖y‖2

2 ´λ2

2

∥∥∥θ ´ y

λ

∥∥∥2

2looooooooooooomooooooooooooon

“Dλpθq

θ̂pλq “ Π∆Xpyλq:

6/20


Duality gap properties

• Duality: for all β P Rp and θ P ∆X ,

Dλpθq ď Dλpθ̂pλqq “ Pλpβ̂

pλqq ď Pλpβq

• One can compute

Gλpβ, θq “Pλpβq ´ Dλpθq

“1

2‖Xβ ´ y‖2

` λ ‖β‖1 ´ p1

2‖y‖2

´λ2

2

∥∥∥θ ´ y

λ

∥∥∥2

q

• Stopping criterion

Gλpβ, θq ď ε ñ Pλpβq´Pλpβ̂pλqq ď ε (β is an ε-solution)

7/20


Karush-Khun-Tucker conditions (KKT)

• Lagrangian:

Lpθ, βq “ 12‖y‖2

2 ´λ2

2

∥∥θ ´ yλ

∥∥2

2` λ ‖β‖1 ´ λβ

JXJθ

• Primal solution: β̂pλq P Rp

• Dual solution: θ̂pλq P Rn

KKT :λθ̂pλq “ y ´ X β̂pλq

}XJθ̂pλq}8 ď 1

|β̂pλqj |

`

|xJj θ̂pλq| ´ 1

˘

“ 0 @j P rps

In particular, @λ ě λmax “ }XJy}8, 0 P Rp is a primal

solution for Pλ and y{λ P ∆X is dual optimal8/20


Creating a safe sphere

10/20


Original safe rule: El Ghaoui et al. (2012)

Static safe rule, becomes useless when λ gets small11/20


Dynamic screening rule

• Use a dual point θk P ∆X , that evolves as theoptimisation algorithm proceedsBonnefoy et al. (2014,2015)

• We have a primal point βk and the residuals

ρk “ y ´ Xβk

• Dual candidate:

θk “y ´ Xβk

maxpλ, ‖XJρk‖8q

• Motivation: θk P ∆X and if βk Ñ β̂pλq then θk Ñ θ̂pλq

12/20


Limits of previous dynamic screening rulesrk “ }θk ´ y{λ} does not converge to 0:the limit ball is

13/20


GAP safe rule

Bonnefoy et al. (2014,2015)

1

2‖y‖2

2 ´λ2

2

∥∥∥θk ´ y

λ

∥∥∥2

2ď

1

2‖y‖2

2 ´λ2

2

∥∥∥θ̂pλq ´ y

λ

∥∥∥2

2

ď Pλpβkq

rλpβ, θq “

b

qRλpθq2 ´ pRλpβq2 “a

2Gλpβ, θq{λ

14/20


GAP safe rule

Weak duality:

1

2‖y‖2

2 ´λ2

2

∥∥∥θk ´ y

λ

∥∥∥2

2ď

1

2‖y‖2

2 ´λ2

2


λ

∥∥∥2

2ď Pλpβkq

rλpβ, θq “

b


2Gλpβ, θq{λ

14/20


GAP safe rule

qRλpθq2“

∥∥∥θk ´ y

λ

∥∥∥2

2ě


λ

∥∥∥2

2ě

2

λ2Pλpβkq`

1

λ2‖y‖2

2 “pRλpβq

2

rλpβ, θq “

b


2Gλpβ, θq{λ

14/20


GAP safe rule

qRλpθq2“

∥∥∥θk ´ y

λ

∥∥∥2

2ě


λ

∥∥∥2

2ě

2

λ2Pλpβkq`

1

λ2‖y‖2

2 “pRλpβq

2

rλpβ, θq “

b


2Gλpβ, θq{λ14/20


Algorithm 1 Coordinate descent (Lasso)Input: X , y , ε,K , f , pλtqtPrT´1s

1: Initialization: λ0 “ λmax, βλ0 “ 02: for t P rT ´ 1s do Ź Loop over λ’s3: β Ð βλt´1 Ź previous ε-solution4: for k P rK s do5: if k mod f “ 1 then6: Construct θ P ∆X

7: if Gλt pβ, θq ď ε then Ź Stop if duality gap small8: βλt Ð β9: break

10: end if11: end if12: for j P rps do Ź Soft-Threshold coordinates

13: βj Ð ST`

λt

‖xj‖2 , βj ´xJj pXβ´yq

‖xj‖2

˘

14: end for15: end for16: end for

15/20


Algorithm 2 Gap Safe screening for coordinate descentInput: X , y , ε,K , f , pλtqtPrT´1s

1: Initialization: λ0 “ λmax, βλ0 “ 02: for t P rT ´ 1s do Ź Loop over λ’s3: β Ð βλt´1 Ź previous ε-solution4: for k P rK s do5: if k mod f “ 1 then6: Construct θ P ∆X , Aλt pCq “ tj P rps : µCpxjq ě 1u7: if Gλt pβ, θq ď ε then Ź Stop if duality gap small8: βλt Ð β9: break

10: end if11: end if12: for j P Aλt pCq do Ź Soft-Threshold coordinates

13: βj Ð ST`

λt

‖xj‖2 , βj ´xJj pXβ´yq

‖xj‖2

˘

14: end for15: end for16: end for

16/20


Gap safe rule: advantages• Dynamic rule• The safe region converges towards tθ̂pλqu• It works better in practice

Proportion of active variables (Leukemia n = 72; p = 7,129)

17/20


Computing time

2 4 6 8

-log10(duality gap)

0

1

2

3

4

5

Tim

e (

s)

No screening

SAFE (El Ghaoui et al.)

ST3 (Bonnefoy et al.)

SAFE (Bonnefoy et al.)

GAP SAFE (sphere)

GAP SAFE (dome)

Time to obtain an ε-solution (Leukemia, n=72; p=7,129)

18/20


Computing time

2 4 6 8

10

-log10(duality gap)

0

1

2

3

4

5

6

7

Tim

e(s

)

No screening

SAFE (El Ghaoui et al.)

ST3 (Bonnefoy et al.)

SAFE (Bonnefoy et al.)

GAP SAFE (sphere)

GAP SAFE (dome)

Time to obtain an ε-solution (RCV1, n=20,242; p=47,236)

19/20


Conclusion

• New safe screening rule based on the duality gap

• Theoretical advantage: converging safe region

• Improves computational efficiency on Scikit-Learn’scoordinate descent implementationPedregosa et al. (2011)

• Extensions

- same idea works for many nonsmooth convex losses(group-lasso, sparse logistic regression, SVM, . . . )

- guarantees for variable selection- accuracy guarantees on the whole path rλT , λ1s,

not only tλ1, . . . , λTu

20/20

Mind the duality gap: safer rules for the Lasso · A possible way to obtain a sparse vector when...

Documents

Transcript of Mind the duality gap: safer rules for the Lasso · A possible way to obtain a sparse vector when...