Mind the duality gap: safer rules for the Lasso · A possible way to obtain a sparse vector when...

28
Motivation Duality Screening rules Coordinate descent Mind the duality gap: safer rules for the Lasso Olivier Fercoq Joint work with A. Gramfort, E. Ndiaye and J. Salmon 26 October 2018 1/20

Transcript of Mind the duality gap: safer rules for the Lasso · A possible way to obtain a sparse vector when...

Page 1: Mind the duality gap: safer rules for the Lasso · A possible way to obtain a sparse vector when the dictionary is known: p^ qParg min PRp 1 2}y X }2 looooomooooon2 data tting term

Motivation Duality Screening rules Coordinate descent

Mind the duality gap: safer rulesfor the Lasso

Olivier Fercoq

Joint work with A. Gramfort, E. Ndiaye and J. Salmon

26 October 2018

1/20

Page 2: Mind the duality gap: safer rules for the Lasso · A possible way to obtain a sparse vector when the dictionary is known: p^ qParg min PRp 1 2}y X }2 looooomooooon2 data tting term

Motivation Duality Screening rules Coordinate descent

Sparse linear model

Objective: determine the parameters β of the model

loooooomoooooon

y P Rn

«

loooooomoooooon

X P Rnˆp

ˆ

¨

˚

˚

˚

˝

β1......βp

˛

loomoon

β P Rp

where we would like β sparse.

y P Rn : a signal

X “ rx1, . . . , xps P Rnˆp: a collection of atoms (the dictionary)

2/20

Page 3: Mind the duality gap: safer rules for the Lasso · A possible way to obtain a sparse vector when the dictionary is known: p^ qParg min PRp 1 2}y X }2 looooomooooon2 data tting term

Motivation Duality Screening rules Coordinate descent

The Lasso

A possible way to obtain a sparse vector when the dictionary isknown:

β̂pλq P arg minβPRp

´ 1

2}y ´ Xβ}2

2looooomooooon

data fitting term

` λ}β}1loomoon

sparsity-inducing penalty

¯

• Not necessarily unique solution

• Best λ unknown a priori

Chosen by cross validation:Needs to solve a lot of Lasso problems β̂pλ1q, ¨ ¨ ¨ , β̂pλT q

(often T “ 100 and 10 folds)

3/20

Page 4: Mind the duality gap: safer rules for the Lasso · A possible way to obtain a sparse vector when the dictionary is known: p^ qParg min PRp 1 2}y X }2 looooomooooon2 data tting term

Motivation Duality Screening rules Coordinate descent

Algorithmic solutions 1

β̂pλq P arg minβPRp

´ 1

2}y ´ Xβ}2

2looooomooooon

quadratic

` λ}β}1loomoon

linear by parts

¯

• Least Angle Regression: Efron et al. (2004)based on linear systems inversionsvery efficient when p is small

4/20

Page 5: Mind the duality gap: safer rules for the Lasso · A possible way to obtain a sparse vector when the dictionary is known: p^ qParg min PRp 1 2}y X }2 looooomooooon2 data tting term

Motivation Duality Screening rules Coordinate descent

Algorithmic solutions 2Convex problem

β̂pλq P arg minβPRp

´ 1

2}y ´ Xβ}2

2looooomooooon

differentiable

` λ}β}1loomoon

separable

¯

Use soft-thresholding

STpλ, yq “ arg mintPR

´

12py´ tq2` λ|t|

¯

“ signpyq ¨ p|y| ´ λq`

• Proximal algorithm: Beck & Teboulle (2009)useful when r Ñ xJj r is cheap (eg. FFT)

• Coordinate descent: Friedman et al. (2007)

βjk`1 “

#

ST`

λ, βjk ´

1

‖xj‖2 xJj pXβk ´ yq˘

if j ­“ jk`1

βjk otherwise

very useful when p is large and X is sparse (eg. text)5/20

Page 6: Mind the duality gap: safer rules for the Lasso · A possible way to obtain a sparse vector when the dictionary is known: p^ qParg min PRp 1 2}y X }2 looooomooooon2 data tting term

Motivation Duality Screening rules Coordinate descent

Dual problem

Primal function: Pλpβq “1

2}y ´ Xβ}2

2 ` λ}β}1

Dual feasible set: ∆X “

θ P Rn : |xJj θ| ď 1, @j P rps(

Dual solution: θ̂pλq “ argmaxθP∆XĂRn

1

2‖y‖2

2 ´λ2

2

∥∥∥θ ´ y

λ

∥∥∥2

2looooooooooooomooooooooooooon

“Dλpθq

θ̂pλq “ Π∆Xpyλq:

6/20

Page 7: Mind the duality gap: safer rules for the Lasso · A possible way to obtain a sparse vector when the dictionary is known: p^ qParg min PRp 1 2}y X }2 looooomooooon2 data tting term

Motivation Duality Screening rules Coordinate descent

Duality gap properties

• Duality: for all β P Rp and θ P ∆X ,

Dλpθq ď Dλpθ̂pλqq “ Pλpβ̂

pλqq ď Pλpβq

• One can compute

Gλpβ, θq “Pλpβq ´ Dλpθq

“1

2‖Xβ ´ y‖2

` λ ‖β‖1 ´ p1

2‖y‖2

´λ2

2

∥∥∥θ ´ y

λ

∥∥∥2

q

• Stopping criterion

Gλpβ, θq ď ε ñ Pλpβq´Pλpβ̂pλqq ď ε (β is an ε-solution)

7/20

Page 8: Mind the duality gap: safer rules for the Lasso · A possible way to obtain a sparse vector when the dictionary is known: p^ qParg min PRp 1 2}y X }2 looooomooooon2 data tting term

Motivation Duality Screening rules Coordinate descent

Karush-Khun-Tucker conditions (KKT)

• Lagrangian:

Lpθ, βq “ 12‖y‖2

2 ´λ2

2

∥∥θ ´ yλ

∥∥2

2` λ ‖β‖1 ´ λβ

JXJθ

• Primal solution: β̂pλq P Rp

• Dual solution: θ̂pλq P Rn

KKT :λθ̂pλq “ y ´ X β̂pλq

}XJθ̂pλq}8 ď 1

|β̂pλqj |

`

|xJj θ̂pλq| ´ 1

˘

“ 0 @j P rps

In particular, @λ ě λmax “ }XJy}8, 0 P Rp is a primal

solution for Pλ and y{λ P ∆X is dual optimal8/20

Page 9: Mind the duality gap: safer rules for the Lasso · A possible way to obtain a sparse vector when the dictionary is known: p^ qParg min PRp 1 2}y X }2 looooomooooon2 data tting term

Motivation Duality Screening rules Coordinate descent

Safe regionsWe can screen variables thanks to KKT|β̂pλqj |

`

|xJj θ̂pλq| ´ 1

˘

“ 0:

If |xJj θ̂pλq| ă 1 then, β̂

pλqj “ 0

Attention: θ̂pλq is unknownSolution: consider a safe region C such that θ̂pλq P C

If supθPC|xJj θ| ă 1 then β̂

pλqj “ 0

Goal : find a region C1. that contains θ̂pλq

2. as small as possible3. such that µC : x ÞÑ supθPC |x

Jθ| is easy to compute

Ñ C “ Bpc , rq a ball of center c P Rn and radius r ą 0

9/20

Page 10: Mind the duality gap: safer rules for the Lasso · A possible way to obtain a sparse vector when the dictionary is known: p^ qParg min PRp 1 2}y X }2 looooomooooon2 data tting term

Motivation Duality Screening rules Coordinate descent

Safe regionsWe can screen variables thanks to KKT|β̂pλqj |

`

|xJj θ̂pλq| ´ 1

˘

“ 0:

If |xJj θ̂pλq| ă 1 then, β̂

pλqj “ 0

Attention: θ̂pλq is unknownSolution: consider a safe region C such that θ̂pλq P C

If supθPC|xJj θ| ă 1 then β̂

pλqj “ 0

Goal : find a region C1. that contains θ̂pλq

2. as small as possible3. such that µC : x ÞÑ supθPC |x

Jθ| is easy to compute

Ñ C “ Bpc , rq a ball of center c P Rn and radius r ą 0

9/20

Page 11: Mind the duality gap: safer rules for the Lasso · A possible way to obtain a sparse vector when the dictionary is known: p^ qParg min PRp 1 2}y X }2 looooomooooon2 data tting term

Motivation Duality Screening rules Coordinate descent

Safe regionsWe can screen variables thanks to KKT|β̂pλqj |

`

|xJj θ̂pλq| ´ 1

˘

“ 0:

If |xJj θ̂pλq| ă 1 then, β̂

pλqj “ 0

Attention: θ̂pλq is unknownSolution: consider a safe region C such that θ̂pλq P C

If supθPC|xJj θ| ă 1 then β̂

pλqj “ 0

Goal : find a region C1. that contains θ̂pλq

2. as small as possible3. such that µC : x ÞÑ supθPC |x

Jθ| is easy to compute

Ñ C “ Bpc , rq a ball of center c P Rn and radius r ą 09/20

Page 12: Mind the duality gap: safer rules for the Lasso · A possible way to obtain a sparse vector when the dictionary is known: p^ qParg min PRp 1 2}y X }2 looooomooooon2 data tting term

Motivation Duality Screening rules Coordinate descent

Creating a safe sphere

10/20

Page 13: Mind the duality gap: safer rules for the Lasso · A possible way to obtain a sparse vector when the dictionary is known: p^ qParg min PRp 1 2}y X }2 looooomooooon2 data tting term

Motivation Duality Screening rules Coordinate descent

Creating a safe sphere

10/20

Page 14: Mind the duality gap: safer rules for the Lasso · A possible way to obtain a sparse vector when the dictionary is known: p^ qParg min PRp 1 2}y X }2 looooomooooon2 data tting term

Motivation Duality Screening rules Coordinate descent

Creating a safe sphere

10/20

Page 15: Mind the duality gap: safer rules for the Lasso · A possible way to obtain a sparse vector when the dictionary is known: p^ qParg min PRp 1 2}y X }2 looooomooooon2 data tting term

Motivation Duality Screening rules Coordinate descent

Original safe rule: El Ghaoui et al. (2012)

Static safe rule, becomes useless when λ gets small11/20

Page 16: Mind the duality gap: safer rules for the Lasso · A possible way to obtain a sparse vector when the dictionary is known: p^ qParg min PRp 1 2}y X }2 looooomooooon2 data tting term

Motivation Duality Screening rules Coordinate descent

Dynamic screening rule

• Use a dual point θk P ∆X , that evolves as theoptimisation algorithm proceedsBonnefoy et al. (2014,2015)

• We have a primal point βk and the residuals

ρk “ y ´ Xβk

• Dual candidate:

θk “y ´ Xβk

maxpλ, ‖XJρk‖8q

• Motivation: θk P ∆X and if βk Ñ β̂pλq then θk Ñ θ̂pλq

12/20

Page 17: Mind the duality gap: safer rules for the Lasso · A possible way to obtain a sparse vector when the dictionary is known: p^ qParg min PRp 1 2}y X }2 looooomooooon2 data tting term

Motivation Duality Screening rules Coordinate descent

Limits of previous dynamic screening rulesrk “ }θk ´ y{λ} does not converge to 0:the limit ball is

13/20

Page 18: Mind the duality gap: safer rules for the Lasso · A possible way to obtain a sparse vector when the dictionary is known: p^ qParg min PRp 1 2}y X }2 looooomooooon2 data tting term

Motivation Duality Screening rules Coordinate descent

GAP safe rule

Bonnefoy et al. (2014,2015)

1

2‖y‖2

2 ´λ2

2

∥∥∥θk ´ y

λ

∥∥∥2

1

2‖y‖2

2 ´λ2

2

∥∥∥θ̂pλq ´ y

λ

∥∥∥2

2

ď Pλpβkq

rλpβ, θq “

b

qRλpθq2 ´ pRλpβq2 “a

2Gλpβ, θq{λ

14/20

Page 19: Mind the duality gap: safer rules for the Lasso · A possible way to obtain a sparse vector when the dictionary is known: p^ qParg min PRp 1 2}y X }2 looooomooooon2 data tting term

Motivation Duality Screening rules Coordinate descent

GAP safe rule

Weak duality:

1

2‖y‖2

2 ´λ2

2

∥∥∥θk ´ y

λ

∥∥∥2

1

2‖y‖2

2 ´λ2

2

∥∥∥θ̂pλq ´ y

λ

∥∥∥2

2ď Pλpβkq

rλpβ, θq “

b

qRλpθq2 ´ pRλpβq2 “a

2Gλpβ, θq{λ

14/20

Page 20: Mind the duality gap: safer rules for the Lasso · A possible way to obtain a sparse vector when the dictionary is known: p^ qParg min PRp 1 2}y X }2 looooomooooon2 data tting term

Motivation Duality Screening rules Coordinate descent

GAP safe rule

qRλpθq2“

∥∥∥θk ´ y

λ

∥∥∥2

∥∥∥θ̂pλq ´ y

λ

∥∥∥2

2

λ2Pλpβkq`

1

λ2‖y‖2

2 “pRλpβq

2

rλpβ, θq “

b

qRλpθq2 ´ pRλpβq2 “a

2Gλpβ, θq{λ

14/20

Page 21: Mind the duality gap: safer rules for the Lasso · A possible way to obtain a sparse vector when the dictionary is known: p^ qParg min PRp 1 2}y X }2 looooomooooon2 data tting term

Motivation Duality Screening rules Coordinate descent

GAP safe rule

qRλpθq2“

∥∥∥θk ´ y

λ

∥∥∥2

∥∥∥θ̂pλq ´ y

λ

∥∥∥2

2

λ2Pλpβkq`

1

λ2‖y‖2

2 “pRλpβq

2

rλpβ, θq “

b

qRλpθq2 ´ pRλpβq2 “a

2Gλpβ, θq{λ

14/20

Page 22: Mind the duality gap: safer rules for the Lasso · A possible way to obtain a sparse vector when the dictionary is known: p^ qParg min PRp 1 2}y X }2 looooomooooon2 data tting term

Motivation Duality Screening rules Coordinate descent

GAP safe rule

qRλpθq2“

∥∥∥θk ´ y

λ

∥∥∥2

∥∥∥θ̂pλq ´ y

λ

∥∥∥2

2

λ2Pλpβkq`

1

λ2‖y‖2

2 “pRλpβq

2

rλpβ, θq “

b

qRλpθq2 ´ pRλpβq2 “a

2Gλpβ, θq{λ14/20

Page 23: Mind the duality gap: safer rules for the Lasso · A possible way to obtain a sparse vector when the dictionary is known: p^ qParg min PRp 1 2}y X }2 looooomooooon2 data tting term

Motivation Duality Screening rules Coordinate descent

Algorithm 1 Coordinate descent (Lasso)Input: X , y , ε,K , f , pλtqtPrT´1s

1: Initialization: λ0 “ λmax, βλ0 “ 02: for t P rT ´ 1s do Ź Loop over λ’s3: β Ð βλt´1 Ź previous ε-solution4: for k P rK s do5: if k mod f “ 1 then6: Construct θ P ∆X

7: if Gλt pβ, θq ď ε then Ź Stop if duality gap small8: βλt Ð β9: break

10: end if11: end if12: for j P rps do Ź Soft-Threshold coordinates

13: βj Ð ST`

λt

‖xj‖2 , βj ´xJj pXβ´yq

‖xj‖2

˘

14: end for15: end for16: end for

15/20

Page 24: Mind the duality gap: safer rules for the Lasso · A possible way to obtain a sparse vector when the dictionary is known: p^ qParg min PRp 1 2}y X }2 looooomooooon2 data tting term

Motivation Duality Screening rules Coordinate descent

Algorithm 2 Gap Safe screening for coordinate descentInput: X , y , ε,K , f , pλtqtPrT´1s

1: Initialization: λ0 “ λmax, βλ0 “ 02: for t P rT ´ 1s do Ź Loop over λ’s3: β Ð βλt´1 Ź previous ε-solution4: for k P rK s do5: if k mod f “ 1 then6: Construct θ P ∆X , Aλt pCq “ tj P rps : µCpxjq ě 1u7: if Gλt pβ, θq ď ε then Ź Stop if duality gap small8: βλt Ð β9: break

10: end if11: end if12: for j P Aλt pCq do Ź Soft-Threshold coordinates

13: βj Ð ST`

λt

‖xj‖2 , βj ´xJj pXβ´yq

‖xj‖2

˘

14: end for15: end for16: end for

16/20

Page 25: Mind the duality gap: safer rules for the Lasso · A possible way to obtain a sparse vector when the dictionary is known: p^ qParg min PRp 1 2}y X }2 looooomooooon2 data tting term

Motivation Duality Screening rules Coordinate descent

Gap safe rule: advantages• Dynamic rule• The safe region converges towards tθ̂pλqu• It works better in practice

Proportion of active variables (Leukemia n = 72; p = 7,129)

17/20

Page 26: Mind the duality gap: safer rules for the Lasso · A possible way to obtain a sparse vector when the dictionary is known: p^ qParg min PRp 1 2}y X }2 looooomooooon2 data tting term

Motivation Duality Screening rules Coordinate descent

Computing time

2 4 6 8

-log10(duality gap)

0

1

2

3

4

5

Tim

e (

s)

No screening

SAFE (El Ghaoui et al.)

ST3 (Bonnefoy et al.)

SAFE (Bonnefoy et al.)

GAP SAFE (sphere)

GAP SAFE (dome)

Time to obtain an ε-solution (Leukemia, n=72; p=7,129)

18/20

Page 27: Mind the duality gap: safer rules for the Lasso · A possible way to obtain a sparse vector when the dictionary is known: p^ qParg min PRp 1 2}y X }2 looooomooooon2 data tting term

Motivation Duality Screening rules Coordinate descent

Computing time

2 4 6 8

10

-log10(duality gap)

0

1

2

3

4

5

6

7

Tim

e(s

)

No screening

SAFE (El Ghaoui et al.)

ST3 (Bonnefoy et al.)

SAFE (Bonnefoy et al.)

GAP SAFE (sphere)

GAP SAFE (dome)

Time to obtain an ε-solution (RCV1, n=20,242; p=47,236)

19/20

Page 28: Mind the duality gap: safer rules for the Lasso · A possible way to obtain a sparse vector when the dictionary is known: p^ qParg min PRp 1 2}y X }2 looooomooooon2 data tting term

Motivation Duality Screening rules Coordinate descent

Conclusion

• New safe screening rule based on the duality gap

• Theoretical advantage: converging safe region

• Improves computational efficiency on Scikit-Learn’scoordinate descent implementationPedregosa et al. (2011)

• Extensions

- same idea works for many nonsmooth convex losses(group-lasso, sparse logistic regression, SVM, . . . )

- guarantees for variable selection- accuracy guarantees on the whole path rλT , λ1s,

not only tλ1, . . . , λTu

20/20