A geometric alternative to Nesterov accelerated gradient...

43
A geometric alternative to Nesterov accelerated gradient descent Bubeck, Lee, Singh, 2015 Yulia Rubanova University of Toronto November 29, 2017 Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD

Transcript of A geometric alternative to Nesterov accelerated gradient...

Page 1: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,

A geometric alternative to Nesterov acceleratedgradient descent

Bubeck, Lee, Singh, 2015

Yulia Rubanova

University of Toronto

November 29, 2017

Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD

Page 2: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,

Problem statement

Goal: minimize β-smooth and α-strongly convex function

minx∈Rn

f (x)

α-strongly convex function: ”not too flat”

∀x , y ∈ Rn : f (y) ≥ f (x) +∇f (x)T (y − x) +α

2‖y − x‖2

β-smooth function: ”not too curvy”

∀x , y ∈ Rn : f (y) ≤ f (x) +∇f (x)T (y − x) +β

2‖y − x‖2

Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD

Page 3: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,

Problem statement

Goal: minimize β-smooth and α-strongly convex function

minx∈Rn

f (x)

α-strongly convex function: ”not too flat”

∀x , y ∈ Rn : f (y) ≥ f (x) +∇f (x)T (y − x) +α

2‖y − x‖2

β-smooth function: ”not too curvy”

∀x , y ∈ Rn : f (y) ≤ f (x) +∇f (x)T (y − x) +β

2‖y − x‖2

Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD

Page 4: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,

Problem statement

Goal: minimize β-smooth and α-strongly convex function

minx∈Rn

f (x)

α-strongly convex function: ”not too flat”

∀x , y ∈ Rn : f (y) ≥ f (x) +∇f (x)T (y − x) +α

2‖y − x‖2

β-smooth function: ”not too curvy”

∀x , y ∈ Rn : f (y) ≤ f (x) +∇f (x)T (y − x) +β

2‖y − x‖2

Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD

Page 5: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,

Problem statement

x⇤

ff�sm f↵

conv

x

f βsm(y) = f (x) +∇f (x)T (y − x) +β

2‖y − x‖2

f αconv (y) = f (x) +∇f (x)T (y − x) +α

2‖y − x‖2

Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD

Page 6: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,

Classic approach: gradient descent

Gradient descent:

xt+1 = xt − γ∇f (xt), t = 0, 1, ...

GD convergence rate: O((1− 1

κ)t), where κ = β

α

Can we obtain faster convergence? Yes!

Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD

Page 7: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,

Classic approach: gradient descent

Gradient descent:

xt+1 = xt − γ∇f (xt), t = 0, 1, ...

GD convergence rate: O((1− 1

κ)t), where κ = β

α

Can we obtain faster convergence? Yes!

Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD

Page 8: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,

Can we obtain faster convergence?

Accelerated Gradient Descent algorithm [Nesterov, 83]:

Let y0 = x0. For t=0,1,2,..., update:

xt+1 = yt −1

β∇f (yt)

yt+1 = xt+1 +

√β −√α√β +√α

(xt+1 − xt)

Convergence rate: O(

(1− 1√κ

)t)

I Convergence rate is optimal

I Pure algebraic trick; no intuition behind the method.

Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD

Page 9: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,

Can we obtain faster convergence?

Accelerated Gradient Descent algorithm [Nesterov, 83]:

Let y0 = x0. For t=0,1,2,..., update:

xt+1 = yt −1

β∇f (yt)

yt+1 = xt+1 +

√β −√α√β +√α

(xt+1 − xt)

Convergence rate: O(

(1− 1√κ

)t)

I Convergence rate is optimal

I Pure algebraic trick; no intuition behind the method.

Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD

Page 10: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,

Can we obtain faster convergence?

Accelerated Gradient Descent algorithm [Nesterov, 83]:

Let y0 = x0. For t=0,1,2,..., update:

xt+1 = yt −1

β∇f (yt)

yt+1 = xt+1 +

√β −√α√β +√α

(xt+1 − xt)

Convergence rate: O(

(1− 1√κ

)t)

I Convergence rate is optimal

I Pure algebraic trick; no intuition behind the method.

Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD

Page 11: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,

Attempts to interpret Nesterov AGD

[Allen-Zhu and Orrechia, 2014] View AGD as a linear coupling ofGradient Descent and Mirror Descent

[Su, Boyd and Candes, 2015] view AGD as the discretization of acertain second-order ODE

[Bubeck, Lee, Singh, 2015] A new algorithm with the sametheoretical guarantees as AGD with a geometric interpretation.

Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD

Page 12: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,

Notation

f (x) is β-smooth and α-strongly convex

Denote κ = βα

x+ = x − 1

β∇f (x), x++ = x − 1

α∇f (x)

The ball with center x and radius r :

B(x , r2) = {y ∈ Rn : ‖y − x‖2 ≤ r2}

Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD

Page 13: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,

Notation

f (x) is β-smooth and α-strongly convex

Denote κ = βα

x+ = x − 1

β∇f (x), x++ = x − 1

α∇f (x)

The ball with center x and radius r :

B(x , r2) = {y ∈ Rn : ‖y − x‖2 ≤ r2}

Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD

Page 14: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,

Notation

f (x) is β-smooth and α-strongly convex

Denote κ = βα

x+ = x − 1

β∇f (x), x++ = x − 1

α∇f (x)

The ball with center x and radius r :

B(x , r2) = {y ∈ Rn : ‖y − x‖2 ≤ r2}

Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD

Page 15: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,

Notation

f (x) is β-smooth and α-strongly convex

Denote κ = βα

x+ = x − 1

β∇f (x), x++ = x − 1

α∇f (x)

The ball with center x and radius r :

B(x , r2) = {y ∈ Rn : ‖y − x‖2 ≤ r2}

Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD

Page 16: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,

Preliminaries

TheoremFor α-strongly convex function f (x) and ∀x ∈ Rn:

x∗ ∈ B

(x++,

‖∇f (x)‖2α2

− 2

α(f (x)− f (x∗))

)

It is equivalent to show that:

∥∥x∗ − x++∥∥2 ≤ ‖∇f (x)‖2

α2− 2

α(f (x)− f (x∗))

Recall the definition x++ = x − 1α∇f (x):

∥∥x∗ − x++∥∥2 = ‖x − x∗‖2 +

2

α∇f (x)T (x∗ − x) +

1

α2‖∇f (x)‖2

Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD

Page 17: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,

Preliminaries

∥∥x∗ − x++∥∥2 = ‖x − x∗‖2 +

2

α∇f (x)T (x∗ − x) +

1

α2‖∇f (x)‖2

By definition of strong convexity:

‖x∗ − x‖2 +2

α∇f (x)T (x∗ − x) ≤ − 2

α(f (x)− f (x∗))

Substituting into the previous equation for the ball:∥∥x∗ − x++∥∥2 ≤ 1

α2‖∇f (x)‖2 − 2

α(f (x)− f (x∗))

Recall that x∗ is a minimum and f (x) ≥ f (x∗):∥∥x∗ − x++∥∥2 ≤ 1

α2‖∇f (x)‖2

Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD

Page 18: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,

Interpretation

(recall x++ = x − 1α∇f (x))∥∥x++ − x∗∥∥2 ≤ ∥∥x++ − x

∥∥2 =1

α2‖∇f (x)‖2

x⇤

ff�sm f↵

conv

xx+

x++

Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD

Page 19: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,

Preliminaries

TheoremFor α-strongly convex and β-smooth function f (x) and ∀x ∈ Rn:

x∗ ∈ B

(x++,

‖∇f (x)‖2α2

(1− 1

κ)− 2

α(f (x+)− f (x∗))

)

From the strong convexity:

x∗ ∈ B

(x++,

‖∇f (x)‖2α2

− 2

α(f (x)− f (x∗))

)Using definition of smoothness (recall x+ = x − 1

β∇f (x)):

f (x+) ≤ f (x)+∇f (x)T (x+−x)+β

2

∥∥x+ − x∥∥2 = f (x)− 1

2β‖∇f (x)‖2

Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD

Page 20: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,

Preliminaries

From smoothness:

f (x) ≥ f (x+) +1

2β‖∇f (x)‖2

Substituting into x∗ ∈ B(x++, ‖∇f (x)‖

2

α2 − 2α(f (x)− f (x∗))

):

x∗ ∈ B

(x++,

‖∇f (x)‖2α2

(1− 1

κ)− 2

α(f (x+)− f (x∗))

)Note that f (x+) ≥ f (x∗) because x∗ is a minimum of f (x).

x∗ ∈ B

(x++,

‖∇f (x)‖2α2

(1− 1

κ)

)

Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD

Page 21: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,

Interpretation

Introduce x̃ on f αconv s.t.f (x̃) = f (x+) ≥ f (x∗)

∥∥x++ − x∗∥∥ ≤ ∥∥x++ − x̃

∥∥How to find x̃?

x̃ = x − t∇f (x)

x⇤x̃

ff�sm f↵

conv

xx+

x++

Our requirement on x̃ :

f αconv (x̃) = f (x)− t ‖∇f (x)‖2 +t2α

2‖∇f (x)‖2 =

= f βsm(x+) = f (x)− 1

2β‖∇f (x)‖2

Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD

Page 22: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,

Interpretation

Solving quadratic equation:

αt2 − 2t +1

β= 0 =⇒ t =

1

α

(1−

√1− 1

κ

)∥∥x̃ − x++

∥∥2 =

∥∥∥∥(x − t∇f (x))− (x − 1

α∇f (x))

∥∥∥∥2 =

∥∥∥∥∥− 1

α∇f (x) +

1

α

(1−

√1− 1

κ

)∇f (x)

∥∥∥∥∥2

= (1− 1

κ)‖∇f (x)‖2

α2

∥∥x∗ − x++∥∥2 ≤ ∥∥x̃ − x++

∥∥2 = (1− 1

κ)‖∇f (x)‖2

α2

Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD

Page 23: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,

A suboptimal algorithm

We have just proven:

x∗ ∈ Bsm−conv

(x++0 , (1− 1

κ)‖∇f (x0)‖2

α2

)

Assume that we are given x0 such that

x∗ ∈ Bguarantee(x0,R20 )

We can intersect these two balls and get a ball with a smallerradius.

Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD

Page 24: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,

A suboptimal algorithm

We have just proven:

x∗ ∈ Bsm−conv

(x++0 , (1− 1

κ)‖∇f (x0)‖2

α2

)

Assume that we are given x0 such that

x∗ ∈ Bguarantee(x0,R20 )

We can intersect these two balls and get a ball with a smallerradius.

Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD

Page 25: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,

A suboptimal algorithm

We have just proven:

x∗ ∈ Bsm−conv

(x++0 , (1− 1

κ)‖∇f (x0)‖2

α2

)

Assume that we are given x0 such that

x∗ ∈ Bguarantee(x0,R20 )

We can intersect these two balls and get a ball with a smallerradius.

Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD

Page 26: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,

A suboptimal algorithm

x⇤

Bguarantee

Bsm�conv

x0 x+0 x++

0

�rf(x0)x1

R0

Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD

Page 27: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,

A suboptimal algorithm

From geometry:

x∗ ∈ Bintersect

(x1, (1− 1

κ)R2

0

)where x1 is the center of a new ball.

Repeat the procedure using B(x1, (1− 1

κ)R20

)as a guarantee.

We have reduced the original ball Bguarantee(x0,R20 ) by (1− 1

κ).

After t iterations:

‖xt − x∗‖ ≤ (1− 1

κ)tR2

0

Same rate of convergence as gradient descent

Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD

Page 28: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,

A suboptimal algorithm

From geometry:

x∗ ∈ Bintersect

(x1, (1− 1

κ)R2

0

)where x1 is the center of a new ball.Repeat the procedure using B

(x1, (1− 1

κ)R20

)as a guarantee.

We have reduced the original ball Bguarantee(x0,R20 ) by (1− 1

κ).

After t iterations:

‖xt − x∗‖ ≤ (1− 1

κ)tR2

0

Same rate of convergence as gradient descent

Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD

Page 29: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,

A suboptimal algorithm

From geometry:

x∗ ∈ Bintersect

(x1, (1− 1

κ)R2

0

)where x1 is the center of a new ball.Repeat the procedure using B

(x1, (1− 1

κ)R20

)as a guarantee.

We have reduced the original ball Bguarantee(x0,R20 ) by (1− 1

κ).

After t iterations:

‖xt − x∗‖ ≤ (1− 1

κ)tR2

0

Same rate of convergence as gradient descent

Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD

Page 30: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,

Acceleration

What if we had a better guarantee?Assume that we are given x0 such that

x∗ ∈ Bguarantee(x0,R20 −

2

α(f (y)− f (x∗))),

where y is such that f (x0) ≤ f (y) (for example, y := x0).

From definition of smoothness and f (x0) ≤ f (y):

f (x+0 ) ≤ f (x) +∇f (x)T (x+0 − x) +β

2

∥∥x+0 − x∥∥2 ≤

≤ f (x0)− 1

2β‖∇f (x0)‖2 ≤ f (y)− 1

2β‖∇f (x0)‖2

Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD

Page 31: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,

Acceleration

f (x+0 ) ≤ f (y)− 1

2β‖∇f (x0)‖2

f (y) ≥ f (x+0 ) +1

2β‖∇f (x0)‖2

Substituting f (y) in Bguarantee(x0,R20 − 2

α(f (y)− f (x∗))):

x∗ ∈ Bguarantee

(x0,R

20 −‖∇f (x0)‖2

α2κ− 2

α(f (x+0 )− f (x∗))

)

This ball is smaller than Bguarantee(x0,R20 ).

Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD

Page 32: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,

Acceleration

x∗ ∈ Bguarantee

(x0,R

20 −‖∇f (x0)‖2

α2κ− 2

α(f (x+0 )− f (x∗))

)

For the second ball, we use the same one as before:

x∗ ∈ Bsm−conv

(x++0 ,‖∇f (x0)‖2

α2(1− 1

κ)− 2

α(f (x+0 )− f (x∗))

)

We take the intersection of the two balls. Note that if one ballbecame smaller, the intersection is also smaller.

Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD

Page 33: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,

Acceleration

x∗ ∈ Bguarantee

(x0,R

20 −‖∇f (x0)‖2

α2κ− 2

α(f (x+0 )− f (x∗))

)

For the second ball, we use the same one as before:

x∗ ∈ Bsm−conv

(x++0 ,‖∇f (x0)‖2

α2(1− 1

κ)− 2

α(f (x+0 )− f (x∗))

)

We take the intersection of the two balls. Note that if one ballbecame smaller, the intersection is also smaller.

Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD

Page 34: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,

Acceleration

x∗ ∈ Bguarantee

(x0,R

20 −‖∇f (x0)‖2

α2κ− 2

α(f (x+0 )− f (x∗))

)

For the second ball, we use the same one as before:

x∗ ∈ Bsm−conv

(x++0 ,‖∇f (x0)‖2

α2(1− 1

κ)− 2

α(f (x+0 )− f (x∗))

)

We take the intersection of the two balls. Note that if one ballbecame smaller, the intersection is also smaller.

Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD

Page 35: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,

Lemma: intersection of the two balls

Denote:

ε =1

κ; g2 =

‖∇f (x0)‖2α2

; δ =2

α(f (x+0 )− f (x∗))

x∗ ∈ Bguarantee

(x0,R

20 − εg2 − δ

)∩Bsm−conv

(x++0 , g2(1− ε)− δ

)

TheoremConsider g , ε such that ε ∈ (0, 1), g ≥ 0. There exists c such thatfor any δ > 0:

B(0,R − εg2 − δ) ∩ B(g , g2(1− ε)− δ) ⊂ B(c ,R(1−√ε)− δ)

Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD

Page 36: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,

Lemma: intersection of the two balls

Denote:

ε =1

κ; g2 =

‖∇f (x0)‖2α2

; δ =2

α(f (x+0 )− f (x∗))

x∗ ∈ Bguarantee

(x0,R

20 − εg2 − δ

)∩Bsm−conv

(x++0 , g2(1− ε)− δ

)

TheoremConsider g , ε such that ε ∈ (0, 1), g ≥ 0. There exists c such thatfor any δ > 0:

B(0,R − εg2 − δ) ∩ B(g , g2(1− ε)− δ) ⊂ B(c ,R(1−√ε)− δ)

Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD

Page 37: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,

Sketch of the proof

0

1�"g

2 �� g 2

(1� ")� �r

x g � x g

r2 = 1− εg2 − δ− x2 = g2(1− ε)− δ− (g − x)2 =⇒ x =1

2g

r2 = 1− εg2 − 1

(2g)2− δ

Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD

Page 38: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,

Sketch of the proof

We will use AM-GM inequality to estimate εg2 + 1(2g)2

:

1

2

(εg2 +

1

(2g)2

)≥√εg2

1

4g2=

√ε

2=⇒ εg2 +

1

(2g)2≥ √ε

Therefore,

r2 = 1− εg2 − 1

(2g)2− δ ≤ 1−√ε− δ

Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD

Page 39: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,

Main result

Using the theorem, we take the smallest ball that contains theintersection of the two:

x∗ ∈ Bintersect

(x1,R

20 (1− 1√

κ)− 2

α(f (x+0 )− f (x∗))

)The intersection ball is always shrinking:

R2t ≤ (1− 1√

κ)tR2

0

‖x∗ − xt‖2 ≤ (1− 1√κ

)tR20

Same convergence rate as in Nesterov AGDMuch faster than (1− 1

κ)t rate in Gradient Descent

Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD

Page 40: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,

Caveats

How do we find such x0,R0, y for the guarantee ball?

x∗ ∈ Bguarantee(x0,R20 −

2

α(f (y)− f (x∗))),

where y is such that f (x0) ≤ f (y)

For the first iteration, we can take use Bsm−conv :

x∗ ∈ Bsm−conv

(x++0 ,‖∇f (x0)‖2

α2(1− 1

κ)− 2

α(f (x+0 )− f (x∗))

)

For other iterations we need a trick which we won’t cover in thislecture.

Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD

Page 41: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,

Caveats

How do we find such x0,R0, y for the guarantee ball?

x∗ ∈ Bguarantee(x0,R20 −

2

α(f (y)− f (x∗))),

where y is such that f (x0) ≤ f (y)

For the first iteration, we can take use Bsm−conv :

x∗ ∈ Bsm−conv

(x++0 ,‖∇f (x0)‖2

α2(1− 1

κ)− 2

α(f (x+0 )− f (x∗))

)

For other iterations we need a trick which we won’t cover in thislecture.

Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD

Page 42: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,

Algorithm

Geometric gradient descent

1: Input: convexity parameter α and initial point x0

2: c0 = x++0 ; R0 = ‖∇f (x0)‖2

α2 (1− 1κ)

3: for t to 1,2,... do4: xt := ct−1

5: Ball 1: Bguarantee(xt ,R2t−1 − ‖∇f (xt)‖

2

α2κ)

6: Ball 2: Bsm−conv (x++t , ‖∇f (xt)‖

2

α2 (1− 1κ))

7: Find B(ct ,R2t ) – minimum ball enclosing the intersection;

R2t ≤ (1− 1√

κ)R2

t−18: end for9: return xt

Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD

Page 43: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,

Summary

I Goal: minimize β-smooth and α-strongly convex function

I We have provided geometric interpretation to GD andAccelerated GD

I Gradient descent has (1− 1κ)t convergence rate

I Accelerated GD achieves (1− 1√κ

)t convergence rate

Thank you for listening

Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD