A geometric alternative to Nesterov acceleratedgradient descent
Bubeck, Lee, Singh, 2015
Yulia Rubanova
University of Toronto
November 29, 2017
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
Problem statement
Goal: minimize β-smooth and α-strongly convex function
minx∈Rn
f (x)
α-strongly convex function: ”not too flat”
∀x , y ∈ Rn : f (y) ≥ f (x) +∇f (x)T (y − x) +α
2‖y − x‖2
β-smooth function: ”not too curvy”
∀x , y ∈ Rn : f (y) ≤ f (x) +∇f (x)T (y − x) +β
2‖y − x‖2
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
Problem statement
Goal: minimize β-smooth and α-strongly convex function
minx∈Rn
f (x)
α-strongly convex function: ”not too flat”
∀x , y ∈ Rn : f (y) ≥ f (x) +∇f (x)T (y − x) +α
2‖y − x‖2
β-smooth function: ”not too curvy”
∀x , y ∈ Rn : f (y) ≤ f (x) +∇f (x)T (y − x) +β
2‖y − x‖2
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
Problem statement
Goal: minimize β-smooth and α-strongly convex function
minx∈Rn
f (x)
α-strongly convex function: ”not too flat”
∀x , y ∈ Rn : f (y) ≥ f (x) +∇f (x)T (y − x) +α
2‖y − x‖2
β-smooth function: ”not too curvy”
∀x , y ∈ Rn : f (y) ≤ f (x) +∇f (x)T (y − x) +β
2‖y − x‖2
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
Problem statement
x⇤
ff�sm f↵
conv
x
f βsm(y) = f (x) +∇f (x)T (y − x) +β
2‖y − x‖2
f αconv (y) = f (x) +∇f (x)T (y − x) +α
2‖y − x‖2
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
Classic approach: gradient descent
Gradient descent:
xt+1 = xt − γ∇f (xt), t = 0, 1, ...
GD convergence rate: O((1− 1
κ)t), where κ = β
α
Can we obtain faster convergence? Yes!
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
Classic approach: gradient descent
Gradient descent:
xt+1 = xt − γ∇f (xt), t = 0, 1, ...
GD convergence rate: O((1− 1
κ)t), where κ = β
α
Can we obtain faster convergence? Yes!
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
Can we obtain faster convergence?
Accelerated Gradient Descent algorithm [Nesterov, 83]:
Let y0 = x0. For t=0,1,2,..., update:
xt+1 = yt −1
β∇f (yt)
yt+1 = xt+1 +
√β −√α√β +√α
(xt+1 − xt)
Convergence rate: O(
(1− 1√κ
)t)
I Convergence rate is optimal
I Pure algebraic trick; no intuition behind the method.
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
Can we obtain faster convergence?
Accelerated Gradient Descent algorithm [Nesterov, 83]:
Let y0 = x0. For t=0,1,2,..., update:
xt+1 = yt −1
β∇f (yt)
yt+1 = xt+1 +
√β −√α√β +√α
(xt+1 − xt)
Convergence rate: O(
(1− 1√κ
)t)
I Convergence rate is optimal
I Pure algebraic trick; no intuition behind the method.
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
Can we obtain faster convergence?
Accelerated Gradient Descent algorithm [Nesterov, 83]:
Let y0 = x0. For t=0,1,2,..., update:
xt+1 = yt −1
β∇f (yt)
yt+1 = xt+1 +
√β −√α√β +√α
(xt+1 − xt)
Convergence rate: O(
(1− 1√κ
)t)
I Convergence rate is optimal
I Pure algebraic trick; no intuition behind the method.
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
Attempts to interpret Nesterov AGD
[Allen-Zhu and Orrechia, 2014] View AGD as a linear coupling ofGradient Descent and Mirror Descent
[Su, Boyd and Candes, 2015] view AGD as the discretization of acertain second-order ODE
[Bubeck, Lee, Singh, 2015] A new algorithm with the sametheoretical guarantees as AGD with a geometric interpretation.
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
Notation
f (x) is β-smooth and α-strongly convex
Denote κ = βα
x+ = x − 1
β∇f (x), x++ = x − 1
α∇f (x)
The ball with center x and radius r :
B(x , r2) = {y ∈ Rn : ‖y − x‖2 ≤ r2}
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
Notation
f (x) is β-smooth and α-strongly convex
Denote κ = βα
x+ = x − 1
β∇f (x), x++ = x − 1
α∇f (x)
The ball with center x and radius r :
B(x , r2) = {y ∈ Rn : ‖y − x‖2 ≤ r2}
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
Notation
f (x) is β-smooth and α-strongly convex
Denote κ = βα
x+ = x − 1
β∇f (x), x++ = x − 1
α∇f (x)
The ball with center x and radius r :
B(x , r2) = {y ∈ Rn : ‖y − x‖2 ≤ r2}
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
Notation
f (x) is β-smooth and α-strongly convex
Denote κ = βα
x+ = x − 1
β∇f (x), x++ = x − 1
α∇f (x)
The ball with center x and radius r :
B(x , r2) = {y ∈ Rn : ‖y − x‖2 ≤ r2}
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
Preliminaries
TheoremFor α-strongly convex function f (x) and ∀x ∈ Rn:
x∗ ∈ B
(x++,
‖∇f (x)‖2α2
− 2
α(f (x)− f (x∗))
)
It is equivalent to show that:
∥∥x∗ − x++∥∥2 ≤ ‖∇f (x)‖2
α2− 2
α(f (x)− f (x∗))
Recall the definition x++ = x − 1α∇f (x):
∥∥x∗ − x++∥∥2 = ‖x − x∗‖2 +
2
α∇f (x)T (x∗ − x) +
1
α2‖∇f (x)‖2
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
Preliminaries
∥∥x∗ − x++∥∥2 = ‖x − x∗‖2 +
2
α∇f (x)T (x∗ − x) +
1
α2‖∇f (x)‖2
By definition of strong convexity:
‖x∗ − x‖2 +2
α∇f (x)T (x∗ − x) ≤ − 2
α(f (x)− f (x∗))
Substituting into the previous equation for the ball:∥∥x∗ − x++∥∥2 ≤ 1
α2‖∇f (x)‖2 − 2
α(f (x)− f (x∗))
Recall that x∗ is a minimum and f (x) ≥ f (x∗):∥∥x∗ − x++∥∥2 ≤ 1
α2‖∇f (x)‖2
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
Interpretation
(recall x++ = x − 1α∇f (x))∥∥x++ − x∗∥∥2 ≤ ∥∥x++ − x
∥∥2 =1
α2‖∇f (x)‖2
x⇤
ff�sm f↵
conv
xx+
x++
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
Preliminaries
TheoremFor α-strongly convex and β-smooth function f (x) and ∀x ∈ Rn:
x∗ ∈ B
(x++,
‖∇f (x)‖2α2
(1− 1
κ)− 2
α(f (x+)− f (x∗))
)
From the strong convexity:
x∗ ∈ B
(x++,
‖∇f (x)‖2α2
− 2
α(f (x)− f (x∗))
)Using definition of smoothness (recall x+ = x − 1
β∇f (x)):
f (x+) ≤ f (x)+∇f (x)T (x+−x)+β
2
∥∥x+ − x∥∥2 = f (x)− 1
2β‖∇f (x)‖2
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
Preliminaries
From smoothness:
f (x) ≥ f (x+) +1
2β‖∇f (x)‖2
Substituting into x∗ ∈ B(x++, ‖∇f (x)‖
2
α2 − 2α(f (x)− f (x∗))
):
x∗ ∈ B
(x++,
‖∇f (x)‖2α2
(1− 1
κ)− 2
α(f (x+)− f (x∗))
)Note that f (x+) ≥ f (x∗) because x∗ is a minimum of f (x).
x∗ ∈ B
(x++,
‖∇f (x)‖2α2
(1− 1
κ)
)
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
Interpretation
Introduce x̃ on f αconv s.t.f (x̃) = f (x+) ≥ f (x∗)
∥∥x++ − x∗∥∥ ≤ ∥∥x++ − x̃
∥∥How to find x̃?
x̃ = x − t∇f (x)
x⇤x̃
ff�sm f↵
conv
xx+
x++
Our requirement on x̃ :
f αconv (x̃) = f (x)− t ‖∇f (x)‖2 +t2α
2‖∇f (x)‖2 =
= f βsm(x+) = f (x)− 1
2β‖∇f (x)‖2
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
Interpretation
Solving quadratic equation:
αt2 − 2t +1
β= 0 =⇒ t =
1
α
(1−
√1− 1
κ
)∥∥x̃ − x++
∥∥2 =
∥∥∥∥(x − t∇f (x))− (x − 1
α∇f (x))
∥∥∥∥2 =
∥∥∥∥∥− 1
α∇f (x) +
1
α
(1−
√1− 1
κ
)∇f (x)
∥∥∥∥∥2
= (1− 1
κ)‖∇f (x)‖2
α2
∥∥x∗ − x++∥∥2 ≤ ∥∥x̃ − x++
∥∥2 = (1− 1
κ)‖∇f (x)‖2
α2
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
A suboptimal algorithm
We have just proven:
x∗ ∈ Bsm−conv
(x++0 , (1− 1
κ)‖∇f (x0)‖2
α2
)
Assume that we are given x0 such that
x∗ ∈ Bguarantee(x0,R20 )
We can intersect these two balls and get a ball with a smallerradius.
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
A suboptimal algorithm
We have just proven:
x∗ ∈ Bsm−conv
(x++0 , (1− 1
κ)‖∇f (x0)‖2
α2
)
Assume that we are given x0 such that
x∗ ∈ Bguarantee(x0,R20 )
We can intersect these two balls and get a ball with a smallerradius.
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
A suboptimal algorithm
We have just proven:
x∗ ∈ Bsm−conv
(x++0 , (1− 1
κ)‖∇f (x0)‖2
α2
)
Assume that we are given x0 such that
x∗ ∈ Bguarantee(x0,R20 )
We can intersect these two balls and get a ball with a smallerradius.
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
A suboptimal algorithm
x⇤
Bguarantee
Bsm�conv
x0 x+0 x++
0
�rf(x0)x1
R0
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
A suboptimal algorithm
From geometry:
x∗ ∈ Bintersect
(x1, (1− 1
κ)R2
0
)where x1 is the center of a new ball.
Repeat the procedure using B(x1, (1− 1
κ)R20
)as a guarantee.
We have reduced the original ball Bguarantee(x0,R20 ) by (1− 1
κ).
After t iterations:
‖xt − x∗‖ ≤ (1− 1
κ)tR2
0
Same rate of convergence as gradient descent
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
A suboptimal algorithm
From geometry:
x∗ ∈ Bintersect
(x1, (1− 1
κ)R2
0
)where x1 is the center of a new ball.Repeat the procedure using B
(x1, (1− 1
κ)R20
)as a guarantee.
We have reduced the original ball Bguarantee(x0,R20 ) by (1− 1
κ).
After t iterations:
‖xt − x∗‖ ≤ (1− 1
κ)tR2
0
Same rate of convergence as gradient descent
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
A suboptimal algorithm
From geometry:
x∗ ∈ Bintersect
(x1, (1− 1
κ)R2
0
)where x1 is the center of a new ball.Repeat the procedure using B
(x1, (1− 1
κ)R20
)as a guarantee.
We have reduced the original ball Bguarantee(x0,R20 ) by (1− 1
κ).
After t iterations:
‖xt − x∗‖ ≤ (1− 1
κ)tR2
0
Same rate of convergence as gradient descent
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
Acceleration
What if we had a better guarantee?Assume that we are given x0 such that
x∗ ∈ Bguarantee(x0,R20 −
2
α(f (y)− f (x∗))),
where y is such that f (x0) ≤ f (y) (for example, y := x0).
From definition of smoothness and f (x0) ≤ f (y):
f (x+0 ) ≤ f (x) +∇f (x)T (x+0 − x) +β
2
∥∥x+0 − x∥∥2 ≤
≤ f (x0)− 1
2β‖∇f (x0)‖2 ≤ f (y)− 1
2β‖∇f (x0)‖2
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
Acceleration
f (x+0 ) ≤ f (y)− 1
2β‖∇f (x0)‖2
f (y) ≥ f (x+0 ) +1
2β‖∇f (x0)‖2
Substituting f (y) in Bguarantee(x0,R20 − 2
α(f (y)− f (x∗))):
x∗ ∈ Bguarantee
(x0,R
20 −‖∇f (x0)‖2
α2κ− 2
α(f (x+0 )− f (x∗))
)
This ball is smaller than Bguarantee(x0,R20 ).
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
Acceleration
x∗ ∈ Bguarantee
(x0,R
20 −‖∇f (x0)‖2
α2κ− 2
α(f (x+0 )− f (x∗))
)
For the second ball, we use the same one as before:
x∗ ∈ Bsm−conv
(x++0 ,‖∇f (x0)‖2
α2(1− 1
κ)− 2
α(f (x+0 )− f (x∗))
)
We take the intersection of the two balls. Note that if one ballbecame smaller, the intersection is also smaller.
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
Acceleration
x∗ ∈ Bguarantee
(x0,R
20 −‖∇f (x0)‖2
α2κ− 2
α(f (x+0 )− f (x∗))
)
For the second ball, we use the same one as before:
x∗ ∈ Bsm−conv
(x++0 ,‖∇f (x0)‖2
α2(1− 1
κ)− 2
α(f (x+0 )− f (x∗))
)
We take the intersection of the two balls. Note that if one ballbecame smaller, the intersection is also smaller.
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
Acceleration
x∗ ∈ Bguarantee
(x0,R
20 −‖∇f (x0)‖2
α2κ− 2
α(f (x+0 )− f (x∗))
)
For the second ball, we use the same one as before:
x∗ ∈ Bsm−conv
(x++0 ,‖∇f (x0)‖2
α2(1− 1
κ)− 2
α(f (x+0 )− f (x∗))
)
We take the intersection of the two balls. Note that if one ballbecame smaller, the intersection is also smaller.
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
Lemma: intersection of the two balls
Denote:
ε =1
κ; g2 =
‖∇f (x0)‖2α2
; δ =2
α(f (x+0 )− f (x∗))
x∗ ∈ Bguarantee
(x0,R
20 − εg2 − δ
)∩Bsm−conv
(x++0 , g2(1− ε)− δ
)
TheoremConsider g , ε such that ε ∈ (0, 1), g ≥ 0. There exists c such thatfor any δ > 0:
B(0,R − εg2 − δ) ∩ B(g , g2(1− ε)− δ) ⊂ B(c ,R(1−√ε)− δ)
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
Lemma: intersection of the two balls
Denote:
ε =1
κ; g2 =
‖∇f (x0)‖2α2
; δ =2
α(f (x+0 )− f (x∗))
x∗ ∈ Bguarantee
(x0,R
20 − εg2 − δ
)∩Bsm−conv
(x++0 , g2(1− ε)− δ
)
TheoremConsider g , ε such that ε ∈ (0, 1), g ≥ 0. There exists c such thatfor any δ > 0:
B(0,R − εg2 − δ) ∩ B(g , g2(1− ε)− δ) ⊂ B(c ,R(1−√ε)− δ)
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
Sketch of the proof
0
1�"g
2 �� g 2
(1� ")� �r
x g � x g
r2 = 1− εg2 − δ− x2 = g2(1− ε)− δ− (g − x)2 =⇒ x =1
2g
r2 = 1− εg2 − 1
(2g)2− δ
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
Sketch of the proof
We will use AM-GM inequality to estimate εg2 + 1(2g)2
:
1
2
(εg2 +
1
(2g)2
)≥√εg2
1
4g2=
√ε
2=⇒ εg2 +
1
(2g)2≥ √ε
Therefore,
r2 = 1− εg2 − 1
(2g)2− δ ≤ 1−√ε− δ
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
Main result
Using the theorem, we take the smallest ball that contains theintersection of the two:
x∗ ∈ Bintersect
(x1,R
20 (1− 1√
κ)− 2
α(f (x+0 )− f (x∗))
)The intersection ball is always shrinking:
R2t ≤ (1− 1√
κ)tR2
0
‖x∗ − xt‖2 ≤ (1− 1√κ
)tR20
Same convergence rate as in Nesterov AGDMuch faster than (1− 1
κ)t rate in Gradient Descent
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
Caveats
How do we find such x0,R0, y for the guarantee ball?
x∗ ∈ Bguarantee(x0,R20 −
2
α(f (y)− f (x∗))),
where y is such that f (x0) ≤ f (y)
For the first iteration, we can take use Bsm−conv :
x∗ ∈ Bsm−conv
(x++0 ,‖∇f (x0)‖2
α2(1− 1
κ)− 2
α(f (x+0 )− f (x∗))
)
For other iterations we need a trick which we won’t cover in thislecture.
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
Caveats
How do we find such x0,R0, y for the guarantee ball?
x∗ ∈ Bguarantee(x0,R20 −
2
α(f (y)− f (x∗))),
where y is such that f (x0) ≤ f (y)
For the first iteration, we can take use Bsm−conv :
x∗ ∈ Bsm−conv
(x++0 ,‖∇f (x0)‖2
α2(1− 1
κ)− 2
α(f (x+0 )− f (x∗))
)
For other iterations we need a trick which we won’t cover in thislecture.
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
Algorithm
Geometric gradient descent
1: Input: convexity parameter α and initial point x0
2: c0 = x++0 ; R0 = ‖∇f (x0)‖2
α2 (1− 1κ)
3: for t to 1,2,... do4: xt := ct−1
5: Ball 1: Bguarantee(xt ,R2t−1 − ‖∇f (xt)‖
2
α2κ)
6: Ball 2: Bsm−conv (x++t , ‖∇f (xt)‖
2
α2 (1− 1κ))
7: Find B(ct ,R2t ) – minimum ball enclosing the intersection;
R2t ≤ (1− 1√
κ)R2
t−18: end for9: return xt
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
Summary
I Goal: minimize β-smooth and α-strongly convex function
I We have provided geometric interpretation to GD andAccelerated GD
I Gradient descent has (1− 1κ)t convergence rate
I Accelerated GD achieves (1− 1√κ
)t convergence rate
Thank you for listening
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
Top Related