Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to...
Transcript of Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to...
Mini-Course 1:SGD Escapes Saddle Points
Yang Yuan
Computer Science DepartmentCornell University
Gradient Descent (GD)
I Task: minx f (x)
I GD does iterative updates xt+1 = xt − ηt∇f (xt)
Gradient Descent (GD)
I Task: minx f (x)
I GD does iterative updates xt+1 = xt − ηt∇f (xt)
Gradient Descent (GD) has at least two problems
I Computing the full gradient is slow for big data.
I Stuck at stationary points.
Gradient Descent (GD) has at least two problems
I Computing the full gradient is slow for big data.
I Stuck at stationary points.
Gradient Descent (GD) has at least two problems
I Computing the full gradient is slow for big data.
I Stuck at stationary points.
Stochastic Gradient Descent (SGD)
I Very similar to GD, gradient now has some randomness:
xt+1 = xt − ηtgt , where E[gt ] = ∇f (xt).
Stochastic Gradient Descent (SGD)
I Very similar to GD, gradient now has some randomness:
xt+1 = xt − ηtgt , where E[gt ] = ∇f (xt).
Why do we use SGD?
Initially because:
I Much cheaper to compute using mini-batch
I Can still converge to global minimum in convex case
I Can escape saddle points! (Today’s topic)
I Can escape shallow local minima (Next time’s topic, someprogress.)
I Can find local minima that generalize well (Not wellunderstood)
Why do we use SGD?
Initially because:
I Much cheaper to compute using mini-batch
I Can still converge to global minimum in convex case
I Can escape saddle points! (Today’s topic)
I Can escape shallow local minima (Next time’s topic, someprogress.)
I Can find local minima that generalize well (Not wellunderstood)
Why do we use SGD?
Initially because:
I Much cheaper to compute using mini-batch
I Can still converge to global minimum in convex case
I Can escape saddle points! (Today’s topic)
I Can escape shallow local minima (Next time’s topic, someprogress.)
I Can find local minima that generalize well (Not wellunderstood)
Why do we use SGD?
Initially because:
I Much cheaper to compute using mini-batch
I Can still converge to global minimum in convex case
But now people realize:
I Can escape saddle points! (Today’s topic)
I Can escape shallow local minima (Next time’s topic, someprogress.)
I Can find local minima that generalize well (Not wellunderstood)
Why do we use SGD?
Initially because:
I Much cheaper to compute using mini-batch
I Can still converge to global minimum in convex case
But now people realize:
I Can escape saddle points! (Today’s topic)
I Can escape shallow local minima (Next time’s topic, someprogress.)
I Can find local minima that generalize well (Not wellunderstood)
Why do we use SGD?
Initially because:
I Much cheaper to compute using mini-batch
I Can still converge to global minimum in convex case
But now people realize:
I Can escape saddle points! (Today’s topic)
I Can escape shallow local minima (Next time’s topic, someprogress.)
I Can find local minima that generalize well (Not wellunderstood)
Why do we use SGD?
Initially because:
I Much cheaper to compute using mini-batch
I Can still converge to global minimum in convex case
But now people realize:
I Can escape saddle points! (Today’s topic)
I Can escape shallow local minima (Next time’s topic, someprogress.)
I Can find local minima that generalize well (Not wellunderstood)
Why do we use SGD?
Initially because:
I Much cheaper to compute using mini-batch
I Can still converge to global minimum in convex case
But now people realize:
I Can escape saddle points! (Today’s topic)
I Can escape shallow local minima (Next time’s topic, someprogress.)
I Can find local minima that generalize well (Not wellunderstood)
Therefore, it’s not only faster, but also works better!
About gt that we use
xt+1 = xt − ηtgt , where E[gt ] = ∇f (xt).
I In practice, gt is obtained by sampling a minibatch of size 128or 256 from the dataset
I To simplify the analysis, we assume
gt = ∇f (xt) + ξt
where ξt ∈ N(0, I) or B0(r)
I In general, if ξt has non-negligible components on everydirection, the analysis works.
About gt that we use
xt+1 = xt − ηtgt , where E[gt ] = ∇f (xt).
I In practice, gt is obtained by sampling a minibatch of size 128or 256 from the dataset
I To simplify the analysis, we assume
gt = ∇f (xt) + ξt
where ξt ∈ N(0, I) or B0(r)
I In general, if ξt has non-negligible components on everydirection, the analysis works.
About gt that we use
xt+1 = xt − ηtgt , where E[gt ] = ∇f (xt).
I In practice, gt is obtained by sampling a minibatch of size 128or 256 from the dataset
I To simplify the analysis, we assume
gt = ∇f (xt) + ξt
where ξt ∈ N(0, I) or B0(r)
I In general, if ξt has non-negligible components on everydirection, the analysis works.
About gt that we use
xt+1 = xt − ηtgt , where E[gt ] = ∇f (xt).
I In practice, gt is obtained by sampling a minibatch of size 128or 256 from the dataset
I To simplify the analysis, we assume
gt = ∇f (xt) + ξt
where ξt ∈ N(0, I) or B0(r)
I In general, if ξt has non-negligible components on everydirection, the analysis works.
Preliminaries
I L-Lipschitz, i.e.,
|f (w1)− f (w2)| ≤ L‖w1 − w2‖2
I `-Smoothness: The gradient is `-Lipschitz, i.e.
‖∇f (w1)−∇f (w2)‖2 ≤ `‖w1 − w2‖2
I ρ-Hessian smoothness: The hessian matrix is ρ-Lipschitz, i.e.,
‖∇2f (w1)−∇2f (w2)‖sp ≤ ρ‖w1 − w2‖2
I We need this because we will use the Hessian at the currentspot to approximate the neighborhood
I Then bound the approximation.
Preliminaries
I L-Lipschitz, i.e.,
|f (w1)− f (w2)| ≤ L‖w1 − w2‖2
I `-Smoothness: The gradient is `-Lipschitz, i.e.
‖∇f (w1)−∇f (w2)‖2 ≤ `‖w1 − w2‖2
I ρ-Hessian smoothness: The hessian matrix is ρ-Lipschitz, i.e.,
‖∇2f (w1)−∇2f (w2)‖sp ≤ ρ‖w1 − w2‖2
I We need this because we will use the Hessian at the currentspot to approximate the neighborhood
I Then bound the approximation.
Preliminaries
I L-Lipschitz, i.e.,
|f (w1)− f (w2)| ≤ L‖w1 − w2‖2
I `-Smoothness: The gradient is `-Lipschitz, i.e.
‖∇f (w1)−∇f (w2)‖2 ≤ `‖w1 − w2‖2
I ρ-Hessian smoothness: The hessian matrix is ρ-Lipschitz, i.e.,
‖∇2f (w1)−∇2f (w2)‖sp ≤ ρ‖w1 − w2‖2
I We need this because we will use the Hessian at the currentspot to approximate the neighborhood
I Then bound the approximation.
Preliminaries
I L-Lipschitz, i.e.,
|f (w1)− f (w2)| ≤ L‖w1 − w2‖2
I `-Smoothness: The gradient is `-Lipschitz, i.e.
‖∇f (w1)−∇f (w2)‖2 ≤ `‖w1 − w2‖2
I ρ-Hessian smoothness: The hessian matrix is ρ-Lipschitz, i.e.,
‖∇2f (w1)−∇2f (w2)‖sp ≤ ρ‖w1 − w2‖2
I We need this because we will use the Hessian at the currentspot to approximate the neighborhood
I Then bound the approximation.
Preliminaries
I L-Lipschitz, i.e.,
|f (w1)− f (w2)| ≤ L‖w1 − w2‖2
I `-Smoothness: The gradient is `-Lipschitz, i.e.
‖∇f (w1)−∇f (w2)‖2 ≤ `‖w1 − w2‖2
I ρ-Hessian smoothness: The hessian matrix is ρ-Lipschitz, i.e.,
‖∇2f (w1)−∇2f (w2)‖sp ≤ ρ‖w1 − w2‖2
I We need this because we will use the Hessian at the currentspot to approximate the neighborhood
I Then bound the approximation.
Saddle points, and negative eigenvalue
Stationary points: saddle points, local minima, localmaxima
For stationary points ∇f (w) = 0,
I If ∇2f (w) � 0, it’s a local minimum.
I If ∇2f (w) ≺ 0, it’s a local maximum.
I If ∇2f (w) has both +/− eigenvalues, it’s a saddle point.I Degenerate case: ∇2f (w) has eigenvalues equal to 0. It
could be either local minimum(maximum)/saddle point.
I f is “flat” on some directionsI SGD is like random walkI We only consider non-degenerate case!
Stationary points: saddle points, local minima, localmaxima
For stationary points ∇f (w) = 0,
I If ∇2f (w) � 0, it’s a local minimum.
I If ∇2f (w) ≺ 0, it’s a local maximum.
I If ∇2f (w) has both +/− eigenvalues, it’s a saddle point.I Degenerate case: ∇2f (w) has eigenvalues equal to 0. It
could be either local minimum(maximum)/saddle point.
I f is “flat” on some directionsI SGD is like random walkI We only consider non-degenerate case!
Stationary points: saddle points, local minima, localmaxima
For stationary points ∇f (w) = 0,
I If ∇2f (w) � 0, it’s a local minimum.
I If ∇2f (w) ≺ 0, it’s a local maximum.
I If ∇2f (w) has both +/− eigenvalues, it’s a saddle point.I Degenerate case: ∇2f (w) has eigenvalues equal to 0. It
could be either local minimum(maximum)/saddle point.
I f is “flat” on some directionsI SGD is like random walkI We only consider non-degenerate case!
Stationary points: saddle points, local minima, localmaxima
For stationary points ∇f (w) = 0,
I If ∇2f (w) � 0, it’s a local minimum.
I If ∇2f (w) ≺ 0, it’s a local maximum.
I If ∇2f (w) has both +/− eigenvalues, it’s a saddle point.
I Degenerate case: ∇2f (w) has eigenvalues equal to 0. Itcould be either local minimum(maximum)/saddle point.
I f is “flat” on some directionsI SGD is like random walkI We only consider non-degenerate case!
Stationary points: saddle points, local minima, localmaxima
For stationary points ∇f (w) = 0,
I If ∇2f (w) � 0, it’s a local minimum.
I If ∇2f (w) ≺ 0, it’s a local maximum.
I If ∇2f (w) has both +/− eigenvalues, it’s a saddle point.I Degenerate case: ∇2f (w) has eigenvalues equal to 0. It
could be either local minimum(maximum)/saddle point.
I f is “flat” on some directionsI SGD is like random walkI We only consider non-degenerate case!
Stationary points: saddle points, local minima, localmaxima
For stationary points ∇f (w) = 0,
I If ∇2f (w) � 0, it’s a local minimum.
I If ∇2f (w) ≺ 0, it’s a local maximum.
I If ∇2f (w) has both +/− eigenvalues, it’s a saddle point.I Degenerate case: ∇2f (w) has eigenvalues equal to 0. It
could be either local minimum(maximum)/saddle point.I f is “flat” on some directions
I SGD is like random walkI We only consider non-degenerate case!
Stationary points: saddle points, local minima, localmaxima
For stationary points ∇f (w) = 0,
I If ∇2f (w) � 0, it’s a local minimum.
I If ∇2f (w) ≺ 0, it’s a local maximum.
I If ∇2f (w) has both +/− eigenvalues, it’s a saddle point.I Degenerate case: ∇2f (w) has eigenvalues equal to 0. It
could be either local minimum(maximum)/saddle point.I f is “flat” on some directionsI SGD is like random walk
I We only consider non-degenerate case!
Stationary points: saddle points, local minima, localmaxima
For stationary points ∇f (w) = 0,
I If ∇2f (w) � 0, it’s a local minimum.
I If ∇2f (w) ≺ 0, it’s a local maximum.
I If ∇2f (w) has both +/− eigenvalues, it’s a saddle point.I Degenerate case: ∇2f (w) has eigenvalues equal to 0. It
could be either local minimum(maximum)/saddle point.I f is “flat” on some directionsI SGD is like random walkI We only consider non-degenerate case!
Strict saddle property
f (w) is (α, γ, ε, ζ)-strict saddle, if for any w ,
I ‖∇f (w)‖2 ≥ εI or, λmin∇2f (w) ≤ −γ < 0
I or, there exists w∗ such that ‖w − w∗‖2 ≤ ζ, and the regioncentered w∗ with radius 2ζ is α-strongly convex.
I Gradient is large
I or (stationary point), we have a negative eigenvalue directionto escape
I or (stationary point, no negative eigenvalues), we are prettyclose to a local minimum.
Strict saddle property
f (w) is (α, γ, ε, ζ)-strict saddle, if for any w ,
I ‖∇f (w)‖2 ≥ ε
I or, λmin∇2f (w) ≤ −γ < 0
I or, there exists w∗ such that ‖w − w∗‖2 ≤ ζ, and the regioncentered w∗ with radius 2ζ is α-strongly convex.
Which means:
I Gradient is large
I or (stationary point), we have a negative eigenvalue directionto escape
I or (stationary point, no negative eigenvalues), we are prettyclose to a local minimum.
Strict saddle property
f (w) is (α, γ, ε, ζ)-strict saddle, if for any w ,
I ‖∇f (w)‖2 ≥ εI or, λmin∇2f (w) ≤ −γ < 0
I or, there exists w∗ such that ‖w − w∗‖2 ≤ ζ, and the regioncentered w∗ with radius 2ζ is α-strongly convex.
Which means:
I Gradient is large
I or (stationary point), we have a negative eigenvalue directionto escape
I or (stationary point, no negative eigenvalues), we are prettyclose to a local minimum.
Strict saddle property
f (w) is (α, γ, ε, ζ)-strict saddle, if for any w ,
I ‖∇f (w)‖2 ≥ εI or, λmin∇2f (w) ≤ −γ < 0
I or, there exists w∗ such that ‖w − w∗‖2 ≤ ζ, and the regioncentered w∗ with radius 2ζ is α-strongly convex.
Which means:
I Gradient is large
I or (stationary point), we have a negative eigenvalue directionto escape
I or (stationary point, no negative eigenvalues), we are prettyclose to a local minimum.
Strict saddle functions are everywhere
I Orthogonal tensor decomposition [Ge et al 2015]
I Deep linear (residual) networks [Kawaguchi 2016], [Hardt andMa 2016]
I Matrix completion [Ge et al 2016]
I Generalized phase retrieval problem [Sun et al 2016]
I Low rank matrix recovery [Bhojanapalli et al 2016]
I SGD escapes all saddle points
I So, SGD arrives one local minimum → global minimum!
I One popular way to prove SGD solves the problem.
Strict saddle functions are everywhere
I Orthogonal tensor decomposition [Ge et al 2015]
I Deep linear (residual) networks [Kawaguchi 2016], [Hardt andMa 2016]
I Matrix completion [Ge et al 2016]
I Generalized phase retrieval problem [Sun et al 2016]
I Low rank matrix recovery [Bhojanapalli et al 2016]
Moreover, in these problems, all local minima are equally good!
I SGD escapes all saddle points
I So, SGD arrives one local minimum → global minimum!
I One popular way to prove SGD solves the problem.
Strict saddle functions are everywhere
I Orthogonal tensor decomposition [Ge et al 2015]
I Deep linear (residual) networks [Kawaguchi 2016], [Hardt andMa 2016]
I Matrix completion [Ge et al 2016]
I Generalized phase retrieval problem [Sun et al 2016]
I Low rank matrix recovery [Bhojanapalli et al 2016]
Moreover, in these problems, all local minima are equally good!That means,
I SGD escapes all saddle points
I So, SGD arrives one local minimum → global minimum!
I One popular way to prove SGD solves the problem.
Strict saddle functions are everywhere
I Orthogonal tensor decomposition [Ge et al 2015]
I Deep linear (residual) networks [Kawaguchi 2016], [Hardt andMa 2016]
I Matrix completion [Ge et al 2016]
I Generalized phase retrieval problem [Sun et al 2016]
I Low rank matrix recovery [Bhojanapalli et al 2016]
Moreover, in these problems, all local minima are equally good!That means,
I SGD escapes all saddle points
I So, SGD arrives one local minimum → global minimum!
I One popular way to prove SGD solves the problem.
Strict saddle functions are everywhere
I Orthogonal tensor decomposition [Ge et al 2015]
I Deep linear (residual) networks [Kawaguchi 2016], [Hardt andMa 2016]
I Matrix completion [Ge et al 2016]
I Generalized phase retrieval problem [Sun et al 2016]
I Low rank matrix recovery [Bhojanapalli et al 2016]
Moreover, in these problems, all local minima are equally good!That means,
I SGD escapes all saddle points
I So, SGD arrives one local minimum → global minimum!
I One popular way to prove SGD solves the problem.
Strict saddle functions are everywhere
I Orthogonal tensor decomposition [Ge et al 2015]
I Deep linear (residual) networks [Kawaguchi 2016], [Hardt andMa 2016]
I Matrix completion [Ge et al 2016]
I Generalized phase retrieval problem [Sun et al 2016]
I Low rank matrix recovery [Bhojanapalli et al 2016]
Moreover, in these problems, all local minima are equally good!That means,
I SGD escapes all saddle points
I So, SGD arrives one local minimum → global minimum!
I One popular way to prove SGD solves the problem.
Main Results
I [Ge et al 2015] says, whp, SGD will escape all saddle points,and converge to a local minimum. The convergence time haspolynomial dependency in dimension d .
I [Jin et al 2017] says, whp, PGD (a variant of SGD) willescape all saddle points, and converge to a local minimummuch faster. The dependence in d is logarithmic.
I Same proof framework. We’ll mainly look at the new result.
Main Results
I [Ge et al 2015] says, whp, SGD will escape all saddle points,and converge to a local minimum. The convergence time haspolynomial dependency in dimension d .
I [Jin et al 2017] says, whp, PGD (a variant of SGD) willescape all saddle points, and converge to a local minimummuch faster. The dependence in d is logarithmic.
I Same proof framework. We’ll mainly look at the new result.
Main Results
I [Ge et al 2015] says, whp, SGD will escape all saddle points,and converge to a local minimum. The convergence time haspolynomial dependency in dimension d .
I [Jin et al 2017] says, whp, PGD (a variant of SGD) willescape all saddle points, and converge to a local minimummuch faster. The dependence in d is logarithmic.
I Same proof framework. We’ll mainly look at the new result.
Description of PGD
Do the following iteratively:
I If ‖∇f (xt)‖ ≤ gthres, and last perturbed time is > tthres stepsbefore, do random perturbation (ball)
I If perturbation happened tthres steps ago, but f is decreasedfor less than fthres, return the value before last perturbation
I Do a gradient descent step:
xt+1 = xt − η∇f (xt)
I Unfortunately.. Not a fast algorithm because of GD!
I η = c` . gthres, tthres, fthres depends on a constant c , as well as
other parameters.
Description of PGD
Do the following iteratively:
I If ‖∇f (xt)‖ ≤ gthres, and last perturbed time is > tthres stepsbefore, do random perturbation (ball)
I If perturbation happened tthres steps ago, but f is decreasedfor less than fthres, return the value before last perturbation
I Do a gradient descent step:
xt+1 = xt − η∇f (xt)
I Unfortunately.. Not a fast algorithm because of GD!
I η = c` . gthres, tthres, fthres depends on a constant c , as well as
other parameters.
Description of PGD
Do the following iteratively:
I If ‖∇f (xt)‖ ≤ gthres, and last perturbed time is > tthres stepsbefore, do random perturbation (ball)
I If perturbation happened tthres steps ago, but f is decreasedfor less than fthres, return the value before last perturbation
I Do a gradient descent step:
xt+1 = xt − η∇f (xt)
I Unfortunately.. Not a fast algorithm because of GD!
I η = c` . gthres, tthres, fthres depends on a constant c , as well as
other parameters.
Description of PGD
Do the following iteratively:
I If ‖∇f (xt)‖ ≤ gthres, and last perturbed time is > tthres stepsbefore, do random perturbation (ball)
I If perturbation happened tthres steps ago, but f is decreasedfor less than fthres, return the value before last perturbation
I Do a gradient descent step:
xt+1 = xt − η∇f (xt)
I Unfortunately.. Not a fast algorithm because of GD!
I η = c` . gthres, tthres, fthres depends on a constant c , as well as
other parameters.
Description of PGD
Do the following iteratively:
I If ‖∇f (xt)‖ ≤ gthres, and last perturbed time is > tthres stepsbefore, do random perturbation (ball)
I If perturbation happened tthres steps ago, but f is decreasedfor less than fthres, return the value before last perturbation
I Do a gradient descent step:
xt+1 = xt − η∇f (xt)
A few Remarks:
I Unfortunately.. Not a fast algorithm because of GD!
I η = c` . gthres, tthres, fthres depends on a constant c , as well as
other parameters.
Description of PGD
Do the following iteratively:
I If ‖∇f (xt)‖ ≤ gthres, and last perturbed time is > tthres stepsbefore, do random perturbation (ball)
I If perturbation happened tthres steps ago, but f is decreasedfor less than fthres, return the value before last perturbation
I Do a gradient descent step:
xt+1 = xt − η∇f (xt)
A few Remarks:
I Unfortunately.. Not a fast algorithm because of GD!
I η = c` . gthres, tthres, fthres depends on a constant c , as well as
other parameters.
Description of PGD
Do the following iteratively:
I If ‖∇f (xt)‖ ≤ gthres, and last perturbed time is > tthres stepsbefore, do random perturbation (ball)
I If perturbation happened tthres steps ago, but f is decreasedfor less than fthres, return the value before last perturbation
I Do a gradient descent step:
xt+1 = xt − η∇f (xt)
A few Remarks:
I Unfortunately.. Not a fast algorithm because of GD!
I η = c` . gthres, tthres, fthres depends on a constant c , as well as
other parameters.
Main theorem in [Jin et al 2017]
Theorem (Main Theorem)
Assume function f is `-smooth and ρ-Hessian Lipschitz,(α, γ, ε, ζ)-strict saddle. There exists an absolute constant cmax
such that, for any δ > 0, ∆f ≥ f (x0)− f ∗, and constant c ≤ cmax,
ε̃ = min{ε, γ2
ρ }, PGD(c) will output a point ζ-close to a localminimum, with probability 1− δ, and terminate in the followingnumber of iterations:
O
(`(f (x0)− f ∗)
ε̃2log4
(d`∆f
ε̃2δ
))
I If could show SGD has similar property, would be great!
I The convergence rate is almost optimal.
Main theorem in [Jin et al 2017]
Theorem (Main Theorem)
Assume function f is `-smooth and ρ-Hessian Lipschitz,(α, γ, ε, ζ)-strict saddle. There exists an absolute constant cmax
such that, for any δ > 0, ∆f ≥ f (x0)− f ∗, and constant c ≤ cmax,
ε̃ = min{ε, γ2
ρ }, PGD(c) will output a point ζ-close to a localminimum, with probability 1− δ, and terminate in the followingnumber of iterations:
O
(`(f (x0)− f ∗)
ε̃2log4
(d`∆f
ε̃2δ
))
I If could show SGD has similar property, would be great!
I The convergence rate is almost optimal.
Main theorem in [Jin et al 2017]
Theorem (Main Theorem)
Assume function f is `-smooth and ρ-Hessian Lipschitz,(α, γ, ε, ζ)-strict saddle. There exists an absolute constant cmax
such that, for any δ > 0, ∆f ≥ f (x0)− f ∗, and constant c ≤ cmax,
ε̃ = min{ε, γ2
ρ }, PGD(c) will output a point ζ-close to a localminimum, with probability 1− δ, and terminate in the followingnumber of iterations:
O
(`(f (x0)− f ∗)
ε̃2log4
(d`∆f
ε̃2δ
))
I If could show SGD has similar property, would be great!
I The convergence rate is almost optimal.
More general version: why it’s fast
Theorem (A more general version)
Assume function f is `-smooth and ρ-Hessian Lipschitz. Thereexists an absolute constant cmax such that, for any δ > 0,∆f ≥ f (x0)− f ∗, and constant c ≤ cmax, ε̃ ≤ `2
ρ , PGD(c) willoutput a point ζ-close to an ε̃-second-order stationary point, withprobability 1− δ, and terminate in the following number ofiterations:
O
(`(f (x0)− f ∗)
ε̃2log4
(d`∆f
ε̃2δ
))
Essentially saying the same thing. If f is not strict saddle, onlyε-second-order stationary point (instead of local minimum) isguaranteed.
ε-stationary points
I ε-first-order stationary point:
‖∇f (x)‖ ≤ ε
I ε-second-order stationary point:
‖∇f (x)‖ ≤ ε, λmin(∇2f (x)) ≥ −√ρε
I If `-smooth, λmin(∇2f (x)) ≥ −`.I For any ε > `2
ρ , an ε-first-order stationary point in a `-smooth
function is a `2
ρ -second-order stationary point
I If (α, γ, ε, ζ)-strict saddle, and ε < γ2
ρ , then any ε-second-orderstationary point is a local minimum.
ε-stationary points
I ε-first-order stationary point:
‖∇f (x)‖ ≤ ε
I ε-second-order stationary point:
‖∇f (x)‖ ≤ ε, λmin(∇2f (x)) ≥ −√ρε
I If `-smooth, λmin(∇2f (x)) ≥ −`.I For any ε > `2
ρ , an ε-first-order stationary point in a `-smooth
function is a `2
ρ -second-order stationary point
I If (α, γ, ε, ζ)-strict saddle, and ε < γ2
ρ , then any ε-second-orderstationary point is a local minimum.
ε-stationary points
I ε-first-order stationary point:
‖∇f (x)‖ ≤ ε
I ε-second-order stationary point:
‖∇f (x)‖ ≤ ε, λmin(∇2f (x)) ≥ −√ρε
I If `-smooth, λmin(∇2f (x)) ≥ −`.
I For any ε > `2
ρ , an ε-first-order stationary point in a `-smooth
function is a `2
ρ -second-order stationary point
I If (α, γ, ε, ζ)-strict saddle, and ε < γ2
ρ , then any ε-second-orderstationary point is a local minimum.
ε-stationary points
I ε-first-order stationary point:
‖∇f (x)‖ ≤ ε
I ε-second-order stationary point:
‖∇f (x)‖ ≤ ε, λmin(∇2f (x)) ≥ −√ρε
I If `-smooth, λmin(∇2f (x)) ≥ −`.I For any ε > `2
ρ , an ε-first-order stationary point in a `-smooth
function is a `2
ρ -second-order stationary point
I If (α, γ, ε, ζ)-strict saddle, and ε < γ2
ρ , then any ε-second-orderstationary point is a local minimum.
ε-stationary points
I ε-first-order stationary point:
‖∇f (x)‖ ≤ ε
I ε-second-order stationary point:
‖∇f (x)‖ ≤ ε, λmin(∇2f (x)) ≥ −√ρε
I If `-smooth, λmin(∇2f (x)) ≥ −`.I For any ε > `2
ρ , an ε-first-order stationary point in a `-smooth
function is a `2
ρ -second-order stationary point
I If (α, γ, ε, ζ)-strict saddle, and ε < γ2
ρ , then any ε-second-orderstationary point is a local minimum.
[Nesterov, 1998]
Theorem
Assume that f is `-smooth. Then for any ε̃ > 0, if we run GD withstep size η = 1
` and termination condition ‖∇f (x)‖ ≤ ε̃, theoutput will be ε̃-first-order stationary point, and the algorithmterminates in the following number of iterations:
`(f (x0)− f ∗)
ε̃2
[Jin et al 2017]: PGD converges to ε̃-second-order stationary point
in O(`(f (x0)−f ∗)
ε̃2 log4(d`∆fε̃2δ
))steps.
I Matched up to log factors!
Why −√ρε?
I If we use third order approximation for x [Nesterov andPolyak, 2006]
miny
{〈∇f (x), y − x〉+
1
2〈∇2f (x)(y − x), y − x〉+
ρ
6‖y − x‖2
}denote the answer as Tx .
I Denote distance r = ‖x − Tx‖I ‖∇f (Tx)‖ ≤ ρr2, ∇2f (Tx) � −3
2 ρr I
I To get a lower bound for r :
max
{√‖∇f (Tx)‖
ρ,− 2
3ρλmin∇2f (Tx)
}I When are they equal → −√ρε
Why −√ρε?
I If we use third order approximation for x [Nesterov andPolyak, 2006]
miny
{〈∇f (x), y − x〉+
1
2〈∇2f (x)(y − x), y − x〉+
ρ
6‖y − x‖2
}denote the answer as Tx .
I Denote distance r = ‖x − Tx‖
I ‖∇f (Tx)‖ ≤ ρr2, ∇2f (Tx) � −32 ρr I
I To get a lower bound for r :
max
{√‖∇f (Tx)‖
ρ,− 2
3ρλmin∇2f (Tx)
}I When are they equal → −√ρε
Why −√ρε?
I If we use third order approximation for x [Nesterov andPolyak, 2006]
miny
{〈∇f (x), y − x〉+
1
2〈∇2f (x)(y − x), y − x〉+
ρ
6‖y − x‖2
}denote the answer as Tx .
I Denote distance r = ‖x − Tx‖I ‖∇f (Tx)‖ ≤ ρr2, ∇2f (Tx) � −3
2 ρr I
I To get a lower bound for r :
max
{√‖∇f (Tx)‖
ρ,− 2
3ρλmin∇2f (Tx)
}I When are they equal → −√ρε
Why −√ρε?
I If we use third order approximation for x [Nesterov andPolyak, 2006]
miny
{〈∇f (x), y − x〉+
1
2〈∇2f (x)(y − x), y − x〉+
ρ
6‖y − x‖2
}denote the answer as Tx .
I Denote distance r = ‖x − Tx‖I ‖∇f (Tx)‖ ≤ ρr2, ∇2f (Tx) � −3
2 ρr I
I To get a lower bound for r :
max
{√‖∇f (Tx)‖
ρ,− 2
3ρλmin∇2f (Tx)
}
I When are they equal → −√ρε
Why −√ρε?
I If we use third order approximation for x [Nesterov andPolyak, 2006]
miny
{〈∇f (x), y − x〉+
1
2〈∇2f (x)(y − x), y − x〉+
ρ
6‖y − x‖2
}denote the answer as Tx .
I Denote distance r = ‖x − Tx‖I ‖∇f (Tx)‖ ≤ ρr2, ∇2f (Tx) � −3
2 ρr I
I To get a lower bound for r :
max
{√‖∇f (Tx)‖
ρ,− 2
3ρλmin∇2f (Tx)
}I When are they equal → −√ρε
Related results
1. “Gradient Descent Converges to Minimizers” By Lee,Simchowitz, Jordan and Recht. 15’
I with random initialization, GD almost surely never touches anysaddle points, and always converges to local minima.
2. “The power of normalization: faster evasion of saddle points”,Kfir Levy. 16’
I Normalized gradient can escape saddle points inO(d3poly(1/ε)), slower than [Jin et al 2017], faster than [Geet al 2015], but still polynomial in d .
Related results
1. “Gradient Descent Converges to Minimizers” By Lee,Simchowitz, Jordan and Recht. 15’
I with random initialization, GD almost surely never touches anysaddle points, and always converges to local minima.
2. “The power of normalization: faster evasion of saddle points”,Kfir Levy. 16’
I Normalized gradient can escape saddle points inO(d3poly(1/ε)), slower than [Jin et al 2017], faster than [Geet al 2015], but still polynomial in d .
Related results
1. “Gradient Descent Converges to Minimizers” By Lee,Simchowitz, Jordan and Recht. 15’
I with random initialization, GD almost surely never touches anysaddle points, and always converges to local minima.
2. “The power of normalization: faster evasion of saddle points”,Kfir Levy. 16’
I Normalized gradient can escape saddle points inO(d3poly(1/ε)), slower than [Jin et al 2017], faster than [Geet al 2015], but still polynomial in d .
Related results
1. “Gradient Descent Converges to Minimizers” By Lee,Simchowitz, Jordan and Recht. 15’
I with random initialization, GD almost surely never touches anysaddle points, and always converges to local minima.
2. “The power of normalization: faster evasion of saddle points”,Kfir Levy. 16’
I Normalized gradient can escape saddle points inO(d3poly(1/ε)), slower than [Jin et al 2017], faster than [Geet al 2015], but still polynomial in d .
Main theorem in [Jin et al 2017]
Theorem (Main Theorem)
Assume function f is `-smooth and ρ-Hessian Lipschitz,(α, γ, ε, ζ)-strict saddle. There exists an absolute constant cmax
such that, for any δ > 0, ∆f ≥ f (x0)− f ∗, and constant c ≤ cmax,
ε̃ = min{ε, γ2
ρ }, PGD(c) will output a point ζ-close to a localminimum, with probability 1− δ, and terminate in the followingnumber of iterations:
O
(`(f (x0)− f ∗)
ε̃2log4
(d`∆f
ε̃2δ
))
Proof framework: Progress, Escape and Trap
I Progress: when ‖∇f (x)‖ > gthres, f (x) is decreased by atleast fthres/tthres.
I Escape: when ‖∇f (x)‖ ≤ gthres, and λmin∇2f (x) ≤ −γ, whpfunction value is decreased by fthres after perturbation+tthressteps.
I fthres/tthres on average each step.
I Trap:
I The algorithm can’t do progress and escape forever, becauseit’s bounded!
I When it stops: perturbation happened tthres steps ago, but fis decreased for less than fthres
I That means, ‖∇f (x)‖ ≤ gthres before perturbation, and whpthere is no eigenvalue ≤ −γ.
I So it’s a local minimum!
Proof framework: Progress, Escape and Trap
I Progress: when ‖∇f (x)‖ > gthres, f (x) is decreased by atleast fthres/tthres.
I Escape: when ‖∇f (x)‖ ≤ gthres, and λmin∇2f (x) ≤ −γ, whpfunction value is decreased by fthres after perturbation+tthressteps.
I fthres/tthres on average each step.
I Trap:
I The algorithm can’t do progress and escape forever, becauseit’s bounded!
I When it stops: perturbation happened tthres steps ago, but fis decreased for less than fthres
I That means, ‖∇f (x)‖ ≤ gthres before perturbation, and whpthere is no eigenvalue ≤ −γ.
I So it’s a local minimum!
Proof framework: Progress, Escape and Trap
I Progress: when ‖∇f (x)‖ > gthres, f (x) is decreased by atleast fthres/tthres.
I Escape: when ‖∇f (x)‖ ≤ gthres, and λmin∇2f (x) ≤ −γ, whpfunction value is decreased by fthres after perturbation+tthressteps.
I fthres/tthres on average each step.
I Trap:
I The algorithm can’t do progress and escape forever, becauseit’s bounded!
I When it stops: perturbation happened tthres steps ago, but fis decreased for less than fthres
I That means, ‖∇f (x)‖ ≤ gthres before perturbation, and whpthere is no eigenvalue ≤ −γ.
I So it’s a local minimum!
Proof framework: Progress, Escape and Trap
I Progress: when ‖∇f (x)‖ > gthres, f (x) is decreased by atleast fthres/tthres.
I Escape: when ‖∇f (x)‖ ≤ gthres, and λmin∇2f (x) ≤ −γ, whpfunction value is decreased by fthres after perturbation+tthressteps.
I fthres/tthres on average each step.
I Trap:
I The algorithm can’t do progress and escape forever, becauseit’s bounded!
I When it stops: perturbation happened tthres steps ago, but fis decreased for less than fthres
I That means, ‖∇f (x)‖ ≤ gthres before perturbation, and whpthere is no eigenvalue ≤ −γ.
I So it’s a local minimum!
Proof framework: Progress, Escape and Trap
I Progress: when ‖∇f (x)‖ > gthres, f (x) is decreased by atleast fthres/tthres.
I Escape: when ‖∇f (x)‖ ≤ gthres, and λmin∇2f (x) ≤ −γ, whpfunction value is decreased by fthres after perturbation+tthressteps.
I fthres/tthres on average each step.
I Trap:I The algorithm can’t do progress and escape forever, because
it’s bounded!
I When it stops: perturbation happened tthres steps ago, but fis decreased for less than fthres
I That means, ‖∇f (x)‖ ≤ gthres before perturbation, and whpthere is no eigenvalue ≤ −γ.
I So it’s a local minimum!
Proof framework: Progress, Escape and Trap
I Progress: when ‖∇f (x)‖ > gthres, f (x) is decreased by atleast fthres/tthres.
I Escape: when ‖∇f (x)‖ ≤ gthres, and λmin∇2f (x) ≤ −γ, whpfunction value is decreased by fthres after perturbation+tthressteps.
I fthres/tthres on average each step.
I Trap:I The algorithm can’t do progress and escape forever, because
it’s bounded!I When it stops: perturbation happened tthres steps ago, but f
is decreased for less than fthres
I That means, ‖∇f (x)‖ ≤ gthres before perturbation, and whpthere is no eigenvalue ≤ −γ.
I So it’s a local minimum!
Proof framework: Progress, Escape and Trap
I Progress: when ‖∇f (x)‖ > gthres, f (x) is decreased by atleast fthres/tthres.
I Escape: when ‖∇f (x)‖ ≤ gthres, and λmin∇2f (x) ≤ −γ, whpfunction value is decreased by fthres after perturbation+tthressteps.
I fthres/tthres on average each step.
I Trap:I The algorithm can’t do progress and escape forever, because
it’s bounded!I When it stops: perturbation happened tthres steps ago, but f
is decreased for less than fthresI That means, ‖∇f (x)‖ ≤ gthres before perturbation, and whp
there is no eigenvalue ≤ −γ.
I So it’s a local minimum!
Proof framework: Progress, Escape and Trap
I Progress: when ‖∇f (x)‖ > gthres, f (x) is decreased by atleast fthres/tthres.
I Escape: when ‖∇f (x)‖ ≤ gthres, and λmin∇2f (x) ≤ −γ, whpfunction value is decreased by fthres after perturbation+tthressteps.
I fthres/tthres on average each step.
I Trap:I The algorithm can’t do progress and escape forever, because
it’s bounded!I When it stops: perturbation happened tthres steps ago, but f
is decreased for less than fthresI That means, ‖∇f (x)‖ ≤ gthres before perturbation, and whp
there is no eigenvalue ≤ −γ.I So it’s a local minimum!
Progress
Lemma
If f is `-smooth, then for GD with step size η < 1` , we have:
f (xt+1) ≤ f (xt)−η
2‖∇f (xt)‖2
Proof.
f (xt+1) ≤ f (xt) +∇f (xt)>(xt+1 − xt) +
`
2‖xt+1 − xt‖2
= f (xt)− η‖∇f (xt)‖2 +η2`
2‖∇f (xt)‖2
≤ f (xt)−η
2‖∇f (xt)‖2
Escape: main idea
Escape: main idea
Escape: thin pancake
Main Lemma: measure the width
Lemma
Suppose we start with point x̃ satisfying following conditions:
‖∇f (x̃)‖ ≤ gthres, λmin(∇2f (x̃)) ≤ −γ
Let e1 the minimum eigenvector. Consider two gradient descentsequences {ut}, {wt}, with initial points u0,w0 satisfying :
‖u0 − x̃‖ ≤ r ,w0 = u0 + µre1, µ ∈ [δ/(2√d), 1]
Then, for any stepsize η ≤ cmax/`, and any T ≥ tthres, we have
min{f (uT )− f (u0), f (wT )− f (w0)} ≤ −2.5fthres
I As long as u0 − w0 are on e1, and ‖u0 − w0‖ ≥ δr2√d
, at least
one of them will escape!
Main Lemma: measure the width
Escape Case
Lemma (Escape case)
Suppose we start with point x̃ satisfying following conditions:
‖∇f (x̃)‖ ≤ gthres, λmin(∇2f (x̃)) ≤ −γ
Let x0 = x̃ + ξ, where ξ come from the uniform distribution overball with radius r , and let xt be the iterates of GD from x0. Thenwhen η < cmax
` , with at least probability 1− δ, for any T ≥ tthres:
f (xT )− f (x̃) ≤ −fthres
Proof of the escape lemma
Proof of the escape lemma
By smoothness, the perturbation step does not increase f much:
f (x0)− f (x̃) ≤ ∇f (x̃)>ξ +`
2‖ξ‖2 ≤ · · · ≤ 1.5fthres
Proof of the escape lemma
By smoothness, the perturbation step does not increase f much:
f (x0)− f (x̃) ≤ ∇f (x̃)>ξ +`
2‖ξ‖2 ≤ · · · ≤ 1.5fthres
By the main lemma, for any x0 ∈ Xstuck, we know(x0 ± µre1) 6∈ Xstuck, where µ ∈ [δ/(2
√d), 1].
Vol(Xstuck) = Vol(B(d−1)x̃ (r))× δr
2√d× 2
Proof of the escape lemma
By smoothness, the perturbation step does not increase f much:
f (x0)− f (x̃) ≤ ∇f (x̃)>ξ +`
2‖ξ‖2 ≤ · · · ≤ 1.5fthres
By the main lemma, for any x0 ∈ Xstuck, we know(x0 ± µre1) 6∈ Xstuck, where µ ∈ [δ/(2
√d), 1].
Vol(Xstuck) = Vol(B(d−1)x̃ (r))× δr
2√d× 2
Therefore, the probability that we picked a point in Xstuck isbounded by
Vol(Xstuck)
Vol(B(d)x̃ (r)))
≤ δ
Proof of the escape lemma
Thus, with probability at least 1− δ, x0 6∈ Xstuck, and in this case,by the main lemma.
f (xT )− f (x̃) ≤ −2.5fthres + 1.5fthres = −fthres
How to prove the main Lemma?
I If uT does not decrease function value, then {u0, · · · , uT} areclose to x̃ .
I If {u0, · · · , uT} are close to x̃ , GD on w0 will decrease thefunction value.
How to prove the main Lemma?
I If uT does not decrease function value, then {u0, · · · , uT} areclose to x̃ .
I If {u0, · · · , uT} are close to x̃ , GD on w0 will decrease thefunction value.
How to prove the main Lemma?
I If uT does not decrease function value, then {u0, · · · , uT} areclose to x̃ .
I If {u0, · · · , uT} are close to x̃ , GD on w0 will decrease thefunction value.
How to prove the main Lemma?
I If uT does not decrease function value, then {u0, · · · , uT} areclose to x̃ .
I If {u0, · · · , uT} are close to x̃ , GD on w0 will decrease thefunction value.
We will need the following approximation:
f̃y (x) = f (y) +∇f (y)>(x − y) +1
2(x − y)>H(x − y)
where H = ∇2f (x̃).
Two lemmas (simplified)
Lemma (uT -stuck)
There exists absolute constant cmax s.t., for any initial point u0
with ‖u0 − x̃‖ ≤ r , defined
T = min{
inft
{t|f̃ u0(ut)− f (u0) ≤ −3fthres
}, tthres
}Then, for any η ≤ cmax
` , we have for all t < T , ‖ut − x̃‖ ≤ Φ.
Two lemmas (simplified)
Lemma (uT -stuck)
There exists absolute constant cmax s.t., for any initial point u0
with ‖u0 − x̃‖ ≤ r , defined
T = min{
inft
{t|f̃ u0(ut)− f (u0) ≤ −3fthres
}, tthres
}Then, for any η ≤ cmax
` , we have for all t < T , ‖ut − x̃‖ ≤ Φ.
Lemma (wT -escape)
There exists absolute constant cmax s.t., define
T = min{
inft
{t|f̃ w0(wt)− f (w0) ≤ −3fthres
}, tthres
}then, for any η ≤ cmax
` , if ‖ut − x̃‖ ≤ Φ for t < T , we haveT < tthres.
Prove the main lemma
Prove the main lemma
Assume x̃ is the origin. Define
T ′ = inft
{t|f̃ u0(ut)− f (u0) ≤ −3fthres
}
Prove the main lemma
Assume x̃ is the origin. Define
T ′ = inft
{t|f̃ u0(ut)− f (u0) ≤ −3fthres
}Case T ′ ≤ tthres:
We know ‖uT ′−1‖ ≤ Φ by uT -stuck-lemma. By simple calculation,we can show that ‖uT ′‖ = O(Φ) as well.
Prove the main lemma
Assume x̃ is the origin. Define
T ′ = inft
{t|f̃ u0(ut)− f (u0) ≤ −3fthres
}Case T ′ ≤ tthres:
We know ‖uT ′−1‖ ≤ Φ by uT -stuck-lemma. By simple calculation,we can show that ‖uT ′‖ = O(Φ) as well.
f (uT ′)− f (u0)
≤∇f (u0)>(uT ′ − u0) +1
2(uT ′ − u0)>∇2f (u0)(uT ′ − u0) +
ρ
6‖uT ′ − u0‖3
≤f̃u0(ut)− f (u0) +ρ
2‖u0 − x̃‖‖uT ′ − u0‖2 +
ρ
6‖uT ′ − u0‖3
≤− 2.5fthres
Prove the main lemma
Case T ′ > tthres: By uT -stuck-lemma, we know for all t ≤ tthres‖ut‖ ≤ Φ.
Prove the main lemma
Case T ′ > tthres: By uT -stuck-lemma, we know for all t ≤ tthres‖ut‖ ≤ Φ. Using the wT -escape-lemma, we know
T ′′ = inft
{t|f̃ w0(wt)− f (w0) ≤ −3fthres
}≤ tthres
Prove the main lemma
Case T ′ > tthres: By uT -stuck-lemma, we know for all t ≤ tthres‖ut‖ ≤ Φ. Using the wT -escape-lemma, we know
T ′′ = inft
{t|f̃ w0(wt)− f (w0) ≤ −3fthres
}≤ tthres
Then we may reduce this to the case that T ′ ≤ tthres becausew , u are interchangeable.
Prove the uT -stuck-lemma
I We won’t move much in large negative eigenvector directions,otherwise it’s a lot of progress!
Prove the uT -stuck-lemma
Lemma (uT -stuck)
There exists absolute constant cmax s.t., for any initial point u0
with ‖u0 − x̃‖ ≤ r , defined
T = min{
inft
{t|f̃ u0(ut)− f (u0) ≤ −3fthres
}, tthres
}Then, for any η ≤ cmax
` , we have for all t < T , ‖ut − x̃‖ ≤ Φ.
I We won’t move much in large negative eigenvector directions,otherwise it’s a lot of progress!
Prove the uT -stuck-lemma
Lemma (uT -stuck)
There exists absolute constant cmax s.t., for any initial point u0
with ‖u0 − x̃‖ ≤ r , defined
T = min{
inft
{t|f̃ u0(ut)− f (u0) ≤ −3fthres
}, tthres
}Then, for any η ≤ cmax
` , we have for all t < T , ‖ut − x̃‖ ≤ Φ.
I We won’t move much in large negative eigenvector directions,otherwise it’s a lot of progress!
Prove the uT -stuck-lemma
Lemma (uT -stuck)
There exists absolute constant cmax s.t., for any initial point u0
with ‖u0 − x̃‖ ≤ r , defined
T = min{
inft
{t|f̃ u0(ut)− f (u0) ≤ −3fthres
}, tthres
}Then, for any η ≤ cmax
` , we have for all t < T , ‖ut − x̃‖ ≤ Φ.
I We won’t move much in large negative eigenvector directions,otherwise it’s a lot of progress!
Consider Bt as ut in the remaining space where eigenvalue ≥ − γ100 ,
‖Bt+1‖ ≤ (1 +ηγ
100)‖Bt‖+ 2ηgthres
Prove the uT -stuck-lemma
Lemma (uT -stuck)
There exists absolute constant cmax s.t., for any initial point u0
with ‖u0 − x̃‖ ≤ r , defined
T = min{
inft
{t|f̃ u0(ut)− f (u0) ≤ −3fthres
}, tthres
}Then, for any η ≤ cmax
` , we have for all t < T , ‖ut − x̃‖ ≤ Φ.
I We won’t move much in large negative eigenvector directions,otherwise it’s a lot of progress!
Consider Bt as ut in the remaining space where eigenvalue ≥ − γ100 ,
‖Bt+1‖ ≤ (1 +ηγ
100)‖Bt‖+ 2ηgthres
If T ≤ tthres, we will have (1 + ηγ100 )T ≤ 3, so ‖BT‖ is bounded.
Prove the wT -escape-lemma
Lemma (wT -escape)
There exists absolute constant cmax s.t., define
T = min{
inft
{t|f̃ w0(wt)− f (w0) ≤ −3fthres
}, tthres
}then, for any η ≤ cmax
` , if ‖ut − x̃‖ ≤ Φ for t < T , we haveT < tthres.
Prove the wT -escape-lemma
I let vt = wt − ut
I We want to say for T < tthres, wT made progress.
I If wt makes no progress, by uT -stuck-lemma, it’s still near x̃ .
I Therefore, we always have ‖vt‖ ≤ ‖ut‖+ ‖wt‖ ≤ 2Φ.
I At e1 direction v0 has at least δr2√d
I Every time it multiplies by at least 1 + ηγ.
I In T < tthres, we get vT > 2Φ, so wT made progress!
Prove the wT -escape-lemma
I let vt = wt − utI We want to say for T < tthres, wT made progress.
I If wt makes no progress, by uT -stuck-lemma, it’s still near x̃ .
I Therefore, we always have ‖vt‖ ≤ ‖ut‖+ ‖wt‖ ≤ 2Φ.
I At e1 direction v0 has at least δr2√d
I Every time it multiplies by at least 1 + ηγ.
I In T < tthres, we get vT > 2Φ, so wT made progress!
Prove the wT -escape-lemma
I let vt = wt − utI We want to say for T < tthres, wT made progress.
I If wt makes no progress, by uT -stuck-lemma, it’s still near x̃ .
I Therefore, we always have ‖vt‖ ≤ ‖ut‖+ ‖wt‖ ≤ 2Φ.
I At e1 direction v0 has at least δr2√d
I Every time it multiplies by at least 1 + ηγ.
I In T < tthres, we get vT > 2Φ, so wT made progress!
Prove the wT -escape-lemma
I let vt = wt − utI We want to say for T < tthres, wT made progress.
I If wt makes no progress, by uT -stuck-lemma, it’s still near x̃ .
I Therefore, we always have ‖vt‖ ≤ ‖ut‖+ ‖wt‖ ≤ 2Φ.
I At e1 direction v0 has at least δr2√d
I Every time it multiplies by at least 1 + ηγ.
I In T < tthres, we get vT > 2Φ, so wT made progress!
Prove the wT -escape-lemma
I let vt = wt − utI We want to say for T < tthres, wT made progress.
I If wt makes no progress, by uT -stuck-lemma, it’s still near x̃ .
I Therefore, we always have ‖vt‖ ≤ ‖ut‖+ ‖wt‖ ≤ 2Φ.
However, vt is increasing very rapidly. It can’t be always small!
I At e1 direction v0 has at least δr2√d
I Every time it multiplies by at least 1 + ηγ.
I In T < tthres, we get vT > 2Φ, so wT made progress!
Prove the wT -escape-lemma
I let vt = wt − utI We want to say for T < tthres, wT made progress.
I If wt makes no progress, by uT -stuck-lemma, it’s still near x̃ .
I Therefore, we always have ‖vt‖ ≤ ‖ut‖+ ‖wt‖ ≤ 2Φ.
However, vt is increasing very rapidly. It can’t be always small!
I At e1 direction v0 has at least δr2√d
I Every time it multiplies by at least 1 + ηγ.
I In T < tthres, we get vT > 2Φ, so wT made progress!
Prove the wT -escape-lemma
I let vt = wt − utI We want to say for T < tthres, wT made progress.
I If wt makes no progress, by uT -stuck-lemma, it’s still near x̃ .
I Therefore, we always have ‖vt‖ ≤ ‖ut‖+ ‖wt‖ ≤ 2Φ.
However, vt is increasing very rapidly. It can’t be always small!
I At e1 direction v0 has at least δr2√d
I Every time it multiplies by at least 1 + ηγ.
I In T < tthres, we get vT > 2Φ, so wT made progress!
Prove the wT -escape-lemma
I let vt = wt − utI We want to say for T < tthres, wT made progress.
I If wt makes no progress, by uT -stuck-lemma, it’s still near x̃ .
I Therefore, we always have ‖vt‖ ≤ ‖ut‖+ ‖wt‖ ≤ 2Φ.
However, vt is increasing very rapidly. It can’t be always small!
I At e1 direction v0 has at least δr2√d
I Every time it multiplies by at least 1 + ηγ.
I In T < tthres, we get vT > 2Φ, so wT made progress!