Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to...

Post on 05-Jun-2021

1 views 0 download

Transcript of Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to...

Mini-Course 1:SGD Escapes Saddle Points

Yang Yuan

Computer Science DepartmentCornell University

Gradient Descent (GD)

I Task: minx f (x)

I GD does iterative updates xt+1 = xt − ηt∇f (xt)

Gradient Descent (GD)

I Task: minx f (x)

I GD does iterative updates xt+1 = xt − ηt∇f (xt)

Gradient Descent (GD) has at least two problems

I Computing the full gradient is slow for big data.

I Stuck at stationary points.

Gradient Descent (GD) has at least two problems

I Computing the full gradient is slow for big data.

I Stuck at stationary points.

Gradient Descent (GD) has at least two problems

I Computing the full gradient is slow for big data.

I Stuck at stationary points.

Stochastic Gradient Descent (SGD)

I Very similar to GD, gradient now has some randomness:

xt+1 = xt − ηtgt , where E[gt ] = ∇f (xt).

Stochastic Gradient Descent (SGD)

I Very similar to GD, gradient now has some randomness:

xt+1 = xt − ηtgt , where E[gt ] = ∇f (xt).

Why do we use SGD?

Initially because:

I Much cheaper to compute using mini-batch

I Can still converge to global minimum in convex case

I Can escape saddle points! (Today’s topic)

I Can escape shallow local minima (Next time’s topic, someprogress.)

I Can find local minima that generalize well (Not wellunderstood)

Why do we use SGD?

Initially because:

I Much cheaper to compute using mini-batch

I Can still converge to global minimum in convex case

I Can escape saddle points! (Today’s topic)

I Can escape shallow local minima (Next time’s topic, someprogress.)

I Can find local minima that generalize well (Not wellunderstood)

Why do we use SGD?

Initially because:

I Much cheaper to compute using mini-batch

I Can still converge to global minimum in convex case

I Can escape saddle points! (Today’s topic)

I Can escape shallow local minima (Next time’s topic, someprogress.)

I Can find local minima that generalize well (Not wellunderstood)

Why do we use SGD?

Initially because:

I Much cheaper to compute using mini-batch

I Can still converge to global minimum in convex case

But now people realize:

I Can escape saddle points! (Today’s topic)

I Can escape shallow local minima (Next time’s topic, someprogress.)

I Can find local minima that generalize well (Not wellunderstood)

Why do we use SGD?

Initially because:

I Much cheaper to compute using mini-batch

I Can still converge to global minimum in convex case

But now people realize:

I Can escape saddle points! (Today’s topic)

I Can escape shallow local minima (Next time’s topic, someprogress.)

I Can find local minima that generalize well (Not wellunderstood)

Why do we use SGD?

Initially because:

I Much cheaper to compute using mini-batch

I Can still converge to global minimum in convex case

But now people realize:

I Can escape saddle points! (Today’s topic)

I Can escape shallow local minima (Next time’s topic, someprogress.)

I Can find local minima that generalize well (Not wellunderstood)

Why do we use SGD?

Initially because:

I Much cheaper to compute using mini-batch

I Can still converge to global minimum in convex case

But now people realize:

I Can escape saddle points! (Today’s topic)

I Can escape shallow local minima (Next time’s topic, someprogress.)

I Can find local minima that generalize well (Not wellunderstood)

Why do we use SGD?

Initially because:

I Much cheaper to compute using mini-batch

I Can still converge to global minimum in convex case

But now people realize:

I Can escape saddle points! (Today’s topic)

I Can escape shallow local minima (Next time’s topic, someprogress.)

I Can find local minima that generalize well (Not wellunderstood)

Therefore, it’s not only faster, but also works better!

About gt that we use

xt+1 = xt − ηtgt , where E[gt ] = ∇f (xt).

I In practice, gt is obtained by sampling a minibatch of size 128or 256 from the dataset

I To simplify the analysis, we assume

gt = ∇f (xt) + ξt

where ξt ∈ N(0, I) or B0(r)

I In general, if ξt has non-negligible components on everydirection, the analysis works.

About gt that we use

xt+1 = xt − ηtgt , where E[gt ] = ∇f (xt).

I In practice, gt is obtained by sampling a minibatch of size 128or 256 from the dataset

I To simplify the analysis, we assume

gt = ∇f (xt) + ξt

where ξt ∈ N(0, I) or B0(r)

I In general, if ξt has non-negligible components on everydirection, the analysis works.

About gt that we use

xt+1 = xt − ηtgt , where E[gt ] = ∇f (xt).

I In practice, gt is obtained by sampling a minibatch of size 128or 256 from the dataset

I To simplify the analysis, we assume

gt = ∇f (xt) + ξt

where ξt ∈ N(0, I) or B0(r)

I In general, if ξt has non-negligible components on everydirection, the analysis works.

About gt that we use

xt+1 = xt − ηtgt , where E[gt ] = ∇f (xt).

I In practice, gt is obtained by sampling a minibatch of size 128or 256 from the dataset

I To simplify the analysis, we assume

gt = ∇f (xt) + ξt

where ξt ∈ N(0, I) or B0(r)

I In general, if ξt has non-negligible components on everydirection, the analysis works.

Preliminaries

I L-Lipschitz, i.e.,

|f (w1)− f (w2)| ≤ L‖w1 − w2‖2

I `-Smoothness: The gradient is `-Lipschitz, i.e.

‖∇f (w1)−∇f (w2)‖2 ≤ `‖w1 − w2‖2

I ρ-Hessian smoothness: The hessian matrix is ρ-Lipschitz, i.e.,

‖∇2f (w1)−∇2f (w2)‖sp ≤ ρ‖w1 − w2‖2

I We need this because we will use the Hessian at the currentspot to approximate the neighborhood

I Then bound the approximation.

Preliminaries

I L-Lipschitz, i.e.,

|f (w1)− f (w2)| ≤ L‖w1 − w2‖2

I `-Smoothness: The gradient is `-Lipschitz, i.e.

‖∇f (w1)−∇f (w2)‖2 ≤ `‖w1 − w2‖2

I ρ-Hessian smoothness: The hessian matrix is ρ-Lipschitz, i.e.,

‖∇2f (w1)−∇2f (w2)‖sp ≤ ρ‖w1 − w2‖2

I We need this because we will use the Hessian at the currentspot to approximate the neighborhood

I Then bound the approximation.

Preliminaries

I L-Lipschitz, i.e.,

|f (w1)− f (w2)| ≤ L‖w1 − w2‖2

I `-Smoothness: The gradient is `-Lipschitz, i.e.

‖∇f (w1)−∇f (w2)‖2 ≤ `‖w1 − w2‖2

I ρ-Hessian smoothness: The hessian matrix is ρ-Lipschitz, i.e.,

‖∇2f (w1)−∇2f (w2)‖sp ≤ ρ‖w1 − w2‖2

I We need this because we will use the Hessian at the currentspot to approximate the neighborhood

I Then bound the approximation.

Preliminaries

I L-Lipschitz, i.e.,

|f (w1)− f (w2)| ≤ L‖w1 − w2‖2

I `-Smoothness: The gradient is `-Lipschitz, i.e.

‖∇f (w1)−∇f (w2)‖2 ≤ `‖w1 − w2‖2

I ρ-Hessian smoothness: The hessian matrix is ρ-Lipschitz, i.e.,

‖∇2f (w1)−∇2f (w2)‖sp ≤ ρ‖w1 − w2‖2

I We need this because we will use the Hessian at the currentspot to approximate the neighborhood

I Then bound the approximation.

Preliminaries

I L-Lipschitz, i.e.,

|f (w1)− f (w2)| ≤ L‖w1 − w2‖2

I `-Smoothness: The gradient is `-Lipschitz, i.e.

‖∇f (w1)−∇f (w2)‖2 ≤ `‖w1 − w2‖2

I ρ-Hessian smoothness: The hessian matrix is ρ-Lipschitz, i.e.,

‖∇2f (w1)−∇2f (w2)‖sp ≤ ρ‖w1 − w2‖2

I We need this because we will use the Hessian at the currentspot to approximate the neighborhood

I Then bound the approximation.

Saddle points, and negative eigenvalue

Stationary points: saddle points, local minima, localmaxima

For stationary points ∇f (w) = 0,

I If ∇2f (w) � 0, it’s a local minimum.

I If ∇2f (w) ≺ 0, it’s a local maximum.

I If ∇2f (w) has both +/− eigenvalues, it’s a saddle point.I Degenerate case: ∇2f (w) has eigenvalues equal to 0. It

could be either local minimum(maximum)/saddle point.

I f is “flat” on some directionsI SGD is like random walkI We only consider non-degenerate case!

Stationary points: saddle points, local minima, localmaxima

For stationary points ∇f (w) = 0,

I If ∇2f (w) � 0, it’s a local minimum.

I If ∇2f (w) ≺ 0, it’s a local maximum.

I If ∇2f (w) has both +/− eigenvalues, it’s a saddle point.I Degenerate case: ∇2f (w) has eigenvalues equal to 0. It

could be either local minimum(maximum)/saddle point.

I f is “flat” on some directionsI SGD is like random walkI We only consider non-degenerate case!

Stationary points: saddle points, local minima, localmaxima

For stationary points ∇f (w) = 0,

I If ∇2f (w) � 0, it’s a local minimum.

I If ∇2f (w) ≺ 0, it’s a local maximum.

I If ∇2f (w) has both +/− eigenvalues, it’s a saddle point.I Degenerate case: ∇2f (w) has eigenvalues equal to 0. It

could be either local minimum(maximum)/saddle point.

I f is “flat” on some directionsI SGD is like random walkI We only consider non-degenerate case!

Stationary points: saddle points, local minima, localmaxima

For stationary points ∇f (w) = 0,

I If ∇2f (w) � 0, it’s a local minimum.

I If ∇2f (w) ≺ 0, it’s a local maximum.

I If ∇2f (w) has both +/− eigenvalues, it’s a saddle point.

I Degenerate case: ∇2f (w) has eigenvalues equal to 0. Itcould be either local minimum(maximum)/saddle point.

I f is “flat” on some directionsI SGD is like random walkI We only consider non-degenerate case!

Stationary points: saddle points, local minima, localmaxima

For stationary points ∇f (w) = 0,

I If ∇2f (w) � 0, it’s a local minimum.

I If ∇2f (w) ≺ 0, it’s a local maximum.

I If ∇2f (w) has both +/− eigenvalues, it’s a saddle point.I Degenerate case: ∇2f (w) has eigenvalues equal to 0. It

could be either local minimum(maximum)/saddle point.

I f is “flat” on some directionsI SGD is like random walkI We only consider non-degenerate case!

Stationary points: saddle points, local minima, localmaxima

For stationary points ∇f (w) = 0,

I If ∇2f (w) � 0, it’s a local minimum.

I If ∇2f (w) ≺ 0, it’s a local maximum.

I If ∇2f (w) has both +/− eigenvalues, it’s a saddle point.I Degenerate case: ∇2f (w) has eigenvalues equal to 0. It

could be either local minimum(maximum)/saddle point.I f is “flat” on some directions

I SGD is like random walkI We only consider non-degenerate case!

Stationary points: saddle points, local minima, localmaxima

For stationary points ∇f (w) = 0,

I If ∇2f (w) � 0, it’s a local minimum.

I If ∇2f (w) ≺ 0, it’s a local maximum.

I If ∇2f (w) has both +/− eigenvalues, it’s a saddle point.I Degenerate case: ∇2f (w) has eigenvalues equal to 0. It

could be either local minimum(maximum)/saddle point.I f is “flat” on some directionsI SGD is like random walk

I We only consider non-degenerate case!

Stationary points: saddle points, local minima, localmaxima

For stationary points ∇f (w) = 0,

I If ∇2f (w) � 0, it’s a local minimum.

I If ∇2f (w) ≺ 0, it’s a local maximum.

I If ∇2f (w) has both +/− eigenvalues, it’s a saddle point.I Degenerate case: ∇2f (w) has eigenvalues equal to 0. It

could be either local minimum(maximum)/saddle point.I f is “flat” on some directionsI SGD is like random walkI We only consider non-degenerate case!

Strict saddle property

f (w) is (α, γ, ε, ζ)-strict saddle, if for any w ,

I ‖∇f (w)‖2 ≥ εI or, λmin∇2f (w) ≤ −γ < 0

I or, there exists w∗ such that ‖w − w∗‖2 ≤ ζ, and the regioncentered w∗ with radius 2ζ is α-strongly convex.

I Gradient is large

I or (stationary point), we have a negative eigenvalue directionto escape

I or (stationary point, no negative eigenvalues), we are prettyclose to a local minimum.

Strict saddle property

f (w) is (α, γ, ε, ζ)-strict saddle, if for any w ,

I ‖∇f (w)‖2 ≥ ε

I or, λmin∇2f (w) ≤ −γ < 0

I or, there exists w∗ such that ‖w − w∗‖2 ≤ ζ, and the regioncentered w∗ with radius 2ζ is α-strongly convex.

Which means:

I Gradient is large

I or (stationary point), we have a negative eigenvalue directionto escape

I or (stationary point, no negative eigenvalues), we are prettyclose to a local minimum.

Strict saddle property

f (w) is (α, γ, ε, ζ)-strict saddle, if for any w ,

I ‖∇f (w)‖2 ≥ εI or, λmin∇2f (w) ≤ −γ < 0

I or, there exists w∗ such that ‖w − w∗‖2 ≤ ζ, and the regioncentered w∗ with radius 2ζ is α-strongly convex.

Which means:

I Gradient is large

I or (stationary point), we have a negative eigenvalue directionto escape

I or (stationary point, no negative eigenvalues), we are prettyclose to a local minimum.

Strict saddle property

f (w) is (α, γ, ε, ζ)-strict saddle, if for any w ,

I ‖∇f (w)‖2 ≥ εI or, λmin∇2f (w) ≤ −γ < 0

I or, there exists w∗ such that ‖w − w∗‖2 ≤ ζ, and the regioncentered w∗ with radius 2ζ is α-strongly convex.

Which means:

I Gradient is large

I or (stationary point), we have a negative eigenvalue directionto escape

I or (stationary point, no negative eigenvalues), we are prettyclose to a local minimum.

Strict saddle functions are everywhere

I Orthogonal tensor decomposition [Ge et al 2015]

I Deep linear (residual) networks [Kawaguchi 2016], [Hardt andMa 2016]

I Matrix completion [Ge et al 2016]

I Generalized phase retrieval problem [Sun et al 2016]

I Low rank matrix recovery [Bhojanapalli et al 2016]

I SGD escapes all saddle points

I So, SGD arrives one local minimum → global minimum!

I One popular way to prove SGD solves the problem.

Strict saddle functions are everywhere

I Orthogonal tensor decomposition [Ge et al 2015]

I Deep linear (residual) networks [Kawaguchi 2016], [Hardt andMa 2016]

I Matrix completion [Ge et al 2016]

I Generalized phase retrieval problem [Sun et al 2016]

I Low rank matrix recovery [Bhojanapalli et al 2016]

Moreover, in these problems, all local minima are equally good!

I SGD escapes all saddle points

I So, SGD arrives one local minimum → global minimum!

I One popular way to prove SGD solves the problem.

Strict saddle functions are everywhere

I Orthogonal tensor decomposition [Ge et al 2015]

I Deep linear (residual) networks [Kawaguchi 2016], [Hardt andMa 2016]

I Matrix completion [Ge et al 2016]

I Generalized phase retrieval problem [Sun et al 2016]

I Low rank matrix recovery [Bhojanapalli et al 2016]

Moreover, in these problems, all local minima are equally good!That means,

I SGD escapes all saddle points

I So, SGD arrives one local minimum → global minimum!

I One popular way to prove SGD solves the problem.

Strict saddle functions are everywhere

I Orthogonal tensor decomposition [Ge et al 2015]

I Deep linear (residual) networks [Kawaguchi 2016], [Hardt andMa 2016]

I Matrix completion [Ge et al 2016]

I Generalized phase retrieval problem [Sun et al 2016]

I Low rank matrix recovery [Bhojanapalli et al 2016]

Moreover, in these problems, all local minima are equally good!That means,

I SGD escapes all saddle points

I So, SGD arrives one local minimum → global minimum!

I One popular way to prove SGD solves the problem.

Strict saddle functions are everywhere

I Orthogonal tensor decomposition [Ge et al 2015]

I Deep linear (residual) networks [Kawaguchi 2016], [Hardt andMa 2016]

I Matrix completion [Ge et al 2016]

I Generalized phase retrieval problem [Sun et al 2016]

I Low rank matrix recovery [Bhojanapalli et al 2016]

Moreover, in these problems, all local minima are equally good!That means,

I SGD escapes all saddle points

I So, SGD arrives one local minimum → global minimum!

I One popular way to prove SGD solves the problem.

Strict saddle functions are everywhere

I Orthogonal tensor decomposition [Ge et al 2015]

I Deep linear (residual) networks [Kawaguchi 2016], [Hardt andMa 2016]

I Matrix completion [Ge et al 2016]

I Generalized phase retrieval problem [Sun et al 2016]

I Low rank matrix recovery [Bhojanapalli et al 2016]

Moreover, in these problems, all local minima are equally good!That means,

I SGD escapes all saddle points

I So, SGD arrives one local minimum → global minimum!

I One popular way to prove SGD solves the problem.

Main Results

I [Ge et al 2015] says, whp, SGD will escape all saddle points,and converge to a local minimum. The convergence time haspolynomial dependency in dimension d .

I [Jin et al 2017] says, whp, PGD (a variant of SGD) willescape all saddle points, and converge to a local minimummuch faster. The dependence in d is logarithmic.

I Same proof framework. We’ll mainly look at the new result.

Main Results

I [Ge et al 2015] says, whp, SGD will escape all saddle points,and converge to a local minimum. The convergence time haspolynomial dependency in dimension d .

I [Jin et al 2017] says, whp, PGD (a variant of SGD) willescape all saddle points, and converge to a local minimummuch faster. The dependence in d is logarithmic.

I Same proof framework. We’ll mainly look at the new result.

Main Results

I [Ge et al 2015] says, whp, SGD will escape all saddle points,and converge to a local minimum. The convergence time haspolynomial dependency in dimension d .

I [Jin et al 2017] says, whp, PGD (a variant of SGD) willescape all saddle points, and converge to a local minimummuch faster. The dependence in d is logarithmic.

I Same proof framework. We’ll mainly look at the new result.

Description of PGD

Do the following iteratively:

I If ‖∇f (xt)‖ ≤ gthres, and last perturbed time is > tthres stepsbefore, do random perturbation (ball)

I If perturbation happened tthres steps ago, but f is decreasedfor less than fthres, return the value before last perturbation

I Do a gradient descent step:

xt+1 = xt − η∇f (xt)

I Unfortunately.. Not a fast algorithm because of GD!

I η = c` . gthres, tthres, fthres depends on a constant c , as well as

other parameters.

Description of PGD

Do the following iteratively:

I If ‖∇f (xt)‖ ≤ gthres, and last perturbed time is > tthres stepsbefore, do random perturbation (ball)

I If perturbation happened tthres steps ago, but f is decreasedfor less than fthres, return the value before last perturbation

I Do a gradient descent step:

xt+1 = xt − η∇f (xt)

I Unfortunately.. Not a fast algorithm because of GD!

I η = c` . gthres, tthres, fthres depends on a constant c , as well as

other parameters.

Description of PGD

Do the following iteratively:

I If ‖∇f (xt)‖ ≤ gthres, and last perturbed time is > tthres stepsbefore, do random perturbation (ball)

I If perturbation happened tthres steps ago, but f is decreasedfor less than fthres, return the value before last perturbation

I Do a gradient descent step:

xt+1 = xt − η∇f (xt)

I Unfortunately.. Not a fast algorithm because of GD!

I η = c` . gthres, tthres, fthres depends on a constant c , as well as

other parameters.

Description of PGD

Do the following iteratively:

I If ‖∇f (xt)‖ ≤ gthres, and last perturbed time is > tthres stepsbefore, do random perturbation (ball)

I If perturbation happened tthres steps ago, but f is decreasedfor less than fthres, return the value before last perturbation

I Do a gradient descent step:

xt+1 = xt − η∇f (xt)

I Unfortunately.. Not a fast algorithm because of GD!

I η = c` . gthres, tthres, fthres depends on a constant c , as well as

other parameters.

Description of PGD

Do the following iteratively:

I If ‖∇f (xt)‖ ≤ gthres, and last perturbed time is > tthres stepsbefore, do random perturbation (ball)

I If perturbation happened tthres steps ago, but f is decreasedfor less than fthres, return the value before last perturbation

I Do a gradient descent step:

xt+1 = xt − η∇f (xt)

A few Remarks:

I Unfortunately.. Not a fast algorithm because of GD!

I η = c` . gthres, tthres, fthres depends on a constant c , as well as

other parameters.

Description of PGD

Do the following iteratively:

I If ‖∇f (xt)‖ ≤ gthres, and last perturbed time is > tthres stepsbefore, do random perturbation (ball)

I If perturbation happened tthres steps ago, but f is decreasedfor less than fthres, return the value before last perturbation

I Do a gradient descent step:

xt+1 = xt − η∇f (xt)

A few Remarks:

I Unfortunately.. Not a fast algorithm because of GD!

I η = c` . gthres, tthres, fthres depends on a constant c , as well as

other parameters.

Description of PGD

Do the following iteratively:

I If ‖∇f (xt)‖ ≤ gthres, and last perturbed time is > tthres stepsbefore, do random perturbation (ball)

I If perturbation happened tthres steps ago, but f is decreasedfor less than fthres, return the value before last perturbation

I Do a gradient descent step:

xt+1 = xt − η∇f (xt)

A few Remarks:

I Unfortunately.. Not a fast algorithm because of GD!

I η = c` . gthres, tthres, fthres depends on a constant c , as well as

other parameters.

Main theorem in [Jin et al 2017]

Theorem (Main Theorem)

Assume function f is `-smooth and ρ-Hessian Lipschitz,(α, γ, ε, ζ)-strict saddle. There exists an absolute constant cmax

such that, for any δ > 0, ∆f ≥ f (x0)− f ∗, and constant c ≤ cmax,

ε̃ = min{ε, γ2

ρ }, PGD(c) will output a point ζ-close to a localminimum, with probability 1− δ, and terminate in the followingnumber of iterations:

O

(`(f (x0)− f ∗)

ε̃2log4

(d`∆f

ε̃2δ

))

I If could show SGD has similar property, would be great!

I The convergence rate is almost optimal.

Main theorem in [Jin et al 2017]

Theorem (Main Theorem)

Assume function f is `-smooth and ρ-Hessian Lipschitz,(α, γ, ε, ζ)-strict saddle. There exists an absolute constant cmax

such that, for any δ > 0, ∆f ≥ f (x0)− f ∗, and constant c ≤ cmax,

ε̃ = min{ε, γ2

ρ }, PGD(c) will output a point ζ-close to a localminimum, with probability 1− δ, and terminate in the followingnumber of iterations:

O

(`(f (x0)− f ∗)

ε̃2log4

(d`∆f

ε̃2δ

))

I If could show SGD has similar property, would be great!

I The convergence rate is almost optimal.

Main theorem in [Jin et al 2017]

Theorem (Main Theorem)

Assume function f is `-smooth and ρ-Hessian Lipschitz,(α, γ, ε, ζ)-strict saddle. There exists an absolute constant cmax

such that, for any δ > 0, ∆f ≥ f (x0)− f ∗, and constant c ≤ cmax,

ε̃ = min{ε, γ2

ρ }, PGD(c) will output a point ζ-close to a localminimum, with probability 1− δ, and terminate in the followingnumber of iterations:

O

(`(f (x0)− f ∗)

ε̃2log4

(d`∆f

ε̃2δ

))

I If could show SGD has similar property, would be great!

I The convergence rate is almost optimal.

More general version: why it’s fast

Theorem (A more general version)

Assume function f is `-smooth and ρ-Hessian Lipschitz. Thereexists an absolute constant cmax such that, for any δ > 0,∆f ≥ f (x0)− f ∗, and constant c ≤ cmax, ε̃ ≤ `2

ρ , PGD(c) willoutput a point ζ-close to an ε̃-second-order stationary point, withprobability 1− δ, and terminate in the following number ofiterations:

O

(`(f (x0)− f ∗)

ε̃2log4

(d`∆f

ε̃2δ

))

Essentially saying the same thing. If f is not strict saddle, onlyε-second-order stationary point (instead of local minimum) isguaranteed.

ε-stationary points

I ε-first-order stationary point:

‖∇f (x)‖ ≤ ε

I ε-second-order stationary point:

‖∇f (x)‖ ≤ ε, λmin(∇2f (x)) ≥ −√ρε

I If `-smooth, λmin(∇2f (x)) ≥ −`.I For any ε > `2

ρ , an ε-first-order stationary point in a `-smooth

function is a `2

ρ -second-order stationary point

I If (α, γ, ε, ζ)-strict saddle, and ε < γ2

ρ , then any ε-second-orderstationary point is a local minimum.

ε-stationary points

I ε-first-order stationary point:

‖∇f (x)‖ ≤ ε

I ε-second-order stationary point:

‖∇f (x)‖ ≤ ε, λmin(∇2f (x)) ≥ −√ρε

I If `-smooth, λmin(∇2f (x)) ≥ −`.I For any ε > `2

ρ , an ε-first-order stationary point in a `-smooth

function is a `2

ρ -second-order stationary point

I If (α, γ, ε, ζ)-strict saddle, and ε < γ2

ρ , then any ε-second-orderstationary point is a local minimum.

ε-stationary points

I ε-first-order stationary point:

‖∇f (x)‖ ≤ ε

I ε-second-order stationary point:

‖∇f (x)‖ ≤ ε, λmin(∇2f (x)) ≥ −√ρε

I If `-smooth, λmin(∇2f (x)) ≥ −`.

I For any ε > `2

ρ , an ε-first-order stationary point in a `-smooth

function is a `2

ρ -second-order stationary point

I If (α, γ, ε, ζ)-strict saddle, and ε < γ2

ρ , then any ε-second-orderstationary point is a local minimum.

ε-stationary points

I ε-first-order stationary point:

‖∇f (x)‖ ≤ ε

I ε-second-order stationary point:

‖∇f (x)‖ ≤ ε, λmin(∇2f (x)) ≥ −√ρε

I If `-smooth, λmin(∇2f (x)) ≥ −`.I For any ε > `2

ρ , an ε-first-order stationary point in a `-smooth

function is a `2

ρ -second-order stationary point

I If (α, γ, ε, ζ)-strict saddle, and ε < γ2

ρ , then any ε-second-orderstationary point is a local minimum.

ε-stationary points

I ε-first-order stationary point:

‖∇f (x)‖ ≤ ε

I ε-second-order stationary point:

‖∇f (x)‖ ≤ ε, λmin(∇2f (x)) ≥ −√ρε

I If `-smooth, λmin(∇2f (x)) ≥ −`.I For any ε > `2

ρ , an ε-first-order stationary point in a `-smooth

function is a `2

ρ -second-order stationary point

I If (α, γ, ε, ζ)-strict saddle, and ε < γ2

ρ , then any ε-second-orderstationary point is a local minimum.

[Nesterov, 1998]

Theorem

Assume that f is `-smooth. Then for any ε̃ > 0, if we run GD withstep size η = 1

` and termination condition ‖∇f (x)‖ ≤ ε̃, theoutput will be ε̃-first-order stationary point, and the algorithmterminates in the following number of iterations:

`(f (x0)− f ∗)

ε̃2

[Jin et al 2017]: PGD converges to ε̃-second-order stationary point

in O(`(f (x0)−f ∗)

ε̃2 log4(d`∆fε̃2δ

))steps.

I Matched up to log factors!

Why −√ρε?

I If we use third order approximation for x [Nesterov andPolyak, 2006]

miny

{〈∇f (x), y − x〉+

1

2〈∇2f (x)(y − x), y − x〉+

ρ

6‖y − x‖2

}denote the answer as Tx .

I Denote distance r = ‖x − Tx‖I ‖∇f (Tx)‖ ≤ ρr2, ∇2f (Tx) � −3

2 ρr I

I To get a lower bound for r :

max

{√‖∇f (Tx)‖

ρ,− 2

3ρλmin∇2f (Tx)

}I When are they equal → −√ρε

Why −√ρε?

I If we use third order approximation for x [Nesterov andPolyak, 2006]

miny

{〈∇f (x), y − x〉+

1

2〈∇2f (x)(y − x), y − x〉+

ρ

6‖y − x‖2

}denote the answer as Tx .

I Denote distance r = ‖x − Tx‖

I ‖∇f (Tx)‖ ≤ ρr2, ∇2f (Tx) � −32 ρr I

I To get a lower bound for r :

max

{√‖∇f (Tx)‖

ρ,− 2

3ρλmin∇2f (Tx)

}I When are they equal → −√ρε

Why −√ρε?

I If we use third order approximation for x [Nesterov andPolyak, 2006]

miny

{〈∇f (x), y − x〉+

1

2〈∇2f (x)(y − x), y − x〉+

ρ

6‖y − x‖2

}denote the answer as Tx .

I Denote distance r = ‖x − Tx‖I ‖∇f (Tx)‖ ≤ ρr2, ∇2f (Tx) � −3

2 ρr I

I To get a lower bound for r :

max

{√‖∇f (Tx)‖

ρ,− 2

3ρλmin∇2f (Tx)

}I When are they equal → −√ρε

Why −√ρε?

I If we use third order approximation for x [Nesterov andPolyak, 2006]

miny

{〈∇f (x), y − x〉+

1

2〈∇2f (x)(y − x), y − x〉+

ρ

6‖y − x‖2

}denote the answer as Tx .

I Denote distance r = ‖x − Tx‖I ‖∇f (Tx)‖ ≤ ρr2, ∇2f (Tx) � −3

2 ρr I

I To get a lower bound for r :

max

{√‖∇f (Tx)‖

ρ,− 2

3ρλmin∇2f (Tx)

}

I When are they equal → −√ρε

Why −√ρε?

I If we use third order approximation for x [Nesterov andPolyak, 2006]

miny

{〈∇f (x), y − x〉+

1

2〈∇2f (x)(y − x), y − x〉+

ρ

6‖y − x‖2

}denote the answer as Tx .

I Denote distance r = ‖x − Tx‖I ‖∇f (Tx)‖ ≤ ρr2, ∇2f (Tx) � −3

2 ρr I

I To get a lower bound for r :

max

{√‖∇f (Tx)‖

ρ,− 2

3ρλmin∇2f (Tx)

}I When are they equal → −√ρε

Related results

1. “Gradient Descent Converges to Minimizers” By Lee,Simchowitz, Jordan and Recht. 15’

I with random initialization, GD almost surely never touches anysaddle points, and always converges to local minima.

2. “The power of normalization: faster evasion of saddle points”,Kfir Levy. 16’

I Normalized gradient can escape saddle points inO(d3poly(1/ε)), slower than [Jin et al 2017], faster than [Geet al 2015], but still polynomial in d .

Related results

1. “Gradient Descent Converges to Minimizers” By Lee,Simchowitz, Jordan and Recht. 15’

I with random initialization, GD almost surely never touches anysaddle points, and always converges to local minima.

2. “The power of normalization: faster evasion of saddle points”,Kfir Levy. 16’

I Normalized gradient can escape saddle points inO(d3poly(1/ε)), slower than [Jin et al 2017], faster than [Geet al 2015], but still polynomial in d .

Related results

1. “Gradient Descent Converges to Minimizers” By Lee,Simchowitz, Jordan and Recht. 15’

I with random initialization, GD almost surely never touches anysaddle points, and always converges to local minima.

2. “The power of normalization: faster evasion of saddle points”,Kfir Levy. 16’

I Normalized gradient can escape saddle points inO(d3poly(1/ε)), slower than [Jin et al 2017], faster than [Geet al 2015], but still polynomial in d .

Related results

1. “Gradient Descent Converges to Minimizers” By Lee,Simchowitz, Jordan and Recht. 15’

I with random initialization, GD almost surely never touches anysaddle points, and always converges to local minima.

2. “The power of normalization: faster evasion of saddle points”,Kfir Levy. 16’

I Normalized gradient can escape saddle points inO(d3poly(1/ε)), slower than [Jin et al 2017], faster than [Geet al 2015], but still polynomial in d .

Main theorem in [Jin et al 2017]

Theorem (Main Theorem)

Assume function f is `-smooth and ρ-Hessian Lipschitz,(α, γ, ε, ζ)-strict saddle. There exists an absolute constant cmax

such that, for any δ > 0, ∆f ≥ f (x0)− f ∗, and constant c ≤ cmax,

ε̃ = min{ε, γ2

ρ }, PGD(c) will output a point ζ-close to a localminimum, with probability 1− δ, and terminate in the followingnumber of iterations:

O

(`(f (x0)− f ∗)

ε̃2log4

(d`∆f

ε̃2δ

))

Proof framework: Progress, Escape and Trap

I Progress: when ‖∇f (x)‖ > gthres, f (x) is decreased by atleast fthres/tthres.

I Escape: when ‖∇f (x)‖ ≤ gthres, and λmin∇2f (x) ≤ −γ, whpfunction value is decreased by fthres after perturbation+tthressteps.

I fthres/tthres on average each step.

I Trap:

I The algorithm can’t do progress and escape forever, becauseit’s bounded!

I When it stops: perturbation happened tthres steps ago, but fis decreased for less than fthres

I That means, ‖∇f (x)‖ ≤ gthres before perturbation, and whpthere is no eigenvalue ≤ −γ.

I So it’s a local minimum!

Proof framework: Progress, Escape and Trap

I Progress: when ‖∇f (x)‖ > gthres, f (x) is decreased by atleast fthres/tthres.

I Escape: when ‖∇f (x)‖ ≤ gthres, and λmin∇2f (x) ≤ −γ, whpfunction value is decreased by fthres after perturbation+tthressteps.

I fthres/tthres on average each step.

I Trap:

I The algorithm can’t do progress and escape forever, becauseit’s bounded!

I When it stops: perturbation happened tthres steps ago, but fis decreased for less than fthres

I That means, ‖∇f (x)‖ ≤ gthres before perturbation, and whpthere is no eigenvalue ≤ −γ.

I So it’s a local minimum!

Proof framework: Progress, Escape and Trap

I Progress: when ‖∇f (x)‖ > gthres, f (x) is decreased by atleast fthres/tthres.

I Escape: when ‖∇f (x)‖ ≤ gthres, and λmin∇2f (x) ≤ −γ, whpfunction value is decreased by fthres after perturbation+tthressteps.

I fthres/tthres on average each step.

I Trap:

I The algorithm can’t do progress and escape forever, becauseit’s bounded!

I When it stops: perturbation happened tthres steps ago, but fis decreased for less than fthres

I That means, ‖∇f (x)‖ ≤ gthres before perturbation, and whpthere is no eigenvalue ≤ −γ.

I So it’s a local minimum!

Proof framework: Progress, Escape and Trap

I Progress: when ‖∇f (x)‖ > gthres, f (x) is decreased by atleast fthres/tthres.

I Escape: when ‖∇f (x)‖ ≤ gthres, and λmin∇2f (x) ≤ −γ, whpfunction value is decreased by fthres after perturbation+tthressteps.

I fthres/tthres on average each step.

I Trap:

I The algorithm can’t do progress and escape forever, becauseit’s bounded!

I When it stops: perturbation happened tthres steps ago, but fis decreased for less than fthres

I That means, ‖∇f (x)‖ ≤ gthres before perturbation, and whpthere is no eigenvalue ≤ −γ.

I So it’s a local minimum!

Proof framework: Progress, Escape and Trap

I Progress: when ‖∇f (x)‖ > gthres, f (x) is decreased by atleast fthres/tthres.

I Escape: when ‖∇f (x)‖ ≤ gthres, and λmin∇2f (x) ≤ −γ, whpfunction value is decreased by fthres after perturbation+tthressteps.

I fthres/tthres on average each step.

I Trap:I The algorithm can’t do progress and escape forever, because

it’s bounded!

I When it stops: perturbation happened tthres steps ago, but fis decreased for less than fthres

I That means, ‖∇f (x)‖ ≤ gthres before perturbation, and whpthere is no eigenvalue ≤ −γ.

I So it’s a local minimum!

Proof framework: Progress, Escape and Trap

I Progress: when ‖∇f (x)‖ > gthres, f (x) is decreased by atleast fthres/tthres.

I Escape: when ‖∇f (x)‖ ≤ gthres, and λmin∇2f (x) ≤ −γ, whpfunction value is decreased by fthres after perturbation+tthressteps.

I fthres/tthres on average each step.

I Trap:I The algorithm can’t do progress and escape forever, because

it’s bounded!I When it stops: perturbation happened tthres steps ago, but f

is decreased for less than fthres

I That means, ‖∇f (x)‖ ≤ gthres before perturbation, and whpthere is no eigenvalue ≤ −γ.

I So it’s a local minimum!

Proof framework: Progress, Escape and Trap

I Progress: when ‖∇f (x)‖ > gthres, f (x) is decreased by atleast fthres/tthres.

I Escape: when ‖∇f (x)‖ ≤ gthres, and λmin∇2f (x) ≤ −γ, whpfunction value is decreased by fthres after perturbation+tthressteps.

I fthres/tthres on average each step.

I Trap:I The algorithm can’t do progress and escape forever, because

it’s bounded!I When it stops: perturbation happened tthres steps ago, but f

is decreased for less than fthresI That means, ‖∇f (x)‖ ≤ gthres before perturbation, and whp

there is no eigenvalue ≤ −γ.

I So it’s a local minimum!

Proof framework: Progress, Escape and Trap

I Progress: when ‖∇f (x)‖ > gthres, f (x) is decreased by atleast fthres/tthres.

I Escape: when ‖∇f (x)‖ ≤ gthres, and λmin∇2f (x) ≤ −γ, whpfunction value is decreased by fthres after perturbation+tthressteps.

I fthres/tthres on average each step.

I Trap:I The algorithm can’t do progress and escape forever, because

it’s bounded!I When it stops: perturbation happened tthres steps ago, but f

is decreased for less than fthresI That means, ‖∇f (x)‖ ≤ gthres before perturbation, and whp

there is no eigenvalue ≤ −γ.I So it’s a local minimum!

Progress

Lemma

If f is `-smooth, then for GD with step size η < 1` , we have:

f (xt+1) ≤ f (xt)−η

2‖∇f (xt)‖2

Proof.

f (xt+1) ≤ f (xt) +∇f (xt)>(xt+1 − xt) +

`

2‖xt+1 − xt‖2

= f (xt)− η‖∇f (xt)‖2 +η2`

2‖∇f (xt)‖2

≤ f (xt)−η

2‖∇f (xt)‖2

Escape: main idea

Escape: main idea

Escape: thin pancake

Main Lemma: measure the width

Lemma

Suppose we start with point x̃ satisfying following conditions:

‖∇f (x̃)‖ ≤ gthres, λmin(∇2f (x̃)) ≤ −γ

Let e1 the minimum eigenvector. Consider two gradient descentsequences {ut}, {wt}, with initial points u0,w0 satisfying :

‖u0 − x̃‖ ≤ r ,w0 = u0 + µre1, µ ∈ [δ/(2√d), 1]

Then, for any stepsize η ≤ cmax/`, and any T ≥ tthres, we have

min{f (uT )− f (u0), f (wT )− f (w0)} ≤ −2.5fthres

I As long as u0 − w0 are on e1, and ‖u0 − w0‖ ≥ δr2√d

, at least

one of them will escape!

Main Lemma: measure the width

Escape Case

Lemma (Escape case)

Suppose we start with point x̃ satisfying following conditions:

‖∇f (x̃)‖ ≤ gthres, λmin(∇2f (x̃)) ≤ −γ

Let x0 = x̃ + ξ, where ξ come from the uniform distribution overball with radius r , and let xt be the iterates of GD from x0. Thenwhen η < cmax

` , with at least probability 1− δ, for any T ≥ tthres:

f (xT )− f (x̃) ≤ −fthres

Proof of the escape lemma

Proof of the escape lemma

By smoothness, the perturbation step does not increase f much:

f (x0)− f (x̃) ≤ ∇f (x̃)>ξ +`

2‖ξ‖2 ≤ · · · ≤ 1.5fthres

Proof of the escape lemma

By smoothness, the perturbation step does not increase f much:

f (x0)− f (x̃) ≤ ∇f (x̃)>ξ +`

2‖ξ‖2 ≤ · · · ≤ 1.5fthres

By the main lemma, for any x0 ∈ Xstuck, we know(x0 ± µre1) 6∈ Xstuck, where µ ∈ [δ/(2

√d), 1].

Vol(Xstuck) = Vol(B(d−1)x̃ (r))× δr

2√d× 2

Proof of the escape lemma

By smoothness, the perturbation step does not increase f much:

f (x0)− f (x̃) ≤ ∇f (x̃)>ξ +`

2‖ξ‖2 ≤ · · · ≤ 1.5fthres

By the main lemma, for any x0 ∈ Xstuck, we know(x0 ± µre1) 6∈ Xstuck, where µ ∈ [δ/(2

√d), 1].

Vol(Xstuck) = Vol(B(d−1)x̃ (r))× δr

2√d× 2

Therefore, the probability that we picked a point in Xstuck isbounded by

Vol(Xstuck)

Vol(B(d)x̃ (r)))

≤ δ

Proof of the escape lemma

Thus, with probability at least 1− δ, x0 6∈ Xstuck, and in this case,by the main lemma.

f (xT )− f (x̃) ≤ −2.5fthres + 1.5fthres = −fthres

How to prove the main Lemma?

I If uT does not decrease function value, then {u0, · · · , uT} areclose to x̃ .

I If {u0, · · · , uT} are close to x̃ , GD on w0 will decrease thefunction value.

How to prove the main Lemma?

I If uT does not decrease function value, then {u0, · · · , uT} areclose to x̃ .

I If {u0, · · · , uT} are close to x̃ , GD on w0 will decrease thefunction value.

How to prove the main Lemma?

I If uT does not decrease function value, then {u0, · · · , uT} areclose to x̃ .

I If {u0, · · · , uT} are close to x̃ , GD on w0 will decrease thefunction value.

How to prove the main Lemma?

I If uT does not decrease function value, then {u0, · · · , uT} areclose to x̃ .

I If {u0, · · · , uT} are close to x̃ , GD on w0 will decrease thefunction value.

We will need the following approximation:

f̃y (x) = f (y) +∇f (y)>(x − y) +1

2(x − y)>H(x − y)

where H = ∇2f (x̃).

Two lemmas (simplified)

Lemma (uT -stuck)

There exists absolute constant cmax s.t., for any initial point u0

with ‖u0 − x̃‖ ≤ r , defined

T = min{

inft

{t|f̃ u0(ut)− f (u0) ≤ −3fthres

}, tthres

}Then, for any η ≤ cmax

` , we have for all t < T , ‖ut − x̃‖ ≤ Φ.

Two lemmas (simplified)

Lemma (uT -stuck)

There exists absolute constant cmax s.t., for any initial point u0

with ‖u0 − x̃‖ ≤ r , defined

T = min{

inft

{t|f̃ u0(ut)− f (u0) ≤ −3fthres

}, tthres

}Then, for any η ≤ cmax

` , we have for all t < T , ‖ut − x̃‖ ≤ Φ.

Lemma (wT -escape)

There exists absolute constant cmax s.t., define

T = min{

inft

{t|f̃ w0(wt)− f (w0) ≤ −3fthres

}, tthres

}then, for any η ≤ cmax

` , if ‖ut − x̃‖ ≤ Φ for t < T , we haveT < tthres.

Prove the main lemma

Prove the main lemma

Assume x̃ is the origin. Define

T ′ = inft

{t|f̃ u0(ut)− f (u0) ≤ −3fthres

}

Prove the main lemma

Assume x̃ is the origin. Define

T ′ = inft

{t|f̃ u0(ut)− f (u0) ≤ −3fthres

}Case T ′ ≤ tthres:

We know ‖uT ′−1‖ ≤ Φ by uT -stuck-lemma. By simple calculation,we can show that ‖uT ′‖ = O(Φ) as well.

Prove the main lemma

Assume x̃ is the origin. Define

T ′ = inft

{t|f̃ u0(ut)− f (u0) ≤ −3fthres

}Case T ′ ≤ tthres:

We know ‖uT ′−1‖ ≤ Φ by uT -stuck-lemma. By simple calculation,we can show that ‖uT ′‖ = O(Φ) as well.

f (uT ′)− f (u0)

≤∇f (u0)>(uT ′ − u0) +1

2(uT ′ − u0)>∇2f (u0)(uT ′ − u0) +

ρ

6‖uT ′ − u0‖3

≤f̃u0(ut)− f (u0) +ρ

2‖u0 − x̃‖‖uT ′ − u0‖2 +

ρ

6‖uT ′ − u0‖3

≤− 2.5fthres

Prove the main lemma

Case T ′ > tthres: By uT -stuck-lemma, we know for all t ≤ tthres‖ut‖ ≤ Φ.

Prove the main lemma

Case T ′ > tthres: By uT -stuck-lemma, we know for all t ≤ tthres‖ut‖ ≤ Φ. Using the wT -escape-lemma, we know

T ′′ = inft

{t|f̃ w0(wt)− f (w0) ≤ −3fthres

}≤ tthres

Prove the main lemma

Case T ′ > tthres: By uT -stuck-lemma, we know for all t ≤ tthres‖ut‖ ≤ Φ. Using the wT -escape-lemma, we know

T ′′ = inft

{t|f̃ w0(wt)− f (w0) ≤ −3fthres

}≤ tthres

Then we may reduce this to the case that T ′ ≤ tthres becausew , u are interchangeable.

Prove the uT -stuck-lemma

I We won’t move much in large negative eigenvector directions,otherwise it’s a lot of progress!

Prove the uT -stuck-lemma

Lemma (uT -stuck)

There exists absolute constant cmax s.t., for any initial point u0

with ‖u0 − x̃‖ ≤ r , defined

T = min{

inft

{t|f̃ u0(ut)− f (u0) ≤ −3fthres

}, tthres

}Then, for any η ≤ cmax

` , we have for all t < T , ‖ut − x̃‖ ≤ Φ.

I We won’t move much in large negative eigenvector directions,otherwise it’s a lot of progress!

Prove the uT -stuck-lemma

Lemma (uT -stuck)

There exists absolute constant cmax s.t., for any initial point u0

with ‖u0 − x̃‖ ≤ r , defined

T = min{

inft

{t|f̃ u0(ut)− f (u0) ≤ −3fthres

}, tthres

}Then, for any η ≤ cmax

` , we have for all t < T , ‖ut − x̃‖ ≤ Φ.

I We won’t move much in large negative eigenvector directions,otherwise it’s a lot of progress!

Prove the uT -stuck-lemma

Lemma (uT -stuck)

There exists absolute constant cmax s.t., for any initial point u0

with ‖u0 − x̃‖ ≤ r , defined

T = min{

inft

{t|f̃ u0(ut)− f (u0) ≤ −3fthres

}, tthres

}Then, for any η ≤ cmax

` , we have for all t < T , ‖ut − x̃‖ ≤ Φ.

I We won’t move much in large negative eigenvector directions,otherwise it’s a lot of progress!

Consider Bt as ut in the remaining space where eigenvalue ≥ − γ100 ,

‖Bt+1‖ ≤ (1 +ηγ

100)‖Bt‖+ 2ηgthres

Prove the uT -stuck-lemma

Lemma (uT -stuck)

There exists absolute constant cmax s.t., for any initial point u0

with ‖u0 − x̃‖ ≤ r , defined

T = min{

inft

{t|f̃ u0(ut)− f (u0) ≤ −3fthres

}, tthres

}Then, for any η ≤ cmax

` , we have for all t < T , ‖ut − x̃‖ ≤ Φ.

I We won’t move much in large negative eigenvector directions,otherwise it’s a lot of progress!

Consider Bt as ut in the remaining space where eigenvalue ≥ − γ100 ,

‖Bt+1‖ ≤ (1 +ηγ

100)‖Bt‖+ 2ηgthres

If T ≤ tthres, we will have (1 + ηγ100 )T ≤ 3, so ‖BT‖ is bounded.

Prove the wT -escape-lemma

Lemma (wT -escape)

There exists absolute constant cmax s.t., define

T = min{

inft

{t|f̃ w0(wt)− f (w0) ≤ −3fthres

}, tthres

}then, for any η ≤ cmax

` , if ‖ut − x̃‖ ≤ Φ for t < T , we haveT < tthres.

Prove the wT -escape-lemma

I let vt = wt − ut

I We want to say for T < tthres, wT made progress.

I If wt makes no progress, by uT -stuck-lemma, it’s still near x̃ .

I Therefore, we always have ‖vt‖ ≤ ‖ut‖+ ‖wt‖ ≤ 2Φ.

I At e1 direction v0 has at least δr2√d

I Every time it multiplies by at least 1 + ηγ.

I In T < tthres, we get vT > 2Φ, so wT made progress!

Prove the wT -escape-lemma

I let vt = wt − utI We want to say for T < tthres, wT made progress.

I If wt makes no progress, by uT -stuck-lemma, it’s still near x̃ .

I Therefore, we always have ‖vt‖ ≤ ‖ut‖+ ‖wt‖ ≤ 2Φ.

I At e1 direction v0 has at least δr2√d

I Every time it multiplies by at least 1 + ηγ.

I In T < tthres, we get vT > 2Φ, so wT made progress!

Prove the wT -escape-lemma

I let vt = wt − utI We want to say for T < tthres, wT made progress.

I If wt makes no progress, by uT -stuck-lemma, it’s still near x̃ .

I Therefore, we always have ‖vt‖ ≤ ‖ut‖+ ‖wt‖ ≤ 2Φ.

I At e1 direction v0 has at least δr2√d

I Every time it multiplies by at least 1 + ηγ.

I In T < tthres, we get vT > 2Φ, so wT made progress!

Prove the wT -escape-lemma

I let vt = wt − utI We want to say for T < tthres, wT made progress.

I If wt makes no progress, by uT -stuck-lemma, it’s still near x̃ .

I Therefore, we always have ‖vt‖ ≤ ‖ut‖+ ‖wt‖ ≤ 2Φ.

I At e1 direction v0 has at least δr2√d

I Every time it multiplies by at least 1 + ηγ.

I In T < tthres, we get vT > 2Φ, so wT made progress!

Prove the wT -escape-lemma

I let vt = wt − utI We want to say for T < tthres, wT made progress.

I If wt makes no progress, by uT -stuck-lemma, it’s still near x̃ .

I Therefore, we always have ‖vt‖ ≤ ‖ut‖+ ‖wt‖ ≤ 2Φ.

However, vt is increasing very rapidly. It can’t be always small!

I At e1 direction v0 has at least δr2√d

I Every time it multiplies by at least 1 + ηγ.

I In T < tthres, we get vT > 2Φ, so wT made progress!

Prove the wT -escape-lemma

I let vt = wt − utI We want to say for T < tthres, wT made progress.

I If wt makes no progress, by uT -stuck-lemma, it’s still near x̃ .

I Therefore, we always have ‖vt‖ ≤ ‖ut‖+ ‖wt‖ ≤ 2Φ.

However, vt is increasing very rapidly. It can’t be always small!

I At e1 direction v0 has at least δr2√d

I Every time it multiplies by at least 1 + ηγ.

I In T < tthres, we get vT > 2Φ, so wT made progress!

Prove the wT -escape-lemma

I let vt = wt − utI We want to say for T < tthres, wT made progress.

I If wt makes no progress, by uT -stuck-lemma, it’s still near x̃ .

I Therefore, we always have ‖vt‖ ≤ ‖ut‖+ ‖wt‖ ≤ 2Φ.

However, vt is increasing very rapidly. It can’t be always small!

I At e1 direction v0 has at least δr2√d

I Every time it multiplies by at least 1 + ηγ.

I In T < tthres, we get vT > 2Φ, so wT made progress!

Prove the wT -escape-lemma

I let vt = wt − utI We want to say for T < tthres, wT made progress.

I If wt makes no progress, by uT -stuck-lemma, it’s still near x̃ .

I Therefore, we always have ‖vt‖ ≤ ‖ut‖+ ‖wt‖ ≤ 2Φ.

However, vt is increasing very rapidly. It can’t be always small!

I At e1 direction v0 has at least δr2√d

I Every time it multiplies by at least 1 + ηγ.

I In T < tthres, we get vT > 2Φ, so wT made progress!