Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to...

121
Mini-Course 1: SGD Escapes Saddle Points Yang Yuan Computer Science Department Cornell University

Transcript of Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to...

Page 1: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Mini-Course 1:SGD Escapes Saddle Points

Yang Yuan

Computer Science DepartmentCornell University

Page 2: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Gradient Descent (GD)

I Task: minx f (x)

I GD does iterative updates xt+1 = xt − ηt∇f (xt)

Page 3: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Gradient Descent (GD)

I Task: minx f (x)

I GD does iterative updates xt+1 = xt − ηt∇f (xt)

Page 4: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Gradient Descent (GD) has at least two problems

I Computing the full gradient is slow for big data.

I Stuck at stationary points.

Page 5: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Gradient Descent (GD) has at least two problems

I Computing the full gradient is slow for big data.

I Stuck at stationary points.

Page 6: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Gradient Descent (GD) has at least two problems

I Computing the full gradient is slow for big data.

I Stuck at stationary points.

Page 7: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Stochastic Gradient Descent (SGD)

I Very similar to GD, gradient now has some randomness:

xt+1 = xt − ηtgt , where E[gt ] = ∇f (xt).

Page 8: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Stochastic Gradient Descent (SGD)

I Very similar to GD, gradient now has some randomness:

xt+1 = xt − ηtgt , where E[gt ] = ∇f (xt).

Page 9: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Why do we use SGD?

Initially because:

I Much cheaper to compute using mini-batch

I Can still converge to global minimum in convex case

I Can escape saddle points! (Today’s topic)

I Can escape shallow local minima (Next time’s topic, someprogress.)

I Can find local minima that generalize well (Not wellunderstood)

Page 10: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Why do we use SGD?

Initially because:

I Much cheaper to compute using mini-batch

I Can still converge to global minimum in convex case

I Can escape saddle points! (Today’s topic)

I Can escape shallow local minima (Next time’s topic, someprogress.)

I Can find local minima that generalize well (Not wellunderstood)

Page 11: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Why do we use SGD?

Initially because:

I Much cheaper to compute using mini-batch

I Can still converge to global minimum in convex case

I Can escape saddle points! (Today’s topic)

I Can escape shallow local minima (Next time’s topic, someprogress.)

I Can find local minima that generalize well (Not wellunderstood)

Page 12: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Why do we use SGD?

Initially because:

I Much cheaper to compute using mini-batch

I Can still converge to global minimum in convex case

But now people realize:

I Can escape saddle points! (Today’s topic)

I Can escape shallow local minima (Next time’s topic, someprogress.)

I Can find local minima that generalize well (Not wellunderstood)

Page 13: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Why do we use SGD?

Initially because:

I Much cheaper to compute using mini-batch

I Can still converge to global minimum in convex case

But now people realize:

I Can escape saddle points! (Today’s topic)

I Can escape shallow local minima (Next time’s topic, someprogress.)

I Can find local minima that generalize well (Not wellunderstood)

Page 14: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Why do we use SGD?

Initially because:

I Much cheaper to compute using mini-batch

I Can still converge to global minimum in convex case

But now people realize:

I Can escape saddle points! (Today’s topic)

I Can escape shallow local minima (Next time’s topic, someprogress.)

I Can find local minima that generalize well (Not wellunderstood)

Page 15: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Why do we use SGD?

Initially because:

I Much cheaper to compute using mini-batch

I Can still converge to global minimum in convex case

But now people realize:

I Can escape saddle points! (Today’s topic)

I Can escape shallow local minima (Next time’s topic, someprogress.)

I Can find local minima that generalize well (Not wellunderstood)

Page 16: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Why do we use SGD?

Initially because:

I Much cheaper to compute using mini-batch

I Can still converge to global minimum in convex case

But now people realize:

I Can escape saddle points! (Today’s topic)

I Can escape shallow local minima (Next time’s topic, someprogress.)

I Can find local minima that generalize well (Not wellunderstood)

Therefore, it’s not only faster, but also works better!

Page 17: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

About gt that we use

xt+1 = xt − ηtgt , where E[gt ] = ∇f (xt).

I In practice, gt is obtained by sampling a minibatch of size 128or 256 from the dataset

I To simplify the analysis, we assume

gt = ∇f (xt) + ξt

where ξt ∈ N(0, I) or B0(r)

I In general, if ξt has non-negligible components on everydirection, the analysis works.

Page 18: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

About gt that we use

xt+1 = xt − ηtgt , where E[gt ] = ∇f (xt).

I In practice, gt is obtained by sampling a minibatch of size 128or 256 from the dataset

I To simplify the analysis, we assume

gt = ∇f (xt) + ξt

where ξt ∈ N(0, I) or B0(r)

I In general, if ξt has non-negligible components on everydirection, the analysis works.

Page 19: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

About gt that we use

xt+1 = xt − ηtgt , where E[gt ] = ∇f (xt).

I In practice, gt is obtained by sampling a minibatch of size 128or 256 from the dataset

I To simplify the analysis, we assume

gt = ∇f (xt) + ξt

where ξt ∈ N(0, I) or B0(r)

I In general, if ξt has non-negligible components on everydirection, the analysis works.

Page 20: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

About gt that we use

xt+1 = xt − ηtgt , where E[gt ] = ∇f (xt).

I In practice, gt is obtained by sampling a minibatch of size 128or 256 from the dataset

I To simplify the analysis, we assume

gt = ∇f (xt) + ξt

where ξt ∈ N(0, I) or B0(r)

I In general, if ξt has non-negligible components on everydirection, the analysis works.

Page 21: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Preliminaries

I L-Lipschitz, i.e.,

|f (w1)− f (w2)| ≤ L‖w1 − w2‖2

I `-Smoothness: The gradient is `-Lipschitz, i.e.

‖∇f (w1)−∇f (w2)‖2 ≤ `‖w1 − w2‖2

I ρ-Hessian smoothness: The hessian matrix is ρ-Lipschitz, i.e.,

‖∇2f (w1)−∇2f (w2)‖sp ≤ ρ‖w1 − w2‖2

I We need this because we will use the Hessian at the currentspot to approximate the neighborhood

I Then bound the approximation.

Page 22: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Preliminaries

I L-Lipschitz, i.e.,

|f (w1)− f (w2)| ≤ L‖w1 − w2‖2

I `-Smoothness: The gradient is `-Lipschitz, i.e.

‖∇f (w1)−∇f (w2)‖2 ≤ `‖w1 − w2‖2

I ρ-Hessian smoothness: The hessian matrix is ρ-Lipschitz, i.e.,

‖∇2f (w1)−∇2f (w2)‖sp ≤ ρ‖w1 − w2‖2

I We need this because we will use the Hessian at the currentspot to approximate the neighborhood

I Then bound the approximation.

Page 23: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Preliminaries

I L-Lipschitz, i.e.,

|f (w1)− f (w2)| ≤ L‖w1 − w2‖2

I `-Smoothness: The gradient is `-Lipschitz, i.e.

‖∇f (w1)−∇f (w2)‖2 ≤ `‖w1 − w2‖2

I ρ-Hessian smoothness: The hessian matrix is ρ-Lipschitz, i.e.,

‖∇2f (w1)−∇2f (w2)‖sp ≤ ρ‖w1 − w2‖2

I We need this because we will use the Hessian at the currentspot to approximate the neighborhood

I Then bound the approximation.

Page 24: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Preliminaries

I L-Lipschitz, i.e.,

|f (w1)− f (w2)| ≤ L‖w1 − w2‖2

I `-Smoothness: The gradient is `-Lipschitz, i.e.

‖∇f (w1)−∇f (w2)‖2 ≤ `‖w1 − w2‖2

I ρ-Hessian smoothness: The hessian matrix is ρ-Lipschitz, i.e.,

‖∇2f (w1)−∇2f (w2)‖sp ≤ ρ‖w1 − w2‖2

I We need this because we will use the Hessian at the currentspot to approximate the neighborhood

I Then bound the approximation.

Page 25: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Preliminaries

I L-Lipschitz, i.e.,

|f (w1)− f (w2)| ≤ L‖w1 − w2‖2

I `-Smoothness: The gradient is `-Lipschitz, i.e.

‖∇f (w1)−∇f (w2)‖2 ≤ `‖w1 − w2‖2

I ρ-Hessian smoothness: The hessian matrix is ρ-Lipschitz, i.e.,

‖∇2f (w1)−∇2f (w2)‖sp ≤ ρ‖w1 − w2‖2

I We need this because we will use the Hessian at the currentspot to approximate the neighborhood

I Then bound the approximation.

Page 26: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Saddle points, and negative eigenvalue

Page 27: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Stationary points: saddle points, local minima, localmaxima

For stationary points ∇f (w) = 0,

I If ∇2f (w) � 0, it’s a local minimum.

I If ∇2f (w) ≺ 0, it’s a local maximum.

I If ∇2f (w) has both +/− eigenvalues, it’s a saddle point.I Degenerate case: ∇2f (w) has eigenvalues equal to 0. It

could be either local minimum(maximum)/saddle point.

I f is “flat” on some directionsI SGD is like random walkI We only consider non-degenerate case!

Page 28: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Stationary points: saddle points, local minima, localmaxima

For stationary points ∇f (w) = 0,

I If ∇2f (w) � 0, it’s a local minimum.

I If ∇2f (w) ≺ 0, it’s a local maximum.

I If ∇2f (w) has both +/− eigenvalues, it’s a saddle point.I Degenerate case: ∇2f (w) has eigenvalues equal to 0. It

could be either local minimum(maximum)/saddle point.

I f is “flat” on some directionsI SGD is like random walkI We only consider non-degenerate case!

Page 29: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Stationary points: saddle points, local minima, localmaxima

For stationary points ∇f (w) = 0,

I If ∇2f (w) � 0, it’s a local minimum.

I If ∇2f (w) ≺ 0, it’s a local maximum.

I If ∇2f (w) has both +/− eigenvalues, it’s a saddle point.I Degenerate case: ∇2f (w) has eigenvalues equal to 0. It

could be either local minimum(maximum)/saddle point.

I f is “flat” on some directionsI SGD is like random walkI We only consider non-degenerate case!

Page 30: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Stationary points: saddle points, local minima, localmaxima

For stationary points ∇f (w) = 0,

I If ∇2f (w) � 0, it’s a local minimum.

I If ∇2f (w) ≺ 0, it’s a local maximum.

I If ∇2f (w) has both +/− eigenvalues, it’s a saddle point.

I Degenerate case: ∇2f (w) has eigenvalues equal to 0. Itcould be either local minimum(maximum)/saddle point.

I f is “flat” on some directionsI SGD is like random walkI We only consider non-degenerate case!

Page 31: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Stationary points: saddle points, local minima, localmaxima

For stationary points ∇f (w) = 0,

I If ∇2f (w) � 0, it’s a local minimum.

I If ∇2f (w) ≺ 0, it’s a local maximum.

I If ∇2f (w) has both +/− eigenvalues, it’s a saddle point.I Degenerate case: ∇2f (w) has eigenvalues equal to 0. It

could be either local minimum(maximum)/saddle point.

I f is “flat” on some directionsI SGD is like random walkI We only consider non-degenerate case!

Page 32: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Stationary points: saddle points, local minima, localmaxima

For stationary points ∇f (w) = 0,

I If ∇2f (w) � 0, it’s a local minimum.

I If ∇2f (w) ≺ 0, it’s a local maximum.

I If ∇2f (w) has both +/− eigenvalues, it’s a saddle point.I Degenerate case: ∇2f (w) has eigenvalues equal to 0. It

could be either local minimum(maximum)/saddle point.I f is “flat” on some directions

I SGD is like random walkI We only consider non-degenerate case!

Page 33: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Stationary points: saddle points, local minima, localmaxima

For stationary points ∇f (w) = 0,

I If ∇2f (w) � 0, it’s a local minimum.

I If ∇2f (w) ≺ 0, it’s a local maximum.

I If ∇2f (w) has both +/− eigenvalues, it’s a saddle point.I Degenerate case: ∇2f (w) has eigenvalues equal to 0. It

could be either local minimum(maximum)/saddle point.I f is “flat” on some directionsI SGD is like random walk

I We only consider non-degenerate case!

Page 34: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Stationary points: saddle points, local minima, localmaxima

For stationary points ∇f (w) = 0,

I If ∇2f (w) � 0, it’s a local minimum.

I If ∇2f (w) ≺ 0, it’s a local maximum.

I If ∇2f (w) has both +/− eigenvalues, it’s a saddle point.I Degenerate case: ∇2f (w) has eigenvalues equal to 0. It

could be either local minimum(maximum)/saddle point.I f is “flat” on some directionsI SGD is like random walkI We only consider non-degenerate case!

Page 35: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Strict saddle property

f (w) is (α, γ, ε, ζ)-strict saddle, if for any w ,

I ‖∇f (w)‖2 ≥ εI or, λmin∇2f (w) ≤ −γ < 0

I or, there exists w∗ such that ‖w − w∗‖2 ≤ ζ, and the regioncentered w∗ with radius 2ζ is α-strongly convex.

I Gradient is large

I or (stationary point), we have a negative eigenvalue directionto escape

I or (stationary point, no negative eigenvalues), we are prettyclose to a local minimum.

Page 36: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Strict saddle property

f (w) is (α, γ, ε, ζ)-strict saddle, if for any w ,

I ‖∇f (w)‖2 ≥ ε

I or, λmin∇2f (w) ≤ −γ < 0

I or, there exists w∗ such that ‖w − w∗‖2 ≤ ζ, and the regioncentered w∗ with radius 2ζ is α-strongly convex.

Which means:

I Gradient is large

I or (stationary point), we have a negative eigenvalue directionto escape

I or (stationary point, no negative eigenvalues), we are prettyclose to a local minimum.

Page 37: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Strict saddle property

f (w) is (α, γ, ε, ζ)-strict saddle, if for any w ,

I ‖∇f (w)‖2 ≥ εI or, λmin∇2f (w) ≤ −γ < 0

I or, there exists w∗ such that ‖w − w∗‖2 ≤ ζ, and the regioncentered w∗ with radius 2ζ is α-strongly convex.

Which means:

I Gradient is large

I or (stationary point), we have a negative eigenvalue directionto escape

I or (stationary point, no negative eigenvalues), we are prettyclose to a local minimum.

Page 38: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Strict saddle property

f (w) is (α, γ, ε, ζ)-strict saddle, if for any w ,

I ‖∇f (w)‖2 ≥ εI or, λmin∇2f (w) ≤ −γ < 0

I or, there exists w∗ such that ‖w − w∗‖2 ≤ ζ, and the regioncentered w∗ with radius 2ζ is α-strongly convex.

Which means:

I Gradient is large

I or (stationary point), we have a negative eigenvalue directionto escape

I or (stationary point, no negative eigenvalues), we are prettyclose to a local minimum.

Page 39: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Strict saddle functions are everywhere

I Orthogonal tensor decomposition [Ge et al 2015]

I Deep linear (residual) networks [Kawaguchi 2016], [Hardt andMa 2016]

I Matrix completion [Ge et al 2016]

I Generalized phase retrieval problem [Sun et al 2016]

I Low rank matrix recovery [Bhojanapalli et al 2016]

I SGD escapes all saddle points

I So, SGD arrives one local minimum → global minimum!

I One popular way to prove SGD solves the problem.

Page 40: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Strict saddle functions are everywhere

I Orthogonal tensor decomposition [Ge et al 2015]

I Deep linear (residual) networks [Kawaguchi 2016], [Hardt andMa 2016]

I Matrix completion [Ge et al 2016]

I Generalized phase retrieval problem [Sun et al 2016]

I Low rank matrix recovery [Bhojanapalli et al 2016]

Moreover, in these problems, all local minima are equally good!

I SGD escapes all saddle points

I So, SGD arrives one local minimum → global minimum!

I One popular way to prove SGD solves the problem.

Page 41: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Strict saddle functions are everywhere

I Orthogonal tensor decomposition [Ge et al 2015]

I Deep linear (residual) networks [Kawaguchi 2016], [Hardt andMa 2016]

I Matrix completion [Ge et al 2016]

I Generalized phase retrieval problem [Sun et al 2016]

I Low rank matrix recovery [Bhojanapalli et al 2016]

Moreover, in these problems, all local minima are equally good!That means,

I SGD escapes all saddle points

I So, SGD arrives one local minimum → global minimum!

I One popular way to prove SGD solves the problem.

Page 42: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Strict saddle functions are everywhere

I Orthogonal tensor decomposition [Ge et al 2015]

I Deep linear (residual) networks [Kawaguchi 2016], [Hardt andMa 2016]

I Matrix completion [Ge et al 2016]

I Generalized phase retrieval problem [Sun et al 2016]

I Low rank matrix recovery [Bhojanapalli et al 2016]

Moreover, in these problems, all local minima are equally good!That means,

I SGD escapes all saddle points

I So, SGD arrives one local minimum → global minimum!

I One popular way to prove SGD solves the problem.

Page 43: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Strict saddle functions are everywhere

I Orthogonal tensor decomposition [Ge et al 2015]

I Deep linear (residual) networks [Kawaguchi 2016], [Hardt andMa 2016]

I Matrix completion [Ge et al 2016]

I Generalized phase retrieval problem [Sun et al 2016]

I Low rank matrix recovery [Bhojanapalli et al 2016]

Moreover, in these problems, all local minima are equally good!That means,

I SGD escapes all saddle points

I So, SGD arrives one local minimum → global minimum!

I One popular way to prove SGD solves the problem.

Page 44: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Strict saddle functions are everywhere

I Orthogonal tensor decomposition [Ge et al 2015]

I Deep linear (residual) networks [Kawaguchi 2016], [Hardt andMa 2016]

I Matrix completion [Ge et al 2016]

I Generalized phase retrieval problem [Sun et al 2016]

I Low rank matrix recovery [Bhojanapalli et al 2016]

Moreover, in these problems, all local minima are equally good!That means,

I SGD escapes all saddle points

I So, SGD arrives one local minimum → global minimum!

I One popular way to prove SGD solves the problem.

Page 45: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Main Results

I [Ge et al 2015] says, whp, SGD will escape all saddle points,and converge to a local minimum. The convergence time haspolynomial dependency in dimension d .

I [Jin et al 2017] says, whp, PGD (a variant of SGD) willescape all saddle points, and converge to a local minimummuch faster. The dependence in d is logarithmic.

I Same proof framework. We’ll mainly look at the new result.

Page 46: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Main Results

I [Ge et al 2015] says, whp, SGD will escape all saddle points,and converge to a local minimum. The convergence time haspolynomial dependency in dimension d .

I [Jin et al 2017] says, whp, PGD (a variant of SGD) willescape all saddle points, and converge to a local minimummuch faster. The dependence in d is logarithmic.

I Same proof framework. We’ll mainly look at the new result.

Page 47: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Main Results

I [Ge et al 2015] says, whp, SGD will escape all saddle points,and converge to a local minimum. The convergence time haspolynomial dependency in dimension d .

I [Jin et al 2017] says, whp, PGD (a variant of SGD) willescape all saddle points, and converge to a local minimummuch faster. The dependence in d is logarithmic.

I Same proof framework. We’ll mainly look at the new result.

Page 48: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Description of PGD

Do the following iteratively:

I If ‖∇f (xt)‖ ≤ gthres, and last perturbed time is > tthres stepsbefore, do random perturbation (ball)

I If perturbation happened tthres steps ago, but f is decreasedfor less than fthres, return the value before last perturbation

I Do a gradient descent step:

xt+1 = xt − η∇f (xt)

I Unfortunately.. Not a fast algorithm because of GD!

I η = c` . gthres, tthres, fthres depends on a constant c , as well as

other parameters.

Page 49: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Description of PGD

Do the following iteratively:

I If ‖∇f (xt)‖ ≤ gthres, and last perturbed time is > tthres stepsbefore, do random perturbation (ball)

I If perturbation happened tthres steps ago, but f is decreasedfor less than fthres, return the value before last perturbation

I Do a gradient descent step:

xt+1 = xt − η∇f (xt)

I Unfortunately.. Not a fast algorithm because of GD!

I η = c` . gthres, tthres, fthres depends on a constant c , as well as

other parameters.

Page 50: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Description of PGD

Do the following iteratively:

I If ‖∇f (xt)‖ ≤ gthres, and last perturbed time is > tthres stepsbefore, do random perturbation (ball)

I If perturbation happened tthres steps ago, but f is decreasedfor less than fthres, return the value before last perturbation

I Do a gradient descent step:

xt+1 = xt − η∇f (xt)

I Unfortunately.. Not a fast algorithm because of GD!

I η = c` . gthres, tthres, fthres depends on a constant c , as well as

other parameters.

Page 51: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Description of PGD

Do the following iteratively:

I If ‖∇f (xt)‖ ≤ gthres, and last perturbed time is > tthres stepsbefore, do random perturbation (ball)

I If perturbation happened tthres steps ago, but f is decreasedfor less than fthres, return the value before last perturbation

I Do a gradient descent step:

xt+1 = xt − η∇f (xt)

I Unfortunately.. Not a fast algorithm because of GD!

I η = c` . gthres, tthres, fthres depends on a constant c , as well as

other parameters.

Page 52: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Description of PGD

Do the following iteratively:

I If ‖∇f (xt)‖ ≤ gthres, and last perturbed time is > tthres stepsbefore, do random perturbation (ball)

I If perturbation happened tthres steps ago, but f is decreasedfor less than fthres, return the value before last perturbation

I Do a gradient descent step:

xt+1 = xt − η∇f (xt)

A few Remarks:

I Unfortunately.. Not a fast algorithm because of GD!

I η = c` . gthres, tthres, fthres depends on a constant c , as well as

other parameters.

Page 53: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Description of PGD

Do the following iteratively:

I If ‖∇f (xt)‖ ≤ gthres, and last perturbed time is > tthres stepsbefore, do random perturbation (ball)

I If perturbation happened tthres steps ago, but f is decreasedfor less than fthres, return the value before last perturbation

I Do a gradient descent step:

xt+1 = xt − η∇f (xt)

A few Remarks:

I Unfortunately.. Not a fast algorithm because of GD!

I η = c` . gthres, tthres, fthres depends on a constant c , as well as

other parameters.

Page 54: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Description of PGD

Do the following iteratively:

I If ‖∇f (xt)‖ ≤ gthres, and last perturbed time is > tthres stepsbefore, do random perturbation (ball)

I If perturbation happened tthres steps ago, but f is decreasedfor less than fthres, return the value before last perturbation

I Do a gradient descent step:

xt+1 = xt − η∇f (xt)

A few Remarks:

I Unfortunately.. Not a fast algorithm because of GD!

I η = c` . gthres, tthres, fthres depends on a constant c , as well as

other parameters.

Page 55: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Main theorem in [Jin et al 2017]

Theorem (Main Theorem)

Assume function f is `-smooth and ρ-Hessian Lipschitz,(α, γ, ε, ζ)-strict saddle. There exists an absolute constant cmax

such that, for any δ > 0, ∆f ≥ f (x0)− f ∗, and constant c ≤ cmax,

ε̃ = min{ε, γ2

ρ }, PGD(c) will output a point ζ-close to a localminimum, with probability 1− δ, and terminate in the followingnumber of iterations:

O

(`(f (x0)− f ∗)

ε̃2log4

(d`∆f

ε̃2δ

))

I If could show SGD has similar property, would be great!

I The convergence rate is almost optimal.

Page 56: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Main theorem in [Jin et al 2017]

Theorem (Main Theorem)

Assume function f is `-smooth and ρ-Hessian Lipschitz,(α, γ, ε, ζ)-strict saddle. There exists an absolute constant cmax

such that, for any δ > 0, ∆f ≥ f (x0)− f ∗, and constant c ≤ cmax,

ε̃ = min{ε, γ2

ρ }, PGD(c) will output a point ζ-close to a localminimum, with probability 1− δ, and terminate in the followingnumber of iterations:

O

(`(f (x0)− f ∗)

ε̃2log4

(d`∆f

ε̃2δ

))

I If could show SGD has similar property, would be great!

I The convergence rate is almost optimal.

Page 57: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Main theorem in [Jin et al 2017]

Theorem (Main Theorem)

Assume function f is `-smooth and ρ-Hessian Lipschitz,(α, γ, ε, ζ)-strict saddle. There exists an absolute constant cmax

such that, for any δ > 0, ∆f ≥ f (x0)− f ∗, and constant c ≤ cmax,

ε̃ = min{ε, γ2

ρ }, PGD(c) will output a point ζ-close to a localminimum, with probability 1− δ, and terminate in the followingnumber of iterations:

O

(`(f (x0)− f ∗)

ε̃2log4

(d`∆f

ε̃2δ

))

I If could show SGD has similar property, would be great!

I The convergence rate is almost optimal.

Page 58: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

More general version: why it’s fast

Theorem (A more general version)

Assume function f is `-smooth and ρ-Hessian Lipschitz. Thereexists an absolute constant cmax such that, for any δ > 0,∆f ≥ f (x0)− f ∗, and constant c ≤ cmax, ε̃ ≤ `2

ρ , PGD(c) willoutput a point ζ-close to an ε̃-second-order stationary point, withprobability 1− δ, and terminate in the following number ofiterations:

O

(`(f (x0)− f ∗)

ε̃2log4

(d`∆f

ε̃2δ

))

Essentially saying the same thing. If f is not strict saddle, onlyε-second-order stationary point (instead of local minimum) isguaranteed.

Page 59: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

ε-stationary points

I ε-first-order stationary point:

‖∇f (x)‖ ≤ ε

I ε-second-order stationary point:

‖∇f (x)‖ ≤ ε, λmin(∇2f (x)) ≥ −√ρε

I If `-smooth, λmin(∇2f (x)) ≥ −`.I For any ε > `2

ρ , an ε-first-order stationary point in a `-smooth

function is a `2

ρ -second-order stationary point

I If (α, γ, ε, ζ)-strict saddle, and ε < γ2

ρ , then any ε-second-orderstationary point is a local minimum.

Page 60: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

ε-stationary points

I ε-first-order stationary point:

‖∇f (x)‖ ≤ ε

I ε-second-order stationary point:

‖∇f (x)‖ ≤ ε, λmin(∇2f (x)) ≥ −√ρε

I If `-smooth, λmin(∇2f (x)) ≥ −`.I For any ε > `2

ρ , an ε-first-order stationary point in a `-smooth

function is a `2

ρ -second-order stationary point

I If (α, γ, ε, ζ)-strict saddle, and ε < γ2

ρ , then any ε-second-orderstationary point is a local minimum.

Page 61: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

ε-stationary points

I ε-first-order stationary point:

‖∇f (x)‖ ≤ ε

I ε-second-order stationary point:

‖∇f (x)‖ ≤ ε, λmin(∇2f (x)) ≥ −√ρε

I If `-smooth, λmin(∇2f (x)) ≥ −`.

I For any ε > `2

ρ , an ε-first-order stationary point in a `-smooth

function is a `2

ρ -second-order stationary point

I If (α, γ, ε, ζ)-strict saddle, and ε < γ2

ρ , then any ε-second-orderstationary point is a local minimum.

Page 62: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

ε-stationary points

I ε-first-order stationary point:

‖∇f (x)‖ ≤ ε

I ε-second-order stationary point:

‖∇f (x)‖ ≤ ε, λmin(∇2f (x)) ≥ −√ρε

I If `-smooth, λmin(∇2f (x)) ≥ −`.I For any ε > `2

ρ , an ε-first-order stationary point in a `-smooth

function is a `2

ρ -second-order stationary point

I If (α, γ, ε, ζ)-strict saddle, and ε < γ2

ρ , then any ε-second-orderstationary point is a local minimum.

Page 63: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

ε-stationary points

I ε-first-order stationary point:

‖∇f (x)‖ ≤ ε

I ε-second-order stationary point:

‖∇f (x)‖ ≤ ε, λmin(∇2f (x)) ≥ −√ρε

I If `-smooth, λmin(∇2f (x)) ≥ −`.I For any ε > `2

ρ , an ε-first-order stationary point in a `-smooth

function is a `2

ρ -second-order stationary point

I If (α, γ, ε, ζ)-strict saddle, and ε < γ2

ρ , then any ε-second-orderstationary point is a local minimum.

Page 64: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

[Nesterov, 1998]

Theorem

Assume that f is `-smooth. Then for any ε̃ > 0, if we run GD withstep size η = 1

` and termination condition ‖∇f (x)‖ ≤ ε̃, theoutput will be ε̃-first-order stationary point, and the algorithmterminates in the following number of iterations:

`(f (x0)− f ∗)

ε̃2

[Jin et al 2017]: PGD converges to ε̃-second-order stationary point

in O(`(f (x0)−f ∗)

ε̃2 log4(d`∆fε̃2δ

))steps.

I Matched up to log factors!

Page 65: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Why −√ρε?

I If we use third order approximation for x [Nesterov andPolyak, 2006]

miny

{〈∇f (x), y − x〉+

1

2〈∇2f (x)(y − x), y − x〉+

ρ

6‖y − x‖2

}denote the answer as Tx .

I Denote distance r = ‖x − Tx‖I ‖∇f (Tx)‖ ≤ ρr2, ∇2f (Tx) � −3

2 ρr I

I To get a lower bound for r :

max

{√‖∇f (Tx)‖

ρ,− 2

3ρλmin∇2f (Tx)

}I When are they equal → −√ρε

Page 66: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Why −√ρε?

I If we use third order approximation for x [Nesterov andPolyak, 2006]

miny

{〈∇f (x), y − x〉+

1

2〈∇2f (x)(y − x), y − x〉+

ρ

6‖y − x‖2

}denote the answer as Tx .

I Denote distance r = ‖x − Tx‖

I ‖∇f (Tx)‖ ≤ ρr2, ∇2f (Tx) � −32 ρr I

I To get a lower bound for r :

max

{√‖∇f (Tx)‖

ρ,− 2

3ρλmin∇2f (Tx)

}I When are they equal → −√ρε

Page 67: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Why −√ρε?

I If we use third order approximation for x [Nesterov andPolyak, 2006]

miny

{〈∇f (x), y − x〉+

1

2〈∇2f (x)(y − x), y − x〉+

ρ

6‖y − x‖2

}denote the answer as Tx .

I Denote distance r = ‖x − Tx‖I ‖∇f (Tx)‖ ≤ ρr2, ∇2f (Tx) � −3

2 ρr I

I To get a lower bound for r :

max

{√‖∇f (Tx)‖

ρ,− 2

3ρλmin∇2f (Tx)

}I When are they equal → −√ρε

Page 68: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Why −√ρε?

I If we use third order approximation for x [Nesterov andPolyak, 2006]

miny

{〈∇f (x), y − x〉+

1

2〈∇2f (x)(y − x), y − x〉+

ρ

6‖y − x‖2

}denote the answer as Tx .

I Denote distance r = ‖x − Tx‖I ‖∇f (Tx)‖ ≤ ρr2, ∇2f (Tx) � −3

2 ρr I

I To get a lower bound for r :

max

{√‖∇f (Tx)‖

ρ,− 2

3ρλmin∇2f (Tx)

}

I When are they equal → −√ρε

Page 69: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Why −√ρε?

I If we use third order approximation for x [Nesterov andPolyak, 2006]

miny

{〈∇f (x), y − x〉+

1

2〈∇2f (x)(y − x), y − x〉+

ρ

6‖y − x‖2

}denote the answer as Tx .

I Denote distance r = ‖x − Tx‖I ‖∇f (Tx)‖ ≤ ρr2, ∇2f (Tx) � −3

2 ρr I

I To get a lower bound for r :

max

{√‖∇f (Tx)‖

ρ,− 2

3ρλmin∇2f (Tx)

}I When are they equal → −√ρε

Page 70: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Related results

1. “Gradient Descent Converges to Minimizers” By Lee,Simchowitz, Jordan and Recht. 15’

I with random initialization, GD almost surely never touches anysaddle points, and always converges to local minima.

2. “The power of normalization: faster evasion of saddle points”,Kfir Levy. 16’

I Normalized gradient can escape saddle points inO(d3poly(1/ε)), slower than [Jin et al 2017], faster than [Geet al 2015], but still polynomial in d .

Page 71: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Related results

1. “Gradient Descent Converges to Minimizers” By Lee,Simchowitz, Jordan and Recht. 15’

I with random initialization, GD almost surely never touches anysaddle points, and always converges to local minima.

2. “The power of normalization: faster evasion of saddle points”,Kfir Levy. 16’

I Normalized gradient can escape saddle points inO(d3poly(1/ε)), slower than [Jin et al 2017], faster than [Geet al 2015], but still polynomial in d .

Page 72: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Related results

1. “Gradient Descent Converges to Minimizers” By Lee,Simchowitz, Jordan and Recht. 15’

I with random initialization, GD almost surely never touches anysaddle points, and always converges to local minima.

2. “The power of normalization: faster evasion of saddle points”,Kfir Levy. 16’

I Normalized gradient can escape saddle points inO(d3poly(1/ε)), slower than [Jin et al 2017], faster than [Geet al 2015], but still polynomial in d .

Page 73: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Related results

1. “Gradient Descent Converges to Minimizers” By Lee,Simchowitz, Jordan and Recht. 15’

I with random initialization, GD almost surely never touches anysaddle points, and always converges to local minima.

2. “The power of normalization: faster evasion of saddle points”,Kfir Levy. 16’

I Normalized gradient can escape saddle points inO(d3poly(1/ε)), slower than [Jin et al 2017], faster than [Geet al 2015], but still polynomial in d .

Page 74: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Main theorem in [Jin et al 2017]

Theorem (Main Theorem)

Assume function f is `-smooth and ρ-Hessian Lipschitz,(α, γ, ε, ζ)-strict saddle. There exists an absolute constant cmax

such that, for any δ > 0, ∆f ≥ f (x0)− f ∗, and constant c ≤ cmax,

ε̃ = min{ε, γ2

ρ }, PGD(c) will output a point ζ-close to a localminimum, with probability 1− δ, and terminate in the followingnumber of iterations:

O

(`(f (x0)− f ∗)

ε̃2log4

(d`∆f

ε̃2δ

))

Page 75: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Proof framework: Progress, Escape and Trap

I Progress: when ‖∇f (x)‖ > gthres, f (x) is decreased by atleast fthres/tthres.

I Escape: when ‖∇f (x)‖ ≤ gthres, and λmin∇2f (x) ≤ −γ, whpfunction value is decreased by fthres after perturbation+tthressteps.

I fthres/tthres on average each step.

I Trap:

I The algorithm can’t do progress and escape forever, becauseit’s bounded!

I When it stops: perturbation happened tthres steps ago, but fis decreased for less than fthres

I That means, ‖∇f (x)‖ ≤ gthres before perturbation, and whpthere is no eigenvalue ≤ −γ.

I So it’s a local minimum!

Page 76: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Proof framework: Progress, Escape and Trap

I Progress: when ‖∇f (x)‖ > gthres, f (x) is decreased by atleast fthres/tthres.

I Escape: when ‖∇f (x)‖ ≤ gthres, and λmin∇2f (x) ≤ −γ, whpfunction value is decreased by fthres after perturbation+tthressteps.

I fthres/tthres on average each step.

I Trap:

I The algorithm can’t do progress and escape forever, becauseit’s bounded!

I When it stops: perturbation happened tthres steps ago, but fis decreased for less than fthres

I That means, ‖∇f (x)‖ ≤ gthres before perturbation, and whpthere is no eigenvalue ≤ −γ.

I So it’s a local minimum!

Page 77: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Proof framework: Progress, Escape and Trap

I Progress: when ‖∇f (x)‖ > gthres, f (x) is decreased by atleast fthres/tthres.

I Escape: when ‖∇f (x)‖ ≤ gthres, and λmin∇2f (x) ≤ −γ, whpfunction value is decreased by fthres after perturbation+tthressteps.

I fthres/tthres on average each step.

I Trap:

I The algorithm can’t do progress and escape forever, becauseit’s bounded!

I When it stops: perturbation happened tthres steps ago, but fis decreased for less than fthres

I That means, ‖∇f (x)‖ ≤ gthres before perturbation, and whpthere is no eigenvalue ≤ −γ.

I So it’s a local minimum!

Page 78: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Proof framework: Progress, Escape and Trap

I Progress: when ‖∇f (x)‖ > gthres, f (x) is decreased by atleast fthres/tthres.

I Escape: when ‖∇f (x)‖ ≤ gthres, and λmin∇2f (x) ≤ −γ, whpfunction value is decreased by fthres after perturbation+tthressteps.

I fthres/tthres on average each step.

I Trap:

I The algorithm can’t do progress and escape forever, becauseit’s bounded!

I When it stops: perturbation happened tthres steps ago, but fis decreased for less than fthres

I That means, ‖∇f (x)‖ ≤ gthres before perturbation, and whpthere is no eigenvalue ≤ −γ.

I So it’s a local minimum!

Page 79: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Proof framework: Progress, Escape and Trap

I Progress: when ‖∇f (x)‖ > gthres, f (x) is decreased by atleast fthres/tthres.

I Escape: when ‖∇f (x)‖ ≤ gthres, and λmin∇2f (x) ≤ −γ, whpfunction value is decreased by fthres after perturbation+tthressteps.

I fthres/tthres on average each step.

I Trap:I The algorithm can’t do progress and escape forever, because

it’s bounded!

I When it stops: perturbation happened tthres steps ago, but fis decreased for less than fthres

I That means, ‖∇f (x)‖ ≤ gthres before perturbation, and whpthere is no eigenvalue ≤ −γ.

I So it’s a local minimum!

Page 80: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Proof framework: Progress, Escape and Trap

I Progress: when ‖∇f (x)‖ > gthres, f (x) is decreased by atleast fthres/tthres.

I Escape: when ‖∇f (x)‖ ≤ gthres, and λmin∇2f (x) ≤ −γ, whpfunction value is decreased by fthres after perturbation+tthressteps.

I fthres/tthres on average each step.

I Trap:I The algorithm can’t do progress and escape forever, because

it’s bounded!I When it stops: perturbation happened tthres steps ago, but f

is decreased for less than fthres

I That means, ‖∇f (x)‖ ≤ gthres before perturbation, and whpthere is no eigenvalue ≤ −γ.

I So it’s a local minimum!

Page 81: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Proof framework: Progress, Escape and Trap

I Progress: when ‖∇f (x)‖ > gthres, f (x) is decreased by atleast fthres/tthres.

I Escape: when ‖∇f (x)‖ ≤ gthres, and λmin∇2f (x) ≤ −γ, whpfunction value is decreased by fthres after perturbation+tthressteps.

I fthres/tthres on average each step.

I Trap:I The algorithm can’t do progress and escape forever, because

it’s bounded!I When it stops: perturbation happened tthres steps ago, but f

is decreased for less than fthresI That means, ‖∇f (x)‖ ≤ gthres before perturbation, and whp

there is no eigenvalue ≤ −γ.

I So it’s a local minimum!

Page 82: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Proof framework: Progress, Escape and Trap

I Progress: when ‖∇f (x)‖ > gthres, f (x) is decreased by atleast fthres/tthres.

I Escape: when ‖∇f (x)‖ ≤ gthres, and λmin∇2f (x) ≤ −γ, whpfunction value is decreased by fthres after perturbation+tthressteps.

I fthres/tthres on average each step.

I Trap:I The algorithm can’t do progress and escape forever, because

it’s bounded!I When it stops: perturbation happened tthres steps ago, but f

is decreased for less than fthresI That means, ‖∇f (x)‖ ≤ gthres before perturbation, and whp

there is no eigenvalue ≤ −γ.I So it’s a local minimum!

Page 83: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Progress

Lemma

If f is `-smooth, then for GD with step size η < 1` , we have:

f (xt+1) ≤ f (xt)−η

2‖∇f (xt)‖2

Proof.

f (xt+1) ≤ f (xt) +∇f (xt)>(xt+1 − xt) +

`

2‖xt+1 − xt‖2

= f (xt)− η‖∇f (xt)‖2 +η2`

2‖∇f (xt)‖2

≤ f (xt)−η

2‖∇f (xt)‖2

Page 84: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Escape: main idea

Page 85: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Escape: main idea

Page 86: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Escape: thin pancake

Page 87: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Main Lemma: measure the width

Lemma

Suppose we start with point x̃ satisfying following conditions:

‖∇f (x̃)‖ ≤ gthres, λmin(∇2f (x̃)) ≤ −γ

Let e1 the minimum eigenvector. Consider two gradient descentsequences {ut}, {wt}, with initial points u0,w0 satisfying :

‖u0 − x̃‖ ≤ r ,w0 = u0 + µre1, µ ∈ [δ/(2√d), 1]

Then, for any stepsize η ≤ cmax/`, and any T ≥ tthres, we have

min{f (uT )− f (u0), f (wT )− f (w0)} ≤ −2.5fthres

I As long as u0 − w0 are on e1, and ‖u0 − w0‖ ≥ δr2√d

, at least

one of them will escape!

Page 88: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Main Lemma: measure the width

Page 89: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Escape Case

Lemma (Escape case)

Suppose we start with point x̃ satisfying following conditions:

‖∇f (x̃)‖ ≤ gthres, λmin(∇2f (x̃)) ≤ −γ

Let x0 = x̃ + ξ, where ξ come from the uniform distribution overball with radius r , and let xt be the iterates of GD from x0. Thenwhen η < cmax

` , with at least probability 1− δ, for any T ≥ tthres:

f (xT )− f (x̃) ≤ −fthres

Page 90: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Proof of the escape lemma

Page 91: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Proof of the escape lemma

By smoothness, the perturbation step does not increase f much:

f (x0)− f (x̃) ≤ ∇f (x̃)>ξ +`

2‖ξ‖2 ≤ · · · ≤ 1.5fthres

Page 92: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Proof of the escape lemma

By smoothness, the perturbation step does not increase f much:

f (x0)− f (x̃) ≤ ∇f (x̃)>ξ +`

2‖ξ‖2 ≤ · · · ≤ 1.5fthres

By the main lemma, for any x0 ∈ Xstuck, we know(x0 ± µre1) 6∈ Xstuck, where µ ∈ [δ/(2

√d), 1].

Vol(Xstuck) = Vol(B(d−1)x̃ (r))× δr

2√d× 2

Page 93: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Proof of the escape lemma

By smoothness, the perturbation step does not increase f much:

f (x0)− f (x̃) ≤ ∇f (x̃)>ξ +`

2‖ξ‖2 ≤ · · · ≤ 1.5fthres

By the main lemma, for any x0 ∈ Xstuck, we know(x0 ± µre1) 6∈ Xstuck, where µ ∈ [δ/(2

√d), 1].

Vol(Xstuck) = Vol(B(d−1)x̃ (r))× δr

2√d× 2

Therefore, the probability that we picked a point in Xstuck isbounded by

Vol(Xstuck)

Vol(B(d)x̃ (r)))

≤ δ

Page 94: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Proof of the escape lemma

Thus, with probability at least 1− δ, x0 6∈ Xstuck, and in this case,by the main lemma.

f (xT )− f (x̃) ≤ −2.5fthres + 1.5fthres = −fthres

Page 95: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

How to prove the main Lemma?

I If uT does not decrease function value, then {u0, · · · , uT} areclose to x̃ .

I If {u0, · · · , uT} are close to x̃ , GD on w0 will decrease thefunction value.

Page 96: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

How to prove the main Lemma?

I If uT does not decrease function value, then {u0, · · · , uT} areclose to x̃ .

I If {u0, · · · , uT} are close to x̃ , GD on w0 will decrease thefunction value.

Page 97: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

How to prove the main Lemma?

I If uT does not decrease function value, then {u0, · · · , uT} areclose to x̃ .

I If {u0, · · · , uT} are close to x̃ , GD on w0 will decrease thefunction value.

Page 98: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

How to prove the main Lemma?

I If uT does not decrease function value, then {u0, · · · , uT} areclose to x̃ .

I If {u0, · · · , uT} are close to x̃ , GD on w0 will decrease thefunction value.

We will need the following approximation:

f̃y (x) = f (y) +∇f (y)>(x − y) +1

2(x − y)>H(x − y)

where H = ∇2f (x̃).

Page 99: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Two lemmas (simplified)

Lemma (uT -stuck)

There exists absolute constant cmax s.t., for any initial point u0

with ‖u0 − x̃‖ ≤ r , defined

T = min{

inft

{t|f̃ u0(ut)− f (u0) ≤ −3fthres

}, tthres

}Then, for any η ≤ cmax

` , we have for all t < T , ‖ut − x̃‖ ≤ Φ.

Page 100: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Two lemmas (simplified)

Lemma (uT -stuck)

There exists absolute constant cmax s.t., for any initial point u0

with ‖u0 − x̃‖ ≤ r , defined

T = min{

inft

{t|f̃ u0(ut)− f (u0) ≤ −3fthres

}, tthres

}Then, for any η ≤ cmax

` , we have for all t < T , ‖ut − x̃‖ ≤ Φ.

Lemma (wT -escape)

There exists absolute constant cmax s.t., define

T = min{

inft

{t|f̃ w0(wt)− f (w0) ≤ −3fthres

}, tthres

}then, for any η ≤ cmax

` , if ‖ut − x̃‖ ≤ Φ for t < T , we haveT < tthres.

Page 101: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Prove the main lemma

Page 102: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Prove the main lemma

Assume x̃ is the origin. Define

T ′ = inft

{t|f̃ u0(ut)− f (u0) ≤ −3fthres

}

Page 103: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Prove the main lemma

Assume x̃ is the origin. Define

T ′ = inft

{t|f̃ u0(ut)− f (u0) ≤ −3fthres

}Case T ′ ≤ tthres:

We know ‖uT ′−1‖ ≤ Φ by uT -stuck-lemma. By simple calculation,we can show that ‖uT ′‖ = O(Φ) as well.

Page 104: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Prove the main lemma

Assume x̃ is the origin. Define

T ′ = inft

{t|f̃ u0(ut)− f (u0) ≤ −3fthres

}Case T ′ ≤ tthres:

We know ‖uT ′−1‖ ≤ Φ by uT -stuck-lemma. By simple calculation,we can show that ‖uT ′‖ = O(Φ) as well.

f (uT ′)− f (u0)

≤∇f (u0)>(uT ′ − u0) +1

2(uT ′ − u0)>∇2f (u0)(uT ′ − u0) +

ρ

6‖uT ′ − u0‖3

≤f̃u0(ut)− f (u0) +ρ

2‖u0 − x̃‖‖uT ′ − u0‖2 +

ρ

6‖uT ′ − u0‖3

≤− 2.5fthres

Page 105: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Prove the main lemma

Case T ′ > tthres: By uT -stuck-lemma, we know for all t ≤ tthres‖ut‖ ≤ Φ.

Page 106: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Prove the main lemma

Case T ′ > tthres: By uT -stuck-lemma, we know for all t ≤ tthres‖ut‖ ≤ Φ. Using the wT -escape-lemma, we know

T ′′ = inft

{t|f̃ w0(wt)− f (w0) ≤ −3fthres

}≤ tthres

Page 107: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Prove the main lemma

Case T ′ > tthres: By uT -stuck-lemma, we know for all t ≤ tthres‖ut‖ ≤ Φ. Using the wT -escape-lemma, we know

T ′′ = inft

{t|f̃ w0(wt)− f (w0) ≤ −3fthres

}≤ tthres

Then we may reduce this to the case that T ′ ≤ tthres becausew , u are interchangeable.

Page 108: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Prove the uT -stuck-lemma

I We won’t move much in large negative eigenvector directions,otherwise it’s a lot of progress!

Page 109: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Prove the uT -stuck-lemma

Lemma (uT -stuck)

There exists absolute constant cmax s.t., for any initial point u0

with ‖u0 − x̃‖ ≤ r , defined

T = min{

inft

{t|f̃ u0(ut)− f (u0) ≤ −3fthres

}, tthres

}Then, for any η ≤ cmax

` , we have for all t < T , ‖ut − x̃‖ ≤ Φ.

I We won’t move much in large negative eigenvector directions,otherwise it’s a lot of progress!

Page 110: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Prove the uT -stuck-lemma

Lemma (uT -stuck)

There exists absolute constant cmax s.t., for any initial point u0

with ‖u0 − x̃‖ ≤ r , defined

T = min{

inft

{t|f̃ u0(ut)− f (u0) ≤ −3fthres

}, tthres

}Then, for any η ≤ cmax

` , we have for all t < T , ‖ut − x̃‖ ≤ Φ.

I We won’t move much in large negative eigenvector directions,otherwise it’s a lot of progress!

Page 111: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Prove the uT -stuck-lemma

Lemma (uT -stuck)

There exists absolute constant cmax s.t., for any initial point u0

with ‖u0 − x̃‖ ≤ r , defined

T = min{

inft

{t|f̃ u0(ut)− f (u0) ≤ −3fthres

}, tthres

}Then, for any η ≤ cmax

` , we have for all t < T , ‖ut − x̃‖ ≤ Φ.

I We won’t move much in large negative eigenvector directions,otherwise it’s a lot of progress!

Consider Bt as ut in the remaining space where eigenvalue ≥ − γ100 ,

‖Bt+1‖ ≤ (1 +ηγ

100)‖Bt‖+ 2ηgthres

Page 112: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Prove the uT -stuck-lemma

Lemma (uT -stuck)

There exists absolute constant cmax s.t., for any initial point u0

with ‖u0 − x̃‖ ≤ r , defined

T = min{

inft

{t|f̃ u0(ut)− f (u0) ≤ −3fthres

}, tthres

}Then, for any η ≤ cmax

` , we have for all t < T , ‖ut − x̃‖ ≤ Φ.

I We won’t move much in large negative eigenvector directions,otherwise it’s a lot of progress!

Consider Bt as ut in the remaining space where eigenvalue ≥ − γ100 ,

‖Bt+1‖ ≤ (1 +ηγ

100)‖Bt‖+ 2ηgthres

If T ≤ tthres, we will have (1 + ηγ100 )T ≤ 3, so ‖BT‖ is bounded.

Page 113: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Prove the wT -escape-lemma

Lemma (wT -escape)

There exists absolute constant cmax s.t., define

T = min{

inft

{t|f̃ w0(wt)− f (w0) ≤ −3fthres

}, tthres

}then, for any η ≤ cmax

` , if ‖ut − x̃‖ ≤ Φ for t < T , we haveT < tthres.

Page 114: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Prove the wT -escape-lemma

I let vt = wt − ut

I We want to say for T < tthres, wT made progress.

I If wt makes no progress, by uT -stuck-lemma, it’s still near x̃ .

I Therefore, we always have ‖vt‖ ≤ ‖ut‖+ ‖wt‖ ≤ 2Φ.

I At e1 direction v0 has at least δr2√d

I Every time it multiplies by at least 1 + ηγ.

I In T < tthres, we get vT > 2Φ, so wT made progress!

Page 115: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Prove the wT -escape-lemma

I let vt = wt − utI We want to say for T < tthres, wT made progress.

I If wt makes no progress, by uT -stuck-lemma, it’s still near x̃ .

I Therefore, we always have ‖vt‖ ≤ ‖ut‖+ ‖wt‖ ≤ 2Φ.

I At e1 direction v0 has at least δr2√d

I Every time it multiplies by at least 1 + ηγ.

I In T < tthres, we get vT > 2Φ, so wT made progress!

Page 116: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Prove the wT -escape-lemma

I let vt = wt − utI We want to say for T < tthres, wT made progress.

I If wt makes no progress, by uT -stuck-lemma, it’s still near x̃ .

I Therefore, we always have ‖vt‖ ≤ ‖ut‖+ ‖wt‖ ≤ 2Φ.

I At e1 direction v0 has at least δr2√d

I Every time it multiplies by at least 1 + ηγ.

I In T < tthres, we get vT > 2Φ, so wT made progress!

Page 117: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Prove the wT -escape-lemma

I let vt = wt − utI We want to say for T < tthres, wT made progress.

I If wt makes no progress, by uT -stuck-lemma, it’s still near x̃ .

I Therefore, we always have ‖vt‖ ≤ ‖ut‖+ ‖wt‖ ≤ 2Φ.

I At e1 direction v0 has at least δr2√d

I Every time it multiplies by at least 1 + ηγ.

I In T < tthres, we get vT > 2Φ, so wT made progress!

Page 118: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Prove the wT -escape-lemma

I let vt = wt − utI We want to say for T < tthres, wT made progress.

I If wt makes no progress, by uT -stuck-lemma, it’s still near x̃ .

I Therefore, we always have ‖vt‖ ≤ ‖ut‖+ ‖wt‖ ≤ 2Φ.

However, vt is increasing very rapidly. It can’t be always small!

I At e1 direction v0 has at least δr2√d

I Every time it multiplies by at least 1 + ηγ.

I In T < tthres, we get vT > 2Φ, so wT made progress!

Page 119: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Prove the wT -escape-lemma

I let vt = wt − utI We want to say for T < tthres, wT made progress.

I If wt makes no progress, by uT -stuck-lemma, it’s still near x̃ .

I Therefore, we always have ‖vt‖ ≤ ‖ut‖+ ‖wt‖ ≤ 2Φ.

However, vt is increasing very rapidly. It can’t be always small!

I At e1 direction v0 has at least δr2√d

I Every time it multiplies by at least 1 + ηγ.

I In T < tthres, we get vT > 2Φ, so wT made progress!

Page 120: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Prove the wT -escape-lemma

I let vt = wt − utI We want to say for T < tthres, wT made progress.

I If wt makes no progress, by uT -stuck-lemma, it’s still near x̃ .

I Therefore, we always have ‖vt‖ ≤ ‖ut‖+ ‖wt‖ ≤ 2Φ.

However, vt is increasing very rapidly. It can’t be always small!

I At e1 direction v0 has at least δr2√d

I Every time it multiplies by at least 1 + ηγ.

I In T < tthres, we get vT > 2Φ, so wT made progress!

Page 121: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Prove the wT -escape-lemma

I let vt = wt − utI We want to say for T < tthres, wT made progress.

I If wt makes no progress, by uT -stuck-lemma, it’s still near x̃ .

I Therefore, we always have ‖vt‖ ≤ ‖ut‖+ ‖wt‖ ≤ 2Φ.

However, vt is increasing very rapidly. It can’t be always small!

I At e1 direction v0 has at least δr2√d

I Every time it multiplies by at least 1 + ηγ.

I In T < tthres, we get vT > 2Φ, so wT made progress!