Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to...

Mini-Course 1:SGD Escapes Saddle Points

Yang Yuan

Computer Science DepartmentCornell University

Gradient Descent (GD)

I Task: minx f (x)

I GD does iterative updates xt+1 = xt − ηt∇f (xt)

Gradient Descent (GD)

I Task: minx f (x)

I GD does iterative updates xt+1 = xt − ηt∇f (xt)

Gradient Descent (GD) has at least two problems

I Computing the full gradient is slow for big data.

I Stuck at stationary points.

Stochastic Gradient Descent (SGD)

I Very similar to GD, gradient now has some randomness:

xt+1 = xt − ηtgt , where E[gt ] = ∇f (xt).

Stochastic Gradient Descent (SGD)

I Very similar to GD, gradient now has some randomness:

Why do we use SGD?

Initially because:

I Much cheaper to compute using mini-batch

I Can still converge to global minimum in convex case

I Can escape saddle points! (Today’s topic)

I Can escape shallow local minima (Next time’s topic, someprogress.)

I Can find local minima that generalize well (Not wellunderstood)

Why do we use SGD?

Initially because:

Why do we use SGD?

Initially because:

Why do we use SGD?

Initially because:

But now people realize:

Why do we use SGD?

Initially because:

Why do we use SGD?

Initially because:

Why do we use SGD?

Initially because:

Why do we use SGD?

Initially because:

Therefore, it’s not only faster, but also works better!

About gt that we use

I In practice, gt is obtained by sampling a minibatch of size 128or 256 from the dataset

I To simplify the analysis, we assume

gt = ∇f (xt) + ξt

where ξt ∈ N(0, I) or B0(r)

I In general, if ξt has non-negligible components on everydirection, the analysis works.

Preliminaries

I L-Lipschitz, i.e.,

|f (w1)− f (w2)| ≤ L‖w1 − w2‖2

I `-Smoothness: The gradient is `-Lipschitz, i.e.

‖∇f (w1)−∇f (w2)‖2 ≤ `‖w1 − w2‖2

I ρ-Hessian smoothness: The hessian matrix is ρ-Lipschitz, i.e.,

‖∇2f (w1)−∇2f (w2)‖sp ≤ ρ‖w1 − w2‖2

I We need this because we will use the Hessian at the currentspot to approximate the neighborhood

I Then bound the approximation.

Preliminaries

|f (w1)− f (w2)| ≤ L‖w1 − w2‖2

‖∇f (w1)−∇f (w2)‖2 ≤ `‖w1 − w2‖2

‖∇2f (w1)−∇2f (w2)‖sp ≤ ρ‖w1 − w2‖2

Preliminaries

|f (w1)− f (w2)| ≤ L‖w1 − w2‖2

‖∇f (w1)−∇f (w2)‖2 ≤ `‖w1 − w2‖2

‖∇2f (w1)−∇2f (w2)‖sp ≤ ρ‖w1 − w2‖2

Preliminaries

|f (w1)− f (w2)| ≤ L‖w1 − w2‖2

‖∇f (w1)−∇f (w2)‖2 ≤ `‖w1 − w2‖2

‖∇2f (w1)−∇2f (w2)‖sp ≤ ρ‖w1 − w2‖2

Preliminaries

|f (w1)− f (w2)| ≤ L‖w1 − w2‖2

‖∇f (w1)−∇f (w2)‖2 ≤ `‖w1 − w2‖2

‖∇2f (w1)−∇2f (w2)‖sp ≤ ρ‖w1 − w2‖2

Saddle points, and negative eigenvalue

Stationary points: saddle points, local minima, localmaxima

For stationary points ∇f (w) = 0,

I If ∇2f (w) � 0, it’s a local minimum.

I If ∇2f (w) ≺ 0, it’s a local maximum.

I If ∇2f (w) has both +/− eigenvalues, it’s a saddle point.I Degenerate case: ∇2f (w) has eigenvalues equal to 0. It

could be either local minimum(maximum)/saddle point.

I f is “flat” on some directionsI SGD is like random walkI We only consider non-degenerate case!

I If ∇2f (w) has both +/− eigenvalues, it’s a saddle point.

I Degenerate case: ∇2f (w) has eigenvalues equal to 0. Itcould be either local minimum(maximum)/saddle point.

could be either local minimum(maximum)/saddle point.I f is “flat” on some directions

I SGD is like random walkI We only consider non-degenerate case!

could be either local minimum(maximum)/saddle point.I f is “flat” on some directionsI SGD is like random walk

I We only consider non-degenerate case!

could be either local minimum(maximum)/saddle point.I f is “flat” on some directionsI SGD is like random walkI We only consider non-degenerate case!

Strict saddle property

f (w) is (α, γ, ε, ζ)-strict saddle, if for any w ,

I ‖∇f (w)‖2 ≥ εI or, λmin∇2f (w) ≤ −γ < 0

I or, there exists w∗ such that ‖w − w∗‖2 ≤ ζ, and the regioncentered w∗ with radius 2ζ is α-strongly convex.

I Gradient is large

I or (stationary point), we have a negative eigenvalue directionto escape

I or (stationary point, no negative eigenvalues), we are prettyclose to a local minimum.

I ‖∇f (w)‖2 ≥ ε

I or, λmin∇2f (w) ≤ −γ < 0

Which means:

I Gradient is large

I ‖∇f (w)‖2 ≥ εI or, λmin∇2f (w) ≤ −γ < 0

Which means:

I Gradient is large

I ‖∇f (w)‖2 ≥ εI or, λmin∇2f (w) ≤ −γ < 0

Which means:

I Gradient is large

Strict saddle functions are everywhere

I Orthogonal tensor decomposition [Ge et al 2015]

I Deep linear (residual) networks [Kawaguchi 2016], [Hardt andMa 2016]

I Matrix completion [Ge et al 2016]

I Generalized phase retrieval problem [Sun et al 2016]

I Low rank matrix recovery [Bhojanapalli et al 2016]

I SGD escapes all saddle points

I So, SGD arrives one local minimum → global minimum!

I One popular way to prove SGD solves the problem.

Moreover, in these problems, all local minima are equally good!

Moreover, in these problems, all local minima are equally good!That means,

Main Results

I [Ge et al 2015] says, whp, SGD will escape all saddle points,and converge to a local minimum. The convergence time haspolynomial dependency in dimension d .

I [Jin et al 2017] says, whp, PGD (a variant of SGD) willescape all saddle points, and converge to a local minimummuch faster. The dependence in d is logarithmic.

I Same proof framework. We’ll mainly look at the new result.

Main Results

Description of PGD

Do the following iteratively:

I If ‖∇f (xt)‖ ≤ gthres, and last perturbed time is > tthres stepsbefore, do random perturbation (ball)

I If perturbation happened tthres steps ago, but f is decreasedfor less than fthres, return the value before last perturbation

I Do a gradient descent step:

xt+1 = xt − η∇f (xt)

I Unfortunately.. Not a fast algorithm because of GD!

I η = c` . gthres, tthres, fthres depends on a constant c , as well as

other parameters.

Description of PGD

other parameters.

Description of PGD

other parameters.

Description of PGD

other parameters.

Description of PGD

A few Remarks:

other parameters.

Description of PGD

A few Remarks:

other parameters.

Description of PGD

A few Remarks:

other parameters.

Main theorem in [Jin et al 2017]

Theorem (Main Theorem)

Assume function f is `-smooth and ρ-Hessian Lipschitz,(α, γ, ε, ζ)-strict saddle. There exists an absolute constant cmax

such that, for any δ > 0, ∆f ≥ f (x0)− f ∗, and constant c ≤ cmax,

ε̃ = min{ε, γ2

ρ }, PGD(c) will output a point ζ-close to a localminimum, with probability 1− δ, and terminate in the followingnumber of iterations:

(`(f (x0)− f ∗)

ε̃2log4

(d`∆f

ε̃2δ

I If could show SGD has similar property, would be great!

I The convergence rate is almost optimal.

ε̃ = min{ε, γ2

(`(f (x0)− f ∗)

ε̃2log4

(d`∆f

ε̃2δ

ε̃ = min{ε, γ2

(`(f (x0)− f ∗)

ε̃2log4

(d`∆f

ε̃2δ

More general version: why it’s fast

Theorem (A more general version)

Assume function f is `-smooth and ρ-Hessian Lipschitz. Thereexists an absolute constant cmax such that, for any δ > 0,∆f ≥ f (x0)− f ∗, and constant c ≤ cmax, ε̃ ≤ `2

ρ , PGD(c) willoutput a point ζ-close to an ε̃-second-order stationary point, withprobability 1− δ, and terminate in the following number ofiterations:

(`(f (x0)− f ∗)

ε̃2log4

(d`∆f

ε̃2δ

Essentially saying the same thing. If f is not strict saddle, onlyε-second-order stationary point (instead of local minimum) isguaranteed.

ε-stationary points

I ε-first-order stationary point:

‖∇f (x)‖ ≤ ε

I ε-second-order stationary point:

‖∇f (x)‖ ≤ ε, λmin(∇2f (x)) ≥ −√ρε

I If `-smooth, λmin(∇2f (x)) ≥ −`.I For any ε > `2

ρ , an ε-first-order stationary point in a `-smooth

function is a `2

ρ -second-order stationary point

I If (α, γ, ε, ζ)-strict saddle, and ε < γ2

ρ , then any ε-second-orderstationary point is a local minimum.

‖∇f (x)‖ ≤ ε

‖∇f (x)‖ ≤ ε, λmin(∇2f (x)) ≥ −√ρε

function is a `2

‖∇f (x)‖ ≤ ε

‖∇f (x)‖ ≤ ε, λmin(∇2f (x)) ≥ −√ρε

I If `-smooth, λmin(∇2f (x)) ≥ −`.

I For any ε > `2

function is a `2

‖∇f (x)‖ ≤ ε

‖∇f (x)‖ ≤ ε, λmin(∇2f (x)) ≥ −√ρε

function is a `2

‖∇f (x)‖ ≤ ε

‖∇f (x)‖ ≤ ε, λmin(∇2f (x)) ≥ −√ρε

function is a `2

[Nesterov, 1998]

Theorem

Assume that f is `-smooth. Then for any ε̃ > 0, if we run GD withstep size η = 1

` and termination condition ‖∇f (x)‖ ≤ ε̃, theoutput will be ε̃-first-order stationary point, and the algorithmterminates in the following number of iterations:

`(f (x0)− f ∗)

[Jin et al 2017]: PGD converges to ε̃-second-order stationary point

in O(`(f (x0)−f ∗)

ε̃2 log4(d`∆fε̃2δ

))steps.

I Matched up to log factors!

Why −√ρε?

I If we use third order approximation for x [Nesterov andPolyak, 2006]

{〈∇f (x), y − x〉+

2〈∇2f (x)(y − x), y − x〉+

6‖y − x‖2

}denote the answer as Tx .

I Denote distance r = ‖x − Tx‖I ‖∇f (Tx)‖ ≤ ρr2, ∇2f (Tx) � −3

2 ρr I

I To get a lower bound for r :

{√‖∇f (Tx)‖

ρ,− 2

3ρλmin∇2f (Tx)

}I When are they equal → −√ρε

Why −√ρε?

{〈∇f (x), y − x〉+

2〈∇2f (x)(y − x), y − x〉+

6‖y − x‖2

I Denote distance r = ‖x − Tx‖

I ‖∇f (Tx)‖ ≤ ρr2, ∇2f (Tx) � −32 ρr I

{√‖∇f (Tx)‖

ρ,− 2

3ρλmin∇2f (Tx)

Why −√ρε?

{〈∇f (x), y − x〉+

2〈∇2f (x)(y − x), y − x〉+

6‖y − x‖2

2 ρr I

{√‖∇f (Tx)‖

ρ,− 2

3ρλmin∇2f (Tx)

Why −√ρε?

{〈∇f (x), y − x〉+

2〈∇2f (x)(y − x), y − x〉+

6‖y − x‖2

2 ρr I

{√‖∇f (Tx)‖

ρ,− 2

3ρλmin∇2f (Tx)

I When are they equal → −√ρε

Why −√ρε?

{〈∇f (x), y − x〉+

2〈∇2f (x)(y − x), y − x〉+

6‖y − x‖2

2 ρr I

{√‖∇f (Tx)‖

ρ,− 2

3ρλmin∇2f (Tx)

Related results

1. “Gradient Descent Converges to Minimizers” By Lee,Simchowitz, Jordan and Recht. 15’

I with random initialization, GD almost surely never touches anysaddle points, and always converges to local minima.

2. “The power of normalization: faster evasion of saddle points”,Kfir Levy. 16’

I Normalized gradient can escape saddle points inO(d3poly(1/ε)), slower than [Jin et al 2017], faster than [Geet al 2015], but still polynomial in d .

Related results

ε̃ = min{ε, γ2

(`(f (x0)− f ∗)

ε̃2log4

(d`∆f

ε̃2δ

Proof framework: Progress, Escape and Trap

I Progress: when ‖∇f (x)‖ > gthres, f (x) is decreased by atleast fthres/tthres.

I Escape: when ‖∇f (x)‖ ≤ gthres, and λmin∇2f (x) ≤ −γ, whpfunction value is decreased by fthres after perturbation+tthressteps.

I fthres/tthres on average each step.

I Trap:

I The algorithm can’t do progress and escape forever, becauseit’s bounded!

I When it stops: perturbation happened tthres steps ago, but fis decreased for less than fthres

I That means, ‖∇f (x)‖ ≤ gthres before perturbation, and whpthere is no eigenvalue ≤ −γ.

I So it’s a local minimum!

I Trap:

I Trap:I The algorithm can’t do progress and escape forever, because

it’s bounded!

it’s bounded!I When it stops: perturbation happened tthres steps ago, but f

is decreased for less than fthres

is decreased for less than fthresI That means, ‖∇f (x)‖ ≤ gthres before perturbation, and whp

there is no eigenvalue ≤ −γ.

is decreased for less than fthresI That means, ‖∇f (x)‖ ≤ gthres before perturbation, and whp

there is no eigenvalue ≤ −γ.I So it’s a local minimum!

Progress

If f is `-smooth, then for GD with step size η < 1` , we have:

f (xt+1) ≤ f (xt)−η

2‖∇f (xt)‖2

Proof.

f (xt+1) ≤ f (xt) +∇f (xt)>(xt+1 − xt) +

2‖xt+1 − xt‖2

= f (xt)− η‖∇f (xt)‖2 +η2`

2‖∇f (xt)‖2

≤ f (xt)−η

2‖∇f (xt)‖2

Escape: main idea

Escape: thin pancake

Main Lemma: measure the width

Suppose we start with point x̃ satisfying following conditions:

‖∇f (x̃)‖ ≤ gthres, λmin(∇2f (x̃)) ≤ −γ

Let e1 the minimum eigenvector. Consider two gradient descentsequences {ut}, {wt}, with initial points u0,w0 satisfying :

‖u0 − x̃‖ ≤ r ,w0 = u0 + µre1, µ ∈ [δ/(2√d), 1]

Then, for any stepsize η ≤ cmax/`, and any T ≥ tthres, we have

min{f (uT )− f (u0), f (wT )− f (w0)} ≤ −2.5fthres

I As long as u0 − w0 are on e1, and ‖u0 − w0‖ ≥ δr2√d

, at least

one of them will escape!

Main Lemma: measure the width

Escape Case

Lemma (Escape case)

Suppose we start with point x̃ satisfying following conditions:

‖∇f (x̃)‖ ≤ gthres, λmin(∇2f (x̃)) ≤ −γ

Let x0 = x̃ + ξ, where ξ come from the uniform distribution overball with radius r , and let xt be the iterates of GD from x0. Thenwhen η < cmax

` , with at least probability 1− δ, for any T ≥ tthres:

f (xT )− f (x̃) ≤ −fthres

Proof of the escape lemma

By smoothness, the perturbation step does not increase f much:

f (x0)− f (x̃) ≤ ∇f (x̃)>ξ +`

2‖ξ‖2 ≤ · · · ≤ 1.5fthres

f (x0)− f (x̃) ≤ ∇f (x̃)>ξ +`

2‖ξ‖2 ≤ · · · ≤ 1.5fthres

By the main lemma, for any x0 ∈ Xstuck, we know(x0 ± µre1) 6∈ Xstuck, where µ ∈ [δ/(2

√d), 1].

Vol(Xstuck) = Vol(B(d−1)x̃ (r))× δr

2√d× 2

f (x0)− f (x̃) ≤ ∇f (x̃)>ξ +`

2‖ξ‖2 ≤ · · · ≤ 1.5fthres

By the main lemma, for any x0 ∈ Xstuck, we know(x0 ± µre1) 6∈ Xstuck, where µ ∈ [δ/(2

√d), 1].

Vol(Xstuck) = Vol(B(d−1)x̃ (r))× δr

2√d× 2

Therefore, the probability that we picked a point in Xstuck isbounded by

Vol(Xstuck)

Vol(B(d)x̃ (r)))

≤ δ

Thus, with probability at least 1− δ, x0 6∈ Xstuck, and in this case,by the main lemma.

f (xT )− f (x̃) ≤ −2.5fthres + 1.5fthres = −fthres

How to prove the main Lemma?

I If uT does not decrease function value, then {u0, · · · , uT} areclose to x̃ .

I If {u0, · · · , uT} are close to x̃ , GD on w0 will decrease thefunction value.

We will need the following approximation:

f̃y (x) = f (y) +∇f (y)>(x − y) +1

2(x − y)>H(x − y)

where H = ∇2f (x̃).

Two lemmas (simplified)

Lemma (uT -stuck)

There exists absolute constant cmax s.t., for any initial point u0

with ‖u0 − x̃‖ ≤ r , defined

T = min{

{t|f̃ u0(ut)− f (u0) ≤ −3fthres

}, tthres

}Then, for any η ≤ cmax

` , we have for all t < T , ‖ut − x̃‖ ≤ Φ.

Two lemmas (simplified)

Lemma (uT -stuck)

T = min{

}, tthres

Lemma (wT -escape)

There exists absolute constant cmax s.t., define

T = min{

{t|f̃ w0(wt)− f (w0) ≤ −3fthres

}, tthres

}then, for any η ≤ cmax

` , if ‖ut − x̃‖ ≤ Φ for t < T , we haveT < tthres.

Prove the main lemma

Assume x̃ is the origin. Define

T ′ = inft

}Case T ′ ≤ tthres:

We know ‖uT ′−1‖ ≤ Φ by uT -stuck-lemma. By simple calculation,we can show that ‖uT ′‖ = O(Φ) as well.

T ′ = inft

}Case T ′ ≤ tthres:

We know ‖uT ′−1‖ ≤ Φ by uT -stuck-lemma. By simple calculation,we can show that ‖uT ′‖ = O(Φ) as well.

f (uT ′)− f (u0)

≤∇f (u0)>(uT ′ − u0) +1

2(uT ′ − u0)>∇2f (u0)(uT ′ − u0) +

6‖uT ′ − u0‖3

≤f̃u0(ut)− f (u0) +ρ

2‖u0 − x̃‖‖uT ′ − u0‖2 +

6‖uT ′ − u0‖3

≤− 2.5fthres

Case T ′ > tthres: By uT -stuck-lemma, we know for all t ≤ tthres‖ut‖ ≤ Φ.

Case T ′ > tthres: By uT -stuck-lemma, we know for all t ≤ tthres‖ut‖ ≤ Φ. Using the wT -escape-lemma, we know

T ′′ = inft

}≤ tthres

Case T ′ > tthres: By uT -stuck-lemma, we know for all t ≤ tthres‖ut‖ ≤ Φ. Using the wT -escape-lemma, we know

T ′′ = inft

}≤ tthres

Then we may reduce this to the case that T ′ ≤ tthres becausew , u are interchangeable.

Prove the uT -stuck-lemma

I We won’t move much in large negative eigenvector directions,otherwise it’s a lot of progress!

Lemma (uT -stuck)

T = min{

}, tthres

Lemma (uT -stuck)

T = min{

}, tthres

Lemma (uT -stuck)

T = min{

}, tthres

Consider Bt as ut in the remaining space where eigenvalue ≥ − γ100 ,

‖Bt+1‖ ≤ (1 +ηγ

100)‖Bt‖+ 2ηgthres

Lemma (uT -stuck)

T = min{

}, tthres

Consider Bt as ut in the remaining space where eigenvalue ≥ − γ100 ,

‖Bt+1‖ ≤ (1 +ηγ

100)‖Bt‖+ 2ηgthres

If T ≤ tthres, we will have (1 + ηγ100 )T ≤ 3, so ‖BT‖ is bounded.

Prove the wT -escape-lemma

Lemma (wT -escape)

There exists absolute constant cmax s.t., define

T = min{

}, tthres

}then, for any η ≤ cmax

` , if ‖ut − x̃‖ ≤ Φ for t < T , we haveT < tthres.

I let vt = wt − ut

I We want to say for T < tthres, wT made progress.

I If wt makes no progress, by uT -stuck-lemma, it’s still near x̃ .

I Therefore, we always have ‖vt‖ ≤ ‖ut‖+ ‖wt‖ ≤ 2Φ.

I At e1 direction v0 has at least δr2√d

I Every time it multiplies by at least 1 + ηγ.

I In T < tthres, we get vT > 2Φ, so wT made progress!

I let vt = wt − utI We want to say for T < tthres, wT made progress.

However, vt is increasing very rapidly. It can’t be always small!

Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to...

Documents

Transcript of Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to...