Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to...
Transcript of Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to...
![Page 1: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/1.jpg)
Mini-Course 1:SGD Escapes Saddle Points
Yang Yuan
Computer Science DepartmentCornell University
![Page 2: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/2.jpg)
Gradient Descent (GD)
I Task: minx f (x)
I GD does iterative updates xt+1 = xt − ηt∇f (xt)
![Page 3: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/3.jpg)
Gradient Descent (GD)
I Task: minx f (x)
I GD does iterative updates xt+1 = xt − ηt∇f (xt)
![Page 4: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/4.jpg)
Gradient Descent (GD) has at least two problems
I Computing the full gradient is slow for big data.
I Stuck at stationary points.
![Page 5: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/5.jpg)
Gradient Descent (GD) has at least two problems
I Computing the full gradient is slow for big data.
I Stuck at stationary points.
![Page 6: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/6.jpg)
Gradient Descent (GD) has at least two problems
I Computing the full gradient is slow for big data.
I Stuck at stationary points.
![Page 7: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/7.jpg)
Stochastic Gradient Descent (SGD)
I Very similar to GD, gradient now has some randomness:
xt+1 = xt − ηtgt , where E[gt ] = ∇f (xt).
![Page 8: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/8.jpg)
Stochastic Gradient Descent (SGD)
I Very similar to GD, gradient now has some randomness:
xt+1 = xt − ηtgt , where E[gt ] = ∇f (xt).
![Page 9: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/9.jpg)
Why do we use SGD?
Initially because:
I Much cheaper to compute using mini-batch
I Can still converge to global minimum in convex case
I Can escape saddle points! (Today’s topic)
I Can escape shallow local minima (Next time’s topic, someprogress.)
I Can find local minima that generalize well (Not wellunderstood)
![Page 10: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/10.jpg)
Why do we use SGD?
Initially because:
I Much cheaper to compute using mini-batch
I Can still converge to global minimum in convex case
I Can escape saddle points! (Today’s topic)
I Can escape shallow local minima (Next time’s topic, someprogress.)
I Can find local minima that generalize well (Not wellunderstood)
![Page 11: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/11.jpg)
Why do we use SGD?
Initially because:
I Much cheaper to compute using mini-batch
I Can still converge to global minimum in convex case
I Can escape saddle points! (Today’s topic)
I Can escape shallow local minima (Next time’s topic, someprogress.)
I Can find local minima that generalize well (Not wellunderstood)
![Page 12: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/12.jpg)
Why do we use SGD?
Initially because:
I Much cheaper to compute using mini-batch
I Can still converge to global minimum in convex case
But now people realize:
I Can escape saddle points! (Today’s topic)
I Can escape shallow local minima (Next time’s topic, someprogress.)
I Can find local minima that generalize well (Not wellunderstood)
![Page 13: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/13.jpg)
Why do we use SGD?
Initially because:
I Much cheaper to compute using mini-batch
I Can still converge to global minimum in convex case
But now people realize:
I Can escape saddle points! (Today’s topic)
I Can escape shallow local minima (Next time’s topic, someprogress.)
I Can find local minima that generalize well (Not wellunderstood)
![Page 14: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/14.jpg)
Why do we use SGD?
Initially because:
I Much cheaper to compute using mini-batch
I Can still converge to global minimum in convex case
But now people realize:
I Can escape saddle points! (Today’s topic)
I Can escape shallow local minima (Next time’s topic, someprogress.)
I Can find local minima that generalize well (Not wellunderstood)
![Page 15: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/15.jpg)
Why do we use SGD?
Initially because:
I Much cheaper to compute using mini-batch
I Can still converge to global minimum in convex case
But now people realize:
I Can escape saddle points! (Today’s topic)
I Can escape shallow local minima (Next time’s topic, someprogress.)
I Can find local minima that generalize well (Not wellunderstood)
![Page 16: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/16.jpg)
Why do we use SGD?
Initially because:
I Much cheaper to compute using mini-batch
I Can still converge to global minimum in convex case
But now people realize:
I Can escape saddle points! (Today’s topic)
I Can escape shallow local minima (Next time’s topic, someprogress.)
I Can find local minima that generalize well (Not wellunderstood)
Therefore, it’s not only faster, but also works better!
![Page 17: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/17.jpg)
About gt that we use
xt+1 = xt − ηtgt , where E[gt ] = ∇f (xt).
I In practice, gt is obtained by sampling a minibatch of size 128or 256 from the dataset
I To simplify the analysis, we assume
gt = ∇f (xt) + ξt
where ξt ∈ N(0, I) or B0(r)
I In general, if ξt has non-negligible components on everydirection, the analysis works.
![Page 18: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/18.jpg)
About gt that we use
xt+1 = xt − ηtgt , where E[gt ] = ∇f (xt).
I In practice, gt is obtained by sampling a minibatch of size 128or 256 from the dataset
I To simplify the analysis, we assume
gt = ∇f (xt) + ξt
where ξt ∈ N(0, I) or B0(r)
I In general, if ξt has non-negligible components on everydirection, the analysis works.
![Page 19: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/19.jpg)
About gt that we use
xt+1 = xt − ηtgt , where E[gt ] = ∇f (xt).
I In practice, gt is obtained by sampling a minibatch of size 128or 256 from the dataset
I To simplify the analysis, we assume
gt = ∇f (xt) + ξt
where ξt ∈ N(0, I) or B0(r)
I In general, if ξt has non-negligible components on everydirection, the analysis works.
![Page 20: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/20.jpg)
About gt that we use
xt+1 = xt − ηtgt , where E[gt ] = ∇f (xt).
I In practice, gt is obtained by sampling a minibatch of size 128or 256 from the dataset
I To simplify the analysis, we assume
gt = ∇f (xt) + ξt
where ξt ∈ N(0, I) or B0(r)
I In general, if ξt has non-negligible components on everydirection, the analysis works.
![Page 21: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/21.jpg)
Preliminaries
I L-Lipschitz, i.e.,
|f (w1)− f (w2)| ≤ L‖w1 − w2‖2
I `-Smoothness: The gradient is `-Lipschitz, i.e.
‖∇f (w1)−∇f (w2)‖2 ≤ `‖w1 − w2‖2
I ρ-Hessian smoothness: The hessian matrix is ρ-Lipschitz, i.e.,
‖∇2f (w1)−∇2f (w2)‖sp ≤ ρ‖w1 − w2‖2
I We need this because we will use the Hessian at the currentspot to approximate the neighborhood
I Then bound the approximation.
![Page 22: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/22.jpg)
Preliminaries
I L-Lipschitz, i.e.,
|f (w1)− f (w2)| ≤ L‖w1 − w2‖2
I `-Smoothness: The gradient is `-Lipschitz, i.e.
‖∇f (w1)−∇f (w2)‖2 ≤ `‖w1 − w2‖2
I ρ-Hessian smoothness: The hessian matrix is ρ-Lipschitz, i.e.,
‖∇2f (w1)−∇2f (w2)‖sp ≤ ρ‖w1 − w2‖2
I We need this because we will use the Hessian at the currentspot to approximate the neighborhood
I Then bound the approximation.
![Page 23: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/23.jpg)
Preliminaries
I L-Lipschitz, i.e.,
|f (w1)− f (w2)| ≤ L‖w1 − w2‖2
I `-Smoothness: The gradient is `-Lipschitz, i.e.
‖∇f (w1)−∇f (w2)‖2 ≤ `‖w1 − w2‖2
I ρ-Hessian smoothness: The hessian matrix is ρ-Lipschitz, i.e.,
‖∇2f (w1)−∇2f (w2)‖sp ≤ ρ‖w1 − w2‖2
I We need this because we will use the Hessian at the currentspot to approximate the neighborhood
I Then bound the approximation.
![Page 24: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/24.jpg)
Preliminaries
I L-Lipschitz, i.e.,
|f (w1)− f (w2)| ≤ L‖w1 − w2‖2
I `-Smoothness: The gradient is `-Lipschitz, i.e.
‖∇f (w1)−∇f (w2)‖2 ≤ `‖w1 − w2‖2
I ρ-Hessian smoothness: The hessian matrix is ρ-Lipschitz, i.e.,
‖∇2f (w1)−∇2f (w2)‖sp ≤ ρ‖w1 − w2‖2
I We need this because we will use the Hessian at the currentspot to approximate the neighborhood
I Then bound the approximation.
![Page 25: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/25.jpg)
Preliminaries
I L-Lipschitz, i.e.,
|f (w1)− f (w2)| ≤ L‖w1 − w2‖2
I `-Smoothness: The gradient is `-Lipschitz, i.e.
‖∇f (w1)−∇f (w2)‖2 ≤ `‖w1 − w2‖2
I ρ-Hessian smoothness: The hessian matrix is ρ-Lipschitz, i.e.,
‖∇2f (w1)−∇2f (w2)‖sp ≤ ρ‖w1 − w2‖2
I We need this because we will use the Hessian at the currentspot to approximate the neighborhood
I Then bound the approximation.
![Page 26: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/26.jpg)
Saddle points, and negative eigenvalue
![Page 27: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/27.jpg)
Stationary points: saddle points, local minima, localmaxima
For stationary points ∇f (w) = 0,
I If ∇2f (w) � 0, it’s a local minimum.
I If ∇2f (w) ≺ 0, it’s a local maximum.
I If ∇2f (w) has both +/− eigenvalues, it’s a saddle point.I Degenerate case: ∇2f (w) has eigenvalues equal to 0. It
could be either local minimum(maximum)/saddle point.
I f is “flat” on some directionsI SGD is like random walkI We only consider non-degenerate case!
![Page 28: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/28.jpg)
Stationary points: saddle points, local minima, localmaxima
For stationary points ∇f (w) = 0,
I If ∇2f (w) � 0, it’s a local minimum.
I If ∇2f (w) ≺ 0, it’s a local maximum.
I If ∇2f (w) has both +/− eigenvalues, it’s a saddle point.I Degenerate case: ∇2f (w) has eigenvalues equal to 0. It
could be either local minimum(maximum)/saddle point.
I f is “flat” on some directionsI SGD is like random walkI We only consider non-degenerate case!
![Page 29: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/29.jpg)
Stationary points: saddle points, local minima, localmaxima
For stationary points ∇f (w) = 0,
I If ∇2f (w) � 0, it’s a local minimum.
I If ∇2f (w) ≺ 0, it’s a local maximum.
I If ∇2f (w) has both +/− eigenvalues, it’s a saddle point.I Degenerate case: ∇2f (w) has eigenvalues equal to 0. It
could be either local minimum(maximum)/saddle point.
I f is “flat” on some directionsI SGD is like random walkI We only consider non-degenerate case!
![Page 30: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/30.jpg)
Stationary points: saddle points, local minima, localmaxima
For stationary points ∇f (w) = 0,
I If ∇2f (w) � 0, it’s a local minimum.
I If ∇2f (w) ≺ 0, it’s a local maximum.
I If ∇2f (w) has both +/− eigenvalues, it’s a saddle point.
I Degenerate case: ∇2f (w) has eigenvalues equal to 0. Itcould be either local minimum(maximum)/saddle point.
I f is “flat” on some directionsI SGD is like random walkI We only consider non-degenerate case!
![Page 31: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/31.jpg)
Stationary points: saddle points, local minima, localmaxima
For stationary points ∇f (w) = 0,
I If ∇2f (w) � 0, it’s a local minimum.
I If ∇2f (w) ≺ 0, it’s a local maximum.
I If ∇2f (w) has both +/− eigenvalues, it’s a saddle point.I Degenerate case: ∇2f (w) has eigenvalues equal to 0. It
could be either local minimum(maximum)/saddle point.
I f is “flat” on some directionsI SGD is like random walkI We only consider non-degenerate case!
![Page 32: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/32.jpg)
Stationary points: saddle points, local minima, localmaxima
For stationary points ∇f (w) = 0,
I If ∇2f (w) � 0, it’s a local minimum.
I If ∇2f (w) ≺ 0, it’s a local maximum.
I If ∇2f (w) has both +/− eigenvalues, it’s a saddle point.I Degenerate case: ∇2f (w) has eigenvalues equal to 0. It
could be either local minimum(maximum)/saddle point.I f is “flat” on some directions
I SGD is like random walkI We only consider non-degenerate case!
![Page 33: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/33.jpg)
Stationary points: saddle points, local minima, localmaxima
For stationary points ∇f (w) = 0,
I If ∇2f (w) � 0, it’s a local minimum.
I If ∇2f (w) ≺ 0, it’s a local maximum.
I If ∇2f (w) has both +/− eigenvalues, it’s a saddle point.I Degenerate case: ∇2f (w) has eigenvalues equal to 0. It
could be either local minimum(maximum)/saddle point.I f is “flat” on some directionsI SGD is like random walk
I We only consider non-degenerate case!
![Page 34: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/34.jpg)
Stationary points: saddle points, local minima, localmaxima
For stationary points ∇f (w) = 0,
I If ∇2f (w) � 0, it’s a local minimum.
I If ∇2f (w) ≺ 0, it’s a local maximum.
I If ∇2f (w) has both +/− eigenvalues, it’s a saddle point.I Degenerate case: ∇2f (w) has eigenvalues equal to 0. It
could be either local minimum(maximum)/saddle point.I f is “flat” on some directionsI SGD is like random walkI We only consider non-degenerate case!
![Page 35: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/35.jpg)
Strict saddle property
f (w) is (α, γ, ε, ζ)-strict saddle, if for any w ,
I ‖∇f (w)‖2 ≥ εI or, λmin∇2f (w) ≤ −γ < 0
I or, there exists w∗ such that ‖w − w∗‖2 ≤ ζ, and the regioncentered w∗ with radius 2ζ is α-strongly convex.
I Gradient is large
I or (stationary point), we have a negative eigenvalue directionto escape
I or (stationary point, no negative eigenvalues), we are prettyclose to a local minimum.
![Page 36: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/36.jpg)
Strict saddle property
f (w) is (α, γ, ε, ζ)-strict saddle, if for any w ,
I ‖∇f (w)‖2 ≥ ε
I or, λmin∇2f (w) ≤ −γ < 0
I or, there exists w∗ such that ‖w − w∗‖2 ≤ ζ, and the regioncentered w∗ with radius 2ζ is α-strongly convex.
Which means:
I Gradient is large
I or (stationary point), we have a negative eigenvalue directionto escape
I or (stationary point, no negative eigenvalues), we are prettyclose to a local minimum.
![Page 37: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/37.jpg)
Strict saddle property
f (w) is (α, γ, ε, ζ)-strict saddle, if for any w ,
I ‖∇f (w)‖2 ≥ εI or, λmin∇2f (w) ≤ −γ < 0
I or, there exists w∗ such that ‖w − w∗‖2 ≤ ζ, and the regioncentered w∗ with radius 2ζ is α-strongly convex.
Which means:
I Gradient is large
I or (stationary point), we have a negative eigenvalue directionto escape
I or (stationary point, no negative eigenvalues), we are prettyclose to a local minimum.
![Page 38: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/38.jpg)
Strict saddle property
f (w) is (α, γ, ε, ζ)-strict saddle, if for any w ,
I ‖∇f (w)‖2 ≥ εI or, λmin∇2f (w) ≤ −γ < 0
I or, there exists w∗ such that ‖w − w∗‖2 ≤ ζ, and the regioncentered w∗ with radius 2ζ is α-strongly convex.
Which means:
I Gradient is large
I or (stationary point), we have a negative eigenvalue directionto escape
I or (stationary point, no negative eigenvalues), we are prettyclose to a local minimum.
![Page 39: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/39.jpg)
Strict saddle functions are everywhere
I Orthogonal tensor decomposition [Ge et al 2015]
I Deep linear (residual) networks [Kawaguchi 2016], [Hardt andMa 2016]
I Matrix completion [Ge et al 2016]
I Generalized phase retrieval problem [Sun et al 2016]
I Low rank matrix recovery [Bhojanapalli et al 2016]
I SGD escapes all saddle points
I So, SGD arrives one local minimum → global minimum!
I One popular way to prove SGD solves the problem.
![Page 40: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/40.jpg)
Strict saddle functions are everywhere
I Orthogonal tensor decomposition [Ge et al 2015]
I Deep linear (residual) networks [Kawaguchi 2016], [Hardt andMa 2016]
I Matrix completion [Ge et al 2016]
I Generalized phase retrieval problem [Sun et al 2016]
I Low rank matrix recovery [Bhojanapalli et al 2016]
Moreover, in these problems, all local minima are equally good!
I SGD escapes all saddle points
I So, SGD arrives one local minimum → global minimum!
I One popular way to prove SGD solves the problem.
![Page 41: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/41.jpg)
Strict saddle functions are everywhere
I Orthogonal tensor decomposition [Ge et al 2015]
I Deep linear (residual) networks [Kawaguchi 2016], [Hardt andMa 2016]
I Matrix completion [Ge et al 2016]
I Generalized phase retrieval problem [Sun et al 2016]
I Low rank matrix recovery [Bhojanapalli et al 2016]
Moreover, in these problems, all local minima are equally good!That means,
I SGD escapes all saddle points
I So, SGD arrives one local minimum → global minimum!
I One popular way to prove SGD solves the problem.
![Page 42: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/42.jpg)
Strict saddle functions are everywhere
I Orthogonal tensor decomposition [Ge et al 2015]
I Deep linear (residual) networks [Kawaguchi 2016], [Hardt andMa 2016]
I Matrix completion [Ge et al 2016]
I Generalized phase retrieval problem [Sun et al 2016]
I Low rank matrix recovery [Bhojanapalli et al 2016]
Moreover, in these problems, all local minima are equally good!That means,
I SGD escapes all saddle points
I So, SGD arrives one local minimum → global minimum!
I One popular way to prove SGD solves the problem.
![Page 43: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/43.jpg)
Strict saddle functions are everywhere
I Orthogonal tensor decomposition [Ge et al 2015]
I Deep linear (residual) networks [Kawaguchi 2016], [Hardt andMa 2016]
I Matrix completion [Ge et al 2016]
I Generalized phase retrieval problem [Sun et al 2016]
I Low rank matrix recovery [Bhojanapalli et al 2016]
Moreover, in these problems, all local minima are equally good!That means,
I SGD escapes all saddle points
I So, SGD arrives one local minimum → global minimum!
I One popular way to prove SGD solves the problem.
![Page 44: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/44.jpg)
Strict saddle functions are everywhere
I Orthogonal tensor decomposition [Ge et al 2015]
I Deep linear (residual) networks [Kawaguchi 2016], [Hardt andMa 2016]
I Matrix completion [Ge et al 2016]
I Generalized phase retrieval problem [Sun et al 2016]
I Low rank matrix recovery [Bhojanapalli et al 2016]
Moreover, in these problems, all local minima are equally good!That means,
I SGD escapes all saddle points
I So, SGD arrives one local minimum → global minimum!
I One popular way to prove SGD solves the problem.
![Page 45: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/45.jpg)
Main Results
I [Ge et al 2015] says, whp, SGD will escape all saddle points,and converge to a local minimum. The convergence time haspolynomial dependency in dimension d .
I [Jin et al 2017] says, whp, PGD (a variant of SGD) willescape all saddle points, and converge to a local minimummuch faster. The dependence in d is logarithmic.
I Same proof framework. We’ll mainly look at the new result.
![Page 46: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/46.jpg)
Main Results
I [Ge et al 2015] says, whp, SGD will escape all saddle points,and converge to a local minimum. The convergence time haspolynomial dependency in dimension d .
I [Jin et al 2017] says, whp, PGD (a variant of SGD) willescape all saddle points, and converge to a local minimummuch faster. The dependence in d is logarithmic.
I Same proof framework. We’ll mainly look at the new result.
![Page 47: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/47.jpg)
Main Results
I [Ge et al 2015] says, whp, SGD will escape all saddle points,and converge to a local minimum. The convergence time haspolynomial dependency in dimension d .
I [Jin et al 2017] says, whp, PGD (a variant of SGD) willescape all saddle points, and converge to a local minimummuch faster. The dependence in d is logarithmic.
I Same proof framework. We’ll mainly look at the new result.
![Page 48: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/48.jpg)
Description of PGD
Do the following iteratively:
I If ‖∇f (xt)‖ ≤ gthres, and last perturbed time is > tthres stepsbefore, do random perturbation (ball)
I If perturbation happened tthres steps ago, but f is decreasedfor less than fthres, return the value before last perturbation
I Do a gradient descent step:
xt+1 = xt − η∇f (xt)
I Unfortunately.. Not a fast algorithm because of GD!
I η = c` . gthres, tthres, fthres depends on a constant c , as well as
other parameters.
![Page 49: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/49.jpg)
Description of PGD
Do the following iteratively:
I If ‖∇f (xt)‖ ≤ gthres, and last perturbed time is > tthres stepsbefore, do random perturbation (ball)
I If perturbation happened tthres steps ago, but f is decreasedfor less than fthres, return the value before last perturbation
I Do a gradient descent step:
xt+1 = xt − η∇f (xt)
I Unfortunately.. Not a fast algorithm because of GD!
I η = c` . gthres, tthres, fthres depends on a constant c , as well as
other parameters.
![Page 50: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/50.jpg)
Description of PGD
Do the following iteratively:
I If ‖∇f (xt)‖ ≤ gthres, and last perturbed time is > tthres stepsbefore, do random perturbation (ball)
I If perturbation happened tthres steps ago, but f is decreasedfor less than fthres, return the value before last perturbation
I Do a gradient descent step:
xt+1 = xt − η∇f (xt)
I Unfortunately.. Not a fast algorithm because of GD!
I η = c` . gthres, tthres, fthres depends on a constant c , as well as
other parameters.
![Page 51: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/51.jpg)
Description of PGD
Do the following iteratively:
I If ‖∇f (xt)‖ ≤ gthres, and last perturbed time is > tthres stepsbefore, do random perturbation (ball)
I If perturbation happened tthres steps ago, but f is decreasedfor less than fthres, return the value before last perturbation
I Do a gradient descent step:
xt+1 = xt − η∇f (xt)
I Unfortunately.. Not a fast algorithm because of GD!
I η = c` . gthres, tthres, fthres depends on a constant c , as well as
other parameters.
![Page 52: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/52.jpg)
Description of PGD
Do the following iteratively:
I If ‖∇f (xt)‖ ≤ gthres, and last perturbed time is > tthres stepsbefore, do random perturbation (ball)
I If perturbation happened tthres steps ago, but f is decreasedfor less than fthres, return the value before last perturbation
I Do a gradient descent step:
xt+1 = xt − η∇f (xt)
A few Remarks:
I Unfortunately.. Not a fast algorithm because of GD!
I η = c` . gthres, tthres, fthres depends on a constant c , as well as
other parameters.
![Page 53: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/53.jpg)
Description of PGD
Do the following iteratively:
I If ‖∇f (xt)‖ ≤ gthres, and last perturbed time is > tthres stepsbefore, do random perturbation (ball)
I If perturbation happened tthres steps ago, but f is decreasedfor less than fthres, return the value before last perturbation
I Do a gradient descent step:
xt+1 = xt − η∇f (xt)
A few Remarks:
I Unfortunately.. Not a fast algorithm because of GD!
I η = c` . gthres, tthres, fthres depends on a constant c , as well as
other parameters.
![Page 54: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/54.jpg)
Description of PGD
Do the following iteratively:
I If ‖∇f (xt)‖ ≤ gthres, and last perturbed time is > tthres stepsbefore, do random perturbation (ball)
I If perturbation happened tthres steps ago, but f is decreasedfor less than fthres, return the value before last perturbation
I Do a gradient descent step:
xt+1 = xt − η∇f (xt)
A few Remarks:
I Unfortunately.. Not a fast algorithm because of GD!
I η = c` . gthres, tthres, fthres depends on a constant c , as well as
other parameters.
![Page 55: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/55.jpg)
Main theorem in [Jin et al 2017]
Theorem (Main Theorem)
Assume function f is `-smooth and ρ-Hessian Lipschitz,(α, γ, ε, ζ)-strict saddle. There exists an absolute constant cmax
such that, for any δ > 0, ∆f ≥ f (x0)− f ∗, and constant c ≤ cmax,
ε̃ = min{ε, γ2
ρ }, PGD(c) will output a point ζ-close to a localminimum, with probability 1− δ, and terminate in the followingnumber of iterations:
O
(`(f (x0)− f ∗)
ε̃2log4
(d`∆f
ε̃2δ
))
I If could show SGD has similar property, would be great!
I The convergence rate is almost optimal.
![Page 56: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/56.jpg)
Main theorem in [Jin et al 2017]
Theorem (Main Theorem)
Assume function f is `-smooth and ρ-Hessian Lipschitz,(α, γ, ε, ζ)-strict saddle. There exists an absolute constant cmax
such that, for any δ > 0, ∆f ≥ f (x0)− f ∗, and constant c ≤ cmax,
ε̃ = min{ε, γ2
ρ }, PGD(c) will output a point ζ-close to a localminimum, with probability 1− δ, and terminate in the followingnumber of iterations:
O
(`(f (x0)− f ∗)
ε̃2log4
(d`∆f
ε̃2δ
))
I If could show SGD has similar property, would be great!
I The convergence rate is almost optimal.
![Page 57: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/57.jpg)
Main theorem in [Jin et al 2017]
Theorem (Main Theorem)
Assume function f is `-smooth and ρ-Hessian Lipschitz,(α, γ, ε, ζ)-strict saddle. There exists an absolute constant cmax
such that, for any δ > 0, ∆f ≥ f (x0)− f ∗, and constant c ≤ cmax,
ε̃ = min{ε, γ2
ρ }, PGD(c) will output a point ζ-close to a localminimum, with probability 1− δ, and terminate in the followingnumber of iterations:
O
(`(f (x0)− f ∗)
ε̃2log4
(d`∆f
ε̃2δ
))
I If could show SGD has similar property, would be great!
I The convergence rate is almost optimal.
![Page 58: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/58.jpg)
More general version: why it’s fast
Theorem (A more general version)
Assume function f is `-smooth and ρ-Hessian Lipschitz. Thereexists an absolute constant cmax such that, for any δ > 0,∆f ≥ f (x0)− f ∗, and constant c ≤ cmax, ε̃ ≤ `2
ρ , PGD(c) willoutput a point ζ-close to an ε̃-second-order stationary point, withprobability 1− δ, and terminate in the following number ofiterations:
O
(`(f (x0)− f ∗)
ε̃2log4
(d`∆f
ε̃2δ
))
Essentially saying the same thing. If f is not strict saddle, onlyε-second-order stationary point (instead of local minimum) isguaranteed.
![Page 59: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/59.jpg)
ε-stationary points
I ε-first-order stationary point:
‖∇f (x)‖ ≤ ε
I ε-second-order stationary point:
‖∇f (x)‖ ≤ ε, λmin(∇2f (x)) ≥ −√ρε
I If `-smooth, λmin(∇2f (x)) ≥ −`.I For any ε > `2
ρ , an ε-first-order stationary point in a `-smooth
function is a `2
ρ -second-order stationary point
I If (α, γ, ε, ζ)-strict saddle, and ε < γ2
ρ , then any ε-second-orderstationary point is a local minimum.
![Page 60: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/60.jpg)
ε-stationary points
I ε-first-order stationary point:
‖∇f (x)‖ ≤ ε
I ε-second-order stationary point:
‖∇f (x)‖ ≤ ε, λmin(∇2f (x)) ≥ −√ρε
I If `-smooth, λmin(∇2f (x)) ≥ −`.I For any ε > `2
ρ , an ε-first-order stationary point in a `-smooth
function is a `2
ρ -second-order stationary point
I If (α, γ, ε, ζ)-strict saddle, and ε < γ2
ρ , then any ε-second-orderstationary point is a local minimum.
![Page 61: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/61.jpg)
ε-stationary points
I ε-first-order stationary point:
‖∇f (x)‖ ≤ ε
I ε-second-order stationary point:
‖∇f (x)‖ ≤ ε, λmin(∇2f (x)) ≥ −√ρε
I If `-smooth, λmin(∇2f (x)) ≥ −`.
I For any ε > `2
ρ , an ε-first-order stationary point in a `-smooth
function is a `2
ρ -second-order stationary point
I If (α, γ, ε, ζ)-strict saddle, and ε < γ2
ρ , then any ε-second-orderstationary point is a local minimum.
![Page 62: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/62.jpg)
ε-stationary points
I ε-first-order stationary point:
‖∇f (x)‖ ≤ ε
I ε-second-order stationary point:
‖∇f (x)‖ ≤ ε, λmin(∇2f (x)) ≥ −√ρε
I If `-smooth, λmin(∇2f (x)) ≥ −`.I For any ε > `2
ρ , an ε-first-order stationary point in a `-smooth
function is a `2
ρ -second-order stationary point
I If (α, γ, ε, ζ)-strict saddle, and ε < γ2
ρ , then any ε-second-orderstationary point is a local minimum.
![Page 63: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/63.jpg)
ε-stationary points
I ε-first-order stationary point:
‖∇f (x)‖ ≤ ε
I ε-second-order stationary point:
‖∇f (x)‖ ≤ ε, λmin(∇2f (x)) ≥ −√ρε
I If `-smooth, λmin(∇2f (x)) ≥ −`.I For any ε > `2
ρ , an ε-first-order stationary point in a `-smooth
function is a `2
ρ -second-order stationary point
I If (α, γ, ε, ζ)-strict saddle, and ε < γ2
ρ , then any ε-second-orderstationary point is a local minimum.
![Page 64: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/64.jpg)
[Nesterov, 1998]
Theorem
Assume that f is `-smooth. Then for any ε̃ > 0, if we run GD withstep size η = 1
` and termination condition ‖∇f (x)‖ ≤ ε̃, theoutput will be ε̃-first-order stationary point, and the algorithmterminates in the following number of iterations:
`(f (x0)− f ∗)
ε̃2
[Jin et al 2017]: PGD converges to ε̃-second-order stationary point
in O(`(f (x0)−f ∗)
ε̃2 log4(d`∆fε̃2δ
))steps.
I Matched up to log factors!
![Page 65: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/65.jpg)
Why −√ρε?
I If we use third order approximation for x [Nesterov andPolyak, 2006]
miny
{〈∇f (x), y − x〉+
1
2〈∇2f (x)(y − x), y − x〉+
ρ
6‖y − x‖2
}denote the answer as Tx .
I Denote distance r = ‖x − Tx‖I ‖∇f (Tx)‖ ≤ ρr2, ∇2f (Tx) � −3
2 ρr I
I To get a lower bound for r :
max
{√‖∇f (Tx)‖
ρ,− 2
3ρλmin∇2f (Tx)
}I When are they equal → −√ρε
![Page 66: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/66.jpg)
Why −√ρε?
I If we use third order approximation for x [Nesterov andPolyak, 2006]
miny
{〈∇f (x), y − x〉+
1
2〈∇2f (x)(y − x), y − x〉+
ρ
6‖y − x‖2
}denote the answer as Tx .
I Denote distance r = ‖x − Tx‖
I ‖∇f (Tx)‖ ≤ ρr2, ∇2f (Tx) � −32 ρr I
I To get a lower bound for r :
max
{√‖∇f (Tx)‖
ρ,− 2
3ρλmin∇2f (Tx)
}I When are they equal → −√ρε
![Page 67: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/67.jpg)
Why −√ρε?
I If we use third order approximation for x [Nesterov andPolyak, 2006]
miny
{〈∇f (x), y − x〉+
1
2〈∇2f (x)(y − x), y − x〉+
ρ
6‖y − x‖2
}denote the answer as Tx .
I Denote distance r = ‖x − Tx‖I ‖∇f (Tx)‖ ≤ ρr2, ∇2f (Tx) � −3
2 ρr I
I To get a lower bound for r :
max
{√‖∇f (Tx)‖
ρ,− 2
3ρλmin∇2f (Tx)
}I When are they equal → −√ρε
![Page 68: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/68.jpg)
Why −√ρε?
I If we use third order approximation for x [Nesterov andPolyak, 2006]
miny
{〈∇f (x), y − x〉+
1
2〈∇2f (x)(y − x), y − x〉+
ρ
6‖y − x‖2
}denote the answer as Tx .
I Denote distance r = ‖x − Tx‖I ‖∇f (Tx)‖ ≤ ρr2, ∇2f (Tx) � −3
2 ρr I
I To get a lower bound for r :
max
{√‖∇f (Tx)‖
ρ,− 2
3ρλmin∇2f (Tx)
}
I When are they equal → −√ρε
![Page 69: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/69.jpg)
Why −√ρε?
I If we use third order approximation for x [Nesterov andPolyak, 2006]
miny
{〈∇f (x), y − x〉+
1
2〈∇2f (x)(y − x), y − x〉+
ρ
6‖y − x‖2
}denote the answer as Tx .
I Denote distance r = ‖x − Tx‖I ‖∇f (Tx)‖ ≤ ρr2, ∇2f (Tx) � −3
2 ρr I
I To get a lower bound for r :
max
{√‖∇f (Tx)‖
ρ,− 2
3ρλmin∇2f (Tx)
}I When are they equal → −√ρε
![Page 70: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/70.jpg)
Related results
1. “Gradient Descent Converges to Minimizers” By Lee,Simchowitz, Jordan and Recht. 15’
I with random initialization, GD almost surely never touches anysaddle points, and always converges to local minima.
2. “The power of normalization: faster evasion of saddle points”,Kfir Levy. 16’
I Normalized gradient can escape saddle points inO(d3poly(1/ε)), slower than [Jin et al 2017], faster than [Geet al 2015], but still polynomial in d .
![Page 71: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/71.jpg)
Related results
1. “Gradient Descent Converges to Minimizers” By Lee,Simchowitz, Jordan and Recht. 15’
I with random initialization, GD almost surely never touches anysaddle points, and always converges to local minima.
2. “The power of normalization: faster evasion of saddle points”,Kfir Levy. 16’
I Normalized gradient can escape saddle points inO(d3poly(1/ε)), slower than [Jin et al 2017], faster than [Geet al 2015], but still polynomial in d .
![Page 72: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/72.jpg)
Related results
1. “Gradient Descent Converges to Minimizers” By Lee,Simchowitz, Jordan and Recht. 15’
I with random initialization, GD almost surely never touches anysaddle points, and always converges to local minima.
2. “The power of normalization: faster evasion of saddle points”,Kfir Levy. 16’
I Normalized gradient can escape saddle points inO(d3poly(1/ε)), slower than [Jin et al 2017], faster than [Geet al 2015], but still polynomial in d .
![Page 73: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/73.jpg)
Related results
1. “Gradient Descent Converges to Minimizers” By Lee,Simchowitz, Jordan and Recht. 15’
I with random initialization, GD almost surely never touches anysaddle points, and always converges to local minima.
2. “The power of normalization: faster evasion of saddle points”,Kfir Levy. 16’
I Normalized gradient can escape saddle points inO(d3poly(1/ε)), slower than [Jin et al 2017], faster than [Geet al 2015], but still polynomial in d .
![Page 74: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/74.jpg)
Main theorem in [Jin et al 2017]
Theorem (Main Theorem)
Assume function f is `-smooth and ρ-Hessian Lipschitz,(α, γ, ε, ζ)-strict saddle. There exists an absolute constant cmax
such that, for any δ > 0, ∆f ≥ f (x0)− f ∗, and constant c ≤ cmax,
ε̃ = min{ε, γ2
ρ }, PGD(c) will output a point ζ-close to a localminimum, with probability 1− δ, and terminate in the followingnumber of iterations:
O
(`(f (x0)− f ∗)
ε̃2log4
(d`∆f
ε̃2δ
))
![Page 75: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/75.jpg)
Proof framework: Progress, Escape and Trap
I Progress: when ‖∇f (x)‖ > gthres, f (x) is decreased by atleast fthres/tthres.
I Escape: when ‖∇f (x)‖ ≤ gthres, and λmin∇2f (x) ≤ −γ, whpfunction value is decreased by fthres after perturbation+tthressteps.
I fthres/tthres on average each step.
I Trap:
I The algorithm can’t do progress and escape forever, becauseit’s bounded!
I When it stops: perturbation happened tthres steps ago, but fis decreased for less than fthres
I That means, ‖∇f (x)‖ ≤ gthres before perturbation, and whpthere is no eigenvalue ≤ −γ.
I So it’s a local minimum!
![Page 76: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/76.jpg)
Proof framework: Progress, Escape and Trap
I Progress: when ‖∇f (x)‖ > gthres, f (x) is decreased by atleast fthres/tthres.
I Escape: when ‖∇f (x)‖ ≤ gthres, and λmin∇2f (x) ≤ −γ, whpfunction value is decreased by fthres after perturbation+tthressteps.
I fthres/tthres on average each step.
I Trap:
I The algorithm can’t do progress and escape forever, becauseit’s bounded!
I When it stops: perturbation happened tthres steps ago, but fis decreased for less than fthres
I That means, ‖∇f (x)‖ ≤ gthres before perturbation, and whpthere is no eigenvalue ≤ −γ.
I So it’s a local minimum!
![Page 77: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/77.jpg)
Proof framework: Progress, Escape and Trap
I Progress: when ‖∇f (x)‖ > gthres, f (x) is decreased by atleast fthres/tthres.
I Escape: when ‖∇f (x)‖ ≤ gthres, and λmin∇2f (x) ≤ −γ, whpfunction value is decreased by fthres after perturbation+tthressteps.
I fthres/tthres on average each step.
I Trap:
I The algorithm can’t do progress and escape forever, becauseit’s bounded!
I When it stops: perturbation happened tthres steps ago, but fis decreased for less than fthres
I That means, ‖∇f (x)‖ ≤ gthres before perturbation, and whpthere is no eigenvalue ≤ −γ.
I So it’s a local minimum!
![Page 78: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/78.jpg)
Proof framework: Progress, Escape and Trap
I Progress: when ‖∇f (x)‖ > gthres, f (x) is decreased by atleast fthres/tthres.
I Escape: when ‖∇f (x)‖ ≤ gthres, and λmin∇2f (x) ≤ −γ, whpfunction value is decreased by fthres after perturbation+tthressteps.
I fthres/tthres on average each step.
I Trap:
I The algorithm can’t do progress and escape forever, becauseit’s bounded!
I When it stops: perturbation happened tthres steps ago, but fis decreased for less than fthres
I That means, ‖∇f (x)‖ ≤ gthres before perturbation, and whpthere is no eigenvalue ≤ −γ.
I So it’s a local minimum!
![Page 79: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/79.jpg)
Proof framework: Progress, Escape and Trap
I Progress: when ‖∇f (x)‖ > gthres, f (x) is decreased by atleast fthres/tthres.
I Escape: when ‖∇f (x)‖ ≤ gthres, and λmin∇2f (x) ≤ −γ, whpfunction value is decreased by fthres after perturbation+tthressteps.
I fthres/tthres on average each step.
I Trap:I The algorithm can’t do progress and escape forever, because
it’s bounded!
I When it stops: perturbation happened tthres steps ago, but fis decreased for less than fthres
I That means, ‖∇f (x)‖ ≤ gthres before perturbation, and whpthere is no eigenvalue ≤ −γ.
I So it’s a local minimum!
![Page 80: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/80.jpg)
Proof framework: Progress, Escape and Trap
I Progress: when ‖∇f (x)‖ > gthres, f (x) is decreased by atleast fthres/tthres.
I Escape: when ‖∇f (x)‖ ≤ gthres, and λmin∇2f (x) ≤ −γ, whpfunction value is decreased by fthres after perturbation+tthressteps.
I fthres/tthres on average each step.
I Trap:I The algorithm can’t do progress and escape forever, because
it’s bounded!I When it stops: perturbation happened tthres steps ago, but f
is decreased for less than fthres
I That means, ‖∇f (x)‖ ≤ gthres before perturbation, and whpthere is no eigenvalue ≤ −γ.
I So it’s a local minimum!
![Page 81: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/81.jpg)
Proof framework: Progress, Escape and Trap
I Progress: when ‖∇f (x)‖ > gthres, f (x) is decreased by atleast fthres/tthres.
I Escape: when ‖∇f (x)‖ ≤ gthres, and λmin∇2f (x) ≤ −γ, whpfunction value is decreased by fthres after perturbation+tthressteps.
I fthres/tthres on average each step.
I Trap:I The algorithm can’t do progress and escape forever, because
it’s bounded!I When it stops: perturbation happened tthres steps ago, but f
is decreased for less than fthresI That means, ‖∇f (x)‖ ≤ gthres before perturbation, and whp
there is no eigenvalue ≤ −γ.
I So it’s a local minimum!
![Page 82: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/82.jpg)
Proof framework: Progress, Escape and Trap
I Progress: when ‖∇f (x)‖ > gthres, f (x) is decreased by atleast fthres/tthres.
I Escape: when ‖∇f (x)‖ ≤ gthres, and λmin∇2f (x) ≤ −γ, whpfunction value is decreased by fthres after perturbation+tthressteps.
I fthres/tthres on average each step.
I Trap:I The algorithm can’t do progress and escape forever, because
it’s bounded!I When it stops: perturbation happened tthres steps ago, but f
is decreased for less than fthresI That means, ‖∇f (x)‖ ≤ gthres before perturbation, and whp
there is no eigenvalue ≤ −γ.I So it’s a local minimum!
![Page 83: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/83.jpg)
Progress
Lemma
If f is `-smooth, then for GD with step size η < 1` , we have:
f (xt+1) ≤ f (xt)−η
2‖∇f (xt)‖2
Proof.
f (xt+1) ≤ f (xt) +∇f (xt)>(xt+1 − xt) +
`
2‖xt+1 − xt‖2
= f (xt)− η‖∇f (xt)‖2 +η2`
2‖∇f (xt)‖2
≤ f (xt)−η
2‖∇f (xt)‖2
![Page 84: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/84.jpg)
Escape: main idea
![Page 85: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/85.jpg)
Escape: main idea
![Page 86: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/86.jpg)
Escape: thin pancake
![Page 87: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/87.jpg)
Main Lemma: measure the width
Lemma
Suppose we start with point x̃ satisfying following conditions:
‖∇f (x̃)‖ ≤ gthres, λmin(∇2f (x̃)) ≤ −γ
Let e1 the minimum eigenvector. Consider two gradient descentsequences {ut}, {wt}, with initial points u0,w0 satisfying :
‖u0 − x̃‖ ≤ r ,w0 = u0 + µre1, µ ∈ [δ/(2√d), 1]
Then, for any stepsize η ≤ cmax/`, and any T ≥ tthres, we have
min{f (uT )− f (u0), f (wT )− f (w0)} ≤ −2.5fthres
I As long as u0 − w0 are on e1, and ‖u0 − w0‖ ≥ δr2√d
, at least
one of them will escape!
![Page 88: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/88.jpg)
Main Lemma: measure the width
![Page 89: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/89.jpg)
Escape Case
Lemma (Escape case)
Suppose we start with point x̃ satisfying following conditions:
‖∇f (x̃)‖ ≤ gthres, λmin(∇2f (x̃)) ≤ −γ
Let x0 = x̃ + ξ, where ξ come from the uniform distribution overball with radius r , and let xt be the iterates of GD from x0. Thenwhen η < cmax
` , with at least probability 1− δ, for any T ≥ tthres:
f (xT )− f (x̃) ≤ −fthres
![Page 90: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/90.jpg)
Proof of the escape lemma
![Page 91: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/91.jpg)
Proof of the escape lemma
By smoothness, the perturbation step does not increase f much:
f (x0)− f (x̃) ≤ ∇f (x̃)>ξ +`
2‖ξ‖2 ≤ · · · ≤ 1.5fthres
![Page 92: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/92.jpg)
Proof of the escape lemma
By smoothness, the perturbation step does not increase f much:
f (x0)− f (x̃) ≤ ∇f (x̃)>ξ +`
2‖ξ‖2 ≤ · · · ≤ 1.5fthres
By the main lemma, for any x0 ∈ Xstuck, we know(x0 ± µre1) 6∈ Xstuck, where µ ∈ [δ/(2
√d), 1].
Vol(Xstuck) = Vol(B(d−1)x̃ (r))× δr
2√d× 2
![Page 93: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/93.jpg)
Proof of the escape lemma
By smoothness, the perturbation step does not increase f much:
f (x0)− f (x̃) ≤ ∇f (x̃)>ξ +`
2‖ξ‖2 ≤ · · · ≤ 1.5fthres
By the main lemma, for any x0 ∈ Xstuck, we know(x0 ± µre1) 6∈ Xstuck, where µ ∈ [δ/(2
√d), 1].
Vol(Xstuck) = Vol(B(d−1)x̃ (r))× δr
2√d× 2
Therefore, the probability that we picked a point in Xstuck isbounded by
Vol(Xstuck)
Vol(B(d)x̃ (r)))
≤ δ
![Page 94: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/94.jpg)
Proof of the escape lemma
Thus, with probability at least 1− δ, x0 6∈ Xstuck, and in this case,by the main lemma.
f (xT )− f (x̃) ≤ −2.5fthres + 1.5fthres = −fthres
![Page 95: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/95.jpg)
How to prove the main Lemma?
I If uT does not decrease function value, then {u0, · · · , uT} areclose to x̃ .
I If {u0, · · · , uT} are close to x̃ , GD on w0 will decrease thefunction value.
![Page 96: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/96.jpg)
How to prove the main Lemma?
I If uT does not decrease function value, then {u0, · · · , uT} areclose to x̃ .
I If {u0, · · · , uT} are close to x̃ , GD on w0 will decrease thefunction value.
![Page 97: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/97.jpg)
How to prove the main Lemma?
I If uT does not decrease function value, then {u0, · · · , uT} areclose to x̃ .
I If {u0, · · · , uT} are close to x̃ , GD on w0 will decrease thefunction value.
![Page 98: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/98.jpg)
How to prove the main Lemma?
I If uT does not decrease function value, then {u0, · · · , uT} areclose to x̃ .
I If {u0, · · · , uT} are close to x̃ , GD on w0 will decrease thefunction value.
We will need the following approximation:
f̃y (x) = f (y) +∇f (y)>(x − y) +1
2(x − y)>H(x − y)
where H = ∇2f (x̃).
![Page 99: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/99.jpg)
Two lemmas (simplified)
Lemma (uT -stuck)
There exists absolute constant cmax s.t., for any initial point u0
with ‖u0 − x̃‖ ≤ r , defined
T = min{
inft
{t|f̃ u0(ut)− f (u0) ≤ −3fthres
}, tthres
}Then, for any η ≤ cmax
` , we have for all t < T , ‖ut − x̃‖ ≤ Φ.
![Page 100: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/100.jpg)
Two lemmas (simplified)
Lemma (uT -stuck)
There exists absolute constant cmax s.t., for any initial point u0
with ‖u0 − x̃‖ ≤ r , defined
T = min{
inft
{t|f̃ u0(ut)− f (u0) ≤ −3fthres
}, tthres
}Then, for any η ≤ cmax
` , we have for all t < T , ‖ut − x̃‖ ≤ Φ.
Lemma (wT -escape)
There exists absolute constant cmax s.t., define
T = min{
inft
{t|f̃ w0(wt)− f (w0) ≤ −3fthres
}, tthres
}then, for any η ≤ cmax
` , if ‖ut − x̃‖ ≤ Φ for t < T , we haveT < tthres.
![Page 101: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/101.jpg)
Prove the main lemma
![Page 102: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/102.jpg)
Prove the main lemma
Assume x̃ is the origin. Define
T ′ = inft
{t|f̃ u0(ut)− f (u0) ≤ −3fthres
}
![Page 103: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/103.jpg)
Prove the main lemma
Assume x̃ is the origin. Define
T ′ = inft
{t|f̃ u0(ut)− f (u0) ≤ −3fthres
}Case T ′ ≤ tthres:
We know ‖uT ′−1‖ ≤ Φ by uT -stuck-lemma. By simple calculation,we can show that ‖uT ′‖ = O(Φ) as well.
![Page 104: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/104.jpg)
Prove the main lemma
Assume x̃ is the origin. Define
T ′ = inft
{t|f̃ u0(ut)− f (u0) ≤ −3fthres
}Case T ′ ≤ tthres:
We know ‖uT ′−1‖ ≤ Φ by uT -stuck-lemma. By simple calculation,we can show that ‖uT ′‖ = O(Φ) as well.
f (uT ′)− f (u0)
≤∇f (u0)>(uT ′ − u0) +1
2(uT ′ − u0)>∇2f (u0)(uT ′ − u0) +
ρ
6‖uT ′ − u0‖3
≤f̃u0(ut)− f (u0) +ρ
2‖u0 − x̃‖‖uT ′ − u0‖2 +
ρ
6‖uT ′ − u0‖3
≤− 2.5fthres
![Page 105: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/105.jpg)
Prove the main lemma
Case T ′ > tthres: By uT -stuck-lemma, we know for all t ≤ tthres‖ut‖ ≤ Φ.
![Page 106: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/106.jpg)
Prove the main lemma
Case T ′ > tthres: By uT -stuck-lemma, we know for all t ≤ tthres‖ut‖ ≤ Φ. Using the wT -escape-lemma, we know
T ′′ = inft
{t|f̃ w0(wt)− f (w0) ≤ −3fthres
}≤ tthres
![Page 107: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/107.jpg)
Prove the main lemma
Case T ′ > tthres: By uT -stuck-lemma, we know for all t ≤ tthres‖ut‖ ≤ Φ. Using the wT -escape-lemma, we know
T ′′ = inft
{t|f̃ w0(wt)− f (w0) ≤ −3fthres
}≤ tthres
Then we may reduce this to the case that T ′ ≤ tthres becausew , u are interchangeable.
![Page 108: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/108.jpg)
Prove the uT -stuck-lemma
I We won’t move much in large negative eigenvector directions,otherwise it’s a lot of progress!
![Page 109: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/109.jpg)
Prove the uT -stuck-lemma
Lemma (uT -stuck)
There exists absolute constant cmax s.t., for any initial point u0
with ‖u0 − x̃‖ ≤ r , defined
T = min{
inft
{t|f̃ u0(ut)− f (u0) ≤ −3fthres
}, tthres
}Then, for any η ≤ cmax
` , we have for all t < T , ‖ut − x̃‖ ≤ Φ.
I We won’t move much in large negative eigenvector directions,otherwise it’s a lot of progress!
![Page 110: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/110.jpg)
Prove the uT -stuck-lemma
Lemma (uT -stuck)
There exists absolute constant cmax s.t., for any initial point u0
with ‖u0 − x̃‖ ≤ r , defined
T = min{
inft
{t|f̃ u0(ut)− f (u0) ≤ −3fthres
}, tthres
}Then, for any η ≤ cmax
` , we have for all t < T , ‖ut − x̃‖ ≤ Φ.
I We won’t move much in large negative eigenvector directions,otherwise it’s a lot of progress!
![Page 111: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/111.jpg)
Prove the uT -stuck-lemma
Lemma (uT -stuck)
There exists absolute constant cmax s.t., for any initial point u0
with ‖u0 − x̃‖ ≤ r , defined
T = min{
inft
{t|f̃ u0(ut)− f (u0) ≤ −3fthres
}, tthres
}Then, for any η ≤ cmax
` , we have for all t < T , ‖ut − x̃‖ ≤ Φ.
I We won’t move much in large negative eigenvector directions,otherwise it’s a lot of progress!
Consider Bt as ut in the remaining space where eigenvalue ≥ − γ100 ,
‖Bt+1‖ ≤ (1 +ηγ
100)‖Bt‖+ 2ηgthres
![Page 112: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/112.jpg)
Prove the uT -stuck-lemma
Lemma (uT -stuck)
There exists absolute constant cmax s.t., for any initial point u0
with ‖u0 − x̃‖ ≤ r , defined
T = min{
inft
{t|f̃ u0(ut)− f (u0) ≤ −3fthres
}, tthres
}Then, for any η ≤ cmax
` , we have for all t < T , ‖ut − x̃‖ ≤ Φ.
I We won’t move much in large negative eigenvector directions,otherwise it’s a lot of progress!
Consider Bt as ut in the remaining space where eigenvalue ≥ − γ100 ,
‖Bt+1‖ ≤ (1 +ηγ
100)‖Bt‖+ 2ηgthres
If T ≤ tthres, we will have (1 + ηγ100 )T ≤ 3, so ‖BT‖ is bounded.
![Page 113: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/113.jpg)
Prove the wT -escape-lemma
Lemma (wT -escape)
There exists absolute constant cmax s.t., define
T = min{
inft
{t|f̃ w0(wt)− f (w0) ≤ −3fthres
}, tthres
}then, for any η ≤ cmax
` , if ‖ut − x̃‖ ≤ Φ for t < T , we haveT < tthres.
![Page 114: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/114.jpg)
Prove the wT -escape-lemma
I let vt = wt − ut
I We want to say for T < tthres, wT made progress.
I If wt makes no progress, by uT -stuck-lemma, it’s still near x̃ .
I Therefore, we always have ‖vt‖ ≤ ‖ut‖+ ‖wt‖ ≤ 2Φ.
I At e1 direction v0 has at least δr2√d
I Every time it multiplies by at least 1 + ηγ.
I In T < tthres, we get vT > 2Φ, so wT made progress!
![Page 115: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/115.jpg)
Prove the wT -escape-lemma
I let vt = wt − utI We want to say for T < tthres, wT made progress.
I If wt makes no progress, by uT -stuck-lemma, it’s still near x̃ .
I Therefore, we always have ‖vt‖ ≤ ‖ut‖+ ‖wt‖ ≤ 2Φ.
I At e1 direction v0 has at least δr2√d
I Every time it multiplies by at least 1 + ηγ.
I In T < tthres, we get vT > 2Φ, so wT made progress!
![Page 116: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/116.jpg)
Prove the wT -escape-lemma
I let vt = wt − utI We want to say for T < tthres, wT made progress.
I If wt makes no progress, by uT -stuck-lemma, it’s still near x̃ .
I Therefore, we always have ‖vt‖ ≤ ‖ut‖+ ‖wt‖ ≤ 2Φ.
I At e1 direction v0 has at least δr2√d
I Every time it multiplies by at least 1 + ηγ.
I In T < tthres, we get vT > 2Φ, so wT made progress!
![Page 117: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/117.jpg)
Prove the wT -escape-lemma
I let vt = wt − utI We want to say for T < tthres, wT made progress.
I If wt makes no progress, by uT -stuck-lemma, it’s still near x̃ .
I Therefore, we always have ‖vt‖ ≤ ‖ut‖+ ‖wt‖ ≤ 2Φ.
I At e1 direction v0 has at least δr2√d
I Every time it multiplies by at least 1 + ηγ.
I In T < tthres, we get vT > 2Φ, so wT made progress!
![Page 118: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/118.jpg)
Prove the wT -escape-lemma
I let vt = wt − utI We want to say for T < tthres, wT made progress.
I If wt makes no progress, by uT -stuck-lemma, it’s still near x̃ .
I Therefore, we always have ‖vt‖ ≤ ‖ut‖+ ‖wt‖ ≤ 2Φ.
However, vt is increasing very rapidly. It can’t be always small!
I At e1 direction v0 has at least δr2√d
I Every time it multiplies by at least 1 + ηγ.
I In T < tthres, we get vT > 2Φ, so wT made progress!
![Page 119: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/119.jpg)
Prove the wT -escape-lemma
I let vt = wt − utI We want to say for T < tthres, wT made progress.
I If wt makes no progress, by uT -stuck-lemma, it’s still near x̃ .
I Therefore, we always have ‖vt‖ ≤ ‖ut‖+ ‖wt‖ ≤ 2Φ.
However, vt is increasing very rapidly. It can’t be always small!
I At e1 direction v0 has at least δr2√d
I Every time it multiplies by at least 1 + ηγ.
I In T < tthres, we get vT > 2Φ, so wT made progress!
![Page 120: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/120.jpg)
Prove the wT -escape-lemma
I let vt = wt − utI We want to say for T < tthres, wT made progress.
I If wt makes no progress, by uT -stuck-lemma, it’s still near x̃ .
I Therefore, we always have ‖vt‖ ≤ ‖ut‖+ ‖wt‖ ≤ 2Φ.
However, vt is increasing very rapidly. It can’t be always small!
I At e1 direction v0 has at least δr2√d
I Every time it multiplies by at least 1 + ηγ.
I In T < tthres, we get vT > 2Φ, so wT made progress!
![Page 121: Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can](https://reader035.fdocument.org/reader035/viewer/2022071510/612f97871ecc515869438c1e/html5/thumbnails/121.jpg)
Prove the wT -escape-lemma
I let vt = wt − utI We want to say for T < tthres, wT made progress.
I If wt makes no progress, by uT -stuck-lemma, it’s still near x̃ .
I Therefore, we always have ‖vt‖ ≤ ‖ut‖+ ‖wt‖ ≤ 2Φ.
However, vt is increasing very rapidly. It can’t be always small!
I At e1 direction v0 has at least δr2√d
I Every time it multiplies by at least 1 + ηγ.
I In T < tthres, we get vT > 2Φ, so wT made progress!