• date post

21-Jan-2021
• Category

## Documents

• view

1

0

Embed Size (px)

### Transcript of CSC411: Optimization for Machine · PDF file 2020. 9. 23. · CSC411: Optimization...

• CSC411: Optimization for Machine Learning

University of Toronto

September 20–26, 2018

1

1based on slides by Eleni Triantafillou, Ladislav Rampasek, Jake Snell, Kevin Swersky, Shenlong Wang and other

• !

!

!

• !

!

!

!

• θ∗ = θ (θ) θ

! θ ∈ R ! : R → R

(θ) − (θ)

• ! θ ! θ !

• θ

! ( , ) ( | , θ)

! − log ( | , θ) !

• ∂ (θ∗) ∂θ =

! θ !

! ∇θ = ( ∂∂θ , ∂ ∂θ , ...,

∂ ∂θ )

• η

! θ ! = :

! δ ← −η∇θ − ! θ ← θ − + δ

• η

! θ ! = :

! η (θ − η ∇θ − ) < (θ ) ! δ ← −η ∇θ − ! θ ← θ − + δ

• α ∈ [ , )

! θ ! δ ! = :

! δ ← −η∇θ − +αδ − ! θ ← θ − + δ

α

• η

! θ !

! δ ← −η∇θ − ! θ ← θ − + δ

!

• ! | (θ + )− (θ )| < ϵ

! ∥∇θ ∥ < ϵ !

• Learning Rate (Step Size)

In gradient descent, the learning rate ↵ is a hyperparameter we need to tune. Here are some things that can go wrong:

↵ too small: slow progress

↵ too large: oscillations

↵ much too large: instability

Good values are typically between 0.001 and 0.1. You should do a grid search if you want good performance (i.e. try 0.1, 0.03, 0.01, . . .).

Intro ML (UofT) CSC311-Lec3 37 / 42

• !

∇ !

∂θ ≈ ((θ , . . . , θ + ϵ, . . . , θ ))− ((θ , . . . , θ − ϵ, . . . , θ ))

ϵ

!

!

• !

!

! !

!

!

• !

!

!

!

Batch gradient descent moves directly downhill. SGD takes steps in a noisy direction, but moves downhill on average.

Intro ML (UofT) CSC311-Lec4 66 / 70

• SGD Learning Rate

In stochastic training, the learning rate also influences the fluctuations due to the stochasticity of the gradients.

Typical strategy: I Use a large learning rate early in training so you can get close to

the optimum I Gradually decay the learning rate to reduce the fluctuations

Intro ML (UofT) CSC311-Lec4 67 / 70

• SGD Learning Rate

Warning: by reducing the learning rate, you reduce the fluctuations, which can appear to make the loss drop suddenly. But this can come at the expense of long-run performance.

Intro ML (UofT) CSC311-Lec4 68 / 70

• SGD and Non-convex optimization

Local minimum

Global minimum

� 3

(null)(null)(null)(null)

� 4

(null)(null)(null)(null)