CSC411: Optimization for Machine · PDF file 2020. 9. 23. · CSC411: Optimization...
date post
21-Jan-2021Category
Documents
view
1download
0
Embed Size (px)
Transcript of CSC411: Optimization for Machine · PDF file 2020. 9. 23. · CSC411: Optimization...
CSC411: Optimization for Machine Learning
University of Toronto
September 20–26, 2018
1
1based on slides by Eleni Triantafillou, Ladislav Rampasek, Jake Snell, Kevin Swersky, Shenlong Wang and other
!
!
!
!
!
!
!
θ∗ = θ (θ) θ
! θ ∈ R ! : R → R
(θ) − (θ)
! θ ! θ !
θ
! ( , ) ( | , θ)
! − log ( | , θ) !
∂ (θ∗) ∂θ =
! θ !
! ∇θ = ( ∂∂θ , ∂ ∂θ , ...,
∂ ∂θ )
η
! θ ! = :
! δ ← −η∇θ − ! θ ← θ − + δ
η
! θ ! = :
! η (θ − η ∇θ − ) < (θ ) ! δ ← −η ∇θ − ! θ ← θ − + δ
α ∈ [ , )
! θ ! δ ! = :
! δ ← −η∇θ − +αδ − ! θ ← θ − + δ
α
η
! θ !
! δ ← −η∇θ − ! θ ← θ − + δ
!
! | (θ + )− (θ )| < ϵ
! ∥∇θ ∥ < ϵ !
Learning Rate (Step Size)
In gradient descent, the learning rate ↵ is a hyperparameter we need to tune. Here are some things that can go wrong:
↵ too small: slow progress
↵ too large: oscillations
↵ much too large: instability
Good values are typically between 0.001 and 0.1. You should do a grid search if you want good performance (i.e. try 0.1, 0.03, 0.01, . . .).
Intro ML (UofT) CSC311-Lec3 37 / 42
!
∇ !
∂
∂θ ≈ ((θ , . . . , θ + ϵ, . . . , θ ))− ((θ , . . . , θ − ϵ, . . . , θ ))
ϵ
!
!
!
!
! !
!
!
!
!
!
!
Stochastic Gradient Descent
Batch gradient descent moves directly downhill. SGD takes steps in a noisy direction, but moves downhill on average.
batch gradient descent stochastic gradient descent
Intro ML (UofT) CSC311-Lec4 66 / 70
SGD Learning Rate
In stochastic training, the learning rate also influences the fluctuations due to the stochasticity of the gradients.
Typical strategy: I Use a large learning rate early in training so you can get close to
the optimum I Gradually decay the learning rate to reduce the fluctuations
Intro ML (UofT) CSC311-Lec4 67 / 70
SGD Learning Rate
Warning: by reducing the learning rate, you reduce the fluctuations, which can appear to make the loss drop suddenly. But this can come at the expense of long-run performance.
Intro ML (UofT) CSC311-Lec4 68 / 70
SGD and Non-convex optimization
Local minimum
Global minimum
� 3
(null)(null)(null)(null)
� 4
(null)(null)(null)(null)
Stochastic Gradient descent updates
Stochastic methods have a chance of escaping from bad minima.
Gradient descent with small step-size converges to first minimum it finds.
Intro ML (UofT) CSC311-Lec4 69 / 70