CSC411: Optimization for Machine · PDF file 2020. 9. 23. · CSC411: Optimization...

Click here to load reader

  • date post

    21-Jan-2021
  • Category

    Documents

  • view

    1
  • download

    0

Embed Size (px)

Transcript of CSC411: Optimization for Machine · PDF file 2020. 9. 23. · CSC411: Optimization...

  • CSC411: Optimization for Machine Learning

    University of Toronto

    September 20–26, 2018

    1

    1based on slides by Eleni Triantafillou, Ladislav Rampasek, Jake Snell, Kevin Swersky, Shenlong Wang and other

  • !

    !

    !

  • !

    !

    !

    !

  • θ∗ = θ (θ) θ

    ! θ ∈ R ! : R → R

    (θ) − (θ)

  • ! θ ! θ !

  • θ

    ! ( , ) ( | , θ)

    ! − log ( | , θ) !

  • ∂ (θ∗) ∂θ =

    ! θ !

    ! ∇θ = ( ∂∂θ , ∂ ∂θ , ...,

    ∂ ∂θ )

  • η

    ! θ ! = :

    ! δ ← −η∇θ − ! θ ← θ − + δ

  • η

    ! θ ! = :

    ! η (θ − η ∇θ − ) < (θ ) ! δ ← −η ∇θ − ! θ ← θ − + δ

  • α ∈ [ , )

    ! θ ! δ ! = :

    ! δ ← −η∇θ − +αδ − ! θ ← θ − + δ

    α

  • η

    ! θ !

    ! δ ← −η∇θ − ! θ ← θ − + δ

    !

  • ! | (θ + )− (θ )| < ϵ

    ! ∥∇θ ∥ < ϵ !

  • Learning Rate (Step Size)

    In gradient descent, the learning rate ↵ is a hyperparameter we need to tune. Here are some things that can go wrong:

    ↵ too small: slow progress

    ↵ too large: oscillations

    ↵ much too large: instability

    Good values are typically between 0.001 and 0.1. You should do a grid search if you want good performance (i.e. try 0.1, 0.03, 0.01, . . .).

    Intro ML (UofT) CSC311-Lec3 37 / 42

  • !

    ∇ !

    ∂θ ≈ ((θ , . . . , θ + ϵ, . . . , θ ))− ((θ , . . . , θ − ϵ, . . . , θ ))

    ϵ

    !

    !

  • !

    !

    ! !

    !

    !

  • !

    !

    !

    !

  • Stochastic Gradient Descent

    Batch gradient descent moves directly downhill. SGD takes steps in a noisy direction, but moves downhill on average.

    batch gradient descent stochastic gradient descent

    Intro ML (UofT) CSC311-Lec4 66 / 70

  • SGD Learning Rate

    In stochastic training, the learning rate also influences the fluctuations due to the stochasticity of the gradients.

    Typical strategy: I Use a large learning rate early in training so you can get close to

    the optimum I Gradually decay the learning rate to reduce the fluctuations

    Intro ML (UofT) CSC311-Lec4 67 / 70

  • SGD Learning Rate

    Warning: by reducing the learning rate, you reduce the fluctuations, which can appear to make the loss drop suddenly. But this can come at the expense of long-run performance.

    Intro ML (UofT) CSC311-Lec4 68 / 70

  • SGD and Non-convex optimization

    Local minimum

    Global minimum

    � 3

    (null)(null)(null)(null)

    � 4

    (null)(null)(null)(null)

    Stochastic Gradient descent updates

    Stochastic methods have a chance of escaping from bad minima.

    Gradient descent with small step-size converges to first minimum it finds.

    Intro ML (UofT) CSC311-Lec4 69 / 70