New Optimization - NTU Speech Processing...

64
New Optimization Chung-Ming Chien 2020/4/9

Transcript of New Optimization - NTU Speech Processing...

Page 1: New Optimization - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML2020/Optimization.pdf · 2020. 4. 9. · Descent Optimization Algorithms”, arXiv, 2017

New OptimizationChung-Ming Chien

2020/4/9

Page 2: New Optimization - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML2020/Optimization.pdf · 2020. 4. 9. · Descent Optimization Algorithms”, arXiv, 2017

Background Knowledge

• μ-strong convexity

• Lipschitz continuity

• Bregman proximal inequality

Page 3: New Optimization - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML2020/Optimization.pdf · 2020. 4. 9. · Descent Optimization Algorithms”, arXiv, 2017

New OptimizationChung-Ming Chien

2020/4/12

Page 4: New Optimization - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML2020/Optimization.pdf · 2020. 4. 9. · Descent Optimization Algorithms”, arXiv, 2017

New Optimizers for Deep Learning

Chung-Ming Chien

2020/4/12

Page 5: New Optimization - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML2020/Optimization.pdf · 2020. 4. 9. · Descent Optimization Algorithms”, arXiv, 2017

What you have known before?

• SGD

• SGD with momentum

• Adagrad

• RMSProp

• Adam

Page 6: New Optimization - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML2020/Optimization.pdf · 2020. 4. 9. · Descent Optimization Algorithms”, arXiv, 2017

Some Notations

• 𝜃𝑡 : model parameters at time step 𝑡

• ∇𝐿(𝜃𝑡) or 𝑔𝑡: gradient at 𝜃𝑡, used to compute 𝜃𝑡+1

• 𝑚𝑡+1: momentum accumulated from time step 0 to time step 𝑡, which is used to compute 𝜃𝑡+1

𝑥𝑡 𝜃𝑡 𝑦𝑡 ො𝑦𝑡𝐿(𝜃𝑡; 𝑥𝑡)

𝑔𝑡

Page 7: New Optimization - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML2020/Optimization.pdf · 2020. 4. 9. · Descent Optimization Algorithms”, arXiv, 2017

What is Optimization about?

• Find a 𝜃 to get the lowest σ𝑥 𝐿(𝜃; 𝑥) !!

• Or, Find a 𝜃 to get the lowest 𝐿(𝜃) !!

Page 8: New Optimization - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML2020/Optimization.pdf · 2020. 4. 9. · Descent Optimization Algorithms”, arXiv, 2017

On-line vs Off-line

• On-line:one pair of (𝑥𝑡 , ො𝑦𝑡) at a time step

𝑥𝑡 𝜃𝑡 𝑦𝑡 ො𝑦𝑡

𝐿(𝜃𝑡; 𝑥𝑡)

𝑔𝑡

Page 9: New Optimization - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML2020/Optimization.pdf · 2020. 4. 9. · Descent Optimization Algorithms”, arXiv, 2017

On-line vs Off-line

• Off-line:pour all (𝑥𝑡 , ො𝑦𝑡) into the model at every time step

• The rest of this lecture will focus on the off-line cases

𝑥𝑡 𝜃𝑡 𝑦𝑡 ො𝑦𝑡

𝐿(𝜃𝑡)

𝑔𝑡

𝑥𝑡𝑥𝑡𝑥1𝑦𝑡𝑦𝑡𝑦1

ො𝑦𝑡ො𝑦𝑡ො𝑦1

Page 10: New Optimization - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML2020/Optimization.pdf · 2020. 4. 9. · Descent Optimization Algorithms”, arXiv, 2017

SGD

Credit to 李宏毅老師上課投影片

Page 11: New Optimization - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML2020/Optimization.pdf · 2020. 4. 9. · Descent Optimization Algorithms”, arXiv, 2017

SGD with Momentum(SGDM)

Credit to 李宏毅老師上課投影片

Page 12: New Optimization - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML2020/Optimization.pdf · 2020. 4. 9. · Descent Optimization Algorithms”, arXiv, 2017

SGD with Momentum(SGDM)

Credit to 李宏毅老師上課投影片

Page 13: New Optimization - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML2020/Optimization.pdf · 2020. 4. 9. · Descent Optimization Algorithms”, arXiv, 2017

Why momentum?

Credit to 李宏毅老師上課投影片

停下來! 讓我看看

Page 14: New Optimization - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML2020/Optimization.pdf · 2020. 4. 9. · Descent Optimization Algorithms”, arXiv, 2017

Adagrad

Credit to 李宏毅老師上課投影片

What if the gradients at

the first few time steps

are extremely large…

𝜃𝑡 = 𝜃𝑡−1 −η

σ𝑖=0𝑡−1(𝑔𝑖)

2𝑔𝑡−1

Page 15: New Optimization - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML2020/Optimization.pdf · 2020. 4. 9. · Descent Optimization Algorithms”, arXiv, 2017

RMSProp

Credit to 李宏毅老師上課投影片

Exponential moving average

(EMA) of squared gradients is

not monotonically increasing

𝜃𝑡 = 𝜃𝑡−1 −η

𝑣𝑡𝑔𝑡−1

𝑣1 = 𝑔02

𝑣𝑡 = 𝛼𝑣𝑡−1 + (1 − 𝛼)(𝑔𝑡−1)2

Page 16: New Optimization - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML2020/Optimization.pdf · 2020. 4. 9. · Descent Optimization Algorithms”, arXiv, 2017

Adam

• SGDM

• RMSProp

𝜃𝑡 = 𝜃𝑡−1 −ηො𝑣𝑡+𝜀

ෝ𝑚𝑡

ෝ𝑚𝑡 =𝑚𝑡

1 − 𝛽1𝑡

ො𝑣𝑡 =𝑣𝑡

1 − 𝛽2𝑡

𝛽1 = 0.9𝛽2 = 0.999𝜀 = 10−8

de-biasing

𝜃𝑡 = 𝜃𝑡−1 − η𝑚𝑡

𝑚𝑡 = 𝛽1𝑚𝑡−1 + (1 − 𝛽1)𝑔𝑡−1

𝜃𝑡 = 𝜃𝑡−1 −η

𝑣𝑡𝑔𝑡−1

𝑣1 = 𝑔02

𝑣𝑡 = 𝛽2𝑣𝑡−1 + (1 − 𝛽2)(𝑔𝑡−1)2

Page 17: New Optimization - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML2020/Optimization.pdf · 2020. 4. 9. · Descent Optimization Algorithms”, arXiv, 2017

What you have known before?

• SGD

• SGD with momentum

• Adagrad

• RMSProp

• Adam

Adaptive learning rate

[Cauchy, 1847]

[Rumelhart, et al., Nature’86]

[Duchi, et al., JMLR’11]

[Hinton, et al., Lecture slides, 2013]

[Kingma, et al., ICLR’15]

Page 18: New Optimization - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML2020/Optimization.pdf · 2020. 4. 9. · Descent Optimization Algorithms”, arXiv, 2017

Optimizers: Real Application

BERT

ADAM

Transformer

ADAM

ADAM

Tacotron

Page 19: New Optimization - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML2020/Optimization.pdf · 2020. 4. 9. · Descent Optimization Algorithms”, arXiv, 2017

Optimizers: Real Application

SGDM

Mask R-CNN

ResNet

SGDM

SGDM

Page 20: New Optimization - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML2020/Optimization.pdf · 2020. 4. 9. · Descent Optimization Algorithms”, arXiv, 2017

Optimizers: Real Application

MAML

ADAM

ADAM

Big-GAN

Page 21: New Optimization - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML2020/Optimization.pdf · 2020. 4. 9. · Descent Optimization Algorithms”, arXiv, 2017

Back to 2014…

Page 22: New Optimization - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML2020/Optimization.pdf · 2020. 4. 9. · Descent Optimization Algorithms”, arXiv, 2017

Back to 2014…

ADAM

Page 25: New Optimization - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML2020/Optimization.pdf · 2020. 4. 9. · Descent Optimization Algorithms”, arXiv, 2017

Adam vs SGDM

[Luo, et al., ICLR’19]

Page 26: New Optimization - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML2020/Optimization.pdf · 2020. 4. 9. · Descent Optimization Algorithms”, arXiv, 2017

Adam vs SGDM

[Luo, et al., ICLR’19]

Page 27: New Optimization - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML2020/Optimization.pdf · 2020. 4. 9. · Descent Optimization Algorithms”, arXiv, 2017

Adam vs SGDM

• Adam:fast training, large generalization gap, unstable

• SGDM:stable, little generalization gap, better convergence(?)

An intuitive illustration for generalization gap

Page 28: New Optimization - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML2020/Optimization.pdf · 2020. 4. 9. · Descent Optimization Algorithms”, arXiv, 2017

Simply combine Adam with SGDM?

• SWATS [Keskar, et al., arXiv’17]

Begin with Adam(fast), end with SGDM

Learning rate

initialization

Start

trainingAdam

Meet some

criteriaConvergenceSGDM

Page 29: New Optimization - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML2020/Optimization.pdf · 2020. 4. 9. · Descent Optimization Algorithms”, arXiv, 2017

Towards Improving Adam…

• Trouble shooting𝜃𝑡 = 𝜃𝑡−1 −

ηො𝑣𝑡+𝜀

ෝ𝑚𝑡

𝑣𝑡 = 𝛽2𝑣𝑡−1 + (1 − 𝛽2)(𝑔𝑡−1)2, 𝛽2 = 0.999

The “memory” of 𝑣𝑡 keeps roughly 1000 steps!!

In the final stage of training, most gradients are small

and non-informative, while some mini-batches

provide large informative gradient rarely.

time step … 100000 100001 100002 100003 … 100999 101000

gradient 1 1 1 1 100000 1

movement η η η η 10 10η 10−3.5η

𝑚𝑡 = 𝛽1𝑚𝑡−1 + 1 − 𝛽1 𝑔𝑡−1, 𝛽1 = 0.

Page 30: New Optimization - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML2020/Optimization.pdf · 2020. 4. 9. · Descent Optimization Algorithms”, arXiv, 2017

Towards Improving Adam…

• Trouble shooting

Maximum movement distance for one single update

is roughly upper bounded by 1

1−𝛽2η

1000η

time step … 100000 100001 100002 100003 … 100999 101000

gradient 1 1 1 1 100000 1

movement η η η η 10 10η 10−3.5η

Non-informative gradients contribute more than

informative gradients

Page 31: New Optimization - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML2020/Optimization.pdf · 2020. 4. 9. · Descent Optimization Algorithms”, arXiv, 2017

Towards Improving Adam…

• AMSGrad [Reddi, et al., ICLR’18]

𝜃𝑡 = 𝜃𝑡−1 −ηො𝑣𝑡+𝜀

𝑚𝑡

ො𝑣𝑡 = max( ො𝑣𝑡−1, 𝑣𝑡)

Reduce the influence of non-

informative gradients

Remove de-biasing due to the

max operation

Monotonically decreasing

learning rate

Remember Adagrad vs

RMSProp?

Page 32: New Optimization - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML2020/Optimization.pdf · 2020. 4. 9. · Descent Optimization Algorithms”, arXiv, 2017

Towards Improving Adam…

• Trouble shooting

In the final stage of training, most gradients are small

and non-informative, while some mini-batches

provide large informative gradient rarely.

Learning rates are either extremely large(for small

gradients) or extremely small(for large gradients).

Page 33: New Optimization - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML2020/Optimization.pdf · 2020. 4. 9. · Descent Optimization Algorithms”, arXiv, 2017

Towards Improving Adam…

• AMSGrad only handles large learning rates

• AdaBound [Luo, et al., ICLR’19]

𝜃𝑡 = 𝜃𝑡−1 − 𝐶𝑙𝑖𝑝(ηො𝑣𝑡+𝜀

) ෝ𝑚𝑡

𝐶𝑙𝑖𝑝 𝑥 = 𝐶𝑙𝑖𝑝(𝑥, 0.1 −0.1

1 − 𝛽2 𝑡 + 1, 0.1 +

0.1

1 − 𝛽2 𝑡)

That‘s not “adaptive” at all…

Page 34: New Optimization - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML2020/Optimization.pdf · 2020. 4. 9. · Descent Optimization Algorithms”, arXiv, 2017

Towards Improving SGDM…

• Adaptive learning rate algorithms:dynamically adjust learning rate over time

• SGD-type algorithms:fix learning rate for all updates… too slow for small learning rates and bad result for large learning rates

There might be a “best” learning rate?

Page 35: New Optimization - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML2020/Optimization.pdf · 2020. 4. 9. · Descent Optimization Algorithms”, arXiv, 2017

Towards Improving SGDM

• LR range test [Smith, WACV’17]

Page 36: New Optimization - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML2020/Optimization.pdf · 2020. 4. 9. · Descent Optimization Algorithms”, arXiv, 2017

Towards Improving SGDM

• Cyclical LR

• learning rate:decide by LR range test

• stepsize:several epochs

• avoid local minimum by varying learning rate

[Smith, WACV’17]

The more exploration the better!

Page 37: New Optimization - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML2020/Optimization.pdf · 2020. 4. 9. · Descent Optimization Algorithms”, arXiv, 2017

Towards Improving SGDM

• SGDR [Loshchilov, et al., ICLR’17]

Page 38: New Optimization - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML2020/Optimization.pdf · 2020. 4. 9. · Descent Optimization Algorithms”, arXiv, 2017

Towards Improving SGDM

• One-cycle LR

• warm-up + annealing + fine-tuning

[Smith, et al., arXiv’17]

Page 39: New Optimization - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML2020/Optimization.pdf · 2020. 4. 9. · Descent Optimization Algorithms”, arXiv, 2017

Does Adam need warm-up?

Of Course!

Experiments show that the gradient

distribution distorted in the first 10 steps

Page 40: New Optimization - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML2020/Optimization.pdf · 2020. 4. 9. · Descent Optimization Algorithms”, arXiv, 2017

Does Adam need warm-up?

Distorted gradient

distorted EMA

squared gradients

Bad learning rate

𝜃𝑡 = 𝜃𝑡−1 −ηො𝑣𝑡+𝜀

ෝ𝑚𝑡

𝑣𝑡 = 𝛽2𝑣𝑡−1 + (1 − 𝛽2)(𝑔𝑡−1)2

Keep your step size small at the

beginning of training helps to reduce the variance of the gradients

Page 41: New Optimization - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML2020/Optimization.pdf · 2020. 4. 9. · Descent Optimization Algorithms”, arXiv, 2017

Does Adam need warm-up?

• RAdam [Liu, et al., ICLR’20]

𝜌𝑡 = 𝜌∞ −2𝑡𝛽2

𝑡

1 − 𝛽2𝑡

𝜌∞ =2

1 − 𝛽2− 1

𝑟𝑡 =𝜌𝑡 − 4)(𝜌𝑡 − 2)𝜌∞(𝜌∞ − 4)(𝜌∞ − 2)𝜌𝑡

effective memory size of EMA

max memory size (t → ∞)

approximated V𝑎𝑟[

1

ෝ𝑣∞]

V𝑎𝑟[1

ෝ𝑣𝑡]

(for 𝜌𝑡>4)

When 𝜌𝑡 ≤ 4 (first few steps of training)

𝜃𝑡 = 𝜃𝑡−1 − η ෝ𝑚𝑡

When 𝜌𝑡 > 4

𝜃𝑡 = 𝜃𝑡−1 −η 𝑟𝑡

ො𝑣𝑡+𝜀ෝ𝑚𝑡

𝑟𝑡 is increasing through time!

Page 42: New Optimization - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML2020/Optimization.pdf · 2020. 4. 9. · Descent Optimization Algorithms”, arXiv, 2017

RAdam vs SWATS

RAdam SWATS

Inspiration Distortion of gradient at the beginning of training results in inaccurate adaptive learning rate

non-convergence and generalization gap of Adam, slow training of SGDM

How? Apply warm-up learning rate to reduce the influence of inaccurate adaptive learning rate

Combine their advantages by applying Adam first, then SGDM

Switch SGDM to RAdam Adam to SGDM

Why switch The approximation of the variance of ො𝑣𝑡 is invalid at the beginning of training

To pursue better convergence

Switch point When the approximation becomes valid

Some human-defined criteria

Page 43: New Optimization - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML2020/Optimization.pdf · 2020. 4. 9. · Descent Optimization Algorithms”, arXiv, 2017

k step forward, 1 step back

• Lookahead [Zhang, et al., arXiv’19]

universal wrapper for all optimizers

For 𝑡 = 1, 2, … (outer loop)

𝜃𝑡,0 = 𝜙𝑡−1

For 𝑖 = 1, 2, …𝑘 (inner loop)

𝜃𝑡,𝑖 = 𝜃𝑡,𝑖−1 + O𝑝𝑡𝑖𝑚 𝐿𝑜𝑠𝑠, 𝑑𝑎𝑡𝑎, 𝜃𝑡,𝑖−1𝜙𝑡=𝜙𝑡−1 + 𝛼(𝜃𝑡,𝑘 − 𝜙𝑡−1)

Similar to Reptile?

O𝑝𝑡𝑖𝑚 can be any

optimizer.

E.g. Ranger=

RAdam+Lookahead

Page 44: New Optimization - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML2020/Optimization.pdf · 2020. 4. 9. · Descent Optimization Algorithms”, arXiv, 2017

k step forward, 1 step back

• Lookahead

1 step back: avoid too dangerous exploration

[Zhang, et al., arXiv’19]

Look for a more flatten minimum

Better generalizationMore stable

Page 45: New Optimization - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML2020/Optimization.pdf · 2020. 4. 9. · Descent Optimization Algorithms”, arXiv, 2017

More than momentum…

• Momentum recap

停下來! 讓我看看

Credit to 李宏毅老師上課投影片

Page 46: New Optimization - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML2020/Optimization.pdf · 2020. 4. 9. · Descent Optimization Algorithms”, arXiv, 2017

More than momentum…

• Momentum recap

停下來!

你不行你要先講

Credit to 李宏毅老師上課投影片

讓我看看

當我通靈...

Page 47: New Optimization - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML2020/Optimization.pdf · 2020. 4. 9. · Descent Optimization Algorithms”, arXiv, 2017

Can we look into the future?

• Nesterov accelerated gradient (NAG)

• SGDM

• Look into the future…

[Nesterov, jour Dokl. Akad. Nauk SSSR’83]

𝜃𝑡 = 𝜃𝑡−1 −𝑚𝑡

𝑚𝑡 = 𝜆𝑚𝑡−1 + 𝜂∇𝐿(𝜃𝑡−1)

𝜃𝑡 = 𝜃𝑡−1 −𝑚𝑡

𝑚𝑡 = 𝜆𝑚𝑡−1 + 𝜂∇𝐿(𝜃𝑡−1 − 𝜆𝑚𝑡−1)

Math Warning

Need to maintain a duplication of model parameters?

Page 48: New Optimization - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML2020/Optimization.pdf · 2020. 4. 9. · Descent Optimization Algorithms”, arXiv, 2017

Can we look into the future?

• Nesterov accelerated gradient (NAG)

• SGDM

𝜃𝑡 = 𝜃𝑡−1 −𝑚𝑡

𝑚𝑡 = 𝜆𝑚𝑡−1 + 𝜂∇𝐿(𝜃𝑡−1 − 𝜆𝑚𝑡−1)

𝐿𝑒𝑡 𝜃𝑡′ = 𝜃𝑡 − 𝜆𝑚𝑡

= 𝜃𝑡−1 −𝑚𝑡 − 𝜆𝑚𝑡

= 𝜃𝑡−1 − 𝜆𝑚𝑡 − 𝜆𝑚𝑡−1 − 𝜂∇𝐿(𝜃𝑡−1 − 𝜆𝑚𝑡−1)

= 𝜃𝑡−1’ − 𝜆𝑚𝑡 − 𝜂∇𝐿(𝜃𝑡−1′)𝑚𝑡 = 𝜆𝑚𝑡−1 + 𝜂∇𝐿(𝜃𝑡−1′)

𝜃𝑡 = 𝜃𝑡−1 −𝑚𝑡

𝑚𝑡 = 𝜆𝑚𝑡−1 + 𝜂∇𝐿(𝜃𝑡−1)

No need to maintain a duplication of model parameters

𝜃𝑡 = 𝜃𝑡−1 − 𝜆𝑚𝑡−1-𝜂∇𝐿(𝜃𝑡−1)𝑚𝑡 = 𝜆𝑚𝑡−1 + 𝜂∇𝐿(𝜃𝑡−1)

or

你也懂超前部署?

Page 49: New Optimization - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML2020/Optimization.pdf · 2020. 4. 9. · Descent Optimization Algorithms”, arXiv, 2017

Adam in the future

• Nadam

• SGDM

[Dozat, ICLR workshop’16]

𝜃𝑡 = 𝜃𝑡−1 −ηො𝑣𝑡+𝜀

ෝ𝑚𝑡

ෝ𝑚𝑡 =𝛽1𝑚𝑡

1 − 𝛽1𝑡+1 +

(1 − 𝛽1)𝑔𝑡−1

1 − 𝛽1𝑡

ෝ𝑚𝑡 =1

1 − 𝛽1𝑡 (𝛽1𝑚𝑡−1 + 1 − 𝛽1 𝑔𝑡−1)

=𝛽1𝑚𝑡−1

1−𝛽1𝑡 +

(1−𝛽1)𝑔𝑡−1

1−𝛽1𝑡

Page 50: New Optimization - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML2020/Optimization.pdf · 2020. 4. 9. · Descent Optimization Algorithms”, arXiv, 2017

Do you really know your optimizer?

• A story of L2 regularization…

𝐿𝑙2 𝜃 = 𝐿 𝜃 + 𝛾| θ |2

SGD

SGDM

𝜃𝑡 = 𝜃𝑡−1 − ∇𝐿𝑙2 𝜃𝑡−1= 𝜃𝑡−1 − ∇𝐿 𝜃𝑡−1 − 𝛾𝜃𝑡−1

𝜃𝑡 = 𝜃𝑡−1 − 𝜆𝑚𝑡−1 − η(∇𝐿 𝜃𝑡−1 + 𝛾𝜃𝑡−1)𝑚𝑡 = 𝜆𝑚𝑡−1 + η(∇𝐿 𝜃𝑡−1 + 𝛾𝜃𝑡−1) ?𝑚𝑡 = 𝜆𝑚𝑡−1 + η(∇𝐿 𝜃𝑡−1 ) ?

Adam𝑚𝑡 = 𝜆𝑚𝑡−1 + η(∇𝐿 𝜃𝑡−1 + 𝛾𝜃𝑡−1) ?𝑣𝑡 = 𝛽2𝑣𝑡−1 + (1 − 𝛽2)(∇𝐿 𝜃𝑡−1 + 𝛾𝜃𝑡−1)

2 ?

Page 51: New Optimization - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML2020/Optimization.pdf · 2020. 4. 9. · Descent Optimization Algorithms”, arXiv, 2017

Do you really know your optimizer?

• AdamW & SGDW with momentum

AdamW

𝑚𝑡 = 𝛽1𝑚𝑡−1 + (1 − 𝛽1)∇𝐿 𝜃𝑡−1𝑣𝑡 = 𝛽2𝑣𝑡−1 + (1 − 𝛽2)(∇𝐿 𝜃𝑡−1 )2

𝜃𝑡 = 𝜃𝑡−1 − η(1ො𝑣𝑡+𝜀

ෝ𝑚𝑡 + 𝛾𝜃𝑡−1)

SGDWM𝜃𝑡 = 𝜃𝑡−1 −𝑚𝑡 − 𝛾𝜃𝑡−1𝑚𝑡 = 𝜆𝑚𝑡−1 + η(∇𝐿 𝜃𝑡−1 )

[Loshchilov, arXiv’17]

L2 regularization or weight decay?

Page 52: New Optimization - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML2020/Optimization.pdf · 2020. 4. 9. · Descent Optimization Algorithms”, arXiv, 2017

Something helps optimization…

• Shuffling

• Dropout

• Gradient noise

The more exploration, the better!

𝑔𝑡,𝑖 = 𝑔𝑡,𝑖 +𝑁(0, 𝜎𝑡2)

𝜎𝑡 =𝑐

(1 + 𝑡)𝛾

[Neelakantan, et al., arXiv’15]

Page 53: New Optimization - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML2020/Optimization.pdf · 2020. 4. 9. · Descent Optimization Algorithms”, arXiv, 2017

Something helps optimization…

• Warm-up

• Curriculum learning

• Fine-tuning

Teach your model patiently!

[Bengio, et al., ICML’09]

Train your model with easy data(e.g. clean voice) first, then difficult data.

Perhaps helps to improve generalization

Page 54: New Optimization - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML2020/Optimization.pdf · 2020. 4. 9. · Descent Optimization Algorithms”, arXiv, 2017

Something helps optimization…

• Normalization

• Regularization

Page 55: New Optimization - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML2020/Optimization.pdf · 2020. 4. 9. · Descent Optimization Algorithms”, arXiv, 2017

What we learned today?

Team SGD

• SGD

• SGDM

• Learning rate scheduling

• NAG

• SGDWM

Team Adam

• Adagrad

• RMSProp

• Adam

• AMSGrad

• AdaBound

• Learning rate scheduling

• RAdam

• Nadam

• AdamWSWATS

Extreme values of

learning rate

Lookahead

Page 56: New Optimization - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML2020/Optimization.pdf · 2020. 4. 9. · Descent Optimization Algorithms”, arXiv, 2017

What we learned today?

SGDM

• Slow

• Better convergence

• Stable

• Smaller generalization gap

Adam

• Fast

• Possibly non-convergence

• Unstable

• Larger generalization gap

Page 57: New Optimization - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML2020/Optimization.pdf · 2020. 4. 9. · Descent Optimization Algorithms”, arXiv, 2017

Advices

SGDM

• Computer vision

Adam

• NLP

• Speech synthesis

• GAN

• Reinforcement learning

image classification

segmentation

object detection

QA

machine translation

summary

Page 58: New Optimization - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML2020/Optimization.pdf · 2020. 4. 9. · Descent Optimization Algorithms”, arXiv, 2017

Universal Optimizer?

No Way!!!

Page 59: New Optimization - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML2020/Optimization.pdf · 2020. 4. 9. · Descent Optimization Algorithms”, arXiv, 2017

What I think I am goingto learn in this class

Page 60: New Optimization - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML2020/Optimization.pdf · 2020. 4. 9. · Descent Optimization Algorithms”, arXiv, 2017

What I actually learnedin this class

Page 61: New Optimization - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML2020/Optimization.pdf · 2020. 4. 9. · Descent Optimization Algorithms”, arXiv, 2017

Reference

• [Ruder, arXiv’17] Sebastian Ruder, “An Overview of Gradient Descent Optimization Algorithms”, arXiv, 2017

• [Rumelhart, et al., Nature’86] David E. Rumelhart, Geoffrey E. Hinton and Ronald J. Williams, “Learning Representations by Back-Propagating Errors”, Nature, 1986

• [Duchi, et al., JMLR’11] John Duchi, Elad Hazan and Yoram Singer, “Adaptive Subgradient Methods for Online Learning and Stochastic Optimization”, JMLR, 2011

• [Hinton, et al., Lecture slides, 2013] Geoffrey Hinton, Nitish Srivastava and Kevin Swersky, ”RMSProp”, Lecture slides, 2013

• [Kingma, et al., ICLR’15] Diederik P. Kingma and Jimmy Ba, “A Method for Stochastic Optimization”, ICLR, 2015

Page 62: New Optimization - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML2020/Optimization.pdf · 2020. 4. 9. · Descent Optimization Algorithms”, arXiv, 2017

Reference

• [Keskar, et al., arXiv’17] Nitish Shirish Keskar and Richard Socher, “Improving Generalization Performance by Switching from Adam to SGD”, arXiv, 2017

• [Reddi, et al., ICLR’18] Sashank J. Reddi, Satyen Kale and Sanjiv Kumar, “On the Convergence of Adam and Beyond”, ICLR, 2018

• [Luo, et al., ICLR’19] Liangchen Luo, Yuanhao Xiong, Yan Liu and Xu Sun, “Adaptive Gradient Methods with Dynamic Bound of Learning Rate”, ICLR, 2019

• [Smith, WACV’17] Leslie N. Smith, “Cyclical Learning Rates for Training Neural Networks”, WACV, 2017

• [Loshchilov, et al., ICLR’17] Ilya Loshchilov and Frank Hutter, “SGDR: Stochastic Gradient Descent with Warm Restarts”, ICLR, 2017

Page 63: New Optimization - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML2020/Optimization.pdf · 2020. 4. 9. · Descent Optimization Algorithms”, arXiv, 2017

Reference

• [Smith, et al., arXiv’17] Leslie N. Smith and Nicholay Topin, “Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates”, arXiv, 2017

• [Liu, et al., ICLR’20] Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao and Jiawei Han, “On the Variance of the Adaptive Learning Rate and Beyond”, ICLR, 2020

• [Zhang, et al., arXiv’19] Michael R. Zhang, James Lucas, Geoffrey Hinton and Jimmy Ba, “Lookahead Optimizer: k Step Forward, 1 Step Back”, arXiv, 2019

• [Nesterov, jour Dokl. Akad. Nauk SSSR’83] Yu. E. Nesterov, “A Method of Solving a Convex Programming Problem with Convergence Rate O(1/k^2)””

Page 64: New Optimization - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML2020/Optimization.pdf · 2020. 4. 9. · Descent Optimization Algorithms”, arXiv, 2017

Reference

• [Dozat, ICLR workshop’16] Timothy Dozat, “Incorporating NesterovMomentum into Adam”, ICLR workshop, 2016

• [Loshchilov, arXiv’17] Ilya Loshchilov and Frank Hutter, “Fixing Weight Decay Regularization in Adam”, arXiv, 2017

• [Neelakantan, et al., arXiv’15] Arvind Neelakantan, Luke Vilnis, Quoc V. Le, Ilya Sutskever, Lukasz Kaiser, Karol Kurach and James Martens, “Adding Gradient Noise Improves Learning for Very Deep Networks“, arXiv, 2015

• [Bengio, et al., ICML’09] Yoshua Bengio, Jèrôme Louradour, Ronan Collobert and Jason Weston, “Curriculum Learning”, ICML, 2015