Introduction to Deep Learning (Optimization …liqun/teaching/cs680_19f/dl2.pdfIntroduction to Deep...

31
WM CS Zeyi (Tim) Tao 11/01/2019 Introduction to Deep Learning (Optimization Algorithms) 1

Transcript of Introduction to Deep Learning (Optimization …liqun/teaching/cs680_19f/dl2.pdfIntroduction to Deep...

Page 1: Introduction to Deep Learning (Optimization …liqun/teaching/cs680_19f/dl2.pdfIntroduction to Deep Learning (Optimization Algorithms)!1 Topics SGD SGDM AdaGrad Adam AdaDelta RMSprop

WM CS Zeyi (Tim) Tao11/01/2019

Introduction to Deep Learning (Optimization Algorithms)

!1

Page 2: Introduction to Deep Learning (Optimization …liqun/teaching/cs680_19f/dl2.pdfIntroduction to Deep Learning (Optimization Algorithms)!1 Topics SGD SGDM AdaGrad Adam AdaDelta RMSprop

TopicsSGD

SGDM

AdaGrad

Adam

AdaDelta

RMSprop

Adaptive LR

Page 3: Introduction to Deep Learning (Optimization …liqun/teaching/cs680_19f/dl2.pdfIntroduction to Deep Learning (Optimization Algorithms)!1 Topics SGD SGDM AdaGrad Adam AdaDelta RMSprop

ERM problem Statement

• Given a training set of input-output pairs (X1, d1), (X2, d2), ⋯, (XN, dN)

Loss(W ) =1N

N

∑i=1

div( f(Wi; X), di) + γ(w)

• Minimize the following function (w.r.t W)

• This is problem of function minimization

Total Loss Sum over all training instances

Measurement functions

Output of net in response to input Desired output (label)

Regularizer

Page 4: Introduction to Deep Learning (Optimization …liqun/teaching/cs680_19f/dl2.pdfIntroduction to Deep Learning (Optimization Algorithms)!1 Topics SGD SGDM AdaGrad Adam AdaDelta RMSprop

Choice of div() functions Mean Square Loss VS CrossEntropy

Prediction Mode: fw(xi) = wxi + b

Square Loss: Lossw =1N

N

∑i

(di − fx(xi))2

Suppose: di = wixi + b + n where n ∼ Nomal(0,1)

E[di] = E[wixi + b + n] = wixi + b

Var[di] = Var[wixi + b + n] = 1

Page 5: Introduction to Deep Learning (Optimization …liqun/teaching/cs680_19f/dl2.pdfIntroduction to Deep Learning (Optimization Algorithms)!1 Topics SGD SGDM AdaGrad Adam AdaDelta RMSprop

Choice of div() functions Mean Square Loss VS CrossEntropy

The probability of observing a single (xi, di)

p(di |xi) = e− (di − (wxi + b))22

The Max Likelihood:

Like(d, x) =N

∏i=1

e− (di − (wxi + b))22

Page 6: Introduction to Deep Learning (Optimization …liqun/teaching/cs680_19f/dl2.pdfIntroduction to Deep Learning (Optimization Algorithms)!1 Topics SGD SGDM AdaGrad Adam AdaDelta RMSprop

Choice of div() functions Mean Square Loss VS CrossEntropy

Like(d, x) =N

∏i=1

e− (di − (wxi + b))22

l(d, x) = −12

N

∑i=1

(di − (wxi + b))2(MAX)

(MIN)MSE(d, x) =12

N

∑i=1

(di − fw(xi))2

Page 7: Introduction to Deep Learning (Optimization …liqun/teaching/cs680_19f/dl2.pdfIntroduction to Deep Learning (Optimization Algorithms)!1 Topics SGD SGDM AdaGrad Adam AdaDelta RMSprop

Choice of div() functions Mean Square Loss VS CrossEntropy

Prediction Mode: fw(xi) = σ(wxi + b) assume is softmax liked σ()

p(di = 1 |xi) = fw(xi) p(di = 0 |xi) = 1 − fw(xi)

p(di |xi) = [ fw(xi)]di[1 − fw(xi)](1−di)

Like(d, x) =N

∏i=1

[ fw(xi)]di[1 − fw(xi)](1−di)

Page 8: Introduction to Deep Learning (Optimization …liqun/teaching/cs680_19f/dl2.pdfIntroduction to Deep Learning (Optimization Algorithms)!1 Topics SGD SGDM AdaGrad Adam AdaDelta RMSprop

Choice of div() functions Mean Square Loss VS CrossEntropy

Like(d, x) =N

∏i=1

[ fw(xi)]di[1 − fw(xi)](1−di)

like(d, x) = −N

∑i=1

di log( fw(xi)) + (1 − di)log(1 − fw(xi))(binary)

like(d, x) = −N

∑i=1

N

∑j=1

dij log( fw(xi)j) + (1 − dij)log(1 − fw(xi)j)(multi-class)

Page 9: Introduction to Deep Learning (Optimization …liqun/teaching/cs680_19f/dl2.pdfIntroduction to Deep Learning (Optimization Algorithms)!1 Topics SGD SGDM AdaGrad Adam AdaDelta RMSprop

Choice of div() functions Mean Square Loss VS CrossEntropy

fw(xi) = σ(wxi + b) = ̂di

Loss =12

N

∑i=1

( ̂di − di)2 Loss = −N

∑i=1

di log( ̂di) + (1 − di)log(1 − ̂di)

dLossdw

=N

∑i=1

( ̂di − di)σ′ �(wxi + b)xidLoss

dw=

N

∑i=1

xi(σ(wxi + b) − di)

Page 10: Introduction to Deep Learning (Optimization …liqun/teaching/cs680_19f/dl2.pdfIntroduction to Deep Learning (Optimization Algorithms)!1 Topics SGD SGDM AdaGrad Adam AdaDelta RMSprop

σ1

σ1

σ1

x1

x2

x3

y(1)1z(1)

1z(1)1 w(2)

2 σk−1

σk

σk

σk−1

σk−1

σk−2

σk−2

σk−2

⋯⋯

⋯z(k)

Y = σk(Wkσk−1(⋯(σ2(W2σ1(W1x + b1)⋯))⋯) + bk)

The vanishing gradient problem

Page 11: Introduction to Deep Learning (Optimization …liqun/teaching/cs680_19f/dl2.pdfIntroduction to Deep Learning (Optimization Algorithms)!1 Topics SGD SGDM AdaGrad Adam AdaDelta RMSprop

Choice of div() functions Mean Square Loss VS CrossEntropy

dLossdw

=N

∑i=1

( ̂di − di)σ′ �(wxi + b)xidLoss

dw=

N

∑i=1

xi(σ(wxi + b) − di)

Page 12: Introduction to Deep Learning (Optimization …liqun/teaching/cs680_19f/dl2.pdfIntroduction to Deep Learning (Optimization Algorithms)!1 Topics SGD SGDM AdaGrad Adam AdaDelta RMSprop

Regularization

Loss(W ) =1N

N

∑i=1

div( f(Wi; X), di) + λγ(w)

γ(w) =12

| |w | |2 =12

| |w − 0 | |2

Page 13: Introduction to Deep Learning (Optimization …liqun/teaching/cs680_19f/dl2.pdfIntroduction to Deep Learning (Optimization Algorithms)!1 Topics SGD SGDM AdaGrad Adam AdaDelta RMSprop

Regularization

w1x1 + bw1x2

1 + w2x1 + b

w1xn + w2xn−1 + ⋯wn−1x21 + wnx1 + b

Page 14: Introduction to Deep Learning (Optimization …liqun/teaching/cs680_19f/dl2.pdfIntroduction to Deep Learning (Optimization Algorithms)!1 Topics SGD SGDM AdaGrad Adam AdaDelta RMSprop

Regularization

minw1N

N

∑i=1

div( f(Wi; X), di) + λγ(w)

minw1N

N

∑i=1

div( f(Wi; X), di) + 100w2n + 100w2

n−1 + ⋯

Page 15: Introduction to Deep Learning (Optimization …liqun/teaching/cs680_19f/dl2.pdfIntroduction to Deep Learning (Optimization Algorithms)!1 Topics SGD SGDM AdaGrad Adam AdaDelta RMSprop

Deep neuron network

div()L2

d1 d2 d3 ⋯

w1 w2 w3w4 w5

xi ∼ 𝒟ft(Wt)

Page 16: Introduction to Deep Learning (Optimization …liqun/teaching/cs680_19f/dl2.pdfIntroduction to Deep Learning (Optimization Algorithms)!1 Topics SGD SGDM AdaGrad Adam AdaDelta RMSprop

Unconstrained First-order Optimization

• For real-valued output vectors, the (scaled) L2 divergence is popular

Div( ̂d, d) =12

| | ̂d − d | |2 =12 ∑

i

( ̂d − di)2

Page 17: Introduction to Deep Learning (Optimization …liqun/teaching/cs680_19f/dl2.pdfIntroduction to Deep Learning (Optimization Algorithms)!1 Topics SGD SGDM AdaGrad Adam AdaDelta RMSprop

Unconstrained First-order Optimization

ERM: convex

Local optimal = Global optimal

First Order: Gradient Descent

Second Order: Newton Method

f(w + △w) = f(w) + f ′�(w) △ w

f(w + △w) = f(w) + f ′�(w) △ w +12

f ′�′�(w)(△x)2

Linear approximation

Page 18: Introduction to Deep Learning (Optimization …liqun/teaching/cs680_19f/dl2.pdfIntroduction to Deep Learning (Optimization Algorithms)!1 Topics SGD SGDM AdaGrad Adam AdaDelta RMSprop

Gradient descent variantsBatch gradient descent

wt = wt−1 − ηk ∇w f(W; X = {(x1, d1), (x2, d2), ⋯})

MNIST: 60k x 28 x28

Page 19: Introduction to Deep Learning (Optimization …liqun/teaching/cs680_19f/dl2.pdfIntroduction to Deep Learning (Optimization Algorithms)!1 Topics SGD SGDM AdaGrad Adam AdaDelta RMSprop

Gradient descent variants

Batch gradient descent

wt = wt−1 − ηk ∇w f(W; X = {(x1, d1), (x2, d2), ⋯})

Stochastic gradient descent

wt = wt−1 − ηk ∇w f(W; (xi, di))

Mini-batch gradient descent

wt = wt−1 − ηk ∇w f(W; X = {(xi, di) | i = 1,2,⋯b})Shuffle

Page 20: Introduction to Deep Learning (Optimization …liqun/teaching/cs680_19f/dl2.pdfIntroduction to Deep Learning (Optimization Algorithms)!1 Topics SGD SGDM AdaGrad Adam AdaDelta RMSprop

Batch Gradient descent VS SGD

Page 21: Introduction to Deep Learning (Optimization …liqun/teaching/cs680_19f/dl2.pdfIntroduction to Deep Learning (Optimization Algorithms)!1 Topics SGD SGDM AdaGrad Adam AdaDelta RMSprop

Gradient descent variants

Batch gradient descent Stochastic gradient descent

Mini-batch gradient descent

• Pro • Computationally fast • Fast convergence(large dataset) • Fit into memory

• Con • Noisy • Longer training time • No vectorization

• Pro • Less oscillations and noise • Vectorization • Stable

• Con • Local optimal • Not memory friendly

Page 22: Introduction to Deep Learning (Optimization …liqun/teaching/cs680_19f/dl2.pdfIntroduction to Deep Learning (Optimization Algorithms)!1 Topics SGD SGDM AdaGrad Adam AdaDelta RMSprop

Mini-batch gradient descent (Vanilla SGD)

wt = wt−1 − ηk ∇w f(W; (xi, di))

Learning Rate (Step Size)

gradient (direction)

wt = wt−1 − ηkgt

Page 23: Introduction to Deep Learning (Optimization …liqun/teaching/cs680_19f/dl2.pdfIntroduction to Deep Learning (Optimization Algorithms)!1 Topics SGD SGDM AdaGrad Adam AdaDelta RMSprop

Learning Rate I

Page 24: Introduction to Deep Learning (Optimization …liqun/teaching/cs680_19f/dl2.pdfIntroduction to Deep Learning (Optimization Algorithms)!1 Topics SGD SGDM AdaGrad Adam AdaDelta RMSprop

Learning Rate II

Page 25: Introduction to Deep Learning (Optimization …liqun/teaching/cs680_19f/dl2.pdfIntroduction to Deep Learning (Optimization Algorithms)!1 Topics SGD SGDM AdaGrad Adam AdaDelta RMSprop

Learning Rate III: Strategies

Step decay

Cyclical Learning Rate

Page 26: Introduction to Deep Learning (Optimization …liqun/teaching/cs680_19f/dl2.pdfIntroduction to Deep Learning (Optimization Algorithms)!1 Topics SGD SGDM AdaGrad Adam AdaDelta RMSprop

SGD to SGD with Momentum(SGDM)

vt = βvt−1 + ηgtwt = wt−1 − vt

Pro: • Accelerate SGD • Overcome local minima • Dampen oscillations (begin)

Con: • oscillations (end)

The momentum term increases for dimensions whose gradients point in the same directions and reduces updates for dimensions whose gradients change directions.

Page 27: Introduction to Deep Learning (Optimization …liqun/teaching/cs680_19f/dl2.pdfIntroduction to Deep Learning (Optimization Algorithms)!1 Topics SGD SGDM AdaGrad Adam AdaDelta RMSprop

SGD to Adaptive Subgradient Methods (AdaGrad)

gt,i = ∇wt,if

Gt,i =t

∑i=1

g2t,i

wt+1,i = wt,i −η0

Gt,i + ϵ⊙ gt,i

Pro: • Suit for dealing with sparse data

Con: • Monotonically decreasing LR

Performing smaller updates (i.e. low learning rates) for parameters associated with frequently occurring features, and larger updates (i.e. high learning rates) for parameters associated with infrequent features.

Page 28: Introduction to Deep Learning (Optimization …liqun/teaching/cs680_19f/dl2.pdfIntroduction to Deep Learning (Optimization Algorithms)!1 Topics SGD SGDM AdaGrad Adam AdaDelta RMSprop

AdaGrad to AdaDelta

gt,i = ∇wt,if

vt,i = βvt−1,i + (1 − β)g2t,i

wt+1,i = wt,i −η0

vt,i + ϵ⊙ gt,i

Pro: • Suit for dealing with sparse data • Using a sliding window

Instead of accumulating all past squared gradients, Adadelta restricts the window of accumulated past gradients to some fixed size 

Page 29: Introduction to Deep Learning (Optimization …liqun/teaching/cs680_19f/dl2.pdfIntroduction to Deep Learning (Optimization Algorithms)!1 Topics SGD SGDM AdaGrad Adam AdaDelta RMSprop

AdaGrad to Adam

gt,i = ∇wt,if

mt,i = β1mt−1,i + (1 − β1)gt,i

vt,i = β2vt−1,i + (1 − β2)g2t,i

wt+1,i = wt,i −η0

vt,i + ϵ⊙ mt,iPro:

• Super fast (current) • Exponential Moving exponential (EMA)

Con: • Suboptimal solution • EMA diminishes the changes of gradients

Page 30: Introduction to Deep Learning (Optimization …liqun/teaching/cs680_19f/dl2.pdfIntroduction to Deep Learning (Optimization Algorithms)!1 Topics SGD SGDM AdaGrad Adam AdaDelta RMSprop

Adam to ?

Page 31: Introduction to Deep Learning (Optimization …liqun/teaching/cs680_19f/dl2.pdfIntroduction to Deep Learning (Optimization Algorithms)!1 Topics SGD SGDM AdaGrad Adam AdaDelta RMSprop

Which to use?

• SGD + SGDM • Familiar with • Knowing your data • Test on small batch • Adam + SGD • Shuffle • Choosing prop LR