Introduction to Deep Learning (Optimization …liqun/teaching/cs680_19f/dl2.pdfIntroduction to Deep...

WM CS Zeyi (Tim) Tao11/01/2019

Introduction to Deep Learning (Optimization Algorithms)

TopicsSGD

AdaGrad

AdaDelta

RMSprop

Adaptive LR

ERM problem Statement

• Given a training set of input-output pairs (X1, d1), (X2, d2), ⋯, (XN, dN)

Loss(W ) =1N

∑i=1

div( f(Wi; X), di) + γ(w)

• Minimize the following function (w.r.t W)

• This is problem of function minimization

Total Loss Sum over all training instances

Measurement functions

Output of net in response to input Desired output (label)

Regularizer

Choice of div() functions Mean Square Loss VS CrossEntropy

Prediction Mode: fw(xi) = wxi + b

Square Loss: Lossw =1N

(di − fx(xi))2

Suppose: di = wixi + b + n where n ∼ Nomal(0,1)

E[di] = E[wixi + b + n] = wixi + b

Var[di] = Var[wixi + b + n] = 1

The probability of observing a single (xi, di)

p(di |xi) = e− (di − (wxi + b))22

The Max Likelihood:

Like(d, x) =N

∏i=1

e− (di − (wxi + b))22

Like(d, x) =N

∏i=1

e− (di − (wxi + b))22

l(d, x) = −12

∑i=1

(di − (wxi + b))2(MAX)

(MIN)MSE(d, x) =12

∑i=1

(di − fw(xi))2

Prediction Mode: fw(xi) = σ(wxi + b) assume is softmax liked σ()

p(di = 1 |xi) = fw(xi) p(di = 0 |xi) = 1 − fw(xi)

p(di |xi) = [ fw(xi)]di[1 − fw(xi)](1−di)

Like(d, x) =N

∏i=1

[ fw(xi)]di[1 − fw(xi)](1−di)

Like(d, x) =N

∏i=1

[ fw(xi)]di[1 − fw(xi)](1−di)

like(d, x) = −N

∑i=1

di log( fw(xi)) + (1 − di)log(1 − fw(xi))(binary)

like(d, x) = −N

∑i=1

∑j=1

dij log( fw(xi)j) + (1 − dij)log(1 − fw(xi)j)(multi-class)

fw(xi) = σ(wxi + b) = ̂di

Loss =12

∑i=1

( ̂di − di)2 Loss = −N

∑i=1

di log( ̂di) + (1 − di)log(1 − ̂di)

dLossdw

∑i=1

( ̂di − di)σ′ �(wxi + b)xidLoss

∑i=1

xi(σ(wxi + b) − di)

y(1)1z(1)

1z(1)1 w(2)

2 σk−1

σk−1

σk−2

⋯⋯

⋯z(k)

Y = σk(Wkσk−1(⋯(σ2(W2σ1(W1x + b1)⋯))⋯) + bk)

The vanishing gradient problem

dLossdw

∑i=1

( ̂di − di)σ′ �(wxi + b)xidLoss

∑i=1

xi(σ(wxi + b) − di)

Regularization

Loss(W ) =1N

∑i=1

div( f(Wi; X), di) + λγ(w)

γ(w) =12

| |w | |2 =12

| |w − 0 | |2

Regularization

w1x1 + bw1x2

1 + w2x1 + b

w1xn + w2xn−1 + ⋯wn−1x21 + wnx1 + b

Regularization

minw1N

∑i=1

div( f(Wi; X), di) + λγ(w)

minw1N

∑i=1

div( f(Wi; X), di) + 100w2n + 100w2

n−1 + ⋯

Deep neuron network

div()L2

d1 d2 d3 ⋯

w1 w2 w3w4 w5

xi ∼ 𝒟ft(Wt)

Unconstrained First-order Optimization

• For real-valued output vectors, the (scaled) L2 divergence is popular

Div( ̂d, d) =12

| | ̂d − d | |2 =12 ∑

( ̂d − di)2

Unconstrained First-order Optimization

ERM: convex

Local optimal = Global optimal

First Order: Gradient Descent

Second Order: Newton Method

f(w + △w) = f(w) + f ′�(w) △ w

f(w + △w) = f(w) + f ′�(w) △ w +12

f ′�′�(w)(△x)2

Linear approximation

Gradient descent variantsBatch gradient descent

wt = wt−1 − ηk ∇w f(W; X = {(x1, d1), (x2, d2), ⋯})

MNIST: 60k x 28 x28

Gradient descent variants

Batch gradient descent

wt = wt−1 − ηk ∇w f(W; X = {(x1, d1), (x2, d2), ⋯})

Stochastic gradient descent

wt = wt−1 − ηk ∇w f(W; (xi, di))

Mini-batch gradient descent

wt = wt−1 − ηk ∇w f(W; X = {(xi, di) | i = 1,2,⋯b})Shuffle

Batch Gradient descent VS SGD

Gradient descent variants

Batch gradient descent Stochastic gradient descent

Mini-batch gradient descent

• Pro • Computationally fast • Fast convergence(large dataset) • Fit into memory

• Con • Noisy • Longer training time • No vectorization

• Pro • Less oscillations and noise • Vectorization • Stable

• Con • Local optimal • Not memory friendly

Mini-batch gradient descent (Vanilla SGD)

wt = wt−1 − ηk ∇w f(W; (xi, di))

Learning Rate (Step Size)

gradient (direction)

wt = wt−1 − ηkgt

Learning Rate I

Learning Rate II

Learning Rate III: Strategies

Step decay

Cyclical Learning Rate

SGD to SGD with Momentum(SGDM)

vt = βvt−1 + ηgtwt = wt−1 − vt

Pro: • Accelerate SGD • Overcome local minima • Dampen oscillations (begin)

Con: • oscillations (end)

The momentum term increases for dimensions whose gradients point in the same directions and reduces updates for dimensions whose gradients change directions.

SGD to Adaptive Subgradient Methods (AdaGrad)

gt,i = ∇wt,if

Gt,i =t

∑i=1

wt+1,i = wt,i −η0

Gt,i + ϵ⊙ gt,i

Pro: • Suit for dealing with sparse data

Con: • Monotonically decreasing LR

Performing smaller updates (i.e. low learning rates) for parameters associated with frequently occurring features, and larger updates (i.e. high learning rates) for parameters associated with infrequent features.

AdaGrad to AdaDelta

gt,i = ∇wt,if

vt,i = βvt−1,i + (1 − β)g2t,i

vt,i + ϵ⊙ gt,i

Pro: • Suit for dealing with sparse data • Using a sliding window

Instead of accumulating all past squared gradients, Adadelta restricts the window of accumulated past gradients to some fixed size

AdaGrad to Adam

gt,i = ∇wt,if

mt,i = β1mt−1,i + (1 − β1)gt,i

vt,i = β2vt−1,i + (1 − β2)g2t,i

vt,i + ϵ⊙ mt,iPro:

• Super fast (current) • Exponential Moving exponential (EMA)

Con: • Suboptimal solution • EMA diminishes the changes of gradients

Adam to ?

Which to use?

• SGD + SGDM • Familiar with • Knowing your data • Test on small batch • Adam + SGD • Shuffle • Choosing prop LR

Introduction to Deep Learning (Optimization …liqun/teaching/cs680_19f/dl2.pdfIntroduction to Deep...

Documents

Transcript of Introduction to Deep Learning (Optimization …liqun/teaching/cs680_19f/dl2.pdfIntroduction to Deep...

By Johan Karlberg & Erik Månsson An Othello agent ...fileadmin.cs.lth.se/cs/Education/edan70/AIProjects/2018/slides/... · An Othello agent utilizing deep learning By Johan Karlberg

Introduction to Deep Reinforcement Learningwnzhang.net/teaching/cs420/slides/13-deep-rl.pdf · 2020-06-08 · Deep Reinforcement Learning •Deep Reinforcement Learning •leverages

Lecture 15: Learning Basics Neural Network / Deep Machine Learning …€¦ · Machine Learning Lecture 15: Neural Network / Deep Learning Basics. 3. ewx+b 1 + e wx+b Logistic Regression

DEEP FOUNDATIONS – Pile Foundations - جامعة تكريتced.ceng.tu.edu.iq/images/lectures/dr.farouq/ch7-Deep-Foundation... · DEEP FOUNDATIONS – Pile Foundations ... Qs =

On the Fundamental Stability of Deep Networks · ON THE STABILITY OF DEEP NETWORKS RAJA GIRYES AND GUILLERMO SAPIRO DUKE UNIVERSITY Mathematics of Deep Learning International Conference

Deep Foundations

Provable Bounds for Learning Some Deep Representations · Provable Bounds for Learning Some Deep Representations Sanjeev Arora Aditya Bhaskara y Rong Gez Tengyu Max October 24, 2013

CSC411: Optimization for Machine Learning · 2020. 9. 23. · CSC411: Optimization for Machine Learning University of Toronto September 20–26, 2018 1 1based on slides by Eleni Triantaﬁllou,

Deep Learning Theory and Practice - Computer Action Teamweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/... · 2020. 1. 21. · Deep Learning Theory and Practice Lecture 5 Introduction

Human-level Control Through Deep Reinforcement Learning€¦ · 1 Mnih, V. et al. Human-level control through deep reinforcement learning. Nature 518, 529{533 (2015) 2 Lin, L.-J.

Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient

10703 Deep Reinforcement Learning and Control · 2017. 10. 18. · Policy-Based Reinforcement Learning ‣ So far we approximated the value or action-value function using parameters

CVPR2010: Sparse Coding and Dictionary Learning for Image Analysis: Part 3: Optimization Techniques for Sparse Coding

Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

CSC411: Optimization for Machine Learning · 2020. 4. 22. · CSC411: Optimization for Machine Learning University of Toronto September 20–26, 2018 1 1based on slides by Eleni Triantaﬁllou,

Nanobubble at 100 meters deep underwater - nac … · Nanobubble at 100 meters deep underwater ＊At deep underwater, such as the bottom of seas, dam lakes and deep wells, Foamestcan

deep imaging Lecture 5: A gentle introduction to optimizationLecture 5: A gentle introduction to optimization. Machine Learning and Imaging – Roarke Horstmeyer(2019) deep imaging

Deep Neural Networks Motivated By Ordinary Differential ...helper.ipam.ucla.edu/publications/mlptut/mlptut_16188.pdfOrdinary Differential Equations Machine Learning for Physics and

Niobe Deep

Introduction - Deep Learning · 2019-01-03 · (Goodfellow 2016) Depth: Repeated Composition CHAPTER 1. INTRODUCTION Visible layer (input pixels) 1st hidden layer (edges) 2nd hidden