Lecture 15: Optimization - GitHub Pages · 2018. 10. 9. · •Heuristic trades off between...

45
Lecture 15: Optimization CS 109B, STAT 121B, AC 209B, CSE 109B Mark Glickman and Pavlos Protopapas

Transcript of Lecture 15: Optimization - GitHub Pages · 2018. 10. 9. · •Heuristic trades off between...

Page 1: Lecture 15: Optimization - GitHub Pages · 2018. 10. 9. · •Heuristic trades off between initialize all layers ... –E.g. in ReLU, initialize bias to 0.1 instead of 0 •Units

Lecture15:OptimizationCS109B,STAT121B,AC209B,CSE109B

MarkGlickman andPavlos Protopapas

Page 2: Lecture 15: Optimization - GitHub Pages · 2018. 10. 9. · •Heuristic trades off between initialize all layers ... –E.g. in ReLU, initialize bias to 0.1 instead of 0 •Units

Learningvs.Optimization

• Goaloflearning:minimizegeneralizationerror

• Inpractice,empiricalriskminimization:

J(θ ) = E(x,y)~pdata L( f (x;θ ), y)[ ]

J(θ ) = 1m

Li=1

m

∑ ( f (x(i);θ ), y(i) )

Quantityoptimizeddifferentfromthequantity

wecareabout

Page 3: Lecture 15: Optimization - GitHub Pages · 2018. 10. 9. · •Heuristic trades off between initialize all layers ... –E.g. in ReLU, initialize bias to 0.1 instead of 0 •Units

Batchvs.StochasticAlgorithms

• Batchalgorithms– Optimizeempiricalriskusingexactgradients

• Stochasticalgorithms– Estimatesgradientfromasmallrandomsample

∇J(θ ) = E(x,y)~pdata ∇L( f (x;θ ), y)[ ]

Largemini-batch:gradientcomputationexpensive

Smallmini-batch:greatervarianceinestimate,longerstepsforconvergence

Page 4: Lecture 15: Optimization - GitHub Pages · 2018. 10. 9. · •Heuristic trades off between initialize all layers ... –E.g. in ReLU, initialize bias to 0.1 instead of 0 •Units

CriticalPoints

• Pointswithzerogradient• 2nd-derivate(Hessian)determinescurvature

Goodfellow etal.(2016)

Page 5: Lecture 15: Optimization - GitHub Pages · 2018. 10. 9. · •Heuristic trades off between initialize all layers ... –E.g. in ReLU, initialize bias to 0.1 instead of 0 •Units

StochasticGradientDescent

• Takesmallstepsindirectionofnegativegradient• Samplem examplesfromtrainingsetandcompute:

• Updateparameters:

g = 1m

∇L( f (x(i);θ ), y(i) )i∑

θ =θ −εkg

Inpractice:shuffletrainingsetonceandpassthroughmultipletimes

Page 6: Lecture 15: Optimization - GitHub Pages · 2018. 10. 9. · •Heuristic trades off between initialize all layers ... –E.g. in ReLU, initialize bias to 0.1 instead of 0 •Units

StochasticGradientDescent

Oscillationsbecauseupdatesdonotexploitcurvatureinformation

J(θ )

Goodfellow etal.(2016)

Page 7: Lecture 15: Optimization - GitHub Pages · 2018. 10. 9. · •Heuristic trades off between initialize all layers ... –E.g. in ReLU, initialize bias to 0.1 instead of 0 •Units

Outline

• ChallengesinOptimization• Momentum• AdaptiveLearningRate• ParameterInitialization• BatchNormalization

Page 8: Lecture 15: Optimization - GitHub Pages · 2018. 10. 9. · •Heuristic trades off between initialize all layers ... –E.g. in ReLU, initialize bias to 0.1 instead of 0 •Units

LocalMinima

Goodfellow etal.(2016)

Page 9: Lecture 15: Optimization - GitHub Pages · 2018. 10. 9. · •Heuristic trades off between initialize all layers ... –E.g. in ReLU, initialize bias to 0.1 instead of 0 •Units

LocalMinima

• Oldview:localminimaismajorprobleminneuralnetworktraining

• Recentview:– Forsufficientlylargeneuralnetworks,mostlocalminimaincurlowcost

– Notimportanttofindtrueglobalminimum

Page 10: Lecture 15: Optimization - GitHub Pages · 2018. 10. 9. · •Heuristic trades off between initialize all layers ... –E.g. in ReLU, initialize bias to 0.1 instead of 0 •Units

SaddlePoints

• Recentstudiesindicatethatinhighdim,saddlepointsaremorelikelythanlocalmin

• Gradientcanbeverysmallnearsaddlepoints

Bothlocalminandmax

Goodfellow etal.(2016)

Page 11: Lecture 15: Optimization - GitHub Pages · 2018. 10. 9. · •Heuristic trades off between initialize all layers ... –E.g. in ReLU, initialize bias to 0.1 instead of 0 •Units

SaddlePoints

• SGDisseentoescapesaddlepoints–Movesdown-hill,usesnoisygradients

• Second-ordermethodsgetstuck– solvesforapointwithzerogradient

Goodfellow etal.(2016)

Page 12: Lecture 15: Optimization - GitHub Pages · 2018. 10. 9. · •Heuristic trades off between initialize all layers ... –E.g. in ReLU, initialize bias to 0.1 instead of 0 •Units

PoorConditioning

• PoorlyconditionedHessianmatrix– Highcurvature:smallstepsleadstohugeincrease

• Learningisslowdespitestronggradients

Oscillationsslowdownprogress

Goodfellow etal.(2016)

Page 13: Lecture 15: Optimization - GitHub Pages · 2018. 10. 9. · •Heuristic trades off between initialize all layers ... –E.g. in ReLU, initialize bias to 0.1 instead of 0 •Units

NoCriticalPoints

• Somecostfunctionsdonothavecriticalpoints

Goodfellow etal.(2016)

Page 14: Lecture 15: Optimization - GitHub Pages · 2018. 10. 9. · •Heuristic trades off between initialize all layers ... –E.g. in ReLU, initialize bias to 0.1 instead of 0 •Units

NoCriticalPoints

Gradientnormincreases,butvalidationerrordecreases

ConvolutionNetsforObjectDetection

Goodfellow etal.(2016)

Page 15: Lecture 15: Optimization - GitHub Pages · 2018. 10. 9. · •Heuristic trades off between initialize all layers ... –E.g. in ReLU, initialize bias to 0.1 instead of 0 •Units

ExplodingandVanishingGradients

h1 =Wxhi =Whi−1, i = 2…n

y =σ (h1n + h2

n ), where σ (s) = 11+ e−s

Linearactivation

deeplearning.ai

Page 16: Lecture 15: Optimization - GitHub Pages · 2018. 10. 9. · •Heuristic trades off between initialize all layers ... –E.g. in ReLU, initialize bias to 0.1 instead of 0 •Units

ExplodingandVanishingGradients

h11

h12

!

"

##

$

%

&&= a 0

0 b

!

"#

$

%&

x1

x2

!

"##

$

%&& !

hn1hn2

!

"

##

$

%

&&= an 0

0 bn!

"##

$

%&&

x1

x2

!

"##

$

%&&

Suppose W = a 00 b

!

"#

$

%& :

y =σ (anx1 + bnx2 )

∇y = "σ (anx1 + bnx2 )

nan−1x1nbn−1x2

$

%

&&

'

(

))

Page 17: Lecture 15: Optimization - GitHub Pages · 2018. 10. 9. · •Heuristic trades off between initialize all layers ... –E.g. in ReLU, initialize bias to 0.1 instead of 0 •Units

ExplodingandVanishingGradients

Suppose x = 11

!

"#

$

%&

Case 1: a =1, b = 2 :

y→1, ∇y→ nn2n−1

!

"##

$

%&&

Case 2: a = 0.5, b = 0.9 :

y→ 0, ∇y→ 00

!

"#

$

%&

Explodes!

Vanishes!

Page 18: Lecture 15: Optimization - GitHub Pages · 2018. 10. 9. · •Heuristic trades off between initialize all layers ... –E.g. in ReLU, initialize bias to 0.1 instead of 0 •Units

ExplodingandVanishingGradients

• Explodinggradientsleadtocliffs• Canbemitigatedusinggradientclipping

Goodfellow etal.(2016)

Page 19: Lecture 15: Optimization - GitHub Pages · 2018. 10. 9. · •Heuristic trades off between initialize all layers ... –E.g. in ReLU, initialize bias to 0.1 instead of 0 •Units

Poorcorrespondencebetweenlocalandglobalstructure

Goodfellow etal.(2016)

Page 20: Lecture 15: Optimization - GitHub Pages · 2018. 10. 9. · •Heuristic trades off between initialize all layers ... –E.g. in ReLU, initialize bias to 0.1 instead of 0 •Units

Outline

• ChallengesinOptimization• Momentum• AdaptiveLearningRate• ParameterInitialization• BatchNormalization

Page 21: Lecture 15: Optimization - GitHub Pages · 2018. 10. 9. · •Heuristic trades off between initialize all layers ... –E.g. in ReLU, initialize bias to 0.1 instead of 0 •Units

Momentum

• SGDisslowwhenthereishighcurvature

• Averagegradientpresentsfasterpathtoopt:– verticalcomponentscancelout

J(θ )

Deeplearning.ai

Page 22: Lecture 15: Optimization - GitHub Pages · 2018. 10. 9. · •Heuristic trades off between initialize all layers ... –E.g. in ReLU, initialize bias to 0.1 instead of 0 •Units

Momentum• Usespastgradientsforupdate• Maintainsanewquantity:‘velocity’• Exponentiallydecayingaverage ofgradients:

v = αv + (−εg)controlshowquickly

effectofpastgradientsdecayα ∈ [0,1)

Currentgradientupdate

Page 23: Lecture 15: Optimization - GitHub Pages · 2018. 10. 9. · •Heuristic trades off between initialize all layers ... –E.g. in ReLU, initialize bias to 0.1 instead of 0 •Units

Momentum

• Computegradientestimate:

• Updatevelocity:

• Updateparameters:

g = 1m

∇θL( f (x(i);θ ), y(i) )

i∑

v =αv−εg

θ =θ + v

Page 24: Lecture 15: Optimization - GitHub Pages · 2018. 10. 9. · •Heuristic trades off between initialize all layers ... –E.g. in ReLU, initialize bias to 0.1 instead of 0 •Units

Momentum

Dampedoscillations:gradientsinoppositedirectionsgetcancelledout

J(θ )

Goodfellow etal.(2016)

Page 25: Lecture 15: Optimization - GitHub Pages · 2018. 10. 9. · •Heuristic trades off between initialize all layers ... –E.g. in ReLU, initialize bias to 0.1 instead of 0 •Units

Nesterov Momentum

• Applyaninterim update:

• Performacorrectionbasedongradientattheinterimpoint:

g = 1m

∇θL( f (x(i); !θ ), y(i) )

i∑v =αv−εg

θ =θ + v

!θ =θ + v

Momentumbasedonlook-aheadslope

Page 26: Lecture 15: Optimization - GitHub Pages · 2018. 10. 9. · •Heuristic trades off between initialize all layers ... –E.g. in ReLU, initialize bias to 0.1 instead of 0 •Units

Outline

• ChallengesinOptimization• Momentum• AdaptiveLearningRate• ParameterInitialization• BatchNormalization

Page 27: Lecture 15: Optimization - GitHub Pages · 2018. 10. 9. · •Heuristic trades off between initialize all layers ... –E.g. in ReLU, initialize bias to 0.1 instead of 0 •Units

AdaptiveLearningRates

• Oscillationsalongverticaldirection– Learningmustbesloweralongparameter2

• Useadifferentlearningrateforeachparameter?

θ1

θ2

J(θ )

Page 28: Lecture 15: Optimization - GitHub Pages · 2018. 10. 9. · •Heuristic trades off between initialize all layers ... –E.g. in ReLU, initialize bias to 0.1 instead of 0 •Units

AdaGrad

• Accumulatesquaredgradients:

• Updateeachparameter:

• Greaterprogressalonggentlyslopeddirections

ri = ri + gi2

θi =θi −ε

δ + rigi

Inverselyproportionaltocumulativesquaredgradient

Page 29: Lecture 15: Optimization - GitHub Pages · 2018. 10. 9. · •Heuristic trades off between initialize all layers ... –E.g. in ReLU, initialize bias to 0.1 instead of 0 •Units

RMSProp

• Fornon-convexproblems,AdaGrad canprematurelydecreaselearningrate

• Useexponentiallyweightedaverageforgradientaccumulation

ri = ρri + (1− ρ)gi2

θi =θi −ε

δ + rigi

Page 30: Lecture 15: Optimization - GitHub Pages · 2018. 10. 9. · •Heuristic trades off between initialize all layers ... –E.g. in ReLU, initialize bias to 0.1 instead of 0 •Units

Adam

• RMSProp +Momentum• Estimatefirstmoment:

• Estimatesecondmoment:

• Updateparameters:

vi = ρ1vi + (1− ρ1 )gi

θi =θi −ε

δ + rivi

ri = ρ2ri + (1− ρ2 )gi2

Alsoappliesbiascorrection

tov andr

Workswellinpractice,isfairlyrobusttohyper-parameters

Page 31: Lecture 15: Optimization - GitHub Pages · 2018. 10. 9. · •Heuristic trades off between initialize all layers ... –E.g. in ReLU, initialize bias to 0.1 instead of 0 •Units

Outline

• ChallengesinOptimization• Momentum• AdaptiveLearningRate• ParameterInitialization• BatchNormalization

Page 32: Lecture 15: Optimization - GitHub Pages · 2018. 10. 9. · •Heuristic trades off between initialize all layers ... –E.g. in ReLU, initialize bias to 0.1 instead of 0 •Units

ParameterInitialization

• Goal:breaksymmetrybetweenunits– sothateachunitcomputesadifferentfunction

• Initializeallweights(notbiases)randomly– Gaussianoruniformdistribution

• Scaleofinitialization?– Large ->gradexplosion,Small ->gradvanishing

Page 33: Lecture 15: Optimization - GitHub Pages · 2018. 10. 9. · •Heuristic trades off between initialize all layers ... –E.g. in ReLU, initialize bias to 0.1 instead of 0 •Units

XavierInitialization

• Heuristicforalloutputstohaveunitvariance• Forafully-connectedlayerwithm inputs:

• ForReLU units,itisrecommended:

Wij ~ N 0, 1m

!

"#

$

%&

Wij ~ N 0, 2m

!

"#

$

%&

Page 34: Lecture 15: Optimization - GitHub Pages · 2018. 10. 9. · •Heuristic trades off between initialize all layers ... –E.g. in ReLU, initialize bias to 0.1 instead of 0 •Units

NormalizedInitialization• Fully-connectedlayerwithm inputs,n outputs:

• Heuristictradesoffbetweeninitializealllayershavesameactivationandgradientvariance

• Sparse variantwhenm islarge– Initializek nonzeroweightsineachunit

Wij ~U −6

m+ n, 6

m+ n

"

#$

%

&'

Page 35: Lecture 15: Optimization - GitHub Pages · 2018. 10. 9. · •Heuristic trades off between initialize all layers ... –E.g. in ReLU, initialize bias to 0.1 instead of 0 •Units

BiasInitialization

• Outputunitbias–Marginalstatisticsoftheoutputinthetrainingset

• Hiddenunitbias– Avoidsaturationatinitialization– E.g.inReLU,initializebiasto0.1insteadof0

• Unitscontrollingparticipationofotherunits– Setbiastoallowparticipationatinitialization

Page 36: Lecture 15: Optimization - GitHub Pages · 2018. 10. 9. · •Heuristic trades off between initialize all layers ... –E.g. in ReLU, initialize bias to 0.1 instead of 0 •Units

Outline

• ChallengesinOptimization• Momentum• AdaptiveLearningRate• ParameterInitialization• BatchNormalization

Page 37: Lecture 15: Optimization - GitHub Pages · 2018. 10. 9. · •Heuristic trades off between initialize all layers ... –E.g. in ReLU, initialize bias to 0.1 instead of 0 •Units

FeatureNormalization

• Goodpracticetonormalizefeaturesbeforeapplyinglearningalgorithm:

• Featuresinsamescale:mean0andvariance1– Speedsuplearning

!x = x −µσ

Vectorofmeanfeaturevalues

VectorofSDoffeaturevalues

Featurevector

Page 38: Lecture 15: Optimization - GitHub Pages · 2018. 10. 9. · •Heuristic trades off between initialize all layers ... –E.g. in ReLU, initialize bias to 0.1 instead of 0 •Units

FeatureNormalization

Beforenormalization Afternormalization

J(θ )

Page 39: Lecture 15: Optimization - GitHub Pages · 2018. 10. 9. · •Heuristic trades off between initialize all layers ... –E.g. in ReLU, initialize bias to 0.1 instead of 0 •Units

InternalCovarianceShift

Eachhiddenlayerchangesdistributionofinputstonextlayer:slowsdownlearning

Normalizeinputstolayer2

Normalizeinputstolayern

Page 40: Lecture 15: Optimization - GitHub Pages · 2018. 10. 9. · •Heuristic trades off between initialize all layers ... –E.g. in ReLU, initialize bias to 0.1 instead of 0 •Units

BatchNormalization

• Trainingtime:–Mini-batchofactivationsforlayertonormalize

H =

H11 ! H1K

" # "HN1 ! HNK

!

"

####

$

%

&&&&

K hiddenlayeractivations

N datapointsinmini-batch

Page 41: Lecture 15: Optimization - GitHub Pages · 2018. 10. 9. · •Heuristic trades off between initialize all layers ... –E.g. in ReLU, initialize bias to 0.1 instead of 0 •Units

BatchNormalization

• Trainingtime:–Mini-batchofactivationsforlayertonormalize

where

H ' = H −µσ

µ =1m

Hi,:i∑ σ =

1m

(H −µ)i2 +δ

i∑

Vectorofmeanactivationsacrossmini-batch

VectorofSDofeachunitacrossmini-batch

Page 42: Lecture 15: Optimization - GitHub Pages · 2018. 10. 9. · •Heuristic trades off between initialize all layers ... –E.g. in ReLU, initialize bias to 0.1 instead of 0 •Units

BatchNormalization

• Trainingtime:– Normalizationcanreduceexpressivepower– Insteaduse:

– Allowsnetworktocontrolrangeofnormalization

Learnableparameters

γ !H +β

Page 43: Lecture 15: Optimization - GitHub Pages · 2018. 10. 9. · •Heuristic trades off between initialize all layers ... –E.g. in ReLU, initialize bias to 0.1 instead of 0 •Units

µ1 =1m

Hi,:i∑

σ 1 =1m

(H −µ)i2 +δ

i∑

BatchNormalization

…..

Batch1

BatchNAddnormalizationoperationsforlayer1

Page 44: Lecture 15: Optimization - GitHub Pages · 2018. 10. 9. · •Heuristic trades off between initialize all layers ... –E.g. in ReLU, initialize bias to 0.1 instead of 0 •Units

µ 2 =1m

Hi,:i∑

σ 2 =1m

(H −µ)i2 +δ

i∑

BatchNormalization

Batch1

BatchN…..

Addnormalizationoperationsforlayer2andsoon…

Page 45: Lecture 15: Optimization - GitHub Pages · 2018. 10. 9. · •Heuristic trades off between initialize all layers ... –E.g. in ReLU, initialize bias to 0.1 instead of 0 •Units

BatchNormalization

• DifferentiatethejointlossforNmini-batches• Back-propagatethrough thenormoperations

• Testtime:–Modelneedstobeevaluatedonasingleexample– Replaceμ andσ withrunning averagescollectedduringtraining