Lecture 15: Optimization - GitHub Pages · 2018. 10. 9. · •Heuristic trades off between...

Lecture15:OptimizationCS109B,STAT121B,AC209B,CSE109B

MarkGlickman andPavlos Protopapas

Learningvs.Optimization

• Goaloflearning:minimizegeneralizationerror

• Inpractice,empiricalriskminimization:

J(θ ) = E(x,y)~pdata L( f (x;θ ), y)[ ]

J(θ ) = 1m

∑ ( f (x(i);θ ), y(i) )

Quantityoptimizeddifferentfromthequantity

wecareabout

Batchvs.StochasticAlgorithms

• Batchalgorithms– Optimizeempiricalriskusingexactgradients

• Stochasticalgorithms– Estimatesgradientfromasmallrandomsample

∇J(θ ) = E(x,y)~pdata ∇L( f (x;θ ), y)[ ]

Largemini-batch:gradientcomputationexpensive

Smallmini-batch:greatervarianceinestimate,longerstepsforconvergence

CriticalPoints

• Pointswithzerogradient• 2nd-derivate(Hessian)determinescurvature

Goodfellow etal.(2016)

StochasticGradientDescent

• Takesmallstepsindirectionofnegativegradient• Samplem examplesfromtrainingsetandcompute:

• Updateparameters:

g = 1m

∇L( f (x(i);θ ), y(i) )i∑

θ =θ −εkg

Inpractice:shuffletrainingsetonceandpassthroughmultipletimes

StochasticGradientDescent

Oscillationsbecauseupdatesdonotexploitcurvatureinformation

J(θ )

Outline

• ChallengesinOptimization• Momentum• AdaptiveLearningRate• ParameterInitialization• BatchNormalization

LocalMinima

• Oldview:localminimaismajorprobleminneuralnetworktraining

• Recentview:– Forsufficientlylargeneuralnetworks,mostlocalminimaincurlowcost

– Notimportanttofindtrueglobalminimum

SaddlePoints

• Recentstudiesindicatethatinhighdim,saddlepointsaremorelikelythanlocalmin

• Gradientcanbeverysmallnearsaddlepoints

Bothlocalminandmax

SaddlePoints

• SGDisseentoescapesaddlepoints–Movesdown-hill,usesnoisygradients

• Second-ordermethodsgetstuck– solvesforapointwithzerogradient

PoorConditioning

• PoorlyconditionedHessianmatrix– Highcurvature:smallstepsleadstohugeincrease

• Learningisslowdespitestronggradients

Oscillationsslowdownprogress

NoCriticalPoints

• Somecostfunctionsdonothavecriticalpoints

NoCriticalPoints

Gradientnormincreases,butvalidationerrordecreases

ConvolutionNetsforObjectDetection

ExplodingandVanishingGradients

h1 =Wxhi =Whi−1, i = 2…n

y =σ (h1n + h2

n ), where σ (s) = 11+ e−s

Linearactivation

deeplearning.ai

&&= a 0

hn1hn2

&&= an 0

Suppose W = a 00 b

y =σ (anx1 + bnx2 )

∇y = "σ (anx1 + bnx2 )

nan−1x1nbn−1x2

Suppose x = 11

Case 1: a =1, b = 2 :

y→1, ∇y→ nn2n−1

Case 2: a = 0.5, b = 0.9 :

y→ 0, ∇y→ 00

Explodes!

Vanishes!

• Explodinggradientsleadtocliffs• Canbemitigatedusinggradientclipping

Poorcorrespondencebetweenlocalandglobalstructure

Outline

Momentum

• SGDisslowwhenthereishighcurvature

• Averagegradientpresentsfasterpathtoopt:– verticalcomponentscancelout

J(θ )

Deeplearning.ai

Momentum• Usespastgradientsforupdate• Maintainsanewquantity:‘velocity’• Exponentiallydecayingaverage ofgradients:

v = αv + (−εg)controlshowquickly

effectofpastgradientsdecayα ∈ [0,1)

Currentgradientupdate

Momentum

• Computegradientestimate:

• Updatevelocity:

g = 1m

∇θL( f (x(i);θ ), y(i) )

v =αv−εg

θ =θ + v

Momentum

Dampedoscillations:gradientsinoppositedirectionsgetcancelledout

J(θ )

Nesterov Momentum

• Applyaninterim update:

• Performacorrectionbasedongradientattheinterimpoint:

g = 1m

∇θL( f (x(i); !θ ), y(i) )

i∑v =αv−εg

θ =θ + v

!θ =θ + v

Momentumbasedonlook-aheadslope

Outline

AdaptiveLearningRates

• Oscillationsalongverticaldirection– Learningmustbesloweralongparameter2

• Useadifferentlearningrateforeachparameter?

J(θ )

AdaGrad

• Accumulatesquaredgradients:

• Updateeachparameter:

• Greaterprogressalonggentlyslopeddirections

ri = ri + gi2

θi =θi −ε

δ + rigi

Inverselyproportionaltocumulativesquaredgradient

RMSProp

• Fornon-convexproblems,AdaGrad canprematurelydecreaselearningrate

• Useexponentiallyweightedaverageforgradientaccumulation

ri = ρri + (1− ρ)gi2

θi =θi −ε

δ + rigi

• RMSProp +Momentum• Estimatefirstmoment:

• Estimatesecondmoment:

vi = ρ1vi + (1− ρ1 )gi

θi =θi −ε

δ + rivi

ri = ρ2ri + (1− ρ2 )gi2

Alsoappliesbiascorrection

tov andr

Workswellinpractice,isfairlyrobusttohyper-parameters

Outline

ParameterInitialization

• Goal:breaksymmetrybetweenunits– sothateachunitcomputesadifferentfunction

• Initializeallweights(notbiases)randomly– Gaussianoruniformdistribution

• Scaleofinitialization?– Large ->gradexplosion,Small ->gradvanishing

XavierInitialization

• Heuristicforalloutputstohaveunitvariance• Forafully-connectedlayerwithm inputs:

• ForReLU units,itisrecommended:

Wij ~ N 0, 1m

Wij ~ N 0, 2m

NormalizedInitialization• Fully-connectedlayerwithm inputs,n outputs:

• Heuristictradesoffbetweeninitializealllayershavesameactivationandgradientvariance

• Sparse variantwhenm islarge– Initializek nonzeroweightsineachunit

Wij ~U −6

m+ n, 6

BiasInitialization

• Outputunitbias–Marginalstatisticsoftheoutputinthetrainingset

• Hiddenunitbias– Avoidsaturationatinitialization– E.g.inReLU,initializebiasto0.1insteadof0

• Unitscontrollingparticipationofotherunits– Setbiastoallowparticipationatinitialization

Outline

FeatureNormalization

• Goodpracticetonormalizefeaturesbeforeapplyinglearningalgorithm:

• Featuresinsamescale:mean0andvariance1– Speedsuplearning

!x = x −µσ

Vectorofmeanfeaturevalues

VectorofSDoffeaturevalues

Featurevector

FeatureNormalization

Beforenormalization Afternormalization

J(θ )

InternalCovarianceShift

Eachhiddenlayerchangesdistributionofinputstonextlayer:slowsdownlearning

Normalizeinputstolayer2

Normalizeinputstolayern

BatchNormalization

• Trainingtime:–Mini-batchofactivationsforlayertonormalize

H11 ! H1K

" # "HN1 ! HNK

K hiddenlayeractivations

N datapointsinmini-batch

BatchNormalization

• Trainingtime:–Mini-batchofactivationsforlayertonormalize

H ' = H −µσ

µ =1m

Hi,:i∑ σ =

(H −µ)i2 +δ

Vectorofmeanactivationsacrossmini-batch

VectorofSDofeachunitacrossmini-batch

BatchNormalization

• Trainingtime:– Normalizationcanreduceexpressivepower– Insteaduse:

– Allowsnetworktocontrolrangeofnormalization

Learnableparameters

γ !H +β

µ1 =1m

Hi,:i∑

σ 1 =1m

(H −µ)i2 +δ

BatchNormalization

Batch1

BatchNAddnormalizationoperationsforlayer1

µ 2 =1m

Hi,:i∑

σ 2 =1m

(H −µ)i2 +δ

BatchNormalization

Batch1

BatchN…..

Addnormalizationoperationsforlayer2andsoon…

BatchNormalization

• DifferentiatethejointlossforNmini-batches• Back-propagatethrough thenormoperations

• Testtime:–Modelneedstobeevaluatedonasingleexample– Replaceμ andσ withrunning averagescollectedduringtraining

Lecture 15: Optimization - GitHub Pages · 2018. 10. 9. · •Heuristic trades off between...

Documents

Transcript of Lecture 15: Optimization - GitHub Pages · 2018. 10. 9. · •Heuristic trades off between...

|Helio |Link Ltd. - Auburn Universitytroppel/internal/violin... · Main Function Motor Controller Node Display Node Steering Controls Node Main Function Initialize CAN Messaging Initialize

Computing poverty measures with survey data · PDF fileComputing poverty measures with survey data PhilippeVanKerm CEPS/INSTEAD, Luxembourg & ISER, University of Essex philippe.vankerm@ceps.lu

Supporting Information - pnas. · PDF file1/2MS plates containing 5 μM β-estradiol (Sigma-Aldrich) for 24 h. DMSO instead of β-estradiol served as a control. ... (P5726; Sigma-Aldrich)

A (Brief) History of Cryptography...© Babaoglu 2001-2020 Sicurezza Secret-Key Cryptography Polyalfabetic Ciphers Instead of substituting single letters of the plaintext, substitute

ANALYSIS OF NONUNIFORM BEAMS ON ELASTIC · PDF fileEngineering MECHANICS, Vol ... ON ELASTIC FOUNDATIONS USING RECURSIVE DIFFERENTATION METHOD Mohamed ... instead of expression strings

Heuristic Reduction of Gyro Drift - University of Michiganjohannb/Papers/paper144.pdf · Heuristic Reduction of Gyro Drift For Personnel Tracking Systems by Johann Borenstein*, Lauro

ProgramofthePosterSession Autumnschool 2019 · In noncommutative geometry, one may use real calculi instead of spectral triples to encode information about noncommutative manifolds.

Dr. J. Frank Norris Dr. J. L. Hines - ICOTBicotb.org/resources/Norris-Hines-ReturnofChrist.pdf · classes, instead of hop-skip Inter ... Doctor Ben M. Bogard, ... Dr. J. Frank Norris

FOOD & BEVERAGE INDUSTRY - Camfil Global processing/Food... · Food & Beverage FOOD SAFETY COMPLIANCE IS OUR CONCERN Efﬁcient ﬁlter systems instead of “bacteria spreaders“.

Simple Linear Regression - David Dalpiaz · Simple Linear Regression STAT 3202 | OSU | Autumn 2018 Dalpiaz. These aren’t really “slides,” instead just some visual aids for lecture

On the design of (meta)heuristics · A Metaheuristic is a high-level problem independent algorithmic framework that provides a set of guidelines or strategies to develop heuristic

Nasser Abbasi, September 26,2007. California State ...€¦ · 1. Seed the uniform random number generator with (010101). 2. initialize the array d of size n which will contain the

Remarks on Chern-Simons Theory · Yetter, Pzytycki, Traczyk 5. Question: How can we make mathematics out of the path integral heuristic? Focus on topological case. Plan: “ ...

γ – Irradiated Jute Reinforced Polypropylene Composites ... · jute instead of Codenka leads to enhance stiffness properties of the composites. On the other hand, impact strength

ELECTRYONE · traction. Instead of tracing the history of skeptiism aout Ovids exile 2, ... Everything else, though, is the same: the isolation, and ... In the real world, however,

13 Sicherheit 3.1 - TU Clausthal...which, as part of a chemistry show, hydrogen has been inhaled instead of helium in a demonstration of the "Mickey Mouse voice" effect. Inhalation

Pathfinding - Part 1: Α* heuristic search

Markov Chain Monte Carlo InferenceIterative conditional modes: Instead of sampling, update wrt a point estimate (e.g. mean, mode). 5Bishop, PRML, 2006 16/?? Collapsed Gibbs Sampling

Review Session 5 - Stanford University · 2009. 3. 11. · • other methods (e.g., particle ﬁlters): based on Monte Carlo methods that sample the random variables • usually heuristic,

%/$1. 3$*( - Guide For School – Study Notes for Java | Physics | … · 2015. 2. 28. · Comments of Examiners (i) Some candidates found A2 by squaring the elements of A, instead