Neural Networks: Backpropagation & Regularization

Benjamin Roth, Nina Poerner

CIS LMU Munchen

Benjamin Roth, Nina Poerner (CIS LMU Munchen)Neural Networks: Backpropagation & Regularization 1 / 16

Outline

1 Backpropagation

2 Regularization

Backpropagation

Forward propagation: Input information x propagates throughnetwork to produce output y .

Calculate cost J(θ), as you would with regression.

Compute gradients w.r.t. all model parameters θ...

... how?I We know how to compute gradients w.r.t. parameters of the output

layer (just like regression).I How to calculate them w.r.t. parameters of the hidden layers?

Chain Rule of Calculus

Let x , y , z ∈ R.

Let functions f , g : R→ R.

y = g(x)

z = f (g(x))

Thendz

Chain Rule of Calculus: Vector-valued Functions

Let x ∈ Rm, y ∈ Rn, z ∈ RLet functions f : Rn → R, g : Rm → Rn

y = g(x)

z = f (g(x)) = f (y)

Then∂z

∂xi=

n∑j=1

∂yj∂xi

In order to write this in vector notation, we need to define theJacobian matrix.

Jacobian

The Jacobian is the matrix of all first-order partial derivatives of avector-valued function.

∂y1∂x1

· · · ∂y1∂xm

∂y2∂x1

∂y2∂xm

.... . .

∂yn∂x1

· · · ∂yn∂xm

How to write in terms of gradients?

We can write the chain rule as:

∇xzm×1

=(∂y∂x

T∇yzn×1

Viewing the Network as a Graph

Nodes are function outputs (can be scalar or vector valued)

Arrows are functions

Example:

y = vT relu(W Tx)

z = W Tx ; r = relu(z)

zWTx r

max(0, z)

Example:

y = vT relu(W Tx)

rmax(0, z)

Example:

y = vT relu(W Tx)

zWTx r

max(0, z)

Example:

y = vT relu(W Tx)

zWTx r

max(0, z)

Example:

y = vT relu(W Tx)

zWTx r

max(0, z)

Forward pass

Green: Known or computed node

zWTx r

max(0, z)

z r y J

Forward pass

zWTx r

max(0, z)

Forward pass

zWTx r

max(0, z)

Forward pass

zWTx r

max(0, z)

Forward pass

zWTx r

max(0, z)

z r y J

Backward pass

Red: Gradient of J w.r.t. node known or computed

max(0, z)yJ

∂y∂v

∂y∂r

( ∂r∂z )T ∂y

∂rdydJ

( ∂z∂W )T ( ∂r

∂z )T ∂y∂r

( ∂z∂W )T ( ∂r

∂z )T ∂y∂r

∂y∂v

Backward pass

max(0, z)yJy

∂y∂v

∂y∂r

( ∂r∂z )T ∂y

∂rdydJ

( ∂z∂W )T ( ∂r

∂z )T ∂y∂r

( ∂z∂W )T ( ∂r

∂z )T ∂y∂r

∂y∂v

Backward pass

max(0, z)yJy

∂y∂v

∂y∂r

( ∂r∂z )T ∂y

∂rdydJ

( ∂z∂W )T ( ∂r

∂z )T ∂y∂r

( ∂z∂W )T ( ∂r

∂z )T ∂y∂r

∂y∂v

Backward pass

max(0, z)yJy

∂y∂v

∂y∂r

( ∂r∂z )T ∂y

∂rdydJ

( ∂z∂W )T ( ∂r

∂z )T ∂y∂r

( ∂z∂W )T ( ∂r

∂z )T ∂y∂r

∂y∂v

Backward pass

max(0, z)yJy

∂y∂v

∂y∂r

( ∂r∂z )T ∂y

∂rdydJ

( ∂z∂W )T ( ∂r

∂z )T ∂y∂r

( ∂z∂W )T ( ∂r

∂z )T ∂y∂r

∂y∂v

Backward pass

max(0, z)yJy

∂y∂v

∂y∂r

( ∂r∂z )T ∂y

∂rdydJ

( ∂z∂W )T ( ∂r

∂z )T ∂y∂r

( ∂z∂W )T ( ∂r

∂z )T ∂y∂r

∂y∂v

Backward pass

max(0, z)yJy

∂y∂v

∂y∂r

( ∂r∂z )T ∂y

∂rdydJ

( ∂z∂W )T ( ∂r

∂z )T ∂y∂r

( ∂z∂W )T ( ∂r

∂z )T ∂y∂r

∂y∂v

Outline

1 Backpropagation

2 Regularization

Regularization

feature

prediction

feature

prediction

feature

prediction

Overfitting vs. underfitting

Regularization: Any modification to a learning algorithm for reducingits generalization error but not its training error

Build a “preference” into model for some solutions in hypothesis space

Unpreferred solutions are penalized: only chosen if they fit trainingdata much better than preferred solutions

Regularization

feature

prediction

feature

prediction

feature

prediction

Overfitting vs. underfitting

Regularization: Any modification to a learning algorithm for reducingits generalization error but not its training error

Build a “preference” into model for some solutions in hypothesis space

Unpreferred solutions are penalized: only chosen if they fit trainingdata much better than preferred solutions

Regularization

Large parameters → overfitting

Prefer models with smaller weights

Popular regularizers:I Penalize large L2 norm (= Euclidian norm) of weight vectorsI Penalize large L1 norm (= Manhattan norm) of weight vectors

L2-Regularization

Add term that penalizes large L2 norm of weight vector θ

The amount of penalty is controlled by a parameter λ

J ′(θ) = J(θ, x, y) +λ

2θTθ

L2-Regularization

The surface of the objective function is now a combination of theoriginal loss and the regularization penalty.

Summary

Feedforward networks: layers of (non-linear) function compositions

Non-Linearities for hidden layers: relu, tanh, ...

Non-Linearities for output units (classification): σ, softmax

Training via backpropagation: compute gradient of cost w.r.t.parameters using chain rule

Regularization: penalize large parameter values, e.g. by addingL2-norm of parameter vector to loss

Summary

Outlook

“Manually” defining forward- and backward passes in numpy istime-consuming

Deep Learning frameworks let you define forward pass as a“computation graph” made up of simple, differentiable operations(e.g., dot products).

They do the backward pass for you

tensorflow + keras, pytorch, theano, MXNet, CNTK, caffe, ...

Neural Networks: Backpropagation & Regularization

Documents

Transcript of Neural Networks: Backpropagation & Regularization

Regularization for Deep Learning - Universität Hildesheim · Regularization Noise Robustness (Noise to Weights) I Noise to weights reduces over- tting and is used primarily with

Basis Expansions and Regularization - Stanford Universitystatweb.stanford.edu/~tibs/stat315a/LECTURES/chap5.pdf · ESL Chap 5 |Basis Expansions and Regularization Rob Tibshirani ’

Regularization Parameter Estimation forrosie/mypresentations/prague.pdf · Regularization Parameter Estimation for ... Rao, C. R., 1973, Linear Statistical Inference and its applications,

Lecture 23: Neural Networks - Department of Statistics

A Consistent Regularization Approach for Structured Prediction

Miniaturized Energy-Efficient Integrated Neural Interfaces · ENIAC: Encapsulated Neural Interfacing and Acquisition Chip No external components or wires CICC2017, JSSC2018b, JSSC2018c,

Convex Optimization for Neural Networks

On Feature Learning in Neural Networks: Emergence from ...

Sox5 controls cell cycle progression in neural progenitors ...digital.csic.es/bitstream/10261/63480/4/Sox5_Wnt_EMBOreports_Mo… · Sox5 controls cell cycle progression in neural

Introduction to Neural Networks - MT classmt-class.org/jhu/slides/lecture-nn-intro.pdf · 2017-11-30 · Introduction to Neural Networks Philipp Koehn ... – neural computer = sequence

NEURAL REGENERATIN RESEARCH · NEURAL REGENERATIN RESEARCH 1490 NEURAL REGEN RES/ VL 13 (N. 9), SEPTEMBER 2018 Proanthocyanidin B2 attenuates high-glucose-induced neurotoxicity of

TOTAL VARIATION REGULARIZATION FOR IMAGE ...wka/66261.pdfKey words. total variation, regularization, image denoising AMS subject classiﬁcations. 49Q20, 58E30 DOI. 10.1137/060662617

Electron-hadron separation by Neural-network Yonsei university.

2.153 Adaptive Control Fall 2019 Lecture 11: Neural ...

ElsaCazelles,jointworkwithJérémieBigot&NicolasPapadakis ...€¦ · Regularization of barycenters in the Wasserstein space ElsaCazelles,jointworkwithJérémieBigot&NicolasPapadakis

Convolutional Neural Networks (CNNs) and Recurrent Neural ...

Neural Networks, Computation Graphs · Neural Networks as Computation Graphs •Decomposes computation into simple operations over matrices and vectors •Forward propagation algorithm

PROCESSES IN BIOLOGICAL VISION - The Complete Neural System

SAPPHIRE: a neural network based classifier for σ70 ...

Regularization Parameter Estimation forrosie/mypapers/NotesMathPaper_v8.pdf · Regularization Parameter Estimation for Underdetermined problems by the ˜2 principle with application