Neural Networks: Backpropagation & Regularization

35
Neural Networks: Backpropagation & Regularization Benjamin Roth, Nina Poerner CIS LMU M¨ unchen Benjamin Roth, Nina Poerner (CIS LMU M¨ unchen) Neural Networks: Backpropagation & Regularization 1 / 16

Transcript of Neural Networks: Backpropagation & Regularization

Page 1: Neural Networks: Backpropagation & Regularization

Neural Networks: Backpropagation & Regularization

Benjamin Roth, Nina Poerner

CIS LMU Munchen

Benjamin Roth, Nina Poerner (CIS LMU Munchen)Neural Networks: Backpropagation & Regularization 1 / 16

Page 2: Neural Networks: Backpropagation & Regularization

Outline

1 Backpropagation

2 Regularization

Benjamin Roth, Nina Poerner (CIS LMU Munchen)Neural Networks: Backpropagation & Regularization 2 / 16

Page 3: Neural Networks: Backpropagation & Regularization

Backpropagation

Forward propagation: Input information x propagates throughnetwork to produce output y .

Calculate cost J(θ), as you would with regression.

Compute gradients w.r.t. all model parameters θ...

... how?I We know how to compute gradients w.r.t. parameters of the output

layer (just like regression).I How to calculate them w.r.t. parameters of the hidden layers?

Benjamin Roth, Nina Poerner (CIS LMU Munchen)Neural Networks: Backpropagation & Regularization 3 / 16

Page 4: Neural Networks: Backpropagation & Regularization

Chain Rule of Calculus

Let x , y , z ∈ R.

Let functions f , g : R→ R.

y = g(x)

z = f (g(x))

Thendz

dx=

dz

dy

dy

dx

Benjamin Roth, Nina Poerner (CIS LMU Munchen)Neural Networks: Backpropagation & Regularization 4 / 16

Page 5: Neural Networks: Backpropagation & Regularization

Chain Rule of Calculus: Vector-valued Functions

Let x ∈ Rm, y ∈ Rn, z ∈ RLet functions f : Rn → R, g : Rm → Rn

y = g(x)

z = f (g(x)) = f (y)

Then∂z

∂xi=

n∑j=1

∂z

∂yj

∂yj∂xi

In order to write this in vector notation, we need to define theJacobian matrix.

Benjamin Roth, Nina Poerner (CIS LMU Munchen)Neural Networks: Backpropagation & Regularization 5 / 16

Page 6: Neural Networks: Backpropagation & Regularization

Jacobian

The Jacobian is the matrix of all first-order partial derivatives of avector-valued function.

∂y

∂x=

∂y1∂x1

· · · ∂y1∂xm

∂y2∂x1

∂y2∂xm

.... . .

...

∂yn∂x1

· · · ∂yn∂xm

How to write in terms of gradients?

We can write the chain rule as:

∇xzm×1

=(∂y∂x

)n×m

T∇yzn×1

Benjamin Roth, Nina Poerner (CIS LMU Munchen)Neural Networks: Backpropagation & Regularization 6 / 16

Page 7: Neural Networks: Backpropagation & Regularization

Viewing the Network as a Graph

Nodes are function outputs (can be scalar or vector valued)

Arrows are functions

Example:

y = vT relu(W Tx)

z = W Tx ; r = relu(z)

x

W

v

zWTx r

max(0, z)

y

vTr

yJ

Benjamin Roth, Nina Poerner (CIS LMU Munchen)Neural Networks: Backpropagation & Regularization 7 / 16

Page 8: Neural Networks: Backpropagation & Regularization

Viewing the Network as a Graph

Nodes are function outputs (can be scalar or vector valued)

Arrows are functions

Example:

y = vT relu(W Tx)

z = W Tx ; r = relu(z)

x

W

v

zWTx

rmax(0, z)

y

vTr

yJ

Benjamin Roth, Nina Poerner (CIS LMU Munchen)Neural Networks: Backpropagation & Regularization 7 / 16

Page 9: Neural Networks: Backpropagation & Regularization

Viewing the Network as a Graph

Nodes are function outputs (can be scalar or vector valued)

Arrows are functions

Example:

y = vT relu(W Tx)

z = W Tx ; r = relu(z)

x

W

v

zWTx r

max(0, z)

y

vTr

yJ

Benjamin Roth, Nina Poerner (CIS LMU Munchen)Neural Networks: Backpropagation & Regularization 7 / 16

Page 10: Neural Networks: Backpropagation & Regularization

Viewing the Network as a Graph

Nodes are function outputs (can be scalar or vector valued)

Arrows are functions

Example:

y = vT relu(W Tx)

z = W Tx ; r = relu(z)

x

W

v

zWTx r

max(0, z)

y

vTr

yJ

Benjamin Roth, Nina Poerner (CIS LMU Munchen)Neural Networks: Backpropagation & Regularization 7 / 16

Page 11: Neural Networks: Backpropagation & Regularization

Viewing the Network as a Graph

Nodes are function outputs (can be scalar or vector valued)

Arrows are functions

Example:

y = vT relu(W Tx)

z = W Tx ; r = relu(z)

x

W

v

zWTx r

max(0, z)

y

vTr

yJ

Benjamin Roth, Nina Poerner (CIS LMU Munchen)Neural Networks: Backpropagation & Regularization 7 / 16

Page 12: Neural Networks: Backpropagation & Regularization

Forward pass

Green: Known or computed node

x

W

zWTx r

max(0, z)

y

v

yJ

vTr

z r y J

Benjamin Roth, Nina Poerner (CIS LMU Munchen)Neural Networks: Backpropagation & Regularization 8 / 16

Page 13: Neural Networks: Backpropagation & Regularization

Forward pass

Green: Known or computed node

x

W

zWTx r

max(0, z)

y

v

yJ

vTr

z

r y J

Benjamin Roth, Nina Poerner (CIS LMU Munchen)Neural Networks: Backpropagation & Regularization 8 / 16

Page 14: Neural Networks: Backpropagation & Regularization

Forward pass

Green: Known or computed node

x

W

zWTx r

max(0, z)

y

v

yJ

vTr

z r

y J

Benjamin Roth, Nina Poerner (CIS LMU Munchen)Neural Networks: Backpropagation & Regularization 8 / 16

Page 15: Neural Networks: Backpropagation & Regularization

Forward pass

Green: Known or computed node

x

W

zWTx r

max(0, z)

y

v

yJ

vTr

z r y

J

Benjamin Roth, Nina Poerner (CIS LMU Munchen)Neural Networks: Backpropagation & Regularization 8 / 16

Page 16: Neural Networks: Backpropagation & Regularization

Forward pass

Green: Known or computed node

x

W

zWTx r

max(0, z)

y

v

yJ

vTr

z r y J

Benjamin Roth, Nina Poerner (CIS LMU Munchen)Neural Networks: Backpropagation & Regularization 8 / 16

Page 17: Neural Networks: Backpropagation & Regularization

Backward pass

Red: Gradient of J w.r.t. node known or computed

x

W

z r y

v

max(0, z)yJ

y

dJdy

∂y∂v

dJdyv

∂y∂r

dJdy

r

( ∂r∂z )T ∂y

∂rdydJ

z

( ∂z∂W )T ( ∂r

∂z )T ∂y∂r

dJdy

W

( ∂z∂W )T ( ∂r

∂z )T ∂y∂r

dJdy

∂y∂v

dJdy

Benjamin Roth, Nina Poerner (CIS LMU Munchen)Neural Networks: Backpropagation & Regularization 9 / 16

Page 18: Neural Networks: Backpropagation & Regularization

Backward pass

Red: Gradient of J w.r.t. node known or computed

x

W

z r y

v

max(0, z)yJy

dJdy

∂y∂v

dJdyv

∂y∂r

dJdy

r

( ∂r∂z )T ∂y

∂rdydJ

z

( ∂z∂W )T ( ∂r

∂z )T ∂y∂r

dJdy

W

( ∂z∂W )T ( ∂r

∂z )T ∂y∂r

dJdy

∂y∂v

dJdy

Benjamin Roth, Nina Poerner (CIS LMU Munchen)Neural Networks: Backpropagation & Regularization 9 / 16

Page 19: Neural Networks: Backpropagation & Regularization

Backward pass

Red: Gradient of J w.r.t. node known or computed

x

W

z r y

v

max(0, z)yJy

dJdy

∂y∂v

dJdyv

∂y∂r

dJdy

r

( ∂r∂z )T ∂y

∂rdydJ

z

( ∂z∂W )T ( ∂r

∂z )T ∂y∂r

dJdy

W

( ∂z∂W )T ( ∂r

∂z )T ∂y∂r

dJdy

∂y∂v

dJdy

Benjamin Roth, Nina Poerner (CIS LMU Munchen)Neural Networks: Backpropagation & Regularization 9 / 16

Page 20: Neural Networks: Backpropagation & Regularization

Backward pass

Red: Gradient of J w.r.t. node known or computed

x

W

z r y

v

max(0, z)yJy

dJdy

∂y∂v

dJdyv

∂y∂r

dJdy

r

( ∂r∂z )T ∂y

∂rdydJ

z

( ∂z∂W )T ( ∂r

∂z )T ∂y∂r

dJdy

W

( ∂z∂W )T ( ∂r

∂z )T ∂y∂r

dJdy

∂y∂v

dJdy

Benjamin Roth, Nina Poerner (CIS LMU Munchen)Neural Networks: Backpropagation & Regularization 9 / 16

Page 21: Neural Networks: Backpropagation & Regularization

Backward pass

Red: Gradient of J w.r.t. node known or computed

x

W

z r y

v

max(0, z)yJy

dJdy

∂y∂v

dJdyv

∂y∂r

dJdy

r

( ∂r∂z )T ∂y

∂rdydJ

z

( ∂z∂W )T ( ∂r

∂z )T ∂y∂r

dJdy

W

( ∂z∂W )T ( ∂r

∂z )T ∂y∂r

dJdy

∂y∂v

dJdy

Benjamin Roth, Nina Poerner (CIS LMU Munchen)Neural Networks: Backpropagation & Regularization 9 / 16

Page 22: Neural Networks: Backpropagation & Regularization

Backward pass

Red: Gradient of J w.r.t. node known or computed

x

W

z r y

v

max(0, z)yJy

dJdy

∂y∂v

dJdyv

∂y∂r

dJdy

r

( ∂r∂z )T ∂y

∂rdydJ

z

( ∂z∂W )T ( ∂r

∂z )T ∂y∂r

dJdy

W

( ∂z∂W )T ( ∂r

∂z )T ∂y∂r

dJdy

∂y∂v

dJdy

Benjamin Roth, Nina Poerner (CIS LMU Munchen)Neural Networks: Backpropagation & Regularization 9 / 16

Page 23: Neural Networks: Backpropagation & Regularization

Backward pass

Red: Gradient of J w.r.t. node known or computed

x

W

z r y

v

max(0, z)yJy

dJdy

∂y∂v

dJdyv

∂y∂r

dJdy

r

( ∂r∂z )T ∂y

∂rdydJ

z

( ∂z∂W )T ( ∂r

∂z )T ∂y∂r

dJdy

W

( ∂z∂W )T ( ∂r

∂z )T ∂y∂r

dJdy

∂y∂v

dJdy

Benjamin Roth, Nina Poerner (CIS LMU Munchen)Neural Networks: Backpropagation & Regularization 9 / 16

Page 24: Neural Networks: Backpropagation & Regularization

Outline

1 Backpropagation

2 Regularization

Benjamin Roth, Nina Poerner (CIS LMU Munchen)Neural Networks: Backpropagation & Regularization 10 / 16

Page 25: Neural Networks: Backpropagation & Regularization

Regularization

feature

prediction

xx

xx

xx

xx x

feature

prediction

xx

xx

xx

xx x

feature

prediction

xx

xx

xx

xx x

Overfitting vs. underfitting

Regularization: Any modification to a learning algorithm for reducingits generalization error but not its training error

Build a “preference” into model for some solutions in hypothesis space

Unpreferred solutions are penalized: only chosen if they fit trainingdata much better than preferred solutions

Benjamin Roth, Nina Poerner (CIS LMU Munchen)Neural Networks: Backpropagation & Regularization 11 / 16

Page 26: Neural Networks: Backpropagation & Regularization

Regularization

feature

prediction

xx

xx

xx

xx x

feature

prediction

xx

xx

xx

xx x

feature

prediction

xx

xx

xx

xx x

Overfitting vs. underfitting

Regularization: Any modification to a learning algorithm for reducingits generalization error but not its training error

Build a “preference” into model for some solutions in hypothesis space

Unpreferred solutions are penalized: only chosen if they fit trainingdata much better than preferred solutions

Benjamin Roth, Nina Poerner (CIS LMU Munchen)Neural Networks: Backpropagation & Regularization 11 / 16

Page 27: Neural Networks: Backpropagation & Regularization

Regularization

Large parameters → overfitting

Prefer models with smaller weights

Popular regularizers:I Penalize large L2 norm (= Euclidian norm) of weight vectorsI Penalize large L1 norm (= Manhattan norm) of weight vectors

Benjamin Roth, Nina Poerner (CIS LMU Munchen)Neural Networks: Backpropagation & Regularization 12 / 16

Page 28: Neural Networks: Backpropagation & Regularization

L2-Regularization

Add term that penalizes large L2 norm of weight vector θ

The amount of penalty is controlled by a parameter λ

J ′(θ) = J(θ, x, y) +λ

2θTθ

Benjamin Roth, Nina Poerner (CIS LMU Munchen)Neural Networks: Backpropagation & Regularization 13 / 16

Page 29: Neural Networks: Backpropagation & Regularization

L2-Regularization

The surface of the objective function is now a combination of theoriginal loss and the regularization penalty.

Benjamin Roth, Nina Poerner (CIS LMU Munchen)Neural Networks: Backpropagation & Regularization 14 / 16

Page 30: Neural Networks: Backpropagation & Regularization

Summary

Feedforward networks: layers of (non-linear) function compositions

Non-Linearities for hidden layers: relu, tanh, ...

Non-Linearities for output units (classification): σ, softmax

Training via backpropagation: compute gradient of cost w.r.t.parameters using chain rule

Regularization: penalize large parameter values, e.g. by addingL2-norm of parameter vector to loss

Benjamin Roth, Nina Poerner (CIS LMU Munchen)Neural Networks: Backpropagation & Regularization 15 / 16

Page 31: Neural Networks: Backpropagation & Regularization

Summary

Feedforward networks: layers of (non-linear) function compositions

Non-Linearities for hidden layers: relu, tanh, ...

Non-Linearities for output units (classification): σ, softmax

Training via backpropagation: compute gradient of cost w.r.t.parameters using chain rule

Regularization: penalize large parameter values, e.g. by addingL2-norm of parameter vector to loss

Benjamin Roth, Nina Poerner (CIS LMU Munchen)Neural Networks: Backpropagation & Regularization 15 / 16

Page 32: Neural Networks: Backpropagation & Regularization

Summary

Feedforward networks: layers of (non-linear) function compositions

Non-Linearities for hidden layers: relu, tanh, ...

Non-Linearities for output units (classification): σ, softmax

Training via backpropagation: compute gradient of cost w.r.t.parameters using chain rule

Regularization: penalize large parameter values, e.g. by addingL2-norm of parameter vector to loss

Benjamin Roth, Nina Poerner (CIS LMU Munchen)Neural Networks: Backpropagation & Regularization 15 / 16

Page 33: Neural Networks: Backpropagation & Regularization

Summary

Feedforward networks: layers of (non-linear) function compositions

Non-Linearities for hidden layers: relu, tanh, ...

Non-Linearities for output units (classification): σ, softmax

Training via backpropagation: compute gradient of cost w.r.t.parameters using chain rule

Regularization: penalize large parameter values, e.g. by addingL2-norm of parameter vector to loss

Benjamin Roth, Nina Poerner (CIS LMU Munchen)Neural Networks: Backpropagation & Regularization 15 / 16

Page 34: Neural Networks: Backpropagation & Regularization

Summary

Feedforward networks: layers of (non-linear) function compositions

Non-Linearities for hidden layers: relu, tanh, ...

Non-Linearities for output units (classification): σ, softmax

Training via backpropagation: compute gradient of cost w.r.t.parameters using chain rule

Regularization: penalize large parameter values, e.g. by addingL2-norm of parameter vector to loss

Benjamin Roth, Nina Poerner (CIS LMU Munchen)Neural Networks: Backpropagation & Regularization 15 / 16

Page 35: Neural Networks: Backpropagation & Regularization

Outlook

“Manually” defining forward- and backward passes in numpy istime-consuming

Deep Learning frameworks let you define forward pass as a“computation graph” made up of simple, differentiable operations(e.g., dot products).

They do the backward pass for you

tensorflow + keras, pytorch, theano, MXNet, CNTK, caffe, ...

Benjamin Roth, Nina Poerner (CIS LMU Munchen)Neural Networks: Backpropagation & Regularization 16 / 16