Lecture 7: Multilayer Perceptrons

82
Lecture 7: Multilayer Perceptrons Andreas Wichert Department of Computer Science and Engineering Técnico Lisboa

Transcript of Lecture 7: Multilayer Perceptrons

Page 1: Lecture 7: Multilayer Perceptrons

Lecture 7: Multilayer Perceptrons

Andreas WichertDepartment of Computer Science and Engineering

Técnico Lisboa

Page 2: Lecture 7: Multilayer Perceptrons

Similarity to real neurons...

Page 3: Lecture 7: Multilayer Perceptrons

• The dot product is a linear representation represented by the value net

• Non linearity can be achieved be a non liner activation (transfer) function φ() with

Page 4: Lecture 7: Multilayer Perceptrons

Sgn

• Examples of nonlinear transfer functions are the sgn function

Page 5: Lecture 7: Multilayer Perceptrons

Perceptron (1957)

• Linear threshold unit (LTU)

S

x1

x2

xn

...

w1w2

wn

w0X0=1

o

McCulloch-Pitts model of a neuron (1943)

The “bias”, a constant term that does not depend on any input value

Page 6: Lecture 7: Multilayer Perceptrons

Linearly separable patterns

X0=1, bias...

Page 7: Lecture 7: Multilayer Perceptrons

• The goal of a perceptron is to correctly classify the set of pattern D={x1,x2,..xm} into one of the classes C1 and C2

• The output for class C1 is o=1 and for C2 is o=-1

• For n=2 è

Page 8: Lecture 7: Multilayer Perceptrons

Perceptron learning rule

• Consider linearly separable problems• How to find appropriate weights

• Initialize each vector w to some small random values

• Look if the output pattern o belongs to the desired class, has the desired value d

• h is called the learning rate• 0 < h ≤ 1

Δw =η ⋅ (d −o) ⋅ x

Page 9: Lecture 7: Multilayer Perceptrons

• In supervised learning the network has its output compared with known correct answers

• Supervised learning• Learning with a teacher

• (d-o) plays the role of the error signal

• The algorithm converges to the correct classification• if the training data is linearly separable• and h is sufficiently small

Page 10: Lecture 7: Multilayer Perceptrons

XOR problem and Perceptron

• By Minsky and Papert in mid 1960

Page 11: Lecture 7: Multilayer Perceptrons

Multi-layer Networks

• The limitations of simple perceptron do not apply to feed-forward networks with intermediate or „hidden“ nonlinear units • A network with just one hidden unit can represent any Boolean

function• The great power of multi-layer networks was realized long ago

• But it was only in the eighties it was shown how to make them learn

Page 12: Lecture 7: Multilayer Perceptrons

XOR-example

Page 13: Lecture 7: Multilayer Perceptrons

• Multiple layers of cascade linear units still produce only linear functions• We search for networks capable of representing nonlinear functions

• Units should use nonlinear activation functions• Examples of nonlinear activation functions

Page 14: Lecture 7: Multilayer Perceptrons

Gradient Descent for one Unit

Page 15: Lecture 7: Multilayer Perceptrons

Linear Unit

Page 16: Lecture 7: Multilayer Perceptrons

Sigmoid Unit

Page 17: Lecture 7: Multilayer Perceptrons

Logistic Regression

Page 18: Lecture 7: Multilayer Perceptrons

Sigmoid Unit versus Logistic Regression

Page 19: Lecture 7: Multilayer Perceptrons

Linear Unit versus Logistic Regression

Page 20: Lecture 7: Multilayer Perceptrons

Distance

Page 21: Lecture 7: Multilayer Perceptrons

Better Decision Boundary of Logistic Regression LR (sigmoid) to Linear Unit

Page 22: Lecture 7: Multilayer Perceptrons

Back-propagation (1980)

• Back-propagation is a learning algorithm for multi-layer neural networks• It was invented independently several times

• Bryson an Ho [1969]• Werbos [1974]• Parker [1985]• Rumelhart et al. [1986]

Parallel Distributed Processing - Vol. 1FoundationsDavid E. Rumelhart, James L. McClelland and the PDP Research Group

What makes people smarter than computers? These volumes bya pioneering neurocomputing.....

Page 23: Lecture 7: Multilayer Perceptrons

• Feed-forward networks with hidden nonlinear units are universal approximators; they can approximate every bounded continuous function with an arbitrarily small error• Each Boolean function can be represented by a network with a single

hidden layer • However, the representation may require an exponential number of hidden

units.

• The hidden units should be nonlinear because multiple layers of linear units can only produce linear functions.

Page 24: Lecture 7: Multilayer Perceptrons
Page 25: Lecture 7: Multilayer Perceptrons
Page 26: Lecture 7: Multilayer Perceptrons

Back-propagation

• The algorithm gives a prescription for changing the weights wij in any feed-forward network to learn a training set of input output pairs {xk,yk}• We consider a simple two-layer network

Page 27: Lecture 7: Multilayer Perceptrons

• The input pattern is represented by the five-dimensional vector x• nonlinear hidden units compute the output V1, V2, V3

• Two output units compute the output o1 and o2. • The units V1, V2, V3 are referred to as hidden units because we cannot see

their outputs and cannot directly perform error correction

Page 28: Lecture 7: Multilayer Perceptrons

• The output layer of a feed-forward network can be trained by the perceptron rule (stochastic gradient descent) since it is a Perceptron

Page 29: Lecture 7: Multilayer Perceptrons
Page 30: Lecture 7: Multilayer Perceptrons
Page 31: Lecture 7: Multilayer Perceptrons
Page 32: Lecture 7: Multilayer Perceptrons
Page 33: Lecture 7: Multilayer Perceptrons
Page 34: Lecture 7: Multilayer Perceptrons
Page 35: Lecture 7: Multilayer Perceptrons
Page 36: Lecture 7: Multilayer Perceptrons
Page 37: Lecture 7: Multilayer Perceptrons

• In general, with an arbitrary number of layers, the back-propagation update rule has always the form

• Where output and input refers to the connection concerned • V stands for the appropriate input (hidden unit or real input, xd ) • d depends on the layer concerned

Δwij =η δoutputd=1

m

∑ ⋅Vinput

Page 38: Lecture 7: Multilayer Perceptrons

• This approach can be extended to any numbers of layers• The coefficient are usual forward, but the errors represented by δ are

propagated backward

Page 39: Lecture 7: Multilayer Perceptrons

Networks with Hidden Linear Layers

Page 40: Lecture 7: Multilayer Perceptrons
Page 41: Lecture 7: Multilayer Perceptrons

• We have to use a nonlinear differentiable activation function in hidden units

• Examples:

f (x) =σ (x) =1

1+ e(−α⋅x )

f ' (x) =σ ' (x) =α ⋅σ(x) ⋅ (1−σ(x))

Page 42: Lecture 7: Multilayer Perceptrons

Two kind of Units

• Output Units • Require Bias • Preform Linear Separable Problems, means the input to them had to be somehow

linearised• Does not require non linear activation function, • We should use sigmoid function or softmax to represent probabilities and to get

better decision boundary.

• Hidden Units • Nonlinear activation function • Feature Extraction

Does it require Bias? It is commonly used • Universal Approximation Theorem uses hidden units with bias.

Page 43: Lecture 7: Multilayer Perceptrons

Output Units are linear (Perceptron)

• The hidden layer applies a nonlinear transformation from the input space to the hidden space • In the hidden space a linear discrimination can be performed

f( )

f( )

f( )f( )f( )

f( )

f( )f( )

f( )

f( )

f( )f( )f( )

f( )

f( )f( )

Page 44: Lecture 7: Multilayer Perceptrons

Bias?

Page 45: Lecture 7: Multilayer Perceptrons

More on Back-Propagation

• Gradient descent over entire network weight vector• Easily generalized to arbitrary directed graphs• Will find a local, not necessarily global error minimum

• In practice, often works well (can run multiple times)

Page 46: Lecture 7: Multilayer Perceptrons

• Gradient descent can be very slow if h is to small, and can oscillate widely if h is to large• Often include weight momentum a

• Momentum parameter a is chosen between 0 and 1, 0.9 is a good value

Δwpq (t +1) = −η∂E∂wpq

+α ⋅ Δwpq (t)

Page 47: Lecture 7: Multilayer Perceptrons

• Minimizes error over training examples• Will it generalize well

• Training can take thousands of iterations, it is slow!

• Using network after training is very fast

Page 48: Lecture 7: Multilayer Perceptrons

Convergence of Back-propagation

• Gradient descent to some local minimum• Perhaps not global minimum...• Add momentum• Stochastic gradient descent• Train multiple nets with different initial weights

• Nature of convergence• Initialize weights near zero• Therefore, initial networks near-linear• Increasingly non-linear functions possible as training progresses

Page 49: Lecture 7: Multilayer Perceptrons
Page 50: Lecture 7: Multilayer Perceptrons

Expressive Capabilities of ANNs

• Boolean functions:• Every boolean function can be represented by network with single hidden

layer• but might require exponential (in number of inputs) hidden units

• Continuous functions:• Every bounded continuous function can be approximated with arbitrarily

small error, by network with one hidden layer [Cybenko 1989; Hornik et al. 1989]• See: https://en.wikipedia.org/wiki/Universal_approximation_theorem

• Any function can be approximated to arbitrary accuracy by a network with two hidden layers [Cybenko 1988].

Page 51: Lecture 7: Multilayer Perceptrons
Page 52: Lecture 7: Multilayer Perceptrons

Early-Stopping Rule

Page 53: Lecture 7: Multilayer Perceptrons
Page 54: Lecture 7: Multilayer Perceptrons
Page 55: Lecture 7: Multilayer Perceptrons

Cross Validation to determine Parameters

Page 56: Lecture 7: Multilayer Perceptrons

Example(different notation)

• Consider a network with M layers m=1,2,..,M• Vm

i from the output of the ith unit of the mth layer• V0

i is a synonym for xi of the ith input• Subscript m layers m’s layers, not patterns• Wm

ij mean connection from Vjm-1 to Vi

m

Page 57: Lecture 7: Multilayer Perceptrons

Δw jk =η δ jd

d=1

m

∑ ⋅ xkd

ΔWij =η δid

d=1

m

∑ V jd

• We have same form with a different definition of d• d is the pattern identificator

Page 58: Lecture 7: Multilayer Perceptrons

• By the equation

• allows us to determine for a given hidden unit Vj in terms of the d‘s of the unit oi

• The coefficient are usual forward, but the errors d are propagated backward

• back-propagation

δ jd = f ' (net j

d ) Wiji=1

2

∑ δid

Page 59: Lecture 7: Multilayer Perceptrons

Stochastic Back-Propagation Algorithm (mostly used)

1. Initialize the weights to small random values2. Choose a pattern xd

k and apply is to the input layer V0k= xd

k for all k3. Propagate the signal through the network

4. Compute the deltas for the output layer

5. Compute the deltas for the preceding layer for m=M,M-1,..2

6. Update all connections

7. Goto 2 and repeat for the next pattern

Vim = f (neti

m ) = f ( wijm

j∑ V j

m−1)

δiM = f ' (neti

M )(tid −Vi

M )

δim−1 = f ' (neti

m−1) w jim

j∑ δ j

m

Δwijm =ηδi

mV jm−1

wijnew = wij

old + Δwij

Page 60: Lecture 7: Multilayer Perceptrons

Example

w1={w11=0.1,w12=0.1,w13=0.1,w14=0.1,w15=0.1}

w2={w21=0.1,w22=0.1,w23=0.1,w24=0.1,w25=0.1}

w3={w31=0.1,w32=0.1,w33=0.1,w34=0.1,w35=0.1}

W1={W11=0.1,W12=0.1,W13=0.1}

W2={W21=0.1,W22=0.1,W23=0.1}

X1={1,1,0,0,0}; t1={1,0}

X2={0,0,0,1,1}; t1={0,1}

f (x) =σ (x) =1

1+ e(−x )

f ' (x) =σ ' (x) =σ(x) ⋅ (1−σ(x))

Page 61: Lecture 7: Multilayer Perceptrons

net11 = w1k

k=1

5

∑ xk1 V1

1 = f (net11) =

11+ e−net1

1

net21 = w2k

k=1

5

∑ xk1 V2

1 = f (net11) =

11+ e−net2

1

net31 = w3k

k=1

5

∑ xk1 V3

1 = f (net31 ) =

11+ e−net3

1

net11=1*0.1+1*0.1+0*0.1+0*0.1+0*0.1

V11=f(net1

1 )=1/(1+exp(-0.2))=0.54983

V12=f(net1

2 )=1/(1+exp(-0.2))=0.54983

V13=f(net1

3 )=1/(1+exp(-0.2))=0.54983

Page 62: Lecture 7: Multilayer Perceptrons

net11 = W1 j

j=1

3

∑ V j1 o1

1 = f (net11) =

11+ e−net1

1

net21 = W2 j

j=1

3

∑ V j1 o2

1 = f (net21 ) =

11+ e−net2

1

net11=0.54983*0.1+ 0.54983*0.1+ 0.54983*0.1= 0.16495

o11= f(net11)=1/(1+exp(- 0.16495))= 0.54114

net12=0.54983*0.1+ 0.54983*0.1+ 0.54983*0.1= 0.16495

o12= f(net11)=1/(1+exp(- 0.16495))= 0.54114

Page 63: Lecture 7: Multilayer Perceptrons

For hidden-to-output

• We will use stochastic gradient descent with h=1

ΔWij =η (tid − oi

d ) f 'd=1

m

∑ (netid ) ⋅V j

d

ΔWij = (ti − oi) f' (neti)V j

f ' (x) =σ ' (x) =σ(x) ⋅ (1−σ(x))

ΔWij = (ti − oi)σ(neti)(1−σ(neti))V j

δi = (ti − oi)σ(neti)(1−σ(neti))ΔWij = δiV j

Page 64: Lecture 7: Multilayer Perceptrons

• d1=(1- 0.54114)*(1/(1+exp(- 0.16495)))*(1-(1/(1+exp(- 0.16495))))= 0.11394

• d2=(0- 0.54114)*(1/(1+exp(- 0.16495)))*(1-(1/(1+exp(- 0.16495))))= -0.13437

δ1 = (t1 − o1)σ(net1)(1−σ(net1))ΔW1 j = δ1V j

δ2 = (t2 − o2)σ(net2)(1−σ(net2))ΔW2 j = δ2V j

Page 65: Lecture 7: Multilayer Perceptrons

Input-to hidden connection

Δw jk = δii=1

2

∑ ⋅Wij f' (net j ) ⋅ xk

Δw jk = δii=1

2

∑ ⋅Wijσ(net j )(1−σ(net j )) ⋅ xk

δ j =σ (net j )(1−σ (net j )) Wiji=1

2

∑ δi

Δw jk = δ j ⋅ xk

Page 66: Lecture 7: Multilayer Perceptrons

d1= 1/(1+exp(- 0.2))*(1- 1/(1+exp(- 0.2)))*(0.1* 0.11394+0.1*( -0.13437))

d1= -5.0568e-04

d2= -5.0568e-04

d3= -5.0568e-04

δ1 =σ (net1)(1−σ (net1)) Wi1i=1

2

∑ δi

δ2 =σ (net2)(1−σ (net2)) Wi2i=1

2

∑ δi

δ3 =σ (net3)(1−σ (net3)) Wi3i=1

2

∑ δi

Page 67: Lecture 7: Multilayer Perceptrons

First Adaptation for x1(one epoch, adaptation over all training patterns, in our case x1 x2)

d1= -5.0568e-04 d1= 0.11394

d2= -5.0568e-04 d2= -0.13437

d3= -5.0568e-04

x1 =1 v1 =0.54983

x2 =1 v2 =0.54983

x3 =0 v3=0.54983

x4 =0x5 =0

ΔWij = δiV j

Δw jk = δ j ⋅ xk

Page 68: Lecture 7: Multilayer Perceptrons
Page 69: Lecture 7: Multilayer Perceptrons

• The input pattern is represented by the five-dimensional vector x• nonlinear hidden units compute the output V1, V2, V3

• Two output units compute the output o1 and o2. • The units V1, V2, V3 are referred to as hidden units because we cannot see

their outputs and cannot directly perform error correction

Page 70: Lecture 7: Multilayer Perceptrons
Page 71: Lecture 7: Multilayer Perceptrons
Page 72: Lecture 7: Multilayer Perceptrons
Page 73: Lecture 7: Multilayer Perceptrons
Page 74: Lecture 7: Multilayer Perceptrons
Page 75: Lecture 7: Multilayer Perceptrons
Page 76: Lecture 7: Multilayer Perceptrons
Page 77: Lecture 7: Multilayer Perceptrons
Page 78: Lecture 7: Multilayer Perceptrons
Page 79: Lecture 7: Multilayer Perceptrons
Page 80: Lecture 7: Multilayer Perceptrons

Literature

• Simon O. Haykin, Neural Networks and Learning Machine, (3rd Edition), Pearson 2008

• Chapter 4

• Christopher M. Bishop, Pattern Recognition and Machine Learning (Information Science and Statistics), Springer 2006

• Chapter 5

Page 81: Lecture 7: Multilayer Perceptrons

Literature (Additional)

• Introduction To The Theory Of Neural Computation (Santa Fe Institute Series Book 1), John A. Hertz, Anders S. Krogh, Richard G. Palmer, Addison-Wesley Pub. Co, Redwood City, CA; 1 edition (January 1, 1991)

• Chapter 6

Page 82: Lecture 7: Multilayer Perceptrons

Literature

• Machine Learning - A Journey to Deep Learning, A. Wichert, Luis Sa-Couto, World Scientific, 2021

• Chapter 6