Learning strategies for neuronal nets - the ... · High values reduce the probabilty of running...

56
ME II Kap 8b H Burkhardt Institut für Informatik Universität Freiburg 1 Learning strategies for neuronal nets - the backpropagation algorithm In contrast to the NNs with thresholds we handled until now NNs are the NNs with non-linear activation functions f(x). The most common function is the sigmoid function ψ(x): 1 () () 1 ax fx x e ψ = = + -6 -4 -2 0 2 4 6 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 a It approxmiates with increasing a the step function σ(x).

Transcript of Learning strategies for neuronal nets - the ... · High values reduce the probabilty of running...

Page 1: Learning strategies for neuronal nets - the ... · High values reduce the probabilty of running into a local minimum, risking the method to diverge and not finding the minimum (or

ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 1

Learning strategies for neuronal nets -the backpropagation algorithm

In contrast to the NNs with thresholds we handled until now NNs are the NNs with non-linear activation functions f(x). The most common function is thesigmoid function ψ(x):

1( ) ( )1 axf x x

eψ −= =

+

-6 -4 -2 0 2 4 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

a

It approxmiates with increasing a thestep function σ(x).

Page 2: Learning strategies for neuronal nets - the ... · High values reduce the probabilty of running into a local minimum, risking the method to diverge and not finding the minimum (or

ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 2

The function ψ(x) is closely related to the tanh function. For a=1 applies:

2

2tanh( ) 1 2 (2 ) 11 xx x

eψ−= − = −

+If the two parameter a and b are added, a more general form results:

( )

1( ) 11g a x bx

eψ − −= −

+

2

1 1 1( 1) (1 ) ( )(1 ( ))(1 ) 1 1

xx x x

d e x xdx e e eψ ψ ψ−

− − −

−= − = − = −

+ + +

with a controlling the sheerness and with b controlling the position on the x-axis.

The non-linearity of the activation function is the core reason for the existence of the multi-layer perceptron; without this property the multi-layer perceptronwould coincide with a trivial linear one-layered network.

The differentiability of f(x) allows us to apply neccessary constraints (∂(·)/∂wi,j) for optimization of the weight coefficients wi,j.

The first derivate of the sigmoid function can be ascribed to itself:

Page 3: Learning strategies for neuronal nets - the ... · High values reduce the probabilty of running into a local minimum, risking the method to diverge and not finding the minimum (or

ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 3

Figure of the extended weight matrix

[ ]

1,0 1 1,1 1,2 1,

2,0 2 2,1 2,2 1,

3,0 3 3,1 3,2 1,

,0 ,1 ,2 ,

,

N

N

N

M M M M M N

w b w w ww b w w ww b w w w

w b w w w

=⎡ ⎤⎢ ⎥=⎢ ⎥⎢ ⎥′ == =⎢ ⎥⎢ ⎥⎢ ⎥=⎣ ⎦

W b W

………

Page 4: Learning strategies for neuronal nets - the ... · High values reduce the probabilty of running into a local minimum, risking the method to diverge and not finding the minimum (or

ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 4

The neural net with H layersThe multi-layer NN with H layers has a peculiar weight matrix W´i in every

layer, but identical sigmoid-functions:

{ } { }22ˆmin mini i

HJ E E′ ′

= − = −W W

y y y y

W´1 W´2y1W´HyH-1y2 yHx´

2 1( ( ( ) ))H H′ ′ ′ ′=y ψ W ψ W ψ W x… …The learning is based on adjustment of the weight matrices with minimization of a square error criterion between target value y and approximation ŷ by the net(supervised learning) in mind. The expected value has to be formed by theavailable training ensemble of {ŷj,xj}:

Note: The one-layer nn and the linear polynom classifier are identical, if theactivation function is set to ψ≡1 .

Page 5: Learning strategies for neuronal nets - the ... · High values reduce the probabilty of running into a local minimum, risking the method to diverge and not finding the minimum (or

ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 5

Relations to the concept of function approximation by a linear combination of base functions

The first layer creates the base functions in form of a hidden vector y1and the second layer forms a linear combination of these basefunctions.

Therefore the coefficient matrix W´1 of the first layer controls theappearance of the base functions, while the weight matrix W´2 of thesecond layer contains the coefficients of the linear combination.Additionally the matrix is weighted through the activation function.

Page 6: Learning strategies for neuronal nets - the ... · High values reduce the probabilty of running into a local minimum, risking the method to diverge and not finding the minimum (or

ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 6

Reaction of a neuron with two inputs and a threshold function σ

Page 7: Learning strategies for neuronal nets - the ... · High values reduce the probabilty of running into a local minimum, risking the method to diverge and not finding the minimum (or

ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 7

Reaction of a neuron with two inputs and a sigmoid function ψ

Page 8: Learning strategies for neuronal nets - the ... · High values reduce the probabilty of running into a local minimum, risking the method to diverge and not finding the minimum (or

ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 8

Overlapping two neurons with two inputseach and a sigmoid function

Page 9: Learning strategies for neuronal nets - the ... · High values reduce the probabilty of running into a local minimum, risking the method to diverge and not finding the minimum (or

ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 9

Reaction of a neuron in the second layer to two neuronsof first layer after valuation by sigmoid function

Page 10: Learning strategies for neuronal nets - the ... · High values reduce the probabilty of running into a local minimum, risking the method to diverge and not finding the minimum (or

ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 10

The backpropagation learning algorithm• The class map is done by a multi-layer perceptron, which produces a 1 on the

output in the regions of x, that are populated by the samples of the accordingclass, and produces a 0 in the areas allocated by other classes. In the areas in-between and outside an interpolation resp. extrapolation is done.

{ }2ˆ ( , )iJ E ′ ′= −y W x y

• This term is non-linear both in the elements of the input-vector x´, and in the weight coefficients {w´ij

h}.

• The network ist trained based on the optimization of the squared quality factor:

Page 11: Learning strategies for neuronal nets - the ... · High values reduce the probabilty of running into a local minimum, risking the method to diverge and not finding the minimum (or

ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 11

• From insertion of the function map of the multi-layer perceptron results:

2 1

ˆ

2

( ( ( ) ))HJ E⎧ ⎫⎪ ⎪= ⎨ ⎬⎪

′ ′ −

′⎪y

ψ W ψ W ψ W x y… …

• The iteration terminates, when the gradient disappears at a (local or evenglobal) minimum.

• A proper choice of α is difficult. Small values increase the number of requirediterations. High values reduce the probabilty of running into a local minimum, risking the method to diverge and not finding the minimum (or giving rise to oscillations).

• Sought is the global minimum. It is not clear whether such an element exists orhow many local minima exist.The backpropagation algorithm solves the problem iteratively with a gradientalgorithm. One iterates according to:

{ }1 2 with: , , ,

and the gradients:

H

hnm

J Jw

α′ ′ ′ ′ ′ ′← − ∇ =

⎧ ⎫∂ ∂∇ = = ⎨ ⎬′ ′∂ ∂⎩ ⎭

W W J W W W W

JW

Page 12: Learning strategies for neuronal nets - the ... · High values reduce the probabilty of running into a local minimum, risking the method to diverge and not finding the minimum (or

ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 12

• The iteration has only linear convergence (gradient algorithm).

• The error caused by a sample results in:

2 1( ( ( ) ))ˆ ( ) H′ ′′ ′= ′ψ W ψ Wy ψ W xx … …

2 21 12 2

1

ˆ ˆ( ) ( ( ) ( ))HN

j j j k kk

J y j y j=

= − = −∑y x y

• The expected value E{...} of the gradient has to be approximated by a meanvalue of all sample gradients:

1

1

n

jnj=

∇ = ∇∑J J

• Calculating the gradient:For determination of the composed function

• the chain rule is needed.

Page 13: Learning strategies for neuronal nets - the ... · High values reduce the probabilty of running into a local minimum, risking the method to diverge and not finding the minimum (or

ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 13

– and cumulative learning (batch learning), based on

• Partial derivations for one layer:For the r-th layer applies:

1

1

n

jnj=

∇ = ∇∑J J

• We have to distinguish between– individual learning based on the last sample

,j j j⎡ ⎤ ⇒∇⎣ ⎦x y J

0

( )

n-th input of layer r

m-th output of layer r

rNr r r rm m nm n

n

rn

rm

y s w x

x

y

ψ ψ=

⎛ ⎞′ ′= = ⎜ ⎟

⎝ ⎠′

Page 14: Learning strategies for neuronal nets - the ... · High values reduce the probabilty of running into a local minimum, risking the method to diverge and not finding the minimum (or

ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 14

Definition of the variables of two layers forthe backpropagation algorithm

+

+

+

f

f

f +

+

+

f

f

f

1rky −

layer r-1 layer r

rjkw′

rjs r

jy

1rks −

rjδ

1rkδ−

Page 15: Learning strategies for neuronal nets - the ... · High values reduce the probabilty of running into a local minimum, risking the method to diverge and not finding the minimum (or

ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 15

1Hm

H HHn nnH H H H

nm n nm nm

y

J J s sw s w w

δ

−=

∂ ∂ ∂ ∂= ⋅ = −

′ ′ ′∂ ∂ ∂ ∂

• First of all we calculate the effect of the last hidden layer on the starting layerr=H. Using partial derivations of the quality function J (actually Jj, but j is omittedfor simplification reasons) and applying the chain rule results:

• and thus for update of weights:

1 1

with

ˆ( ) ( )

neu alt

H H H H Hnm n m n n n m

Jw w w ww

w y y y f s y

α

αδ α− −

∂= + ∆ ∆ = −

∂′ ′∆ = − = − −

• introducing the sensibility of cell n

nn

Js

δ ∂= −

• results:

212

1

ˆ( ))

ˆ

ˆ ˆ( ) ( )ˆ

HN

k kk

n

HH Hnn n n nH H H

n n n

y y

y

yJ J y y f ss y s

δ

=

⎛ ⎞⎜ ⎟∂ −⎜ ⎟⎜ ⎟⎝ ⎠

∂∂ ∂ ′= − = − ⋅ = −∂ ∂ ∂

Page 16: Learning strategies for neuronal nets - the ... · High values reduce the probabilty of running into a local minimum, risking the method to diverge and not finding the minimum (or

ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 16

• With

1

1 1

rrkj

rk

r r rkj k j

sJ Js s s

δδ −

− −

∂∂ ∂=

∂ ∂ ∂∑

• For all other hidden layers r <H the thoughts behind are a litte more complex. Due to dependencies amongst each other, the values of sj

r-1 have an effect on all elements sk

r of the following layer. Using the chain rule results in:

• results: 1 1( )r r r r

j j kj kk

f s wδ δ− − ⎡ ⎤′ ′= ⎢ ⎥⎣ ⎦∑

1( )

1

11 1 ( )

rmf s

r rkm m

mrr rk

kj jr rj j

w ys w f s

s s

−− −

⎛ ⎞⎜ ⎟′∂ ⎜ ⎟⎜ ⎟∂ ⎝ ⎠ ′ ′= =

∂ ∂

• i.e. the errors “backpropagate” from the starting layer to lower layers!

Page 17: Learning strategies for neuronal nets - the ... · High values reduce the probabilty of running into a local minimum, risking the method to diverge and not finding the minimum (or

ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 17

• All learning samples are crossed without any changes of the weightcoefficients, and the gradient ∇J results by forming the mean value of the ∇Jj. Both values can be used for update of the parameter matrix W´ of theperceptron:

1

1

individual learning

batch learningk k j

k k

α

α+

+

′ ′= − ∇

′ ′= − ∇

W W J

W W J

• The gradients ∇J and ∇Jj differ. The latter is the mean value of the first (∇J = E{∇ Jj}), or: the first value is random value of the latter.

Page 18: Learning strategies for neuronal nets - the ... · High values reduce the probabilty of running into a local minimum, risking the method to diverge and not finding the minimum (or

ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 18

The whole individual learning procedure consists of these steps:• Choose interim values for the coefficient matrices W1,W2, ... ,WH of all layers• Given a new observation [x´j, ŷj] • Calculate the discrimination function ŷj from the given observation x´j and the

current weight matrices (forward calculation)• Calculate the error between estimation ŷ and target vector y:

∆y= ŷ-y• Calculate the gradient ∇J regarding to all perceptron weights (error

backpropagation). To do so calculate first δjH of the starting layer according to:

1 1( ) für , 1, , 2r r r rj j kj k

kf s w r H Hδ δ− − ⎡ ⎤′ ′= = −⎢ ⎥⎣ ⎦

∑ …

• and calculate from this backwards all values of the lower layers according to:

ˆ ˆ( ) ( )ˆ

HH Hnj j j jH H H

n n n

J J y y y f ss y s

δ ∂ ∂ ∂ ′= − = − ⋅ = −∂ ∂ ∂

Page 19: Learning strategies for neuronal nets - the ... · High values reduce the probabilty of running into a local minimum, risking the method to diverge and not finding the minimum (or

ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 19

• (Parallely) correct all perceptron weights according to:

1

1,2,= für 1, 2,

1, 2,

r r r r r Hnm nm nm n mr

nm H

r HJw w w y n N

wm M

α αδ −

= …∂′ ′ ′← − − = …′∂

= …

• For individual learning the weights {w´nmh} are corrected with ∇J for every

sample, while for batch learning all learning samples have to be considered, in order to determine the averaged gradient ∇J from the sequence {∇J}, before theweights can be corrected according to:

1

1,2,( ) ( ) für 1, 2,

1, 2,

r r r r Hnm nm n m

i H

r Hw w i y i n N

m Mαδ −

= …′ ′← − = …

= …∑

• considering the first derivate of the sigmoid function f(s)=ψ(s):

( )( ) ( )(1 ( )) (1 )d sf s s s y ydsψ ψ ψ′ = = − = −

Page 20: Learning strategies for neuronal nets - the ... · High values reduce the probabilty of running into a local minimum, risking the method to diverge and not finding the minimum (or

ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 20

Backpropagation algorithm for matrices• Choose interim values for the coefficient matrices W1,W2, ... ,WH of all layers• Given a new observation [x´j, ŷj] • Calculate the discrimination function ŷj from the given observation x´j and the

current weight matrices (forward calculation). Store all values of yr and sr in all layers inbetween r = 1,2,…,H.

• Calculate the error between estimation ŷ and target vector y on the output:∆y= ŷ-y

• Calculate the gradient ∇J regarding all perceptron weights (errorbackpropagation). To do so first calculate δH of the starting layer according to:

ˆ ˆ( ) ( )ˆ

HH H

H H H

J J f∂ ∂ ∂ ′= − = − ⋅ = −∂ ∂ ∂

yδ y y ss y s

• and calculate from this backwards all values of the lower layers according to:

1 1 1( ) ( ) für , 1, , 2Tr r rf r H H− −′= =′ −δ s δW …

Page 21: Learning strategies for neuronal nets - the ... · High values reduce the probabilty of running into a local minimum, risking the method to diverge and not finding the minimum (or

ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 21

• Individual leraning: (parallely) correct all perceptron weight matrices with ∇Jfor every sample according to:

( 1)= für 1, 2,Tr r r r r

r

J r Hα α −∂′ ′ ′← − − = …′∂

W W W δ yW

• Cumulative learning: all samples have to be considered, in order to determinethe averaged value ∇J from the sequence of the {∇J}, before the weights can becorrected according to:

( 1) für 1,2, Tr r r r

j jj

r Hα −′ ′← − = …∑W W δ y

• considering the first derivate of the sigmoid function f(s)=ψ(s):

( )( ) ( )(1 ( )) (1 )d sf s s s y ydsψ ψ ψ′ = = − = −

Page 22: Learning strategies for neuronal nets - the ... · High values reduce the probabilty of running into a local minimum, risking the method to diverge and not finding the minimum (or

ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 22

Properties:• The backpropagation algorithm is simple to implement, but very complex

especially for large coefficient matrices, which causes the number of samples to be large as well. The dependency of the method on the starting values of theweights, the correction factor α, and the order of processing the samples havealso an adverse effect.

• Linear dependencies remain unconsidered, just like for recursive training of thepolynom classifier.

• The gradient algorithm has the advantage that it can be used for very large problems.

• The dimension of the coefficient matrix W results from the dimensions of theinput vector, the masked layers, and the output vector (N,N 1,..., NH-1,K) as:

1 0

1

dim( ) ( 1) with and

dim( ) feature spaceˆdim( ) number of classes

Hh h H

h

T N N N N N K

NK

=

= = + = =

==

∑W

xy

Page 23: Learning strategies for neuronal nets - the ... · High values reduce the probabilty of running into a local minimum, risking the method to diverge and not finding the minimum (or

ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 23

Dimensioning the net• The considerations for designing the multi-layer perceptron with threshold

function (σ resp. sign) gives quite a good idea how many hidden-layers and how many neurons should be used for a MLPC for backpropagationlearning (assumed one has a certain idea of distribution of clusters).

• The net is to be chosen as simple as possible. The higher the dimension thehigher the risk of overfitting, accompanied by a loss of ability to generalize. Also higher dimension allows a lot of local maximas, which can cause thealgorithm to get stuck!

epochs

training

testinger

ror• For a given number of samples, the

learning phase should be stopped at a certain point. If not stopped thenet is overfitted to the existing dataand looses its ablility to generalize.

Page 24: Learning strategies for neuronal nets - the ... · High values reduce the probabilty of running into a local minimum, risking the method to diverge and not finding the minimum (or

ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 24

Complexity of backpropagation algorithm

• For T=dim(W) number of weights and bias terms, linear complexity in thenumber of weights can be easily understood, since O(T) computation steps areneeded for forward simulation, O(T) steps are needed for backpropagation of the error and also O(T) operations for correction of weights, altogether: O(T).

• If gradients would be determined experimentally by finite differences (on order to do so, one would have to increment each weight and determine the effect on the quality criterion by forward calculation) by calculating a differential quotient according to (i.e. no analytical evaluation):

• T forward calculations of complexity O(T) would result and therefore a overall(?) complexity of O(T2)

( ) ( )( )

r rji ji

rji

J w J wJ Ow

εε

ε+ −∂

= +∂

Page 25: Learning strategies for neuronal nets - the ... · High values reduce the probabilty of running into a local minimum, risking the method to diverge and not finding the minimum (or

ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 25

Backpropagation for training a two-layer network forfunction approximation

(Matlab demo: „Backpropagation Calculation)

1 1 1 1 1 1

1 11,11 1 11 12,1 2

( ) ( ) ( )

mit: und

x

w bw b

′ ′= + = =

⎡ ⎤ ⎡ ⎤= =⎢ ⎥ ⎢ ⎥

⎣ ⎦⎣ ⎦

y ψ W b ψ W x ψ s

W b

2 2 1 2

2 2 21,1 1,2

ˆ ( )

mit:

y y b

w w

= = +

⎡ ⎤= ⎣ ⎦

W y

W

4( ) 1 sin( ) for 2 2y x x xπ= + − ≤ ≤

• Sought is an approximation of the function

+

+

1x +2 ˆy y=

1

11,1w

12,1w

12s

11s

11y

12y

11b

12b

2b

21,1w

21,2w

2s

1

2δ11δ

12δ

Page 26: Learning strategies for neuronal nets - the ... · High values reduce the probabilty of running into a local minimum, risking the method to diverge and not finding the minimum (or

ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 26

Backpropagation for training a two-layer network forfunction approximation

2 2 1 2 21, and for 1,2j jw y b jαδ αδ∆ = ∆ = =

• Backpropagation: For output layer or second layer with linear activationfunction results due to f ´=1:

2 ˆ( )y yδ = −

• Correction of weights in the output layer:

• Backpropagation: For first layer with sigmoid function results:

1 1 2 2 1 1 2 21, 1,( ) (1 ) for 1,2j j j j j jf s w y y w jδ δ δ′ ′⎡ ⎤ ⎡ ⎤= = − =⎣ ⎦ ⎣ ⎦

• Correction of weights in the input layer:

1 1 1 1,1 and for 1, 2j j j jw x b jαδ αδ∆ = ∆ = =

Page 27: Learning strategies for neuronal nets - the ... · High values reduce the probabilty of running into a local minimum, risking the method to diverge and not finding the minimum (or

ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 27

Starting Matlab demomatlab-BPC.bat

Page 28: Learning strategies for neuronal nets - the ... · High values reduce the probabilty of running into a local minimum, risking the method to diverge and not finding the minimum (or

ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 28

Backpropagation for training a two-layer network forfunction approximation(Matlab demo: „Function Approximation)

4( ) 1 sin( ) for 2 2y x i x xπ= + ⋅ − ≤ ≤

• Sought is an approximation of the function

• With increasing value of i (difficulty index) the MLPC networkrequirements increase. The approximation based on the sigmoid functionneeds more and more layers in order to depict several periods of the sinusfunction.

• Problem: Convergence to local minima, even when the network is bigenough to represent the function– The network is too small, so that approximation is poor, i.e. the global minimum can

be reached, but approximation quality is poor (i=8 and network 1-3-1)– The network is big enough, so that approximation is good, but it converges towards a

local minimum (i=4 and network 1-3-1)

Page 29: Learning strategies for neuronal nets - the ... · High values reduce the probabilty of running into a local minimum, risking the method to diverge and not finding the minimum (or

ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 29

Page 30: Learning strategies for neuronal nets - the ... · High values reduce the probabilty of running into a local minimum, risking the method to diverge and not finding the minimum (or

ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 30

Starting Matlab demomatlab-FA.bat

Page 31: Learning strategies for neuronal nets - the ... · High values reduce the probabilty of running into a local minimum, risking the method to diverge and not finding the minimum (or

ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 31

Model big enough but only local minimum

Page 32: Learning strategies for neuronal nets - the ... · High values reduce the probabilty of running into a local minimum, risking the method to diverge and not finding the minimum (or

ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 32

Generalization ability of the network

• Assuming that the network is trained for 11 samples

{ } { } { }1 1 2 2 11 11, , , , ,y x y x y x…

• How well does the network approximate the function for samplesnot learned depending on the complexity of the network?

• Rule: The network should have fewer parameters than theavailable input/output pairs

Starting Matlab demo (Hagan 11-21)matlab-GENERAL.bat

Page 33: Learning strategies for neuronal nets - the ... · High values reduce the probabilty of running into a local minimum, risking the method to diverge and not finding the minimum (or

ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 33

Improving the ability to generalize of a netby adding noise

• If only few samples are available, the generalization ability of a nn can be improved by extending the number of samples e.g. bynormally distributed noise. i.e. adding further samples, which canbe found in the nearest neighbourhood with high probablity.

• In doing so the intra class areas are broadened and the bordersbetween classes are sharpened.

class 1

class 2

Page 34: Learning strategies for neuronal nets - the ... · High values reduce the probabilty of running into a local minimum, risking the method to diverge and not finding the minimum (or

ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 34

Interpolation by overlapping of normal distributions

( ) ( )i i iy x f x tα= −∑

-3 -2 -1 0 1 2 30

0.2

0.4

0.6

0.8

1

1.2

1.4W eighted Sum of Radial Basis Transfer Functions

I t

Page 35: Learning strategies for neuronal nets - the ... · High values reduce the probabilty of running into a local minimum, risking the method to diverge and not finding the minimum (or

ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 35

Interpolation by overlapping of normal distributions

Page 36: Learning strategies for neuronal nets - the ... · High values reduce the probabilty of running into a local minimum, risking the method to diverge and not finding the minimum (or

ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 36

Character recognition task for26 letters of size 7×5

• with gradually increasing additive noise:

Page 37: Learning strategies for neuronal nets - the ... · High values reduce the probabilty of running into a local minimum, risking the method to diverge and not finding the minimum (or

ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 37

y2= ψ(W2 ψ(W1x+b1)+b2)

Two-layer neural net for symbol recognition with 35 inputvalues (pixel), 10 neurons in the hidden layer and 26 neurons

in the output layer

ψ+

W1

b11

xN×1

S1×N=10 ×35

S1×1=10×1

S1×1=10 × 1

s1

y1=f1(W1x+b1)

first layer

N=35

ψ+

W2

b21

y1

S1×1=10 × 1 S2×S1=

26 × 10

S2×1=26 × 1

s2

y2=f2(W2y1+b2)

second layer

y2

S2×1=26 × 1

S2×1=26 × 1

S2×1=26 × 1

S1×1=10×1

Page 38: Learning strategies for neuronal nets - the ... · High values reduce the probabilty of running into a local minimum, risking the method to diverge and not finding the minimum (or

ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 38

Demo with MATLAB

• Opening Matlab– Demo in Toolbox Neural Networks– „Character recognition“ (command line)– Two networks are trained

• network 1 without noise• network 2 with noise

– Network 2 has better results than network 1 considering an independent set of test data

Starting Matlab demomatlab-CR.bat

Page 39: Learning strategies for neuronal nets - the ... · High values reduce the probabilty of running into a local minimum, risking the method to diverge and not finding the minimum (or

ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 39

Possibilites to accelerate the adaption

Adaption of NN is simply a general parameter optimizationproblem. Accordingly a variety of other optimization methodsfrom numerical mathematics could be applied in principle. Thiscan lead to significant improvements (convergence speed, convergence area, stability, complexity).

In general conclusions cannot be made easily, since the events candiffer a lot and compromises have to be made!

• Heuristic improvements of the gradient algorithm (incrementcontrol)

– gradient algorithm (steepest descent) with momentum term (update of weights depends not only on gradient but also on former update)

– using an adaptive learning factor• Conjugate gradient algorithm

Page 40: Learning strategies for neuronal nets - the ... · High values reduce the probabilty of running into a local minimum, risking the method to diverge and not finding the minimum (or

ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 40

• Newton and Semi-Newton-algorithms (using the second derivatives of theerror function, e.g. in form of Hesse mtrix or its estimation)

– Quickprop– Levenberg-Marquardt-algorithm

• Pruning techniques. Starting with a sufficiently big network, neurons, thathave no or little effect on the quality function, are taken off the net, whichcan reduce overfitting of data.

12

2

Taylor expansion of quality function in :( ) ( ) terms of higher order

with: gradient vector

and: Hesse matrix

T T

i j

J JJ

Jw w

+ ∆ = +∇ ∆ + ∆ ∆ +

∂∇ =

∂⎧ ⎫∂⎪ ⎪= ⎨ ⎬∂ ∂⎪ ⎪⎩ ⎭

ww w w J w w H w

Jw

H

Page 41: Learning strategies for neuronal nets - the ... · High values reduce the probabilty of running into a local minimum, risking the method to diverge and not finding the minimum (or

ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 41

Demo with MATLAB(Demo of Theodoridis)

• C:\...\matlab\PR\startdemo• Example 2 (XOR) with (2-2-1 and 2-5-5-1)• Example 3 with three-layer network [5,5]

(3000 epochs, learning rate 0,7, momentum 0,5)Start der Matlab-Demomatlab-theodoridis.bat

Two solutions of the XOR-problem:

Convex area implemented with two-layernet (2-2-1). Possible mal-classificationfor variance:

Union of 2 convex areas implemented with a three-layer net (2-5-5-1). Possible mal-classification for variance:

14 2 0,35> ≈n 0,5>n

2-2-2-1 does notconverge!

Page 42: Learning strategies for neuronal nets - the ... · High values reduce the probabilty of running into a local minimum, risking the method to diverge and not finding the minimum (or

ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 42

XOR with high noise in order to activate second solution! With two lines a error-free (exact) combination cannot be reached and the gradient is still different from 0!

Page 43: Learning strategies for neuronal nets - the ... · High values reduce the probabilty of running into a local minimum, risking the method to diverge and not finding the minimum (or

ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 43

Page 44: Learning strategies for neuronal nets - the ... · High values reduce the probabilty of running into a local minimum, risking the method to diverge and not finding the minimum (or

ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 44

Page 45: Learning strategies for neuronal nets - the ... · High values reduce the probabilty of running into a local minimum, risking the method to diverge and not finding the minimum (or

ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 45

Demo with MATLAB (Klassifikationgui.m)

• Opening Matlab– first invoke setmypath.m, then– C:\Home\ppt\Lehre\ME_2002\matlab\KlassifikatorEntwurf-

WinXX\Klassifikationgui– set of data: load Samples/xor2.mat

Setting: 500 epochs, learning rate 0.7– load weight matrices: models/xor-trivial.mat– load weight matrices : models/xor-3.mat

• Two-class problem with banana shape– load set of data Samples/banana_test_samples.mat– load presetting from: models/MLP_7_3_2_banana.mat

Starting Matlab-Demomatlab-Klassifikation_gui.bat

Page 46: Learning strategies for neuronal nets - the ... · High values reduce the probabilty of running into a local minimum, risking the method to diverge and not finding the minimum (or

ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 46

Solution of the XOR-problem with two-layer nn

Page 47: Learning strategies for neuronal nets - the ... · High values reduce the probabilty of running into a local minimum, risking the method to diverge and not finding the minimum (or

ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 47

Solution of the XOR-problem with two-layer nn

Page 48: Learning strategies for neuronal nets - the ... · High values reduce the probabilty of running into a local minimum, risking the method to diverge and not finding the minimum (or

ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 48

Solution of the XOR-problem with three-layer nn

Page 49: Learning strategies for neuronal nets - the ... · High values reduce the probabilty of running into a local minimum, risking the method to diverge and not finding the minimum (or

ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 49

Solution of the XOR-problem with three-layer nn

Page 50: Learning strategies for neuronal nets - the ... · High values reduce the probabilty of running into a local minimum, risking the method to diverge and not finding the minimum (or

ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 50

Page 51: Learning strategies for neuronal nets - the ... · High values reduce the probabilty of running into a local minimum, risking the method to diverge and not finding the minimum (or

ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 51

Page 52: Learning strategies for neuronal nets - the ... · High values reduce the probabilty of running into a local minimum, risking the method to diverge and not finding the minimum (or

ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 52

Convergence behaviour of backpropagation algorithm

• Backpropagation with common gradient descent(steepest descent)

Starting Matlab demomatlab-BP-gradient.bat

• Backpropagation with conjugate gradient descent(conjugate gradient) – significantly better convergence!

Starting Matlab demomatlab-BP-CGgradient.bat

Page 53: Learning strategies for neuronal nets - the ... · High values reduce the probabilty of running into a local minimum, risking the method to diverge and not finding the minimum (or

ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 53

NN and their properties• „Quick and Dirty“. A nn design is simple to implement and results in

solutions are by all means usable (a suboptimal solution is better than just anysolution).

• Particularly large problems can be solved. • All strategies only use “local” optimization functions and generally do not

reach a global optimum, which is a big draw back for using neural nets.

Page 54: Learning strategies for neuronal nets - the ... · High values reduce the probabilty of running into a local minimum, risking the method to diverge and not finding the minimum (or

ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 54

Using a NN for iterative calculation of PrincipalComponent Analysis (KLT)

• Instead of solving the KLT explicitely by finding Eigen values, a nn can beused for iterative calculation.

• A two-layer perceptron is used. The output of the hidden layer w is thefeature vector sought.

1

N

1M

1

N

x̂x

TA

w

A

• The first layer calculates the feature vector as:T=w A x

Page 55: Learning strategies for neuronal nets - the ... · High values reduce the probabilty of running into a local minimum, risking the method to diverge and not finding the minimum (or

ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 55

• Supposing a linear activation function is used. The second layerimplements the reconstruction of x with the transposed weight matrix of thefirst layer

• BM is a N× M-matrix of the M dominant Eigenvectors of E{xxT} and T a orthonormal M× M–matrix, which causes a rotation of the coordinatesystem, within the space that is spanned by the M dominant Eigenvectors of E{xxT}.

M=A B T

ˆ =x Aw• Target of optimization is to minimize:

{ } { } { }22 2ˆ TJ E E E= − = − = −x x Aw x AA x x

• The following learning rule:

( ) Tα α← − − = − ∇A A Aw x w A J

• leads to a coefficient matrix A, which can be written as a product of twomatrices (without proof!):

Page 56: Learning strategies for neuronal nets - the ... · High values reduce the probabilty of running into a local minimum, risking the method to diverge and not finding the minimum (or

ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 56

• The learning strategy does not include translation. i.e. this strategycalculates the Eigenvectors of the moment matrix E{xxT} and not of thecovariance matrix K = E{(x-µx)(x- µx)T}. This can be implemented bysubtracting a recursively estimated expected value µx = E{x} before startingthe recursive estimation of the feature vector. The same effect can beretrieved by using an extended observation vector:

1 instead of

⎡ ⎤= ⎢ ⎥⎣ ⎦

x xx