CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the...
Transcript of CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the...
![Page 1: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/1.jpg)
1
CS407 Neural Computation
Lecture 5: The Multi-Layer Perceptron (MLP)
and Backpropagation
Lecturer: A/Prof. M. Bennamoun
![Page 2: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/2.jpg)
2
What is a perceptron and what is a Multi-Layer Perceptron (MLP)?
![Page 3: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/3.jpg)
3
What is a perceptron?
wk1x1
wk2x2
wkmxm
... ... Σ
Biasbk
ϕ(.)vk
Inputsignal
Synapticweights
Summingjunction
Activationfunction
Outputyk
bxwv kj
m
jkjk += ∑
=1
)(vy kkϕ=
)()( ⋅=⋅ signϕDiscrete Perceptron:
shapeS −=⋅)(ϕContinous Perceptron:
![Page 4: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/4.jpg)
4
Activation Function of a perceptron
vi
+1
-1
Signum Function (sign)
)()( ⋅=⋅ signϕDiscrete Perceptron: shapesv −=)(ϕ
Continous Perceptron:
vi
+1
![Page 5: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/5.jpg)
5
MLP Architecture The Multi-Layer-Perceptron was first introduced by M. Minsky and S. Papertin 1969
Type: Feedforward
Neuron layers: 1 input layer 1 or more hidden layers 1 output layer
Learning Method: Supervised
![Page 6: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/6.jpg)
6
Terminology/ConventionsArrows indicate the direction of data flow.
The first layer, termed input layer, just contains the input vector and does not perform any computations.
The second layer, termed hidden layer, receives input from the input layer and sends its output to the output layer.After applying their activation function, the neurons in
the output layer contain the output vector.
![Page 7: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/7.jpg)
7
Why the MLP?The single-layer perceptron classifiers discussed previously can only deal with linearly separable sets of patterns.
The multilayer networks to be introduced here are the most widespread neural network architecture– Made useful until the 1980s, because of lack
of efficient training algorithms (McClelland and Rumelhart 1986)
– The introduction of the backpropagationtraining algorithm.
![Page 8: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/8.jpg)
8
Different Non-Linearly Separable Problems http://www.zsolutions.com/light.htm
StructureTypes of
Decision RegionsExclusive-OR
ProblemClasses with
Meshed regionsMost General
Region Shapes
Single-Layer
Two-Layer
Three-Layer
Half PlaneBounded ByHyperplane
Convex OpenOr
Closed Regions
Arbitrary(Complexity
Limited by No.of Nodes)
A
AB
B
A
AB
B
A
AB
B
BA
BA
BA
![Page 9: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/9.jpg)
9
What is backpropagation Training
and how does it work?
![Page 10: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/10.jpg)
10
Supervised Error Back-propagation Training– The mechanism of backward error transmission
(delta learning rule) is used to modify the synaptic weights of the internal (hidden) and output layers
• The mapping error can be propagated into hidden layers
– Can implement arbitrary complex/output mappings or decision surfaces to separate pattern classes
• For which, the explicit derivation of mappings and discovery of relationships is almost impossible
– Produce surprising results and generalizations
What is Backpropagation?
![Page 11: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/11.jpg)
11
Architecture: Backpropagation NetworkThe Backpropagation Net was first introduced by G.E. Hinton, E. Rumelhartand R.J. Williams in 1986
Type: Feedforward
Neuron layers: 1 input layer 1 or more hidden layers 1 output layer
Learning Method: Supervised
Reference: Clara Boyd
![Page 12: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/12.jpg)
12
Backpropagation PreparationTraining SetA collection of input-output patterns that are used to train the networkTesting SetA collection of input-output patterns that are used to assess network performanceLearning Rate-αA scalar parameter, analogous to step size in numerical integration, used to set the rate of adjustments
![Page 13: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/13.jpg)
13
Backpropagation training cycle
1/ Feedforward of the input training pattern
2/ Backpropagation of the associated error3/ Adjustement of the weights
Reference Eric Plammer
![Page 14: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/14.jpg)
14
Backpropagation Neural NetworksArchitectureBP training AlgorithmGeneralizationExamples – Example 1– Example 2
Uses (applications) of BP networksOptions/Variations on BP– Momentum– Sequential vs. batch– Adaptive learning rates
AppendixReferences and suggested reading
ArchitectureBP training AlgorithmGeneralizationExamples – Example 1– Example 2
Uses (applications) of BP networksOptions/Variations on BP– Momentum– Sequential vs. batch– Adaptive learning rates
AppendixReferences and suggested reading
![Page 15: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/15.jpg)
15Source: Fausett, L., Fundamentals of Neural Networks, Prentice Hall, 1994.
Notation -- p. 292 of FausettNotation -- p. 292 of Fausett
BP NN With Single Hidden Layer
kjw ,
jiv ,
I/P layer
O/P layer
Hidden layer
Reference: Dan St. ClairFausett: Chapter 6
![Page 16: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/16.jpg)
16
Notationx = input training vectort = Output target vector.δk = portion of error correction weight for wjk that is due
to an error at output unit Yk; also the information about the error at unit Yk that is propagated back to the hidden units that feed into unit Yk
δj = portion of error correction weight for vjk that is due to the backpropagation of error information from the output layer to the hidden unit Zj
α = learning rate.voj = bias on hidden unit jwok = bias on output unit k
![Page 17: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/17.jpg)
17Source: Fausett, L., Fundamentals of Neural Networks, Prentice Hall, 1994.
Hyberbolictangent
Binary stepActivation Functions
)](1[*)()(
)exp(11)(
' xfxfxf
xxf
−=
−+=
Should be continuos, differentiable, and monotonically non-decreasing. Plus, its derivative should be easy to compute.
![Page 18: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/18.jpg)
18
Backpropagation Neural NetworksArchitectureBP training AlgorithmGeneralizationExamples – Example 1– Example 2
Uses (applications) of BP networksOptions/Variations on BP– Momentum– Sequential vs. batch– Adaptive learning rates
AppendixReferences and suggested reading
ArchitectureBP training AlgorithmGeneralizationExamples – Example 1– Example 2
Uses (applications) of BP networksOptions/Variations on BP– Momentum– Sequential vs. batch– Adaptive learning rates
AppendixReferences and suggested reading
![Page 19: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/19.jpg)
19Fausett, L., pp. 294-296.
Yk
Z1 Zj Z3
X1 X2 X3
1
1
![Page 20: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/20.jpg)
20Fausett, L., pp. 294-296.
Yk
Z1 Zj Z3
X1 X2 X3
1
1
![Page 21: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/21.jpg)
21Fausett, L., pp. 294-296.
Yk
Z1 Zj Z3
X1 X2 X3
1
1
![Page 22: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/22.jpg)
22Fausett, L., pp. 294-296.
Yk
Z1 Zj Z3
X1 X2 X3
1
1
![Page 23: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/23.jpg)
23
Let’s examine Training Algorithm Equations
[ ]nxxX ...1=
[ ]pvvV ,01,00 ...=
Y1
Z1 Z2 Z3
X1 X2 X3
1
1v2,1
Vectors & matrices make computation easier.
=
pnn
p
vv
vvV
,1,
,11,1
............
...
[ ])_(...)_(_
1
0
pinzfinzfZXVVinZ
=+=
Step 4 computation becomes
[ ]mwwW ,01,00 ...=
=
mpp
m
ww
wwW
,1,
,11,1
............
...
Step 5 computation becomes
[ ])_(...)_(_
1
0
minyfinyfYZWWinY
=+=
![Page 24: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/24.jpg)
24
Backpropagation Neural NetworksArchitectureBP training AlgorithmGeneralizationExamples – Example 1– Example 2
Uses (applications) of BP networksOptions/Variations on BP– Momentum– Sequential vs. batch– Adaptive learning rates
AppendixReferences and suggested reading
ArchitectureBP training AlgorithmGeneralizationExamples – Example 1– Example 2
Uses (applications) of BP networksOptions/Variations on BP– Momentum– Sequential vs. batch– Adaptive learning rates
AppendixReferences and suggested reading
![Page 25: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/25.jpg)
25
GeneralisationOnce trained, weights are held contstant, and input patterns are applied in feedforwardmode. - Commonly called “recall mode”.We wish network to “generalize”, i.e. to make sensible choices about input vectors which are not in the training setCommonly we check generalization of a network by dividing known patterns into a training set, used to adjust weights, and a test set, used to evaluate performance of trained network
![Page 26: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/26.jpg)
26
Generalisation …Generalisation can be improved by– Using a smaller number of hidden units
(network must learn the rule, not just the examples)
– Not overtraining (occasionally check that error on test set is not increasing)
– Ensuring training set includes a good mixture of examples
No good rule for deciding upon good network size (# of layers, # units per layer)Usually use one input/output per class rather than a continuous variable or binary encoding
![Page 27: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/27.jpg)
27
Backpropagation Neural NetworksArchitectureBP training AlgorithmGeneralizationExamples – Example 1– Example 2
Uses (applications) of BP networksOptions/Variations on BP– Momentum– Sequential vs. batch– Adaptive learning rates
AppendixReferences and suggested reading
ArchitectureBP training AlgorithmGeneralizationExamples – Example 1– Example 2
Uses (applications) of BP networksOptions/Variations on BP– Momentum– Sequential vs. batch– Adaptive learning rates
AppendixReferences and suggested reading
![Page 28: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/28.jpg)
28
Example 1
The XOR function could not be solved by a single layer perceptron network
The function is:
X Y F0 0 00 1 11 0 11 1 0
Reference: R. Spillman
![Page 29: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/29.jpg)
29
XOR Architecture
x
y
fv11 Σv01
v21
fv12 Σv02
v22
fw11 Σw01
w21
1
1
1
![Page 30: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/30.jpg)
30
Initial WeightsRandomly assign small weight values:
x
y
f.21 Σ-.3
.15
f-.4 Σ.25
.1
f-.2 Σ-.4
.3
1
1
1
![Page 31: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/31.jpg)
31
Feedfoward – 1st Pass
x
y
f.21 Σ-.3
.15
f-.4 Σ.25
.1
f-.2 Σ-.4
.3
1
1
1
Training Case: (0 0 0)
0
0
1
1
zin1 = -.3(1) + .21(0) + .15(0) = -.3
Activation function f:
z1 = .43
zin2 = .25(1) -.4(0) + .1(0)
z2 = .56
1yin1 = -.4(1) - .2(.43)
+.3(.56) = -.318
y1 = .42(not 0)
xexf −+
=1
1)(
![Page 32: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/32.jpg)
32
Backpropagate
0
0
f.21 Σ-.3
.15
f-.4 Σ.25
.1
f-.2 Σ-.4
.3
1
1
1
δ1 = (t1 – y1)f’(y_in1)=(t1 – y1)f(y_in1)[1- f(y_in1)]
δ1 = (0 – .42).42[1-.42]= -.102
δ_in1 = δ1w11 = -.102(-.2) = .02δ1 = δ_in1f’(z_in1) = .02(.43)(1-.43)
= .005
δ_in2 = δ1w21 = -.102(.3) = -.03δ2 = δ_in2f’(z_in2) = -.03(.56)(1-.56)
= -.007
![Page 33: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/33.jpg)
33
Calculate the Weights – First Pass
0
0
f.21 Σ-.3
.15
f-.4 Σ.25
.1
f-.2 Σ-.4
.3
1
1
1
10
11 2,1
αδ
αδ
=∆
==∆
wjzw jj
jj
ijij
vjxv
αδ
αδ
=∆
==∆
0
2,1
102.0 −=∆w
0571.)56)(.102.(2121 −=−==∆ zw δ
0439.)43)(.102.(1111 −=−==∆ zw δ
005.01 −=∆v
007.02 −=∆v
0)0)(005(.1111 ===∆ xv δ
0)0)(007.(1212 =−==∆ xv δ
0)0)(005(.2121 ===∆ xv δ
0)0)(007.(2222 =−==∆ xv δ
![Page 34: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/34.jpg)
34
Update the Weights – First Pass
0
0
f.21 Σ-.305
.15
f-.4 Σ.243
.1
f-.044Σ-.502
.243
1
1
1
![Page 35: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/35.jpg)
35
Final ResultAfter about 500 iterations:
x
y
f1 Σ-1.5
1
f1 Σ-.5
1
f-2 Σ-.5
1
1
1
1
![Page 36: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/36.jpg)
36
Backpropagation Neural NetworksArchitectureBP training AlgorithmGeneralizationExamples – Example 1– Example 2
Uses (applications) of BP networksOptions/Variations on BP– Momentum– Sequential vs. batch– Adaptive learning rates
AppendixReferences and suggested reading
ArchitectureBP training AlgorithmGeneralizationExamples – Example 1– Example 2
Uses (applications) of BP networksOptions/Variations on BP– Momentum– Sequential vs. batch– Adaptive learning rates
AppendixReferences and suggested reading
![Page 37: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/37.jpg)
37
Example 2
[ ]08.06.0=X
−=
211
w
[ ]10 −=w[ ]1000 −=v
=
130221012
v
9.0=t
Y1
Z1 Z2 Z3
X1 X2 X3
1
1
v2,1
α = 0.3
Desired output for X input
)1(1)( xe
xf −+=
m = 1
p = 3
n = 3
Reference: Vamsi Pegatraju and Aparna Patsa
![Page 38: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/38.jpg)
38
Primary Values: Inputs to Epoch - IX=[0.6 0.8 0];W=[-1 1 2]’;W0=[-1];V= 2 1 0
1 2 2 0 3 1
V0=[ 0 0 -1];Target t=0.9;α = 0.3;
![Page 39: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/39.jpg)
39
Epoch – IStep 4: Z_in= V0+XV = [ 2 2.2 0.6];
Z=f([Z_in])=[ 0.8808 0.9002 0.646];Step 5: Y_in = W0+ZW = [0.3114];
Y=f([Z_in])=0.5772;Sum of Squares Error obtained originally:(0.9 – 0.5772)2 = 0.1042
![Page 40: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/40.jpg)
40
Step 6: Error = tk – Yk = 0.9 – 0.5772Now we have only one output and hence the value of k=1.δ1= (t1 – y1 )f’(Y_in1)We know f’(x) for sigmoid = f(x)(1-f(x))
⇒ δ1 = (0.9 −0.5772)(0.5772)(1−0.5772)= 0.0788
![Page 41: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/41.jpg)
41
For intermediate weights we have (j=1,2,3)∆Wj,k=α δκΖj = α δ1Ζj
⇒ ∆W1=(0.3)(0.0788)[0.8808 0.9002 0.646]’=[0.0208 0.0213 0.0153]’;
Bias ∆W0,1=α δ1= (0.3)(0.0788)=0.0236;
![Page 42: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/42.jpg)
42
Step 7: Backpropagation to the first hidden layerFor Zj (j=1,2,3), we haveδ_inj = ∑k=1..m δκWj,k= δ1Wj,1
⇒ δ_in1=-0.0788;δ_in2=0.0788;δ_in3=0.1576;δj= δ_injf’(Z_inj)
=> δ1=-0.0083; δ2=0.0071; δ3=0.0361;
![Page 43: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/43.jpg)
43
∆Vi,j = αδjXi
⇒ ∆V1 = [-0.0015 -0.0020 0]’;⇒ ∆V2 = [0.0013 0.0017 0]’;⇒ ∆V3 = [0.0065 0.0087 0]’;
∆V0=α[δ1 δ2 δ3] = [ -0.0025 0.0021 0.0108];
X=[0.6 0.8 0]
![Page 44: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/44.jpg)
44
Step 8: Updating of W1, V1, W0, V0
Wnew= Wold+∆W1=[-0.9792 1.0213 2.0153]’;Vnew= Vold+∆V1
=[1.9985 1.0013 0.065;0.998 2.0017 2.0087;0 3 1];
W0new = -0.9764;V0new = [-0.0025 0.0021 -0.9892];Completion of the first epoch.
![Page 45: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/45.jpg)
45
Primary Values: Inputs to Epoch - 2X=[0.6 0.8 0];W=[-0.9792 1.0213 2.0153]’;W0=[-0.9792];V=[1.9985 1.0013 0.065; 0.998 2.0017 2.0087;
0 3 1]; V0=[ -0.0025 0.0021 -0.9892];Target t=0.9;α = 0.3;
![Page 46: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/46.jpg)
46
Epoch – 2Step 4:Z_in=V0+XV=[1.995 2.2042 0.6217];Z=f([Z_in])=[ 0.8803 0.9006 0.6506];
Step 5: Y_in = W0+ZW = [0.3925];Y=f([Z_in])=0.5969;
Sum of Squares Error obtained from first epoch: (0.9 – 0.5969)2 = 0.0918
![Page 47: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/47.jpg)
47
Step 6: Error = tk – Yk = 0.9 – 0.5969Now again, as we have only one output, the value of k=1.δ1= (t1 – y1 )f’(Y_in1)
=>δ1 = (0.9 −0.5969)(0.5969)(1−0.5969)= 0.0729
![Page 48: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/48.jpg)
48
For intermediate weights we have (j=1,2,3)∆Wj,k=α δκΖj = α δ1Ζj
⇒ ∆W1=(0.3)*(0.0729)*[0.8803 0.9006 0.6506]’
=[0.0173 0.0197 0.0142]’;Bias ∆W0,1=α δ1= 0.0219;
![Page 49: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/49.jpg)
49
Step 7: Backpropagation to the first hidden layerFor Zj (j=1,2,3), we haveδ_inj = ∑k=1..m δκWj,k= δ1Wj,1
⇒ δ_in1=-0.0714;δ_in2=0.0745;δ_in3=0.1469;δj= δ_injf’(Z_inj)
=> δ1=-0.0075; δ2=0.0067; δ3=0.0334;
![Page 50: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/50.jpg)
50
∆Vi,j = αδjXi
⇒ ∆V1 = [-0.0013 -0.0018 0]’;⇒ ∆V2 = [0.0012 0.0016 0]’;⇒ ∆V3 = [0.006 0.008 0]’;
∆V0=α[δ1 δ2 δ3] = [ -0.0022 0.002 0.01];
![Page 51: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/51.jpg)
51
Step 8: Updating of W1, V1, W0, V0
Wnew= Wold+∆W1=[-0.9599 1.041 2.0295]’;Vnew= Vold+∆V1
=[1.9972 1.0025 0.0125; 0.9962 2.0033 2.0167; 0 3 1];
W0new = -0.9545;V0new = [-0.0047 0.0041 -0.9792];
Completion of the second epoch.
![Page 52: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/52.jpg)
52
Z_in=V0+XV=[1.9906 2.2082 0.6417];=>Z=f([Z_in])=[ 0.8798 0.9010 0.6551];
Step 5: Y_in = W0+ZW = [0.4684];=> Y=f([Z_in])=0.6150;
Sum of Squares Error at the end of the second epoch: (0.9 – 0.615)2 = 0.0812.From the last two values of Sum of Squares Error, we see that the value is gradually decreasing as the weights are getting updated.
![Page 53: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/53.jpg)
53
Backpropagation Neural NetworksArchitectureBP training AlgorithmGeneralizationExamples – Example 1– Example 2
Uses (applications) of BP networksOptions/Variations on BP– Momentum– Sequential vs. batch– Adaptive learning rates
AppendixReferences and suggested reading
ArchitectureBP training AlgorithmGeneralizationExamples – Example 1– Example 2
Uses (applications) of BP networksOptions/Variations on BP– Momentum– Sequential vs. batch– Adaptive learning rates
AppendixReferences and suggested reading
![Page 54: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/54.jpg)
54
Functional ApproximationMulti-Layer Perceptrons can approximate any continuous function by a two-layer network with squashing activation functions.If activation functions can vary with the function, can show that a n-input, m-output function requires at most 2n+1 hidden units.See Fausett: 6.3.2 for more details.
![Page 55: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/55.jpg)
55
Function ApproximatorsExample: a function h(x) approximated by H(w,x)
![Page 56: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/56.jpg)
56
ApplicationsWe look at a number of applications for
backpropagation MLP’s.In each case we’ll examine–Problem to be solved–Architecture Used–Results
Reference: J.Hertz, A. Krogh, R.G. Palmer, “Introduction to the Theory of Neural Computation”, Addison Wesley, 1991
![Page 57: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/57.jpg)
57
NETtalk - SpecificationsProblem is to convert written text to speech.Conventionally, this is done by hand-coded linguistic rules, such as the DECtalk system. NETtalk uses a neural network to achieve similar resultsInput is written textOutput is choice of phoneme for speech synthesiser
![Page 58: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/58.jpg)
58
NETtalk - architecture
e c ah t oT n7 letter sliding window, generatingphoneme for centre character.Input units use 1 of 29 code.=> 203 input units (=29x7)
80 hidden units, fully interconnected
26 output units, 1 of 26 code representing most likely phoneme
![Page 59: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/59.jpg)
59
NETtalk - Results
1024 Training SetAfter 10 epochs - intelligible speechAfter 50 epochs - 95% correct on training set
- 78% correct on test setNote that this network must generalise - many input combinations are not in training setResults not as good as DECtalk, but significantly less effort to code up.
![Page 60: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/60.jpg)
60
Sonar Classifier
Task - distinguish between rock and metal cylinder from sonar return of bottom of bayConvert time-varying input signal to frequency domain to reduce input dimension.(This is a linear transform and could be done with a fixed weight neural network.)Used a 60-x-2 network with x from 0 to 24Training took about 200 epochs. 60-2 classified about 80% of training set; 60-12-2 classified 100% training, 85% test set
![Page 61: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/61.jpg)
61
ALVINNDrives 70 mph on a public highway
30x32 pixelsas inputs
30 outputsfor steering
30x32 weightsinto one out offour hiddenunit
4 hiddenunits
![Page 62: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/62.jpg)
62
Navigation of a Car
Task is to control a car on a winding roadInputs are a 30x32 pixel image from a video camera on roof, 8x32 image from a range finder => 1216 inputs29 hidden units45 output units arranged in a line, 1-of-45 code representing hard-left..straight-ahead..hard-right
![Page 63: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/63.jpg)
63
Navigation of Car - Results
Training set of 1200 simulated road imagesTrained for 40 epochsCould drive at 5 km/hr on road, limited by calculation speed of feed-forward network.Twice as fast as best non-net solution
![Page 64: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/64.jpg)
64
Backgammon
Trained on 3000 example board scenarios of (position, dice, move) rated from -100 (very bad) to +100 (very good) from human expert.Some important information such as “pip-count” and “degree-of-trapping” was included as input.Some “noise” added to input set (scenarios with random score)Handcrafted examples added to training set to correct obvious errors
![Page 65: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/65.jpg)
65
Backgammon results
459 inputs, 2 hidden layers, each 24 units, plus 1 output for score (All possible moves evaluated)Won 59% against a conventional backgammon program (41% without extra info, 45% without noise in training set)Won computer olympiad, 1989, but lost to human expert (Not surprising since trained by human scored examples)
![Page 66: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/66.jpg)
66
Encoder / Image Compression
Wish to encode a number of input patterns in an efficient number of bits for storage or transmissionWe can use an autoassociative network, i.e. an M-N-M network, where we have M inputs, and N<M hidden units, M outputs, trained with target outputs same as inputsHidden units need to encode inputs in fewer signals in the hidden layers.Outputs from hidden layer are encoded signal
![Page 67: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/67.jpg)
67
Encoders
We can store/transmit hidden values using first half of network; decode using second half.We may need to truncate hidden unit values to fixed precision, which must be considered during training.Cottrell et al. tried 8x8 blocks (8 bits each) of images, encoded in 16 units, giving results similar to conventional approaches.Works best with similar images
![Page 68: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/68.jpg)
68
Neural network for OCRfeedforward networktrained using Back-propagation
AB
EDC
Output Layer
Input Layer
Hidden Layer
8
10
8 8
1010
![Page 69: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/69.jpg)
69
Pattern Recognition
Post-code (or ZIP code) recognition is a good example - hand-written characters need to be classified.One interesting network used 16x16 pixel map input of handwritten digits already found and scaled by another system. 3 hidden layers plus 1-of-10 output layer.First two hidden layers were feature detectors.
![Page 70: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/70.jpg)
70
ZIP code classifier
First layer had same feature detector connected to 5x5 blocks of input, at 2 pixel intervals => 8x8 array of same detector, each with the same weights but connected to different parts of input.Twelve such feature detector arrays.Same for second hidden layer, but 4x4 arrays connected to 5x5 blocks of first hidden layer; with 12 different features.Conventional 30 unit 3rd hidden layer
![Page 71: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/71.jpg)
71
ZIP Code Classifier - Results
Note 8x8 and 4x4 arrays of feature detectors use the same weights => many fewer weights to train.Trained on 7300 digits, tested on 2000Error rates: 1% on training, 5% on test setIf cases with no clear winner rejected (i.e. largest output not much greater than second largest output), then, with 12% rejection, error rate on test set reduced to 1%.Performance improved further by removing more weights: “optimal brain damage”.
![Page 72: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/72.jpg)
72
Backpropagation Neural NetworksArchitectureBP training AlgorithmGeneralizationExamples – Example 1– Example 2
Uses (applications) of BP networksOptions/Variations on BP– Momentum– Sequential vs. batch– Adaptive learning rates
AppendixReferences and suggested reading
ArchitectureBP training AlgorithmGeneralizationExamples – Example 1– Example 2
Uses (applications) of BP networksOptions/Variations on BP– Momentum– Sequential vs. batch– Adaptive learning rates
AppendixReferences and suggested reading
![Page 73: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/73.jpg)
73
Heuristics for making BP BetterTraining with BP is more an art than science– result of own experience
Normalizing the inputs– preprocessed so that its mean value is
closer to zero (see “prestd” function in matlab).
– input variables should be uncorrelated• by “Principal Component Analysis” (PCA). See
“prepca” and “trapca” functions in Matlab.
![Page 74: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/74.jpg)
74
Sequential vs. Batch update
“Sequential” learning means that a given input pattern is forward propagated, the error is determined and back-propagated, and the weights are updated. Then the same procedure is repeated for the next pattern.“Batch” learning means that the weights are updated only after the entire set of training patterns has been presented to the network. In other words, all patterns are forward propagated, and the error is determined and back-propagated, but the weights are only updated when all patterns have been processed. Thus, the weight update is only performed every epoch.
If P = # patterns in one epoch
∑=
∆=∆P
ppw
Pw
1
1
![Page 75: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/75.jpg)
75
Sequential vs. Batch updatei.e.in some cases, it is advantageous to accumulate the weight correction terms for several patterns (or even an entire epoch if there are not too many patterns) and make a single weight adjustment (equal to the average of the weight correction terms) for each weight rather than updating the weights after each pattern is presented. This procedure has a “smoothing effect”(because of the use of the average) on the correction terms. In some cases, this smoothing may increase the chances of convergence to a local minimum.
![Page 76: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/76.jpg)
76
Initial weightsInitial weights – will influence whether the net reaches
a global (or only a local minimum) of the error and if so, how quickly it converges.– The values for the initial weights must not be too large otherwise,
the initial input signals to each hidden or output unit will be likely to fall in the region where the derivative of the sigmoid function has a very small value (f’(net)~0) : so called saturation region.
– On the other hand, if the initial weights are too small, the net input to a hidden or output unit will be close to zero, which also causes extremely slow learning.
– Best to set the initial weights (and biases) to random numbers between –0.5 and 0.5 (or between –1 and 1 or some other suitable interval).
– The values may be +ve or –ve because the final weights after training may be of either sign also.
![Page 77: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/77.jpg)
77
Memorization vs. generalizationHow long to train the net: Since the usual motivation for applying a backprop net is to achieve a balance between memorization and generalization, it is not necessarily advantageous to continue training until the error actually reaches a minimum.
– Use 2 disjoint sets of data during training: 1/ a set of training patterns and 2/ a set of training- testingpatterns (or validation set).
– Weight adjustment are based on the training patterns; however, at intervals during training, the error is computed using the validation patterns.
– As long as the error for the validation decreases, training continues.
– When the error begins to increase, the net is starting to memorize the training patterns too specifically (starts to loose its ability to generalize). At this point, training is terminated.
![Page 78: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/78.jpg)
78
Early stoppingError
Training time
With training set (which changes wij)
With validation set (which does not change wij)
Stop Here !
L. Studer, IPHE-UNIL
![Page 79: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/79.jpg)
79
Backpropagation with momentumBackpropagation with momentum: the weight change is in a direction that is a combination of 1/ the current gradient and 2/ the previous gradient.
Momentum can be added so weights tend to change more quickly if changing in the same direction for several training cycles:-∆ wij (t+1) = α δ xi + µ . ∆ wij (t)µ is called the “momentum factor” and ranges from 0 < µ < 1. – When subsequent changes are in the same direction increase
the rate (accelerated descent)– When subsequent changes are in opposite directions decrease
the rate (stabilizes)
![Page 80: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/80.jpg)
80
Backpropagation with momentum…Weight update equation Momentum
)1( −tw
)(tw z)( αδ+tw
)1( +tw
Source: Fausett, L., Fundamentals of Neural Networks, Prentice Hall, 1994, pg. 305.
![Page 81: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/81.jpg)
81Source: Fausett, L., Fundamentals of Neural Networks, Prentice Hall, 1994.
Adaptive learning
rate
BP training algorithm –Adaptive Learning Rate
BP training algorithm –Adaptive Learning Rate
![Page 82: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/82.jpg)
82
Adaptive Learning rate…Adaptive Parameters: Vary the learning rate during training, accelerating learning slowly if all is well ( error, E, decreasing) , but reducingit quickly if things go unstable (E increasing).
For example:
Typically, a = 0.1, b = 0.5
>∆
<∆+=+
otherwise (t)0 E if (t) . b)-(1
epochs fewlast for 0 E if a (t) 1)(t
αα
αα
![Page 83: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/83.jpg)
83
Matlab BP NN ArchitectureA neuron with a single R-element input vector is shown below. Here the individual element inputs
•are multiplied by weights
•and the weighted values are fed to the summing junction. Their sum is simply Wp, the dot product of the (single row) matrix W and the vector p.
The neuron has a bias b, which is summed with the weighted inputs to form the net input n. This sum, n, is the argument of the transfer function f.
•
This expression can, of course, be written in MATLAB code as:•n = W*p + b
However, the user will seldom be writing code at this low level, for such code is already built into functions to define and simulate entire networks.
![Page 84: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/84.jpg)
84
Matlab BP NN Architecture
![Page 85: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/85.jpg)
85
Backpropagation Neural NetworksArchitectureBP training AlgorithmGeneralizationExamples – Example 1– Example 2
Uses (applications) of BP networksOptions/Variations on BP– Momentum– Sequential vs. batch– Adaptive learning rates
AppendixReferences and suggested reading
ArchitectureBP training AlgorithmGeneralizationExamples – Example 1– Example 2
Uses (applications) of BP networksOptions/Variations on BP– Momentum– Sequential vs. batch– Adaptive learning rates
AppendixReferences and suggested reading
![Page 86: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/86.jpg)
86
Learning RuleSimilar to Delta Rule.Our goal is to minimize the error, E, which is the difference between targets, tm , and our outputs, yk
m , using a least squares error measure: E = 1/2 Σk (tk - yk)2
To find out how to change wjk and vij to reduce E, we need to find
ijjk vEand
wE
∂∂
∂∂
Fausett, section 6.3, p324
![Page 87: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/87.jpg)
87
Delta Rule Derivation Hidden-to-Output
−=−= ∑∑
k
2kk
jkjk
2 )y(t21
wwEhence][5.0
∂∂
∂∂
kkk ytE
[ ] [ ]
−=
−= ∑ 2
KJKk
2k
JKJK
)(t21
wt
21
wwE
inKk yfy∂
∂∂
∂∂∂
and)( where ∑==j
jKjinKinKk wzyyfy
JKin KKJK
).z(y')fy(twE
−−=∂∂
JK
inK
JK
inK
wy
wyf
∂∂
−−=∂
∂−−=
)().(y')fy(t)()y(twE
Kin KKKKJK∂
∂
Notice the difference between the subscripts k (which corresponds to any node between hidden and output layers) and K (which represents aparticular node K of interest)
![Page 88: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/88.jpg)
88
Delta Rule Derivation Hidden-to-Output
)(y')f(t :define toconvenient isIt inKK KK y−=δ
jkjinkkk zzyfyt δαα∂∂α =−=−=∆ )('][w
Ew Thus,jk
jk
jk zδα=∆ jkw summary,In )(y')f(twith inKK KK y−=δ
![Page 89: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/89.jpg)
89
Delta Rule Derivation: Input to Hidden
IJ
inkink
kkk
IJ
k
kkk v
yyfytvyyt
v ∂∂
−−=∂∂
−−= ∑∑ )('][][E
IJ∂∂
])[('vE
IJIinJJk
kk
IJ
JJk
kk
IJ
ink
kk xzfw
vzw
vy ∑∑∑ −=
∂∂
−=∂∂
−= δδδ∂∂
−=−= ∑∑
k
2kk
IJIJ
2 )y(t21
vEhence][5.0
vytE
kkk ∂
∂∂∂
and)( where ∑==j
jKjinKinKk wzyyfy
)(z'f :define toconvenient isIt inJ∑=k
JkkJ wδδ
Notice the difference between the subscripts j and J and i and I
ijk
jkkiinjij xwxzfv αδδα∂∂α ==−=∆ ∑)('
vE
ij
![Page 90: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/90.jpg)
90
Delta Rule Derivation: Input to Hidden
)(z'f :where inJ∑=k
JkkJ wδδijij xv αδ=∆ summary In
![Page 91: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/91.jpg)
91
Backpropagation Neural NetworksArchitectureBP training AlgorithmGeneralizationExamples – Example 1– Example 2
Uses (applications) of BP networksOptions/Variations on BP– Momentum– Sequential vs. batch– Adaptive learning rates
AppendixReferences and suggested reading
ArchitectureBP training AlgorithmGeneralizationExamples – Example 1– Example 2
Uses (applications) of BP networksOptions/Variations on BP– Momentum– Sequential vs. batch– Adaptive learning rates
AppendixReferences and suggested reading
![Page 92: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/92.jpg)
92
Suggested Reading.
L. Fausett, “Fundamentals of Neural Networks”, Prentice-Hall, 1994, Chapter 6.
![Page 93: CS407 Neural Computation · 2004-09-03 · The first layer, termed input layer, just contains the input vector and does not perform any computations. The second layer, termed hidden](https://reader036.fdocument.org/reader036/viewer/2022062317/5f255bde9d3f5d554e5c8062/html5/thumbnails/93.jpg)
93
References:
These lecture notes were based on the references of the previous slide, and the following references
1. Eric Plummer, University of Wyomingwww.karlbranting.net/papers/plummer/Pres.ppt
2. Clara Boyd, Columbia Univ. N.Y comet.ctr.columbia.edu/courses/elen_e4011/2002/Artificial.ppt
3. Dan St. Clair, University of Missori-Rolla, http://web.umr.edu/~stclair/class/classfiles/cs404_fs02/Misc/CS404_fall2001/Lectures/Lect09_102301/
4. Vamsi Pegatraju and Aparna Patsa: web.umr.edu/~stclair/class/classfiles/cs404_fs02/ Lectures/Lect09_102902/Lect8_Homework/L8_3.ppt
5. Richard Spillman, Pacific Lutheran University: www.cs.plu.edu/courses/csce436/notes/pr_l22_nn5.ppt
6. Khurshid Ahmad and Matthew Casey Univ. Surrey, http://www.computing.surrey.ac.uk/courses/cs365/