CS621: Artificial Intelligence
description
Transcript of CS621: Artificial Intelligence
CS621: Artificial IntelligencePushpak Bhattacharyya
CSE Dept., IIT Bombay
Lecture 44– Multilayer Perceptron Training: Sigmoid Neuron; Backpropagation
9th Nov, 2010
The Perceptron Model
.
Output = y
wn Wn-1
w1
Xn-1
x1
Threshold = θ
y = 1 for Σwixi >=θ = 0 otherwise
θ
1y
Σwixi
Perceptron Training Algorithm1. Start with a random value of w
ex: <0,0,0…>2. Test for wxi > 0 If the test succeeds for i=1,2,…n
then return w3. Modify w, wnext = wprev + xfail
Feedforward Network
Limitations of perceptron Non-linear separability is all
pervading Single perceptron does not have
enough computing power Eg: XOR cannot be computed by
perceptron
Solutions Tolerate error (Ex: pocket algorithm
used by connectionist expert systems). Try to get the best possible hyperplane
using only perceptrons Use higher dimension surfaces
Ex: Degree - 2 surfaces like parabola Use layered network
Pocket Algorithm Algorithm evolved in 1985 –
essentially uses PTA Basic Idea:
Always preserve the best weight obtained so far in the “pocket”
Change weights, if found better (i.e. changed weights result in reduced error).
XOR using 2 layers
)))),(()),(,(( 2121
212121
xxNOTANDxNOTxANDORxxxxxx
• Non-LS function expressed as a linearly separable function of individual linearly separable functions.
Example - XOR
x1 x2 x1x2
0 0 00 1 1
1 0 0
1 1 0
w2=1.5w1=-1θ = 1
x1 x2
2112
0
wwww
Calculation of XOR
Calculation of x1x2
w2=1w1=1θ = 0.5
x1x2 x1x2
Example - XOR
w2=1w1=1θ = 0.5
x1x2 x1x2
-1x1 x2
-11.51.5
1 1
Some Terminology A multilayer feedforward neural
network has Input layer Output layer Hidden layer (assists computation)
Output units and hidden units are called
computation units.
Training of the MLP Multilayer Perceptron (MLP)
Question:- How to find weights for the hidden layers when no target output is available?
Credit assignment problem – to be solved by “Gradient Descent”
x2 x1
h2 h1
33 cxmy
11 cxmy 22 cxmy
1221111 )( cxwxwmh
1221111 )( cxwxwmh
32211
32615 )(kxkxkchwhwOut
Can Linear Neurons Work?
Note: The whole structure shown in earlier slide is reducible to a single neuron with given behavior
Claim: A neuron with linear I-O behavior can’t compute X-OR.
Proof: Considering all possible cases:
[assuming 0.1 and 0.9 as the lower and upper thresholds]
For (0,0), Zero class:
For (0,1), One class:
32211 kxkxkOut
1.0.1.0)0.0.( 21
mccwwm
9.0..9.0)0.1.(
1
12
cmwmcwwm
For (1,0), One class:
For (1,1), Zero class:
These equations are inconsistent. Hence X-OR can’t be computed.
Observations:1. A linear neuron can’t compute X-OR.2. A multilayer FFN with linear neurons is
collapsible to a single linear neuron, hence no a additional power due to hidden layer.
3. Non-linearity is essential for power.
9.0.. 1 cmwm
9.0.. 1 cmwm
Multilayer Perceptron
Training of the MLP Multilayer Perceptron (MLP)
Question:- How to find weights for the hidden layers when no target output is available?
Credit assignment problem – to be solved by “Gradient Descent”
Gradient Descent Technique
Let E be the error at the output layer
ti = target output; oi = observed output
i is the index going over n neurons in the outermost layer
j is the index going over the p patterns (1 to p)
Ex: XOR:– p=4 and n=1
p
j
n
ijii otE
1 1
2)(21
Weights in a FF NN wmn is the weight of the
connection from the nth
neuron to the mth neuron E vs surface is a
complex surface in the space defined by the weights wij
gives the direction in which a movement of the operating point in the wmn co-ordinate space will result in maximum decrease in error
W
m
n
wmn
mnwE
mnmn w
Ew
Sigmoid neurons Gradient Descent needs a derivative
computation- not possible in perceptron due to the discontinuous step function used!
Sigmoid neurons with easy-to-compute derivatives used!
Computing power comes from non-linearity of sigmoid function.
xyxy
as 0 as 1
Derivative of Sigmoid function
)1(1
111
1)1(
)()1(
11
1
22
yyee
eee
edxdy
ey
xx
x
xx
x
x
Training algorithm Initialize weights to random values. For input x = <xn,xn-1,…,x0>, modify
weights as followsTarget output = t, Observed output = o
Iterate until E < (threshold)
ii w
Ew
2)(21 otE
Calculation of ∆wi
ii
ii
i
i
n
iii
ii
xoootwwEw
xoootwnet
neto
oE
xwnetwherewnet
netE
wE
)1()(
)10 constant, learning(
)1()(
:1
0
ObservationsDoes the training technique support our intuition?
The larger the xi, larger is ∆wi Error burden is borne by the weight
values corresponding to large input values
Backpropagation on feedforward network
Backpropagation algorithm
Fully connected feed forward network Pure FF network (no jumping of
connections over layers)
Hidden layers
Input layer (n i/p neurons)
Output layer (m o/p neurons)
j
iwji
….….….….
Gradient Descent Equations
iji
jji
j
thj
ji
j
jji
jiji
jownet
jw
jnetE
netwnet
netE
wE
wEw
)layer j at theinput (
)10 rate, learning(
Backpropagation – for outermost layer
ijjjjji
jjjj
m
ppp
thj
j
j
jj
ooootw
oootj
otE
netneto
oE
netEj
)1()(
))1()(( Hence,
)(21
)layer j at theinput (
1
2
Backpropagation for hidden layers
Hidden layers
Input layer (n i/p neurons)
Output layer (m o/p neurons)j
i
….….….….
k
k is propagated backwards to find value of j
Backpropagation – for hidden layers
ijjk
kkj
jjk
kjkj
jjk j
k
k
jjj
j
j
jj
iji
ooow
oow
ooonet
netE
oooE
neto
oE
netEj
jow
)1()(
)1()( Hence,
)1()(
)1(
layernext
layernext
layernext
General Backpropagation Rule
ijjk
kkj ooow )1()(layernext
)1()( jjjjj ooot
iji jow • General weight updating rule:
• Where
for outermost layer
for hidden layers
How does it work? Input propagation forward and
error propagation backward (e.g. XOR)
w2=1w1=1θ = 0.5
x1x2 x1x2
-1x1 x2
-11.51.5
1 1