CS621: Artificial Intelligence

CS621: Artificial IntelligencePushpak Bhattacharyya

CSE Dept., IIT Bombay

Lecture 44– Multilayer Perceptron Training: Sigmoid Neuron; Backpropagation

9th Nov, 2010

The Perceptron Model

.

Output = y

wn Wn-1

w1

Xn-1

x1

Threshold = θ

y = 1 for Σwixi >=θ = 0 otherwise

θ

1y

Σwixi

Perceptron Training Algorithm1. Start with a random value of w

ex: <0,0,0…>2. Test for wxi > 0 If the test succeeds for i=1,2,…n

then return w3. Modify w, wnext = wprev + xfail

Feedforward Network

Limitations of perceptron Non-linear separability is all

pervading Single perceptron does not have

enough computing power Eg: XOR cannot be computed by

perceptron

Solutions Tolerate error (Ex: pocket algorithm

used by connectionist expert systems). Try to get the best possible hyperplane

using only perceptrons Use higher dimension surfaces

Ex: Degree - 2 surfaces like parabola Use layered network

Pocket Algorithm Algorithm evolved in 1985 –

essentially uses PTA Basic Idea:

Always preserve the best weight obtained so far in the “pocket”

Change weights, if found better (i.e. changed weights result in reduced error).

XOR using 2 layers

)))),(()),(,(( 2121

212121

xxNOTANDxNOTxANDORxxxxxx

• Non-LS function expressed as a linearly separable function of individual linearly separable functions.

Example - XOR

x1 x2 x1x2

0 0 00 1 1

1 0 0

1 1 0

w2=1.5w1=-1θ = 1

x1 x2

2112

0

wwww

Calculation of XOR

Calculation of x1x2

w2=1w1=1θ = 0.5

x1x2 x1x2

Example - XOR

w2=1w1=1θ = 0.5

x1x2 x1x2

-1x1 x2

-11.51.5

1 1

Some Terminology A multilayer feedforward neural

network has Input layer Output layer Hidden layer (assists computation)

Output units and hidden units are called

computation units.

Training of the MLP Multilayer Perceptron (MLP)

Question:- How to find weights for the hidden layers when no target output is available?

Credit assignment problem – to be solved by “Gradient Descent”

x2 x1

h2 h1

33 cxmy

11 cxmy 22 cxmy

1221111 )( cxwxwmh

1221111 )( cxwxwmh

32211

32615 )(kxkxkchwhwOut

Can Linear Neurons Work?

Note: The whole structure shown in earlier slide is reducible to a single neuron with given behavior

Claim: A neuron with linear I-O behavior can’t compute X-OR.

Proof: Considering all possible cases:

[assuming 0.1 and 0.9 as the lower and upper thresholds]

For (0,0), Zero class:

For (0,1), One class:

32211 kxkxkOut

1.0.1.0)0.0.( 21

mccwwm

9.0..9.0)0.1.(

1

12

cmwmcwwm

For (1,0), One class:

For (1,1), Zero class:

These equations are inconsistent. Hence X-OR can’t be computed.

Observations:1. A linear neuron can’t compute X-OR.2. A multilayer FFN with linear neurons is

collapsible to a single linear neuron, hence no a additional power due to hidden layer.

3. Non-linearity is essential for power.

9.0.. 1 cmwm

9.0.. 1 cmwm

Multilayer Perceptron

Training of the MLP Multilayer Perceptron (MLP)

Question:- How to find weights for the hidden layers when no target output is available?

Credit assignment problem – to be solved by “Gradient Descent”

Gradient Descent Technique

Let E be the error at the output layer

ti = target output; oi = observed output

i is the index going over n neurons in the outermost layer

j is the index going over the p patterns (1 to p)

Ex: XOR:– p=4 and n=1

p

j

n

ijii otE

1 1

2)(21

Weights in a FF NN wmn is the weight of the

connection from the nth

neuron to the mth neuron E vs surface is a

complex surface in the space defined by the weights wij

gives the direction in which a movement of the operating point in the wmn co-ordinate space will result in maximum decrease in error

W

m

n

wmn

mnwE

mnmn w

Ew

Sigmoid neurons Gradient Descent needs a derivative

computation- not possible in perceptron due to the discontinuous step function used!

Sigmoid neurons with easy-to-compute derivatives used!

Computing power comes from non-linearity of sigmoid function.

xyxy

as 0 as 1

Derivative of Sigmoid function

)1(1

111

1)1(

)()1(

11

1

22

yyee

eee

edxdy

ey

xx

x

xx

x

x

Training algorithm Initialize weights to random values. For input x = <xn,xn-1,…,x0>, modify

weights as followsTarget output = t, Observed output = o

Iterate until E < (threshold)

ii w

Ew

2)(21 otE

Calculation of ∆wi

ii

ii

i

i

n

iii

ii

xoootwwEw

xoootwnet

neto

oE

xwnetwherewnet

netE

wE

)1()(

)10 constant, learning(

)1()(

:1

0

ObservationsDoes the training technique support our intuition?

The larger the xi, larger is ∆wi Error burden is borne by the weight

values corresponding to large input values

Backpropagation on feedforward network

Backpropagation algorithm

Fully connected feed forward network Pure FF network (no jumping of

connections over layers)

Hidden layers

Input layer (n i/p neurons)

Output layer (m o/p neurons)

j

iwji

….….….….

Gradient Descent Equations

iji

jji

j

thj

ji

j

jji

jiji

jownet

jw

jnetE

netwnet

netE

wE

wEw

)layer j at theinput (

)10 rate, learning(

Backpropagation – for outermost layer

ijjjjji

jjjj

m

ppp

thj

j

j

jj

ooootw

oootj

otE

netneto

oE

netEj

)1()(

))1()(( Hence,

)(21

)layer j at theinput (

1

2

Backpropagation for hidden layers

Hidden layers

Input layer (n i/p neurons)

Output layer (m o/p neurons)j

i

….….….….

k

k is propagated backwards to find value of j

Backpropagation – for hidden layers

ijjk

kkj

jjk

kjkj

jjk j

k

k

jjj

j

j

jj

iji

ooow

oow

ooonet

netE

oooE

neto

oE

netEj

jow

)1()(

)1()( Hence,

)1()(

)1(

layernext

layernext

layernext

General Backpropagation Rule

ijjk

kkj ooow )1()(layernext

)1()( jjjjj ooot

iji jow • General weight updating rule:

• Where

for outermost layer

for hidden layers

How does it work? Input propagation forward and

error propagation backward (e.g. XOR)

w2=1w1=1θ = 0.5

x1x2 x1x2

-1x1 x2

-11.51.5

1 1

CS621: Artificial Intelligence

Documents

Transcript of CS621: Artificial Intelligence