CS621: Artificial Intelligence

33
CS621: Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 44– Multilayer Perceptron Training: Sigmoid Neuron; Backpropagation 9 th Nov, 2010

description

CS621: Artificial Intelligence. Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 44– Multilayer Perceptron Training: Sigmoid Neuron; Backpropagation 9 th Nov, 2010. The Perceptron Model . Output = y. y = 1 for Σ w i x i >= θ = 0 otherwise. Threshold = θ. w 1. w n. W n-1. - PowerPoint PPT Presentation

Transcript of CS621: Artificial Intelligence

Page 1: CS621: Artificial Intelligence

CS621: Artificial IntelligencePushpak Bhattacharyya

CSE Dept., IIT Bombay

Lecture 44– Multilayer Perceptron Training: Sigmoid Neuron; Backpropagation

9th Nov, 2010

Page 2: CS621: Artificial Intelligence

The Perceptron Model

.

Output = y

wn Wn-1

w1

Xn-1

x1

Threshold = θ

y = 1 for Σwixi >=θ = 0 otherwise

Page 3: CS621: Artificial Intelligence

θ

1y

Σwixi

Page 4: CS621: Artificial Intelligence

Perceptron Training Algorithm1. Start with a random value of w

ex: <0,0,0…>2. Test for wxi > 0 If the test succeeds for i=1,2,…n

then return w3. Modify w, wnext = wprev + xfail

Page 5: CS621: Artificial Intelligence

Feedforward Network

Page 6: CS621: Artificial Intelligence

Limitations of perceptron Non-linear separability is all

pervading Single perceptron does not have

enough computing power Eg: XOR cannot be computed by

perceptron

Page 7: CS621: Artificial Intelligence

Solutions Tolerate error (Ex: pocket algorithm

used by connectionist expert systems). Try to get the best possible hyperplane

using only perceptrons Use higher dimension surfaces

Ex: Degree - 2 surfaces like parabola Use layered network

Page 8: CS621: Artificial Intelligence

Pocket Algorithm Algorithm evolved in 1985 –

essentially uses PTA Basic Idea:

Always preserve the best weight obtained so far in the “pocket”

Change weights, if found better (i.e. changed weights result in reduced error).

Page 9: CS621: Artificial Intelligence

XOR using 2 layers

)))),(()),(,(( 2121

212121

xxNOTANDxNOTxANDORxxxxxx

• Non-LS function expressed as a linearly separable function of individual linearly separable functions.

Page 10: CS621: Artificial Intelligence

Example - XOR

x1 x2 x1x2

0 0 00 1 1

1 0 0

1 1 0

w2=1.5w1=-1θ = 1

x1 x2

2112

0

wwww

Calculation of XOR

Calculation of x1x2

w2=1w1=1θ = 0.5

x1x2 x1x2

Page 11: CS621: Artificial Intelligence

Example - XOR

w2=1w1=1θ = 0.5

x1x2 x1x2

-1x1 x2

-11.51.5

1 1

Page 12: CS621: Artificial Intelligence

Some Terminology A multilayer feedforward neural

network has Input layer Output layer Hidden layer (assists computation)

Output units and hidden units are called

computation units.

Page 13: CS621: Artificial Intelligence

Training of the MLP Multilayer Perceptron (MLP)

Question:- How to find weights for the hidden layers when no target output is available?

Credit assignment problem – to be solved by “Gradient Descent”

Page 14: CS621: Artificial Intelligence

x2 x1

h2 h1

33 cxmy

11 cxmy 22 cxmy

1221111 )( cxwxwmh

1221111 )( cxwxwmh

32211

32615 )(kxkxkchwhwOut

Can Linear Neurons Work?

Page 15: CS621: Artificial Intelligence

Note: The whole structure shown in earlier slide is reducible to a single neuron with given behavior

Claim: A neuron with linear I-O behavior can’t compute X-OR.

Proof: Considering all possible cases:

[assuming 0.1 and 0.9 as the lower and upper thresholds]

For (0,0), Zero class:

For (0,1), One class:

32211 kxkxkOut

1.0.1.0)0.0.( 21

mccwwm

9.0..9.0)0.1.(

1

12

cmwmcwwm

Page 16: CS621: Artificial Intelligence

For (1,0), One class:

For (1,1), Zero class:

These equations are inconsistent. Hence X-OR can’t be computed.

Observations:1. A linear neuron can’t compute X-OR.2. A multilayer FFN with linear neurons is

collapsible to a single linear neuron, hence no a additional power due to hidden layer.

3. Non-linearity is essential for power.

9.0.. 1 cmwm

9.0.. 1 cmwm

Page 17: CS621: Artificial Intelligence

Multilayer Perceptron

Page 18: CS621: Artificial Intelligence

Training of the MLP Multilayer Perceptron (MLP)

Question:- How to find weights for the hidden layers when no target output is available?

Credit assignment problem – to be solved by “Gradient Descent”

Page 19: CS621: Artificial Intelligence

Gradient Descent Technique

Let E be the error at the output layer

ti = target output; oi = observed output

i is the index going over n neurons in the outermost layer

j is the index going over the p patterns (1 to p)

Ex: XOR:– p=4 and n=1

p

j

n

ijii otE

1 1

2)(21

Page 20: CS621: Artificial Intelligence

Weights in a FF NN wmn is the weight of the

connection from the nth

neuron to the mth neuron E vs surface is a

complex surface in the space defined by the weights wij

gives the direction in which a movement of the operating point in the wmn co-ordinate space will result in maximum decrease in error

W

m

n

wmn

mnwE

mnmn w

Ew

Page 21: CS621: Artificial Intelligence

Sigmoid neurons Gradient Descent needs a derivative

computation- not possible in perceptron due to the discontinuous step function used!

Sigmoid neurons with easy-to-compute derivatives used!

Computing power comes from non-linearity of sigmoid function.

xyxy

as 0 as 1

Page 22: CS621: Artificial Intelligence

Derivative of Sigmoid function

)1(1

111

1)1(

)()1(

11

1

22

yyee

eee

edxdy

ey

xx

x

xx

x

x

Page 23: CS621: Artificial Intelligence

Training algorithm Initialize weights to random values. For input x = <xn,xn-1,…,x0>, modify

weights as followsTarget output = t, Observed output = o

Iterate until E < (threshold)

ii w

Ew

2)(21 otE

Page 24: CS621: Artificial Intelligence

Calculation of ∆wi

ii

ii

i

i

n

iii

ii

xoootwwEw

xoootwnet

neto

oE

xwnetwherewnet

netE

wE

)1()(

)10 constant, learning(

)1()(

:1

0

Page 25: CS621: Artificial Intelligence

ObservationsDoes the training technique support our intuition?

The larger the xi, larger is ∆wi Error burden is borne by the weight

values corresponding to large input values

Page 26: CS621: Artificial Intelligence

Backpropagation on feedforward network

Page 27: CS621: Artificial Intelligence

Backpropagation algorithm

Fully connected feed forward network Pure FF network (no jumping of

connections over layers)

Hidden layers

Input layer (n i/p neurons)

Output layer (m o/p neurons)

j

iwji

….….….….

Page 28: CS621: Artificial Intelligence

Gradient Descent Equations

iji

jji

j

thj

ji

j

jji

jiji

jownet

jw

jnetE

netwnet

netE

wE

wEw

)layer j at theinput (

)10 rate, learning(

Page 29: CS621: Artificial Intelligence

Backpropagation – for outermost layer

ijjjjji

jjjj

m

ppp

thj

j

j

jj

ooootw

oootj

otE

netneto

oE

netEj

)1()(

))1()(( Hence,

)(21

)layer j at theinput (

1

2

Page 30: CS621: Artificial Intelligence

Backpropagation for hidden layers

Hidden layers

Input layer (n i/p neurons)

Output layer (m o/p neurons)j

i

….….….….

k

k is propagated backwards to find value of j

Page 31: CS621: Artificial Intelligence

Backpropagation – for hidden layers

ijjk

kkj

jjk

kjkj

jjk j

k

k

jjj

j

j

jj

iji

ooow

oow

ooonet

netE

oooE

neto

oE

netEj

jow

)1()(

)1()( Hence,

)1()(

)1(

layernext

layernext

layernext

Page 32: CS621: Artificial Intelligence

General Backpropagation Rule

ijjk

kkj ooow )1()(layernext

)1()( jjjjj ooot

iji jow • General weight updating rule:

• Where

for outermost layer

for hidden layers

Page 33: CS621: Artificial Intelligence

How does it work? Input propagation forward and

error propagation backward (e.g. XOR)

w2=1w1=1θ = 0.5

x1x2 x1x2

-1x1 x2

-11.51.5

1 1