Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

103
SOLUTIONS MANUAL THIRD EDITION Neural Networks and Learning Machines Simon Haykin and Yanbo Xue McMaster University Canada

TAGS:

description

Neural networks

Transcript of Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

Page 1: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

SOLUTIONS MANUAL

THIRD EDITION

NeuralNetworksandLearning Machines

Simon HaykinandYanbo XueMcMaster UniversityCanada

Page 2: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

CHAPTER 1Rosenblatt’s Perceptron

Problem 1.1

(1) If wT(n)x(n) > 0, then y(n) = +1.If also x(n) belongs to C1, then d(n) = +1.Under these conditions, the error signal is

e(n) = d(n) - y(n) = 0and from Eq. (1.22) of the text:

w(n + 1) = w(n) + ηe(n)x(n) = w(n)This result is the same as line 1 of Eq. (1.5) of the text.

(2) If wT(n)x(n) < 0, then y(n) = -1.If also x(n) belongs to C2, then d(n) = -1.Under these conditions, the error signal e(n) remains zero, and so from Eq. (1.22)we have

w(n + 1) = w(n)This result is the same as line 2 of Eq. (1.5).

(3) If wT(n)x(n) > 0 and x(n) belongs to C2 we havey(n) = +1d(n) = -1

The error signal e(n) is -2, and so Eq. (1.22) yieldsw(n + 1) = w(n) -2ηx(n)

which has the same form as the first line of Eq. (1.6), except for the scaling factor 2.

(4) Finally if wT(n)x(n) < 0 and x(n) belongs to C1, theny(n) = -1d(n) = +1

In this case, the use of Eq. (1.22) yieldsw(n + 1) = w(n) +2ηx(n)

which has the same mathematical form as line 2 of Eq. (1.6), except for the scalingfactor 2.

Problem 1.2

The output signal is defined by

yv2---

tanh=

b2---

12--- wixi

i∑+

tanh=

Page 3: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

Equivalently, we may write

(1)

where

Equation (1) is the equation of a hyperplane.

Problem 1.3

(a) AND operation: Truth Table 1

This operation may be realized using the perceptron of Fig. 1

The hard limiter input is

If x1 = x2 = 1, then v = 0.5, and y = 1If x1 = 0, and x2 = 1, then v = -0.5, and y = 0If x1 = 1, and x2 = 0, then v = -0.5, and y = 0If x1 = x2 = 0, then v = -1.5, and y = 0

Inputs Output

x1 x2 y

1010

1100

1000

b wixii

∑+ y′

=

y′ 2 y( )1–

tanh=

oo

o

o o

o

x1

x2

w1 = 1

w2 = 1

+1

vy

Hardlimiter

Figure 1: Problem 1.3

b = -1.5

v w1x1 w2x2 b+ +=

x1 x2 1.5–+=

Page 4: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

These conditions agree with truth table 1.

OR operation: Truth Table 2

The OR operation may be realized using the perceptron of Fig. 2:

In this case, the hard limiter input is

If x1 = x2 = 1, then v = 1.5, and y = 1If x1 = 0, and x2 = 1, then v = 0.5, and y = 1If x1 = 1, and x2 = 0, then v = 0.5, and y = 1If x1 = x2 = 0, then v = -0.5, and y = -1

These conditions agree with truth table 2.

Inputs Output

x1 x2 y

1010

1100

1110

oo

o

o o

o

x1

x2

w1 = 1

w2 = 1

+1

vy

Hardlimiter

Figure 2: Problem 1.3

b = -0.5

v x1 x2 0.5–+=

Page 5: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

COMPLEMENT operation: Truth Table 3

The COMPLEMENT operation may be realized as in Figure 3::

The hard limiter input is

If x = 1, then v = -0.5, and y = 0If x = 0, then v = 0.5, and y = 1

These conditions agree with truth table 3.

(b) EXCLUSIVE OR operation: Truth table 4

This operation is nonlinearly separable, which cannot be solved by the perceptron.

Problem 1.4

The Gaussian classifier consists of a single unit with a single weight and zero bias, determined inaccordance with Eqs. (1.37) and (1.38) of the textbook, respectively, as follows:

Input x, Output, y

10

01

Inputs Output

x1 x2 y

1010

1100

0110

oo o v

y

Hardlimiter

ow1 = -1

b = -0.5 Figure 3: Problem 1.3

v wx b+ x– 0.5+= =

w1

σ2------ µ1 µ2–( )=

20–=

Page 6: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

Problem 1.5

Using the condition

in Eqs. (1.37) and (1.38) of the textbook, we get the following formulas for the weight vector andbias of the Bayes classifier:

b1

2σ2--------- µ2

2 µ12

–( )=

0=

C σ2I=

w1

σ2------ µ1 µ2–( )=

b1

2σ2--------- µ1

2 µ22

–( )=

Page 7: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

CHAPTER 4Multilayer Perceptrons

Problem 4.1

Assume that each neuron is represented by a McCulloch-Pitts model. Also assume that

The induced local field of neuron 1 is

We may thus construct the following table:

The induced local field of neuron is

Accordingly, we may construct the following table:

x1 0 0 1 1

x2 0 1 0 1

v1 -1.5 -0.5 -0.5 0.5

y2 0 0 0 1

x1 0 0 1 1

x2 0 1 0 1

y1 0 0 0 1

v2 -0.5 0.5 -0.5 -0.5

y2 0 1 1 1

+1

-21+1

+1

x1

x2-0.5

2y2

-1.5Figure 4: Problem 4.1

xi1 if the input bit is 10 if the input bit is 0

=

v1 x1 x2 1.5–+=

v2 x1 x2 2 y1– 0.5–+=

2

1

Page 8: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

From this table we observe that the overall output y2 is 0 if x1 and x2 are both 0 or both 1, and it is1 if x1 is 0 and x2 is 1 or vice versa. In other words, the network of Fig. P4.1 operates as anEXCLUSIVE OR gate.

Problem 4.2

Figure 1 shows the evolutions of the free parameters (synaptic weights and biases) of the neuralnetwork as the back-propagation learning process progresses. Each epoch corresponds to 100 iter-ations. From the figure, we see that the network reaches a steady state after about 25 epochs. Eachneuron uses a logistic function for its sigmoid nonlinearity. Also the desired response is defined as

Figure 2 shows the final form of the neural network. Note that we have used biases (the negativeof thresholds) for the individual neurons.

d 0.9 for symbol bit( ) 10.1 for symbol bit( ) 0

=

Figure 1: Problem 4.2, where one epoch = 100 iterations

2

Page 9: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

Problem 4.3

If the momentum constant α is negative, Equation (4.43) of the text becomes

Now we find that if the derivative has the same algebraic sign on consecutive iterations

of the algorithm, the magnitude of the exponentially weighted sum is reduced. The opposite istrue when alternates its algebraic sign on consecutive iterations. Thus, the effect of the

momentum constant α is the same as before, except that the effects are reversed, compared to thecase when α is positive.

Problem 4.4

From Eq. (4.43) of the text we have

(1)

For the case of a single weight, the cost function is defined by

x1

x2

b1 = 1.6

w11= -4.72

w22 = -3.52

w31 = -6.80

w32 = 6.44 b3 = -2.85

+1

Output

b2 = 5.0

+1

1

2

3

Figure 2: Problem 4.2

w21 = -3.51

w12 = -4.24

∆w ji n( ) η α n-t ∂E t( )∂w ji t( )------------------

t=0

n

∑–=

η 1–( )n-t α n-t ∂E t( )∂w ji t( )------------------

t=0

n

∑–=

∂E ∂w ji⁄

∂E ∂w ji⁄

∆w ji n( ) η α n-t ∂E t( )∂w ji t( )------------------

t=1

n

∑–=

E k1 w w0–( )2k2+=

3

Page 10: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

Hence, the application of (1) to this case yields

In this case, the partial derivative has the same algebraic sign on consecutive itera-

tions. Hence, with 0 < α < 1 the exponentially weighted adjustment to the weight w attime n grows in magnitude. That is, the weight w is adjusted by a large amount. The inclusion ofthe momentum constant α in the algorithm for computing the optimum weight w* = w0 tends toaccelerate the downhill descent toward this optimum point.

Problem 4.5

Consider Fig. 4.14 of the text, which has an input layer, two hidden layers, and a single outputneuron. We note the following:

Hence, the derivative of with respect to the synaptic weight connecting neuron k in

the second hidden layer to the single output neuron is

(1)

where is the activation potential of the output neuron. Next, we note that

(2)

where is the output of neuron k in layer 2. We may thus proceed further and write

(3)

∆w n( ) 2k1η α n-tw t( ) w0–( )

t=1

n

∑–=

∂E t( ) ∂w t( )⁄

∆w n( )

y13( )

F A13( )( ) F w x,( )= =

F A13( )( ) w1k

3( )

∂F A13( )( )

∂w1k3( )----------------------

∂F A13( )( )

∂ y13( )----------------------

∂ y13( )

∂v13( )------------

∂v13( )

∂w1k3( )-------------=

v13( )

∂F A13( )( )

∂ y13( )---------------------- 1=

y13( ) ϕ v1

3( )( )=

v13( )

w1k3( )

yk2( )

k∑=

yk2( )

∂ y13( )

∂v13( )------------ ϕ′ v1

3( )( ) ϕ′ A13( )

= =

4

Page 11: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

(4)

Thus, combining (1) to (4):

Consider next the derivative of F(w,x) with respect to , the synaptic weight connecting

neuron j in layer 1 (i.e., first hidden layer) to neuron k in layer 2 (i.e., second hidden layer):

(5)

where is the output of neuron in layer 2, and is the activation potential of that neuron.

Next we note that

(6)

(7)

(8)

(9)

∂v13( )

∂w1k3( )------------- yk

2( )=

ϕ Ak2( )( )=

∂F w x,( )∂w1k

3( )----------------------∂F A1

3( )( )

∂w1k3( )----------------------=

ϕ′ A13( )( )ϕ Ak

3( )( )=

wkj2( )

∂F w x,( )∂wkj

2( )----------------------∂F w x,( )

∂ y13( )----------------------

∂ y13( )

∂v13( )------------

∂v13( )

∂ yk2( )------------

∂ yk2( )

∂vk2( )------------

∂vk2( )

∂wkj2( )-------------=

yk2( )

vk1( )

∂F w x,( )∂ y1

3( )---------------------- 1=

∂ y13( )

∂v13( )------------ ϕ′ A1

3( )( )=

v13( )

w1k3( )

yk2( )

k∑=

∂v13( )

∂ yk2( )------------ w1k

3( )=

yk2( ) ϕ vk

2( )( )=

∂ yk2( )

∂vk2( )------------ ϕ′ vk

2( )( ) ϕ′ Ak2( )( )= =

5

Page 12: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

(10)

Substituting (6) and (10) into (5), we get

Finally, we consider the derivative of F(w,x) with respect to , the synaptic weight

connecting source node i in the input layer to neuron j in layer 1. We may thus write

(11)

where is the output of neuron j in layer 1, and is the activation potential of that neuron.

Next we note that

(12)

(13)

(14)

vk2( )

wkj1( )

y j1( )

j∑=

∂vk2( )

∂wkj1( )------------- y j

1( ) ϕ v j1( )( ) ϕ A j

1( )( )= = =

∂F w x,( )∂wkj

2( )---------------------- ϕ′ A13( )( )w1k

3( )ϕ′ Ak2( )( )ϕ A j

1( )( )=

w ji1( )

∂F w x,( )∂w ji

1( )----------------------∂F w x,( )

∂ y13( )----------------------

∂ y13( )

∂v13( )------------

∂v13( )

∂ y j1( )------------

∂ y j1( )

∂v j1( )------------

∂v j1( )

∂w ji1( )-------------=

y j1( )

vi1( )

∂F w x,( )∂ y1

3( )---------------------- 1=

∂ y13( )

∂v13( )------------ ϕ′ A

3( )( )=

v13( )

w1k3( )

yk2( )

k∑=

∂v13( )

∂ y j1( )------------ w1k

3( )∂ yk2( )

∂ y j1( )------------

k∑=

w1k3( ) ∂ yk

2( )

∂vk2( )------------

∂vk2( )

∂ y j1( )------------

k∑=

w1k3( ) ϕ′ Ak

2( )( )∂vk

2( )

∂ y j1( )------------

k∑=

6

Page 13: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

(15)

(16)

(17)

Substituting (12) to (17) into (11) yields

Problem 4.12

According to the conjugate-gradient method, we have

(1)

where, in the second term of the last line in (1), we have used η(n - 1) in place of η(η). Define

We may then rewrite (1) as

(2)

On the other hand, according to the generalized delta rule, we have for neuron j:

(3)

Comparing (2) and (3), we observe that they have a similar mathematical form:

∂vk2( )

∂ y j1( )------------ wkj

2( )=

y j1( ) ϕ v j

1( )( )=

∂ y j1( )

∂v j1( )------------ ϕ′ v j

1( )( ) ϕ′ A j1( )( )= =

v j1( )

w ji1( )

xii

∑=

∂v j1( )

∂w ji1( )------------- xi=

∂F w x,( )∂w ji

1( )---------------------- ϕ′ A13( )( ) w1k

3( )ϕ′ Ak2( )( )wkj

2( )

k∑

ϕ′ A j1( )( )xi=

∆w n( ) η n( )p n( )=

η n( ) g n( )– β n 1–( )p n 1–( )+[ ]=

η– n( )g n( ) β n 1–( )η n 1–( )p n 1–( )+≈

∆w n 1–( ) η n 1–( )p n 1–( )=

∆w n( ) η n( )g n( ) β n 1–( )∆w n 1–( )+–≈

∆w j n( ) α∆w j n 1–( ) ηδ j n( )y n( )+=

7

Page 14: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

• The vector -g(n) in the conjugate gradient method plays the role of δj(n)y(n), where δj(n) is thelocal gradient of neuron j and y(n) is the vector of inputs for neuron j.

• The time-varying parameter β(n - 1) in the conjugate-gradient method plays the role ofmomentum α in the generalized delta rule.

Problem 4.13

We start with (4.127) in the text:

(1)

The residual r(n) is governed by the recursion:

Equivalently, we may write

(2)

Hence multiplying both sides of (2) by sT(n - 1), we obtain

(3)

where it is noted that (by definition)

Moreover, multiplying both sides of (2) by rT(n), we obtain

(4)

where it is noted that AT = A. Dividing (4) by (3) and invoking the use of (1):

(5)

which is the Hesteness-Stiefel formula.

β n( ) sTn 1–( )Ar n( )

sTn 1–( )As n 1–( )

-----------------------------------------------–=

r n( ) r n 1–( ) η n 1–( )As n 1–( )–=

η n 1–( )As n 1–( )– r n( ) r n 1–( )–=

η n 1–( )sTn 1–( )As n 1–( ) sT

n 1–( ) r n( ) r n 1–( )–( )–=

sTn 1–( )r n 1–( )=

sTn 1–( )r n( ) 0=

η n 1–( )rTn( )As n 1–( )– η n 1–( )sT

n 1–( )Ar n 1–( )–=

rTn( ) r n( ) r n 1–( )–( )=

β n( ) rTn( ) r n( ) r n 1–( )–( )

sTn 1–( )r n 1–( )

--------------------------------------------------------=

8

Page 15: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

In the linear form of conjugate gradient method, we have

in which case (5) is modified to

(6)

which is the Polak-Ribiére formula. Moreover, in the linear case we have

in which case (6) reduces to the Fletcher-Reeves formula:

Problem 4.15

In this problem, we explore the operation of a fully connected multilayer perceptron trained withthe back-propagation algorithm. The network has a single hidden layer. It is trained to realize thefollowing one-to-one mappings:

(a) Inversion:

, 1< x < 100

(b) Logarithmic computation, 1< x < 10

(c) Exponentiation

, 1< x < 10

(d) Sinusoidal computation

,

(a) f(x) = 1/x for 1< x < 100The network is trained with:

sTn 1–( )r n 1–( ) rT

n 1–( )r n 1–( )=

β n( ) rTn( ) r n( ) r n 1–( )–( )

rTn 1–( )r n 1–( )

--------------------------------------------------------=

rTn( )r n 1–( ) 0=

β n( ) rTn( )r n( )

rTn 1–( )r n 1–( )

-------------------------------------------=

f x( ) 1x---=

f x( ) x10log=

f x( ) ex–

=

f x( ) xsin= 0 xπ2---≤ ≤

9

Page 16: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

learning-rate parameter η = 0.3, andmomentum constant α = 0.7.

Ten different network configurations were trained to learn this mapping. Each network wastrained identically, that is, with the same η and α, with bias terms, and with 10,000 passes of thetraining vectors (with one exception noted below). Once each network was trained, the test datasetwas applied to compare the performance and accuracy of each configuration. Table 1 summarizesthe results obtained:

The results of Table 1 indicate that even with a small number of hidden neurons, and with a rela-tively small number of training passes, the network is able to learn the mapping described in (a)quite well.

(b) f(x) = log10x for 1< x < 10The results of this second experiment are presented in Table 2:

Here again, we see that the network performs well even with a small number of hidden neurons.Interestingly, in this second experiment the network peaked in accuracy with 10 hidden neurons,after which the accuracy of the network to produce the correct output started to decrease.

(c) f(x) = e- x for 1< x < 10The results of this third experiment (using the logistic function as with experiments (a)

Table 1

Number of hidden neuronsAverage percentage errorat the network output

3 4 5 7 10 15 20 30 10030 (trained with 100,000 passes)

4.73% 4.43 3.59 1.49 1.12 0.93 0.85 0.94 0.9 0.19

Table 2

Number of hidden neuronsAverage percentage errorat the network output

2 3 4 5 7 10 15 20 30 10030 (trained with 100,000 passes)

2.55% 2.09 0.46 0.48 0.85 0.42 0.85 0.96 1.26 1.18 0.41

10

Page 17: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

and (b)), are summarized in Table 3:

These results are unacceptable since the network is unable to generalize when each neuron isdriven to its limits.

The experiment with 30 hidden neurons and 100,000 training passes was repeated, but thistime the hyperbolic tangent function was used as the nonlinearity. The result obtained this timewas an average percentage error of 3.87% at the network output. This last result shows that thehyperbolic tangent function is a better choice than the logistic function as the sigmoid function for

realizing the mapping f(x) = e- x.

(d) f(x) = sinx for 0< x < π/2Finally, the following results were obtained using the logistic function with 10,000training passes, except for the last configuration:

The results of Table 4 show that the accuracy of the network peaks around 20 neurons, where afterthe accuracy decreases.

Table 3

Number of hidden neuronsAverage percentage errorat the network output

2 3 4 5 7 10 15 20 30 10030 (trained with 100,000 passes)

244.0‘% 185.17 134.85 133.67 141.65 158.77 151.91 144.79 137.35 98.09 103.99

Table 4

Number of hidden neuronsAverage percentage errorat the network output

2 3 4 5 7 10 15 20 30 10030 (trained with 100,000 passes)

1.63‘% 1.25 1.18 1.11 1.07 1.01 1.01 0.72 1.21 3.19 0.4

11

Page 18: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

1

CHAPTER 5Kernel Methods and Radial-Basis Function Networks

Problem 5.9

The expected square error is given by

where is the probability density function of a noise distribution in the input space . It is

reasonable to assume that the noise vector is additive to the input data vector x. Hence, we maydefine the cost function J(F) as

(1)

where (for convenience of presentation) we have interchanged the order of summation andintegration, which is permissible because both operations are linear. Let

or

Hence, we may rewrite (1) in the equivalent form:

(2)

Note that the subscript in merely refers to the “name” of the noise distribution and is

therefore untouched by the change of variables. Differentiating (2) with respect to F, setting theresult equal to zero, and finally solving for F(z), we get the optimal estimator

This result bears a close resemblance to the Watson-Nadaraya estimator.

J F( ) 12--- f xi( ) F xi ξ,( )–( )2

f ξ ξ( ) ξd

Rm0

∫i=1

N

∑=

f ξ ξ( ) Rm0

ξ

J F( ) 12--- f xi( ) F xi ξ+( )–( )2

f ξ ξ( ) ξd

Rm0

∫i=1

N

∑=

z xi ξ+= ξ z xi–=

J F( ) 12---

Rm0

∫ f xi( ) F z( )–( )2f ξ z xi–( ) zd

i=1

N

∑=

ξ f ξ .( )

F z( )

f xi( ) f ξ z xi–( )i=1

N

f ξ z xi–( )i=1

N

∑---------------------------------------------=

Page 19: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

CHAPTER 6Support Vector Machines

Problem 6.1

From Eqs. (6.2) in the text we recall that the optimum weight vector wo and optimum bias bosatisfy the following pair of conditions:

for di = +1

for di = -1

where i = 1, 2, ...,N. Equivalently, we may write

as the defining condition for the pair (wo, bo).

Problem 6.2

In the context of a support vector machine, we note the following:

1. Misclassification of patterns can only arise if the patterns are nonseparable.2. If the patterns are nonseparable, it is possible for a pattern to lie inside the margin of

separation and yet be on the correct side of the decision boundary. Hence, nonseparabilitydoes not necessarily mean misclassification.

Problem 6.3

We start with the primel problem formulated as follows (see Eq. (6.15)) of the text

(1)

Recall from (6.12) in the text that

Premultiplying w by wT:

woT xi bo +1≥+

woT xi bo -1<+

mini 1 2 … N, , ,=

wT xi b+ 1=

J w b α, ,( ) 12---wT w α idiw

T xi b α idi α ii=1

N

∑+i=1

N

∑–i=1

N

∑–=

w α idixi=1

N

∑=

1

Page 20: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

(2)

We may also write

Accordingly, we may redefine the inner product wTw as the double summation:

(3)

Thus substituting (2) and (3) into (1) yields

(4)

subject to the constraint

Recognizing that αi > 0 for all i, we see that (4) is the formulation of the dual problem.

Problem 6.4

Consider a support vector machine designed for nonseparable patterns. Assuming the use of the“leave-one-out-method” for training the machine, the following situations may arise when theexample left out is used as a test example:

1. The example is a support vector.Result: Correct classification.

2. The example lies inside the margin of separation but on the correct side of the decisionboundary.

Result: Correct classification.3. The example lies inside the margin of separation but on the wrong side of the decision

boundary.Result: Incorrect classification.

wT w α idiwT xi

i=1

N

∑=

wT α idixiT

i=1

N

∑=

wT w α idiα jd jx jT xi

j=1

N

∑i=1

N

∑=

Q α( ) 12--- α idiα jd jx j

T xi α ii=1

N

∑+j=1

N

∑i=1

N

∑–=

α idii=1

N

∑ 0=

2

Page 21: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

Problem 6.5

By definition, a support vector machine is designed to maximize the margin of separation betweenthe examples drawn from different classes. This definition applies to all sources of data, be theynoisy or otherwise. It follows therefore that by the very nature of it, the support vector machine isrobust to the presence of additive noise in the data used for training and testing, provided that allthe data are drawn from the same population.

Problem 6.6

Since theGram K = {K(xi, xj)} is a square matrix, it can be diagonalized using the similaritytransformation:

where is a diagonal matrix consisting of the eigenvalues of K and Q is an orthogonal matrix

whose columns are the associated eigenvectors. With K being a positive matrix, hasnonnegative entries. The inner-product (i.e., Mercer) kernel k(xi, xj) is the ijth element of matrixK. Hence,

(1)

Let ui denote the ith row of matrix Q. (Note that ui is not an eigenvector.) We may then rewrite (1)as the inner product

(2)

where is the square root of .

By definition, we have

(3)

K QΛQT=

ΛΛ

k xi x j,( ) QΛQT( )ij=

Q( )il Λ( )ll QT( )lj

l=1

m1

∑=

Q( )il Λ( )ll Q( )ljl=1

m1

∑=

k xi x j,( ) uiT Λu j=

Λ1 2⁄ ui( )T

Λ1 2⁄ u j( )=

Λ1 2⁄ Λ

k xi x j,( ) φT xi( )ϕ x j( )=

3

Page 22: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

Comparing (2) and (3), we deduce that the mapping from the input space to the hidden (feature)space of a support vector machine is described by

Problem 6.7

(a) From the solution to Problem 6.6, we have

Suppose the input vector xi is multiplied by the orthogonal (unitary) matrix Q. We then have a

new mapping described by

Correspondingly, we may write

(1)

where ui is the ith row of Q. From the definition of an orthogonal (unitary) matrix:

or equivalently

where I is the identity matrix. Hence, (1) reduces to

In words, the Mercer kernel exhibits the unitary invariance property.

ϕ : xi Λ1 2⁄ ui→

φ: xi Λ1 2⁄ ui→

φ′

φ′: Qxi QΛ1 2⁄ ui→

k Qxi Qx j,( ) QΛ1 2⁄ ui( )T

QΛ1 2⁄ u j( )=

Λ1 2⁄ ui( )T

QT Q Λ1 2⁄ u j( )=

Q 1– QT=

QT Q I=

k Qxi Qx j,( ) Λ1 2⁄ ui( )T

Λ1 2⁄ u j( )=

k xi x j,( )=

4

Page 23: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

(b) Consider first the polynomial machine described by

Consider next the RBF network described by the Mercer kernel:

,

Finally, consider the multilayer perceptron described by

Thus all three types of the support vector machine, namely, the polynomial machine, RBFnetwork, and MLP, satisfy the unitary invariance property in their own individual ways.

k Qxi Qx j,( ) Qxi( )T Qx j( ) 1+( )p

=

xiT QT Qx j 1+( )

p=

xiT x j 1+( )

p=

k xi xJ,( )=

k Qxi Qx j,( ) 1

2σ2--------- Qxi Qx j–

2–

exp=

1

2σ2--------- Qxi Qx j–( )T Qxi Qx j–( )–

exp=

1

2σ2--------- xi x j–( )T QT Q xi x j–( )–

exp=

1

2σ2--------- xi x j–( )T xi x j–( )–

exp= QT Q I=

k xi xJ,( )=

k Qxi Qx j,( ) β0 Qxi( )T Qx j( ) β1+( )tanh=

β0xiT QT Qx j β1+( )tanh=

β0xiT x j β1+( )tanh=

k xi xJ,( )=

5

Page 24: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

Problem 6.17

The truth table for the XOR function, operating on a three-dimensional pattern x, is as follows:

To proceed with the support vector machine for solving this multidimensional XOR problem, letthe Mercer kernel

The minimum value of power p (denoting a positive integer) needed for this problem is p = 3. Forp = 2, we end up with a zero weight vector, which is clearly unacceptable.

Setting p = 3, we thus have

where

and likewise for xi. Then, proceeding in a manner similar but much more cumbersome than thatdescribed for the two-dimensional XOR problem in Section 6.6, we end up with a polynomialmachine defined by

This machine satisfies the entries of Table 1.

Table 1

InputsDesired response

x1 x2 x3 y

+1 +1 -1 +1 +1 -1 -1 -1

+1 -1 +1 +1 -1 +1 -1 -1

+1 +1 +1 -1 -1 -1 -1 +1

+1 -1 -1 -1 +1 +1 -1 +1

k x x j,( ) 1 xT xi+( )p

=

k x xi,( ) 1 xT xi+( )3

=

1 3xT xi 3 xT xi( )2

xT xi( )3

+ + +=

x x1 x2 x3, ,[ ] T=

y x1 x2 x3, ,=

6

Page 25: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

CHAPTER 8Principal-Components Analysis

Problem 8.5

From Example 8.2 in the text:

(1)

(2)

The correlation matrix of the input is

(3)

where s is the signal vector and σ2 is the variance of an element of the additive noise vector.Hence, using (2) and (3):

(4)

The vector s is a signal vector of unit length:

Hence, (4) simplifies to

which is the desired result given in (1).

λ0 1 σ2+=

q0 s=

R ssT σ2I+=

λ0

q0T Rq0

q0T q0

-----------------=

sT ssT σ2I+( )s

sT s------------------------------------=

sT s( ) sT s( ) σ2 sT s( )+

sT s----------------------------------------------------=

sT s σ2+=

s 2 σ2+=

s 1=

λ0 1 σ2+=

1

Page 26: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

Problem 8.6

From (8.46) in the text we have

(1)

As , and so we deduce from (1) that

for (2)

where q1 is the eigenvector associated with the largest eigenvalue λ1 of the correlation matrix

R = E[x(n)xT(n)], where E is the expectation operator. Multiplying (2) by its own transpose andthen taking expectations, we get

Equivalently, we may write

(3)

where is the variance of the output y(n). Post-multiplying (3) by q1:

(4)

where it is noted that by definition. From (4) we readily see that , which is the

desired result.

Problem 8.7

Writing the learning algorithm for minor components analysis in matrix form:

Proceeding in a manner similar to that described in Section (8.5) of the textbook, we have thenonlinear differential equation:

Define

w n 1+( ) w n( ) ηy n( ) x n( ) y n( )w n( )–[ ]+=

n ∞ w n( ) q1→,→

x n( ) y n( )q1= n ∞→

E x n( )xTn( )[ ] E y

2n( )[ ] q1q1

T=

R σY2 q1q1

T=

σY2

Rq1 σY2 q1q1

T q1 σY2 q1= =

q1 1= σY2 λ1=

w n 1+( ) w n( ) ηy n( ) x n( ) y n( )w n( )–[ ]–=

ddt-----w t( ) wT

t( )Rw t( )[ ] w t( ) Rw t( )–=

2

Page 27: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

(1)

where qk is the kth eigenvector of correlation matrix R = E[x(n)xT(n)] and the coefficient is

the projection of w(t) onto qk. We may then identify two cases as summarized here:

Case I: 1 < k < m

For this first case, we define

for some fixed m (2)

Accordingly, we find that

(3)

With the eigenvalues of R arranged in decreasing order:

it follows that as .

Case II: k = m

For this second case, we find that

for (4)

Hence, as .

Thus, in light of the results derived for cases I and II, we deduce from (1) that:

= eigenvector associated with the smallest eigenvalue λm as , and

.

w t( ) θk t( )qkk=1

M

∑=

θk t( )

αk t( )θk t( )θm t( )-------------=

dαk t( )dt

---------------- λm λk–( )αk t( )–=

λ1 λ2 … λk … λm 0> > > > > >

αk t( ) 0→ t ∞→

dθm t( )dt

----------------- λmθm t( ) θm2

t( ) 1–( )= t ∞→

θm t( ) 1±= t ∞→

w t( ) qm→ t ∞→

σY2 E y

2n( )[ ] λ m→=

3

Page 28: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

Problem 8.8

From (8.87) and (8.88) of the text:

(1)

(2)

where, for convenience of presentation, we have omitted the dependence on time n. Equations (1)and (2) may be represented by the following vector-valued signal flow graph:

Note: The dashed lines indicate inner (dot) products formed by the input vector x and thepertinent synaptic weight vectors w0, w1, ..., wj to produce y0, y1, ..., yj, respectively.

Problem 8.9

Consider a network consisting of a single layer of neurons with feedforward connections. Thealgorithm for adjusting the matrix of synaptic weights W(n) of the network is described by therecursive equation (see Eq. (8.91) of the text):

∆w j η y jx ′ η y j2w j–=

x ′ x wk ykk=0

j-1

∑–=

o o o

o

. . .

o

o

o

o

o

o

o∆wj

x-y0

-y1

-yj-1

-yj

ηyj

w0

w1

wj-1

w0

w1

wj-1

wj

4

Page 29: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

(1)

where x(n) is the input vector, y(n) is the output vector; and LT[.] is a matrix operator that sets allthe elements above the diagonal of the matrix argument to zero, thereby making it lowertriangular.

First, we note that the asymptotic stability theorem discussed in the text does not applydirectly to the convergence analysis of stochastic approximation algorithms involving matrices; itis formulated to apply to vectors. However, we may write the elements of the parameter (synapticweight) matrix W(n) in (1) as a vector, that is, one column vector stacked up on top of another. Wemay then interpret the resulting nonlinear update equation in a corresponding way and so proceedto apply the asymptotic stability theorem directly.

To prove the convergence of the learning algorithm described in (1), we may use themethod of induction to show that if the first j columns of matrix W(n) converge to the first j

eigenvectors of the correlation matrix R = E[x(n)xT(n)], then the (j + 1)th column will converge tothe (j + 1)th eigenvector of R. Here we use the fact that in light of the convergence of themaximum eigenfilter involving a single neuron, the first column of the matrix W(n) convergeswith probability 1 to the first eigenvector of R, and so on.

Problem 8.10

The results of a computer experiment on the training of a single-layer feedforward network usingthe generalized Hebbian algorithm are described by Sanger (1990). The network has 16 outputneurons, and 4096 inputs arranged as a 64 x 64 grid of pixels. The training involved presentationof 2000 samples, which are produced by low-pass filtering a white Gaussian noise image and thenmultiplying wi6th a Gaussian window function. The low-pass filter was a Gaussian function withstandard deviation of 2 pixels, and the window had a standard deviation of 8 pixels.

Figure 1, presented on the next page, shows the first 16 receptive field masks learned bythe network (Sanger, 1990). In this figure, positive weights are indicated by “white” and negativeweights are indicated by “black”; the ordering is left-to-right and top-to-bottom.

The results displayed in Fig. 1 are rationalized as follows (Sanger, 1990):

• The first mask is a low-pass filter since the input has most of its energy near dc (zerofrequency).

• The second mask cannot be a low-pass filter, so it must be a band-pass filter with a mid-bandfrequency as small as possible since the input power decreases with increasing frequency.

• Continuing the analysis in the manner described above, the frequency response of successivemasks approaches dc as closely as possible, subject (of course) to being orthogonal toprevious masks.

The end result is a sequence of orthogonal masks that respond to progressively higherfrequencies.

W n( ) W n( ) η n( ) y n( )xTn( ) LT y n( )yT

n( )[ ] W n( )–{ }+=

5

Page 30: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

Figure 1: Problem 8.10 (Reproduced with permission of Biological Cybernetics)

6

Page 31: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

CHAPTER 9Self-Organizing Maps

Problem 9.1

Expanding the function g(yj) in a Taylor series around yj = 0, we get

(1)

where

for k = 1, 2, ....

Let

Then, we may rewrite (1) as

Correspondingly, we may write

Consequently, a nonzero g(0) has the effect of making dwj/dt assume a nonzero value whenneuron j is off, which is undesirable. To alleviate this problem, we make g(0) = 0.

g y j( ) g 0( ) g1( ) 0( ) y j

12!-----g

2( ) 0( ) y j2 …+ + +=

gk( ) 0( ) ∂k

g y j( )

∂ y jk

-------------------y j 0=

=

y j1, neuron j is on0, neuron j is off

=

g y j( ) g 0( ) gi( ) 0( ) 1

2!-----g

2( ) 0( ) …,+ + + neuron j is on

g 0( ) neuron j is off

=

dw j

dt---------- η y jx g y j( )w j–=

ηx w j g 0( ) g1( ) 0( ) 1

2!-----g

2( ) 0( ) …+ + +– neuron j is on

g 0( )w j– neuron j is off

=

1

Page 32: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

Problem 9.2

Assume that y(c) is a minimum L2 (least-squares) distortion vector quantizer for the code vector c.We may then form the distortion function

This distortion function is similar to that of Eq. (10.20) in the text, except for the use of c and

in place of x and , respectively. We wish to minimize D2 with respect to y(c) and .

Assuming that is a smooth function of the noise vector , we may expand the

decoder output in using the Taylor series. In particular, using a second-order approximation,we get (Luttrell, 1989b)

(1)

where

where δij is a Kronecker delta function. We now make the following observations:

• The first term on the right-hand side of (1) is the conventional distortion term.• The second term (i.e., curvature term) arises due to the output noise model .

Problem 9.3

Consider the Peano curve shown in part (d) of Fig. 9.9 of the text. This particular self-organizingfeature map pertains to a one-dimensional lattice fed with a two-dimensional input. We see that(counting from left to right) neuron 14, say, is quite close to neuron 97. It is therefore possible fora large enough input perturbation to make neuron 14 jump into the neighborhood of neuron 97, orvice versa. If this change were to happen, the topological preserving property of the SOMalgorithm would no longer hold

For a more convincing demonstration, consider a higher-dimensional, namely, three-dimensional input structure mapped onto a two-dimensional lattice of 10-by-10 neurons. The

D212--- f c( ) c ′ y c( )( ) c–

2 cd∫=

c ′x ′ c ′ y( )

π ν( ) νx ′ ν

π ν( ) x ′ c x( ) ν+( ) x–2 νd∫

1D2

2------ ∇ k

2+

x ′ c( ) x–2≈

π∫ ν( )dν 1=

niπ ν( ) dν( )∫ 0=

nin jπ ν( ) dν( )∫ D2δij=

π ν( )

2

Page 33: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

network is trained with an input consisting of 8 Gaussian clouds with unit variance but differentcenters. The centers are located at the points (0,0,0,...,0), (4,0,0,...,0), (4,4,0,...,0), (0,4,0,...,0),(0,0,4,...,0), (4,0,4, ...,0), (4,4,4, ..., 0), and (0,4,4, ...,0). The clouds occupy the 8 corners of a cubeas shown in Fig. 1a. The resulting labeled feature map computed by the SOM algorithm is shownin Fig. 1b. Although each of the classes is grouped together in the map, the planar feature mapfails to capture the complete topology of the input space. In particular, we observe that class 6 isadjacent to class 2 in the input space, but is not adjacent to it in the feature map.

The conclusion to be drawn here is that although the SOM algorithm does performclustering on the input space, it may not always completely preserve the topology of the inputspace.

Figure 1: Problem 9.3

Problem 9.4

Consider for example a two-dimensional lattice using the SOM algorithm to learn a two-dimensional input distribution as illustrated in Fig. 9.8 in the textbook. Suppose that the neuron atthe center of the lattice breaks down; this failure may have a dramatic effect on the evolution ofthe feature map. On the other hand, a small perturbation applied to the input space leaves the maplearned by the lattice essentially unchanged.

Problem 9.5

The batch version of the SOM algorithm is defined by

for some prescribed neuron j (1)

where πj,i is the discretized version of the pdf of noise vector . From Table 9.1 of the textwe recall that πj,i plays a role analogous to that of the neighborhood function. Indeed, we can

w j

πj i, xii

∑πj i,

i∑

--------------------=

π ν( ) ν

3

Page 34: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

substitute hj,i(x) for πj,i in (1). We are interested in rewriting (1) in a form that highlights the role ofVoronoi cells. To this end we note that the dependence of the neighborhood function hj,i(x) andtherefore πj,i on the input pattern x is indirect, with the dependence being through the Voronoi cellin which x lies. Hence, for all input patterns that lie in a particular Voronoi cell the sameneighborhood function applies. Let each Voronoi cell be identified by an indicator function Ii,kinterpreted as follows:

Ii,k = 1 if the input pattern xi lies in the Voronoi cell corresponding to winning neuron k. Then inlight of these considerations we may rewrite (1) in the new form

(2)

Now let mk denote the centroid of the Voronoi cell of neuron k and Nk denote the number of inputpatterns that lie in that cell. We may then simplify (2) as

(3)

where Wj,k is a weighting function defined by

(4)

with

for all j

Equation (3) bears a close resemblance to the Watson-Nadaraya regression estimatordefined in Eq. (5.61) of the textbook. Indeed, in light of this analogy, we may offer the followingobservations:

• The SOM algorithm is similar to nonparametric regression in a statistical sense.• Except for the normalizing factor Nk, the discretized pdf πj,i and therefore the neighborhood

function hj,i plays the role of a kernel in the Watson-Nadaraya estimator.

w j

πj k, I i k, xii

∑k∑

πj k, I i k,i

∑k∑

-------------------------------------=

w j

πj k, N kmkk∑

πj k, N kk∑

--------------------------------=

W j k, mkk∑=

W j k,πj k, N k

πj k, N kk∑------------------------=

W j k,k∑ 1=

4

Page 35: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

• The width of the neighborhood function plays the role of the span of the kernel.

Problem 9.6

In its basic form, Hebb’s postulate of learning states that the adjustment ∆wkj applied to thesynaptic weight wkj is defined by

where yk is the output signal produced in response to the input signal xj.

The weight update for the maximum eigenfilter includes the term and, additionally,

a stabilizing term defined by . The term provides for synaptic amplification.

In contrast, in the SOM algorithm two modifications are made to Hebb’s postulate oflearning:

1. The stabilizing term is set equal to .

2. The output yk of neuron k is set equal to a neighborhood function.

The net result of these two modifications is to make the weight update for the SOM algorithmassume a form similar to that in competitive learning rather than Hebbian learning.

Problem 9.7

In Fig. 1 (shown on the next page), we summarize the density matching results of computersimulation on a one-dimensional lattice consisting of 20 neurons. The network is trained with atriangular input density. Two sets of results are displayed in this figure:

1. The standard SOM (Kohonen) algorithm, shown as the solid line.2. The conscience algorithm, shown as the dashed line; the line labeled “predict” is its

straight-line approximation.

In Fig. 1, we have also included the exact result. Although it appears that both algorithms fail tomatch the input density exactly, we see that the conscience algorithm comes closer to the exactresult than the standard SOM algorithm.

∆wkj η yk x j=

η yk x j

yk2wkj– η yk x j

ykwkj–

5

Page 36: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

Figure 1: Problem 9.7

Problem 9.11

The results of computer simulation for a one-dimensional lattice with a two-dimensional(triangular) input are shown in Fig. 1 on the next page for an increasing number of iterations. Theexperiment begins with random weights at zero time, and then the neurons start spreading out.

Two distinct phases in the learning process can be recognized from this figure:

The neurons become ordered (i.e., the one-dimensional lattice becomes untangled), whichhappens at about 20 iterations.

The neurons spread out to match the density of the input distribution, culminating in thesteady-state condition attained after 25,000 iterations.

6

Page 37: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

Figure 1: Problem 9.11

7

Page 38: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

CHAPTER 10Information-Theoretic Learning Models

Problem 10.1

The maximum entropy distribution of the random variable X is a uniform distribution over therange, [a, b], as shown by

Hence,

Problem 10.3

Let

where the vectors X1 and X2 have multivariate Gaussian distributions. The correlation coefficientbetween Yi and Zi is defined by

(1)

f X x( )1

a b–------------, a x b≤ ≤

0, otherwise

=

h X( ) f X x( ) f X x( )log xd∞–

∞∫–=

1a b–------------ a b–( ) xdlog

b

a

∫=

a b–( )log=

Y i aiT X1=

Zi biT X2=

ρi

E Y iZi[ ]

E Y i2[ ] E Zi

2[ ]-----------------------------------=

aiT E X1X2

T[ ] bi

aiT E X1X1

T[ ] ai( ) biT E X1X2

T[ ] bi( ){ }1 2⁄-----------------------------------------------------------------------------------------------=

aiT Σ12bi

aiT Σ11ai( ) bi

T Σ22bi( ){ }1 2⁄----------------------------------------------------------------=

1

Page 39: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

where

The mutual information between Yi and Zi is defined by

Let r denote the rank of the cross-covariance matrix . Given the vectors X1 and X2, we

may invoke the idea of canonical correlations as summarized here:

• Find the pair of random variables and that are most highly

correlated.

• Extract the pair of random variables and in such a way that Y1 and

Y2 are uncorrelated and so are Z1 and Z2.• Continue these two steps until at most r pairs of variables {(Y1, Zi), (Y2, Zi), ..., (Zr, Zr)} have

been extracted.

The essence of the canonical correlation described above is to encapsulate the dependencebetween random vectors X1 and X2 in the sequence {(Y1, Zi), (Y2, Zi), ..., (Zr, Zr)}. Theuncorrelatedness of the pairs in this serquence, that is,

for all

means that the mutual information between the vectors X1 and X2 is the sum of the mutual

information measures between the individual elements of the pairs . That is, we may

write

where ρi is defined by (1).

Σ11 E X1X1T[ ]=

Σ12 E X1X2T[ ] Σ 21= =

Σ22 E X2X2T[ ]=

I Y i Zi;( ) 1 ρi2

–( )log–=

Σ12

Y 1 a1T X1= Z1 b1

T X2=

Y 2 a2T X1= Z2 b2

T X2=

E Y iY j[ ] E ZiZ j[ ] 0= = j i≠

Y i Zi,( ){ } i=1r

I X1 X2,( ) I Y ij,Zi( ) constant+i=1

r

∑=

1 ρi2

–( )log constant+i=1

r

∑–=

2

Page 40: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

Problem 10.4

Consider a multilayer perceptron with a single hidden layer. Let wji denote the synaptic weight ofhidden neuron j connected to source node i in the input layer. Let xi|α denote the ith component ofthe input vector x, given example α. Then the induced local field of neuron j is

(1)

Correspondingly, the output of hidden neuron j for example α is given by

(2)

where is the logistic function

Consider next the output layer of the network. Let wkj denote the synaptic weight of output neuronk connected to hidden neuron j. The induced local field of output neuron k is

(3)

The kth output of the network is therefore

(4)

The output yk|α is assigned a probabilistic interpretation by writing

(5)

Accordingly, we may view yk|α as an estimate of the conditional probability that the proposition kis true, given the example α at the input. On this basis, we may interpret

as the estimate of the conditional probability that the proposition k is false, given the inputexample α. Correspondingly, let qk|α denote the actual (true) value of the conditional probabilitythat the proposition k is true, given the input example α. This means that 1 - qk|α is the actual

v j α w jixi αi

∑=

y j α ϕ v j α( )=

ϕ .( )

ϕ v( ) 1

1 ev–

+----------------=

vk α wkj y j αi

∑=

yk α ϕ vk α( )=

pk α yk α=

1 yk α– 1 pk α–=

3

Page 41: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

value of the conditional probability that the proposition k is false, given the input example α.Thus, we may define the Kullback-Leibler divergence for the multilayer perceptron as

where pα is the a priori probability of occurrence of example α at the input.

To perform supervised training of the multilayer perceptron, we use gradient descent onin weight space. First, we use the chain rule to express the partial derivative of with

respect to the synaptic weight wkj of output neuron k as follows:

(6)

Next, we express the partial derivative of with respect to the synaptic weight wji of hidden

neuron j by writing

(7)

Via the chain rule, we write

(8)

But

(9)

Hence, using (8) and (9) we may simplify (7) as

D p q pα qk αqk α

pk α----------

1 qk α–( )1 qk α–

1 pk α–-------------------

log+logk∑

α∑=

D p q D p q

∂D p q

∂wkj----------------

∂D p q

∂ pk α----------------

∂ pk α

∂ yk α--------------

∂ yk α

∂vk α-------------=

∂vk α

∂wkj-------------

pα qk α pk α–( ) y j αα∑–=

D p q

∂D p q

∂w ji---------------- pα

qk α

pk α----------

1 qk α–

1 pk α–-------------------–

∂ pk α

∂w ji--------------

α∑

α∑–=

∂ pk α

∂w ji--------------

∂ pk α

∂ yk α--------------

∂ yk α

∂vk α-------------

∂vk α

∂ y j α-------------

∂ y j α

∂v j α-------------=

∂v j α

∂w ji------------

ϕ′ vk α( )wkjϕ′ v j α( )xi α=

ϕ′ vk α( ) yk α 1 yk α–( )=

pk α 1 pk α–( )=

∂D p q

∂w ji---------------- pα xi αϕ′ w jixi α

i∑

pk α qk α–( )wkjk∑

α∑–=

4

Page 42: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

where is the derivative of the logistic function with respect to its argument.

Assuming the use of the learning-rate parameter η for all weight changes applied to thenetwork, we may use the method of steepest descent to write the following two-step probabilisticalgorithm:

1. For output neuron k, compute

2. For hidden neuron j, compute

Problem 10.9

We first note that the mutual information between the random variables X and Y is defined by

To maximize the mutual information I(X;Y) we need to maximize the sum of the differential

entropy h(X) and the differential entropy and also minimize the joint differential entropyh(X,Y). From the definition of differential entropy, both h(X) and h(Y) attain their maximum valueof 0.5 when X and Y occur with probability 1/2. Moreover h(X,Y) is minimized when the jointprobability of X and Y occupies the smallest possible region in the probability space.

Problem 10.10

The outputs Y1 and Y2 of the two neurons in Fig. P10.6 in the text are respectively defined by

ϕ′ .( ) ϕ .( )

∆wkj η∂D p q

∂wkj----------------–=

η pα qk α Pk α–( ) y j αα∑=

∆w ji η∂D p q

∂w ji----------------–=

η pα xi αϕ′ w jixi αi

∑ pk α qk α–( )wkj

k∑

α∑=

I X Y;( ) h X( ) h Y( ) h X Y,( )–+=

h Y( )

Y 1 w1ixii=1

m

N 1+=

Y 2 w2ixii=1

L

N 2+=

5

Page 43: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

where are w1i the synaptic weights of output neuron 1, and the w2i are synaptic weights of output

neuron 2. The mutual information between the output vector Y = [Y1, Y2]T and the input vector X

= [X1, X2, .., Xm]T is

(1)

where h(Y) is the differential entropy of the output vector Y and h(N) is the differential entropy of

the noise vector N = [N1, N2]T.

Since the noise terms N1 and N2 are Gaussian and uncorrelated, it follows that they arestatistically independent. Hence,

(2)

The differential entropy of the output vector Y is

where is the joint pdf of Y1 and Y2. Both Y1 and Y2 are dependent on the same set

of input signals, and so they are correlated with each other. Let

where

, i, j = 1, 2

The individual element of the correlation matrix R are given by

I X Y;( ) h Y( ) h Y X( )–=

h Y( ) h N( )–=

h N( ) h N1 N2,( )=

h N1( ) h N2( )+=

1 2πσN2( )log+=

h Y( ) h Y 1,Y 2( )=

f Y 1,Y 2y1 y2,( ) f Y 1,Y 2

y1 y2,( )log∞–

∞∫ y1d y2d

∞–

∞∫–=

f Y 1,Y 2y1 y2,( )

R E YYT[ ]=

r11 r12

r21 r22

=

rij E Y iY j[ ]=

r11 σ12 σN

2+=

r12 r21 σ1σ1ρ12= =

6

Page 44: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

where and are the respective variances of Y1 and Y2 in the absence of noise, and ρ12 is their

correlation coefficient also in the absence of noise. For the general case of an N-dimensionalGaussian distribution, we have

Correspondingly, the differential entropy of the N-dimensional vector Y is described as

where e is the base of the natural logarithm. For the problem at hand, we have N = 2 and so

Hence, the use of (2) and (3) in (1) yields

(4)

For a fixed noise variance , the mutual information I(X;Y) is maximized by maximizing the

determinant det(R). By definition,

That is,

(5)

Depending on the value of noise variance , we may identify two distinct situations:

1. Large noise variance. When is large, the third term in (5) may be neglected, obtaining

r22 σ22 σN

2+=

σ12 σ2

2

f Y y( ) 1

2π( )N 2⁄ detR( )1 2⁄--------------------------------------------- 12---yT R 1– y

exp=

h Y( ) 2πe( )N 2⁄ det R( )( )log=

h Y( ) 2πedet R( )( )log=

1 2πdet R( )( )log+=

I X Y;( ) det R( )σN

2----------------

log=

σN2

det R( ) r11r22 r12r21–=

det R( ) σN4 σN

2 σ12 σ2

2+( ) σ1

2σ22 1 ρ12

2–( )+ +=

σN2

σN2

det R( ) σN4 σN

2 σ12 σ2

2+( )+≈

7

Page 45: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

In this case, maximizing det(R) requires that we maximize . This requirement may

be satisfied simply by maximizing the variance of output Y1 or the variance of output

Y2, separately. Since the variance of output Yi : i = 1, 2, is equal to in the absence of noise

and in the presence of noise, it follows from the Infomax principle that the optimum

solution for a fixed noise variance is to maximize the variance of either output, Y1 or Y2.

2. Low noise variance. When the noise variance is small, the third term in

(5) becomes important relative to the other two terms. The mutual information I(X;Y) is thenmaximized by making an optimal tradeoff between two options: keeping the output variances

and large, and making the outputs Y1 and Y2 of the two neurons uncorrelated.

Based on these observations, we may now make the following two statements:

• A high-noise level favors redundancy of response, in which case the two output neuronscompute the same linear combination of inputs. Only one such combination yields a responsewith maximum variance.

• A low-noise level favors diversity of response, in which case the two output neurons computedifferent linear combinations of inputs even though such a choice may result in a reducedoutput variance.

Problem 10.11

(a) We are given

Hence,

The mutual information between and the signal component S is

(1)

The differential entropy of is

σ12 σ2

2+( )

σ12 σ2

2

σi2

σ12 σN

2+

σN2 σ1

2σ22 1 ρ12

2–( )

σ12 σ2

2

Y a S N a+=

Y b S N b+=

Y a Y b+

2------------------- S

12--- N a N b+( )+=

12--- Y a Y b+( )

IY a Y b+

2------------------- S;

hY a Y b+

2-------------------

hY a Y b+

2------------------- S

–=

Y a Y b+

2-------------------

8

Page 46: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

(2)

The conditional differential entropy of given S is

(3)

Hence, the use of (2) and (3) in (1) yields (after the simplification of terms)

(b) The signal component S is ordinarily independent of the noise components Na and Nb. Hencewith

it follows that

The ratio in the expression for the mutual information

may therefore be interpreted as a signal-plus-noise to noise ratio.

Problem 10.12

Principal components analysis (PCA) and independent-components analysis (ICA) share acommon feature: They both linearly transform an input signal into a fixed set of components.

However, they differ from each other in two important respects:

1. PCA performs decorrelation by minimizing second-order moments; higher-order moments arenot involved in this computation. On the other hand, ICA performs statistical independence byusing higher-order moments.

hY a Y b+

2-------------------

12--- 1

π2---var Y a Y b+[ ]

log+=

Y a Y b+

2-------------------

hY a Y b+

2------------------- S

hN a N b+

2--------------------

=

12--- π

2---var N a N b+[ ]

log=

IY a Y b+

2------------------- S;

var Y a Y b+[ ]var N a N b+[ ]----------------------------------

log=

Y a Y b+ 2S N a N b+ +=

var Y a Y b+[ ] 4var S[ ] var N a N b+[ ]+=

var Y a Y b+[ ]( ) var N a N b+[ ]( )⁄

IY a Y b+

2------------------- S;

9

Page 47: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

2. The output signal vector resulting from PCA has a diagonal covariance matrix. The firstprincipal component defines a direction in the original signal space that captures themaximum possible variance; the second principal component defines another direction in theremaining orthogonal subspace that captures the next maximum possible variance, and so on.On the other hand, ICA does not find the directions of maximum variances but ratherinteresting directions where the term “interesting” refers to “deviation from Gaussianity”.

Problem 10.13

Independent components analysis may be used as a preprocessing tool before signal detection andpattern classification. In particular, through a change of coordinates resulting from the use of ICA,the probability density function of multichannel data may be expressed as a product of marginaldensities. This change, in turn, permits density estimation with shorter observations.

Problem 10.14

Consider m random variables X1, X2, ..., Xm that are defined by

, i = 1, 2, ..., N

where the Uj are independent random variables. The Darmois’ theorem states that if the Xi are

independent, then the variables Uj for which are all Gaussian.

For independent-components analysis to work, at most a single Xi can be Gaussian. If allthe Xi are independent to begin with, there is no need for the application of independent-components analysis. This, in turn, means that all the Xi must be Gaussian. For a finite N, thiscondition can only be satisfied if all the Uj are not only independent but also Gaussian.

Problem 10.15

The use of independent-components analysis results in a set of components that are as statisticallyindependent of each other as possible. In contrast, the use of decorrelation only addresses second-order statistics and there is therefore no guarantee of statistical independence.

Problem 10.16

The Kullback-Leibler divergence between the joint pdf fY(y, w) and the factorial pdf isthe multifold integral

(1)

X i aijU jj=1

N

∑=

aij 0≠

f Y y w,( )

Df Y f Y

f Y y w,( )f Y y w,( )

f Y y w,( )----------------------log

yd∞–

∞∫=

10

Page 48: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

Let

where excludes yi and yj. We may then rewrite (1) as

That is, the Kullback-Leibler divergence between the joint pdf fY(y, w) and the factorial pdf

distribution is equal to the mutual information between the components Yi and Yj of theoutput vector Y for any pair (i, j).

Problem 10.18

Define the output matrix

(1)

where m is the dimension of the output vector y(n) and N is the number of samples used incomputing the matrix Y. Correspondingly, define the m-by-N matrix of activation functions

yd d yid y j y ′d=

y ′

Df Y f Y

d yid y j f Y y w,( ) f Y y w,( )log y ′d∞–

∞∫∞–

∞∫∞–

∞∫=

d yid y j f Y y w,( ) f Y y w,( )log y ′d∞–

∞∫∞–

∞∫∞–

∞∫–

f Y i,Y jyi y j w, ,( ) f Y i,Y j

yi y j w, ,( )d yid y jlog∞–

∞∫∞–

∞∫=

f Y i,Y jyi y j w, ,( ) f Y i

yi w,( ) f Y jy j w,( )( )d yid y jlog

∞–

∞∫∞–

∞∫–

f Y i,Y jyi y j,( )

f Y i,Y jyi y j w, ,( )

f Y iyi w,( ) f Y j

y j w,( )----------------------------------------------------

d yid y jlog∞–

∞∫∞–

∞∫=

I Y i Y j;( )=

f Y y w,( )

Y

y1 0( ) y1 1( ) … y1 N 1–( )

y2 0( ) y2 1( ) … y2 N 1–( )

ym 0( ) ym 1( ) … ym N 1–( )

=

. . .

. . .

. . .

11

Page 49: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

In the batch mode, we define the average weight adjustment (see Eq. (10.100) of the text)

Equivalently, using the matrix definitions introduced in (2), we may write

which is the desired formula.

Problem 10.19

(a) Let q(y) denote a pdf equal to the determinant det(J) with the elements of the Jacobian J beingas defined in Eq. (10.115). Then using Eq. (10.116) we may express the entropy of the randomvector Z at the output of the nonlinearity in Fig. 10.16 of the text as

Invoking the pythogorean decomposition of the Kullback-Leibler divergence, we write

Hence, the differential entropy

(1)

Φ Y( )

ϕ y1 0( )( ) ϕ y1 1( )( ) … ϕ y1 N 1–( )( )

ϕ y2 0( )( ) ϕ y2 1( )( ) … ϕ y2 N 1–( )( )

ϕ ym 0( )( ) ϕ ym 1( )( ) … ϕ ym N 1–( )( )

=

. . .

. . .

. . .

∆W1N---- ∆W n( )

n=0

N -1

∑=

η I1N---- φ y n( )( )yT

n( )n=0

N -1

W=

∆W η I1N----Φ Y( )YT

– W=

h Z( ) D f q–=

D f q Df f

Df q

+=

h Z( ) Df f

Df q

–=

12

Page 50: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

(b) If q(yi) happens to equal the source pdf fU(yi) for all i, we then find that . In such a

case, (1) reduces to

That is, the entropy h(Z) is equal to the negative of the Kullback-Leibler divergence between

the pdf fY(y) and the corresponding factorial distribution .

Problem 10.20

(a) From Eq. (10.124) in the text,

The matrix A of the linear mixer is fixed. Hence differentiating with respect to W:

(1)

(b) From Eq. (10.126) pf the text,

Differentiating zi with respect to yi:

(2)

Hence, differentiating with respect to the demixing matrix W, we get

Df q

0=

h Z( ) Df f

–=

f Y y( )

Φ det A( ) det W( )∂zi

∂ yi-------

logi

∑+log+log=

Φ

∂Φ∂W--------- W T– ∂

∂W---------

∂zi

∂ yi-------

logi

∑+=

zi1

1 eyi–

+------------------=

∂zi

∂ yi------- e

yi–

1 eyi–

+( )2

-------------------------=

zi zi2

–=

∂zi

∂ yi-------

log

∂∂W---------

∂zi

∂ yi-------

log∂

∂W--------- zi zi

2–( )log=

13

Page 51: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

(3)

But from (2) we have

Hence, we may simplify (3) to

We may thus rewrite (1) as

Putting this relation in matrix form and recognizing that the demixer output y is equal to Wxwhere x is the observation vector, we find that the adjustment applied to W is defined by

where η is the learning-rate parameter and 1 is a vector of ones.

∂zi

∂W--------- ∂

∂zi------- zi zi

2–( )log=

∂zi

∂W--------- 1

zi zi2

–( )------------------- 1 2zi–( )=

∂zi

∂ yi-------

∂ yi

∂W--------- 1

zi zi2

–( )------------------- 1 2zi–( )=

∂zi

∂ yi------- 1

zi zi2

–---------------

1=

∂∂W---------

∂zi

∂ yi-------

log∂ yi

∂W--------- 1 2zi–( )=

∂Φ∂W--------- W T– ∂ yi

∂W--------- 1 2zi–( )

i∑+=

∆W η ∂Φ∂W---------=

η W T– 1 2z–( )xT+( )=

14

Page 52: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

CHAPTER 11Stochastic Methodfs Rooted in Statistical Mechanics

Problem 11.1

By definition, we have

where t denotes time and n denotes the number of discrete steps. For n = 1, we have the one-steptransition probability

For n = 2 we have the two-step transition probability

where the sum is taken over all intermediate steps k taken by the system. By induction, it thusfollows that

Problem 11.2

For p > 0, the state transition diagram for the random walk process shown in Fig., P11.2 of the testis irreducible. The reason for saying so is that the system has only one class, namely,{0, +1, +2, ...}.

Problem 11.3

The state transition diagram of Fig. P11.3 in the text pertains to a Markov chain with two classes:{x1} and {x1, x2}.

pijn( )

P X t j X t n– i= =( )=

pij1( )

pij P X t j X t 1– i= =( )= =

pij2( )

pik pkjk∑=

pijn 1–( )

pik pkjn( )

k∑=

1

Page 53: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

Problem 11.4

The stochastic matrix of the Markov chain in Fig. P11.4 of the text is given by

Let π1, π2, and π3 denote the steady-state probabilities of this chain. We may then write (see Eq.(11.27) of the text)

That is,

We also have, by definition,

Hence,

or equivalently

and so

P

34--- 1

4--- 0

023--- 1

3---

14--- 3

4--- 0

=

π1 π134---

π2 0( ) π314---

+ +=

π2 π114---

π223---

π334---

+ +=

π3 π1 0( ) π213---

π3 0( )+ +=

π1 π3=

π2 3π3=

π1 π2 π3+ + 1=

π3 3π3 π3+ + 1=

π315---=

2

Page 54: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

Problem 11.6

The Metropolis algorithm and the Gibbs sampler are similar in that they both generate a Markovchain with the Gibbs distribution as the equilibrium distribution.

They differ from each other in the following respect: In the Metropolis algorithm, thetransition probabilities of the Markov chain are stationary. In contrast, in the Gibbs sampler, theyare nonstationary.

Problem 11.7

Simulated annealing algorithm for solving the travelling salesman problem:

1. Set up an annealing schedule for the algorithm.2. Initialize the algorithm by picking a tour at random.3. Choose a pair of cities in the tour and then reverse the order that the cities in-between the

selected pairs are visited. This procedure, illustrated in Figure 1 below, generates new tours ina local manner:

4. Calculate the energy difference due to the reversal of paths applied in step 3.5. If the energy difference so calculated is negative or zero, accept the new tour. If, on the other

hand, it is positive, accept the change in the tour with probability defined in accordance withthe Metropolis algorithm.

6. Select another pair of cities, and repeat steps 3 to 5 until the required number of iterations isaccepted.

7. Lower the temperature in the annealing schedule, and repeat steps 3 to 6.

Problem 11.8

(a) We start with the notion that a neuron j flips from state xj to -xj at temperature T withprobability

(1)

π115---=

π235---=

. . ...... . . .....

..

. . ....... .. . ..Figure 1: Problem 11.7

P x j x j–→( ) 11 ∆E j T⁄–( )exp+---------------------------------------------=

3

Page 55: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

where ∆Ej is the energy difference resulting from such a flip. The energy function of the Boltzmanmachine is defined by

Hence, the energy change produced by neuron j flipping from state xj to -xj is

(2)

where vi is the induced local field of neuron j.

(b) In light of the result in (2), we may rewrite (1) as

This means that for an initial state xj = -1, the probability that neuron j is flipped into state +1 is

(3)

(c) For an initial state of xj = +1, the probability that neuron j is flipped into state -1 is

(4)

The flipping probability in (4) and the one in (3) are in perfect agreement with the followingprobabilistic rule

E12--- w jixix j

ji j≠

∑i

∑–=

∆E jenergy with neuron j

in state x j energy with neuron j

in state x– j –=

x j( ) w jixi x j–( ) w jixij

∑+j

∑–=

2x j w jixij

∑–=

2x jv j–=

P x j x j–→( ) 11 2x jvj T⁄( )exp+---------------------------------------------=

11 2– vj T⁄( )exp+-------------------------------------------

11 +2vj T⁄( )exp+------------------------------------------- 1 1

1 2– vj T⁄( )exp+-------------------------------------------–=

x j+1 with probability P v j( )

-1 with probability 1-P v j( )

=

4

Page 56: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

where P(vj) is itself defined by

Problem 11.9

The log-likelihood function L(w) is (see (11.48) of the text)

Differentiating L(w) with respect to weight wji:

The energy function E(x) is defined by (see (11.39) of the text)

Hence,

, (1)

We also note the following:

(2)

(3)

P v j( ) 11 2– vj T⁄( )exp+-------------------------------------------=

L w( ) E x( )T

------------– E x( )

T------------–

expx∑log–exp

∑logxα T∈∑=

∂L w( )∂w ji

----------------1T--- ∂E x( )

∂w ji--------------- 1

E x( )T

------------– exp

∑-------------------------------------– 1

E x( )T

------------– exp

x∑-------------------------------------+

xα T∈∑=

E x( ) 12--- w jixix j

ji j≠

∑i

∑–=

∂E x( )∂w ji

--------------- xix j–= i j≠

P Xβ xβ Xα xα= =( ) 1E x( )

T------------–

expxβ

∑-------------------------------------=

P X x=( ) 1E x( )

T------------–

expx∑-------------------------------------=

5

Page 57: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

Accordingly, using the formulas of (1) to (3), we may redefine the derivative as

follows:

which is the desired result.

Problem 11.10

(a) Factoring the transition process from state i to state j into a two-step process, we may expressthe transition probability pji as

for (1)

where τji is the probability that a transition from state j to state i is attempted, and qji is theconditional probability that the attempt is successful given that it was attempted. When j = i, theproperty that each row of the stochastic matrix must add to unity implies that

(b) We require that the attempt-rate matrix be symmetric:

for all (2)

and that it satisfies the normalization condition

for all

We also require the property of complementary conditional transition probability

(3)

For a stationary distribution, we have

for all i (4)

∂L w( ) ∂w ji⁄

∂L w( )∂w ji

----------------1T--- P Xβ xβ Xα xα= =( )x jxi P X x=( )x jxi

x∑–

xα T∈∑=

p ji τ jiq ji= j i≠

pii 1 pijj i≠∑–=

1 τ ijqijj i≠∑–=

τ ji τ ij= i j≠

τ jij

∑ 1= i j≠

q ji 1 qij–=

πi πj p jij

∑=

6

Page 58: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

Hence, using (1) to (3) in (4):

(5)

Next, recognizing that

for all i

we may go on to write

(6)

Hence, combining (5) and (6), using the symmetry property of (2), and then rearranging terms:

(7)

(c) For the condition of (7) can only be satisfied if

which, in turn, means that qij is defined by

(8)

(d) Make a change of variables:

where T and T* are arbitrary constants. We may then express πi in terms of Ei as

πi πjτ ji p jij

∑=

πjτ ji 1 qij–( )j

∑=

pijj

∑ 1=

πi πi pijj

∑=

πi pijj

∑=

πiτ ijqijj

∑=

τ ji πiqij πjqij πj–+( )j

∑ 0=

τ ji 0≠

πiqij πjqij πj–+ 0=

qij1

1 πi πj⁄( )+----------------------------=

Ei T πi T *+log–=

7

Page 59: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

where

Accordingly, we may reformulate (8) in the new form

(9)

where ∆E = Ei - Ej. To evaluate the constant Z, we note that

and therefore

(e) The formula of (9) is the only possible distribution for state transitions in the Boltzmannmachine; it is recognized as the Gibbs distribution.

Problem 11.11

We start with the Kullback-Leibler divergence

(1)

The probability distribution in the clamped condition is naturally independent of the synaptic

weights wji in the Boltzman machine, whereas the probability distribution is dependent on wji.

Hence differentiating (1) with respect to wji:

πi1Z---

Ei

T-----–

exp=

Z T *T

-------– exp=

qij1

11T--- Ei E j–( )–

exp+

------------------------------------------------------=

11 ∆E T⁄–( )exp+------------------------------------------=

πii

∑ 1=

Z Ei T⁄–( )expi

∑=

D p+ p- pα+ pα

+

pα-

------

logα∑=

pα+

pα-

8

Page 60: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

(2)

To minimize , we use the method of gradient descent:

(3)

where ε is a positive constant.

Let denote the joint probability that the visible neurons are in state α and the hidden

neurons are in state β, given that the network is in its clamped condition. We may then write

Assuming that the network is in thermal equilibrium, we may use the Gibbs distribution

to write

(4)

where Eαβ is the energy of the network when the visible neurons are in state α and the hiddenneurons are in state β. The partition function Z is itself defined by

The energy Eαβ is defined in terms of the synaptic weights wji by

∂D p+ p-

∂w ji---------------------

pα+

pα-

------pα

-

∂w ji-----------

α∑–=

D p+ p-

∆w ji εD p+ p-

∂w ji------------------–=

εpα

+

pα-

------pα

-

∂w ji-----------

α∑=

pαβ-

pα-

pαβ-

β∑=

pαβ- 1

Z---

EαβT

---------– exp=

pα- 1

Z---

EαβT

---------– exp

β∑=

ZEαβT

---------– exp

β∑

α∑=

9

Page 61: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

(5)

where is the state of neuron i when the visible neurons are in state α and the hidden neurons

are in state β. Therefore, using (4):

(6)

From (5) we have (remembering that in a Boltzmann machine wji = wij)

(7)

The first term on the right-hand side of (6) is therefore

where we have made use of the Gibbs distribution

as the probability that the visible neurons are in state α and the hidden neurons are in state β in thefree-running condition. Consider next the second term on the right-hand side of (6). Except for theminus sign, we may express this term as the product of two factors:

(8)

The first factor in (8) is recognized as the Gibbs distribution defined by

(9)

Eαβ12--- w jix j αβxi αβ

ji j≠

∑i

∑–=

xi αβ

pα-

∂w ji-----------

1ZT-------

EαβT

---------– ∂Eαβ

∂w ji------------exp

β∑–=

1

Z2

------ ∂Z∂w ji-----------

EαβT

---------– exp

β∑–

∂Eαβ∂w ji------------ x j αβxi αβ–=

1ZT-------

EαβT

---------– ∂Eαβ

∂w ji------------exp

β∑– +

1ZT-------

EαβT

---------– x j αβxi αβexp

β∑=

1T--- pαβ

-x j αβxi αβ

β∑=

pαβ- 1

Z---

EαβT

---------– exp=

1

Z2

------ ∂Z∂w ji-----------

EαβT

---------– exp

β∑ 1

Z---

EαβT

---------– exp

β∑ 1

Z--- ∂Z

∂w ji-----------=

pα-

pα- 1

Z---

EαβT

---------– exp

β∑=

10

Page 62: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

To evaluate the second factor in (8), we write

(10)

Using (9) and (10) in (8):

(11)

We are now ready to revisit (6) and thus write

We now make the following observations:

1. The sum of probability over the states α is unity, that is,

(12)

2. The joint probability

(13)

Similarly

(14)

3. The probability of a hidden state, given some visible state, is naturally the same whether thevisible neurons of the network in thermal equilibrium are clamped in that state by the externalenvironment or arrive at that state by free running of the network, as shown by

(15)

In light of this relation we may rewrite Eq. (13) as

1Z--- ∂Z

∂w ji----------- 1

Z--- ∂

∂w ji-----------

EαβT

---------– exp

β∑

α∑=

1TZ-------–

EαβT

---------– ∂Eαβ

∂w ji------------exp

β∑

α∑=

1TZ-------

EαβT

---------– x j αβxi αβexp

β∑

α∑=

1T--- pαβ

-x j αβxi αβ

β∑

α∑=

1

Z2

------ ∂Z∂w ji-----------

EαβT

---------– exp

β∑

pα-

T------ pαβ

-x j αβxi αβ

β∑

α∑=

∂ pα-

∂w ji-----------

1T--- pαβ

-x j αβxi αβ

pα-

T------ pαβ

-x j αβxi αβ

β∑

α∑–

β∑=

pα+

pα+

α∑ 1=

pαβ-

pβ α-

pα-

=

pαβ+

pβ α+

pα+

=

pβ α-

pβ α+

=

11

Page 63: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

(16)

Moreover, we may write

(17)

Accordingly, we may rewrite (3) as follows:

Define the following terms:

= learning rate parameter

=

=

=

=

=

We may then finally formulate the Boltzmann learning rule as

Problem 11.12

(a) We start with the relative entropy:

pαβ-

pβ α+

pα-

=

pα+

pα-

------

pαβ-

pα+

pβ α+

=

pαβ+

=

∆w jiεT---

pα+

pα-

------ pαβ-

x j αβxi αβ pα+

α∑ pαβ

-x j αβxi αβ

β∑

α∑–

β∑

α∑

=

εT--- pαβ

+x j αβxi αβ pαβ

-x j αβxi αβ

β∑

α∑–

β∑

α∑

=

ηεT---

ρ ji+

x j αβxi αβ+><

pαβ+

x j αβxi αββ∑

α∑

ρ ji-

x j αβxi αβ-><

pαβ-

x j αβxi αββ∑

α∑

∆w ji η ρ ji+ ρ ji

-–( )=

12

Page 64: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

(1)

From probability theory, we have

(2)

(3)

where, in the last line, we have made use of the fact that the input neurons are always clamped tothe environment, which means that

Substituting (2) and (3) into (1):

(4)

where the state α refers to the input neurons and γ refers to the output neurons.

(b) With denoting the conditional probability of finding the output neurons in state γ, given

that the input neurons are in state α, we may express the probability distribution of the outputstates as

The conditional is determined by the synaptic weights of the network in accordance with the

formula

(5)

where

(6)

Dp+ p- pαγ

+ pαγ+

pαγ-

---------

logγ∑

α∑=

pαγ+

pγ α+

pα+

=

pαγ-

pγ α-

pα-

pγ α-

pα+

= =

pα-

pα+

=

Dp+ p- pα

+pγ α

+ pγ α+

pγ α-

----------

logγ∑

α∑=

pγ α

pγ-

pγ α-

pα-

α∑=

pγ α-

pγ α- 1

Z1α---------

EγβαT

-----------– exp

β∑=

Eγβα12--- w ji s jsi[ ] γβα

i∑

j∑=

13

Page 65: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

The parameter Z1α is the partition function:

(7)

The function of the Boltzmann machine is to find the synaptic weights for which the conditional

probability approaches the desired value .

Applying the gradient method to the relative entropy of (1):

(8)

Using (4) in (8) and recognizing that is determined by the environment (i.e., it is

independent of the network), we get

(9)

To evaluate the partial derivative we use (5) to (7):

(10)

Next, we recognize the following pair of relations:

(11)

1Z1α---------

EγβαT

-----------– exp

γ∑

β∑=

pγ α-

pγ α+

∆w ji -ε∂D

p+ p-

∂w ji-------------------=

pγ α+

∆w ji ε pα+ pγ α

+

pγ α-

----------pγ α

-

∂w ji-----------

γ∑

α∑=

∂ pγ α- ∂w ji⁄

pγ α-

∂w ji-----------

1Z1α--------- 1

T---–

EγβαT

-----------– Eγβα

∂w ji-----------exp

β∑=

1

Z1α2

---------∂Z1α∂w ji------------

EγβαT

-----------– exp

β∑–

1Z1α--------- 1

T--- s jsi[ ] γβα

EγβαT

-----------– exp

β∑=

+1

Z1α2

--------- s jsi[ ] γβαEγβα

T-----------–

expγ∑

β∑

1Z1α---------

EγβαT

-----------– exp pγβ1α

-=

14

Page 66: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

(12)

where the term is the averaged correlation of the states sj and si with the input neurons

clamped to state α and the network in a free-running condition. Substituting (11) and (12) in (10):

(13)

Next, substituting (13) into (9):

(14)

We now recognize that

for all α (15)

(16)

Accordingly, substituting (15) and (16) into (14):

1Z1α--------- s jsi[ ] γβα

EγβαT

-----------– exp

γ∑

β∑ <s jsi>α

-pγ α

-=

<s jsi>α-

∂ pγ α-

∂w ji--------------

1T--- s jsi[ ] γβα pγβα

- <s jsi>α-

pγ α-

–β∑

=

∆w jiεT--- pα

+s jsi[ ] γβα pγβ1α

- pγ α+

pγ α-

----------

β∑

α∑

α∑

=

pα+ <s jsi>α

-pγ α

+

γ∑

α∑–

pα+

α∑ 1=

s jsi[ ] γβα pγβ|α- pγ α

+

pγ α-

----------

β∑

γ∑ pγ α

+s jsi[ ] γβα

pγβ α+

pγ α-

-------------

β∑

γ∑=

pγ α+ <s jsi>γα

γ∑=

<s jsi>α+

=

∆w jiεT--- pα

+ <s jsi>α+ <s jsi>α

-–( )

α∑=

η pα+ ρ ji α

+ ρ ji α-

–( )α∑=

15

Page 67: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

where ; and and are the averaged correlations in the clamped and free-

running conditions, given that the input neurons are in state α.

Problem 11.15

Consider the expected distortion (energy)

(1)

where d(x, yj) is the distortion measure for representing the data point x by the vector yj, and

is the probability that x belongs to the cluster of points represented by yj. To

determine the association probabilities at a given expected distortion, we maximize the entropysubject to the constraint of (1). For a fixed Y = {yj}, we assume that the association probabilitiesof different data points are independent. We may thus express the entropy as

(2)

The probability distribution that maximizes the entropy under the expectation constraint is theGibbs distribution:

(3)

where

is the partition function. The inverse temperature B = 1/T is the Lagrange multiplier defined by thevalue of E in (1).

Problem 11.6

(a) The free energy is

(1)

where D is the expected distortion, T is the temperature, and H is the conditional entropy. Theexpected distortion is defined by

η ε T⁄= ρ ji α+ ρ ji α

-

E P x C j∈( )d x y j,( )j

∑x∑=

P x C j∈( )

H P x C j∈( ) P x C j∈( )logj

∑x∑–=

P x C j∈( ) 1Zx------ 1

T---d x y j,( )–

exp=

Zx1T---d x y j,( )–

expj

∑=

F D TH–=

16

Page 68: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

(2)

The conditional entropy if defined by

(3)

The minimizing P(Y = y|X = x) is itself defined by the Gibbs distribution:

(4)

where

(5)

is the partition function. Substituting (2) to (5) into (1), we get

This result simplifies as follows by virtue of the definition given in (5) for the partition function:

(6)

(b) Differentiating the minimum free energy F* of (6) with respect to y:

(7)

Using the definition of Zx given in (5), we write:

(8)

D P(Xy∑

x∑ x)= = P(Y y X x)d x y,( )= =

H Y X( ) P(Xy∑

x∑ x)P(Y y X x) P(Y y X x)= =log= ==–=

P(Y y X x)1

Zx------ d x y,( )

T-----------------–

exp= = =

Zxd x y,( )

T-----------------–

expy∑=

F*

P(Xy∑

x∑ x)

1Zx------ d x y,( )

T-----------------–

d x y,( )exp= =

T P(Xy∑

x∑ x)

1Zx------ d x y,( )

T-----------------–

Zx1T---–log– d x y,( )

exp=+

T P(Xy∑

x∑ x)

1Zx------ d x y,( )

T-----------------–

Zxlog–( )exp==

F*

T P(Xx∑– x) Zxlog= =

∂F*

∂y--------- T P(X

x∑– x)

1Zx------

∂Zx

∂y---------= =

∂Zx

∂y---------

1T--- d x y,( )

T-----------------–

∂d x y,( )∂y

--------------------expy∑–=

17

Page 69: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

Hence, we may rewrite (7) as

(9)

where use has been made of (4). Noting that

we may then state that the condition for minimizing the Lagrangian with respect to y is

for all y (10)

Normalizing this result with respect to P(X = x) we get the minimizing condition:

for all y (11)

(c) Consider the squared Euclidean distortion

for which we have

(12)

For this particular measure we find it more convenient to normalize (10) with respect to theprobability

We may then write the minimizing condition with respect to y as

(13)

∂F*

∂y--------- P(X

y∑

x∑ x)

1Zx------ d x y,( )

T-----------------–

∂d x y,( )∂y

--------------------exp= =

P(Xx∑ x)P(Y y X x)

∂d x y,( )∂y

--------------------= = = =

P(X x Y, y)= = P(Y y X x)= P(X x)= = =

P(X x Y, y)∂d x y,( )

∂y--------------------= =

x∑ 0=

P(Y y X x)∂d x y,( )

∂y--------------------= =

x∑ 0=

d x y,( ) x y–2 x y–( )T x y–( )= =

∂d x y,( )∂y

--------------------∂

∂y------ x y–( )T x y–( )=

2 x y–( )( )–=

P(Y y) P(X x,Y y)= =x∑= =

P(X x|Y y)∂d x y,( )

∂y--------------------= =

x∑ 0=

18

Page 70: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

Using (12) in (13) and solving for y, we get the desired minimizing solution

(14)

which is recognized as the formula for a centroid.

Problem 11.17

The advantage of deterministic annealing over maximum likelihood is that it does not make anyassumption on the underlying probability distribution of the data.

Problem 11.18

(a) Let

, k = 1, 2, ..., K

where tk is the center or prototype vector of the kth radial basis function and K is the number ofsuch functions (i.e., hidden units). Define the normalized radial basis function

The average squared cost over the training set is

(1)

where is the output vector of the RBF network in response to the input xi. The Gibbs

distribution for is

(2)

where d is defined in (1) and

y

P(X x|Y y)x= =x∑

P(X x|Y y)= =x∑

--------------------------------------------------=

ϕk x( ) 1

2σ2--------- x = tk

2–

exp=

Pk x( )ϕk x( )

ϕk x( )k∑---------------------=

d1N---- yi F xi( )–

2

i=1

N

∑=

F xi( )

P x R∈( )

P x R∈( ) 1Zx------ d

T---–

exp=

19

Page 71: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

(3)

(b) The Lagrangian for minimizing the average misclassification cost is

F = d - TH

where the average squared cost d is defined in (1), and the entropy H is defined by

where is the probability of associating class j at the output of the RBF network with theinput x.

ZxdT---–

expyi

∑=

H p j x( ) j x( )logj

∑x∑–=

p j x( )

20

Page 72: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

CHAPTER 12Dynamic Programming

Problem 12.1

As the discount factor γ approaches 1, the computation of the cost-to-go function J π(i) becomeslonger because of the corresponding increase in the number of time steps involved in thecomputation.

Problem 12.2

(a) Let πbe an arbitrary policy, and suppose that this policy chooses action at time step 0.

We may then write

where pa is the probability of choosing action a, c(i, a) is the expected cost, pij(a) is the

probability of transition from state i to state j under action a, Wπ(j) is the expected cost-to-gofunction from time step n = 1 onward, and j is the state at that time step. We now note that

which follows from the observation that if the state at time step n = 1 is j, then the situation at thattime step is the same as if the process had started in state j with the exception that all the returnsare multiplied by the discount factor γ. Hence, we have

which implies that

(1)

a Ai∈

i( ) pa c i a,( ) pij a( )Wπ

j( )j=1

N

∑+

a Ai∈∑=

j( ) γ J j( )≤

i( ) pa c i a,( ) γ pij a( )J j( )j

∑+ ≥

pa mina Ai∈

c i a,( ) γ pij a( )J j( )j

∑+ ≥

mina

c i a,( ) γ pij a( )J j( )j

∑+ =

i( ) mina Ai∈

c i a,( ) γ pij a( )J j( )j

∑+ ≥

1

Page 73: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

(b) Suppose we next go the other way by choosing a0 with

(2)

Let πbe the policy that chooses a0 at time step 0 and, if the next state is j, the process is viewed asoriginating in state j following a policy πj such that

Where ε is a small positive number. Hence

(3)

Since J(i) < Jπ(i), (3) implies that

Hence, from (2) it follows that

(4)

(c) Finally, since ε is arbitrary, we immediately deduce from (1) and (4) that the optimum cost-to-go function

(5)

which is the desired result

Problem 12.3

Writing the system of N simultaneous equations (12.22) of the text in matrix form:

c i a0,( ) γ pij a0( )j

∑+ mina

c i a,( ) γ pij a( )J j( )j

∑+ =

Jπj J j( ) ε+≤

Jπj c i a0,( ) pij a0( )J

πj j( )j=1

N

∑+=

c i a0,( ) pij a0( )J j( ) γε+j=1

N

∑+≤

J i( ) c i a0,( ) γ pij a0( )J j( ) γε+j=1

N

∑+≤

J i( ) mina Ai∈

c i a,( ) γ pij a( )J j( )j

∑+ γε+≤

J*

i( ) mina

c i a,( ) γ pij a( )J*

j( )j

∑+ =

2

Page 74: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

Jµ = c(µ) + γP(µ)Jµ (1)

where

Rearranging terms in 1), we may write

where I is the N-by-N identity matrix. For the solution Jµ to be unique we require that the N-by-Nmatrix (I - γP(µ)) have an inverse matrix for all possible values of the discount factor γ.

Problem 12.4

Consider an admissible policy {µ0, µ1, ...}, a positive integer K, and cost-to-go function J. Let the

costs of the first K stages be accumulated, and add the terminal cost γKJ(XK), thereby obtainingthe total expected cost

where E is the expectational operator, To minimize the total expected cost, we start with γKJ(XK)and perform K iterations of the dynamic programming algorithm, as shown by

(1)

with the initial condition

JK(X) = γKJ(X)

J µ( )J

µ( ) 1( ), Jµ( ) 2( ), … J

µ( )N( )

T=

c µ( ) C 1 µ,( ), C 2 µ,( ), … C N µ,( )T

=

P µ( )

p11 µ( ) p12 µ( ) … p1N µ( )

p21 µ( ) p22 µ( ) … p2N µ( )

pN 1 µ( ) pN 2 µ( ) … pNN µ( )

=

. . .

. . .

. . .

I γP µ( )J µ( )– c µ( )=

E γKJ X K( ) γn

g X n µn X n( ) X n-1,,( )n=0

K -1

∑+

J n X n( ) minµn

E gn X n µn X n( ) X n-1,,( ) J n+1 X n+1( )+[ ]=

3

Page 75: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

Now consider the function Vn defined by

for all n and X (2)

The function Vn(X) is the optimal K-stage cost J0(X). Hence, the dynamic programming algorithmof (1) can be rewritten in terms of the function Vn(X) as follows:

with the initial condition

V0(X) = J(X)

which has the same mathematical form as that specified in the problem.

Problem 12.5

An important property of dynamic programming is the monotonicity property described by

This property follows from the fact that if the terminal cost gK for K stages is changed to a

uniformly larger cost , that is,

for all XK,

then the last stage cost-to-go function JK-1(XK-1) will be uniformly increased. In more generalterms, we may state the following.

Given two cost-to-go functions JK’+1 and with

for all XK+1,

we find that for all XK and µK the following relation holds

This relation merely restates the monotonicity property of the dynamic programming algorithm.

V n X( )J K -n X( )

γK -n---------------------=

V n+1 X 0( ) minµ

E g( X 0 µ X 0( ) X 1), γV n X 1( )+,[ ]=

Jµn+1 J

µn≤

gK

gK X K( ) gK X K( )≥

J K+1

J K+1 X K+1( ) J K+1 X K+1( )≥

E gK X K µK X K+1, ,( ) J K+1 X K+1( )+[ ] E gK X K µK X K+1, ,( ) J K+1 X K+1( )+[ ]≤

4

Page 76: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

Problem 12.6

According to (12.24) of the text the Q-factor for state-action pair (i, a) and stationary policy µsatisfies the condition

for all i

This equation emphasizes the fact that the policy µ is greedy with respect to the cost-to-go

function Jµ(i).

Problem 12.7

Figure 1, shown below, presents an interesting interpretation of the policy iteration algorithm. Inthis interpretation, the policy evaluation step is viewed as the work of a critic that evaluates the

performance of the current policy; that is, it calculates an estimate of the cost-to-go function .The policy improvement step is viewed as the work of a controller or actor that accounts for thelatest evaluation made by her critic and acts out the improved policy µn+1. In short, the critic looksafter policy evaluation and the controller (actor) looks after policy improvement and the iterationbetween them goes on.

Problem 12.8

From (12.29) in the text, we find that for each possible state, the value iteration algorithm requiresNM iterations, where N is the number of states and M is the number of admissible actions. Hence,

the total number of iterations for all N states in N2M.

i µ i( ),( ) mina

i a,( )=

Jµn

Environment

Controller (Actor)

Cost-to-goJµn

Critic

Stateiµn+1(i)

Figure 1: Problem 12.7

5

Page 77: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

Problem 12.9

To reformulate the value-iteration algorithm in terms of Q-factors, the only change we need tomake is in step 2 of Table 12.2 in the text. Specifically, we rewrite this step as follows:

For n = 0, 1, 2, ..., compute

Problem 12.10

The policy-iteration algorithm alternates between two steps: policy evaluation, and policyimprovement. In other words, an optimal policy is computed directly in the policy iterationalgorithm. In contrast, no such thing happens in the value iteration algorithm.

Another point of difference is that in policy iteration the cost-to-go function is recomputedon each iteration of the algorithm. This burdensome computational difficulty is avoided in thevalue-iteration algorithm.

Problem 12.14

From the definition of Q-factor given in (12.24) in the text and Bellman’s optimality equation(12.11), we immediately see that

where the minimization is performed over all possible actions a.

Problem 12.15

The value-iteration algorithm requires knowledge of the state transition probabilities. In contrast,Q-learning operates without this knowledge. But through an interactive process, Q-learning learnsestimates of the transition probabilities in an implicit manner. Recognizing the intimaterelationship between value iteration and Q-learning, we may therefore view Q-learning as anadaptive version of the value-iteration algorithm.

Q i a,( ) c i a,( ) γ pij a( )J n j( )j=1

N

∑+=

J n+1 i( ) mina

Q i a,( )=

J*

i( ) mina

Q i a,( )=

6

Page 78: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

Problem 12.16

Using Table 12.4 in the text, we may construct the signal-flow graph in Figure 1 for the Q-learning algorithm:

Problem 12.17

The whole point of the Q-learning algorithm is that it eliminates the need for knowing the statetransition probabilities. If knowledge of the state transition probabilities is available, then the Q-learning algorithm assumes the same form as the value-iteration algorithm.

Computeoptimumaction

ComputetargetQ-factor

UpdateQ-factor

Unitdelay

Qn+1(in+1, a, w)an Qtarget

Qn(in, a, w)

Figure 1: Problem 12.16

7

Page 79: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

CHAPTER 13Neurodynamics

Problem 13.1

The equilibrium state x(0) is (asymptotically) stable if in a small neighborhood around x(0), thereexists a positive definite function V(x) such that its derivative with respect to time is negativedefinite in that region.

Problem 13.3

Consider the symem of coupled nonlinear differential equations:

, j = 1, 2, ..., N

where W is the weight matrix, i is the bias vector, and x is the state vector with its jth elementdenoted by xj.

(a) With the bias vector i treated as input and with fixed initial condition x(0), let denote thefinal state vector of the system. Then,

, j = 1, 2, ..., N

For a given matrix W and input vector i, the set of initial points x(0) evolves to a fixed point. Thefixed points are functions of W and i. Thus, the system acts as a “mapper” with i as input and

as output, as shown in Fig. 1(a):

(b) With the initial state vector x(0) treated as input, and the bias vector i being fixed, letdenote the final state vector of the system. We may then write

, j = 1, 2, .., N

dx j

dt-------- ϕ j W i x, ,( )=

x ∞( )

0 ϕ j W i x ∞( ), ,( )=

x ∞( )

W;x(0) : fixed

W;i : fixed

(a) (b)

x ∞( )x ∞( )

x(0)i

Figure 1: Problem 13.3

x ∞( )

0 ϕ j W i:fixed x, ∞( ),( )=

1

Page 80: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

Thus with x(0) acting as input and acting as output, the dynamic system behaves like apattern associator, as shown in Fig. 1b.

Problem 13.4

(a) We are given the fundamental memories:

The weight matrix of the Hopfield network (with N = 25 and p = 3) is therefore

(b) According to the alignment condition, we write

, i = 1, 2, 3

Consider first , for which we have

x ∞( )

ξ1 +1, +1, +1, +1, +1T

=

ξ2 +1, -1, -1, +1, -1T

=

ξ3 1-, +1, -1, +1, +1T

=

W1N---- ξ iξ i

T PN----I–

i=1

p

∑=

15---

0 -1 +1 +1 -1-1 0 +1 +1 +3+1 +1 0 -1 +1+1 +1 -1 0 +1-1 +3 +1 +1 0

=

ξ i Wξ i( )sgn=

ξ1

Wξ1( )sgn15---

0 -1 +1 +1 -1-1 0 +1 +1 +3+1 +1 0 -1 +1+1 +1 -1 0 +1-1 +3 +1 +1 0

+1+1+1+1+1

sgn=

2

Page 81: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

Thus all three fundamental memories satisfy the alignment condition.

Note: Wherever a particular element of the product is zero, the neuron in question is left in

its previous state.

(c) Consider the noisy probe:

15---

0+4+2+2+4

sgn

+1+1+1+1+1

ξ1= = =

Wξ2( )sgn15---

0 -1 +1 +1 -1-1 0 +1 +1 +3+1 +1 0 -1 +1+1 +1 -1 0 +1-1 +3 +1 +1 0

+1-1-1+1-1

sgn=

15---

+2-4-20-4

sgn

+1-1-1+1-1

ξ2= = =

Wξ3( )sgn15---

0 -1 +1 +1 -1-1 0 +1 +1 +3+1 +1 0 -1 +1+1 +1 -1 0 +1-1 +3 +1 +1 0

-1+1-1+1+1

sgn=

15---

-2+40

+2+4

sgn

-1+1-1+1+1

ξ2= = =

Wξ i

3

Page 82: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

which is the fundamental memory with its second element reversed in polarity. We write

(1)

Therefore,

Thus, neurons 2 and 5 want to change their states. We therefore have 2 options:

• Neuron 5 is chosen for a state change, which yields the result

This vector is recognized as the fundamental memory , and the computation is thereby

terminated.

• Neuron 2 is chosen to change its state, yielding the vector

Next, we go on to compute

x +1, -1, +1, +1, +1T

=

Wx15---

0 -1 +1 +1 -1-1 0 +1 +1 +3+1 +1 0 -1 +1+1 +1 -1 0 +1-1 +3 +1 +1 0

+1-1+1+1+1

=

15---

+2+400-2

=

Wx( )sgn

+1+1+1+1-1

=

x +1, +1, +1, +1, +1T

=

ξ1

x +1, -1, +1, +1, -1T

=

4

Page 83: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

Hence, neurons 3 and 4 want to change their states:

• If we permit neuron 3 to change its state from +1 to -1, we get

which is recognized as the fundamental memory .

• If we permit neuron 4 to change its state from +1 to -1, we get

which is recognized as the negative of the third fundamental memory .

In both cases, the new state would satisfy the alignment condition and the computation is thenterminated.

Thus, when the noisy version of is applied to the network, with its second element

changed in polarity, one of 2 things can happen with equal likelihood:

1. The original is recovered after 1 iteration.

2. The second fundamental memory or the negative of the third fundamental memory is

recovered after 2 iterations, which, of course, is in error.

Wx15---

0 -1 +1 +1 -1-1 0 +1 +1 +3+1 +1 0 -1 +1+1 +1 -1 0 +1-1 +3 +1 +1 0

+1-1+1+1-1

=

15---

+4-2-2-2-2

=

Wx( )sgn

+1-1-1-1-1

=

x +1, -1, -1, +1, -1T

=

ξ2

x +1, -1, +1, -1, -1=

ξ3

ξ1

ξ1

ξ2 ξ3

5

Page 84: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

Problem 13.5

Given the probe vector

and the weight matrix of (1) Problem 13.4, we find that

and

According to this result, neurons 2 and 5 have changed their states. In synchronous updating, thisis permitted. Thus, with the new state vector

on the next iteration,we compute

x +1, -1, +1, +1, +1T

=

Wx15---

2400-2

=

Wx( )sgn

+1+1+1+1-1

=

x

+1+1+1+1-1

=

Wx15---

0 -1 +1 +1 -1-1 0 +1 +1 +3+1 +1 0 -1 +1+1 +1 -1 0 +1-1 +3 +1 +1 0

+1+1+1+1-1

=

6

Page 85: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

Hence,

The new state vector is therefore

which is recognized as the original probe. In this problem, we thus find that the networkexperiences a limit cycle of duration 2.

Problem 13.6

(a) The vectors

are simply the negatives of the three fundamental memories considered in Problem 13.4,respectively. These 3 vectors are therefore also fundamental memories of the Hopfield network.

15---

+2-200

+4

=

Wx( )sgn

+1-1+1+1+1

=

x

+1-1+1+1+1

=

ξ1 -1, -1, -1, +1, -1T

=

ξ2 +1, +1, +1, -1, +1T

=

ξ3 +1, -1, +1, -1, -1T

=

7

Page 86: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

(b) Consider the vector

which is the result of masking the first element of the fundamental memory of Problem 13.4.

According to our notation, a neuron of the Hopfield network is in either state +1 or -1. Wetherefore have the choice of setting the zero element of x to +1 or -1. The first option restores thevector x to its original form: fundamental memory , which satisfies the alignment condition.

Alternatively, we may set the zero element equal to -1, obtaining

In this latter case, the alignment condition is not satisfied. The obvious choice is therefore theformer one.

Problem 13.7

We are given

(a) For state s2 we have

which yields

Next for state s4, we have

x 0, +1, +1, +1, +1T

=

ξ1

ξ1

x -1, +1, +1, +1, +1T

=

W 0 -1-1 0

=

Ws20 -1-1 0

-1+1

=

-1+1

=

Ws2( )sgn -1+1

s2= =

8

Page 87: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

which yields

Thus, both states s2 and s4 satisfy the alignment condition and are therefore stable.

Consider next the state s1, for which we write

which yields

Thus, both neurons want to change; suppose we pick neuron 1 to change its state, yielding the

new state vector [-1, +1]T. This is a stable vector as it satisfies the alignment condition. If,however, we permit neuron 2 to change its state, we get a state vector equal to s4. Similarly, we

may show that the state vector s3 = [-1, -1]T is also unstable. The resulting state-transition diagramof the network is thus as depicted in Fig. 1.

Ws40 -1-1 0

+1-1

=

+1-1

=

Ws4( )sgn +1-1

s4= =

Ws10 -1-1 0

+1+1

=

-1-1

=

Ws1( )sgn -1-1

s1= =

9

Page 88: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

The results depicted in Fig. 1 assume the use of asynchronous updating. If, however, weuse synchronous updating, we find that in the case of s1:

Permitting both neurons to change state, we get the new state vector [-1, -1]T. This is recognizedto be stable state s3. Now, we find that

which takes back to state s1.

Thus, in the synchronous updating case, the states s1 and s3 represent a limit cycle withlength 2.

Returning to the normal operation of the Hopfield network, we note that the energyfunction of the network is

since

(1)

. .

..

(-1, 1) (1, 1)

(-1, -1) (1, -1)

x2

Figure 1: Problem 13.7

Ws1( )sgn -1-1

=

0 -1-1 0

-1-1

+1+1

=

E12--- w jisis j

j∑

ii j≠

∑–=

12---w12s1s2

12---w21s2s1––=

w12s1s2–= w12 w21=

s1s2=

10

Page 89: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

Evaluating (1) for all possible states of the network, we get the following table:

State Energy[+1, +1] +1[-1, +1] -1[-1. -1] +1[+1. -1] -1

Thus, states s1 and s3 represent global minima and are therefore stable.

Problem 13.8

The energy function of the Hopfield network is

(1)

The overlap mv is defined by

(2)

and the weight wji is itself defined by

(3)

Substituting (3) into (1) yields

where, in the third line, we made use of (2).

E12--- w jis jsi

j∑

i∑–=

mv1N---- s jξv j,

j∑=

w ji1N---- ξv j, ξv i,

v∑=

E1

2N-------- ξv j, ξv i, s jsi

v∑

j∑

i∑–=

12N--------– siξv i,

i∑

s jξv j,j

v∑=

12N-------- mvN( ) mvN( )

v∑–=

N2---- mv

2

v∑–=

11

Page 90: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

Problem 13.11

We start with the function (see (13.48) of the text)

(1)

where is the derivative of the function with respect to its argument. We now

differentiate the function E with respect to time t and note the following relations:

1. Cji = Cij

2.

3.

Accordingly, we may use (1) to express the derivative as follows:

(2)

From Eq. (13.47) in the text, we have

, j = 1, 2, ..., N (3)

Hence using (3) in (2) and collecting terms, we get the final result

(4)

Provided that the coefficient aj(uj) satisfies the nonnegativity condition

E12--- c jiϕ i ui( )ϕ j u j( ) b j λ( )ϕ′ j λ( ) λd

0

u j

∫j=1

N

∑–j=1

N

∑i=1

N

∑=

ϕ′ j.( ) ϕ j

.( )

∂∂t-----ϕ j u j( )

∂u j

∂t-------- ∂

u j-----ϕ j u j( )=

∂u j

∂t--------ϕ′ j u j( )=

∂∂t----- b j λ( )ϕ′ j λ( ) λd

0

u j

∫∂u j

∂t-------- ∂

u j----- b j λ( )ϕ′ j λ( ) λd

0

u j

∫=

∂u j

∂t--------b j u j( )ϕ′ j u j( )=

∂E ∂t⁄

∂E∂t-------

∂u j

∂t-------- c jiϕ′ j u j( ) b j u j( )ϕ′ j u j( )

j=1

N

∑–j=1

N

∑i=1

N

=

∂u j

∂t-------- a j u j( ) b j u j( ) c jiϕ i ui( )

j=1

N

∑–

=

∂E∂t------- a j u j( )ϕ′ j u j( ) b j u j( ) c jiϕ i ui( )

i=1

N

∑– 2

j=1

N

∑–=

12

Page 91: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

aj(uj) > 0 for all uj

and the function satisfies the monotonicity condition

for all uj,

we then immediately see from (4) that

for all t

In words, the function E defined in(1) is the Lyapunov function for the coupled system ofnonlinear differential equations (3).

Problem 13.12

From (13.61) of the text:

, j = 1, 2, ..., N (1)

where

where δji is a Kronecker delta. According to the Cohen-Grossberg theorem of (13.47) in the text,we have

(2)

Comparison of (1) and (2) yields the following correspondences between the Cohen-Grossbergtheorem and the brain-in-state-box (BSB) model:

Therefore, using these correspondences in (13.48) of thetext:

Cohen-Grossberg Theorem BSB Modeluj vj

aj(uj) 1

bj(uj) -vj

cji -cji

ϕ′ j u j( )

ϕ′ j u j( ) 0≥

∂E∂t------- 0≤

ddt-----v j t( ) v j t( )– c jiϕ i vi( )

i=1

N

∑+=

c ji δ ji βw ji+=

ddt-----u j t( ) a j u j( ) b j u j( ) c jiϕ i ui( )

i=1

N

∑––=

ϕ i ui( ) ϕ vi( )

13

Page 92: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

,

we get the following Liapunov function for the BSB model:

(3)

From (13.55) in the text, we note that

We therefore have

Hence, the second term of (3) is given by

inside the linear region (4)

The first term of (3) is given by

(5)

Finally, substituting (4) and (5) into (3), we obtain

E12--- c jiϕ i ui( )ϕ j u j( ) b j λ( )ϕ′ j λ( ) λd

u j

∫j

∑–j

∑i

∑=

E -12--- c jiϕ vi( )ϕ v j( ) λϕ′ λ( ) λd

0

v j

∫j

∑+j

∑i

∑=

ϕ y j( )+1 if y j 1>

y j if -1 y j 1≤ ≤

-1 if y j -1≤

=

ϕ′ y j( )0, y j 1>

1, y j 1≤

=

λϕ′ λ( ) λd0

v j

∫j

∑ λ λd0

v j

∫j

∑ 12--- v j

2

j∑= =

12--- x j

2

j∑=

12--- c jiϕ vi( )ϕ v j( )

i∑

j∑–

12--- δ ji βw ji+( )ϕ vi( )ϕ v j( )

i∑

j∑–=

β2---– w jix jxi

12--- ϕ2

v j( )j

∑–i

∑j

∑=

β2--- w jix jxi

12--- x j

2

j∑–

i∑

j∑–=

14

Page 93: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

which is the desired result

Problem 13.13

The activation function of Fig. P13.13 is a nonmonotonic function of the argument v; that is,

assumes both positive and negative values. It therefore violates the monotonicitycondition required by the Cohen-Grossberg theorem; see Eq. (4) of Problem 13.11. This meansthat the cohen-Grossberg theorem is not applicable to an associative memory like a Hopfieldnetwork that uses the activation function of Fig. P14.15.

Eβ2--- w jix jxi

β2---xT Wx–

i∑

j∑–=

ϕ v( )∂ϕ ∂v⁄

15

Page 94: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

CHAPTER 15Dynamically Driven Recurrent Networks

Problem 15.1

Referring to the simple recurrent neural network of Fig. 15.3, let the vector u(n) denote the inputsignal, the vector x(n) denotes the signal produced at the output of the hidden layer, and the vectory(n) denotes the output signal of the whole network. Then, treating x(n) as the state of thenetwork, we may describe the state-space model of the network as follows:

where and are vector-valued functions of their respective arguments.

Problem 15.2

Referring to the recurrent MLP of Fig. 15.4, we note the following:

(1)

(2)

(3)

where , , and are vector-valued functions of their respective arguments.

Substituting (1) into (2), we write

(4)

Define the state of the system at time n as

(5)

Then, from (4) and (5) we immediately see that

(6)

where f is a new vector-valued function. Define the output of the system as

x n 1+( ) f x n( ) u n( ),( )=

y n( ) g x n( )( )=

f .( ) g .( )

xI n 1+( ) f 1 xI n( ) u n( ),( )=

xII n 1+( ) f 2 xII n( ) xI n 1+( ),( )=

x0 n 1+( ) f 3 x0 n( )xII n 1+( )( )=

f 1.( ) f 2

.( ) f 3.( )

xII f 2 xII f 1 xI u n( ),( ),( )=

xII n 1+( )xII n( )

x0 n 1–( )=

x n 1+( ) f x n( ) u n( ),( )=

1

Page 95: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

(7)

With x0(n) included in the definition of the state x(n + 1) and with x(n) dependent on the inputu(n), we thus have

(8)

where is another vector valued function. Equations (6) and (8) define the state-space modelof the recurrent MLP.

Problem 15.3

It is indeed possible for a dynamic system to be controllable but unobservable, and vice versa.This statement is justified by virtue of the fact that the conditions for controllability andobservability are entirely different, which means that there are situations where the conditions aresatisfied for one and not for the other.

Problem 15.4

(a) We are given the process equation

Hence, iterating forward in time, we write

and so on. By induction, we may state that the state x(n + q) is a nested nonlinear function of x(n)and uq(n), where

(b) The Jacobian of x(n + q) with respect to uq(n) at the origin, is

y n( ) x0 n( )=

y n( ) g x n( ) u n( ),( )=

g .,.( )

x n 1+( ) φ Wax n( ) wbu n( )+( )=

x n 2+( ) φ Wax n 1+( ) wbu n 1+( )+( )=

φ Waφ Wax n( ) wbu n( )+( ) wbu n 1+( )+( )=

x n 3+( ) φ Wax n 2+( ) wbu n 2+( )+( )=

φ WaφWaφ(Wax n( ) wbu n( ))+ wbu n 1+( )+( ) wbu n 2+( ))+=

uq n( ) u n( ) u n 1+( ) … u n q 1–+( ), ,,[ ] T=

Jq n( ) ∂x n q+( )∂uq n( )

-----------------------x n( ) 0=u n( ) 0=

=

2

Page 96: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

As an illustrative example, consider the cast of q = 3. The Jacobian of x(n + 3) with respect tou3(n) is

From the defining equation of x(n + 3), we find that

All these partial derivatives have been evaluated at x(n) = 0 and u(n) = 0. The Jacobian J3(n) istherefore

We may generalize this result by writing

Problem 15.5

We start with the state-space model

(1)

where c is a column vector. We thus write

J3 n( ) ∂x n 3+( )∂u n( )

----------------------- ∂x n 2+( )∂u n 1+( )----------------------- ∂x n 3+( )

∂u n 2+( )-----------------------, ,

x n( ) 0=u n( ) 0=

=

∂x n 3+( )∂u n( )

----------------------- φ′ 0( )Waφ′ 0( )Waφ′ 0( )wb=

AAb=

A2b=

∂x n 3+( )∂u n 1+( )----------------------- φ′ 0( )Waφ′ 0( )wb=

Ab=

∂x n 3+( )∂u n 2+( )----------------------- φ′ 0( )wb=

b=

J3 n( ) A2b,Ab,b[ ]=

Jq n( ) Aq 1– b,Aq 2– b …,, Ab,b[ ]=

x n 1+( ) φ Wax n( ) wbu n( )+( )=

y n( ) cT x n( )=

y n 1+( ) cT x n 1+( )=

3

Page 97: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

(2)

(3)

and so on. By induction, we may therefore state that y(n + q) is a nested nonlinear function of x(n)and uq(n), where

Define the q-by-1 vector

The Jacobian of yq(n) with respect to x(n), evaluated at the origin, is defined by

As an illustrative example, consider the case of q = 3, for which we have

From (1), we readily find that

From (2), we find that

From (3), we finally find that

cT φ Wax n( ) wbu n( )+( )=

y n 2+( ) cT x n 2+( )=

cT φ Waφ Wax n( ) wbu n( )+( ) wbu n 1+( )+( )=

uq n( ) u n( ) u n 1+( ) …, u n q 1–+( ),,[ ] T=

yq n( ) y n( ) y n 1+( ) … y n q 1–+( ), ,,[ ] T=

Jq n( )∂yq

Tn( )

∂x n( )----------------- x n( ) 0=

u n( ) 0=

=

J3 n( ) ∂y n( )∂x n( )-------------- ∂y n 1+( )

∂x n( )----------------------- ∂y n 2+( )

∂x n( )-----------------------, ,

x n( ) 0=u n( ) 0=

=

∂y n( )∂x n( )-------------- c=

∂y n 1+( )∂x n( )

----------------------- c φ′ 0( )Wa( )T=

cAT=

4

Page 98: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

All these partial derivatives have been evaluated at the origin. We thus write

By induction, we may now state that the Jacobian Jq(n) for observability is, in general,

where c is a column vector and .

Problem 15.6

We are given a nonlinear dynamic system described by

(1)

Suppose x(n) is N-dimensional and u(n) is m-dimensional. Define a new nonlinear dynamicsystem in which the input is of additive form, as shown by

(2)

where

(3)

(4)

and

(5)

∂y n 2+( )∂x n( )

----------------------- c φ′ 0( )Wa( )T φ′ 0( )Wa( )=

cAT AT=

c AT( )2

=

J3 n( ) c c AT cAT,( )2

,[ ]=

Jq n( ) c cAT c AT( )2

… c AT( )q 1–

, , , ,[ ]=

A φ′ 0( )Wa=

x n 1+( ) f x n( ) u n( ),( )=

x ′ n 1+( ) f ′ x ′ n( )( ) u ′ n( )+=

x ′ n( ) x n( )u n 1–( )

=

u ′ n( ) 0u n( )

=

f ′ x ′ n( )( ) f x n( ) u n( ),( )0

=

5

Page 99: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

Both and are (N + m)-dimensional, and the first N elements of are zero. Fromthese definitions, we readily see that

which is in perfect agreement with the description of the original nonlinear dynamic systemdefined in (1).

Problem 15.7

(a) The state-space model of the local activation feedback system of Fig. P15.7a depends on howthe linear dynamic component is described. For example, we may define the input as

(1)

where B is a (p-1)-by-(p-1) matrix and

Let w denote the synaptic weight vector of the single neuron in Fig. P15.7a, with w1 being the firstelement and w0 denoting the rest. we may then write

(2)

where

and

x ′ n( ) u ′ n( ) u ′ n( )

x ′ n 1+( ) x n 1+( )u n( )

=

f x n( ) u n( ),( )0

0u n( )

= =

z n( ) x n 1–( )Bu n( )

=

u n( ) u n( ) u n 1–( ) … u n p– 2+( ), ,,[ ] T=

x n( ) wT z n( ) b+=

w1 w0T,[ ] x n 1–( )

Bu n( )b+=

w1x n 1–( ) B ′u ′ n( )+=

u ′ n( ) u ′ n( )1

=

6

Page 100: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

The output y(n) is defined by

(3)

Equations (2) and (3) define the state-space model of Fig. P15.7a, assuming that its lineardynamic component is described by (1).

(b) Consider next the local output feedback system of Fig. 15.7b. Let the linear dynamiccomponent of this system be described by (1). The output of the whole system in Fig. 15.7b isthen defined by

(4)

where w1, w0, , and are all as defined previously. The output y(n) of Fig. P15.7b is

(5)

Equations (4) and (5) define the state-space model of the local output feedback system of Fig.P15.7b, assuming that its linear dynamic component is described by (1).

The process (state) equation of the local feedback system of Fig. P15.7a is linear but itsmeasurement equation is nonlinear, and conversely for the local feedback system of Fig. P15.7b.These two local feedback systems are controllable and observable, because they both satisfy theconditions for controllability and observability.

Problem 15.8

We start with the state equation

Hence, we write

B ′ w0T B b,[ ]=

y n( ) ϕ x n( )( )=

x n( ) φ wT z n( ) b+( )=

φ w1 w0T,[ ] x n 1–( )

Bu n( )b+

=

φ w1x n 1–( ) B ′u ′ n( )+( )=

B ′ u ′ n( )

y n( ) x n( )=

x n 1+( ) φ Wax n( ) wbu n( )+( )=

x n 2+( ) φ Wax n 1+( ) wbu n 1+( )+( )=

φ Waφ Wax n( ) wbu n( )+( ) wbu n 1+( )+( )=

7

Page 101: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

and so on.

By induction, we may now state that x(n + q) is a nested nonlinear function of x(n) and uq(n), andthus write

where g is a vector-valued function, and

By definition, the output is correspondingly given by

where Φ is a new scalar-valued nonlinear function.

x n 3+( ) φ Wax n 2+( ) wbu n 2+( )+( )=

φ Waφ Waφ Wax n( ) wbu n( )+( ) wbu n 1+( )+( ) wbu n 2+( )+( )=

x n q+( ) g x n( )uq n( )( )=

uq n( ) u n( ) u n 1+( ) … u n q 1–+( ), ,,[ ] T=

y n q+( ) cT x n q+( )=

cT g x n( )uq n( )( )=

Φ x n( ) uq n( ),( )=

8

Page 102: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

Problem 15.11

Consider a state-space model described by

(1)

(2)

Using (1), we may readily write

and so on. Accordingly, the simple recurrent network of Fig. 15.3 may be unfolded in time asfollows:

Problem 15.12

The local gradient for the hybrid form of the BPTT algorithm is given by

where is the number of additional steps taken before performing the next BPTT computation,

with .

x n 1+( ) f x n( ) u n( ),( )=

y n( ) g x n( )( )=

x n( ) f x n 1–( ) u n 1–( ),( )=

x n 1–( ) f x n 2–( ) u n 2–( ),( )=

x n 2–( ) f x n 3–( ) u n 3–( ),( )=

x(n-3)

u(n-3)

x(n-2)

u(n-2)

x(n-1)

u(n-1)

x(n) y(n)

f( ).

f( ).

f( ). g( ) .

Figure : Problem 15.11

δ j l( )

ϕ′ v j l( )( )e j l( ) for l n=

ϕ′ v j l( )( ) e j l( ) wkj l( )δl l 1+( )k∑+

for n h l n< <–

ϕ′ v j l( )( ) wkj l( )δl l 1+( )k∑ for n h l j h ′–< <–

=

h ′h ′ h<

9

Page 103: Haykin,Xue-Neural Networks and Learning Machines 3ed Soln

Problem 15.13

(a) The nonlinear state dynamics of the real-time recurrent learning algorithm of described in(15.48) and (15.52) olf the text may be reformulated in the equivalent form:

(1)

where is the Kronecker delta and yj(n + 1) is the output of neuron j at time n + 1. For a

teacher-forced recurrent network, we have

(2)

Hence, substituting (2) into (1), we get

(3)

(b) Let

Provided that the learning-rate parameter is small enough, we may put

Under this condition, we may rewrite (3) as follows:

(4)

This nonlinear state equation is the centerpiece of the RTRL algorithm using teacher forcing.

∂ y j n 1+( )∂wkl n( )

-------------------------- ϕ′ v j n( )( ) w ji n( )∂ξ i n( )

∂wkl n( )------------------- δkjξ l n( )+

i A B∈ ∈∑=

δkj

ξ i n( )ui n( ) if i A∈

di n( ) if i C∈

yi n( ) if i B-C∈

=

∂ y j n 1+( )∂wkl n( )

-------------------------- ϕ′ v j n( )( ) w ji n( )∂ yi n( )∂wkl n( )------------------- δkjξ l n( )+

i B-C∈∑=

πklj

n( )∂ yi n( )∂wkl n( )-------------------=

η

πklj

n 1+( )∂ yi n 1+( )∂wkl n 1+( )----------------------------

∂ yi n 1+( )∂wkl n( )

-------------------------≈=

πklj

n 1+( ) φ′ v j n( )( ) w ji n( )πklj

n( ) δkjξ l n( )+i B-C∈∑=

10