Haykin,Xue-Neural Networks and Learning Machines 3ed Soln
-
Upload
sticker592 -
Category
Documents
-
view
1.352 -
download
39
description
Transcript of Haykin,Xue-Neural Networks and Learning Machines 3ed Soln
SOLUTIONS MANUAL
THIRD EDITION
NeuralNetworksandLearning Machines
Simon HaykinandYanbo XueMcMaster UniversityCanada
CHAPTER 1Rosenblatt’s Perceptron
Problem 1.1
(1) If wT(n)x(n) > 0, then y(n) = +1.If also x(n) belongs to C1, then d(n) = +1.Under these conditions, the error signal is
e(n) = d(n) - y(n) = 0and from Eq. (1.22) of the text:
w(n + 1) = w(n) + ηe(n)x(n) = w(n)This result is the same as line 1 of Eq. (1.5) of the text.
(2) If wT(n)x(n) < 0, then y(n) = -1.If also x(n) belongs to C2, then d(n) = -1.Under these conditions, the error signal e(n) remains zero, and so from Eq. (1.22)we have
w(n + 1) = w(n)This result is the same as line 2 of Eq. (1.5).
(3) If wT(n)x(n) > 0 and x(n) belongs to C2 we havey(n) = +1d(n) = -1
The error signal e(n) is -2, and so Eq. (1.22) yieldsw(n + 1) = w(n) -2ηx(n)
which has the same form as the first line of Eq. (1.6), except for the scaling factor 2.
(4) Finally if wT(n)x(n) < 0 and x(n) belongs to C1, theny(n) = -1d(n) = +1
In this case, the use of Eq. (1.22) yieldsw(n + 1) = w(n) +2ηx(n)
which has the same mathematical form as line 2 of Eq. (1.6), except for the scalingfactor 2.
Problem 1.2
The output signal is defined by
yv2---
tanh=
b2---
12--- wixi
i∑+
tanh=
Equivalently, we may write
(1)
where
Equation (1) is the equation of a hyperplane.
Problem 1.3
(a) AND operation: Truth Table 1
This operation may be realized using the perceptron of Fig. 1
The hard limiter input is
If x1 = x2 = 1, then v = 0.5, and y = 1If x1 = 0, and x2 = 1, then v = -0.5, and y = 0If x1 = 1, and x2 = 0, then v = -0.5, and y = 0If x1 = x2 = 0, then v = -1.5, and y = 0
Inputs Output
x1 x2 y
1010
1100
1000
b wixii
∑+ y′
=
y′ 2 y( )1–
tanh=
oo
o
o o
o
x1
x2
w1 = 1
w2 = 1
+1
vy
Hardlimiter
Figure 1: Problem 1.3
b = -1.5
v w1x1 w2x2 b+ +=
x1 x2 1.5–+=
These conditions agree with truth table 1.
OR operation: Truth Table 2
The OR operation may be realized using the perceptron of Fig. 2:
In this case, the hard limiter input is
If x1 = x2 = 1, then v = 1.5, and y = 1If x1 = 0, and x2 = 1, then v = 0.5, and y = 1If x1 = 1, and x2 = 0, then v = 0.5, and y = 1If x1 = x2 = 0, then v = -0.5, and y = -1
These conditions agree with truth table 2.
Inputs Output
x1 x2 y
1010
1100
1110
oo
o
o o
o
x1
x2
w1 = 1
w2 = 1
+1
vy
Hardlimiter
Figure 2: Problem 1.3
b = -0.5
v x1 x2 0.5–+=
COMPLEMENT operation: Truth Table 3
The COMPLEMENT operation may be realized as in Figure 3::
The hard limiter input is
If x = 1, then v = -0.5, and y = 0If x = 0, then v = 0.5, and y = 1
These conditions agree with truth table 3.
(b) EXCLUSIVE OR operation: Truth table 4
This operation is nonlinearly separable, which cannot be solved by the perceptron.
Problem 1.4
The Gaussian classifier consists of a single unit with a single weight and zero bias, determined inaccordance with Eqs. (1.37) and (1.38) of the textbook, respectively, as follows:
Input x, Output, y
10
01
Inputs Output
x1 x2 y
1010
1100
0110
oo o v
y
Hardlimiter
ow1 = -1
b = -0.5 Figure 3: Problem 1.3
v wx b+ x– 0.5+= =
w1
σ2------ µ1 µ2–( )=
20–=
Problem 1.5
Using the condition
in Eqs. (1.37) and (1.38) of the textbook, we get the following formulas for the weight vector andbias of the Bayes classifier:
b1
2σ2--------- µ2
2 µ12
–( )=
0=
C σ2I=
w1
σ2------ µ1 µ2–( )=
b1
2σ2--------- µ1
2 µ22
–( )=
CHAPTER 4Multilayer Perceptrons
Problem 4.1
Assume that each neuron is represented by a McCulloch-Pitts model. Also assume that
The induced local field of neuron 1 is
We may thus construct the following table:
The induced local field of neuron is
Accordingly, we may construct the following table:
x1 0 0 1 1
x2 0 1 0 1
v1 -1.5 -0.5 -0.5 0.5
y2 0 0 0 1
x1 0 0 1 1
x2 0 1 0 1
y1 0 0 0 1
v2 -0.5 0.5 -0.5 -0.5
y2 0 1 1 1
+1
-21+1
+1
x1
x2-0.5
2y2
-1.5Figure 4: Problem 4.1
xi1 if the input bit is 10 if the input bit is 0
=
v1 x1 x2 1.5–+=
v2 x1 x2 2 y1– 0.5–+=
2
1
From this table we observe that the overall output y2 is 0 if x1 and x2 are both 0 or both 1, and it is1 if x1 is 0 and x2 is 1 or vice versa. In other words, the network of Fig. P4.1 operates as anEXCLUSIVE OR gate.
Problem 4.2
Figure 1 shows the evolutions of the free parameters (synaptic weights and biases) of the neuralnetwork as the back-propagation learning process progresses. Each epoch corresponds to 100 iter-ations. From the figure, we see that the network reaches a steady state after about 25 epochs. Eachneuron uses a logistic function for its sigmoid nonlinearity. Also the desired response is defined as
Figure 2 shows the final form of the neural network. Note that we have used biases (the negativeof thresholds) for the individual neurons.
d 0.9 for symbol bit( ) 10.1 for symbol bit( ) 0
=
Figure 1: Problem 4.2, where one epoch = 100 iterations
2
Problem 4.3
If the momentum constant α is negative, Equation (4.43) of the text becomes
Now we find that if the derivative has the same algebraic sign on consecutive iterations
of the algorithm, the magnitude of the exponentially weighted sum is reduced. The opposite istrue when alternates its algebraic sign on consecutive iterations. Thus, the effect of the
momentum constant α is the same as before, except that the effects are reversed, compared to thecase when α is positive.
Problem 4.4
From Eq. (4.43) of the text we have
(1)
For the case of a single weight, the cost function is defined by
x1
x2
b1 = 1.6
w11= -4.72
w22 = -3.52
w31 = -6.80
w32 = 6.44 b3 = -2.85
+1
Output
b2 = 5.0
+1
1
2
3
Figure 2: Problem 4.2
w21 = -3.51
w12 = -4.24
∆w ji n( ) η α n-t ∂E t( )∂w ji t( )------------------
t=0
n
∑–=
η 1–( )n-t α n-t ∂E t( )∂w ji t( )------------------
t=0
n
∑–=
∂E ∂w ji⁄
∂E ∂w ji⁄
∆w ji n( ) η α n-t ∂E t( )∂w ji t( )------------------
t=1
n
∑–=
E k1 w w0–( )2k2+=
3
Hence, the application of (1) to this case yields
In this case, the partial derivative has the same algebraic sign on consecutive itera-
tions. Hence, with 0 < α < 1 the exponentially weighted adjustment to the weight w attime n grows in magnitude. That is, the weight w is adjusted by a large amount. The inclusion ofthe momentum constant α in the algorithm for computing the optimum weight w* = w0 tends toaccelerate the downhill descent toward this optimum point.
Problem 4.5
Consider Fig. 4.14 of the text, which has an input layer, two hidden layers, and a single outputneuron. We note the following:
Hence, the derivative of with respect to the synaptic weight connecting neuron k in
the second hidden layer to the single output neuron is
(1)
where is the activation potential of the output neuron. Next, we note that
(2)
where is the output of neuron k in layer 2. We may thus proceed further and write
(3)
∆w n( ) 2k1η α n-tw t( ) w0–( )
t=1
n
∑–=
∂E t( ) ∂w t( )⁄
∆w n( )
y13( )
F A13( )( ) F w x,( )= =
F A13( )( ) w1k
3( )
∂F A13( )( )
∂w1k3( )----------------------
∂F A13( )( )
∂ y13( )----------------------
∂ y13( )
∂v13( )------------
∂v13( )
∂w1k3( )-------------=
v13( )
∂F A13( )( )
∂ y13( )---------------------- 1=
y13( ) ϕ v1
3( )( )=
v13( )
w1k3( )
yk2( )
k∑=
yk2( )
∂ y13( )
∂v13( )------------ ϕ′ v1
3( )( ) ϕ′ A13( )
= =
4
(4)
Thus, combining (1) to (4):
Consider next the derivative of F(w,x) with respect to , the synaptic weight connecting
neuron j in layer 1 (i.e., first hidden layer) to neuron k in layer 2 (i.e., second hidden layer):
(5)
where is the output of neuron in layer 2, and is the activation potential of that neuron.
Next we note that
(6)
(7)
(8)
(9)
∂v13( )
∂w1k3( )------------- yk
2( )=
ϕ Ak2( )( )=
∂F w x,( )∂w1k
3( )----------------------∂F A1
3( )( )
∂w1k3( )----------------------=
ϕ′ A13( )( )ϕ Ak
3( )( )=
wkj2( )
∂F w x,( )∂wkj
2( )----------------------∂F w x,( )
∂ y13( )----------------------
∂ y13( )
∂v13( )------------
∂v13( )
∂ yk2( )------------
∂ yk2( )
∂vk2( )------------
∂vk2( )
∂wkj2( )-------------=
yk2( )
vk1( )
∂F w x,( )∂ y1
3( )---------------------- 1=
∂ y13( )
∂v13( )------------ ϕ′ A1
3( )( )=
v13( )
w1k3( )
yk2( )
k∑=
∂v13( )
∂ yk2( )------------ w1k
3( )=
yk2( ) ϕ vk
2( )( )=
∂ yk2( )
∂vk2( )------------ ϕ′ vk
2( )( ) ϕ′ Ak2( )( )= =
5
(10)
Substituting (6) and (10) into (5), we get
Finally, we consider the derivative of F(w,x) with respect to , the synaptic weight
connecting source node i in the input layer to neuron j in layer 1. We may thus write
(11)
where is the output of neuron j in layer 1, and is the activation potential of that neuron.
Next we note that
(12)
(13)
(14)
vk2( )
wkj1( )
y j1( )
j∑=
∂vk2( )
∂wkj1( )------------- y j
1( ) ϕ v j1( )( ) ϕ A j
1( )( )= = =
∂F w x,( )∂wkj
2( )---------------------- ϕ′ A13( )( )w1k
3( )ϕ′ Ak2( )( )ϕ A j
1( )( )=
w ji1( )
∂F w x,( )∂w ji
1( )----------------------∂F w x,( )
∂ y13( )----------------------
∂ y13( )
∂v13( )------------
∂v13( )
∂ y j1( )------------
∂ y j1( )
∂v j1( )------------
∂v j1( )
∂w ji1( )-------------=
y j1( )
vi1( )
∂F w x,( )∂ y1
3( )---------------------- 1=
∂ y13( )
∂v13( )------------ ϕ′ A
3( )( )=
v13( )
w1k3( )
yk2( )
k∑=
∂v13( )
∂ y j1( )------------ w1k
3( )∂ yk2( )
∂ y j1( )------------
k∑=
w1k3( ) ∂ yk
2( )
∂vk2( )------------
∂vk2( )
∂ y j1( )------------
k∑=
w1k3( ) ϕ′ Ak
2( )( )∂vk
2( )
∂ y j1( )------------
k∑=
6
(15)
(16)
(17)
Substituting (12) to (17) into (11) yields
Problem 4.12
According to the conjugate-gradient method, we have
(1)
where, in the second term of the last line in (1), we have used η(n - 1) in place of η(η). Define
We may then rewrite (1) as
(2)
On the other hand, according to the generalized delta rule, we have for neuron j:
(3)
Comparing (2) and (3), we observe that they have a similar mathematical form:
∂vk2( )
∂ y j1( )------------ wkj
2( )=
y j1( ) ϕ v j
1( )( )=
∂ y j1( )
∂v j1( )------------ ϕ′ v j
1( )( ) ϕ′ A j1( )( )= =
v j1( )
w ji1( )
xii
∑=
∂v j1( )
∂w ji1( )------------- xi=
∂F w x,( )∂w ji
1( )---------------------- ϕ′ A13( )( ) w1k
3( )ϕ′ Ak2( )( )wkj
2( )
k∑
ϕ′ A j1( )( )xi=
∆w n( ) η n( )p n( )=
η n( ) g n( )– β n 1–( )p n 1–( )+[ ]=
η– n( )g n( ) β n 1–( )η n 1–( )p n 1–( )+≈
∆w n 1–( ) η n 1–( )p n 1–( )=
∆w n( ) η n( )g n( ) β n 1–( )∆w n 1–( )+–≈
∆w j n( ) α∆w j n 1–( ) ηδ j n( )y n( )+=
7
• The vector -g(n) in the conjugate gradient method plays the role of δj(n)y(n), where δj(n) is thelocal gradient of neuron j and y(n) is the vector of inputs for neuron j.
• The time-varying parameter β(n - 1) in the conjugate-gradient method plays the role ofmomentum α in the generalized delta rule.
Problem 4.13
We start with (4.127) in the text:
(1)
The residual r(n) is governed by the recursion:
Equivalently, we may write
(2)
Hence multiplying both sides of (2) by sT(n - 1), we obtain
(3)
where it is noted that (by definition)
Moreover, multiplying both sides of (2) by rT(n), we obtain
(4)
where it is noted that AT = A. Dividing (4) by (3) and invoking the use of (1):
(5)
which is the Hesteness-Stiefel formula.
β n( ) sTn 1–( )Ar n( )
sTn 1–( )As n 1–( )
-----------------------------------------------–=
r n( ) r n 1–( ) η n 1–( )As n 1–( )–=
η n 1–( )As n 1–( )– r n( ) r n 1–( )–=
η n 1–( )sTn 1–( )As n 1–( ) sT
n 1–( ) r n( ) r n 1–( )–( )–=
sTn 1–( )r n 1–( )=
sTn 1–( )r n( ) 0=
η n 1–( )rTn( )As n 1–( )– η n 1–( )sT
n 1–( )Ar n 1–( )–=
rTn( ) r n( ) r n 1–( )–( )=
β n( ) rTn( ) r n( ) r n 1–( )–( )
sTn 1–( )r n 1–( )
--------------------------------------------------------=
8
In the linear form of conjugate gradient method, we have
in which case (5) is modified to
(6)
which is the Polak-Ribiére formula. Moreover, in the linear case we have
in which case (6) reduces to the Fletcher-Reeves formula:
Problem 4.15
In this problem, we explore the operation of a fully connected multilayer perceptron trained withthe back-propagation algorithm. The network has a single hidden layer. It is trained to realize thefollowing one-to-one mappings:
(a) Inversion:
, 1< x < 100
(b) Logarithmic computation, 1< x < 10
(c) Exponentiation
, 1< x < 10
(d) Sinusoidal computation
,
(a) f(x) = 1/x for 1< x < 100The network is trained with:
sTn 1–( )r n 1–( ) rT
n 1–( )r n 1–( )=
β n( ) rTn( ) r n( ) r n 1–( )–( )
rTn 1–( )r n 1–( )
--------------------------------------------------------=
rTn( )r n 1–( ) 0=
β n( ) rTn( )r n( )
rTn 1–( )r n 1–( )
-------------------------------------------=
f x( ) 1x---=
f x( ) x10log=
f x( ) ex–
=
f x( ) xsin= 0 xπ2---≤ ≤
9
learning-rate parameter η = 0.3, andmomentum constant α = 0.7.
Ten different network configurations were trained to learn this mapping. Each network wastrained identically, that is, with the same η and α, with bias terms, and with 10,000 passes of thetraining vectors (with one exception noted below). Once each network was trained, the test datasetwas applied to compare the performance and accuracy of each configuration. Table 1 summarizesthe results obtained:
The results of Table 1 indicate that even with a small number of hidden neurons, and with a rela-tively small number of training passes, the network is able to learn the mapping described in (a)quite well.
(b) f(x) = log10x for 1< x < 10The results of this second experiment are presented in Table 2:
Here again, we see that the network performs well even with a small number of hidden neurons.Interestingly, in this second experiment the network peaked in accuracy with 10 hidden neurons,after which the accuracy of the network to produce the correct output started to decrease.
(c) f(x) = e- x for 1< x < 10The results of this third experiment (using the logistic function as with experiments (a)
Table 1
Number of hidden neuronsAverage percentage errorat the network output
3 4 5 7 10 15 20 30 10030 (trained with 100,000 passes)
4.73% 4.43 3.59 1.49 1.12 0.93 0.85 0.94 0.9 0.19
Table 2
Number of hidden neuronsAverage percentage errorat the network output
2 3 4 5 7 10 15 20 30 10030 (trained with 100,000 passes)
2.55% 2.09 0.46 0.48 0.85 0.42 0.85 0.96 1.26 1.18 0.41
10
and (b)), are summarized in Table 3:
These results are unacceptable since the network is unable to generalize when each neuron isdriven to its limits.
The experiment with 30 hidden neurons and 100,000 training passes was repeated, but thistime the hyperbolic tangent function was used as the nonlinearity. The result obtained this timewas an average percentage error of 3.87% at the network output. This last result shows that thehyperbolic tangent function is a better choice than the logistic function as the sigmoid function for
realizing the mapping f(x) = e- x.
(d) f(x) = sinx for 0< x < π/2Finally, the following results were obtained using the logistic function with 10,000training passes, except for the last configuration:
The results of Table 4 show that the accuracy of the network peaks around 20 neurons, where afterthe accuracy decreases.
Table 3
Number of hidden neuronsAverage percentage errorat the network output
2 3 4 5 7 10 15 20 30 10030 (trained with 100,000 passes)
244.0‘% 185.17 134.85 133.67 141.65 158.77 151.91 144.79 137.35 98.09 103.99
Table 4
Number of hidden neuronsAverage percentage errorat the network output
2 3 4 5 7 10 15 20 30 10030 (trained with 100,000 passes)
1.63‘% 1.25 1.18 1.11 1.07 1.01 1.01 0.72 1.21 3.19 0.4
11
1
CHAPTER 5Kernel Methods and Radial-Basis Function Networks
Problem 5.9
The expected square error is given by
where is the probability density function of a noise distribution in the input space . It is
reasonable to assume that the noise vector is additive to the input data vector x. Hence, we maydefine the cost function J(F) as
(1)
where (for convenience of presentation) we have interchanged the order of summation andintegration, which is permissible because both operations are linear. Let
or
Hence, we may rewrite (1) in the equivalent form:
(2)
Note that the subscript in merely refers to the “name” of the noise distribution and is
therefore untouched by the change of variables. Differentiating (2) with respect to F, setting theresult equal to zero, and finally solving for F(z), we get the optimal estimator
This result bears a close resemblance to the Watson-Nadaraya estimator.
J F( ) 12--- f xi( ) F xi ξ,( )–( )2
f ξ ξ( ) ξd
Rm0
∫i=1
N
∑=
f ξ ξ( ) Rm0
ξ
J F( ) 12--- f xi( ) F xi ξ+( )–( )2
f ξ ξ( ) ξd
Rm0
∫i=1
N
∑=
z xi ξ+= ξ z xi–=
J F( ) 12---
Rm0
∫ f xi( ) F z( )–( )2f ξ z xi–( ) zd
i=1
N
∑=
ξ f ξ .( )
F z( )
f xi( ) f ξ z xi–( )i=1
N
∑
f ξ z xi–( )i=1
N
∑---------------------------------------------=
CHAPTER 6Support Vector Machines
Problem 6.1
From Eqs. (6.2) in the text we recall that the optimum weight vector wo and optimum bias bosatisfy the following pair of conditions:
for di = +1
for di = -1
where i = 1, 2, ...,N. Equivalently, we may write
as the defining condition for the pair (wo, bo).
Problem 6.2
In the context of a support vector machine, we note the following:
1. Misclassification of patterns can only arise if the patterns are nonseparable.2. If the patterns are nonseparable, it is possible for a pattern to lie inside the margin of
separation and yet be on the correct side of the decision boundary. Hence, nonseparabilitydoes not necessarily mean misclassification.
Problem 6.3
We start with the primel problem formulated as follows (see Eq. (6.15)) of the text
(1)
Recall from (6.12) in the text that
Premultiplying w by wT:
woT xi bo +1≥+
woT xi bo -1<+
mini 1 2 … N, , ,=
wT xi b+ 1=
J w b α, ,( ) 12---wT w α idiw
T xi b α idi α ii=1
N
∑+i=1
N
∑–i=1
N
∑–=
w α idixi=1
N
∑=
1
(2)
We may also write
Accordingly, we may redefine the inner product wTw as the double summation:
(3)
Thus substituting (2) and (3) into (1) yields
(4)
subject to the constraint
Recognizing that αi > 0 for all i, we see that (4) is the formulation of the dual problem.
Problem 6.4
Consider a support vector machine designed for nonseparable patterns. Assuming the use of the“leave-one-out-method” for training the machine, the following situations may arise when theexample left out is used as a test example:
1. The example is a support vector.Result: Correct classification.
2. The example lies inside the margin of separation but on the correct side of the decisionboundary.
Result: Correct classification.3. The example lies inside the margin of separation but on the wrong side of the decision
boundary.Result: Incorrect classification.
wT w α idiwT xi
i=1
N
∑=
wT α idixiT
i=1
N
∑=
wT w α idiα jd jx jT xi
j=1
N
∑i=1
N
∑=
Q α( ) 12--- α idiα jd jx j
T xi α ii=1
N
∑+j=1
N
∑i=1
N
∑–=
α idii=1
N
∑ 0=
2
Problem 6.5
By definition, a support vector machine is designed to maximize the margin of separation betweenthe examples drawn from different classes. This definition applies to all sources of data, be theynoisy or otherwise. It follows therefore that by the very nature of it, the support vector machine isrobust to the presence of additive noise in the data used for training and testing, provided that allthe data are drawn from the same population.
Problem 6.6
Since theGram K = {K(xi, xj)} is a square matrix, it can be diagonalized using the similaritytransformation:
where is a diagonal matrix consisting of the eigenvalues of K and Q is an orthogonal matrix
whose columns are the associated eigenvectors. With K being a positive matrix, hasnonnegative entries. The inner-product (i.e., Mercer) kernel k(xi, xj) is the ijth element of matrixK. Hence,
(1)
Let ui denote the ith row of matrix Q. (Note that ui is not an eigenvector.) We may then rewrite (1)as the inner product
(2)
where is the square root of .
By definition, we have
(3)
K QΛQT=
ΛΛ
k xi x j,( ) QΛQT( )ij=
Q( )il Λ( )ll QT( )lj
l=1
m1
∑=
Q( )il Λ( )ll Q( )ljl=1
m1
∑=
k xi x j,( ) uiT Λu j=
Λ1 2⁄ ui( )T
Λ1 2⁄ u j( )=
Λ1 2⁄ Λ
k xi x j,( ) φT xi( )ϕ x j( )=
3
Comparing (2) and (3), we deduce that the mapping from the input space to the hidden (feature)space of a support vector machine is described by
Problem 6.7
(a) From the solution to Problem 6.6, we have
Suppose the input vector xi is multiplied by the orthogonal (unitary) matrix Q. We then have a
new mapping described by
Correspondingly, we may write
(1)
where ui is the ith row of Q. From the definition of an orthogonal (unitary) matrix:
or equivalently
where I is the identity matrix. Hence, (1) reduces to
In words, the Mercer kernel exhibits the unitary invariance property.
ϕ : xi Λ1 2⁄ ui→
φ: xi Λ1 2⁄ ui→
φ′
φ′: Qxi QΛ1 2⁄ ui→
k Qxi Qx j,( ) QΛ1 2⁄ ui( )T
QΛ1 2⁄ u j( )=
Λ1 2⁄ ui( )T
QT Q Λ1 2⁄ u j( )=
Q 1– QT=
QT Q I=
k Qxi Qx j,( ) Λ1 2⁄ ui( )T
Λ1 2⁄ u j( )=
k xi x j,( )=
4
(b) Consider first the polynomial machine described by
Consider next the RBF network described by the Mercer kernel:
,
Finally, consider the multilayer perceptron described by
Thus all three types of the support vector machine, namely, the polynomial machine, RBFnetwork, and MLP, satisfy the unitary invariance property in their own individual ways.
k Qxi Qx j,( ) Qxi( )T Qx j( ) 1+( )p
=
xiT QT Qx j 1+( )
p=
xiT x j 1+( )
p=
k xi xJ,( )=
k Qxi Qx j,( ) 1
2σ2--------- Qxi Qx j–
2–
exp=
1
2σ2--------- Qxi Qx j–( )T Qxi Qx j–( )–
exp=
1
2σ2--------- xi x j–( )T QT Q xi x j–( )–
exp=
1
2σ2--------- xi x j–( )T xi x j–( )–
exp= QT Q I=
k xi xJ,( )=
k Qxi Qx j,( ) β0 Qxi( )T Qx j( ) β1+( )tanh=
β0xiT QT Qx j β1+( )tanh=
β0xiT x j β1+( )tanh=
k xi xJ,( )=
5
Problem 6.17
The truth table for the XOR function, operating on a three-dimensional pattern x, is as follows:
To proceed with the support vector machine for solving this multidimensional XOR problem, letthe Mercer kernel
The minimum value of power p (denoting a positive integer) needed for this problem is p = 3. Forp = 2, we end up with a zero weight vector, which is clearly unacceptable.
Setting p = 3, we thus have
where
and likewise for xi. Then, proceeding in a manner similar but much more cumbersome than thatdescribed for the two-dimensional XOR problem in Section 6.6, we end up with a polynomialmachine defined by
This machine satisfies the entries of Table 1.
Table 1
InputsDesired response
x1 x2 x3 y
+1 +1 -1 +1 +1 -1 -1 -1
+1 -1 +1 +1 -1 +1 -1 -1
+1 +1 +1 -1 -1 -1 -1 +1
+1 -1 -1 -1 +1 +1 -1 +1
k x x j,( ) 1 xT xi+( )p
=
k x xi,( ) 1 xT xi+( )3
=
1 3xT xi 3 xT xi( )2
xT xi( )3
+ + +=
x x1 x2 x3, ,[ ] T=
y x1 x2 x3, ,=
6
CHAPTER 8Principal-Components Analysis
Problem 8.5
From Example 8.2 in the text:
(1)
(2)
The correlation matrix of the input is
(3)
where s is the signal vector and σ2 is the variance of an element of the additive noise vector.Hence, using (2) and (3):
(4)
The vector s is a signal vector of unit length:
Hence, (4) simplifies to
which is the desired result given in (1).
λ0 1 σ2+=
q0 s=
R ssT σ2I+=
λ0
q0T Rq0
q0T q0
-----------------=
sT ssT σ2I+( )s
sT s------------------------------------=
sT s( ) sT s( ) σ2 sT s( )+
sT s----------------------------------------------------=
sT s σ2+=
s 2 σ2+=
s 1=
λ0 1 σ2+=
1
Problem 8.6
From (8.46) in the text we have
(1)
As , and so we deduce from (1) that
for (2)
where q1 is the eigenvector associated with the largest eigenvalue λ1 of the correlation matrix
R = E[x(n)xT(n)], where E is the expectation operator. Multiplying (2) by its own transpose andthen taking expectations, we get
Equivalently, we may write
(3)
where is the variance of the output y(n). Post-multiplying (3) by q1:
(4)
where it is noted that by definition. From (4) we readily see that , which is the
desired result.
Problem 8.7
Writing the learning algorithm for minor components analysis in matrix form:
Proceeding in a manner similar to that described in Section (8.5) of the textbook, we have thenonlinear differential equation:
Define
w n 1+( ) w n( ) ηy n( ) x n( ) y n( )w n( )–[ ]+=
n ∞ w n( ) q1→,→
x n( ) y n( )q1= n ∞→
E x n( )xTn( )[ ] E y
2n( )[ ] q1q1
T=
R σY2 q1q1
T=
σY2
Rq1 σY2 q1q1
T q1 σY2 q1= =
q1 1= σY2 λ1=
w n 1+( ) w n( ) ηy n( ) x n( ) y n( )w n( )–[ ]–=
ddt-----w t( ) wT
t( )Rw t( )[ ] w t( ) Rw t( )–=
2
(1)
where qk is the kth eigenvector of correlation matrix R = E[x(n)xT(n)] and the coefficient is
the projection of w(t) onto qk. We may then identify two cases as summarized here:
Case I: 1 < k < m
For this first case, we define
for some fixed m (2)
Accordingly, we find that
(3)
With the eigenvalues of R arranged in decreasing order:
it follows that as .
Case II: k = m
For this second case, we find that
for (4)
Hence, as .
Thus, in light of the results derived for cases I and II, we deduce from (1) that:
= eigenvector associated with the smallest eigenvalue λm as , and
.
w t( ) θk t( )qkk=1
M
∑=
θk t( )
αk t( )θk t( )θm t( )-------------=
dαk t( )dt
---------------- λm λk–( )αk t( )–=
λ1 λ2 … λk … λm 0> > > > > >
αk t( ) 0→ t ∞→
dθm t( )dt
----------------- λmθm t( ) θm2
t( ) 1–( )= t ∞→
θm t( ) 1±= t ∞→
w t( ) qm→ t ∞→
σY2 E y
2n( )[ ] λ m→=
3
Problem 8.8
From (8.87) and (8.88) of the text:
(1)
(2)
where, for convenience of presentation, we have omitted the dependence on time n. Equations (1)and (2) may be represented by the following vector-valued signal flow graph:
Note: The dashed lines indicate inner (dot) products formed by the input vector x and thepertinent synaptic weight vectors w0, w1, ..., wj to produce y0, y1, ..., yj, respectively.
Problem 8.9
Consider a network consisting of a single layer of neurons with feedforward connections. Thealgorithm for adjusting the matrix of synaptic weights W(n) of the network is described by therecursive equation (see Eq. (8.91) of the text):
∆w j η y jx ′ η y j2w j–=
x ′ x wk ykk=0
j-1
∑–=
o o o
o
. . .
o
o
o
o
o
o
o∆wj
x-y0
-y1
-yj-1
-yj
ηyj
w0
w1
wj-1
w0
w1
wj-1
wj
4
(1)
where x(n) is the input vector, y(n) is the output vector; and LT[.] is a matrix operator that sets allthe elements above the diagonal of the matrix argument to zero, thereby making it lowertriangular.
First, we note that the asymptotic stability theorem discussed in the text does not applydirectly to the convergence analysis of stochastic approximation algorithms involving matrices; itis formulated to apply to vectors. However, we may write the elements of the parameter (synapticweight) matrix W(n) in (1) as a vector, that is, one column vector stacked up on top of another. Wemay then interpret the resulting nonlinear update equation in a corresponding way and so proceedto apply the asymptotic stability theorem directly.
To prove the convergence of the learning algorithm described in (1), we may use themethod of induction to show that if the first j columns of matrix W(n) converge to the first j
eigenvectors of the correlation matrix R = E[x(n)xT(n)], then the (j + 1)th column will converge tothe (j + 1)th eigenvector of R. Here we use the fact that in light of the convergence of themaximum eigenfilter involving a single neuron, the first column of the matrix W(n) convergeswith probability 1 to the first eigenvector of R, and so on.
Problem 8.10
The results of a computer experiment on the training of a single-layer feedforward network usingthe generalized Hebbian algorithm are described by Sanger (1990). The network has 16 outputneurons, and 4096 inputs arranged as a 64 x 64 grid of pixels. The training involved presentationof 2000 samples, which are produced by low-pass filtering a white Gaussian noise image and thenmultiplying wi6th a Gaussian window function. The low-pass filter was a Gaussian function withstandard deviation of 2 pixels, and the window had a standard deviation of 8 pixels.
Figure 1, presented on the next page, shows the first 16 receptive field masks learned bythe network (Sanger, 1990). In this figure, positive weights are indicated by “white” and negativeweights are indicated by “black”; the ordering is left-to-right and top-to-bottom.
The results displayed in Fig. 1 are rationalized as follows (Sanger, 1990):
• The first mask is a low-pass filter since the input has most of its energy near dc (zerofrequency).
• The second mask cannot be a low-pass filter, so it must be a band-pass filter with a mid-bandfrequency as small as possible since the input power decreases with increasing frequency.
• Continuing the analysis in the manner described above, the frequency response of successivemasks approaches dc as closely as possible, subject (of course) to being orthogonal toprevious masks.
The end result is a sequence of orthogonal masks that respond to progressively higherfrequencies.
W n( ) W n( ) η n( ) y n( )xTn( ) LT y n( )yT
n( )[ ] W n( )–{ }+=
5
Figure 1: Problem 8.10 (Reproduced with permission of Biological Cybernetics)
6
CHAPTER 9Self-Organizing Maps
Problem 9.1
Expanding the function g(yj) in a Taylor series around yj = 0, we get
(1)
where
for k = 1, 2, ....
Let
Then, we may rewrite (1) as
Correspondingly, we may write
Consequently, a nonzero g(0) has the effect of making dwj/dt assume a nonzero value whenneuron j is off, which is undesirable. To alleviate this problem, we make g(0) = 0.
g y j( ) g 0( ) g1( ) 0( ) y j
12!-----g
2( ) 0( ) y j2 …+ + +=
gk( ) 0( ) ∂k
g y j( )
∂ y jk
-------------------y j 0=
=
y j1, neuron j is on0, neuron j is off
=
g y j( ) g 0( ) gi( ) 0( ) 1
2!-----g
2( ) 0( ) …,+ + + neuron j is on
g 0( ) neuron j is off
=
dw j
dt---------- η y jx g y j( )w j–=
ηx w j g 0( ) g1( ) 0( ) 1
2!-----g
2( ) 0( ) …+ + +– neuron j is on
g 0( )w j– neuron j is off
=
1
Problem 9.2
Assume that y(c) is a minimum L2 (least-squares) distortion vector quantizer for the code vector c.We may then form the distortion function
This distortion function is similar to that of Eq. (10.20) in the text, except for the use of c and
in place of x and , respectively. We wish to minimize D2 with respect to y(c) and .
Assuming that is a smooth function of the noise vector , we may expand the
decoder output in using the Taylor series. In particular, using a second-order approximation,we get (Luttrell, 1989b)
(1)
where
where δij is a Kronecker delta function. We now make the following observations:
• The first term on the right-hand side of (1) is the conventional distortion term.• The second term (i.e., curvature term) arises due to the output noise model .
Problem 9.3
Consider the Peano curve shown in part (d) of Fig. 9.9 of the text. This particular self-organizingfeature map pertains to a one-dimensional lattice fed with a two-dimensional input. We see that(counting from left to right) neuron 14, say, is quite close to neuron 97. It is therefore possible fora large enough input perturbation to make neuron 14 jump into the neighborhood of neuron 97, orvice versa. If this change were to happen, the topological preserving property of the SOMalgorithm would no longer hold
For a more convincing demonstration, consider a higher-dimensional, namely, three-dimensional input structure mapped onto a two-dimensional lattice of 10-by-10 neurons. The
D212--- f c( ) c ′ y c( )( ) c–
2 cd∫=
c ′x ′ c ′ y( )
π ν( ) νx ′ ν
π ν( ) x ′ c x( ) ν+( ) x–2 νd∫
1D2
2------ ∇ k
2+
x ′ c( ) x–2≈
π∫ ν( )dν 1=
niπ ν( ) dν( )∫ 0=
nin jπ ν( ) dν( )∫ D2δij=
π ν( )
2
network is trained with an input consisting of 8 Gaussian clouds with unit variance but differentcenters. The centers are located at the points (0,0,0,...,0), (4,0,0,...,0), (4,4,0,...,0), (0,4,0,...,0),(0,0,4,...,0), (4,0,4, ...,0), (4,4,4, ..., 0), and (0,4,4, ...,0). The clouds occupy the 8 corners of a cubeas shown in Fig. 1a. The resulting labeled feature map computed by the SOM algorithm is shownin Fig. 1b. Although each of the classes is grouped together in the map, the planar feature mapfails to capture the complete topology of the input space. In particular, we observe that class 6 isadjacent to class 2 in the input space, but is not adjacent to it in the feature map.
The conclusion to be drawn here is that although the SOM algorithm does performclustering on the input space, it may not always completely preserve the topology of the inputspace.
Figure 1: Problem 9.3
Problem 9.4
Consider for example a two-dimensional lattice using the SOM algorithm to learn a two-dimensional input distribution as illustrated in Fig. 9.8 in the textbook. Suppose that the neuron atthe center of the lattice breaks down; this failure may have a dramatic effect on the evolution ofthe feature map. On the other hand, a small perturbation applied to the input space leaves the maplearned by the lattice essentially unchanged.
Problem 9.5
The batch version of the SOM algorithm is defined by
for some prescribed neuron j (1)
where πj,i is the discretized version of the pdf of noise vector . From Table 9.1 of the textwe recall that πj,i plays a role analogous to that of the neighborhood function. Indeed, we can
w j
πj i, xii
∑πj i,
i∑
--------------------=
π ν( ) ν
3
substitute hj,i(x) for πj,i in (1). We are interested in rewriting (1) in a form that highlights the role ofVoronoi cells. To this end we note that the dependence of the neighborhood function hj,i(x) andtherefore πj,i on the input pattern x is indirect, with the dependence being through the Voronoi cellin which x lies. Hence, for all input patterns that lie in a particular Voronoi cell the sameneighborhood function applies. Let each Voronoi cell be identified by an indicator function Ii,kinterpreted as follows:
Ii,k = 1 if the input pattern xi lies in the Voronoi cell corresponding to winning neuron k. Then inlight of these considerations we may rewrite (1) in the new form
(2)
Now let mk denote the centroid of the Voronoi cell of neuron k and Nk denote the number of inputpatterns that lie in that cell. We may then simplify (2) as
(3)
where Wj,k is a weighting function defined by
(4)
with
for all j
Equation (3) bears a close resemblance to the Watson-Nadaraya regression estimatordefined in Eq. (5.61) of the textbook. Indeed, in light of this analogy, we may offer the followingobservations:
• The SOM algorithm is similar to nonparametric regression in a statistical sense.• Except for the normalizing factor Nk, the discretized pdf πj,i and therefore the neighborhood
function hj,i plays the role of a kernel in the Watson-Nadaraya estimator.
w j
πj k, I i k, xii
∑k∑
πj k, I i k,i
∑k∑
-------------------------------------=
w j
πj k, N kmkk∑
πj k, N kk∑
--------------------------------=
W j k, mkk∑=
W j k,πj k, N k
πj k, N kk∑------------------------=
W j k,k∑ 1=
4
• The width of the neighborhood function plays the role of the span of the kernel.
Problem 9.6
In its basic form, Hebb’s postulate of learning states that the adjustment ∆wkj applied to thesynaptic weight wkj is defined by
where yk is the output signal produced in response to the input signal xj.
The weight update for the maximum eigenfilter includes the term and, additionally,
a stabilizing term defined by . The term provides for synaptic amplification.
In contrast, in the SOM algorithm two modifications are made to Hebb’s postulate oflearning:
1. The stabilizing term is set equal to .
2. The output yk of neuron k is set equal to a neighborhood function.
The net result of these two modifications is to make the weight update for the SOM algorithmassume a form similar to that in competitive learning rather than Hebbian learning.
Problem 9.7
In Fig. 1 (shown on the next page), we summarize the density matching results of computersimulation on a one-dimensional lattice consisting of 20 neurons. The network is trained with atriangular input density. Two sets of results are displayed in this figure:
1. The standard SOM (Kohonen) algorithm, shown as the solid line.2. The conscience algorithm, shown as the dashed line; the line labeled “predict” is its
straight-line approximation.
In Fig. 1, we have also included the exact result. Although it appears that both algorithms fail tomatch the input density exactly, we see that the conscience algorithm comes closer to the exactresult than the standard SOM algorithm.
∆wkj η yk x j=
η yk x j
yk2wkj– η yk x j
ykwkj–
5
Figure 1: Problem 9.7
Problem 9.11
The results of computer simulation for a one-dimensional lattice with a two-dimensional(triangular) input are shown in Fig. 1 on the next page for an increasing number of iterations. Theexperiment begins with random weights at zero time, and then the neurons start spreading out.
Two distinct phases in the learning process can be recognized from this figure:
The neurons become ordered (i.e., the one-dimensional lattice becomes untangled), whichhappens at about 20 iterations.
The neurons spread out to match the density of the input distribution, culminating in thesteady-state condition attained after 25,000 iterations.
6
Figure 1: Problem 9.11
7
CHAPTER 10Information-Theoretic Learning Models
Problem 10.1
The maximum entropy distribution of the random variable X is a uniform distribution over therange, [a, b], as shown by
Hence,
Problem 10.3
Let
where the vectors X1 and X2 have multivariate Gaussian distributions. The correlation coefficientbetween Yi and Zi is defined by
(1)
f X x( )1
a b–------------, a x b≤ ≤
0, otherwise
=
h X( ) f X x( ) f X x( )log xd∞–
∞∫–=
1a b–------------ a b–( ) xdlog
b
a
∫=
a b–( )log=
Y i aiT X1=
Zi biT X2=
ρi
E Y iZi[ ]
E Y i2[ ] E Zi
2[ ]-----------------------------------=
aiT E X1X2
T[ ] bi
aiT E X1X1
T[ ] ai( ) biT E X1X2
T[ ] bi( ){ }1 2⁄-----------------------------------------------------------------------------------------------=
aiT Σ12bi
aiT Σ11ai( ) bi
T Σ22bi( ){ }1 2⁄----------------------------------------------------------------=
1
where
The mutual information between Yi and Zi is defined by
Let r denote the rank of the cross-covariance matrix . Given the vectors X1 and X2, we
may invoke the idea of canonical correlations as summarized here:
• Find the pair of random variables and that are most highly
correlated.
• Extract the pair of random variables and in such a way that Y1 and
Y2 are uncorrelated and so are Z1 and Z2.• Continue these two steps until at most r pairs of variables {(Y1, Zi), (Y2, Zi), ..., (Zr, Zr)} have
been extracted.
The essence of the canonical correlation described above is to encapsulate the dependencebetween random vectors X1 and X2 in the sequence {(Y1, Zi), (Y2, Zi), ..., (Zr, Zr)}. Theuncorrelatedness of the pairs in this serquence, that is,
for all
means that the mutual information between the vectors X1 and X2 is the sum of the mutual
information measures between the individual elements of the pairs . That is, we may
write
where ρi is defined by (1).
Σ11 E X1X1T[ ]=
Σ12 E X1X2T[ ] Σ 21= =
Σ22 E X2X2T[ ]=
I Y i Zi;( ) 1 ρi2
–( )log–=
Σ12
Y 1 a1T X1= Z1 b1
T X2=
Y 2 a2T X1= Z2 b2
T X2=
E Y iY j[ ] E ZiZ j[ ] 0= = j i≠
Y i Zi,( ){ } i=1r
I X1 X2,( ) I Y ij,Zi( ) constant+i=1
r
∑=
1 ρi2
–( )log constant+i=1
r
∑–=
2
Problem 10.4
Consider a multilayer perceptron with a single hidden layer. Let wji denote the synaptic weight ofhidden neuron j connected to source node i in the input layer. Let xi|α denote the ith component ofthe input vector x, given example α. Then the induced local field of neuron j is
(1)
Correspondingly, the output of hidden neuron j for example α is given by
(2)
where is the logistic function
Consider next the output layer of the network. Let wkj denote the synaptic weight of output neuronk connected to hidden neuron j. The induced local field of output neuron k is
(3)
The kth output of the network is therefore
(4)
The output yk|α is assigned a probabilistic interpretation by writing
(5)
Accordingly, we may view yk|α as an estimate of the conditional probability that the proposition kis true, given the example α at the input. On this basis, we may interpret
as the estimate of the conditional probability that the proposition k is false, given the inputexample α. Correspondingly, let qk|α denote the actual (true) value of the conditional probabilitythat the proposition k is true, given the input example α. This means that 1 - qk|α is the actual
v j α w jixi αi
∑=
y j α ϕ v j α( )=
ϕ .( )
ϕ v( ) 1
1 ev–
+----------------=
vk α wkj y j αi
∑=
yk α ϕ vk α( )=
pk α yk α=
1 yk α– 1 pk α–=
3
value of the conditional probability that the proposition k is false, given the input example α.Thus, we may define the Kullback-Leibler divergence for the multilayer perceptron as
where pα is the a priori probability of occurrence of example α at the input.
To perform supervised training of the multilayer perceptron, we use gradient descent onin weight space. First, we use the chain rule to express the partial derivative of with
respect to the synaptic weight wkj of output neuron k as follows:
(6)
Next, we express the partial derivative of with respect to the synaptic weight wji of hidden
neuron j by writing
(7)
Via the chain rule, we write
(8)
But
(9)
Hence, using (8) and (9) we may simplify (7) as
D p q pα qk αqk α
pk α----------
1 qk α–( )1 qk α–
1 pk α–-------------------
log+logk∑
α∑=
D p q D p q
∂D p q
∂wkj----------------
∂D p q
∂ pk α----------------
∂ pk α
∂ yk α--------------
∂ yk α
∂vk α-------------=
∂vk α
∂wkj-------------
pα qk α pk α–( ) y j αα∑–=
D p q
∂D p q
∂w ji---------------- pα
qk α
pk α----------
1 qk α–
1 pk α–-------------------–
∂ pk α
∂w ji--------------
α∑
α∑–=
∂ pk α
∂w ji--------------
∂ pk α
∂ yk α--------------
∂ yk α
∂vk α-------------
∂vk α
∂ y j α-------------
∂ y j α
∂v j α-------------=
∂v j α
∂w ji------------
ϕ′ vk α( )wkjϕ′ v j α( )xi α=
ϕ′ vk α( ) yk α 1 yk α–( )=
pk α 1 pk α–( )=
∂D p q
∂w ji---------------- pα xi αϕ′ w jixi α
i∑
pk α qk α–( )wkjk∑
α∑–=
4
where is the derivative of the logistic function with respect to its argument.
Assuming the use of the learning-rate parameter η for all weight changes applied to thenetwork, we may use the method of steepest descent to write the following two-step probabilisticalgorithm:
1. For output neuron k, compute
2. For hidden neuron j, compute
Problem 10.9
We first note that the mutual information between the random variables X and Y is defined by
To maximize the mutual information I(X;Y) we need to maximize the sum of the differential
entropy h(X) and the differential entropy and also minimize the joint differential entropyh(X,Y). From the definition of differential entropy, both h(X) and h(Y) attain their maximum valueof 0.5 when X and Y occur with probability 1/2. Moreover h(X,Y) is minimized when the jointprobability of X and Y occupies the smallest possible region in the probability space.
Problem 10.10
The outputs Y1 and Y2 of the two neurons in Fig. P10.6 in the text are respectively defined by
ϕ′ .( ) ϕ .( )
∆wkj η∂D p q
∂wkj----------------–=
η pα qk α Pk α–( ) y j αα∑=
∆w ji η∂D p q
∂w ji----------------–=
η pα xi αϕ′ w jixi αi
∑ pk α qk α–( )wkj
k∑
α∑=
I X Y;( ) h X( ) h Y( ) h X Y,( )–+=
h Y( )
Y 1 w1ixii=1
m
∑
N 1+=
Y 2 w2ixii=1
L
∑
N 2+=
5
where are w1i the synaptic weights of output neuron 1, and the w2i are synaptic weights of output
neuron 2. The mutual information between the output vector Y = [Y1, Y2]T and the input vector X
= [X1, X2, .., Xm]T is
(1)
where h(Y) is the differential entropy of the output vector Y and h(N) is the differential entropy of
the noise vector N = [N1, N2]T.
Since the noise terms N1 and N2 are Gaussian and uncorrelated, it follows that they arestatistically independent. Hence,
(2)
The differential entropy of the output vector Y is
where is the joint pdf of Y1 and Y2. Both Y1 and Y2 are dependent on the same set
of input signals, and so they are correlated with each other. Let
where
, i, j = 1, 2
The individual element of the correlation matrix R are given by
I X Y;( ) h Y( ) h Y X( )–=
h Y( ) h N( )–=
h N( ) h N1 N2,( )=
h N1( ) h N2( )+=
1 2πσN2( )log+=
h Y( ) h Y 1,Y 2( )=
f Y 1,Y 2y1 y2,( ) f Y 1,Y 2
y1 y2,( )log∞–
∞∫ y1d y2d
∞–
∞∫–=
f Y 1,Y 2y1 y2,( )
R E YYT[ ]=
r11 r12
r21 r22
=
rij E Y iY j[ ]=
r11 σ12 σN
2+=
r12 r21 σ1σ1ρ12= =
6
where and are the respective variances of Y1 and Y2 in the absence of noise, and ρ12 is their
correlation coefficient also in the absence of noise. For the general case of an N-dimensionalGaussian distribution, we have
Correspondingly, the differential entropy of the N-dimensional vector Y is described as
where e is the base of the natural logarithm. For the problem at hand, we have N = 2 and so
Hence, the use of (2) and (3) in (1) yields
(4)
For a fixed noise variance , the mutual information I(X;Y) is maximized by maximizing the
determinant det(R). By definition,
That is,
(5)
Depending on the value of noise variance , we may identify two distinct situations:
1. Large noise variance. When is large, the third term in (5) may be neglected, obtaining
r22 σ22 σN
2+=
σ12 σ2
2
f Y y( ) 1
2π( )N 2⁄ detR( )1 2⁄--------------------------------------------- 12---yT R 1– y
exp=
h Y( ) 2πe( )N 2⁄ det R( )( )log=
h Y( ) 2πedet R( )( )log=
1 2πdet R( )( )log+=
I X Y;( ) det R( )σN
2----------------
log=
σN2
det R( ) r11r22 r12r21–=
det R( ) σN4 σN
2 σ12 σ2
2+( ) σ1
2σ22 1 ρ12
2–( )+ +=
σN2
σN2
det R( ) σN4 σN
2 σ12 σ2
2+( )+≈
7
In this case, maximizing det(R) requires that we maximize . This requirement may
be satisfied simply by maximizing the variance of output Y1 or the variance of output
Y2, separately. Since the variance of output Yi : i = 1, 2, is equal to in the absence of noise
and in the presence of noise, it follows from the Infomax principle that the optimum
solution for a fixed noise variance is to maximize the variance of either output, Y1 or Y2.
2. Low noise variance. When the noise variance is small, the third term in
(5) becomes important relative to the other two terms. The mutual information I(X;Y) is thenmaximized by making an optimal tradeoff between two options: keeping the output variances
and large, and making the outputs Y1 and Y2 of the two neurons uncorrelated.
Based on these observations, we may now make the following two statements:
• A high-noise level favors redundancy of response, in which case the two output neuronscompute the same linear combination of inputs. Only one such combination yields a responsewith maximum variance.
• A low-noise level favors diversity of response, in which case the two output neurons computedifferent linear combinations of inputs even though such a choice may result in a reducedoutput variance.
Problem 10.11
(a) We are given
Hence,
The mutual information between and the signal component S is
(1)
The differential entropy of is
σ12 σ2
2+( )
σ12 σ2
2
σi2
σ12 σN
2+
σN2 σ1
2σ22 1 ρ12
2–( )
σ12 σ2
2
Y a S N a+=
Y b S N b+=
Y a Y b+
2------------------- S
12--- N a N b+( )+=
12--- Y a Y b+( )
IY a Y b+
2------------------- S;
hY a Y b+
2-------------------
hY a Y b+
2------------------- S
–=
Y a Y b+
2-------------------
8
(2)
The conditional differential entropy of given S is
(3)
Hence, the use of (2) and (3) in (1) yields (after the simplification of terms)
(b) The signal component S is ordinarily independent of the noise components Na and Nb. Hencewith
it follows that
The ratio in the expression for the mutual information
may therefore be interpreted as a signal-plus-noise to noise ratio.
Problem 10.12
Principal components analysis (PCA) and independent-components analysis (ICA) share acommon feature: They both linearly transform an input signal into a fixed set of components.
However, they differ from each other in two important respects:
1. PCA performs decorrelation by minimizing second-order moments; higher-order moments arenot involved in this computation. On the other hand, ICA performs statistical independence byusing higher-order moments.
hY a Y b+
2-------------------
12--- 1
π2---var Y a Y b+[ ]
log+=
Y a Y b+
2-------------------
hY a Y b+
2------------------- S
hN a N b+
2--------------------
=
12--- π
2---var N a N b+[ ]
log=
IY a Y b+
2------------------- S;
var Y a Y b+[ ]var N a N b+[ ]----------------------------------
log=
Y a Y b+ 2S N a N b+ +=
var Y a Y b+[ ] 4var S[ ] var N a N b+[ ]+=
var Y a Y b+[ ]( ) var N a N b+[ ]( )⁄
IY a Y b+
2------------------- S;
9
2. The output signal vector resulting from PCA has a diagonal covariance matrix. The firstprincipal component defines a direction in the original signal space that captures themaximum possible variance; the second principal component defines another direction in theremaining orthogonal subspace that captures the next maximum possible variance, and so on.On the other hand, ICA does not find the directions of maximum variances but ratherinteresting directions where the term “interesting” refers to “deviation from Gaussianity”.
Problem 10.13
Independent components analysis may be used as a preprocessing tool before signal detection andpattern classification. In particular, through a change of coordinates resulting from the use of ICA,the probability density function of multichannel data may be expressed as a product of marginaldensities. This change, in turn, permits density estimation with shorter observations.
Problem 10.14
Consider m random variables X1, X2, ..., Xm that are defined by
, i = 1, 2, ..., N
where the Uj are independent random variables. The Darmois’ theorem states that if the Xi are
independent, then the variables Uj for which are all Gaussian.
For independent-components analysis to work, at most a single Xi can be Gaussian. If allthe Xi are independent to begin with, there is no need for the application of independent-components analysis. This, in turn, means that all the Xi must be Gaussian. For a finite N, thiscondition can only be satisfied if all the Uj are not only independent but also Gaussian.
Problem 10.15
The use of independent-components analysis results in a set of components that are as statisticallyindependent of each other as possible. In contrast, the use of decorrelation only addresses second-order statistics and there is therefore no guarantee of statistical independence.
Problem 10.16
The Kullback-Leibler divergence between the joint pdf fY(y, w) and the factorial pdf isthe multifold integral
(1)
X i aijU jj=1
N
∑=
aij 0≠
f Y y w,( )
Df Y f Y
f Y y w,( )f Y y w,( )
f Y y w,( )----------------------log
yd∞–
∞∫=
10
Let
where excludes yi and yj. We may then rewrite (1) as
That is, the Kullback-Leibler divergence between the joint pdf fY(y, w) and the factorial pdf
distribution is equal to the mutual information between the components Yi and Yj of theoutput vector Y for any pair (i, j).
Problem 10.18
Define the output matrix
(1)
where m is the dimension of the output vector y(n) and N is the number of samples used incomputing the matrix Y. Correspondingly, define the m-by-N matrix of activation functions
yd d yid y j y ′d=
y ′
Df Y f Y
d yid y j f Y y w,( ) f Y y w,( )log y ′d∞–
∞∫∞–
∞∫∞–
∞∫=
d yid y j f Y y w,( ) f Y y w,( )log y ′d∞–
∞∫∞–
∞∫∞–
∞∫–
f Y i,Y jyi y j w, ,( ) f Y i,Y j
yi y j w, ,( )d yid y jlog∞–
∞∫∞–
∞∫=
f Y i,Y jyi y j w, ,( ) f Y i
yi w,( ) f Y jy j w,( )( )d yid y jlog
∞–
∞∫∞–
∞∫–
f Y i,Y jyi y j,( )
f Y i,Y jyi y j w, ,( )
f Y iyi w,( ) f Y j
y j w,( )----------------------------------------------------
d yid y jlog∞–
∞∫∞–
∞∫=
I Y i Y j;( )=
f Y y w,( )
Y
y1 0( ) y1 1( ) … y1 N 1–( )
y2 0( ) y2 1( ) … y2 N 1–( )
ym 0( ) ym 1( ) … ym N 1–( )
=
. . .
. . .
. . .
11
In the batch mode, we define the average weight adjustment (see Eq. (10.100) of the text)
Equivalently, using the matrix definitions introduced in (2), we may write
which is the desired formula.
Problem 10.19
(a) Let q(y) denote a pdf equal to the determinant det(J) with the elements of the Jacobian J beingas defined in Eq. (10.115). Then using Eq. (10.116) we may express the entropy of the randomvector Z at the output of the nonlinearity in Fig. 10.16 of the text as
Invoking the pythogorean decomposition of the Kullback-Leibler divergence, we write
Hence, the differential entropy
(1)
Φ Y( )
ϕ y1 0( )( ) ϕ y1 1( )( ) … ϕ y1 N 1–( )( )
ϕ y2 0( )( ) ϕ y2 1( )( ) … ϕ y2 N 1–( )( )
ϕ ym 0( )( ) ϕ ym 1( )( ) … ϕ ym N 1–( )( )
=
. . .
. . .
. . .
∆W1N---- ∆W n( )
n=0
N -1
∑=
η I1N---- φ y n( )( )yT
n( )n=0
N -1
∑
–
W=
∆W η I1N----Φ Y( )YT
– W=
h Z( ) D f q–=
D f q Df f
Df q
+=
h Z( ) Df f
Df q
–=
12
(b) If q(yi) happens to equal the source pdf fU(yi) for all i, we then find that . In such a
case, (1) reduces to
That is, the entropy h(Z) is equal to the negative of the Kullback-Leibler divergence between
the pdf fY(y) and the corresponding factorial distribution .
Problem 10.20
(a) From Eq. (10.124) in the text,
The matrix A of the linear mixer is fixed. Hence differentiating with respect to W:
(1)
(b) From Eq. (10.126) pf the text,
Differentiating zi with respect to yi:
(2)
Hence, differentiating with respect to the demixing matrix W, we get
Df q
0=
h Z( ) Df f
–=
f Y y( )
Φ det A( ) det W( )∂zi
∂ yi-------
logi
∑+log+log=
Φ
∂Φ∂W--------- W T– ∂
∂W---------
∂zi
∂ yi-------
logi
∑+=
zi1
1 eyi–
+------------------=
∂zi
∂ yi------- e
yi–
1 eyi–
+( )2
-------------------------=
zi zi2
–=
∂zi
∂ yi-------
log
∂∂W---------
∂zi
∂ yi-------
log∂
∂W--------- zi zi
2–( )log=
13
(3)
But from (2) we have
Hence, we may simplify (3) to
We may thus rewrite (1) as
Putting this relation in matrix form and recognizing that the demixer output y is equal to Wxwhere x is the observation vector, we find that the adjustment applied to W is defined by
where η is the learning-rate parameter and 1 is a vector of ones.
∂zi
∂W--------- ∂
∂zi------- zi zi
2–( )log=
∂zi
∂W--------- 1
zi zi2
–( )------------------- 1 2zi–( )=
∂zi
∂ yi-------
∂ yi
∂W--------- 1
zi zi2
–( )------------------- 1 2zi–( )=
∂zi
∂ yi------- 1
zi zi2
–---------------
1=
∂∂W---------
∂zi
∂ yi-------
log∂ yi
∂W--------- 1 2zi–( )=
∂Φ∂W--------- W T– ∂ yi
∂W--------- 1 2zi–( )
i∑+=
∆W η ∂Φ∂W---------=
η W T– 1 2z–( )xT+( )=
14
CHAPTER 11Stochastic Methodfs Rooted in Statistical Mechanics
Problem 11.1
By definition, we have
where t denotes time and n denotes the number of discrete steps. For n = 1, we have the one-steptransition probability
For n = 2 we have the two-step transition probability
where the sum is taken over all intermediate steps k taken by the system. By induction, it thusfollows that
Problem 11.2
For p > 0, the state transition diagram for the random walk process shown in Fig., P11.2 of the testis irreducible. The reason for saying so is that the system has only one class, namely,{0, +1, +2, ...}.
Problem 11.3
The state transition diagram of Fig. P11.3 in the text pertains to a Markov chain with two classes:{x1} and {x1, x2}.
pijn( )
P X t j X t n– i= =( )=
pij1( )
pij P X t j X t 1– i= =( )= =
pij2( )
pik pkjk∑=
pijn 1–( )
pik pkjn( )
k∑=
1
Problem 11.4
The stochastic matrix of the Markov chain in Fig. P11.4 of the text is given by
Let π1, π2, and π3 denote the steady-state probabilities of this chain. We may then write (see Eq.(11.27) of the text)
That is,
We also have, by definition,
Hence,
or equivalently
and so
P
34--- 1
4--- 0
023--- 1
3---
14--- 3
4--- 0
=
π1 π134---
π2 0( ) π314---
+ +=
π2 π114---
π223---
π334---
+ +=
π3 π1 0( ) π213---
π3 0( )+ +=
π1 π3=
π2 3π3=
π1 π2 π3+ + 1=
π3 3π3 π3+ + 1=
π315---=
2
Problem 11.6
The Metropolis algorithm and the Gibbs sampler are similar in that they both generate a Markovchain with the Gibbs distribution as the equilibrium distribution.
They differ from each other in the following respect: In the Metropolis algorithm, thetransition probabilities of the Markov chain are stationary. In contrast, in the Gibbs sampler, theyare nonstationary.
Problem 11.7
Simulated annealing algorithm for solving the travelling salesman problem:
1. Set up an annealing schedule for the algorithm.2. Initialize the algorithm by picking a tour at random.3. Choose a pair of cities in the tour and then reverse the order that the cities in-between the
selected pairs are visited. This procedure, illustrated in Figure 1 below, generates new tours ina local manner:
4. Calculate the energy difference due to the reversal of paths applied in step 3.5. If the energy difference so calculated is negative or zero, accept the new tour. If, on the other
hand, it is positive, accept the change in the tour with probability defined in accordance withthe Metropolis algorithm.
6. Select another pair of cities, and repeat steps 3 to 5 until the required number of iterations isaccepted.
7. Lower the temperature in the annealing schedule, and repeat steps 3 to 6.
Problem 11.8
(a) We start with the notion that a neuron j flips from state xj to -xj at temperature T withprobability
(1)
π115---=
π235---=
. . ...... . . .....
..
. . ....... .. . ..Figure 1: Problem 11.7
P x j x j–→( ) 11 ∆E j T⁄–( )exp+---------------------------------------------=
3
where ∆Ej is the energy difference resulting from such a flip. The energy function of the Boltzmanmachine is defined by
Hence, the energy change produced by neuron j flipping from state xj to -xj is
(2)
where vi is the induced local field of neuron j.
(b) In light of the result in (2), we may rewrite (1) as
This means that for an initial state xj = -1, the probability that neuron j is flipped into state +1 is
(3)
(c) For an initial state of xj = +1, the probability that neuron j is flipped into state -1 is
(4)
The flipping probability in (4) and the one in (3) are in perfect agreement with the followingprobabilistic rule
E12--- w jixix j
ji j≠
∑i
∑–=
∆E jenergy with neuron j
in state x j energy with neuron j
in state x– j –=
x j( ) w jixi x j–( ) w jixij
∑+j
∑–=
2x j w jixij
∑–=
2x jv j–=
P x j x j–→( ) 11 2x jvj T⁄( )exp+---------------------------------------------=
11 2– vj T⁄( )exp+-------------------------------------------
11 +2vj T⁄( )exp+------------------------------------------- 1 1
1 2– vj T⁄( )exp+-------------------------------------------–=
x j+1 with probability P v j( )
-1 with probability 1-P v j( )
=
4
where P(vj) is itself defined by
Problem 11.9
The log-likelihood function L(w) is (see (11.48) of the text)
Differentiating L(w) with respect to weight wji:
The energy function E(x) is defined by (see (11.39) of the text)
Hence,
, (1)
We also note the following:
(2)
(3)
P v j( ) 11 2– vj T⁄( )exp+-------------------------------------------=
L w( ) E x( )T
------------– E x( )
T------------–
expx∑log–exp
xβ
∑logxα T∈∑=
∂L w( )∂w ji
----------------1T--- ∂E x( )
∂w ji--------------- 1
E x( )T
------------– exp
xβ
∑-------------------------------------– 1
E x( )T
------------– exp
x∑-------------------------------------+
xα T∈∑=
E x( ) 12--- w jixix j
ji j≠
∑i
∑–=
∂E x( )∂w ji
--------------- xix j–= i j≠
P Xβ xβ Xα xα= =( ) 1E x( )
T------------–
expxβ
∑-------------------------------------=
P X x=( ) 1E x( )
T------------–
expx∑-------------------------------------=
5
Accordingly, using the formulas of (1) to (3), we may redefine the derivative as
follows:
which is the desired result.
Problem 11.10
(a) Factoring the transition process from state i to state j into a two-step process, we may expressthe transition probability pji as
for (1)
where τji is the probability that a transition from state j to state i is attempted, and qji is theconditional probability that the attempt is successful given that it was attempted. When j = i, theproperty that each row of the stochastic matrix must add to unity implies that
(b) We require that the attempt-rate matrix be symmetric:
for all (2)
and that it satisfies the normalization condition
for all
We also require the property of complementary conditional transition probability
(3)
For a stationary distribution, we have
for all i (4)
∂L w( ) ∂w ji⁄
∂L w( )∂w ji
----------------1T--- P Xβ xβ Xα xα= =( )x jxi P X x=( )x jxi
x∑–
xβ
∑
xα T∈∑=
p ji τ jiq ji= j i≠
pii 1 pijj i≠∑–=
1 τ ijqijj i≠∑–=
τ ji τ ij= i j≠
τ jij
∑ 1= i j≠
q ji 1 qij–=
πi πj p jij
∑=
6
Hence, using (1) to (3) in (4):
(5)
Next, recognizing that
for all i
we may go on to write
(6)
Hence, combining (5) and (6), using the symmetry property of (2), and then rearranging terms:
(7)
(c) For the condition of (7) can only be satisfied if
which, in turn, means that qij is defined by
(8)
(d) Make a change of variables:
where T and T* are arbitrary constants. We may then express πi in terms of Ei as
πi πjτ ji p jij
∑=
πjτ ji 1 qij–( )j
∑=
pijj
∑ 1=
πi πi pijj
∑=
πi pijj
∑=
πiτ ijqijj
∑=
τ ji πiqij πjqij πj–+( )j
∑ 0=
τ ji 0≠
πiqij πjqij πj–+ 0=
qij1
1 πi πj⁄( )+----------------------------=
Ei T πi T *+log–=
7
where
Accordingly, we may reformulate (8) in the new form
(9)
where ∆E = Ei - Ej. To evaluate the constant Z, we note that
and therefore
(e) The formula of (9) is the only possible distribution for state transitions in the Boltzmannmachine; it is recognized as the Gibbs distribution.
Problem 11.11
We start with the Kullback-Leibler divergence
(1)
The probability distribution in the clamped condition is naturally independent of the synaptic
weights wji in the Boltzman machine, whereas the probability distribution is dependent on wji.
Hence differentiating (1) with respect to wji:
πi1Z---
Ei
T-----–
exp=
Z T *T
-------– exp=
qij1
11T--- Ei E j–( )–
exp+
------------------------------------------------------=
11 ∆E T⁄–( )exp+------------------------------------------=
πii
∑ 1=
Z Ei T⁄–( )expi
∑=
D p+ p- pα+ pα
+
pα-
------
logα∑=
pα+
pα-
8
(2)
To minimize , we use the method of gradient descent:
(3)
where ε is a positive constant.
Let denote the joint probability that the visible neurons are in state α and the hidden
neurons are in state β, given that the network is in its clamped condition. We may then write
Assuming that the network is in thermal equilibrium, we may use the Gibbs distribution
to write
(4)
where Eαβ is the energy of the network when the visible neurons are in state α and the hiddenneurons are in state β. The partition function Z is itself defined by
The energy Eαβ is defined in terms of the synaptic weights wji by
∂D p+ p-
∂w ji---------------------
pα+
pα-
------pα
-
∂w ji-----------
α∑–=
D p+ p-
∆w ji εD p+ p-
∂w ji------------------–=
εpα
+
pα-
------pα
-
∂w ji-----------
α∑=
pαβ-
pα-
pαβ-
β∑=
pαβ- 1
Z---
EαβT
---------– exp=
pα- 1
Z---
EαβT
---------– exp
β∑=
ZEαβT
---------– exp
β∑
α∑=
9
(5)
where is the state of neuron i when the visible neurons are in state α and the hidden neurons
are in state β. Therefore, using (4):
(6)
From (5) we have (remembering that in a Boltzmann machine wji = wij)
(7)
The first term on the right-hand side of (6) is therefore
where we have made use of the Gibbs distribution
as the probability that the visible neurons are in state α and the hidden neurons are in state β in thefree-running condition. Consider next the second term on the right-hand side of (6). Except for theminus sign, we may express this term as the product of two factors:
(8)
The first factor in (8) is recognized as the Gibbs distribution defined by
(9)
Eαβ12--- w jix j αβxi αβ
ji j≠
∑i
∑–=
xi αβ
pα-
∂w ji-----------
1ZT-------
EαβT
---------– ∂Eαβ
∂w ji------------exp
β∑–=
1
Z2
------ ∂Z∂w ji-----------
EαβT
---------– exp
β∑–
∂Eαβ∂w ji------------ x j αβxi αβ–=
1ZT-------
EαβT
---------– ∂Eαβ
∂w ji------------exp
β∑– +
1ZT-------
EαβT
---------– x j αβxi αβexp
β∑=
1T--- pαβ
-x j αβxi αβ
β∑=
pαβ- 1
Z---
EαβT
---------– exp=
1
Z2
------ ∂Z∂w ji-----------
EαβT
---------– exp
β∑ 1
Z---
EαβT
---------– exp
β∑ 1
Z--- ∂Z
∂w ji-----------=
pα-
pα- 1
Z---
EαβT
---------– exp
β∑=
10
To evaluate the second factor in (8), we write
(10)
Using (9) and (10) in (8):
(11)
We are now ready to revisit (6) and thus write
We now make the following observations:
1. The sum of probability over the states α is unity, that is,
(12)
2. The joint probability
(13)
Similarly
(14)
3. The probability of a hidden state, given some visible state, is naturally the same whether thevisible neurons of the network in thermal equilibrium are clamped in that state by the externalenvironment or arrive at that state by free running of the network, as shown by
(15)
In light of this relation we may rewrite Eq. (13) as
1Z--- ∂Z
∂w ji----------- 1
Z--- ∂
∂w ji-----------
EαβT
---------– exp
β∑
α∑=
1TZ-------–
EαβT
---------– ∂Eαβ
∂w ji------------exp
β∑
α∑=
1TZ-------
EαβT
---------– x j αβxi αβexp
β∑
α∑=
1T--- pαβ
-x j αβxi αβ
β∑
α∑=
1
Z2
------ ∂Z∂w ji-----------
EαβT
---------– exp
β∑
pα-
T------ pαβ
-x j αβxi αβ
β∑
α∑=
∂ pα-
∂w ji-----------
1T--- pαβ
-x j αβxi αβ
pα-
T------ pαβ
-x j αβxi αβ
β∑
α∑–
β∑=
pα+
pα+
α∑ 1=
pαβ-
pβ α-
pα-
=
pαβ+
pβ α+
pα+
=
pβ α-
pβ α+
=
11
(16)
Moreover, we may write
(17)
Accordingly, we may rewrite (3) as follows:
Define the following terms:
= learning rate parameter
=
=
=
=
=
We may then finally formulate the Boltzmann learning rule as
Problem 11.12
(a) We start with the relative entropy:
pαβ-
pβ α+
pα-
=
pα+
pα-
------
pαβ-
pα+
pβ α+
=
pαβ+
=
∆w jiεT---
pα+
pα-
------ pαβ-
x j αβxi αβ pα+
α∑ pαβ
-x j αβxi αβ
β∑
α∑–
β∑
α∑
=
εT--- pαβ
+x j αβxi αβ pαβ
-x j αβxi αβ
β∑
α∑–
β∑
α∑
=
ηεT---
ρ ji+
x j αβxi αβ+><
pαβ+
x j αβxi αββ∑
α∑
ρ ji-
x j αβxi αβ-><
pαβ-
x j αβxi αββ∑
α∑
∆w ji η ρ ji+ ρ ji
-–( )=
12
(1)
From probability theory, we have
(2)
(3)
where, in the last line, we have made use of the fact that the input neurons are always clamped tothe environment, which means that
Substituting (2) and (3) into (1):
(4)
where the state α refers to the input neurons and γ refers to the output neurons.
(b) With denoting the conditional probability of finding the output neurons in state γ, given
that the input neurons are in state α, we may express the probability distribution of the outputstates as
The conditional is determined by the synaptic weights of the network in accordance with the
formula
(5)
where
(6)
Dp+ p- pαγ
+ pαγ+
pαγ-
---------
logγ∑
α∑=
pαγ+
pγ α+
pα+
=
pαγ-
pγ α-
pα-
pγ α-
pα+
= =
pα-
pα+
=
Dp+ p- pα
+pγ α
+ pγ α+
pγ α-
----------
logγ∑
α∑=
pγ α
pγ-
pγ α-
pα-
α∑=
pγ α-
pγ α- 1
Z1α---------
EγβαT
-----------– exp
β∑=
Eγβα12--- w ji s jsi[ ] γβα
i∑
j∑=
13
The parameter Z1α is the partition function:
(7)
The function of the Boltzmann machine is to find the synaptic weights for which the conditional
probability approaches the desired value .
Applying the gradient method to the relative entropy of (1):
(8)
Using (4) in (8) and recognizing that is determined by the environment (i.e., it is
independent of the network), we get
(9)
To evaluate the partial derivative we use (5) to (7):
(10)
Next, we recognize the following pair of relations:
(11)
1Z1α---------
EγβαT
-----------– exp
γ∑
β∑=
pγ α-
pγ α+
∆w ji -ε∂D
p+ p-
∂w ji-------------------=
pγ α+
∆w ji ε pα+ pγ α
+
pγ α-
----------pγ α
-
∂w ji-----------
γ∑
α∑=
∂ pγ α- ∂w ji⁄
pγ α-
∂w ji-----------
1Z1α--------- 1
T---–
EγβαT
-----------– Eγβα
∂w ji-----------exp
β∑=
1
Z1α2
---------∂Z1α∂w ji------------
EγβαT
-----------– exp
β∑–
1Z1α--------- 1
T--- s jsi[ ] γβα
EγβαT
-----------– exp
β∑=
+1
Z1α2
--------- s jsi[ ] γβαEγβα
T-----------–
expγ∑
β∑
1Z1α---------
EγβαT
-----------– exp pγβ1α
-=
14
(12)
where the term is the averaged correlation of the states sj and si with the input neurons
clamped to state α and the network in a free-running condition. Substituting (11) and (12) in (10):
(13)
Next, substituting (13) into (9):
(14)
We now recognize that
for all α (15)
(16)
Accordingly, substituting (15) and (16) into (14):
1Z1α--------- s jsi[ ] γβα
EγβαT
-----------– exp
γ∑
β∑ <s jsi>α
-pγ α
-=
<s jsi>α-
∂ pγ α-
∂w ji--------------
1T--- s jsi[ ] γβα pγβα
- <s jsi>α-
pγ α-
–β∑
=
∆w jiεT--- pα
+s jsi[ ] γβα pγβ1α
- pγ α+
pγ α-
----------
β∑
α∑
α∑
=
pα+ <s jsi>α
-pγ α
+
γ∑
α∑–
pα+
α∑ 1=
s jsi[ ] γβα pγβ|α- pγ α
+
pγ α-
----------
β∑
γ∑ pγ α
+s jsi[ ] γβα
pγβ α+
pγ α-
-------------
β∑
γ∑=
pγ α+ <s jsi>γα
γ∑=
<s jsi>α+
=
∆w jiεT--- pα
+ <s jsi>α+ <s jsi>α
-–( )
α∑=
η pα+ ρ ji α
+ ρ ji α-
–( )α∑=
15
where ; and and are the averaged correlations in the clamped and free-
running conditions, given that the input neurons are in state α.
Problem 11.15
Consider the expected distortion (energy)
(1)
where d(x, yj) is the distortion measure for representing the data point x by the vector yj, and
is the probability that x belongs to the cluster of points represented by yj. To
determine the association probabilities at a given expected distortion, we maximize the entropysubject to the constraint of (1). For a fixed Y = {yj}, we assume that the association probabilitiesof different data points are independent. We may thus express the entropy as
(2)
The probability distribution that maximizes the entropy under the expectation constraint is theGibbs distribution:
(3)
where
is the partition function. The inverse temperature B = 1/T is the Lagrange multiplier defined by thevalue of E in (1).
Problem 11.6
(a) The free energy is
(1)
where D is the expected distortion, T is the temperature, and H is the conditional entropy. Theexpected distortion is defined by
η ε T⁄= ρ ji α+ ρ ji α
-
E P x C j∈( )d x y j,( )j
∑x∑=
P x C j∈( )
H P x C j∈( ) P x C j∈( )logj
∑x∑–=
P x C j∈( ) 1Zx------ 1
T---d x y j,( )–
exp=
Zx1T---d x y j,( )–
expj
∑=
F D TH–=
16
(2)
The conditional entropy if defined by
(3)
The minimizing P(Y = y|X = x) is itself defined by the Gibbs distribution:
(4)
where
(5)
is the partition function. Substituting (2) to (5) into (1), we get
This result simplifies as follows by virtue of the definition given in (5) for the partition function:
(6)
(b) Differentiating the minimum free energy F* of (6) with respect to y:
(7)
Using the definition of Zx given in (5), we write:
(8)
D P(Xy∑
x∑ x)= = P(Y y X x)d x y,( )= =
H Y X( ) P(Xy∑
x∑ x)P(Y y X x) P(Y y X x)= =log= ==–=
P(Y y X x)1
Zx------ d x y,( )
T-----------------–
exp= = =
Zxd x y,( )
T-----------------–
expy∑=
F*
P(Xy∑
x∑ x)
1Zx------ d x y,( )
T-----------------–
d x y,( )exp= =
T P(Xy∑
x∑ x)
1Zx------ d x y,( )
T-----------------–
Zx1T---–log– d x y,( )
exp=+
T P(Xy∑
x∑ x)
1Zx------ d x y,( )
T-----------------–
Zxlog–( )exp==
F*
T P(Xx∑– x) Zxlog= =
∂F*
∂y--------- T P(X
x∑– x)
1Zx------
∂Zx
∂y---------= =
∂Zx
∂y---------
1T--- d x y,( )
T-----------------–
∂d x y,( )∂y
--------------------expy∑–=
17
Hence, we may rewrite (7) as
(9)
where use has been made of (4). Noting that
we may then state that the condition for minimizing the Lagrangian with respect to y is
for all y (10)
Normalizing this result with respect to P(X = x) we get the minimizing condition:
for all y (11)
(c) Consider the squared Euclidean distortion
for which we have
(12)
For this particular measure we find it more convenient to normalize (10) with respect to theprobability
We may then write the minimizing condition with respect to y as
(13)
∂F*
∂y--------- P(X
y∑
x∑ x)
1Zx------ d x y,( )
T-----------------–
∂d x y,( )∂y
--------------------exp= =
P(Xx∑ x)P(Y y X x)
∂d x y,( )∂y
--------------------= = = =
P(X x Y, y)= = P(Y y X x)= P(X x)= = =
P(X x Y, y)∂d x y,( )
∂y--------------------= =
x∑ 0=
P(Y y X x)∂d x y,( )
∂y--------------------= =
x∑ 0=
d x y,( ) x y–2 x y–( )T x y–( )= =
∂d x y,( )∂y
--------------------∂
∂y------ x y–( )T x y–( )=
2 x y–( )( )–=
P(Y y) P(X x,Y y)= =x∑= =
P(X x|Y y)∂d x y,( )
∂y--------------------= =
x∑ 0=
18
Using (12) in (13) and solving for y, we get the desired minimizing solution
(14)
which is recognized as the formula for a centroid.
Problem 11.17
The advantage of deterministic annealing over maximum likelihood is that it does not make anyassumption on the underlying probability distribution of the data.
Problem 11.18
(a) Let
, k = 1, 2, ..., K
where tk is the center or prototype vector of the kth radial basis function and K is the number ofsuch functions (i.e., hidden units). Define the normalized radial basis function
The average squared cost over the training set is
(1)
where is the output vector of the RBF network in response to the input xi. The Gibbs
distribution for is
(2)
where d is defined in (1) and
y
P(X x|Y y)x= =x∑
P(X x|Y y)= =x∑
--------------------------------------------------=
ϕk x( ) 1
2σ2--------- x = tk
2–
exp=
Pk x( )ϕk x( )
ϕk x( )k∑---------------------=
d1N---- yi F xi( )–
2
i=1
N
∑=
F xi( )
P x R∈( )
P x R∈( ) 1Zx------ d
T---–
exp=
19
(3)
(b) The Lagrangian for minimizing the average misclassification cost is
F = d - TH
where the average squared cost d is defined in (1), and the entropy H is defined by
where is the probability of associating class j at the output of the RBF network with theinput x.
ZxdT---–
expyi
∑=
H p j x( ) j x( )logj
∑x∑–=
p j x( )
20
CHAPTER 12Dynamic Programming
Problem 12.1
As the discount factor γ approaches 1, the computation of the cost-to-go function J π(i) becomeslonger because of the corresponding increase in the number of time steps involved in thecomputation.
Problem 12.2
(a) Let πbe an arbitrary policy, and suppose that this policy chooses action at time step 0.
We may then write
where pa is the probability of choosing action a, c(i, a) is the expected cost, pij(a) is the
probability of transition from state i to state j under action a, Wπ(j) is the expected cost-to-gofunction from time step n = 1 onward, and j is the state at that time step. We now note that
which follows from the observation that if the state at time step n = 1 is j, then the situation at thattime step is the same as if the process had started in state j with the exception that all the returnsare multiplied by the discount factor γ. Hence, we have
which implies that
(1)
a Ai∈
Jπ
i( ) pa c i a,( ) pij a( )Wπ
j( )j=1
N
∑+
a Ai∈∑=
Wπ
j( ) γ J j( )≤
Jπ
i( ) pa c i a,( ) γ pij a( )J j( )j
∑+ ≥
pa mina Ai∈
c i a,( ) γ pij a( )J j( )j
∑+ ≥
mina
c i a,( ) γ pij a( )J j( )j
∑+ =
Jπ
i( ) mina Ai∈
c i a,( ) γ pij a( )J j( )j
∑+ ≥
1
(b) Suppose we next go the other way by choosing a0 with
(2)
Let πbe the policy that chooses a0 at time step 0 and, if the next state is j, the process is viewed asoriginating in state j following a policy πj such that
Where ε is a small positive number. Hence
(3)
Since J(i) < Jπ(i), (3) implies that
Hence, from (2) it follows that
(4)
(c) Finally, since ε is arbitrary, we immediately deduce from (1) and (4) that the optimum cost-to-go function
(5)
which is the desired result
Problem 12.3
Writing the system of N simultaneous equations (12.22) of the text in matrix form:
c i a0,( ) γ pij a0( )j
∑+ mina
c i a,( ) γ pij a( )J j( )j
∑+ =
Jπj J j( ) ε+≤
Jπj c i a0,( ) pij a0( )J
πj j( )j=1
N
∑+=
c i a0,( ) pij a0( )J j( ) γε+j=1
N
∑+≤
J i( ) c i a0,( ) γ pij a0( )J j( ) γε+j=1
N
∑+≤
J i( ) mina Ai∈
c i a,( ) γ pij a( )J j( )j
∑+ γε+≤
J*
i( ) mina
c i a,( ) γ pij a( )J*
j( )j
∑+ =
2
Jµ = c(µ) + γP(µ)Jµ (1)
where
Rearranging terms in 1), we may write
where I is the N-by-N identity matrix. For the solution Jµ to be unique we require that the N-by-Nmatrix (I - γP(µ)) have an inverse matrix for all possible values of the discount factor γ.
Problem 12.4
Consider an admissible policy {µ0, µ1, ...}, a positive integer K, and cost-to-go function J. Let the
costs of the first K stages be accumulated, and add the terminal cost γKJ(XK), thereby obtainingthe total expected cost
where E is the expectational operator, To minimize the total expected cost, we start with γKJ(XK)and perform K iterations of the dynamic programming algorithm, as shown by
(1)
with the initial condition
JK(X) = γKJ(X)
J µ( )J
µ( ) 1( ), Jµ( ) 2( ), … J
µ( )N( )
T=
c µ( ) C 1 µ,( ), C 2 µ,( ), … C N µ,( )T
=
P µ( )
p11 µ( ) p12 µ( ) … p1N µ( )
p21 µ( ) p22 µ( ) … p2N µ( )
pN 1 µ( ) pN 2 µ( ) … pNN µ( )
=
. . .
. . .
. . .
I γP µ( )J µ( )– c µ( )=
E γKJ X K( ) γn
g X n µn X n( ) X n-1,,( )n=0
K -1
∑+
J n X n( ) minµn
E gn X n µn X n( ) X n-1,,( ) J n+1 X n+1( )+[ ]=
3
Now consider the function Vn defined by
for all n and X (2)
The function Vn(X) is the optimal K-stage cost J0(X). Hence, the dynamic programming algorithmof (1) can be rewritten in terms of the function Vn(X) as follows:
with the initial condition
V0(X) = J(X)
which has the same mathematical form as that specified in the problem.
Problem 12.5
An important property of dynamic programming is the monotonicity property described by
This property follows from the fact that if the terminal cost gK for K stages is changed to a
uniformly larger cost , that is,
for all XK,
then the last stage cost-to-go function JK-1(XK-1) will be uniformly increased. In more generalterms, we may state the following.
Given two cost-to-go functions JK’+1 and with
for all XK+1,
we find that for all XK and µK the following relation holds
This relation merely restates the monotonicity property of the dynamic programming algorithm.
V n X( )J K -n X( )
γK -n---------------------=
V n+1 X 0( ) minµ
E g( X 0 µ X 0( ) X 1), γV n X 1( )+,[ ]=
Jµn+1 J
µn≤
gK
gK X K( ) gK X K( )≥
J K+1
J K+1 X K+1( ) J K+1 X K+1( )≥
E gK X K µK X K+1, ,( ) J K+1 X K+1( )+[ ] E gK X K µK X K+1, ,( ) J K+1 X K+1( )+[ ]≤
4
Problem 12.6
According to (12.24) of the text the Q-factor for state-action pair (i, a) and stationary policy µsatisfies the condition
for all i
This equation emphasizes the fact that the policy µ is greedy with respect to the cost-to-go
function Jµ(i).
Problem 12.7
Figure 1, shown below, presents an interesting interpretation of the policy iteration algorithm. Inthis interpretation, the policy evaluation step is viewed as the work of a critic that evaluates the
performance of the current policy; that is, it calculates an estimate of the cost-to-go function .The policy improvement step is viewed as the work of a controller or actor that accounts for thelatest evaluation made by her critic and acts out the improved policy µn+1. In short, the critic looksafter policy evaluation and the controller (actor) looks after policy improvement and the iterationbetween them goes on.
Problem 12.8
From (12.29) in the text, we find that for each possible state, the value iteration algorithm requiresNM iterations, where N is the number of states and M is the number of admissible actions. Hence,
the total number of iterations for all N states in N2M.
Qµ
i µ i( ),( ) mina
Qµ
i a,( )=
Jµn
Environment
Controller (Actor)
Cost-to-goJµn
Critic
Stateiµn+1(i)
Figure 1: Problem 12.7
5
Problem 12.9
To reformulate the value-iteration algorithm in terms of Q-factors, the only change we need tomake is in step 2 of Table 12.2 in the text. Specifically, we rewrite this step as follows:
For n = 0, 1, 2, ..., compute
Problem 12.10
The policy-iteration algorithm alternates between two steps: policy evaluation, and policyimprovement. In other words, an optimal policy is computed directly in the policy iterationalgorithm. In contrast, no such thing happens in the value iteration algorithm.
Another point of difference is that in policy iteration the cost-to-go function is recomputedon each iteration of the algorithm. This burdensome computational difficulty is avoided in thevalue-iteration algorithm.
Problem 12.14
From the definition of Q-factor given in (12.24) in the text and Bellman’s optimality equation(12.11), we immediately see that
where the minimization is performed over all possible actions a.
Problem 12.15
The value-iteration algorithm requires knowledge of the state transition probabilities. In contrast,Q-learning operates without this knowledge. But through an interactive process, Q-learning learnsestimates of the transition probabilities in an implicit manner. Recognizing the intimaterelationship between value iteration and Q-learning, we may therefore view Q-learning as anadaptive version of the value-iteration algorithm.
Q i a,( ) c i a,( ) γ pij a( )J n j( )j=1
N
∑+=
J n+1 i( ) mina
Q i a,( )=
J*
i( ) mina
Q i a,( )=
6
Problem 12.16
Using Table 12.4 in the text, we may construct the signal-flow graph in Figure 1 for the Q-learning algorithm:
Problem 12.17
The whole point of the Q-learning algorithm is that it eliminates the need for knowing the statetransition probabilities. If knowledge of the state transition probabilities is available, then the Q-learning algorithm assumes the same form as the value-iteration algorithm.
Computeoptimumaction
ComputetargetQ-factor
UpdateQ-factor
Unitdelay
Qn+1(in+1, a, w)an Qtarget
Qn(in, a, w)
Figure 1: Problem 12.16
7
CHAPTER 13Neurodynamics
Problem 13.1
The equilibrium state x(0) is (asymptotically) stable if in a small neighborhood around x(0), thereexists a positive definite function V(x) such that its derivative with respect to time is negativedefinite in that region.
Problem 13.3
Consider the symem of coupled nonlinear differential equations:
, j = 1, 2, ..., N
where W is the weight matrix, i is the bias vector, and x is the state vector with its jth elementdenoted by xj.
(a) With the bias vector i treated as input and with fixed initial condition x(0), let denote thefinal state vector of the system. Then,
, j = 1, 2, ..., N
For a given matrix W and input vector i, the set of initial points x(0) evolves to a fixed point. Thefixed points are functions of W and i. Thus, the system acts as a “mapper” with i as input and
as output, as shown in Fig. 1(a):
(b) With the initial state vector x(0) treated as input, and the bias vector i being fixed, letdenote the final state vector of the system. We may then write
, j = 1, 2, .., N
dx j
dt-------- ϕ j W i x, ,( )=
x ∞( )
0 ϕ j W i x ∞( ), ,( )=
x ∞( )
W;x(0) : fixed
W;i : fixed
(a) (b)
x ∞( )x ∞( )
x(0)i
Figure 1: Problem 13.3
x ∞( )
0 ϕ j W i:fixed x, ∞( ),( )=
1
Thus with x(0) acting as input and acting as output, the dynamic system behaves like apattern associator, as shown in Fig. 1b.
Problem 13.4
(a) We are given the fundamental memories:
The weight matrix of the Hopfield network (with N = 25 and p = 3) is therefore
(b) According to the alignment condition, we write
, i = 1, 2, 3
Consider first , for which we have
x ∞( )
ξ1 +1, +1, +1, +1, +1T
=
ξ2 +1, -1, -1, +1, -1T
=
ξ3 1-, +1, -1, +1, +1T
=
W1N---- ξ iξ i
T PN----I–
i=1
p
∑=
15---
0 -1 +1 +1 -1-1 0 +1 +1 +3+1 +1 0 -1 +1+1 +1 -1 0 +1-1 +3 +1 +1 0
=
ξ i Wξ i( )sgn=
ξ1
Wξ1( )sgn15---
0 -1 +1 +1 -1-1 0 +1 +1 +3+1 +1 0 -1 +1+1 +1 -1 0 +1-1 +3 +1 +1 0
+1+1+1+1+1
sgn=
2
Thus all three fundamental memories satisfy the alignment condition.
Note: Wherever a particular element of the product is zero, the neuron in question is left in
its previous state.
(c) Consider the noisy probe:
15---
0+4+2+2+4
sgn
+1+1+1+1+1
ξ1= = =
Wξ2( )sgn15---
0 -1 +1 +1 -1-1 0 +1 +1 +3+1 +1 0 -1 +1+1 +1 -1 0 +1-1 +3 +1 +1 0
+1-1-1+1-1
sgn=
15---
+2-4-20-4
sgn
+1-1-1+1-1
ξ2= = =
Wξ3( )sgn15---
0 -1 +1 +1 -1-1 0 +1 +1 +3+1 +1 0 -1 +1+1 +1 -1 0 +1-1 +3 +1 +1 0
-1+1-1+1+1
sgn=
15---
-2+40
+2+4
sgn
-1+1-1+1+1
ξ2= = =
Wξ i
3
which is the fundamental memory with its second element reversed in polarity. We write
(1)
Therefore,
Thus, neurons 2 and 5 want to change their states. We therefore have 2 options:
• Neuron 5 is chosen for a state change, which yields the result
This vector is recognized as the fundamental memory , and the computation is thereby
terminated.
• Neuron 2 is chosen to change its state, yielding the vector
Next, we go on to compute
x +1, -1, +1, +1, +1T
=
Wx15---
0 -1 +1 +1 -1-1 0 +1 +1 +3+1 +1 0 -1 +1+1 +1 -1 0 +1-1 +3 +1 +1 0
+1-1+1+1+1
=
15---
+2+400-2
=
Wx( )sgn
+1+1+1+1-1
=
x +1, +1, +1, +1, +1T
=
ξ1
x +1, -1, +1, +1, -1T
=
4
Hence, neurons 3 and 4 want to change their states:
• If we permit neuron 3 to change its state from +1 to -1, we get
which is recognized as the fundamental memory .
• If we permit neuron 4 to change its state from +1 to -1, we get
which is recognized as the negative of the third fundamental memory .
In both cases, the new state would satisfy the alignment condition and the computation is thenterminated.
Thus, when the noisy version of is applied to the network, with its second element
changed in polarity, one of 2 things can happen with equal likelihood:
1. The original is recovered after 1 iteration.
2. The second fundamental memory or the negative of the third fundamental memory is
recovered after 2 iterations, which, of course, is in error.
Wx15---
0 -1 +1 +1 -1-1 0 +1 +1 +3+1 +1 0 -1 +1+1 +1 -1 0 +1-1 +3 +1 +1 0
+1-1+1+1-1
=
15---
+4-2-2-2-2
=
Wx( )sgn
+1-1-1-1-1
=
x +1, -1, -1, +1, -1T
=
ξ2
x +1, -1, +1, -1, -1=
ξ3
ξ1
ξ1
ξ2 ξ3
5
Problem 13.5
Given the probe vector
and the weight matrix of (1) Problem 13.4, we find that
and
According to this result, neurons 2 and 5 have changed their states. In synchronous updating, thisis permitted. Thus, with the new state vector
on the next iteration,we compute
x +1, -1, +1, +1, +1T
=
Wx15---
2400-2
=
Wx( )sgn
+1+1+1+1-1
=
x
+1+1+1+1-1
=
Wx15---
0 -1 +1 +1 -1-1 0 +1 +1 +3+1 +1 0 -1 +1+1 +1 -1 0 +1-1 +3 +1 +1 0
+1+1+1+1-1
=
6
Hence,
The new state vector is therefore
which is recognized as the original probe. In this problem, we thus find that the networkexperiences a limit cycle of duration 2.
Problem 13.6
(a) The vectors
are simply the negatives of the three fundamental memories considered in Problem 13.4,respectively. These 3 vectors are therefore also fundamental memories of the Hopfield network.
15---
+2-200
+4
=
Wx( )sgn
+1-1+1+1+1
=
x
+1-1+1+1+1
=
ξ1 -1, -1, -1, +1, -1T
=
ξ2 +1, +1, +1, -1, +1T
=
ξ3 +1, -1, +1, -1, -1T
=
7
(b) Consider the vector
which is the result of masking the first element of the fundamental memory of Problem 13.4.
According to our notation, a neuron of the Hopfield network is in either state +1 or -1. Wetherefore have the choice of setting the zero element of x to +1 or -1. The first option restores thevector x to its original form: fundamental memory , which satisfies the alignment condition.
Alternatively, we may set the zero element equal to -1, obtaining
In this latter case, the alignment condition is not satisfied. The obvious choice is therefore theformer one.
Problem 13.7
We are given
(a) For state s2 we have
which yields
Next for state s4, we have
x 0, +1, +1, +1, +1T
=
ξ1
ξ1
x -1, +1, +1, +1, +1T
=
W 0 -1-1 0
=
Ws20 -1-1 0
-1+1
=
-1+1
=
Ws2( )sgn -1+1
s2= =
8
which yields
Thus, both states s2 and s4 satisfy the alignment condition and are therefore stable.
Consider next the state s1, for which we write
which yields
Thus, both neurons want to change; suppose we pick neuron 1 to change its state, yielding the
new state vector [-1, +1]T. This is a stable vector as it satisfies the alignment condition. If,however, we permit neuron 2 to change its state, we get a state vector equal to s4. Similarly, we
may show that the state vector s3 = [-1, -1]T is also unstable. The resulting state-transition diagramof the network is thus as depicted in Fig. 1.
Ws40 -1-1 0
+1-1
=
+1-1
=
Ws4( )sgn +1-1
s4= =
Ws10 -1-1 0
+1+1
=
-1-1
=
Ws1( )sgn -1-1
s1= =
9
The results depicted in Fig. 1 assume the use of asynchronous updating. If, however, weuse synchronous updating, we find that in the case of s1:
Permitting both neurons to change state, we get the new state vector [-1, -1]T. This is recognizedto be stable state s3. Now, we find that
which takes back to state s1.
Thus, in the synchronous updating case, the states s1 and s3 represent a limit cycle withlength 2.
Returning to the normal operation of the Hopfield network, we note that the energyfunction of the network is
since
(1)
. .
..
(-1, 1) (1, 1)
(-1, -1) (1, -1)
x2
Figure 1: Problem 13.7
Ws1( )sgn -1-1
=
0 -1-1 0
-1-1
+1+1
=
E12--- w jisis j
j∑
ii j≠
∑–=
12---w12s1s2
12---w21s2s1––=
w12s1s2–= w12 w21=
s1s2=
10
Evaluating (1) for all possible states of the network, we get the following table:
State Energy[+1, +1] +1[-1, +1] -1[-1. -1] +1[+1. -1] -1
Thus, states s1 and s3 represent global minima and are therefore stable.
Problem 13.8
The energy function of the Hopfield network is
(1)
The overlap mv is defined by
(2)
and the weight wji is itself defined by
(3)
Substituting (3) into (1) yields
where, in the third line, we made use of (2).
E12--- w jis jsi
j∑
i∑–=
mv1N---- s jξv j,
j∑=
w ji1N---- ξv j, ξv i,
v∑=
E1
2N-------- ξv j, ξv i, s jsi
v∑
j∑
i∑–=
12N--------– siξv i,
i∑
s jξv j,j
∑
v∑=
12N-------- mvN( ) mvN( )
v∑–=
N2---- mv
2
v∑–=
11
Problem 13.11
We start with the function (see (13.48) of the text)
(1)
where is the derivative of the function with respect to its argument. We now
differentiate the function E with respect to time t and note the following relations:
1. Cji = Cij
2.
3.
Accordingly, we may use (1) to express the derivative as follows:
(2)
From Eq. (13.47) in the text, we have
, j = 1, 2, ..., N (3)
Hence using (3) in (2) and collecting terms, we get the final result
(4)
Provided that the coefficient aj(uj) satisfies the nonnegativity condition
E12--- c jiϕ i ui( )ϕ j u j( ) b j λ( )ϕ′ j λ( ) λd
0
u j
∫j=1
N
∑–j=1
N
∑i=1
N
∑=
ϕ′ j.( ) ϕ j
.( )
∂∂t-----ϕ j u j( )
∂u j
∂t-------- ∂
u j-----ϕ j u j( )=
∂u j
∂t--------ϕ′ j u j( )=
∂∂t----- b j λ( )ϕ′ j λ( ) λd
0
u j
∫∂u j
∂t-------- ∂
u j----- b j λ( )ϕ′ j λ( ) λd
0
u j
∫=
∂u j
∂t--------b j u j( )ϕ′ j u j( )=
∂E ∂t⁄
∂E∂t-------
∂u j
∂t-------- c jiϕ′ j u j( ) b j u j( )ϕ′ j u j( )
j=1
N
∑–j=1
N
∑i=1
N
∑
=
∂u j
∂t-------- a j u j( ) b j u j( ) c jiϕ i ui( )
j=1
N
∑–
=
∂E∂t------- a j u j( )ϕ′ j u j( ) b j u j( ) c jiϕ i ui( )
i=1
N
∑– 2
j=1
N
∑–=
12
aj(uj) > 0 for all uj
and the function satisfies the monotonicity condition
for all uj,
we then immediately see from (4) that
for all t
In words, the function E defined in(1) is the Lyapunov function for the coupled system ofnonlinear differential equations (3).
Problem 13.12
From (13.61) of the text:
, j = 1, 2, ..., N (1)
where
where δji is a Kronecker delta. According to the Cohen-Grossberg theorem of (13.47) in the text,we have
(2)
Comparison of (1) and (2) yields the following correspondences between the Cohen-Grossbergtheorem and the brain-in-state-box (BSB) model:
Therefore, using these correspondences in (13.48) of thetext:
Cohen-Grossberg Theorem BSB Modeluj vj
aj(uj) 1
bj(uj) -vj
cji -cji
ϕ′ j u j( )
ϕ′ j u j( ) 0≥
∂E∂t------- 0≤
ddt-----v j t( ) v j t( )– c jiϕ i vi( )
i=1
N
∑+=
c ji δ ji βw ji+=
ddt-----u j t( ) a j u j( ) b j u j( ) c jiϕ i ui( )
i=1
N
∑––=
ϕ i ui( ) ϕ vi( )
13
,
we get the following Liapunov function for the BSB model:
(3)
From (13.55) in the text, we note that
We therefore have
Hence, the second term of (3) is given by
inside the linear region (4)
The first term of (3) is given by
(5)
Finally, substituting (4) and (5) into (3), we obtain
E12--- c jiϕ i ui( )ϕ j u j( ) b j λ( )ϕ′ j λ( ) λd
u j
∫j
∑–j
∑i
∑=
E -12--- c jiϕ vi( )ϕ v j( ) λϕ′ λ( ) λd
0
v j
∫j
∑+j
∑i
∑=
ϕ y j( )+1 if y j 1>
y j if -1 y j 1≤ ≤
-1 if y j -1≤
=
ϕ′ y j( )0, y j 1>
1, y j 1≤
=
λϕ′ λ( ) λd0
v j
∫j
∑ λ λd0
v j
∫j
∑ 12--- v j
2
j∑= =
12--- x j
2
j∑=
12--- c jiϕ vi( )ϕ v j( )
i∑
j∑–
12--- δ ji βw ji+( )ϕ vi( )ϕ v j( )
i∑
j∑–=
β2---– w jix jxi
12--- ϕ2
v j( )j
∑–i
∑j
∑=
β2--- w jix jxi
12--- x j
2
j∑–
i∑
j∑–=
14
which is the desired result
Problem 13.13
The activation function of Fig. P13.13 is a nonmonotonic function of the argument v; that is,
assumes both positive and negative values. It therefore violates the monotonicitycondition required by the Cohen-Grossberg theorem; see Eq. (4) of Problem 13.11. This meansthat the cohen-Grossberg theorem is not applicable to an associative memory like a Hopfieldnetwork that uses the activation function of Fig. P14.15.
Eβ2--- w jix jxi
β2---xT Wx–
i∑
j∑–=
ϕ v( )∂ϕ ∂v⁄
15
CHAPTER 15Dynamically Driven Recurrent Networks
Problem 15.1
Referring to the simple recurrent neural network of Fig. 15.3, let the vector u(n) denote the inputsignal, the vector x(n) denotes the signal produced at the output of the hidden layer, and the vectory(n) denotes the output signal of the whole network. Then, treating x(n) as the state of thenetwork, we may describe the state-space model of the network as follows:
where and are vector-valued functions of their respective arguments.
Problem 15.2
Referring to the recurrent MLP of Fig. 15.4, we note the following:
(1)
(2)
(3)
where , , and are vector-valued functions of their respective arguments.
Substituting (1) into (2), we write
(4)
Define the state of the system at time n as
(5)
Then, from (4) and (5) we immediately see that
(6)
where f is a new vector-valued function. Define the output of the system as
x n 1+( ) f x n( ) u n( ),( )=
y n( ) g x n( )( )=
f .( ) g .( )
xI n 1+( ) f 1 xI n( ) u n( ),( )=
xII n 1+( ) f 2 xII n( ) xI n 1+( ),( )=
x0 n 1+( ) f 3 x0 n( )xII n 1+( )( )=
f 1.( ) f 2
.( ) f 3.( )
xII f 2 xII f 1 xI u n( ),( ),( )=
xII n 1+( )xII n( )
x0 n 1–( )=
x n 1+( ) f x n( ) u n( ),( )=
1
(7)
With x0(n) included in the definition of the state x(n + 1) and with x(n) dependent on the inputu(n), we thus have
(8)
where is another vector valued function. Equations (6) and (8) define the state-space modelof the recurrent MLP.
Problem 15.3
It is indeed possible for a dynamic system to be controllable but unobservable, and vice versa.This statement is justified by virtue of the fact that the conditions for controllability andobservability are entirely different, which means that there are situations where the conditions aresatisfied for one and not for the other.
Problem 15.4
(a) We are given the process equation
Hence, iterating forward in time, we write
and so on. By induction, we may state that the state x(n + q) is a nested nonlinear function of x(n)and uq(n), where
(b) The Jacobian of x(n + q) with respect to uq(n) at the origin, is
y n( ) x0 n( )=
y n( ) g x n( ) u n( ),( )=
g .,.( )
x n 1+( ) φ Wax n( ) wbu n( )+( )=
x n 2+( ) φ Wax n 1+( ) wbu n 1+( )+( )=
φ Waφ Wax n( ) wbu n( )+( ) wbu n 1+( )+( )=
x n 3+( ) φ Wax n 2+( ) wbu n 2+( )+( )=
φ WaφWaφ(Wax n( ) wbu n( ))+ wbu n 1+( )+( ) wbu n 2+( ))+=
uq n( ) u n( ) u n 1+( ) … u n q 1–+( ), ,,[ ] T=
Jq n( ) ∂x n q+( )∂uq n( )
-----------------------x n( ) 0=u n( ) 0=
=
2
As an illustrative example, consider the cast of q = 3. The Jacobian of x(n + 3) with respect tou3(n) is
From the defining equation of x(n + 3), we find that
All these partial derivatives have been evaluated at x(n) = 0 and u(n) = 0. The Jacobian J3(n) istherefore
We may generalize this result by writing
Problem 15.5
We start with the state-space model
(1)
where c is a column vector. We thus write
J3 n( ) ∂x n 3+( )∂u n( )
----------------------- ∂x n 2+( )∂u n 1+( )----------------------- ∂x n 3+( )
∂u n 2+( )-----------------------, ,
x n( ) 0=u n( ) 0=
=
∂x n 3+( )∂u n( )
----------------------- φ′ 0( )Waφ′ 0( )Waφ′ 0( )wb=
AAb=
A2b=
∂x n 3+( )∂u n 1+( )----------------------- φ′ 0( )Waφ′ 0( )wb=
Ab=
∂x n 3+( )∂u n 2+( )----------------------- φ′ 0( )wb=
b=
J3 n( ) A2b,Ab,b[ ]=
Jq n( ) Aq 1– b,Aq 2– b …,, Ab,b[ ]=
x n 1+( ) φ Wax n( ) wbu n( )+( )=
y n( ) cT x n( )=
y n 1+( ) cT x n 1+( )=
3
(2)
(3)
and so on. By induction, we may therefore state that y(n + q) is a nested nonlinear function of x(n)and uq(n), where
Define the q-by-1 vector
The Jacobian of yq(n) with respect to x(n), evaluated at the origin, is defined by
As an illustrative example, consider the case of q = 3, for which we have
From (1), we readily find that
From (2), we find that
From (3), we finally find that
cT φ Wax n( ) wbu n( )+( )=
y n 2+( ) cT x n 2+( )=
cT φ Waφ Wax n( ) wbu n( )+( ) wbu n 1+( )+( )=
uq n( ) u n( ) u n 1+( ) …, u n q 1–+( ),,[ ] T=
yq n( ) y n( ) y n 1+( ) … y n q 1–+( ), ,,[ ] T=
Jq n( )∂yq
Tn( )
∂x n( )----------------- x n( ) 0=
u n( ) 0=
=
J3 n( ) ∂y n( )∂x n( )-------------- ∂y n 1+( )
∂x n( )----------------------- ∂y n 2+( )
∂x n( )-----------------------, ,
x n( ) 0=u n( ) 0=
=
∂y n( )∂x n( )-------------- c=
∂y n 1+( )∂x n( )
----------------------- c φ′ 0( )Wa( )T=
cAT=
4
All these partial derivatives have been evaluated at the origin. We thus write
By induction, we may now state that the Jacobian Jq(n) for observability is, in general,
where c is a column vector and .
Problem 15.6
We are given a nonlinear dynamic system described by
(1)
Suppose x(n) is N-dimensional and u(n) is m-dimensional. Define a new nonlinear dynamicsystem in which the input is of additive form, as shown by
(2)
where
(3)
(4)
and
(5)
∂y n 2+( )∂x n( )
----------------------- c φ′ 0( )Wa( )T φ′ 0( )Wa( )=
cAT AT=
c AT( )2
=
J3 n( ) c c AT cAT,( )2
,[ ]=
Jq n( ) c cAT c AT( )2
… c AT( )q 1–
, , , ,[ ]=
A φ′ 0( )Wa=
x n 1+( ) f x n( ) u n( ),( )=
x ′ n 1+( ) f ′ x ′ n( )( ) u ′ n( )+=
x ′ n( ) x n( )u n 1–( )
=
u ′ n( ) 0u n( )
=
f ′ x ′ n( )( ) f x n( ) u n( ),( )0
=
5
Both and are (N + m)-dimensional, and the first N elements of are zero. Fromthese definitions, we readily see that
which is in perfect agreement with the description of the original nonlinear dynamic systemdefined in (1).
Problem 15.7
(a) The state-space model of the local activation feedback system of Fig. P15.7a depends on howthe linear dynamic component is described. For example, we may define the input as
(1)
where B is a (p-1)-by-(p-1) matrix and
Let w denote the synaptic weight vector of the single neuron in Fig. P15.7a, with w1 being the firstelement and w0 denoting the rest. we may then write
(2)
where
and
x ′ n( ) u ′ n( ) u ′ n( )
x ′ n 1+( ) x n 1+( )u n( )
=
f x n( ) u n( ),( )0
0u n( )
= =
z n( ) x n 1–( )Bu n( )
=
u n( ) u n( ) u n 1–( ) … u n p– 2+( ), ,,[ ] T=
x n( ) wT z n( ) b+=
w1 w0T,[ ] x n 1–( )
Bu n( )b+=
w1x n 1–( ) B ′u ′ n( )+=
u ′ n( ) u ′ n( )1
=
6
The output y(n) is defined by
(3)
Equations (2) and (3) define the state-space model of Fig. P15.7a, assuming that its lineardynamic component is described by (1).
(b) Consider next the local output feedback system of Fig. 15.7b. Let the linear dynamiccomponent of this system be described by (1). The output of the whole system in Fig. 15.7b isthen defined by
(4)
where w1, w0, , and are all as defined previously. The output y(n) of Fig. P15.7b is
(5)
Equations (4) and (5) define the state-space model of the local output feedback system of Fig.P15.7b, assuming that its linear dynamic component is described by (1).
The process (state) equation of the local feedback system of Fig. P15.7a is linear but itsmeasurement equation is nonlinear, and conversely for the local feedback system of Fig. P15.7b.These two local feedback systems are controllable and observable, because they both satisfy theconditions for controllability and observability.
Problem 15.8
We start with the state equation
Hence, we write
B ′ w0T B b,[ ]=
y n( ) ϕ x n( )( )=
x n( ) φ wT z n( ) b+( )=
φ w1 w0T,[ ] x n 1–( )
Bu n( )b+
=
φ w1x n 1–( ) B ′u ′ n( )+( )=
B ′ u ′ n( )
y n( ) x n( )=
x n 1+( ) φ Wax n( ) wbu n( )+( )=
x n 2+( ) φ Wax n 1+( ) wbu n 1+( )+( )=
φ Waφ Wax n( ) wbu n( )+( ) wbu n 1+( )+( )=
7
and so on.
By induction, we may now state that x(n + q) is a nested nonlinear function of x(n) and uq(n), andthus write
where g is a vector-valued function, and
By definition, the output is correspondingly given by
where Φ is a new scalar-valued nonlinear function.
x n 3+( ) φ Wax n 2+( ) wbu n 2+( )+( )=
φ Waφ Waφ Wax n( ) wbu n( )+( ) wbu n 1+( )+( ) wbu n 2+( )+( )=
x n q+( ) g x n( )uq n( )( )=
uq n( ) u n( ) u n 1+( ) … u n q 1–+( ), ,,[ ] T=
y n q+( ) cT x n q+( )=
cT g x n( )uq n( )( )=
Φ x n( ) uq n( ),( )=
8
Problem 15.11
Consider a state-space model described by
(1)
(2)
Using (1), we may readily write
and so on. Accordingly, the simple recurrent network of Fig. 15.3 may be unfolded in time asfollows:
Problem 15.12
The local gradient for the hybrid form of the BPTT algorithm is given by
where is the number of additional steps taken before performing the next BPTT computation,
with .
x n 1+( ) f x n( ) u n( ),( )=
y n( ) g x n( )( )=
x n( ) f x n 1–( ) u n 1–( ),( )=
x n 1–( ) f x n 2–( ) u n 2–( ),( )=
x n 2–( ) f x n 3–( ) u n 3–( ),( )=
x(n-3)
u(n-3)
x(n-2)
u(n-2)
x(n-1)
u(n-1)
x(n) y(n)
f( ).
f( ).
f( ). g( ) .
Figure : Problem 15.11
δ j l( )
ϕ′ v j l( )( )e j l( ) for l n=
ϕ′ v j l( )( ) e j l( ) wkj l( )δl l 1+( )k∑+
for n h l n< <–
ϕ′ v j l( )( ) wkj l( )δl l 1+( )k∑ for n h l j h ′–< <–
=
h ′h ′ h<
9
Problem 15.13
(a) The nonlinear state dynamics of the real-time recurrent learning algorithm of described in(15.48) and (15.52) olf the text may be reformulated in the equivalent form:
(1)
where is the Kronecker delta and yj(n + 1) is the output of neuron j at time n + 1. For a
teacher-forced recurrent network, we have
(2)
Hence, substituting (2) into (1), we get
(3)
(b) Let
Provided that the learning-rate parameter is small enough, we may put
Under this condition, we may rewrite (3) as follows:
(4)
This nonlinear state equation is the centerpiece of the RTRL algorithm using teacher forcing.
∂ y j n 1+( )∂wkl n( )
-------------------------- ϕ′ v j n( )( ) w ji n( )∂ξ i n( )
∂wkl n( )------------------- δkjξ l n( )+
i A B∈ ∈∑=
δkj
ξ i n( )ui n( ) if i A∈
di n( ) if i C∈
yi n( ) if i B-C∈
=
∂ y j n 1+( )∂wkl n( )
-------------------------- ϕ′ v j n( )( ) w ji n( )∂ yi n( )∂wkl n( )------------------- δkjξ l n( )+
i B-C∈∑=
πklj
n( )∂ yi n( )∂wkl n( )-------------------=
η
πklj
n 1+( )∂ yi n 1+( )∂wkl n 1+( )----------------------------
∂ yi n 1+( )∂wkl n( )
-------------------------≈=
πklj
n 1+( ) φ′ v j n( )( ) w ji n( )πklj
n( ) δkjξ l n( )+i B-C∈∑=
10