A Note on TopicRNN
-
Upload
tomonari-masada -
Category
Data & Analytics
-
view
148 -
download
1
Transcript of A Note on TopicRNN
A Note on TopicRNN
Tomonari MASADA @ Nagasaki University
July 13, 2017
1 Model
TopicRNN is a generative model proposed by [1], whose generative story for a particular document x1:Tis given as below.
1. Draw a topic vector θ ∼ N(0, I).
2. Given word y1:t−1, for the tth word yt in the document,
(a) Compute hidden state ht = fW (xt, ht−1), where we let xt , yt−1.
(b) Draw stop word indicator lt ∼ Bernoulli(σ(Γ>ht)), with σ the sigmoid function.
(c) Draw word yt ∼ p(yt|ht, θ, lt, B), where
p(yt = i|ht, θ, lt, B) ∝ exp(v>i ht + (1− lt)b>i θ) .
2 Lower bound
The log marginal likelihood of the word sequence y1:T and the stop word indicators l1:T is
log p(y1:T , l1:T |h1:T ) = log
∫p(θ)
T∏t=1
p(yt|ht, lt, θ;W )p(lt|ht; Γ)dθ (1)
A lower bound can be obtained as follows:
log p(y1:T , l1:T |h1:T ) = log
∫p(θ)
T∏t=1
p(yt|ht, lt, θ;W )p(lt|ht; Γ)dθ
= log
∫q(θ)
p(θ)∏T
t=1 p(yt|ht, lt, θ;W )p(lt|ht; Γ)
q(θ)dθ
≥∫q(θ) log
p(θ)∏T
t=1 p(yt|ht, lt, θ;W )p(lt|ht; Γ)
q(θ)dθ
=
∫q(θ) log p(θ)dθ +
T∑t=1
∫q(θ) log p(yt|ht, lt, θ;W )dθ +
T∑t=1
∫q(θ) log p(lt|ht; Γ)dθ −
∫q(θ) log q(θ)dθ
, L(y1:T , l1:T |q(θ),Θ) (2)
3 Approximate posterior
The form of q(θ) is chosen to be an inference network using a feed-forward neural network. Each expec-tation in Eq. (2) is approximated with the samples from q(θ|Xc), where Xc denotes the term-frequencyrepresentation of y1:T excluding stop words. The density of the approximate posterior q(θ|Xc) is specifiedas follows:
q(θ|Xc) = N(θ;µ(Xc),diag(σ2(Xc))), (3)
µ(Xc) = W1g(Xc) + a1, (4)
log σ(Xc) = W2g(Xc) + a2, (5)
where g(·) denotes the feed-forward neural network. Eq. (3) gives the reparameterization of θk as θk =µk(Xc) + εkσk(Xc) for k = 1, . . . ,K, where εk is a sample from the standard normal distribution N(0, 1).
1
4 Monte Carlo integration
We can now rewrite each term of the lower bound L(y1:T , l1:T |q(θ),Θ) in Eq. (2) as below, where the θ(s)sdenote the samples drawn from the approximate posterior q(θ|Xc).
The first term:∫q(θ) log p(θ)dθ ≈ 1
S
S∑s=1
log p(θ(s)) =1
S
S∑s=1
K∑k=1
log
[1√2π
exp(−θ(s)k
2
2
)]
= −K log(2π)
2− 1
2
K∑k=1
∑s
(θ(s)k
)2S
(6)
Each addend of the second term:∫q(θ) log p(yt|ht, lt, θ;W )dθ ≈ 1
S
S∑s=1
logexp(v>yt
ht + (1− lt)b>ytθ(s))∑C
j=1 exp(v>j ht + (1− lt)b>j θ(s))
= v>ytht + (1− lt)b>yt
∑Ss=1 θ
(s)
S− 1
S
S∑s=1
log
C∑j=1
exp{v>j ht + (1− lt)b>j θ(s)
}(7)
Each addend of the third term:∫q(θ) log p(lt|ht; Γ)dθ = lt log(σ(Γ>ht)) + (1− lt) log(1− σ(Γ>ht)) (8)
The fourth term:∫q(θ) log q(θ)dθ ≈ 1
S
S∑s=1
K∑k=1
log
[1√
2πσ2k(Xc)
exp(−
(θ(s)k − µk(Xc))
2
2σ2k(Xc)
)]
= −K log(2π)
2−
K∑k=1
log(σk(Xc))−1
S
S∑s=1
K∑k=1
{θ(s)k − µk(Xc)
}22σ2
k(Xc)(9)
5 Objective to be maximized
Each of the s samples (i.e., θ(s) for s = 1, . . . , S) is obtained as θ(s) = µ(Xc)+ε(s) ◦σ(Xc) via the reparam-
eterization, where the ε(s)k s are drawn from the standard normal, and ◦ is the element-wise multiplication.
Consequently, the lower bound L(y1:T , l1:T |q(θ),Θ) to be maximized is obtained as follows:
L(y1:T , l1:T |q(θ),Θ) =− 1
2
K∑k=1
∑s
{µk(Xc) + ε
(s)k σk(Xc)
}2S
+
T∑t=1
v>ytht +
1
S
S∑s=1
T∑t=1
(1− lt)b>yt
{µ(Xc) + ε(s) ◦ σ(Xc)
}−
T∑t=1
1
S
S∑s=1
log
C∑j=1
exp[v>j ht + (1− lt)b>j
{µ(Xc) + ε(s) ◦ σ(Xc)
}]
+
T∑t=1
{lt log(σ(Γ>ht)) + (1− lt) log(1− σ(Γ>ht))
}+
K∑k=1
log(σk(Xc)) + const. (10)
References
[1] Adji Bousso Dieng, Chong Wang, Jianfeng Gao, and John Paisley. TopicRNN: A Recurrent NeuralNetwork with Long-Range Semantic Dependency. ICLR, 2017.
2