A Note on TopicRNN

A Note on TopicRNN

Tomonari MASADA @ Nagasaki University

July 13, 2017

1 Model

TopicRNN is a generative model proposed by [1], whose generative story for a particular document x1:Tis given as below.

1. Draw a topic vector θ ∼ N(0, I).

2. Given word y1:t−1, for the tth word yt in the document,

(a) Compute hidden state ht = fW (xt, ht−1), where we let xt , yt−1.

(b) Draw stop word indicator lt ∼ Bernoulli(σ(Γ>ht)), with σ the sigmoid function.

(c) Draw word yt ∼ p(yt|ht, θ, lt, B), where

p(yt = i|ht, θ, lt, B) ∝ exp(v>i ht + (1− lt)b>i θ) .

2 Lower bound

The log marginal likelihood of the word sequence y1:T and the stop word indicators l1:T is

log p(y1:T , l1:T |h1:T ) = log

∫p(θ)

T∏t=1

p(yt|ht, lt, θ;W )p(lt|ht; Γ)dθ (1)

A lower bound can be obtained as follows:

log p(y1:T , l1:T |h1:T ) = log

∫p(θ)

T∏t=1

p(yt|ht, lt, θ;W )p(lt|ht; Γ)dθ

= log

∫q(θ)

p(θ)∏T

t=1 p(yt|ht, lt, θ;W )p(lt|ht; Γ)

q(θ)dθ

≥∫q(θ) log

p(θ)∏T

t=1 p(yt|ht, lt, θ;W )p(lt|ht; Γ)

q(θ)dθ

=

∫q(θ) log p(θ)dθ +

T∑t=1

∫q(θ) log p(yt|ht, lt, θ;W )dθ +

T∑t=1

∫q(θ) log p(lt|ht; Γ)dθ −

∫q(θ) log q(θ)dθ

, L(y1:T , l1:T |q(θ),Θ) (2)

3 Approximate posterior

The form of q(θ) is chosen to be an inference network using a feed-forward neural network. Each expec-tation in Eq. (2) is approximated with the samples from q(θ|Xc), where Xc denotes the term-frequencyrepresentation of y1:T excluding stop words. The density of the approximate posterior q(θ|Xc) is specifiedas follows:

q(θ|Xc) = N(θ;µ(Xc),diag(σ2(Xc))), (3)

µ(Xc) = W1g(Xc) + a1, (4)

log σ(Xc) = W2g(Xc) + a2, (5)

where g(·) denotes the feed-forward neural network. Eq. (3) gives the reparameterization of θk as θk =µk(Xc) + εkσk(Xc) for k = 1, . . . ,K, where εk is a sample from the standard normal distribution N(0, 1).

1

4 Monte Carlo integration

We can now rewrite each term of the lower bound L(y1:T , l1:T |q(θ),Θ) in Eq. (2) as below, where the θ(s)sdenote the samples drawn from the approximate posterior q(θ|Xc).

The first term:∫q(θ) log p(θ)dθ ≈ 1

S

S∑s=1

log p(θ(s)) =1

S

S∑s=1

K∑k=1

log

[1√2π

exp(−θ(s)k

2

2

)]

= −K log(2π)

2− 1

2

K∑k=1

∑s

(θ(s)k

)2S

(6)

Each addend of the second term:∫q(θ) log p(yt|ht, lt, θ;W )dθ ≈ 1

S

S∑s=1

logexp(v>yt

ht + (1− lt)b>ytθ(s))∑C

j=1 exp(v>j ht + (1− lt)b>j θ(s))

= v>ytht + (1− lt)b>yt

∑Ss=1 θ

(s)

S− 1

S

S∑s=1

log

C∑j=1

exp{v>j ht + (1− lt)b>j θ(s)

}(7)

Each addend of the third term:∫q(θ) log p(lt|ht; Γ)dθ = lt log(σ(Γ>ht)) + (1− lt) log(1− σ(Γ>ht)) (8)

The fourth term:∫q(θ) log q(θ)dθ ≈ 1

S

S∑s=1

K∑k=1

log

[1√

2πσ2k(Xc)

exp(−

(θ(s)k − µk(Xc))

2

2σ2k(Xc)

)]

= −K log(2π)

2−

K∑k=1

log(σk(Xc))−1

S

S∑s=1

K∑k=1

{θ(s)k − µk(Xc)

}22σ2

k(Xc)(9)

5 Objective to be maximized

Each of the s samples (i.e., θ(s) for s = 1, . . . , S) is obtained as θ(s) = µ(Xc)+ε(s) ◦σ(Xc) via the reparam-

eterization, where the ε(s)k s are drawn from the standard normal, and ◦ is the element-wise multiplication.

Consequently, the lower bound L(y1:T , l1:T |q(θ),Θ) to be maximized is obtained as follows:

L(y1:T , l1:T |q(θ),Θ) =− 1

2

K∑k=1

∑s

{µk(Xc) + ε

(s)k σk(Xc)

}2S

+

T∑t=1

v>ytht +

1

S

S∑s=1

T∑t=1

(1− lt)b>yt

{µ(Xc) + ε(s) ◦ σ(Xc)

}−

T∑t=1

1

S

S∑s=1

log

C∑j=1

exp[v>j ht + (1− lt)b>j

{µ(Xc) + ε(s) ◦ σ(Xc)

}]

+

T∑t=1

{lt log(σ(Γ>ht)) + (1− lt) log(1− σ(Γ>ht))

}+

K∑k=1

log(σk(Xc)) + const. (10)

References

[1] Adji Bousso Dieng, Chong Wang, Jianfeng Gao, and John Paisley. TopicRNN: A Recurrent NeuralNetwork with Long-Range Semantic Dependency. ICLR, 2017.

2

A Note on TopicRNN

Data & Analytics

Transcript of A Note on TopicRNN