A Note on TopicRNN

2
A Note on TopicRNN Tomonari MASADA @ Nagasaki University July 13, 2017 1 Model TopicRNN is a generative model proposed by [1], whose generative story for a particular document x 1:T is given as below. 1. Draw a topic vector θ N (0,I ). 2. Given word y 1:t-1 , for the tth word y t in the document, (a) Compute hidden state h t = f W (x t ,h t-1 ), where we let x t , y t-1 . (b) Draw stop word indicator l t Bernoulli(σ> h t )), with σ the sigmoid function. (c) Draw word y t p(y t |h t ,θ,l t ,B), where p(y t = i|h t ,θ,l t ,B) exp(v > i h t + (1 - l t )b > i θ) . 2 Lower bound The log marginal likelihood of the word sequence y 1:T and the stop word indicators l 1:T is log p(y 1:T ,l 1:T |h 1:T ) = log Z p(θ) T Y t=1 p(y t |h t ,l t ; W )p(l t |h t ; Γ)(1) A lower bound can be obtained as follows: log p(y 1:T ,l 1:T |h 1:T ) = log Z p(θ) T Y t=1 p(y t |h t ,l t ; W )p(l t |h t ; Γ)= log Z q(θ) p(θ) Q T t=1 p(y t |h t ,l t ; W )p(l t |h t ; Γ) q(θ) Z q(θ) log p(θ) Q T t=1 p(y t |h t ,l t ; W )p(l t |h t ; Γ) q(θ) = Z q(θ) log p(θ)+ T X t=1 Z q(θ) log p(y t |h t ,l t ; W )+ T X t=1 Z q(θ) log p(l t |h t ; Γ)- Z q(θ) log q(θ), L(y 1:T ,l 1:T |q(θ), Θ) (2) 3 Approximate posterior The form of q(θ) is chosen to be an inference network using a feed-forward neural network. Each expec- tation in Eq. (2) is approximated with the samples from q(θ|X c ), where X c denotes the term-frequency representation of y 1:T excluding stop words. The density of the approximate posterior q(θ|X c ) is specified as follows: q(θ|X c )= N (θ; μ(X c ), diag(σ 2 (X c ))), (3) μ(X c )= W 1 g(X c )+ a 1 , (4) log σ(X c )= W 2 g(X c )+ a 2 , (5) where g(·) denotes the feed-forward neural network. Eq. (3) gives the reparameterization of θ k as θ k = μ k (X c )+ k σ k (X c ) for k =1,...,K, where k is a sample from the standard normal distribution N (0, 1). 1

Transcript of A Note on TopicRNN

Page 1: A Note on TopicRNN

A Note on TopicRNN

Tomonari MASADA @ Nagasaki University

July 13, 2017

1 Model

TopicRNN is a generative model proposed by [1], whose generative story for a particular document x1:Tis given as below.

1. Draw a topic vector θ ∼ N(0, I).

2. Given word y1:t−1, for the tth word yt in the document,

(a) Compute hidden state ht = fW (xt, ht−1), where we let xt , yt−1.

(b) Draw stop word indicator lt ∼ Bernoulli(σ(Γ>ht)), with σ the sigmoid function.

(c) Draw word yt ∼ p(yt|ht, θ, lt, B), where

p(yt = i|ht, θ, lt, B) ∝ exp(v>i ht + (1− lt)b>i θ) .

2 Lower bound

The log marginal likelihood of the word sequence y1:T and the stop word indicators l1:T is

log p(y1:T , l1:T |h1:T ) = log

∫p(θ)

T∏t=1

p(yt|ht, lt, θ;W )p(lt|ht; Γ)dθ (1)

A lower bound can be obtained as follows:

log p(y1:T , l1:T |h1:T ) = log

∫p(θ)

T∏t=1

p(yt|ht, lt, θ;W )p(lt|ht; Γ)dθ

= log

∫q(θ)

p(θ)∏T

t=1 p(yt|ht, lt, θ;W )p(lt|ht; Γ)

q(θ)dθ

≥∫q(θ) log

p(θ)∏T

t=1 p(yt|ht, lt, θ;W )p(lt|ht; Γ)

q(θ)dθ

=

∫q(θ) log p(θ)dθ +

T∑t=1

∫q(θ) log p(yt|ht, lt, θ;W )dθ +

T∑t=1

∫q(θ) log p(lt|ht; Γ)dθ −

∫q(θ) log q(θ)dθ

, L(y1:T , l1:T |q(θ),Θ) (2)

3 Approximate posterior

The form of q(θ) is chosen to be an inference network using a feed-forward neural network. Each expec-tation in Eq. (2) is approximated with the samples from q(θ|Xc), where Xc denotes the term-frequencyrepresentation of y1:T excluding stop words. The density of the approximate posterior q(θ|Xc) is specifiedas follows:

q(θ|Xc) = N(θ;µ(Xc),diag(σ2(Xc))), (3)

µ(Xc) = W1g(Xc) + a1, (4)

log σ(Xc) = W2g(Xc) + a2, (5)

where g(·) denotes the feed-forward neural network. Eq. (3) gives the reparameterization of θk as θk =µk(Xc) + εkσk(Xc) for k = 1, . . . ,K, where εk is a sample from the standard normal distribution N(0, 1).

1

Page 2: A Note on TopicRNN

4 Monte Carlo integration

We can now rewrite each term of the lower bound L(y1:T , l1:T |q(θ),Θ) in Eq. (2) as below, where the θ(s)sdenote the samples drawn from the approximate posterior q(θ|Xc).

The first term:∫q(θ) log p(θ)dθ ≈ 1

S

S∑s=1

log p(θ(s)) =1

S

S∑s=1

K∑k=1

log

[1√2π

exp(−θ(s)k

2

2

)]

= −K log(2π)

2− 1

2

K∑k=1

∑s

(θ(s)k

)2S

(6)

Each addend of the second term:∫q(θ) log p(yt|ht, lt, θ;W )dθ ≈ 1

S

S∑s=1

logexp(v>yt

ht + (1− lt)b>ytθ(s))∑C

j=1 exp(v>j ht + (1− lt)b>j θ(s))

= v>ytht + (1− lt)b>yt

∑Ss=1 θ

(s)

S− 1

S

S∑s=1

log

C∑j=1

exp{v>j ht + (1− lt)b>j θ(s)

}(7)

Each addend of the third term:∫q(θ) log p(lt|ht; Γ)dθ = lt log(σ(Γ>ht)) + (1− lt) log(1− σ(Γ>ht)) (8)

The fourth term:∫q(θ) log q(θ)dθ ≈ 1

S

S∑s=1

K∑k=1

log

[1√

2πσ2k(Xc)

exp(−

(θ(s)k − µk(Xc))

2

2σ2k(Xc)

)]

= −K log(2π)

2−

K∑k=1

log(σk(Xc))−1

S

S∑s=1

K∑k=1

{θ(s)k − µk(Xc)

}22σ2

k(Xc)(9)

5 Objective to be maximized

Each of the s samples (i.e., θ(s) for s = 1, . . . , S) is obtained as θ(s) = µ(Xc)+ε(s) ◦σ(Xc) via the reparam-

eterization, where the ε(s)k s are drawn from the standard normal, and ◦ is the element-wise multiplication.

Consequently, the lower bound L(y1:T , l1:T |q(θ),Θ) to be maximized is obtained as follows:

L(y1:T , l1:T |q(θ),Θ) =− 1

2

K∑k=1

∑s

{µk(Xc) + ε

(s)k σk(Xc)

}2S

+

T∑t=1

v>ytht +

1

S

S∑s=1

T∑t=1

(1− lt)b>yt

{µ(Xc) + ε(s) ◦ σ(Xc)

}−

T∑t=1

1

S

S∑s=1

log

C∑j=1

exp[v>j ht + (1− lt)b>j

{µ(Xc) + ε(s) ◦ σ(Xc)

}]

+

T∑t=1

{lt log(σ(Γ>ht)) + (1− lt) log(1− σ(Γ>ht))

}+

K∑k=1

log(σk(Xc)) + const. (10)

References

[1] Adji Bousso Dieng, Chong Wang, Jianfeng Gao, and John Paisley. TopicRNN: A Recurrent NeuralNetwork with Long-Range Semantic Dependency. ICLR, 2017.

2