A Note on Correlated Topic Models
Click here to load reader
-
Upload
tomonari-masada -
Category
Technology
-
view
113 -
download
1
Transcript of A Note on Correlated Topic Models
Deriving formulas used in a variational Bayesian inference for
Correlated Topic Models
Tomonari MASADA @ Nagasaki University
December 21, 2012
1 Model
This manuscript includes a derivation of update formulas for correlated topic models (CTM)[1]. We givea generative description of CTM below.
1. For each topic k, draw a multinomial Mul(φk) from a Dirichlet prior Dir(β).
2. For each document d,
(a) Draw md from a Gaussian N (µ,Σ).
(b) Let θdk ≡ exp(mdk)∑k exp(mdk) .
(c) For the ith word token, draw a topic zdi from a multinomial Mul(θd).
(d) For the ith word token, draw a word xdi from a multinomial Mul(φzdi).
A full joint distribution can be written as follows:
p(x, z,φ,m|β,µ,Σ) = p(φ|β)p(m|µ,Σ)p(z|m)p(x|φ, z)
=∏k
p(φk|β) ·∏d
p(md|µ,Σ) ·∏d
∏i
p(zdi|md)p(xdi|φzdi)
=∏k
Γ(∑w βw)∏
w Γ(βw)φβw−1kw ·
∏d
1
(2π)K/2|Σ|1/2exp
{− 1
2(md − µ)TΣ−1(md − µ)
}·∏d
∏i
∏k
{exp(mdk)∑k exp(mdk)
φkxdi
}δ(zdi=k)
, (1)
where δ(·) is equal to one when the condition inside the parentheses holds and is equal to zero otherwise.
2 Variational Bayesian inference
A log evidence of an observed document set x can be lower-bounded by using Jensen’s inequality as follows:
ln p(x|β,µ,Σ) = ln
∫ ∑z
p(φ|β)p(m|µ,Σ)p(z|m)p(x|φ, z)dφdm
= ln
∫ ∑z
q(z)q(φ)q(m)p(φ|β)p(m|µ,Σ)p(z|m)p(x|φ, z)
q(z)q(φ)q(m)dφdm
≥∫ ∑
z
q(z)q(φ)q(m) lnp(φ|β)p(m|µ,Σ)p(z|m)p(x|φ, z)
q(z)q(φ)q(m)dφdm
=
∫ ∑z
q(z)q(m) ln p(z|m)dm+
∫q(φ) ln p(φ|β)dφ
+
∫ ∑z
q(z)q(φ) ln p(x|φ, z)dφ+
∫q(m) ln p(m|µ,Σ)dm
−∑z
q(z) ln q(z)−∫q(φ) ln q(φ)dφ−
∫q(m) ln q(m)dm . (2)
1
With respect to variational posteriors, we assume:
• q(z) is factorized as∏d
∏i
∏k q(zdi|γdi) =
∏d
∏i
∏k γ
δ(zdi=k)dik ;
• q(φ) is factorized as∏k q(φk|ζk), where each q(φk|ζk) is a Dirichlet; and
• q(m) is factorized as∏d
∏k q(mdk|rdk, sdk), where each q(mdk|rdk, sdk) is a univariate Gaussian.
2.1∫ ∑z
q(z)q(m) ln p(z|m)dm
=
∫ ∑z
(∏d
∏i
∏k
γδ(zdi=k)dik
){∏d
∏k
q(mdk|rdk, sdk)
}ln∏d
∏i
∏k
{exp(mdk)∑k exp(mdk)
}δ(zdi=k)
dm
=∑d
∑i
∑k
γdik
∫q(mdk|rdk, sdk) ln exp(mdk)dmdk
−∑d
∑i
∑k
γdik
∫q(md|rd, sd) ln
{∑k
exp(mdk)}dmd
=∑d
∑i
∑k
γdikrdk −∑d
∑i
∑k
γdik
∫q(md|rd, sd) ln
{∑k
exp(mdk)}dmd (3)
We obtain a lower bound by a variational method proposed in [1]. Since f(x) = lnx ≤ xν − 1 + ln ν for
any ν > 0, we introduce a new variable νd for each document and obtain the following inequality:∫q(md|rd, sd) ln
{∑k
exp(mdk)}dmd ≤
∫q(md|rd, sd)
{ν−1d
∑k
exp(mdk)− 1 + ln νd
}dmd
= ln νd − 1 + ν−1d
∑k
∫q(mdk|rdk, sdk) exp(mdk)dmdk
= ln νd − 1 + ν−1d
∑k
exp(rdk + s2dk/2) . (4)
Therefore, Eq. (3) can be lower-bounded as follows:∫ ∑z
q(z)q(m) ln p(z|m)dm ≥∑d
∑i
∑k
γdik
{rdk − ln νd + 1− ν−1
d
∑k
exp(rdk + s2dk/2)
}. (5)
2.2 ∫q(φ) ln p(φ|β)dφ =
∑k
∫Γ(∑w ζkw)∏
w Γ(ζkw)
∏w
φζkw−1kw ln
Γ(∑w βw)∏
w Γ(βw)φβw−1kw dφk
= K ln Γ(∑w
βw)−K∑w
ln Γ(βw) +∑k
∑w
(βw − 1){
Ψ(ζkw)−Ψ(∑w
ζkw)}
(6)∫ ∑z
q(z)q(φ) ln p(x|φ, z)dφ =∑d
∑i
∑k
γdik
∫Γ(∑w ζkw)∏
w Γ(ζkw)
∏w
φζkw−1kw lnφkxdi
dφk
=∑d
∑i
∑k
γdik{
Ψ(ζkxdi)−Ψ(
∑w
ζkw)}
(7)
These derivations are completely the same with latent Dirichlet allocation (LDA).
2
2.3 ∫q(m) ln p(m|µ,Σ)dm
=∑d
∫ ∏k
q(mdk|rdk, sdk) ln
[1
(2π)K/2|Σ|1/2exp
{− 1
2(md − µ)TΣ−1(md − µ)
}]dmd
= −DK2
ln 2π − D
2ln |Σ| − 1
2
∑d
∑k
s2dk(Σ−1)kk −
1
2
∑d
(rd − µ)TΣ−1(rd − µ) , (8)
where (Σ−1)kk′ means the (k, k′)th entry of Σ−1. The last two terms are derived as follows:∫ ∏k
q(mdk|rdk, sdk)(md − µ)TΣ−1(md − µ)dmd
=
∫ ∏k
q(mdk|rdk, sdk){∑
k
(mdk − µk)2(Σ−1)kk +∑k
∑k′ 6=k
(mdk − µk)(mdk′ − µk′)(Σ−1)kk′}dmd
=∑k
(r2dk + s2
dk − 2rdkµk + µ2k)(Σ−1)kk +
∑k
∑k′ 6=k
(rdk − µk)(rdk′ − µk′)(Σ−1)kk′
=∑k
s2dk(Σ−1)kk +
∑k
∑k′
(rdk − µk)(rdk′ − µk′)(Σ−1)kk′
=∑k
s2dk(Σ−1)kk + (rd − µ)TΣ−1(rd − µ) (9)
2.4 ∑z
q(z) ln q(z) =∑d
∑i
∑k
γdik ln γdik (10)∫q(φ) ln q(φ)dφ =
∑k
ln Γ(∑w
ζkw)−∑k
∑w
ln Γ(ζkw) +∑k
∑w
(ζkw − 1){
Ψ(ζkw)−Ψ(∑w
ζkw)}(11)∫
q(m) ln q(m)dm = −DK2−DK ln
√2π −
∑d
∑k
ln sdk (12)
3 Updating posteriors
Consequently, the lower bound in Eq. (2) is obtained as follows:
ln p(x|β,µ,Σ) ≥∑d
∑i
∑k
γdik
{rdk − ln νd + 1− ν−1
d
∑k
exp(rdk + s2dk/2)
}+K ln Γ(
∑w
βw)−K∑w
ln Γ(βw) +∑k
∑w
(βw − 1){
Ψ(ζkw)−Ψ(∑w
ζkw)}
−∑k
ln Γ(∑w
ζkw) +∑k
∑w
ln Γ(ζkw)−∑k
∑w
(ζkw − 1){
Ψ(ζkw)−Ψ(∑w
ζkw)}
+∑d
∑i
∑k
γdik{
Ψ(ζkxdi)−Ψ(
∑w
ζkw)}−∑d
∑i
∑k
γdik ln γdik
− DK
2ln 2π − D
2ln |Σ| − 1
2
∑d
∑k
s2dk(Σ−1)kk −
1
2
∑d
(rd − µ)TΣ−1(rd − µ)
+DK
2+DK ln
√2π +
∑d
∑k
ln sdk . (13)
Let L denote the right hand side. With respect to νd, we obtain a derivative:
∂L
∂νd=∑i
∑k′
γdik′{− ν−1
d + ν−2d
∑k
exp(rdk + s2dk/2)
}. (14)
3
Note that∑i
∑k′ γdik′ is equal to nd, the length of document d. From ∂L/∂νd = 0, we obtain νd =∑
k exp(rdk + s2dk/2). With respect to γdik, we obtain a derivative:
∂L
∂γdik= rdk − ln νd + 1− ν−1
d
∑k
exp(rdk + s2dk/2) + Ψ(ζkxdi
)−Ψ(∑w
ζkw)− ln γdik + 1 . (15)
Therefore, by using νd =∑k exp(rdk +s2
dk/2), we can update γdik as γdik ∝ exp(rdk) · exp Ψ(ζkxdi)
exp Ψ(∑
w ζkw) . Withrespect to rdk,
∂L
∂rdk= ndk −
ndνd
exp(rdk + s2dk/2)−
∑k′
(rdk′ − µk′)(Σ−1)kk′ , (16)
where ndk ≡∑i γdik. This cannot be solved analytically. Therefore, we maximize
L(rdk) = ndkrdk −ndνd
exp(rdk + s2dk/2) +
1
2r2dk(Σ−1)kk − rdk
∑k′
(rdk′ − µk′)(Σ−1)kk′ (17)
by some gradient-based method (e.g. L-BFGS). With respect to sdk, we maximize
L(sdk) = −ndνd
exp(rdk + s2dk/2)− 1
2s2dk(Σ−1)kk + ln sdk (18)
by using a gradient
∂L(sdk)
∂sdk= −nd
νdexp(rdk + s2
dk/2)− sdk(Σ−1)kk +1
sdk. (19)
With respect to ζkw, we obtain the following update: ζkw = βw +∑d
∑i
∑k γdik.
With respect to Σ, we have the following function to be maximized:
L(Σ) = −D2
ln |Σ| − 1
2
∑d
∑k
s2dk(Σ−1)kk −
1
2
∑d
(rd − µ)TΣ−1(rd − µ) . (20)
From the first term in Eq. (20), we obtain a derivative ∂ ln |Σ|∂Σkk′
= tr(Σ−1 ∂Σ
∂Σkk′
). The matrix Σ−1 ∂Σ
∂Σkk′
has non-zero entries only in the k′th column, and the column has an entry (Σ−1)lk at the lth row.
Therefore, ∂ ln |Σ|∂Σkk′
= (Σ−1)k′k. By a symmetry, ∂ ln |Σ|∂Σ = Σ−1.
For the second term in Eq. (20), it holds that∑k s
2dk(Σ−1)kk = tr(Σ−1Sd), where Sd is a diagonal
matrix whose kth diagonal entry is s2dk. By using an equation1 ∂tr(AΣ−1B)
∂Σ = −Σ−1BAΣ−1, we obtain∂∑
d
∑k s
2dk(Σ−1)kk
∂Σ = −Σ−1(∑
d Sd)Σ−1.
For the last term in Eq. (20), it holds that (rd − µ)TΣ−1(rd − µ) = tr((rd − µ)TΣ−1(rd − µ)
).
Therefore, by using an equation ∂tr(AΣ−1B)∂Σ = −Σ−1BAΣ−1 again, we obtain ∂(rd−µ)TΣ−1(rd−µ)
∂Σ =
−Σ−1(rd − µ)(rd − µ)TΣ−1.Consequently,
∂L(Σ)
∂Σ= −D
2Σ−1 +
1
2Σ−1
(∑d
Sd)Σ−1 +
1
2Σ−1
∑d
{(rd − µ)(rd − µ)T
}Σ−1 . (21)
Therefore, ∂L(Σ)∂Σ = 0 holds when Σ−1 = 1
DΣ−1∑d
{Sd + (rd − µ)(rd − µ)T
}Σ−1. By multiplying Σ
from the left and the right, we obtain Σ = 1D
∑d
{Sd + (rd − µ)(rd − µ)T
}.
References
[1] David M. Blei and John D. Lafferty. Correlated topic models. In NIPS, 2005.
1cf. Eq. (16) in http://research.microsoft.com/en-us/um/people/minka/papers/matrix/minka-matrix.pdf
4