An Introduction to Spectral Learninghanxiaol/slides/spectral_learning.pdfAn Introduction to Spectral...
Transcript of An Introduction to Spectral Learninghanxiaol/slides/spectral_learning.pdfAn Introduction to Spectral...
An Introduction to Spectral Learning
An Introduction to Spectral Learning
Hanxiao Liu
November 8, 2013
An Introduction to Spectral Learning
Outline
1 Method of Moments2 Learning topic models using spectral properties3 Anchor words
An Introduction to Spectral Learning
Preliminaries
X1, · · · , Xn ∼ p (x; θ), θ = (θ1, · · · θm)>
θ = θn = w (X1, · · · , Xn)
Maximum Likelihood Estimator (MLE)
θ = argmaxθ
logL (θ)
Bayes Estimator (BE)
θ = E (θ|X ) =
∫θp (x|θ) π (θ) dθ∫p (x|θ) π (θ) dθ
An Introduction to Spectral Learning
Preliminaries
QuestionWhat makes a good estimator?
MLE is consistentBoth the MLE and BE have asymptotic normality
√n(θn − θ
) N
(0, 1
I (θ)
)under mild (regularity) conditions
Can be computationally expensive
An Introduction to Spectral Learning
Preliminaries
Example (Gamma distribution)
p (xi ;α, θ) = 1Γ (α) θα
xα−1i exp
(−xiθ
)
L (α, θ) =( 1
Γ (α) θα
)n( n∏
i=1xi
)α−1
exp(−∑n
i=1 xiθ
)
MLE is hard to compute due to the existence of Γ (α)
An Introduction to Spectral Learning
Method of Moments
j-th theoretical moment, j ∈ [k]
µj (θ) := Eθ
(X j)
j-th sample moment, j ∈ [k]
Mj :=1n
n∑i=1
X ji
Plug-in and solve the multivariate polynomial equations
Mj = µj (θ) j ∈ [k]
sometimes can be recast as spectral decomposition
An Introduction to Spectral Learning
Method of Moments
Example (Gamma distribution)
p (xi ;α, θ) = 1Γ (α) θα
xα−1i exp
(−xiθ
)
X = E (Xi) = αθ
1n
n∑i=1
(Xi −X
)2= Var (Xi) = αθ2
⇒ θ =1
nX
n∑i=1
(Xi −X
)2, α =
Xθ
=nX2∑n
i=1
(Xi −X
)2
An Introduction to Spectral Learning
Method of Moments
lack guarantee about the solutionhigh-order sample moments are hard to estimate
To reach a specified accuracy, the required sample size andcomputational cost is exponential in k (or n)!
QuestionCould we recover the true θ from only low-order moments?
QuestionCould we lower the sample requirement and computationalcomplexity based on some (hopefully mild) assumptions?
An Introduction to Spectral Learning
Learning the Topic Models
Papadimitriou et al. (2000)Non-overlapping separation condition (strong)
Anandkumar et al. (2012), MoM+SDFull rank assumption (weak)Multinomial Mixture, LDA
Arora et al. (2012), MoM+NMF+LPAnchor words (mild)LDA, Correlated Topic ModelA more practical algorithm proposed in 2013
An Introduction to Spectral Learning
Learning the Topic Models
Suppose there are n documents, k hidden topics, d features
M = [µ1|µ2| . . . |µk ] ∈ Rd×k , µj ∈ ∆d−1 ∀j ∈ [k]
w = (w1, . . . , wk) , w ∈ ∆k−1
P (h = j) = wj j ∈ [k]
For the v-th word in a document, xv ∈ {e1, . . . ed}
P (xv = ei |h = j) = µij , j ∈ [k], i ∈ [d ]
Goal: Recover the M using low-order moments
An Introduction to Spectral Learning
Learning the Topic Models
Construct moment statistics
Pairsij := P (x1 = ei , x2 = ej)
Triplesij := P (x1 = ei , x2 = ej , x3 = et)
Pair = E[x1 ⊗ x2] ∈ Rd×d
Triples = E[x1 ⊗ x2 ⊗ x3] ∈ Rd×d×d
Empirical plug-ins i.e. ˆPairs and ˆTriples could be obtainedfrom data through a straightforward mannerWe want to establish some equivalence between theempirical moments and parameters of interest
An Introduction to Spectral Learning
Learning the Topic Models
Triples (η) := E[x1 ⊗ x2 ⊗ 〈x3, η〉] ∈ Rd×d
Triples (η) : Rd → Rd×d
Lemma
Pairs = Mdiag (w)M>
Triples (η) = M(diag
(M>η
)diag (w)
)M>
The unknown M and w are twisted.
An Introduction to Spectral Learning
Learning the Topic Models
Assumption ( Non-degeneracy )
M has full column rank k
1 Find U , V ∈ Rd×k s.t.(U>M
)−1and
(V>M
)−1exist.
2 ∀η ∈ Rd , define B (η) ∈ Rk×k
B (η) :=(U>Triples (η)V
) (U>PairsV
)−1
Lemma (Observable Operator)
B (η) =(U>M
)diag
(M>η
) (U>M
)−1
An Introduction to Spectral Learning
Learning the Topic Models
Input: ˆPairs and ˆTriplesOutput: topic-word distributions MU , V ← top k left, right eigenvectors of ˆPairs a
η ← random sample from range(U )(ξ1, ξ2, . . . , ξk
)← right eigenvectors of B (η) b
for j ← 1 to k doµj ← U ξj/〈1, U ξj〉
endreturn M = [µ1|µ2| . . . |µk ]
aPairs = Mdiag (w)M>bB (η) =
(U>M
)diag
(M>η
) (U>M
)−1
An Introduction to Spectral Learning
Learning the Topic Models
Lemma (Observable Operator)
B (η) =(U>M
)diag
(M>η
) (U>M
)−1
We hope M>η has distinct entries. How to pick η?
η ← ei ⇒ M>η i-th word’s distribution over topics
Prior knowledge required!Otherwise, η ← Uθ, θ ∼ Uniform(Sk−1)
An Introduction to Spectral Learning
Learning the Topic Models
SVD is carried out on Rk×k , k � dOnly involves trigram statistics i.e. low-order momentsGuaranteed to recover the parametersParameters of more complicated models like LDA can berecovered in the same manner
An Introduction to Spectral Learning
Tensor Decomposition
Recall
Pairs = Mdiag (w)M>
Triples (η) = M(diag
(M>η
)diag (w)
)M>
Pairs =k∑j
wj · µj ⊗ µj
Triples =k∑j
wj · µj ⊗ µj ⊗ µj
Symmetric tensor decomposition? µj need to be orthogonal
An Introduction to Spectral Learning
Tensor Decomposition
Whiten Pairs
W := UD12 ⇒W>PairsW = I
µ′j :=√wjW>µj
We can check that µ′j , j ∈ [k] are orthonormal vectors
Do orthogonal tensor decomposition on
Triples (W , W , W ) =k∑
j=1wj(W>µj
)⊗3=
k∑j=1
1√wj
µ′j⊗3
Then recover µj from µ′j
An Introduction to Spectral Learning
Anchor Words
Drawbacks of previous algorithms
topics cannot be correlatedthe bound is weak (comparatively speaking)empirical runtime performance is not satisfactory
Alternatively assumptions?
An Introduction to Spectral Learning
Anchor Words
Definition (p-separable)
M is p-separable if ∀j, ∃i s.t. Mij ≥ p and Mij′ = 0 for j ′ 6= j
Documents do not necessarily contains anchor wordsTwo-fold algorithm
1 Selection: find the anchor word for each topic2 Recover: recover M based on anchor words
Good theoretical guarantees and empirical results
An Introduction to Spectral Learning
Anchor Words
1
1The illustration is taken from Ankur Moitra’s slides,http://people.csail.mit.edu/moitra/docs/IASM.pdf
An Introduction to Spectral Learning
Discussion
SummaryA brief introduction to MoMLearning topic models by spectral decompositionAnchor words assumption
Connections with our work?