An Introduction to Spectral Learninghanxiaol/slides/spectral_learning.pdfAn Introduction to Spectral...

22
An Introduction to Spectral Learning An Introduction to Spectral Learning Hanxiao Liu November 8, 2013

Transcript of An Introduction to Spectral Learninghanxiaol/slides/spectral_learning.pdfAn Introduction to Spectral...

Page 1: An Introduction to Spectral Learninghanxiaol/slides/spectral_learning.pdfAn Introduction to Spectral Learning Anchor Words Definition (p-separable) M is p-separable if ∀j, ∃i

An Introduction to Spectral Learning

An Introduction to Spectral Learning

Hanxiao Liu

November 8, 2013

Page 2: An Introduction to Spectral Learninghanxiaol/slides/spectral_learning.pdfAn Introduction to Spectral Learning Anchor Words Definition (p-separable) M is p-separable if ∀j, ∃i

An Introduction to Spectral Learning

Outline

1 Method of Moments2 Learning topic models using spectral properties3 Anchor words

Page 3: An Introduction to Spectral Learninghanxiaol/slides/spectral_learning.pdfAn Introduction to Spectral Learning Anchor Words Definition (p-separable) M is p-separable if ∀j, ∃i

An Introduction to Spectral Learning

Preliminaries

X1, · · · , Xn ∼ p (x; θ), θ = (θ1, · · · θm)>

θ = θn = w (X1, · · · , Xn)

Maximum Likelihood Estimator (MLE)

θ = argmaxθ

logL (θ)

Bayes Estimator (BE)

θ = E (θ|X ) =

∫θp (x|θ) π (θ) dθ∫p (x|θ) π (θ) dθ

Page 4: An Introduction to Spectral Learninghanxiaol/slides/spectral_learning.pdfAn Introduction to Spectral Learning Anchor Words Definition (p-separable) M is p-separable if ∀j, ∃i

An Introduction to Spectral Learning

Preliminaries

QuestionWhat makes a good estimator?

MLE is consistentBoth the MLE and BE have asymptotic normality

√n(θn − θ

) N

(0, 1

I (θ)

)under mild (regularity) conditions

Can be computationally expensive

Page 5: An Introduction to Spectral Learninghanxiaol/slides/spectral_learning.pdfAn Introduction to Spectral Learning Anchor Words Definition (p-separable) M is p-separable if ∀j, ∃i

An Introduction to Spectral Learning

Preliminaries

Example (Gamma distribution)

p (xi ;α, θ) = 1Γ (α) θα

xα−1i exp

(−xiθ

)

L (α, θ) =( 1

Γ (α) θα

)n( n∏

i=1xi

)α−1

exp(−∑n

i=1 xiθ

)

MLE is hard to compute due to the existence of Γ (α)

Page 6: An Introduction to Spectral Learninghanxiaol/slides/spectral_learning.pdfAn Introduction to Spectral Learning Anchor Words Definition (p-separable) M is p-separable if ∀j, ∃i

An Introduction to Spectral Learning

Method of Moments

j-th theoretical moment, j ∈ [k]

µj (θ) := Eθ

(X j)

j-th sample moment, j ∈ [k]

Mj :=1n

n∑i=1

X ji

Plug-in and solve the multivariate polynomial equations

Mj = µj (θ) j ∈ [k]

sometimes can be recast as spectral decomposition

Page 7: An Introduction to Spectral Learninghanxiaol/slides/spectral_learning.pdfAn Introduction to Spectral Learning Anchor Words Definition (p-separable) M is p-separable if ∀j, ∃i

An Introduction to Spectral Learning

Method of Moments

Example (Gamma distribution)

p (xi ;α, θ) = 1Γ (α) θα

xα−1i exp

(−xiθ

)

X = E (Xi) = αθ

1n

n∑i=1

(Xi −X

)2= Var (Xi) = αθ2

⇒ θ =1

nX

n∑i=1

(Xi −X

)2, α =

=nX2∑n

i=1

(Xi −X

)2

Page 8: An Introduction to Spectral Learninghanxiaol/slides/spectral_learning.pdfAn Introduction to Spectral Learning Anchor Words Definition (p-separable) M is p-separable if ∀j, ∃i

An Introduction to Spectral Learning

Method of Moments

lack guarantee about the solutionhigh-order sample moments are hard to estimate

To reach a specified accuracy, the required sample size andcomputational cost is exponential in k (or n)!

QuestionCould we recover the true θ from only low-order moments?

QuestionCould we lower the sample requirement and computationalcomplexity based on some (hopefully mild) assumptions?

Page 9: An Introduction to Spectral Learninghanxiaol/slides/spectral_learning.pdfAn Introduction to Spectral Learning Anchor Words Definition (p-separable) M is p-separable if ∀j, ∃i

An Introduction to Spectral Learning

Learning the Topic Models

Papadimitriou et al. (2000)Non-overlapping separation condition (strong)

Anandkumar et al. (2012), MoM+SDFull rank assumption (weak)Multinomial Mixture, LDA

Arora et al. (2012), MoM+NMF+LPAnchor words (mild)LDA, Correlated Topic ModelA more practical algorithm proposed in 2013

Page 10: An Introduction to Spectral Learninghanxiaol/slides/spectral_learning.pdfAn Introduction to Spectral Learning Anchor Words Definition (p-separable) M is p-separable if ∀j, ∃i

An Introduction to Spectral Learning

Learning the Topic Models

Suppose there are n documents, k hidden topics, d features

M = [µ1|µ2| . . . |µk ] ∈ Rd×k , µj ∈ ∆d−1 ∀j ∈ [k]

w = (w1, . . . , wk) , w ∈ ∆k−1

P (h = j) = wj j ∈ [k]

For the v-th word in a document, xv ∈ {e1, . . . ed}

P (xv = ei |h = j) = µij , j ∈ [k], i ∈ [d ]

Goal: Recover the M using low-order moments

Page 11: An Introduction to Spectral Learninghanxiaol/slides/spectral_learning.pdfAn Introduction to Spectral Learning Anchor Words Definition (p-separable) M is p-separable if ∀j, ∃i

An Introduction to Spectral Learning

Learning the Topic Models

Construct moment statistics

Pairsij := P (x1 = ei , x2 = ej)

Triplesij := P (x1 = ei , x2 = ej , x3 = et)

Pair = E[x1 ⊗ x2] ∈ Rd×d

Triples = E[x1 ⊗ x2 ⊗ x3] ∈ Rd×d×d

Empirical plug-ins i.e. ˆPairs and ˆTriples could be obtainedfrom data through a straightforward mannerWe want to establish some equivalence between theempirical moments and parameters of interest

Page 12: An Introduction to Spectral Learninghanxiaol/slides/spectral_learning.pdfAn Introduction to Spectral Learning Anchor Words Definition (p-separable) M is p-separable if ∀j, ∃i

An Introduction to Spectral Learning

Learning the Topic Models

Triples (η) := E[x1 ⊗ x2 ⊗ 〈x3, η〉] ∈ Rd×d

Triples (η) : Rd → Rd×d

Lemma

Pairs = Mdiag (w)M>

Triples (η) = M(diag

(M>η

)diag (w)

)M>

The unknown M and w are twisted.

Page 13: An Introduction to Spectral Learninghanxiaol/slides/spectral_learning.pdfAn Introduction to Spectral Learning Anchor Words Definition (p-separable) M is p-separable if ∀j, ∃i

An Introduction to Spectral Learning

Learning the Topic Models

Assumption ( Non-degeneracy )

M has full column rank k

1 Find U , V ∈ Rd×k s.t.(U>M

)−1and

(V>M

)−1exist.

2 ∀η ∈ Rd , define B (η) ∈ Rk×k

B (η) :=(U>Triples (η)V

) (U>PairsV

)−1

Lemma (Observable Operator)

B (η) =(U>M

)diag

(M>η

) (U>M

)−1

Page 14: An Introduction to Spectral Learninghanxiaol/slides/spectral_learning.pdfAn Introduction to Spectral Learning Anchor Words Definition (p-separable) M is p-separable if ∀j, ∃i

An Introduction to Spectral Learning

Learning the Topic Models

Input: ˆPairs and ˆTriplesOutput: topic-word distributions MU , V ← top k left, right eigenvectors of ˆPairs a

η ← random sample from range(U )(ξ1, ξ2, . . . , ξk

)← right eigenvectors of B (η) b

for j ← 1 to k doµj ← U ξj/〈1, U ξj〉

endreturn M = [µ1|µ2| . . . |µk ]

aPairs = Mdiag (w)M>bB (η) =

(U>M

)diag

(M>η

) (U>M

)−1

Page 15: An Introduction to Spectral Learninghanxiaol/slides/spectral_learning.pdfAn Introduction to Spectral Learning Anchor Words Definition (p-separable) M is p-separable if ∀j, ∃i

An Introduction to Spectral Learning

Learning the Topic Models

Lemma (Observable Operator)

B (η) =(U>M

)diag

(M>η

) (U>M

)−1

We hope M>η has distinct entries. How to pick η?

η ← ei ⇒ M>η i-th word’s distribution over topics

Prior knowledge required!Otherwise, η ← Uθ, θ ∼ Uniform(Sk−1)

Page 16: An Introduction to Spectral Learninghanxiaol/slides/spectral_learning.pdfAn Introduction to Spectral Learning Anchor Words Definition (p-separable) M is p-separable if ∀j, ∃i

An Introduction to Spectral Learning

Learning the Topic Models

SVD is carried out on Rk×k , k � dOnly involves trigram statistics i.e. low-order momentsGuaranteed to recover the parametersParameters of more complicated models like LDA can berecovered in the same manner

Page 17: An Introduction to Spectral Learninghanxiaol/slides/spectral_learning.pdfAn Introduction to Spectral Learning Anchor Words Definition (p-separable) M is p-separable if ∀j, ∃i

An Introduction to Spectral Learning

Tensor Decomposition

Recall

Pairs = Mdiag (w)M>

Triples (η) = M(diag

(M>η

)diag (w)

)M>

Pairs =k∑j

wj · µj ⊗ µj

Triples =k∑j

wj · µj ⊗ µj ⊗ µj

Symmetric tensor decomposition? µj need to be orthogonal

Page 18: An Introduction to Spectral Learninghanxiaol/slides/spectral_learning.pdfAn Introduction to Spectral Learning Anchor Words Definition (p-separable) M is p-separable if ∀j, ∃i

An Introduction to Spectral Learning

Tensor Decomposition

Whiten Pairs

W := UD12 ⇒W>PairsW = I

µ′j :=√wjW>µj

We can check that µ′j , j ∈ [k] are orthonormal vectors

Do orthogonal tensor decomposition on

Triples (W , W , W ) =k∑

j=1wj(W>µj

)⊗3=

k∑j=1

1√wj

µ′j⊗3

Then recover µj from µ′j

Page 19: An Introduction to Spectral Learninghanxiaol/slides/spectral_learning.pdfAn Introduction to Spectral Learning Anchor Words Definition (p-separable) M is p-separable if ∀j, ∃i

An Introduction to Spectral Learning

Anchor Words

Drawbacks of previous algorithms

topics cannot be correlatedthe bound is weak (comparatively speaking)empirical runtime performance is not satisfactory

Alternatively assumptions?

Page 20: An Introduction to Spectral Learninghanxiaol/slides/spectral_learning.pdfAn Introduction to Spectral Learning Anchor Words Definition (p-separable) M is p-separable if ∀j, ∃i

An Introduction to Spectral Learning

Anchor Words

Definition (p-separable)

M is p-separable if ∀j, ∃i s.t. Mij ≥ p and Mij′ = 0 for j ′ 6= j

Documents do not necessarily contains anchor wordsTwo-fold algorithm

1 Selection: find the anchor word for each topic2 Recover: recover M based on anchor words

Good theoretical guarantees and empirical results

Page 21: An Introduction to Spectral Learninghanxiaol/slides/spectral_learning.pdfAn Introduction to Spectral Learning Anchor Words Definition (p-separable) M is p-separable if ∀j, ∃i

An Introduction to Spectral Learning

Anchor Words

1

1The illustration is taken from Ankur Moitra’s slides,http://people.csail.mit.edu/moitra/docs/IASM.pdf

Page 22: An Introduction to Spectral Learninghanxiaol/slides/spectral_learning.pdfAn Introduction to Spectral Learning Anchor Words Definition (p-separable) M is p-separable if ∀j, ∃i

An Introduction to Spectral Learning

Discussion

SummaryA brief introduction to MoMLearning topic models by spectral decompositionAnchor words assumption

Connections with our work?