An Introduction to Spectral Learninghanxiaol/slides/spectral_learning.pdfAn Introduction to Spectral...

Post on 07-Jun-2020

0 views 0 download

Transcript of An Introduction to Spectral Learninghanxiaol/slides/spectral_learning.pdfAn Introduction to Spectral...

An Introduction to Spectral Learning

An Introduction to Spectral Learning

Hanxiao Liu

November 8, 2013

An Introduction to Spectral Learning

Outline

1 Method of Moments2 Learning topic models using spectral properties3 Anchor words

An Introduction to Spectral Learning

Preliminaries

X1, · · · , Xn ∼ p (x; θ), θ = (θ1, · · · θm)>

θ = θn = w (X1, · · · , Xn)

Maximum Likelihood Estimator (MLE)

θ = argmaxθ

logL (θ)

Bayes Estimator (BE)

θ = E (θ|X ) =

∫θp (x|θ) π (θ) dθ∫p (x|θ) π (θ) dθ

An Introduction to Spectral Learning

Preliminaries

QuestionWhat makes a good estimator?

MLE is consistentBoth the MLE and BE have asymptotic normality

√n(θn − θ

) N

(0, 1

I (θ)

)under mild (regularity) conditions

Can be computationally expensive

An Introduction to Spectral Learning

Preliminaries

Example (Gamma distribution)

p (xi ;α, θ) = 1Γ (α) θα

xα−1i exp

(−xiθ

)

L (α, θ) =( 1

Γ (α) θα

)n( n∏

i=1xi

)α−1

exp(−∑n

i=1 xiθ

)

MLE is hard to compute due to the existence of Γ (α)

An Introduction to Spectral Learning

Method of Moments

j-th theoretical moment, j ∈ [k]

µj (θ) := Eθ

(X j)

j-th sample moment, j ∈ [k]

Mj :=1n

n∑i=1

X ji

Plug-in and solve the multivariate polynomial equations

Mj = µj (θ) j ∈ [k]

sometimes can be recast as spectral decomposition

An Introduction to Spectral Learning

Method of Moments

Example (Gamma distribution)

p (xi ;α, θ) = 1Γ (α) θα

xα−1i exp

(−xiθ

)

X = E (Xi) = αθ

1n

n∑i=1

(Xi −X

)2= Var (Xi) = αθ2

⇒ θ =1

nX

n∑i=1

(Xi −X

)2, α =

=nX2∑n

i=1

(Xi −X

)2

An Introduction to Spectral Learning

Method of Moments

lack guarantee about the solutionhigh-order sample moments are hard to estimate

To reach a specified accuracy, the required sample size andcomputational cost is exponential in k (or n)!

QuestionCould we recover the true θ from only low-order moments?

QuestionCould we lower the sample requirement and computationalcomplexity based on some (hopefully mild) assumptions?

An Introduction to Spectral Learning

Learning the Topic Models

Papadimitriou et al. (2000)Non-overlapping separation condition (strong)

Anandkumar et al. (2012), MoM+SDFull rank assumption (weak)Multinomial Mixture, LDA

Arora et al. (2012), MoM+NMF+LPAnchor words (mild)LDA, Correlated Topic ModelA more practical algorithm proposed in 2013

An Introduction to Spectral Learning

Learning the Topic Models

Suppose there are n documents, k hidden topics, d features

M = [µ1|µ2| . . . |µk ] ∈ Rd×k , µj ∈ ∆d−1 ∀j ∈ [k]

w = (w1, . . . , wk) , w ∈ ∆k−1

P (h = j) = wj j ∈ [k]

For the v-th word in a document, xv ∈ {e1, . . . ed}

P (xv = ei |h = j) = µij , j ∈ [k], i ∈ [d ]

Goal: Recover the M using low-order moments

An Introduction to Spectral Learning

Learning the Topic Models

Construct moment statistics

Pairsij := P (x1 = ei , x2 = ej)

Triplesij := P (x1 = ei , x2 = ej , x3 = et)

Pair = E[x1 ⊗ x2] ∈ Rd×d

Triples = E[x1 ⊗ x2 ⊗ x3] ∈ Rd×d×d

Empirical plug-ins i.e. ˆPairs and ˆTriples could be obtainedfrom data through a straightforward mannerWe want to establish some equivalence between theempirical moments and parameters of interest

An Introduction to Spectral Learning

Learning the Topic Models

Triples (η) := E[x1 ⊗ x2 ⊗ 〈x3, η〉] ∈ Rd×d

Triples (η) : Rd → Rd×d

Lemma

Pairs = Mdiag (w)M>

Triples (η) = M(diag

(M>η

)diag (w)

)M>

The unknown M and w are twisted.

An Introduction to Spectral Learning

Learning the Topic Models

Assumption ( Non-degeneracy )

M has full column rank k

1 Find U , V ∈ Rd×k s.t.(U>M

)−1and

(V>M

)−1exist.

2 ∀η ∈ Rd , define B (η) ∈ Rk×k

B (η) :=(U>Triples (η)V

) (U>PairsV

)−1

Lemma (Observable Operator)

B (η) =(U>M

)diag

(M>η

) (U>M

)−1

An Introduction to Spectral Learning

Learning the Topic Models

Input: ˆPairs and ˆTriplesOutput: topic-word distributions MU , V ← top k left, right eigenvectors of ˆPairs a

η ← random sample from range(U )(ξ1, ξ2, . . . , ξk

)← right eigenvectors of B (η) b

for j ← 1 to k doµj ← U ξj/〈1, U ξj〉

endreturn M = [µ1|µ2| . . . |µk ]

aPairs = Mdiag (w)M>bB (η) =

(U>M

)diag

(M>η

) (U>M

)−1

An Introduction to Spectral Learning

Learning the Topic Models

Lemma (Observable Operator)

B (η) =(U>M

)diag

(M>η

) (U>M

)−1

We hope M>η has distinct entries. How to pick η?

η ← ei ⇒ M>η i-th word’s distribution over topics

Prior knowledge required!Otherwise, η ← Uθ, θ ∼ Uniform(Sk−1)

An Introduction to Spectral Learning

Learning the Topic Models

SVD is carried out on Rk×k , k � dOnly involves trigram statistics i.e. low-order momentsGuaranteed to recover the parametersParameters of more complicated models like LDA can berecovered in the same manner

An Introduction to Spectral Learning

Tensor Decomposition

Recall

Pairs = Mdiag (w)M>

Triples (η) = M(diag

(M>η

)diag (w)

)M>

Pairs =k∑j

wj · µj ⊗ µj

Triples =k∑j

wj · µj ⊗ µj ⊗ µj

Symmetric tensor decomposition? µj need to be orthogonal

An Introduction to Spectral Learning

Tensor Decomposition

Whiten Pairs

W := UD12 ⇒W>PairsW = I

µ′j :=√wjW>µj

We can check that µ′j , j ∈ [k] are orthonormal vectors

Do orthogonal tensor decomposition on

Triples (W , W , W ) =k∑

j=1wj(W>µj

)⊗3=

k∑j=1

1√wj

µ′j⊗3

Then recover µj from µ′j

An Introduction to Spectral Learning

Anchor Words

Drawbacks of previous algorithms

topics cannot be correlatedthe bound is weak (comparatively speaking)empirical runtime performance is not satisfactory

Alternatively assumptions?

An Introduction to Spectral Learning

Anchor Words

Definition (p-separable)

M is p-separable if ∀j, ∃i s.t. Mij ≥ p and Mij′ = 0 for j ′ 6= j

Documents do not necessarily contains anchor wordsTwo-fold algorithm

1 Selection: find the anchor word for each topic2 Recover: recover M based on anchor words

Good theoretical guarantees and empirical results

An Introduction to Spectral Learning

Anchor Words

1

1The illustration is taken from Ankur Moitra’s slides,http://people.csail.mit.edu/moitra/docs/IASM.pdf

An Introduction to Spectral Learning

Discussion

SummaryA brief introduction to MoMLearning topic models by spectral decompositionAnchor words assumption

Connections with our work?