Latent Variable Models Probabilistic Models in the Study of...

Post on 19-Aug-2020

3 views 0 download

Transcript of Latent Variable Models Probabilistic Models in the Study of...

Latent Variable ModelsProbabilistic Models in the Study of Language

Day 4

Roger Levy

UC San Diego

Department of Linguistics

Preamble: plate notation for graphical models

Here is the kind of hierarchical model we’ve seen so far:

θ

Σb

b1 b2 · · · bm

y11 · · · y1n1 y21 · · · y2n2 · · · ym1 · · · ymnm

Plate notation for graphical models

Here is a more succinct representation of the same model:

θ

y

b

Σb

i

m

N

◮ The rectangles with N and m

are plates; semantics of a platewith n is “replicate this node n

times”

Plate notation for graphical models

Here is a more succinct representation of the same model:

θ

y

b

Σb

i

m

N

◮ The rectangles with N and m

are plates; semantics of a platewith n is “replicate this node n

times”

◮ N =∑m

i=1 ni (see previous slide)

Plate notation for graphical models

Here is a more succinct representation of the same model:

θ

y

b

Σb

i

m

N

◮ The rectangles with N and m

are plates; semantics of a platewith n is “replicate this node n

times”

◮ N =∑m

i=1 ni (see previous slide)

◮ The i node is a cluster identitynode

Plate notation for graphical models

Here is a more succinct representation of the same model:

θ

y

b

Σb

i

m

N

◮ The rectangles with N and m

are plates; semantics of a platewith n is “replicate this node n

times”

◮ N =∑m

i=1 ni (see previous slide)

◮ The i node is a cluster identitynode

◮ In our previous application ofhierarchical models toregression, cluster identities areknown

Plate notation for graphical models

Here is a more succinct representation of the same model:

θ

y

b

Σb

i

m

N

◮ The rectangles with N and m

are plates; semantics of a platewith n is “replicate this node n

times”

◮ N =∑m

i=1 ni (see previous slide)

◮ The i node is a cluster identitynode

◮ In our previous application ofhierarchical models toregression, cluster identities areknown

The plan for today’s lecture

θ

y

b

Σb

i

m

N

◮ We are going to study thesimplest type of latent-variablemodels

The plan for today’s lecture

θ

y

b

Σb

i

m

N

◮ We are going to study thesimplest type of latent-variablemodels

◮ Technically speaking, “latentvariable”means any variablewhose value is unknown

The plan for today’s lecture

θ

y

b

Σb

i

m

N

◮ We are going to study thesimplest type of latent-variablemodels

◮ Technically speaking, “latentvariable”means any variablewhose value is unknown

◮ But it’s conventionally used torefer to hidden structural relations

among observations

The plan for today’s lecture

θ

y

b

Σb

i

m

N

◮ We are going to study thesimplest type of latent-variablemodels

◮ Technically speaking, “latentvariable”means any variablewhose value is unknown

◮ But it’s conventionally used torefer to hidden structural relations

among observations

◮ In today’s clustering applications,simply treat i as unknown

The plan for today’s lecture

θ

y

b

Σb

m

N

◮ We are going to study thesimplest type of latent-variablemodels

◮ Technically speaking, “latentvariable”means any variablewhose value is unknown

◮ But it’s conventionally used torefer to hidden structural relations

among observations

◮ In today’s clustering applications,simply treat i as unknown

◮ Inferring values of i induces aclustering among observations; todo so we need to put a probabilitydistribution over i

The plan for today’s lecture

We will cover two types of simple latent-variable models:

The plan for today’s lecture

We will cover two types of simple latent-variable models:

◮ The mixture of Gaussians for continuous multivariate data;

The plan for today’s lecture

We will cover two types of simple latent-variable models:

◮ The mixture of Gaussians for continuous multivariate data;

◮ Latent Dirichlet Allocation (LDA; also called Topic models)for categorical data (words) in collections of documents

Mixture of Gaussians

◮ Motivating example: how are phonological categories learned

Mixture of Gaussians

◮ Motivating example: how are phonological categories learned

◮ Evidence that learning involves a combination of both innatebias and experience:

Mixture of Gaussians

◮ Motivating example: how are phonological categories learned

◮ Evidence that learning involves a combination of both innatebias and experience:

◮ Infants can distinguish some contrasts that adults of speakerslacking them cannot: alveolar [d] versus retroflex [ã] forEnglish speakers, [r] versus [l] for Japanese speakers; Werkerand Tees, 1984; Kuhl et al., 2006, inter alia)

Mixture of Gaussians

◮ Motivating example: how are phonological categories learned

◮ Evidence that learning involves a combination of both innatebias and experience:

◮ Infants can distinguish some contrasts that adults of speakerslacking them cannot: alveolar [d] versus retroflex [ã] forEnglish speakers, [r] versus [l] for Japanese speakers; Werkerand Tees, 1984; Kuhl et al., 2006, inter alia)

◮ Other contrasts are not reliably distinguished until ∼ 1 year ofage by native speakers (e.g., syllable-initial [n] versus [N] inFilipino language environments; Narayan et al., 2010)

Learning vowel categories

To appreciate the potential difficulties of vowel category learning,consider inter-speaker variation (data courtesy of Vallabha et al.,2007):

Scatter Plot Matrix

F10

1

2 0 1 2

−2

−1

0

−2 −1 0

F20123 0 1 2 3

−3−2−1

0

−3 −2 −1 0

Duration2

4 2 4

−2

0

−2 0

S1

F10

1

2 0 1 2

−2

−1

0

−2 −1 0

F20123 0 1 2 3

−3−2−1

0

−3 −2 −1 0

Duration2

4 2 4

−2

0

−2 0

S2

eEiI

Framing the category learning problem

Here’s 19 speakers’ data mixed together:

Scatter Plot Matrix

F12

4 2 4

−2

0

−2 0

F20

20 2

−4

−2

−4 −2

Duration

2

4 2 4

−2

0

−2 0

eEiI

Framing the category learning problem

◮ “Learning” from such data can be thought of in two ways:

Framing the category learning problem

◮ “Learning” from such data can be thought of in two ways:◮ Grouping the observations into categories

Framing the category learning problem

◮ “Learning” from such data can be thought of in two ways:◮ Grouping the observations into categories◮ Determining the underlying category representations

(positions, shapes, and sizes)

Framing the category learning problem

◮ “Learning” from such data can be thought of in two ways:◮ Grouping the observations into categories◮ Determining the underlying category representations

(positions, shapes, and sizes)

◮ Formally: every possible grouping of observations y intocategories represents a partition Π of the observations y.

Framing the category learning problem

◮ “Learning” from such data can be thought of in two ways:◮ Grouping the observations into categories◮ Determining the underlying category representations

(positions, shapes, and sizes)

◮ Formally: every possible grouping of observations y intocategories represents a partition Π of the observations y.

Framing the category learning problem

◮ “Learning” from such data can be thought of in two ways:◮ Grouping the observations into categories◮ Determining the underlying category representations

(positions, shapes, and sizes)

◮ Formally: every possible grouping of observations y intocategories represents a partition Π of the observations y. If θare parameters describing category representations, ourproblem is to infer

P(Π,θ|y)

Framing the category learning problem

◮ “Learning” from such data can be thought of in two ways:◮ Grouping the observations into categories◮ Determining the underlying category representations

(positions, shapes, and sizes)

◮ Formally: every possible grouping of observations y intocategories represents a partition Π of the observations y. If θare parameters describing category representations, ourproblem is to infer

P(Π,θ|y)

from which we could recover the two marginal probabilitydistributions of interest:

P(Π|y) (distr. over partitions given data)

P(θ|y) (distr. over category properties given data)

The mixture of Gaussians

◮ Simple generative model of the data: we have k multivariateGaussians with frequencies φ = 〈φ1, . . . , φk〉, each with itsown mean µi and covariance matrix Σi (here we punt on howto induce the correct number of categories)

The mixture of Gaussians

◮ Simple generative model of the data: we have k multivariateGaussians with frequencies φ = 〈φ1, . . . , φk〉, each with itsown mean µi and covariance matrix Σi (here we punt on howto induce the correct number of categories)

◮ N observations are generated i.i.d. by:

i ∼ Multinom(φ)

y ∼ N (µi ,Σi )

The mixture of Gaussians

◮ Simple generative model of the data: we have k multivariateGaussians with frequencies φ = 〈φ1, . . . , φk〉, each with itsown mean µi and covariance matrix Σi (here we punt on howto induce the correct number of categories)

◮ N observations are generated i.i.d. by:

i ∼ Multinom(φ)

y ∼ N (µi ,Σi )

◮ Here is the corresponding graphical model:

y

µΣ

m

n

Can we use maximum likelihood?

◮ For observations y all known to come from the samek-dimensional Gaussian, the MLE for the Gaussian’sparameters is

µ = 〈y1, y2, . . . , yk〉

Σ =

Var(y1) Cov(y1, y2) . . . Cov(y1, yk)Cov(y1, y2) Var(y2) . . . Cov(y1, yk)

......

. . ....

Cov(y1, y2) Cov(y1, y2) . . . Var(yk)

where“Var”and“Cov”are the sample variance and covariance

Can we use maximum likelihood?So you might ask: why not use the method of maximumlikelihood, searching through all the possible partitions of the dataand choosing the partition that gives the highest data likelihood?

2 4 6 8

24

68

x

y

Can we use maximum likelihood?

The set of all partitions into 〈3, 3〉 observations for our exampledata:

0 2 4 6 8

02

46

810

−12.22

0 2 4 6 8

02

46

810

−14.36

0 2 4 6 80

24

68

10

−16.09

0 2 4 6 8

02

46

810

−12.00

0 2 4 6 8

02

46

810

−10.65

0 2 4 6 8

02

46

810

−15.90

0 2 4 6 8

02

46

810

−17.24

0 2 4 6 8

02

46

810

−12.22

0 2 4 6 8

02

46

810

−14.99

0 2 4 6 8

02

46

810

−14.72

0 2 4 6 8

02

46

810

−11.46

0 2 4 6 8

02

46

810

−10.65

0 2 4 6 8

02

46

810

−16.09

0 2 4 6 8

02

46

810

−14.36

0 2 4 6 8

02

46

810

−12.00

0 2 4 6 8

02

46

810

−17.24

0 2 4 6 8

02

46

810

−15.90

0 2 4 6 8

02

46

810

−14.99

0 2 4 6 8

02

46

810

−11.46

0 2 4 6 8

02

46

810

−14.72

Can we use maximum likelihood?

This looks like a daunting search task, but there is an even biggerproblem.

Can we use maximum likelihood?

This looks like a daunting search task, but there is an even biggerproblem.Suppose I try a partition into 〈5, 1〉. . .

Can we use maximum likelihood?

This looks like a daunting search task, but there is an even biggerproblem.Suppose I try a partition into 〈5, 1〉. . .

0 2 4 6 8 10

02

46

810

Can we use maximum likelihood?

This looks like a daunting search task, but there is an even biggerproblem.Suppose I try a partition into 〈5, 1〉. . .

0 2 4 6 8 10

02

46

810

ML for this partition: ∞!!!

Can we use maximum likelihood?

This looks like a daunting search task, but there is an even biggerproblem.Suppose I try a partition into 〈5, 1〉. . .

0 2 4 6 8 10

02

46

810

ML for this partition: ∞!!!

More generally, for a V -dimensional problem you need at leastV + 1 points in each partition

Can we use maximum likelihood?

This looks like a daunting search task, but there is an even biggerproblem.Suppose I try a partition into 〈5, 1〉. . .

0 2 4 6 8 10

02

46

810

ML for this partition: ∞!!!

More generally, for a V -dimensional problem you need at leastV + 1 points in each partitionBut this constraint would prevent you from finding intuitivesolutions to your problem!

Bayesian Mixture of Gaussians

y

µΣ

m

n

i ∼ Multinom(φ)

y ∼ N (µi ,Σi )

Bayesian Mixture of Gaussians

y

µΣ

m

n

i ∼ Multinom(φ)

y ∼ N (µi ,Σi )

◮ The Bayesian framework allows us to build in explicitassumptions about what constitutes a“sensible”category size

Bayesian Mixture of Gaussians

y

µΣ

m

n

i ∼ Multinom(φ)

y ∼ N (µi ,Σi )

◮ The Bayesian framework allows us to build in explicitassumptions about what constitutes a“sensible”category size

◮ Returning to our graphical model, we put in a prior oncategory size/shape

Bayesian Mixture of Gaussians

y

µΣ

α

m

n

i ∼ Multinom(φ)

y ∼ N (µi ,Σi )

◮ The Bayesian framework allows us to build in explicitassumptions about what constitutes a“sensible”category size

◮ Returning to our graphical model, we put in a prior oncategory size/shape

Bayesian Mixture of Gaussiansy

µΣ

α

m

n

i ∼ Multinom(φ)

y ∼ N (µi ,Σi )

◮ For now we will just leave category prior probabilities uniform:

φ1 = φ2 = φ3 = φ4 =1

4

Bayesian Mixture of Gaussiansy

µΣ

α

m

n

i ∼ Multinom(φ)

y ∼ N (µi ,Σi )

◮ For now we will just leave category prior probabilities uniform:

φ1 = φ2 = φ3 = φ4 =1

4◮ Here is a conjugate prior distribution for multivariate

Gaussians:

Σi ∼ IW(Σ0, ν)

µi |Σ ∼ N (µ0,Σi/A)

The Inverse Wishart distribution

◮ Perhaps the best way to understand the Inverse Wishartdistribution is to look at samples from it

The Inverse Wishart distribution

◮ Perhaps the best way to understand the Inverse Wishartdistribution is to look at samples from it

◮ Below I give samples for Σ = ( 1 00 1 )

The Inverse Wishart distribution

◮ Perhaps the best way to understand the Inverse Wishartdistribution is to look at samples from it

◮ Below I give samples for Σ = ( 1 00 1 )

The Inverse Wishart distribution

◮ Perhaps the best way to understand the Inverse Wishartdistribution is to look at samples from it

◮ Below I give samples for Σ = ( 1 00 1 )

◮ Here, k = 2 (top row) or k = 5 (bottom row)

Inference for Mixture of Gaussians using Gibbs Sampling

◮ We still have not given a solution to the search problem

Inference for Mixture of Gaussians using Gibbs Sampling

◮ We still have not given a solution to the search problem

◮ One broadly applicable solution is Gibbs sampling

Inference for Mixture of Gaussians using Gibbs Sampling

◮ We still have not given a solution to the search problem

◮ One broadly applicable solution is Gibbs sampling

◮ Simply put:

Inference for Mixture of Gaussians using Gibbs Sampling

◮ We still have not given a solution to the search problem

◮ One broadly applicable solution is Gibbs sampling

◮ Simply put:

1. Randomly initialize cluster assignments

Inference for Mixture of Gaussians using Gibbs Sampling

◮ We still have not given a solution to the search problem

◮ One broadly applicable solution is Gibbs sampling

◮ Simply put:

1. Randomly initialize cluster assignments2. On each iteration through the data, for each point:

Inference for Mixture of Gaussians using Gibbs Sampling

◮ We still have not given a solution to the search problem

◮ One broadly applicable solution is Gibbs sampling

◮ Simply put:

1. Randomly initialize cluster assignments2. On each iteration through the data, for each point:

2.1 “Forget” the cluster assignment of the current point xi

Inference for Mixture of Gaussians using Gibbs Sampling

◮ We still have not given a solution to the search problem

◮ One broadly applicable solution is Gibbs sampling

◮ Simply put:

1. Randomly initialize cluster assignments2. On each iteration through the data, for each point:

2.1 “Forget” the cluster assignment of the current point xi2.2 Compute the probability distribution over xi ’s cluster

assignment conditional on the rest of the partition:

P(Ci |xi ,Π−i ) =

∫θP(xi |Ci ,θ)P(Ci |θ)P(θ) dθ

∑j

∫θP(xj |Cj ,θ)P(Cj |θ)P(θ) dθ

Inference for Mixture of Gaussians using Gibbs Sampling

◮ We still have not given a solution to the search problem

◮ One broadly applicable solution is Gibbs sampling

◮ Simply put:

1. Randomly initialize cluster assignments2. On each iteration through the data, for each point:

2.1 “Forget” the cluster assignment of the current point xi2.2 Compute the probability distribution over xi ’s cluster

assignment conditional on the rest of the partition:

P(Ci |xi ,Π−i ) =

∫θP(xi |Ci ,θ)P(Ci |θ)P(θ) dθ

∑j

∫θP(xj |Cj ,θ)P(Cj |θ)P(θ) dθ

2.3 Randomly sample a cluster assignment for xi fromP(Ci |xi ,Π−i ) and continue

Inference for Mixture of Gaussians using Gibbs Sampling

◮ We still have not given a solution to the search problem

◮ One broadly applicable solution is Gibbs sampling

◮ Simply put:

1. Randomly initialize cluster assignments2. On each iteration through the data, for each point:

2.1 “Forget” the cluster assignment of the current point xi2.2 Compute the probability distribution over xi ’s cluster

assignment conditional on the rest of the partition:

P(Ci |xi ,Π−i ) =

∫θP(xi |Ci ,θ)P(Ci |θ)P(θ) dθ

∑j

∫θP(xj |Cj ,θ)P(Cj |θ)P(θ) dθ

2.3 Randomly sample a cluster assignment for xi fromP(Ci |xi ,Π−i ) and continue

3. Do this for “many” iterations (e.g., until the unnormalizedmarginal data likelihood is high)

Inference for Mixture of Gaussians using Gibbs SamplingStarting point for our problem:

One pass of Gibbs sampling through the data

Results of Gibbs sampling with known category

probabilities

Posterior modes of category structures:F1 versus F2 F1 versus Duration F2 versus Duration

F1

F2

−4

−2

0

2

−2 0 2 4

F1

Dur

atio

n

−2

0

2

4

−2 0 2 4

F2

Dur

atio

n

−2

0

2

4

−4 −2 0 2

Results of Gibbs sampling with known category

probabilities

Confusion table of assignments of observations to categories:Unsupervised Supervised

Cluster

True

vow

el

I

i

E

e

1 2 3 4

Cluster

True

vow

el

I

i

E

e

1 2 3 4

Extending the model to learning category probabilities

◮ The multinomial extension of the beta distribution is theDirichlet distribution, characterized by parameters α1, . . . , αk ,and D(π1, . . . , πk):

D(π1, . . . , πk)def=

1

Zπα1−11 πα2−1

2 . . . παk−1k

where the normalizing constant Z is

Z =Γ(α1)Γ(α2) . . . Γ(αk)

Γ(α1 + α2 + · · ·+ αk)

Extending the model to learning category probabilities

◮ So we set:

φ ∼ D(Σφ)

Extending the model to learning category probabilities

◮ So we set:

φ ∼ D(Σφ)

◮ Combine this with the rest of the model:

Σi ∼ IW(Σ0, ν)

µi |Σ ∼ N (µ0,Σi/A)

i ∼ Multinom(φ)

y ∼ N (µi ,Σi )

Extending the model to learning category probabilities

◮ So we set:

φ ∼ D(Σφ)

◮ Combine this with the rest of the model:

Σi ∼ IW(Σ0, ν)

µi |Σ ∼ N (µ0,Σi/A)

i ∼ Multinom(φ)

y ∼ N (µi ,Σi )

Extending the model to learning category probabilities

◮ So we set:

φ ∼ D(Σφ)

◮ Combine this with the rest of the model:

Σi ∼ IW(Σ0, ν)

µi |Σ ∼ N (µ0,Σi/A)

i ∼ Multinom(φ)

y ∼ N (µi ,Σi )

θΣθ

y

b

ΣbΣΣb

Σφ

m

n

Having to learn category probabilities too makes the

problem harder

F1 and F2 F1 and Duration F2 and Duration

F1

F2

−2

0

2

4

−2 −1 0 1 2

F1

Dur

atio

n

−2

−1

0

1

2

3

−2 −1 0 1 2

F2

Dur

atio

n

−2

−1

0

1

2

3

−2 0 2 4

Having to learn category probabilities too makes the

problem harder

We can make the problem even more challenging by skewing thecategory probabilities:Category Probabilitye 0.04E 0.05i 0.29I 0.62

Having to learn category probabilities too makes the

problem harder

F1 and F2 F1 and Duration F2 and Duration

F1

F2

−2

0

2

4

−2 −1 0 1 2

F1

Dur

atio

n

−2

−1

0

1

2

3

−2 −1 0 1 2

F2

Dur

atio

n

−2

−1

0

1

2

3

−2 0 2 4

Having to learn category probabilities too makes the

problem harder

Confusion tables for these cases:

With learning of category fre-quencies

Without learning of categoryfrequencies

Cluster

True

vow

el

I

i

E

e

1 2 3 4

Cluster

True

vow

el

I

i

E

e

1 2 3 4

Summary

◮ We can use the exact same models for unsupervised(latent-variable) learning as for hierarchical/mixed-effectsregression!

Summary

◮ We can use the exact same models for unsupervised(latent-variable) learning as for hierarchical/mixed-effectsregression!

◮ However, category induction presents additional difficultiescategory learning

Summary

◮ We can use the exact same models for unsupervised(latent-variable) learning as for hierarchical/mixed-effectsregression!

◮ However, category induction presents additional difficultiescategory learning

◮ Non-convexity of the objective function→difficulty of search

Summary

◮ We can use the exact same models for unsupervised(latent-variable) learning as for hierarchical/mixed-effectsregression!

◮ However, category induction presents additional difficultiescategory learning

◮ Non-convexity of the objective function→difficulty of search◮ Degeneracy of maximum likelihood

Summary

◮ We can use the exact same models for unsupervised(latent-variable) learning as for hierarchical/mixed-effectsregression!

◮ However, category induction presents additional difficultiescategory learning

◮ Non-convexity of the objective function→difficulty of search◮ Degeneracy of maximum likelihood

◮ In general you need far more data, and/or additionalinformation sources, to converge on good solutions

Summary

◮ We can use the exact same models for unsupervised(latent-variable) learning as for hierarchical/mixed-effectsregression!

◮ However, category induction presents additional difficultiescategory learning

◮ Non-convexity of the objective function→difficulty of search◮ Degeneracy of maximum likelihood

◮ In general you need far more data, and/or additionalinformation sources, to converge on good solutions

◮ Relevant references: tons! Read about MOGs for automatedspeech recognition in Jurafsky and Martin (2008, Chapter 9).See Vallabha et al. (2007) and Feldman et al. (2009) forearlier application of MOGs to phonetic category learning.

References I

Feldman, N. H., Griffiths, T. L., and Morgan, J. L. (2009). Learningphonetic categories by learning a lexicon. In Proceedings of the 31st

Annual Conference of the Cognitive Science Society, pages 2208–2213.Cognitive Science Society, Austin, TX.

Jurafsky, D. and Martin, J. H. (2008). Speech and Language Processing:

An Introduction to Natural Language Processing, Computational

Linguistics, and Speech Recognition. Prentice-Hall, second edition.

Kuhl, P. K., Stevens, E., Hayashi, A., Deguchi, T., Kiritani, S., andIverson, P. (2006). Infants show a facilitation effect for nativelanguage phonetic perception between 6 and 12 months.Developmental Science, 9(2):F13–F21.

Narayan, C. R., Werker, J. F., and Beddor, P. S. (2010). The interactionbetween acoustic salience and language experience in developmentalspeech perception: evidence from nasal place discrimination.Developmental Science, 13(3):407–420.

References II

Vallabha, G. K., McClelland, J. L., Pons, F., Werker, J. F., and Amano,S. (2007). Unsupervised learning of vowel categories frominfant-directed speech. Proceedings of the National Academy of

Sciences, 104(33):13273–13278.

Werker, J. F. and Tees, R. C. (1984). Cross-language speech perception:Evidence for perceptual reorganization during the first year of life.Infant Behavior and Development, 7:49–63.