Latent Variable Models Probabilistic Models in the Study of...

80
Latent Variable Models Probabilistic Models in the Study of Language Day 4 Roger Levy UC San Diego Department of Linguistics

Transcript of Latent Variable Models Probabilistic Models in the Study of...

Page 1: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

Latent Variable ModelsProbabilistic Models in the Study of Language

Day 4

Roger Levy

UC San Diego

Department of Linguistics

Page 2: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

Preamble: plate notation for graphical models

Here is the kind of hierarchical model we’ve seen so far:

θ

Σb

b1 b2 · · · bm

y11 · · · y1n1 y21 · · · y2n2 · · · ym1 · · · ymnm

Page 3: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

Plate notation for graphical models

Here is a more succinct representation of the same model:

θ

y

b

Σb

i

m

N

◮ The rectangles with N and m

are plates; semantics of a platewith n is “replicate this node n

times”

Page 4: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

Plate notation for graphical models

Here is a more succinct representation of the same model:

θ

y

b

Σb

i

m

N

◮ The rectangles with N and m

are plates; semantics of a platewith n is “replicate this node n

times”

◮ N =∑m

i=1 ni (see previous slide)

Page 5: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

Plate notation for graphical models

Here is a more succinct representation of the same model:

θ

y

b

Σb

i

m

N

◮ The rectangles with N and m

are plates; semantics of a platewith n is “replicate this node n

times”

◮ N =∑m

i=1 ni (see previous slide)

◮ The i node is a cluster identitynode

Page 6: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

Plate notation for graphical models

Here is a more succinct representation of the same model:

θ

y

b

Σb

i

m

N

◮ The rectangles with N and m

are plates; semantics of a platewith n is “replicate this node n

times”

◮ N =∑m

i=1 ni (see previous slide)

◮ The i node is a cluster identitynode

◮ In our previous application ofhierarchical models toregression, cluster identities areknown

Page 7: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

Plate notation for graphical models

Here is a more succinct representation of the same model:

θ

y

b

Σb

i

m

N

◮ The rectangles with N and m

are plates; semantics of a platewith n is “replicate this node n

times”

◮ N =∑m

i=1 ni (see previous slide)

◮ The i node is a cluster identitynode

◮ In our previous application ofhierarchical models toregression, cluster identities areknown

Page 8: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

The plan for today’s lecture

θ

y

b

Σb

i

m

N

◮ We are going to study thesimplest type of latent-variablemodels

Page 9: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

The plan for today’s lecture

θ

y

b

Σb

i

m

N

◮ We are going to study thesimplest type of latent-variablemodels

◮ Technically speaking, “latentvariable”means any variablewhose value is unknown

Page 10: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

The plan for today’s lecture

θ

y

b

Σb

i

m

N

◮ We are going to study thesimplest type of latent-variablemodels

◮ Technically speaking, “latentvariable”means any variablewhose value is unknown

◮ But it’s conventionally used torefer to hidden structural relations

among observations

Page 11: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

The plan for today’s lecture

θ

y

b

Σb

i

m

N

◮ We are going to study thesimplest type of latent-variablemodels

◮ Technically speaking, “latentvariable”means any variablewhose value is unknown

◮ But it’s conventionally used torefer to hidden structural relations

among observations

◮ In today’s clustering applications,simply treat i as unknown

Page 12: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

The plan for today’s lecture

θ

y

b

Σb

m

N

◮ We are going to study thesimplest type of latent-variablemodels

◮ Technically speaking, “latentvariable”means any variablewhose value is unknown

◮ But it’s conventionally used torefer to hidden structural relations

among observations

◮ In today’s clustering applications,simply treat i as unknown

◮ Inferring values of i induces aclustering among observations; todo so we need to put a probabilitydistribution over i

Page 13: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

The plan for today’s lecture

We will cover two types of simple latent-variable models:

Page 14: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

The plan for today’s lecture

We will cover two types of simple latent-variable models:

◮ The mixture of Gaussians for continuous multivariate data;

Page 15: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

The plan for today’s lecture

We will cover two types of simple latent-variable models:

◮ The mixture of Gaussians for continuous multivariate data;

◮ Latent Dirichlet Allocation (LDA; also called Topic models)for categorical data (words) in collections of documents

Page 16: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

Mixture of Gaussians

◮ Motivating example: how are phonological categories learned

Page 17: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

Mixture of Gaussians

◮ Motivating example: how are phonological categories learned

◮ Evidence that learning involves a combination of both innatebias and experience:

Page 18: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

Mixture of Gaussians

◮ Motivating example: how are phonological categories learned

◮ Evidence that learning involves a combination of both innatebias and experience:

◮ Infants can distinguish some contrasts that adults of speakerslacking them cannot: alveolar [d] versus retroflex [ã] forEnglish speakers, [r] versus [l] for Japanese speakers; Werkerand Tees, 1984; Kuhl et al., 2006, inter alia)

Page 19: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

Mixture of Gaussians

◮ Motivating example: how are phonological categories learned

◮ Evidence that learning involves a combination of both innatebias and experience:

◮ Infants can distinguish some contrasts that adults of speakerslacking them cannot: alveolar [d] versus retroflex [ã] forEnglish speakers, [r] versus [l] for Japanese speakers; Werkerand Tees, 1984; Kuhl et al., 2006, inter alia)

◮ Other contrasts are not reliably distinguished until ∼ 1 year ofage by native speakers (e.g., syllable-initial [n] versus [N] inFilipino language environments; Narayan et al., 2010)

Page 20: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

Learning vowel categories

To appreciate the potential difficulties of vowel category learning,consider inter-speaker variation (data courtesy of Vallabha et al.,2007):

Scatter Plot Matrix

F10

1

2 0 1 2

−2

−1

0

−2 −1 0

F20123 0 1 2 3

−3−2−1

0

−3 −2 −1 0

Duration2

4 2 4

−2

0

−2 0

S1

F10

1

2 0 1 2

−2

−1

0

−2 −1 0

F20123 0 1 2 3

−3−2−1

0

−3 −2 −1 0

Duration2

4 2 4

−2

0

−2 0

S2

eEiI

Page 21: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

Framing the category learning problem

Here’s 19 speakers’ data mixed together:

Scatter Plot Matrix

F12

4 2 4

−2

0

−2 0

F20

20 2

−4

−2

−4 −2

Duration

2

4 2 4

−2

0

−2 0

eEiI

Page 22: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

Framing the category learning problem

◮ “Learning” from such data can be thought of in two ways:

Page 23: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

Framing the category learning problem

◮ “Learning” from such data can be thought of in two ways:◮ Grouping the observations into categories

Page 24: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

Framing the category learning problem

◮ “Learning” from such data can be thought of in two ways:◮ Grouping the observations into categories◮ Determining the underlying category representations

(positions, shapes, and sizes)

Page 25: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

Framing the category learning problem

◮ “Learning” from such data can be thought of in two ways:◮ Grouping the observations into categories◮ Determining the underlying category representations

(positions, shapes, and sizes)

◮ Formally: every possible grouping of observations y intocategories represents a partition Π of the observations y.

Page 26: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

Framing the category learning problem

◮ “Learning” from such data can be thought of in two ways:◮ Grouping the observations into categories◮ Determining the underlying category representations

(positions, shapes, and sizes)

◮ Formally: every possible grouping of observations y intocategories represents a partition Π of the observations y.

Page 27: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

Framing the category learning problem

◮ “Learning” from such data can be thought of in two ways:◮ Grouping the observations into categories◮ Determining the underlying category representations

(positions, shapes, and sizes)

◮ Formally: every possible grouping of observations y intocategories represents a partition Π of the observations y. If θare parameters describing category representations, ourproblem is to infer

P(Π,θ|y)

Page 28: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

Framing the category learning problem

◮ “Learning” from such data can be thought of in two ways:◮ Grouping the observations into categories◮ Determining the underlying category representations

(positions, shapes, and sizes)

◮ Formally: every possible grouping of observations y intocategories represents a partition Π of the observations y. If θare parameters describing category representations, ourproblem is to infer

P(Π,θ|y)

from which we could recover the two marginal probabilitydistributions of interest:

P(Π|y) (distr. over partitions given data)

P(θ|y) (distr. over category properties given data)

Page 29: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

The mixture of Gaussians

◮ Simple generative model of the data: we have k multivariateGaussians with frequencies φ = 〈φ1, . . . , φk〉, each with itsown mean µi and covariance matrix Σi (here we punt on howto induce the correct number of categories)

Page 30: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

The mixture of Gaussians

◮ Simple generative model of the data: we have k multivariateGaussians with frequencies φ = 〈φ1, . . . , φk〉, each with itsown mean µi and covariance matrix Σi (here we punt on howto induce the correct number of categories)

◮ N observations are generated i.i.d. by:

i ∼ Multinom(φ)

y ∼ N (µi ,Σi )

Page 31: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

The mixture of Gaussians

◮ Simple generative model of the data: we have k multivariateGaussians with frequencies φ = 〈φ1, . . . , φk〉, each with itsown mean µi and covariance matrix Σi (here we punt on howto induce the correct number of categories)

◮ N observations are generated i.i.d. by:

i ∼ Multinom(φ)

y ∼ N (µi ,Σi )

◮ Here is the corresponding graphical model:

y

µΣ

m

n

Page 32: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

Can we use maximum likelihood?

◮ For observations y all known to come from the samek-dimensional Gaussian, the MLE for the Gaussian’sparameters is

µ = 〈y1, y2, . . . , yk〉

Σ =

Var(y1) Cov(y1, y2) . . . Cov(y1, yk)Cov(y1, y2) Var(y2) . . . Cov(y1, yk)

......

. . ....

Cov(y1, y2) Cov(y1, y2) . . . Var(yk)

where“Var”and“Cov”are the sample variance and covariance

Page 33: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

Can we use maximum likelihood?So you might ask: why not use the method of maximumlikelihood, searching through all the possible partitions of the dataand choosing the partition that gives the highest data likelihood?

2 4 6 8

24

68

x

y

Page 34: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

Can we use maximum likelihood?

The set of all partitions into 〈3, 3〉 observations for our exampledata:

0 2 4 6 8

02

46

810

−12.22

0 2 4 6 8

02

46

810

−14.36

0 2 4 6 80

24

68

10

−16.09

0 2 4 6 8

02

46

810

−12.00

0 2 4 6 8

02

46

810

−10.65

0 2 4 6 8

02

46

810

−15.90

0 2 4 6 8

02

46

810

−17.24

0 2 4 6 8

02

46

810

−12.22

0 2 4 6 8

02

46

810

−14.99

0 2 4 6 8

02

46

810

−14.72

0 2 4 6 8

02

46

810

−11.46

0 2 4 6 8

02

46

810

−10.65

0 2 4 6 8

02

46

810

−16.09

0 2 4 6 8

02

46

810

−14.36

0 2 4 6 8

02

46

810

−12.00

0 2 4 6 8

02

46

810

−17.24

0 2 4 6 8

02

46

810

−15.90

0 2 4 6 8

02

46

810

−14.99

0 2 4 6 8

02

46

810

−11.46

0 2 4 6 8

02

46

810

−14.72

Page 35: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

Can we use maximum likelihood?

This looks like a daunting search task, but there is an even biggerproblem.

Page 36: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

Can we use maximum likelihood?

This looks like a daunting search task, but there is an even biggerproblem.Suppose I try a partition into 〈5, 1〉. . .

Page 37: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

Can we use maximum likelihood?

This looks like a daunting search task, but there is an even biggerproblem.Suppose I try a partition into 〈5, 1〉. . .

0 2 4 6 8 10

02

46

810

Page 38: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

Can we use maximum likelihood?

This looks like a daunting search task, but there is an even biggerproblem.Suppose I try a partition into 〈5, 1〉. . .

0 2 4 6 8 10

02

46

810

ML for this partition: ∞!!!

Page 39: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

Can we use maximum likelihood?

This looks like a daunting search task, but there is an even biggerproblem.Suppose I try a partition into 〈5, 1〉. . .

0 2 4 6 8 10

02

46

810

ML for this partition: ∞!!!

More generally, for a V -dimensional problem you need at leastV + 1 points in each partition

Page 40: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

Can we use maximum likelihood?

This looks like a daunting search task, but there is an even biggerproblem.Suppose I try a partition into 〈5, 1〉. . .

0 2 4 6 8 10

02

46

810

ML for this partition: ∞!!!

More generally, for a V -dimensional problem you need at leastV + 1 points in each partitionBut this constraint would prevent you from finding intuitivesolutions to your problem!

Page 41: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

Bayesian Mixture of Gaussians

y

µΣ

m

n

i ∼ Multinom(φ)

y ∼ N (µi ,Σi )

Page 42: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

Bayesian Mixture of Gaussians

y

µΣ

m

n

i ∼ Multinom(φ)

y ∼ N (µi ,Σi )

◮ The Bayesian framework allows us to build in explicitassumptions about what constitutes a“sensible”category size

Page 43: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

Bayesian Mixture of Gaussians

y

µΣ

m

n

i ∼ Multinom(φ)

y ∼ N (µi ,Σi )

◮ The Bayesian framework allows us to build in explicitassumptions about what constitutes a“sensible”category size

◮ Returning to our graphical model, we put in a prior oncategory size/shape

Page 44: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

Bayesian Mixture of Gaussians

y

µΣ

α

m

n

i ∼ Multinom(φ)

y ∼ N (µi ,Σi )

◮ The Bayesian framework allows us to build in explicitassumptions about what constitutes a“sensible”category size

◮ Returning to our graphical model, we put in a prior oncategory size/shape

Page 45: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

Bayesian Mixture of Gaussiansy

µΣ

α

m

n

i ∼ Multinom(φ)

y ∼ N (µi ,Σi )

◮ For now we will just leave category prior probabilities uniform:

φ1 = φ2 = φ3 = φ4 =1

4

Page 46: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

Bayesian Mixture of Gaussiansy

µΣ

α

m

n

i ∼ Multinom(φ)

y ∼ N (µi ,Σi )

◮ For now we will just leave category prior probabilities uniform:

φ1 = φ2 = φ3 = φ4 =1

4◮ Here is a conjugate prior distribution for multivariate

Gaussians:

Σi ∼ IW(Σ0, ν)

µi |Σ ∼ N (µ0,Σi/A)

Page 47: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

The Inverse Wishart distribution

◮ Perhaps the best way to understand the Inverse Wishartdistribution is to look at samples from it

Page 48: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

The Inverse Wishart distribution

◮ Perhaps the best way to understand the Inverse Wishartdistribution is to look at samples from it

◮ Below I give samples for Σ = ( 1 00 1 )

Page 49: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

The Inverse Wishart distribution

◮ Perhaps the best way to understand the Inverse Wishartdistribution is to look at samples from it

◮ Below I give samples for Σ = ( 1 00 1 )

Page 50: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

The Inverse Wishart distribution

◮ Perhaps the best way to understand the Inverse Wishartdistribution is to look at samples from it

◮ Below I give samples for Σ = ( 1 00 1 )

◮ Here, k = 2 (top row) or k = 5 (bottom row)

Page 51: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

Inference for Mixture of Gaussians using Gibbs Sampling

◮ We still have not given a solution to the search problem

Page 52: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

Inference for Mixture of Gaussians using Gibbs Sampling

◮ We still have not given a solution to the search problem

◮ One broadly applicable solution is Gibbs sampling

Page 53: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

Inference for Mixture of Gaussians using Gibbs Sampling

◮ We still have not given a solution to the search problem

◮ One broadly applicable solution is Gibbs sampling

◮ Simply put:

Page 54: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

Inference for Mixture of Gaussians using Gibbs Sampling

◮ We still have not given a solution to the search problem

◮ One broadly applicable solution is Gibbs sampling

◮ Simply put:

1. Randomly initialize cluster assignments

Page 55: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

Inference for Mixture of Gaussians using Gibbs Sampling

◮ We still have not given a solution to the search problem

◮ One broadly applicable solution is Gibbs sampling

◮ Simply put:

1. Randomly initialize cluster assignments2. On each iteration through the data, for each point:

Page 56: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

Inference for Mixture of Gaussians using Gibbs Sampling

◮ We still have not given a solution to the search problem

◮ One broadly applicable solution is Gibbs sampling

◮ Simply put:

1. Randomly initialize cluster assignments2. On each iteration through the data, for each point:

2.1 “Forget” the cluster assignment of the current point xi

Page 57: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

Inference for Mixture of Gaussians using Gibbs Sampling

◮ We still have not given a solution to the search problem

◮ One broadly applicable solution is Gibbs sampling

◮ Simply put:

1. Randomly initialize cluster assignments2. On each iteration through the data, for each point:

2.1 “Forget” the cluster assignment of the current point xi2.2 Compute the probability distribution over xi ’s cluster

assignment conditional on the rest of the partition:

P(Ci |xi ,Π−i ) =

∫θP(xi |Ci ,θ)P(Ci |θ)P(θ) dθ

∑j

∫θP(xj |Cj ,θ)P(Cj |θ)P(θ) dθ

Page 58: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

Inference for Mixture of Gaussians using Gibbs Sampling

◮ We still have not given a solution to the search problem

◮ One broadly applicable solution is Gibbs sampling

◮ Simply put:

1. Randomly initialize cluster assignments2. On each iteration through the data, for each point:

2.1 “Forget” the cluster assignment of the current point xi2.2 Compute the probability distribution over xi ’s cluster

assignment conditional on the rest of the partition:

P(Ci |xi ,Π−i ) =

∫θP(xi |Ci ,θ)P(Ci |θ)P(θ) dθ

∑j

∫θP(xj |Cj ,θ)P(Cj |θ)P(θ) dθ

2.3 Randomly sample a cluster assignment for xi fromP(Ci |xi ,Π−i ) and continue

Page 59: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

Inference for Mixture of Gaussians using Gibbs Sampling

◮ We still have not given a solution to the search problem

◮ One broadly applicable solution is Gibbs sampling

◮ Simply put:

1. Randomly initialize cluster assignments2. On each iteration through the data, for each point:

2.1 “Forget” the cluster assignment of the current point xi2.2 Compute the probability distribution over xi ’s cluster

assignment conditional on the rest of the partition:

P(Ci |xi ,Π−i ) =

∫θP(xi |Ci ,θ)P(Ci |θ)P(θ) dθ

∑j

∫θP(xj |Cj ,θ)P(Cj |θ)P(θ) dθ

2.3 Randomly sample a cluster assignment for xi fromP(Ci |xi ,Π−i ) and continue

3. Do this for “many” iterations (e.g., until the unnormalizedmarginal data likelihood is high)

Page 60: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

Inference for Mixture of Gaussians using Gibbs SamplingStarting point for our problem:

Page 61: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

One pass of Gibbs sampling through the data

Page 62: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

Results of Gibbs sampling with known category

probabilities

Posterior modes of category structures:F1 versus F2 F1 versus Duration F2 versus Duration

F1

F2

−4

−2

0

2

−2 0 2 4

F1

Dur

atio

n

−2

0

2

4

−2 0 2 4

F2

Dur

atio

n

−2

0

2

4

−4 −2 0 2

Page 63: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

Results of Gibbs sampling with known category

probabilities

Confusion table of assignments of observations to categories:Unsupervised Supervised

Cluster

True

vow

el

I

i

E

e

1 2 3 4

Cluster

True

vow

el

I

i

E

e

1 2 3 4

Page 64: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

Extending the model to learning category probabilities

◮ The multinomial extension of the beta distribution is theDirichlet distribution, characterized by parameters α1, . . . , αk ,and D(π1, . . . , πk):

D(π1, . . . , πk)def=

1

Zπα1−11 πα2−1

2 . . . παk−1k

where the normalizing constant Z is

Z =Γ(α1)Γ(α2) . . . Γ(αk)

Γ(α1 + α2 + · · ·+ αk)

Page 65: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

Extending the model to learning category probabilities

◮ So we set:

φ ∼ D(Σφ)

Page 66: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

Extending the model to learning category probabilities

◮ So we set:

φ ∼ D(Σφ)

◮ Combine this with the rest of the model:

Σi ∼ IW(Σ0, ν)

µi |Σ ∼ N (µ0,Σi/A)

i ∼ Multinom(φ)

y ∼ N (µi ,Σi )

Page 67: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

Extending the model to learning category probabilities

◮ So we set:

φ ∼ D(Σφ)

◮ Combine this with the rest of the model:

Σi ∼ IW(Σ0, ν)

µi |Σ ∼ N (µ0,Σi/A)

i ∼ Multinom(φ)

y ∼ N (µi ,Σi )

Page 68: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

Extending the model to learning category probabilities

◮ So we set:

φ ∼ D(Σφ)

◮ Combine this with the rest of the model:

Σi ∼ IW(Σ0, ν)

µi |Σ ∼ N (µ0,Σi/A)

i ∼ Multinom(φ)

y ∼ N (µi ,Σi )

θΣθ

y

b

ΣbΣΣb

Σφ

m

n

Page 69: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

Having to learn category probabilities too makes the

problem harder

F1 and F2 F1 and Duration F2 and Duration

F1

F2

−2

0

2

4

−2 −1 0 1 2

F1

Dur

atio

n

−2

−1

0

1

2

3

−2 −1 0 1 2

F2

Dur

atio

n

−2

−1

0

1

2

3

−2 0 2 4

Page 70: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

Having to learn category probabilities too makes the

problem harder

We can make the problem even more challenging by skewing thecategory probabilities:Category Probabilitye 0.04E 0.05i 0.29I 0.62

Page 71: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

Having to learn category probabilities too makes the

problem harder

F1 and F2 F1 and Duration F2 and Duration

F1

F2

−2

0

2

4

−2 −1 0 1 2

F1

Dur

atio

n

−2

−1

0

1

2

3

−2 −1 0 1 2

F2

Dur

atio

n

−2

−1

0

1

2

3

−2 0 2 4

Page 72: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

Having to learn category probabilities too makes the

problem harder

Confusion tables for these cases:

With learning of category fre-quencies

Without learning of categoryfrequencies

Cluster

True

vow

el

I

i

E

e

1 2 3 4

Cluster

True

vow

el

I

i

E

e

1 2 3 4

Page 73: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

Summary

◮ We can use the exact same models for unsupervised(latent-variable) learning as for hierarchical/mixed-effectsregression!

Page 74: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

Summary

◮ We can use the exact same models for unsupervised(latent-variable) learning as for hierarchical/mixed-effectsregression!

◮ However, category induction presents additional difficultiescategory learning

Page 75: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

Summary

◮ We can use the exact same models for unsupervised(latent-variable) learning as for hierarchical/mixed-effectsregression!

◮ However, category induction presents additional difficultiescategory learning

◮ Non-convexity of the objective function→difficulty of search

Page 76: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

Summary

◮ We can use the exact same models for unsupervised(latent-variable) learning as for hierarchical/mixed-effectsregression!

◮ However, category induction presents additional difficultiescategory learning

◮ Non-convexity of the objective function→difficulty of search◮ Degeneracy of maximum likelihood

Page 77: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

Summary

◮ We can use the exact same models for unsupervised(latent-variable) learning as for hierarchical/mixed-effectsregression!

◮ However, category induction presents additional difficultiescategory learning

◮ Non-convexity of the objective function→difficulty of search◮ Degeneracy of maximum likelihood

◮ In general you need far more data, and/or additionalinformation sources, to converge on good solutions

Page 78: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

Summary

◮ We can use the exact same models for unsupervised(latent-variable) learning as for hierarchical/mixed-effectsregression!

◮ However, category induction presents additional difficultiescategory learning

◮ Non-convexity of the objective function→difficulty of search◮ Degeneracy of maximum likelihood

◮ In general you need far more data, and/or additionalinformation sources, to converge on good solutions

◮ Relevant references: tons! Read about MOGs for automatedspeech recognition in Jurafsky and Martin (2008, Chapter 9).See Vallabha et al. (2007) and Feldman et al. (2009) forearlier application of MOGs to phonetic category learning.

Page 79: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

References I

Feldman, N. H., Griffiths, T. L., and Morgan, J. L. (2009). Learningphonetic categories by learning a lexicon. In Proceedings of the 31st

Annual Conference of the Cognitive Science Society, pages 2208–2213.Cognitive Science Society, Austin, TX.

Jurafsky, D. and Martin, J. H. (2008). Speech and Language Processing:

An Introduction to Natural Language Processing, Computational

Linguistics, and Speech Recognition. Prentice-Hall, second edition.

Kuhl, P. K., Stevens, E., Hayashi, A., Deguchi, T., Kiritani, S., andIverson, P. (2006). Infants show a facilitation effect for nativelanguage phonetic perception between 6 and 12 months.Developmental Science, 9(2):F13–F21.

Narayan, C. R., Werker, J. F., and Beddor, P. S. (2010). The interactionbetween acoustic salience and language experience in developmentalspeech perception: evidence from nasal place discrimination.Developmental Science, 13(3):407–420.

Page 80: Latent Variable Models Probabilistic Models in the Study of …idiom.ucsd.edu/.../latent_variable_models_presentation.pdf · 2012. 8. 9. · We will cover two types of simple latent-variable

References II

Vallabha, G. K., McClelland, J. L., Pons, F., Werker, J. F., and Amano,S. (2007). Unsupervised learning of vowel categories frominfant-directed speech. Proceedings of the National Academy of

Sciences, 104(33):13273–13278.

Werker, J. F. and Tees, R. C. (1984). Cross-language speech perception:Evidence for perceptual reorganization during the first year of life.Infant Behavior and Development, 7:49–63.