Latent Variable Models Probabilistic Models in the Study of...

Latent Variable ModelsProbabilistic Models in the Study of Language

Roger Levy

UC San Diego

Department of Linguistics

Preamble: plate notation for graphical models

Here is the kind of hierarchical model we’ve seen so far:

b1 b2 · · · bm

y11 · · · y1n1 y21 · · · y2n2 · · · ym1 · · · ymnm

Plate notation for graphical models

Here is a more succinct representation of the same model:

◮ The rectangles with N and m

are plates; semantics of a platewith n is “replicate this node n

times”

◮ N =∑m

i=1 ni (see previous slide)

times”

◮ N =∑m

◮ The i node is a cluster identitynode

times”

◮ N =∑m

◮ In our previous application ofhierarchical models toregression, cluster identities areknown

times”

◮ N =∑m

◮ In our previous application ofhierarchical models toregression, cluster identities areknown

The plan for today’s lecture

◮ We are going to study thesimplest type of latent-variablemodels

◮ Technically speaking, “latentvariable”means any variablewhose value is unknown

◮ But it’s conventionally used torefer to hidden structural relations

among observations

◮ In today’s clustering applications,simply treat i as unknown

among observations

◮ In today’s clustering applications,simply treat i as unknown

◮ Inferring values of i induces aclustering among observations; todo so we need to put a probabilitydistribution over i

We will cover two types of simple latent-variable models:

◮ The mixture of Gaussians for continuous multivariate data;

◮ Latent Dirichlet Allocation (LDA; also called Topic models)for categorical data (words) in collections of documents

Mixture of Gaussians

◮ Motivating example: how are phonological categories learned

◮ Evidence that learning involves a combination of both innatebias and experience:

◮ Infants can distinguish some contrasts that adults of speakerslacking them cannot: alveolar [d] versus retroflex [ã] forEnglish speakers, [r] versus [l] for Japanese speakers; Werkerand Tees, 1984; Kuhl et al., 2006, inter alia)

◮ Other contrasts are not reliably distinguished until ∼ 1 year ofage by native speakers (e.g., syllable-initial [n] versus [N] inFilipino language environments; Narayan et al., 2010)

Learning vowel categories

To appreciate the potential difficulties of vowel category learning,consider inter-speaker variation (data courtesy of Vallabha et al.,2007):

Scatter Plot Matrix

2 0 1 2

−2 −1 0

F20123 0 1 2 3

−3−2−1

−3 −2 −1 0

Duration2

−2 0

2 0 1 2

−2 −1 0

F20123 0 1 2 3

−3−2−1

−3 −2 −1 0

Duration2

−2 0

Framing the category learning problem

Here’s 19 speakers’ data mixed together:

Scatter Plot Matrix

−2 0

−4 −2

Duration

−2 0

◮ “Learning” from such data can be thought of in two ways:

◮ “Learning” from such data can be thought of in two ways:◮ Grouping the observations into categories

◮ “Learning” from such data can be thought of in two ways:◮ Grouping the observations into categories◮ Determining the underlying category representations

(positions, shapes, and sizes)

◮ Formally: every possible grouping of observations y intocategories represents a partition Π of the observations y.

◮ Formally: every possible grouping of observations y intocategories represents a partition Π of the observations y. If θare parameters describing category representations, ourproblem is to infer

P(Π,θ|y)

◮ Formally: every possible grouping of observations y intocategories represents a partition Π of the observations y. If θare parameters describing category representations, ourproblem is to infer

P(Π,θ|y)

from which we could recover the two marginal probabilitydistributions of interest:

P(Π|y) (distr. over partitions given data)

P(θ|y) (distr. over category properties given data)

The mixture of Gaussians

◮ Simple generative model of the data: we have k multivariateGaussians with frequencies φ = 〈φ1, . . . , φk〉, each with itsown mean µi and covariance matrix Σi (here we punt on howto induce the correct number of categories)

◮ N observations are generated i.i.d. by:

i ∼ Multinom(φ)

y ∼ N (µi ,Σi )

◮ N observations are generated i.i.d. by:

i ∼ Multinom(φ)

y ∼ N (µi ,Σi )

◮ Here is the corresponding graphical model:

Can we use maximum likelihood?

◮ For observations y all known to come from the samek-dimensional Gaussian, the MLE for the Gaussian’sparameters is

µ = 〈y1, y2, . . . , yk〉

Var(y1) Cov(y1, y2) . . . Cov(y1, yk)Cov(y1, y2) Var(y2) . . . Cov(y1, yk)

......

. . ....

Cov(y1, y2) Cov(y1, y2) . . . Var(yk)

where“Var”and“Cov”are the sample variance and covariance

Can we use maximum likelihood?So you might ask: why not use the method of maximumlikelihood, searching through all the possible partitions of the dataand choosing the partition that gives the highest data likelihood?

2 4 6 8

The set of all partitions into 〈3, 3〉 observations for our exampledata:

0 2 4 6 8

−12.22

0 2 4 6 8

−14.36

0 2 4 6 80

−16.09

0 2 4 6 8

−12.00

0 2 4 6 8

−10.65

0 2 4 6 8

−15.90

0 2 4 6 8

−17.24

0 2 4 6 8

−12.22

0 2 4 6 8

−14.99

0 2 4 6 8

−14.72

0 2 4 6 8

−11.46

0 2 4 6 8

−10.65

0 2 4 6 8

−16.09

0 2 4 6 8

−14.36

0 2 4 6 8

−12.00

0 2 4 6 8

−17.24

0 2 4 6 8

−15.90

0 2 4 6 8

−14.99

0 2 4 6 8

−11.46

0 2 4 6 8

−14.72

This looks like a daunting search task, but there is an even biggerproblem.

This looks like a daunting search task, but there is an even biggerproblem.Suppose I try a partition into 〈5, 1〉. . .

0 2 4 6 8 10

ML for this partition: ∞!!!

0 2 4 6 8 10

More generally, for a V -dimensional problem you need at leastV + 1 points in each partition

0 2 4 6 8 10

More generally, for a V -dimensional problem you need at leastV + 1 points in each partitionBut this constraint would prevent you from finding intuitivesolutions to your problem!

Bayesian Mixture of Gaussians

i ∼ Multinom(φ)

y ∼ N (µi ,Σi )

i ∼ Multinom(φ)

y ∼ N (µi ,Σi )

◮ The Bayesian framework allows us to build in explicitassumptions about what constitutes a“sensible”category size

i ∼ Multinom(φ)

y ∼ N (µi ,Σi )

◮ Returning to our graphical model, we put in a prior oncategory size/shape

i ∼ Multinom(φ)

y ∼ N (µi ,Σi )

◮ Returning to our graphical model, we put in a prior oncategory size/shape

Bayesian Mixture of Gaussiansy

i ∼ Multinom(φ)

y ∼ N (µi ,Σi )

◮ For now we will just leave category prior probabilities uniform:

φ1 = φ2 = φ3 = φ4 =1

Bayesian Mixture of Gaussiansy

i ∼ Multinom(φ)

y ∼ N (µi ,Σi )

◮ For now we will just leave category prior probabilities uniform:

φ1 = φ2 = φ3 = φ4 =1

4◮ Here is a conjugate prior distribution for multivariate

Gaussians:

Σi ∼ IW(Σ0, ν)

µi |Σ ∼ N (µ0,Σi/A)

The Inverse Wishart distribution

◮ Perhaps the best way to understand the Inverse Wishartdistribution is to look at samples from it

◮ Below I give samples for Σ = ( 1 00 1 )

◮ Here, k = 2 (top row) or k = 5 (bottom row)

Inference for Mixture of Gaussians using Gibbs Sampling

◮ We still have not given a solution to the search problem

◮ One broadly applicable solution is Gibbs sampling

◮ Simply put:

1. Randomly initialize cluster assignments

◮ Simply put:

1. Randomly initialize cluster assignments2. On each iteration through the data, for each point:

◮ Simply put:

2.1 “Forget” the cluster assignment of the current point xi

◮ Simply put:

2.1 “Forget” the cluster assignment of the current point xi2.2 Compute the probability distribution over xi ’s cluster

assignment conditional on the rest of the partition:

P(Ci |xi ,Π−i ) =

∫θP(xi |Ci ,θ)P(Ci |θ)P(θ) dθ

∫θP(xj |Cj ,θ)P(Cj |θ)P(θ) dθ

◮ Simply put:

2.3 Randomly sample a cluster assignment for xi fromP(Ci |xi ,Π−i ) and continue

◮ Simply put:

2.3 Randomly sample a cluster assignment for xi fromP(Ci |xi ,Π−i ) and continue

3. Do this for “many” iterations (e.g., until the unnormalizedmarginal data likelihood is high)

Inference for Mixture of Gaussians using Gibbs SamplingStarting point for our problem:

One pass of Gibbs sampling through the data

Results of Gibbs sampling with known category

probabilities

Posterior modes of category structures:F1 versus F2 F1 versus Duration F2 versus Duration

−2 0 2 4

−4 −2 0 2

Results of Gibbs sampling with known category

probabilities

Confusion table of assignments of observations to categories:Unsupervised Supervised

Cluster

1 2 3 4

Cluster

1 2 3 4

Extending the model to learning category probabilities

◮ The multinomial extension of the beta distribution is theDirichlet distribution, characterized by parameters α1, . . . , αk ,and D(π1, . . . , πk):

D(π1, . . . , πk)def=

Zπα1−11 πα2−1

2 . . . παk−1k

where the normalizing constant Z is

Z =Γ(α1)Γ(α2) . . . Γ(αk)

Γ(α1 + α2 + · · ·+ αk)

◮ So we set:

φ ∼ D(Σφ)

◮ So we set:

φ ∼ D(Σφ)

◮ Combine this with the rest of the model:

Σi ∼ IW(Σ0, ν)

i ∼ Multinom(φ)

y ∼ N (µi ,Σi )

◮ So we set:

φ ∼ D(Σφ)

Σi ∼ IW(Σ0, ν)

i ∼ Multinom(φ)

y ∼ N (µi ,Σi )

◮ So we set:

φ ∼ D(Σφ)

Σi ∼ IW(Σ0, ν)

i ∼ Multinom(φ)

y ∼ N (µi ,Σi )

θΣθ

ΣbΣΣb

Having to learn category probabilities too makes the

problem harder

F1 and F2 F1 and Duration F2 and Duration

−2 −1 0 1 2

−2 0 2 4

problem harder

We can make the problem even more challenging by skewing thecategory probabilities:Category Probabilitye 0.04E 0.05i 0.29I 0.62

problem harder

F1 and F2 F1 and Duration F2 and Duration

−2 −1 0 1 2

−2 0 2 4

problem harder

Confusion tables for these cases:

With learning of category fre-quencies

Without learning of categoryfrequencies

Cluster

1 2 3 4

Cluster

1 2 3 4

Summary

◮ We can use the exact same models for unsupervised(latent-variable) learning as for hierarchical/mixed-effectsregression!

Summary

◮ However, category induction presents additional difficultiescategory learning

Summary

◮ Non-convexity of the objective function→difficulty of search

Summary

◮ Non-convexity of the objective function→difficulty of search◮ Degeneracy of maximum likelihood

Summary

◮ In general you need far more data, and/or additionalinformation sources, to converge on good solutions

Summary

◮ In general you need far more data, and/or additionalinformation sources, to converge on good solutions

◮ Relevant references: tons! Read about MOGs for automatedspeech recognition in Jurafsky and Martin (2008, Chapter 9).See Vallabha et al. (2007) and Feldman et al. (2009) forearlier application of MOGs to phonetic category learning.

References I

Feldman, N. H., Griffiths, T. L., and Morgan, J. L. (2009). Learningphonetic categories by learning a lexicon. In Proceedings of the 31st

Annual Conference of the Cognitive Science Society, pages 2208–2213.Cognitive Science Society, Austin, TX.

Jurafsky, D. and Martin, J. H. (2008). Speech and Language Processing:

An Introduction to Natural Language Processing, Computational

Linguistics, and Speech Recognition. Prentice-Hall, second edition.

Kuhl, P. K., Stevens, E., Hayashi, A., Deguchi, T., Kiritani, S., andIverson, P. (2006). Infants show a facilitation effect for nativelanguage phonetic perception between 6 and 12 months.Developmental Science, 9(2):F13–F21.

Narayan, C. R., Werker, J. F., and Beddor, P. S. (2010). The interactionbetween acoustic salience and language experience in developmentalspeech perception: evidence from nasal place discrimination.Developmental Science, 13(3):407–420.

References II

Vallabha, G. K., McClelland, J. L., Pons, F., Werker, J. F., and Amano,S. (2007). Unsupervised learning of vowel categories frominfant-directed speech. Proceedings of the National Academy of

Sciences, 104(33):13273–13278.

Werker, J. F. and Tees, R. C. (1984). Cross-language speech perception:Evidence for perceptual reorganization during the first year of life.Infant Behavior and Development, 7:49–63.

Latent Variable Models Probabilistic Models in the Study of...

Documents

Transcript of Latent Variable Models Probabilistic Models in the Study of...

Selectivity models

Spectral and scattering theory for P arXiv:0806.4377v1 [math ...Spectral and scattering theory for space-cutoﬀ P(ϕ) 2 models with variable metric C. G´erard, A. Panati, Laboratoire

Probabilistic Models

Macroparticle Models

Análisis estadístico de una variable

Woollam – Variable Angle Spectroscopic Ellipsometer...Woollam – Variable Angle Spectroscopic Ellipsometer Rob 2004 Cornell Nanoscale Facility rob@cnf.cornell.edu OVERVIEW Ellipsometry

ECE 302: Lecture 4.7 Gaussian Random Variable

Title Latent TGF-β binding protein 2 and 4 have essential ...€¦ · SCTC REPORT 743714 DOI 1.138srep43714 1 Latent TGF-β binding protein 2 and 4 have essential overlapping functions

Regression Models

Variable Change Technique Applied in Constrained Inverse ...

Latent TGF-beta Binding Proteins : Adhesive functions and ...

7. Structural Equation Models Structural equation models ...lipas.uwasa.fi/~sjp/Teaching/sem/lectures/semc7.pdfRemark 7.1. Testing of the latent structural equation, i.e., the initial

Variable Fractional Delay FIR Filters with Sparse Coefficients

9. Heterogeneity: Latent Class Modelspeople.stern.nyu.edu › wgreene › DiscreteChoice › 2014 › DC2014-9-LCModels.pdf[Topic 9-Latent Class Models] 3/66 Latent Classes • A population

Modelling of W UMa -type Variable Stars

Variable Selection and Model Choice in Survival Models with

Classical Planning in Deep Latent Space: From Unlabeled ...ceur-ws.org/Vol-2003/NeSy17_paper3.pdfClassical Planning in Deep Latent Space: From Unlabeled Images to PDDL (and back) Masataro

Function spaces with variable exponents - TU Chemnitzpotts/cms/cms14/Summerschool… · Extrapolation in variable Lebesgue spaces ... Lebesgue and Sobolev spaces with variable exponents,

The Cleveland Street Variable Star Observatory presents

6563.nuclear models