Latent Variable Models Probabilistic Models in the Study of...
Transcript of Latent Variable Models Probabilistic Models in the Study of...
Latent Variable ModelsProbabilistic Models in the Study of Language
Day 4
Roger Levy
UC San Diego
Department of Linguistics
Preamble: plate notation for graphical models
Here is the kind of hierarchical model we’ve seen so far:
θ
Σb
b1 b2 · · · bm
y11 · · · y1n1 y21 · · · y2n2 · · · ym1 · · · ymnm
Plate notation for graphical models
Here is a more succinct representation of the same model:
θ
y
b
Σb
i
m
N
◮ The rectangles with N and m
are plates; semantics of a platewith n is “replicate this node n
times”
Plate notation for graphical models
Here is a more succinct representation of the same model:
θ
y
b
Σb
i
m
N
◮ The rectangles with N and m
are plates; semantics of a platewith n is “replicate this node n
times”
◮ N =∑m
i=1 ni (see previous slide)
Plate notation for graphical models
Here is a more succinct representation of the same model:
θ
y
b
Σb
i
m
N
◮ The rectangles with N and m
are plates; semantics of a platewith n is “replicate this node n
times”
◮ N =∑m
i=1 ni (see previous slide)
◮ The i node is a cluster identitynode
Plate notation for graphical models
Here is a more succinct representation of the same model:
θ
y
b
Σb
i
m
N
◮ The rectangles with N and m
are plates; semantics of a platewith n is “replicate this node n
times”
◮ N =∑m
i=1 ni (see previous slide)
◮ The i node is a cluster identitynode
◮ In our previous application ofhierarchical models toregression, cluster identities areknown
Plate notation for graphical models
Here is a more succinct representation of the same model:
θ
y
b
Σb
i
m
N
◮ The rectangles with N and m
are plates; semantics of a platewith n is “replicate this node n
times”
◮ N =∑m
i=1 ni (see previous slide)
◮ The i node is a cluster identitynode
◮ In our previous application ofhierarchical models toregression, cluster identities areknown
The plan for today’s lecture
θ
y
b
Σb
i
m
N
◮ We are going to study thesimplest type of latent-variablemodels
The plan for today’s lecture
θ
y
b
Σb
i
m
N
◮ We are going to study thesimplest type of latent-variablemodels
◮ Technically speaking, “latentvariable”means any variablewhose value is unknown
The plan for today’s lecture
θ
y
b
Σb
i
m
N
◮ We are going to study thesimplest type of latent-variablemodels
◮ Technically speaking, “latentvariable”means any variablewhose value is unknown
◮ But it’s conventionally used torefer to hidden structural relations
among observations
The plan for today’s lecture
θ
y
b
Σb
i
m
N
◮ We are going to study thesimplest type of latent-variablemodels
◮ Technically speaking, “latentvariable”means any variablewhose value is unknown
◮ But it’s conventionally used torefer to hidden structural relations
among observations
◮ In today’s clustering applications,simply treat i as unknown
The plan for today’s lecture
θ
y
b
Σb
iφ
m
N
◮ We are going to study thesimplest type of latent-variablemodels
◮ Technically speaking, “latentvariable”means any variablewhose value is unknown
◮ But it’s conventionally used torefer to hidden structural relations
among observations
◮ In today’s clustering applications,simply treat i as unknown
◮ Inferring values of i induces aclustering among observations; todo so we need to put a probabilitydistribution over i
The plan for today’s lecture
We will cover two types of simple latent-variable models:
The plan for today’s lecture
We will cover two types of simple latent-variable models:
◮ The mixture of Gaussians for continuous multivariate data;
The plan for today’s lecture
We will cover two types of simple latent-variable models:
◮ The mixture of Gaussians for continuous multivariate data;
◮ Latent Dirichlet Allocation (LDA; also called Topic models)for categorical data (words) in collections of documents
Mixture of Gaussians
◮ Motivating example: how are phonological categories learned
Mixture of Gaussians
◮ Motivating example: how are phonological categories learned
◮ Evidence that learning involves a combination of both innatebias and experience:
Mixture of Gaussians
◮ Motivating example: how are phonological categories learned
◮ Evidence that learning involves a combination of both innatebias and experience:
◮ Infants can distinguish some contrasts that adults of speakerslacking them cannot: alveolar [d] versus retroflex [ã] forEnglish speakers, [r] versus [l] for Japanese speakers; Werkerand Tees, 1984; Kuhl et al., 2006, inter alia)
Mixture of Gaussians
◮ Motivating example: how are phonological categories learned
◮ Evidence that learning involves a combination of both innatebias and experience:
◮ Infants can distinguish some contrasts that adults of speakerslacking them cannot: alveolar [d] versus retroflex [ã] forEnglish speakers, [r] versus [l] for Japanese speakers; Werkerand Tees, 1984; Kuhl et al., 2006, inter alia)
◮ Other contrasts are not reliably distinguished until ∼ 1 year ofage by native speakers (e.g., syllable-initial [n] versus [N] inFilipino language environments; Narayan et al., 2010)
Learning vowel categories
To appreciate the potential difficulties of vowel category learning,consider inter-speaker variation (data courtesy of Vallabha et al.,2007):
Scatter Plot Matrix
F10
1
2 0 1 2
−2
−1
0
−2 −1 0
F20123 0 1 2 3
−3−2−1
0
−3 −2 −1 0
Duration2
4 2 4
−2
0
−2 0
S1
F10
1
2 0 1 2
−2
−1
0
−2 −1 0
F20123 0 1 2 3
−3−2−1
0
−3 −2 −1 0
Duration2
4 2 4
−2
0
−2 0
S2
eEiI
Framing the category learning problem
Here’s 19 speakers’ data mixed together:
Scatter Plot Matrix
F12
4 2 4
−2
0
−2 0
F20
20 2
−4
−2
−4 −2
Duration
2
4 2 4
−2
0
−2 0
eEiI
Framing the category learning problem
◮ “Learning” from such data can be thought of in two ways:
Framing the category learning problem
◮ “Learning” from such data can be thought of in two ways:◮ Grouping the observations into categories
Framing the category learning problem
◮ “Learning” from such data can be thought of in two ways:◮ Grouping the observations into categories◮ Determining the underlying category representations
(positions, shapes, and sizes)
Framing the category learning problem
◮ “Learning” from such data can be thought of in two ways:◮ Grouping the observations into categories◮ Determining the underlying category representations
(positions, shapes, and sizes)
◮ Formally: every possible grouping of observations y intocategories represents a partition Π of the observations y.
Framing the category learning problem
◮ “Learning” from such data can be thought of in two ways:◮ Grouping the observations into categories◮ Determining the underlying category representations
(positions, shapes, and sizes)
◮ Formally: every possible grouping of observations y intocategories represents a partition Π of the observations y.
Framing the category learning problem
◮ “Learning” from such data can be thought of in two ways:◮ Grouping the observations into categories◮ Determining the underlying category representations
(positions, shapes, and sizes)
◮ Formally: every possible grouping of observations y intocategories represents a partition Π of the observations y. If θare parameters describing category representations, ourproblem is to infer
P(Π,θ|y)
Framing the category learning problem
◮ “Learning” from such data can be thought of in two ways:◮ Grouping the observations into categories◮ Determining the underlying category representations
(positions, shapes, and sizes)
◮ Formally: every possible grouping of observations y intocategories represents a partition Π of the observations y. If θare parameters describing category representations, ourproblem is to infer
P(Π,θ|y)
from which we could recover the two marginal probabilitydistributions of interest:
P(Π|y) (distr. over partitions given data)
P(θ|y) (distr. over category properties given data)
The mixture of Gaussians
◮ Simple generative model of the data: we have k multivariateGaussians with frequencies φ = 〈φ1, . . . , φk〉, each with itsown mean µi and covariance matrix Σi (here we punt on howto induce the correct number of categories)
The mixture of Gaussians
◮ Simple generative model of the data: we have k multivariateGaussians with frequencies φ = 〈φ1, . . . , φk〉, each with itsown mean µi and covariance matrix Σi (here we punt on howto induce the correct number of categories)
◮ N observations are generated i.i.d. by:
i ∼ Multinom(φ)
y ∼ N (µi ,Σi )
The mixture of Gaussians
◮ Simple generative model of the data: we have k multivariateGaussians with frequencies φ = 〈φ1, . . . , φk〉, each with itsown mean µi and covariance matrix Σi (here we punt on howto induce the correct number of categories)
◮ N observations are generated i.i.d. by:
i ∼ Multinom(φ)
y ∼ N (µi ,Σi )
◮ Here is the corresponding graphical model:
y
µΣ
iφ
m
n
Can we use maximum likelihood?
◮ For observations y all known to come from the samek-dimensional Gaussian, the MLE for the Gaussian’sparameters is
µ = 〈y1, y2, . . . , yk〉
Σ =
Var(y1) Cov(y1, y2) . . . Cov(y1, yk)Cov(y1, y2) Var(y2) . . . Cov(y1, yk)
......
. . ....
Cov(y1, y2) Cov(y1, y2) . . . Var(yk)
where“Var”and“Cov”are the sample variance and covariance
Can we use maximum likelihood?So you might ask: why not use the method of maximumlikelihood, searching through all the possible partitions of the dataand choosing the partition that gives the highest data likelihood?
2 4 6 8
24
68
x
y
Can we use maximum likelihood?
The set of all partitions into 〈3, 3〉 observations for our exampledata:
0 2 4 6 8
02
46
810
−12.22
0 2 4 6 8
02
46
810
−14.36
0 2 4 6 80
24
68
10
−16.09
0 2 4 6 8
02
46
810
−12.00
0 2 4 6 8
02
46
810
−10.65
0 2 4 6 8
02
46
810
−15.90
0 2 4 6 8
02
46
810
−17.24
0 2 4 6 8
02
46
810
−12.22
0 2 4 6 8
02
46
810
−14.99
0 2 4 6 8
02
46
810
−14.72
0 2 4 6 8
02
46
810
−11.46
0 2 4 6 8
02
46
810
−10.65
0 2 4 6 8
02
46
810
−16.09
0 2 4 6 8
02
46
810
−14.36
0 2 4 6 8
02
46
810
−12.00
0 2 4 6 8
02
46
810
−17.24
0 2 4 6 8
02
46
810
−15.90
0 2 4 6 8
02
46
810
−14.99
0 2 4 6 8
02
46
810
−11.46
0 2 4 6 8
02
46
810
−14.72
Can we use maximum likelihood?
This looks like a daunting search task, but there is an even biggerproblem.
Can we use maximum likelihood?
This looks like a daunting search task, but there is an even biggerproblem.Suppose I try a partition into 〈5, 1〉. . .
Can we use maximum likelihood?
This looks like a daunting search task, but there is an even biggerproblem.Suppose I try a partition into 〈5, 1〉. . .
0 2 4 6 8 10
02
46
810
Can we use maximum likelihood?
This looks like a daunting search task, but there is an even biggerproblem.Suppose I try a partition into 〈5, 1〉. . .
0 2 4 6 8 10
02
46
810
ML for this partition: ∞!!!
Can we use maximum likelihood?
This looks like a daunting search task, but there is an even biggerproblem.Suppose I try a partition into 〈5, 1〉. . .
0 2 4 6 8 10
02
46
810
ML for this partition: ∞!!!
More generally, for a V -dimensional problem you need at leastV + 1 points in each partition
Can we use maximum likelihood?
This looks like a daunting search task, but there is an even biggerproblem.Suppose I try a partition into 〈5, 1〉. . .
0 2 4 6 8 10
02
46
810
ML for this partition: ∞!!!
More generally, for a V -dimensional problem you need at leastV + 1 points in each partitionBut this constraint would prevent you from finding intuitivesolutions to your problem!
Bayesian Mixture of Gaussians
y
µΣ
iφ
m
n
i ∼ Multinom(φ)
y ∼ N (µi ,Σi )
Bayesian Mixture of Gaussians
y
µΣ
iφ
m
n
i ∼ Multinom(φ)
y ∼ N (µi ,Σi )
◮ The Bayesian framework allows us to build in explicitassumptions about what constitutes a“sensible”category size
Bayesian Mixture of Gaussians
y
µΣ
iφ
m
n
i ∼ Multinom(φ)
y ∼ N (µi ,Σi )
◮ The Bayesian framework allows us to build in explicitassumptions about what constitutes a“sensible”category size
◮ Returning to our graphical model, we put in a prior oncategory size/shape
Bayesian Mixture of Gaussians
y
µΣ
iφ
α
m
n
i ∼ Multinom(φ)
y ∼ N (µi ,Σi )
◮ The Bayesian framework allows us to build in explicitassumptions about what constitutes a“sensible”category size
◮ Returning to our graphical model, we put in a prior oncategory size/shape
Bayesian Mixture of Gaussiansy
µΣ
iφ
α
m
n
i ∼ Multinom(φ)
y ∼ N (µi ,Σi )
◮ For now we will just leave category prior probabilities uniform:
φ1 = φ2 = φ3 = φ4 =1
4
Bayesian Mixture of Gaussiansy
µΣ
iφ
α
m
n
i ∼ Multinom(φ)
y ∼ N (µi ,Σi )
◮ For now we will just leave category prior probabilities uniform:
φ1 = φ2 = φ3 = φ4 =1
4◮ Here is a conjugate prior distribution for multivariate
Gaussians:
Σi ∼ IW(Σ0, ν)
µi |Σ ∼ N (µ0,Σi/A)
The Inverse Wishart distribution
◮ Perhaps the best way to understand the Inverse Wishartdistribution is to look at samples from it
The Inverse Wishart distribution
◮ Perhaps the best way to understand the Inverse Wishartdistribution is to look at samples from it
◮ Below I give samples for Σ = ( 1 00 1 )
The Inverse Wishart distribution
◮ Perhaps the best way to understand the Inverse Wishartdistribution is to look at samples from it
◮ Below I give samples for Σ = ( 1 00 1 )
The Inverse Wishart distribution
◮ Perhaps the best way to understand the Inverse Wishartdistribution is to look at samples from it
◮ Below I give samples for Σ = ( 1 00 1 )
◮ Here, k = 2 (top row) or k = 5 (bottom row)
Inference for Mixture of Gaussians using Gibbs Sampling
◮ We still have not given a solution to the search problem
Inference for Mixture of Gaussians using Gibbs Sampling
◮ We still have not given a solution to the search problem
◮ One broadly applicable solution is Gibbs sampling
Inference for Mixture of Gaussians using Gibbs Sampling
◮ We still have not given a solution to the search problem
◮ One broadly applicable solution is Gibbs sampling
◮ Simply put:
Inference for Mixture of Gaussians using Gibbs Sampling
◮ We still have not given a solution to the search problem
◮ One broadly applicable solution is Gibbs sampling
◮ Simply put:
1. Randomly initialize cluster assignments
Inference for Mixture of Gaussians using Gibbs Sampling
◮ We still have not given a solution to the search problem
◮ One broadly applicable solution is Gibbs sampling
◮ Simply put:
1. Randomly initialize cluster assignments2. On each iteration through the data, for each point:
Inference for Mixture of Gaussians using Gibbs Sampling
◮ We still have not given a solution to the search problem
◮ One broadly applicable solution is Gibbs sampling
◮ Simply put:
1. Randomly initialize cluster assignments2. On each iteration through the data, for each point:
2.1 “Forget” the cluster assignment of the current point xi
Inference for Mixture of Gaussians using Gibbs Sampling
◮ We still have not given a solution to the search problem
◮ One broadly applicable solution is Gibbs sampling
◮ Simply put:
1. Randomly initialize cluster assignments2. On each iteration through the data, for each point:
2.1 “Forget” the cluster assignment of the current point xi2.2 Compute the probability distribution over xi ’s cluster
assignment conditional on the rest of the partition:
P(Ci |xi ,Π−i ) =
∫θP(xi |Ci ,θ)P(Ci |θ)P(θ) dθ
∑j
∫θP(xj |Cj ,θ)P(Cj |θ)P(θ) dθ
Inference for Mixture of Gaussians using Gibbs Sampling
◮ We still have not given a solution to the search problem
◮ One broadly applicable solution is Gibbs sampling
◮ Simply put:
1. Randomly initialize cluster assignments2. On each iteration through the data, for each point:
2.1 “Forget” the cluster assignment of the current point xi2.2 Compute the probability distribution over xi ’s cluster
assignment conditional on the rest of the partition:
P(Ci |xi ,Π−i ) =
∫θP(xi |Ci ,θ)P(Ci |θ)P(θ) dθ
∑j
∫θP(xj |Cj ,θ)P(Cj |θ)P(θ) dθ
2.3 Randomly sample a cluster assignment for xi fromP(Ci |xi ,Π−i ) and continue
Inference for Mixture of Gaussians using Gibbs Sampling
◮ We still have not given a solution to the search problem
◮ One broadly applicable solution is Gibbs sampling
◮ Simply put:
1. Randomly initialize cluster assignments2. On each iteration through the data, for each point:
2.1 “Forget” the cluster assignment of the current point xi2.2 Compute the probability distribution over xi ’s cluster
assignment conditional on the rest of the partition:
P(Ci |xi ,Π−i ) =
∫θP(xi |Ci ,θ)P(Ci |θ)P(θ) dθ
∑j
∫θP(xj |Cj ,θ)P(Cj |θ)P(θ) dθ
2.3 Randomly sample a cluster assignment for xi fromP(Ci |xi ,Π−i ) and continue
3. Do this for “many” iterations (e.g., until the unnormalizedmarginal data likelihood is high)
Inference for Mixture of Gaussians using Gibbs SamplingStarting point for our problem:
One pass of Gibbs sampling through the data
Results of Gibbs sampling with known category
probabilities
Posterior modes of category structures:F1 versus F2 F1 versus Duration F2 versus Duration
F1
F2
−4
−2
0
2
−2 0 2 4
F1
Dur
atio
n
−2
0
2
4
−2 0 2 4
F2
Dur
atio
n
−2
0
2
4
−4 −2 0 2
Results of Gibbs sampling with known category
probabilities
Confusion table of assignments of observations to categories:Unsupervised Supervised
Cluster
True
vow
el
I
i
E
e
1 2 3 4
Cluster
True
vow
el
I
i
E
e
1 2 3 4
Extending the model to learning category probabilities
◮ The multinomial extension of the beta distribution is theDirichlet distribution, characterized by parameters α1, . . . , αk ,and D(π1, . . . , πk):
D(π1, . . . , πk)def=
1
Zπα1−11 πα2−1
2 . . . παk−1k
where the normalizing constant Z is
Z =Γ(α1)Γ(α2) . . . Γ(αk)
Γ(α1 + α2 + · · ·+ αk)
Extending the model to learning category probabilities
◮ So we set:
φ ∼ D(Σφ)
Extending the model to learning category probabilities
◮ So we set:
φ ∼ D(Σφ)
◮ Combine this with the rest of the model:
Σi ∼ IW(Σ0, ν)
µi |Σ ∼ N (µ0,Σi/A)
i ∼ Multinom(φ)
y ∼ N (µi ,Σi )
Extending the model to learning category probabilities
◮ So we set:
φ ∼ D(Σφ)
◮ Combine this with the rest of the model:
Σi ∼ IW(Σ0, ν)
µi |Σ ∼ N (µ0,Σi/A)
i ∼ Multinom(φ)
y ∼ N (µi ,Σi )
Extending the model to learning category probabilities
◮ So we set:
φ ∼ D(Σφ)
◮ Combine this with the rest of the model:
Σi ∼ IW(Σ0, ν)
µi |Σ ∼ N (µ0,Σi/A)
i ∼ Multinom(φ)
y ∼ N (µi ,Σi )
θΣθ
y
b
ΣbΣΣb
iφ
Σφ
m
n
Having to learn category probabilities too makes the
problem harder
F1 and F2 F1 and Duration F2 and Duration
F1
F2
−2
0
2
4
−2 −1 0 1 2
F1
Dur
atio
n
−2
−1
0
1
2
3
−2 −1 0 1 2
F2
Dur
atio
n
−2
−1
0
1
2
3
−2 0 2 4
Having to learn category probabilities too makes the
problem harder
We can make the problem even more challenging by skewing thecategory probabilities:Category Probabilitye 0.04E 0.05i 0.29I 0.62
Having to learn category probabilities too makes the
problem harder
F1 and F2 F1 and Duration F2 and Duration
F1
F2
−2
0
2
4
−2 −1 0 1 2
F1
Dur
atio
n
−2
−1
0
1
2
3
−2 −1 0 1 2
F2
Dur
atio
n
−2
−1
0
1
2
3
−2 0 2 4
Having to learn category probabilities too makes the
problem harder
Confusion tables for these cases:
With learning of category fre-quencies
Without learning of categoryfrequencies
Cluster
True
vow
el
I
i
E
e
1 2 3 4
Cluster
True
vow
el
I
i
E
e
1 2 3 4
Summary
◮ We can use the exact same models for unsupervised(latent-variable) learning as for hierarchical/mixed-effectsregression!
Summary
◮ We can use the exact same models for unsupervised(latent-variable) learning as for hierarchical/mixed-effectsregression!
◮ However, category induction presents additional difficultiescategory learning
Summary
◮ We can use the exact same models for unsupervised(latent-variable) learning as for hierarchical/mixed-effectsregression!
◮ However, category induction presents additional difficultiescategory learning
◮ Non-convexity of the objective function→difficulty of search
Summary
◮ We can use the exact same models for unsupervised(latent-variable) learning as for hierarchical/mixed-effectsregression!
◮ However, category induction presents additional difficultiescategory learning
◮ Non-convexity of the objective function→difficulty of search◮ Degeneracy of maximum likelihood
Summary
◮ We can use the exact same models for unsupervised(latent-variable) learning as for hierarchical/mixed-effectsregression!
◮ However, category induction presents additional difficultiescategory learning
◮ Non-convexity of the objective function→difficulty of search◮ Degeneracy of maximum likelihood
◮ In general you need far more data, and/or additionalinformation sources, to converge on good solutions
Summary
◮ We can use the exact same models for unsupervised(latent-variable) learning as for hierarchical/mixed-effectsregression!
◮ However, category induction presents additional difficultiescategory learning
◮ Non-convexity of the objective function→difficulty of search◮ Degeneracy of maximum likelihood
◮ In general you need far more data, and/or additionalinformation sources, to converge on good solutions
◮ Relevant references: tons! Read about MOGs for automatedspeech recognition in Jurafsky and Martin (2008, Chapter 9).See Vallabha et al. (2007) and Feldman et al. (2009) forearlier application of MOGs to phonetic category learning.
References I
Feldman, N. H., Griffiths, T. L., and Morgan, J. L. (2009). Learningphonetic categories by learning a lexicon. In Proceedings of the 31st
Annual Conference of the Cognitive Science Society, pages 2208–2213.Cognitive Science Society, Austin, TX.
Jurafsky, D. and Martin, J. H. (2008). Speech and Language Processing:
An Introduction to Natural Language Processing, Computational
Linguistics, and Speech Recognition. Prentice-Hall, second edition.
Kuhl, P. K., Stevens, E., Hayashi, A., Deguchi, T., Kiritani, S., andIverson, P. (2006). Infants show a facilitation effect for nativelanguage phonetic perception between 6 and 12 months.Developmental Science, 9(2):F13–F21.
Narayan, C. R., Werker, J. F., and Beddor, P. S. (2010). The interactionbetween acoustic salience and language experience in developmentalspeech perception: evidence from nasal place discrimination.Developmental Science, 13(3):407–420.
References II
Vallabha, G. K., McClelland, J. L., Pons, F., Werker, J. F., and Amano,S. (2007). Unsupervised learning of vowel categories frominfant-directed speech. Proceedings of the National Academy of
Sciences, 104(33):13273–13278.
Werker, J. F. and Tees, R. C. (1984). Cross-language speech perception:Evidence for perceptual reorganization during the first year of life.Infant Behavior and Development, 7:49–63.