Expectation Maximization - Bishop PRML Ch....

91
K-Means Gaussian Mixture Models Expectation-Maximization Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane c Ghane/Mori 1

Transcript of Expectation Maximization - Bishop PRML Ch....

Page 1: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

Expectation MaximizationBishop PRML Ch. 9

Alireza Ghane

c©Ghane/Mori 1

Page 2: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

Learning Parameters to Probability Distributions• We discussed probabilistic models at length• In assignment 3 and 4 you see that given fully observed training

data, setting parameters θi to probability distributions isstraight-forward

• However, in many settings not all variables are observed(labelled) in the training data: xi = (xi,hi)• e.g. Speech recognition: have speech signals, but not phoneme

labels• e.g. Object recognition: have object labels (car, bicycle), but not

part labels (wheel, door, seat)• Unobserved variables are called latent variables

Shape. The shape is represented by a joint Gaussian den-

sity of the locations of features within a hypothesis, once

they have been transformed into a scale-invariant space.

This is done using the scale information from the features

in the hypothesis, so avoiding an exhaustive search over

scale that other methods use. The density has parameters

!shape = {µ,!}. Note that, unlike appearance whose co-variance matrices Vp, Vbg are diagonal, ! is a full matrix.

All features not included in the hypothesis are considered

as arising from the background. The model for the back-

ground assumes features to be spread uniformly over the

image (which has area "), with locations independent of theforeground locations. If a part is occluded it is integrated

out of the joint foreground density.

p(X|S,h, !)p(X|S,h, !bg)

= G(X(h)|µ,!)"f

Relative scale. The scale of each part p relative to a ref-erence frame is modeled by a Gaussian density which has

parameters !scale = {tp, Up}. The parts are assumed tobe independent to one another. The background model as-

sumes a uniform distribution over scale (within a range r).

p(S|h, !)p(S|h, !bg)

=P!

p=1

G(S(hp)|tp, Up)dp rf

Occlusion and Statistics of the feature finder.

p(h|!)p(h|!bg)

=pPoiss(n|M)pPoiss(N |M)

1nCr(N, f)

p(d|!)

The first term models the number of features detected using

a Poisson distribution, which has a meanM . The second is

a book-keeping term for the hypothesis variable and the last

is a probability table (of size 2P ) for all possible occlusion

patterns and is a parameter of the model.

The model of Weber et al. contains the shape and oc-

clusion terms to which we have added the appearance and

relative scale terms. Since the model encompasses many of

the properties of an object, all in a probabilistic way, this

model can represent both geometrically constrained objects

(where the shape density would have a small covariance)

and objects with distinctive appearance but lacking geomet-

ric form (the appearance densities would be tight, but the

shape density would now be looser). From the equations

above we can now calculate the overall likelihood ratio from

a given set of X,S,A. The intuition is that the majority ofthe hypotheses will be low scoring as they will be picking

up features from background junk on the image but hope-

fully a few features will genuinely be part of the object and

hypotheses using these will score highly. However, we must

be able to locate features over many different instances of

the object and over a range of scales in order for this ap-

proach to work.

2.2. Feature detection

Features are found using the detector of Kadir and

Brady [7]. This method finds regions that are salient over

both location and scale. For each point on the image a his-

togram P (I) is made of the intensities in a circular regionof radius (scale) s. The entropy H(s) of this histogram is

then calculated and the local maxima ofH(s) are candidatescales for the region. The saliency of each of these candi-

dates is measured by H dPds (with appropriate normalization

for scale [7, 8]). The N regions with highest saliency over

the image provide the features for learning and recognition.

Each feature is defined by its centre and radius (the scale).

A good example illustrating the saliency principle is that

of a bright circle on a dark background. If the scale is too

small then only the white circle is seen, and there is no ex-

trema in entropy. There is an entropy extrema when the

scale is slightly larger than the radius of the bright circle,

and thereafter the entropy decreases as the scale increases.

In practice this method gives stable identification of fea-

tures over a variety of sizes and copes well with intra-class

variability. The saliency measure is designed to be invari-

ant to scaling, although experimental tests show that this is

not entirely the case due to aliasing and other effects. Note,

only monochrome information is used to detect and repre-

sent features.

2.3. Feature representation

The feature detector identifies regions of interest on each

image. The coordinates of the centre give usX and the size

of the region gives S. Figure 2 illustrates this on two typicalimages from the motorbike dataset.

50 100 150 200 250

20

40

60

80

100

120

140

160

180

50 100 150 200 250

20

40

60

80

100

120

140

Figure 2: Output of the feature detector

Once the regions are identified, they are cropped from

the image and rescaled to the size of a small (typically

11!11) pixel patch. Thus, each patch exists in a 121 dimen-sional space. Since the appearance densities of the model

must also exist in this space, we must somehow reduce the

dimensionality of each patch whilst retaining its distinctive-

ness, since a 121-dimensional Gaussian is unmanageable

from a numerical point of view and also the number of pa-

rameters involved (242 per model part) are too many to be

estimated.

This is done by using principal component analysis

(PCA). In the learning stage, we collect the patches from

Shape. The shape is represented by a joint Gaussian den-

sity of the locations of features within a hypothesis, once

they have been transformed into a scale-invariant space.

This is done using the scale information from the features

in the hypothesis, so avoiding an exhaustive search over

scale that other methods use. The density has parameters

!shape = {µ,!}. Note that, unlike appearance whose co-variance matrices Vp, Vbg are diagonal, ! is a full matrix.

All features not included in the hypothesis are considered

as arising from the background. The model for the back-

ground assumes features to be spread uniformly over the

image (which has area "), with locations independent of theforeground locations. If a part is occluded it is integrated

out of the joint foreground density.

p(X|S,h, !)p(X|S,h, !bg)

= G(X(h)|µ,!)"f

Relative scale. The scale of each part p relative to a ref-erence frame is modeled by a Gaussian density which has

parameters !scale = {tp, Up}. The parts are assumed tobe independent to one another. The background model as-

sumes a uniform distribution over scale (within a range r).

p(S|h, !)p(S|h, !bg)

=P!

p=1

G(S(hp)|tp, Up)dp rf

Occlusion and Statistics of the feature finder.

p(h|!)p(h|!bg)

=pPoiss(n|M)pPoiss(N |M)

1nCr(N, f)

p(d|!)

The first term models the number of features detected using

a Poisson distribution, which has a meanM . The second is

a book-keeping term for the hypothesis variable and the last

is a probability table (of size 2P ) for all possible occlusion

patterns and is a parameter of the model.

The model of Weber et al. contains the shape and oc-

clusion terms to which we have added the appearance and

relative scale terms. Since the model encompasses many of

the properties of an object, all in a probabilistic way, this

model can represent both geometrically constrained objects

(where the shape density would have a small covariance)

and objects with distinctive appearance but lacking geomet-

ric form (the appearance densities would be tight, but the

shape density would now be looser). From the equations

above we can now calculate the overall likelihood ratio from

a given set of X,S,A. The intuition is that the majority ofthe hypotheses will be low scoring as they will be picking

up features from background junk on the image but hope-

fully a few features will genuinely be part of the object and

hypotheses using these will score highly. However, we must

be able to locate features over many different instances of

the object and over a range of scales in order for this ap-

proach to work.

2.2. Feature detection

Features are found using the detector of Kadir and

Brady [7]. This method finds regions that are salient over

both location and scale. For each point on the image a his-

togram P (I) is made of the intensities in a circular regionof radius (scale) s. The entropy H(s) of this histogram is

then calculated and the local maxima ofH(s) are candidatescales for the region. The saliency of each of these candi-

dates is measured by H dPds (with appropriate normalization

for scale [7, 8]). The N regions with highest saliency over

the image provide the features for learning and recognition.

Each feature is defined by its centre and radius (the scale).

A good example illustrating the saliency principle is that

of a bright circle on a dark background. If the scale is too

small then only the white circle is seen, and there is no ex-

trema in entropy. There is an entropy extrema when the

scale is slightly larger than the radius of the bright circle,

and thereafter the entropy decreases as the scale increases.

In practice this method gives stable identification of fea-

tures over a variety of sizes and copes well with intra-class

variability. The saliency measure is designed to be invari-

ant to scaling, although experimental tests show that this is

not entirely the case due to aliasing and other effects. Note,

only monochrome information is used to detect and repre-

sent features.

2.3. Feature representation

The feature detector identifies regions of interest on each

image. The coordinates of the centre give usX and the size

of the region gives S. Figure 2 illustrates this on two typicalimages from the motorbike dataset.

50 100 150 200 250

20

40

60

80

100

120

140

160

180

50 100 150 200 250

20

40

60

80

100

120

140

Figure 2: Output of the feature detector

Once the regions are identified, they are cropped from

the image and rescaled to the size of a small (typically

11!11) pixel patch. Thus, each patch exists in a 121 dimen-sional space. Since the appearance densities of the model

must also exist in this space, we must somehow reduce the

dimensionality of each patch whilst retaining its distinctive-

ness, since a 121-dimensional Gaussian is unmanageable

from a numerical point of view and also the number of pa-

rameters involved (242 per model part) are too many to be

estimated.

This is done by using principal component analysis

(PCA). In the learning stage, we collect the patches from

figs from Fergus et al. c©Ghane/Mori 2

Page 3: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

Outline

K-Means

Gaussian Mixture Models

Expectation-Maximization

c©Ghane/Mori 3

Page 4: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

Outline

K-Means

Gaussian Mixture Models

Expectation-Maximization

c©Ghane/Mori 4

Page 5: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

Unsupervised Learning

(a)

−2 0 2

−2

0

2

• We will start with an unsupervised learning(clustering) problem:

• Given a dataset {x1, . . . ,xN}, eachxi ∈ RD, partition the dataset into Kclusters• Intuitively, a cluster is a group of points,

which are close together and far fromothers

c©Ghane/Mori 5

Page 6: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

Distortion Measure

(a)

−2 0 2

−2

0

2

(i)

−2 0 2

−2

0

2

• Formally, introduce prototypes (or clustercenters) µk ∈ RD

• Use binary rnk, 1 if point n is in cluster k, 0otherwise (1-of-K coding scheme again)

• Find {µk}, {rnk} to minimize distortionmeasure:

J =

N∑n=1

K∑k=1

rnk||xn − µk||2

• e.g. two clusters k = 1, 2:

J =∑

xn∈C1

||xn−µ1||2+∑

xn∈C2

||xn−µ2||2

c©Ghane/Mori 6

Page 7: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

Minimizing Distortion Measure

• Minimizing J directly is hard

J =

N∑n=1

K∑k=1

rnk||xn − µk||2

• However, two things are easy• If we know µk, minimizing J wrt rnk• If we know rnk, minimizing J wrt µk

• This suggests an iterative procedure• Start with initial guess for µk• Iteration of two steps:

• Minimize J wrt rnk

• Minimize J wrt µk

• Rinse and repeat until convergence

c©Ghane/Mori 7

Page 8: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

Minimizing Distortion Measure

• Minimizing J directly is hard

J =

N∑n=1

K∑k=1

rnk||xn − µk||2

• However, two things are easy• If we know µk, minimizing J wrt rnk• If we know rnk, minimizing J wrt µk

• This suggests an iterative procedure• Start with initial guess for µk• Iteration of two steps:

• Minimize J wrt rnk

• Minimize J wrt µk

• Rinse and repeat until convergence

c©Ghane/Mori 8

Page 9: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

Minimizing Distortion Measure

• Minimizing J directly is hard

J =

N∑n=1

K∑k=1

rnk||xn − µk||2

• However, two things are easy• If we know µk, minimizing J wrt rnk• If we know rnk, minimizing J wrt µk

• This suggests an iterative procedure• Start with initial guess for µk• Iteration of two steps:

• Minimize J wrt rnk

• Minimize J wrt µk

• Rinse and repeat until convergence

c©Ghane/Mori 9

Page 10: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

Determining Membership Variables

(a)

−2 0 2

−2

0

2

(b)

−2 0 2

−2

0

2

• Step 1 in an iteration of K-means is tominimize distortion measure J wrt clustermembership variables rnk

J =

N∑n=1

K∑k=1

rnk||xn − µk||2

• Terms for different data points xn areindependent, for each data point set rnk tominimize

K∑k=1

rnk||xn − µk||2

• Simply set rnk = 1 for the cluster center µk

with smallest distancec©Ghane/Mori 10

Page 11: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

Determining Membership Variables

(a)

−2 0 2

−2

0

2

(b)

−2 0 2

−2

0

2

• Step 1 in an iteration of K-means is tominimize distortion measure J wrt clustermembership variables rnk

J =

N∑n=1

K∑k=1

rnk||xn − µk||2

• Terms for different data points xn areindependent, for each data point set rnk tominimize

K∑k=1

rnk||xn − µk||2

• Simply set rnk = 1 for the cluster center µk

with smallest distancec©Ghane/Mori 11

Page 12: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

Determining Membership Variables

(a)

−2 0 2

−2

0

2

(b)

−2 0 2

−2

0

2

• Step 1 in an iteration of K-means is tominimize distortion measure J wrt clustermembership variables rnk

J =

N∑n=1

K∑k=1

rnk||xn − µk||2

• Terms for different data points xn areindependent, for each data point set rnk tominimize

K∑k=1

rnk||xn − µk||2

• Simply set rnk = 1 for the cluster center µk

with smallest distancec©Ghane/Mori 12

Page 13: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

Determining Cluster Centers

(b)

−2 0 2

−2

0

2

(c)

−2 0 2

−2

0

2

• Step 2: fix rnk, minimize J wrt the cluster centersµk

J =K∑k=1

N∑n=1

rnk||xn − µk||2 switch order of sums

• So we can minimize wrt each µk separately• Take derivative, set to zero:

2

N∑n=1

rnk(xn − µk) = 0

⇔ µk =

∑n rnkxn∑n rnk

i.e. mean of datapoints xn assigned to cluster kc©Ghane/Mori 13

Page 14: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

Determining Cluster Centers

(b)

−2 0 2

−2

0

2

(c)

−2 0 2

−2

0

2

• Step 2: fix rnk, minimize J wrt the cluster centersµk

J =K∑k=1

N∑n=1

rnk||xn − µk||2 switch order of sums

• So we can minimize wrt each µk separately• Take derivative, set to zero:

2

N∑n=1

rnk(xn − µk) = 0

⇔ µk =

∑n rnkxn∑n rnk

i.e. mean of datapoints xn assigned to cluster kc©Ghane/Mori 14

Page 15: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

K-means Algorithm

• Start with initial guess for µk

• Iteration of two steps:• Minimize J wrt rnk

• Assign points to nearest cluster center• Minimize J wrt µk

• Set cluster center as average of points in cluster

• Rinse and repeat until convergence

c©Ghane/Mori 15

Page 16: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

K-means example

(a)

−2 0 2

−2

0

2

c©Ghane/Mori 16

Page 17: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

K-means example

(b)

−2 0 2

−2

0

2

c©Ghane/Mori 17

Page 18: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

K-means example

(c)

−2 0 2

−2

0

2

c©Ghane/Mori 18

Page 19: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

K-means example

(d)

−2 0 2

−2

0

2

c©Ghane/Mori 19

Page 20: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

K-means example

(e)

−2 0 2

−2

0

2

c©Ghane/Mori 20

Page 21: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

K-means example

(f)

−2 0 2

−2

0

2

c©Ghane/Mori 21

Page 22: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

K-means example

(g)

−2 0 2

−2

0

2

c©Ghane/Mori 22

Page 23: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

K-means example

(h)

−2 0 2

−2

0

2

c©Ghane/Mori 23

Page 24: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

K-means example

(i)

−2 0 2

−2

0

2

Next step doesn’t change membership – stopc©Ghane/Mori 24

Page 25: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

K-means Convergence

• Repeat steps until no change in cluster assignments• For each step, value of J either goes down, or we stop• Finite number of possible assignments of data points to clusters,

so we are guarranteed to converge eventually• Note it may be a local maximum rather than a global maximum

to which we converge

c©Ghane/Mori 25

Page 26: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

K-means Example - Image Segmentation����� ����� ������� Original image

• K-means clustering on pixel colour values• Pixels in a cluster are coloured by cluster mean• Represent each pixel (e.g. 24-bit colour value) by a cluster

number (e.g. 4 bits for K = 10), compressed version• This technique known as vector quantization

• Represent vector (in this case from RGB, R3) as a single discretevalue

c©Ghane/Mori 26

Page 27: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

Outline

K-Means

Gaussian Mixture Models

Expectation-Maximization

c©Ghane/Mori 27

Page 28: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

Hard Assignment vs. Soft Assignment

(i)

−2 0 2

−2

0

2 • In the K-means algorithm, a hardassignment of points to clusters is made

• However, for points near the decisionboundary, this may not be such a good idea

• Instead, we could think about making a softassignment of points to clusters

c©Ghane/Mori 28

Page 29: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

Gaussian Mixture Model

(b)

0 0.5 1

0

0.5

1 (a)

0 0.5 1

0

0.5

1

• The Gaussian mixture model (or mixture of Gaussians MoG)models the data as a combination of Gaussians

• Above shows a dataset generated by drawing samples from threedifferent Gaussians

c©Ghane/Mori 29

Page 30: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

Generative Model

x

z (a)

0 0.5 1

0

0.5

1

• The mixture of Gaussians is a generative model• To generate a datapoint xn, we first generate a value for a

discrete variable zn ∈ {1, . . . ,K}• We then generate a value xn ∼ N (x|µk,Σk) for the

corresponding Gaussian

c©Ghane/Mori 30

Page 31: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

Graphical Model

xn

zn

N

µ Σ

π

(a)

0 0.5 1

0

0.5

1

• Full graphical model using plate notation• Note zn is a latent variable, unobserved

• Need to give conditional distributions p(zn) and p(xn|zn)• The one-of-K representation is helpful here: znk ∈ {0, 1},zn = (zn1, . . . , znK)

c©Ghane/Mori 31

Page 32: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

Graphical Model - Latent Component Variable

xn

zn

N

µ Σ

π

(a)

0 0.5 1

0

0.5

1

• Use a Bernoulli distribution for p(zn)• i.e. p(znk = 1) = πk• Parameters to this distribution {πk}• Must have 0 ≤ πk ≤ 1 and

∑Kk=1 πk = 1

• p(zn) =∏K

k=1 πznkk

c©Ghane/Mori 32

Page 33: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

Graphical Model - Observed Variable

xn

zn

N

µ Σ

π

(a)

0 0.5 1

0

0.5

1

• Use a Gaussian distribution for p(xn|zn)• Parameters to this distribution {µk,Σk}

p(xn|znk = 1) = N (xn|µk,Σk)

p(xn|zn) =

K∏k=1

N (xn|µk,Σk)znk

c©Ghane/Mori 33

Page 34: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

Graphical Model - Joint distribution

xn

zn

N

µ Σ

π

(a)

0 0.5 1

0

0.5

1

• The full joint distribution is given by:

p(x, z) =

N∏n=1

p(zn)p(xn|zn)

=N∏

n=1

K∏k=1

πznkk N (xn|µk,Σk)

znk

c©Ghane/Mori 34

Page 35: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

MoG Marginal over Observed Variables

• The marginal distribution p(xn) for this model is:

p(xn) =∑zn

p(xn, zn) =∑zn

p(zn)p(xn|zn)

=

K∑k=1

πkN (xn|µk,Σk)

• A mixture of Gaussians

c©Ghane/Mori 35

Page 36: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

MoG Conditional over Latent Variable(b)

0 0.5 1

0

0.5

1 (c)

0 0.5 1

0

0.5

1

• The conditional p(znk = 1|xn) will play an important role forlearning

• It is denoted by γ(znk) can be computed as:

γ(znk) ≡ p(znk = 1|xn) =p(znk = 1)p(xn|znk = 1)∑Kj=1 p(znj = 1)p(xn|znj = 1)

=πkN (xn|µk,Σk)∑Kj=1 πjN (xn|µj ,Σj)

• γ(znk) is the responsibility of component k for datapoint nc©Ghane/Mori 36

Page 37: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

MoG Conditional over Latent Variable(b)

0 0.5 1

0

0.5

1 (c)

0 0.5 1

0

0.5

1

• The conditional p(znk = 1|xn) will play an important role forlearning

• It is denoted by γ(znk) can be computed as:

γ(znk) ≡ p(znk = 1|xn) =p(znk = 1)p(xn|znk = 1)∑Kj=1 p(znj = 1)p(xn|znj = 1)

=πkN (xn|µk,Σk)∑Kj=1 πjN (xn|µj ,Σj)

• γ(znk) is the responsibility of component k for datapoint nc©Ghane/Mori 37

Page 38: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

MoG Conditional over Latent Variable(b)

0 0.5 1

0

0.5

1 (c)

0 0.5 1

0

0.5

1

• The conditional p(znk = 1|xn) will play an important role forlearning

• It is denoted by γ(znk) can be computed as:

γ(znk) ≡ p(znk = 1|xn) =p(znk = 1)p(xn|znk = 1)∑Kj=1 p(znj = 1)p(xn|znj = 1)

=πkN (xn|µk,Σk)∑Kj=1 πjN (xn|µj ,Σj)

• γ(znk) is the responsibility of component k for datapoint nc©Ghane/Mori 38

Page 39: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

MoG Learning

• Given a set of observations {x1, . . . ,xN}, without the latentvariables zn, how can we learn the parameters?• Model parameters are θ = {πk,µk,Σk}

• Answer will be similar to k-means:• If we know the latent variables zn, fitting the Gaussians is easy• If we know the Gaussians µk,Σk, finding the latent variables is

easy

• Rather than latent variables, we will use responsibilities γ(znk)

c©Ghane/Mori 39

Page 40: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

MoG Learning

• Given a set of observations {x1, . . . ,xN}, without the latentvariables zn, how can we learn the parameters?• Model parameters are θ = {πk,µk,Σk}

• Answer will be similar to k-means:• If we know the latent variables zn, fitting the Gaussians is easy• If we know the Gaussians µk,Σk, finding the latent variables is

easy

• Rather than latent variables, we will use responsibilities γ(znk)

c©Ghane/Mori 40

Page 41: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

MoG Learning

• Given a set of observations {x1, . . . ,xN}, without the latentvariables zn, how can we learn the parameters?• Model parameters are θ = {πk,µk,Σk}

• Answer will be similar to k-means:• If we know the latent variables zn, fitting the Gaussians is easy• If we know the Gaussians µk,Σk, finding the latent variables is

easy

• Rather than latent variables, we will use responsibilities γ(znk)

c©Ghane/Mori 41

Page 42: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

MoG Maximum Likelihood Learning

• Given a set of observations {x1, . . . ,xN}, without the latentvariables zn, how can we learn the parameters?• Model parameters are θ = {πk,µk,Σk}

• We can use the maximum likelihood criterion:

θML = argmaxθ

N∏n=1

K∑k=1

πkN (xn|µk,Σk)

= argmaxθ

N∑n=1

log

{K∑k=1

πkN (xn|µk,Σk)

}

• Unfortunately, closed-form solution not possible this time – logof sum rather than log of product

c©Ghane/Mori 42

Page 43: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

MoG Maximum Likelihood Learning

• Given a set of observations {x1, . . . ,xN}, without the latentvariables zn, how can we learn the parameters?• Model parameters are θ = {πk,µk,Σk}

• We can use the maximum likelihood criterion:

θML = argmaxθ

N∏n=1

K∑k=1

πkN (xn|µk,Σk)

= argmaxθ

N∑n=1

log

{K∑k=1

πkN (xn|µk,Σk)

}

• Unfortunately, closed-form solution not possible this time – logof sum rather than log of product

c©Ghane/Mori 43

Page 44: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

MoG Maximum Likelihood Learning - Problem

• Maximum likelihood criterion, 1-D:

θML = argmaxθ

N∑n=1

log

{K∑k=1

πk1√2πσ

exp{−(xn − µk)2/(2σ2)

}}

• Suppose we set µk = xn for some k and n, then we have oneterm in the sum:

πk1√2πσk

exp{−(xn − µk)2/(2σ2)

}= πk

1√2πσk

exp{−(0)2/(2σ2)

}• In the limit as σk → 0, this goes to∞

• So ML solution is to set some µk = xn, and σk = 0!

c©Ghane/Mori 44

Page 45: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

MoG Maximum Likelihood Learning - Problem

• Maximum likelihood criterion, 1-D:

θML = argmaxθ

N∑n=1

log

{K∑k=1

πk1√2πσ

exp{−(xn − µk)2/(2σ2)

}}

• Suppose we set µk = xn for some k and n, then we have oneterm in the sum:

πk1√2πσk

exp{−(xn − µk)2/(2σ2)

}= πk

1√2πσk

exp{−(0)2/(2σ2)

}

• In the limit as σk → 0, this goes to∞• So ML solution is to set some µk = xn, and σk = 0!

c©Ghane/Mori 45

Page 46: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

MoG Maximum Likelihood Learning - Problem

• Maximum likelihood criterion, 1-D:

θML = argmaxθ

N∑n=1

log

{K∑k=1

πk1√2πσ

exp{−(xn − µk)2/(2σ2)

}}

• Suppose we set µk = xn for some k and n, then we have oneterm in the sum:

πk1√2πσk

exp{−(xn − µk)2/(2σ2)

}= πk

1√2πσk

exp{−(0)2/(2σ2)

}• In the limit as σk → 0, this goes to∞

• So ML solution is to set some µk = xn, and σk = 0!c©Ghane/Mori 46

Page 47: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

ML for Gaussian Mixtures

• Keeping this problem in mind, we will develop an algorithm forML estimation of the parameters for a MoG model• Search for a local optimum

• Consider the log-likelihood function

`(θ) =

N∑n=1

log

{K∑k=1

πkN (xn|µk,Σk)

}

• We can try taking derivatives and setting to zero, even though noclosed form solution exists

c©Ghane/Mori 47

Page 48: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

Maximizing Log-Likelihood - Means

`(θ) =

N∑n=1

log

{K∑k=1

πkN (xn|µk,Σk)

}∂

∂µk`(θ) =

N∑n=1

πkN (xn|µk,Σk)∑j πjN (xn|µj ,Σj)

Σ−1k (xn − µk)

=

N∑n=1

γ(znk)Σ−1k (xn − µk)

• Setting derivative to 0, and multiply by Σk

N∑n=1

γ(znk)µk =

N∑n=1

γ(znk)xn

⇔ µk =1

Nk

N∑n=1

γ(znk)xn where Nk =

N∑n=1

γ(znk)

c©Ghane/Mori 48

Page 49: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

Maximizing Log-Likelihood - Means

`(θ) =

N∑n=1

log

{K∑k=1

πkN (xn|µk,Σk)

}∂

∂µk`(θ) =

N∑n=1

πkN (xn|µk,Σk)∑j πjN (xn|µj ,Σj)

Σ−1k (xn − µk)

=

N∑n=1

γ(znk)Σ−1k (xn − µk)

• Setting derivative to 0, and multiply by Σk

N∑n=1

γ(znk)µk =

N∑n=1

γ(znk)xn

⇔ µk =1

Nk

N∑n=1

γ(znk)xn where Nk =

N∑n=1

γ(znk)

c©Ghane/Mori 49

Page 50: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

Maximizing Log-Likelihood - Means and Covariances

• Note that the mean µk is a weighted combination of points xn,using the responsibilities γ(znk) for the cluster k

µk =1

Nk

N∑n=1

γ(znk)xn

• Nk =∑N

n=1 γ(znk) is the effective number of points in thecluster

• A similar result comes from taking derivatives wrt the covariancematrices Σk:

Σk =1

Nk

N∑n=1

γ(znk)(xn − µk)(xn − µk)T

c©Ghane/Mori 50

Page 51: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

Maximizing Log-Likelihood - Means and Covariances

• Note that the mean µk is a weighted combination of points xn,using the responsibilities γ(znk) for the cluster k

µk =1

Nk

N∑n=1

γ(znk)xn

• Nk =∑N

n=1 γ(znk) is the effective number of points in thecluster

• A similar result comes from taking derivatives wrt the covariancematrices Σk:

Σk =1

Nk

N∑n=1

γ(znk)(xn − µk)(xn − µk)T

c©Ghane/Mori 51

Page 52: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

Maximizing Log-Likelihood - Mixing Coefficients

• We can also maximize wrt the mixing coefficients πk• Note there is a constraint that

∑k πk = 1

• Use Lagrange multipliers

• End up with:

πk =Nk

N

average responsibility that component k takes

c©Ghane/Mori 52

Page 53: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

Three Parameter Types and Three Equations

• These three equations a solution does not make

µk =1

Nk

N∑n=1

γ(znk)xn

Σk =1

Nk

N∑n=1

γ(znk)(xn − µk)(xn − µk)T

πk =Nk

N

• All depend on γ(znk), which depends on all 3!• But an iterative scheme can be used

c©Ghane/Mori 53

Page 54: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

EM for Gaussian Mixtures• Initialize parameters, then iterate:

• E step: Calculate responsibilities using current parameters

γ(znk) =πkN (xn|µk,Σk)∑Kj=1 πjN (xn|µj ,Σj)

• M step: Re-estimate parameters using these γ(znk)

µk =1

Nk

N∑n=1

γ(znk)xn

Σk =1

Nk

N∑n=1

γ(znk)(xn − µk)(xn − µk)T

πk =Nk

N

• This algorithm is known as the expectation-maximizationalgorithm (EM)• Next we describe its general form, why it works, and why it’s

called EM (but first an example)c©Ghane/Mori 54

Page 55: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

EM for Gaussian Mixtures• Initialize parameters, then iterate:

• E step: Calculate responsibilities using current parameters

γ(znk) =πkN (xn|µk,Σk)∑Kj=1 πjN (xn|µj ,Σj)

• M step: Re-estimate parameters using these γ(znk)

µk =1

Nk

N∑n=1

γ(znk)xn

Σk =1

Nk

N∑n=1

γ(znk)(xn − µk)(xn − µk)T

πk =Nk

N

• This algorithm is known as the expectation-maximizationalgorithm (EM)• Next we describe its general form, why it works, and why it’s

called EM (but first an example)c©Ghane/Mori 55

Page 56: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

EM for Gaussian Mixtures• Initialize parameters, then iterate:

• E step: Calculate responsibilities using current parameters

γ(znk) =πkN (xn|µk,Σk)∑Kj=1 πjN (xn|µj ,Σj)

• M step: Re-estimate parameters using these γ(znk)

µk =1

Nk

N∑n=1

γ(znk)xn

Σk =1

Nk

N∑n=1

γ(znk)(xn − µk)(xn − µk)T

πk =Nk

N

• This algorithm is known as the expectation-maximizationalgorithm (EM)• Next we describe its general form, why it works, and why it’s

called EM (but first an example)c©Ghane/Mori 56

Page 57: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

MoG EM - Example

(a)−2 0 2

−2

0

2

• Same initialization as with K-means before• Often, K-means is actually used to initialize EM

c©Ghane/Mori 57

Page 58: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

MoG EM - Example

(b)−2 0 2

−2

0

2

• Calculate responsibilities γ(znk)

c©Ghane/Mori 58

Page 59: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

MoG EM - Example

(c)

�����

−2 0 2

−2

0

2

• Calculate model parameters {πk,µk,Σk} using theseresponsibilities

c©Ghane/Mori 59

Page 60: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

MoG EM - Example

(d)

�����

−2 0 2

−2

0

2

• Iteration 2

c©Ghane/Mori 60

Page 61: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

MoG EM - Example

(e)

�����

−2 0 2

−2

0

2

• Iteration 5

c©Ghane/Mori 61

Page 62: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

MoG EM - Example

(f)

�������

−2 0 2

−2

0

2

• Iteration 20 - converged

c©Ghane/Mori 62

Page 63: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

Outline

K-Means

Gaussian Mixture Models

Expectation-Maximization

c©Ghane/Mori 63

Page 64: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

General Version of EM

• In general, we are interested in maximizing the likelihood

p(X|θ) =∑Z

p(X,Z|θ)

whereX denotes all observed variables, and Z denotes all latent(hidden, unobserved) variables

• Assume that maximizing p(X|θ) is difficult (e.g. mixture ofGaussians)

• But maximizing p(X,Z|θ) is tractable (everything observed)• p(X,Z|θ) is referred to as the complete-data likelihood function,

which we don’t have

c©Ghane/Mori 64

Page 65: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

General Version of EM

• In general, we are interested in maximizing the likelihood

p(X|θ) =∑Z

p(X,Z|θ)

whereX denotes all observed variables, and Z denotes all latent(hidden, unobserved) variables

• Assume that maximizing p(X|θ) is difficult (e.g. mixture ofGaussians)

• But maximizing p(X,Z|θ) is tractable (everything observed)• p(X,Z|θ) is referred to as the complete-data likelihood function,

which we don’t have

c©Ghane/Mori 65

Page 66: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

A Lower Bound

• The strategy for optimization will be to introduce a lower boundon the likelihood• This lower bound will be based on the complete-data likelihood,

which is easy to optimize

• Iteratively increase this lower bound• Make sure we’re increasing the likelihood while doing so

c©Ghane/Mori 66

Page 67: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

A Decomposition Trick

• To obtain the lower bound, we use a decomposition:

ln p(X,Z|θ) = ln p(X|θ) + ln p(Z|X,θ) product rule

ln p(X|θ) = L(q,θ) +KL(q||p)

L(q,θ) ≡∑Z

q(Z) ln

{p(X,Z|θ)q(Z)

}KL(q||p) ≡ −

∑Z

q(Z) ln

{p(Z|X,θ)

q(Z)

}• KL(q||p) is known as the Kullback-Leibler divergence

(KL-divergence), and is ≥ 0 (see p.55 PRML, next slide)• Hence ln p(X|θ) ≥ L(q,θ)

c©Ghane/Mori 67

Page 68: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

Kullback-Leibler Divergence

• KL(p(x)||q(x)) is a measure of the difference betweendistributions p(x) and q(x):

KL(p(x)||q(x)) = −∑x

p(x) logq(x)

p(x)

• Motivation: average additional amount of information required toencode x using code assuming distribution q(x) when x actuallycomes from p(x)

• Note it is not symmetric: KL(q(x)||p(x)) 6= KL(p(x)||q(x)) ingeneral

• It is non-negative:• Jensen’s inequality: − ln(

∑x xp(x)) ≤ −

∑x p(x) lnx

• Apply to KL:

KL(p||q) = −∑x

p(x) logq(x)

p(x)≥ − ln

(∑x

q(x)

p(x)p(x)

)= −ln

∑x

q(x) = 0

c©Ghane/Mori 68

Page 69: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

Kullback-Leibler Divergence

• KL(p(x)||q(x)) is a measure of the difference betweendistributions p(x) and q(x):

KL(p(x)||q(x)) = −∑x

p(x) logq(x)

p(x)

• Motivation: average additional amount of information required toencode x using code assuming distribution q(x) when x actuallycomes from p(x)

• Note it is not symmetric: KL(q(x)||p(x)) 6= KL(p(x)||q(x)) ingeneral

• It is non-negative:• Jensen’s inequality: − ln(

∑x xp(x)) ≤ −

∑x p(x) lnx

• Apply to KL:

KL(p||q) = −∑x

p(x) logq(x)

p(x)≥ − ln

(∑x

q(x)

p(x)p(x)

)= −ln

∑x

q(x) = 0

c©Ghane/Mori 69

Page 70: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

Kullback-Leibler Divergence

• KL(p(x)||q(x)) is a measure of the difference betweendistributions p(x) and q(x):

KL(p(x)||q(x)) = −∑x

p(x) logq(x)

p(x)

• Motivation: average additional amount of information required toencode x using code assuming distribution q(x) when x actuallycomes from p(x)

• Note it is not symmetric: KL(q(x)||p(x)) 6= KL(p(x)||q(x)) ingeneral

• It is non-negative:• Jensen’s inequality: − ln(

∑x xp(x)) ≤ −

∑x p(x) lnx

• Apply to KL:

KL(p||q) = −∑x

p(x) logq(x)

p(x)≥ − ln

(∑x

q(x)

p(x)p(x)

)= −ln

∑x

q(x) = 0

c©Ghane/Mori 70

Page 71: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

Kullback-Leibler Divergence

• KL(p(x)||q(x)) is a measure of the difference betweendistributions p(x) and q(x):

KL(p(x)||q(x)) = −∑x

p(x) logq(x)

p(x)

• Motivation: average additional amount of information required toencode x using code assuming distribution q(x) when x actuallycomes from p(x)

• Note it is not symmetric: KL(q(x)||p(x)) 6= KL(p(x)||q(x)) ingeneral

• It is non-negative:• Jensen’s inequality: − ln(

∑x xp(x)) ≤ −

∑x p(x) lnx

• Apply to KL:

KL(p||q) = −∑x

p(x) logq(x)

p(x)≥ − ln

(∑x

q(x)

p(x)p(x)

)= −ln

∑x

q(x) = 0

c©Ghane/Mori 71

Page 72: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

Increasing the Lower Bound - E step

• EM is an iterative optimization technique which tries tomaximize this lower bound: ln p(X|θ) ≥ L(q,θ)

• E step: Fix θold, maximize L(q,θold) wrt q• i.e. Choose distribution q to maximize L• Reordering bound:

L(q,θold) = ln p(X|θold)−KL(q||p)

• ln p(X|θold) does not depend on q• Maximum is obtained when KL(q||p) is as small as possible

• Occurs when q = p, i.e. q(Z) = p(Z|X,θ)• This is the posterior over Z, recall these are the responsibilities

from MoG model

c©Ghane/Mori 72

Page 73: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

Increasing the Lower Bound - E step

• EM is an iterative optimization technique which tries tomaximize this lower bound: ln p(X|θ) ≥ L(q,θ)

• E step: Fix θold, maximize L(q,θold) wrt q• i.e. Choose distribution q to maximize L• Reordering bound:

L(q,θold) = ln p(X|θold)−KL(q||p)

• ln p(X|θold) does not depend on q• Maximum is obtained when KL(q||p) is as small as possible

• Occurs when q = p, i.e. q(Z) = p(Z|X,θ)• This is the posterior over Z, recall these are the responsibilities

from MoG model

c©Ghane/Mori 73

Page 74: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

Increasing the Lower Bound - E step

• EM is an iterative optimization technique which tries tomaximize this lower bound: ln p(X|θ) ≥ L(q,θ)

• E step: Fix θold, maximize L(q,θold) wrt q• i.e. Choose distribution q to maximize L• Reordering bound:

L(q,θold) = ln p(X|θold)−KL(q||p)

• ln p(X|θold) does not depend on q• Maximum is obtained when KL(q||p) is as small as possible

• Occurs when q = p, i.e. q(Z) = p(Z|X,θ)• This is the posterior over Z, recall these are the responsibilities

from MoG model

c©Ghane/Mori 74

Page 75: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

Increasing the Lower Bound - E step

• EM is an iterative optimization technique which tries tomaximize this lower bound: ln p(X|θ) ≥ L(q,θ)

• E step: Fix θold, maximize L(q,θold) wrt q• i.e. Choose distribution q to maximize L• Reordering bound:

L(q,θold) = ln p(X|θold)−KL(q||p)

• ln p(X|θold) does not depend on q• Maximum is obtained when KL(q||p) is as small as possible

• Occurs when q = p, i.e. q(Z) = p(Z|X,θ)• This is the posterior over Z, recall these are the responsibilities

from MoG model

c©Ghane/Mori 75

Page 76: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

Increasing the Lower Bound - M step

• M step: Fix q, maximize L(q,θ) wrt θ• The maximization problem is on

L(q,θ) =∑Z

q(Z) ln p(X,Z|θ)−∑Z

q(Z) ln q(Z)

=∑Z

p(Z|X,θold) ln p(X,Z|θ)−∑Z

p(Z|X,θold) ln p(Z|X,θold)

• Second term is constant with respect to θ• First term is ln of complete data likelihood, which is assumed

easy to optimize• Expected complete log likelihood – what we think complete data

likelihood will be

c©Ghane/Mori 76

Page 77: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

Increasing the Lower Bound - M step

• M step: Fix q, maximize L(q,θ) wrt θ• The maximization problem is on

L(q,θ) =∑Z

q(Z) ln p(X,Z|θ)−∑Z

q(Z) ln q(Z)

=∑Z

p(Z|X,θold) ln p(X,Z|θ)−∑Z

p(Z|X,θold) ln p(Z|X,θold)

• Second term is constant with respect to θ• First term is ln of complete data likelihood, which is assumed

easy to optimize• Expected complete log likelihood – what we think complete data

likelihood will be

c©Ghane/Mori 77

Page 78: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

Increasing the Lower Bound - M step

• M step: Fix q, maximize L(q,θ) wrt θ• The maximization problem is on

L(q,θ) =∑Z

q(Z) ln p(X,Z|θ)−∑Z

q(Z) ln q(Z)

=∑Z

p(Z|X,θold) ln p(X,Z|θ)−∑Z

p(Z|X,θold) ln p(Z|X,θold)

• Second term is constant with respect to θ• First term is ln of complete data likelihood, which is assumed

easy to optimize• Expected complete log likelihood – what we think complete data

likelihood will be

c©Ghane/Mori 78

Page 79: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

Increasing the Lower Bound - M step

• M step: Fix q, maximize L(q,θ) wrt θ• The maximization problem is on

L(q,θ) =∑Z

q(Z) ln p(X,Z|θ)−∑Z

q(Z) ln q(Z)

=∑Z

p(Z|X,θold) ln p(X,Z|θ)−∑Z

p(Z|X,θold) ln p(Z|X,θold)

• Second term is constant with respect to θ• First term is ln of complete data likelihood, which is assumed

easy to optimize• Expected complete log likelihood – what we think complete data

likelihood will be

c©Ghane/Mori 79

Page 80: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

Why does EM work?

• In the M-step we changed from θold to θnew

• This increased the lower bound L, unless we were at a maximum(so we would have stopped)

• This must have caused the log likelihood to increase• The E-step set q to make the KL-divergence 0:

ln p(X|θold) = L(q,θold) +KL(q||p) = L(q,θold)

• Since the lower bound L increased when we moved from θold toθnew:

ln p(X|θold) = L(q,θold) < L(q,θnew)= ln p(X|θnew)−KL(q||pnew)

• So the log-likelihood has increased going from θold to θnew

c©Ghane/Mori 80

Page 81: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

Why does EM work?

• In the M-step we changed from θold to θnew

• This increased the lower bound L, unless we were at a maximum(so we would have stopped)

• This must have caused the log likelihood to increase• The E-step set q to make the KL-divergence 0:

ln p(X|θold) = L(q,θold) +KL(q||p) = L(q,θold)

• Since the lower bound L increased when we moved from θold toθnew:

ln p(X|θold) = L(q,θold) < L(q,θnew)= ln p(X|θnew)−KL(q||pnew)

• So the log-likelihood has increased going from θold to θnew

c©Ghane/Mori 81

Page 82: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

Why does EM work?

• In the M-step we changed from θold to θnew

• This increased the lower bound L, unless we were at a maximum(so we would have stopped)

• This must have caused the log likelihood to increase• The E-step set q to make the KL-divergence 0:

ln p(X|θold) = L(q,θold) +KL(q||p) = L(q,θold)

• Since the lower bound L increased when we moved from θold toθnew:

ln p(X|θold) = L(q,θold) < L(q,θnew)= ln p(X|θnew)−KL(q||pnew)

• So the log-likelihood has increased going from θold to θnew

c©Ghane/Mori 82

Page 83: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

Why does EM work?

• In the M-step we changed from θold to θnew

• This increased the lower bound L, unless we were at a maximum(so we would have stopped)

• This must have caused the log likelihood to increase• The E-step set q to make the KL-divergence 0:

ln p(X|θold) = L(q,θold) +KL(q||p) = L(q,θold)

• Since the lower bound L increased when we moved from θold toθnew:

ln p(X|θold) = L(q,θold) < L(q,θnew)= ln p(X|θnew)−KL(q||pnew)

• So the log-likelihood has increased going from θold to θnew

c©Ghane/Mori 83

Page 84: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

Bounding Example

!5 !4 !3 !2 !1 0 1 2 3 4 5!0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Figure 1: EM example: Mixture components and data. The data consists of three

samples drawn from each mixture component, shown above as circles and triangles.

The means of the mixture components are −2 and 2, respectively.

!3

!2

!1

0

1

2

3

!3

!2

!1

0

1

2

3

0

0.1

0.2

0.3

0.4

0.5

!1

!2

Figure 2: The true likelihood function of the two component means θ1 and θ2, given

the data in Figure 1.

2

Consider 2 component 1-D MoG with known variances (examplefrom F. Dellaert)

c©Ghane/Mori 84

Page 85: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

Bounding Example!5 !4 !3 !2 !1 0 1 2 3 4 5!0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Figure 1: EM example: Mixture components and data. The data consists of three

samples drawn from each mixture component, shown above as circles and triangles.

The means of the mixture components are −2 and 2, respectively.

!3

!2

!1

0

1

2

3

!3

!2

!1

0

1

2

3

0

0.1

0.2

0.3

0.4

0.5

!1

!2

Figure 2: The true likelihood function of the two component means θ1 and θ2, given

the data in Figure 1.

2

!5 !4 !3 !2 !1 0 1 2 3 4 5!0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Figure 1: EM example: Mixture components and data. The data consists of three

samples drawn from each mixture component, shown above as circles and triangles.

The means of the mixture components are −2 and 2, respectively.

!3

!2

!1

0

1

2

3

!3

!2

!1

0

1

2

3

0

0.1

0.2

0.3

0.4

0.5

!1

!2

Figure 2: The true likelihood function of the two component means θ1 and θ2, given

the data in Figure 1.

2

• True likelihood function• Recall we’re fitting means θ1, θ2

c©Ghane/Mori 85

Page 86: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

Bounding Example!5 !4 !3 !2 !1 0 1 2 3 4 5!0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Figure 1: EM example: Mixture components and data. The data consists of three

samples drawn from each mixture component, shown above as circles and triangles.

The means of the mixture components are −2 and 2, respectively.

!3

!2

!1

0

1

2

3

!3

!2

!1

0

1

2

3

0

0.1

0.2

0.3

0.4

0.5

!1

!2

Figure 2: The true likelihood function of the two component means θ1 and θ2, given

the data in Figure 1.

2

!3

!2

!1

0

1

2

3

!3

!2

!1

0

1

2

3

0

0.1

0.2

0.3

0.4

0.5

!1

i=1, Q=!3.279564

!2

1 2 3 4 5 6

0.5

1

1.5

2

2.5

!3

!2

!1

0

1

2

3

!3

!2

!1

0

1

2

3

0

0.1

0.2

0.3

0.4

0.5

!1

i=2, Q=!2.788156

!2

!3

!2

!1

0

1

2

3

!3

!2

!1

0

1

2

3

0

0.1

0.2

0.3

0.4

0.5

!1

i=3, Q=!1.501116

!2

1 2 3 4 5 6

0.5

1

1.5

2

2.51 2 3 4 5 6

0.5

1

1.5

2

2.5

!3

!2

!1

0

1

2

3

!3

!2

!1

0

1

2

3

0

0.1

0.2

0.3

0.4

0.5

!1

i=4, Q=!0.817491

!2

!3

!2

!1

0

1

2

3

!3

!2

!1

0

1

2

3

0

0.1

0.2

0.3

0.4

0.5

!1

i=5, Q=!0.762661

!2

1 2 3 4 5 6

0.5

1

1.5

2

2.51 2 3 4 5 6

0.5

1

1.5

2

2.5

Figure 3: Lower Bounds

3

• Lower bound the likelihood function using averaging distributionq(Z)• ln p(X|θ) = L(q,θ) +KL(q(Z)||p(Z|X,θ))• Since q(Z) = p(Z|X,θold), bound is tight (equal to actual

likelihood) at θ = θold

c©Ghane/Mori 86

Page 87: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

Bounding Example!5 !4 !3 !2 !1 0 1 2 3 4 5!0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Figure 1: EM example: Mixture components and data. The data consists of three

samples drawn from each mixture component, shown above as circles and triangles.

The means of the mixture components are −2 and 2, respectively.

!3

!2

!1

0

1

2

3

!3

!2

!1

0

1

2

3

0

0.1

0.2

0.3

0.4

0.5

!1

!2

Figure 2: The true likelihood function of the two component means θ1 and θ2, given

the data in Figure 1.

2

!3

!2

!1

0

1

2

3

!3

!2

!1

0

1

2

3

0

0.1

0.2

0.3

0.4

0.5

!1

i=1, Q=!3.279564

!2

1 2 3 4 5 6

0.5

1

1.5

2

2.5

!3

!2

!1

0

1

2

3

!3

!2

!1

0

1

2

3

0

0.1

0.2

0.3

0.4

0.5

!1

i=2, Q=!2.788156

!2

!3

!2

!1

0

1

2

3

!3

!2

!1

0

1

2

3

0

0.1

0.2

0.3

0.4

0.5

!1

i=3, Q=!1.501116

!2

1 2 3 4 5 6

0.5

1

1.5

2

2.51 2 3 4 5 6

0.5

1

1.5

2

2.5

!3

!2

!1

0

1

2

3

!3

!2

!1

0

1

2

3

0

0.1

0.2

0.3

0.4

0.5

!1

i=4, Q=!0.817491

!2

!3

!2

!1

0

1

2

3

!3

!2

!1

0

1

2

3

0

0.1

0.2

0.3

0.4

0.5

!1

i=5, Q=!0.762661

!2

1 2 3 4 5 6

0.5

1

1.5

2

2.51 2 3 4 5 6

0.5

1

1.5

2

2.5

Figure 3: Lower Bounds

3

• Lower bound the likelihood function using averaging distributionq(Z)• ln p(X|θ) = L(q,θ) +KL(q(Z)||p(Z|X,θ))• Since q(Z) = p(Z|X,θold), bound is tight (equal to actual

likelihood) at θ = θold

c©Ghane/Mori 87

Page 88: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

Bounding Example!5 !4 !3 !2 !1 0 1 2 3 4 5!0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Figure 1: EM example: Mixture components and data. The data consists of three

samples drawn from each mixture component, shown above as circles and triangles.

The means of the mixture components are −2 and 2, respectively.

!3

!2

!1

0

1

2

3

!3

!2

!1

0

1

2

3

0

0.1

0.2

0.3

0.4

0.5

!1

!2

Figure 2: The true likelihood function of the two component means θ1 and θ2, given

the data in Figure 1.

2

!3

!2

!1

0

1

2

3

!3

!2

!1

0

1

2

3

0

0.1

0.2

0.3

0.4

0.5

!1

i=1, Q=!3.279564

!2

1 2 3 4 5 6

0.5

1

1.5

2

2.5

!3

!2

!1

0

1

2

3

!3

!2

!1

0

1

2

3

0

0.1

0.2

0.3

0.4

0.5

!1

i=2, Q=!2.788156

!2

!3

!2

!1

0

1

2

3

!3

!2

!1

0

1

2

3

0

0.1

0.2

0.3

0.4

0.5

!1

i=3, Q=!1.501116

!2

1 2 3 4 5 6

0.5

1

1.5

2

2.51 2 3 4 5 6

0.5

1

1.5

2

2.5

!3

!2

!1

0

1

2

3

!3

!2

!1

0

1

2

3

0

0.1

0.2

0.3

0.4

0.5

!1

i=4, Q=!0.817491

!2

!3

!2

!1

0

1

2

3

!3

!2

!1

0

1

2

3

0

0.1

0.2

0.3

0.4

0.5

!1

i=5, Q=!0.762661

!2

1 2 3 4 5 6

0.5

1

1.5

2

2.51 2 3 4 5 6

0.5

1

1.5

2

2.5

Figure 3: Lower Bounds

3

• Lower bound the likelihood function using averaging distributionq(Z)• ln p(X|θ) = L(q,θ) +KL(q(Z)||p(Z|X,θ))• Since q(Z) = p(Z|X,θold), bound is tight (equal to actual

likelihood) at θ = θold

c©Ghane/Mori 88

Page 89: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

Bounding Example!5 !4 !3 !2 !1 0 1 2 3 4 5!0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Figure 1: EM example: Mixture components and data. The data consists of three

samples drawn from each mixture component, shown above as circles and triangles.

The means of the mixture components are −2 and 2, respectively.

!3

!2

!1

0

1

2

3

!3

!2

!1

0

1

2

3

0

0.1

0.2

0.3

0.4

0.5

!1

!2

Figure 2: The true likelihood function of the two component means θ1 and θ2, given

the data in Figure 1.

2

!3

!2

!1

0

1

2

3

!3

!2

!1

0

1

2

3

0

0.1

0.2

0.3

0.4

0.5

!1

i=1, Q=!3.279564

!2

1 2 3 4 5 6

0.5

1

1.5

2

2.5

!3

!2

!1

0

1

2

3

!3

!2

!1

0

1

2

3

0

0.1

0.2

0.3

0.4

0.5

!1

i=2, Q=!2.788156

!2

!3

!2

!1

0

1

2

3

!3

!2

!1

0

1

2

3

0

0.1

0.2

0.3

0.4

0.5

!1

i=3, Q=!1.501116

!2

1 2 3 4 5 6

0.5

1

1.5

2

2.51 2 3 4 5 6

0.5

1

1.5

2

2.5

!3

!2

!1

0

1

2

3

!3

!2

!1

0

1

2

3

0

0.1

0.2

0.3

0.4

0.5

!1

i=4, Q=!0.817491

!2

!3

!2

!1

0

1

2

3

!3

!2

!1

0

1

2

3

0

0.1

0.2

0.3

0.4

0.5

!1

i=5, Q=!0.762661

!2

1 2 3 4 5 6

0.5

1

1.5

2

2.51 2 3 4 5 6

0.5

1

1.5

2

2.5

Figure 3: Lower Bounds

3

• Lower bound the likelihood function using averaging distributionq(Z)• ln p(X|θ) = L(q,θ) +KL(q(Z)||p(Z|X,θ))• Since q(Z) = p(Z|X,θold), bound is tight (equal to actual

likelihood) at θ = θold

c©Ghane/Mori 89

Page 90: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

EM - Summary

• EM finds local maximum to likelihood

p(X|θ) =∑Z

p(X,Z|θ)

• Iterates two steps:• E step “fills in” the missing variables Z (calculates their

distribution)• M step maximizes expected complete log likelihood (expectation

wrt E step distribution)

• This works because these two steps are performing acoordinate-wise hill-climbing on a lower bound on the likelihoodp(X|θ)

c©Ghane/Mori 90

Page 91: Expectation Maximization - Bishop PRML Ch. 9vda.univie.ac.at/Teaching/ML/16s/LectureNotes/09_mixture... · 2016-05-13 · Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane

K-Means Gaussian Mixture Models Expectation-Maximization

Conclusion

• Readings: Ch. 9.1, 9.2, 9.4• K-means clustering• Gaussian mixture model• What about K?

• Model selection: either cross-validation or Bayesian version(average over all values for K)

• Expectation-maximization, a general method for learningparameters of models when not all variables are observed

c©Ghane/Mori 91