Machine Learning - CMPcmp.felk.cvut.cz/cmp/courses/recognition/resources/_EM/...Expectation...

Machine LearningLecture 11

Thomas Hofmann

June 2, 2005

June 2, 2005 c© Thomas Hofmann Page 0

Statistical Estimation

• General setting:

– random variable X = a variable representing a random event

– sample space X = space of possible outcomes

– realization x = observed or hypothetical outcome

– set of probability distributions P over X parameterized by

some parameter vector θ, p(·; θ) ∈ P .

– E.g. p(·; θ) ≥ is a probabiliy density function∫

Xp(x; θ)dx = 1

• Statistical estimation: given an observation or a set of

observations, infer an optimal parameter θ

Maximum Likelihood Estimation

• Use likelihood as criterion to rate different hypotheses (θ).

• More convenient tu use so-called log-likelihood function

L(θ;x) = log p(x; θ)

• This means, a parameter θ is preferred over some θ̄, if the

observed data is more likely under θ than θ̄.

• Maximum Likelihood Estimation

θ̂ = arg maxθ

L(θ;x) = arg maxθ

log p(x; θ)

• i.i.d. sample x = x1, . . . ,xn: θ̂ = arg maxθ

i=1 log p(xi; θ)

MLE: Gaussian Case

• Example: Gaussian distribution, X = R, θ = (µ, σ)′, probability

density

p(x;µ, σ) =1√2πσ

x − µ

• Maximum likelihood estimates

µ̂ =1

σ̂2 =1

(xi − µ̂)2

MLE: Multivariate Normal Distribution

• Multivariate normal

p(x;µ,Σ) =1

(2π)d

2 |Σ| 12exp

2(x − µ)′Σ−1(x − µ)

• MLE

µ̂ =1

Σ̂ =1

(xi − µ̂)(xi − µ̂)′

MLE: Multivariate Normal Distribution

Mixture Models (1)

• Statistical classification: assume that the observed patterns

x1, . . . ,xn belong to a certain number of K classes c1, . . . , cK .

• Assume further that we do not observe these classes, but rather a

mixture of patterns from different classes.

• For each class we assume that patterns are distributed according

to a class-conditional distribution pk(x; θk) parameterized by

θk, pk(x; θ) = p(x|C = ck; θk). Denote θ = (θ1, . . . , θK)′.

• These assumptions lead to a mixture model

p(x;π, θ) =K∑

πk pk(x; θk)

where πk is the prior probability of class ck (mixing proportions).

Mixture Models (2)

• Notice that πk ≥ 0 and∑K

k=1 πk = 1.

• A simple example of a density consisting of a mixture of three

Gaussians

−5 −4 −3 −2 −1 0 1 2 3 4 50

0.35mixture

Why Mixture Models?

• Mixture models are more powerful than the component models

used for the class-conditional distribution

• Mixture models can capture multimodality and offer a systematic

way to define complex statistical models based on simpler ones.

• Mixture models can also be utilized to “unmix” the data, i.e. to

assign patterns to the unobserved classes (data clustering)

• Bayes rule: posterior probabilities

P (ck|x;π, θ) =πk · pk(x; θk)

l=1 πl · pl(x; θl)

MLE in Mixtures: Complete Data Log-Likelihood

• Key question: how to fit the parameters π, θ of a mixture model

• Expectation Maximization (EM) algorithm

• Introduce unobserved cluster membership variables zik ∈ {0, 1}– zik = 1 denotes the fact that data point xi has been

generated from the k-th component or class

–∑K

k=1 zik = 1 for all i = 1, . . . , n

• If membership variables were observed, then one could define the

so-called complete data log-likelihood,

Lc(π, θ;x, z) =

zik [log pk(xi; θk) + log πk]

MLE in Mixtures: Observed Data Log-Likelihood

• Since class membership variables z are not observed, we only

have access to the observed data log-likelihood

L(π, θ;x) =n∑

log p(xi; θ) =n∑

logK∑

πkpk(xi; θk),

• Problem: direct maximization is difficult (logarithm of a sum

effectively introduces complicated couplings)

Statistical Models with Unobserved Variables

• Imagine we would have some estimate of what the unobserved

variables could be:

Qik = Pr(zik = 1) = probability that xi belongs to cluster ck

• Try to maximize the expected complete data log-likelihood

EQ [Lc(π, θ;x, z)] =n∑

Qik [log pk(xi; θk) + log πk] .

• Q is called a variational distribution (we don’t know yet how to

chose it appropriately)

Expected Complete Data Log-Likelihood

• Consider the following line of argument

L(π, θ;x) =

log p(xi;π, θ) =

πkpk(xi; θk)

logK∑

πkpk(xi; θk)

≥n∑

Qik logπkpk(xi; θk)

= L(π, θ,Q;x)

• Inequality follows from the concavity of the logarithm, or more

specifically from Jensen’s inequality.

Jensen’s Inequality

• Jensen’s inequality: for a convex function f and any probability

mass function p

E[f(x)] =∑

p(x)f(x) ≥ f

= f(E[x])

• Proof uses a simple inductive argument over the state space size.

Variational Upper Bound

• No matter what Q is, we will get a lower bound on the

log-likelihood function.

• Instead of maximizing L directly, we can hence try to maximize

the (simpler) lower bound L(θ, π,Q;x) w.r.t. the parameters

θ and π.

Expectation Maximization Algorithm (1)

• Each choice of Q defines a different lower bound L(θ, π,Q;x)

• Key idea: optimize lower bound also w.r.t. Q. Get tightest

lower bound for a given estimate of θ.

• Alternation scheme, maximizes L(π, θ,Q;x) in every step.

– E-step: Q(t+1) = arg maxQ L(π(t), θ(t), Q;x)

– M-step: (π(t+1), θ(t+1)) = arg maxπ,θ L(θ, π,Q(t+1);x)

• M-step optimizes a lower bound instead of the true likelihood

function

• E-step adjusts the bound

• What does that have to do with the function we referred to as

expected complete data log-likelihood above?

L(π, θ,Q;x) =n∑

Qik logπkpk(xi; θk)

Qik log πkpk(xi; θk) −n∑

Qik log Qik

= EQ [Lc(π, θ;x, z)] −n∑

Qik log Qik

• Second term: entropy of Q (does not depend on π or θ)

• Maximizing L(π, θ,Q;x) is the same as maximizing the expected

complete data log-likelihood.

• How about the E-step?

• It is easy to find a general answer to how Q should be chosen.

• Posterior probability Q∗

ik ≡ Pr(zik = 1|xi;π, θ) maximizes

L(π, θ,Q;x) for given π and θ.

• Proof: insert this choice for Q∗ into L(π, θ,Q;x)

L(π, θ,Q∗;x) =

ik logπk pk(xi; θk)

ik log p(xi;π, θ) = L(π, θ;x)

• Since L(π, θ,Q;x) ≤ L(π, θ;x) for all Q, equality is optimal.

Normal Mixture Model

• In the case of a mixture of multivariate normal distributions:

• M-step: differentiating expected complete data log-likelihood

• Mixing proportions π̂k = 1n

i=1 Qik

• Normal model

µ̂k =

i=1 Qikxi∑n

i=1 Qik

Σ̂k =

i=1 Qik(xi − µ̂k)(xi − µ̂k)′

i=1 Qik

Normal Mixture Model (2)

• E-Step

Qik =πk|Σk|−

2 exp[

−12(xi − µk)Σ

−1k (xi − µk)

l=1 πl|Σl|−1

2 exp[

−12(xi − µl)Σ

−1l (xi − µl)

EM for Normal Mixture Model

1: initialize µ̂k at random

2: initialize Σ̂k = σ2I, where σ2 is the overall data variance

3: repeat

4: for each data point xi do

5: for each component k = 1, . . . ,K do

6: compute posterior probability Qik

7: end for

8: end for

9: for each component k = 1, . . . ,K do

10: compute µ̂k, Σ̂k, π̂k

11: end for

12: until convergence

Machine Learning - CMPcmp.felk.cvut.cz/cmp/courses/recognition/resources/_EM/...Expectation...

Documents

Transcript of Machine Learning - CMPcmp.felk.cvut.cz/cmp/courses/recognition/resources/_EM/...Expectation...

L9 Momentum PHYS101 - UNIVERSE OF ALI OVGUN€¦ · February 13, 2017 Linear Momentum and Collisions q Conservation of Energy q Momentum q Impulse q Conservation of Momentum q 1-D

Urban Energy Budget Geog301 Urban Climatology. Radiation Balance Net radiation, Rn = (Q+q) (1- α) + RL ↓ -RL ↑, where Q and q are direct and diffuse solar.

Chapter S:VI · 2020. 12. 18. · Chapter S:VI VI.Relaxed Models q Motivation q "-Admissible Speedup Versions of A* q Using Information about Uncertainty of h q Risk Measures q Nonadditive

Finite Automata - Stanford University · q 0 q 1 q 4 q 5 q 2 q 03 ε-Transitions NFAs have a special type of transition called the ε-transition. An NFA may follow any number of ε-transitions

Cmp Net Mushrooms

Introducere Bazele termodinamicii chimice Principiu al ... · În termochimie Q P Q s a u Q P û H (6) Atunci ecuaţia (4), luînd în considerare ecuaţia (2), se transformă Q P

IO-Link communication master transceiver IC · C/QO, I L+ C/Q O, L+ current (continuous) Internally limited A IOUT C/Q, IOUT I/Q OUT C/Q, OUT I/Q output current ±5 mA I SDA I2C transmission

Comprehensive Compreflow Q&A

Approximate Likelihoods · use an approximation q( ) dependence of q on y suppressed choose q( ) to be simple to calculate close to posterior simple to calculate q( ) = Q q j( j)

:%q q q 1 :%q q q 1 Q5)q q q q q :%q q q 1

12.1 Bessel Functions of the First Kind, J xwilliamsgj.people.cofc.edu/Bessel Function.pdfIn Section 5.6, the generating function (1+x)n deﬁnes the binomial coefﬁcients; x/(ex

OBLICZENIASTATYCZNE 1.0 Obciążenia α= 3,4 = q = Q 2.0 Dach ...

q broj 74 - SDLSN

Hop q ninh

Q-Function Learning Methods

-MODELLAMBERT-W SOLVES THE NONCOMMUTATIVE 4-MODEL 3 deﬁnes an isomorphism of Fréchet algebras between Schwartz functions with Moyal product and inﬁnite matrices with rapidly decaying

2 )? % @ % 2 - JSCE · 2011-04-13 · ZWZZWYY 1 J;PL" http:// index.html [ #" ! _V K A 3 UJIS A 5372 Q#05y |q&^U' < {q ^ {q `V ^U'{q M ϕ YXXeeUy'>u +l +l- zG S ...

aQHmiBQMb-gJQMi2g* `HQgJ2i?Q/b J `FQpg.2+BbBQMgS`Q+2bb ...

ΣΥΓΚΡΙΤΙΚΗ ΣΤΑΤΙΚΗ ΑΝΑΛΥΣΗ · PDF filea q 0 q 1 q 2 c 0 c 2 c 1} Δc b d k g h f e q c 0 c=f(q) Η έννοια ης κλίης ης καμπύλης ίναι

IT1252 Q&A