EM algorithm LING 572 Fei Xia 03/02/06. Outline The EM algorithm EM for PM models Three special...

EM algorithm

LING 572

Fei Xia

03/02/06

Outline

• The EM algorithm

• EM for PM models

• Three special cases– Inside-outside algorithm– Forward-backward algorithm– IBM models for MT

The EM algorithm

Basic setting in EM

• X is a set of data points: observed data• Θ is a parameter vector.• EM is a method to find θML where

• Calculating P(X | θ) directly is hard.• Calculating P(X,Y|θ) is much simpler,

where Y is “hidden” data (or “missing” data).

)|(logmaxarg

)(maxarg

The basic EM strategy

• Z = (X, Y)– Z: complete data (“augmented data”)– X: observed data (“incomplete” data)– Y: hidden data (“missing” data)

• Given a fixed x, there could be many possible y’s.– Ex: given a sentence x, there could be many state

sequences in an HMM that generates x.

Examples of EM

HMM PCFG MT Coin toss

(observed)

sentences sentences Parallel data Head-tail sequences

Y (hidden) State sequences

Parse trees Word alignment

Coin id sequences

θ aij

P(ABC) t(f|e)

d(aj|j, l, m), …

p1, p2, λ

Algorithm Forward-backward

Inside-outside

IBM Models N/A

The log-likelihood function

• L is a function of θ, while holding X constant: )|()()|( XPLXL

)|,(log

)|(log

)|(log)(log)(

The iterative approach for MLE

)|,(logmaxarg

)(maxarg

,....,...,, 10 tIn many cases, we cannot find the solution directly.

An alternative is to find a sequence:

....)(...)()( 10 tlll s.t.

)|,([log

)|,(),|(log

)|,(log

)|,(log)|,(log

)|(log)|(log)()(

yxPxyP

yxPyxP

XPXPll

Jensen’s inequality

])([()](([, xgEfxgfEthenconvexisfif

)])([log()]([log( xpExpE

])([()](([, xgEfxgfEthenconcaveisfif

log is a concave function

Maximizing the lower bound

)]|,([logmaxarg

)|,(log),|(maxarg

)|,([logmaxarg

yxPxyP

The Q function

The Q-function

• Define the Q-function (a function of θ):

– Y is a random vector.– X=(x1, x2, …, xn) is a constant (vector).– Θt is the current parameter estimate and is a constant (vector).– Θ is the normal variable (vector) that we wish to adjust.

• The Q-function is the expected value of the complete data log-likelihood P(X,Y|θ) with respect to Y given X and θt.

)|,(log),|(

)]|,([log)|,(log),|(

)]|,([log],|)|,([log),(

yxPxyP

yxPEYXPXYP

YXPEXYXPEQ

The inner loop of the EM algorithm

• E-step: calculate

• M-step: find

),(maxarg)1( tt Q

)|,(log),|(),(1

yxPxyPQ it

L(θ) is non-decreasing at each iteration

• The EM algorithm will produce a sequence

• It can be proved that

,....,...,, 10 t

....)(...)()( 10 tlll

The inner loop of the Generalized EM algorithm (GEM)

• E-step: calculate

• M-step: find

),(maxarg)1( tt Q

)|,(log),|(),(1

yxPxyPQ it

),(),( 1 tttt QQ

Recap of the EM algorithm

Idea #1: find θ that maximizes the likelihood of training data

)|(logmaxarg

)(maxarg

Idea #2: find the θt sequence

No analytical solution iterative approach, find

,....,...,, 10 t

....)(...)()( 10 tlll

Idea #3: find θt+1 that maximizes a tight lower bound of )()( tll

a tight lower bound

)|,([log)()(

1),|( t

yxPEll t

Idea #4: find θt+1 that maximizes the Q function

)]|,([logmaxarg

)|,([logmaxarg

Lower bound of )()( tll

The Q function

The EM algorithm

• Start with initial estimate, θ0

• Repeat until convergence– E-step: calculate

– M-step: find

),(maxarg)1( tt Q

)|,(log),|(),(1

yxPxyPQ it

Important classes of EM problem

• Products of multinomial (PM) models

• Exponential families

• Gaussian mixture

• …

The EM algorithm for PM models

PM models

yxici pppyxp ),,(),,(),,(

...)|,(

Where is a partition of all the parameters, and for any j

),...,( 1 m

HMM is a PM

yxsscj

yxsscji

• PCFG: each sample point (x,y):– x is a sentence– y is a possible parse tree for that sentence.

)|()|,(1

ii AAPyxP

VPsleepsVPP

NPJimNPP

SVPNPSP

PCFG is a PM

),,()(

yxAcAp

Ap 1)(

Q-function for PM

)log),,(),|((...)log),,(),|((

)log),|((...)log),|((

))(log(),|(

)|,(log),|(

i yiji

pyxjCxyPpyxjCxyP

pxyPpxyP

yxPxyP

),(1tQ

Maximizing the Q function

t pyxjCxyPQ log),,(),|(),(11

Maximize

Subject to the constraint

)log),,(),|((),(ˆ1

t ppyxjCxyPQ

Use Lagrange multipliers

Optimal solution

)log),,(),|((),(ˆ1

t ppyxjCxyPQ

0)/),,(),|((),(ˆ

pyxjCxypp

),,(),|(1

yxjCxyp

Normalization factor

Expected count

PM Models

r is rth parameter in the model. Each parameter is the member of some multinomial distribution.

Count(x,y, r) is the number of times that is seen in the expression for P(x, y | θ)

The EM algorithm for PM Models

• Calculate expected counts

• Update parameters

PCFG example

• Calculate expected counts

• Update parameters

The EM algorithm for PM models

// for each iteration

// for each training example xi

// for each possible y

// for each parameter

Inside-outside algorithm

Inner loop of the Inside-outside algorithm

Given an input sequence and1. Calculate inside probability:

• Base case• Recursive case:

2. Calculate outside probability:• Base case:

• Recursive case:

)(),( kj

j wNPkk

),1(),()(),(,

qddpNNNPqp srsr

otherwise

jifmj 0

11),1(

)1,()(),(

),1()(),(),(

peNNNPqe

eqNNNPepqp

Inside-outside algorithm (cont)

),1(),()(),(

qddpNNNPqp

wNNNNP

),(),(),()|,(

wwhhhhwusedisNwNP

3. Collect the counts

4. Normalize and update the parameters

srjsrj

wusedisNwNP

wNCntwNP

wNNNNP

NNNCnt

NNNCntNNNP

Expected counts for PCFG rules

),,(*),|(

),,(*),|()(

wNNNNP

NNNwTrcountwTrP

NNNYXcountXYPNNNcount

This is the formula if we have only one sentence.Add an outside sum if X contains multiple sentences.

Expected counts (cont)

),,(*),|(

),,(*),|()(

wusedisNwNP

wNTrwcountwTrP

wNYXcountXYPwNcount

Relation to EM

• PCFG is a PM Model

• Inside-outside algorithm is a special case of the EM algorithm for PM Models.

• X (observed data): each data point is a sentence w1m.

• Y (hidden data): parse tree Tr.

• Θ (parameters):

Forward-backward algorithm

The inner loop for forward-backward algorithm

Given an input sequence and1. Calculate forward probability:

• Base case• Recursive case:

2. Calculate backward probability:• Base case:• Recursive case:

3. Calculate expected counts:

4. Update the parameters:

),,,,( BAKS

ii )1(

tijoiji

ij batt )()1(

1)1( Ti

tijoijj

ji batt )1()(

jijoijiij

tbatt t

)1()()(

Expected counts

),,(*),|(

),,(*),|()(

OjXiXP

ssXOcountOXP

ssYXcountXYPsscount

Expected counts (cont)

),(*),|,(

),,(*),|(

),,(*),|()(

wOOjXiXP

ssXOcountOXP

ssYXcountXYPsscount

Relation to EM

• HMM is a PM Model

• Forward-backward algorithm is a special case of the EM algorithm for PM Models.

• X (observed data): each data point is an O1T.

• Y (hidden data): state sequence X1T.

• Θ (parameters): aij, bijk, πi.

IBM models for MT

Expected counts for (f, e) pairs

• Let Ct(f, e) be the fractional count of (f, e) pair in the training data.

)),(),(*),|((),(||

eeffFEaPefCt

Alignment prob Actual count of times e and f are linked in (E,F) by alignment a

efCteft

),()|(

Relation to EM

• IBM models are PM Models.

• The EM algorithm used in IBM models is a special case of the EM algorithm for PM Models.

• X (observed data): each data point is a sentence pair (F, E).

• Y (hidden data): word alignment a.• Θ (parameters): t(f|e), d(i | j, m, n), etc..

Summary

• The EM algorithm– An iterative approach – L(θ) is non-decreasing at each iteration– Optimal solution in M-step exists for many classes of

problems.

• The EM algorithm for PM models– Simpler formulae– Three special cases

• Inside-outside algorithm• Forward-backward algorithm• IBM Models for MT

Relations among the algorithms

The generalized EM

The EM algorithm

PM Gaussian MixInside-OutsideForward-backwardIBM models

Strengths of EM

• Numerical stability: in every iteration of the EM algorithm, it increases the likelihood of the observed data.

• The EM handles parameter constraints gracefully.

Problems with EM

• Convergence can be very slow on some problems and is intimately related to the amount of missing information.

• It guarantees to improve the probability of the training corpus, which is different from reducing the errors directly.

• It cannot guarantee to reach global maxima (it could get struck at the local maxima, saddle points, etc)

The initial estimate is important.

Additional slides

Lower bound lemma

)()ˆ( 0xgxg

)()()ˆ()ˆ( 00 xgxfxfxg Proof :

L(θ) is non-decreasing

)|,([log)()(

1),|( t

yxPEll t

)()()( tllg

)|,([log)(

1),|( t

ixyP yxP

yxPEf t

)(maxarg

0)()()()(1

fgandfgt

0)()( 1 tt gg

)()( 1 tt ll

We have

(By lower bound lemma)

EM algorithm LING 572 Fei Xia 03/02/06. Outline The EM algorithm EM for PM models Three special...

Documents

Transcript of EM algorithm LING 572 Fei Xia 03/02/06. Outline The EM algorithm EM for PM models Three special...

Gaussian Mixtures and the EM Algorithmrtc12/CSE586/lectures/EMLectureFeb3.pdfCSE586 Robert Collins EM Algorithm Unfortunately, oracles don’t exist (or if they do, they won’t talk

Adaptive inversion algorithm for 1.5

Edmonds-Karp Algorithm - DIKUhjemmesider.diku.dk/~pawel/ad/maxflow25-45.pdf · 2009-06-09 · Edmonds-Karp Algorithm Use breadth-first search!!! This variant of Ford-Fulkerson algorithm

Modeling of double asteroids with PIKAIA algorithm

ALADIN An Algorithm for Distributed Non-Convex ...

Forward-backward algorithm LING 572 Fei Xia 02/23/06.

Francis’s AlgorithmFrancis’s Algorithm David S. Watkins watkins@math.wsu.edu Department of Mathematics Washington State University Francis’s Algorithm – p. 1Published in: American

Lightning Jump Algorithm Update

CYK Algorithm & CFL reachability

Lirong Xia

Algorithm Strassen's Method

J/Ψ event selection algorithm - status

6. Max-product algorithm - University Of Illinoisswoh.web.engr.illinois.edu/courses/IE598/handout/maxprod.pdf · 6. Max-product algorithm MAP elimination algorithm Max-product algorithm

Reiter's Projection and Perturbation Algorithm Applied to ...

CSCI 2100B Data Structures 1 Algorithm Analysis

Faster Regular Expression Matchingphbi/files/talks/2009fremS.pdf• History tour of regular expression matching • Thompson’s algorithm • Myers’ algorithm • New algorithm

EM algorithm and applications Lecture #9

EMD Algorithm

Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.

String matching algorithm