55
EM algorithm LING 572 Fei Xia 03/02/06
• date post

22-Dec-2015
• Category

## Documents

• view

254

1

### Transcript of EM algorithm LING 572 Fei Xia 03/02/06. Outline The EM algorithm EM for PM models Three special...

EM algorithm

LING 572

Fei Xia

03/02/06

Outline

• The EM algorithm

• EM for PM models

• Three special cases– Inside-outside algorithm– Forward-backward algorithm– IBM models for MT

The EM algorithm

Basic setting in EM

• X is a set of data points: observed data• Θ is a parameter vector.• EM is a method to find θML where

• Calculating P(X | θ) directly is hard.• Calculating P(X,Y|θ) is much simpler,

where Y is “hidden” data (or “missing” data).

)|(logmaxarg

)(maxarg

XP

LML

The basic EM strategy

• Z = (X, Y)– Z: complete data (“augmented data”)– X: observed data (“incomplete” data)– Y: hidden data (“missing” data)

• Given a fixed x, there could be many possible y’s.– Ex: given a sentence x, there could be many state

sequences in an HMM that generates x.

Examples of EM

HMM PCFG MT Coin toss

X

(observed)

sentences sentences Parallel data Head-tail sequences

Y (hidden) State sequences

Parse trees Word alignment

Coin id sequences

θ aij

bijk

P(ABC) t(f|e)

d(aj|j, l, m), …

p1, p2, λ

Algorithm Forward-backward

Inside-outside

IBM Models N/A

The log-likelihood function

• L is a function of θ, while holding X constant: )|()()|( XPLXL

)|,(log

)|(log

)|(log

)|(log)(log)(

1

1

1

yxP

xP

xP

XPLl

iy

n

i

i

n

i

n

ii

The iterative approach for MLE

)|,(logmaxarg

)(maxarg

)(maxarg

1

yxp

l

L

n

i yi

ML

,....,...,, 10 tIn many cases, we cannot find the solution directly.

An alternative is to find a sequence:

....)(...)()( 10 tlll s.t.

])|,(

)|,([log

])|,(

)|,([log

)|,(

)|,(),|(log

)|,(

)|,(

)|',(

)|,(log

)|,(

)|,(

)|',(

)|,(log

)|',(

)|,(log

)|,(

)|,(

log

)|,(log)|,(log

)|(log)|(log)()(

1),|(

1),|(

1

'1

'1

'1

1

11

ti

in

ixyP

ti

in

ixyP

ti

itn

i yi

ti

i

yt

yi

ti

n

i

ti

ti

yt

yi

in

i

yt

yi

in

i

t

yi

yin

i

t

yi

n

iyi

n

i

tt

yxP

yxPE

yxP

yxPE

yxP

yxPxyP

yxP

yxP

yxP

yxP

yxP

yxP

yxP

yxP

yxP

yxP

yxP

yxP

yxPyxP

XPXPll

ti

ti

Jensen’s inequality

Jensen’s inequality

])([()](([, xgEfxgfEthenconvexisfif

)])([log()]([log( xpExpE

])([()](([, xgEfxgfEthenconcaveisfif

log is a concave function

Maximizing the lower bound

)]|,([logmaxarg

)|,(log),|(maxarg

)|,(

)|,(log),|(maxarg

])|,(

)|,([logmaxarg

1),|(

1

1

1),|(

)1(

yxPE

yxPxyP

yxP

yxPxyP

yxp

yxpE

i

n

ixyP

it

i

n

i y

ti

iti

n

i y

ti

in

ixyP

t

ti

ti

The Q function

The Q-function

• Define the Q-function (a function of θ):

– Y is a random vector.– X=(x1, x2, …, xn) is a constant (vector).– Θt is the current parameter estimate and is a constant (vector).– Θ is the normal variable (vector) that we wish to adjust.

• The Q-function is the expected value of the complete data log-likelihood P(X,Y|θ) with respect to Y given X and θt.

)|,(log),|(

)]|,([log)|,(log),|(

)]|,([log],|)|,([log),(

1

1),|(

),|(

yxPxyP

yxPEYXPXYP

YXPEXYXPEQ

it

n

i yi

n

iixyP

Y

t

XYP

tt

ti

t

The inner loop of the EM algorithm

• E-step: calculate

• M-step: find

),(maxarg)1( tt Q

)|,(log),|(),(1

yxPxyPQ it

n

i yi

t

L(θ) is non-decreasing at each iteration

• The EM algorithm will produce a sequence

• It can be proved that

,....,...,, 10 t

....)(...)()( 10 tlll

The inner loop of the Generalized EM algorithm (GEM)

• E-step: calculate

• M-step: find

),(maxarg)1( tt Q

)|,(log),|(),(1

yxPxyPQ it

n

i yi

t

),(),( 1 tttt QQ

Recap of the EM algorithm

Idea #1: find θ that maximizes the likelihood of training data

)|(logmaxarg

)(maxarg

XP

LML

Idea #2: find the θt sequence

No analytical solution iterative approach, find

s.t.

,....,...,, 10 t

....)(...)()( 10 tlll

Idea #3: find θt+1 that maximizes a tight lower bound of )()( tll

a tight lower bound

])|,(

)|,([log)()(

1),|( t

i

in

ixyP

t

yxP

yxPEll t

i

Idea #4: find θt+1 that maximizes the Q function

)]|,([logmaxarg

])|,(

)|,([logmaxarg

1),|(

1),|(

)1(

yxPE

yxp

yxpE

i

n

ixyP

ti

in

ixyP

t

ti

ti

Lower bound of )()( tll

The Q function

The EM algorithm

• Repeat until convergence– E-step: calculate

– M-step: find

),(maxarg)1( tt Q

)|,(log),|(),(1

yxPxyPQ it

n

i yi

t

Important classes of EM problem

• Products of multinomial (PM) models

• Exponential families

• Gaussian mixture

• …

The EM algorithm for PM models

PM models

mi

yxici

i

yxici

i

yxici pppyxp ),,(),,(),,(

1

...)|,(

1 ji

ip

Where is a partition of all the parameters, and for any j

),...,( 1 m

HMM is a PM

kji

j

kw

ik

jsis

ji

wss

yxsscj

w

i

yxsscji

ssp

ssp

yxp

,,

),,(

),,(

)(

)(

)|,(

,

1

1

kijk

jij

b

a

PCFG

• PCFG: each sample point (x,y):– x is a sentence– y is a possible parse tree for that sentence.

)|()|,(1

ii

n

ii AAPyxP

)|(

)|(

)|(

)|,(

VPsleepsVPP

NPJimNPP

SVPNPSP

yxP

PCFG is a PM

,

),,()(

)|,(

A

yxAcAp

yxp

A

Ap 1)(

Q-function for PM

)log),,(),|((...)log),,(),|((

)log),|((...)log),|((

))(log(),|(

)|,(log),|(

),(

11

),,(

1

),,(

1

),,(

1

1

1

1

jij

tn

i yiji

j

tn

i yi

yxjC

jj

tn

i yi

yxjC

jj

tn

i yi

yxjC

k jj

tn

i yi

it

n

i yi

t

pyxjCxyPpyxjCxyP

pxyPpxyP

pxyP

yxPxyP

Q

k

i

k

i

k

),(1tQ

Maximizing the Q function

jj

it

n

i yi

t pyxjCxyPQ log),,(),|(),(11

1

11

j

jp

Maximize

Subject to the constraint

11

)log),,(),|((),(ˆ1

1j

jjj

it

n

i yi

t ppyxjCxyPQ

Use Lagrange multipliers

Optimal solution

11

)log),,(),|((),(ˆ1

1j

jjj

it

n

i yi

t ppyxjCxyPQ

0)/),,(),|((),(ˆ

1

1

ji

tn

i yi

j

t

pyxjCxypp

Q

),,(),|(1

yxjCxyp

pi

tn

i yi

j

Normalization factor

Expected count

PM Models

r is rth parameter in the model. Each parameter is the member of some multinomial distribution.

Count(x,y, r) is the number of times that is seen in the expression for P(x, y | θ)

r

The EM algorithm for PM Models

• Calculate expected counts

• Update parameters

PCFG example

• Calculate expected counts

• Update parameters

The EM algorithm for PM models

// for each iteration

// for each training example xi

// for each possible y

// for each parameter

// for each parameter

Inside-outside algorithm

Inner loop of the Inside-outside algorithm

Given an input sequence and1. Calculate inside probability:

• Base case• Recursive case:

2. Calculate outside probability:• Base case:

• Recursive case:

)(),( kj

j wNPkk

),1(),()(),(,

1

qddpNNNPqp srsr

sr

q

pd

jj

otherwise

jifmj 0

11),1(

)1,()(),(

),1()(),(),(

,

1

1

, 1

peNNNPqe

eqNNNPepqp

gjgf

gf

p

ef

ggjf

gf

m

qefj

Inside-outside algorithm (cont)

)(

),1(),()(),(

)|,(

1

1

1 1

1

m

q

pdsr

srjj

m

p

m

q

msrjj

wP

qddpNNNPqp

wNNNNP

)(

),(),(),()|,(

1

11

m

m

h

khjj

mjkj

wP

wwhhhhwusedisNwNP

3. Collect the counts

4. Normalize and update the parameters

km

jkjm

jkj

kj

k

kjkj

r sm

srjjm

srjj

r s

srj

srjsrj

wusedisNwNP

wusedisNwNP

wNCnt

wNCntwNP

wNNNNP

wNNNNP

NNNCnt

NNNCntNNNP

)|,(

)|,(

)(

)()(

)|,(

)|,(

)(

)()(

1

1

1

1

Expected counts for PCFG rules

),|,(

),|,(

),,(*),|(

),,(*),|()(

1

11 1

11

msrjj

msrjj

pq

m

p

m

q

srj

Trmm

srj

Y

srj

wNNNNP

wNNNNP

NNNwTrcountwTrP

NNNYXcountXYPNNNcount

This is the formula if we have only one sentence.Add an outside sum if X contains multiple sentences.

Expected counts (cont)

),|,(

),,(*),|(

),,(*),|()(

11

11

mjkj

m

h

Tr

kjmm

kj

Y

kj

wusedisNwNP

wNTrwcountwTrP

wNYXcountXYPwNcount

Relation to EM

• PCFG is a PM Model

• Inside-outside algorithm is a special case of the EM algorithm for PM Models.

• X (observed data): each data point is a sentence w1m.

• Y (hidden data): parse tree Tr.

• Θ (parameters):

)(

)(

kj

srj

wNP

NNNP

Forward-backward algorithm

The inner loop for forward-backward algorithm

Given an input sequence and1. Calculate forward probability:

• Base case• Recursive case:

2. Calculate backward probability:• Base case:• Recursive case:

3. Calculate expected counts:

4. Update the parameters:

),,,,( BAKS

ii )1(

tijoiji

ij batt )()1(

1)1( Ti

tijoijj

ji batt )1()(

N

mmm

jijoijiij

tt

tbatt t

1

)()(

)1()()(

T

tij

N

j

T

tij

ij

t

ta

11

1

)(

)(

T

tij

T

tijkt

ijk

t

twob

1

1

)(

)(),(

Expected counts

)(

),|,(

),,(*),|(

),,(*),|()(

1

111

1111

1

t

OjXiXP

ssXOcountOXP

ssYXcountXYPsscount

T

tij

Tt

T

tt

jX

iTTTT

Yjiji

T

Expected counts (cont)

),()(

),(*),|,(

),,(*),|(

),,(*),|()(

1

111

1111

1

kk

T

tij

kkTt

T

tt

jX

w

iTTTT

j

w

iY

j

w

i

wOt

wOOjXiXP

ssXOcountOXP

ssYXcountXYPsscount

T

k

kk

Relation to EM

• HMM is a PM Model

• Forward-backward algorithm is a special case of the EM algorithm for PM Models.

• X (observed data): each data point is an O1T.

• Y (hidden data): state sequence X1T.

• Θ (parameters): aij, bijk, πi.

IBM models for MT

Expected counts for (f, e) pairs

• Let Ct(f, e) be the fractional count of (f, e) pair in the training data.

)),(),(*),|((),(||

1,ja

F

jj

FE a

eeffFEaPefCt

Alignment prob Actual count of times e and f are linked in (E,F) by alignment a

FVx

exCt

efCteft

),(

),()|(

Relation to EM

• IBM models are PM Models.

• The EM algorithm used in IBM models is a special case of the EM algorithm for PM Models.

• X (observed data): each data point is a sentence pair (F, E).

• Y (hidden data): word alignment a.• Θ (parameters): t(f|e), d(i | j, m, n), etc..

Summary

• The EM algorithm– An iterative approach – L(θ) is non-decreasing at each iteration– Optimal solution in M-step exists for many classes of

problems.

• The EM algorithm for PM models– Simpler formulae– Three special cases

• Inside-outside algorithm• Forward-backward algorithm• IBM Models for MT

Relations among the algorithms

The generalized EM

The EM algorithm

PM Gaussian MixInside-OutsideForward-backwardIBM models

Strengths of EM

• Numerical stability: in every iteration of the EM algorithm, it increases the likelihood of the observed data.

• The EM handles parameter constraints gracefully.

Problems with EM

• Convergence can be very slow on some problems and is intimately related to the amount of missing information.

• It guarantees to improve the probability of the training corpus, which is different from reducing the errors directly.

• It cannot guarantee to reach global maxima (it could get struck at the local maxima, saddle points, etc)

The initial estimate is important.

Lower bound lemma

)()ˆ( 0xgxg

If

Then

)()()ˆ()ˆ( 00 xgxfxfxg Proof :

L(θ) is non-decreasing

])|,(

)|,([log)()(

1),|( t

i

in

ixyP

t

yxP

yxPEll t

i

)()()( tllg

])|,(

)|,([log)(

1),|( t

i

in

ixyP yxP

yxPEf t

i

)(maxarg

0)()()()(1

f

fgandfgt

tt

0)()( 1 tt gg

)()( 1 tt ll

Let

We have

(By lower bound lemma)