EM algorithm LING 572 Fei Xia 03/02/06. Outline The EM algorithm EM for PM models Three special...
-
date post
22-Dec-2015 -
Category
Documents
-
view
254 -
download
1
Transcript of EM algorithm LING 572 Fei Xia 03/02/06. Outline The EM algorithm EM for PM models Three special...
EM algorithm
LING 572
Fei Xia
03/02/06
Outline
• The EM algorithm
• EM for PM models
• Three special cases– Inside-outside algorithm– Forward-backward algorithm– IBM models for MT
The EM algorithm
Basic setting in EM
• X is a set of data points: observed data• Θ is a parameter vector.• EM is a method to find θML where
• Calculating P(X | θ) directly is hard.• Calculating P(X,Y|θ) is much simpler,
where Y is “hidden” data (or “missing” data).
)|(logmaxarg
)(maxarg
XP
LML
The basic EM strategy
• Z = (X, Y)– Z: complete data (“augmented data”)– X: observed data (“incomplete” data)– Y: hidden data (“missing” data)
• Given a fixed x, there could be many possible y’s.– Ex: given a sentence x, there could be many state
sequences in an HMM that generates x.
Examples of EM
HMM PCFG MT Coin toss
X
(observed)
sentences sentences Parallel data Head-tail sequences
Y (hidden) State sequences
Parse trees Word alignment
Coin id sequences
θ aij
bijk
P(ABC) t(f|e)
d(aj|j, l, m), …
p1, p2, λ
Algorithm Forward-backward
Inside-outside
IBM Models N/A
The log-likelihood function
• L is a function of θ, while holding X constant: )|()()|( XPLXL
)|,(log
)|(log
)|(log
)|(log)(log)(
1
1
1
yxP
xP
xP
XPLl
iy
n
i
i
n
i
n
ii
The iterative approach for MLE
)|,(logmaxarg
)(maxarg
)(maxarg
1
yxp
l
L
n
i yi
ML
,....,...,, 10 tIn many cases, we cannot find the solution directly.
An alternative is to find a sequence:
....)(...)()( 10 tlll s.t.
])|,(
)|,([log
])|,(
)|,([log
)|,(
)|,(),|(log
)|,(
)|,(
)|',(
)|,(log
)|,(
)|,(
)|',(
)|,(log
)|',(
)|,(log
)|,(
)|,(
log
)|,(log)|,(log
)|(log)|(log)()(
1),|(
1),|(
1
'1
'1
'1
1
11
ti
in
ixyP
ti
in
ixyP
ti
itn
i yi
ti
i
yt
yi
ti
n
i
ti
ti
yt
yi
in
i
yt
yi
in
i
t
yi
yin
i
t
yi
n
iyi
n
i
tt
yxP
yxPE
yxP
yxPE
yxP
yxPxyP
yxP
yxP
yxP
yxP
yxP
yxP
yxP
yxP
yxP
yxP
yxP
yxP
yxPyxP
XPXPll
ti
ti
Jensen’s inequality
Jensen’s inequality
])([()](([, xgEfxgfEthenconvexisfif
)])([log()]([log( xpExpE
])([()](([, xgEfxgfEthenconcaveisfif
log is a concave function
Maximizing the lower bound
)]|,([logmaxarg
)|,(log),|(maxarg
)|,(
)|,(log),|(maxarg
])|,(
)|,([logmaxarg
1),|(
1
1
1),|(
)1(
yxPE
yxPxyP
yxP
yxPxyP
yxp
yxpE
i
n
ixyP
it
i
n
i y
ti
iti
n
i y
ti
in
ixyP
t
ti
ti
The Q function
The Q-function
• Define the Q-function (a function of θ):
– Y is a random vector.– X=(x1, x2, …, xn) is a constant (vector).– Θt is the current parameter estimate and is a constant (vector).– Θ is the normal variable (vector) that we wish to adjust.
• The Q-function is the expected value of the complete data log-likelihood P(X,Y|θ) with respect to Y given X and θt.
)|,(log),|(
)]|,([log)|,(log),|(
)]|,([log],|)|,([log),(
1
1),|(
),|(
yxPxyP
yxPEYXPXYP
YXPEXYXPEQ
it
n
i yi
n
iixyP
Y
t
XYP
tt
ti
t
The inner loop of the EM algorithm
• E-step: calculate
• M-step: find
),(maxarg)1( tt Q
)|,(log),|(),(1
yxPxyPQ it
n
i yi
t
L(θ) is non-decreasing at each iteration
• The EM algorithm will produce a sequence
• It can be proved that
,....,...,, 10 t
....)(...)()( 10 tlll
The inner loop of the Generalized EM algorithm (GEM)
• E-step: calculate
• M-step: find
),(maxarg)1( tt Q
)|,(log),|(),(1
yxPxyPQ it
n
i yi
t
),(),( 1 tttt QQ
Recap of the EM algorithm
Idea #1: find θ that maximizes the likelihood of training data
)|(logmaxarg
)(maxarg
XP
LML
Idea #2: find the θt sequence
No analytical solution iterative approach, find
s.t.
,....,...,, 10 t
....)(...)()( 10 tlll
Idea #3: find θt+1 that maximizes a tight lower bound of )()( tll
a tight lower bound
])|,(
)|,([log)()(
1),|( t
i
in
ixyP
t
yxP
yxPEll t
i
Idea #4: find θt+1 that maximizes the Q function
)]|,([logmaxarg
])|,(
)|,([logmaxarg
1),|(
1),|(
)1(
yxPE
yxp
yxpE
i
n
ixyP
ti
in
ixyP
t
ti
ti
Lower bound of )()( tll
The Q function
The EM algorithm
• Start with initial estimate, θ0
• Repeat until convergence– E-step: calculate
– M-step: find
),(maxarg)1( tt Q
)|,(log),|(),(1
yxPxyPQ it
n
i yi
t
Important classes of EM problem
• Products of multinomial (PM) models
• Exponential families
• Gaussian mixture
• …
The EM algorithm for PM models
PM models
mi
yxici
i
yxici
i
yxici pppyxp ),,(),,(),,(
1
...)|,(
1 ji
ip
Where is a partition of all the parameters, and for any j
),...,( 1 m
HMM is a PM
kji
j
kw
ik
jsis
ji
wss
yxsscj
w
i
yxsscji
ssp
ssp
yxp
,,
),,(
),,(
)(
)(
)|,(
,
1
1
kijk
jij
b
a
PCFG
• PCFG: each sample point (x,y):– x is a sentence– y is a possible parse tree for that sentence.
)|()|,(1
ii
n
ii AAPyxP
)|(
)|(
)|(
)|,(
VPsleepsVPP
NPJimNPP
SVPNPSP
yxP
PCFG is a PM
,
),,()(
)|,(
A
yxAcAp
yxp
A
Ap 1)(
Q-function for PM
)log),,(),|((...)log),,(),|((
)log),|((...)log),|((
))(log(),|(
)|,(log),|(
),(
11
),,(
1
),,(
1
),,(
1
1
1
1
jij
tn
i yiji
j
tn
i yi
yxjC
jj
tn
i yi
yxjC
jj
tn
i yi
yxjC
k jj
tn
i yi
it
n
i yi
t
pyxjCxyPpyxjCxyP
pxyPpxyP
pxyP
yxPxyP
Q
k
i
k
i
k
),(1tQ
Maximizing the Q function
jj
it
n
i yi
t pyxjCxyPQ log),,(),|(),(11
1
11
j
jp
Maximize
Subject to the constraint
11
)log),,(),|((),(ˆ1
1j
jjj
it
n
i yi
t ppyxjCxyPQ
Use Lagrange multipliers
Optimal solution
11
)log),,(),|((),(ˆ1
1j
jjj
it
n
i yi
t ppyxjCxyPQ
0)/),,(),|((),(ˆ
1
1
ji
tn
i yi
j
t
pyxjCxypp
Q
),,(),|(1
yxjCxyp
pi
tn
i yi
j
Normalization factor
Expected count
PM Models
r is rth parameter in the model. Each parameter is the member of some multinomial distribution.
Count(x,y, r) is the number of times that is seen in the expression for P(x, y | θ)
r
The EM algorithm for PM Models
• Calculate expected counts
• Update parameters
PCFG example
• Calculate expected counts
• Update parameters
The EM algorithm for PM models
// for each iteration
// for each training example xi
// for each possible y
// for each parameter
// for each parameter
Inside-outside algorithm
Inner loop of the Inside-outside algorithm
Given an input sequence and1. Calculate inside probability:
• Base case• Recursive case:
2. Calculate outside probability:• Base case:
• Recursive case:
)(),( kj
j wNPkk
),1(),()(),(,
1
qddpNNNPqp srsr
sr
q
pd
jj
otherwise
jifmj 0
11),1(
)1,()(),(
),1()(),(),(
,
1
1
, 1
peNNNPqe
eqNNNPepqp
gjgf
gf
p
ef
ggjf
gf
m
qefj
Inside-outside algorithm (cont)
)(
),1(),()(),(
)|,(
1
1
1 1
1
m
q
pdsr
srjj
m
p
m
q
msrjj
wP
qddpNNNPqp
wNNNNP
)(
),(),(),()|,(
1
11
m
m
h
khjj
mjkj
wP
wwhhhhwusedisNwNP
3. Collect the counts
4. Normalize and update the parameters
km
jkjm
jkj
kj
k
kjkj
r sm
srjjm
srjj
r s
srj
srjsrj
wusedisNwNP
wusedisNwNP
wNCnt
wNCntwNP
wNNNNP
wNNNNP
NNNCnt
NNNCntNNNP
)|,(
)|,(
)(
)()(
)|,(
)|,(
)(
)()(
1
1
1
1
Expected counts for PCFG rules
),|,(
),|,(
),,(*),|(
),,(*),|()(
1
11 1
11
msrjj
msrjj
pq
m
p
m
q
srj
Trmm
srj
Y
srj
wNNNNP
wNNNNP
NNNwTrcountwTrP
NNNYXcountXYPNNNcount
This is the formula if we have only one sentence.Add an outside sum if X contains multiple sentences.
Expected counts (cont)
),|,(
),,(*),|(
),,(*),|()(
11
11
mjkj
m
h
Tr
kjmm
kj
Y
kj
wusedisNwNP
wNTrwcountwTrP
wNYXcountXYPwNcount
Relation to EM
• PCFG is a PM Model
• Inside-outside algorithm is a special case of the EM algorithm for PM Models.
• X (observed data): each data point is a sentence w1m.
• Y (hidden data): parse tree Tr.
• Θ (parameters):
)(
)(
kj
srj
wNP
NNNP
Forward-backward algorithm
The inner loop for forward-backward algorithm
Given an input sequence and1. Calculate forward probability:
• Base case• Recursive case:
2. Calculate backward probability:• Base case:• Recursive case:
3. Calculate expected counts:
4. Update the parameters:
),,,,( BAKS
ii )1(
tijoiji
ij batt )()1(
1)1( Ti
tijoijj
ji batt )1()(
N
mmm
jijoijiij
tt
tbatt t
1
)()(
)1()()(
T
tij
N
j
T
tij
ij
t
ta
11
1
)(
)(
T
tij
T
tijkt
ijk
t
twob
1
1
)(
)(),(
Expected counts
)(
),|,(
),,(*),|(
),,(*),|()(
1
111
1111
1
t
OjXiXP
ssXOcountOXP
ssYXcountXYPsscount
T
tij
Tt
T
tt
jX
iTTTT
Yjiji
T
Expected counts (cont)
),()(
),(*),|,(
),,(*),|(
),,(*),|()(
1
111
1111
1
kk
T
tij
kkTt
T
tt
jX
w
iTTTT
j
w
iY
j
w
i
wOt
wOOjXiXP
ssXOcountOXP
ssYXcountXYPsscount
T
k
kk
Relation to EM
• HMM is a PM Model
• Forward-backward algorithm is a special case of the EM algorithm for PM Models.
• X (observed data): each data point is an O1T.
• Y (hidden data): state sequence X1T.
• Θ (parameters): aij, bijk, πi.
IBM models for MT
Expected counts for (f, e) pairs
• Let Ct(f, e) be the fractional count of (f, e) pair in the training data.
)),(),(*),|((),(||
1,ja
F
jj
FE a
eeffFEaPefCt
Alignment prob Actual count of times e and f are linked in (E,F) by alignment a
FVx
exCt
efCteft
),(
),()|(
Relation to EM
• IBM models are PM Models.
• The EM algorithm used in IBM models is a special case of the EM algorithm for PM Models.
• X (observed data): each data point is a sentence pair (F, E).
• Y (hidden data): word alignment a.• Θ (parameters): t(f|e), d(i | j, m, n), etc..
Summary
• The EM algorithm– An iterative approach – L(θ) is non-decreasing at each iteration– Optimal solution in M-step exists for many classes of
problems.
• The EM algorithm for PM models– Simpler formulae– Three special cases
• Inside-outside algorithm• Forward-backward algorithm• IBM Models for MT
Relations among the algorithms
The generalized EM
The EM algorithm
PM Gaussian MixInside-OutsideForward-backwardIBM models
Strengths of EM
• Numerical stability: in every iteration of the EM algorithm, it increases the likelihood of the observed data.
• The EM handles parameter constraints gracefully.
Problems with EM
• Convergence can be very slow on some problems and is intimately related to the amount of missing information.
• It guarantees to improve the probability of the training corpus, which is different from reducing the errors directly.
• It cannot guarantee to reach global maxima (it could get struck at the local maxima, saddle points, etc)
The initial estimate is important.
Additional slides
Lower bound lemma
)()ˆ( 0xgxg
If
Then
)()()ˆ()ˆ( 00 xgxfxfxg Proof :
L(θ) is non-decreasing
])|,(
)|,([log)()(
1),|( t
i
in
ixyP
t
yxP
yxPEll t
i
)()()( tllg
])|,(
)|,([log)(
1),|( t
i
in
ixyP yxP
yxPEf t
i
)(maxarg
0)()()()(1
f
fgandfgt
tt
0)()( 1 tt gg
)()( 1 tt ll
Let
We have
(By lower bound lemma)