EM algorithm and applications Lecture #9

27
. EM algorithm and applications Lecture #9 Background Readings : Chapters 11.2, 11.6 in the text book, Biological Sequence Analysis, Durbin et al., 2001.

description

EM algorithm and applications Lecture #9. Background Readings : Chapters 11.2, 11.6 in the text book, Biological Sequence Analysis , Durbin et al., 2001. The EM algorithm. This lecture plan: Presentation and Correctness Proof of the EM algorithm. Examples of Implementations . - PowerPoint PPT Presentation

Transcript of EM algorithm and applications Lecture #9

Page 1: EM algorithm and applications Lecture #9

.

EM algorithm and applicationsLecture #9

Background Readings: Chapters 11.2, 11.6 in the text book, Biological Sequence Analysis, Durbin et al., 2001.

Page 2: EM algorithm and applications Lecture #9

2

This lecture plan:1. Presentation and Correctness Proof of the

EM algorithm.2. Examples of Implementations

The EM algorithm

Page 3: EM algorithm and applications Lecture #9

3

A “model with parameters θ ” is a probabilistic space M, in which each simple event y is determined by values of random variables (dice). The parameters θ are the probabilities associated with the random variables.

(In HMM of length L, the simple events are HMM-sequences of length L, and the parameters are the transition probabilities mkl and the emission probabilities ek(b)).

An “observed data” is a non empty subset x M. (In HMM, it is usually all the simple events which fit with a given output

sequence).Given observed data x, the ML method seeks parameters θ* which maximize the

likelihood of the data p(x|θ)=∑yp(x,y|θ).(In HMM, x can be the transmitted letters ,and y the hidden states) Finding such θ* is easy when the observed data is a simple event, but hard in

general.

Model, Parameters, ML

Page 4: EM algorithm and applications Lecture #9

Assume a model with parameters as in the previous slide.Given observed data x, the likelihood of x under model parameters θ

is given by

p(x|θ)=∑yp(x,y|θ).(The pairs (x,y) are the simple events which comprise x. Informally,

y denotes the possible values of the“hidden data”).The EM algorithm receives x and parameters θ, and returns new

parameters * s.t. p(x|*) ≥ p(x|θ), with equality only if λ*=θ. i.e., the new parameters increase the likelihood of the observed data.

4

The EM algorithm

Page 5: EM algorithm and applications Lecture #9

5

EM uses the current parameters θ to construct a simpler ML problem Lθ:

Guarantee: if Lθ(λ)>Lθ(θ), than P(x| λ)>P(x| θ).

log P(x| λ)

The EM algorithm The graphs below are the logarithms of the likelihood functions

λ

Log(Lθ)= E [log P(x,y|λ)]

θ λ*

( , | )( ) ( , | ) p x y

y

L p x y

Page 6: EM algorithm and applications Lecture #9

6

Let x be the observed data. Let {(x,y1),…,(x,yk)} be the set of (simple) events which comprise x.Our goal is to find parameters θ* which maximize the sum

As this is hard, we start with some parameters θ, and only find λ* s.t. if λ*≠θ then:

Derivation of the EM Algorithm

* * * *1 2( , ) ( , | ) ( , | ) .. ( , | )kp x p x y p x y p x y

* *

1 1

( , ) ( , | ) ( , | ) ( , )k k

i ii i

p x p x y p x y p x

Finding λ* is obtained via “virtual sampling”, defined next.

Page 7: EM algorithm and applications Lecture #9

7

For given parameters θ, Let pi= p(yi|x,θ). (note that p1+…+pk=1).

We use the pi’s to define “virtual” sampling, in which:y1 occurs p1 times, y2 occurs p2 times, …yk occurs pk times

1 21 2

The EM algorithm looks for new parameters which maximize the likelihood of this "virtual" sampling.This likelihood is given by

( )= ( , | ) ( , | ) ( , | ) .kpp pkL p y x p y x p y x

Page 8: EM algorithm and applications Lecture #9

8

In each iteration the EM algorithm does the following. (E step): Given θ, compute the function

(M step): Find * which maximizes Lθ ()

(Next iteration sets * and repeat).

The EM algorithm

Comment: 1. At the M-step we only need that Lθ(*)>Lθ(θ). This change yields the so

called Generalized EM algorithm. It is used when it is hard to find the optimal *.

2. Usually, the computations use the function:

( )= , p y x

y

L p y x ( | , )( , | )

y

( )= ( ))= .Q L p y x p y x log( ( | , ) log( ( , | ))

Page 9: EM algorithm and applications Lecture #9

9

Correctness Theorem for the EM Algorithm

1

prob( | , )1

* *

),.., ( , )} be a collection of events, as inthe setting of the EM algorithm, and let

( ) prob( , | )

( ) ( ), then prob( | ) pr

Theorem:Let {( ,

Then the following holds:If

i

k

k y xii

x y

x y

L x

x x y

L

L

ob( | ).x

Page 10: EM algorithm and applications Lecture #9

10

Correctness proof of EM

*

* *

*

1

Let prob( | , ) , prob( | , ) . Then from thedefinition of conditional probability we have:

prob( , | ) prob( | ), prob( , | ) prob( | ).

By the EM assumption on and :

( prob(

i i i i

i i i i

kii

y x p y x q

x y p x x y q x

q x

* *1

1 1

*1 1

| )) ( ) ( ) ( prob( | ))

since 1 we get, after re-arranging terms:

( ) prob( | ) ( ) prob( | )

i i

i i

kp pii

k ki ii i

k kp pi ii i

L L p x

p q

q x p x

Page 11: EM algorithm and applications Lecture #9

1 1

*1 1

1

* 1

1

, ) ,1 since

( ( )

from last slide:

( ) prob( | ) ( ) prob( | )

Dividing by ( ) we get :

( )prob( | ) prob( | ) prob( | )

( )

.., ..,QED

i i

i

i

i

k k

k kp pi ii i

k pii

k pii

k pii

p p q q

q x p x

q

px x x

q

11

Correctness proof of EM (end)

Page 12: EM algorithm and applications Lecture #9

12

The Baum-Welsh algorithm is the EM algorithm for HMM:

E step for HMM:

where λ are the new parameters {mkl,ek(b)}. M step for HMM: look for λ which maximizes Lθ

().

Example: Baum Welsh = EM for HMM

( )= , p s x

s

L p s x ( | , )( , | )

Recall that for HMM,

s skl kM E b

kk l k b

p s x m e bkl ( )

, ,

( , | ) ( ( ) )

Page 13: EM algorithm and applications Lecture #9

13

Baum Welsh = EM for HMM (cont)writing as we get

( ))=

s skl k

s skl k

s skl ks s

M E bkl k

k l k b

p s x

M E bkl k

s k l k b

M p s x E b p s xkl k

k l k b

p s x m e b

L m e b

m e b

( )

, ,

( | , )

( )

, ,

( | , ) ( ) ( | , )

, ,

( , | ) ( )

( )

( ) .

As we showed, ( )) is maximized when the 's and are the relative frequencies of the corresponding variables given and i.e.,

and

kl k

kl kll

k k

L m e b s

x

m M Mkl

e b E b

''

( )'

.

( ) ( kb

E b'

) ( ')

Mkl Ek(b)

Page 14: EM algorithm and applications Lecture #9

14

A simple example: EM for 2 coin tosses

Consider the following experiment:Given a coin with two possible outcomes: H (head) and T (tail), with probabilities θH, θ T = 1- θ H.The coin is tossed twice, but only the 1st outcome, T, is seen. So the data is x = (T,*).We wish to apply the EM algorithm to get parameters that increase the likelihood of the data. Let the initial parameters be θ = (θH, θT) = ( ¼, ¾ ).

Page 15: EM algorithm and applications Lecture #9

15

EM for 2 coin tosses (cont)

The “hidden data” which produce x are the sequences y1= (T,H); y2=(T,T);

Hence the likelihood of x with parameters (θH, θT), is

p(x| θ) = P(x,y1 |) + P(x,y2 |) = qHqT+qT2

For the initial parameters θ = ( ¼, ¾ ), we have:p(x| θ) = ¼ ∙ ¾ + ¾ ∙ ¾ = ¾

Note that in this case P(x,yi |) = P(yi |), for i = 1,2. we can always define y so that (x,y) = y (otherwise we set y’ (x,y) and replace the “ y ”s by “ y’ ”s).

Page 16: EM algorithm and applications Lecture #9

16

EM for 2 coin tosses - E step

Calculate Lθ () = Lθ(λH,λT).

Recall: λH,λT are the new parameters, which we need to optimize

p(y1|x,θ) = p(y1,x|θ)/p(x|θ) = (¾∙ ¼)/ (¾) = ¼ p(y2|x,θ) = p(y2,x|θ)/p(x|θ) = (¾∙ ¾)/ (¾) = ¾

Thus we have

1 2( | , ) ( | , )1 2( ) ( , | ) ( , | )p y x p y xL p x y p x y

314 4

1 2( ) ( , | ) ( , | )L p x y p x y

This is the “virtual sampling”

Page 17: EM algorithm and applications Lecture #9

17

EM for 2 coin tosses - E step

For a sequence y of coin tosses, let NH(y) be the number of

H’s in y, and NT(y) be the number of T’s in y. Then

In our example:y1= (T,H); y2=(T,T), hence:NH(y1) = NT(y1)=1, NH(y2) =0, NT(y2)=2

( ) ( )( | ) H TN y N yH Tp y

Page 18: EM algorithm and applications Lecture #9

18

314 4

1 2 1 2

314 4

1 2

2

3 31 1( ) ( ) ( ) ( )4 4 4 4

( ) ( , | ) ( , | )

( ) ( )T T H H

T H T

N y N y N y N yT H

L p x y p x y

Thus

Example: 2 coin tosses - E step

1 1

2 2

( ) ( )1

( ) ( ) 22

( , | )

( , | )

T H

T H

N y N yT H T H

N y N yT H T

p x y

p x y

NT= 7/4 NH= ¼

( ) T HN NT HL

And in general:

Page 19: EM algorithm and applications Lecture #9

19

EM for 2 coin tosses - M stepFind * which maximizes Lθ

()

And as we already saw, is maximized when:

; H TH T

H T H T

N NN N N N

714 47 71 1

4 4 4 4

7 71 .8 8 8

71 ; 8 8that is, * ( , ) and ( | *)

H T

p x

[The optimal parameters (0,1), will never be reached by the EM algorithm!]

T HN NT H

Page 20: EM algorithm and applications Lecture #9

20

Let Nk be the expected value of Nk(y), given x and θ:

Nk=E(Nk|x,θ) = ∑y p(y|x,θ) Nk(y),

EM for single random variable (dice) Now, the probability of each y (≡(x,y)) is given by a sequence of dice tosses. The dice has m outcomes, with probabilities λ1,..,λm.

Let Nk(y) = #(outcome k occurs in y). Then

( )

1

( | ) k

mN yk

k

p y

Then we have:

Page 21: EM algorithm and applications Lecture #9

21

( | , )

( | , ) ( ) ( | , )( )

1 1 1

''

( ) ( | )

which is maximized for

kyk k

p y x

y

p y x N y p y xm m mN y N

k k ky k k k

kk

kk

L p y

NN

L (λ) for one dice

Nk

Page 22: EM algorithm and applications Lecture #9

22

EM algorithm for n independent observations x1,…, xn :

Expectation stepIt can be shown that, if the xj are independent, then:

1 1

( | , ) ( , ) j

n nj j j j j

k k kj jy

N p y x N y x N

jkN

1

1 ( , | ) ( , )( | ) j

nj j j j

kjj y

p y x N y xp x

Page 23: EM algorithm and applications Lecture #9

23

Example: The ABO locusA locus is a particular place on the chromosome. Each locus’ state (called genotype) consists of two alleles – one parental and one maternal. Some loci (plural of locus) determine distinguished features. The ABO locus, for example, determines blood type.

NN

qN

Nq

NN

qN

Nq

NN

qN

Nq oo

ooba

baob

obbb

bboa

oaaa

aa/

//

//

//

//

//

/ ,,,,,

Suppose we randomly sampled N individuals and found that Na/a have genotype a/a, Na/b have genotype a/b, etc. Then, the MLE is given by:

The ABO locus has six possible genotypes {a/a, a/o, b/o, b/b, a/b, o/o}. The first two genotypes determine blood type A, the next two determine blood type B, then blood type AB, and finally blood type O.We wish to estimate the proportion in a population of the 6 genotypes.

Page 24: EM algorithm and applications Lecture #9

24

The ABO locus (Cont.)

However, testing individuals for their genotype is a very expensive. Can we estimate the proportions of genotype using the common cheap blood test with outcome being one of the four blood types (A, B, AB, O) ?

The problem is that among individuals measured to have blood type A, we don’t know how many have genotype a/a and how many have genotype a/o. So what can we do ?

Page 25: EM algorithm and applications Lecture #9

25

The ABO locus (Cont.)The Hardy-Weinberg equilibrium rule states that in equilibrium the frequencies of the three alleles qa,qb,qo in the population determine the frequencies of the genotypes as follows: qa/b= 2qa qb, qa/o= 2qa qo, qb/o= 2qb qo, qa/a= [qa]2, qb/b= [qb]2, qo/o= [qo]2.

In fact, Hardy-Weinberg equilibrium rule follows from modeling this problem as data x with hidden parameters y:

Page 26: EM algorithm and applications Lecture #9

26

The ABO locus (Cont.)

The dice’ outcome are the three possible alleles a, b and o. The observed data are the blood types A, B, AB or O.Each blood type is determined by two successive random sampling of alleles, which is an “ordered genotypes pair” – this is the hidden data. A ={(a,a), (a,o),(o,a)}; B={(b,b),(b,o),(o,b); AB={(a,b),(b,a)}; O={(o,o)}.

So we have three parameters of one dice – qa,qb,qo - that we need to estimate.We start with parameters θ = (qa,qb,qo), and then use EM to improve them.

Page 27: EM algorithm and applications Lecture #9

27

EM setting for the ABO locusThe observed data x =(x1,..,xn) is a sequence of elements (blood types) from the set {A,B,AB,O}. eg: (B,A,B,B,O,A,B,A,O,B, AB) are observations (x1,…x11).

The hidden data (ie the y’s) for each xj is the set of ordered pairs of alleles that generates it. For instance, for A it is the set {aa, ao, oa}.

The parameters = {qa ,qb, qo} are the (current) probabilities of the alleles.

The complete implementation of the EM algorithm for this problem will be given in the tutorial.