Hidden Markov Models BIOL337/STAT337/437 Spring Semester 2014.

Post on 18-Dec-2015

216 views 1 download

Transcript of Hidden Markov Models BIOL337/STAT337/437 Spring Semester 2014.

Hidden Markov Hidden Markov ModelsModels

BIOL337/STAT337/437Spring Semester 2014

1

2

K

1

2

K

1

2

K

1

2

K

x1 x2 x3xn

2

1

K

2

•Theory of hidden Markov models (HMMs)•Probabilistic interpretation of sequence alignments using HMMs•Applications of HMMs to biological sequence modeling and discovery of features such as genes, etc.

An HHM

π1

…π2 π3 πn

Example of an HHMExample of an HHM

Do you want to play?

The Dishonest

Casino

The situation...

•Casino has two dice, one fair (F) and one loaded (L)

•Probabilities for the fair die: P(1) = P(2) = P(3) = P(4) = P(5) = P(6) = 1/6

•Probabilities for the loaded die: P(1) = P(2) = P(3) = P(4) = P(5) = 1/10; P(6) = ½

•Before each roll, the casino player switches from the fair die to the loaded die (or vice versa) with probability 1/20

The game...

•You bet $1

•You roll (always with the fair die)

•Casino player rolls (maybe with the fair die, maybe with the loaded die)

•Player who rolls the highest number wins $2

Dishonest Casino HMMDishonest Casino HMM

FAIR LOADED

0.05

0.05

0.950.95

P(1 | F) = 1/6P(2 | F) = 1/6P(3 | F) = 1/6P(4 | F) = 1/6P(5 | F) = 1/6P(6 | F) = 1/6

P(1 | L) = 1/10P(2 | L) = 1/10P(3 | L) = 1/10P(4 | L) = 1/10P(5 | L) = 1/10P(6 | L) = 1/2

Three Fundamental Three Fundamental Questions About Any HMMQuestions About Any HMM

•Evaluation: What is the probability of a sequence of outputs of an HMM?

•Decoding: Given a sequence of outputs of an HMM, what is the most probable sequence of states that the HMM went through to produce the output?

•Learning: Given a sequence of outputs of an HMM, how do we estimate the parameters of the model?

1666316316416412646255421

6515616361663616636166466

26532164151151

Evaluation QuestionEvaluation Question

Suppose the casino player rolls the following sequence...

How likely is this sequence given our model of how the casino operates?

443

Probability = 1.3 x 10-35 (Note (1/6)67 = 7.309139054 x 10-53.)

Decoding QuestionDecoding Question

1666316316416412646255421

6515616361663616636166466

26532164151151

What portion of the sequence was generated with the fair die, and what portion with the loaded die?

443

FAIRLOADED

Learning QuestionLearning Question

1666316316416412646255421

6515616361663616636166466

26532164151151

How “loaded” is the loaded die? How “fair” is the fair die? How often does the casino player change from fair to loaded, and back? That is,

what are the parameters of the model?

443

FAIR

LOADED

P(6) = 0.64

Ingredients of a HMMIngredients of a HMM

An HMM M has the following parts...

•Alphabet Σ = {b1, b2, ..., bm} //these symbols are output by the model, for example, Σ = {1,2,3,4,5,6} in the dishonest casino model

•Set of states Q = {1, 2, ..., K} //for example, ‘FAIR’ and ‘LOADED’

•Transition probabilities between states ai,j = probability of making a transition from state i to state j

•Starting probabilities a0j = probability of the model starting in state j

Σ aij = 1 j=1

K

Σ a0j = 1 j=1

K

•Emission probabilities within state k, ek(b) = probability of seeing (emitting) the symbol b output while in state k, that is,

ek(b) = P(xi = b | πi = k)

k

1

2

K

•ek(b1) = P(xi = b1 | πi = k)•ek(b2) = P(xi = b2 | πi = k)

•ek(bm) = P(xi = bm | πi = k)

ak1

ak2

ak,K

ak,k

.…

.…

Some NotationSome Notation

•π = (π1,π2,...,πt)

•x1,x2,...,xt,π1,π2,...,πt

πi = state occupied after i steps

πi = state occupied after i steps, xi emitted in state i

1

2

K

1

2

K

1

2

K

x1 x2 x3

1

2

K

xn

… … …

x1,x2,...,xn,π1,π2,...,πn

•‘Forgetfulness’ property: if the current state is πt, the next state πt+1 depends only on πt and nothing else

P(t+1 = k | ‘whatever happened so far’) =

P(t+1 = k | x1,x2,…,xt,1,2,…,t) =

P(t+1 = k | t)

The ‘forgetfulness’ property is part of the definition of an HMM!

What is the probability of What is the probability of xx11,x,x22,...,x,...,xnn,,ππ11,,ππ22,...,,...,ππnn??

Example: Take n = 2.

P(x1,x2,π1,π2) = P(x1,π1,x2,π2)

= P(x2,π2 | x1,π1)·P(x1,π1) (conditional probability)

= P(x2 | π2)·P(π2 | x1,π1)·P(x1,π1) (conditional probability)

= P(x2 | π2)·P(π2 | π1)·P(x1,π1) (‘forgetfulness’)

= P(x2 | π2)·P(π2 | π1)·P(x1 | π1)·P(π1) (conditional probability)

= eπ (x2)·eπ (x1)·aπ π ·a0π2 1 11 2

In general, for the sequence x1,x2,...,xn,π1,π2,...,πn, we have that

P(x1,...,xn,π1,...,πn) = Πeπ (xk)·aπ πk k-1 kk=1

n

2

x1

1

x2

K

x3

2

xn

a21 a1K aK* a*2

e2(x1) e1(x2) eK(x3) e2(xn)

Do you want to play?

The Dishonest Casino (cont.)

4251265121

π = F F F F F F F F F F

x =

a0F = ½ a0L = ½

P(x,π) = a0FeF(1)·aFFeF(2)· aFFeF(1)· aFFeF(5)· aFFeF(6)· aFFeF(2)· aFFeF(1)· aFFeF(5)· aFFeF(2) aFFeF(4)

= ½·(1/6)10 (0.95)9 = .00000000521158647211 ≈ 5.21 10-9

Suppose the starting probabilities are as follows:

Now suppose that

π = L L L L L L L L L L

P(x,π) = a0LeL(1)·aLLeL(2)· aLLeL(1)· aLLeL(5)· aLLeL(6)· aLLeL(2)· aLLeL(1)· aLLeL(5)· aLLeL(2) aLLeL(4)

= ½·(1/2)1·(1/10)9·(0.95)9 = .00000000015756235243 ≈ 0.16 10-9

6366265661

π = F F F F F F F F F F

x =

P(x,π) = a0FeF(1)·aFFeF(6)· aFFeF(6)· aFFeF(5)· aFFeF(6)· aFFeF(2)· aFFeF(6)· aFFeF(6)· aFFeF(3) aFFeF(6)

= ½·(1/6)10 (0.95)9 = .00000000521158647211 ≈ 5.21 10-9

Now suppose that

π = L L L L L L L L L L

P(x,π) = a0LeL(1)·aLLeL(6)· aLLeL(6)· aLLeL(5)· aLLeL(6)· aLLeL(2)· aLLeL(6)· aLLeL(6)· aLLeL(3) aLLeL(6)

= ½·(1/2)6·(1/10)4·(0.95)9 = .00000049238235134735 ≈ 492 10-9

≈100 times more likely!

• Evaluation Problem (solved by Forward-Backward algorithm)

GIVEN: an HMM M and a sequence xFIND: P(x | M)

• Decoding Problem (solved by Viterbi algorithm)

GIVEN: an HMM M and a sequence xFIND: sequence of states that maximizes P(x, | M)

• Learning Problem (solved by Baum-Welch algorithm)

GIVEN: a HMM M, with unspecified transition/emission probabilitiesand a sequence x

FIND: parameter vector = (ei(.), aij) that maximize P(x | )

Three Fundamental ProblemsThree Fundamental Problems

Let’s not get confused by Let’s not get confused by notation...(lots of different ones!)notation...(lots of different ones!)

P(x | M) : probability that x is generated by the model M (the model M consists of the transition and emission probabilities and the architecture of the HMM, that is, the underlying directed graph)

P(x | θ) : probability that x is generated by the model M where θ is the vector of parameter values, that, is the transition and emission probabilities (note that P(x | M) is equivalent to P(x | θ))

P(x) : same as P(x | M) and P(x | θ)

: probability that x is generated by the model and π is the sequence of states that produced x

P(x, | M), P(x, | ) and P(x,)

1

2

K

1

2

K

1

2

K

x1 x2 x3

1

2

K

xn

… … …

Decoding Problem for HMMsDecoding Problem for HMMs

GIVEN: Sequence x = x1x2… xn generated by the model M

FIND: Path π = π1π2…πn that maximizes P(x,π)

ππ1

π2

π3

πn

π = π1,π2,…,πn

Formally, the Decoding Problem for HHMs is to find the following given a sequence x = x1x2

… xn generated by the model M:

π* = argmax {P(x,π | M)}

P* = max {P(x,π | M)} =P(x, π* | M)}

π

π

P(x1,...,xn,π1,...,πn) = Πeπ (xk)·aπ πk k-1 kk=1

n

Let Vk(i) denote the probability of the maximum probability path from stage 1 to stage i ending in state k generating xi in state k. Can we write an equation (hopefully recursive) for Vk(i)?

Vk(i) = max {P(x1,...,xi-1,π1,...,πi-1,xi,πi = k | M)}

= ek(xi)·max {ajk·Vj(i-1)}

π1,...,πi-1

j

… (proof using properties of conditional probabilities...)

Recursive equation... so...dynamic programming!

k

……

ak1

ak2

akK

a1k

a2k

aKk

xi

Stage i, State k

ek(xi)

1

2

K

Stage (i – 1)

V1(i-1)

V2(i-1)

VK-1(i-1)

1

2

K

1

2

K

1

2

K

x1 x2 x3

1

2

K

xn

… … …

1

2

K

0

V1(0) = 0

V2(0) = 0

VK(0) = 0

V0(0) = 1a0k

a02a01

Initialization step is needed to start the algorithm! (‘dummy’

0th stage)

How do we start the algorithm?How do we start the algorithm?

Stage 1 Stage 2 Stage 3 … Stage n

1. Initialization Step. //initialize matrixV0(0) = 1 Vj(0) = 0 for all j > 0

2. Main Iteration. //fill in tablefor each i = 1 to n for each k = 1 to K Vk(i) = ek(xi)·max {ajk·Vj(i-1)}

Ptrk(i) = argmax {ajk·Vj(i-1)}

3. Termination. //recover optimal probability and path

P* = max {P(x,π | M)} = max {Vj(n)}

π* = argmax {Vj(n)}

for each i = n – 1 downto 1 π* = Ptrπ (i+1)

Viterbi Algorithm for Decoding an HMMViterbi Algorithm for Decoding an HMM

Time: O(K2n)Space: O(Kn)

j

j

nj

i i+1

x1 x2 x3 xn

1

2

K

States

V1(1) = e1(x1)·a01

VK(1) = eK(x1)·a0K

V2(1) = e2(x1)·a02

Vk(2) = ek(x2)·max {ajk·Vj(1)}j

FAIR LOADED

0.05

0.05

0.950.95

P(1 | F) = 1/6P(2 | F) = 1/6P(3 | F) = 1/6P(4 | F) = 1/6P(5 | F) = 1/6P(6 | F) = 1/6

P(1 | L) = 1/10P(2 | L) = 1/10P(3 | L) = 1/10P(4 | L) = 1/10P(5 | L) = 1/10P(6 | L) = 1/2

Computational Example Computational Example of the Viterbi Algorithmof the Viterbi Algorithm

x = 2 6 6

1/2*1/6 = 1/121/6*max{19/20*1/12,1/20

*1/20} = 19/1440 = 0.0131944

1/6*max{19/20* 19/1440,1/20*19/800} =

361/172800 = 0.002089120

1/2*1/10 = 1/201/2*max{1/20*1/12,19/20

*1/20} = 19/800 = 0.02375

1/2*max{1/20* 19/1440,19/20*19/800} = 361/32000 = 0.01128125

F

L

2 6 6

P* = 0.01128125π = LLL

x = 2 6 6

P(266,LLL) = (1/2)3 ·(1/10)·(95/100)2 = 361/32000 = 0.01128125

How well does Viterbi perform?How well does Viterbi perform?

300 rolls by the casino

Viterbi is correct 91% of the time!

Problem of Underflow Problem of Underflow in the Viterbi Algorithmin the Viterbi Algorithm

Vk(i) = max {P(x1,...,xi-1,π1,...,πi-1,xi,πi = k | M)}

= ek(xi)·max {ajk·Vj(i-1)}

π1,...,πi-1

j

Numbers become very small since probabilities are being

multiplied together!

Vk(i) = log(ek(xi)) + max {log(ajk) + Vj(i-1)}j

Compute with the logarithms of the probabilities to reduce the occurrence of underflow!

• Evaluation Problem (solved by Forward-Backward algorithm)

GIVEN: an HMM M and a sequence xFIND: P(x | M)

• Decoding Problem (solved by Viterbi algorithm)

GIVEN: an HMM M and a sequence xFIND: sequence of states that maximizes P(x, | M)

• Learning Problem (solved by Baum-Welch algorithm)

GIVEN: a HMM M, with unspecified transition/emission probabilitiesand a sequence x

FIND: parameter vector = (ei(.), aij) that maximize P(x | )

Three Fundamental ProblemsThree Fundamental Problems

Done!

1

2

K

1

2

K

1

2

K

x1 x2 x3

1

2

K

xn

… … …

Evaluation Problem for HMMsEvaluation Problem for HMMs

GIVEN: Sequence x = x1x2… xn generated by the model M

FIND: Find P(x), the probability of x given the model M

ππ1

π2

π3

πn

Formally, the Evaluation Problem for HHMs is to find the following given a sequence x = x1x2

… xn generated by the model M:

P(x) = Σ P(x,π | M)

= Σ P(x | π)·P(π)

π

π

Exponential number of paths π !

Since there are an exponential number of paths, specifically Kn, the probability P(x) cannot be computed directly. So...dynamic programming again!

Let fk(i) denote the probability of the subsequence x1x2,...,xi of x such that πi = k. The quantity fk(i) is called the forward probability. Can we write an equation (hopefully recursive) for fk(i)?

fk(i) = Σ P(x1,...,xi-1,π1,...,πi-1,xi,πi = k | M)

= ek(xi)· Σ {ajk·fj(i-1)}

π1,...,πi-1

j

… (proof using properties of conditional probabilities...)

Recursive equation suitable for dynamic programming

k

……

ak1

ak2

akK

a1k

a2k

aKk

xi

Stage i, State k

ek(xi)

1

2

K

Stage (i – 1)

f1(i-1)

f2(i-1)

fK(i-1)

1

2

K

1

2

K

1

2

K

x1 x2 x3

1

2

K

xn

… … …

1

2

K

0

f1(0) = 0

f2(0) = 0

fK(0) = 0

f0(0) = 1a0k

a02a01

Initialization step is needed to start the algorithm! (‘dummy’

0th stage)

How do we start the algorithm?How do we start the algorithm?

Stage 1 Stage 2 Stage 3 … Stage n

1. Initialization Step. //initialize matrixf0(0) = 1 fj(0) = 0 for all j > 0

2. Main Iteration. //fill in tablefor each i = 1 to n for each k = 1 to K fk(i) = ek(xi)·Σ ajk·fj(i-1)

3. Termination. //recover probability of x

P(x) = Σ fj(n)

Forward Algorithm Forward Algorithm for Evaluationfor Evaluation

Time: O(K2n)Space: O(Kn)

j

j

x1 x2 x3 xn

1

2

K

States

f1(1) = e1(x1)·a01

fK(1) = eK(x1)·a0K

f2(1) = e2(x1)·a02

fk(2) = ek(x2)· Σ {ajk·fj(1)}j

1. Initialization Step. //initialize matrixV0(0) = 1 Vj(0) = 0 for all j > 0

2. Main Iteration. //fill in tablefor each i = 1 to n for each k = 1 to K Vk(i) = ek(xi)·max {ajk·Vj(i-1)}

Ptrk(i) = argmax {ajk·Vj(i-1)}

3. Termination. //recover optimal probability and path P* = max {P(x,π | M)} = max {Vj(n)}

π* = argmax {Vj(n)}

for each i = n – 1 downto 1 π* = Ptrπ (i+1)

ViterbiViterbi vs. vs. ForwardForward

j

j n

i i+1

j

1. Initialization Step. //initialize matrixf0(0) = 1 fj(0) = 0 for all j > 0

2. Main Iteration. //fill in tablefor each i = 1 to n for each k = 1 to K fk(i) = ek(xi)·Σ ajk·fj(i-1)

3. Termination. //recover probability P(x) = Σ fj(n)

j

j

1/2*1/6 = 1/121/6*[19/20*1/12+1/20*1/2

0] = 49/3600 = 0.0136111

1/6*[19/20*49/3600+1/20*31/1200] =

8/3375 = 0.0023703

1/2*1/10 = 1/201/2*[1/20*1/12+19/20*1/2

0] = 31/1200 = 0.0258333

1/2*[1/20*49/3600+19/20*31/1200] = 227/18000 =

0.0126111

F

L

2 6 6

P(x) = P(266) = 8/3375 + 227/18000 = 809/54000 = 0.01498148148

x = 2 6 6

(1/2*1/10 * (95/100)*(1/2) * (95/100)*(1/2)) + (1/2*1/10 * (95/100)*(1/2) * (1/20)*(1/6)) + (1/2*1/10 * (1/20)*(1/6) * (1/20)*(1/2)) + (1/2*1/10 * (1/20)*(1/6) * (95/100)*(1/6) ) + (1/2*1/6 * (1/20)*(1/2) * (95/100)*(1/2) ) + (1/2*1/6 * (1/20)*(1/2) * (1/20)*(1/6)) +(1/2*1/6 * (95/100)*(1/6) * (1/20)*(1/2)) + (1/2*1/6 * (95/100)*(1/6) * (95/100)*(1/6))

P(x) = Σ P(x,π | M) = π

= 0.01498148148

Checking, we see that for x = 266,

Backward Algorithm: MotivationBackward Algorithm: Motivation

Suppose we wish to compute the probability that the ith state is k given the observed sequence of outputs x. (Notice that we would then know the density of the random variable πi.) That is, we must compute

We start by computing

P(πi = k | x) = P(πi = k, x) / P(x)

P(πi = k , x) = P(x1,...,xi,πi = k,xi+1,...,xn)

= P(x1,...,xi,πi = k) · P(xi+1,...,xn | x1,...,xi,πi = k)

fk(i) bk(i)new!

P(πi = k | x) = P(πi = k , x) / P(x)

= fk(i)·bk(i) / P(x)

So then, we have the following equation.

The quantity bk(i) is called the backward probability and is defined by

bk(i) = P(xi+1,...,xn | πi = k)

bk(i) = P(xi+1,...,xn | πi = k)

= Σ ej(xi+1)·akj·bj(i+1)j

… (proof using properties of conditional probabilities...)

Recursive equation suitable for dynamic programming

Can we write an equation (hopefully recursive) for bk(i)?

1. Initialization Step. //initialize matrixbj(n) = 1 for all j > 0

2. Main Iteration. //fill in tablefor each i = n - 1 downto 1 for each k = 1 to K bk(i) = Σ ej(xi+1)·akj·bj(i+1)

3. Termination. //recover probability of x

P(x) = Σ ej(x1)·bj(1)·a0j

Backward Algorithm Backward Algorithm for Evaluationfor Evaluation

Time: O(K2n)Space: O(Kn) j

j

x1 x2 x3 xn

1

2

K

States

b1(n) = 1

bK(n) = 1

b2(n) = 1

bk(n - 1) = Σ ej(xn)·akj·bj(n)j

1/6*19/20*11/60+1/2*1/20*29/60 = 37/900 =

0.041111

1/6*19/20*1+1/2*1/20*1 = 11/60 = 0.183333

1

1/6*1/20*11/60+1/2*19/20*29/60 = 52/225 =

0.231111

1/6*1/20*1+1/2*19/20*1 = 29/60 = 0.483333

1

F

L

2 6 6

P(x) = P(266) = 1/6*37/900*1/2 + 1/10*52/225*1/2 = 809/54000 = 0.01498148148

x = 2 6 6

Posterior DecodingPosterior Decoding

P(πi = k | x) = P(πi = k , x) / P(x)

= fk(i)·bk(i) / P(x)Now we can ask what is the most probable state at stage i. Let πi denote thisstate. Clearly, we have

πi = argmax {P(πi = k | x)}^

^

Therefore, (π1, π2,..., πn) is the sequence of the most probable states. Notice that the above sequence is not (necessarily) the most probable path that the HMM went through to produce x and may not even be a valid path!

There are two types of decoding for an HMM: Viterbi decoding and posterior decoding.

^ ^ ^

π1 = argmax{fF(1)·bF(1)/P(x), fL(1)·bL(1)/P(x)}

= argmax{1/12*37/900/(809/54000),1/20*52/225 /(809/54000)}

= argmax{185/809,624/809}

= argmax{0.2286773795,0.7713226205}

= L

^

π2 = argmax{fF(2)·bF(2)/P(x), fL(2)·bL(2)/P(x)}

= argmax{49/3600*11/60/(809/54000),31/1200*29/60/(809/54000)}

= argmax{539/3236,2697/3236}

= argmax{0.1665636588,0.8334363412}

= L

^

π3 = argmax{fF(3)·bF(3)/P(x), fL(3)·bL(3)/P(x)}

= argmax{8/3375*1/(809/54000),227/18000*1/(809/54000)}

= argmax{128/809,681/809}

= argmax{0.1582200247,0.8417799753}

= L

^

(π1, π2, π3) = (L, L, L)^ ^ ^

The sequence of most probable states given x = 266 is

fπ (k | x) = P(πi = k | x) i

P(πi = k | x) is the (conditional) density function of the random variable πi.

1. Initialization Step. V0(0) = 1 V0(j) = 0 for all j > 0

2. Main Iteration.for each i = 1 to n for each k = 1 to K Vk(i) = ek(xi)·max {ajk·Vj(i-1)}

Ptrk(i) = argmax {ajk·Vj(i-1)}

3. Termination. P* = max {P(x,π | M)} = max {Vj(n)}

π* = argmax {Vj(n)}

for each i = n – 1 downto 1 π* = Ptrπ (i+1)

ViterbiViterbi vs. vs. ForwardForward vs. vs. BackwardBackward

j

j n

i i+1

j

1. Initialization Step.f0(0) = 1 f0(j) = 0 for all j > 0

2. Main Iteration.for each i = 1 to n for each k = 1 to K

fk(i) = ek(xi)·Σ ajk·fj(i-1)

3. Termination. P(x) = Σ fj(n)

j

j

1. Initialization Step.bj(n) = 1 for all j > 0

2. Main Iteration.for each i = n - 1 downto 1 for each k = 1 to K

bk(i) = Σ ej(xi+1)·akj·bj(i+1)

3. Termination. P(x) = Σ ej(x1)·bj(1)·a0j

j

j

• Evaluation Problem (solved by Forward-Backward algorithm)

GIVEN: an HMM M and a sequence xFIND: P(x | M)

• Decoding Problem (solved by Viterbi algorithm)

GIVEN: an HMM M and a sequence xFIND: sequence of states that maximizes P(x, | M)

• Learning Problem (solved by Baum-Welch algorithm)

GIVEN: a HMM M, with unspecified transition/emission probabilitiesand a sequence x

FIND: parameter vector = (ei(.), aij) that maximize P(x | )

Three Fundamental ProblemsThree Fundamental Problems

Done!

Done!

Two Learning ScenariosTwo Learning Scenarios• Learning when states are known

GIVEN: an HMM M with unspecified transition/emission

probabilities, a sequence x, and a sequence π1,...,πn

FIND: parameter vector = (ei(.), aij) that maximize P(x | )

For example, the Dishonest Casino dealer allows an observer to view himchanging dice while he produces a large number of rolls.

• Learning when states are unknown

GIVEN: an HMM M with unspecified transition/emission probabilities and a sequence x

FIND: parameter vector = (ei(.), aij) that maximize P(x | )

The Dishonest Casino dealer does not allow an observer to view himchanging dice while he produces a large number of rolls.

Learning When Learning When States are KnownStates are Known

GIVEN: an HMM M with unspecified transition/emission

probabilities, a sequence x, and a sequence π1,...,πn

FIND: parameter vector = (ei(.), aij) that maximize P(x | )

Ajk = # times there is a j → k transition in π1,...,πn

Ek(b) = # times state k in π1,...,πn emits b

The following can be shown to be the maximum likelihood estimators for the parameters in θ (that is, those parameter values that maximize P(x | ).

ajk = Ajk

Σ Aji i

ek(b) = Ek(b)

Σ Ek(c) c

^ ^

300 rolls by the casino

= 262/(262 + 6) = 0.9776 aFF = AFF^

AFF + AFL

= 6/(262 + 6) = 0.0224 aFL = AFL^

AFF + AFL

eL(6) = EL(6)

Σ EL(c) c

^= 54/95 = 0.5684

eL(1) = EL(1)

Σ EL(c) c

^= 8/95 = 0.0842

Problem of ‘Overfitting’ Problem of ‘Overfitting’

aFF = 1, aFL = 0, aLL = und., aLF = und.

eF(1) = eF(3) = 0.2eF(2) = 0.3, eF(4) = 0, eF(5) = 0.1, eF(6) = 0.2

x = 2 1 5 6 1 2 3 6 2 3

π = F F F F F F F F F F

P(x | ) is maximized, but is unreasonable! More data is needed to derive sensible parameter values or, as an alternative, pseudocounts can be used.

^ ^ ^ ^

^ ^^ ^ ^ ^

Learning When Learning When States are UnknownStates are Unknown

GIVEN: an HMM M with unspecified transition/emission

probabilities and a sequence x (the values of Ajk and

Ek(b) cannot be computed since π1,...,πn are unknown)

FIND: parameter vector = (ei(.), aij) that maximizes P(x | )

•STEP 1: Estimate our ‘best guess’ on what Ajk and Ek(b) should be

•STEP 2: Update the parameters of the model based on our guess

•Repeat STEPS 1 and 2 until convergence of P(x | θ)

How do we update the current How do we update the current parameters of the model?...parameters of the model?...

Assume that θcurr represents the current estimates of the HMM parameters. We will derive the new estimate of Ajk (as an example).

First, at each state j, find the probability that j → k transition is used. Assume that θcurr appears in the appropriate places in the formulas below.

P(πi = j, πi+1 = k | x) = P(πi = j, πi+1 = k, x1,...,xn) / P(x)

= P(x1,...,xi, πi = j, πi+1 = k, xi+1,...,xn) / P(x)

= P(πi+1 = k, xi+1,...,xn | πi = j )·P(x1,...,xi πi = j) / P(x)

= P(πi+1 = k, xi+1,xi+2...,xn | πi = j )·fj(i) / P(x)

= P(xi+2...,xn | πi+1 = k)·P(xi+1| πi+1 = k )·P(πi+1 = k | πi = j)·fj(i) / P(x)

= bk(i+1)·ek(xi+1)·ajk·fj(i) / P(x)

j k

xi+1

ajk

ek(xi+1)

bk(i+1)fj(i)

x1,...,xi-1

xi

xi+2,...,xn

P(x)

So, we have derived a formula for the probability of a j → k transitionfrom stage i to stage i+1 given the output x and the current values of the parameters.

So, the new value of Ajk (expected number of j → k transitions) can be found as

P(πi = j, πi+1 = k | x, θcurr) =

bk(i+1)·ek(xi+1)·ajk·fj(i) / P(x | θcurr)

Ajk = P(πi = j, πi+1 = k | x, θcurr) =

bk(i+1)·ek(xi+1)·ajk·fj(i) / P(x | θcurr)

Σi

Σi

In a similar way,

Ek(b) = bk(i)·fk(i) / P(x | θcurr)

Σi, xi = b

ajk = Ajk

Σ Aji i

ek(b) = Ek(b)

Σ Ek(c) c

^ ^

To obtain new (updated) values for the parameters of the HMM, we normalize as before. Recall that

The Baum-Welch algorithm is normally applied to an entire group of sequences that are assumed to have been generated independently by the model. Typically, training sequences are collected over a period of time. Let x1, x2, ..., xr be r sequences of length n1, n2, ..., nr.

x1x1...x11 2 n1

x2x2...x21 2 n2

x1 =

x2 =

.

.

.

xrxr ...xr1 2 nr

xr =

Training Sequences for the Training Sequences for the Baum-Welch AlgorithmBaum-Welch Algorithm

Training sequences

1. Initialization Step. //initialize parametersPick the initial guess θcurr for model parameters (or arbitrary)

2. Main Iteration. //refine model parameters by iterationrepeat for each training sequence perform the Forward Algorithm perform the Backward Algorithm calculate the Ajk and Ek(b) given θcurr and using all the training sequences x1, x2, ..., xr

calculate new model parameters θnew : ajk and ek(b) calculate P(x1, x2, ..., xr | θnew) //theory guarantees that value will increase; note that P(x1,

x2, ..., xr | θnew) = P(x1 | θnew)·P(x2 | θnew)· ... · P(xr | θnew) by independenceuntil (P(x1, x2, ..., xr | θnew)) does not change much)

3. Termination.

return θnew as the parameter values

Baum-Welch AlgorithmBaum-Welch Algorithm

Time: O(K2n)/interationSpace: O(Kn)

Baum-Welch Algorithm ExampleBaum-Welch Algorithm Example

1

2

EB

1/2

0

1/2 0

0

1

1

0

x1 = ABAx2 = ABBx3 = AB

e1(A) = 1/4e1(B) = 3/4

e2(A) = 1/2e2(B) = 1/2

Iteration #1Iteration #1

The Forward and Backward probability tables must be computed for each of thetraining sequences!

1/4 3/32 3/256

0 1/16 3/128

x1 = ABA

A B A

1

2

Forward Probabilities

P(x1) = (0)(3/256) + (1)(3/128) = 3/128

3/32 1/4 0

0 0 1

x1 = ABA

A B A

1

2

Backward Probabilities

P(x1) = (1/4)(3/32)(1) + (1/2)(0)(0) = 3/128

1/4 3/32 9/256

0 1/16 3/128

x2 = ABB

A B B

1

2

Forward Probabilities

P(x2) = (3/128)(1) + (9/256)(0)

3/32 1/4 0

0 0 1

x2 = ABB

A B B

1

2

Backward Probabilites

P(x2) = (1/4)(3/32)(1) + (1/2)(0)(0) = 3/128

1/4 3/32

0 1/16

x3 = AB

A B

1

2

Forward Probabilites

P(x3) = (1/16)(1) + (3/32)(0)

1/4 0

0 1

x3 = AB

A B

1

2

Backward Probabilities

P(x3) = (1/4)(1/4)(1) + (1/2)(0)(0) = 1/16

All the expected number of transition values must be computed.

A12 = f1(1)·a12·e2(B)·b2(2) + f1(2)·a12·e2(A)·b2(3)

P(x1)

(1/4)·(1/2)·(1/2)·(0) + (3/32)·(1/2)·(1/2)·(1)

f1(1)·a12·e2(B)·b2(2) + f1(2)·a12·e2(B)·b2(3)

P(x2)

f1(1)·a12·e2(B)·b2(2)

P(x3)

+

+

=3/128

(1/4)·(1/2)·(1/2)·(0) + (3/32)·(1/2)·(1/2)·(1)

3/128+ +

(1/4)·(1/2)·(1/2)·(1)

1/16

= 3

Likewise, A11 = 2, A21 = 0 = A22.

AB1 = aB1·e1(A)·b1(1)

P(x1)

aB1·e1(A)·b1(1)

P(x2)

aB1·e1(A)·b1(1)

P(x3)+ +

= (1)·(1/4)·(3/32)

3/128

(1)·(1/4)·(3/32)

3/128

(1)·(1/4)·(1/4)

1/16+ +

Computations for the states B and E must be adjusted accordingly.

= 3

A2E = f2(3)·a2E

P(x1)

f2(3)·a2E

P(x2)

f2(2)·a2E

P(x3)+ +

= (3/128)·(1)

3/128

(3/128)·(1)

3/128

(1/16)·(1)

1/16+ +

= 3

Likewise, A1E = 0 = AB2.

All the expected number of emission values must be computed.

E1(A) = f1(1)·b1(1) + f1(3)·b1(3)

P(x1)

f1(1)·b1(1)

P(x2)+

f1(1)·b1(1)

P(x3)+

= (1/4)·(3/32) + (3/256)·(0)

3/128

(1/4)·(3/32)

3/128

(1/4)·(1/4)

1/16+ +

= 3

Likewise, E1(B) = 2, E2(A) = 1, E2(B) = 2.

Finally, all the new parameter values must be computed.

a12 = ^ A12

A11 + A12 + A1E

32 + 3 + 0

=35

=

a12 = ^ 35

a11 = ^ 25

a1E = 0 ^ a21 = 0 ^

a22 = 0 ^ a2E = 1 ^ aB1 = 1 ^ aB2 = 0 ^

Similar computations yield the following new transition probabilities.

e1(A) = E1(A)

E1(A) + E1(B)3

3 + 2 =

35

=

e1(A) = 25

e1(B) =

13

e2(A) = 23

e2(B) =

^ ^

^ ^

35

Similar computations yield the following new emission probabilities.

1

2

EB

2/5

0

3/5 0

0

1

1

0

x1 = ABAx2 = ABBx3 = AB

e1(A) = 3/5e1(B) = 2/5

e2(A) = 1/3e2(B) = 2/3

Iteration #2Iteration #2

The Forward and Backward probability tables must be computed for each of thetraining sequences! Only the Forward probabilities will be computed this time.

3/5 12/125 72/3125

0 6/25 12/625

x1 = ABA

A B A

1

2

Forward Probabilities

P(x1) = (0)(72/3125) + (1)(12/625) = 12/625

3/5 12/125 48/3125

0 6/25 24/625

x2 = ABB

A B B

1

2

Forward Probabilities

P(x2) = (1)(24/625) + (0)(48/3125)

3/5 12/125

0 6/25

x3 = AB

A B

1

2

Forward Probabilities

P(x3) = (1)(6/25) + (0)(12/125)

Iteration #1Iteration #1

P(x1)·P(x2)·P(x3) = (3/128)·(3/128)·(1/16) = 9/262144 = .0000343

Iteration #2Iteration #2

P(x1)·P(x2)·P(x3) = (12/625)·(24/625)·(6/25) = 1728/9765625 = .0001769

Probability has increased!

A Modeling Example: CpG A Modeling Example: CpG Islands in DNA SequenceIslands in DNA Sequence

A+ C+ G+ T+

A- C- G- T-

What are What are CpGCpG islands and why are they important? islands and why are they important?

•The frequency of the four nucleotides A, T, C, and G are fairly stable across the human genome: A ≈ 29.5%, C ≈ 20.4%, T ≈ 20.5%, and G ≈ 29.6%.

•Frequencies of dinucleotides (that is, nucleotide pairs) vary widely across the human genome.

•CG pairs are typically underrepresented. This is because CG pairs often mutate to TG, and so the frequency of CG pairs is less than 1/16. In fact, the CG pair is the least frequent dinucleotide since the C in the CG pair is easily methylated (a mythyl CH3 “joins” the cytosine), then the methyl-C has the tendency to mutate to a T over the course of evolution by a process called deamination. CG pairs tend to mutate to TG pairs.

•Methylation is suppressed around genes in a genome, and so CpG dinucleotides occur at greater frequencies in and around genes. These high-frequency stretches of DNA are called CpG (or simply CG) islands. (The ‘p’ stands for the phosphodiester bond between the C and G nucleotide to emphasize that the C and G are on the same strand of DNA and are not a base pair.)

CpGCpG Islands & Genes Islands & Genes

Gene5’ end

Gene

promoter CpG islands

CpG islands in body

Gene 3’ end CpG islands

CpG island

Gene

Finding CpG islands is an important problem!

ModelModel of of CpGCpG Islands: Islands: ArchitectureArchitecture

A+ C+ G+ T+

A- C- G- T-

+ : in a CpG island

- : not in a CpG island

ModelModel of of CpGCpG Islands: Transitions Islands: Transitions

+ A C G T

A .180 .274 .426 .120

C .171 .368 .274 .188

G .161 .339 .375 .125

T .079 .355 .384 .182

- A C G T

A .300 .205 .285 .210

C .322 .298 .078 .302

G .248 .246 .298 .208

T .177 .239 .292 .292

Transition probabilities within CpG islands; emission probability =

1/0, e.g., eA+(A) = 1, eA+(C,G,T) = 0

Tables below were established from many known (experimentally verified) CpG islands and known non-CpG islands (called training sequences).

=1

=1

=1

=1

=1

=1

=1

=1

Note: Transitions out of each state add up to one. There is no room for transitions between (+) and (-) states! What do we do?...

Transition probabilities within non-CpG islands; emission probability = 1/0, e.g., eA-(A) = 1, eA-(C,G,T) = 0

ModelModel of of CpGCpG Islands: Transitions (cont.) Islands: Transitions (cont.)

What about the transitions between the + and – states? Certainly there is a probability (say) p of staying in a CpG island and a probability (say) q of staying in a non-CpG island.

A+ C+ G+ T+

A- C- G- T-

1 - p

p

1 - q

q

ModelModel of of CpGCpG Islands: Transitions (cont.) Islands: Transitions (cont.)

To estimate the remaining probabilities, use the following steps.

Step 1: Adjust all probabilities by a factor of p or q. For example,

aA+C+ ← aA+C+ · p, aA-C- ← aA-C- · q, etc.

Step 2: Calculate all the probabilities between the + states and the – states.

Step 2.1: Let fA-, fC-, fG-, and fT- be the frequency of A, C, G, and T among the non-CpG nucleotides in the training sequence.

Step 2.2: Let aA+A- ← fA- · (1 – p), aA+C- ← fC- · (1 – p), etc. Do the same for the – to + states.

Step 3: Estimate the probabilities p and q. But how?...

Geometric DistributionGeometric Distribution

A random variable X is said to be geometrically distributed if it has a density given by

fX(x) = p·(1 – p)x-1, x = 1,2,...

p is the probability of a success in a series of Bernoulli trials. The random variable X counts the number of trials up to and including the first success.

The expected value and variance of X are easy to remember!

E(X) = 1/p

Var(X) = (1 – p)/p2

ModelModel of of CpGCpG Islands: Transitions (cont.) Islands: Transitions (cont.)

Let L+ denote the length in nucleotides of a CpG island. L+ is a random variable, and one approach is to model L+ as a geometric random variable (controversial since CpG islands may not have an exponential-length distribution in the genome under study).

P(L+ = 1) = 1 – p (leaving is considered a success!) P(L+ = 2) = p(1 – p) P(L+ = 3) = p2(1 – p)

P(L+ = k) = pk-1(1 – p)

The expected value of L+ is E(L+) = 1/(1 – p). Similarly, E(L-) = 1/(1 – q) where L- is the length in nucleotides of a non-CpG island.

From the training data, compute the average length of a CpG island, and then set that number equal to E(L+) = 1/(1 – p), and solve for p. Do the same for non-CpG islands. For example, if the average length of a CpG island is 300, then p = 299/300.

Duration Modeling in Hidden Markov ModelsDuration Modeling in Hidden Markov Models

Length distribution of introns and exons show considerable variation. Length distribution of introns and exons show considerable variation.

k(1) k(2) k(n-1) k(n)

p p p p

1-p 1-p 1-p 1-p

n copies of state k

Negative Binomial DistributionNegative Binomial Distribution

The shortest sequence through the states that can be modeled has length n. Let D denote the duration of state k. (Clearly, D is at least n.) Note that D has a negative binomial distribution with parameters 1 – p (probability of a success) and n (number of successes needed).

j1-p...

P(D = L) = L-1n-1

(1-p)n pL-n

ModelModel of of CpGCpG Islands: Islands: ApplicationsApplications

A+ C+ G+ T+

A- C- G- T-

So what is it good

for?

Viterbi Decoding: Given a long strand of DNA, we can decode it using the model!

ATCGTTAGCTACCGACC...

A-T-C-G-T-T-A-G+C+T+A+C+C+G+A+C+C+...

CpG island

Viterbi decoding

For Example...For Example...

Posterior Decoding: Given a long strand of DNA, we can derive the probability distribution of a given position.

ATCGTTAGCTACCGACC...

posterior decoding

ith position

fπ (k) = P(πi = k | x) i

What if a new genome What if a new genome comes along...?comes along...?

Porcupine

We just sequenced the porcupine genome

We know CpG islands play the same role in this genome. That is, they signal the occurrence of a gene.

However, we have no known CpG islands for porcupines.

We suspect the frequency and characteristics of CpG islands are quite different in porcupines and humans.

What do we do...?What do we do...?

Alignment Penalties Revisited: Alignment Penalties Revisited: Affine Gap Penalties Affine Gap Penalties

If, for example n ≥ m, then the time needed to run the algorithm is O(n2m). A compromise between general convex gap penalties and linear gap penalties is the affine gap penalty. In the following discussion, we assume that a deletion will not be followed directly by an insertion. That is, Ix and Iy can not jump between each other.

d is the gap opening penalty and e is the gap extension penalty. For affine gap penalties, there is an implementation of the N-W algorithm that runs in O(nm) time like the original algorithm.

γ(n) = d + (n – 1)e de

(n)

1. xi is aligned to yj

x1...xi-1 xi

y1...yj-1 yj

2. xi is aligned to a gap

x1...xi-2 xi-1 xi

y1...yj - -

3. yj is aligned to a gap

x1...xi - - -

y1...yj-3 yj-2 yj-1 yj

Updating the Score (cont.)Updating the Score (cont.)Updating the score is complicated by the fact that gaps are not assessed the same penalty. Opening a gap is penalized more (typically a lot more!) than extending a group of gaps. Keeping one value, F(i,j) does not suffice!

M(i,j) = optimal score aligning x1x2...xi to y1y2…yj given xi is aligned to yj

Ix(i,j) = optimal score aligning x1x2...xi to y1y2…yj given xi is aligned to a gap

Iy(i,j) = optimal score aligning x1x2...xi to y1y2…yj given yj is aligned to a gap

1. xi is aligned to yj

x1...xi-1 xi

y1...yj-1 yj

2. xi is aligned to a gap

x1...xi-2 xi-1 xi

y1...yj - -

3. yj is aligned to a gap

x1...xi - - -

y1...yj-3 yj-2 yj-1 yj

M(i - 1,j - 1) + m, if xi = yj

M(i - 1,j - 1) - s, if xi ≠ yj

Ix(i - 1,j - 1) + m, if xi = yj

Ix(i - 1,j - 1) - s, if xi ≠ yj

Iy(i - 1,j - 1) - m, if xi = yj

Iy(i - 1,j - 1) - s, if xi ≠ yj

M(i,j) = max

M(i - 1,j) – d (opening)Ix(i - 1,j) – e (extending)

Ix(i,j) = max

M(i,j - 1) – d (opening)Iy(i,j - 1) – e (extending)

Iy(i,j) = max

Updating the Score (cont.)Updating the Score (cont.)

Now we assume that Ix and Iy can not jump between each other.

HMMs for Sequence HMMs for Sequence AlignmentAlignment

),( ji yxs

),( ji yxs),( ji yxs

d de e

MΔ(+1,+1)

Iy

Δ(0,+1)Ix

Δ(+1,0)

Δ(i,j) = change in indices when the state is entered

ji

jiji yxs

yxmyxs

if ,

if ,),(

Needleman-Wunsch Algorithm With Affine Needleman-Wunsch Algorithm With Affine Gap Penalties (FSA Representation)Gap Penalties (FSA Representation)

•The recursive equations of dynamic programming have an elegant representation as a finite-state automaton (FSA).

•A finite state machine (FSM) or finite state automaton (plural: automata), or simply a state machine, is a model of behavior composed of a finite number of states, transitions between those states, and actions.

•The new value of the state variable at indices (i,j) is the maximum of the scores corresponding to the transitions coming into the state.

•Each transition score is given by the values of the source state at the offsets specified by the Δ(i,j) of the target state, plus the specified score increment.

•This type of representation corresponds to a finite-state automaton (FSA) common in computer science.

•An alignment corresponds to a path through the states, with symbols from the underlying pair of sequences being transferred to the alignment according to the Δ(i,j) values in the states.

M M

Ix Ix

M

M

Iy

M

x = V L S P A D Ky = H L A E S K

Consider the alignment

x = V L S P A D - Ky = H L - - A E S K

Pair HMMsPair HMMs• We would like to transform the FSA for the Needleman-Wunsch affine gap penalty algorithm into a HMM.

• Why?

• The HMM methods allow us to use the resulting probabilistic model to explore questions about the reliability of the alignment obtained by dynamic programming, and to explore alternative (suboptimal) alignments.

• By weighting all alternatives probabilistically, we will be able to score the similarity of two sequences independent of any specific alignment.

• We can also build more specialized probabilistic models out of simple pieces, to model more complex versions of sequence alignment.

• How? Two issues need to be resolved.

• Emission probabilities and transition probabilities must be established.

We will keep the parameters arbitrary. The model must be fitted to training data to estimate the parameter values.

Probabilistic Model

Mpx y

Ix

px

Iy

py

δ ε

i

i j

j

δε

1- ε1- ε

1 - 2δ

Something is missing...a begin and end state!

Mpx y

Ix

px

Iy

py

δ ε

i

i j

j

δε

1- ε - τ1- ε - τ

1 - 2δ - τ

Eτ τ

τ

B

τ

δδ

1 - 2δ - τ

Pair HMM With Begin and End StatesPair HMM With Begin and End States

M M

Ix Ix

M

M

Iy

Mx = Vy = H

x = V L S P A D Ky = H L A E S K

LL

S-

P-

AA

DE

-S

KK

Probability of an AlignmentProbability of an Alignment

P = (1 – 2δ – τ)·pVH·(1 – 2δ – τ)·pLL·δ·pS·ε·pP ·(1 – ε – τ)·pAA

·(1 – 2δ – τ)·pDE·δ·pS·(1 – ε – τ)·pKK·τ

What is the most probable What is the most probable alignment?...Use Viterbi!alignment?...Use Viterbi!

•All the algorithms we have seen for HMMs apply, for example, the Viterbi algorithm, forward-backward, etc.

•There is an extra dimension in the search space because of the extra emitted sequence

•Instead of using Vk(i), we will use Vk(i,j), because an observation of xi does not necessarily mean an observation of yj

•Imagine we have two clocks, one for the sequence x and one for the sequence y that work differently in different time zones

•Vk(i,j) can only advance in certain ways1. In time zone M, both i and j advance2. In time zone Ix, only i advances3. In time zone Iy, only j advances

1. Initialization Step. //initialize three matricesVM (0,0) = 1, VM (i,0) = VM (0,j) = 0 for i, j > 0VX (0,0) = 0, VX (i,0) = VX (0,j) = 0 for i, j > 0VY (0,0) = 0, VY (i,0) = VY (0,j) = 0 for i, j > 0

2. Main Iteration. //fill in three tablesfor each i = 1 to m for each j = 1 to n VM(i,j) = px y·max {(1 – 2δ – τ)·VM(i-1,j-1), (1 – ε – τ)·VX (i-1,j-1), (1 – ε – τ)·VY (i-1,j-1 }

VX (i,j) = px ·max {δ·VM(i-1,j), ε·VX (i-1,j)}

VY (i,j) = py ·max {δ·VM(i,j-1), ε·VY (i,j-1)}

//keep pointers so that the most probable alignment can be reconstructed

Viterbi Algorithm for Decoding a Pair HMMViterbi Algorithm for Decoding a Pair HMM

Time: O(mn)Space: O(mn)

i j

j

i

3. Termination. //recover optimal probability and path P* = τ·max {VM(m,n), VX (m,n), VY (m,n)}

//use pointers to reconstruct the most probable alignment

Remark

With the initialization conditions of the Viterbi algorithm for the pair HMM as suggested above (Durbin et al., 1998, p 84), the resulting alignment of two sequences will always start with a matched pair x1, y1 for any two sequences x and y. Hence the alignment generated by a pair HMM with such a restriction on the initialization step may not be the optimal one.

Question

How to change the initialization condition to allow for alignments starting with a gap aligned to a letter in x or y?

Optimal log-odds alignmentOptimal log-odds alignment

• In log-odds terms, we can compute in terms of an additive model with log-odds emission scores and log-odds transition scores.• In practice, this is normally the most practical way to implement pair HMM.• It is possible to merge the emission scores with the transitions as

to produce scores that correspond to the standard terms used in sequence alignment by dynamic programming.

2)1(

21loglog),(

ba

ab

pp

pbas

)21)(1(

)1(log

d

1

loge

Example: Pair HMMExample: Pair HMM

The Pair HMM shown on Slide 109 generates two aligned DNA sequences x and y. State M emits aligned pairs of nucleotides with emission probabilities defined as follows: PTT = PCC = PAA = PGG = 0.5, PCT = PAG = 0.05, PAT = PGC = 0.3, PGT = PAC = 0.15.

The insert states X and Y emit (aligned with gaps) symbols from sequences x and y, respectively. The emission probabilities are the same for both insert states: pA = pC = pG = pT = 0.25

No symbols are emitted by begin and end states. The values of other parameters are as follows:

ji yxP

1.0 ,1.0 ,2.0

Pair HMM: ViterbiPair HMM: ViterbiNow consider the Viterbi algorithm to find the optimal alignment of DNA sequences x = TAG and y = TTACG.

Answer:VM (0,0) = 1, VM (i,0) = VM (0,j) = 0 for i, j > 0VX (0,0) = 0, VX (i,0) = VX (0,j) = 0 for i, j > 0VY (0,0) = 0, VY (i,0) = VY (0,j) = 0 for i, j > 0

We start calculations as follows:

and continue by using equations on slide112, filling the computed probability values in the cells of the V matrix (next slide). At the termination step we have:

The traceback through the V matrix determines the optimal path :

which corresponds to the alignment T T A C G T - A - G

0)1,1( ,0)1,1( ,25.0)0,0()1,1( yXMMMTTM vvvpv

*)1,1()2,1()3,2()4,2()5,3( MYMYM vvvvv

510))5,3(),5,3(),5,3((max *),,( YXM vvvyxP

j=0 j=1 j=2 j=3 j=4 j=5 V(i,j) - T T A C Gi=0 - 1 0.0000000 0.000000 0.0000000 0.000e+00 0.000e+00 - 0 0.0000000 0.000000 0.0000000 0.000e+00 0.000e+00 - 0 0.0000000 0.000000 0.0000000 0.000e+00 0.000e+00

i=1 T 0 0.2500000 0.000000 0.0000000 0.000e+00 0.000e+00 T 0 0.0000000 0.000000 0.0000000 0.000e+00 0.000e+00 T 0 0.0000000 0.012500 0.0003125 7.813e-06 1.953e-07

i=2 A 0 0.0000000 0.037500 0.0050000 3.750e-05 3.125e-07 A 0 0.0125000 0.000000 0.0000000 0.000e+00 0.000e+00 A 0 0.0000000 0.000000 0.0018750 2.500e-04 6.250e-06

i=3 G 0 0.0000000 0.001500 0.0009375 7.500e-04 1.000e-04 G 0 0.0003125 0.001875 0.0002500 1.875e-06 1.563e-08 G 0 0.0000000 0.000000 0.0000750 4.688e-05 3.750e-05

The matrix of probability values V(i,j) determined by the Viterbi algorithm. Each cell (i,j) contains three values, VM (i,j), VX (i,j) and VY (i,j), written in top down order. Entries on the optimal path are shown in red bold.

Pair HMM: ForwardPair HMM: ForwardNow find P(x,y) for DNA sequences x = TAG and y = TTACG using the forward algorithm.

Answer: (the forward variables are on next slide)

Initial valuefM (0,0) = 1, fX (0,0) = fY (0,0) = 0, for any i, j and M, X or Y matrices,f (i, -1) = f (-1,j) = 0

Main iteration

At the termination step we have:510632.3))5,3()5,3()5,3(( ),( YXM fffyxP

))1,1()1,1()1,1((),( jifjifjifpjif YYMXXMMMMyxM ji

)),1(),1((),( jifjifpjif XXXMMXxX i

))1,()1,((),( jifjifpjif YYYMMYyY j

j=0 j=1 j=2 j=3 j=4 j=5f(i,j) - T T A C Gi=0 - 1.000e+00 0.0000000 0.00e+00 0.000e+00 0.000e+00 0.000e+00 - 0.000e+00 0.0000000 0.00e+00 0.000e+00 0.000e+00 0.000e+00 - 0.000e+00 0.0500000 1.25e-03 3.125e-05 7.813e-07 1.953e-08

i=1 T 0.000e+00 0.2500000 2.00e-02 3.000e-04 1.250e-06 9.375e-08 T 5.000e-02 0.0000000 0.00e+00 0.000e+00 0.000e+00 0.000e+00 T 0.000e+00 0.0000000 1.25e-02 1.313e-03 4.781e-05 1.258e-06

i=2 A 0.000e+00 0.0120000 3.75e-02 1.000e-02 1.800e-04 1.944e-06 A 1.250e-03 0.0125000 1.00e-03 1.500e-05 6.250e-08 4.688e-09 A 0.000e+00 0.0000000 6.00e-04 1.890e-03 5.473e-04 2.268e-05

i=3 G 0.000e+00 0.0001500 2.40e-03 1.002e-03 1.957e-03 2.639e-04 G 3.125e-05 0.0009125 1.90e-03 5.004e-04 9.002e-06 9.730e-08 G 0.000e+00 0.0000000 7.50e-06 1.202e-04 5.308e-05 9.919e-05

Forward variables f(i,j) determined by the forward algorithm. Each cell (i,j) contains three values, fM (i,j), fX (i,j) and fY (i,j), written in top down order.

Pair HMM: an example questionPair HMM: an example question

Question: For sequences x = TAG and y = TTACG find the posterior probability of the optimal alignment obtained by the Viterbi algorithm for the pair HMM as described above.

Answer: The posterior probability of path is given by

From the previous calculations, we have and , therefore,

),|*( yxp

*

),(

*),,(),|*(

yxp

yxpyxp

510*),,( yxP510632.3),( yxP

2753.010632.3

10),|*(

5

5

yxp