Chapter 4: Hidden Markov Models - Columbia · PDF file · 2007-10-11Chapter 4:...

1

Prof. Yechiam Yemini (YY)

Computer Science DepartmentColumbia University

Chapter 4: Hidden Markov Models

4.3 HMM Training

2

Overview

Learning HMM parameters Supervise learning Unsupervised learning (Viterbi, EM/Baum-Welch)

2

3

The Training ProblemHMM = Topology + Statistical parametersTopology is designedTraining = compute the statistical parametersTraining algorithm:

Input: a DB of sample HMM behaviors {<X,E>}Output: transition/emission probabilities for HMM

e(F,H)=0.5

e(F,T)=0.5

e(B,H)=?

e(B,T)=?

???

?

Start??

FFFBFFHHTHTH

BFFBFFTHTHTH

FFBFFFTHHTTH

BFFFBFTHHTTT

FFFBBFHHTHHT

BFFFFFHHTTHT

Training

4

Supervised vs Unsupervised Supervised training: use statistics of known samples

Trainer knows underlying sequence for samplesEstimates transition/emission probabilities from samples DBE.g., CpG islands; gene components

Unsupervised training (learning): update HMM parametersbased on new samplesHow?

3

5

Supervised Training: Counting Frequencies Use sample DB to count # transitions, # emissions

A(i,j) = # of ij transitions for sample DB E(i,S) = # emissions of symbol S at state i

Estimate θ={a(i,j),e(i,S)} using respective frequencies a(i,j) = A(i,j)/ΣkA(i,k) e(i,S) = E(i,S)/ΣS’E(i,S’)

A(i,j) or e(i,S) are often zero e.g., the state i is not visited by sample sequence DB Include correction A(i,j)=# ij transitions + r(i,j) E(i,S) = # emissions of S at state i + r’(i,S) r(i,j) and r’(i,S) are typically selected from some prior probability Laplace rule: r(i,j)=r’(i,S)=1

6

ExampleSuppose the sample DB is:

e(F,H)=0.5

e(F,T)=0.5

e(B,H)=?

e(B,T)=?

???

?

Start??

FFFBFFHHTHTH

BFFBFFTHTHTH

FFBFFFTHHTTH

FFFFBFTHTTTH

BFBFFFHTTTHH

BFFFBFTHHTHT

FFFBBFHHTHHT

BFFFFFHHTTHT

Count transitions frequencies: a(S,F)=4/8; A(S,F)=4

{Start⇒F} transitions occur at sequences (1,3,4,6) a(F,B)=7/(4*4+3*4)=1/4… A(F,B)=7

{F⇒B} transitions occur at 7 locations (13,23,32,44,54,63,82) where ki denotes location i of sequence k

Emission frequencies: e(B,H)=8/12; E(B,H)=8

<B,H> emissions occurs at 8 locations (14,24,33,55,65,66,71,81)

4

7

The Catch: Acquiring DB of Supervised Samples

Need to extract training samples Fortunately, many biological sequences admit such samples

E.g., protein secondary structure regions E.g., start/stop codon regions

However, manual sampling is often difficult & imprecise and sometimes manual sampling is not feasible

This is why unsupervised training (learning) is important

8

Viterbi Learning (Unsupervised) Key idea: iterative improvement of model parameters

Input: An HMM model with initial parameters θ[0]={a(i,j),e(i,S)} A DB of training sequences D

Iterate until convergence:1) Compute the Viterbi paths of D: V[t]=V[D,θ[t]]2) Count frequencies F[V[t]]={A(i,j), E(i,S)}3) Update {a(i,j),e(i,S)}: θ[t+1]⇐F[V[t]]

How do we tell convergence? e.g., Δ(θ[t],θ[t+1])<ε where Δ is a distance metric

5

91

F

B

Start

0

0

2 3 4 5 6

Viterbi LearningSuppose the sample DB is:

Initialize estimates:

Step 1: Compute Viterbi paths V[t]

e(F,H)=0.5

e(F,T)=0.5

e(B,H)=.9

e(B,T)=.1

.5.5.5

.5

Start

.5.5

HHTHTHTHTHTHTHHTTHTHTTTH HTTTHH

THHTTTHHTHHTHHTTHT

1

H H T H T H.45

.25

.2025

.1125 .028

.01 .013

.007

.0006 .0014

.003 .0008

D=

θ[0]=

V[t]=HHTHTHBBFBFB

10

Viterbi LearningSuppose the sample DB is:

Initialize estimates (use logarithms to base 2)

Step 1: Compute Viterbi paths V[t]

H H T H T H

1

F

B

Start

--

--

2 3 4 5 6

0

-1.32

-2

-2.64

-3.32-4.64

-5.96 -5.96

-6.64

-9.28-9.28

-7.96 -9.96

D=

θ[0]=

V[t]=HHTHTHBBFBFB

-1

e(F,H)-1

e(F,T)=-1

e(B,H)=-.32

e(B,T)=-2.32

Start

-1-1-1

-1

-1

HHTHTHTHTHTHTTHTTHTTTHTH HTTTHH

THHTTTHHTHHTHHTTHT

6

11

0

-10.64-8.64-7.32-5.32-4-2--

-9.96-9.96-6.64-6.64-3.32-3.32--

HTHTHT

Compute Viterbi Paths For Rest of DHHTHTHTHTHTHTTHTTHTTTHTH HTTTHH

THHTTTHHTHHTHHTTHT

D=

e(F,H)-1

e(F,T)=-1

e(B,H)=-.32

e(B,T)=-2.32

Start

-1-1-1

-1

-1

-1

0

-11.32-9.32-7.32-6-4-2--

-10.64-10.96-8.64-5.32-5.32-3.32--

HTTHTT

HHTHTHBBFBFBTHTHTHFBFBFBTTHTTHFFBFFB………………………………

V[1]=

12

Viterbi Learning Step 2: update θ[t+1]⇐F[V[t]]

Estimate model parameters from V[t]

Convert to log

Iterate steps 1,2

e(F,H)=-9.8

e(F,T)=0

e(B,H)=0

e(B,T)=-9.8

-0.36

-2.17 -2.58

-0.26

Start-1.58-0.58

θ[t]=

HHTHTHBBFBFBTHTHTHFBFBFBTTHTTHFFBFFB………………………………

V[1]=

2/31/3S

2/97/9F

5/61/6BFB

8.99/90.01/9F

0.01/98.99/9BTH

a(i,j)= e(i,s)=

Laplace correction

7

13

Notes Viterbi Learning: θ[t+1]⇐ F[V[t]]⇐V[t]⇐ θ[t]

Observations: Sequential vs. batch learning The starting model is important (e.g., consider a(.,.)=0.5 e(.,.)=0.5) Does the iteration converge? Is the limit model unique?

Need a formal base to consider such questions

14

Learning = Estimating Model Parameters Estimation is a well developed field of statistics

Wish to estimate a parameterized probabilistic model from sample observation HMM: estimate the transition and emission probabilities from sample sequences Estimate the “best” θ={a(i,j),e(i)} to explain an observed sequences {Xs}

A typical estimation setting Assume: random data X depends on a parameterized model θ

according to a probability f(x,θ)=P(X=x|θ) Given: a set {Xs} of sample observations of X Compute: estimate of θ to “best” explain {Xs}

[“Best” = min variance, max likelihood….] E.g., θ is a signal transmitted by a source, {Xs} is sample observations by a receiver

E.g., nature operates as HMM; θ={a(i,j),e(i)} is a signal transmitted by nature todescribe the HMM; {Xs} is sample observations of this signal by biologists

8

15

Maximum Likelihood Estimation (MLE)

The MLE problem:Consider a random event X, with distribution f(x,θ)=P(X=x|θ)Given: independent samples D={xs} of XCompute: θ(D) maximizing L(θ)=logP(x1…xr|θ)

L(θ) =log[Πs f(xs,θ)]=Σs log f(xs,θ)

A Biased Coin ExampleXε{H,T}; define θ=P(X=H) ⇒ f(H,θ)=θ, f(T,θ)=1-θ Let D={H,H,H,T,H,T}; intuitively θ(D)=frequency(H,X)=2/3 L(θ)= Σs log f(xs,θ)=4logθ +2log(1-θ) Find θ that maximizes L(θ) ∂L(θ)/∂ θ=04/θ-2/(1-θ)=0θ=2/3

θ*

P(X|θ)=θ4(1-θ)2

16

MLE Notes The MLE of θ is the value θ* maximizing L(θ)= ∑slog f(xs,θ)

The MLE can be computed by solving ∂ L(θ)/∂ θ=0, or byusing any optimization algorithm (e.g., gradient search)

This may lead to a local maximum

9

17

MLE Training of HMMSupervised Training:

Let Z=(X, π) where X=emitted sequence and π=hidden path The HMM parameters are θ={a(i,j),e(i)} Supervised learning: the sample DB is D={Zs} with observed samples Zs=(Xs, πs)

MLE for a discrete distribution reduces to counting frequencies Suppose Z has finite # of values {V1,V2…Vk} with P(Z=Vj)=f(Vj,θ)=θj

The parameter θ=(θ1,θ2….θk) is the distribution of Z For a DB of Z samples D={zs} define ni=ni(zs)=# occurrences of Vj in z

and let n=n1+…nk and fj=nj/n be the frequency of Vj

L(θ)=L(D,θ)=logP(z1,…zr|θ)=∑j fjlog(θi)= -∑j fjlog(fj/θj) +∑j fjlog(fi)=-H(θ||f)+H(f) L(θ)=-H(θ||f)+H(f) where the first term is the relative entropy;

the second is the sample entropy Maximizing L(θ) is the same as minimizing H(θ||f) It is well known that H(θ||f)>0 with equality iff θ=f Therefore the MLE is provided by the respective frequencies

18

HMM Unsupervised TrainingUnsupervised Learning:

Let Z=(X, π) where X=emitted sequence and π=hidden path The HMM parameters are θ={a(i,j),e(i)} The sample DB is D={Xs} provides partial observations of Z

MLE: find θ maxing L(θ)=logP(X1,…Xr|θ) L(θ) =Σs L(Xs|θ) where L(Xs|θ)=logP(Xs|θ)

P(Xs|θ) may be computed using the forward/backward algorithms of Ch 4.2 But calculating P and then optimizing can be both sensitive and complex Baum-Welch solution: instead of optimizing L, optimize its average

10

19

Baum Welch Training [1972] Initialize: θ[0]={a(i,j),e(i,S)}

Compute: the forward/backward likelihood factors fk(i), bk(i) for D={Xs}

Compute θ[t+1] as the expected frequencies of transitions & emissions

Compute L(θ[t+1] ); stop if Δ=L(θ[t+1])-L(θ[t])<ε

This was generalized by Dempster, Laird, Rubin [1977] into theExpectation Maximization (EM) training algorithm

20

Expectation-Maximization (EM) Training Input:

A DB of sample sequences D={Xs} Initial estimate θ[0]

Iterate until convergence: Expectation step: compute Q[θ’,θ[t]]=E[ L(Z|θ’)|D,θ[t])]

where θ’ is a dummy parameter and Z=(X, π) Maximization step: compute θ’=θ[t+1] maximizing Q[θ’,θ[t]]

Notes: Key idea: instead of maximizing likelihood L(θ), max its expected value Q The expectation step can compute Q using forward/backward algorithms The maximization step can use gradient search or other techniques

Ref Book: The Elements of Statistical Learning…Hastie et al

11

21

Conclusions

There is a growing range of HMM applications Training is essential

Both supervised or unsupervised techniques can be effective

Unsupervised training is of great value but has its challenges Convergence to local optimum; dimensionality; how trustworthy are the results…

Substantial successes with applications: CpG islands Gene discovery Transmembrane proteins…

Chapter 4: Hidden Markov Models - Columbia · PDF file · 2007-10-11Chapter 4:...

Documents

Transcript of Chapter 4: Hidden Markov Models - Columbia · PDF file · 2007-10-11Chapter 4:...