Chapter 4: Hidden Markov Models - Columbia · PDF file · 2007-10-11Chapter 4:...
Transcript of Chapter 4: Hidden Markov Models - Columbia · PDF file · 2007-10-11Chapter 4:...
1
Prof. Yechiam Yemini (YY)
Computer Science DepartmentColumbia University
Chapter 4: Hidden Markov Models
4.3 HMM Training
2
Overview
Learning HMM parameters Supervise learning Unsupervised learning (Viterbi, EM/Baum-Welch)
2
3
The Training ProblemHMM = Topology + Statistical parametersTopology is designedTraining = compute the statistical parametersTraining algorithm:
Input: a DB of sample HMM behaviors {<X,E>}Output: transition/emission probabilities for HMM
e(F,H)=0.5
e(F,T)=0.5
e(B,H)=?
e(B,T)=?
???
?
Start??
FFFBFFHHTHTH
BFFBFFTHTHTH
FFBFFFTHHTTH
BFFFBFTHHTTT
FFFBBFHHTHHT
BFFFFFHHTTHT
Training
4
Supervised vs Unsupervised Supervised training: use statistics of known samples
Trainer knows underlying sequence for samplesEstimates transition/emission probabilities from samples DBE.g., CpG islands; gene components
Unsupervised training (learning): update HMM parametersbased on new samplesHow?
3
5
Supervised Training: Counting Frequencies Use sample DB to count # transitions, # emissions
A(i,j) = # of ij transitions for sample DB E(i,S) = # emissions of symbol S at state i
Estimate θ={a(i,j),e(i,S)} using respective frequencies a(i,j) = A(i,j)/ΣkA(i,k) e(i,S) = E(i,S)/ΣS’E(i,S’)
A(i,j) or e(i,S) are often zero e.g., the state i is not visited by sample sequence DB Include correction A(i,j)=# ij transitions + r(i,j) E(i,S) = # emissions of S at state i + r’(i,S) r(i,j) and r’(i,S) are typically selected from some prior probability Laplace rule: r(i,j)=r’(i,S)=1
6
ExampleSuppose the sample DB is:
e(F,H)=0.5
e(F,T)=0.5
e(B,H)=?
e(B,T)=?
???
?
Start??
FFFBFFHHTHTH
BFFBFFTHTHTH
FFBFFFTHHTTH
FFFFBFTHTTTH
BFBFFFHTTTHH
BFFFBFTHHTHT
FFFBBFHHTHHT
BFFFFFHHTTHT
Count transitions frequencies: a(S,F)=4/8; A(S,F)=4
{Start⇒F} transitions occur at sequences (1,3,4,6) a(F,B)=7/(4*4+3*4)=1/4… A(F,B)=7
{F⇒B} transitions occur at 7 locations (13,23,32,44,54,63,82) where ki denotes location i of sequence k
Emission frequencies: e(B,H)=8/12; E(B,H)=8
<B,H> emissions occurs at 8 locations (14,24,33,55,65,66,71,81)
4
7
The Catch: Acquiring DB of Supervised Samples
Need to extract training samples Fortunately, many biological sequences admit such samples
E.g., protein secondary structure regions E.g., start/stop codon regions
However, manual sampling is often difficult & imprecise and sometimes manual sampling is not feasible
This is why unsupervised training (learning) is important
8
Viterbi Learning (Unsupervised) Key idea: iterative improvement of model parameters
Input: An HMM model with initial parameters θ[0]={a(i,j),e(i,S)} A DB of training sequences D
Iterate until convergence:1) Compute the Viterbi paths of D: V[t]=V[D,θ[t]]2) Count frequencies F[V[t]]={A(i,j), E(i,S)}3) Update {a(i,j),e(i,S)}: θ[t+1]⇐F[V[t]]
How do we tell convergence? e.g., Δ(θ[t],θ[t+1])<ε where Δ is a distance metric
5
91
F
B
Start
0
0
2 3 4 5 6
Viterbi LearningSuppose the sample DB is:
Initialize estimates:
Step 1: Compute Viterbi paths V[t]
e(F,H)=0.5
e(F,T)=0.5
e(B,H)=.9
e(B,T)=.1
.5.5.5
.5
Start
.5.5
HHTHTHTHTHTHTHHTTHTHTTTH HTTTHH
THHTTTHHTHHTHHTTHT
1
H H T H T H.45
.25
.2025
.1125 .028
.01 .013
.007
.0006 .0014
.003 .0008
D=
θ[0]=
V[t]=HHTHTHBBFBFB
10
Viterbi LearningSuppose the sample DB is:
Initialize estimates (use logarithms to base 2)
Step 1: Compute Viterbi paths V[t]
H H T H T H
1
F
B
Start
--
--
2 3 4 5 6
0
-1.32
-2
-2.64
-3.32-4.64
-5.96 -5.96
-6.64
-9.28-9.28
-7.96 -9.96
D=
θ[0]=
V[t]=HHTHTHBBFBFB
-1
e(F,H)-1
e(F,T)=-1
e(B,H)=-.32
e(B,T)=-2.32
Start
-1-1-1
-1
-1
HHTHTHTHTHTHTTHTTHTTTHTH HTTTHH
THHTTTHHTHHTHHTTHT
6
11
0
-10.64-8.64-7.32-5.32-4-2--
-9.96-9.96-6.64-6.64-3.32-3.32--
HTHTHT
Compute Viterbi Paths For Rest of DHHTHTHTHTHTHTTHTTHTTTHTH HTTTHH
THHTTTHHTHHTHHTTHT
D=
e(F,H)-1
e(F,T)=-1
e(B,H)=-.32
e(B,T)=-2.32
Start
-1-1-1
-1
-1
-1
0
-11.32-9.32-7.32-6-4-2--
-10.64-10.96-8.64-5.32-5.32-3.32--
HTTHTT
HHTHTHBBFBFBTHTHTHFBFBFBTTHTTHFFBFFB………………………………
V[1]=
12
Viterbi Learning Step 2: update θ[t+1]⇐F[V[t]]
Estimate model parameters from V[t]
Convert to log
Iterate steps 1,2
e(F,H)=-9.8
e(F,T)=0
e(B,H)=0
e(B,T)=-9.8
-0.36
-2.17 -2.58
-0.26
Start-1.58-0.58
θ[t]=
HHTHTHBBFBFBTHTHTHFBFBFBTTHTTHFFBFFB………………………………
V[1]=
2/31/3S
2/97/9F
5/61/6BFB
8.99/90.01/9F
0.01/98.99/9BTH
a(i,j)= e(i,s)=
Laplace correction
7
13
Notes Viterbi Learning: θ[t+1]⇐ F[V[t]]⇐V[t]⇐ θ[t]
Observations: Sequential vs. batch learning The starting model is important (e.g., consider a(.,.)=0.5 e(.,.)=0.5) Does the iteration converge? Is the limit model unique?
Need a formal base to consider such questions
14
Learning = Estimating Model Parameters Estimation is a well developed field of statistics
Wish to estimate a parameterized probabilistic model from sample observation HMM: estimate the transition and emission probabilities from sample sequences Estimate the “best” θ={a(i,j),e(i)} to explain an observed sequences {Xs}
A typical estimation setting Assume: random data X depends on a parameterized model θ
according to a probability f(x,θ)=P(X=x|θ) Given: a set {Xs} of sample observations of X Compute: estimate of θ to “best” explain {Xs}
[“Best” = min variance, max likelihood….] E.g., θ is a signal transmitted by a source, {Xs} is sample observations by a receiver
E.g., nature operates as HMM; θ={a(i,j),e(i)} is a signal transmitted by nature todescribe the HMM; {Xs} is sample observations of this signal by biologists
8
15
Maximum Likelihood Estimation (MLE)
The MLE problem:Consider a random event X, with distribution f(x,θ)=P(X=x|θ)Given: independent samples D={xs} of XCompute: θ(D) maximizing L(θ)=logP(x1…xr|θ)
L(θ) =log[Πs f(xs,θ)]=Σs log f(xs,θ)
A Biased Coin ExampleXε{H,T}; define θ=P(X=H) ⇒ f(H,θ)=θ, f(T,θ)=1-θ Let D={H,H,H,T,H,T}; intuitively θ(D)=frequency(H,X)=2/3 L(θ)= Σs log f(xs,θ)=4logθ +2log(1-θ) Find θ that maximizes L(θ) ∂L(θ)/∂ θ=04/θ-2/(1-θ)=0θ=2/3
θ*
P(X|θ)=θ4(1-θ)2
16
MLE Notes The MLE of θ is the value θ* maximizing L(θ)= ∑slog f(xs,θ)
The MLE can be computed by solving ∂ L(θ)/∂ θ=0, or byusing any optimization algorithm (e.g., gradient search)
This may lead to a local maximum
9
17
MLE Training of HMMSupervised Training:
Let Z=(X, π) where X=emitted sequence and π=hidden path The HMM parameters are θ={a(i,j),e(i)} Supervised learning: the sample DB is D={Zs} with observed samples Zs=(Xs, πs)
MLE for a discrete distribution reduces to counting frequencies Suppose Z has finite # of values {V1,V2…Vk} with P(Z=Vj)=f(Vj,θ)=θj
The parameter θ=(θ1,θ2….θk) is the distribution of Z For a DB of Z samples D={zs} define ni=ni(zs)=# occurrences of Vj in z
and let n=n1+…nk and fj=nj/n be the frequency of Vj
L(θ)=L(D,θ)=logP(z1,…zr|θ)=∑j fjlog(θi)= -∑j fjlog(fj/θj) +∑j fjlog(fi)=-H(θ||f)+H(f) L(θ)=-H(θ||f)+H(f) where the first term is the relative entropy;
the second is the sample entropy Maximizing L(θ) is the same as minimizing H(θ||f) It is well known that H(θ||f)>0 with equality iff θ=f Therefore the MLE is provided by the respective frequencies
18
HMM Unsupervised TrainingUnsupervised Learning:
Let Z=(X, π) where X=emitted sequence and π=hidden path The HMM parameters are θ={a(i,j),e(i)} The sample DB is D={Xs} provides partial observations of Z
MLE: find θ maxing L(θ)=logP(X1,…Xr|θ) L(θ) =Σs L(Xs|θ) where L(Xs|θ)=logP(Xs|θ)
P(Xs|θ) may be computed using the forward/backward algorithms of Ch 4.2 But calculating P and then optimizing can be both sensitive and complex Baum-Welch solution: instead of optimizing L, optimize its average
10
19
Baum Welch Training [1972] Initialize: θ[0]={a(i,j),e(i,S)}
Compute: the forward/backward likelihood factors fk(i), bk(i) for D={Xs}
Compute θ[t+1] as the expected frequencies of transitions & emissions
Compute L(θ[t+1] ); stop if Δ=L(θ[t+1])-L(θ[t])<ε
This was generalized by Dempster, Laird, Rubin [1977] into theExpectation Maximization (EM) training algorithm
20
Expectation-Maximization (EM) Training Input:
A DB of sample sequences D={Xs} Initial estimate θ[0]
Iterate until convergence: Expectation step: compute Q[θ’,θ[t]]=E[ L(Z|θ’)|D,θ[t])]
where θ’ is a dummy parameter and Z=(X, π) Maximization step: compute θ’=θ[t+1] maximizing Q[θ’,θ[t]]
Notes: Key idea: instead of maximizing likelihood L(θ), max its expected value Q The expectation step can compute Q using forward/backward algorithms The maximization step can use gradient search or other techniques
Ref Book: The Elements of Statistical Learning…Hastie et al
11
21
Conclusions
There is a growing range of HMM applications Training is essential
Both supervised or unsupervised techniques can be effective
Unsupervised training is of great value but has its challenges Convergence to local optimum; dimensionality; how trustworthy are the results…
Substantial successes with applications: CpG islands Gene discovery Transmembrane proteins…