55
Machine Learning: CSE 574 Machine Learning: CSE 574 0 0 Hidden Markov Models Sargur N. Srihari Machine Learning Course: http://www.cedar.buffalo.edu/~srihari/CSE574/index.html

butest
• Category

Documents

• view

378

0

description

Transcript of PowerPoint Presentation

Machine Learning: CSE 574 Machine Learning: CSE 574 00

Hidden Markov Models

Sargur N. Srihari [email protected]

Machine Learning Course: http://www.cedar.buffalo.edu/~srihari/CSE574/index.html

Machine Learning: CSE 574 Machine Learning: CSE 574 11

HMM Topics1. What is an HMM?2. State-space Representation3. HMM Parameters4. Generative View of HMM5. Determining HMM Parameters Using EM6. Forward-Backward or α−β

algorithm

7. HMM Implementation Issues:a) Length of Sequenceb) Predictive Distributionc) Sum-Product Algorithmd) Scaling Factorse) Viterbi Algorithm

Machine Learning: CSE 574 Machine Learning: CSE 574 22

1. What is an HMM?• Ubiquitous tool for modeling time series data• Used in

• Almost all speech recognition systems• Computational molecular biology

• Group amino acid sequences into proteins• Handwritten word recognition

• It is a tool for representing probability distributions over sequences of observations

• HMM gets its name from two defining properties:• Observation xt

at time t

was generated by some process whose state zt

is hidden from the observer• Assumes that state at zt

is dependent only on state zt-1

and independent of all prior states (First order)

• Example: z

are phoneme sequencesx

are acoustic observations

Machine Learning: CSE 574 Machine Learning: CSE 574 33

Graphical Model of HMM• Has the graphical model shown below and

latent variables are discrete

• Joint distribution has the form:

)|()|()(),..,,..(12

1111 nn

N

n

N

nnnnN zxpzzpzpzzxxp ∏∏

==− ⎥

⎤⎢⎣

⎡=

Machine Learning: CSE 574 Machine Learning: CSE 574 44

HMM Viewed as Mixture

• A single time slice corresponds to a mixture distribution with component densities p(x|z)• Each state of discrete variable z

represents a different component • An extension of mixture model

• Choice of mixture component depends on choice of mixture component for previous distribution

• Latent variables are multinomial variables zn• That describe component responsible for generating xn

• Can use one-of-K coding scheme

Machine Learning: CSE 574 Machine Learning: CSE 574 55

2. State-Space Representation • Probability distribution of zn

depends on state of previous latent variable zn-1

through probability distribution p(zn

|zn-1

)• One-of K

coding

• Since latent variables are K- dimensional binary vectorsAjk

=p(znk

=1|zn-1,j

=1)

Σk

Ajk

=1• These are known as Transition

Probabilities•

K(K-1)

independent parameters

k 1 2 . Kznk 0 1 . 0

1 2 …. K12… Ajk

K

A A matrixmatrix

j 1 2 . Kzn-1,j 1 0 . 0

State State of of zznn

State State of of zznn--11

zn-1

zznn

Machine Learning: CSE 574 Machine Learning: CSE 574 66

Transition Probabilities

zn

=1zn1

=1zn2

=0zn3

=0

zn

=2zn1

=0zn2

=1zn3

=0

zn

=3zn1

=0zn2

=0zn3

=1

zn-1

=1zn-1,1

=1zn-1,2

=0zn-1,3

=0

A11 A12 A13

zn-1

=2zn-1,1

=0zn-1,2

=1zn-1,3

=0

A21 A22 A23

zn-3

= 3zn-1,1

=0zn-1,2

=0zn-1,3

=1

A31 A32 A33

State Transition Diagram

•• Not a graphical modelsince nodes are not separate variables but states of a single variable

• Here K=3K=3

AA1111

+A+A1212

+A+A1313

=1=1

Example with 3-state latent variable

Machine Learning: CSE 574 Machine Learning: CSE 574 77

Conditional Probabilities• Transition probabilities Ajk

represent state-to-state probabilities for each variable

• Conditional probabilities are variable-to-variable probabilities• can be written in terms of transition probabilities as

• Note that exponent zn-1,j zn,k

is a product that evaluates to 0 or 1• Hence the overall product will evaluate to a single Ajk

for each setting of values of zn

and zn-1

• E.g., zn-1

=2

and zn

=3 will result in only zn-1,2

=1

and zn,3

=1. Thus p(zn

=3|zn-1

=2)=A23

A

is a global HMM parameter

∏∏= =

−−=

K

k

K

j

zzjknn

knjnAAp1 1

1,,1),z|z(

Machine Learning: CSE 574 Machine Learning: CSE 574 88

Initial Variable Probabilities• Initial latent node z1

is a special case without parent node

• Represented by vector of probabilities π with elements πk

=p(z1k

=1)

so that

• Note that π is an HMM parameter • representing probabilities of each state for the

first variable

1 where)|z( k1

1,1 =Σ= ∏

=k

K

k

zk

kp πππ

Machine Learning: CSE 574 Machine Learning: CSE 574 99

• State transition diagram unfolded over time

• Representation of latent variable states• Each column corresponds to one of latent

variables zn

Lattice or Trellis Diagram

Machine Learning: CSE 574 Machine Learning: CSE 574 1010

Emission Probabilities p(xn

|zn

)• We have so far only specified p(zn

|zn-1

)

by means of transition probabilities

• Need to specify probabilities p(xn

|zn

,φ) to complete the model, where φ are parameters

• These can be continuous or discrete• Because xn

is observed and zn

is discrete p(xn

|zn

,φ) consists of a table of K

numbers corresponding to K

states of zn

• Analogous to class-conditional probabilities• Can be represented as

1

(x | z , ) (x | ) nk

Kz

n n n kk

p pϕ ϕ=

= ∏

Machine Learning: CSE 574 Machine Learning: CSE 574 1111

3. HMM Parameters

• We have defined three types of HMM parameters: θ = (π, A,φ)1. Initial Probabilities of first latent variable:

π is a vector of K

probabilities of the states for latent variable z12. Transition Probabilities (State-to-state for any latent variable):

A

is a K

x K

matrix of transition probabilities Aij

3. Emission Probabilities (Observations conditioned on latent):φ are parameters of conditional distribution p(xk

|zk

)•

A

and π parameters are often initialized uniformly

• Initialization of φ depends on form of distribution

Machine Learning: CSE 574 Machine Learning: CSE 574 1212

• Joint can be expressed in terms of parameters:

• Most discussion of HMM is independent of emission probabilities• Tractable for discrete tables, Gaussian, GMMs

Joint Distribution over Latent and Observed variables

1 12 1

1 1

( , | ) (z | ) (z | z , ) (x | z , )

where x x z z { , , }

N N

n n m mn m

N N

p X Z p p A p

X { ,.. }, Z { ,.. }, θ A

θ π ϕ

π ϕ

−= =

⎡ ⎤= ⎢ ⎥

⎣ ⎦= = =

∏ ∏

Machine Learning: CSE 574 Machine Learning: CSE 574 1313

4. Generative View of HMM• Sampling from an HMM• HMM with 3-state latent variable z

• Gaussian emission model p(x|z)• Contours of constant density of emission

distribution shown for each state• Two-dimensional x

• 50 data points generated from HMM• Lines show successive observations

• Transition probabilities fixed so that• 5% probability of making transition• 90% of remaining in same

Machine Learning: CSE 574 Machine Learning: CSE 574 1414

Left-to-Right HMM• Setting elements of Ajk

=0 if k < j

• Corresponding lattice diagram

Machine Learning: CSE 574 Machine Learning: CSE 574 1515

Left-to-Right Applied Generatively to Digits• Examples of on-line handwritten 2’s •

x

is a sequence of pen coordinates

• There are16 states for z, or K=16• Each state can generate a line segment of fixed

length in one of 16 angles• Emission probabilities: 16 x 16 table

• Transition probabilities set to zero except for those that keep state index k

the same or increment by one

• Parameters optimized by 25 EM iterations • Trained on 45 digits• Generative model is quite poor

• Since generated don’t look like training•

If classification is goal, can do better by using a discriminative HMM

Training SetTraining Set

Generated SetGenerated Set

Machine Learning: CSE 574 Machine Learning: CSE 574 1616

5. Determining HMM Parameters• Given data set X={x1

,..xn

}

we can determine HMM parameters θ ={π,A,φ}

using maximum

likelihood• Likelihood function obtained from joint

distribution by marginalizing over latent variables Z={z1

,..zn

}p(X|θ) = ΣZ

p(X,Z|θ)

XX

Joint (X,Z)Joint (X,Z)

Machine Learning: CSE 574 Machine Learning: CSE 574 1717

Computational issues for Parametersp(X|θ) = ΣZ p(X,Z|θ)

Computational difficulties•

Joint distribution p(X,Z|θ)

does not factorize over n,

in contrast to mixture model

Z

has exponential number of terms corresponding to trellis•

Solution•

Use conditional independence properties to reorder summations

Use EM instead to maximize log-likelihood function of joint ln

p(X,Z|θ)

Efficient framework for maximizing the likelihood function in HMMs

Machine Learning: CSE 574 Machine Learning: CSE 574 1818

EM for MLE in HMM1. Start with initial selection for model parameters θold

2. In E step take these parameter values and find posterior distribution of latent variables p(Z|X,θ old)Use this posterior distribution to evaluate expectation of the logarithm of the complete-data likelihood function ln

p(X,Z|θ)

Which can be written as

underlined portion independent of θ

is evaluated3. In M-Step maximize Q

w.r.t. θ

∑=Z

)|,(ln),|Z(),( θθθθ ZXpXpQ oldold

Machine Learning: CSE 574 Machine Learning: CSE 574 1919

Expansion of Q• Introduce notation

γ (zn

) = p(zn

|X,θold): Marginal posterior distribution of latent variable zn

ξ(zn-1

,zn

) = p(zn-1

,zn

|X,θ old): Joint posterior of two successive latent variables

• We will be re-expressing Q in terms of γ and ξ

γγ

ξξ

∑=Z

)|,(ln),|Z(),( θθθθ ZXpXpQ oldold

Machine Learning: CSE 574 Machine Learning: CSE 574 2020

Detail of γ and ξ

For each value of n

we can store γ(zn

)

using K

non-negative numbers that sum to unityξ(zn-1

,zn

) using a K

x K

matrix whose elements also sum to unity

• Using notationγ (znk

) denotes conditional probability of znk=1

Similar notation for ξ(zn-1,j

,znk

)

• Because the expectation of a binary random variable is the probability that it takes value 1γ(znk

)

= E[znk

] = Σz

γ(z)znk

ξ(zn-1,j

,znk

) = E[zn-1,j

,znk

] = Σz

γ(z) zn-1,j

znk

Machine Learning: CSE 574 Machine Learning: CSE 574 2121

Expansion of Q • We begin with

• Substitute

• And use definitions of γ and ξ to get:

∑∑

∑ ∑∑∑

= =

= = = =

+

+=

N

n

K

kknnk

K

k

N

n

K

j

K

kjknkjn-kk

old

xpz

A),z(zzQ

1 1

1 2 1 1,11

)|(ln)(

lnln)(),(

φγ

ξπγθθ

∑=Z

)|,(ln),|Z(),( θθθθ ZXpXpQ oldold

1 12 1

( , | ) (z | ) (z | z , ) (x | z , )N N

n n m mn m

p X Z p p A pθ π ϕ−= =

⎡ ⎤= ⎢ ⎥

⎣ ⎦∏ ∏

Machine Learning: CSE 574 Machine Learning: CSE 574 2222

• Goal of E step is to evaluate γ(zn

)

and ξ(zn-1

,zn

) efficiently (Forward-Backward Algorithm)

∑∑

∑ ∑∑∑

= =

= = = =

+

+=

N

n

K

kknnk

K

k

N

n

K

j

K

kjknkjn-kk

old

xpz

A),z(zzQ

1 1

1 2 1 1,11

)|(ln)(

lnln)(),(

φγ

ξπγθθ

E-Step

γγ

ξξ

Machine Learning: CSE 574 Machine Learning: CSE 574 2323

M-Step• Maximize Q(θ,θ old)

with respect to parameters

θ ={π,A,φ}• Treat γ (zn

) and ξ(zn-1

,zn

) as constant• Maximization w.r.t. π

and A

• easily achieved (using Lagrangian multipliers)

• Maximization w.r.t. φk

• Only last term of Q

depends on φk

• Same form as in mixture distribution for i.i.d.

∑=

= K

jj

kk

z

z

11

1

)(

)(

γ

γπ

∑∑

= =−

=−

= K

l

N

nnljn

N

nnkjn

jk

zz

zzA

1 2,1

2,1

),(

),(

ξ

ξ

∑∑= =

N

n

K

kknnk xpz

1 1)|(ln)( φγ

Machine Learning: CSE 574 Machine Learning: CSE 574 2424

M-step for Gaussian emission• Maximization of Q(θ,θ old)

wrt φk

• Gaussian Emission Densitiesp(x|φk

) ~ N(x|μk

,Σk

)• Solution:

=

== N

nnk

N

nnnk

k

z

z

1

1

)(

x)(μ

γ

γ

=

=

−−=Σ N

nnk

N

n

Tknknnk

k

z

z

1

1

)(

)μx)(μx)((

γ

γ

Machine Learning: CSE 574 Machine Learning: CSE 574 2525

M-Step for Multinomial Observed• Conditional Distribution of Observations

have the form

• M-Step equations:

• Analogous result holds for Bernoulli observed variables

∏∏= =

=D

i

K

k

zxik

kip1 1

)z|x( μ

=

== N

nnk

N

nnink

ik

z

xzμ

1

1

)(

)(

γ

γ

Machine Learning: CSE 574 Machine Learning: CSE 574 2626

6. Forward-Backward Algorithm• E step: efficient procedure to evaluate

γ(zn

)

and ξ(zn-1

,zn

)• Graph of HMM, a tree

• Implies that posterior distribution of latent variables can be obtained efficiently using message passing algorithm

• In HMM it is called forward-backward algorithm or Baum-Welch Algorithm

• Several variants lead to exact marginals• Method called alpha-beta discussed here

Machine Learning: CSE 574 Machine Learning: CSE 574 2727

Derivation of Forward-Backward• Several conditional-independences (A-H) hold

A. p(X|zn

) = p(x1

,..xn

|zn

) p(xn+1

,..xN

|zn

)• Proved using d-separation:

Path fromPath from xx11

toto xxnn--1 1 passes through passes through zznn

which is observed. which is observed. Path is headPath is head--toto--tail. Thus tail. Thus ((xx11

,..,..xxnn--11

) _) _||_||_

xxnn

| | zznnSimilarly Similarly ((xx11

,..,..xxnn--11

,,xxnn

) _) _||_||_

xxn+1n+1

,..x,..xNN

||zznn

……....

……....

xxNN

zzNN

Machine Learning: CSE 574 Machine Learning: CSE 574 2828

Conditional independence B• Since we have

B. p(x1

,..xn-1

|xn,

zn

)= p(x1

,..xn-1

|zn

)

((xx11

,..,..xxnn--11

) _) _||_||_

xxnn

| | zznn

Machine Learning: CSE 574 Machine Learning: CSE 574 2929

Conditional independence C• Since

C. p(x1

,..xn-1

|zn-1

,zn

)= p ( x1

,..xn-1

|zn-1 )

((xx11

,..,..xxnn--11

) _) _||_||_

zznn

| | zznn--11

Machine Learning: CSE 574 Machine Learning: CSE 574 3030

Conditional independence D• Since

D. p(xn+1

,..xN

|zn

,zn+1

)= p ( x1

,..xN

|zn+1 )

((xxn+1n+1

,..,..xxNN

) _) _||_||_

zznn

| | zzn+1n+1

…………xxNN

zzNN

Machine Learning: CSE 574 Machine Learning: CSE 574 3131

Conditional independence E• Since

E. p(xn+2

,..xN

|zn+1

,xn+1

)= p ( xn+2

,..xN

|zn+1 )

((xxn+2n+2

,..,..xxNN

) _) _||_||_

zznn

| | zzn+1n+1

…………xxNN

zzNN

Machine Learning: CSE 574 Machine Learning: CSE 574 3232

Conditional independence F

F. p(X|zn-1

,zn

)= p ( x1

,..xn-1

|zn-1 )p(xn

|zn

)p(xn+1

,..xN

|zn

)

xxNN

zzNN

…………

Machine Learning: CSE 574 Machine Learning: CSE 574 3333

Conditional independence GSince

G. p(xN+1

|X,zN+1

)=p(xN+1

|zN+1 )

xxN+1N+1xxNN

zzN+1N+1zzNN

……..

……..

((xx11

,..,..xxNN

) _) _||_||_

xxNN+1+1

| | zzN+1N+1

Machine Learning: CSE 574 Machine Learning: CSE 574 3434

Conditional independence H

H. p(zN+1

|zN , X)= p ( zN+1

|zN )

xxN+1N+1xxNN

zzN+1N+1zzNN

……..

……..

Machine Learning: CSE 574 Machine Learning: CSE 574 3535

Evaluation of γ(zn

)• Recall that this is to efficiently compute

the E step of estimating parameters of HMMγ (zn

) = p(zn

|X,θold): Marginal posterior distribution of latent variable zn

• We are interested in finding posterior distribution p(zn

|x1

,..xN

)• This is a vector of length K

whose entries

correspond to expected values of znk

Machine Learning: CSE 574 Machine Learning: CSE 574 3636

Introducing α

and β• Using Bayes theorem• Using conditional independence A

• where α(zn

) =

p(x1

,..,xn

,zn

)which is the probability of observing all given data up to time n

and the value of

zn

β(zn

) =

p(xn+1

,..,xN

|zn

)which is the conditional probability of all future data from time n+1

up to N

given the value of zn

)X()z()z(

)X()z|x,..x()z,x,..x(

)X()z()z|x,..x()z|x,..x()z(

11

11

pppp

pppp

nnnNnnn

nnNnnnn

βα

γ

==

=

+

+

)X()()|X()X|()z(

pzpzpzp nn

nn ==γ

alphaalpha

betabeta

Machine Learning: CSE 574 Machine Learning: CSE 574 3737

Recursion Relation for α

αα

α

of definitionby )z|z()z()z|x(

rule Bayesby )z|z()z,x,..,x()z|x(

C ind. cond.by )z()z|z()z|x,..,x()z|x(

rule Bayesby )z()z|z,x,..,x()z|x(

Rule Sumby )z,z,x,..,x()z|x( rule Bayesby )z,x,..,x()z|x(

B ceindependen lconditionaby )z()z|x,..,x()z|x(

rule Bayesby )z()z|x,..,x( )z,x,..,x()z(

1

1

1

1

1

z11

z1111

1z

1111

z1111

z111

11

11

1

1

−−

−−−

−−−−

−−−

−−

=

=

=

=

=

=

=

==

n

n

n

n

n

nnnnn

nnnnnn

nnnnnnn

nnnnnn

nnnnn

nnnn

nnnnn

nnn

nnn

pp

ppp

pppp

ppp

pppp

ppp

ppp

Machine Learning: CSE 574 Machine Learning: CSE 574 3838

Forward Recursion for α

Εvaluation• Recursion Relation is

• There are K terms in the summation• Has to be evaluated for each of K

values of zn

• Each step of recursion is O(K2)• Initial condition is

• Overall cost for the chain in O(K2N)

)z|z()z()z|x()z( 111

−−∑−

= nnz

nnnn ppn

αα

{ }∏=

===K

1k1k111111

1)|x()z|x()z()z,x()z( kzkpppp φπα

Machine Learning: CSE 574 Machine Learning: CSE 574 3939

Recursion Relation for β

1

1

1

1

1 1

1 1 1

1 1 1

(z ) (x ,.., x | z ) (x ,.., x , z | z ) by Sum Rule

(x ,.., x | z ,z ) z |z ) by Bayes rule

(x ,.., x | z ) z |z ) by Cond ind. D

n

n

n

n n N n

n N n nz

n N n n n nz

n N n n nz

pp

p p(

p p(

β

+

+

+

+

+ +

+ + +

+ + +

=

=

=

=

1

1

2 1 1 1 1

1 1 1 1

(x ,.., x | z ) x |z ) z |z ) by Cond. ind E

(z ) x |z ) z |z ) by definition of n

n

n N n n n n nz

n n n n nz

p p( p(

p( p(β β+

+

+ + + + +

+ + + +

=

=

Machine Learning: CSE 574 Machine Learning: CSE 574 4040

Backward Recursion for β• Backward message passing

• Evaluates β(zn

) in terms of β(zn+1

)

• Starting condition for recursion is

• Is correct provided we set β(zN

) =1 for all settings of zN• This is the initial condition for backward

computation

)X()z()z,X()X|z(

ppp NN

=

1

1 1 1(z ) (z ) x |z ) z |z ) n

n n n n n nz

p( p(β β+

+ + += ∑

Machine Learning: CSE 574 Machine Learning: CSE 574 4141

M step Equations• In the M-step equations p(x)

will cancel out

=

== N

nnk

N

nnnk

k

z

z

1

1

)(

x)(μ

γ

γ

∑=nz

nn zzp )()()X( βα

Machine Learning: CSE 574 Machine Learning: CSE 574 4242

Evaluation of Quantities ξ(zn-1

,zn

)• They correspond to the values of the conditional

probabilities p(zn-1

,zn

|X)

for each of the K

x K

settings for (zn-1

,zn

)

• Thus we calculate ξ(zn-1

,zn

) directly by using results of the α

and β

recursions

p(X)()p(p

p(X))p(p(zp)p|zp(

p(X)),z)p(z,zp(X|z

nn-nnnn

n-n-nnNnnnn-n

nn-nn-

)zz|z)z|x()z(

F ind condby )zz|z)|x,..,x()z|x(x,..x

Rule Bayesby

definitionby X)|z,p(z )z,(z

11

111111

11

n1-nn1-n

βα

ξ

+−

=

=

=

=

Machine Learning: CSE 574 Machine Learning: CSE 574 4343

Summary of EM to train HMMStep 1: Initialization• Make an initial selection of parameters θ old

where θ = (π, A,φ)1.

π is a vector of K

probabilities of the states for latent

variable z12.

A

is a K

x K

matrix of transition probabilities Aij

3.

φ are parameters of conditional distribution p(xk

|zk

)•

A

and π parameters are often initialized uniformly

• Initialization of φ depends on form of distribution• For Gaussian:

• parameters μk

initialized by applying K-means to the data, Σk

corresponds to covariance matrix of cluster

Machine Learning: CSE 574 Machine Learning: CSE 574 4444

Summary of EM to train HMMStep 2: E Step• Run both forward α

recursion and

backward β

recursion• Use results to evaluate γ(zn

)

and ξ(zn-1

,zn

)and the likelihood function

Step 3: M Step• Use results of E step to find revised set of

parameters θnew

using M-step equations

Alternate between E and Muntil convergence of likelihood function

Machine Learning: CSE 574 Machine Learning: CSE 574 4545

Values for p(xn

|zn

)• In recursion relations, observations enter

through conditional distributions p(xn

|zn

)• Recursions are independent of

• Dimensionality of observed variables• Form of conditional distribution

• So long as it can be computed for each of K possible states of zn

• Since observed variables {xn

}

are fixed they can be pre-computed at the start of the EM algorithm

Machine Learning: CSE 574 Machine Learning: CSE 574 4646

7(a) Sequence Length: Using Multiple Short Sequences

• HMM can be trained effectively if length of sequence is sufficiently long• True of all maximum likelihood approaches

• Alternatively we can use multiple short sequences• Requires straightforward modification of HMM-EM

algorithm• Particularly important in left-to-right models

• In given observation sequence, a given state transition for a non-diagonal element of A

occurs only

once

Machine Learning: CSE 574 Machine Learning: CSE 574 4747

7(b). Predictive Distribution• Observed data is X={ x1

,..,xN

}• Wish to predict xN+1• Application in financial forecasting

• Can be evaluated by first running forward α

recursion and summing over zN

and zN+1• Can be extended to subsequent predictions of xN+2

, after xN+1

is observed, using a fixed amount of storage

1

1

1

1

1 1 1

1 1 1

1 1 1

1 1

(x | ) (x , z | )

= (x |z | ) (z | ) by Product Rule

= (x |z ) (z , | ) by Sum Rule

= (x |z ) (z

N

N

N N

N

N N Nz

N N Nz

N N N Nz z

N N Nz

p X p X

p X p X

p p z X

p p

+

+

+

+

+ + +

+ + +

+ + +

+ + +

= ∑

∑ ∑

1

1

1

1 1 1

1 1 1

| ) ( | ) by conditional ind H

( , ) = (x |z ) (z | ) by Bayes rule( )

1 = (x |z ) (z | ) ( ) by definition of ( )

N

N N

N N

N Nz

NN N N N

z z

N N N N Nz z

z p z X

p z Xp p zp X

p p z zp X

α α

+

+

+ + +

+ + +

∑ ∑

∑ ∑

Machine Learning: CSE 574 Machine Learning: CSE 574 4848

7(c). Sum-Product and HMM • HMM graph is a tree and

hence sum-product algorithm can be used to find local marginals for hidden variables• Equivalent to forward-

backward algorithm• Sum-product provides a

simple way to derive alpha- beta recursion formulae

• Transform directed graph to factor graph• Each variable has a node,

small squares represent factors, undirected links connect factor nodes to variables used

Fragment of Factor Graph

)|()|()(),..,,..(12

1111 nn

N

n

N

nnnnN zxpzzpzpzzxxp ∏∏

==− ⎥

⎤⎢⎣

⎡=

HMM Graph

Joint distribution

Machine Learning: CSE 574 Machine Learning: CSE 574 4949

Deriving alpha-beta from Sum-Product• Begin with simplified form of

factor graph• Factors are given by

• Messages propagated are

• Can show that α

recursion is computed

• Similarly starting with the root node β

recursion is computed

• So also γ

and ξ

are derived

Fragment of Factor Graph

Simplified by absorbing emission probabilities into transition probability factors

1 1 1 1

1 1

( ) ( ) ( | )( , ) ( | ) ( | )n n n n n n n

h z p z p x zf z z p z z p x z− −

==

1 1 1

1

1

z 1 z 1

z 1 z 1z

(z ) (z )

(z ) (z , z ) (z )n n n n

n n n n

n

f n f n

f n n n n f nf

μ μ

μ μ− − −

→ − → −

→ − → −

=

= ∑)z|z()z()z|x()z( 11

1

−−∑−

= nnz

nnnn ppn

αα

)z|z)z|x)()z( 1111

nnnnz

nn p(p(zn

+++∑+

= ββ

(z ) (z )(z )(X)

n nn p

α βγ =

1 1n-1 n

(z ) (x | z ) z | z z )(z ,z ) n n n n n- np p( ) (p(X)

α βξ −=

Final Results

Machine Learning: CSE 574 Machine Learning: CSE 574 5050

7(d). Scaling Factors• Implementation issue for small probabilities• At each step of recursion

• To obtain new value of α(zn

)

from previous value α(zn-1

) we multiply p(zn

|zn-1

)

and p(xn

|zn

)• These probabilities are small and products will underflow• Logs don’t help since we have sums of products

• Solution is rescaling• of α(zn

)

and β(zn

)

whose values remain close to unity

)z|z()z()z|x()z( 111

−−∑−

= nnz

nnnn ppn

αα

Machine Learning: CSE 574 Machine Learning: CSE 574 5151

7(e). The Viterbi Algorithm• Finding most probable sequence of hidden

states for a given sequence of observables• In speech recognition: finding most probable

phoneme sequence for a given series of acoustic observations

• Since graphical model of HMM is a tree, can be solved exactly using max-sum algorithm• Known as Viterbi algorithm in the context of HMM• Since max-sum works with log probabilities no need

to work with re-scaled varaibles as with forward- backward

Machine Learning: CSE 574 Machine Learning: CSE 574 5252

Viterbi Algorithm for HMM• Fragment of HMM lattice

showing two paths• Number of possible paths

grows exponentially with length of chain

• Viterbi searches space of paths efficiently• Finds most probable path

with computational cost linear with length of chain

Machine Learning: CSE 574 Machine Learning: CSE 574 5353

• Treat variable zN

as root node, passing messages to root from leaf nodes

• Messages passed are

{ }1

1 1 1

z z

z 1 1 1 z

(z ) (z )

(z ) max ln (z , z ) (z )n n n n

n n n nn

f n f n

f n n n n f nzf

μ μ

μ μ+

+ + +

→ →

→ + + + →

=

= +

Machine Learning: CSE 574 Machine Learning: CSE 574 5454

Other Topics on Sequential Data• Sequential Data and Markov Models:

http://www.cedar.buffalo.edu/~srihari/CSE574/Chap11/Ch11.1- MarkovModels.pdf

• Extensions of HMMs:http://www.cedar.buffalo.edu/~srihari/CSE574/Chap11/Ch11.3- HMMExtensions.pdf

• Linear Dynamical Systems:http://www.cedar.buffalo.edu/~srihari/CSE574/Chap11/Ch11.4- LinearDynamicalSystems.pdf

• Conditional Random Fields:http://www.cedar.buffalo.edu/~srihari/CSE574/Chap11/Ch11.5- ConditionalRandomFields.pdf