PowerPoint Presentation

of 55/55
Machine Learning: CSE 574 Machine Learning: CSE 574 0 0 Hidden Markov Models Sargur N. Srihari [email protected] Machine Learning Course: http://www.cedar.buffalo.edu/~srihari/CSE574/index.html
  • date post

  • Category


  • view

  • download


Embed Size (px)



Transcript of PowerPoint Presentation

  • 1. Hidden Markov ModelsSargur N. Srihari [email protected] Learning Course: http://www.cedar.buffalo.edu/~srihari/CSE574/index.htmlMachine Learning: CSE 574 0

2. HMM Topics 1. What is an HMM? 2. State-space Representation 3. HMM Parameters 4. Generative View of HMM 5. Determining HMM Parameters Using EM 6. Forward-Backward or algorithm 7. HMM Implementation Issues: a)Length of Sequence b)Predictive Distribution c)Sum-Product Algorithm d)Scaling Factors e)Viterbi Algorithm Machine Learning: CSE 5741 3. 1. What is an HMM? Ubiquitous tool for modeling time series data Used in Almost all speech recognition systems Computational molecular biology Group amino acid sequences into proteins Handwritten word recognition It is a tool for representing probability distributions oversequences of observations HMM gets its name from two defining properties: Observation xt at time t was generated by some process whose state zt is hidden from the observer Assumes that state at zt is dependent only on state zt-1 and independent of all prior states (First order) Example: z are phoneme sequences x are acoustic observationsMachine Learning: CSE 5742 4. Graphical Model of HMM Has the graphical model shown below and latent variables are discrete Joint distribution has the form: N Np ( x1 ,..x N , z1 ,..z n ) = p ( z1 ) p ( z n | z n 1 ) p (xn | z n ) n=2 n =1 Machine Learning: CSE 574 3 5. HMM Viewed as Mixture A single time slice corresponds to a mixture distribution with component densities p(x|z) Each state of discrete variable z represents a different component An extension of mixture model Choice of mixture component depends on choice of mixture component for previous distribution Latent variables are multinomial variables zn That describe component responsible for generating xn Can use one-of-K coding scheme Machine Learning: CSE 5744 6. 2. State-Space Representation Probability distribution of zn State k 1 2 . Kdepends on state of previousof zn znk 0 1 . 0latent variable zn-1 throughprobability distribution p(zn|zn-1) One-of K coding State j1 2 . Kof zn-1 zn-1,j 1 0 . 0 Since latent variables are K- dimensional binary vectorsAzn Ajk=p(znk=1|zn-1,j=1) matrix1 2 . K kAjk=11 These are known as TransitionProbabilities zn-12 K(K-1) independent parameters Ajk Machine Learning: CSE 574K5 7. Transition Probabilities Example with 3-state latent variablezn=1 zn=2 zn=3 State Transition Diagram zn1=1zn1=0zn1=0 zn2=0zn2=1zn2=0 zn3=0zn3=0zn3=1 zn-1=1 A11A12A13 zn-1,1=1A11+A12+A13=1 zn-1,2=0 zn-1,3=0 zn-1=2 A21A22A23 zn-1,1=0 zn-1,2=1 zn-1,3=0 Not a graphical model zn-3= 3A31A32A33since nodes are not zn-1,1=0separate variables but zn-1,2=0states of a single zn-1,3=1variable Machine Learning: CSE 5746 Here K=3 8. Conditional Probabilities Transition probabilities Ajk represent state-to-stateprobabilities for each variable Conditional probabilities are variable-to-variableprobabilities can be written in terms of transition probabilities asK K p (z n | z n 1 , A) = A z n1, j z n ,k jk k =1 j =1 Note that exponent zn-1,j zn,k is a product that evaluates to 0 or 1 Hence the overall product will evaluate to a single Ajk for each setting of values of zn and zn-1 E.g., zn-1=2 and zn=3 will result in only zn-1,2=1 and zn,3=1. Thus p(zn=3|zn-1=2)=A23 A is a global HMM parameter Machine Learning: CSE 574 7 9. Initial Variable Probabilities Initial latent node z1 is a special case without parent node Represented by vector of probabilities with elements k=p(z1k=1) so that K p ( z1 | ) = where k k = 1z1,kk k =1 Note that is an HMM parameter representing probabilities of each state for thefirst variableMachine Learning: CSE 5748 10. Lattice or Trellis Diagram State transition diagram unfolded over time Representation of latent variable states Each column corresponds to one of latent variables znMachine Learning: CSE 5749 11. Emission Probabilities p(xn|zn) We have so far only specified p(zn|zn-1)by means of transition probabilities Need to specify probabilities p(xn|zn,) to complete the model, where are parameters These can be continuous or discrete Because xn is observed and zn is discrete p(xn|zn,) consists of a table of K numbers corresponding to K states of zn Analogous to class-conditional probabilities Can be represented as K p (x n | z n , ) = p(x n | k ) znkk =1 Machine Learning: CSE 574 10 12. 3. HMM Parameters We have defined three types of HMM parameters: =(, A,)1. Initial Probabilities of first latent variable: is a vector of K probabilities of the states for latent variable z12. Transition Probabilities (State-to-state for any latent variable): A is a K x K matrix of transition probabilities Aij3. Emission Probabilities (Observations conditioned on latent): are parameters of conditional distribution p(xk|zk)A and parameters are often initialized uniformly Initialization of depends on form of distributionMachine Learning: CSE 574 11 13. Joint Distribution over Latent andObserved variables Joint can be expressed in terms of parameters: N Np ( X , Z | ) = p (z1 | ) p (z n | z n 1 , A) p (x m | z m , ) n=2 m =1where X = { x1 ,..x N }, Z = { z1 ,..z N }, = { , A, } Most discussion of HMM is independent ofemission probabilities Tractable for discrete tables, Gaussian, GMMs Machine Learning: CSE 574 12 14. 4. Generative View of HMM Sampling from an HMM HMM with 3-state latent variable z Gaussian emission model p(x|z) Contours of constant density of emission distribution shown for each state Two-dimensional x 50 data points generated from HMM Lines show successive observations Transition probabilities fixed so that 5% probability of making transition 90% of remaining in sameMachine Learning: CSE 57413 15. Left-to-Right HMM Setting elements of Ajk=0 if k < j Corresponding lattice diagramMachine Learning: CSE 57414 16. Left-to-Right Applied Generatively to Digits Examples of on-line handwritten 2s x is a sequence of pen coordinates There are16 states for z, or K=16 Each state can generate a line segment of fixedlength in one of 16 angles Emission probabilities: 16 x 16 table Transition probabilities set to zero except for thosethat keep state index k the same or increment by one Parameters optimized by 25 EM iterations Training Set Trained on 45 digits Generative model is quite poor Since generated dont look like training If classification is goal, can do better by using a discriminative HMM Machine Learning: CSE 57415 Generated Set 17. 5. Determining HMM Parameters Given data set X={x1,..xn} we can determine HMM parameters ={,A,} using maximum likelihood Likelihood function obtained from joint distribution by marginalizing over latent variables Z={z1,..zn} p(X|) = Z p(X,Z|) Joint (X,Z) Machine Learning: CSE 574 X 16 18. Computational issues for Parametersp(X|) = Z p(X,Z|) Computational difficulties Joint distribution p(X,Z|) does not factorize over n,in contrast to mixture model Z has exponential number of terms corresponding to trellis Solution Use conditional independence properties to reorder summations Use EM instead to maximize log-likelihood function of joint ln p(X,Z|) Efficient framework for maximizing the likelihood function in HMMs Machine Learning: CSE 57417 19. EM for MLE in HMM 1. Start with initial selection for model parameters old 2. In E step take these parameter values and findposterior distribution of latent variables p(Z|X, old)Use this posterior distribution to evaluateexpectation of the logarithm of the complete-datalikelihood function ln p(X,Z|)Which can be written asQ( , old ) = p ( Z | X , old ) ln p ( X , Z | ) Zunderlined portion independent of is evaluated 3. In M-Step maximize Q w.r.t. Machine Learning: CSE 57418 20. Expansion of Q Introduce notation (zn) = p(zn|X,old): Marginal posterior distribution of latent variable zn (zn-1,zn) = p(zn-1,zn|X, old): Joint posterior of two successive latent variables We will bere-expressing Q in terms of and Q ( , old ) = p ( Z | X , old ) ln p ( X , Z | ) Z Machine Learning: CSE 574 19 21. Detail of and For each value of n we can store (zn) using K non-negative numbers that sum to unity (zn-1,zn) using a K x K matrix whose elements also sum to unity Using notation (znk) denotes conditional probability of znk=1 Similar notation for (zn-1,j,znk) Because the expectation of a binary random variable is the probability that it takes value 1 (znk) = E[znk] = z (z)znk (z,z ) = E[zn-1,j,znk] = z (z) zn-1,jznkn-1,j nk Machine Learning: CSE 574 20 22. Expansion of Q We begin withQ( , old ) = p ( Z | X , old ) ln p ( X , Z | ) Z Substitute N N p ( X , Z | ) = p (z1 | ) p (z n | z n 1 , A) p(x m | z m , ) n=2 m =1 And use definitions of and to get:K NKKQ( , old ) = ( z1k ) ln k + (zn-1, j ,z nk ) ln A jk k =1n = 2 j =1 k =1N K + ( z nk ) ln p ( xn | k ) 21 Machine Learning: CSE 574n =1 k =1 23. E-Step K NKKQ( , old ) = ( z1k ) ln k + (zn-1, j ,z nk ) ln A jk k =1 n = 2 j =1 k =1N K+ ( z nk ) ln p ( xn | k )n =1 k =1 Goal of E step is to evaluate (zn) and (zn-1,zn) efficiently (Forward-Backward Algorithm) Machine Learning: CSE 57422 24. M-Step Maximize Q(, old) with respect to parameters ={,A,} Treat (zn) and (zn-1,zn) as constant Maximization w.r.t. and A easily achieved (using Lagrangian multipliers)Nk = ( z1k ) (z n 1, j , z nk )KA jk =n=2 (zj =11j ) K N ( zn 1, j , z nl ) l =1 n = 2 Maximization w.r.t. kN K Only last term of Q depends on ( zk n =1 k =1nk ) ln p ( xn | k ) Same form as in mixture distribution for i.i.d.23 Machine Learning: CSE 574 25. M-step for Gaussian emission Maximization of Q(, old) wrt k Gaussian Emission Densities p(x|k) ~ N(x|k,k) Solution:N N ( znk )x n ( z nk )( x n k )( x n k )Tk =n =1 Nk = n =1 N (z n =1 nk ) (z nk ) n =1 Machine Learning: CSE 574 24 26. M-Step for Multinomial Observed Conditional Distribution of Observations have the formDK p ( x | z) = iki zk x i =1 k =1 M-Step equations:N (z nk ) xniik = n =1N (zn =1 nk) Analogous result holds for Bernoulli observed variables Machine Learning: CSE 574 25 27. 6. Forward-Backward Algorithm E step: efficient procedure to evaluate (zn) and (zn-1,zn) Graph of HMM, a tree Implies that posterior distribution of latentvariables can be obtained efficiently usingmessage passing algorithm In HMM it is called forward-backward algorithm or Baum-Welch Algorithm Several variants lead to exact marginals Method called alpha-beta discussed here Machine Learning: CSE 574 26 28. Derivation of Forward-Backward Several conditional-independences (A-H) hold A. p(X|zn) = p(x1,..xn|zn) p(xn+1,..xN|zn) Proved using d-separation:Path from x1 to xn-1 passes through zn which is observed.Path is head-to-tail. Thus (x1,..xn-1) _||_ xn | znSimilarly (x1,..xn-1,xn) _||_ xn+1,..xN |zn zN.. xN ..Machine Learning: CSE 57427 29. Conditional independence B Since (x1,..xn-1) _||_ xn | zn we haveB. p(x1,..xn-1|xn,zn)= p(x1,..xn-1|zn) Machine Learning: CSE 57428 30. Conditional independence C Since(x1,..xn-1) _||_ zn | zn-1 C. p(x1,..xn-1|zn-1,zn)= p ( x1,..xn-1|zn-1 ) Machine Learning: CSE 574 29 31. Conditional independence D Since(xn+1,..xN) _||_ zn | zn+1 D. p(xn+1,..xN|zn,zn+1)= p ( x1,..xN|zn+1 )zNxN Machine Learning: CSE 574 30 32. Conditional independence E Since(xn+2,..xN) _||_ zn | zn+1 E. p(xn+2,..xN|zn+1,xn+1)= p ( xn+2,..xN|zn+1 )zNxN Machine Learning: CSE 574 31 33. Conditional independence FF. p(X|zn-1,zn)= p ( x1,..xn-1|zn-1 )p(xn|zn)p(xn+1,..xN|zn)zNxN Machine Learning: CSE 574 32 34. Conditional independence G Since(x1,..xN) _||_ xN+1 | zN+1G.p(xN+1|X,zN+1)=p(xN+1|zN+1 )zN zN+1 .xN xN+1. Machine Learning: CSE 574 33 35. Conditional independence HH. p(zN+1|zN , X)= p ( zN+1|zN ) zN zN+1. xN xN+1 . Machine Learning: CSE 57434 36. Evaluation of (zn) Recall that this is to efficiently compute the E step of estimating parameters of HMM (zn) = p(zn|X,old): Marginal posteriordistribution of latent variable zn We are interested in finding posterior distribution p(zn|x1,..xN) This is a vector of length K whose entries correspond to expected values of znkMachine Learning: CSE 574 35 37. Introducing and Using Bayes theorem (z ) = p( z | X) = p(X |pz(X) )p( z ) nnnn Using conditional independence A p ( x1 ,..x n | z n ) p ( x n +1 ,..x N | z n ) p (z n ) (z n ) = p(X) p( x1 ,..x n , z n ) p( x n +1 ,..x N | z n ) (z n ) (z n ) = =p(X)p(X) where(zn) = p(x1,..,xn,zn) which is the probability of observing all beta given data up to time n and the value of zn(zn) = p(xn+1,..,xN|zn) which is the conditional probability of all alpha Machine Learning: CSE 574 time n+1 up to N given future data from 36the value of z 38. Recursion Relation for (z n ) = p( x1 ,.., x n , z n )= p( x1 ,.., x n | z n ) p(z n ) by Bayes rule= p( x n | z n ) p( x1 ,.., x n 1 | z n ) p (z n ) by conditional independence B= p( x n | z n ) p( x1 ,.., x n 1 , z n ) by Bayes rule= p( x n | z n ) p( x1 ,.., x n 1 , z n 1 , z n ) by Sum Rule z n1 = p( x n | z n ) p( x1 ,.., x n 1 , z n | z n 1 ) p(z n 1 ) by Bayes rule z n1 = p( x n | z n ) p( x1 ,.., x n 1 | z n 1 ) p (z n | z n 1 ) p (z n 1 ) by cond. ind. C z n1 = p( x n | z n ) p ( x1 ,.., x n 1 , z n 1 ) p (z n | z n 1 ) by Bayes rule z n1 = p( x | z ) (z n 1 ) p(z n | z n 1 ) by definition of Machine Learning:nCSE 574 n 37z n1 39. Forward Recursion for valuationRecursion Relation is (z n ) = p( x n | z n ) (z n 1 ) p(z n | z n 1 )z n1There are K terms in the summation Has to be evaluated for each of Kvalues of zn Each step of recursion is O(K2)Initial condition isK (z1 ) = p ( x1 , z1 ) = p (z1 ) p ( x1 | z1 ) = { k p( x1 | k )}z 1k k =1 Overall cost for the chain in O(K2N)Machine Learning: CSE 574 38 40. Recursion Relation for (z n ) = p (x n +1 ,.., x N | z n )= p (x n +1 ,.., x N , z n +1 | z n ) by Sum Rulezn+1 = p(x zn+1 n +1 ,.., x N | z n ,z n +1 ) p( z n +1|z n ) by Bayes rule = p (x n +1 ,.., x N | z n +1 ) p( z n +1|z n ) by Cond ind. Dzn+1 = p (x n + 2 ,.., x N | z n +1 ) p( x n +1|z n +1 )p( z n +1|z n ) by Cond. ind Ezn+1 = (z n +1 ) p( x n +1|z n +1 )p( z n +1|z n ) by definition of zn+1 Machine Learning: CSE 57439 41. Backward Recursion for Backward message passing (z n ) = (z n +1 ) p( x n +1|z n )p( z n +1|z n ) zn+1 Evaluates (zn) in terms of (zn+1) Starting condition for recursion is p ( X, z N ) ( z N )p(z N | X) =p(X) Is correct provided we set (zN) =1 for all settings of zN This is the initial condition for backward computationMachine Learning: CSE 57440 42. M step Equations In the M-step equations p(x) will cancel outN (znk )x nk = n =1N (z n =1 nk ) p(X) = ( zn ) ( zn )znMachine Learning: CSE 57441 43. Evaluation of Quantities (zn-1,zn) They correspond to the values of the conditional probabilities p(zn-1,zn|X) for each of the K x K settings for (zn-1,zn) (z n -1 , z n ) = p(z n -1 , z n | X) by definitionp(X|zn-1,z n )p(zn-1,z n )=by Bayes Rulep(X)p(x1 ,..x n 1|z n-1 )p( x n | z n ) p( x n +1 ,.., x N | z n ) p(z n | z n-1 )p(z n-1 )=by cond ind Fp(X) (z n 1 ) p ( x n | z n ) p(z n | z n-1 )(z n )= p(X) Thus we calculate (zn-1,zn) directly by using results of the and recursionsMachine Learning: CSE 57442 44. Summary of EM to train HMM Step 1: Initialization Make an initial selection of parameters old where = (, A,) 1. is a vector of K probabilities of the states for latentvariable z1 2. A is a K x K matrix of transition probabilities Aij 3. are parameters of conditional distribution p(xk|zk) A and parameters are often initialized uniformly Initialization of depends on form of distribution For Gaussian: parameters k initialized by applying K-means to the data, kcorresponds to covariance matrix of cluster Machine Learning: CSE 57443 45. Summary of EM to train HMM Step 2: E Step Run both forward recursion and backward recursion Use results to evaluate (zn) and (zn-1,zn) and the likelihood function Step 3: M Step Use results of E step to find revised set of parameters new using M-step equations Alternate between E and Muntil convergence of likelihood functionMachine Learning: CSE 574 44 46. Values for p(xn|zn) In recursion relations, observations enter through conditional distributions p(xn|zn) Recursions are independent of Dimensionality of observed variables Form of conditional distribution So long as it can be computed for each of K possible states of zn Since observed variables {xn} are fixed they can be pre-computed at the start of the EM algorithm Machine Learning: CSE 57445 47. 7(a) Sequence Length: Using Multiple Short Sequences HMM can be trained effectively if length of sequence is sufficiently long True of all maximum likelihood approaches Alternatively we can use multiple short sequences Requires straightforward modification of HMM-EMalgorithm Particularly important in left-to-right models In given observation sequence, a given statetransition for a non-diagonal element of A occurs onlyonceMachine Learning: CSE 57446 48. 7(b). Predictive Distribution Observed data is X={ x1,..,xN} Wish to predict xN+1 Application in financial forecasting p (x | X ) = p(x , z | X )N +1 N +1 N +1 z N +1 = p(xz N +1 N +1 |z N +1 | X ) p (z N +1 | X ) by Product Rule = p(x N +1|z N +1 ) p(z N +1 , z N | X ) by Sum Rule z N +1zN = p(x N +1|z N +1 ) p (z N +1 | z N ) p( z N | X ) by conditional ind H z N +1zN p( z N , X ) = p(x N +1|z N +1 ) p (z N +1 | z N )by Bayes rule z N +1zN p( X ) 1 = p(x N +1|z N +1 ) p(z N +1 | zN ) ( zN ) by definition of p ( X ) zN +1zN Can be evaluated by first running forward recursion and summing over zN and zN+1 Can be extended to subsequent predictions of xN+2, after xN+1 is observed, using a fixed amount of storage Machine Learning: CSE 57447 49. 7(c). Sum-Product and HMM HMM Graph HMM graph is a tree and hence sum-product algorithm can be used to find local marginals for hidden variablesJoint distribution Equivalent to forward- backward algorithm N N p ( x1 ,..x N , z1 ,..z n ) = p ( z1 ) p ( z n | z n 1 ) p (xn | z n ) Sum-product provides a n=2 n =1 simple way to derive alpha- beta recursion formulae Transform directed graph toFragment of Factor Graph factor graph Each variable has a node, small squares represent factors, undirected links connect factor nodes to variables used Machine Learning: CSE 57448 50. Deriving alpha-beta from Sum-Product Begin with simplified form ofFragment of Factor Graph factor graph Factors are given byh( z1 ) = p ( z1 ) p ( x1 | z1 )Simplified by absorbing emission probabilitiesf n ( zn 1 , zn ) = p ( zn | zn 1 ) p ( xn | zn ) into transition probability factors Messages propagated arez n1 f n (z n 1 ) = fn1 zn1 (z n 1 )Final Resultsf n zn (z n ) = f n (z n 1 , z n ) zn1 fn (z n 1 )z n1 (z n ) = p( x n | z n ) (z n 1 ) p(z n | z n 1 ) Can show that recursion isz n1 computed (z n ) = ( z n +1 ) p(x n +1 | z n ) p(z n +1 | z n ) Similarly starting with the root z n+1 (z n ) (z n ) node recursion is computed (z n ) = p (X) So also and are derived (z n-1 ,z n ) = (z n 1 ) p (x n | z n ) p( z n | z n-1 ) ( z n )Machine Learning: CSE 57449 p(X) 51. 7(d). Scaling Factors Implementation issue for small probabilities At each step of recursion (z n ) = p( x n | z n ) (z n 1 ) p(z n | z n 1 ) z n1 To obtain new value of (zn) from previous value (zn-1) we multiply p(zn|zn-1) and p(xn|zn) These probabilities are small and products will underflow Logs dont help since we have sums of products Solution is rescaling of (zn) and (zn) whose values remain close to unityMachine Learning: CSE 574 50 52. 7(e). The Viterbi Algorithm Finding most probable sequence of hidden states for a given sequence of observables In speech recognition: finding most probable phoneme sequence for a given series of acoustic observations Since graphical model of HMM is a tree, can be solved exactly using max-sum algorithm Known as Viterbi algorithm in the context of HMM Since max-sum works with log probabilities no needto work with re-scaled varaibles as with forward-backward Machine Learning: CSE 57451 53. Viterbi Algorithm for HMM Fragment of HMM lattice showing two paths Number of possible paths grows exponentially with length of chain Viterbi searches space of paths efficiently Finds most probable pathwith computational costlinear with length of chain Machine Learning: CSE 57452 54. Deriving Viterbi from Max-Sum Start with simplified factor graph Treat variable zN as root node, passing messages to root from leaf nodes Messages passed arez n f n +1(z n ) = fn zn (z n )f n +1 z n +1 { } (z n +1 ) = max ln f n +1 (z n , z n +1 ) + zn fn+1 (z n )zn Machine Learning: CSE 574 53 55. Other Topics on Sequential Data Sequential Data and Markov Models: http://www.cedar.buffalo.edu/~srihari/CSE574/Chap11/Ch11.1- MarkovModels.pdf Extensions of HMMs: http://www.cedar.buffalo.edu/~srihari/CSE574/Chap11/Ch11.3- HMMExtensions.pdf Linear Dynamical Systems: http://www.cedar.buffalo.edu/~srihari/CSE574/Chap11/Ch11.4- LinearDynamicalSystems.pdf Conditional Random Fields: http://www.cedar.buffalo.edu/~srihari/CSE574/Chap11/Ch11.5- ConditionalRandomFields.pdf Machine Learning: CSE 574 54