Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional...

28
Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields

description

Maximum Entropy Model l o g p ( y ; x ) = P ¸ f + c n s t p ( y ; x ) x – observations y – class identity fk – feature functions λk – trainable parameters p ( y ; x ) / e f P k ¸ g l o g p ( y ; x ) = P k ¸ f + c n s t ¸ = 2 6 4 1 . K 3 7 5 f ( x ; y ) = 2 6 4 1 . K 3 7 5 l o g p ( y ; x ) = ¸ T f + c n s t

Transcript of Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional...

Page 1: Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.

Maximum Entropy Model, Bayesian Networks, HMM,

Markov Random Fields, (Hidden/Segmental) Conditional

Random Fields

Page 2: Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.

Maximum Entropy Model• x – observations• y – class identity• fk – feature functions• λk – trainable parameters

¸ =

264¸1...¸K

375

logp(y;x) =Pk ¸kf k(x;y) +const

p(y;x) / expfP k ¸kf k(x;y)g

f(x;y) =

264f 1(x;y)...

fK (x;y)

375

logp(y;x) = ¸ T f(x;y) +const

Page 3: Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.

Maximum Entropy Model

• We train parameters λk to maximize conditional likelihood (minimize cross entropy; maximize MMI objective function) of training data

^̧= argmax¸Qr P¸ (yr jxr )

P (yjx) = p(x;y)p(x) = p(x;y)P

y0 p(x;y0)

P (yjx) = expf ¸ T f (x;y)gPy0 expf ¸ T f (x;y)g

Page 4: Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.

Multiclass Logistic Regression

f(x;y) =

26666666666666664

±(y= 1)x1±(y= 1)x2

...±(y = 1)xN

...±(y = K )x1±(y= K )x2...±(y = K )xN

37777777777777775

P (yjx) = expf w Ty xgPy0 expf w Ty0xg

26666666666666664

w11W12...W1N...wK 1WK 2...WK N

37777777777777775

= ¸

Page 5: Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.

MaxEnt example• MaxEnt model can be initialized to simulate recognizer

where classes are modeled by Gaussians• Example for two classes and 1-dimensional data

f (x;y) =

26666664

±(y= 1)±(y= 1)x±(y= 1)x2±(y= 2)±(y= 2)x±(y= 2)x2

37777775

26666664

logP (1) ¡ 0:5(log2¼¾21 +¹ 21=¾21)¹ 1=¾21¡ 0:5=¾21logP (2) ¡ 0:5(log2¼¾22 +¹ 22=¾22)¹ 2=¾22¡ 0:5=¾22

37777775= ¸

logp(x;y) = logP (y) + logp(xjy)= logP (y) + logN (x;¹ y;¾2y)= logP (y) ¡ 0:5(log2¼¾2y) ¡ 0:5(x ¡ ¹ y)2=¾2y= logP (y) ¡ 0:5(log2¼¾2y) ¡ 0:5x2=¾2y +x¹ y=¾2y ¡ 0:5¹ 2y=¾2y

Page 6: Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.

Bayesian Networks• Graph corresponds to a particular

factorization of joint probability distribution over a set of random variables

• Nodes are random variables, but the graph does not say what are the distributions of the variables

• The graph represents a set of distributions that conform to the factorization.

Page 7: Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.

Bayesian Networks for GMM• s is discrete latent random variable identifying Gaussian

component generating observation xs

x

P (x;s) = p(xjs)p(s)

P (x) =Ps p(xjs)p(s)

• To compute likelihood of observed data, we need to marginalize over latent variable s:

Page 8: Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.

Bayesian Networks for GMM

• Multiple observations:s1 s2 sN-1 sN

x1 x2 xN-1 xN

P (x1; : : : ;xN ;s1; : : : ;sN ) =QNn=1p(xn jsn)p(sn)

P (x1; : : : ;xN ) =X

S

NY

n=1p(xn jsn)p(sn)

=NY

n=1

X

sp(xn js)p(s)

Page 9: Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.

Bayesian Networks for HMM• si nodes are not “HMM states”, these are random variables (one for

each frame) with values saying what state we are in for a particular frame i

• To evaluate likelihood of data p(x1,… , xN), we marginalize over all state sequences (all possible values s1,… , sN ), e.g. using Dynamic Programming

s1 s2 sN-1 sN

x1 x2 xN-1 xN

P (x1; : : : ;xN ;s1; : : : ;sN ) = p(s1)hQ N

n=2p(sn jsn¡ 1)i QN

n=1p(xn jsn)

Page 10: Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.

Conditional independence• Bayesian Networks allows to see conditional

independence properties.

But the opposite is true for:

Page 11: Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.

Markov Random Fields• Undirected graphical model• Directly describe the conditional independence property

– On the example: P(x1, x4 | x2, x3) = P(x1 | x2, x3) P(x4 | x2, x3)– x1 and x4 are independent given x2 and x3 as there is no path

from x1 to x4 not leading through either x2 or x3.• Subsets of nodes where all nodes are connected

with each other are called cliques• The outline in blue is Maximal clique.• When factorizing distribution described by MRF,

variables not connected by link must not appear in the same factor lets make factor corresponding to (Maximal) cliques.

Page 12: Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.

MRF - factorization• Joint probability distribution over all random variables x

can be expressed as normalized product of potential functions , which are positive valued functions of subsets of variables xC corresponding to maximal cliques C

• It is useful to express the potential functions in terms of energy functions E(xC) sum of E(xC) instead of product

Page 13: Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.

MRF - factorization

P (x1;x2;x3;x4) = 1Z expf ¡ E (x1;x2;x3) ¡ E (x2;x3;x4)g

Page 14: Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.

Checking conditional Independence

P (x1;x2;x3;x4) = 1Z Ã1(x1;x2;x3)Ã2(x2;x3;x4)

P (x2;x3) =Px1;x4

1Z Ã1(x1;x2;x3)Ã2(x2;x3;x4)

P (x1;x4jx2;x3) = P (x1;x2;x3;x4)P (x2;x3)

= Ã1(x1;x2;x3)Ã2(x2;x3;x4)Px1;x 4 Ã1(x1;x2;x3)Ã2(x2;x3;x4)

= P (x1jx2;x3)P (x4jx2;x3)

Page 15: Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.

Markov Random Fields for HMM

For Z = 1 and

s1 s2 sn-1 sn

x1 x2 xn-1 xn

We obtain HMM model:

P (x1; : : : ;xN ;s1; :: : ;sN ) = 1Z

hQ Nn=2

~Ãn(sn¡ 1;sn)i QN

n=1Ãn(xn ;sn)

P (x1; : : : ;xN ;s1; : : : ;sN ) = p(s1)hQ N

n=2p(sn jsn¡ 1)i QN

n=1p(xn jsn)

~Ã(s2;s1) = p(s2js1)p(s1)~Ã(sn¡ 1;sn) = p(sn jsn¡ 1)Ã(xn;sn) = p(xn jsn)

Page 16: Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.

Markov Random Fields for HMM• HMM is only one of possible distributions represented by the

graph• In the case of HMM, individual factors are already well normalized

distributions Z = 1• With general “unnormalized” potential functions, it would be

difficult to compute Z as we would have to integrate over all real-valued variables xn.

• However, it is not difficult to evaluate the conditional probability:

• Normalization terms Z in numerator and denominator cancels• Sum in the denominator is over all possible state sequences, but the

terms in the sum are just products of factors like for HMM we can use the same Dynamic Programming trick.

• We can also find the most likely sequence S using familiar Viterbi algorithm.

• To train such model (parameters of potential functions), we can directly maximize the conditional likelihood P(S|X) discriminative training (like MMI or logistic regression)

P (SjX ) = p(X ;S)p(X ) = p(X ;S)P

S p(X ;S)

Page 17: Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.

Conditional Random Fields• Lets consider special form of potential functions:

• fn – are predefined feature functions• λn – are trainable parameters• We can rewrite:

as

~Ã(sn¡ 1;sn) = expfX

k~̧k ~f k(sn¡ 1;sn)g

Ã(xn;sn) = expfX

l¸ lf l(xn;sn)g

P (SjX ) / P (X ;S) /hQN

n=2~Ãn(sn¡ 1;sn)

i QNn=1Ãn(xn ;sn)

P (SjX ) = exp(P k~̧kP Nn =2

~f k (sn ¡ 1;sn )+Pl ¸ l

P Nn=1 f l (xn ;sn ))P

S 0 exp(P k~̧kP Nn =2

~f k (s0n ¡ 1;s0n )+Pl ¸ l

P Nn=1 f l (xn ;s0n ))

Page 18: Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.

Conditional Random Fields

¸ =

2666666664

~̧1...~̧K¸1...¸L

3777777775

• The model can be re-writen to a form that is very similar to Maximum entropy models (Logistic regression)

• However, S and X are sequences here (not only class identity and input vector)

P (SjX ) = expf ¸ T f (S;X )gPS 0 expf ¸ T f (S0;X )g

f (S;X ) =

26666666664

P Nn=2

~f 1(sn¡ 1;sn)...P N

n=2~fK (sn¡ 1;sn)P N

n=1 f1(xn;sn)...P Nn=1 f L (xn;sn)

37777777775

Page 19: Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.

Hidden Conditional Random Fields• As with HMM, we can use CRFs to model state sequences, but

state sequence is not really what we are interested in. We are interested in sequences of words.

• Maybe, we can live with decoding the most likely sequence of states (as we anyway do with HMMs), but for training, we usually only know the sequence of words (or phonemes). Not states.

• HCRF therefore marginalize over all state sequences corresponding to a sequence of words.

Page 20: Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.

Hidden Conditional Random Fields• Still, we can initialize HCRF to simulate HMMs with

states modeled by Gaussians or GMMs

Page 21: Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.

Hidden Conditional Random Fields• Still we can use Dynamic Programming to efficiently

evaluate the normalizing constant and to decode

• similarly to HMMs

Page 22: Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.

Segmental CRF for LVCSR

• Lets have some unit “detectors”– phone bigram recognizer– multi-phone unit recognizer

• If we knew the word boundary, we would not care about any sequences; we would just train Maximum Entropy model were feature functions are return quantities derived from units detected in the word span

Page 23: Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.

SCRF for LVCSR featues• Ngram Existence Features

• Ngram Expectation Features

Page 24: Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.

SCRF for LVCSR featues• Levenshtein Features – compares units detected

in the segment/word span with the desired pronunciation

Page 25: Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.

Segmental CRF for LVCSR• State sequence = word sequence• CRF observations are segments of frames• All possible segmentations of frames into observations

must be, however, taken into account

Page 26: Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.

Segmental CRF for LVCSR• For convenience, we make observation dependent also

on previous state/word simplifies the equation below• We marginalize over all possible segmentations

Page 27: Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.

State transition features• LM features

• Baseline features

Page 28: Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.

Results