EM algorithm LING 572 Fei Xia 03/02/06. Outline The EM algorithm EM for PM models Three special...

of 55/55
EM algorithm LING 572 Fei Xia 03/02/06
  • date post

    22-Dec-2015
  • Category

    Documents

  • view

    249
  • download

    1

Embed Size (px)

Transcript of EM algorithm LING 572 Fei Xia 03/02/06. Outline The EM algorithm EM for PM models Three special...

  • Slide 1
  • EM algorithm LING 572 Fei Xia 03/02/06
  • Slide 2
  • Outline The EM algorithm EM for PM models Three special cases Inside-outside algorithm Forward-backward algorithm IBM models for MT
  • Slide 3
  • The EM algorithm
  • Slide 4
  • Basic setting in EM X is a set of data points: observed data is a parameter vector. EM is a method to find ML where Calculating P(X | ) directly is hard. Calculating P(X,Y|) is much simpler, where Y is hidden data (or missing data).
  • Slide 5
  • The basic EM strategy Z = (X, Y) Z: complete data (augmented data) X: observed data (incomplete data) Y: hidden data (missing data) Given a fixed x, there could be many possible ys. Ex: given a sentence x, there could be many state sequences in an HMM that generates x.
  • Slide 6
  • Examples of EM HMMPCFGMTCoin toss X (observed) sentences Parallel dataHead-tail sequences Y (hidden)State sequences Parse treesWord alignment Coin id sequences a ij b ijk P(A BC)t(f|e) d(a j |j, l, m), p1, p2, AlgorithmForward- backward Inside- outside IBM ModelsN/A
  • Slide 7
  • The log-likelihood function L is a function of , while holding X constant:
  • Slide 8
  • The iterative approach for MLE In many cases, we cannot find the solution directly. An alternative is to find a sequence: s.t.
  • Slide 9
  • Jensens inequality
  • Slide 10
  • log is a concave function
  • Slide 11
  • Maximizing the lower bound The Q function
  • Slide 12
  • The Q-function Define the Q-function (a function of ): Y is a random vector. X=(x 1, x 2, , x n ) is a constant (vector). t is the current parameter estimate and is a constant (vector). is the normal variable (vector) that we wish to adjust. The Q-function is the expected value of the complete data log- likelihood P(X,Y|) with respect to Y given X and t.
  • Slide 13
  • The inner loop of the EM algorithm E-step: calculate M-step: find
  • Slide 14
  • L() is non-decreasing at each iteration The EM algorithm will produce a sequence It can be proved that
  • Slide 15
  • The inner loop of the Generalized EM algorithm (GEM) E-step: calculate M-step: find
  • Slide 16
  • Recap of the EM algorithm
  • Slide 17
  • Idea #1: find that maximizes the likelihood of training data
  • Slide 18
  • Idea #2: find the t sequence No analytical solution iterative approach, find s.t.
  • Slide 19
  • Idea #3: find t+1 that maximizes a tight lower bound of a tight lower bound
  • Slide 20
  • Idea #4: find t+1 that maximizes the Q function Lower bound of The Q function
  • Slide 21
  • The EM algorithm Start with initial estimate, 0 Repeat until convergence E-step: calculate M-step: find
  • Slide 22
  • Important classes of EM problem Products of multinomial (PM) models Exponential families Gaussian mixture
  • Slide 23
  • The EM algorithm for PM models
  • Slide 24
  • PM models Where is a partition of all the parameters, and for any j
  • Slide 25
  • HMM is a PM
  • Slide 26
  • PCFG PCFG: each sample point (x,y): x is a sentence y is a possible parse tree for that sentence.
  • Slide 27
  • PCFG is a PM
  • Slide 28
  • Q-function for PM
  • Slide 29
  • Maximizing the Q function Maximize Subject to the constraint Use Lagrange multipliers
  • Slide 30
  • Optimal solution Normalization factor Expected count
  • Slide 31
  • PM Models is r th parameter in the model. Each parameter is the member of some multinomial distribution. Count(x,y, r) is the number of times that is seen in the expression for P(x, y | )
  • Slide 32
  • The EM algorithm for PM Models Calculate expected counts Update parameters
  • Slide 33
  • PCFG example Calculate expected counts Update parameters
  • Slide 34
  • The EM algorithm for PM models // for each iteration // for each training example x i // for each possible y // for each parameter
  • Slide 35
  • Inside-outside algorithm
  • Slide 36
  • Inner loop of the Inside-outside algorithm Given an input sequence and 1.Calculate inside probability: Base case Recursive case: 2.Calculate outside probability: Base case: Recursive case:
  • Slide 37
  • Inside-outside algorithm (cont) 3. Collect the counts 4. Normalize and update the parameters
  • Slide 38
  • Expected counts for PCFG rules This is the formula if we have only one sentence. Add an outside sum if X contains multiple sentences.
  • Slide 39
  • Expected counts (cont)
  • Slide 40
  • Relation to EM PCFG is a PM Model Inside-outside algorithm is a special case of the EM algorithm for PM Models. X (observed data): each data point is a sentence w 1m. Y (hidden data): parse tree Tr. (parameters):
  • Slide 41
  • Forward-backward algorithm
  • Slide 42
  • The inner loop for forward-backward algorithm Given an input sequence and 1.Calculate forward probability: Base case Recursive case: 2.Calculate backward probability: Base case: Recursive case: 3.Calculate expected counts: 4.Update the parameters:
  • Slide 43
  • Expected counts
  • Slide 44
  • Expected counts (cont)
  • Slide 45
  • Relation to EM HMM is a PM Model Forward-backward algorithm is a special case of the EM algorithm for PM Models. X (observed data): each data point is an O 1T. Y (hidden data): state sequence X 1T. (parameters): a ij, b ijk, i.
  • Slide 46
  • IBM models for MT
  • Slide 47
  • Expected counts for (f, e) pairs Let Ct(f, e) be the fractional count of (f, e) pair in the training data. Alignment probActual count of times e and f are linked in (E,F) by alignment a
  • Slide 48
  • Relation to EM IBM models are PM Models. The EM algorithm used in IBM models is a special case of the EM algorithm for PM Models. X (observed data): each data point is a sentence pair (F, E). Y (hidden data): word alignment a. (parameters): t(f|e), d(i | j, m, n), etc..
  • Slide 49
  • Summary The EM algorithm An iterative approach L() is non-decreasing at each iteration Optimal solution in M-step exists for many classes of problems. The EM algorithm for PM models Simpler formulae Three special cases Inside-outside algorithm Forward-backward algorithm IBM Models for MT
  • Slide 50
  • Relations among the algorithms The generalized EM The EM algorithm PM Gaussian Mix Inside-Outside Forward-backward IBM models
  • Slide 51
  • Strengths of EM Numerical stability: in every iteration of the EM algorithm, it increases the likelihood of the observed data. The EM handles parameter constraints gracefully.
  • Slide 52
  • Problems with EM Convergence can be very slow on some problems and is intimately related to the amount of missing information. It guarantees to improve the probability of the training corpus, which is different from reducing the errors directly. It cannot guarantee to reach global maxima (it could get struck at the local maxima, saddle points, etc) The initial estimate is important.
  • Slide 53
  • Additional slides
  • Slide 54
  • Lower bound lemma If Then Proof :
  • Slide 55
  • L() is non-decreasing Let We have (By lower bound lemma)