Language Modeling/Pronunciation Modeling Michael Picheny ... stanchen/fall12/e6870/slides/...

download Language Modeling/Pronunciation Modeling Michael Picheny ... stanchen/fall12/e6870/slides/ Modeling/Pronunciation Modeling Michael Picheny, Bhuvana Ramabhadran, ... Analogy: multiple-choice

of 141

  • date post

    26-Mar-2018
  • Category

    Documents

  • view

    214
  • download

    0

Embed Size (px)

Transcript of Language Modeling/Pronunciation Modeling Michael Picheny ... stanchen/fall12/e6870/slides/...

  • Lecture 6

    Language Modeling/Pronunciation Modeling

    Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen

    IBM T.J. Watson Research CenterYorktown Heights, New York, USA

    {picheny,bhuvana,stanchen}@us.ibm.com

    15 October 2012

  • Review: Acoustic Modeling

    x Observations; sequence of 40d feature vectors. word sequence.HMM/GMM framework lets us model P(x|) . . .

    How likely feature vectors are given word sequence.

    2 / 141

  • The Fundamental Equation of ASR

    I HATE TO WAITEYE HATE TWO WEIGHT

    = arg max

    P(x|)

    = arg max

    P(|x) = arg max

    P()P(x|)

    Whats new?Language model P() describing . . .Frequency of each word sequence .

    3 / 141

  • Part I

    Language Modeling

    4 / 141

  • Language Modeling: Goals

    Describe which word sequences are likely.e.g., BRITNEY SPEARS vs. BRIT KNEE SPEARS.

    Analogy: multiple-choice test.LM restricts choices given to acoustic model.The fewer choices, the better you do.

    5 / 141

  • What Type of Model?

    Want probability distribution over sequence of symbols.(Hidden) Markov model!Hidden or non-hidden?

    For hidden, too hard to come up with topology.

    6 / 141

  • Where Are We?

    1 N-Gram Models

    2 Technical Details

    3 Smoothing

    4 Discussion

    7 / 141

  • Whats an n-Gram Model?

    Markov model of order n 1.To predict next word . . .

    Only need to remember last n 1 words.

    8 / 141

  • Whats a Markov Model?

    Decompose probability of sequence . . .Into product of conditional probabilities.

    e.g., trigram model Markov order 2 . . .Remember last 2 words.

    P(w1 wL) =L

    i=1

    P(wi |w1 wi1)

    =L

    i=1

    P(wi |wi2wi1)

    P(I HATE TO WAIT) = P(I)P(HATE|I)P(TO|I HATE)P(WAIT|HATE TO)

    9 / 141

  • Sentence Begins and Ends

    Pad left with beginning-of-sentence tokens.e.g., w1 = w0 = ..Always condition on two words to left, even at start.

    Predict end-of-sentence token at end.So true probability, i.e.,

    P() = 1.

    P(w1 wL) =L+1i=1

    P(wi |wi2wi1)

    P(I HATE TO WAIT) = P(I| . .) P(HATE| . I) P(TO|I HATE)P(WAIT|HATE TO) P(/|TO WAIT)

    10 / 141

  • How to Set Probabilities?

    For each history wi2wi1 . . .P(wi |wi2wi1) is multinomial distribution.

    Maximum likelihood estimation for multinomials.Count and normalize!

    PMLE(wi |wi2wi1) =c(wi2wi1wi)w c(wi2wi1w)

    =c(wi2wi1wi)c(wi2wi1)

    11 / 141

  • Example: Maximum Likelihood Estimation

    23M words of Wall Street Journal text.

    FEDERAL HOME LOAN MORTGAGE CORPORATION DASHONE .POINT FIVE BILLION DOLLARS OF REALESTATEMORTGAGE -HYPHEN INVESTMENT CONDUIT SECURITIESOFFERED BY MERRILL LYNCH &AMPERSAND COMPANY

    NONCOMPETITIVE TENDERS MUST BE RECEIVED BY NOONEASTERN TIME THURSDAY AT THE TREASURY OR ATFEDERAL RESERVE BANKS OR BRANCHES. . .. . .

    P(TO|I HATE) = c(I HATE TO)c(I HATE)

    =1745

    = 0.378

    12 / 141

  • Example: Bigram Model

    P(I HATE TO WAIT) =???P(EYE HATE TWO WEIGHT) =???Step 1: Collect all bigram counts, unigram history counts.

    EYE I HATE TO TWO WAIT WEIGHT / . 3 3234 5 4064 1339 8 22 0 892669

    EYE 0 0 0 26 1 0 0 52 735I 0 0 45 2 1 1 0 8 21891

    HATE 0 0 0 40 0 0 0 9 246TO 8 6 19 21 5341 324 4 221 510508

    TWO 0 5 0 1617 652 0 0 4213 132914WAIT 0 0 0 71 2 0 0 35 882

    WEIGHT 0 0 0 38 0 0 0 45 643

    13 / 141

  • Example: Bigram Model

    P(I HATE TO WAIT)= P(I|.)P(HATE|I)P(TO|HATE)P(WAIT|TO)P(/|WAIT)

    =3234

    892669 45

    21891 40

    246 324

    510508 35

    882= 3.05 1011

    P(EYE HATE TWO WEIGHT)= P(EYE|.)P(HATE|EYE)P(TWO|HATE)P(WEIGHT|TWO)

    P(/|WEIGHT)

    =3

    892669 0

    735 0

    246 0

    132914 45

    643= 0

    14 / 141

  • Recap: N-Gram Models

    Simple formalism, yet effective.Discriminates between wheat and chaff.

    Easy to train: count and normalize.Generalizes.

    Assigns nonzero probabilities to sentences . . .Not seen in training data, e.g., I HATE TO WAIT.

    15 / 141

  • Where Are We?

    1 N-Gram Models

    2 Technical Details

    3 Smoothing

    4 Discussion

    16 / 141

  • LMs and Training and Decoding

    Decoding without LMs.Word HMM encoding allowable word sequences.Replace each word with its HMM.

    ONE

    TWO

    THREE

    . . . . . .

    HMMone

    HMMtwo

    HMMthree

    . . . . . .

    17 / 141

  • LMs and Training and Decoding

    Point: n-gram model is (hidden) Markov model.Can be expressed as word HMM.Replace each word with its HMM.Leave in language model probabilities.

    ONE/P(ONE)

    TWO/P(TWO)

    THREE/P(THREE)

    . . . . . .

    HMMone/P(ONE)

    HMMtwo/P(TWO)

    HMMthree/P(THREE)

    . . . . . . . . .

    Lots more details in lectures 7, 8.How do LMs impact acoustic model training?

    18 / 141

  • One Puny Prob versus Many?

    one

    two

    three

    four

    ve

    six

    seveneight

    nine

    zero

    19 / 141

  • The Acoustic Model Weight

    Not a fair fight.Solution: acoustic model weight.

    = arg max

    P()P(x|)

    usually somewhere between 0.05 and 0.1.Important to tune for each LM, AM.Theoretically inelegant.

    Empirical performance trumps theory any day of week.Is it LM weight or AM weight?

    20 / 141

  • Real World Toy Example

    Test set: continuous digit strings.Unigram language model: P() =

    L+1i=1 P(wi).

    0

    5

    10

    15

    AM weight=1 AM weight=0.1

    WE

    R

    21 / 141

  • What is This Word Error Rate Thing?

    Most popular evaluation measure for ASR systemsDivide total number of errors in test set . . .

    By total number of words.

    WER

    utts u(# errors in u)utts u(# words in reference for u)

    What is number of errors in utterance?Minimum number of word insertions, deletions, and . . .Substitutions to transform reference to hypothesis.

    22 / 141

  • Example: Word Error Rate

    What is the WER?

    reference: THE DOG IS HERE NOWhypothesis: THE UH BOG IS NOW

    Can WER be above 100%?What algorithm to compute WER?

    How many ways to transform reference to hypothesis?

    23 / 141

  • Evaluating Language Models

    Best way: plug into ASR system; measure WER.Need ASR system.Expensive to compute (especially in old days).Results depend on acoustic model.

    Is there something cheaper that predicts WER well?

    24 / 141

  • Perplexity

    Basic idea: test set likelihood . . .Normalized so easy to interpret.

    Take (geometric) average probability pavg . . .Assigned to each word in test data.

    pavg =

    [L+1i=1

    P(wi |wi2wi1)

    ] 1L+1

    Invert it: PP = 1pavg .

    Interpretation:Given history, how many possible next words . . .(For acoustic model to choose from.)e.g., uniform unigram LM over V words PP = V .

    25 / 141

  • Example: Perplexity

    P(I HATE TO WAIT)= P(I|.)P(HATE|I)P(TO|HATE)P(WAIT|TO)P(/|WAIT)

    =3234

    892669 45

    21891 40

    246 324

    510508 35

    882= 3.05 1011

    pavg =

    [L+1i=1

    P(wi |wi1)

    ] 1L+1

    = (3.05 1011)15 = 0.00789

    PP =1

    pavg= 126.8

    26 / 141

  • Perplexity: Example Values

    training case+type domain data punct PP

    human1 biography 142machine2 Brown 600MW

    790

    ASR3 WSJ 23MW 120

    Varies highly across domains, languages. Why?

    1Jefferson the Virginian; Shannon game (Shannon, 1951).2Trigram model (Brown et al., 1992).3Trigram model; 20kw vocabulary.

    27 / 141

  • Does Perplexity Predict Word-Error Rate?

    Not across different LM types.e.g., word n-gram model; class n-gram model; . . .

    OK within LM type.e.g., vary training set; model order; pruning; . . .

    28 / 141

  • Perplexity and Word-Error Rate

    20

    25

    30

    35

    4.5 5 5.5 6 6.5

    WE

    R

    log PP

    29 / 141

  • Recap

    LM describes allowable word sequences.Used to build decoding graph.

    Need AM weight for LM to have full effect.Best to evaluate LMs using WER . . .

    But perplexity can be informative.Can you think of any problems with word error rate?

    What do we really care about in applications?

    30 / 141

  • Where Are We?

    1 N-Gram Models

    2 Technical Details

    3 Smoothing

    4 Discussion

    31 / 141

  • An Experiment

    Take 50M words of WSJ; shuffle sentences; split in two.Training set: 25M words.

    NONCOMPETITIVE TENDERS MUST BE RECEIVED BY NOONEASTERN TIME THURSDAY AT THE TREASURY OR ATFEDERAL RESERVE BANKS OR BRANCHES .PERIOD

    NOT EVERYONE AGREED WITH THAT STRATEGY .PERIOD. . .. . .

    Test set: 25M words.NATIONAL PICTURE &AMPERSAND FRAME DASH INITIALTWO MILLION ,COMMA TWO HUNDRED FIFTY THOUSANDSHARES ,COMMA VIA WILLIAM BLAIR .PERIOD

    THERE WILL EVEN BE AN EIGHTEEN -HYPHEN HOLE GOLFCOURSE .PERIOD. . .. . .

    32 / 141

  • An Experiment

    Count how often each word occurs in training; sort by count.

    word count,COMMA 1156259THE 1062057.PERIOD 877624OF 520374TO 510508A 455832AND 417364IN 385940. . . . . .. . . . . .

    word count. . . . . .. . . . . .ZZZZ 2AAAAAHHH 1AAB 1AACHENER 1. . . . . .. . . . . .ZYPLAST 1ZYUGANOV 1

    33 / 141

  • An Experiment

    For each word that occurs exactly once in training . . .Count how often occurs in test set.Average this count across all such words.

    What is actual value?1 Larger than 1.2 Exactly 1, more or less.3 Between 0.5 and 1.4 Between 0.1 and 0.5.

    What if