Smoothing and interpolation

30
Smoothing and interpolation Micha Elsner January 9, 2014

Transcript of Smoothing and interpolation

Page 1: Smoothing and interpolation

Smoothing and interpolation

Micha Elsner

January 9, 2014

Page 2: Smoothing and interpolation

Last time

I Estimation for the bigram model needs smoothing(regularization)

I ...cope with sparse dataI Simple add-λ smoother

P̂(wi+1 = b|wi = a) =#(a b) + λ

#(a •) + λ|V |

2

Page 3: Smoothing and interpolation

Log-likelihood curve for a bigram modelThis isn’t quite the one I’m outlining now— I used my code forAssignment 1, which is a bit more complex

Finding optimal λ: line search3

Page 4: Smoothing and interpolation

Line search

4

Page 5: Smoothing and interpolation

Mathematical optimization

OptimizationMathematical/computer science techniques for solving max ormin problems

I A large library of techniquesI We’ll cover some basic ones

Objective: what you are maximizing/minimizing

Key insight: the objective is separate from the optimizationmethod!

I Be clear about what you wantI Then think about how to find it

Two methods so far: ratio-of-counts (a closed form estimate)one-dimensional line search (an iterative method)

5

Page 6: Smoothing and interpolation

Unknown words

Issue: how big is vocab set V anyway?

I Common to take set of words in training data as vocabularyI Plus “unknown word” symbol <UNK>

I This symbol is assigned to all words not appearing in trainI Of course it has no training counts...

I Train: The cat is on the mat.I Test: The dog is on the sofa.I → Transformed test: The <UNK> is on the <UNK>

6

Page 7: Smoothing and interpolation

More complicated UNK and vocabulary systems

I Possible to “spell out” unknown words letter by letterI The d.o.g is on the s.o.f.aI The letter probabilities can be fit with more held-out data

I Can have multiple kinds of <UNK> based onmorphophonology or syntactic category

I For instance, what about numbers?I What are some other examples?

I One simple way to do this (Stanford/Berkeley):orthographic shape

I Where are capital letters, punctuation, numbers?I cassowary = “c”I Stanford = “Cc”I “3G-related” = “1C-c”

7

Page 8: Smoothing and interpolation

More complicated UNK and vocabulary systems

I Possible to “spell out” unknown words letter by letterI The d.o.g is on the s.o.f.aI The letter probabilities can be fit with more held-out data

I Can have multiple kinds of <UNK> based onmorphophonology or syntactic category

I For instance, what about numbers?I What are some other examples?

I One simple way to do this (Stanford/Berkeley):orthographic shape

I Where are capital letters, punctuation, numbers?I cassowary = “c”I Stanford = “Cc”I “3G-related” = “1C-c”

7

Page 9: Smoothing and interpolation

Observations of UNK

In order to use these more sophisticated UNK methods, it helpsto learn some parameters for UNK

I Already discussed using held-out dataI Alternately, we can use rare items from the data we haveI Transform very rare words into <UNK>

I We can also keep them as themselves... just count themtwice?

I For instance, all words that appear only once in trainingI These words could easily be unseen...

I So they’re probably similar to words that are actuallyunseen

8

Page 10: Smoothing and interpolation

Good-Turing estimates

From work at Bletchley in WWIIBasic idea: estimate number of unseen items using trend from:

I 1-count objects, 2-count items, 3-count items... etcI For instance, if there are 100 2-count itemsI And 200 1-count itemsI How many 0-count items should we expect?I Not a fully-specified algorithm (must decide how to model

the trend)Transforming rare words into UNK makes the assumption thatthere are as many 0-count words as 1-count words...

I This is unsophisticated, but easy to implement

9

Page 11: Smoothing and interpolation

Problems with the add-λ smoother

I The add-λ smoother can behave oddlyI What follows a rare context word: eg depart, deepen?

10

Page 12: Smoothing and interpolation

This is a problem...

I Our own intuitions about rare words— can usually makesome guesses

I Makes it impossible to build high-order n-gram modelsI Data sparsity gets worse as grams get longer...I More and more data in the long tailI Count for “to eat a fish” ≤ “eat a fish” ≤ “a fish”I More complex model overfits more...

11

Page 13: Smoothing and interpolation

A better idea

I Fall back on unigram estimatesI “the” is really common, so perhaps “depart the” and

“deepen the” also common?I Could explicitly switch models (“backoff”) but common to

merge estimates together

P̂(wi+1 = b|wi = a) =#(a b) + βPunigram(wi+1)

#(a •) + β

I β is total number of pseudocounts (ie, λ|V |)I Punigram determines how we split up the pseudocounts

I Split up unfairly so more probable items get more of βI If Punigram uniform, we recover our λ scheme

12

Page 14: Smoothing and interpolation

Interpolation

InterpolationI Estimate multiple distributionsI Combine so that final estimate is in between

Notice that for 0 ≤ λ ≤ 1 and probabilities P1,P2,the following is a probability (linear interpolation):

P(x) = λP1(x) + (1− λ)P2(x)

P̂(wi+1 = b|wi = a) =#(a b) + βPunigram(wi+1)

#(a •) + β

This one is sometimes called Dirichlet smoothingI Is a particular kind of interpolationI Name comes from a derivation via Bayesian statistics

13

Page 15: Smoothing and interpolation

Estimates for unigrams

Where does Punigram come from?

P̂(wi = b) =#(b) + λ

#(•) + λ|V |(Note Charniak uses α instead of λ)Turns out to be a special case of the Dirichlet smoothingequation.

14

Page 16: Smoothing and interpolation

Estimates for unigrams

Where does Punigram come from?

P̂(wi = b) =#(b) + λ

#(•) + λ|V |(Note Charniak uses α instead of λ)Turns out to be a special case of the Dirichlet smoothingequation.

14

Page 17: Smoothing and interpolation

Proof: skip this

We want to show that add-λ is a special case of the Dirichletsmoother (when Pu is uniform)

#(a b) + βPunigram(wi+1)

#(a •) + β=

#(a b) + λ

#(a •) + λ|V |

Pu uniform means Pu(w) = 1|V | so:

#(a b) + β 1|V |

#(a •) + β=

#(a b) + λ

#(a •) + λ

Choose λ = β|V | , thus β = λ|V |

#(a b) + λ|V | 1|V |

#(a •) + λ|V |=

#(a b) + λ

#(a •) + λ|V |

And cancel |V | 1|V | , QED

15

Page 18: Smoothing and interpolation

Some example effects

departone real instance in train (“depart from”)Effect of smoothing with unigrams vs add-λ

I depart the: .058 vs 6e-5I depart from: .01 vs .007I depart of: .03 vs 6e-5I depart jewel: 63-6 vs 6e-5I depart French: .0001 vs 6e-5

theI the senate: .04 vs .04I the the: .0002 vs 3e-5I the jewel: 1e-8 vs 3e-7

16

Page 19: Smoothing and interpolation

High-order ngram models

This kind of smoothing enables us to go beyond bigrams:I For instance, possible to build a practical trigram (3-gram)

model from a moderate-size dataset

I Construction is pretty simple...

P̂(wi+1 = c|wi = a,wi−1 = b) =#(a b c) + βPbigram(wi+1|wi)

#(a b •) + β

...and so on, recursivelyI Pbigram based on Punigram, etc

17

Page 20: Smoothing and interpolation

The assignment

This basically covers the whole assignment.I Read in the training data to compute unigram and bigram

countsI Compute the dev data log-likelihood for various values of α

(λ)I Remember to replace unknown words from dev with<UNK>

I Construct a bigram model using your best unigram modelfor smoothing

I Compute the dev data log-likelihood for various values of βI Use the learned probabilities to make decisions in the

good-bad dataset

18

Page 21: Smoothing and interpolation

An LM minipresentation

This is an example minipresentation of a paper on languagemodels

I What is the task the paper is about?I How does the technology (ie LMs) apply?I Are there any novel extensions or clever tricks?

I If so, one sentence about what they seem to be doingI What are the results like?

19

Page 22: Smoothing and interpolation

Language modeling for determiner selection

Jenine Turner and Eugene Charniak, NAACL 2007I The task: adding determiners (“a/an”, “the”) to sentences

I Useful for machine translation, eg from ChineseI Use LM to check probabilities with all possible determiners

I “Kim bought the car” vs “Kim bought a car” vs “Kim boughtcar”

I LM is actually based on syntax, not just ngramsI We’ll cover this kind of model in the parsing section

Evaluation:I It’s important to be fair in selecting test dataI Selecting sentences at random allows sentences from the

same article to be in train and testI This inflates results, since the article usually uses the

same NPs

20

Page 23: Smoothing and interpolation

Results

This approach 86.3% accurateSome errors:

I The computers were crude by the today’s standardsI In addition the Apple II was the/an affordable $1298

Some “acceptable” errors:I Crude as they were, these early PCs triggered an

explosive product development in desktop modelsI Highway officials insist the ornamental railings on old

bridges aren’t strong enoughSometimes you need to know semantics

I IBM, a/the world leader in computers, didn’t offer its firstPC until 1981

21

Page 24: Smoothing and interpolation

Modern language models: Kneser-Ney smoothing

Kneser-Ney smoothingI Still a competitive algorithm for midsize datasetsI Popular modification introduced by Chen and Goodman 99

I On Carmen if you want to read itI Hierarchical Bayesian explanation recently givenI Teh and Goldwater/Griffiths each in 2006

22

Page 25: Smoothing and interpolation

Discounting estimatorsSo far, we’ve looked at smoothing by addingpseudo-observations

I the “fake” training counts are all strictly larger than the realcounts

P̂(wi+1 = b|wi = a) =#(a b)

#(a •) + β+

β

#(a •) + βPuni(b)

Kneser-Ney is based on discountingI we subtract counts from real items and distribute them as

pseudocounts

P̂(wi+1 = b|wi = a) =max(0,#(a b)− d)

#(a •)+

Da

#(a •)Puni(b)

and Da =∑

b min(d ,#(a b))

23

Page 26: Smoothing and interpolation

Why discounting?

Some words need more smoothing than others

I “San” (a few city names: Francisco, Diego)I vs “green” (nearly anything: car, house, paint...)

I Counts for “green” are spread outI And we anticipate many new possibilities

I For “San”, concentrated on a few optionsI And we anticipate few more

Da =∑

b

min(d ,#(a b))

24

Page 27: Smoothing and interpolation

Kneser-Ney’s Puni

We can drive the intuition behind discounting furtherI Our intuition then: some words are very productive

contextsI Others restrict what appears next

I Same idea: some words appear in many contextsI While others are very restricted

I “Francisco” (in US English) usually appears after SanI “report” appears all over the place

P̂KNuni(b) =∑

a 1if (a b) ever appears∑ax 1if (a x) ever appears

Proportion of unique word types appearing before b

25

Page 28: Smoothing and interpolation

Very large datasets

“More data is better data”

Google 5-gramsI File sizes: approx. 24 GB compressed (gzip’ed) text filesI Number of tokens: 1,024,908,267,229I Number of sentences: 95,119,665,584I Number of unigrams: 13,588,391I Number of bigrams: 314,843,401I Number of trigrams: 977,069,902I Number of fourgrams: 1,313,818,354I Number of fivegrams: 1,176,470,663

http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html

26

Page 29: Smoothing and interpolation

Dealing with massive data

Primary challenges are algorithmicI How do you store this stuff?I How do you query it?I Subfield of CL devoted to this— very mathematical

I Often enough to get approximate countsI May only need to know if count is zero or nonzeroI Or save space by clustering/hashing similar words

Basic techniques (smoothing by interpolating high andlow-order estimates) haven’t changed much

27

Page 30: Smoothing and interpolation

More exotic LMs

I An LM is just a distribution over stringsI Doesn’t have to break down by n-grams

I Possible to incorporate more interestingsyntax/semantics/etc

I ...Worth thinking about how the rest of the course caninform LMs

I and people are tryingI But the tools we’ve covered so far aren’t sophisticated

enough

28