Smoothing and interpolation
Transcript of Smoothing and interpolation
Smoothing and interpolation
Micha Elsner
January 9, 2014
Last time
I Estimation for the bigram model needs smoothing(regularization)
I ...cope with sparse dataI Simple add-λ smoother
P̂(wi+1 = b|wi = a) =#(a b) + λ
#(a •) + λ|V |
2
Log-likelihood curve for a bigram modelThis isn’t quite the one I’m outlining now— I used my code forAssignment 1, which is a bit more complex
Finding optimal λ: line search3
Line search
4
Mathematical optimization
OptimizationMathematical/computer science techniques for solving max ormin problems
I A large library of techniquesI We’ll cover some basic ones
Objective: what you are maximizing/minimizing
Key insight: the objective is separate from the optimizationmethod!
I Be clear about what you wantI Then think about how to find it
Two methods so far: ratio-of-counts (a closed form estimate)one-dimensional line search (an iterative method)
5
Unknown words
Issue: how big is vocab set V anyway?
I Common to take set of words in training data as vocabularyI Plus “unknown word” symbol <UNK>
I This symbol is assigned to all words not appearing in trainI Of course it has no training counts...
I Train: The cat is on the mat.I Test: The dog is on the sofa.I → Transformed test: The <UNK> is on the <UNK>
6
More complicated UNK and vocabulary systems
I Possible to “spell out” unknown words letter by letterI The d.o.g is on the s.o.f.aI The letter probabilities can be fit with more held-out data
I Can have multiple kinds of <UNK> based onmorphophonology or syntactic category
I For instance, what about numbers?I What are some other examples?
I One simple way to do this (Stanford/Berkeley):orthographic shape
I Where are capital letters, punctuation, numbers?I cassowary = “c”I Stanford = “Cc”I “3G-related” = “1C-c”
7
More complicated UNK and vocabulary systems
I Possible to “spell out” unknown words letter by letterI The d.o.g is on the s.o.f.aI The letter probabilities can be fit with more held-out data
I Can have multiple kinds of <UNK> based onmorphophonology or syntactic category
I For instance, what about numbers?I What are some other examples?
I One simple way to do this (Stanford/Berkeley):orthographic shape
I Where are capital letters, punctuation, numbers?I cassowary = “c”I Stanford = “Cc”I “3G-related” = “1C-c”
7
Observations of UNK
In order to use these more sophisticated UNK methods, it helpsto learn some parameters for UNK
I Already discussed using held-out dataI Alternately, we can use rare items from the data we haveI Transform very rare words into <UNK>
I We can also keep them as themselves... just count themtwice?
I For instance, all words that appear only once in trainingI These words could easily be unseen...
I So they’re probably similar to words that are actuallyunseen
8
Good-Turing estimates
From work at Bletchley in WWIIBasic idea: estimate number of unseen items using trend from:
I 1-count objects, 2-count items, 3-count items... etcI For instance, if there are 100 2-count itemsI And 200 1-count itemsI How many 0-count items should we expect?I Not a fully-specified algorithm (must decide how to model
the trend)Transforming rare words into UNK makes the assumption thatthere are as many 0-count words as 1-count words...
I This is unsophisticated, but easy to implement
9
Problems with the add-λ smoother
I The add-λ smoother can behave oddlyI What follows a rare context word: eg depart, deepen?
10
This is a problem...
I Our own intuitions about rare words— can usually makesome guesses
I Makes it impossible to build high-order n-gram modelsI Data sparsity gets worse as grams get longer...I More and more data in the long tailI Count for “to eat a fish” ≤ “eat a fish” ≤ “a fish”I More complex model overfits more...
11
A better idea
I Fall back on unigram estimatesI “the” is really common, so perhaps “depart the” and
“deepen the” also common?I Could explicitly switch models (“backoff”) but common to
merge estimates together
P̂(wi+1 = b|wi = a) =#(a b) + βPunigram(wi+1)
#(a •) + β
I β is total number of pseudocounts (ie, λ|V |)I Punigram determines how we split up the pseudocounts
I Split up unfairly so more probable items get more of βI If Punigram uniform, we recover our λ scheme
12
Interpolation
InterpolationI Estimate multiple distributionsI Combine so that final estimate is in between
Notice that for 0 ≤ λ ≤ 1 and probabilities P1,P2,the following is a probability (linear interpolation):
P(x) = λP1(x) + (1− λ)P2(x)
P̂(wi+1 = b|wi = a) =#(a b) + βPunigram(wi+1)
#(a •) + β
This one is sometimes called Dirichlet smoothingI Is a particular kind of interpolationI Name comes from a derivation via Bayesian statistics
13
Estimates for unigrams
Where does Punigram come from?
P̂(wi = b) =#(b) + λ
#(•) + λ|V |(Note Charniak uses α instead of λ)Turns out to be a special case of the Dirichlet smoothingequation.
14
Estimates for unigrams
Where does Punigram come from?
P̂(wi = b) =#(b) + λ
#(•) + λ|V |(Note Charniak uses α instead of λ)Turns out to be a special case of the Dirichlet smoothingequation.
14
Proof: skip this
We want to show that add-λ is a special case of the Dirichletsmoother (when Pu is uniform)
#(a b) + βPunigram(wi+1)
#(a •) + β=
#(a b) + λ
#(a •) + λ|V |
Pu uniform means Pu(w) = 1|V | so:
#(a b) + β 1|V |
#(a •) + β=
#(a b) + λ
#(a •) + λ
Choose λ = β|V | , thus β = λ|V |
#(a b) + λ|V | 1|V |
#(a •) + λ|V |=
#(a b) + λ
#(a •) + λ|V |
And cancel |V | 1|V | , QED
15
Some example effects
departone real instance in train (“depart from”)Effect of smoothing with unigrams vs add-λ
I depart the: .058 vs 6e-5I depart from: .01 vs .007I depart of: .03 vs 6e-5I depart jewel: 63-6 vs 6e-5I depart French: .0001 vs 6e-5
theI the senate: .04 vs .04I the the: .0002 vs 3e-5I the jewel: 1e-8 vs 3e-7
16
High-order ngram models
This kind of smoothing enables us to go beyond bigrams:I For instance, possible to build a practical trigram (3-gram)
model from a moderate-size dataset
I Construction is pretty simple...
P̂(wi+1 = c|wi = a,wi−1 = b) =#(a b c) + βPbigram(wi+1|wi)
#(a b •) + β
...and so on, recursivelyI Pbigram based on Punigram, etc
17
The assignment
This basically covers the whole assignment.I Read in the training data to compute unigram and bigram
countsI Compute the dev data log-likelihood for various values of α
(λ)I Remember to replace unknown words from dev with<UNK>
I Construct a bigram model using your best unigram modelfor smoothing
I Compute the dev data log-likelihood for various values of βI Use the learned probabilities to make decisions in the
good-bad dataset
18
An LM minipresentation
This is an example minipresentation of a paper on languagemodels
I What is the task the paper is about?I How does the technology (ie LMs) apply?I Are there any novel extensions or clever tricks?
I If so, one sentence about what they seem to be doingI What are the results like?
19
Language modeling for determiner selection
Jenine Turner and Eugene Charniak, NAACL 2007I The task: adding determiners (“a/an”, “the”) to sentences
I Useful for machine translation, eg from ChineseI Use LM to check probabilities with all possible determiners
I “Kim bought the car” vs “Kim bought a car” vs “Kim boughtcar”
I LM is actually based on syntax, not just ngramsI We’ll cover this kind of model in the parsing section
Evaluation:I It’s important to be fair in selecting test dataI Selecting sentences at random allows sentences from the
same article to be in train and testI This inflates results, since the article usually uses the
same NPs
20
Results
This approach 86.3% accurateSome errors:
I The computers were crude by the today’s standardsI In addition the Apple II was the/an affordable $1298
Some “acceptable” errors:I Crude as they were, these early PCs triggered an
explosive product development in desktop modelsI Highway officials insist the ornamental railings on old
bridges aren’t strong enoughSometimes you need to know semantics
I IBM, a/the world leader in computers, didn’t offer its firstPC until 1981
21
Modern language models: Kneser-Ney smoothing
Kneser-Ney smoothingI Still a competitive algorithm for midsize datasetsI Popular modification introduced by Chen and Goodman 99
I On Carmen if you want to read itI Hierarchical Bayesian explanation recently givenI Teh and Goldwater/Griffiths each in 2006
22
Discounting estimatorsSo far, we’ve looked at smoothing by addingpseudo-observations
I the “fake” training counts are all strictly larger than the realcounts
P̂(wi+1 = b|wi = a) =#(a b)
#(a •) + β+
β
#(a •) + βPuni(b)
Kneser-Ney is based on discountingI we subtract counts from real items and distribute them as
pseudocounts
P̂(wi+1 = b|wi = a) =max(0,#(a b)− d)
#(a •)+
Da
#(a •)Puni(b)
and Da =∑
b min(d ,#(a b))
23
Why discounting?
Some words need more smoothing than others
I “San” (a few city names: Francisco, Diego)I vs “green” (nearly anything: car, house, paint...)
I Counts for “green” are spread outI And we anticipate many new possibilities
I For “San”, concentrated on a few optionsI And we anticipate few more
Da =∑
b
min(d ,#(a b))
24
Kneser-Ney’s Puni
We can drive the intuition behind discounting furtherI Our intuition then: some words are very productive
contextsI Others restrict what appears next
I Same idea: some words appear in many contextsI While others are very restricted
I “Francisco” (in US English) usually appears after SanI “report” appears all over the place
P̂KNuni(b) =∑
a 1if (a b) ever appears∑ax 1if (a x) ever appears
Proportion of unique word types appearing before b
25
Very large datasets
“More data is better data”
Google 5-gramsI File sizes: approx. 24 GB compressed (gzip’ed) text filesI Number of tokens: 1,024,908,267,229I Number of sentences: 95,119,665,584I Number of unigrams: 13,588,391I Number of bigrams: 314,843,401I Number of trigrams: 977,069,902I Number of fourgrams: 1,313,818,354I Number of fivegrams: 1,176,470,663
http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html
26
Dealing with massive data
Primary challenges are algorithmicI How do you store this stuff?I How do you query it?I Subfield of CL devoted to this— very mathematical
I Often enough to get approximate countsI May only need to know if count is zero or nonzeroI Or save space by clustering/hashing similar words
Basic techniques (smoothing by interpolating high andlow-order estimates) haven’t changed much
27
More exotic LMs
I An LM is just a distribution over stringsI Doesn’t have to break down by n-grams
I Possible to incorporate more interestingsyntax/semantics/etc
I ...Worth thinking about how the rest of the course caninform LMs
I and people are tryingI But the tools we’ve covered so far aren’t sophisticated
enough
28