CS 136a Lecture 8 Yet More Language Modelingcs136a/CS136a_Slides/CS136a... · 2017-10-24 · +topic...

38
+ CS 136a Lecture 8 Yet More Language Modeling September16, 2016 Professor Meteer Thanks to Dan Jurafsky & Josh Goodman for many of these slides π

Transcript of CS 136a Lecture 8 Yet More Language Modelingcs136a/CS136a_Slides/CS136a... · 2017-10-24 · +topic...

Page 1: CS 136a Lecture 8 Yet More Language Modelingcs136a/CS136a_Slides/CS136a... · 2017-10-24 · +topic examples - 2 (numbers) n south korea posted a surplus on its current account of

+

CS 136a Lecture 8 Yet More Language Modeling

September16, 2016 Professor Meteer

Thanks to Dan Jurafsky & Josh Goodman for many of these slides π

Page 2: CS 136a Lecture 8 Yet More Language Modelingcs136a/CS136a_Slides/CS136a... · 2017-10-24 · +topic examples - 2 (numbers) n south korea posted a surplus on its current account of

+ Overview (from Microsoft Tutorial)

n Caching n Skipping

n Clustering

n Sentence-mixture models

n Structured language models

n Tools

n More on the author, Josh Goodman http://research.microsoft.com/en-us/um/people/

joshuago/icmldescription.htm

2

Page 3: CS 136a Lecture 8 Yet More Language Modelingcs136a/CS136a_Slides/CS136a... · 2017-10-24 · +topic examples - 2 (numbers) n south korea posted a surplus on its current account of

+ 3 Caching

n  If you say something, you are likely to say it again later.

P(z | history = λPsmooth( z | xy) + (1 – λ)Pcache(z | history)

n  Interpolate trigram with cache

n Trigram caches get almost twice the improvement as unigram caches

Pcache (z | history) = C (z ∈ history) length(history)

Page 4: CS 136a Lecture 8 Yet More Language Modelingcs136a/CS136a_Slides/CS136a... · 2017-10-24 · +topic examples - 2 (numbers) n south korea posted a surplus on its current account of

+ 4 Caching: Variations

n N-gram caches:

n Conditional n-gram cache: use n-gram cache only if xy ∈ history

n Remove function-words from cache, like “the”, “to”

Pcache (z | history) = C (xyz ∈ history ) C (xy ∈ history )

Page 5: CS 136a Lecture 8 Yet More Language Modelingcs136a/CS136a_Slides/CS136a... · 2017-10-24 · +topic examples - 2 (numbers) n south korea posted a surplus on its current account of

+ Skipping n Capturing phrasal elements

n  Show John a good time à Show XXX a good time

n Standard 5 gram: P(z|…rstuvwxy) ≈ P(z|vwxy)

n Why not P(z|v_xy) – “skipping” n-gram – skips value of 3-back word

n Example: “P(time|show John a good)” ->

P(time | show ____ a good)

n P(…rstuvwxy) ≈

λP(z|vwxy) + µP(z|vw_y) + (1-λ-µ)P(z|v_xy)

5

Page 6: CS 136a Lecture 8 Yet More Language Modelingcs136a/CS136a_Slides/CS136a... · 2017-10-24 · +topic examples - 2 (numbers) n south korea posted a surplus on its current account of

+ Clustering

n CLUSTERING = CLASSES (same thing)

n What is P(“Tuesday | party on”)

n Similar to P(“Monday | party on”)

n Similar to P(“Tuesday | celebration on”)

n Put words in clusters: n  WEEKDAY = Sunday, Monday, Tuesday, … n  EVENT=party, celebration, birthday, …

6

Page 7: CS 136a Lecture 8 Yet More Language Modelingcs136a/CS136a_Slides/CS136a... · 2017-10-24 · +topic examples - 2 (numbers) n south korea posted a surplus on its current account of

+ Clustering overview

n Major topic, useful in many fields

n Kinds of clustering n  Predictive clustering n  Conditional clustering n  IBM-style clustering

n How to get clusters n  Be clever or it takes forever!

7

Page 8: CS 136a Lecture 8 Yet More Language Modelingcs136a/CS136a_Slides/CS136a... · 2017-10-24 · +topic examples - 2 (numbers) n south korea posted a surplus on its current account of

+ Predictive clustering

n Let “z” be a word, “Z” be its cluster

n One cluster per word: hard clustering n  WEEKDAY = Sunday, Monday, Tuesday, … n  MONTH = January, February, April, May, June, …

n P(z|xy) = P(Z|xy) × P(z|xyZ)

n P(Tuesday | party on) = P(WEEKDAY | party on) × P(Tuesday | party on WEEKDAY)

n Psmooth(z|xy) ≈ Psmooth (Z|xy) × Psmooth (z|xyZ)

8

Page 9: CS 136a Lecture 8 Yet More Language Modelingcs136a/CS136a_Slides/CS136a... · 2017-10-24 · +topic examples - 2 (numbers) n south korea posted a surplus on its current account of

+ Predictive clustering example

n Find P(Tuesday | party on) Psmooth (WEEKDAY | party on) × Psmooth (Tuesday | party on WEEKDAY) C(party on Tuesday) = 0 C(party on Wednesday) = 10 C(arriving on Tuesday) = 10 C(on Tuesday) = 100

Psmooth (WEEKDAY | party on) is high

Psmooth (Tuesday | party on WEEKDAY) backs off to Psmooth (Tuesday | on WEEKDAY)

9

Page 10: CS 136a Lecture 8 Yet More Language Modelingcs136a/CS136a_Slides/CS136a... · 2017-10-24 · +topic examples - 2 (numbers) n south korea posted a surplus on its current account of

+ Conditional clustering

P(z|xy) = P(z|xXyY)

P(Tuesday | party on) =

P(Tuesday | party EVENT on PREPOSITION)

Psmooth(z|xy) ≈ Psmooth (z|xXyY) λPML (Tuesday | party EVENT on PREPOSITION) + µ PML (Tuesday | EVENT on PREPOSITION) + δPML (Tuesday | on PREPOSITION) + γMLP(Tuesday | PREPOSITION) + (1- λ - µ - δ- γ) PML (Tuesday)

10

Condition off classes in the context

Page 11: CS 136a Lecture 8 Yet More Language Modelingcs136a/CS136a_Slides/CS136a... · 2017-10-24 · +topic examples - 2 (numbers) n south korea posted a surplus on its current account of

+ Conditional clustering example

λP (Tuesday | party EVENT on PREPOSITION) +

µ P(Tuesday | EVENT on PREPOSITION) +

δP(Tuesday | on PREPOSITION) +

γP(Tuesday | PREPOSITION) +

(1- λ - µ - δ- γ) P(Tuesday

11

Eliminating redundancy

λP (Tuesday | party on) + µ P(Tuesday | EVENT on) + δP(Tuesday | on) + γP(Tuesday | PREPOSITION) + (1- λ - µ - δ- γ) P(Tuesday) =

Page 12: CS 136a Lecture 8 Yet More Language Modelingcs136a/CS136a_Slides/CS136a... · 2017-10-24 · +topic examples - 2 (numbers) n south korea posted a surplus on its current account of

+ Combined clustering

n P(z|xy) ≈ Psmooth(Z|xXyY) × Psmooth(z|xXyYZ)

P(Tuesday| party on) ≈ Psmooth(WEEKDAY | party EVENT on PREPOSITION) ×

Psmooth(Tuesday | party EVENT on PREPOSITION WEEKDAY)

n Much larger than unclustered, somewhat lower perplexity.

12

Page 13: CS 136a Lecture 8 Yet More Language Modelingcs136a/CS136a_Slides/CS136a... · 2017-10-24 · +topic examples - 2 (numbers) n south korea posted a surplus on its current account of

+ IBM Clustering P (z|xy) ≈ Psmooth(Z|XY) × P(z|Z)

P(WEEKDAY|EVENT PREPOSITION)

× P(Tuesday | WEEKDAY)

n Small, very smooth, mediocre perplexity

P (z|xy) ≈

λ Psmooth (z|xy) + (1- λ )Psmooth(Z|XY) × P(z|Z)

n Bigger, better than no clusters, better than combined clustering.

n  Improvement: use P(z|XYZ) instead of P(z|Z)

13

Page 14: CS 136a Lecture 8 Yet More Language Modelingcs136a/CS136a_Slides/CS136a... · 2017-10-24 · +topic examples - 2 (numbers) n south korea posted a surplus on its current account of

+ Clustering by Position

n “A” and “AN”: same cluster or different cluster?

n Same cluster for predictive clustering

n Different clusters for conditional clustering

n Small improvement by using different clusters for conditional and predictive

14

Page 15: CS 136a Lecture 8 Yet More Language Modelingcs136a/CS136a_Slides/CS136a... · 2017-10-24 · +topic examples - 2 (numbers) n south korea posted a surplus on its current account of

+ Clustering: how to get them

n Build them by hand n  Works ok when almost no data

n Part of Speech (POS) tags n  Tends not to work as well as automatic

n Automatic Clustering n  Swap words between clusters to minimize perplexity

15

Page 16: CS 136a Lecture 8 Yet More Language Modelingcs136a/CS136a_Slides/CS136a... · 2017-10-24 · +topic examples - 2 (numbers) n south korea posted a surplus on its current account of

+ Clustering: automatic

n Minimize perplexity of P(z|Y) Mathematical tricks speed it up

Use top-down splitting,

not bottom up merging!

16

Page 17: CS 136a Lecture 8 Yet More Language Modelingcs136a/CS136a_Slides/CS136a... · 2017-10-24 · +topic examples - 2 (numbers) n south korea posted a surplus on its current account of

+ Two actual WSJ classes n  MONDAYS

n  FRIDAYS

n  THURSDAY

n  MONDAY

n  EURODOLLARS

n  SATURDAY

n  WEDNESDAY

n  FRIDAY

n  TENTERHOOKS

n  TUESDAY

n  SUNDAY

n  CONDITION

n  PARTY

n  FESCO

n  CULT

n  NILSON

n  PETA

n  CAMPAIGN

n  WESTPAC

n  FORCE

n  CONRAN

n  DEPARTMENT

n  PENH

n  GUILD

17

Page 18: CS 136a Lecture 8 Yet More Language Modelingcs136a/CS136a_Slides/CS136a... · 2017-10-24 · +topic examples - 2 (numbers) n south korea posted a surplus on its current account of

+ Sentence Mixture Models

n Lots of different sentence types: n  Numbers (The Dow rose one hundred seventy three points) n  Quotations (Officials said “quote we deny all wrong doing ”quote) n  Mergers (AOL and Time Warner, in an attempt to control the media

and the internet, will merge)

n Model each sentence type separately

18

Page 19: CS 136a Lecture 8 Yet More Language Modelingcs136a/CS136a_Slides/CS136a... · 2017-10-24 · +topic examples - 2 (numbers) n south korea posted a surplus on its current account of

+ Sentence Mixture Models

n Roll a die to pick sentence type, sk

with probability λk

n Probability of sentence, given sk

n Probability of sentence across types:

19

λk P(wi |wi−2wi−1sk )i=1

n

∏k=1

m

∑€

P(wi |wi−2wi−1sk )i=1

n

Page 20: CS 136a Lecture 8 Yet More Language Modelingcs136a/CS136a_Slides/CS136a... · 2017-10-24 · +topic examples - 2 (numbers) n south korea posted a surplus on its current account of

+ Sentence Model Smoothing

n Each topic model is smoothed with overall model.

n Sentence mixture model is smoothed with overall model (sentence type 0).

20

∑ ∏= = −−

−−⎥⎦

⎤⎢⎣

−+

m

k

n

i iiik

kiiikk wwwP

swwwP

0 1 12

12

)|()1()|(

µ

µλ

Page 21: CS 136a Lecture 8 Yet More Language Modelingcs136a/CS136a_Slides/CS136a... · 2017-10-24 · +topic examples - 2 (numbers) n south korea posted a surplus on its current account of

+ Sentence Mixture Results

Sentence mixture models (10,000,000 training)

108110112114116118120122124126

0 1 2 3 4 5 6 7Log-2 Number Mixtures

Perp

lexi

ty

Sentence mixtureBaseline

13% reduction

21

Page 22: CS 136a Lecture 8 Yet More Language Modelingcs136a/CS136a_Slides/CS136a... · 2017-10-24 · +topic examples - 2 (numbers) n south korea posted a surplus on its current account of

+ Sentence Clustering

n Same algorithm as word clustering

n Assign each sentence to a type, sk

n Minimize perplexity of P(z|sk ) instead of P(z|Y)

22

Page 23: CS 136a Lecture 8 Yet More Language Modelingcs136a/CS136a_Slides/CS136a... · 2017-10-24 · +topic examples - 2 (numbers) n south korea posted a surplus on its current account of

+ Topic Examples - 0 (Mergers and acquisitions) n  JOHN BLAIR & COMPANY IS CLOSE TO AN AGREEMENT TO SELL ITS T. V.

STATION ADVERTISING REPRESENTATION OPERATION AND PROGRAM PRODUCTION UNIT TO AN INVESTOR GROUP LED BY JAMES H. ROSENFIELD ,COMMA A FORMER C. B. S. INCORPORATED EXECUTIVE ,COMMA INDUSTRY SOURCES SAID .PERIOD

n  INDUSTRY SOURCES PUT THE VALUE OF THE PROPOSED ACQUISITION AT MORE THAN ONE HUNDRED MILLION DOLLARS .PERIOD

n  JOHN BLAIR WAS ACQUIRED LAST YEAR BY RELIANCE CAPITAL GROUP INCORPORATED ,COMMA WHICH HAS BEEN DIVESTING ITSELF OF JOHN BLAIR'S MAJOR ASSETS .PERIOD

n  JOHN BLAIR REPRESENTS ABOUT ONE HUNDRED THIRTY LOCAL TELEVISION STATIONS IN THE PLACEMENT OF NATIONAL AND OTHER ADVERTISING .PERIOD

n  MR. ROSENFIELD STEPPED DOWN AS A SENIOR EXECUTIVE VICE PRESIDENT OF C. B. S. BROADCASTING IN DECEMBER NINETEEN EIGHTY FIVE UNDER A C. B. S. EARLY RETIREMENT PROGRAM .PERIOD

23

Page 24: CS 136a Lecture 8 Yet More Language Modelingcs136a/CS136a_Slides/CS136a... · 2017-10-24 · +topic examples - 2 (numbers) n south korea posted a surplus on its current account of

+ Topic Examples - 1 (production, promotions, commas)

n  MR. DION ,COMMA EXPLAINING THE RECENT INCREASE IN THE STOCK PRICE ,COMMA SAID ,COMMA "DOUBLE-QUOTE OBVIOUSLY ,COMMA IT WOULD BE VERY ATTRACTIVE TO OUR COMPANY TO WORK WITH THESE PEOPLE .PERIOD

n  BOTH MR. BRONFMAN AND MR. SIMON WILL REPORT TO DAVID G. SACKS ,COMMA PRESIDENT AND CHIEF OPERATING OFFICER OF SEAGRAM .PERIOD

n  JOHN A. KROL WAS NAMED GROUP VICE PRESIDENT ,COMMA AGRICULTURE PRODUCTS DEPARTMENT ,COMMA OF THIS DIVERSIFIED CHEMICALS COMPANY ,COMMA SUCCEEDING DALE E. WOLF ,COMMA WHO WILL RETIRE MAY FIRST .PERIOD

n  MR. KROL WAS FORMERLY VICE PRESIDENT IN THE AGRICULTURE PRODUCTS DEPARTMENT .PERIOD

24

Page 25: CS 136a Lecture 8 Yet More Language Modelingcs136a/CS136a_Slides/CS136a... · 2017-10-24 · +topic examples - 2 (numbers) n south korea posted a surplus on its current account of

+ Topic Examples - 2 (Numbers)

n  SOUTH KOREA POSTED A SURPLUS ON ITS CURRENT ACCOUNT OF FOUR HUNDRED NINETEEN MILLION DOLLARS IN FEBRUARY ,COMMA IN CONTRAST TO A DEFICIT OF ONE HUNDRED TWELVE MILLION DOLLARS A YEAR EARLIER ,COMMA THE GOVERNMENT SAID .PERIOD

n  THE CURRENT ACCOUNT COMPRISES TRADE IN GOODS AND SERVICES AND SOME UNILATERAL TRANSFERS .PERIOD

n  COMMERCIAL -HYPHEN VEHICLE SALES IN ITALY ROSE ELEVEN .POINT FOUR %PERCENT IN FEBRUARY FROM A YEAR EARLIER ,COMMA TO EIGHT THOUSAND ,COMMA EIGHT HUNDRED FORTY EIGHT UNITS ,COMMA ACCORDING TO PROVISIONAL FIGURES FROM THE ITALIAN ASSOCIATION OF AUTO MAKERS .PERIOD

n  INDUSTRIAL PRODUCTION IN ITALY DECLINED THREE .POINT FOUR %PERCENT IN JANUARY FROM A YEAR EARLIER ,COMMA THE GOVERNMENT SAID .PERIOD

25

Page 26: CS 136a Lecture 8 Yet More Language Modelingcs136a/CS136a_Slides/CS136a... · 2017-10-24 · +topic examples - 2 (numbers) n south korea posted a surplus on its current account of

+ Topic Examples – 3 (quotations)

n  NEITHER MR. ROSENFIELD NOR OFFICIALS OF JOHN BLAIR COULD BE REACHED FOR COMMENT .PERIOD

n  THE AGENCY SAID THERE IS "DOUBLE-QUOTE SOME INDICATION OF AN UPTURN "DOUBLE-QUOTE IN THE RECENT IRREGULAR PATTERN OF SHIPMENTS ,COMMA FOLLOWING THE GENERALLY DOWNWARD TREND RECORDED DURING THE FIRST HALF OF NINETEEN EIGHTY SIX .PERIOD

n  THE COMPANY SAID IT ISN'T AWARE OF ANY TAKEOVER INTEREST .PERIOD

n  THE SALE INCLUDES THE RIGHTS TO GERMAINE MONTEIL IN NORTH AND SOUTH AMERICA AND IN THE FAR EAST ,COMMA AS WELL AS THE WORLDWIDE RIGHTS TO THE DIANE VON FURSTENBERG COSMETICS AND FRAGRANCE LINES AND U. S. DISTRIBUTION RIGHTS TO LANCASTER BEAUTY PRODUCTS .PERIOD

n  BUT THE COMPANY WOULDN'T ELABORATE .PERIOD

26

Page 27: CS 136a Lecture 8 Yet More Language Modelingcs136a/CS136a_Slides/CS136a... · 2017-10-24 · +topic examples - 2 (numbers) n south korea posted a surplus on its current account of

+ Structured Language Model

“The contract ended with a loss of 7 cents after”

27

Page 28: CS 136a Lecture 8 Yet More Language Modelingcs136a/CS136a_Slides/CS136a... · 2017-10-24 · +topic examples - 2 (numbers) n south korea posted a surplus on its current account of

+ How to get structured data?

n Use a Treebank (a collection of sentences with structure hand annotated) like Wall Street Journal, Penn Tree Bank.

n Problem: need a treebank.

n Or – use a treebank (WSJ) to train a parser; then parse new training data (e.g. Broadcast News)

n Re-estimate parameters to get lower perplexity models.

28

Page 29: CS 136a Lecture 8 Yet More Language Modelingcs136a/CS136a_Slides/CS136a... · 2017-10-24 · +topic examples - 2 (numbers) n south korea posted a surplus on its current account of

+ Structured Language Models

n Use structure of language to detect long distance information

n Promising results

n But: time consuming; language is right branching; 5-grams, skipping, capture similar information.

29

Page 30: CS 136a Lecture 8 Yet More Language Modelingcs136a/CS136a_Slides/CS136a... · 2017-10-24 · +topic examples - 2 (numbers) n south korea posted a surplus on its current account of

+ Some Experiments n Goodman re-implemented all techniques

n Trained on 260,000,000 words of WSJ

n Optimize parameters on heldout set

n Test on separate test section

n Some combinations extremely time-consuming (days of CPU time) n  Don’t try this at home, or in anything you want to ship

n Rescored N-best lists to get results n  Maximum possible improvement from 10% word error rate

absolute to 5%

30

Page 31: CS 136a Lecture 8 Yet More Language Modelingcs136a/CS136a_Slides/CS136a... · 2017-10-24 · +topic examples - 2 (numbers) n south korea posted a surplus on its current account of

+ Overall Results: Perplexity 31

Page 32: CS 136a Lecture 8 Yet More Language Modelingcs136a/CS136a_Slides/CS136a... · 2017-10-24 · +topic examples - 2 (numbers) n south korea posted a surplus on its current account of

Overall Results: Word Accuracy

32

Page 33: CS 136a Lecture 8 Yet More Language Modelingcs136a/CS136a_Slides/CS136a... · 2017-10-24 · +topic examples - 2 (numbers) n south korea posted a surplus on its current account of

+ Conclusions

n Use trigram models

n Use any reasonable smoothing algorithm (Katz, Kneser-Ney)

n Use caching if you have correction information.

n Clustering, sentence mixtures, skipping not usually worth effort.

33

Page 34: CS 136a Lecture 8 Yet More Language Modelingcs136a/CS136a_Slides/CS136a... · 2017-10-24 · +topic examples - 2 (numbers) n south korea posted a surplus on its current account of

+ Tools: CMU Language Modeling Toolkit

n Can handle bigram, trigrams, more

n Can handle different smoothing schemes

n Many separate tools – output of one tool is input to next: easy to use

n Free for research purposes

n http://svr-www.eng.cam.ac.uk/~prc14/toolkit.html

34

Page 35: CS 136a Lecture 8 Yet More Language Modelingcs136a/CS136a_Slides/CS136a... · 2017-10-24 · +topic examples - 2 (numbers) n south korea posted a surplus on its current account of

+ Using the CMU LM Tools 35

Page 36: CS 136a Lecture 8 Yet More Language Modelingcs136a/CS136a_Slides/CS136a... · 2017-10-24 · +topic examples - 2 (numbers) n south korea posted a surplus on its current account of

+ Tools: SRI Language Modeling Toolkit

n More powerful than CMU toolkit

n Can handles clusters, lattices, n-best lists, hidden tags

n Free for research use

n http://www.speech.sri.com/projects/srilm

36

Page 37: CS 136a Lecture 8 Yet More Language Modelingcs136a/CS136a_Slides/CS136a... · 2017-10-24 · +topic examples - 2 (numbers) n south korea posted a surplus on its current account of

+ IRSTLM

n (put in the link)

n Looks like its mostly addressing the problem of really huge LMs

Thanks to Dan Jurafsky for these slides

Page 38: CS 136a Lecture 8 Yet More Language Modelingcs136a/CS136a_Slides/CS136a... · 2017-10-24 · +topic examples - 2 (numbers) n south korea posted a surplus on its current account of

+ Reality: The LM is only a good as the data

n Text normalization n  What about “$3,100,000” à convert to “Three million one

hundred thousand dollars”, etc. n  Need to do this for dates, numbers, maybe abbreviations.

n Some text-normalization tools come with Wall Street Journal corpus, from LDC (Linguistic Data Consortium)

n Not much available

n Write your own (use Perl!)

38