CS 136a Lecture 8 Yet More Language Modelingcs136a/CS136a_Slides/CS136a... · 2017-10-24 · +topic...

Post on 06-Aug-2020

3 views 0 download

Transcript of CS 136a Lecture 8 Yet More Language Modelingcs136a/CS136a_Slides/CS136a... · 2017-10-24 · +topic...

+

CS 136a Lecture 8 Yet More Language Modeling

September16, 2016 Professor Meteer

Thanks to Dan Jurafsky & Josh Goodman for many of these slides π

+ Overview (from Microsoft Tutorial)

n Caching n Skipping

n Clustering

n Sentence-mixture models

n Structured language models

n Tools

n More on the author, Josh Goodman http://research.microsoft.com/en-us/um/people/

joshuago/icmldescription.htm

2

+ 3 Caching

n  If you say something, you are likely to say it again later.

P(z | history = λPsmooth( z | xy) + (1 – λ)Pcache(z | history)

n  Interpolate trigram with cache

n Trigram caches get almost twice the improvement as unigram caches

Pcache (z | history) = C (z ∈ history) length(history)

+ 4 Caching: Variations

n N-gram caches:

n Conditional n-gram cache: use n-gram cache only if xy ∈ history

n Remove function-words from cache, like “the”, “to”

Pcache (z | history) = C (xyz ∈ history ) C (xy ∈ history )

+ Skipping n Capturing phrasal elements

n  Show John a good time à Show XXX a good time

n Standard 5 gram: P(z|…rstuvwxy) ≈ P(z|vwxy)

n Why not P(z|v_xy) – “skipping” n-gram – skips value of 3-back word

n Example: “P(time|show John a good)” ->

P(time | show ____ a good)

n P(…rstuvwxy) ≈

λP(z|vwxy) + µP(z|vw_y) + (1-λ-µ)P(z|v_xy)

5

+ Clustering

n CLUSTERING = CLASSES (same thing)

n What is P(“Tuesday | party on”)

n Similar to P(“Monday | party on”)

n Similar to P(“Tuesday | celebration on”)

n Put words in clusters: n  WEEKDAY = Sunday, Monday, Tuesday, … n  EVENT=party, celebration, birthday, …

6

+ Clustering overview

n Major topic, useful in many fields

n Kinds of clustering n  Predictive clustering n  Conditional clustering n  IBM-style clustering

n How to get clusters n  Be clever or it takes forever!

7

+ Predictive clustering

n Let “z” be a word, “Z” be its cluster

n One cluster per word: hard clustering n  WEEKDAY = Sunday, Monday, Tuesday, … n  MONTH = January, February, April, May, June, …

n P(z|xy) = P(Z|xy) × P(z|xyZ)

n P(Tuesday | party on) = P(WEEKDAY | party on) × P(Tuesday | party on WEEKDAY)

n Psmooth(z|xy) ≈ Psmooth (Z|xy) × Psmooth (z|xyZ)

8

+ Predictive clustering example

n Find P(Tuesday | party on) Psmooth (WEEKDAY | party on) × Psmooth (Tuesday | party on WEEKDAY) C(party on Tuesday) = 0 C(party on Wednesday) = 10 C(arriving on Tuesday) = 10 C(on Tuesday) = 100

Psmooth (WEEKDAY | party on) is high

Psmooth (Tuesday | party on WEEKDAY) backs off to Psmooth (Tuesday | on WEEKDAY)

9

+ Conditional clustering

P(z|xy) = P(z|xXyY)

P(Tuesday | party on) =

P(Tuesday | party EVENT on PREPOSITION)

Psmooth(z|xy) ≈ Psmooth (z|xXyY) λPML (Tuesday | party EVENT on PREPOSITION) + µ PML (Tuesday | EVENT on PREPOSITION) + δPML (Tuesday | on PREPOSITION) + γMLP(Tuesday | PREPOSITION) + (1- λ - µ - δ- γ) PML (Tuesday)

10

Condition off classes in the context

+ Conditional clustering example

λP (Tuesday | party EVENT on PREPOSITION) +

µ P(Tuesday | EVENT on PREPOSITION) +

δP(Tuesday | on PREPOSITION) +

γP(Tuesday | PREPOSITION) +

(1- λ - µ - δ- γ) P(Tuesday

11

Eliminating redundancy

λP (Tuesday | party on) + µ P(Tuesday | EVENT on) + δP(Tuesday | on) + γP(Tuesday | PREPOSITION) + (1- λ - µ - δ- γ) P(Tuesday) =

+ Combined clustering

n P(z|xy) ≈ Psmooth(Z|xXyY) × Psmooth(z|xXyYZ)

P(Tuesday| party on) ≈ Psmooth(WEEKDAY | party EVENT on PREPOSITION) ×

Psmooth(Tuesday | party EVENT on PREPOSITION WEEKDAY)

n Much larger than unclustered, somewhat lower perplexity.

12

+ IBM Clustering P (z|xy) ≈ Psmooth(Z|XY) × P(z|Z)

P(WEEKDAY|EVENT PREPOSITION)

× P(Tuesday | WEEKDAY)

n Small, very smooth, mediocre perplexity

P (z|xy) ≈

λ Psmooth (z|xy) + (1- λ )Psmooth(Z|XY) × P(z|Z)

n Bigger, better than no clusters, better than combined clustering.

n  Improvement: use P(z|XYZ) instead of P(z|Z)

13

+ Clustering by Position

n “A” and “AN”: same cluster or different cluster?

n Same cluster for predictive clustering

n Different clusters for conditional clustering

n Small improvement by using different clusters for conditional and predictive

14

+ Clustering: how to get them

n Build them by hand n  Works ok when almost no data

n Part of Speech (POS) tags n  Tends not to work as well as automatic

n Automatic Clustering n  Swap words between clusters to minimize perplexity

15

+ Clustering: automatic

n Minimize perplexity of P(z|Y) Mathematical tricks speed it up

Use top-down splitting,

not bottom up merging!

16

+ Two actual WSJ classes n  MONDAYS

n  FRIDAYS

n  THURSDAY

n  MONDAY

n  EURODOLLARS

n  SATURDAY

n  WEDNESDAY

n  FRIDAY

n  TENTERHOOKS

n  TUESDAY

n  SUNDAY

n  CONDITION

n  PARTY

n  FESCO

n  CULT

n  NILSON

n  PETA

n  CAMPAIGN

n  WESTPAC

n  FORCE

n  CONRAN

n  DEPARTMENT

n  PENH

n  GUILD

17

+ Sentence Mixture Models

n Lots of different sentence types: n  Numbers (The Dow rose one hundred seventy three points) n  Quotations (Officials said “quote we deny all wrong doing ”quote) n  Mergers (AOL and Time Warner, in an attempt to control the media

and the internet, will merge)

n Model each sentence type separately

18

+ Sentence Mixture Models

n Roll a die to pick sentence type, sk

with probability λk

n Probability of sentence, given sk

n Probability of sentence across types:

19

λk P(wi |wi−2wi−1sk )i=1

n

∏k=1

m

∑€

P(wi |wi−2wi−1sk )i=1

n

+ Sentence Model Smoothing

n Each topic model is smoothed with overall model.

n Sentence mixture model is smoothed with overall model (sentence type 0).

20

∑ ∏= = −−

−−⎥⎦

⎤⎢⎣

−+

m

k

n

i iiik

kiiikk wwwP

swwwP

0 1 12

12

)|()1()|(

µ

µλ

+ Sentence Mixture Results

Sentence mixture models (10,000,000 training)

108110112114116118120122124126

0 1 2 3 4 5 6 7Log-2 Number Mixtures

Perp

lexi

ty

Sentence mixtureBaseline

13% reduction

21

+ Sentence Clustering

n Same algorithm as word clustering

n Assign each sentence to a type, sk

n Minimize perplexity of P(z|sk ) instead of P(z|Y)

22

+ Topic Examples - 0 (Mergers and acquisitions) n  JOHN BLAIR & COMPANY IS CLOSE TO AN AGREEMENT TO SELL ITS T. V.

STATION ADVERTISING REPRESENTATION OPERATION AND PROGRAM PRODUCTION UNIT TO AN INVESTOR GROUP LED BY JAMES H. ROSENFIELD ,COMMA A FORMER C. B. S. INCORPORATED EXECUTIVE ,COMMA INDUSTRY SOURCES SAID .PERIOD

n  INDUSTRY SOURCES PUT THE VALUE OF THE PROPOSED ACQUISITION AT MORE THAN ONE HUNDRED MILLION DOLLARS .PERIOD

n  JOHN BLAIR WAS ACQUIRED LAST YEAR BY RELIANCE CAPITAL GROUP INCORPORATED ,COMMA WHICH HAS BEEN DIVESTING ITSELF OF JOHN BLAIR'S MAJOR ASSETS .PERIOD

n  JOHN BLAIR REPRESENTS ABOUT ONE HUNDRED THIRTY LOCAL TELEVISION STATIONS IN THE PLACEMENT OF NATIONAL AND OTHER ADVERTISING .PERIOD

n  MR. ROSENFIELD STEPPED DOWN AS A SENIOR EXECUTIVE VICE PRESIDENT OF C. B. S. BROADCASTING IN DECEMBER NINETEEN EIGHTY FIVE UNDER A C. B. S. EARLY RETIREMENT PROGRAM .PERIOD

23

+ Topic Examples - 1 (production, promotions, commas)

n  MR. DION ,COMMA EXPLAINING THE RECENT INCREASE IN THE STOCK PRICE ,COMMA SAID ,COMMA "DOUBLE-QUOTE OBVIOUSLY ,COMMA IT WOULD BE VERY ATTRACTIVE TO OUR COMPANY TO WORK WITH THESE PEOPLE .PERIOD

n  BOTH MR. BRONFMAN AND MR. SIMON WILL REPORT TO DAVID G. SACKS ,COMMA PRESIDENT AND CHIEF OPERATING OFFICER OF SEAGRAM .PERIOD

n  JOHN A. KROL WAS NAMED GROUP VICE PRESIDENT ,COMMA AGRICULTURE PRODUCTS DEPARTMENT ,COMMA OF THIS DIVERSIFIED CHEMICALS COMPANY ,COMMA SUCCEEDING DALE E. WOLF ,COMMA WHO WILL RETIRE MAY FIRST .PERIOD

n  MR. KROL WAS FORMERLY VICE PRESIDENT IN THE AGRICULTURE PRODUCTS DEPARTMENT .PERIOD

24

+ Topic Examples - 2 (Numbers)

n  SOUTH KOREA POSTED A SURPLUS ON ITS CURRENT ACCOUNT OF FOUR HUNDRED NINETEEN MILLION DOLLARS IN FEBRUARY ,COMMA IN CONTRAST TO A DEFICIT OF ONE HUNDRED TWELVE MILLION DOLLARS A YEAR EARLIER ,COMMA THE GOVERNMENT SAID .PERIOD

n  THE CURRENT ACCOUNT COMPRISES TRADE IN GOODS AND SERVICES AND SOME UNILATERAL TRANSFERS .PERIOD

n  COMMERCIAL -HYPHEN VEHICLE SALES IN ITALY ROSE ELEVEN .POINT FOUR %PERCENT IN FEBRUARY FROM A YEAR EARLIER ,COMMA TO EIGHT THOUSAND ,COMMA EIGHT HUNDRED FORTY EIGHT UNITS ,COMMA ACCORDING TO PROVISIONAL FIGURES FROM THE ITALIAN ASSOCIATION OF AUTO MAKERS .PERIOD

n  INDUSTRIAL PRODUCTION IN ITALY DECLINED THREE .POINT FOUR %PERCENT IN JANUARY FROM A YEAR EARLIER ,COMMA THE GOVERNMENT SAID .PERIOD

25

+ Topic Examples – 3 (quotations)

n  NEITHER MR. ROSENFIELD NOR OFFICIALS OF JOHN BLAIR COULD BE REACHED FOR COMMENT .PERIOD

n  THE AGENCY SAID THERE IS "DOUBLE-QUOTE SOME INDICATION OF AN UPTURN "DOUBLE-QUOTE IN THE RECENT IRREGULAR PATTERN OF SHIPMENTS ,COMMA FOLLOWING THE GENERALLY DOWNWARD TREND RECORDED DURING THE FIRST HALF OF NINETEEN EIGHTY SIX .PERIOD

n  THE COMPANY SAID IT ISN'T AWARE OF ANY TAKEOVER INTEREST .PERIOD

n  THE SALE INCLUDES THE RIGHTS TO GERMAINE MONTEIL IN NORTH AND SOUTH AMERICA AND IN THE FAR EAST ,COMMA AS WELL AS THE WORLDWIDE RIGHTS TO THE DIANE VON FURSTENBERG COSMETICS AND FRAGRANCE LINES AND U. S. DISTRIBUTION RIGHTS TO LANCASTER BEAUTY PRODUCTS .PERIOD

n  BUT THE COMPANY WOULDN'T ELABORATE .PERIOD

26

+ Structured Language Model

“The contract ended with a loss of 7 cents after”

27

+ How to get structured data?

n Use a Treebank (a collection of sentences with structure hand annotated) like Wall Street Journal, Penn Tree Bank.

n Problem: need a treebank.

n Or – use a treebank (WSJ) to train a parser; then parse new training data (e.g. Broadcast News)

n Re-estimate parameters to get lower perplexity models.

28

+ Structured Language Models

n Use structure of language to detect long distance information

n Promising results

n But: time consuming; language is right branching; 5-grams, skipping, capture similar information.

29

+ Some Experiments n Goodman re-implemented all techniques

n Trained on 260,000,000 words of WSJ

n Optimize parameters on heldout set

n Test on separate test section

n Some combinations extremely time-consuming (days of CPU time) n  Don’t try this at home, or in anything you want to ship

n Rescored N-best lists to get results n  Maximum possible improvement from 10% word error rate

absolute to 5%

30

+ Overall Results: Perplexity 31

Overall Results: Word Accuracy

32

+ Conclusions

n Use trigram models

n Use any reasonable smoothing algorithm (Katz, Kneser-Ney)

n Use caching if you have correction information.

n Clustering, sentence mixtures, skipping not usually worth effort.

33

+ Tools: CMU Language Modeling Toolkit

n Can handle bigram, trigrams, more

n Can handle different smoothing schemes

n Many separate tools – output of one tool is input to next: easy to use

n Free for research purposes

n http://svr-www.eng.cam.ac.uk/~prc14/toolkit.html

34

+ Using the CMU LM Tools 35

+ Tools: SRI Language Modeling Toolkit

n More powerful than CMU toolkit

n Can handles clusters, lattices, n-best lists, hidden tags

n Free for research use

n http://www.speech.sri.com/projects/srilm

36

+ IRSTLM

n (put in the link)

n Looks like its mostly addressing the problem of really huge LMs

Thanks to Dan Jurafsky for these slides

+ Reality: The LM is only a good as the data

n Text normalization n  What about “$3,100,000” à convert to “Three million one

hundred thousand dollars”, etc. n  Need to do this for dates, numbers, maybe abbreviations.

n Some text-normalization tools come with Wall Street Journal corpus, from LDC (Linguistic Data Consortium)

n Not much available

n Write your own (use Perl!)

38