Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework...

research

Model Combinationfor Machine Translation

John DeNero, Shankar Kumar, Ciprian Chelba, and Franz Josef Och

Monday, June 7, 2010

Motivation

A statistical machine translation model scores derivations

(log) model score sums language model and translation model

Motivation

θ · φ(d) = θ ·

w∈n-grams(d)

φlm(w) +�

r∈rules(d)

φtm(r)

Motivation

θ · φ(d) = θ ·

w∈n-grams(d)

φlm(w) +�

r∈rules(d)

φtm(r)

Motivation

θ · φ(d) = θ ·

w∈n-grams(d)

φlm(w) +�

r∈rules(d)

φtm(r)

Motivation

We can improve over max-derivation decoding

θ · φ(d) = θ ·

w∈n-grams(d)

φlm(w) +�

r∈rules(d)

φtm(r)

Motivation

• Consensus decoding (e.g., minimum Bayes risk)

θ · φ(d) = θ ·

w∈n-grams(d)

φlm(w) +�

r∈rules(d)

φtm(r)

Motivation

• Consensus decoding (e.g., minimum Bayes risk)�

θ · φ(d) = θ ·

w∈n-grams(d)

φlm(w) +�

r∈rules(d)

φtm(r)

Motivation

• System combination (e.g., confusion networks)

�We can improve over max-derivation decoding

θ · φ(d) = θ ·

w∈n-grams(d)

φlm(w) +�

r∈rules(d)

φtm(r)

Motivation

θ · φ(d) = θ ·

w∈n-grams(d)

φlm(w) +�

r∈rules(d)

φtm(r)

Motivation

In this work, we develop a technique that integrates both

θ · φ(d) = θ ·

w∈n-grams(d)

φlm(w) +�

r∈rules(d)

φtm(r)

Consensus Decoding

Derivation scores can be interpreted as probabilities

P(d|f) =exp (θ · φ(d))�d� exp (θ · φ(d�))

Consensus Decoding

We can query this distribution for:

Consensus Decoding

Whole translations [Blunsom et al., ’08]:

emax-trans = arg maxe

d:σe(d)=e

P(d|f)

Consensus Decoding

d:σe(d)=e

P(d|f)

N-gram overlap [Kumar and Byrne, ’04]:

embr = arg maxe

E [bleu(e;σe(d))]

Consensus Decoding

Free BLEU

Points

d:σe(d)=e

P(d|f)

N-gram overlap [Kumar and Byrne, ’04]:

embr = arg maxe

E [bleu(e;σe(d))]

System Combination

We often have multiple translation systems

System Combination

el perro comí mi tarea

System Combination

the dog bit my work

the dog ate homework

dog ate homeworkmy

System Combination

the dog bit my work

dog ate homeworkmy

System Combination

the dog bit my work

dog ate homeworkmy

System Combination

the dog bit my work

dog ate homeworkmy

Combiners assume little about systems

System Combination

the dog bit my work

dog ate homeworkmy

Objectives similar to consensus decoding

System Combination

the dog bit my work

dog ate homeworkmy

Objectives similar to consensus decoding

More BLEU

Points

Model Combination

P1(d|f)

P2(d|f)

P3(d|f)

Model Combination

P1(d|f)

P2(d|f)

P3(d|f)

Consensus

Model Combination

P1(d|f)

P2(d|f)

P3(d|f)

Consensus

System combo

Model Combination

P1(d|f)

P2(d|f)

P3(d|f)

Consensus

System combo

Model Combination

Consensus decoding with multiple models

P1(d|f)

P2(d|f)

P3(d|f)

Consensus

System combo

Model Combination

Distribution-driven approach to system combination

P1(d|f)

P2(d|f)

P3(d|f)

Consensus

System combo

Model Combination

Distribution-driven approach to system combination

Unifies consensus and combination objectives

P1(d|f)

P2(d|f)

P3(d|f)

Consensus

System combo

Outline

Consensus decoding review

Our model combination technique

Comparison to system combination

Outline

With new

algorithms

and results

Forest-Based Consensus DecodingReview

Forest-Based Consensus Decoding

Build a forest that encodes the model posterior1

P(d|f) =

Review

Yo vi al hombre con el telescopio

I saw the man with {a,the} telescope

P(d|f) =

Review

I ... telescope

“I saw the man with {a,the} telescope”

P(d|f) =

Review

I ... telescope

I ... man

“I saw with {a,the} telescope the man”

P(d|f) =

Review

I ... telescope

I ... man

Compute n-gram statistics from the posterior 2

P(d|f) =

Review

I ... telescope

I ... man

P(d|f) =

Review

I ... telescope

I ... man

P(edge|f)

P(d|f) =

Review

I ... telescope

I ... man

“telescope the”

P(edge|f)

P(d|f) =

Review

I ... telescope

I ... man

“telescope the”

P(edge|f)

Optimize a consensus objective using these statistics3

P(d|f) =

Review

Types of Efficient Consensus Techniques

[Tromble et al., ’08]

Lattice Minimum Bayes-Risk Decoding

Review

Posteriors Expected counts

N-gram Statistics

Review

N-gram Statistics

Review

[DeNero et al., ’09]

Fast Consensus Decoding

[Kumar et al., ’09]

Minimum Bayes-Risk Decoding for Hypergraphs

[Li et al., ’09]

Variational Decoding for Machine Translation

N-gram Statistics

Review

Learned Objectives are Better than FixedReview

Learned Objectives are Better than Fixed

C(d) =4�

w(n)�

g∈n-grams

c(g, d) · P(g|f) + w(�)|σe(d)| + w(b)θ · φ(d)

Review

C(d) =4�

w(n)�

g∈n-grams

N-gram statistics

Review

C(d) =4�

w(n)�

g∈n-grams

N-gram statistics Length

Review

C(d) =4�

w(n)�

g∈n-grams

N-gram statistics Length Base model

Review

C(d) =4�

w(n)�

g∈n-grams

Objective parameters

Review

C(d) =4�

w(n)�

g∈n-grams

Choose such thatwFixed: Cw(d) ≈ E [bleu(d)]

Review

C(d) =4�

w(n)�

g∈n-grams

Choose to maximizewLearned: bleu

��arg max

dCw(d)

�; e

Review

C(d) =4�

w(n)�

g∈n-grams

��arg max

dCw(d)

�; e

�References

Review

C(d) =4�

w(n)�

g∈n-grams

��arg max

dCw(d)

�; e

�References

Increased test set BLEU by ≥0.2Decreased test set BLEU by ≥0.2

Consensus performance versus max-derivation decoding (39 pairs)

Review

C(d) =4�

w(n)�

g∈n-grams

��arg max

dCw(d)

�; e

�References

Review

C(d) =4�

w(n)�

g∈n-grams

��arg max

dCw(d)

�; e

�References

Fixed Learned

Review

N-gram Posteriors can also be Computed Quickly

P(g|f) = 1.0− P(g|f)

I ... telescope

the man with {a,the} telescope

P(g|f) = 1.0− P(g|f)

I ... telescope

eθ·φ(edge) = 0.2

“saw the”

P(g|f) = 1.0− P(g|f)

Inside: 0.1

“I saw” : 0“with the” : 0.1

I ... telescope

eθ·φ(edge) = 0.2

“saw the”

P(g|f) = 1.0− P(g|f)

Inside: 0.1

I ... telescope

eθ·φ(edge) = 0.2

“saw the”

Derivations that don’t contain “with the”

P(g|f) = 1.0− P(g|f)

Inside: 0.1

I ... telescope

eθ·φ(edge) = 0.2

“saw the”Inside: 0.5

“with a” : 0.3“with the” : 0.2

...Derivations that don’t

contain “with the”

P(g|f) = 1.0− P(g|f)

Inside: 0.1

I ... telescope

eθ·φ(edge) = 0.2

Inside: 0.01“saw the” : 0

“I saw” : 0“with the” : 0.004

P(g|f) = 1.0− P(g|f)

Inside: 0.1

I ... telescope

eθ·φ(edge) = 0.2

“I saw” : 0“with the” : 0.004

P(g|f) = 1.0− P(g|f)

Inside: 0.1

Audience challenge: What semiring computes n-gram posteriors?

I ... telescope

eθ·φ(edge) = 0.2

“I saw” : 0“with the” : 0.004

P(g|f) = 1.0− P(g|f)

Results for Learned Consensus Decoding

Constrained data track of the 2008 NIST MT task

max-derivation posterior probabilities expected counts

Learned consensus decoding techniquesBaseline

Phrase-Based

40 42 44 46

Arabic-to-English

Phrase-Based

40 42 44 46

Arabic-to-English

24 26 28 30

Chinese-to-English

Outline

All in

One Slide!

Extending Consensus Decoding to Multiple Models

1. Build posterior forests

Phrase-based

fHiero- style

P1(d|f) :

P2(d|f) :

C(d) =4�

w(n)�

g∈n-grams

Phrase-based

fHiero- style

P1(d|f) :

P2(d|f) :

Phrase-based

fHiero- style

P1(d|f) :

P2(d|f) :

C(d) = +w(�)|σe(d)| + w(b)b(d)4�

w(n)v(n)(d)

Phrase-based

fHiero- style

P1(d|f) :

P2(d|f) :

C(d) = +w(�)|σe(d)| + w(b)b(d)I�

w(n)i v(n)

Sum over models

Phrase-based

fHiero- style

P1(d|f) :

P2(d|f) :

2. Compute n-gram statistics from forests

C(d) = +w(�)|σe(d)| + w(b)b(d)I�

w(n)i v(n)

Sum over models

Phrase-based

fHiero- style

P1(d|f) :

P2(d|f) :

3. Union all forests

C(d) = +w(�)|σe(d)| + w(b)b(d)I�

w(n)i v(n)

Sum over models

Phrase-based

fHiero- style

P1(d|f) :

P2(d|f) :

C(d) = +w(�)|σe(d)| + w(b)b(d)I�

w(n)i v(n)

Sum over models

Phrase-based

fHiero- style

P1(d|f) :

P2(d|f) :

C(d) = +w(�)|σe(d)| + w(b)b(d)I�

w(n)i v(n)

i (d) +w(α)i αi(d)

Model choiceSum over models

Phrase-based

fHiero- style

P1(d|f) :

P2(d|f) :

C(d) = +w(�)|σe(d)| + w(b)b(d)I�

w(n)i v(n)

dIndicator feature for the system that originally generatedModel choice:

Base model: Model score under the system that generated d

Phrase-based

fHiero- style

P1(d|f) :

P2(d|f) :

4. Optimize multi-model consensus objective

C(d) = +w(�)|σe(d)| + w(b)b(d)I�

w(n)i v(n)

C(d) = +w(�)|σe(d)| + w(b)b(d)I�

w(n)i v(n)

C(d) = +w(�)|σe(d)| + w(b)b(d)I�

w(n)i v(n)

Original features

C(d) = +w(�)|σe(d)| + w(b)b(d)I�

w(n)i v(n)

Original features

C(d) = +w(�)|σe(d)| + w(b)b(d)I�

w(n)i v(n)

Original features

C(d) = +w(�)|σe(d)| + w(b)b(d)I�

w(n)i v(n)

Original features

Properties of Model Combination

C(d) = +w(�)|σe(d)| + w(b)b(d)I�

w(n)i v(n)

C(d) = +w(�)|σe(d)| + w(b)b(d)I�

w(n)i v(n)

Reduces to consensus decoding when we have only one model

C(d) = +w(�)|σe(d)| + w(b)b(d)I�

w(n)i v(n)

A linear model: can be tuned to maximize output performancew

C(d) = +w(�)|σe(d)| + w(b)b(d)I�

w(n)i v(n)

No concept of a primary system

C(d) = +w(�)|σe(d)| + w(b)b(d)I�

w(n)i v(n)

No concept of a primary system

Every possible output was a derivation under some original model

Model Combination Experimental Results

Compared three in-house Google systems

Constrained data track of 2008 NIST task

Parameters tuned on NIST 2004 eval set

Max-derivationConsensus

Phrase Hiero SAMT Combo

Arabic-to-English BLEU

29.028.8

27.827.3

Chinese-to-English BLEU

Sources of Improvement: Arabic-to-English

43 44 45 46

Best max-derivation system

Best single-system with consensus decoding

43 44 45 46

Do we need model choice features?

43 44 45 46

Should n-gram statistics be model specific?

43 44 45 46

Phrase-based

fHiero- style

43 44 45 46

Phrase-based

fHiero- style

“Union”

43 44 45 46

Phrase-based

fHiero- style

Union + consensus decoding

“Union”

43 44 45 46

Phrase-based

fHiero- style

Union + consensus + model choice features

“Union”

43 44 45 46

Phrase-based

fHiero- style

Union with model-specific n-gram statistics

“Union”

43 44 45 46

Phrase-based

fHiero- style

Union with model-specific n-gram statistics

“Union”

Model-specific & union n-gram statistics

Outline

The Final

Showdown

System Combination Baselines

Two established system combination methods [Macherey & Och, ’07]

P1(d|f)

P2(d|f)

P3(d|f)

Consensus

System combo

Sentence-level Choose among system outputs using an MBR objective

e = arg maxe∈{e∗1 ,...,e∗k}

E [bleu(e)]

P1(d|f)

P2(d|f)

P3(d|f)

Consensus

System combo

Word-level Confusion network approach

E [bleu(e)]

P1(d|f)

P2(d|f)

P3(d|f)

Consensus

System combo

E [bleu(e)]

P1(d|f)

P2(d|f)

P3(d|f)

Consensus

System combo

All are aligned to a backbone e∗i eb ∈ {e∗1, . . . , e∗k}

Sentences + alignments form a confusion network

E [bleu(e)]

P1(d|f)

P2(d|f)

P3(d|f)

Consensus

System combo

Sentences + alignments form a confusion network

Output maximizes a consensus objective

E [bleu(e)]

P1(d|f)

P2(d|f)

P3(d|f)

Consensus

System combo

Model versus System Combination

Qualitative

Quantitative

Single-system n-gram statistics are required in both methods

Qualitative

Quantitative

Model combination searches only once under a consensus objective

Qualitative

Quantitative

Inter-hypothesis alignment problem exists only in system combination

Qualitative

Quantitative

Best single-system consensus

Sentence-level system combination

Word-level system combination

Model combination

43 44 45 46

Arabic-to-English

Qualitative

Quantitative

Best single-system consensus

Sentence-level system combination

Word-level system combination

Model combination

43 44 45 46

Arabic-to-English

27 28 29 30

Chinese-to-English

Qualitative

Quantitative

Conclusion

Consensus decoding with a learned objective extends naturally to multiple models

Conclusion

Model combination provides consensus and combination effects in a unified, distribution-driven objective

Conclusion

It outperforms a pipeline of consensus decoding followed by system combination, using less total computation

Conclusion

It’s easy, it’s clean, and it works

Conclusion

It’s easy, it’s clean, and it works Thanks!

Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework...

Documents

Transcript of Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework...

Οξύ ισχαιμικό Α: Περιφερικές ... · AHA/ASA Class ICSI Work Group Consensus Qualification Statement/ Comment New Literature Support Endovascular Interventions

A- 'Unaware' Decoding populations of neurons A- 'Unaware'homepages.inf.ed.ac.uk/pseries/CCN17/CCN-wk3-1.pdfDecoding: Summary of previous slides Decoding: for neuro-prostheses and/or

DECODING LIGHT: from Plato to Matrix€¦ · “Aristotelio” He starts talking to the crowd about Plato’s Cave Myth. “ Which is the connection between the cave you are talking

Ο ΣΥΝΤΑΞΙΟΥΧΟΣ ATE › uploads › 9 › 2 › 2 › 5 › 9225242 › ... · 1.1.2013 έως και την 31.12.2013. Το ποσό της αποζημίωσης που

Biochimica et Biophysica Acta - core.ac.uk · doi:10.1016/j.bbamcr.2011.06.004 ... containing three NF-κB consensus sequences was from M. Hung ... 1768 P. Bendinelli et al. / Biochimica

BENHARBONE William20 Septembre 2007 Pulse Start Laser Synchro LabVIEW & F ield P rogrammable G ate A rray MC68HC08 MC68HC08 XILINX 255 cycles FSL ΔT

Model Combination for Machine Translationdenero.org/content/pubs/naacl10_denero_combination_slides.pdf · dog ate my homework Combiners assume little about systems Objectives similar

Chapter 10 Design, Synthesis, and Fabrication of a Novel ... · design, synthesis, and characterization of recombinant protein polymers based on multimers of 15 amino acid consensus

Gath Saheed Kartar SIngh Srahawa T.Phcarorafoundation.com/Gaatha Shaheed Kartar Singh Sarabha Ate Gadar... · 6/ ≈Ê≈ Ù‘ΔÁ ’ Â≈ «√ßÿ √ ≈Ì≈ ¡Â∂ Á Ò«‘

We don't need consensus: All agreed?

ATΟΠΙΚΗ ΔΕΡΜΑΤΙΤΙΔΑ - CONSENSUS

M o v in g F ra m e s, V a ria tio n a l P ro b le m s ...olver/t_/case.pdf · su " ciently sm all co ord in ate ch arts on ly on ce ( )* irration al ß ow on th e toru s) G e o m

2H C y c l ic P hot ph s ory l atio N n - y cli e e tr 3.5 ...muglaweb.com/biyokimya/ders10/metabolizmaposter.pdf · 2h+ h+ oxaloace tate pyr uvate s uccinyl-coa 2 x glutar ate citr

Stt Os 40 p Dsµe Ate t s a - NEWS247

ate of the Traversible Wormholes · 2006. 5. 17. · Traveler should cross it in a ﬁnite and reasonably small proper time 7. ... Numerical Grid / Convergence test x plus x minus

Modulo Sequencial ate 256 Canais - Manual Sonigate

Machine translation: Decoding - JHU Department of Computer Science

Tutorial: Sparse Signal Recovery - Home | Institute for ... · Tutorial: Sparse Signal Recovery ... kuSc k1: Greedy, iterative algorithms for sparse matrices ... Decoding via linear

Micro 133 Prelim Lecture 5 - Intel ΜP Instruction Encoding and Decoding

ALMA View of Dust Evolution: Making Planets and Decoding ...• GM Aur, TW Hya, CoKu Tau 4, DM Tau, … – near/mid-ir flux deficits indicate inner holes – planet formation? viscous