Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework...

134
research Model Combination for Machine Translation John DeNero, Shankar Kumar, Ciprian Chelba, and Franz Josef Och Monday, June 7, 2010

Transcript of Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework...

Page 1: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

research

Model Combinationfor Machine Translation

John DeNero, Shankar Kumar, Ciprian Chelba, and Franz Josef Och

Monday, June 7, 2010

Page 2: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Motivation

A statistical machine translation model scores derivations

(log) model score sums language model and translation model

Monday, June 7, 2010

Page 3: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Motivation

A statistical machine translation model scores derivations

θ · φ(d) = θ ·

w∈n-grams(d)

φlm(w) +�

r∈rules(d)

φtm(r)

(log) model score sums language model and translation model

Monday, June 7, 2010

Page 4: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Motivation

A statistical machine translation model scores derivations

θ · φ(d) = θ ·

w∈n-grams(d)

φlm(w) +�

r∈rules(d)

φtm(r)

(log) model score sums language model and translation model

Monday, June 7, 2010

Page 5: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Motivation

A statistical machine translation model scores derivations

θ · φ(d) = θ ·

w∈n-grams(d)

φlm(w) +�

r∈rules(d)

φtm(r)

(log) model score sums language model and translation model

Monday, June 7, 2010

Page 6: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Motivation

A statistical machine translation model scores derivations

We can improve over max-derivation decoding

θ · φ(d) = θ ·

w∈n-grams(d)

φlm(w) +�

r∈rules(d)

φtm(r)

(log) model score sums language model and translation model

Monday, June 7, 2010

Page 7: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Motivation

A statistical machine translation model scores derivations

• Consensus decoding (e.g., minimum Bayes risk)

We can improve over max-derivation decoding

θ · φ(d) = θ ·

w∈n-grams(d)

φlm(w) +�

r∈rules(d)

φtm(r)

(log) model score sums language model and translation model

Monday, June 7, 2010

Page 8: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Motivation

A statistical machine translation model scores derivations

• Consensus decoding (e.g., minimum Bayes risk)�

We can improve over max-derivation decoding

θ · φ(d) = θ ·

w∈n-grams(d)

φlm(w) +�

r∈rules(d)

φtm(r)

(log) model score sums language model and translation model

Monday, June 7, 2010

Page 9: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Motivation

A statistical machine translation model scores derivations

• Consensus decoding (e.g., minimum Bayes risk)

• System combination (e.g., confusion networks)

�We can improve over max-derivation decoding

θ · φ(d) = θ ·

w∈n-grams(d)

φlm(w) +�

r∈rules(d)

φtm(r)

(log) model score sums language model and translation model

Monday, June 7, 2010

Page 10: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Motivation

A statistical machine translation model scores derivations

• Consensus decoding (e.g., minimum Bayes risk)

• System combination (e.g., confusion networks)

�We can improve over max-derivation decoding

θ · φ(d) = θ ·

w∈n-grams(d)

φlm(w) +�

r∈rules(d)

φtm(r)

(log) model score sums language model and translation model

Monday, June 7, 2010

Page 11: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Motivation

A statistical machine translation model scores derivations

• Consensus decoding (e.g., minimum Bayes risk)

• System combination (e.g., confusion networks)

�We can improve over max-derivation decoding

In this work, we develop a technique that integrates both

θ · φ(d) = θ ·

w∈n-grams(d)

φlm(w) +�

r∈rules(d)

φtm(r)

(log) model score sums language model and translation model

Monday, June 7, 2010

Page 12: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Consensus Decoding

Derivation scores can be interpreted as probabilities

P(d|f) =exp (θ · φ(d))�d� exp (θ · φ(d�))

Monday, June 7, 2010

Page 13: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Consensus Decoding

Derivation scores can be interpreted as probabilities

We can query this distribution for:

P(d|f) =exp (θ · φ(d))�d� exp (θ · φ(d�))

Monday, June 7, 2010

Page 14: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Consensus Decoding

Derivation scores can be interpreted as probabilities

We can query this distribution for:

P(d|f) =exp (θ · φ(d))�d� exp (θ · φ(d�))

Whole translations [Blunsom et al., ’08]:

emax-trans = arg maxe

d:σe(d)=e

P(d|f)

Monday, June 7, 2010

Page 15: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Consensus Decoding

Derivation scores can be interpreted as probabilities

We can query this distribution for:

P(d|f) =exp (θ · φ(d))�d� exp (θ · φ(d�))

Whole translations [Blunsom et al., ’08]:

emax-trans = arg maxe

d:σe(d)=e

P(d|f)

N-gram overlap [Kumar and Byrne, ’04]:

embr = arg maxe

E [bleu(e;σe(d))]

Monday, June 7, 2010

Page 16: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Consensus Decoding

Derivation scores can be interpreted as probabilities

We can query this distribution for:

P(d|f) =exp (θ · φ(d))�d� exp (θ · φ(d�))

Free BLEU

Points

Whole translations [Blunsom et al., ’08]:

emax-trans = arg maxe

d:σe(d)=e

P(d|f)

N-gram overlap [Kumar and Byrne, ’04]:

embr = arg maxe

E [bleu(e;σe(d))]

Monday, June 7, 2010

Page 17: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

System Combination

We often have multiple translation systems

Monday, June 7, 2010

Page 18: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

System Combination

We often have multiple translation systems

el perro comí mi tarea

Monday, June 7, 2010

Page 19: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

System Combination

We often have multiple translation systems

el perro comí mi tarea

1

2

3

Monday, June 7, 2010

Page 20: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

System Combination

We often have multiple translation systems

the dog bit my work

the dog ate homework

dog ate homeworkmy

el perro comí mi tarea

1

2

3

Monday, June 7, 2010

Page 21: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

System Combination

We often have multiple translation systems

el perro comí mi tarea

the dog bit my work

the dog ate homework

dog ate homeworkmy

1

2

3

Monday, June 7, 2010

Page 22: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

System Combination

We often have multiple translation systems

el perro comí mi tarea

the dog bit my work

the dog ate homework

dog ate homeworkmy

1

2

3

Monday, June 7, 2010

Page 23: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

System Combination

We often have multiple translation systems

el perro comí mi tarea

the dog bit my work

the dog ate homework

dog ate homeworkmy

Combiners assume little about systems

1

2

3

Monday, June 7, 2010

Page 24: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

System Combination

We often have multiple translation systems

el perro comí mi tarea

the dog bit my work

the dog ate homework

dog ate homeworkmy

Combiners assume little about systems

Objectives similar to consensus decoding

1

2

3

Monday, June 7, 2010

Page 25: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

System Combination

We often have multiple translation systems

el perro comí mi tarea

the dog bit my work

the dog ate homework

dog ate homeworkmy

Combiners assume little about systems

Objectives similar to consensus decoding

1

2

3

More BLEU

Points

Monday, June 7, 2010

Page 26: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Model Combination

Monday, June 7, 2010

Page 27: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Model Combination

f

Monday, June 7, 2010

Page 28: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Model Combination

1

2

3

f

Monday, June 7, 2010

Page 29: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Model Combination

1

2

3

P1(d|f)

P2(d|f)

P3(d|f)

f

Monday, June 7, 2010

Page 30: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Model Combination

1

2

3

P1(d|f)

P2(d|f)

P3(d|f)

f

Consensus

Consensus

Consensus

e∗1

e∗2

e∗3

Monday, June 7, 2010

Page 31: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Model Combination

1

2

3

P1(d|f)

P2(d|f)

P3(d|f)

f

Consensus

Consensus

Consensus

e∗1

e∗2

e∗3

System combo

e

Monday, June 7, 2010

Page 32: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Model Combination

1

2

3

P1(d|f)

P2(d|f)

P3(d|f)

f

Consensus

Consensus

Consensus

e∗1

e∗2

e∗3

System combo

e

Monday, June 7, 2010

Page 33: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Model Combination

Consensus decoding with multiple models

1

2

3

P1(d|f)

P2(d|f)

P3(d|f)

f

Consensus

Consensus

Consensus

e∗1

e∗2

e∗3

System combo

e

Monday, June 7, 2010

Page 34: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Model Combination

Consensus decoding with multiple models

Distribution-driven approach to system combination

1

2

3

P1(d|f)

P2(d|f)

P3(d|f)

f

Consensus

Consensus

Consensus

e∗1

e∗2

e∗3

System combo

e

Monday, June 7, 2010

Page 35: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Model Combination

Consensus decoding with multiple models

Distribution-driven approach to system combination

Unifies consensus and combination objectives

1

2

3

P1(d|f)

P2(d|f)

P3(d|f)

f

Consensus

Consensus

Consensus

e∗1

e∗2

e∗3

System combo

e

Monday, June 7, 2010

Page 36: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Outline

Consensus decoding review

Our model combination technique

Comparison to system combination

Monday, June 7, 2010

Page 37: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Outline

Consensus decoding review

Our model combination technique

Comparison to system combination

With new

algorithms

and results

Monday, June 7, 2010

Page 38: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Forest-Based Consensus DecodingReview

Monday, June 7, 2010

Page 39: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Forest-Based Consensus Decoding

Build a forest that encodes the model posterior1

P(d|f) =

Review

Monday, June 7, 2010

Page 40: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Forest-Based Consensus Decoding

Yo vi al hombre con el telescopio

I saw the man with {a,the} telescope

Build a forest that encodes the model posterior1

P(d|f) =

Review

Monday, June 7, 2010

Page 41: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Forest-Based Consensus Decoding

Yo vi al hombre con el telescopio

I saw the man with {a,the} telescope

I ... telescope

“I saw the man with {a,the} telescope”

Build a forest that encodes the model posterior1

P(d|f) =

Review

Monday, June 7, 2010

Page 42: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Forest-Based Consensus Decoding

Yo vi al hombre con el telescopio

I saw the man with {a,the} telescope

I ... telescope

I ... telescope

“I saw the man with {a,the} telescope”

I ... man

“I saw with {a,the} telescope the man”

Build a forest that encodes the model posterior1

P(d|f) =

Review

Monday, June 7, 2010

Page 43: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Forest-Based Consensus Decoding

Yo vi al hombre con el telescopio

I saw the man with {a,the} telescope

I ... telescope

I ... telescope

“I saw the man with {a,the} telescope”

I ... man

“I saw with {a,the} telescope the man”

Build a forest that encodes the model posterior1

Compute n-gram statistics from the posterior 2

P(d|f) =

Review

Monday, June 7, 2010

Page 44: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Forest-Based Consensus Decoding

Yo vi al hombre con el telescopio

I saw the man with {a,the} telescope

I ... telescope

I ... telescope

“I saw the man with {a,the} telescope”

I ... man

“I saw with {a,the} telescope the man”

Build a forest that encodes the model posterior1

Compute n-gram statistics from the posterior 2

P(d|f) =

Review

Monday, June 7, 2010

Page 45: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Forest-Based Consensus Decoding

Yo vi al hombre con el telescopio

I saw the man with {a,the} telescope

I ... telescope

I ... telescope

“I saw the man with {a,the} telescope”

I ... man

“I saw with {a,the} telescope the man”

P(edge|f)

Build a forest that encodes the model posterior1

Compute n-gram statistics from the posterior 2

P(d|f) =

Review

Monday, June 7, 2010

Page 46: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Forest-Based Consensus Decoding

Yo vi al hombre con el telescopio

I saw the man with {a,the} telescope

I ... telescope

I ... telescope

“I saw the man with {a,the} telescope”

I ... man

“I saw with {a,the} telescope the man”

“telescope the”

P(edge|f)

Build a forest that encodes the model posterior1

Compute n-gram statistics from the posterior 2

P(d|f) =

Review

Monday, June 7, 2010

Page 47: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Forest-Based Consensus Decoding

Yo vi al hombre con el telescopio

I saw the man with {a,the} telescope

I ... telescope

I ... telescope

“I saw the man with {a,the} telescope”

I ... man

“I saw with {a,the} telescope the man”

“telescope the”

P(edge|f)

Build a forest that encodes the model posterior1

Compute n-gram statistics from the posterior 2

Optimize a consensus objective using these statistics3

P(d|f) =

Review

Monday, June 7, 2010

Page 48: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Types of Efficient Consensus Techniques

[Tromble et al., ’08]

Lattice Minimum Bayes-Risk Decoding

Review

Monday, June 7, 2010

Page 49: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Types of Efficient Consensus Techniques

[Tromble et al., ’08]

Lattice Minimum Bayes-Risk Decoding

Posteriors Expected counts

N-gram Statistics

Review

Monday, June 7, 2010

Page 50: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Types of Efficient Consensus Techniques

[Tromble et al., ’08]

Lattice Minimum Bayes-Risk Decoding

Lear

ned

Fixe

d

Con

sens

us O

bjec

tive

Posteriors Expected counts

N-gram Statistics

Review

Monday, June 7, 2010

Page 51: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Types of Efficient Consensus Techniques

[Tromble et al., ’08]

Lattice Minimum Bayes-Risk Decoding

[DeNero et al., ’09]

Fast Consensus Decoding

[Kumar et al., ’09]

Minimum Bayes-Risk Decoding for Hypergraphs

[Li et al., ’09]

Variational Decoding for Machine Translation

Lear

ned

Fixe

d

Con

sens

us O

bjec

tive

Posteriors Expected counts

N-gram Statistics

Review

Monday, June 7, 2010

Page 52: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Learned Objectives are Better than FixedReview

Monday, June 7, 2010

Page 53: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Learned Objectives are Better than Fixed

C(d) =4�

n=1

w(n)�

g∈n-grams

c(g, d) · P(g|f) + w(�)|σe(d)| + w(b)θ · φ(d)

Review

Monday, June 7, 2010

Page 54: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Learned Objectives are Better than Fixed

C(d) =4�

n=1

w(n)�

g∈n-grams

c(g, d) · P(g|f) + w(�)|σe(d)| + w(b)θ · φ(d)

N-gram statistics

Review

Monday, June 7, 2010

Page 55: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Learned Objectives are Better than Fixed

C(d) =4�

n=1

w(n)�

g∈n-grams

c(g, d) · P(g|f) + w(�)|σe(d)| + w(b)θ · φ(d)

N-gram statistics Length

Review

Monday, June 7, 2010

Page 56: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Learned Objectives are Better than Fixed

C(d) =4�

n=1

w(n)�

g∈n-grams

c(g, d) · P(g|f) + w(�)|σe(d)| + w(b)θ · φ(d)

N-gram statistics Length Base model

Review

Monday, June 7, 2010

Page 57: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Learned Objectives are Better than Fixed

C(d) =4�

n=1

w(n)�

g∈n-grams

c(g, d) · P(g|f) + w(�)|σe(d)| + w(b)θ · φ(d)

Objective parameters

N-gram statistics Length Base model

Review

Monday, June 7, 2010

Page 58: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Learned Objectives are Better than Fixed

C(d) =4�

n=1

w(n)�

g∈n-grams

c(g, d) · P(g|f) + w(�)|σe(d)| + w(b)θ · φ(d)

Objective parameters

N-gram statistics Length Base model

Choose such thatwFixed: Cw(d) ≈ E [bleu(d)]

Review

Monday, June 7, 2010

Page 59: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Learned Objectives are Better than Fixed

C(d) =4�

n=1

w(n)�

g∈n-grams

c(g, d) · P(g|f) + w(�)|σe(d)| + w(b)θ · φ(d)

Objective parameters

N-gram statistics Length Base model

Choose such thatwFixed: Cw(d) ≈ E [bleu(d)]

Choose to maximizewLearned: bleu

��arg max

dCw(d)

�; e

[Kumar et al., ’09]

Review

Monday, June 7, 2010

Page 60: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Learned Objectives are Better than Fixed

C(d) =4�

n=1

w(n)�

g∈n-grams

c(g, d) · P(g|f) + w(�)|σe(d)| + w(b)θ · φ(d)

Objective parameters

N-gram statistics Length Base model

Choose such thatwFixed: Cw(d) ≈ E [bleu(d)]

Choose to maximizewLearned: bleu

��arg max

dCw(d)

�; e

�References

[Kumar et al., ’09]

Review

Monday, June 7, 2010

Page 61: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Learned Objectives are Better than Fixed

C(d) =4�

n=1

w(n)�

g∈n-grams

c(g, d) · P(g|f) + w(�)|σe(d)| + w(b)θ · φ(d)

Objective parameters

N-gram statistics Length Base model

Choose such thatwFixed: Cw(d) ≈ E [bleu(d)]

Choose to maximizewLearned: bleu

��arg max

dCw(d)

�; e

�References

Increased test set BLEU by ≥0.2Decreased test set BLEU by ≥0.2

Consensus performance versus max-derivation decoding (39 pairs)

[Kumar et al., ’09]

Review

Monday, June 7, 2010

Page 62: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Learned Objectives are Better than Fixed

C(d) =4�

n=1

w(n)�

g∈n-grams

c(g, d) · P(g|f) + w(�)|σe(d)| + w(b)θ · φ(d)

Objective parameters

N-gram statistics Length Base model

Choose such thatwFixed: Cw(d) ≈ E [bleu(d)]

Choose to maximizewLearned: bleu

��arg max

dCw(d)

�; e

�References

Increased test set BLEU by ≥0.2Decreased test set BLEU by ≥0.2

Consensus performance versus max-derivation decoding (39 pairs)

[Kumar et al., ’09]

1218

Fixed

Review

Monday, June 7, 2010

Page 63: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Learned Objectives are Better than Fixed

C(d) =4�

n=1

w(n)�

g∈n-grams

c(g, d) · P(g|f) + w(�)|σe(d)| + w(b)θ · φ(d)

Objective parameters

N-gram statistics Length Base model

Choose such thatwFixed: Cw(d) ≈ E [bleu(d)]

Choose to maximizewLearned: bleu

��arg max

dCw(d)

�; e

�References

Increased test set BLEU by ≥0.2Decreased test set BLEU by ≥0.2

Consensus performance versus max-derivation decoding (39 pairs)

[Kumar et al., ’09]

1218

5

26

Fixed Learned

Review

Monday, June 7, 2010

Page 64: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

N-gram Posteriors can also be Computed Quickly

Monday, June 7, 2010

Page 65: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

N-gram Posteriors can also be Computed Quickly

P(g|f) = 1.0− P(g|f)

Monday, June 7, 2010

Page 66: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

N-gram Posteriors can also be Computed Quickly

I ... telescope

I saw

the man with {a,the} telescope

“I saw the man with {a,the} telescope”

P(g|f) = 1.0− P(g|f)

Monday, June 7, 2010

Page 67: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

N-gram Posteriors can also be Computed Quickly

I ... telescope

I saw

the man with {a,the} telescope

“I saw the man with {a,the} telescope”

eθ·φ(edge) = 0.2

“saw the”

P(g|f) = 1.0− P(g|f)

Monday, June 7, 2010

Page 68: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Inside: 0.1

...

“I saw” : 0“with the” : 0.1

N-gram Posteriors can also be Computed Quickly

I ... telescope

I saw

the man with {a,the} telescope

“I saw the man with {a,the} telescope”

eθ·φ(edge) = 0.2

“saw the”

P(g|f) = 1.0− P(g|f)

Monday, June 7, 2010

Page 69: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Inside: 0.1

...

“I saw” : 0“with the” : 0.1

N-gram Posteriors can also be Computed Quickly

I ... telescope

I saw

the man with {a,the} telescope

“I saw the man with {a,the} telescope”

eθ·φ(edge) = 0.2

“saw the”

Derivations that don’t contain “with the”

P(g|f) = 1.0− P(g|f)

Monday, June 7, 2010

Page 70: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Inside: 0.1

...

“I saw” : 0“with the” : 0.1

N-gram Posteriors can also be Computed Quickly

I ... telescope

I saw

the man with {a,the} telescope

“I saw the man with {a,the} telescope”

eθ·φ(edge) = 0.2

“saw the”Inside: 0.5

“with a” : 0.3“with the” : 0.2

...Derivations that don’t

contain “with the”

P(g|f) = 1.0− P(g|f)

Monday, June 7, 2010

Page 71: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Inside: 0.1

...

“I saw” : 0“with the” : 0.1

N-gram Posteriors can also be Computed Quickly

I ... telescope

I saw

the man with {a,the} telescope

“I saw the man with {a,the} telescope”

eθ·φ(edge) = 0.2

“saw the”Inside: 0.5

“with a” : 0.3“with the” : 0.2

...

Inside: 0.01“saw the” : 0

“I saw” : 0“with the” : 0.004

...

Derivations that don’t contain “with the”

P(g|f) = 1.0− P(g|f)

Monday, June 7, 2010

Page 72: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Inside: 0.1

...

“I saw” : 0“with the” : 0.1

N-gram Posteriors can also be Computed Quickly

I ... telescope

I saw

the man with {a,the} telescope

“I saw the man with {a,the} telescope”

eθ·φ(edge) = 0.2

“saw the”Inside: 0.5

“with a” : 0.3“with the” : 0.2

...

Inside: 0.01“saw the” : 0

“I saw” : 0“with the” : 0.004

...

Derivations that don’t contain “with the”

P(g|f) = 1.0− P(g|f)

Monday, June 7, 2010

Page 73: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Inside: 0.1

...

“I saw” : 0“with the” : 0.1

N-gram Posteriors can also be Computed Quickly

Audience challenge: What semiring computes n-gram posteriors?

I ... telescope

I saw

the man with {a,the} telescope

“I saw the man with {a,the} telescope”

eθ·φ(edge) = 0.2

“saw the”Inside: 0.5

“with a” : 0.3“with the” : 0.2

...

Inside: 0.01“saw the” : 0

“I saw” : 0“with the” : 0.004

...

Derivations that don’t contain “with the”

P(g|f) = 1.0− P(g|f)

Monday, June 7, 2010

Page 74: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Results for Learned Consensus Decoding

Constrained data track of the 2008 NIST MT task

max-derivation posterior probabilities expected counts

Learned consensus decoding techniquesBaseline

Monday, June 7, 2010

Page 75: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Results for Learned Consensus Decoding

Constrained data track of the 2008 NIST MT task

max-derivation posterior probabilities expected counts

Learned consensus decoding techniquesBaseline

Phrase-Based

Hiero

SAMT

40 42 44 46

44.4

43.8

44.6

44.5

43.8

44.6

43.8

43.3

43.9

Arabic-to-English

Monday, June 7, 2010

Page 76: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Results for Learned Consensus Decoding

Constrained data track of the 2008 NIST MT task

max-derivation posterior probabilities expected counts

Learned consensus decoding techniquesBaseline

Phrase-Based

Hiero

SAMT

40 42 44 46

44.4

43.8

44.6

44.5

43.8

44.6

43.8

43.3

43.9

Arabic-to-English

24 26 28 30

28.7

28.2

27.2

28.8

27.8

27.3

28.4

27.2

25.4

Chinese-to-English

Monday, June 7, 2010

Page 77: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Outline

Consensus decoding review

Our model combination technique

Comparison to system combination

Monday, June 7, 2010

Page 78: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Outline

Consensus decoding review

Our model combination technique

Comparison to system combination

All in

One Slide!

Monday, June 7, 2010

Page 79: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Extending Consensus Decoding to Multiple Models

1. Build posterior forests

Monday, June 7, 2010

Page 80: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Extending Consensus Decoding to Multiple Models

Phrase-based

fHiero- style

P1(d|f) :

P2(d|f) :

1. Build posterior forests

Monday, June 7, 2010

Page 81: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Extending Consensus Decoding to Multiple Models

C(d) =4�

n=1

w(n)�

g∈n-grams

c(g, d) · P(g|f) + w(�)|σe(d)| + w(b)θ · φ(d)

N-gram statistics Length Base model

Phrase-based

fHiero- style

P1(d|f) :

P2(d|f) :

1. Build posterior forests

Monday, June 7, 2010

Page 82: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Extending Consensus Decoding to Multiple Models

N-gram statistics Length Base model

Phrase-based

fHiero- style

P1(d|f) :

P2(d|f) :

C(d) = +w(�)|σe(d)| + w(b)b(d)4�

n=1

w(n)v(n)(d)

1. Build posterior forests

Monday, June 7, 2010

Page 83: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Extending Consensus Decoding to Multiple Models

N-gram statistics Length Base model

Phrase-based

fHiero- style

P1(d|f) :

P2(d|f) :

1. Build posterior forests

C(d) = +w(�)|σe(d)| + w(b)b(d)I�

i=1

4�

n=1

w(n)i v(n)

i (d)

Sum over models

Monday, June 7, 2010

Page 84: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Extending Consensus Decoding to Multiple Models

N-gram statistics Length Base model

Phrase-based

fHiero- style

P1(d|f) :

P2(d|f) :

1. Build posterior forests

2. Compute n-gram statistics from forests

C(d) = +w(�)|σe(d)| + w(b)b(d)I�

i=1

4�

n=1

w(n)i v(n)

i (d)

Sum over models

Monday, June 7, 2010

Page 85: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Extending Consensus Decoding to Multiple Models

N-gram statistics Length Base model

Phrase-based

fHiero- style

P1(d|f) :

P2(d|f) :

1. Build posterior forests

2. Compute n-gram statistics from forests

3. Union all forests

C(d) = +w(�)|σe(d)| + w(b)b(d)I�

i=1

4�

n=1

w(n)i v(n)

i (d)

Sum over models

Monday, June 7, 2010

Page 86: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Extending Consensus Decoding to Multiple Models

N-gram statistics Length Base model

Phrase-based

fHiero- style

P1(d|f) :

P2(d|f) :

1. Build posterior forests

2. Compute n-gram statistics from forests

3. Union all forests

C(d) = +w(�)|σe(d)| + w(b)b(d)I�

i=1

4�

n=1

w(n)i v(n)

i (d)

Sum over models

Monday, June 7, 2010

Page 87: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Extending Consensus Decoding to Multiple Models

N-gram statistics Length Base model

Phrase-based

fHiero- style

P1(d|f) :

P2(d|f) :

1. Build posterior forests

2. Compute n-gram statistics from forests

3. Union all forests

C(d) = +w(�)|σe(d)| + w(b)b(d)I�

i=1

4�

n=1

w(n)i v(n)

i (d) +w(α)i αi(d)

Model choiceSum over models

Monday, June 7, 2010

Page 88: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Extending Consensus Decoding to Multiple Models

N-gram statistics Length Base model

Phrase-based

fHiero- style

P1(d|f) :

P2(d|f) :

1. Build posterior forests

2. Compute n-gram statistics from forests

3. Union all forests

C(d) = +w(�)|σe(d)| + w(b)b(d)I�

i=1

4�

n=1

w(n)i v(n)

i (d) +w(α)i αi(d)

Model choiceSum over models

dIndicator feature for the system that originally generatedModel choice:

Base model: Model score under the system that generated d

Monday, June 7, 2010

Page 89: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Extending Consensus Decoding to Multiple Models

N-gram statistics Length Base model

Phrase-based

fHiero- style

P1(d|f) :

P2(d|f) :

1. Build posterior forests

2. Compute n-gram statistics from forests

3. Union all forests

4. Optimize multi-model consensus objective

C(d) = +w(�)|σe(d)| + w(b)b(d)I�

i=1

4�

n=1

w(n)i v(n)

i (d) +w(α)i αi(d)

Model choiceSum over models

dIndicator feature for the system that originally generatedModel choice:

Base model: Model score under the system that generated d

Monday, June 7, 2010

Page 90: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

1. Build posterior forests

2. Compute n-gram statistics from forests

3. Union all forests

4. Optimize multi-model consensus objective

N-gram statistics Length Base model

C(d) = +w(�)|σe(d)| + w(b)b(d)I�

i=1

4�

n=1

w(n)i v(n)

i (d) +w(α)i αi(d)

Model choiceSum over models

dIndicator feature for the system that originally generatedModel choice:

Base model: Model score under the system that generated d

Extending Consensus Decoding to Multiple Models

Monday, June 7, 2010

Page 91: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

1. Build posterior forests

2. Compute n-gram statistics from forests

3. Union all forests

4. Optimize multi-model consensus objective

N-gram statistics Length Base model

C(d) = +w(�)|σe(d)| + w(b)b(d)I�

i=1

4�

n=1

w(n)i v(n)

i (d) +w(α)i αi(d)

Model choiceSum over models

dIndicator feature for the system that originally generatedModel choice:

Base model: Model score under the system that generated d

Original features

Extending Consensus Decoding to Multiple Models

Monday, June 7, 2010

Page 92: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

1. Build posterior forests

2. Compute n-gram statistics from forests

3. Union all forests

4. Optimize multi-model consensus objective

N-gram statistics Length Base model

C(d) = +w(�)|σe(d)| + w(b)b(d)I�

i=1

4�

n=1

w(n)i v(n)

i (d) +w(α)i αi(d)

Model choiceSum over models

dIndicator feature for the system that originally generatedModel choice:

Base model: Model score under the system that generated d

Original features

Extending Consensus Decoding to Multiple Models

Monday, June 7, 2010

Page 93: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

1. Build posterior forests

2. Compute n-gram statistics from forests

3. Union all forests

4. Optimize multi-model consensus objective

N-gram statistics Length Base model

C(d) = +w(�)|σe(d)| + w(b)b(d)I�

i=1

4�

n=1

w(n)i v(n)

i (d) +w(α)i αi(d)

Model choiceSum over models

dIndicator feature for the system that originally generatedModel choice:

Base model: Model score under the system that generated d

Original features

Extending Consensus Decoding to Multiple Models

Monday, June 7, 2010

Page 94: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

1. Build posterior forests

2. Compute n-gram statistics from forests

3. Union all forests

4. Optimize multi-model consensus objective

N-gram statistics Length Base model

C(d) = +w(�)|σe(d)| + w(b)b(d)I�

i=1

4�

n=1

w(n)i v(n)

i (d) +w(α)i αi(d)

Model choiceSum over models

dIndicator feature for the system that originally generatedModel choice:

Base model: Model score under the system that generated d

Original features

Extending Consensus Decoding to Multiple Models

Monday, June 7, 2010

Page 95: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Properties of Model Combination

N-gram statistics Length Base model

C(d) = +w(�)|σe(d)| + w(b)b(d)I�

i=1

4�

n=1

w(n)i v(n)

i (d) +w(α)i αi(d)

Model choiceSum over models

Monday, June 7, 2010

Page 96: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Properties of Model Combination

N-gram statistics Length Base model

C(d) = +w(�)|σe(d)| + w(b)b(d)I�

i=1

4�

n=1

w(n)i v(n)

i (d) +w(α)i αi(d)

Model choiceSum over models

Reduces to consensus decoding when we have only one model

Monday, June 7, 2010

Page 97: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Properties of Model Combination

N-gram statistics Length Base model

C(d) = +w(�)|σe(d)| + w(b)b(d)I�

i=1

4�

n=1

w(n)i v(n)

i (d) +w(α)i αi(d)

Model choiceSum over models

Reduces to consensus decoding when we have only one model

A linear model: can be tuned to maximize output performancew

Monday, June 7, 2010

Page 98: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Properties of Model Combination

N-gram statistics Length Base model

C(d) = +w(�)|σe(d)| + w(b)b(d)I�

i=1

4�

n=1

w(n)i v(n)

i (d) +w(α)i αi(d)

Model choiceSum over models

Reduces to consensus decoding when we have only one model

A linear model: can be tuned to maximize output performancew

No concept of a primary system

Monday, June 7, 2010

Page 99: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Properties of Model Combination

N-gram statistics Length Base model

C(d) = +w(�)|σe(d)| + w(b)b(d)I�

i=1

4�

n=1

w(n)i v(n)

i (d) +w(α)i αi(d)

Model choiceSum over models

Reduces to consensus decoding when we have only one model

A linear model: can be tuned to maximize output performancew

No concept of a primary system

Every possible output was a derivation under some original model

Monday, June 7, 2010

Page 100: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Model Combination Experimental Results

Compared three in-house Google systems

Constrained data track of 2008 NIST task

Parameters tuned on NIST 2004 eval set

Max-derivationConsensus

Monday, June 7, 2010

Page 101: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

42

43

44

45

46

Phrase Hiero SAMT Combo

45.3

44.5

43.8

44.6

43.8

43.3

43.9

Arabic-to-English BLEU

Model Combination Experimental Results

Compared three in-house Google systems

Constrained data track of 2008 NIST task

Parameters tuned on NIST 2004 eval set

Max-derivationConsensus

Monday, June 7, 2010

Page 102: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

42

43

44

45

46

Phrase Hiero SAMT Combo

45.3

44.5

43.8

44.6

43.8

43.3

43.9

Arabic-to-English BLEU

Model Combination Experimental Results

Compared three in-house Google systems

Constrained data track of 2008 NIST task

Parameters tuned on NIST 2004 eval set

+1.4

Max-derivationConsensus

Monday, June 7, 2010

Page 103: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

42

43

44

45

46

Phrase Hiero SAMT Combo

45.3

44.5

43.8

44.6

43.8

43.3

43.9

Arabic-to-English BLEU

Model Combination Experimental Results

Compared three in-house Google systems

Constrained data track of 2008 NIST task

Parameters tuned on NIST 2004 eval set

+1.4

24

25

26

27

28

29

30

Phrase Hiero SAMT Combo

29.028.8

27.827.3

28.4

27.2

25.4

Chinese-to-English BLEU

+0.6

Max-derivationConsensus

Monday, June 7, 2010

Page 104: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Sources of Improvement: Arabic-to-English

Monday, June 7, 2010

Page 105: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Sources of Improvement: Arabic-to-English

43 44 45 46

43.9

44.5

Best max-derivation system

Best single-system with consensus decoding

Monday, June 7, 2010

Page 106: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Sources of Improvement: Arabic-to-English

43 44 45 46

43.9

44.5

Best max-derivation system

Do we need model choice features?

Best single-system with consensus decoding

Monday, June 7, 2010

Page 107: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Sources of Improvement: Arabic-to-English

43 44 45 46

43.9

44.5

Best max-derivation system

Do we need model choice features?

Should n-gram statistics be model specific?

Best single-system with consensus decoding

Monday, June 7, 2010

Page 108: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Sources of Improvement: Arabic-to-English

43 44 45 46

43.9

44.5

Best max-derivation system

Do we need model choice features?

Should n-gram statistics be model specific?

Phrase-based

fHiero- style

Best single-system with consensus decoding

Monday, June 7, 2010

Page 109: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Sources of Improvement: Arabic-to-English

43 44 45 46

43.9

44.5

Best max-derivation system

Do we need model choice features?

Should n-gram statistics be model specific?

Phrase-based

fHiero- style

Best single-system with consensus decoding

“Union”

Monday, June 7, 2010

Page 110: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Sources of Improvement: Arabic-to-English

43 44 45 46

43.9

44.5

44.5

Best max-derivation system

Do we need model choice features?

Should n-gram statistics be model specific?

Phrase-based

fHiero- style

Best single-system with consensus decoding

Union + consensus decoding

“Union”

Monday, June 7, 2010

Page 111: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Sources of Improvement: Arabic-to-English

43 44 45 46

43.9

44.5

44.5

44.9

Best max-derivation system

Do we need model choice features?

Should n-gram statistics be model specific?

Phrase-based

fHiero- style

Best single-system with consensus decoding

Union + consensus decoding

Union + consensus + model choice features

“Union”

Monday, June 7, 2010

Page 112: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Sources of Improvement: Arabic-to-English

43 44 45 46

43.9

44.5

44.5

44.9

45.1

Best max-derivation system

Do we need model choice features?

Should n-gram statistics be model specific?

Phrase-based

fHiero- style

Best single-system with consensus decoding

Union + consensus decoding

Union + consensus + model choice features

Union with model-specific n-gram statistics

“Union”

Monday, June 7, 2010

Page 113: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Sources of Improvement: Arabic-to-English

43 44 45 46

43.9

44.5

44.5

44.9

45.1

45.3

Best max-derivation system

Do we need model choice features?

Should n-gram statistics be model specific?

Phrase-based

fHiero- style

Best single-system with consensus decoding

Union + consensus decoding

Union + consensus + model choice features

Union with model-specific n-gram statistics

“Union”

Model-specific & union n-gram statistics

Monday, June 7, 2010

Page 114: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Outline

Consensus decoding review

Our model combination technique

Comparison to system combination

Monday, June 7, 2010

Page 115: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Outline

Consensus decoding review

Our model combination technique

Comparison to system combination

The Final

Showdown

Monday, June 7, 2010

Page 116: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

System Combination Baselines

Two established system combination methods [Macherey & Och, ’07]

Monday, June 7, 2010

Page 117: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

System Combination Baselines

Two established system combination methods [Macherey & Och, ’07]

P1(d|f)

P2(d|f)

P3(d|f)

f

e∗1

e∗2

e∗3

Consensus

Consensus

Consensus

System combo

e

1

2

3

Monday, June 7, 2010

Page 118: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

System Combination Baselines

Sentence-level Choose among system outputs using an MBR objective

Two established system combination methods [Macherey & Och, ’07]

e = arg maxe∈{e∗1 ,...,e∗k}

E [bleu(e)]

P1(d|f)

P2(d|f)

P3(d|f)

f

e∗1

e∗2

e∗3

Consensus

Consensus

Consensus

System combo

e

1

2

3

Monday, June 7, 2010

Page 119: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

System Combination Baselines

Sentence-level Choose among system outputs using an MBR objective

Word-level Confusion network approach

Two established system combination methods [Macherey & Och, ’07]

e = arg maxe∈{e∗1 ,...,e∗k}

E [bleu(e)]

P1(d|f)

P2(d|f)

P3(d|f)

f

e∗1

e∗2

e∗3

Consensus

Consensus

Consensus

System combo

e

1

2

3

Monday, June 7, 2010

Page 120: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

System Combination Baselines

Sentence-level Choose among system outputs using an MBR objective

Word-level Confusion network approach

Two established system combination methods [Macherey & Och, ’07]

e = arg maxe∈{e∗1 ,...,e∗k}

E [bleu(e)]

P1(d|f)

P2(d|f)

P3(d|f)

f

e∗1

e∗2

e∗3

Consensus

Consensus

Consensus

System combo

e

1

2

3

All are aligned to a backbone e∗i eb ∈ {e∗1, . . . , e∗k}

Monday, June 7, 2010

Page 121: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

System Combination Baselines

Sentence-level Choose among system outputs using an MBR objective

Word-level Confusion network approach

Sentences + alignments form a confusion network

Two established system combination methods [Macherey & Och, ’07]

e = arg maxe∈{e∗1 ,...,e∗k}

E [bleu(e)]

P1(d|f)

P2(d|f)

P3(d|f)

f

e∗1

e∗2

e∗3

Consensus

Consensus

Consensus

System combo

e

1

2

3

All are aligned to a backbone e∗i eb ∈ {e∗1, . . . , e∗k}

Monday, June 7, 2010

Page 122: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

System Combination Baselines

Sentence-level Choose among system outputs using an MBR objective

Word-level Confusion network approach

Sentences + alignments form a confusion network

Output maximizes a consensus objective

Two established system combination methods [Macherey & Och, ’07]

e = arg maxe∈{e∗1 ,...,e∗k}

E [bleu(e)]

P1(d|f)

P2(d|f)

P3(d|f)

f

e∗1

e∗2

e∗3

Consensus

Consensus

Consensus

System combo

e

1

2

3

All are aligned to a backbone e∗i eb ∈ {e∗1, . . . , e∗k}

Monday, June 7, 2010

Page 123: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Model versus System Combination

Qualitative

Quantitative

Monday, June 7, 2010

Page 124: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Model versus System Combination

Single-system n-gram statistics are required in both methods

Qualitative

Quantitative

Monday, June 7, 2010

Page 125: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Model versus System Combination

Single-system n-gram statistics are required in both methods

Model combination searches only once under a consensus objective

Qualitative

Quantitative

Monday, June 7, 2010

Page 126: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Model versus System Combination

Inter-hypothesis alignment problem exists only in system combination

Single-system n-gram statistics are required in both methods

Model combination searches only once under a consensus objective

Qualitative

Quantitative

Monday, June 7, 2010

Page 127: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Model versus System Combination

Best single-system consensus

Sentence-level system combination

Word-level system combination

Model combination

43 44 45 46

45.3

44.7

44.6

44.5

Arabic-to-English

Inter-hypothesis alignment problem exists only in system combination

Single-system n-gram statistics are required in both methods

Model combination searches only once under a consensus objective

Qualitative

Quantitative

Monday, June 7, 2010

Page 128: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Model versus System Combination

Best single-system consensus

Sentence-level system combination

Word-level system combination

Model combination

43 44 45 46

45.3

44.7

44.6

44.5

Arabic-to-English

27 28 29 30

29.0

28.8

28.8

28.8

Chinese-to-English

Inter-hypothesis alignment problem exists only in system combination

Single-system n-gram statistics are required in both methods

Model combination searches only once under a consensus objective

Qualitative

Quantitative

Monday, June 7, 2010

Page 129: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Conclusion

Monday, June 7, 2010

Page 130: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Conclusion

Consensus decoding with a learned objective extends naturally to multiple models

Monday, June 7, 2010

Page 131: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Conclusion

Consensus decoding with a learned objective extends naturally to multiple models

Model combination provides consensus and combination effects in a unified, distribution-driven objective

Monday, June 7, 2010

Page 132: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Conclusion

It outperforms a pipeline of consensus decoding followed by system combination, using less total computation

Consensus decoding with a learned objective extends naturally to multiple models

Model combination provides consensus and combination effects in a unified, distribution-driven objective

Monday, June 7, 2010

Page 133: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Conclusion

It outperforms a pipeline of consensus decoding followed by system combination, using less total computation

Consensus decoding with a learned objective extends naturally to multiple models

Model combination provides consensus and combination effects in a unified, distribution-driven objective

It’s easy, it’s clean, and it works

Monday, June 7, 2010

Page 134: Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Conclusion

It outperforms a pipeline of consensus decoding followed by system combination, using less total computation

Consensus decoding with a learned objective extends naturally to multiple models

Model combination provides consensus and combination effects in a unified, distribution-driven objective

It’s easy, it’s clean, and it works Thanks!

Monday, June 7, 2010