Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework...

Post on 13-Jul-2020

7 views 0 download

Transcript of Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework...

research

Model Combinationfor Machine Translation

John DeNero, Shankar Kumar, Ciprian Chelba, and Franz Josef Och

Monday, June 7, 2010

Motivation

A statistical machine translation model scores derivations

(log) model score sums language model and translation model

Monday, June 7, 2010

Motivation

A statistical machine translation model scores derivations

θ · φ(d) = θ ·

w∈n-grams(d)

φlm(w) +�

r∈rules(d)

φtm(r)

(log) model score sums language model and translation model

Monday, June 7, 2010

Motivation

A statistical machine translation model scores derivations

θ · φ(d) = θ ·

w∈n-grams(d)

φlm(w) +�

r∈rules(d)

φtm(r)

(log) model score sums language model and translation model

Monday, June 7, 2010

Motivation

A statistical machine translation model scores derivations

θ · φ(d) = θ ·

w∈n-grams(d)

φlm(w) +�

r∈rules(d)

φtm(r)

(log) model score sums language model and translation model

Monday, June 7, 2010

Motivation

A statistical machine translation model scores derivations

We can improve over max-derivation decoding

θ · φ(d) = θ ·

w∈n-grams(d)

φlm(w) +�

r∈rules(d)

φtm(r)

(log) model score sums language model and translation model

Monday, June 7, 2010

Motivation

A statistical machine translation model scores derivations

• Consensus decoding (e.g., minimum Bayes risk)

We can improve over max-derivation decoding

θ · φ(d) = θ ·

w∈n-grams(d)

φlm(w) +�

r∈rules(d)

φtm(r)

(log) model score sums language model and translation model

Monday, June 7, 2010

Motivation

A statistical machine translation model scores derivations

• Consensus decoding (e.g., minimum Bayes risk)�

We can improve over max-derivation decoding

θ · φ(d) = θ ·

w∈n-grams(d)

φlm(w) +�

r∈rules(d)

φtm(r)

(log) model score sums language model and translation model

Monday, June 7, 2010

Motivation

A statistical machine translation model scores derivations

• Consensus decoding (e.g., minimum Bayes risk)

• System combination (e.g., confusion networks)

�We can improve over max-derivation decoding

θ · φ(d) = θ ·

w∈n-grams(d)

φlm(w) +�

r∈rules(d)

φtm(r)

(log) model score sums language model and translation model

Monday, June 7, 2010

Motivation

A statistical machine translation model scores derivations

• Consensus decoding (e.g., minimum Bayes risk)

• System combination (e.g., confusion networks)

�We can improve over max-derivation decoding

θ · φ(d) = θ ·

w∈n-grams(d)

φlm(w) +�

r∈rules(d)

φtm(r)

(log) model score sums language model and translation model

Monday, June 7, 2010

Motivation

A statistical machine translation model scores derivations

• Consensus decoding (e.g., minimum Bayes risk)

• System combination (e.g., confusion networks)

�We can improve over max-derivation decoding

In this work, we develop a technique that integrates both

θ · φ(d) = θ ·

w∈n-grams(d)

φlm(w) +�

r∈rules(d)

φtm(r)

(log) model score sums language model and translation model

Monday, June 7, 2010

Consensus Decoding

Derivation scores can be interpreted as probabilities

P(d|f) =exp (θ · φ(d))�d� exp (θ · φ(d�))

Monday, June 7, 2010

Consensus Decoding

Derivation scores can be interpreted as probabilities

We can query this distribution for:

P(d|f) =exp (θ · φ(d))�d� exp (θ · φ(d�))

Monday, June 7, 2010

Consensus Decoding

Derivation scores can be interpreted as probabilities

We can query this distribution for:

P(d|f) =exp (θ · φ(d))�d� exp (θ · φ(d�))

Whole translations [Blunsom et al., ’08]:

emax-trans = arg maxe

d:σe(d)=e

P(d|f)

Monday, June 7, 2010

Consensus Decoding

Derivation scores can be interpreted as probabilities

We can query this distribution for:

P(d|f) =exp (θ · φ(d))�d� exp (θ · φ(d�))

Whole translations [Blunsom et al., ’08]:

emax-trans = arg maxe

d:σe(d)=e

P(d|f)

N-gram overlap [Kumar and Byrne, ’04]:

embr = arg maxe

E [bleu(e;σe(d))]

Monday, June 7, 2010

Consensus Decoding

Derivation scores can be interpreted as probabilities

We can query this distribution for:

P(d|f) =exp (θ · φ(d))�d� exp (θ · φ(d�))

Free BLEU

Points

Whole translations [Blunsom et al., ’08]:

emax-trans = arg maxe

d:σe(d)=e

P(d|f)

N-gram overlap [Kumar and Byrne, ’04]:

embr = arg maxe

E [bleu(e;σe(d))]

Monday, June 7, 2010

System Combination

We often have multiple translation systems

Monday, June 7, 2010

System Combination

We often have multiple translation systems

el perro comí mi tarea

Monday, June 7, 2010

System Combination

We often have multiple translation systems

el perro comí mi tarea

1

2

3

Monday, June 7, 2010

System Combination

We often have multiple translation systems

the dog bit my work

the dog ate homework

dog ate homeworkmy

el perro comí mi tarea

1

2

3

Monday, June 7, 2010

System Combination

We often have multiple translation systems

el perro comí mi tarea

the dog bit my work

the dog ate homework

dog ate homeworkmy

1

2

3

Monday, June 7, 2010

System Combination

We often have multiple translation systems

el perro comí mi tarea

the dog bit my work

the dog ate homework

dog ate homeworkmy

1

2

3

Monday, June 7, 2010

System Combination

We often have multiple translation systems

el perro comí mi tarea

the dog bit my work

the dog ate homework

dog ate homeworkmy

Combiners assume little about systems

1

2

3

Monday, June 7, 2010

System Combination

We often have multiple translation systems

el perro comí mi tarea

the dog bit my work

the dog ate homework

dog ate homeworkmy

Combiners assume little about systems

Objectives similar to consensus decoding

1

2

3

Monday, June 7, 2010

System Combination

We often have multiple translation systems

el perro comí mi tarea

the dog bit my work

the dog ate homework

dog ate homeworkmy

Combiners assume little about systems

Objectives similar to consensus decoding

1

2

3

More BLEU

Points

Monday, June 7, 2010

Model Combination

Monday, June 7, 2010

Model Combination

f

Monday, June 7, 2010

Model Combination

1

2

3

f

Monday, June 7, 2010

Model Combination

1

2

3

P1(d|f)

P2(d|f)

P3(d|f)

f

Monday, June 7, 2010

Model Combination

1

2

3

P1(d|f)

P2(d|f)

P3(d|f)

f

Consensus

Consensus

Consensus

e∗1

e∗2

e∗3

Monday, June 7, 2010

Model Combination

1

2

3

P1(d|f)

P2(d|f)

P3(d|f)

f

Consensus

Consensus

Consensus

e∗1

e∗2

e∗3

System combo

e

Monday, June 7, 2010

Model Combination

1

2

3

P1(d|f)

P2(d|f)

P3(d|f)

f

Consensus

Consensus

Consensus

e∗1

e∗2

e∗3

System combo

e

Monday, June 7, 2010

Model Combination

Consensus decoding with multiple models

1

2

3

P1(d|f)

P2(d|f)

P3(d|f)

f

Consensus

Consensus

Consensus

e∗1

e∗2

e∗3

System combo

e

Monday, June 7, 2010

Model Combination

Consensus decoding with multiple models

Distribution-driven approach to system combination

1

2

3

P1(d|f)

P2(d|f)

P3(d|f)

f

Consensus

Consensus

Consensus

e∗1

e∗2

e∗3

System combo

e

Monday, June 7, 2010

Model Combination

Consensus decoding with multiple models

Distribution-driven approach to system combination

Unifies consensus and combination objectives

1

2

3

P1(d|f)

P2(d|f)

P3(d|f)

f

Consensus

Consensus

Consensus

e∗1

e∗2

e∗3

System combo

e

Monday, June 7, 2010

Outline

Consensus decoding review

Our model combination technique

Comparison to system combination

Monday, June 7, 2010

Outline

Consensus decoding review

Our model combination technique

Comparison to system combination

With new

algorithms

and results

Monday, June 7, 2010

Forest-Based Consensus DecodingReview

Monday, June 7, 2010

Forest-Based Consensus Decoding

Build a forest that encodes the model posterior1

P(d|f) =

Review

Monday, June 7, 2010

Forest-Based Consensus Decoding

Yo vi al hombre con el telescopio

I saw the man with {a,the} telescope

Build a forest that encodes the model posterior1

P(d|f) =

Review

Monday, June 7, 2010

Forest-Based Consensus Decoding

Yo vi al hombre con el telescopio

I saw the man with {a,the} telescope

I ... telescope

“I saw the man with {a,the} telescope”

Build a forest that encodes the model posterior1

P(d|f) =

Review

Monday, June 7, 2010

Forest-Based Consensus Decoding

Yo vi al hombre con el telescopio

I saw the man with {a,the} telescope

I ... telescope

I ... telescope

“I saw the man with {a,the} telescope”

I ... man

“I saw with {a,the} telescope the man”

Build a forest that encodes the model posterior1

P(d|f) =

Review

Monday, June 7, 2010

Forest-Based Consensus Decoding

Yo vi al hombre con el telescopio

I saw the man with {a,the} telescope

I ... telescope

I ... telescope

“I saw the man with {a,the} telescope”

I ... man

“I saw with {a,the} telescope the man”

Build a forest that encodes the model posterior1

Compute n-gram statistics from the posterior 2

P(d|f) =

Review

Monday, June 7, 2010

Forest-Based Consensus Decoding

Yo vi al hombre con el telescopio

I saw the man with {a,the} telescope

I ... telescope

I ... telescope

“I saw the man with {a,the} telescope”

I ... man

“I saw with {a,the} telescope the man”

Build a forest that encodes the model posterior1

Compute n-gram statistics from the posterior 2

P(d|f) =

Review

Monday, June 7, 2010

Forest-Based Consensus Decoding

Yo vi al hombre con el telescopio

I saw the man with {a,the} telescope

I ... telescope

I ... telescope

“I saw the man with {a,the} telescope”

I ... man

“I saw with {a,the} telescope the man”

P(edge|f)

Build a forest that encodes the model posterior1

Compute n-gram statistics from the posterior 2

P(d|f) =

Review

Monday, June 7, 2010

Forest-Based Consensus Decoding

Yo vi al hombre con el telescopio

I saw the man with {a,the} telescope

I ... telescope

I ... telescope

“I saw the man with {a,the} telescope”

I ... man

“I saw with {a,the} telescope the man”

“telescope the”

P(edge|f)

Build a forest that encodes the model posterior1

Compute n-gram statistics from the posterior 2

P(d|f) =

Review

Monday, June 7, 2010

Forest-Based Consensus Decoding

Yo vi al hombre con el telescopio

I saw the man with {a,the} telescope

I ... telescope

I ... telescope

“I saw the man with {a,the} telescope”

I ... man

“I saw with {a,the} telescope the man”

“telescope the”

P(edge|f)

Build a forest that encodes the model posterior1

Compute n-gram statistics from the posterior 2

Optimize a consensus objective using these statistics3

P(d|f) =

Review

Monday, June 7, 2010

Types of Efficient Consensus Techniques

[Tromble et al., ’08]

Lattice Minimum Bayes-Risk Decoding

Review

Monday, June 7, 2010

Types of Efficient Consensus Techniques

[Tromble et al., ’08]

Lattice Minimum Bayes-Risk Decoding

Posteriors Expected counts

N-gram Statistics

Review

Monday, June 7, 2010

Types of Efficient Consensus Techniques

[Tromble et al., ’08]

Lattice Minimum Bayes-Risk Decoding

Lear

ned

Fixe

d

Con

sens

us O

bjec

tive

Posteriors Expected counts

N-gram Statistics

Review

Monday, June 7, 2010

Types of Efficient Consensus Techniques

[Tromble et al., ’08]

Lattice Minimum Bayes-Risk Decoding

[DeNero et al., ’09]

Fast Consensus Decoding

[Kumar et al., ’09]

Minimum Bayes-Risk Decoding for Hypergraphs

[Li et al., ’09]

Variational Decoding for Machine Translation

Lear

ned

Fixe

d

Con

sens

us O

bjec

tive

Posteriors Expected counts

N-gram Statistics

Review

Monday, June 7, 2010

Learned Objectives are Better than FixedReview

Monday, June 7, 2010

Learned Objectives are Better than Fixed

C(d) =4�

n=1

w(n)�

g∈n-grams

c(g, d) · P(g|f) + w(�)|σe(d)| + w(b)θ · φ(d)

Review

Monday, June 7, 2010

Learned Objectives are Better than Fixed

C(d) =4�

n=1

w(n)�

g∈n-grams

c(g, d) · P(g|f) + w(�)|σe(d)| + w(b)θ · φ(d)

N-gram statistics

Review

Monday, June 7, 2010

Learned Objectives are Better than Fixed

C(d) =4�

n=1

w(n)�

g∈n-grams

c(g, d) · P(g|f) + w(�)|σe(d)| + w(b)θ · φ(d)

N-gram statistics Length

Review

Monday, June 7, 2010

Learned Objectives are Better than Fixed

C(d) =4�

n=1

w(n)�

g∈n-grams

c(g, d) · P(g|f) + w(�)|σe(d)| + w(b)θ · φ(d)

N-gram statistics Length Base model

Review

Monday, June 7, 2010

Learned Objectives are Better than Fixed

C(d) =4�

n=1

w(n)�

g∈n-grams

c(g, d) · P(g|f) + w(�)|σe(d)| + w(b)θ · φ(d)

Objective parameters

N-gram statistics Length Base model

Review

Monday, June 7, 2010

Learned Objectives are Better than Fixed

C(d) =4�

n=1

w(n)�

g∈n-grams

c(g, d) · P(g|f) + w(�)|σe(d)| + w(b)θ · φ(d)

Objective parameters

N-gram statistics Length Base model

Choose such thatwFixed: Cw(d) ≈ E [bleu(d)]

Review

Monday, June 7, 2010

Learned Objectives are Better than Fixed

C(d) =4�

n=1

w(n)�

g∈n-grams

c(g, d) · P(g|f) + w(�)|σe(d)| + w(b)θ · φ(d)

Objective parameters

N-gram statistics Length Base model

Choose such thatwFixed: Cw(d) ≈ E [bleu(d)]

Choose to maximizewLearned: bleu

��arg max

dCw(d)

�; e

[Kumar et al., ’09]

Review

Monday, June 7, 2010

Learned Objectives are Better than Fixed

C(d) =4�

n=1

w(n)�

g∈n-grams

c(g, d) · P(g|f) + w(�)|σe(d)| + w(b)θ · φ(d)

Objective parameters

N-gram statistics Length Base model

Choose such thatwFixed: Cw(d) ≈ E [bleu(d)]

Choose to maximizewLearned: bleu

��arg max

dCw(d)

�; e

�References

[Kumar et al., ’09]

Review

Monday, June 7, 2010

Learned Objectives are Better than Fixed

C(d) =4�

n=1

w(n)�

g∈n-grams

c(g, d) · P(g|f) + w(�)|σe(d)| + w(b)θ · φ(d)

Objective parameters

N-gram statistics Length Base model

Choose such thatwFixed: Cw(d) ≈ E [bleu(d)]

Choose to maximizewLearned: bleu

��arg max

dCw(d)

�; e

�References

Increased test set BLEU by ≥0.2Decreased test set BLEU by ≥0.2

Consensus performance versus max-derivation decoding (39 pairs)

[Kumar et al., ’09]

Review

Monday, June 7, 2010

Learned Objectives are Better than Fixed

C(d) =4�

n=1

w(n)�

g∈n-grams

c(g, d) · P(g|f) + w(�)|σe(d)| + w(b)θ · φ(d)

Objective parameters

N-gram statistics Length Base model

Choose such thatwFixed: Cw(d) ≈ E [bleu(d)]

Choose to maximizewLearned: bleu

��arg max

dCw(d)

�; e

�References

Increased test set BLEU by ≥0.2Decreased test set BLEU by ≥0.2

Consensus performance versus max-derivation decoding (39 pairs)

[Kumar et al., ’09]

1218

Fixed

Review

Monday, June 7, 2010

Learned Objectives are Better than Fixed

C(d) =4�

n=1

w(n)�

g∈n-grams

c(g, d) · P(g|f) + w(�)|σe(d)| + w(b)θ · φ(d)

Objective parameters

N-gram statistics Length Base model

Choose such thatwFixed: Cw(d) ≈ E [bleu(d)]

Choose to maximizewLearned: bleu

��arg max

dCw(d)

�; e

�References

Increased test set BLEU by ≥0.2Decreased test set BLEU by ≥0.2

Consensus performance versus max-derivation decoding (39 pairs)

[Kumar et al., ’09]

1218

5

26

Fixed Learned

Review

Monday, June 7, 2010

N-gram Posteriors can also be Computed Quickly

Monday, June 7, 2010

N-gram Posteriors can also be Computed Quickly

P(g|f) = 1.0− P(g|f)

Monday, June 7, 2010

N-gram Posteriors can also be Computed Quickly

I ... telescope

I saw

the man with {a,the} telescope

“I saw the man with {a,the} telescope”

P(g|f) = 1.0− P(g|f)

Monday, June 7, 2010

N-gram Posteriors can also be Computed Quickly

I ... telescope

I saw

the man with {a,the} telescope

“I saw the man with {a,the} telescope”

eθ·φ(edge) = 0.2

“saw the”

P(g|f) = 1.0− P(g|f)

Monday, June 7, 2010

Inside: 0.1

...

“I saw” : 0“with the” : 0.1

N-gram Posteriors can also be Computed Quickly

I ... telescope

I saw

the man with {a,the} telescope

“I saw the man with {a,the} telescope”

eθ·φ(edge) = 0.2

“saw the”

P(g|f) = 1.0− P(g|f)

Monday, June 7, 2010

Inside: 0.1

...

“I saw” : 0“with the” : 0.1

N-gram Posteriors can also be Computed Quickly

I ... telescope

I saw

the man with {a,the} telescope

“I saw the man with {a,the} telescope”

eθ·φ(edge) = 0.2

“saw the”

Derivations that don’t contain “with the”

P(g|f) = 1.0− P(g|f)

Monday, June 7, 2010

Inside: 0.1

...

“I saw” : 0“with the” : 0.1

N-gram Posteriors can also be Computed Quickly

I ... telescope

I saw

the man with {a,the} telescope

“I saw the man with {a,the} telescope”

eθ·φ(edge) = 0.2

“saw the”Inside: 0.5

“with a” : 0.3“with the” : 0.2

...Derivations that don’t

contain “with the”

P(g|f) = 1.0− P(g|f)

Monday, June 7, 2010

Inside: 0.1

...

“I saw” : 0“with the” : 0.1

N-gram Posteriors can also be Computed Quickly

I ... telescope

I saw

the man with {a,the} telescope

“I saw the man with {a,the} telescope”

eθ·φ(edge) = 0.2

“saw the”Inside: 0.5

“with a” : 0.3“with the” : 0.2

...

Inside: 0.01“saw the” : 0

“I saw” : 0“with the” : 0.004

...

Derivations that don’t contain “with the”

P(g|f) = 1.0− P(g|f)

Monday, June 7, 2010

Inside: 0.1

...

“I saw” : 0“with the” : 0.1

N-gram Posteriors can also be Computed Quickly

I ... telescope

I saw

the man with {a,the} telescope

“I saw the man with {a,the} telescope”

eθ·φ(edge) = 0.2

“saw the”Inside: 0.5

“with a” : 0.3“with the” : 0.2

...

Inside: 0.01“saw the” : 0

“I saw” : 0“with the” : 0.004

...

Derivations that don’t contain “with the”

P(g|f) = 1.0− P(g|f)

Monday, June 7, 2010

Inside: 0.1

...

“I saw” : 0“with the” : 0.1

N-gram Posteriors can also be Computed Quickly

Audience challenge: What semiring computes n-gram posteriors?

I ... telescope

I saw

the man with {a,the} telescope

“I saw the man with {a,the} telescope”

eθ·φ(edge) = 0.2

“saw the”Inside: 0.5

“with a” : 0.3“with the” : 0.2

...

Inside: 0.01“saw the” : 0

“I saw” : 0“with the” : 0.004

...

Derivations that don’t contain “with the”

P(g|f) = 1.0− P(g|f)

Monday, June 7, 2010

Results for Learned Consensus Decoding

Constrained data track of the 2008 NIST MT task

max-derivation posterior probabilities expected counts

Learned consensus decoding techniquesBaseline

Monday, June 7, 2010

Results for Learned Consensus Decoding

Constrained data track of the 2008 NIST MT task

max-derivation posterior probabilities expected counts

Learned consensus decoding techniquesBaseline

Phrase-Based

Hiero

SAMT

40 42 44 46

44.4

43.8

44.6

44.5

43.8

44.6

43.8

43.3

43.9

Arabic-to-English

Monday, June 7, 2010

Results for Learned Consensus Decoding

Constrained data track of the 2008 NIST MT task

max-derivation posterior probabilities expected counts

Learned consensus decoding techniquesBaseline

Phrase-Based

Hiero

SAMT

40 42 44 46

44.4

43.8

44.6

44.5

43.8

44.6

43.8

43.3

43.9

Arabic-to-English

24 26 28 30

28.7

28.2

27.2

28.8

27.8

27.3

28.4

27.2

25.4

Chinese-to-English

Monday, June 7, 2010

Outline

Consensus decoding review

Our model combination technique

Comparison to system combination

Monday, June 7, 2010

Outline

Consensus decoding review

Our model combination technique

Comparison to system combination

All in

One Slide!

Monday, June 7, 2010

Extending Consensus Decoding to Multiple Models

1. Build posterior forests

Monday, June 7, 2010

Extending Consensus Decoding to Multiple Models

Phrase-based

fHiero- style

P1(d|f) :

P2(d|f) :

1. Build posterior forests

Monday, June 7, 2010

Extending Consensus Decoding to Multiple Models

C(d) =4�

n=1

w(n)�

g∈n-grams

c(g, d) · P(g|f) + w(�)|σe(d)| + w(b)θ · φ(d)

N-gram statistics Length Base model

Phrase-based

fHiero- style

P1(d|f) :

P2(d|f) :

1. Build posterior forests

Monday, June 7, 2010

Extending Consensus Decoding to Multiple Models

N-gram statistics Length Base model

Phrase-based

fHiero- style

P1(d|f) :

P2(d|f) :

C(d) = +w(�)|σe(d)| + w(b)b(d)4�

n=1

w(n)v(n)(d)

1. Build posterior forests

Monday, June 7, 2010

Extending Consensus Decoding to Multiple Models

N-gram statistics Length Base model

Phrase-based

fHiero- style

P1(d|f) :

P2(d|f) :

1. Build posterior forests

C(d) = +w(�)|σe(d)| + w(b)b(d)I�

i=1

4�

n=1

w(n)i v(n)

i (d)

Sum over models

Monday, June 7, 2010

Extending Consensus Decoding to Multiple Models

N-gram statistics Length Base model

Phrase-based

fHiero- style

P1(d|f) :

P2(d|f) :

1. Build posterior forests

2. Compute n-gram statistics from forests

C(d) = +w(�)|σe(d)| + w(b)b(d)I�

i=1

4�

n=1

w(n)i v(n)

i (d)

Sum over models

Monday, June 7, 2010

Extending Consensus Decoding to Multiple Models

N-gram statistics Length Base model

Phrase-based

fHiero- style

P1(d|f) :

P2(d|f) :

1. Build posterior forests

2. Compute n-gram statistics from forests

3. Union all forests

C(d) = +w(�)|σe(d)| + w(b)b(d)I�

i=1

4�

n=1

w(n)i v(n)

i (d)

Sum over models

Monday, June 7, 2010

Extending Consensus Decoding to Multiple Models

N-gram statistics Length Base model

Phrase-based

fHiero- style

P1(d|f) :

P2(d|f) :

1. Build posterior forests

2. Compute n-gram statistics from forests

3. Union all forests

C(d) = +w(�)|σe(d)| + w(b)b(d)I�

i=1

4�

n=1

w(n)i v(n)

i (d)

Sum over models

Monday, June 7, 2010

Extending Consensus Decoding to Multiple Models

N-gram statistics Length Base model

Phrase-based

fHiero- style

P1(d|f) :

P2(d|f) :

1. Build posterior forests

2. Compute n-gram statistics from forests

3. Union all forests

C(d) = +w(�)|σe(d)| + w(b)b(d)I�

i=1

4�

n=1

w(n)i v(n)

i (d) +w(α)i αi(d)

Model choiceSum over models

Monday, June 7, 2010

Extending Consensus Decoding to Multiple Models

N-gram statistics Length Base model

Phrase-based

fHiero- style

P1(d|f) :

P2(d|f) :

1. Build posterior forests

2. Compute n-gram statistics from forests

3. Union all forests

C(d) = +w(�)|σe(d)| + w(b)b(d)I�

i=1

4�

n=1

w(n)i v(n)

i (d) +w(α)i αi(d)

Model choiceSum over models

dIndicator feature for the system that originally generatedModel choice:

Base model: Model score under the system that generated d

Monday, June 7, 2010

Extending Consensus Decoding to Multiple Models

N-gram statistics Length Base model

Phrase-based

fHiero- style

P1(d|f) :

P2(d|f) :

1. Build posterior forests

2. Compute n-gram statistics from forests

3. Union all forests

4. Optimize multi-model consensus objective

C(d) = +w(�)|σe(d)| + w(b)b(d)I�

i=1

4�

n=1

w(n)i v(n)

i (d) +w(α)i αi(d)

Model choiceSum over models

dIndicator feature for the system that originally generatedModel choice:

Base model: Model score under the system that generated d

Monday, June 7, 2010

1. Build posterior forests

2. Compute n-gram statistics from forests

3. Union all forests

4. Optimize multi-model consensus objective

N-gram statistics Length Base model

C(d) = +w(�)|σe(d)| + w(b)b(d)I�

i=1

4�

n=1

w(n)i v(n)

i (d) +w(α)i αi(d)

Model choiceSum over models

dIndicator feature for the system that originally generatedModel choice:

Base model: Model score under the system that generated d

Extending Consensus Decoding to Multiple Models

Monday, June 7, 2010

1. Build posterior forests

2. Compute n-gram statistics from forests

3. Union all forests

4. Optimize multi-model consensus objective

N-gram statistics Length Base model

C(d) = +w(�)|σe(d)| + w(b)b(d)I�

i=1

4�

n=1

w(n)i v(n)

i (d) +w(α)i αi(d)

Model choiceSum over models

dIndicator feature for the system that originally generatedModel choice:

Base model: Model score under the system that generated d

Original features

Extending Consensus Decoding to Multiple Models

Monday, June 7, 2010

1. Build posterior forests

2. Compute n-gram statistics from forests

3. Union all forests

4. Optimize multi-model consensus objective

N-gram statistics Length Base model

C(d) = +w(�)|σe(d)| + w(b)b(d)I�

i=1

4�

n=1

w(n)i v(n)

i (d) +w(α)i αi(d)

Model choiceSum over models

dIndicator feature for the system that originally generatedModel choice:

Base model: Model score under the system that generated d

Original features

Extending Consensus Decoding to Multiple Models

Monday, June 7, 2010

1. Build posterior forests

2. Compute n-gram statistics from forests

3. Union all forests

4. Optimize multi-model consensus objective

N-gram statistics Length Base model

C(d) = +w(�)|σe(d)| + w(b)b(d)I�

i=1

4�

n=1

w(n)i v(n)

i (d) +w(α)i αi(d)

Model choiceSum over models

dIndicator feature for the system that originally generatedModel choice:

Base model: Model score under the system that generated d

Original features

Extending Consensus Decoding to Multiple Models

Monday, June 7, 2010

1. Build posterior forests

2. Compute n-gram statistics from forests

3. Union all forests

4. Optimize multi-model consensus objective

N-gram statistics Length Base model

C(d) = +w(�)|σe(d)| + w(b)b(d)I�

i=1

4�

n=1

w(n)i v(n)

i (d) +w(α)i αi(d)

Model choiceSum over models

dIndicator feature for the system that originally generatedModel choice:

Base model: Model score under the system that generated d

Original features

Extending Consensus Decoding to Multiple Models

Monday, June 7, 2010

Properties of Model Combination

N-gram statistics Length Base model

C(d) = +w(�)|σe(d)| + w(b)b(d)I�

i=1

4�

n=1

w(n)i v(n)

i (d) +w(α)i αi(d)

Model choiceSum over models

Monday, June 7, 2010

Properties of Model Combination

N-gram statistics Length Base model

C(d) = +w(�)|σe(d)| + w(b)b(d)I�

i=1

4�

n=1

w(n)i v(n)

i (d) +w(α)i αi(d)

Model choiceSum over models

Reduces to consensus decoding when we have only one model

Monday, June 7, 2010

Properties of Model Combination

N-gram statistics Length Base model

C(d) = +w(�)|σe(d)| + w(b)b(d)I�

i=1

4�

n=1

w(n)i v(n)

i (d) +w(α)i αi(d)

Model choiceSum over models

Reduces to consensus decoding when we have only one model

A linear model: can be tuned to maximize output performancew

Monday, June 7, 2010

Properties of Model Combination

N-gram statistics Length Base model

C(d) = +w(�)|σe(d)| + w(b)b(d)I�

i=1

4�

n=1

w(n)i v(n)

i (d) +w(α)i αi(d)

Model choiceSum over models

Reduces to consensus decoding when we have only one model

A linear model: can be tuned to maximize output performancew

No concept of a primary system

Monday, June 7, 2010

Properties of Model Combination

N-gram statistics Length Base model

C(d) = +w(�)|σe(d)| + w(b)b(d)I�

i=1

4�

n=1

w(n)i v(n)

i (d) +w(α)i αi(d)

Model choiceSum over models

Reduces to consensus decoding when we have only one model

A linear model: can be tuned to maximize output performancew

No concept of a primary system

Every possible output was a derivation under some original model

Monday, June 7, 2010

Model Combination Experimental Results

Compared three in-house Google systems

Constrained data track of 2008 NIST task

Parameters tuned on NIST 2004 eval set

Max-derivationConsensus

Monday, June 7, 2010

42

43

44

45

46

Phrase Hiero SAMT Combo

45.3

44.5

43.8

44.6

43.8

43.3

43.9

Arabic-to-English BLEU

Model Combination Experimental Results

Compared three in-house Google systems

Constrained data track of 2008 NIST task

Parameters tuned on NIST 2004 eval set

Max-derivationConsensus

Monday, June 7, 2010

42

43

44

45

46

Phrase Hiero SAMT Combo

45.3

44.5

43.8

44.6

43.8

43.3

43.9

Arabic-to-English BLEU

Model Combination Experimental Results

Compared three in-house Google systems

Constrained data track of 2008 NIST task

Parameters tuned on NIST 2004 eval set

+1.4

Max-derivationConsensus

Monday, June 7, 2010

42

43

44

45

46

Phrase Hiero SAMT Combo

45.3

44.5

43.8

44.6

43.8

43.3

43.9

Arabic-to-English BLEU

Model Combination Experimental Results

Compared three in-house Google systems

Constrained data track of 2008 NIST task

Parameters tuned on NIST 2004 eval set

+1.4

24

25

26

27

28

29

30

Phrase Hiero SAMT Combo

29.028.8

27.827.3

28.4

27.2

25.4

Chinese-to-English BLEU

+0.6

Max-derivationConsensus

Monday, June 7, 2010

Sources of Improvement: Arabic-to-English

Monday, June 7, 2010

Sources of Improvement: Arabic-to-English

43 44 45 46

43.9

44.5

Best max-derivation system

Best single-system with consensus decoding

Monday, June 7, 2010

Sources of Improvement: Arabic-to-English

43 44 45 46

43.9

44.5

Best max-derivation system

Do we need model choice features?

Best single-system with consensus decoding

Monday, June 7, 2010

Sources of Improvement: Arabic-to-English

43 44 45 46

43.9

44.5

Best max-derivation system

Do we need model choice features?

Should n-gram statistics be model specific?

Best single-system with consensus decoding

Monday, June 7, 2010

Sources of Improvement: Arabic-to-English

43 44 45 46

43.9

44.5

Best max-derivation system

Do we need model choice features?

Should n-gram statistics be model specific?

Phrase-based

fHiero- style

Best single-system with consensus decoding

Monday, June 7, 2010

Sources of Improvement: Arabic-to-English

43 44 45 46

43.9

44.5

Best max-derivation system

Do we need model choice features?

Should n-gram statistics be model specific?

Phrase-based

fHiero- style

Best single-system with consensus decoding

“Union”

Monday, June 7, 2010

Sources of Improvement: Arabic-to-English

43 44 45 46

43.9

44.5

44.5

Best max-derivation system

Do we need model choice features?

Should n-gram statistics be model specific?

Phrase-based

fHiero- style

Best single-system with consensus decoding

Union + consensus decoding

“Union”

Monday, June 7, 2010

Sources of Improvement: Arabic-to-English

43 44 45 46

43.9

44.5

44.5

44.9

Best max-derivation system

Do we need model choice features?

Should n-gram statistics be model specific?

Phrase-based

fHiero- style

Best single-system with consensus decoding

Union + consensus decoding

Union + consensus + model choice features

“Union”

Monday, June 7, 2010

Sources of Improvement: Arabic-to-English

43 44 45 46

43.9

44.5

44.5

44.9

45.1

Best max-derivation system

Do we need model choice features?

Should n-gram statistics be model specific?

Phrase-based

fHiero- style

Best single-system with consensus decoding

Union + consensus decoding

Union + consensus + model choice features

Union with model-specific n-gram statistics

“Union”

Monday, June 7, 2010

Sources of Improvement: Arabic-to-English

43 44 45 46

43.9

44.5

44.5

44.9

45.1

45.3

Best max-derivation system

Do we need model choice features?

Should n-gram statistics be model specific?

Phrase-based

fHiero- style

Best single-system with consensus decoding

Union + consensus decoding

Union + consensus + model choice features

Union with model-specific n-gram statistics

“Union”

Model-specific & union n-gram statistics

Monday, June 7, 2010

Outline

Consensus decoding review

Our model combination technique

Comparison to system combination

Monday, June 7, 2010

Outline

Consensus decoding review

Our model combination technique

Comparison to system combination

The Final

Showdown

Monday, June 7, 2010

System Combination Baselines

Two established system combination methods [Macherey & Och, ’07]

Monday, June 7, 2010

System Combination Baselines

Two established system combination methods [Macherey & Och, ’07]

P1(d|f)

P2(d|f)

P3(d|f)

f

e∗1

e∗2

e∗3

Consensus

Consensus

Consensus

System combo

e

1

2

3

Monday, June 7, 2010

System Combination Baselines

Sentence-level Choose among system outputs using an MBR objective

Two established system combination methods [Macherey & Och, ’07]

e = arg maxe∈{e∗1 ,...,e∗k}

E [bleu(e)]

P1(d|f)

P2(d|f)

P3(d|f)

f

e∗1

e∗2

e∗3

Consensus

Consensus

Consensus

System combo

e

1

2

3

Monday, June 7, 2010

System Combination Baselines

Sentence-level Choose among system outputs using an MBR objective

Word-level Confusion network approach

Two established system combination methods [Macherey & Och, ’07]

e = arg maxe∈{e∗1 ,...,e∗k}

E [bleu(e)]

P1(d|f)

P2(d|f)

P3(d|f)

f

e∗1

e∗2

e∗3

Consensus

Consensus

Consensus

System combo

e

1

2

3

Monday, June 7, 2010

System Combination Baselines

Sentence-level Choose among system outputs using an MBR objective

Word-level Confusion network approach

Two established system combination methods [Macherey & Och, ’07]

e = arg maxe∈{e∗1 ,...,e∗k}

E [bleu(e)]

P1(d|f)

P2(d|f)

P3(d|f)

f

e∗1

e∗2

e∗3

Consensus

Consensus

Consensus

System combo

e

1

2

3

All are aligned to a backbone e∗i eb ∈ {e∗1, . . . , e∗k}

Monday, June 7, 2010

System Combination Baselines

Sentence-level Choose among system outputs using an MBR objective

Word-level Confusion network approach

Sentences + alignments form a confusion network

Two established system combination methods [Macherey & Och, ’07]

e = arg maxe∈{e∗1 ,...,e∗k}

E [bleu(e)]

P1(d|f)

P2(d|f)

P3(d|f)

f

e∗1

e∗2

e∗3

Consensus

Consensus

Consensus

System combo

e

1

2

3

All are aligned to a backbone e∗i eb ∈ {e∗1, . . . , e∗k}

Monday, June 7, 2010

System Combination Baselines

Sentence-level Choose among system outputs using an MBR objective

Word-level Confusion network approach

Sentences + alignments form a confusion network

Output maximizes a consensus objective

Two established system combination methods [Macherey & Och, ’07]

e = arg maxe∈{e∗1 ,...,e∗k}

E [bleu(e)]

P1(d|f)

P2(d|f)

P3(d|f)

f

e∗1

e∗2

e∗3

Consensus

Consensus

Consensus

System combo

e

1

2

3

All are aligned to a backbone e∗i eb ∈ {e∗1, . . . , e∗k}

Monday, June 7, 2010

Model versus System Combination

Qualitative

Quantitative

Monday, June 7, 2010

Model versus System Combination

Single-system n-gram statistics are required in both methods

Qualitative

Quantitative

Monday, June 7, 2010

Model versus System Combination

Single-system n-gram statistics are required in both methods

Model combination searches only once under a consensus objective

Qualitative

Quantitative

Monday, June 7, 2010

Model versus System Combination

Inter-hypothesis alignment problem exists only in system combination

Single-system n-gram statistics are required in both methods

Model combination searches only once under a consensus objective

Qualitative

Quantitative

Monday, June 7, 2010

Model versus System Combination

Best single-system consensus

Sentence-level system combination

Word-level system combination

Model combination

43 44 45 46

45.3

44.7

44.6

44.5

Arabic-to-English

Inter-hypothesis alignment problem exists only in system combination

Single-system n-gram statistics are required in both methods

Model combination searches only once under a consensus objective

Qualitative

Quantitative

Monday, June 7, 2010

Model versus System Combination

Best single-system consensus

Sentence-level system combination

Word-level system combination

Model combination

43 44 45 46

45.3

44.7

44.6

44.5

Arabic-to-English

27 28 29 30

29.0

28.8

28.8

28.8

Chinese-to-English

Inter-hypothesis alignment problem exists only in system combination

Single-system n-gram statistics are required in both methods

Model combination searches only once under a consensus objective

Qualitative

Quantitative

Monday, June 7, 2010

Conclusion

Monday, June 7, 2010

Conclusion

Consensus decoding with a learned objective extends naturally to multiple models

Monday, June 7, 2010

Conclusion

Consensus decoding with a learned objective extends naturally to multiple models

Model combination provides consensus and combination effects in a unified, distribution-driven objective

Monday, June 7, 2010

Conclusion

It outperforms a pipeline of consensus decoding followed by system combination, using less total computation

Consensus decoding with a learned objective extends naturally to multiple models

Model combination provides consensus and combination effects in a unified, distribution-driven objective

Monday, June 7, 2010

Conclusion

It outperforms a pipeline of consensus decoding followed by system combination, using less total computation

Consensus decoding with a learned objective extends naturally to multiple models

Model combination provides consensus and combination effects in a unified, distribution-driven objective

It’s easy, it’s clean, and it works

Monday, June 7, 2010

Conclusion

It outperforms a pipeline of consensus decoding followed by system combination, using less total computation

Consensus decoding with a learned objective extends naturally to multiple models

Model combination provides consensus and combination effects in a unified, distribution-driven objective

It’s easy, it’s clean, and it works Thanks!

Monday, June 7, 2010