Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based...

97
Exact decoding for phrase-based SMT Wilker Aziz 1 , Marc Dymetman 2 , Lucia Specia 1 1 University of Sheffield 2 Xerox Research Centre Europe October 27, 2014

Transcript of Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based...

Page 1: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

Exact decoding for phrase-based SMT

Wilker Aziz1, Marc Dymetman2, Lucia Specia1

1University of Sheffield2Xerox Research Centre Europe

October 27, 2014

Page 2: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

Outline

Introduction

Approach

Results

Conclusions

2 / 21

Page 3: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

Decoding

Viterbi decoding

d∗ = argmaxd∈D(x)

f (d)

= argmaxd∈D(x)

θ>H(d)

1 / 21

Page 4: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

Decoding

Viterbi decoding

d∗ = argmaxd∈D(x)

f (d)

= argmaxd∈D(x)

θ>H(d)

space of translation derivations compatible with the input x

1 / 21

Page 5: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

Decoding

Viterbi decoding

d∗ = argmaxd∈D(x)

f (d)

= argmaxd∈D(x)

θ>H(d)

we are looking for the best derivationunder a linear parameterisation (θ ∈ Rm)

1 / 21

Page 6: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

Decoding

Viterbi decoding

d∗ = argmaxd∈D(x)

f (d)

= argmaxd∈D(x)

θ>1 H1(d) + θ>2 H2(d)

“local” features assess steps in a derivation independently

H1(d) =∑e∈d

h1(e)

1 / 21

Page 7: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

Decoding

Viterbi decoding

d∗ = argmaxd∈D(x)

f (d)

= argmaxd∈D(x)

θ>1 H1(d) + θ>2 H2(d)

“local” features assess steps in a derivation independently

H1(d) =∑e∈d

h1(e)

“nonlocal” features make weaker independence assumptionse.g. HLM(d) = log pLM(yield(d))

1 / 21

Page 8: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

Complexity

〈D(x), f (d)〉 can be seen as the intersection between

I a translation hypergraph G(x)locally parameterised

I and a target language model Aas a wFSA

Phrase-based SMT with a distortion limit (d) and an n-gram LM

|G(x)| ∝ I 22d

|A| ∝ |∆|n−1

ProblemThe intersection is too large for standard dynamic programming

2 / 21

Page 9: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

Complexity

〈D(x), f (d)〉 can be seen as the intersection betweenI a translation hypergraph G(x)

locally parameterised

I and a target language model Aas a wFSA

Phrase-based SMT with a distortion limit (d) and an n-gram LM

|G(x)| ∝ I 22d

|A| ∝ |∆|n−1

ProblemThe intersection is too large for standard dynamic programming

2 / 21

Page 10: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

Complexity

〈D(x), f (d)〉 can be seen as the intersection betweenI a translation hypergraph G(x)

locally parameterisedI and a target language model A

as a wFSA

Phrase-based SMT with a distortion limit (d) and an n-gram LM

|G(x)| ∝ I 22d

|A| ∝ |∆|n−1

ProblemThe intersection is too large for standard dynamic programming

2 / 21

Page 11: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

Complexity

〈D(x), f (d)〉 can be seen as the intersection betweenI a translation hypergraph G(x)

locally parameterisedI and a target language model A

as a wFSAPhrase-based SMT with a distortion limit (d) and an n-gram LM

|G(x)| ∝ I 22d

|A| ∝ |∆|n−1

ProblemThe intersection is too large for standard dynamic programming

2 / 21

Page 12: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

Complexity

〈D(x), f (d)〉 can be seen as the intersection betweenI a translation hypergraph G(x)

locally parameterisedI and a target language model A

as a wFSAPhrase-based SMT with a distortion limit (d) and an n-gram LM

|G(x)| ∝ I 22d

|A| ∝ |∆|n−1

ProblemThe intersection is too large for standard dynamic programming

2 / 21

Page 13: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

Contribution

Previous work on exact decodingI compact modelsI simpler parameterisation using 3-gram LMs

This workI large modelsI realistic 5-gram LMs

3 / 21

Page 14: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

Contribution

Previous work on exact decodingI compact modelsI simpler parameterisation using 3-gram LMs

This workI large modelsI realistic 5-gram LMs

3 / 21

Page 15: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

IntuitionFull intersection is wasteful

G(x) ∩ A

I complete n-gram LM I encodes one n-gram

Assumptionnot every n-gram participates in high-scoring derivations

Problemwhich n-grams are really necessary?and at which level of refinement?

Strategystart with strong independence assumptionsrevisit those assumptions as necessary

4 / 21

Page 16: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

IntuitionFull intersection is wasteful

G(x) ∩ A

I complete n-gram LM

I encodes one n-gram

Assumptionnot every n-gram participates in high-scoring derivations

Problemwhich n-grams are really necessary?and at which level of refinement?

Strategystart with strong independence assumptionsrevisit those assumptions as necessary

4 / 21

Page 17: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

IntuitionFull intersection is wasteful

G(x) ∩ A = G(x) ∩( M⋂

i=1A(i)

)

I complete n-gram LM I encodes one n-gram

Assumptionnot every n-gram participates in high-scoring derivations

Problemwhich n-grams are really necessary?and at which level of refinement?

Strategystart with strong independence assumptionsrevisit those assumptions as necessary

4 / 21

Page 18: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

IntuitionFull intersection is wasteful

G(x) ∩ A = G(x) ∩( M⋂

i=1A(i)

)

I complete n-gram LM I encodes one n-gram

Assumptionnot every n-gram participates in high-scoring derivations

Problemwhich n-grams are really necessary?and at which level of refinement?

Strategystart with strong independence assumptionsrevisit those assumptions as necessary

4 / 21

Page 19: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

IntuitionFull intersection is wasteful

G(x) ∩ A = G(x) ∩( M⋂

i=1A(i)

)

I complete n-gram LM I encodes one n-gram

Assumptionnot every n-gram participates in high-scoring derivations

Problemwhich n-grams are really necessary?and at which level of refinement?

Strategystart with strong independence assumptionsrevisit those assumptions as necessary

4 / 21

Page 20: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

IntuitionFull intersection is wasteful

G(x) ∩ A = G(x) ∩( M⋂

i=1A(i)

)

I complete n-gram LM I encodes one n-gram

Assumptionnot every n-gram participates in high-scoring derivations

Problemwhich n-grams are really necessary?and at which level of refinement?

Strategystart with strong independence assumptionsrevisit those assumptions as necessary

4 / 21

Page 21: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

OS∗: illustration

5 / 21

0

5

10

15

20

the

blac

k ca

t

the

dark

cat

the

blac

k fe

line

the

dark

felin

e

the

cat b

lack

the

cat d

ark

the

felin

e bl

ack

the

felin

e da

rk

the

(bla

ck c

at)

cat t

he b

lack

cat t

he d

ark

felin

e th

e bl

ack

felin

e th

e da

rk

f

I upperbound the complex target f (d)

Page 22: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

OS∗: illustration

5 / 21

0

5

10

15

20

the

blac

k ca

t

the

dark

cat

the

blac

k fe

line

the

dark

felin

e

the

cat b

lack

the

cat d

ark

the

felin

e bl

ack

the

felin

e da

rk

the

(bla

ck c

at)

cat t

he b

lack

cat t

he d

ark

felin

e th

e bl

ack

felin

e th

e da

rk

f g0

I upperbound the complex target f (d) by a simpler proposal g(d)

Page 23: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

OS∗: illustration

5 / 21

0

5

10

15

20

the

blac

k ca

t

the

dark

cat

the

blac

k fe

line

the

dark

felin

e

the

cat b

lack

the

cat d

ark

the

felin

e bl

ack

the

felin

e da

rk

the

(bla

ck c

at)

cat t

he b

lack

cat t

he d

ark

felin

e th

e bl

ack

felin

e th

e da

rk

f g0

This is our goal

I we are interested in finding f ’s argmaxhowever we cannot search through f

Page 24: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

OS∗: illustration

5 / 21

0

5

10

15

20

the

blac

k ca

t

the

dark

cat

the

blac

k fe

line

the

dark

felin

e

the

cat b

lack

the

cat d

ark

the

felin

e bl

ack

the

felin

e da

rk

the

(bla

ck c

at)

cat t

he b

lack

cat t

he d

ark

felin

e th

e bl

ack

felin

e th

e da

rk

f g0but because our proxy says this is the best

This is our goal

I we find d∗ = argmaxd g(d)

I finding our best solution thus far (according to f )I however we cannot guarantee exactness yet

Page 25: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

OS∗: illustration

5 / 21

0

5

10

15

20

the

blac

k ca

t

the

dark

cat

the

blac

k fe

line

the

dark

felin

e

the

cat b

lack

the

cat d

ark

the

felin

e bl

ack

the

felin

e da

rk

the

(bla

ck c

at)

cat t

he b

lack

cat t

he d

ark

felin

e th

e bl

ack

felin

e th

e da

rk

f g0but because our proxy says this is the best

This is our goal

I we find d∗ = argmaxd g(d) and assess f (d∗)

I finding our best solution thus far (according to f )I however we cannot guarantee exactness yet

Page 26: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

OS∗: illustration

5 / 21

0

5

10

15

20

the

blac

k ca

t

the

dark

cat

the

blac

k fe

line

the

dark

felin

e

the

cat b

lack

the

cat d

ark

the

felin

e bl

ack

the

felin

e da

rk

the

(bla

ck c

at)

cat t

he b

lack

cat t

he d

ark

felin

e th

e bl

ack

felin

e th

e da

rk

f g0

this is the best we have thus farThis is our goal

but because our proxy says this is the best

I we find d∗ = argmaxd g(d) and assess f (d∗)I finding our best solution thus far (according to f )

I however we cannot guarantee exactness yet

Page 27: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

OS∗: illustration

5 / 21

0

5

10

15

20

the

blac

k ca

t

the

dark

cat

the

blac

k fe

line

the

dark

felin

e

the

cat b

lack

the

cat d

ark

the

felin

e bl

ack

the

felin

e da

rk

the

(bla

ck c

at)

cat t

he b

lack

cat t

he d

ark

felin

e th

e bl

ack

felin

e th

e da

rk

f g0

Maximum error

I we find d∗ = argmaxd g(d) and assess f (d∗)I finding our best solution thus far (according to f )I however we cannot guarantee exactness yet

Page 28: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

OS∗: illustration

5 / 21

0

5

10

15

20

the

blac

k ca

t

the

dark

cat

the

blac

k fe

line

the

dark

felin

e

the

cat b

lack

the

cat d

ark

the

felin

e bl

ack

the

felin

e da

rk

the

(bla

ck c

at)

cat t

he b

lack

cat t

he d

ark

felin

e th

e bl

ack

felin

e th

e da

rk

f g1

I so we refine g as to bring it closer to fe.g. by making g(d∗) = f (d∗)

I at the cost of some little complexity increase (extra nodes and edges)

Page 29: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

OS∗: illustration

5 / 21

0

5

10

15

20

the

blac

k ca

t

the

dark

cat

the

blac

k fe

line

the

dark

felin

e

the

cat b

lack

the

cat d

ark

the

felin

e bl

ack

the

felin

e da

rk

the

(bla

ck c

at)

cat t

he b

lack

cat t

he d

ark

felin

e th

e bl

ack

felin

e th

e da

rk

f g1

I so we refine g as to bring it closer to fe.g. by making g(d∗) = f (d∗)

I at the cost of some little complexity increase (extra nodes and edges)

Page 30: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

OS∗: illustration

5 / 21

0

5

10

15

20

the

blac

k ca

t

the

dark

cat

the

blac

k fe

line

the

dark

felin

e

the

cat b

lack

the

cat d

ark

the

felin

e bl

ack

the

felin

e da

rk

the

(bla

ck c

at)

cat t

he b

lack

cat t

he d

ark

felin

e th

e bl

ack

felin

e th

e da

rk

f g1

I as a consequence, g’s argmax has changedwe solve d∗ = argmaxd g(d)

I and assess f (d∗) againour best solution thus far might remain unchanged

Page 31: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

OS∗: illustration

5 / 21

0

5

10

15

20

the

blac

k ca

t

the

dark

cat

the

blac

k fe

line

the

dark

felin

e

the

cat b

lack

the

cat d

ark

the

felin

e bl

ack

the

felin

e da

rk

the

(bla

ck c

at)

cat t

he b

lack

cat t

he d

ark

felin

e th

e bl

ack

felin

e th

e da

rk

f g1

I as a consequence, g’s argmax has changedwe solve d∗ = argmaxd g(d)

I and assess f (d∗) againour best solution thus far might remain unchanged

Page 32: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

OS∗: illustration

5 / 21

0

5

10

15

20

the

blac

k ca

t

the

dark

cat

the

blac

k fe

line

the

dark

felin

e

the

cat b

lack

the

cat d

ark

the

felin

e bl

ack

the

felin

e da

rk

the

(bla

ck c

at)

cat t

he b

lack

cat t

he d

ark

felin

e th

e bl

ack

felin

e th

e da

rk

f g1

I we cannot yet guarantee exactness

I even though our maximum error is smaller

Page 33: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

OS∗: illustration

5 / 21

0

5

10

15

20

the

blac

k ca

t

the

dark

cat

the

blac

k fe

line

the

dark

felin

e

the

cat b

lack

the

cat d

ark

the

felin

e bl

ack

the

felin

e da

rk

the

(bla

ck c

at)

cat t

he b

lack

cat t

he d

ark

felin

e th

e bl

ack

felin

e th

e da

rk

f g1

I we cannot yet guarantee exactnessI even though our maximum error is smaller

Page 34: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

OS∗: illustration

5 / 21

0

5

10

15

20

the

blac

k ca

t

the

dark

cat

the

blac

k fe

line

the

dark

felin

e

the

cat b

lack

the

cat d

ark

the

felin

e bl

ack

the

felin

e da

rk

the

(bla

ck c

at)

cat t

he b

lack

cat t

he d

ark

felin

e th

e bl

ack

felin

e th

e da

rk

f g2

I but we can continue refining g

I for as long as g and f disagree on the maximumI until we have a certificate of optimality

Page 35: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

OS∗: illustration

5 / 21

0

5

10

15

20

the

blac

k ca

t

the

dark

cat

the

blac

k fe

line

the

dark

felin

e

the

cat b

lack

the

cat d

ark

the

felin

e bl

ack

the

felin

e da

rk

the

(bla

ck c

at)

cat t

he b

lack

cat t

he d

ark

felin

e th

e bl

ack

felin

e th

e da

rk

f g3

I but we can continue refining gI for as long as g and f disagree on the maximum

I until we have a certificate of optimality

Page 36: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

OS∗: illustration

5 / 21

0

5

10

15

20

the

blac

k ca

t

the

dark

cat

the

blac

k fe

line

the

dark

felin

e

the

cat b

lack

the

cat d

ark

the

felin

e bl

ack

the

felin

e da

rk

the

(bla

ck c

at)

cat t

he b

lack

cat t

he d

ark

felin

e th

e bl

ack

felin

e th

e da

rk

f g4

I but we can continue refining gI for as long as g and f disagree on the maximum

I until we have a certificate of optimality

Page 37: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

OS∗: illustration

5 / 21

0

5

10

15

20

the

blac

k ca

t

the

dark

cat

the

blac

k fe

line

the

dark

felin

e

the

cat b

lack

the

cat d

ark

the

felin

e bl

ack

the

felin

e da

rk

the

(bla

ck c

at)

cat t

he b

lack

cat t

he d

ark

felin

e th

e bl

ack

felin

e th

e da

rk

f g5

I but we can continue refining gI for as long as g and f disagree on the maximumI until we have a certificate of optimality

Page 38: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

Decoding algorithm

1: function Optimise(g, ε)

2: d∗ ← argmaxd g(d) . proxy’s argmax3: g∗ ← g(d∗)4: f ∗ ← f (d∗) . observe a point in f5: while (q∗ − f ∗ ≥ ε) do . ε is the maximum error6: A← actions(g,d∗) . collect refinement actions7: g ← refine(g,A) . update proposal8: d∗ ← argmaxd g(d) . update argmax9: g∗ ← g(d∗)

10: f ∗ ← max(f ∗, f (d∗)) . update “best so far”11: end while12: return g, d∗13: end function

6 / 21

Page 39: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

Decoding algorithm

1: function Optimise(g, ε)2: d∗ ← argmaxd g(d) . proxy’s argmax

3: g∗ ← g(d∗)4: f ∗ ← f (d∗) . observe a point in f5: while (q∗ − f ∗ ≥ ε) do . ε is the maximum error6: A← actions(g,d∗) . collect refinement actions7: g ← refine(g,A) . update proposal8: d∗ ← argmaxd g(d) . update argmax9: g∗ ← g(d∗)

10: f ∗ ← max(f ∗, f (d∗)) . update “best so far”11: end while12: return g, d∗13: end function

6 / 21

Page 40: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

Decoding algorithm

1: function Optimise(g, ε)2: d∗ ← argmaxd g(d) . proxy’s argmax3: g∗ ← g(d∗)4: f ∗ ← f (d∗) . observe a point in f

5: while (q∗ − f ∗ ≥ ε) do . ε is the maximum error6: A← actions(g,d∗) . collect refinement actions7: g ← refine(g,A) . update proposal8: d∗ ← argmaxd g(d) . update argmax9: g∗ ← g(d∗)

10: f ∗ ← max(f ∗, f (d∗)) . update “best so far”11: end while12: return g, d∗13: end function

6 / 21

Page 41: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

Decoding algorithm

1: function Optimise(g, ε)2: d∗ ← argmaxd g(d) . proxy’s argmax3: g∗ ← g(d∗)4: f ∗ ← f (d∗) . observe a point in f5: while (q∗ − f ∗ ≥ ε) do . ε is the maximum error

6: A← actions(g,d∗) . collect refinement actions7: g ← refine(g,A) . update proposal8: d∗ ← argmaxd g(d) . update argmax9: g∗ ← g(d∗)

10: f ∗ ← max(f ∗, f (d∗)) . update “best so far”11: end while12: return g, d∗13: end function

6 / 21

Page 42: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

Decoding algorithm

1: function Optimise(g, ε)2: d∗ ← argmaxd g(d) . proxy’s argmax3: g∗ ← g(d∗)4: f ∗ ← f (d∗) . observe a point in f5: while (q∗ − f ∗ ≥ ε) do . ε is the maximum error6: A← actions(g,d∗) . collect refinement actions

7: g ← refine(g,A) . update proposal8: d∗ ← argmaxd g(d) . update argmax9: g∗ ← g(d∗)

10: f ∗ ← max(f ∗, f (d∗)) . update “best so far”11: end while12: return g, d∗13: end function

6 / 21

Page 43: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

Decoding algorithm

1: function Optimise(g, ε)2: d∗ ← argmaxd g(d) . proxy’s argmax3: g∗ ← g(d∗)4: f ∗ ← f (d∗) . observe a point in f5: while (q∗ − f ∗ ≥ ε) do . ε is the maximum error6: A← actions(g,d∗) . collect refinement actions7: g ← refine(g,A) . update proposal

8: d∗ ← argmaxd g(d) . update argmax9: g∗ ← g(d∗)

10: f ∗ ← max(f ∗, f (d∗)) . update “best so far”11: end while12: return g, d∗13: end function

6 / 21

Page 44: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

Decoding algorithm

1: function Optimise(g, ε)2: d∗ ← argmaxd g(d) . proxy’s argmax3: g∗ ← g(d∗)4: f ∗ ← f (d∗) . observe a point in f5: while (q∗ − f ∗ ≥ ε) do . ε is the maximum error6: A← actions(g,d∗) . collect refinement actions7: g ← refine(g,A) . update proposal8: d∗ ← argmaxd g(d) . update argmax

9: g∗ ← g(d∗)10: f ∗ ← max(f ∗, f (d∗)) . update “best so far”11: end while12: return g, d∗13: end function

6 / 21

Page 45: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

Decoding algorithm

1: function Optimise(g, ε)2: d∗ ← argmaxd g(d) . proxy’s argmax3: g∗ ← g(d∗)4: f ∗ ← f (d∗) . observe a point in f5: while (q∗ − f ∗ ≥ ε) do . ε is the maximum error6: A← actions(g,d∗) . collect refinement actions7: g ← refine(g,A) . update proposal8: d∗ ← argmaxd g(d) . update argmax9: g∗ ← g(d∗)

10: f ∗ ← max(f ∗, f (d∗)) . update “best so far”11: end while

12: return g, d∗13: end function

6 / 21

Page 46: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

Decoding algorithm

1: function Optimise(g, ε)2: d∗ ← argmaxd g(d) . proxy’s argmax3: g∗ ← g(d∗)4: f ∗ ← f (d∗) . observe a point in f5: while (q∗ − f ∗ ≥ ε) do . ε is the maximum error6: A← actions(g,d∗) . collect refinement actions7: g ← refine(g,A) . update proposal8: d∗ ← argmaxd g(d) . update argmax9: g∗ ← g(d∗)

10: f ∗ ← max(f ∗, f (d∗)) . update “best so far”11: end while12: return g, d∗13: end function

6 / 21

Page 47: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

ProposalIn g, the true LM pLM is replaced by an upperbound qLM

with stronger independence assumptions

I Suppose α = yJI is a substring of y = yM

1e.g. α = black2 cat3 in y = BOS0 the1 black2 cat3 EOS4

I contribution of α to the true LM score of y

pLM(α) ≡J∏

k=Ip(yk |yk−1

1 )

e.g. p(black2|BOS0 the1)p(cat3|BOS0 the1 black2)I upperbound to pLM(α)

qLM(α) ≡ q(yI |ε)J∏

k=I+1q(yk |yk−1

I )

e.g. q(black2|ε)q(cat3|black2)

where q(z|P) ≡ maxH∈∆∗ p(z|HP)

7 / 21

Page 48: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

ProposalIn g, the true LM pLM is replaced by an upperbound qLM

with stronger independence assumptionsI Suppose α = yJ

I is a substring of y = yM1

e.g. α = black2 cat3 in y = BOS0 the1 black2 cat3 EOS4

I contribution of α to the true LM score of y

pLM(α) ≡J∏

k=Ip(yk |yk−1

1 )

e.g. p(black2|BOS0 the1)p(cat3|BOS0 the1 black2)I upperbound to pLM(α)

qLM(α) ≡ q(yI |ε)J∏

k=I+1q(yk |yk−1

I )

e.g. q(black2|ε)q(cat3|black2)

where q(z|P) ≡ maxH∈∆∗ p(z|HP)

7 / 21

Page 49: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

ProposalIn g, the true LM pLM is replaced by an upperbound qLM

with stronger independence assumptionsI Suppose α = yJ

I is a substring of y = yM1

e.g. α = black2 cat3 in y = BOS0 the1 black2 cat3 EOS4

I contribution of α to the true LM score of y

pLM(α) ≡J∏

k=Ip(yk |yk−1

1 )

e.g. p(black2|BOS0 the1)p(cat3|BOS0 the1 black2)

I upperbound to pLM(α)

qLM(α) ≡ q(yI |ε)J∏

k=I+1q(yk |yk−1

I )

e.g. q(black2|ε)q(cat3|black2)

where q(z|P) ≡ maxH∈∆∗ p(z|HP)

7 / 21

Page 50: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

ProposalIn g, the true LM pLM is replaced by an upperbound qLM

with stronger independence assumptionsI Suppose α = yJ

I is a substring of y = yM1

e.g. α = black2 cat3 in y = BOS0 the1 black2 cat3 EOS4

I contribution of α to the true LM score of y

pLM(α) ≡J∏

k=Ip(yk |yk−1

1 )

e.g. p(black2|BOS0 the1)p(cat3|BOS0 the1 black2)I upperbound to pLM(α)

qLM(α) ≡ q(yI |ε)J∏

k=I+1q(yk |yk−1

I )

e.g. q(black2|ε)q(cat3|black2)

where q(z|P) ≡ maxH∈∆∗ p(z|HP)7 / 21

Page 51: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

Max-ARPAAn n-gram is scored in its most optimistic context H

q(z|P) ≡ maxH∈∆∗

p(z|HP)

Efficiently computed using a Max-ARPA table MI start with an ARPA table

n-gram Pz — conditional log p(z|P) — backoff b(Pz)I compute an upperbound view of the last 2 columns

n-gram Pz — max-conditional log q(z|P) — max-backoff m(Pz)

Then for an arbitrary n-gram Pz,

q(z|P) =

p(z|P) Pz 6∈ M and P 6∈ Mp(z|P)×m(P) Pz 6∈ M and P ∈ Mq(z|P) Pz ∈ M

8 / 21

Page 52: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

Max-ARPAAn n-gram is scored in its most optimistic context H

q(z|P) ≡ maxH∈∆∗

p(z|HP)

Efficiently computed using a Max-ARPA table M

I start with an ARPA tablen-gram Pz — conditional log p(z|P) — backoff b(Pz)

I compute an upperbound view of the last 2 columnsn-gram Pz — max-conditional log q(z|P) — max-backoff m(Pz)

Then for an arbitrary n-gram Pz,

q(z|P) =

p(z|P) Pz 6∈ M and P 6∈ Mp(z|P)×m(P) Pz 6∈ M and P ∈ Mq(z|P) Pz ∈ M

8 / 21

Page 53: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

Max-ARPAAn n-gram is scored in its most optimistic context H

q(z|P) ≡ maxH∈∆∗

p(z|HP)

Efficiently computed using a Max-ARPA table MI start with an ARPA table

n-gram Pz — conditional log p(z|P) — backoff b(Pz)

I compute an upperbound view of the last 2 columnsn-gram Pz — max-conditional log q(z|P) — max-backoff m(Pz)

Then for an arbitrary n-gram Pz,

q(z|P) =

p(z|P) Pz 6∈ M and P 6∈ Mp(z|P)×m(P) Pz 6∈ M and P ∈ Mq(z|P) Pz ∈ M

8 / 21

Page 54: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

Max-ARPAAn n-gram is scored in its most optimistic context H

q(z|P) ≡ maxH∈∆∗

p(z|HP)

Efficiently computed using a Max-ARPA table MI start with an ARPA table

n-gram Pz — conditional log p(z|P) — backoff b(Pz)I compute an upperbound view of the last 2 columns

n-gram Pz — max-conditional log q(z|P) — max-backoff m(Pz)

Then for an arbitrary n-gram Pz,

q(z|P) =

p(z|P) Pz 6∈ M and P 6∈ Mp(z|P)×m(P) Pz 6∈ M and P ∈ Mq(z|P) Pz ∈ M

8 / 21

Page 55: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

Max-ARPAAn n-gram is scored in its most optimistic context H

q(z|P) ≡ maxH∈∆∗

p(z|HP)

Efficiently computed using a Max-ARPA table MI start with an ARPA table

n-gram Pz — conditional log p(z|P) — backoff b(Pz)I compute an upperbound view of the last 2 columns

n-gram Pz — max-conditional log q(z|P) — max-backoff m(Pz)

Then for an arbitrary n-gram Pz,

q(z|P) =

p(z|P) Pz 6∈ M and P 6∈ Mp(z|P)×m(P) Pz 6∈ M and P ∈ Mq(z|P) Pz ∈ M

8 / 21

Page 56: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

Proposal hypergraph

The proposal g(d) can be efficiently represented by a hypergraph

Remark!Nodes do not store LM state

9 / 21

Page 57: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

Proposal hypergraph

The proposal g(d) can be efficiently represented by a hypergraph

Remark!Nodes do not store LM state

9 / 21

Page 58: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

Proposal hypergraph

The proposal g(d) can be efficiently represented by a hypergraph

Remark!Nodes do not store LM state

| 〈D(x), f (d)〉 | = |G(d) ∩ A| ∝ I 22d |∆|n−1

9 / 21

Page 59: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

Proposal hypergraph

The proposal g(d) can be efficiently represented by a hypergraph

Remark!Nodes do not store LM state

| 〈D(x), g(d)〉 | = |G(d) ∩���upperbound

A| ∝ I 22d����|∆|n−1

9 / 21

Page 60: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

Refinement

Goal: break independence assumptions1. making larger n-grams2. bringing g closer to f

Examples

10 / 21

Page 61: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

Refinement

Goal: break independence assumptions1. making larger n-grams2. bringing g closer to f

Examples

10 / 21

Page 62: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

Refinement

Goal: break independence assumptions1. making larger n-grams2. bringing g closer to f

Examples

g′(d) ={

g(d) if d 6= argmax g(d)f (d) otherwise.

10 / 21

Page 63: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

Refinement

Goal: break independence assumptions1. making larger n-grams2. bringing g closer to f

Examples

g′(d) ={

g(d) if d 6= argmax g(d)f (d) otherwise.

Too local: one derivation at a time

10 / 21

Page 64: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

Refinement

Goal: break independence assumptions1. making larger n-grams2. bringing g closer to f

Examples

g′(d) = g(d)q(z|hP)q(z|P)

k0 1

aelse a/w(a)

z/w(z)where k counts occurrences of hPz in yield(d)

10 / 21

Page 65: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

Refinement

Goal: break independence assumptions1. making larger n-grams2. bringing g closer to f

Examples

g′(d) = g(d)q(z|hP)q(z|P)

k0 1

aelse a/w(a)

z/w(z)where k counts occurrences of hPz in yield(d)Too global: refines derivations which already score poorly

10 / 21

Page 66: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

LM state refinement1

Break independence assumptions (making larger n-grams)but not in every derivation

Example:I suppose the argmax is (X1(X2(X3the)black)cat)I and node X3 currently stores an empty LM statesI this motivates a refined node X3′ whose LM state is

the · LMState(X3)

I incoming edges to X3 are splitI outgoing edges from X3′ are reweighted copies of those leaving X3

This is connected to an intersection local to X3

∗(X3 ∗ the)z∗ with weight update q(z|the)q(z)

1[Li and Khudanpur, 2008, Heafield et al., 2013]11 / 21

Page 67: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

LM state refinement1

Break independence assumptions (making larger n-grams)but not in every derivation

Example:I suppose the argmax is (X1(X2(X3the)black)cat)

I and node X3 currently stores an empty LM statesI this motivates a refined node X3′ whose LM state is

the · LMState(X3)

I incoming edges to X3 are splitI outgoing edges from X3′ are reweighted copies of those leaving X3

This is connected to an intersection local to X3

∗(X3 ∗ the)z∗ with weight update q(z|the)q(z)

1[Li and Khudanpur, 2008, Heafield et al., 2013]11 / 21

Page 68: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

LM state refinement1

Break independence assumptions (making larger n-grams)but not in every derivation

Example:I suppose the argmax is (X1(X2(X3the)black)cat)I and node X3 currently stores an empty LM states

I this motivates a refined node X3′ whose LM state is

the · LMState(X3)

I incoming edges to X3 are splitI outgoing edges from X3′ are reweighted copies of those leaving X3

This is connected to an intersection local to X3

∗(X3 ∗ the)z∗ with weight update q(z|the)q(z)

1[Li and Khudanpur, 2008, Heafield et al., 2013]11 / 21

Page 69: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

LM state refinement1

Break independence assumptions (making larger n-grams)but not in every derivation

Example:I suppose the argmax is (X1(X2(X3the)black)cat)I and node X3 currently stores an empty LM statesI this motivates a refined node X3′ whose LM state is

the · LMState(X3)

I incoming edges to X3 are splitI outgoing edges from X3′ are reweighted copies of those leaving X3

This is connected to an intersection local to X3

∗(X3 ∗ the)z∗ with weight update q(z|the)q(z)

1[Li and Khudanpur, 2008, Heafield et al., 2013]11 / 21

Page 70: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

LM state refinement1

Break independence assumptions (making larger n-grams)but not in every derivation

Example:I suppose the argmax is (X1(X2(X3the)black)cat)I and node X3 currently stores an empty LM statesI this motivates a refined node X3′ whose LM state is

the · LMState(X3)

I incoming edges to X3 are split

I outgoing edges from X3′ are reweighted copies of those leaving X3

This is connected to an intersection local to X3

∗(X3 ∗ the)z∗ with weight update q(z|the)q(z)

1[Li and Khudanpur, 2008, Heafield et al., 2013]11 / 21

Page 71: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

LM state refinement1

Break independence assumptions (making larger n-grams)but not in every derivation

Example:I suppose the argmax is (X1(X2(X3the)black)cat)I and node X3 currently stores an empty LM statesI this motivates a refined node X3′ whose LM state is

the · LMState(X3)

I incoming edges to X3 are splitI outgoing edges from X3′ are reweighted copies of those leaving X3

This is connected to an intersection local to X3

∗(X3 ∗ the)z∗ with weight update q(z|the)q(z)

1[Li and Khudanpur, 2008, Heafield et al., 2013]11 / 21

Page 72: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

LM state refinement1

Break independence assumptions (making larger n-grams)but not in every derivation

Example:I suppose the argmax is (X1(X2(X3the)black)cat)I and node X3 currently stores an empty LM statesI this motivates a refined node X3′ whose LM state is

the · LMState(X3)

I incoming edges to X3 are splitI outgoing edges from X3′ are reweighted copies of those leaving X3

This is connected to an intersection local to X3

∗(X3 ∗ the)z∗ with weight update q(z|the)q(z)

1[Li and Khudanpur, 2008, Heafield et al., 2013]11 / 21

Page 73: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

Refinement actions

Goal: lower g’s maximum as much as possible

Heuristic

Refine all nodes participating in the current argmaxI by extending a node’s LM (right) state by

exactly one word from yield(argmax g(d))

12 / 21

Page 74: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

Refinement actions

Goal: lower g’s maximum as much as possible

Heuristic

Refine all nodes participating in the current argmaxI by extending a node’s LM (right) state by

exactly one word from yield(argmax g(d))

12 / 21

Page 75: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

13 / 21

0

5

10

15

20

(8 (6

(2 th

e) b

lack

) cat

)

(8 (6

(2 th

e) d

ark)

cat

)

(8 (6

(2 th

e) b

lack

) fel

ine)

(8 (6

(2 th

e) d

ark)

felin

e)

(5 (4

(2 th

e) c

at) b

lack

)

(5 (4

(2 th

e) c

at) d

ark)

(5 (4

(2 th

e) fe

line)

bla

ck)

(5 (4

(2 th

e) fe

line)

dar

k)

(5 (2

the)

bla

ck c

at)

(5 (7

(3 c

at) t

he) b

lack

)

(5 (7

(3 c

at) t

he) d

ark)

(5 (7

(3 fe

line)

the)

bla

ck)

(5 (7

(3 fe

line)

the)

dar

k)

f g

I argmax is (X8(X6(X2the)black)cat)

Page 76: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

13 / 21

0

5

10

15

20

(8 (6

(2"

the)

bla

ck) c

at)

(8 (6

(2"

the)

dar

k) c

at)

(8 (6

(2"

the)

bla

ck) f

elin

e)

(8 (6

(2"

the)

dar

k) fe

line)

(5 (4

(2"

the)

cat

) bla

ck)

(5 (4

(2"

the)

cat

) dar

k)

(5 (4

(2"

the)

felin

e) b

lack

)

(5 (4

(2"

the)

felin

e) d

ark)

(5 (2

" th

e) b

lack

cat

)

(5 (7

(3 c

at) t

he) b

lack

)

(5 (7

(3 c

at) t

he) d

ark)

(5 (7

(3 fe

line)

the)

bla

ck)

(5 (7

(3 fe

line)

the)

dar

k)

f g the

I the argmax is (X8(X6(X2the)black)cat)I refine spans continuing from X2 by conditioning on the

Page 77: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

13 / 21

0

5

10

15

20

(8 (6

" (2

" th

e) b

lack

) cat

)

(8 (6

(2"

the)

dar

k) c

at)

(8 (6

" (2

" th

e) b

lack

) fel

ine)

(8 (6

(2"

the)

dar

k) fe

line)

(5 (4

(2"

the)

cat

) bla

ck)

(5 (4

(2"

the)

cat

) dar

k)

(5 (4

(2"

the)

felin

e) b

lack

)

(5 (4

(2"

the)

felin

e) d

ark)

(5 (2

" th

e) b

lack

cat

)

(5 (7

(3 c

at) t

he) b

lack

)

(5 (7

(3 c

at) t

he) d

ark)

(5 (7

(3 fe

line)

the)

bla

ck)

(5 (7

(3 fe

line)

the)

dar

k)

f g the black

I the argmax is (X8(X6(X2the)black)cat)I refine spans continuing from X6 by conditioning on black

Page 78: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

13 / 21

0

5

10

15

20

(8"

(6"

(2"

the)

bla

ck) c

at)

(8"

(6 (2

" th

e) d

ark)

cat

)

(8 (6

" (2

" th

e) b

lack

) fel

ine)

(8 (6

(2"

the)

dar

k) fe

line)

(5 (4

(2"

the)

cat

) bla

ck)

(5 (4

(2"

the)

cat

) dar

k)

(5 (4

(2"

the)

felin

e) b

lack

)

(5 (4

(2"

the)

felin

e) d

ark)

(5 (2

" th

e) b

lack

cat

)

(5 (7

(3 c

at) t

he) b

lack

)

(5 (7

(3 c

at) t

he) d

ark)

(5 (7

(3 fe

line)

the)

bla

ck)

(5 (7

(3 fe

line)

the)

dar

k)

f g the black cat

I the argmax is (X8(X6(X2the)black)cat)I refine spans continuing from X8 by conditioning on cat

Page 79: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

13 / 21

0

5

10

15

20

(8"

(6"

(2"

the)

bla

ck) c

at)

(8"

(6 (2

" th

e) d

ark)

cat

)

(8 (6

" (2

" th

e) b

lack

) fel

ine)

(8 (6

(2"

the)

dar

k) fe

line)

(5 (4

(2"

the)

cat

) bla

ck)

(5 (4

(2"

the)

cat

) dar

k)

(5 (4

(2"

the)

felin

e) b

lack

)

(5 (4

(2"

the)

felin

e) d

ark)

(5 (2

" th

e) b

lack

cat

)

(5 (7

(3 c

at) t

he) b

lack

)

(5 (7

(3 c

at) t

he) d

ark)

(5 (7

(3 fe

line)

the)

bla

ck)

(5 (7

(3 fe

line)

the)

dar

k)

f g"

I obtaining a new hypergraph 〈D(x), g′(d)〉

Page 80: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

Experiments

German-English (WMT14 data)I phrase extraction: 2.2M sentencesI maximum phrase length 5I maximum translation options 40I unpruned LMs: 25M sentencesI dev set: newstest2010 (LM interpolation and tuning)I batch-mira tuning (cube pruning beam 5000)I test set: newstest2012 (3,003 sentences)I distortion limit d = 4

14 / 21

Page 81: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

Exact decoding

n build (s) total (s) N |V | |E |3 1.5 21 190 2.5 1594 1.5 50 350 4 2885 1.5 106 555 6.1 450

I time to build initial proposalI decoding timeI number of iterationsI size of the hypergraph (in thousands of nodes and edges)

15 / 21

Page 82: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

Cube pruning

Search errors by beam size (k)

k n-gram LM3 4 5

10 2168 2347 2377102 613 999 1126103 29 102 167104 0 4 7

16 / 21

Page 83: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

Translation quality with BLEU

Cube pruning and exact decoding with OS∗

k 3-gram LM 4-gram LM 5-gram LM10 20.47 20.71 20.69102 21.14 21.73 21.76103 21.27 21.89 21.91104 21.29 21.92 21.93OS∗ 21.29 21.92 21.93

17 / 21

Page 84: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

Conclusions

Exact decoding

I manageable timeI fraction of the search space

Search error curves for beam search and cube pruningI large phrase tablesI large 5-gram LMs

Exactness at the cost of worst-case complexity, howeverI we demonstrate empirically that the algorithm is practicable

18 / 21

Page 85: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

Conclusions

Exact decodingI manageable time

I fraction of the search spaceSearch error curves for beam search and cube pruning

I large phrase tablesI large 5-gram LMs

Exactness at the cost of worst-case complexity, howeverI we demonstrate empirically that the algorithm is practicable

18 / 21

Page 86: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

Conclusions

Exact decodingI manageable timeI fraction of the search space

Search error curves for beam search and cube pruningI large phrase tablesI large 5-gram LMs

Exactness at the cost of worst-case complexity, howeverI we demonstrate empirically that the algorithm is practicable

18 / 21

Page 87: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

Conclusions

Exact decodingI manageable timeI fraction of the search space

Search error curves for beam search and cube pruning

I large phrase tablesI large 5-gram LMs

Exactness at the cost of worst-case complexity, howeverI we demonstrate empirically that the algorithm is practicable

18 / 21

Page 88: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

Conclusions

Exact decodingI manageable timeI fraction of the search space

Search error curves for beam search and cube pruningI large phrase tablesI large 5-gram LMs

Exactness at the cost of worst-case complexity, howeverI we demonstrate empirically that the algorithm is practicable

18 / 21

Page 89: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

Conclusions

Exact decodingI manageable timeI fraction of the search space

Search error curves for beam search and cube pruningI large phrase tablesI large 5-gram LMs

Exactness at the cost of worst-case complexity, however

I we demonstrate empirically that the algorithm is practicable

18 / 21

Page 90: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

Conclusions

Exact decodingI manageable timeI fraction of the search space

Search error curves for beam search and cube pruningI large phrase tablesI large 5-gram LMs

Exactness at the cost of worst-case complexity, howeverI we demonstrate empirically that the algorithm is practicable

18 / 21

Page 91: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

Recent developments

1. exact k-best2. exact sampling

19 / 21

Page 92: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

Future work

OptimisationI early stop the searchI error safe pruning

SpeedupsI be more selective with refinementsI LR to deal with powerset constraints (allowing for higher d)I more grouping (partial edges)

Hiero models

20 / 21

Page 93: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

Future work

OptimisationI early stop the searchI error safe pruning

SpeedupsI be more selective with refinementsI LR to deal with powerset constraints (allowing for higher d)I more grouping (partial edges)

Hiero models

20 / 21

Page 94: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

Future work

OptimisationI early stop the searchI error safe pruning

SpeedupsI be more selective with refinementsI LR to deal with powerset constraints (allowing for higher d)I more grouping (partial edges)

Hiero models

20 / 21

Page 95: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

Thanks!

Questions?

21 / 21

Page 96: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

Variable-order LM

n Nodes at level m LM states at level m0 1 2 3 4 1 2 3 4

3 0.4 1.2 0.5 - - 113 263 - -4 0.4 1.6 1.4 0.3 - 132 544 212 -5 0.4 2.1 2.4 0.7 0.1 142 790 479 103

Table : Average number of nodes (in thousands) whose LM state encodean m-gram, and average number of unique LM states of order m in thefinal hypergraph for different n-gram LMs (d = 4 everywhere).

22 / 21

Page 97: Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based SMT Wilker Aziz1, Marc Dymetman2, Lucia Specia1 1University of Sheffield 2Xerox

References I

Kenneth Heafield, Philipp Koehn, and Alon Lavie. Groupinglanguage model boundary words to speed k-best extraction fromhypergraphs. In Proceedings of the 2013 Conference of theNorth American Chapter of the Association for ComputationalLinguistics: Human Language Technologies, pages 958–968,Atlanta, Georgia, USA, June 2013.

Zhifei Li and Sanjeev Khudanpur. A scalable decoder forparsing-based machine translation with equivalent languagemodel state maintenance. In Proceedings of the SecondWorkshop on Syntax and Structure in Statistical Translation,SSST ’08, pages 10–18, Stroudsburg, PA, USA, 2008.Association for Computational Linguistics.

23 / 21