Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based...

Post on 10-Jun-2020

37 views 0 download

Transcript of Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based...

Exact decoding for phrase-based SMT

Wilker Aziz1, Marc Dymetman2, Lucia Specia1

1University of Sheffield2Xerox Research Centre Europe

October 27, 2014

Outline

Introduction

Approach

Results

Conclusions

2 / 21

Decoding

Viterbi decoding

d∗ = argmaxd∈D(x)

f (d)

= argmaxd∈D(x)

θ>H(d)

1 / 21

Decoding

Viterbi decoding

d∗ = argmaxd∈D(x)

f (d)

= argmaxd∈D(x)

θ>H(d)

space of translation derivations compatible with the input x

1 / 21

Decoding

Viterbi decoding

d∗ = argmaxd∈D(x)

f (d)

= argmaxd∈D(x)

θ>H(d)

we are looking for the best derivationunder a linear parameterisation (θ ∈ Rm)

1 / 21

Decoding

Viterbi decoding

d∗ = argmaxd∈D(x)

f (d)

= argmaxd∈D(x)

θ>1 H1(d) + θ>2 H2(d)

“local” features assess steps in a derivation independently

H1(d) =∑e∈d

h1(e)

1 / 21

Decoding

Viterbi decoding

d∗ = argmaxd∈D(x)

f (d)

= argmaxd∈D(x)

θ>1 H1(d) + θ>2 H2(d)

“local” features assess steps in a derivation independently

H1(d) =∑e∈d

h1(e)

“nonlocal” features make weaker independence assumptionse.g. HLM(d) = log pLM(yield(d))

1 / 21

Complexity

〈D(x), f (d)〉 can be seen as the intersection between

I a translation hypergraph G(x)locally parameterised

I and a target language model Aas a wFSA

Phrase-based SMT with a distortion limit (d) and an n-gram LM

|G(x)| ∝ I 22d

|A| ∝ |∆|n−1

ProblemThe intersection is too large for standard dynamic programming

2 / 21

Complexity

〈D(x), f (d)〉 can be seen as the intersection betweenI a translation hypergraph G(x)

locally parameterised

I and a target language model Aas a wFSA

Phrase-based SMT with a distortion limit (d) and an n-gram LM

|G(x)| ∝ I 22d

|A| ∝ |∆|n−1

ProblemThe intersection is too large for standard dynamic programming

2 / 21

Complexity

〈D(x), f (d)〉 can be seen as the intersection betweenI a translation hypergraph G(x)

locally parameterisedI and a target language model A

as a wFSA

Phrase-based SMT with a distortion limit (d) and an n-gram LM

|G(x)| ∝ I 22d

|A| ∝ |∆|n−1

ProblemThe intersection is too large for standard dynamic programming

2 / 21

Complexity

〈D(x), f (d)〉 can be seen as the intersection betweenI a translation hypergraph G(x)

locally parameterisedI and a target language model A

as a wFSAPhrase-based SMT with a distortion limit (d) and an n-gram LM

|G(x)| ∝ I 22d

|A| ∝ |∆|n−1

ProblemThe intersection is too large for standard dynamic programming

2 / 21

Complexity

〈D(x), f (d)〉 can be seen as the intersection betweenI a translation hypergraph G(x)

locally parameterisedI and a target language model A

as a wFSAPhrase-based SMT with a distortion limit (d) and an n-gram LM

|G(x)| ∝ I 22d

|A| ∝ |∆|n−1

ProblemThe intersection is too large for standard dynamic programming

2 / 21

Contribution

Previous work on exact decodingI compact modelsI simpler parameterisation using 3-gram LMs

This workI large modelsI realistic 5-gram LMs

3 / 21

Contribution

Previous work on exact decodingI compact modelsI simpler parameterisation using 3-gram LMs

This workI large modelsI realistic 5-gram LMs

3 / 21

IntuitionFull intersection is wasteful

G(x) ∩ A

I complete n-gram LM I encodes one n-gram

Assumptionnot every n-gram participates in high-scoring derivations

Problemwhich n-grams are really necessary?and at which level of refinement?

Strategystart with strong independence assumptionsrevisit those assumptions as necessary

4 / 21

IntuitionFull intersection is wasteful

G(x) ∩ A

I complete n-gram LM

I encodes one n-gram

Assumptionnot every n-gram participates in high-scoring derivations

Problemwhich n-grams are really necessary?and at which level of refinement?

Strategystart with strong independence assumptionsrevisit those assumptions as necessary

4 / 21

IntuitionFull intersection is wasteful

G(x) ∩ A = G(x) ∩( M⋂

i=1A(i)

)

I complete n-gram LM I encodes one n-gram

Assumptionnot every n-gram participates in high-scoring derivations

Problemwhich n-grams are really necessary?and at which level of refinement?

Strategystart with strong independence assumptionsrevisit those assumptions as necessary

4 / 21

IntuitionFull intersection is wasteful

G(x) ∩ A = G(x) ∩( M⋂

i=1A(i)

)

I complete n-gram LM I encodes one n-gram

Assumptionnot every n-gram participates in high-scoring derivations

Problemwhich n-grams are really necessary?and at which level of refinement?

Strategystart with strong independence assumptionsrevisit those assumptions as necessary

4 / 21

IntuitionFull intersection is wasteful

G(x) ∩ A = G(x) ∩( M⋂

i=1A(i)

)

I complete n-gram LM I encodes one n-gram

Assumptionnot every n-gram participates in high-scoring derivations

Problemwhich n-grams are really necessary?and at which level of refinement?

Strategystart with strong independence assumptionsrevisit those assumptions as necessary

4 / 21

IntuitionFull intersection is wasteful

G(x) ∩ A = G(x) ∩( M⋂

i=1A(i)

)

I complete n-gram LM I encodes one n-gram

Assumptionnot every n-gram participates in high-scoring derivations

Problemwhich n-grams are really necessary?and at which level of refinement?

Strategystart with strong independence assumptionsrevisit those assumptions as necessary

4 / 21

OS∗: illustration

5 / 21

0

5

10

15

20

the

blac

k ca

t

the

dark

cat

the

blac

k fe

line

the

dark

felin

e

the

cat b

lack

the

cat d

ark

the

felin

e bl

ack

the

felin

e da

rk

the

(bla

ck c

at)

cat t

he b

lack

cat t

he d

ark

felin

e th

e bl

ack

felin

e th

e da

rk

f

I upperbound the complex target f (d)

OS∗: illustration

5 / 21

0

5

10

15

20

the

blac

k ca

t

the

dark

cat

the

blac

k fe

line

the

dark

felin

e

the

cat b

lack

the

cat d

ark

the

felin

e bl

ack

the

felin

e da

rk

the

(bla

ck c

at)

cat t

he b

lack

cat t

he d

ark

felin

e th

e bl

ack

felin

e th

e da

rk

f g0

I upperbound the complex target f (d) by a simpler proposal g(d)

OS∗: illustration

5 / 21

0

5

10

15

20

the

blac

k ca

t

the

dark

cat

the

blac

k fe

line

the

dark

felin

e

the

cat b

lack

the

cat d

ark

the

felin

e bl

ack

the

felin

e da

rk

the

(bla

ck c

at)

cat t

he b

lack

cat t

he d

ark

felin

e th

e bl

ack

felin

e th

e da

rk

f g0

This is our goal

I we are interested in finding f ’s argmaxhowever we cannot search through f

OS∗: illustration

5 / 21

0

5

10

15

20

the

blac

k ca

t

the

dark

cat

the

blac

k fe

line

the

dark

felin

e

the

cat b

lack

the

cat d

ark

the

felin

e bl

ack

the

felin

e da

rk

the

(bla

ck c

at)

cat t

he b

lack

cat t

he d

ark

felin

e th

e bl

ack

felin

e th

e da

rk

f g0but because our proxy says this is the best

This is our goal

I we find d∗ = argmaxd g(d)

I finding our best solution thus far (according to f )I however we cannot guarantee exactness yet

OS∗: illustration

5 / 21

0

5

10

15

20

the

blac

k ca

t

the

dark

cat

the

blac

k fe

line

the

dark

felin

e

the

cat b

lack

the

cat d

ark

the

felin

e bl

ack

the

felin

e da

rk

the

(bla

ck c

at)

cat t

he b

lack

cat t

he d

ark

felin

e th

e bl

ack

felin

e th

e da

rk

f g0but because our proxy says this is the best

This is our goal

I we find d∗ = argmaxd g(d) and assess f (d∗)

I finding our best solution thus far (according to f )I however we cannot guarantee exactness yet

OS∗: illustration

5 / 21

0

5

10

15

20

the

blac

k ca

t

the

dark

cat

the

blac

k fe

line

the

dark

felin

e

the

cat b

lack

the

cat d

ark

the

felin

e bl

ack

the

felin

e da

rk

the

(bla

ck c

at)

cat t

he b

lack

cat t

he d

ark

felin

e th

e bl

ack

felin

e th

e da

rk

f g0

this is the best we have thus farThis is our goal

but because our proxy says this is the best

I we find d∗ = argmaxd g(d) and assess f (d∗)I finding our best solution thus far (according to f )

I however we cannot guarantee exactness yet

OS∗: illustration

5 / 21

0

5

10

15

20

the

blac

k ca

t

the

dark

cat

the

blac

k fe

line

the

dark

felin

e

the

cat b

lack

the

cat d

ark

the

felin

e bl

ack

the

felin

e da

rk

the

(bla

ck c

at)

cat t

he b

lack

cat t

he d

ark

felin

e th

e bl

ack

felin

e th

e da

rk

f g0

Maximum error

I we find d∗ = argmaxd g(d) and assess f (d∗)I finding our best solution thus far (according to f )I however we cannot guarantee exactness yet

OS∗: illustration

5 / 21

0

5

10

15

20

the

blac

k ca

t

the

dark

cat

the

blac

k fe

line

the

dark

felin

e

the

cat b

lack

the

cat d

ark

the

felin

e bl

ack

the

felin

e da

rk

the

(bla

ck c

at)

cat t

he b

lack

cat t

he d

ark

felin

e th

e bl

ack

felin

e th

e da

rk

f g1

I so we refine g as to bring it closer to fe.g. by making g(d∗) = f (d∗)

I at the cost of some little complexity increase (extra nodes and edges)

OS∗: illustration

5 / 21

0

5

10

15

20

the

blac

k ca

t

the

dark

cat

the

blac

k fe

line

the

dark

felin

e

the

cat b

lack

the

cat d

ark

the

felin

e bl

ack

the

felin

e da

rk

the

(bla

ck c

at)

cat t

he b

lack

cat t

he d

ark

felin

e th

e bl

ack

felin

e th

e da

rk

f g1

I so we refine g as to bring it closer to fe.g. by making g(d∗) = f (d∗)

I at the cost of some little complexity increase (extra nodes and edges)

OS∗: illustration

5 / 21

0

5

10

15

20

the

blac

k ca

t

the

dark

cat

the

blac

k fe

line

the

dark

felin

e

the

cat b

lack

the

cat d

ark

the

felin

e bl

ack

the

felin

e da

rk

the

(bla

ck c

at)

cat t

he b

lack

cat t

he d

ark

felin

e th

e bl

ack

felin

e th

e da

rk

f g1

I as a consequence, g’s argmax has changedwe solve d∗ = argmaxd g(d)

I and assess f (d∗) againour best solution thus far might remain unchanged

OS∗: illustration

5 / 21

0

5

10

15

20

the

blac

k ca

t

the

dark

cat

the

blac

k fe

line

the

dark

felin

e

the

cat b

lack

the

cat d

ark

the

felin

e bl

ack

the

felin

e da

rk

the

(bla

ck c

at)

cat t

he b

lack

cat t

he d

ark

felin

e th

e bl

ack

felin

e th

e da

rk

f g1

I as a consequence, g’s argmax has changedwe solve d∗ = argmaxd g(d)

I and assess f (d∗) againour best solution thus far might remain unchanged

OS∗: illustration

5 / 21

0

5

10

15

20

the

blac

k ca

t

the

dark

cat

the

blac

k fe

line

the

dark

felin

e

the

cat b

lack

the

cat d

ark

the

felin

e bl

ack

the

felin

e da

rk

the

(bla

ck c

at)

cat t

he b

lack

cat t

he d

ark

felin

e th

e bl

ack

felin

e th

e da

rk

f g1

I we cannot yet guarantee exactness

I even though our maximum error is smaller

OS∗: illustration

5 / 21

0

5

10

15

20

the

blac

k ca

t

the

dark

cat

the

blac

k fe

line

the

dark

felin

e

the

cat b

lack

the

cat d

ark

the

felin

e bl

ack

the

felin

e da

rk

the

(bla

ck c

at)

cat t

he b

lack

cat t

he d

ark

felin

e th

e bl

ack

felin

e th

e da

rk

f g1

I we cannot yet guarantee exactnessI even though our maximum error is smaller

OS∗: illustration

5 / 21

0

5

10

15

20

the

blac

k ca

t

the

dark

cat

the

blac

k fe

line

the

dark

felin

e

the

cat b

lack

the

cat d

ark

the

felin

e bl

ack

the

felin

e da

rk

the

(bla

ck c

at)

cat t

he b

lack

cat t

he d

ark

felin

e th

e bl

ack

felin

e th

e da

rk

f g2

I but we can continue refining g

I for as long as g and f disagree on the maximumI until we have a certificate of optimality

OS∗: illustration

5 / 21

0

5

10

15

20

the

blac

k ca

t

the

dark

cat

the

blac

k fe

line

the

dark

felin

e

the

cat b

lack

the

cat d

ark

the

felin

e bl

ack

the

felin

e da

rk

the

(bla

ck c

at)

cat t

he b

lack

cat t

he d

ark

felin

e th

e bl

ack

felin

e th

e da

rk

f g3

I but we can continue refining gI for as long as g and f disagree on the maximum

I until we have a certificate of optimality

OS∗: illustration

5 / 21

0

5

10

15

20

the

blac

k ca

t

the

dark

cat

the

blac

k fe

line

the

dark

felin

e

the

cat b

lack

the

cat d

ark

the

felin

e bl

ack

the

felin

e da

rk

the

(bla

ck c

at)

cat t

he b

lack

cat t

he d

ark

felin

e th

e bl

ack

felin

e th

e da

rk

f g4

I but we can continue refining gI for as long as g and f disagree on the maximum

I until we have a certificate of optimality

OS∗: illustration

5 / 21

0

5

10

15

20

the

blac

k ca

t

the

dark

cat

the

blac

k fe

line

the

dark

felin

e

the

cat b

lack

the

cat d

ark

the

felin

e bl

ack

the

felin

e da

rk

the

(bla

ck c

at)

cat t

he b

lack

cat t

he d

ark

felin

e th

e bl

ack

felin

e th

e da

rk

f g5

I but we can continue refining gI for as long as g and f disagree on the maximumI until we have a certificate of optimality

Decoding algorithm

1: function Optimise(g, ε)

2: d∗ ← argmaxd g(d) . proxy’s argmax3: g∗ ← g(d∗)4: f ∗ ← f (d∗) . observe a point in f5: while (q∗ − f ∗ ≥ ε) do . ε is the maximum error6: A← actions(g,d∗) . collect refinement actions7: g ← refine(g,A) . update proposal8: d∗ ← argmaxd g(d) . update argmax9: g∗ ← g(d∗)

10: f ∗ ← max(f ∗, f (d∗)) . update “best so far”11: end while12: return g, d∗13: end function

6 / 21

Decoding algorithm

1: function Optimise(g, ε)2: d∗ ← argmaxd g(d) . proxy’s argmax

3: g∗ ← g(d∗)4: f ∗ ← f (d∗) . observe a point in f5: while (q∗ − f ∗ ≥ ε) do . ε is the maximum error6: A← actions(g,d∗) . collect refinement actions7: g ← refine(g,A) . update proposal8: d∗ ← argmaxd g(d) . update argmax9: g∗ ← g(d∗)

10: f ∗ ← max(f ∗, f (d∗)) . update “best so far”11: end while12: return g, d∗13: end function

6 / 21

Decoding algorithm

1: function Optimise(g, ε)2: d∗ ← argmaxd g(d) . proxy’s argmax3: g∗ ← g(d∗)4: f ∗ ← f (d∗) . observe a point in f

5: while (q∗ − f ∗ ≥ ε) do . ε is the maximum error6: A← actions(g,d∗) . collect refinement actions7: g ← refine(g,A) . update proposal8: d∗ ← argmaxd g(d) . update argmax9: g∗ ← g(d∗)

10: f ∗ ← max(f ∗, f (d∗)) . update “best so far”11: end while12: return g, d∗13: end function

6 / 21

Decoding algorithm

1: function Optimise(g, ε)2: d∗ ← argmaxd g(d) . proxy’s argmax3: g∗ ← g(d∗)4: f ∗ ← f (d∗) . observe a point in f5: while (q∗ − f ∗ ≥ ε) do . ε is the maximum error

6: A← actions(g,d∗) . collect refinement actions7: g ← refine(g,A) . update proposal8: d∗ ← argmaxd g(d) . update argmax9: g∗ ← g(d∗)

10: f ∗ ← max(f ∗, f (d∗)) . update “best so far”11: end while12: return g, d∗13: end function

6 / 21

Decoding algorithm

1: function Optimise(g, ε)2: d∗ ← argmaxd g(d) . proxy’s argmax3: g∗ ← g(d∗)4: f ∗ ← f (d∗) . observe a point in f5: while (q∗ − f ∗ ≥ ε) do . ε is the maximum error6: A← actions(g,d∗) . collect refinement actions

7: g ← refine(g,A) . update proposal8: d∗ ← argmaxd g(d) . update argmax9: g∗ ← g(d∗)

10: f ∗ ← max(f ∗, f (d∗)) . update “best so far”11: end while12: return g, d∗13: end function

6 / 21

Decoding algorithm

1: function Optimise(g, ε)2: d∗ ← argmaxd g(d) . proxy’s argmax3: g∗ ← g(d∗)4: f ∗ ← f (d∗) . observe a point in f5: while (q∗ − f ∗ ≥ ε) do . ε is the maximum error6: A← actions(g,d∗) . collect refinement actions7: g ← refine(g,A) . update proposal

8: d∗ ← argmaxd g(d) . update argmax9: g∗ ← g(d∗)

10: f ∗ ← max(f ∗, f (d∗)) . update “best so far”11: end while12: return g, d∗13: end function

6 / 21

Decoding algorithm

1: function Optimise(g, ε)2: d∗ ← argmaxd g(d) . proxy’s argmax3: g∗ ← g(d∗)4: f ∗ ← f (d∗) . observe a point in f5: while (q∗ − f ∗ ≥ ε) do . ε is the maximum error6: A← actions(g,d∗) . collect refinement actions7: g ← refine(g,A) . update proposal8: d∗ ← argmaxd g(d) . update argmax

9: g∗ ← g(d∗)10: f ∗ ← max(f ∗, f (d∗)) . update “best so far”11: end while12: return g, d∗13: end function

6 / 21

Decoding algorithm

1: function Optimise(g, ε)2: d∗ ← argmaxd g(d) . proxy’s argmax3: g∗ ← g(d∗)4: f ∗ ← f (d∗) . observe a point in f5: while (q∗ − f ∗ ≥ ε) do . ε is the maximum error6: A← actions(g,d∗) . collect refinement actions7: g ← refine(g,A) . update proposal8: d∗ ← argmaxd g(d) . update argmax9: g∗ ← g(d∗)

10: f ∗ ← max(f ∗, f (d∗)) . update “best so far”11: end while

12: return g, d∗13: end function

6 / 21

Decoding algorithm

1: function Optimise(g, ε)2: d∗ ← argmaxd g(d) . proxy’s argmax3: g∗ ← g(d∗)4: f ∗ ← f (d∗) . observe a point in f5: while (q∗ − f ∗ ≥ ε) do . ε is the maximum error6: A← actions(g,d∗) . collect refinement actions7: g ← refine(g,A) . update proposal8: d∗ ← argmaxd g(d) . update argmax9: g∗ ← g(d∗)

10: f ∗ ← max(f ∗, f (d∗)) . update “best so far”11: end while12: return g, d∗13: end function

6 / 21

ProposalIn g, the true LM pLM is replaced by an upperbound qLM

with stronger independence assumptions

I Suppose α = yJI is a substring of y = yM

1e.g. α = black2 cat3 in y = BOS0 the1 black2 cat3 EOS4

I contribution of α to the true LM score of y

pLM(α) ≡J∏

k=Ip(yk |yk−1

1 )

e.g. p(black2|BOS0 the1)p(cat3|BOS0 the1 black2)I upperbound to pLM(α)

qLM(α) ≡ q(yI |ε)J∏

k=I+1q(yk |yk−1

I )

e.g. q(black2|ε)q(cat3|black2)

where q(z|P) ≡ maxH∈∆∗ p(z|HP)

7 / 21

ProposalIn g, the true LM pLM is replaced by an upperbound qLM

with stronger independence assumptionsI Suppose α = yJ

I is a substring of y = yM1

e.g. α = black2 cat3 in y = BOS0 the1 black2 cat3 EOS4

I contribution of α to the true LM score of y

pLM(α) ≡J∏

k=Ip(yk |yk−1

1 )

e.g. p(black2|BOS0 the1)p(cat3|BOS0 the1 black2)I upperbound to pLM(α)

qLM(α) ≡ q(yI |ε)J∏

k=I+1q(yk |yk−1

I )

e.g. q(black2|ε)q(cat3|black2)

where q(z|P) ≡ maxH∈∆∗ p(z|HP)

7 / 21

ProposalIn g, the true LM pLM is replaced by an upperbound qLM

with stronger independence assumptionsI Suppose α = yJ

I is a substring of y = yM1

e.g. α = black2 cat3 in y = BOS0 the1 black2 cat3 EOS4

I contribution of α to the true LM score of y

pLM(α) ≡J∏

k=Ip(yk |yk−1

1 )

e.g. p(black2|BOS0 the1)p(cat3|BOS0 the1 black2)

I upperbound to pLM(α)

qLM(α) ≡ q(yI |ε)J∏

k=I+1q(yk |yk−1

I )

e.g. q(black2|ε)q(cat3|black2)

where q(z|P) ≡ maxH∈∆∗ p(z|HP)

7 / 21

ProposalIn g, the true LM pLM is replaced by an upperbound qLM

with stronger independence assumptionsI Suppose α = yJ

I is a substring of y = yM1

e.g. α = black2 cat3 in y = BOS0 the1 black2 cat3 EOS4

I contribution of α to the true LM score of y

pLM(α) ≡J∏

k=Ip(yk |yk−1

1 )

e.g. p(black2|BOS0 the1)p(cat3|BOS0 the1 black2)I upperbound to pLM(α)

qLM(α) ≡ q(yI |ε)J∏

k=I+1q(yk |yk−1

I )

e.g. q(black2|ε)q(cat3|black2)

where q(z|P) ≡ maxH∈∆∗ p(z|HP)7 / 21

Max-ARPAAn n-gram is scored in its most optimistic context H

q(z|P) ≡ maxH∈∆∗

p(z|HP)

Efficiently computed using a Max-ARPA table MI start with an ARPA table

n-gram Pz — conditional log p(z|P) — backoff b(Pz)I compute an upperbound view of the last 2 columns

n-gram Pz — max-conditional log q(z|P) — max-backoff m(Pz)

Then for an arbitrary n-gram Pz,

q(z|P) =

p(z|P) Pz 6∈ M and P 6∈ Mp(z|P)×m(P) Pz 6∈ M and P ∈ Mq(z|P) Pz ∈ M

8 / 21

Max-ARPAAn n-gram is scored in its most optimistic context H

q(z|P) ≡ maxH∈∆∗

p(z|HP)

Efficiently computed using a Max-ARPA table M

I start with an ARPA tablen-gram Pz — conditional log p(z|P) — backoff b(Pz)

I compute an upperbound view of the last 2 columnsn-gram Pz — max-conditional log q(z|P) — max-backoff m(Pz)

Then for an arbitrary n-gram Pz,

q(z|P) =

p(z|P) Pz 6∈ M and P 6∈ Mp(z|P)×m(P) Pz 6∈ M and P ∈ Mq(z|P) Pz ∈ M

8 / 21

Max-ARPAAn n-gram is scored in its most optimistic context H

q(z|P) ≡ maxH∈∆∗

p(z|HP)

Efficiently computed using a Max-ARPA table MI start with an ARPA table

n-gram Pz — conditional log p(z|P) — backoff b(Pz)

I compute an upperbound view of the last 2 columnsn-gram Pz — max-conditional log q(z|P) — max-backoff m(Pz)

Then for an arbitrary n-gram Pz,

q(z|P) =

p(z|P) Pz 6∈ M and P 6∈ Mp(z|P)×m(P) Pz 6∈ M and P ∈ Mq(z|P) Pz ∈ M

8 / 21

Max-ARPAAn n-gram is scored in its most optimistic context H

q(z|P) ≡ maxH∈∆∗

p(z|HP)

Efficiently computed using a Max-ARPA table MI start with an ARPA table

n-gram Pz — conditional log p(z|P) — backoff b(Pz)I compute an upperbound view of the last 2 columns

n-gram Pz — max-conditional log q(z|P) — max-backoff m(Pz)

Then for an arbitrary n-gram Pz,

q(z|P) =

p(z|P) Pz 6∈ M and P 6∈ Mp(z|P)×m(P) Pz 6∈ M and P ∈ Mq(z|P) Pz ∈ M

8 / 21

Max-ARPAAn n-gram is scored in its most optimistic context H

q(z|P) ≡ maxH∈∆∗

p(z|HP)

Efficiently computed using a Max-ARPA table MI start with an ARPA table

n-gram Pz — conditional log p(z|P) — backoff b(Pz)I compute an upperbound view of the last 2 columns

n-gram Pz — max-conditional log q(z|P) — max-backoff m(Pz)

Then for an arbitrary n-gram Pz,

q(z|P) =

p(z|P) Pz 6∈ M and P 6∈ Mp(z|P)×m(P) Pz 6∈ M and P ∈ Mq(z|P) Pz ∈ M

8 / 21

Proposal hypergraph

The proposal g(d) can be efficiently represented by a hypergraph

Remark!Nodes do not store LM state

9 / 21

Proposal hypergraph

The proposal g(d) can be efficiently represented by a hypergraph

Remark!Nodes do not store LM state

9 / 21

Proposal hypergraph

The proposal g(d) can be efficiently represented by a hypergraph

Remark!Nodes do not store LM state

| 〈D(x), f (d)〉 | = |G(d) ∩ A| ∝ I 22d |∆|n−1

9 / 21

Proposal hypergraph

The proposal g(d) can be efficiently represented by a hypergraph

Remark!Nodes do not store LM state

| 〈D(x), g(d)〉 | = |G(d) ∩���upperbound

A| ∝ I 22d����|∆|n−1

9 / 21

Refinement

Goal: break independence assumptions1. making larger n-grams2. bringing g closer to f

Examples

10 / 21

Refinement

Goal: break independence assumptions1. making larger n-grams2. bringing g closer to f

Examples

10 / 21

Refinement

Goal: break independence assumptions1. making larger n-grams2. bringing g closer to f

Examples

g′(d) ={

g(d) if d 6= argmax g(d)f (d) otherwise.

10 / 21

Refinement

Goal: break independence assumptions1. making larger n-grams2. bringing g closer to f

Examples

g′(d) ={

g(d) if d 6= argmax g(d)f (d) otherwise.

Too local: one derivation at a time

10 / 21

Refinement

Goal: break independence assumptions1. making larger n-grams2. bringing g closer to f

Examples

g′(d) = g(d)q(z|hP)q(z|P)

k0 1

aelse a/w(a)

z/w(z)where k counts occurrences of hPz in yield(d)

10 / 21

Refinement

Goal: break independence assumptions1. making larger n-grams2. bringing g closer to f

Examples

g′(d) = g(d)q(z|hP)q(z|P)

k0 1

aelse a/w(a)

z/w(z)where k counts occurrences of hPz in yield(d)Too global: refines derivations which already score poorly

10 / 21

LM state refinement1

Break independence assumptions (making larger n-grams)but not in every derivation

Example:I suppose the argmax is (X1(X2(X3the)black)cat)I and node X3 currently stores an empty LM statesI this motivates a refined node X3′ whose LM state is

the · LMState(X3)

I incoming edges to X3 are splitI outgoing edges from X3′ are reweighted copies of those leaving X3

This is connected to an intersection local to X3

∗(X3 ∗ the)z∗ with weight update q(z|the)q(z)

1[Li and Khudanpur, 2008, Heafield et al., 2013]11 / 21

LM state refinement1

Break independence assumptions (making larger n-grams)but not in every derivation

Example:I suppose the argmax is (X1(X2(X3the)black)cat)

I and node X3 currently stores an empty LM statesI this motivates a refined node X3′ whose LM state is

the · LMState(X3)

I incoming edges to X3 are splitI outgoing edges from X3′ are reweighted copies of those leaving X3

This is connected to an intersection local to X3

∗(X3 ∗ the)z∗ with weight update q(z|the)q(z)

1[Li and Khudanpur, 2008, Heafield et al., 2013]11 / 21

LM state refinement1

Break independence assumptions (making larger n-grams)but not in every derivation

Example:I suppose the argmax is (X1(X2(X3the)black)cat)I and node X3 currently stores an empty LM states

I this motivates a refined node X3′ whose LM state is

the · LMState(X3)

I incoming edges to X3 are splitI outgoing edges from X3′ are reweighted copies of those leaving X3

This is connected to an intersection local to X3

∗(X3 ∗ the)z∗ with weight update q(z|the)q(z)

1[Li and Khudanpur, 2008, Heafield et al., 2013]11 / 21

LM state refinement1

Break independence assumptions (making larger n-grams)but not in every derivation

Example:I suppose the argmax is (X1(X2(X3the)black)cat)I and node X3 currently stores an empty LM statesI this motivates a refined node X3′ whose LM state is

the · LMState(X3)

I incoming edges to X3 are splitI outgoing edges from X3′ are reweighted copies of those leaving X3

This is connected to an intersection local to X3

∗(X3 ∗ the)z∗ with weight update q(z|the)q(z)

1[Li and Khudanpur, 2008, Heafield et al., 2013]11 / 21

LM state refinement1

Break independence assumptions (making larger n-grams)but not in every derivation

Example:I suppose the argmax is (X1(X2(X3the)black)cat)I and node X3 currently stores an empty LM statesI this motivates a refined node X3′ whose LM state is

the · LMState(X3)

I incoming edges to X3 are split

I outgoing edges from X3′ are reweighted copies of those leaving X3

This is connected to an intersection local to X3

∗(X3 ∗ the)z∗ with weight update q(z|the)q(z)

1[Li and Khudanpur, 2008, Heafield et al., 2013]11 / 21

LM state refinement1

Break independence assumptions (making larger n-grams)but not in every derivation

Example:I suppose the argmax is (X1(X2(X3the)black)cat)I and node X3 currently stores an empty LM statesI this motivates a refined node X3′ whose LM state is

the · LMState(X3)

I incoming edges to X3 are splitI outgoing edges from X3′ are reweighted copies of those leaving X3

This is connected to an intersection local to X3

∗(X3 ∗ the)z∗ with weight update q(z|the)q(z)

1[Li and Khudanpur, 2008, Heafield et al., 2013]11 / 21

LM state refinement1

Break independence assumptions (making larger n-grams)but not in every derivation

Example:I suppose the argmax is (X1(X2(X3the)black)cat)I and node X3 currently stores an empty LM statesI this motivates a refined node X3′ whose LM state is

the · LMState(X3)

I incoming edges to X3 are splitI outgoing edges from X3′ are reweighted copies of those leaving X3

This is connected to an intersection local to X3

∗(X3 ∗ the)z∗ with weight update q(z|the)q(z)

1[Li and Khudanpur, 2008, Heafield et al., 2013]11 / 21

Refinement actions

Goal: lower g’s maximum as much as possible

Heuristic

Refine all nodes participating in the current argmaxI by extending a node’s LM (right) state by

exactly one word from yield(argmax g(d))

12 / 21

Refinement actions

Goal: lower g’s maximum as much as possible

Heuristic

Refine all nodes participating in the current argmaxI by extending a node’s LM (right) state by

exactly one word from yield(argmax g(d))

12 / 21

13 / 21

0

5

10

15

20

(8 (6

(2 th

e) b

lack

) cat

)

(8 (6

(2 th

e) d

ark)

cat

)

(8 (6

(2 th

e) b

lack

) fel

ine)

(8 (6

(2 th

e) d

ark)

felin

e)

(5 (4

(2 th

e) c

at) b

lack

)

(5 (4

(2 th

e) c

at) d

ark)

(5 (4

(2 th

e) fe

line)

bla

ck)

(5 (4

(2 th

e) fe

line)

dar

k)

(5 (2

the)

bla

ck c

at)

(5 (7

(3 c

at) t

he) b

lack

)

(5 (7

(3 c

at) t

he) d

ark)

(5 (7

(3 fe

line)

the)

bla

ck)

(5 (7

(3 fe

line)

the)

dar

k)

f g

I argmax is (X8(X6(X2the)black)cat)

13 / 21

0

5

10

15

20

(8 (6

(2"

the)

bla

ck) c

at)

(8 (6

(2"

the)

dar

k) c

at)

(8 (6

(2"

the)

bla

ck) f

elin

e)

(8 (6

(2"

the)

dar

k) fe

line)

(5 (4

(2"

the)

cat

) bla

ck)

(5 (4

(2"

the)

cat

) dar

k)

(5 (4

(2"

the)

felin

e) b

lack

)

(5 (4

(2"

the)

felin

e) d

ark)

(5 (2

" th

e) b

lack

cat

)

(5 (7

(3 c

at) t

he) b

lack

)

(5 (7

(3 c

at) t

he) d

ark)

(5 (7

(3 fe

line)

the)

bla

ck)

(5 (7

(3 fe

line)

the)

dar

k)

f g the

I the argmax is (X8(X6(X2the)black)cat)I refine spans continuing from X2 by conditioning on the

13 / 21

0

5

10

15

20

(8 (6

" (2

" th

e) b

lack

) cat

)

(8 (6

(2"

the)

dar

k) c

at)

(8 (6

" (2

" th

e) b

lack

) fel

ine)

(8 (6

(2"

the)

dar

k) fe

line)

(5 (4

(2"

the)

cat

) bla

ck)

(5 (4

(2"

the)

cat

) dar

k)

(5 (4

(2"

the)

felin

e) b

lack

)

(5 (4

(2"

the)

felin

e) d

ark)

(5 (2

" th

e) b

lack

cat

)

(5 (7

(3 c

at) t

he) b

lack

)

(5 (7

(3 c

at) t

he) d

ark)

(5 (7

(3 fe

line)

the)

bla

ck)

(5 (7

(3 fe

line)

the)

dar

k)

f g the black

I the argmax is (X8(X6(X2the)black)cat)I refine spans continuing from X6 by conditioning on black

13 / 21

0

5

10

15

20

(8"

(6"

(2"

the)

bla

ck) c

at)

(8"

(6 (2

" th

e) d

ark)

cat

)

(8 (6

" (2

" th

e) b

lack

) fel

ine)

(8 (6

(2"

the)

dar

k) fe

line)

(5 (4

(2"

the)

cat

) bla

ck)

(5 (4

(2"

the)

cat

) dar

k)

(5 (4

(2"

the)

felin

e) b

lack

)

(5 (4

(2"

the)

felin

e) d

ark)

(5 (2

" th

e) b

lack

cat

)

(5 (7

(3 c

at) t

he) b

lack

)

(5 (7

(3 c

at) t

he) d

ark)

(5 (7

(3 fe

line)

the)

bla

ck)

(5 (7

(3 fe

line)

the)

dar

k)

f g the black cat

I the argmax is (X8(X6(X2the)black)cat)I refine spans continuing from X8 by conditioning on cat

13 / 21

0

5

10

15

20

(8"

(6"

(2"

the)

bla

ck) c

at)

(8"

(6 (2

" th

e) d

ark)

cat

)

(8 (6

" (2

" th

e) b

lack

) fel

ine)

(8 (6

(2"

the)

dar

k) fe

line)

(5 (4

(2"

the)

cat

) bla

ck)

(5 (4

(2"

the)

cat

) dar

k)

(5 (4

(2"

the)

felin

e) b

lack

)

(5 (4

(2"

the)

felin

e) d

ark)

(5 (2

" th

e) b

lack

cat

)

(5 (7

(3 c

at) t

he) b

lack

)

(5 (7

(3 c

at) t

he) d

ark)

(5 (7

(3 fe

line)

the)

bla

ck)

(5 (7

(3 fe

line)

the)

dar

k)

f g"

I obtaining a new hypergraph 〈D(x), g′(d)〉

Experiments

German-English (WMT14 data)I phrase extraction: 2.2M sentencesI maximum phrase length 5I maximum translation options 40I unpruned LMs: 25M sentencesI dev set: newstest2010 (LM interpolation and tuning)I batch-mira tuning (cube pruning beam 5000)I test set: newstest2012 (3,003 sentences)I distortion limit d = 4

14 / 21

Exact decoding

n build (s) total (s) N |V | |E |3 1.5 21 190 2.5 1594 1.5 50 350 4 2885 1.5 106 555 6.1 450

I time to build initial proposalI decoding timeI number of iterationsI size of the hypergraph (in thousands of nodes and edges)

15 / 21

Cube pruning

Search errors by beam size (k)

k n-gram LM3 4 5

10 2168 2347 2377102 613 999 1126103 29 102 167104 0 4 7

16 / 21

Translation quality with BLEU

Cube pruning and exact decoding with OS∗

k 3-gram LM 4-gram LM 5-gram LM10 20.47 20.71 20.69102 21.14 21.73 21.76103 21.27 21.89 21.91104 21.29 21.92 21.93OS∗ 21.29 21.92 21.93

17 / 21

Conclusions

Exact decoding

I manageable timeI fraction of the search space

Search error curves for beam search and cube pruningI large phrase tablesI large 5-gram LMs

Exactness at the cost of worst-case complexity, howeverI we demonstrate empirically that the algorithm is practicable

18 / 21

Conclusions

Exact decodingI manageable time

I fraction of the search spaceSearch error curves for beam search and cube pruning

I large phrase tablesI large 5-gram LMs

Exactness at the cost of worst-case complexity, howeverI we demonstrate empirically that the algorithm is practicable

18 / 21

Conclusions

Exact decodingI manageable timeI fraction of the search space

Search error curves for beam search and cube pruningI large phrase tablesI large 5-gram LMs

Exactness at the cost of worst-case complexity, howeverI we demonstrate empirically that the algorithm is practicable

18 / 21

Conclusions

Exact decodingI manageable timeI fraction of the search space

Search error curves for beam search and cube pruning

I large phrase tablesI large 5-gram LMs

Exactness at the cost of worst-case complexity, howeverI we demonstrate empirically that the algorithm is practicable

18 / 21

Conclusions

Exact decodingI manageable timeI fraction of the search space

Search error curves for beam search and cube pruningI large phrase tablesI large 5-gram LMs

Exactness at the cost of worst-case complexity, howeverI we demonstrate empirically that the algorithm is practicable

18 / 21

Conclusions

Exact decodingI manageable timeI fraction of the search space

Search error curves for beam search and cube pruningI large phrase tablesI large 5-gram LMs

Exactness at the cost of worst-case complexity, however

I we demonstrate empirically that the algorithm is practicable

18 / 21

Conclusions

Exact decodingI manageable timeI fraction of the search space

Search error curves for beam search and cube pruningI large phrase tablesI large 5-gram LMs

Exactness at the cost of worst-case complexity, howeverI we demonstrate empirically that the algorithm is practicable

18 / 21

Recent developments

1. exact k-best2. exact sampling

19 / 21

Future work

OptimisationI early stop the searchI error safe pruning

SpeedupsI be more selective with refinementsI LR to deal with powerset constraints (allowing for higher d)I more grouping (partial edges)

Hiero models

20 / 21

Future work

OptimisationI early stop the searchI error safe pruning

SpeedupsI be more selective with refinementsI LR to deal with powerset constraints (allowing for higher d)I more grouping (partial edges)

Hiero models

20 / 21

Future work

OptimisationI early stop the searchI error safe pruning

SpeedupsI be more selective with refinementsI LR to deal with powerset constraints (allowing for higher d)I more grouping (partial edges)

Hiero models

20 / 21

Thanks!

Questions?

21 / 21

Variable-order LM

n Nodes at level m LM states at level m0 1 2 3 4 1 2 3 4

3 0.4 1.2 0.5 - - 113 263 - -4 0.4 1.6 1.4 0.3 - 132 544 212 -5 0.4 2.1 2.4 0.7 0.1 142 790 479 103

Table : Average number of nodes (in thousands) whose LM state encodean m-gram, and average number of unique LM states of order m in thefinal hypergraph for different n-gram LMs (d = 4 everywhere).

22 / 21

References I

Kenneth Heafield, Philipp Koehn, and Alon Lavie. Groupinglanguage model boundary words to speed k-best extraction fromhypergraphs. In Proceedings of the 2013 Conference of theNorth American Chapter of the Association for ComputationalLinguistics: Human Language Technologies, pages 958–968,Atlanta, Georgia, USA, June 2013.

Zhifei Li and Sanjeev Khudanpur. A scalable decoder forparsing-based machine translation with equivalent languagemodel state maintenance. In Proceedings of the SecondWorkshop on Syntax and Structure in Statistical Translation,SSST ’08, pages 10–18, Stroudsburg, PA, USA, 2008.Association for Computational Linguistics.

23 / 21