Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based...
Transcript of Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based...
Exact decoding for phrase-based SMT
Wilker Aziz1, Marc Dymetman2, Lucia Specia1
1University of Sheffield2Xerox Research Centre Europe
October 27, 2014
Outline
Introduction
Approach
Results
Conclusions
2 / 21
Decoding
Viterbi decoding
d∗ = argmaxd∈D(x)
f (d)
= argmaxd∈D(x)
θ>H(d)
1 / 21
Decoding
Viterbi decoding
d∗ = argmaxd∈D(x)
f (d)
= argmaxd∈D(x)
θ>H(d)
space of translation derivations compatible with the input x
1 / 21
Decoding
Viterbi decoding
d∗ = argmaxd∈D(x)
f (d)
= argmaxd∈D(x)
θ>H(d)
we are looking for the best derivationunder a linear parameterisation (θ ∈ Rm)
1 / 21
Decoding
Viterbi decoding
d∗ = argmaxd∈D(x)
f (d)
= argmaxd∈D(x)
θ>1 H1(d) + θ>2 H2(d)
“local” features assess steps in a derivation independently
H1(d) =∑e∈d
h1(e)
1 / 21
Decoding
Viterbi decoding
d∗ = argmaxd∈D(x)
f (d)
= argmaxd∈D(x)
θ>1 H1(d) + θ>2 H2(d)
“local” features assess steps in a derivation independently
H1(d) =∑e∈d
h1(e)
“nonlocal” features make weaker independence assumptionse.g. HLM(d) = log pLM(yield(d))
1 / 21
Complexity
〈D(x), f (d)〉 can be seen as the intersection between
I a translation hypergraph G(x)locally parameterised
I and a target language model Aas a wFSA
Phrase-based SMT with a distortion limit (d) and an n-gram LM
|G(x)| ∝ I 22d
|A| ∝ |∆|n−1
ProblemThe intersection is too large for standard dynamic programming
2 / 21
Complexity
〈D(x), f (d)〉 can be seen as the intersection betweenI a translation hypergraph G(x)
locally parameterised
I and a target language model Aas a wFSA
Phrase-based SMT with a distortion limit (d) and an n-gram LM
|G(x)| ∝ I 22d
|A| ∝ |∆|n−1
ProblemThe intersection is too large for standard dynamic programming
2 / 21
Complexity
〈D(x), f (d)〉 can be seen as the intersection betweenI a translation hypergraph G(x)
locally parameterisedI and a target language model A
as a wFSA
Phrase-based SMT with a distortion limit (d) and an n-gram LM
|G(x)| ∝ I 22d
|A| ∝ |∆|n−1
ProblemThe intersection is too large for standard dynamic programming
2 / 21
Complexity
〈D(x), f (d)〉 can be seen as the intersection betweenI a translation hypergraph G(x)
locally parameterisedI and a target language model A
as a wFSAPhrase-based SMT with a distortion limit (d) and an n-gram LM
|G(x)| ∝ I 22d
|A| ∝ |∆|n−1
ProblemThe intersection is too large for standard dynamic programming
2 / 21
Complexity
〈D(x), f (d)〉 can be seen as the intersection betweenI a translation hypergraph G(x)
locally parameterisedI and a target language model A
as a wFSAPhrase-based SMT with a distortion limit (d) and an n-gram LM
|G(x)| ∝ I 22d
|A| ∝ |∆|n−1
ProblemThe intersection is too large for standard dynamic programming
2 / 21
Contribution
Previous work on exact decodingI compact modelsI simpler parameterisation using 3-gram LMs
This workI large modelsI realistic 5-gram LMs
3 / 21
Contribution
Previous work on exact decodingI compact modelsI simpler parameterisation using 3-gram LMs
This workI large modelsI realistic 5-gram LMs
3 / 21
IntuitionFull intersection is wasteful
G(x) ∩ A
I complete n-gram LM I encodes one n-gram
Assumptionnot every n-gram participates in high-scoring derivations
Problemwhich n-grams are really necessary?and at which level of refinement?
Strategystart with strong independence assumptionsrevisit those assumptions as necessary
4 / 21
IntuitionFull intersection is wasteful
G(x) ∩ A
I complete n-gram LM
I encodes one n-gram
Assumptionnot every n-gram participates in high-scoring derivations
Problemwhich n-grams are really necessary?and at which level of refinement?
Strategystart with strong independence assumptionsrevisit those assumptions as necessary
4 / 21
IntuitionFull intersection is wasteful
G(x) ∩ A = G(x) ∩( M⋂
i=1A(i)
)
I complete n-gram LM I encodes one n-gram
Assumptionnot every n-gram participates in high-scoring derivations
Problemwhich n-grams are really necessary?and at which level of refinement?
Strategystart with strong independence assumptionsrevisit those assumptions as necessary
4 / 21
IntuitionFull intersection is wasteful
G(x) ∩ A = G(x) ∩( M⋂
i=1A(i)
)
I complete n-gram LM I encodes one n-gram
Assumptionnot every n-gram participates in high-scoring derivations
Problemwhich n-grams are really necessary?and at which level of refinement?
Strategystart with strong independence assumptionsrevisit those assumptions as necessary
4 / 21
IntuitionFull intersection is wasteful
G(x) ∩ A = G(x) ∩( M⋂
i=1A(i)
)
I complete n-gram LM I encodes one n-gram
Assumptionnot every n-gram participates in high-scoring derivations
Problemwhich n-grams are really necessary?and at which level of refinement?
Strategystart with strong independence assumptionsrevisit those assumptions as necessary
4 / 21
IntuitionFull intersection is wasteful
G(x) ∩ A = G(x) ∩( M⋂
i=1A(i)
)
I complete n-gram LM I encodes one n-gram
Assumptionnot every n-gram participates in high-scoring derivations
Problemwhich n-grams are really necessary?and at which level of refinement?
Strategystart with strong independence assumptionsrevisit those assumptions as necessary
4 / 21
OS∗: illustration
5 / 21
0
5
10
15
20
the
blac
k ca
t
the
dark
cat
the
blac
k fe
line
the
dark
felin
e
the
cat b
lack
the
cat d
ark
the
felin
e bl
ack
the
felin
e da
rk
the
(bla
ck c
at)
cat t
he b
lack
cat t
he d
ark
felin
e th
e bl
ack
felin
e th
e da
rk
f
I upperbound the complex target f (d)
OS∗: illustration
5 / 21
0
5
10
15
20
the
blac
k ca
t
the
dark
cat
the
blac
k fe
line
the
dark
felin
e
the
cat b
lack
the
cat d
ark
the
felin
e bl
ack
the
felin
e da
rk
the
(bla
ck c
at)
cat t
he b
lack
cat t
he d
ark
felin
e th
e bl
ack
felin
e th
e da
rk
f g0
I upperbound the complex target f (d) by a simpler proposal g(d)
OS∗: illustration
5 / 21
0
5
10
15
20
the
blac
k ca
t
the
dark
cat
the
blac
k fe
line
the
dark
felin
e
the
cat b
lack
the
cat d
ark
the
felin
e bl
ack
the
felin
e da
rk
the
(bla
ck c
at)
cat t
he b
lack
cat t
he d
ark
felin
e th
e bl
ack
felin
e th
e da
rk
f g0
This is our goal
I we are interested in finding f ’s argmaxhowever we cannot search through f
OS∗: illustration
5 / 21
0
5
10
15
20
the
blac
k ca
t
the
dark
cat
the
blac
k fe
line
the
dark
felin
e
the
cat b
lack
the
cat d
ark
the
felin
e bl
ack
the
felin
e da
rk
the
(bla
ck c
at)
cat t
he b
lack
cat t
he d
ark
felin
e th
e bl
ack
felin
e th
e da
rk
f g0but because our proxy says this is the best
This is our goal
I we find d∗ = argmaxd g(d)
I finding our best solution thus far (according to f )I however we cannot guarantee exactness yet
OS∗: illustration
5 / 21
0
5
10
15
20
the
blac
k ca
t
the
dark
cat
the
blac
k fe
line
the
dark
felin
e
the
cat b
lack
the
cat d
ark
the
felin
e bl
ack
the
felin
e da
rk
the
(bla
ck c
at)
cat t
he b
lack
cat t
he d
ark
felin
e th
e bl
ack
felin
e th
e da
rk
f g0but because our proxy says this is the best
This is our goal
I we find d∗ = argmaxd g(d) and assess f (d∗)
I finding our best solution thus far (according to f )I however we cannot guarantee exactness yet
OS∗: illustration
5 / 21
0
5
10
15
20
the
blac
k ca
t
the
dark
cat
the
blac
k fe
line
the
dark
felin
e
the
cat b
lack
the
cat d
ark
the
felin
e bl
ack
the
felin
e da
rk
the
(bla
ck c
at)
cat t
he b
lack
cat t
he d
ark
felin
e th
e bl
ack
felin
e th
e da
rk
f g0
this is the best we have thus farThis is our goal
but because our proxy says this is the best
I we find d∗ = argmaxd g(d) and assess f (d∗)I finding our best solution thus far (according to f )
I however we cannot guarantee exactness yet
OS∗: illustration
5 / 21
0
5
10
15
20
the
blac
k ca
t
the
dark
cat
the
blac
k fe
line
the
dark
felin
e
the
cat b
lack
the
cat d
ark
the
felin
e bl
ack
the
felin
e da
rk
the
(bla
ck c
at)
cat t
he b
lack
cat t
he d
ark
felin
e th
e bl
ack
felin
e th
e da
rk
f g0
Maximum error
I we find d∗ = argmaxd g(d) and assess f (d∗)I finding our best solution thus far (according to f )I however we cannot guarantee exactness yet
OS∗: illustration
5 / 21
0
5
10
15
20
the
blac
k ca
t
the
dark
cat
the
blac
k fe
line
the
dark
felin
e
the
cat b
lack
the
cat d
ark
the
felin
e bl
ack
the
felin
e da
rk
the
(bla
ck c
at)
cat t
he b
lack
cat t
he d
ark
felin
e th
e bl
ack
felin
e th
e da
rk
f g1
I so we refine g as to bring it closer to fe.g. by making g(d∗) = f (d∗)
I at the cost of some little complexity increase (extra nodes and edges)
OS∗: illustration
5 / 21
0
5
10
15
20
the
blac
k ca
t
the
dark
cat
the
blac
k fe
line
the
dark
felin
e
the
cat b
lack
the
cat d
ark
the
felin
e bl
ack
the
felin
e da
rk
the
(bla
ck c
at)
cat t
he b
lack
cat t
he d
ark
felin
e th
e bl
ack
felin
e th
e da
rk
f g1
I so we refine g as to bring it closer to fe.g. by making g(d∗) = f (d∗)
I at the cost of some little complexity increase (extra nodes and edges)
OS∗: illustration
5 / 21
0
5
10
15
20
the
blac
k ca
t
the
dark
cat
the
blac
k fe
line
the
dark
felin
e
the
cat b
lack
the
cat d
ark
the
felin
e bl
ack
the
felin
e da
rk
the
(bla
ck c
at)
cat t
he b
lack
cat t
he d
ark
felin
e th
e bl
ack
felin
e th
e da
rk
f g1
I as a consequence, g’s argmax has changedwe solve d∗ = argmaxd g(d)
I and assess f (d∗) againour best solution thus far might remain unchanged
OS∗: illustration
5 / 21
0
5
10
15
20
the
blac
k ca
t
the
dark
cat
the
blac
k fe
line
the
dark
felin
e
the
cat b
lack
the
cat d
ark
the
felin
e bl
ack
the
felin
e da
rk
the
(bla
ck c
at)
cat t
he b
lack
cat t
he d
ark
felin
e th
e bl
ack
felin
e th
e da
rk
f g1
I as a consequence, g’s argmax has changedwe solve d∗ = argmaxd g(d)
I and assess f (d∗) againour best solution thus far might remain unchanged
OS∗: illustration
5 / 21
0
5
10
15
20
the
blac
k ca
t
the
dark
cat
the
blac
k fe
line
the
dark
felin
e
the
cat b
lack
the
cat d
ark
the
felin
e bl
ack
the
felin
e da
rk
the
(bla
ck c
at)
cat t
he b
lack
cat t
he d
ark
felin
e th
e bl
ack
felin
e th
e da
rk
f g1
I we cannot yet guarantee exactness
I even though our maximum error is smaller
OS∗: illustration
5 / 21
0
5
10
15
20
the
blac
k ca
t
the
dark
cat
the
blac
k fe
line
the
dark
felin
e
the
cat b
lack
the
cat d
ark
the
felin
e bl
ack
the
felin
e da
rk
the
(bla
ck c
at)
cat t
he b
lack
cat t
he d
ark
felin
e th
e bl
ack
felin
e th
e da
rk
f g1
I we cannot yet guarantee exactnessI even though our maximum error is smaller
OS∗: illustration
5 / 21
0
5
10
15
20
the
blac
k ca
t
the
dark
cat
the
blac
k fe
line
the
dark
felin
e
the
cat b
lack
the
cat d
ark
the
felin
e bl
ack
the
felin
e da
rk
the
(bla
ck c
at)
cat t
he b
lack
cat t
he d
ark
felin
e th
e bl
ack
felin
e th
e da
rk
f g2
I but we can continue refining g
I for as long as g and f disagree on the maximumI until we have a certificate of optimality
OS∗: illustration
5 / 21
0
5
10
15
20
the
blac
k ca
t
the
dark
cat
the
blac
k fe
line
the
dark
felin
e
the
cat b
lack
the
cat d
ark
the
felin
e bl
ack
the
felin
e da
rk
the
(bla
ck c
at)
cat t
he b
lack
cat t
he d
ark
felin
e th
e bl
ack
felin
e th
e da
rk
f g3
I but we can continue refining gI for as long as g and f disagree on the maximum
I until we have a certificate of optimality
OS∗: illustration
5 / 21
0
5
10
15
20
the
blac
k ca
t
the
dark
cat
the
blac
k fe
line
the
dark
felin
e
the
cat b
lack
the
cat d
ark
the
felin
e bl
ack
the
felin
e da
rk
the
(bla
ck c
at)
cat t
he b
lack
cat t
he d
ark
felin
e th
e bl
ack
felin
e th
e da
rk
f g4
I but we can continue refining gI for as long as g and f disagree on the maximum
I until we have a certificate of optimality
OS∗: illustration
5 / 21
0
5
10
15
20
the
blac
k ca
t
the
dark
cat
the
blac
k fe
line
the
dark
felin
e
the
cat b
lack
the
cat d
ark
the
felin
e bl
ack
the
felin
e da
rk
the
(bla
ck c
at)
cat t
he b
lack
cat t
he d
ark
felin
e th
e bl
ack
felin
e th
e da
rk
f g5
I but we can continue refining gI for as long as g and f disagree on the maximumI until we have a certificate of optimality
Decoding algorithm
1: function Optimise(g, ε)
2: d∗ ← argmaxd g(d) . proxy’s argmax3: g∗ ← g(d∗)4: f ∗ ← f (d∗) . observe a point in f5: while (q∗ − f ∗ ≥ ε) do . ε is the maximum error6: A← actions(g,d∗) . collect refinement actions7: g ← refine(g,A) . update proposal8: d∗ ← argmaxd g(d) . update argmax9: g∗ ← g(d∗)
10: f ∗ ← max(f ∗, f (d∗)) . update “best so far”11: end while12: return g, d∗13: end function
6 / 21
Decoding algorithm
1: function Optimise(g, ε)2: d∗ ← argmaxd g(d) . proxy’s argmax
3: g∗ ← g(d∗)4: f ∗ ← f (d∗) . observe a point in f5: while (q∗ − f ∗ ≥ ε) do . ε is the maximum error6: A← actions(g,d∗) . collect refinement actions7: g ← refine(g,A) . update proposal8: d∗ ← argmaxd g(d) . update argmax9: g∗ ← g(d∗)
10: f ∗ ← max(f ∗, f (d∗)) . update “best so far”11: end while12: return g, d∗13: end function
6 / 21
Decoding algorithm
1: function Optimise(g, ε)2: d∗ ← argmaxd g(d) . proxy’s argmax3: g∗ ← g(d∗)4: f ∗ ← f (d∗) . observe a point in f
5: while (q∗ − f ∗ ≥ ε) do . ε is the maximum error6: A← actions(g,d∗) . collect refinement actions7: g ← refine(g,A) . update proposal8: d∗ ← argmaxd g(d) . update argmax9: g∗ ← g(d∗)
10: f ∗ ← max(f ∗, f (d∗)) . update “best so far”11: end while12: return g, d∗13: end function
6 / 21
Decoding algorithm
1: function Optimise(g, ε)2: d∗ ← argmaxd g(d) . proxy’s argmax3: g∗ ← g(d∗)4: f ∗ ← f (d∗) . observe a point in f5: while (q∗ − f ∗ ≥ ε) do . ε is the maximum error
6: A← actions(g,d∗) . collect refinement actions7: g ← refine(g,A) . update proposal8: d∗ ← argmaxd g(d) . update argmax9: g∗ ← g(d∗)
10: f ∗ ← max(f ∗, f (d∗)) . update “best so far”11: end while12: return g, d∗13: end function
6 / 21
Decoding algorithm
1: function Optimise(g, ε)2: d∗ ← argmaxd g(d) . proxy’s argmax3: g∗ ← g(d∗)4: f ∗ ← f (d∗) . observe a point in f5: while (q∗ − f ∗ ≥ ε) do . ε is the maximum error6: A← actions(g,d∗) . collect refinement actions
7: g ← refine(g,A) . update proposal8: d∗ ← argmaxd g(d) . update argmax9: g∗ ← g(d∗)
10: f ∗ ← max(f ∗, f (d∗)) . update “best so far”11: end while12: return g, d∗13: end function
6 / 21
Decoding algorithm
1: function Optimise(g, ε)2: d∗ ← argmaxd g(d) . proxy’s argmax3: g∗ ← g(d∗)4: f ∗ ← f (d∗) . observe a point in f5: while (q∗ − f ∗ ≥ ε) do . ε is the maximum error6: A← actions(g,d∗) . collect refinement actions7: g ← refine(g,A) . update proposal
8: d∗ ← argmaxd g(d) . update argmax9: g∗ ← g(d∗)
10: f ∗ ← max(f ∗, f (d∗)) . update “best so far”11: end while12: return g, d∗13: end function
6 / 21
Decoding algorithm
1: function Optimise(g, ε)2: d∗ ← argmaxd g(d) . proxy’s argmax3: g∗ ← g(d∗)4: f ∗ ← f (d∗) . observe a point in f5: while (q∗ − f ∗ ≥ ε) do . ε is the maximum error6: A← actions(g,d∗) . collect refinement actions7: g ← refine(g,A) . update proposal8: d∗ ← argmaxd g(d) . update argmax
9: g∗ ← g(d∗)10: f ∗ ← max(f ∗, f (d∗)) . update “best so far”11: end while12: return g, d∗13: end function
6 / 21
Decoding algorithm
1: function Optimise(g, ε)2: d∗ ← argmaxd g(d) . proxy’s argmax3: g∗ ← g(d∗)4: f ∗ ← f (d∗) . observe a point in f5: while (q∗ − f ∗ ≥ ε) do . ε is the maximum error6: A← actions(g,d∗) . collect refinement actions7: g ← refine(g,A) . update proposal8: d∗ ← argmaxd g(d) . update argmax9: g∗ ← g(d∗)
10: f ∗ ← max(f ∗, f (d∗)) . update “best so far”11: end while
12: return g, d∗13: end function
6 / 21
Decoding algorithm
1: function Optimise(g, ε)2: d∗ ← argmaxd g(d) . proxy’s argmax3: g∗ ← g(d∗)4: f ∗ ← f (d∗) . observe a point in f5: while (q∗ − f ∗ ≥ ε) do . ε is the maximum error6: A← actions(g,d∗) . collect refinement actions7: g ← refine(g,A) . update proposal8: d∗ ← argmaxd g(d) . update argmax9: g∗ ← g(d∗)
10: f ∗ ← max(f ∗, f (d∗)) . update “best so far”11: end while12: return g, d∗13: end function
6 / 21
ProposalIn g, the true LM pLM is replaced by an upperbound qLM
with stronger independence assumptions
I Suppose α = yJI is a substring of y = yM
1e.g. α = black2 cat3 in y = BOS0 the1 black2 cat3 EOS4
I contribution of α to the true LM score of y
pLM(α) ≡J∏
k=Ip(yk |yk−1
1 )
e.g. p(black2|BOS0 the1)p(cat3|BOS0 the1 black2)I upperbound to pLM(α)
qLM(α) ≡ q(yI |ε)J∏
k=I+1q(yk |yk−1
I )
e.g. q(black2|ε)q(cat3|black2)
where q(z|P) ≡ maxH∈∆∗ p(z|HP)
7 / 21
ProposalIn g, the true LM pLM is replaced by an upperbound qLM
with stronger independence assumptionsI Suppose α = yJ
I is a substring of y = yM1
e.g. α = black2 cat3 in y = BOS0 the1 black2 cat3 EOS4
I contribution of α to the true LM score of y
pLM(α) ≡J∏
k=Ip(yk |yk−1
1 )
e.g. p(black2|BOS0 the1)p(cat3|BOS0 the1 black2)I upperbound to pLM(α)
qLM(α) ≡ q(yI |ε)J∏
k=I+1q(yk |yk−1
I )
e.g. q(black2|ε)q(cat3|black2)
where q(z|P) ≡ maxH∈∆∗ p(z|HP)
7 / 21
ProposalIn g, the true LM pLM is replaced by an upperbound qLM
with stronger independence assumptionsI Suppose α = yJ
I is a substring of y = yM1
e.g. α = black2 cat3 in y = BOS0 the1 black2 cat3 EOS4
I contribution of α to the true LM score of y
pLM(α) ≡J∏
k=Ip(yk |yk−1
1 )
e.g. p(black2|BOS0 the1)p(cat3|BOS0 the1 black2)
I upperbound to pLM(α)
qLM(α) ≡ q(yI |ε)J∏
k=I+1q(yk |yk−1
I )
e.g. q(black2|ε)q(cat3|black2)
where q(z|P) ≡ maxH∈∆∗ p(z|HP)
7 / 21
ProposalIn g, the true LM pLM is replaced by an upperbound qLM
with stronger independence assumptionsI Suppose α = yJ
I is a substring of y = yM1
e.g. α = black2 cat3 in y = BOS0 the1 black2 cat3 EOS4
I contribution of α to the true LM score of y
pLM(α) ≡J∏
k=Ip(yk |yk−1
1 )
e.g. p(black2|BOS0 the1)p(cat3|BOS0 the1 black2)I upperbound to pLM(α)
qLM(α) ≡ q(yI |ε)J∏
k=I+1q(yk |yk−1
I )
e.g. q(black2|ε)q(cat3|black2)
where q(z|P) ≡ maxH∈∆∗ p(z|HP)7 / 21
Max-ARPAAn n-gram is scored in its most optimistic context H
q(z|P) ≡ maxH∈∆∗
p(z|HP)
Efficiently computed using a Max-ARPA table MI start with an ARPA table
n-gram Pz — conditional log p(z|P) — backoff b(Pz)I compute an upperbound view of the last 2 columns
n-gram Pz — max-conditional log q(z|P) — max-backoff m(Pz)
Then for an arbitrary n-gram Pz,
q(z|P) =
p(z|P) Pz 6∈ M and P 6∈ Mp(z|P)×m(P) Pz 6∈ M and P ∈ Mq(z|P) Pz ∈ M
8 / 21
Max-ARPAAn n-gram is scored in its most optimistic context H
q(z|P) ≡ maxH∈∆∗
p(z|HP)
Efficiently computed using a Max-ARPA table M
I start with an ARPA tablen-gram Pz — conditional log p(z|P) — backoff b(Pz)
I compute an upperbound view of the last 2 columnsn-gram Pz — max-conditional log q(z|P) — max-backoff m(Pz)
Then for an arbitrary n-gram Pz,
q(z|P) =
p(z|P) Pz 6∈ M and P 6∈ Mp(z|P)×m(P) Pz 6∈ M and P ∈ Mq(z|P) Pz ∈ M
8 / 21
Max-ARPAAn n-gram is scored in its most optimistic context H
q(z|P) ≡ maxH∈∆∗
p(z|HP)
Efficiently computed using a Max-ARPA table MI start with an ARPA table
n-gram Pz — conditional log p(z|P) — backoff b(Pz)
I compute an upperbound view of the last 2 columnsn-gram Pz — max-conditional log q(z|P) — max-backoff m(Pz)
Then for an arbitrary n-gram Pz,
q(z|P) =
p(z|P) Pz 6∈ M and P 6∈ Mp(z|P)×m(P) Pz 6∈ M and P ∈ Mq(z|P) Pz ∈ M
8 / 21
Max-ARPAAn n-gram is scored in its most optimistic context H
q(z|P) ≡ maxH∈∆∗
p(z|HP)
Efficiently computed using a Max-ARPA table MI start with an ARPA table
n-gram Pz — conditional log p(z|P) — backoff b(Pz)I compute an upperbound view of the last 2 columns
n-gram Pz — max-conditional log q(z|P) — max-backoff m(Pz)
Then for an arbitrary n-gram Pz,
q(z|P) =
p(z|P) Pz 6∈ M and P 6∈ Mp(z|P)×m(P) Pz 6∈ M and P ∈ Mq(z|P) Pz ∈ M
8 / 21
Max-ARPAAn n-gram is scored in its most optimistic context H
q(z|P) ≡ maxH∈∆∗
p(z|HP)
Efficiently computed using a Max-ARPA table MI start with an ARPA table
n-gram Pz — conditional log p(z|P) — backoff b(Pz)I compute an upperbound view of the last 2 columns
n-gram Pz — max-conditional log q(z|P) — max-backoff m(Pz)
Then for an arbitrary n-gram Pz,
q(z|P) =
p(z|P) Pz 6∈ M and P 6∈ Mp(z|P)×m(P) Pz 6∈ M and P ∈ Mq(z|P) Pz ∈ M
8 / 21
Proposal hypergraph
The proposal g(d) can be efficiently represented by a hypergraph
Remark!Nodes do not store LM state
9 / 21
Proposal hypergraph
The proposal g(d) can be efficiently represented by a hypergraph
Remark!Nodes do not store LM state
9 / 21
Proposal hypergraph
The proposal g(d) can be efficiently represented by a hypergraph
Remark!Nodes do not store LM state
| 〈D(x), f (d)〉 | = |G(d) ∩ A| ∝ I 22d |∆|n−1
9 / 21
Proposal hypergraph
The proposal g(d) can be efficiently represented by a hypergraph
Remark!Nodes do not store LM state
| 〈D(x), g(d)〉 | = |G(d) ∩���upperbound
A| ∝ I 22d����|∆|n−1
9 / 21
Refinement
Goal: break independence assumptions1. making larger n-grams2. bringing g closer to f
Examples
10 / 21
Refinement
Goal: break independence assumptions1. making larger n-grams2. bringing g closer to f
Examples
10 / 21
Refinement
Goal: break independence assumptions1. making larger n-grams2. bringing g closer to f
Examples
g′(d) ={
g(d) if d 6= argmax g(d)f (d) otherwise.
10 / 21
Refinement
Goal: break independence assumptions1. making larger n-grams2. bringing g closer to f
Examples
g′(d) ={
g(d) if d 6= argmax g(d)f (d) otherwise.
Too local: one derivation at a time
10 / 21
Refinement
Goal: break independence assumptions1. making larger n-grams2. bringing g closer to f
Examples
g′(d) = g(d)q(z|hP)q(z|P)
k0 1
aelse a/w(a)
z/w(z)where k counts occurrences of hPz in yield(d)
10 / 21
Refinement
Goal: break independence assumptions1. making larger n-grams2. bringing g closer to f
Examples
g′(d) = g(d)q(z|hP)q(z|P)
k0 1
aelse a/w(a)
z/w(z)where k counts occurrences of hPz in yield(d)Too global: refines derivations which already score poorly
10 / 21
LM state refinement1
Break independence assumptions (making larger n-grams)but not in every derivation
Example:I suppose the argmax is (X1(X2(X3the)black)cat)I and node X3 currently stores an empty LM statesI this motivates a refined node X3′ whose LM state is
the · LMState(X3)
I incoming edges to X3 are splitI outgoing edges from X3′ are reweighted copies of those leaving X3
This is connected to an intersection local to X3
∗(X3 ∗ the)z∗ with weight update q(z|the)q(z)
1[Li and Khudanpur, 2008, Heafield et al., 2013]11 / 21
LM state refinement1
Break independence assumptions (making larger n-grams)but not in every derivation
Example:I suppose the argmax is (X1(X2(X3the)black)cat)
I and node X3 currently stores an empty LM statesI this motivates a refined node X3′ whose LM state is
the · LMState(X3)
I incoming edges to X3 are splitI outgoing edges from X3′ are reweighted copies of those leaving X3
This is connected to an intersection local to X3
∗(X3 ∗ the)z∗ with weight update q(z|the)q(z)
1[Li and Khudanpur, 2008, Heafield et al., 2013]11 / 21
LM state refinement1
Break independence assumptions (making larger n-grams)but not in every derivation
Example:I suppose the argmax is (X1(X2(X3the)black)cat)I and node X3 currently stores an empty LM states
I this motivates a refined node X3′ whose LM state is
the · LMState(X3)
I incoming edges to X3 are splitI outgoing edges from X3′ are reweighted copies of those leaving X3
This is connected to an intersection local to X3
∗(X3 ∗ the)z∗ with weight update q(z|the)q(z)
1[Li and Khudanpur, 2008, Heafield et al., 2013]11 / 21
LM state refinement1
Break independence assumptions (making larger n-grams)but not in every derivation
Example:I suppose the argmax is (X1(X2(X3the)black)cat)I and node X3 currently stores an empty LM statesI this motivates a refined node X3′ whose LM state is
the · LMState(X3)
I incoming edges to X3 are splitI outgoing edges from X3′ are reweighted copies of those leaving X3
This is connected to an intersection local to X3
∗(X3 ∗ the)z∗ with weight update q(z|the)q(z)
1[Li and Khudanpur, 2008, Heafield et al., 2013]11 / 21
LM state refinement1
Break independence assumptions (making larger n-grams)but not in every derivation
Example:I suppose the argmax is (X1(X2(X3the)black)cat)I and node X3 currently stores an empty LM statesI this motivates a refined node X3′ whose LM state is
the · LMState(X3)
I incoming edges to X3 are split
I outgoing edges from X3′ are reweighted copies of those leaving X3
This is connected to an intersection local to X3
∗(X3 ∗ the)z∗ with weight update q(z|the)q(z)
1[Li and Khudanpur, 2008, Heafield et al., 2013]11 / 21
LM state refinement1
Break independence assumptions (making larger n-grams)but not in every derivation
Example:I suppose the argmax is (X1(X2(X3the)black)cat)I and node X3 currently stores an empty LM statesI this motivates a refined node X3′ whose LM state is
the · LMState(X3)
I incoming edges to X3 are splitI outgoing edges from X3′ are reweighted copies of those leaving X3
This is connected to an intersection local to X3
∗(X3 ∗ the)z∗ with weight update q(z|the)q(z)
1[Li and Khudanpur, 2008, Heafield et al., 2013]11 / 21
LM state refinement1
Break independence assumptions (making larger n-grams)but not in every derivation
Example:I suppose the argmax is (X1(X2(X3the)black)cat)I and node X3 currently stores an empty LM statesI this motivates a refined node X3′ whose LM state is
the · LMState(X3)
I incoming edges to X3 are splitI outgoing edges from X3′ are reweighted copies of those leaving X3
This is connected to an intersection local to X3
∗(X3 ∗ the)z∗ with weight update q(z|the)q(z)
1[Li and Khudanpur, 2008, Heafield et al., 2013]11 / 21
Refinement actions
Goal: lower g’s maximum as much as possible
Heuristic
Refine all nodes participating in the current argmaxI by extending a node’s LM (right) state by
exactly one word from yield(argmax g(d))
12 / 21
Refinement actions
Goal: lower g’s maximum as much as possible
Heuristic
Refine all nodes participating in the current argmaxI by extending a node’s LM (right) state by
exactly one word from yield(argmax g(d))
12 / 21
13 / 21
0
5
10
15
20
(8 (6
(2 th
e) b
lack
) cat
)
(8 (6
(2 th
e) d
ark)
cat
)
(8 (6
(2 th
e) b
lack
) fel
ine)
(8 (6
(2 th
e) d
ark)
felin
e)
(5 (4
(2 th
e) c
at) b
lack
)
(5 (4
(2 th
e) c
at) d
ark)
(5 (4
(2 th
e) fe
line)
bla
ck)
(5 (4
(2 th
e) fe
line)
dar
k)
(5 (2
the)
bla
ck c
at)
(5 (7
(3 c
at) t
he) b
lack
)
(5 (7
(3 c
at) t
he) d
ark)
(5 (7
(3 fe
line)
the)
bla
ck)
(5 (7
(3 fe
line)
the)
dar
k)
f g
I argmax is (X8(X6(X2the)black)cat)
13 / 21
0
5
10
15
20
(8 (6
(2"
the)
bla
ck) c
at)
(8 (6
(2"
the)
dar
k) c
at)
(8 (6
(2"
the)
bla
ck) f
elin
e)
(8 (6
(2"
the)
dar
k) fe
line)
(5 (4
(2"
the)
cat
) bla
ck)
(5 (4
(2"
the)
cat
) dar
k)
(5 (4
(2"
the)
felin
e) b
lack
)
(5 (4
(2"
the)
felin
e) d
ark)
(5 (2
" th
e) b
lack
cat
)
(5 (7
(3 c
at) t
he) b
lack
)
(5 (7
(3 c
at) t
he) d
ark)
(5 (7
(3 fe
line)
the)
bla
ck)
(5 (7
(3 fe
line)
the)
dar
k)
f g the
I the argmax is (X8(X6(X2the)black)cat)I refine spans continuing from X2 by conditioning on the
13 / 21
0
5
10
15
20
(8 (6
" (2
" th
e) b
lack
) cat
)
(8 (6
(2"
the)
dar
k) c
at)
(8 (6
" (2
" th
e) b
lack
) fel
ine)
(8 (6
(2"
the)
dar
k) fe
line)
(5 (4
(2"
the)
cat
) bla
ck)
(5 (4
(2"
the)
cat
) dar
k)
(5 (4
(2"
the)
felin
e) b
lack
)
(5 (4
(2"
the)
felin
e) d
ark)
(5 (2
" th
e) b
lack
cat
)
(5 (7
(3 c
at) t
he) b
lack
)
(5 (7
(3 c
at) t
he) d
ark)
(5 (7
(3 fe
line)
the)
bla
ck)
(5 (7
(3 fe
line)
the)
dar
k)
f g the black
I the argmax is (X8(X6(X2the)black)cat)I refine spans continuing from X6 by conditioning on black
13 / 21
0
5
10
15
20
(8"
(6"
(2"
the)
bla
ck) c
at)
(8"
(6 (2
" th
e) d
ark)
cat
)
(8 (6
" (2
" th
e) b
lack
) fel
ine)
(8 (6
(2"
the)
dar
k) fe
line)
(5 (4
(2"
the)
cat
) bla
ck)
(5 (4
(2"
the)
cat
) dar
k)
(5 (4
(2"
the)
felin
e) b
lack
)
(5 (4
(2"
the)
felin
e) d
ark)
(5 (2
" th
e) b
lack
cat
)
(5 (7
(3 c
at) t
he) b
lack
)
(5 (7
(3 c
at) t
he) d
ark)
(5 (7
(3 fe
line)
the)
bla
ck)
(5 (7
(3 fe
line)
the)
dar
k)
f g the black cat
I the argmax is (X8(X6(X2the)black)cat)I refine spans continuing from X8 by conditioning on cat
13 / 21
0
5
10
15
20
(8"
(6"
(2"
the)
bla
ck) c
at)
(8"
(6 (2
" th
e) d
ark)
cat
)
(8 (6
" (2
" th
e) b
lack
) fel
ine)
(8 (6
(2"
the)
dar
k) fe
line)
(5 (4
(2"
the)
cat
) bla
ck)
(5 (4
(2"
the)
cat
) dar
k)
(5 (4
(2"
the)
felin
e) b
lack
)
(5 (4
(2"
the)
felin
e) d
ark)
(5 (2
" th
e) b
lack
cat
)
(5 (7
(3 c
at) t
he) b
lack
)
(5 (7
(3 c
at) t
he) d
ark)
(5 (7
(3 fe
line)
the)
bla
ck)
(5 (7
(3 fe
line)
the)
dar
k)
f g"
I obtaining a new hypergraph 〈D(x), g′(d)〉
Experiments
German-English (WMT14 data)I phrase extraction: 2.2M sentencesI maximum phrase length 5I maximum translation options 40I unpruned LMs: 25M sentencesI dev set: newstest2010 (LM interpolation and tuning)I batch-mira tuning (cube pruning beam 5000)I test set: newstest2012 (3,003 sentences)I distortion limit d = 4
14 / 21
Exact decoding
n build (s) total (s) N |V | |E |3 1.5 21 190 2.5 1594 1.5 50 350 4 2885 1.5 106 555 6.1 450
I time to build initial proposalI decoding timeI number of iterationsI size of the hypergraph (in thousands of nodes and edges)
15 / 21
Cube pruning
Search errors by beam size (k)
k n-gram LM3 4 5
10 2168 2347 2377102 613 999 1126103 29 102 167104 0 4 7
16 / 21
Translation quality with BLEU
Cube pruning and exact decoding with OS∗
k 3-gram LM 4-gram LM 5-gram LM10 20.47 20.71 20.69102 21.14 21.73 21.76103 21.27 21.89 21.91104 21.29 21.92 21.93OS∗ 21.29 21.92 21.93
17 / 21
Conclusions
Exact decoding
I manageable timeI fraction of the search space
Search error curves for beam search and cube pruningI large phrase tablesI large 5-gram LMs
Exactness at the cost of worst-case complexity, howeverI we demonstrate empirically that the algorithm is practicable
18 / 21
Conclusions
Exact decodingI manageable time
I fraction of the search spaceSearch error curves for beam search and cube pruning
I large phrase tablesI large 5-gram LMs
Exactness at the cost of worst-case complexity, howeverI we demonstrate empirically that the algorithm is practicable
18 / 21
Conclusions
Exact decodingI manageable timeI fraction of the search space
Search error curves for beam search and cube pruningI large phrase tablesI large 5-gram LMs
Exactness at the cost of worst-case complexity, howeverI we demonstrate empirically that the algorithm is practicable
18 / 21
Conclusions
Exact decodingI manageable timeI fraction of the search space
Search error curves for beam search and cube pruning
I large phrase tablesI large 5-gram LMs
Exactness at the cost of worst-case complexity, howeverI we demonstrate empirically that the algorithm is practicable
18 / 21
Conclusions
Exact decodingI manageable timeI fraction of the search space
Search error curves for beam search and cube pruningI large phrase tablesI large 5-gram LMs
Exactness at the cost of worst-case complexity, howeverI we demonstrate empirically that the algorithm is practicable
18 / 21
Conclusions
Exact decodingI manageable timeI fraction of the search space
Search error curves for beam search and cube pruningI large phrase tablesI large 5-gram LMs
Exactness at the cost of worst-case complexity, however
I we demonstrate empirically that the algorithm is practicable
18 / 21
Conclusions
Exact decodingI manageable timeI fraction of the search space
Search error curves for beam search and cube pruningI large phrase tablesI large 5-gram LMs
Exactness at the cost of worst-case complexity, howeverI we demonstrate empirically that the algorithm is practicable
18 / 21
Recent developments
1. exact k-best2. exact sampling
19 / 21
Future work
OptimisationI early stop the searchI error safe pruning
SpeedupsI be more selective with refinementsI LR to deal with powerset constraints (allowing for higher d)I more grouping (partial edges)
Hiero models
20 / 21
Future work
OptimisationI early stop the searchI error safe pruning
SpeedupsI be more selective with refinementsI LR to deal with powerset constraints (allowing for higher d)I more grouping (partial edges)
Hiero models
20 / 21
Future work
OptimisationI early stop the searchI error safe pruning
SpeedupsI be more selective with refinementsI LR to deal with powerset constraints (allowing for higher d)I more grouping (partial edges)
Hiero models
20 / 21
Thanks!
Questions?
21 / 21
Variable-order LM
n Nodes at level m LM states at level m0 1 2 3 4 1 2 3 4
3 0.4 1.2 0.5 - - 113 263 - -4 0.4 1.6 1.4 0.3 - 132 544 212 -5 0.4 2.1 2.4 0.7 0.1 142 790 479 103
Table : Average number of nodes (in thousands) whose LM state encodean m-gram, and average number of unique LM states of order m in thefinal hypergraph for different n-gram LMs (d = 4 everywhere).
22 / 21
References I
Kenneth Heafield, Philipp Koehn, and Alon Lavie. Groupinglanguage model boundary words to speed k-best extraction fromhypergraphs. In Proceedings of the 2013 Conference of theNorth American Chapter of the Association for ComputationalLinguistics: Human Language Technologies, pages 958–968,Atlanta, Georgia, USA, June 2013.
Zhifei Li and Sanjeev Khudanpur. A scalable decoder forparsing-based machine translation with equivalent languagemodel state maintenance. In Proceedings of the SecondWorkshop on Syntax and Structure in Statistical Translation,SSST ’08, pages 10–18, Stroudsburg, PA, USA, 2008.Association for Computational Linguistics.
23 / 21