Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based...

Exact decoding for phrase-based SMT

Wilker Aziz1, Marc Dymetman2, Lucia Specia1

1University of Sheffield2Xerox Research Centre Europe

October 27, 2014

Outline

Introduction

Approach

Results

Conclusions

2 / 21

Decoding

Viterbi decoding

d∗ = argmaxd∈D(x)

= argmaxd∈D(x)

θ>H(d)

1 / 21

Decoding

Viterbi decoding

= argmaxd∈D(x)

θ>H(d)

space of translation derivations compatible with the input x

1 / 21

Decoding

Viterbi decoding

= argmaxd∈D(x)

θ>H(d)

we are looking for the best derivationunder a linear parameterisation (θ ∈ Rm)

1 / 21

Decoding

Viterbi decoding

= argmaxd∈D(x)

θ>1 H1(d) + θ>2 H2(d)

“local” features assess steps in a derivation independently

H1(d) =∑e∈d

1 / 21

Decoding

Viterbi decoding

= argmaxd∈D(x)

θ>1 H1(d) + θ>2 H2(d)

“local” features assess steps in a derivation independently

H1(d) =∑e∈d

“nonlocal” features make weaker independence assumptionse.g. HLM(d) = log pLM(yield(d))

1 / 21

Complexity

〈D(x), f (d)〉 can be seen as the intersection between

I a translation hypergraph G(x)locally parameterised

I and a target language model Aas a wFSA

Phrase-based SMT with a distortion limit (d) and an n-gram LM

|G(x)| ∝ I 22d

|A| ∝ |∆|n−1

ProblemThe intersection is too large for standard dynamic programming

2 / 21

Complexity

〈D(x), f (d)〉 can be seen as the intersection betweenI a translation hypergraph G(x)

locally parameterised

I and a target language model Aas a wFSA

|G(x)| ∝ I 22d

|A| ∝ |∆|n−1

2 / 21

Complexity

locally parameterisedI and a target language model A

as a wFSA

|G(x)| ∝ I 22d

|A| ∝ |∆|n−1

2 / 21

Complexity

as a wFSAPhrase-based SMT with a distortion limit (d) and an n-gram LM

|G(x)| ∝ I 22d

|A| ∝ |∆|n−1

2 / 21

Complexity

as a wFSAPhrase-based SMT with a distortion limit (d) and an n-gram LM

|G(x)| ∝ I 22d

|A| ∝ |∆|n−1

2 / 21

Contribution

Previous work on exact decodingI compact modelsI simpler parameterisation using 3-gram LMs

This workI large modelsI realistic 5-gram LMs

3 / 21

Contribution

Previous work on exact decodingI compact modelsI simpler parameterisation using 3-gram LMs

This workI large modelsI realistic 5-gram LMs

3 / 21

IntuitionFull intersection is wasteful

G(x) ∩ A

I complete n-gram LM I encodes one n-gram

Assumptionnot every n-gram participates in high-scoring derivations

Problemwhich n-grams are really necessary?and at which level of refinement?

Strategystart with strong independence assumptionsrevisit those assumptions as necessary

4 / 21

G(x) ∩ A

I complete n-gram LM

I encodes one n-gram

4 / 21

G(x) ∩ A = G(x) ∩( M⋂

i=1A(i)

4 / 21

G(x) ∩ A = G(x) ∩( M⋂

i=1A(i)

4 / 21

G(x) ∩ A = G(x) ∩( M⋂

i=1A(i)

4 / 21

G(x) ∩ A = G(x) ∩( M⋂

i=1A(i)

4 / 21

OS∗: illustration

5 / 21

I upperbound the complex target f (d)

OS∗: illustration

5 / 21

I upperbound the complex target f (d) by a simpler proposal g(d)

OS∗: illustration

5 / 21

This is our goal

I we are interested in finding f ’s argmaxhowever we cannot search through f

OS∗: illustration

5 / 21

f g0but because our proxy says this is the best

This is our goal

I we find d∗ = argmaxd g(d)

I finding our best solution thus far (according to f )I however we cannot guarantee exactness yet

OS∗: illustration

5 / 21

f g0but because our proxy says this is the best

This is our goal

I we find d∗ = argmaxd g(d) and assess f (d∗)

I finding our best solution thus far (according to f )I however we cannot guarantee exactness yet

OS∗: illustration

5 / 21

this is the best we have thus farThis is our goal

but because our proxy says this is the best

I we find d∗ = argmaxd g(d) and assess f (d∗)I finding our best solution thus far (according to f )

I however we cannot guarantee exactness yet

OS∗: illustration

5 / 21

Maximum error

I we find d∗ = argmaxd g(d) and assess f (d∗)I finding our best solution thus far (according to f )I however we cannot guarantee exactness yet

OS∗: illustration

5 / 21

I so we refine g as to bring it closer to fe.g. by making g(d∗) = f (d∗)

I at the cost of some little complexity increase (extra nodes and edges)

OS∗: illustration

5 / 21

I so we refine g as to bring it closer to fe.g. by making g(d∗) = f (d∗)

I at the cost of some little complexity increase (extra nodes and edges)

OS∗: illustration

5 / 21

I as a consequence, g’s argmax has changedwe solve d∗ = argmaxd g(d)

I and assess f (d∗) againour best solution thus far might remain unchanged

OS∗: illustration

5 / 21

I as a consequence, g’s argmax has changedwe solve d∗ = argmaxd g(d)

I and assess f (d∗) againour best solution thus far might remain unchanged

OS∗: illustration

5 / 21

I we cannot yet guarantee exactness

I even though our maximum error is smaller

OS∗: illustration

5 / 21

I we cannot yet guarantee exactnessI even though our maximum error is smaller

OS∗: illustration

5 / 21

I but we can continue refining g

I for as long as g and f disagree on the maximumI until we have a certificate of optimality

OS∗: illustration

5 / 21

I but we can continue refining gI for as long as g and f disagree on the maximum

I until we have a certificate of optimality

OS∗: illustration

5 / 21

I but we can continue refining gI for as long as g and f disagree on the maximum

I until we have a certificate of optimality

OS∗: illustration

5 / 21

I but we can continue refining gI for as long as g and f disagree on the maximumI until we have a certificate of optimality

Decoding algorithm

1: function Optimise(g, ε)

2: d∗ ← argmaxd g(d) . proxy’s argmax3: g∗ ← g(d∗)4: f ∗ ← f (d∗) . observe a point in f5: while (q∗ − f ∗ ≥ ε) do . ε is the maximum error6: A← actions(g,d∗) . collect refinement actions7: g ← refine(g,A) . update proposal8: d∗ ← argmaxd g(d) . update argmax9: g∗ ← g(d∗)

10: f ∗ ← max(f ∗, f (d∗)) . update “best so far”11: end while12: return g, d∗13: end function

6 / 21

Decoding algorithm

1: function Optimise(g, ε)2: d∗ ← argmaxd g(d) . proxy’s argmax

3: g∗ ← g(d∗)4: f ∗ ← f (d∗) . observe a point in f5: while (q∗ − f ∗ ≥ ε) do . ε is the maximum error6: A← actions(g,d∗) . collect refinement actions7: g ← refine(g,A) . update proposal8: d∗ ← argmaxd g(d) . update argmax9: g∗ ← g(d∗)

6 / 21

Decoding algorithm

1: function Optimise(g, ε)2: d∗ ← argmaxd g(d) . proxy’s argmax3: g∗ ← g(d∗)4: f ∗ ← f (d∗) . observe a point in f

5: while (q∗ − f ∗ ≥ ε) do . ε is the maximum error6: A← actions(g,d∗) . collect refinement actions7: g ← refine(g,A) . update proposal8: d∗ ← argmaxd g(d) . update argmax9: g∗ ← g(d∗)

6 / 21

Decoding algorithm

1: function Optimise(g, ε)2: d∗ ← argmaxd g(d) . proxy’s argmax3: g∗ ← g(d∗)4: f ∗ ← f (d∗) . observe a point in f5: while (q∗ − f ∗ ≥ ε) do . ε is the maximum error

6: A← actions(g,d∗) . collect refinement actions7: g ← refine(g,A) . update proposal8: d∗ ← argmaxd g(d) . update argmax9: g∗ ← g(d∗)

6 / 21

Decoding algorithm

1: function Optimise(g, ε)2: d∗ ← argmaxd g(d) . proxy’s argmax3: g∗ ← g(d∗)4: f ∗ ← f (d∗) . observe a point in f5: while (q∗ − f ∗ ≥ ε) do . ε is the maximum error6: A← actions(g,d∗) . collect refinement actions

7: g ← refine(g,A) . update proposal8: d∗ ← argmaxd g(d) . update argmax9: g∗ ← g(d∗)

6 / 21

Decoding algorithm

1: function Optimise(g, ε)2: d∗ ← argmaxd g(d) . proxy’s argmax3: g∗ ← g(d∗)4: f ∗ ← f (d∗) . observe a point in f5: while (q∗ − f ∗ ≥ ε) do . ε is the maximum error6: A← actions(g,d∗) . collect refinement actions7: g ← refine(g,A) . update proposal

8: d∗ ← argmaxd g(d) . update argmax9: g∗ ← g(d∗)

6 / 21

Decoding algorithm

1: function Optimise(g, ε)2: d∗ ← argmaxd g(d) . proxy’s argmax3: g∗ ← g(d∗)4: f ∗ ← f (d∗) . observe a point in f5: while (q∗ − f ∗ ≥ ε) do . ε is the maximum error6: A← actions(g,d∗) . collect refinement actions7: g ← refine(g,A) . update proposal8: d∗ ← argmaxd g(d) . update argmax

9: g∗ ← g(d∗)10: f ∗ ← max(f ∗, f (d∗)) . update “best so far”11: end while12: return g, d∗13: end function

6 / 21

Decoding algorithm

1: function Optimise(g, ε)2: d∗ ← argmaxd g(d) . proxy’s argmax3: g∗ ← g(d∗)4: f ∗ ← f (d∗) . observe a point in f5: while (q∗ − f ∗ ≥ ε) do . ε is the maximum error6: A← actions(g,d∗) . collect refinement actions7: g ← refine(g,A) . update proposal8: d∗ ← argmaxd g(d) . update argmax9: g∗ ← g(d∗)

10: f ∗ ← max(f ∗, f (d∗)) . update “best so far”11: end while

12: return g, d∗13: end function

6 / 21

Decoding algorithm

1: function Optimise(g, ε)2: d∗ ← argmaxd g(d) . proxy’s argmax3: g∗ ← g(d∗)4: f ∗ ← f (d∗) . observe a point in f5: while (q∗ − f ∗ ≥ ε) do . ε is the maximum error6: A← actions(g,d∗) . collect refinement actions7: g ← refine(g,A) . update proposal8: d∗ ← argmaxd g(d) . update argmax9: g∗ ← g(d∗)

6 / 21

ProposalIn g, the true LM pLM is replaced by an upperbound qLM

with stronger independence assumptions

I Suppose α = yJI is a substring of y = yM

1e.g. α = black2 cat3 in y = BOS0 the1 black2 cat3 EOS4

I contribution of α to the true LM score of y

pLM(α) ≡J∏

k=Ip(yk |yk−1

e.g. p(black2|BOS0 the1)p(cat3|BOS0 the1 black2)I upperbound to pLM(α)

qLM(α) ≡ q(yI |ε)J∏

k=I+1q(yk |yk−1

e.g. q(black2|ε)q(cat3|black2)

where q(z|P) ≡ maxH∈∆∗ p(z|HP)

7 / 21

with stronger independence assumptionsI Suppose α = yJ

I is a substring of y = yM1

e.g. α = black2 cat3 in y = BOS0 the1 black2 cat3 EOS4

pLM(α) ≡J∏

k=Ip(yk |yk−1

k=I+1q(yk |yk−1

7 / 21

pLM(α) ≡J∏

k=Ip(yk |yk−1

e.g. p(black2|BOS0 the1)p(cat3|BOS0 the1 black2)

I upperbound to pLM(α)

k=I+1q(yk |yk−1

7 / 21

pLM(α) ≡J∏

k=Ip(yk |yk−1

k=I+1q(yk |yk−1

where q(z|P) ≡ maxH∈∆∗ p(z|HP)7 / 21

Max-ARPAAn n-gram is scored in its most optimistic context H

q(z|P) ≡ maxH∈∆∗

p(z|HP)

Efficiently computed using a Max-ARPA table MI start with an ARPA table

n-gram Pz — conditional log p(z|P) — backoff b(Pz)I compute an upperbound view of the last 2 columns

n-gram Pz — max-conditional log q(z|P) — max-backoff m(Pz)

Then for an arbitrary n-gram Pz,

q(z|P) =

p(z|P) Pz 6∈ M and P 6∈ Mp(z|P)×m(P) Pz 6∈ M and P ∈ Mq(z|P) Pz ∈ M

8 / 21

p(z|HP)

Efficiently computed using a Max-ARPA table M

I start with an ARPA tablen-gram Pz — conditional log p(z|P) — backoff b(Pz)

I compute an upperbound view of the last 2 columnsn-gram Pz — max-conditional log q(z|P) — max-backoff m(Pz)

q(z|P) =

8 / 21

p(z|HP)

n-gram Pz — conditional log p(z|P) — backoff b(Pz)

I compute an upperbound view of the last 2 columnsn-gram Pz — max-conditional log q(z|P) — max-backoff m(Pz)

q(z|P) =

8 / 21

p(z|HP)

q(z|P) =

8 / 21

p(z|HP)

q(z|P) =

8 / 21

Proposal hypergraph

The proposal g(d) can be efficiently represented by a hypergraph

Remark!Nodes do not store LM state

9 / 21

Proposal hypergraph

9 / 21

Proposal hypergraph

| 〈D(x), f (d)〉 | = |G(d) ∩ A| ∝ I 22d |∆|n−1

9 / 21

Proposal hypergraph

| 〈D(x), g(d)〉 | = |G(d) ∩��upperbound

A| ∝ I 22d��|∆|n−1

9 / 21

Refinement

Goal: break independence assumptions1. making larger n-grams2. bringing g closer to f

Examples

10 / 21

Refinement

Examples

10 / 21

Refinement

Examples

g′(d) ={

g(d) if d 6= argmax g(d)f (d) otherwise.

10 / 21

Refinement

Examples

g′(d) ={

g(d) if d 6= argmax g(d)f (d) otherwise.

Too local: one derivation at a time

10 / 21

Refinement

Examples

g′(d) = g(d)q(z|hP)q(z|P)

aelse a/w(a)

z/w(z)where k counts occurrences of hPz in yield(d)

10 / 21

Refinement

Examples

g′(d) = g(d)q(z|hP)q(z|P)

aelse a/w(a)

z/w(z)where k counts occurrences of hPz in yield(d)Too global: refines derivations which already score poorly

10 / 21

LM state refinement1

Break independence assumptions (making larger n-grams)but not in every derivation

Example:I suppose the argmax is (X1(X2(X3the)black)cat)I and node X3 currently stores an empty LM statesI this motivates a refined node X3′ whose LM state is

the · LMState(X3)

I incoming edges to X3 are splitI outgoing edges from X3′ are reweighted copies of those leaving X3

This is connected to an intersection local to X3

∗(X3 ∗ the)z∗ with weight update q(z|the)q(z)

1[Li and Khudanpur, 2008, Heafield et al., 2013]11 / 21

Example:I suppose the argmax is (X1(X2(X3the)black)cat)

I and node X3 currently stores an empty LM statesI this motivates a refined node X3′ whose LM state is

the · LMState(X3)

Example:I suppose the argmax is (X1(X2(X3the)black)cat)I and node X3 currently stores an empty LM states

I this motivates a refined node X3′ whose LM state is

the · LMState(X3)

I incoming edges to X3 are split

I outgoing edges from X3′ are reweighted copies of those leaving X3

the · LMState(X3)

Refinement actions

Goal: lower g’s maximum as much as possible

Heuristic

Refine all nodes participating in the current argmaxI by extending a node’s LM (right) state by

exactly one word from yield(argmax g(d))

12 / 21

Refinement actions

Goal: lower g’s maximum as much as possible

Heuristic

Refine all nodes participating in the current argmaxI by extending a node’s LM (right) state by

exactly one word from yield(argmax g(d))

12 / 21

13 / 21

I argmax is (X8(X6(X2the)black)cat)

13 / 21

f g the

I the argmax is (X8(X6(X2the)black)cat)I refine spans continuing from X2 by conditioning on the

13 / 21

f g the black

I the argmax is (X8(X6(X2the)black)cat)I refine spans continuing from X6 by conditioning on black

13 / 21

f g the black cat

I the argmax is (X8(X6(X2the)black)cat)I refine spans continuing from X8 by conditioning on cat

13 / 21

I obtaining a new hypergraph 〈D(x), g′(d)〉

Experiments

German-English (WMT14 data)I phrase extraction: 2.2M sentencesI maximum phrase length 5I maximum translation options 40I unpruned LMs: 25M sentencesI dev set: newstest2010 (LM interpolation and tuning)I batch-mira tuning (cube pruning beam 5000)I test set: newstest2012 (3,003 sentences)I distortion limit d = 4

14 / 21

Exact decoding

n build (s) total (s) N |V | |E |3 1.5 21 190 2.5 1594 1.5 50 350 4 2885 1.5 106 555 6.1 450

I time to build initial proposalI decoding timeI number of iterationsI size of the hypergraph (in thousands of nodes and edges)

15 / 21

Cube pruning

Search errors by beam size (k)

k n-gram LM3 4 5

10 2168 2347 2377102 613 999 1126103 29 102 167104 0 4 7

16 / 21

Translation quality with BLEU

Cube pruning and exact decoding with OS∗

k 3-gram LM 4-gram LM 5-gram LM10 20.47 20.71 20.69102 21.14 21.73 21.76103 21.27 21.89 21.91104 21.29 21.92 21.93OS∗ 21.29 21.92 21.93

17 / 21

Conclusions

Exact decoding

I manageable timeI fraction of the search space

Search error curves for beam search and cube pruningI large phrase tablesI large 5-gram LMs

Exactness at the cost of worst-case complexity, howeverI we demonstrate empirically that the algorithm is practicable

18 / 21

Conclusions

Exact decodingI manageable time

I fraction of the search spaceSearch error curves for beam search and cube pruning

I large phrase tablesI large 5-gram LMs

18 / 21

Conclusions

Exact decodingI manageable timeI fraction of the search space

18 / 21

Conclusions

Search error curves for beam search and cube pruning

I large phrase tablesI large 5-gram LMs

18 / 21

Conclusions

18 / 21

Conclusions

Exactness at the cost of worst-case complexity, however

I we demonstrate empirically that the algorithm is practicable

18 / 21

Conclusions

18 / 21

Recent developments

1. exact k-best2. exact sampling

19 / 21

Future work

OptimisationI early stop the searchI error safe pruning

SpeedupsI be more selective with refinementsI LR to deal with powerset constraints (allowing for higher d)I more grouping (partial edges)

Hiero models

20 / 21

Future work

Hiero models

20 / 21

Future work

Hiero models

20 / 21

Thanks!

Questions?

21 / 21

Variable-order LM

n Nodes at level m LM states at level m0 1 2 3 4 1 2 3 4

3 0.4 1.2 0.5 - - 113 263 - -4 0.4 1.6 1.4 0.3 - 132 544 212 -5 0.4 2.1 2.4 0.7 0.1 142 790 479 103

Table : Average number of nodes (in thousands) whose LM state encodean m-gram, and average number of unique LM states of order m in thefinal hypergraph for different n-gram LMs (d = 4 everywhere).

22 / 21

References I

Kenneth Heafield, Philipp Koehn, and Alon Lavie. Groupinglanguage model boundary words to speed k-best extraction fromhypergraphs. In Proceedings of the 2013 Conference of theNorth American Chapter of the Association for ComputationalLinguistics: Human Language Technologies, pages 958–968,Atlanta, Georgia, USA, June 2013.

Zhifei Li and Sanjeev Khudanpur. A scalable decoder forparsing-based machine translation with equivalent languagemodel state maintenance. In Proceedings of the SecondWorkshop on Syntax and Structure in Statistical Translation,SSST ’08, pages 10–18, Stroudsburg, PA, USA, 2008.Association for Computational Linguistics.

23 / 21

Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based...

Documents

Transcript of Exact decoding for phrase-based SMT › slides › emnlp2014.pdf · Exact decoding for phrase-based...

Abréviations et Symboles - carlroth.com · Abréviations et Symboles N° E numéro d'identification des additifs alimentaires attribué par la CE P phrase P, conseil de prudence,

ΜΙΑ ΠΡΟΣΕΓΓΙΣΗ ΣΤΗ ΣΦΑΙΡΙΚΗ ΓΕΩΜΕΤΡΙΑusers.sch.gr/anitus/07_kainotomes_draseis/project_smt/photos/SMT... · ισημερινο της γης, τέμνει

SMT$Solvers$ (an$extension$of$SAT)$jason/325/PDFSlides/lect-smt.pdf · 32 CombiningTheorySolvers • Theory$solvers$become$much$more$useful$if$they$can$be$ used$together.$ mux_sel’=0’→’

ALMA View of Dust Evolution: Making Planets and Decoding ...• GM Aur, TW Hya, CoKu Tau 4, DM Tau, … – near/mid-ir flux deficits indicate inner holes – planet formation? viscous

Verbal Non-verbal Enactor (Encoding) (a) Recipient (Decoding) (c) Body Voice Speech 65% 35% 27% 19% 18% 35% Face Visual Audio Channels (b)

Model Combination for Machine Translationdenero/research/papers/naacl10... · dog ate my homework Combiners assume little about systems Objectives similar to consensus decoding 1

Optics for EUV Lithography · Carl Zeiss SMT GmbH, Sascha Migura 2018 EUVL Workshop June 13. th, 2018 26. Extreme aspheres enabling. further improved wavefront / imaging performance.

Topology and Shape Optimization with · PDF file4 Ι r 1 Topology Optimization “Topology optimization is a phrase used to characterize design optimization formulations that allow

Tutorial: Sparse Signal Recovery - Home | Institute for ... · Tutorial: Sparse Signal Recovery ... kuSc k1: Greedy, iterative algorithms for sparse matrices ... Decoding via linear

The phrase “Computer Graphics” was ... - Web view9/1/2010 · Oracle Spatial, which also ... and a fast growing set of geoscientifc methods, ... Java Unified Mapping Platform (the

Thaddeus M. Maharaj: ΕχεGrεεκsις: Fishers of Men …...Thaddeus M. Maharaj: ΕχεGrεεκsις: "Fishers of Men" – Matthew 4:18-22 Matthew's locating of the phrase ἦσαν

J8 No CLEAN Jetting SOLDer pastebastiroid.com/smt/wp-content/uploads/2018/04/j8_jetting_solder_paste_tds.pdf · J-STD-005A 3.8 IPC-TM-650 2.4.44 36.1 gf Time 0 Typical 0.00 50.00

Eurocode 7 and Slope Design - Decoding the Eurocodesgeocentrix.typepad.com/decoding_the_eurocodes/files/eurocode_7_a… · v) γ Rv 1.0 Weight density (γ) γγ Unconfined compressive

Lecture # 15 Computer Communication & Networks. Today’s Menu ϞEncoding/Decoding ϞUnipolar, Polar and Bipolar encoding.

Rekap Nilai Mhs Angk 2013 Smt 1

CUSTOMIZED PIN-HEADERS & SOCKETS 0.8 to 5.08 mm · 2016-08-30 · 0.8 to 5.08 mm 0.80*1.20mm Pin Header, Dual Row, SMT Speciﬁ cations Current Rating: 0.75 AMP Contact Resistance:

ABSTRACT arXiv:0802.1963v2 [astro-ph] 14 Apr 20081:3mm JT receiver of the Heinrich Hertz submillimeter Telescope (SMT) at Mt. Graham (October 31, 2005 to January 24, 2007). Observed

DECODING LIGHT: from Plato to Matrix€¦ · “Aristotelio” He starts talking to the crowd about Plato’s Cave Myth. “ Which is the connection between the cave you are talking

HEADPHONE JACK --- φ2 - Lup Fung 2.5.pdf · HEADPHONE JACK Characteristics Series Terminal styles Mode Operating temperature range Soldering PJAS-009 SMT Stereo-25 ℃ to 85 ℃

IDENTIFICATION DE SYSTEMES DYNAMIQUES`people.dti.supsi.ch/~smt/courses/IC1-poly-2007.pdf · une série d’exercices résolus permettant à l’étudiant de vérifier ses connaissances