Summary Step 1. Seed Paraphrase Acquisition Large multiple...

2
10 3 10 4 10 5 10 6 10 7 10 8 10 6 10 7 10 8 # of paraphrase pairs # of words in the English side of bilingual corpus P Raw P Raw (th p =0.01) P Seed (th p =ε, th s =ε) P Seed (th p =0.01, th s =ε) 10 3 10 4 10 5 10 6 10 7 10 8 10 6 10 7 10 8 # of paraphrase pairs # of words in the English side of bilingual corpus P Raw P Raw (th p =0.01) P Seed (th p =ε, th s =ε) P Seed (th p =0.01, th s =ε) 10 3 10 4 10 5 10 6 10 7 10 8 10 6 10 7 10 8 # of paraphrase pairs # of unique LHS phrases # of words in the English side of bilingual corpus Pair in P Hvst LHS in P Hvst Pair in P Seed LHS in P Seed 18.1M 1.22M 10 3 10 4 10 5 10 6 10 7 10 8 10 6 10 7 10 8 # of paraphrase pairs # of unique LHS phrases # of words in the English side of bilingual corpus Pair in P Hvst LHS in P Hvst Pair in P Seed LHS in P Seed 28.7M 1.41M 10 3 10 4 10 5 10 6 10 7 10 8 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 # of paraphrase pairs Probability threshold th p P Hvst (Patent) P Seed (Patent) P Hvst (Europarl) P Seed (Europarl) 10 3 10 4 10 5 10 6 10 7 10 8 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 # of paraphrase pairs Similarity threshold th s P Hvst (Patent) P Seed (Patent) P Hvst (Europarl) P Seed (Europarl) Monolingual Non-parallel Corpus Step 1. Seed Paraphrase Acquisition Step 2. Paraphrase Pattern Induction Step 3. Paraphrase Instance Acquisition Paraphrase acquisition Through generalization and instantiation Using both bilingual and monolingual data Human evaluation of phrase substitutions Europarl paraphrases on WMT “newstest” data Comparable to the state-of-the-art (A) Europarl + GigaFrEn , (B) NTCIR Patent data One-variable patterns and single words High leverage rate (# pair): 1580%, 2140% Paraphrases for many novel phrases 3 3.5 4 4.5 5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Avg. score Probability threshold th p Grammaticality (P Seed ) Grammaticality (P Hvst ) Meaning (P Seed ) Meaning (P Hvst ) 3 3.5 4 4.5 5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Avg. score Similarity threshold th s Grammaticality (P Seed ) Grammaticality (P Hvst ) Meaning (P Seed ) Meaning (P Hvst ) Large multiple of # of seeds Good quality P Seed P Hvst Total .85 .74 .76 G4 M3 Both 5-pt Binary .93 .67 .71 .78 .55 .58 G M 4.60 4.35 4.22 3.35 4.28 3.50 55 295 350 Pivot-based PA using generic SMT systems e.g., Phrase-based SMT system [Koehn, 03] 1. Clean up phrase table: sig. pruning [Johnson+, 07] 2. Pair phrases that get translated to the same phrases [Bannard and Callison-Burch, 05] 3. Filter paraphrase candidate pairs 3a. stop word differences, word super-sequences 3b. conditional probability and contextual similarity Identical words of LHS and RHS Variable slots Ignore morphological variation e.g., number (sg./pl.), gender, case, person, tense Bilingual Parallel Corpus health issue ||| problème de santé health problem ||| problème de santé look like ||| ressemble regional issue ||| problème régional regional problem ||| problème régional resemble ||| ressemble Translation table health issue health problem health problem health issue look like resemble regional issue regional problem regional problem regional issue resemble look like Seed paraphrases (P Seed ) X issue X problem X {“health”, “regional”} X problem X issue X {“health”, “regional”} Paraphrase patterns backlog issue backlog problem communal issue communal problem phishing issue phishing problem spatial issue spatial problem unrelated issue unrelated problem Novel paraphrases (P Hvst ) rp: control device lp: controller p(lp|rp) .153 lp: control apparatus .135 lp: the control apparatus .010 lp: control apparatus of .008 lp: controlling unit .004 lp: control equipment .002 lp: controller for a .001 lp: to the control apparatus .001 lp: control apparatus rp: control device p(rp|lp) .172 rp: control system .032 rp: the control device .015 rp: control device of the .005 rp: controlling device .004 rp: control system of .003 rp: a control system for an .001 rp: a controlling device .001 Harvest novel instances of the patterns 1. Collect expressions that match both sides of the pattern 2. Score each instance by contextual similarity Resources Corpora (bilingual parallel and monolingual) Tokenizer SMT system Lists of stop words (optional) Morphological resources Summary Step 1. Step 2. Step 3. Related work Develop patterns manually [Jacquemin, 99][Fujita+, 07] Add contextual constraints [Callison-Burch, 08][Zhao+, 09] Related work Learn class-dependent patterns [De Saeger+, 09;11] (Pattern-dependent) set expansion n

Transcript of Summary Step 1. Seed Paraphrase Acquisition Large multiple...

Page 1: Summary Step 1. Seed Paraphrase Acquisition Large multiple ...paraphrasing.org/~fujita/publications/mine/fujita... · Paraphrase patterns backlog issue ⇒ backlog problem communal

103

104

105

106

107

108

106 107 108

# o

f para

phra

se p

airs

# of words in the English side of bilingual corpus

PRawPRaw (thp=0.01)PSeed (thp=ε, ths=ε)PSeed (thp=0.01, ths=ε)

103

104

105

106

107

108

106 107 108

# o

f para

phra

se p

airs

# of words in the English side of bilingual corpus

PRawPRaw (thp=0.01)PSeed (thp=ε, ths=ε)PSeed (thp=0.01, ths=ε)

103

104

105

106

107

108

106 107 108

# o

f para

phra

se p

airs

# o

f uniq

ue L

HS

phra

ses

# of words in the English side of bilingual corpus

Pair in PHvstLHS in PHvstPair in PSeedLHS in PSeed

18.1M

1.22M

103

104

105

106

107

108

106 107 108

# o

f para

phra

se p

airs

# o

f uniq

ue L

HS

phra

ses

# of words in the English side of bilingual corpus

Pair in PHvstLHS in PHvstPair in PSeedLHS in PSeed

28.7M

1.41M

103

104

105

106

107

108

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

# o

f para

phra

se p

airs

Probability threshold thp

PHvst (Patent)PSeed (Patent)PHvst (Europarl)PSeed (Europarl)

103

104

105

106

107

108

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

# o

f para

phra

se p

airs

Similarity threshold ths

PHvst (Patent)PSeed (Patent)PHvst (Europarl)PSeed (Europarl)

Monolingual Non-parallel

Corpus

Step 1. Seed Paraphrase Acquisition

Step 2. Paraphrase Pattern Induction

Step 3. Paraphrase Instance Acquisition

!   Paraphrase acquisition   Through generalization and instantiation   Using both bilingual and monolingual data

!   Human evaluation of phrase substitutions   Europarl paraphrases on WMT “newstest” data   Comparable to the state-of-the-art

!   (A) Europarl + GigaFrEn, (B) NTCIR Patent data   One-variable patterns and single words   High leverage rate (# pair): ≧1580%, ≧2140%   Paraphrases for many novel phrases

3

3.5

4

4.5

5

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1A

vg. sc

ore

Probability threshold thp

Grammaticality (PSeed)Grammaticality (PHvst)Meaning (PSeed)Meaning (PHvst)

3

3.5

4

4.5

5

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Avg

. sc

ore

Similarity threshold ths

Grammaticality (PSeed)Grammaticality (PHvst)Meaning (PSeed)Meaning (PHvst)

Large multiple of # of seeds

Good quality

PSeed PHvst Total

.85

.74

.76

G≥4 M≥3 Both 5-pt Binary

.93

.67

.71

.78

.55

.58

G M 4.60 4.35 4.22 3.35 4.28 3.50

55 295 350

!   Pivot-based PA using generic SMT systems   e.g., Phrase-based SMT system [Koehn, 03] 1.  Clean up phrase table: sig. pruning [Johnson+, 07] 2.  Pair phrases that get translated to the same phrases

[Bannard and Callison-Burch, 05] 3.  Filter paraphrase candidate pairs

3a. stop word differences, word super-sequences 3b. conditional probability and contextual similarity

!   Identical words of LHS and RHS Variable slots   Ignore morphological variation

e.g., number (sg./pl.), gender, case, person, tense

Bilingual Parallel Corpus

health issue ||| problème de santé health problem ||| problème de santé look like ||| ressemble regional issue ||| problème régional regional problem ||| problème régional resemble ||| ressemble

Translation table

health issue ⇒ health problem health problem ⇒ health issue look like ⇒ resemble regional issue ⇒ regional problem regional problem ⇒ regional issue resemble ⇒ look like

Seed paraphrases (PSeed)

X issue ⇒ X problem X ∋ {“health”, “regional”}

X problem ⇒ X issue X ∋ {“health”, “regional”}

Paraphrase patterns

backlog issue ⇒ backlog problem communal issue ⇒ communal problem phishing issue ⇒ phishing problem spatial issue ⇒ spatial problem unrelated issue ⇒ unrelated problem …

Novel paraphrases (PHvst)

rp: control device

lp: controller p(lp|rp).153

lp: control apparatus.135

lp: the control apparatus.010

lp: control apparatus of .008

lp: controlling unit.004

lp: control equipment

.002

lp: controller for a

.001

lp: to the control apparatus

.001

lp: control apparatus

rp: control devicep(rp|lp).172

rp: control system.032

rp: the control device.015

rp: control device of the.005

rp: controlling device.004

rp: control system of

.003

rp: a control system for an

.001

rp: a controlling device

.001

!   Harvest novel instances of the patterns 1.  Collect expressions that match both sides of the pattern 2.  Score each instance by contextual similarity

!   Resources   Corpora (bilingual parallel and monolingual)   Tokenizer   SMT system   Lists of stop words   (optional) Morphological resources

Summary

Step 1.

Step 2.

Step 3.

!   Related work   Develop patterns manually [Jacquemin, 99][Fujita+, 07]   Add contextual constraints [Callison-Burch, 08][Zhao+, 09]

!   Related work   Learn class-dependent patterns [De Saeger+, 09;11]   (Pattern-dependent) set expansion

n

Page 2: Summary Step 1. Seed Paraphrase Acquisition Large multiple ...paraphrasing.org/~fujita/publications/mine/fujita... · Paraphrase patterns backlog issue ⇒ backlog problem communal

L1:to L2:approaches:to

L3:many:approaches:to L4:been:many:approaches:to

R1:between R2:between:words

R3:between:words:based R4:between:words:based:on

1

2

3

4

5

6

7

8

1 2 3 4 5 6 7 8

Length

of R

HS

phra

se

Length of LHS phrase

100

101

102

103

104

105

106

107

# o

f pairs

1

2

3

4

5

6

7

8

1 2 3 4 5 6 7 8

Length

of R

HS

phra

se

Length of LHS phrase

100

101

102

103

104

105

106

107

# o

f pairs

1

2

3

4

5

6

7

8

1 2 3 4 5 6 7 8

Length

of R

HS

phra

se

Length of LHS phrase

100

101

102

103

104

105

106

107

# o

f pairs

1

2

3

4

5

6

7

8

1 2 3 4 5 6 7 8

Length

of R

HS

phra

se

Length of LHS phrase

100

101

102

103

104

105

106

107

# o

f pairs

Europarl, PSeed

Europarl, PHvst

Patent, PSeed

Patent, PHvst

1

2

3

4

5

6

7

8

1 2 3 4 5 6 7 8

Length

of ta

rget la

nguage p

hra

se

Length of source language phrase

100

101

102

103

104

105

106

107

# o

f pairs

1

2

3

4

5

6

7

8

1 2 3 4 5 6 7 8

Length

of ta

rget la

nguage p

hra

se

Length of source language phrase

100

101

102

103

104

105

106

107

# o

f pairs

Europarl, Ttable Patent, Ttable

102

103

104

105

106

106 107 108

# o

f para

phra

se p

attern

s

# of words in the English side of bilingual corpus

All (Patent)1-var (Patent)All (Europarl)1-var (Europarl)

0

10

20

30

40

106 107 108C

ove

rage [%

]

# of words in the English side of bilingual corpus

All (Patent)1-var (Patent)All (Europarl)1-var (Europarl)

0

20

40

60

80

106 107 108

Ratio

of P

Hvs

t to P

Seed

# of words in the English side of bilingual corpus

LHS (Patent)Pair (Patent)LHS (Europarl)Pair (Europarl)

1

2

3

4

5

106 107 108

Avg

. # o

f R

HS

phra

ses

# of words in the English side of bilingual corpus

PHvst (Patent)PSeed (Patent)PHvst (Europarl)PSeed (Europarl)

!   Ingredients   Extract contextual features: adjacent n-grams

  cf. Bag-of-words (cheap but noisy)   cf. Dependency trees (accurate but expensive)

  Weight and filter features: nothing   Aggregate into a single value: cosine of vectors

Additional statistics Recipe for contextual similarity Human evaluation: details

Alternative for seed paraphrases

Limitations

!   Show 5 alternatives at the same time   To make results more consistent   To reduce the human labor

!   “Grammaticality” and “Meaning equiv.”   5-pt scales and binary prec. [Callison-Burch, 08]   [Callison-Burch, 08]

  Europarl (10 langs-En) + CCG + LM   WMT 2007 “newstest” data   Binary prec. : .68 for G, .62 for M, .55 for both   Ours (total): .76 for G, .71 for M, .58 for both

  [Chan+, 11]   Europarl (10 langs-En) + CCG + Google N-gram   Europarl (i.e., closed)   Score for 1-best: 4.2 pts for G and 3.7 pts for M   Ours (1-best): 4.57 pts for G and 3.96 pts for M

transitional {task, strategy, phase, costs, ...} ⇒ {task, strategy, phase, costs, ...} of transition

in the course of the last few {weeks, years, decades} ⇒ during recent {weeks, years, decades}

… There have been many approaches to compute the similarity between words based on their distribution in a corpus. …

Examples multi-lateral ⇒ multilateral i would like to start by congratulating ⇒ let me first of all congratulate transitional {process, year} ⇒ {process, year} of transition in the course of the last few {months} ⇒ during recent {months}

overall structure ⇒ entire configuration in accordance with the structure mentioned above ⇒ due to such a constitution {bypass, chip} condensers ⇒ {bypass, chip} capacitors will be described with reference to {drawings} ⇒ is explained based on the {drawings}

{layer, ceramic, ferroelectric, solid, ...} condensers ⇒ {layer, ceramic, ferroelectric, solid, ...} capacitors

will be described with reference to {embodiments} ⇒ is explained based on the {embodiments}

!   Any high-prec. set can be used as PSeed   e.g., Multiple definition sentences [Hashimoto+, 11]

... Osteoporosis is a disease that decreases the quantity of bone and makes bones fragile ... Osteoporosis is a disease that reduces bone mass and increases the risk of bone fracture. ...

decreases the quantity of bone ⇒ reduces bone mass makes bones fragile ⇒ increases the risk of bone fracture

decreases the quantity of {body, tissue, fat, gas, tumor, muscle, ...} ⇒ reduces {body, tissue, fat, gas, tumor, muscle, ...} mass

Future work !   In-depth analyses of the proposed method

  Similarity metrics   Paraphrase patterns with more than one variable   Size & type of monolingual corpora

Europarl, PSeed

Europarl, PHvst

Patent, PSeed

Patent, PHvst

!   Our method (current version)   Does not cover totally different expressions

!   Type-based approaches   Do not properly deal with polysemy   Tend to miss rare expressions

!   Corpus-based approaches   Do not acquire expressions that do not appear

!   Sophisticated paraphrase patterns   Hierarchical pattern induction   Deeper level of lexical correspondences

!   Use for NLP applications

!   Paraphrase patterns   Coverage depends on corpus/domain   Mostly 1-var patterns

15%

40%

!   Leverage rate   Small bilingual data High leverage

!   Relative yield  PSeed grows slowly  PHvst depends on patterns

!   Phrases tend to be short   Our filters tend to discard long phrases   Setting: 1-var patterns & single-word fillers

PSeed

PHvst