Approximate sentence matching and its applications in corpus-based research Rafał Jaworski...

20
Approximate sentence matching and its applications in corpus-based research Rafał Jaworski INFuture2015, Zagreb, Croatia

Transcript of Approximate sentence matching and its applications in corpus-based research Rafał Jaworski...

Page 1: Approximate sentence matching and its applications in corpus-based research Rafał Jaworski INFuture2015, Zagreb, Croatia.

Approximate sentence matchingand its applications

in corpus-based research

Rafał Jaworski

INFuture2015, Zagreb, Croatia

Page 2: Approximate sentence matching and its applications in corpus-based research Rafał Jaworski INFuture2015, Zagreb, Croatia.

1. Approximate sentence matching – what is that?

2. Some information about Roman goddess of agreement.

3. Thoughts on translating an entire text corpus… manually.

4. Why is the Attic Greek word συνεργία worth remembering.

Agenda

Page 3: Approximate sentence matching and its applications in corpus-based research Rafał Jaworski INFuture2015, Zagreb, Croatia.

• ASM is a technique of retrieving sentences similar to a given input sentence from a large text corpus.

• If we search for the sentence „the agreement was concluded on 11th of March 2012” in law texts we expect ASM to find the sentences:

a) „the agreement was concluded on 25th of September 2014”

b) „the contract was signed on 11th of March 2012”

c) „the agreement was not concluded”

• Which sentences are similar depends on the similarity measure.

Approximate sentence matching

Page 4: Approximate sentence matching and its applications in corpus-based research Rafał Jaworski INFuture2015, Zagreb, Croatia.

• ASM is primarily used as a Computer-Aided Translation mechanism.

• When a translator works on a sentence, he/she searches for similar sentences in the base of previously translated texts – translation memory (TM).

• This technique is known to boost the efficiency of translation and to ensure repetitiveness of the translations.

• The drawback – it can be used rather rarely (ca. 5% can be found in TM)

ASM – translation memories

Page 5: Approximate sentence matching and its applications in corpus-based research Rafał Jaworski INFuture2015, Zagreb, Croatia.

• How to modify the classic TM searching so it can retrieve more valuable information?

The goal

Image found at: http://www.commercebees.com

Page 6: Approximate sentence matching and its applications in corpus-based research Rafał Jaworski INFuture2015, Zagreb, Croatia.

• How to convince translators to use new software instead of their favourite workbenches?

The goal

Image found at: http://www.commercebees.com

Page 7: Approximate sentence matching and its applications in corpus-based research Rafał Jaworski INFuture2015, Zagreb, Croatia.

What it feels like…

Image found at: http://memegenerator.net,depicting a character from the Lord of the Rings film series

Page 8: Approximate sentence matching and its applications in corpus-based research Rafał Jaworski INFuture2015, Zagreb, Croatia.

What it feels like…

Image found at: http://www.dailymail.co.uk

Page 9: Approximate sentence matching and its applications in corpus-based research Rafał Jaworski INFuture2015, Zagreb, Croatia.

• The Concordia translation memory searcher was developed.

• It combines classical TM search with concordance searching (finding a single word in context).

• It takes its name from the Roman goddess of agreement, as it helps to produce translation that „agree” with each other.

Let’s not give up!

Page 10: Approximate sentence matching and its applications in corpus-based research Rafał Jaworski INFuture2015, Zagreb, Croatia.

Concordia – example

Translation memoryI just think it is impossible.

He is not sure if it is needed.

I want you to repair the car already!

I can not repair the lawn mower.

It might be impossible to do that.

It is impossible to repair the car.search:

Page 11: Approximate sentence matching and its applications in corpus-based research Rafał Jaworski INFuture2015, Zagreb, Croatia.

• All possible overlays are then scored.

• A good overlay covers the most of the input sentence with as little fragments as possible.

• The translator is presented with translations of longest fragments of the sentence he/she is working on.

• Productiveness and usability experiments are under way!

Concordia

Page 12: Approximate sentence matching and its applications in corpus-based research Rafał Jaworski INFuture2015, Zagreb, Croatia.

• And now for something (completely) different…

• Let us assume we have a large collection of texts in just one language.

• We would like to build a TM (aka parallel corpus) by manually translating all our sentences.

• WHAT?!

Producing TMs

Page 13: Approximate sentence matching and its applications in corpus-based research Rafał Jaworski INFuture2015, Zagreb, Croatia.

• It’s okay, we will not translate ALL the sentences!

• We will only choose the most represantative ones and translate them.

• And how do we choose the most representative sentences of a monolingual corpus? Let’s make a clever use of ASM, more precisely – the sentence similarity measure.

Producing TMs

Page 14: Approximate sentence matching and its applications in corpus-based research Rafał Jaworski INFuture2015, Zagreb, Croatia.

Producing TMs

Page 15: Approximate sentence matching and its applications in corpus-based research Rafał Jaworski INFuture2015, Zagreb, Croatia.

• This method proved effective in preparing high-quality specialized translation memories .

• Such TMs are much more beneficial for the translation process.

• They can also be used for other purposes, such as training statistical machine translators.

Producing TMs

Page 16: Approximate sentence matching and its applications in corpus-based research Rafał Jaworski INFuture2015, Zagreb, Croatia.

• Now, what is so special about the word συνεργία?

• Transliterated it is: synergia – synergy, working together.

• Good NLP research requires synergy betweenlinguists and computer scientists.

Greek word

Images found at: https://spectacledbookworm.wordpress.com/

and http://lemmino.deviantart.com

Page 17: Approximate sentence matching and its applications in corpus-based research Rafał Jaworski INFuture2015, Zagreb, Croatia.

• Linguists do not seem to know much about how computer software is created and which techniques are easy to implement and which are not.

• However, to be fair, computer scientists probably know even less about the translation process

• Moreover, the two groups are motivated differently – translators are primarily focused on the quality of their translation.

• Computer scientists, on the other hand, are focused on the performance of their software.

Synergy – problems

Page 18: Approximate sentence matching and its applications in corpus-based research Rafał Jaworski INFuture2015, Zagreb, Croatia.

• Ideally, linguists and computer scientists should spend about 1-2 hours a week working together.

• They should exchange concepts and educate each other in their fields.

• The computer scientist should translate a document under supervision of the linguist.

• The translator should get accustomed with the architecture of the system he/she is using for their work.

• Ideas for new features in the software should be a result of their mutual thinking process.

Synergy – solutions

Page 19: Approximate sentence matching and its applications in corpus-based research Rafał Jaworski INFuture2015, Zagreb, Croatia.

• Only with this approach one can establish true synergy!

Synergy – solutions

Image found at: http://www.referenceforbusiness.com

Page 20: Approximate sentence matching and its applications in corpus-based research Rafał Jaworski INFuture2015, Zagreb, Croatia.

Hvala lijepa!

INFuture2015, Zagreb, Croatia