Approximate sentence matching and its applications in corpus-based research Rafał Jaworski...

Post on 18-Jan-2016

229 views 0 download

Transcript of Approximate sentence matching and its applications in corpus-based research Rafał Jaworski...

Approximate sentence matchingand its applications

in corpus-based research

Rafał Jaworski

INFuture2015, Zagreb, Croatia

1. Approximate sentence matching – what is that?

2. Some information about Roman goddess of agreement.

3. Thoughts on translating an entire text corpus… manually.

4. Why is the Attic Greek word συνεργία worth remembering.

Agenda

• ASM is a technique of retrieving sentences similar to a given input sentence from a large text corpus.

• If we search for the sentence „the agreement was concluded on 11th of March 2012” in law texts we expect ASM to find the sentences:

a) „the agreement was concluded on 25th of September 2014”

b) „the contract was signed on 11th of March 2012”

c) „the agreement was not concluded”

• Which sentences are similar depends on the similarity measure.

Approximate sentence matching

• ASM is primarily used as a Computer-Aided Translation mechanism.

• When a translator works on a sentence, he/she searches for similar sentences in the base of previously translated texts – translation memory (TM).

• This technique is known to boost the efficiency of translation and to ensure repetitiveness of the translations.

• The drawback – it can be used rather rarely (ca. 5% can be found in TM)

ASM – translation memories

• How to modify the classic TM searching so it can retrieve more valuable information?

The goal

Image found at: http://www.commercebees.com

• How to convince translators to use new software instead of their favourite workbenches?

The goal

Image found at: http://www.commercebees.com

What it feels like…

Image found at: http://memegenerator.net,depicting a character from the Lord of the Rings film series

What it feels like…

Image found at: http://www.dailymail.co.uk

• The Concordia translation memory searcher was developed.

• It combines classical TM search with concordance searching (finding a single word in context).

• It takes its name from the Roman goddess of agreement, as it helps to produce translation that „agree” with each other.

Let’s not give up!

Concordia – example

Translation memoryI just think it is impossible.

He is not sure if it is needed.

I want you to repair the car already!

I can not repair the lawn mower.

It might be impossible to do that.

It is impossible to repair the car.search:

• All possible overlays are then scored.

• A good overlay covers the most of the input sentence with as little fragments as possible.

• The translator is presented with translations of longest fragments of the sentence he/she is working on.

• Productiveness and usability experiments are under way!

Concordia

• And now for something (completely) different…

• Let us assume we have a large collection of texts in just one language.

• We would like to build a TM (aka parallel corpus) by manually translating all our sentences.

• WHAT?!

Producing TMs

• It’s okay, we will not translate ALL the sentences!

• We will only choose the most represantative ones and translate them.

• And how do we choose the most representative sentences of a monolingual corpus? Let’s make a clever use of ASM, more precisely – the sentence similarity measure.

Producing TMs

Producing TMs

• This method proved effective in preparing high-quality specialized translation memories .

• Such TMs are much more beneficial for the translation process.

• They can also be used for other purposes, such as training statistical machine translators.

Producing TMs

• Now, what is so special about the word συνεργία?

• Transliterated it is: synergia – synergy, working together.

• Good NLP research requires synergy betweenlinguists and computer scientists.

Greek word

Images found at: https://spectacledbookworm.wordpress.com/

and http://lemmino.deviantart.com

• Linguists do not seem to know much about how computer software is created and which techniques are easy to implement and which are not.

• However, to be fair, computer scientists probably know even less about the translation process

• Moreover, the two groups are motivated differently – translators are primarily focused on the quality of their translation.

• Computer scientists, on the other hand, are focused on the performance of their software.

Synergy – problems

• Ideally, linguists and computer scientists should spend about 1-2 hours a week working together.

• They should exchange concepts and educate each other in their fields.

• The computer scientist should translate a document under supervision of the linguist.

• The translator should get accustomed with the architecture of the system he/she is using for their work.

• Ideas for new features in the software should be a result of their mutual thinking process.

Synergy – solutions

• Only with this approach one can establish true synergy!

Synergy – solutions

Image found at: http://www.referenceforbusiness.com

Hvala lijepa!

INFuture2015, Zagreb, Croatia