Maria daniele

Extending ranking with interword spacing analysis

Maria Carmela Daniele, Claudio Carpineto and Andrea Bernardini

Overview

I.  Word weighting based on interword spacing: σp II.  Extension of quantistic weight through corpora analysis: σ* III.  σ* application to ranking IV.  Experiments V.  Selective application of quantistic and frequentistic metrics based on:

a)  Document’s length

b)  Query hardness

Words weighting based on spacing between term occurrences: σp

•  Research branch evolved in the last decade.

•  Follow studies on energy level of statistical system formed by irregular quantum, created by Ortuño et al (2002)

•  Keyword extraction based on distances between term’s occurrences in a document, regardless of terms frequency analysis of the document.

•  Let’s see in more detail…

Reference Scenario

  Similar to quantistic system, terms in a document are subject to an attraction/repulsion phenomena, that is stronger between relevant terms compared to common words.

  Reference Document: Charles Darwin’s “The Origin of Species”

  In practice:

  Relevant words tend to cluster in documents ( ie: “INSTINCT”)

  Common words like “THE” are distributed uniformly

Definition of σp

•  Weighting method definition based on probability distributions of distances

•  A more efficient method characterized by Standard Deviation:

•  Normalizing with respect to the mean value:

A great scientist must be a good teacher and a good researcher

1 2 3 4 5 6 7 8 9 10 11 12

•  For term “a” we get: X={1,6,10}, D = {0,5,4,2} (di = xi+1- xi), and:

n −1x

i+1−xi( ) − µ( )

Extension of quantistic weighting through corpora analysis: σ*

•  We propose to modify the original metric with a factor σf based on the variance of term frequencies (Salton 1975). The factor σf is analogous to σp and it has a twofold goal:

1.  Penalize rare words, because they can be often seen as ‘noise’ in real collection of documents, while they tend to be overestimated using σp ;

2.  Reward words that make it possible to better discriminate a document from the rest of the collection. This feature is lacking in quantistic weighting

sf (w) =1ND

⋅ fi(w) − µ f( )2

Comparison between quantistic and frequentistic metrics

Rank Tf-Idf Tf-Idf* σp σ*

1 unto lord jesus jesus

2 shall god christ saul

3 lord absalom paul absalom

4 thou son peter jephthah

5 thy king disciples jubile

6 thee behold faith ascendeteh

7 him man john abimelech

8 god judah david elias

9 his land saul joab

10 hath men gospel haman

•  Using Tf-Idf (with and without stop words) for the metric on the frequencies •  Using σp e σ* for the quantistic weighting •  Reference Document: “The Bible” of The King James •  To calculate Idf e σf that require the collection, we use WT10g Trec collection

Application of σ* to ranking (1)

•  Based on the complementary features of quantistic and frequentistic weighting metrics, we would like to combine these two metrics.

•  Using σ* metric it’s possible to rank a collection of documents against a query q

Application of σ* to ranking (2)

•  The combined metric is obtained through:   Linear Combination of Okapi’s BM25 and σ* metrics

•  Prerequisite for the linear combination is that the the scores will be in similar range

•  Application of normalization of scores through:

•  The scores are combined by:

Collection:   Web Track: about 1.690.000 documents   Robust Track: more than 500.000 documents

•  Evaluation measure: MAP (mean average precision)

•  Lucene with BM25 extension created by Perez-Iglesias

Experiments (1)

Experiments (2)

•  The quantistic metric alone does not work well:

•  Experiments on combined quantistic method enhance in a significant way performance of classical methods of IR

•  We let the α parameter vary in the range [0,1]: the two extreme points coincides, respectively, with BM25 and σ∗ techniques.

•  Results suggest us that the method is sufficiently robust, because we found a range of values in which the performance of the combined method was good.

Collezione Topics BM25 σ* BM25+σ*

WT10g 501-550 0.143 0.057 0.153

Robust 301-450,601-700 0.195 0.089 0.203

α 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

MAP .1436 .1469 .1537 .1535 .1501 .1379 .1222 .096 .0819 .0679 .0547

MAP .1954 .2033 .2031 .1983 .1673 .1549 .1428 .1203 .1075 .9674 .0898

Query by query analysis

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49

N° Query

BM25+σ*

Selective application of quantistic and frequentistic techniques

1. Relying on predictors of the query difficulty for choosing which metric to use (rationale: the quantistic method should be better on difficult queries)

2. Relying on document’s length for choosing which metric to use (rationale: the quantistic method should be better for long documents)

Query hardness (1)

•  We used two well-know query predictor:

•  Simplified Clarity Score

•  σ1

0,0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

Robust

BM25 SS* BM25 SS*

0,0 0,1 0,2 0,3 0,4 0,5 0,6 0,7

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

WT10g Bm25 SS* Lineare(Bm25) Lineare(SS*)

Query hardness (2)

• WT10g with σ1 predictor

• Robust with SCS predictor

• Predictor obtained values on x-axis

• MAP value on y-axes (both BM25 and σ∗)

Document Lenght (1)

•  Why using document length? Because the quantistic method works better with long texts

BM25 σ*

Relevant Retrieved 1544 3729 Relevant NOT Retrieved 4239 2115

Document Lenght (2)

• Collection: WT10g

• σ*

• BM25

• X-Axis: document’s length expressed in number of words

• Y-Axis: Cumulative percentage of relevant documents (retrieved in Blue, not retrieved in Red)

Conclusions on using a selective application of frequentistic and quantistic weighting

•  Query hardness did not work.

•  Using document length was more promising

Conclusions and future works

•  Definition of an extended quantistic weighting method through corpora analysis.

•  Integration of quantistic and frequentistic ranking methods

•  A linear combination showed a significant enhance of performance compared to the classical frequentistic method

•  Selective application: query hardness not useful, document length useful

•  This method could be applied on other Information Retrieval Task, i.e.: •  Document Summarization: for create a short version of a text •  Query Expansion: expand the query phrase (ie : using synonymous) •  Search Result Clustering: group results in clusters

Thanks for listening! questions?

Conclusions

Maria daniele

Technology

Transcript of Maria daniele

Rainer Maria Rilke, Οι ελεγείες του Ντουίνο-Αρμός (2000)

Kariolaki,maria chania 15.05.15

Hardy inequalities and applications Maria J. ESTEBAN …castro/Esteban.pdf · Hardy inequalities and applications Maria J. ESTEBAN C.N.R.S. & UniversitØ Paris-Dauphine In collaboration

PAPER ''HUMAN RESOURCES AND DEVELOPMENT''- MARIA BALASKA

Βιοδυναμικό Ημερολόγιο 2010 - Maria Thun

Eletromagnetismo – Aula 4 Maria Augusta Constante Puget (Magu)

Prof. Maria Papadopouli University of Crete ICS-FORTH

Maria Iordanidou - 1963 - Loxandra

Silvania Maria Netto

Joaquim Maurício DUARTE-ALMEIDA, Ricardo José dos SANTOS, Maria Inês GENOVESE, Franco Maria LAJOLO

Daniele De Gruttola (TOF group – Salerno University and INFN )

β-ESTRADIOL PELO PROCESSO DE OZONIZAÇÃO Daniele Maia …livros01.livrosgratis.com.br/cp013052.pdf · desregulador endÓcrino 17β-estradiol pelo processo de ozonizaÇÃo daniele

ANA MARIA MARQUES ORELLANA ADMINISTRAÇÃO ...

NATIONAL AND KAPODISTRIAN UNIVERSITY OF …uest.ntua.gr/cyprus2016/proceedings/presentation/9...•Prof. Dr. Konstantinos Chassapis •Maria Exarchakou •Dr. Maria Roulia •Eva Kontezaki

Dissertação - Fátima Maria Angelim Mendes Sales.pdf

ΦIRMA GYPSY GLOBALES-Maria Papadimitriou (book excerpt)

Software Performance Monitoring Daniele Francesco Kruse July 2010.

MARIA DE FÁTIMA MATOS DE FREITAS

Filosofia y Poesia Maria Zambrano PDF

Project Scheduling Resource-constraint Kantzari Maria