Onur Koçak CS533 Fall 2017 HW5 Solutions -...

10
Onur Koçak CS533 Fall 2017 HW5 Solutions 1 CS533 Information Retrieval Fall 2017 HW5 Solutions Q1 a) For λ = 1, we select documents based on similarity Thus, d1> d2> d4> d3 Start with d1 , S = {d1} R\S = { d2, d4, d3} MMR(d2) = 0.7 Maximum. Therefore S={ d1, d2 } = 0.67 MMR(d3) = 0.4 MMR(d4) = 0.6 b) For λ = 0 : Rank documents based on diversity S= { d1} without MMR ; R\S = { d2, d3, d4} MMR (d2 )= 0.67 = -0.67 MMR (d3 )= 0.5 = -0.5 MMR (d4 )= -0.2 = -0.2 Maximum. Therefore S={ d1, d4 } = 0.20 c) For λ = 0.5 : S= { d1} without MMR ; R\S = { d2, d3, d4} MMR (d2 )= 0.5 * 0.7 0.5 * 0.67 = 0.015 MMR (d3 )= 0.5 * 0.4 0.5 * 0.5 = -0.05 MMR (d4 )= 0.5 * 0.6 0.5 * 0.2 = 0.2 Maximum. Therefore S={ d1, d4 } = 0.20

Transcript of Onur Koçak CS533 Fall 2017 HW5 Solutions -...

Page 1: Onur Koçak CS533 Fall 2017 HW5 Solutions - cs.bilkent.edu.trcanf/CS533/CS533Fall2017/HW5oKocak.pdf · Onur Koçak CS533 Fall 2017 HW5 Solutions 1 CS533 Information Retrieval –

Onur Koçak CS533 Fall 2017 HW5 Solutions

1

CS533 Information Retrieval – Fall 2017

HW5 Solutions

Q1

a) For λ = 1, we select documents based on similarity

Thus, d1> d2> d4> d3

Start with d1 , S = {d1} R\S = { d2, d4, d3}

MMR(d2) = 0.7 Maximum. Therefore S={ d1, d2 } = 0.67

MMR(d3) = 0.4

MMR(d4) = 0.6

b) For λ = 0 : Rank documents based on diversity

S= { d1} without MMR ; R\S = { d2, d3, d4}

MMR (d2 )= –0.67 = -0.67

MMR (d3 )= –0.5 = -0.5

MMR (d4 )= -0.2 = -0.2 Maximum. Therefore S={ d1, d4 } = 0.20

c) For λ = 0.5 :

S= { d1} without MMR ; R\S = { d2, d3, d4}

MMR (d2 )= 0.5 * 0.7 – 0.5 * 0.67 = 0.015

MMR (d3 )= 0.5 * 0.4 – 0.5 * 0.5 = -0.05

MMR (d4 )= 0.5 * 0.6 – 0.5 * 0.2 = 0.2 Maximum. Therefore S={ d1, d4 } = 0.20

Page 2: Onur Koçak CS533 Fall 2017 HW5 Solutions - cs.bilkent.edu.trcanf/CS533/CS533Fall2017/HW5oKocak.pdf · Onur Koçak CS533 Fall 2017 HW5 Solutions 1 CS533 Information Retrieval –

Onur Koçak CS533 Fall 2017 HW5 Solutions

2

d) For λ = 1, we select documents based on similarity

From answer Q1a S={ d1, d2 } R\S = { d4, d3}

MMR(d3) = 0.4

MMR(d4) = 0.6 Maximum. Therefore S={ d1, d2, d4 }

S = { d1, d2} + { d2, d4 } + { d1, d4 } = 0.67 + 0.10 + 0.20 = 0.97

For λ = 0 : Rank documents based on diversity

From answer Q1b S={ d1, d4 } R\S = { d2, d3 }

MMR (d2 )= –0.67 = -0.67

MMR (d3 )= –0.5 = -0.5 Maximum. Therefore S={ d1, d4, d3 } = 0.7

S = { d1, d4} + { d1, d3 } + { d4, d3 } = 0.2 + 0.50 + 0.0 = 0.70

For λ = 0.5 :

From answer Q1c S={ d1, d4 } R\S = { d2, d3 }

MMR (d2 )= 0.5 * 0.7 – 0.5 * 0.67 = 0.015 Maximum. Therefore S={ d1, d4, d2} = 0.97

MMR (d3 )= 0.5 * 0.4 – 0.5 * 0.5 = -0.05

S = { d1, d4} + { d1, d2 } + { d4, d2 } = 0.2 + 0.67 + 0.1 = 0.97

λ K=2 K=3 1.0 0.67 0.97 0.5 0.20 0.97 0.0 0.20 0.70

Remarks:

As the below table shows, λ changes the relevance / diversity balance of results. As λ

increases, the total similarity increases, and as λ decreases, the total similarity decreases

and diversity increases. The results can better be observed with more number of

documents and terms.

λ Similarity K=2 Similarity K=3 1.0 0.67 0.97 0.5 0.20 0.97 0.0 0.20 0.70

Page 3: Onur Koçak CS533 Fall 2017 HW5 Solutions - cs.bilkent.edu.trcanf/CS533/CS533Fall2017/HW5oKocak.pdf · Onur Koçak CS533 Fall 2017 HW5 Solutions 1 CS533 Information Retrieval –

Onur Koçak CS533 Fall 2017 HW5 Solutions

3

Q2

a) S-Recall can be defined as:

The unique number of topics covered until nth rank / Total number of unique topics

S-Recall @ 5 = 6 / 6 = 1.0

S-Recall @ 10 = 6 / 6 = 1.0

b) Precision -IA

Rank m1 m2 m3 m4 m5 m6

1 1

2 1

3 1 1

4 1 1

5 1

6 1

7 1

8 1

9 1

10 1

P@5 1/5 1/5 1/5 1/5 1/5 2/5

P@10 2/10 2/10 2/10 2/10 2/10 2/10

Precision-IA @ 5

Precision-IA @ 10

Page 4: Onur Koçak CS533 Fall 2017 HW5 Solutions - cs.bilkent.edu.trcanf/CS533/CS533Fall2017/HW5oKocak.pdf · Onur Koçak CS533 Fall 2017 HW5 Solutions 1 CS533 Information Retrieval –

Onur Koçak CS533 Fall 2017 HW5 Solutions

4

Q3

a) Reciprocal Rank

(Divisions are based on the rank in each result set)

Page 5: Onur Koçak CS533 Fall 2017 HW5 Solutions - cs.bilkent.edu.trcanf/CS533/CS533Fall2017/HW5oKocak.pdf · Onur Koçak CS533 Fall 2017 HW5 Solutions 1 CS533 Information Retrieval –

Onur Koçak CS533 Fall 2017 HW5 Solutions

5

b) Borda Count Method

The highest rank individual (in an n-way vote) gets n votes and each subsequent

gets one vote less (so #2 get n-1 …).

BC(a) = BCA(a) + BCB(a) + BCC(a) + BCD(a) = 3 +3 + 1 +4 = 11

BC(b) = BCA(b) + BCB(b) + BCC(b) + BCD(b) = 4 + 4 + 4 = 12

BC(c) = BCA(c) + BCB(c) + BCC(c) + BCD(c) = 1 +3 +3 = 7

BC(d) = BCA(d) + BCB(d) + BCC(d) + BCD(d) = 2 +2 +2 +2 = 8

BC(e) = BCA(e) + BCB(e) + BCC(e) + BCD(e) = 0 +0 +0 +1 = 1

BC(f) = BCA(d) + BCB(d) + BCC(d) + BCD(d) = 0 +1 +0 +0 = 1

Hence the result is b>a>d>c > e=f

c) Condorcet Method

* a b c d e f

a - 1,3,0 3,1,0 3,1,0 4,0,0 4,0,0

b 3,1,0 - 3,1,0 3,1,0 3,1,0 3,0,1

c 1,3,0 1,3,0 - 2,2,0 3,0,1 3,1,0

d 1,3,0 1,3,0 2,2,0 - 4,0,0 4,0,0

e 0,4,0 1,3,0 0,3,1 0,4,0 - 1,1,2

f 0,4,0 0,3,1 1,3,0 0,4,0 1,1,2 -

* Win Lose Tie

a 4 1 0

b 5 0 0

c 2 2 1

d 2 2 1

e 0 4 1

f 0 4 1

Final ranking of documents: b>a>c=d > e=f

Page 6: Onur Koçak CS533 Fall 2017 HW5 Solutions - cs.bilkent.edu.trcanf/CS533/CS533Fall2017/HW5oKocak.pdf · Onur Koçak CS533 Fall 2017 HW5 Solutions 1 CS533 Information Retrieval –

Onur Koçak CS533 Fall 2017 HW5 Solutions

6

Q4

a)

File size = N * F

Each object is assigned with 1024 bits.

Filesize= 1024 * 512 000 /8 /1024= 64 000 Kbytes = 62.5 Mb

b)

Each memory element holds the bits of associated columns of object signatures.

There are 512000 objects with 1024 bits long.

Filesize = F*N = 1024 * 512 000 /8 /1024= 64 000 Kbytes = 62.5 Mb

Q5

Page size = 0.5 Kbytes

1024 bit signatures

512000 objects

0.5 Kbytes / 1024 bits = 4 signatures in each page

Sequential signature method requires all pages to be accessed.

512000 / 4 = 128 000 pages needs to be accessed in SS.

As long as the page signatures comply with the query signatures, the page must be accessed. As

the query is mostly 0, a high portion of the pages will be accessed. In the worst case, all 128000

will be accessed.

In BS, there are 1024 memory objects / 4 = 256 pages

Since there are 5 positive bits (1s) 5 of these pages will be accessed in the next iterations.

Page 7: Onur Koçak CS533 Fall 2017 HW5 Solutions - cs.bilkent.edu.trcanf/CS533/CS533Fall2017/HW5oKocak.pdf · Onur Koçak CS533 Fall 2017 HW5 Solutions 1 CS533 Information Retrieval –

Onur Koçak CS533 Fall 2017 HW5 Solutions

7

Q6

a)

K=2

00 01 10 11 S5 S8

S1 S2

S6 S7 S9 S10

S3 S4

b)

i) & AND first 2 bits of each query with partition representative bits.

K=1 K=2 K=3 Q1 = 1110 1 (1) 1 (11) 2 (110, 111) Q2 = 0110 2 (0.1) 4 (00,01,10,11) 8 (011) Q3 = 1100 1 (1) 1 (11) 2 (110.111)

PAR (k=1) PAR k=2 PAR k=3 Q1 ½ ¼ 2/8 Q2 2/2 4/4 8/8 Q3 4/2 1/4 2/8

Page 8: Onur Koçak CS533 Fall 2017 HW5 Solutions - cs.bilkent.edu.trcanf/CS533/CS533Fall2017/HW5oKocak.pdf · Onur Koçak CS533 Fall 2017 HW5 Solutions 1 CS533 Information Retrieval –

Onur Koçak CS533 Fall 2017 HW5 Solutions

8

ii) Turnaround time = completion time –arrival time.

Each query takes 1 unit time.

In sequential processing all queries arrives at time 0.

K=2 all queries arrives at t=0

Q1 Q2 Q2 Q2 Q2 Q3 1 2 3 4 5 6

Sequential Parallel Q1 1 1 Q2 2 2 Q3 6 3

Sequential Processing Avg. Turnaround Time = 1 + 3 +4 /3 = /2.66 tu

Parallel Avg. Turnaround Time: 1+2+3/3 = 2 tu

Parallel Speedup ratio =3/2 = 1.5

Q7

a) In EPP (extended prefix partitioning) the key length is chosen to be the shortest prefix

which contains a predefined number of zeros described by z.

For Z=2 partition structure is as follows:

Partition Signatures

P1 1110 0001 (Q1)

P2 0110 0011 (Q2)

P3 1100 1100 (Q3)

b) In FKP (fixed key partitioning) we examine each of the consecutive no overlapping k-

substrings of a signature and selects the leftmost substring that has the least amount of

1s.

For k=2 partition structure is as follows:

Q1 = 11 10 00 01

Q2 = 01 10 00 11

Q3 = 11 00 11 00

Partition Signatures

P1 11 10 00 01 01 10 00 11

P2 11 00 11 00

Page 9: Onur Koçak CS533 Fall 2017 HW5 Solutions - cs.bilkent.edu.trcanf/CS533/CS533Fall2017/HW5oKocak.pdf · Onur Koçak CS533 Fall 2017 HW5 Solutions 1 CS533 Information Retrieval –

Onur Koçak CS533 Fall 2017 HW5 Solutions

9

c)

For signature query Q1 since the prefix of the query is ‘11100’ will have to access 1

partition for EPP and 2 partitions for FKP

For signature query Q2 since the prefix of the query is 0110 we have to access 2

partitions (p1 , p2)

For signature query Q3 because of the prefix we can access only P1 and P3 of EPP, but

we access 1 partitions of FKP.

Q8

Blocksize = 3

LoadFactor = 2/3

When we reach LF = 2/3 we can add 2 more bits and update the signature file.

Bv = 0, h=1

0 S1 S2 1 S3 S4

Lf = 4 / (2*3) = 2/3 insert 2 more

0 S1 S2 S5 1 S3 S4 S6

Page 10: Onur Koçak CS533 Fall 2017 HW5 Solutions - cs.bilkent.edu.trcanf/CS533/CS533Fall2017/HW5oKocak.pdf · Onur Koçak CS533 Fall 2017 HW5 Solutions 1 CS533 Information Retrieval –

Onur Koçak CS533 Fall 2017 HW5 Solutions

10

Bv = 1, h =1

01 S6 1 S3 S4 10 S1 S2 S5

Lf = 6 / (6*3) = 2/3 insert 2 more

01 S6 1 S3 S4 S8 10 S1 S2 S5

Bv = 0, h =1

00 S10 01 S6 10 S1 S2 S5 S7 S9 11 S3 S4 S8

Lf = 4 / (4*3) = 1/3 insert 1 more (S8 inserted)

Q1 1110 0001

Q2 0110 0011

Since las two bits of Q1 is 01 we need to access 01 and 11 and for Q2 we need to access

page 11.

Q9

In ranked key method, least frequent term in user profiles are more likely to appear

frequently in documents.