Bandwidth-Efficient Continuous Query Processing over DHTs

28
Bandwidth-Efficient Continuous Query Processing over DHTs Yingwu Zhu

description

Bandwidth-Efficient Continuous Query Processing over DHTs. Yingwu Zhu. Background. Instantaneous Query Continuous Query. Instantaneous Query (1). Documents are indexed Node responsible for keyword t stores the IDs of documents containing that term (i.e., inverted lists) - PowerPoint PPT Presentation

Transcript of Bandwidth-Efficient Continuous Query Processing over DHTs

Page 1: Bandwidth-Efficient Continuous Query Processing over DHTs

Bandwidth-Efficient Continuous Query Processing over DHTs

Yingwu Zhu

Page 2: Bandwidth-Efficient Continuous Query Processing over DHTs

Background

Instantaneous Query Continuous Query

Page 3: Bandwidth-Efficient Continuous Query Processing over DHTs

Instantaneous Query (1)

Documents are indexed Node responsible for keyword t stores the IDs of

documents containing that term (i.e., inverted lists)

Retrieve “one-time” relevant docs Latency is a top priority Query Q = t1 Λ t2 …

Fetch lists of doc IDs stored under t1, t2 …. Intersect these lists

E.g.: Google search engine

Page 4: Bandwidth-Efficient Continuous Query Processing over DHTs

Instantaneous Query (2)

A

B

D

C

cat:1,4,7,19,20

dog:1,5,7,26

cow:2,4,8,18bat: 1,8,31

“cat Λ dog”?

cat?cat:1,4,7,19,20

dog?

dog:1,5,7,26

Send Result:Docs 1,7

Page 5: Bandwidth-Efficient Continuous Query Processing over DHTs

Continuous Query (1)

Reverse the role of documents and queries Queries are indexed

Query Q = t1 Λ t2 … stored at one of the terms t1, t2 …

Question 1: How is the index term selected? (query indexing)

“Push” new relevant docs (incrementally) Enabled by “long-lived” queries

E.g.: Google New Alert feature

Page 6: Bandwidth-Efficient Continuous Query Processing over DHTs

Continuous Query (2)

Upon a new doc D = t1 Λ t2 (insertion) Contacts the nodes responsible for the inverted

query lists of D’s keywords t1 and t2

Question 2: How to locate the nodes (query nodes QN)? (document announcement)

Resolve the query lists the final list of satisfied queries (by D) Question 3: What is the resolution strategy? (query

resolution) E.g., Term Dialogue, Bloom filters (Infocom’06)

Notify owners of satisfied queries

Page 7: Bandwidth-Efficient Continuous Query Processing over DHTs

Query Resolution: Term Dialogue

A Bcat (query):1. dog2. horse & dog3. horse & cow

catdogcow

Doc

Inver. list for “cat”

1. Document announcement

2. “dog” & “cow”

3. “11” (bit vector)

4. “horse”

5. “0” (bit vector)Notify owner of Q1

C DInver. list for “dog”

Inver. list for “cow”

Page 8: Bandwidth-Efficient Continuous Query Processing over DHTs

Query Resolution: Bloom filters

A Bcat (query):1. dog2. horse & dog3. horse & cow

catdogcow

Doc

Inver. list for “cat”

1. Doc announcement “10110” (bloom filter)

2. “dog” (Term Dialogue)

3. “1” (bit vector)

Notify owner of Q1

C DInver. list for “dog”

Inver. list for “cow”

Page 9: Bandwidth-Efficient Continuous Query Processing over DHTs

Motivation

Latency is not the primary concern, but bandwidth can be one of the important design issues Various query indexing schemes incur different cost Various query resolution strategies cause different costs

Design a bandwidth-efficient continuous query system with “proper” query indexing (Question #1), document announcement (Question #2), and query resolution (Question #3) approaches

Page 10: Bandwidth-Efficient Continuous Query Processing over DHTs

Contributions

Novel query indexing schemes Question #1 Focus of this talk!

Multicast-based document announcement Question #2 In the paper

Adaptive query resolution Question #3 Make intelligent decisions in resolving query terms Minimize the bandwidth cost In the full tech. report paper

Page 11: Bandwidth-Efficient Continuous Query Processing over DHTs

Focus on simple keyword queries, e.g., Q = t1 Λ t2 Λ … Λtn

Leverage DHTs Location & storage of documents and continuous

queries Query indexing

How to choose index terms for queries? Doc. announcement, query resolution

Not covered in this talk!

Design

Page 12: Bandwidth-Efficient Continuous Query Processing over DHTs

Current Indexing Schemes

Random Indexing (RI) Optimal Indexing (OI)

Page 13: Bandwidth-Efficient Continuous Query Processing over DHTs

Random Indexing (RI)

Randomly chooses a term as index term Q = t1 Λ … Λ tm

Index term ti is randomly selected Q is indexed in a DHT node responsible for ti

Pros: simple Cons:

Popular terms are more likely to be index terms for queries Load imbalance Introduce many irrelevant queries in query resolution,

wasting bandwidth

Page 14: Bandwidth-Efficient Continuous Query Processing over DHTs

Optimal Indexing (OI)

Q = t1 Λ … Λ tm

Index term ti is deterministically chosen, the most selective term, i.e., with the least frequency

Q is indexed in a DHT node responsible for ti

Pros: Maximize load balance & minimize bandwidth cost

Cons: Assume perfect knowledge of term statistics Impractical, e.g., due to large number of documents, node

churn, continuous doc updates, ….

Page 15: Bandwidth-Efficient Continuous Query Processing over DHTs

Solution 1: MHI

Minimum Hash Indexing Order query terms by their hashes Select the term with minimum hash as the index

term Q = t1 Λ… Λ tm

Index term ti is deterministically chosen, s.t. h(ti) < h(tx) (for all x≠i)

Q is indexed in a DHT node responsible for ti

Page 16: Bandwidth-Efficient Continuous Query Processing over DHTs

RI v.s. MHI

t1 t2 t3 t4 t5 t6 t7

D = {t2, t4, t5, t6}

Where h(ti) < h(tj) for i < j.• 3 queries, irrelevant to D:

•Q1= t1 Λ t2 Λ t4

•Q2= t3 Λ t4 Λ t5

•Q3= t3 Λ t5 Λ t6

(1) RI: •Q1, Q2, and Q3 will be considered in query resolution each with

probability of 67% (need to resolve terms t1,t2,t3,t4,t5,and t6)

(2) MHI•All of them will be filtered out! bandwidth savings!•How?

Page 17: Bandwidth-Efficient Continuous Query Processing over DHTs

MHI: filtering irrelevant queries!

B

G

F

E

D = {t2, t4, t5, t6}

t2:

none

t1:Q1

t3:Q2, Q3

t6:

none

C

D

t5:

none

t4:

none

No action

No action

No actionNo action

A

Disregarded in query resolution, saving bandwidth!

Q1= t1 Λ t2 Λ t4Q2= t3 Λ t4 Λ t5Q3= t3 Λ t5 Λ t6

Page 18: Bandwidth-Efficient Continuous Query Processing over DHTs

MHI

Pros: Simple and deterministic Does not require term stats Saves bandwidth over RI (up to 39.3% saving for

various query types) Cons:

Some popular terms can be index terms by their minimum hashes in their queries! Load imbalance & irrelevant queries to process

Page 19: Bandwidth-Efficient Continuous Query Processing over DHTs

Solution 2: SAP-MHI

MHI is good but may still index queries under popular terms

SAmPling-based MHI(SAP-MHI) Sampling (synopsis of K popular terms) + MHI Avoid indexing queries under K popular terms Challenge: support duplicate-sensitive aggregates of

popular terms as synopses may be gossiped over multiple DHT overlay links and term frequencies may be overestimated! Borrow idea from duplicate-sensitive aggregation in

sensor networks

Page 20: Bandwidth-Efficient Continuous Query Processing over DHTs

SAP-MHI Duplicate-sensitive aggregation

Goal: a synopsis of K popular terms Based on coin tossing experiment CT(y)

Toss a fair coin until either the first head occurs or y coin tosses end up with no head, and return the number of tosses

Each node a Produce a local synopsis Sa containing K popular terms (the

terms with the highest values of CT(y)) Gossip Sa to its neighbor nodes Upon receiving a synopsis Sb from a neigbor b, aggregate Sa

and Sb, producing a new synopsis Sa (max() operations) Thus, each node has a synopsis of K popular terms after a

sufficient number of gossip rounds Intuition: If a term appears in more documents then its value

produced by CT(y) will be larger than the values of rare terms

Page 21: Bandwidth-Efficient Continuous Query Processing over DHTs

SAP-MHI: Indexing Example

Query Q=t1 Λ t2 Λ t3 Λ t4 Λ t5, where h(t1)<h(t2)<h(t3)<h(t4)<h(t5)

Synopsis S={t1,t2} Q is indexed on the node which is

responsible for t3, instead of t1

Page 22: Bandwidth-Efficient Continuous Query Processing over DHTs

Simulations

Parameter Value

DHT 1000-node Chord

Document collection TREC-1,2-AP

Mean of query sizes 5

# of continuous queries 100,000

# of docs 10,000

# of unique terms 46,654

# of unique terms per doc 178

Query types Skew, Uniform, InverSkew

Query resolution Term Dialogue, Bloom filters

Page 23: Bandwidth-Efficient Continuous Query Processing over DHTs

SAP-MHI v.s. MHI

SAP-MHI improves load balance over MHI with increasing synopsis size K, for Skew queries.

Page 24: Bandwidth-Efficient Continuous Query Processing over DHTs

SAP-MHI v.s. MHI

010203040506070

100 500 1000 1500 2000 3000Synopsis size K

Band

wid

th s

avin

g (%

) SkewUniformInverSkew

Bloom filters are used in query resolution.

Page 25: Bandwidth-Efficient Continuous Query Processing over DHTs

SAP-MHI v.s. MHI

0

20

40

60

80

100

100 500 1000 1500 2000 3000Synopsis size K

Band

wid

th s

avin

g (%

)

Skew

Uniform

InverSkew

Term Dialogue is used in query resolution.

Page 26: Bandwidth-Efficient Continuous Query Processing over DHTs

SAP-MHI v.s. MHI

0

20

40

60

80

100

100 500 1000 1500 2000 3000Synopsis size K

% o

f q

ue

rie

s f

ilte

red

Skew

Uniform

InverSkew

This shows why SAP-MHI saves bandwidth over MHI!

Page 27: Bandwidth-Efficient Continuous Query Processing over DHTs

Summary Focus on a simple keyword query model Bandwidth is a top priority Query indexing impacts bandwidth cost

Goal: Sift out as many irrelevant queries as possible! MHI and SAP-MHI SAP-MHI is a more viable solution

Load is more balanced, more bandwidth saving! Sampling cost is controlled

# of popular terms is relatively low Memberships of popular terms do not change rapidly

Document announcement & adaptive query resolution further cut down bandwidth consumption (not covered in this talk)

Page 28: Bandwidth-Efficient Continuous Query Processing over DHTs

Thank You!