Bandwidth-Efficient Continuous Query Processing over DHTs

Bandwidth-Efficient Continuous Query Processing over DHTs

Yingwu Zhu

Background

Instantaneous Query Continuous Query

Instantaneous Query (1)

Documents are indexed Node responsible for keyword t stores the IDs of

documents containing that term (i.e., inverted lists)

Retrieve “one-time” relevant docs Latency is a top priority Query Q = t1 Λ t2 …

Fetch lists of doc IDs stored under t1, t2 …. Intersect these lists

E.g.: Google search engine

Instantaneous Query (2)

A

B

D

C

cat:1,4,7,19,20

dog:1,5,7,26

cow:2,4,8,18bat: 1,8,31

“cat Λ dog”?

cat?cat:1,4,7,19,20

dog?

dog:1,5,7,26

Send Result:Docs 1,7

Continuous Query (1)

Reverse the role of documents and queries Queries are indexed

Query Q = t1 Λ t2 … stored at one of the terms t1, t2 …

Question 1: How is the index term selected? (query indexing)

“Push” new relevant docs (incrementally) Enabled by “long-lived” queries

E.g.: Google New Alert feature

Continuous Query (2)

Upon a new doc D = t1 Λ t2 (insertion) Contacts the nodes responsible for the inverted

query lists of D’s keywords t1 and t2

Question 2: How to locate the nodes (query nodes QN)? (document announcement)

Resolve the query lists the final list of satisfied queries (by D) Question 3: What is the resolution strategy? (query

resolution) E.g., Term Dialogue, Bloom filters (Infocom’06)

Notify owners of satisfied queries

Query Resolution: Term Dialogue

A Bcat (query):1. dog2. horse & dog3. horse & cow

catdogcow

Doc

Inver. list for “cat”

1. Document announcement

2. “dog” & “cow”

3. “11” (bit vector)

4. “horse”

5. “0” (bit vector)Notify owner of Q1

C DInver. list for “dog”

Inver. list for “cow”

Query Resolution: Bloom filters

A Bcat (query):1. dog2. horse & dog3. horse & cow

catdogcow

Doc

Inver. list for “cat”

1. Doc announcement “10110” (bloom filter)

2. “dog” (Term Dialogue)

3. “1” (bit vector)

Notify owner of Q1

C DInver. list for “dog”

Inver. list for “cow”

Motivation

Latency is not the primary concern, but bandwidth can be one of the important design issues Various query indexing schemes incur different cost Various query resolution strategies cause different costs

Design a bandwidth-efficient continuous query system with “proper” query indexing (Question #1), document announcement (Question #2), and query resolution (Question #3) approaches

Contributions

Novel query indexing schemes Question #1 Focus of this talk!

Multicast-based document announcement Question #2 In the paper

Adaptive query resolution Question #3 Make intelligent decisions in resolving query terms Minimize the bandwidth cost In the full tech. report paper

Focus on simple keyword queries, e.g., Q = t1 Λ t2 Λ … Λtn

Leverage DHTs Location & storage of documents and continuous

queries Query indexing

How to choose index terms for queries? Doc. announcement, query resolution

Not covered in this talk!

Design

Current Indexing Schemes

Random Indexing (RI) Optimal Indexing (OI)

Random Indexing (RI)

Randomly chooses a term as index term Q = t1 Λ … Λ tm

Index term ti is randomly selected Q is indexed in a DHT node responsible for ti

Pros: simple Cons:

Popular terms are more likely to be index terms for queries Load imbalance Introduce many irrelevant queries in query resolution,

wasting bandwidth

Optimal Indexing (OI)

Q = t1 Λ … Λ tm

Index term ti is deterministically chosen, the most selective term, i.e., with the least frequency

Q is indexed in a DHT node responsible for ti

Pros: Maximize load balance & minimize bandwidth cost

Cons: Assume perfect knowledge of term statistics Impractical, e.g., due to large number of documents, node

churn, continuous doc updates, ….

Solution 1: MHI

Minimum Hash Indexing Order query terms by their hashes Select the term with minimum hash as the index

term Q = t1 Λ… Λ tm

Index term ti is deterministically chosen, s.t. h(ti) < h(tx) (for all x≠i)

Q is indexed in a DHT node responsible for ti

RI v.s. MHI

t1 t2 t3 t4 t5 t6 t7

D = {t2, t4, t5, t6}

Where h(ti) < h(tj) for i < j.• 3 queries, irrelevant to D:

•Q1= t1 Λ t2 Λ t4

•Q2= t3 Λ t4 Λ t5

•Q3= t3 Λ t5 Λ t6

(1) RI: •Q1, Q2, and Q3 will be considered in query resolution each with

probability of 67% (need to resolve terms t1,t2,t3,t4,t5,and t6)

(2) MHI•All of them will be filtered out! bandwidth savings!•How?

MHI: filtering irrelevant queries!

B

G

F

E

D = {t2, t4, t5, t6}

t2:

none

t1:Q1

t3:Q2, Q3

t6:

none

C

D

t5:

none

t4:

none

No action

No action

No actionNo action

A

Disregarded in query resolution, saving bandwidth!

Q1= t1 Λ t2 Λ t4Q2= t3 Λ t4 Λ t5Q3= t3 Λ t5 Λ t6

MHI

Pros: Simple and deterministic Does not require term stats Saves bandwidth over RI (up to 39.3% saving for

various query types) Cons:

Some popular terms can be index terms by their minimum hashes in their queries! Load imbalance & irrelevant queries to process

Solution 2: SAP-MHI

MHI is good but may still index queries under popular terms

SAmPling-based MHI(SAP-MHI) Sampling (synopsis of K popular terms) + MHI Avoid indexing queries under K popular terms Challenge: support duplicate-sensitive aggregates of

popular terms as synopses may be gossiped over multiple DHT overlay links and term frequencies may be overestimated! Borrow idea from duplicate-sensitive aggregation in

sensor networks

SAP-MHI Duplicate-sensitive aggregation

Goal: a synopsis of K popular terms Based on coin tossing experiment CT(y)

Toss a fair coin until either the first head occurs or y coin tosses end up with no head, and return the number of tosses

Each node a Produce a local synopsis Sa containing K popular terms (the

terms with the highest values of CT(y)) Gossip Sa to its neighbor nodes Upon receiving a synopsis Sb from a neigbor b, aggregate Sa

and Sb, producing a new synopsis Sa (max() operations) Thus, each node has a synopsis of K popular terms after a

sufficient number of gossip rounds Intuition: If a term appears in more documents then its value

produced by CT(y) will be larger than the values of rare terms

SAP-MHI: Indexing Example

Query Q=t1 Λ t2 Λ t3 Λ t4 Λ t5, where h(t1)<h(t2)<h(t3)<h(t4)<h(t5)

Synopsis S={t1,t2} Q is indexed on the node which is

responsible for t3, instead of t1

Simulations

Parameter Value

DHT 1000-node Chord

Document collection TREC-1,2-AP

Mean of query sizes 5

# of continuous queries 100,000

# of docs 10,000

# of unique terms 46,654

# of unique terms per doc 178

Query types Skew, Uniform, InverSkew

Query resolution Term Dialogue, Bloom filters

SAP-MHI v.s. MHI

SAP-MHI improves load balance over MHI with increasing synopsis size K, for Skew queries.

SAP-MHI v.s. MHI

010203040506070

100 500 1000 1500 2000 3000Synopsis size K

Band

wid

th s

avin

g (%

) SkewUniformInverSkew

Bloom filters are used in query resolution.

SAP-MHI v.s. MHI

0

20

40

60

80

100

100 500 1000 1500 2000 3000Synopsis size K

Band

wid

th s

avin

g (%

)

Skew

Uniform

InverSkew

Term Dialogue is used in query resolution.

SAP-MHI v.s. MHI

0

20

40

60

80

100

100 500 1000 1500 2000 3000Synopsis size K

% o

f q

ue

rie

s f

ilte

red

Skew

Uniform

InverSkew

This shows why SAP-MHI saves bandwidth over MHI!

Summary Focus on a simple keyword query model Bandwidth is a top priority Query indexing impacts bandwidth cost

Goal: Sift out as many irrelevant queries as possible! MHI and SAP-MHI SAP-MHI is a more viable solution

Load is more balanced, more bandwidth saving! Sampling cost is controlled

# of popular terms is relatively low Memberships of popular terms do not change rapidly

Document announcement & adaptive query resolution further cut down bandwidth consumption (not covered in this talk)

Thank You!

Bandwidth-Efficient Continuous Query Processing over DHTs

Documents

Transcript of Bandwidth-Efficient Continuous Query Processing over DHTs