Efficient Query Evaluation over Temporally Correlated Probabilistic Streams Bhargav Kanagal, Amol...

30
Efficient Query Evaluation over Temporally Correlated Probabilistic Streams Bhargav Kanagal, Amol Deshpande ΗΥ-562 Advanced Topics on Databases Αλέκα Σεληνιωτάκη Ηράκλειο, 22/05/2012

Transcript of Efficient Query Evaluation over Temporally Correlated Probabilistic Streams Bhargav Kanagal, Amol...

Page 1: Efficient Query Evaluation over Temporally Correlated Probabilistic Streams Bhargav Kanagal, Amol Deshpande ΗΥ-562 Advanced Topics on Databases Αλέκα Σεληνιωτάκη.

Efficient Query Evaluation over Temporally Correlated Probabilistic StreamsBhargav Kanagal, Amol DeshpandeΗΥ-562 Advanced Topics on Databases

Αλέκα Σεληνιωτάκη

Ηράκλειο, 22/05/2012

Page 2: Efficient Query Evaluation over Temporally Correlated Probabilistic Streams Bhargav Kanagal, Amol Deshpande ΗΥ-562 Advanced Topics on Databases Αλέκα Σεληνιωτάκη.

Slide 2 / 84

Contents

1. Motivation

2. Previous Work

3. Approach

4. Markov Sequence

5. Formal Semantics

6. Operator Design

7. Query Planning Algorithm

8. Results

9. Feature Work

Page 3: Efficient Query Evaluation over Temporally Correlated Probabilistic Streams Bhargav Kanagal, Amol Deshpande ΗΥ-562 Advanced Topics on Databases Αλέκα Σεληνιωτάκη.

Slide 3 / 84

Motivation

Correlated probabilistic streams occur in large variety of applications including sensor networks, information extraction and pattern/event detection.

Probabilistic data streams are highly correlated in both time and space.

Impact final query results

Page 4: Efficient Query Evaluation over Temporally Correlated Probabilistic Streams Bhargav Kanagal, Amol Deshpande ΗΥ-562 Advanced Topics on Databases Αλέκα Σεληνιωτάκη.

Slide 4 / 84

Motivation

A habitat monitoring application:

Query example: Compute the likelihood hat a nest was occupied for all 7 days in a week.

Other issues: Handle high-rate data streams efficiently and produce accurate answers

to continuous queries in real time Query semantics become ambiguous since they are dealing with

sequences of tuples and not set of tuples.

Page 5: Efficient Query Evaluation over Temporally Correlated Probabilistic Streams Bhargav Kanagal, Amol Deshpande ΗΥ-562 Advanced Topics on Databases Αλέκα Σεληνιωτάκη.

Slide 5 / 84

Previous Work

Probabilistic databases (Mystiq,Trio,Orion,MayBMS) focus on static data and not streams, also cannot handle complex correlations

Lahar, Caldera are applicable to probabilistic streams, but focus on pattern identification queries.

Unable to represent the types of correlations that probabilistic streams

exhibit Can not be applied directly because of their complexity

Page 6: Efficient Query Evaluation over Temporally Correlated Probabilistic Streams Bhargav Kanagal, Amol Deshpande ΗΥ-562 Advanced Topics on Databases Αλέκα Σεληνιωτάκη.

Slide 6 / 84

Approach

Observation: Many naturally occurring probabilistic streams are both structured and Markovian in nature Structured: Same set of correlations and independences repeat

across time Markovian: The values of variables at time t + 1 are independent of

those at time t − 1 given their values at time t

Design compact, lightweight data structures to represent and modifythemEnable incremental query processing using the iterator framework

A Markov sequence as a DGM

Page 7: Efficient Query Evaluation over Temporally Correlated Probabilistic Streams Bhargav Kanagal, Amol Deshpande ΗΥ-562 Advanced Topics on Databases Αλέκα Σεληνιωτάκη.

Slide 7 / 84

ApproachProbabilistic Databases

Probabilistic Databases exhibit two types of uncertainties: Tuple existence (captures the uncertainty whether a tuple exists in a

database) Attribute value (captures the uncertainty about the value of an

attribute)

Equivalence between Probabilistic Databases and Directed Graphical Models:

nodes in the graph: denote the random variables edges: correspond to the dependencies between the random

variables (dependencies quantified by a conditional probability distribution function (CPD) for each node that describes how the value of that node depends on the value of its parent)

Page 8: Efficient Query Evaluation over Temporally Correlated Probabilistic Streams Bhargav Kanagal, Amol Deshpande ΗΥ-562 Advanced Topics on Databases Αλέκα Σεληνιωτάκη.

Slide 9 / 84

ApproachSystem’s Description

Page 9: Efficient Query Evaluation over Temporally Correlated Probabilistic Streams Bhargav Kanagal, Amol Deshpande ΗΥ-562 Advanced Topics on Databases Αλέκα Σεληνιωτάκη.

Slide 10 / 84

ApproachSystem’s Description

Input:a query on probabilistic streams Output: query’s results Steps:

1. Query conversion to a probabilistic sequence algebra expression

2. Query plan construction by instantiating each of the operators with the schemas of their input sequences

3. Each operator executes its schema routine and computes its output schema

4. Check the input to the projection and determine if projection operator is safe.

5. Optimize query plan

Page 10: Efficient Query Evaluation over Temporally Correlated Probabilistic Streams Bhargav Kanagal, Amol Deshpande ΗΥ-562 Advanced Topics on Databases Αλέκα Σεληνιωτάκη.

Slide 11 / 84

Markov Sequence

A probabilistic sequence Sp, is defined over a set of named attributes S = {V1,V2…Vk}, also called its schema, where each Vi is a discrete probabilistic value.

A probabilistic sequence is equivalent to a set of deterministic sequences, where each deterministic sequence is obtained by assigning a value to each random variable.

Two operators for conversion from probabilistic sequence to deterministic sequence: MAP: returns the possible sequence that has the highest probability ML: returns a sequence of items each of which has the highest

probability for its time instant.

Page 11: Efficient Query Evaluation over Temporally Correlated Probabilistic Streams Bhargav Kanagal, Amol Deshpande ΗΥ-562 Advanced Topics on Databases Αλέκα Σεληνιωτάκη.

Slide 12 / 84

Markov Sequence

Special case of a probabilistic sequence. Completely determined by specifying successive joint distributions for all time instants p(Vt, Vt+1)

Efficient Representation of Markov sequences using a combination of the schema graph and the clique list: Schema graph: graphical representation of the two step

joint distribution that repeats continuously. Clique list: the set of direct dependencies present between

successive sets of random variables.

Schema graph Clique List

Page 12: Efficient Query Evaluation over Temporally Correlated Probabilistic Streams Bhargav Kanagal, Amol Deshpande ΗΥ-562 Advanced Topics on Databases Αλέκα Σεληνιωτάκη.

Slide 13 / 84

Formal Semantics

Possible World Semantics (+small modification for sequences)

Operators: Select, Project, Join, Aggregate, Windowing MAP, ML: Convert a Markovian sequence into a

deterministic sequence The set of Markov sequences is not closed under these

operators, i.e., some operators return non-Markovian sequences (Projection, Windowing)

Op is safe for input schema I , Op(I) is a Markovian sequence

Page 13: Efficient Query Evaluation over Temporally Correlated Probabilistic Streams Bhargav Kanagal, Amol Deshpande ΗΥ-562 Advanced Topics on Databases Αλέκα Σεληνιωτάκη.

Slide 14 / 84

Operator Design

Operators treat the input Markov sequence as a data stream, operating on one tuple (an array of

numbers) at a time and produce the output in the same fashion (passed to the next operator).

Each operator implements two high level routines: Schema routine: Output schema is computed based on

the input schema get next() routine: Operates on each of the input data

tuples to compute the output tuples.

Page 14: Efficient Query Evaluation over Temporally Correlated Probabilistic Streams Bhargav Kanagal, Amol Deshpande ΗΥ-562 Advanced Topics on Databases Αλέκα Σεληνιωτάκη.

Slide 15 / 84

Operator DesignSelection

Steps:

1. Start with the DGM corresponding to the input schema

2. Add a new node corresponding to the exists variable to both time steps of the DGM

3. Connect this node to the variables that are part of selection predicate through directed edges (selection predicate X>Y)

4. Update the clique list of the schema to include the newly created dependencies.

Generally: Add new boolean random variable to each slice. Always safe

Schema Routine

Add new boolean random variable to each slice.

Page 15: Efficient Query Evaluation over Temporally Correlated Probabilistic Streams Bhargav Kanagal, Amol Deshpande ΗΥ-562 Advanced Topics on Databases Αλέκα Σεληνιωτάκη.

Slide 16 / 84

get_next() routine: determine the CPD of the newly created node add it to the input tuple’s CPD list return the new tuple

The algorithm does not change the Markovian property

of the sequence Safe operator.

Operator DesignSelection

Page 16: Efficient Query Evaluation over Temporally Correlated Probabilistic Streams Bhargav Kanagal, Amol Deshpande ΗΥ-562 Advanced Topics on Databases Αλέκα Σεληνιωτάκη.

Slide 17 / 84

Steps: Remove the nodes that are not in

the projection list - Eliminate operation on the graphical model-Update clique list with eliminations

Determine if new edges need to be added to the schema graph

Operator DesignProjection Generally: Eliminate all variables not of interest. Unsafe for schemas

Schema Routine

Page 17: Efficient Query Evaluation over Temporally Correlated Probabilistic Streams Bhargav Kanagal, Amol Deshpande ΗΥ-562 Advanced Topics on Databases Αλέκα Σεληνιωτάκη.

Slide 18 / 84

get_next() routine: perform the actual variable elimination procedure to eliminate the nodes that are not required

Projection is not safe for all input sequences, and in some

cases, even if the input is a Markov Sequence, the output may

not be a Markov sequence.

Operator DesignProjection

Page 18: Efficient Query Evaluation over Temporally Correlated Probabilistic Streams Bhargav Kanagal, Amol Deshpande ΗΥ-562 Advanced Topics on Databases Αλέκα Σεληνιωτάκη.

Slide 19 / 84

schema routine: concatenate the schemas of the two sequences in order to determine the resulting output schema (combine the schema graphs, and concatenate the clique lists)

get_next() routine: concatenate the CPD lists of the two tuples, whenever both tuples have the same time value.

a join can be computed incrementally, and is always safe.

Operator DesignJoin

Page 19: Efficient Query Evaluation over Temporally Correlated Probabilistic Streams Bhargav Kanagal, Amol Deshpande ΗΥ-562 Advanced Topics on Databases Αλέκα Σεληνιωτάκη.

Slide 20 / 84

Schema routine: the output schema is just a single attribute corresponding to the aggregate

get_next() routine:

Case1: no selection predicates (Gi =Agg(X1,X2…Xi))

dotted variables added to the DGM, the boxed nodes are

eliminated and computing p(Xi,Gi) as input tuples arrive.

At the end of the sequence, we get p(Xn,Gn), from which we can obtain p(Gn) by eliminating Xn.

Case2: with selection predicatesa value Xi contributes to the aggregate only if

Ei (exists attribute) is 1 and not otherwise

Always safe operator

Operator DesignAggregation

Page 20: Efficient Query Evaluation over Temporally Correlated Probabilistic Streams Bhargav Kanagal, Amol Deshpande ΗΥ-562 Advanced Topics on Databases Αλέκα Σεληνιωτάκη.

Slide 21 / 84

A sliding window aggregate query asks to compute the aggregate values over a window that shifts over the stream (characterized by the length of the window, the desired shift, and the type of aggregate)

DGM for sliding window aggregate. Gi denotes the aggregates to compute

After eliminating the Xi variables, make a clique

list on the Gi variables.

Operator DesignSliding Window Aggregates

The aggregate value for a sliding window influences the aggregates for all of the future windows in the stream Unsafe Operator

Page 21: Efficient Query Evaluation over Temporally Correlated Probabilistic Streams Bhargav Kanagal, Amol Deshpande ΗΥ-562 Advanced Topics on Databases Αλέκα Σεληνιωτάκη.

Slide 22 / 84

1. Ignoring the dependencies between the aggregate values produced at different time instances

2. Computing the distribution over each aggregate value independently

Splitting the sliding window DGM into separate graphical models (one for each window), run inference on each of them separately and compute the results

Unmarked nodes are intermediate aggregates

Operator DesignSliding Window Aggregates

Page 22: Efficient Query Evaluation over Temporally Correlated Probabilistic Streams Bhargav Kanagal, Amol Deshpande ΗΥ-562 Advanced Topics on Databases Αλέκα Σεληνιωτάκη.

Slide 23 / 84

Operator DesignTumpling Window Aggregates

The length of the sliding window is equal to its shift can

compute exact answers in a few cases.

For tumbling window aggregates, only eliminate only eliminate boxed nodes to obtain the obtain Markov sequence :

Eliminating X3 και Χ5 is postponed to a later projection.

Page 23: Efficient Query Evaluation over Temporally Correlated Probabilistic Streams Bhargav Kanagal, Amol Deshpande ΗΥ-562 Advanced Topics on Databases Αλέκα Σεληνιωτάκη.

Slide 25 / 84

MAP operator takes in a Markov sequence and returns a

deterministic sequence. It is usually the last operator in the

query plan, and hence it does not have a schema routine. The

get next() routine uses the dynamic programming based

approach of Viterbi's algorithm. (Viterbi-style dynamic

Programming)

ML: at each time step, compute the probability distribution

for each time instant from each tuple. Based on this eliminate

the variables that are not required and determine the most

likely values for the variables. Simply marginalize the joint

distribution.

Operator DesignMAP-ML

Page 24: Efficient Query Evaluation over Temporally Correlated Probabilistic Streams Bhargav Kanagal, Amol Deshpande ΗΥ-562 Advanced Topics on Databases Αλέκα Σεληνιωτάκη.

Slide 26 / 84

1. The user has the choice between using MAP or ML operators for converting the final probabilistic answer to a deterministic answer

2. Support for specifying sliding window parameters

Query EvaluationQuery Syntax

Page 25: Efficient Query Evaluation over Temporally Correlated Probabilistic Streams Bhargav Kanagal, Amol Deshpande ΗΥ-562 Advanced Topics on Databases Αλέκα Σεληνιωτάκη.

Slide 27 / 84

Query Planning Algorithm

Unsafe Operators: Projection, Tumbling window operator reduced to a projection, window operator use approximate method

Query Planning = Determining the correct position for Projection

Strategy: Pull up projection operator until it is safe. If no safe position for projection, check its parent: ML we can determine a safe plan, MAP we cannot determine a safe plan

plan

Page 26: Efficient Query Evaluation over Temporally Correlated Probabilistic Streams Bhargav Kanagal, Amol Deshpande ΗΥ-562 Advanced Topics on Databases Αλέκα Σεληνιωτάκη.

Slide 28 / 84

1. For a given query, convert it to a probabilistic sequence algebra expression: Q0: SELECT_MAP MAX(X) FROM SEQ WHERE Y<20 MAP(Gp(Πp

Χ(σpΥ<20SEQ)))

2. Each operator then executes its schema routine and computes its output schema, which is used as the input schema for the next operator in the chain

3. Check the input to the projection, and determine if the projection operator is safe

4. If a projection-input pair is not safe, we pull up the projection operator through the aggregate and the windowing operators and continue with the rest of the query plan: MAP(Πp

MAX(X)(Gp(σpY<20SEQ)))

5. If the operator after the projection is ML, then determine the exact answer. If it is a MAP operator, replace both the projection and the MAP operator with the approximate-MAP operator

Query Planning Algorithm

Page 27: Efficient Query Evaluation over Temporally Correlated Probabilistic Streams Bhargav Kanagal, Amol Deshpande ΗΥ-562 Advanced Topics on Databases Αλέκα Σεληνιωτάκη.

Slide 29 / 84

Results

Markov sequence generator: Generate Markov sequences for a given input schema

Capturing and reasoning about temporal correlations is critical for obtaining accurate query answers

Operators can process up to 500 tuples per second Get_next() query processing framework that exploits the structure in

Markov sequences is much more efficient that previous generic approaches

Page 28: Efficient Query Evaluation over Temporally Correlated Probabilistic Streams Bhargav Kanagal, Amol Deshpande ΗΥ-562 Advanced Topics on Databases Αλέκα Σεληνιωτάκη.

Slide 30 / 84

Results

The % error in query processing for various operators when temporal correlations are ignored.

Q1: SELECT MAP Agg(A) FROM SQ3: SELECT MAP MAX(A) FROM S[size,size]

Page 29: Efficient Query Evaluation over Temporally Correlated Probabilistic Streams Bhargav Kanagal, Amol Deshpande ΗΥ-562 Advanced Topics on Databases Αλέκα Σεληνιωτάκη.

Slide 31 / 84

Feature Work

Scalability improvement of the system by resorting to approximations with guarantees.

Page 30: Efficient Query Evaluation over Temporally Correlated Probabilistic Streams Bhargav Kanagal, Amol Deshpande ΗΥ-562 Advanced Topics on Databases Αλέκα Σεληνιωτάκη.

Slide 32 / 84

References

B.Kanagal, A.Deshpande: “Efficient Query Evaluation over Temporally Correlated Probabilistic Streams.”

http://www.cs.umd.edu/~bhargav/ICDE_0703.pdf http://www.cs.umd.edu/~bhargav/ICDE09poster.pdf