Data in Motion Michael Hoffman (Leicester) S Muthukrishnan (Google) Rajeev Raman (Leicester)

22
Data in Motion Michael Hoffman (Leicester) S Muthukrishnan (Google) Rajeev Raman (Leicester)

Transcript of Data in Motion Michael Hoffman (Leicester) S Muthukrishnan (Google) Rajeev Raman (Leicester)

Page 1: Data in Motion Michael Hoffman (Leicester) S Muthukrishnan (Google) Rajeev Raman (Leicester)

Data in Motion

Michael Hoffman (Leicester)

S Muthukrishnan (Google)

Rajeev Raman (Leicester)

Page 2: Data in Motion Michael Hoffman (Leicester) S Muthukrishnan (Google) Rajeev Raman (Leicester)

Models for moving data

Reset model Delta model

Geometric and database motivations Given vector A[1..n]

A[i] is a point in Rd, d ≥ 1 A is updated in a streaming manner

Probabilistic approximate computation of some function on A: ε : error parameter δ : confidence parameter Space and time: poly(log n, (1/ ε), (1/ δ))

Page 3: Data in Motion Michael Hoffman (Leicester) S Muthukrishnan (Google) Rajeev Raman (Leicester)

Reset model

Given vector A[1..n]A[i] is a point in Rd, d ≥ 1

Updates reset(i, x)A[i] := x

Motivation:Location data streams (tracking

passive/dumb objects).Query self-tuning in databases.

Page 4: Data in Motion Michael Hoffman (Leicester) S Muthukrishnan (Google) Rajeev Raman (Leicester)

Reset Model

“Dynamic” geometric information Different from standard “dynamic”

streams:• insert(p), p in Rd

• delete(p)

In reset model, points have identity delete(p) + insert(p’) gives more

information than reset

Page 5: Data in Motion Michael Hoffman (Leicester) S Muthukrishnan (Google) Rajeev Raman (Leicester)

Delta Model

Given vector A[1..n]A[i] is a point in Rd, d ≥ 2

Process updates (i, x1, x2, …, xd)

A[i] := A[i] + (x1, x2, …, xd)Motivation:

• Data is often multi-dimensional E.g.• <IP address, packet size, packet delay>

Direct generalization of turnstile model

Page 6: Data in Motion Michael Hoffman (Leicester) S Muthukrishnan (Google) Rajeev Raman (Leicester)

Delta Model

Problems involving several dimensions“extent” of points (sum of distances of

points from a given center)k-median, diameter, minimum

enclosing ball etc?regression:

• correlation of packet size with delay

Page 7: Data in Motion Michael Hoffman (Leicester) S Muthukrishnan (Google) Rajeev Raman (Leicester)

Problems

Reset modelLp norm*

Lp sampling*1-median

Delta model“Extent” of points1-median

} monotone, d = 1

Page 8: Data in Motion Michael Hoffman (Leicester) S Muthukrishnan (Google) Rajeev Raman (Leicester)

Lp norm: Reset Model

Assume wlog p=1 required to estimate ||A||1 = Σ |A[i]|

Assume monotone updates A[i] initially zero reset(i,x) implies A[i] ≤ x

• A[i] := max(A[i], x) [GC]

Estimation impossible if non-monotone reduction to estimating |X| - |X ∩Y |

Page 9: Data in Motion Michael Hoffman (Leicester) S Muthukrishnan (Google) Rajeev Raman (Leicester)

L1 norm (reset model) Reduction to counting distinct items

1)1(

i)1(

A

Buckets

ni = number of itemsin ith bucket

wi = width of ith bucket

Σ(wi*ni)≤ ||A||1 ≤ (1+ε) Σ(wi*ni)

distinct

Page 10: Data in Motion Michael Hoffman (Leicester) S Muthukrishnan (Google) Rajeev Raman (Leicester)

L1 norm (reset model)

Counting the number of distinct items in a stream ≡ L0 normpoly-log space and time [FM,CIM]

Need to keep only O((log n)/ε) buckets.

Can we detect if the input is non-monotone?

Page 11: Data in Motion Michael Hoffman (Leicester) S Muthukrishnan (Google) Rajeev Raman (Leicester)

Lp sampling

Query: sample()Choose i from {1,…,n} with probability

proportional to |A[i]|p

Successive calls may return same index, if no updates happen.

Not known how to do this in the turnstile model

Can be used to detect if ||A||1 ≤ (1 - ε) ||A*||1

Page 12: Data in Motion Michael Hoffman (Leicester) S Muthukrishnan (Google) Rajeev Raman (Leicester)

Lp sampling (reset model) Reduction to sampling distinct items

)1( i)1(

A

Buckets

ni = number of distinct items in ith bucket

wi = width of ith bucket

Sample a random (distinct) index from each bucketReturn sample from bucket i with probability proportional to wi * ni

||||

][

)*(

)*(*)/1(]Pr[

A

iA

nw

nwni

jj

Bi jjjj

Page 13: Data in Motion Michael Hoffman (Leicester) S Muthukrishnan (Google) Rajeev Raman (Leicester)

1-median

Assume A[i] contains coordinates of a set S of 2-D points

Problem: find c in R2 s.t. Σp in S d(c,p) (Euclidean distance) is approximately minimized

Monotonicity not required; cannot report Σp in S d(c,p).

Return (4/π + ε) ~ (1.29 + ε) estimator boosting: see later.

Page 14: Data in Motion Michael Hoffman (Leicester) S Muthukrishnan (Google) Rajeev Raman (Leicester)

1-median (reset model)

L1 1-median:

find c in R2 such that Σp in S d(c,p) is minimized.

d(p,q) = L1 distance

• d1(p,q) = |px – qx| + |py – qy|

L1 1-median c = (cx, cy)

cx = median of x-coordinates

cy = median of y-coordinates

Page 15: Data in Motion Michael Hoffman (Leicester) S Muthukrishnan (Google) Rajeev Raman (Leicester)

1-median (reset model)

1-D median sample O((1/ε) log (1/δ)) random indices;

maintain position of sample. median mx of x-coordinates of sample is

(1+ε)-approximation to median of x-coordinates of S.

(1+ε)-approximate median is a (1+ε’)-approximate 1-median in 1-D

Approximate L1 1-median: return (mx , my) may not be in S.

Page 16: Data in Motion Michael Hoffman (Leicester) S Muthukrishnan (Google) Rajeev Raman (Leicester)

Projections of points L1 1-median is a √2-approximation to L2

(Euclidean) 1-median: consider projections of S to do better:

Page 17: Data in Motion Michael Hoffman (Leicester) S Muthukrishnan (Google) Rajeev Raman (Leicester)

Let l be a line segment of length x, and s be the sum of the lengths of the projections of l on k equally-spaced lines passing through the origin, then πs/(2k) = x(1 +/- Θ(1/k)).

Page 18: Data in Motion Michael Hoffman (Leicester) S Muthukrishnan (Google) Rajeev Raman (Leicester)

1-median (reset model)

Consider L1 1-medians c1 … ck

Σ d(ci,S) ≤ (4k/π + O(1/k)) d(c*,S)

One of the ci is a (4/π + ε) approx.• Which one?• λ d(p,S) + (1- λ)d(q,S) ≥ d(λp + (1- λ)q,S)

• return average of c1 … ck

Boosting confidence: take several independent samples, take mean.

Q: how good is 1-median of sample? Similar to “projection median” [DK]

Page 19: Data in Motion Michael Hoffman (Leicester) S Muthukrishnan (Google) Rajeev Raman (Leicester)

Reset Model (conclusions)

Computed extent and approximate 1-median. Many problems seem hard without some monotonicity assumptions CH, k-center, k-median, k > 1

What assumptions? strict: points moving away from known origin.

(min encl ball, [GC]) points moving away from unknown origin. points moving monotonically along

trajectories from known class (lines eg).

Page 20: Data in Motion Michael Hoffman (Leicester) S Muthukrishnan (Google) Rajeev Raman (Leicester)

Delta Model

A[1..n]; A[i] is a point in Rd, d ≥ 2S is set of points

Updates (i, x1, x2, …, xd)

A[i] := A[i] + (x1, x2, …, xd)

“Extent” query:Given c, estimate Σp in S d(c,p)

(Euclidean distances)

Page 21: Data in Motion Michael Hoffman (Leicester) S Muthukrishnan (Google) Rajeev Raman (Leicester)

Delta Model

Extent query: Use projections and 1-D L1 norm sketches (1+ε)-approximation to extent(c)

1-median Use L1 1-median to find suitable search area. Using above, search for 1-median (1+ε)-approximation

Page 22: Data in Motion Michael Hoffman (Leicester) S Muthukrishnan (Google) Rajeev Raman (Leicester)

Conclusions

Introduced (1+ε) new models for “geometric” computation

Gave solutions to some basic problems

Many open questions:appropriate monotonicity assumptions

for reset modelstatistical analysis of low-dimensional

point set for delta model.