Data in Motion
Michael Hoffman (Leicester)
S Muthukrishnan (Google)
Rajeev Raman (Leicester)
Models for moving data
Reset model Delta model
Geometric and database motivations Given vector A[1..n]
A[i] is a point in Rd, d ≥ 1 A is updated in a streaming manner
Probabilistic approximate computation of some function on A: ε : error parameter δ : confidence parameter Space and time: poly(log n, (1/ ε), (1/ δ))
Reset model
Given vector A[1..n]A[i] is a point in Rd, d ≥ 1
Updates reset(i, x)A[i] := x
Motivation:Location data streams (tracking
passive/dumb objects).Query self-tuning in databases.
Reset Model
“Dynamic” geometric information Different from standard “dynamic”
streams:• insert(p), p in Rd
• delete(p)
In reset model, points have identity delete(p) + insert(p’) gives more
information than reset
Delta Model
Given vector A[1..n]A[i] is a point in Rd, d ≥ 2
Process updates (i, x1, x2, …, xd)
A[i] := A[i] + (x1, x2, …, xd)Motivation:
• Data is often multi-dimensional E.g.• <IP address, packet size, packet delay>
Direct generalization of turnstile model
Delta Model
Problems involving several dimensions“extent” of points (sum of distances of
points from a given center)k-median, diameter, minimum
enclosing ball etc?regression:
• correlation of packet size with delay
Problems
Reset modelLp norm*
Lp sampling*1-median
Delta model“Extent” of points1-median
} monotone, d = 1
Lp norm: Reset Model
Assume wlog p=1 required to estimate ||A||1 = Σ |A[i]|
Assume monotone updates A[i] initially zero reset(i,x) implies A[i] ≤ x
• A[i] := max(A[i], x) [GC]
Estimation impossible if non-monotone reduction to estimating |X| - |X ∩Y |
L1 norm (reset model) Reduction to counting distinct items
1)1(
i)1(
A
Buckets
ni = number of itemsin ith bucket
wi = width of ith bucket
Σ(wi*ni)≤ ||A||1 ≤ (1+ε) Σ(wi*ni)
distinct
L1 norm (reset model)
Counting the number of distinct items in a stream ≡ L0 normpoly-log space and time [FM,CIM]
Need to keep only O((log n)/ε) buckets.
Can we detect if the input is non-monotone?
Lp sampling
Query: sample()Choose i from {1,…,n} with probability
proportional to |A[i]|p
Successive calls may return same index, if no updates happen.
Not known how to do this in the turnstile model
Can be used to detect if ||A||1 ≤ (1 - ε) ||A*||1
Lp sampling (reset model) Reduction to sampling distinct items
)1( i)1(
A
Buckets
ni = number of distinct items in ith bucket
wi = width of ith bucket
Sample a random (distinct) index from each bucketReturn sample from bucket i with probability proportional to wi * ni
||||
][
)*(
)*(*)/1(]Pr[
A
iA
nw
nwni
jj
Bi jjjj
1-median
Assume A[i] contains coordinates of a set S of 2-D points
Problem: find c in R2 s.t. Σp in S d(c,p) (Euclidean distance) is approximately minimized
Monotonicity not required; cannot report Σp in S d(c,p).
Return (4/π + ε) ~ (1.29 + ε) estimator boosting: see later.
1-median (reset model)
L1 1-median:
find c in R2 such that Σp in S d(c,p) is minimized.
d(p,q) = L1 distance
• d1(p,q) = |px – qx| + |py – qy|
L1 1-median c = (cx, cy)
cx = median of x-coordinates
cy = median of y-coordinates
1-median (reset model)
1-D median sample O((1/ε) log (1/δ)) random indices;
maintain position of sample. median mx of x-coordinates of sample is
(1+ε)-approximation to median of x-coordinates of S.
(1+ε)-approximate median is a (1+ε’)-approximate 1-median in 1-D
Approximate L1 1-median: return (mx , my) may not be in S.
Projections of points L1 1-median is a √2-approximation to L2
(Euclidean) 1-median: consider projections of S to do better:
Let l be a line segment of length x, and s be the sum of the lengths of the projections of l on k equally-spaced lines passing through the origin, then πs/(2k) = x(1 +/- Θ(1/k)).
1-median (reset model)
Consider L1 1-medians c1 … ck
Σ d(ci,S) ≤ (4k/π + O(1/k)) d(c*,S)
One of the ci is a (4/π + ε) approx.• Which one?• λ d(p,S) + (1- λ)d(q,S) ≥ d(λp + (1- λ)q,S)
• return average of c1 … ck
Boosting confidence: take several independent samples, take mean.
Q: how good is 1-median of sample? Similar to “projection median” [DK]
≤
Reset Model (conclusions)
Computed extent and approximate 1-median. Many problems seem hard without some monotonicity assumptions CH, k-center, k-median, k > 1
What assumptions? strict: points moving away from known origin.
(min encl ball, [GC]) points moving away from unknown origin. points moving monotonically along
trajectories from known class (lines eg).
Delta Model
A[1..n]; A[i] is a point in Rd, d ≥ 2S is set of points
Updates (i, x1, x2, …, xd)
A[i] := A[i] + (x1, x2, …, xd)
“Extent” query:Given c, estimate Σp in S d(c,p)
(Euclidean distances)
Delta Model
Extent query: Use projections and 1-D L1 norm sketches (1+ε)-approximation to extent(c)
1-median Use L1 1-median to find suitable search area. Using above, search for 1-median (1+ε)-approximation
Conclusions
Introduced (1+ε) new models for “geometric” computation
Gave solutions to some basic problems
Many open questions:appropriate monotonicity assumptions
for reset modelstatistical analysis of low-dimensional
point set for delta model.
Top Related