L p -Sampling

21
L p -Sampling David Woodruff IBM Almaden Joint work with Morteza Monemizadeh TU Dortmund

description

L p -Sampling. David Woodruff IBM Almaden Joint work with Morteza Monemizadeh TU Dortmund. Given a stream of updates (i, a) to coordinates i of an n -dimensional vector x |a| < poly(n) a is an integer stream length < poly(n) - PowerPoint PPT Presentation

Transcript of L p -Sampling

Page 1: L p -Sampling

Lp-SamplingDavid WoodruffIBM Almaden

Joint work with Morteza MonemizadehTU Dortmund

Page 2: L p -Sampling

• Given a stream of updates (i, a) to coordinates i of an n-dimensional vector x– |a| < poly(n)– a is an integer– stream length < poly(n)

• Output i with probability |xi|p/Fp, where Fp = |x|pp = Σi=1

n |xi|p

• Easy cases:– p = 1 and updates all of the form (i, 1) for some i Solution: choose a random update in the stream, output the

coordinate it updates [Alon, Matias, Szegedy]Generalizes to all positive updates

– p = 0 and there are no deletions Solution: min-wise hashing, hash all distinct coordinates as you

see them, maintain the minimum hash and item [Broder, Charikar, Frieze, Mitzenmacher] [Indyk] [Cormode, Muthukrishnan]

Page 3: L p -Sampling

Our main result

• For every 0 · p · 2, there is an algorithm that with probability · n-100 fails, and otherwise outputs an I in [n] for which for all j in [n]

Pr[I = j] = (1 ± ε)|xj|p/Fp

Condition on every invocation succeeding in any poly(n)-time algorithm

Algorithm is 1-pass, poly(ε-1 log n)-space and update time, and also returns wi = (1 ± ε)|xj|p/Fp

Generalizes to 1-pass n1-2/ppoly(ε-1 log n)-space for p > 2

• “additive-error” samplers Pr[I = j] = |xj|p/Fp ± εFp given – explicitly in [Jayram, W] – implicitly in [Andoni, DoBa, Indyk, W]

Page 4: L p -Sampling

Lp-sampling solves and unifies many well-studied streaming

problems:

Page 5: L p -Sampling

• Solves Sampling with Deletions:

– [Cormode, Muthukrishnan, Rozenbaum] want importance sampling with deletions: maintain a sample i with probability |xi|/|x|1

• Set p = 1 in our theorem

– [Chaudhuri, Motwani, Narasayya] ask to sample from the result of a SQL operation, e.g., self-join

• Set p = 2 in our theorem

– [Frahling, Indyk, Sohler] study maintaining approximate range spaces and costs of Euclidean spanning trees

• They need and obtain a routine to sample a point from a set undergoing insertions and deletions

• Alternatively, set p = 0 in our theorem

Page 6: L p -Sampling

• Alternative solution to Heavy Hitters Problem for any Fp:

– Output all i for which |xi|p > Á Fp

– Do not output any i for which |xi|p < (Á/2) Fp

– Studied by Charikar, Chen, Cormode, Farach-Colton, Ganguly, Muthukrishnan, and many others

– Invoke our algorithm O~(1/Á) times, use approximations to values

– Optimal up to poly(ε-1 log n) factors

Page 7: L p -Sampling

• Solves Block Heavy Hitters: given an n x d matrix, return indices i of rows Ri with |Ri|p

p > Á ¢ Σj |Rj|pp

– [Andoni, DoBa, Indyk] study the case p = 1

– Used by [Andoni, Indyk, Kraughtgamer] for constructing a small-size sketch for the Ulam metric under the edit distance

– Treat R as a big (nd)-dimensional vector

– Sample an entry (i, j) using our theorem for general p

– The probability a row i is sampled is |Ri|pp/ Σj |Rj|p

p, so we can recover IDs of all the heavy rows.

– We do not use Cauchy random variables or Nisan’s pseudorandom generator, could be more practical than [ADI]

Page 8: L p -Sampling

• Alternative Solution to Fk-Estimation for any k ¸ 2:

– Optimal up to poly(ε-1 log n) factors

– Reduction given by [Coppersmith, Kumar]:

– Take r = O(n1-2/k) L2-samples wi1, … , wir

– In parallel estimate F2, call it F2’

– Output (F2’/r) * Σj wijk-2

Proof: second moment method

First algorithm not to use Nisan’s pseudorandom generator

Page 9: L p -Sampling

• Solves Cascaded Moment Estimation:– Given an n x d matrix A, Fk(Fp)(A) = Σj |Aj|p

kp

– Problem initiated by [Cormode, Muthukrishnan] • Show F2(F0)(A) uses O(n1/2) space if no deletions• Ask about complexity for other k and p

– For any p in [0,2], gives O(n1-1/k) space for Fk(Fp)(A)• We get entry (i, j) with probability |Ai, j|p/ Σi’, j’ |Ai’, j’|p

• Probability row Ai is returned is Fp(Ai)/ Σj Fp(Aj)• If 2 passes allowed, take O(n1-1/k) samples Ai, in 1st pass,

compute Fp(Ai) in 2nd pass, and feed into Fk AMS estimator• To get 1 pass, feed row IDs into an O(n1-1/k)-space algorithm

of [Jayram, W] for estimating Fk based only on item IDs• Algorithm is space-optimal [Jayram, W]

– Our theorem with p = 0 gives O(n1/2) space for F2(F0)(A) with deletions

Page 10: L p -Sampling

Ok, so how does it work?

Page 11: L p -Sampling

General Framework [Indyk, W]• St = {i | |xi| in [ηt-1, ηt)} for η = 1 + £(ε)

• St contributes if |St|ηpt ¸ ³ Fp(x), where ³ = poly(ε/log n) – assume p > 0 in talk

• Let h:[n] -> [n] be a hash function– Create log n substreams Stream1, Stream2, …, Streamlog n

– Streamj is stream restricted to updates (i, c) with h(i) · n/2j

• Suppose 2j ¼ |St|. Then– Streamj contains about 1 item of St

– Fp(Streamj) ¼ Fp(x)/2j

– |St| ηpt ¸ ³ Fp(x) means ηpt ¸ ³ Fp(Streamj)– Can find the item in St in Streamj with Fp-heavy hitters algorithm– Repeat the sampling poly(ε-1log n) times, count number of times

there was an item in Streamj from St

– Use this to estimate sizes of contributing St, and Fp(x) ¼ Σt |St|ηpt

1. Form streams by subsampling

1. Form streams by subsampling

2. Run Heavy hitters algorithm on streams

2. Run Heavy hitters algorithm on streams

3. Use heavy hitters to estimate contributing St

3. Use heavy hitters to estimate contributing St

Page 12: L p -Sampling

Additive Error Sampler [Jayram, W]

• For contributing St, we also get poly(ε-1log n) items from the heavy hitters routine

• If the sub-sampling is sufficiently random (Nisan’s generator, min-wise independent), these items are random in St

• Since we have (1 ± ε)-approximations s’t to all contributing St, can:– Choose a contributing t with probability s’tηpt/Σt’ s’t’ηpt

– Output a random heavy hitter found in St

• For item i in contributing St, – Pr[i output] =[s’tηpt/Σt’ s’t’ηpt] ¢ 1/|St| = (1 ± ε)|xi|p/Fp

• For item i in non-contributing St,– Pr[i output] = 0

Page 13: L p -Sampling

Relative Error in Words

• Force all classes to contribute– Inject additional coordinates in each class whose

purpose is to make every class contribute– Inject just enough so that overall, Fp does not change

by more than a (1+ε)-factor

• Run [Jayram, W]-sampling on resulting vector– If the item sampled is an injected coordinate, forget

about it– Repeat many times in parallel and take the first

repetition that is not an injected coordinate

• Since injected coordinates only contribute O(ε) to Fp mass, small # of repetitions suffice

Page 14: L p -Sampling

Some Minor Points

• Before seeing the stream, we don’t know which classes contribute, so we inject coordinates into every class– For St = {i | |xi| in [ηt-1, ηt)}, inject £(εFp/(ηpt # classes)) coordinates,

where # classes = O(ε-1log n)

– Need to know Fp - just guess it, verify at end of stream

• For some classes, £(εFp/(ηpt # classes)) < 1, e.g. if t is very large, so we can’t inject any new coordinates– Find all elements in these classes and (1 ± ε)-approximations to

their frequencies separately using a heavy hitters algorithm– When sampling, either choose a heavy hitter with the appropriate

probability, or select from contributing sets using [Jayram, W]

Page 15: L p -Sampling

There is a Problem• The [Jayram, W]-sampler fails with probability ¸ poly(ε/log n), in

which case it can output any item

• This is due to some of the subroutines of [Indyk, W] that it relies on, which only succeed with this probability

• So the large poly(ε/log n) additive error is still there

• Cannot repeat [Jayram, W] multiple times for amplification, since we get a collection of samples, and no obvious way of detecting failure– On the other hand, could just repeat [Indyk, W] and take the median for

the simpler Fk-estimation problem

• Our solution: – Dig into the guts of the [Indyk, W] algorithm– Amplify success probability to ¸ 1 – n-100 of subroutines

Page 16: L p -Sampling

A Technical Point About [Indyk, W]• In [Indyk, W],

– Create log n substreams Streamj, where Streamj includes each coordinate independently with probability 2-j

– Can find the items in contributing St in Streamj with Fp-heavy hitters– Repeat the sampling poly(ε-1log n) times, observe the fraction there is

an item in Streamj from St

• Can use [Indyk, W] to estimate every |St| since every class contributes– Issue of misclassification

• St = {i | |xi| in [ηt-1, ηt)}, and Fp-heavy hitters algorithm only reports approximate frequencies of items i it finds

• If |xi| = ηt, it may be classified into St or St+1 – it doesn’t matter– Simpler solution than in [Indyk, W]

• If item misclassified, just classify it consistently if we see it again • Equivalent to sampling from x’ with |x’|p = (1 ± ε)|x|p

• Can ensure with probability ¸ 1-n-100, we obtain st’ = (1 ± ε)|St| for all t

Page 17: L p -Sampling

A Technical Point About [Jayram, W]

• Since we have st’ = (1 ± ε)|St| for all t– Choose a class t with probability s’tηpt/Σt’ s’t’ηpt

– Output a random heavy hitter found in St

• How do we output a random item in St ?

• Min-wise independent hash function h

– For each i in St, h(i) = minj in St h(j) with probability (1 ± ε)/|St|

– h can be an O(log 1/ε)-wise independent hash function

• We recover i* in St for which h(i*) is minimum

– Compatible with sub-sampling, where Streamj is items i for which h(i) · n/2j

• Our goal is to recover i* with probability ¸ 1-n-100

• We have st’, and look at the level j* where |St|/2j* = £(log n)

• If h is O(log n)-wise independent, then with probability ¸ 1-n-100, i* is in Streamj*

• A worry: maybe Fp(Streamj*) >> Fp(x)/2j* so Heavy Hitter algorithm doesn’t work

• Can be resolved with enough independent repetitions

Page 18: L p -Sampling

Beyond the Moraines: Sampling Records

• Given an n x d matrix M of rows M1, …, Mn, sample i with probability |Mi|X/Σj |Mj|X, where X is a norm

• If i sampled, return a vector v for which

|v|X = (1 ± ε)|Mi|X• Applications

– Estimating planar EMD [Andoni, DoBa, Indyk, W]– Sampling records in a relational database

• Define classes St = {i | |Mi|X in [ηt-1, ηt)} for η = 1 + £(ε)

• If we have a heavy hitters algorithm for rows of a matrix, then we can apply a similar approach as before

• Space should be d¢poly(ε-1log n)

Page 19: L p -Sampling

Heavy Hitters for Rows

• Algorithm in [Andoni, DoBa, Indyk, W]– Partition rows into B buckets– In each bucket maintain the vector sum of rows

hashed to it

• If |Mi|X > γΣj |Mj|X, and if v is the vector in the bucket containing Mi, by the triangle inequality– |v|X < |Mi|X + |Noise|X ¼ |Mi|X + Σj |Mj|X/B

– |v|X > |Mi|X - |Noise|X ¼ |Mi|X – Σj |Mj|X/B

• For B large enough, noise translates to a relative error

Page 20: L p -Sampling

Lower Bounds For every 0 · p · 2, there is a randomized algorithm that with probability · n-100

outputs FAIL, and otherwise outputs an I in [n] for which for all j in [n]

Pr[I = j] = (1 ± ε)|xj|p/Fp

Algorithm is 1-pass, poly(ε-1 log n)-space and time, returns wi = (1 ± ε)|xj|p/Fp

For p > 2, gives n1-2/ppoly(ε-1 log n)-space.

Can we output FAIL with probability 0?

Requires (n) space for any ε. Reduction from 2-party equality testing with no error

Given that we don’t output FAIL, can we get a sampler with ε = 0?

Yes for 2-pass algorithms, using rejection sampling.

1-pass requires (n) space if algorithm outputs the corresponding probability wi (needed in many applications). Reduction from the 2-party INDEX problem

Can we use less space for p > 2?

Requires (n1-2/p) space for any ε. Reduction from L1-estimation

Can improve to (n1-2/plog n) using augmented L1-estimation [Jayram, W]

Page 21: L p -Sampling

Some Open Questions • 1-pass algorithms for Lp-sampling

– If we output FAIL with probability · n-100, and don’t require outputting the sampled item’s probability, can we get ε = 0 with low space?

– ε and log n factors are large. What is the optimal dependence on them?

• Useful for Fk-estimation for k > 2, and other applications

• Sampling from other distributions

– Given a vector (x1, …, xn) in a data stream, for which functions g can we sample from the distribution ¹(i) = |g(xi)|/Σj |g(xj)|?

• E.g., random walks

Thank you