CS 261: Graduate Data Structures Week 4: Streaming and ...eppstein/261/s20w4.pdf · This algorithm...

CS 261: Graduate Data Structures

Week 4: Streaming and sketching

David EppsteinUniversity of California, Irvine

Spring Quarter, 2020

Streaming and sketching

The main idea

Sometimes data is too big to fit into memory...

Sketching

Represent data by a structure of sublinear size

Preferably O(1); at most n1−ε for some ε > 0

Keep enough information to solve what you wanteither exactly or approximately

Streaming

Read through your data once, in some given order

Use a sketch to represent information about it

Produce a result from the sketch

Example: Sketch for the average

Sketch for the average of a collection of numbers: pair (n, t)

n is the number of elements in the collectiont is the sum of the elements

Operations:

I Incorporate new element x into collection:Add one to n and add x to t

I Remove element from collection:Subtract one from n and subtract x from t

I Compute average of collection:Return t/n

Total space is O(1) regardless of how big n is

(Time/operation is also O(1) but our main concern is space)

Example: Streaming average

To compute the average of a sequence S :

I Initialize an empty sketch for the average

I For each x in the sequence, incorporate x into the sketch

I Return the average of the the sketch

Time for an n-element sequence is O(n)Working space (not counting the space for the input) is O(1)

This works even when the sequence is generated somehow in a waythat does not involve local storage (e.g. internet traffic)

Cash register versus turnstile

Two different variations of streaming algorithms:

Cash register: Money goes inbut doesn’t come back out

Meaning for streaming:

Input sequence has onlyinsertions, no removals

Turnstile: Counts peoplepassing in either direction

Meaning for streaming:

Input sequence can have bothinsertions and removals

Streaming average was presented in cash register modelbut the same sketch also works for turnstile model

Images: File:Credit card terminal in Laos.jpg andFile:Warszawa - Metro - witokrzyska (17009062241).jpg from Wikimedia commons

Beyond streaming

Sketches can sometimes be combined to produce informationabout multiple data sets

Example: Can produce an average-sketch for the concatenation oftwo sequences by adding the sketches of the two sequences

If we have n sequences, of total length N, we can compute allpairwise averages in time O(n2 + N), faster than computing eachpairwise average separately

Medians and reservoir sampling

Medians

Median: Sort given list of n items; which one goes to position n/2?

Like average, an important measure of the central tendency of asample, more robust against arbitrarily-corrupted outliers

Applications in facility location: Given n points on a line, place acenter minimizing the average distance to the given points

Impossibility of streaming median

No streaming algorithm can compute the median!

Consider sketch after seeing the first n/2 items in a stream

Depending on the remaining items, any one of these first n/2could become the median of the whole stream

So we have to remember all their values

Space is Ω(n), not O(n1−ε) for some ε > 0

Approximate medians

We want the item at position 12n in sorted sequence

Relaxation: δ-approximate median

Position should be between (12 − δ)n and (12 + δ)n

Easy randomized method: Choose a sample of Θ(δ−2) sequencemembers and return median of sample

Chernoff ⇒ constant probability of being in desired position

Bigger constant in sample size ⇒ better probability

(There also exist non-random solutions but more complicated andwith non-constant space complexity)

Reservoir sampling

How to maintain a sample of k random elements (withoutreplacement) from a longer sequence?

Store an array A of size k, and the number nFor each element x :

If n < k :Store x in A[n]

Else:Choose a random integer i from 0 to n − 1If i < k, store x in A[i ]

Increment n

(More complex methods reduce the number of random numbergeneration steps by calculating, after each replacement, when thenext replacement should be made.)

Majority and heavy hitters

Impossibility of most-frequent item

We’ve seen the mean and median; what about the mode (mostfrequent item) in a sequence?

Impossible in general!

Consider the state of a sketch after n − 2 items

Suppose we have already seen one item appear three times, and(n − 3)/2 pairs of other items

If the next two items are equal, with value x , then the resultshould be

I The triple, if x was not already seen

I The triple, if x has the same value

I x , if x has the same value as one of the pairs

So the sketch must know all the pairs, too much memory

Impossibility of majority

Majority element: one that occurs as more than half of thesequence elements

Naive formulation of the problem:

Either find a majority element if one exists

Or return “None” if there is no majority

Still impossible!

If first n/2 elements are distinct, any one could become majority

We don’t have enough memory to store all of their values

Boyer–Moore majority vote algorithm

Maintain two variables, m and c, initially None and 0

For each x in the given sequence:

I If m = x : increment c

I Else if c = 0: set m = x and c = 1

I Else decrement c

Claims:

I If sequence has a majority, it will be the final value of m

I If no majority, m may be any element of the sequence(might not be frequently occurring)

I Algorithm cannot tell us whether m is majority element or not

Boyer & Moore, “MJRTY - A Fast Majority Vote Algorithm”, 1991

Streaming majority example

With the sequence elements in left-to-right order:

x : B A C A A A C B Am: None B B C C A A A A Ac: 0 1 0 1 0 1 2 1 0 1

majority? N Y N N N Y Y Y N Y

(We know when there is a majority but the algorithm doesn’t)

After the first step, B is in the majority, and the algorithmcorrectly reports B as its value of m.

After the third A (out of five sequence elements), A is in themajority, and stays in the majority for most of the remaining steps;the algorithm correctly reports A as its value of m.

When there is no majority (each element has at most half of thesequence) the algorithm may choose any element as its value of m.

Heavy hitters

Generalization of the same algorithm:

Store two arrays M and C , both of length k

Initialize cells of M to empty and C to zero

For each sequence item x :

I If x equals M[i ] for some i : increment C [i ]

I Else if some C [i ] is zero: set M[i ] = x and C [i ] = 1

I Else decrement C [i ] for all choices of i

Claim: In a sequence of n elements, if x occurs more thann/(k + 1) times, then x will be one of the elements stored in M

(Special case k = 1 is correctness of majority algorithm)

Why does it work?

This algorithm approximately counts occurrences for all sequenceelements (not just the ones it stores)!

Define count(x) = actual number of occurrences of x ,estimate(x) = C [i ] (if M[i ] = x for some i) or 0 otherwise.

Then count(x)− n

k + 1≤ estimate(x) ≤ count(x).

Proof idea: Induction on number of steps of the algorithm

If a step ends with x in M, we increase its count and estimate

Otherwise, we decrease all k + 1 nonzero estimates

Number of increases ≤ n ⇒ number of decreases ≤ n/(k + 1)

Because heavy hitters have big estimates, they must be stored

MinHash

Hashing for reservoir sampling

We saw how to maintain a sample of k elements, using a randomchoice for each element

Instead, make a single random choice, of a hash function h(x)

Sample = the k elements with the smallest values of h

Two advantages:

I Less use of (possibly slow) random number generation

I Repeatable: if different agents use the same hash function tosample the same sets, they will get the same samples

We can take advantage of repeatability to use this for moreapplications!

Broder, “On the resemblance and containment of documents”, 1997

Combining sets with MinHash

To compute the union:

MinHash(S ∪ T ) = MinHash(MinHash(S) ∪MinHash(T )

)Because any element that is one of the k smallest hashes in theunion must also be one of the k smallest in either of the smallersets that contain it

We can get useful information about sets S and T from thiscomputation!

Key question: how many elements fromMinHash(S) ∪MinHash(T ) survive into MinHash(S ∪ T )?

Original motivation for MinHash

Before the success of Google, a different company (DEC) ran adifferent popular search engine (AltaVista)

They had a problem: Many web pages can be found in multiplesimilar (but not always identical) copies

Search engines should return only one copy of the same page, souser can see other results without getting swamped by duplicates

⇒ Need to quickly test similarity of web pages

From documents to sets

“Bag of words” model: represent any web page by the set of wordsused in it

Similar documents (web pages) should have similar words

So after converting web pages to bags of words, the search engineproblem becomes: quickly test similarity of pairs of sets

Jaccard similarity

A standard measure of how similar two sets are:

J(S ,T ) =|S ∩ T ||S ∪ T |

Scale-invariant (normalized for how big the two sets are)

Ranges from 0 to 1

I 0 when sets are disjoint

I close to 0 when sets are not very similar

I close to 1 when sets are similar

I 1 when sets are identical

Estimating Jaccard similarity

Theorem:

J(S ,T ) =1

kE[∣∣MinHash(S ∪ T ) ∩MinHash(S) ∩MinHash(T )

∣∣]The intersection consists of the elements of MinHash(S ∪ T ) thatalso belong to MinHash(S ∩ T )

Linearity of expectation ⇒

E [. . . ] =∑

x∈S∪TPr[x ∈ intersection of MinHashes]

=∑

x∈S∪TPr[x ∈ S ∩ T ]× Pr[x ∈ MinHash(S ∪ T )]

=∑

x∈S∪T

|S ∩ T ||S ∪ T |

× k

|S ∪ T |

= |S ∪ T | × J(S ,T )× k

|S ∪ T |= k J(S ,T )

How accurate is it?

Suppose we use

1

k

∣∣MinHash(S ∪ T ) ∩MinHash(S) ∩MinHash(T )∣∣

as an estimate for J(S ,T )

Previous analysis shows that this has the correct expected value:it is an unbiased estimator

Our analysis decomposed the estimate into a sum of independent0-1 random variables (is each member of S ∪T in the intersection)

⇒ Can use Chernoff bounds for being within 1± ε of expectation

To get expected error |J(S ,T )− estimate| ≤ ε, set k = Θ(1/ε2)

Summary of MinHash

Main idea: sketch any set by choosing the k elements that havethe smallest hash values

Can reduce randomness in reservoir sampling

Sketches of unions can be computed from unions of sketches

Provides accurate estimates of Jaccard similarity

All-pairs similarity of n sets with total length N: time O(n2 + N)

Count-min sketch

Main idea

We want to estimate frequencies of individual items (like heavyhitter data structure) in the turnstile model, allowing removals

Use hashing to map each item to a multiple cells in a small table(like Bloom filter, but smaller)

Each cell counts how many elements map to it

I Include another copy of x : increment all cells for x

I Remove a copy of x : decrement all cells for x

Estimate of frequency of x : minimum of counts in cells for x

Cormode & Muthukrishnan, “An Improved Data Stream Summary: TheCount-Min Sketch and its Applications”, 2005

Details

Set up according to two parameters ε and δ

I ε: how accurate the estimates should be

I δ: how likely estimates are to be accurate

2d table C [i , j ], initially all zero

I Horizontal dimension (i): de/εe cells

I Vertical dimension (j): dln 1/δe cells

Hash functions hj (for j = 0, . . . dln 1/δe − 1)

I Include x : increment C [hj(x), j ] for all j

I Remove x : decrement C [hj(x), j ] for all j

Estimate of count(x): minj C [hj(x), j ]

Accuracy

Clearly, estimate ≥ count; let N = total number of elements

Claim: With probability ≥ 1− δ, estimate ≤ count + εN

Because row estimates are independent, this is equivalent to:

With probability ≥ 1− 1e , for any j , C [hj(x), j ] ≤ count + εN

(then multiplying row failure probabilities, (1/e)number of rows ≤ δ)

Overestimate comes from collisions with other elements

Expected number of collisions = εN/e. Instead of Chernoff,use Markov’s inequality: Pr [X > c E [X ]] ≤ 1/c

Turnstile medians

Data structure for approximate medians in the turnstile model,of a collection S of integers 0, 1, 2, . . . n (allowing repetitions):

Form sets Si (i = 0 . . . log2 n) by rounding each element down to amultiple of 2i , using the formula round(x) = x &~ ((1<<i)-1)

Construct log n count-min sketches, one for each Si

(with accuracy and failure probability smaller by a logarithmicfactor so that when we query all of them and sum, the overallfailure probability and accuracy are still what we want)

Finding the approximate median

To estimate the number of elements in the interval [`, r ]:

Compute the numbers

I ì = round(`− 1) + 1

I ri = round(r)

The query interval can be decomposed into subintervals of length2i , consisting of everything that rounds to ì or ri , for a subset ofthese numbers

Estimate occurrences of ì and ri in Si and sum the results

To find median: binary search for an interval [0,m] containingapproximately half of the elements

Set reconciliation

Find the missing element

Toy problem:

I give you a sequence of 999 different numbers, from 1 to 1000

Which number is missing?

Streaming solution: return 500500−∑

(sequence)

Straggler sketching

Given a parameter k, maintain a turnstile collection of objects(allowing repetitions), such that

I Total space is O(k)

I The set can have arbitrarily many elements

I Whenever it has ≤ k distinct elements, we can list them all(with their numbers of repetitions)

E.g. could solve the toy problem by:

I Initialize structure with k = 1

I Insert all numbers from 1 to 1000

I Remove all members of sequence

I Ask which element is left

Invertible Bloom filter

Structure:

I Table with O(k) cells

I O(1) hash functions hi (x) mapping elements to cells(like a Bloom filter)

I “Fingerprint” hash function f (x) (like a cuckoo filter)

Each cell stores three values:

I Number of elements mapped to it (like count-min sketch)

I Sum of elements mapped to it

I Sum of fingerprints of elements mapped to it

Eppstein & Goodrich, “Space-efficient straggler identification inround-trip data streams via Newton’s identitities and invertible Bloomfilters”, 2011

How to list the elements

“Pure cell”: stores only a single distinct element (but maybe morethan one copy of it)

If a pure cell stores element x , we can compute x as(sum of elements)/(number of elements)

Test whether a cell is pure: compute x and check that stored sumof fingerprints equals number of copies times fingerprint(x)

Decoding algorithm:

I While there is a pure cell, decode it, output it, and remove itselements from the collection

I If we reach an empty table (all cells zero), return

I Otherwise, decoding fails (table is too full)

Application: Set reconciliation

Problem: Two internet servers both have slightly different copiesof the same collection (e.g. versions of a git repository)

We want to find how they differ, using communication proportionalto the size of the difference (rather than the size of the whole set)

If we already know the size of the difference, k:

I Server A builds an invertible Bloom filter for its collection,with parameter k, and sends it to server B

I Server B removes everything in its collection from the filterand then decodes the result

I (Some elements might have negative numbers of copies! Butthe data structure still works)

Eppstein, Goodrich, Uyeda, and Varghese, “What’s the difference?Efficient set reconciliation without prior context”, 2011

How to estimate the size of the difference?

Use MinHash to find samples of server B’s set with numbers ofelements smaller by factors of 1/2, 1/4, 1/8, . . .

Sample i : elements whose hash has first i bits = zero

Server B chooses parameter k = O(1) and sends invertible Bloomfilters of all the samples in a single message to server A

Server A uses the same hash function to sample its set, andremoves sampled subsets from B’s filters

Estimate of the size of the difference = 1/sampling rate of largestsample whose difference can be decoded

Summary

Summary

I Sketch: Structure with useful information in sublinear space

I Stream: Loop through data once, using a sketch

I Cash register (addition only) and turnstile (addition andremoval) models of sketching and streaming

I Mean and median; impossibility of exact median

I Maintaining a random sample of a stream and using it forapproximate medians

I Majority voting, heavy hitters, and frequency estimation

I MinHash-based sampling and estimation of Jaccard distance

I Count-min sketch turnstile-model frequency estimation

I Using count-min sketch for turnstile median estimation

I Stragglers, invertible Bloom filters, and set reconciliation

CS 261: Graduate Data Structures Week 4: Streaming and ...eppstein/261/s20w4.pdf · This algorithm...

Documents

Transcript of CS 261: Graduate Data Structures Week 4: Streaming and ...eppstein/261/s20w4.pdf · This algorithm...