CS 261: Graduate Data Structures Week 4: Streaming and ...eppstein/261/s20w4.pdf · This algorithm...
Transcript of CS 261: Graduate Data Structures Week 4: Streaming and ...eppstein/261/s20w4.pdf · This algorithm...
CS 261: Graduate Data Structures
Week 4: Streaming and sketching
David EppsteinUniversity of California, Irvine
Spring Quarter, 2020
Streaming and sketching
The main idea
Sometimes data is too big to fit into memory...
Sketching
Represent data by a structure of sublinear size
Preferably O(1); at most n1−ε for some ε > 0
Keep enough information to solve what you wanteither exactly or approximately
Streaming
Read through your data once, in some given order
Use a sketch to represent information about it
Produce a result from the sketch
Example: Sketch for the average
Sketch for the average of a collection of numbers: pair (n, t)
n is the number of elements in the collectiont is the sum of the elements
Operations:
I Incorporate new element x into collection:Add one to n and add x to t
I Remove element from collection:Subtract one from n and subtract x from t
I Compute average of collection:Return t/n
Total space is O(1) regardless of how big n is
(Time/operation is also O(1) but our main concern is space)
Example: Streaming average
To compute the average of a sequence S :
I Initialize an empty sketch for the average
I For each x in the sequence, incorporate x into the sketch
I Return the average of the the sketch
Time for an n-element sequence is O(n)Working space (not counting the space for the input) is O(1)
This works even when the sequence is generated somehow in a waythat does not involve local storage (e.g. internet traffic)
Cash register versus turnstile
Two different variations of streaming algorithms:
Cash register: Money goes inbut doesn’t come back out
Meaning for streaming:
Input sequence has onlyinsertions, no removals
Turnstile: Counts peoplepassing in either direction
Meaning for streaming:
Input sequence can have bothinsertions and removals
Streaming average was presented in cash register modelbut the same sketch also works for turnstile model
Images: File:Credit card terminal in Laos.jpg andFile:Warszawa - Metro - witokrzyska (17009062241).jpg from Wikimedia commons
Beyond streaming
Sketches can sometimes be combined to produce informationabout multiple data sets
Example: Can produce an average-sketch for the concatenation oftwo sequences by adding the sketches of the two sequences
If we have n sequences, of total length N, we can compute allpairwise averages in time O(n2 + N), faster than computing eachpairwise average separately
Medians and reservoir sampling
Medians
Median: Sort given list of n items; which one goes to position n/2?
Like average, an important measure of the central tendency of asample, more robust against arbitrarily-corrupted outliers
Applications in facility location: Given n points on a line, place acenter minimizing the average distance to the given points
Impossibility of streaming median
No streaming algorithm can compute the median!
Consider sketch after seeing the first n/2 items in a stream
Depending on the remaining items, any one of these first n/2could become the median of the whole stream
So we have to remember all their values
Space is Ω(n), not O(n1−ε) for some ε > 0
Approximate medians
We want the item at position 12n in sorted sequence
Relaxation: δ-approximate median
Position should be between (12 − δ)n and (12 + δ)n
Easy randomized method: Choose a sample of Θ(δ−2) sequencemembers and return median of sample
Chernoff ⇒ constant probability of being in desired position
Bigger constant in sample size ⇒ better probability
(There also exist non-random solutions but more complicated andwith non-constant space complexity)
Reservoir sampling
How to maintain a sample of k random elements (withoutreplacement) from a longer sequence?
Store an array A of size k, and the number nFor each element x :
If n < k :Store x in A[n]
Else:Choose a random integer i from 0 to n − 1If i < k, store x in A[i ]
Increment n
(More complex methods reduce the number of random numbergeneration steps by calculating, after each replacement, when thenext replacement should be made.)
Majority and heavy hitters
Impossibility of most-frequent item
We’ve seen the mean and median; what about the mode (mostfrequent item) in a sequence?
Impossible in general!
Consider the state of a sketch after n − 2 items
Suppose we have already seen one item appear three times, and(n − 3)/2 pairs of other items
If the next two items are equal, with value x , then the resultshould be
I The triple, if x was not already seen
I The triple, if x has the same value
I x , if x has the same value as one of the pairs
So the sketch must know all the pairs, too much memory
Impossibility of majority
Majority element: one that occurs as more than half of thesequence elements
Naive formulation of the problem:
Either find a majority element if one exists
Or return “None” if there is no majority
Still impossible!
If first n/2 elements are distinct, any one could become majority
We don’t have enough memory to store all of their values
Boyer–Moore majority vote algorithm
Maintain two variables, m and c, initially None and 0
For each x in the given sequence:
I If m = x : increment c
I Else if c = 0: set m = x and c = 1
I Else decrement c
Claims:
I If sequence has a majority, it will be the final value of m
I If no majority, m may be any element of the sequence(might not be frequently occurring)
I Algorithm cannot tell us whether m is majority element or not
Boyer & Moore, “MJRTY - A Fast Majority Vote Algorithm”, 1991
Streaming majority example
With the sequence elements in left-to-right order:
x : B A C A A A C B Am: None B B C C A A A A Ac: 0 1 0 1 0 1 2 1 0 1
majority? N Y N N N Y Y Y N Y
(We know when there is a majority but the algorithm doesn’t)
After the first step, B is in the majority, and the algorithmcorrectly reports B as its value of m.
After the third A (out of five sequence elements), A is in themajority, and stays in the majority for most of the remaining steps;the algorithm correctly reports A as its value of m.
When there is no majority (each element has at most half of thesequence) the algorithm may choose any element as its value of m.
Heavy hitters
Generalization of the same algorithm:
Store two arrays M and C , both of length k
Initialize cells of M to empty and C to zero
For each sequence item x :
I If x equals M[i ] for some i : increment C [i ]
I Else if some C [i ] is zero: set M[i ] = x and C [i ] = 1
I Else decrement C [i ] for all choices of i
Claim: In a sequence of n elements, if x occurs more thann/(k + 1) times, then x will be one of the elements stored in M
(Special case k = 1 is correctness of majority algorithm)
Why does it work?
This algorithm approximately counts occurrences for all sequenceelements (not just the ones it stores)!
Define count(x) = actual number of occurrences of x ,estimate(x) = C [i ] (if M[i ] = x for some i) or 0 otherwise.
Then count(x)− n
k + 1≤ estimate(x) ≤ count(x).
Proof idea: Induction on number of steps of the algorithm
If a step ends with x in M, we increase its count and estimate
Otherwise, we decrease all k + 1 nonzero estimates
Number of increases ≤ n ⇒ number of decreases ≤ n/(k + 1)
Because heavy hitters have big estimates, they must be stored
MinHash
Hashing for reservoir sampling
We saw how to maintain a sample of k elements, using a randomchoice for each element
Instead, make a single random choice, of a hash function h(x)
Sample = the k elements with the smallest values of h
Two advantages:
I Less use of (possibly slow) random number generation
I Repeatable: if different agents use the same hash function tosample the same sets, they will get the same samples
We can take advantage of repeatability to use this for moreapplications!
Broder, “On the resemblance and containment of documents”, 1997
Combining sets with MinHash
To compute the union:
MinHash(S ∪ T ) = MinHash(MinHash(S) ∪MinHash(T )
)Because any element that is one of the k smallest hashes in theunion must also be one of the k smallest in either of the smallersets that contain it
We can get useful information about sets S and T from thiscomputation!
Key question: how many elements fromMinHash(S) ∪MinHash(T ) survive into MinHash(S ∪ T )?
Original motivation for MinHash
Before the success of Google, a different company (DEC) ran adifferent popular search engine (AltaVista)
They had a problem: Many web pages can be found in multiplesimilar (but not always identical) copies
Search engines should return only one copy of the same page, souser can see other results without getting swamped by duplicates
⇒ Need to quickly test similarity of web pages
From documents to sets
“Bag of words” model: represent any web page by the set of wordsused in it
Similar documents (web pages) should have similar words
So after converting web pages to bags of words, the search engineproblem becomes: quickly test similarity of pairs of sets
Jaccard similarity
A standard measure of how similar two sets are:
J(S ,T ) =|S ∩ T ||S ∪ T |
Scale-invariant (normalized for how big the two sets are)
Ranges from 0 to 1
I 0 when sets are disjoint
I close to 0 when sets are not very similar
I close to 1 when sets are similar
I 1 when sets are identical
Estimating Jaccard similarity
Theorem:
J(S ,T ) =1
kE[∣∣MinHash(S ∪ T ) ∩MinHash(S) ∩MinHash(T )
∣∣]The intersection consists of the elements of MinHash(S ∪ T ) thatalso belong to MinHash(S ∩ T )
Linearity of expectation ⇒
E [. . . ] =∑
x∈S∪TPr[x ∈ intersection of MinHashes]
=∑
x∈S∪TPr[x ∈ S ∩ T ]× Pr[x ∈ MinHash(S ∪ T )]
=∑
x∈S∪T
|S ∩ T ||S ∪ T |
× k
|S ∪ T |
= |S ∪ T | × J(S ,T )× k
|S ∪ T |= k J(S ,T )
How accurate is it?
Suppose we use
1
k
∣∣MinHash(S ∪ T ) ∩MinHash(S) ∩MinHash(T )∣∣
as an estimate for J(S ,T )
Previous analysis shows that this has the correct expected value:it is an unbiased estimator
Our analysis decomposed the estimate into a sum of independent0-1 random variables (is each member of S ∪T in the intersection)
⇒ Can use Chernoff bounds for being within 1± ε of expectation
To get expected error |J(S ,T )− estimate| ≤ ε, set k = Θ(1/ε2)
Summary of MinHash
Main idea: sketch any set by choosing the k elements that havethe smallest hash values
Can reduce randomness in reservoir sampling
Sketches of unions can be computed from unions of sketches
Provides accurate estimates of Jaccard similarity
All-pairs similarity of n sets with total length N: time O(n2 + N)
Count-min sketch
Main idea
We want to estimate frequencies of individual items (like heavyhitter data structure) in the turnstile model, allowing removals
Use hashing to map each item to a multiple cells in a small table(like Bloom filter, but smaller)
Each cell counts how many elements map to it
I Include another copy of x : increment all cells for x
I Remove a copy of x : decrement all cells for x
Estimate of frequency of x : minimum of counts in cells for x
Cormode & Muthukrishnan, “An Improved Data Stream Summary: TheCount-Min Sketch and its Applications”, 2005
Details
Set up according to two parameters ε and δ
I ε: how accurate the estimates should be
I δ: how likely estimates are to be accurate
2d table C [i , j ], initially all zero
I Horizontal dimension (i): de/εe cells
I Vertical dimension (j): dln 1/δe cells
Hash functions hj (for j = 0, . . . dln 1/δe − 1)
I Include x : increment C [hj(x), j ] for all j
I Remove x : decrement C [hj(x), j ] for all j
Estimate of count(x): minj C [hj(x), j ]
Accuracy
Clearly, estimate ≥ count; let N = total number of elements
Claim: With probability ≥ 1− δ, estimate ≤ count + εN
Because row estimates are independent, this is equivalent to:
With probability ≥ 1− 1e , for any j , C [hj(x), j ] ≤ count + εN
(then multiplying row failure probabilities, (1/e)number of rows ≤ δ)
Overestimate comes from collisions with other elements
Expected number of collisions = εN/e. Instead of Chernoff,use Markov’s inequality: Pr [X > c E [X ]] ≤ 1/c
Turnstile medians
Data structure for approximate medians in the turnstile model,of a collection S of integers 0, 1, 2, . . . n (allowing repetitions):
Form sets Si (i = 0 . . . log2 n) by rounding each element down to amultiple of 2i , using the formula round(x) = x &~ ((1<<i)-1)
Construct log n count-min sketches, one for each Si
(with accuracy and failure probability smaller by a logarithmicfactor so that when we query all of them and sum, the overallfailure probability and accuracy are still what we want)
Finding the approximate median
To estimate the number of elements in the interval [`, r ]:
Compute the numbers
I `i = round(`− 1) + 1
I ri = round(r)
The query interval can be decomposed into subintervals of length2i , consisting of everything that rounds to `i or ri , for a subset ofthese numbers
Estimate occurrences of `i and ri in Si and sum the results
To find median: binary search for an interval [0,m] containingapproximately half of the elements
Set reconciliation
Find the missing element
Toy problem:
I give you a sequence of 999 different numbers, from 1 to 1000
Which number is missing?
Streaming solution: return 500500−∑
(sequence)
Straggler sketching
Given a parameter k, maintain a turnstile collection of objects(allowing repetitions), such that
I Total space is O(k)
I The set can have arbitrarily many elements
I Whenever it has ≤ k distinct elements, we can list them all(with their numbers of repetitions)
E.g. could solve the toy problem by:
I Initialize structure with k = 1
I Insert all numbers from 1 to 1000
I Remove all members of sequence
I Ask which element is left
Invertible Bloom filter
Structure:
I Table with O(k) cells
I O(1) hash functions hi (x) mapping elements to cells(like a Bloom filter)
I “Fingerprint” hash function f (x) (like a cuckoo filter)
Each cell stores three values:
I Number of elements mapped to it (like count-min sketch)
I Sum of elements mapped to it
I Sum of fingerprints of elements mapped to it
Eppstein & Goodrich, “Space-efficient straggler identification inround-trip data streams via Newton’s identitities and invertible Bloomfilters”, 2011
How to list the elements
“Pure cell”: stores only a single distinct element (but maybe morethan one copy of it)
If a pure cell stores element x , we can compute x as(sum of elements)/(number of elements)
Test whether a cell is pure: compute x and check that stored sumof fingerprints equals number of copies times fingerprint(x)
Decoding algorithm:
I While there is a pure cell, decode it, output it, and remove itselements from the collection
I If we reach an empty table (all cells zero), return
I Otherwise, decoding fails (table is too full)
Application: Set reconciliation
Problem: Two internet servers both have slightly different copiesof the same collection (e.g. versions of a git repository)
We want to find how they differ, using communication proportionalto the size of the difference (rather than the size of the whole set)
If we already know the size of the difference, k:
I Server A builds an invertible Bloom filter for its collection,with parameter k, and sends it to server B
I Server B removes everything in its collection from the filterand then decodes the result
I (Some elements might have negative numbers of copies! Butthe data structure still works)
Eppstein, Goodrich, Uyeda, and Varghese, “What’s the difference?Efficient set reconciliation without prior context”, 2011
How to estimate the size of the difference?
Use MinHash to find samples of server B’s set with numbers ofelements smaller by factors of 1/2, 1/4, 1/8, . . .
Sample i : elements whose hash has first i bits = zero
Server B chooses parameter k = O(1) and sends invertible Bloomfilters of all the samples in a single message to server A
Server A uses the same hash function to sample its set, andremoves sampled subsets from B’s filters
Estimate of the size of the difference = 1/sampling rate of largestsample whose difference can be decoded
Summary
Summary
I Sketch: Structure with useful information in sublinear space
I Stream: Loop through data once, using a sketch
I Cash register (addition only) and turnstile (addition andremoval) models of sketching and streaming
I Mean and median; impossibility of exact median
I Maintaining a random sample of a stream and using it forapproximate medians
I Majority voting, heavy hitters, and frequency estimation
I MinHash-based sampling and estimation of Jaccard distance
I Count-min sketch turnstile-model frequency estimation
I Using count-min sketch for turnstile median estimation
I Stragglers, invertible Bloom filters, and set reconciliation