Buffered Count-Min Sketch on SSD: Theory and Experiments - … · 2020-04-05 · Buffered Count-Min...

Buffered Count-Min Sketch on SSD: Theory and

Experiments

Mayank Goswami, Dzejla Medjedovic, Emina Mekic, and Prashant Pandey

ESA 2018, Helsinki, Finland

The heavy hitters problem (HH(k))

● Given stream of N items, report items whose frequency ≥ φN.

● General solution is “hard” in small space.

● Approximate solutions are employed.

Picture taken from: http://dmac.rutgers.edu/Workshops/WGUnifyingTheory/Slides/cormode.pdf.

The approximate heavy hitters problem (ε-HH(k))

● Find all items with count ≥ φN, none with count < (φ−ε)N

● Error 0 < ε < 1, e.g., ε = 1/1000

● Related problem: estimate each frequency with error ±εN

Picture taken from: http://dmac.rutgers.edu/Workshops/WGUnifyingTheory/Slides/cormode.pdf.

Sketch data structures

● A sketch is a compact representation of a data stream.

● It is typically lossy.

● It is useful to approximately answer analytical questions

about data stream. E.g.,

○ Heavy hitters

○ Quantile queries

○ Inner-product queries

Financial market

Sketches are at the heart of stream analyses

Financial market

Sensor networks

Financial market

Sensor networks

IP traffic

Financial market

Text analysisSensor networks

IP traffic

Financial market

Text analysisSensor networks

IP traffic

Cyber security

In this talk:

● The buffered count-min sketch (BCMS), an SSD-based sketch data structure.○ The BCMS scales efficiently to large datasets keeping the total

estimation error bounded.

● Theoretical analysis of the BCMS for:○ Update and estimate times on SSD○ Bounded error

● Experimental evaluation of the BCMS.

Count-min sketch (CMS)[1]

A CMS consists of a 2-D counter-array of depth d and width w and d hash functions.

[1]. Graham Cormode and S. Muthukrishnan. An improved data stream summary: The count-min sketch and its applications. Journal. of Algorithms.

Count-min sketch: update

UPDATE(ai , c

Count-min sketch: estimate

ESTIMATE(ai ) = MIN

1 ≤ i ≤ d(C

Count-min sketch: analysis

We want estimation error within the range εN, with probability at least 1 - δ, i.e., Pr[Error(q) > εN] ≤ δ.

Count-min sketch: analysis

d = ln(1/δ) and w = ⌈e/ε⌉ and CMS size: ln(1/δ) � ⌈e/ε⌉

We want estimation error within the range εN, with probability at least 1 - δ, i.e., Pr[Error(q) > εN] ≤ δ.

CMS size vs dataset size when ε is constant

CMS size remains constant when ε is constant.

CMS size grows linearly with dataset size.

CMS size vs dataset size when εN is constant

The count-min sketch size grows with data set size

Consider the example,

N = 230, where overestimate 512 (ε = 2-21) with 99.9% certainty (δ = 0.001), then

CMS size ~3.36 GB (assuming 4 Byte counters).

Now, suppose we want to keep the overestimate same but double the data set size, then

CMS size ~6.72 GB.

For example, in a word-similarity application[1], for 90 GB of web data the count-min sketch size is 8 GB.

[1]. Amit Goyal, Jagadeesh Jagarlamudi, Hal Daumé, III, and Suresh Venkatasubramanian. Sketch techniques for scaling distributional similarity to the web. GEMS, 2010.

Update performance degrades when the CMS grows out of Cache

Update performance worsens when the CMS grows out of RAM

Buffered count-min sketch (BCMS)

● Theoretical:

○ The buffered CMS is asymptotically faster for estimate

operation than the plain CMS on SSD.

○ The buffered CMS requires less than 1 I/O per update operation

for most practical configurations on SSD.

○ The buffered CMS offers similar error guarantees as the plain

Pr[Error(q) > εN (1 + o(1))] ≤ δ + o(1).

I/O in the disk access machine (DAM) model

● How computations works:

○ Data is transferred in blocks between RAM and disk.

○ The # of block transfers dominate the running time.

● Goal: Minimize # of block transfers

○ Performance bounds are parameterized by block size B, memory size M,

data size N.

RAM DISK

Plain count-min sketch on SSD

O(d) random I/Os for each operation.

Buffered count-min sketch: hash localization

w/B w/B w/B

1 I/O per estimate operation.

Buffered count-min sketch: hash localization

w/B w/B w/B

Buffered count-min sketch: buffering

Sub-buffers

Buffers

+Ci +Ci

+CiRAM

Sub-buffers

Buffers

+Ci +Ci

O(w/MB) I/Os per update operation amortized!

● In a stream of size N and a plain CMS of width w and depth d, for a

query item i

○ E[# items colliding with i in a row (or error)] = N/w.

Plain count-min sketch: error analysis

query item i

● Using Markov’s inequality

○ Probability of seeing more than e times the expected error is 1/e,

i.e., P[Error > e N/w] ≤ 1/e.

query item i

● Each row has an independent hash function

○ Probability that the expected error is more than e times in each row is (1/e)d .

query item i

● Each row has an independent hash function

○ Probability that the expected error is more than e times in each row is (1/e)d .

For w = ⌈e/ε⌉ and d = ln(1/δ), expected error = δ.

Assumption: ε is subconstant in N (limN→∞

ε(N) = 0)

CMS size grows much slowly with dataset size.

● Each item i from the stream hashes into a buffer

○ To determine WHP number of items in a buffer we use

balls-and-bins analysis where, # balls >> # bins.

● Error for a query q in different rows are no longer

independent

○ A high error in one row implies more elements were hashed by h0

to the same bucket.

Buffered count-min sketch: error analysis

Buffered count-min sketch: evaluation

● Empirical:

○ The buffered CMS is 3.7X--4.7X faster on update operation

compared to the plain CMS on SSD.

○ The buffered CMS is 4.3X faster on estimate operation

compared to the plain CMS on SSD.

Evaluation parameters

● Update throughput

● Estimate throughput

● Effect of hash localization on estimation error

● Effect of changing RAM-to-sketch-size ratio

Evaluation setup

Size Ratio Width Depth #elements

128MB 2 3355444 5 9875188

256MB 4 6710887 5 19750377

512MB 8 13421773 5 39500754

1GB 16 26843546 5 79001508

In all our experiments, δ = 0.01 and overestimate (εN) = 8.RAM size: 64 MB

Update performance

The BCMS is 3.7X--4.7X faster on update operation compared to the plain CMS.

Estimate performance

The BCMS is 4.3X faster on estimate operation compared to the plain CMS.

Accuracy

Overestimates in the BCMS are similar to the plain CMS.

Conclusion

● Techniques like hash localization and buffering can be applied

to scale the plain CMS to SSDs.

● We showed both theoretically and empirically, if we keep the

ε-error subconstant in N then the hash localization has trivial (or

no effect) on the overestimate.

● We leave the question of deriving update lower bounds and/or a

SSD-based data structure with faster update time for future

Buffered Count-Min Sketch on SSD: Theory and Experiments - … · 2020-04-05 · Buffered Count-Min...

Documents

Transcript of Buffered Count-Min Sketch on SSD: Theory and Experiments - … · 2020-04-05 · Buffered Count-Min...

PHS 398 (Rev. 9/04), Biographical Sketch Format Page · 2019. 10. 31. · FAK Phosphorylation in Prostate Cancer Cells. Cell Physiol Biochem. 2017;42(4):1366-1376. doi: 10.1159 ...

MCP1501 High Precision Buffered Voltage Referenceww1.microchip.com/downloads/en/DeviceDoc/20005474E.pdfHigh-Precision Buffered Voltage Reference MCP1501 DS20005474E-page 2 2015-2017

Approximate Spectral Clustering via Randomized Sketching · Yahoo! Labs, New York Joint work with Alex Gittens (Ebay), Anju Kambadur (IBM) The big picture: “sketch” and solve

Part IB Paper 6: Information Engineering LINEAR SYSTEMS AND … · 2011. 10. 26. · = 0.571e−0.0286j = 0.571∠−1.64 ... The Bode diagram is relatively straightforward to sketch

4.5 Graphs of Sine and Cosine FUNctions How can I sketch the graphs of sine and cosine FUNctions?

i=io 6ヒ b″6 - unipv.eu · PDF file- Lingue, letterature e cultura inglese e anglo-americana -SSD L-LIN-12 - Lingua e traduzione - Lingua inglese Prot. n. 7 aEe g D.R.n.

Rat IFN-γ Immunoassay · 2018. 8. 14. · Rat IFN-γ Control 890063 2 vials 12 vials Recombinant rat IFN-γ in a buffered protein base with preservatives; lyophilized. The assayed

SSD DEMA hdt 08Aug06 - Pennsylvania State University...nn H nn × × = − ⎡ ⎣ ⎢ ⎤ ... Probability (Chatterjee, Bhavana and Lin, 2006) 24 Supersaturated Designs with High Searching

Homework 4 - California State University, Bakersfieldvvakilian/CourseECE423/HomeworkSoluti… · Sketch the spectrum of m(t). ii. Find and sketch the spectrum of the DSB-SC signal

Salmonella II Quick Guide Extraction Deepwell Protocol · Quick Guide • Add 900 μl of buffered peptone water in a deepwell or a tube • Transfer 100 μl of the pre-enriched sample

SSD 1100 X - megamaster.ru fileEredeti használati utasítás originale Izvirna navodila Originalne pogonske upute Instrukcijâm oriěinâlvalodâ Originali instrukcija Algupärane

16-Bit V nano - produktinfo.conrad.com fileFigure 1. GENERAL DESCRIPTION . The AD5061, a member of ADI’s nanoDAC family, is a low power, single 16-bit buffered voltage-out DAC that

Considerable overlap with AS. Intermolecular forces are? Hydrogen bonding Hydrogen bonds operate between which atoms? O(δ-) and H(δ+) Sketch.

Σ Rέλεχος Σ Rενογρά Tος - Σεμινάρια · RhinoGold ArhiCAD Sketch ...

Introduction to Lexical Analysis - NTUA sketch of lexical analysis – Identifies tokens in input string • Issues in lexical analysis – Lookahead – Ambiguities • Specifying

Preliminary Results of the CiS-HH SRD Projectssd-rd.web.cern.ch/ssd-rd/rd/talks/lindstroem-CIS-rd-2001.pdf · Gunnar Lindström, CiS report, CERN 29-Nov-01 Wacker ρ =

A Sketch of Lie Superalgebra Theory - Project Euclid

Ultra-Low Latency SSDs’ Impact on Overall Energy Eﬃciency€¦ · HDD ∼ 10 ms 100,000× Traditional SSD ∼ 100 μs 1000 ... Idle vs. active power consumption Observation 2

τα εργαλεία του Sketch up

Chlorophyll( - University of California, Irvinefaculty.sites.uci.edu/.../04/A07DISCChlorophyllwk1and2.pdf · 2015-11-09 · Yellow(&(Blue(! Sketch!blue!food!coloring’s! absorp/on!spectrum:!!