Mr Share 11 Sep 2010

MRShare: Sharing Across Multiple Queries in MapReduce

Tomasz Nykiel (University of Toronto)

Michalis Potamias (Boston University)

Chaitanya Mishra (University of Toronto, currently Facebook)

George Kollios (Boston University)

Nick Koudas (University of Toronto)

Data management landscape

efficiency

• Time performance

• Arbitrary data• Large scale setups

MRShare – sharing framework for MR

MRShare – a sharing framework for Map Reduce

• MRShare framework:

– Inspired by sharing primitives from relational domain

– Introduces a cost model for Map Reduce jobs

– Searches for the optimal sharing strategies

– Does not change the Map Reduce computational model

Outline

• Introduction

• Map Reduce recap.

• MRShare – Sharing primitives in Map-Reduce

• MRShare – Cost based approach to sharing

• MRShare Evaluation

• Summary

Outline

• Map Reduce recap.

Map Reduce recap.

Map Reduce

Outline

• MRShare - Sharing primitives in Map-Reduce

Sharing primitives – sharing scans

• SELECT COUNT(*) FROM user GROUP BY hometown

• SELECT AVG(age) FROM user GROUP BY hometown

User_id Hometown Occupation Age

MRShare – sharing scans (map).

MRShare – sharing scans (reduce)

J1 J2 J3 J4 key value

Toronto 1

Toronto 17

Toronto 19

Toronto 2

Toronto 5

Outline

• MRShare - Sharing primitives in Map-Reduce

– Sharing scans

– Sharing intermediate data

Sharing primitives - Sharing intermediate data.

• SELECT COUNT(*) FROM user WHERE occupation=‘student’ GROUP BY hometown

• SELECT COUNT(*) FROM user WHERE age > 18 GROUP BY hometown

User_id Hometown Occupation Age

Occupation ?= ‘student’ Age ?> 18

MRShare – sharing intermediate data (map).

MRShare – sharing intermediate data (reduce).

J1 J2 J3 J4 key value

Toronto 1

Toronto 4

Toronto 1

Toronto 2

Toronto 5

Outline

– Cost model for finding the optimal sharing strategy

– SplitJobs – cost based algorithm for sharing scans

– MultiSplitJobs – an improvement of SplitJobs

– γ-MultiSplitJobs – the algorithm for sharing intermediate data

Cost model for Map Reduce (single job)

• Reading – f(input size)

• Sorting – f(intermediate data size)

• Copying – f(intermediate data size)

• Writing – f(output size)

Cost of executing a group of jobs

Finding the optimal sharing strategy

• An optimization problem

“NoShare”

“GreedyShare”

Outline

Sharing scans - cost based optimization

• Savings come from reduced number of scans• The sorting cost might change• The costs of copying and writing the output do not

change

• We prove NP-hardness of the problem of finding the optimal sharing strategy

SplitJobs – a DP solution for sharing scans.

• We reduce the problem of grouping to the problem of splitting a sorted list of jobs – by approximating the cost of sorting.

• Using our cost model and the approximation, we employ a DP algorithm to find the optimal split points.

Outline

MultiSplitJobs – an improvement of SplitJobs

Outline

Sharing intermediate data - cost based optimization

• The sorting and copying costs change – depending on the size of the intermediate data

We need to estimate the size of the intermediate data of all combinations of jobs.

γ-MultiSplitJobs – the solution for sharing intermediate data

• Approximate the size of the intermediate data

• γ –MultiSplitJobs – applies MultiSplitJobs with modified cost function

• γ set heuristically

Outline

• MRShare Evaluation

Evaluation setup

• 40 EC2 small instance virtual machines

• Modified Hadoop engine

• 30 GB text dataset consisting of blogs

• Multiple grep-wordcount queries

– Counts words matching a regular expression

– Allows for variable intermediate data sizes

– Generic aggregation Map Reduce job

Evaluation goals

• Sharing is not always beneficial.

– ‘GreedyShare’ policy

• How much can we save on sharing scans?

– MRShare - MultiSplitJobs evaluation

• How much can we save on sharing intermediate data?

– MRShare - γ-MultiSplitJobs evaluation

Is sharing always beneficial?- ‘GreedyShare’ policy

Group of jobs

Group size

d=|intermediate data| / |input data|

H1 16 0.3 < d <0.7

H2 16 0.7 < d

H3 16 0.9 < d

How much we save on sharing scans –MRShare MultiSplitJobs

Group of jobs

Group size

G1 16 0.7 < d

G2 16 0.2 < d < 0.7

G3 16 0.0 < d < 0.2

G4 16 0.0 < d < max

G5 64 0.0 < d < max

How much we save on sharing intermediate data -

MRShare - γ-MultiSplitJobs

Group of jobs

Group size

G1 16 0.7 < d

G2 16 0.2 < d < 0.7

G3 16 0.0 < d < 0.2

Summary

• We introduced MRShare – a framework for automatic work sharing in Map Reduce.

• We identified sharing primitives and demonstrated the implementation thereof in a Map Reduce engine.

• We established a cost model and solved several work sharing optimization problems.

• We demonstrated vast savings when using MRShare.

Thank you!!!

Questions?

Ongoing work – sharing expensive computation

• Sharing across multiple Map Reduce jobs with expensive predicates.

Ongoing work – dynamic sharing

• Dynamic sharing.

Mr Share 11 Sep 2010

Technology

Transcript of Mr Share 11 Sep 2010

Newsletter Jul Sep 2013

Labour force measured as share of population in same age group.

Decidability and Complexity of Tree Share Formulashobor/Publications/2016/...Decidability and Complexity of Tree Share Formulas Introduction Shares Shares are embedded into separation

Kenny Kirkland - Mr JC Solo

X de Cables30 Sep

∞nnual Report 2007 · The company’s share in the Greek Factoring market (Domestic and International) remained approximately 30% in 2007, while its share in the International Factoring

INTERIM REPORT JANUARY - SEPTEMBER 2019Interim Report, January – September 2019 2 Global Gaming 555 AB | 556721-0520 KEY FIGURES, SEK M jul-sep 2019 jul-sep 2018Δ jan-sep jan-sep

Κοινοποιήστε! / Share!

micRun - Productivity Inc · 3/4* 19.05 1132.19056 MR 11- 32 DIN 6499 ISO 1/ 5488 MR Metric Collets MR Inch Collets 4. MR REGO-FIX.COM 800-999-7346 Type Part No. D [mm] L [mm] BT

Mr Thermo-ch7 Partie 2

arXiv:1409.1520v1 [math.AP] 4 Sep 2014 · arXiv:1409.1520v1 [math.AP] 4 Sep 2014 Evolutionequationsofp-Laplacetypewithabsorptionorsource termsandmeasuredata …

Do an betong 1 mr. d

65 sep - oct 2008

Let’s Supper, Mr. Iolas | Resurrection Re

1era Sep Microeco - Introducción Economia

i-TECH4u Sep #27 2012

AP World History Review Packet - foresthillshs.org · College Board Examination in World History. AP World History Review Packet Mr. Bennett Mr. Bogolub Mr. Mena Mr. Ott Mr. Urrico

Share 2004 Questionnaire version 10 · Share 2004 Questionnaire version 10 IF INTERVIEW MODE = 1. Individual. Single | ELSE | | IF INTERVIEW MODE = 2.Individual. Couple, first respondent

Share Your Step

Psida Slide Share