# Mr Share 11 Sep 2010

date post

09-Jul-2015Category

## Technology

view

1.516download

4

Embed Size (px)

description

### Transcript of Mr Share 11 Sep 2010

MRShare: Sharing Across Multiple Queries in MapReduce

MRShare: Sharing Across Multiple Queries in MapReduceTomasz Nykiel (University of Toronto)Michalis Potamias (Boston University)Chaitanya Mishra (University of Toronto, currently Facebook)George Kollios (Boston University)Nick Koudas (University of Toronto)

1Data management landscapeefficiencyflexibility

2 Time performanceArbitrary dataLarge scale setups

MRShare sharing framework for MRMRShare a sharing framework for Map ReduceMRShare framework:Inspired by sharing primitives from relational domainIntroduces a cost model for Map Reduce jobsSearches for the optimal sharing strategiesDoes not change the Map Reduce computational modelhsdhquweiquwijksajdajsdjhwhjadjhashdj

3OutlineIntroductionMap Reduce recap.MRShare Sharing primitives in Map-ReduceMRShare Cost based approach to sharing MRShare EvaluationSummary4OutlineIntroductionMap Reduce recap.MRShare Sharing primitives in Map-ReduceMRShare Cost based approach to sharing MRShare EvaluationSummary5

networkMap Reduce recap.IIIIMapReduceOutputOutputHDFSHDFS6OutlineIntroductionMap Reduce recap.MRShare - Sharing primitives in Map-ReduceMRShare Cost based approach to sharing MRShare EvaluationSummary7Sharing primitives sharing scansSELECT COUNT(*) FROM user GROUP BY hometownSELECT AVG(age) FROM user GROUP BY hometownMapid1studentTorontoToronto1Mapid1studentTorontoToronto17ReduceToronto1Toronto1Toronto1Ottawa1Ottawa1Toronto3Ottawa2ReduceToronto17Toronto19Montreal20Ottawa23Ottawa25Toronto18Montreal20Ottawa248User_idHometownOccupationAgeSQLMap ReduceMeta-mapMRShare sharing scans (map).InputMap 1Map 2Map 3Map 4Map output9Meta-reduceMRShare sharing scans (reduce)J1J2J3J4keyvalueToronto1Toronto1Toronto1Toronto17Toronto19Toronto2Toronto5Reduce 1Reduce 2Reduce 3Reduce 410OutlineIntroductionMap Reduce recap.MRShare - Sharing primitives in Map-ReduceSharing scansSharing intermediate dataMRShare Cost based approach to sharing MRShare EvaluationSummary11Sharing primitives - Sharing intermediate data.SELECT COUNT(*) FROM user WHERE occupation=student GROUP BY hometownSELECT COUNT(*) FROM user WHERE age > 18 GROUP BY hometownUser_idHometownOccupationAgeMapid1studentTorontoToronto1Mapid1studentTorontoToronto1ReduceToronto1Toronto1Toronto1Ottawa1Ottawa1Toronto3Ottawa2ReduceToronto1Toronto1Ottawa1Ottawa1Montreal1Toronto2Ottawa1Montreal2Occupation ?= studentAge ?> 1812SQLMap ReduceMeta-mapMRShare sharing intermediate data (map).InputMap 1Map 2Map 3Map 4Map output13Meta-reduceMRShare sharing intermediate data (reduce).J1J2J3J4keyvalueToronto1Toronto4Toronto1Toronto1Toronto2Toronto2Toronto5Reduce 1Reduce 2Reduce 3Reduce 414OutlineIntroductionMap Reduce recap.MRShare Sharing primitives in Map-ReduceMRShare Cost based approach to sharingCost model for finding the optimal sharing strategySplitJobs cost based algorithm for sharing scansMultiSplitJobs an improvement of SplitJobs-MultiSplitJobs the algorithm for sharing intermediate dataMRShare EvaluationSummary15Cost model for Map Reduce (single job)Reading f(input size)Sorting f(intermediate data size)Copying f(intermediate data size)Writing f(output size)Reading inputSorting int. dataCopyingWriting output16Cost of executing a group of jobsReadSortCopyWriteReadSortCopyWriteReadSortCopyWriteJ1J2J3ReadSortCopyWritePotential costsSavingsPotential savingsJ1+J2+J317Talk about different possibilities of arranging jobs, and the question which one is the optimal one.17Finding the optimal sharing strategy

18An optimization problemJ1J2J5J3J4J1J2J5J3J4J1J2J5J3J4NoShareGreedyShareOutlineIntroductionMap Reduce recap.MRShare Sharing primitives in Map-ReduceMRShare Cost based approach to sharingCost model for finding the optimal sharing strategy SplitJobs cost based algorithm for sharing scansMultiSplitJobs an improvement of SplitJobs-MultiSplitJobs the algorithm for sharing intermediate dataMRShare EvaluationSummary19Sharing scans - cost based optimization Savings come from reduced number of scansThe sorting cost might changeThe costs of copying and writing the output do not change

20ReadSortReadSortReadSortJ1J2J3ReadSortPotential costsSavingsJ1+J2+J3We prove NP-hardness of the problem of finding the optimal sharing strategySharing scans approximating the cost of sorting 21J1J2J3J4J5J1J2J3J2J3J1J3Compute the exact sorting cost for (J1+J2+J3)Approximate the sorting cost based on (J1+J2+J3)Area ~ |intermediate data|SplitJobs a DP solution for sharing scans.We reduce the problem of grouping to the problem of splitting a sorted list of jobs by approximating the cost of sorting.22Using our cost model and the approximation, we employ a DP algorithm to find the optimal split points.J1J2J3J4J5J6J1J2J3J4J5J6G1G2G3SplitJobsOutlineIntroductionMap Reduce recap.MRShare Sharing primitives in Map-ReduceMRShare Cost based approach to sharingCost model for finding the optimal sharing strategySplitJobs cost based algorithm for sharing scansMultiSplitJobs an improvement of SplitJobs-MultiSplitJobs the algorithm for sharing intermediate dataMRShare EvaluationSummary23MultiSplitJobs an improvement of SplitJobs24J1J2J7J8G1G2G3J6J3J4J5SplitJobsSplitJobsG4SplitJobsMultiSplitJobsOutlineIntroductionMap Reduce recap.MRShare Sharing primitives in Map-ReduceMRShare Cost based approach to sharingCost model for finding the optimal sharing strategySplitJobs cost based algorithm for sharing scansMultiSplitJobs an improvement of SplitJobs-MultiSplitJobs the algorithm for sharing intermediate dataMRShare EvaluationSummary25Sharing intermediate data - cost based optimization The sorting and copying costs change depending on the size of the intermediate data26ReadSortCopyReadSortCopyReadSortCopyJ1J2J3ReadSortCopyPotential costs or savingsSavingsPotential savingsJ1+J2+J3J1J2J3We need to estimate the size of the intermediate data of all combinations of jobs.Prohibitive cost of maintaining statistics-MultiSplitJobs the solution for sharing intermediate dataApproximate the size of the intermediate data27J1J2J3J1J3J2+ *=J1J2J3 MultiSplitJobs applies MultiSplitJobs with modified cost function set heuristicallyOutlineIntroductionMap Reduce recap.MRShare Sharing primitives in Map-ReduceMRShare Cost based approach to sharing MRShare EvaluationSummary28Evaluation setup40 EC2 small instance virtual machinesModified Hadoop engine30 GB text dataset consisting of blogsMultiple grep-wordcount queriesCounts words matching a regular expressionAllows for variable intermediate data sizesGeneric aggregation Map Reduce job

29Evaluation goalsSharing is not always beneficial.GreedyShare policyHow much can we save on sharing scans?MRShare - MultiSplitJobs evaluationHow much can we save on sharing intermediate data? MRShare - -MultiSplitJobs evaluation

30Is sharing always beneficial?- GreedyShare policyGroup of jobsGroup sized=|intermediate data| / |input data| H1160.3 < d