Mr Share 11 Sep 2010

of 36/36
MRShare: Sharing Across Multiple Queries in MapReduce Tomasz Nykiel (University of Toronto) Michalis Potamias (Boston University) Chaitanya Mishra (University of Toronto, currently Facebook) George Kollios (Boston University) Nick Koudas (University of Toronto) 1
  • date post

    09-Jul-2015
  • Category

    Technology

  • view

    1.516
  • download

    4

Embed Size (px)

description

Large-scale data analysis lies in the core of modern enterprises and scientific research. With the emergence of cloud computing, the use of an analytical query processing infrastructure (e.g., Amazon EC2) can be directly mapped to monetary value. MapReduce has been a popular framework in the context of cloud computing, designed to serve long running queries (jobs) which can be processed in batch mode. Taking into account that different jobs often perform similar work, there are many opportunities for sharing. In principle, sharing similar work reduces the overall amount of work, which can lead to reducing monetary charges incurred while utilizing the processing infrastructure. In this paper we propose a sharing framework tailored to MapReduce. Our framework, MRShare, transforms a batch of queries into a new batch that will be executed more efficiently, by merging jobs into groups and evaluating each group as a single query. Based on our cost model for MapReduce, we define an optimization problem and we provide a solution that derives the optimal grouping of queries. Experiments in our prototype, built on top of Hadoop, demonstrate the overall effectiveness of our approach and substantial savings.

Transcript of Mr Share 11 Sep 2010

MRShare: Sharing Across Multiple Queries in MapReduce

MRShare: Sharing Across Multiple Queries in MapReduceTomasz Nykiel (University of Toronto)Michalis Potamias (Boston University)Chaitanya Mishra (University of Toronto, currently Facebook)George Kollios (Boston University)Nick Koudas (University of Toronto)

1Data management landscapeefficiencyflexibility

2 Time performanceArbitrary dataLarge scale setups

MRShare sharing framework for MRMRShare a sharing framework for Map ReduceMRShare framework:Inspired by sharing primitives from relational domainIntroduces a cost model for Map Reduce jobsSearches for the optimal sharing strategiesDoes not change the Map Reduce computational modelhsdhquweiquwijksajdajsdjhwhjadjhashdj

3OutlineIntroductionMap Reduce recap.MRShare Sharing primitives in Map-ReduceMRShare Cost based approach to sharing MRShare EvaluationSummary4OutlineIntroductionMap Reduce recap.MRShare Sharing primitives in Map-ReduceMRShare Cost based approach to sharing MRShare EvaluationSummary5

networkMap Reduce recap.IIIIMapReduceOutputOutputHDFSHDFS6OutlineIntroductionMap Reduce recap.MRShare - Sharing primitives in Map-ReduceMRShare Cost based approach to sharing MRShare EvaluationSummary7Sharing primitives sharing scansSELECT COUNT(*) FROM user GROUP BY hometownSELECT AVG(age) FROM user GROUP BY hometownMapid1studentTorontoToronto1Mapid1studentTorontoToronto17ReduceToronto1Toronto1Toronto1Ottawa1Ottawa1Toronto3Ottawa2ReduceToronto17Toronto19Montreal20Ottawa23Ottawa25Toronto18Montreal20Ottawa248User_idHometownOccupationAgeSQLMap ReduceMeta-mapMRShare sharing scans (map).InputMap 1Map 2Map 3Map 4Map output9Meta-reduceMRShare sharing scans (reduce)J1J2J3J4keyvalueToronto1Toronto1Toronto1Toronto17Toronto19Toronto2Toronto5Reduce 1Reduce 2Reduce 3Reduce 410OutlineIntroductionMap Reduce recap.MRShare - Sharing primitives in Map-ReduceSharing scansSharing intermediate dataMRShare Cost based approach to sharing MRShare EvaluationSummary11Sharing primitives - Sharing intermediate data.SELECT COUNT(*) FROM user WHERE occupation=student GROUP BY hometownSELECT COUNT(*) FROM user WHERE age > 18 GROUP BY hometownUser_idHometownOccupationAgeMapid1studentTorontoToronto1Mapid1studentTorontoToronto1ReduceToronto1Toronto1Toronto1Ottawa1Ottawa1Toronto3Ottawa2ReduceToronto1Toronto1Ottawa1Ottawa1Montreal1Toronto2Ottawa1Montreal2Occupation ?= studentAge ?> 1812SQLMap ReduceMeta-mapMRShare sharing intermediate data (map).InputMap 1Map 2Map 3Map 4Map output13Meta-reduceMRShare sharing intermediate data (reduce).J1J2J3J4keyvalueToronto1Toronto4Toronto1Toronto1Toronto2Toronto2Toronto5Reduce 1Reduce 2Reduce 3Reduce 414OutlineIntroductionMap Reduce recap.MRShare Sharing primitives in Map-ReduceMRShare Cost based approach to sharingCost model for finding the optimal sharing strategySplitJobs cost based algorithm for sharing scansMultiSplitJobs an improvement of SplitJobs-MultiSplitJobs the algorithm for sharing intermediate dataMRShare EvaluationSummary15Cost model for Map Reduce (single job)Reading f(input size)Sorting f(intermediate data size)Copying f(intermediate data size)Writing f(output size)Reading inputSorting int. dataCopyingWriting output16Cost of executing a group of jobsReadSortCopyWriteReadSortCopyWriteReadSortCopyWriteJ1J2J3ReadSortCopyWritePotential costsSavingsPotential savingsJ1+J2+J317Talk about different possibilities of arranging jobs, and the question which one is the optimal one.17Finding the optimal sharing strategy

18An optimization problemJ1J2J5J3J4J1J2J5J3J4J1J2J5J3J4NoShareGreedyShareOutlineIntroductionMap Reduce recap.MRShare Sharing primitives in Map-ReduceMRShare Cost based approach to sharingCost model for finding the optimal sharing strategy SplitJobs cost based algorithm for sharing scansMultiSplitJobs an improvement of SplitJobs-MultiSplitJobs the algorithm for sharing intermediate dataMRShare EvaluationSummary19Sharing scans - cost based optimization Savings come from reduced number of scansThe sorting cost might changeThe costs of copying and writing the output do not change

20ReadSortReadSortReadSortJ1J2J3ReadSortPotential costsSavingsJ1+J2+J3We prove NP-hardness of the problem of finding the optimal sharing strategySharing scans approximating the cost of sorting 21J1J2J3J4J5J1J2J3J2J3J1J3Compute the exact sorting cost for (J1+J2+J3)Approximate the sorting cost based on (J1+J2+J3)Area ~ |intermediate data|SplitJobs a DP solution for sharing scans.We reduce the problem of grouping to the problem of splitting a sorted list of jobs by approximating the cost of sorting.22Using our cost model and the approximation, we employ a DP algorithm to find the optimal split points.J1J2J3J4J5J6J1J2J3J4J5J6G1G2G3SplitJobsOutlineIntroductionMap Reduce recap.MRShare Sharing primitives in Map-ReduceMRShare Cost based approach to sharingCost model for finding the optimal sharing strategySplitJobs cost based algorithm for sharing scansMultiSplitJobs an improvement of SplitJobs-MultiSplitJobs the algorithm for sharing intermediate dataMRShare EvaluationSummary23MultiSplitJobs an improvement of SplitJobs24J1J2J7J8G1G2G3J6J3J4J5SplitJobsSplitJobsG4SplitJobsMultiSplitJobsOutlineIntroductionMap Reduce recap.MRShare Sharing primitives in Map-ReduceMRShare Cost based approach to sharingCost model for finding the optimal sharing strategySplitJobs cost based algorithm for sharing scansMultiSplitJobs an improvement of SplitJobs-MultiSplitJobs the algorithm for sharing intermediate dataMRShare EvaluationSummary25Sharing intermediate data - cost based optimization The sorting and copying costs change depending on the size of the intermediate data26ReadSortCopyReadSortCopyReadSortCopyJ1J2J3ReadSortCopyPotential costs or savingsSavingsPotential savingsJ1+J2+J3J1J2J3We need to estimate the size of the intermediate data of all combinations of jobs.Prohibitive cost of maintaining statistics-MultiSplitJobs the solution for sharing intermediate dataApproximate the size of the intermediate data27J1J2J3J1J3J2+ *=J1J2J3 MultiSplitJobs applies MultiSplitJobs with modified cost function set heuristicallyOutlineIntroductionMap Reduce recap.MRShare Sharing primitives in Map-ReduceMRShare Cost based approach to sharing MRShare EvaluationSummary28Evaluation setup40 EC2 small instance virtual machinesModified Hadoop engine30 GB text dataset consisting of blogsMultiple grep-wordcount queriesCounts words matching a regular expressionAllows for variable intermediate data sizesGeneric aggregation Map Reduce job

29Evaluation goalsSharing is not always beneficial.GreedyShare policyHow much can we save on sharing scans?MRShare - MultiSplitJobs evaluationHow much can we save on sharing intermediate data? MRShare - -MultiSplitJobs evaluation

30Is sharing always beneficial?- GreedyShare policyGroup of jobsGroup sized=|intermediate data| / |input data| H1160.3 < d