Information Retrieval in Cloud

download Information Retrieval in Cloud

of 22

  • date post

  • Category


  • view

  • download


Embed Size (px)


University of Patras Department of Computer Engineering & Informatics. Information Retrieval in Cloud. Diploma Thesis. Zois Vasileios Α.Μ :4183. Presentation Contents. Distributed Systems Hadoop Distributed File System (HDFS ) Distributed Database ( HBase) MapReduce Programming Model - PowerPoint PPT Presentation

Transcript of Information Retrieval in Cloud

Slide 1

Information Retrieval in CloudZois Vasileios.:4183University of PatrasDepartment of Computer Engineering & InformaticsDiploma ThesisPresentation ContentsDistributed Systems Hadoop Distributed File System (HDFS )Distributed Database(HBase)MapReduce Programming ModelStudy of , + Trees Building Trees on BaseRange Queries on B+ & B TreesExperiments in the Construction of TreesAnalyzing ResultsConclusions

HDFS ArchitectureOpen Source Implementation of GFSDistributed File System Used by GoogleGoogle File SystemDistributed File SystemManagement of Large Amount of DataFailure Detection & Automatic RecoveryScalability Designed Using JavaIndependent from Operating SystemComputers with Different Hardware

HBase ArchitectureHBaseOpen Source Implementation of BigTableNoSQL SystemsOrganizing Data in TablesTables Divided in Column FamiliesCategory: Column Family StoresArchitecture Similar to HDFSWork Using HDFS

MapReduce Programming ModelDistributed Programming ModelData Intensive ApplicationsDistributed Computing in a Cluster of MachinesFunctional ProgrammingMap FunctionReduce FunctionOperationsData Structured in (key,value)Process Data Parallel at Input (Mapper)Process Intermediate Results(Reducer)Map(k1,v1) List(k2,v2) Reduce(k2,list(v2)) List(v3)

Building Tree with BulkInsertMapperInput Data ProcessingPairing in the Form (key,value)Custom PartitionerData ClusteringSpecific Range of Values on Each ReducerReducerTree Building(BulkInsert,BulkLoading)Some Data saved in memory during processCleanupWrite Tree at Hbase Table

Building Tree with BulkLoadingMore EfficientLesser Requirements in Physical Memory.Completion in Less Steps (n/B).Relative Easy ImplementationExecution StepsSorted keys from Map FaceDivide into LeafsSave Information for the Next LevelWrite Created Nodes when Buffer FullRepeat Procedure Until you Reach the Root

Tree Node = Row in TableDefine Node Column Family Row KeyInternal Nodes Last Key of Respective NodeLeafs Adding a Special Tag in Front of Last Node key (Sorting in Lexicographic order)

Organizing Data in TableCheck Tree RangeFind LeafLeaf Including left rangeLeaf Including right range Hbase TableScan to Find KeysUse Rowkey from each Leaf to ScanComplexity Trees , keys in Tree, Tree Order(2*( + logB(E) )Range Queries on + Trees

Respectively with B+ TreesFind Trees with Required RangePinpoint Individual Trees from Start to EndExecution of Depth First Search on Each TreeDepth First SearchRetrieval of Keys in Internal NodesComplexityDepth First Search Complexity(|V| + |E|)*

Range Queries on B TreesExperiments Systems & ToolsHadoop & HBaseHadoop version 1.0.1HBase version 0.94.1Operating SystemDebian Base 6.0.5Machines(4) Okeanos 4 CPUs(Virtual) per machineRAM 2048MB per machineHDD 40 GB per machineDatatpc-HOrders Table (cust_id,order_id)Experiments Data & ObservationsExperiment ObservationTree OrderExecution TimeNecessary Storage SpacePhysical MemoryNumber of Reducers

Experiments Bulk InsertComparison of Trees with Order 5 & 101Augmented Execution TimeRebalance OperationPhysical Memory & HDD SpaceNecessary Information for Tree StructureConclusionProblem in ScalabilityLarge Physical Memory RequirementsAugmented Execution TimeExecution Time Distribution Order 5Execution Time Distribution Order 101Experiments Bulk InsertTree Order 5+TreeB-TreeData Input Size230230MBOutput Tree Size2,2 G1,4 GBExecution Time (sec)900451Median Execution Time Map(sec)56,2955Median Execution Time Shuffle (sec)2828,75Median Execution Time Reduce (sec)125,588,25Number of Reducers88Physical Memory Allocated19525 MB15222 MBTree Order 101+TreeB-TreeInput Data Size230230MBOutput Tree Size598,2256MBExecution Time (sec)263246Median Execution Time Map (sec)5249,86Median Execution Time Shuffle (sec)28,6329,75Median Execution Time Reduce (sec)68,2566,25Number of Reducers88Physical Memory Allocated9501 MB9286 MBExperiments Bulk LoadingBulkLoading vs BulkInsert ComparisonSmaller Execution TimeLess Requirements in Physical MemorySmaller Required Space on HDD

Testing Buffer Fluctuation Buffer 128,512 Smaller Execution TimeAdjustable Requirements for Physical MemoryExecution Time Distribution Buffer 128Execution Time Distribution Buffer 512Tree Order 101+TreeB-TreeInput Data Size230230MBOutput Tree Size267,1256MBExecution Time (sec)132125Median Execution Time Map(sec)51,1453,57Median Execution Time Reduce (sec)43,537,75Number of Reducers88Buffer Size(Put Objects)128128Physical Memory Allocated6517 6165 Tree Order 101+TreeB-TreeInput Data Size230230MBOutput Tree Size267,1256MBExecution Time (sec)114108Median Execution Time Map(sec)5255,14Median Execution Time Reduce (sec)3330,63Number of Reducers88Buffer Size(Put Objects)512512Physical Memory Allocated6613 6678 Experiments Bulk LoadingIn Comparing Building TechniquesBulkInsert Precise Choice of Tree OrderAugmented Execution Time with Small Order Trees Due to constant RebalancingHigh Physical Memory RequirementsNot So ScalableBulkLoadingCreated Tree is Full ( Next Insert could cause an Tree Rebalancing)Smaller Execution TimeAdjustable Requirements in Physical MemoryMore Complicated ImplementationWhy Use B & B+ TreesIn Collaboration with Pre-Warm TechniquesLess Burden on Master.Communication Between SlavesConclusionsTHANK YOU FOR YOUR ATTENTION!!!