Προχωρημένα θέματα βάσεων δεδομένων

44
Προχωρημένα θέματα βάσεων δεδομένων MapReduce

description

Προχωρημένα θέματα βάσεων δεδομένων. MapReduce. Big Data. 90% των σημερινών δεδομένων δημιουργήθηκαν τα τελευταία 2 χρόνια Νόμος του Moore : Διπλασιασμός δεδομένων κάθε 18 μήνες YouTube : 13 εκατ. ώρες και 700 δις αναπαραγωγές το 2010 Facebook : 20TB/ημέρα συμπιεσμένα - PowerPoint PPT Presentation

Transcript of Προχωρημένα θέματα βάσεων δεδομένων

MapReduce

MapReduceBig Data90% 2 Moore: 18 YouTube: 13 . 700 2010Facebook: 20TB/ CERN/LHC: 40TB/ (15PB/), Web logs, , ,

640K ...: 1 EB (Exabyte=1018bytes) = 1000 PB (Petabyte=1015bytes) 20101.2 ZB (Zettabyte) = 1200 EB 201035 ZB (Zettabyte = 1021 bytes) 2020 . () EB , 1000 Petabytes, (), 1200 Exabytes, 1.2 Zettabytes 2010 35ZB 2020. , , . .

3: (scalability )?

Source: Wikipedia (IBM Roadrunner) (divide and conquer )w1w2w3r1r2r3workerworkerworkerPartitionCombine workers? workers? workers ? ? workers ? workers ? ?7 : workers (, )

8

Source: Ricardo Guimares Herrmann Workers, workers :Semaphores (lock, unlock)Conditional variables (wait, notify, broadcast)Barriers , :Deadlock, livelock, race conditions...Dining philosophers, sleeping barbers, cigarette smokers...10 Shared memory (pthreads)Message passing (MPI) Master-slavesProducer-consumer flowsShared work queues

Message PassingP1P2P3P4P5Shared MemoryP1P2P3P4P5Memorymasterslavesproducerconsumerproducerconsumerwork queue (Concurrency) datacenter ( datacenters) / (failures) ?, ( MapReduce): (custom-!) . !

MapReduce? (clusters) Closed-source GoogleScientific papers 03 04 Hadoop: opensource paperhttp://hadoop.apache.org/HDFS .: cluster: 10 , 100 10PB storageClosed-source (MapR M5-M7) cluster blocks block 128 MB. Replication: block (DataNodes) - default 3 (rack aware). HDFS/MapReduce Master/SlaveHDFS: NameNode DataNodesNameNode: DataNode o ( FAT)DataNodes: servers raw file chunksMapReduce: JobTracker TaskTrackers-NameNode JobTracker master-DataNode TaskTracker slaves - Data locality

MapReduce 2 , Map ReduceMap: - ( ) (mappers) Reduce: Map (reducers) MapReduce

? :/ log : 1,000,000 Fibonacci MySQL ?DFS . , . MapReduce Master/Slaveto be continued

DFS M , 16 64 MB master. (). map reduce .Master master : , master

Map map GFS (input split) , map . map . (partition function). . R . . master . master reduce .Reduce reduce master. key. values key . reduce . reduce . . reduce . master. master .

workerworkerworkerworkerworkerworkerInputMapMapMapReduceReduceReduceMasterOutputPart 1Part 2Part 3: 1/3: : url webserver logfiles : MapReduce map reduce MapReduce : 2/3map(key, value):// key: document name; value: text of documentfor each word w in value:emit(w, 1)

reduce(key, values):// key: a word; value: an iterator over countsresult = 0for each count v in values:result += vemit(result): 3/3(d1, w1 w2 w4)(d2, w1 w2 w3 w4)(w2, 3)(w2,4)(w3, 2)(w3,2)(w2,3)(d4, w1 w2 w3)(d5, w1 w3 w4)(d8, w2 w2 w3)(d9, w1 w1 w3 w3) (d3, w2 w3 w4)(w2,4)(w1,3)(w3,2)(w4,3)(w3,2)(w1,7)(d10, w2 w1 w4 w3)(w3,4)(w2,3)(w2,15)M=3 mappersR=2 reducers(w1, 2)(w4,3)(w1,3)(w4,3)(w1,3)(w4,1)(d6, w1 w4 w2 w2)(d7, w4 w2 w1)(w1,3)(w4,3)(w2,3)(w1,2)(w3,4)(w4,1)(w3,8)(w4,7): To be, or not to bemapmapreducereduce,,, ,,

,,,,,,,,,,,,,

To be is to do30 . .

...mapmapreducereduce

...,...

......

...31

...mapmapreducereduce

...,...

......

...32 master . . master . . map reduce . master .

. block(64MB ) .Move computation near the data: master , .

(speculative execution) master Partitioning-combining shuffling. combiner map reducers. map reduce . reducer master web interface

Hadooppen-source MapReduce.JavaHDFShttp://hadoop.apache.org/http://wiki.apache.org/hadoop/ :Yahoo! AmazonFacebookTwitter ...Use cases 1/3

Large Scale Image Conversions100 Amazon EC2 Instances, 4TB raw TIFF data11 Million PDF in 24 hours and 240$Internal log processingReporting, analytics and machine learningCluster of 1110 machines, 8800 cores and 12PB raw storageOpen source contributors (Hive)Store and process tweets, logs, etcOpen source contributors (hadoop-lzo)Use cases 2/3

100.000 CPUs in 25.000 computersContent/Ads Optimization, Search indexMachine learning (e.g. spam filtering)Open source contributors (Pig)

Natural language search (through Powerset)400 nodes in EC2, storage in S3Open source contributors (!) to HBaseElasticMapReduce serviceOn demand elastic Hadoop clusters for the CloudUse cases 3/3

ETL processing, statistics generationAdvanced algorithms for behavioral analysis and targeting

Used for discovering People you May Know, and for other apps3X30 node cluster, 16GB RAM and 8TB storage

Leading Chinese language search engineSearch log analysis, data mining300TB per week10 to 500 node clusters Dean, Jeff and Ghemawat, Sanjay. MapReduce: Simplified Data Processing on Large Clusters http://labs.google.com/papers/mapreduce-osdi04.pdf