R + 15 minutes = Hadoop cluster

11
useR Vignette: R + 15 minutes = Hadoop cluster Greater Boston useR Group February 2011 by Jeffrey Breen [email protected]

description

Overview of JD Long's experimental "segue" package which marshals and manages Hadoop clusters (for non-Big Data problems) with Amazon's Elastic MapReduce service. Presented at the February 2011 meeting of the Greater Boston useR Group.

Transcript of R + 15 minutes = Hadoop cluster

Page 1: R + 15 minutes = Hadoop cluster

useR Vignette:

R + 15 minutes = Hadoop cluster

Greater Boston useR GroupFebruary 2011

by

Jeffrey [email protected]

Page 2: R + 15 minutes = Hadoop cluster

Greater Boston useR Meeting, February 2011 Slide 2useR Vignette: R + 15 minutes = Hadoop Cluster

Agenda

● What's Hadoop?● But I don't have Big

Data● Building the cluster● Estimating π

stochastically● Want to know more?

Page 3: R + 15 minutes = Hadoop cluster

Greater Boston useR Meeting, February 2011 Slide 3useR Vignette: R + 15 minutes = Hadoop Cluster

MapReduce, Hadoop and Big Data

● Hadoop is an open source implementation of Google's MapReduce-based data processing infrastructure● Designed to process huge data sets

– “huge” = “all of facebook's web logs”– Yahoo! sorted 1TB in 62 seconds in May 2009– HDFS distributed file system makes replication decisions

based on knowledge of network topology● Amazon Elastic MapReduce is full Hadoop stack

on EC2

Page 4: R + 15 minutes = Hadoop cluster

Greater Boston useR Meeting, February 2011 Slide 4useR Vignette: R + 15 minutes = Hadoop Cluster

MapReduce = Map + shuffle + Reduce

Source: http://developer.yahoo.com/hadoop/tutorial/module4.html

Page 5: R + 15 minutes = Hadoop cluster

Greater Boston useR Meeting, February 2011 Slide 5useR Vignette: R + 15 minutes = Hadoop Cluster

But I don't have Big Data

● Agricultural economist J.D. Long doesn't either, but he does have a bunch of simulations to run

● Had a key insight: the input could be small amount of data (like 1:1000) to serve as random seeds for simulation code in “mapper” function

● Enjoy Hadoop's infrastructure for job scheduling, fault tolerance, inter-node communication, etc.

● Use Amazon's cloud to scale up quickly as needed

Page 6: R + 15 minutes = Hadoop cluster

Greater Boston useR Meeting, February 2011 Slide 6useR Vignette: R + 15 minutes = Hadoop Cluster

Load the segue library

> library(segue)

Loading required package: rJava

Loading required package: caTools

Loading required package: bitops

Segue did not find your AWS credentials. Please run the setCredentials() function.

> setCredentials('YOUR_ACCESS_KEY_ID', 'YOUR_SECRET_ACCESS_KEY')

Page 7: R + 15 minutes = Hadoop cluster

Greater Boston useR Meeting, February 2011 Slide 7useR Vignette: R + 15 minutes = Hadoop Cluster

Start the cluster

> myCluster <- createCluster(numInstances=5)

STARTING - 2011-01-04 15:07:53

[…]

BOOTSTRAPPING - 2011-01-04 15:11:28

[…]

WAITING - 2011-01-04 15:15:35

Your Amazon EMR Hadoop Cluster is ready for action.

Remember to terminate your cluster with stopCluster().

Amazon is billing you!

Page 8: R + 15 minutes = Hadoop cluster

Greater Boston useR Meeting, February 2011 Slide 8useR Vignette: R + 15 minutes = Hadoop Cluster

Estimate π stochastically

> estimatePi <- function(seed){

set.seed(seed)

numDraws <- 1e6

r <- .5 #radius

x <- runif(numDraws, min=-r, max=r)

y <- runif(numDraws, min=-r, max=r)

inCircle <- ifelse( (x^2 + y^2)^.5 < r , 1, 0)

return(sum(inCircle) / length(inCircle) * 4)

}

Page 9: R + 15 minutes = Hadoop cluster

Greater Boston useR Meeting, February 2011 Slide 9useR Vignette: R + 15 minutes = Hadoop Cluster

Run the simulation

> seedList <- as.list(1:1e3)

> myEstimates <- emrlapply( myCluster, seedList, estimatePi )

RUNNING - 2011-01-04 15:22:28

[…]

WAITING - 2011-01-04 15:32:18

> myPi <- Reduce(sum, myEstimates) / length(myEstimates)

> format(myPi, digits=10)

[1] "3.141586544"

> format(pi, digits=10)

[1] "3.141592654"

Page 10: R + 15 minutes = Hadoop cluster

Greater Boston useR Meeting, February 2011 Slide 10useR Vignette: R + 15 minutes = Hadoop Cluster

Won't break the bank

● Total cost: $0.15

Standard On-Demand Instances

Amazon EC2 Price per hour (On-Demand Instances)

Amazon Elastic MapReduce Price per hour

Small (Default) $0.085 per hour $0.015 per hour

Large $0.34 per hour $0.06 per hour

Extra Large $0.68 per hour $0.12 per hour

Page 11: R + 15 minutes = Hadoop cluster

Greater Boston useR Meeting, February 2011 Slide 11useR Vignette: R + 15 minutes = Hadoop Cluster

Want to know more?

● JD Long's segue package● http://code.google.com/p/segue/

● Hadoop● http://hadoop.apache.org/● Book: http://oreilly.com/catalog/0636920010388

● My blog● http://jeffreybreen.wordpress.com/2011/01/10/segue-r-to-amazon-elastic-mapreduce-hadoop/