Data Processing at the Speed of 100 Gbps using Apache Crail · The CRAIL Project: Overview Crail...

41
Patrick Stuedi IBM Research Data Processing at the Speed of 100 Gbps using Apache Crail

Transcript of Data Processing at the Speed of 100 Gbps using Apache Crail · The CRAIL Project: Overview Crail...

Page 1: Data Processing at the Speed of 100 Gbps using Apache Crail · The CRAIL Project: Overview Crail Store Data Processing Framework (e.g., Spark, TensorFlow, λ Compute) DRAM NVMe PCM

Patrick Stuedi IBM Research

Data Processing at the Speed of 100 Gbps using Apache Crail

Page 2: Data Processing at the Speed of 100 Gbps using Apache Crail · The CRAIL Project: Overview Crail Store Data Processing Framework (e.g., Spark, TensorFlow, λ Compute) DRAM NVMe PCM

The CRAIL Project: Overview

Crail Store

Data Processing Framework (e.g., Spark, TensorFlow, λ Compute)

DRAM NVMe PCM GPU….

100 Gbps10 μsec

Fast Network, e.g., 100 Gbps RoCE

RDMATCP NVMeF SPDK

FS HDFSKVStreaming

Spark-IO Albis Pocket

Page 3: Data Processing at the Speed of 100 Gbps using Apache Crail · The CRAIL Project: Overview Crail Store Data Processing Framework (e.g., Spark, TensorFlow, λ Compute) DRAM NVMe PCM

The CRAIL Project: Overview

Crail Store

Data Processing Framework (e.g., Spark, TensorFlow, λ Compute)

DRAM NVMe PCM GPU….

100 Gbps10 μsec

Fast Network, e.g., 100 Gbps RoCE

RDMATCP NVMeF SPDK

FS HDFSKVStreaming

Spark-IO Albis Pocket

fast sharing ofephemeral data

Page 4: Data Processing at the Speed of 100 Gbps using Apache Crail · The CRAIL Project: Overview Crail Store Data Processing Framework (e.g., Spark, TensorFlow, λ Compute) DRAM NVMe PCM

The CRAIL Project: Overview

Crail Store

Data Processing Framework (e.g., Spark, TensorFlow, λ Compute)

DRAM NVMe PCM GPU….

100 Gbps10 μsec

Fast Network, e.g., 100 Gbps RoCE

RDMATCP NVMeF SPDK

FS HDFSKVStreaming

Spark-IO Albis Pocket

fast sharing ofephemeral data

shuffle/broadcastacceleration

Page 5: Data Processing at the Speed of 100 Gbps using Apache Crail · The CRAIL Project: Overview Crail Store Data Processing Framework (e.g., Spark, TensorFlow, λ Compute) DRAM NVMe PCM

The CRAIL Project: Overview

Crail Store

Data Processing Framework (e.g., Spark, TensorFlow, λ Compute)

DRAM NVMe PCM GPU….

100 Gbps10 μsec

Fast Network, e.g., 100 Gbps RoCE

RDMATCP NVMeF SPDK

FS HDFSKVStreaming

Spark-IO Albis Pocket

fast sharing ofephemeral data

shuffle/broadcastacceleration

efficient storage of

relational data

Page 6: Data Processing at the Speed of 100 Gbps using Apache Crail · The CRAIL Project: Overview Crail Store Data Processing Framework (e.g., Spark, TensorFlow, λ Compute) DRAM NVMe PCM

The CRAIL Project: Overview

Crail Store

Data Processing Framework (e.g., Spark, TensorFlow, λ Compute)

DRAM NVMe PCM GPU….

100 Gbps10 μsec

Fast Network, e.g., 100 Gbps RoCE

RDMATCP NVMeF SPDK

FS HDFSKVStreaming

Spark-IO Albis Pocket

fast sharing ofephemeral data

shuffle/broadcastacceleration

efficient storage of

relational data

data sharing forserverless

applications

Page 7: Data Processing at the Speed of 100 Gbps using Apache Crail · The CRAIL Project: Overview Crail Store Data Processing Framework (e.g., Spark, TensorFlow, λ Compute) DRAM NVMe PCM

put your #assignedhashtag here by setting the footer in view-header/footer● Why CRAIL

● Crail Store

● Workload specific I/O Processing– File Format, shuffle engine, serverless

● Use Cases:– Disaggregation

– Workloads: SQL, Machine Learning

Outline

Page 8: Data Processing at the Speed of 100 Gbps using Apache Crail · The CRAIL Project: Overview Crail Store Data Processing Framework (e.g., Spark, TensorFlow, λ Compute) DRAM NVMe PCM

#1 Performance Challenge (1)

Page 9: Data Processing at the Speed of 100 Gbps using Apache Crail · The CRAIL Project: Overview Crail Store Data Processing Framework (e.g., Spark, TensorFlow, λ Compute) DRAM NVMe PCM

#1 Performance Challenge (2)

Sorting Application

JVM

Netty

SorterSerializer

socketsData Processing Framework

TCP/IP

Ethernet

NIC

filesystem

block layer

iSCSI

SSD

Page 10: Data Processing at the Speed of 100 Gbps using Apache Crail · The CRAIL Project: Overview Crail Store Data Processing Framework (e.g., Spark, TensorFlow, λ Compute) DRAM NVMe PCM

#1 Performance Challenge (2)

Sorting Application

JVM

Netty

SorterSerializer

socketsData Processing Framework

TCP/IP

Ethernet

NIC

filesystem

block layer

iSCSI

SSD

HotNets’16Fetch chunk Over the network

Process chunkIn reduce task

Page 11: Data Processing at the Speed of 100 Gbps using Apache Crail · The CRAIL Project: Overview Crail Store Data Processing Framework (e.g., Spark, TensorFlow, λ Compute) DRAM NVMe PCM

#1 Performance Challenge (2)

Sorting Application

JVM

Netty

SorterSerializer

socketsData Processing Framework

TCP/IP

Ethernet

NIC

filesystem

block layer

iSCSI

SSD

HotNets’16

Page 12: Data Processing at the Speed of 100 Gbps using Apache Crail · The CRAIL Project: Overview Crail Store Data Processing Framework (e.g., Spark, TensorFlow, λ Compute) DRAM NVMe PCM

#1 Performance Challenge (2)

Sorting Application

JVM

Netty

SorterSerializer

socketsData Processing Framework

TCP/IP

Ethernet

NIC

filesystem

block layer

iSCSI

SSD

software overheadare spread

over the entirestack

HotNets’16

Page 13: Data Processing at the Speed of 100 Gbps using Apache Crail · The CRAIL Project: Overview Crail Store Data Processing Framework (e.g., Spark, TensorFlow, λ Compute) DRAM NVMe PCM

#2 Diversity

Diverse hardware technologies / complex programming APIs / many frameworks

SPDK RDMA Verbs

Page 14: Data Processing at the Speed of 100 Gbps using Apache Crail · The CRAIL Project: Overview Crail Store Data Processing Framework (e.g., Spark, TensorFlow, λ Compute) DRAM NVMe PCM

#3 Ephemeral Data

Ephemeral data has unique properties (e.g., wide range of I/O size)

serverless (AWS lambda) applications

Spark applications

Page 15: Data Processing at the Speed of 100 Gbps using Apache Crail · The CRAIL Project: Overview Crail Store Data Processing Framework (e.g., Spark, TensorFlow, λ Compute) DRAM NVMe PCM

put your #assignedhashtag here by setting the footer in view-header/footerAbstract hardware via high-level storage interface

CRAIL Approach (1)

Page 16: Data Processing at the Speed of 100 Gbps using Apache Crail · The CRAIL Project: Overview Crail Store Data Processing Framework (e.g., Spark, TensorFlow, λ Compute) DRAM NVMe PCM

put your #assignedhashtag here by setting the footer in view-header/footerAbstract hardware via high-level storage interface

CRAIL Approach (1)

hardwarespecificplugins

(storage tiers)

most I/O operationscan conveniently beimplemented on a storage abstraction

Page 17: Data Processing at the Speed of 100 Gbps using Apache Crail · The CRAIL Project: Overview Crail Store Data Processing Framework (e.g., Spark, TensorFlow, λ Compute) DRAM NVMe PCM

put your #assignedhashtag here by setting the footer in view-header/footerAbstract hardware via high-level storage interface

CRAIL Approach (1)

hardwarespecificplugins

(storage tiers)

most I/O operationscan conveniently beimplemented on a storage abstraction

what is theright API?(FS, KV, ?)

Page 18: Data Processing at the Speed of 100 Gbps using Apache Crail · The CRAIL Project: Overview Crail Store Data Processing Framework (e.g., Spark, TensorFlow, λ Compute) DRAM NVMe PCM

put your #assignedhashtag here by setting the footer in view-header/footerFilesystem-like interface:● Hierarchical namespace

– Helps to organize data (shuffle, tmp, etc) for different jobs

● Separate data from metadata plane– Reading/writing involves block metatdata lookup

– Cheap on a low-latency network (few usecs)

– Flexible: data objects can be of arbitrary size

● Specific data types – KeyValue files: last create wins

– Shuffle files: efficient reading of multiple files in a directory

● Let applications control the details– Data placement policy: which storage node or storage tier to use

CRAIL Approach (2)

Page 19: Data Processing at the Speed of 100 Gbps using Apache Crail · The CRAIL Project: Overview Crail Store Data Processing Framework (e.g., Spark, TensorFlow, λ Compute) DRAM NVMe PCM

put your #assignedhashtag here by setting the footer in view-header/footerCareful software design:

● Leverage user-level APIs– RDMA, NFMf, DPDK, SPDK

● Seperate data from control operations– Memory allocation, string parsing, etc.

● Efficient non-blocking operations– Avoid army of threads, let the hardware do the work

● Leverage byte-address storage– Transmit no more data than what is read/written

CRAIL Approach (3)

Page 20: Data Processing at the Speed of 100 Gbps using Apache Crail · The CRAIL Project: Overview Crail Store Data Processing Framework (e.g., Spark, TensorFlow, λ Compute) DRAM NVMe PCM

Crail Store: Architecture

hierarchical namespace, multiple datatypes

distributedstorage over DRAM and flash

Page 21: Data Processing at the Speed of 100 Gbps using Apache Crail · The CRAIL Project: Overview Crail Store Data Processing Framework (e.g., Spark, TensorFlow, λ Compute) DRAM NVMe PCM

Crail Store: Architecture

hierarchical namespace, multiple datatypes

distributedstorage over DRAM and flash

files mayspawn

multiplestorage tiers

can be readlike a single

file

Page 22: Data Processing at the Speed of 100 Gbps using Apache Crail · The CRAIL Project: Overview Crail Store Data Processing Framework (e.g., Spark, TensorFlow, λ Compute) DRAM NVMe PCM

Crail Store: Deployment Modes

Application compute

DRAM storage server

Flash storage server

Metadata server

compute/storageco-located

compute/storagedisaggregated

flash storagedisaggregation

Page 23: Data Processing at the Speed of 100 Gbps using Apache Crail · The CRAIL Project: Overview Crail Store Data Processing Framework (e.g., Spark, TensorFlow, λ Compute) DRAM NVMe PCM

Crail Store: Read Throughput

0

20

40

60

80

100

128B 256B 512B 1K 128K 256K 512K 1MB

Thro

ughput

[Gbit

/s]

Buffer size

Single-client (1 core) throughput

CrailAlluxio

0

2

4

6

8

10

12

128256

5121K 4K 8K 16K

32K64K

128K256K

512K

Th

rou

gh

pu

t (G

B/s

)

NVMf - directNVMf - bufferedDRAM - buffered

DRAM NVMf

Crail reaches line speed at for an I/O size of 1K

Buffer size Buffer size

Performance of a single client running on one core only

Page 24: Data Processing at the Speed of 100 Gbps using Apache Crail · The CRAIL Project: Overview Crail Store Data Processing Framework (e.g., Spark, TensorFlow, λ Compute) DRAM NVMe PCM

Crail Store: Read Latency

Remote DRAMNVMf

Buffer size

0

10

20

30

40

50

4B 1K 4K 16K 64K 256K

late

ncy

[u

s]

key size

124RAMCloud/read/CRAMCloud/read/JavaCrail (lookup & read)

Crail (lookup only)

Buffer size Buffer size

Remote NVMe SSD (3D XPoint)

Crail remote read latencies (DRAM and NVM) are very close to the hardware latencies

Buffer sizeBuffer size

Page 25: Data Processing at the Speed of 100 Gbps using Apache Crail · The CRAIL Project: Overview Crail Store Data Processing Framework (e.g., Spark, TensorFlow, λ Compute) DRAM NVMe PCM

Metadata server scalability

A single metadata server can process 10M Crail lookup ops/sec

0

5

10

15

20

25

30

35

0 10 20 30 40 50 60 70

IOPS [

mill

ions]

Number of clients

Namenode IOPS

2 Namenodes IOPS

4 Namenodes IOPS

Network interfaces

Page 26: Data Processing at the Speed of 100 Gbps using Apache Crail · The CRAIL Project: Overview Crail Store Data Processing Framework (e.g., Spark, TensorFlow, λ Compute) DRAM NVMe PCM

Running Workloads: MapReduce

Crail Store

Data Processing Framework (e.g., Spark, TensorFlow, λ Compute)

DRAM NVMe PCM GPU….

100 Gbps10 μsec

Fast Network, e.g., 100 Gbps RoCE

RDMATCP NVMeF SPDK

FS HDFSKVStreaming

Albis PocketSpark-IO

Page 27: Data Processing at the Speed of 100 Gbps using Apache Crail · The CRAIL Project: Overview Crail Store Data Processing Framework (e.g., Spark, TensorFlow, λ Compute) DRAM NVMe PCM

Spark GroupBy (80M keys, 4K)SparkExec0

/shuffle/1/ f0 f1 fN/shuffle/2/ f0 f1 fN/shuffle/3 f0 f1 fN

SparkExec1

SparkExecN

SparkExec0

SparkExec1

SparkExecN

hash hash hash1

2

map: HashAppend (fle)

reduce: fetchBucket (dir)

Cra

il

Spark/Vanilla

5x2.5x2x

0

20

40

60

80

100

0 10 20 30 40 50 60 70 80 90 100 110 120

Thro

ughput

(Gbit

/s)

Elapsed time (seconds)

1 core4 cores8 cores

0

20

40

60

80

100

0 10 20 30 40 50 60 70 80 90 100 110 120

Th

rou

gh

pu

t (G

bit

/s)

Elapsed time (seconds)

1 core4 cores8 cores

Spark/Crail

val pairs = sc.parallelize(1 to tasks, tasks).flatmap(_ => { var values = new array[(Long,Array[Byte])](numKeys) values = initValues(values)}).cache().groupByKey().count()

Page 28: Data Processing at the Speed of 100 Gbps using Apache Crail · The CRAIL Project: Overview Crail Store Data Processing Framework (e.g., Spark, TensorFlow, λ Compute) DRAM NVMe PCM

Sorting 12.8 TB on 128 nodes

Sorting rate of Crail/Spark only 27% slower than rate of

Winner 2016

Native C distributed

sorting benchmark

Spark Spark/Crail Task ID

So

rtin

g t

ime

[se

c]

Ne

tw T

hp

ut

[Gb

/s]

Page 29: Data Processing at the Speed of 100 Gbps using Apache Crail · The CRAIL Project: Overview Crail Store Data Processing Framework (e.g., Spark, TensorFlow, λ Compute) DRAM NVMe PCM

DRAM & Flash Disaggregation

SparkExec1

SparkExecN

SparkExec0

storagenodes

computenodes 0

40

80

120

160

200Reduce

Map

Crail/DRAM

Crail/NVMe

Vanilla/Alluxio

Input/Output/Shuffle

Tim

e (

sec)

sorting runtime

Using Crail, a Spark 200GB sorting workload can be run with memory and flash disaggregated at no extra cost

Page 30: Data Processing at the Speed of 100 Gbps using Apache Crail · The CRAIL Project: Overview Crail Store Data Processing Framework (e.g., Spark, TensorFlow, λ Compute) DRAM NVMe PCM

Running Workloads: SQL

Crail Store

Data Processing Framework (e.g., Spark, TensorFlow, λ Compute)

DRAM NVMe PCM GPU….

100 Gbps10 μsec

Fast Network, e.g., 100 Gbps RoCE

RDMATCP NVMeF SPDK

FS HDFSKVStreaming

Spark-IO PocketAlbis

Page 31: Data Processing at the Speed of 100 Gbps using Apache Crail · The CRAIL Project: Overview Crail Store Data Processing Framework (e.g., Spark, TensorFlow, λ Compute) DRAM NVMe PCM

Reading Relational Data

None of the common file formats delivers a performance close to the hardware speed

Goo

dput

[Gbp

s]

Page 32: Data Processing at the Speed of 100 Gbps using Apache Crail · The CRAIL Project: Overview Crail Store Data Processing Framework (e.g., Spark, TensorFlow, λ Compute) DRAM NVMe PCM

put your #assignedhashtag here by setting the footer in view-header/footer● Traditional Assumption: CPU is fast, I/O is slow– Use compression, encoding, etc.

– Pack data and metadata together

– Avoid metadata lookups

● Albis: new file format designed for fast I/O hardware

● Albis design principles– Avoid CPU pressure, i.e., no compression, encoding, etc.

– Simple metadata management

Revisiting Design Principles

Mismatch in case of fastI/O hardware!

Page 33: Data Processing at the Speed of 100 Gbps using Apache Crail · The CRAIL Project: Overview Crail Store Data Processing Framework (e.g., Spark, TensorFlow, λ Compute) DRAM NVMe PCM

Reading Relational with Albis

Albis/Crail delivers 2-30x performance improvements over other formats

Page 34: Data Processing at the Speed of 100 Gbps using Apache Crail · The CRAIL Project: Overview Crail Store Data Processing Framework (e.g., Spark, TensorFlow, λ Compute) DRAM NVMe PCM

TPC-DS using Albis/Crail

Albis/Crail delivers up to 2.5x performance gains

Page 35: Data Processing at the Speed of 100 Gbps using Apache Crail · The CRAIL Project: Overview Crail Store Data Processing Framework (e.g., Spark, TensorFlow, λ Compute) DRAM NVMe PCM

Running Workloads: Serverless

Crail Store

Data Processing Framework (e.g., Spark, TensorFlow, λ Compute)

DRAM NVMe PCM GPU….

100 Gbps10 μsec

Fast Network, e.g., 100 Gbps RoCE

RDMATCP NVMeF SPDK

FS HDFSKVStreaming

Spark-IO Albis Pocket

Page 36: Data Processing at the Speed of 100 Gbps using Apache Crail · The CRAIL Project: Overview Crail Store Data Processing Framework (e.g., Spark, TensorFlow, λ Compute) DRAM NVMe PCM

put your #assignedhashtag here by setting the footer in view-header/footer● Data sharing implemented using remote storage– Enables fast and fine-grained scaling

● Problem: existing storage platforms not suitable– Slow (e.g., S3)

– No dynamic scaling (e.g. Redis)

– Designed for either small or large data sets

● Can we use Crail? Not as is.– Most clouds don’t support RDMA, NVMf, etc.

– Lacks automatic & elastic resource management

Serverless Computing

Page 37: Data Processing at the Speed of 100 Gbps using Apache Crail · The CRAIL Project: Overview Crail Store Data Processing Framework (e.g., Spark, TensorFlow, λ Compute) DRAM NVMe PCM

● An elastic distributed data store for ephemeral data sharing in serverless analytics

Pocket

Resource usage ($/hr)

Exe

cutio

n tim

e (s

ec)

Pocket dynamicallyrightsizes storageresources (nodes,

media) in an attemptto find a spot with agood performance

price ratio

Page 38: Data Processing at the Speed of 100 Gbps using Apache Crail · The CRAIL Project: Overview Crail Store Data Processing Framework (e.g., Spark, TensorFlow, λ Compute) DRAM NVMe PCM

Pocket: Resource Utilization

Pocket cost-effectively allocates resources based on user/framework hints

Page 39: Data Processing at the Speed of 100 Gbps using Apache Crail · The CRAIL Project: Overview Crail Store Data Processing Framework (e.g., Spark, TensorFlow, λ Compute) DRAM NVMe PCM

put your #assignedhashtag here by setting the footer in view-header/footer● Effectively using high-performance I/O hardware for data processing is challenging

● Crail is an attempt to re-think how data processing systems should interact with network and storage hardware– User-level I/O

– Storage disaggregation

– Memory/flash convergence

– Elastic resource provisioning

Conclusions

Page 40: Data Processing at the Speed of 100 Gbps using Apache Crail · The CRAIL Project: Overview Crail Store Data Processing Framework (e.g., Spark, TensorFlow, λ Compute) DRAM NVMe PCM

put your #assignedhashtag here by setting the footer in view-header/footer● Crail: A High-Performance I/O Architecture for Distributed Data Processing, IEEE Data Bulletin 2017

● Albis: High-Performance File-format for Big Data, Usenix ATC’18

● Navigating Storage for Serverless Computing, Usenix ATC’18

● Pocket: Ephemeral Storage for Serverless Analytics, OSDI’18 (to appear)

● Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash, Spark Summit’17

● Serverless Machine Learning using Crail, Spark Summit’18

● Apache Crail, http://crail.apache.org

References

Page 41: Data Processing at the Speed of 100 Gbps using Apache Crail · The CRAIL Project: Overview Crail Store Data Processing Framework (e.g., Spark, TensorFlow, λ Compute) DRAM NVMe PCM

Contributors

Animesh Trivedi, Jonas Pfefferle, Bernard Metzler, Adrian Schuepbach, Ana Klimovic, Yawen Wang, Michael Kaufmann, Yuval Degani, ...