Data Processing at the Speed of 100 Gbps using Apache Crail · Data Processing at the Speed of 100...

41
Patrick Stuedi IBM Research Data Processing at the Speed of 100 Gbps using Apache Crail

Transcript of Data Processing at the Speed of 100 Gbps using Apache Crail · Data Processing at the Speed of 100...

Patrick Stuedi IBM Research

Data Processing at the Speed of 100 Gbps using Apache Crail

The CRAIL Project: Overview

Crail Store

Data Processing Framework (e.g., Spark, TensorFlow, λ Compute)

DRAM NVMe PCM GPU….

100 Gbps10 μsec

Fast Network, e.g., 100 Gbps RoCE

RDMATCP NVMeF SPDK

FS HDFSKVStreaming

Spark-IO Albis Pocket

The CRAIL Project: Overview

Crail Store

Data Processing Framework (e.g., Spark, TensorFlow, λ Compute)

DRAM NVMe PCM GPU….

100 Gbps10 μsec

Fast Network, e.g., 100 Gbps RoCE

RDMATCP NVMeF SPDK

FS HDFSKVStreaming

Spark-IO Albis Pocket

fast sharing ofephemeral data

The CRAIL Project: Overview

Crail Store

Data Processing Framework (e.g., Spark, TensorFlow, λ Compute)

DRAM NVMe PCM GPU….

100 Gbps10 μsec

Fast Network, e.g., 100 Gbps RoCE

RDMATCP NVMeF SPDK

FS HDFSKVStreaming

Spark-IO Albis Pocket

fast sharing ofephemeral data

shuffle/broadcastacceleration

The CRAIL Project: Overview

Crail Store

Data Processing Framework (e.g., Spark, TensorFlow, λ Compute)

DRAM NVMe PCM GPU….

100 Gbps10 μsec

Fast Network, e.g., 100 Gbps RoCE

RDMATCP NVMeF SPDK

FS HDFSKVStreaming

Spark-IO Albis Pocket

fast sharing ofephemeral data

shuffle/broadcastacceleration

efficient storage of

relational data

The CRAIL Project: Overview

Crail Store

Data Processing Framework (e.g., Spark, TensorFlow, λ Compute)

DRAM NVMe PCM GPU….

100 Gbps10 μsec

Fast Network, e.g., 100 Gbps RoCE

RDMATCP NVMeF SPDK

FS HDFSKVStreaming

Spark-IO Albis Pocket

fast sharing ofephemeral data

shuffle/broadcastacceleration

efficient storage of

relational data

data sharing forserverless

applications

put your #assignedhashtag here by setting the footer in view-header/footer● Why CRAIL

● Crail Store

● Workload specific I/O Processing– File Format, shuffle engine, serverless

● Use Cases:– Disaggregation

– Workloads: SQL, Machine Learning

Outline

#1 Performance Challenge (1)

#1 Performance Challenge (2)

Sorting Application

JVM

Netty

SorterSerializer

socketsData Processing Framework

TCP/IP

Ethernet

NIC

filesystem

block layer

iSCSI

SSD

#1 Performance Challenge (2)

Sorting Application

JVM

Netty

SorterSerializer

socketsData Processing Framework

TCP/IP

Ethernet

NIC

filesystem

block layer

iSCSI

SSD

HotNets’16Fetch chunk Over the network

Process chunkIn reduce task

#1 Performance Challenge (2)

Sorting Application

JVM

Netty

SorterSerializer

socketsData Processing Framework

TCP/IP

Ethernet

NIC

filesystem

block layer

iSCSI

SSD

HotNets’16

#1 Performance Challenge (2)

Sorting Application

JVM

Netty

SorterSerializer

socketsData Processing Framework

TCP/IP

Ethernet

NIC

filesystem

block layer

iSCSI

SSD

software overheadare spread

over the entirestack

HotNets’16

#2 Diversity

Diverse hardware technologies / complex programming APIs / many frameworks

SPDK RDMA Verbs

#3 Ephemeral Data

Ephemeral data has unique properties (e.g., wide range of I/O size)

serverless (AWS lambda) applications

Spark applications

put your #assignedhashtag here by setting the footer in view-header/footerAbstract hardware via high-level storage interface

CRAIL Approach (1)

put your #assignedhashtag here by setting the footer in view-header/footerAbstract hardware via high-level storage interface

CRAIL Approach (1)

hardwarespecificplugins

(storage tiers)

most I/O operationscan conveniently beimplemented on a storage abstraction

put your #assignedhashtag here by setting the footer in view-header/footerAbstract hardware via high-level storage interface

CRAIL Approach (1)

hardwarespecificplugins

(storage tiers)

most I/O operationscan conveniently beimplemented on a storage abstraction

what is theright API?(FS, KV, ?)

put your #assignedhashtag here by setting the footer in view-header/footerFilesystem-like interface:● Hierarchical namespace

– Helps to organize data (shuffle, tmp, etc) for different jobs

● Separate data from metadata plane– Reading/writing involves block metatdata lookup

– Cheap on a low-latency network (few usecs)

– Flexible: data objects can be of arbitrary size

● Specific data types – KeyValue files: last create wins

– Shuffle files: efficient reading of multiple files in a directory

● Let applications control the details– Data placement policy: which storage node or storage tier to use

CRAIL Approach (2)

put your #assignedhashtag here by setting the footer in view-header/footerCareful software design:

● Leverage user-level APIs– RDMA, NFMf, DPDK, SPDK

● Seperate data from control operations– Memory allocation, string parsing, etc.

● Efficient non-blocking operations– Avoid army of threads, let the hardware do the work

● Leverage byte-address storage– Transmit no more data than what is read/written

CRAIL Approach (3)

Crail Store: Architecture

hierarchical namespace, multiple datatypes

distributedstorage over DRAM and flash

Crail Store: Architecture

hierarchical namespace, multiple datatypes

distributedstorage over DRAM and flash

files mayspawn

multiplestorage tiers

can be readlike a single

file

Crail Store: Deployment Modes

Application compute

DRAM storage server

Flash storage server

Metadata server

compute/storageco-located

compute/storagedisaggregated

flash storagedisaggregation

Crail Store: Read Throughput

0

20

40

60

80

100

128B 256B 512B 1K 128K 256K 512K 1MB

Thro

ughput

[Gbit

/s]

Buffer size

Single-client (1 core) throughput

CrailAlluxio

0

2

4

6

8

10

12

128256

5121K 4K 8K 16K

32K64K

128K256K

512K

Th

rou

gh

pu

t (G

B/s

)

NVMf - directNVMf - bufferedDRAM - buffered

DRAM NVMf

Crail reaches line speed at for an I/O size of 1K

Buffer size Buffer size

Performance of a single client running on one core only

Crail Store: Read Latency

Remote DRAMNVMf

Buffer size

0

10

20

30

40

50

4B 1K 4K 16K 64K 256K

late

ncy

[u

s]

key size

124RAMCloud/read/CRAMCloud/read/JavaCrail (lookup & read)

Crail (lookup only)

Buffer size Buffer size

Remote NVMe SSD (3D XPoint)

Crail remote read latencies (DRAM and NVM) are very close to the hardware latencies

Buffer sizeBuffer size

Metadata server scalability

A single metadata server can process 10M Crail lookup ops/sec

0

5

10

15

20

25

30

35

0 10 20 30 40 50 60 70

IOPS [

mill

ions]

Number of clients

Namenode IOPS

2 Namenodes IOPS

4 Namenodes IOPS

Network interfaces

Running Workloads: MapReduce

Crail Store

Data Processing Framework (e.g., Spark, TensorFlow, λ Compute)

DRAM NVMe PCM GPU….

100 Gbps10 μsec

Fast Network, e.g., 100 Gbps RoCE

RDMATCP NVMeF SPDK

FS HDFSKVStreaming

Albis PocketSpark-IO

Spark GroupBy (80M keys, 4K)SparkExec0

/shuffle/1/ f0 f1 fN/shuffle/2/ f0 f1 fN/shuffle/3 f0 f1 fN

SparkExec1

SparkExecN

SparkExec0

SparkExec1

SparkExecN

hash hash hash1

2

map: HashAppend (fle)

reduce: fetchBucket (dir)

Cra

il

Spark/Vanilla

5x2.5x2x

0

20

40

60

80

100

0 10 20 30 40 50 60 70 80 90 100 110 120

Thro

ughput

(Gbit

/s)

Elapsed time (seconds)

1 core4 cores8 cores

0

20

40

60

80

100

0 10 20 30 40 50 60 70 80 90 100 110 120

Th

rou

gh

pu

t (G

bit

/s)

Elapsed time (seconds)

1 core4 cores8 cores

Spark/Crail

val pairs = sc.parallelize(1 to tasks, tasks).flatmap(_ => { var values = new array[(Long,Array[Byte])](numKeys) values = initValues(values)}).cache().groupByKey().count()

Sorting 12.8 TB on 128 nodes

Sorting rate of Crail/Spark only 27% slower than rate of

Winner 2016

Native C distributed

sorting benchmark

Spark Spark/Crail Task ID

So

rtin

g t

ime

[se

c]

Ne

tw T

hp

ut

[Gb

/s]

DRAM & Flash Disaggregation

SparkExec1

SparkExecN

SparkExec0

storagenodes

computenodes 0

40

80

120

160

200Reduce

Map

Crail/DRAM

Crail/NVMe

Vanilla/Alluxio

Input/Output/Shuffle

Tim

e (

sec)

sorting runtime

Using Crail, a Spark 200GB sorting workload can be run with memory and flash disaggregated at no extra cost

Running Workloads: SQL

Crail Store

Data Processing Framework (e.g., Spark, TensorFlow, λ Compute)

DRAM NVMe PCM GPU….

100 Gbps10 μsec

Fast Network, e.g., 100 Gbps RoCE

RDMATCP NVMeF SPDK

FS HDFSKVStreaming

Spark-IO PocketAlbis

Reading Relational Data

None of the common file formats delivers a performance close to the hardware speed

Goo

dput

[Gbp

s]

put your #assignedhashtag here by setting the footer in view-header/footer● Traditional Assumption: CPU is fast, I/O is slow– Use compression, encoding, etc.

– Pack data and metadata together

– Avoid metadata lookups

● Albis: new file format designed for fast I/O hardware

● Albis design principles– Avoid CPU pressure, i.e., no compression, encoding, etc.

– Simple metadata management

Revisiting Design Principles

Mismatch in case of fastI/O hardware!

Reading Relational with Albis

Albis/Crail delivers 2-30x performance improvements over other formats

TPC-DS using Albis/Crail

Albis/Crail delivers up to 2.5x performance gains

Running Workloads: Serverless

Crail Store

Data Processing Framework (e.g., Spark, TensorFlow, λ Compute)

DRAM NVMe PCM GPU….

100 Gbps10 μsec

Fast Network, e.g., 100 Gbps RoCE

RDMATCP NVMeF SPDK

FS HDFSKVStreaming

Spark-IO Albis Pocket

put your #assignedhashtag here by setting the footer in view-header/footer● Data sharing implemented using remote storage– Enables fast and fine-grained scaling

● Problem: existing storage platforms not suitable– Slow (e.g., S3)

– No dynamic scaling (e.g. Redis)

– Designed for either small or large data sets

● Can we use Crail? Not as is.– Most clouds don’t support RDMA, NVMf, etc.

– Lacks automatic & elastic resource management

Serverless Computing

● An elastic distributed data store for ephemeral data sharing in serverless analytics

Pocket

Resource usage ($/hr)

Exe

cutio

n tim

e (s

ec)

Pocket dynamicallyrightsizes storageresources (nodes,

media) in an attemptto find a spot with agood performance

price ratio

Pocket: Resource Utilization

Pocket cost-effectively allocates resources based on user/framework hints

put your #assignedhashtag here by setting the footer in view-header/footer● Effectively using high-performance I/O hardware for data processing is challenging

● Crail is an attempt to re-think how data processing systems should interact with network and storage hardware– User-level I/O

– Storage disaggregation

– Memory/flash convergence

– Elastic resource provisioning

Conclusions

put your #assignedhashtag here by setting the footer in view-header/footer● Crail: A High-Performance I/O Architecture for Distributed Data Processing, IEEE Data Bulletin 2017

● Albis: High-Performance File-format for Big Data, Usenix ATC’18

● Navigating Storage for Serverless Computing, Usenix ATC’18

● Pocket: Ephemeral Storage for Serverless Analytics, OSDI’18 (to appear)

● Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash, Spark Summit’17

● Serverless Machine Learning using Crail, Spark Summit’18

● Apache Crail, http://crail.apache.org

References

Contributors

Animesh Trivedi, Jonas Pfefferle, Bernard Metzler, Adrian Schuepbach, Ana Klimovic, Yawen Wang, Michael Kaufmann, Yuval Degani, ...