Head-to-TOE Evaluation of High Performance Sockets over Protocol Offload Engines

30
Head-to-TOE Evaluation of High Performance Sockets over Protocol Offload Engines P. Balaji ¥ W. Feng α Q. Gao ¥ R. Noronha ¥ W. Yu ¥ D. K. Panda ¥ ¥ Network Based Computing Lab, Ohio State University α Advanced Computing Lab, Los Alamos National Lab

description

Head-to-TOE Evaluation of High Performance Sockets over Protocol Offload Engines. P. Balaji ¥ W. Feng α Q. Gao ¥ R. Noronha ¥ W. Yu ¥ D. K. Panda ¥. ¥ Network Based Computing Lab, Ohio State University. α Advanced Computing Lab, Los Alamos National Lab. Ethernet Trends. - PowerPoint PPT Presentation

Transcript of Head-to-TOE Evaluation of High Performance Sockets over Protocol Offload Engines

Page 1: Head-to-TOE Evaluation of High Performance Sockets over Protocol Offload Engines

Head-to-TOE Evaluation of High Performance Sockets

over Protocol Offload Engines

P. Balaji¥ W. Fengα Q. Gao¥

R. Noronha¥ W. Yu¥ D. K. Panda ¥

¥Network Based Computing Lab,

Ohio State University

αAdvanced Computing Lab,

Los Alamos National Lab

Page 2: Head-to-TOE Evaluation of High Performance Sockets over Protocol Offload Engines

Ethernet Trends

• Ethernet is the most widely used network architecture today

• Traditionally Ethernet has been notorious for performance issues– Near an order-of-magnitude performance gap compared to InfiniBand, Myrinet

• Cost conscious architecture

• Relied on host-based TCP/IP for network and transport layer support

• Compatibility with existing infrastructure (switch buffering, MTU)

– Used by 42.4% of the Top500 supercomputers

– Key: Extremely high performance per unit cost• GigE can give about 900Mbps (performance) / 0$ (cost)

• 10-Gigabit Ethernet (10GigE) recently introduced– 10-fold (theoretical) increase in performance while retaining existing features

– Can 10GigE bridge the performance between Ethernet and InfiniBand/Myrinet?

Page 3: Head-to-TOE Evaluation of High Performance Sockets over Protocol Offload Engines

InfiniBand, Myrinet and 10GigE: Brief Overview

• InfiniBand (IBA)– Industry Standard Network Architecture– Supports 10Gbps and higher network bandwidths– Offloaded Protocol Stack– Rich feature set (one-sided, zero-copy communication, multicast, etc.)

• Myrinet– Proprietary network by Myricom– Supports up to 4Gbps with dual ports (10G adapter announced !)– Offloaded Protocol Stack and rich feature set like IBA

• 10-Gigabit Ethernet (10GigE)– The next step for the Ethernet family– Supports up to 10Gbps link bandwidth– Offloaded Protocol Stack– Promises a richer feature set too with the upcoming iWARP stack

Page 4: Head-to-TOE Evaluation of High Performance Sockets over Protocol Offload Engines

Characterizing the Performance Gap• Each High Performance Interconnect has its own interface

– Characterizing the performance gap is no longer straight forward

• Portability in Application Development– Portability across various networks is a must

– Message Passing Interface (MPI)• De facto standard for Scientific Applications

– Sockets Interface• Legacy Scientific Applications

• Grid-based or Heterogeneous computing applications

• File and Storage Systems

• Other Commercial Applications

• Sockets and MPI are the right choices to characterize the GAP– In this paper we concentrate only on the Sockets interface

Page 5: Head-to-TOE Evaluation of High Performance Sockets over Protocol Offload Engines

Presentation Overview

۩ Introduction and Motivation

۩ Protocol Offload Engines

۩ Experimental Evaluation

۩ Conclusions and Future Work

Page 6: Head-to-TOE Evaluation of High Performance Sockets over Protocol Offload Engines

Interfacing with Protocol Offload Engines

Application or Library

TraditionalSockets Interface

High Performance Sockets

User-level Protocol

TCP/IP

Device Driver

High Performance Network Adapter

Network Features(e.g., Offloaded Protocol)

High Performance Sockets

toedev

TOM

Application or Library

Traditional Sockets Interface

TCP/IP

Device Driver

High Performance Network Adapter

Network Features(e.g., Offloaded Protocol)

TCP Stack Override

Page 7: Head-to-TOE Evaluation of High Performance Sockets over Protocol Offload Engines

High Performance Sockets

• Pseudo Sockets-like Interface

– Smooth transition for existing sockets applications

• Existing applications do not have to rewritten or recompiled !

– Improved Performance by using the Offloaded Protocol Stack on networks

• High Performance Sockets exist for many networks

– Initial implementations on VIA

– Follow-up implementations on Myrinet and Gigabit Ethernet

– Sockets Direct Protocol is an industry standard specification for IBA

• Implementations by Voltaire, Mellanox and OSU exist

[shah98:canpc, kim00:cluster, balaji02:hpdc]

[balaji02:cluster]

[balaji03:ispass, mellanox05:hoti]

Page 8: Head-to-TOE Evaluation of High Performance Sockets over Protocol Offload Engines

TCP Stack Override

• Similar to the High Performance Sockets approach, but…– Overrides the TCP layer instead of the Sockets layer

• Advantages compared to High Performance Sockets– Sockets features do not have to be duplicated

• E.g., Buffer management, Memory registration

– Features implemented in the Berkeley sockets implementation can be used

• Disadvantages compared to High Performance Sockets– A kernel patch is required

– Some TCP functionality has to be duplicated

• This approach is used by Chelsio in their 10GigE adapters

Page 9: Head-to-TOE Evaluation of High Performance Sockets over Protocol Offload Engines

Presentation Overview

۩ Introduction and Motivation

۩ High Performance Sockets over Protocol Offload Engines

۩ Experimental Evaluation

۩ Conclusions and Future Work

Page 10: Head-to-TOE Evaluation of High Performance Sockets over Protocol Offload Engines

Experimental Test-bed and Evaluation

• 4 node cluster: Dual Xeon 3.0GHz; SuperMicro SUPER X5DL8-GG nodes

• 512 KB L2 cache; 2GB of 266MHz DDR SDRAM memory; PCI-X 64-bit 133MHz

• InfiniBand– Mellanox MT23108 Dual Port 4x HCAs (10Gbps link bandwidth); MT43132 24-port switch

– Voltaire IBHost-3.0.0-16 stack

• Myrinet– Myrinet-2000 dual port adapters (4Gbps link bandwidth)– SDP/Myrinet v1.7.9 over GM v2.1.9

• 10GigE– Chelsio T110 adapters (10Gbps link bandwidth); Foundry 16-port SuperX switch

– Driver v1.2.0 for the adapters; Firmware v2.2.0 for the switch

• Experimental Results:– Micro-benchmarks (latency, bandwidth, bi-dir bandwidth, multi-stream, hot-spot, fan-tests)

– Application-level Evaluation

Page 11: Head-to-TOE Evaluation of High Performance Sockets over Protocol Offload Engines

Ping-Pong Latency MeasurementsPing-Pong Latency (Event)

0

10

20

30

40

50

60

70

80

90

100

1 4 16 64 256 1K 4K 16

K

Message Size (bytes)

Late

ncy

(us)

10GigE

SDP/IBA

SDP/Myrinet

Ping-Pong Latency (Poll)

0

10

20

30

40

50

60

70

80

1 4 16 64 256 1K 4K 16

K

Message Size (bytes)

Late

ncy

(us)

SDP/IBA

SDP/Myrinet

• SDP/Myrinet achieves the best small message latency at 11.3us• 10GigE and IBA achieve latencies of 17.7us and 24.4us respectively

• As message size increases, IBA performs the best Myrinet cards are 4Gbps links right now !

Page 12: Head-to-TOE Evaluation of High Performance Sockets over Protocol Offload Engines

Unidirectional Throughput MeasurementsUnidirectional Throughput (Event)

0

1000

2000

3000

4000

5000

6000

70001 4 16 64 256

1K 4K 16K

64K

256K

Message Size (bytes)

Thro

ughp

ut (M

bps)

10GigE

SDP/IBA

SDP/Myrinet

Unidirectional Throughput (Poll)

0

1000

2000

3000

4000

5000

6000

1 4 16 64 256

1K 4K 16K

64K

256K

Message Size (bytes)

Thro

ughp

ut (M

bps)

SDP/IBA

SDP/Myrinet

• 10GigE achieves the highest throughput at 6.4Gbps• IBA and Myrinet achieve about 5.3Gbps and 3.8Gbps Myrinet is only a 4Gbps link

Page 13: Head-to-TOE Evaluation of High Performance Sockets over Protocol Offload Engines

Snapshot Results (“Apples-to-Oranges” comparison)10GigE: Opteron 2.2GHz IBA: Xeon 3.6GHz Myrinet: Xeon 3.0GHz

Provide a reference point ! Only valid for THIS slide !

Unidirectional Throughput (Event)

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

1 4 16 64 256

1K 4K 16K

64K

256K

Message Size (bytes)

Thro

ughp

ut (M

bps)

10GigE (Opteron)

SDP/IBA-DDR

SDP/MX

Ping-Pong Latency

0

5

10

15

20

25

1 2 4 8 16 32 64 128

256

512

1K

Message Size (bytes)

Late

ncy

(us)

10GigE (Opteron) SDP/IBA-DDR

SDP/MX (Event) SDP/MX (Poll)

• SDP/Myrinet with MX allows polling and achieves about 4.6us latency (event-based is better for SDP/GM ~ 11.3us)

• 10GigE achieves the lowest event-based latency of 8.9us on Opteron systems• IBA achieves a 9Gbps throughput with their DDR cards (link speed of 20Gbps)

Page 14: Head-to-TOE Evaluation of High Performance Sockets over Protocol Offload Engines

Bidirectional and Multi-Stream Throughput

Bidirectional Throughput

0

1000

2000

3000

4000

5000

6000

7000

8000

1 4 16 64 256 1K 4K 16K 64KMessage Size (bytes)

Thro

ughp

ut (M

bps)

10GigE

SDP/IBA

SDP/Myrinet

Multi-Stream Throughput

0

1000

2000

3000

4000

5000

6000

7000

1 2 3 4 5 6 7 8 9 10 11 12Number of Streams

Thro

ughp

ut (M

bps)

10GigE

SDP/IBA

SDP/Myrinet

Page 15: Head-to-TOE Evaluation of High Performance Sockets over Protocol Offload Engines

Hot-Spot LatencyHot-Spot Latency

01020304050607080

3 6 9 12

Number of Clients

Late

ncy

(us)

10GigE SDP/IBA SDP/Myrinet• 10GigE and IBA demonstrate similar scalability with increasing number of clients• Myrinet’s performance deteriorates faster than the other two

Page 16: Head-to-TOE Evaluation of High Performance Sockets over Protocol Offload Engines

Fan-in and Fan-out Throughput TestFan-in Test Results

0

1000

2000

3000

4000

5000

6000

7000

8000

3 6 9 12Number of Streams

Ban

dwid

th (

Mbp

s)

10GigE SDP/IBA SDP/Myrinet

Fan-out Test Results

0

1000

2000

3000

4000

5000

6000

7000

3 6 9 12Number of Streams

Ban

dwid

th (

Mbp

s)

10GigE SDP/IBA SDP/Myrinet

• 10GigE and IBA achieve a similar performance for the fan-in test• 10GigE performs slightly better for the fan-out test

Page 17: Head-to-TOE Evaluation of High Performance Sockets over Protocol Offload Engines

Data-Cutter Run-time Library(Software Support for Data Driven Applications)

6/20/2003 DataCutter 19

Combined Data/Task Parallelism

host1

R0

R1

host2

R2

host3

Ra0

host1

E0

EK

host2

EK+1

EN

host4

Ra1

host5

Ra2

host1

M

Cluster 1

Cluster 3

Cluster 2

http://www.datacutter.org

• Designed by Univ. of Maryland

• Component framework

• User-defined pipeline of components

• Each component is a filter

• The collection of components is a filter group

• Replicated filters as well as filter groups

• Illusion of a single stream in the filter group

• Stream based communication

• Flow control between components

• Unit of Work (UOW) based flow control

• Each UOW contains about 16 to 64KB of data

• Several applications supported

• Virtual Microscope

• ISO Surface Oil Reservoir Simulator

Page 18: Head-to-TOE Evaluation of High Performance Sockets over Protocol Offload Engines

Data-Cutter Performance EvaluationVirtual Microscope over Data-cutter

0

0.5

1

1.5

2

2.5

512X

512

1024

X1024

2048

X2048

4096

X4096

Dataset Dimensions

Exe

cutio

n Ti

me

(sec

)

10GigE TOESDP/IBASDP/Myrinet

ISO Surface over Data-Cutter

0

50

100

150

200

250

300

1024

X1024

2048

X2048

4096

X4096

8192

X8192

Dataset Dimensions

Exe

cutio

n Ti

me

(sec

s)

10GigE TOE

SDP/IBA

SDP/Myrinet

• InfiniBand performs the best for both the data-cutter applications (especially Virtual Microscope)

• The filter-based approach makes the environment medium message latency sensitive

Page 19: Head-to-TOE Evaluation of High Performance Sockets over Protocol Offload Engines

Network

Parallel Virtual File System (PVFS)

ComputeNode

ComputeNode

ComputeNode

ComputeNode

Meta-DataManager

I/O ServerNode

I/O ServerNode

I/O ServerNode

MetaData

Data

Data

Data

Designed by ANL and Clemson University• Relies on Striping of data across different nodes• Tries to aggregate I/O bandwidth from multiple nodes• Utilizes the local file system on the I/O Server nodes

Page 20: Head-to-TOE Evaluation of High Performance Sockets over Protocol Offload Engines

PVFS Contiguous I/OPVFS Write Bandwidth

0

500

1000

1500

2000

2500

3000

3500

4000

4500

1S/3C 3S/1CConfigurations

Ban

dwid

th (

Mbp

s)

10GigESDP/IBASDP/Myrinet

PVFS Read Bandwidth

0

500

1000

1500

2000

2500

3000

3500

4000

4500

1S/3C 3S/1CConfigurations

Ban

dwid

th (

Mbp

s)

10GigE

SDP/IBA

SDP/Myrinet

• Performance trends are similar to the throughput test• Experiment is throughput intensive

Page 21: Head-to-TOE Evaluation of High Performance Sockets over Protocol Offload Engines

MPI-Tile I/O (PVFS Non-contiguous I/O)MPI-tile-IO over PVFS

0

200

400

600

800

1000

1200

1400

1600

1800

2000

Write Read

Ban

dwid

th (M

bps)

10GigESDP/IBASDP/Myrinet

• 10GigE and IBA perform quite equally• Myrinet is very close behind inspite of being only a 4Gbps network

Page 22: Head-to-TOE Evaluation of High Performance Sockets over Protocol Offload Engines

Ganglia Cluster Management Infrastructure

Cluster Nodes

Management Node

connect()Request Information

Receive Data

• Developed by UC Berkeley• For each transaction, one connection and one medium sized message (~6KB) transfer is

required

Ganglia Daemon

Page 23: Head-to-TOE Evaluation of High Performance Sockets over Protocol Offload Engines

Ganglia Performance EvaluationGanglia

0

500

1000

1500

2000

2500

3000

3500

4000

4500

3 6 9 12 15 18Cluster size (Processors)

Tran

sact

ions

per

sec

ond

(TP

S)

10GigE TOE

SDP/IBA

SDP/Myrinet

• Performance is dominated by connection time• IBA, Myrinet take about a millisecond to establish connections while 10GigE takes about 60us• Optimizations for SDP/IBA and SDP/Myrinet such as connection caching are possible (being implemented)

Page 24: Head-to-TOE Evaluation of High Performance Sockets over Protocol Offload Engines

Presentation Overview

۩ Introduction and Motivation

۩ High Performance Sockets over Protocol Offload Engines

۩ Experimental Evaluation

۩ Conclusions and Future Work

Page 25: Head-to-TOE Evaluation of High Performance Sockets over Protocol Offload Engines

Concluding Remarks

• Ethernet has traditionally been notorious for performance reasons– Close to an order-of-magnitude performance gap compared to IBA/Myrinet

• 10GigE: Recently introduced as a successor in the Ethernet family– Can 10GigE help bridge the gap for Ethernet with IBA/Myrinet?

• We showed comparisons between Ethernet, IBA and Myrinet– Sockets Interface was used for the comparison

– Micro-benchmark as well as application-level evaluations have been shown

• 10GigE performs quite comparably with the IBA and Myrinet– Better in some cases and worse in others; but around the same ballpark

– Quite ubiquitous in Grid and WAN environments

– Comparable performance in SAN environments

Page 26: Head-to-TOE Evaluation of High Performance Sockets over Protocol Offload Engines

Continuing and Future Work

• Sockets is only one end of the comparison chart– Other middleware are quite widely used too (e.g., MPI)– IBA/Myrinet might have an advantage due to RDMA/multicast capabilities

• Network interfaces and software stacks change– Myrinet coming out with 10Gig adapters– 10GigE might release a PCI-Express based card– IBA has a zero-copy sockets interface for improved performance– IBM’s 12x InfiniBand adapters increase the performance of IBA by 3 fold– These results keep changing with time; more snapshots needed for fairness

• Multi-NIC comparison for Sockets/MPI– Awful lot of work at the host– Scalability might be bound by the host

• iWARP compatibility and features for 10GigE TOE adapters

Page 27: Head-to-TOE Evaluation of High Performance Sockets over Protocol Offload Engines

Acknowledgements

Our research is supported by the following organizations• Funding support by

• Equipment donations by

Page 29: Head-to-TOE Evaluation of High Performance Sockets over Protocol Offload Engines

Backup Slides

Page 30: Head-to-TOE Evaluation of High Performance Sockets over Protocol Offload Engines

10GigE: Technology Overview• Broken into three levels of technologies

– Regular Ethernet adapters• Layer-2 adapters

• Rely on host-based TCP/IP to provide network/transport functionality

• Could achieve a high performance with optimizations

– TCP Offload Engines (TOEs)• Layer-4 adapters

• Have the entire TCP/IP stack offloaded on to hardware

• Sockets layer retained in the host space

– iWARP-aware adapters• Layer-4 adapters

• Entire TCP/IP stack offloaded on to hardware

• Support more features than TCP Offload Engines– No sockets ! Richer iWARP interface !

– E.g., Out-of-order placement of data, RDMA semantics

[feng03:hoti, feng03:sc, balaji03:rait]

[balaji05:hoti]