Μπ A Scalable & Transparent System for Simulating MPI Programs Kalyan S. Perumalla, Ph.D. Senior...

µπA Scalable & Transparent Systemfor Simulating MPI Programs

Kalyan S. Perumalla, Ph.D.

Senior R&D ManagerOak Ridge National Laboratory

Adjunct ProfessorGeorgia Institute of Technology

SimuTools, Malaga, Spain

March 17, 2010

2 Managed by UT-Battellefor the U.S. Department of Energy SimuTools’10 Presentation – Perumalla (ORNL)

Motivation & Background

Software & Hardware Lifetimes• Lifetime of large parallel machine:

5 years

• Lifetime of useful parallel code:20 years

• Port, analyze, optimize

• Ease of development:Obviate actual scaled hardware

• Energy efficient:Reduce failed runs at actual scale

Software & Hardware Design• Co-design:

E.g., 1 μs barrier cost/benefit

• Hardware:E.g., Load from application

• Software:Scaling, debugging, testing, customizing


μπ Performance Investigation System

μπ = micro parallel performance investigator

– Performance prediction for MPI, Portals and other parallel applications

– Actual application code executed on the real hardware

– Platform is simulated at large virtual scale

– Timing customized by user-defined machine

• Scale is key differentiator– Target: 1,000,000 virtual cores– E.g., 1,000,000 virtual MPI ranks

in simulated MPI application

• Based on µsikmicro simulator kernel– Highly scalable PDES engine

• Scale is key differentiator– Target: 1,000,000 virtual cores– E.g., 1,000,000 virtual MPI ranks

in simulated MPI application

• Based on µsikmicro simulator kernel– Highly scalable PDES engine


Generalized Interface & Timing Framework

• Accommodates arbitrary level of timing detail– Compute time: can use a full system simulation (instruction-level)

on the side, or model with cache-effects, other corrected processor speed, etc., depending on user desire, accuracy-cost trade-off

– Communication time: can use network simulator, queueing and congestion models, etc., depending on user desire, accuracy-cost


Compiling MPI application with μπ

Modify #include and recompileChange #include <mpi.h>to #include <mupi.h>

Relink to μπ library– Instead of –lmpi

use -lmupi

Modify #include and recompileChange #include <mpi.h>to #include <mupi.h>

Relink to μπ library– Instead of –lmpi

use -lmupi


Executing MPI application over μπ

Run the modified MPI application(a μπ simulation)

– mpirun np– test 4 nvp-32

runs test with 32 virtual MPI rankssimulation uses 4 real cores

μπ itself uses multiple real cores to runsimulation in parallel

Run the modified MPI application(a μπ simulation)

– mpirun –np 4 test -nvp 32

runs test with 32 virtual MPI rankssimulation uses 4 real cores

μπ itself uses multiple real cores to run simulation in parallel


Interface Support

Existing, Sufficient• MPI_Init(), MPI_Finalize()

• MPI_Comm_rank()MPI_Comm_size()

• MPI_Barrier()

• MPI_Send(), MPI_Recv()

• MPI_Isend(), MPI_Irecv()

• MPI_Waitall()

• MPI_Wtime()

• MPI_COMM_WORLD

Planned, Optional• Other wait variants

• Other send/recv variants

• Other collectives

• Group communication

Other, Performance-Oriented• MPI_Elapse_time(dt)

• Added for simulation speed

• Avoids actual computation, instead simply elapses time


Performance Study

• Benchmarks– Zero lookahead– 10μs lookahead

• Platform– Cray XT5, 226K cores

• Scaling Results– Event Cost– Synchronization Overhead– Multiplexing Gain

8


Experimentation Platform: Jaguar*

* Data and images from http://nccs.gov


Event Cost

10 100 1,000 10,000 100,000 1,000,00010

100

1000

10000

100000

4 8 16 64 128

Number of Real Cores

Mic

rose

cond

s pe

r MU

PI E

vent

Virtual MPI Ranks per Real Core (VPX) =


Synchronization Speed

10 100 1,000 10,000 100,000 1,000,000100

1000

4 8 16 64 128

Number of Real Cores

Nnu

mbe

r of G

loba

l Vir

tual

Tim

e Co

mpu

ation

s



Multiplexing Gain

10,000 100,000 1,000,000 10,000,0000

5

10

15

20

25

30

35

40

45

16 64 128

Number of Virtual MPI Ranks

Fact

or o

f Spe

ed G

ain

rela

tive

to V

PX=4



μπ Summary - Quantitative

• Unprecedented scalability– 27,648,000 virtual MPI ranks on 216,000 actual cores

• Optimal multiplex-factor seen– 64 virtual ranks per real rank

• Low slowdown even on zero-lookahead scenarios– Even on fast virtual networks


μπ Summary - Qualitative

• The only available simulator for highly scaled MPI runs– Suitable for source-available, trace-driven, or modeled applications

• Configurable hardware timing– User-specified latencies, bandwidths, arbitrary inter-network models

• Executions repeatable and deterministic– Global time-stamped ordering– Deterministic timing model, and– Purely discrete event simulation

• Most suitable for applications whose MPI communication may be either trapped, instrumented or modeled– Trapped: on-line, live actual execution– Instrumented: off-line trace generation, trace-driven on-line execution– Modeled: model-driven computation and MPI communication patterns

• Nearly zero perturbation with unlimited instrumentation


Ongoing Work

• NAS Benchmarks– E.g., FFT

• Actual at-scale application– E.g., Chemistry

• Optimized implementation of certain MPI primitives– E.g., MPI_Barrier(), MPI_Reduce()

• Tie to other important phenomena– E.g., energy consumption models

Thank you!

Questions?

Discrete Computing Systems

www.ornl.gov/~2ip

http://www.ornl.gov/~2ip

Μπ A Scalable & Transparent System for Simulating MPI Programs Kalyan S. Perumalla, Ph.D. Senior...

Documents

Transcript of Μπ A Scalable & Transparent System for Simulating MPI Programs Kalyan S. Perumalla, Ph.D. Senior...