Μπ A Scalable & Transparent System for Simulating MPI Programs Kalyan S. Perumalla, Ph.D. Senior...
-
Upload
clare-greenhow -
Category
Documents
-
view
214 -
download
0
Transcript of Μπ A Scalable & Transparent System for Simulating MPI Programs Kalyan S. Perumalla, Ph.D. Senior...
µπA Scalable & Transparent Systemfor Simulating MPI Programs
Kalyan S. Perumalla, Ph.D.
Senior R&D ManagerOak Ridge National Laboratory
Adjunct ProfessorGeorgia Institute of Technology
SimuTools, Malaga, Spain
March 17, 2010
2 Managed by UT-Battellefor the U.S. Department of Energy SimuTools’10 Presentation – Perumalla (ORNL)
Motivation & Background
Software & Hardware Lifetimes• Lifetime of large parallel machine:
5 years
• Lifetime of useful parallel code:20 years
• Port, analyze, optimize
• Ease of development:Obviate actual scaled hardware
• Energy efficient:Reduce failed runs at actual scale
Software & Hardware Design• Co-design:
E.g., 1 μs barrier cost/benefit
• Hardware:E.g., Load from application
• Software:Scaling, debugging, testing, customizing
3 Managed by UT-Battellefor the U.S. Department of Energy SimuTools’10 Presentation – Perumalla (ORNL)
μπ Performance Investigation System
μπ = micro parallel performance investigator
– Performance prediction for MPI, Portals and other parallel applications
– Actual application code executed on the real hardware
– Platform is simulated at large virtual scale
– Timing customized by user-defined machine
• Scale is key differentiator– Target: 1,000,000 virtual cores– E.g., 1,000,000 virtual MPI ranks
in simulated MPI application
• Based on µsikmicro simulator kernel– Highly scalable PDES engine
• Scale is key differentiator– Target: 1,000,000 virtual cores– E.g., 1,000,000 virtual MPI ranks
in simulated MPI application
• Based on µsikmicro simulator kernel– Highly scalable PDES engine
4 Managed by UT-Battellefor the U.S. Department of Energy SimuTools’10 Presentation – Perumalla (ORNL)
Generalized Interface & Timing Framework
• Accommodates arbitrary level of timing detail– Compute time: can use a full system simulation (instruction-level)
on the side, or model with cache-effects, other corrected processor speed, etc., depending on user desire, accuracy-cost trade-off
– Communication time: can use network simulator, queueing and congestion models, etc., depending on user desire, accuracy-cost
5 Managed by UT-Battellefor the U.S. Department of Energy SimuTools’10 Presentation – Perumalla (ORNL)
Compiling MPI application with μπ
Modify #include and recompileChange #include <mpi.h>to #include <mupi.h>
Relink to μπ library– Instead of –lmpi
use -lmupi
Modify #include and recompileChange #include <mpi.h>to #include <mupi.h>
Relink to μπ library– Instead of –lmpi
use -lmupi
6 Managed by UT-Battellefor the U.S. Department of Energy SimuTools’10 Presentation – Perumalla (ORNL)
Executing MPI application over μπ
Run the modified MPI application(a μπ simulation)
– mpirun np– test 4 nvp-32
runs test with 32 virtual MPI rankssimulation uses 4 real cores
μπ itself uses multiple real cores to runsimulation in parallel
Run the modified MPI application(a μπ simulation)
– mpirun –np 4 test -nvp 32
runs test with 32 virtual MPI rankssimulation uses 4 real cores
μπ itself uses multiple real cores to run simulation in parallel
7 Managed by UT-Battellefor the U.S. Department of Energy SimuTools’10 Presentation – Perumalla (ORNL)
Interface Support
Existing, Sufficient• MPI_Init(), MPI_Finalize()
• MPI_Comm_rank()MPI_Comm_size()
• MPI_Barrier()
• MPI_Send(), MPI_Recv()
• MPI_Isend(), MPI_Irecv()
• MPI_Waitall()
• MPI_Wtime()
• MPI_COMM_WORLD
Planned, Optional• Other wait variants
• Other send/recv variants
• Other collectives
• Group communication
Other, Performance-Oriented• MPI_Elapse_time(dt)
• Added for simulation speed
• Avoids actual computation, instead simply elapses time
8 Managed by UT-Battellefor the U.S. Department of Energy SimuTools’10 Presentation – Perumalla (ORNL)
Performance Study
• Benchmarks– Zero lookahead– 10μs lookahead
• Platform– Cray XT5, 226K cores
• Scaling Results– Event Cost– Synchronization Overhead– Multiplexing Gain
8
9 Managed by UT-Battellefor the U.S. Department of Energy SimuTools’10 Presentation – Perumalla (ORNL)
Experimentation Platform: Jaguar*
* Data and images from http://nccs.gov
10 Managed by UT-Battellefor the U.S. Department of Energy SimuTools’10 Presentation – Perumalla (ORNL)
Event Cost
10 100 1,000 10,000 100,000 1,000,00010
100
1000
10000
100000
4 8 16 64 128
Number of Real Cores
Mic
rose
cond
s pe
r MU
PI E
vent
Virtual MPI Ranks per Real Core (VPX) =
11 Managed by UT-Battellefor the U.S. Department of Energy SimuTools’10 Presentation – Perumalla (ORNL)
Synchronization Speed
10 100 1,000 10,000 100,000 1,000,000100
1000
4 8 16 64 128
Number of Real Cores
Nnu
mbe
r of G
loba
l Vir
tual
Tim
e Co
mpu
ation
s
Virtual MPI Ranks per Real Core (VPX) =
12 Managed by UT-Battellefor the U.S. Department of Energy SimuTools’10 Presentation – Perumalla (ORNL)
Multiplexing Gain
10,000 100,000 1,000,000 10,000,0000
5
10
15
20
25
30
35
40
45
16 64 128
Number of Virtual MPI Ranks
Fact
or o
f Spe
ed G
ain
rela
tive
to V
PX=4
Virtual MPI Ranks per Real Core (VPX) =
13 Managed by UT-Battellefor the U.S. Department of Energy SimuTools’10 Presentation – Perumalla (ORNL)
μπ Summary - Quantitative
• Unprecedented scalability– 27,648,000 virtual MPI ranks on 216,000 actual cores
• Optimal multiplex-factor seen– 64 virtual ranks per real rank
• Low slowdown even on zero-lookahead scenarios– Even on fast virtual networks
14 Managed by UT-Battellefor the U.S. Department of Energy SimuTools’10 Presentation – Perumalla (ORNL)
μπ Summary - Qualitative
• The only available simulator for highly scaled MPI runs– Suitable for source-available, trace-driven, or modeled applications
• Configurable hardware timing– User-specified latencies, bandwidths, arbitrary inter-network models
• Executions repeatable and deterministic– Global time-stamped ordering– Deterministic timing model, and– Purely discrete event simulation
• Most suitable for applications whose MPI communication may be either trapped, instrumented or modeled– Trapped: on-line, live actual execution– Instrumented: off-line trace generation, trace-driven on-line execution– Modeled: model-driven computation and MPI communication patterns
• Nearly zero perturbation with unlimited instrumentation
15 Managed by UT-Battellefor the U.S. Department of Energy SimuTools’10 Presentation – Perumalla (ORNL)
Ongoing Work
• NAS Benchmarks– E.g., FFT
• Actual at-scale application– E.g., Chemistry
• Optimized implementation of certain MPI primitives– E.g., MPI_Barrier(), MPI_Reduce()
• Tie to other important phenomena– E.g., energy consumption models