Introductory Parallel Applications Courtesy: Dr. David Walker, Cardiff University.

Introductory Parallel Applications

Courtesy: Dr. David Walker, Cardiff University

Example1 – wave equationCourtesy: David Walker, Cardiff University

Problem – Vibrating string of length L fixed at both ends given an initial displacement.

To determine Ψ(x, t)

displacement

x=0x=L

Vibrating String Problem To solve ∂2Ψ/c2∂t2 - ∂2Ψ/∂x2 = 0 Subject to Ψ(0, t) = Ψ(L, t) = 0; Ψ(x,

0) = u(x) Numerical approximation: Ψi(t+∆t) = 2Ψi(t) – Ψi(t-∆t) + tau2(Ψi-1(t) -2Ψi(t) + Ψ i+1(t)) where

tau = c ∆t/ ∆x and u(x) = sinx

Vibrating String Problem – Sequential Problemfor(i=0; i<npoints; i++){ x = 2*pi*i/(npoints-1); x = sin(x); ps[i] = old_ps[i] = x;}

for(j=0; j<nsteps; j++){ for(i=1; i<npoints-1; i++){ new_ps[i] = 2*ps[i] – old_ps[i] + tau*tau*(ps[i-1] + 2ps[i] + ps[i+1]); }

for(i=0; i<npoints; i++){ old_ps[i] = ps[i]; ps[i] = new_ps[i]; }}

Vibrating String – Parallel Formulation The points are divided across

processes Communication between neighboring

processors Data structure – an array for every

processor with sizes local_size+2

Vibrating String – Parallel Code/* Set up 1D topology */ MPI_Cart_create(MPI_COMM_WORLD,1,&nprocs,&periods,

reorder,&new_comm); MPI_Cart_coords (new_comm, rank, 1, &mypos) MPI_Cart_shift (new_comm, 0, -1, &right, &left);

/* Initialise array */ local_npts = npoints/nprocs; lbound = mypos*local_npts;

for(i=0;i<local_npts;i++){ x = 2.0*PI*(double)(lbound+i)/(double)(npoints-1); x = sin(x); psi[i+1] = old_psi[i+1] = x; }

Vibrating String – Parallel Code/* Do the updates */ istart = (mypos==0) ? 2 : 1; iend = (mypos==nprocs-1) ? local_npts-1 : local_npts; for(j=0;j<NSTEPS;j++){ MPI_Sendrecv (&psi[1], 1, MPI_DOUBLE, left, 111,

&psi[local_npts+1], 1, MPI_DOUBLE, right, 111, new_comm, &status);

MPI_Sendrecv (&psi[local_npts], 1, MPI_DOUBLE, right, 111, &psi[0], 1, MPI_DOUBLE, left, 111, new_comm, &status);

for(i=istart;i<=iend;i++){ new_psi[i] = 2.0*psi[i] - old_psi[i] + tau*tau*(psi[i-1]- 2.0*psi[i]

+psi[i+1]); } for(i=1;i<=local_npts;i++){ old_psi[i] = psi[i]; psi[i] = new_psi[i]; } }

Example 2 – Laplace equation problem

Electric field around a conducting object in a box

To solve ∂2Ψ/∂x2 + ∂2Ψ/∂y2 = 0 Subject to Ψ(0, y) = Ψ(L1, y) = 0;

Ψ(x, 0) = Ψ(x, L2) = 0; Ψ(x, y) =1 on S(x, y)

S(x, y)

x=0 x=L1y=0

Laplace equation – Numerical approximation We consider a square problem, L1 =

L2 = L Ψi,j

k = ¼(Ψi-1,jk-1 + Ψi+1,j

k-1 + Ψi,j-1k-1 + Ψi,j+1

Laplace Equation – Sequential codedouble phi[NXMAX][NYMAX];double old_phi[NXMAX][NYMAX];int mask[NXMAX][NYMAX];

setup_grid(phi, nptsx, nptsy, mask);

for(k=0; k<nsteps; k++){

for(i=0; i<nptsx; i++){ for(j=0; j<nptsy; j++){ old_phi[i][j] = phi[i, j]; } }

for(i=0; i<nptsx; i++){ for(j=0; j<nptsy; j++){ if(mask[i][j]) phi[i][j] = ¼*(old_phi[i][j-1] + old_phi[i][j+1] +

old_phi[i-1][[j] + old_phi[i][[j-1]); } }

Laplace Equation – Sequential codesetup_grid(phi, nptsx, nptsy, mask){

for(i=0; i<nptsx; i++){ for(j=0; j<nptsy; j++){ phi[i, j] = 0.0; mask[i, j] = 1; } }

mask[0…nptsx-1][0] = mask[0..nptsx-1][nptsy-1] = mask[0][0..nptsy-1] = mask[nptsx-1][0..nptsy-1] = 0;

mask[nptsx/2, nptsx/2-1][nptsy/2, nptsy/2-1] = 0; phi[nptsx/2, nptsx/2-1][nptsy/2, nptsy/2-1] = 1;

Laplace Equation - Parallelization A 2-D topology of processes are considered The updates of boundary values for a given

processor may need communication from its neighboring processors.

Following are main data structures double phi[NYMAX+2][NXMAX+2]; double oldphi[NYMAX+2][NXMAX+2]; int mask[NYMAX+2][NXMAX+2];

LAPLACE Equation - Parallelization/* Work out number of processes in each direction of the process mesh */ dims[0] = dims[1] = 0; MPI_Dims_create (nprocs, 2, dims); npy = dims[0]; npx = dims[1];

/* Set up 2D topology */ periods[0] = periods[1] = 0; MPI_Cart_create (MPI_COMM_WORLD, 2, dims, periods, 1, &new_comm); MPI_Cart_coords (new_comm, rank, 2, coords); myposy = coords[0]; myposx = coords[1];

/* Determine neighbouring processes for communication shifts */ MPI_Cart_shift (new_comm, 0, -1, &down, &up); MPI_Cart_shift (new_comm, 1, -1, &right, &left);

/* Initialise arrays */ setup_grid (phi, npx, npy, myposx, myposy, nptsx, nptsy, &local_nptsx,

&local_nptsy, mask);

Laplace Equation - Parallelization/* Iterate to find solution */for(k=1;k<=nsteps;k++){ for(i=1;i<=local_nptsy;i++) for(j=1;j<=local_nptsx;j++) oldphi[i][j] = phi[i][j];

MPI_Sendrecv (&oldphi[1][1], local_nptsy, MPI_DOUBLE, up, 111, &oldphi[local_nptsx+1][1], local_nptsy, MPI_DOUBLE,down, 111, new_comm, &status);

MPI_Sendrecv (&oldphi[local_nptsx][1],local_nptsy,MPI_DOUBLE,down,112, &oldphi[0][1], local_nptsy, MPI_DOUBLE, up, 112, new_comm, &status);

Laplace Equation - Parallelizationfor(i=1;i<=local_nptsx;i++) sbuf[i-1] = oldphi[i][1]; MPI_Sendrecv (sbuf, local_nptsx, MPI_DOUBLE, left, 113, rbuf, local_nptsx,

MPI_DOUBLE, right, 113, new_comm, &status); for(j=1;j<=local_nptsy;j++) oldphi[i][local_nptsy+1] = rbuf[i-1];

for(i=1;i<=local_nptsx;i++) sbuf[i-1] = oldphi[i][local_nptsy]; MPI_Sendrecv (sbuf, local_nptsx, MPI_DOUBLE, right, 114, rbuf, local_nptsx,

MPI_DOUBLE, left, 114, new_comm, &status); for(i=1;i<=local_nptsx;i++) oldphi[i][0] = rbuf[i-1];

for(i=1;i<=local_nptsx;i++) for(j=1;j<=local_nptsy;j++) if (mask[i][j]) phi[i][j] = 0.25*(oldphi[i][j-1] + oldphi[i][j+1] + oldphi[i-1][j] +

oldphi[i+1][j]); }

Laplace Equation - Parallelizationsetup_grid (phi, npx, npy, myposx, myposy, nptsx, nptsy, local_ptrx,

local_ptry, mask){

local_nptsx = nptsx/npx; local_nptsy = nptsy/npy; for(j=0;j<=local_nptsy+1;j++) for(i=0;i<=local_nptsx+1;i++){ phi[i][j] = 0.0; mask[i][j] = 1;}

if (myposx == 0) mask[1][1..local_nptsy] = 0;if (myposx == npx-1) mask[local_nptsx][1..local_nptsy] = 0;if (myposy == 0) mask[1..local_nptsx][1] =0;if (myposy == npy-1) mask[1..local_nptsx][local_nptsy] = 0;

for(i=1;i<=local_nptsx;i++){ global_x = local_nptsx*myposx + i - 1; if (global_x == nptsx/2 || global_x == nptsx/2-1){ for(j=1;j<=local_nptsx;j++){ global_y = local_nptsy*myposy + j - 1; if (global_y == nptsy/2 || global_y == nptsy/2-1){ mask[i][j] = 0; phi[i][j] = 1.0; } } }}

*local_ptrx = local_nptsx; *local_ptry = local_nptsy;

Laplace Equation – Performance Analysis For n2 grid size, N = PxQ processor

mesh, tcalc time for a flop, tshift time for communication per grid point

Tseq = T(N) =

4n2tcalc

(4n2/N)tcalc + (2n/P)tshift + (2n/Q)tshift

A more irregular problem – Molecular Dynamics Simulation

In the previous problems, communication requirements are known in advance

In the current Molecular Dynamics simulation problem, the amount of data that are communicated between processors are not known in advance

The communication is slightly irregular

The Problem A domain consisting of number of particles

(molecules) Each molecule, i is exerted a force, fij by

another molecule, j The sum of all the forces, Fi = ∑jfij makes the

particles assume a new position and velocity

Particles that are r distance apart do not influence each other

Given initial velocities and positions of particles, their movements are followed for discrete time steps

MD - Solution The cutoff distance, r is used to

reduce the time for summation from O(n2)

rDomain decomposed into cells of size rxrParticles in one cell interact with particles in the neighbouring 8 cells and particles in the same cell

MD - SolutionData structures:An array of all particles. Each element holds <position,

velocity>A 2D array of linked lists, one for each cell. Each element of a

linked list contains pointers to particles.struct particle{ double position[2]; double velocity[2];} Particles[MAX_PARTICLES];

struct list{ particle* part; struct list* next;}*List[MAX_CELLSX][MAX_CELLSY];

Linked ListParticles

MD – Sequential LogicInitialize Particles and Lists;

for each time step for each particle i Let cell(k, l) hold i F[i] = 0; for each particle j in this cell and neighboring 8 cells, and are

r distance from i{ F[i]+= f[i, j]; } update particle[i].{position, velocity} due to F[i]; if new position in new cell (m,n) update Lists[k,l] and

Lists[m,n]

MD – Parallel Version

A 2D array of processors similar to LaplaceEach processor holds a set of cells

Differences:•A processor can communicate with the diagonal neighbors•Amount of data communicated varies over time steps•Receiver does not know the amount of data

MDS – parallel solution Steps

1. Communication – Each processor communicates parameters of the particles on the boundary cells to its 8 neighboring cells

Challenges – to communicate diagonal cells2. Update – Each processor calculates new particle velocities and positions3. Migration – Particles may migrate to cells in other processors

Other challenges:1. Appropriate packing of data.2. Particles may have to go through several hops during migrationAssumptions:1. For simplicity, let us assume that particles are transported to

only neighboring cells during migration

MDS – parallel solution – 1st step Communication of boundary data

a a aaaa

b b bbbb

c c cccc

d d dddd

a a aaaa

b b bbbb

c c cccc

d d dddd

a a aaaa

C Cc c c

bb b b

A Aa a a B Bb b b

D Dd d d

a a aaaa

b b bbbb

c c cccc

d d dddd

a a aaaa

C Cc c c

bb b b

A Aa a a B Bb b b

D Dd d d

Can be achieved by ?Shift left, shift right, shift up, shift down

MDS – parallel solution – 1st stepLeft shiftnsend = 0;for(i=0; i<local_cellsx; i++){ for each particle p in cell (i, 1){ pack position of p in sbuf nsend += 2 }}

MPI_Sendrecv(sbuf, nsend, …, left,..rbuf, max_particles*2, …, right, &status);

MPI_Getcount(status, MPI_DOUBLE, &nrecv);particles = nrecv/2;for(i=0; i<particles; i++){ read (x,y) from next 2 positions in rbuf; add (x,y) to particles[local_particle_count+i]; determine cell k, l for the particle Add it to list (k, l);}

MDS – parallel solution – 2nd stepUpdate: Similar to sequential program. A processor has all the required information

for calculating Fi for all its particles Thus new position and velocity determined. If new position belongs to the same cell in

the same processor, do nothing If new position belongs to the different cell

in the same processor, update link lists for old and new cells.

MDS – parallel solution – 3rd step If new position belongs to the different cell in a

different processor – particle migrationfor each particle p update {position, velocity} determine new cell if new cell # old cell delete p from list of old cell if(different processor) pack p into appropriate communication buffer remove p from particle array

Shift leftShift rightShift upShift down

MDS – parallel solution – 3rd step This shifting is a bit different from the previous shifting A processor may just act as a transit point for a particle Hence particles have to be packed with careShift left:MPI_Sendrecv(leftbuf, nsend_left, …, left

rbuf, max_size*4, .., right, &status);MPI_Getcount(status, MPI_DOUBLE, &nrecv);particles = nrecv/4;

for(i=0; i<particles; i++){ read next 4 numbers in {x, y vx, vy} if(particle in this process) add particle to particle array determine cell add particle to list for the cell else put data in the appropriate comm. buffer for the next up or down

shifts}

MDS – comments Generic solution

A particle can move to any cell Force can be from any distance

Load balancing

A Dynamical System- WaTor Courtesy: Dr. David Walker, Cardiff

A 2-D ocean in which sharks and fish survive

2 important featuresa. Need for dynamic load distributionb. Potential conflicts due to updates by

different processors Features shared by other advanced

parallel applications

WaTor – The problem Ocean divided into grids Each grid cell can be empty or have a

fish or a shark Grid initially populated with fishes and

sharks in a random manner Population evolves over discrete time

steps according to certain rules

WaTor - RulesFish: At each time step, a fish tries to move

to a neighboring empty cell. If not empty, it stays

If a fish reaches a breeding age, when it moves, it breeds, leaving behind a fish of age 0. Fish cannot breed if it doesn’t move.

Fish never starves

WaTor - RulesShark: At each time step, if one of the neighboring

cells has a fish, the shark moves to that cell eating the fish. If not and if one of the neighboring cells is empty, the shark moves there. Otherwise, it stays.

If a shark reaches a breeding age, when it moves, it breeds, leaving behind a shark of age 0. shark cannot breed if it doesn’t move.

Sharks eat only fish. If a shark reaches a startvation age (time steps since last eaten), it dies.

Inputs and Data StructuresInputs: Size of the grid Distribution of sharks and fishes Shark and fish breeding ages Shark starvation ageData structures:A 2-D grid of cells struct ocean{ int type /* shark or fish or empty */ struct swimmer* occupier; }ocean[MAXX][MAXY]A linked list of swimmers struct swimmer{ int type; int x,y; int age; int last_ate; int iteration; swimmer* prev; swimmer* next; } *List;

Sequential Code Logic•Initialize ocean array and swimmers list•In each time step, go through the swimmers in the order in which they are stored and perform updates

Towards a Parallel Code 2-D data distribution similar to

Laplace and molecular dynamics is used. Each processor holds a grid of ocean cells.

For communication, each processor needs data from 4 neighboring processors.

2 new challenges – potential for conflicts, load balancing

1st Challenge – Potential for Conflicts Unlike previous problems,

border cells may change during updates due to fish or shark movement

Border cells need to be communicated back to the original processor. Hence update step involves communication

In the meantime, the original processor may have updated the border cell. Hence potential conflicts

Time T

Time T+1

S F SF

2 Techniques Rollback updates for those particles

(fish or shark) that have crossed processor boundary and are in potential conflicts.

May lead to several rollbacks until a free space is found.

2nd technique is synchronization during updates to avoid conflicts in the first place.

2 Techniques During update, a processor

x sends its data first to processor y, allows y to perform its updates, get the updates from y, and then performs its own updates.

Synchronization can be done by sub-partitioning.

Divide a grid owned by a processor into sub-grids.

1 23 4

update

Load Imbalance The workload distribution changes

over time 2-D block distribution is not optimal

Techniques: Dynamic load balancer Static load balancing by a different data

distribution

Dynamic load balancing Performed at each time step Orthogonal recursive bisection

Problems: Complexity in finding the neighbors and data for communication

Static Data Distribution Cyclic

0,0 0,1 0,2 0,3 0,0 0,1 0,2 0,3 1,0 1,1 1,2 1,3 1,0 1,1 1,2 1,3 2,0 2,1 2,2 2,3 2,0 2,1 2,2 2,3 3,0 3,1 3,2 3,3 3,0 3,1 3,2 3,3

Problems: Increase in boundary data; increase in communication

Introductory Parallel Applications Courtesy: Dr. David Walker, Cardiff University.

Documents

Transcript of Introductory Parallel Applications Courtesy: Dr. David Walker, Cardiff University.

Courtesy CDC - · Leishmania sp. Numerous ovoid to round small (2-6 μm) parasites in a bone marrow aspirate. In some parasites the typical rod shaped kinetoplast is seen beside

Analysis of Cisplatin and its Variants in Water · 2020. 2. 14. · Analysis of Cisplatin and its Variants in Water COSMOSIL πNAP Technical Note 18 Data courtesy of Prof. Ryuji Kato,

Courtesy of The Steroid Hormone Cascade€¦ · The Steroid Hormone Cascade Pregnenolone Progesterone Pregnanediol Pregnanetriol 11-Deoxycortisol Cortisone Cortisol Tetrahydrocortisone

| 1 tx | courtesy of Nathan Long, MD ProFractional-XC · ProFractional-XC™ Parameters: 175 ˜m ablation, 11% density, 2 passes on lower eyelid & upper lip. 1 pass on the upper eyelid.

James Walker Australasia 1930 – 2005 year detailed.pdf · best we could find to meet all criteria, including price. Nowadays, you would not give the place a second thought, let

Cardiff Universityorca.cf.ac.uk/118080/7/Alberti2019_Article_Essential...Ann. Henri Poincar´e Online First c 2019 The Author(s) Annales Henri Poincar´e Essential Spectrum for Maxwell’s

Mike Holenderski, m.holenderski@tue.nl Communication 2IN60: Real-time Architectures (for automotive systems) (many slides are by courtesy of Damir Isovic)

Time series, spring 2014 - Chalmersrootzen/timeseries/timeseries... · The Yule-Walker estimates were found by applying the Durbin-Levinson Algorithm with the ACF replaced by the

Hydraprobe calibration O. Merlin, J. Walker, R. Panciera, H. Meade, D. Biasioni, R. Young, L. Siriwardena, A. Western 3 rd NAFE Workshop 17-18 Sept. 2007.

11 Phys I Stiliaris...ΦΥΣΙΚΗΙ ALONSO FINN GIANCOLI HALLIDAY‐RESNICK WALKER YOUNG FREEDMAN ΡΕΥΣΤΑΣΕ ΗΡΕΜΙΑ 13.1 έως 13.4 14.1 έως14.5 14.1 ΑΡΧΗΤΟΥ

The clustering of galaxies detected by neutral hydrogen emission Sean Passmoor Prof. Catherine Cress Image courtesy of NRAO/AUI and Fabian Walter, Max.

R.S. RAM, K. TERESZCHUK, I. GORDON 1, K.A. WALKER 2 and P. F. BERNATH Department of Chemistry, University of York, Heslington, York YO10 5DD, UK. 1 Harvard-Smithsonian.

Prof. Dr. JoséPedro Donoso - gradadm.ifsc.usp.br sonoras 2.pdf · Halliday, Resnick e Walker ... 2 ππππ/3 ou 120 o Halliday, Resnick, Walker, ... J.S. Rigden, 2nd edition (Wiley

„Killer apps‟ of Abaqus in the Offshore Industry · PDF file„Killer apps‟ of Abaqus in the Offshore Industry ... •Multiphysics analysis including thermal, ... Crack (courtesy

SRW HISTORY I SRW DESIGNncma-br.org/pdfs/masterlibrary/SRWmarketHISTORY_3 Design.pdfRetaining Wall Figure 3. Earth Pressure for a Vertical Wall Figure 4. Gravity Retaining Wall (Courtesy

Hollow Fiber Element 0.01 HAW 0.03 AT ... - … Fiber Element Nominal filtration rating: ... Courtesy of Steven Engineering, Inc.-230 Ryan Way, ... 1 Bracket set SFD-BR100 With 2 countersunk

Walker-Wagner Syndrome 2095 W transactivation of MITF ... · W w: A symbol of map distance. θ, map distance, cM, recombination, mapping function W Chromosome: This corresponds to

Advances in Laser Fusion EnergyAtzeni & Schiavi, PPCF 2004 Titolo Laser NIF 31 3D simulation of a NIF ignition experiment S. Haan et al., NF 44, S171 (2004), courtesy of LLNL 37 Alternative

Solar Neutrino Results from Phase III of the Sudbury Neutrino … · 2008-06-18 · Image courtesy National Geographic Sudbury Neutrino Observatory (SNO) 1 kt D 2O Nucl. Inst. Meth.

How cold is cold dark matter? - Institute of Astronomy€¦ · How cold is cold dark matter? Mark Wilkinson with Matt Walker, Justin Read, Walter Dehnen, ... R /kpc LP(R,v los) v