Challenges in Fully Generating Multigrid Solvers for the ...
Transcript of Challenges in Fully Generating Multigrid Solvers for the ...
Challenges in Fully GeneratingMultigrid Solvers for theSimulation of non-Newtonian Fluids
Sebastian Kuckuk
FAU Erlangen-Nürnberg
18.01.2016
HiStencils 2016, Prague, Czech Republic
Outline
Outline
● Scope and Motivation
● Project ExaStencils
● From Poisson to Fluid Dynamics
● 1st Challenge – Code Generation
● 2nd Challenge – Optimization
● 3rd Challenge – Parallelization
● Preliminary Results
● Conclusion and Outlook
Scope and Motivation
PDEs
● Goal: Solve a partial differential equation approximately by solving a
discretized form
5
Optical Flow (LIC Visualization)Particles in Flow
∆𝑢 = 𝑓 in Ω𝑢 = 𝑔 in 𝜕Ω
𝐴 𝑢ℎ = 𝑓ℎ
Multigrid
6
Smoothing of
High Frequency
Errors
Coarsened
Representation of
Low Frequency Errors
Multigrid V-Cycle
7
Smoothing
Restriction Prolongation
& Correction
Coarse Grid Solving
Computational Grids
● Regular grids
● Block-Structured grids -> deferred for now
8
Main Focus: HPC
● Highly optimized and highly scalable geometric multigrid solvers
● ‘Traditional’ supercomputers like JUQUEEN and SuperMUC
● Alternative architectures like Piz Daint and Mont Blanc
9
Project ExaStencils
14
• Sebastian Kuckuk
• Harald Köstler
• Ulrich Rüde
• Christian Schmitt
• Frank Hannig
• Jürgen Teich
• Hannah Rittich
• Matthias Bolten
• Alexander Grebhahn
• Sven Apel
• Stefan Kronawitter
• Armin Größlinger
• Christian Lengauer
Project ExaStencils
Project ExaStencils
● Multi-layered DSL for the specification of
● PDEs
● Discretizations
● (MG) Solver components
● Parallel behavior and application specifics
● Fully automatic code generation fueled by our Scala framework
● Automatic application of low-level optimizations
● Powerful performance characteristics prediction based on
● Local Fourier Analysis (LFA) and
● Software Product Lines (SPL) technology
15
From Poisson to Fluid Dynamics
Overall Goal
● Simulation of non-isothermal/non-
Newtonian fluid flows
● Suspensions of particles or
macromolecules
● E.g. pastes, gels, foams, drilling
fluids, food products, blood, etc.
● Importance in mining, chemical
and food industry as well as
medical applications
17
https://www.youtube.com/watch?v=G1Op_1yG6lQ
Overall Goal cnt’d
● FORTRAN90 Code
● Finite volume
discretization
● Staggered grid
● SIMPLE Algorithm
● TDMA solvers
● OMP parallelization
18
Fully generate a replica source code for “Parallel finite volume method
simulation of three-dimensional fluid flow and convective heat transfer for
viscoplastic non-Newtonian fluids”, D. A. Vasco, N. O. Moraga and G. Haase
Governing Equations (w/o BC)
Poisson NNF
∆𝑢 = 𝑓−𝛻𝑇 𝐻𝛻𝑣 + 𝐷 𝑣𝑇 ⋅ 𝛻 𝑣 + 𝛻𝑝 + 𝐷
0𝜃0
= 0
𝛻𝑇𝑣 = 0−𝛻𝑇 𝛻𝜃 + 𝐺 ⋅ v𝑇 ⋅ 𝛻 𝜃 = 0
with𝐺, 𝐷 dependent on physical properties and temperature-dependent density, and
viscosity 𝐻 Γ(𝑣) dependent on physical
properties and computed w.r.t. various models
20
The SIMPLE Algorithm
● Semi-Implicit Method for Pressure Linked Equations
● Concept:
23
Solve for velocity componentsSolve for velocity componentsSolve for velocity components
Solve for pressure correction
Apply pressure correction
Update properties
The SIMPLE Algorithm
● Temperature can be added as a separate step
24
Solve for velocity componentsSolve for velocity componentsSolve for velocity components
Solve for pressure correction
Apply pressure correction
Solve for temperature
Update properties
Staggered grid
● Finite volume discretization on a staggered grid
25
values associated with the
x-staggered grid, e.g. U
values associated with the
y-staggered grid, e.g. V
values associated with the
cell centers, e.g. p and θ
control volumes associated
with cell-centered values
control volumes associated
with x-staggered values
control volumes associated
with y-staggered values
Differences to Poisson
● From model problem to application
27
Poisson NNF
∆𝑢 = 𝑓−𝛻𝑇 𝐻𝛻𝑣 + 𝐷 𝑣𝑇 ⋅ 𝛻 𝑣 + 𝛻𝑝 + 𝐷
0𝜃0
= 0
𝛻𝑇𝑣 = 0−𝛻𝑇 𝛻𝜃 + 𝐺 ⋅ v𝑇 ⋅ 𝛻 𝜃 = 0
Linear Non-linear
Scalar PDE (one unknown)3 velocity components, pressure and temperature with varying localizations
Simple boundary conditions Mixed boundary conditions
Finite differences Finite volumes
One grid Staggered grids
Uniform spacing Local refinement
1st Challenge – Code Generation
Extensions
Necessary preparations:
● Interface from/to FORTRAN
● Data and functions
Allows replacing single portions of the code, one step at a time
● Data layouts for staggered grids
● Requires, apart from the language extension itself,
adapted field layouts
new communication routines
specialized boundary conditions
(refined coarsening and interpolation stencils)
29
Porting code
● Straight-forward for simple kernels
30
subroutine advance_fields ()
!$omp parallel do &
!$omp private(i,j,k) &
!$omp firstprivate(l1,m1,n1) &
!$omp shared(rho,rho0) &
!$omp schedule(static) default(none)
do k=1,n1
do j=1,m1
do i=1,l1
rho0(i,j,k)=rho(i,j,k)
end do
end do
end do
!$omp end parallel do
Function AdvanceFields@finest ()
: Unit {
loop over rho@current {
rho[next]@current =
rho[active]@current
}
advance rho@current
}
Porting code
● But what about more complicated code? How do we get from this …
31
! if not at the boundary
fl = xcvi(i) * v(i,jp,k) * (fy(jp)*rho(i,jp,k) + fym(jp)*rho(i,j,k))
flm = xcvip(im) * v(im,jp,k) * (fy(jp)*rho(im,jp,k) + fym(jp)*rho(im,j,k))
flownu = zcv(k) * (fl+flm)
gm = xcvi(i) * vis(i,j,k) * vis(i,jp,k) / (ycv(j)*vis(i,jp,k) + ycv(jp)*vis(i,j,k) + 1.e-30)
gmm = xcvip(im) * vis(im,j,k) * vis(im,jp,k) /(ycv(j)*vis(im,jp,k) +
ycv(jp)*vis(im,j,k) + 1.e-30)
diff = 2. * zcv(k) * (gm+gmm)
call diflow(flownu,diff,acof)
adc = acof + max(0.,flownu)
anu(i,j,k) = adc - flownu
Porting code
● … to this?
32
flownu@current = integrateOverXStaggeredNorthFace (
v[active]@current * rho[active]@current )
Val diffnu : Real = integrateOverXStaggeredNorthFace (
evalAtXStaggeredNorthFace ( vis@current, "harmonicMean" ) )
/ vf_stagCVWidth_y@current@[0, 1, 0]
AuStencil@current:[0, 1, 0] = -1.0 * ( calc_diflow ( flownu@current, diffnu )
+ max ( 0.0, flownu@current ) - flownu@current )
Extended boundary conditions
33
1 Function ApplyBC_u@finest ( ) : Unit {
2 loop over u@current only duplicate [ 1, 0] on boundary {
3 u@current = 0.0
4 }
5 loop over u@current only duplicate [-1, 0] on boundary {
6 u@current = 0.0
7 }
8 loop over u@current only duplicate [ 0, 1] on boundary {
9 u@current = wall_velocity
10 }
11 loop over u@current only duplicate [ 0, -1] on boundary {
12 u@current = -1.0 * wall_velocity
13 }
14 }15
16 Field Solution < global, DefNodeLayout, ApplyBC_u >@finest
Virtual fields
● Rework and extension of the way geometric information is used
● Many virtual fields (positions of (staggered) nodes/ faces/ cells,
interpolation and integration parameters, etc.)
● Tradeoff between on-the-fly calculation and explicit storage
34
1 // previous access through functions doesn’t allow offset access
2 rhs@current = sin ( nodePosition_x@current ( ) )3
4 // newly introduced virtual fields allow virtually identical behavior ...
5 rhs@current = sin ( vf_nodePosition_x@current )6
7 // ... and in addition allow offset accesses
8 Val dif : Real = (
9 vf_nodePosition_x@current@[ 1, 0, 0]
10 - vf_nodePosition_x@current@[ 0, 0, 0] )
Specialized functions
● Specialized evaluation and integration functions
35
1 // evaluate with respect to cells of the grid
2 evalAtSouthFace ( rho[active]@current )3
4 // integrate expressions across faces of grid cells
5 integrateOverEastFace (
6 u[active]@current * rho[active]@current )7
8 // integrate expressions across faces of cells of the staggered grid
9 integrateOverXStaggeredEastFace (
10 u[active]@current * rho[active]@current )11
12 // integrate expressions using specialized interpolation schemes
13 integrateOverZStaggeredEastFace (
14 evalAtZStaggeredEastFace ( vis@current, "harmonicMean" ) )
2nd Challenge – Optimization
Optimization of Solver Components
● Very similar to what we usually do (when using SIMPLE)
● 7P stencils with variable coefficients
● ‘Standard’ options for low-level optimization are viable
● address pre-calculation
● (temporal and spatial) blocking
● vectorization
● polyhedral loop transformations
● etc.
37
Performance distribution
● Rough breakdown of time spent
● Preliminary test with 64^3 cells without applied optimizations
38
update LSE solve update propertiesvelocity u velocity v
velocity w pressure correction
temperature
Optimization of LSE Updates
● Has to be redone many times -> costly
● No temporal blocking
● Reuse due to symmetry in some subexpressions possible
● Vectorization challenging (but possible)
39
Other Open Points
● Choice of discretization specifics / grid spacing / numerical
components are crucial
● Representation of the LSEs on coarser levels
● Optimization of solver considerably harder when handling all
components at once
● Multiple unknowns are combined in vectors
● Stencil coefficient become matrices (e.g. 8x8)
● Inverting the center coefficient becomes a matrix inversion
40
3rd Challenge – Parallelization
Parallelization
● Free for shared memory parallelism
● OpenMP code is emitted by our generator without further effort
● Distributed memory parallelism is also manageable
● Communication patterns and routines are available
● Points of communication have to be annotated (to be automated in the
future)
● More difficult: domain partitioning of non-uniform grids
=> approach from the reference work is strictly serial
42
Domain Partitioning
● Easy for regular domains
43
Each domain consists
of one or more blocks
Each block consists
of one or more fragments
Each fragment consists
of several data points / cells
Parallelization
● Communication statements are added automatically when
transforming Layer 3 to Layer 4 where they may be reviewed or
adapted
45
/* communicates all applicable layers */communicate Solution@current/* communicates only ghost layers */communicate ghost of Solution[active]@current/* communicates duplicate and first two ghost layers */communicate dup, ghost[0, 1] of Solution[active]@current/* asynchronous communicate */begin communicate Residual@current//...finish communicating Residual@current
Preliminary results
Preliminary results
● Initial version without non-Newtonian model
● Comparison is difficult due to the different exit criteria, e.g.
● Solving the single components
● Reference: fixed to 1 solve step (TDMA)
● Our implementation
● At least one step (RBGS or GMG)
● SIMPLE
● Reference: a certain percentage of all cells (including boundaries)
doesn’t change beyond a fixed threshold
● Our implementation: convergence for all components is achieved after
updating the LSE but before starting the solve routine
47
𝑟 ≤ 𝛼 1 + 𝛽 𝑏𝑟𝑖+1 − 𝑟𝑖 < 𝛼
Preliminary results
● One socket of our local cluster (Intel Xeon E7-4830)
● 16 OMP threads
● 100 timesteps
48
0,01
0,1
1
10
100
1000
16³ 32³ 64³ 128³Ex
ec
uti
on
tm
e p
er
tim
ers
tep
[s]
Problem Size
Reference RBGS Multigrid
Conclusion and Outlook
Conclusion
● Generating solvers for simulating non-Newtonian fluids
● 95 %
● Optimization
● 30 %
● Parallelization
● 75 %
50
Outlook
● Enhanced performance optimizations
● Comparison with analytical models
● Study of parallel performance characteristics
● DSL extensions for relevant concepts
● More complex numeric components
● (Multi-) GPU support
51
Thank you for yourAttention!
Questions?