FETI and HFETI Solvers x3 s 3D 3 8 Ω

1
Optimization and Highly Parallel Implementation of Domain Decomposition Based Algorithms x 1 x 2 x 3 x 1 x 2 x 3 4 2 1 3 H λ c 1 H C C C C C C 4 4 4 4 H h H s s s c s c c 2 s 1 c s 3 s 4 8 C 8 C to clusters to subdomains decomposiotion decomposiotion λ g λ d h = c H s industry.it4i.cz www.it4i.cz am.vsb.cz FETI and HFETI Solvers • C++ implementation based on Intel MKL sparse and dense BLAS routines and MKL version of PARDISO sparse direct solver • Parallelization tools and strategies - hybrid parallelization for multisocket, multicore comp. nodes - enables over-subscription of cores with multiple subdoma ins - distributed memory parallelization - use of MPI 3.0 non-blocking collective operations for global reductions - Intel MPI 5.0 - shared memory parallelization using Intel Cilk+ - enables parallel reduction using custom reduce operations on C++ classes Reduction of global communication for (H)FETI B 1 B 2 B 3 B 4 λ λ 1 λ 2 λ 3 λ 4 λ 1 λ 3 λ 1 2 1 2 λ 3 λ 4 3 3 λ 4 4 λ 2 λ 2 2 1 λ 2 3 CPU CPU CPU CPU 1 2 4 3 4 3 2 1 Benchmark Systems Hiding latencies in Krylov Solvers Intercluster Processing FETI and HFETI per iteration time Total FETI - Large scale benchmarks Preconditioned pipelined Conjuagent Gradient (CG) algorithm Cluster size: ~120 000 DOFs; optimal decomposition into 27 subdomains 1: r 0 := b - Ax 0 ; u 0 := M -1 r 0 ; w 0 := Au 0 2: for i = 0,..... do 3: γ i := r i u i 4: δ := w i u i 5: m i := M -1 w i 6: n i := Am i 7: if i > 0 then 8: β i := γ i /γ i -1 ; α i := γ i /( δ - β i γ i /α i -1 ) 9: else 10: β i := 0; α i := γ i /δ 11: end if 12: z i := n i + β i z i -1 13: q i := m i + β i q i -1 14: s i := w i + β i s i -1 15: p i := u i + β i p i -1 16: x i +1 := x i + α i p i Sparse Matrix-Vector product - nearest neighbor comm., good scaling Dot-product - global communication - scales as log(P) Scalar vector multiplication, vector-vector addition no communication P. Ghysels, T. Ashby, K. Meerbergen and W. Vanroose Hiding global communication latency in the GMRES algorithm on massively parallel machines. http://www.prace-ri.eu/ https://projects.imec.be/exa2ct/ Superlinear Strong Scaling of FETI 3D CUBE - HFETI elasticity benchmark 3D CUBE - Large Scale FETI Benchmark FETI - decomposition into subdomains (Hc) HFETI - decomposition into clusters (Hc) and subdomains (Hs) highly parallel generator scales up to thousands of subdomains large subdomains from 120,000 to 160,000 DOFs sparse direct solver uses most of the memory and CPU time (99%) no preconditioner ---------------------------------------------------------------------- #subdomains 125 ( Nx Ny Nz )=(5 5 5) #elements per subd . 8000 ( nx ny nz )=(20 20 20) #all elements: 1000000 ( nx ny nz )=(100 100 100) #coordinates: 1030301 #DOFs undecomp .: 3090903 #DOFs decomp .: 3472875 ---------------------------------------------------------------------- mesh: done • each MPI process iterates only over lambdas associated with given subdomain - FETI – required for multiplication with the restriction of B matrix to given subdomain - HFETI – required for multiplication with the restriction of B1 matrix to • global update of vector λ becomes only nearest the neighbor type of communication with good scalability Communication layer performance evaluation on FETI - CG algorithm with 2 reductions – is a general version of the CG algorithm - CG algorithm with 1 reductions – is based on Preconditioned Pipelined CG algorithm, where projector is used in place of preconditioner (this algorithm is ready to use non-blocking global reduction for further performance improvements (comes in Intel MPI 5.0)) - GGTINV – parallelizes the solve of coarse problem plus merges two Gather and Scatter global operations into single AllGather -5 3 = 3*(5+1) 3 = 648 DOFs – small sub domain size is chosen to identify all communication bottlenecks of the solver 0.127351 0.05973 0.025983 0.00902 0.004424 0.002782 0.112963 0.048849 0.01942 0.005946 0.002672 0.001466 0.001 0.01 0.1 1 32 64 128 256 512 1024 Itera on me [s] Number of subdomains [-] Engine 2.5 millions - FETI - itera on me REGCG - LUMPED - GGTINV PIPECG - NOPREC - GGTINV REGCG - LUMPED - NOGGTINV PIPECG - NOPREC - NOGGTINV Linear strong scaling (based on 32) 0 0.005 0.01 0.015 0.02 0.025 0 200 400 600 800 1000 1200 1400 1600 1800 2000 itera on me [s] Number of subdomains [-] FETI itera on me (measured on Anselm) FETI - CG with 2 reduc ons FETI - CG with 1 reduc on FETI - CG with 1 reduc on - GGTINV 2.5 millions of DOFs using FETI solver decomposed into: 34, 64, 128, 256, 512 and 1024 subdomains measured on Anselm supercomputer 128 nodes with 16 CPU cores (2x8)8 subdomains per node SurfSara.nl – Cartesius – up to ~8600 cores non-blocking island of 360 nodes each with: 2× 12-core Intel Xeon E5-2695 v2 (Ivy Bridge), 2.4 GHz and 64 GB RAM InfiniBand FDR network - 56 Gbit/s inter-node bandwidth IT4Innovations – www.it4i.cz – Anselm – up to ~3300 cores non-blocking cluster of 209 nodes each with: 2x 8-core Intel Sandy Bridge E5-2665, 2.4GHz and 64GB of RAM InfiniBand QDR network - 40 Gbit/s inter-node bandwidth domain size subdomains size avg sum prec iter 4 343 128625 0.123941 5.949182 75.503219 48 5 216 139968 0.072914 3.718598 37.732576 51 6 125 128625 0.040325 2.217873 16.381135 55 8 64 139968 0.030154 1.718762 10.04497 57 11 27 139968 0.034425 1.893399 7.490033 55 17 8 139968 0.064372 3.347334 7.876874 52 Cluster size: ~330 000 DOFs; optimal decomposition into 27 subdomains domain size subdomains size avg sum prec iter 5 512 331776 0.309891 17.973677 338.035839 58 6 343 352947 0.198848 12.328554 180.041968 62 7 216 331776 0.118965 7.613787 90.402162 64 11 64 331776 0.080177 5.13133 27.456335 64 15 27 331776 0.095755 6.03256 19.761467 63 23 8 331776 0.156146 8.431877 29.455252 54 Note: domains size ‘N’ is the number of elements per edge of the cube – real size is equal to 3*(N+1) 3 Note: avg – average iteration time [s]; sum – total time of all iterations = solution time; prec – cluster preprocessing time; iter – number of iterations; 0 0.002 0.004 0.006 0.008 0.01 0.012 0.014 0.016 0.018 0 500 1000 1500 2000 itera on me [s] Number of subdomains [-] HFETI vs FETI – one itera on me (measured on Anselm) FETI - CG with 1 reduc on HFETI - CG with 1 reduc on 0 0.0005 0.001 0.0015 0.002 0.0025 0.003 0.0035 0.004 0.0045 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 HFETI vs FETI - one itera on me (measured on Cartesius) FETI HFETI 0 0.0005 0.001 0.0015 0.002 0.0025 0.003 0.0035 0.004 0.0045 0.005 0 500 1000 1500 2000 itera on me [s] Number of subdomains [-] HFETI vs FETI – one itera on me (measured on Anselm) FETI - CG with 1 reduc on - GGTINV HFETI - CG with 1 reduc on HFETI - CG with 2 reduc ons HFETI - CG with 1 reduc on - GGTINV Weak scaling – domain size is fixed; cluster size is fixed; number of domains goes from 1 to 2000 - domain size 5 3 = 3*(5+1) 3 = 648; cluster size 16 (1 cluster per node; 1 subdomain per core) - efficient communication algorithms helps both FETI and HFETI - CG with 1 reduction + coarse problem solved using distributed inverse matrix (GGTINV) Weak scaling – domain size is fixed; cluster size is fixed; number of domains goes from 1 to 2000 - FETI a HFETI uses CG with 1 reduction and coarse problem is solved using distributed inverse matrix of GGT - Cartesius: domain size 5 3 = 3*(5+1) 3 = 648; cluster size is 8 domains (3 clusters per node; 1 subdom. per core) - Anselm: domain size 5 3 = 3*(5+1) 3 = 648; cluster size: 16 domains (1 cluster per node; 1 subdomain per core) 24 35 38 41 43 4545 46 47 47 49 49 49 50 50 50 50 50 50 50 50 51 0 20 40 60 80 100 120 140 160 - 500,000,000 1,000,000,000 1,500,000,000 Solu on and preprocessing me [s] Problem size [DOFs] Preprocessing, number of itera ons and solu on me GGT me[s] - 164 616 DOFs GGT me[s] - 177 957 DOFs GGT me[s] - 192 000 DOFs GGT me[s] - 206 763 DOFs Solu on me[s] - 164 616 DOFs Solu on me[s] - 177 957 DOFs Solu on me[s] - 192 000 DOFs Solu on me[s] - 206 763 DOFs 53.19 56.16 56.63 58.53 57.92 58.11 58.32 58.59 58.57 58.53 58.57 58.68 58.73 58.75 58.71 58.80 58.85 58.86 58.92 0.08 0.37 0.78 1.76 3.50 5.49 8.91 11.27 16.90 23.89 27.11 38.24 46.46 56.21 70.69 85.70 101.02 131.49 144.36 24 15.54 35 23.39 38 26.80 41 37.83 43 34.58 45 41.75 45 41.86 46 42.57 47 43.43 47 43.36 49 45.27 49 45.55 49 45.69 50 47.03 50 47.21 50 47.83 50 48.34 50 49.22 50 49.75 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 Coarse problem preprocessine me, K factoriza on me, solu on me and number of itera ons - subdomains size 192000 DOFs K factoriza on GGT me[s] - 39^3 Solu on me[s] - 39^3 - number of itera ons - solu on me [s] - ready to use MIC accelerators λ λ given subdomain

Transcript of FETI and HFETI Solvers x3 s 3D 3 8 Ω

Optimization and Highly Parallel Implementation of Domain Decomposition Based Algorithms

Ω

x1

x2

x3

x1

x2

x3

Ω 4Ω 2

Ω 1

Ω 3

H

λc

1

HCC

C

C C

C

Ω 4Ω 4

Ω 4

Ω 4

H

h

Hs s

s

c

s

c

c2

s1

cs3

s4

Ω 8C

Ω 8C

to clusters

to subdomains

decomposiotion

decomposiotion λgλdh =c H s

industry.it4i.czwww.it4i.cz

am.vsb.cz

FETI and HFETI Solvers• C++ implementation based on Intel MKL sparse and dense BLAS routines and MKL version of PARDISO sparse direct solver • Parallelization tools and strategies - hybrid parallelization for multisocket, multicore comp. nodes - enables over-subscription of cores with multiple subdoma ins - distributed memory parallelization - use of MPI 3.0 non-blocking collective operations for global reductions - Intel MPI 5.0 - shared memory parallelization using Intel Cilk+ - enables parallel reduction using custom reduce operations on C++ classes

Reduction of global communication for (H)FETI

B1 B2 B3 B4 λλ 1 λ2 λ3 λ4

λ 1

λ 3

λ 121

2 λ 3

λ4

3

3 λ44

λ 2 λ 221 λ 2

3

CPU CPU CPU CPU1 2 43

Ω4Ω3Ω2

Ω1

Benchmark Systems

Hiding latencies in Krylov Solvers Intercluster Processing

FETI and HFETI per iteration time

Total FETI - Large scale benchmarks

Preconditioned pipelined Conjuagent Gradient (CG) algorithm Cluster size: ~120 000 DOFs; optimal decomposition into 27 subdomains 1: r0 := b − Ax0; u0 := M−1r0; w0 := Au02: for i = 0,..... do3: γi := ri ui4: δ := wi ui5: mi := M−1wi

6: ni := Ami

7: if i > 0 then8: βi := γi /γ i−1; αi := γi /(δ − βiγi /α i−1)9: else

10: βi := 0; αi := γi /δ11: end if12: zi := ni + βizi−1

13: qi := mi + βiqi−1

14: si := wi + βi si−1

15: pi := ui + βipi−1

16: xi+1 := xi + αipi

Sparse Matrix-Vector product - nearest neighbor comm., good scaling

Dot-product - global communication - scales as log(P)

Scalar vector multiplication, vector-vector additionno communication

P. Ghysels, T. Ashby, K. Meerbergen and W. VanrooseHiding global communication latency in the GMRES algorithm on massively parallel machines.

http://www.prace-ri.eu/https://projects.imec.be/exa2ct/

Superlinear Strong Scaling of FETI

3D CUBE - HFETI elasticity benchmark

3D CUBE - Large Scale FETI Benchmark

• FETI - decomposition into subdomains (Hc)• HFETI - decomposition into clusters (Hc) and subdomains (Hs)

• highly parallel generator scales up to thousands of subdomains• large subdomains from 120,000 to 160,000 DOFs • sparse direct solver uses most of the memory and CPU time (99%)• no preconditioner

----------------------------------------------------------------------#subdomains 125 (Nx Ny Nz)=(5 5 5)#elements per subd. 8000 (nx ny nz)=(20 20 20)#all elements: 1000000 (nx ny nz)=(100 100 100)#coordinates: 1030301#DOFs undecomp.: 3090903#DOFs decomp.: 3472875----------------------------------------------------------------------mesh: done

• each MPI process iterates only over lambdas associated with given subdomain - FETI – required for multiplication with the restriction of B matrix to given subdomain - HFETI – required for multiplication with the restriction of B1 matrix to • global update of vector λ becomes only nearest the neighbor type of communication with good scalability

Communication layer performance evaluation on FETI

- CG algorithm with 2 reductions – is a general version of the CG algorithm- CG algorithm with 1 reductions – is based on Preconditioned Pipelined CG algorithm, where projector is used in place of preconditioner

(this algorithm is ready to use non-blocking global reduction for further performance improvements (comes in Intel MPI 5.0))- GGTINV – parallelizes the solve of coarse problem plus merges two Gather and Scatter global operations into single AllGather- 53 = 3*(5+1) 3 = 648 DOFs – small sub domain size is chosen to identify all communication bottlenecks of the solver

0.127351

0.059730.025983

0.009020.004424

0.002782

0.112963

0.048849

0.01942

0.005946

0.0026720.001466

0.001

0.01

0.1

132 64 128 256 512 1024

Itera

on

me

[s] Number of subdomains [-]

Engine 2.5 millions - FETI - itera on me

REGCG - LUMPED - GGTINVPIPECG - NOPREC - GGTINVREGCG - LUMPED - NOGGTINVPIPECG - NOPREC - NOGGTINVLinear strong scaling (based on 32)

0

0.005

0.01

0.015

0.02

0.025

0 200 400 600 800 1000 1200 1400 1600 1800 2000

itera

on

me

[s]

Number of subdomains [-]

FETI itera on me (measured on Anselm)

FETI - CG with 2 reduc onsFETI - CG with 1 reduc onFETI - CG with 1 reduc on - GGTINV

• 2.5 millions of DOFs using FETI solver• decomposed into: 34, 64, 128, 256, 512 and 1024 subdomains • measured on Anselm supercomputer

– 128 nodes with 16 CPU cores (2x8)8 subdomains per node

SurfSara.nl – Cartesius – up to ~8600 cores – non-blocking island of 360 nodes each with:

• 2× 12-core Intel Xeon E5-2695 v2 (Ivy Bridge), 2.4 GHz and 64 GB RAM

– InfiniBand FDR network - 56 Gbit/s inter-node bandwidth

IT4Innovations – www.it4i.cz – Anselm – up to ~3300 cores – non-blocking cluster of 209 nodes each with:

• 2x 8-core Intel Sandy Bridge E5-2665, 2.4GHz and 64GB of RAM

– InfiniBand QDR network - 40 Gbit/s inter-node bandwidth

domain size subdomains size avg sum prec iter4 343 128625 0.123941 5.949182 75.503219 485 216 139968 0.072914 3.718598 37.732576 516 125 128625 0.040325 2.217873 16.381135 558 64 139968 0.030154 1.718762 10.04497 57

11 27 139968 0.034425 1.893399 7.490033 5517 8 139968 0.064372 3.347334 7.876874 52

Cluster size: ~330 000 DOFs; optimal decomposition into 27 subdomainsdomain size subdomains size avg sum prec iter

5 512 331776 0.309891 17.973677 338.035839 586 343 352947 0.198848 12.328554 180.041968 627 216 331776 0.118965 7.613787 90.402162 64

11 64 331776 0.080177 5.13133 27.456335 6415 27 331776 0.095755 6.03256 19.761467 6323 8 331776 0.156146 8.431877 29.455252 54

Note: domains size ‘N’ is the number of elements per edge of the cube – real size is equal to 3*(N+1)3

Note: avg – average iteration time [s]; sum – total time of all iterations = solution time; prec – cluster preprocessing time; iter – number of iterations;

0

0.002

0.004

0.006

0.008

0.01

0.012

0.014

0.016

0.018

0 500 1000 1500 2000

iter

ao

n

me

[s]

Number of subdomains [-]

HFETI vs FETI – one itera on me (measured on Anselm)

FETI - CG with 1 reduc onHFETI - CG with 1 reduc on

0

0.0005

0.001

0.0015

0.002

0.0025

0.003

0.0035

0.004

0.0045

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

HFETI vs FETI - one itera on me (measured on Cartesius)

FETIHFETI

0

0.0005

0.001

0.0015

0.002

0.0025

0.003

0.0035

0.004

0.0045

0.005

0 500 1000 1500 2000

iter

ao

n

me

[s]

Number of subdomains [-]

HFETI vs FETI – one itera on me (measured on Anselm)

FETI - CG with 1 reduc on - GGTINVHFETI - CG with 1 reduc onHFETI - CG with 2 reduc onsHFETI - CG with 1 reduc on - GGTINV

Weak scaling – domain size is fixed; cluster size is fixed; number of domains goes from 1 to 2000- domain size 53 = 3*(5+1) 3 = 648; cluster size 16 (1 cluster per node; 1 subdomain per core)- efficient communication algorithms helps both FETI and HFETI- CG with 1 reduction + coarse problem solved using distributed inverse matrix (GGTINV)

Weak scaling – domain size is fixed; cluster size is fixed; number of domains goes from 1 to 2000- FETI a HFETI uses CG with 1 reduction and coarse problem is solved using distributed inverse matrix of GGT- Cartesius: domain size 53 = 3*(5+1) 3 = 648; cluster size is 8 domains (3 clusters per node; 1 subdom. per core)- Anselm: domain size 53 = 3*(5+1) 3 = 648; cluster size: 16 domains (1 cluster per node; 1 subdomain per core)

243538

41

43

454546 47 47 49 49 49 50 50 50 50 50 50

50 50 51

0

20

40

60

80

100

120

140

160

- 500,000,000 1,000,000,000 1,500,000,000

Solu

on a

nd p

repr

oces

sing

me

[s]

Problem size [DOFs]

Preprocessing, number of itera ons and solu on me

GGT me[s] - 164 616 DOFsGGT me[s] - 177 957 DOFsGGT me[s] - 192 000 DOFsGGT me[s] - 206 763 DOFsSolu on me[s] - 164 616 DOFsSolu on me[s] - 177 957 DOFsSolu on me[s] - 192 000 DOFsSolu on me[s] - 206 763 DOFs

53.19 56.16 56.63 58.53 57.92 58.11 58.32 58.59 58.57 58.53 58.57 58.68 58.73 58.75 58.71 58.80 58.85 58.86 58.920.08 0.37 0.78 1.76 3.50 5.49 8.91 11.27 16.90 23.89 27.11 38.24 46.46 56.21 70.69 85.70 101.02

131.49144.36

2415.54

3523.39

3826.80

4137.83

4334.58

4541.75

4541.86

4642.57

4743.43

4743.36

4945.27

4945.55

4945.69

5047.03

5047.21

5047.83

5048.34

5049.22

5049.75

020406080

100120140160180200220240260280

Coarse problem preprocessine me, K factoriza on me, solu on me and number of itera ons - subdomains size 192000 DOFs

K factoriza onGGT me[s] - 39^3Solu on me[s] - 39^3

- number of itera ons- solu on me [s]

- ready to use MIC accelerators

λ λ given subdomain