FFTW++: A Hybrid OpenMP/MPI Implementation of FFTs and … · 2020. 2. 28. · The FFT provides an...

58
FFTW++: A Hybrid OpenMP/MPI Implementation of FFTs and Implicitly Dealiased Convolutions John C. Bowman Malcolm Roberts University of Alberta Advanced Micro Devices Feb 14, 2020 www.math.ualberta.ca/bowman/talks 1

Transcript of FFTW++: A Hybrid OpenMP/MPI Implementation of FFTs and … · 2020. 2. 28. · The FFT provides an...

Page 1: FFTW++: A Hybrid OpenMP/MPI Implementation of FFTs and … · 2020. 2. 28. · The FFT provides an e cient tool for computing the discrete cyclic convolution NX 1 p=0 FpGk p; where

FFTW++: A Hybrid OpenMP/MPIImplementation of FFTs and Implicitly

Dealiased Convolutions

John C. Bowman Malcolm Roberts

University of Alberta Advanced Micro Devices

Feb 14, 2020

www.math.ualberta.ca/∼bowman/talks1

Page 2: FFTW++: A Hybrid OpenMP/MPI Implementation of FFTs and … · 2020. 2. 28. · The FFT provides an e cient tool for computing the discrete cyclic convolution NX 1 p=0 FpGk p; where

Discrete Cyclic Convolution

•The FFT provides an efficient tool for computing the discretecyclic convolution

N−1∑p=0

FpGk−p,

where the vectors F and G have period N .

•Define the N th primitive root of unity:

ζN = exp

(2πi

N

).

•The fast Fourier transform method exploits the properties thatζrN = ζN/r and ζNN = 1.

•However, the pseudospectral method requires a linearconvolution.

2

Page 3: FFTW++: A Hybrid OpenMP/MPI Implementation of FFTs and … · 2020. 2. 28. · The FFT provides an e cient tool for computing the discrete cyclic convolution NX 1 p=0 FpGk p; where

•The unnormalized backwards discrete Fourier transform of{Fk : k = 0, . . . , N} is

fj.=

N−1∑k=0

ζjkN Fk j = 0, . . . , N − 1.

•The corresponding forward transform is

Fk.=

1

N

N−1∑j=0

ζ−kjN fj k = 0, . . . , N − 1.

•The orthogonality of this transform pair follows from

N−1∑j=0

ζ`jN =

N if ` = sN for s ∈ Z,1− ζ`NN1− ζ`N

= 0 otherwise.

3

Page 4: FFTW++: A Hybrid OpenMP/MPI Implementation of FFTs and … · 2020. 2. 28. · The FFT provides an e cient tool for computing the discrete cyclic convolution NX 1 p=0 FpGk p; where

Convolution Theorem

N−1∑j=0

fjgjζ−jkN =

N−1∑j=0

ζ−jkN

N−1∑p=0

ζjpN Fp

N−1∑q=0

ζjqNGq

=

N−1∑p=0

N−1∑q=0

FpGq

N−1∑j=0

ζ(−k+p+q)jN

= N∑s

N−1∑p=0

FpGk−p+sN .

•The terms indexed by s 6= 0 are aliases; we need to removethem!

• If only the firstm entries of the input vectors are nonzero, aliasescan be avoided by zero padding input data vectors of length mto length N ≥ 2m− 1.

•Explicit zero padding prevents mode m− 1 from beating withitself and wrapping around to contaminate modeN = 0 modN .

4

Page 5: FFTW++: A Hybrid OpenMP/MPI Implementation of FFTs and … · 2020. 2. 28. · The FFT provides an e cient tool for computing the discrete cyclic convolution NX 1 p=0 FpGk p; where

• Since FFT sizes with small prime factors in practice yieldthe most efficient implementations, the padding is normallyextended to N = 2m:

{Fk}m−1k=0 {Gk}m−1

k=0

5

Page 6: FFTW++: A Hybrid OpenMP/MPI Implementation of FFTs and … · 2020. 2. 28. · The FFT provides an e cient tool for computing the discrete cyclic convolution NX 1 p=0 FpGk p; where

• Since FFT sizes with small prime factors in practice yieldthe most efficient implementations, the padding is normallyextended to N = 2m:

{Fk}m−1k=0 {Gk}m−1

k=0

{Fk}m−1k=0 {0}m−1

k=0 {Gk}m−1k=0 {0}m−1

k=0

5

Page 7: FFTW++: A Hybrid OpenMP/MPI Implementation of FFTs and … · 2020. 2. 28. · The FFT provides an e cient tool for computing the discrete cyclic convolution NX 1 p=0 FpGk p; where

• Since FFT sizes with small prime factors in practice yieldthe most efficient implementations, the padding is normallyextended to N = 2m:

{Fk}m−1k=0 {Gk}m−1

k=0

{Fk}m−1k=0 {0}m−1

k=0 {Gk}m−1k=0 {0}m−1

k=0

{fj}2m−1j=0

FFT−1

{gj}2m−1j=0

FFT−1

5

Page 8: FFTW++: A Hybrid OpenMP/MPI Implementation of FFTs and … · 2020. 2. 28. · The FFT provides an e cient tool for computing the discrete cyclic convolution NX 1 p=0 FpGk p; where

• Since FFT sizes with small prime factors in practice yieldthe most efficient implementations, the padding is normallyextended to N = 2m:

{Fk}m−1k=0 {Gk}m−1

k=0

{Fk}m−1k=0 {0}m−1

k=0 {Gk}m−1k=0 {0}m−1

k=0

{fj}2m−1j=0

FFT−1

{gj}2m−1j=0

FFT−1

{fjgj}2m−1j=0

5

Page 9: FFTW++: A Hybrid OpenMP/MPI Implementation of FFTs and … · 2020. 2. 28. · The FFT provides an e cient tool for computing the discrete cyclic convolution NX 1 p=0 FpGk p; where

• Since FFT sizes with small prime factors in practice yieldthe most efficient implementations, the padding is normallyextended to N = 2m:

{Fk}m−1k=0 {Gk}m−1

k=0

{Fk}m−1k=0 {0}m−1

k=0 {Gk}m−1k=0 {0}m−1

k=0

{fj}2m−1j=0

FFT−1

{gj}2m−1j=0

FFT−1

{fjgj}2m−1j=0

{F ∗G}m−1k=0

FFT

5

Page 10: FFTW++: A Hybrid OpenMP/MPI Implementation of FFTs and … · 2020. 2. 28. · The FFT provides an e cient tool for computing the discrete cyclic convolution NX 1 p=0 FpGk p; where

• Since FFT sizes with small prime factors in practice yieldthe most efficient implementations, the padding is normallyextended to N = 2m:

{Fk}m−1k=0 {Gk}m−1

k=0

{Fk}m−1k=0 {0}m−1

k=0 {Gk}m−1k=0 {0}m−1

k=0

{fj}2m−1j=0

FFT−1

{gj}2m−1j=0

FFT−1

{fjgj}2m−1j=0

{F ∗G}m−1k=0

FFT

F ∗G

5

Page 11: FFTW++: A Hybrid OpenMP/MPI Implementation of FFTs and … · 2020. 2. 28. · The FFT provides an e cient tool for computing the discrete cyclic convolution NX 1 p=0 FpGk p; where

Implicit Padding

•Let N = 2m. For j = 0, . . . , 2m− 1 we want to compute

fj =

2m−1∑k=0

ζjk2mFk.

• If Fk = 0 for k ≥ m, one can easily avoid looping over theunwanted zero Fourier modes by decimating in wavenumber:

f2` =

m−1∑k=0

ζ2`k2mFk =

m−1∑k=0

ζ`kmFk, ` = 0, 1, . . .m− 1.

f2`+1 =

m−1∑k=0

ζ(2`+1)k2m Fk =

m−1∑k=0

ζ`km ζk2mFk,

•This requires computing two subtransforms, each of size m,for an overall computational scaling of order 2m log2m =N log2m.

6

Page 12: FFTW++: A Hybrid OpenMP/MPI Implementation of FFTs and … · 2020. 2. 28. · The FFT provides an e cient tool for computing the discrete cyclic convolution NX 1 p=0 FpGk p; where

•Odd and even terms of the convolution can then be computedseparately, multiplied term-by-term, and transformed again toFourier space:

2mFk =

2m−1∑j=0

ζ−kj2m fj

=

m−1∑`=0

ζ−k2`2m f2` +

m−1∑`=0

ζ−k(2`+1)2m f2`+1

=

m−1∑`=0

ζ−k`m f2` + ζ−k2m

m−1∑`=0

ζ−k`m f2`+1 k = 0, . . . ,m− 1.

•No bit reversal is required at the highest level.

•A 1D implicitly padded convolution is implemented in ourFFTW++ library.

•This in-place convolution was written to use six out-of-placetransforms, thereby avoiding bit reversal at all levels.

7

Page 13: FFTW++: A Hybrid OpenMP/MPI Implementation of FFTs and … · 2020. 2. 28. · The FFT provides an e cient tool for computing the discrete cyclic convolution NX 1 p=0 FpGk p; where

•The computational complexity is 6Km log2m.

•The numerical error is similar to explicit padding and thememory usage is identical.

{Fk}m−1k=0 {Gk}m−1

k=0

8

Page 14: FFTW++: A Hybrid OpenMP/MPI Implementation of FFTs and … · 2020. 2. 28. · The FFT provides an e cient tool for computing the discrete cyclic convolution NX 1 p=0 FpGk p; where

•The computational complexity is 6Km log2m.

•The numerical error is similar to explicit padding and thememory usage is identical.

{Fk}m−1k=0 {Gk}m−1

k=0

{f2ℓ}m−1ℓ=0 {f2ℓ+1}m−1

ℓ=0 {g2ℓ}m−1ℓ=0 {g2ℓ+1}m−1

ℓ=0

8

Page 15: FFTW++: A Hybrid OpenMP/MPI Implementation of FFTs and … · 2020. 2. 28. · The FFT provides an e cient tool for computing the discrete cyclic convolution NX 1 p=0 FpGk p; where

•The computational complexity is 6Km log2m.

•The numerical error is similar to explicit padding and thememory usage is identical.

{Fk}m−1k=0 {Gk}m−1

k=0

{f2ℓ}m−1ℓ=0 {f2ℓ+1}m−1

ℓ=0 {g2ℓ}m−1ℓ=0 {g2ℓ+1}m−1

ℓ=0

{f2ℓg2ℓ}m−1ℓ=0 {f2ℓ+1g2ℓ+1}m−1

ℓ=0

8

Page 16: FFTW++: A Hybrid OpenMP/MPI Implementation of FFTs and … · 2020. 2. 28. · The FFT provides an e cient tool for computing the discrete cyclic convolution NX 1 p=0 FpGk p; where

•The computational complexity is 6Km log2m.

•The numerical error is similar to explicit padding and thememory usage is identical.

{Fk}m−1k=0 {Gk}m−1

k=0

{f2ℓ}m−1ℓ=0 {f2ℓ+1}m−1

ℓ=0 {g2ℓ}m−1ℓ=0 {g2ℓ+1}m−1

ℓ=0

{f2ℓg2ℓ}m−1ℓ=0 {f2ℓ+1g2ℓ+1}m−1

ℓ=0

{(F ∗G)k}m−1k=0

8

Page 17: FFTW++: A Hybrid OpenMP/MPI Implementation of FFTs and … · 2020. 2. 28. · The FFT provides an e cient tool for computing the discrete cyclic convolution NX 1 p=0 FpGk p; where

Input: vector f, vector gOutput: vector fu← fft−1(f);v← fft−1(g);u← u ∗ v;for k = 0 to m− 1 dof[k]← ζk2mf[k];

g[k]← ζk2mg[k];end

v← fft−1(f);f ← fft−1(g);v← v ∗ f;f ← fft(u);u← fft(v);

for k = 0 to m− 1 dof[k]← f[k] + ζ−k2mu[k];

endreturn f/(2m);

9

Page 18: FFTW++: A Hybrid OpenMP/MPI Implementation of FFTs and … · 2020. 2. 28. · The FFT provides an e cient tool for computing the discrete cyclic convolution NX 1 p=0 FpGk p; where

Implicit Padding in 1D

2

3

4

5

6

7

time/(m

log2m)(ns)

102 103 104 105 106m

explicit T=1

implicit T=1

explicit T=4

implicit T=4

10

Page 19: FFTW++: A Hybrid OpenMP/MPI Implementation of FFTs and … · 2020. 2. 28. · The FFT provides an e cient tool for computing the discrete cyclic convolution NX 1 p=0 FpGk p; where

Convolutions in Higher Dimensions

•An explicitly padded convolution in 2 dimensions requires 12padded FFTs, and 4 times the memory of a cyclic convolution.

F G

11

Page 20: FFTW++: A Hybrid OpenMP/MPI Implementation of FFTs and … · 2020. 2. 28. · The FFT provides an e cient tool for computing the discrete cyclic convolution NX 1 p=0 FpGk p; where

Convolutions in Higher Dimensions

•An explicitly padded convolution in 2 dimensions requires 12padded FFTs, and 4 times the memory of a cyclic convolution.

F GF G

11

Page 21: FFTW++: A Hybrid OpenMP/MPI Implementation of FFTs and … · 2020. 2. 28. · The FFT provides an e cient tool for computing the discrete cyclic convolution NX 1 p=0 FpGk p; where

Convolutions in Higher Dimensions

•An explicitly padded convolution in 2 dimensions requires 12padded FFTs, and 4 times the memory of a cyclic convolution.

F GF G

f g

11

Page 22: FFTW++: A Hybrid OpenMP/MPI Implementation of FFTs and … · 2020. 2. 28. · The FFT provides an e cient tool for computing the discrete cyclic convolution NX 1 p=0 FpGk p; where

Convolutions in Higher Dimensions

•An explicitly padded convolution in 2 dimensions requires 12padded FFTs, and 4 times the memory of a cyclic convolution.

F GF G

f g

fg

11

Page 23: FFTW++: A Hybrid OpenMP/MPI Implementation of FFTs and … · 2020. 2. 28. · The FFT provides an e cient tool for computing the discrete cyclic convolution NX 1 p=0 FpGk p; where

Convolutions in Higher Dimensions

•An explicitly padded convolution in 2 dimensions requires 12padded FFTs, and 4 times the memory of a cyclic convolution.

F GF G

f g

fg

F ∗G

11

Page 24: FFTW++: A Hybrid OpenMP/MPI Implementation of FFTs and … · 2020. 2. 28. · The FFT provides an e cient tool for computing the discrete cyclic convolution NX 1 p=0 FpGk p; where

Convolutions in Higher Dimensions

•An explicitly padded convolution in 2 dimensions requires 12padded FFTs, and 4 times the memory of a cyclic convolution.

F GF G

f g

fg

F ∗GF ∗G

11

Page 25: FFTW++: A Hybrid OpenMP/MPI Implementation of FFTs and … · 2020. 2. 28. · The FFT provides an e cient tool for computing the discrete cyclic convolution NX 1 p=0 FpGk p; where

Recursive Convolution

•Naive way to compute a multiple-dimensional convolution:

FN1,...,Nd multiply F−1N1,...,Nd

•The technique of recursive convolution allows one to avoidcomputing and storing the entire Fourier image of the data:

FNdNd× convolveN1 ,...,Nd−1 F−1

Nd

12

Page 26: FFTW++: A Hybrid OpenMP/MPI Implementation of FFTs and … · 2020. 2. 28. · The FFT provides an e cient tool for computing the discrete cyclic convolution NX 1 p=0 FpGk p; where

Implicit Padding in 2D

•Extra work memory need not be contiguous with the data.

F G

13

Page 27: FFTW++: A Hybrid OpenMP/MPI Implementation of FFTs and … · 2020. 2. 28. · The FFT provides an e cient tool for computing the discrete cyclic convolution NX 1 p=0 FpGk p; where

Implicit Padding in 2D

•Extra work memory need not be contiguous with the data.

FFT−1x {F}

nx even

FFT−1x {F}

nx odd

FFT−1x {G}

nx even

FFT−1x {G}

nx odd

13

Page 28: FFTW++: A Hybrid OpenMP/MPI Implementation of FFTs and … · 2020. 2. 28. · The FFT provides an e cient tool for computing the discrete cyclic convolution NX 1 p=0 FpGk p; where

Implicit Padding in 2D

•Extra work memory need not be contiguous with the data.

13

Page 29: FFTW++: A Hybrid OpenMP/MPI Implementation of FFTs and … · 2020. 2. 28. · The FFT provides an e cient tool for computing the discrete cyclic convolution NX 1 p=0 FpGk p; where

Implicit Padding in 2D

•Extra work memory need not be contiguous with the data.

13

Page 30: FFTW++: A Hybrid OpenMP/MPI Implementation of FFTs and … · 2020. 2. 28. · The FFT provides an e cient tool for computing the discrete cyclic convolution NX 1 p=0 FpGk p; where

Implicit Padding in 2D

•Extra work memory need not be contiguous with the data.

13

Page 31: FFTW++: A Hybrid OpenMP/MPI Implementation of FFTs and … · 2020. 2. 28. · The FFT provides an e cient tool for computing the discrete cyclic convolution NX 1 p=0 FpGk p; where

Implicit Padding in 2D

•Extra work memory need not be contiguous with the data.

FFT−1x {F ∗G}nx even

FFT−1x {F ∗G}nx odd

13

Page 32: FFTW++: A Hybrid OpenMP/MPI Implementation of FFTs and … · 2020. 2. 28. · The FFT provides an e cient tool for computing the discrete cyclic convolution NX 1 p=0 FpGk p; where

Implicit Padding in 2D

•Extra work memory need not be contiguous with the data.

F ∗G

13

Page 33: FFTW++: A Hybrid OpenMP/MPI Implementation of FFTs and … · 2020. 2. 28. · The FFT provides an e cient tool for computing the discrete cyclic convolution NX 1 p=0 FpGk p; where

Implicit Padding in 2D

5

10

15

time/(m

2log2m

2)(ns)

102 103m

explicit T=1

implicit T=1

explicit T=4

implicit T=4

14

Page 34: FFTW++: A Hybrid OpenMP/MPI Implementation of FFTs and … · 2020. 2. 28. · The FFT provides an e cient tool for computing the discrete cyclic convolution NX 1 p=0 FpGk p; where

Implicit Padding in 3D

10

20time/(m

3log2m

3)(ns)

101 102m

explicit T=1

implicit T=1

explicit T=4

implicit T=4

15

Page 35: FFTW++: A Hybrid OpenMP/MPI Implementation of FFTs and … · 2020. 2. 28. · The FFT provides an e cient tool for computing the discrete cyclic convolution NX 1 p=0 FpGk p; where

Centered (Pseudospectral) Convolutions

•For a centered convolution, the Fourier origin (k = 0) iscentered in the domain:

m−1∑p=k−m+1

fpgk−p

•Need to pad to N ≥ 3m− 2 to remove aliases.

•The ratio (2m− 1)/(3m− 2) of the number of physical to totalmodes is asymptotic to 2/3 for large m .

•A Hermitian convolution arises since the input vectors are real:

f−k = fk.

16

Page 36: FFTW++: A Hybrid OpenMP/MPI Implementation of FFTs and … · 2020. 2. 28. · The FFT provides an e cient tool for computing the discrete cyclic convolution NX 1 p=0 FpGk p; where

1D Implicit Hermitian Convolution

2

3

4

5

time/(m

log2m)(ns)

102 103 104 105 106m

explicit T=1

implicit T=1

explicit T=4

implicit T=4

17

Page 37: FFTW++: A Hybrid OpenMP/MPI Implementation of FFTs and … · 2020. 2. 28. · The FFT provides an e cient tool for computing the discrete cyclic convolution NX 1 p=0 FpGk p; where

Distributed-Memory Parallelization

•The pseudospectral method uses a matrix transpose to localizethe computation of the multi-dimensional FFTs onto individualprocessors.

•Parallel generalized slab/pencil decompositions have recentlybeen developed for distributed-memory architectures.

•We have compared several distributed matrix transposealgorithms, both blocking and nonblocking, under pure MPIand hybrid MPI/OpenMP architectures.

•Local transposition is not required within a single MPI node.

•We have developed an adaptive algorithm, dynamically tunedto choose the optimal block size.

18

Page 38: FFTW++: A Hybrid OpenMP/MPI Implementation of FFTs and … · 2020. 2. 28. · The FFT provides an e cient tool for computing the discrete cyclic convolution NX 1 p=0 FpGk p; where

8× 8 Block Transpose over 8 processors

0

1

2

3

4

5

6

7

Process

19

Page 39: FFTW++: A Hybrid OpenMP/MPI Implementation of FFTs and … · 2020. 2. 28. · The FFT provides an e cient tool for computing the discrete cyclic convolution NX 1 p=0 FpGk p; where

8× 8 Block Transpose over 8 processors

0

1

2

3

4

5

6

7

Process

19

Page 40: FFTW++: A Hybrid OpenMP/MPI Implementation of FFTs and … · 2020. 2. 28. · The FFT provides an e cient tool for computing the discrete cyclic convolution NX 1 p=0 FpGk p; where

8× 8 Block Transpose over 8 processors

0

1

2

3

4

5

6

7

Process

19

Page 41: FFTW++: A Hybrid OpenMP/MPI Implementation of FFTs and … · 2020. 2. 28. · The FFT provides an e cient tool for computing the discrete cyclic convolution NX 1 p=0 FpGk p; where

8× 8 Block Transpose over 8 processors

0

1

2

3

4

5

6

7

Process

19

Page 42: FFTW++: A Hybrid OpenMP/MPI Implementation of FFTs and … · 2020. 2. 28. · The FFT provides an e cient tool for computing the discrete cyclic convolution NX 1 p=0 FpGk p; where

8× 8 Block Transpose over 8 processors

0

1

2

3

4

5

6

7

Process

19

Page 43: FFTW++: A Hybrid OpenMP/MPI Implementation of FFTs and … · 2020. 2. 28. · The FFT provides an e cient tool for computing the discrete cyclic convolution NX 1 p=0 FpGk p; where

8× 8 Block Transpose over 8 processors

0

1

2

3

4

5

6

7

Process

19

Page 44: FFTW++: A Hybrid OpenMP/MPI Implementation of FFTs and … · 2020. 2. 28. · The FFT provides an e cient tool for computing the discrete cyclic convolution NX 1 p=0 FpGk p; where

8× 8 Block Transpose over 8 processors

0

1

2

3

4

5

6

7

Process

19

Page 45: FFTW++: A Hybrid OpenMP/MPI Implementation of FFTs and … · 2020. 2. 28. · The FFT provides an e cient tool for computing the discrete cyclic convolution NX 1 p=0 FpGk p; where

Advantages of Hybrid MPI/OpenMP

•Use hybrid OpenMP/MPI with the optimal number of threads:

– yields larger communication block size;

– local transposition is not required within a single MPI node;

– allows smaller problems to be distributed over a large numberof processors;

– for 3D FFTs, allows for more slab-like than pencil-like models,reducing the size of or even eliminating the need for a secondtranspose;

– sometimes more efficient (by a factor of 2) than pure MPI.

•The use of nonblocking MPI communications allows us tooverlap computation with communication: this can yield upto an additional 32% performance gain for implicitly dealiasedconvolutions, for which a natural parallelism exists betweencommunication and computation. 20

Page 46: FFTW++: A Hybrid OpenMP/MPI Implementation of FFTs and … · 2020. 2. 28. · The FFT provides an e cient tool for computing the discrete cyclic convolution NX 1 p=0 FpGk p; where

Pure MPI 2D Convolutions

1

2

3

time/(m

2log2m

2)(ns)

102 103 104m

implicit P=24

implicit P=192

explicit P=24

explicit P=192

21

Page 47: FFTW++: A Hybrid OpenMP/MPI Implementation of FFTs and … · 2020. 2. 28. · The FFT provides an e cient tool for computing the discrete cyclic convolution NX 1 p=0 FpGk p; where

Pure MPI 3D Convolutions

1

2

3

4time/(m

3log2m

3)(ns)

102 103m

implicit P=24

implicit P=192

explicit P=24

explicit P=192

22

Page 48: FFTW++: A Hybrid OpenMP/MPI Implementation of FFTs and … · 2020. 2. 28. · The FFT provides an e cient tool for computing the discrete cyclic convolution NX 1 p=0 FpGk p; where

Hybrid MPI 3D Adaptive Transpose Timing

23

Page 49: FFTW++: A Hybrid OpenMP/MPI Implementation of FFTs and … · 2020. 2. 28. · The FFT provides an e cient tool for computing the discrete cyclic convolution NX 1 p=0 FpGk p; where

Hybrid MPI 3D Adaptive Transpose Speedup

24

Page 50: FFTW++: A Hybrid OpenMP/MPI Implementation of FFTs and … · 2020. 2. 28. · The FFT provides an e cient tool for computing the discrete cyclic convolution NX 1 p=0 FpGk p; where

Communication Costs: Direct Transpose

• Suppose an N ×N matrix is distributed over P processes withP | N .

•Direct transposition involves P−1 communications per process,each of size N 2/P 2, for a total per-process data transfer of

P − 1

P 2N 2.

25

Page 51: FFTW++: A Hybrid OpenMP/MPI Implementation of FFTs and … · 2020. 2. 28. · The FFT provides an e cient tool for computing the discrete cyclic convolution NX 1 p=0 FpGk p; where

Block Transpose

•Let P = ab. Subdivide N ×M matrix into a × a blocks eachof size N/a×M/a.

• Inner: Over each team of b processes, transpose the a individualN/a×M/a matrices, grouping all a communications with thesame source and destination together.

•Outer: Over each team of a processes, transpose the a×amatrixof N/a×M/a blocks.

26

Page 52: FFTW++: A Hybrid OpenMP/MPI Implementation of FFTs and … · 2020. 2. 28. · The FFT provides an e cient tool for computing the discrete cyclic convolution NX 1 p=0 FpGk p; where

Communication Costs

•Let τ` be the typical latency of a message and τd be the timerequired to send each matrix element, so that the time to senda message consisting of n matrix elements is

τ` + nτd

.

•The time required to perform a direct transpose is

TD = τ`(P − 1) + τdP − 1

P 2NM = (P − 1)

(τ` + τd

NM

P 2

),

whereas a block transpose requires

TB(a) = τ`

(a +

P

a− 2

)+ τd

(2P − a− P

a

)NM

P 2.

•Let L = τ`/τd be the effective communication block length.

27

Page 53: FFTW++: A Hybrid OpenMP/MPI Implementation of FFTs and … · 2020. 2. 28. · The FFT provides an e cient tool for computing the discrete cyclic convolution NX 1 p=0 FpGk p; where

Direct vs. Block Transposes

• Since

TD − TB = τd

(P + 1− a− P

a

)(L− NM

P 2

),

we see that a direct transpose is preferred when NM ≥ P 2L,whereas a block transpose should be used when NM < P 2L.

•To find the optimal value of a for a block transpose consider

T ′B(a) = τd

(1− P

a2

)(L− NM

P 2

).

•For NM < P 2L, we see that TB is convex, with a minimum ata =√P .

28

Page 54: FFTW++: A Hybrid OpenMP/MPI Implementation of FFTs and … · 2020. 2. 28. · The FFT provides an e cient tool for computing the discrete cyclic convolution NX 1 p=0 FpGk p; where

Optimal Number of Threads

•The minimum value of TB is

TB(√P ) = 2τd

(√P − 1

)(L +

NM

P 3/2

)∼ 2 τd

√P

(L +

NM

P 3/2

), P � 1.

•The global minimum of TB over both a and P occurs at

P ≈ (2NM/L)2/3.

• If the matrix dimensions satisfy NM > L, as is typicallythe case, this minimum occurs above the transition value(NM/L)1/2.

29

Page 55: FFTW++: A Hybrid OpenMP/MPI Implementation of FFTs and … · 2020. 2. 28. · The FFT provides an e cient tool for computing the discrete cyclic convolution NX 1 p=0 FpGk p; where

Transpose Communication Costs

104

105

106

Com

municationCost

101 102 103

P

100

101

102

103

Zero Latency

Direct

Block

Threads

30

Page 56: FFTW++: A Hybrid OpenMP/MPI Implementation of FFTs and … · 2020. 2. 28. · The FFT provides an e cient tool for computing the discrete cyclic convolution NX 1 p=0 FpGk p; where

Conclusions

•For centered convolutions in d dimensions implicit paddingasymptotically uses (2/3)d−1 of the conventional storage.

•The factor of 2 speedup is largely due to increased data locality.

•Highly optimized and parallelized implicit dealiasing routineshave been implemented as a software layer FFTW++ (v 2.05) ontop of the FFTW library and released under the Lesser GNUPublic License: http://fftwpp.sourceforge.net/

•Hybrid MPI/OpenMP is often more efficient than pure MPI fordistributed matrix transposes.

•The hybrid paradigm provides an optimal setting for nonlocalcomputationally intensive operations found in applications likethe fast Fourier transform.

•The advent of implicit dealiasing of convolutions makesoverlapping transposition with FFT computation feasible.

31

Page 57: FFTW++: A Hybrid OpenMP/MPI Implementation of FFTs and … · 2020. 2. 28. · The FFT provides an e cient tool for computing the discrete cyclic convolution NX 1 p=0 FpGk p; where

•Writing of a high-performance dealiased pseudospectral code isnow a relatively straightforward exercise. For example, see theprotodns project at

http://github.com/dealias/dns

32

Page 58: FFTW++: A Hybrid OpenMP/MPI Implementation of FFTs and … · 2020. 2. 28. · The FFT provides an e cient tool for computing the discrete cyclic convolution NX 1 p=0 FpGk p; where

References[Bowman & Roberts 2011] J. C. Bowman & M. Roberts, SIAM J. Sci. Comput., 33:386, 2011.

[Bowman & Roberts 2016] J. C. Bowman & M. Roberts, to be submitted to Parallel computing, 2016.

[Orszag 1971] S. A. Orszag, Journal of the Atmospheric Sciences, 28:1074, 1971.

[Patterson Jr. & Orszag 1971] G. S. Patterson Jr. & S. A. Orszag, Physics of Fluids, 14:2538, 1971.

[Roberts & Bowman 2016] M. Roberts & J. C. Bowman, submitted to SIAM J. Sci. Comput., 2016.