FFTW++: A Hybrid OpenMP/MPI Implementation of FFTs and … · 2020. 2. 28. · The FFT provides an...

FFTW++: A Hybrid OpenMP/MPIImplementation of FFTs and Implicitly

Dealiased Convolutions

John C. Bowman Malcolm Roberts

University of Alberta Advanced Micro Devices

Feb 14, 2020

www.math.ualberta.ca/∼bowman/talks1

Discrete Cyclic Convolution

•The FFT provides an efficient tool for computing the discretecyclic convolution

N−1∑p=0

FpGk−p,

where the vectors F and G have period N .

•Define the N th primitive root of unity:

ζN = exp

(2πi

N

).

•The fast Fourier transform method exploits the properties thatζrN = ζN/r and ζNN = 1.

•However, the pseudospectral method requires a linearconvolution.

2

•The unnormalized backwards discrete Fourier transform of{Fk : k = 0, . . . , N} is

fj.=

N−1∑k=0

ζjkN Fk j = 0, . . . , N − 1.

•The corresponding forward transform is

Fk.=

1

N

N−1∑j=0

ζ−kjN fj k = 0, . . . , N − 1.

•The orthogonality of this transform pair follows from

N−1∑j=0

ζ`jN =

N if ` = sN for s ∈ Z,1− ζ`NN1− ζ`N

= 0 otherwise.

3

Convolution Theorem

N−1∑j=0

fjgjζ−jkN =

N−1∑j=0

ζ−jkN

N−1∑p=0

ζjpN Fp

N−1∑q=0

ζjqNGq

=

N−1∑p=0

N−1∑q=0

FpGq

N−1∑j=0

ζ(−k+p+q)jN

= N∑s

N−1∑p=0

FpGk−p+sN .

•The terms indexed by s 6= 0 are aliases; we need to removethem!

• If only the firstm entries of the input vectors are nonzero, aliasescan be avoided by zero padding input data vectors of length mto length N ≥ 2m− 1.

•Explicit zero padding prevents mode m− 1 from beating withitself and wrapping around to contaminate modeN = 0 modN .

4

• Since FFT sizes with small prime factors in practice yieldthe most efficient implementations, the padding is normallyextended to N = 2m:

{Fk}m−1k=0 {Gk}m−1

k=0

5


{Fk}m−1k=0 {Gk}m−1

k=0

{Fk}m−1k=0 {0}m−1

k=0 {Gk}m−1k=0 {0}m−1

k=0

5


{Fk}m−1k=0 {Gk}m−1

k=0

{Fk}m−1k=0 {0}m−1

k=0 {Gk}m−1k=0 {0}m−1

k=0

{fj}2m−1j=0

FFT−1

{gj}2m−1j=0

FFT−1

5


{Fk}m−1k=0 {Gk}m−1

k=0

{Fk}m−1k=0 {0}m−1

k=0 {Gk}m−1k=0 {0}m−1

k=0

{fj}2m−1j=0

FFT−1

{gj}2m−1j=0

FFT−1

{fjgj}2m−1j=0

5


{Fk}m−1k=0 {Gk}m−1

k=0

{Fk}m−1k=0 {0}m−1

k=0 {Gk}m−1k=0 {0}m−1

k=0

{fj}2m−1j=0

FFT−1

{gj}2m−1j=0

FFT−1

{fjgj}2m−1j=0

{F ∗G}m−1k=0

FFT

5


{Fk}m−1k=0 {Gk}m−1

k=0

{Fk}m−1k=0 {0}m−1

k=0 {Gk}m−1k=0 {0}m−1

k=0

{fj}2m−1j=0

FFT−1

{gj}2m−1j=0

FFT−1

{fjgj}2m−1j=0

{F ∗G}m−1k=0

FFT

F ∗G

5

Implicit Padding

•Let N = 2m. For j = 0, . . . , 2m− 1 we want to compute

fj =

2m−1∑k=0

ζjk2mFk.

• If Fk = 0 for k ≥ m, one can easily avoid looping over theunwanted zero Fourier modes by decimating in wavenumber:

f2` =

m−1∑k=0

ζ2`k2mFk =

m−1∑k=0

ζ`kmFk, ` = 0, 1, . . .m− 1.

f2`+1 =

m−1∑k=0

ζ(2`+1)k2m Fk =

m−1∑k=0

ζ`km ζk2mFk,

•This requires computing two subtransforms, each of size m,for an overall computational scaling of order 2m log2m =N log2m.

6

•Odd and even terms of the convolution can then be computedseparately, multiplied term-by-term, and transformed again toFourier space:

2mFk =

2m−1∑j=0

ζ−kj2m fj

=

m−1∑`=0

ζ−k2`2m f2` +

m−1∑`=0

ζ−k(2`+1)2m f2`+1

=

m−1∑`=0

ζ−k`m f2` + ζ−k2m

m−1∑`=0

ζ−k`m f2`+1 k = 0, . . . ,m− 1.

•No bit reversal is required at the highest level.

•A 1D implicitly padded convolution is implemented in ourFFTW++ library.

•This in-place convolution was written to use six out-of-placetransforms, thereby avoiding bit reversal at all levels.

7

•The computational complexity is 6Km log2m.

•The numerical error is similar to explicit padding and thememory usage is identical.

{Fk}m−1k=0 {Gk}m−1

k=0

8



{Fk}m−1k=0 {Gk}m−1

k=0

{f2ℓ}m−1ℓ=0 {f2ℓ+1}m−1

ℓ=0 {g2ℓ}m−1ℓ=0 {g2ℓ+1}m−1

ℓ=0

8



{Fk}m−1k=0 {Gk}m−1

k=0

{f2ℓ}m−1ℓ=0 {f2ℓ+1}m−1

ℓ=0 {g2ℓ}m−1ℓ=0 {g2ℓ+1}m−1

ℓ=0

{f2ℓg2ℓ}m−1ℓ=0 {f2ℓ+1g2ℓ+1}m−1

ℓ=0

8



{Fk}m−1k=0 {Gk}m−1

k=0

{f2ℓ}m−1ℓ=0 {f2ℓ+1}m−1

ℓ=0 {g2ℓ}m−1ℓ=0 {g2ℓ+1}m−1

ℓ=0

{f2ℓg2ℓ}m−1ℓ=0 {f2ℓ+1g2ℓ+1}m−1

ℓ=0

{(F ∗G)k}m−1k=0

8

Input: vector f, vector gOutput: vector fu← fft−1(f);v← fft−1(g);u← u ∗ v;for k = 0 to m− 1 dof[k]← ζk2mf[k];

g[k]← ζk2mg[k];end

v← fft−1(f);f ← fft−1(g);v← v ∗ f;f ← fft(u);u← fft(v);

for k = 0 to m− 1 dof[k]← f[k] + ζ−k2mu[k];

endreturn f/(2m);

9

Implicit Padding in 1D

2

3

4

5

6

7

time/(m

log2m)(ns)

102 103 104 105 106m

explicit T=1

implicit T=1

explicit T=4

implicit T=4

10

Convolutions in Higher Dimensions

•An explicitly padded convolution in 2 dimensions requires 12padded FFTs, and 4 times the memory of a cyclic convolution.

F G

11



F GF G

11



F GF G

f g

11



F GF G

f g

fg

11



F GF G

f g

fg

F ∗G

11



F GF G

f g

fg

F ∗GF ∗G

11

Recursive Convolution

•Naive way to compute a multiple-dimensional convolution:

FN1,...,Nd multiply F−1N1,...,Nd

•The technique of recursive convolution allows one to avoidcomputing and storing the entire Fourier image of the data:

FNdNd× convolveN1 ,...,Nd−1 F−1

Nd

12


•Extra work memory need not be contiguous with the data.

F G

13



FFT−1x {F}

nx even

FFT−1x {F}

nx odd

FFT−1x {G}

nx even

FFT−1x {G}

nx odd

13



13



FFT−1x {F ∗G}nx even

FFT−1x {F ∗G}nx odd

13



F ∗G

13


5

10

15

time/(m

2log2m

2)(ns)

102 103m

explicit T=1

implicit T=1

explicit T=4

implicit T=4

14


10

20time/(m

3log2m

3)(ns)

101 102m

explicit T=1

implicit T=1

explicit T=4

implicit T=4

15

Centered (Pseudospectral) Convolutions

•For a centered convolution, the Fourier origin (k = 0) iscentered in the domain:

m−1∑p=k−m+1

fpgk−p

•Need to pad to N ≥ 3m− 2 to remove aliases.

•The ratio (2m− 1)/(3m− 2) of the number of physical to totalmodes is asymptotic to 2/3 for large m .

•A Hermitian convolution arises since the input vectors are real:

f−k = fk.

16

1D Implicit Hermitian Convolution

2

3

4

5

time/(m

log2m)(ns)

102 103 104 105 106m

explicit T=1

implicit T=1

explicit T=4

implicit T=4

17

Distributed-Memory Parallelization

•The pseudospectral method uses a matrix transpose to localizethe computation of the multi-dimensional FFTs onto individualprocessors.

•Parallel generalized slab/pencil decompositions have recentlybeen developed for distributed-memory architectures.

•We have compared several distributed matrix transposealgorithms, both blocking and nonblocking, under pure MPIand hybrid MPI/OpenMP architectures.

•Local transposition is not required within a single MPI node.

•We have developed an adaptive algorithm, dynamically tunedto choose the optimal block size.

18

8× 8 Block Transpose over 8 processors

0

1

2

3

4

5

6

7

Process

19

Advantages of Hybrid MPI/OpenMP

•Use hybrid OpenMP/MPI with the optimal number of threads:

– yields larger communication block size;

– local transposition is not required within a single MPI node;

– allows smaller problems to be distributed over a large numberof processors;

– for 3D FFTs, allows for more slab-like than pencil-like models,reducing the size of or even eliminating the need for a secondtranspose;

– sometimes more efficient (by a factor of 2) than pure MPI.

•The use of nonblocking MPI communications allows us tooverlap computation with communication: this can yield upto an additional 32% performance gain for implicitly dealiasedconvolutions, for which a natural parallelism exists betweencommunication and computation. 20

Pure MPI 2D Convolutions

1

2

3

time/(m

2log2m

2)(ns)

102 103 104m

implicit P=24

implicit P=192

explicit P=24

explicit P=192

21

Pure MPI 3D Convolutions

1

2

3

4time/(m

3log2m

3)(ns)

102 103m

implicit P=24

implicit P=192

explicit P=24

explicit P=192

22

Hybrid MPI 3D Adaptive Transpose Timing

23

https://www.math.ualberta.ca/~bowman/asygl/siam20/adaptiveTranspose.html

Hybrid MPI 3D Adaptive Transpose Speedup

24

https://www.math.ualberta.ca/~bowman/asygl/siam20/speedupTranspose.html

Communication Costs: Direct Transpose

• Suppose an N ×N matrix is distributed over P processes withP | N .

•Direct transposition involves P−1 communications per process,each of size N 2/P 2, for a total per-process data transfer of

P − 1

P 2N 2.

25

Block Transpose

•Let P = ab. Subdivide N ×M matrix into a × a blocks eachof size N/a×M/a.

• Inner: Over each team of b processes, transpose the a individualN/a×M/a matrices, grouping all a communications with thesame source and destination together.

•Outer: Over each team of a processes, transpose the a×amatrixof N/a×M/a blocks.

26

Communication Costs

•Let τ` be the typical latency of a message and τd be the timerequired to send each matrix element, so that the time to senda message consisting of n matrix elements is

τ` + nτd

.

•The time required to perform a direct transpose is

TD = τ`(P − 1) + τdP − 1

P 2NM = (P − 1)

(τ` + τd

NM

P 2

),

whereas a block transpose requires

TB(a) = τ`

(a +

P

a− 2

)+ τd

(2P − a− P

a

)NM

P 2.

•Let L = τ`/τd be the effective communication block length.

27

Direct vs. Block Transposes

• Since

TD − TB = τd

(P + 1− a− P

a

)(L− NM

P 2

),

we see that a direct transpose is preferred when NM ≥ P 2L,whereas a block transpose should be used when NM < P 2L.

•To find the optimal value of a for a block transpose consider

T ′B(a) = τd

(1− P

a2

)(L− NM

P 2

).

•For NM < P 2L, we see that TB is convex, with a minimum ata =√P .

28

Optimal Number of Threads

•The minimum value of TB is

TB(√P ) = 2τd

(√P − 1

)(L +

NM

P 3/2

)∼ 2 τd

√P

(L +

NM

P 3/2

), P � 1.

•The global minimum of TB over both a and P occurs at

P ≈ (2NM/L)2/3.

• If the matrix dimensions satisfy NM > L, as is typicallythe case, this minimum occurs above the transition value(NM/L)1/2.

29

Transpose Communication Costs

104

105

106

Com

municationCost

101 102 103

P

100

101

102

103

Zero Latency

Direct

Block

Threads

30

Conclusions

•For centered convolutions in d dimensions implicit paddingasymptotically uses (2/3)d−1 of the conventional storage.

•The factor of 2 speedup is largely due to increased data locality.

•Highly optimized and parallelized implicit dealiasing routineshave been implemented as a software layer FFTW++ (v 2.05) ontop of the FFTW library and released under the Lesser GNUPublic License: http://fftwpp.sourceforge.net/

•Hybrid MPI/OpenMP is often more efficient than pure MPI fordistributed matrix transposes.

•The hybrid paradigm provides an optimal setting for nonlocalcomputationally intensive operations found in applications likethe fast Fourier transform.

•The advent of implicit dealiasing of convolutions makesoverlapping transposition with FFT computation feasible.

31

http://fftwpp.sourceforge.net/

•Writing of a high-performance dealiased pseudospectral code isnow a relatively straightforward exercise. For example, see theprotodns project at

http://github.com/dealias/dns

32

http://github.com/dealias/dns

References[Bowman & Roberts 2011] J. C. Bowman & M. Roberts, SIAM J. Sci. Comput., 33:386, 2011.

[Bowman & Roberts 2016] J. C. Bowman & M. Roberts, to be submitted to Parallel computing, 2016.

[Orszag 1971] S. A. Orszag, Journal of the Atmospheric Sciences, 28:1074, 1971.

[Patterson Jr. & Orszag 1971] G. S. Patterson Jr. & S. A. Orszag, Physics of Fluids, 14:2538, 1971.

[Roberts & Bowman 2016] M. Roberts & J. C. Bowman, submitted to SIAM J. Sci. Comput., 2016.

FFTW++: A Hybrid OpenMP/MPI Implementation of FFTs and … · 2020. 2. 28. · The FFT provides an...

Documents

Transcript of FFTW++: A Hybrid OpenMP/MPI Implementation of FFTs and … · 2020. 2. 28. · The FFT provides an...