PARATEC and the Generation of the Empty States (Starting point for GW/BSE)

PARATEC and the Generation of the Empty States (Starting point for GW/BSE)

Andrew CanningComputational Research Division, LBNL and Chemical Engineering and Materials Science Dept. UC Davis.

GW/BSE Method Overview

DFT Kohn-Sham (SCF and NSCF){φDFT

nk(r), EDFTnk}

Compute Dielectric Function{ }

GW: Quasiparticle Properties{φQP

nk(r), EQPnk}

BSE: Construct Kernel (coarse grid) K(k,c,v,k',c',v')

Interpolate Kernel to Fine Grid / Diagonalize BSE Hamiltonian{As

cvk, Escvk}

Expt. G.E. Jellison, M.F. Chisholm, S.M. Gorbatkin, Appl. Phys. Lett. 62, 3348 (1993).

Computational Cost: GW Method for nanotube

• 80 carbon atoms, 80x80x4.6au • 160 occupied (valence) bands,

800 unoccupied (conduction) bands • kpoints 1x1x32 (coarse) 1x1x256 (fine)• Running on Cray XE6 Hopper • Generation of empty states ~30% of

computational cost and highest in terms of wall clock time

• scaling issues for running DFT codes for large number of bands (on relatively small system)

Features of Different Codes for generation of empty states (what to use for GW/BSE ? )

• SIESTA (Spanish Initiative for Electronic Simulations with Thousands of Atoms Basis set LCAO (Linear Combination of Atomic Orbitals) Less accurate basis allows larger systems to be studied (thousands of atoms) Good for non-periodic systems, large molecules O(N) algorithms implemented in LCAO basis

• PARSEC (Pseudopotential Algorithm for Real-Space Electronic structure Calculations)

Grid based real space representation finite-difference approachEasy to implement non-periodic boundary conditionsGood for large molecules etc.

• Quantum Espresso Plane Wave basis set (same as BerkeleyGW code)PAW (Projector Augmented Wavefunctions) option Hybrid Functionals

• PARATEC (PARAllel Total Energy Code)Plane Wave basis set (same as BerkeleyGW code)Good for periodic systems (crystals etc, metallic systems)Hybrid Functionalsstatic-COHSEXOpenMP/MPI Hybrid implementation

PARATEC (PARAllel Total Energy Code)

• PARATEC performs first-principles quantum mechanical total energy calculation using pseudopotentials & plane wave basis set

• Written in F90 and MPI• Designed to run on large parallel

machines Cray, IBM etc. but also runs on PCs

• PARATEC uses all-band CG approach to obtain wavefunctions of electrons (blocks comms. Specialized 3dffts)

• Generally obtains high percentage of peak on different platforms (uses BLAS3 and 1d FFT libs)

• Developed by Louie and Cohen groups (UCB, LBNL) in collaboration with CRD, NERSC

Breakdown of Computational Costs for Solving Kohn-Sham LDA/GGA Equations in PARATEC

)()())}((||||

)(21{ 2 rErrV

RrZrd

rrr

jjjXCI I

Computational Task (CG solver) Scaling Orthogonalization MN2

Subspace diagonalization N3

3d FFTs (most communications) NMlogMNonlocal pseudopotential MN2 (N2 real space)

N: number of eigenpairs required (lowest in spectrum)M: matrix (Hamiltonian) dimension, basis set size (M ~ 100-200N)

)),(( jj Er

Load Balancing, Parallel Data Layout

• Wavefunctions stored as spheres of points (due to energy cutoff) • Data intensive parts (BLAS) proportional to number of Fourier components • Pseudopotential calculation, Orthogonalization scales as N3 (atom system) • FFT part scales as N2logN

FFT

)(21 2 ri )(rV

Data distribution: load balancing constraints (Fourier Space):• each processor should have same number of Fourier coefficients (N3 calcs.)• each processor should have complete columns of Fourier coefficients (3d

FFT)

Give out sets of columns of data to each processor

PARATEC: Performance

Grid size 2523

All architectures generally achieve high performance due to computational intensity of code (BLAS3, FFT)

ES achieves highest overall performance : 5.5Tflop/s on 2048 procs (5.3 Tflops on XT4 on 2048 procs in single proc. node mode)

FFT used for benchmark for NERSC procurements (run on up to 18K procs on Cray XT4, weak scaling )

Vectorisation directives and multiple 1d FFTs required for NEC SX6

Developed with Louie and Cohen’s groups (UCB, LBNL), also work with L. Oliker, J Carter

Problem Proc

Bassi NERSC (IBM Power5)

Jaquard NERSC (Opteron)

Thunder (Itanium2)

Franklin NERSC (Cray XT4) NEC ES (SX6) IBM BG/L

Gflops/Proc

% peak

Gflops/Proc

% peak

Gflops/Proc

% peak

Gflops/Proc

% peak

Gflops/Proc

% peak

Gflops/Proc

% peak

488 AtomCdSe

Quantum

Dot

128 5.49 72% 2.8 51

% 5.1 64%

256 5.52 73% 1.98 45

% 2.6 47% 3.36

65% 5.0 62% 1.21

43%

512 5.13 67% 0.95 21

% 2.4 44% 3.15

61% 4.4 55%

1.00

35%

1024 3.74 49

% 1.8 32%

2.93

56% 3.6 46

%2048 2.37 46

% 2.7 35%

Parallelization in PW DFT codes four levels (k-points, bands, PWs, OpenMP)

Band parallelization: n nodes divided into groups

k-point parallelization: divide k-points among groups of nodes (limited for large systems, molecules, nanostructures etc)

PW parallelization: each group parallelizes over PWs

OpenMP, Threaded Libs on the node/chip

OpenMP, Threading for on-node/chip parallelism

• fewer mpi messages avoids communication bottlenecks• aggregation of messages per node reduces latency issues• smaller memory footprint (from code and mpi buffers) • no on-node mpi messaging • extra level of parallelism to improve scaling to larger core counts

Timing results for threaded version of PARATEC code used to generate VB and CB states for input to GW code PARATEC (Cray XT5 Jaguar) 686 Si atomsJaguar Cray XT5 at ORNL (224,162 cores) : Node: 2 AMD Istambul 2.6 GHz 6 core chips (Total 12 cores, 2x6cores)

1 2 3 6 12768 384 256 128 64

0500

100015002000

FFT "DGEMM" MPI

OpenMP threads / MPI tasks

Tim

e / s

Non-SCF problem to generate empty CB states

)()()},(21{ 2 rErrV iii

2|)(|)( rrVBN

ii

Solve selfconsistently for NVB valence states

)()()},(21{ 2 rErrV iii

Solve non-selfconsistently for NVB+ NCB states

Output

Output for GW/BSE codes

XCNNiii VECBVB

,,},{ ,..,1

• Non-SCF problem is like simulation of metallic system (no gap above top of spectrum)

• Slow convergence• requires convergence

criteria for empty states• NVB+ NCB can be very large • Operations on subspace

matrix can dominate • High percent of eigenpairs

calculated compared to SCF calc.

• Typically almost all the time is for the Non-SCF calc.

Breakdown of Computational Costs for Solving Kohn-Sham LDA/GGA Equations in PARATEC

)()())}((||||

)(21{ 2 rErrV

RrZrd

rrr

jjjXCI I

Computational Task (CG solver) Scaling Orthogonalization MN2

Subspace diagonalization N3

3d FFTs (most communications) NMlogMNonlocal pseudopotential MN2 (N2 real space)

N= NVB+ NCB (NVB): number of eigenpairs required

M: matrix (Hamiltonian) dimension, basis set size (M ~10-20N) (M ~100-200N)

)),(( jj Er

NSCF calculation for GW/BSE (compared to standard SCF)

PARATEC features for Non-SCF problem

• Efficient distributed implementation of operations on subspace matrix using Scalapack

• Extra states calculated above the required number to improve convergence of CG solver

• Option for using direct solver on Hamiltonian when percentage of eigenpairs required is high (>10%) can be faster than CG iterative solver (P. Zhang)

Scaling of Iterative Solver (e.g. CG) a N2M Compared to Direct (Lapack, Scalapack) a M3 (M = matrix size (basis, number of PWs), N = number of states)

ijijS 0 1

2 3

Block-block data layout Block size chosen for optimal performance

PARATEC summary and future developments

• PARATEC optimized for large parallel machines (Cray, IBM)• OpenMP/Threaded version under development (important to

get more parallelism, particularly for small systems for GW/BSE, gives faster time to solution)

• Hybrid Functionals, static-COHSEX (starting point for GW/BSE)

• Some optimization for generation of empty states for GW/BSE• Direct diagonalization of H for cases when high % of

eigenstates required (to be in released version soon) for GW/BSE

PARATEC and the Generation of the Empty States (Starting point for GW/BSE)

Documents

Transcript of PARATEC and the Generation of the Empty States (Starting point for GW/BSE)