Memory Optimizations for Graphics Processing Units

71
λ Fernando Magno Quintão Pereira PROGRAMMING LANGUAGES LABORATORY Universidade Federal de Minas Gerais - Department of Computer Science PROGRAM ANALYSIS AND OPTIMIZATION – DCC888 MEMORY OPTIMIZATIONS FOR GRAPHICS PROCESSING UNITS The material in these slides has been taken from the NVIDIA manuals (Best Practices Guide & Optimizing Matrix Transpose in CUDA), and from a paper by Ryoo et al [Ryoo12]. See "A bit of History" in the last slide

description

Memory Optimizations for Graphics Processing Units. The material in these slides has been taken from the NVIDIA manuals (Best Practices Guide & Optimizing Matrix Transpose in CUDA), and from a paper by Ryoo et al [Ryoo12]. See "A bit of History" in the last slide. - PowerPoint PPT Presentation

Transcript of Memory Optimizations for Graphics Processing Units

Page 1: Memory Optimizations for Graphics Processing Units

λFernando Magno Quintão Pereira

PROGRAMMING LANGUAGES LABORATORYUniversidade Federal de Minas Gerais - Department of Computer Science

PROGRAM ANALYSIS AND OPTIMIZATION – DCC888

MEMORY OPTIMIZATIONS FOR GRAPHICS

PROCESSING UNITS

The material in these slides has been taken from the NVIDIA manuals (Best Practices Guide & Optimizing Matrix Transpose in CUDA), and from a paper by Ryoo et al [Ryoo12]. See "A bit of History" in the last slide

Page 2: Memory Optimizations for Graphics Processing Units

DCC 888

λ Universidade Federal de Minas Gerais – Department of Computer Science – Programming Languages Laboratory

WHAT ARE GRAPHICS PROCESSING UNITS

Page 3: Memory Optimizations for Graphics Processing Units

The Wheel of Reincarnation

• In the good old days, the graphics hardware was just the VGA. All the processing was in software.

• People started complaining: software is slow…– But, what do you want to run at the

hardware level?

A scandalously briefhistory of GPUs

1) Do you know how the frame buffer works?

2) Can you program the VGA standard in any way?

Page 4: Memory Optimizations for Graphics Processing Units

The Wheel of Reincarnation

• Some functions, like the rasterizer, are heavily used. What is rasterization?

• Better to implement these functions in hardware.

A scandalously briefhistory of GPUs

1) How can we implement a function at the hardware level?

2) What is the advantage of implementing a function at the hardware level?

3) Is there any drawback?

4) Can we program (in any way) this hardware used to implement a specific function?

Page 5: Memory Optimizations for Graphics Processing Units

Graphics Pipeline

• Graphics can be processed in a pipeline.– Transform, project, clip, display, etc…

• Some functions, although different, can be implemented by very similar hardware.

• Add a graphics API to program the shaders.– But this API is so specific… and the hardware

is so powerful… what a waste!

A scandalously briefhistory of GPUs

Shading is an example. Do you know what is a shader?

Page 6: Memory Optimizations for Graphics Processing Units

General Purpose Hardware

• Let’s add a instruction set to the shader.– Let’s augment this hardware with general purpose integer

operations.– What about adding some branching machinery too?

• Hum… add a high level language on top of this stuff.– Plus a lot of documentation. Advertise it! It should look

cool!• Oh boy: we have now two general purpose processors.– We should unify them. The rant starts all over again…

A scandalously briefhistory of GPUs

Page 7: Memory Optimizations for Graphics Processing Units

1.5 turns around the wheel

• Lets add a display processor to the display processor– After all, there are some operations that are really specific,

and performance critical…

Dedicated rasterizer

A scandalously briefhistory of GPUs

Page 8: Memory Optimizations for Graphics Processing Units

Brief Timeline

Year Transistors Model Tech

1999 25M GeForce 256 DX7, OpenGL

2001 60M GeForce 3 Programmable Shader

2002 125M GeForce FX Cg programs

2006 681M GeForce 8800 C for CUDA

2008 1.4G GeForce GTX 280

IEEE FP

2010 3.0G Fermi Cache, C++

A scandalously briefhistory of GPUs

Page 9: Memory Optimizations for Graphics Processing Units

Computer Organization

• GPUs show different types of parallelism– Single Instruction Multiple Data (SIMD)– Single Program Multiple Data (SPMD)

• In the end, we have a MSIMD hardware.

1) Why are GPUs so parallel?

2) Why traditional CPUs do not show off all this parallelism?

We can think on a SIMD hardware as a firing squad: we have a captain, and a row of soldiers. The captain issues orders, such as set, aim, fire! And all the soldiers, upon hearing one of these orders, performs an action. They all do the same action, yet, they use different guns and bullets.

An outrageously conciseoverview of theprogramming mode.

Page 10: Memory Optimizations for Graphics Processing Units

The Programming EnvironmentAn outrageously conciseoverview of theprogramming mode.

• There are two main programming languages used to program graphics processing units today: OpenCL and C for CUDA

• These are not the first languages developed for GPUs. They came after Cg or HLSL, for instance.– But they are much more general and expressive.

• We will focus on C for CUDA.• This language lets the programmer explicitly write code

that will run in the CPU, and code that will run in the GPU.– It is a heterogeneous programming language.

Page 11: Memory Optimizations for Graphics Processing Units

From C to CUDA in one StepAn outrageously conciseoverview of theprogramming mode.

void saxpy_serial(int n, float alpha, float *x, float *y) { for (int i = 0; i < n; i++) y[i] = alpha*x[i] + y[i];}// Invoke the serial function:saxpy_serial(n, 2.0, x, y);

• This program, written in C, performs a typical vector operation, reading two arrays, and writing on a third array.

• We will translate this program to C for CUDA.

1) What is the asymptotic complexity of this program?

2) How much can we parallelize this program? In a world with many – really many – processors, e.g., the PRAM world, what would be the complexity of this program?

Page 12: Memory Optimizations for Graphics Processing Units

The first Cuda program

void saxpy_serial(int n, float alpha, float *x, float *y) { for (int i = 0; i < n; i++) y[i] = alpha*x[i] + y[i];}// Invoke the serial function:saxpy_seral(n, 2.0, x, y);

__global__void saxpy_parallel(int n, float alpha, float *x, float *y) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) y[i] = alpha * x[i] + y[i];}// Invoke the parallel kernel:int nblocks = (n + 255) / 256;saxpy_parallel<<<nblocks, 256>>>(n, 2.0, x, y);

What happened to the loop in the CUDA program?

An outrageously conciseoverview of theprogramming mode.

Page 13: Memory Optimizations for Graphics Processing Units

Understanding the Code

• Threads are grouped in warps, blocks and grids• Threads in different grids do not talk to each

other– Grids are divided in blocks

• Threads in the same block share memory and barriers– Blocks are divided in warps

• Threads in the same warp follow the SIMD model.

__global__void saxpy_parallel(int n, float alpha, float *x, float *y) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) y[i] = alpha * x[i] + y[i];}// Invoke the parallel kernel:int nblocks = (n + 255) / 256;saxpy_parallel<<<nblocks, 256>>>(n, 2.0, x, y);

An outrageously conciseoverview of theprogramming mode.

Page 14: Memory Optimizations for Graphics Processing Units

Raising the level

• Cuda programs contain CPU programs plus kernels

• Kernels are called via a special syntax:

• The C part of the program is compiled as traditional C.

• The kernel part is first translated into PTX, and then this high level assembly is translated into SASS.

__global__ void matMul1(float* B, float* C, float* A, int w) { float Pvalue = 0.0;

for (int k = 0; k < w; ++k) { Pvalue += B[threadIdx.y * w + k] * C[k * w + threadIdx.x]; }

A[threadIdx.x + threadIdx.y * w] = Pvalue;}void Mul(const float* A, const float* B, int width, float* C) { int size = width * width * sizeof(float); // Load A and B to the device float* Ad; cudaMalloc((void**)&Ad, size); cudaMemcpy(Ad, A, size, cudaMemcpyHostToDevice); float* Bd; cudaMalloc((void**)&Bd, size); cudaMemcpy(Bd, B, size, cudaMemcpyHostToDevice); // Allocate C on the device float* Cd; cudaMalloc((void**)&Cd, size); // Compute the execution configuration assuming // the matrix dimensions are multiples of BLOCK_SIZE dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE); dim3 dimGrid(wB / dimBlock.x, hA / dimBlock.y); // Launch the device computation Muld<<<dimGrid, dimBlock>>>(Ad, Bd, width, Cd); // Read C from the device cudaMemcpy(C, Cd, size, cudaMemcpyDeviceToHost); // Free device memory cudaFree(Ad); cudaFree(Bd); cudaFree(Cd); }

kernel<<<dGrd, dBck>>>(A,B,w,C);

An outrageously conciseoverview of theprogramming mode.

Page 15: Memory Optimizations for Graphics Processing Units

Lowering the level

• CUDA assembly is called Parallel Thread Execution (PTX)

.entry saxpy_GPU (n, a, x, y) { .reg .u16 %rh<4>; .reg .u32 %r<6>; .reg .u64 %rd<8>; .reg .f32 %f<6>; .reg .pred %p<3>;$LBB1__Z9saxpy_GPUifPfS_: mov.u16 %rh1, %ctaid.x; mov.u16 %rh2, %ntid.x; mul.wide.u16 %r1, %rh1, %rh2; cvt.u32.u16 %r2, %tid.x; add.u32 %r3, %r2, %r1; ld.param.s32 %r4, [n]; setp.le.s32 %p1, %r4, %r3; @%p1 bra $Lt_0_770; .loc 28 13 0 cvt.u64.s32 %rd1, %r3; mul.lo.u64 %rd2, %rd1, 4; ld.param.u64 %rd3, [y]; add.u64 %rd4, %rd3, %rd2; ld.global.f32 %f1, [%rd4+0]; ld.param.u64 %rd5, [x]; add.u64 %rd6, %rd5, %rd2; ld.global.f32 %f2, [%rd6+0]; ld.param.f32 %f3, [alpha]; mad.f32 %f4, %f2, %f3, %f1; st.global.f32 [%rd4+0], %f4; exit;}

__global__ voidsaxpy_parallel(int n, float a, float *x, float *y) {

int i = bid.x * bid.x + tid.x; if (i < n) y[i] = a * x[i] + y[i];

}

What do you think an assembly language for parallel programming should have?

An outrageously conciseoverview of theprogramming mode.

Page 16: Memory Optimizations for Graphics Processing Units

DCC 888

λ Universidade Federal de Minas Gerais – Department of Computer Science – Programming Languages Laboratory

MEMORY OPTIMIZATIONS

Page 17: Memory Optimizations for Graphics Processing Units

A Brief Overview of the GPU Threading Model

• Each thread has local registers and local memory

• Threads are grouped in warps– Warps run in SIMD exec

• Warps are grouped in blocks– Shared memory + syncs

• Blocks are grouped in grids– Each grid represents a kernel

Page 18: Memory Optimizations for Graphics Processing Units

1) How do different grids communicate?

2) How do threads in the same block communicate?

3) What determines the size of the block of threads?

4) What determines the size of the grid of threads?

5) What is the effect of branches in the warp execution?

A Brief Overview of the GPU Threading Model

Page 19: Memory Optimizations for Graphics Processing Units

Going to the Archives

• GPUs are much more memory intensive than traditional CPUs. Lets look into an example?– The GeForce 8800 processes 32 pixels per clock. Each pixel

contains a color (3 bytes) and a depth (4 bytes), which are read and written. On the average 16 extra bytes of information are read for each pixel. How many bytes are processed per clock?

To put these numbers in perspective, how much data is processed in each cycle of an ordinary x86 CPU?

Page 20: Memory Optimizations for Graphics Processing Units

The GPU Archive

• Registers: fast, yet few. Private to each thread

• Shared memory: used by threads in the same block

• Local memory: off-chip and slow. Private to each thread

• Global memory: off-chip and slow. Used to provide communication between blocks and grids.

Page 21: Memory Optimizations for Graphics Processing Units

The GPU Archive

1) Why can't we leave all the data in registers?

2) Why can't we leave all the data in shared memory?

3) The CPU also has a memory hierarchy. Do you remember how is this hierarchy like?

4) Why do we have a memory hierarchy also in the CPU?

• Registers: fast, yet few. Private to each thread

• Shared memory: used by threads in the same block

• Local memory: off-chip and slow. Private to each thread

• Global memory: off-chip and slow. Used to provide communication between blocks and grids.

Page 22: Memory Optimizations for Graphics Processing Units

The Traveler’s Journal

The interplanetary trip Commuting data between host and device

The silk road trip Commuting data between global and shared memory

The bakery walk Commuting data between shared memory and registers

Reaching a book on the table Reading/Writing data to registers

Page 23: Memory Optimizations for Graphics Processing Units

DCC 888

λ Universidade Federal de Minas Gerais – Department of Computer Science – Programming Languages Laboratory

INTER-DEVICE COMMUNICATION

Page 24: Memory Optimizations for Graphics Processing Units

The Interplanetary Trip

• Copying data between GPU and CPU is pretty slow. CUDA provides some library functions for this:

– cudaMalloc: allocates data in the GPU memory space

– cudaMemset: fills a memory area with a value

– cudaFree: frees the data in the GPU memory space

– cudaMemcpy: copies data from CPU to GPU, or vice-versa

Page 25: Memory Optimizations for Graphics Processing Units

The interplanetary trip

int nbytes = 1024 * sizeof(int);int *a_d = 0;cudaMalloc( (void**) &a_d, nbytes);cudaMemset( a_d, 0, nbytes);cudaFree(a_d);

What is each of these calls below doing?

Page 26: Memory Optimizations for Graphics Processing Units

This program copies data from the host to the device, and then moves this data inside the device, and finally brings the data back to the host memory.

The interplanetary trip

int main(int argc, char** argv) { float *a_h, *b_h; float *a_d, *b_d; int N = 14, nBytes, i ;

nBytes = N*sizeof(float); a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes);

cudaMemset(&a_d, 0, nBytes);

cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost);

for (i=0; i< N; i++) { ASSERT( a_h[i] == b_h[i] ); } free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return EXIT_SUCCESS;}

Page 27: Memory Optimizations for Graphics Processing Units

The interplanetary trip

int main(int argc, char** argv) { float *a_h, *b_h; float *a_d, *b_d; int N = 14, nBytes, i ;

nBytes = N*sizeof(float); a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes);

cudaMemset(&a_d, 0, nBytes);

cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost);

for (i=0; i< N; i++) { ASSERT( a_h[i] == b_h[i] ); } free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return EXIT_SUCCESS;}

float *a_h, *b_h;float *a_d, *b_d;int N = 14, nBytes, i;

Page 28: Memory Optimizations for Graphics Processing Units

The interplanetary trip

int main(int argc, char** argv) { float *a_h, *b_h; float *a_d, *b_d; int N = 14, nBytes, i ;

nBytes = N*sizeof(float); a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes);

cudaMemset(&a_d, 0, nBytes);

cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost);

for (i=0; i< N; i++) { ASSERT( a_h[i] == b_h[i] ); } free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return EXIT_SUCCESS;}

nBytes = N * sizeof(float);a_h = (float*)malloc(nBytes);b_h = (float*)malloc(nBytes);

Host

a_h

b_h

Page 29: Memory Optimizations for Graphics Processing Units

The interplanetary trip

int main(int argc, char** argv) { float *a_h, *b_h; float *a_d, *b_d; int N = 14, nBytes, i ;

nBytes = N*sizeof(float); a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes);

cudaMemset(&a_d, 0, nBytes);

cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost);

for (i=0; i< N; i++) { ASSERT( a_h[i] == b_h[i] ); } free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return EXIT_SUCCESS;}

cudaMalloc((void**)&a_d, nBytes);cudaMalloc((void**)&b_d, nBytes);

Host

a_h

b_h

Device

a_d

b_d

Page 30: Memory Optimizations for Graphics Processing Units

The interplanetary trip

int main(int argc, char** argv) { float *a_h, *b_h; float *a_d, *b_d; int N = 14, nBytes, i ;

nBytes = N*sizeof(float); a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes);

cudaMemset(&a_d, 0, nBytes);

cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost);

for (i=0; i< N; i++) { ASSERT( a_h[i] == b_h[i] ); } free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return EXIT_SUCCESS;}

cudaMemset(&a_d, 0, nBytes);

Host

a_h

b_h

Device

a_d

b_d

Page 31: Memory Optimizations for Graphics Processing Units

The interplanetary trip

int main(int argc, char** argv) { float *a_h, *b_h; float *a_d, *b_d; int N = 14, nBytes, i ;

nBytes = N*sizeof(float); a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes);

cudaMemset(&a_d, 0, nBytes);

cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost);

for (i=0; i< N; i++) { ASSERT( a_h[i] == b_h[i] ); } free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return EXIT_SUCCESS;}

cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice);

Host

a_h

b_h

Device

a_d

b_d

Page 32: Memory Optimizations for Graphics Processing Units

The interplanetary trip

int main(int argc, char** argv) { float *a_h, *b_h; float *a_d, *b_d; int N = 14, nBytes, i ;

nBytes = N*sizeof(float); a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes);

cudaMemset(&a_d, 0, nBytes);

cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost);

for (i=0; i< N; i++) { ASSERT( a_h[i] == b_h[i] ); } free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return EXIT_SUCCESS;}

cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice);

Host

a_h

b_h

Device

a_d

b_d

Page 33: Memory Optimizations for Graphics Processing Units

The interplanetary trip

int main(int argc, char** argv) { float *a_h, *b_h; float *a_d, *b_d; int N = 14, nBytes, i ;

nBytes = N*sizeof(float); a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes);

cudaMemset(&a_d, 0, nBytes);

cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost);

for (i=0; i< N; i++) { ASSERT( a_h[i] == b_h[i] ); } free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return EXIT_SUCCESS;}

cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost);

Host

a_h

b_h

Device

a_d

b_d

Page 34: Memory Optimizations for Graphics Processing Units

The interplanetary trip

int main(int argc, char** argv) { float *a_h, *b_h; float *a_d, *b_d; int N = 14, nBytes, i ;

nBytes = N*sizeof(float); a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes);

cudaMemset(&a_d, 0, nBytes);

cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost);

for (i=0; i< N; i++) { ASSERT( a_h[i] == b_h[i] ); } free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return EXIT_SUCCESS;}

for (i=0; i< N; i++) { ASSERT(a_h[i] == b_h[i]);}

Host

a_h

b_h

Device

a_d

b_d

Page 35: Memory Optimizations for Graphics Processing Units

The interplanetary trip

int main(int argc, char** argv) { float *a_h, *b_h; float *a_d, *b_d; int N = 14, nBytes, i ;

nBytes = N*sizeof(float); a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes);

cudaMemset(&a_d, 0, nBytes);

cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost);

for (i=0; i< N; i++) { ASSERT( a_h[i] == b_h[i] ); } free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return EXIT_SUCCESS;}

free(a_h); free(b_h);

Host Device

a_d

b_d

Page 36: Memory Optimizations for Graphics Processing Units

The interplanetary trip

int main(int argc, char** argv) { float *a_h, *b_h; float *a_d, *b_d; int N = 14, nBytes, i ;

nBytes = N*sizeof(float); a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes);

cudaMemset(&a_d, 0, nBytes);

cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost);

for (i=0; i< N; i++) { ASSERT( a_h[i] == b_h[i] ); } free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return EXIT_SUCCESS;}

Host Device

cudaFree(a_d); cudaFree(b_d);

Page 37: Memory Optimizations for Graphics Processing Units

Inter-device Communication

• Inter-device communication, i.e, between the CPU and the GPU, should be minimized as much as possible.

• Inter-device communication is orders of magnitude slower than reading data from shared memory, for instance.

• That is why GPUs are not good for interactive applications.

Page 38: Memory Optimizations for Graphics Processing Units

Avoid traveling whenever you can

cudaMalloc((void**) &d_vec, mem_size);

cudaMemcpy(d_vec, h_vec, mem_size, cudaMemcpyHostToDevice);

kernel0<<< gridSize0, blockSize0 >>>(d_vec, vec_size);

kernel1<<< gridSize1, blockSize1 >>>(d_vec, vec_size);

cudaMemcpy(h_vec, d_vec, mem_size, cudaMemcpyDeviceToHost);

cudaFree(d_vec);

From maxSort, available in the course webpage.

d_vec does not change between kernl calls.

Therefore, there is no need to send it again!

Page 39: Memory Optimizations for Graphics Processing Units

Keep data on the GPU

• Once data is sent to the GPU, it stays on the DRAM, even after the kernel is done executing

• Try invoking kernels on data already on the GPU

1) Can you think about a situation in which it is better to leave a kernel, do some computation on the CPU, and then call another kernel?

2) By the way, can you think about a problem that is inherently sequential?

Page 40: Memory Optimizations for Graphics Processing Units

The GPU deserves complex work

__global__void matSumKernel(float* S, float* A, float* B, int side) { int ij = tid.x + tid.y * side; A[ij] = B[ij] + C[ij];}

__global__void matMul1(float* B, float* C, float* A, int w) { float v = 0.0; for (int k = 0; k < w; ++k) { v += B[tid.x*w+k] * C[k*w+tid.x]; } A[tid.x + tid.y * w] = v;}

1) What is the complexity of copying data from the CPU to the GPU?

2) Is it worth to do matrix sum in the GPU?

3) Is it worth to do matrix multiplication in the GPU?

Page 41: Memory Optimizations for Graphics Processing Units

Matrix Sum × Matrix Mul

• Matrix Sum:

• Matrix Mul:

Page 42: Memory Optimizations for Graphics Processing Units

The ballerina’s waltz

• Start working as soon as data is available - use a pipeline!cudaMemcpy(dst, src, N * sizeof(float), dir);kernel<<<N/nThreads, nThreads>>>(dst);

sz = N * sizeof(float) / nStreams;for (i = 0; i < nStreams; i++) { offset = i * N / nStreams; cudaMemcpyAsync(dst+offset, src+offset, sz, dir, stream[i]);}for (i=0; i<nStreams; i++) { gridSize = N / (nThreads * nStreams); offset = i * N / nStreams; kernel<<<gridSize, nThreads, 0, stream[i]>>>(dst+offset);}

• C for CUDA has an API for asynchronous transfers:

What is the glue between data and computation in this example?

Page 43: Memory Optimizations for Graphics Processing Units

The ballerina’s waltz

• Asynchronous memory copy overlaps data transfer and GPU processing

This technique to obtain parallelism, e.g., pipeline parallelism, is a pattern used in many different scenarios. Could you name other examples of pipeline parallelism?

Page 44: Memory Optimizations for Graphics Processing Units

DCC 888

λ Universidade Federal de Minas Gerais – Department of Computer Science – Programming Languages Laboratory

GLOBAL MEMORY ACCESS

Page 45: Memory Optimizations for Graphics Processing Units

The Silk Road Trip

• Reading or writing to the global memory is also slow.– But not as much as reading/writing between host and

device.– The Global Memory is on-board.

Page 46: Memory Optimizations for Graphics Processing Units

The Matrix Multiplication Kernel

void matmult(float* B, float* C, float* A, int w) { for (unsigned int i = 0; i < w; ++i) { for (unsigned int j = 0; j < w; ++j) { A[i * w + j] = 0.0; for (unsigned int k = 0; k < w; ++k) { A[i * w + j] += B[i * w + k] * C[k * w + j]; } } }}

1) In the PRAM model, what is the asymptotic complexity of the matrix multiplication problem?

2) Could you translate this program to C for CUDA?

Page 47: Memory Optimizations for Graphics Processing Units

Matrix Multiplication Kernel

__global__ void matMul1(float* B, float* C, float* A, int Width) { float Pvalue = 0.0;

int tx = blockIdx.x * blockDim.x + threadIdx.x; int ty = blockIdx.y * blockDim.y + threadIdx.y;

for (int k = 0; k < Width; ++k) { Pvalue += B[tx * Width + k] * C[k * Width + ty]; }

A[ty + tx * Width] = Pvalue;} From matMul, available in the course webpage.

1) What is the asymptotic complexity of this program?

2) Given width = 10, how many accesses to the global memory does this program perform?

3) How to know how many floating-point operations per second this program will perform?

In this example, each thread is responsible for multiplying one line of B by one column of C, to produce an element of A.

Page 48: Memory Optimizations for Graphics Processing Units

GFLOPS

mov.f32 %f1, 0f00000000;$Lt_0_1282: cvt.u64.u32 %rd3, %r7; mul.lo.u64 %rd4, %rd3, 4; ld.param.u64 %rd2, [B]; add.u64 %rd5, %rd2, %rd4; ld.global.f32 %f2, [%rd5+0]; cvt.u64.u32 %rd6, %r9; mul.lo.u64 %rd7, %rd6, 4; ld.param.u64 %rd1, [C]; add.u64 %rd8, %rd1, %rd7; ld.global.f32 %f3, [%rd8+0]; mad.f32 %f1, %f2, %f3, %f1; add.u32 %r7, %r7, 1; ld.param.s32 %r3, [w]; add.u32 %r9, %r3, %r9; setp.ne.s32 %p2, %r7, %r8; @%p2 bra $Lt_0_1282;

1) How many instructions we find in block Lt_0_1282 of this code?

2) How many floating point operations does this program perform in the inner loop?

3) What is, then, the proportion of floating point operations per GPU operation?

4) If the GTX 8800 can perform 172.8 Gflops, how many GFlops could we expect from this code?

5) But if we get a much lower number, what could be the reasons for this bad performance behavior?

Page 49: Memory Optimizations for Graphics Processing Units

Coalesced Access to the Global Memory

• The global memory is divided into segments of 16 cells. If 16 threads read data from the same segment, the memory access contains only one trip.

• However, if each thread reads from a different segment we may have a slow access to the global memory.

In order to know if we are accessing the same segment, we can do the following check:

• take thread t whose id is 16 × n

• find out the segment s that thread t is accessing

• for every thread t+i, 1 ≤ i ≤ 15, see if t + i is accessing segment s

Page 50: Memory Optimizations for Graphics Processing Units

The Anatomy of a Block

0, 0 1, 0 2, 0 3, 0

0, 1 1, 1 2, 1 3, 1

0, 2 1, 2 2, 2 3, 2

0, 3 1, 3 2, 3 3, 3

…, 0

…, 1

…, 2

…, 3

0, … 1, … 2, … 3, … …, …

(tid.x, tid.y)

Warp

Warps should read data in the same segment.

Page 51: Memory Optimizations for Graphics Processing Units

Checking for coalesced access

__global__ void matMul1(float* B, float* C, float* A, int Width) { float Pvalue = 0.0;

int tx = blockIdx.x * blockDim.x + threadIdx.x; int ty = blockIdx.y * blockDim.y + threadIdx.y;

for (int k = 0; k < Width; ++k) { Pvalue += B[tx * Width + k] * C[k * Width + ty]; }

A[ty + tx * Width] = Pvalue;}

1) What is the work that each thread (tx, ty) is doing?

2) Which accesses to the global memory are coalesced, and which ones are not?

Page 52: Memory Optimizations for Graphics Processing Units

Checking Memory Accesses

• Assuming Width = 640 and k = 0• Warp: (0, 0) (1, 0) (2, 0) … (31, 0)• B[tx * Width + k]: B[0], B[640], B[1280], …, B[19840]• C[k * Width + ty]: C[0], C[0] , C[0], …, C[0]• A[ty + tx * Width]: A[0], A[640], A[1280], …, A[19840]

...int tx = blockIdx.x * blockDim.x + threadIdx.x;int ty = blockIdx.y * blockDim.y + threadIdx.y;for (int k = 0; k < Width; ++k) { Pvalue += B[tx * Width + k] * C[k * Width + ty];}A[ty + tx * Width] = Pvalue;

We have a lot of uncoalesced accesses. How can we improve this situation?

Page 53: Memory Optimizations for Graphics Processing Units

Change the indexes a tiny bit…

__global__ void matMul5(float* B, float* C, float* A, int Width) { float Pvalue = 0.0;

int tx = blockIdx.x * blockDim.x + threadIdx.x; int ty = blockIdx.y * blockDim.y + threadIdx.y;

for (int k = 0; k < Width; ++k) { Pvalue += B[ty * Width + k] * C[k * Width + tx]; }

A[tx + ty * Width] = Pvalue;}

for (int k = 0; k < Width; ++k) { Pvalue += B[tx * Width + k] * C[k * Width + ty];}A[ty + tx * Width] = Pvalue;

Old Code:

New Code:

Page 54: Memory Optimizations for Graphics Processing Units

Why is it so much better?

• Assuming Width = 640 and k = 0• Warp: (0, 0) (1, 0) (2, 0) … (31, 0)• B[ty * Width + k]: B[0], B[0], B[0], …, B[0]• C[k * Width + tx]: C[0], C[1] , C[2], …, C[31]• A[tx + ty * Width]: A[0], A[1], A[2], …, A[31]

... int tx = blockIdx.x * blockDim.x + threadIdx.x; int ty = blockIdx.y * blockDim.y + threadIdx.y; for (int k = 0; k < Width; ++k) { Pvalue += B[ty * Width + k] * C[k * Width + tx]; } A[tx + ty * Width] = Pvalue;

How much speed do you think you have gotten with this small change?

Page 55: Memory Optimizations for Graphics Processing Units

DCC 888

λ Universidade Federal de Minas Gerais – Department of Computer Science – Programming Languages Laboratory

THE SHARED MEMORY

Page 56: Memory Optimizations for Graphics Processing Units

Second law of performance

• Avoid going to the global or local memory. Ideally threads should be able to share as much data as possible in shared memory.

• Remember: the shared memory is on-chip; the global memory is off-chip.

• Can we improve this matrix multiplication by sharing data among threads?

Page 57: Memory Optimizations for Graphics Processing Units

Moving data to the shared memory

__global__ void matMul5(float* B, float* C, float* A, int Width) { float Pvalue = 0.0;

int tx = blockIdx.x * blockDim.x + threadIdx.x; int ty = blockIdx.y * blockDim.y + threadIdx.y;

for (int k = 0; k < Width; ++k) { Pvalue += B[ty * Width + k] * C[k * Width + tx]; }

A[tx + ty * Width] = Pvalue;}

1) How many accesses to the global memory take place in this kernel?

2) Again: what is the shared memory?

3) How to use the shared memory to improve this program?

Page 58: Memory Optimizations for Graphics Processing Units

__global__ void matMul6(float* B, float* C, float* A, int Width) { __shared__ float Bs[TILE_WIDTH][TILE_WIDTH]; __shared__ float Cs[TILE_WIDTH][TILE_WIDTH]; int tx = threadIdx.x; int ty = threadIdx.y; int Row = blockIdx.x * TILE_WIDTH + tx; int Col = blockIdx.y * TILE_WIDTH + ty; float Pvalue = 0; for (int m = 0; m < Width/TILE_WIDTH; ++m) { Bs[ty][tx] = B[Col * Width + (m * TILE_WIDTH + tx)]; Cs[ty][tx] = C[Row + (m * TILE_WIDTH + ty) * Width]; __syncthreads(); for (int k = 0; k < TILE_WIDTH; ++k) Pvalue += Bs[ty][k] * Cs[k][tx]; __syncthreads(); } A[Col * Width + Row] = Pvalue;}

TILE_WIDTH = blockDim.x = blockDim.y

Why do we have to synchronize threads at this point?

Page 59: Memory Optimizations for Graphics Processing Units

Tiling Can you tell if each of these indices gives us coalesced access or not?

Page 60: Memory Optimizations for Graphics Processing Units

The Tiles of Scheherazade

1) How many access to the global memory does the original program perform? Answer in terms of N, the width of the matrix.

2) What about the tiles Program? Answer in terms of N, the Matrix width, and T, the tile width.

3) What if we assume that T = N/2?

Page 61: Memory Optimizations for Graphics Processing Units

N

N/2

The Tiles of Scheherazade

• How many reads with no tiling? • How many reads with tiling?– Size of each tile: – Num of times each tile is read:– Num of tiles:

Page 62: Memory Optimizations for Graphics Processing Units

N

N/2

The Tiles of Scheherazade

• How many reads with no tiling? 2N * N2

• How many reads with tiling?– Size of each tile:– Num of times each tile is read:– Num of tiles:

Page 63: Memory Optimizations for Graphics Processing Units

N

N/2

The Tiles of Scheherazade

• How many reads with no tiling? 2N * N2

• How many reads with tiling?– Size of each tile: N2/4– Num of times each tile is read:

N/(N/2)– Num of tiles:– (N/(N/2)) 2 * 2

So: N2/4 * 2 * 8 = 4*N2

Page 64: Memory Optimizations for Graphics Processing Units

__global__ void matMul6(float* B, float* C, float* A, int Width) { __shared__ float Bs[TILE_WIDTH][TILE_WIDTH]; __shared__ float Cs[TILE_WIDTH][TILE_WIDTH]; int tx = threadIdx.x; int ty = threadIdx.y; int Row = blockIdx.x * TILE_WIDTH + tx; int Col = blockIdx.y * TILE_WIDTH + ty; float Pvalue = 0; for (int m = 0; m < Width/TILE_WIDTH; ++m) { Bs[ty][tx] = B[Col * Width + (m * TILE_WIDTH + tx)]; Cs[ty][tx] = C[Row + (m * TILE_WIDTH + ty) * Width]; __syncthreads(); for (int k = 0; k < TILE_WIDTH; ++k) Pvalue += Bs[ty][k] * Cs[k][tx]; __syncthreads(); } A[Col * Width + Row] = Pvalue;}

1) Can we optimize this program with loop unrolling?

2) We have two loops. Which loop is more reasonable to unroll?

Loop Unrolling

Page 65: Memory Optimizations for Graphics Processing Units

float Pvalue = 0;for (int m = 0; m < w/8; ++m) { Mds[tx][ty] = Md[Row * w + (m * 8 + ty)]; Nds[tx][ty] = Nd[Col + (m * 8 + tx) * w]; __syncthreads(); Pvalue += Mds[tx][ 0] * Nds[ 0][ty]; Pvalue += Mds[tx][ 1] * Nds[ 1][ty]; Pvalue += Mds[tx][ 2] * Nds[ 2][ty]; Pvalue += Mds[tx][ 3] * Nds[ 3][ty]; Pvalue += Mds[tx][ 4] * Nds[ 4][ty]; Pvalue += Mds[tx][ 5] * Nds[ 5][ty]; Pvalue += Mds[tx][ 6] * Nds[ 6][ty]; Pvalue += Mds[tx][ 7] * Nds[ 7][ty]; __syncthreads();}Pd[Row*Width+Col] = Pvalue;

Loop Unrolling

1) What is the proportion of floating-point to non-floating-point operations in the innermost loop of the optimized program?

2) What do we gain, at the hardware level, for having less branches to execute?

3) About ½ of the instructions in the innermost loop are floating-point operations. What is the new GFops in the GTX 8800, that has an upper limit of 172.8 GFlops?

Page 66: Memory Optimizations for Graphics Processing Units

Shared Bank Conflicts

• The shared memory has 16 access doors.

• If two threads of a half-warp read from the same door, then a conflict happens.

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

Ideal cases Very bad case

No conflictSam

e address

Page 67: Memory Optimizations for Graphics Processing Units

Matrix transpose

__shared__ float tile[BLOCK_WIDTH][BLOCK_WIDTH]; // Compute the index of the data in the input matrix: int xIndex = blockIdx.x * blockDim.x + threadIdx.x; int yIndex = blockIdx.y * blockDim.y + threadIdx.y; int index_in = yIndex * Width + xIndex;

// Compute the index of the data in the output matrix: xIndex = blockIdx.y * blockDim.x + threadIdx.x; yIndex = blockIdx.x * blockDim.y + threadIdx.y; int index_out = yIndex * Width + xIndex;

// Copy the data: tile[threadIdx.x][threadIdx.y] = In[index_in]; __syncthreads(); Out[index_out] = tile[threadIdx.y][threadIdx.x];}

BLOCK_WIDTH = 16

From transpose, available in the course webpage.

1) Is access to the global memory coalesced?

2) Where is the bank conflict to access shared memory?

Page 68: Memory Optimizations for Graphics Processing Units

0,0 0,1 0,2 0,3

1,0 1,1 1,2 1,3

2,0 2,1 2,2 2,3

3,0 3,1 3,2 3,3

Warp

0,0 1,0 2,0 3,0

0,1 1,1 2,1 3,1

0,2 1,2 2,2 3,2

0,3 1,3 2,3 3,3

tile[threadIdx.x][threadIdx.y]

tile[threadIdx.y][threadIdx.x]

• Let’s assume warp with 4 threads, and 4 access doors

• Each color is an access door• The red square marks a warp• If the warp falls on blocks having the same

color, then we have a conflict

Warp

tile[threadIdx.x][threadIdx.y] = In[index_in];__syncthreads();Out[index_out] = tile[threadIdx.y][threadIdx.x];

Ps.: in fig, (x,y) is warpindex, not data index.

Page 69: Memory Optimizations for Graphics Processing Units

A very simple solution:

• You may not quite believe me…__global__ void transpose2(float* In, float* Out, int Width) { __shared__ float tile[BLOCK_WIDTH][BLOCK_WIDTH + 1]; // Compute the index of the data in the input matrix: int xIndex = blockIdx.x * blockDim.x + threadIdx.x; int yIndex = blockIdx.y * blockDim.y + threadIdx.y; int index_in = yIndex * Width + xIndex;

// Compute the index of the data in the output matrix: xIndex = blockIdx.y * blockDim.x + threadIdx.x; yIndex = blockIdx.x * blockDim.y + threadIdx.y; int index_out = yIndex * Width + xIndex;

// Copy the data: tile[threadIdx.x][threadIdx.y] = In[index_in]; __syncthreads(); Out[index_out] = tile[threadIdx.y][threadIdx.x];}

Page 70: Memory Optimizations for Graphics Processing Units

0,0 0,1 0,2 0,3

1,0 1,1 1,2 1,3

2,0 2,1 2,2 2,3

3,0 3,1 3,2 3,3

Warp

0,0 1,0 2,0 3,0

0,1 1,1 2,1 3,1

0,2 1,2 2,2 3,2

0,3 1,3 2,3 3,3

tile[threadIdx.x][threadIdx.y]

tile[threadIdx.y][threadIdx.x]

0,0 0,1 0,2 0,3

1,0 1,1 1,2 1,3

2,0 2,1 2,2 2,3

3,0 3,1 3,2 3,3

0,0 0,1 0,2 0,3

1,0 1,1 1,2 1,3

2,0 2,1 2,2 2,3

3,0 3,1 3,2 3,3

The extra column breaks the bad alignment in both cases.

Page 71: Memory Optimizations for Graphics Processing Units

Stay Tuned

• We have seen a suite of memory optimizations that speed up code meant to run on a GPU.

• There exists another phenomenon that plays a very important role in the execution of GPU programs: the control flow divergences.

• Control flow divergences are the subject of our next class.