Memory Optimizations for Graphics Processing Units

λFernando Magno Quintão Pereira

PROGRAMMING LANGUAGES LABORATORYUniversidade Federal de Minas Gerais - Department of Computer Science

PROGRAM ANALYSIS AND OPTIMIZATION – DCC888

MEMORY OPTIMIZATIONS FOR GRAPHICS

PROCESSING UNITS

The material in these slides has been taken from the NVIDIA manuals (Best Practices Guide & Optimizing Matrix Transpose in CUDA), and from a paper by Ryoo et al [Ryoo12]. See "A bit of History" in the last slide

DCC 888

λ Universidade Federal de Minas Gerais – Department of Computer Science – Programming Languages Laboratory

WHAT ARE GRAPHICS PROCESSING UNITS

The Wheel of Reincarnation

• In the good old days, the graphics hardware was just the VGA. All the processing was in software.

• People started complaining: software is slow…– But, what do you want to run at the

hardware level?

A scandalously briefhistory of GPUs

1) Do you know how the frame buffer works?

2) Can you program the VGA standard in any way?

The Wheel of Reincarnation

• Some functions, like the rasterizer, are heavily used. What is rasterization?

• Better to implement these functions in hardware.


1) How can we implement a function at the hardware level?

2) What is the advantage of implementing a function at the hardware level?

3) Is there any drawback?

4) Can we program (in any way) this hardware used to implement a specific function?

Graphics Pipeline

• Graphics can be processed in a pipeline.– Transform, project, clip, display, etc…

• Some functions, although different, can be implemented by very similar hardware.

• Add a graphics API to program the shaders.– But this API is so specific… and the hardware

is so powerful… what a waste!


Shading is an example. Do you know what is a shader?

General Purpose Hardware

• Let’s add a instruction set to the shader.– Let’s augment this hardware with general purpose integer

operations.– What about adding some branching machinery too?

• Hum… add a high level language on top of this stuff.– Plus a lot of documentation. Advertise it! It should look

cool!• Oh boy: we have now two general purpose processors.– We should unify them. The rant starts all over again…


1.5 turns around the wheel

• Lets add a display processor to the display processor– After all, there are some operations that are really specific,

and performance critical…

Dedicated rasterizer


Brief Timeline

Year Transistors Model Tech

1999 25M GeForce 256 DX7, OpenGL

2001 60M GeForce 3 Programmable Shader

2002 125M GeForce FX Cg programs

2006 681M GeForce 8800 C for CUDA

2008 1.4G GeForce GTX 280

IEEE FP

2010 3.0G Fermi Cache, C++


Computer Organization

• GPUs show different types of parallelism– Single Instruction Multiple Data (SIMD)– Single Program Multiple Data (SPMD)

• In the end, we have a MSIMD hardware.

1) Why are GPUs so parallel?

2) Why traditional CPUs do not show off all this parallelism?

We can think on a SIMD hardware as a firing squad: we have a captain, and a row of soldiers. The captain issues orders, such as set, aim, fire! And all the soldiers, upon hearing one of these orders, performs an action. They all do the same action, yet, they use different guns and bullets.

An outrageously conciseoverview of theprogramming mode.

The Programming EnvironmentAn outrageously conciseoverview of theprogramming mode.

• There are two main programming languages used to program graphics processing units today: OpenCL and C for CUDA

• These are not the first languages developed for GPUs. They came after Cg or HLSL, for instance.– But they are much more general and expressive.

• We will focus on C for CUDA.• This language lets the programmer explicitly write code

that will run in the CPU, and code that will run in the GPU.– It is a heterogeneous programming language.

From C to CUDA in one StepAn outrageously conciseoverview of theprogramming mode.

void saxpy_serial(int n, float alpha, float *x, float *y) { for (int i = 0; i < n; i++) y[i] = alpha*x[i] + y[i];}// Invoke the serial function:saxpy_serial(n, 2.0, x, y);

• This program, written in C, performs a typical vector operation, reading two arrays, and writing on a third array.

• We will translate this program to C for CUDA.

1) What is the asymptotic complexity of this program?

2) How much can we parallelize this program? In a world with many – really many – processors, e.g., the PRAM world, what would be the complexity of this program?

The first Cuda program

void saxpy_serial(int n, float alpha, float *x, float *y) { for (int i = 0; i < n; i++) y[i] = alpha*x[i] + y[i];}// Invoke the serial function:saxpy_seral(n, 2.0, x, y);

__global__void saxpy_parallel(int n, float alpha, float *x, float *y) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) y[i] = alpha * x[i] + y[i];}// Invoke the parallel kernel:int nblocks = (n + 255) / 256;saxpy_parallel<<<nblocks, 256>>>(n, 2.0, x, y);

What happened to the loop in the CUDA program?


Understanding the Code

• Threads are grouped in warps, blocks and grids• Threads in different grids do not talk to each

other– Grids are divided in blocks

• Threads in the same block share memory and barriers– Blocks are divided in warps

• Threads in the same warp follow the SIMD model.

__global__void saxpy_parallel(int n, float alpha, float *x, float *y) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) y[i] = alpha * x[i] + y[i];}// Invoke the parallel kernel:int nblocks = (n + 255) / 256;saxpy_parallel<<<nblocks, 256>>>(n, 2.0, x, y);


Raising the level

• Cuda programs contain CPU programs plus kernels

• Kernels are called via a special syntax:

• The C part of the program is compiled as traditional C.

• The kernel part is first translated into PTX, and then this high level assembly is translated into SASS.

__global__ void matMul1(float* B, float* C, float* A, int w) { float Pvalue = 0.0;

for (int k = 0; k < w; ++k) { Pvalue += B[threadIdx.y * w + k] * C[k * w + threadIdx.x]; }

A[threadIdx.x + threadIdx.y * w] = Pvalue;}void Mul(const float* A, const float* B, int width, float* C) { int size = width * width * sizeof(float); // Load A and B to the device float* Ad; cudaMalloc((void**)&Ad, size); cudaMemcpy(Ad, A, size, cudaMemcpyHostToDevice); float* Bd; cudaMalloc((void**)&Bd, size); cudaMemcpy(Bd, B, size, cudaMemcpyHostToDevice); // Allocate C on the device float* Cd; cudaMalloc((void**)&Cd, size); // Compute the execution configuration assuming // the matrix dimensions are multiples of BLOCK_SIZE dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE); dim3 dimGrid(wB / dimBlock.x, hA / dimBlock.y); // Launch the device computation Muld<<<dimGrid, dimBlock>>>(Ad, Bd, width, Cd); // Read C from the device cudaMemcpy(C, Cd, size, cudaMemcpyDeviceToHost); // Free device memory cudaFree(Ad); cudaFree(Bd); cudaFree(Cd); }

kernel<<<dGrd, dBck>>>(A,B,w,C);


Lowering the level

• CUDA assembly is called Parallel Thread Execution (PTX)

.entry saxpy_GPU (n, a, x, y) { .reg .u16 %rh<4>; .reg .u32 %r<6>; .reg .u64 %rd<8>; .reg .f32 %f<6>; .reg .pred %p<3>;$LBB1__Z9saxpy_GPUifPfS_: mov.u16 %rh1, %ctaid.x; mov.u16 %rh2, %ntid.x; mul.wide.u16 %r1, %rh1, %rh2; cvt.u32.u16 %r2, %tid.x; add.u32 %r3, %r2, %r1; ld.param.s32 %r4, [n]; setp.le.s32 %p1, %r4, %r3; @%p1 bra $Lt_0_770; .loc 28 13 0 cvt.u64.s32 %rd1, %r3; mul.lo.u64 %rd2, %rd1, 4; ld.param.u64 %rd3, [y]; add.u64 %rd4, %rd3, %rd2; ld.global.f32 %f1, [%rd4+0]; ld.param.u64 %rd5, [x]; add.u64 %rd6, %rd5, %rd2; ld.global.f32 %f2, [%rd6+0]; ld.param.f32 %f3, [alpha]; mad.f32 %f4, %f2, %f3, %f1; st.global.f32 [%rd4+0], %f4; exit;}

__global__ voidsaxpy_parallel(int n, float a, float *x, float *y) {

int i = bid.x * bid.x + tid.x; if (i < n) y[i] = a * x[i] + y[i];

}

What do you think an assembly language for parallel programming should have?


DCC 888


MEMORY OPTIMIZATIONS

A Brief Overview of the GPU Threading Model

• Each thread has local registers and local memory

• Threads are grouped in warps– Warps run in SIMD exec

• Warps are grouped in blocks– Shared memory + syncs

• Blocks are grouped in grids– Each grid represents a kernel

1) How do different grids communicate?

2) How do threads in the same block communicate?

3) What determines the size of the block of threads?

4) What determines the size of the grid of threads?

5) What is the effect of branches in the warp execution?

A Brief Overview of the GPU Threading Model

Going to the Archives

• GPUs are much more memory intensive than traditional CPUs. Lets look into an example?– The GeForce 8800 processes 32 pixels per clock. Each pixel

contains a color (3 bytes) and a depth (4 bytes), which are read and written. On the average 16 extra bytes of information are read for each pixel. How many bytes are processed per clock?

To put these numbers in perspective, how much data is processed in each cycle of an ordinary x86 CPU?

The GPU Archive

• Registers: fast, yet few. Private to each thread

• Shared memory: used by threads in the same block

• Local memory: off-chip and slow. Private to each thread

• Global memory: off-chip and slow. Used to provide communication between blocks and grids.

The GPU Archive

1) Why can't we leave all the data in registers?

2) Why can't we leave all the data in shared memory?

3) The CPU also has a memory hierarchy. Do you remember how is this hierarchy like?

4) Why do we have a memory hierarchy also in the CPU?

• Registers: fast, yet few. Private to each thread

• Shared memory: used by threads in the same block

• Local memory: off-chip and slow. Private to each thread

• Global memory: off-chip and slow. Used to provide communication between blocks and grids.

The Traveler’s Journal

The interplanetary trip Commuting data between host and device

The silk road trip Commuting data between global and shared memory

The bakery walk Commuting data between shared memory and registers

Reaching a book on the table Reading/Writing data to registers

DCC 888


INTER-DEVICE COMMUNICATION

The Interplanetary Trip

• Copying data between GPU and CPU is pretty slow. CUDA provides some library functions for this:

– cudaMalloc: allocates data in the GPU memory space

– cudaMemset: fills a memory area with a value

– cudaFree: frees the data in the GPU memory space

– cudaMemcpy: copies data from CPU to GPU, or vice-versa

The interplanetary trip

int nbytes = 1024 * sizeof(int);int *a_d = 0;cudaMalloc( (void**) &a_d, nbytes);cudaMemset( a_d, 0, nbytes);cudaFree(a_d);

What is each of these calls below doing?

This program copies data from the host to the device, and then moves this data inside the device, and finally brings the data back to the host memory.


int main(int argc, char** argv) { float *a_h, *b_h; float *a_d, *b_d; int N = 14, nBytes, i ;

nBytes = N*sizeof(float); a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes);

cudaMemset(&a_d, 0, nBytes);

cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost);

for (i=0; i< N; i++) { ASSERT( a_h[i] == b_h[i] ); } free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return EXIT_SUCCESS;}







float *a_h, *b_h;float *a_d, *b_d;int N = 14, nBytes, i;







nBytes = N * sizeof(float);a_h = (float*)malloc(nBytes);b_h = (float*)malloc(nBytes);

Host

a_h

b_h







cudaMalloc((void**)&a_d, nBytes);cudaMalloc((void**)&b_d, nBytes);

Host

a_h

b_h

Device

a_d

b_d








Host

a_h

b_h

Device

a_d

b_d







cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice);

Host

a_h

b_h

Device

a_d

b_d







cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice);

Host

a_h

b_h

Device

a_d

b_d







cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost);

Host

a_h

b_h

Device

a_d

b_d







for (i=0; i< N; i++) { ASSERT(a_h[i] == b_h[i]);}

Host

a_h

b_h

Device

a_d

b_d







free(a_h); free(b_h);

Host Device

a_d

b_d







Host Device

cudaFree(a_d); cudaFree(b_d);

Inter-device Communication

• Inter-device communication, i.e, between the CPU and the GPU, should be minimized as much as possible.

• Inter-device communication is orders of magnitude slower than reading data from shared memory, for instance.

• That is why GPUs are not good for interactive applications.

Avoid traveling whenever you can

cudaMalloc((void**) &d_vec, mem_size);

cudaMemcpy(d_vec, h_vec, mem_size, cudaMemcpyHostToDevice);

kernel0<<< gridSize0, blockSize0 >>>(d_vec, vec_size);

kernel1<<< gridSize1, blockSize1 >>>(d_vec, vec_size);

cudaMemcpy(h_vec, d_vec, mem_size, cudaMemcpyDeviceToHost);

cudaFree(d_vec);

From maxSort, available in the course webpage.

d_vec does not change between kernl calls.

Therefore, there is no need to send it again!

Keep data on the GPU

• Once data is sent to the GPU, it stays on the DRAM, even after the kernel is done executing

• Try invoking kernels on data already on the GPU

1) Can you think about a situation in which it is better to leave a kernel, do some computation on the CPU, and then call another kernel?

2) By the way, can you think about a problem that is inherently sequential?

The GPU deserves complex work

__global__void matSumKernel(float* S, float* A, float* B, int side) { int ij = tid.x + tid.y * side; A[ij] = B[ij] + C[ij];}

__global__void matMul1(float* B, float* C, float* A, int w) { float v = 0.0; for (int k = 0; k < w; ++k) { v += B[tid.x*w+k] * C[k*w+tid.x]; } A[tid.x + tid.y * w] = v;}

1) What is the complexity of copying data from the CPU to the GPU?

2) Is it worth to do matrix sum in the GPU?

3) Is it worth to do matrix multiplication in the GPU?

Matrix Sum × Matrix Mul

• Matrix Sum:

• Matrix Mul:

The ballerina’s waltz

• Start working as soon as data is available - use a pipeline!cudaMemcpy(dst, src, N * sizeof(float), dir);kernel<<<N/nThreads, nThreads>>>(dst);

sz = N * sizeof(float) / nStreams;for (i = 0; i < nStreams; i++) { offset = i * N / nStreams; cudaMemcpyAsync(dst+offset, src+offset, sz, dir, stream[i]);}for (i=0; i<nStreams; i++) { gridSize = N / (nThreads * nStreams); offset = i * N / nStreams; kernel<<<gridSize, nThreads, 0, stream[i]>>>(dst+offset);}

• C for CUDA has an API for asynchronous transfers:

What is the glue between data and computation in this example?

The ballerina’s waltz

• Asynchronous memory copy overlaps data transfer and GPU processing

This technique to obtain parallelism, e.g., pipeline parallelism, is a pattern used in many different scenarios. Could you name other examples of pipeline parallelism?

DCC 888


GLOBAL MEMORY ACCESS

The Silk Road Trip

• Reading or writing to the global memory is also slow.– But not as much as reading/writing between host and

device.– The Global Memory is on-board.

The Matrix Multiplication Kernel

void matmult(float* B, float* C, float* A, int w) { for (unsigned int i = 0; i < w; ++i) { for (unsigned int j = 0; j < w; ++j) { A[i * w + j] = 0.0; for (unsigned int k = 0; k < w; ++k) { A[i * w + j] += B[i * w + k] * C[k * w + j]; } } }}

1) In the PRAM model, what is the asymptotic complexity of the matrix multiplication problem?

2) Could you translate this program to C for CUDA?

Matrix Multiplication Kernel

__global__ void matMul1(float* B, float* C, float* A, int Width) { float Pvalue = 0.0;

int tx = blockIdx.x * blockDim.x + threadIdx.x; int ty = blockIdx.y * blockDim.y + threadIdx.y;

for (int k = 0; k < Width; ++k) { Pvalue += B[tx * Width + k] * C[k * Width + ty]; }

A[ty + tx * Width] = Pvalue;} From matMul, available in the course webpage.

1) What is the asymptotic complexity of this program?

2) Given width = 10, how many accesses to the global memory does this program perform?

3) How to know how many floating-point operations per second this program will perform?

In this example, each thread is responsible for multiplying one line of B by one column of C, to produce an element of A.

GFLOPS

mov.f32 %f1, 0f00000000;$Lt_0_1282: cvt.u64.u32 %rd3, %r7; mul.lo.u64 %rd4, %rd3, 4; ld.param.u64 %rd2, [B]; add.u64 %rd5, %rd2, %rd4; ld.global.f32 %f2, [%rd5+0]; cvt.u64.u32 %rd6, %r9; mul.lo.u64 %rd7, %rd6, 4; ld.param.u64 %rd1, [C]; add.u64 %rd8, %rd1, %rd7; ld.global.f32 %f3, [%rd8+0]; mad.f32 %f1, %f2, %f3, %f1; add.u32 %r7, %r7, 1; ld.param.s32 %r3, [w]; add.u32 %r9, %r3, %r9; setp.ne.s32 %p2, %r7, %r8; @%p2 bra $Lt_0_1282;

1) How many instructions we find in block Lt_0_1282 of this code?

2) How many floating point operations does this program perform in the inner loop?

3) What is, then, the proportion of floating point operations per GPU operation?

4) If the GTX 8800 can perform 172.8 Gflops, how many GFlops could we expect from this code?

5) But if we get a much lower number, what could be the reasons for this bad performance behavior?

Coalesced Access to the Global Memory

• The global memory is divided into segments of 16 cells. If 16 threads read data from the same segment, the memory access contains only one trip.

• However, if each thread reads from a different segment we may have a slow access to the global memory.

In order to know if we are accessing the same segment, we can do the following check:

• take thread t whose id is 16 × n

• find out the segment s that thread t is accessing

• for every thread t+i, 1 ≤ i ≤ 15, see if t + i is accessing segment s

The Anatomy of a Block

0, 0 1, 0 2, 0 3, 0

0, 1 1, 1 2, 1 3, 1

0, 2 1, 2 2, 2 3, 2

0, 3 1, 3 2, 3 3, 3

…, 0

…, 1

…, 2

…, 3

0, … 1, … 2, … 3, … …, …

…

…

…

…

(tid.x, tid.y)

Warp

Warps should read data in the same segment.

Checking for coalesced access



for (int k = 0; k < Width; ++k) { Pvalue += B[tx * Width + k] * C[k * Width + ty]; }

A[ty + tx * Width] = Pvalue;}

1) What is the work that each thread (tx, ty) is doing?

2) Which accesses to the global memory are coalesced, and which ones are not?

Checking Memory Accesses

• Assuming Width = 640 and k = 0• Warp: (0, 0) (1, 0) (2, 0) … (31, 0)• B[tx * Width + k]: B[0], B[640], B[1280], …, B[19840]• C[k * Width + ty]: C[0], C[0] , C[0], …, C[0]• A[ty + tx * Width]: A[0], A[640], A[1280], …, A[19840]

...int tx = blockIdx.x * blockDim.x + threadIdx.x;int ty = blockIdx.y * blockDim.y + threadIdx.y;for (int k = 0; k < Width; ++k) { Pvalue += B[tx * Width + k] * C[k * Width + ty];}A[ty + tx * Width] = Pvalue;

We have a lot of uncoalesced accesses. How can we improve this situation?

Change the indexes a tiny bit…



for (int k = 0; k < Width; ++k) { Pvalue += B[ty * Width + k] * C[k * Width + tx]; }

A[tx + ty * Width] = Pvalue;}

for (int k = 0; k < Width; ++k) { Pvalue += B[tx * Width + k] * C[k * Width + ty];}A[ty + tx * Width] = Pvalue;

Old Code:

New Code:

Why is it so much better?

• Assuming Width = 640 and k = 0• Warp: (0, 0) (1, 0) (2, 0) … (31, 0)• B[ty * Width + k]: B[0], B[0], B[0], …, B[0]• C[k * Width + tx]: C[0], C[1] , C[2], …, C[31]• A[tx + ty * Width]: A[0], A[1], A[2], …, A[31]

... int tx = blockIdx.x * blockDim.x + threadIdx.x; int ty = blockIdx.y * blockDim.y + threadIdx.y; for (int k = 0; k < Width; ++k) { Pvalue += B[ty * Width + k] * C[k * Width + tx]; } A[tx + ty * Width] = Pvalue;

How much speed do you think you have gotten with this small change?

DCC 888


THE SHARED MEMORY

Second law of performance

• Avoid going to the global or local memory. Ideally threads should be able to share as much data as possible in shared memory.

• Remember: the shared memory is on-chip; the global memory is off-chip.

• Can we improve this matrix multiplication by sharing data among threads?

Moving data to the shared memory



for (int k = 0; k < Width; ++k) { Pvalue += B[ty * Width + k] * C[k * Width + tx]; }

A[tx + ty * Width] = Pvalue;}

1) How many accesses to the global memory take place in this kernel?

2) Again: what is the shared memory?

3) How to use the shared memory to improve this program?

__global__ void matMul6(float* B, float* C, float* A, int Width) { __shared__ float Bs[TILE_WIDTH][TILE_WIDTH]; __shared__ float Cs[TILE_WIDTH][TILE_WIDTH]; int tx = threadIdx.x; int ty = threadIdx.y; int Row = blockIdx.x * TILE_WIDTH + tx; int Col = blockIdx.y * TILE_WIDTH + ty; float Pvalue = 0; for (int m = 0; m < Width/TILE_WIDTH; ++m) { Bs[ty][tx] = B[Col * Width + (m * TILE_WIDTH + tx)]; Cs[ty][tx] = C[Row + (m * TILE_WIDTH + ty) * Width]; __syncthreads(); for (int k = 0; k < TILE_WIDTH; ++k) Pvalue += Bs[ty][k] * Cs[k][tx]; __syncthreads(); } A[Col * Width + Row] = Pvalue;}

TILE_WIDTH = blockDim.x = blockDim.y

Why do we have to synchronize threads at this point?

Tiling Can you tell if each of these indices gives us coalesced access or not?

The Tiles of Scheherazade

1) How many access to the global memory does the original program perform? Answer in terms of N, the width of the matrix.

2) What about the tiles Program? Answer in terms of N, the Matrix width, and T, the tile width.

3) What if we assume that T = N/2?

N

N/2


• How many reads with no tiling? • How many reads with tiling?– Size of each tile: – Num of times each tile is read:– Num of tiles:

N

N/2


• How many reads with no tiling? 2N * N2

• How many reads with tiling?– Size of each tile:– Num of times each tile is read:– Num of tiles:

N

N/2


• How many reads with no tiling? 2N * N2

• How many reads with tiling?– Size of each tile: N2/4– Num of times each tile is read:

N/(N/2)– Num of tiles:– (N/(N/2)) 2 * 2

So: N2/4 * 2 * 8 = 4*N2

__global__ void matMul6(float* B, float* C, float* A, int Width) { __shared__ float Bs[TILE_WIDTH][TILE_WIDTH]; __shared__ float Cs[TILE_WIDTH][TILE_WIDTH]; int tx = threadIdx.x; int ty = threadIdx.y; int Row = blockIdx.x * TILE_WIDTH + tx; int Col = blockIdx.y * TILE_WIDTH + ty; float Pvalue = 0; for (int m = 0; m < Width/TILE_WIDTH; ++m) { Bs[ty][tx] = B[Col * Width + (m * TILE_WIDTH + tx)]; Cs[ty][tx] = C[Row + (m * TILE_WIDTH + ty) * Width]; __syncthreads(); for (int k = 0; k < TILE_WIDTH; ++k) Pvalue += Bs[ty][k] * Cs[k][tx]; __syncthreads(); } A[Col * Width + Row] = Pvalue;}

1) Can we optimize this program with loop unrolling?

2) We have two loops. Which loop is more reasonable to unroll?

Loop Unrolling

float Pvalue = 0;for (int m = 0; m < w/8; ++m) { Mds[tx][ty] = Md[Row * w + (m * 8 + ty)]; Nds[tx][ty] = Nd[Col + (m * 8 + tx) * w]; __syncthreads(); Pvalue += Mds[tx][ 0] * Nds[ 0][ty]; Pvalue += Mds[tx][ 1] * Nds[ 1][ty]; Pvalue += Mds[tx][ 2] * Nds[ 2][ty]; Pvalue += Mds[tx][ 3] * Nds[ 3][ty]; Pvalue += Mds[tx][ 4] * Nds[ 4][ty]; Pvalue += Mds[tx][ 5] * Nds[ 5][ty]; Pvalue += Mds[tx][ 6] * Nds[ 6][ty]; Pvalue += Mds[tx][ 7] * Nds[ 7][ty]; __syncthreads();}Pd[Row*Width+Col] = Pvalue;

Loop Unrolling

1) What is the proportion of floating-point to non-floating-point operations in the innermost loop of the optimized program?

2) What do we gain, at the hardware level, for having less branches to execute?

3) About ½ of the instructions in the innermost loop are floating-point operations. What is the new GFops in the GTX 8800, that has an upper limit of 172.8 GFlops?

Shared Bank Conflicts

• The shared memory has 16 access doors.

• If two threads of a half-warp read from the same door, then a conflict happens.

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

Ideal cases Very bad case

No conflictSam

e address

Matrix transpose

__shared__ float tile[BLOCK_WIDTH][BLOCK_WIDTH]; // Compute the index of the data in the input matrix: int xIndex = blockIdx.x * blockDim.x + threadIdx.x; int yIndex = blockIdx.y * blockDim.y + threadIdx.y; int index_in = yIndex * Width + xIndex;

// Compute the index of the data in the output matrix: xIndex = blockIdx.y * blockDim.x + threadIdx.x; yIndex = blockIdx.x * blockDim.y + threadIdx.y; int index_out = yIndex * Width + xIndex;

// Copy the data: tile[threadIdx.x][threadIdx.y] = In[index_in]; __syncthreads(); Out[index_out] = tile[threadIdx.y][threadIdx.x];}

BLOCK_WIDTH = 16

From transpose, available in the course webpage.

1) Is access to the global memory coalesced?

2) Where is the bank conflict to access shared memory?

0,0 0,1 0,2 0,3

1,0 1,1 1,2 1,3

2,0 2,1 2,2 2,3

3,0 3,1 3,2 3,3

Warp

0,0 1,0 2,0 3,0

0,1 1,1 2,1 3,1

0,2 1,2 2,2 3,2

0,3 1,3 2,3 3,3

tile[threadIdx.x][threadIdx.y]

tile[threadIdx.y][threadIdx.x]

• Let’s assume warp with 4 threads, and 4 access doors

• Each color is an access door• The red square marks a warp• If the warp falls on blocks having the same

color, then we have a conflict

Warp

tile[threadIdx.x][threadIdx.y] = In[index_in];__syncthreads();Out[index_out] = tile[threadIdx.y][threadIdx.x];

Ps.: in fig, (x,y) is warpindex, not data index.

A very simple solution:

• You may not quite believe me…__global__ void transpose2(float* In, float* Out, int Width) { __shared__ float tile[BLOCK_WIDTH][BLOCK_WIDTH + 1]; // Compute the index of the data in the input matrix: int xIndex = blockIdx.x * blockDim.x + threadIdx.x; int yIndex = blockIdx.y * blockDim.y + threadIdx.y; int index_in = yIndex * Width + xIndex;

// Compute the index of the data in the output matrix: xIndex = blockIdx.y * blockDim.x + threadIdx.x; yIndex = blockIdx.x * blockDim.y + threadIdx.y; int index_out = yIndex * Width + xIndex;

// Copy the data: tile[threadIdx.x][threadIdx.y] = In[index_in]; __syncthreads(); Out[index_out] = tile[threadIdx.y][threadIdx.x];}

0,0 0,1 0,2 0,3

1,0 1,1 1,2 1,3

2,0 2,1 2,2 2,3

3,0 3,1 3,2 3,3

Warp

0,0 1,0 2,0 3,0

0,1 1,1 2,1 3,1

0,2 1,2 2,2 3,2

0,3 1,3 2,3 3,3

tile[threadIdx.x][threadIdx.y]

tile[threadIdx.y][threadIdx.x]

0,0 0,1 0,2 0,3

1,0 1,1 1,2 1,3

2,0 2,1 2,2 2,3

3,0 3,1 3,2 3,3

0,0 0,1 0,2 0,3

1,0 1,1 1,2 1,3

2,0 2,1 2,2 2,3

3,0 3,1 3,2 3,3

The extra column breaks the bad alignment in both cases.

Stay Tuned

• We have seen a suite of memory optimizations that speed up code meant to run on a GPU.

• There exists another phenomenon that plays a very important role in the execution of GPU programs: the control flow divergences.

• Control flow divergences are the subject of our next class.

Memory Optimizations for Graphics Processing Units

Documents

Transcript of Memory Optimizations for Graphics Processing Units