ΕΠΛ372 Παράλληλη Επεξεργασία
Παράλληλες Αρχιτεκτονικές: SIMD, GPU
Γιάννος ΣαζεϊδηςΕαρινό Εξάμηνο 2014
READING1. Paper on GPUs: Erik Lindholm, John Nickolls, Stuart Oberman, and John Montrym, NVIDIA
TESLA: A Unified Graphics and Computing Architecture, IEEE Micro Volume 28, Issue 2, Date: March-April 2008, Pages: 39-55.
2. HW#2Slides based on Elsevier Material for H&P 5th editionCopyright © 2012, Elsevier Inc. All rights reserved.
Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
SIMD Extensions• Media applications operate on data types narrower than
the native word size– Example: disconnect carry chains to
“partition” adder
• Limitations, compared to vector instructions:– Number of data operands encoded into op
code– No sophisticated addressing modes (strided,
scatter-gather)– No mask registers
SIM
D Instruction S
et Extensions for M
ultimedia
SIMD Implementations• Implementations:
– Intel MMX (1996)• Eight 8-bit integer ops or four 16-bit integer ops
– Streaming SIMD Extensions (SSE) (1999)• Eight 16-bit integer ops• Four 32-bit integer/fp ops or two 64-bit integer/fp ops
– Advanced Vector Extensions (2010)• Four 64-bit integer/fp ops
– Operands must be consecutive and aligned memory locations
SIM
D Instruction S
et Extensions for M
ultimedia
DAXPY 64 elements with non-vector/simd Instructions
L.D F0,a ;load scalar aDADDIU R4,Rx,#512 ;last address to load
Loop:L.D F4,0[Rx] ;load X[i]MUL.D F4,F4,F0 ;a×X[i]L.D F8,0[Ry] ;load Y[i]ADD.D F8,F8,F4 ;a×X[i]+Y[i]S.D 0[Ry],F8 ;store into Y[i]DADDIU Rx,Rx,#8 ;increment index to XDADDIU Ry,Ry,#8 ;increment index to YDSUBU R20,R4,Rx ;compute boundBNEZ R20,Loop ;check if done
• Requires almost 600 for MIPS
Vector Architectures
With Vector Instructions• ADDVV.D: add two vectors• ADDVS.D: add vector to a scalar• LV/SV: vector load and vector store from address
• Example: DAXPYL.D F0,a ; load scalar aLV V1,Rx ; load vector XMULVS.D V2,V1,F0 ; vector-scalar multiplyLV V3,Ry ; load vector YADDVV V4,V2,V3 ; addSV Ry,V4 ; store the result
• Requires 6 instructions vs. almost 600 for MIPS
Vector Architectures
Example SIMD Codeassume 256 bit extension
• Example DXPY:L.D F0,a ;load scalar aMOV F1, F0 ;copy a into F1 for SIMD MULMOV F2, F0 ;copy a into F2 for SIMD MULMOV F3, F0 ;copy a into F3 for SIMD MULDADDIU R4,Rx,#512 ;last address to load
Loop: L.4D F4,0[Rx] ;load X[i], X[i+1], X[i+2], X[i+3]MUL.4D F4,F4,F0 ;a×X[i],a×X[i+1],a×X[i+2],a×X[i+3]L.4D F8,0[Ry] ;load Y[i], Y[i+1], Y[i+2], Y[i+3]ADD.4D F8,F8,F4 ;a×X[i]+Y[i], ..., a×X[i+3]+Y[i+3]S.4D 0[Ry],F8 ;store into Y[i], Y[i+1], Y[i+2], Y[i+3]DADDIU Rx,Rx,#32 ;increment index to XDADDIU Ry,Ry,#32 ;increment index to YDSUBU R20,R4,Rx ;compute boundBNEZ R20,Loop ;check if done
SIM
D Instruction S
et Extensions for M
ultimedia
Roofline Performance Model• Basic idea:
– Plot peak floating-point throughput as a function of arithmetic intensity
– Ties together floating-point performance and memory performance for a target machine
• Arithmetic intensity– Floating-point operations per byte read (word)
SIM
D Instruction S
et Extensions for M
ultimedia
Examples• Attainable GFLOPs/sec Min = (Peak Memory BW ×
Arithmetic Intensity, Peak Floating Point Perf.)• Helps understand compute vs memory bound
SIM
D Instruction S
et Extensions for M
ultimedia
Graphical Processing Units• Given the hardware invested to do graphics well, how can be
supplement it to improve performance of a wider range of applications?
• Basic idea:– Heterogeneous execution model
• CPU is the host, GPU is the device• Energy efficient
– Develop a C-like programming language for GPU– Unify all forms of GPU parallelism as CUDA thread
– Programming model is “Single Instruction Multiple Data” (SIMD) – NVIDIA: SIMT – single instruction multiple threads, flexible SIMD
Graphical P
rocessing Units
Dedicated GPU
• Separate GPU cards with own DRAM
• Separate GPU card but sharing DRAM with CPU– Less cost but less BW
Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
GPU Processors
Integrated GPU Processors
• Integrated GPU Processors– Less DRAM BW vsDedicated GPUs
High Level GPU Programming
Source:Wikipedia
Serious bottleneck if this latency is not hidden
Threads and Blocks• A thread is associated with each data
element• Threads are organized into blocks• Blocks are organized into a grid• GPU hardware handles thread
management, not applications or OS– Lower overhead thread management vs
pthreads (suitable for finer grain parallelism)• Execution of threads as simd instructions
when possible gives best performance
Graphical P
rocessing Units
GPU vs CPU threads
• Differences between GPU and CPU threads – GPU threads are extremely lightweight
• Very little overhead– GPU needs 1000s of threads for full
efficiency (switch to hide latency)• Multi-core CPU needs only a few
Thread Batching: Grids and Blocks• A kernel is executed as a grid of
thread blocks– All threads share data memory
space• A thread block is a batch of
threads that can cooperate with each other by:– Synchronizing their execution
• For hazard-free shared memory accesses– Efficiently sharing data through a
low latency shared memory• Two threads from two different
blocks cannot cooperate
Host
Kernel 1
Kernel 2
Device
Grid 1
Block(0, 0)
Block(1, 0)
Block(2, 0)
Block(0, 1)
Block(1, 1)
Block(2, 1)
Grid 2
Block (1, 1)
Thread(0, 1)
Thread(1, 1)
Thread(2, 1)
Thread(3, 1)
Thread(4, 1)
Thread(0, 2)
Thread(1, 2)
Thread(2, 2)
Thread(3, 2)
Thread(4, 2)
Thread(0, 0)
Thread(1, 0)
Thread(2, 0)
Thread(3, 0)
Thread(4, 0)
Courtesy: NVDIA
Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
Block and Thread IDs• Threads and blocks
have IDs– So each thread can
decide what data to work on
– Block ID: 1D or 2D (blockIdx.x, blockIdx.y)
– Thread ID: 1D, 2D, or 3D (threadIdx.{x,y,z})
• Simplifies memoryaddressing when processingmultidimensional data– Image processing– Solving PDEs on
volumes– …
Device
Grid 1
Block(0, 0)
Block(1, 0)
Block(2, 0)
Block(0, 1)
Block(1, 1)
Block(2, 1)
Block (1, 1)
Thread(0, 1)
Thread(1, 1)
Thread(2, 1)
Thread(3, 1)
Thread(4, 1)
Thread(0, 2)
Thread(1, 2)
Thread(2, 2)
Thread(3, 2)
Thread(4, 2)
Thread(0, 0)
Thread(1, 0)
Thread(2, 0)
Thread(3, 0)
Thread(4, 0)
Courtesy: NDVIA
Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
An example• Multiply two vectors of length 8192
– Code that works over all elements is the grid– Thread blocks break this down into manageable
sizes• 512 threads per block• many threads to cover long memory latency
– On stall HW switch quickly between threads
– Thus grid size = 16 blocks– Block is assigned to a multithreaded SIMD
processor by the thread block scheduler• Fermi GPUs had 16 multithreaded SIMD processors
Graphical P
rocessing Units
What Programmer Expresses in CUDA
P
M
P
HO
ST
(CP
U)
M DE
VIC
E (G
PU
)
Interconnect between devices and memories
Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
Basics• Computation partitioning (where does computation occur?)
Declarations on functions __host__ : on cpu__global__ : kernel called by host __device__ : code to be executed by device called by
global or other device function• Mapping of kernels to device :
compute ()gs grid size, bs block size(gs *bs = threads of one kernel)
• Data partitioning (where does data reside, who may access it and how?)• Declarations on data __shared__, __device__, __constant__, …
• Data management and orchestration• Copying to/from host: e.g., cudaMemcpy(h_obj,d_obj, cudaMemcpyDevicetoHost)
• Concurrency management– E.g. __synchthreads() – block granularity
Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
Synchronous vs Asynchronous
• Whether a call returns or not return immediately (before call function completes)– Yes: asynchronous– No: synchronous
Streams• A stream in CUDA is a sequence of operations
that execute on the device in the order in which they are issued by the host code.
• Operations within a stream are guaranteed to execute in the prescribed order
• Operations in different streams can be interleaved and, when possible, they can even run concurrently.
• How to specify stream: pass as argument, use functions that take stream as parameter
• Useful to hide cpu-gpu latency
Minimal Extensions to C + API• Declspecs
– global, device, shared, local, constant
• Keywords– threadIdx, blockIdx
• Intrinsics– __syncthreads
• Runtime API– Memory, symbol,
execution management
• Function launch
__device__ float filter[N];
__global__ void convolve (float *image) {
__shared__ float region[M];...
region[threadIdx] = image[i];
__syncthreads() ...
image[j] = result;}
// Allocate GPU memoryvoid *myimage = cudaMalloc(bytes)
// 100 blocks, 10 threads per blockconvolve (myimage);
Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
NVCC Compiler’s Role: Partition Code and Compile for Device
mycode.cu
__device__ dfunc() {int ddata;
}
__global__ gfunc() {int gdata;dfuncc();
}
Main() { }__host__ hfunc () {
int hdata;();}
Dev
ice
Onl
yIn
terfa
ceH
ost O
nly
int main_data;__shared__ int sdata;
Main() {}__host__ hfunc () {
int hdata;
();}
__global__ gfunc() {int gdata;dfuncc();
}
Compiled by nativecompiler: gcc, icc, cc
__shared__ sdata;
__device__ dfunc() {int ddata;
}
Compiled by nvcccompiler
int main_data;
Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
Simple working code example• What does it do?
– check elements of integer array (any of 0 to 9)– How many times does “6” appear?– Array of 16 elements, each thread examines 4
elements, 1 block in grid, 1 grid
3 6 57 3 5 26 0 9 639 1 72
threadIdx.x = 0 examines in_array elements 0, 4, 8, 12threadIdx.x = 1 examines in_array elements 1, 5, 9, 13threadIdx.x = 2 examines in_array elements 2, 6, 10, 14threadIdx.x = 3 examines in_array elements 3, 7, 11, 15
} Known as acyclic data distributionSlides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
CUDA Pseudo-CodeMAIN PROGRAM:Initialization• Allocate memory on host for
input and output• Assign random numbers to
input array
Call host function
Calculate final output from per-thread output
Print result
HOST FUNCTION:Allocate memory on device for
copy of input and outputCopy input to deviceSet up grid/blockCall global functionCopy device output to host
GLOBAL FUNCTION:Thread scans subset of array elementsCall device function to compare with “6”Compute local result
DEVICE FUNCTION:Compare current element
and “6”Return 1 if same, else 0Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Main Program: PreliminariesMAIN PROGRAM:Initialization• Allocate memory on host for
input and output• Assign random numbers to
input arrayCall host functionCalculate final output from
per-thread outputPrint result
#include #define SIZE 16#define BLOCKSIZE 4
int main(int argc, char **argv){int *in_array, *out_array;…
}
Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
Main Program: Invoke Global FunctionMAIN PROGRAM:Initialization (OMIT)• Allocate memory on host for
input and output• Assign random numbers to
input arrayCall host functionCalculate final output from
per-thread outputPrint result
#include #define SIZE 16#define BLOCKSIZE 4__host__ void outer_compute (int
*in_arr, int *out_arr);int main(int argc, char **argv){
int *in_array, *out_array;/* initialization */ …outer_compute(in_array, out_array);…
}
Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
Main Program: Calculate Output & Print ResultMAIN PROGRAM:Initialization (OMIT)• Allocate memory on host for
input and output• Assign random numbers to
input arrayCall host functionCalculate final output from per-thread output
Print result
#include #define SIZE 16#define BLOCKSIZE 4__host__ void outer_compute (int
*in_arr, int *out_arr);int main(int argc, char **argv){
int *in_array, *out_array;int sum = 0;/* initialization */ …outer_compute(in_array, out_array);for (int i=0; i
Host Function: Preliminaries & Allocation
HOST FUNCTION:Allocate memory on device for
copy of input and outputCopy input to deviceSet up grid/blockCall global functionCopy device output to host
__host__ void outer_compute (int *h_in_array, int *h_out_array) {
int *d_in_array, *d_out_array;
cudaMalloc((void **) &d_in_array, SIZE*sizeof(int));
cudaMalloc((void **) &d_out_array, BLOCKSIZE*sizeof(int));
…}
Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
Host Function: Copy Data To/From Host
HOST FUNCTION:Allocate memory on device for
copy of input and outputCopy input to deviceSet up grid/blockCall global functionCopy device output to host
__host__ void outer_compute (int*h_in_array, int *h_out_array) {int *d_in_array, *d_out_array;
cudaMalloc((void **) &d_in_array, SIZE*sizeof(int));
cudaMalloc((void **) &d_out_array, BLOCKSIZE*sizeof(int));
cudaMemcpy(d_in_array, h_in_array, SIZE*sizeof(int), cudaMemcpyHostToDevice);
… do computation ...cudaMemcpy(h_out_array,d_out_array,
BLOCKSIZE*sizeof(int), cudaMemcpyDeviceToHost);
}Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
Host Function: Setup & Call Global Function
HOST FUNCTION:Allocate memory on device for
copy of input and outputCopy input to deviceSet up grid/blockCall global functionCopy device output to host
__host__ void outer_compute (int *h_in_array, int *h_out_array) {int *d_in_array, *d_out_array;
cudaMalloc((void **) &d_in_array, SIZE*sizeof(int));
cudaMalloc((void **) &d_out_array, BLOCKSIZE*sizeof(int));
cudaMemcpy(d_in_array, h_in_array, SIZE*sizeof(int), cudaMemcpyHostToDevice);
compute (d_in_array, d_out_array);cudaMemcpy(h_out_array, d_out_array,
BLOCKSIZE*sizeof(int), cudaMemcpyDeviceToHost);
}Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
Global FunctionGLOBAL FUNCTION:Thread scans subset of array
elementsCall device function to
compare with “6”Compute local result
__global__ void compute(int *d_in,int *d_out) {d_out[threadIdx.x] = 0;for (int i=0; i
Device Function
L1: Course/CUDA Introduction
DEVICE FUNCTION:Compare current element
and “6”Return 1 if same, else 0
__device__ int compare(int a, int b) {if (a == b) return 1;return 0;
}
CS6963 Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
Memory Model
Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
Memory Model
• Host copies to/from device (can be slow)• Device does not support coherence
– Any caches in device must be disabled if coherence is needed
Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
What if we computed sum on GPU?• Global, device functions and excerpts from host, main
int __host__ void outer_compute (int *h_in_array, int *h_sum) {…compute (d_in_array, d_sum);cudaThreadSynchronize();cudaMemcpy(h_sum, d_sum, sizeof(int), cudaMemcpyDeviceToHost);
}
main(int argc, char **argv) {…int *sum; // an integerouter_compute(in_array, sum);printf (”Result = %d\n",sum);}
__device__ int compare(int a, int b) {if (a == b) return 1;return 0;
}
__global__ void compute(int *d_in, int*sum) {
*sum = 0;
for (i=0; i
Gathering Results on GPU for “Count 6”__global__ void compute(int *d_in, int *d_out) {
d_out[threadIdx.x] = 0;
for (i=0; i
Gathering Results on GPU for “Count 6”__global__ void compute(int *d_in, int *d_out) {
d_out[threadIdx.x] = 0;
for (i=0; i
Synchronize• kernel execution is normally asynchronous, so while the
GPU device is executing your kernel the CPU can continue to work on some other commands, issue more instructions to the device, etc., instead of waiting
• Host-Device memory transfers synchronous (blocking)– Points of synchronization (slow)
Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
Programmer’s View: Memory Spaces• Each thread can:
– Read/write per-thread registers– Read/write per-thread local memory– Read/write per-block shared memory– Read/write per-grid global memory– Read only per-grid constant memory– Read only per-grid texture memory
• Block Threads: share shared memory• Grid blocks: share global, constant
and texture memory
Grid
ConstantMemory
TextureMemory
GlobalMemory
Block (0, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Host
• The host can read/write global, constant, and texture memory
Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
Terminology Review
• device = GPU = set of multiprocessors • Multiprocessor = set of processors & shared memory• Kernel = GPU program• Grid = array of thread blocks that execute a kernel• Thread block = group of SIMD threads that execute
a kernel and can communicate via shared memory
Memory Location Cached Access WhoLocal Off-chip No 1.x / >2Yes Read/write One threadShared On-chip N/A - resident Read/write All threads in a blockGlobal Off-chip No 1.x / >2 Yes Read/write All threads + hostConstant Off-chip Yes Read All threads + hostTexture Off-chip Yes Read All threads + host
Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
Reuse and Locality• Consider how data is accessed
– Data reuse: • Same data used multiple times • Intrinsic in computation
– Data locality:• Data is reused and is present in “fast memory”• Same data or same data transfer
• If a computation has reuse, what can we do to get locality?
• Appropriate data placement and layout• Code reordering transformations
Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
Access Times• Register – dedicated HW - single cycle• Constant and Texture caches – possibly single cycle,
proportional to addresses accessed by warp• Shared Memory – dedicated HW - single cycle if no
“bank conflicts”• Local Memory – DRAM, no cached- *slow*, if cached fast• Global Memory– DRAM, no cached- *slow*, if cached fast• Constant Memory – DRAM, cached, 1…10s…100s of
cycles, depending on cache locality• Texture Memory – DRAM, cached, 1…10s…100s of cycles,
depending on cache locality• Instruction Memory – DRAM, cached
Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
Data Placement: Conceptual• Copies from host to device go to some part of global memory
(possibly, constant or texture memory)• How to use SP shared memory
• Must construct or be copied from global memory by kernel program • How to use constant or texture cache
– Read-only “reused” data can be placed in constant & texture memory by host
• Also, how to use registers– Most locally-allocated data is placed directly in registers– Even array variables can use registers if compiler understands
access patterns– Can allocate “superwords” to registers, e.g., float4– Excessive use of registers will “spill” data to local memory
• Local memory – Deals with capacity limitations of registers and shared memory– Eliminates worries about race conditions (per thread)– … but SLOW if not in cache
Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
Data Placement: Syntax• Through type qualifiers
– __constant__, __shared__, __local__, __device__
• Through cudaMemcpy calls– Type of call and symbolic constant designate where
to copy• Implicit default behavior
– Device memory without qualifier is global memory– Host by default copies to global memory– Thread-local variables go into registers unless
capacity exceeded, then local memory
Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
Language Extensions: Variable Type Qualifiers
• __device__ is optional when used with __local__, __shared__, or __constant__
• Automatic variables without any qualifier reside in a register– Except arrays that reside in local memory
Memory Scope Lifetime__device__ __local__ int LocalVar; local thread thread__device__ __shared__ int SharedVar; shared block block__device__ int GlobalVar; global grid application__device__ __constant__ int ConstantVar; constant grid application
Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
Variable Type Restrictions
• Pointers can only point to memory allocated or declared in global memory:– Allocated in the host and passed to the
kernel: __global__ void KernelFunc(float* ptr)
– Obtained as the address of a global variable: float* ptr = &GlobalVar;
Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
• How to place data in constant and shared memory
• Constant works fast if all threads in a warp access same address
Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
Constant Memory Example– Apply a vector) to a set of precomputed
transform matrices– Compute M1V, M2V, …, MnV– Note same constant value across threads
__constant__ float d_signalVector[M];__device__ float R[N][M];
__host__ void outerApplySignal () {float *h_inputSignal;dim3 dimGrid(N);dim3 dimBlock(M);cudaMemcpyToSymbol (d_signalVector,
h_inputSignal, M*sizeof(float));// assume input matrix is in d_matApplySignal(d_mat, M);
}
__global__ void ApplySignal (float * d_mat, int M) {float result = 0.0; /* register */
for (j=0; j
More on Constant Cache• Example from previous slide
– All threads in a block accessing same element of signal vector
– Brought into cache for first access, then latency equivalent to a register access
P0Instruction
UnitP! PM‐1
Reg
...Reg Reg
Constant Cache
LD signalVector[j]
Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
Now Let’s Look at Shared Memory
• Common Programming Pattern (5.1.2 of CUDA manual)– Load data into shared memory– Synchronize (if necessary)– Operate on data in shared memory– Synchronize (if necessary)– Write intermediate results to global
memory– Repeat until done
Shared memory
Global memory
Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
Mechanics of Using Shared Memory
• __shared__ type qualifier required• Must be allocated from global/device
function, or as “extern”• Examples:
/* a form of dynamic allocation *//* MEMSIZE is size of per‐block *//* shared memory*/ __host__ void outerCompute() {compute();
} __global__ void compute() {extern __shared__ float d_s_array[];
d_s_array[i] = …;}
__global__ void compute2() {__shared__ float d_s_array[M];
/* create or copy from global memory */d_s_array[j] = …;
/* write result back to global memory */d_g_array[j] = d_s_array[j];
}
Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
Matrix Transpose (from SDK)_global__ void transpose(float *odata, float *idata, int width, int height){
__shared__ float block[BLOCK_DIM][BLOCK_DIM+1];
// read the matrix tile into shared memoryunsigned int xIndex = blockIdx.x * BLOCK_DIM + threadIdx.x;unsigned int yIndex = blockIdx.y * BLOCK_DIM + threadIdx.y;unsigned int index_in = yIndex * width + xIndex;block[threadIdx.y][threadIdx.x] = idata[index_in];
__syncthreads();
// write the transposed matrix tile to global memoryxIndex = blockIdx.y * BLOCK_DIM + threadIdx.x;yIndex = blockIdx.x * BLOCK_DIM + threadIdx.y;unsigned int index_out = yIndex * height + xIndex;odata[index_out] = block[threadIdx.x][threadIdx.y];
}
odata and idata in global memory
Rearrange in Rearrange in shared memory and write back efficiently to
global memory
Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
Overview of Texture Memory• Recall, texture cache of read-only data• Optimize for spatial locality
– can be fast for many sequential addresses in the same warp
• Special protocol for allocating and copying to GPU– texture texRef;
• Dim: 1, 2 or 3D objects
• Special protocol for accesses– tex2D(,dim1,dim2);
• Will not cover in more detailSlides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
NVIDIA GPU Execution/Arch
I. SIMD Execution of warpsize=M threads (from single block)– Result is a set of instruction
streams roughly equal to # threads in a blocks divided by warpsize
– Warp SIMD executionII. Multithreaded Execution
across different instruction streams within block– Also possibly across different
blocks III. Each block mapped to
single SM(streaming multipr)– No direct interaction across
SMs
Device
Multiprocessor N
Multiprocessor 2
Multiprocessor 1
Device memory
Shared Memory
InstructionUnit
Processor 1
Registers
…Processor 2
Registers
Processor M
Registers
ConstantCache
TextureCache
Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
CUDA Thread Block Overview
• All threads in a block execute the same kernel program (SPMD)
• Programmer declares block:– Block size 1 to 512 (1024)concurrent threads– Block shape 1D, 2D, or 3D– Block dimensions in threads
• Threads have thread id numbers within block– Thread program uses thread id to select work
and address shared data
• Threads in the same block share data and synchronize while doing their share of the work
• Threads in different blocks cannot cooperate– Each block can execute in any order relative to
other blocks!
CUDA Thread Block
Thread Id #:0 1 2 3 … m
Thread program
Courtesy: John Nickolls, NVIDIA
Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
Simplified block diagram of a Multithreaded SIMD Processor. •It has 16 SIMD lanes. •The SIMD Thread Scheduler has, say, 48 warps that it schedules with a table of 48 PCs.
Streaming Multiprocessor (SM)
NVIDIA GPU Memory Structures
• Each Thread has private section of off-chip DRAM– “Private memory”– Contains stack frame, spilling registers, and
private variables• Each multithreaded SIMD processor also has
local memory– Shared by SIMD lanes / threads within a block
• Memory shared by SIMD processors is GPU Memory– Host can read and write GPU memory
Graphical P
rocessing Units
Device Memory HierarchyDevice
Multiprocessor N
Multiprocessor 2
Multiprocessor 1
Device memory
Shared Memory
InstructionUnit
Processor 1
Registers
…Processor 2
Registers
Processor M
Registers
ConstantCache
TextureCache
Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
Optimizing the Memory Hierarchy on GPUs, Overview
• Device memory access times non-uniform so data placement significantly affects performance.• But controlling data placement may require
additional copying, so consider overhead.• Optimizations to increase memory bandwidth.
Idea: maximize utility of each memory access.
Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
Hardware Implementation: Memory Architecture
• The local, global, constant, and texture spaces are regions of device memory (DRAM)
• Each multiprocessor has:– A set of 32-bit registers per
processor– On-chip shared memory
• Where the shared memory space resides
– A read-only constant cache• To speed up access to the
constant memory space– A read-only texture cache
• To speed up access to the texture memory space
Device
Multiprocessor N
Multiprocessor 2
Multiprocessor 1
Device memory
Shared Memory
InstructionUnit
Processor 1
Registers
…Processor 2
Registers
Processor M
Registers
ConstantCache
TextureCache
Global, constant, texture memories
Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
Terminology• Thread block scheduler schedules blocks to SIMD
processors• Within each SIMD processor (another scheduler):
– 16 (Tesla)-32(Fermi) SIMD lanes– Wide and shallow compared to vector processors
• Warps collection of threads in a block• Thread scheduler uses scoreboard to dispatch warps
– No data dependencies between threads of a warp!– But control flow can diverge– Best performance when there is no divergence (all threads
follow same control flow)– Keeps track of up to 48 warps
• Hides memory latency
Graphical P
rocessing Units
Example SIMD Execution“Count 6” kernel functiond_out[threadIdx.x] = 0;for (int i=0; i
Example SIMD Execution“Count 6” kernel functiond_out[threadIdx.x] = 0;for (int i=0; i
Example SIMD Execution“Count 6” kernel functiond_out[threadIdx.x] = 0;for (int i=0; i
Example SIMD Execution“Count 6” kernel functiond_out[threadIdx.x] = 0;for (int i=0; i
Multithreading: Motivation• Each arithmetic instruction includes the
following sequence
• Memory latency, the time in cycles to access memory, limits utilization of compute engines
Activity Cost Note
Load operands As much as O(100) cycles Depends on location
Compute O(1) cycles Accesses registers
Store result As much as O(100) cycles Depends on location
Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
Thread-Level Parallelism • Motivation:
– a single thread leaves a processor under-utilized for most of the time
• Strategies for thread-level parallelism: – multiple threads share the same large processor
reduces under-utilization, efficient resource allocation Multi-Threading
– each thread executes on its own mini processor simple design, low interference between threads Multi-Processing
Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
How all Comes TogetherG80 Example: Executing Thread Blocks
• Threads are assigned to Streaming Multiprocessors in block granularity– Up to 8 blocks to
each SM as resource allows
– SM in G80 can take up to 768 threads• Could be 256
(threads/block) * 3 blocks
• Or 128 (threads/block) * 6 blocks, etc.
t0 t1 t2 … tm
Blocks
SP
SharedMemory
MT IU
SP
SharedMemory
MT IU
t0 t1 t2 … tm
Blocks
SM 1SM 0
• Threads run concurrently– SM maintains thread/block id #s– SM manages/schedules thread execution
Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
Details of Mapping• If #blocks in a grid exceeds number of SMs,
– multiple blocks mapped to an SM– treated independently– provides more warps to scheduler so good as long as
resources not exceeded– Possibly context switching overhead when
scheduling between blocks (registers and shared memory)
• Thread Synchronization – Within a block, threads observe SIMD model, and
synchronize using __syncthreads()– Across blocks, interaction through global memory
Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
Transparent Scalability• Hardware is free to assigns blocks to
any processor at any time– A kernel scales across any number of
parallel processorsDevice
Block 0 Block 1
Block 2 Block 3
Block 4 Block 5
Block 6 Block 7
Kernel grid
Block 0 Block 1
Block 2 Block 3
Block 4 Block 5
Block 6 Block 7
Device
Block 0 Block 1 Block 2 Block 3
Block 4 Block 5 Block 6 Block 7
Each block can execute in any order relative to other blocks.
time
Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
NVIDIA GPU Architecture• Similarities to vector machines:
– Works well with data-level parallel problems– Scatter-gather transfers– Mask registers– Large register files
• Differences:– No scalar processor– Uses multithreading to hide memory latency– Has many functional units, as opposed to a few
deeply pipelined units like a vector processor– SIMD+SPMD
Graphical P
rocessing Units
NVIDIA Instruction Set Arch.• ISA is an abstraction of the hardware instruction set
– “Parallel Thread Execution (PTX)”– Uses virtual registers– Translation to machine code is performed in software– Example:shl.s32 R8, blockIdx, 9 ; Thread Block ID * Block size (512 or 29)add.s32 R8, R8, threadIdx ; R8 = i = my CUDA thread IDld.global.f64 RD0, [X+R8] ; RD0 = X[i]ld.global.f64 RD2, [Y+R8] ; RD2 = Y[i]mul.f64 R0D, RD0, RD4 ; Product in RD0 = RD0 * RD4 (scalar a)add.f64 R0D, RD0, RD2 ; Sum in RD0 = RD0 + RD2 (Y[i])st.global.f64 [Y+R8], RD0 ; Y[i] = sum (X[i]*a + Y[i])
Graphical P
rocessing Units
How code is compiled
Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
Conditional Branching• Like vector architectures, GPU branch hardware uses internal
masks• Also uses
– Branch synchronization stack• Entries consist of masks for each SIMD lane• I.e. which threads commit their results (all threads execute)
– Instruction markers to manage when a branch diverges into multiple execution paths
• Push on divergent branch– …and when paths converge
• Act as barriers• Pops stack
• Per-thread-lane 1-bit predicate register, specified by programmer
Graphical P
rocessing Units
SIMD Execution of Control FlowControl flow example
if (threadIdx >= 2) {out[threadIdx] += 100;
}else {
out[threadIdx] += 10;}
P0Instruction
UnitP! PM‐1
Reg
...
Memory
Reg Reg compare threadIdx,2
Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
SIMD Execution of Control FlowControl flow example
if (threadIdx.x >= 2) {out[threadIdx.x] += 100;
}else {
out[threadIdx.x] += 10;}
P0Instruction
UnitP! PM‐1
Reg
...
Memory
Reg Reg
/* Condition code cc = true branch set by predicate execution */(CC) LD R5,
&(out+threadIdx.x)(CC) ADD R5, R5, 100(CC) ST R5,
&(out+threadIdx.x)
X X ✔ ✔
Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
SIMD Execution of Control FlowControl flow example
if (threadIdx >= 2) {out[threadIdx] += 100;
}else {
out[threadIdx] += 10;}
P0Instruction
UnitP! PM‐1
Reg
...
Memory
Reg Reg
/* possibly predicated using CC */(not CC) LD R5,
&(out+threadIdx)(not CC) ADD R5, R5, 10(not CC) ST R5,
&(out+threadIdx)
✔ ✔ X X
Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
Exampleif (X[i] != 0)
X[i] = X[i] – Y[i];else X[i] = Z[i];
ld.global.f64 RD0, [X+R8] ; RD0 = X[i]setp.neq.s32 P1, RD0, #0 ; P1 is predicate register 1@!P1, bra ELSE1, *Push ; Push old mask, set new mask bits
; if P1 false, go to ELSE1ld.global.f64 RD2, [Y+R8] ; RD2 = Y[i]sub.f64 RD0, RD0, RD2 ; Difference in RD0st.global.f64 [X+R8], RD0 ; X[i] = RD0@P1, bra ENDIF1, *Comp ; complement mask bits
; if P1 true, go to ENDIF1ELSE1: ld.global.f64 RD0, [Z+R8] ; RD0 = Z[i]
st.global.f64 [X+R8], RD0 ; X[i] = RD0ENDIF1: , *Pop ; pop to restore old mask
Graphical P
rocessing Units
Terminology
• Divergent paths– Different threads within a warp take
different control flow paths within a kernel function
– N divergent paths in a warp?• An N-way divergent warp is serially issued over
the N different paths using a hardware stack and per-thread predication logic to only write back results from the threads taking each divergent path.
• Performance decreases by about a factor of NSlides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
Recall: Serialized Gathering of Results on GPU for “Count 6”
__global__ void compute(int *d_in, int *d_out) {
d_out[threadIdx.x] = 0;
for (i=0; i
Tree-Structured Computation
out[0] += out[2]
out[0] += out[1] out[2] += out[3]
out[0] out[1] out[2] out[3]
Tree-structured results-gathering phase, where independent threads collect their results in parallel.
Assume SIZE=16 and BLOCKSIZE(elements computed per thread)=4.
Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
A possible implementation for just the reduction
unsigned int t = threadIdx.x;for (unsigned int stride = 1;
stride < blockDim.x; stride *= 2) {
__syncthreads();if (t % (2*stride) == 0)
d_out[t] += d_out[t+stride];}
Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
Vector Reduction with Branch Divergence
0 1 2 3 4 5 76 1098 11
0+1 2+3 4+5 6+7 10+118+9
0...3 4..7 8..11
0..7 8..15
1
2
3
Array elements iterations
Thread 0 Thread 8Thread 2 Thread 4 Thread 6 Thread 10
Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
Some Observations• In each iteration, two control flow paths will be
sequentially traversed for each warp– Threads that perform addition and threads that do not– Threads that do not perform addition may cost extra cycles
depending on the implementation of divergence• No more than half of threads will be executing at any
time– All odd index threads are disabled right from the beginning!– On average, less than ¼ of the threads will be activated for
all warps over time.– After the 5th iteration, entire warps in each block will be
disabled, poor resource utilization but no divergence.• This can go on for a while, up to 4 more iterations (512/32=16=
24), where each iteration only has one thread activated until all warps retire
Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
What’s Wrong?
unsigned int t = threadIdx.x;for (unsigned int stride = 1;
stride < blockDim.x; stride *= 2) {
__syncthreads();if (t % (2*stride) == 0)
d_out[t] += d_out[t+stride];}
BAD: Divergence due to interleaved branch decisions
Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
A better implementation
unsigned int t = threadIdx.x;for (unsigned int stride = blockDim.x >> 1;
stride >= 1; stride >>= 1) {
__syncthreads();if (t < stride)
d_out[t] += d_out[t+stride];}
Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
Thread 0
No Divergence until < 16 sub-sums
0 1 2 3 … 13 1514 181716 19
0+16 15+311
3
4
89L8: Control Flow
Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
A shared memory implementation
• Assume we have already loaded array into__shared__ float partialSum[];
unsigned int t = threadIdx.x;for (unsigned int stride = blockDim.x >> 1;
stride >= 1; stride >> 1) {
__syncthreads();if (t < stride)
partialSum[t] += partialSum[t+stride];}
Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
Steam Managementhttp://devblogs.nvidia.com/parallelforall/how-overlap-data-transfers-cuda-cc
Red are blocking functions (synchronous) Ex.1
cudaMemcpy(d_a, a, numBytes, cudaMemcpyHostToDevice); increment(d_a) cudaMemcpy(a, d_a, numBytes, cudaMemcpyDeviceToHost);
Ex.2
cudaMemcpy(d_a, a, numBytes, cudaMemcpyHostToDevice); increment(d_a) myCpuFunction(b) cudaMemcpy(a, d_a, numBytes, cudaMemcpyDeviceToHost);
Create and Destroy StreamcudaStream_t stream1; cudaError_t result; result = cudaStreamCreate(&stream1); result = cudaStreamDestroy(stream1);
Transfer data to stream (asynchronous–overlap transfer with compute):result = cudaMemcpyAsync(d_a, a,N,cudaMemcpyHostToDevice, stream1)
Issue kernel to a streamincrement(d_a)
Examplefor (int i = 0; i < nStreams; ++i) { int offset = i * streamSize;
cudaMemcpyAsync(&d_a[offset], &a[offset], streamBytes, stream[i]);
kernel(d_a, offset,stream[i]);
cudaMemcpyAsync(&a[offset], &d_a[offset], streamBytes, stream[i]);}
nStreams concurrent execution
Note: all commands issued for a stream execute in order
Top Related