ΕΠΛ372 Παράλληλη Επεξεργασία · 2014. 3. 11. · copy of inputand output...

ΕΠΛ372 Παράλληλη Επεξεργασία

Παράλληλες Αρχιτεκτονικές: SIMD, GPU

Γιάννος ΣαζεϊδηςΕαρινό Εξάμηνο 2014

READING1. Paper on GPUs: Erik Lindholm, John Nickolls, Stuart Oberman, and John Montrym, NVIDIA

TESLA: A Unified Graphics and Computing Architecture, IEEE Micro Volume 28, Issue 2, Date: March-April 2008, Pages: 39-55.

2. HW#2Slides based on Elsevier Material for H&P 5th editionCopyright © 2012, Elsevier Inc. All rights reserved.

Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

SIMD Extensions• Media applications operate on data types narrower than

the native word size– Example: disconnect carry chains to

“partition” adder

• Limitations, compared to vector instructions:– Number of data operands encoded into op

code– No sophisticated addressing modes (strided,

scatter-gather)– No mask registers

SIM

D Instruction S

et Extensions for M

ultimedia

SIMD Implementations• Implementations:

– Intel MMX (1996)• Eight 8-bit integer ops or four 16-bit integer ops

– Streaming SIMD Extensions (SSE) (1999)• Eight 16-bit integer ops• Four 32-bit integer/fp ops or two 64-bit integer/fp ops

– Advanced Vector Extensions (2010)• Four 64-bit integer/fp ops

– Operands must be consecutive and aligned memory locations

SIM

D Instruction S

et Extensions for M

ultimedia

DAXPY 64 elements with non-vector/simd Instructions

L.D F0,a ;load scalar aDADDIU R4,Rx,#512 ;last address to load

Loop:L.D F4,0[Rx] ;load X[i]MUL.D F4,F4,F0 ;a×X[i]L.D F8,0[Ry] ;load Y[i]ADD.D F8,F8,F4 ;a×X[i]+Y[i]S.D 0[Ry],F8 ;store into Y[i]DADDIU Rx,Rx,#8 ;increment index to XDADDIU Ry,Ry,#8 ;increment index to YDSUBU R20,R4,Rx ;compute boundBNEZ R20,Loop ;check if done

• Requires almost 600 for MIPS

Vector Architectures

With Vector Instructions• ADDVV.D: add two vectors• ADDVS.D: add vector to a scalar• LV/SV: vector load and vector store from address

• Example: DAXPYL.D F0,a ; load scalar aLV V1,Rx ; load vector XMULVS.D V2,V1,F0 ; vector-scalar multiplyLV V3,Ry ; load vector YADDVV V4,V2,V3 ; addSV Ry,V4 ; store the result

• Requires 6 instructions vs. almost 600 for MIPS

Vector Architectures

Example SIMD Codeassume 256 bit extension

• Example DXPY:L.D F0,a ;load scalar aMOV F1, F0 ;copy a into F1 for SIMD MULMOV F2, F0 ;copy a into F2 for SIMD MULMOV F3, F0 ;copy a into F3 for SIMD MULDADDIU R4,Rx,#512 ;last address to load

Loop: L.4D F4,0[Rx] ;load X[i], X[i+1], X[i+2], X[i+3]MUL.4D F4,F4,F0 ;a×X[i],a×X[i+1],a×X[i+2],a×X[i+3]L.4D F8,0[Ry] ;load Y[i], Y[i+1], Y[i+2], Y[i+3]ADD.4D F8,F8,F4 ;a×X[i]+Y[i], ..., a×X[i+3]+Y[i+3]S.4D 0[Ry],F8 ;store into Y[i], Y[i+1], Y[i+2], Y[i+3]DADDIU Rx,Rx,#32 ;increment index to XDADDIU Ry,Ry,#32 ;increment index to YDSUBU R20,R4,Rx ;compute boundBNEZ R20,Loop ;check if done

SIM

D Instruction S

et Extensions for M

ultimedia

Roofline Performance Model• Basic idea:

– Plot peak floating-point throughput as a function of arithmetic intensity

– Ties together floating-point performance and memory performance for a target machine

• Arithmetic intensity– Floating-point operations per byte read (word)

SIM

D Instruction S

et Extensions for M

ultimedia

Examples• Attainable GFLOPs/sec Min = (Peak Memory BW ×

Arithmetic Intensity, Peak Floating Point Perf.)• Helps understand compute vs memory bound

SIM

D Instruction S

et Extensions for M

ultimedia

Graphical Processing Units• Given the hardware invested to do graphics well, how can be

supplement it to improve performance of a wider range of applications?

• Basic idea:– Heterogeneous execution model

• CPU is the host, GPU is the device• Energy efficient

– Develop a C-like programming language for GPU– Unify all forms of GPU parallelism as CUDA thread

– Programming model is “Single Instruction Multiple Data” (SIMD) – NVIDIA: SIMT – single instruction multiple threads, flexible SIMD

Graphical P

rocessing Units

Dedicated GPU

• Separate GPU cards with own DRAM

• Separate GPU card but sharing DRAM with CPU– Less cost but less BW


GPU Processors

Integrated GPU Processors

• Integrated GPU Processors– Less DRAM BW vsDedicated GPUs

High Level GPU Programming

Source:Wikipedia

Serious bottleneck if this latency is not hidden

Threads and Blocks• A thread is associated with each data

element• Threads are organized into blocks• Blocks are organized into a grid• GPU hardware handles thread

management, not applications or OS– Lower overhead thread management vs

pthreads (suitable for finer grain parallelism)• Execution of threads as simd instructions

when possible gives best performance

Graphical P

rocessing Units

GPU vs CPU threads

• Differences between GPU and CPU threads – GPU threads are extremely lightweight

• Very little overhead– GPU needs 1000s of threads for full

efficiency (switch to hide latency)• Multi-core CPU needs only a few

Thread Batching: Grids and Blocks• A kernel is executed as a grid of

thread blocks– All threads share data memory

space• A thread block is a batch of

threads that can cooperate with each other by:– Synchronizing their execution

• For hazard-free shared memory accesses– Efficiently sharing data through a

low latency shared memory• Two threads from two different

blocks cannot cooperate

Host

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(2, 0)

Block(0, 1)

Block(1, 1)

Block(2, 1)

Grid 2

Block (1, 1)

Thread(0, 1)

Thread(1, 1)

Thread(2, 1)

Thread(3, 1)

Thread(4, 1)

Thread(0, 2)

Thread(1, 2)

Thread(2, 2)

Thread(3, 2)

Thread(4, 2)

Thread(0, 0)

Thread(1, 0)

Thread(2, 0)

Thread(3, 0)

Thread(4, 0)

Courtesy: NVDIA


Block and Thread IDs• Threads and blocks

have IDs– So each thread can

decide what data to work on

– Block ID: 1D or 2D (blockIdx.x, blockIdx.y)

– Thread ID: 1D, 2D, or 3D (threadIdx.{x,y,z})

• Simplifies memoryaddressing when processingmultidimensional data– Image processing– Solving PDEs on

volumes– …

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(2, 0)

Block(0, 1)

Block(1, 1)

Block(2, 1)

Block (1, 1)

Thread(0, 1)

Thread(1, 1)

Thread(2, 1)

Thread(3, 1)

Thread(4, 1)

Thread(0, 2)

Thread(1, 2)

Thread(2, 2)

Thread(3, 2)

Thread(4, 2)

Thread(0, 0)

Thread(1, 0)

Thread(2, 0)

Thread(3, 0)

Thread(4, 0)

Courtesy: NDVIA


An example• Multiply two vectors of length 8192

– Code that works over all elements is the grid– Thread blocks break this down into manageable

sizes• 512 threads per block• many threads to cover long memory latency

– On stall HW switch quickly between threads

– Thus grid size = 16 blocks– Block is assigned to a multithreaded SIMD

processor by the thread block scheduler• Fermi GPUs had 16 multithreaded SIMD processors

Graphical P

rocessing Units

What Programmer Expresses in CUDA

P

M

P

HO

ST

(CP

U)

M DE

VIC

E (G

PU

)

Interconnect between devices and memories


Basics• Computation partitioning (where does computation occur?)

Declarations on functions __host__ : on cpu__global__ : kernel called by host __device__ : code to be executed by device called by

global or other device function• Mapping of kernels to device :

compute ()gs grid size, bs block size(gs *bs = threads of one kernel)

• Data partitioning (where does data reside, who may access it and how?)• Declarations on data __shared__, __device__, __constant__, …

• Data management and orchestration• Copying to/from host: e.g., cudaMemcpy(h_obj,d_obj, cudaMemcpyDevicetoHost)

• Concurrency management– E.g. __synchthreads() – block granularity


Synchronous vs Asynchronous

• Whether a call returns or not return immediately (before call function completes)– Yes: asynchronous– No: synchronous

Streams• A stream in CUDA is a sequence of operations

that execute on the device in the order in which they are issued by the host code.

• Operations within a stream are guaranteed to execute in the prescribed order

• Operations in different streams can be interleaved and, when possible, they can even run concurrently.

• How to specify stream: pass as argument, use functions that take stream as parameter

• Useful to hide cpu-gpu latency

Minimal Extensions to C + API• Declspecs

– global, device, shared, local, constant

• Keywords– threadIdx, blockIdx

• Intrinsics– __syncthreads

• Runtime API– Memory, symbol,

execution management

• Function launch

__device__ float filter[N];

__global__ void convolve (float *image) {

__shared__ float region[M];...

region[threadIdx] = image[i];

__syncthreads() ...

image[j] = result;}

// Allocate GPU memoryvoid *myimage = cudaMalloc(bytes)

// 100 blocks, 10 threads per blockconvolve (myimage);


NVCC Compiler’s Role: Partition Code and Compile for Device

mycode.cu

__device__ dfunc() {int ddata;

}

__global__ gfunc() {int gdata;dfuncc();

}

Main() { }__host__ hfunc () {

int hdata;();}

Dev

ice

Onl

yIn

terfa

ceH

ost O

nly

int main_data;__shared__ int sdata;

Main() {}__host__ hfunc () {

int hdata;

();}

__global__ gfunc() {int gdata;dfuncc();

}

Compiled by nativecompiler: gcc, icc, cc

__shared__ sdata;

__device__ dfunc() {int ddata;

}

Compiled by nvcccompiler

int main_data;


Simple working code example• What does it do?

– check elements of integer array (any of 0 to 9)– How many times does “6” appear?– Array of 16 elements, each thread examines 4

elements, 1 block in grid, 1 grid

3 6 57 3 5 26 0 9 639 1 72

threadIdx.x = 0 examines in_array elements 0, 4, 8, 12threadIdx.x = 1 examines in_array elements 1, 5, 9, 13threadIdx.x = 2 examines in_array elements 2, 6, 10, 14threadIdx.x = 3 examines in_array elements 3, 7, 11, 15

} Known as acyclic data distributionSlides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

CUDA Pseudo-CodeMAIN PROGRAM:Initialization• Allocate memory on host for

input and output• Assign random numbers to

input array

Call host function

Calculate final output from per-thread output

Print result

HOST FUNCTION:Allocate memory on device for

copy of input and outputCopy input to deviceSet up grid/blockCall global functionCopy device output to host

GLOBAL FUNCTION:Thread scans subset of array elementsCall device function to compare with “6”Compute local result

DEVICE FUNCTION:Compare current element

and “6”Return 1 if same, else 0Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007

ECE 498AL, University of Illinois, Urbana-Champaign

Main Program: PreliminariesMAIN PROGRAM:Initialization• Allocate memory on host for


input arrayCall host functionCalculate final output from

per-thread outputPrint result

#include #define SIZE 16#define BLOCKSIZE 4

int main(int argc, char **argv){int *in_array, *out_array;…

}


Main Program: Invoke Global FunctionMAIN PROGRAM:Initialization (OMIT)• Allocate memory on host for


input arrayCall host functionCalculate final output from

per-thread outputPrint result

#include #define SIZE 16#define BLOCKSIZE 4__host__ void outer_compute (int

*in_arr, int *out_arr);int main(int argc, char **argv){

int *in_array, *out_array;/* initialization */ …outer_compute(in_array, out_array);…

}


Main Program: Calculate Output & Print ResultMAIN PROGRAM:Initialization (OMIT)• Allocate memory on host for


input arrayCall host functionCalculate final output from per-thread output

Print result

#include #define SIZE 16#define BLOCKSIZE 4__host__ void outer_compute (int

*in_arr, int *out_arr);int main(int argc, char **argv){

int *in_array, *out_array;int sum = 0;/* initialization */ …outer_compute(in_array, out_array);for (int i=0; i

Host Function: Preliminaries & Allocation



__host__ void outer_compute (int *h_in_array, int *h_out_array) {

int *d_in_array, *d_out_array;

cudaMalloc((void **) &d_in_array, SIZE*sizeof(int));

cudaMalloc((void **) &d_out_array, BLOCKSIZE*sizeof(int));

…}


Host Function: Copy Data To/From Host



__host__ void outer_compute (int*h_in_array, int *h_out_array) {int *d_in_array, *d_out_array;



cudaMemcpy(d_in_array, h_in_array, SIZE*sizeof(int), cudaMemcpyHostToDevice);

… do computation ...cudaMemcpy(h_out_array,d_out_array,

BLOCKSIZE*sizeof(int), cudaMemcpyDeviceToHost);

}Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

Host Function: Setup & Call Global Function



__host__ void outer_compute (int *h_in_array, int *h_out_array) {int *d_in_array, *d_out_array;



cudaMemcpy(d_in_array, h_in_array, SIZE*sizeof(int), cudaMemcpyHostToDevice);

compute (d_in_array, d_out_array);cudaMemcpy(h_out_array, d_out_array,

BLOCKSIZE*sizeof(int), cudaMemcpyDeviceToHost);

}Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

Global FunctionGLOBAL FUNCTION:Thread scans subset of array

elementsCall device function to

compare with “6”Compute local result

__global__ void compute(int *d_in,int *d_out) {d_out[threadIdx.x] = 0;for (int i=0; i

Device Function

L1: Course/CUDA Introduction

DEVICE FUNCTION:Compare current element

and “6”Return 1 if same, else 0

__device__ int compare(int a, int b) {if (a == b) return 1;return 0;

}

CS6963 Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

Memory Model


Memory Model

• Host copies to/from device (can be slow)• Device does not support coherence

– Any caches in device must be disabled if coherence is needed


What if we computed sum on GPU?• Global, device functions and excerpts from host, main

int __host__ void outer_compute (int *h_in_array, int *h_sum) {…compute (d_in_array, d_sum);cudaThreadSynchronize();cudaMemcpy(h_sum, d_sum, sizeof(int), cudaMemcpyDeviceToHost);

}

main(int argc, char **argv) {…int *sum; // an integerouter_compute(in_array, sum);printf (”Result = %d\n",sum);}

__device__ int compare(int a, int b) {if (a == b) return 1;return 0;

}

__global__ void compute(int *d_in, int*sum) {

*sum = 0;

for (i=0; i

Gathering Results on GPU for “Count 6”__global__ void compute(int *d_in, int *d_out) {

d_out[threadIdx.x] = 0;

for (i=0; i

Synchronize• kernel execution is normally asynchronous, so while the

GPU device is executing your kernel the CPU can continue to work on some other commands, issue more instructions to the device, etc., instead of waiting

• Host-Device memory transfers synchronous (blocking)– Points of synchronization (slow)


Programmer’s View: Memory Spaces• Each thread can:

– Read/write per-thread registers– Read/write per-thread local memory– Read/write per-block shared memory– Read/write per-grid global memory– Read only per-grid constant memory– Read only per-grid texture memory

• Block Threads: share shared memory• Grid blocks: share global, constant

and texture memory

Grid

ConstantMemory

TextureMemory

GlobalMemory

Block (0, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Host

• The host can read/write global, constant, and texture memory


Terminology Review

• device = GPU = set of multiprocessors • Multiprocessor = set of processors & shared memory• Kernel = GPU program• Grid = array of thread blocks that execute a kernel• Thread block = group of SIMD threads that execute

a kernel and can communicate via shared memory

Memory Location Cached Access WhoLocal Off-chip No 1.x / >2Yes Read/write One threadShared On-chip N/A - resident Read/write All threads in a blockGlobal Off-chip No 1.x / >2 Yes Read/write All threads + hostConstant Off-chip Yes Read All threads + hostTexture Off-chip Yes Read All threads + host


Reuse and Locality• Consider how data is accessed

– Data reuse: • Same data used multiple times • Intrinsic in computation

– Data locality:• Data is reused and is present in “fast memory”• Same data or same data transfer

• If a computation has reuse, what can we do to get locality?

• Appropriate data placement and layout• Code reordering transformations


Access Times• Register – dedicated HW - single cycle• Constant and Texture caches – possibly single cycle,

proportional to addresses accessed by warp• Shared Memory – dedicated HW - single cycle if no

“bank conflicts”• Local Memory – DRAM, no cached- *slow*, if cached fast• Global Memory– DRAM, no cached- *slow*, if cached fast• Constant Memory – DRAM, cached, 1…10s…100s of

cycles, depending on cache locality• Texture Memory – DRAM, cached, 1…10s…100s of cycles,

depending on cache locality• Instruction Memory – DRAM, cached


Data Placement: Conceptual• Copies from host to device go to some part of global memory

(possibly, constant or texture memory)• How to use SP shared memory

• Must construct or be copied from global memory by kernel program • How to use constant or texture cache

– Read-only “reused” data can be placed in constant & texture memory by host

• Also, how to use registers– Most locally-allocated data is placed directly in registers– Even array variables can use registers if compiler understands

access patterns– Can allocate “superwords” to registers, e.g., float4– Excessive use of registers will “spill” data to local memory

• Local memory – Deals with capacity limitations of registers and shared memory– Eliminates worries about race conditions (per thread)– … but SLOW if not in cache


Data Placement: Syntax• Through type qualifiers

– __constant__, __shared__, __local__, __device__

• Through cudaMemcpy calls– Type of call and symbolic constant designate where

to copy• Implicit default behavior

– Device memory without qualifier is global memory– Host by default copies to global memory– Thread-local variables go into registers unless

capacity exceeded, then local memory


Language Extensions: Variable Type Qualifiers

• __device__ is optional when used with __local__, __shared__, or __constant__

• Automatic variables without any qualifier reside in a register– Except arrays that reside in local memory

Memory Scope Lifetime__device__ __local__ int LocalVar; local thread thread__device__ __shared__ int SharedVar; shared block block__device__ int GlobalVar; global grid application__device__ __constant__ int ConstantVar; constant grid application


Variable Type Restrictions

• Pointers can only point to memory allocated or declared in global memory:– Allocated in the host and passed to the

kernel: __global__ void KernelFunc(float* ptr)

– Obtained as the address of a global variable: float* ptr = &GlobalVar;


• How to place data in constant and shared memory

• Constant works fast if all threads in a warp access same address


Constant Memory Example– Apply a vector) to a set of precomputed

transform matrices– Compute M1V, M2V, …, MnV– Note same constant value across threads

__constant__ float d_signalVector[M];__device__ float R[N][M];

__host__ void outerApplySignal () {float *h_inputSignal;dim3 dimGrid(N);dim3 dimBlock(M);cudaMemcpyToSymbol (d_signalVector,

h_inputSignal, M*sizeof(float));// assume input matrix is in d_matApplySignal(d_mat, M);

}

__global__ void ApplySignal (float * d_mat, int M) {float result = 0.0; /* register */

for (j=0; j

More on Constant Cache• Example from previous slide

– All threads in a block accessing same element of signal vector

– Brought into cache for first access, then latency equivalent to a register access

P0Instruction

UnitP! PM‐1

Reg

...Reg Reg

Constant Cache

LD signalVector[j]


Now Let’s Look at Shared Memory

• Common Programming Pattern (5.1.2 of CUDA manual)– Load data into shared memory– Synchronize (if necessary)– Operate on data in shared memory– Synchronize (if necessary)– Write intermediate results to global

memory– Repeat until done

Shared memory

Global memory


Mechanics of Using Shared Memory

• __shared__ type qualifier required• Must be allocated from global/device

function, or as “extern”• Examples:

/* a form of dynamic allocation *//* MEMSIZE is size of per‐block *//* shared memory*/ __host__ void outerCompute() {compute();

} __global__ void compute() {extern __shared__ float d_s_array[];

d_s_array[i] = …;}

__global__ void compute2() {__shared__ float d_s_array[M];

/* create or copy from global memory */d_s_array[j] = …;

/* write result back to global memory */d_g_array[j] = d_s_array[j];

}


Matrix Transpose (from SDK)_global__ void transpose(float *odata, float *idata, int width, int height){

__shared__ float block[BLOCK_DIM][BLOCK_DIM+1];

// read the matrix tile into shared memoryunsigned int xIndex = blockIdx.x * BLOCK_DIM + threadIdx.x;unsigned int yIndex = blockIdx.y * BLOCK_DIM + threadIdx.y;unsigned int index_in = yIndex * width + xIndex;block[threadIdx.y][threadIdx.x] = idata[index_in];

__syncthreads();

// write the transposed matrix tile to global memoryxIndex = blockIdx.y * BLOCK_DIM + threadIdx.x;yIndex = blockIdx.x * BLOCK_DIM + threadIdx.y;unsigned int index_out = yIndex * height + xIndex;odata[index_out] = block[threadIdx.x][threadIdx.y];

}

odata and idata in global memory

Rearrange in Rearrange in shared memory and write back efficiently to

global memory


Overview of Texture Memory• Recall, texture cache of read-only data• Optimize for spatial locality

– can be fast for many sequential addresses in the same warp

• Special protocol for allocating and copying to GPU– texture texRef;

• Dim: 1, 2 or 3D objects

• Special protocol for accesses– tex2D(,dim1,dim2);

• Will not cover in more detailSlides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

NVIDIA GPU Execution/Arch

I. SIMD Execution of warpsize=M threads (from single block)– Result is a set of instruction

streams roughly equal to # threads in a blocks divided by warpsize

– Warp SIMD executionII. Multithreaded Execution

across different instruction streams within block– Also possibly across different

blocks III. Each block mapped to

single SM(streaming multipr)– No direct interaction across

SMs

Device

Multiprocessor N

Multiprocessor 2

Multiprocessor 1

Device memory

Shared Memory

InstructionUnit

Processor 1

Registers

…Processor 2

Registers

Processor M

Registers

ConstantCache

TextureCache


CUDA Thread Block Overview

• All threads in a block execute the same kernel program (SPMD)

• Programmer declares block:– Block size 1 to 512 (1024)concurrent threads– Block shape 1D, 2D, or 3D– Block dimensions in threads

• Threads have thread id numbers within block– Thread program uses thread id to select work

and address shared data

• Threads in the same block share data and synchronize while doing their share of the work

• Threads in different blocks cannot cooperate– Each block can execute in any order relative to

other blocks!

CUDA Thread Block

Thread Id #:0 1 2 3 … m

Thread program

Courtesy: John Nickolls, NVIDIA


Simplified block diagram of a Multithreaded SIMD Processor. •It has 16 SIMD lanes. •The SIMD Thread Scheduler has, say, 48 warps that it schedules with a table of 48 PCs.

Streaming Multiprocessor (SM)

NVIDIA GPU Memory Structures

• Each Thread has private section of off-chip DRAM– “Private memory”– Contains stack frame, spilling registers, and

private variables• Each multithreaded SIMD processor also has

local memory– Shared by SIMD lanes / threads within a block

• Memory shared by SIMD processors is GPU Memory– Host can read and write GPU memory

Graphical P

rocessing Units

Device Memory HierarchyDevice

Multiprocessor N

Multiprocessor 2

Multiprocessor 1

Device memory

Shared Memory

InstructionUnit

Processor 1

Registers

…Processor 2

Registers

Processor M

Registers

ConstantCache

TextureCache


Optimizing the Memory Hierarchy on GPUs, Overview

• Device memory access times non-uniform so data placement significantly affects performance.• But controlling data placement may require

additional copying, so consider overhead.• Optimizations to increase memory bandwidth.

Idea: maximize utility of each memory access.


Hardware Implementation: Memory Architecture

• The local, global, constant, and texture spaces are regions of device memory (DRAM)

• Each multiprocessor has:– A set of 32-bit registers per

processor– On-chip shared memory

• Where the shared memory space resides

– A read-only constant cache• To speed up access to the

constant memory space– A read-only texture cache

• To speed up access to the texture memory space

Device

Multiprocessor N

Multiprocessor 2

Multiprocessor 1

Device memory

Shared Memory

InstructionUnit

Processor 1

Registers

…Processor 2

Registers

Processor M

Registers

ConstantCache

TextureCache

Global, constant, texture memories


Terminology• Thread block scheduler schedules blocks to SIMD

processors• Within each SIMD processor (another scheduler):

– 16 (Tesla)-32(Fermi) SIMD lanes– Wide and shallow compared to vector processors

• Warps collection of threads in a block• Thread scheduler uses scoreboard to dispatch warps

– No data dependencies between threads of a warp!– But control flow can diverge– Best performance when there is no divergence (all threads

follow same control flow)– Keeps track of up to 48 warps

• Hides memory latency

Graphical P

rocessing Units

Example SIMD Execution“Count 6” kernel functiond_out[threadIdx.x] = 0;for (int i=0; i

Multithreading: Motivation• Each arithmetic instruction includes the

following sequence

• Memory latency, the time in cycles to access memory, limits utilization of compute engines

Activity Cost Note

Load operands As much as O(100) cycles Depends on location

Compute O(1) cycles Accesses registers

Store result As much as O(100) cycles Depends on location


Thread-Level Parallelism • Motivation:

– a single thread leaves a processor under-utilized for most of the time

• Strategies for thread-level parallelism: – multiple threads share the same large processor

reduces under-utilization, efficient resource allocation Multi-Threading

– each thread executes on its own mini processor simple design, low interference between threads Multi-Processing


How all Comes TogetherG80 Example: Executing Thread Blocks

• Threads are assigned to Streaming Multiprocessors in block granularity– Up to 8 blocks to

each SM as resource allows

– SM in G80 can take up to 768 threads• Could be 256

(threads/block) * 3 blocks

• Or 128 (threads/block) * 6 blocks, etc.

t0 t1 t2 … tm

Blocks

SP

SharedMemory

MT IU

SP

SharedMemory

MT IU

t0 t1 t2 … tm

Blocks

SM 1SM 0

• Threads run concurrently– SM maintains thread/block id #s– SM manages/schedules thread execution


Details of Mapping• If #blocks in a grid exceeds number of SMs,

– multiple blocks mapped to an SM– treated independently– provides more warps to scheduler so good as long as

resources not exceeded– Possibly context switching overhead when

scheduling between blocks (registers and shared memory)

• Thread Synchronization – Within a block, threads observe SIMD model, and

synchronize using __syncthreads()– Across blocks, interaction through global memory


Transparent Scalability• Hardware is free to assigns blocks to

any processor at any time– A kernel scales across any number of

parallel processorsDevice

Block 0 Block 1

Block 2 Block 3

Block 4 Block 5

Block 6 Block 7

Kernel grid

Block 0 Block 1

Block 2 Block 3

Block 4 Block 5

Block 6 Block 7

Device

Block 0 Block 1 Block 2 Block 3

Block 4 Block 5 Block 6 Block 7

Each block can execute in any order relative to other blocks.

time


NVIDIA GPU Architecture• Similarities to vector machines:

– Works well with data-level parallel problems– Scatter-gather transfers– Mask registers– Large register files

• Differences:– No scalar processor– Uses multithreading to hide memory latency– Has many functional units, as opposed to a few

deeply pipelined units like a vector processor– SIMD+SPMD

Graphical P

rocessing Units

NVIDIA Instruction Set Arch.• ISA is an abstraction of the hardware instruction set

– “Parallel Thread Execution (PTX)”– Uses virtual registers– Translation to machine code is performed in software– Example:shl.s32 R8, blockIdx, 9 ; Thread Block ID * Block size (512 or 29)add.s32 R8, R8, threadIdx ; R8 = i = my CUDA thread IDld.global.f64 RD0, [X+R8] ; RD0 = X[i]ld.global.f64 RD2, [Y+R8] ; RD2 = Y[i]mul.f64 R0D, RD0, RD4 ; Product in RD0 = RD0 * RD4 (scalar a)add.f64 R0D, RD0, RD2 ; Sum in RD0 = RD0 + RD2 (Y[i])st.global.f64 [Y+R8], RD0 ; Y[i] = sum (X[i]*a + Y[i])

Graphical P

rocessing Units

How code is compiled


Conditional Branching• Like vector architectures, GPU branch hardware uses internal

masks• Also uses

– Branch synchronization stack• Entries consist of masks for each SIMD lane• I.e. which threads commit their results (all threads execute)

– Instruction markers to manage when a branch diverges into multiple execution paths

• Push on divergent branch– …and when paths converge

• Act as barriers• Pops stack

• Per-thread-lane 1-bit predicate register, specified by programmer

Graphical P

rocessing Units

SIMD Execution of Control FlowControl flow example

if (threadIdx >= 2) {out[threadIdx] += 100;

}else {

out[threadIdx] += 10;}

P0Instruction

UnitP! PM‐1

Reg

...

Memory

Reg Reg compare threadIdx,2



if (threadIdx.x >= 2) {out[threadIdx.x] += 100;

}else {

out[threadIdx.x] += 10;}

P0Instruction

UnitP! PM‐1

Reg

...

Memory

Reg Reg

/* Condition code cc = true branch set by predicate execution */(CC) LD R5,

&(out+threadIdx.x)(CC) ADD R5, R5, 100(CC) ST R5,

&(out+threadIdx.x)

X X ✔ ✔



if (threadIdx >= 2) {out[threadIdx] += 100;

}else {

out[threadIdx] += 10;}

P0Instruction

UnitP! PM‐1

Reg

...

Memory

Reg Reg

/* possibly predicated using CC */(not CC) LD R5,

&(out+threadIdx)(not CC) ADD R5, R5, 10(not CC) ST R5,

&(out+threadIdx)

✔ ✔ X X


Exampleif (X[i] != 0)

X[i] = X[i] – Y[i];else X[i] = Z[i];

ld.global.f64 RD0, [X+R8] ; RD0 = X[i]setp.neq.s32 P1, RD0, #0 ; P1 is predicate register 1@!P1, bra ELSE1, *Push ; Push old mask, set new mask bits

; if P1 false, go to ELSE1ld.global.f64 RD2, [Y+R8] ; RD2 = Y[i]sub.f64 RD0, RD0, RD2 ; Difference in RD0st.global.f64 [X+R8], RD0 ; X[i] = RD0@P1, bra ENDIF1, *Comp ; complement mask bits

; if P1 true, go to ENDIF1ELSE1: ld.global.f64 RD0, [Z+R8] ; RD0 = Z[i]

st.global.f64 [X+R8], RD0 ; X[i] = RD0ENDIF1: , *Pop ; pop to restore old mask

Graphical P

rocessing Units

Terminology

• Divergent paths– Different threads within a warp take

different control flow paths within a kernel function

– N divergent paths in a warp?• An N-way divergent warp is serially issued over

the N different paths using a hardware stack and per-thread predication logic to only write back results from the threads taking each divergent path.

• Performance decreases by about a factor of NSlides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

Recall: Serialized Gathering of Results on GPU for “Count 6”

__global__ void compute(int *d_in, int *d_out) {

d_out[threadIdx.x] = 0;

for (i=0; i

Tree-Structured Computation

out[0] += out[2]

out[0] += out[1] out[2] += out[3]

out[0] out[1] out[2] out[3]

Tree-structured results-gathering phase, where independent threads collect their results in parallel.

Assume SIZE=16 and BLOCKSIZE(elements computed per thread)=4.


A possible implementation for just the reduction

unsigned int t = threadIdx.x;for (unsigned int stride = 1;

stride < blockDim.x; stride *= 2) {

__syncthreads();if (t % (2*stride) == 0)

d_out[t] += d_out[t+stride];}


Vector Reduction with Branch Divergence

0 1 2 3 4 5 76 1098 11

0+1 2+3 4+5 6+7 10+118+9

0...3 4..7 8..11

0..7 8..15

1

2

3

Array elements iterations

Thread 0 Thread 8Thread 2 Thread 4 Thread 6 Thread 10


Some Observations• In each iteration, two control flow paths will be

sequentially traversed for each warp– Threads that perform addition and threads that do not– Threads that do not perform addition may cost extra cycles

depending on the implementation of divergence• No more than half of threads will be executing at any

time– All odd index threads are disabled right from the beginning!– On average, less than ¼ of the threads will be activated for

all warps over time.– After the 5th iteration, entire warps in each block will be

disabled, poor resource utilization but no divergence.• This can go on for a while, up to 4 more iterations (512/32=16=

24), where each iteration only has one thread activated until all warps retire


What’s Wrong?

unsigned int t = threadIdx.x;for (unsigned int stride = 1;

stride < blockDim.x; stride *= 2) {

__syncthreads();if (t % (2*stride) == 0)


BAD: Divergence due to interleaved branch decisions


A better implementation

unsigned int t = threadIdx.x;for (unsigned int stride = blockDim.x >> 1;

stride >= 1; stride >>= 1) {

__syncthreads();if (t < stride)



Thread 0

No Divergence until < 16 sub-sums

0 1 2 3 … 13 1514 181716 19

0+16 15+311

3

4

89L8: Control Flow


A shared memory implementation

• Assume we have already loaded array into__shared__ float partialSum[];

unsigned int t = threadIdx.x;for (unsigned int stride = blockDim.x >> 1;

stride >= 1; stride >> 1) {

__syncthreads();if (t < stride)

partialSum[t] += partialSum[t+stride];}


Steam Managementhttp://devblogs.nvidia.com/parallelforall/how-overlap-data-transfers-cuda-cc

Red are blocking functions (synchronous) Ex.1

cudaMemcpy(d_a, a, numBytes, cudaMemcpyHostToDevice); increment(d_a) cudaMemcpy(a, d_a, numBytes, cudaMemcpyDeviceToHost);

Ex.2

cudaMemcpy(d_a, a, numBytes, cudaMemcpyHostToDevice); increment(d_a) myCpuFunction(b) cudaMemcpy(a, d_a, numBytes, cudaMemcpyDeviceToHost);

Create and Destroy StreamcudaStream_t stream1; cudaError_t result; result = cudaStreamCreate(&stream1); result = cudaStreamDestroy(stream1);

Transfer data to stream (asynchronous–overlap transfer with compute):result = cudaMemcpyAsync(d_a, a,N,cudaMemcpyHostToDevice, stream1)

Issue kernel to a streamincrement(d_a)

Examplefor (int i = 0; i < nStreams; ++i) { int offset = i * streamSize;

cudaMemcpyAsync(&d_a[offset], &a[offset], streamBytes, stream[i]);

kernel(d_a, offset,stream[i]);

cudaMemcpyAsync(&a[offset], &d_a[offset], streamBytes, stream[i]);}

nStreams concurrent execution

Note: all commands issued for a stream execute in order

ΕΠΛ372 Παράλληλη Επεξεργασία · 2014. 3. 11. · copy of inputand output...

Documents

Transcript of ΕΠΛ372 Παράλληλη Επεξεργασία · 2014. 3. 11. · copy of inputand output...