Download - ΕΠΛ372 Παράλληλη Επεξεργασία · 2014. 3. 11. · copy of inputand output Copy input to device Set up grid/block Call globalfunction Copy deviceoutput to host

Transcript
  • ΕΠΛ372 Παράλληλη Επεξεργασία

    Παράλληλες Αρχιτεκτονικές: SIMD, GPU

    Γιάννος ΣαζεϊδηςΕαρινό Εξάμηνο 2014

    READING1. Paper on GPUs: Erik Lindholm, John Nickolls, Stuart Oberman, and John Montrym, NVIDIA

    TESLA: A Unified Graphics and Computing Architecture, IEEE Micro Volume 28, Issue 2, Date: March-April 2008, Pages: 39-55.

    2. HW#2Slides based on Elsevier Material for H&P 5th editionCopyright © 2012, Elsevier Inc. All rights reserved.

    Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

  • SIMD Extensions• Media applications operate on data types narrower than

    the native word size– Example: disconnect carry chains to

    “partition” adder

    • Limitations, compared to vector instructions:– Number of data operands encoded into op

    code– No sophisticated addressing modes (strided,

    scatter-gather)– No mask registers

    SIM

    D Instruction S

    et Extensions for M

    ultimedia

  • SIMD Implementations• Implementations:

    – Intel MMX (1996)• Eight 8-bit integer ops or four 16-bit integer ops

    – Streaming SIMD Extensions (SSE) (1999)• Eight 16-bit integer ops• Four 32-bit integer/fp ops or two 64-bit integer/fp ops

    – Advanced Vector Extensions (2010)• Four 64-bit integer/fp ops

    – Operands must be consecutive and aligned memory locations

    SIM

    D Instruction S

    et Extensions for M

    ultimedia

  • DAXPY 64 elements with non-vector/simd Instructions

    L.D F0,a ;load scalar aDADDIU R4,Rx,#512 ;last address to load

    Loop:L.D F4,0[Rx] ;load X[i]MUL.D F4,F4,F0 ;a×X[i]L.D F8,0[Ry] ;load Y[i]ADD.D F8,F8,F4 ;a×X[i]+Y[i]S.D 0[Ry],F8 ;store into Y[i]DADDIU Rx,Rx,#8 ;increment index to XDADDIU Ry,Ry,#8 ;increment index to YDSUBU R20,R4,Rx ;compute boundBNEZ R20,Loop ;check if done

    • Requires almost 600 for MIPS

    Vector Architectures

  • With Vector Instructions• ADDVV.D: add two vectors• ADDVS.D: add vector to a scalar• LV/SV: vector load and vector store from address

    • Example: DAXPYL.D F0,a ; load scalar aLV V1,Rx ; load vector XMULVS.D V2,V1,F0 ; vector-scalar multiplyLV V3,Ry ; load vector YADDVV V4,V2,V3 ; addSV Ry,V4 ; store the result

    • Requires 6 instructions vs. almost 600 for MIPS

    Vector Architectures

  • Example SIMD Codeassume 256 bit extension

    • Example DXPY:L.D F0,a ;load scalar aMOV F1, F0 ;copy a into F1 for SIMD MULMOV F2, F0 ;copy a into F2 for SIMD MULMOV F3, F0 ;copy a into F3 for SIMD MULDADDIU R4,Rx,#512 ;last address to load

    Loop: L.4D F4,0[Rx] ;load X[i], X[i+1], X[i+2], X[i+3]MUL.4D F4,F4,F0 ;a×X[i],a×X[i+1],a×X[i+2],a×X[i+3]L.4D F8,0[Ry] ;load Y[i], Y[i+1], Y[i+2], Y[i+3]ADD.4D F8,F8,F4 ;a×X[i]+Y[i], ..., a×X[i+3]+Y[i+3]S.4D 0[Ry],F8 ;store into Y[i], Y[i+1], Y[i+2], Y[i+3]DADDIU Rx,Rx,#32 ;increment index to XDADDIU Ry,Ry,#32 ;increment index to YDSUBU R20,R4,Rx ;compute boundBNEZ R20,Loop ;check if done

    SIM

    D Instruction S

    et Extensions for M

    ultimedia

  • Roofline Performance Model• Basic idea:

    – Plot peak floating-point throughput as a function of arithmetic intensity

    – Ties together floating-point performance and memory performance for a target machine

    • Arithmetic intensity– Floating-point operations per byte read (word)

    SIM

    D Instruction S

    et Extensions for M

    ultimedia

  • Examples• Attainable GFLOPs/sec Min = (Peak Memory BW ×

    Arithmetic Intensity, Peak Floating Point Perf.)• Helps understand compute vs memory bound

    SIM

    D Instruction S

    et Extensions for M

    ultimedia

  • Graphical Processing Units• Given the hardware invested to do graphics well, how can be

    supplement it to improve performance of a wider range of applications?

    • Basic idea:– Heterogeneous execution model

    • CPU is the host, GPU is the device• Energy efficient

    – Develop a C-like programming language for GPU– Unify all forms of GPU parallelism as CUDA thread

    – Programming model is “Single Instruction Multiple Data” (SIMD) – NVIDIA: SIMT – single instruction multiple threads, flexible SIMD

    Graphical P

    rocessing Units

  • Dedicated GPU

    • Separate GPU cards with own DRAM

    • Separate GPU card but sharing DRAM with CPU– Less cost but less BW

  • Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

    GPU Processors

  • Integrated GPU Processors

    • Integrated GPU Processors– Less DRAM BW vsDedicated GPUs

  • High Level GPU Programming

    Source:Wikipedia

    Serious bottleneck if this latency is not hidden

  • Threads and Blocks• A thread is associated with each data

    element• Threads are organized into blocks• Blocks are organized into a grid• GPU hardware handles thread

    management, not applications or OS– Lower overhead thread management vs

    pthreads (suitable for finer grain parallelism)• Execution of threads as simd instructions

    when possible gives best performance

    Graphical P

    rocessing Units

  • GPU vs CPU threads

    • Differences between GPU and CPU threads – GPU threads are extremely lightweight

    • Very little overhead– GPU needs 1000s of threads for full

    efficiency (switch to hide latency)• Multi-core CPU needs only a few

  • Thread Batching: Grids and Blocks• A kernel is executed as a grid of

    thread blocks– All threads share data memory

    space• A thread block is a batch of

    threads that can cooperate with each other by:– Synchronizing their execution

    • For hazard-free shared memory accesses– Efficiently sharing data through a

    low latency shared memory• Two threads from two different

    blocks cannot cooperate

    Host

    Kernel 1

    Kernel 2

    Device

    Grid 1

    Block(0, 0)

    Block(1, 0)

    Block(2, 0)

    Block(0, 1)

    Block(1, 1)

    Block(2, 1)

    Grid 2

    Block (1, 1)

    Thread(0, 1)

    Thread(1, 1)

    Thread(2, 1)

    Thread(3, 1)

    Thread(4, 1)

    Thread(0, 2)

    Thread(1, 2)

    Thread(2, 2)

    Thread(3, 2)

    Thread(4, 2)

    Thread(0, 0)

    Thread(1, 0)

    Thread(2, 0)

    Thread(3, 0)

    Thread(4, 0)

    Courtesy: NVDIA

    Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

  • Block and Thread IDs• Threads and blocks

    have IDs– So each thread can

    decide what data to work on

    – Block ID: 1D or 2D (blockIdx.x, blockIdx.y)

    – Thread ID: 1D, 2D, or 3D (threadIdx.{x,y,z})

    • Simplifies memoryaddressing when processingmultidimensional data– Image processing– Solving PDEs on

    volumes– …

    Device

    Grid 1

    Block(0, 0)

    Block(1, 0)

    Block(2, 0)

    Block(0, 1)

    Block(1, 1)

    Block(2, 1)

    Block (1, 1)

    Thread(0, 1)

    Thread(1, 1)

    Thread(2, 1)

    Thread(3, 1)

    Thread(4, 1)

    Thread(0, 2)

    Thread(1, 2)

    Thread(2, 2)

    Thread(3, 2)

    Thread(4, 2)

    Thread(0, 0)

    Thread(1, 0)

    Thread(2, 0)

    Thread(3, 0)

    Thread(4, 0)

    Courtesy: NDVIA

    Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

  • An example• Multiply two vectors of length 8192

    – Code that works over all elements is the grid– Thread blocks break this down into manageable

    sizes• 512 threads per block• many threads to cover long memory latency

    – On stall HW switch quickly between threads

    – Thus grid size = 16 blocks– Block is assigned to a multithreaded SIMD

    processor by the thread block scheduler• Fermi GPUs had 16 multithreaded SIMD processors

    Graphical P

    rocessing Units

  • What Programmer Expresses in CUDA

    P

    M

    P

    HO

    ST

    (CP

    U)

    M DE

    VIC

    E (G

    PU

    )

    Interconnect between devices and memories

    Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

  • Basics• Computation partitioning (where does computation occur?)

    Declarations on functions __host__ : on cpu__global__ : kernel called by host __device__ : code to be executed by device called by

    global or other device function• Mapping of kernels to device :

    compute ()gs grid size, bs block size(gs *bs = threads of one kernel)

    • Data partitioning (where does data reside, who may access it and how?)• Declarations on data __shared__, __device__, __constant__, …

    • Data management and orchestration• Copying to/from host: e.g., cudaMemcpy(h_obj,d_obj, cudaMemcpyDevicetoHost)

    • Concurrency management– E.g. __synchthreads() – block granularity

    Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

  • Synchronous vs Asynchronous

    • Whether a call returns or not return immediately (before call function completes)– Yes: asynchronous– No: synchronous

  • Streams• A stream in CUDA is a sequence of operations

    that execute on the device in the order in which they are issued by the host code.

    • Operations within a stream are guaranteed to execute in the prescribed order

    • Operations in different streams can be interleaved and, when possible, they can even run concurrently.

    • How to specify stream: pass as argument, use functions that take stream as parameter

    • Useful to hide cpu-gpu latency

  • Minimal Extensions to C + API• Declspecs

    – global, device, shared, local, constant

    • Keywords– threadIdx, blockIdx

    • Intrinsics– __syncthreads

    • Runtime API– Memory, symbol,

    execution management

    • Function launch

    __device__ float filter[N];

    __global__ void convolve (float *image) {

    __shared__ float region[M];...

    region[threadIdx] = image[i];

    __syncthreads() ...

    image[j] = result;}

    // Allocate GPU memoryvoid *myimage = cudaMalloc(bytes)

    // 100 blocks, 10 threads per blockconvolve (myimage);

    Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

  • NVCC Compiler’s Role: Partition Code and Compile for Device

    mycode.cu

    __device__ dfunc() {int ddata;

    }

    __global__ gfunc() {int gdata;dfuncc();

    }

    Main() { }__host__ hfunc () {

    int hdata;();}

    Dev

    ice

    Onl

    yIn

    terfa

    ceH

    ost O

    nly

    int main_data;__shared__ int sdata;

    Main() {}__host__ hfunc () {

    int hdata;

    ();}

    __global__ gfunc() {int gdata;dfuncc();

    }

    Compiled by nativecompiler: gcc, icc, cc

    __shared__ sdata;

    __device__ dfunc() {int ddata;

    }

    Compiled by nvcccompiler

    int main_data;

    Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

  • Simple working code example• What does it do?

    – check elements of integer array (any of 0 to 9)– How many times does “6” appear?– Array of 16 elements, each thread examines 4

    elements, 1 block in grid, 1 grid

    3 6 57 3 5 26 0 9 639 1 72

    threadIdx.x = 0 examines in_array elements 0, 4, 8, 12threadIdx.x = 1 examines in_array elements 1, 5, 9, 13threadIdx.x = 2 examines in_array elements 2, 6, 10, 14threadIdx.x = 3 examines in_array elements 3, 7, 11, 15

    } Known as acyclic data distributionSlides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

  • CUDA Pseudo-CodeMAIN PROGRAM:Initialization• Allocate memory on host for

    input and output• Assign random numbers to

    input array

    Call host function

    Calculate final output from per-thread output

    Print result

    HOST FUNCTION:Allocate memory on device for

    copy of input and outputCopy input to deviceSet up grid/blockCall global functionCopy device output to host

    GLOBAL FUNCTION:Thread scans subset of array elementsCall device function to compare with “6”Compute local result

    DEVICE FUNCTION:Compare current element

    and “6”Return 1 if same, else 0Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007

    ECE 498AL, University of Illinois, Urbana-Champaign

  • Main Program: PreliminariesMAIN PROGRAM:Initialization• Allocate memory on host for

    input and output• Assign random numbers to

    input arrayCall host functionCalculate final output from

    per-thread outputPrint result

    #include #define SIZE 16#define BLOCKSIZE 4

    int main(int argc, char **argv){int *in_array, *out_array;…

    }

    Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

  • Main Program: Invoke Global FunctionMAIN PROGRAM:Initialization (OMIT)• Allocate memory on host for

    input and output• Assign random numbers to

    input arrayCall host functionCalculate final output from

    per-thread outputPrint result

    #include #define SIZE 16#define BLOCKSIZE 4__host__ void outer_compute (int

    *in_arr, int *out_arr);int main(int argc, char **argv){

    int *in_array, *out_array;/* initialization */ …outer_compute(in_array, out_array);…

    }

    Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

  • Main Program: Calculate Output & Print ResultMAIN PROGRAM:Initialization (OMIT)• Allocate memory on host for

    input and output• Assign random numbers to

    input arrayCall host functionCalculate final output from per-thread output

    Print result

    #include #define SIZE 16#define BLOCKSIZE 4__host__ void outer_compute (int

    *in_arr, int *out_arr);int main(int argc, char **argv){

    int *in_array, *out_array;int sum = 0;/* initialization */ …outer_compute(in_array, out_array);for (int i=0; i

  • Host Function: Preliminaries & Allocation

    HOST FUNCTION:Allocate memory on device for

    copy of input and outputCopy input to deviceSet up grid/blockCall global functionCopy device output to host

    __host__ void outer_compute (int *h_in_array, int *h_out_array) {

    int *d_in_array, *d_out_array;

    cudaMalloc((void **) &d_in_array, SIZE*sizeof(int));

    cudaMalloc((void **) &d_out_array, BLOCKSIZE*sizeof(int));

    …}

    Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

  • Host Function: Copy Data To/From Host

    HOST FUNCTION:Allocate memory on device for

    copy of input and outputCopy input to deviceSet up grid/blockCall global functionCopy device output to host

    __host__ void outer_compute (int*h_in_array, int *h_out_array) {int *d_in_array, *d_out_array;

    cudaMalloc((void **) &d_in_array, SIZE*sizeof(int));

    cudaMalloc((void **) &d_out_array, BLOCKSIZE*sizeof(int));

    cudaMemcpy(d_in_array, h_in_array, SIZE*sizeof(int), cudaMemcpyHostToDevice);

    … do computation ...cudaMemcpy(h_out_array,d_out_array,

    BLOCKSIZE*sizeof(int), cudaMemcpyDeviceToHost);

    }Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

  • Host Function: Setup & Call Global Function

    HOST FUNCTION:Allocate memory on device for

    copy of input and outputCopy input to deviceSet up grid/blockCall global functionCopy device output to host

    __host__ void outer_compute (int *h_in_array, int *h_out_array) {int *d_in_array, *d_out_array;

    cudaMalloc((void **) &d_in_array, SIZE*sizeof(int));

    cudaMalloc((void **) &d_out_array, BLOCKSIZE*sizeof(int));

    cudaMemcpy(d_in_array, h_in_array, SIZE*sizeof(int), cudaMemcpyHostToDevice);

    compute (d_in_array, d_out_array);cudaMemcpy(h_out_array, d_out_array,

    BLOCKSIZE*sizeof(int), cudaMemcpyDeviceToHost);

    }Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

  • Global FunctionGLOBAL FUNCTION:Thread scans subset of array

    elementsCall device function to

    compare with “6”Compute local result

    __global__ void compute(int *d_in,int *d_out) {d_out[threadIdx.x] = 0;for (int i=0; i

  • Device Function

    L1: Course/CUDA Introduction

    DEVICE FUNCTION:Compare current element

    and “6”Return 1 if same, else 0

    __device__ int compare(int a, int b) {if (a == b) return 1;return 0;

    }

    CS6963 Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

  • Memory Model

    Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

  • Memory Model

    • Host copies to/from device (can be slow)• Device does not support coherence

    – Any caches in device must be disabled if coherence is needed

    Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

  • What if we computed sum on GPU?• Global, device functions and excerpts from host, main

    int __host__ void outer_compute (int *h_in_array, int *h_sum) {…compute (d_in_array, d_sum);cudaThreadSynchronize();cudaMemcpy(h_sum, d_sum, sizeof(int), cudaMemcpyDeviceToHost);

    }

    main(int argc, char **argv) {…int *sum;  // an integerouter_compute(in_array, sum);printf (”Result = %d\n",sum);}

    __device__ int compare(int a, int b) {if (a == b) return 1;return 0;

    }

    __global__ void compute(int *d_in, int*sum) {

    *sum = 0;

    for (i=0; i

  • Gathering Results on GPU for “Count 6”__global__ void compute(int *d_in, int *d_out) {

    d_out[threadIdx.x] = 0;

    for (i=0; i

  • Gathering Results on GPU for “Count 6”__global__ void compute(int *d_in, int *d_out) {

    d_out[threadIdx.x] = 0;

    for (i=0; i

  • Synchronize• kernel execution is normally asynchronous, so while the

    GPU device is executing your kernel the CPU can continue to work on some other commands, issue more instructions to the device, etc., instead of waiting

    • Host-Device memory transfers synchronous (blocking)– Points of synchronization (slow)

    Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

  • Programmer’s View: Memory Spaces• Each thread can:

    – Read/write per-thread registers– Read/write per-thread local memory– Read/write per-block shared memory– Read/write per-grid global memory– Read only per-grid constant memory– Read only per-grid texture memory

    • Block Threads: share shared memory• Grid blocks: share global, constant

    and texture memory

    Grid

    ConstantMemory

    TextureMemory

    GlobalMemory

    Block (0, 0)

    Shared Memory

    LocalMemory

    Thread (0, 0)

    Registers

    LocalMemory

    Thread (1, 0)

    Registers

    Block (1, 0)

    Shared Memory

    LocalMemory

    Thread (0, 0)

    Registers

    LocalMemory

    Thread (1, 0)

    Registers

    Host

    • The host can read/write global, constant, and texture memory

    Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

  • Terminology Review

    • device = GPU = set of multiprocessors • Multiprocessor = set of processors & shared memory• Kernel = GPU program• Grid = array of thread blocks that execute a kernel• Thread block = group of SIMD threads that execute

    a kernel and can communicate via shared memory

    Memory Location Cached Access WhoLocal Off-chip No 1.x / >2Yes Read/write One threadShared On-chip N/A - resident Read/write All threads in a blockGlobal Off-chip No 1.x / >2 Yes Read/write All threads + hostConstant Off-chip Yes Read All threads + hostTexture Off-chip Yes Read All threads + host

    Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

  • Reuse and Locality• Consider how data is accessed

    – Data reuse: • Same data used multiple times • Intrinsic in computation

    – Data locality:• Data is reused and is present in “fast memory”• Same data or same data transfer

    • If a computation has reuse, what can we do to get locality?

    • Appropriate data placement and layout• Code reordering transformations

    Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

  • Access Times• Register – dedicated HW - single cycle• Constant and Texture caches – possibly single cycle,

    proportional to addresses accessed by warp• Shared Memory – dedicated HW - single cycle if no

    “bank conflicts”• Local Memory – DRAM, no cached- *slow*, if cached fast• Global Memory– DRAM, no cached- *slow*, if cached fast• Constant Memory – DRAM, cached, 1…10s…100s of

    cycles, depending on cache locality• Texture Memory – DRAM, cached, 1…10s…100s of cycles,

    depending on cache locality• Instruction Memory – DRAM, cached

    Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

  • Data Placement: Conceptual• Copies from host to device go to some part of global memory

    (possibly, constant or texture memory)• How to use SP shared memory

    • Must construct or be copied from global memory by kernel program • How to use constant or texture cache

    – Read-only “reused” data can be placed in constant & texture memory by host

    • Also, how to use registers– Most locally-allocated data is placed directly in registers– Even array variables can use registers if compiler understands

    access patterns– Can allocate “superwords” to registers, e.g., float4– Excessive use of registers will “spill” data to local memory

    • Local memory – Deals with capacity limitations of registers and shared memory– Eliminates worries about race conditions (per thread)– … but SLOW if not in cache

    Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

  • Data Placement: Syntax• Through type qualifiers

    – __constant__, __shared__, __local__, __device__

    • Through cudaMemcpy calls– Type of call and symbolic constant designate where

    to copy• Implicit default behavior

    – Device memory without qualifier is global memory– Host by default copies to global memory– Thread-local variables go into registers unless

    capacity exceeded, then local memory

    Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

  • Language Extensions: Variable Type Qualifiers

    • __device__ is optional when used with __local__, __shared__, or __constant__

    • Automatic variables without any qualifier reside in a register– Except arrays that reside in local memory

    Memory Scope Lifetime__device__ __local__ int LocalVar; local thread thread__device__ __shared__ int SharedVar; shared block block__device__ int GlobalVar; global grid application__device__ __constant__ int ConstantVar; constant grid application

    Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

  • Variable Type Restrictions

    • Pointers can only point to memory allocated or declared in global memory:– Allocated in the host and passed to the

    kernel: __global__ void KernelFunc(float* ptr)

    – Obtained as the address of a global variable: float* ptr = &GlobalVar;

    Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

  • • How to place data in constant and shared memory

    • Constant works fast if all threads in a warp access same address

    Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

  • Constant Memory Example– Apply a vector) to a set of precomputed

    transform matrices– Compute M1V, M2V, …, MnV– Note same constant value across threads

    __constant__ float d_signalVector[M];__device__ float R[N][M];

    __host__ void outerApplySignal () {float *h_inputSignal;dim3 dimGrid(N);dim3 dimBlock(M);cudaMemcpyToSymbol (d_signalVector,

    h_inputSignal, M*sizeof(float));//  assume input matrix is in d_matApplySignal(d_mat, M);

    }

    __global__ void ApplySignal (float * d_mat,  int M) {float result = 0.0; /* register */

    for (j=0; j

  • More on Constant Cache• Example from previous slide

    – All threads in a block accessing same element of signal vector

    – Brought into cache for first access, then latency equivalent to a register access

    P0Instruction

    UnitP! PM‐1

    Reg

    ...Reg Reg

    Constant Cache

    LD signalVector[j]

    Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

  • Now Let’s Look at Shared Memory

    • Common Programming Pattern (5.1.2 of CUDA manual)– Load data into shared memory– Synchronize (if necessary)– Operate on data in shared memory– Synchronize (if necessary)– Write intermediate results to global

    memory– Repeat until done

    Shared memory

    Global memory

    Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

  • Mechanics of Using Shared Memory

    • __shared__ type qualifier required• Must be allocated from global/device

    function, or as “extern”• Examples:

    /* a form of dynamic allocation *//* MEMSIZE is size of per‐block  *//* shared memory*/ __host__ void outerCompute() {compute();

    } __global__ void compute() {extern __shared__ float  d_s_array[];

    d_s_array[i] = …;}

    __global__ void compute2() {__shared__ float d_s_array[M];

    /* create or copy from global memory */d_s_array[j] = …;

    /* write result back to global memory */d_g_array[j] =  d_s_array[j];

    Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

  • Matrix Transpose (from SDK)_global__ void transpose(float *odata, float *idata, int width, int height){

    __shared__ float block[BLOCK_DIM][BLOCK_DIM+1];

    // read the matrix tile into shared memoryunsigned int xIndex = blockIdx.x * BLOCK_DIM + threadIdx.x;unsigned int yIndex = blockIdx.y * BLOCK_DIM + threadIdx.y;unsigned int index_in = yIndex * width + xIndex;block[threadIdx.y][threadIdx.x] = idata[index_in];

    __syncthreads();

    // write the transposed matrix tile to global memoryxIndex = blockIdx.y * BLOCK_DIM + threadIdx.x;yIndex = blockIdx.x * BLOCK_DIM + threadIdx.y;unsigned int index_out = yIndex * height + xIndex;odata[index_out] = block[threadIdx.x][threadIdx.y];

    }

    odata and idata in global memory

    Rearrange in Rearrange in shared memory and write back efficiently to 

    global memory 

    Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

  • Overview of Texture Memory• Recall, texture cache of read-only data• Optimize for spatial locality

    – can be fast for many sequential addresses in the same warp

    • Special protocol for allocating and copying to GPU– texture texRef;

    • Dim: 1, 2 or 3D objects

    • Special protocol for accesses– tex2D(,dim1,dim2);

    • Will not cover in more detailSlides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

  • NVIDIA GPU Execution/Arch

    I. SIMD Execution of warpsize=M threads (from single block)– Result is a set of instruction

    streams roughly equal to # threads in a blocks divided by warpsize

    – Warp SIMD executionII. Multithreaded Execution

    across different instruction streams within block– Also possibly across different

    blocks III. Each block mapped to

    single SM(streaming multipr)– No direct interaction across

    SMs

    Device

    Multiprocessor N

    Multiprocessor 2

    Multiprocessor 1

    Device memory

    Shared Memory

    InstructionUnit

    Processor 1

    Registers

    …Processor 2

    Registers

    Processor M

    Registers

    ConstantCache

    TextureCache

    Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

  • CUDA Thread Block Overview

    • All threads in a block execute the same kernel program (SPMD)

    • Programmer declares block:– Block size 1 to 512 (1024)concurrent threads– Block shape 1D, 2D, or 3D– Block dimensions in threads

    • Threads have thread id numbers within block– Thread program uses thread id to select work

    and address shared data

    • Threads in the same block share data and synchronize while doing their share of the work

    • Threads in different blocks cannot cooperate– Each block can execute in any order relative to

    other blocks!

    CUDA Thread Block

    Thread Id #:0 1 2 3 … m

    Thread program

    Courtesy: John Nickolls, NVIDIA

    Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

  • Simplified block diagram of a Multithreaded SIMD Processor. •It has 16 SIMD lanes. •The SIMD Thread Scheduler has, say, 48 warps that it schedules with a table of 48 PCs.

    Streaming Multiprocessor (SM)

  • NVIDIA GPU Memory Structures

    • Each Thread has private section of off-chip DRAM– “Private memory”– Contains stack frame, spilling registers, and

    private variables• Each multithreaded SIMD processor also has

    local memory– Shared by SIMD lanes / threads within a block

    • Memory shared by SIMD processors is GPU Memory– Host can read and write GPU memory

    Graphical P

    rocessing Units

  • Device Memory HierarchyDevice

    Multiprocessor N

    Multiprocessor 2

    Multiprocessor 1

    Device memory

    Shared Memory

    InstructionUnit

    Processor 1

    Registers

    …Processor 2

    Registers

    Processor M

    Registers

    ConstantCache

    TextureCache

    Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

  • Optimizing the Memory Hierarchy on GPUs, Overview

    • Device memory access times non-uniform so data placement significantly affects performance.• But controlling data placement may require

    additional copying, so consider overhead.• Optimizations to increase memory bandwidth.

    Idea: maximize utility of each memory access.

    Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

  • Hardware Implementation: Memory Architecture

    • The local, global, constant, and texture spaces are regions of device memory (DRAM)

    • Each multiprocessor has:– A set of 32-bit registers per

    processor– On-chip shared memory

    • Where the shared memory space resides

    – A read-only constant cache• To speed up access to the

    constant memory space– A read-only texture cache

    • To speed up access to the texture memory space

    Device

    Multiprocessor N

    Multiprocessor 2

    Multiprocessor 1

    Device memory

    Shared Memory

    InstructionUnit

    Processor 1

    Registers

    …Processor 2

    Registers

    Processor M

    Registers

    ConstantCache

    TextureCache

    Global, constant, texture memories

    Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

  • Terminology• Thread block scheduler schedules blocks to SIMD

    processors• Within each SIMD processor (another scheduler):

    – 16 (Tesla)-32(Fermi) SIMD lanes– Wide and shallow compared to vector processors

    • Warps collection of threads in a block• Thread scheduler uses scoreboard to dispatch warps

    – No data dependencies between threads of a warp!– But control flow can diverge– Best performance when there is no divergence (all threads

    follow same control flow)– Keeps track of up to 48 warps

    • Hides memory latency

    Graphical P

    rocessing Units

  • Example SIMD Execution“Count 6” kernel functiond_out[threadIdx.x] = 0;for (int i=0; i

  • Example SIMD Execution“Count 6” kernel functiond_out[threadIdx.x] = 0;for (int i=0; i

  • Example SIMD Execution“Count 6” kernel functiond_out[threadIdx.x] = 0;for (int i=0; i

  • Example SIMD Execution“Count 6” kernel functiond_out[threadIdx.x] = 0;for (int i=0; i

  • Multithreading: Motivation• Each arithmetic instruction includes the

    following sequence

    • Memory latency, the time in cycles to access memory, limits utilization of compute engines

    Activity Cost Note

    Load operands As much as O(100) cycles Depends on location

    Compute O(1) cycles Accesses registers

    Store result As much as O(100) cycles Depends on location

    Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

  • Thread-Level Parallelism • Motivation:

    – a single thread leaves a processor under-utilized for most of the time

    • Strategies for thread-level parallelism: – multiple threads share the same large processor

    reduces under-utilization, efficient resource allocation Multi-Threading

    – each thread executes on its own mini processor simple design, low interference between threads Multi-Processing

    Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

  • How all Comes TogetherG80 Example: Executing Thread Blocks

    • Threads are assigned to Streaming Multiprocessors in block granularity– Up to 8 blocks to

    each SM as resource allows

    – SM in G80 can take up to 768 threads• Could be 256

    (threads/block) * 3 blocks

    • Or 128 (threads/block) * 6 blocks, etc.

    t0 t1 t2 … tm

    Blocks

    SP

    SharedMemory

    MT IU

    SP

    SharedMemory

    MT IU

    t0 t1 t2 … tm

    Blocks

    SM 1SM 0

    • Threads run concurrently– SM maintains thread/block id #s– SM manages/schedules thread execution

    Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

  • Details of Mapping• If #blocks in a grid exceeds number of SMs,

    – multiple blocks mapped to an SM– treated independently– provides more warps to scheduler so good as long as

    resources not exceeded– Possibly context switching overhead when

    scheduling between blocks (registers and shared memory)

    • Thread Synchronization – Within a block, threads observe SIMD model, and

    synchronize using __syncthreads()– Across blocks, interaction through global memory

    Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

  • Transparent Scalability• Hardware is free to assigns blocks to

    any processor at any time– A kernel scales across any number of

    parallel processorsDevice

    Block 0 Block 1

    Block 2 Block 3

    Block 4 Block 5

    Block 6 Block 7

    Kernel grid

    Block 0 Block 1

    Block 2 Block 3

    Block 4 Block 5

    Block 6 Block 7

    Device

    Block 0 Block 1 Block 2 Block 3

    Block 4 Block 5 Block 6 Block 7

    Each block can execute in any order relative to other blocks.

    time

    Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

  • NVIDIA GPU Architecture• Similarities to vector machines:

    – Works well with data-level parallel problems– Scatter-gather transfers– Mask registers– Large register files

    • Differences:– No scalar processor– Uses multithreading to hide memory latency– Has many functional units, as opposed to a few

    deeply pipelined units like a vector processor– SIMD+SPMD

    Graphical P

    rocessing Units

  • NVIDIA Instruction Set Arch.• ISA is an abstraction of the hardware instruction set

    – “Parallel Thread Execution (PTX)”– Uses virtual registers– Translation to machine code is performed in software– Example:shl.s32 R8, blockIdx, 9 ; Thread Block ID * Block size (512 or 29)add.s32 R8, R8, threadIdx ; R8 = i = my CUDA thread IDld.global.f64 RD0, [X+R8] ; RD0 = X[i]ld.global.f64 RD2, [Y+R8] ; RD2 = Y[i]mul.f64 R0D, RD0, RD4 ; Product in RD0 = RD0 * RD4 (scalar a)add.f64 R0D, RD0, RD2 ; Sum in RD0 = RD0 + RD2 (Y[i])st.global.f64 [Y+R8], RD0 ; Y[i] = sum (X[i]*a + Y[i])

    Graphical P

    rocessing Units

  • How code is compiled

    Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

  • Conditional Branching• Like vector architectures, GPU branch hardware uses internal

    masks• Also uses

    – Branch synchronization stack• Entries consist of masks for each SIMD lane• I.e. which threads commit their results (all threads execute)

    – Instruction markers to manage when a branch diverges into multiple execution paths

    • Push on divergent branch– …and when paths converge

    • Act as barriers• Pops stack

    • Per-thread-lane 1-bit predicate register, specified by programmer

    Graphical P

    rocessing Units

  • SIMD Execution of Control FlowControl flow example

    if (threadIdx >= 2) {out[threadIdx] += 100;

    }else {

    out[threadIdx] += 10;}

    P0Instruction

    UnitP! PM‐1

    Reg

    ...

    Memory

    Reg Reg compare threadIdx,2

    Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

  • SIMD Execution of Control FlowControl flow example

    if (threadIdx.x >= 2) {out[threadIdx.x] += 100;

    }else {

    out[threadIdx.x] += 10;}

    P0Instruction

    UnitP! PM‐1

    Reg

    ...

    Memory

    Reg Reg

    /* Condition code cc = true branch set by predicate execution */(CC) LD R5,

    &(out+threadIdx.x)(CC) ADD R5, R5, 100(CC) ST R5,

    &(out+threadIdx.x)

    X X ✔ ✔

    Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

  • SIMD Execution of Control FlowControl flow example

    if (threadIdx >= 2) {out[threadIdx] += 100;

    }else {

    out[threadIdx] += 10;}

    P0Instruction

    UnitP! PM‐1

    Reg

    ...

    Memory

    Reg Reg

    /* possibly predicated using CC */(not CC) LD R5,

    &(out+threadIdx)(not CC) ADD R5, R5, 10(not CC) ST R5,

    &(out+threadIdx)

    ✔ ✔ X X

    Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

  • Exampleif (X[i] != 0)

    X[i] = X[i] – Y[i];else X[i] = Z[i];

    ld.global.f64 RD0, [X+R8] ; RD0 = X[i]setp.neq.s32 P1, RD0, #0 ; P1 is predicate register 1@!P1, bra ELSE1, *Push ; Push old mask, set new mask bits

    ; if P1 false, go to ELSE1ld.global.f64 RD2, [Y+R8] ; RD2 = Y[i]sub.f64 RD0, RD0, RD2 ; Difference in RD0st.global.f64 [X+R8], RD0 ; X[i] = RD0@P1, bra ENDIF1, *Comp ; complement mask bits

    ; if P1 true, go to ENDIF1ELSE1: ld.global.f64 RD0, [Z+R8] ; RD0 = Z[i]

    st.global.f64 [X+R8], RD0 ; X[i] = RD0ENDIF1: , *Pop ; pop to restore old mask

    Graphical P

    rocessing Units

  • Terminology

    • Divergent paths– Different threads within a warp take

    different control flow paths within a kernel function

    – N divergent paths in a warp?• An N-way divergent warp is serially issued over

    the N different paths using a hardware stack and per-thread predication logic to only write back results from the threads taking each divergent path.

    • Performance decreases by about a factor of NSlides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

  • Recall: Serialized Gathering of Results on GPU for “Count 6”

    __global__ void compute(int *d_in, int *d_out) {

    d_out[threadIdx.x] = 0;

    for (i=0; i

  • Tree-Structured Computation

    out[0] += out[2]

    out[0] += out[1] out[2] += out[3]

    out[0] out[1] out[2] out[3]

    Tree-structured results-gathering phase, where independent threads collect their results in parallel.

    Assume SIZE=16 and BLOCKSIZE(elements computed per thread)=4.

    Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

  • A possible implementation for just the reduction

    unsigned int t = threadIdx.x;for (unsigned int stride = 1;

    stride < blockDim.x; stride *= 2) {

    __syncthreads();if (t % (2*stride) == 0)

    d_out[t] += d_out[t+stride];}

    Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

  • Vector Reduction with Branch Divergence

    0 1 2 3 4 5 76 1098 11

    0+1 2+3 4+5 6+7 10+118+9

    0...3 4..7 8..11

    0..7 8..15

    1

    2

    3

    Array elements iterations

    Thread 0 Thread 8Thread 2 Thread 4 Thread 6 Thread 10

    Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

  • Some Observations• In each iteration, two control flow paths will be

    sequentially traversed for each warp– Threads that perform addition and threads that do not– Threads that do not perform addition may cost extra cycles

    depending on the implementation of divergence• No more than half of threads will be executing at any

    time– All odd index threads are disabled right from the beginning!– On average, less than ¼ of the threads will be activated for

    all warps over time.– After the 5th iteration, entire warps in each block will be

    disabled, poor resource utilization but no divergence.• This can go on for a while, up to 4 more iterations (512/32=16=

    24), where each iteration only has one thread activated until all warps retire

    Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

  • What’s Wrong?

    unsigned int t = threadIdx.x;for (unsigned int stride = 1;

    stride < blockDim.x; stride *= 2) {

    __syncthreads();if (t % (2*stride) == 0)

    d_out[t] += d_out[t+stride];}

    BAD: Divergence due to interleaved branch decisions

    Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

  • A better implementation

    unsigned int t = threadIdx.x;for (unsigned int stride = blockDim.x >> 1;

    stride >= 1; stride >>= 1) {

    __syncthreads();if (t < stride)

    d_out[t] += d_out[t+stride];}

    Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

  • Thread 0

    No Divergence until < 16 sub-sums

    0 1 2 3 … 13 1514 181716 19

    0+16 15+311

    3

    4

    89L8: Control Flow

    Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

  • A shared memory implementation

    • Assume we have already loaded array into__shared__ float partialSum[];

    unsigned int t = threadIdx.x;for (unsigned int stride = blockDim.x >> 1;

    stride >= 1; stride >> 1) {

    __syncthreads();if (t < stride)

    partialSum[t] += partialSum[t+stride];}

    Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

  • Steam Managementhttp://devblogs.nvidia.com/parallelforall/how-overlap-data-transfers-cuda-cc

    Red are blocking functions (synchronous) Ex.1

    cudaMemcpy(d_a, a, numBytes, cudaMemcpyHostToDevice); increment(d_a) cudaMemcpy(a, d_a, numBytes, cudaMemcpyDeviceToHost);

    Ex.2

    cudaMemcpy(d_a, a, numBytes, cudaMemcpyHostToDevice); increment(d_a) myCpuFunction(b) cudaMemcpy(a, d_a, numBytes, cudaMemcpyDeviceToHost);

  • Create and Destroy StreamcudaStream_t stream1; cudaError_t result; result = cudaStreamCreate(&stream1); result = cudaStreamDestroy(stream1);

    Transfer data to stream (asynchronous–overlap transfer with compute):result = cudaMemcpyAsync(d_a, a,N,cudaMemcpyHostToDevice, stream1)

    Issue kernel to a streamincrement(d_a)

  • Examplefor (int i = 0; i < nStreams; ++i) { int offset = i * streamSize;

    cudaMemcpyAsync(&d_a[offset], &a[offset], streamBytes, stream[i]);

    kernel(d_a, offset,stream[i]);

    cudaMemcpyAsync(&a[offset], &d_a[offset], streamBytes, stream[i]);}

    nStreams concurrent execution

    Note: all commands issued for a stream execute in order