ΕΠΛ372 Παράλληλη...

Click here to load reader

  • date post

    07-Feb-2021
  • Category

    Documents

  • view

    1
  • download

    0

Embed Size (px)

Transcript of ΕΠΛ372 Παράλληλη...

  • ΕΠΛ372 Παράλληλη Επεξεργασία

    Παράλληλες Αρχιτεκτονικές: SIMD, GPU

    Γιάννος Σαζεϊδης Εαρινό Εξάμηνο 2014

    READING 1. Paper on GPUs: Erik Lindholm, John Nickolls, Stuart Oberman, and John Montrym, NVIDIA

    TESLA: A Unified Graphics and Computing Architecture, IEEE Micro Volume 28, Issue 2, Date: March-April 2008, Pages: 39-55.

    2. HW#2 Slides based on Elsevier Material for H&P 5th edition Copyright © 2012, Elsevier Inc. All rights reserved.

    Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign

  • SIMD Extensions • Media applications operate on data types narrower than

    the native word size – Example: disconnect carry chains to

    “partition” adder

    • Limitations, compared to vector instructions: – Number of data operands encoded into op

    code – No sophisticated addressing modes (strided,

    scatter-gather) – No mask registers

    S IM

    D Instruction S

    et E xtensions for M

    ultim edia

  • SIMD Implementations • Implementations:

    – Intel MMX (1996) • Eight 8-bit integer ops or four 16-bit integer ops

    – Streaming SIMD Extensions (SSE) (1999) • Eight 16-bit integer ops • Four 32-bit integer/fp ops or two 64-bit integer/fp ops

    – Advanced Vector Extensions (2010) • Four 64-bit integer/fp ops

    – Operands must be consecutive and aligned memory locations

    S IM

    D Instruction S

    et E xtensions for M

    ultim edia

  • DAXPY 64 elements with non- vector/simd Instructions

    L.D F0,a ;load scalar a DADDIU R4,Rx,#512 ;last address to load

    Loop: L.D F4,0[Rx] ;load X[i] MUL.D F4,F4,F0 ;a×X[i] L.D F8,0[Ry] ;load Y[i] ADD.D F8,F8,F4 ;a×X[i]+Y[i] S.D 0[Ry],F8 ;store into Y[i] DADDIU Rx,Rx,#8 ;increment index to X DADDIU Ry,Ry,#8 ;increment index to Y DSUBU R20,R4,Rx ;compute bound BNEZ R20,Loop ;check if done

    • Requires almost 600 for MIPS

    Vector A rchitectures

  • With Vector Instructions • ADDVV.D: add two vectors • ADDVS.D: add vector to a scalar • LV/SV: vector load and vector store from address

    • Example: DAXPY L.D F0,a ; load scalar a LV V1,Rx ; load vector X MULVS.D V2,V1,F0 ; vector-scalar multiply LV V3,Ry ; load vector Y ADDVV V4,V2,V3 ; add SV Ry,V4 ; store the result

    • Requires 6 instructions vs. almost 600 for MIPS

    Vector A rchitectures

  • Example SIMD Code assume 256 bit extension

    • Example DXPY: L.D F0,a ;load scalar a MOV F1, F0 ;copy a into F1 for SIMD MUL MOV F2, F0 ;copy a into F2 for SIMD MUL MOV F3, F0 ;copy a into F3 for SIMD MUL DADDIU R4,Rx,#512 ;last address to load

    Loop: L.4D F4,0[Rx] ;load X[i], X[i+1], X[i+2], X[i+3] MUL.4D F4,F4,F0 ;a×X[i],a×X[i+1],a×X[i+2],a×X[i+3] L.4D F8,0[Ry] ;load Y[i], Y[i+1], Y[i+2], Y[i+3] ADD.4D F8,F8,F4 ;a×X[i]+Y[i], ..., a×X[i+3]+Y[i+3] S.4D 0[Ry],F8 ;store into Y[i], Y[i+1], Y[i+2], Y[i+3] DADDIU Rx,Rx,#32 ;increment index to X DADDIU Ry,Ry,#32 ;increment index to Y DSUBU R20,R4,Rx ;compute bound BNEZ R20,Loop ;check if done

    S IM

    D Instruction S

    et E xtensions for M

    ultim edia

  • Roofline Performance Model • Basic idea:

    – Plot peak floating-point throughput as a function of arithmetic intensity

    – Ties together floating-point performance and memory performance for a target machine

    • Arithmetic intensity – Floating-point operations per byte read (word)

    S IM

    D Instruction S

    et E xtensions for M

    ultim edia

  • Examples • Attainable GFLOPs/sec Min = (Peak Memory BW ×

    Arithmetic Intensity, Peak Floating Point Perf.) • Helps understand compute vs memory bound

    S IM

    D Instruction S

    et E xtensions for M

    ultim edia

  • Graphical Processing Units • Given the hardware invested to do graphics well, how can be

    supplement it to improve performance of a wider range of applications?

    • Basic idea: – Heterogeneous execution model

    • CPU is the host, GPU is the device • Energy efficient

    – Develop a C-like programming language for GPU – Unify all forms of GPU parallelism as CUDA thread

    – Programming model is “Single Instruction Multiple Data” (SIMD) – NVIDIA: SIMT – single instruction multiple threads, flexible SIMD

    G raphical P

    rocessing U nits

  • Dedicated GPU

    • Separate GPU cards with own DRAM

    • Separate GPU card but sharing DRAM with CPU – Less cost but less BW

  • Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign

    GPU Processors

  • Integrated GPU Processors

    • Integrated GPU Processors – Less DRAM BW vs Dedicated GPUs

  • High Level GPU Programming

    Source:Wikipedia

    Serious bottleneck if this latency is not hidden

  • Threads and Blocks • A thread is associated with each data

    element • Threads are organized into blocks • Blocks are organized into a grid • GPU hardware handles thread

    management, not applications or OS – Lower overhead thread management vs

    pthreads (suitable for finer grain parallelism) • Execution of threads as simd instructions

    when possible gives best performance

    G raphical P

    rocessing U nits

  • GPU vs CPU threads

    • Differences between GPU and CPU threads – GPU threads are extremely lightweight

    • Very little overhead – GPU needs 1000s of threads for full

    efficiency (switch to hide latency) • Multi-core CPU needs only a few

  • Thread Batching: Grids and Blocks • A kernel is executed as a grid of

    thread blocks – All threads share data memory

    space • A thread block is a batch of

    threads that can cooperate with each other by: – Synchronizing their execution

    • For hazard-free shared memory accesses – Efficiently sharing data through a

    low latency shared memory • Two threads from two different

    blocks cannot cooperate

    Host

    Kernel 1

    Kernel 2

    Device

    Grid 1

    Block (0, 0)

    Block (1, 0)

    Block (2, 0)

    Block (0, 1)

    Block (1, 1)

    Block (2, 1)

    Grid 2

    Block (1, 1)

    Thread (0, 1)

    Thread (1, 1)

    Thread (2, 1)

    Thread (3, 1)

    Thread (4, 1)

    Thread (0, 2)

    Thread (1, 2)

    Thread (2, 2)

    Thread (3, 2)

    Thread (4, 2)

    Thread (0, 0)

    Thread (1, 0)

    Thread (2, 0)

    Thread (3, 0)

    Thread (4, 0)

    Courtesy: NVDIA

    Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign

  • Block and Thread IDs • Threads and blocks

    have IDs – So each thread can

    decide what data to work on

    – Block ID: 1D or 2D (blockIdx.x, blockIdx.y)

    – Thread ID: 1D, 2D, or 3D (threadIdx.{x,y,z})

    • Simplifies memory addressing when processing multidimensional data – Image processing – Solving PDEs on

    volumes – …

    Device

    Grid 1

    Block (0, 0)

    Block (1, 0)

    Block (2, 0)

    Block (0, 1)

    Block (1, 1)

    Block (2, 1)

    Block (1, 1)

    Thread (0, 1)

    Thread (1, 1)

    Thread (2, 1)

    Thread (3, 1)

    Thread (4, 1)

    Thread (0, 2)

    Thread (1, 2)

    Thread (2, 2)

    Thread (3, 2)

    Thread (4, 2)

    Thread (0, 0)

    Thread (1, 0)

    Thread (2, 0)

    Thread (3, 0)

    Thread (4, 0)

    Courtesy: NDVIA

    Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign

  • An example • Multiply two vectors of length 8192

    – Code that works over all elements is the grid – Thread blocks break this down into manageable

    sizes • 512 threads per block • many threads to cover long memory latency

    – On stall HW switch quickly between threads

    – Thus grid size = 16 blocks – Block is assigned to a multithreaded SIMD

    processor by the thread block scheduler • Fermi GPUs had 16 multithreaded SIMD processors

    G raphical P

    rocessing U nits

  • What Programmer Expresses in CUDA

    P

    M

    P

    H O

    S T

    (C P

    U )

    M D E

    V IC

    E (G

    P U

    )

    Interconnect between devices and memories

    Slides based on © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign

  • Basics • Computation partitioning (where does computation occur?)

    Declarations on functions __host__ : on cpu __global__ : kernel called by host __device__ : code to be executed by device called by

    global or other device function • Mapping of kernels to device :

    compute () gs grid size, bs block size (gs *bs = threads of one kernel)

    • Data partitioning (where does data reside, who may access it and how?) • Declarations on data __shared__, __device__, __constant__, …

    • Data management and orchestration • Copying to/from host: e