BLAS and Vectorization extensions

24
BLAS and Vectorization extensions. Carlos Pachajoa December 5, 2012

Transcript of BLAS and Vectorization extensions

Page 1: BLAS and Vectorization extensions

BLAS and Vectorization extensions.

Carlos Pachajoa

December 5, 2012

Page 2: BLAS and Vectorization extensions

Contents

BLAS

Vectorization extensions for X86

GPGPU

Page 3: BLAS and Vectorization extensions

BLAS

Stands for Basic Linear Algebra Subprograms

Is an interface for linear algebra operations. BLAS itself is aspecification for Fortran. The equivalent interface in C is CBLAS.

Use the local implementation with#include <cblas.h>

Page 4: BLAS and Vectorization extensions

BLAS levels

The operations are divided in three levels:

I Level 1: Vector operations likey← αx + y, x, y ∈ ZN

and also dot products and vector norms.

I Level 2: Matrix-vector operations likey← αAx + βy, x, y ∈ ZN , A ∈ ZM×N

and solutions for triangular systems, among others.

I Level 3: Matrix-matrix operations likeC← αAB + βC, C ∈ ZM×N , A ∈ ZM×P , B ∈ ZP×N

Different calls for different precisions and whether real or complexnumbers.

Page 5: BLAS and Vectorization extensions

BLAS function naming conventions

The first letter specifies precision:

I S for real, single precision.

I D for real, double precision.

I C for complex, single precision.

I Z for complex, double precision.

The first letter is followed by a function, for examplexAXPY is y ← αx + y from level one. Here, x represents a spaceand a precision.Therefore, SAXPY will perform the operation using single precisionfloating point numbers.

Page 6: BLAS and Vectorization extensions

CBLAS data representation

CBLAS receives data as contiguous positions in memory and asize. Both matrices and vectors are stored in this way. To specify amatrix, one additionally has to provide a stride and define whetherit is row- or column- major.

{1,2,3,4,5,6,7,8,9} will be1 2 34 5 67 8 9

in R.M. and

1 4 72 5 83 6 9

in C.M.

The ordering is given using this enumeration:enum CBLAS ORDER {CblasRowMajor=101, CblasColMajor=102};

Page 7: BLAS and Vectorization extensions

A function signature

y← αAx + βy, y← αATx + βy

vo i d cb l a s sgemv (con s t enum CBLAS ORDER Order ,con s t enum CBLAS TRANSPOSE TransA ,con s t i n t M,con s t i n t N,con s t f l o a t a lpha ,con s t f l o a t ∗A,con s t i n t lda ,con s t f l o a t ∗X,con s t i n t incX ,con s t f l o a t beta ,f l o a t ∗Y,con s t i n t incY

) ;

Page 8: BLAS and Vectorization extensions

Enumeration types

enum CBLAS ORDER {CblasRowMajor=101 , /∗ row−major a r r a y s ∗/Cb lasCo lMajo r =102}; /∗ column−major a r r a y s ∗/

enum CBLAS TRANSPOSE { // Whether to work wi th t r a n s p o s eCblasNoTrans=111 , /∗ t r a n s =’N’ ∗/Cb lasTrans =112 , /∗ t r a n s =’T’ ∗/Cb lasCon jTrans =113}; /∗ t r a n s =’C ’ ∗/

enum CBLAS UPLO { // The mat r i x i s upper o r l owe rCblasUpper=121 , /∗ up lo =’U’ ∗/CblasLower =122}; /∗ up lo =’L ’ ∗/

enum CBLAS DIAG { // The mat r i x i s u n i t r i a n g u l a rCblasNonUnit=131 , /∗ d i ag =’N’ ∗/Cb l a sUn i t =132}; /∗ d i ag =’U’ ∗/

enum CBLAS SIDE { // Order o f mat r i x m u l t i p l i c a t i o nCb l a s L e f t =141 , /∗ s i d e =’L ’ ∗/Cb l a sR i gh t =142}; /∗ s i d e =’R ’ ∗/

Page 9: BLAS and Vectorization extensions

Some CBLAS implementations

I ATLAS (Automatically Tuned Linear Algebra Software)

I MKL (Math Kernel Library)

I CUBLAS

Page 11: BLAS and Vectorization extensions

SIMD

Single Instruction, Multiple Data

Taken from

http://archive.arstechnica.com/cpu/1q00/simd/figure6.gif

Page 12: BLAS and Vectorization extensions

SSE

Streaming SIMD Extensions.

Additional registers in the processor and operations in thearchitecture.

http://en.wikipedia.org/wiki/File:XMM_registers.svg

8 registers with 128 bits each. 4 single precision floating pointnumbers in each register.

Page 13: BLAS and Vectorization extensions

Some instructions

; A l l i n s t r u c t i o n s end up i n S; p enu l t ima t e l e t t e r deno te s s c a l a r o r v e c t o r; S s t and s f o r s c a l a r , P f o r v e c t o r .; Operands a r e XMM r e g i s t e r s

; Adds a l l e l ement s o f a r r a y s op1 and op2 i n t o op1ADDPS op1 , op2

; Adds the f i r s t e l ement o f op1 and op2 i n t o; the f i r s t p o s i t i o n o f op2ADDSS op1 , op2

Page 14: BLAS and Vectorization extensions

Some SSE instructions

v e c r e s . x = v1 . x + v2 . x ;v e c r e s . y = v1 . y + v2 . y ;v e c r e s . z = v1 . z + v2 . z ;v e c r e s .w = v1 .w + v2 .w;

; xmm0 = v1 .w | v1 . z | v1 . y | v1 . xmovaps xmm0, [ v1 ]

; xmm0 = v1 .w+v2 .w | v1 . z+v2 . z | v1 . y+v2 . y | v1 . x+v2 . xaddps xmm0, [ v2 ]

movaps [ v e c r e s ] , xmm0

http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions

Page 15: BLAS and Vectorization extensions

SSE upgrades

SSE2 Allows to represent multiple types of data fitting onthe vectors, including integers and characters, andperform the corresponding operations.AMD’s implementation also doubled the number ofXMM arrays.

SSE3 Addition of horizontal operations within the XMMarrays, such as data reduction.

SSSE3 Additional instructions for SSE3.

SSE4 Introduction of Dword multiply, allowing to multiplytwo pairs of 32-bit integers to produce 2 64-bitnumbers. Vector dot products.

Page 16: BLAS and Vectorization extensions

AVX

Intel’s extension to SSE for the Sandy Bridge microarchitecture,introduced in 2011. Also available in AMD’s Bulldozer.

http://upload.wikimedia.org/wikipedia/commons/f/f5/AVX_registers.svg

Page 17: BLAS and Vectorization extensions

Automatic vectorization

The Intel compiler can, under certain conditions, vectorize loops inthe code.

This can be activated by using the -vec option.

for(i=0; i<SIZE; i++)A[i] = B[i] + C[i]

Page 18: BLAS and Vectorization extensions

Obstacles to vectorization

I Non-contiguous memory accessfor(i=0; i < SIZE; i+=stride) A[i] = B[i] + C[i];

I Data dependenciesfor(i=0; i < SIZE; i++) A[i] = A[i-1] + B[i];

Page 19: BLAS and Vectorization extensions

Guiding ICC vectorization

I Pragmas For example, #pragma ivdep, among others tocontrol when to vectorize a loop.

I Keywords, such as restrict.

I Switches passed to the compiler as optimization levels.

Look at the ICC automatic vectorization documentation[11].

Page 21: BLAS and Vectorization extensions

GPGPU and CPU

CPU

I General purpose

I Pipelines

I Few threads

I Lots of cache (Correlation)

GPGPU

I Specialized for local vector operations

I Many cores and threads

I Little cache

I Smaller consumption relative to a CPU

Page 22: BLAS and Vectorization extensions

CUDA

Stands for Compute Unified Device Architecture.

Effectively, a programming model to access and control GPUsusing a virtual instruction set, in a simmilar manner as a CPU.

Only supported on NVIDIA cards.

Uses the NVIDIA compiler, and can be programmed using CUDAC/C++, languages based on C/C++.

Page 23: BLAS and Vectorization extensions

OpenCL

Stands for Open Computing Language

It also provides access to the GPU.

I It’s an open standard, supported by NVIDIA and AMD,among others.

I Provides a language based on C99.

I Functionality provided by a driver.

I Compilation handled by linking to the correct library.

Page 24: BLAS and Vectorization extensions

References

http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions

http://www.netlib.org/blas/

http://www.stanford.edu/class/me200c/tutorial_77/18.1_blas.html

http://math-atlas.sourceforge.net/faq.html

http://software.intel.com/sites/products/documentation/hpc/mkl/

mklman/GUID-2BCA8900-BD2F-4A15-9044-0AA23D07D0D2.htm

https://developer.nvidia.com/cublas

http://www.godevtool.com/TestbugHelp/XMMfpins.htm

software.intel.com/en-us/avx

http://www.khronos.org/opencl/

http://developer.download.nvidia.com/CUDA/training/GTC_Express_

Sarah_Tariq_June2011.pdf

http:

//software.intel.com/sites/products/documentation/hpc/composerxe/

en-us/2011Update/cpp/lin/optaps/common/optaps_vec_use.htm