ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level...

Post on 20-Aug-2018

216 views 0 download

Transcript of ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level...

ΧΑΡΗΣ

ΘΕΟΧΑΡΙΔΗΣ

(ttheocharides@ucy.ac.cy)

ΗΜΥ

656

ΠΡΟΧΩΡΗΜΕΝΗ

ΑΡΧΙΤΕΚΤΟΝΙΚΗ

ΗΛΕΚΤΡΟΝΙΚΩΝ

ΥΠΟΛΟΓΙΣΤΩΝ

Εαρινό

Εξάμηνο

2007

ΔΙΑΛΕΞΗ

7: Parallel Computer Systems

Ack: Parallel Computer Architecture: A Hardware/Software Approach, David E. Culler et al, Morgan Kaufmann

Presenter
Presentation Notes
Other handouts To handout next time

What is Parallel Architecture?

A parallel computer is a collection of processing elements that cooperate to solve large problems fast

Some broad issues:–

Resource Allocation:•

how large a collection? •

how powerful are the elements?•

how much memory?–

Data access, Communication and Synchronization•

how do the elements cooperate and communicate?•

how are data transmitted between processors?•

what are the abstractions and primitives for cooperation?–

Performance and Scalability•

how does it all translate into performance?•

how does it scale?

Why Study Parallel Architecture?•

Role of a computer architect:

To design and engineer the various levels of a computer system to maximize performance and programmability within limits of technology and cost.

Parallelism:–

Provides alternative to faster clock for performance

Applies at all levels of system design–

Is a fascinating perspective from which to view architecture

Is increasingly central in information processing

Why Study it Today?

History:

diverse and innovative organizational structures, often tied to novel programming models

Rapidly maturing under strong technological constraints–

The “killer micro”

is ubiquitous–

Laptops and supercomputers are fundamentally similar!–

Technological trends cause diverse approaches to converge•

Technological trends make parallel computing inevitable–

In the mainstream•

Need to understand fundamental principles and design tradeoffs, not just taxonomies–

Naming, Ordering, Replication, Communication performance

Inevitability of Parallel Computing

Application demands:

Our insatiable need for cycles–

Scientific computing: CFD, Biology, Chemistry, Physics, ...–

General-purpose computing: Video, Graphics, CAD, Databases, TP...•

Technology Trends–

Number of transistors on chip growing rapidly–

Clock rates expected to go up only slowly•

Architecture Trends–

Instruction-level parallelism valuable but limited–

Coarser-level parallelism, as in MPs, the most viable approach•

Economics

Current trends:–

Today’s microprocessors have multiprocessor support–

Servers & even PCs becoming MP: Sun, SGI, COMPAQ, Dell,...–

Tomorrow’s microprocessors are multiprocessors

Application Trends

Demand for cycles fuels advances in hardware, and vice-versa–

Cycle drives exponential increase in microprocessor performance–

Drives parallel architecture harder: most demanding applications•

Range of performance demands–

Need range of system performance with progressively increasing cost

Platform pyramid•

Goal of applications in using parallel machines: Speedup•

Speedup (p processors) =

For a fixed problem size (input data set), performance = 1/time

Speedup fixed problem (p processors) =

Performance (p processors)

Performance (1 processor)

Time (1 processor)Time (p processors)

Scientific Computing Demand

Summary of Application Trends•

Transition to parallel computing has occurred for scientific and engineering computing

In rapid progress in commercial computing–

Database and transactions as well as financial

Usually smaller-scale, but large-scale systems also used

Desktop

also uses multithreaded

programs, which are a lot like parallel programs

Demand for improving throughput

on sequential workloads–

Greatest use of small-scale multiprocessors

Solid application demand exists and will increase

Technology Trends

Commodity microprocessors have caught up with supercomputers.

Per

form

ance

0.1

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors

Architectural Trends

Architecture translates technology’s gifts

to performance

and capability

Resolves the tradeoff between parallelism and locality–

Current microprocessor: 1/3 compute, 1/3 cache, 1/3 off-chip connect

Tradeoffs may change with scale and technology advances•

Understanding microprocessor architectural trends –

Helps build intuition about design issues or parallel machines–

Shows fundamental role of parallelism even in “sequential”

computers•

Four generations of architectural history: tube, transistor, IC, VLSI–

Here focus only on VLSI

generation•

Greatest delineation in VLSI has been in type of parallelism exploited

Arch. Trends: Exploiting Parallelism

Greatest trend in VLSI generation is increase in parallelism–

Up to 1985: bit level parallelism: 4-bit -> 8 bit -> 16-bit•

slows after 32 bit

adoption of 64-bit now under way, 128-bit far (not performance issue)

great inflection point when 32-bit micro and cache fit on a chip

Mid 80s to mid 90s: instruction level parallelism•

pipelining and simple instruction sets, + compiler advances (RISC)

on-chip caches and functional units => superscalar execution

greater sophistication: out of order execution, speculation, prediction–

to deal with control transfer and latency problems

Next step: thread level parallelism

Phases in VLSI Generation

How good is instruction-level parallelism? •

Thread-level needed in microprocessors?

Tran

sist

ors

1,000

10,000

100,000

1,000,000

10,000,000

100,000,000

1970 1975 1980 1985 1990 1995 2000 2005

Bit-level parallelism Instruction-level Thread-level (?)

i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000

Architectural Trends: ILP

• • Reported speedups for superscalar processors•

Horst, Harris, and Jardine

[1990] ......................

1.37•

Wang and Wu [1988] ..........................................

1.70•

Smith, Johnson, and Horowitz [1989] .............. 2.30•

Murakami et al. [1989] ........................................

2.55•

Chang et al. [1991] ............................................. 2.90•

Jouppi

and Wall [1989] ...................................... 3.20•

Lee, Kwok, and Briggs [1991] ........................... 3.50•

Wall [1991] .......................................................... 5•

Melvin and Patt

[1991] ....................................... 8•

Butler et al. [1991] ............................................. 17+

Large variance due to difference in–

application domain investigated (numerical versus non-numerical)–

capabilities of processor modeled

ILP Ideal Potential

Infinite resources and fetch bandwidth, perfect branch prediction and renaming –

real caches and non-zero miss latencies

0 1 2 3 4 5 6+0

5

10

15

20

25

30

0 5 10 150

0.5

1

1.5

2

2.5

3

Frac

tion

of to

tal c

ycle

s (%

)

Number of instructions issued

Spe

edup

Instructions issued per cycle

Results of ILP Studies

Concentrate on parallelism for 4-issue machines•

Realistic studies show only 2-fold speedup

Recent studies show that for more parallelism, one must look across threads

1x

2x

3x

4x

Jouppi_89 Smith_89 Murakami_89 Chang_91 Butler_91 Melvin_91

1 branch unit/real prediction

perfect branch prediction

Architectural Trends: Bus-based MPs

No. of processors in fully configured commercial shared-memory systems

•Micro on a chip makes it natural to connect many to shared memory–

dominates server and enterprise market, moving down to desktop•Faster processors began to saturate bus, then bus technology advanced

today, range of sizes for bus-based systems, desktop to large servers

0

10

20

30

40

CRAY CS6400

SGI Challenge

Sequent B2100

Sequent B8000

Symmetry81

Symmetry21

Power

SS690MP 140 SS690MP 120

AS8400

HP K400AS2100SS20

SE30SS1000E

SS10

SE10

SS1000

P-ProSGI PowerSeries

SE60

SE70

Sun E6000

SC2000ESun SC2000SGI PowerChallenge/XL

SunE10000

50

60

70

1984 1986 1988 1990 1992 1994 1996 1998

Num

ber o

f pro

cess

ors

Bus BandwidthS

hare

d bu

s ba

ndw

idth

(MB

/s)

10

100

1,000

10,000

100,000

1984 1986 1988 1990 1992 1994 1996 1998

SequentB8000

SGI PowerCh

XL

Sequent B2100

Symmetry81/21

SS690MP 120SS690MP 140 SS10/

SE10/SE60

SE70/SE30SS1000 SS20

SS1000EAS2100SC2000EHPK400SGI Challenge

Sun E6000AS8400

P-Pro

Sun E10000

SGI PowerSeries

SC2000

Power

CS6400

Economics

Commodity microprocessors not only fast

but CHEAP•

Development cost is tens of millions of dollars (5-100 typical)• BUT, many more are sold compared to supercomputers–

Crucial to take advantage of the investment, and use the commodity building block

Exotic parallel architectures no more than special-purpose

Multiprocessors being pushed by software vendors (e.g. database)

as well as hardware vendors

Standardization by Intel makes small, bus-based SMPs

commodity

Desktop: few smaller processors versus one larger one?–

Multiprocessor on a chip

History•

Historically, parallel architectures tied to programming models –

Divergent architectures, with no predictable pattern of growth.

Application Software

SystemSoftware SIMD

Message PassingShared MemoryDataflow

SystolicArrays Architecture

Uncertainty of direction paralyzed parallel software development!

Today•

Extension of “computer architecture”

to support

communication and cooperation–

OLD: Instruction Set Architecture

NEW: Communication Architecture

Defines –

Critical abstractions, boundaries, and primitives (interfaces)

Organizational structures that implement interfaces (hw or sw)

Compilers, libraries and OS are important bridges today

History•

“Mainframe”

approach:–

Motivated by multiprogramming–

Extends crossbar used for mem

bw

and I/O–

Originally processor cost limited to small scale•

later, cost of crossbar–

Bandwidth scales with p–

High incremental cost; use multistage instead

“Minicomputer”

approach:–

Almost all microprocessor systems have bus–

Motivated by multiprogramming, TP–

Used heavily for parallel computing–

Called symmetric multiprocessor (SMP)–

Latency larger than for uniprocessor–

Bus is bandwidth bottleneck•

caching is key:

coherence problem–

Low incremental cost

P

P

C

C

I/O

I/O

M MM M

PP

C

I/O

M MC

I/O

$ $

Example: Intel Pentium Pro Quad

All coherence and multiprocessing glue in processor module

Highly integrated, targeted at high volume

Low latency and bandwidth

P-Pro bus (64-bit data, 36-bit address, 66 MHz)

CPU

Bus interface

MIU

P-Promodule

P-Promodule

P-Promodule256-KB

L2 $Interruptcontroller

PCIbridge

PCIbridge

Memorycontroller

1-, 2-, or 4-wayinterleaved

DRAM

PCI b

us

PCI b

usPCII/O

cards

Example: SUN Enterprise

16

cards of either type: processors + memory, or I/O–

All memory accessed over bus, so symmetric–

Higher bandwidth, higher latency bus

Gigaplane bus (256 data, 41 address, 83 MHz)

SBU

S

SBU

S

SBU

S

2 Fi

berC

hann

el

100b

T, S

CS

I

Bus interface

CPU/memcardsP

$2

$P

$2

$

Mem ctrl

Bus interface/switch

I/O cards

Scaling Up

Problem is interconnect: cost (crossbar) or bandwidth (bus)–

Dance-hall: bandwidth still scalable, but lower cost than crossbar•

latencies to memory uniform, but uniformly large–

Distributed memory

or non-uniform memory access (NUMA)•

Construct shared address space out of simple message transactions across a general-purpose network (e.g. read-request, read-response)

Caching shared (particularly nonlocal) data?

M M M° ° °

° ° ° M ° ° °M M

NetworkNetwork

P

$

P

$

P

$

P

$

P

$

P

$

“Dance hall” Distributed memory

Example: Cray T3E

Scale up to 1024 processors, 480MB/s

links–

Memory controller generates comm. request for nonlocal

references–

No hardware mechanism for coherence

(SGI Origin etc. provide this)

Switch

P$

XY

Z

External I/O

Memctrl

and NI

Mem

Message Passing Architectures

Complete computer as building block, including I/O–

Communication via explicit I/O operations

Programming model:

directly access

only private address space

(local memory)

communicate

via explicit messages (send/receive)

High-level block diagram similar to distributed-mem

SAS–

But comm. integrated at IO level, need not put into memory system–

Like networks of workstations (clusters), but tighter integration–

Easier to build than scalable SAS

Programming model further from basic hardware ops–

Library or OS intervention

Message Passing Abstraction

Send

specifies buffer to be transmitted and receiving process–

Recv

specifies sending process and application storage to receive into–

Memory to memory copy, but need to name processes–

Optional tag on send and matching rule on receive–

User process names local data and entities in process/tag space too–

In simplest form, the send/recv

match achieves pairwise

synch event•

Other variants too–

Many overheads:

copying, buffer management, protection

Process P Process Q

Address Y

Address X

Send X, Q, t

Receive Y, P, tMatch

Local processaddress

spaceLocal processaddress

space

Evolution of Message Passing

Early machines: FIFO on each link–

Hardware close to programming model•

synchronous ops–

Replaced by DMA, enabling non-blocking ops•

Buffered by system at destination until recv

Diminishing role of topology–

Store & forward routing: topology important–

Introduction of pipelined routing made it less so

Cost is in node-network interface–

Simplifies programming

000001

010011

100

110

101

111

Example: IBM SP-2

Made out of essentially complete RS6000 workstations–

Network interface integrated in I/O bus (bw

limited by I/O bus)

Memory bus

MicroChannel bus

I/O

i860 NI

DMA

DR

AM

IBM SP-2 node

L2 $

Power 2CPU

Memorycontroller

4-wayinterleaved

DRAM

General interconnectionnetwork formed from8-port switches

NIC

Example: Intel Paragon

Memory bus (64-bit, 50 MHz)

i860

L1 $

NI

DMA

i860

L1 $

Driver

Memctrl

4-wayinterleaved

DRAM

IntelParagonnode

8 bits,175 MHz,bidirectional2D grid network

with processing nodeattached to every switch

Sandia’ s Intel Paragon XP/S-based Supercomputer

Toward Architectural Convergence

Evolution and role of software have blurred boundary–

Send/recv

supported on SAS machines via buffers–

Can construct global address space on MP using hashing–

Page-based (or finer-grained) shared virtual memory•

Hardware organization converging too–

Tighter NI integration even for MP (low-latency, high-bandwidth)–

At lower level, even hardware SAS passes hardware messages•

Even clusters of workstations/SMPs

are parallel systems–

Emergence of fast system area networks (SAN)•

Programming models distinct, but organizations converging–

Nodes connected by general network and communication assists–

Implementations also converging, at least in high-end machines

Data Parallel Systems•

Programming model:–

Operations performed in parallel on each element of data structure–

Logically single thread of control, performs sequential or parallel steps

Conceptually, a processor associated with each data element

Architectural model:–

Array of many simple, cheap processors with little memory each•

Processors don’t sequence through instructions–

Attached to a control processor that issues instructions–

Specialized and general communication, cheap global synchronization

Original motivation:–

Matches simple differential equation solvers

Centralize high cost of instruction fetch & sequencing

PE PE PE° ° °

PE PE PE° ° °

PE PE PE° ° °

° ° ° ° ° ° ° ° °

Controlprocessor

Application of Data Parallelism–

Each PE contains an employee record with his/her salaryIf salary > 100K then

salary = salary *1.05

else

salary = salary *1.10

Logically, the whole operation is a single step–

Some

processors enabled for arithmetic operation, others disabled

Other examples:–

Finite differences, linear algebra, ... –

Document searching, graphics, image processing, ...•

Some recent machines:–

Thinking Machines CM-1, CM-2 (and CM-5)–

Maspar

MP-1 and MP-2,

Parallel Computer Architectures

Flynn’s Classification•

Legacy Parallel Computers

Current Parallel Computers•

Trends in Supercomputers

Converged Parallel Computer Architecture

Parallel Architectures Tied to Programming Models

Application Software

Systemsoftware SIMD

Message PassingShared Memory

Dataflow

SystolicArrays Architecture

Uncertainty of direction paralyzed parallel software development!

Divergent architectures, no predictable pattern of growth

Flynn’s Classification•

Based on notions of instruction

and data streams (1972)–

SISD (Single Instruction stream over a Single Data stream )–

SIMD (Single Instruction stream over Multiple Data streams )

MISD (Multiple Instruction streams over a Single Data stream)

MIMD (Multiple Instruction streams over Multiple Data stream)

Popularity–

MIMD > SIMD > MISD

SISD (Single Instruction Stream Over A Single Data Stream )

CU PU MU

IS

IS DSI/O

IS : Instruction Stream DS : Data StreamCU : Control Unit PU : Processing UnitMU : Memory Unit

SISD –

Conventional sequential machines

SIMD (Single Instruction Stream Over Multiple Data Streams )

SIMD–

Vector computers

Special purpose computations

CU

PE1 LM1

PEn LMn

DS

DS

DS

DS

IS IS

Program loaded from host

Data sets loaded from host

SIMD architecture with distributed memory

PE : Processing Element LM : Local Memory

MISD (Multiple Instruction Streams Over A Single Data Streams)

MISD –

Processor arrays, systolic arrays

Special purpose computations

Memory(Program,

Data) PU1 PU2 PUn

CU1 CU2 CUn

DS DS DSIS IS

IS IS

IS

DSI/O

MISD architecture (the systolic array)

MIMD (Multiple Instruction Streams Over Multiple Data Stream)

MIMD–

General purpose parallel computers

CU1 PU1

SharedMemory

ISIS DS

I/O

CUn PUnIS DSI/O

ISMIMD architecture with shared memory

Dataflow Architectures•

Represent computation as a graph of essential dependences–

Logical processor

at each node, activated by availability of operands–

Message (tokens) carrying tag

of next instruction sent to next processor–

Tag compared with others in matching store; match fires execution1 b

a

+ − ×

×

×

c e

d

f

Dataflow graph

f = a × d

Network

Tokenstore

WaitingMatching

Instructionfetch Execute

Token

queue

Formtoken

Network

Network

Programstore

a = (b +1) × (b − c)d = c × e

Convergence: General Parallel Architecture

Node:

processor(s), memory system, plus communication assist–

Network interface

and communication controller

Scalable network•

Convergence allows lots of innovation, now within framework–

Integration of assist with node, what operations, how efficiently...

Mem

° ° °

Network

P

$

Communicationassist (CA)

A generic modern multiprocessor

Data Flow vs. Control Flow

Control-flow computer–

Program control is explicitly controlled by instruction flow in program

Basic components•

PC (Program Counter)•

Shared memory•

Control sequencer

Data Flow vs. Control Flow (Cont’d)

Advantages–

Full control

Complex data and control structures are easily implemented

Disadvantages–

Less efficient

Difficult in programming–

Difficult in preventing runtime error

Data Flow vs. Control Flow (Cont’d)

Data-flow computer–

Execution of instructions is driven by data availability

Basic components•

Data are directly held inside instructions•

Data availability check unit•

Token matching unit•

Chain reaction of asynchronous instruction executions

Data Flow vs. Control Flow (Cont’d)

Advantages–

Very high potential for parallelism

High throughput –

Free from side-effect

Disadvantages–

Time lost waiting for unneeded arguments

High control overhead–

Difficult in manipulating data structures

Dataflow Machine (Manchester Dataflow Computer)

From host

To host

I/OSwitch

MatchingUnit

OverflowUnit

TokenQueue

InstructionStore

func1 funck

Network

First actual hardware implementation

Token<data, tag, dest, marker>

Match<tag, dest>

Execution on Control Flow Machines

a1 b1 c1 a2 b2 c2 a4 b4 c4

Sequential execution on a uniprocessor in 24 cycles

Assume all the external inputs are available before entering do loop+ : 1 cycle, * : 2 cycles, / : 3 cycles,

Execution On A Data Flow Machine

c1 c2 c3 c4a1

a2

a3

a4

b1

b2

b4

b3

Data-driven execution on a 4-processor dataflow computer in 9 cycles

s1 t1a1

a2

a3

a4

b1

b2

b3

b4

Parallel execution on a shared-memory 4-processor system in 7 cycles

s1 = b1+b2, t1 = s1+b3

s2 t2 s2 = b3+b4, t2 = s1+s2

Problems

Excessive copying of large data structures in dataflow operations–

I-structure : a tagged memory unit for overlapped usage by the producer and consumer

Retreat from pure dataflow approach (shared memory)•

Handling complex data structures•

Chain reaction control is difficult to implement–

Complexity of matching store and memory units

Expose too much parallelism (?)

Convergence of Dataflow Machines

Converged to use conventional processors and memory–

Support for large, dynamic set of threads to map to processors

Operations have locality across them, useful to group together

Typically shared address space as well–

But separation of programming model from hardware (like data-parallel)

Contributions

Integration of communication with thread (handler) generation

Tightly integrated communication and fine-grained synchronization–

Each instruction represents a synchronization operation.

Absorb the communication latency and minimize the losses due to synchronization waits.

Systolic Architectures

Replace single processor with array of regular processing elements–

Orchestrate data flow

for high throughput with less memory access

M

PE

M

PE PE PE

Different from pipelining:–

Nonlinear array structure, multidirection data flow, each PE may

have (small) local instruction and data memory

Different from SIMD:

each PE may do something different•

Initial motivation:

VLSI enables inexpensive special-purpose chips•

Represent algorithms directly by chips connected in regular pattern

Systolic Arrays (Cont)

Example: Systolic array for 1-D convolution

Practical realizations (e.g. iWARP) use quite general processors•

Enable variety of algorithms on same hardware–

But dedicated interconnect channels•

Data transfer directly from register to register across channel–

Specialized, and same problems as SIMD•

General purpose systems work well for same algorithms (locality etc.)

x(i+1) x(i) x(i-1) x(i-k)

y(i) y(i+1)

y(i) = w(j)*x(i-j)

j=1

k

y(i+k+1) y(i+k)W (1) W (2) W (k)

Systolic Architectures•

Orchestrate data flow for high throughput with less memory access

Different from pipelining–

Nonlinear array structure, multidirection data flow, each PE may have (small) local instruction and data memory

Different from SIMD–

Each PE may do something different•

Initial motivation–

VLSI enables inexpensive special-purpose chips–

Represent algorithms directly by chips connected in regular pattern

Systolic Architectures

M

PE PE PE

M

PE

Conventional Systolic arrays

Replace a processing element(PE) with an array of PE’s without increasing I/O bandwidth

Two Communication Styles

CPU CPU CPU

LocalMemory

LocalMemory

LocalMemory

Systolic communication

Memory communication

CPU

LocalMemory

CPU

LocalMemory

CPU

LocalMemory

Characteristics

Practical realizations (e.g. Intel iWARP) use quite general processors–

Enable variety of algorithms on same hardware

But dedicated interconnect channels–

Data transfer directly from register to register across channel

Specialized, and same problems as SIMD–

General purpose systems work well for same algorithms (locality etc.)

Matrix Multiplication

nixay j

n

jiji ,...,1,

1== ∑

=

for i = 1 to ny(i,0) = 0for j = 0 to n

y(i,0) = y(i,0) + a(i,j) * x(j,0)

Recursive algorithmx

w

xin xout

yin yout

xout = xx = xinyout = yin + w * xin

Use the following PE

Systolic Array Representation of Matrix Multiplication

x4

x3

x2

x1

y1 y2 y3 y4

0 0 0 0

Example of Convolution

y(i) = w1 * x(i) + w2 * x(i+1) + w3 * x(i+2) + w4 * x(i+3)y(i) is initialized as 0.

x7 x5 x3

w4

x1

w3 w2 w1

x8 x6 x4 x2

y3 y2 y1

x

w

xin xout

yin yout

xout = xx = xinyout = yin + w * xin

Data Parallel Systems

Programming model –

Operations performed in parallel on each element of data structure

Logically single thread of control, performs sequential or parallel steps

Conceptually, a processor associated with each data element

Data Parallel Systems (Cont’d)

SIMD Architectural model–

Array of many simple, cheap processors with little memory each

Processors don’t sequence through instructions–

Attached to a control processor that issues instructions

Specialized and general communication, cheap global synchronization

Evolution & Convergence

Popular due to cost savings of centralized sequencer –

Replace by vector in mid-70s.

Revived in mid-80s when 32-bit datapath sliced fit on chip

Old machines•

Thinking machines CM-1, CM-2•

Maspar

MP-1 and MP-2

Evolution & Convergence (Cont’d)

Drawbacks–

Low applicability

Simple, regular applications can do well anyway in MIMD

Convergence: SPMD (Single Program Multiple Data)–

Fast global synchronization is needed

Vector Processors

Merits of vector processor–

Very deep pipeline without data hazard

The computation of each result is independent of the computation of previous results

Instruction bandwidth requirement is reduced•

A vector instruction specifies a great deal of work–

Control hazards are nonexistent

A vector instruction represents an entire loop.•

No loop branch

Vector Processors (Cont’d)

The high latency of initiating a main memory access is amortized

A single access is initiated for the entire vector rather than a single word

Known access pattern–

Interleaved memory banks

Vector operations is faster than a sequence of scalar operations on the same number of data items!

Vector Programming Example

LD F0, aADDI R4, Rx, #512 ; last address to load

Loop: LD F2, 0(Rx) ; load X(i)MULTD F2, F0, F2 ; a x X(i)LD F4, 0(Ry) ; load Y(i)ADDD F4, F2, F4 ; a x X(i) + Y(i)SD F4, 0(Ry) ; store into Y(i)ADDI Rx, Rx, #8 ; increment index to XADDI Ry, Ry, #8 ; increment index to YSUB R20, R4, Rx ; compute boundBNZ R20, loop ; check if done

RISC machine

8 * 64

Repeat 64 times

Y = a * X + Y

Vector Programming Example (Cont’d)

LD F0, a ; load scalar LV V1, Rx ; load vector XMULTSV V2, F0, V1 ; vector-scalar multiplyLV V3, Ry ; load vector YADDV V4, V2, V3 ; addSV Ry, V4 ; store the result

Vector machine

6 instructions(low instructionbandwidth)

Y = a * X + Y

Basic Vector Architecture

Vector-register processor–

All vector operations except load and store are among the vector registers

The major vector computers•

Memory-memory vector processor–

All vector operations are memory to memory

The first vector computer

A Vector-Register Architecture (DLXV)

Main Memory

VectorLoad-store

FP add/subtractFP add/subtract

FP add/subtractFP add/subtract

FP add/subtractFP add/subtract

FP add/subtractFP add/subtract

FP add/subtractFP add/subtract

Vectorregisters

Scalarregisters

Crossbar Crossbar

Vector Machines

CRAY-1

CRAY-2

CRAY X-MP

CRAY C-90

NEC SX/2

NEC SX/4

Fujitsu VP200

Hitachi S820

Convex C-1

8

8

8

8

8 + 8192

8 + 8192

8 - 256

32

8

Registers

64

64

64

128

256

256

32-1024

256

128

Elementsper register

1

1

2Ld/1St

4

8

8

2

4

1

LoadStore

6

5

8

8

16

16

3

4

4

Functionalunits

CRAY Y-MP 8 64 2Ld/1St 8

Convoy

Convoy–

The set of vector instructions that could begin execution in one clock period

The instructions in a convoy must not contain any structural or data hazards

Chime–

An approximate measure of execution time for a vector sequence

Convoy Example

LV V1, Rx ; load vector XMULTSV V2, F0, V1 ; vector-scalar multiplyLV V3, Ry ; load vector YADDV V4, V2, V3 ; addSV Ry, V4 ; store the result

1. LV2. MULTSV LV3. ADDV4. SV

Convoy

Chime = 4

Strip Mining

Strip miningWhen a vector has a length greater than that of the vector

registers, segment the long vector into fixed-length segments (size = MVL, maximum vector length).

Tn = n/MVL * (Tloop + Tstart ) + n * Tchime

Ex) T200 = 200/64 * (15+49) + 200*4

Total execution time

Strip Mining Example

do 10 i = 1, n10 Y(i) = a * X(i) + Y(i)

low = 1VL = (n mod MVL) ; find the odd size piecedo 1 j = 0, (n/MVL) ; outer loop

do 10 i = low, low+VL-1 ; runs for length VLY(i) = a*X(i) + Y(i) ; main operation

10 continuelow = low + VL ; start of next vectorVL = MVL ; reset the length to max

1 continue

Performance Enhancement Techniques

Chaining•

Conditional execution

Sparse matrix

Chaining

Chaining–

Allow a vector operation to start as soon as the individual elements of its vector source operand become available

Permit the operations to be schedule in the same convoy and reduces the number of chimes required.

Chaining Example

Unit

Load and store unit

Multiply unit

Add unit

Start-up overhead

12 cycles

7 cycles

6 cycles

MULTVV1, V2, V3ADDV V4, V1, V5

Vector sequence

7 664 64Unchained

7

6

64

64

Chained

MULTV ADDV

ADDV

MULTV

Total = 141

Total = 77

Conditional Execution

Vector mask register–

Any vector instructions executed operate only on the vector elements whose corresponding entries in the vector-mask registers are 1.

Require execution time even when the condition is not satisfied.

The elimination of a branch and the associated control dependences can make a conditional instruction faster.

Conditional Execution Example

do 10 i = 1, 64if (A(i) .ne. 0) then

A(i) = A(i) - B(i)endif

10 continue

LV V1, Ra ; load vector A into V1LV V2, Rb ; load vector BLD F0, #0 ; load FP zero into F0SNESV F0, V1 ; sets the VM to 1 if V1(i) != F0SUBV V1, V1, V2 ; subtract under the VMCVM ; set the VM to all 1sSV Ra, V1 ; store the result in A

Sparse Matrix

Sparse Matrix–

There are small number of non-zero elements

Scatter-gather operations using index vectors–

Moving between a dense representation and a normal representation of a sparse matrix

Gather•

Make a dense representation–

Scatter

Return to the sparse representation

Sparse Matrix Example

do 10 i = 1,n10 A(K(i)) = A(K(i))+ C(M(i))

LV Vk, Rk ; load KLVI Va, (Ra+Vk) ; load A(K(i)) - gatherLV Vm, Rm ; load MLVI Vc, (Rc+Vm) ; load C(M(i)) - gatherADDV Va, Va, Vc ; add themSVI (Ra + Vk), Va ; store A(K(i)) - scatter

LVI : Load Vector Indexed

SVI : Store Vector Indexed

Current Parallel Computer Architectures

MIMD

Multiprocessor

Multicomputer

PVP

SMP

DSM

MPP

Constellation

Cluster

(Shared Address Space)

(Message Passing)

Programming Models•

What does programmer use in coding applications?•

Specifies communication and synchronization•

Classification–

Uniprocessor

model: Von Neumann model–

Multiprogramming: no comm. and synch. at program level•

(ex) CAD–

Shared address space–

Symmetric multiprocessor model–

CC-NUMA model–

Message passing–

Data parallel

Communication Architecture

User/System interface–

Communication primitives exposed to used-level realizes the programming model

Implementation–

How to implement primitives: HW or OS

How optimized are they?–

Network structure

Communication Architecture (Cont’d)

Goals–

Performance and cost

Programmability–

Scalability

Broad applicability

Shared Address Space Architecture•

Any processor can directly reference any memory location–

Communication occurs implicitly by “loads and stores”.•

Natural extension of uniprocessor

model–

Location transparency–

Good throughput on multiprogrammed

workloads•

OS used shared memory to coordinate processes•

Shared memory multiprocessors–

SMP: every processor has equal access to the shared memory, the I/O devices, and the OS system serviced. UMA architecture

NUMA: distributed shared memory

Shared Address Space Model

Virtual address spaces for a collection of processes communicating via shared addresses

Machine physical address space

Shared portion of address space

Private portion of address space

Pn private

Common physical addresses

P2 private

P1 private

P0 private

Store

Load

SAS History

“Mainframe”

approach–

Motivated by multiprogramming

Extends crossbar used for memory bandwidth and I/O

Bandwidth scales with p–

High incremental cost; use multistage instead

Crossbar Switch

Mem

Mem

Mem

Mem

Cache

P

I/OCache

P

I/O

Minocomputer

“Minicomputer”

approach–

Almost all microprocessor systems have bus

Motivated by multiprogramming, TP–

Called symmetric multiprocessor (SMP)

Latency larger than for uniprocessor–

Bus is bandwidth bottleneck

caching is key: coherence problem–

Low incremental cost

Bus Connection

Mem Mem Mem Mem

Cache

P

I/OCache

P

I/O

Scaling Up•

Problem is interconnect: cost (crossbar) or bandwidth (bus)•

Dance-hall: bandwidth still scalable, but lower cost than crossbar–

Latencies to memory uniform, but uniformly large•

Distributed memory or non-uniform memory access (NUMA)–

Construct shared address space out of simple message transactions across a general-purpose network (e.g. read-

request, read-response)•

Caching shared (particularly nonlocal) data?

Organizations

Network

Mem Mem

Cache

P

Cache

P

Cache

P

Mem

Dancing Hall

Network

Cache

P

Mem Cache

P

Mem

Cache

P

Mem Cache

P

Mem

Distributed Memory

Multiprocessors (Shared Address Space Architecture)

PVP (Parallel Vector Processor)–

A small number of proprietary vector processors connected by a high-bandwidth crossbar switch

SMP (Symmetric Multiprocessor)–

A small number of COTS microprocessors connected by a high-speed bus or crossbar switch

DSM (Distributed Shared Memory)–

Similar to SMP

The memory is physically distributed among nodes.

PVP (Parallel Vector Processor)

VP VP

Crossbar Switch

VP

SM SM SM

VP : Vector Processor SM : Shared Memory

SMP (Symmetric Multi-Processor)

P/C P/C

Bus or Crossbar Switch

P/C

SM SM SM

P/C : Microprocessor and Cache

DSM (Distributed Shared Memory)

Custom-Designed Network

MB MBP/C

LM

NIC

DIR

P/C

LM

NIC

DIR

DIR : Cache Directory

MPP (Massively Parallel Processing)

P/C

LM

NIC

MB

P/C

LM

NIC

MB

Custom-Designed Network

MB : Memory Bus NIC : Network Interface Circuitry

Cluster

Commodity Network (Ethernet, ATM, Myrinet, VIA)

MB MBP/C

M

NIC

P/C

M

Bridge Bridge

LD LD

NIC

IOB IOB

LD : Local Disk IOB : I/O Bus

Constellation

P/CP/C

SM SMNICLD

Hub

Custom or Commodity Network

>= 16

IOC

P/CP/C

SM SMNICLD

Hub

>= 16

IOC

IOC : I/O Controller

Trend in Parallel Computer Architectures

0

50

100

150

200

250

300

350

400

1997 1998 1999 2000 2001 2002

Years

Num

ber o

f HPC

s

MPPs Constellations Clusters SMPs

Food-Chain of High-Performance Computers

Converged Architecture of Current Supercomputers

Interconnection NetworkInterconnection Network

Memory

P P P P

Memory

P P P P

Memory

P P P P

Memory

P P P P

Multiprocessors Multiprocessors Multiprocessors Multiprocessors