ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level...

105
ΧΑΡΗΣ ΘΕΟΧΑΡΙΔΗΣ ([email protected]) ΗΜΥ 656 ΠΡΟΧΩΡΗΜΕΝΗ ΑΡΧΙΤΕΚΤΟΝΙΚΗ ΗΛΕΚΤΡΟΝΙΚΩΝ ΥΠΟΛΟΓΙΣΤΩΝ Εαρινό Εξάμηνο 2007 ΔΙΑΛΕΞΗ 7: Parallel Computer Systems Ack: Parallel Computer Architecture: A Hardware/Software Approach, David E. Culler et al, Morgan Kaufmann

Transcript of ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level...

Page 1: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

ΧΑΡΗΣ

ΘΕΟΧΑΡΙΔΗΣ

([email protected])

ΗΜΥ

656

ΠΡΟΧΩΡΗΜΕΝΗ

ΑΡΧΙΤΕΚΤΟΝΙΚΗ

ΗΛΕΚΤΡΟΝΙΚΩΝ

ΥΠΟΛΟΓΙΣΤΩΝ

Εαρινό

Εξάμηνο

2007

ΔΙΑΛΕΞΗ

7: Parallel Computer Systems

Ack: Parallel Computer Architecture: A Hardware/Software Approach, David E. Culler et al, Morgan Kaufmann

Presenter
Presentation Notes
Other handouts To handout next time
Page 2: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

What is Parallel Architecture?

A parallel computer is a collection of processing elements that cooperate to solve large problems fast

Some broad issues:–

Resource Allocation:•

how large a collection? •

how powerful are the elements?•

how much memory?–

Data access, Communication and Synchronization•

how do the elements cooperate and communicate?•

how are data transmitted between processors?•

what are the abstractions and primitives for cooperation?–

Performance and Scalability•

how does it all translate into performance?•

how does it scale?

Page 3: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Why Study Parallel Architecture?•

Role of a computer architect:

To design and engineer the various levels of a computer system to maximize performance and programmability within limits of technology and cost.

Parallelism:–

Provides alternative to faster clock for performance

Applies at all levels of system design–

Is a fascinating perspective from which to view architecture

Is increasingly central in information processing

Page 4: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Why Study it Today?

History:

diverse and innovative organizational structures, often tied to novel programming models

Rapidly maturing under strong technological constraints–

The “killer micro”

is ubiquitous–

Laptops and supercomputers are fundamentally similar!–

Technological trends cause diverse approaches to converge•

Technological trends make parallel computing inevitable–

In the mainstream•

Need to understand fundamental principles and design tradeoffs, not just taxonomies–

Naming, Ordering, Replication, Communication performance

Page 5: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Inevitability of Parallel Computing

Application demands:

Our insatiable need for cycles–

Scientific computing: CFD, Biology, Chemistry, Physics, ...–

General-purpose computing: Video, Graphics, CAD, Databases, TP...•

Technology Trends–

Number of transistors on chip growing rapidly–

Clock rates expected to go up only slowly•

Architecture Trends–

Instruction-level parallelism valuable but limited–

Coarser-level parallelism, as in MPs, the most viable approach•

Economics

Current trends:–

Today’s microprocessors have multiprocessor support–

Servers & even PCs becoming MP: Sun, SGI, COMPAQ, Dell,...–

Tomorrow’s microprocessors are multiprocessors

Page 6: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Application Trends

Demand for cycles fuels advances in hardware, and vice-versa–

Cycle drives exponential increase in microprocessor performance–

Drives parallel architecture harder: most demanding applications•

Range of performance demands–

Need range of system performance with progressively increasing cost

Platform pyramid•

Goal of applications in using parallel machines: Speedup•

Speedup (p processors) =

For a fixed problem size (input data set), performance = 1/time

Speedup fixed problem (p processors) =

Performance (p processors)

Performance (1 processor)

Time (1 processor)Time (p processors)

Page 7: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Scientific Computing Demand

Page 8: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Summary of Application Trends•

Transition to parallel computing has occurred for scientific and engineering computing

In rapid progress in commercial computing–

Database and transactions as well as financial

Usually smaller-scale, but large-scale systems also used

Desktop

also uses multithreaded

programs, which are a lot like parallel programs

Demand for improving throughput

on sequential workloads–

Greatest use of small-scale multiprocessors

Solid application demand exists and will increase

Page 9: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Technology Trends

Commodity microprocessors have caught up with supercomputers.

Per

form

ance

0.1

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors

Page 10: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Architectural Trends

Architecture translates technology’s gifts

to performance

and capability

Resolves the tradeoff between parallelism and locality–

Current microprocessor: 1/3 compute, 1/3 cache, 1/3 off-chip connect

Tradeoffs may change with scale and technology advances•

Understanding microprocessor architectural trends –

Helps build intuition about design issues or parallel machines–

Shows fundamental role of parallelism even in “sequential”

computers•

Four generations of architectural history: tube, transistor, IC, VLSI–

Here focus only on VLSI

generation•

Greatest delineation in VLSI has been in type of parallelism exploited

Page 11: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Arch. Trends: Exploiting Parallelism

Greatest trend in VLSI generation is increase in parallelism–

Up to 1985: bit level parallelism: 4-bit -> 8 bit -> 16-bit•

slows after 32 bit

adoption of 64-bit now under way, 128-bit far (not performance issue)

great inflection point when 32-bit micro and cache fit on a chip

Mid 80s to mid 90s: instruction level parallelism•

pipelining and simple instruction sets, + compiler advances (RISC)

on-chip caches and functional units => superscalar execution

greater sophistication: out of order execution, speculation, prediction–

to deal with control transfer and latency problems

Next step: thread level parallelism

Page 12: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Phases in VLSI Generation

How good is instruction-level parallelism? •

Thread-level needed in microprocessors?

Tran

sist

ors

1,000

10,000

100,000

1,000,000

10,000,000

100,000,000

1970 1975 1980 1985 1990 1995 2000 2005

Bit-level parallelism Instruction-level Thread-level (?)

i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000

Page 13: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Architectural Trends: ILP

• • Reported speedups for superscalar processors•

Horst, Harris, and Jardine

[1990] ......................

1.37•

Wang and Wu [1988] ..........................................

1.70•

Smith, Johnson, and Horowitz [1989] .............. 2.30•

Murakami et al. [1989] ........................................

2.55•

Chang et al. [1991] ............................................. 2.90•

Jouppi

and Wall [1989] ...................................... 3.20•

Lee, Kwok, and Briggs [1991] ........................... 3.50•

Wall [1991] .......................................................... 5•

Melvin and Patt

[1991] ....................................... 8•

Butler et al. [1991] ............................................. 17+

Large variance due to difference in–

application domain investigated (numerical versus non-numerical)–

capabilities of processor modeled

Page 14: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

ILP Ideal Potential

Infinite resources and fetch bandwidth, perfect branch prediction and renaming –

real caches and non-zero miss latencies

0 1 2 3 4 5 6+0

5

10

15

20

25

30

0 5 10 150

0.5

1

1.5

2

2.5

3

Frac

tion

of to

tal c

ycle

s (%

)

Number of instructions issued

Spe

edup

Instructions issued per cycle

Page 15: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Results of ILP Studies

Concentrate on parallelism for 4-issue machines•

Realistic studies show only 2-fold speedup

Recent studies show that for more parallelism, one must look across threads

1x

2x

3x

4x

Jouppi_89 Smith_89 Murakami_89 Chang_91 Butler_91 Melvin_91

1 branch unit/real prediction

perfect branch prediction

Page 16: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Architectural Trends: Bus-based MPs

No. of processors in fully configured commercial shared-memory systems

•Micro on a chip makes it natural to connect many to shared memory–

dominates server and enterprise market, moving down to desktop•Faster processors began to saturate bus, then bus technology advanced

today, range of sizes for bus-based systems, desktop to large servers

0

10

20

30

40

CRAY CS6400

SGI Challenge

Sequent B2100

Sequent B8000

Symmetry81

Symmetry21

Power

SS690MP 140 SS690MP 120

AS8400

HP K400AS2100SS20

SE30SS1000E

SS10

SE10

SS1000

P-ProSGI PowerSeries

SE60

SE70

Sun E6000

SC2000ESun SC2000SGI PowerChallenge/XL

SunE10000

50

60

70

1984 1986 1988 1990 1992 1994 1996 1998

Num

ber o

f pro

cess

ors

Page 17: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Bus BandwidthS

hare

d bu

s ba

ndw

idth

(MB

/s)

10

100

1,000

10,000

100,000

1984 1986 1988 1990 1992 1994 1996 1998

SequentB8000

SGI PowerCh

XL

Sequent B2100

Symmetry81/21

SS690MP 120SS690MP 140 SS10/

SE10/SE60

SE70/SE30SS1000 SS20

SS1000EAS2100SC2000EHPK400SGI Challenge

Sun E6000AS8400

P-Pro

Sun E10000

SGI PowerSeries

SC2000

Power

CS6400

Page 18: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Economics

Commodity microprocessors not only fast

but CHEAP•

Development cost is tens of millions of dollars (5-100 typical)• BUT, many more are sold compared to supercomputers–

Crucial to take advantage of the investment, and use the commodity building block

Exotic parallel architectures no more than special-purpose

Multiprocessors being pushed by software vendors (e.g. database)

as well as hardware vendors

Standardization by Intel makes small, bus-based SMPs

commodity

Desktop: few smaller processors versus one larger one?–

Multiprocessor on a chip

Page 19: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

History•

Historically, parallel architectures tied to programming models –

Divergent architectures, with no predictable pattern of growth.

Application Software

SystemSoftware SIMD

Message PassingShared MemoryDataflow

SystolicArrays Architecture

Uncertainty of direction paralyzed parallel software development!

Page 20: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Today•

Extension of “computer architecture”

to support

communication and cooperation–

OLD: Instruction Set Architecture

NEW: Communication Architecture

Defines –

Critical abstractions, boundaries, and primitives (interfaces)

Organizational structures that implement interfaces (hw or sw)

Compilers, libraries and OS are important bridges today

Page 21: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

History•

“Mainframe”

approach:–

Motivated by multiprogramming–

Extends crossbar used for mem

bw

and I/O–

Originally processor cost limited to small scale•

later, cost of crossbar–

Bandwidth scales with p–

High incremental cost; use multistage instead

“Minicomputer”

approach:–

Almost all microprocessor systems have bus–

Motivated by multiprogramming, TP–

Used heavily for parallel computing–

Called symmetric multiprocessor (SMP)–

Latency larger than for uniprocessor–

Bus is bandwidth bottleneck•

caching is key:

coherence problem–

Low incremental cost

P

P

C

C

I/O

I/O

M MM M

PP

C

I/O

M MC

I/O

$ $

Page 22: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Example: Intel Pentium Pro Quad

All coherence and multiprocessing glue in processor module

Highly integrated, targeted at high volume

Low latency and bandwidth

P-Pro bus (64-bit data, 36-bit address, 66 MHz)

CPU

Bus interface

MIU

P-Promodule

P-Promodule

P-Promodule256-KB

L2 $Interruptcontroller

PCIbridge

PCIbridge

Memorycontroller

1-, 2-, or 4-wayinterleaved

DRAM

PCI b

us

PCI b

usPCII/O

cards

Page 23: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Example: SUN Enterprise

16

cards of either type: processors + memory, or I/O–

All memory accessed over bus, so symmetric–

Higher bandwidth, higher latency bus

Gigaplane bus (256 data, 41 address, 83 MHz)

SBU

S

SBU

S

SBU

S

2 Fi

berC

hann

el

100b

T, S

CS

I

Bus interface

CPU/memcardsP

$2

$P

$2

$

Mem ctrl

Bus interface/switch

I/O cards

Page 24: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Scaling Up

Problem is interconnect: cost (crossbar) or bandwidth (bus)–

Dance-hall: bandwidth still scalable, but lower cost than crossbar•

latencies to memory uniform, but uniformly large–

Distributed memory

or non-uniform memory access (NUMA)•

Construct shared address space out of simple message transactions across a general-purpose network (e.g. read-request, read-response)

Caching shared (particularly nonlocal) data?

M M M° ° °

° ° ° M ° ° °M M

NetworkNetwork

P

$

P

$

P

$

P

$

P

$

P

$

“Dance hall” Distributed memory

Page 25: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Example: Cray T3E

Scale up to 1024 processors, 480MB/s

links–

Memory controller generates comm. request for nonlocal

references–

No hardware mechanism for coherence

(SGI Origin etc. provide this)

Switch

P$

XY

Z

External I/O

Memctrl

and NI

Mem

Page 26: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Message Passing Architectures

Complete computer as building block, including I/O–

Communication via explicit I/O operations

Programming model:

directly access

only private address space

(local memory)

communicate

via explicit messages (send/receive)

High-level block diagram similar to distributed-mem

SAS–

But comm. integrated at IO level, need not put into memory system–

Like networks of workstations (clusters), but tighter integration–

Easier to build than scalable SAS

Programming model further from basic hardware ops–

Library or OS intervention

Page 27: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Message Passing Abstraction

Send

specifies buffer to be transmitted and receiving process–

Recv

specifies sending process and application storage to receive into–

Memory to memory copy, but need to name processes–

Optional tag on send and matching rule on receive–

User process names local data and entities in process/tag space too–

In simplest form, the send/recv

match achieves pairwise

synch event•

Other variants too–

Many overheads:

copying, buffer management, protection

Process P Process Q

Address Y

Address X

Send X, Q, t

Receive Y, P, tMatch

Local processaddress

spaceLocal processaddress

space

Page 28: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Evolution of Message Passing

Early machines: FIFO on each link–

Hardware close to programming model•

synchronous ops–

Replaced by DMA, enabling non-blocking ops•

Buffered by system at destination until recv

Diminishing role of topology–

Store & forward routing: topology important–

Introduction of pipelined routing made it less so

Cost is in node-network interface–

Simplifies programming

000001

010011

100

110

101

111

Page 29: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Example: IBM SP-2

Made out of essentially complete RS6000 workstations–

Network interface integrated in I/O bus (bw

limited by I/O bus)

Memory bus

MicroChannel bus

I/O

i860 NI

DMA

DR

AM

IBM SP-2 node

L2 $

Power 2CPU

Memorycontroller

4-wayinterleaved

DRAM

General interconnectionnetwork formed from8-port switches

NIC

Page 30: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Example: Intel Paragon

Memory bus (64-bit, 50 MHz)

i860

L1 $

NI

DMA

i860

L1 $

Driver

Memctrl

4-wayinterleaved

DRAM

IntelParagonnode

8 bits,175 MHz,bidirectional2D grid network

with processing nodeattached to every switch

Sandia’ s Intel Paragon XP/S-based Supercomputer

Page 31: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Toward Architectural Convergence

Evolution and role of software have blurred boundary–

Send/recv

supported on SAS machines via buffers–

Can construct global address space on MP using hashing–

Page-based (or finer-grained) shared virtual memory•

Hardware organization converging too–

Tighter NI integration even for MP (low-latency, high-bandwidth)–

At lower level, even hardware SAS passes hardware messages•

Even clusters of workstations/SMPs

are parallel systems–

Emergence of fast system area networks (SAN)•

Programming models distinct, but organizations converging–

Nodes connected by general network and communication assists–

Implementations also converging, at least in high-end machines

Page 32: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Data Parallel Systems•

Programming model:–

Operations performed in parallel on each element of data structure–

Logically single thread of control, performs sequential or parallel steps

Conceptually, a processor associated with each data element

Architectural model:–

Array of many simple, cheap processors with little memory each•

Processors don’t sequence through instructions–

Attached to a control processor that issues instructions–

Specialized and general communication, cheap global synchronization

Original motivation:–

Matches simple differential equation solvers

Centralize high cost of instruction fetch & sequencing

PE PE PE° ° °

PE PE PE° ° °

PE PE PE° ° °

° ° ° ° ° ° ° ° °

Controlprocessor

Page 33: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Application of Data Parallelism–

Each PE contains an employee record with his/her salaryIf salary > 100K then

salary = salary *1.05

else

salary = salary *1.10

Logically, the whole operation is a single step–

Some

processors enabled for arithmetic operation, others disabled

Other examples:–

Finite differences, linear algebra, ... –

Document searching, graphics, image processing, ...•

Some recent machines:–

Thinking Machines CM-1, CM-2 (and CM-5)–

Maspar

MP-1 and MP-2,

Page 34: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Parallel Computer Architectures

Flynn’s Classification•

Legacy Parallel Computers

Current Parallel Computers•

Trends in Supercomputers

Converged Parallel Computer Architecture

Page 35: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Parallel Architectures Tied to Programming Models

Application Software

Systemsoftware SIMD

Message PassingShared Memory

Dataflow

SystolicArrays Architecture

Uncertainty of direction paralyzed parallel software development!

Divergent architectures, no predictable pattern of growth

Page 36: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Flynn’s Classification•

Based on notions of instruction

and data streams (1972)–

SISD (Single Instruction stream over a Single Data stream )–

SIMD (Single Instruction stream over Multiple Data streams )

MISD (Multiple Instruction streams over a Single Data stream)

MIMD (Multiple Instruction streams over Multiple Data stream)

Popularity–

MIMD > SIMD > MISD

Page 37: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

SISD (Single Instruction Stream Over A Single Data Stream )

CU PU MU

IS

IS DSI/O

IS : Instruction Stream DS : Data StreamCU : Control Unit PU : Processing UnitMU : Memory Unit

SISD –

Conventional sequential machines

Page 38: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

SIMD (Single Instruction Stream Over Multiple Data Streams )

SIMD–

Vector computers

Special purpose computations

CU

PE1 LM1

PEn LMn

DS

DS

DS

DS

IS IS

Program loaded from host

Data sets loaded from host

SIMD architecture with distributed memory

PE : Processing Element LM : Local Memory

Page 39: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

MISD (Multiple Instruction Streams Over A Single Data Streams)

MISD –

Processor arrays, systolic arrays

Special purpose computations

Memory(Program,

Data) PU1 PU2 PUn

CU1 CU2 CUn

DS DS DSIS IS

IS IS

IS

DSI/O

MISD architecture (the systolic array)

Page 40: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

MIMD (Multiple Instruction Streams Over Multiple Data Stream)

MIMD–

General purpose parallel computers

CU1 PU1

SharedMemory

ISIS DS

I/O

CUn PUnIS DSI/O

ISMIMD architecture with shared memory

Page 41: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Dataflow Architectures•

Represent computation as a graph of essential dependences–

Logical processor

at each node, activated by availability of operands–

Message (tokens) carrying tag

of next instruction sent to next processor–

Tag compared with others in matching store; match fires execution1 b

a

+ − ×

×

×

c e

d

f

Dataflow graph

f = a × d

Network

Tokenstore

WaitingMatching

Instructionfetch Execute

Token

queue

Formtoken

Network

Network

Programstore

a = (b +1) × (b − c)d = c × e

Page 42: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Convergence: General Parallel Architecture

Node:

processor(s), memory system, plus communication assist–

Network interface

and communication controller

Scalable network•

Convergence allows lots of innovation, now within framework–

Integration of assist with node, what operations, how efficiently...

Mem

° ° °

Network

P

$

Communicationassist (CA)

A generic modern multiprocessor

Page 43: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Data Flow vs. Control Flow

Control-flow computer–

Program control is explicitly controlled by instruction flow in program

Basic components•

PC (Program Counter)•

Shared memory•

Control sequencer

Page 44: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Data Flow vs. Control Flow (Cont’d)

Advantages–

Full control

Complex data and control structures are easily implemented

Disadvantages–

Less efficient

Difficult in programming–

Difficult in preventing runtime error

Page 45: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Data Flow vs. Control Flow (Cont’d)

Data-flow computer–

Execution of instructions is driven by data availability

Basic components•

Data are directly held inside instructions•

Data availability check unit•

Token matching unit•

Chain reaction of asynchronous instruction executions

Page 46: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Data Flow vs. Control Flow (Cont’d)

Advantages–

Very high potential for parallelism

High throughput –

Free from side-effect

Disadvantages–

Time lost waiting for unneeded arguments

High control overhead–

Difficult in manipulating data structures

Page 47: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Dataflow Machine (Manchester Dataflow Computer)

From host

To host

I/OSwitch

MatchingUnit

OverflowUnit

TokenQueue

InstructionStore

func1 funck

Network

First actual hardware implementation

Token<data, tag, dest, marker>

Match<tag, dest>

Page 48: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Execution on Control Flow Machines

a1 b1 c1 a2 b2 c2 a4 b4 c4

Sequential execution on a uniprocessor in 24 cycles

Assume all the external inputs are available before entering do loop+ : 1 cycle, * : 2 cycles, / : 3 cycles,

Page 49: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Execution On A Data Flow Machine

c1 c2 c3 c4a1

a2

a3

a4

b1

b2

b4

b3

Data-driven execution on a 4-processor dataflow computer in 9 cycles

s1 t1a1

a2

a3

a4

b1

b2

b3

b4

Parallel execution on a shared-memory 4-processor system in 7 cycles

s1 = b1+b2, t1 = s1+b3

s2 t2 s2 = b3+b4, t2 = s1+s2

Page 50: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Problems

Excessive copying of large data structures in dataflow operations–

I-structure : a tagged memory unit for overlapped usage by the producer and consumer

Retreat from pure dataflow approach (shared memory)•

Handling complex data structures•

Chain reaction control is difficult to implement–

Complexity of matching store and memory units

Expose too much parallelism (?)

Page 51: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Convergence of Dataflow Machines

Converged to use conventional processors and memory–

Support for large, dynamic set of threads to map to processors

Operations have locality across them, useful to group together

Typically shared address space as well–

But separation of programming model from hardware (like data-parallel)

Page 52: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Contributions

Integration of communication with thread (handler) generation

Tightly integrated communication and fine-grained synchronization–

Each instruction represents a synchronization operation.

Absorb the communication latency and minimize the losses due to synchronization waits.

Page 53: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Systolic Architectures

Replace single processor with array of regular processing elements–

Orchestrate data flow

for high throughput with less memory access

M

PE

M

PE PE PE

Different from pipelining:–

Nonlinear array structure, multidirection data flow, each PE may

have (small) local instruction and data memory

Different from SIMD:

each PE may do something different•

Initial motivation:

VLSI enables inexpensive special-purpose chips•

Represent algorithms directly by chips connected in regular pattern

Page 54: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Systolic Arrays (Cont)

Example: Systolic array for 1-D convolution

Practical realizations (e.g. iWARP) use quite general processors•

Enable variety of algorithms on same hardware–

But dedicated interconnect channels•

Data transfer directly from register to register across channel–

Specialized, and same problems as SIMD•

General purpose systems work well for same algorithms (locality etc.)

x(i+1) x(i) x(i-1) x(i-k)

y(i) y(i+1)

y(i) = w(j)*x(i-j)

j=1

k

y(i+k+1) y(i+k)W (1) W (2) W (k)

Page 55: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Systolic Architectures•

Orchestrate data flow for high throughput with less memory access

Different from pipelining–

Nonlinear array structure, multidirection data flow, each PE may have (small) local instruction and data memory

Different from SIMD–

Each PE may do something different•

Initial motivation–

VLSI enables inexpensive special-purpose chips–

Represent algorithms directly by chips connected in regular pattern

Page 56: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Systolic Architectures

M

PE PE PE

M

PE

Conventional Systolic arrays

Replace a processing element(PE) with an array of PE’s without increasing I/O bandwidth

Page 57: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Two Communication Styles

CPU CPU CPU

LocalMemory

LocalMemory

LocalMemory

Systolic communication

Memory communication

CPU

LocalMemory

CPU

LocalMemory

CPU

LocalMemory

Page 58: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Characteristics

Practical realizations (e.g. Intel iWARP) use quite general processors–

Enable variety of algorithms on same hardware

But dedicated interconnect channels–

Data transfer directly from register to register across channel

Specialized, and same problems as SIMD–

General purpose systems work well for same algorithms (locality etc.)

Page 59: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Matrix Multiplication

nixay j

n

jiji ,...,1,

1== ∑

=

for i = 1 to ny(i,0) = 0for j = 0 to n

y(i,0) = y(i,0) + a(i,j) * x(j,0)

Recursive algorithmx

w

xin xout

yin yout

xout = xx = xinyout = yin + w * xin

Use the following PE

Page 60: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Systolic Array Representation of Matrix Multiplication

x4

x3

x2

x1

y1 y2 y3 y4

0 0 0 0

Page 61: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Example of Convolution

y(i) = w1 * x(i) + w2 * x(i+1) + w3 * x(i+2) + w4 * x(i+3)y(i) is initialized as 0.

x7 x5 x3

w4

x1

w3 w2 w1

x8 x6 x4 x2

y3 y2 y1

x

w

xin xout

yin yout

xout = xx = xinyout = yin + w * xin

Page 62: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Data Parallel Systems

Programming model –

Operations performed in parallel on each element of data structure

Logically single thread of control, performs sequential or parallel steps

Conceptually, a processor associated with each data element

Page 63: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Data Parallel Systems (Cont’d)

SIMD Architectural model–

Array of many simple, cheap processors with little memory each

Processors don’t sequence through instructions–

Attached to a control processor that issues instructions

Specialized and general communication, cheap global synchronization

Page 64: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Evolution & Convergence

Popular due to cost savings of centralized sequencer –

Replace by vector in mid-70s.

Revived in mid-80s when 32-bit datapath sliced fit on chip

Old machines•

Thinking machines CM-1, CM-2•

Maspar

MP-1 and MP-2

Page 65: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Evolution & Convergence (Cont’d)

Drawbacks–

Low applicability

Simple, regular applications can do well anyway in MIMD

Convergence: SPMD (Single Program Multiple Data)–

Fast global synchronization is needed

Page 66: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Vector Processors

Merits of vector processor–

Very deep pipeline without data hazard

The computation of each result is independent of the computation of previous results

Instruction bandwidth requirement is reduced•

A vector instruction specifies a great deal of work–

Control hazards are nonexistent

A vector instruction represents an entire loop.•

No loop branch

Page 67: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Vector Processors (Cont’d)

The high latency of initiating a main memory access is amortized

A single access is initiated for the entire vector rather than a single word

Known access pattern–

Interleaved memory banks

Vector operations is faster than a sequence of scalar operations on the same number of data items!

Page 68: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Vector Programming Example

LD F0, aADDI R4, Rx, #512 ; last address to load

Loop: LD F2, 0(Rx) ; load X(i)MULTD F2, F0, F2 ; a x X(i)LD F4, 0(Ry) ; load Y(i)ADDD F4, F2, F4 ; a x X(i) + Y(i)SD F4, 0(Ry) ; store into Y(i)ADDI Rx, Rx, #8 ; increment index to XADDI Ry, Ry, #8 ; increment index to YSUB R20, R4, Rx ; compute boundBNZ R20, loop ; check if done

RISC machine

8 * 64

Repeat 64 times

Y = a * X + Y

Page 69: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Vector Programming Example (Cont’d)

LD F0, a ; load scalar LV V1, Rx ; load vector XMULTSV V2, F0, V1 ; vector-scalar multiplyLV V3, Ry ; load vector YADDV V4, V2, V3 ; addSV Ry, V4 ; store the result

Vector machine

6 instructions(low instructionbandwidth)

Y = a * X + Y

Page 70: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Basic Vector Architecture

Vector-register processor–

All vector operations except load and store are among the vector registers

The major vector computers•

Memory-memory vector processor–

All vector operations are memory to memory

The first vector computer

Page 71: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

A Vector-Register Architecture (DLXV)

Main Memory

VectorLoad-store

FP add/subtractFP add/subtract

FP add/subtractFP add/subtract

FP add/subtractFP add/subtract

FP add/subtractFP add/subtract

FP add/subtractFP add/subtract

Vectorregisters

Scalarregisters

Crossbar Crossbar

Page 72: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Vector Machines

CRAY-1

CRAY-2

CRAY X-MP

CRAY C-90

NEC SX/2

NEC SX/4

Fujitsu VP200

Hitachi S820

Convex C-1

8

8

8

8

8 + 8192

8 + 8192

8 - 256

32

8

Registers

64

64

64

128

256

256

32-1024

256

128

Elementsper register

1

1

2Ld/1St

4

8

8

2

4

1

LoadStore

6

5

8

8

16

16

3

4

4

Functionalunits

CRAY Y-MP 8 64 2Ld/1St 8

Page 73: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Convoy

Convoy–

The set of vector instructions that could begin execution in one clock period

The instructions in a convoy must not contain any structural or data hazards

Chime–

An approximate measure of execution time for a vector sequence

Page 74: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Convoy Example

LV V1, Rx ; load vector XMULTSV V2, F0, V1 ; vector-scalar multiplyLV V3, Ry ; load vector YADDV V4, V2, V3 ; addSV Ry, V4 ; store the result

1. LV2. MULTSV LV3. ADDV4. SV

Convoy

Chime = 4

Page 75: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Strip Mining

Strip miningWhen a vector has a length greater than that of the vector

registers, segment the long vector into fixed-length segments (size = MVL, maximum vector length).

Tn = n/MVL * (Tloop + Tstart ) + n * Tchime

Ex) T200 = 200/64 * (15+49) + 200*4

Total execution time

Page 76: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Strip Mining Example

do 10 i = 1, n10 Y(i) = a * X(i) + Y(i)

low = 1VL = (n mod MVL) ; find the odd size piecedo 1 j = 0, (n/MVL) ; outer loop

do 10 i = low, low+VL-1 ; runs for length VLY(i) = a*X(i) + Y(i) ; main operation

10 continuelow = low + VL ; start of next vectorVL = MVL ; reset the length to max

1 continue

Page 77: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Performance Enhancement Techniques

Chaining•

Conditional execution

Sparse matrix

Page 78: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Chaining

Chaining–

Allow a vector operation to start as soon as the individual elements of its vector source operand become available

Permit the operations to be schedule in the same convoy and reduces the number of chimes required.

Page 79: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Chaining Example

Unit

Load and store unit

Multiply unit

Add unit

Start-up overhead

12 cycles

7 cycles

6 cycles

MULTVV1, V2, V3ADDV V4, V1, V5

Vector sequence

7 664 64Unchained

7

6

64

64

Chained

MULTV ADDV

ADDV

MULTV

Total = 141

Total = 77

Page 80: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Conditional Execution

Vector mask register–

Any vector instructions executed operate only on the vector elements whose corresponding entries in the vector-mask registers are 1.

Require execution time even when the condition is not satisfied.

The elimination of a branch and the associated control dependences can make a conditional instruction faster.

Page 81: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Conditional Execution Example

do 10 i = 1, 64if (A(i) .ne. 0) then

A(i) = A(i) - B(i)endif

10 continue

LV V1, Ra ; load vector A into V1LV V2, Rb ; load vector BLD F0, #0 ; load FP zero into F0SNESV F0, V1 ; sets the VM to 1 if V1(i) != F0SUBV V1, V1, V2 ; subtract under the VMCVM ; set the VM to all 1sSV Ra, V1 ; store the result in A

Page 82: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Sparse Matrix

Sparse Matrix–

There are small number of non-zero elements

Scatter-gather operations using index vectors–

Moving between a dense representation and a normal representation of a sparse matrix

Gather•

Make a dense representation–

Scatter

Return to the sparse representation

Page 83: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Sparse Matrix Example

do 10 i = 1,n10 A(K(i)) = A(K(i))+ C(M(i))

LV Vk, Rk ; load KLVI Va, (Ra+Vk) ; load A(K(i)) - gatherLV Vm, Rm ; load MLVI Vc, (Rc+Vm) ; load C(M(i)) - gatherADDV Va, Va, Vc ; add themSVI (Ra + Vk), Va ; store A(K(i)) - scatter

LVI : Load Vector Indexed

SVI : Store Vector Indexed

Page 84: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Current Parallel Computer Architectures

MIMD

Multiprocessor

Multicomputer

PVP

SMP

DSM

MPP

Constellation

Cluster

(Shared Address Space)

(Message Passing)

Page 85: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Programming Models•

What does programmer use in coding applications?•

Specifies communication and synchronization•

Classification–

Uniprocessor

model: Von Neumann model–

Multiprogramming: no comm. and synch. at program level•

(ex) CAD–

Shared address space–

Symmetric multiprocessor model–

CC-NUMA model–

Message passing–

Data parallel

Page 86: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Communication Architecture

User/System interface–

Communication primitives exposed to used-level realizes the programming model

Implementation–

How to implement primitives: HW or OS

How optimized are they?–

Network structure

Page 87: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Communication Architecture (Cont’d)

Goals–

Performance and cost

Programmability–

Scalability

Broad applicability

Page 88: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Shared Address Space Architecture•

Any processor can directly reference any memory location–

Communication occurs implicitly by “loads and stores”.•

Natural extension of uniprocessor

model–

Location transparency–

Good throughput on multiprogrammed

workloads•

OS used shared memory to coordinate processes•

Shared memory multiprocessors–

SMP: every processor has equal access to the shared memory, the I/O devices, and the OS system serviced. UMA architecture

NUMA: distributed shared memory

Page 89: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Shared Address Space Model

Virtual address spaces for a collection of processes communicating via shared addresses

Machine physical address space

Shared portion of address space

Private portion of address space

Pn private

Common physical addresses

P2 private

P1 private

P0 private

Store

Load

Page 90: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

SAS History

“Mainframe”

approach–

Motivated by multiprogramming

Extends crossbar used for memory bandwidth and I/O

Bandwidth scales with p–

High incremental cost; use multistage instead

Page 91: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Crossbar Switch

Mem

Mem

Mem

Mem

Cache

P

I/OCache

P

I/O

Page 92: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Minocomputer

“Minicomputer”

approach–

Almost all microprocessor systems have bus

Motivated by multiprogramming, TP–

Called symmetric multiprocessor (SMP)

Latency larger than for uniprocessor–

Bus is bandwidth bottleneck

caching is key: coherence problem–

Low incremental cost

Page 93: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Bus Connection

Mem Mem Mem Mem

Cache

P

I/OCache

P

I/O

Page 94: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Scaling Up•

Problem is interconnect: cost (crossbar) or bandwidth (bus)•

Dance-hall: bandwidth still scalable, but lower cost than crossbar–

Latencies to memory uniform, but uniformly large•

Distributed memory or non-uniform memory access (NUMA)–

Construct shared address space out of simple message transactions across a general-purpose network (e.g. read-

request, read-response)•

Caching shared (particularly nonlocal) data?

Page 95: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Organizations

Network

Mem Mem

Cache

P

Cache

P

Cache

P

Mem

Dancing Hall

Network

Cache

P

Mem Cache

P

Mem

Cache

P

Mem Cache

P

Mem

Distributed Memory

Page 96: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Multiprocessors (Shared Address Space Architecture)

PVP (Parallel Vector Processor)–

A small number of proprietary vector processors connected by a high-bandwidth crossbar switch

SMP (Symmetric Multiprocessor)–

A small number of COTS microprocessors connected by a high-speed bus or crossbar switch

DSM (Distributed Shared Memory)–

Similar to SMP

The memory is physically distributed among nodes.

Page 97: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

PVP (Parallel Vector Processor)

VP VP

Crossbar Switch

VP

SM SM SM

VP : Vector Processor SM : Shared Memory

Page 98: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

SMP (Symmetric Multi-Processor)

P/C P/C

Bus or Crossbar Switch

P/C

SM SM SM

P/C : Microprocessor and Cache

Page 99: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

DSM (Distributed Shared Memory)

Custom-Designed Network

MB MBP/C

LM

NIC

DIR

P/C

LM

NIC

DIR

DIR : Cache Directory

Page 100: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

MPP (Massively Parallel Processing)

P/C

LM

NIC

MB

P/C

LM

NIC

MB

Custom-Designed Network

MB : Memory Bus NIC : Network Interface Circuitry

Page 101: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Cluster

Commodity Network (Ethernet, ATM, Myrinet, VIA)

MB MBP/C

M

NIC

P/C

M

Bridge Bridge

LD LD

NIC

IOB IOB

LD : Local Disk IOB : I/O Bus

Page 102: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Constellation

P/CP/C

SM SMNICLD

Hub

Custom or Commodity Network

>= 16

IOC

P/CP/C

SM SMNICLD

Hub

>= 16

IOC

IOC : I/O Controller

Page 103: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Trend in Parallel Computer Architectures

0

50

100

150

200

250

300

350

400

1997 1998 1999 2000 2001 2002

Years

Num

ber o

f HPC

s

MPPs Constellations Clusters SMPs

Page 104: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Food-Chain of High-Performance Computers

Page 105: ΔΙΑΛΕΞΗ 7: Parallel Computer Systems · 7: Parallel Computer Systems. Ack: ... Coarser-level parallelism, as in MPs, ... Latency larger than for uniprocessor ...

Converged Architecture of Current Supercomputers

Interconnection NetworkInterconnection Network

Memory

P P P P

Memory

P P P P

Memory

P P P P

Memory

P P P P

Multiprocessors Multiprocessors Multiprocessors Multiprocessors