ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased...

140
ΧΑΡΗΣ ΘΕΟΧΑΡΙΔΗΣ ([email protected]) ΗΜΥ 656 ΠΡΟΧΩΡΗΜΕΝΗ ΑΡΧΙΤΕΚΤΟΝΙΚΗ ΗΛΕΚΤΡΟΝΙΚΩΝ ΥΠΟΛΟΓΙΣΤΩΝ Εαρινό Εξάμηνο 2007 ΔΙΑΛΕΞΗ 5: Chip Multiprocessors – The New Era in Processor Architectures Acknowledgements: Wen-Mei Hwu, Kunle Olukotun, Shih-Hao Hung, Dezső Sima and the Stanford Hydra Group.

Transcript of ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased...

Page 1: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

ΧΑΡΗΣ ΘΕΟΧΑΡΙΔΗΣ([email protected])

ΗΜΥ 656ΠΡΟΧΩΡΗΜΕΝΗ ΑΡΧΙΤΕΚΤΟΝΙΚΗΗΛΕΚΤΡΟΝΙΚΩΝ ΥΠΟΛΟΓΙΣΤΩΝ

Εαρινό Εξάμηνο 2007ΔΙΑΛΕΞΗ 5:Chip Multiprocessors –The New Era in Processor Architectures

Acknowledgements:

Wen-Mei Hwu, Kunle Olukotun, Shih-HaoHung, Dezső Sima and the Stanford Hydra Group.

Page 2: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Microarchitecture: Overview

InstructionSupply

ExecutionMechanism

DataSuppl

Highest performance means generating the highest instructionand data bandwidth you can, and effectively consuming that

bandwidth in execution – paraphrased from M. Alsup, AMD Fellow

Page 3: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Microarchitecture, 1990

• Short pipelines• On-chip I and D Caches, blocking• Simple prediction

Page 4: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Microarchitecture, 2000

• Mechanisms to find parallel instructions– dynamic scheduling– static scheduling

• On-chip cache hierarchies, with non-blocking, higher-bandwidth caches

• Sophisticated branch prediction

Page 5: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Future Microarchitecture:One Perspective

InstructionSupply

ExecutionMechanism

DataSuppl

Highest performance means generating the highest instructionand data bandwidth you can, and effectively consuming that

bandwidth in execution – paraphrased from M. Alsup, AMD Fellow

Page 6: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Where are we headed?• More ILP : Even wider, deeper

– enabling technology: speculation, predication, compiler transformations, binary re-optimization, complexity effective design

• Multithreading– enabling technology: speculation, subordinate

threads, discovery of thread-level parallelism• Chip Multiprocessors

– enabling technology: speculation, discovery of thread-level, course-grained parallelism

Page 7: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

More ILP• Instruction Supply

– Branches, cache misses, partial fetches• Data Supply

– Higher bandwidth, lower latency, memory ordering, non-blocking caches

• Execution– Reduction of redundant work, design complexity

and partitioning• Tolerating Latency

– Can some things just take a long time?

Page 8: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Multithreading[Burton Smith, 1978]

Fetch

Execute

WriteBack

This is a snapshot of the pipeline during a single cycle. Each colorrepresents instructions from a different thread.

B. Smith’s original concept was for a single-wide pipeline, butextends naturally to a multiple issue pipeline.

Page 9: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Simultaneous Multithreadiing[W. Yamamoto, 1994/D. Tullsen, 1995]

Fetch

Execute

WriteBack

Page 10: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Simultaneous Multithreading,possible implementation

Front End Back End

•Intel Hyperthreading in Pentium 4 [HotChips’14] is first realization with two threads

•Small ISA register file minimizes effect of replication•Replicated retirement logic•Minimal hardware overhead but major increase in verification costcost

Page 11: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Chip Multiprocessor[K. Olukotun, 1996]

Fetch

Execute

WriteBack

ProcA

Shared L2 Cache

ProcC

ProcDProcB

Single processor die contains multiple CPUs all ofwhich share some amount of resources, such as an L2 cache and chip pins.

Page 12: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Hardware Accelerators

Page 13: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Existing Solutions…

Intel IXP1200 Intel IXP1200 Network Network

ProcessorProcessor

Philips Philips Nexperia Nexperia (Viper)(Viper)

ARM

MICRO-

ENGINES

ACCESSCTL.

MIPS

MPEG

VLIW

VIDEO

MSP

IBM CellIBM Cell …… whatwhat’’s next? s next? ……

Page 14: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Discussion/Thought Exercise• What are the essential differences

between the SMT model of execution and the CMP model?– What resources are shared and in what

manner?– What type of data movement exists in one

but not others?– What types of applications/situations are

the best case situations for each model?

Page 15: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

The Advent of Superscalar Architecture

Transistor densities increased at a stunning pace.

Any method to increase computing performance for using those transistors ?

Put more than one ALU a chip

The RS6000 from IBM released in 1990The world's first superscalar CPU

Most general purpose CPUs developed since about 1998 are superscalar

Page 16: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Technology ↔ Architecture

• Transistors are cheap, plentiful and fast– Moore’s law– 100 million transistors by 2000

• Wires are cheap, plentiful and slow– Wires get slower relative to transistors– Long cross-chip wires are especially slow

• Architectural implications– Plenty of room for innovation– Single cycle communication requires localized blocks of logic– High communication bandwidth across the chip easier to

achieve than low latency

Page 17: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Exploiting Program Parallelism

Instruction

Loop

Thread

Process

Leve

ls o

f Par

alle

lism

1 10 100 1K 10K 100K 1M

Grain Size (instructions)

Page 18: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Future Processors to use Coarse-Grain Parallelism

• Today‘s microprocessors utilize instruction level parallelism by a deepinstruction pipeline and by the superscalar or VLIW multiple issuetechniques

• Today‘s (2001) technology: approx. 40 M transistors per chip, In future (2012): 1.4 G transistors per chip,

What next?

• Two directions:– Increase of single-thread performance

--> use of more speculative instruction-level parallelism– Increase of multi-thread (multi-task) performance

--> Utilize thread-level parallism additionally to instruction-level parallelismA „thread“ in this lecture means a „HW thread“ which can be a SW (Posix) thread, a process, ...

• Far future (??): Increase of single-thread performance by use of speculative instruction-level and thread-level parallelism

Page 19: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Advanced Superscalar Processors for Billion Transistor Chips in Year 2005 - Characteristics

• Aggressive speculation, such as a very aggressive dynamic branch predictor,

• a large trace cache,• very-wide-issue superscalar processing (an issue width of 16

or 32 instructions per cycle),• a large number of reservation stations to accommodate

2,000 instructions,• 24 to 48 highly optimized, pipelined functional units,• sufficient on-chip data cache, and• sufficient resolution and forwarding logic.

– see: Yale N. Patt, Sanjay J. Patel, Marius Evers, Daniel H. Friendly, Jaret Stark: One Billion Transistors, One Uniprocessor, One Chip. IEEE Computer, September 1997, pp. 51-57.

Page 20: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those
Page 21: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Requirements and Solutions• Delivering optimal instruction bandwidth requires:

– a minimal number of empty fetch cycles, – a very wide (conservatively 16 instructions, aggressively 32), full issue

each cycle, – and a minimal number of cycles in which the instructions fetched are

subsequently discarded.• Consuming this instruction bandwidth requires:

– sufficient data supply,– and sufficient processing resources to handle the instruction bandwidth.

• Suggestions:– an instruction cache system (the I-cache) that provides for out-of-order

fetch (fetch, decode, and issue in the presence of I-cache misses). – a large Trace cache for providing a logically contiguous instruction

stream,– an aggressive Multi-Hybrid branch predictor (multiple, separate branch

predictors, each tuned to a different class of branches) with support for context switching, indirect jumps, and interference handling.

Page 22: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Future Processors to use Coarse-Grain Parallelism

• Chip multiprocessors (CMPs) or multiprocessor chips– integrate two or more complete processors on a single

chip,– every functional unit of a processor is duplicated

• Simultaneous multithreaded processors (SMPs)– store multiple contexts in different register sets on the

chip,– the functional units are multiplexed between the threads,– instructions of different contexts are simultaneously

executed

Page 23: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Chip Multiprocessors (CMPs)Principal Chip Multiprocessor Alternatives

• symmetric multiprocessor (SMP), • distributed shared memory

multiprocessor (DSM), • message-passing shared-nothing

multiprocessor.

Page 24: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Organizationalprinciples of

multiprocessors

Pro-cessor

Pro-cessor...

Interconnection

Shared Memory

(SMP) symmetric multiprocessor

Pro-cessor

Pro-cessor...

(DSM) distributed-shared-memorymultiprocessor

Interconnection

LocalMemory

LocalMemory

Pro-cessor

Pro-cessor...

Interconnection

LocalMemory

LocalMemory

message-passing(shared-nothing) multiprocessor

send receive

empty

global memory physically distributed memory

dist

ribut

ed a

ddre

ss sp

aces

shar

ed a

ddre

ss sp

ace

Page 25: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Typical SMP

Pro-cessor

Pro-cessor

Pro-cessor

Pro-cessor

PrimaryCache

SecondaryCache

Bus

SecondaryCache

SecondaryCache

SecondaryCache

PrimaryCache

PrimaryCache

PrimaryCache

Global Memory

Page 26: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Shared memory candidates for CMPs

Pro-cessor

Pro-cessor

Pro-cessor

Pro-cessor

PrimaryCache

SecondaryCache

SecondaryCache

SecondaryCache

SecondaryCache

Global Memory

PrimaryCache

PrimaryCache

PrimaryCache

Pro-cessor

Pro-cessor

Pro-cessor

Pro-cessor

PrimaryCache

Secondary Cache

Global Memory

PrimaryCache

PrimaryCache

PrimaryCache

Shared-main memory and shared-secondary cache

Page 27: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Shared memory candidates for CMPs

Pro-cessor

Pro-cessor

Pro-cessor

Pro-cessor

Secondary Cache

Global Memory

Primary Cache

and shared-primary cache

Page 28: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Grain-levels for CMPs• multiple processes in parallel• multiple threads from a single

application ⇒ implies a common address space for

all threads• extracting threads of control

dynamically from a single instruction stream

• ⇒ see last lecture, multiscalar, trace processors, ...

Page 29: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Hydra: A Single-Chip Multiprocessor

CPU 0

Centralized Bus Arbitration Mechanisms

Cache SRAM Array DRAM Main Memory I/O Device

A Single C

hip

PrimaryI-cache

PrimaryD-cache

CPU 0 Memory Controller

Rambus MemoryInterface

Off-chip L3Interface

I/O BusInterface

DMA

CPU 1

PrimaryI-cache

PrimaryD-cache

CPU 1 Memory Controller

CPU 2

PrimaryI-cache

PrimaryD-cache

CPU2 Memory Controller

CPU 3

PrimaryI-cache

PrimaryD-cache

CPU 3 Memory Controller

On-chip SecondaryCache

Page 30: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Multithreaded Processors• Aim: Latency tolerance• What is the problem?

• Load access latencies measured on an Alpha Server 4100 SMP with four 300 MHz Alpha 21164 processors are:– 7 cycles for a primary cache miss which hits in the on-chip L2 cache

of the 21164 processor,– 21 cycles for a L2 cache miss which hits in the L3 (board-level)

cache,– 80 cycles for a miss that is served by the memory, and– 125 cycles for a dirty miss, i.e., a miss that has to be served from

another processor's cache memory.

• Multithreaded processors are able to bridge latencies by switching to another thread of control - in contrast to chip multiprocessors.

Page 31: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Register set 1

Register set 2

Register set 3

Register set 4

PC PSR 1

PC PSR 2

PC PSR 3

PC PSR 4

FP

Thread 1:

Thread 2:

Thread 3:

Thread 4:

... ... ...

Multithreaded Processors• Multithreading:

– Provide several program counters registers (and usually several register sets) on chip

– Fast context switching by switching to another thread of control

Page 32: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Approaches of Multithreaded Processors

• Cycle-by-cycle interleaving– An instruction of another thread is fetched and fed into the

execution pipeline at each processor cycle.• Block-interleaving

– The instructions of a thread are executed successively until an event occurs that may cause latency. This event induces a context switch.

• Simultaneous multithreading– Instructions are simultaneously issued from multiple threads

to the FUs of a superscalar processor.– combines a wide issue superscalar instruction issue with

multithreading.

Page 33: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Comparison of Multithreading with Non-Multithreading Approaches:

(a) single-threaded scalar(b) cycle-by-cycle interleaving multithreaded scalar (c) block interleaving multithreaded scalar

(a)

Tim

e (p

roce

sscy

cles

)

(c)

Con

text

switc

h

(b)

Con

text

switc

h

Page 34: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Comparison of Multithreading with Non-Multithreading Approaches:

(a) superscalar (c) cycle-by-cycle interleaving(b) VLIW (d) cycle-by-cycle interleaving VLIW

(a)

Tim

e(p

roc e

sso r

cyc l

e s)

Issue slots

(b)

N

NNNNN

(c)

Con

text

switc

h

(d)

Con

text

switc

h

NN

N

NNNNN

NN

N

Page 35: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Comparison of Multithreading withNon-Multithreading:

simultaneous multithreading (SMT) and chip multiprocessor (CMP)

Issue slots

Tim

e (p

roce

ssor

cyc

les)

Page 36: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Cycle-by-Cycle Interleaving• the processor switches to a different thread after

each instruction fetch• pipeline hazards cannot arise and the processor

pipeline can be easily built without the necessity of complex forwarding paths

• context-switching overhead is zero cycles• memory latency is tolerated by not scheduling a

thread until the memory transaction has completed • requires at least as many threads as pipeline stages in

the processor• degrading the single-thread performance if not

enough threads are present

Page 37: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Cycle-by-Cycle Interleaving- Improving single-thread performance

• The dependence look-ahead technique adds several bits to each instruction format in the ISA.– Scheduler feeds non data or control dependent

instructions of the same thread successively into the pipeline.

• The interleaving technique proposed by Laudon et al. adds caching and full pipeline interlocks to the cycle-by-cycle interleaving approach.

Page 38: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Single-core computer

Page 39: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Single-core CPU chipthe single core

Page 40: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Multi-core architectures

• This lecture is about a new trend in computer architecture:Replicate multiple processor cores on a single die.

Core 1 Core 2 Core 3 Core 4

Multi-core CPU chip

Page 41: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Multi-core CPU chip• The cores fit on a single processor

socket• Also called CMP (Chip Multi-Processor)

core

1

core

2

core

3

core

4

Page 42: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

The cores run in parallel

core

1

core

2

core

3

core

4

thread 1 thread 2 thread 3 thread 4

Page 43: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Within each core, threads are time-sliced (just like on a uniprocessor)

core

1

core

2

core

3

core

4

several threads

several threads

several threads

several threads

Page 44: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Interaction with OS• OS perceives each core as a separate

processor

• OS scheduler maps threads/processes to different cores

• Most major OS support multi-core today

Page 45: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Why multi-core ?• Difficult to make single-core

clock frequencies even higher • Deeply pipelined circuits:

– heat problems– speed of light problems– difficult design and verification– large design teams necessary– server farms need expensive

air-conditioning• Many new applications are multithreaded • General trend in computer architecture (shift

towards more parallelism)

Page 46: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Instruction-level parallelism• Parallelism at the machine-instruction

level• The processor can re-order, pipeline

instructions, split them into microinstructions, do aggressive branch prediction, etc.

• Instruction-level parallelism enabled rapid increases in processor speeds over the last 15 years

Page 47: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Thread-level parallelism (TLP)• This is parallelism on a more coarser scale• Server can serve each client in a separate

thread (Web server, database server)• A computer game can do AI, graphics, and

physics in three separate threads• Single-core superscalar processors cannot

fully exploit TLP• Multi-core architectures are the next

step in processor evolution: explicitly exploiting TLP

Page 48: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

General context: Multiprocessors• Multiprocessor is any

computer with several processors

• SIMD– Single instruction, multiple data– Modern graphics cards

• MIMD– Multiple instructions, multiple data

Lemieux cluster,Pittsburgh

supercomputing center

Page 49: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Multiprocessor memory types• Shared memory:

In this model, there is one (large) common shared memory for all processors

• Distributed memory:In this model, each processor has its own (small) local memory, and its content is not replicated anywhere else

Page 50: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Multi-core processor is a special kind of a multiprocessor:

All processors are on the same chip

• Multi-core processors are MIMD:Different cores execute different threads (Multiple Instructions), operating on different parts of memory (Multiple Data).

• Multi-core is a shared memory multiprocessor:All cores share the same memory

Page 51: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

What applications benefit from multi-core?

• Database servers• Web servers (Web

commerce)• Compilers• Multimedia applications• Scientific applications,

CAD/CAM• In general, applications with

Thread-level parallelism(as opposed to instruction-level parallelism)

Each can run on itsown core

Page 52: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

More examples

• Editing a photo while recording a TV show through a digital video recorder

• Downloading software while running an anti-virus program

• “Anything that can be threaded today will map efficiently to multi-core”

• BUT: some applications difficult toparallelize

Page 53: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

A technique complementary to multi-core:Simultaneous multithreading

• Problem addressed:The processor pipeline can get stalled:– Waiting for the result

of a long floating point (or integer) operation

– Waiting for data to arrive from memory

Other execution unitswait unused

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers

Integer Floating Point

L1 D-Cache D-TLB

uCodeROM

BTBL2 C

ache

and

Con

trol

Bus

Source: Intel

Page 54: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Simultaneous multithreading (SMT)

• Permits multiple independent threads to execute SIMULTANEOUSLY on the SAME core

• Weaving together multiple “threads”on the same core

• Example: if one thread is waiting for a floating point operation to complete, another thread can use the integer units

Page 55: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers

Integer Floating Point

L1 D-Cache D-TLB

uCode ROMBTBL2 C

ache

and

Con

trol

Bus

Thread 1: floating point

Without SMT, only a single thread can run at any given time

Page 56: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Without SMT, only a single thread can run at any given time

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers

Integer Floating Point

L1 D-Cache D-TLB

uCode ROMBTBL2 C

ache

and

Con

trol

Bus

Thread 2:integer operation

Page 57: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

SMT processor: both threads can run concurrently

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers

Integer Floating Point

L1 D-Cache D-TLB

uCode ROMBTBL2 C

ache

and

Con

trol

Bus

Thread 1: floating pointThread 2:integer operation

Page 58: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

But: Can’t simultaneously use the same functional unit

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers

Integer Floating Point

L1 D-Cache D-TLB

uCode ROMBTBL2 C

ache

and

Con

trol

Bus

Thread 1 Thread 2

This scenario isimpossible with SMTon a single core(assuming a single integer unit)IMPOSSIBLE

Page 59: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

SMT not a “true” parallel processor

• Enables better threading (e.g. up to 30%)• OS and applications perceive each

simultaneous thread as a separate “virtual processor”

• The chip has only a single copy of each resource

• Compare to multi-core:each core has its own copy of resources

Page 60: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Multi-core: threads can run on separate cores

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers

Integer Floating Point

L1 D-Cache D-TLB

uCodeROM

BTBL2 C

ache

and

Con

trol

Bus

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers

Integer Floating Point

L1 D-Cache D-TLB

uCodeROM

BTBL2 C

ache

and

Con

trol

Bus

Thread 1 Thread 3

Page 61: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers

Integer Floating Point

L1 D-Cache D-TLB

uCodeROM

BTBL2 C

ache

and

Con

trol

Bus

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers

Integer Floating Point

L1 D-Cache D-TLB

uCodeROM

BTBL2 C

ache

and

Con

trol

Bus

Thread 2 Thread 4

Multi-core: threads can run on separate cores

Page 62: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Combining Multi-core and SMT

• Cores can be SMT-enabled (or not)• The different combinations:

– Single-core, non-SMT: standard uniprocessor

– Single-core, with SMT – Multi-core, non-SMT– Multi-core, with SMT: our fish machines

• The number of SMT threads:2, 4, or sometimes 8 simultaneous threads

Page 63: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

SMT Dual-core: all four threads can run concurrently

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers

Integer Floating Point

L1 D-Cache D-TLB

uCodeROM

BTBL2 C

ache

and

Con

trol

Bus

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers

Integer Floating Point

L1 D-Cache D-TLB

uCodeROM

BTBL2 C

ache

and

Con

trol

Bus

Thread 1 Thread 2 Thread 3 Thread 4

Page 64: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Comparison: multi-core vs SMT

• Multi-core:– Since there are several cores,

each is smaller and not as powerful(but also easier to design and manufacture)

– However, great with thread-level parallelism• SMT

– Can have one large and fast superscalar core– Great performance on a single thread– Mostly still only exploits instruction-level

parallelism

Page 65: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Concepts – Multithreading

• Data Access Latency– Cache misses (L1, L2)– Memory latency (remote, local)– Often unpredictable

• Multithreading (MT)– Tolerate or mask long and often unpredictable

latency operations by switching to another context, which is able to do useful work.

Page 66: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Why Multithreading Today?

• ILP is exhausted, TLP is in.• Large performance gap bet. MEM and

PROC.• Too many transistors on chip• More existing MT applications Today.• Multiprocessors on a single chip.• Long network latency, too.

Page 67: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Classical Problem, 60’ & 70’• I/O latency prompted multitasking • IBM mainframes • Multitasking • I/O processors • Caches within disk controllers

Page 68: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Requirements of Multithreading• Storage need to hold multiple context’s PC,

registers, status word, etc. • Coordination to match an event with a saved

context • A way to switch contexts • Long latency operations must use resources

not in use

Page 69: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Processor Utilization vs. Latency

R = the run length to a long latency event

L = the amount of latency

Page 70: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Problem of 80’• Problem was revisited due to the advent of

graphics workstations– Xerox Alto, TI Explorer – Concurrent processes are interleaved to allow for

the workstations to be more responsive. – These processes could drive or monitor display,

input, file system, network, user processing – Process switch was slow so the subsystems

were microprogrammed to support multiple contexts

Page 71: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Scalable Multiprocessor (90’)• Dance hall – a shared interconnect with memory on one side

and processors on the other. • Or processors may have local memory

Page 72: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

How do the processors communicate?• Shared Memory • Potential long latency on every load

– Cache coherency becomes an issue – Examples include NYU’s Ultracomputer, IBM’s RP3, BBN’s

Butterfly, MIT’s Alewife, and later Stanford’s Dash. – Synchronization occurs through share variables, locks, flags, and

semaphores. • Message Passing

– Programmer deals with latency. This enables them to minimize the number of messages, while maximizing the size, and this scheme allows for delay minimization by sending a message so that it reaches the receiver at the time it expects it.

– Examples include Intel’s PSC and Paragon, Caltech’s Cosmic Cube, and Thinking Machines’ CM-5

– Synchronization occurs through send and receive

Page 73: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Cycle-by-Cycle Interleaved Multithreading

• Denelcor HEP1 (1982), HEP2

• Horizon, which was never built

• Tera, MTA

Page 74: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Cycle-by-Cycle Interleaved Multithreading

• Features– An instruction from a different context is launched at each

clock cycle– No interlocks or bypasses thanks to a non-blocking

pipeline

• Optimizations: – Leaving context state in proc (PC, register #, status) – Assigning tags to remote request and then matching it on

completion

Page 75: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Challenges with this approach

• I-Cache:– Instruction bandwidth– I-Cache misses: Since instructions are being grabbed from many different

contexts, instruction locality is degraded and the I-cache miss rate rises. • Register file access time:

– Register file access time increases due to the fact that the regfile had to significantly increase in size to accommodate many separate contexts.

– In fact, the HEP and Tera use SRAM to implement the regfile, which means longer access times.

• Single thread performance– Single thread performance significantly degraded since the context is

forced to switch to a new thread even if none are available. • Very high bandwidth network, which is fast and wide • Retries on load empty or store full

Page 76: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Improving Single Thread Performance

• Do more operations per instruction (VLIW) • Allow multiple instructions to issue into pipeline from

each context. – This could lead to pipeline hazards, so other safe

instructions could be interleaved into the execution. – For Horizon & Tera, the compiler detects such data

dependencies and the hardware enforces it by switching to another context if detected.

• Switch on load • Switch on miss

– Switching on load or miss will increase the context switch time.

Page 77: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Simultaneous Multithreading (SMT)• Tullsen, et. al. (U. of Washington), ISCA ‘95• A way to utilize pipeline with increased

parallelism from multiple threads.

Page 78: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Simultaneous Multithreading

Page 79: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

SMT Architecture• Straightforward extension to conventional

superscalar design.– multiple program counters and some mechanism by which

the fetch unit selects one each cycle,– a separate return stack for each thread for predicting

subroutine return destinations,– per-thread instruction retirement, instruction queue flush,

and trap mechanisms,– a thread id with each branch target buffer entry to avoid

predicting phantom branches, and– a larger register file, to support logical registers for all

threads plus additional registers for register renaming. • The size of the register file affects the pipeline and the

scheduling of load-dependent instructions.

Page 80: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

SMT PerformanceTullsen ‘96

Page 81: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Commercial Machines w/ MT Support

• Intel Hyperthreding (HT)– Dual threads– Pentium 4, XEON

• Sun CoolThreads– UltraSPARC T1– 4-threads per core

• IBM– POWER5

Page 82: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

IBM Power5http://www.research.ibm.com/journal/rd/494/mathis.pdf

Page 83: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

IBM Power5http://www.research.ibm.com/journal/rd/494/mathis.pdf

Page 84: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

SMT Summary

• Pros:– Increased throughput w/o adding much cost– Fast response for multitasking environment

• Cons:– Slower single processor performance

Page 85: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Multicore • Multiple processor cores on a chip

– Chip multiprocessor (CMP)– Sun’s Chip Multithreading (CMT)

• UltraSPARC T1 (Niagara)– Intel’s Pentium D– AMD dual-core Opteron

• Also a way to utilize TLP, but– 2 cores 2X costs– No good for single thread performacne

• Can be used together with SMT

Page 86: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Chip Multithreading (CMT)

Page 87: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Sun UltraSPARC T1 Processor

http://www.sun.com/servers/wp.jsp?tab=3&group=CoolThreads%20servers

Page 88: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

8 Cores vs 2 Cores

• Is 8-cores too aggressive?– Good for server applications, given

• Lots of threads• Scalable operating environment• Large memory space (64bit)

– Good for power efficiency• Simple pipeline design for each core

– Good for availability– Not intended for PCs, gaming, etc

Page 89: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION

The Case for a Single-Chip Multiprocessor

Kunle [email protected]

Lance [email protected]

Basem A. [email protected]

Computer Systems LaboratoryStanford UniversityStanford, CA 94305-4070Hydra Grouphttp://www-hydra.stanford.edu

Ken Wilson

Kunyung Chang

Page 90: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION

Multiple Single Chip

Increase in integration densityIntroduction

Higher clock rates

New Micro architectural innovation

Multiple instruction issueDynamic schedulingSpeculative executionNon-blocking caches

Microprocessor performance growth

Page 91: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION

Multiple Single Chip

The Limits of the Superscalar Approach

Recent trend of designing: CPU with Multiple instruction issue(dynamic scheduling)

Track register dependency between instructions

Superscalar Approach

Page 92: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION

Multiple Single Chip

Instruction

Introduction

Higher Clock rates

A dynamic superscalar CPU

Execution array

Superscalar Approach

Page 93: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION

Multiple Single Chip

Instructions

Mispredicted branches

Instruction misalignment

Cache misses

2 ways to implement

Use an explicit table for mapping architectural registers to physical registers

Use a combination reorder buffer/instruction queue

3 factors constrain instruction fetch

Instruction is inserted into the instruction issue queue, and

instruction is issued for execution once all of its operands are ready

Superscalar Approach

Page 94: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION

Multiple Single Chip

Instructions

O : the number of operands / instruction

W : the issue width of the machine

O x W : the number of access ports required by the mapping tablestructure

Ex : eight-wide issue machine with three operands per instruction requires a24 port mapping table

Superscalar Approach

Page 95: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION

Multiple Single Chip

Instructions

n : the number of bits required to encode a register identifier

Q : the size of the instruction issue queue

n x Q x O x W : the number of comparators grows with the size of the instruction queue and issue widthstructure

Ex : eight-wide issue machine with three operands per instruction, a 64-entry instruction queue, 6-bit comparisons requires a

8x3x64x6= 9216 port mapping table

Superscalar Approach

Page 96: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION

Multiple Single Chip

Instructions

That

can issue in parallel,

maintain the full issue bandwidth

-> A quadratic increase in the size of the instruction issue queue -> will limit the cycle time of the processor-> will limit the performance of wide issue superscalar machines

As Instruction issue widths increase, larger windows of instructions are required to find independent instructions

Superscalar Approach

Page 97: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION

Multiple Single Chip

Motivation

Technology push

Technical issues especially -the delay of the complex issue queue -multi-port register fileswill limit the performance returns from a wide superscalar execution model.

Needs for decentralized micro architecture that maintain performance growth of microprocessors

Single-Chip Multiprocessor

Page 98: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION

Multiple Single Chip

Motivation

Application pull

micro architecture that works best depends on the amount and characteristics of the parallelism in the applications.

Application pull towards a single-chip multiprocessor

In parallelism, 2 different applications require different execution models

Single-Chip Multiprocessor

Page 99: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION

Multiple Single Chip

Application parallelism

comparison

Applications with low to moderate amounts of parallelism

Applications with large amount of parallelism greater than 40 i/c

Class 1

Class 2

Floating point apps, loop-level parallelism

Under 10 instructions/cycleVS

Single-Chip Multiprocessor

Page 100: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION

Multiple Single Chip

Application parallelism

comparison

Applications with low to moderate amounts of parallelism

Applications work best on moderately superscalar processors with very high clock rates

Class 1

Little parallelism to exploit

Single-Chip Multiprocessor

Page 101: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION

Multiple Single Chip

Application parallelism

comparison

However, the recent advances in parallel compilers make a multiprocessor an efficient and flexible way to exploit the parallelism in these programs

Class 2

Large amounts of parallelism and see performance benefits from a variety of methods designed to exploit parallelism such as Superscalar, VLIW, vector processing

Applications with large amount of parallelism greater than 40 i/c

Single-Chip Multiprocessor

Page 102: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION

Multiple Single Chip

Application parallelism

comparison

1st Way to use multiprocessor

In the environment under the control of a multiprocessor aware operating system, a number of commercially available OS like WINNT, IRIX, Sun Solaris, etc.. have capability.

The increasingly widespread use of visualization and multimedia applications tends to increase the number of active processes orindependent threads on a desktop machine or server at a particular point in time.

Execution of multiple processes in parallel to increase throughput in a multiprogramming environment

Way to use Multiprocessor

Page 103: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION

Multiple Single Chip

Application parallelism

comparison

2nd Way to use multiprocessor

Ex1, transaction processing

The threads communicate using shared memory.

Designed to run on parallel machines with communication latencies in the hundreds of CPU clock cycles.

The threads do not communicate in a very fine grained manner.

Execution of multiple threads in parallel that come from a single application.

Way to use Multiprocessor

Page 104: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION

Multiple Single Chip

Application parallelism

comparison

2nd Way to use multiprocessor

Ex2, hand parallelized floating point scientific applications

When instruction window size is very large and the branch prediction is perfect. (because existing parallelism is widely distributed)

The parallelism exposed in this fine-grained manner cannot be exploited by a conventional multiprocessor architecture.

-> to exploit this, a single-chip multiprocessor architecture is available.

Execution of multiple threads in parallel that come from a single application.

Way to use Multiprocessor

Hand code (code before compile program

Page 105: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION

Multiple Single Chip

Application parallelism

comparison

3rd Way to use multiprocessor

Automatic parallelization technology was shown to be effective.

The automatic parallelization technology derive significant performance benefits from the low-latency inter-processor communication

which are provided by a single-chip multiprocessor.

Accelerating the execution of sequential applications without manual intervention

Way to use Multiprocessor

Usually used in Fortran

Page 106: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION

Multiple Single Chip

Key Characteristics of two micro architectures

Two-issue CPU will have a higher clock rate than the six-issue CPU

But, assume that two processors have the same clock.

Key characteristics of the two micro architectures

The Difference of two systems

Page 107: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION

Multiple Single Chip

Key Characteristics of two micro architectures

Key characteristics of the two micro architectures

The Difference of two systems

Page 108: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION

Multiple Single Chip

Key Characteristics of two micro architectures

Introduction

Higher Clock rates

Floor plan for the six-issue dynamic superscalar

Big overhead

6-Way Superscalar Architecture

Page 109: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION

Multiple Single Chip

TERMS

TLB :The translation lookaside buffer (TLB) is a table in the processor that contains cross-references between the virtual and real addresses of recently referenced pages of memory. It functions like a "hot list" or quick-lookup index of the pages in main memory that have been most recently accessed.

6-Way Superscalar Architecture

Page 110: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION

Multiple Single Chip

Key Characteristics of two micro architectures

4x2-Way Superscalar Multiprocessor Architecture

Introduction

Higher Clock rates

Floor plan for the four-way single chip

Also have floating point unit, integer unit, smaller overhead

L1 caches in each chip

Switch for data communication among processors and memory

through the cache

Page 111: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION

Multiple Single Chip

.

Simulation Environment

Page 112: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION

Multiple Single Chip

.

Simulation Environment

Page 113: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION

Multiple Single Chip

ApplicationsSample Applications

applu

apsi

swim

tomcatv

pmake

Floating point applicationsMultiprogramming applications

Integer applications

compress

eqntott

m88ksim

MPsim

Page 114: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION

Multiple Single Chip

compress – compresses and uncompressed file in memory

eqntott – translates logic equations into truth tables

m88ksim – Motorola 88000 CPU simulatorMPsim – VCS compiled Verilog simulation of a multiprocessor

applu – solver for parabolic/elliptic partial differential equations

swim – shallow water model with 1K x 1K grid

tomcatv – mesh-genetation with Thomson solver

apsi – solves problems of temperature, wind, velocity, and distribution of pollutants

Sample Applications

pmake– parallel make of gnuchess using C compiler

Applications

Page 115: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION

Multiple Single Chip

How to compare

MPCI : misses per completed instructor – the cache miss rates

IPC :

BP rate :

Performance Comparison

Page 116: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION

Comparison of two micro architectures

Multiple Single Chip

IPC Breakdown for a single 2-issue processor

Performance of a single 2-issue superscalar processor

Performance Comparison

Page 117: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION

Multiple Single Chip

Key Characteristics of two micro architectures

IPC Breakdown for the 6-issue processor

Performance of the 6-issue superscalar processor

Performance Comparison

Page 118: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION

Multiple Single Chip

Performance of the 4x2-issue processor

Performance Comparison

Page 119: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION

Multiple Single Chip

Comparison of two micro architectures

Introduction

Higher Clock rates

Performance comparison of SS and MP

Performance Comparison

SS MP(Hydra)

Page 120: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION

Multiple Single Chip

Conclusion

in selecting the way to design microprocessors

• Single-chip MP exploits parallelism more effectively at some levels than SS microprocessor

• Single-chip MP architecture is more efficient than Superscalar architecture in same physical space

• Provides up to 2x performance on applications with higher levels of parallelism

Conclusion

Page 121: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Case examples

Intel MCPs (1)

The Move to Intel MultiThe Move to Intel Multi--corecore20052005 20062006 2007+2007+PlatformPlatform

ItaniumItanium®®processorprocessor

Desktop Desktop ClientClient

Mobile Mobile ClientClient

All products and dates are preliminary and subject to change without notice.

MP ServerMP Server

DP Server / DP Server / WSWS

Refer to ‘fact sheet’ for specific product timings

today

Figure 5.1: The move to Intel multi-coreSource: A. Loktu: Itanium 2 for Enterprise Computing

http://h40132.www4.hp.com/upload/se/sv/Itanium2forenterprisecomputing.pps

Page 122: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Intel MCPs (2)

Figure 5.2: Processor specifications of Intel’s Pentium D family (90 nm)Source: http://www.intel.com/products/processor/index.htm

Page 123: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

EIST: Enhanced Intel SpeedStep Technology

First delivered in Intel’s mobile and server platforms,It allows the system to dynamically adjust processor voltage and core frequency,which can result in decreased average power consumptionand decreased average heat production.

It is a set of hardware enhancements to Intel’s server and client platformsthat can improve the performance and robustness of traditional software-based virtualization solutions.

Virtualization solutions will allow a platform to run multipleoperating systems and applications in independent partitions.Using virtualization capabilities, one computer system can function as multiple "virtual" systems.

VT: Virtualization Technology

Malicious buffer overflow attacks pose a significant security threat. In a typical attack, a malicious worm creates a flood of code that overwhelms the processor,allowing the worm to propagate itself to the network, and to other computers. It can help prevent certain classes of malicious buffer overflow attackswhen combined with a supporting operating system.

Execute Disable Bit allows the processor to classify areas in memoryby where application code can execute and where it cannot.When a malicious worm attempts to insert code in the buffer,the processor disables code execution, preventing damage and worm propagation.

ED: Execute Disable Bit

Intel MCPs (3)

Page 124: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Intel MCPs (4)

Figure 5.3: Processor specifications of Intel’s Pentium D family (65 nm)Source: http://www.intel.com/products/processor/index.htm

Page 125: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Intel MCPs (5)

Figure 5.4 Specifications of Intel’s Pentium Processor Extrem Edition models 840/955/965Source: http://www.intel.com/products/processor/index.htm

Page 126: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Intel MCPs (6)

Figure 5.5: Procesor specifications of Intel’s Yonah Duo (Core Duo) family

Source: http://www.intel.com/products/processor/index.htm

Page 127: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Source: http://www.intel.com/products/processor_number/chart/core2duo.htm

Intel MCPs (7)

Figure 5.6 Specifications of Intel’s Core Processors

Page 128: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Intel MCPs (8)Category Code Name Cores Cache Market

Desktop Kentsfield Dual core multi-die 4 MB Mid 2007

Desktop Conroe Dual core single die 4 MB shared End 2006

Desktop Allendale Dual core single die 2 MB shared End 2006

Desktop Cedar Mill (NetBurst/P4) Single core 512 kB, 1 MB, 2 MB Early 2006

Desktop Presler (NetBurst/P4) Dual core, dual die 4 MB Early 2006

Desktop/Mobile Millville Single core 1 MB Early 2007

Mobile Yonah2 Dual core, single die 2 MB Early 2006

Mobile Yonah1 Single core 1/2 MB Mid 2006

Mobile Stealey Single core 512 kB Mid 2007

Mobile Merom Dual core, single die 2/4 MB shared End 2006

Enterprise Sossaman Dual core, single die 2 MB Early 2006

Enterprise Woodcrest Dual core, single die 4 MB Mid 2006

Enterprise Clovertown Quad core, multi-die 4 MB Mid 2007

Enterprise Dempsey (NetBurst/Xeon) Dual core, dual die 4 MB Mid 2006

Enterprise Tulsa Dual core single die 4/8/16 MB End 2006

Enterprise Whitefield Quad core single die 8 MB, 16 MB shared Early 2008

Figure 5.7: Future 65 nm processors (overview)Source: P. Schmid: Top Secret Intel Processor Plans Uncovered

www.tomshardware.com/2005/12/04/top_secret_intel_processor_plans_uncovered

Page 129: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Codename Cores Cache Market

Desktop Wolfdale Dual core, single die 3 MB shared 2008

Desktop Ridgefield Dual core single die 6 MB shared 2008

Desktop Yorkfield 8 cores multi-die 12 MB shared 2008+

Desktop Bloomfield Quad core, single die - 2008+

Desktop/Mobile Perryville Single core 2 MB 2008

Mobile Penryn Dual core single die 3 MB, 6 MB shared 2008

Mobile Silverthorne - - 2008+

Enterprise Hapertown 8 cores multi-die 12 MB shared 2008

Figure 5.8: Future 45 nm processors (overview)

Intel MCPs (9)

Source: P. Schmid: Top Secret Intel Processor Plans Uncovered www.tomshardware.com/2005/12/04/top_secret_intel_processor_plans_uncovered

Page 130: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Athlon 64 X2

Figure 5.9: AMD Athlon 64 X2 dual-core processor architectureSource: AMD Athlon 64 X2 Dual-Core Processor for Desktop – Key Architecture Features, http:///www.amd.com/us-en/Processors/ProductInformation/0,,30_118_9485_13041.00.html

Page 131: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Sun’s UltraSPARC IV/IV+ (1)

Figure 5.10: UltraSPARC IV (Jaguar)

Source: C. Boussard: Architecture des processeurshttp://laser.igh.cnrs.fr/IMG/pdf/SUN-CNRS-archi-cpu-3.pdf

ARB: Arbiter

Page 132: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Sun’s UltraSPARC IV/IV+ (2)

Figure 5.11: UltraSPARC IV+ (Panther)

Source: C. Boussard: Architecture des processeurshttp://laser.igh.cnrs.fr/IMG/pdf/SUN-CNRS-archi-cpu-3.pdf

Page 133: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

POWER4/POWER5 (1)

Figure 5.12: POWER4 chip logical viewSource: J.M. Tendler, S. Dodson, S. Fields, H. Le, B. Sinharoy: Power4 System Microarchitecture, IBM Server,

Technical White Paper, October 2001http://www-03.ibm.coom/servers/eserver/pseries/hardware/whitepapers/power4.pdf

Built-In-SelfTest

Service ProcessorPower On Reset

Core interface Unit(crossbar)

Non-CacheableUnit

MultiChip Module

Page 134: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

POWER4/POWER5 (2)

Figure 5.13: POWER4 chip

Source: R. Kalla, B. Sinharoy, J. Tendler: Simultaneous Multi-threading Implementation in Power5 –IBM’s Next Generation POWER Microprocessor, 2003

http://www.hotchips.org/archives/hc15/3_Tue/11.ibm.pdf

Page 135: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

POWER4/POWER5 (3)

Figure 5.14: POWER4 and POWER5 system structures

Source: R. Kalla, B. Sinharoy, J.M. Tendler: IBM Power5 chip: A Dual-core multithreaded Processor, IEEE. Micro, Vol. 24, No.2, March-April 2004, pp. 40-47.

FabricController

Page 136: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Cell

Figure 5.15: Cell (BE) microarchitecture

Source: IBM: „Cell Broadband Engine™ processor – based systems”, IBM corp. 2006

SPE: SynergisticProcessing Element

EIB: Element Interface Bus

MFC: Memory Flow Controller

PPE: Power Processing Element

AUC: Atomic Update Cache

Page 137: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Cell (2)

Figure 5.16: Cell SPE architecture

Source: Blachford N.: „Cell Architecture Explained Version 2”, http://www.blachford.info/computer/Cell/Cell1_v2.html

Page 138: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Cell (3)

Figure 5.17: Cell floorplan

Source: Blachford N.: „Cell Architecture Explained Version 2”, http://www.blachford.info/computer/Cell/Cell1_v2.html

Page 139: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Issues and Challenges

• Memory Organization– Distributed? Shared?

• Interconnect and Communication Protocols• Coherency and Consistency – Memory/Cache.• Scheduling, Load Balancing and Synchronization• Reliability? Energy Efficiency?

• Will see all these through the rest of the course!

Page 140: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those

Next Week

• We will talk about Scheduling and Load Balancing issues, with respect to multiple processing nodes (in effect covers CMPs as well).

• We will also talk about the possibility of quizzes…