ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased...
Transcript of ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased...
ΧΑΡΗΣ ΘΕΟΧΑΡΙΔΗΣ([email protected])
ΗΜΥ 656ΠΡΟΧΩΡΗΜΕΝΗ ΑΡΧΙΤΕΚΤΟΝΙΚΗΗΛΕΚΤΡΟΝΙΚΩΝ ΥΠΟΛΟΓΙΣΤΩΝ
Εαρινό Εξάμηνο 2007ΔΙΑΛΕΞΗ 5:Chip Multiprocessors –The New Era in Processor Architectures
Acknowledgements:
Wen-Mei Hwu, Kunle Olukotun, Shih-HaoHung, Dezső Sima and the Stanford Hydra Group.
Microarchitecture: Overview
InstructionSupply
ExecutionMechanism
DataSuppl
Highest performance means generating the highest instructionand data bandwidth you can, and effectively consuming that
bandwidth in execution – paraphrased from M. Alsup, AMD Fellow
Microarchitecture, 1990
• Short pipelines• On-chip I and D Caches, blocking• Simple prediction
Microarchitecture, 2000
• Mechanisms to find parallel instructions– dynamic scheduling– static scheduling
• On-chip cache hierarchies, with non-blocking, higher-bandwidth caches
• Sophisticated branch prediction
Future Microarchitecture:One Perspective
InstructionSupply
ExecutionMechanism
DataSuppl
Highest performance means generating the highest instructionand data bandwidth you can, and effectively consuming that
bandwidth in execution – paraphrased from M. Alsup, AMD Fellow
Where are we headed?• More ILP : Even wider, deeper
– enabling technology: speculation, predication, compiler transformations, binary re-optimization, complexity effective design
• Multithreading– enabling technology: speculation, subordinate
threads, discovery of thread-level parallelism• Chip Multiprocessors
– enabling technology: speculation, discovery of thread-level, course-grained parallelism
More ILP• Instruction Supply
– Branches, cache misses, partial fetches• Data Supply
– Higher bandwidth, lower latency, memory ordering, non-blocking caches
• Execution– Reduction of redundant work, design complexity
and partitioning• Tolerating Latency
– Can some things just take a long time?
Multithreading[Burton Smith, 1978]
Fetch
Execute
WriteBack
This is a snapshot of the pipeline during a single cycle. Each colorrepresents instructions from a different thread.
B. Smith’s original concept was for a single-wide pipeline, butextends naturally to a multiple issue pipeline.
Simultaneous Multithreadiing[W. Yamamoto, 1994/D. Tullsen, 1995]
Fetch
Execute
WriteBack
Simultaneous Multithreading,possible implementation
Front End Back End
•Intel Hyperthreading in Pentium 4 [HotChips’14] is first realization with two threads
•Small ISA register file minimizes effect of replication•Replicated retirement logic•Minimal hardware overhead but major increase in verification costcost
Chip Multiprocessor[K. Olukotun, 1996]
Fetch
Execute
WriteBack
ProcA
Shared L2 Cache
ProcC
ProcDProcB
Single processor die contains multiple CPUs all ofwhich share some amount of resources, such as an L2 cache and chip pins.
Hardware Accelerators
Existing Solutions…
Intel IXP1200 Intel IXP1200 Network Network
ProcessorProcessor
Philips Philips Nexperia Nexperia (Viper)(Viper)
ARM
MICRO-
ENGINES
ACCESSCTL.
MIPS
MPEG
VLIW
VIDEO
MSP
IBM CellIBM Cell …… whatwhat’’s next? s next? ……
Discussion/Thought Exercise• What are the essential differences
between the SMT model of execution and the CMP model?– What resources are shared and in what
manner?– What type of data movement exists in one
but not others?– What types of applications/situations are
the best case situations for each model?
The Advent of Superscalar Architecture
Transistor densities increased at a stunning pace.
Any method to increase computing performance for using those transistors ?
Put more than one ALU a chip
The RS6000 from IBM released in 1990The world's first superscalar CPU
Most general purpose CPUs developed since about 1998 are superscalar
Technology ↔ Architecture
• Transistors are cheap, plentiful and fast– Moore’s law– 100 million transistors by 2000
• Wires are cheap, plentiful and slow– Wires get slower relative to transistors– Long cross-chip wires are especially slow
• Architectural implications– Plenty of room for innovation– Single cycle communication requires localized blocks of logic– High communication bandwidth across the chip easier to
achieve than low latency
Exploiting Program Parallelism
Instruction
Loop
Thread
Process
Leve
ls o
f Par
alle
lism
1 10 100 1K 10K 100K 1M
Grain Size (instructions)
Future Processors to use Coarse-Grain Parallelism
• Today‘s microprocessors utilize instruction level parallelism by a deepinstruction pipeline and by the superscalar or VLIW multiple issuetechniques
• Today‘s (2001) technology: approx. 40 M transistors per chip, In future (2012): 1.4 G transistors per chip,
What next?
• Two directions:– Increase of single-thread performance
--> use of more speculative instruction-level parallelism– Increase of multi-thread (multi-task) performance
--> Utilize thread-level parallism additionally to instruction-level parallelismA „thread“ in this lecture means a „HW thread“ which can be a SW (Posix) thread, a process, ...
• Far future (??): Increase of single-thread performance by use of speculative instruction-level and thread-level parallelism
Advanced Superscalar Processors for Billion Transistor Chips in Year 2005 - Characteristics
• Aggressive speculation, such as a very aggressive dynamic branch predictor,
• a large trace cache,• very-wide-issue superscalar processing (an issue width of 16
or 32 instructions per cycle),• a large number of reservation stations to accommodate
2,000 instructions,• 24 to 48 highly optimized, pipelined functional units,• sufficient on-chip data cache, and• sufficient resolution and forwarding logic.
– see: Yale N. Patt, Sanjay J. Patel, Marius Evers, Daniel H. Friendly, Jaret Stark: One Billion Transistors, One Uniprocessor, One Chip. IEEE Computer, September 1997, pp. 51-57.
Requirements and Solutions• Delivering optimal instruction bandwidth requires:
– a minimal number of empty fetch cycles, – a very wide (conservatively 16 instructions, aggressively 32), full issue
each cycle, – and a minimal number of cycles in which the instructions fetched are
subsequently discarded.• Consuming this instruction bandwidth requires:
– sufficient data supply,– and sufficient processing resources to handle the instruction bandwidth.
• Suggestions:– an instruction cache system (the I-cache) that provides for out-of-order
fetch (fetch, decode, and issue in the presence of I-cache misses). – a large Trace cache for providing a logically contiguous instruction
stream,– an aggressive Multi-Hybrid branch predictor (multiple, separate branch
predictors, each tuned to a different class of branches) with support for context switching, indirect jumps, and interference handling.
Future Processors to use Coarse-Grain Parallelism
• Chip multiprocessors (CMPs) or multiprocessor chips– integrate two or more complete processors on a single
chip,– every functional unit of a processor is duplicated
• Simultaneous multithreaded processors (SMPs)– store multiple contexts in different register sets on the
chip,– the functional units are multiplexed between the threads,– instructions of different contexts are simultaneously
executed
Chip Multiprocessors (CMPs)Principal Chip Multiprocessor Alternatives
• symmetric multiprocessor (SMP), • distributed shared memory
multiprocessor (DSM), • message-passing shared-nothing
multiprocessor.
Organizationalprinciples of
multiprocessors
Pro-cessor
Pro-cessor...
Interconnection
Shared Memory
(SMP) symmetric multiprocessor
Pro-cessor
Pro-cessor...
(DSM) distributed-shared-memorymultiprocessor
Interconnection
LocalMemory
LocalMemory
Pro-cessor
Pro-cessor...
Interconnection
LocalMemory
LocalMemory
message-passing(shared-nothing) multiprocessor
send receive
empty
global memory physically distributed memory
dist
ribut
ed a
ddre
ss sp
aces
shar
ed a
ddre
ss sp
ace
Typical SMP
Pro-cessor
Pro-cessor
Pro-cessor
Pro-cessor
PrimaryCache
SecondaryCache
Bus
SecondaryCache
SecondaryCache
SecondaryCache
PrimaryCache
PrimaryCache
PrimaryCache
Global Memory
Shared memory candidates for CMPs
Pro-cessor
Pro-cessor
Pro-cessor
Pro-cessor
PrimaryCache
SecondaryCache
SecondaryCache
SecondaryCache
SecondaryCache
Global Memory
PrimaryCache
PrimaryCache
PrimaryCache
Pro-cessor
Pro-cessor
Pro-cessor
Pro-cessor
PrimaryCache
Secondary Cache
Global Memory
PrimaryCache
PrimaryCache
PrimaryCache
Shared-main memory and shared-secondary cache
Shared memory candidates for CMPs
Pro-cessor
Pro-cessor
Pro-cessor
Pro-cessor
Secondary Cache
Global Memory
Primary Cache
and shared-primary cache
Grain-levels for CMPs• multiple processes in parallel• multiple threads from a single
application ⇒ implies a common address space for
all threads• extracting threads of control
dynamically from a single instruction stream
• ⇒ see last lecture, multiscalar, trace processors, ...
Hydra: A Single-Chip Multiprocessor
CPU 0
Centralized Bus Arbitration Mechanisms
Cache SRAM Array DRAM Main Memory I/O Device
A Single C
hip
PrimaryI-cache
PrimaryD-cache
CPU 0 Memory Controller
Rambus MemoryInterface
Off-chip L3Interface
I/O BusInterface
DMA
CPU 1
PrimaryI-cache
PrimaryD-cache
CPU 1 Memory Controller
CPU 2
PrimaryI-cache
PrimaryD-cache
CPU2 Memory Controller
CPU 3
PrimaryI-cache
PrimaryD-cache
CPU 3 Memory Controller
On-chip SecondaryCache
Multithreaded Processors• Aim: Latency tolerance• What is the problem?
• Load access latencies measured on an Alpha Server 4100 SMP with four 300 MHz Alpha 21164 processors are:– 7 cycles for a primary cache miss which hits in the on-chip L2 cache
of the 21164 processor,– 21 cycles for a L2 cache miss which hits in the L3 (board-level)
cache,– 80 cycles for a miss that is served by the memory, and– 125 cycles for a dirty miss, i.e., a miss that has to be served from
another processor's cache memory.
• Multithreaded processors are able to bridge latencies by switching to another thread of control - in contrast to chip multiprocessors.
Register set 1
Register set 2
Register set 3
Register set 4
PC PSR 1
PC PSR 2
PC PSR 3
PC PSR 4
FP
Thread 1:
Thread 2:
Thread 3:
Thread 4:
... ... ...
Multithreaded Processors• Multithreading:
– Provide several program counters registers (and usually several register sets) on chip
– Fast context switching by switching to another thread of control
Approaches of Multithreaded Processors
• Cycle-by-cycle interleaving– An instruction of another thread is fetched and fed into the
execution pipeline at each processor cycle.• Block-interleaving
– The instructions of a thread are executed successively until an event occurs that may cause latency. This event induces a context switch.
• Simultaneous multithreading– Instructions are simultaneously issued from multiple threads
to the FUs of a superscalar processor.– combines a wide issue superscalar instruction issue with
multithreading.
Comparison of Multithreading with Non-Multithreading Approaches:
(a) single-threaded scalar(b) cycle-by-cycle interleaving multithreaded scalar (c) block interleaving multithreaded scalar
(a)
Tim
e (p
roce
sscy
cles
)
(c)
Con
text
switc
h
(b)
Con
text
switc
h
Comparison of Multithreading with Non-Multithreading Approaches:
(a) superscalar (c) cycle-by-cycle interleaving(b) VLIW (d) cycle-by-cycle interleaving VLIW
(a)
Tim
e(p
roc e
sso r
cyc l
e s)
Issue slots
(b)
N
NNNNN
(c)
Con
text
switc
h
(d)
Con
text
switc
h
NN
N
NNNNN
NN
N
Comparison of Multithreading withNon-Multithreading:
simultaneous multithreading (SMT) and chip multiprocessor (CMP)
Issue slots
Tim
e (p
roce
ssor
cyc
les)
Cycle-by-Cycle Interleaving• the processor switches to a different thread after
each instruction fetch• pipeline hazards cannot arise and the processor
pipeline can be easily built without the necessity of complex forwarding paths
• context-switching overhead is zero cycles• memory latency is tolerated by not scheduling a
thread until the memory transaction has completed • requires at least as many threads as pipeline stages in
the processor• degrading the single-thread performance if not
enough threads are present
Cycle-by-Cycle Interleaving- Improving single-thread performance
• The dependence look-ahead technique adds several bits to each instruction format in the ISA.– Scheduler feeds non data or control dependent
instructions of the same thread successively into the pipeline.
• The interleaving technique proposed by Laudon et al. adds caching and full pipeline interlocks to the cycle-by-cycle interleaving approach.
Single-core computer
Single-core CPU chipthe single core
Multi-core architectures
• This lecture is about a new trend in computer architecture:Replicate multiple processor cores on a single die.
Core 1 Core 2 Core 3 Core 4
Multi-core CPU chip
Multi-core CPU chip• The cores fit on a single processor
socket• Also called CMP (Chip Multi-Processor)
core
1
core
2
core
3
core
4
The cores run in parallel
core
1
core
2
core
3
core
4
thread 1 thread 2 thread 3 thread 4
Within each core, threads are time-sliced (just like on a uniprocessor)
core
1
core
2
core
3
core
4
several threads
several threads
several threads
several threads
Interaction with OS• OS perceives each core as a separate
processor
• OS scheduler maps threads/processes to different cores
• Most major OS support multi-core today
Why multi-core ?• Difficult to make single-core
clock frequencies even higher • Deeply pipelined circuits:
– heat problems– speed of light problems– difficult design and verification– large design teams necessary– server farms need expensive
air-conditioning• Many new applications are multithreaded • General trend in computer architecture (shift
towards more parallelism)
Instruction-level parallelism• Parallelism at the machine-instruction
level• The processor can re-order, pipeline
instructions, split them into microinstructions, do aggressive branch prediction, etc.
• Instruction-level parallelism enabled rapid increases in processor speeds over the last 15 years
Thread-level parallelism (TLP)• This is parallelism on a more coarser scale• Server can serve each client in a separate
thread (Web server, database server)• A computer game can do AI, graphics, and
physics in three separate threads• Single-core superscalar processors cannot
fully exploit TLP• Multi-core architectures are the next
step in processor evolution: explicitly exploiting TLP
General context: Multiprocessors• Multiprocessor is any
computer with several processors
• SIMD– Single instruction, multiple data– Modern graphics cards
• MIMD– Multiple instructions, multiple data
Lemieux cluster,Pittsburgh
supercomputing center
Multiprocessor memory types• Shared memory:
In this model, there is one (large) common shared memory for all processors
• Distributed memory:In this model, each processor has its own (small) local memory, and its content is not replicated anywhere else
Multi-core processor is a special kind of a multiprocessor:
All processors are on the same chip
• Multi-core processors are MIMD:Different cores execute different threads (Multiple Instructions), operating on different parts of memory (Multiple Data).
• Multi-core is a shared memory multiprocessor:All cores share the same memory
What applications benefit from multi-core?
• Database servers• Web servers (Web
commerce)• Compilers• Multimedia applications• Scientific applications,
CAD/CAM• In general, applications with
Thread-level parallelism(as opposed to instruction-level parallelism)
Each can run on itsown core
More examples
• Editing a photo while recording a TV show through a digital video recorder
• Downloading software while running an anti-virus program
• “Anything that can be threaded today will map efficiently to multi-core”
• BUT: some applications difficult toparallelize
A technique complementary to multi-core:Simultaneous multithreading
• Problem addressed:The processor pipeline can get stalled:– Waiting for the result
of a long floating point (or integer) operation
– Waiting for data to arrive from memory
Other execution unitswait unused
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCodeROM
BTBL2 C
ache
and
Con
trol
Bus
Source: Intel
Simultaneous multithreading (SMT)
• Permits multiple independent threads to execute SIMULTANEOUSLY on the SAME core
• Weaving together multiple “threads”on the same core
• Example: if one thread is waiting for a floating point operation to complete, another thread can use the integer units
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCode ROMBTBL2 C
ache
and
Con
trol
Bus
Thread 1: floating point
Without SMT, only a single thread can run at any given time
Without SMT, only a single thread can run at any given time
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCode ROMBTBL2 C
ache
and
Con
trol
Bus
Thread 2:integer operation
SMT processor: both threads can run concurrently
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCode ROMBTBL2 C
ache
and
Con
trol
Bus
Thread 1: floating pointThread 2:integer operation
But: Can’t simultaneously use the same functional unit
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCode ROMBTBL2 C
ache
and
Con
trol
Bus
Thread 1 Thread 2
This scenario isimpossible with SMTon a single core(assuming a single integer unit)IMPOSSIBLE
SMT not a “true” parallel processor
• Enables better threading (e.g. up to 30%)• OS and applications perceive each
simultaneous thread as a separate “virtual processor”
• The chip has only a single copy of each resource
• Compare to multi-core:each core has its own copy of resources
Multi-core: threads can run on separate cores
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCodeROM
BTBL2 C
ache
and
Con
trol
Bus
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCodeROM
BTBL2 C
ache
and
Con
trol
Bus
Thread 1 Thread 3
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCodeROM
BTBL2 C
ache
and
Con
trol
Bus
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCodeROM
BTBL2 C
ache
and
Con
trol
Bus
Thread 2 Thread 4
Multi-core: threads can run on separate cores
Combining Multi-core and SMT
• Cores can be SMT-enabled (or not)• The different combinations:
– Single-core, non-SMT: standard uniprocessor
– Single-core, with SMT – Multi-core, non-SMT– Multi-core, with SMT: our fish machines
• The number of SMT threads:2, 4, or sometimes 8 simultaneous threads
SMT Dual-core: all four threads can run concurrently
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCodeROM
BTBL2 C
ache
and
Con
trol
Bus
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCodeROM
BTBL2 C
ache
and
Con
trol
Bus
Thread 1 Thread 2 Thread 3 Thread 4
Comparison: multi-core vs SMT
• Multi-core:– Since there are several cores,
each is smaller and not as powerful(but also easier to design and manufacture)
– However, great with thread-level parallelism• SMT
– Can have one large and fast superscalar core– Great performance on a single thread– Mostly still only exploits instruction-level
parallelism
Concepts – Multithreading
• Data Access Latency– Cache misses (L1, L2)– Memory latency (remote, local)– Often unpredictable
• Multithreading (MT)– Tolerate or mask long and often unpredictable
latency operations by switching to another context, which is able to do useful work.
Why Multithreading Today?
• ILP is exhausted, TLP is in.• Large performance gap bet. MEM and
PROC.• Too many transistors on chip• More existing MT applications Today.• Multiprocessors on a single chip.• Long network latency, too.
Classical Problem, 60’ & 70’• I/O latency prompted multitasking • IBM mainframes • Multitasking • I/O processors • Caches within disk controllers
Requirements of Multithreading• Storage need to hold multiple context’s PC,
registers, status word, etc. • Coordination to match an event with a saved
context • A way to switch contexts • Long latency operations must use resources
not in use
Processor Utilization vs. Latency
R = the run length to a long latency event
L = the amount of latency
Problem of 80’• Problem was revisited due to the advent of
graphics workstations– Xerox Alto, TI Explorer – Concurrent processes are interleaved to allow for
the workstations to be more responsive. – These processes could drive or monitor display,
input, file system, network, user processing – Process switch was slow so the subsystems
were microprogrammed to support multiple contexts
Scalable Multiprocessor (90’)• Dance hall – a shared interconnect with memory on one side
and processors on the other. • Or processors may have local memory
How do the processors communicate?• Shared Memory • Potential long latency on every load
– Cache coherency becomes an issue – Examples include NYU’s Ultracomputer, IBM’s RP3, BBN’s
Butterfly, MIT’s Alewife, and later Stanford’s Dash. – Synchronization occurs through share variables, locks, flags, and
semaphores. • Message Passing
– Programmer deals with latency. This enables them to minimize the number of messages, while maximizing the size, and this scheme allows for delay minimization by sending a message so that it reaches the receiver at the time it expects it.
– Examples include Intel’s PSC and Paragon, Caltech’s Cosmic Cube, and Thinking Machines’ CM-5
– Synchronization occurs through send and receive
Cycle-by-Cycle Interleaved Multithreading
• Denelcor HEP1 (1982), HEP2
• Horizon, which was never built
• Tera, MTA
Cycle-by-Cycle Interleaved Multithreading
• Features– An instruction from a different context is launched at each
clock cycle– No interlocks or bypasses thanks to a non-blocking
pipeline
• Optimizations: – Leaving context state in proc (PC, register #, status) – Assigning tags to remote request and then matching it on
completion
Challenges with this approach
• I-Cache:– Instruction bandwidth– I-Cache misses: Since instructions are being grabbed from many different
contexts, instruction locality is degraded and the I-cache miss rate rises. • Register file access time:
– Register file access time increases due to the fact that the regfile had to significantly increase in size to accommodate many separate contexts.
– In fact, the HEP and Tera use SRAM to implement the regfile, which means longer access times.
• Single thread performance– Single thread performance significantly degraded since the context is
forced to switch to a new thread even if none are available. • Very high bandwidth network, which is fast and wide • Retries on load empty or store full
Improving Single Thread Performance
• Do more operations per instruction (VLIW) • Allow multiple instructions to issue into pipeline from
each context. – This could lead to pipeline hazards, so other safe
instructions could be interleaved into the execution. – For Horizon & Tera, the compiler detects such data
dependencies and the hardware enforces it by switching to another context if detected.
• Switch on load • Switch on miss
– Switching on load or miss will increase the context switch time.
Simultaneous Multithreading (SMT)• Tullsen, et. al. (U. of Washington), ISCA ‘95• A way to utilize pipeline with increased
parallelism from multiple threads.
Simultaneous Multithreading
SMT Architecture• Straightforward extension to conventional
superscalar design.– multiple program counters and some mechanism by which
the fetch unit selects one each cycle,– a separate return stack for each thread for predicting
subroutine return destinations,– per-thread instruction retirement, instruction queue flush,
and trap mechanisms,– a thread id with each branch target buffer entry to avoid
predicting phantom branches, and– a larger register file, to support logical registers for all
threads plus additional registers for register renaming. • The size of the register file affects the pipeline and the
scheduling of load-dependent instructions.
SMT PerformanceTullsen ‘96
Commercial Machines w/ MT Support
• Intel Hyperthreding (HT)– Dual threads– Pentium 4, XEON
• Sun CoolThreads– UltraSPARC T1– 4-threads per core
• IBM– POWER5
IBM Power5http://www.research.ibm.com/journal/rd/494/mathis.pdf
IBM Power5http://www.research.ibm.com/journal/rd/494/mathis.pdf
SMT Summary
• Pros:– Increased throughput w/o adding much cost– Fast response for multitasking environment
• Cons:– Slower single processor performance
Multicore • Multiple processor cores on a chip
– Chip multiprocessor (CMP)– Sun’s Chip Multithreading (CMT)
• UltraSPARC T1 (Niagara)– Intel’s Pentium D– AMD dual-core Opteron
• Also a way to utilize TLP, but– 2 cores 2X costs– No good for single thread performacne
• Can be used together with SMT
Chip Multithreading (CMT)
Sun UltraSPARC T1 Processor
http://www.sun.com/servers/wp.jsp?tab=3&group=CoolThreads%20servers
8 Cores vs 2 Cores
• Is 8-cores too aggressive?– Good for server applications, given
• Lots of threads• Scalable operating environment• Large memory space (64bit)
– Good for power efficiency• Simple pipeline design for each core
– Good for availability– Not intended for PCs, gaming, etc
INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION
The Case for a Single-Chip Multiprocessor
Kunle [email protected]
Lance [email protected]
Basem A. [email protected]
Computer Systems LaboratoryStanford UniversityStanford, CA 94305-4070Hydra Grouphttp://www-hydra.stanford.edu
Ken Wilson
Kunyung Chang
INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION
Multiple Single Chip
Increase in integration densityIntroduction
Higher clock rates
New Micro architectural innovation
Multiple instruction issueDynamic schedulingSpeculative executionNon-blocking caches
Microprocessor performance growth
INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION
Multiple Single Chip
The Limits of the Superscalar Approach
Recent trend of designing: CPU with Multiple instruction issue(dynamic scheduling)
Track register dependency between instructions
Superscalar Approach
INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION
Multiple Single Chip
Instruction
Introduction
Higher Clock rates
A dynamic superscalar CPU
Execution array
Superscalar Approach
INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION
Multiple Single Chip
Instructions
Mispredicted branches
Instruction misalignment
Cache misses
2 ways to implement
Use an explicit table for mapping architectural registers to physical registers
Use a combination reorder buffer/instruction queue
3 factors constrain instruction fetch
Instruction is inserted into the instruction issue queue, and
instruction is issued for execution once all of its operands are ready
Superscalar Approach
INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION
Multiple Single Chip
Instructions
O : the number of operands / instruction
W : the issue width of the machine
O x W : the number of access ports required by the mapping tablestructure
Ex : eight-wide issue machine with three operands per instruction requires a24 port mapping table
Superscalar Approach
INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION
Multiple Single Chip
Instructions
n : the number of bits required to encode a register identifier
Q : the size of the instruction issue queue
n x Q x O x W : the number of comparators grows with the size of the instruction queue and issue widthstructure
Ex : eight-wide issue machine with three operands per instruction, a 64-entry instruction queue, 6-bit comparisons requires a
8x3x64x6= 9216 port mapping table
Superscalar Approach
INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION
Multiple Single Chip
Instructions
That
can issue in parallel,
maintain the full issue bandwidth
-> A quadratic increase in the size of the instruction issue queue -> will limit the cycle time of the processor-> will limit the performance of wide issue superscalar machines
As Instruction issue widths increase, larger windows of instructions are required to find independent instructions
Superscalar Approach
INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION
Multiple Single Chip
Motivation
Technology push
Technical issues especially -the delay of the complex issue queue -multi-port register fileswill limit the performance returns from a wide superscalar execution model.
Needs for decentralized micro architecture that maintain performance growth of microprocessors
Single-Chip Multiprocessor
INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION
Multiple Single Chip
Motivation
Application pull
micro architecture that works best depends on the amount and characteristics of the parallelism in the applications.
Application pull towards a single-chip multiprocessor
In parallelism, 2 different applications require different execution models
Single-Chip Multiprocessor
INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION
Multiple Single Chip
Application parallelism
comparison
Applications with low to moderate amounts of parallelism
Applications with large amount of parallelism greater than 40 i/c
Class 1
Class 2
Floating point apps, loop-level parallelism
Under 10 instructions/cycleVS
Single-Chip Multiprocessor
INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION
Multiple Single Chip
Application parallelism
comparison
Applications with low to moderate amounts of parallelism
Applications work best on moderately superscalar processors with very high clock rates
Class 1
Little parallelism to exploit
Single-Chip Multiprocessor
INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION
Multiple Single Chip
Application parallelism
comparison
However, the recent advances in parallel compilers make a multiprocessor an efficient and flexible way to exploit the parallelism in these programs
Class 2
Large amounts of parallelism and see performance benefits from a variety of methods designed to exploit parallelism such as Superscalar, VLIW, vector processing
Applications with large amount of parallelism greater than 40 i/c
Single-Chip Multiprocessor
INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION
Multiple Single Chip
Application parallelism
comparison
1st Way to use multiprocessor
In the environment under the control of a multiprocessor aware operating system, a number of commercially available OS like WINNT, IRIX, Sun Solaris, etc.. have capability.
The increasingly widespread use of visualization and multimedia applications tends to increase the number of active processes orindependent threads on a desktop machine or server at a particular point in time.
Execution of multiple processes in parallel to increase throughput in a multiprogramming environment
Way to use Multiprocessor
INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION
Multiple Single Chip
Application parallelism
comparison
2nd Way to use multiprocessor
Ex1, transaction processing
The threads communicate using shared memory.
Designed to run on parallel machines with communication latencies in the hundreds of CPU clock cycles.
The threads do not communicate in a very fine grained manner.
Execution of multiple threads in parallel that come from a single application.
Way to use Multiprocessor
INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION
Multiple Single Chip
Application parallelism
comparison
2nd Way to use multiprocessor
Ex2, hand parallelized floating point scientific applications
When instruction window size is very large and the branch prediction is perfect. (because existing parallelism is widely distributed)
The parallelism exposed in this fine-grained manner cannot be exploited by a conventional multiprocessor architecture.
-> to exploit this, a single-chip multiprocessor architecture is available.
Execution of multiple threads in parallel that come from a single application.
Way to use Multiprocessor
Hand code (code before compile program
INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION
Multiple Single Chip
Application parallelism
comparison
3rd Way to use multiprocessor
Automatic parallelization technology was shown to be effective.
The automatic parallelization technology derive significant performance benefits from the low-latency inter-processor communication
which are provided by a single-chip multiprocessor.
Accelerating the execution of sequential applications without manual intervention
Way to use Multiprocessor
Usually used in Fortran
INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION
Multiple Single Chip
Key Characteristics of two micro architectures
Two-issue CPU will have a higher clock rate than the six-issue CPU
But, assume that two processors have the same clock.
Key characteristics of the two micro architectures
The Difference of two systems
INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION
Multiple Single Chip
Key Characteristics of two micro architectures
Key characteristics of the two micro architectures
The Difference of two systems
INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION
Multiple Single Chip
Key Characteristics of two micro architectures
Introduction
Higher Clock rates
Floor plan for the six-issue dynamic superscalar
Big overhead
6-Way Superscalar Architecture
INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION
Multiple Single Chip
TERMS
TLB :The translation lookaside buffer (TLB) is a table in the processor that contains cross-references between the virtual and real addresses of recently referenced pages of memory. It functions like a "hot list" or quick-lookup index of the pages in main memory that have been most recently accessed.
6-Way Superscalar Architecture
INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION
Multiple Single Chip
Key Characteristics of two micro architectures
4x2-Way Superscalar Multiprocessor Architecture
Introduction
Higher Clock rates
Floor plan for the four-way single chip
Also have floating point unit, integer unit, smaller overhead
L1 caches in each chip
Switch for data communication among processors and memory
through the cache
INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION
Multiple Single Chip
.
Simulation Environment
INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION
Multiple Single Chip
.
Simulation Environment
INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION
Multiple Single Chip
ApplicationsSample Applications
applu
apsi
swim
tomcatv
pmake
Floating point applicationsMultiprogramming applications
Integer applications
compress
eqntott
m88ksim
MPsim
INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION
Multiple Single Chip
compress – compresses and uncompressed file in memory
eqntott – translates logic equations into truth tables
m88ksim – Motorola 88000 CPU simulatorMPsim – VCS compiled Verilog simulation of a multiprocessor
applu – solver for parabolic/elliptic partial differential equations
swim – shallow water model with 1K x 1K grid
tomcatv – mesh-genetation with Thomson solver
apsi – solves problems of temperature, wind, velocity, and distribution of pollutants
Sample Applications
pmake– parallel make of gnuchess using C compiler
Applications
INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION
Multiple Single Chip
How to compare
MPCI : misses per completed instructor – the cache miss rates
IPC :
BP rate :
Performance Comparison
INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION
Comparison of two micro architectures
Multiple Single Chip
IPC Breakdown for a single 2-issue processor
Performance of a single 2-issue superscalar processor
Performance Comparison
INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION
Multiple Single Chip
Key Characteristics of two micro architectures
IPC Breakdown for the 6-issue processor
Performance of the 6-issue superscalar processor
Performance Comparison
INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION
Multiple Single Chip
Performance of the 4x2-issue processor
Performance Comparison
INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION
Multiple Single Chip
Comparison of two micro architectures
Introduction
Higher Clock rates
Performance comparison of SS and MP
Performance Comparison
SS MP(Hydra)
INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION
Multiple Single Chip
Conclusion
in selecting the way to design microprocessors
• Single-chip MP exploits parallelism more effectively at some levels than SS microprocessor
• Single-chip MP architecture is more efficient than Superscalar architecture in same physical space
• Provides up to 2x performance on applications with higher levels of parallelism
Conclusion
Case examples
Intel MCPs (1)
The Move to Intel MultiThe Move to Intel Multi--corecore20052005 20062006 2007+2007+PlatformPlatform
ItaniumItanium®®processorprocessor
Desktop Desktop ClientClient
Mobile Mobile ClientClient
All products and dates are preliminary and subject to change without notice.
MP ServerMP Server
DP Server / DP Server / WSWS
Refer to ‘fact sheet’ for specific product timings
today
Figure 5.1: The move to Intel multi-coreSource: A. Loktu: Itanium 2 for Enterprise Computing
http://h40132.www4.hp.com/upload/se/sv/Itanium2forenterprisecomputing.pps
Intel MCPs (2)
Figure 5.2: Processor specifications of Intel’s Pentium D family (90 nm)Source: http://www.intel.com/products/processor/index.htm
EIST: Enhanced Intel SpeedStep Technology
First delivered in Intel’s mobile and server platforms,It allows the system to dynamically adjust processor voltage and core frequency,which can result in decreased average power consumptionand decreased average heat production.
It is a set of hardware enhancements to Intel’s server and client platformsthat can improve the performance and robustness of traditional software-based virtualization solutions.
Virtualization solutions will allow a platform to run multipleoperating systems and applications in independent partitions.Using virtualization capabilities, one computer system can function as multiple "virtual" systems.
VT: Virtualization Technology
Malicious buffer overflow attacks pose a significant security threat. In a typical attack, a malicious worm creates a flood of code that overwhelms the processor,allowing the worm to propagate itself to the network, and to other computers. It can help prevent certain classes of malicious buffer overflow attackswhen combined with a supporting operating system.
Execute Disable Bit allows the processor to classify areas in memoryby where application code can execute and where it cannot.When a malicious worm attempts to insert code in the buffer,the processor disables code execution, preventing damage and worm propagation.
ED: Execute Disable Bit
Intel MCPs (3)
Intel MCPs (4)
Figure 5.3: Processor specifications of Intel’s Pentium D family (65 nm)Source: http://www.intel.com/products/processor/index.htm
Intel MCPs (5)
Figure 5.4 Specifications of Intel’s Pentium Processor Extrem Edition models 840/955/965Source: http://www.intel.com/products/processor/index.htm
Intel MCPs (6)
Figure 5.5: Procesor specifications of Intel’s Yonah Duo (Core Duo) family
Source: http://www.intel.com/products/processor/index.htm
Source: http://www.intel.com/products/processor_number/chart/core2duo.htm
Intel MCPs (7)
Figure 5.6 Specifications of Intel’s Core Processors
Intel MCPs (8)Category Code Name Cores Cache Market
Desktop Kentsfield Dual core multi-die 4 MB Mid 2007
Desktop Conroe Dual core single die 4 MB shared End 2006
Desktop Allendale Dual core single die 2 MB shared End 2006
Desktop Cedar Mill (NetBurst/P4) Single core 512 kB, 1 MB, 2 MB Early 2006
Desktop Presler (NetBurst/P4) Dual core, dual die 4 MB Early 2006
Desktop/Mobile Millville Single core 1 MB Early 2007
Mobile Yonah2 Dual core, single die 2 MB Early 2006
Mobile Yonah1 Single core 1/2 MB Mid 2006
Mobile Stealey Single core 512 kB Mid 2007
Mobile Merom Dual core, single die 2/4 MB shared End 2006
Enterprise Sossaman Dual core, single die 2 MB Early 2006
Enterprise Woodcrest Dual core, single die 4 MB Mid 2006
Enterprise Clovertown Quad core, multi-die 4 MB Mid 2007
Enterprise Dempsey (NetBurst/Xeon) Dual core, dual die 4 MB Mid 2006
Enterprise Tulsa Dual core single die 4/8/16 MB End 2006
Enterprise Whitefield Quad core single die 8 MB, 16 MB shared Early 2008
Figure 5.7: Future 65 nm processors (overview)Source: P. Schmid: Top Secret Intel Processor Plans Uncovered
www.tomshardware.com/2005/12/04/top_secret_intel_processor_plans_uncovered
Codename Cores Cache Market
Desktop Wolfdale Dual core, single die 3 MB shared 2008
Desktop Ridgefield Dual core single die 6 MB shared 2008
Desktop Yorkfield 8 cores multi-die 12 MB shared 2008+
Desktop Bloomfield Quad core, single die - 2008+
Desktop/Mobile Perryville Single core 2 MB 2008
Mobile Penryn Dual core single die 3 MB, 6 MB shared 2008
Mobile Silverthorne - - 2008+
Enterprise Hapertown 8 cores multi-die 12 MB shared 2008
Figure 5.8: Future 45 nm processors (overview)
Intel MCPs (9)
Source: P. Schmid: Top Secret Intel Processor Plans Uncovered www.tomshardware.com/2005/12/04/top_secret_intel_processor_plans_uncovered
Athlon 64 X2
Figure 5.9: AMD Athlon 64 X2 dual-core processor architectureSource: AMD Athlon 64 X2 Dual-Core Processor for Desktop – Key Architecture Features, http:///www.amd.com/us-en/Processors/ProductInformation/0,,30_118_9485_13041.00.html
Sun’s UltraSPARC IV/IV+ (1)
Figure 5.10: UltraSPARC IV (Jaguar)
Source: C. Boussard: Architecture des processeurshttp://laser.igh.cnrs.fr/IMG/pdf/SUN-CNRS-archi-cpu-3.pdf
ARB: Arbiter
Sun’s UltraSPARC IV/IV+ (2)
Figure 5.11: UltraSPARC IV+ (Panther)
Source: C. Boussard: Architecture des processeurshttp://laser.igh.cnrs.fr/IMG/pdf/SUN-CNRS-archi-cpu-3.pdf
POWER4/POWER5 (1)
Figure 5.12: POWER4 chip logical viewSource: J.M. Tendler, S. Dodson, S. Fields, H. Le, B. Sinharoy: Power4 System Microarchitecture, IBM Server,
Technical White Paper, October 2001http://www-03.ibm.coom/servers/eserver/pseries/hardware/whitepapers/power4.pdf
Built-In-SelfTest
Service ProcessorPower On Reset
Core interface Unit(crossbar)
Non-CacheableUnit
MultiChip Module
POWER4/POWER5 (2)
Figure 5.13: POWER4 chip
Source: R. Kalla, B. Sinharoy, J. Tendler: Simultaneous Multi-threading Implementation in Power5 –IBM’s Next Generation POWER Microprocessor, 2003
http://www.hotchips.org/archives/hc15/3_Tue/11.ibm.pdf
POWER4/POWER5 (3)
Figure 5.14: POWER4 and POWER5 system structures
Source: R. Kalla, B. Sinharoy, J.M. Tendler: IBM Power5 chip: A Dual-core multithreaded Processor, IEEE. Micro, Vol. 24, No.2, March-April 2004, pp. 40-47.
FabricController
Cell
Figure 5.15: Cell (BE) microarchitecture
Source: IBM: „Cell Broadband Engine™ processor – based systems”, IBM corp. 2006
SPE: SynergisticProcessing Element
EIB: Element Interface Bus
MFC: Memory Flow Controller
PPE: Power Processing Element
AUC: Atomic Update Cache
Cell (2)
Figure 5.16: Cell SPE architecture
Source: Blachford N.: „Cell Architecture Explained Version 2”, http://www.blachford.info/computer/Cell/Cell1_v2.html
Cell (3)
Figure 5.17: Cell floorplan
Source: Blachford N.: „Cell Architecture Explained Version 2”, http://www.blachford.info/computer/Cell/Cell1_v2.html
Issues and Challenges
• Memory Organization– Distributed? Shared?
• Interconnect and Communication Protocols• Coherency and Consistency – Memory/Cache.• Scheduling, Load Balancing and Synchronization• Reliability? Energy Efficiency?
• Will see all these through the rest of the course!
Next Week
• We will talk about Scheduling and Load Balancing issues, with respect to multiple processing nodes (in effect covers CMPs as well).
• We will also talk about the possibility of quizzes…