ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … 89 19 9 2 9 5 98 20 0 1 4 Year ... • How does...
Transcript of ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … 89 19 9 2 9 5 98 20 0 1 4 Year ... • How does...
ΧΑΡΗΣ ΘΕΟΧΑΡΙΔΗΣ ( [email protected])
ΗΜΥ 656ΠΡΟΧΩΡΗΜΕΝΗ ΑΡΧΙΤΕΚΤΟΝΙΚΗΗΛΕΚΤΡΟΝΙΚΩΝ ΥΠΟΛΟΓΙΣΤΩΝ
Εαρινό Εξάμηνο 2007
ΔΙΑΛΕΞΗ 1β:Επανάληψη – Κύριες πτυχέςΑρχιτεκτονικής
Class Schedule• Lectures – Tuesdays, 6-9pm
– Room 002, Lecture Hall 01, UCY Campus.• Books and Literature:
– Patterson, Hennesy: Computer Architecture, a Quantitive Approach
– Research Papers given in class• Other questions on Administrivia?
Review
What’s Computer Architecture?
• Architecture (in general) = – Design of a functional structure
• Computer Architecture (CA) =– Design of the logical structure and
functional organization of a computer system.• Especially its CPU and associated components
• Computer Architecture does not traditionally include other aspects of computer system design…– Enclosures, styling, packaging, applications, power
supplies, cooling systems, peripheral devices…• But these are all important in designing real-world
products!
What is a Computer?• A computer is (most generally) any
information processing system!– Today, this almost always
means a digital system…• Though simple analog “computers” do exist…
– Also, today we usually mean a general-purpose, universal, or at least programmable computer• Although a wide range of non-programmable
digital components exist that perform fixed functions
– These could be considered simple special-purpose computers
Not JustThis!
Medievalastrolabe
Types of Computers
• In this course, a “computer”could be anything from the simplest embeddedmicroprocessor…
• …to the largest supercomputer!– We will discuss architectural
techniques for parallelcomputing if time permits…
Intel 4004 (1971)(4-bit, 740 kHz)
IBM Blue Gene/L (2005)(65,536 processors,
136 TFlops, 1MW, 300 tons)
Levels of Computer Architecture
•• Computer architects may deal with design Computer architects may deal with design elements at a variety of different levelselements at a variety of different levels……–– Custom logic circuit & functionalCustom logic circuit & functional--unit designs.unit designs.–– CPU datapath pipelines, memory hierarchies.CPU datapath pipelines, memory hierarchies.–– InstructionInstruction--Set Architectures (Set Architectures (ISAsISAs))
•• Or other programming models.Or other programming models.–– Special compiler & operating system support.Special compiler & operating system support.–– Multiprocessing systems, interconnection Multiprocessing systems, interconnection
networks, distributed systems...networks, distributed systems...
Levels of Design & Abstractions
Com
puterA
rchitecture
HW/SWinterface
Hardwaredescriptionlanguages
Useful RealUseful Real--World ProductsWorld Products
Processor example:Intel Itanium 2(McKinley) 64b Processor• 221 million transistors!
(~US adult population)• How are they used?• What will we do as
transistor counts grow?
Most of chip is used formemories, inst. decoding,dynamic scheduling…• Why is it done this way?• How much more efficient
could it be if more of areawent to actual processing?
Dual-Core CPUs
Intel “Smithfield” Pentium D die photo
Introduction• Computer Architecture refers to the attributes of
a system visible to a programmer — i.e., attributes that have a direct impact on the logical execution of a program.
• Architectural Attributes include the instruction set, the number of bits used to represent various data types, I/O mechanisms, and techniques for addressing memory.
Introduction• Computer Organization refers to the operational
units and their interconnections that realize the architectural specifications.
• Organizational Attributes include those hardware details transparent to the programmer, such as control signals, interfaces between the computer and peripherals, and the memory technology used.
Introduction
• Computer Hardware refers to the hardware detail design — logic design, and the implementation (packaging, power, cooling, ...).
Introduction• It is an architectural design issue whether
a computer will have a multiply instruction.• It is an organizational issue whether that
instruction will be implemented by a special multiply unit or by a mechanism that makes repeated use of the add unit of the system.
Introduction• Many computer manufacturers offer a family of
computer models, all with the same architecture but with differences in organization.
• An architecture may survive many years but its organization changes with changing technology.
• For a good design, architecture (instruction set design), organization, and hardware as well as software (compiler and operating system) issues must be considered.
Introduction• A computer architect is concerned about:
– The form in which programs are represented to and interpreted by the underlying machine,
– The methods with which these programs address the data, and
– The representation of data.
Introduction
• A computer architect should:– Analyze the requirements and criteria —
Functional requirements– Study the previous attempts– Design the conceptual system– Define the detailed issues of the design– Tune the design — Balancing software and
hardware– Evaluate the design– Implement the design — Technological trend
Introduction
Cost
Functional requirements
Performance
Technological Trends
Design Complexity
How Do the Pieces Fit Together?
I/O systemInstr. Set Proc.
Compiler
OperatingSystem
Application
Digital DesignCircuit Design
Instruction SetArchitecture
Firmware
• Coordination of many levels of abstraction• Under a rapidly changing set of forces• Design, measurement, and evaluation
Memory system
Datapath & Control
The Stored Program Computer• 1943: ENIAC
– Presper Eckert and John Mauchly -- first general electronic computer.(or was it John V. Atanasoff in 1939?)
– Hard-wired program -- settings of dials and switches.• 1944: Beginnings of EDVAC
– among other improvements, includes program stored in memory• 1945: John von Neumann
– wrote a report on the stored program concept, known as the First Draft of a Report on EDVAC
• The basic structure proposed in the draft became knownas the “von Neumann machine” (or model).– a memory, containing instructions and data– a processing unit, for performing arithmetic and logical operations– a control unit, for interpreting instructions
For more history, see http://www.maxmon.com/history.htm
Von Neumann ModelMEMORY
CONTROL UNIT
MAR MDR
IR
PROCESSING UNIT
ALU TEMP
PC
OUTPUTMonitorPrinterLEDDisk
INPUTKeyboardMouseScannerDisk
Von Neumann Processor Organization• Control needs to
1. input instructions from Memory
2. issue signals to control the information flow between the Datapath components and to control what operations they perform
3. control instruction sequencing
Fetch
DecodeExec
CPU
Control
Datapath
Memory Devices
Input
Output
• Datapath needs to have the– components – the functional units and storage (e.g.,
register file) needed to execute instructions– interconnects - components connected so that the
instructions can be accomplished and so that data can be loaded from and stored to Memory
Memory• 2k x m array of stored bits• Address
– unique (k-bit) identifier of location• Contents
– m-bit value stored in location
• Basic Operations:• LOAD
– read a value from a memory location• STORE
– write a value to a memory location
•••
0000000100100011010001010110
110111101111
00101101
10100010
CSE431 L18 Memory Hierarchy.24 Irwin, PSU, 2005
Processor-Memory Performance Gap
1
10
100
1000
10000
1980
1983
1986
1989
1992
1995
1998
2001
2004
Year
Perf
orm
ance
“Moore’s Law”
µProc55%/year(2X/1.5yr)
DRAM7%/year(2X/10yrs)
Processor-MemoryPerformance Gap(grows 50%/year)
CSE431 L18 Memory Hierarchy.25 Irwin, PSU, 2005
The “Memory Wall”Logic vs DRAM speed gap continues to grow
0.01
0.1
1
10
100
1000
VAX/1980 PPro/1996 2010+
CoreMemory
Clo
cks
per i
nstru
ctio
n
Clo
cks
per D
RA
M a
cces
s
CSE431 L18 Memory Hierarchy.26 Irwin, PSU, 2005
The Memory Hierarchy Goal
Fact: Large memories are slow and fast memories are small
How do we create a memory that gives the illusion of being large, cheap and fast (most of the time)?
With hierarchyWith parallelism
CSE431 L18 Memory Hierarchy.27 Irwin, PSU, 2005
SecondLevelCache
(SRAM)
A Typical Memory Hierarchy
Control
Datapath
SecondaryMemory(Disk)
On-Chip Components
RegFile
MainMemory(DRAM)D
ataC
acheInstr
Cache
ITLBD
TLB
eDRAM
Speed (%cycles): ½’s 1’s 10’s 100’s 1,000’s
Size (bytes): 100’s K’s 10K’s M’s G’s to T’s
Cost: highest lowest
By taking advantage of the principle of localityCan present the user with as much memory as is available in the cheapest technologyat the speed offered by the fastest technology
CSE431 L18 Memory Hierarchy.28 Irwin, PSU, 2005
Characteristics of the Memory Hierarchy
Increasing distance from the processor in access time
L1$
L2$
Main Memory
Secondary Memory
Processor
(Relative) size of the memory at each level
Inclusive– what is in L1$ is a subset of what is in L2$ is a subset of what is in MM that is a subset of is in SM
4-8 bytes (word)
1 to 4 blocks
1,024+ bytes (disk sector = page)
8-32 bytes (block)
CSE431 L18 Memory Hierarchy.29 Irwin, PSU, 2005
Memory Hierarchy TechnologiesCaches use SRAM for speed and technology compatibility
Low density (6 transistor cells), high power, expensive, fastStatic: content will last “forever” (until power turned off)
Main Memory uses DRAM for size (density)High density (1 transistor cells), low power, cheap, slowDynamic: needs to be “refreshed” regularly (~ every 8 ms)
- 1% to 2% of the active cycles of the DRAM
Addresses divided into 2 halves (row and column)- RAS or Row Access Strobe triggering row decoder- CAS or Column Access Strobe triggering column selector
Dout[15-0]SRAM
2M x 16
Din[15-0]
AddressChip select
Output enableWrite enable
16
16
21
Interface to Memory• How does processing unit get data to/from memory?• MAR: Memory Address Register• MDR: Memory Data Register
• To LOAD a location (A):1. Write the address (A) into the MAR.2. Send a “read” signal to the memory.3. Read the data from MDR.
1. To STORE a value (X) to a location (A):1. Write the data (X) to the MDR.2. Write the address (A) into the MAR.3. Send a “write” signal to the memory.
MEMORY
MAR MDR
Processing Unit• Functional Units
– ALU = Arithmetic and Logic Unit– could have many functional units.
some of them special-purpose(multiply, square root, …)
– LC-3 performs ADD, AND, NOT• Registers
– Small, temporary storage– Operands and results of functional units– LC-3 has eight registers (R0, …, R7), each 16 bits wide
• Word Size– number of bits normally processed by ALU in one instruction– also width of registers– LC-3 is 16 bits
PROCESSING UNIT
ALU TEMP
Input and Output• Devices for getting data into and out of computer
memory
• Each device has its own interface,usually a set of registers like thememory’s MAR and MDR– keyboard: data register (KBDR) and status register (KBSR)– monitor: data register (DDR) and status register (DSR)
• Some devices provide both input and output– disk, network
INPUTKeyboardMouseScannerDisk
OUTPUTMonitorPrinterLEDDisk
Control Unit• Orchestrates execution of the program
• Instruction Register (IR) contains the current instruction.
• Program Counter (PC) contains the addressof the next instruction to be executed.
• Control unit:– reads an instruction from memory
• the instruction’s address is in the PC– interprets the instruction, generating signals
that tell the other components what to do• an instruction may take many machine cycles to complete
CONTROL UNIT
IRPC
Instruction Processing
Decode instructionDecode instruction
Execute operationExecute operation
Fetch operands from memoryFetch operands from memory
Store resultStore result
Fetch instruction from memoryFetch instruction from memory
Instruction• The instruction is the fundamental unit of work.• Specifies two things:
– opcode: operation to be performed– operands: data/locations to be used for operation
• An instruction is encoded as a sequence of bits. (Just like data!)– Often, but not always, instructions have a fixed length,
such as 16 or 32 bits.– Control unit interprets instruction:
generates sequence of control signals to carry out operation.– Operation is either executed completely, or not at all.
• A computer’s instructions and their formats is known as itsInstruction Set Architecture (ISA).
Changing the Sequence of Instructions
• In the FETCH phase,we increment the Program Counter by 1.
• What if we don’t want to always execute the instructionthat follows this one?– examples: loop, if-then, function call
• Need special instructions that change the contents of the PC.
• These are called control instructions.– jumps are unconditional -- they always change the PC– branches are conditional -- they change the PC only if
some condition is true (e.g., the result of an ADD is zero)
Instruction Processing Summary
• Instructions look just like data -- it’s all interpretation.
• Three basic kinds of instructions:– computational instructions (ADD, AND, …)– data movement instructions (LD, ST, …)– control instructions (JMP, BRnz, …)
• Five basic phases of instruction processing:
• F → D → EX → M → WB– not all phases are needed by every instruction– phases may take variable number of machine cycles
Pipelining
Pipeline• Optimal Pipeline
– Each stage is executing part of an instruction each clock cycle.
– One instruction finishes during each clock cycle.– On average, execute far more quickly.
• What makes this work?– Similarities between instructions allow us to use same
stages for all instructions (generally).– Each stage takes about the same amount of time as all
others: little wasted time.
Pipeline• Pipelining a Big Idea: widely used concept• What makes it less than perfect?
– Structural hazards: suppose we had only one cache? ⇒ Need more HW resources
– Control hazards: need to worry about branch instructions? ⇒ Delayed branch or branch prediction
– Data hazards: an instruction depends on a previous instruction?
RISC - Reduced Instruction Set Computer• RISC philosophy
– fixed instruction lengths– load-store instruction sets– limited addressing modes– limited operations
• MIPS, Sun SPARC, HP PA-RISC, IBM PowerPC, Intel (Compaq) Alpha, …
• Instruction sets are measured by how well compilers use them as opposed to how well assembly language programmers use them
Design goals: speed, cost (design, fabrication, test, packaging), size, power consumption, reliability,
memory space (embedded systems)
Towards CISC• Wired logic → microcode control
– Temptingly easy extensibility• Performance tuning
– HW implementation of some high-level functions• Marketing
– Add successful instructions of competitors– “New feature” hype– Compatibility: only extensions are possible
CISC Problems• Performance tuning unsuccessful
– Rarely used high-level instructions– Sometimes slower than equivalent sequence
• High complexity– Pipelining bottlenecks → lower clock rates– Interrupt handling can complicate even more
• Marketing– Prolonged design time and frequent microcode
errors hurt competitiveness
RISC Features• Low complexity
– Generally results in overall speedup– Less error-prone implementation by hardwired
logic or simple microcodes• VLSI implementation advantages
– Less transistors– Extra space: more registers, cache
• Marketing– Reduced design time, less errors, and more options
increase competitiveness
RISC Compiler Issues• The compilers themselves
– Computationally more complex– More portable
• The compiler writer– Less instructions → probably easier job– Simpler instructions → probably less bugs– Can reuse optimization techniques
Ideally…• A RISC pipeline should execute at lease
one instruction per cycle.• However, due to hazards, etc. it doesn’t.
• What can be done?
Instruction Level Parallelism• Common instructions (arithmetic,
load/store, conditional branch) can be initiated and executed independently
• Equally applicable to RISC & CISC• In practice usually RISC
What if we had the hardware?
• Many pipeline stages need less than half a clock cycle• Double internal clock speed gets two tasks per
external clock cycle• Superscalar allows parallel fetch execute• SUPERPIPELINE!
The Superscalar Engine
But…• Superscalar Issues:
– Complexity…– Energy…– ---
Hence, Static ILP (aka. VLIW)• Instruction level parallelism
– Implicit in machine instruction– Not determined at run time by processor
• Long or very long instruction words (LIW/VLIW)• Branch predication (not the same as branch
prediction)• Speculative loading• Intel & HP call this Explicit Parallel Instruction
Computing (EPIC)• IA-64 is an instruction set architecture intended for
implementation on EPIC• Itanium is first Intel product
VLIW Architecture
But…• More transistors, more hardware!• More logic, more memory, more more
more!
• Do we use efficiently our resources?
From CMPs…
…to Systems on Chip!