Computer Architecture The P6 Micro-Architecture An Example of an Out-Of-Order Micro-processor

53
Computer Architecture 2011 – P6 uArch 1 Computer Architecture The P6 Micro- Architecture An Example of an Out-Of-Order Micro- processor Lihu Rappoport and Adi Yoaz

description

Computer Architecture The P6 Micro-Architecture An Example of an Out-Of-Order Micro-processor. Lihu Rappoport and Adi Yoaz. The P6 Family. Features Out Of Order execution Register renaming Speculative Execution Multiple Branch prediction Super pipeline: 12 pipe stages. *off die. - PowerPoint PPT Presentation

Transcript of Computer Architecture The P6 Micro-Architecture An Example of an Out-Of-Order Micro-processor

Page 1: Computer Architecture The P6 Micro-Architecture  An Example of an Out-Of-Order Micro-processor

Computer Architecture 2011 – P6 uArch 1

Computer Architecture

The P6 Micro-Architecture An Example of an Out-Of-Order Micro-processor

Lihu Rappoport and Adi Yoaz

Page 2: Computer Architecture The P6 Micro-Architecture  An Example of an Out-Of-Order Micro-processor

Computer Architecture 2011 – P6 uArch 2

The P6 Family· Features

– Out Of Order execution– Register renaming– Speculative Execution– Multiple Branch prediction– Super pipeline: 12 pipe stages

Processor Year Freq (MHz) Bus (MHz) L2 cache Process (μ)Pentium® Pro 1995 150~200 60/66 256/512K* 0.5, 0.35Pentium® II 1997 233~450 66/100 512K* 0.35, 0.25Pentium® III 1999 450~1400 100/133 256/512K 0.25, 0.18, 0.13Pentium® M 2003 900~2260 400/533 1M / 2M 0.13, 90nmCoreTM 2005 1660~2330 667 2M 65nmCoreTM 2 2006 1800~2930 800/1066 2/4/8M 65nm

*off die

Page 3: Computer Architecture The P6 Micro-Architecture  An Example of an Out-Of-Order Micro-processor

Computer Architecture 2011 – P6 uArch 3

· In-Order Front End– BIU: Bus Interface Unit– IFU: Instruction Fetch Unit (includes IC)– BPU: Branch Prediction Unit– ID: Instruction Decoder– MS: Micro-Instruction Sequencer– RAT: Register Alias Table

· Out-of-order Core– ROB: Reorder Buffer– RRF: Real Register File– RS: Reservation Stations– IEU: Integer Execution Unit– FEU: Floating-point Execution Unit – AGU: Address Generation Unit– MIU: Memory Interface Unit– DCU: Data Cache Unit– MOB: Memory Order Buffer– L2: Level 2 cache

· In-Order Retire

P6 Arch

MS

AGU

MOB

External Bus

IEU

MIU

FEU

BPU

BIU

IFU

ID

RAT

RS

L2

DCU

ROB

Page 4: Computer Architecture The P6 Micro-Architecture  An Example of an Out-Of-Order Micro-processor

Computer Architecture 2011 – P6 uArch 4

O1 O3

R1 R2

Ex

I1 I2 I3 I4 I5 I6 I7 I8

NextIP

RegRen

RSWrIcache Decode

RS disp

Retirement

In-Order Front End

Out-of-order Core

In-order Retirement

1: Next IP2: ICache lookup3: ILD (instruction length decode)4: rotate5: ID16: ID27: RAT- rename sources, ALLOC-assign destinations8: ROB-read sources RS-schedule data-ready uops for dispatch9: RS-dispatch uops10:EX11:Retirement

P6 Pipeline

Page 5: Computer Architecture The P6 Micro-Architecture  An Example of an Out-Of-Order Micro-processor

Computer Architecture 2011 – P6 uArch 5

In-Order Front End

· BPU – Branch Prediction Unit – predict next fetch address

· IFU – Instruction Fetch Unit– iTLB translates virtual to physical address (access PMH on miss)– IC supplies 16byte/cyc (access L2 cache on miss)

· ILD – Induction Length Decode – split bytes to instructions

· IQ – Instruction Queue – buffer the instructions

· ID – Instruction Decode – decode instructions into uops

· MS – Micro-Sequencer – provides uops for complex instructions

Next IPMux

BPU

ID

MS

ILD IQ IDQIFU

Bytes Instructions uops

Page 6: Computer Architecture The P6 Micro-Architecture  An Example of an Out-Of-Order Micro-processor

Computer Architecture 2011 – P6 uArch 6

Branch Prediction · Implementation

– Use local history to predict direction– Need to predict multiple branches Þ Need to predict branches before previous branches are resolvedÞ Branch history updated first based on prediction, later based on

actual execution (speculative history)– Target address taken from BTB

· Prediction rate: ~92%– ~60 instructions between mispredictions– High prediction rate is very crucial for long pipelines– Especially important for OOOE, speculative execution:

On misprediction all instructions following the branch in the instruction window are flushed

Effective size of the window is determined by prediction accuracy· RSB used for Call/Return pairs

Page 7: Computer Architecture The P6 Micro-Architecture  An Example of an Out-Of-Order Micro-processor

Computer Architecture 2011 – P6 uArch 7

Branch Prediction – Clustering· Need to provide predictions for the entire fetch line each cycle

– Predict the first taken branch in the line, following the fetch IP

· Implemented by – Splitting IP into offset within line, set, and tag– If the tag of more than one way matches the fetch IP

The offsets of the matching ways are ordered Ways with offset smaller than the fetch IP offset are discarded The first branch that is predicted taken is chosen as the predicted branch

Jump into the fetch line

Jump out of the line

jmp jmpjmpjmpPredict

not takenPredicttaken

Predicttaken

Predicttaken

Page 8: Computer Architecture The P6 Micro-Architecture  An Example of an Out-Of-Order Micro-processor

Computer Architecture 2011 – P6 uArch 8

The P6 BTB· 2-level, local histories, per-set counters· 4-way set associative: 512 entries in 128 sets

IP Tag Hist

1001

Pred=msb of counter

9

0

15

Way 0

Target

Way 2

Way 3

9 432

counters

128sets

PTV

21 1 32

LRR

2

Per-Set

Branch Type00- cond01- ret10- call11- uncond

ReturnStackBufferWay

1

Prediction bit

4

ofst

· Up to 4 branches can have a tag match

Page 9: Computer Architecture The P6 Micro-Architecture  An Example of an Out-Of-Order Micro-processor

Computer Architecture 2011 – P6 uArch 9

Determine where each IA instruction starts

In-Order Front End: Decoder

Buffer Instructions

Convert instructions Into uops• If inst aligned with dec1/2/3

decodes into >1 uops, defer it to next cycle

IQ

Instruction Length Decode

16 Instruction bytes from IFU

1 uop

D1

IDQ

D2

1 uop

D0

4 uops

Buffers uops• Smooth decoder’s variable

throughput

D3

1 uop

Page 10: Computer Architecture The P6 Micro-Architecture  An Example of an Out-Of-Order Micro-processor

Computer Architecture 2011 – P6 uArch 10

Micro Operations (Uops)· Each CISC inst is broken into one or more RISC uops

– Simplicity: Each uop is (relatively) simple Canonical representation of src/dest (2 src, 1 dest)

– Increased ILP e.g., pop eax becomes esp1<-esp0+4, eax1<-[esp0]

· Simple instructions translate to a few uops– Typical uop count (it is not necessarily cycle count!)

Reg-Reg ALU/Mov inst: 1 uopMem-Reg Mov (load) 1 uopMem-Reg ALU (load + op) 2 uopsReg-Mem Mov (store) 2 uops (st addr, st data)Reg-Mem ALU (ld + op + st) 4 uops

· Complex instructions need ucode

Page 11: Computer Architecture The P6 Micro-Architecture  An Example of an Out-Of-Order Micro-processor

Computer Architecture 2011 – P6 uArch 11

Out-of-order Core

MIS

AGU

MOB

External Bus

IEU

MIU

FEU

BTB

BIU

IFU

ID

RAT

RS

L2

DCU

ROB

· Reorder Buffer (ROB):– Holds “not yet retired” instructions– 40 entries, in-order

· Reservation Stations (RS): – Holds “not yet executed” instructions– 20 entries

· Execution Units– IEU: Integer Execution Unit– FEU: Floating-point Execution Unit

· Memory related units– AGU: Address Generation Unit MIU:

Memory Interface Unit– DCU: Data Cache Unit– MOB: Orders Memory operations– L2: Level 2 cache

Page 12: Computer Architecture The P6 Micro-Architecture  An Example of an Out-Of-Order Micro-processor

Computer Architecture 2011 – P6 uArch 12

Alloc & Rat· Perform register allocation and renaming for ≤4 uops/cyc· The Register Alias Table (RAT)

– Maps architectural registers into physical registers For each arch reg, holds the number of latest phy reg that updates it

– When a new uop that writes to a arch reg R is allocated Record phy reg allocated to the uop as the latest reg that updates R

· The Allocator (Alloc) – Assigns each uop an entry number in the ROB / RS– For each one of the sources (architectural registers) of the uop

Lookup the RAT to find out the latest phy reg updating it Write it up in the RS entry

– Allocate Load & Store buffers in the MOB

Arch reg #reg LocationEAX 0 RRFEBX 19 ROBECX 23 ROB

Page 13: Computer Architecture The P6 Micro-Architecture  An Example of an Out-Of-Order Micro-processor

Computer Architecture 2011 – P6 uArch 13

Re-order Buffer (ROB)· Hold 40 uops which are not yet committed

– At the same order as in the program· Provide a large physical register space for register renaming

– One physical register per each ROB entry physical register number = entry number Each uop has only one destination Buffer the execution results until retirement

– Valid data is set after uop executed and result written to physical reg

#entry EntryValid

DataValid

Physical Reg Data

Architectural dest. reg

0 1 1 12H EBX

1 1 1 33H ECX

2 1 0 xxx ESI

39 0 0 xxx XXX

Page 14: Computer Architecture The P6 Micro-Architecture  An Example of an Out-Of-Order Micro-processor

Computer Architecture 2011 – P6 uArch 14

RRF – Real Register File

· Holds the Architectural Register File– Architectural Register are numbered: 0 – EAX, 1 – EBX, …

· The value of an architectural register – is the value written to it by the last instruction committed

which writes to this register

RRF: #entry Arch Reg Data

0 (EAX) 9AH

1 (EBX) F34H

Page 15: Computer Architecture The P6 Micro-Architecture  An Example of an Out-Of-Order Micro-processor

Computer Architecture 2011 – P6 uArch 15

Uop flow through the ROB· Uops are entered in order

– Registers renamed by the entry number· Once assigned: execution order unimportant· After execution:

– Entries marked “executed” and wait for retirement– Executed entry can be “retired” once all prior instruction have retired– Commit architectural state only after speculation (branch, exception)

has resolved· Retirement

– Detect exceptions and mispredictions Initiate repair to get machine back on right track

– Update “real registers” with value of renamed registers– Update memory– Leave the ROB

Page 16: Computer Architecture The P6 Micro-Architecture  An Example of an Out-Of-Order Micro-processor

Computer Architecture 2011 – P6 uArch 16

Reservation station (RS)· Pool of all “not yet executed” uops

– Holds the uop attributes and the uop source data until it is dispatched

· When a uop is allocated in RS, operand values are updated– If operand is from an architectural register, value is taken from the

RRF– If operand is from a phy reg, with data valid set, value taken from ROB– If operand is from a phy reg, with data valid not set, wait for value

· The RS maintains operands status “ready/not-ready”– Each cycle, executed uops make more operands “ready”

The RS arbitrate the WB busses between the units The RS monitors the WB bus to capture data needed by awaiting uops Data can be bypassed directly from WB bus to execution unit

– Uops whose all operands are ready can be dispatched for execution Dispatcher chooses which of the ready uops to execute next Dispatches chosen uops to functional units

Page 17: Computer Architecture The P6 Micro-Architecture  An Example of an Out-Of-Order Micro-processor

Computer Architecture 2011 – P6 uArch 17

Register Renaming example

RS

RAT / Alloc

Þ

IDQ

Add EAX, EBX, EAX

#regEAX 0 RRFEBX 19 ROBECX 23 ROB

Add ROB37, ROB19, RRF0

ROB

Þ# Data

Valid Data DST19 V V 12H EBX23 V V 33H ECX37 I x xxx XXX38 I x xxx XXX

v src1 v src2 Pdst

add 1 97H 1 12H 37

RRF: 0 EAX 97H

# Data Valid Data DST

19 V V 12H EBX23 V V 33H ECX37 V I xxx EAX38 I x xxx XXX

#regEAX 37 ROBEBX 19 ROBECX 23 ROB

Page 18: Computer Architecture The P6 Micro-Architecture  An Example of an Out-Of-Order Micro-processor

Computer Architecture 2011 – P6 uArch 18

Register Renaming example (2)

RS

RAT / Alloc

Þ

IDQ

sub EAX, ECX, EAX

#regEAX 37 ROBEBX 19 ROBECX 23 ROB

sub ROB38, ROB23, ROB37

ROB

Þ# Data

Valid Data DST19 V V 12H EBX23 V V 33H ECX37 I x xxx XXX38 I x xxx XXX

v src1 v src2 Pdst

add 1 97H 1 12H 37

sub 0 rob37 1 33H 38

RRF: 0 EAX 97H

# Data Valid Data DST

19 V V 12H EBX23 V V 33H ECX37 V I xxx EAX38 V I xxx EAX

#regEAX 38 ROBEBX 19 ROBECX 23 ROB

Page 19: Computer Architecture The P6 Micro-Architecture  An Example of an Out-Of-Order Micro-processor

Computer Architecture 2011 – P6 uArch 19

MIU

Port 0

Port 1

Port 2

Port 3,4

SHFFMU

FDIVIDIV

FAUIEU

JEUIEU

AGU

AGU

Load Address

Store Address

Out-of-order Core: Execution Unitsinternal 0-dealy bypass within each EU

2nd bypass in RS

RS

1st bypass in MIU

DCUSDB

Page 20: Computer Architecture The P6 Micro-Architecture  An Example of an Out-Of-Order Micro-processor

Computer Architecture 2011 – P6 uArch 20

In-Order Retire

MIS

AGU

MOB

External Bus

IEU

MIU

FEU

BTB

BIU

IFU

ID

RAT

RS

L2

DCU

ROB

· ROB:– Retires up to 3 uops per clock– Copies the values to the RRF– Retirement is done In-order– Performs exception checking

Page 21: Computer Architecture The P6 Micro-Architecture  An Example of an Out-Of-Order Micro-processor

Computer Architecture 2011 – P6 uArch 21

In-order Retirement· The process of committing the results to the architectural

state of the processor· Retires up to 3 uops per clock· Copies the values to the RRF· Retirement is done In Order· Performs exception checking· An instruction is retired after the following checks

– Instruction has executed– All previous instructions have retired– Instruction isn’t mis-predicted– no exceptions

Page 22: Computer Architecture The P6 Micro-Architecture  An Example of an Out-Of-Order Micro-processor

Computer Architecture 2011 – P6 uArch 22

Pipeline: Fetch

· Fetch 16B from I$· Length-decode instructions within 16B· Write instructions into IQ

Predict/Fetch DecodeIQ IDQ

AllocROBRS

Schedule EX Retire

Page 23: Computer Architecture The P6 Micro-Architecture  An Example of an Out-Of-Order Micro-processor

Computer Architecture 2011 – P6 uArch 23

Pipeline: Decode

· Read 4 instructions from IQ· Translate instructions into uops

– Asymmetric decoders (4-1-1-1)· Write resulting uops into IDQ

Predict/Fetch DecodeIQ IDQ

AllocROBRS

Schedule EX Retire

Page 24: Computer Architecture The P6 Micro-Architecture  An Example of an Out-Of-Order Micro-processor

Computer Architecture 2011 – P6 uArch 24

Pipeline: Allocate

· Allocate, port bind and rename 4 uops· Allocate ROB/RS entry per uop

– If source data is available from ROB or RRF, write data to RS– Otherwise, mark data not ready in RS

Predict/Fetch DecodeIQ IDQ

AllocROBRS

Schedule EX Retire

Page 25: Computer Architecture The P6 Micro-Architecture  An Example of an Out-Of-Order Micro-processor

Computer Architecture 2011 – P6 uArch 25

Pipeline: EXE

· Ready/Schedule– Check for data-ready uops if needed functional unit available – Select and dispatch ≤6 ready uops/clock to EXE– Reclaim RS entries

· Write back results into RS/ROB– Write results into result buffer– Snoop write-back ports for results that are sources to uops in RS– Update data-ready status of these uops in the RS

Predict/Fetch DecodeIQ IDQ

AllocROBRS

Schedule EX Retire

Page 26: Computer Architecture The P6 Micro-Architecture  An Example of an Out-Of-Order Micro-processor

Computer Architecture 2011 – P6 uArch 26

Pipeline: Retire

· Retire ≤4 oldest uops in ROB– Uop may retire if

its ready bit is set it does not cause an exception all preceding candidates are eligible for retirement

– Commit results from result buffer to RRF– Reclaim ROB entry

· In case of exception– Nuke and restart

Predict/Fetch DecodeIQ IDQ

AllocROBRS

Schedule EX Retire

Page 27: Computer Architecture The P6 Micro-Architecture  An Example of an Out-Of-Order Micro-processor

Computer Architecture 2011 – P6 uArch 27

Jump Misprediction – Flush at Retire· Flush the pipe when the mispredicted jump retires

– retirement is in-orderÞ all the instructions preceding the jump have already retiredÞ all the instructions remaining in the pipe follow the jumpÞ flush all the instructions in the pipe

· Disadvantage– Misprediction known after the jump was executed– Continue fetching instructions from wrong path until the branch

retires

Page 28: Computer Architecture The P6 Micro-Architecture  An Example of an Out-Of-Order Micro-processor

Computer Architecture 2011 – P6 uArch 28

Jump Misprediction – Flush at Execute· When the JEU detects jump misprediction it

– Flush the in-order front-end– Instructions already in the OOO part continue to execute

Including instructions following the wrong jump, which take execution resource, and waste power, but will never be committed

– Start fetching and decoding from the “correct” path The “correct” path still be wrong

A preceding uop that hasn’t executed may cause an exception A preceding jump executed OOO can also mispredict

– The “correct” instruction stream is stalled at the RAT The RAT was wrongly updated also by wrong path instruction

· When the mispredicted branch retires– Resets all state in the Out-of-Order Engine (RAT, RS, RB, MOB, etc.)

Only instruction following the jump are left – they must all be flushed Reset the RAT to point only to architectural registers

– Un-stalls the in-order machine– RS gets uops from RAT and starts scheduling and dispatching them

Page 29: Computer Architecture The P6 Micro-Architecture  An Example of an Out-Of-Order Micro-processor

Computer Architecture 2011 – P6 uArch 29

Fetch DecodeIQ IDQ

AllocROBRS

RetireSchedule JEU

Pipeline: Branch gets to EXE

Page 30: Computer Architecture The P6 Micro-Architecture  An Example of an Out-Of-Order Micro-processor

Computer Architecture 2011 – P6 uArch 30

· Flush front-end and re-steer it to correct path· RAT state already updated by wrong path

– Block further allocation· Block younger branches from clearing· Update BPU

Fetch DecodeIQ IDQ

AllocROBRS

RetireSchedule JEU

Flush

Pipeline: Mispredicted Branch EXE

Page 31: Computer Architecture The P6 Micro-Architecture  An Example of an Out-Of-Order Micro-processor

Computer Architecture 2011 – P6 uArch 31

· When mispredicted branch retires– Flush OOO– Allow allocation of uops from correct path

Fetch DecodeIQ IDQ

AllocROBRS

RetireSchedule JEU

Pipeline: Mispredicted Branch Retires

Clear

Page 32: Computer Architecture The P6 Micro-Architecture  An Example of an Out-Of-Order Micro-processor

Computer Architecture 2011 – P6 uArch 32

Instant Reclamation

· Allow a faster recovery after jump misprediction– Allow execution/allocation of uops from correct path before

mispredicted jump retires

· Every few cycles take a checkpoint of the RAT

· In case of misprediction– Flush the frontend and re-steer it to the correct path– Recover RAT to latest checkpoint taken prior to misprediction– Recover RAT to exact state at misprediction

Rename 4 uops/cycle from checkpoint and until branch– Flush all uops younger than the branch in the OOO

Page 33: Computer Architecture The P6 Micro-Architecture  An Example of an Out-Of-Order Micro-processor

Computer Architecture 2011 – P6 uArch 33

Instant Reclamation Mispredicted Branch EXE

· JEClear raised on mispredicted macro-branches

Clear

AllocROBRS

Schedule

JEU RetirePredict/Fetch Decode

IQ IDQ

Page 34: Computer Architecture The P6 Micro-Architecture  An Example of an Out-Of-Order Micro-processor

Computer Architecture 2011 – P6 uArch 34

Instant Reclamation Mispredicted Branch EXE

· JEClear raised on mispredicted macro-branches– Flush frontend and re-steer it to the correct path– Flush all younger uops in OOO– Update BPU– Block further allocation

Clear

Predict/Fetch DecodeIQ IDQ

AllocROBRS

Schedule Retire

BPU UpdateJE

U

Page 35: Computer Architecture The P6 Micro-Architecture  An Example of an Out-Of-Order Micro-processor

Computer Architecture 2011 – P6 uArch 35

Pipeline: Instant Reclamation: EXE

· Restore RAT from latest check-point before branch· Recover RAT to its states just after the branch

– Before any instruction on the wrong path· Meanwhile front-end starts fetching and decoding instructions

from the correct path

Predict/Fetch DecodeIQ IDQ

AllocROBRS

Schedule Retire

JEU

Page 36: Computer Architecture The P6 Micro-Architecture  An Example of an Out-Of-Order Micro-processor

Computer Architecture 2011 – P6 uArch 36

Pipeline: Instant Reclamation: EXE

· Once done restoring the RAT – allow allocation of uops from correct path

Predict/Fetch DecodeIQ IDQ

AllocROBRS

Schedule Retire

JEU

Page 37: Computer Architecture The P6 Micro-Architecture  An Example of an Out-Of-Order Micro-processor

Computer Architecture 2011 – P6 uArch 37

Large ROB and RS are Important· Large RS

– Increases the window in which looking for impendent instructions Exposes more parallelism potential

· Large ROB– The ROB is a superset of the RS Þ ROB size ≥ RS size– Allows for of covering long latency operations (cache miss, divide)

· Example– Assume there is a Load that misses the L1 cache

Data takes ~10 cycles to return Þ ~30 new instrs get into pipeline– Instructions following the Load cannot commit Þ Pile up in the ROB– Instructions independent of the load are executed, and leave the RS

As long as the ROB is not full, we can keep executing instructions– A 40 entry ROB can cover for an L1 cache miss

Cannot cover for an L2 cache miss, which is hundreds of cycles

Page 38: Computer Architecture The P6 Micro-Architecture  An Example of an Out-Of-Order Micro-processor

Computer Architecture 2011 – P6 uArch 38

OOO Execution of Memory Operations

Page 39: Computer Architecture The P6 Micro-Architecture  An Example of an Out-Of-Order Micro-processor

Computer Architecture 2011 – P6 uArch 39

P6 Caches· Blocking caches severely hurt OOO

– A cache miss prevents from other cache requests (which could possibly be hits) to be served

– Hurts one of the main gains from OOO – hiding caches misses

· Both L1 and L2 cache in the P6 are non-blocking– Initiate the actions necessary to return data to cache miss while

they respond to subsequent cached data requests– Support up to 4 outstanding misses

Misses translate into outstanding requests on the P6 bus The bus can support up to 8 outstanding requests

· Squash subsequent requests for the same missed cache line– Squashed requests not counted in number of outstanding requests– Once the engine has executed beyond the 4 outstanding requests

subsequent load requests are placed in the load buffer

Page 40: Computer Architecture The P6 Micro-Architecture  An Example of an Out-Of-Order Micro-processor

Computer Architecture 2011 – P6 uArch 40

Memory Operations· Loads

– Loads need to specify memory address to be accessed width of data being retrieved the destination register

– Loads are encoded into a single uop· Stores

– Stores need to provide a memory address, a data width, and the data to be written

– Stores require two uops: One to generate the address One to generate the data

– Uops are scheduled independently to maximize concurrency Can calculate the address before the data to be stored is known must re-combine in the store buffer for the store to complete

Page 41: Computer Architecture The P6 Micro-Architecture  An Example of an Out-Of-Order Micro-processor

Computer Architecture 2011 – P6 uArch 41

The Memory Execution Unit· As a black box, the MEU is just another execution unit

– Receives memory reads and writes from the RS– Returns data back to the RS and to the ROB

· Unlike many execution units, the MEU has side effects – May cause a bus request to external memory

· The RS dispatches memory uops – When all the data used for address calculation is ready– Both to the MOB and to the Address Generation Unit (AGU) are free– The AGU computes the linear address:

Segment-Base + Base-Address + (Scale*Index) + Displacement Sends linear address to MOB, where it is stored with the uop in

the appropriate Load Buffer or Store Buffer entry

Page 42: Computer Architecture The P6 Micro-Architecture  An Example of an Out-Of-Order Micro-processor

Computer Architecture 2011 – P6 uArch 42

The Memory Execution Unit (cont.)· The RS operates based on register dependencies

– Cannot detect memory dependencies, even as simple as: i1: movl -4(%ebp), %ebx # MEM[ebp-4] ← ebxi2: movl %eax, -4(%ebp) # eax ← MEM[ebp-4]

– The RS may dispatch these operations in any order· MEU is in-charge of memory ordering

– If MEU blindly executes above operations, eax may get wrong value

· Could require memory operations dispatch non-speculatively– x86 has small register set Þ uses memory often– Preventing Stores from passing Stores/Loads: 3%~5% perf. loss

P6 chooses not allow Stores to pass Stores/Loads– Preventing Loads from passing Loads/Stores Big performance loss

Need to allow loads to pass stores, and loads to pass loads

· MEU allows speculative execution as much as possible

Page 43: Computer Architecture The P6 Micro-Architecture  An Example of an Out-Of-Order Micro-processor

Computer Architecture 2011 – P6 uArch 43

OOO Execution of Memory Operations· Stores are not executed OOO

– Stores are never performed speculatively there is no transparent way to undo them

– Stores are also never re-ordered among themselves The Store Buffer dispatches a store only when

the store has both its address and its data, and there are no older stores awaiting dispatch

· Resolving memory dependencies– Some memory dependencies can be resolved statically

store r1,a load r2,b

– Problem: some cannot store r1,[r3]; load r2,b

Þ can advance load before store

Þ load must wait till r3 is known

Page 44: Computer Architecture The P6 Micro-Architecture  An Example of an Out-Of-Order Micro-processor

Computer Architecture 2011 – P6 uArch 44

Memory Order Buffer (MOB)· Every memory uop is allocated an entry in order

– De-allocated when retires· Address & data (for stores), are updated when known· Load is checked against all previous stores

– Waits if store to same address exist, but data not ready If store data exists, just use it: Store → Load forwarding

– Waits till all previous store addresses are resolved In case of no address collision – go to memory

· Memory Disambiguation– MOB predicts if a load can proceed despite unknown STAs

Predict colliding block Load if there is unknown STA (as usual) Predict non colliding execute even if there are unknown STAs

– In case of wrong prediction The entire pipeline is flushed when the load retires

Page 45: Computer Architecture The P6 Micro-Architecture  An Example of an Out-Of-Order Micro-processor

Computer Architecture 2011 – P6 uArch 45

Pipeline: Load: Allocate

· Allocate ROB/RS, MOB entries· Assign Store ID to enable ordering

IDQ

Alloc

ROBRS

RetireSchedule

LB

AGU LB Write

DTLB DCU WBMOB

Page 46: Computer Architecture The P6 Micro-Architecture  An Example of an Out-Of-Order Micro-processor

Computer Architecture 2011 – P6 uArch 46

Pipeline: Bypassed Load: EXE

· RS checks when data used for address calculation is ready· AGU calculates linear address: DS-Base + base + (Scale*Index) + Disp. · Write load into Load Buffer· DTLB Virtual → Physical + DCU set access· MOB checks blocking and forwarding· DCU read / Store Data Buffer read (Store → Load forwarding)· Write back data / write block code

IDQ

Alloc

ROBRS

RetireSchedule

LB

AGU LB Write

DTLB DCU WBMOB

Page 47: Computer Architecture The P6 Micro-Architecture  An Example of an Out-Of-Order Micro-processor

Computer Architecture 2011 – P6 uArch 47

Pipeline: Blocked Load Re-dispatch

· MOB determines which loads are ready, and schedules one· Load arbitrates for MEU · DTLB Virtual → Physical + DCU set access· MOB checks blocking/forwarding· DCU way select / Store Data Buffer read· write back data / write block code

IDQ

Alloc

ROBRS

RetireSchedule

LB

AGU LB Write

DTLB DCU WBMOB

Page 48: Computer Architecture The P6 Micro-Architecture  An Example of an Out-Of-Order Micro-processor

Computer Architecture 2011 – P6 uArch 48

Pipeline: Load: Retire

· Reclaim ROB, LB entries· Commit results to RRF

IDQ

Alloc

ROBRS

RetireSchedule

LB

AGU LB Write

DTLB DCU WBMOB

Page 49: Computer Architecture The P6 Micro-Architecture  An Example of an Out-Of-Order Micro-processor

Computer Architecture 2011 – P6 uArch 49

DCU Miss

· If load misses in the DCU, the DCU– Marks the write-back data as invalid – Assigns a fill buffer to the load – Starts MLC access– When critical chunk is returned

DCU writes it back on WB bus

Page 50: Computer Architecture The P6 Micro-Architecture  An Example of an Out-Of-Order Micro-processor

Computer Architecture 2011 – P6 uArch 50

Pipeline: Store: Allocate

· Allocate ROB/RS· Allocate Store Buffer entry

IDQ RS

Alloc Schedule AGU SB

DTLB

ROB

SB

Retire

Page 51: Computer Architecture The P6 Micro-Architecture  An Example of an Out-Of-Order Micro-processor

Computer Architecture 2011 – P6 uArch 51

Pipeline: Store: STA EXE

· RS checks when data used for address calculation is ready– dispatches STA to AGU

· AGU calculates linear address· Write linear address to Store Buffer· DTLB Virtual → Physical · Load Buffer Memory Disambiguation verification· Write physical address to Store Buffer

IDQ RS

ScheduleAlloc AGU SBV.A.

ROB

DTLB SBP.A.

SB

Retire

Page 52: Computer Architecture The P6 Micro-Architecture  An Example of an Out-Of-Order Micro-processor

Computer Architecture 2011 – P6 uArch 52

Pipeline: Store: STD EXE

· RS checks when data for STD is ready– dispatches STD

· Write data to Store Buffer

IDQ RS

ScheduleAlloc SB data

ROB

SB

Retire

Page 53: Computer Architecture The P6 Micro-Architecture  An Example of an Out-Of-Order Micro-processor

Computer Architecture 2011 – P6 uArch 53

Pipeline: Senior Store Retirement

· When STA (and thus STD) retires– Store Buffer entry marked as senior

· When DCU idle Þ MOB dispatches senior store· Read senior entry

– Store Buffer sends data and physical address· DCU writes data· Reclaim SB entry

SB

IDQ RS

ScheduleAlloc

ROB

Retire

SB DCUMOB