Διασωλήνωση - Pipelining · C D 6 PM 7 8 9 T a s k O r d e r Time 30 40 40 40 40 20...
Transcript of Διασωλήνωση - Pipelining · C D 6 PM 7 8 9 T a s k O r d e r Time 30 40 40 40 40 20...
1
Διασωλήνωση - Pipelining
Pedro Trancoso Appendix A
(και διαφάνειες από Prof. David Culler, Berkeley)
Βιοµηχανία Αυτοκινήτων
2
Βιοµηχανία Αυτοκινήτων
Βιοµηχανία Αυτοκινήτων
3
Βιοµηχανία Αυτοκινήτων
Βιοµηχανία Αυτοκινήτων
1
4
Βιοµηχανία Αυτοκινήτων
2 1
Βιοµηχανία Αυτοκινήτων
3 2 1
5
Βιοµηχανία Αυτοκινήτων
4 3 2 1
Βιοµηχανία Αυτοκινήτων
5 4 3 2
6
Ας πλύνουµε τα ρούχα! Σειριακή Μεθόδος
n Sequential laundry takes 6 hours for 4 loads n If they learned pipelining, how long would laundry take?
A
B
C
D
30 40 20 30 40 20 30 40 20 30 40 20
6 PM 7 8 9 10 11 Midnight
T a s k O r d e r
Time
Χρησιµοποιώντας Διασωλήνωση...
n Pipelined laundry takes 3.5 hours for 4 loads
A
B
C
D
6 PM 7 8 9 10 11 Midnight
T a s k O r d e r
Time
30 40 40 40 40 20
7
Μάθαµε ότι... n Pipelining doesn’t help latency (χρόνος αναµονής) of single task, it helps throughput (ρυθµοαπόδοση) of entire workload
n Pipeline rate limited by slowest pipeline stage
n Multiple tasks operating simultaneously
n Potential speedup = Number pipe stages
n Unbalanced lengths of pipe stages reduces speedup
n Time to “fill” pipeline and time to “drain” it reduces speedup
A
B
C
D
6 PM 7 8 9
T a s k O r d e r
Time
30 40 40 40 40 20
Διασωλήνωση Εντολών n Execute billions of instructions, so throughput is
what matters n What is desirable in the ISA (ΑΣΕ) for pipelining?
n Variable length instructions vs. all instructions same length?
n Memory operands part of any operation vs. memory operands only in loads or stores?
n Register operand (τελεστέος) many places in instruction format vs. registers located in same place?
8
Κύκλος Εκτέλεσης Instruction
Fetch
Instruction Decode
Operand Fetch
Execute
Result Store
Next Instruction
Obtain instruction from program storage
Determine required actions and instruction size
Locate and obtain operand data
Compute result value or status
Deposit results in storage for later use
Determine successor instruction
Στάδια Διασωλήνωσης του DLX (1) 1. Instruction Fetch (IF) 2. Instruction Decode / Register Fetch (ID) 3. Execution / Effective Address (EX) 4. Memory Access / Branch Completion
(MEM) 5. Write-back (WB) (go to 1!)
9
Στάδια Διασωλήνωσης του DLX (2) 1. Instruction Fetch (IF)
IR ! Mem[PC] NPC ! PC+4
2. Instruction Decode / Register Fetch (ID) A ! Regs[IR6..10] B ! Regs[IR11..15] Imm ! ((IR16)^16##IR16..31)
3. Execution / Effective Address (EX) Mem ref: ALU output ! A+Imm Reg-reg (ALUop): ALUoutput ! A op B Reg-imm (ALU op): ALU output ! A op Imm Branch: ALU output ! NPC+Imm
cond ! (A op 0)
Στάδια Διασωλήνωσης του DLX (3) 4. Memory Access / Branch Completion (MEM)
Mem access: LMD ! Mem[ALU output] Mem[ALU output] ! B
Branch: if (cond) PC ! ALU output else PC ! NPC
5. Write-back (WB) Reg-reg ALU instr: Regs[IR16..20] ! ALU output Reg-imm ALU instr: Regs[IR11..15] ! ALU output Load instr: Regs[IR11..15] ! LMD
(go to 1!)
10
Διασωλήνωση του DLX Cycle number 1
Instruction I IF
Instruction I+1
Instruction I+2
Instruction I+3
Instruction I+4
DLX Pipeline Cycle number 1 2
Instruction I IF ID
Instruction I+1 IF
Instruction I+2
Instruction I+3
Instruction I+4
11
DLX Pipeline Cycle number 1 2 3
Instruction I IF ID EX
Instruction I+1 IF ID
Instruction I+2 IF
Instruction I+3
Instruction I+4
DLX Pipeline Cycle number 1 2 3 4
Instruction I IF ID EX MEM
Instruction I+1 IF ID EX
Instruction I+2 IF ID
Instruction I+3 IF
Instruction I+4
12
DLX Pipeline Cycle number 1 2 3 4 5
Instruction I IF ID EX MEM WB
Instruction I+1 IF ID EX MEM
Instruction I+2 IF ID EX
Instruction I+3 IF ID
Instruction I+4 IF
DLX Pipeline Cycle number 1 2 3 4 5 6
Instruction I IF ID EX MEM WB
Instruction I+1 IF ID EX MEM WB
Instruction I+2 IF ID EX MEM
Instruction I+3 IF ID EX
Instruction I+4 IF ID
13
DLX Pipeline Cycle number 1 2 3 4 5 6 7 8 9
Instruction I IF ID EX MEM WB
Instruction I+1 IF ID EX MEM WB
Instruction I+2 IF ID EX MEM WB
Instruction I+3 IF ID EX MEM WB
Instruction I+4 IF ID EX MEM WB
Παράδειγµα: MIPS
Op 31 26 0 15 16 20 21 25
Rs1 Rd immediate
Op 31 26 0 25
Op 31 26 0 15 16 20 21 25
Rs1 Rs2
target
Rd Opx
Register-Register 5 6 10 11
Register-Immediate
Op 31 26 0 15 16 20 21 25
Rs1 Rs2/Opx immediate
Branch
Jump / Call
14
5 Βήµατα της Διόδου Δεδοµένων (Datapath) του MIPS
Memory Access
Write Back
Instruction Fetch
Instr. Decode Reg. Fetch
Execute Addr. Calc
L M D
ALU
MU
X
Mem
ory
Reg File
MU
X MU
X
Data
Mem
ory
MU
X
Sign Extend
4
Adder
Zero?
Next SEQ PC
Address
Next PC
WB Data
Inst
RD
RS1 RS2
Imm
Memory Access
Write Back
Instruction Fetch
Instr. Decode Reg. Fetch
Execute Addr. Calc
ALU
Mem
ory
Reg File
MU
X MU
X
M
emory
MU
X
Sign Extend
Zero?
IF/ID
ID/EX
MEM
/WB
EX/M
EM
4
Adder
Next SEQ PC Next SEQ PC
RD RD RD WB
Dat
a
Next PC
Address
RS1 RS2
Imm
MU
X
Datapath
Control Path
5 Βήµατα της Διόδου Δεδοµένων (Datapath) του MIPS
15
Memory Access
Write Back
Instruction Fetch
Instr. Decode Reg. Fetch
Execute Addr. Calc
ALU
Mem
ory
Reg File
MU
X MU
X
Mem
ory
MU
X
Sign Extend
Zero?
IF/ID
ID/EX
MEM
/WB
EX/M
EM
4
Adder
Next SEQ PC Next SEQ PC
RD RD RD WB
Dat
a
Next PC
Address
RS1 RS2
Imm
MU
X
Datapath
Control Path
Inst
1
Inst
1
Inst
2
Inst
1
Inst
2
Inst
3
5 Βήµατα της Διόδου Δεδοµένων (Datapath) του MIPS
Εκτέλεση µε Διασωλήνωση
I n s t r. O r d e r
Time (clock cycles)
Reg ALU
DMem Ifetch Reg
Reg ALU
DMem Ifetch Reg
Reg ALU
DMem Ifetch Reg
Reg ALU
DMem Ifetch Reg
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7 Cycle 5
16
Όρια της Διασωλήνωσης
n Hazards (Κίνδυνοι): circumstances that would cause incorrect execution if next instruction were launched n Structural hazards (κίνδυνος δοµής): Attempting to
use the same hardware to do two different things at the same time
n Data hazards (κίνδυνος δεδοµένων): Instruction depends on result of prior instruction still in the pipeline
n Control hazards (κίνδυνος ελέγχου): Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps).
Παράδειγµα: µια πόρτα µνήµης
I n s t r. O r d e r
Time (clock cycles)
Load
Instr 1
Instr 2
Instr 3
Instr 4
Reg ALU
DMem Ifetch Reg
Reg ALU
DMem Ifetch Reg
Reg ALU
DMem Ifetch Reg
Reg ALU
DMem Ifetch Reg
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7 Cycle 5
DMem
Structural Hazard
17
Λύσεις για κίνδυνους δοµής n Defn: attempt to use same hardware for
two different things at the same time n Solution 1: Wait
⇒ must detect the hazard ⇒ must have mechanism to stall
n Solution 2: Throw more hardware at the problem
Εύρεση και λύση ενός κινδύνου δοµής
I n s t r. O r d e r
Time (clock cycles)
Load
Instr 1
Instr 2
Stall
Instr 3
Reg ALU
DMem Ifetch Reg
Reg ALU
DMem Ifetch Reg
Reg ALU
DMem Ifetch Reg
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7 Cycle 5
Reg ALU
DMem Ifetch Reg
Bubble Bubble Bubble Bubble Bubble
18
Λύση των Κίνδυνων Δοµών στο σχεδιασµό
ALU
Instr Cache
Reg File
MU
X MU
X
Data
Cache
MU
X
Sign Extend
Zero?
IF/ID
ID/EX
MEM
/WB
EX/M
EM
4
Adder
Next SEQ PC Next SEQ PC
RD RD RD WB
Dat
a
Next PC
Address
RS1 RS2
Imm
MU
X
Datapath
Control Path
Σηµασία του ΑΣΕ στην λύση των Κινδύνων Δοµής n Simple to determine the sequence of
resources used by an instruction n opcode tells it all
n Uniformity in the resource usage n Compare MIPS to IA32? n MIPS approach => all instructions flow
through same 5-stage pipeling
19
I n s t r. O r d e r
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9 xor r10,r1,r11
Reg ALU
DMem Ifetch Reg
Reg ALU
DMem Ifetch Reg
Κίνδυνοι Δεδοµένων Time (clock cycles)
IF ID/RF EX MEM WB
Reg ALU
DMem Ifetch Reg
Reg ALU
DMem Ifetch Reg
Reg ALU
DMem Ifetch Reg
n Read After Write (RAW) (Διάβασµα µετά από γράψιµο) InstrJ tries to read operand before InstrI writes it
Caused by a “Data Dependence” (εξάρτηση δεδοµένων). This hazard results from an actual need for communication.
Τρεις Είδους Κίνδυνοι Δεδοµένων
I: add r1,r2,r3 J: sub r4,r1,r3
20
n Write After Read (WAR) (Γράψιµο µετά από διάβασµα) InstrJ writes operand before InstrI reads it
n Called an “anti-dependence” by compiler writers. This results from reuse of the name “r1”.
n Can it happen in the MIPS 5 stage pipeline? n Can’t happen in MIPS 5 stage pipeline because:
n All instructions take 5 stages, and n Reads are always in stage 2, and n Writes are always in stage 5
I: sub r4,r1,r3 J: add r1,r2,r3 K: mul r6,r1,r7
Τρεις Είδους Κίνδυνοι Δεδοµένων
n Write After Write (WAW) (Γράψιµο µετά από γράψιµο) InstrJ writes operand before InstrI writes it.
n Called an “output dependence” (εξάρτηση εξόδου) by compiler writers. This also results from the reuse of name “r1”.
n Can it happen in the MIPS 5 stage pipeline? n Can’t happen in MIPS 5 stage pipeline because:
n All instructions take 5 stages, and n Writes are always in stage 5
n Will see WAR and WAW in later more complicated pipes
I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7
Τρεις Είδους Κίνδυνοι Δεδοµένων
21
Time (clock cycles)
Μεταβίβαση (Forwarding) για αποφυγή Κινδύνου Δεδοµένων
I n s t
r.
O r d e r
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
Reg ALU
DMem Ifetch Reg
Reg ALU
DMem Ifetch Reg
Reg ALU
DMem Ifetch Reg
Reg ALU
DMem Ifetch Reg
Reg ALU
DMem Ifetch Reg
Αλλαγές στο υλικό για Forwarding
MEM
/WR
ID/EX
EX/M
EM
Data Memory
ALU
mux
mux
Registers
NextPC
Immediate
mux
22
Time (clock cycles)
I n s t r. O r d e r
lw r1, 0(r2)
sub r4,r1,r6
and r6,r1,r7
or r8,r1,r9
Κίνδυνοι Δεδοµένων και µε Forwarding
Reg ALU
DMem Ifetch Reg
Reg ALU
DMem Ifetch Reg
Reg ALU
DMem Ifetch Reg
Reg ALU
DMem Ifetch Reg
Λήξη του κινδύνου της εντολής φόρτωσης
n Adding hardware? ... not n Detection? n Compilation techniques?
n What is the cost of load delays?
23
Time (clock cycles)
or r8,r1,r9
I n s t r. O r d e r
lw r1, 0(r2)
sub r4,r1,r6
and r6,r1,r7
Reg ALU
DMem Ifetch Reg
Reg Ifetch ALU
DMem Reg Bubble
Ifetch ALU
DMem Reg Bubble Reg
Ifetch ALU
DMem Bubble Reg
How is this different from the instruction issue stall?
Λήξη του κινδύνου της εντολής φόρτωσης
Χρονοδροµολόγηση Λογισµικού για αποφυγή κίνδυνου δεδοµένων Software Scheduling to Avoid Load Hazards
Fast code: LW Rb,b LW Rc,c LW Re,e ADD Ra,Rb,Rc LW Rf,f SW a,Ra SUB Rd,Re,Rf SW d,Rd
Try producing fast code for a = b + c; d = e – f;
assuming a, b, c, d ,e, and f in memory. Slow code:
LW Rb,b LW Rc,c ADD Ra,Rb,Rc SW a,Ra LW Re,e LW Rf,f SUB Rd,Re,Rf SW d,Rd
STALL
STALL
24
Σχέση µε τη ΕΣΑ n What is exposed about this organizational
hazard in the instruction set? n k cycle delay?
n bad, CPI is not part of ISA n k instruction slot delay
n load should not be followed by use of the value in the next k instructions
n Nothing, but code can reduce run-time delays n MIPS did the transformation in the assembler
Κίνδυνος Ελέγχου από εντολές διακλάδωσης (Branches) => Three Stage Stall
10: beq r1,r3,36
14: and r2,r3,r5
18: or r6,r1,r7
22: add r8,r1,r9
36: xor r10,r1,r11
Reg ALU
DMem Ifetch Reg
Reg ALU
DMem Ifetch Reg
Reg ALU
DMem Ifetch Reg
Reg ALU
DMem Ifetch Reg
Reg ALU
DMem Ifetch Reg
25
Παράδειγµα: Branch Stall Impact
n If 30% branch, Stall 3 cycles significant (CPI=?) n Two part solution:
n Determine branch taken or not sooner, AND n Compute taken branch address earlier
n MIPS branch tests if register = 0 or ≠ 0 n MIPS Solution:
n Move Zero test to ID/RF stage n Adder to calculate new PC in ID/RF stage n 1 clock cycle penalty for branch versus 3
Adder
IF/ID
Διασωλήνωση της διόδου δεδοµένων του MIPS
Memory Access
Write Back
Instruction Fetch
Instr. Decode Reg. Fetch
Execute Addr. Calc
ALU
Mem
ory
Reg File
MU
X
Data
Mem
ory
MU
X
Sign Extend
Zero?
MEM
/WB
EX/M
EM
4
Adder
Next SEQ PC
RD RD RD WB
Dat
a
• Data stationary control – local decode for each instruction phase / pipeline stage
Next PC
Address
RS1 RS2
Imm
MU
X
ID/EX
26
Τέσσερις Επιλογές για Κίνδυνους Ελέγχου
#1: Stall until branch direction is clear #2: Predict Branch Not Taken
n Execute successor instructions in sequence n “Squash” instructions in pipeline if branch actually taken n Advantage of late pipeline state update n 47% MIPS branches not taken on average (CPI=?) n PC+4 already calculated, so use it to get next instruction
#3: Predict Branch Taken n 53% MIPS branches taken on average n But haven’t calculated branch target address in MIPS
n MIPS still incurs 1 cycle branch penalty n Other machines: branch target known before outcome
#4: Delayed Branch n Define branch to take place AFTER a following
instruction
branch instruction sequential successor1 sequential successor2 ........ sequential successorn
........ branch target if taken
n 1 slot delay allows proper decision and branch target address in 5 stage pipeline
n MIPS uses this
Branch delay of length n
Τέσσερις Επιλογές για Κίνδυνους Ελέγχου
27
Καθυστερηµένη Εντολή Διακλάδωσης (Delayed Branch)
n Where to get instructions to fill branch delay slot? n Before branch instruction n From the target address: only valuable when branch taken n From fall through: only valuable when branch not taken n Canceling branches allow more slots to be filled
n Compiler effectiveness for single branch delay slot: n Fills about 60% of branch delay slots n About 80% of instructions executed in branch delay slots
useful in computation n About 50% (60% x 80%) of slots usefully filled
n Delayed Branch downside: 7-8 stage pipelines, multiple instructions issued per clock (superscalar)
Καθυστερηµένη Εντολή Διακλάδωσης (Delayed Branch)
28
Recall: Speed Up συνάρτηση για διασωλήνωση
pipelined
dunpipeline
TimeCycle TimeCycle
CPI stall Pipeline CPI Ideal
depth Pipeline CPI Ideal Speedup ×+×
=
pipelined
dunpipeline
TimeCycle TimeCycle
CPI stall Pipeline 1
depth Pipeline Speedup ×+
=
Instper cycles Stall Average CPI Ideal CPIpipelined +=
For simple RISC pipeline, CPI = 1:
Παράδειγµα: Αξιολόγηση της Διαφορετικής Υλοποίησης της Εντολής Διακλάδωσης
Assume: Conditional & Unconditional = 14%, 65% change PC Scheduling Branch CPI speedup v. scheme penalty stall
Stall pipeline 3 1.42 1.0 Predict taken 1 1.14 1.26 Predict not taken 1 1.09 1.29 Delayed branch 0.5 1.07 1.31
Pipeline speedup = Pipeline depth1 +Branch frequency×Branch penalty
29
Άλλες Δυσκολίες... n Χρόνος αναµονής της κρυφής µνήµης n Διακοπές και Εξαιρέσεις (Interrupts and
Exceptions) n ΑΣΕ (π.χ. ADD3 42(R1),56(R1)+,@(R1)) n Λειτουργίες Πολύ-κύκλων (Multicycle
Operations)
Σχέση µεταξύ Κρυφής Μνήµης και Διασωλήνωσης
WB
Dat
a
Adder
IF/ID
ALU
Mem
ory
Reg File
MU
X
Data
Mem
ory
MU
X Sign Extend
Zero? MEM
/WB
EX/M
EM
4
Adder
Next SEQ PC
RD RD RD
Next PC
Address
RS1 RS2
Imm
MU
X
ID/EX
I-$ D-$
Memory
30
Λειτουργίες Πολύ-κύκλων
Λειτουργίες Πολύ-κύκλων
31
Παράδειγµα: MIPS R4000
Στάδια (8): IF – IS – RF – EX – DF – DS – TC - WB
Άλλα Παραδείγµατα n PowerPC G4 (4):
n PowerPC G4e (7):
n Pentium 4 (20):
FETCH DECODE / DISPATCH
EXECUTE COMPLETE / WRITE-BACK
FETCH FETCH DECODE / DISP
ISSUE EXE COMPL WRITE - BACK
TCNIP
TCNIP
TCF
TCF
DRIVE
A&R
A&R
A&R
Q SCHE
SCHE
SCHE
DISP
DISP
REG
REG
EXE
FLAGS
BRAN
DRIVE