Speculative Data-Driven Multithreading (an implementation of pre-execution)
description
Transcript of Speculative Data-Driven Multithreading (an implementation of pre-execution)
"DDMT", Amir Roth, HPCA-7 1
Speculative Data-Driven Multithreading (an implementation of pre-execution)
Amir Roth and Gurindar S. SohiHPCA-7
Jan. 22, 2001
"DDMT", Amir Roth, HPCA-7 2
Pre-Execution
Goal: high single-thread performance
Problem: μarchitectural latencies of “problem” instructions (PIs) Memory: cache misses, Pipeline: mispredicted branches
Solution: decouple μarchitectural latencies from main thread Execute copies of PI computations in parallel with whole
program Copies execute PIs faster than main thread “pre-execute”
Why? Fewer instructions Initiate Cache misses earlier Pre-computed branch outcomes, relay to main thread
DDMT: an implementation of pre-execution
"DDMT", Amir Roth, HPCA-7 3
Pre-Execution is a Supplement
Fundamentally: tolerating non-execution latencies (pipeline, memory) requires values faster than execution can provide them
Ways of providing values faster than execution Old: Behavioral prediction: table-lookup
+ small effort per value, - less than perfect (“problems”) New: Pre-execution: executes fewer instructions
+ perfect accuracy, - more effort per value
Solution: supplement behavioral prediction with pre-execution Key: behavioral prediction must handle majority of cases Good news: it already does
"DDMT", Amir Roth, HPCA-7 4
Data-Driven Multithreading (DDMT)
DDMT: an implementation of pre-execution Data-Driven Thread (DDT): pre-executed computation of PI
Implementation: extension to simultaneous multithreading (SMT)
SMT is a reality (21464) Low static cost: minimal additional hardware
Pre-execution siphons execution resources Low dynamic cost: fine-grain, flexible bandwidth partitioning
Take only as much as you need Minimize contention, overhead
Paper: Metrics, algorithms and mechanics Talk: Mostly mechanics
"DDMT", Amir Roth, HPCA-7 5
Talk Outline
Working example in 3 parts Some details Numbers, numbers, numbers
"DDMT", Amir Roth, HPCA-7 6
Example.1 Identify PIs
Use profiling to find PIs
for (node=list; node; node = node->next)
node->val -= node->neighbor->val * node->coeff; if (node->neighbor != NULL)
STATIC CODE
ldt f1, 16(r2) I5
ldt f0, 16(r1)I4
ldt f2, 24(r1)I6
ldt f1, 16(r2)I5
ldt f0, 16(r1)I4
ldq r1, 0(r1)I10stt f0, 16(r1)I9subt f0, f3, f0I8mult f1, f2, f3I7
beq r2, I10I3
ldq r2, 8(r1)I2beq r1, I12I1br I1I11
beq r2, I10I3
ldq r2, 8(r1)I2beq r1, I12I1br I1I5ldq r1, 0(r1)I3INSTPC
DY
NA
MIC
IN
SN
STR
EA
M
Running example: same as the paper Simplified loop from EM3D
Few static PIs cause most dynamic “problems”
Good coverage with few static DDTs
"DDMT", Amir Roth, HPCA-7 7
ldt f1, 16(r2) I5
ldt f0, 16(r1)I4
ldt f2, 24(r1)I6
ldt f1, 16(r2) I5
ldt f0, 16(r1)I4
ldq r1, 0(r1)I10stt f0, 16(r1)I9subt f0, f3, f0I8mult f1, f2, f3I7
beq r2, I10I3
ldq r2, 8(r1)I2beq r1, I12I1br I1I11
beq r2, I10I3
ldq r2, 8(r1)I2beq r1, I12I1br I1I11ldq r1, 0(r1)I10INSTPC
Example.2 Extract DDTs
Use first instruction as DDT trigger Dynamic trigger instances signal DDT fork
ldt f1, 16(r2) I5
beq r2, I10I3
ldq r2, 8(r1)I2
ldq r1, 0(r1)I10
ldq r1, 0(r1)I10
ldt f1, 16(r2) I5beq r2, I10I3ldq r2, 8(r1)I2ldq r1, 0(r1)I10INSTPC
DDTC (static)
I10:
Examine program traces Start with PIs Work backwards, gather backward-slices Eventually stop. When? (see paper)
Pack last N-1 slice instructions into DDT
Load DDT into DDTC (DDT$)
"DDMT", Amir Roth, HPCA-7 8
Example.3 Pre-Execute DDTs
Main thread (MT)
ldt f0, 16(r1)I4
beq r1, I12I1br I1I11ldq r1, 0(r1)I10
ldq r2, 8(r1)I2beq r2, I10I3
ldt f1, 16(r2) I5
stt f0, 16(r1)I9subt f0, f3, f0I8ldt f1, 16(r2) I5
mult f1, f2, f3I7beq r2, I10I3ldt f2, 24(r1)I6ldq r2, 8(r1)I2
INSTPC ldt f1, 16(r2) I5ldq r1, 0(r1)I10
DD
T
ldt f0, 16(r1)I4
beq r2, I10I3
ldq r2, 8(r1)I2beq r1, I12I1br I1I11ldq r1, 0(r1)I10INSTPC
…ldq r1, 0(r1)I10INSTPC
DDTC (static)
I10: MT
MT integrates DDT results Instr’s not re-executed reduces
contention Shortens MT critical path Pre-computed branch avoids mis-
prediction
MT, DDT execute in parallel
Executed a trigger instr?
Fork DDT (μarch)
DDT initiates cache miss
“Absorbs” latency
"DDMT", Amir Roth, HPCA-7 9
Details.1 More About DDTs
Composed of instr’s from original program Required by integration
Should look like normal instr’s to processor
Data-driven: instructions are not “sequential” No explicit control-flow How are they sequenced? Pack into “traces” (in DDTC) Execute all instructions (branches too) Save results for integration
No runaway threads, better overhead control “Contain” any control-flow (e.g. unrolled
loops)
ldt f1, 16(r2) I5beq r2, I10I3ldq r2, 8(r1)I2ldq r1, 0(r1)I10INSTPC
I10:
ldt f1, 16(r2) I5beq r2, I10I3ldq r2, 8(r1)I2
ldq r1, 0(r1)I10INSTPC
I10:ldq r1, 0(r1)I10ldq r1, 0(r1)I10ldq r1, 0(r1)I10
"DDMT", Amir Roth, HPCA-7 10
Details.2 More About Integration
Centralized physical register file Use for DDT-MT communication
beq r1, I12br I1ldq r1, 0(r1)
ldq r2, 8(r1)ldq r1, 0(r1)
stt f0, 16(r1)subt f0, f3, f0mult f1, f2, f3ldt f2, 24(r1)
ldt f1, 16(r2)
ldt f0, 16(r1)
beq r2, I10
ldq r2, 8(r1)beq r1, I12br I1ldq r1, 0(r1)
ldq r2, 8(r1) ldq r1, 0(r1)
r1
ldq r1, 0(r1)
ldq r2, 8(r1)ldq r2, 8(r1)
More on integration: implementation of squash-reuse [MICRO-33]
Fork: copy MT rename map to DDT DDT locates MT values via
lookups “Roots” integration
Integration: match PC/physical registers to establish DDT-MT instruction correspondence
MT “claims” physical registers allocated by DDT
Modify register-renaming to do this
"DDMT", Amir Roth, HPCA-7 11
Details.3 More About DDT Selection
Very important problem Very important problem
Fundamental aspects Metrics, algorithms Promising start See paper
Practical aspects Who implements algorithm? How do DDTs get into DDTC? Paper: profile-driven, offline, executable annotations Open question
"DDMT", Amir Roth, HPCA-7 12
Performance Evaluation
SPEC2K, Olden, Alpha EV6, –O3 –fast Chose programs with problems
SimpleScalar-based simulation environment
DDT selection phase: functional simulation on small input
DDT measurement phase: timing simulation on larger input 8-wide, superscalar, out-of-order core 128 ROB, 64 LDQ, 32 STQ, 80 RS (shared) Pipe: 3 fetch, 2 rename/integrate, 2 schedule, 2 reg read, 2
load 32KB I$/64KB D$ (2-way), 1MB L2$ (4-way), mem b/w: 8 b/cyc.
DDTC: 16 DDTs, 32 instructions (max) per DDT
"DDMT", Amir Roth, HPCA-7 13
Numbers.1 The Bottom Line
Cache misses Speedups vary, 10-15% DDT “unrolling”:
increases latency tolerance (paper)
0
5
10
15
20
25
30
parser mcf gzip vpr em3d mstExec
utio
n Ti
me
Save
d (%
)
0
5
10
15
20
25
30
eon crafty gzip vpr em3d bhExe
cutio
n Ti
me
Sav
ed (%
)
Branch mispredictions Speedups lower, 5-10% More PIs, lower coverage Branch integration
!= perfect branch prediction
Effects mix
"DDMT", Amir Roth, HPCA-7 14
Numbers.2 A Closer Look
DDT overhead: fetch utilization
~5% (reasonable) Fewer MT fetches (always)
Contention Fewer total fetches
Early branch resolution0
20
40
60
80
100
120
mcf.l vpr.l mst.l eon.b gzip.b em3d.b
Fetc
hed
DD
MT/
Bas
e (%
)
DDT
MT
0
20
40
60
80
100
mcf.l vpr.l mst.l eon.b gzip.b em3d.bDD
T In
teg
rate
d/F
etch
ed (
%) NOT COMPLETED
COMPLETED
DDT utility: integration rates
Vary, mostly ~30% (low) Completed: well done Not completed: a little late Input-set differences Greedy DDTs, no early
exits
"DDMT", Amir Roth, HPCA-7 15
Numbers.3 Is Full DDMT necessary?
How important is integration? Important for branch resolution, less so for prefetching
How important is decoupling? What if we just priority-scheduled? Extremely. No data-driven sequencing no speedup
-5
0
5
10
15
20
25
30
mcf.l vpr.l mst.l eon.b gzip.b em3d.bExe
cutio
n Ti
me
Sav
ed (%
) FULL DDMT
NO INTEGRATION
NO DATA DRIVEN SEQUENCING
"DDMT", Amir Roth, HPCA-7 16
Summary
Pre-execution: supplements behavioral prediction Decouple/absorb μarchitectural latencies of “problem”
instructions
DDMT: an implementation of pre-execution An extension to SMT
Few hardware changes Bandwidth allocation flexibility reduces overhead
Future: DDT selection Fundamental: better metrics/algorithms to increase
utility/coverage Practical: an easy implementation
Coming to a university/research-lab near you