Speculative Data-Driven Multithreading (an implementation of pre-execution)

"DDMT", Amir Roth, HPCA-7 1

Speculative Data-Driven Multithreading (an implementation of pre-execution)

Amir Roth and Gurindar S. SohiHPCA-7

Jan. 22, 2001


Pre-Execution

Goal: high single-thread performance

Problem: μarchitectural latencies of “problem” instructions (PIs) Memory: cache misses, Pipeline: mispredicted branches

Solution: decouple μarchitectural latencies from main thread Execute copies of PI computations in parallel with whole

program Copies execute PIs faster than main thread “pre-execute”

Why? Fewer instructions Initiate Cache misses earlier Pre-computed branch outcomes, relay to main thread

DDMT: an implementation of pre-execution


Pre-Execution is a Supplement

Fundamentally: tolerating non-execution latencies (pipeline, memory) requires values faster than execution can provide them

Ways of providing values faster than execution Old: Behavioral prediction: table-lookup

+ small effort per value, - less than perfect (“problems”) New: Pre-execution: executes fewer instructions

+ perfect accuracy, - more effort per value

Solution: supplement behavioral prediction with pre-execution Key: behavioral prediction must handle majority of cases Good news: it already does


Data-Driven Multithreading (DDMT)

DDMT: an implementation of pre-execution Data-Driven Thread (DDT): pre-executed computation of PI

Implementation: extension to simultaneous multithreading (SMT)

SMT is a reality (21464) Low static cost: minimal additional hardware

Pre-execution siphons execution resources Low dynamic cost: fine-grain, flexible bandwidth partitioning

Take only as much as you need Minimize contention, overhead

Paper: Metrics, algorithms and mechanics Talk: Mostly mechanics


Talk Outline

Working example in 3 parts Some details Numbers, numbers, numbers


Example.1 Identify PIs

Use profiling to find PIs

for (node=list; node; node = node->next)

node->val -= node->neighbor->val * node->coeff; if (node->neighbor != NULL)

STATIC CODE

ldt f1, 16(r2) I5

ldt f0, 16(r1)I4

ldt f2, 24(r1)I6

ldt f1, 16(r2)I5

ldt f0, 16(r1)I4

ldq r1, 0(r1)I10stt f0, 16(r1)I9subt f0, f3, f0I8mult f1, f2, f3I7

beq r2, I10I3

ldq r2, 8(r1)I2beq r1, I12I1br I1I11

beq r2, I10I3

ldq r2, 8(r1)I2beq r1, I12I1br I1I5ldq r1, 0(r1)I3INSTPC

DY

NA

MIC

IN

SN

STR

EA

M

Running example: same as the paper Simplified loop from EM3D

Few static PIs cause most dynamic “problems”

Good coverage with few static DDTs


ldt f1, 16(r2) I5

ldt f0, 16(r1)I4

ldt f2, 24(r1)I6

ldt f1, 16(r2) I5

ldt f0, 16(r1)I4

ldq r1, 0(r1)I10stt f0, 16(r1)I9subt f0, f3, f0I8mult f1, f2, f3I7

beq r2, I10I3

ldq r2, 8(r1)I2beq r1, I12I1br I1I11

beq r2, I10I3


Example.2 Extract DDTs

Use first instruction as DDT trigger Dynamic trigger instances signal DDT fork

ldt f1, 16(r2) I5

beq r2, I10I3

ldq r2, 8(r1)I2

ldq r1, 0(r1)I10

ldq r1, 0(r1)I10

ldt f1, 16(r2) I5beq r2, I10I3ldq r2, 8(r1)I2ldq r1, 0(r1)I10INSTPC

DDTC (static)

I10:

Examine program traces Start with PIs Work backwards, gather backward-slices Eventually stop. When? (see paper)

Pack last N-1 slice instructions into DDT

Load DDT into DDTC (DDT$)


Example.3 Pre-Execute DDTs

Main thread (MT)

ldt f0, 16(r1)I4

beq r1, I12I1br I1I11ldq r1, 0(r1)I10

ldq r2, 8(r1)I2beq r2, I10I3

ldt f1, 16(r2) I5

stt f0, 16(r1)I9subt f0, f3, f0I8ldt f1, 16(r2) I5

mult f1, f2, f3I7beq r2, I10I3ldt f2, 24(r1)I6ldq r2, 8(r1)I2

INSTPC ldt f1, 16(r2) I5ldq r1, 0(r1)I10

DD

T

ldt f0, 16(r1)I4

beq r2, I10I3


…ldq r1, 0(r1)I10INSTPC

DDTC (static)

I10: MT

MT integrates DDT results Instr’s not re-executed reduces

contention Shortens MT critical path Pre-computed branch avoids mis-

prediction

MT, DDT execute in parallel

Executed a trigger instr?

Fork DDT (μarch)

DDT initiates cache miss

“Absorbs” latency


Details.1 More About DDTs

Composed of instr’s from original program Required by integration

Should look like normal instr’s to processor

Data-driven: instructions are not “sequential” No explicit control-flow How are they sequenced? Pack into “traces” (in DDTC) Execute all instructions (branches too) Save results for integration

No runaway threads, better overhead control “Contain” any control-flow (e.g. unrolled

loops)

ldt f1, 16(r2) I5beq r2, I10I3ldq r2, 8(r1)I2ldq r1, 0(r1)I10INSTPC

I10:

ldt f1, 16(r2) I5beq r2, I10I3ldq r2, 8(r1)I2

ldq r1, 0(r1)I10INSTPC

I10:ldq r1, 0(r1)I10ldq r1, 0(r1)I10ldq r1, 0(r1)I10


Details.2 More About Integration

Centralized physical register file Use for DDT-MT communication

beq r1, I12br I1ldq r1, 0(r1)

ldq r2, 8(r1)ldq r1, 0(r1)

stt f0, 16(r1)subt f0, f3, f0mult f1, f2, f3ldt f2, 24(r1)

ldt f1, 16(r2)

ldt f0, 16(r1)

beq r2, I10

ldq r2, 8(r1)beq r1, I12br I1ldq r1, 0(r1)

ldq r2, 8(r1) ldq r1, 0(r1)

r1

ldq r1, 0(r1)

ldq r2, 8(r1)ldq r2, 8(r1)

More on integration: implementation of squash-reuse [MICRO-33]

Fork: copy MT rename map to DDT DDT locates MT values via

lookups “Roots” integration

Integration: match PC/physical registers to establish DDT-MT instruction correspondence

MT “claims” physical registers allocated by DDT

Modify register-renaming to do this


Details.3 More About DDT Selection

Very important problem Very important problem

Fundamental aspects Metrics, algorithms Promising start See paper

Practical aspects Who implements algorithm? How do DDTs get into DDTC? Paper: profile-driven, offline, executable annotations Open question


Performance Evaluation

SPEC2K, Olden, Alpha EV6, –O3 –fast Chose programs with problems

SimpleScalar-based simulation environment

DDT selection phase: functional simulation on small input

DDT measurement phase: timing simulation on larger input 8-wide, superscalar, out-of-order core 128 ROB, 64 LDQ, 32 STQ, 80 RS (shared) Pipe: 3 fetch, 2 rename/integrate, 2 schedule, 2 reg read, 2

load 32KB I$/64KB D$ (2-way), 1MB L2$ (4-way), mem b/w: 8 b/cyc.

DDTC: 16 DDTs, 32 instructions (max) per DDT


Numbers.1 The Bottom Line

Cache misses Speedups vary, 10-15% DDT “unrolling”:

increases latency tolerance (paper)

0

5

10

15

20

25

30

parser mcf gzip vpr em3d mstExec

utio

n Ti

me

Save

d (%

)

0

5

10

15

20

25

30

eon crafty gzip vpr em3d bhExe

cutio

n Ti

me

Sav

ed (%

)

Branch mispredictions Speedups lower, 5-10% More PIs, lower coverage Branch integration

!= perfect branch prediction

Effects mix


Numbers.2 A Closer Look

DDT overhead: fetch utilization

~5% (reasonable) Fewer MT fetches (always)

Contention Fewer total fetches

Early branch resolution0

20

40

60

80

100

120

mcf.l vpr.l mst.l eon.b gzip.b em3d.b

Fetc

hed

DD

MT/

Bas

e (%

)

DDT

MT

0

20

40

60

80

100

mcf.l vpr.l mst.l eon.b gzip.b em3d.bDD

T In

teg

rate

d/F

etch

ed (

%) NOT COMPLETED

COMPLETED

DDT utility: integration rates

Vary, mostly ~30% (low) Completed: well done Not completed: a little late Input-set differences Greedy DDTs, no early

exits


Numbers.3 Is Full DDMT necessary?

How important is integration? Important for branch resolution, less so for prefetching

How important is decoupling? What if we just priority-scheduled? Extremely. No data-driven sequencing no speedup

-5

0

5

10

15

20

25

30

mcf.l vpr.l mst.l eon.b gzip.b em3d.bExe

cutio

n Ti

me

Sav

ed (%

) FULL DDMT

NO INTEGRATION

NO DATA DRIVEN SEQUENCING


Summary

Pre-execution: supplements behavioral prediction Decouple/absorb μarchitectural latencies of “problem”

instructions

DDMT: an implementation of pre-execution An extension to SMT

Few hardware changes Bandwidth allocation flexibility reduces overhead

Future: DDT selection Fundamental: better metrics/algorithms to increase

utility/coverage Practical: an easy implementation

Coming to a university/research-lab near you

Speculative Data-Driven Multithreading (an implementation of pre-execution)

Documents

Transcript of Speculative Data-Driven Multithreading (an implementation of pre-execution)