UPC Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф...

39
U U P P C C Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф Carlos Molina ψ Jordi Tubella ф INTERACT-9, San Francisco (USA) - February 13, 2005 λ Intel Barcelona Research Center Intel Labs - UPC Barcelona, Spain [email protected] m ф Dept. Arquitectura de Computadors Universitat Politècnica de Catalunya Barcelona, Spain {antonio,jordit}@ac.upc.edu ψ Dept. Enginyeria Informàtica Universitat Rovira i Virgili Tarragona, Spain [email protected]

Transcript of UPC Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф...

Page 1: UPC Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф Carlos Molina ψ Jordi Tubella ф INTERACT-9, San Francisco.

UU PP CC

Compiler Support for Trace-Level Speculative

Multithreaded Architectures

Compiler Support for Trace-Level Speculative

Multithreaded Architectures

Antonio González λ,ф

Carlos Molina ψ

Jordi Tubella ф

INTERACT-9, San Francisco (USA) - February 13, 2005

λ Intel Barcelona Research Center

Intel Labs - UPC

Barcelona, Spain

[email protected]

ф Dept. Arquitectura de Computadors

Universitat Politècnica de Catalunya

Barcelona, Spain

{antonio,jordit}@ac.upc.edu

ψ Dept. Enginyeria Informàtica

Universitat Rovira i Virgili

Tarragona, Spain

[email protected]

Page 2: UPC Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф Carlos Molina ψ Jordi Tubella ф INTERACT-9, San Francisco.

Trace Level SpeculationTrace Level Speculation

Avoids serialization caused by data dependences

Skips in a row multiple instructions

Predicts values based on the past

Introduces penalties due to misspeculations

With Live Output TestWith Live Output Test

Trace Level SpeculationTrace Level Speculation

With Live Input TestWith Live Input Test

Page 3: UPC Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф Carlos Molina ψ Jordi Tubella ф INTERACT-9, San Francisco.

BUFFERBUFFER

Trace Level Speculation with Live Output Test

Trace Level Speculation with Live Output Test

Live Output Update & Trace Speculation

NST

ST

Trace Miss Speculation Detection & Recovery Actions

INSTRUCTION EXECUTION

NOT EXECUTED

LIVE OUTPUT VALIDATION

Page 4: UPC Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф Carlos Molina ψ Jordi Tubella ф INTERACT-9, San Francisco.

TSMA Block DiagramTSMA Block Diagram

CacheI

EngineFetch

RenameDecode &

UnitsFunctional

PredictorBranch

TraceSpeculation

Engine NST Reorder BufferNST Reorder Buffer

ST Reorder BufferST Reorder Buffer

NST Ld/St QueueNST Ld/St Queue

ST Ld/St QueueST Ld/St Queue

NST I WindowNST I Window

ST I WindowST I Window

Look Ahead BufferLook Ahead Buffer

EngineVerification

L1NSDCL1NSDC L2NSDCL2NSDC

L1SDCL1SDC DataCache

Register FileNST Arch.

Register FileST Arch.

Page 5: UPC Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф Carlos Molina ψ Jordi Tubella ф INTERACT-9, San Francisco.

MotivationMotivation

Two orthogonal issues microarchitecture support for trace speculation control and data speculation techniques

– prediction of initial and final points– prediction of live output values

TSMA does not introduce significant misspeculation penalties does not impose constraints to build or predict traces

This work focuses on developing effective trace selection schemes for TSMA based on static analysis that uses profiling data

Page 6: UPC Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф Carlos Molina ψ Jordi Tubella ф INTERACT-9, San Francisco.

OutlineOutline

Trace Selection

Graph Construction

Graph Analysis

Performance Evaluation

Conclusions

Page 7: UPC Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф Carlos Molina ψ Jordi Tubella ф INTERACT-9, San Francisco.

Graph ConstructionGraph Construction

Test input set of the analyzed benchmarks

Abstract data structure is built based on control flow graph data dependences graph predictability of values

Each node represents each static instruction type of instruction, number of dynamic executions pointers and frequencies to succeeding instructions pointers and frequencies to preceding instructions predictability of live output values and dead values

Page 8: UPC Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф Carlos Molina ψ Jordi Tubella ф INTERACT-9, San Francisco.

Graph AnalysisGraph Analysis

Two important issues initial and final point of a trace

– maximize trace length & minimize control flow misspeculations

predictability of live output values– prediction accuracy and utilization degree

Three basic heuristics Procedure Trace Heuristic

Loop Trace Heuristic

Instruction Chaining Trace Heuristic

Page 9: UPC Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф Carlos Molina ψ Jordi Tubella ф INTERACT-9, San Francisco.

Procedure Trace HeuristicProcedure Trace Heuristic

Procedures relatively frequent

Computations that follow a subroutine fairly independent of the subroutine

except return values and some memory locations

Quite easy to predict the end of a trace

Page 10: UPC Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф Carlos Molina ψ Jordi Tubella ф INTERACT-9, San Francisco.

I10

I4

I5

I6

I7I12

I1

I2

I3

I11

I12

I11

CallBranch

Return

T NT

NT

T

Branch

I13 I14

Call instruction is marked as initial point of the trace

I3

1Return address is marked as final point of the trace

I11

2

N instructions after the final point of the trace are checked.

Only significant paths are considered.

I12

I13 I14

I11

3

Each instruction in a significant path it is checked whether any of its operands are produced by any instruction of the procedure.

4

In this case, utilization degree of the value produced and predictability of the producer instruction is evaluated.

5

If it does not achieve a certain threshold, the trace is discarded

6

Procedure Trace HeuristicProcedure Trace Heuristic

Page 11: UPC Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф Carlos Molina ψ Jordi Tubella ф INTERACT-9, San Francisco.

Loop Trace HeuristicLoop Trace Heuristic

Traditional source of parallelization and

speculation

We consider the whole execution of a loop as

a trace

The objective is to detect loops whose live-

output values after their whole execution are

predictable

Page 12: UPC Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф Carlos Molina ψ Jordi Tubella ф INTERACT-9, San Francisco.

I8

I1

I2

I3

I5 I6

I4

Backward Branch

T NT

Branch

I7T

NT

Backward branch target is marked as initial point of the trace

I2

1Fall-through instruction of the same backward branch

is marked as final point of the trace

I8

2 N instructions after the final point of the trace are checked.

Same behaviour as procedure trace heuristic

3

Loop Trace HeuristicLoop Trace Heuristic

Page 13: UPC Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф Carlos Molina ψ Jordi Tubella ф INTERACT-9, San Francisco.

Ichaining Trace HeuristicIchaining Trace Heuristic

Goal to identify large sequences of dynamic instructions besides procedures and loops

A trace is identified by: initial point final point behaviour of conditional branches within the trace

Page 14: UPC Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф Carlos Molina ψ Jordi Tubella ф INTERACT-9, San Francisco.

IChaining Trace HeuristicIChaining Trace Heuristic

I2

I12

I1

I5

I7

I11

I8

Conditional Branch

T NT

I4

NT

Conditional Branch

I6

I10I9

Conditional BranchT NT

T

I3

Taken and not taken targets of all conditional branches

are considered as initial points of a trace

I2 I3

I7 I8

I9 I10

1

Page 15: UPC Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф Carlos Molina ψ Jordi Tubella ф INTERACT-9, San Francisco.

IChaining Trace HeuristicIChaining Trace Heuristic

I2

I12

I1

I5

I7

I11

I8

Conditional Branch

T NT

I4

NT

Conditional Branch

I6

I10I9

Conditional BranchT NT

T

I3

Given an initial point, a trace is extended

adding successive instructions

I3

2

I5

Every time a conditional branch is found,

the trace is split into two.

3

Page 16: UPC Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф Carlos Molina ψ Jordi Tubella ф INTERACT-9, San Francisco.

IChaining Trace HeuristicIChaining Trace Heuristic

I2

I12

I1

I5

I7

I11

I8

Conditional Branch

T NT

I4

NT

Conditional Branch

I6

I10I9

Conditional BranchT NT

T

I3I3

I5

I7

I11

I12

Page 17: UPC Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф Carlos Molina ψ Jordi Tubella ф INTERACT-9, San Francisco.

IChaining Trace HeuristicIChaining Trace Heuristic

I2

I12

I1

I5

I7

I11

I8

Conditional Branch

T NT

I4

NT

Conditional Branch

I6

I10I9

Conditional BranchT NT

T

I3I3

I5

I7

I11

I12

Page 18: UPC Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф Carlos Molina ψ Jordi Tubella ф INTERACT-9, San Francisco.

IChaining Trace HeuristicIChaining Trace Heuristic

I2

I12

I1

I5

I7

I11

I8

Conditional Branch

T NT

I4

NT

Conditional Branch

I6

I10I9

Conditional BranchT NT

T

I3I3

I5

I7

I11

I12

Final point is reached if: new instruction already belongs to the trace, trace reaches a maximum size or new instructions is an indirect jump.

4

I12

Page 19: UPC Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф Carlos Molina ψ Jordi Tubella ф INTERACT-9, San Francisco.

IChaining Trace HeuristicIChaining Trace Heuristic

I2

I12

I1

I5

I7

I11

I8

Conditional Branch

T NT

I4

NT

Conditional Branch

I6

I10I9

Conditional BranchT NT

T

I3I3

I5

I7

I11

I12

Live-output values are determined and its predictability is checked for every trace candidate

(highest between prediction accuracy and utilization degree)

5

Trace is considered predictable, if the multiplication of percentagesof all live output-values is above certain threshold

6If not, final instruction is removed and process starts again.

(until trace reaches a minimum size)

7

I12

Page 20: UPC Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф Carlos Molina ψ Jordi Tubella ф INTERACT-9, San Francisco.

Trace Speculation EngineTrace Speculation Engine

Traces are communicated to the hardware at program loading time filling a special hardware structure (trace table)

Each entry of the trace table contains initial PC final PC branch history live-output values information frequency counter

Page 21: UPC Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф Carlos Molina ψ Jordi Tubella ф INTERACT-9, San Francisco.

Experimental FrameworkExperimental Framework

Simulator Alpha version of the SimpleScalar Toolset

Benchmarks Spec2000, ref input

Maximum Optimization Level DEC C & F77 compilers with -non_shared -O5

Statistics Collected for 250 million instructions Skipping an initial part of 500 million

Page 22: UPC Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф Carlos Molina ψ Jordi Tubella ф INTERACT-9, San Francisco.

Simulation ParametersSimulation Parameters

Base microarchitecture out of order machine, 4 instructions per cycle I cache: 16KB, D cache: 16KB, L2 shared: 256KB bimodal predictor

TSMA additional structures each thread: I window, reorder buffer, register file speculative data cache: 1KB verification engine: up to 8 instructions per cycle trace table: 128 entries, 4-way set associative look ahead buffer: 128 entries

Page 23: UPC Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф Carlos Molina ψ Jordi Tubella ф INTERACT-9, San Francisco.

Profiling Analysis ParametersProfiling Analysis Parameters

Value Predictors: Stride & Context

Minimum size of trace: 16

Maximum size of trace: 1024

Maximum number of live-outputs: 32

Threshold to consider a set of LO predictable: 25%

Significative path (mimimum frequency): 10%

Page 24: UPC Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф Carlos Molina ψ Jordi Tubella ф INTERACT-9, San Francisco.

Type of Speculated InstructionsType of Speculated Instructions

Amm

pApsi

Crafty Eon

Equake

Gcc Mcf

Mes

a

Mgrid

Sixtra

ck

Vortex

Vpr

A_Mean

100 %

90 %

80 %

70 %

60 %

50 %

40 %

30 %

20 %

10 %

0 %

Loop Heuristic Procedure Heuristic Ichaining Heuristic

Page 25: UPC Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф Carlos Molina ψ Jordi Tubella ф INTERACT-9, San Francisco.

Type of Speculated InstructionsType of Speculated Instructions

Procedure and loop traces are relatively low

But sizes are significantly larger than Ichain

Some statistics: procedure trace size: 97.3 loop trace size: 215.8 Ichaining trace size: 36.4 average size of speculated traces: 65.7 average number of live output values: 16.4 branches within a trace (Ichaining): 5.3 traces with same initial PC (Ichaining): 1.57

Page 26: UPC Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф Carlos Molina ψ Jordi Tubella ф INTERACT-9, San Francisco.

Type of SpeculationsType of Speculations

Amm

pApsi

Crafty Eon

Equake

Gcc Mcf

Mes

a

Mgrid

Sixtra

ck

Vortex

Vpr

A_Mean

100 %

90 %

80 %

70 %

60 %

50 %

40 %

30 %

20 %

10 %

0 %

Spec KO, Path KO

Spec KO, Path OK

Spec OK, Path KO

Spec OK, Path OK

Page 27: UPC Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф Carlos Molina ψ Jordi Tubella ф INTERACT-9, San Francisco.

Type of SpeculationsType of Speculations

Correct speculations: up to 70% 65% for correctly predicted paths 7% for incorrectly predicted paths (positive missprediction)

Incorrect speculations: close to 30% 20% for correctly predicted paths 8% for incorrectly predicted paths

These confirms that mechanism proposed to predict paths and final points provides significant accuracy

Page 28: UPC Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф Carlos Molina ψ Jordi Tubella ф INTERACT-9, San Francisco.

SpeedupSpeedup

Amm

pApsi

Crafty Eon

Equake

Gcc Mcf

Mes

a

Mgrid

Sixtra

ck

Vortex

Vpr

A_Mean

1.35

1.30

1.25

1.20

1.15

1.10

1.05

1.00

1.40

1.45

Page 29: UPC Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф Carlos Molina ψ Jordi Tubella ф INTERACT-9, San Francisco.

SpeedupSpeedup

Average speedup close to 38%

In spite of misspeculating close to 30%

Page 30: UPC Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф Carlos Molina ψ Jordi Tubella ф INTERACT-9, San Francisco.

Type of Cycles of STType of Cycles of ST

Amm

pApsi

Crafty Eon

Equake

Gcc Mcf

Mes

a

Mgrid

Sixtra

ck

Vortex

Vpr

A_Mean

100 %

90 %

80 %

70 %

60 %

50 %

40 %

30 %

20 %

10 %

0 %

ST can not speculate ST can speculate

Page 31: UPC Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф Carlos Molina ψ Jordi Tubella ф INTERACT-9, San Francisco.

Type of Cycles of STType of Cycles of ST

25% of the time ST can speculate but does not

find a trace to be speculated performance could be improved with further analysis

75% of the time ST can not speculate because

NST is executing and verifying a speculated

trace speculation may be performed only when NST catches

up ST

Page 32: UPC Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф Carlos Molina ψ Jordi Tubella ф INTERACT-9, San Francisco.

Type of Cycles of NSTType of Cycles of NST

Amm

pApsi

Crafty Eon

Equake

Gcc Mcf

Mes

a

Mgrid

Sixtra

ck

Vortex

Vpr

A_Mean

100 %

90 %

80 %

70 %

60 %

50 %

40 %

30 %

20 %

10 %

0 %

NST is verifying instructions NST is executing instructions

Page 33: UPC Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф Carlos Molina ψ Jordi Tubella ф INTERACT-9, San Francisco.

Type of Cycles of NSTType of Cycles of NST

65% of the time NST is executing traces

speculated by ST more speculated instructions imply more time

executing instructions

35% of the time NST is verifying

instructions from the look ahead buffer verifying instructions is faster than executing them

Page 34: UPC Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф Carlos Molina ψ Jordi Tubella ф INTERACT-9, San Francisco.

Useless Cycles of STUseless Cycles of ST

Amm

pApsi

Crafty Eon

Equake

Gcc Mcf

Mes

a

Mgrid

Sixtra

ck

Vortex

Vpr

A_Mean

100 %

90 %

80 %

70 %

60 %

50 %

40 %

30 %

20 %

10 %

0 %

Page 35: UPC Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф Carlos Molina ψ Jordi Tubella ф INTERACT-9, San Francisco.

Useless Cycles of STUseless Cycles of ST

Up to 20% of the time ST is executing

instructions beyond the misspeculation point ST is wasting up to 20% of the time executing

instructions that will be discarded

Ideal scenario would be when this percentage

is negligible

Page 36: UPC Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф Carlos Molina ψ Jordi Tubella ф INTERACT-9, San Francisco.

Branch Behaviour DistributionBranch Behaviour Distribution

50 60 70 80 90

100 %

90 %

80 %

70 %

60 %

50 %

40 %

30 %

20 %

10 %

0 %

Page 37: UPC Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф Carlos Molina ψ Jordi Tubella ф INTERACT-9, San Francisco.

Branch Behaviour DistributionBranch Behaviour Distribution

Instruction chanining heuristic does not

provide many traces with the same initial point despite the significant number of branches within a

trace (5.3 on average)

The study concludes that the majority of branches take almost always the same direction

Close to 80% of the branches take the same direction more than 90% of the times

Page 38: UPC Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф Carlos Molina ψ Jordi Tubella ф INTERACT-9, San Francisco.

ConclusionsConclusions Profile guided analysis to support TSMA

identifies large and highly predictable traces reducing hardware complexity

Three basic heuristics are proposed procedure trace heuristic loop trace heuristic instruction chaining heuristic

Results show speedup of 38% with a 30% of missprediction rate

Future work aggressive trace level predictors generalization to multiple threads

Page 39: UPC Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф Carlos Molina ψ Jordi Tubella ф INTERACT-9, San Francisco.

UU PP CC

Questions & AnswersQuestions & Answers

INTERACT-9, San Francisco (USA) - February 13, 2005