UPC Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф...

UU PP CC

Compiler Support for Trace-Level Speculative

Multithreaded Architectures

Compiler Support for Trace-Level Speculative

Multithreaded Architectures

Antonio González λ,ф

Carlos Molina ψ

Jordi Tubella ф

INTERACT-9, San Francisco (USA) - February 13, 2005

λ Intel Barcelona Research Center

Intel Labs - UPC

Barcelona, Spain

[email protected]

ф Dept. Arquitectura de Computadors

Universitat Politècnica de Catalunya

Barcelona, Spain

{antonio,jordit}@ac.upc.edu

ψ Dept. Enginyeria Informàtica

Universitat Rovira i Virgili

Tarragona, Spain

[email protected]

Trace Level SpeculationTrace Level Speculation

Avoids serialization caused by data dependences

Skips in a row multiple instructions

Predicts values based on the past

Introduces penalties due to misspeculations

With Live Output TestWith Live Output Test

Trace Level SpeculationTrace Level Speculation

With Live Input TestWith Live Input Test

BUFFERBUFFER

Trace Level Speculation with Live Output Test

Trace Level Speculation with Live Output Test

Live Output Update & Trace Speculation

NST

ST

Trace Miss Speculation Detection & Recovery Actions

INSTRUCTION EXECUTION

NOT EXECUTED

LIVE OUTPUT VALIDATION

TSMA Block DiagramTSMA Block Diagram

CacheI

EngineFetch

RenameDecode &

UnitsFunctional

PredictorBranch

TraceSpeculation

Engine NST Reorder BufferNST Reorder Buffer

ST Reorder BufferST Reorder Buffer

NST Ld/St QueueNST Ld/St Queue

ST Ld/St QueueST Ld/St Queue

NST I WindowNST I Window

ST I WindowST I Window

Look Ahead BufferLook Ahead Buffer

EngineVerification

L1NSDCL1NSDC L2NSDCL2NSDC

L1SDCL1SDC DataCache

Register FileNST Arch.

Register FileST Arch.

MotivationMotivation

Two orthogonal issues microarchitecture support for trace speculation control and data speculation techniques

– prediction of initial and final points– prediction of live output values

TSMA does not introduce significant misspeculation penalties does not impose constraints to build or predict traces

This work focuses on developing effective trace selection schemes for TSMA based on static analysis that uses profiling data

OutlineOutline

Trace Selection

Graph Construction

Graph Analysis

Performance Evaluation

Conclusions

Graph ConstructionGraph Construction

Test input set of the analyzed benchmarks

Abstract data structure is built based on control flow graph data dependences graph predictability of values

Each node represents each static instruction type of instruction, number of dynamic executions pointers and frequencies to succeeding instructions pointers and frequencies to preceding instructions predictability of live output values and dead values

Graph AnalysisGraph Analysis

Two important issues initial and final point of a trace

– maximize trace length & minimize control flow misspeculations

predictability of live output values– prediction accuracy and utilization degree

Three basic heuristics Procedure Trace Heuristic

Loop Trace Heuristic

Instruction Chaining Trace Heuristic

Procedure Trace HeuristicProcedure Trace Heuristic

Procedures relatively frequent

Computations that follow a subroutine fairly independent of the subroutine

except return values and some memory locations

Quite easy to predict the end of a trace

I10

I4

I5

I6

I7I12

I1

I2

I3

I11

I12

I11

CallBranch

Return

T NT

NT

T

Branch

I13 I14

Call instruction is marked as initial point of the trace

I3

1Return address is marked as final point of the trace

I11

2

N instructions after the final point of the trace are checked.

Only significant paths are considered.

I12

I13 I14

I11

3

Each instruction in a significant path it is checked whether any of its operands are produced by any instruction of the procedure.

4

In this case, utilization degree of the value produced and predictability of the producer instruction is evaluated.

5

If it does not achieve a certain threshold, the trace is discarded

6

Procedure Trace HeuristicProcedure Trace Heuristic

Loop Trace HeuristicLoop Trace Heuristic

Traditional source of parallelization and

speculation

We consider the whole execution of a loop as

a trace

The objective is to detect loops whose live-

output values after their whole execution are

predictable

I8

I1

I2

I3

I5 I6

I4

Backward Branch

T NT

Branch

I7T

NT

Backward branch target is marked as initial point of the trace

I2

1Fall-through instruction of the same backward branch

is marked as final point of the trace

I8

2 N instructions after the final point of the trace are checked.

Same behaviour as procedure trace heuristic

3

Loop Trace HeuristicLoop Trace Heuristic

Ichaining Trace HeuristicIchaining Trace Heuristic

Goal to identify large sequences of dynamic instructions besides procedures and loops

A trace is identified by: initial point final point behaviour of conditional branches within the trace

IChaining Trace HeuristicIChaining Trace Heuristic

I2

I12

I1

I5

I7

I11

I8

Conditional Branch

T NT

I4

NT

Conditional Branch

I6

I10I9

Conditional BranchT NT

T

I3

Taken and not taken targets of all conditional branches

are considered as initial points of a trace

I2 I3

I7 I8

I9 I10

1


I2

I12

I1

I5

I7

I11

I8

Conditional Branch

T NT

I4

NT

Conditional Branch

I6

I10I9


T

I3

Given an initial point, a trace is extended

adding successive instructions

I3

2

I5

Every time a conditional branch is found,

the trace is split into two.

3


I2

I12

I1

I5

I7

I11

I8

Conditional Branch

T NT

I4

NT

Conditional Branch

I6

I10I9


T

I3I3

I5

I7

I11

I12


I2

I12

I1

I5

I7

I11

I8

Conditional Branch

T NT

I4

NT

Conditional Branch

I6

I10I9


T

I3I3

I5

I7

I11

I12

Final point is reached if: new instruction already belongs to the trace, trace reaches a maximum size or new instructions is an indirect jump.

4

I12


I2

I12

I1

I5

I7

I11

I8

Conditional Branch

T NT

I4

NT

Conditional Branch

I6

I10I9


T

I3I3

I5

I7

I11

I12

Live-output values are determined and its predictability is checked for every trace candidate

(highest between prediction accuracy and utilization degree)

5

Trace is considered predictable, if the multiplication of percentagesof all live output-values is above certain threshold

6If not, final instruction is removed and process starts again.

(until trace reaches a minimum size)

7

I12

Trace Speculation EngineTrace Speculation Engine

Traces are communicated to the hardware at program loading time filling a special hardware structure (trace table)

Each entry of the trace table contains initial PC final PC branch history live-output values information frequency counter

Experimental FrameworkExperimental Framework

Simulator Alpha version of the SimpleScalar Toolset

Benchmarks Spec2000, ref input

Maximum Optimization Level DEC C & F77 compilers with -non_shared -O5

Statistics Collected for 250 million instructions Skipping an initial part of 500 million

Simulation ParametersSimulation Parameters

Base microarchitecture out of order machine, 4 instructions per cycle I cache: 16KB, D cache: 16KB, L2 shared: 256KB bimodal predictor

TSMA additional structures each thread: I window, reorder buffer, register file speculative data cache: 1KB verification engine: up to 8 instructions per cycle trace table: 128 entries, 4-way set associative look ahead buffer: 128 entries

Profiling Analysis ParametersProfiling Analysis Parameters

Value Predictors: Stride & Context

Minimum size of trace: 16

Maximum size of trace: 1024

Maximum number of live-outputs: 32

Threshold to consider a set of LO predictable: 25%

Significative path (mimimum frequency): 10%

Type of Speculated InstructionsType of Speculated Instructions

Amm

pApsi

Crafty Eon

Equake

Gcc Mcf

Mes

a

Mgrid

Sixtra

ck

Vortex

Vpr

A_Mean

100 %

90 %

80 %

70 %

60 %

50 %

40 %

30 %

20 %

10 %

0 %

Loop Heuristic Procedure Heuristic Ichaining Heuristic

Type of Speculated InstructionsType of Speculated Instructions

Procedure and loop traces are relatively low

But sizes are significantly larger than Ichain

Some statistics: procedure trace size: 97.3 loop trace size: 215.8 Ichaining trace size: 36.4 average size of speculated traces: 65.7 average number of live output values: 16.4 branches within a trace (Ichaining): 5.3 traces with same initial PC (Ichaining): 1.57

Type of SpeculationsType of Speculations

Amm

pApsi

Crafty Eon

Equake

Gcc Mcf

Mes

a

Mgrid

Sixtra

ck

Vortex

Vpr

A_Mean

100 %

90 %

80 %

70 %

60 %

50 %

40 %

30 %

20 %

10 %

0 %

Spec KO, Path KO

Spec KO, Path OK

Spec OK, Path KO

Spec OK, Path OK

Type of SpeculationsType of Speculations

Correct speculations: up to 70% 65% for correctly predicted paths 7% for incorrectly predicted paths (positive missprediction)

Incorrect speculations: close to 30% 20% for correctly predicted paths 8% for incorrectly predicted paths

These confirms that mechanism proposed to predict paths and final points provides significant accuracy

SpeedupSpeedup

Amm

pApsi

Crafty Eon

Equake

Gcc Mcf

Mes

a

Mgrid

Sixtra

ck

Vortex

Vpr

A_Mean

1.35

1.30

1.25

1.20

1.15

1.10

1.05

1.00

1.40

1.45

SpeedupSpeedup

Average speedup close to 38%

In spite of misspeculating close to 30%

Type of Cycles of STType of Cycles of ST

Amm

pApsi

Crafty Eon

Equake

Gcc Mcf

Mes

a

Mgrid

Sixtra

ck

Vortex

Vpr

A_Mean

100 %

90 %

80 %

70 %

60 %

50 %

40 %

30 %

20 %

10 %

0 %

ST can not speculate ST can speculate

Type of Cycles of STType of Cycles of ST

25% of the time ST can speculate but does not

find a trace to be speculated performance could be improved with further analysis

75% of the time ST can not speculate because

NST is executing and verifying a speculated

trace speculation may be performed only when NST catches

up ST

Type of Cycles of NSTType of Cycles of NST

Amm

pApsi

Crafty Eon

Equake

Gcc Mcf

Mes

a

Mgrid

Sixtra

ck

Vortex

Vpr

A_Mean

100 %

90 %

80 %

70 %

60 %

50 %

40 %

30 %

20 %

10 %

0 %

NST is verifying instructions NST is executing instructions

Type of Cycles of NSTType of Cycles of NST

65% of the time NST is executing traces

speculated by ST more speculated instructions imply more time

executing instructions

35% of the time NST is verifying

instructions from the look ahead buffer verifying instructions is faster than executing them

Useless Cycles of STUseless Cycles of ST

Amm

pApsi

Crafty Eon

Equake

Gcc Mcf

Mes

a

Mgrid

Sixtra

ck

Vortex

Vpr

A_Mean

100 %

90 %

80 %

70 %

60 %

50 %

40 %

30 %

20 %

10 %

0 %

Useless Cycles of STUseless Cycles of ST

Up to 20% of the time ST is executing

instructions beyond the misspeculation point ST is wasting up to 20% of the time executing

instructions that will be discarded

Ideal scenario would be when this percentage

is negligible

Branch Behaviour DistributionBranch Behaviour Distribution

50 60 70 80 90

100 %

90 %

80 %

70 %

60 %

50 %

40 %

30 %

20 %

10 %

0 %

Branch Behaviour DistributionBranch Behaviour Distribution

Instruction chanining heuristic does not

provide many traces with the same initial point despite the significant number of branches within a

trace (5.3 on average)

The study concludes that the majority of branches take almost always the same direction

Close to 80% of the branches take the same direction more than 90% of the times

ConclusionsConclusions Profile guided analysis to support TSMA

identifies large and highly predictable traces reducing hardware complexity

Three basic heuristics are proposed procedure trace heuristic loop trace heuristic instruction chaining heuristic

Results show speedup of 38% with a 30% of missprediction rate

Future work aggressive trace level predictors generalization to multiple threads

UU PP CC

Questions & AnswersQuestions & Answers

INTERACT-9, San Francisco (USA) - February 13, 2005

UPC Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф...

Documents

Transcript of UPC Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф...