UPC Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф...
-
Upload
kassidy-diaz -
Category
Documents
-
view
219 -
download
0
Transcript of UPC Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф...
UU PP CC
Compiler Support for Trace-Level Speculative
Multithreaded Architectures
Compiler Support for Trace-Level Speculative
Multithreaded Architectures
Antonio González λ,ф
Carlos Molina ψ
Jordi Tubella ф
INTERACT-9, San Francisco (USA) - February 13, 2005
λ Intel Barcelona Research Center
Intel Labs - UPC
Barcelona, Spain
ф Dept. Arquitectura de Computadors
Universitat Politècnica de Catalunya
Barcelona, Spain
{antonio,jordit}@ac.upc.edu
ψ Dept. Enginyeria Informàtica
Universitat Rovira i Virgili
Tarragona, Spain
Trace Level SpeculationTrace Level Speculation
Avoids serialization caused by data dependences
Skips in a row multiple instructions
Predicts values based on the past
Introduces penalties due to misspeculations
With Live Output TestWith Live Output Test
Trace Level SpeculationTrace Level Speculation
With Live Input TestWith Live Input Test
BUFFERBUFFER
Trace Level Speculation with Live Output Test
Trace Level Speculation with Live Output Test
Live Output Update & Trace Speculation
NST
ST
Trace Miss Speculation Detection & Recovery Actions
INSTRUCTION EXECUTION
NOT EXECUTED
LIVE OUTPUT VALIDATION
TSMA Block DiagramTSMA Block Diagram
CacheI
EngineFetch
RenameDecode &
UnitsFunctional
PredictorBranch
TraceSpeculation
Engine NST Reorder BufferNST Reorder Buffer
ST Reorder BufferST Reorder Buffer
NST Ld/St QueueNST Ld/St Queue
ST Ld/St QueueST Ld/St Queue
NST I WindowNST I Window
ST I WindowST I Window
Look Ahead BufferLook Ahead Buffer
EngineVerification
L1NSDCL1NSDC L2NSDCL2NSDC
L1SDCL1SDC DataCache
Register FileNST Arch.
Register FileST Arch.
MotivationMotivation
Two orthogonal issues microarchitecture support for trace speculation control and data speculation techniques
– prediction of initial and final points– prediction of live output values
TSMA does not introduce significant misspeculation penalties does not impose constraints to build or predict traces
This work focuses on developing effective trace selection schemes for TSMA based on static analysis that uses profiling data
OutlineOutline
Trace Selection
Graph Construction
Graph Analysis
Performance Evaluation
Conclusions
Graph ConstructionGraph Construction
Test input set of the analyzed benchmarks
Abstract data structure is built based on control flow graph data dependences graph predictability of values
Each node represents each static instruction type of instruction, number of dynamic executions pointers and frequencies to succeeding instructions pointers and frequencies to preceding instructions predictability of live output values and dead values
Graph AnalysisGraph Analysis
Two important issues initial and final point of a trace
– maximize trace length & minimize control flow misspeculations
predictability of live output values– prediction accuracy and utilization degree
Three basic heuristics Procedure Trace Heuristic
Loop Trace Heuristic
Instruction Chaining Trace Heuristic
Procedure Trace HeuristicProcedure Trace Heuristic
Procedures relatively frequent
Computations that follow a subroutine fairly independent of the subroutine
except return values and some memory locations
Quite easy to predict the end of a trace
I10
I4
I5
I6
I7I12
I1
I2
I3
I11
I12
I11
CallBranch
Return
T NT
NT
T
Branch
I13 I14
Call instruction is marked as initial point of the trace
I3
1Return address is marked as final point of the trace
I11
2
N instructions after the final point of the trace are checked.
Only significant paths are considered.
I12
I13 I14
I11
3
Each instruction in a significant path it is checked whether any of its operands are produced by any instruction of the procedure.
4
In this case, utilization degree of the value produced and predictability of the producer instruction is evaluated.
5
If it does not achieve a certain threshold, the trace is discarded
6
Procedure Trace HeuristicProcedure Trace Heuristic
Loop Trace HeuristicLoop Trace Heuristic
Traditional source of parallelization and
speculation
We consider the whole execution of a loop as
a trace
The objective is to detect loops whose live-
output values after their whole execution are
predictable
I8
I1
I2
I3
I5 I6
I4
Backward Branch
T NT
Branch
I7T
NT
Backward branch target is marked as initial point of the trace
I2
1Fall-through instruction of the same backward branch
is marked as final point of the trace
I8
2 N instructions after the final point of the trace are checked.
Same behaviour as procedure trace heuristic
3
Loop Trace HeuristicLoop Trace Heuristic
Ichaining Trace HeuristicIchaining Trace Heuristic
Goal to identify large sequences of dynamic instructions besides procedures and loops
A trace is identified by: initial point final point behaviour of conditional branches within the trace
IChaining Trace HeuristicIChaining Trace Heuristic
I2
I12
I1
I5
I7
I11
I8
Conditional Branch
T NT
I4
NT
Conditional Branch
I6
I10I9
Conditional BranchT NT
T
I3
Taken and not taken targets of all conditional branches
are considered as initial points of a trace
I2 I3
I7 I8
I9 I10
1
IChaining Trace HeuristicIChaining Trace Heuristic
I2
I12
I1
I5
I7
I11
I8
Conditional Branch
T NT
I4
NT
Conditional Branch
I6
I10I9
Conditional BranchT NT
T
I3
Given an initial point, a trace is extended
adding successive instructions
I3
2
I5
Every time a conditional branch is found,
the trace is split into two.
3
IChaining Trace HeuristicIChaining Trace Heuristic
I2
I12
I1
I5
I7
I11
I8
Conditional Branch
T NT
I4
NT
Conditional Branch
I6
I10I9
Conditional BranchT NT
T
I3I3
I5
I7
I11
I12
IChaining Trace HeuristicIChaining Trace Heuristic
I2
I12
I1
I5
I7
I11
I8
Conditional Branch
T NT
I4
NT
Conditional Branch
I6
I10I9
Conditional BranchT NT
T
I3I3
I5
I7
I11
I12
IChaining Trace HeuristicIChaining Trace Heuristic
I2
I12
I1
I5
I7
I11
I8
Conditional Branch
T NT
I4
NT
Conditional Branch
I6
I10I9
Conditional BranchT NT
T
I3I3
I5
I7
I11
I12
Final point is reached if: new instruction already belongs to the trace, trace reaches a maximum size or new instructions is an indirect jump.
4
I12
IChaining Trace HeuristicIChaining Trace Heuristic
I2
I12
I1
I5
I7
I11
I8
Conditional Branch
T NT
I4
NT
Conditional Branch
I6
I10I9
Conditional BranchT NT
T
I3I3
I5
I7
I11
I12
Live-output values are determined and its predictability is checked for every trace candidate
(highest between prediction accuracy and utilization degree)
5
Trace is considered predictable, if the multiplication of percentagesof all live output-values is above certain threshold
6If not, final instruction is removed and process starts again.
(until trace reaches a minimum size)
7
I12
Trace Speculation EngineTrace Speculation Engine
Traces are communicated to the hardware at program loading time filling a special hardware structure (trace table)
Each entry of the trace table contains initial PC final PC branch history live-output values information frequency counter
Experimental FrameworkExperimental Framework
Simulator Alpha version of the SimpleScalar Toolset
Benchmarks Spec2000, ref input
Maximum Optimization Level DEC C & F77 compilers with -non_shared -O5
Statistics Collected for 250 million instructions Skipping an initial part of 500 million
Simulation ParametersSimulation Parameters
Base microarchitecture out of order machine, 4 instructions per cycle I cache: 16KB, D cache: 16KB, L2 shared: 256KB bimodal predictor
TSMA additional structures each thread: I window, reorder buffer, register file speculative data cache: 1KB verification engine: up to 8 instructions per cycle trace table: 128 entries, 4-way set associative look ahead buffer: 128 entries
Profiling Analysis ParametersProfiling Analysis Parameters
Value Predictors: Stride & Context
Minimum size of trace: 16
Maximum size of trace: 1024
Maximum number of live-outputs: 32
Threshold to consider a set of LO predictable: 25%
Significative path (mimimum frequency): 10%
Type of Speculated InstructionsType of Speculated Instructions
Amm
pApsi
Crafty Eon
Equake
Gcc Mcf
Mes
a
Mgrid
Sixtra
ck
Vortex
Vpr
A_Mean
100 %
90 %
80 %
70 %
60 %
50 %
40 %
30 %
20 %
10 %
0 %
Loop Heuristic Procedure Heuristic Ichaining Heuristic
Type of Speculated InstructionsType of Speculated Instructions
Procedure and loop traces are relatively low
But sizes are significantly larger than Ichain
Some statistics: procedure trace size: 97.3 loop trace size: 215.8 Ichaining trace size: 36.4 average size of speculated traces: 65.7 average number of live output values: 16.4 branches within a trace (Ichaining): 5.3 traces with same initial PC (Ichaining): 1.57
Type of SpeculationsType of Speculations
Amm
pApsi
Crafty Eon
Equake
Gcc Mcf
Mes
a
Mgrid
Sixtra
ck
Vortex
Vpr
A_Mean
100 %
90 %
80 %
70 %
60 %
50 %
40 %
30 %
20 %
10 %
0 %
Spec KO, Path KO
Spec KO, Path OK
Spec OK, Path KO
Spec OK, Path OK
Type of SpeculationsType of Speculations
Correct speculations: up to 70% 65% for correctly predicted paths 7% for incorrectly predicted paths (positive missprediction)
Incorrect speculations: close to 30% 20% for correctly predicted paths 8% for incorrectly predicted paths
These confirms that mechanism proposed to predict paths and final points provides significant accuracy
SpeedupSpeedup
Amm
pApsi
Crafty Eon
Equake
Gcc Mcf
Mes
a
Mgrid
Sixtra
ck
Vortex
Vpr
A_Mean
1.35
1.30
1.25
1.20
1.15
1.10
1.05
1.00
1.40
1.45
SpeedupSpeedup
Average speedup close to 38%
In spite of misspeculating close to 30%
Type of Cycles of STType of Cycles of ST
Amm
pApsi
Crafty Eon
Equake
Gcc Mcf
Mes
a
Mgrid
Sixtra
ck
Vortex
Vpr
A_Mean
100 %
90 %
80 %
70 %
60 %
50 %
40 %
30 %
20 %
10 %
0 %
ST can not speculate ST can speculate
Type of Cycles of STType of Cycles of ST
25% of the time ST can speculate but does not
find a trace to be speculated performance could be improved with further analysis
75% of the time ST can not speculate because
NST is executing and verifying a speculated
trace speculation may be performed only when NST catches
up ST
Type of Cycles of NSTType of Cycles of NST
Amm
pApsi
Crafty Eon
Equake
Gcc Mcf
Mes
a
Mgrid
Sixtra
ck
Vortex
Vpr
A_Mean
100 %
90 %
80 %
70 %
60 %
50 %
40 %
30 %
20 %
10 %
0 %
NST is verifying instructions NST is executing instructions
Type of Cycles of NSTType of Cycles of NST
65% of the time NST is executing traces
speculated by ST more speculated instructions imply more time
executing instructions
35% of the time NST is verifying
instructions from the look ahead buffer verifying instructions is faster than executing them
Useless Cycles of STUseless Cycles of ST
Amm
pApsi
Crafty Eon
Equake
Gcc Mcf
Mes
a
Mgrid
Sixtra
ck
Vortex
Vpr
A_Mean
100 %
90 %
80 %
70 %
60 %
50 %
40 %
30 %
20 %
10 %
0 %
Useless Cycles of STUseless Cycles of ST
Up to 20% of the time ST is executing
instructions beyond the misspeculation point ST is wasting up to 20% of the time executing
instructions that will be discarded
Ideal scenario would be when this percentage
is negligible
Branch Behaviour DistributionBranch Behaviour Distribution
50 60 70 80 90
100 %
90 %
80 %
70 %
60 %
50 %
40 %
30 %
20 %
10 %
0 %
Branch Behaviour DistributionBranch Behaviour Distribution
Instruction chanining heuristic does not
provide many traces with the same initial point despite the significant number of branches within a
trace (5.3 on average)
The study concludes that the majority of branches take almost always the same direction
Close to 80% of the branches take the same direction more than 90% of the times
ConclusionsConclusions Profile guided analysis to support TSMA
identifies large and highly predictable traces reducing hardware complexity
Three basic heuristics are proposed procedure trace heuristic loop trace heuristic instruction chaining heuristic
Results show speedup of 38% with a 30% of missprediction rate
Future work aggressive trace level predictors generalization to multiple threads
UU PP CC
Questions & AnswersQuestions & Answers
INTERACT-9, San Francisco (USA) - February 13, 2005