UPC Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф...

UU PP CC

Compiler Support for Trace-Level Speculative

Multithreaded Architectures

Compiler Support for Trace-Level Speculative

Multithreaded Architectures

Antonio González λ,ф

Carlos Molina ψ

Jordi Tubella ф

INTERACT-9, San Francisco (USA) - February 13, 2005

λ Intel Barcelona Research Center

Intel Labs - UPC

Barcelona, Spain

antoniox.gonzalez@intel.com

ф Dept. Arquitectura de Computadors

Universitat Politècnica de Catalunya

Barcelona, Spain

{antonio,jordit}@ac.upc.edu

ψ Dept. Enginyeria Informàtica

Universitat Rovira i Virgili

Tarragona, Spain

carlos.molina@urv.net

Trace Level SpeculationTrace Level Speculation

Avoids serialization caused by data dependences

Skips in a row multiple instructions

Predicts values based on the past

Introduces penalties due to misspeculations

With Live Output TestWith Live Output Test

Trace Level SpeculationTrace Level Speculation

With Live Input TestWith Live Input Test

BUFFERBUFFER

Trace Level Speculation with Live Output Test

Live Output Update & Trace Speculation

Trace Miss Speculation Detection & Recovery Actions

INSTRUCTION EXECUTION

NOT EXECUTED

LIVE OUTPUT VALIDATION

TSMA Block DiagramTSMA Block Diagram

CacheI

EngineFetch

RenameDecode &

UnitsFunctional

PredictorBranch

TraceSpeculation

Engine NST Reorder BufferNST Reorder Buffer

ST Reorder BufferST Reorder Buffer

NST Ld/St QueueNST Ld/St Queue

ST Ld/St QueueST Ld/St Queue

NST I WindowNST I Window

ST I WindowST I Window

Look Ahead BufferLook Ahead Buffer

EngineVerification

L1NSDCL1NSDC L2NSDCL2NSDC

L1SDCL1SDC DataCache

Register FileNST Arch.

Register FileST Arch.

MotivationMotivation

Two orthogonal issues microarchitecture support for trace speculation control and data speculation techniques

– prediction of initial and final points– prediction of live output values

TSMA does not introduce significant misspeculation penalties does not impose constraints to build or predict traces

This work focuses on developing effective trace selection schemes for TSMA based on static analysis that uses profiling data

OutlineOutline

Trace Selection

Graph Construction

Graph Analysis

Performance Evaluation

Conclusions

Graph ConstructionGraph Construction

Test input set of the analyzed benchmarks

Abstract data structure is built based on control flow graph data dependences graph predictability of values

Each node represents each static instruction type of instruction, number of dynamic executions pointers and frequencies to succeeding instructions pointers and frequencies to preceding instructions predictability of live output values and dead values

Graph AnalysisGraph Analysis

Two important issues initial and final point of a trace

– maximize trace length & minimize control flow misspeculations

predictability of live output values– prediction accuracy and utilization degree

Three basic heuristics Procedure Trace Heuristic

Loop Trace Heuristic

Instruction Chaining Trace Heuristic

Procedure Trace HeuristicProcedure Trace Heuristic

Procedures relatively frequent

Computations that follow a subroutine fairly independent of the subroutine

except return values and some memory locations

Quite easy to predict the end of a trace

CallBranch

Return

Branch

I13 I14

Call instruction is marked as initial point of the trace

1Return address is marked as final point of the trace

N instructions after the final point of the trace are checked.

Only significant paths are considered.

I13 I14

Each instruction in a significant path it is checked whether any of its operands are produced by any instruction of the procedure.

In this case, utilization degree of the value produced and predictability of the producer instruction is evaluated.

If it does not achieve a certain threshold, the trace is discarded

Procedure Trace HeuristicProcedure Trace Heuristic

Loop Trace HeuristicLoop Trace Heuristic

Traditional source of parallelization and

speculation

We consider the whole execution of a loop as

a trace

The objective is to detect loops whose live-

output values after their whole execution are

predictable

Backward Branch

Branch

Backward branch target is marked as initial point of the trace

1Fall-through instruction of the same backward branch

is marked as final point of the trace

2 N instructions after the final point of the trace are checked.

Same behaviour as procedure trace heuristic

Loop Trace HeuristicLoop Trace Heuristic

Ichaining Trace HeuristicIchaining Trace Heuristic

Goal to identify large sequences of dynamic instructions besides procedures and loops

A trace is identified by: initial point final point behaviour of conditional branches within the trace

IChaining Trace HeuristicIChaining Trace Heuristic

Conditional Branch

Conditional BranchT NT

Taken and not taken targets of all conditional branches

are considered as initial points of a trace

I9 I10

Conditional Branch

Given an initial point, a trace is extended

adding successive instructions

Every time a conditional branch is found,

the trace is split into two.

Conditional Branch

Final point is reached if: new instruction already belongs to the trace, trace reaches a maximum size or new instructions is an indirect jump.

Conditional Branch

Live-output values are determined and its predictability is checked for every trace candidate

(highest between prediction accuracy and utilization degree)

Trace is considered predictable, if the multiplication of percentagesof all live output-values is above certain threshold

6If not, final instruction is removed and process starts again.

(until trace reaches a minimum size)

Trace Speculation EngineTrace Speculation Engine

Traces are communicated to the hardware at program loading time filling a special hardware structure (trace table)

Each entry of the trace table contains initial PC final PC branch history live-output values information frequency counter

Experimental FrameworkExperimental Framework

Simulator Alpha version of the SimpleScalar Toolset

Benchmarks Spec2000, ref input

Maximum Optimization Level DEC C & F77 compilers with -non_shared -O5

Statistics Collected for 250 million instructions Skipping an initial part of 500 million

Simulation ParametersSimulation Parameters

Base microarchitecture out of order machine, 4 instructions per cycle I cache: 16KB, D cache: 16KB, L2 shared: 256KB bimodal predictor

TSMA additional structures each thread: I window, reorder buffer, register file speculative data cache: 1KB verification engine: up to 8 instructions per cycle trace table: 128 entries, 4-way set associative look ahead buffer: 128 entries

Profiling Analysis ParametersProfiling Analysis Parameters

Value Predictors: Stride & Context

Minimum size of trace: 16

Maximum size of trace: 1024

Maximum number of live-outputs: 32

Threshold to consider a set of LO predictable: 25%

Significative path (mimimum frequency): 10%

Type of Speculated InstructionsType of Speculated Instructions

Crafty Eon

Equake

Gcc Mcf

Sixtra

Vortex

A_Mean

Loop Heuristic Procedure Heuristic Ichaining Heuristic

Type of Speculated InstructionsType of Speculated Instructions

Procedure and loop traces are relatively low

But sizes are significantly larger than Ichain

Some statistics: procedure trace size: 97.3 loop trace size: 215.8 Ichaining trace size: 36.4 average size of speculated traces: 65.7 average number of live output values: 16.4 branches within a trace (Ichaining): 5.3 traces with same initial PC (Ichaining): 1.57

Type of SpeculationsType of Speculations

Crafty Eon

Equake

Gcc Mcf

Sixtra

Vortex

A_Mean

Spec KO, Path KO

Spec KO, Path OK

Spec OK, Path KO

Spec OK, Path OK

Type of SpeculationsType of Speculations

Correct speculations: up to 70% 65% for correctly predicted paths 7% for incorrectly predicted paths (positive missprediction)

Incorrect speculations: close to 30% 20% for correctly predicted paths 8% for incorrectly predicted paths

These confirms that mechanism proposed to predict paths and final points provides significant accuracy

SpeedupSpeedup

Crafty Eon

Equake

Gcc Mcf

Sixtra

Vortex

A_Mean

SpeedupSpeedup

Average speedup close to 38%

In spite of misspeculating close to 30%

Type of Cycles of STType of Cycles of ST

Crafty Eon

Equake

Gcc Mcf

Sixtra

Vortex

A_Mean

ST can not speculate ST can speculate

Type of Cycles of STType of Cycles of ST

25% of the time ST can speculate but does not

find a trace to be speculated performance could be improved with further analysis

75% of the time ST can not speculate because

NST is executing and verifying a speculated

trace speculation may be performed only when NST catches

Type of Cycles of NSTType of Cycles of NST

Crafty Eon

Equake

Gcc Mcf

Sixtra

Vortex

A_Mean

NST is verifying instructions NST is executing instructions

Type of Cycles of NSTType of Cycles of NST

65% of the time NST is executing traces

speculated by ST more speculated instructions imply more time

executing instructions

35% of the time NST is verifying

instructions from the look ahead buffer verifying instructions is faster than executing them

Useless Cycles of STUseless Cycles of ST

Crafty Eon

Equake

Gcc Mcf

Sixtra

Vortex

A_Mean

Useless Cycles of STUseless Cycles of ST

Up to 20% of the time ST is executing

instructions beyond the misspeculation point ST is wasting up to 20% of the time executing

instructions that will be discarded

Ideal scenario would be when this percentage

is negligible

Branch Behaviour DistributionBranch Behaviour Distribution

50 60 70 80 90

Branch Behaviour DistributionBranch Behaviour Distribution

Instruction chanining heuristic does not

provide many traces with the same initial point despite the significant number of branches within a

trace (5.3 on average)

The study concludes that the majority of branches take almost always the same direction

Close to 80% of the branches take the same direction more than 90% of the times

ConclusionsConclusions Profile guided analysis to support TSMA

identifies large and highly predictable traces reducing hardware complexity

Three basic heuristics are proposed procedure trace heuristic loop trace heuristic instruction chaining heuristic

Results show speedup of 38% with a 30% of missprediction rate

Future work aggressive trace level predictors generalization to multiple threads

UU PP CC

Questions & AnswersQuestions & Answers

INTERACT-9, San Francisco (USA) - February 13, 2005

UPC Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф...

Documents

Transcript of UPC Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф...

69451 Weinheim, Germany - Wiley-VCHCationic η1/η2-Gold(I) Complexes of Simple Arenes Carbenes Elena Herrero-Gómez, Cristina Nieto-Oberhuber, Salomé López, Jordi Benet-Buchholz

ZMK · 2018. 12. 17. · Dr. Jordi Manauta 796 Bulk-fill- oder universalkom- posit im Seitenzahnbereich? Dr. José Zorzin 800 Über 6 Jahre legionellenfrei Farina Heilen management

HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs.

LA SECTION D OR UN VOYAGE PARMI MATHEMATIQUE, NATURE ET ART Ф.

Assuring QoS Guarantees for Heterogeneous Services in RINA ... · Assuring QoS Guarantees for Heterogeneous Services in RINA Networks with ΔQ Sergio Leon Gaixas, Jordi Perelló,

LA TEORÍA DEL COLOR Olga Molina. ¿QUÉ ES EL COLOR? El color (en griego: χρώμ-α/-ματος [chroma, chrómatos]) es una percepción visual que se genera en el.

TEMA 15 - um.es 15.pdf · Miguel E. Pérez Molina Universidad de Murcia 3 Palabra en la frase a final de palabra: vocales / diptongos: acción de vocal inicial de palabra siguiente

CALCULO Y SELECCIÓN DE CADENAS DISEÑO DE EQUIPOS ANGIE LONDOÑO TORRES LEIDY MARRIAGA LAMADRID ANGIE MOLINA ARTETA.

arXiv:1504.04338v1 [math.CV] 16 Apr 2015 · arXiv:1504.04338v1 [math.CV] 16 Apr 2015 BOUNDARY MULTIPLIERS OF A FAMILY OF MOBIUS INVARIANT FUNCTION¨ SPACES GUANLONG BAO AND JORDI

Presentación de PowerPoint - gob.mx€¦ · Presentación de PowerPoint Author: Karimi Anabel Molina Garduño Created Date: 11/23/2016 12:09:18 PM ...

LA SEZIONE AUREA UN VIAGGIO TRA MATEMATICA, NATURA E ARTE Ф.

ANALYSIS OF S&R PHENOMENA THROUGH SIMULATION IN ABAQUS · · 2016-11-08ABAQUS Inés Lama, Jordi Viñas*, Yannick Blecon, Xavier Montané IDIADA Automotive Technology, Spain KEYWORDS

ANTÍGONA Ana María Germán Solano Elisa Isabel Molina Ferrández.

Sábado 06.03.2010 Faulkner, lo imprescindible...a ello, nunca estuvo en la pomada de las corrientes de la novela ame-ricana de su tiempo. Faulkner, de-cía Muñoz Molina, es más

Cambio climático en las propiedades físicas del mar ... · Observaciones, algunas consecuencias, preguntas y conjeturas . Jordi Salat “Cambio climático en el medio marino español:

Josu e Danilo Molina Rodriguez Amplitude analysis of the ...cds.cern.ch/record/2265095/files/fulltext.pdfpouco conhecidas. O estudo do Dalitz plot atrav es da an alise de amplitudes

Räkneövningar / Classroom exercises 2015-5called wetted wall column as in the Figure above, case A. The (inner)diameter of the column is D = 0,05 m. The size of the volume flow Ф

Performance Analysis of NUCA Policies for CMPs Using Parsec v2.0 Benchmark Suite Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research.

Tema 3. Glúcidos - files.cienciasespiritusanto.webnode.esfiles.cienciasespiritusanto.webnode.es/200000344-a69b7a95ce/Tema... · es un enlace más rígido y resistente Ana Molina

Α Á Ι ó Ρ Β Κ á Σ ú ЛЕКЦИИ ПО ФИЗИКЕ...фемто 10–15 ф f милли 10–3 м m гекто 102 г h пико 10–12 п p санти 10–2 с c кило