Compiler Construction of Idempotent Regions and Applications in Architecture Design

118
Compiler Construction of Idempotent Regions and Applications in Architecture Design Marc de Kruijf Advisor: Karthikeyan Sankaralingam PhD Defense 07/20/2012

description

Compiler Construction of Idempotent Regions and Applications in Architecture Design. Marc de Kruijf Advisor: Karthikeyan Sankaralingam. PhD Defense 07/20/2012. Example. source code. int sum( int *array, int len ) { int x = 0; for ( int i = 0; i < len ; ++ i ) - PowerPoint PPT Presentation

Transcript of Compiler Construction of Idempotent Regions and Applications in Architecture Design

Page 1: Compiler Construction of Idempotent Regions and Applications in Architecture Design

Compiler Construction of Idempotent Regions and Applications in Architecture

DesignMarc de Kruijf

Advisor: Karthikeyan Sankaralingam

PhD Defense 07/20/2012

Page 2: Compiler Construction of Idempotent Regions and Applications in Architecture Design

Example

2

int sum(int *array, int len) { int x = 0; for (int i = 0; i < len; ++i) x += array[i]; return x;}

source code

Page 3: Compiler Construction of Idempotent Regions and Applications in Architecture Design

Example

3

R2 = load [R1] R3 = 0LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOPEXIT: return R3

assembly code

F F F F0

faults

exceptions

x

load ?

mis-speculations

Page 4: Compiler Construction of Idempotent Regions and Applications in Architecture Design

Example

4

R2 = load [R1] R3 = 0LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOPEXIT: return R3

assembly code

BAD STUFF HAPPENS!

Page 5: Compiler Construction of Idempotent Regions and Applications in Architecture Design

R0 and R1 are unmodified

R2 = load [R1] R3 = 0LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOPEXIT: return R3

Example

5

assembly code

just re-execute!

convention:use checkpoints/buffers

Page 6: Compiler Construction of Idempotent Regions and Applications in Architecture Design

It’s Idempotent!

6

idempoh… what…?

int sum(int *data, int len) { int x = 0; for (int i = 0; i < len; ++i) x += data[i]; return x;}

=

Page 7: Compiler Construction of Idempotent Regions and Applications in Architecture Design

7

Thesis

idempotent regionsALL THE TIME

specifically… – using compiler analysis (intra-procedural) – transparent; no programmer intervention – hardware/software co-design – software analysis hardware execution

the thing that I am defending

Page 8: Compiler Construction of Idempotent Regions and Applications in Architecture Design

8

Thesisprelim.pptx defense.pptx

preliminary exam (11/2010) – idempotence: concept and simple empirical analysis – compiler: preliminary design & partial implementation – architecture: some area and power savings…?

defense (07/2012) – idempotence: formalization and detailed empirical analysis – compiler: complete design, source code release*

– architecture: compelling benefits (various)

* http://research.cs.wisc.edu/vertical/iCompiler

Page 9: Compiler Construction of Idempotent Regions and Applications in Architecture Design

9

Contributions & Findingsa summary

contribution areas – idempotence: models and analysis framework – compiler: design, implementation, and evaluation – architecture: design and evaluation

findings – potentially large idempotent regions exist in applications – for compilation, larger is better – small regions (5-15 instructions), 10-15% overheads – large regions (50+ instructions), 0-2% overheads – enables efficient exception and hardware fault recovery

Page 10: Compiler Construction of Idempotent Regions and Applications in Architecture Design

10

Overview

❶ Idempotence Models in Architecture

❷ Compiler Design & Evaluation

❸ Architecture Design & Evaluation

Page 11: Compiler Construction of Idempotent Regions and Applications in Architecture Design

11

Idempotence Models

MODEL AAn input is a variable that is live-in to the region. A region

preserves an input if the input is not overwritten.

idempotence: what does it mean?

DEFINITION(1) a region is idempotent iff its re-execution has no side-effects

(2) a region is idempotent iff it preserves its inputs

OK, but what does it mean to preserve an input?four models (next slides): A, B, C, & D

Page 12: Compiler Construction of Idempotent Regions and Applications in Architecture Design

12

Idempotence Model A

R1 = R2 + R3if R1 > 0

mem[R4] = R1

SP = SP - 16

false

true

Live-ins:{all registers}

a starting point

{all memory}\{R1}

?? = mem[R4]

?

Page 13: Compiler Construction of Idempotent Regions and Applications in Architecture Design

13

Idempotence Model A

R1 = R2 + R3if R1 > 0

mem[R4] = R1

SP = SP - 16

false

true

Live-ins:{all registers}

a starting point

{all memory}

Page 14: Compiler Construction of Idempotent Regions and Applications in Architecture Design

14

?? = mem[R4]

?

Idempotence Model A

R1 = R2 + R3if R1 > 0

mem[R4] = R1

SP = SP - 16

false

true

Live-ins:{all registers}

a starting point

TWO REGIONS. SOME STUFF NOT

COVERED.

{all memory} \{mem[R4]}

Page 15: Compiler Construction of Idempotent Regions and Applications in Architecture Design

15

?? = mem[R4]

?

Idempotence Model B

R1 = R2 + R3if R1 > 0

mem[R4] = R1

SP = SP - 16

false

true

live-in but dynamically dead at time of write– OK to overwrite if control flow invariable

varying control flow assumptions

ONE REGION. SOME STUFF STILL NOT

COVERED.

Page 16: Compiler Construction of Idempotent Regions and Applications in Architecture Design

16

Idempotence Model C

R1 = R2 + R3if R1 > 0

mem[R4] = R1

SP = SP - 16

false

true

allow final instruction to overwrite input(to include otherwise ineligible instructions)

varying sequencing assumptions

ONE REGION. EVERYTHING COVERED.

Page 17: Compiler Construction of Idempotent Regions and Applications in Architecture Design

17

Idempotence Model D

R1 = R2 + R3if R1 > 0

mem[R4] = R1

SP = SP - 16

false

true

may be concurrently read in another thread– consider as input

varying isolation assumptions

TWO REGIONS. EVERYTHING COVERED.

Page 18: Compiler Construction of Idempotent Regions and Applications in Architecture Design

18

Idempotence Modelsan idempotence taxonomy

sequencing axis

control axis

isolation axis

(Model C)

(Model B)

(Model D)

Page 19: Compiler Construction of Idempotent Regions and Applications in Architecture Design

19

what are the implications?

Page 20: Compiler Construction of Idempotent Regions and Applications in Architecture Design

20

Empirical Analysismethodology

measurement – dynamic region size (path length) – subject to axis constraints – x86 dynamic instruction count (using PIN)

benchmarks – SPEC 2006, PARSEC, and Parboil suites

experimental configurations – unconstrained: ideal upper bound (Model C) – oblivious: actual in normal compiled code (Model C) – X-constrained: ideal upper bound constrained by axis X

Page 21: Compiler Construction of Idempotent Regions and Applications in Architecture Design

21

Empirical Analysis

SPEC INT SPEC FP PARSEC Parboil OVERALL1

10

100

1000

obliviousunconstrained

*geometrically averaged across suites

oblivious vs. unconstrained

5.2

160.2

30X IMPROVEMENT ACHIEVABLE*

aver

age

regi

on si

ze*

Page 22: Compiler Construction of Idempotent Regions and Applications in Architecture Design

22

Empirical Analysis

SPEC INT SPEC FP PARSEC Parboil OVERALL1

10

100

1000

obliviousunconstrainedcontrol-constrained

control axis sensitivity

160.2

5.2

40.1

*geometrically averaged across suites

aver

age

regi

on si

ze*

4X DROP IF VARIABLE CONTROL FLOW

Page 23: Compiler Construction of Idempotent Regions and Applications in Architecture Design

23

Empirical Analysis

SPEC INT SPEC FP PARSEC Parboil OVERALL1

10

100

1000

obliviousunconstrainedcontrol-constrainedisolation-constrained

isolation axis sensitivity

160.2

5.2

40.127.4

*geometrically averaged across suites

aver

age

regi

on si

ze*

1.5X DROP IF NO ISOLATION

Page 24: Compiler Construction of Idempotent Regions and Applications in Architecture Design

24

non-

idem

pote

nt in

stru

ction

s*

Empirical Analysis

SPEC INT SPEC FP PARSEC Parboil OVERALL0%

1%

2%

sequencing axis sensitivity

0.19%

*geometrically averaged across suites

INHERENTLY NON-IDEMPOTENT

INSTRUCTIONS:

RARE AND OF LIMITED TYPE

Page 25: Compiler Construction of Idempotent Regions and Applications in Architecture Design

25

Idempotence Modelsa summary

a spectrum of idempotence models – significant opportunity: 100+ sizes possible – 4x reduction constraining control axis – 1.5x reduction constraining isolation axis

two models going forward – architectural idempotence & contextual idempotence – both are effectively the ideal case (Model C) – architectural idempotence: invariable control always – contextual idempotence: variable control w.r.t. locals

Page 26: Compiler Construction of Idempotent Regions and Applications in Architecture Design

26

Overview

❶ Idempotence Models in Architecture

❷ Compiler Design & Evaluation

❸ Architecture Design & Evaluation

Page 27: Compiler Construction of Idempotent Regions and Applications in Architecture Design

27

Compiler Design

PARTITION: ANALYZE:

choose your own adventure

CODE GEN:

PARTITION

ANALYZE CODE GENX

COMPILER EVALUATION

identify semantic clobber antidependencescut semantic clobber antidependencespreserve semantic idempotence

Page 28: Compiler Construction of Idempotent Regions and Applications in Architecture Design

28

Compiler Evaluationpreamble

NOT FreeFreeFree

WHAT DO YOU MEAN:PERFORMANCE OVERHEADS?

PARTITION: ANALYZE:

CODE GEN:

identify semantic clobber antidependencescut semantic clobber antidependencespreserve semantic idempotence

Page 29: Compiler Construction of Idempotent Regions and Applications in Architecture Design

29

Compiler Evaluationpreamble

region size

over

head

register pressure

- preserve input values in registers- spill other values (if needed)

- spill input values to stack- allocate other values to registers

Page 30: Compiler Construction of Idempotent Regions and Applications in Architecture Design

30

Compiler Evaluation

compiler implementation – LLVM, support for both x86 and ARM

methodology

measurements – performance overhead: dynamic instruction count – for x86, using PIN – for ARM, using gem5 (just for ISA comparison at end)

– region size: instructions between boundaries (path length) – x86 only, using PIN

benchmarks – SPEC 2006, PARSEC, and Parboil suites

Page 31: Compiler Construction of Idempotent Regions and Applications in Architecture Design

31

Results, Take 1/3

SPEC INT SPEC FP PARSEC Parboil OVERALL0%2%4%6%8%

10%12%14%16%

initial results – overheadpe

rform

ance

ove

rhea

d

percentage increase in x86 dynamic instruction countgeometrically averaged across suites

13.1%

NON-TRIVIAL OVERHEAD

Page 32: Compiler Construction of Idempotent Regions and Applications in Architecture Design

0%

10%

20%

30%

40%

50%

Results, Take 1/3

region size

over

head

YOU ARE HERE(typically 10-30 instructions)

?

analysis of trade-offs

32

10+ instructions register pressure

Page 33: Compiler Construction of Idempotent Regions and Applications in Architecture Design

0%

10%

20%

30%

40%

50%

Results, Take 1/3

region size

over

head

register pressure

analysis of trade-offs

33

detectionlatency

?

𝒍= 𝒅𝒓+𝒅

?

Page 34: Compiler Construction of Idempotent Regions and Applications in Architecture Design

34

Results, Take 2/3

SPEC INT SPEC FP PARSEC Parboil OVERALL0%2%4%6%8%

10%12%14%16%

minimizing register pressurepe

rform

ance

ove

rhea

d

13.1%

STILL NON-TRIVIAL OVERHEAD

11.1%

Before After

Page 35: Compiler Construction of Idempotent Regions and Applications in Architecture Design

0%

10%

20%

30%

40%

50%

Results, Take 2/3

region size

over

head

register pressure

analysis of trade-offs

35

detectionlatency

𝒍= 𝒅𝒓+𝒅 𝒆∝( 𝟏

𝟏−𝒑 )𝒓

re-executiontime

?

Page 36: Compiler Construction of Idempotent Regions and Applications in Architecture Design

Big Regions

36

how do we get there?

Problem #1: aliasing analysis – no flow-sensitive analysis in LLVM; really hurts loops

Problem #2: loop optimizations – boundaries in loops are bad for everyone – loop blocking, fission/fusion, inter-change, peeling, unrolling, scalarization, etc. can all help

Problem #3: large array structures – awareness of array access patterns can help

Problem #4: intra-procedural scope – limited scope aggravates all effects listed above

Page 37: Compiler Construction of Idempotent Regions and Applications in Architecture Design

Big Regions

37

how do we get there?

solutions can be automated – a lot of work… what would be the gain?

ad hoc for now – consider PARSEC and Parboil suites as a case study – aliasing annotations – manual loop refactoring, scalarization, etc. – partitioning algorithm refinements (application-specific) – inlining annotations

Page 38: Compiler Construction of Idempotent Regions and Applications in Architecture Design

38

Results Take 3/3

PARSEC Parboil OVERALL0%2%4%6%8%

10%12%14%

big regionspe

rform

ance

ove

rhea

d 13.1%

0.06%

Before After

GAIN = BIG; TRIVIAL OVERHEAD

Page 39: Compiler Construction of Idempotent Regions and Applications in Architecture Design

39

1 10 100 1000 10000 100000 1000000100000000%

10%

20%

30%

40%

50%

Results Take 3/350+ instructions is good enough

region size

over

head

registerpressure

50+ instructions

(mis-optimized)

Page 40: Compiler Construction of Idempotent Regions and Applications in Architecture Design

40

ISA Sensitivityyou might be curious

ISA matters? (1) two-address (e.g. x86) vs. three-address (e.g. ARM) (2) register-memory (e.g. x86) vs. register-register (e.g. ARM) (3) number of available registers

the short version ( ) – impact of (1) & (2) not significant (+/- 2% overall) – even less significant as regions grow larger – impact of (3): to get same performance w/ idempotence – increase registers by 0% (large) to ~60% (small regions)

Page 41: Compiler Construction of Idempotent Regions and Applications in Architecture Design

41

Compiler Design & Evaluationa summary

design and implementation – static analysis algorithms, modular and perform well – code-gen algorithms, modular and perform well – LLVM implementation source code available*

findings – pressure-related performance overheads range from: – 0% (large regions) to ~15% (small regions) – greatest opportunity: loop-intensive applications – ISA effects are insignificant* http://research.cs.wisc.edu/vertical/iCompiler

Page 42: Compiler Construction of Idempotent Regions and Applications in Architecture Design

42

Overview

❶ Idempotence Models in Architecture

❷ Compiler Design & Evaluation

❸ Architecture Design & Evaluation

Page 43: Compiler Construction of Idempotent Regions and Applications in Architecture Design

Architecture Recovery: It’s Real

43

safety first speed first (safety second)

Page 44: Compiler Construction of Idempotent Regions and Applications in Architecture Design

44

1 Fetch

4 Write-back

3 Execute

2 Decode

Architecture Recovery: It’s Reallots of sharp turns

1 Fetch 2 Decode 3 Execute 4 Write-back

closer to the truth

Page 45: Compiler Construction of Idempotent Regions and Applications in Architecture Design

45

1 Fetch

4 Write-back

3 Execute

2 Decode

1 Fetch 2 Decode 3 Execute 4 Write-back

!!!

Architecture Recovery: It’s Reallots of interaction

too late!

Page 46: Compiler Construction of Idempotent Regions and Applications in Architecture Design

46

Architecture Recovery: It’s Realbad stuff can happen

mis-speculation

(a) branch mis-prediction,(b) memory re-ordering,(c) transaction violation,

etc.

hardware faults

(d) wear-out fault,(e) particle strike,(f) voltage spike,

etc.

exceptions

(g) page fault,(h) divide-by-zero,

(i) mis-aligned access,etc.

Page 47: Compiler Construction of Idempotent Regions and Applications in Architecture Design

47

Architecture Recovery: It’s Realbad stuff can happen

mis-speculation

(a) branch mis-prediction,(b) memory re-ordering,(c) transaction violation,

etc.

hardware faults

(d) wear-out fault,(e) particle strike,(f) voltage spike,

etc.

exceptions

(g) page fault,(h) divide-by-zero,

(i) mis-aligned access,etc.

register pressure

detectionlatency

re-executiontime

Page 48: Compiler Construction of Idempotent Regions and Applications in Architecture Design

48

Architecture Recovery: It’s Realbad stuff can happen

mis-speculation

(a) branch mis-prediction,(b) memory re-ordering,(c) transaction violation,

etc.

hardware faults

(d) wear-out fault,(e) particle strike,(f) voltage spike,

etc.

exceptions

(g) page fault,(h) divide-by-zero,

(i) mis-aligned access,etc.

Page 49: Compiler Construction of Idempotent Regions and Applications in Architecture Design

49

Architecture Recovery: It’s Realbad stuff can happen

hardware faultsexceptions

(d) wear-out fault,(e) particle strike,(f) voltage spike,

etc.

(g) page fault,(h) divide-by-zero,

(i) mis-aligned access,etc.

integrated GPU

low-power CPU high-reliability systems

Page 50: Compiler Construction of Idempotent Regions and Applications in Architecture Design

50

GPU Exception Support

Page 51: Compiler Construction of Idempotent Regions and Applications in Architecture Design

51

GPU Exception Supportwhy would we want it?

GPU/CPU integration– unified address space: support for demand paging– numerous secondary benefits as well…

Page 52: Compiler Construction of Idempotent Regions and Applications in Architecture Design

52

GPU Exception Supportwhy is it hard?

the CPU solution

pipeline registersbuffers

Page 53: Compiler Construction of Idempotent Regions and Applications in Architecture Design

53

GPU Exception Supportwhy is it hard?

CPU: 10s of registers/core

GPU: 10s of registers/thread32 threads/warp48 warps per “core”

10,000s of registers/core

Page 54: Compiler Construction of Idempotent Regions and Applications in Architecture Design

54

GPU Exception Supportidempotence on GPUs

GPUs hit the sweet spot (1) extractably large regions (low compiler overheads) (2) detection latencies long or hard to bound (large is good) (3) exceptions are infrequent (low re-execution overheads)

registerpressure

detectionlatency

re-executiontime

Page 55: Compiler Construction of Idempotent Regions and Applications in Architecture Design

55

GPU Exception Supportidempotence on GPUs

GPU design topics – compiler flow – hardware support – exception live-lock – bonus: fast context switching

DETAILS

GPUs hit the sweet spot (1) extractably large regions (low compiler overheads) (2) detection latencies long or hard to bound (large is good) (3) exceptions are infrequent (low re-execution overheads)

Page 56: Compiler Construction of Idempotent Regions and Applications in Architecture Design

56

GPU Exception Supportevaluation methodology

compiler – LLVM targeting ARM

benchmarks – Parboil GPU benchmarks for CPUs, modified

simulation – gem5 for ARM: simple dual-issue in-order (e.g. Fermi) – 10-cycle page fault detection latency

measurement – performance overhead in execution cycles

Page 57: Compiler Construction of Idempotent Regions and Applications in Architecture Design

57

GPU Exception Support

cutcp fft histo mri-q sad tpacf gmean0.0%0.2%0.4%0.6%0.8%1.0%1.2%1.4%1.6%

evaluation resultspe

rform

ance

ove

rhea

d

0.54%NEGLIGIBLE OVERHEADS

Page 58: Compiler Construction of Idempotent Regions and Applications in Architecture Design

58

CPU Exception Support

Page 59: Compiler Construction of Idempotent Regions and Applications in Architecture Design

59

CPU Exception Supportwhy is it a problem?

the CPU solution

pipeline registersbuffers

Page 60: Compiler Construction of Idempotent Regions and Applications in Architecture Design

60

CPU Exception Supportwhy is it a problem?

Decode, Rename,& Issue

Fetch

Integer

Integer

Multiply

Load/Store

RF

Branch

FP…

IEEE FP

Bypass

Replay queue

Flush?Replay?

Before

Page 61: Compiler Construction of Idempotent Regions and Applications in Architecture Design

61

CPU Exception Supportwhy is it a problem?

Fetch Decode & Issue

Integer

Integer

Branch

Multiply

Load/Store

FP

RF

After

Page 62: Compiler Construction of Idempotent Regions and Applications in Architecture Design

62

CPU Exception Supportidempotence on CPUs

CPU design simplification – in ARM Cortex-A8 (dual-issue in-order) can remove: – bypass / staging register file, replay queue – rename pipeline stage – IEEE-compliant floating point unit – pipeline flush for exceptions and replays – all associated control logic

DETAILS

leaner hardware – bonus: cheap (but modest) OoO issue

Page 63: Compiler Construction of Idempotent Regions and Applications in Architecture Design

63

CPU Exception Supportevaluation methodology

compiler – LLVM targeting ARM, minimize pressure (take 2/3)

benchmarks – SPEC 2006 & PARSEC suites (unmodified)

simulation – gem5 for ARM: aggressive dual-issue in-order (e.g. A8) – stall on potential in-flight exception

measurement – performance overhead in execution cycles

Page 64: Compiler Construction of Idempotent Regions and Applications in Architecture Design

64

CPU Exception Support

SPEC INT SPEC FP PARSEC OVERALL0%2%4%6%8%

10%12%14%

evaluation resultspe

rform

ance

ove

rhea

d

9.1%

LESS THAN 10%... BUT NOT AS GOOD AS

GPU (REGIONS TOO SMALL)

Page 65: Compiler Construction of Idempotent Regions and Applications in Architecture Design

65

Hardware Fault Tolerance

Page 66: Compiler Construction of Idempotent Regions and Applications in Architecture Design

66

Hardware Fault Tolerancewhat is the opportunity?

reliability trends – CMOS reliability is a growing problem – future CMOS alternatives are no better

architecture trends – hardware power and complexity are premium – desire for simple hardware + efficient recovery

application trends – emerging workloads consist of large idempotent regions – increasing levels of software abstraction

Page 67: Compiler Construction of Idempotent Regions and Applications in Architecture Design

67

Hardware Fault Tolerancedesign topics

hardware organizations – homogenous: idempotence everywhere – statically heterogeneous: e.g. accelerators – dynamically heterogeneous: adaptive cores

FAULT MODEL

fault detection capability – fine-grained in hardware (e.g. Argus, MICRO ‘07) or – fine-grained in software (e.g. instruction/region DMR)

fault model (aka ISA semantics) – similar to pipeline-based (e.g. ROB) recovery

Page 68: Compiler Construction of Idempotent Regions and Applications in Architecture Design

68

Hardware Fault Toleranceevaluation methodology

compiler – LLVM targeting ARM (compiled to minimize pressure)

benchmarks – SPEC 2006, PARSEC, and Parboil suites (unmodified)

simulation – gem5 for ARM: simple dual-issue in-order – DMR detection; compare against checkpoint/log and TMR

measurement – performance overhead in execution cycles

Page 69: Compiler Construction of Idempotent Regions and Applications in Architecture Design

69

Hardware Fault Toleranceevaluation results

SPEC INTSPEC FP

PARSECParboil

OVERALL0%5%

10%15%20%25%30%35%

idempotencecheckpoint/logTMR9.1

22.229.3

perfo

rman

ce o

verh

ead

IDEMPOTENCE: BEATS THE

COMPETITION

Page 70: Compiler Construction of Idempotent Regions and Applications in Architecture Design

70

Overview

❶ Idempotence Models in Architecture

❷ Compiler Design & Evaluation

❸ Architecture Design & Evaluation

Page 71: Compiler Construction of Idempotent Regions and Applications in Architecture Design

71

RELATED WORK

CONCLUSIONS

Page 72: Compiler Construction of Idempotent Regions and Applications in Architecture Design

72

Conclusions

idempotence: not good for everything – small regions are expensive – preserving register state is difficult with limited flexibility – large regions are cheap – preserving register state is easy with amortization effect – preserving memory state is mostly “for free”

idempotence: synergistic with modern trends – programmability (for GPUs) – low power (for everyone) – high-level software efficient recovery (for everyone)

Page 73: Compiler Construction of Idempotent Regions and Applications in Architecture Design

73

The End

Page 74: Compiler Construction of Idempotent Regions and Applications in Architecture Design

74

Back-Up: Chronology

MapReduce for CELL

Time

SELSE ’09: Synergy

ISCA ‘10: Relax

MICRO ’11:Idempotent Processors

PLDI ’12:Static Analysis and Compiler Design

ISCA ’12:iGPU

DSN ’10:TS model

CGO ??:Code Gen

TACO ??:Models

prelim defense

Page 75: Compiler Construction of Idempotent Regions and Applications in Architecture Design

75

Choose Your Own Adventure Slides

Page 76: Compiler Construction of Idempotent Regions and Applications in Architecture Design

Idempotence Analysis

76

Yes

2

is this idempotent?

Page 77: Compiler Construction of Idempotent Regions and Applications in Architecture Design

Idempotence Analysis

77

No

2

how about this?

Page 78: Compiler Construction of Idempotent Regions and Applications in Architecture Design

Idempotence Analysis

78

Yes

2

maybe this?

Page 79: Compiler Construction of Idempotent Regions and Applications in Architecture Design

Idempotence Analysis

79

operation sequence dependence chain idempotent?

write

read, write

write, read, write Yes

No

Yes

it’s all about the data dependences

Page 80: Compiler Construction of Idempotent Regions and Applications in Architecture Design

Idempotence Analysis

80

operation sequence dependence chain idempotent?

write, read

read, write

write, read, write Yes

No

Yes

it’s all about the data dependences

CLOBBER ANTIDEPENDENCEantidependence with an exposed read

Page 81: Compiler Construction of Idempotent Regions and Applications in Architecture Design

Semantic Idempotence

81

(1) local (“pseudoregister”) state:can be renamed to remove clobber antidependences*

does not semantically constrain idempotence

two types of program state

(2) non-local (“memory”) state:cannot “rename” to avoid clobber antidependences

semantically constrains idempotence

semantic idempotence = no non-local clobber antidep.

preserve local state by renaming and careful allocation

Page 82: Compiler Construction of Idempotent Regions and Applications in Architecture Design

Region Partitioning Algorithm

82

steps one, two, and three

Step 1: transform functionremove artificial dependences, remove non-clobbers

Step 3: refine for correctness & performanceaccount for loops, optimize for dynamic behavior

Step 2: construct regions around antidependencescut all non-local antidependences in the CFG

Page 83: Compiler Construction of Idempotent Regions and Applications in Architecture Design

Step 1: Transform

83

Transformation 1: SSA for pseudoregister antidependences

not one, but two transformations

clobber antidependences

region boundaries

region identification

But we still have a problem:

depends on

Page 84: Compiler Construction of Idempotent Regions and Applications in Architecture Design

Step 1: Transform

84

Transformation 1: SSA for pseudoregister antidependences

not one, but two transformations

Transformation 2: Scalar replacement of memory variables

[x] = a;b = [x];[x] = c;

[x] = a;b = a;[x] = c;

non-clobber antidependences… GONE!

Page 85: Compiler Construction of Idempotent Regions and Applications in Architecture Design

Step 1: Transform

85

Transformation 1: SSA for pseudoregister antidependences

not one, but two transformations

Transformation 2: Scalar replacement of memory variables

clobber antidependences

region boundaries

region identification depends on

Page 86: Compiler Construction of Idempotent Regions and Applications in Architecture Design

Region Partitioning Algorithm

86

steps one, two, and three

Step 1: transform functionremove artificial dependences, remove non-clobbers

Step 3: refine for correctness & performanceaccount for loops, optimize for dynamic behavior

Step 2: construct regions around antidependencescut all non-local antidependences in the CFG

Page 87: Compiler Construction of Idempotent Regions and Applications in Architecture Design

Step 2: Cut the CFG

87

cut, cut, cut…

construct regions by “cutting” non-local antidependences

antidependence

Page 88: Compiler Construction of Idempotent Regions and Applications in Architecture Design

larger is (generally) better:large regions amortize the cost of input preservation

88

region size

over

head sources of overhead

optimal region size?

Step 2: Cut the CFG

rough sketch

but where to cut…?

Page 89: Compiler Construction of Idempotent Regions and Applications in Architecture Design

Step 2: Cut the CFG

89

but where to cut…?

goal: the minimum set of cuts that cuts allantidependence paths

intuition: minimum cuts fewest regions large regions

approach: a series of reductions:minimum vertex multi-cut (NP-complete)minimum hitting set among pathsminimum hitting set among “dominating nodes”

details omitted

Page 90: Compiler Construction of Idempotent Regions and Applications in Architecture Design

Region Partitioning Algorithm

90

steps one, two, and three

Step 1: transform functionremove artificial dependences, remove non-clobbers

Step 3: refine for correctness & performanceaccount for loops, optimize for dynamic behavior

Step 2: construct regions around antidependencescut all non-local antidependences in the CFG

Page 91: Compiler Construction of Idempotent Regions and Applications in Architecture Design

Step 3: Loop-Related Refinements

91

correctness: Not all local antidependences removed by SSA…

loops affect correctness and performance

loop-carried antidependences may clobberdepends on boundary placement; handled as a post-pass

performance: Loops tend to execute multiple times…

to maximize region size, place cuts outside of loopalgorithm modified to prefer cuts outside of loops

details omitted

Page 92: Compiler Construction of Idempotent Regions and Applications in Architecture Design

Code Generation Algorithms

92

idempotence preservation

background & concepts:live intervals, region intervals, and shadow intervals

compiling for architectural idempotence:invariable control flow upon re-execution

compiling for contextual idempotence:potentially variable control flow upon re-execution

Page 93: Compiler Construction of Idempotent Regions and Applications in Architecture Design

Code Generation Algorithmslive intervals and region intervals

x = ...

... = f(x)y = ...

93

region boundaries

region interval

x’s live interval

Page 94: Compiler Construction of Idempotent Regions and Applications in Architecture Design

Code Generation Algorithmsshadow intervals

94

shadow intervalthe interval over which a variable must not be overwritten

specifically to preserve idempotence

different for architectural and contextual idempotence

X

Page 95: Compiler Construction of Idempotent Regions and Applications in Architecture Design

Code Generation Algorithmsfor contextual idempotence

x = ...

... = f(x)y = ...

95

region boundaries

x’s shadow interval

x’s live interval

Page 96: Compiler Construction of Idempotent Regions and Applications in Architecture Design

Code Generation Algorithmsfor architectural idempotence

x = ...

... = f(x)y = ...

96

region boundaries

x’s shadow interval

x’s live interval

Page 97: Compiler Construction of Idempotent Regions and Applications in Architecture Design

Code Generation Algorithmsfor architectural idempotence

x = ...

... = f(x)y = ...

97

region boundaries

x’s shadow interval

x’s live interval

y’s live interval

Page 98: Compiler Construction of Idempotent Regions and Applications in Architecture Design

Big Regions

98

Re: Problem #2 (cut in loops are bad)

i0 = φ(0, i1)

i1 = i0 + 1if (i1 < X)

for (i = 0; i < X; i++) { ...}

C code CFG + SSA

Page 99: Compiler Construction of Idempotent Regions and Applications in Architecture Design

Big Regions

99

Re: Problem #2 (cut in loops are bad)

R0 = 0

R0 = R0 + 1if (R0 < X)

for (i = 0; i < X; i++) { ...}

C code machine code

NO BOUNDARIES = NO PROBLEM

Page 100: Compiler Construction of Idempotent Regions and Applications in Architecture Design

Big Regions

100

Re: Problem #2 (cut in loops are bad)

R0 = 0

R0 = R0 + 1if (R0 < X)

for (i = 0; i < X; i++) { ...}

C code machine code

Page 101: Compiler Construction of Idempotent Regions and Applications in Architecture Design

Big Regions

101

Re: Problem #2 (cut in loops are bad)

R1 = 0

R0 = R1

R1 = R0 + 1if (R1 < X)

for (i = 0; i < X; i++) { ...}

C code machine code

– “redundant” copy– extra boundary (pressure)

Page 102: Compiler Construction of Idempotent Regions and Applications in Architecture Design

Big Regions

102

Re: Problem #3 (array access patterns)

[x] = a;b = [x];[x] = c;

[x] = a;b = a;[x] = c;

non-clobber antidependences… GONE!

algorithm makes this simplifying assumption:

cheap for scalars, expensive for arrays

Page 103: Compiler Construction of Idempotent Regions and Applications in Architecture Design

Big Regions

103

Re: Problem #3 (array access patterns)

not really practical for large arraysbut if we don’t do it, non-clobber antidependences remain

solution: handle potential non-clobbers in a post-pass(same way we deal with loop clobbers in static analysis)

// initialize:int[100] array;memset(&array, 100*4, 0);// accumulate:for (...) array[i] += foo(i);

Page 104: Compiler Construction of Idempotent Regions and Applications in Architecture Design

Big Regions

104

Benchmark Problems Size Before Size After

blackscholes ALIASING, SCOPE 78.9 >10,000,000

canneal SCOPE 35.3 187.3

fluidanimate ARRAYS, LOOPS, SCOPE

9.4 >10,000,000

streamcluster ALIASING 120.7 4,928

swaptions ALIASING, ARRAYS 10.8 211,000

cutcp LOOPS 21.9 612.4

fft ALIASING 24.7 2,450

histo ARRAYS, SCOPE 4.4 4,640,000

mri-q – 22,100 22,100

sad ALIASING 51.3 90,000

tpacf ARRAYS, SCOPE 30.2 107,000

results: sizes

Page 105: Compiler Construction of Idempotent Regions and Applications in Architecture Design

Big Regions

105

Benchmark Problems Overhead Before Overhead After

blackscholes ALIASING, SCOPE -2.93% -0.05%

canneal SCOPE 5.31% 1.33%

fluidanimate ARRAYS, LOOPS, SCOPE

26.67% -0.62%

streamcluster ALIASING 13.62% 0.00%

swaptions ALIASING, ARRAYS 17.67% 0.00%

cutcp LOOPS 6.344% -0.01%

fft ALIASING 11.12% 0.00%

histo ARRAYS, SCOPE 23.53% 0.00%

mri-q – 0.00% 0.00%

sad ALIASING 4.17% 0.00%

tpacf ARRAYS, SCOPE 12.36% -0.02%

results: overheads

Page 106: Compiler Construction of Idempotent Regions and Applications in Architecture Design

Big Regions

106

problem labels

Problem #1: aliasing analysis – no flow-sensitive analysis in LLVM; really hurts loops

Problem #2: loop optimizations – boundaries in loops are bad for everyone – loop blocking, fission/fusion, inter-change, peeling, blocking, unrolling, scalarization, etc. can all help

Problem #3: large array structures – awareness of array access patterns can help

Problem #4: intra-procedural scope – limited scope aggravates all effects listed above

(ALIASING)

(LOOPS)

(ARRAYS)

(SCOPE)

Page 107: Compiler Construction of Idempotent Regions and Applications in Architecture Design

107

ISA Sensitivity

SPEC INT SPEC FP PARSEC Parboil OVERALL0

5

10

15

20

perc

enta

ge o

verh

ead

x86-64 ARMv7

same configuration as take 1/3

x86-64 vs. ARMv7

Page 108: Compiler Construction of Idempotent Regions and Applications in Architecture Design

108

ISA Sensitivitygeneral purpose register (GPR) sensititivity

14-GPR 12-GPR 10-GPR take 2/302468

101214

ARMv7, 16 GPR baseline; data as geometric mean across SPEC INT

perc

enta

ge o

verh

ead

Page 109: Compiler Construction of Idempotent Regions and Applications in Architecture Design

ISA Sensitivity

109

more registers isn’t always enough

x = 0;if (y > 0) x = 1;z = x + y;

C codeR0 = 0

if (R1 > 0)

R0 = 1

R2 = R0 + R1

Page 110: Compiler Construction of Idempotent Regions and Applications in Architecture Design

ISA Sensitivity

110

more registers isn’t always enough

R0 = 0

if (R1 > 0)

R3 = R0

x = 0;if (y > 0) x = 1;z = x + y;

C code

R3 = 1

R2 = R3 + R1

Page 111: Compiler Construction of Idempotent Regions and Applications in Architecture Design

111

GPU Exception Supportcompiler flow & hardware support

kernel source

source code

compilerIR

device code generator

partitioning preservation

idempotent device code

L2 cache

core

L1, TLB

general purpose registers RPCs

fetchFU

… decodeFU FU

compiler

hardware

Page 112: Compiler Construction of Idempotent Regions and Applications in Architecture Design

112

GPU Exception Supportexception live-lock and fast context switching

bonus: fast context switching – boundary locations are configurable at compile time – observation 1: save/restore only live state – observation 2: place boundaries to minimize liveness

exception live-lock – multiple recurring exceptions can cause live-lock – detection: save PC and compare – recovery: single-stepped re-execution or re-compilation

Page 113: Compiler Construction of Idempotent Regions and Applications in Architecture Design

113

CPU Exception Supportdesign simplification

idempotence OoO retirement – simplify result bypassing – simplifies exception support for long latency instructions – simplifies scheduling of variable-latency instructions

OoO issue?

Page 114: Compiler Construction of Idempotent Regions and Applications in Architecture Design

114

CPU Exception Supportdesign simplification

what about branch prediction, etc.? high re-execution costs; live-lock issues

registerpressure

detectionlatency

re-executiontime

!!!

region placement to minimize re-execution...?

Page 115: Compiler Construction of Idempotent Regions and Applications in Architecture Design

115

CPU Exception Support

SPEC INT SPEC FP PARSEC OVERALL0

5

10

15

20

25

minimizing branch re-execution costpe

rcen

tage

ove

rhea

d

9.1%

take 2/3 cut at branch

18.1%

Page 116: Compiler Construction of Idempotent Regions and Applications in Architecture Design

116

Hardware Fault Tolerancefault semantics

hardware fault model (fault semantics) – side-effects are temporally contained to region execution – side-effects are spatially contained to target resources – control flow is legal (follows static CFG edges)

Page 117: Compiler Construction of Idempotent Regions and Applications in Architecture Design

117

Related Workon idempotence

Very Related Year Domain

Sentinel Scheduling 1992 Speculative memory re-ordering

Reference Idempotency 2006 Reducing speculative storage

Restart Markers 2006 Virtual memory in vector machines

Encore 2011 Hardware fault recovery

Somewhat Related Year Domain

Multi-Instruction Retry 1995 Branch and hardware fault recovery

Atomic Heap Transactions 1999 Atomic memory allocation

Page 118: Compiler Construction of Idempotent Regions and Applications in Architecture Design

118

Related Workon idempotence

what’s new? – idempotence model classification and analysis – first work to decompose entire programs – static analysis in terms of clobber (anti-)dependences – static analysis and code generation algorithms – overhead analysis: detection, pressure, re-execution – comprehensive (and general) compiler implementation – comprehensive compiler evaluation – a spectrum of architecture designs & applications