x =0i Building Custom Arithmetic Operators with the...

162
Building Custom Arithmetic Operators with the FloPoCo Generator Florent de Dinechin e x x 2 +y 2 +z 2 πx sin e x +y n X i =0 x i x log x

Transcript of x =0i Building Custom Arithmetic Operators with the...

Page 1: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Building Custom Arithmetic Operatorswith the FloPoCo Generator

Florent de Dinechine

x

√x2 +y

2 +z2

πx

sine x+

y

n∑i=

0x i

√x logx

Page 2: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Outline

Introduction: custom arithmetic

FloPoCo for application developers

Computing just right

FloPoCo for developers of HLS software

FloPoCo for developers of custom arithmetic

A tutorial: building an integer divider by 3

A tutorial: building a faithful FIR

A tutorial: building a√

x2 + y2 operator

Conclusion

F. de Dinechin A FloPoCo tutorial 2

Page 3: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Introduction :custom arithmetic

Introduction: custom arithmetic

FloPoCo for application developers

Computing just right

FloPoCo for developers of HLS software

FloPoCo for developers of custom arithmetic

A tutorial: building an integer divider by 3

A tutorial: building a faithful FIR

A tutorial: building a√

x2 + y2 operator

Conclusion

F. de Dinechin A FloPoCo tutorial 3

Page 4: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Two different ways of wasting silicon

Here are two universally programmable chips.

Who’s best for (insert your computation here) ?

F. de Dinechin A FloPoCo tutorial 4

Page 5: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Are FPGAs any good at floating-point ?

Long ago (1995), people ported the basic operations : +,−,×Versus the highly optimized FPU in the processor,

each operator 10x slower in an FPGA

This is the inavoidable overhead of programmability.

If you lose according to a metric, change the metric.

Peak figures for double-precision floating-point exponential

Pentium core : 20 cycles / DPExp @ 4GHz : 200 MDPExp/s

FPExp in FPGA : 1 DPExp/cycle @ 400MHz : 400 MDPExp/s

Chip vs chip : 6 Pentium cores vs 150 FPExp/FPGA

Power consumption also better

Single precision data better

(Intel MKL vector libm, vs FPExp in FloPoCo version 2.0.0)

F. de Dinechin A FloPoCo tutorial 5

Page 6: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Are FPGAs any good at floating-point ?

Long ago (1995), people ported the basic operations : +,−,×Versus the highly optimized FPU in the processor,

each operator 10x slower in an FPGA

This is the inavoidable overhead of programmability.

If you lose according to a metric, change the metric.

Peak figures for double-precision floating-point exponential

Pentium core : 20 cycles / DPExp @ 4GHz : 200 MDPExp/s

FPExp in FPGA : 1 DPExp/cycle @ 400MHz : 400 MDPExp/s

Chip vs chip : 6 Pentium cores vs 150 FPExp/FPGA

Power consumption also better

Single precision data better

(Intel MKL vector libm, vs FPExp in FloPoCo version 2.0.0)

F. de Dinechin A FloPoCo tutorial 5

Page 7: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Dura Amdahl lex, sed lex

SPICE Model-Evaluation, cut from Kapre and DeHon (FPL 2009)

F. de Dinechin A FloPoCo tutorial 6

Page 8: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Custom arithmetic (not your Pentium’s)

multiplier

genericpolynomial

truncated

precomputed

ROM

Constantmultipliers

evaluator

Shift to fixed−point

normalize / round

Fixed-point X

SX EX FX

A Z

E

E×1/ log(2)

× log(2)

eA eZ − Z − 1

Y

R

F. de Dinechin A FloPoCo tutorial 7

Page 9: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Custom arithmetic (not your Pentium’s)

Never compute1 bit more accuratelythan needed!

multiplier

genericpolynomial

truncated

precomputed

ROM

Constantmultipliers

evaluator

Shift to fixed−point

normalize / round

Fixed-point X

SX EX FX

A Z

E

E×1/ log(2)

× log(2)

eA eZ − Z − 1

Y

R

1 + wF + g

wF + g − k

wF + g + 2 − kMSB wF + g + 2 − k

wF + g + 1 − k

MSB wF + g + 1 − 2k

1 + wF + g

wE + wF + g + 1

wE + 1

wE + wF + g + 1

wE + wF + g + 1

k

F. de Dinechin A FloPoCo tutorial 7

Page 10: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Custom arithmetic (not your Pentium’s)

Never compute1 bit more accuratelythan needed!

multiplier

genericpolynomial

truncated

precomputed

ROM

Constantmultipliers

evaluator

Shift to fixed−point

normalize / round

generatorNeed a

Fixed-point X

SX EX FX

A Z

E

E×1/ log(2)

× log(2)

eA eZ − Z − 1

Y

R

1 + wF + g

wF + g − k

wF + g + 2 − kMSB wF + g + 2 − k

wF + g + 1 − k

MSB wF + g + 1 − 2k

1 + wF + g

wE + wF + g + 1

wE + 1

wE + wF + g + 1

wE + wF + g + 1

k

F. de Dinechin A FloPoCo tutorial 7

Page 11: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Useful operators that make sense in a processor

Should a processor include elementary functions ?Yes (Paul&Wilson, 1976), No since the transition to RISC

Should a processor include a divider and square root ?Yes (Oberman et al, Arith, 1997), No since the transition to FMA(IBM then HP then Intel)

Should a processor include decimal hardware ?Yes say IBM, No say Intel

Should a processor include a multiplier by log(2) ?No of course.

F. de Dinechin A FloPoCo tutorial 8

Page 12: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Useful operators that make sense in an FPGA or ASIC

Elementary functions ?Yes iff your application needs it

Divider or square root ?Yes iff your application needs it

Decimal hardware ?Yes iff your application needs it

A multiplier by log(2) ?Yes iff your application needs it

In FPGAs, useful means : useful to one application.

F. de Dinechin A FloPoCo tutorial 9

Page 13: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Enough work to keep me busy to retirement

Arithmetic operators useful to at least one application :

Elementary functions (sine, exponential, logarithm...)

Algebraic functions (x√

x2 + y2, polynomials, ...)

Compound functions (log2(1± 2x), e−Kt2 , ...)

Floating-point sums, dot products, sums of squares

Specialized operators : constant multipliers, squarers, ...

Complex arithmetic

LNS arithmetic

Decimal arithmetic

Interval arithmetic

...

Oh yes, basic operations, too.

F. de Dinechin A FloPoCo tutorial 10

Page 14: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Enough work to keep me busy to retirement

Arithmetic operators useful to at least one application :

Elementary functions (sine, exponential, logarithm...)

Algebraic functions (x√

x2 + y2, polynomials, ...)

Compound functions (log2(1± 2x), e−Kt2 , ...)

Floating-point sums, dot products, sums of squares

Specialized operators : constant multipliers, squarers, ...

Complex arithmetic

LNS arithmetic

Decimal arithmetic

Interval arithmetic

...

Oh yes, basic operations, too.

F. de Dinechin A FloPoCo tutorial 10

Page 15: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

What do we call arithmetic operators ?

An arithmetic operation is a function (in the mathematical sense)

few well-typed inputs and outputsno memory or side effect (usually)

An operator is the implementation of such a function

IEEE-754 FP standard : operator(x) = rounding(operation(x))

→ Clean mathematical definition (even for floating-point arithmetic)

The operator as a circuit...

... is a direct acyclic graph (DAG) :

easy to build and pipeline

easy to test against its mathematical specification

F. de Dinechin A FloPoCo tutorial 11

Page 16: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

What do we call arithmetic operators ?

An arithmetic operation is a function (in the mathematical sense)

few well-typed inputs and outputsno memory or side effect (usually)

An operator is the implementation of such a function

IEEE-754 FP standard : operator(x) = rounding(operation(x))

→ Clean mathematical definition (even for floating-point arithmetic)

The operator as a circuit...

... is a direct acyclic graph (DAG) :

easy to build and pipeline

easy to test against its mathematical specification

F. de Dinechin A FloPoCo tutorial 11

Page 17: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

What do we call arithmetic operators ?

An arithmetic operation is a function (in the mathematical sense)

few well-typed inputs and outputsno memory or side effect (usually)

An operator is the implementation of such a function

IEEE-754 FP standard : operator(x) = rounding(operation(x))

→ Clean mathematical definition (even for floating-point arithmetic)

The operator as a circuit...

... is a direct acyclic graph (DAG) :

easy to build and pipeline

easy to test against its mathematical specification

F. de Dinechin A FloPoCo tutorial 11

Page 18: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

The benefits of custom computing

Example : a floating-point sum of squares

x2 + y2 + z2

(not a toy example but a useful building block)

A square is simpler than a multiplicationhalf the hardware required

x2, y2, and z2 are positive :one half of your FP adder is useless

Accuracy can be improved :5 rounding errors in the floating-point version(x2 + y2) + z2 : asymmetrical

The FloPoCo Recipe

Floating-point interface for convenience

Clear accuracy specification for computing just right

Fixed-point internal architecture for efficiency

F. de Dinechin A FloPoCo tutorial 12

Page 19: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

The benefits of custom computing

Example : a floating-point sum of squares

x2 + y2 + z2

(not a toy example but a useful building block)

A square is simpler than a multiplicationhalf the hardware required

x2, y2, and z2 are positive :one half of your FP adder is useless

Accuracy can be improved :5 rounding errors in the floating-point version(x2 + y2) + z2 : asymmetrical

The FloPoCo Recipe

Floating-point interface for convenience

Clear accuracy specification for computing just right

Fixed-point internal architecture for efficiency

F. de Dinechin A FloPoCo tutorial 12

Page 20: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

The benefits of custom computing

Example : a floating-point sum of squares

x2 + y2 + z2

(not a toy example but a useful building block)

A square is simpler than a multiplicationhalf the hardware required

x2, y2, and z2 are positive :one half of your FP adder is useless

Accuracy can be improved :5 rounding errors in the floating-point version(x2 + y2) + z2 : asymmetrical

The FloPoCo Recipe

Floating-point interface for convenience

Clear accuracy specification for computing just right

Fixed-point internal architecture for efficiency

F. de Dinechin A FloPoCo tutorial 12

Page 21: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

The benefits of custom computing

Example : a floating-point sum of squares

x2 + y2 + z2

(not a toy example but a useful building block)

A square is simpler than a multiplicationhalf the hardware required

x2, y2, and z2 are positive :one half of your FP adder is useless

Accuracy can be improved :5 rounding errors in the floating-point version(x2 + y2) + z2 : asymmetrical

The FloPoCo Recipe

Floating-point interface for convenience

Clear accuracy specification for computing just right

Fixed-point internal architecture for efficiency

F. de Dinechin A FloPoCo tutorial 12

Page 22: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

The benefits of custom computing

Example : a floating-point sum of squares

x2 + y2 + z2

(not a toy example but a useful building block)

A square is simpler than a multiplicationhalf the hardware required

x2, y2, and z2 are positive :one half of your FP adder is useless

Accuracy can be improved :5 rounding errors in the floating-point version(x2 + y2) + z2 : asymmetrical

The FloPoCo Recipe

Floating-point interface for convenience

Clear accuracy specification for computing just right

Fixed-point internal architecture for efficiency

F. de Dinechin A FloPoCo tutorial 12

Page 23: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

A floating-point adder

λ

LZC/shift

p + 1

p + 1

p + 1

p + 1

2p + 2

p p

p + 1

p

x y

z

exp. difference / swap

rounding,normalizationand exception handling

mxex +/–c/f ex − ey

close path c/f

ex

ez

my

shift

|mx −my |

my

1-bit shift

ex

ez

mx

far pathmz , r

mz , r

sticky

s

gr

prenorm (2-bit shift)

s

F. de Dinechin A FloPoCo tutorial 13

Page 24: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

A fixed-point architecture

1 + wF 1 + wF 1 + wF

2 + wF + g2 + wF + g

2 + wF + g2 + wF + g

2 + wF + g

wE + wF + g

2 + wF + g

EC

EBMB2 MC 2

X Y Z

MXEZEYEX MY MZ

MA2

R

4 + wF + g

shifter

sort

sortsquarer squarer

shifter

squarer

add

normalize/pack

unpack

F. de Dinechin A FloPoCo tutorial 14

Page 25: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

The benefits of custom computing

A few results for floating-point sum-of-squares on Virtex4 :

Simple Precision area performance

LogiCore classic 1282 slices, 20 DSP 43 cycles @ 353 MHz

FloPoCo classic 1188 slices, 12 DSP 29 cycles @ 289 MHz

FloPoCo custom 453 slices, 9 DSP 11 cycles @ 368 MHz

Double Precision area performance

FloPoCo classic 4480 slices, 27 DSP 46 cycles @ 276 MHz

FloPoCo custom 1845 slices, 18 DSP 16 cycles @ 362 MHz

all performance metrics improved, FLOP/s/area more than doubled

Plus : custom operator more accurate, and symmetrical

F. de Dinechin A FloPoCo tutorial 15

Page 26: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Custom also means : custom pipeline

1 + wF 1 + wF 1 + wF

2 + wF + g2 + wF + g

2 + wF + g2 + wF + g

2 + wF + g

wE + wF + g

2 + wF + g

EC

EBMB2 MC 2

X Y Z

MXEZEYEX MY MZ

MA2

R

4 + wF + g

shifter

sort

sortsquarer squarer

shifter

squarer

add

normalize/pack

unpack

One operator does not fit all

Low frequency, low resource consumption

Faster but larger (more registers)

Combinatorial

F. de Dinechin A FloPoCo tutorial 16

Page 27: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Custom also means : custom pipeline

1 + wF 1 + wF 1 + wF

2 + wF + g2 + wF + g

2 + wF + g2 + wF + g

2 + wF + g

wE + wF + g

2 + wF + g

EC

EBMB2 MC 2

X Y Z

MXEZEYEX MY MZ

MA2

R

4 + wF + g

shifter

sort

sortsquarer squarer

shifter

squarer

add

normalize/pack

unpack

One operator does not fit all

Low frequency, low resource consumption

Faster but larger (more registers)

Combinatorial

F. de Dinechin A FloPoCo tutorial 16

Page 28: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Custom also means : custom pipeline

1 + wF 1 + wF 1 + wF

2 + wF + g2 + wF + g

2 + wF + g2 + wF + g

2 + wF + g

wE + wF + g

2 + wF + g

EC

EBMB2 MC 2

X Y Z

MXEZEYEX MY MZ

MA2

R

4 + wF + g

shifter

sort

sortsquarer squarer

shifter

squarer

add

normalize/pack

unpack

One operator does not fit all

Low frequency, low resource consumption

Faster but larger (more registers)

Combinatorial

F. de Dinechin A FloPoCo tutorial 16

Page 29: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

History of the FloPoCo project

Initial goal : FPGA arithmetic the way it should be

that is : open-ended, unlike processor arithmetic

an open-ended list of custom operators

open-ended data formats : all operators fully parameterized

open-ended performance trade-off : flexible pipeline

General philosophy : computing just right

A generator framework (written in C++, outputting VHDL)

Objectives :

portable, target-agnostic (more on this later)

VHDL should be simulable and synthesizable using free tool

Design philosophy :

VHDL generation by printing arbitrary VHDL from C++

automate repetitive and error-prone tasks (such as pipelining)

F. de Dinechin A FloPoCo tutorial 17

Page 30: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

History of the FloPoCo project

Initial goal : FPGA arithmetic the way it should be

that is : open-ended, unlike processor arithmetic

an open-ended list of custom operators

open-ended data formats : all operators fully parameterized

open-ended performance trade-off : flexible pipeline

General philosophy : computing just right

A generator framework (written in C++, outputting VHDL)

Objectives :

portable, target-agnostic (more on this later)

VHDL should be simulable and synthesizable using free tool

Design philosophy :

VHDL generation by printing arbitrary VHDL from C++

automate repetitive and error-prone tasks (such as pipelining)

F. de Dinechin A FloPoCo tutorial 17

Page 31: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Beyond the plan

Custom arithmetic for ASIC and HLS

the FloPoCo framework was successfully used to design the FPU ofthe Kalray processor

FloPoCo provides the floating-point back-end to the PandA project(politecnico de Milano)

Not enough “custom” yet to my taste

F. de Dinechin A FloPoCo tutorial 18

Page 32: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Beyond the plan

Custom arithmetic for ASIC and HLS

the FloPoCo framework was successfully used to design the FPU ofthe Kalray processor

FloPoCo provides the floating-point back-end to the PandA project(politecnico de Milano)

Not enough “custom” yet to my taste

F. de Dinechin A FloPoCo tutorial 18

Page 33: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

FloPoCo for application developers

Introduction: custom arithmetic

FloPoCo for application developers

Computing just right

FloPoCo for developers of HLS software

FloPoCo for developers of custom arithmetic

A tutorial: building an integer divider by 3

A tutorial: building a faithful FIR

A tutorial: building a√

x2 + y2 operator

Conclusion

F. de Dinechin A FloPoCo tutorial 19

Page 34: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Here should come a demo

FloPoCo is freely available from

http://flopoco.gforge.inria.fr/

Command line syntax : a sequence of operator specifications

Options : target frequency, target hardware, ...

Output : synthesizable VHDL.

F. de Dinechin A FloPoCo tutorial 20

Page 35: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

First something classical

A single precision floating-point adder(8-bit exponent and 23-bit mantissa)

flopoco -pipeline=no FPAdder 8 23

Final report:

|---Entity FPAdder_8_23_uid2_RightShifter

|---Entity IntAdder_27_f400_uid7

|---Entity LZCShifter_28_to_28_counting_32_uid14

|---Entity IntAdder_34_f400_uid17

Entity FPAdder_8_23_uid2

Output file: flopoco.vhdl

To probe further :

flopoco -pipeline=no FPAdder 11 52 double precision

flopoco -pipeline=no FPAdder 9 34 just right for you

F. de Dinechin A FloPoCo tutorial 21

Page 36: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Classical floating-point, continued

A complete single-precision FPU in a single VHDL file :flopoco -pipeline=no FPAdder 8 23 FPMultiplier 8 23 23

FPDiv 8 23 FPSqrt 8 23

Final report:

|---Entity FPAdder_8_23_uid2_RightShifter

|---Entity IntAdder_27_f400_uid7

|---Entity LZCShifter_28_to_28_counting_32_uid14

|---Entity IntAdder_34_f400_uid17

Entity FPAdder_8_23_uid2

Entity Compressor_2_2

Entity Compressor_3_2

| |---Entity IntAdder_49_f400_uid39

|---Entity IntMultiplier_UsingDSP_24_24_48_unsigned_uid26

|---Entity IntAdder_33_f400_uid47

Entity FPMultiplier_8_23_8_23_8_23_uid24

Entity FPDiv_8_23

Entity FPSqrt_8_23

Output file: flopoco.vhdl

F. de Dinechin A FloPoCo tutorial 22

Page 37: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Frequency-directed pipelining

The same FPAdder, pipelined for 300MHz :flopoco -pipeline=yes -frequency=300 FPAdder 8 23

FloPoCo interface to pipeline construction

“Please pipeline this operator to work at 200MHz”

Not the choice made by other core generators...

... but better because compositional

When you assemble components working at frequency f ,you obtain a component working at frequency f .

F. de Dinechin A FloPoCo tutorial 23

Page 38: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Frequency-directed pipelining

The same FPAdder, pipelined for 300MHz :flopoco -pipeline=yes -frequency=300 FPAdder 8 23

FloPoCo interface to pipeline construction

“Please pipeline this operator to work at 200MHz”

Not the choice made by other core generators...

... but better because compositional

When you assemble components working at frequency f ,you obtain a component working at frequency f .

F. de Dinechin A FloPoCo tutorial 23

Page 39: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Frequency-directed pipelining

The same FPAdder, pipelined for 300MHz :flopoco -pipeline=yes -frequency=300 FPAdder 8 23

FloPoCo interface to pipeline construction

“Please pipeline this operator to work at 200MHz”

Not the choice made by other core generators...

... but better because compositional

When you assemble components working at frequency f ,you obtain a component working at frequency f .

F. de Dinechin A FloPoCo tutorial 23

Page 40: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Frequency-directed pipelining

The same FPAdder, pipelined for 300MHz :flopoco -pipeline=yes -frequency=300 FPAdder 8 23

FloPoCo interface to pipeline construction

“Please pipeline this operator to work at 200MHz”

Not the choice made by other core generators...

... but better because compositional

When you assemble components working at frequency f ,you obtain a component working at frequency f .

F. de Dinechin A FloPoCo tutorial 23

Page 41: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Examples of pipeline

flopoco -pipeline=yes -frequency=400 FPAdder 8 23

Final report:

|---Entity FPAdder_8_23_uid2_RightShifter

| Pipeline depth = 1

|---Entity IntAdder_27_f400_uid7

| Pipeline depth = 1

|---Entity LZCShifter_28_to_28_counting_32_uid14

| Pipeline depth = 4

|---Entity IntAdder_34_f400_uid17

| Pipeline depth = 1

Entity FPAdder_8_23_uid2

Pipeline depth = 10

flopoco -pipeline=yes -frequency=200 FPAdder 8 23

Final report:

(...)

Pipeline depth = 4

F. de Dinechin A FloPoCo tutorial 24

Page 42: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Of course the frequency depends on the target FPGA

flopoco -pipeline=yes -frequency=200 -target=Spartan3

FPAdder 8 23

Final report:

(...)

Pipeline depth = 14

flopoco -pipeline=yes -frequency=200 -target=Virtex6

FPAdder 8 23

Final report:

(...)

Pipeline depth = 3

Altera and Xilinx target currently supported (at various levels ofaccuracy) : Spartan3, Virtex4, Virtex5, Virtex6, StratixII, StratixIII,StratixIV, StratixV, CycloneII, CycloneIII, CycloneIV, CycloneV.

F. de Dinechin A FloPoCo tutorial 25

Page 43: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Also match the architecture to the target FPGA

Compareflopoco -pipeline=no -target=Virtex4 IntConstDiv 16 3 -1

flopoco -pipeline=no -target=Virtex6 IntConstDiv 16 3 -1

LUT LUTLUTLUT

q0q1q2q3

2 2 2 2

4444

4 4 4x3 x2 x1 x0r3 r2 r1 r0 = r

4

r4 = 0

Architecture specificities

LUTs

DSP blocks

memory blocks

And FloPoCo attempts to model the delays of these blocksfor efficient pipelining

F. de Dinechin A FloPoCo tutorial 26

Page 44: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Also match the architecture to the target FPGA

Compareflopoco -pipeline=no -target=Virtex4 IntConstDiv 16 3 -1

flopoco -pipeline=no -target=Virtex6 IntConstDiv 16 3 -1

LUT LUTLUTLUT

q0q1q2q3

2 2 2 2

4444

4 4 4x3 x2 x1 x0r3 r2 r1 r0 = r

4

r4 = 0

Architecture specificities

LUTs

DSP blocks

memory blocks

And FloPoCo attempts to model the delays of these blocksfor efficient pipelining

F. de Dinechin A FloPoCo tutorial 26

Page 45: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Also match the architecture to the target FPGA

Compareflopoco -pipeline=no -target=Virtex4 IntConstDiv 16 3 -1

flopoco -pipeline=no -target=Virtex6 IntConstDiv 16 3 -1

LUT LUTLUTLUT

q0q1q2q3

2 2 2 2

4444

4 4 4x3 x2 x1 x0r3 r2 r1 r0 = r

4

r4 = 0

Architecture specificities

LUTs

DSP blocks

memory blocks

And FloPoCo attempts to model the delays of these blocksfor efficient pipelining

F. de Dinechin A FloPoCo tutorial 26

Page 46: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Also match the architecture to the target FPGA

Compareflopoco -pipeline=no -target=Virtex4 IntConstDiv 16 3 -1

flopoco -pipeline=no -target=Virtex6 IntConstDiv 16 3 -1

LUT LUTLUTLUT

q0q1q2q3

2 2 2 2

4444

4 4 4x3 x2 x1 x0r3 r2 r1 r0 = r

4

r4 = 0

Architecture specificities

LUTs

DSP blocks

memory blocks

And FloPoCo attempts to model the delays of these blocksfor efficient pipelining

F. de Dinechin A FloPoCo tutorial 26

Page 47: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Frequency-directed pipelining in practice

We do our best but we know it’s hopeless

The actual frequency obtained will depend on the whole application(placement, routing pressure etc)...

best-effort philosophy,

aiming to be accurate to 10% for an operator synthesized alone

asking a higher frequency provides a deeper pipeline

And a big TODO : VLSI targets.

F. de Dinechin A FloPoCo tutorial 27

Page 48: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Frequency-directed pipelining in practice

We do our best but we know it’s hopeless

The actual frequency obtained will depend on the whole application(placement, routing pressure etc)...

best-effort philosophy,

aiming to be accurate to 10% for an operator synthesized alone

asking a higher frequency provides a deeper pipeline

And a big TODO : VLSI targets.

F. de Dinechin A FloPoCo tutorial 27

Page 49: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Non-standard operators

Correctly rounded divider by 3 : flopoco FPConstDiv 8 23 3

Floating-point exponential : flopoco FPExp 8 23

Multiplication of a 32-bit signed integer by the constant 1234567(two algorithms, your mileage may vary) :flopoco IntIntKCM 32 1234567 0

flopoco IntConstMult 32 1234567

Full list in the documentation, or by typing just flopoco.Sorry for the sometimes incomplete or inconsistent interface.

F. de Dinechin A FloPoCo tutorial 28

Page 50: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Don’t trust us

Two operators, TestBench and TestBenchFile, generate test benchs forthe operator preceding them on the command line

flopoco FPExp 8 23 TestBenchFile 10000 generates 10000random tests

flopoco IntConstDiv 16 3 -1 TestBenchFile -2 generatesan exhaustive test

Specification-based test bench generation

Not by simulation of the generated architecture !

Helper functions for encoding/decoding FP format, if you want to checkthe testbench...

fp2bin 9 36 3.1415926

bin2fp 9 36

010100000000100100100001111110110100110100010011

F. de Dinechin A FloPoCo tutorial 29

Page 51: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Don’t trust us

Two operators, TestBench and TestBenchFile, generate test benchs forthe operator preceding them on the command line

flopoco FPExp 8 23 TestBenchFile 10000 generates 10000random tests

flopoco IntConstDiv 16 3 -1 TestBenchFile -2 generatesan exhaustive test

Specification-based test bench generation

Not by simulation of the generated architecture !

Helper functions for encoding/decoding FP format, if you want to checkthe testbench...

fp2bin 9 36 3.1415926

bin2fp 9 36

010100000000100100100001111110110100110100010011

F. de Dinechin A FloPoCo tutorial 29

Page 52: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Don’t trust us

Two operators, TestBench and TestBenchFile, generate test benchs forthe operator preceding them on the command line

flopoco FPExp 8 23 TestBenchFile 10000 generates 10000random tests

flopoco IntConstDiv 16 3 -1 TestBenchFile -2 generatesan exhaustive test

Specification-based test bench generation

Not by simulation of the generated architecture !

Helper functions for encoding/decoding FP format, if you want to checkthe testbench...

fp2bin 9 36 3.1415926

bin2fp 9 36

010100000000100100100001111110110100110100010011

F. de Dinechin A FloPoCo tutorial 29

Page 53: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Open-ended operators

A polynomial evaluator for arbitrary functions

Example :flopoco FunctionEvaluator "(sin(x*Pi/2))^ 2" 32 32 4

4 is the degree of the polynomial, allows to express amemory/multiplier trade-off

Works for the set of functions for which it works

Another one is HOTBM, all this is still work in progress

F. de Dinechin A FloPoCo tutorial 30

Page 54: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

FPPipeline

Example :flopoco FPPipeline pipeline.txt 8 23

where pipeline.txt contains something liker = sqrt(sqr(x0-x1)+sqr(y0-y1)); output r;

⊕ Properly synchronizes the various sub components

⊕ Makes it easy to explore various precisions

Using this is against the philosophy of FloPoCo !

which is : let’s build a custom operator instead

Quick and dirty implementation

not all operators supported, constants not always specializedno TestBench support (not a trivial thing to add)

Anybody in the audience doing serious HLS ?

F. de Dinechin A FloPoCo tutorial 31

Page 55: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

FPPipeline

Example :flopoco FPPipeline pipeline.txt 8 23

where pipeline.txt contains something liker = sqrt(sqr(x0-x1)+sqr(y0-y1)); output r;

⊕ Properly synchronizes the various sub components

⊕ Makes it easy to explore various precisions

Using this is against the philosophy of FloPoCo !

which is : let’s build a custom operator instead

Quick and dirty implementation

not all operators supported, constants not always specializedno TestBench support (not a trivial thing to add)

Anybody in the audience doing serious HLS ?

F. de Dinechin A FloPoCo tutorial 31

Page 56: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

FPPipeline

Example :flopoco FPPipeline pipeline.txt 8 23

where pipeline.txt contains something liker = sqrt(sqr(x0-x1)+sqr(y0-y1)); output r;

⊕ Properly synchronizes the various sub components

⊕ Makes it easy to explore various precisions

Using this is against the philosophy of FloPoCo !

which is : let’s build a custom operator instead

Quick and dirty implementation

not all operators supported, constants not always specializedno TestBench support (not a trivial thing to add)

Anybody in the audience doing serious HLS ?

F. de Dinechin A FloPoCo tutorial 31

Page 57: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

FPPipeline

Example :flopoco FPPipeline pipeline.txt 8 23

where pipeline.txt contains something liker = sqrt(sqr(x0-x1)+sqr(y0-y1)); output r;

⊕ Properly synchronizes the various sub components

⊕ Makes it easy to explore various precisions

Using this is against the philosophy of FloPoCo !

which is : let’s build a custom operator instead

Quick and dirty implementation

not all operators supported, constants not always specializedno TestBench support (not a trivial thing to add)

Anybody in the audience doing serious HLS ?

F. de Dinechin A FloPoCo tutorial 31

Page 58: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Computing just right

Introduction: custom arithmetic

FloPoCo for application developers

Computing just right

FloPoCo for developers of HLS software

FloPoCo for developers of custom arithmetic

A tutorial: building an integer divider by 3

A tutorial: building a faithful FIR

A tutorial: building a√

x2 + y2 operator

Conclusion

F. de Dinechin A FloPoCo tutorial 32

Page 59: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

A bit of plain common sense

Common wisdom

The more accurate you compute, the more expensive it gets

In practice

We (hopefully) notice it when our computation isnot accurate enough.

But do we notice it when it is too accurate for our needs ?

Reconciling performance and accuracy ?

Or, regain performance by computing just right ?

F. de Dinechin A FloPoCo tutorial 33

Page 60: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

A bit of plain common sense

Common wisdom

The more accurate you compute, the more expensive it gets

In practice

We (hopefully) notice it when our computation isnot accurate enough.

But do we notice it when it is too accurate for our needs ?

Reconciling performance and accuracy ?

Or, regain performance by computing just right ?

F. de Dinechin A FloPoCo tutorial 33

Page 61: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

A bit of plain common sense

Common wisdom

The more accurate you compute, the more expensive it gets

In practice

We (hopefully) notice it when our computation isnot accurate enough.

But do we notice it when it is too accurate for our needs ?

Reconciling performance and accuracy ?

Or, regain performance by computing just right ?

F. de Dinechin A FloPoCo tutorial 33

Page 62: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Double precision spoils us

The standard binary64 format (formerly known as double-precision)provides roughly 16 decimal digits.

Why should anybody need such accuracy ?

Count the digits in the following

Definition of the second : the duration of 9,192,631,770 periods ofthe radiation corresponding to the transition between the twohyperfine levels of the ground state of the cesium 133 atom.

Definition of the metre : the distance travelled by light in vacuumin 1/299,792,458 of a second.

Most accurate measurement ever (another atomic frequency)to 14 decimal places

Most accurate measurement of the Planck constant to date :to 7 decimal places

The gravitation constant G is known to 3 decimal places only

F. de Dinechin A FloPoCo tutorial 34

Page 63: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Parenthesis : then why binary64 ?

This PC computes 109 operations per second (1 gigaflops)

An allegory due to Kulisch

print the numbers in 100 lines of 5 columns double-sided :1000 numbers/sheet

1000 sheets ≈ a heap of 10 cm

109 flops ≈ heap height speed of 100m/s, or 360km/h

A teraflops (1012 op/s) prints to the moon in one second

Current top 500 computers reach the petaflop (1015 op/s)

each operation may involve a relative error of 10−16,and they accumulate.

Doesn’t this sound wrong ?

We would use these 16 digits just to accumulate garbage in them ?

F. de Dinechin A FloPoCo tutorial 35

Page 64: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Parenthesis : then why binary64 ?

This PC computes 109 operations per second (1 gigaflops)

An allegory due to Kulisch

print the numbers in 100 lines of 5 columns double-sided :1000 numbers/sheet

1000 sheets ≈ a heap of 10 cm

109 flops ≈ heap height speed of 100m/s, or 360km/h

A teraflops (1012 op/s) prints to the moon in one second

Current top 500 computers reach the petaflop (1015 op/s)

each operation may involve a relative error of 10−16,and they accumulate.

Doesn’t this sound wrong ?

We would use these 16 digits just to accumulate garbage in them ?

F. de Dinechin A FloPoCo tutorial 35

Page 65: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Parenthesis : then why binary64 ?

This PC computes 109 operations per second (1 gigaflops)

An allegory due to Kulisch

print the numbers in 100 lines of 5 columns double-sided :1000 numbers/sheet

1000 sheets ≈ a heap of 10 cm

109 flops ≈ heap height speed of 100m/s, or 360km/h

A teraflops (1012 op/s) prints to the moon in one second

Current top 500 computers reach the petaflop (1015 op/s)

each operation may involve a relative error of 10−16,and they accumulate.

Doesn’t this sound wrong ?

We would use these 16 digits just to accumulate garbage in them ?

F. de Dinechin A FloPoCo tutorial 35

Page 66: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Parenthesis : then why binary64 ?

This PC computes 109 operations per second (1 gigaflops)

An allegory due to Kulisch

print the numbers in 100 lines of 5 columns double-sided :1000 numbers/sheet

1000 sheets ≈ a heap of 10 cm

109 flops ≈ heap height speed of 100m/s, or 360km/h

A teraflops (1012 op/s) prints to the moon in one second

Current top 500 computers reach the petaflop (1015 op/s)

each operation may involve a relative error of 10−16,and they accumulate.

Doesn’t this sound wrong ?

We would use these 16 digits just to accumulate garbage in them ?

F. de Dinechin A FloPoCo tutorial 35

Page 67: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Parenthesis : then why binary64 ?

This PC computes 109 operations per second (1 gigaflops)

An allegory due to Kulisch

print the numbers in 100 lines of 5 columns double-sided :1000 numbers/sheet

1000 sheets ≈ a heap of 10 cm

109 flops ≈ heap height speed of 100m/s, or 360km/h

A teraflops (1012 op/s) prints to the moon in one second

Current top 500 computers reach the petaflop (1015 op/s)

each operation may involve a relative error of 10−16,and they accumulate.

Doesn’t this sound wrong ?

We would use these 16 digits just to accumulate garbage in them ?

F. de Dinechin A FloPoCo tutorial 35

Page 68: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Back to the point

Mastering accuracy for performance

When implementing a “computing core”

A goal : never compute more accurately than needed

A tool : flexible arithmetic

Qualitatively : use FPConstDiv instead of a dividerQuantitatively : tailor your data formats to your problem

Sub-goals

Know what accuracy you needKnow how accurate you compute

“Computing cores” considered so far : elementary functions, sums ofproducts, linear algebra, Euclidean lattices algorithms, ...

Hopelessly difficult in the general case ...but at least ask this question systematically

F. de Dinechin A FloPoCo tutorial 36

Page 69: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Back to the point

Mastering accuracy for performance

When implementing a “computing core”

A goal : never compute more accurately than needed

A tool : flexible arithmetic

Qualitatively : use FPConstDiv instead of a dividerQuantitatively : tailor your data formats to your problem

Sub-goals

Know what accuracy you needKnow how accurate you compute

“Computing cores” considered so far : elementary functions, sums ofproducts, linear algebra, Euclidean lattices algorithms, ...

Hopelessly difficult in the general case ...but at least ask this question systematically

F. de Dinechin A FloPoCo tutorial 36

Page 70: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Back to the point

Mastering accuracy for performance

When implementing a “computing core”

A goal : never compute more accurately than needed

A tool : flexible arithmetic

Qualitatively : use FPConstDiv instead of a dividerQuantitatively : tailor your data formats to your problem

Sub-goals

Know what accuracy you need

Know how accurate you compute

“Computing cores” considered so far : elementary functions, sums ofproducts, linear algebra, Euclidean lattices algorithms, ...

Hopelessly difficult in the general case ...but at least ask this question systematically

F. de Dinechin A FloPoCo tutorial 36

Page 71: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Back to the point

Mastering accuracy for performance

When implementing a “computing core”

A goal : never compute more accurately than needed

A tool : flexible arithmetic

Qualitatively : use FPConstDiv instead of a dividerQuantitatively : tailor your data formats to your problem

Sub-goals

Know what accuracy you needKnow how accurate you compute

“Computing cores” considered so far : elementary functions, sums ofproducts, linear algebra, Euclidean lattices algorithms, ...

Hopelessly difficult in the general case ...but at least ask this question systematically

F. de Dinechin A FloPoCo tutorial 36

Page 72: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Back to the point

Mastering accuracy for performance

When implementing a “computing core”

A goal : never compute more accurately than needed

A tool : flexible arithmetic

Qualitatively : use FPConstDiv instead of a dividerQuantitatively : tailor your data formats to your problem

Sub-goals

Know what accuracy you needKnow how accurate you compute

“Computing cores” considered so far : elementary functions, sums ofproducts, linear algebra, Euclidean lattices algorithms, ...

Hopelessly difficult in the general case ...but at least ask this question systematically

F. de Dinechin A FloPoCo tutorial 36

Page 73: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Back to the point

Mastering accuracy for performance

When implementing a “computing core”

A goal : never compute more accurately than needed

A tool : flexible arithmetic

Qualitatively : use FPConstDiv instead of a dividerQuantitatively : tailor your data formats to your problem

Sub-goals

Know what accuracy you needKnow how accurate you compute

“Computing cores” considered so far : elementary functions, sums ofproducts, linear algebra, Euclidean lattices algorithms, ...

Hopelessly difficult in the general case ...but at least ask this question systematically

F. de Dinechin A FloPoCo tutorial 36

Page 74: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

My current crusade

The evil (at least on FPGAs)

A fast implementation of single-precision floating-point exponential(but accurate to 2−8 only)

Do you see why it is wrong ?

A line I shall have in each of my talks until the world is saved

Save routing ! Save power ! Don’t move useless bits around !

Or maybe this one

Do you really need to compute this bit ?

F. de Dinechin A FloPoCo tutorial 37

Page 75: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

My current crusade

The evil (at least on FPGAs)

A fast implementation of single-precision floating-point exponential(but accurate to 2−8 only)

Do you see why it is wrong ?

A line I shall have in each of my talks until the world is saved

Save routing ! Save power ! Don’t move useless bits around !

Or maybe this one

Do you really need to compute this bit ?

F. de Dinechin A FloPoCo tutorial 37

Page 76: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

My current crusade

The evil (at least on FPGAs)

A fast implementation of single-precision floating-point exponential(but accurate to 2−8 only)

Do you see why it is wrong ?

A line I shall have in each of my talks until the world is saved

Save routing ! Save power ! Don’t move useless bits around !

Or maybe this one

Do you really need to compute this bit ?

F. de Dinechin A FloPoCo tutorial 37

Page 77: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

FloPoCo for developers of HLSsoftware

Introduction: custom arithmetic

FloPoCo for application developers

Computing just right

FloPoCo for developers of HLS software

FloPoCo for developers of custom arithmetic

A tutorial: building an integer divider by 3

A tutorial: building a faithful FIR

A tutorial: building a√

x2 + y2 operator

Conclusion

F. de Dinechin A FloPoCo tutorial 38

Page 78: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Using FloPoCoLib

From file src/main minimal.cpp :

1 #include "FloPoCo.hpp"2 using namespace std;3 using namespace flopoco;4

5 int main(int argc , char* argv[] ) {6 Target* target = new Virtex4 ();7

8 int wE = 9;9 int wF = 33;

10 Operator *op = new FPAdderSinglePath(target , wE,wF, wE,wF, wE,wF);

11

12 ofstream file;13 file.open("FPAdder.vhdl", ios::out);14 op->outputVHDLToFile(file);15 file.close ();16

17 cerr << endl <<"Final report:"<<endl;18 op->outputFinalReport (0);19 }

F. de Dinechin A FloPoCo tutorial 39

Page 79: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

FloPoCo class hierarchy

Signal

+width

+cycle

+lifeSpan

Opera tor

+signalList

+vhdl

+outputVHDL()

+emulate()

+buildStandardTestCases()

FPAdder

+wE

+wF

In tAddder

+size

Shif ters Collision

+wE

+wF

Targets

V i r tex4Strat ix I I

TestBench

F. de Dinechin A FloPoCo tutorial 40

Page 80: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Custom arithmetic for HLS (an outsider point of view)

All the optimizations you do when compiling for a processor, plus

a+a should be replaced with 2*a, not the opposite !

fewer optimizations prevented by bit-for-bit or cornercase issues

a C compiler is not allowed to replace x-0.0 with x

or 2*x/2 with x

both issues can be solved by (very small) cornercase-fix logic

more opportunities of operator specialization :

multiplication and division by a constanta squarer for a*a,a sine on one period only...

endless opportunities of operator fusion

did I mention√

x2 + y2 ?bit-for-bit compatibility lost, replaced with guarantee accuracy

Eventually we will need new languages.

F. de Dinechin A FloPoCo tutorial 41

Page 81: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Custom arithmetic for HLS (an outsider point of view)

All the optimizations you do when compiling for a processor, plus

a+a should be replaced with 2*a, not the opposite !

fewer optimizations prevented by bit-for-bit or cornercase issues

a C compiler is not allowed to replace x-0.0 with x

or 2*x/2 with x

both issues can be solved by (very small) cornercase-fix logic

more opportunities of operator specialization :

multiplication and division by a constanta squarer for a*a,a sine on one period only...

endless opportunities of operator fusion

did I mention√

x2 + y2 ?bit-for-bit compatibility lost, replaced with guarantee accuracy

Eventually we will need new languages.

F. de Dinechin A FloPoCo tutorial 41

Page 82: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Custom arithmetic for HLS (an outsider point of view)

All the optimizations you do when compiling for a processor, plus

a+a should be replaced with 2*a, not the opposite !

fewer optimizations prevented by bit-for-bit or cornercase issues

a C compiler is not allowed to replace x-0.0 with x

or 2*x/2 with x

both issues can be solved by (very small) cornercase-fix logic

more opportunities of operator specialization :

multiplication and division by a constanta squarer for a*a,a sine on one period only...

endless opportunities of operator fusion

did I mention√

x2 + y2 ?bit-for-bit compatibility lost, replaced with guarantee accuracy

Eventually we will need new languages.

F. de Dinechin A FloPoCo tutorial 41

Page 83: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Custom arithmetic for HLS (an outsider point of view)

All the optimizations you do when compiling for a processor, plus

a+a should be replaced with 2*a, not the opposite !

fewer optimizations prevented by bit-for-bit or cornercase issues

a C compiler is not allowed to replace x-0.0 with x

or 2*x/2 with x

both issues can be solved by (very small) cornercase-fix logic

more opportunities of operator specialization :

multiplication and division by a constanta squarer for a*a,a sine on one period only...

endless opportunities of operator fusion

did I mention√

x2 + y2 ?bit-for-bit compatibility lost, replaced with guarantee accuracy

Eventually we will need new languages.

F. de Dinechin A FloPoCo tutorial 41

Page 84: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Custom arithmetic for HLS (an outsider point of view)

All the optimizations you do when compiling for a processor, plus

a+a should be replaced with 2*a, not the opposite !

fewer optimizations prevented by bit-for-bit or cornercase issues

a C compiler is not allowed to replace x-0.0 with x

or 2*x/2 with x

both issues can be solved by (very small) cornercase-fix logic

more opportunities of operator specialization :

multiplication and division by a constanta squarer for a*a,a sine on one period only...

endless opportunities of operator fusion

did I mention√

x2 + y2 ?bit-for-bit compatibility lost, replaced with guarantee accuracy

Eventually we will need new languages.

F. de Dinechin A FloPoCo tutorial 41

Page 85: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

FloPoCo for developers of customarithmetic

Introduction: custom arithmetic

FloPoCo for application developers

Computing just right

FloPoCo for developers of HLS software

FloPoCo for developers of custom arithmetic

A tutorial: building an integer divider by 3

A tutorial: building a faithful FIR

A tutorial: building a√

x2 + y2 operator

Conclusion

F. de Dinechin A FloPoCo tutorial 42

Page 86: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

A modestly object-oriented approach

FloPoCo is not a C++-based HDL

VHDL generation is “print-based”

1 vhdl << "SoS <= EA(wE -1 downto 0) & Fraction;" ;

easy to port existing work (FPLibrary)easy learning curve for the VHDL-litterateat least the expressive power of VHDL !

Many helper functions help doing the printsExample : VHDL signal declaration

1 vhdl << declare("SoS", wE+wF+g)2 << " <= EA(wE -1 downto 0) & Fraction;" ;

F. de Dinechin A FloPoCo tutorial 43

Page 87: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

A modestly object-oriented approach

FloPoCo is not a C++-based HDL

VHDL generation is “print-based”

1 vhdl << "SoS <= EA(wE -1 downto 0) & Fraction;" ;

easy to port existing work (FPLibrary)easy learning curve for the VHDL-litterateat least the expressive power of VHDL !

Many helper functions help doing the printsExample : VHDL signal declaration

1 vhdl << declare("SoS", wE+wF+g)2 << " <= EA(wE -1 downto 0) & Fraction;" ;

F. de Dinechin A FloPoCo tutorial 43

Page 88: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

A modestly object-oriented approach

FloPoCo is not a C++-based HDL

VHDL generation is “print-based”

1 vhdl << "SoS <= EA(wE -1 downto 0) & Fraction;" ;

easy to port existing work (FPLibrary)easy learning curve for the VHDL-litterateat least the expressive power of VHDL !

Many helper functions help doing the printsExample : VHDL signal declaration

1 vhdl << declare("SoS", wE+wF+g)2 << " <= EA(wE -1 downto 0) & Fraction;" ;

F. de Dinechin A FloPoCo tutorial 43

Page 89: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Workflow of designing a new operator in FloPoCo

1. Create an operator skeleton

copy src/UserDefinedOperator.cpp (heavily documented)add your file in CMakeList.txt and src/FloPoCo.hpp

add command-line interface + doc in src/main.cpp

2. Define the specification : write the emulate() method

syntax is quite OOgly but imitation is your frienda few lines to change from a type-matching operator

... then TestBench and TestBenchFile will work

3. Write the code generating combinatorial VHDL

Basically embedd your VHDL in C++ prints

4. Add code for self-pipelining

Roughly the same effort as would take to get one pipeline instance... but you get frequency-directed pipeline for all the FloPoCo targets

F. de Dinechin A FloPoCo tutorial 44

Page 90: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Workflow of designing a new operator in FloPoCo

1. Create an operator skeleton

copy src/UserDefinedOperator.cpp (heavily documented)add your file in CMakeList.txt and src/FloPoCo.hpp

add command-line interface + doc in src/main.cpp

2. Define the specification : write the emulate() method

syntax is quite OOgly but imitation is your frienda few lines to change from a type-matching operator

... then TestBench and TestBenchFile will work

3. Write the code generating combinatorial VHDL

Basically embedd your VHDL in C++ prints

4. Add code for self-pipelining

Roughly the same effort as would take to get one pipeline instance... but you get frequency-directed pipeline for all the FloPoCo targets

F. de Dinechin A FloPoCo tutorial 44

Page 91: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Workflow of designing a new operator in FloPoCo

1. Create an operator skeleton

copy src/UserDefinedOperator.cpp (heavily documented)add your file in CMakeList.txt and src/FloPoCo.hpp

add command-line interface + doc in src/main.cpp

2. Define the specification : write the emulate() method

syntax is quite OOgly but imitation is your frienda few lines to change from a type-matching operator

... then TestBench and TestBenchFile will work

3. Write the code generating combinatorial VHDL

Basically embedd your VHDL in C++ prints

4. Add code for self-pipelining

Roughly the same effort as would take to get one pipeline instance... but you get frequency-directed pipeline for all the FloPoCo targets

F. de Dinechin A FloPoCo tutorial 44

Page 92: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Workflow of designing a new operator in FloPoCo

1. Create an operator skeleton

copy src/UserDefinedOperator.cpp (heavily documented)add your file in CMakeList.txt and src/FloPoCo.hpp

add command-line interface + doc in src/main.cpp

2. Define the specification : write the emulate() method

syntax is quite OOgly but imitation is your frienda few lines to change from a type-matching operator

... then TestBench and TestBenchFile will work

3. Write the code generating combinatorial VHDL

Basically embedd your VHDL in C++ prints

4. Add code for self-pipelining

Roughly the same effort as would take to get one pipeline instance... but you get frequency-directed pipeline for all the FloPoCo targets

F. de Dinechin A FloPoCo tutorial 44

Page 93: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

My personal record

Two weeks from the first intuition of the algorithmto complete pipelined FloPoCo implementation + paper submission(Division by small integer constants, ARC 2012)

Implementation time

10 minutes to obtain a testbench generator

1/2 day for the integer Euclidean division

20 mn for its flexible pipeline

1/2 day for the FP divider by 3

and again 20 mn

Time savers :

Test bench generator for correct combinatorial operation

Pipeline framework (for filling result tables in the paper)

(Time wasters : maintaining all this instead of writing new operators)

F. de Dinechin A FloPoCo tutorial 45

Page 94: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

My personal record

Two weeks from the first intuition of the algorithmto complete pipelined FloPoCo implementation + paper submission(Division by small integer constants, ARC 2012)

Implementation time

10 minutes to obtain a testbench generator

1/2 day for the integer Euclidean division

20 mn for its flexible pipeline

1/2 day for the FP divider by 3

and again 20 mn

Time savers :

Test bench generator for correct combinatorial operation

Pipeline framework (for filling result tables in the paper)

(Time wasters : maintaining all this instead of writing new operators)

F. de Dinechin A FloPoCo tutorial 45

Page 95: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

My personal record

Two weeks from the first intuition of the algorithmto complete pipelined FloPoCo implementation + paper submission(Division by small integer constants, ARC 2012)

Implementation time

10 minutes to obtain a testbench generator

1/2 day for the integer Euclidean division

20 mn for its flexible pipeline

1/2 day for the FP divider by 3

and again 20 mn

Time savers :

Test bench generator for correct combinatorial operation

Pipeline framework (for filling result tables in the paper)

(Time wasters : maintaining all this instead of writing new operators)

F. de Dinechin A FloPoCo tutorial 45

Page 96: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Pipeline made easy

1 + wF 1 + wF 1 + wF

2 + wF + g2 + wF + g

2 + wF + g2 + wF + g

2 + wF + g

wE + wF + g

2 + wF + g

EC

EBMB2 MC 2

X Y Z

MXEZEYEX MY MZ

MA2

R

4 + wF + g

shifter

sort

sortsquarer squarer

shifter

squarer

add

normalize/pack

unpack

F. de Dinechin A FloPoCo tutorial 46

Page 97: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Pipeline made easy

5

4

3

2

1

0

0

1

6

1 + wF 1 + wF 1 + wF

2 + wF + g2 + wF + g

2 + wF + g2 + wF + g

2 + wF + g

wE + wF + g

2 + wF + g

EC

EBMB2 MC 2

X Y Z

MXEZEYEX MY MZ

MA2

R

4 + wF + g

shifter

sort

sortsquarer squarer

shifter

squarer

add

normalize/pack

unpack

Notion of current cycle during VHDL output

Each signal has a cycle attribute (at which cycle it is defined)

F. de Dinechin A FloPoCo tutorial 46

Page 98: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Pipeline made easy

5

4

3

2

1

0

0

1

6

EA

EB

1 + wF 1 + wF 1 + wF

2 + wF + g2 + wF + g

2 + wF + g2 + wF + g

2 + wF + g

wE + wF + g

2 + wF + g

EC

EBMB2 MC 2

X Y Z

MXEZEYEX MY MZ

MA2

R

4 + wF + g

shifter

sort

sortsquarer squarer

shifter

squarer

add

normalize/pack

unpack

1 vhdl << declare("EA", wE) << " <= ... ;" ; // c y c l e (EA)=02 nextCycle ();3 vhdl << declare("EB", wE) << " <= ... ;" ; // c y c l e (EB)=14 setCycle (0);5 vhdl << ... ;

F. de Dinechin A FloPoCo tutorial 46

Page 99: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Pipeline made easy

5

4

3

2

1

0

0

1

6

EA

EB

(insert 6 registers)

RHS("EA")SoS

1 + wF 1 + wF 1 + wF

2 + wF + g2 + wF + g

2 + wF + g2 + wF + g

2 + wF + g

wE + wF + g

2 + wF + g

EC

EBMB2 MC 2

X Y Z

MXEZEYEX MY MZ

MA2

R

4 + wF + g

shifter

sort

sortsquarer squarer

shifter

squarer

add

normalize/pack

unpack

FloPoCo looks for signal names on the right-hand sideand delays them by (currentCycle - defCycle) cycles :

1 vhdl << declare("SoS", wE+wF+g)2 << " <= EA(wE -1 downto 0) & Fraction;" ;

F. de Dinechin A FloPoCo tutorial 46

Page 100: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Pipeline made easy

5

4

3

2

1

0

0

1

6

EA

EB

(insert 6 registers)

RHS("EA")SoS

1 + wF 1 + wF 1 + wF

2 + wF + g2 + wF + g

2 + wF + g2 + wF + g

2 + wF + g

wE + wF + g

2 + wF + g

EC

EBMB2 MC 2

X Y Z

MXEZEYEX MY MZ

MA2

R

4 + wF + g

shifter

sort

sortsquarer squarer

shifter

squarer

add

normalize/pack

unpack

FloPoCo looks for signal names on the right-hand sideand delays them by (currentCycle - defCycle) cycles :Generated VHDL :

1 SoS <= EA_d6(wE -1 downto 0) & Fraction_d1;

(and transparently declare and build the needed shift registers)

F. de Dinechin A FloPoCo tutorial 46

Page 101: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Pipeline made easy

5

4

3

2

1

0

0

1

6

1 + wF 1 + wF 1 + wF

2 + wF + g2 + wF + g

2 + wF + g2 + wF + g

2 + wF + g

wE + wF + g

2 + wF + g

EC

EBMB2 MC 2

X Y Z

MXEZEYEX MY MZ

MA2

R

4 + wF + g

shifter

sort

sortsquarer squarer

shifter

squarer

add

normalize/pack

unpack

Managing the current cycle :n=getCycle();

setCycle(n);

nextCycle();

syncCycleWithSignal("EA");

Frequency-directed pipelining :

manageCriticalPath(

target->adderDelay(n) );

F. de Dinechin A FloPoCo tutorial 46

Page 102: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

manageCriticalPath(delay)

A currentCriticalPath variable is maintained during VHDL generation

manageCriticalPath(delay)

adds delay to currentCriticalPathif the sum is larger than 1/targetFrequency,

I currentCycle ++I and reset currentCriticalPath to delay

delay is an estimation of the delay of the following block of VHDL

you (the developer) have to perform this estimation

... typically computed using timing methods of the Target class

adderDelay(int n)

localRoutingDelay(int fanout)

This is how the pipeline adapts to the actual target.

F. de Dinechin A FloPoCo tutorial 47

Page 103: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

1 + wF 1 + wF 1 + wF

2 + wF + g2 + wF + g

2 + wF + g2 + wF + g

2 + wF + g

wE + wF + g

2 + wF + g

EC

EBMB2 MC 2

X Y Z

MXEZEYEX MY MZ

MA2

R

4 + wF + g

shifter

sort

sortsquarer squarer

shifter

squarer

add

normalize/pack

unpack

The magic explained

Pipeline adapts to random insertions of pipeline levels anywhere

during development, tinkering, fine-tuning

or during frequency-directed, target-optimized pipelining

Graceful degradation to a combinatorial operator

in which case the vhdl stream is just output untouched

Suits the “print VHDL” philosophy

FloPoCo code is much shorter than the VHDL it generates.

F. de Dinechin A FloPoCo tutorial 48

Page 104: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

1 + wF 1 + wF 1 + wF

2 + wF + g2 + wF + g

2 + wF + g2 + wF + g

2 + wF + g

wE + wF + g

2 + wF + g

EC

EBMB2 MC 2

X Y Z

MXEZEYEX MY MZ

MA2

R

4 + wF + g

shifter

sort

sortsquarer squarer

shifter

squarer

add

normalize/pack

unpack

The magic explained

Pipeline adapts to random insertions of pipeline levels anywhere

during development, tinkering, fine-tuningor during frequency-directed, target-optimized pipelining

Graceful degradation to a combinatorial operator

in which case the vhdl stream is just output untouched

Suits the “print VHDL” philosophy

FloPoCo code is much shorter than the VHDL it generates.

F. de Dinechin A FloPoCo tutorial 48

Page 105: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

1 + wF 1 + wF 1 + wF

2 + wF + g2 + wF + g

2 + wF + g2 + wF + g

2 + wF + g

wE + wF + g

2 + wF + g

EC

EBMB2 MC 2

X Y Z

MXEZEYEX MY MZ

MA2

R

4 + wF + g

shifter

sort

sortsquarer squarer

shifter

squarer

add

normalize/pack

unpack

The magic explained

Pipeline adapts to random insertions of pipeline levels anywhere

during development, tinkering, fine-tuningor during frequency-directed, target-optimized pipelining

Graceful degradation to a combinatorial operator

in which case the vhdl stream is just output untouched

Suits the “print VHDL” philosophy

FloPoCo code is much shorter than the VHDL it generates.

F. de Dinechin A FloPoCo tutorial 48

Page 106: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

1 + wF 1 + wF 1 + wF

2 + wF + g2 + wF + g

2 + wF + g2 + wF + g

2 + wF + g

wE + wF + g

2 + wF + g

EC

EBMB2 MC 2

X Y Z

MXEZEYEX MY MZ

MA2

R

4 + wF + g

shifter

sort

sortsquarer squarer

shifter

squarer

add

normalize/pack

unpack

The magic explained

Pipeline adapts to random insertions of pipeline levels anywhere

during development, tinkering, fine-tuningor during frequency-directed, target-optimized pipelining

Graceful degradation to a combinatorial operator

in which case the vhdl stream is just output untouched

Suits the “print VHDL” philosophy

FloPoCo code is much shorter than the VHDL it generates.

F. de Dinechin A FloPoCo tutorial 48

Page 107: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

From the developer point of view

Clean separation of functionality and performance issues

You write to the vhdl stream only combinatorial VHDL

FloPoCo parses it to insert registers (in two passes)

Timing is managed out of the VHDL code

Subcomponents also delay their outputs WRT their inputs, etc

Correct-by-construction pipelining

You cannot obtain a non-functional operator,starting from a working combinatorial one

(at least not without very loud warnings)

This part of the framework can be trusted

nothing more complicated than integer subtractions and comparisonsand of course, test bench generators adapt to the pipeline

You can obtain a bad pipeline, though

unbalanced, over-pipelined, missing the target frequency, ...

F. de Dinechin A FloPoCo tutorial 49

Page 108: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

From the developer point of view

Clean separation of functionality and performance issues

You write to the vhdl stream only combinatorial VHDL

FloPoCo parses it to insert registers (in two passes)

Timing is managed out of the VHDL code

Subcomponents also delay their outputs WRT their inputs, etc

Correct-by-construction pipelining

You cannot obtain a non-functional operator,starting from a working combinatorial one

(at least not without very loud warnings)

This part of the framework can be trusted

nothing more complicated than integer subtractions and comparisonsand of course, test bench generators adapt to the pipeline

You can obtain a bad pipeline, though

unbalanced, over-pipelined, missing the target frequency, ...

F. de Dinechin A FloPoCo tutorial 49

Page 109: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

subcomponent list

signal list

(bitwidth, etc)

in structural VHDL syntax)

(combinatorial description

streamvhdl

generation of VHDL declarations

generation of VHDL register code

for each signalvaluelifeSpan

for each signalcycle value

generation of VHDL architecture code

(second pass on vhdl stream,

delaying right−hand side signals)

C++to know their pipeline depth)

(recursively calling constructors of all sub−components

Constructor

Operator.outputVHDL()

pipeline information

architecture ....

entity ...

port (...)

component ....

signal ....

end architecture

output file

DEXY <= ...

DEYZ <= ...

...

end process

begin

process(clk)

....

VHDLC++

functional informationproduces

is used

F. de Dinechin A FloPoCo tutorial 50

Page 110: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

We could have done it better

History of development shows a bit

Do we really need an interface with explicit cycle management ?

In FloPoCo, time=(cycle, critical path within the cycle)

but the user interface should consist only of delay information

A delay should be associated systematically to signal declaration

Too much code to change it now...

F. de Dinechin A FloPoCo tutorial 51

Page 111: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Matching the architecture to the target

Perfectly valid code :

1 if(target =="StratixII"){2 // outpu t some VHDL3 }4 else if(target =="Virtex4")

{5 // outpu t o t h e r VHDL6 }7 else ...

Try to avoid it :

Model architecture details

LUTsDSP blocksmemory blocks

Model delays

Definitely an endless effort.

F. de Dinechin A FloPoCo tutorial 52

Page 112: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

One slide on testing

emulate() performs bit-accurate emulation

Not by architecture simulation !By expressing the mathematical specification..in typically a few lines of MPFR (www.mpfr.org)

buildStandardTestCases() for corner cases and regression tests

buildRandomTestCases() to bias the random generator in anoperation-specific way

default is uniform on the bits.Example 1 : exponential useful range is very small,bias the random generator towards thatExample2 : cancellation cases in floating-point additions

may (should) be overloaded by each Operator

The special TestBench and TestBenchFile operators invokethese methods

F. de Dinechin A FloPoCo tutorial 53

Page 113: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

One slide on testing

emulate() performs bit-accurate emulation

Not by architecture simulation !By expressing the mathematical specification..in typically a few lines of MPFR (www.mpfr.org)

buildStandardTestCases() for corner cases and regression tests

buildRandomTestCases() to bias the random generator in anoperation-specific way

default is uniform on the bits.Example 1 : exponential useful range is very small,bias the random generator towards thatExample2 : cancellation cases in floating-point additions

may (should) be overloaded by each Operator

The special TestBench and TestBenchFile operators invokethese methods

F. de Dinechin A FloPoCo tutorial 53

Page 114: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

One slide on testing

emulate() performs bit-accurate emulation

Not by architecture simulation !By expressing the mathematical specification..in typically a few lines of MPFR (www.mpfr.org)

buildStandardTestCases() for corner cases and regression tests

buildRandomTestCases() to bias the random generator in anoperation-specific way

default is uniform on the bits.Example 1 : exponential useful range is very small,bias the random generator towards thatExample2 : cancellation cases in floating-point additions

may (should) be overloaded by each Operator

The special TestBench and TestBenchFile operators invokethese methods

F. de Dinechin A FloPoCo tutorial 53

Page 115: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

One slide on testing

emulate() performs bit-accurate emulation

Not by architecture simulation !By expressing the mathematical specification..in typically a few lines of MPFR (www.mpfr.org)

buildStandardTestCases() for corner cases and regression tests

buildRandomTestCases() to bias the random generator in anoperation-specific way

default is uniform on the bits.Example 1 : exponential useful range is very small,bias the random generator towards thatExample2 : cancellation cases in floating-point additions

may (should) be overloaded by each Operator

The special TestBench and TestBenchFile operators invokethese methods

F. de Dinechin A FloPoCo tutorial 53

Page 116: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

A tutorial : building an integerdivider by 3

Introduction: custom arithmetic

FloPoCo for application developers

Computing just right

FloPoCo for developers of HLS software

FloPoCo for developers of custom arithmetic

A tutorial: building an integer divider by 3

A tutorial: building a faithful FIR

A tutorial: building a√

x2 + y2 operator

Conclusion

F. de Dinechin A FloPoCo tutorial 54

Page 117: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Anybody here remembers how we compute divisions ?

7 7 6

1 7

2 6

2

2 5 8

3

iteration body : Euclidean division of a 2-digit decimal number by 3

The first digit is a remainder from previous iteration :its value is 0, 1 or 2

Possible implementation as a look-up table that, for each valuefrom 00 to 29, gives the quotient and the remainder of its divisionby 3.

F. de Dinechin A FloPoCo tutorial 55

Page 118: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Anybody here remembers how we compute divisions ?

7 7 6

1 7

2 6

2

2 5 8

3

iteration body : Euclidean division of a 2-digit decimal number by 3

The first digit is a remainder from previous iteration :its value is 0, 1 or 2

Possible implementation as a look-up table that, for each valuefrom 00 to 29, gives the quotient and the remainder of its divisionby 3.

F. de Dinechin A FloPoCo tutorial 55

Page 119: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Mathematical obfuscation, in binary-friendly radix

Representation of x in radix β = 2α

(splitting the binary decomposition of x into k chunks of α bits)

procedure ConstantDiv(x , d)rk ← 0for i = k − 1 down to 0 do

yi ← xi + 2αri+1 (this + is a concatenation)(qi , ri )← (byi/dc, yi mod d) (read from a table)

end forreturn q =

∑ki=0 qi .2

−αi , r0end procedure

Each iteration

consumes α bits of x , and a remainder of size γ = dlog2 deproduces α bits of q, and a remainder of size γ

Implemented as a table with α + γ bits in, α + γ bits out

F. de Dinechin A FloPoCo tutorial 56

Page 120: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Mathematical obfuscation, in binary-friendly radix

Representation of x in radix β = 2α

(splitting the binary decomposition of x into k chunks of α bits)

procedure ConstantDiv(x , d)rk ← 0for i = k − 1 down to 0 do

yi ← xi + 2αri+1 (this + is a concatenation)(qi , ri )← (byi/dc, yi mod d) (read from a table)

end forreturn q =

∑ki=0 qi .2

−αi , r0end procedure

Each iteration

consumes α bits of x , and a remainder of size γ = dlog2 deproduces α bits of q, and a remainder of size γ

Implemented as a table with α + γ bits in, α + γ bits out

F. de Dinechin A FloPoCo tutorial 56

Page 121: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Sequential implementation

LUT

clk

reset

α

α

xi

γγ

qi

ri+1 ri

F. de Dinechin A FloPoCo tutorial 57

Page 122: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Unrolled implementation

LUT LUTLUTLUT

q0q1q2q3

2 2 2 2

4444

4 4 4x3 x2 x1 x0r3 r2 r1 r0 = r

4

r4 = 0

F. de Dinechin A FloPoCo tutorial 58

Page 123: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Setting the parameters for LUTs

LUT LUTLUTLUT

q0q1q2q3

2 2 2 2

4444

4 4 4x3 x2 x1 x0r3 r2 r1 r0 = r

4

r4 = 0

For instance, assuming a 6-input LUTs (e.g. LUT6)

A 6-bit in, 6-bit out consumes 6 LUT6

Size of remainder is γ = log2 d

If d < 25, very efficient architecture : α = 6− γThe smaller d , the better

Easy to pipeline (one register behind each LUT for free)

F. de Dinechin A FloPoCo tutorial 59

Page 124: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Step 0 : get a working FloPoCo distribution

Google FloPoCo, then

first use the 1-line install to get all dependencies

optionally do an svn checkout to live on the edge

Tutorial files in a tgz on the tutorial webpage.

F. de Dinechin A FloPoCo tutorial 60

Page 125: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Step 1 : build a skeleton

We want to implement flopoco IntConstDiv w dcd src

cp UserDefinedOperator.cpp IntConstDiv.cpp

cp UserDefinedOperator.hpp IntConstDiv.hpp

Inside the newly created files,

global replace UserDefinedOperator with IntConstDiv

empty the cpp of all flesh except initial declarations

Edit CMakeLists.txt (the file that defines the building process),

add src/IntConstDiv somewhere

Edit src/FloPoCo.hpp and do similar imitation

Edit src/main.cpp and do similar imitation

Compile and fix

Check that ./flopoco IntConstDiv 32 3 does nothing

F. de Dinechin A FloPoCo tutorial 61

Page 126: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Step 2 : overload emulate()

Your d should be stored in the IntConstDiv somewhere

What emulate() does

a TestCase is a pair (input, allowed outputs)

emulate() inputs a TestCase with only the input defined

and defines the allowed outputs – here two

Inputs and outputs are semantically bit vectors, held in a GMPinteger.

OOgly methods to convert floating-point from / to arbitraryprecision floating-point (MPFR)

no need here

You may define several acceptable values for an input

typically for faithful rounding (two values)

After this step, wonder at :./flopoco IntConstDiv 10 3 TestBench 1000

F. de Dinechin A FloPoCo tutorial 62

Page 127: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Step 2 : overload emulate()

Your d should be stored in the IntConstDiv somewhere

What emulate() does

a TestCase is a pair (input, allowed outputs)

emulate() inputs a TestCase with only the input defined

and defines the allowed outputs – here two

Inputs and outputs are semantically bit vectors, held in a GMPinteger.

OOgly methods to convert floating-point from / to arbitraryprecision floating-point (MPFR)

no need here

You may define several acceptable values for an input

typically for faithful rounding (two values)

After this step, wonder at :./flopoco IntConstDiv 10 3 TestBench 1000

F. de Dinechin A FloPoCo tutorial 62

Page 128: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Step 3 : Write the combinatorial VHDL

Use the abstract Table operator

All you need is overload the function method

Other useful generic operators :

Shifters and leading zero counters for floating-pointPipelined adders, multipliers, ...

F. de Dinechin A FloPoCo tutorial 63

Page 129: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Step 3.5 : overload BuildStandardTestCases

First debug until VHDL works for all xi = 0

... then for all xi = 1

F. de Dinechin A FloPoCo tutorial 64

Page 130: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Step 4 : pipeline

using nextCycle() and sync*() only in this case

F. de Dinechin A FloPoCo tutorial 65

Page 131: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

A tutorial : building a faithful FIR

Introduction: custom arithmetic

FloPoCo for application developers

Computing just right

FloPoCo for developers of HLS software

FloPoCo for developers of custom arithmetic

A tutorial: building an integer divider by 3

A tutorial: building a faithful FIR

A tutorial: building a√

x2 + y2 operator

Conclusion

F. de Dinechin A FloPoCo tutorial 66

Page 132: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Specification

Function

r =n∑

i=1

cixi

The ci are constants provided with arbitrary accuracy on thecommand line

the xi are signed fixed-point inputs in (1,p) format

Accuracy : faithful rounding of the exact result

xi and ci considered exact

return one of the two numbers surrounding the exact result :

next-best after correct rounding (which is too expensive)last-bit accurate (all returned bits are useful),more accurate than naive assembly of multipliers and adders

F. de Dinechin A FloPoCo tutorial 67

Page 133: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Specification

Function

r =n∑

i=1

cixi

The ci are constants provided with arbitrary accuracy on thecommand line

the xi are signed fixed-point inputs in (1,p) format

Accuracy : faithful rounding of the exact result

xi and ci considered exact

return one of the two numbers surrounding the exact result :

next-best after correct rounding (which is too expensive)last-bit accurate (all returned bits are useful),more accurate than naive assembly of multipliers and adders

F. de Dinechin A FloPoCo tutorial 67

Page 134: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Algorithm and error budget

sum the products truncated to precision 2−p−g , i.e. g guard bits,

add one bit at weight 2−p−1,

and truncate to precision 2−p, i.e. to the result format

... where g = 1 + log2(n − 1)

Design space exploration

The sum can be implemented as

a rake of adders (simplest, let’s start with that)

a tree of adders (slightly more complex, good we have C++)

a single BitHeap object (on the edge)

F. de Dinechin A FloPoCo tutorial 68

Page 135: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Step 0 : get a working FloPoCo distribution

Google FloPoCo, then

first use the 1-line install to get all dependencies

optionally do an svn checkout to live on the edge

F. de Dinechin A FloPoCo tutorial 69

Page 136: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Step 1 : build a skeleton

We want to implement flopoco FixedPointFIR n w a1 ... ancd src

cp UserDefinedOperator.cpp FixedPointFIR.cpp

cp UserDefinedOperator.hpp FixedPointFIR.hpp

Inside the newly created files,

global replace UserDefinedOperator with FixedPointFIR

empty the cpp of all flesh except initial declarations

Edit CMakeLists.txt (the file that defines the building process),

add src/FixedPointFIR somewhere

Edit src/FloPoCo.hpp and do similar imitation

Edit src/main.cpp and do similar imitation

NewCompressorTree has an arbitrary size parameter list

Compile and fix

Check that ./flopoco FixedPointFIR 4 8 1 1 1 1 doesnothing

F. de Dinechin A FloPoCo tutorial 70

Page 137: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Step 2 : overload emulate()

Your ai should be stored in the FixedPointFIR somewhere

What emulate() does

a TestCase is a pair (input, allowed outputs)

emulate() inputs a TestCase with only the input defined

and defines the allowed outputs – here two

Inputs and outputs are semantically bit vectors, held in a GMPinteger.

OOgly methods to convert floating-point from / to arbitraryprecision floating-point (MPFR)

no need here

After this step, wonder at :./flopoco FP2DNorm 5 10 TestBench 1000

F. de Dinechin A FloPoCo tutorial 71

Page 138: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Step 2 : overload emulate()

Your ai should be stored in the FixedPointFIR somewhere

What emulate() does

a TestCase is a pair (input, allowed outputs)

emulate() inputs a TestCase with only the input defined

and defines the allowed outputs – here two

Inputs and outputs are semantically bit vectors, held in a GMPinteger.

OOgly methods to convert floating-point from / to arbitraryprecision floating-point (MPFR)

no need here

After this step, wonder at :./flopoco FP2DNorm 5 10 TestBench 1000

F. de Dinechin A FloPoCo tutorial 71

Page 139: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Step 3 : Write the combinatorial VHDL

use one of the existing constant multipliers in FloPoCo

Half the room should use FixRealKCM

The other half should use IntConstMult

First write the additions as “+” in standard VHDL

not very efficient, but tutorially good.(several better alternatives to discuss later)

All datapath on 1 + w + g bits

here also, a small optimization is possible

F. de Dinechin A FloPoCo tutorial 72

Page 140: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Step 3.5 : overload BuildStandardTestCases

First debug until VHDL works for all xi = 0

... then for all xi = 1

F. de Dinechin A FloPoCo tutorial 73

Page 141: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Step 4 : pipeline

using nextCycle() and sync*() only in this case

F. de Dinechin A FloPoCo tutorial 74

Page 142: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

A tutorial : building a√x2 + y 2

operatorIntroduction: custom arithmetic

FloPoCo for application developers

Computing just right

FloPoCo for developers of HLS software

FloPoCo for developers of custom arithmetic

A tutorial: building an integer divider by 3

A tutorial: building a faithful FIR

A tutorial: building a√

x2 + y2 operator

Conclusion

F. de Dinechin A FloPoCo tutorial 75

Page 143: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Specification

We are going to build a floating-point operator for√

x2 + y2

Black box : two inputs, one output, just like an adder or amultiplier.

Mathematical specification : faithful rounding of the exact result

x and y considered exactreturn one of the FP numbers surrounding the exact resultless accurate than correct rounding, butlast-bit accurate (all returned bits are useful), andmore accurate than a combination of two multipliers, one adder anda square root (4 rounding errors)

Algorithm : sum of square, followed by a square root of themantissa

with some guard bits for faithful roundingexpectig savings mostly in the sum of squares

F. de Dinechin A FloPoCo tutorial 76

Page 144: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Specification

We are going to build a floating-point operator for√

x2 + y2

Black box : two inputs, one output, just like an adder or amultiplier.

Mathematical specification : faithful rounding of the exact result

x and y considered exactreturn one of the FP numbers surrounding the exact resultless accurate than correct rounding, butlast-bit accurate (all returned bits are useful), andmore accurate than a combination of two multipliers, one adder anda square root (4 rounding errors)

Algorithm : sum of square, followed by a square root of themantissa

with some guard bits for faithful roundingexpectig savings mostly in the sum of squares

F. de Dinechin A FloPoCo tutorial 76

Page 145: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Disclaimer (for hard-core arithmeticians)

It should be possible to fuse the whole computation in a single digitrecurrence

more efficient for ASIC and DSP-less FPGAs

see works by Ercegovac, Muller, Lang, Bruguera, Takagi ...

... but beyond the scope of this work

and less interesting from a tutorial point of view :no component reuse

F. de Dinechin A FloPoCo tutorial 77

Page 146: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Step 0 : get a working FloPoCo distribution

Google FloPoCo, then

first use the 1-line install to get all dependencies

optionally do an svn checkout to live on the edge

F. de Dinechin A FloPoCo tutorial 78

Page 147: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Step 1 : build a skeleton

We want to implement flopoco FP2DNorm wE wFcd src

cp UserDefinedOperator.cpp FP2DNorm.cpp

cp UserDefinedOperator.hpp FP2DNorm.hpp

Inside the newly created files,

global replace UserDefinedOperator with FP2DNorm

empty the cpp of all flesh except initial declarations

Edit CMakeLists.txt (the file that defines the building process),

look for src/FPSqrt (because the interface is similar)and add src/FP2DNorm close to it

Edit src/FloPoCo.hpp and do similar imitation

Edit src/main.cpp and do similar imitation

Compile and fix

Check that ./flopoco FP2DNorm 8 23 does nothing

F. de Dinechin A FloPoCo tutorial 79

Page 148: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Step 2 : overload emulate()

Here, get inspiration from src/FPSumOfSquares orsrc/FPAdderSinglePath.

a TestCase is a pair (input, allowed outputs)

emulate() inputs a TestCase with only the input defined

and defines the allowed outputs

Inputs and outputs are semantically bit vectors, held in a GMPinteger.

OOgly methods to convert from / to arbitrary precisionfloating-point (MPFR)

After this step, test :./flopoco FP2DNorm 5 10 TestBench 1000

F. de Dinechin A FloPoCo tutorial 80

Page 149: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Step 2 : overload emulate()

Here, get inspiration from src/FPSumOfSquares orsrc/FPAdderSinglePath.

a TestCase is a pair (input, allowed outputs)

emulate() inputs a TestCase with only the input defined

and defines the allowed outputs

Inputs and outputs are semantically bit vectors, held in a GMPinteger.

OOgly methods to convert from / to arbitrary precisionfloating-point (MPFR)

After this step, test :./flopoco FP2DNorm 5 10 TestBench 1000

F. de Dinechin A FloPoCo tutorial 80

Page 150: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Step 3 : Write the combinatorial VHDL

use an exponent difference, a shifter, two squarers, and a squareroot

exponent difference is a subtraction, but a small onefollowed by swap

Shift before or after the square ?

before enables single bit-heap implementation... but requires a slightly larger squarerafter requires a slightly larger shifter

Hack the floating-point square root

no fixed-point version available

F. de Dinechin A FloPoCo tutorial 81

Page 151: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Conclusion

Introduction: custom arithmetic

FloPoCo for application developers

Computing just right

FloPoCo for developers of HLS software

FloPoCo for developers of custom arithmetic

A tutorial: building an integer divider by 3

A tutorial: building a faithful FIR

A tutorial: building a√

x2 + y2 operator

Conclusion

F. de Dinechin A FloPoCo tutorial 82

Page 152: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

The “no killer app” theorem

For 20 years, the FPGA community has been waiting for the “killerapplication”.(The widely useful application on which the FPGA is so much better)

Theorem : we’ll wait forever.

Proof : When such an application pops up,

either it is indeed widely useful, and next year’s Pentium will do itin hardware 10x faster than the FPGA, so it won’t be an FPGAkiller app next year,

or the FPGA remains competitive next year, but it means that itwas not a killer app.

The killer feature of FPGAs is flexibility

To exploit it, we do need infinitely many arithmetic operators.

F. de Dinechin A FloPoCo tutorial 83

Page 153: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

The “no killer app” theorem

For 20 years, the FPGA community has been waiting for the “killerapplication”.(The widely useful application on which the FPGA is so much better)

Theorem : we’ll wait forever.

Proof : When such an application pops up,

either it is indeed widely useful, and next year’s Pentium will do itin hardware 10x faster than the FPGA, so it won’t be an FPGAkiller app next year,

or the FPGA remains competitive next year, but it means that itwas not a killer app.

The killer feature of FPGAs is flexibility

To exploit it, we do need infinitely many arithmetic operators.

F. de Dinechin A FloPoCo tutorial 83

Page 154: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

The “no killer app” theorem

For 20 years, the FPGA community has been waiting for the “killerapplication”.(The widely useful application on which the FPGA is so much better)

Theorem : we’ll wait forever.

Proof : When such an application pops up,

either it is indeed widely useful, and next year’s Pentium will do itin hardware 10x faster than the FPGA, so it won’t be an FPGAkiller app next year,

or the FPGA remains competitive next year, but it means that itwas not a killer app.

The killer feature of FPGAs is flexibility

To exploit it, we do need infinitely many arithmetic operators.

F. de Dinechin A FloPoCo tutorial 83

Page 155: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

The “no killer app” theorem

For 20 years, the FPGA community has been waiting for the “killerapplication”.(The widely useful application on which the FPGA is so much better)

Theorem : we’ll wait forever.

Proof : When such an application pops up,

either it is indeed widely useful, and next year’s Pentium will do itin hardware 10x faster than the FPGA, so it won’t be an FPGAkiller app next year,

or the FPGA remains competitive next year, but it means that itwas not a killer app.

The killer feature of FPGAs is flexibility

To exploit it, we do need infinitely many arithmetic operators.

F. de Dinechin A FloPoCo tutorial 83

Page 156: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Computing just right

In a Pentium

the choice is between

an integer SUV, or

a floating-point SUV.

In an FPGA

If all I need is a bicycle, I have the possibility to build a bicycle

(and I’m usually faster to destination)

Save routing ! Save power ! Don’t move useless bits around !

F. de Dinechin A FloPoCo tutorial 84

Page 157: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Computing just right

In a Pentium

the choice is between

an integer SUV, or

a floating-point SUV.

In an FPGA

If all I need is a bicycle, I have the possibility to build a bicycle

(and I’m usually faster to destination)

Save routing ! Save power ! Don’t move useless bits around !

F. de Dinechin A FloPoCo tutorial 84

Page 158: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Computing just right

In a Pentium

the choice is between

an integer SUV, or

a floating-point SUV.

In an FPGA

If all I need is a bicycle, I have the possibility to build a bicycle

(and I’m usually faster to destination)

Save routing ! Save power ! Don’t move useless bits around !

F. de Dinechin A FloPoCo tutorial 84

Page 159: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

An almost virgin land

Most of the arithmetic literature addresses the construction of SUVs.

F. de Dinechin A FloPoCo tutorial 85

Page 160: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

My personal record

Division by 3 (ARC 2012) :Two weeks from the first intuition of the algorithmto complete pipelined FloPoCo implementation + paper submission.

Implementation time

10 minutes to obtain a testbench generator

2 days for the fully parametric combinatorial operator

(less than VHDL)

20 mn for its fully flexible pipeline

F. de Dinechin A FloPoCo tutorial 86

Page 161: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

So when do we have an FPGA in every PC ?

When they become as easy to program as processors ?

(now that’s a challenge)

(or do we quietly wait for processors to become as messy to program asFPGAs ?)

F. de Dinechin A FloPoCo tutorial 87

Page 162: x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit.....

Thanks for your attention

The following people have contributed to FloPoCo :S. Banescu, N. Brunie, S. Collange, J. Detrey,P. Echeverrıa, F. Ferrandi, M. Grad, K. Illyes,M. Istoan, M. Joldes, C. Klein, D. Mastrandrea,B. Pasca, B. Popa, X. Pujol, D. Thomas,R. Tudoran, A. Vasquez.

e

x

√x2 +y

2 +z2

πx

sine x+

y

n∑i=

0x i

√x logx

http://flopoco.gforge.inria.fr/

Introduction: custom arithmetic

FloPoCo for application developers

Computing just right

FloPoCo for developers of HLS software

FloPoCo for developers of custom arithmetic

A tutorial: building an integer divider by 3

A tutorial: building a faithful FIR

A tutorial: building a√

x2 + y2 operator

Conclusion

F. de Dinechin A FloPoCo tutorial 88