x =0i Building Custom Arithmetic Operators with the...

Post on 24-Jul-2020

2 views 0 download

Transcript of x =0i Building Custom Arithmetic Operators with the...

Building Custom Arithmetic Operatorswith the FloPoCo Generator

Florent de Dinechine

x

√x2 +y

2 +z2

πx

sine x+

y

n∑i=

0x i

√x logx

Outline

Introduction: custom arithmetic

FloPoCo for application developers

Computing just right

FloPoCo for developers of HLS software

FloPoCo for developers of custom arithmetic

A tutorial: building an integer divider by 3

A tutorial: building a faithful FIR

A tutorial: building a√

x2 + y2 operator

Conclusion

F. de Dinechin A FloPoCo tutorial 2

Introduction :custom arithmetic

Introduction: custom arithmetic

FloPoCo for application developers

Computing just right

FloPoCo for developers of HLS software

FloPoCo for developers of custom arithmetic

A tutorial: building an integer divider by 3

A tutorial: building a faithful FIR

A tutorial: building a√

x2 + y2 operator

Conclusion

F. de Dinechin A FloPoCo tutorial 3

Two different ways of wasting silicon

Here are two universally programmable chips.

Who’s best for (insert your computation here) ?

F. de Dinechin A FloPoCo tutorial 4

Are FPGAs any good at floating-point ?

Long ago (1995), people ported the basic operations : +,−,×Versus the highly optimized FPU in the processor,

each operator 10x slower in an FPGA

This is the inavoidable overhead of programmability.

If you lose according to a metric, change the metric.

Peak figures for double-precision floating-point exponential

Pentium core : 20 cycles / DPExp @ 4GHz : 200 MDPExp/s

FPExp in FPGA : 1 DPExp/cycle @ 400MHz : 400 MDPExp/s

Chip vs chip : 6 Pentium cores vs 150 FPExp/FPGA

Power consumption also better

Single precision data better

(Intel MKL vector libm, vs FPExp in FloPoCo version 2.0.0)

F. de Dinechin A FloPoCo tutorial 5

Are FPGAs any good at floating-point ?

Long ago (1995), people ported the basic operations : +,−,×Versus the highly optimized FPU in the processor,

each operator 10x slower in an FPGA

This is the inavoidable overhead of programmability.

If you lose according to a metric, change the metric.

Peak figures for double-precision floating-point exponential

Pentium core : 20 cycles / DPExp @ 4GHz : 200 MDPExp/s

FPExp in FPGA : 1 DPExp/cycle @ 400MHz : 400 MDPExp/s

Chip vs chip : 6 Pentium cores vs 150 FPExp/FPGA

Power consumption also better

Single precision data better

(Intel MKL vector libm, vs FPExp in FloPoCo version 2.0.0)

F. de Dinechin A FloPoCo tutorial 5

Dura Amdahl lex, sed lex

SPICE Model-Evaluation, cut from Kapre and DeHon (FPL 2009)

F. de Dinechin A FloPoCo tutorial 6

Custom arithmetic (not your Pentium’s)

multiplier

genericpolynomial

truncated

precomputed

ROM

Constantmultipliers

evaluator

Shift to fixed−point

normalize / round

Fixed-point X

SX EX FX

A Z

E

E×1/ log(2)

× log(2)

eA eZ − Z − 1

Y

R

F. de Dinechin A FloPoCo tutorial 7

Custom arithmetic (not your Pentium’s)

Never compute1 bit more accuratelythan needed!

multiplier

genericpolynomial

truncated

precomputed

ROM

Constantmultipliers

evaluator

Shift to fixed−point

normalize / round

Fixed-point X

SX EX FX

A Z

E

E×1/ log(2)

× log(2)

eA eZ − Z − 1

Y

R

1 + wF + g

wF + g − k

wF + g + 2 − kMSB wF + g + 2 − k

wF + g + 1 − k

MSB wF + g + 1 − 2k

1 + wF + g

wE + wF + g + 1

wE + 1

wE + wF + g + 1

wE + wF + g + 1

k

F. de Dinechin A FloPoCo tutorial 7

Custom arithmetic (not your Pentium’s)

Never compute1 bit more accuratelythan needed!

multiplier

genericpolynomial

truncated

precomputed

ROM

Constantmultipliers

evaluator

Shift to fixed−point

normalize / round

generatorNeed a

Fixed-point X

SX EX FX

A Z

E

E×1/ log(2)

× log(2)

eA eZ − Z − 1

Y

R

1 + wF + g

wF + g − k

wF + g + 2 − kMSB wF + g + 2 − k

wF + g + 1 − k

MSB wF + g + 1 − 2k

1 + wF + g

wE + wF + g + 1

wE + 1

wE + wF + g + 1

wE + wF + g + 1

k

F. de Dinechin A FloPoCo tutorial 7

Useful operators that make sense in a processor

Should a processor include elementary functions ?Yes (Paul&Wilson, 1976), No since the transition to RISC

Should a processor include a divider and square root ?Yes (Oberman et al, Arith, 1997), No since the transition to FMA(IBM then HP then Intel)

Should a processor include decimal hardware ?Yes say IBM, No say Intel

Should a processor include a multiplier by log(2) ?No of course.

F. de Dinechin A FloPoCo tutorial 8

Useful operators that make sense in an FPGA or ASIC

Elementary functions ?Yes iff your application needs it

Divider or square root ?Yes iff your application needs it

Decimal hardware ?Yes iff your application needs it

A multiplier by log(2) ?Yes iff your application needs it

In FPGAs, useful means : useful to one application.

F. de Dinechin A FloPoCo tutorial 9

Enough work to keep me busy to retirement

Arithmetic operators useful to at least one application :

Elementary functions (sine, exponential, logarithm...)

Algebraic functions (x√

x2 + y2, polynomials, ...)

Compound functions (log2(1± 2x), e−Kt2 , ...)

Floating-point sums, dot products, sums of squares

Specialized operators : constant multipliers, squarers, ...

Complex arithmetic

LNS arithmetic

Decimal arithmetic

Interval arithmetic

...

Oh yes, basic operations, too.

F. de Dinechin A FloPoCo tutorial 10

Enough work to keep me busy to retirement

Arithmetic operators useful to at least one application :

Elementary functions (sine, exponential, logarithm...)

Algebraic functions (x√

x2 + y2, polynomials, ...)

Compound functions (log2(1± 2x), e−Kt2 , ...)

Floating-point sums, dot products, sums of squares

Specialized operators : constant multipliers, squarers, ...

Complex arithmetic

LNS arithmetic

Decimal arithmetic

Interval arithmetic

...

Oh yes, basic operations, too.

F. de Dinechin A FloPoCo tutorial 10

What do we call arithmetic operators ?

An arithmetic operation is a function (in the mathematical sense)

few well-typed inputs and outputsno memory or side effect (usually)

An operator is the implementation of such a function

IEEE-754 FP standard : operator(x) = rounding(operation(x))

→ Clean mathematical definition (even for floating-point arithmetic)

The operator as a circuit...

... is a direct acyclic graph (DAG) :

easy to build and pipeline

easy to test against its mathematical specification

F. de Dinechin A FloPoCo tutorial 11

What do we call arithmetic operators ?

An arithmetic operation is a function (in the mathematical sense)

few well-typed inputs and outputsno memory or side effect (usually)

An operator is the implementation of such a function

IEEE-754 FP standard : operator(x) = rounding(operation(x))

→ Clean mathematical definition (even for floating-point arithmetic)

The operator as a circuit...

... is a direct acyclic graph (DAG) :

easy to build and pipeline

easy to test against its mathematical specification

F. de Dinechin A FloPoCo tutorial 11

What do we call arithmetic operators ?

An arithmetic operation is a function (in the mathematical sense)

few well-typed inputs and outputsno memory or side effect (usually)

An operator is the implementation of such a function

IEEE-754 FP standard : operator(x) = rounding(operation(x))

→ Clean mathematical definition (even for floating-point arithmetic)

The operator as a circuit...

... is a direct acyclic graph (DAG) :

easy to build and pipeline

easy to test against its mathematical specification

F. de Dinechin A FloPoCo tutorial 11

The benefits of custom computing

Example : a floating-point sum of squares

x2 + y2 + z2

(not a toy example but a useful building block)

A square is simpler than a multiplicationhalf the hardware required

x2, y2, and z2 are positive :one half of your FP adder is useless

Accuracy can be improved :5 rounding errors in the floating-point version(x2 + y2) + z2 : asymmetrical

The FloPoCo Recipe

Floating-point interface for convenience

Clear accuracy specification for computing just right

Fixed-point internal architecture for efficiency

F. de Dinechin A FloPoCo tutorial 12

The benefits of custom computing

Example : a floating-point sum of squares

x2 + y2 + z2

(not a toy example but a useful building block)

A square is simpler than a multiplicationhalf the hardware required

x2, y2, and z2 are positive :one half of your FP adder is useless

Accuracy can be improved :5 rounding errors in the floating-point version(x2 + y2) + z2 : asymmetrical

The FloPoCo Recipe

Floating-point interface for convenience

Clear accuracy specification for computing just right

Fixed-point internal architecture for efficiency

F. de Dinechin A FloPoCo tutorial 12

The benefits of custom computing

Example : a floating-point sum of squares

x2 + y2 + z2

(not a toy example but a useful building block)

A square is simpler than a multiplicationhalf the hardware required

x2, y2, and z2 are positive :one half of your FP adder is useless

Accuracy can be improved :5 rounding errors in the floating-point version(x2 + y2) + z2 : asymmetrical

The FloPoCo Recipe

Floating-point interface for convenience

Clear accuracy specification for computing just right

Fixed-point internal architecture for efficiency

F. de Dinechin A FloPoCo tutorial 12

The benefits of custom computing

Example : a floating-point sum of squares

x2 + y2 + z2

(not a toy example but a useful building block)

A square is simpler than a multiplicationhalf the hardware required

x2, y2, and z2 are positive :one half of your FP adder is useless

Accuracy can be improved :5 rounding errors in the floating-point version(x2 + y2) + z2 : asymmetrical

The FloPoCo Recipe

Floating-point interface for convenience

Clear accuracy specification for computing just right

Fixed-point internal architecture for efficiency

F. de Dinechin A FloPoCo tutorial 12

The benefits of custom computing

Example : a floating-point sum of squares

x2 + y2 + z2

(not a toy example but a useful building block)

A square is simpler than a multiplicationhalf the hardware required

x2, y2, and z2 are positive :one half of your FP adder is useless

Accuracy can be improved :5 rounding errors in the floating-point version(x2 + y2) + z2 : asymmetrical

The FloPoCo Recipe

Floating-point interface for convenience

Clear accuracy specification for computing just right

Fixed-point internal architecture for efficiency

F. de Dinechin A FloPoCo tutorial 12

A floating-point adder

λ

LZC/shift

p + 1

p + 1

p + 1

p + 1

2p + 2

p p

p + 1

p

x y

z

exp. difference / swap

rounding,normalizationand exception handling

mxex +/–c/f ex − ey

close path c/f

ex

ez

my

shift

|mx −my |

my

1-bit shift

ex

ez

mx

far pathmz , r

mz , r

sticky

s

gr

prenorm (2-bit shift)

s

F. de Dinechin A FloPoCo tutorial 13

A fixed-point architecture

1 + wF 1 + wF 1 + wF

2 + wF + g2 + wF + g

2 + wF + g2 + wF + g

2 + wF + g

wE + wF + g

2 + wF + g

EC

EBMB2 MC 2

X Y Z

MXEZEYEX MY MZ

MA2

R

4 + wF + g

shifter

sort

sortsquarer squarer

shifter

squarer

add

normalize/pack

unpack

F. de Dinechin A FloPoCo tutorial 14

The benefits of custom computing

A few results for floating-point sum-of-squares on Virtex4 :

Simple Precision area performance

LogiCore classic 1282 slices, 20 DSP 43 cycles @ 353 MHz

FloPoCo classic 1188 slices, 12 DSP 29 cycles @ 289 MHz

FloPoCo custom 453 slices, 9 DSP 11 cycles @ 368 MHz

Double Precision area performance

FloPoCo classic 4480 slices, 27 DSP 46 cycles @ 276 MHz

FloPoCo custom 1845 slices, 18 DSP 16 cycles @ 362 MHz

all performance metrics improved, FLOP/s/area more than doubled

Plus : custom operator more accurate, and symmetrical

F. de Dinechin A FloPoCo tutorial 15

Custom also means : custom pipeline

1 + wF 1 + wF 1 + wF

2 + wF + g2 + wF + g

2 + wF + g2 + wF + g

2 + wF + g

wE + wF + g

2 + wF + g

EC

EBMB2 MC 2

X Y Z

MXEZEYEX MY MZ

MA2

R

4 + wF + g

shifter

sort

sortsquarer squarer

shifter

squarer

add

normalize/pack

unpack

One operator does not fit all

Low frequency, low resource consumption

Faster but larger (more registers)

Combinatorial

F. de Dinechin A FloPoCo tutorial 16

Custom also means : custom pipeline

1 + wF 1 + wF 1 + wF

2 + wF + g2 + wF + g

2 + wF + g2 + wF + g

2 + wF + g

wE + wF + g

2 + wF + g

EC

EBMB2 MC 2

X Y Z

MXEZEYEX MY MZ

MA2

R

4 + wF + g

shifter

sort

sortsquarer squarer

shifter

squarer

add

normalize/pack

unpack

One operator does not fit all

Low frequency, low resource consumption

Faster but larger (more registers)

Combinatorial

F. de Dinechin A FloPoCo tutorial 16

Custom also means : custom pipeline

1 + wF 1 + wF 1 + wF

2 + wF + g2 + wF + g

2 + wF + g2 + wF + g

2 + wF + g

wE + wF + g

2 + wF + g

EC

EBMB2 MC 2

X Y Z

MXEZEYEX MY MZ

MA2

R

4 + wF + g

shifter

sort

sortsquarer squarer

shifter

squarer

add

normalize/pack

unpack

One operator does not fit all

Low frequency, low resource consumption

Faster but larger (more registers)

Combinatorial

F. de Dinechin A FloPoCo tutorial 16

History of the FloPoCo project

Initial goal : FPGA arithmetic the way it should be

that is : open-ended, unlike processor arithmetic

an open-ended list of custom operators

open-ended data formats : all operators fully parameterized

open-ended performance trade-off : flexible pipeline

General philosophy : computing just right

A generator framework (written in C++, outputting VHDL)

Objectives :

portable, target-agnostic (more on this later)

VHDL should be simulable and synthesizable using free tool

Design philosophy :

VHDL generation by printing arbitrary VHDL from C++

automate repetitive and error-prone tasks (such as pipelining)

F. de Dinechin A FloPoCo tutorial 17

History of the FloPoCo project

Initial goal : FPGA arithmetic the way it should be

that is : open-ended, unlike processor arithmetic

an open-ended list of custom operators

open-ended data formats : all operators fully parameterized

open-ended performance trade-off : flexible pipeline

General philosophy : computing just right

A generator framework (written in C++, outputting VHDL)

Objectives :

portable, target-agnostic (more on this later)

VHDL should be simulable and synthesizable using free tool

Design philosophy :

VHDL generation by printing arbitrary VHDL from C++

automate repetitive and error-prone tasks (such as pipelining)

F. de Dinechin A FloPoCo tutorial 17

Beyond the plan

Custom arithmetic for ASIC and HLS

the FloPoCo framework was successfully used to design the FPU ofthe Kalray processor

FloPoCo provides the floating-point back-end to the PandA project(politecnico de Milano)

Not enough “custom” yet to my taste

F. de Dinechin A FloPoCo tutorial 18

Beyond the plan

Custom arithmetic for ASIC and HLS

the FloPoCo framework was successfully used to design the FPU ofthe Kalray processor

FloPoCo provides the floating-point back-end to the PandA project(politecnico de Milano)

Not enough “custom” yet to my taste

F. de Dinechin A FloPoCo tutorial 18

FloPoCo for application developers

Introduction: custom arithmetic

FloPoCo for application developers

Computing just right

FloPoCo for developers of HLS software

FloPoCo for developers of custom arithmetic

A tutorial: building an integer divider by 3

A tutorial: building a faithful FIR

A tutorial: building a√

x2 + y2 operator

Conclusion

F. de Dinechin A FloPoCo tutorial 19

Here should come a demo

FloPoCo is freely available from

http://flopoco.gforge.inria.fr/

Command line syntax : a sequence of operator specifications

Options : target frequency, target hardware, ...

Output : synthesizable VHDL.

F. de Dinechin A FloPoCo tutorial 20

First something classical

A single precision floating-point adder(8-bit exponent and 23-bit mantissa)

flopoco -pipeline=no FPAdder 8 23

Final report:

|---Entity FPAdder_8_23_uid2_RightShifter

|---Entity IntAdder_27_f400_uid7

|---Entity LZCShifter_28_to_28_counting_32_uid14

|---Entity IntAdder_34_f400_uid17

Entity FPAdder_8_23_uid2

Output file: flopoco.vhdl

To probe further :

flopoco -pipeline=no FPAdder 11 52 double precision

flopoco -pipeline=no FPAdder 9 34 just right for you

F. de Dinechin A FloPoCo tutorial 21

Classical floating-point, continued

A complete single-precision FPU in a single VHDL file :flopoco -pipeline=no FPAdder 8 23 FPMultiplier 8 23 23

FPDiv 8 23 FPSqrt 8 23

Final report:

|---Entity FPAdder_8_23_uid2_RightShifter

|---Entity IntAdder_27_f400_uid7

|---Entity LZCShifter_28_to_28_counting_32_uid14

|---Entity IntAdder_34_f400_uid17

Entity FPAdder_8_23_uid2

Entity Compressor_2_2

Entity Compressor_3_2

| |---Entity IntAdder_49_f400_uid39

|---Entity IntMultiplier_UsingDSP_24_24_48_unsigned_uid26

|---Entity IntAdder_33_f400_uid47

Entity FPMultiplier_8_23_8_23_8_23_uid24

Entity FPDiv_8_23

Entity FPSqrt_8_23

Output file: flopoco.vhdl

F. de Dinechin A FloPoCo tutorial 22

Frequency-directed pipelining

The same FPAdder, pipelined for 300MHz :flopoco -pipeline=yes -frequency=300 FPAdder 8 23

FloPoCo interface to pipeline construction

“Please pipeline this operator to work at 200MHz”

Not the choice made by other core generators...

... but better because compositional

When you assemble components working at frequency f ,you obtain a component working at frequency f .

F. de Dinechin A FloPoCo tutorial 23

Frequency-directed pipelining

The same FPAdder, pipelined for 300MHz :flopoco -pipeline=yes -frequency=300 FPAdder 8 23

FloPoCo interface to pipeline construction

“Please pipeline this operator to work at 200MHz”

Not the choice made by other core generators...

... but better because compositional

When you assemble components working at frequency f ,you obtain a component working at frequency f .

F. de Dinechin A FloPoCo tutorial 23

Frequency-directed pipelining

The same FPAdder, pipelined for 300MHz :flopoco -pipeline=yes -frequency=300 FPAdder 8 23

FloPoCo interface to pipeline construction

“Please pipeline this operator to work at 200MHz”

Not the choice made by other core generators...

... but better because compositional

When you assemble components working at frequency f ,you obtain a component working at frequency f .

F. de Dinechin A FloPoCo tutorial 23

Frequency-directed pipelining

The same FPAdder, pipelined for 300MHz :flopoco -pipeline=yes -frequency=300 FPAdder 8 23

FloPoCo interface to pipeline construction

“Please pipeline this operator to work at 200MHz”

Not the choice made by other core generators...

... but better because compositional

When you assemble components working at frequency f ,you obtain a component working at frequency f .

F. de Dinechin A FloPoCo tutorial 23

Examples of pipeline

flopoco -pipeline=yes -frequency=400 FPAdder 8 23

Final report:

|---Entity FPAdder_8_23_uid2_RightShifter

| Pipeline depth = 1

|---Entity IntAdder_27_f400_uid7

| Pipeline depth = 1

|---Entity LZCShifter_28_to_28_counting_32_uid14

| Pipeline depth = 4

|---Entity IntAdder_34_f400_uid17

| Pipeline depth = 1

Entity FPAdder_8_23_uid2

Pipeline depth = 10

flopoco -pipeline=yes -frequency=200 FPAdder 8 23

Final report:

(...)

Pipeline depth = 4

F. de Dinechin A FloPoCo tutorial 24

Of course the frequency depends on the target FPGA

flopoco -pipeline=yes -frequency=200 -target=Spartan3

FPAdder 8 23

Final report:

(...)

Pipeline depth = 14

flopoco -pipeline=yes -frequency=200 -target=Virtex6

FPAdder 8 23

Final report:

(...)

Pipeline depth = 3

Altera and Xilinx target currently supported (at various levels ofaccuracy) : Spartan3, Virtex4, Virtex5, Virtex6, StratixII, StratixIII,StratixIV, StratixV, CycloneII, CycloneIII, CycloneIV, CycloneV.

F. de Dinechin A FloPoCo tutorial 25

Also match the architecture to the target FPGA

Compareflopoco -pipeline=no -target=Virtex4 IntConstDiv 16 3 -1

flopoco -pipeline=no -target=Virtex6 IntConstDiv 16 3 -1

LUT LUTLUTLUT

q0q1q2q3

2 2 2 2

4444

4 4 4x3 x2 x1 x0r3 r2 r1 r0 = r

4

r4 = 0

Architecture specificities

LUTs

DSP blocks

memory blocks

And FloPoCo attempts to model the delays of these blocksfor efficient pipelining

F. de Dinechin A FloPoCo tutorial 26

Also match the architecture to the target FPGA

Compareflopoco -pipeline=no -target=Virtex4 IntConstDiv 16 3 -1

flopoco -pipeline=no -target=Virtex6 IntConstDiv 16 3 -1

LUT LUTLUTLUT

q0q1q2q3

2 2 2 2

4444

4 4 4x3 x2 x1 x0r3 r2 r1 r0 = r

4

r4 = 0

Architecture specificities

LUTs

DSP blocks

memory blocks

And FloPoCo attempts to model the delays of these blocksfor efficient pipelining

F. de Dinechin A FloPoCo tutorial 26

Also match the architecture to the target FPGA

Compareflopoco -pipeline=no -target=Virtex4 IntConstDiv 16 3 -1

flopoco -pipeline=no -target=Virtex6 IntConstDiv 16 3 -1

LUT LUTLUTLUT

q0q1q2q3

2 2 2 2

4444

4 4 4x3 x2 x1 x0r3 r2 r1 r0 = r

4

r4 = 0

Architecture specificities

LUTs

DSP blocks

memory blocks

And FloPoCo attempts to model the delays of these blocksfor efficient pipelining

F. de Dinechin A FloPoCo tutorial 26

Also match the architecture to the target FPGA

Compareflopoco -pipeline=no -target=Virtex4 IntConstDiv 16 3 -1

flopoco -pipeline=no -target=Virtex6 IntConstDiv 16 3 -1

LUT LUTLUTLUT

q0q1q2q3

2 2 2 2

4444

4 4 4x3 x2 x1 x0r3 r2 r1 r0 = r

4

r4 = 0

Architecture specificities

LUTs

DSP blocks

memory blocks

And FloPoCo attempts to model the delays of these blocksfor efficient pipelining

F. de Dinechin A FloPoCo tutorial 26

Frequency-directed pipelining in practice

We do our best but we know it’s hopeless

The actual frequency obtained will depend on the whole application(placement, routing pressure etc)...

best-effort philosophy,

aiming to be accurate to 10% for an operator synthesized alone

asking a higher frequency provides a deeper pipeline

And a big TODO : VLSI targets.

F. de Dinechin A FloPoCo tutorial 27

Frequency-directed pipelining in practice

We do our best but we know it’s hopeless

The actual frequency obtained will depend on the whole application(placement, routing pressure etc)...

best-effort philosophy,

aiming to be accurate to 10% for an operator synthesized alone

asking a higher frequency provides a deeper pipeline

And a big TODO : VLSI targets.

F. de Dinechin A FloPoCo tutorial 27

Non-standard operators

Correctly rounded divider by 3 : flopoco FPConstDiv 8 23 3

Floating-point exponential : flopoco FPExp 8 23

Multiplication of a 32-bit signed integer by the constant 1234567(two algorithms, your mileage may vary) :flopoco IntIntKCM 32 1234567 0

flopoco IntConstMult 32 1234567

Full list in the documentation, or by typing just flopoco.Sorry for the sometimes incomplete or inconsistent interface.

F. de Dinechin A FloPoCo tutorial 28

Don’t trust us

Two operators, TestBench and TestBenchFile, generate test benchs forthe operator preceding them on the command line

flopoco FPExp 8 23 TestBenchFile 10000 generates 10000random tests

flopoco IntConstDiv 16 3 -1 TestBenchFile -2 generatesan exhaustive test

Specification-based test bench generation

Not by simulation of the generated architecture !

Helper functions for encoding/decoding FP format, if you want to checkthe testbench...

fp2bin 9 36 3.1415926

bin2fp 9 36

010100000000100100100001111110110100110100010011

F. de Dinechin A FloPoCo tutorial 29

Don’t trust us

Two operators, TestBench and TestBenchFile, generate test benchs forthe operator preceding them on the command line

flopoco FPExp 8 23 TestBenchFile 10000 generates 10000random tests

flopoco IntConstDiv 16 3 -1 TestBenchFile -2 generatesan exhaustive test

Specification-based test bench generation

Not by simulation of the generated architecture !

Helper functions for encoding/decoding FP format, if you want to checkthe testbench...

fp2bin 9 36 3.1415926

bin2fp 9 36

010100000000100100100001111110110100110100010011

F. de Dinechin A FloPoCo tutorial 29

Don’t trust us

Two operators, TestBench and TestBenchFile, generate test benchs forthe operator preceding them on the command line

flopoco FPExp 8 23 TestBenchFile 10000 generates 10000random tests

flopoco IntConstDiv 16 3 -1 TestBenchFile -2 generatesan exhaustive test

Specification-based test bench generation

Not by simulation of the generated architecture !

Helper functions for encoding/decoding FP format, if you want to checkthe testbench...

fp2bin 9 36 3.1415926

bin2fp 9 36

010100000000100100100001111110110100110100010011

F. de Dinechin A FloPoCo tutorial 29

Open-ended operators

A polynomial evaluator for arbitrary functions

Example :flopoco FunctionEvaluator "(sin(x*Pi/2))^ 2" 32 32 4

4 is the degree of the polynomial, allows to express amemory/multiplier trade-off

Works for the set of functions for which it works

Another one is HOTBM, all this is still work in progress

F. de Dinechin A FloPoCo tutorial 30

FPPipeline

Example :flopoco FPPipeline pipeline.txt 8 23

where pipeline.txt contains something liker = sqrt(sqr(x0-x1)+sqr(y0-y1)); output r;

⊕ Properly synchronizes the various sub components

⊕ Makes it easy to explore various precisions

Using this is against the philosophy of FloPoCo !

which is : let’s build a custom operator instead

Quick and dirty implementation

not all operators supported, constants not always specializedno TestBench support (not a trivial thing to add)

Anybody in the audience doing serious HLS ?

F. de Dinechin A FloPoCo tutorial 31

FPPipeline

Example :flopoco FPPipeline pipeline.txt 8 23

where pipeline.txt contains something liker = sqrt(sqr(x0-x1)+sqr(y0-y1)); output r;

⊕ Properly synchronizes the various sub components

⊕ Makes it easy to explore various precisions

Using this is against the philosophy of FloPoCo !

which is : let’s build a custom operator instead

Quick and dirty implementation

not all operators supported, constants not always specializedno TestBench support (not a trivial thing to add)

Anybody in the audience doing serious HLS ?

F. de Dinechin A FloPoCo tutorial 31

FPPipeline

Example :flopoco FPPipeline pipeline.txt 8 23

where pipeline.txt contains something liker = sqrt(sqr(x0-x1)+sqr(y0-y1)); output r;

⊕ Properly synchronizes the various sub components

⊕ Makes it easy to explore various precisions

Using this is against the philosophy of FloPoCo !

which is : let’s build a custom operator instead

Quick and dirty implementation

not all operators supported, constants not always specializedno TestBench support (not a trivial thing to add)

Anybody in the audience doing serious HLS ?

F. de Dinechin A FloPoCo tutorial 31

FPPipeline

Example :flopoco FPPipeline pipeline.txt 8 23

where pipeline.txt contains something liker = sqrt(sqr(x0-x1)+sqr(y0-y1)); output r;

⊕ Properly synchronizes the various sub components

⊕ Makes it easy to explore various precisions

Using this is against the philosophy of FloPoCo !

which is : let’s build a custom operator instead

Quick and dirty implementation

not all operators supported, constants not always specializedno TestBench support (not a trivial thing to add)

Anybody in the audience doing serious HLS ?

F. de Dinechin A FloPoCo tutorial 31

Computing just right

Introduction: custom arithmetic

FloPoCo for application developers

Computing just right

FloPoCo for developers of HLS software

FloPoCo for developers of custom arithmetic

A tutorial: building an integer divider by 3

A tutorial: building a faithful FIR

A tutorial: building a√

x2 + y2 operator

Conclusion

F. de Dinechin A FloPoCo tutorial 32

A bit of plain common sense

Common wisdom

The more accurate you compute, the more expensive it gets

In practice

We (hopefully) notice it when our computation isnot accurate enough.

But do we notice it when it is too accurate for our needs ?

Reconciling performance and accuracy ?

Or, regain performance by computing just right ?

F. de Dinechin A FloPoCo tutorial 33

A bit of plain common sense

Common wisdom

The more accurate you compute, the more expensive it gets

In practice

We (hopefully) notice it when our computation isnot accurate enough.

But do we notice it when it is too accurate for our needs ?

Reconciling performance and accuracy ?

Or, regain performance by computing just right ?

F. de Dinechin A FloPoCo tutorial 33

A bit of plain common sense

Common wisdom

The more accurate you compute, the more expensive it gets

In practice

We (hopefully) notice it when our computation isnot accurate enough.

But do we notice it when it is too accurate for our needs ?

Reconciling performance and accuracy ?

Or, regain performance by computing just right ?

F. de Dinechin A FloPoCo tutorial 33

Double precision spoils us

The standard binary64 format (formerly known as double-precision)provides roughly 16 decimal digits.

Why should anybody need such accuracy ?

Count the digits in the following

Definition of the second : the duration of 9,192,631,770 periods ofthe radiation corresponding to the transition between the twohyperfine levels of the ground state of the cesium 133 atom.

Definition of the metre : the distance travelled by light in vacuumin 1/299,792,458 of a second.

Most accurate measurement ever (another atomic frequency)to 14 decimal places

Most accurate measurement of the Planck constant to date :to 7 decimal places

The gravitation constant G is known to 3 decimal places only

F. de Dinechin A FloPoCo tutorial 34

Parenthesis : then why binary64 ?

This PC computes 109 operations per second (1 gigaflops)

An allegory due to Kulisch

print the numbers in 100 lines of 5 columns double-sided :1000 numbers/sheet

1000 sheets ≈ a heap of 10 cm

109 flops ≈ heap height speed of 100m/s, or 360km/h

A teraflops (1012 op/s) prints to the moon in one second

Current top 500 computers reach the petaflop (1015 op/s)

each operation may involve a relative error of 10−16,and they accumulate.

Doesn’t this sound wrong ?

We would use these 16 digits just to accumulate garbage in them ?

F. de Dinechin A FloPoCo tutorial 35

Parenthesis : then why binary64 ?

This PC computes 109 operations per second (1 gigaflops)

An allegory due to Kulisch

print the numbers in 100 lines of 5 columns double-sided :1000 numbers/sheet

1000 sheets ≈ a heap of 10 cm

109 flops ≈ heap height speed of 100m/s, or 360km/h

A teraflops (1012 op/s) prints to the moon in one second

Current top 500 computers reach the petaflop (1015 op/s)

each operation may involve a relative error of 10−16,and they accumulate.

Doesn’t this sound wrong ?

We would use these 16 digits just to accumulate garbage in them ?

F. de Dinechin A FloPoCo tutorial 35

Parenthesis : then why binary64 ?

This PC computes 109 operations per second (1 gigaflops)

An allegory due to Kulisch

print the numbers in 100 lines of 5 columns double-sided :1000 numbers/sheet

1000 sheets ≈ a heap of 10 cm

109 flops ≈ heap height speed of 100m/s, or 360km/h

A teraflops (1012 op/s) prints to the moon in one second

Current top 500 computers reach the petaflop (1015 op/s)

each operation may involve a relative error of 10−16,and they accumulate.

Doesn’t this sound wrong ?

We would use these 16 digits just to accumulate garbage in them ?

F. de Dinechin A FloPoCo tutorial 35

Parenthesis : then why binary64 ?

This PC computes 109 operations per second (1 gigaflops)

An allegory due to Kulisch

print the numbers in 100 lines of 5 columns double-sided :1000 numbers/sheet

1000 sheets ≈ a heap of 10 cm

109 flops ≈ heap height speed of 100m/s, or 360km/h

A teraflops (1012 op/s) prints to the moon in one second

Current top 500 computers reach the petaflop (1015 op/s)

each operation may involve a relative error of 10−16,and they accumulate.

Doesn’t this sound wrong ?

We would use these 16 digits just to accumulate garbage in them ?

F. de Dinechin A FloPoCo tutorial 35

Parenthesis : then why binary64 ?

This PC computes 109 operations per second (1 gigaflops)

An allegory due to Kulisch

print the numbers in 100 lines of 5 columns double-sided :1000 numbers/sheet

1000 sheets ≈ a heap of 10 cm

109 flops ≈ heap height speed of 100m/s, or 360km/h

A teraflops (1012 op/s) prints to the moon in one second

Current top 500 computers reach the petaflop (1015 op/s)

each operation may involve a relative error of 10−16,and they accumulate.

Doesn’t this sound wrong ?

We would use these 16 digits just to accumulate garbage in them ?

F. de Dinechin A FloPoCo tutorial 35

Back to the point

Mastering accuracy for performance

When implementing a “computing core”

A goal : never compute more accurately than needed

A tool : flexible arithmetic

Qualitatively : use FPConstDiv instead of a dividerQuantitatively : tailor your data formats to your problem

Sub-goals

Know what accuracy you needKnow how accurate you compute

“Computing cores” considered so far : elementary functions, sums ofproducts, linear algebra, Euclidean lattices algorithms, ...

Hopelessly difficult in the general case ...but at least ask this question systematically

F. de Dinechin A FloPoCo tutorial 36

Back to the point

Mastering accuracy for performance

When implementing a “computing core”

A goal : never compute more accurately than needed

A tool : flexible arithmetic

Qualitatively : use FPConstDiv instead of a dividerQuantitatively : tailor your data formats to your problem

Sub-goals

Know what accuracy you needKnow how accurate you compute

“Computing cores” considered so far : elementary functions, sums ofproducts, linear algebra, Euclidean lattices algorithms, ...

Hopelessly difficult in the general case ...but at least ask this question systematically

F. de Dinechin A FloPoCo tutorial 36

Back to the point

Mastering accuracy for performance

When implementing a “computing core”

A goal : never compute more accurately than needed

A tool : flexible arithmetic

Qualitatively : use FPConstDiv instead of a dividerQuantitatively : tailor your data formats to your problem

Sub-goals

Know what accuracy you need

Know how accurate you compute

“Computing cores” considered so far : elementary functions, sums ofproducts, linear algebra, Euclidean lattices algorithms, ...

Hopelessly difficult in the general case ...but at least ask this question systematically

F. de Dinechin A FloPoCo tutorial 36

Back to the point

Mastering accuracy for performance

When implementing a “computing core”

A goal : never compute more accurately than needed

A tool : flexible arithmetic

Qualitatively : use FPConstDiv instead of a dividerQuantitatively : tailor your data formats to your problem

Sub-goals

Know what accuracy you needKnow how accurate you compute

“Computing cores” considered so far : elementary functions, sums ofproducts, linear algebra, Euclidean lattices algorithms, ...

Hopelessly difficult in the general case ...but at least ask this question systematically

F. de Dinechin A FloPoCo tutorial 36

Back to the point

Mastering accuracy for performance

When implementing a “computing core”

A goal : never compute more accurately than needed

A tool : flexible arithmetic

Qualitatively : use FPConstDiv instead of a dividerQuantitatively : tailor your data formats to your problem

Sub-goals

Know what accuracy you needKnow how accurate you compute

“Computing cores” considered so far : elementary functions, sums ofproducts, linear algebra, Euclidean lattices algorithms, ...

Hopelessly difficult in the general case ...but at least ask this question systematically

F. de Dinechin A FloPoCo tutorial 36

Back to the point

Mastering accuracy for performance

When implementing a “computing core”

A goal : never compute more accurately than needed

A tool : flexible arithmetic

Qualitatively : use FPConstDiv instead of a dividerQuantitatively : tailor your data formats to your problem

Sub-goals

Know what accuracy you needKnow how accurate you compute

“Computing cores” considered so far : elementary functions, sums ofproducts, linear algebra, Euclidean lattices algorithms, ...

Hopelessly difficult in the general case ...but at least ask this question systematically

F. de Dinechin A FloPoCo tutorial 36

My current crusade

The evil (at least on FPGAs)

A fast implementation of single-precision floating-point exponential(but accurate to 2−8 only)

Do you see why it is wrong ?

A line I shall have in each of my talks until the world is saved

Save routing ! Save power ! Don’t move useless bits around !

Or maybe this one

Do you really need to compute this bit ?

F. de Dinechin A FloPoCo tutorial 37

My current crusade

The evil (at least on FPGAs)

A fast implementation of single-precision floating-point exponential(but accurate to 2−8 only)

Do you see why it is wrong ?

A line I shall have in each of my talks until the world is saved

Save routing ! Save power ! Don’t move useless bits around !

Or maybe this one

Do you really need to compute this bit ?

F. de Dinechin A FloPoCo tutorial 37

My current crusade

The evil (at least on FPGAs)

A fast implementation of single-precision floating-point exponential(but accurate to 2−8 only)

Do you see why it is wrong ?

A line I shall have in each of my talks until the world is saved

Save routing ! Save power ! Don’t move useless bits around !

Or maybe this one

Do you really need to compute this bit ?

F. de Dinechin A FloPoCo tutorial 37

FloPoCo for developers of HLSsoftware

Introduction: custom arithmetic

FloPoCo for application developers

Computing just right

FloPoCo for developers of HLS software

FloPoCo for developers of custom arithmetic

A tutorial: building an integer divider by 3

A tutorial: building a faithful FIR

A tutorial: building a√

x2 + y2 operator

Conclusion

F. de Dinechin A FloPoCo tutorial 38

Using FloPoCoLib

From file src/main minimal.cpp :

1 #include "FloPoCo.hpp"2 using namespace std;3 using namespace flopoco;4

5 int main(int argc , char* argv[] ) {6 Target* target = new Virtex4 ();7

8 int wE = 9;9 int wF = 33;

10 Operator *op = new FPAdderSinglePath(target , wE,wF, wE,wF, wE,wF);

11

12 ofstream file;13 file.open("FPAdder.vhdl", ios::out);14 op->outputVHDLToFile(file);15 file.close ();16

17 cerr << endl <<"Final report:"<<endl;18 op->outputFinalReport (0);19 }

F. de Dinechin A FloPoCo tutorial 39

FloPoCo class hierarchy

Signal

+width

+cycle

+lifeSpan

Opera tor

+signalList

+vhdl

+outputVHDL()

+emulate()

+buildStandardTestCases()

FPAdder

+wE

+wF

In tAddder

+size

Shif ters Collision

+wE

+wF

Targets

V i r tex4Strat ix I I

TestBench

F. de Dinechin A FloPoCo tutorial 40

Custom arithmetic for HLS (an outsider point of view)

All the optimizations you do when compiling for a processor, plus

a+a should be replaced with 2*a, not the opposite !

fewer optimizations prevented by bit-for-bit or cornercase issues

a C compiler is not allowed to replace x-0.0 with x

or 2*x/2 with x

both issues can be solved by (very small) cornercase-fix logic

more opportunities of operator specialization :

multiplication and division by a constanta squarer for a*a,a sine on one period only...

endless opportunities of operator fusion

did I mention√

x2 + y2 ?bit-for-bit compatibility lost, replaced with guarantee accuracy

Eventually we will need new languages.

F. de Dinechin A FloPoCo tutorial 41

Custom arithmetic for HLS (an outsider point of view)

All the optimizations you do when compiling for a processor, plus

a+a should be replaced with 2*a, not the opposite !

fewer optimizations prevented by bit-for-bit or cornercase issues

a C compiler is not allowed to replace x-0.0 with x

or 2*x/2 with x

both issues can be solved by (very small) cornercase-fix logic

more opportunities of operator specialization :

multiplication and division by a constanta squarer for a*a,a sine on one period only...

endless opportunities of operator fusion

did I mention√

x2 + y2 ?bit-for-bit compatibility lost, replaced with guarantee accuracy

Eventually we will need new languages.

F. de Dinechin A FloPoCo tutorial 41

Custom arithmetic for HLS (an outsider point of view)

All the optimizations you do when compiling for a processor, plus

a+a should be replaced with 2*a, not the opposite !

fewer optimizations prevented by bit-for-bit or cornercase issues

a C compiler is not allowed to replace x-0.0 with x

or 2*x/2 with x

both issues can be solved by (very small) cornercase-fix logic

more opportunities of operator specialization :

multiplication and division by a constanta squarer for a*a,a sine on one period only...

endless opportunities of operator fusion

did I mention√

x2 + y2 ?bit-for-bit compatibility lost, replaced with guarantee accuracy

Eventually we will need new languages.

F. de Dinechin A FloPoCo tutorial 41

Custom arithmetic for HLS (an outsider point of view)

All the optimizations you do when compiling for a processor, plus

a+a should be replaced with 2*a, not the opposite !

fewer optimizations prevented by bit-for-bit or cornercase issues

a C compiler is not allowed to replace x-0.0 with x

or 2*x/2 with x

both issues can be solved by (very small) cornercase-fix logic

more opportunities of operator specialization :

multiplication and division by a constanta squarer for a*a,a sine on one period only...

endless opportunities of operator fusion

did I mention√

x2 + y2 ?bit-for-bit compatibility lost, replaced with guarantee accuracy

Eventually we will need new languages.

F. de Dinechin A FloPoCo tutorial 41

Custom arithmetic for HLS (an outsider point of view)

All the optimizations you do when compiling for a processor, plus

a+a should be replaced with 2*a, not the opposite !

fewer optimizations prevented by bit-for-bit or cornercase issues

a C compiler is not allowed to replace x-0.0 with x

or 2*x/2 with x

both issues can be solved by (very small) cornercase-fix logic

more opportunities of operator specialization :

multiplication and division by a constanta squarer for a*a,a sine on one period only...

endless opportunities of operator fusion

did I mention√

x2 + y2 ?bit-for-bit compatibility lost, replaced with guarantee accuracy

Eventually we will need new languages.

F. de Dinechin A FloPoCo tutorial 41

FloPoCo for developers of customarithmetic

Introduction: custom arithmetic

FloPoCo for application developers

Computing just right

FloPoCo for developers of HLS software

FloPoCo for developers of custom arithmetic

A tutorial: building an integer divider by 3

A tutorial: building a faithful FIR

A tutorial: building a√

x2 + y2 operator

Conclusion

F. de Dinechin A FloPoCo tutorial 42

A modestly object-oriented approach

FloPoCo is not a C++-based HDL

VHDL generation is “print-based”

1 vhdl << "SoS <= EA(wE -1 downto 0) & Fraction;" ;

easy to port existing work (FPLibrary)easy learning curve for the VHDL-litterateat least the expressive power of VHDL !

Many helper functions help doing the printsExample : VHDL signal declaration

1 vhdl << declare("SoS", wE+wF+g)2 << " <= EA(wE -1 downto 0) & Fraction;" ;

F. de Dinechin A FloPoCo tutorial 43

A modestly object-oriented approach

FloPoCo is not a C++-based HDL

VHDL generation is “print-based”

1 vhdl << "SoS <= EA(wE -1 downto 0) & Fraction;" ;

easy to port existing work (FPLibrary)easy learning curve for the VHDL-litterateat least the expressive power of VHDL !

Many helper functions help doing the printsExample : VHDL signal declaration

1 vhdl << declare("SoS", wE+wF+g)2 << " <= EA(wE -1 downto 0) & Fraction;" ;

F. de Dinechin A FloPoCo tutorial 43

A modestly object-oriented approach

FloPoCo is not a C++-based HDL

VHDL generation is “print-based”

1 vhdl << "SoS <= EA(wE -1 downto 0) & Fraction;" ;

easy to port existing work (FPLibrary)easy learning curve for the VHDL-litterateat least the expressive power of VHDL !

Many helper functions help doing the printsExample : VHDL signal declaration

1 vhdl << declare("SoS", wE+wF+g)2 << " <= EA(wE -1 downto 0) & Fraction;" ;

F. de Dinechin A FloPoCo tutorial 43

Workflow of designing a new operator in FloPoCo

1. Create an operator skeleton

copy src/UserDefinedOperator.cpp (heavily documented)add your file in CMakeList.txt and src/FloPoCo.hpp

add command-line interface + doc in src/main.cpp

2. Define the specification : write the emulate() method

syntax is quite OOgly but imitation is your frienda few lines to change from a type-matching operator

... then TestBench and TestBenchFile will work

3. Write the code generating combinatorial VHDL

Basically embedd your VHDL in C++ prints

4. Add code for self-pipelining

Roughly the same effort as would take to get one pipeline instance... but you get frequency-directed pipeline for all the FloPoCo targets

F. de Dinechin A FloPoCo tutorial 44

Workflow of designing a new operator in FloPoCo

1. Create an operator skeleton

copy src/UserDefinedOperator.cpp (heavily documented)add your file in CMakeList.txt and src/FloPoCo.hpp

add command-line interface + doc in src/main.cpp

2. Define the specification : write the emulate() method

syntax is quite OOgly but imitation is your frienda few lines to change from a type-matching operator

... then TestBench and TestBenchFile will work

3. Write the code generating combinatorial VHDL

Basically embedd your VHDL in C++ prints

4. Add code for self-pipelining

Roughly the same effort as would take to get one pipeline instance... but you get frequency-directed pipeline for all the FloPoCo targets

F. de Dinechin A FloPoCo tutorial 44

Workflow of designing a new operator in FloPoCo

1. Create an operator skeleton

copy src/UserDefinedOperator.cpp (heavily documented)add your file in CMakeList.txt and src/FloPoCo.hpp

add command-line interface + doc in src/main.cpp

2. Define the specification : write the emulate() method

syntax is quite OOgly but imitation is your frienda few lines to change from a type-matching operator

... then TestBench and TestBenchFile will work

3. Write the code generating combinatorial VHDL

Basically embedd your VHDL in C++ prints

4. Add code for self-pipelining

Roughly the same effort as would take to get one pipeline instance... but you get frequency-directed pipeline for all the FloPoCo targets

F. de Dinechin A FloPoCo tutorial 44

Workflow of designing a new operator in FloPoCo

1. Create an operator skeleton

copy src/UserDefinedOperator.cpp (heavily documented)add your file in CMakeList.txt and src/FloPoCo.hpp

add command-line interface + doc in src/main.cpp

2. Define the specification : write the emulate() method

syntax is quite OOgly but imitation is your frienda few lines to change from a type-matching operator

... then TestBench and TestBenchFile will work

3. Write the code generating combinatorial VHDL

Basically embedd your VHDL in C++ prints

4. Add code for self-pipelining

Roughly the same effort as would take to get one pipeline instance... but you get frequency-directed pipeline for all the FloPoCo targets

F. de Dinechin A FloPoCo tutorial 44

My personal record

Two weeks from the first intuition of the algorithmto complete pipelined FloPoCo implementation + paper submission(Division by small integer constants, ARC 2012)

Implementation time

10 minutes to obtain a testbench generator

1/2 day for the integer Euclidean division

20 mn for its flexible pipeline

1/2 day for the FP divider by 3

and again 20 mn

Time savers :

Test bench generator for correct combinatorial operation

Pipeline framework (for filling result tables in the paper)

(Time wasters : maintaining all this instead of writing new operators)

F. de Dinechin A FloPoCo tutorial 45

My personal record

Two weeks from the first intuition of the algorithmto complete pipelined FloPoCo implementation + paper submission(Division by small integer constants, ARC 2012)

Implementation time

10 minutes to obtain a testbench generator

1/2 day for the integer Euclidean division

20 mn for its flexible pipeline

1/2 day for the FP divider by 3

and again 20 mn

Time savers :

Test bench generator for correct combinatorial operation

Pipeline framework (for filling result tables in the paper)

(Time wasters : maintaining all this instead of writing new operators)

F. de Dinechin A FloPoCo tutorial 45

My personal record

Two weeks from the first intuition of the algorithmto complete pipelined FloPoCo implementation + paper submission(Division by small integer constants, ARC 2012)

Implementation time

10 minutes to obtain a testbench generator

1/2 day for the integer Euclidean division

20 mn for its flexible pipeline

1/2 day for the FP divider by 3

and again 20 mn

Time savers :

Test bench generator for correct combinatorial operation

Pipeline framework (for filling result tables in the paper)

(Time wasters : maintaining all this instead of writing new operators)

F. de Dinechin A FloPoCo tutorial 45

Pipeline made easy

1 + wF 1 + wF 1 + wF

2 + wF + g2 + wF + g

2 + wF + g2 + wF + g

2 + wF + g

wE + wF + g

2 + wF + g

EC

EBMB2 MC 2

X Y Z

MXEZEYEX MY MZ

MA2

R

4 + wF + g

shifter

sort

sortsquarer squarer

shifter

squarer

add

normalize/pack

unpack

F. de Dinechin A FloPoCo tutorial 46

Pipeline made easy

5

4

3

2

1

0

0

1

6

1 + wF 1 + wF 1 + wF

2 + wF + g2 + wF + g

2 + wF + g2 + wF + g

2 + wF + g

wE + wF + g

2 + wF + g

EC

EBMB2 MC 2

X Y Z

MXEZEYEX MY MZ

MA2

R

4 + wF + g

shifter

sort

sortsquarer squarer

shifter

squarer

add

normalize/pack

unpack

Notion of current cycle during VHDL output

Each signal has a cycle attribute (at which cycle it is defined)

F. de Dinechin A FloPoCo tutorial 46

Pipeline made easy

5

4

3

2

1

0

0

1

6

EA

EB

1 + wF 1 + wF 1 + wF

2 + wF + g2 + wF + g

2 + wF + g2 + wF + g

2 + wF + g

wE + wF + g

2 + wF + g

EC

EBMB2 MC 2

X Y Z

MXEZEYEX MY MZ

MA2

R

4 + wF + g

shifter

sort

sortsquarer squarer

shifter

squarer

add

normalize/pack

unpack

1 vhdl << declare("EA", wE) << " <= ... ;" ; // c y c l e (EA)=02 nextCycle ();3 vhdl << declare("EB", wE) << " <= ... ;" ; // c y c l e (EB)=14 setCycle (0);5 vhdl << ... ;

F. de Dinechin A FloPoCo tutorial 46

Pipeline made easy

5

4

3

2

1

0

0

1

6

EA

EB

(insert 6 registers)

RHS("EA")SoS

1 + wF 1 + wF 1 + wF

2 + wF + g2 + wF + g

2 + wF + g2 + wF + g

2 + wF + g

wE + wF + g

2 + wF + g

EC

EBMB2 MC 2

X Y Z

MXEZEYEX MY MZ

MA2

R

4 + wF + g

shifter

sort

sortsquarer squarer

shifter

squarer

add

normalize/pack

unpack

FloPoCo looks for signal names on the right-hand sideand delays them by (currentCycle - defCycle) cycles :

1 vhdl << declare("SoS", wE+wF+g)2 << " <= EA(wE -1 downto 0) & Fraction;" ;

F. de Dinechin A FloPoCo tutorial 46

Pipeline made easy

5

4

3

2

1

0

0

1

6

EA

EB

(insert 6 registers)

RHS("EA")SoS

1 + wF 1 + wF 1 + wF

2 + wF + g2 + wF + g

2 + wF + g2 + wF + g

2 + wF + g

wE + wF + g

2 + wF + g

EC

EBMB2 MC 2

X Y Z

MXEZEYEX MY MZ

MA2

R

4 + wF + g

shifter

sort

sortsquarer squarer

shifter

squarer

add

normalize/pack

unpack

FloPoCo looks for signal names on the right-hand sideand delays them by (currentCycle - defCycle) cycles :Generated VHDL :

1 SoS <= EA_d6(wE -1 downto 0) & Fraction_d1;

(and transparently declare and build the needed shift registers)

F. de Dinechin A FloPoCo tutorial 46

Pipeline made easy

5

4

3

2

1

0

0

1

6

1 + wF 1 + wF 1 + wF

2 + wF + g2 + wF + g

2 + wF + g2 + wF + g

2 + wF + g

wE + wF + g

2 + wF + g

EC

EBMB2 MC 2

X Y Z

MXEZEYEX MY MZ

MA2

R

4 + wF + g

shifter

sort

sortsquarer squarer

shifter

squarer

add

normalize/pack

unpack

Managing the current cycle :n=getCycle();

setCycle(n);

nextCycle();

syncCycleWithSignal("EA");

Frequency-directed pipelining :

manageCriticalPath(

target->adderDelay(n) );

F. de Dinechin A FloPoCo tutorial 46

manageCriticalPath(delay)

A currentCriticalPath variable is maintained during VHDL generation

manageCriticalPath(delay)

adds delay to currentCriticalPathif the sum is larger than 1/targetFrequency,

I currentCycle ++I and reset currentCriticalPath to delay

delay is an estimation of the delay of the following block of VHDL

you (the developer) have to perform this estimation

... typically computed using timing methods of the Target class

adderDelay(int n)

localRoutingDelay(int fanout)

This is how the pipeline adapts to the actual target.

F. de Dinechin A FloPoCo tutorial 47

1 + wF 1 + wF 1 + wF

2 + wF + g2 + wF + g

2 + wF + g2 + wF + g

2 + wF + g

wE + wF + g

2 + wF + g

EC

EBMB2 MC 2

X Y Z

MXEZEYEX MY MZ

MA2

R

4 + wF + g

shifter

sort

sortsquarer squarer

shifter

squarer

add

normalize/pack

unpack

The magic explained

Pipeline adapts to random insertions of pipeline levels anywhere

during development, tinkering, fine-tuning

or during frequency-directed, target-optimized pipelining

Graceful degradation to a combinatorial operator

in which case the vhdl stream is just output untouched

Suits the “print VHDL” philosophy

FloPoCo code is much shorter than the VHDL it generates.

F. de Dinechin A FloPoCo tutorial 48

1 + wF 1 + wF 1 + wF

2 + wF + g2 + wF + g

2 + wF + g2 + wF + g

2 + wF + g

wE + wF + g

2 + wF + g

EC

EBMB2 MC 2

X Y Z

MXEZEYEX MY MZ

MA2

R

4 + wF + g

shifter

sort

sortsquarer squarer

shifter

squarer

add

normalize/pack

unpack

The magic explained

Pipeline adapts to random insertions of pipeline levels anywhere

during development, tinkering, fine-tuningor during frequency-directed, target-optimized pipelining

Graceful degradation to a combinatorial operator

in which case the vhdl stream is just output untouched

Suits the “print VHDL” philosophy

FloPoCo code is much shorter than the VHDL it generates.

F. de Dinechin A FloPoCo tutorial 48

1 + wF 1 + wF 1 + wF

2 + wF + g2 + wF + g

2 + wF + g2 + wF + g

2 + wF + g

wE + wF + g

2 + wF + g

EC

EBMB2 MC 2

X Y Z

MXEZEYEX MY MZ

MA2

R

4 + wF + g

shifter

sort

sortsquarer squarer

shifter

squarer

add

normalize/pack

unpack

The magic explained

Pipeline adapts to random insertions of pipeline levels anywhere

during development, tinkering, fine-tuningor during frequency-directed, target-optimized pipelining

Graceful degradation to a combinatorial operator

in which case the vhdl stream is just output untouched

Suits the “print VHDL” philosophy

FloPoCo code is much shorter than the VHDL it generates.

F. de Dinechin A FloPoCo tutorial 48

1 + wF 1 + wF 1 + wF

2 + wF + g2 + wF + g

2 + wF + g2 + wF + g

2 + wF + g

wE + wF + g

2 + wF + g

EC

EBMB2 MC 2

X Y Z

MXEZEYEX MY MZ

MA2

R

4 + wF + g

shifter

sort

sortsquarer squarer

shifter

squarer

add

normalize/pack

unpack

The magic explained

Pipeline adapts to random insertions of pipeline levels anywhere

during development, tinkering, fine-tuningor during frequency-directed, target-optimized pipelining

Graceful degradation to a combinatorial operator

in which case the vhdl stream is just output untouched

Suits the “print VHDL” philosophy

FloPoCo code is much shorter than the VHDL it generates.

F. de Dinechin A FloPoCo tutorial 48

From the developer point of view

Clean separation of functionality and performance issues

You write to the vhdl stream only combinatorial VHDL

FloPoCo parses it to insert registers (in two passes)

Timing is managed out of the VHDL code

Subcomponents also delay their outputs WRT their inputs, etc

Correct-by-construction pipelining

You cannot obtain a non-functional operator,starting from a working combinatorial one

(at least not without very loud warnings)

This part of the framework can be trusted

nothing more complicated than integer subtractions and comparisonsand of course, test bench generators adapt to the pipeline

You can obtain a bad pipeline, though

unbalanced, over-pipelined, missing the target frequency, ...

F. de Dinechin A FloPoCo tutorial 49

From the developer point of view

Clean separation of functionality and performance issues

You write to the vhdl stream only combinatorial VHDL

FloPoCo parses it to insert registers (in two passes)

Timing is managed out of the VHDL code

Subcomponents also delay their outputs WRT their inputs, etc

Correct-by-construction pipelining

You cannot obtain a non-functional operator,starting from a working combinatorial one

(at least not without very loud warnings)

This part of the framework can be trusted

nothing more complicated than integer subtractions and comparisonsand of course, test bench generators adapt to the pipeline

You can obtain a bad pipeline, though

unbalanced, over-pipelined, missing the target frequency, ...

F. de Dinechin A FloPoCo tutorial 49

subcomponent list

signal list

(bitwidth, etc)

in structural VHDL syntax)

(combinatorial description

streamvhdl

generation of VHDL declarations

generation of VHDL register code

for each signalvaluelifeSpan

for each signalcycle value

generation of VHDL architecture code

(second pass on vhdl stream,

delaying right−hand side signals)

C++to know their pipeline depth)

(recursively calling constructors of all sub−components

Constructor

Operator.outputVHDL()

pipeline information

architecture ....

entity ...

port (...)

component ....

signal ....

end architecture

output file

DEXY <= ...

DEYZ <= ...

...

end process

begin

process(clk)

....

VHDLC++

functional informationproduces

is used

F. de Dinechin A FloPoCo tutorial 50

We could have done it better

History of development shows a bit

Do we really need an interface with explicit cycle management ?

In FloPoCo, time=(cycle, critical path within the cycle)

but the user interface should consist only of delay information

A delay should be associated systematically to signal declaration

Too much code to change it now...

F. de Dinechin A FloPoCo tutorial 51

Matching the architecture to the target

Perfectly valid code :

1 if(target =="StratixII"){2 // outpu t some VHDL3 }4 else if(target =="Virtex4")

{5 // outpu t o t h e r VHDL6 }7 else ...

Try to avoid it :

Model architecture details

LUTsDSP blocksmemory blocks

Model delays

Definitely an endless effort.

F. de Dinechin A FloPoCo tutorial 52

One slide on testing

emulate() performs bit-accurate emulation

Not by architecture simulation !By expressing the mathematical specification..in typically a few lines of MPFR (www.mpfr.org)

buildStandardTestCases() for corner cases and regression tests

buildRandomTestCases() to bias the random generator in anoperation-specific way

default is uniform on the bits.Example 1 : exponential useful range is very small,bias the random generator towards thatExample2 : cancellation cases in floating-point additions

may (should) be overloaded by each Operator

The special TestBench and TestBenchFile operators invokethese methods

F. de Dinechin A FloPoCo tutorial 53

One slide on testing

emulate() performs bit-accurate emulation

Not by architecture simulation !By expressing the mathematical specification..in typically a few lines of MPFR (www.mpfr.org)

buildStandardTestCases() for corner cases and regression tests

buildRandomTestCases() to bias the random generator in anoperation-specific way

default is uniform on the bits.Example 1 : exponential useful range is very small,bias the random generator towards thatExample2 : cancellation cases in floating-point additions

may (should) be overloaded by each Operator

The special TestBench and TestBenchFile operators invokethese methods

F. de Dinechin A FloPoCo tutorial 53

One slide on testing

emulate() performs bit-accurate emulation

Not by architecture simulation !By expressing the mathematical specification..in typically a few lines of MPFR (www.mpfr.org)

buildStandardTestCases() for corner cases and regression tests

buildRandomTestCases() to bias the random generator in anoperation-specific way

default is uniform on the bits.Example 1 : exponential useful range is very small,bias the random generator towards thatExample2 : cancellation cases in floating-point additions

may (should) be overloaded by each Operator

The special TestBench and TestBenchFile operators invokethese methods

F. de Dinechin A FloPoCo tutorial 53

One slide on testing

emulate() performs bit-accurate emulation

Not by architecture simulation !By expressing the mathematical specification..in typically a few lines of MPFR (www.mpfr.org)

buildStandardTestCases() for corner cases and regression tests

buildRandomTestCases() to bias the random generator in anoperation-specific way

default is uniform on the bits.Example 1 : exponential useful range is very small,bias the random generator towards thatExample2 : cancellation cases in floating-point additions

may (should) be overloaded by each Operator

The special TestBench and TestBenchFile operators invokethese methods

F. de Dinechin A FloPoCo tutorial 53

A tutorial : building an integerdivider by 3

Introduction: custom arithmetic

FloPoCo for application developers

Computing just right

FloPoCo for developers of HLS software

FloPoCo for developers of custom arithmetic

A tutorial: building an integer divider by 3

A tutorial: building a faithful FIR

A tutorial: building a√

x2 + y2 operator

Conclusion

F. de Dinechin A FloPoCo tutorial 54

Anybody here remembers how we compute divisions ?

7 7 6

1 7

2 6

2

2 5 8

3

iteration body : Euclidean division of a 2-digit decimal number by 3

The first digit is a remainder from previous iteration :its value is 0, 1 or 2

Possible implementation as a look-up table that, for each valuefrom 00 to 29, gives the quotient and the remainder of its divisionby 3.

F. de Dinechin A FloPoCo tutorial 55

Anybody here remembers how we compute divisions ?

7 7 6

1 7

2 6

2

2 5 8

3

iteration body : Euclidean division of a 2-digit decimal number by 3

The first digit is a remainder from previous iteration :its value is 0, 1 or 2

Possible implementation as a look-up table that, for each valuefrom 00 to 29, gives the quotient and the remainder of its divisionby 3.

F. de Dinechin A FloPoCo tutorial 55

Mathematical obfuscation, in binary-friendly radix

Representation of x in radix β = 2α

(splitting the binary decomposition of x into k chunks of α bits)

procedure ConstantDiv(x , d)rk ← 0for i = k − 1 down to 0 do

yi ← xi + 2αri+1 (this + is a concatenation)(qi , ri )← (byi/dc, yi mod d) (read from a table)

end forreturn q =

∑ki=0 qi .2

−αi , r0end procedure

Each iteration

consumes α bits of x , and a remainder of size γ = dlog2 deproduces α bits of q, and a remainder of size γ

Implemented as a table with α + γ bits in, α + γ bits out

F. de Dinechin A FloPoCo tutorial 56

Mathematical obfuscation, in binary-friendly radix

Representation of x in radix β = 2α

(splitting the binary decomposition of x into k chunks of α bits)

procedure ConstantDiv(x , d)rk ← 0for i = k − 1 down to 0 do

yi ← xi + 2αri+1 (this + is a concatenation)(qi , ri )← (byi/dc, yi mod d) (read from a table)

end forreturn q =

∑ki=0 qi .2

−αi , r0end procedure

Each iteration

consumes α bits of x , and a remainder of size γ = dlog2 deproduces α bits of q, and a remainder of size γ

Implemented as a table with α + γ bits in, α + γ bits out

F. de Dinechin A FloPoCo tutorial 56

Sequential implementation

LUT

clk

reset

α

α

xi

γγ

qi

ri+1 ri

F. de Dinechin A FloPoCo tutorial 57

Unrolled implementation

LUT LUTLUTLUT

q0q1q2q3

2 2 2 2

4444

4 4 4x3 x2 x1 x0r3 r2 r1 r0 = r

4

r4 = 0

F. de Dinechin A FloPoCo tutorial 58

Setting the parameters for LUTs

LUT LUTLUTLUT

q0q1q2q3

2 2 2 2

4444

4 4 4x3 x2 x1 x0r3 r2 r1 r0 = r

4

r4 = 0

For instance, assuming a 6-input LUTs (e.g. LUT6)

A 6-bit in, 6-bit out consumes 6 LUT6

Size of remainder is γ = log2 d

If d < 25, very efficient architecture : α = 6− γThe smaller d , the better

Easy to pipeline (one register behind each LUT for free)

F. de Dinechin A FloPoCo tutorial 59

Step 0 : get a working FloPoCo distribution

Google FloPoCo, then

first use the 1-line install to get all dependencies

optionally do an svn checkout to live on the edge

Tutorial files in a tgz on the tutorial webpage.

F. de Dinechin A FloPoCo tutorial 60

Step 1 : build a skeleton

We want to implement flopoco IntConstDiv w dcd src

cp UserDefinedOperator.cpp IntConstDiv.cpp

cp UserDefinedOperator.hpp IntConstDiv.hpp

Inside the newly created files,

global replace UserDefinedOperator with IntConstDiv

empty the cpp of all flesh except initial declarations

Edit CMakeLists.txt (the file that defines the building process),

add src/IntConstDiv somewhere

Edit src/FloPoCo.hpp and do similar imitation

Edit src/main.cpp and do similar imitation

Compile and fix

Check that ./flopoco IntConstDiv 32 3 does nothing

F. de Dinechin A FloPoCo tutorial 61

Step 2 : overload emulate()

Your d should be stored in the IntConstDiv somewhere

What emulate() does

a TestCase is a pair (input, allowed outputs)

emulate() inputs a TestCase with only the input defined

and defines the allowed outputs – here two

Inputs and outputs are semantically bit vectors, held in a GMPinteger.

OOgly methods to convert floating-point from / to arbitraryprecision floating-point (MPFR)

no need here

You may define several acceptable values for an input

typically for faithful rounding (two values)

After this step, wonder at :./flopoco IntConstDiv 10 3 TestBench 1000

F. de Dinechin A FloPoCo tutorial 62

Step 2 : overload emulate()

Your d should be stored in the IntConstDiv somewhere

What emulate() does

a TestCase is a pair (input, allowed outputs)

emulate() inputs a TestCase with only the input defined

and defines the allowed outputs – here two

Inputs and outputs are semantically bit vectors, held in a GMPinteger.

OOgly methods to convert floating-point from / to arbitraryprecision floating-point (MPFR)

no need here

You may define several acceptable values for an input

typically for faithful rounding (two values)

After this step, wonder at :./flopoco IntConstDiv 10 3 TestBench 1000

F. de Dinechin A FloPoCo tutorial 62

Step 3 : Write the combinatorial VHDL

Use the abstract Table operator

All you need is overload the function method

Other useful generic operators :

Shifters and leading zero counters for floating-pointPipelined adders, multipliers, ...

F. de Dinechin A FloPoCo tutorial 63

Step 3.5 : overload BuildStandardTestCases

First debug until VHDL works for all xi = 0

... then for all xi = 1

F. de Dinechin A FloPoCo tutorial 64

Step 4 : pipeline

using nextCycle() and sync*() only in this case

F. de Dinechin A FloPoCo tutorial 65

A tutorial : building a faithful FIR

Introduction: custom arithmetic

FloPoCo for application developers

Computing just right

FloPoCo for developers of HLS software

FloPoCo for developers of custom arithmetic

A tutorial: building an integer divider by 3

A tutorial: building a faithful FIR

A tutorial: building a√

x2 + y2 operator

Conclusion

F. de Dinechin A FloPoCo tutorial 66

Specification

Function

r =n∑

i=1

cixi

The ci are constants provided with arbitrary accuracy on thecommand line

the xi are signed fixed-point inputs in (1,p) format

Accuracy : faithful rounding of the exact result

xi and ci considered exact

return one of the two numbers surrounding the exact result :

next-best after correct rounding (which is too expensive)last-bit accurate (all returned bits are useful),more accurate than naive assembly of multipliers and adders

F. de Dinechin A FloPoCo tutorial 67

Specification

Function

r =n∑

i=1

cixi

The ci are constants provided with arbitrary accuracy on thecommand line

the xi are signed fixed-point inputs in (1,p) format

Accuracy : faithful rounding of the exact result

xi and ci considered exact

return one of the two numbers surrounding the exact result :

next-best after correct rounding (which is too expensive)last-bit accurate (all returned bits are useful),more accurate than naive assembly of multipliers and adders

F. de Dinechin A FloPoCo tutorial 67

Algorithm and error budget

sum the products truncated to precision 2−p−g , i.e. g guard bits,

add one bit at weight 2−p−1,

and truncate to precision 2−p, i.e. to the result format

... where g = 1 + log2(n − 1)

Design space exploration

The sum can be implemented as

a rake of adders (simplest, let’s start with that)

a tree of adders (slightly more complex, good we have C++)

a single BitHeap object (on the edge)

F. de Dinechin A FloPoCo tutorial 68

Step 0 : get a working FloPoCo distribution

Google FloPoCo, then

first use the 1-line install to get all dependencies

optionally do an svn checkout to live on the edge

F. de Dinechin A FloPoCo tutorial 69

Step 1 : build a skeleton

We want to implement flopoco FixedPointFIR n w a1 ... ancd src

cp UserDefinedOperator.cpp FixedPointFIR.cpp

cp UserDefinedOperator.hpp FixedPointFIR.hpp

Inside the newly created files,

global replace UserDefinedOperator with FixedPointFIR

empty the cpp of all flesh except initial declarations

Edit CMakeLists.txt (the file that defines the building process),

add src/FixedPointFIR somewhere

Edit src/FloPoCo.hpp and do similar imitation

Edit src/main.cpp and do similar imitation

NewCompressorTree has an arbitrary size parameter list

Compile and fix

Check that ./flopoco FixedPointFIR 4 8 1 1 1 1 doesnothing

F. de Dinechin A FloPoCo tutorial 70

Step 2 : overload emulate()

Your ai should be stored in the FixedPointFIR somewhere

What emulate() does

a TestCase is a pair (input, allowed outputs)

emulate() inputs a TestCase with only the input defined

and defines the allowed outputs – here two

Inputs and outputs are semantically bit vectors, held in a GMPinteger.

OOgly methods to convert floating-point from / to arbitraryprecision floating-point (MPFR)

no need here

After this step, wonder at :./flopoco FP2DNorm 5 10 TestBench 1000

F. de Dinechin A FloPoCo tutorial 71

Step 2 : overload emulate()

Your ai should be stored in the FixedPointFIR somewhere

What emulate() does

a TestCase is a pair (input, allowed outputs)

emulate() inputs a TestCase with only the input defined

and defines the allowed outputs – here two

Inputs and outputs are semantically bit vectors, held in a GMPinteger.

OOgly methods to convert floating-point from / to arbitraryprecision floating-point (MPFR)

no need here

After this step, wonder at :./flopoco FP2DNorm 5 10 TestBench 1000

F. de Dinechin A FloPoCo tutorial 71

Step 3 : Write the combinatorial VHDL

use one of the existing constant multipliers in FloPoCo

Half the room should use FixRealKCM

The other half should use IntConstMult

First write the additions as “+” in standard VHDL

not very efficient, but tutorially good.(several better alternatives to discuss later)

All datapath on 1 + w + g bits

here also, a small optimization is possible

F. de Dinechin A FloPoCo tutorial 72

Step 3.5 : overload BuildStandardTestCases

First debug until VHDL works for all xi = 0

... then for all xi = 1

F. de Dinechin A FloPoCo tutorial 73

Step 4 : pipeline

using nextCycle() and sync*() only in this case

F. de Dinechin A FloPoCo tutorial 74

A tutorial : building a√x2 + y 2

operatorIntroduction: custom arithmetic

FloPoCo for application developers

Computing just right

FloPoCo for developers of HLS software

FloPoCo for developers of custom arithmetic

A tutorial: building an integer divider by 3

A tutorial: building a faithful FIR

A tutorial: building a√

x2 + y2 operator

Conclusion

F. de Dinechin A FloPoCo tutorial 75

Specification

We are going to build a floating-point operator for√

x2 + y2

Black box : two inputs, one output, just like an adder or amultiplier.

Mathematical specification : faithful rounding of the exact result

x and y considered exactreturn one of the FP numbers surrounding the exact resultless accurate than correct rounding, butlast-bit accurate (all returned bits are useful), andmore accurate than a combination of two multipliers, one adder anda square root (4 rounding errors)

Algorithm : sum of square, followed by a square root of themantissa

with some guard bits for faithful roundingexpectig savings mostly in the sum of squares

F. de Dinechin A FloPoCo tutorial 76

Specification

We are going to build a floating-point operator for√

x2 + y2

Black box : two inputs, one output, just like an adder or amultiplier.

Mathematical specification : faithful rounding of the exact result

x and y considered exactreturn one of the FP numbers surrounding the exact resultless accurate than correct rounding, butlast-bit accurate (all returned bits are useful), andmore accurate than a combination of two multipliers, one adder anda square root (4 rounding errors)

Algorithm : sum of square, followed by a square root of themantissa

with some guard bits for faithful roundingexpectig savings mostly in the sum of squares

F. de Dinechin A FloPoCo tutorial 76

Disclaimer (for hard-core arithmeticians)

It should be possible to fuse the whole computation in a single digitrecurrence

more efficient for ASIC and DSP-less FPGAs

see works by Ercegovac, Muller, Lang, Bruguera, Takagi ...

... but beyond the scope of this work

and less interesting from a tutorial point of view :no component reuse

F. de Dinechin A FloPoCo tutorial 77

Step 0 : get a working FloPoCo distribution

Google FloPoCo, then

first use the 1-line install to get all dependencies

optionally do an svn checkout to live on the edge

F. de Dinechin A FloPoCo tutorial 78

Step 1 : build a skeleton

We want to implement flopoco FP2DNorm wE wFcd src

cp UserDefinedOperator.cpp FP2DNorm.cpp

cp UserDefinedOperator.hpp FP2DNorm.hpp

Inside the newly created files,

global replace UserDefinedOperator with FP2DNorm

empty the cpp of all flesh except initial declarations

Edit CMakeLists.txt (the file that defines the building process),

look for src/FPSqrt (because the interface is similar)and add src/FP2DNorm close to it

Edit src/FloPoCo.hpp and do similar imitation

Edit src/main.cpp and do similar imitation

Compile and fix

Check that ./flopoco FP2DNorm 8 23 does nothing

F. de Dinechin A FloPoCo tutorial 79

Step 2 : overload emulate()

Here, get inspiration from src/FPSumOfSquares orsrc/FPAdderSinglePath.

a TestCase is a pair (input, allowed outputs)

emulate() inputs a TestCase with only the input defined

and defines the allowed outputs

Inputs and outputs are semantically bit vectors, held in a GMPinteger.

OOgly methods to convert from / to arbitrary precisionfloating-point (MPFR)

After this step, test :./flopoco FP2DNorm 5 10 TestBench 1000

F. de Dinechin A FloPoCo tutorial 80

Step 2 : overload emulate()

Here, get inspiration from src/FPSumOfSquares orsrc/FPAdderSinglePath.

a TestCase is a pair (input, allowed outputs)

emulate() inputs a TestCase with only the input defined

and defines the allowed outputs

Inputs and outputs are semantically bit vectors, held in a GMPinteger.

OOgly methods to convert from / to arbitrary precisionfloating-point (MPFR)

After this step, test :./flopoco FP2DNorm 5 10 TestBench 1000

F. de Dinechin A FloPoCo tutorial 80

Step 3 : Write the combinatorial VHDL

use an exponent difference, a shifter, two squarers, and a squareroot

exponent difference is a subtraction, but a small onefollowed by swap

Shift before or after the square ?

before enables single bit-heap implementation... but requires a slightly larger squarerafter requires a slightly larger shifter

Hack the floating-point square root

no fixed-point version available

F. de Dinechin A FloPoCo tutorial 81

Conclusion

Introduction: custom arithmetic

FloPoCo for application developers

Computing just right

FloPoCo for developers of HLS software

FloPoCo for developers of custom arithmetic

A tutorial: building an integer divider by 3

A tutorial: building a faithful FIR

A tutorial: building a√

x2 + y2 operator

Conclusion

F. de Dinechin A FloPoCo tutorial 82

The “no killer app” theorem

For 20 years, the FPGA community has been waiting for the “killerapplication”.(The widely useful application on which the FPGA is so much better)

Theorem : we’ll wait forever.

Proof : When such an application pops up,

either it is indeed widely useful, and next year’s Pentium will do itin hardware 10x faster than the FPGA, so it won’t be an FPGAkiller app next year,

or the FPGA remains competitive next year, but it means that itwas not a killer app.

The killer feature of FPGAs is flexibility

To exploit it, we do need infinitely many arithmetic operators.

F. de Dinechin A FloPoCo tutorial 83

The “no killer app” theorem

For 20 years, the FPGA community has been waiting for the “killerapplication”.(The widely useful application on which the FPGA is so much better)

Theorem : we’ll wait forever.

Proof : When such an application pops up,

either it is indeed widely useful, and next year’s Pentium will do itin hardware 10x faster than the FPGA, so it won’t be an FPGAkiller app next year,

or the FPGA remains competitive next year, but it means that itwas not a killer app.

The killer feature of FPGAs is flexibility

To exploit it, we do need infinitely many arithmetic operators.

F. de Dinechin A FloPoCo tutorial 83

The “no killer app” theorem

For 20 years, the FPGA community has been waiting for the “killerapplication”.(The widely useful application on which the FPGA is so much better)

Theorem : we’ll wait forever.

Proof : When such an application pops up,

either it is indeed widely useful, and next year’s Pentium will do itin hardware 10x faster than the FPGA, so it won’t be an FPGAkiller app next year,

or the FPGA remains competitive next year, but it means that itwas not a killer app.

The killer feature of FPGAs is flexibility

To exploit it, we do need infinitely many arithmetic operators.

F. de Dinechin A FloPoCo tutorial 83

The “no killer app” theorem

For 20 years, the FPGA community has been waiting for the “killerapplication”.(The widely useful application on which the FPGA is so much better)

Theorem : we’ll wait forever.

Proof : When such an application pops up,

either it is indeed widely useful, and next year’s Pentium will do itin hardware 10x faster than the FPGA, so it won’t be an FPGAkiller app next year,

or the FPGA remains competitive next year, but it means that itwas not a killer app.

The killer feature of FPGAs is flexibility

To exploit it, we do need infinitely many arithmetic operators.

F. de Dinechin A FloPoCo tutorial 83

Computing just right

In a Pentium

the choice is between

an integer SUV, or

a floating-point SUV.

In an FPGA

If all I need is a bicycle, I have the possibility to build a bicycle

(and I’m usually faster to destination)

Save routing ! Save power ! Don’t move useless bits around !

F. de Dinechin A FloPoCo tutorial 84

Computing just right

In a Pentium

the choice is between

an integer SUV, or

a floating-point SUV.

In an FPGA

If all I need is a bicycle, I have the possibility to build a bicycle

(and I’m usually faster to destination)

Save routing ! Save power ! Don’t move useless bits around !

F. de Dinechin A FloPoCo tutorial 84

Computing just right

In a Pentium

the choice is between

an integer SUV, or

a floating-point SUV.

In an FPGA

If all I need is a bicycle, I have the possibility to build a bicycle

(and I’m usually faster to destination)

Save routing ! Save power ! Don’t move useless bits around !

F. de Dinechin A FloPoCo tutorial 84

An almost virgin land

Most of the arithmetic literature addresses the construction of SUVs.

F. de Dinechin A FloPoCo tutorial 85

My personal record

Division by 3 (ARC 2012) :Two weeks from the first intuition of the algorithmto complete pipelined FloPoCo implementation + paper submission.

Implementation time

10 minutes to obtain a testbench generator

2 days for the fully parametric combinatorial operator

(less than VHDL)

20 mn for its fully flexible pipeline

F. de Dinechin A FloPoCo tutorial 86

So when do we have an FPGA in every PC ?

When they become as easy to program as processors ?

(now that’s a challenge)

(or do we quietly wait for processors to become as messy to program asFPGAs ?)

F. de Dinechin A FloPoCo tutorial 87

Thanks for your attention

The following people have contributed to FloPoCo :S. Banescu, N. Brunie, S. Collange, J. Detrey,P. Echeverrıa, F. Ferrandi, M. Grad, K. Illyes,M. Istoan, M. Joldes, C. Klein, D. Mastrandrea,B. Pasca, B. Popa, X. Pujol, D. Thomas,R. Tudoran, A. Vasquez.

e

x

√x2 +y

2 +z2

πx

sine x+

y

n∑i=

0x i

√x logx

http://flopoco.gforge.inria.fr/

Introduction: custom arithmetic

FloPoCo for application developers

Computing just right

FloPoCo for developers of HLS software

FloPoCo for developers of custom arithmetic

A tutorial: building an integer divider by 3

A tutorial: building a faithful FIR

A tutorial: building a√

x2 + y2 operator

Conclusion

F. de Dinechin A FloPoCo tutorial 88