Opening & Plenary Session #1 Hardware

55
Opening & Plenary Session #1 Hardware November 17, 2020

Transcript of Opening & Plenary Session #1 Hardware

Page 1: Opening & Plenary Session #1 Hardware

Opening & PlenarySession #1 – Hardware

November 17, 2020

Page 2: Opening & Plenary Session #1 Hardware

1

Prof. Shouyi YinDepartment of Micro- and Nano-electronics

Tsinghua University

Deploying AI EverywheremW/μW Level Neural Network Processor

Page 3: Opening & Plenary Session #1 Hardware

2

AI HealthcareImage Recognition

Translation Speech-to-TextSmart home

Google Baidu IBM MicrosoftAlibaba

AI Cloud

Today’s AI Services almost All in Cloud

Page 4: Opening & Plenary Session #1 Hardware

3

Things

WIFI Clients

MobileDevices

10101012

Cloud

108

From AI in Cloud to AI Everywhere

[Google Brain @ISSCC’18]

Little MLMicro-chips

Medium MLEdge, Mobile

Big MLData Center

[Rob Aitken @ARM]

Page 5: Opening & Plenary Session #1 Hardware

4

AI Chipsets Revenue 2016 - 2025

From Tractica Report

Page 6: Opening & Plenary Session #1 Hardware

5

Computation requirements v.s. Power constraints

Power

Computation (ops)

Smartphone (Always on)Wearable deviceSmart sensors

Smart phoneSmart homeSmart toysSmart glassesWiFi camera

Homeappliance

Video surveillanceIndustryAgriculture

AutomobileData center

1mW 100mW 2W 100W

100G

1T

2T

20T

Page 7: Opening & Plenary Session #1 Hardware

6

① Programmability

Example 1:LeNet for Handwriting Recognition

Example 2:AlexNet for Image Classification

Example 3:LRCN for Image Captioning

② Neural Computing & General computing

③ High energy efficiency for edge computing ( ~ TOPS/W @ mW or μW level)

Challenges for AI Chips

FaceDetection

NN

Resize

NMS

LandMark(NN) Facial

alignment

Resize

Euclidean Distance

Resize

FaceRecognition

NNLeonardo DiCaprio

× =

× =

× =

Network Computation General Computation

Face detection and Recognition

Neural Computing: • Convolutional Network• Fully connected Network• Recurrent Network• ……

General Computing: • Image Signal Processing• Visual Processing• Sound Signal Processing• ……

Page 8: Opening & Plenary Session #1 Hardware

7

RS DataflowMIT Eyeriss

2016

ASIPDiannao

2014

Domain Specific Architecture

Towards massive basic computation

Algorithm design in conjunction with hardware design:

less latency, more power efficient, more general !

Towards compact parallel computing

Network pruning, compression, quantization, low-bit …

Data granularity , program

ming/storage m

odel, …

Algorithm & hardware co-design, co-optimization, co-verification !

���

� � �

Systolic ArrayGoogle TPU

2017.1

Low-bit Adaptive quant. LQ-Nets, 2018.9

Low-bit TrainingDoReFa-Net, 2018.2

TWN2016.11

BWN2016.8

Pruning2016.2

Sparsity-awareNVIDIA SCNN

2017.6

Flexible BitwidthKAIST UNPU

2018

To make algorithms more flexible

To make hardware more busy

Dual Trends for Energy Efficient NN Computing

Compact NN Model

Page 9: Opening & Plenary Session #1 Hardware

8

RS DataflowMIT Eyeriss

2016

ASIPDiannao

2014

Compact NN Model

Domain Specific Architecture

Towards massive basic computation

Algorithm design in conjunction with hardware design:

less latency, more power efficient, more general !

Towards compact parallel computing

Network pruning, compression, quantization, low-bit …

Data granularity , program

ming/storage m

odel, …

Algorithm & hardware co-design, co-optimization, co-verification !

���

� � �

Systolic ArrayGoogle TPU

2017.1

Low-bit Adaptive quant. LQ-Nets, 2018.9

Low-bit TrainingDoReFa-Net, 2018.2

TWN2016.11

BWN2016.8

Pruning2016.2

Sparsity-awareNVIDIA SCNN

2017.6

Flexible BitwidthKAIST UNPU

2018

To make algorithms more flexible

To make hardware more busy

Dual Trends for Energy Efficient NN Computing

Page 10: Opening & Plenary Session #1 Hardware

9

Binary & Ternary Weight Neural Networks

Page 11: Opening & Plenary Session #1 Hardware

10

Training Techniques for extreme low-bit NN

² Reduce memory footprint: various weight quantization techniques

² Increase representational capability: shortcut

² Reduce variation: weight approximation, batch-normalization

Top1 classification accuracy of ResNet-18 on ImageNet(weights of all networks are binary/ternary)

Evolvement of networks

Accuracy

XNOR-Net2016

LQ-Nets2018 � � �

DoReFa-Net2016

Full precision69.3%

65.0%

42.2%

51.2%

59.2%

60.8%

66.6% 68.0%

BWN2016

ABC-Net2017

TTQ2017

BinaryNet2016

Page 12: Opening & Plenary Session #1 Hardware

11

RS DataflowMIT Eyeriss

2016

ASIPDiannao

2014

Compact NN Model

Domain Specific Architecture

Towards massive basic computation

Algorithm design in conjunction with hardware design:

less latency, more power efficient, more general !

Towards compact parallel computing

Network pruning, compression, quantization, low-bit …

Data granularity , program

ming/storage m

odel, …

Algorithm & hardware co-design, co-optimization, co-verification !

���

� � �

Systolic ArrayGoogle TPU

2017.1

Low-bit Adaptive quant. LQ-Nets, 2018.9

Low-bit TrainingDoReFa-Net, 2018.2

TWN2016.11

BWN2016.8

Pruning2016.2

Sparsity-awareNVIDIA SCNN

2017.6

Flexible BitwidthKAIST UNPU

2018

To make algorithms more flexible

To make hardware more busy

Dual Trends for Energy Efficient NN Computing

Page 13: Opening & Plenary Session #1 Hardware

12

What Kind of Computing Architecture?

Programmabilityv.s.

Energy Efficiencyv.s.

Compact Model

ASIP: Cambricon Systolic Array: TPURS Dataflow: Eyeriss

Sparsity: SCNN What is Next ?Flexible Bit: UNPU

Page 14: Opening & Plenary Session #1 Hardware

13

Coarse-Grained Reconfigurable Architecture

9.8 EMERGING COMPUTINGARCHITECTURE

Page 15: Opening & Plenary Session #1 Hardware

14

Software Defined Hardware

Build runtime reconfigurable hardware and software thatenables near ASIC performance without sacrificingprogrammability for data-intensive algorithms.

l to build a processor that is reconfigurable at runtime.l to build programming languages and compilers that

optimize software and hardware at runtime.

Reconfiguration times: 300 - 1,000 ns

at runtime

Page 16: Opening & Plenary Session #1 Hardware

15

Energy Efficiency Comparison

CPU

GPUFPGA

0.1

1

10

100

1000

10000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

CPU CPUs GPU CPU+GPU FPGA Dedicated

Dedicated

CPU+GPU

CPUs

moreprogrammable

lessprogrammable

notprogrammable

Chip number

En

erg

y e

ffic

ien

cy [

MO

PS/m

W]

[Stanford University, Prof. Kunle Olukotun, ISCA 2018, Keynote Speech]

Software Defined Hardware(CGRA)

Page 17: Opening & Plenary Session #1 Hardware

16

New Start探索期 发展期 High Speed ExpansionExpansionExploration

PADDI1990

PADDI-21993

Matrix1996 RaPID1996

Pleiades1997

REMARC1998

PipeRench1998

DP-FPGA 1994

Garp 1997

RAW1997

CHESS 1999

MorphoSys1999

DReAM2000

MorphICs 2000

ADRES2003XPP2003Zippy2003

EGRA2009

DySER2012

CGRA History

Prof. Gerald Estrin

“Organization of Computer Systems-The Fixed Plus Variable Structure Computer,”Proc. Western Joint Computer Conf., New York, 1960, pp. 33-40.

Page 18: Opening & Plenary Session #1 Hardware

17

863 Program: General Purpose Reconfigurable

Computing

NSFC: Reconfigurable Vision Processor

NSFC: Reconfigurable Networks-on-Chip

Major S&T Program: Reconfigurable Graphic Processor

20062010

20152018

863 Program: Reconfigurable Multimedia

Processor

NSFC: Reconfigurable

Architecture

Thinker: Reconfigurable NN Processor

2020

China-UK: Reconfigurable Cloud Computing

NSFC: Reconfigurable Hybrid AI Processor

NN Processor

Domain Specific Reconfigurable Processor

Theory and Basic Arch

Prof. Shaojun Wei

IEEE FellowDirector of IME, THU

Reconfigurable Computing Research in Tsinghua Univ.

Page 19: Opening & Plenary Session #1 Hardware

18

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

Con

text

mem

ory

Data memory (Scratchpad)

Hos

t Con

trolle

r

ALU

REGs

MEM

VddH VddL

Architecture Overview

Config

Execution

Config

Execution

Time

Arithmetic and Logic OperationsA + B A >> B A > B (A+B)>>C

A - B A + (B>>C) A == B (A+B)<<C

A & B A A < B (A-B)>>C

A | B A + (B<<C) A >= B (A+B)<<C

A ^ B A - (B<<C) A <= B A×B_H

A ~^ B |A-B| A != B Clip(A, -B, B)

~A (A>>C)-B A - (B>>C) Clip(A, 0, B)

A << B A×B_L (A<<C)-B C?A:B

n Spatial Arch, Distributed Mem, No-Instruction, flexible bit-width

Page 20: Opening & Plenary Session #1 Hardware

19

Software defined Datapath

Page 21: Opening & Plenary Session #1 Hardware

20

Programming Model

Compilation

Page 22: Opening & Plenary Session #1 Hardware

21

Compiling Applications onto CGRA

DDG (dif, min)A loop

Kernel

II: Initiation IntervalS1

S2

S3

S4

S1

S2

S3

S4

S1

S2

S3

S4

Loop Pipelining

Time

PE1 PE2

PE1 PE2

PE1 PE2

T

T+1

S3

S4

S1

S2

Time Extended PE Array

CGRA

Key ProblemsExploiting operator level parallelism

Finding better OP-PE binding according to CGRA’s arch features

1

2

for(i=0; i<N; i++){S1: a[i] = b[i-1] + 1;S2: b[i] = a[i] / 2;S3: c[i] = b[i] + 3;S4: d[i] = c[i];}

S1

S2

S3

S4

(0,1)

(0,1)

(0,1)

(1,1)

Loop carried dependence

S1

S2

S3

S4

A widely held rule of thumb is that a program spends 90% of its execution time in only 10% of the code.《Computer Architecture: A Quantitative Approach》

Page 23: Opening & Plenary Session #1 Hardware

22

Loop

Non-affine Affine

Non-uniform Uniform

Multi-level Single-level

Branchw/o branch 2. Branch 1. w/o branch

3. Perfect 4. Imperfect

Loops fall into many categories and we mainly focus on 4 categories loop mapping to CGRAs,which cover most of real life applications.

convertload/store

innermost

for(i=0; i<N; i++)for(j=0;j<M;j++)

A[i][j]=A[2i+3j][j+1]+..

for(i=0; i<N; i++)a[i][0] = …for(j=0;j<M;j++)

A[i][j]=A[i+1][j+2]+..

[1,2]

for(i=0; i<N; i++)a = b + c;if(a>0){

d=a+c}

Multi-levelPerfectAffineNon-uniform

Multi-levelImperfectAffineUniform

Single-levelBranch

Iteration domain

Dependence

Structure

Loops Classification

Page 24: Opening & Plenary Session #1 Hardware

24

TRMap(ICCAD’15/TVLSI’15)

MEMMap (TVLSI’15)

PolyMap(DAC’13/TVLSI’14)DualPL(TPDS’16)

ConfigurationContext

Nested?

Perfect ?

Y

YN

Branch ?

N

N

Y

DualVdd (TCAD’16)

Loops

CTRL

Auto Parallelization

Memory partitioning & mapping

Low-power Optimization

Compilation Flow

• PolyMap: Polyhedral model based mapping

• DualPL: Multi-level loop pipelining

• TRMap: Trigger-aware mapping

• MEMMap: Memory partitioning & mapping

• DualVdd: Voltage scheduling

Page 25: Opening & Plenary Session #1 Hardware

25

Thinker: Reconfigurable AI Computing Architecture

Features:

1. Heterogeneous PE arrays supporting data reuse

2. Two types of reconfigurable PE providing programmability

3. Bitwidth adaptive MAC unit exploiting computing power for low bit NN

General PE

Super PE

Page 26: Opening & Plenary Session #1 Hardware

26

Thinker: Reconfigurable AI Computing Architecture

Features:

1. Heterogeneous PE arrays supporting data reuse

2. Two types of reconfigurable PE providing programmability

3. Bitwidth adaptive MAC unit exploiting computing power for low bit NN

General PE

Super PE

Three levels of Reconfigurability

Page 27: Opening & Plenary Session #1 Hardware

27

1. Reconfigurable MAC Unit

2. Reconfigurable PE 3. Reconfigurable PE Array

Three Levels of Reconfigurability

Page 28: Opening & Plenary Session #1 Hardware

28

Reconfigurable MAC Unit: Bit-width Adaptive

• Support flexible bit-width neural networks

• Fully exploit the computing power of PE for low-bit neural networks

Bit-Serial ComputationSubword-Parallel Computation

Page 29: Opening & Plenary Session #1 Hardware

29

Reconfigurable PE: Support different OPs in NN

Page 30: Opening & Plenary Session #1 Hardware

30

Reconfigurable PE Array: On-demand Partitioning

CNN, FCN, LSTM execute sequentially CNN, FCN+LSTM execute in parallel

①②

Page 31: Opening & Plenary Session #1 Hardware

31

2017 ACM/IEEE ISLPED Design Contest Award2017 VLSI Symposia on Technology and Circuits (VLSIC)2017 IEEE TVLSI Most Popular Article2018 IEEE Journal of Solid-State Circuits (JSSC)2018 MIT Technology Review

Thinker-I: Watt level AI processor

Technology TSMC 65nm LP

Supply voltage 0.67V~1.29V

Area 4.4mm×4.4mm

SRAM 348KB

Frequency 10 ~ 200MHz

Power 4mW ~ 447mW

Energy efficiency 1.06 ~ 5.09 TOPs/W

• General Purpose NN• Heterogeneous PE• Support CNN/FCN/RNN, Hybrid-NN

Thinker I

Page 32: Opening & Plenary Session #1 Hardware

32

Power

Computation (ops)

Smartphone (Always on)Wearable deviceSmart sensors

Smart phoneSmart homeSmart toysSmart glassesWiFi camera

Homeappliance

Video surveillanceIndustryAgriculture

AutomobileData center

1mW 100mW 2W 100W

100G

1T

2T

20T

Tens of mW

The requirement of mW level AI processor

Page 33: Opening & Plenary Session #1 Hardware

33

RS DataflowMIT Eyeriss

2016

ASIPDiannao

2014

Domain Specific Architecture

Towards massive basic computation

Algorithm design in conjunction with hardware design:

less latency, more power efficient, more general !

Towards compact parallel computing

Network pruning, compression, quantization, low-bit …

Data granularity , program

ming/storage m

odel, …

Algorithm & hardware co-design, co-optimization, co-verification !

���

� � �

Systolic ArrayGoogle TPU

2017.1

Low-bit Adaptive quant. LQ-Nets, 2018.9

Low-bit TrainingDoReFa-Net, 2018.2

TWN2016.11

BWN2016.8

Pruning2016.2

Sparsity-awareNVIDIA SCNN

2017.6

Flexible BitwidthKAIST UNPU

2018

To make algorithms more flexible

To make hardware more busy

Dual Trends for Energy Efficient NN Computing

Compact NN Model

Low-bit NN & HW co-design

Page 34: Opening & Plenary Session #1 Hardware

34

Binary/Ternary Weight Neural Networks (BTNNs)

p No Multiplication p Low Memory Footprint and Capacityp Satisfied Accuracy

Bitwidth Accuracy Loss (%)

Weight Activation MNIST CIFAR-10 ImageNet

Binary Weight Neural Networks

(BWNNs)

Binary Connect 1 32 0.7 2.78 19.20Binary Weight Network 1 32 - 0.76 0.8Binary Neural Network 1 1 0.37 3.03 29.8

XNOR-Net 1 1 - 3.05 11

Ternary WeightNeural Networks

(TWNNs)

Ternary Connect 2 32 0.56 4.89 -Ternary Weight Network 2 32 0.06 0.32 0.8

Trained Ternary Quantization 2 32 - - 0.6

Ternary Neural Network 2 2 1.08 4.99 -

Hardware Friendly

Page 35: Opening & Plenary Session #1 Hardware

35

Input Feature Maps

Redundant Operations

(ROPs)

Higher Power Consumption

Remove ROPs for different

kernel groups

Kernel Group

1 1-1 0

0 1-1 1

1 -11 0

-1 10 1

2 or 3 types of weight values

Kernel 2

Kernel 3

Kernel 4

1 + 2 - 3 + 0

0 + 2 - 3 + 4

1 - 2 + 3 + 0

-1 + 2 + 0 + 4

1 + (2 – 3) + 0

0 + (2 – 3) + 4

(1 – 2) + 3+ 0

-(1 – 2) +0 +4

1 23 4

Kernel 1

New opportunity to optimize B/T weight convolutions

Page 36: Opening & Plenary Session #1 Hardware

36

Standard Convolution

1 2 35 6 79 1011

131415

48

1216

1 1 -1-1 1 -11 1 1

-1 1 11 1 1-1 -1 1

0 1 00 1 00 0 1

1 0 -1-1 0 -11 1 0

19223134

5 55 5

24273639

14172629

24273639

1 2 35 6 79 1011

131415

48

1216

14172629

Ifmap

Original Kernels

K1 K2

Ifmap

K1′ K2′

O1+O2 O1-O2

O1 O2

64 OPs 36 OPs

Ofmap

KTFR Kernels

OfmapFeature

Reconstruction

Kernel Transformation

KTFR

K1′ =K1 + K2

2

K2′ =K1 − K2

2

Special Optimization for B/T weight NNsKernel-Transformation-Feature-Reconstruction (KTFR)

Page 37: Opening & Plenary Session #1 Hardware

38

Feature Summation

Remaining Convolution

IntegralFusion

Integral Calculation KBWI

1 2 32 1 23 3 4

Ifmap

6 89 10

-2×

Originalkernels

OfmapStandard Convolution24 OPs

-2 01 0

4 43 2

4 44 5

1 23 4

Ifmap

RRC②

0 11 0

0 00 1

1 -1-1 1

1 11 -1

1 2 32 1 23 3 4

1 2 32 1 23 3 4

-2 01 0

4 43 2

6 89 10

4 44 5

1 23 4

FIBCkernels

RRC Ofmap

Ifmap

KBWI

16 OPs

FIBC

Special Optimization for B/T weight NNs

Page 38: Opening & Plenary Session #1 Hardware

39

2018 IEEE ISSCC SRP2018 International Symposium on Computer Architecture (ISCA)2018 VLSI Symposia on Technology and Circuits (VLSIC)2019 IEEE Journal of Solid-State Circuits (JSSC)

Technology TSMC 28nm HPC

Supply voltage 0.58V~0.9V

Area 1.7mm×2.7mm

SRAM 225KB

Frequency 20 ~ 400MHz

Power < 100mW

Energy efficiency 20 TOPs/W @ Binary AlexNet

Thinker II• Ultra-low Power• Load Balancing and Scheduling• Low bit-width Weights

Data Memory

Data

Me

mor

yWeight Memory

PLL

Cont

rolle

r

Processing Engine

Thinker-II: mW level AI processor

Page 39: Opening & Plenary Session #1 Hardware

40

Power

Computation (ops)

Smartphone (Always on)Wearable deviceSmart sensors

Smart phoneSmart homeSmart toysSmart glassesWiFi camera

Homeappliance

Video surveillanceIndustryAgriculture

AutomobileData center

1mW 100mW 2W 100W

100G

1T

2T

20T

< 1 mW

The requirement of µW level AI processor

Page 40: Opening & Plenary Session #1 Hardware

41

Prevailing Human-Machine Speech Interfaces

Apple Siri Google Now

Microsoft Cortana

○ Mobile phones, wearable devices, IoT devices…

Smart earphones

Wall switches

○ General speech recognition procedure

FeatureExtraction

AcousticModel DecodingVoice Activity

Detection

Page 41: Opening & Plenary Session #1 Hardware

42

Binary Convolutional Neural Networks (BCNN)

○ Activation and weights quantized to 1 bit--save memory footprint○ Replace expensive multiplications by XNORs--save power and area

Page 42: Opening & Plenary Session #1 Hardware

43

Frame-Level Reuse in BCNN□ Exploit temporal data locality to eliminate redundant computing

Overlapped SpeechFeature Maps

OverlappedConvolutional results

Output reuse ofConsecutive feature maps

140

3×3Kernel3×3

Kernel

3×3

642

11

... ×

23×3

Kernel3×3Kernel

3×3

64

1112

...

×

=

=

40

38

Redundancy

Buffer

10 3×3Kernel3×3

Kernel

3×31112

×

Buffer

=

Results reuse

Update stored results

NextLayer

12

121110

......

13

BCNN fmap 1BCNN fmap 2

BCNN fmap 3

3

(1×40 features per frame)

...

(11 frames per fmap)

Page 43: Opening & Plenary Session #1 Hardware

44

Bit-Level BCNN Weight Regularization

8

3

16

10

18

0

14

9

12Z=

After pruning

Channel32 bit

Before pruning

□ Regularize and compress the bits in BCNN 3-d weight matrices

Before regularization After regularization

Zero completion

00...0

00 01 10 11

00...0110100...

10

01

2-4 decoder

0 1 0 0D0 D1 D2 D3

1EnAddr

NextFlag

+

+

+

+

+

110100...

Address Generator

16-zeroBank

12-zeroBank

8-zeroBank

FullBank

FlagTable

32 bit

32 bit

EnAddr

16 bit 20 bit

32 bit 32 bit 32 bit

EnAddr

24 bit 32 bit 2 bit

EnAddr Addr

24.25% 24.25% 1%27.50% 27.50% 2%

Storage Reduction

MEM Access Reduction

Precision Loss

5% 72% 4% 19%5% 80% 5% 10%

16-zero 12-zero 8-zero Full

Page 44: Opening & Plenary Session #1 Hardware

45

Lowest 5-bit Addition

A4:0 B4:0

11111 + 00001100000

l=5

≥ 32

3-bitSub Carry

3-bit RCAAddition

A15:13 B15:13

4-bit Sub Carry

4-bit RCAAddition

A12:9 B12:9

Sum15:13 Sum12:9

4-bit Sub Carry

4-bit RCAAddition

B8:5

Sum4:0

A8:5

Sum8:5

□ Additions in BCNN are dominated by “+1”

□ Truncate high-bit carry-chain to shorten critical path.

95.9%

4.1%

1-bit Add

16-bit Add

Proportion of Addition operations in BCNN

Approximate Adder

Page 45: Opening & Plenary Session #1 Hardware

46

□ Circuit diagram of lowest 5 bits adder. Reduce delay by 49.28%, power-delay-product (PDP) by 48.08%.

Lowest 5-bit to reduce power and guarantee correctness of

Incremental Addition.

Lowest 5-bit Addition

A4:0 B4:0

11111 + 00001100000

l=5≥ 32

Carry Chain:c0 = g0 c1 = p1&g0c2 = p2&p1&g0

c3 = p3& p2 &p1&g0c4 = p4&p3&p2

&p1&g0

TSMC-28nm, 0.9V→DC→Netlist→Hspice→Delay/Power* Benchmarks: TIDIGIT,TIMIT,RM,WSJ,Spoken Number

Sum4:0

Approximate Adder

Page 46: Opening & Plenary Session #1 Hardware

47

Technology TSMC 28nm HPC

Supply voltage 0.52V~0.9V

Area 1.74mm×0.74mm

SRAM 27KB

Frequency 2 ~ 50MHz

Power 0.2 ~ 5mW

Energy efficiency 304 nJ/Frame

Thinker S• Ultra-low power for speech • Always-on & Real-time• Wakeup & Command & Speech

Deco

der

Unit

Contr

oller

Data

Memo

ry

Data Memory

Data Memory

Weight Memory

BCNN Unit

Pre-processing

2018 VLSI Symposia on Technology and Circuits (VLSIC)2018 Design Automation Conference (DAC)2019 IEEE Transactions on Circuits and Systems I: Regular Papers

Thinker-S: µW level AI processor

Page 47: Opening & Plenary Session #1 Hardware

48

0.25

2.75 11.6 19.9

28.1

55

2.3

0.54

0

10

20

30

40

50

60

70

80

90

100

2016 2017 2018 2019 2020

MITEyerissRS Flow

KU LeuvenEnvisionDVAFS

MITCONV-RAM7×1 bit

SEUSandwich8×1 bit

KAISTUNPUReconfigurable

Phase-II√ In-memory computing

shows great potential× Only supporting the

basic MAC operations

Phase-I√ Innovation in architecture drives

energy efficiency√ Reconfigurable architecture has

good programmability× Digital architecture faces

“Memory wall” bottleneck

Ene

rgy

Eff

icie

ncy

(TO

PS/W

)

Next ?

THUThinker-IIReconfigurable

GoogleTPUSystolic

ICT/CambrianCambricon-XSparsity

Phase-IIIReconfigurable Architecture

+In-memory computing

The next step of AI chips

Page 48: Opening & Plenary Session #1 Hardware

49

Thank youfor your attention

Page 49: Opening & Plenary Session #1 Hardware

SponsorsPremier Sponsor & tinyML Strategic Partner

Gold Sponsor

Silver Sponsor

Page 50: Opening & Plenary Session #1 Hardware

31 © 2020 Arm Limited (or its affiliates)31 © 2020 Arm Limited (or its affiliates)

Optimized models for embedded

Application

Runtime(e.g. TensorFlow Lite Micro)

Optimized low-level NN libraries(i.e. CMSIS-NN)

Arm Cortex-M CPUs and microNPUs

Profiling and

debugging

tooling such as

Arm Keil MDK

Connect to high-level

frameworks

1

Supported byend-to-end tooling

2

2

RTOS such as Mbed OS

Connect toRuntime

3

3

Arm: The Software and Hardware Foundation for tinyML

1

AI Ecosystem

Partners

Resources: developer.arm.com/solutions/machine-learning-on-arm

Stay Connected

@ArmSoftwareDevelopers

@ArmSoftwareDev

Page 51: Opening & Plenary Session #1 Hardware

Dynamic Neural Accelerator™

Tight coupling between AI

software & hardware with

automated co-design

THE NEXT GENERATION OF AI PROCESSOR FOR THE EMBEDDED EDGE

10x more compute with

single DNA engine

More than 20x better

energy-efficiency

Ultra-low latency

Fully-programmable with

INT 8bit support

www.edgecortix.com

© 2020 EDGECORTIX. ALL RIGHTS RESERVED

� Automotive

� Robotics

� Drones

� Smart Cities

� Industry 4.0

TARGET

MARKETS

Page 52: Opening & Plenary Session #1 Hardware

SynSense builds ultra-low-power (sub-mW) sensing and inference hardware for embedded, mobile and edge devices. We design

systems for real-time always-on smart sensing, for audio, vision,

IMUs, bio-signals and more.

https://SynSense.ai

Page 53: Opening & Plenary Session #1 Hardware

Partners

Conference Partner

Media Partners

Page 54: Opening & Plenary Session #1 Hardware

Questions?

Or to join tinyML WeChat Group

添加工作人员进官方微信群(注明tinyML)

Please add staff to join our official tinyML WeChat Group

Page 55: Opening & Plenary Session #1 Hardware

Copyright Notice

This presentation in this publication was presented at tinyML® Asia 2020. The content reflects the

opinion of the author(s) and their respective companies. The inclusion of presentations in this

publication does not constitute an endorsement by tinyML Foundation or the sponsors.

There is no copyright protection claimed by this publication. However, each presentation is the work of

the authors and their respective companies and may contain copyrighted material. As such, it is strongly

encouraged that any use reflect proper acknowledgement to the appropriate source. Any questions

regarding the use of any materials presented should be directed to the author(s) or their companies.

tinyML is a registered trademark of the tinyML Foundation.

www.tinyML.org