Opening & Plenary Session #1 Hardware

Opening & PlenarySession #1 – Hardware

November 17, 2020

1

Prof. Shouyi YinDepartment of Micro- and Nano-electronics

Tsinghua University

Deploying AI EverywheremW/μW Level Neural Network Processor

2

AI HealthcareImage Recognition

Translation Speech-to-TextSmart home

Google Baidu IBM MicrosoftAlibaba

AI Cloud

Today’s AI Services almost All in Cloud

3

Things

WIFI Clients

MobileDevices

10101012

Cloud

108

From AI in Cloud to AI Everywhere

[Google Brain @ISSCC’18]

Little MLMicro-chips

Medium MLEdge, Mobile

Big MLData Center

[Rob Aitken @ARM]

4

AI Chipsets Revenue 2016 - 2025

From Tractica Report

5

Computation requirements v.s. Power constraints

Power

Computation (ops)

Smartphone (Always on)Wearable deviceSmart sensors

Smart phoneSmart homeSmart toysSmart glassesWiFi camera

Homeappliance

Video surveillanceIndustryAgriculture

AutomobileData center

1mW 100mW 2W 100W

100G

1T

2T

20T

6

① Programmability

Example 1：LeNet for Handwriting Recognition

Example 2：AlexNet for Image Classification

Example 3：LRCN for Image Captioning

② Neural Computing & General computing

③ High energy efficiency for edge computing ( ~ TOPS/W @ mW or μW level)

Challenges for AI Chips

FaceDetection

NN

Resize

NMS

LandMark(NN) Facial

alignment

Resize

Euclidean Distance

Resize

FaceRecognition

NNLeonardo DiCaprio

× =

× =

× =

Network Computation General Computation

Face detection and Recognition

Neural Computing: • Convolutional Network• Fully connected Network• Recurrent Network• ……

General Computing: • Image Signal Processing• Visual Processing• Sound Signal Processing• ……

7

RS DataflowMIT Eyeriss

2016

ASIPDiannao

2014

Domain Specific Architecture

Towards massive basic computation

Algorithm design in conjunction with hardware design:

less latency, more power efficient, more general !

Towards compact parallel computing

Network pruning, compression, quantization, low-bit …

Data granularity , program

ming/storage m

odel, …

Algorithm & hardware co-design, co-optimization, co-verification !

��

� � �

Systolic ArrayGoogle TPU

2017.1

Low-bit Adaptive quant. LQ-Nets, 2018.9

Low-bit TrainingDoReFa-Net, 2018.2

TWN2016.11

BWN2016.8

Pruning2016.2

Sparsity-awareNVIDIA SCNN

2017.6

Flexible BitwidthKAIST UNPU

2018

To make algorithms more flexible

To make hardware more busy

Dual Trends for Energy Efficient NN Computing

Compact NN Model

8


2016

ASIPDiannao

2014

Compact NN Model








ming/storage m

odel, …


��

� � �


2017.1



TWN2016.11

BWN2016.8

Pruning2016.2


2017.6


2018




9

Binary & Ternary Weight Neural Networks

10

Training Techniques for extreme low-bit NN

² Reduce memory footprint: various weight quantization techniques

² Increase representational capability: shortcut

² Reduce variation: weight approximation, batch-normalization

Top1 classification accuracy of ResNet-18 on ImageNet(weights of all networks are binary/ternary)

Evolvement of networks

Accuracy

XNOR-Net2016

LQ-Nets2018 � � �

DoReFa-Net2016

Full precision69.3%

65.0%

42.2%

51.2%

59.2%

60.8%

66.6% 68.0%

BWN2016

ABC-Net2017

TTQ2017

BinaryNet2016

11


2016

ASIPDiannao

2014

Compact NN Model








ming/storage m

odel, …


��

� � �


2017.1



TWN2016.11

BWN2016.8

Pruning2016.2


2017.6


2018




12

What Kind of Computing Architecture?

Programmabilityv.s.

Energy Efficiencyv.s.

Compact Model

ASIP: Cambricon Systolic Array: TPURS Dataflow: Eyeriss

Sparsity: SCNN What is Next ?Flexible Bit: UNPU

13

Coarse-Grained Reconfigurable Architecture

9.8 EMERGING COMPUTINGARCHITECTURE

14

Software Defined Hardware

Build runtime reconfigurable hardware and software thatenables near ASIC performance without sacrificingprogrammability for data-intensive algorithms.

l to build a processor that is reconfigurable at runtime.l to build programming languages and compilers that

optimize software and hardware at runtime.

Reconfiguration times: 300 - 1,000 ns

at runtime

15

Energy Efficiency Comparison

CPU

GPUFPGA

0.1

1

10

100

1000

10000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

CPU CPUs GPU CPU+GPU FPGA Dedicated

Dedicated

CPU+GPU

CPUs

moreprogrammable

lessprogrammable

notprogrammable

Chip number

En

erg

y e

ffic

ien

cy [

MO

PS/m

W]

[Stanford University, Prof. Kunle Olukotun, ISCA 2018, Keynote Speech]

Software Defined Hardware(CGRA)

16

New Start探索期发展期 High Speed ExpansionExpansionExploration

PADDI1990

PADDI-21993

Matrix1996 RaPID1996

Pleiades1997

REMARC1998

PipeRench1998

DP-FPGA 1994

Garp 1997

RAW1997

CHESS 1999

MorphoSys1999

DReAM2000

MorphICs 2000

ADRES2003XPP2003Zippy2003

EGRA2009

DySER2012

CGRA History

Prof. Gerald Estrin

“Organization of Computer Systems-The Fixed Plus Variable Structure Computer,”Proc. Western Joint Computer Conf., New York, 1960, pp. 33-40.

17

863 Program: General Purpose Reconfigurable

Computing

NSFC: Reconfigurable Vision Processor

NSFC: Reconfigurable Networks-on-Chip

Major S&T Program: Reconfigurable Graphic Processor

20062010

20152018

863 Program: Reconfigurable Multimedia

Processor

NSFC: Reconfigurable

Architecture

Thinker: Reconfigurable NN Processor

2020

China-UK: Reconfigurable Cloud Computing

NSFC: Reconfigurable Hybrid AI Processor

NN Processor

Domain Specific Reconfigurable Processor

Theory and Basic Arch

Prof. Shaojun Wei

IEEE FellowDirector of IME, THU

Reconfigurable Computing Research in Tsinghua Univ.

18

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

Con

text

mem

ory

Data memory (Scratchpad)

Hos

t Con

trolle

r

ALU

REGs

MEM

VddH VddL

Architecture Overview

Config

Execution

Config

Execution

Time

Arithmetic and Logic OperationsA + B A >> B A > B (A+B)>>C

A - B A + (B>>C) A == B (A+B)<<C

A & B A A < B (A-B)>>C

A | B A + (B<<C) A >= B (A+B)<<C

A ^ B A - (B<<C) A <= B A×B_H

A ~^ B |A-B| A != B Clip(A, -B, B)

~A (A>>C)-B A - (B>>C) Clip(A, 0, B)

A << B A×B_L (A<<C)-B C?A:B

n Spatial Arch, Distributed Mem, No-Instruction, flexible bit-width

19

Software defined Datapath

20

Programming Model

Compilation

21

Compiling Applications onto CGRA

DDG (dif, min)A loop

Kernel

II: Initiation IntervalS1

S2

S3

S4

S1

S2

S3

S4

S1

S2

S3

S4

Loop Pipelining

Time

PE1 PE2

PE1 PE2

PE1 PE2

T

T+1

S3

S4

S1

S2

Time Extended PE Array

CGRA

Key ProblemsExploiting operator level parallelism

Finding better OP-PE binding according to CGRA’s arch features

1

2

for(i=0; i<N; i++){S1: a[i] = b[i-1] + 1;S2: b[i] = a[i] / 2;S3: c[i] = b[i] + 3;S4: d[i] = c[i];}

S1

S2

S3

S4

(0,1)

(0,1)

(0,1)

(1,1)

Loop carried dependence

S1

S2

S3

S4

A widely held rule of thumb is that a program spends 90% of its execution time in only 10% of the code.《Computer Architecture: A Quantitative Approach》

22

Loop

Non-affine Affine

Non-uniform Uniform

Multi-level Single-level

Branchw/o branch 2. Branch 1. w/o branch

3. Perfect 4. Imperfect

Loops fall into many categories and we mainly focus on 4 categories loop mapping to CGRAs,which cover most of real life applications.

convertload/store

innermost

for(i=0; i<N; i++)for(j=0;j<M;j++)

A[i][j]=A[2i+3j][j+1]+..

for(i=0; i<N; i++)a[i][0] = …for(j=0;j<M;j++)

A[i][j]=A[i+1][j+2]+..

[1,2]

for(i=0; i<N; i++)a = b + c;if(a>0){

d=a+c}

Multi-levelPerfectAffineNon-uniform

Multi-levelImperfectAffineUniform

Single-levelBranch

Iteration domain

Dependence

Structure

Loops Classification

24

TRMap(ICCAD’15/TVLSI’15)

MEMMap (TVLSI’15)

PolyMap(DAC’13/TVLSI’14)DualPL(TPDS’16)

ConfigurationContext

Nested?

Perfect ?

Y

YN

Branch ?

N

N

Y

DualVdd (TCAD’16)

Loops

CTRL

Auto Parallelization

Memory partitioning & mapping

Low-power Optimization

Compilation Flow

• PolyMap: Polyhedral model based mapping

• DualPL: Multi-level loop pipelining

• TRMap: Trigger-aware mapping

• MEMMap: Memory partitioning & mapping

• DualVdd: Voltage scheduling

25

Thinker: Reconfigurable AI Computing Architecture

Features:

1. Heterogeneous PE arrays supporting data reuse

2. Two types of reconfigurable PE providing programmability

3. Bitwidth adaptive MAC unit exploiting computing power for low bit NN

General PE

Super PE

26

Thinker: Reconfigurable AI Computing Architecture

Features:

1. Heterogeneous PE arrays supporting data reuse

2. Two types of reconfigurable PE providing programmability

3. Bitwidth adaptive MAC unit exploiting computing power for low bit NN

General PE

Super PE

Three levels of Reconfigurability

27

1. Reconfigurable MAC Unit

2. Reconfigurable PE 3. Reconfigurable PE Array

Three Levels of Reconfigurability

28

Reconfigurable MAC Unit: Bit-width Adaptive

• Support flexible bit-width neural networks

• Fully exploit the computing power of PE for low-bit neural networks

Bit-Serial ComputationSubword-Parallel Computation

29

Reconfigurable PE: Support different OPs in NN

30

Reconfigurable PE Array: On-demand Partitioning

CNN, FCN, LSTM execute sequentially CNN, FCN+LSTM execute in parallel

①②

③

31

2017 ACM/IEEE ISLPED Design Contest Award2017 VLSI Symposia on Technology and Circuits (VLSIC)2017 IEEE TVLSI Most Popular Article2018 IEEE Journal of Solid-State Circuits (JSSC)2018 MIT Technology Review

Thinker-I: Watt level AI processor

Technology TSMC 65nm LP

Supply voltage 0.67V~1.29V

Area 4.4mm×4.4mm

SRAM 348KB

Frequency 10 ~ 200MHz

Power 4mW ~ 447mW

Energy efficiency 1.06 ~ 5.09 TOPs/W

• General Purpose NN• Heterogeneous PE• Support CNN/FCN/RNN, Hybrid-NN

Thinker I

32

Power

Computation (ops)



Homeappliance



1mW 100mW 2W 100W

100G

1T

2T

20T

Tens of mW

The requirement of mW level AI processor

33


2016

ASIPDiannao

2014








ming/storage m

odel, …


��

� � �


2017.1



TWN2016.11

BWN2016.8

Pruning2016.2


2017.6


2018




Compact NN Model

Low-bit NN & HW co-design

34

Binary/Ternary Weight Neural Networks (BTNNs)

p No Multiplication p Low Memory Footprint and Capacityp Satisfied Accuracy

Bitwidth Accuracy Loss (%)

Weight Activation MNIST CIFAR-10 ImageNet

Binary Weight Neural Networks

(BWNNs)

Binary Connect 1 32 0.7 2.78 19.20Binary Weight Network 1 32 - 0.76 0.8Binary Neural Network 1 1 0.37 3.03 29.8

XNOR-Net 1 1 - 3.05 11

Ternary WeightNeural Networks

(TWNNs)

Ternary Connect 2 32 0.56 4.89 -Ternary Weight Network 2 32 0.06 0.32 0.8

Trained Ternary Quantization 2 32 - - 0.6

Ternary Neural Network 2 2 1.08 4.99 -

Hardware Friendly

35

Input Feature Maps

Redundant Operations

(ROPs)

Higher Power Consumption

Remove ROPs for different

kernel groups

Kernel Group

1 1-1 0

0 1-1 1

1 -11 0

-1 10 1

2 or 3 types of weight values

Kernel 2

Kernel 3

Kernel 4

1 + 2 - 3 + 0

0 + 2 - 3 + 4

1 - 2 + 3 + 0

-1 + 2 + 0 + 4

1 + (2 – 3) + 0

0 + (2 – 3) + 4

(1 – 2) + 3+ 0

-(1 – 2) +0 +4

1 23 4

Kernel 1

New opportunity to optimize B/T weight convolutions

36

Standard Convolution

1 2 35 6 79 1011

131415

48

1216

1 1 -1-1 1 -11 1 1

-1 1 11 1 1-1 -1 1

0 1 00 1 00 0 1

1 0 -1-1 0 -11 1 0

19223134

5 55 5

24273639

14172629

24273639

1 2 35 6 79 1011

131415

48

1216

14172629

Ifmap

Original Kernels

K1 K2

Ifmap

K1′ K2′

O1+O2 O1-O2

O1 O2

64 OPs 36 OPs

Ofmap

KTFR Kernels

OfmapFeature

Reconstruction

Kernel Transformation

KTFR

K1′ =K1 + K2

2

K2′ =K1 − K2

2

Special Optimization for B/T weight NNsKernel-Transformation-Feature-Reconstruction (KTFR)

38

Feature Summation

Remaining Convolution

IntegralFusion

Integral Calculation KBWI

1 2 32 1 23 3 4

Ifmap

6 89 10

-2×

Originalkernels

OfmapStandard Convolution24 OPs

①

③

-2 01 0

4 43 2

4 44 5

1 23 4

Ifmap

RRC②

0 11 0

0 00 1

1 -1-1 1

1 11 -1

1 2 32 1 23 3 4

1 2 32 1 23 3 4

-2 01 0

4 43 2

6 89 10

4 44 5

1 23 4

FIBCkernels

RRC Ofmap

Ifmap

KBWI

16 OPs

FIBC

Special Optimization for B/T weight NNs

39

2018 IEEE ISSCC SRP2018 International Symposium on Computer Architecture (ISCA)2018 VLSI Symposia on Technology and Circuits (VLSIC)2019 IEEE Journal of Solid-State Circuits (JSSC)

Technology TSMC 28nm HPC


Area 1.7mm×2.7mm

SRAM 225KB

Frequency 20 ~ 400MHz

Power < 100mW

Energy efficiency 20 TOPs/W @ Binary AlexNet

Thinker II• Ultra-low Power• Load Balancing and Scheduling• Low bit-width Weights

Data Memory

Data

Me

mor

yWeight Memory

PLL

Cont

rolle

r

Processing Engine

Thinker-II: mW level AI processor

40

Power

Computation (ops)



Homeappliance



1mW 100mW 2W 100W

100G

1T

2T

20T

< 1 mW

The requirement of µW level AI processor

41

Prevailing Human-Machine Speech Interfaces

Apple Siri Google Now

Microsoft Cortana

○ Mobile phones, wearable devices, IoT devices…

Smart earphones

Wall switches

○ General speech recognition procedure

FeatureExtraction

AcousticModel DecodingVoice Activity

Detection

42

Binary Convolutional Neural Networks (BCNN)

○ Activation and weights quantized to 1 bit--save memory footprint○ Replace expensive multiplications by XNORs--save power and area

43

Frame-Level Reuse in BCNN□ Exploit temporal data locality to eliminate redundant computing

Overlapped SpeechFeature Maps

OverlappedConvolutional results

Output reuse ofConsecutive feature maps

140

3×3Kernel3×3

Kernel

3×3

642

11

... ×

23×3

Kernel3×3Kernel

3×3

64

1112

...

×

=

=

40

38

Redundancy

Buffer

10 3×3Kernel3×3

Kernel

3×31112

×

Buffer

=

Results reuse

Update stored results

NextLayer

12

121110

......

13

BCNN fmap 1BCNN fmap 2

BCNN fmap 3

3

(1×40 features per frame)

...

(11 frames per fmap)

44

Bit-Level BCNN Weight Regularization

8

3

16

10

18

0

14

9

12Z=

After pruning

Channel32 bit

Before pruning

□ Regularize and compress the bits in BCNN 3-d weight matrices

Before regularization After regularization

Zero completion

00...0

00 01 10 11

00...0110100...

10

01

2-4 decoder

0 1 0 0D0 D1 D2 D3

1EnAddr

NextFlag

+

+

+

+

+

110100...

Address Generator

16-zeroBank

12-zeroBank

8-zeroBank

FullBank

FlagTable

32 bit

32 bit

EnAddr

16 bit 20 bit

32 bit 32 bit 32 bit

EnAddr

24 bit 32 bit 2 bit

EnAddr Addr

24.25% 24.25% 1%27.50% 27.50% 2%

Storage Reduction

MEM Access Reduction

Precision Loss

5% 72% 4% 19%5% 80% 5% 10%

16-zero 12-zero 8-zero Full

①

②

45

Lowest 5-bit Addition

A4:0 B4:0

11111 + 00001100000

l=5

≥ 32

3-bitSub Carry

3-bit RCAAddition

A15:13 B15:13

4-bit Sub Carry

4-bit RCAAddition

A12:9 B12:9

Sum15:13 Sum12:9

4-bit Sub Carry

4-bit RCAAddition

B8:5

Sum4:0

A8:5

Sum8:5

□ Additions in BCNN are dominated by “+1”

□ Truncate high-bit carry-chain to shorten critical path.

95.9%

4.1%

1-bit Add

16-bit Add

Proportion of Addition operations in BCNN

Approximate Adder

46

□ Circuit diagram of lowest 5 bits adder. Reduce delay by 49.28%, power-delay-product (PDP) by 48.08%.

Lowest 5-bit to reduce power and guarantee correctness of

Incremental Addition.

Lowest 5-bit Addition

A4:0 B4:0

11111 + 00001100000

l=5≥ 32

Carry Chain:c0 = g0 c1 = p1&g0c2 = p2&p1&g0

c3 = p3& p2 &p1&g0c4 = p4&p3&p2

&p1&g0

TSMC-28nm, 0.9V→DC→Netlist→Hspice→Delay/Power* Benchmarks: TIDIGIT,TIMIT,RM,WSJ,Spoken Number

Sum4:0

Approximate Adder

47

Technology TSMC 28nm HPC


Area 1.74mm×0.74mm

SRAM 27KB

Frequency 2 ~ 50MHz

Power 0.2 ~ 5mW

Energy efficiency 304 nJ/Frame

Thinker S• Ultra-low power for speech • Always-on & Real-time• Wakeup & Command & Speech

Deco

der

Unit

Contr

oller

Data

Memo

ry

Data Memory

Data Memory

Weight Memory

BCNN Unit

Pre-processing

2018 VLSI Symposia on Technology and Circuits (VLSIC)2018 Design Automation Conference (DAC)2019 IEEE Transactions on Circuits and Systems I: Regular Papers

Thinker-S: µW level AI processor

48

0.25

2.75 11.6 19.9

28.1

55

2.3

0.54

0

10

20

30

40

50

60

70

80

90

100

2016 2017 2018 2019 2020

MITEyerissRS Flow

KU LeuvenEnvisionDVAFS

MITCONV-RAM7×1 bit

SEUSandwich8×1 bit

KAISTUNPUReconfigurable

Phase-II√ In-memory computing

shows great potential× Only supporting the

basic MAC operations

Phase-I√ Innovation in architecture drives

energy efficiency√ Reconfigurable architecture has

good programmability× Digital architecture faces

“Memory wall” bottleneck

Ene

rgy

Eff

icie

ncy

(TO

PS/W

)

Next ?

THUThinker-IIReconfigurable

GoogleTPUSystolic

ICT/CambrianCambricon-XSparsity

Phase-IIIReconfigurable Architecture

+In-memory computing

The next step of AI chips

49

Thank youfor your attention

SponsorsPremier Sponsor & tinyML Strategic Partner

Gold Sponsor

Silver Sponsor

31 © 2020 Arm Limited (or its affiliates)31 © 2020 Arm Limited (or its affiliates)

Optimized models for embedded

Application

Runtime(e.g. TensorFlow Lite Micro)

Optimized low-level NN libraries(i.e. CMSIS-NN)

Arm Cortex-M CPUs and microNPUs

Profiling and

debugging

tooling such as

Arm Keil MDK

Connect to high-level

frameworks

1

Supported byend-to-end tooling

2

2

RTOS such as Mbed OS

Connect toRuntime

3

3

Arm: The Software and Hardware Foundation for tinyML

1

AI Ecosystem

Partners

Resources: developer.arm.com/solutions/machine-learning-on-arm

Stay Connected

@ArmSoftwareDevelopers

@ArmSoftwareDev

Dynamic Neural Accelerator™

Tight coupling between AI

software & hardware with

automated co-design

THE NEXT GENERATION OF AI PROCESSOR FOR THE EMBEDDED EDGE

10x more compute with

single DNA engine

More than 20x better

energy-efficiency

Ultra-low latency

Fully-programmable with

INT 8bit support

www.edgecortix.com

© 2020 EDGECORTIX. ALL RIGHTS RESERVED

� Automotive

� Robotics

� Drones

� Smart Cities

� Industry 4.0

TARGET

MARKETS

SynSense builds ultra-low-power (sub-mW) sensing and inference hardware for embedded, mobile and edge devices. We design

systems for real-time always-on smart sensing, for audio, vision,

IMUs, bio-signals and more.

https://SynSense.ai

Partners

Conference Partner

Media Partners

Questions?

Or to join tinyML WeChat Group

添加工作人员进官方微信群(注明tinyML)

Please add staff to join our official tinyML WeChat Group

Copyright Notice

This presentation in this publication was presented at tinyML® Asia 2020. The content reflects the

opinion of the author(s) and their respective companies. The inclusion of presentations in this

publication does not constitute an endorsement by tinyML Foundation or the sponsors.

There is no copyright protection claimed by this publication. However, each presentation is the work of

the authors and their respective companies and may contain copyrighted material. As such, it is strongly

encouraged that any use reflect proper acknowledgement to the appropriate source. Any questions

regarding the use of any materials presented should be directed to the author(s) or their companies.

tinyML is a registered trademark of the tinyML Foundation.

www.tinyML.org

Opening & Plenary Session #1 Hardware

Documents

Transcript of Opening & Plenary Session #1 Hardware