Opening & Plenary Session #1 Hardware

Opening & PlenarySession #1 – Hardware

November 17, 2020

Prof. Shouyi YinDepartment of Micro- and Nano-electronics

Tsinghua University

Deploying AI EverywheremW/μW Level Neural Network Processor

AI HealthcareImage Recognition

Translation Speech-to-TextSmart home

Google Baidu IBM MicrosoftAlibaba

AI Cloud

Today’s AI Services almost All in Cloud

Things

WIFI Clients

MobileDevices

10101012

From AI in Cloud to AI Everywhere

[Google Brain @ISSCC’18]

Little MLMicro-chips

Medium MLEdge, Mobile

Big MLData Center

[Rob Aitken @ARM]

AI Chipsets Revenue 2016 - 2025

From Tractica Report

Computation requirements v.s. Power constraints

Computation (ops)

Smartphone (Always on)Wearable deviceSmart sensors

Smart phoneSmart homeSmart toysSmart glassesWiFi camera

Homeappliance

Video surveillanceIndustryAgriculture

AutomobileData center

1mW 100mW 2W 100W

① Programmability

Example 1：LeNet for Handwriting Recognition

Example 2：AlexNet for Image Classification

Example 3：LRCN for Image Captioning

② Neural Computing & General computing

③ High energy efficiency for edge computing ( ~ TOPS/W @ mW or μW level)

Challenges for AI Chips

FaceDetection

Resize

LandMark(NN) Facial

alignment

Resize

Euclidean Distance

Resize

FaceRecognition

NNLeonardo DiCaprio

Network Computation General Computation

Face detection and Recognition

Neural Computing: • Convolutional Network• Fully connected Network• Recurrent Network• ……

General Computing: • Image Signal Processing• Visual Processing• Sound Signal Processing• ……

RS DataflowMIT Eyeriss

ASIPDiannao

Domain Specific Architecture

Towards massive basic computation

Algorithm design in conjunction with hardware design:

less latency, more power efficient, more general !

Towards compact parallel computing

Network pruning, compression, quantization, low-bit …

Data granularity , program

ming/storage m

odel, …

Algorithm & hardware co-design, co-optimization, co-verification !

��

� � �

Systolic ArrayGoogle TPU

2017.1

Low-bit Adaptive quant. LQ-Nets, 2018.9

Low-bit TrainingDoReFa-Net, 2018.2

TWN2016.11

BWN2016.8

Pruning2016.2

Sparsity-awareNVIDIA SCNN

2017.6

Flexible BitwidthKAIST UNPU

To make algorithms more flexible

To make hardware more busy

Dual Trends for Energy Efficient NN Computing

Compact NN Model

ASIPDiannao

Compact NN Model

ming/storage m

odel, …

��

� � �

2017.1

TWN2016.11

BWN2016.8

Pruning2016.2

2017.6

Binary & Ternary Weight Neural Networks

Training Techniques for extreme low-bit NN

² Reduce memory footprint: various weight quantization techniques

² Increase representational capability: shortcut

² Reduce variation: weight approximation, batch-normalization

Top1 classification accuracy of ResNet-18 on ImageNet(weights of all networks are binary/ternary)

Evolvement of networks

Accuracy

XNOR-Net2016

LQ-Nets2018 � � �

DoReFa-Net2016

Full precision69.3%

66.6% 68.0%

BWN2016

ABC-Net2017

TTQ2017

BinaryNet2016

ASIPDiannao

Compact NN Model

ming/storage m

odel, …

��

� � �

2017.1

TWN2016.11

BWN2016.8

Pruning2016.2

2017.6

What Kind of Computing Architecture?

Programmabilityv.s.

Energy Efficiencyv.s.

Compact Model

ASIP: Cambricon Systolic Array: TPURS Dataflow: Eyeriss

Sparsity: SCNN What is Next ?Flexible Bit: UNPU

Coarse-Grained Reconfigurable Architecture

9.8 EMERGING COMPUTINGARCHITECTURE

Software Defined Hardware

Build runtime reconfigurable hardware and software thatenables near ASIC performance without sacrificingprogrammability for data-intensive algorithms.

l to build a processor that is reconfigurable at runtime.l to build programming languages and compilers that

optimize software and hardware at runtime.

Reconfiguration times: 300 - 1,000 ns

at runtime

Energy Efficiency Comparison

GPUFPGA

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

CPU CPUs GPU CPU+GPU FPGA Dedicated

Dedicated

CPU+GPU

moreprogrammable

lessprogrammable

notprogrammable

Chip number

[Stanford University, Prof. Kunle Olukotun, ISCA 2018, Keynote Speech]

Software Defined Hardware(CGRA)

New Start探索期发展期 High Speed ExpansionExpansionExploration

PADDI1990

PADDI-21993

Matrix1996 RaPID1996

Pleiades1997

REMARC1998

PipeRench1998

DP-FPGA 1994

Garp 1997

RAW1997

CHESS 1999

MorphoSys1999

DReAM2000

MorphICs 2000

ADRES2003XPP2003Zippy2003

EGRA2009

DySER2012

CGRA History

Prof. Gerald Estrin

“Organization of Computer Systems-The Fixed Plus Variable Structure Computer,”Proc. Western Joint Computer Conf., New York, 1960, pp. 33-40.

863 Program: General Purpose Reconfigurable

Computing

NSFC: Reconfigurable Vision Processor

NSFC: Reconfigurable Networks-on-Chip

Major S&T Program: Reconfigurable Graphic Processor

20062010

20152018

863 Program: Reconfigurable Multimedia

Processor

NSFC: Reconfigurable

Architecture

Thinker: Reconfigurable NN Processor

China-UK: Reconfigurable Cloud Computing

NSFC: Reconfigurable Hybrid AI Processor

NN Processor

Domain Specific Reconfigurable Processor

Theory and Basic Arch

Prof. Shaojun Wei

IEEE FellowDirector of IME, THU

Reconfigurable Computing Research in Tsinghua Univ.

Data memory (Scratchpad)

trolle

VddH VddL

Architecture Overview

Config

Execution

Config

Execution

Arithmetic and Logic OperationsA + B A >> B A > B (A+B)>>C

A - B A + (B>>C) A == B (A+B)<<C

A & B A A < B (A-B)>>C

A | B A + (B<<C) A >= B (A+B)<<C

A ^ B A - (B<<C) A <= B A×B_H

A ~^ B |A-B| A != B Clip(A, -B, B)

~A (A>>C)-B A - (B>>C) Clip(A, 0, B)

A << B A×B_L (A<<C)-B C?A:B

n Spatial Arch, Distributed Mem, No-Instruction, flexible bit-width

Software defined Datapath

Programming Model

Compilation

Compiling Applications onto CGRA

DDG (dif, min)A loop

Kernel

II: Initiation IntervalS1

Loop Pipelining

PE1 PE2

Time Extended PE Array

Key ProblemsExploiting operator level parallelism

Finding better OP-PE binding according to CGRA’s arch features

for(i=0; i<N; i++){S1: a[i] = b[i-1] + 1;S2: b[i] = a[i] / 2;S3: c[i] = b[i] + 3;S4: d[i] = c[i];}

Loop carried dependence

A widely held rule of thumb is that a program spends 90% of its execution time in only 10% of the code.《Computer Architecture: A Quantitative Approach》

Non-affine Affine

Non-uniform Uniform

Multi-level Single-level

Branchw/o branch 2. Branch 1. w/o branch

3. Perfect 4. Imperfect

Loops fall into many categories and we mainly focus on 4 categories loop mapping to CGRAs,which cover most of real life applications.

convertload/store

innermost

for(i=0; i<N; i++)for(j=0;j<M;j++)

A[i][j]=A[2i+3j][j+1]+..

for(i=0; i<N; i++)a[i][0] = …for(j=0;j<M;j++)

A[i][j]=A[i+1][j+2]+..

for(i=0; i<N; i++)a = b + c;if(a>0){

d=a+c}

Multi-levelPerfectAffineNon-uniform

Multi-levelImperfectAffineUniform

Single-levelBranch

Iteration domain

Dependence

Structure

Loops Classification

TRMap(ICCAD’15/TVLSI’15)

MEMMap (TVLSI’15)

PolyMap(DAC’13/TVLSI’14)DualPL(TPDS’16)

ConfigurationContext

Nested?

Perfect ?

Branch ?

DualVdd (TCAD’16)

Auto Parallelization

Memory partitioning & mapping

Low-power Optimization

Compilation Flow

• PolyMap: Polyhedral model based mapping

• DualPL: Multi-level loop pipelining

• TRMap: Trigger-aware mapping

• MEMMap: Memory partitioning & mapping

• DualVdd: Voltage scheduling

Thinker: Reconfigurable AI Computing Architecture

Features:

1. Heterogeneous PE arrays supporting data reuse

2. Two types of reconfigurable PE providing programmability

3. Bitwidth adaptive MAC unit exploiting computing power for low bit NN

General PE

Super PE

Thinker: Reconfigurable AI Computing Architecture

Features:

1. Heterogeneous PE arrays supporting data reuse

2. Two types of reconfigurable PE providing programmability

3. Bitwidth adaptive MAC unit exploiting computing power for low bit NN

General PE

Super PE

Three levels of Reconfigurability

1. Reconfigurable MAC Unit

2. Reconfigurable PE 3. Reconfigurable PE Array

Three Levels of Reconfigurability

Reconfigurable MAC Unit: Bit-width Adaptive

• Support flexible bit-width neural networks

• Fully exploit the computing power of PE for low-bit neural networks

Bit-Serial ComputationSubword-Parallel Computation

Reconfigurable PE: Support different OPs in NN

Reconfigurable PE Array: On-demand Partitioning

CNN, FCN, LSTM execute sequentially CNN, FCN+LSTM execute in parallel

①②

2017 ACM/IEEE ISLPED Design Contest Award2017 VLSI Symposia on Technology and Circuits (VLSIC)2017 IEEE TVLSI Most Popular Article2018 IEEE Journal of Solid-State Circuits (JSSC)2018 MIT Technology Review

Thinker-I: Watt level AI processor

Technology TSMC 65nm LP

Supply voltage 0.67V~1.29V

Area 4.4mm×4.4mm

SRAM 348KB

Frequency 10 ~ 200MHz

Power 4mW ~ 447mW

Energy efficiency 1.06 ~ 5.09 TOPs/W

• General Purpose NN• Heterogeneous PE• Support CNN/FCN/RNN, Hybrid-NN

Thinker I

Computation (ops)

Homeappliance

1mW 100mW 2W 100W

Tens of mW

The requirement of mW level AI processor

ASIPDiannao

ming/storage m

odel, …

��

� � �

2017.1

TWN2016.11

BWN2016.8

Pruning2016.2

2017.6

Compact NN Model

Low-bit NN & HW co-design

Binary/Ternary Weight Neural Networks (BTNNs)

p No Multiplication p Low Memory Footprint and Capacityp Satisfied Accuracy

Bitwidth Accuracy Loss (%)

Weight Activation MNIST CIFAR-10 ImageNet

Binary Weight Neural Networks

(BWNNs)

Binary Connect 1 32 0.7 2.78 19.20Binary Weight Network 1 32 - 0.76 0.8Binary Neural Network 1 1 0.37 3.03 29.8

XNOR-Net 1 1 - 3.05 11

Ternary WeightNeural Networks

(TWNNs)

Ternary Connect 2 32 0.56 4.89 -Ternary Weight Network 2 32 0.06 0.32 0.8

Trained Ternary Quantization 2 32 - - 0.6

Ternary Neural Network 2 2 1.08 4.99 -

Hardware Friendly

Input Feature Maps

Redundant Operations

(ROPs)

Higher Power Consumption

Remove ROPs for different

kernel groups

Kernel Group

1 1-1 0

0 1-1 1

1 -11 0

-1 10 1

2 or 3 types of weight values

Kernel 2

Kernel 3

Kernel 4

1 + 2 - 3 + 0

0 + 2 - 3 + 4

1 - 2 + 3 + 0

-1 + 2 + 0 + 4

1 + (2 – 3) + 0

0 + (2 – 3) + 4

(1 – 2) + 3+ 0

-(1 – 2) +0 +4

1 23 4

Kernel 1

New opportunity to optimize B/T weight convolutions

Standard Convolution

1 2 35 6 79 1011

131415

1 1 -1-1 1 -11 1 1

-1 1 11 1 1-1 -1 1

0 1 00 1 00 0 1

1 0 -1-1 0 -11 1 0

19223134

5 55 5

24273639

14172629

24273639

1 2 35 6 79 1011

131415

14172629

Original Kernels

K1′ K2′

O1+O2 O1-O2

64 OPs 36 OPs

KTFR Kernels

OfmapFeature

Reconstruction

Kernel Transformation

K1′ =K1 + K2

K2′ =K1 − K2

Special Optimization for B/T weight NNsKernel-Transformation-Feature-Reconstruction (KTFR)

Feature Summation

Remaining Convolution

IntegralFusion

Integral Calculation KBWI

1 2 32 1 23 3 4

6 89 10

Originalkernels

OfmapStandard Convolution24 OPs

-2 01 0

4 43 2

4 44 5

1 23 4

RRC②

0 11 0

0 00 1

1 -1-1 1

1 11 -1

1 2 32 1 23 3 4

-2 01 0

4 43 2

6 89 10

4 44 5

1 23 4

FIBCkernels

RRC Ofmap

16 OPs

Special Optimization for B/T weight NNs

2018 IEEE ISSCC SRP2018 International Symposium on Computer Architecture (ISCA)2018 VLSI Symposia on Technology and Circuits (VLSIC)2019 IEEE Journal of Solid-State Circuits (JSSC)

Technology TSMC 28nm HPC

Area 1.7mm×2.7mm

SRAM 225KB

Frequency 20 ~ 400MHz

Power < 100mW

Energy efficiency 20 TOPs/W @ Binary AlexNet

Thinker II• Ultra-low Power• Load Balancing and Scheduling• Low bit-width Weights

Data Memory

yWeight Memory

Processing Engine

Thinker-II: mW level AI processor

Computation (ops)

Homeappliance

1mW 100mW 2W 100W

< 1 mW

The requirement of µW level AI processor

Prevailing Human-Machine Speech Interfaces

Apple Siri Google Now

Microsoft Cortana

○ Mobile phones, wearable devices, IoT devices…

Smart earphones

Wall switches

○ General speech recognition procedure

FeatureExtraction

AcousticModel DecodingVoice Activity

Detection

Binary Convolutional Neural Networks (BCNN)

○ Activation and weights quantized to 1 bit--save memory footprint○ Replace expensive multiplications by XNORs--save power and area

Frame-Level Reuse in BCNN□ Exploit temporal data locality to eliminate redundant computing

Overlapped SpeechFeature Maps

OverlappedConvolutional results

Output reuse ofConsecutive feature maps

3×3Kernel3×3

Kernel

... ×

Kernel3×3Kernel

Redundancy

Buffer

10 3×3Kernel3×3

Kernel

3×31112

Buffer

Results reuse

Update stored results

NextLayer

121110

......

BCNN fmap 1BCNN fmap 2

BCNN fmap 3

(1×40 features per frame)

(11 frames per fmap)

Bit-Level BCNN Weight Regularization

After pruning

Channel32 bit

Before pruning

□ Regularize and compress the bits in BCNN 3-d weight matrices

Before regularization After regularization

Zero completion

00...0

00 01 10 11

00...0110100...

2-4 decoder

0 1 0 0D0 D1 D2 D3

1EnAddr

NextFlag

110100...

Address Generator

16-zeroBank

12-zeroBank

8-zeroBank

FullBank

FlagTable

32 bit

EnAddr

16 bit 20 bit

32 bit 32 bit 32 bit

EnAddr

24 bit 32 bit 2 bit

EnAddr Addr

24.25% 24.25% 1%27.50% 27.50% 2%

Storage Reduction

MEM Access Reduction

Precision Loss

5% 72% 4% 19%5% 80% 5% 10%

16-zero 12-zero 8-zero Full

Lowest 5-bit Addition

A4:0 B4:0

11111 + 00001100000

≥ 32

3-bitSub Carry

3-bit RCAAddition

A15:13 B15:13

4-bit Sub Carry

4-bit RCAAddition

A12:9 B12:9

Sum15:13 Sum12:9

4-bit Sub Carry

4-bit RCAAddition

Sum4:0

Sum8:5

□ Additions in BCNN are dominated by “+1”

□ Truncate high-bit carry-chain to shorten critical path.

1-bit Add

16-bit Add

Proportion of Addition operations in BCNN

Approximate Adder

□ Circuit diagram of lowest 5 bits adder. Reduce delay by 49.28%, power-delay-product (PDP) by 48.08%.

Lowest 5-bit to reduce power and guarantee correctness of

Incremental Addition.

Lowest 5-bit Addition

A4:0 B4:0

11111 + 00001100000

l=5≥ 32

Carry Chain:c0 = g0 c1 = p1&g0c2 = p2&p1&g0

c3 = p3& p2 &p1&g0c4 = p4&p3&p2

&p1&g0

TSMC-28nm, 0.9V→DC→Netlist→Hspice→Delay/Power* Benchmarks: TIDIGIT,TIMIT,RM,WSJ,Spoken Number

Sum4:0

Approximate Adder

Technology TSMC 28nm HPC

Area 1.74mm×0.74mm

SRAM 27KB

Frequency 2 ~ 50MHz

Power 0.2 ~ 5mW

Energy efficiency 304 nJ/Frame

Thinker S• Ultra-low power for speech • Always-on & Real-time• Wakeup & Command & Speech

Data Memory

Weight Memory

BCNN Unit

Pre-processing

2018 VLSI Symposia on Technology and Circuits (VLSIC)2018 Design Automation Conference (DAC)2019 IEEE Transactions on Circuits and Systems I: Regular Papers

Thinker-S: µW level AI processor

2.75 11.6 19.9

2016 2017 2018 2019 2020

MITEyerissRS Flow

KU LeuvenEnvisionDVAFS

MITCONV-RAM7×1 bit

SEUSandwich8×1 bit

KAISTUNPUReconfigurable

Phase-II√ In-memory computing

shows great potential× Only supporting the

basic MAC operations

Phase-I√ Innovation in architecture drives

energy efficiency√ Reconfigurable architecture has

good programmability× Digital architecture faces

“Memory wall” bottleneck

Next ?

THUThinker-IIReconfigurable

GoogleTPUSystolic

ICT/CambrianCambricon-XSparsity

Phase-IIIReconfigurable Architecture

+In-memory computing

The next step of AI chips

Thank youfor your attention

SponsorsPremier Sponsor & tinyML Strategic Partner

Gold Sponsor

Silver Sponsor

Optimized models for embedded

Application

Runtime(e.g. TensorFlow Lite Micro)

Optimized low-level NN libraries(i.e. CMSIS-NN)

Arm Cortex-M CPUs and microNPUs

Profiling and

debugging

tooling such as

Arm Keil MDK

Connect to high-level

frameworks

Supported byend-to-end tooling

RTOS such as Mbed OS

Connect toRuntime

Arm: The Software and Hardware Foundation for tinyML

AI Ecosystem

Partners

Resources: developer.arm.com/solutions/machine-learning-on-arm

Stay Connected

@ArmSoftwareDevelopers

@ArmSoftwareDev

Dynamic Neural Accelerator™

Tight coupling between AI

software & hardware with

automated co-design

THE NEXT GENERATION OF AI PROCESSOR FOR THE EMBEDDED EDGE

10x more compute with

single DNA engine

More than 20x better

energy-efficiency

Ultra-low latency

Fully-programmable with

INT 8bit support

www.edgecortix.com

� Automotive

� Robotics

� Drones

� Smart Cities

� Industry 4.0

TARGET

MARKETS

SynSense builds ultra-low-power (sub-mW) sensing and inference hardware for embedded, mobile and edge devices. We design

systems for real-time always-on smart sensing, for audio, vision,

IMUs, bio-signals and more.

https://SynSense.ai

Partners

Conference Partner

Media Partners

Questions?

Or to join tinyML WeChat Group

添加工作人员进官方微信群(注明tinyML)

Please add staff to join our official tinyML WeChat Group

Copyright Notice

This presentation in this publication was presented at tinyML® Asia 2020. The content reflects the

opinion of the author(s) and their respective companies. The inclusion of presentations in this

publication does not constitute an endorsement by tinyML Foundation or the sponsors.

There is no copyright protection claimed by this publication. However, each presentation is the work of

the authors and their respective companies and may contain copyrighted material. As such, it is strongly

encouraged that any use reflect proper acknowledgement to the appropriate source. Any questions

regarding the use of any materials presented should be directed to the author(s) or their companies.

tinyML is a registered trademark of the tinyML Foundation.

www.tinyML.org

Opening & Plenary Session #1 Hardware

Documents

Transcript of Opening & Plenary Session #1 Hardware

Opening titles research

ΔΟΥΛΕΥΟΝΤΑΣ ΜΕ ΤΟ ΥΛΙΚΟ ΜΕΡΟΣ (HARDWARE) ΤΟΥ Η/Υ … · Βαική δομή ου Personal Computer (PC) - Hardware: Το Υλικό ου Υπολογιή

Stereoselective photoredox ring-opening polymerization of O … · 2020-01-29 · ARTICLE Stereoselective photoredox ring-opening polymerization of O-carboxyanhydrides Quanyou Feng

€¦ · Web viewCountry Fire Authority, Victoria Ventilation: The opening type can also influence the ventilation of your home. Select the opening type that ensures adequate ventilation

BLK 142 opening Balkan

AGATA: Advanced Gamma Tracking Arrayagata.pd.infn.it/LLP_Carrier/AGATA_Week_2007_pdf_private/Plenary... · AGATA: Advanced Gamma Tracking Array Presented by Andres Gadea INFN-LNL

Abstracts Part I: Plenary Lectures€¦ · context of classical orthogonal expansions, namely, expansions in Jacobi polyno-mials, spherical harmonics, orthogonal polynomials on the

A S-Sn Lewis Pair-Mediated Ring-Opening Polymerization of … · 2018-12-19 · A S-Sn Lewis Pair-Mediated Ring-Opening Polymerization of α‑Amino Acid N‑Carboxyanhydrides: Fast

Hardware Developers Didactic Galactic 0xb: Capacitors

PROGRAMME - Σύγχρονη Γαλλική Φιλοσοφία Τhe ase of the Sentence ... Jazz Blues Night . ... Plenary Lecture F • Theory of Thinking: Methods of Inquiry ...

ZXSDR BS8800 U240(V4[1].00.30) Hardware Description

08 CBB_H08_E3 Principle and Hardware Structure of ZXSDR BS8900 C100 74

Hardware Datapath Veriﬁcation using Algebraic Geometrykalla/ECE6745/book.pdf · 2014. 10. 30. · Hardware Datapath Veriﬁcation using Algebraic Geometry Priyank Kalla idealJ =

8086/8088 Hardware Specifications - LSU 5.pdf · 8086/8088 Hardware Specifications ... Indicates if the processor is accessing a memory address ... peripheral interface is a

Plenary lectures - Ruđer Bošković Institute · 2010. 9. 22. · Plenary lectures Circular surfaces CS(α,p) Sonja Gorjanc Faculty of Civil Engineering, University of Zagreb, Zagreb,

Squeezing The Hardware To Make Performance Juice

Regular expressions to fsms \hardware and software techniques Paul Cockshott.

764M FX2NC DSSDS Hardware - Mitsubishi Electric · i FX2NC (DSS/DS) Series Programmable Controllers Hardware Manual This manual confers no industrial property rights or any rights

Ring-Opening Polymerization of γ-(4-Vinylbenzyl ...

The Hardware of the ATLAS Pixel Detector Control System