14.3 A 43pJ/cycle non‑volatile microcontroller with 4.7μs ......14.3 A 43pJ/cycle Non-volatile...

5
This document is downloaded from DR‑NTU (https://dr.ntu.edu.sg) Nanyang Technological University, Singapore. 14.3 A 43pJ/cycle non‑volatile microcontroller with 4.7μs shutdown/wake‑up integrating 2.3‑bit/cell resistive RAM and resillence techniques Wu, Tony F.; Le, Binh Q.; Radway, Robert; Bartolo, Andrew; Hwang, William; Jeong, Seungbin; Li, Haitong; Tandon, Pulkit; Vianello, Elisa; Vivet, Pascal; Nowak, Etienne; Wootters, Mary K.; Wong, Philip H.‑S.; Mohamed M. Sabry Aly; Beigne, Edith; Mitra, Subhasish 2019 Wu, T. F., Le, B. Q., Radway, R., Bartolo, A., Hwang, W., Jeong, S., ... Mitra, S. (2019). 14.3 A 43pJ/cycle non‑volatile microcontroller with 4.7μs shutdown/wake‑up integrating 2.3‑bit/cell resistive RAM and resillence techniques. Proceedings of 2019 IEEE International Solid‑State Circuits Conference (ISSCC 2019), 226‑228. doi:10.1109/ISSCC.2019.8662402 https://hdl.handle.net/10356/143358 https://doi.org/10.1109/ISSCC.2019.8662402 © 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. The published version is available at: https://doi.org/10.1109/ISSCC.2019.8662402. Downloaded on 10 Jun 2021 05:23:26 SGT

Transcript of 14.3 A 43pJ/cycle non‑volatile microcontroller with 4.7μs ......14.3 A 43pJ/cycle Non-volatile...

  • This document is downloaded from DR‑NTU (https://dr.ntu.edu.sg)Nanyang Technological University, Singapore.

    14.3 A 43pJ/cycle non‑volatile microcontrollerwith 4.7μs shutdown/wake‑up integrating2.3‑bit/cell resistive RAM and resillencetechniques

    Wu, Tony F.; Le, Binh Q.; Radway, Robert; Bartolo, Andrew; Hwang, William; Jeong,Seungbin; Li, Haitong; Tandon, Pulkit; Vianello, Elisa; Vivet, Pascal; Nowak, Etienne;Wootters, Mary K.; Wong, Philip H.‑S.; Mohamed M. Sabry Aly; Beigne, Edith; Mitra,Subhasish

    2019

    Wu, T. F., Le, B. Q., Radway, R., Bartolo, A., Hwang, W., Jeong, S., ... Mitra, S. (2019). 14.3 A43pJ/cycle non‑volatile microcontroller with 4.7μs shutdown/wake‑up integrating2.3‑bit/cell resistive RAM and resillence techniques. Proceedings of 2019 IEEE InternationalSolid‑State Circuits Conference (ISSCC 2019), 226‑228. doi:10.1109/ISSCC.2019.8662402

    https://hdl.handle.net/10356/143358

    https://doi.org/10.1109/ISSCC.2019.8662402

    © 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must beobtained for all other uses, in any current or future media, includingreprinting/republishing this material for advertising or promotional purposes, creating newcollective works, for resale or redistribution to servers or lists, or reuse of any copyrightedcomponent of this work in other works. The published version is available at:https://doi.org/10.1109/ISSCC.2019.8662402.

    Downloaded on 10 Jun 2021 05:23:26 SGT

  • 14.3 A 43pJ/cycle Non-volatile Microcontroller with 4.7µs Shutdown/Wake-up integrating 2.3 bits-per-cell Resistive RAM and Resilience Techniques

    Tony F. Wu1, Binh Q. Le1, Andrew Bartolo1, William Hwang1, Seungbin Jeong1, Haitong Li1, Robert M. Radway1, Pulkit Tandon1, Elisa Vianello2, Pascal Vivet2, Etienne Nowak2, Mary K. Wootters1, H.-S. Philip Wong1, Mohamed M. Sabry Aly3, Edith Beigne2, Subhasish Mitra1 1Stanford University, Stanford, CA, 2CEA-LETI, Grenoble France, 3Nanyang Technological University, Singapore Non-volatility is emerging as an essential on-chip memory characteristic across a wide range of application domains, from edge nodes for the Internet of Things (IoT) to large computing clusters. On-chip non-volatile memory (NVM) is critical for low-energy operation, real-time responses, privacy and security, operation in unpredictable environments, and fault-tolerance [1]. Existing on-chip NVMs (e.g., Flash, FRAM, EEPROM) suffer from high read/write energy/latency, density, and integration challenges [1]. For example, an ideal IoT edge system would employ fine-grained temporal power gating (i.e., shutdown) between active modes. However, existing on-chip Flash can have long latencies (> 23ms latency for erase followed by write) while inter-sample arrival times can be short (e.g., 2ms in [2]). Our chip monolithically integrates two heterogeneous technologies: 18KBytes of on-chip Resistive RAM (emerging on-chip NVM, technology details in Fig. 14.3.1) on top of commercial 130nm silicon CMOS (16-bit general-purpose microcontroller core with 8Kbytes of SRAM). For various applications (in machine learning, control, and cryptography), we demonstrate active mode average energy of 43pJ/cycle (up to 5.7X lower vs. similar chips at similar speeds / technology nodes using on-chip Flash and FRAM), fine-grained temporal power gating (0.25µW during shutdown) with up to 8µs (average 4.7µs) transition from active to shutdown mode (up to 5,878X quicker vs. on-chip Flash), and 2-clock cycle (200ns) transition from shutdown to active mode. We also demonstrate, for the first time, a complete chip that stores multiple bits per on-chip RRAM cell (5 resistance values, i.e., 2.3 bits per cell) and processes stored information correctly (vs. previous demonstrations using standalone RRAM cells or few cells in standalone RRAM array). Such multi-bit storage improves the accuracy of neural network inference (2.3X for MNIST) on same hardware (vs. 1 bit per cell). RRAM (like other emerging NVMs such as phase change memory) exhibits write failures [1]. We overcome these challenges through the critical combination of two resilience techniques: 1. dynamic address remapping, which overcomes write failures during system operation with 0.5% active-mode energy increase and negligible execution time impact; 2. periodic ENDUrance REsiliency using random Remapping (ENDURER – Fig. 14.3.5) [3], a new technique implemented here for the first time. This combination enables our chip to achieve a 10-year functional lifetime when running MNIST inference continuously. To demonstrate fine-grained temporal power gating enabled by on-chip RRAM, our chip operates as follows (Fig. 14.3.1). During active mode, instructions are read from the on-chip 12KByte instruction RRAM and executed by the microcontroller core (MSP430 instruction set). During this time, data is accessed from peripheral ports (e.g., off-chip sensors), on-chip 4KByte data RRAM, or on-chip 8KByte scratchpad SRAM (loop counters, temporary variables with repeated writes: memory-mapped using compiler). After the data is processed, to transition to shutdown mode, results are written back to the 4KByte on-chip data RRAM (consuming 168pJ over 5 clock cycles per 16-bit word, Fig. 14.3.2) and the hardware scheduler unit power-gates (i.e. turns off power) the core, memory controllers, and memory. Our chip performs this transition 5,878X quicker than those with on-chip Flash due to the low write latency of RRAM (500ns vs 23ms for Flash). The chip returns to active mode upon data arrival (e.g., from sensors). We run 5 applications representing machine learning (logistic regression, support vector machine, convolutional neural network), control (Kalman filter) and cryptography (SHA256 hash) to demonstrate the effectiveness of our chip (Fig. 14.3.2). To put our results into perspective, we select a similar clock rate for our chip (10MHz, vs. industry chips with existing on-chip NVM such as FRAM and Flash) that is sufficient for fine-grained temporal power-gating while avoiding excessive energy consumption. The active mode power of our chip varies between 407µW to 477µW (average active mode energy: 43pJ/cycle). We achieve average 4.7µs/1.6nJ transition from active to shutdown mode and a 200ns/152pJ transition from shutdown to active mode (Fig. 14.3.2). Although the industry chips might be

    engineered to include additional margins, the overall benefits demonstrated by our chip are expected to stay significant even after margins are taken into consideration. We store multiple resistance levels (up to 5 in our chip) inside on-chip RRAM cells (e.g., neural network model weights, only read during inference) by special algorithms that change wordline voltage (VWL) and bitline voltage (VBL) in addition to modifying the pulse width (Fig. 14.3.3) and allocating larger resistance windows for levels with higher resistance values. With greater effective memory capacity (2.3 bits vs. 1 bit per RRAM cell) on the same hardware, higher-precision weights (e.g., 4-bit vs 8-bit) or larger neural network models (e.g., 6,490 vs. 9,402 weights) can be used (Fig. 14.3.3). Despite errors (cells with resistance values outside its intended resistance window) in 5 levels-per-cell storage, we achieve a 2.3X improvement in inference accuracy (i.e., 2.3X decrease in inference error) for neural networks (on the MNIST dataset, Fig. 14.3.3) when the weights are encoded as follows: two 5-level cells for magnitude and one 2-level cell for sign bit. RRAM is subject to temporary write failures (TWFs) and permanent write failures (PWFs, resulting in limited endurance: maximum number of successful writes to a cell) [4] that degrade application accuracy over time (Fig. 14.3.4). Cell-level parameter adjustment to improve write failures isn’t sufficient [4]. To address TWFs, we employ a write-verify scheme with retries [4]. If a write to an RRAM address is unsuccessful after 4 retries, we map that address (during runtime) to another location in a separate backup RRAM array using dynamic address remapping (Figs. 14.3.1, 14.3.4). Our chip contains a backup RRAM array (256 16-bit words) for every 4KBytes of RRAM; 128 words of that backup array are used for this mapping. The mapping information is stored in a 128-entry volatile look up table (volatile LUT, implemented using flip-flops, Fig. 14.3.1). During transition from active to shutdown mode, the contents of each volatile LUT are stored in the remaining 128 words of the corresponding backup array (non-volatile LUT). A write failure to a non-volatile LUT entry results in that entry marked invalid (majority vote over 5 RRAM bits decides entry validity). When the chip boots, the contents of the volatile LUTs are loaded from the corresponding non-volatile LUT. We use dynamic address remapping for our data RRAM, incurring 0.5% energy and negligible (0.005%) execution time costs; our data RRAM tolerates TWFs and PWFs in 17.3% and 2% words, respectively (Fig. 14.3.4). We use stronger programming conditions (higher voltage, more retries) to mitigate TWFs and insert dummy instructions to avoid PWFs in instruction memory (as writes occur only during programming). Despite limited write endurance of the 4 Kbyte data RRAM, we achieve 10-year lifetime using ENDURER (Fig. 14.3.5, software on FPGA + our chip) combined with dynamic address remapping, when running our neural network application (MNIST dataset) continuously (Fig. 14.3.6). We accelerate our tests to account for 10 years of running an application by first obtaining a sequence of all writes to RRAM (which account for 258 out of 617,669 total memory operations for a single inference) for the application. Then, we repeatedly perform the sequence of writes, through the ENDURER module on the FPGA, on the RRAM (skipping any read operations, writes to non-RRAM, and computation to save time). In our implementation of ENDURER, remapping is performed every 30 minutes and we use an SRAM buffer of 8 16-bit words. On-chip RRAM NVM enables significantly lower energy during active mode (vs. existing on-chip NVM such as Flash and FRAM), fine-grained temporal power gating, and multiple bits per RRAM cell. Correct computation using multi-bit RRAM cells in a complete chip, demonstrated for the first time, successfully improves neural network inference accuracy. Effective resilience techniques enable chips with on-chip RRAM to achieve 10-year lifetime (for neural network inference applications) despite write failures in the underlying RRAM. Our results can be further enhanced through domain-specific accelerators, bit-cost scalable 3D Vertical RRAM [5], and monolithic 3D integration of multiple RRAM layers [5]. The

  • presented techniques (fine-grained temporal power gating, resilience) may be used for other emerging on-chip NVM (e.g., phase change) technologies as well. Acknowledgements Work supported in part by DARPA, NSF/NRI/GRC E2CDA, and the Stanford SystemX Alliance. References [1] A. Chen, “A review of emerging non-volatile memory (NVM) technologies and applications,” Solid-State Electronics, 125 pp. 25-38, 2016. [2] R. Braojos, et al., "Nano-engineered architectures for ultra-low power wireless body sensor nodes," CODES+ISSS, 2016. [3] M. M. S. Aly et al., " The N3XT Approach to Energy-Efficient Abundant-Data Computing", in Proc. IEEE, 2019. [4] A. Grossi, et al. “Fundamental Variability Limits of Filament-based RRAM,” IEDM, 2016. [5] H.-S. P. Wong, et al., “Memory leads way to better computing,” Nat. Nanotech., 2015.

  • Figure 14.3.1: Block diagram of our chip.

    Figure 14.3.2: Benefits of using our chip with on-chip RRAM.

    Figure 14.3.3: Using 2.3 bits per RRAM cell for convolutional neural network applications.

    Figure 14.3.4: Dynamic address remapping with write-verify loop.

    Figure 14.3.5: ENDURER test setup and remapping, read, and write algorithms.

    Figure 14.3.6: ENDURER and Dynamic Address Remapping improves chip lifetime to 10 years.

    Programming Ports

    Set Voltage 1.55VReset Voltage 2.5VSet Wordline Voltage 1.6V

    RRAM Device (TiN/HfOx/Ti/TiN) ParametersReset Wordline Voltage 3.2VSet/Reset pulse time 50nsRead Voltage 0.2V

    Minimum VDD 0.71VMinimum RRAM Peripheral Logic VDD 3.2VClock Frequency 10MHz

    Chip Operating Conditions

    Valid BadAddressValid BadAddressValid BadAddress

    0

    127

    09 8

    PriorityEncoder

    Valid & Match

    RRAM Memory Controller

    Address

    Volatile LUT (in FFs)

    Data to/from RRAM

    Backup Addr.

    Addr. To RRAM

    Data

    Address Lookup

    16-bitData16-bitData16-bitData

    Valid(5 bits) Address(11 bits)Valid(5bits) Address(11bits)

    Valid(5 bits) Address(11bits)

    0

    127

    128

    255…

    BackupMem.

    NV LUT

    0111015

    open

    MSP

    430

    Cor

    e(6

    66 F

    Fs, 5

    203

    std

    cells

    )

    Periph.Bus

    0

    1Instr. RRAM(12KByte + 1536 Byte Backup)

    Instr. Bus

    Instr. Mem. Busy

    Instr. Mem. Controller w/Dynamic Addr. Remap

    (3x 128 entry LUTs)

    Data RRAM(4KByte + 512 Byte Backup)

    Data Scratchpad (8KByte SRAM)

    0

    1

    Data Mem. Busy

    Data Mem. Controller w/Dynamic Addr. Remap

    (128 entry LUT)

    0

    1Data Bus

    Data Addr.

    Sche

    dulin

    gH

    /W B

    lock

    CoreReset

    Reset

    Wakeup

    Clock

    Chip Only on during Active mode (powered off during shutdown mode)Always On

    Dynamic Address Remapping 512 Byte Backup Array (128 16-bit words

    NV: Non-volatile

    01

    To C

    ore

    6605 FFs, 32,957 standard cells, 18KByte RRAM, 8KByte SRAM

    1.55.0 5.0 8.0 4.01.4

    5.1 5.1 5.1 5.1

    192640 640

    24024 23512

    18 60 6096 48

    110

    1001000

    10000100000

    SVM LogisticRegression

    Convolutional NN SHA256 Hash Kalman Filter

    0100200300400500 Core + SRAM RRAM

    (1) MSP430FR5739, (2) MSP430F5529, (3) MN101L (4) http://www.ti.com/tool/SHA-256

    128x

    6 Bytes 20 Bytes 20 Bytes 32 Bytes 16 BytesData Written to NVM

    5,878

    1 1 1 1 14.4 5.5 5.7 4.2 4.4

    23

    1511

    2128

    8.6 9.1 9.0 7.9 9.3

    0

    10

    20

    30

    Rel

    ativ

    e M

    easu

    red

    Activ

    e En

    ergy

    CNN1 (Fig. 14.3.3, MNIST) (4)

    0.5 1.7 1.7 2.7 1.30.6 1.9 1.9 3.1 1.52790 90

    1447268

    227 227363

    181

    0

    200

    400

    Activ

    e to

    shu

    tdow

    n m

    ode

    Mea

    sure

    d En

    ergy

    (nJ)

    Mea

    sure

    dTi

    me

    (µs)

    Our Chip (RRAM) (130nm) 130nm FRAM (1) 130nm Flash (2) 180nm Industry RRAM(3)

    Operating Frequency (MHz) 10 8 8 10Amount of NVM (KBytes) 18 16 128 64Read/Write latency (ns) 23/50 (500 w/ write-verify(5)) 50/50 140/64µs (Prgm)/23ms (Erase) Not ReportedRead/Write energy (pJ/bit) 1.76/10.9 (Set) / 10.1 (Reset) 10.5/10.5 30.5/562.5 Not ReportedShutdown mode power (µW) 0.25 0.64 0.54 0.42

    (5) Write-verify: read → set → verify set → reset → verify reset

    Mea

    sure

    d Av

    g.

    Activ

    e Po

    wer

    for o

    ur c

    hip

    (µW)

    Shutdown to active mode energy/time for our chip: 152 pJ / 200 ns

    3 6 15 35 60Resistance (k+)

    0

    500

    1000

    1500

    Num

    ber o

    f Cel

    ls

    Tpulse (ns) = 100 70 70 70 100VWL(V) = 1.6 11 0.94 0.9 3.2

    Set Voltage (V) = 1.55 1.3 1.3 1.3 -2.5 (Reset)Error (%) 2.1 0.87 0.63 5.1 2.2

    ‘4’ ’3’ ’2’ ’1’ ’0’

    Mean Error = 2.2%

    95%96%97%98%99%

    4-bi

    t Wei

    ghts

    8-bi

    t Wei

    ghts

    Larg

    er C

    NN

    (5.6

    -bit

    Wei

    ghts

    )

    Mea

    sure

    d In

    fere

    nce

    Accu

    racy

    9,402 weights6,490 weights

    +/- & MSBs

    2 bits/cell

    2 RRAM cells

    2 RRAM cells

    = 8 bits

    = 4 cells

    2 bits/cell

    LSBs

    8-bit Weights2.3 bitsMSBs LSBs

    2.3 bits

    1 RRAM cell

    +/-1 RRAM

    cell1 RRAM

    cell

    1 bit = 5.6 bits

    = 3 cells

    5.6-bit Weights

    Convolutional Neural Network (CNN), MNIST dataset

    (1 b

    it/ce

    ll)

    2.3X

    Cel

    ls in

    Dat

    a R

    RAM

    Error: cells w/ resistance outside window

    CN

    N1

    CN

    N1

    CN

    N2

    Output SizeLayer Name CNN1 CNN2conv1 26x26x16 24x24x16maxpool1 13x13x16 12x12x16conv2 11x11x16 8x8x16maxpool2 5x5x16 4x4x16dense 1x10 1x10

    102 104 106Cycles

    0

    50

    100

    Mea

    sure

    d To

    tal F

    aile

    d W

    ords

    (%)

    102 104 106Cycles

    0

    50

    100

    Mea

    sure

    d To

    tal F

    aile

    d W

    ords

    (%)

    w/ Dyn. Remapping + Write-Verify Loopw/o Dyn. Remapping or Write-Verify Loop

    431 433

    0200400600

    DynamicRemapping

    Off

    DynamicRemapping

    On

    Mea

    sure

    d A

    vera

    ge P

    ower

    W)

    0.5%

    Mea

    sure

    d To

    tal

    Faile

    d W

    ords

    (%)

    0%

    PWF: Permanent Word FailTWF: Temporary Word Fail

    Mea

    sure

    dFa

    iled

    Wor

    ds (%

    )

    RRAM write failures w/o Dynamic Remapping or Write-verify Loop

    Allocate new LUT entry

    Address in LUT and

    Valid?

    Memory Request (MR)

    Route MR to main RRAM

    arrays

    Yes

    Yes

    No

    NV LUT valid field := 0Yes

    Done

    No

    Write Fail?

    No

    Route MR to backup

    RRAM arrays

    LUT Full?

    Memory System Failure

    Write Request?

    Shutdown to Active mode

    To Shutdown Mode

    Load LUTfrom RRAM

    Service memory requests

    Store LUTback into RRAM

    No, check if write to LUT successful

    Write-verify loop: read → set → verify set → reset → verify reset → repeat if any verify fails

    102 104 106Cycles

    0

    20

    40

    60

    80

    100

    Faile

    d W

    ords

    (%) PWF

    TWFTotal

    Max Retries

    Allowed?

    Retry Write

    Yes

    MR in main RRAM arrays?

    No

    No

    Yes

    Data RRAM

    Our chip (Fig.14.3.1)

    Software Implementation of ENDURERENDURER remapping algorithm (once every 30 minutes)Inputs: memory size M = 4kByte, Offset!ß create random number (); " ß count trailing zeros (!); Offset ß (Offset + !) modulo M; N_iterations ß 2";For (#=0; #

  • Figure 14.3.7: Die micrograph

    Data Scratch-

    padCore

    MemCtrlrs

    4.5m

    m

    2.5mm

    Instr.Addr.

    RemapDataAddr

    Remap

    Backup Arrays

    4KB

    yte

    Dat

    a R

    RA

    M

    12KByte Instruction

    RRAM

    Transistor500nm

    TiN40nm

    TiNHfOx Ti

    RRAM cell

    This work Liu, et al. [6] Su, et al. [7] Chen, et al. [8]

    Year 2019 2016 2017 2018Supply Voltage (V) 0.71 - 1.2 0.8 0.8 1Technology node (nm) 130 65 150 65Clock Frequency (MHz) 10 100 20 64Memory Technology RRAM RRAM RRAM RRAMAmount of NVM(1) (KBytes) 18 12.2 1.3 128# of bits/cell demonstrated 2.3 1 1 1Type 16-bit

    microcontroller8-bit NV(2)Processor

    8-bit NV Proc.+ Accelerator

    In-memory compute macro

    Applications (energy, pJ/cycle) / (Active to shutdown mode time) / (Active to shutdown mode energy):CNN (5x5 images, 4 class)CNN (MNIST, 28x28 images)SVMLinear RegressionKalman FilterSHA256 HashCounter

    Dataset Not Avail.42/5 µs/1.68 nJ44/1.5 µs/0.5 nJ42/5 µs/1.68 nJ41/4 µs/1.34 nJ48/8 µs/2.69 nJ24.2/0.5 µs/0.3 nJ

    NoNoNoNoNoNo33/4µs - 1.02ms/400 nJ

    110/0.1ms/0.5 µJNoNoNoNoNoNo

    Dataset Not Avail.Yes*NoNoNoNoNo

    Active Power @ 25C (mW) 0.47 @ 0.71V 3.3 22 Not ReportedFine-grained Temporal Power-gating demonstrated

    Yes Yes Yes N/A

    Shutdown to Active mode time 200 ns 130 ns 50 ns N/AShutdown to Active mode energy 152 pJ 450 pJ 510 pJ N/ARRAM Read/Write Latency (ns) 23/50 Not Reported Not Reported 5/(Not Reported)RRAM Read/Write energy (pJ/bit) 1.76/10.9 (Set) /

    10.1 (Reset)Not Reported /99

    Not Reported/46.1

    Not Reported

    Resilience addressed by Dynamic Addr. Remapping & ENDURER

    None None None

    Lifetime (years) 10 Not Reported Not Reported Not Reported[6] Y. Liu et al., ISSCC, 2016. [7] F. Su et al., VLSI Circuits, 2017. [8] W. Chen et al., ISSCC, 2018.

    (1) NVM: Non-volatile memory(2) NV: Non-volatile

    * Values Not reported