14.3 A 43pJ/cycle non‑volatile microcontroller with 4.7μs ......14.3 A 43pJ/cycle Non-volatile...
Transcript of 14.3 A 43pJ/cycle non‑volatile microcontroller with 4.7μs ......14.3 A 43pJ/cycle Non-volatile...
-
This document is downloaded from DR‑NTU (https://dr.ntu.edu.sg)Nanyang Technological University, Singapore.
14.3 A 43pJ/cycle non‑volatile microcontrollerwith 4.7μs shutdown/wake‑up integrating2.3‑bit/cell resistive RAM and resillencetechniques
Wu, Tony F.; Le, Binh Q.; Radway, Robert; Bartolo, Andrew; Hwang, William; Jeong,Seungbin; Li, Haitong; Tandon, Pulkit; Vianello, Elisa; Vivet, Pascal; Nowak, Etienne;Wootters, Mary K.; Wong, Philip H.‑S.; Mohamed M. Sabry Aly; Beigne, Edith; Mitra,Subhasish
2019
Wu, T. F., Le, B. Q., Radway, R., Bartolo, A., Hwang, W., Jeong, S., ... Mitra, S. (2019). 14.3 A43pJ/cycle non‑volatile microcontroller with 4.7μs shutdown/wake‑up integrating2.3‑bit/cell resistive RAM and resillence techniques. Proceedings of 2019 IEEE InternationalSolid‑State Circuits Conference (ISSCC 2019), 226‑228. doi:10.1109/ISSCC.2019.8662402
https://hdl.handle.net/10356/143358
https://doi.org/10.1109/ISSCC.2019.8662402
© 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must beobtained for all other uses, in any current or future media, includingreprinting/republishing this material for advertising or promotional purposes, creating newcollective works, for resale or redistribution to servers or lists, or reuse of any copyrightedcomponent of this work in other works. The published version is available at:https://doi.org/10.1109/ISSCC.2019.8662402.
Downloaded on 10 Jun 2021 05:23:26 SGT
-
14.3 A 43pJ/cycle Non-volatile Microcontroller with 4.7µs Shutdown/Wake-up integrating 2.3 bits-per-cell Resistive RAM and Resilience Techniques
Tony F. Wu1, Binh Q. Le1, Andrew Bartolo1, William Hwang1, Seungbin Jeong1, Haitong Li1, Robert M. Radway1, Pulkit Tandon1, Elisa Vianello2, Pascal Vivet2, Etienne Nowak2, Mary K. Wootters1, H.-S. Philip Wong1, Mohamed M. Sabry Aly3, Edith Beigne2, Subhasish Mitra1 1Stanford University, Stanford, CA, 2CEA-LETI, Grenoble France, 3Nanyang Technological University, Singapore Non-volatility is emerging as an essential on-chip memory characteristic across a wide range of application domains, from edge nodes for the Internet of Things (IoT) to large computing clusters. On-chip non-volatile memory (NVM) is critical for low-energy operation, real-time responses, privacy and security, operation in unpredictable environments, and fault-tolerance [1]. Existing on-chip NVMs (e.g., Flash, FRAM, EEPROM) suffer from high read/write energy/latency, density, and integration challenges [1]. For example, an ideal IoT edge system would employ fine-grained temporal power gating (i.e., shutdown) between active modes. However, existing on-chip Flash can have long latencies (> 23ms latency for erase followed by write) while inter-sample arrival times can be short (e.g., 2ms in [2]). Our chip monolithically integrates two heterogeneous technologies: 18KBytes of on-chip Resistive RAM (emerging on-chip NVM, technology details in Fig. 14.3.1) on top of commercial 130nm silicon CMOS (16-bit general-purpose microcontroller core with 8Kbytes of SRAM). For various applications (in machine learning, control, and cryptography), we demonstrate active mode average energy of 43pJ/cycle (up to 5.7X lower vs. similar chips at similar speeds / technology nodes using on-chip Flash and FRAM), fine-grained temporal power gating (0.25µW during shutdown) with up to 8µs (average 4.7µs) transition from active to shutdown mode (up to 5,878X quicker vs. on-chip Flash), and 2-clock cycle (200ns) transition from shutdown to active mode. We also demonstrate, for the first time, a complete chip that stores multiple bits per on-chip RRAM cell (5 resistance values, i.e., 2.3 bits per cell) and processes stored information correctly (vs. previous demonstrations using standalone RRAM cells or few cells in standalone RRAM array). Such multi-bit storage improves the accuracy of neural network inference (2.3X for MNIST) on same hardware (vs. 1 bit per cell). RRAM (like other emerging NVMs such as phase change memory) exhibits write failures [1]. We overcome these challenges through the critical combination of two resilience techniques: 1. dynamic address remapping, which overcomes write failures during system operation with 0.5% active-mode energy increase and negligible execution time impact; 2. periodic ENDUrance REsiliency using random Remapping (ENDURER – Fig. 14.3.5) [3], a new technique implemented here for the first time. This combination enables our chip to achieve a 10-year functional lifetime when running MNIST inference continuously. To demonstrate fine-grained temporal power gating enabled by on-chip RRAM, our chip operates as follows (Fig. 14.3.1). During active mode, instructions are read from the on-chip 12KByte instruction RRAM and executed by the microcontroller core (MSP430 instruction set). During this time, data is accessed from peripheral ports (e.g., off-chip sensors), on-chip 4KByte data RRAM, or on-chip 8KByte scratchpad SRAM (loop counters, temporary variables with repeated writes: memory-mapped using compiler). After the data is processed, to transition to shutdown mode, results are written back to the 4KByte on-chip data RRAM (consuming 168pJ over 5 clock cycles per 16-bit word, Fig. 14.3.2) and the hardware scheduler unit power-gates (i.e. turns off power) the core, memory controllers, and memory. Our chip performs this transition 5,878X quicker than those with on-chip Flash due to the low write latency of RRAM (500ns vs 23ms for Flash). The chip returns to active mode upon data arrival (e.g., from sensors). We run 5 applications representing machine learning (logistic regression, support vector machine, convolutional neural network), control (Kalman filter) and cryptography (SHA256 hash) to demonstrate the effectiveness of our chip (Fig. 14.3.2). To put our results into perspective, we select a similar clock rate for our chip (10MHz, vs. industry chips with existing on-chip NVM such as FRAM and Flash) that is sufficient for fine-grained temporal power-gating while avoiding excessive energy consumption. The active mode power of our chip varies between 407µW to 477µW (average active mode energy: 43pJ/cycle). We achieve average 4.7µs/1.6nJ transition from active to shutdown mode and a 200ns/152pJ transition from shutdown to active mode (Fig. 14.3.2). Although the industry chips might be
engineered to include additional margins, the overall benefits demonstrated by our chip are expected to stay significant even after margins are taken into consideration. We store multiple resistance levels (up to 5 in our chip) inside on-chip RRAM cells (e.g., neural network model weights, only read during inference) by special algorithms that change wordline voltage (VWL) and bitline voltage (VBL) in addition to modifying the pulse width (Fig. 14.3.3) and allocating larger resistance windows for levels with higher resistance values. With greater effective memory capacity (2.3 bits vs. 1 bit per RRAM cell) on the same hardware, higher-precision weights (e.g., 4-bit vs 8-bit) or larger neural network models (e.g., 6,490 vs. 9,402 weights) can be used (Fig. 14.3.3). Despite errors (cells with resistance values outside its intended resistance window) in 5 levels-per-cell storage, we achieve a 2.3X improvement in inference accuracy (i.e., 2.3X decrease in inference error) for neural networks (on the MNIST dataset, Fig. 14.3.3) when the weights are encoded as follows: two 5-level cells for magnitude and one 2-level cell for sign bit. RRAM is subject to temporary write failures (TWFs) and permanent write failures (PWFs, resulting in limited endurance: maximum number of successful writes to a cell) [4] that degrade application accuracy over time (Fig. 14.3.4). Cell-level parameter adjustment to improve write failures isn’t sufficient [4]. To address TWFs, we employ a write-verify scheme with retries [4]. If a write to an RRAM address is unsuccessful after 4 retries, we map that address (during runtime) to another location in a separate backup RRAM array using dynamic address remapping (Figs. 14.3.1, 14.3.4). Our chip contains a backup RRAM array (256 16-bit words) for every 4KBytes of RRAM; 128 words of that backup array are used for this mapping. The mapping information is stored in a 128-entry volatile look up table (volatile LUT, implemented using flip-flops, Fig. 14.3.1). During transition from active to shutdown mode, the contents of each volatile LUT are stored in the remaining 128 words of the corresponding backup array (non-volatile LUT). A write failure to a non-volatile LUT entry results in that entry marked invalid (majority vote over 5 RRAM bits decides entry validity). When the chip boots, the contents of the volatile LUTs are loaded from the corresponding non-volatile LUT. We use dynamic address remapping for our data RRAM, incurring 0.5% energy and negligible (0.005%) execution time costs; our data RRAM tolerates TWFs and PWFs in 17.3% and 2% words, respectively (Fig. 14.3.4). We use stronger programming conditions (higher voltage, more retries) to mitigate TWFs and insert dummy instructions to avoid PWFs in instruction memory (as writes occur only during programming). Despite limited write endurance of the 4 Kbyte data RRAM, we achieve 10-year lifetime using ENDURER (Fig. 14.3.5, software on FPGA + our chip) combined with dynamic address remapping, when running our neural network application (MNIST dataset) continuously (Fig. 14.3.6). We accelerate our tests to account for 10 years of running an application by first obtaining a sequence of all writes to RRAM (which account for 258 out of 617,669 total memory operations for a single inference) for the application. Then, we repeatedly perform the sequence of writes, through the ENDURER module on the FPGA, on the RRAM (skipping any read operations, writes to non-RRAM, and computation to save time). In our implementation of ENDURER, remapping is performed every 30 minutes and we use an SRAM buffer of 8 16-bit words. On-chip RRAM NVM enables significantly lower energy during active mode (vs. existing on-chip NVM such as Flash and FRAM), fine-grained temporal power gating, and multiple bits per RRAM cell. Correct computation using multi-bit RRAM cells in a complete chip, demonstrated for the first time, successfully improves neural network inference accuracy. Effective resilience techniques enable chips with on-chip RRAM to achieve 10-year lifetime (for neural network inference applications) despite write failures in the underlying RRAM. Our results can be further enhanced through domain-specific accelerators, bit-cost scalable 3D Vertical RRAM [5], and monolithic 3D integration of multiple RRAM layers [5]. The
-
presented techniques (fine-grained temporal power gating, resilience) may be used for other emerging on-chip NVM (e.g., phase change) technologies as well. Acknowledgements Work supported in part by DARPA, NSF/NRI/GRC E2CDA, and the Stanford SystemX Alliance. References [1] A. Chen, “A review of emerging non-volatile memory (NVM) technologies and applications,” Solid-State Electronics, 125 pp. 25-38, 2016. [2] R. Braojos, et al., "Nano-engineered architectures for ultra-low power wireless body sensor nodes," CODES+ISSS, 2016. [3] M. M. S. Aly et al., " The N3XT Approach to Energy-Efficient Abundant-Data Computing", in Proc. IEEE, 2019. [4] A. Grossi, et al. “Fundamental Variability Limits of Filament-based RRAM,” IEDM, 2016. [5] H.-S. P. Wong, et al., “Memory leads way to better computing,” Nat. Nanotech., 2015.
-
Figure 14.3.1: Block diagram of our chip.
Figure 14.3.2: Benefits of using our chip with on-chip RRAM.
Figure 14.3.3: Using 2.3 bits per RRAM cell for convolutional neural network applications.
Figure 14.3.4: Dynamic address remapping with write-verify loop.
Figure 14.3.5: ENDURER test setup and remapping, read, and write algorithms.
Figure 14.3.6: ENDURER and Dynamic Address Remapping improves chip lifetime to 10 years.
Programming Ports
Set Voltage 1.55VReset Voltage 2.5VSet Wordline Voltage 1.6V
RRAM Device (TiN/HfOx/Ti/TiN) ParametersReset Wordline Voltage 3.2VSet/Reset pulse time 50nsRead Voltage 0.2V
Minimum VDD 0.71VMinimum RRAM Peripheral Logic VDD 3.2VClock Frequency 10MHz
Chip Operating Conditions
Valid BadAddressValid BadAddressValid BadAddress
0
127
…
09 8
PriorityEncoder
Valid & Match
RRAM Memory Controller
Address
Volatile LUT (in FFs)
Data to/from RRAM
Backup Addr.
Addr. To RRAM
Data
Address Lookup
16-bitData16-bitData16-bitData
Valid(5 bits) Address(11 bits)Valid(5bits) Address(11bits)
Valid(5 bits) Address(11bits)
0
127
…
128
255…
BackupMem.
NV LUT
0111015
open
MSP
430
Cor
e(6
66 F
Fs, 5
203
std
cells
)
Periph.Bus
0
1Instr. RRAM(12KByte + 1536 Byte Backup)
Instr. Bus
Instr. Mem. Busy
Instr. Mem. Controller w/Dynamic Addr. Remap
(3x 128 entry LUTs)
Data RRAM(4KByte + 512 Byte Backup)
Data Scratchpad (8KByte SRAM)
0
1
Data Mem. Busy
Data Mem. Controller w/Dynamic Addr. Remap
(128 entry LUT)
0
1Data Bus
Data Addr.
Sche
dulin
gH
/W B
lock
CoreReset
Reset
Wakeup
Clock
Chip Only on during Active mode (powered off during shutdown mode)Always On
Dynamic Address Remapping 512 Byte Backup Array (128 16-bit words
NV: Non-volatile
01
To C
ore
6605 FFs, 32,957 standard cells, 18KByte RRAM, 8KByte SRAM
1.55.0 5.0 8.0 4.01.4
5.1 5.1 5.1 5.1
192640 640
24024 23512
18 60 6096 48
110
1001000
10000100000
SVM LogisticRegression
Convolutional NN SHA256 Hash Kalman Filter
0100200300400500 Core + SRAM RRAM
(1) MSP430FR5739, (2) MSP430F5529, (3) MN101L (4) http://www.ti.com/tool/SHA-256
128x
6 Bytes 20 Bytes 20 Bytes 32 Bytes 16 BytesData Written to NVM
5,878
1 1 1 1 14.4 5.5 5.7 4.2 4.4
23
1511
2128
8.6 9.1 9.0 7.9 9.3
0
10
20
30
Rel
ativ
e M
easu
red
Activ
e En
ergy
CNN1 (Fig. 14.3.3, MNIST) (4)
0.5 1.7 1.7 2.7 1.30.6 1.9 1.9 3.1 1.52790 90
1447268
227 227363
181
0
200
400
Activ
e to
shu
tdow
n m
ode
Mea
sure
d En
ergy
(nJ)
Mea
sure
dTi
me
(µs)
Our Chip (RRAM) (130nm) 130nm FRAM (1) 130nm Flash (2) 180nm Industry RRAM(3)
Operating Frequency (MHz) 10 8 8 10Amount of NVM (KBytes) 18 16 128 64Read/Write latency (ns) 23/50 (500 w/ write-verify(5)) 50/50 140/64µs (Prgm)/23ms (Erase) Not ReportedRead/Write energy (pJ/bit) 1.76/10.9 (Set) / 10.1 (Reset) 10.5/10.5 30.5/562.5 Not ReportedShutdown mode power (µW) 0.25 0.64 0.54 0.42
(5) Write-verify: read → set → verify set → reset → verify reset
Mea
sure
d Av
g.
Activ
e Po
wer
for o
ur c
hip
(µW)
Shutdown to active mode energy/time for our chip: 152 pJ / 200 ns
3 6 15 35 60Resistance (k+)
0
500
1000
1500
Num
ber o
f Cel
ls
Tpulse (ns) = 100 70 70 70 100VWL(V) = 1.6 11 0.94 0.9 3.2
Set Voltage (V) = 1.55 1.3 1.3 1.3 -2.5 (Reset)Error (%) 2.1 0.87 0.63 5.1 2.2
‘4’ ’3’ ’2’ ’1’ ’0’
Mean Error = 2.2%
95%96%97%98%99%
4-bi
t Wei
ghts
8-bi
t Wei
ghts
Larg
er C
NN
(5.6
-bit
Wei
ghts
)
Mea
sure
d In
fere
nce
Accu
racy
9,402 weights6,490 weights
+/- & MSBs
2 bits/cell
2 RRAM cells
2 RRAM cells
= 8 bits
= 4 cells
2 bits/cell
LSBs
8-bit Weights2.3 bitsMSBs LSBs
2.3 bits
1 RRAM cell
+/-1 RRAM
cell1 RRAM
cell
1 bit = 5.6 bits
= 3 cells
5.6-bit Weights
Convolutional Neural Network (CNN), MNIST dataset
(1 b
it/ce
ll)
2.3X
Cel
ls in
Dat
a R
RAM
Error: cells w/ resistance outside window
CN
N1
CN
N1
CN
N2
Output SizeLayer Name CNN1 CNN2conv1 26x26x16 24x24x16maxpool1 13x13x16 12x12x16conv2 11x11x16 8x8x16maxpool2 5x5x16 4x4x16dense 1x10 1x10
102 104 106Cycles
0
50
100
Mea
sure
d To
tal F
aile
d W
ords
(%)
102 104 106Cycles
0
50
100
Mea
sure
d To
tal F
aile
d W
ords
(%)
w/ Dyn. Remapping + Write-Verify Loopw/o Dyn. Remapping or Write-Verify Loop
431 433
0200400600
DynamicRemapping
Off
DynamicRemapping
On
Mea
sure
d A
vera
ge P
ower
(µ
W)
0.5%
Mea
sure
d To
tal
Faile
d W
ords
(%)
0%
PWF: Permanent Word FailTWF: Temporary Word Fail
Mea
sure
dFa
iled
Wor
ds (%
)
RRAM write failures w/o Dynamic Remapping or Write-verify Loop
Allocate new LUT entry
Address in LUT and
Valid?
Memory Request (MR)
Route MR to main RRAM
arrays
Yes
Yes
No
NV LUT valid field := 0Yes
Done
No
Write Fail?
No
Route MR to backup
RRAM arrays
LUT Full?
Memory System Failure
Write Request?
Shutdown to Active mode
To Shutdown Mode
Load LUTfrom RRAM
Service memory requests
Store LUTback into RRAM
No, check if write to LUT successful
Write-verify loop: read → set → verify set → reset → verify reset → repeat if any verify fails
102 104 106Cycles
0
20
40
60
80
100
Faile
d W
ords
(%) PWF
TWFTotal
Max Retries
Allowed?
Retry Write
Yes
MR in main RRAM arrays?
No
No
Yes
Data RRAM
Our chip (Fig.14.3.1)
Software Implementation of ENDURERENDURER remapping algorithm (once every 30 minutes)Inputs: memory size M = 4kByte, Offset!ß create random number (); " ß count trailing zeros (!); Offset ß (Offset + !) modulo M; N_iterations ß 2";For (#=0; #
-
Figure 14.3.7: Die micrograph
Data Scratch-
padCore
MemCtrlrs
4.5m
m
2.5mm
Instr.Addr.
RemapDataAddr
Remap
Backup Arrays
4KB
yte
Dat
a R
RA
M
12KByte Instruction
RRAM
Transistor500nm
TiN40nm
TiNHfOx Ti
RRAM cell
This work Liu, et al. [6] Su, et al. [7] Chen, et al. [8]
Year 2019 2016 2017 2018Supply Voltage (V) 0.71 - 1.2 0.8 0.8 1Technology node (nm) 130 65 150 65Clock Frequency (MHz) 10 100 20 64Memory Technology RRAM RRAM RRAM RRAMAmount of NVM(1) (KBytes) 18 12.2 1.3 128# of bits/cell demonstrated 2.3 1 1 1Type 16-bit
microcontroller8-bit NV(2)Processor
8-bit NV Proc.+ Accelerator
In-memory compute macro
Applications (energy, pJ/cycle) / (Active to shutdown mode time) / (Active to shutdown mode energy):CNN (5x5 images, 4 class)CNN (MNIST, 28x28 images)SVMLinear RegressionKalman FilterSHA256 HashCounter
Dataset Not Avail.42/5 µs/1.68 nJ44/1.5 µs/0.5 nJ42/5 µs/1.68 nJ41/4 µs/1.34 nJ48/8 µs/2.69 nJ24.2/0.5 µs/0.3 nJ
NoNoNoNoNoNo33/4µs - 1.02ms/400 nJ
110/0.1ms/0.5 µJNoNoNoNoNoNo
Dataset Not Avail.Yes*NoNoNoNoNo
Active Power @ 25C (mW) 0.47 @ 0.71V 3.3 22 Not ReportedFine-grained Temporal Power-gating demonstrated
Yes Yes Yes N/A
Shutdown to Active mode time 200 ns 130 ns 50 ns N/AShutdown to Active mode energy 152 pJ 450 pJ 510 pJ N/ARRAM Read/Write Latency (ns) 23/50 Not Reported Not Reported 5/(Not Reported)RRAM Read/Write energy (pJ/bit) 1.76/10.9 (Set) /
10.1 (Reset)Not Reported /99
Not Reported/46.1
Not Reported
Resilience addressed by Dynamic Addr. Remapping & ENDURER
None None None
Lifetime (years) 10 Not Reported Not Reported Not Reported[6] Y. Liu et al., ISSCC, 2016. [7] F. Su et al., VLSI Circuits, 2017. [8] W. Chen et al., ISSCC, 2018.
(1) NVM: Non-volatile memory(2) NV: Non-volatile
* Values Not reported