Design of Current-Mode 8T SRAM Compute-In-Memory Macro For Processing Neural Networks
Design of Current-Mode 8T SRAM Compute-In-Memory Macro For Processing Neural Networks
Abstract - A novel 8T SRAM bitcell is proposed for computing H WWL H H-ѐV WWL H
dot-products using current-mode accumulation. A write disturb
issue has been eliminated by adding two extra transistors into a
standard 6T SRAM bitcell. Besides, we embed a column ADC in H L H L
WBLb
WBLb
RBLb
RBLb
WBL
WBL
RBL
RBL
of ADC is reconfigurable from 1 to 5bit. A test-chip is fabricated
H
using 65nm, and the energy-efficiency of bitwise operation is 490- RWL RWL
to-15.8TOPS/W at 1-5bit. (a) (W,X)=(-1,0) ї VRBL - VRBLb = 0 (b) (W,X)=(-1,1) ї VRBL - VRBLb = -ѐV
H WWL H H WWL H-ѐV
Keywords—SRAM, compute-in-memory, artificial neural network,
hardware accelerator, current-mode accumulation.
2020 International SoC Design Conference (ISOCC) | 978-1-7281-8331-2/20/$31.00 ©2020 IEEE | DOI: 10.1109/ISOCC50952.2020.9332992
L H L H
I. INTRODUCTION
WBLb
WBLb
RBLb
RBLb
WBL
WBL
RBL
RBL
Compute-in-memory (CIM) architecture has gained a lot of
attention as an alternative to the conventional von Neumann H
RWL RWL
architecture for processing data-intensive applications such as (c) (W,X)=(+1,0) ї VRBL - VRBLb = 0 (d) (W,X)=(+1,1) ї VRBL - VRBLb = +ѐV
deep neural networks. The CIM approach could achieve higher Fig. 2. Basic operating principle of the proposed 8T bitcell.
energy-efficiency and computation performance thanks to the
elimination of off-chip data movement between memory and SRAM write path. However, the ADC overhead issue is yet to
processor. The reduced amount of off-chip data communication solve, and the memory density is reduced. In this paper, we
is also beneficial for better privacy and security. present a novel 8T SRAM bitcell [3], which addresses both
Recently, mixed-signal CIM macros [1-7] based on current- write disturbance and ADC overhead issues using a novel 8T
or voltage-mode accumulation have been introduced as energy- SRAM bitcell and column ADC architecture.
and area-efficient hardware accelerators for massively-parallel
dot-products. J. Zhang et al. [1] presented a machine learning II. COMPUTER-IN-MEMORY MACRO DESIGN
classifier based on standard 6T SRAM bitcell, however, it Fig. 1(left) shows a fully-connected layer of neural network
suffers from challenges such as a write disturbance and the comprising 64× inputs, 128× outputs (neurons), and 64×128
ADC overhead. A. Biswas et al. [2] added four extra transistors weights. The proposed CIM macro, which corresponds to the
to the standard 6T SRAM for addressing the write disturbance fully-connected layer, is shown in Fig. 1(right). Each column-
issue by decoupling its voltage-mode accumulation from the based neuron consists of a pre-charge driver (a pair of PMOS
transistors), 128× novel SRAM bitcells, and a sense amplifier
(SA). Each 8T SRAM bitcell consists of a standard 6T SRAM
Y[0]
є Pre-Charge
W0 and a pair of NMOS transistors for discharging one of the read
Offset Cal (32x) ADC Ref (32x) Dot-Product (64x)
X[0] 6T SRAM
X[0] (W 0)
Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on September 10,2022 at 11:23:32 UTC from IEEE Xplore. Restrictions apply.
Reference Sweep TABLE I. PERFORMANCE COMPARISON WITH STATE-OF-THE ARTS
Ref=+2 Ref=0 Ref=-2
+1 +1 +1 SOVC'16 [1] ISSCC'18 [2] SOVT'18 [4] This Work
-1 -1 -1
Dot-Product
Dot-Product / Ref
+1 +1 +1 TH[0] TH[1] TH[2] Accumulation Current-Mode Voltage-Mode Voltage-Mode Current-Mode
-1 -1 -1 =0 =0 =1 Array Size 128x128 64x256 256x64 128x128
+1 +1 +1
-1 -1 -1 Ref=+2 Input/Out. Bit# 5/1 6/6 1.59/3.46 1/1-5
Ref=0
(To Sweep)
Weight Bit # 1 1 1 1
ADC Ref
-1 -1 -1
-1 -1 +1 DP=-1 Energy-Efficiency b
-1 +1 +1 (Fixed) Ref=-2 [TOPS/W] 11.5 51.3 139 490-15.8
+1 +1 +1 # ADCs/Neurons N/A 16/256 1/64 128/128
0 1 2 ML Algotithm SVM CNN MLP MLP
Cycle # ML Dataset MNIST MNIST MNIST MNIST
TH[0]=0 TH[1]=0 TH[2]=1 Accuracy 90% 98% 98.3% 96.2%a
a.Accuracy based-on MC (1K runs) sim. results (ʍ=6.35mV) b.1-5b (1-31cycles/OP), 200MHz
B[1:0]=01
Fig. 3. Operating principle of the proposed column ADC. A column Process 65nm
WBL Drivers
comprises 13 bitcells for 9-input dot-product and 2bit column ADC.
Supply 0.8V (RWL/PCH)
Voltage /0.45V (SRAM)
WWL Driver
RWL Driver
Measured, 5bit, Room Temp. (No Calib) Measured, 5bit, Room Temp. (w/ Calib) 128x128
234Əm
+32 +32 Frequency 200MHz
8T SRAM
Bitcells
Column ADC Output Code
Bitcell 8T SRAM
+16 +16 234Əm Bitcell Size 1.83x1.83μm2
Sense Amps
Array Size 128 x 128 (16K)
0 0 Energy/OP 2.04fJ/OP (1bit)
1.83Əm
Energy 490-15.8TOPS/W
Efficiency @ 1-5bit
-16 -16
1.83Əm ML MLP (3-layer)
Algorithm 784-128-128-10
-32 -32 Dataset MNIST/96.2%
-32 -16 0 +16 +32 -32 -16 0 +16 +32
Dot-Product Results Dot-Product Results /Accuracy (-0.4% vs. baseline)
Fig. 4. Measured ADC outputs vs. dot-product results. Note that the Fig. 5. A 65nm test-chip die micrograph and bitcell layout.
accumulator input is swept from -32 to +32 (i.e. half range).
bitcells). The energy consumption of a bitwise operation from a
current flows through the transistors. A unit discharging current single 8T SRAM bitcell is 2.04fJ at 200MHz. The energy-
of a bitcell leads to a small voltage drop (i.e. ǻV) on the read efficiency is 490-to-15.8TOPS/W at 1-5bit. The area of the
bitline. As for the accumulate operation, we add up all the proposed bitcell and the compute-in-memory macro is 3.35ȝm2
voltage drops due to 128× individual operations of bitcells that and 0.055mm2, respectively. A performance comparison with
are connected to the shared read bitlines. Finally, the read state-of-the-arts is summarized in Table I. A 65nm test-chip die
bitline voltages are settled, and the resulting voltage difference micrograph and a summary table are shown in Fig. 5.
between RBL and RBLb represents the dot-product output of a REFERENCES
single neuron. The proposed current-mode accumulation gains
[1] J. Zhang, Z. Wang, N. Verma, “A machine-learning classifier
high linearity by using a limited dynamic range (~200mV). implemented in a standard 6T SRAM array,” Symposium on VLSI
Circuits (SOVC), pp. C252-C253, Jun. 2016.
B. Column ADC and Offset Calibration
[2] A. Biswas, A. Chandrakasan, “Conv-RAM: An energy-efficient SRAM
Fig. 3 shows the operating principle of the proposed ADC with embedded convolution computation for low-power CNN-based
with a 2bit resolution. In this example, a single neuron consists machine learning application,” IEEE International Solid-State Circuits
of 9× bitcells for dot-product and 4× for ADC reference. The Conference (ISSCC), pp. 488-489, Feb. 2018.
dot-product result is fixed at -1 while the reference is swept [3] C. Yu, et al., “A 16K current-based 8T SRAM compute-in-memory
from +2 to -2 with the step size of two. A 2bit operation takes 3 macro with decoupled read/write and 1-5bit column ADC,” IEEE
Custom Integrated Circuits Conference (CICC), Mar. 2020
cycles of operations since each cycle produces a thermometer
[4] Z. Jiang, et al., “XNOR-SRAM: in-memory computing SRAM macro
code. Finally, the generated 3bit thermometer code is converted for binary/ternary deep neural networks,” Symposia on VLSI
to a 2bit binary code. Note that the proposed macro implements Technology (SOVT), pp. 173–174, Jun. 2018.
32× reference bitcells per column for 1-5b ADC operation. We [5] H. Valavi et al., “A mixed-signal binarized convolutional-neural-
calibrate offsets using replica bitcells per each column. Fig. 4 network accelerator integrating dense weight storage and multiplication
shows the measured 5bit ADC linearity with and without offset for reduced data movement,” Symposium on VLSI Circuits (SOVC), pp.
calibration. The variation of ADC outputs has been reduced by 141-142, Jun. 2018.
3-4× after calibration (right). [6] H. Kim, Q. Chen, B. Kim , “A 16K SRAM-based mixed-signal in-
memory computing macro featuring voltage-mode accumulator and row-
by-row ADC,” IEEE Asian Solid-State Circuit Conference (ASSCC),
III. EXPERIMENT RESULTS Nov. 2019.
A 65nm macro test-chip is fabricated for demonstrating the [7] T. Yoo, et al., “A Logic Compatible 4T Dual Embedded DRAM Array
for In-Memory Computation of Deep Neural Networks”, ACM/IEEE
proposed 8T SRAM-based CIM. The macro uses two supply International Symposium on Low Power Electronics and Design, Jul.
voltages (0.8V for pre-charge & RWL, and 0.45V for SRAM 2019.
Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on September 10,2022 at 11:23:32 UTC from IEEE Xplore. Restrictions apply.