0% found this document useful (0 votes)
137 views2 pages

Design of Current-Mode 8T SRAM Compute-In-Memory Macro For Processing Neural Networks

Uploaded by

Abhay Shriram
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
137 views2 pages

Design of Current-Mode 8T SRAM Compute-In-Memory Macro For Processing Neural Networks

Uploaded by

Abhay Shriram
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Design of Current-Mode 8T SRAM Compute-In-Memory

Macro for Processing Neural Networks


Chengshuo Yu Taegeun Yoo Chengshuo Yu
Tony Tae-Hyoung Kim Bongjin Kim Kevin Chai Tshun Chuan
School of Electrical and Electronic Engineering Institute of Microelectronics
Nanyang Technological University Agency for Science, Technology and Research
Singapore 639798 Singapore 138634

Abstract - A novel 8T SRAM bitcell is proposed for computing H WWL H H-ѐV WWL H
dot-products using current-mode accumulation. A write disturb
issue has been eliminated by adding two extra transistors into a
standard 6T SRAM bitcell. Besides, we embed a column ADC in H L H L

each column-based neuron to address the ADC overhead issue of


conventional analog compute-in-memory macros. The resolution

WBLb

WBLb
RBLb

RBLb
WBL

WBL
RBL

RBL
of ADC is reconfigurable from 1 to 5bit. A test-chip is fabricated
H
using 65nm, and the energy-efficiency of bitwise operation is 490- RWL RWL

to-15.8TOPS/W at 1-5bit. (a) (W,X)=(-1,0) ї VRBL - VRBLb = 0 (b) (W,X)=(-1,1) ї VRBL - VRBLb = -ѐV
H WWL H H WWL H-ѐV
Keywords—SRAM, compute-in-memory, artificial neural network,
hardware accelerator, current-mode accumulation.
2020 International SoC Design Conference (ISOCC) | 978-1-7281-8331-2/20/$31.00 ©2020 IEEE | DOI: 10.1109/ISOCC50952.2020.9332992

L H L H

I. INTRODUCTION

WBLb

WBLb
RBLb

RBLb
WBL

WBL
RBL

RBL
Compute-in-memory (CIM) architecture has gained a lot of
attention as an alternative to the conventional von Neumann H
RWL RWL
architecture for processing data-intensive applications such as (c) (W,X)=(+1,0) ї VRBL - VRBLb = 0 (d) (W,X)=(+1,1) ї VRBL - VRBLb = +ѐV
deep neural networks. The CIM approach could achieve higher Fig. 2. Basic operating principle of the proposed 8T bitcell.
energy-efficiency and computation performance thanks to the
elimination of off-chip data movement between memory and SRAM write path. However, the ADC overhead issue is yet to
processor. The reduced amount of off-chip data communication solve, and the memory density is reduced. In this paper, we
is also beneficial for better privacy and security. present a novel 8T SRAM bitcell [3], which addresses both
Recently, mixed-signal CIM macros [1-7] based on current- write disturbance and ADC overhead issues using a novel 8T
or voltage-mode accumulation have been introduced as energy- SRAM bitcell and column ADC architecture.
and area-efficient hardware accelerators for massively-parallel
dot-products. J. Zhang et al. [1] presented a machine learning II. COMPUTER-IN-MEMORY MACRO DESIGN
classifier based on standard 6T SRAM bitcell, however, it Fig. 1(left) shows a fully-connected layer of neural network
suffers from challenges such as a write disturbance and the comprising 64× inputs, 128× outputs (neurons), and 64×128
ADC overhead. A. Biswas et al. [2] added four extra transistors weights. The proposed CIM macro, which corresponds to the
to the standard 6T SRAM for addressing the write disturbance fully-connected layer, is shown in Fig. 1(right). Each column-
issue by decoupling its voltage-mode accumulation from the based neuron consists of a pre-charge driver (a pair of PMOS
transistors), 128× novel SRAM bitcells, and a sense amplifier
(SA). Each 8T SRAM bitcell consists of a standard 6T SRAM
Y[0]

є Pre-Charge
W0 and a pair of NMOS transistors for discharging one of the read
Offset Cal (32x) ADC Ref (32x) Dot-Product (64x)

8T SRAM Column-Based Neuron [127]

bitlines when a bitcell is enabled for computing. Note that 64×


8T SRAM Column-Based Neuron [1]

X[0] 6T SRAM
X[0] (W 0)

X[1] bitcells are reserved for a 64-input dot-product operation, while


X[2] 6T SRAM
the 32× bitcells in the middle are used as an ADC reference.
X[3]
X[63] (W 63)
The remaining 32× bitcells at the bottom are assigned for offset
W63 calibration.
R[0] 8T Bitcell
A. Multiplication and Accumulation
R[31] 8T Bitcell
Fig. 2 shows the basic operating principle of the proposed
X[60]
C[0] 8T Bitcell
8T SRAM bitcell. Note that a binary weight (-1 or +1) is stored
X[61] in the bitcell, and a binary input (0 or 1) is applied through a
X[62] C[31] 8T Bitcell read wordline (RWL). Fig. 2 illustrates four possible operations
X[63] based on the programmed weight and the input values together
SA with the corresponding multiplication result as a voltage drop
Y[127]

in read bitlines (RBL and RBLb). When a negative short pulse


Y[0] Y[1] Y[127] is applied to RWL, a read NMOS transistor (with its gate node
Fig. 1. A fully-connected layer with 64 inputs and 128 neurons (left) is connected to an SRAM internal node storing a high level) is
is equivalent to 128x column neurons (right). Each neuron comprises turned on and discharges a read bitline (RBL or RBLb). When
64x bitcells for dot-product, 32x for ADC, and 32x for calibration. RWL is high (H), both NMOS transistors are turned off, and no

978-1-7281-8331-2/20/$31.00 ©2020 IEEE 175 ISOCC 2020

Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on September 10,2022 at 11:23:32 UTC from IEEE Xplore. Restrictions apply.
Reference Sweep TABLE I. PERFORMANCE COMPARISON WITH STATE-OF-THE ARTS
Ref=+2 Ref=0 Ref=-2
+1 +1 +1 SOVC'16 [1] ISSCC'18 [2] SOVT'18 [4] This Work
-1 -1 -1
Dot-Product

Technology 130nm 65nm 65nm 65nm


-1 -1 -1
(Fixed)

+1 +1 +1 Bitcell 6T SRAM 10T SRAM 12T SRAM 8T SRAM


-1 -1 -1

Dot-Product / Ref
+1 +1 +1 TH[0] TH[1] TH[2] Accumulation Current-Mode Voltage-Mode Voltage-Mode Current-Mode
-1 -1 -1 =0 =0 =1 Array Size 128x128 64x256 256x64 128x128
+1 +1 +1
-1 -1 -1 Ref=+2 Input/Out. Bit# 5/1 6/6 1.59/3.46 1/1-5
Ref=0
(To Sweep)

Weight Bit # 1 1 1 1
ADC Ref

-1 -1 -1
-1 -1 +1 DP=-1 Energy-Efficiency b
-1 +1 +1 (Fixed) Ref=-2 [TOPS/W] 11.5 51.3 139 490-15.8
+1 +1 +1 # ADCs/Neurons N/A 16/256 1/64 128/128
0 1 2 ML Algotithm SVM CNN MLP MLP
Cycle # ML Dataset MNIST MNIST MNIST MNIST
TH[0]=0 TH[1]=0 TH[2]=1 Accuracy 90% 98% 98.3% 96.2%a
a.Accuracy based-on MC (1K runs) sim. results (ʍ=6.35mV) b.1-5b (1-31cycles/OP), 200MHz
B[1:0]=01
Fig. 3. Operating principle of the proposed column ADC. A column Process 65nm
WBL Drivers
comprises 13 bitcells for 9-input dot-product and 2bit column ADC.
Supply 0.8V (RWL/PCH)
Voltage /0.45V (SRAM)

WWL Driver
RWL Driver
Measured, 5bit, Room Temp. (No Calib) Measured, 5bit, Room Temp. (w/ Calib) 128x128

234Əm
+32 +32 Frequency 200MHz
8T SRAM
Bitcells
Column ADC Output Code

Column ADC Output Code

Bitcell 8T SRAM
+16 +16 234Əm Bitcell Size 1.83x1.83μm2
Sense Amps
Array Size 128 x 128 (16K)
0 0 Energy/OP 2.04fJ/OP (1bit)

1.83Əm
Energy 490-15.8TOPS/W
Efficiency @ 1-5bit
-16 -16
1.83Əm ML MLP (3-layer)
Algorithm 784-128-128-10
-32 -32 Dataset MNIST/96.2%
-32 -16 0 +16 +32 -32 -16 0 +16 +32
Dot-Product Results Dot-Product Results /Accuracy (-0.4% vs. baseline)

Fig. 4. Measured ADC outputs vs. dot-product results. Note that the Fig. 5. A 65nm test-chip die micrograph and bitcell layout.
accumulator input is swept from -32 to +32 (i.e. half range).
bitcells). The energy consumption of a bitwise operation from a
current flows through the transistors. A unit discharging current single 8T SRAM bitcell is 2.04fJ at 200MHz. The energy-
of a bitcell leads to a small voltage drop (i.e. ǻV) on the read efficiency is 490-to-15.8TOPS/W at 1-5bit. The area of the
bitline. As for the accumulate operation, we add up all the proposed bitcell and the compute-in-memory macro is 3.35ȝm2
voltage drops due to 128× individual operations of bitcells that and 0.055mm2, respectively. A performance comparison with
are connected to the shared read bitlines. Finally, the read state-of-the-arts is summarized in Table I. A 65nm test-chip die
bitline voltages are settled, and the resulting voltage difference micrograph and a summary table are shown in Fig. 5.
between RBL and RBLb represents the dot-product output of a REFERENCES
single neuron. The proposed current-mode accumulation gains
[1] J. Zhang, Z. Wang, N. Verma, “A machine-learning classifier
high linearity by using a limited dynamic range (~200mV). implemented in a standard 6T SRAM array,” Symposium on VLSI
Circuits (SOVC), pp. C252-C253, Jun. 2016.
B. Column ADC and Offset Calibration
[2] A. Biswas, A. Chandrakasan, “Conv-RAM: An energy-efficient SRAM
Fig. 3 shows the operating principle of the proposed ADC with embedded convolution computation for low-power CNN-based
with a 2bit resolution. In this example, a single neuron consists machine learning application,” IEEE International Solid-State Circuits
of 9× bitcells for dot-product and 4× for ADC reference. The Conference (ISSCC), pp. 488-489, Feb. 2018.
dot-product result is fixed at -1 while the reference is swept [3] C. Yu, et al., “A 16K current-based 8T SRAM compute-in-memory
from +2 to -2 with the step size of two. A 2bit operation takes 3 macro with decoupled read/write and 1-5bit column ADC,” IEEE
Custom Integrated Circuits Conference (CICC), Mar. 2020
cycles of operations since each cycle produces a thermometer
[4] Z. Jiang, et al., “XNOR-SRAM: in-memory computing SRAM macro
code. Finally, the generated 3bit thermometer code is converted for binary/ternary deep neural networks,” Symposia on VLSI
to a 2bit binary code. Note that the proposed macro implements Technology (SOVT), pp. 173–174, Jun. 2018.
32× reference bitcells per column for 1-5b ADC operation. We [5] H. Valavi et al., “A mixed-signal binarized convolutional-neural-
calibrate offsets using replica bitcells per each column. Fig. 4 network accelerator integrating dense weight storage and multiplication
shows the measured 5bit ADC linearity with and without offset for reduced data movement,” Symposium on VLSI Circuits (SOVC), pp.
calibration. The variation of ADC outputs has been reduced by 141-142, Jun. 2018.
3-4× after calibration (right). [6] H. Kim, Q. Chen, B. Kim , “A 16K SRAM-based mixed-signal in-
memory computing macro featuring voltage-mode accumulator and row-
by-row ADC,” IEEE Asian Solid-State Circuit Conference (ASSCC),
III. EXPERIMENT RESULTS Nov. 2019.
A 65nm macro test-chip is fabricated for demonstrating the [7] T. Yoo, et al., “A Logic Compatible 4T Dual Embedded DRAM Array
for In-Memory Computation of Deep Neural Networks”, ACM/IEEE
proposed 8T SRAM-based CIM. The macro uses two supply International Symposium on Low Power Electronics and Design, Jul.
voltages (0.8V for pre-charge & RWL, and 0.45V for SRAM 2019.

978-1-7281-8331-2/20/$31.00 ©2020 IEEE 176 ISOCC 2020

Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on September 10,2022 at 11:23:32 UTC from IEEE Xplore. Restrictions apply.

You might also like