A Brain-Inspired ADC-Free SRAM-Based In-Memory Computing Macro With High-Precision MAC For AI Application
A Brain-Inspired ADC-Free SRAM-Based In-Memory Computing Macro With High-Precision MAC For AI Application
4, APRIL 2023
I. I NTRODUCTION
EEP neural networks (DNNs) show strong vitality in real-
D ize complex artificial intelligence (AI) applications, such
as image recognition, autopilot, object detection, and natural Fig. 1. (a) Difference between Von Neumann architecture and SRAM-based
language processing (NLP) [1]. Vast amounts of accelera- in-memory computing architecture. (b) Conventional adopt ADC as readout
circuit and proposed adopt spiking neuron as readout circuit.
tors based on Von Neumann architecture have been proposed
to achieve high-performance DNN algorithm acceleration.
Among them, general-purpose DNN accelerators represented In-memory computing (IMC) can perfectly solve this problem
by DaDianNao [2] and domain-specific CNN accelerators rep- by performing in situ multiply and accumulate (MAC) oper-
resented by Eyeriss [3] have been widely reported. These ations in memory, which minimizes the energy consumption
DNN accelerators reduce off-chip memory access by enhanc- overhead caused by off-chip memory access [4]. It is a promis-
ing data reuse to achieve effective algorithm acceleration. ing candidate approach to breaking through the limitations of
However, limited by the inherent “memory wall” problem Von Neumann’s architecture and achieving a low-power, high-
in Von Neumann architecture, which has separate memory parallel, and high-throughput computing system. Previous
and process elements (PEs), these accelerators cannot meet works have explored the novel IMC macros based on different
the further computing requirements of DNNs with continu- memory technologies [5], [6]. Among many alternatives, static
ally increasing parameters by reusing data (see Fig. 1 (a)). random-access memory (SRAM) based IMC macro stands out
Manuscript received 30 June 2022; revised 2 September 2022 and
because of its excellent advantages in stability and maturity.
29 October 2022; accepted 19 November 2022. Date of publication Previous works [7], [8], [9], [10] have demonstrated the
23 November 2022; date of current version 29 March 2023. This work robustness of analog SRAM-based IMC macros to achieve
was supported in part by the National Key Research and Development energy-efficient, low-latency, and high-parallelism computa-
Program of China under Grant 2019YFB2204800, and in part by the
tions. In the analog domain approach, the result of the
Strategic Priority Research Program of Chinese Academy of Sciences under
Grant XDB44000000. This brief was recommended by Associate Editor calculation is represented as a continuous voltage signal
J. Kulkarni. (Corresponding author: Yi Kang.) and converted into a digital output by the analog-to-digital
The authors are with the School of Microelectronics, University of Science converter (ADC) (see Fig. 1 (b)). However, the limited volt-
and Technology of China, Hefei 230026, Anhui, China (e-mail: zh11@ age margin and the excessive overhead ADCs can limit the
mail.ustc.edu.cn; [email protected])
Color versions of one or more figures in this article are available at
system’s accuracy and energy efficiency gains.
https://doi.org/10.1109/TCSII.2022.3224049. In this brief, we propose an ADC-free SRAM-based IMC
Digital Object Identifier 10.1109/TCSII.2022.3224049 macro, which adopts a spiking neuron as a readout circuit
1549-7747
c 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on August 08,2023 at 09:59:45 UTC from IEEE Xplore. Restrictions apply.
XUAN et al.: BRAIN-INSPIRED ADC-FREE SRAM-BASED IMC MACRO 1277
IMAC = DMAC · I0 (1) B. Local Computing and Global Reference Block Circuit
where I0 is the unit current. To get the corresponding digital In Section II, we find that temporal-coding spiking neu-
code DMAC , traditionally, the IMAC is converted into a MAC rons have excellent advantages in low-power computing. In
voltage VMAC via a capacitor for quantization by ADC cir- this sub-section, we propose a novel local computing block
cuit. The relationship between VMAC , IMAC , and DMAC can be circuit (LCB) to perform energy-efficient sub-column MAC
indicated as: operations without ADC overhead. As shown in Fig. 4 (a),
each LCB consists of 32 8T-SRAMs and a spiking neu-
IMAC · t I0 · t
VMAC = = DMAC (2) ron circuit. Each 8T-SRAM is equivalent to an analog 1-bit
C C multiplier. When the multiplication result is “1”, the SRAM
where C and t are integral capacitor value and time, respec- generates a unit current I0 , otherwise, no current is generated.
tively. However, because of the limited voltage margins This unit current is pooled on the local read bit line (RBL)
Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on August 08,2023 at 09:59:45 UTC from IEEE Xplore. Restrictions apply.
1278 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 70, NO. 4, APRIL 2023
Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on August 08,2023 at 09:59:45 UTC from IEEE Xplore. Restrictions apply.
XUAN et al.: BRAIN-INSPIRED ADC-FREE SRAM-BASED IMC MACRO 1279
Fig. 8. (a) The simulated output max positive error (red line) and neg-
ative error (blue line) under TT corner and room temperature. The LCB
output precision from different (b) operating voltages, (c) temperatures,
Fig. 6. S&A consists of (a) weighted summation circuit (WSC) for MAC and (d) process corners simulation.
operation with multi-bit programmable signed/unsigned weights. (b) weighted
accumulation circuit (WAC) for MAC operation with multi-bit programmable
signed/unsigned inputs.
IV. R ESULT AND D ISCUSSION
Fig. 7 illustrates the relationship between the number of
activated RWLs (NMAC ), tMAC , and DMAC of the proposed LCB
circuit. The red scatter plot shows the results of the Monte-
Carlo simulation at room temperature of 25◦ C and TT corner.
It shows the offset of the output time tMAC caused by non-
ideal factors, such as process mismatch and variation. The
interval of the envelope of the two grey lines represents all
possible relationships between DMAC and NMAC . As shown in
Fig. 8 (a), we have obtained the relationship between the dig-
ital output code DMAC and the output error. The red and blue
line represents the maximum positive error and the maximum
negative error, respectively. The maximum absolute error less
than 1.5 LSB and the output precision of an LCB is greater
Fig. 7. The relationship between the number of activated RWLs, tMAC ,
and DMAC of the proposed LCB circuit by Monte-Carlo simulation at room
than 4-bit. To future illustrate the LCB output accuracy, we
temperature and TT corner. provide simulation results from different process corners, oper-
ating voltages, and temperatures. As shown in Fig. 8 (b), (c),
and (d), the lowering of the operating voltage and the fast
NMOS (ff, fnsp) process corners will lead to a drop in the
MAC3 ) are fed into WSC to generate a column-MAC@1bIN-
output accuracy of the LCB to 3 bits. These results show that
4bW (PSUM) results. The relationship between the output and
the output precision of our proposed LCB circuit can reach
input of the WSC can be expressed as follows:
at least 3 bits. This conclusion is still valid under worst-case
conditions (125◦ C, FF angle, 1V supply voltage).
PSUM = 20 MAC0 + · · · + 22 MAC2 + (−1)Sign 23 MAC3
Fig. 9 shows the layout photograph, area and energy break-
(5) down, and performance summary of the proposed IMC macro,
which was implemented in 0.18um CMOS technology and
Input activations are simultaneously fed into the SRAM array occupied 3.41 mm2 with 16 Kb SRAM memory. Each sub-
in an MSB-first bit-serial manner. The WAC is required to module of the proposed macro is individually post-layout
accumulate the column MAC@1bIN-4bW results of each simulated in the cadence tool. The accuracy, delay, and
cycle. The WAC is shown in Fig. 6 (b), which consists of energy efficiency of the whole macro can be obtained by
a shifter for weighting, a 21-bit adder for accumulation, O2C aggregating the simulation results of the sub-modules. Our
for input signed expansion and some registers for the pipeline SRAM-based IMC macro energy efficiency can reach 10.8-
manner. For the input activations with 4-bit precision, 5 cycles 13.5 TOPS/W and throughput can reach 10.24 GOPS for
are required to complete a column MAC@4bIN-4bW. The nearly full output-ratio (4bIN-4bW-14bOUT). The time for the
macro to complete one vector-matrix multiplication (VMM) is
final output (SUM) of the WAC can be expressed by the
370ns. We adopt VGG-8 as a benchmark (software baseline is
bit-serial input as: 93.5%), and our design achieved 92.5% and 93.4% inference
accuracy on the CIFAR-10 dataset using 4- or 8-bit precision
SUM = 20 PSUM0 + · · · + 22 PSUM2 + (−1)Sign∗MSB 23 PSUM3 input and weight configuration, respectively.
(6) Design assessment shows that the spike time approach has
excellent technology scaling capability. We also estimated
Through WSC and WAC scheme, macro can provide flex- the performance parameters of the entire IMC macro at the
ible MAC computation requirements with adjustable weights 28nm technology by pre-layout simulation, in which aver-
and input activations accuracy. age energy efficiency can reach 98 TOPS/W with 4-bit MAC
Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on August 08,2023 at 09:59:45 UTC from IEEE Xplore. Restrictions apply.
1280 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 70, NO. 4, APRIL 2023
TABLE I
F EATURE S UMMARY AND C OMPARISON TO P RIOR W ORKS V. C ONCLUSION
This brief presents an ADC-free SRAM-based IMC macro
to support high-precision MAC operation for AI applications
based on brain-inspired computing. A 0.18um 16kb IMC
SRAM macro implementation is demonstrated, and simula-
tion shows the energy efficiency can reach 10.8-13.5 TOPS/W
while performing MAC operation with 4-bit input, 4-bit
weight, and 14-bit precision output.
R EFERENCES
[1] Y. LeCun, Y. Bengio, and G. J. N. Hinton, “Deep learning,” Nature,
vol. 521, no. 7553, pp. 436–444, 2015.
[2] Y. Chen et al., “DaDianNao: A machine-learning supercomputer,”
in Proc. 47th Annu. IEEE/ACM Int. Symp. Microarchit., Dec. 2014,
pp. 609–622, doi: 10.1109/MICRO.2014.58.
[3] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An
energy-efficient reconfigurable accelerator for deep convolutional neu-
ral networks,” IEEE J. Solid-State Circuits, vol. 52, no. 1, pp. 127–138,
Jan. 2017, doi: 10.1109/JSSC.2016.2616357.
[4] S. Yu, H. Jiang, S. Huang, X. Peng, and A. Lu, “Compute-in-
memory chips for deep learning: Recent trends and prospects,” IEEE
Circuits Syst. Mag., vol. 21, no. 3, pp. 31–56, 3rd Quart., 2021,
doi: 10.1109/MCAS.2021.3092533.
[5] T. P. Xiao, C. H. Bennett, B. Feinberg, S. Agarwal, and M. J. Marinella,
“Analog architectures for neural network acceleration based on non-
volatile memory,” Appl. Phys. Rev., vol. 7, no. 3, 2020, Art. no. 31301,
doi: 10.1063/1.5143815.
[6] M. Kang, S. K. Gonugondla, and N. R. Shanbhag, “Deep in-memory
architectures in SRAM: An analog approach to approximate com-
puting,” Proc. IEEE, vol. 108, no. 12, pp. 2251–2275, Dec. 2020,
doi: 10.1109/JPROC.2020.3034117.
[7] X. Si et al., “A twin-8T SRAM computation-in-memory unit-macro for
multibit CNN-based AI edge processors,” IEEE J. Solid-State Circuits,
vol. 55, no. 1, pp. 189–202, Jan. 2020.
Fig. 9. Layout photograph, area and energy breakdown, and summary of [8] A. Biswas and A. P. Chandrakasan, “Conv-RAM: An energy-
proposed macro. efficient SRAM with embedded convolution computation for low-
power CNN-based machine learning applications,” in Proc. IEEE
Int. Solid-State Circuits Conf. (ISSCC), Feb. 2018, pp. 488–490,
doi: 10.1109/ISSCC.2018.8310397.
[9] Z. Chen et al., “CAP-RAM: A charge-domain in-memory computing
6T-SRAM for accurate and precision-programmable CNN inference,”
IEEE J. Solid-State Circuits, vol. 56, no. 6, pp. 1924–1935, Jun. 2021,
doi: 10.1109/JSSC.2021.3056447.
[10] X. Si et al., “A local computing cell and 6T SRAM-based computing-
in-memory macro with 8-b MAC Operation for edge AI chips,” IEEE
J. Solid-State Circuits, vol. 56, no. 9, pp. 2817–2831, Sep. 2021,
doi: 10.1109/JSSC.2021.3073254.
[11] W. Gerstner, W. M. Kistler, R. Naud, and L. Paninski, Neuronal
Dynamics: From Single Neurons to Networks and Models of Cognition.
Cambridge, U.K.: Cambridge Univ. Press, 2014.
[12] J.-M. Hung et al., “An 8-Mb DC-current-free binary-to-8b precision
ReRAM nonvolatile computing-in-memory macro using time-space-
Fig. 10. Figure of Merit (FoM) Comparison. readout with 1286.4-21.6TOPS/W for edge-AI devices,” presented at
the IEEE Int. Solid-State Circuits Conf. (ISSCC), 2022.
[13] Y.-D. Chih et al., “An 89TOPS/W and 16.3 TOPS/mm2 all-digital
SRAM-based full-precision compute-in memory macro in 22nm for
under 0.65V analog domain supply voltage and 0.81V digi- machine-learning edge applications,” in Proc. IEEE Int. Solid-State
Circuits Conf. (ISSCC), vol. 64, 2021, pp. 252–254.
tal supply voltage. The proposed spike-time-based IMC macro [14] A. Sayal, S. S. T. Nibhanupudi, S. Fathima, and J. P. Kulkarni,
is at most 1.9× and 2.04× more energy efficiency than the “A 12.08-TOPS/W all-digital time-domain CNN engine using bi-
state-of-the-art digital- and analog- domain macros, respec- directional memory delay lines for energy efficient edge computing,”
tively, and is on par with the state-of-the-art time-domain IEEE J. Solid-State Circuits, vol. 55, no. 1, pp. 60–75, Jan. 2020,
macro. Table I presents the comparison table with previous doi: 10.1109/JSSC.2019.2939888.
[15] P.-C. Wu et al., “A 28nm 1Mb time-domain computing-in-memory 6T-
SRAM-based IMC works [7], [8], [10], [13], [14], [15], and SRAM macro with a 6.6ns latency, 1241GOPS and 37.01TOPS/W for
Fig. 10 shows the comparison Figure-of-Merit (FoM = EFMAC 8b-MAC operations for edge-AI devices,” in Proc. IEEE Int. Solid-State
× Input-precision × Weight-precision × Output-precision). Circuits Conf. (ISSCC), vol. 65, 2022, pp. 1–3.
Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on August 08,2023 at 09:59:45 UTC from IEEE Xplore. Restrictions apply.