0% found this document useful (0 votes)
6 views

AreviewonSRAM Basedcomputingin Memory Circuitsfunctionsandapplications3

This document reviews research on SRAM-based computing in-memory (CIM) at the circuit, function, and application levels. At the circuit level, it discusses modified SRAM bitcell structures like 8T, 9T, and 10T cells that enable multirow reading and calculation without storage damage. It also discusses peripheral circuits like digital-to-analog converters that enable analog-digital conversion. At the function level, it describes how SRAM CIM achieves operations like Boolean logic, content-addressable memory, multiplication and accumulation. At the application level, it discusses how SRAM CIM accelerates algorithms for convolutional neural networks and AES encryption.

Uploaded by

예민욱
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

AreviewonSRAM Basedcomputingin Memory Circuitsfunctionsandapplications3

This document reviews research on SRAM-based computing in-memory (CIM) at the circuit, function, and application levels. At the circuit level, it discusses modified SRAM bitcell structures like 8T, 9T, and 10T cells that enable multirow reading and calculation without storage damage. It also discusses peripheral circuits like digital-to-analog converters that enable analog-digital conversion. At the function level, it describes how SRAM CIM achieves operations like Boolean logic, content-addressable memory, multiplication and accumulation. At the application level, it discusses how SRAM CIM accelerates algorithms for convolutional neural networks and AES encryption.

Uploaded by

예민욱
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/359331985

A review on SRAM-based computing in-memory: Circuits, functions, and


applications

Article  in  Journal of Semiconductors · March 2022


DOI: 10.1088/1674-4926/43/3/031401

CITATIONS READS

0 117

11 authors, including:

Zhi-Ting Lin Zhongzhen Tong


Anhui University Beihang University (BUAA)
77 PUBLICATIONS   538 CITATIONS    8 PUBLICATIONS   8 CITATIONS   

SEE PROFILE SEE PROFILE

Chunyu Peng
Anhui University
56 PUBLICATIONS   354 CITATIONS   

SEE PROFILE

All content following this page was uploaded by Zhongzhen Tong on 18 April 2023.

The user has requested enhancement of the downloaded file.


A review on SRAM-based computing in-memory: Circuits, functions, and applications
Zhiting Lin, Zhongzhen Tong, Jin Zhang, Fangming Wang, Tian Xu, Yue Zhao, Xiulong Wu, Chunyu Peng, Wenjuan Lu, Qiang Zhao,
and Junning Chen

Citation: Z T Lin, Z Z Tong, J Zhang, F M Wang, T Xu, Y Zhao, X L Wu, C Y Peng, W J Lu, Q Zhao, and J N Chen, A review on
SRAM-based computing in-memory: Circuits, functions, and applications[J]. J. Semicond., 2022, 43(3).
View online: https://doi.org/10.1088/1674-4926/43/3/031401

Articles you may be interested in

Towards engineering in memristors for emerging memory and neuromorphic computing: A review
Journal of Semiconductors. 2021, 42(1), 013101 https://doi.org/10.1088/1674-4926/42/1/013101

Resistive random access memory and its applications in storage and nonvolatile logic
Journal of Semiconductors. 2017, 38(7), 071002 https://doi.org/10.1088/1674-4926/38/7/071002

Multiply accumulate operations in memristor crossbar arrays for analog computing


Journal of Semiconductors. 2021, 42(1), 013104 https://doi.org/10.1088/1674-4926/42/1/013104

A review: Photonics devices, architectures, and algorithms for optical neural computing
Journal of Semiconductors. 2021, 42(2), 023105 https://doi.org/10.1088/1674-4926/42/2/023105

Reconfigurable computing: a promising microchip architecture for artificial intelligence


Journal of Semiconductors. 2020, 41(2), 020301 https://doi.org/10.1088/1674-4926/41/2/020301

Oscillation neuron based on a low-variability threshold switching device for high-performance neuromorphic computing
Journal of Semiconductors. 2021, 42(6), 064101 https://doi.org/10.1088/1674-4926/42/6/064101

关注微信公众号,获得更多资讯信息
Journal of Semiconductors
(2022) 43, 031401
REVIEWS
 
 
doi: 10.1088/1674-4926/43/3/031401

A review on SRAM-based computing in-memory: Circuits,


functions, and applications
Zhiting Lin, Zhongzhen Tong, Jin Zhang, Fangming Wang, Tian Xu, Yue Zhao, Xiulong Wu†, Chunyu Peng,
Wenjuan Lu, Qiang Zhao, and Junning Chen
School of Integrated Circuits, Anhui University, Hefei 230601, China

Abstract: Artificial intelligence (AI) processes data-centric applications with minimal effort. However, it poses new challenges
to system design in terms of computational speed and energy efficiency. The traditional von Neumann architecture cannot
meet the requirements of heavily data-centric applications due to the separation of computation and storage. The emergence
of computing in-memory (CIM) is significant in circumventing the von Neumann bottleneck. A commercialized memory architec-
ture, static random-access memory (SRAM), is fast and robust, consumes less power, and is compatible with state-of-the-art tech-
nology. This study investigates the research progress of SRAM-based CIM technology in three levels: circuit, function, and applic-
ation. It also outlines the problems, challenges, and prospects of SRAM-based CIM macros.
Key words: static random-access memory (SRAM); artificial intelligence (AI); von Neumann bottleneck; computing in-memory
(CIM); convolutional neural network (CNN)
Citation: Z T Lin, Z Z Tong, J Zhang, F M Wang, T Xu, Y Zhao, X L Wu, C Y Peng, W J Lu, Q Zhao, and J N Chen, A review on SRAM-
based computing in-memory: Circuits, functions, and applications[J]. J. Semicond., 2022, 43(3), 031401. https://doi.org/10.1088/
1674-4926/43/3/031401

 
 

1.  Introduction an active topic in CIM for the robustness and access speed of
its cells.
Recently, with the breakthrough of key technologies To improve the performance of the SRAM-based CIM, the
such as big data and artificial intelligence (AI), emerging intelli-
SRAM bitcell structure has been modified and auxiliary peri-
gent applications represented by edge computing and intelli-
pheral circuits have been developed. For example, read–write
gent life have emerged in the trend of the era of rapid devel-
isolation 8T[4, 5], 9T[6−8], and 10T[9−11] cells were proposed to
opment[1, 2]. These emerging intelligent applications often
prevent the storage damage caused by multirow reading and
need to access the memory frequently when dealing with
calculation. Transposable cells were proposed to overcome
events. However, von Neumann architecture is the most com-
the limitations of storage arrangement[12−14]. Peripheral cir-
monly used architecture for data processing, which is imple-
cuits, such as word[15, 16], bit[10], line digital-to-analog conver-
mented by separating memory banks and computing ele-
sion (DAC), redundant reference columns[1, 5], and multi-
ments. Massive volumes of data are exchanged between the
plexed analog-to-digital conversion (ADC)[17], were proposed
memory and processing units, which consume large amounts
of energy. Furthermore, the memory bandwidth limits the com- to convert between analog and digital signals. With these mod-
puting throughput. The resulting memory limitation in- ified bitcells and additional peripheral circuits for SRAM CIM,
creases energy wastage and latency and decreases efficiency. researchers have achieved various computational operations,
This limitation is more severe in resource-constrained including in-memory Boolean logic[5, 7, 18−27], content-address-
devices. Therefore, it is important to explore solutions to over- able memory (CAM)[5, 19, 20, 28, 29], Hamming distance[8], multi-
come the issue of the “memory wall.” plication and accumulation (MAC)[1, 2, 9, 10, 15, 17, 30−34], and the
As a computing paradigm that may address the von Neu- sum of absolute difference (SAD)[35, 36]. These in-memory ope-
mann bottleneck, researchers have proposed the computing rations can expedite the AI algorithms. For example, Sinangil
in-memory (CIM) technology. The so-called CIM is a new archi- et al. implemented a multibit MAC based on a 7-nm FinFET
tecture and technology for computing directly in memory. It technique to accelerate the CNN algorithm, which achieved a
breaks through the limitation of traditional architecture, optim- recognition accuracy of 98.3%[37, 38]. Agrawal et al. proposed
izes the structure of storage unit and logic unit, realizes the in- a novel “read–compute–store” scheme, wherein the XOR calcu-
tegration of storage unit and logic unit, and avoids the cum- lation results were used to accelerate the AES algorithm
bersome process of transmitting data to processor register without having to latch data and perform subsequent write
for calculation and then back to memory, thus significantly re- operations[18].
ducing the delay and energy consumption of the chip[3]. Re- As illustrated in Fig. 1, we reviewed three levels of SRAM-
search on static random-access memory (SRAM) has become based CIM: circuit, function, and application. The circuit level
   is reviewed from two aspects: 1) bitcell structures, which in-
Correspondence to: X L Wu, [email protected] clude read–write separation structures, transposable struc-
Received 28 AUGUST 2021; Revised 4 NOVEMBER 2021. tures, and compact coupling structures, and 2) peripheral auxi-
©2022 Chinese Institute of Electronics
 
liary circuits, which include analog-to-digital conversion cir-
 
 
2 Journal of Semiconductors   doi: 10.1088/ 1674-4926/43/3/031401

Redundant
Overall framework of SRAM-based reference column Analog-to-Digital conversion circuit

CIM Bit-cell (a)


Circuit level:
1 Bitcell structures Circuit
2 Peripheral circuits

Time Function
Digital -to- Analog conversion circuit The function of
Control
Function level: SRAM-based CIM
1 Digital CIM Boolean AND、NAND、NOR
The digital
Logic XNOR、XOR、IMP、OR
2 Mixed-signal CIM SRAM-based Content (b)
Binary CAM
CIM Addressable
Memory Ternary CAM

The mixed- Single-bit


Binary dot product
Ternary dot product
Function
Application level: signal Hamming Distance
1 CNN 3 k-NN SRAM-based MAC
Multi-bit
CIM SAD
2 AES 4 Classifier Application
CNN algorithm Dog
Pig k-NN algorithm (k-NN)
Cat di pi results
Cattle 56 32 10 18 10 20 24 17 46 12 14 1
90 23 128 133 8 10 89 100 82 13 39 33

Layer4
Full connection 24 26 178 130 12 16 178 100 12 10 0 30 456
Layer2
Layer3 Layer 2 0 255 220 4 32 233 112 2 32 22 108
SUM (c)
Layer1 I1 substract I2 results di - pi
Column-based weak classifiers
AES algorithm Classifier algorithm
N<=10 128 bits data out BC1,1 BC1,1 BC1,1

Column-based BC1,1 BC1,1 BC1,1


S0 S1 S2 S3 Add Weak classifier C1 xZ1
S4 S5 S6 S7 Sub Shift Mix Column-based
S8 S9 S10 S11 Bytes Round Weak classifier C2 xZ2 BC1,1 BC1,1 BC1,1
Rows columns Key
S12 S13 S14 S15 Column-based
xZ3 In memory
Weak classifier C3
128 bits LUT Performed on 3.3%(50% Column-based
Weak classifier C4 xZ4
xZ1 xZ1 xZ1 xZ1 xZ1 xZ1

data in (100% XOR) Adder/Subtractor


state matrix XOR) Boosted strong classifier

 
Fig. 1. (Color online) Overall framework of static random-access memory (SRAM)-based computing in-memory (CIM) for the review: (a) various
functions implemented in CIM, (b) operation functions realizable with CIM, and (c) application scenarios of CIM.

cuits, digital-to-analog conversion circuits, redundant refer- 24−26, 29, 33, 35, 39−53] and Ref. [34], used standard 6T cells
ence columns, digital auxiliary circuits, and analog auxiliary cir- considering the area overhead. Fig. 2(a) illustrates a sche-
cuits (Fig. 1(a)). The second level is reviewed from two as- matic of the standard 6T SRAM cell. The 6T storage cell is
pects: 1) pure digital CIM, which includes Boolean logic and composed of two P-channel metal–oxide–semiconductors
CAM, and 2) mixed-signal CIM functions, which include MAC, (PMOSs) and four N-channel metal–oxide–semiconductors
Hamming distance, and SAD (Fig. 1(b)). The third level is (NMOSs), in which P1, N1, P2, and N2 constitute two cross-
mainly reviewed from the aspects of the application of the ac- coupled inverters to store data stably. To perform CIM with
celerating CNN, classifier, k-NN, and AES algorithms (Fig. 1(c)). the conventional 6T SRAM cell, the operands are commonly
Finally, the challenges and future development prospects of represented by the word line (WL) voltage and storage node
SRAM-based CIM are discussed from these three levels.
 
data. The processing results are often reflected by the
voltage difference between BL and BLB.
 

2.  Memory cell in static random-access memory 2.1.2.    Dual-split 6T cell


(SRAM)-based computing in-memory (CIM) Khwa et al.[1, 2] designed a 6T cell with double separation
In the core module of the SRAM, the memory cell occu- (WL separation and ground line separation), as shown in Fig.
pies most of the SRAM area. Irrespective of the complexity of 2(b), which is in contrast to the standard 6T structure. When
operations implemented in the memory unit, the primary chal- the dual-split 6T cell performs basic SRAM read and write func-
lenge is to fully leverage the memory cells. In this section, tions, it connects the left WL (WLL) and right WL (WLR), as
we analyze and summarize the cell structures in the SRAM- well as ground, CVSS1, and CVSS2. The read and write opera-
based CIM. Additionally, we compare the performances of tions are consistent with those of the standard 6T cells.
the reconstructed SRAM cell and the traditional structure and However, this structure can achieve more sophisticated func-
discuss the possible research direction of CIM in terms of the tions owing to the separated WL and GND. The structure al-
cell structure. lows the use of different voltages to represent various types
 

of information.
 

2.1.  Structure of the 6T cell


 

2.1.3.    4+2T SRAM cell


2.1.1.    Standard 6T-SRAM structure Dong et al.[20] proposed a 4+2T SRAM cell (Fig. 2(c)). The
Standard 6T structures have been adopted in most sys- 4+2T memory cells were used to decouple data from the
tem-on-chips (SoCs) for their high robustness and access read path. The read operation was similar to that of the stand-
speed. Previous studies on CIM, including Refs. [15, 16, 19, 20,
 
ard 6T SRAM, but the write operation was different as it adop-
 
 
Z T Lin et al.: A review on SRAM-based computing in-memory: Circuits, functions, and applications
Journal of Semiconductors   doi: 10.1088/ 1674-4926/43/3/031401 3

WL Portless Write using N-well


VDD VDD WLL WLR
WBL WLL WBLB

P1 P2 PU1 PU2

RBL/ RBLB
N3 N4 PG1 PG2
Q QB SL /SLB
N1 N2 PD1 PD2
BL BLB VSS
BL BLB
INV1 INV2
INV1 INV2
Shared CVSS1
RWL/ Decoupled RWLB/
VSS VSS CVSS CVSS2 ML Differential Read MLB
(a) (b) (c)
 
Fig. 2. (Color online) (a) Standard 6T SRAM cell, (b) dual-Split 6T SRAM cell, and (c) 4+2T SRAM cell.
 

9T SRAM Bit-Cell 10T SRAM Bit-Cell


WBL WWL WBLB
8T SRAM Bit-Cell 7T SRAM Bit-Cell
C_RBL Write path
(R_RWL) WWL WWL
WBLB WBL
WWL WWL QB Q
M1 M2 QB
M1 M2 Q
Q M3 M4
QB QB Q
M3 BL BLB
M3 M4 RBL RBLB
RWL RWL
WBL WBLB
M3 M4
RWL Read path
M2' M2
M2
C_RWL (R_RBL)
M1' M1
M1 RWL M5 RBLB RBL
WWL
(a) RBL (b) (c) (d) SL

 
Fig. 3. (Color online) SRAM cells with separated read and write: (a) standard 8T SRAM bit-cell, (b) 7T SRAM cell, (c) 9T SRAM cell, and (d) 10T SRAM
cell.

ted the N-well as the write word line (WWL), and the source macro for multibit multiply multibit[37, 38], and a computation-
ends of two pull-up PMOS were used as the write bit line al SRAM (C-SRAM) for vector processing[54]. To reduce the
(WBL) and write bit line bar (WBLB). The design was implemen- area cost, Jiang Hongwu et al.[55] proposed a 7T cell to com-
ted in the deep-decomposed channel process, where the plete the dot product operation, as shown in Fig. 3(b). To
bulk effect of the circuit was more significant. To write data, achieve better performance and execute more complex ope-
the threshold voltage of the PMOS can be significantly rations, Srinivasa et al. utilized 9T SRAM cells to perform CAM
altered by changing the N-well and PMOS source voltages. operations[6]. Fig. 3(c) presents a schematic of the 9T SRAM
During the application of CIM, different voltages on the WL cell. In Ref. [8], Ali et al. completed the Hamming distance
and storage node represent the different operands.
 
based on a 9T SRAM cell. In Refs. [9, 10, 56, 57] and Ref. [17],
10T SRAM cells were used to perform CIM of the dot product
2.2.  Cell structure with additional devices for SRAM-
and XONR, as shown in Fig. 3(d). These studies demon-
based CIM strated the advantages of the read–write separation struc-
The simple 6T structure cannot realize complex comput- ture in CIM, but the additional transistors also reduced the
ing operations and does not fully meet the requirements of storage density.
 

CIM. Therefore, studies on CIM have modified the traditional 2.2.2.    SRAM cells based on capacitive coupling
6T structure, which can be roughly divided into the follow- SRAM cells based on capacitive coupling add additional ca-
ing four categories.
 

pacitances inside the cell to perform the operations. Jiang et


2.2.1.    SRAM cells with separated read and write paths al.[31, 58] proposed a C3SRAM (capacitive-coupling computing)
Compared with the 6T cell, SRAM cells with separated cell, which is composed of a standard 6T cell, a pair of transmis-
read and write do not suffer from reading–writing disturb- sion transistors controlled by the storage data from the gate,
ances; hence, within the memory array, it can simultaneously and a capacitor, as shown in Fig. 4(a). In contrast to C3SRAM,
activate multiple read WLs to complete the operation. These Jia[59, 60] and Valavi et al.[61] proposed a multiplying bitcell (M-
cells contain a standard 6T cell and an additional read port BC) circuit, where the transmission transistors were con-
composed of extra transistors to separate read and write. The trolled by the storage data from the drain or the source, as
write operation is similar to that of a traditional 6T cell. For shown in Fig. 4(b). These cells can execute a dot-product oper-
the read operation, these cells can read data through a read ation between the input vectors and the stored weight. The
bit line (RBL) discharge or charge. calculation result is represented by the number of capacitors
Fig. 3(a) presents a schematic of a standard 8T cell, storing charge. Therefore, cells based on capacitive coupling
where M2 and M1 form an additional read port. A large num- are less prone to process variations. However, the introduc-
ber of pioneering works have used standard 8T cells to com- tion of capacitance into the memory cell increases the over-
plete CIM operations, including a multibit dot-product en- all power consumption and area overhead.
gine for computing acceleration[3], X-SRAM[18], and double In Refs. [31, 58], the capacitance in the cell can be dir-
WLs SRAM[5] for Boolean logic operations, a 7-nm CIM SRAM
 
ectly coupled and shared through the MBL. However, in
 
 
Z T Lin et al.: A review on SRAM-based computing in-memory: Circuits, functions, and applications
4 Journal of Semiconductors   doi: 10.1088/ 1674-4926/43/3/031401

WL VDD
VDD WL WL

MBL
T1 T2
T1 T2 T5 T6
QA QB
T5 T6
QA QB
T3 T4 T3 T4

BLB
am,n
BL
abm,n
VSS VSS
VC0 T7 Xn Xb n T8
T7 T8 Om,n
CC MWLB 0

BLB
BL
(a) (b)
 
Fig. 4. (Color online) SRAM cells based on capacitive coupling: (a) C3SRAM bitcell and (b) M-BC bitcell.
 

HWL R_RWL C_RBL WL


CW L

CBL
R_RBL VDD

BLB
BL
N8
VDD C_RWL P1 P2
VDD
P1 P2 T1 T2
N1 N2
N6 QB
Q T6 T5 Q QB
QB Q N3 N4
N5
VBLB

VSS
VBL

N3 N4 T7 T8
T3 T4
CWL Q RWL QB
N7 VSS VSS
WWL
CBLB WBLB WBL M1 M2 RL RR M3 M4

(a) (b) (c)


 
Fig. 5. (Color online) (a) Transposable bitcell contains two pairs of access transistors and (b) separated read–write transposable bit cell. (c) Schemat-
ic of the transposable 10T bit bitcell.

Refs. [59, 60], the capacitors inside the unit need to be CAM can be achieved in two directions. Because of the
coupled with each other through additional switches. There- row–column symmetry, these cells break the limitations of con-
fore, in the latter work, additional transistors need to be intro- ventional SRAM storage arrangements. Thus, the algorithm of
duced, which increases the area of the cell. In the selection of forward and backward propagations can be flexibly applied
capacitance type, Refs. [31, 58] selected MOSCAP which consti- to them.
 

tutes 27% of the bitcell area. However, Refs. [59, 60] selected 2.2.4.    Compact coupling structure
MOMCAP that is formed using metal-fringing structures. The basic reading and writing operations of the com-
MOMCAP can be placed on the top of the cell, so there is no pact-coupling structure are consistent with traditional SRAM.
additional area overhead. However, compared with MOSCAP, However, exclusive and independent structures have been de-
MOMCAP has a lower capacitance density and is not suitable veloped for CIM operations. Yin et al.[17] proposed a 12T cell,
for large capacitances.
 

as shown in Fig. 6(a), where T1–T6 constitute a 6T SRAM cir-


2.2.3.    Transposable SRAM cell cuit, T7–T10 perform XNOR functions, and T11 and T12 de-
Fig. 5 presents three different forms of transposable bit- cide the execution of the XNOR function. Su et al.[43, 62] pro-
cells: 1) Wang et al.[12, 13] proposed a transposable bitcell con- posed a two-way transpose multibit cell structure (Fig. 6(b))
taining two pairs of access transistors, as shown in Fig. 5(a). that contains a common functional operation unit used by
There are two types of writing and reading in the basic SRAM 16 6T cells to execute multiplication in the vertical or horizont-
functions of this structure: the one used by columns and the al direction.
other by rows. When performing CIM operations, the struc- Both structures are based on traditional 6T cells, which
ture can achieve two-directional computing. 2) Similarly, Ji- perform operations by adding compact-coupled computing
ang et al.[14, 55] proposed a separated read–write transpos- units. This strategy can maintain the basic read and write
able bitcell, as shown in Fig. 5(b). This structure can choose capabilities of SRAM while allowing it to perform complex op-
to read data by rows or columns. In addition, this structure erations. However, the compact-coupling module also in-
can perform CIM operations on both row and column dimen- creases the area overhead and complexity of the peripheral
sions simultaneously using a row bit line (R_BL) or column bit control circuit.
line (C_BL). 3) Lin et al. proposed a 10T bitcell to avoid vertic- Table 1 summarizes various types of SRAM-based CIM bit-
al data storage and improve the stability of CIM, as shown in cell. It was found that the area efficiency of the previous
Fig. 5(c)[11]. In this study, vector logic operations can be per- study was related to the structure of the basic cells. In CIM,
formed with multirow or multicolumn parallel activation, and
 
two main approaches are used to design the bitcell: 1) main-
 
 
Z T Lin et al.: A review on SRAM-based computing in-memory: Circuits, functions, and applications
Journal of Semiconductors   doi: 10.1088/ 1674-4926/43/3/031401 5

For Forward
VDD VDD 6T SRAM #0

BWLM<0>
C-RBL<0>

BWLL<0>
GBL<0>

LBLB
LBL
T1 T2
T5 T6 6T SRAM #15
T3 T4
VSS VSS TWT
RWL_P

RBLB_P NP1 NP2


T7 T11 T12 T8
FWLM<0> N3
N4
N5 2x
R-RBL<0>
T9 T10 N6 1x
RWL_N TWT-MC N8
N7
RWLB_N FWLL<0>

(a) (b)
 
Fig. 6. (Color online) Compact coupling structure: (a) 12T cell and (b) two-way transpose multibitcell.

Table 1.   Static random-access memory (SRAM) bitcells in CIM.


Structure of the 6T cell Cell structure with additional devices
Standard Dual- Read and write
4+2T Capacitive coupling Transposable Compact coupling
6T split 6T separating
Parameter
Refs. Refs.
Ref. Refs. Ref. Ref. Ref. Ref. Refs. Refs. Ref. Ref. Ref.
[59, 60, [14,
[19] [1, 2] [20] [18] [67] [57] [31, 58] [12, 13] [11] [17] [43]
61] 55]
Cell type 6T 6T 6T 8T 9T 10T 8T1C 10T1C 8T 8T 10T 12T TWT-MC
Process 28-nm 65 nm 55-nm 45 65 nm 28 nm 65 nm 65 nm 28 nm 7 nm 28 nm 65 nm 28 nm
technology FDSOI DDC nm
Added circuit No No Two One One Two Two Four Two read One Two Pull-up/ Multiply
read read read read transistors transistors ports read read down cell
ports port port ports one cap one cap port ports circuits
Read write Yes Yes No No No No No No No No No No Yes
disturb
Area efficiency High High High Med. Med. Med. Low Low Med.
Low High Low Low
TOPS/mm2 NA 33.13 NA NA NA 170 20.2 0.6 27.3
NA NA NA 5.461
TOPS/W NA 30.49– 41.4 NA NA 1002 671.5 192– 0.56/5.27
7.2–61.1 6.02 66.7 403
55.8 1/1/1b1 1/1/1b1 1/1/5b1 400 Arbitrary
2,4,8/4, 8/1/ 1/1/1b1 1/1/
1/1/5b1 1–8b 8/10,12, 11b1 3.46b1
16,20b1
1Input precesion/Weight precesion/Output precesion. TWT-MC: Two-way transpose multibit accumulation; DDC: deeply depleted channel;

FDSOI: full depleted silicon on insulator

taining the standard 6T and 2) reconstructing basic units by 3.1.1.    Quantifying BL voltage
adding additional transistors or capacitors. Currently, the As shown in Fig. 7(a), it is an asymmetrically sized sense
main purpose of cell design in CIM is to realize novel opera- amplifier (SA)[18]. This circuit is different from the traditional
tions, and the automatic restoration of operation results is SA, which is only used to read data. If the MBL in the circuit is
not realized. In the future, the unit design will focus on the set to be larger than MBLB, the BL will discharge faster than
processing of operation results.
 

the BLB under the same conditions. This difference in the dis-
3.  Peripheral auxiliary circuits of the SRAM-based charging speed distinguishes the two cases of input ‘01’ and
CIM ‘10’ in the Boolean logic. Thus, in a single memory read cycle,
a class of bitwise Boolean operations can be obtained by read-
Depending on the basic cells in the array, only limited di- ing directly from the asymmetrically sized SA outputs.
gital computing functions can be achieved. Peripheral cir- Sinangil et al.[24, 37] proposed a 4-bit flash ADC. This ADC
cuits, such as high-precision ADC, weight processing, genera- uses SA to save the area and reduce the power consumption.
tion of reference voltage, pulse modulation, and near Each 4-bit flash ADC uses 15 SAs simultaneously and quant-
memory multiplication, must be used with the SRAM system izes the analog voltage of the RBL by setting 15 different refer-
to achieve high-performance memory operations or analog ence voltage levels (Fig. 7(b)).
domain computation.
 

In Ref. [32], Si et al. used several capacitors with different


3.1.  Analog-to-digital conversion (ADC) circuit values to generate different reference voltages to quantify
An ADC circuit is indispensable in processing computa- the analog voltage of the RBL. RBL shares its charge with the
tional results in an array (mostly analog value represented by capacitor array after multicycle operations (Fig. 7(c)). In this
the BL voltage). It has two main roles: 1) quantifying the BL design, capacitors with different values are sequentially con-
voltage and 2) weighting the BL voltage.
 
 
nected to one of the two ends of the SA through a switch;
 
 
Z T Lin et al.: A review on SRAM-based computing in-memory: Circuits, functions, and applications
6 Journal of Semiconductors   doi: 10.1088/ 1674-4926/43/3/031401

SC/2CA Logic

Mode Selector
Asymmetric Differential Vref [14:11]
SA
SA[14:11] RBLS RBLO[1] RBLE[1] RBLO[0] RBLE[0] RBLO RBLE
RBL[3]
SA SC/2CA MUX SC/2CA Logic SC/2CA Logic
SBQ

SAE Vref [10:9]


SA
SA[10:9] SODD SEVEN
RBL[2]
WPINS C WPIN2S C WPIN[1] WPIN[0]
MAC

Weight Processor
Vref [8] SA[8] VDD
VCM
SA
RBL[1] K3 K2 K1 K0
S1
SA OUT SA OUTB
Vref [7] SA[7]
SA 16Co 8Co 4Co Co 3Co
RBL[0] SUM
X(-16) X(-8) X(+4) X(+1)
BL BLB Vref [6]
SA
SA[6] SA

Local Reference Generator


RBL[1]
16Co 8Co 4Co 2Co 2Co
REF
Vref [5:4] SA[5:4]
SA
MBL MBLB RBL[2] D4 D3 D2 D1 S2 5 bit outputs
VDD VCM (pre cycle)
Vref [3:0] SA[3:0]
SAE RBL[3]
SA VRRBL (form CRC)

(a) (b) (c)


 
Fig. 7. (Color online) (a) Asymmetric differential sense amplifier (SA), (b) flash ADC, and (c) successive approximation ADC.
 

Capacitor sizes weighting technique Multi period weighting technique


RBL[3] RBL[2] RBL[0] SW_DISPREC SW_DISPREC
RBL[1]
5*C u Port 1 HARGE1 Port 2 HARGE2
1*C u
S0 S0 7*C u S0
8*C u
S0 VDD
Compensation
Caps S0 S0 S0 S0 C_DIV C_DIV
C_DIV C_DIV

N3 N2 N1 N0
SH SH SH SH
Computation S1 S1 S1 S1
Caps 8*C u 4*C u 2*C u 1*C u
VDD
ADC Division capacitors Division module
(a) (b)
 
Fig. 8. (Color online) (a) Weighted array with different capacitor sizes and (b) multi-period weighting technique using capacitors of the same size.

that is, a stepwise comparison is performed through multiple ted data, this SAR ADC seems to have larger area and power
operation stages to quantify the analog voltage of the RBL. than the above-mentioned flash ADC. However, in this work,
The quantifying circuits process the final calculation it had 8-bit accuracy to increase the inference accuracy. To re-
result, which is especially important for the calculation accur- duce the interference of PVT, the SRAM-based CIM architec-
acy of the entire system. Most researchers choose flash ture needs to be integrated with ADC. Therefore, it is signific-
ADC[14, 37, 38, 55] or successive approximation ADC[44, 51, 52, 60, 63] antly important for the design of ADC in CIM works.
 

to quantify the BL voltage. Flash ADCs usually require mul- 3.1.2.    Weighting BL voltage circuits
tiple comparators and reference voltages; thus, under the One of the weighting BL voltage circuits is the capacitor
same bit-accuracy, the area and power consumption of flash array weighting circuits. It has a higher linearity, which is of-
ADC is relatively larger than SAR ADC. Sometimes a high-accur- ten used when high-precision operations are required. Capacit-
acy flash ADC may be larger than the whole array. SAR ADC or array weighting techniques are broadly grouped into two
only needs one comparator to complete the quantization oper- categories:
ation. However, it needs multiple cycles for comparison. Con- 1) Weighting by different capacitor sizes[37, 38], where the
sequently, its speed is far lower than that of flash ADC. To se-
voltage of the RBL decreases by Δv after calculation and is
lect the type of ADC in CIM, the tradeoff between quantiza-
shared by the capacitors connected to each column. The cor-
tion accuracy and the overhead of area power consumption
responding N3 on RBL[3] is  Δv , N2 is  Δv , N1 is  Δv , and N0
is considered first. For example, the C3SRAM used flash ADC
with multiple comparators and voltage references achieving is  Δv , which is achieved by different capacitor sizes, as
a relatively high 1638 GOPS throughput[58]. The ADC con- shown in Fig. 8(a). Finally, the weighted shared charge is trans-
sumed 15.28% of the area and 22% power of the whole chip. mitted to an ADC composed of the SA.
In this work, the 256-row can be activated for dot-product cal- 2) Weighting by sharing charge across multiperiod opera-
culation; thus, the full resolution of partial convolution res- tions, as shown in Fig. 8(b). In Ref. [64], the parasitic capacit-
ults was 8-bit. However, considering the area, power, and ance on the BL was the same as that of C_DIV in the capacit-
latency, a lower flash ADC was used with 5-bit accuracy or array, so that the charge was shared equally among the ca-
design that was still able to maintain the final accuracy. In con- pacitors. When 1/8 weighting was required, in the first peri-
trast, the authors in Ref. [57] used SAR ADC achieving a od, the BL was connected to port 2 achieving the charge shar-
throughput of 8.5 GOPS, which consumed 43.1% of the area ing with three C_DIVs. Thus, the voltage decreased to 1/4. In
and 24.1% of the power of the entire chip. From the presen-
 
the second period, the BL was connected to port 1, and the
 
 
Z T Lin et al.: A review on SRAM-based computing in-memory: Circuits, functions, and applications
Journal of Semiconductors   doi: 10.1088/ 1674-4926/43/3/031401 7

Constant current source VDD,DAC Phase A Phase B


MP1
RST
ON
Vbiasp
MP2 ON
XIN=63 1V
ON GRBL GRBL
RST XIN=24 0
(a) MN 63t0
TS63
Analog Output:Va 54t0
(GRBL Pre-charge Voltage) TS54
45t0
TS63
7 TS45
TS54

8:1 MUX
36t0
Global TS45 TS36
TS36
Timing ON 27t0
TS27 TS27
Signals TS18
18t0
TS9 TS18
TS0 0
3 SEL[2:0] TS9 9t0
Two-stage
MUX TS56 x3 TS0
2:1 MUX
1 0 56t0
XIN[5:3] XIN[2:0]
TS56
7t0
(b) Digital Inputs
Select signal (c)

Binary weighted current source WL ΔV


WL 2 ΔV
BL BLB
WL 4ΔV
VBIS,O VBIS
1X 2X 16X

WL 8 ΔV
Xoffset X[0] X[1] X[4]
WL
Bit-cell replica
(upsized by R)

MA,R
IADC WL_RESET MA
CLASS_EN MD
MD,R

(d) IBC=1/R x IDAC


 
Fig. 9. (Color online) Schematic of the column-wise GBL_DAC circuit: (a) Circuit of the constant current source, (b) two-stage MUX, and (c) wave-
form of the column-wise GBL_DAC circuit. (d) Schematic and waveform of the pulse height modulation circuit.

shared charge was divided equally. Therefore, the original th or pulse height. The BL voltage can be decreased or incre-
charge was ultimately divided into 1/8 of the original charge. ased proportionately by controlling the pulse width or pulse
Charge can be shared in multiperiod operations in a similar height in proportion to the digital input. The precise genera-
manner to achieve other weights. tion of these widths or heights is crucial to multibit calcula-
The two types of weighting techniques have their demer- tion.
its. In the first technique, the capacitance increases exponen- For example, for an input of 6 bits, the circuit design re-
tially with the number of input bits, which increases the area quires 64 different pulse widths. Generating such a variety of
overhead exponentially. However, in that work its unit of com- pulse widths in the memory consumes an immense area and
putation capacitance is formed by the inherent cap of the power. Therefore, the solution to this problem requires a delic-
sense amplifier (SA) inside the 4-bit Flash ADC, which saves ate circuit. Biswas et al.[10] proposed a global read bitline
area and minimizes the kick-back effect. Moreover, it is diffi- (GBL) DAC circuit consisting of a cascade PMOS stack biased
cult to realize in the manufacturing process. In the second in the saturation region to act as a constant current source
technique, the capacitance of the circuit remains unchanged. (Fig. 9(a)) and a two-level data selector (Fig. 9(b)). The circuit
However, multiple operation periods are required to com- controls the opening time of the transmission gates accord-
plete the weighting, which decreases the computation ing to the input data so that the bitline is charged to the cor-
speed. In general, whether it is the equivalent capacitance of
responding voltage value. With the traditional scheme, it re-
SA, MOMCAP or MOSCAP, if there has a large capacitance in
quired 64 types of signals and 64 : 1 MUX if the input is 6-bit
the CIM system, the robustness of the operations will have a
(XIN [5:0]). However, in Ref. [10], a two-level data selector was
certain impact.
proposed to solve the problem. The first level had three 2 : 1
 

3.2.  Digital-to-analog (DAC) conversion circuit MUX, and the second level an 8 : 1 MUX. The first stage TS56
The purpose of the DAC circuit is to convert a digital in- chooses XIN [5:3] as the input, and the second stage XIN
put into the corresponding analog quantity as the pulse wid-
 
[2:0]. The two stages have pulse widths of 56t0 and 7t0, re-
 
 
Z T Lin et al.: A review on SRAM-based computing in-memory: Circuits, functions, and applications
8 Journal of Semiconductors   doi: 10.1088/ 1674-4926/43/3/031401

through switches to generate different reference voltages. Gen-


 

Precharge
erating multiple reference voltages in a memory array in-
creased the functionality and enhanced the accuracy of the en-

RWLvref
8T 8T 8T 10T

BLB
RWLvref
BL RBL RBLB tire system.
 

8T 8T 8T 10T
Row Decoder

3.4.  Digital auxiliary computing circuit


8T 8T 8T 10T gnd

gnd Although most of the repeated operation functions are


8T 8T 8T 8T 10T completed in the analog domain, the accumulation of the ana-
log results is technically difficult. Digital domain accumula-

Redundant column
SRAM
ARRAY
tion is a good choice, and digital auxiliary circuits can signific-
8T 8T 8T 10T
antly improve the computational accuracy of the system.
VREF2
Sense Amplifier VREF1
VREF2 VREF1
Fig. 11(a) illustrates a digital-aided technique for 2- and
3-bit multiplication. The Boolean logic “AND” and “NOR” func-
Fig. 10. (Color online) Redundant reference column technology. tions are realized by an SA sensing the voltage of column
CBL or CBLB[12, 13]. Then, the addition operation is implemen-
spectively. Finally, 64 proportional time-series signals are gen- ted by the Boolean logic. Finally, the trigger and data select-
erated by combining eight different widths of the 8 : 1 MUX in- or are used to accumulate the results of the addition opera-
put. The operation waveform is illustrated in Fig. 9(c). The 64 tion to execute multiplication.
different pulse widths can act on the charging time of the BL The digital circuit not only assists in the execution of multi-
through the constant current source such that the BL has 64 plication but also the MAC, as shown in Fig. 11(b)[22, 31, 58].
types of precharging voltage corresponding to the input. Fi- The bit-tree adder is used to obtain the cumulative sum of
nally, this circuit can achieve up to six bits of DAC input.
the BL quantization results. For an array with N numbers to
As shown in Fig. 9(d), the DAC circuit, used for pulse heig-
be accumulated, a log(N)-layer full adder is required to per-
ht modulation, is composed of a binary weight current sour-
form population count.
ce and a copy unit[15, 16]. In a current source circuit, a fixed vol-
Digital calculations are highly precise. However, this meth-
tage VBIS is applied to the gate of transistors with different
od requires multiple cycles and has a relatively large area over-
width-length ratios. It produces different proportions of cur-
head and power consumption. The efficient use of digital cir-
rent according to the input data and then passes it through
cuits is a research direction in in-memory computing.
the diode-connected MA, R, generating a weighted voltage at
 

the WL to realize BL discharge proportionally. Similarly, 3.5.  Analog auxiliary computing circuit


Ref. [65] used a current-type DAC to realize different pulse The analog domain auxiliary circuit has a lower calcula-
heights and applied it to the BL, thereby realizing multibit in- tion accuracy than its digital counterpart; however, it can
put. achieve a higher computing ability with limited area and en-
The DAC precisely controls the proportional discharge of ergy consumption. Kang et al.[35, 36, 45, 50, 66] proposed a
transistors. However, the transistor itself is a nonlinear device, signed multiplier in the analog domain. The timing diagram
so the discharge rate cannot be controlled proportionally. is illustrated in Fig. 12(a), and a schematic of the same is depic-
Thus, both techniques encounter the problem of nonlinear cal- ted in Fig. 12(b). First, the analog voltage corresponding to
culation results. Solving this is particularly important for the the 8-bit weighted data are shared with five capacitors of
SRAM-based CIM.
 
equal size through ϕdump connected to the multiplier. Then,
3.3.  Redundant reference column circuit ϕ2,0–ϕ2,3 is activated by a 4-bit digital input XLSB. If the corres-
ponding input data is 0, it is deactivated; otherwise, it will be
The in-memory Boolean logic, CAM operation, and MAC re-
activated. Finally, ϕ3,0–ϕ3,3 is, in turn, activated to redistribute
quire multiple reference voltages. In practice, multiple refer-
the charge, and the final voltage is proportional to the multi-
ence voltages cannot be simultaneously connected due to
plication of the input XLSB and weight. The 8-bit input is real-
the limited external pins of the chip. To address this, mul-
ized through two operation cycles, and one cycle processes
tiple reference voltages must be implemented in an array. A
4-bit data. The array is not used to complete the multiplica-
common technique for generating the required reference
tion; hence, it does not require reconstruction, preserving its
voltages in an array is by the redundant reference column cir-
storage density and robustness.
cuit.
 

As shown in the orange rectangle in Fig. 10[5], the BL and 4.  Computational functions of the SRAM-based
BLB of the redundant reference column are shorted together.
CIM
Therefore, the parasitic capacitance of the redundant refer-
ence column is twice that of the main array. The BL voltage Because the internal cells of an SRAM array are repetitive,
of the redundant reference column is reduced by half relat- the operation in memory must be simple and repeatable. The
ive to the BL voltage of the main array, generating the de- existing SRAM-based CIM can be classified into two: pure digit-
sired reference voltage. The redundant column with buffers al CIM and mixed-signal CIM. The pure digital CIM mainly in-
generates the required reference voltages, as well as tracks cludes Boolean operation and CAM, and the mixed-signal
the PVT variations in the memory array, thereby increasing CIM the processing of the Hamming distance, MAC, and SAD.
 

the sensing margin. Si et al.[1, 2] proposed a dynamic input- 4.1.  Digital SRAM-based CIM
aware reference generation scheme to generate an appropri-
 

ate reference for the binary dot-product accumulation mode. 4.1.1.    Boolean logic (AND, OR, NAND, NOR, XNOR, XOR,
Two redundant columns were used as the reference column. and IMP)
The BLs of the reference columns were selectively connected
 
Implementing the Boolean logic in memory is relatively
 
 
Z T Lin et al.: A review on SRAM-based computing in-memory: Circuits, functions, and applications
Journal of Semiconductors   doi: 10.1088/ 1674-4926/43/3/031401 9

VREF WEN Latch_EN

T
EN
Near-Memory Part XNORed vector(from SAs)
Q D

T
DR
CBL
Cout

SA
AB/A+B/A+B Layer 1 + + +
AB Cout
Q D Layer 2 +
Cin C
DR EN
CBLB
A+ B
Bit-Tree
SA
A+B
Sum
+ Adder
Layer log (N)
SAE Latch_EN
(a) (b) Popcount
 
Fig. 11. (Color online) (a) In-/near-memory computing peripherals and (b) a bit-tree adder.
 

P1 P2 P3 LSB negative rail


P4 LSB positive rail
ϕdump
Vp ϕrail Vn

XLSB
0 X0 X1 X2 X3
XLSB
DEMUX Sign(W)
ϕ2,X
ϕ2,0 ϕ2,3 Δ VM
ϕ2,1 Vpre
ϕ2,2 ϕ3,3
ϕ2,2
Vpre
ϕ2,3
ϕ2,1 ϕ3,2 C2
ϕ3,0
Vpre
ϕ3,1 ϕ2,0 ϕ3,1 C1

ϕ3,2 Vpre

ϕ3,3 ϕ2,X ϕ3,0 C0

ϕrail Vpre
ϕdump CX
(a) (b) ΔVBLMUX
 
Fig. 12. (Color online) Signed 4-b × 8-b least significant bit (LSB) multiplier: (a) timing diagram and (b) circuit schematic.

simple and accurate because its operation is completed in Most of the existing Boolean logics are realized by open-
the digital domain. Fig. 13(a) illustrates a basic construct ing two rows of cells and sensing the BL voltage by setting
for performing in-place bitwise logical operations using the SA reference voltage[5, 7, 21, 23−27]. Implementing multiple in-
SRAM[19, 20, 22, 24]. To realize the logic operations of A and B, put logic operations in a single cycle and XNOR without addi-
the BL and BLB are first precharged to VDD. Then, the WL of tional combinational logic gates is challenging. Surana et al.
the corresponding cell is turned on, and the two BLs are dis- proposed a 12T dual port dual interlocked storage cell SRAM
charged according to the data stored. If AB = 11, BL will not to implement the essential Boolean logic in a single cycle[23].
discharge; if AB = 01/10, BL will discharge to a certain level; if Lin et al. leveraged the three types of BLs of a traditional 8T-
AB = 00, BL will discharge to the maximum, as shown in Fig. SRAM to simultaneously realize four input logic operations[5].
13(b). The SA can realize different logic functions, such as Zhang et al. proposed an 8T-SRAM with dual complementary
AND and NOR, by setting different VREF values. Finally, XNOR access transistors[64]. It utilizes the threshold voltage of NMOS
can be realized by combining ‘AND’ and ‘NOR’ through the and PMOS, precharges the BL voltage to 0.5 VDD, and finally
OR gate to implement the full Boolean logic. Fig. 13(c) shows configures different reference voltages to realize the Boolean
the implication logic (IMP) and XOR logic[18]. In the CIM logic. In addition, a composite logic operation can be real-
mode, SL1 is connected to the VDD supply, while SL2 is groun- ized without additional combinational logic. The existing meth-
ded, forming a voltage divider. RWL1 and RWL2 are initially ods add additional cycles to store results in other cells[18];
grounded, and RDBL is precharged to Vpre (at 400 mV). The however, it decreases the speed and storage density of the sys-
voltages of RWL1/ RWL2 represent the input data. If Q1 and tem. The future direction of in-memory Boolean logic is to ef-
Q2 store data ‘1’ and the input is ‘00/11’ (RWL1 = 0, RWL2 = fectively store the calculated results.
 

0; RWL1 = 1, RWL2 = 1), RDBL remains Vpre; if the input is ‘01’ 4.1.2.    Content addressable memory (CAM)
(RWL1 = 0, RWL2 = 1), RDBL discharges to ground; and if the in- The CAM is a special type of memory that can automatic-
put is ‘10’ (RWL1 = 1, RWL2 = 0), RDBL charges to VDD. Fi- ally compare input data with all the data stored in the array
nally, the results of IMP and XOR can be realized by using simultaneously to determine whether the input data matches
two skewed inverters to sense the RDBL voltage.
 
the data in the array. The realization of CAM in SRAM can re-
 
 
Z T Lin et al.: A review on SRAM-based computing in-memory: Circuits, functions, and applications
10 Journal of Semiconductors   doi: 10.1088/ 1674-4926/43/3/031401

WL AND operation
VBL VREF
RWL1 RDBL
Cell1 VDD
BL BLB
A A Q1
M2
Q1 M1
M1
WL VDD INV1 INV3
SL1
00 01 10 11 RWL2 RWL1 M2
Bit combinations Cell2 10
VREF NOR operation RDBL
VREF B B M4
INV2
VBLB VREF
Q2 RWL2 M4
01
M3
SA SL2
SA Q2 M3

VBL VBL IMP and XOR operation


A AND B A NOR B 00 01 10 11
Bit combinations
(a) (b) (c)
 
Fig. 13. (Color online) Boolean operation: (a) Boolean logical operations using an SRAM array, (b) histogram of AND and NOR operation voltages,
and (c) schematic of the 8T-SRAM for implementing the IMP and XOR operations.

Wr<0> 3D-CAM Search Data Input


0 1 1 0

Vref
Q QB Q QB
0 1 1 0
Vdd

-
SA
pre

+
Keeping High 1
(Match)
0
T1 T1 Match 0 1 0 1 (Mismatch)

+SA -
Slb Sl Slb Sl
BLB<4>

BLB<0>
BL<4>

BL<0>

Discharging

(a) (b)
BLB_1 BL_1 BLB_2 BL_2 BLB_3 BL_3 BLB_4 BL_4
CWL Driver
Column 1 Column 2 Column 3 Column 4
Sea rch Da ta Inp u t 10 0 1 / RWL Dr iver

S_1 1 0 1 0 1 0 0 1
1 0 1 0 0 1 1 0

SLB1
0

SL1
1

0 1 0 1 1 0 1 0
Search driver

S_2 1 0 1 0 1 0 1 0
SLB2
1

SL2
0

0 1 0 1 1 0 1 0
SLB3
1

SL3 S_3 0 1 0 1 0 1 0 1
0

1 0 1 0 1 0 1 0

SLB4
0

SL4
ML1 ML1' ML3 ML3'
1

ML2 ML2' ML4 ML4'


SA SA SA SA SA SA SA SA S_4 0 1 0 1 0 1 0 1
Vref 1 0 1
1

1 0 VR EF SA SA SA SA SA SA SA SA

Match Mismatch

(c) (d)
 
Fig. 14. (Color online) Column-wise BCAM: (a) search example in 3D-CAM and (b) 4+2T. Row-wise TCAM: (c) organization based on 10T and (d) or-
ganization based on 6T.

duce data transmission and avoid a large amount of energy (a) and (b) are the column-wise BCAM operations, and (c) and
consumption. The CAM operation can be divided into binary (d) are the row-wise TCAM operations. Srinivasa et al.[6] used
CAM (BCAM) and fault-tolerant ternary CAM (TCAM). In addi- the array structure shown in Fig. 14(a) to execute a column-
tion, there are two different search modes, including row- wise BCAM operation. In the CAM operation, the data to be
wise search and column-wise search. The row-wise search is searched are stored in an array, and the search data are repres-
defined as the cases where the input searching data are repres- ented by the voltage values of the source line (Sl) and Slb. If
ented by lines connected by rows. Similarly, the column-wise the data to be searched matches the search data, the pre-
search is defined as the cases where the input searching data charged match line is considered not-discharged and vice
are represented by lines connected by columns. In Figs. 14,
 
versa. To save area during CAM operations, Dong et al.[20] pro-
 
 
Z T Lin et al.: A review on SRAM-based computing in-memory: Circuits, functions, and applications
Journal of Semiconductors   doi: 10.1088/ 1674-4926/43/3/031401 11

Table 2.   Summary of chip parameters and performance of in-memory Boolean logic and CAM
Parameter Ref. [19] Ref. [28] Ref. [20] Ref. [29] Ref. [5] Ref. [68] Ref. [11]
Technology 28-nm FDSOI 180 nm 55-nm DDC 28-nm FDSOI 65 nm 28 nm 28 nm
Cell type 6T 8T 4+2T 6T 8T 14T 10T
Array size 64×64 8×8 128×128 128×64 128×128 1024×320 64×64
Supply voltage (V) 1 1.2 0.8 0.9 1.2 0.9 0.9
Freq. 1560 (0.9 V) 262 256
370 (1 V) NA 270 (0.8 V) 813 (1.2 V) 1330 (0.9 V)
(MHz) 8.90 (0.38 V) (0.9 V) (0.9 V)
1.025 1.02
CAM
Energy (0.9 V) (0.9 V)
0.6 (1 V) NA 0.45 (0.8 V) 0.13 (0.9 V) 0.85 (1.2 V) 0.422 (0.9 V)
(fJ/bit) 0.635 0.632
(0.7 V) (0.7 V)
Freq.
NA NA 230 (0.8 V) NA 793 (1.2 V) NA ~300(0.9 V)
(MHz)
Logic ~31 (1.2 V) NA ~15 (0.9 V)
Energy ~22.5 (1 V) ~12.5 (0.8 V)
NA NA 24.1 (0.8 V) NA
(fJ/bit)
16.6 (0.8 V) ~10.5 (0.7 V)
Search mode 1 1 2 2 1 2 1 2
SRAM/ SRAM/ TCAM/ SRAM/ BCAM/SRAM/ SRAM/ CAM/ SRAM/ TCAM SRAM/CAM/
Function CAM/Logic Left Shift/ CAM/Logic Pseudo-TCAM Logic Logic/Matrix
Right/Shift transpose
1Row-wise search. 2Column-wise search. DDC, deeply depleted channel; FDSOI, full depleted silicon on insulator.

posed a 4 + 2TSRAM. The data to be searched is also stored gics and CAM are compared and the different search modes
in an array, and the search data is represented by the BL are presented. Most studies have implemented both Boolean
voltage. The SA is used to capture the voltage value of the operations and CAM by modifying standard cells, indicating
matching line to obtain the result, as shown in Fig. 14(b). Be- the compatibility of these two functions. However, few have
cause the two studies represent the search data by the BL achieved CAM function in two directions simultaneously. Cur-
voltage, all the memory units are column-wise addressable. rently, in-memory Boolean logic is realized at the cost of re-
The difference between TCAM and BCAM is that the duced parallelism as opposed to its analog counterpart. There-
former has a do-not-care state. Two cells are used to repres- fore, improving the parallelism for Boolean logic and using it
ent one data point owing to the presence of three states in to achieve sophisticated calculations in memory will be a direc-
the TCAM. The three states, 0/1/X (where X is an independ- tion for research.
 

ent state, a do-not-care state), are represented by 00/11/01, re-


4.2.  Mixed-signal SRAM-based CIM
spectively, whose implementation for column-wise search is
shown in Fig. 14(c). The data to be searched is represented The mixed-signal SRAM-based CIM is primarily categor-
by two adjacent cells in a row, and the search data by the ized into two types: 1) single-bit operation, including binary
SL/SLB voltage. The intermediate ML1’/ML2 of two adjacent and ternary dot products, and Hamming distance, and 2) mult-
columns are shielded, and the matching result is determined ibit operation, including multibit multiplication and SAD.
 

by whether the other two MLs are discharged[11]. In addition, 4.2.1.    Single-bit operation0
to match the characteristics of the data stored in rows in an A. Binary dot product
SRAM, this work used 10T-SRAM to simultaneously imple- The multiplication operation of a single bit is also a dot-
ment CAM in the row and column dimensions. In order to product operation. As mentioned previously, there are two
save area, Jeloka et al. used standard 6T cell achieving TCAM subtypes of this operation: binary and ternary dot products.
operations, as shown in Fig. 14(d)[19], where the search data Chiu et al. executed (1,0) × (1,0) and (1,0) × (+1, -1) dot-
are represented by WLs voltage and the nodes in two adja- product operations with standard 6T cells[33]. The input is rep-
cent cells in a row represent the searched data. Similarly, the resented by the WL voltage. The weight is represented by 6T
search results are obtained by SA sensing the BLs voltage. storage data. The truth table is used to summarize the combin-
However, unlike Refs. [11, 19], Ref. [29] used the virtual ations of different inputs and weights, as shown in Fig. 15(a).
ground wire technique as the sensing mechanism. The virtu- The binary dot product results are represented by the
al and actual grounds are connected via a diode, and the SA voltage difference between BL and BLB. The BL voltage
detects the voltage of the virtual ground, obtaining the tern- changes according to the result of the multiplication of the in-
ary addressing result. put and weight. Sun et al. proposed a dedicated 8T-SRAM
Lin et al.[5] and Chen et al.[28] utilized the 8T unit to com- cell for the parallel computing of a binary dot product[30, 69]
plete the CAM operation. The difference between the two stud- (Fig. 15(b)). There are two complementary WLs (i.e., WL and
ies is that the former used the BL as the data search line, WLB) and two pairs of pass gates (PGs) to achieve the opera-
whereas the latter used the WL. Ref. [67] used the CAM auxili- tion function. The first pair of PGs is controlled by the WL and
ary circuit techniques to improve the operational speed of connects Q and QB to BLB and BL, respectively, while the
the system under ultra-low voltage. In Ref. [68], the CAM func- second pair of PGs is controlled by WLB and connects Q and
tion was realized by combining two 6T cells with two addition- QB to BL and BLB, respectively. The proposed 8T-SRAM al-
al control transistors. ways contains a non-zero voltage difference between BL and
 
In Table 2, the performances of several CIM Boolean lo- BLB, which represents the results of the binary-weight multi-
 
 
Z T Lin et al.: A review on SRAM-based computing in-memory: Circuits, functions, and applications
12 Journal of Semiconductors   doi: 10.1088/ 1674-4926/43/3/031401

Truth table of (1,0)x(1,0) RWL Driver Logic


Input Weight Binary Product Input 0 0
1 -1
RWL (even row) (odd row)
WL Value Q QB Value BL BLB RWL_P VDD 0 0 VDD
0 0 0 1 0 0 0 RWL_N VDD 0 VDD 0
0 1 1 0 0 0 0 RWBL_P 0 VDD 0 VDD
RWLB_N 0 VDD VDD 0
1 0 1 0 0 IMC-D 0
1 1 0 1 1 0 IMC-D XNOR Value Mapping Table
Truth table of (1,0)x(+1,-1) Input Strong Weak Strong Weak
XNOR pull-up pull-up pull-down pull-down
Input Weight Binary Product 1 1 1 0 0
WL Value Q QB Value BL BLB -1 0 0 1 1
0 -1 0 1 0 0 0 0(even row) 0 1 0 1
0 1 1 0 0 0 0 0(odd row) 1 0 1 0
1 -1 0 1 -1 IMC-D 0
Weight=+1 Weight=-1
1 1 1 0 1 0 IMC-D RBL RBL
VDD 0 VDD 0
(1,0)x(1,0) (1,0)x(+1,-1)
T7 T8 T7 T8

WL WL VDD VDD VDD


0
Q QB Q QB
T9 T10 T9 T10

Input=+1
Q QB Q QB
ΔVBL=-ΔVPS

0 VDD XNOR=+1 0 VDD XNOR=-1


ΔVBLB=-ΔVPS
ΔVBLB=0
ΔVBL=0

0 VDD 0 VDD

T7 T8 T7 T8
IMC-D IMC-D
VDD VDD
WL=1, W=0 WL=1, W=1 0 0
Q QB Q QB
(a)
T9 T10 T9 T10
Input=-1
Truth table and BL situation VDD 0 VDD 0
0 0 XNOR=-1 0 0
XNOR=+1
Neuron Weight Multi iBL iBLB VBL VBLB
1 -1 -1 Δi 0 ΔV 0 T7 T8 T7 T8

-1 -1 1 0 Δi 0 ΔV 0
VDD VDD
0
1 1 1 0 Δi 0 ΔV Q QB Q QB
-1 1 -1 Δi 0 ΔV 0 T9 T10 T9 T10

Input=0
BL BLB (even row)VDD VDD
W:+1 Q/QB=1/0 VDD VDD
VDD VDD
XNOR=0 VDD 0 XNOR=0
W:-1 Q/QB=0/1 WL WLB
T7 T8 T7 T8
IBL IBLB
Q QB 0
VDD VDD
0
Q QB Q QB
A:+1
T9 T10 T9 T10
WL/WLB=1/0 Input=0
WLB WL
A:-1 (odd row) XNOR=0 XNOR=0
0 0 0 0
WL/WLB=0/1

(b) (c)
 
Fig. 15. (Color online) Schematic and truth table of the binary dot product: (a) 6T-SRAM binary dot product and (b) 8T-SRAM binary dot product.
(c) Ternary dot product: operation of ternary multiplication and XNOR value mapping table.

plication operations. The truth table in Fig. 15(b) reveals the The cumulative result of the above study is on the BL;
combination of different inputs and weights, and the corres- therefore, the operation should be performed by columns.
ponding changes in the BL discharge current (Δi) and voltage However, in the traditional SRAM storage mode, data are
(Δv) caused. In addition, according to the input neuron vec- stored row-wise. The CIM mode conflicts with the storage
tor, a WL switch matrix is used to simultaneously activate mul- mode, which reduces the operation efficiency. To address
tiple WLs. The discharge current from several bitcells in the this problem, Agrawal et al. used a 10T SRAM to perform a bin-
same column decreases the voltage of the BL (BL or BLB). ary line dot product, and the results were reflected on the hori-
Thus, the voltage difference between BL and BLB can be used zontal source lines[9], conforming to the traditional SRAM stor-
to determine the weighted sum.
 
age mode. However, the storage density is reduced by the in-
 
 
Z T Lin et al.: A review on SRAM-based computing in-memory: Circuits, functions, and applications
Journal of Semiconductors   doi: 10.1088/ 1674-4926/43/3/031401 13

Mismatched-bit Matched-bit Mismatched-bit Matched-bit


Cell-n Cell-n-1 Cell-1 Cell-0

‘1’ ‘0’ ‘0’ ‘1’ ‘1’ ‘0’ ‘0’ ‘1’

M1,n M2,n M1,n-1 M2,n-1 M1,1 M2,1 M1,0 M2,0


RBLn RBLB n RBLn-1 RBLB n-1 RBL1 RBLB 1 RBL0 RBLB 0
=1 =0 =1 =0 =1 =0 =1 =0
MX,n MX,n-1 MX,1 MX,0 SL
 
Fig. 16. (Color online) Row of 9T SRAM cells for calculating the Hamming distance.

troduction of additional transistors in the basic cell. stored in the memory array, while the other vector is used to
B. Ternary dot product drive the RBL/RBLB. For example, if the input vector is ‘1’, the
Yin et al. achieved a ternary dot product[17] (Fig. 15(c)). RBL is connected to VDD, and the RBLB to GND. If the data in
For the XNOR operation if the result is +1, PMOS provides a one cell match the input, SL is connected to VDD through M1, n.
strong pull-up, and NMOS a weak pull-up on the RBL. If the res- If it mismatches the input, the SL is connected to GND
ult equals –1, then NMOS provides a strong pull-down and through M2, n. Hence, this circuit forms a voltage divider at
PMOS a weak pull-down. A resistor divider is formed for cells the SL. Finally, the results of the Hamming distance accumu-
on the same column, with the RBL as the output. The voltage late on the SL.
on the RBL represents the accumulation of the XNOR results Unlike the row-wise Hamming distance operations, Kang
in that column. For XNOR with ternary activations input (+1, et al. proposed a column-wise Hamming-distance macro
0, or –1) and binary weights, an additional case of ‘0’ input based on a 6T SRAM array[70]. In this design, two inputs are
must be considered. As shown in Fig. 15(c), if ‘0’ input is in stored in two different rows in the same column. Then, the lo-
the even row, the PMOS provides a weak pull-down and gic operation between the two inputs is executed by the SA.
NMOS a weak pull-up on the RBL. If ‘0’ input is in the odd The final Hamming distance value is obtained by combining
row, the PMOS provides a strong pull-up and NMOS a strong and accumulating the outputs of the SA. Because the result
pull-down on the RBL. Assuming there is a sufficiently large of the study is represented by BLs connected by columns, the
number of inputs, half of the inputs will be in the even row Hamming distance is calculated column-wise.
and the other half in the odd row. Thus, the cumulative in- In CIM, most of the results are reflected on BLs of
crease and decrease in the RBL voltage are 0. The result of column-wise connection. Therefore, vertical data storage is
the ternary multiplication is summarized in the truth table in generally required, increasing the implementation complex-
Fig. 15(c). ity for the SRAM writing mode. For example, Jeloka et al. pro-
As the input and weight of the dot-product operation posed a strategy of column-wise write[19]. It writes 1 in the
are in single bits, no additional auxiliary circuits are required first cycle and 0 in the latter cycle, which decreases the writ-
to process the weight, and the quantization of the operation ten data throughput and writing speed. In addition, as a ba-
results is simplified. The dot-product operation can be ap- sic and important operation, matrix transposition is generally
plied to the binary neural network (BNN) algorithm, where realized by data reading, moving, and writing back in a com-
the inputs can be restricted to either +1/–1 or 0/1. When the plicated operation with high power consumption. Thus, it will
inputs are restricted to 0 or 1, it has a 0.03% loss of accuracy be a research direction to realize Hamming distance opera-
compared with +1/–1 input, which is tested in the MNIST data- tion in both the row and column directions.
 

set[1]. However, because these studies used the analog 4.2.2.    Multibit operation
voltage of BLs to reflect the operation results, they could not Unlike single-bit operations wherein the operands are lim-
obtain results that are as accurate as the digital CIM. ited to only 0, –1, and 1, multibit operations can obtain more
C. Hamming distance precise in-memory computations, which meets the require-
The Hamming distance algorithm is widely used in sig- ments of various AI algorithms. There are two main categor-
nal processing and pattern recognition. The Hamming dis- ies of multibit operations: 1) multibit multiplication and 2)
tance between any two vectors of the same length is defined SAD.
as the number of corresponding bits with different values. A. Multibit multiplication
The principle of Hamming distance is that two bytes of the The key to multibit multiplication is the weighting
same length are bitwise XNOR and these XNOR results are strategy. The weighting strategies include pulse width[10, 35, 36,
then accumulated. For example, the Hamming distance from 39, 40, 45, 50, 56, 64, 66, 71], pulse height[10, 15, 16, 44, 56], number of

1101 to 0111 is 2. Because this algorithm also requires high pulses[37, 38], width-to-length ratio of transistors[3, 32, 42, 43, 62], ca-
data access, it consumes a significant amount of energy pacitor array weighting[37, 38, 62, 63, 72, 73], and precharge time
when used in the traditional architecture. weighting[44]. The specific implementation strategy of the ca-
Ali et al. proposed a 9T SRAM to calculate the Hamming pacitor array weighting technology is introduced in Section
distance[8], as illustrated in Fig. 16. One of the vectors is
 
3. The multibit multiplication is reviewed from the following
 
 
Z T Lin et al.: A review on SRAM-based computing in-memory: Circuits, functions, and applications
14 Journal of Semiconductors   doi: 10.1088/ 1674-4926/43/3/031401

t=0 t=8T+Tch-sh t=4T t=8T+Tch-sh t=6T t=8T+Tch-sh t=7T t=8T+Tch-sh


Vpre7 Vpre2 Vpre1 Vpre0
t=0 t=8T

WL

Q QB Q QB Q QB Q QB

t=8T t=8T t=8T t=8T


t=0
t=4T t=6T t=7T

ench-sh

ench-sh

ench-sh

ench-sh
t=8T t=8T+Tch-sh t=8T
Vch-sh
(a)

BL BLB BL BLB BL BLB BL BLB

WL0 8T WL0 W0,15 W0,11 W1,15 W1,11


Func ti ona l Re ad

4T WL1
WL1
W0,14 W0,10 W1,14 W1,10
2T
WL2 WL2
W0,13 W0,9 W1,13 W1,9
1T
WL3
WL3
W0,12 W0,8 W1,12 W1,8
(b)
WL[3] Counter & RWL[63]
WL Driver
ΔV
Cell Cell Cell Cell 8T 8T 8T 8T
WL[3] BL/BLB3 11111
00000

2ΔV WL[2] Counter & RWL[62] 4-bit weight


WL[2] BL/BLB2 WL Driver

Cell Cell Cell Cell 8T 8T 8T 8T


00000 11111
4ΔV WL[1]
Counter & RWL[1]
WL[1] BL/BLB1
WL Driver
Cell Cell Cell Cell
8T 8T 8T 8T
00000 11111
WL[0]
8ΔV Counter & RWL[0]
WL Driver
Cell Cell Cell Cell
8T 8T 8T 8T
RBL[0]
RBL[1]

WL[0] BL/BLB0
RBL[3]

RBL[2]

00000 11111

(c) BL3 BLB3 BL2 BLB2 BL1 BLB1 BL0 BLB0


(d)
 
Fig. 17. (Color online) (a) Precharge weighting technology, (b) pulse width weighting, (c) pulse height weighting, and (d) pulse number weight-
ing.

five aspects: spectively.


1) Precharge time weighting 2) Pulse-width-weighting
Fig. 17 illustrates various pulse-based weighting tech- During the SRAM read operation, when the BL voltage is
niques. Ali et al.[44] used the standard 6T and controlled the maintained within a certain range, its discharge voltage is pro-
BL precharge time to achieve multibit multiplication opera- portional to the WL turn-on time; that is, the decrease of the
tions (Fig. 17(a)). In Ali et al. study, the 4-bit weight is stored BL voltage can be controlled by controlling the opening time
within the adjacent 6T SRAM cells in a row. The 4-bit input is of the WL. Therefore, by doubling the opening width of each
reflected in the WL. When performing a 4-bit multiplication op- WL, a proportional increase in the BL voltage change can be
eration, the pulse width of the WL is 8T and its amplitude is de- obtained. Sujan et al.[45, 66] proposed a functional read techno-
pendent on the input. The closing times of the precharge cir- logy based on pulse-width modulation strategy for 4b weight-
cuit from the first to last columns are [0, 8T+Tch-sh], [4T, ing (Fig. 17(b)). In this work, the turn-on time of WL0-WL3
8T+Tch-sh], [6T, 8T+Tch-sh], and [7T, 8T+Tch-sh], respectively, can be modulated to 8 : 4 : 2 : 1 to execute the 4b-weighted
where Tch-sh is the merging period. Because the opening multiplication operation.
time of the WL is 8T, the discharge times of the BL from the 3) Pulse height weighting
first to last columns are approximately 8T, 4T, 2T, and T, re-
 
As shown in Fig. 17(c), different pulse heights can be ap-
 
 
Z T Lin et al.: A review on SRAM-based computing in-memory: Circuits, functions, and applications
Journal of Semiconductors   doi: 10.1088/ 1674-4926/43/3/031401 15

W/LM1,2=8 W/LM1,2=4 W/LM1,2=2 W/LM1,2=1 charge time weighting: When the precharge and the word
RWL0 RBL RBL RBL RBL
line are turned on simultaneously, there will be a relatively
Q3
M2
Q2
M2
Q1
M2
Q0
M2
large current which will increase power consumption. 2)
M1 M1 M1 M1
Pulse-width-weighting: Its operation is relatively simple.
Vi0
However, when the input bits increase, the corresponding
RWL1

M2 M2 M2 M2
pulse width will increase proportionally; thus, the time in-
Q2 Q1 Q0
Q3
M1 M1 M1 M1 creases exponentially, resulting in a significant decrease in cal-
Vi1 culation speed. Moreover, when the bit line voltage is relat-
IRBL IRBL IRBL IRBL ively low, linearity will also be a problem. 3) Pulse height
Iout
(a) Sensing Cricuit
weighting: The V–I characteristic of metal–oxide–semiconduct-
Read Port
or (MOS) devices is that the current between the source and
Most significant 8T(M8T) RWL Least significant 8T(L8T) drain increases proportionally to the square of the gate
WL
voltage. Therefore, unlike 2), the pulse height cannot be in-
creased proportionally to ensure proportional discharge of
Q1 QB1 N2_M N2_L QB2 Q2
PGM1 PGM2 PGL2 PGL1
the bit line, so it is difficult to set the pulse height. In addi-
N1_M N1_L
tion, the linearity is relatively poor compared with 2). 4) Pulse
BLB1

BLB2
BL2
BL1

VSS VSS
(b) 2X 1X
number weighting: As this design controls the discharge num-
RBL

bers of the access transistor, it is basically consistent with 2);


Fig. 18. (Color online) (a) 8T-SRAM memory array for computing dot
thus, there will be similar problems. 5) Weighting based /on
products with 4-bit weight precision and (b) Twin-8T cell.
transistor width–length ratio: The proportional discharge of
plied to the WL or BL. Zhang et al.[15, 16] used the WL ADC tech- bit line can be realized by adjusting the width–length ratio of
nology to implement a machine learning (ML) classifier. Differ- transistors. However, increasing the size of the transistor will
ent voltages applied to the WL represent different weights, also increase the area overhead, bring difficulties to the lay-
and the access transistor is considered to have discharged ap- out process and cause mismatch due to the oversized array ra-
proximately linearly according to the WL voltage, thereby real- tio. The above weighting design methods are all a tradeoff
izing multibit multiplication. Biswas et al.[10, 56] realized the between the octagonal rules of circuit design.
weighting of the dot-product sum based on a 10T SRAM ar- B. Sum of absolute difference (SAD)
ray. They executed the multibit multiplication by charging To implement the SAD, we must first obtain the abso-
the BL to different voltage values with different pulse lute difference (AD), ∣D − P∣ defined by Eq. (1).
 

heights.
∣D − P∣ = max(D − P, P − D)
4) Weighting strategy based on the number of pulses
Dong et al.[37, 38] designed a 7-nm FinFET CIM chip for ML = max(D + P + , P + D + ) (1)
that uses a pulse number modulation circuit (Fig. 17(d)). The ⇒ max(D + P, P + D),
4-bit input is represented by the number of read word-line
(RWL) pulses. These pulses are generated by a counter accord- where D and P are 1’s complement of D and P, respectively.
ing to the value of the input data and are applied to the RWL Note: D and P are available because of the complementary
to turn on the corresponding cells. The discharge amount of nature of the SRAM bitcell.
the BL is proportional to the number of times the RWL is Kang et al. executed a SAD operation based on a 6T
turned on, thereby executing multibit multiplication. SRAM array without sacrificing storage density[35, 50]. In the
5) Weighting strategy based on the transistor width– SAD, the template pattern P is stored with a polarity oppos-
length ratio ite to that of D. Both P and D are stored in four adjacent cells
As depicted in Fig. 18(a), in the 8T array, the sizes of the in a column, as shown in Fig. 19(a). The decimal data can be
read access transistors in different columns are adjusted in pro- read out on the BL through multirow read technology with
portion to the weights of the input data[3]. The width–length WL pulse modulation. For example, if the data stored in four
ratios of M1 and M2 in the first, second, third, and fourth consecutive cells in a column are d = ‘1111’, the decimal num-
columns are 8 (W/LM1,2 = 8) and 4 : 2 : 1, respectively. There- ber is D = 15. When reading the data with four weighted
fore, the sum of the multiple bits multiplied by one bit can pulse WLs (WL0–WL3), the BL voltage decreases by 15Δv, as
be obtained. The study combines the BL discharge currents shown in Fig. 19(b). Because P and D are stored in the array
of the four columns and converts the current value into a in the opposite manner, the AD operation results can be ob-
voltage value through an operational amplifier, which is ulti- tained by comparing the voltages of the two BLs. These out-
mately quantified by the ADC. Similarly, Si et al. proposed a puts are summed via a capacitive network using a charge-
twin-8T structure based on the traditional 8T[32] and adjusted transfer mechanism to generate the SAD.
the width–length ratio of one group of reading transistors to Table 3 summarizes the performances of existing single
twice that of the other group to achieve 2b input weighting and multibit operations. The number of input and output bits
(Fig. 18(b)). Su et al.[43, 62] proposed a transposable arithmetic reflect the performance of the operation. However, the diffi-
cell structure and quantified its internal weighting by the culty lies in improving the effective number of bits (ENOB) of
width–length ratio of the transistors, achieving simultaneous the final output result without using a high ENOB ADC. This
bidirectional calculations. is because the overhead of a high ENOB ADC is unaccept-
Different designs have tradeoffs among time cost, imple- able as it deviates from the original intention of low-over-
mentation difficulty, linearity, area cost, and process. 1) Pre-
 
head in-memory computing.
 

 
 
Z T Lin et al.: A review on SRAM-based computing in-memory: Circuits, functions, and applications
16 Journal of Semiconductors   doi: 10.1088/ 1674-4926/43/3/031401

Table 3.   Summary of chip parameters and performance of single- and multibit operations.
Parameter Refs. [1, 2] Ref. [10] Refs. [15, 16] Ref. [32] Ref. [31] Ref. [44] Ref. [33] Refs. [24, 37] Ref. [34]
Tchnology 65-nm 65-nm 130-nm 55-nm 65-nm 65-nm 55-nm 7-nm FinFET 65nm
CMOS CMOS CMOS CMOS CMOS TSMC CMOS
Cell structure DCS 6T 10T 6T Twin-8T 8T1C 6T 6T 8T 6T
Array size 4 Kb 16 Kb 16 Kb 64×60 b 2 KB 64 Kb 4 Kb 4 Kb 64 Kb
Chip area NA 6.3×104 2.67×105 4.69×104 8.1×104 NA 5.94×106 3.2×103 1.75×105
(μm2)
Input 1 6 5 1, 2, 4 1 5 1, 2, 7, 8 4 4
precision (bit)
Weight 1 1 1 2, 5 1 5 1, 2, 8 4 1, 2, 3
precision (bit) 4, 5, 8
Output 1 6 NA 3, 5, 7 5 NA 3, 7, 10, 19 4 NA
precision (bit)
Computing Analog Digital+ Digital+ Analog Analog Analog Digital+ Analog Analog
mechanism Analog Analog Analog
Model XNORNN/M CNN Classify CNN CNN VGG CNN VGG-9 NN CNN
BNN LeNet-5
Energy 30.49–55.8 40.3 (1 V) NA 18.37– 671.5 NA 0.6–40.2 351 49.4
efficiency 51.3 (0.8 V) 72.03 (0.8 V) (Input:
(TOPS/W) 4b
Weight:1b)
Throughput 278.2 8 (1 V) NA 21.2~ 1638 NA 5.14-329.14 372.4(0.8 V) 573.4
(GOPS) 1 (0.4 V) 67.5 (Input:
4b
Weight:2b)
Accu- MNIST 96.5% 98%(0.8 V) 90% 90.02%– 98.30% 99% 98.56%– 98.51%– 98.80%
racy (XNORNN) 98.3%(1 V) 99.52% 99.59% 99.99%
95.1%
(MBNN)
CIFAR NA NA NA 85.56%~ 85.50% 88.83% 85.97%- 22.89%- 89.00%
10 90.42% 91.93% 96.76%

T3=8Tmin T2=4Tmin T1=2Tmin T0=Tmin

VPRE Prech
WL3
VWL0 6T SRAM bitcell
VWL0
WL0 VWL0 WL2

D WL1
D D CWL
4 bits
d0 d0 WL0
VBL
VWL1 8△V
P
P P CWL 15△V 4△V
4 bits
6T SRAM bitcell
CBL CBL 2△V
△V
(a) (b) d3 d2 d1 d0
 
Fig. 19. (Color online) (a) Schematic of SAD circuit and (b) sequence diagram.

5.  Application scenarios for CIM doned and replaced by general-purpose programmable


chips, such as field programmable gate arrays and general-pur-
The computing functions realizable by CIM have been pose graphics processors. The implementation of the DNN al-
widely used in various fields, including image and voice recog- gorithm requires the transmission of massive volumes of data
nition, which require data exchange between encryption and between the memory and the CPU, and the resulting delay
decryption algorithms for data security. In this section, the ap- and power consumption limit the further development of
plication scenarios of CIM, including CNN, AES, k-NN, and classi-
DNNs.
fier algorithms, are introduced.
CNN algorithms evolved from DNN algorithms are fully ap-
 

5.1.  Application in CNNs plied in image processing. A CNN algorithm can be mapped


Inspired by biological neural networks, artificial neural net- onto multiple intercommunicating SRAM arrays, as illus-
works are used in image processing and speech recognition. trated in Fig. 20(a). Each layer of the CNN contains several re-
A deep neural network (DNN) contains at least two layers of petitive MAC operations, and implementing these MACs in
nonlinear neural units connected via adjustable synaptic the traditional von Neumann architecture results in a large
weights. These weights can be updated according to the in- overhead. In CIM, the weights of the layer are stored in SRAM
put data to optimize the output. In 1990, a dedicated hybrid cells, and the input is represented by WLs or BLs. A large
chip was developed for multilayer neural networks[74]. amount of data flowing between the processor and the
However, owing to its lack of flexibility, it was eventually aban-
 
memory is eliminated. The final accumulation results are of-
 
 
Z T Lin et al.: A review on SRAM-based computing in-memory: Circuits, functions, and applications
Journal of Semiconductors   doi: 10.1088/ 1674-4926/43/3/031401 17

Dog

CNN
Cat

Cattle

Full connection
(a) Layer1 Layer2 Layer3 Layer4 Layer
Input
Peripheral Peripheral Peripheral WL

CIM CIM CIM


f f f Weight Q QB WeightB

BL BLB

N<=10
AES
S0 S1 S2 S3
Add
S4 S5 S6 S7 Sub Shift Mix 128 bits
Round
S8 S9 S10 S11 Bytes Rows Column data out
Key
S12 S13 S14 S15
128 bits
LUT Performed on (100% XOR) (50% XOR)
data in state matrix
(b)
 
Fig. 20. (Color online) Implementation of (a) CNN and (b) AES on multiple SRAM arrays.

ten reflected in the BL voltage. Quantifying the analog ity have increased. For example, the convolution kernels in
voltage of the BL is key to the entire operation. the CNN algorithm and convolution step are obtained by train-
The weights and inputs of the CNN are usually multibit. ing with a large volume of data. However, the entire weight
However, to realize multibit operation in memory, it is ne- is stored in the array and, therefore, can be easily read, caus-
cessary to change the cell structure or add auxiliary circuits. ing data leakage. Encryption is crucial to big data. However,
A twin-8T structure was proposed to realize 1b, 2b, and 4b the power consumption and delay in implementing a set of en-
inputs and 1b, 2b, and 5b weights, with an output of up to cryption algorithms in the digital domain will limit the over-
7b[32]. The test accuracy of the system using the MNIST data- all performance of the system; hence, researchers have pro-
set was as high as 99.52%. To simplify circuit design, research- posed to implement these algorithms in memory.
ers have developed the BNN, with binary inputs and weights, Fig. 20(b) illustrates the process of the AES algorithm in
i.e., “+1” or “–1.” In simple application scenarios, it has nearly four steps: byte replacement, row shift, column mixing, and
the same accuracy as the traditional CNN algorithm[1, 2]. Chih round key addition. Byte replacement is used to replace the in-
et al.[75] proposed another solution that uses all-digital CIM to put plaintext with a look-up table and implement the first
execute MAC operations and has high energy efficiency and round of encryption. The row-shift operation shifts the trans-
throughput. In order to reduce computational costs, Sie et formed matrix by a certain rule. Column mixing is the XOR op-
al.[76] proposed a software and hardware co-design approach eration of the target and fixed matrices. Round key addition
to design MARS. In this study, a SRAM-based CIM CNN acceler- performs an iterative XOR between the data and the key mat-
ator that can utilize multiple SRAM CIM macros as pro- rix.
cessing units and support a sparse CNN was also proposed. To implement the AES algorithm in an SRAM array, first,
With the proposed hardware-software codesigned method, the plaintext matrix that needs to be encrypted is stored in
MARS can reach over 700 and 400 FPS for CIFAR-10 and CI- the array. Then, the plaintext and key matrices are encrypted
FAR-100, respectively. Although these studies have realized using a peripheral auxiliary circuit. The most repeated opera-
in-memory MAC, they could not execute the entire CNN pro- tion in the AES algorithm is the XOR operation; hence, imple-
cess in memory. Therefore, the execution of the entire pro- menting XOR in memory, storing the result, and continuing
to perform the XOR operation with the input are key steps in
cess in memory can be a possible research direction.
 

the execution of the AES algorithm. Agrawal et al.[18] pro-


5.2.  Application in encryption algorithms posed a “read-calculation and storage” strategy based on the
With the development of AI, the amount of data that re- 8T SRAM, in which the obtained XOR result was stored in an-
quires processing has surged and concerns about data secur-
 
other row through a data selector at the end of each column,
 
 
Z T Lin et al.: A review on SRAM-based computing in-memory: Circuits, functions, and applications
18 Journal of Semiconductors   doi: 10.1088/ 1674-4926/43/3/031401

k-Nearest Neighbor
di 18 (k-NN) 17 pi results 1
10 24 14
133 100 33
32 substract 20 12
128 89 39
56 130 10 100 46 30
23 10 13
178 178
90 220 8 112 82
0
108 456
26 16 10
255 233 22
24
0
12
32 12 SUM
32
2 4 2
results di pi
I1 I2
 
Fig. 21. (Color online) Application in the k-NN algorithm.

which implemented part of the iterative XOR operation for As depicted in Fig. 22, high-precision boosted strong classi-
the AES. However, this strategy increases the area overhead fier can be realized by combining column-based weak classifi-
of the storage array and consequently requires an additional er C1-M. However, there is a typical characteristic for calcula-
empty row in the array to store the calculation results. Jaisw- tions by column in CIM, which can be perfectly mapped into
al et al. proposed interleaving WL (i-SRAM)[22] as the basic a column-based weak classifier. Zhang et al.[15, 16] achieved an
structure for embedding bitwise XOR computations in the ML classifier based on the 6T cell. Because of the nonlinear-
SRAM arrays, which improved the throughput of AES al- ity in column-wise CIM, the result of each column can only
gorithms by a factor of three. Huang et al. modified the 6T form a weak classifier. In memory, the boosted strong classifi-
SRAM bitcell with dual WLs to implement XOR without com- er can be obtained through adder or subtractor circuits that
promising the parallel computation efficiency[46], which can process the column-wise results which reduces the non-ideal
protect the DNN model in CIM. These studies realized part of characteristic of analog CIM. In addition, this design reduces
the operations in encryption, and the realization of the AES the energy consumption by 113 times when using a stand-
of the entire process in memory is a research direction.
 

ard training algorithm. Similarly, a random forest (RF) ma-


5.3.  Application in k-nearest neighbor (k-NN) chine learning classifier can be enabled by a standard 6T
algorithms SRAM in Ref. [40]. It achieved massively parallel processing
thereby minimizing the memory fetches and reducing en-
The k-NN algorithm is one of the simplest, most basic ML
ergy-delay product (EDP).
algorithms that can be used for both classification and regres-
 

sion. The concept of the algorithm is as follows: if most of the 6.  Challenges and prospects
k-most similar samples in the feature space belong to a cat-
egory, the sample also belongs to that category. There are With the rapid development of AI, requirements for com-
two methods to measure the distance of two samples: the Euc- puting power have become stringent. The CIM architecture
lidean distance and Manhattan distance. has ushered in unprecedented development opportunities.
In Fig. 21, I1 and I2 are the pixel values of the target All CIM strategies make a compromise among bandwidth,
and test images, respectively. First, the pixel values of the delay, area overhead, energy consumption, and accuracy. The
same position of the test and training images were subtrac- following is an analysis of several typical problems in the CIM
ted, followed by calculating the absolute values and sum- architecture.
 

ming them. The sum represents the similarity between the 6.1.  Read-disturb issue
test and target images. The smaller the value, the higher the In CIM, it is often necessary to access multiple rows for sim-
similarity. The k-NN algorithm finds the first k images that are ultaneous computation to increase the throughput of data
most similar to the target image. Kang et al. executed the processing. However, turning on multiple rows synchron-
SAD for application in the k-NN algorithm[35, 50]. They used ously connects the storage node directly to the BL, which will
hand-written number recognition to test the accuracy of cause data to be flipped during the reading process. There-
the algorithm on the MNIST dataset. The results showed that fore, this strategy resulted in an error in the final calculation
the CIM-based k-NN algorithm has a high recognition accur- result; even worse, it destroys the stored data. As shown in
acy of 92%.
Fig. 23, when the BL voltage drops decreases significantly
 

5.4.  Application in classifier algorithms and reaches the write margin, the cell storing ‘1’ will be mis-
Classification is a significantly important method of data takenly written to ‘0’. To address this issue, Kang et al.[35] re-
mining. The concept of classification is to learn a classifica- duced the discharge speed of a BL by reducing the turn-on
tion function or construct a classification model (that is, what voltage of the WL. When the WL voltage is reduced to a cer-
we usually call a classifier) on the basis of existing data. tain extent, the full swing BL voltage can be achieved when
However, it is challenging to implement an energy-efficient multiple rows are opened simultaneously. Researchers have
classifier algorithm in resource-constrained devices. If the clas- suggested using read–write decoupling cells to achieve CIM
sifier algorithm can be realized with methods of CIM, the fre- operations, such as 8T[3, 5, 18, 28, 37, 38, 65, 77, 78] and 10T[9−11, 56, 57],
quent data access will be greatly reduced.
 
which can also obtain the full RBL voltage swing (from VDD
 
 
Z T Lin et al.: A review on SRAM-based computing in-memory: Circuits, functions, and applications
Journal of Semiconductors   doi: 10.1088/ 1674-4926/43/3/031401 19

Column-based Column-based weak classifiers


xZ1
weak classifier C1
BC1,1 BC1,1 BC1,1

Column-based
weak classifier C2 xZ2 BC2,1 BC2,1 BC2,1

Σ
BC128,1 BC128,1 BC128,1
Column-based
weak classifier C3 xZ3
In memory
Z1 Z2 ZM
Column-based
weak classifier CM xZM Adder/Subtractor
Boosted strong classifier
 
Fig. 22. (Color online) Application in classifier algorithms.

To address these problems, Lin et al.[79, 80] proposed a


 

WL 0=1 M0 BL/BLB cascode current mirror (CCM) peripheral circuit and applied it
to the bottom of each BL. The CCM clamps the BL voltage
Q QB 1
0 and proportionally duplicates the BL current in the addition-
Cell 0 Write margin al capacitor, increasing the calculation linearity. In addition,
Qn they also proposed a double WL structure to reduce the
WL n=1 pulse-width delay on the WL and increase the consistency of
M1 Read
disturb circuit calculations. They demonstrated the ability of the CCM
1
Q QB 0
QBn circuit to reduce the integer nonlinearity by approximately
BL Cell n BLB 70% at 0.8 V supply and improve the computational consist-
ency by 56.84% at 0.9 V supply.
 

Fig. 23. (Color online) Read disturb issue.


6.3.  Challenge of the array size under the CIM
to ground). However, the bitcell adds transistors at the ex- architecture
pense of the storage density of the array.
 
To increase the data throughput, the array size of the stor-
6.2.  Linearity and consistency problems in the SRAM- age is often expanded; however, a series of factors limit the ex-
pansion. For example, to achieve specific functions, research-
based CIM
ers introduced capacitors in each storage unit[31, 58, 60, 61]. Al-
The demerit of the SRAM-based CIM is the problem of pre- though the introduction of capacitors increases the linearity
serving linearity and consistency, which directly determines of calculations, it also increases the power consumption and
the final calculation accuracy. In Fig. 24(a), only one row is ac- latency of the system.
tivated at a time under the basic SRAM read operation, and Researchers including Jiang have proposed the C3SRAM
the voltage difference between BL and BLB is detected by (capacitive-coupling computing) structure to execute single-
the SA to achieve a full swing output. In contrast, in Fig. 24 bit multiplication[31, 58]. Because the cell uses a capacitor, the
(b), four rows of WLs are activated simultaneously to realize calculation result is ideally linear. However, each C3SRAM cell
the calculation function, and the pulse widths of the WLs are has a capacitor coupled with an RBL. Thus, 256 cells in a
8T, 4T, 2T, and 1T. For example, if the input on the WL is column are equivalent to 256 internal capacitors in that
‘1000’, the WL with an 8T width is activated. Ideally, the BL column, leading to a capacitance surge in the RBL.
should reduce by 8∆v; however, in practice, it may only re- The storage and computing architecture based on the
duce by 5∆v, leading to erroneous calculations. Moreover, 8T1C structure proposed by Jia et al. also uses capacitors in-
the asynchronization of the pulse width caused by latency side cells[60, 61]. The BL capacitance has a similar cumulative ef-
and the imbalance of pulse width on each WL can cause a non- fect as C3SRAM’s array; therefore, the overall array size is lim-
linear BL discharge. In Fig. 24(c), the SRAM array is assumed ited. Surplus capacitors in the array increase the precharge ti-
to store identical data in each column, and the discharge is as- me and reduce the computational speed of the entire system.
sumed to be 6∆v in the column closest to the pulse-width gen- Yin et al. designed an XNOR-SRAM cell for ternary multi-
erator. Given the pulse-width distortion, the farther the plication[17], and the result is represented by the activation of
column from the pulse-width generator, the smaller the dis- the transistor. If the transistor is turned on, the result is equival-
charge. For instance, the last column may only be dis- ent to a resistor Ron. This design has 256 XNOR-SRAM cells in
charged by 2∆v. Therefore, the linearity and consistency prob- one column. During operation, 512 transistors are turned on,
lems will affect the calculation results in a multiline read struc- which is equivalent to 512 Ron connected in parallel on the
ture[35, 45, 66].
 
BL, which causes the resistance on the BL to be insignificant,
 
 
Z T Lin et al.: A review on SRAM-based computing in-memory: Circuits, functions, and applications
20 Journal of Semiconductors   doi: 10.1088/ 1674-4926/43/3/031401

SRAM array SRAM array SRAM array

Weight
T 8T
4T Identical data
Activate 2T
single 1T
row
Activate
multiple
rows
IMC based on BL/BLB IMC based on BL/BLB
SA & Digitial output
0 1 1 0 0 1 1 0 15 0 3 8 12 0 4 6 9 13 11 1 6 6 6 6 6 5 5 4 3 3 2 2
VBL VBL Error
Nonlinearity
ΔVBL 5ΔV Linearity 6ΔV 2ΔV
8ΔV
Inconsisten cy
(a) (b) (c)
 
Fig. 24. (Color online) (a) Single row activation during normal SRAM read operation, (b) multirow read and nonlinearity during CIM, and (c) incon-
sistent CIM calculation.

producing a large current and increasing the system’s power operations of the algorithm. The same operator can be real-
consumption.
  ized by several circuits. In contrast, one circuit can also be
used by multiple operators. Therefore, the problem of selec-
6.4.  Area overhead and energy efficiency challenges of
tion of the circuits set and effectively mapping the common
peripheral circuits operator set to it must be studied. As shown in Fig. 25, this
The architecture based on CIM requires several peripher- problem can be studied from three aspects.
al auxiliary circuits to perform additional computing func- 1) Refining and merging common operator sets. First, an
tions. For example, when implementing logic operations in di- initial set of common operators ϕ (such as multiplication,
gital functions, two SAs are required on each BL to distin- SAD, addition and subtraction, and MAC) is extracted accord-
guish different inputs. When performing multiplication and ac- ing to multiple factors, including requirements of AI algo-
cumulation in analog operations, the calculation results are re- rithms, and computational complexity and efficiency. Then,
flected on the BL with an analog voltage. Therefore, peripher- part of the operators must be split based on the initial set ϕ,
al auxiliary circuits are also needed to quantify these voltage which will facilitate the further integration of some operators
values. The use of multiple high-precision ADCs for the quanti- and circuit implementation. For example, the SAD can be
fication greatly increases the proportion of peripheral circuits. split into difference, absolute values, and addition operators.
In addition, multibit input modules, such as multi-pulse width 2) Exploring the appropriate circuit sets. The design pro-
or pulse-height-generation circuits, occupy a large part of the cess of the circuit set must consider, for example, the cover-
area. age of the circuit set, its redundancy, the accuracy of the calcu-
Yin et al.[17] designed a shared ADC with 64 columns, lation, the cost of the circuit area, and circuit latency and
quantized it 64 times through a data selector, and added a power consumption. The circuits set can be subdivided by
fault-tolerant algorithm to further reduce the requirement of the computational complexity and data requirements. On
quantization and the overhead of the peripheral circuits. Kim this basis, a unified interface is designed for the circuit mod-
et al.[81] used low-bit ADCs to achieve ResNet-style BNN al- ule, which is convenient for the upper architecture to organ-
gorithms via aggressive partial sum quantization and input- ize and implement.
splitting combined with retraining. The area overhead is re- 3) Designing word-column/row-block hierarchical shar-
duced due to the use of low-bit ADC. Maintaining the propor- ing architecture. Flexible and configurable hierarchical architec-
tion of peripheral circuits within an acceptable range comes ture is key to realizing mapping from a common operator set
with the loss of the accuracy of the final output digits. Interest- to a circuit set. The memory array can be divided into a
ingly, the complexity of the auxiliary circuit to achieve more word-column/row-block three-tier system. The lowest layer is
complex calculations cannot be dismissed in the CIM system- the word-level memory computing layer, which has the tight-
level design. est memory computing coupling and requires the lowest
In the future, peripheral circuits may be stacked in 3D to area cost and power among all the layers. It can directly read
form a pool of peripheral circuit resources shared by CIM ar- and write the data of a single word, transmit it rapidly, and per-
rays. In addition, the use of peripheral circuits can be greatly form lightweight operations. The implementation of simple op-
reduced by time-division multiplexing, which further reduces erators in word-level memory computing can improve en-
the area of the SRAM-based CIM architecture.
  ergy efficiency. The middle layer is a column/row level
memory-computing layer and has column- or row-shared oper-
6.5.  Research prospect
ation units, such as a linearity compensation module, and a
 

6.5.1.    More efficient mapping from common operators consistency compensation module that improve the perform-
set to actual circuits set ance of multibyte operations. This layer, in which more com-
 
In CIM, it is necessary to design various circuits to realize plex operators can be implemented, balances the energy effi-
 
 
Z T Lin et al.: A review on SRAM-based computing in-memory: Circuits, functions, and applications
Journal of Semiconductors   doi: 10.1088/ 1674-4926/43/3/031401 21

Perceptron (P) Feed Forward Radial Basis


(FF) Network (RBF)
1 2
Σ

configurable layered
+ Area
SA constraint

Flexible and

architecture
Recurrent Neural
Network (RNN)
Long/Short Term
Memory ((LSTM)
-

Explore appropriate
merging common
operators set
Refining and

circuit sets
Denoising AE Redundancy Coverage
Auto Encoder (AE) Variational AE
(VAE) (DAE) ADC

Markov Chain
(MC)
Nopfieeld Network
(HN)
Boltzmann
Machine (BM)
* Delay Power
Consumption
constraint constraint

3 Low Complex
Word
Block sharing layer
Column (Row) Coupling
Operation
Block Column (Row) sharing layer degree
Layered sharing
Word sharing layer
High Simple

 
Fig. 25. (Color online) Approach of mapping from the common operator set to the actual circuits.

ciency, area cost, and power consumption. The uppermost lay- tical parallel reading techniques; thus, the bidirectional CIM ar-
er is the block-level memory computing layer. Although chitecture can be implemented with a small area cost.
memory and computing in this layer are loosely coupled, the 2) A low-powered fast-migration channel is introduced,
layer has the highest tolerance for area, power consumption, and efficient data-storage patterns are designed to reduce
and delay cost, and it suffers latency and demands a large the power consumed by migration. It was found that the
amount of power in performing operations, data caching, data used by two adjacent calculations overlapped signific-
and quantization. The implementation of complex operators antly in the CNN. To improve data utilization and reduce the
in the block-level memory-computing layer can enrich the volume of reloaded data, it is necessary to study the discon-
functions of the CIM system and provide a smooth transition tinuous data-storage mode to meet the requirements of the
between the CIM system and the traditional von Neumann ar- algorithm and improve the coupling between computing
chitecture.
 
and storage. Additionally, a fast data-migration channel can
6.5.2.    Optimize the CIM process perform multiple calculations continuously without reload-
The primary steps of CIM include reading and comput- ing data.
ing, along with a series of processes of writing, quantization, 3) Memory circuits can be designed based on the reuse/re-
and writeback. The optimization of the entire process, which construction/transformation strategy. First, the existing mod-
is key to realizing energy-efficient, high-throughput, and low- ules of the SRAM memory are fully harnessed. The reusable
area-overhead CIM, can be performed from the following modules include sensitive amplifiers, BLs, WLs, redundant
three aspects, as shown in Fig. 26. columns, and decoding circuits. Second, the existing module
1) A horizontal computing channel can be introduced to structure is subtly modified to induce new functions into exist-
implement a bidirectional memory computing system. The in- ing SRAM modules at a markedly low area cost. With the SA
tent of SRAM is to introduce computing units into the stor- as an example, an appropriate configuration transistor can be
age array, reduce data movement, and break through the stor- included to not only induce the amplification and comparis-
age wall. However, in-memory calculations mainly rely on ver- on functions into the SA but also generate the SIGMOID func-
tical accumulative paths and can only be performed after stor- tion and reconstruct it as part of the ADC (Fig. 25). With the re-
age rearrangement, which complicates the calculation pro- dundant column of a duplicate BL as an example, a small num-
cess, changing from write → read → calculation → write- ber of switches can be included to transform the redundant
back in the von Neumann architecture to write → read → stor- column into a pulse-width-generation module that can track
age rearrangement → write → read–calculation integration the process voltage and temperature variations. The same cir-
→ quantification → writeback. The entire process consumed cuit can be used as part of the operators with different func-
significantly more energy than the original consumption with tions through a time-division multiplexing strategy, which
von Neumann architecture. Therefore, as shown in Fig. 25, hori- greatly reduces the area cost incurred by CIM. Third, the calcu-
zontal computational channels can be used to enable CIM lation mode is changed from analog domain calculation to di-
without storage rearrangement. Simultaneously, the vertical cu- gital–analog hybrid calculation, which not only preserves the
mulative path is preserved to make it compatible with the ver-
 
advantages of analog calculation but also remarkably re-
 
 
Z T Lin et al.: A review on SRAM-based computing in-memory: Circuits, functions, and applications
22 Journal of Semiconductors   doi: 10.1088/ 1674-4926/43/3/031401

Replica Bitline Fast data migration channel


Redundant
column
BL VDD

Disc ontinuous data storage m ode

Horizontal calculation channel


RC SA

Sense Amplifier
DC

Reconfiguring
DC

RBL SA
Replica 1
WL 1T

VDD

Replica 2 SA
WL 2T

Replica 4
WL 4T

Sense Amplifier Reusable


Replica 8
WL 8T Compatible multi-row reading
 
Fig. 26. (Color online) Architecture of the bidirectional CIM system, including a reusable and reconfigurable module.
 

Pipeline processor
Memory
Memory
CIM
CIM Memory
Memory
f CIM
f CIM

f
f

Bus
 
Fig. 27. (Color online) Multithreaded CIM macro based on a pipeline processor.

duces the difficulty of circuit implementation.


  croprocessor[60] and a programmable neural-network reason-
6.5.3.    Realize the programmability of the SRAM-based ing accelerator[59], which can realize a scalable bit width and
CIM architecture programmable functions.
Although traditional computing architectures (such as The functions that can be realized by existing CIM are relat-
CPUs and GPUs) are limited in terms of energy efficiency and ively simple, and multiple complex operations cannot be per-
memory bandwidth, their appeal lies in their general-pur- formed simultaneously. However, the novelty of CIM is that it
pose functions and programmability and their ability to per- solves the problem of storage walls. Therefore, studies have
form various arithmetic operations and execute different al- proposed more stringent requirements for the universality
gorithms. The existing in-memory technology can achieve sev- and programmability of the CIM architecture. As shown in
eral computing functions. However, a few of these CIM mac- Fig. 27, to leverage all its advantages, the CIM architecture
ros have poor compatibility with software and limited bit- must be able to implement a multithreaded CIM macro com-
width accuracy. Therefore, they cannot execute complex pro- bined with a pipeline processor.
 

grammable functions and are limited to specific applications.


7.  Conclusion
Interestingly, research on the development of program-
mable CIMs is underway. For example, Wang et al. proposed CIM technology addresses the limitation of the tradition-
a general hybrid memory/near memory computing struc- al architecture (i.e., separate storage and computation) and ef-
ture[12], which supports the development of neural networks fectively implements AI algorithms. To allow CIM to perform
and software algorithms and offers flexibility and programma- complex operations, the basic cells in the circuit may be modi-
bility. Jia et al. proposed a programmable heterogeneous mi-
 
fied, and peripheral circuits must be added. This paper de-
 
 
Z T Lin et al.: A review on SRAM-based computing in-memory: Circuits, functions, and applications
Journal of Semiconductors   doi: 10.1088/ 1674-4926/43/3/031401 23

tailed the different basic cell structures and peripheral auxili- memory vector computing. IEEE J Solid State Circuits, 2020, 55,
ary circuits of CIM. It also investigated various computing func- 76
tions that can be realized by the existing CIM framework and [13] Wang J C, Wang X W, Eckert C, et al. A compute SRAM with bit-seri-
their applications. Finally, the challenges encountered by cur- al integer/floating-point operations for programmable in-
memory vector acceleration. 2019 IEEE International Solid-State
rent CIM macros based on SRAM and the future scope of CIM
Circuits Conference, 2019, 224
were analyzed. To improve the computational accuracy and
[14] Jiang H W, Peng X C, Huang S S, et al. CIMAT: a transpose SRAM-
capability of the CIM architecture, we recommend efficient based compute-in-memory architecture for deep neural net-
mapping the common operators set to a circuit set and opti- work on-chip training. Proceedings of the International Symposi-
mizing the CIM process under spatiotemporal constraints. En- um on Memory Systems, 2019, 490
hancing the programmability of the CIM architecture will en- [15] Zhang J T, Wang Z, Verma N. In-memory computation of a ma-
hance its compatibility with general CPUs and enable its wide chine-learning classifier in a standard 6T SRAM array. IEEE J Solid
usage across industries.
  State Circuits, 2017, 52, 915
[16] Zhang J T, Wang Z, Verma N. A machine-learning classifier imple-
Acknowledgements mented in a standard 6T SRAM array. 2016 IEEE Symposium on
VLSI Circuits, 2016, 1
This work was supported by the National Key Research
[17] Jiang Z W, Yin S H, Seok M, et al. XNOR-SRAM: In-memory comput-
and Development Program of China (2018YFB2202602), The
ing SRAM macro for binary/ternary deep neural networks. 2018
State Key Program of the National Natural Science Founda-
IEEE Symp VLSI Technol, 2018, 173
tion of China (NO.61934005), The National Natural Science [18] Agrawal A, Jaiswal A, Lee C, et al. X-SRAM: Enabling in-memory
Foundation of China (NO.62074001), and Joint Funds of the Na- Boolean computations in CMOS static random access memories.
tional Natural Science Foundation of China under Grant IEEE Trans Circuits Syst I, 2018, 65, 4219
U19A2074. [19] Jeloka S, Akesh N B, Sylvester D, et al. A 28 nm configurable
memory (TCAM/BCAM/SRAM) using push-rule 6T bit cell en-
abling logic-in-memory. IEEE J Solid State Circuits, 2016, 51, 1009
References
[20] Dong Q, Jeloka S, Saligane M, et al. A 4 2T SRAM for searching
[1] Si X, Khwa W S, Chen J J, et al. A dual-split 6T SRAM-based comput- and in-memory computing with 0.3-V VDDmin. IEEE J Solid State Cir-
ing-in-memory unit-macro with fully parallel product-sum opera- cuits, 2018, 53, 1006
tion for binarized DNN edge processors. IEEE Trans Circuits Syst I, [21] Rajput A K, Pattanaik M. Implementation of Boolean and arit-
2019, 66, 4172 hmetic functions with 8T SRAM cell for in-memory computa-
[2] Khwa W S, Chen J J, Li J F, et al. A 65nm 4Kb algorithm-depend- tion. 2020 International Conference for Emerging Technology,
ent computing-in-memory SRAM unit-macro with 2.3ns and 2020, 1
55.8TOPS/W fully parallel product-sum operation for binary DNN [22] Jaiswal A, Agrawal A, Ali M F, et al. I-SRAM: Interleaved wordlines
edge processors. 2018 IEEE International Solid-State Circuits Con- for vector Boolean operations using SRAMs. IEEE Trans Circuits
ference, 2018, 496 Syst I, 2020, 67, 4651
[3] Jaiswal A, Chakraborty I, Agrawal A, et al. 8T SRAM cell as a multib- [23] Surana N, Lavania M, Barma A, et al. Robust and high-perform-
it dot-product engine for beyond von Neumann computing. IEEE ance 12-T interlocked SRAM for in-memory computing. 2020
Trans Very Large Scale Integr VLSI Syst, 2019, 27, 2556 Design, Automation & Test in Europe Conference & Exhibition,
[4] Lu L, Yoo T, Le V L, et al. A 0.506-pJ 16-kb 8T SRAM with vertical 2020, 1323
read wordlines and selective dual split power lines. IEEE Trans [24] Simon W A, Qureshi Y M, Rios M, et al. BLADE: an in-cache comput-
Very Large Scale Integr VLSI Syst, 2020, 28, 1345 ing architecture for edge devices. IEEE Trans Comput, 2020, 69,
[5] Lin Z T, Zhan H L, Li X, et al. In-memory computing with double 1349
word lines and three read Ports for four operands. IEEE Trans Very [25] Chen J, Zhao W F, Ha Y J. Area-efficient distributed arithmetic
Large Scale Integr VLSI Syst, 2020, 28, 1316 optimization via heuristic decomposition and in-memroy com-
[6] Srinivasa S, Chen W H, Tu Y N, et al. Monolithic-3D integration aug- puting. 2019 IEEE 13th International Conference on ASIC, 2019, 1
mented design techniques for computing in SRAMs. 2019 IEEE In- [26] Lee K, Jeong J, Cheon S, et al. Bit parallel 6T SRAM in-memory com-
ternational Symposium on Circuits and Systems, 2019, 1 puting with reconfigurable bit-precision. 2020 57th ACM/IEEE
[7] Zeng J M, Zhang Z, Chen R H, et al. DM-IMCA: A dual-mode in- Design Automation Conference, 2020, 1
memory computing architecture for general purpose processing. [27] Simon W, Galicia J, Levisse A, et al. A fast, reliable and wide-
IEICE Electron Express, 2020, 17, 20200005 voltage-range in-memory computing architecture. Proceedings
[8] Ali M, Agrawal A, Roy K. RAMANN: in-SRAM differentiable of the 56th Annual Design Automation Conference, 2019, 1
memory computations for memory-augmented neural networks. [28] Chen H C, Li J F, Hsu C L, et al. Configurable 8T SRAM for enbling
Proceedings of the ACM/IEEE International Symposium on Low in-memory computing. 2019 2nd International Conference on
Power Electronics and Design, 2020, 61 Communication Engineering and Technology, 2019, 139
[9] Agrawal A, Jaiswal A, Roy D, et al. Xcel-RAM: Accelerating binary [29] Gupta N, Makosiej A, Vladimirescu A, et al. 1.56GHz/0.9V energy-ef-
neural networks in high-throughput SRAM compute arrays. IEEE ficient reconfigurable CAM/SRAM using 6T-CMOS bitcell. ESS-
Trans Circuits Syst I, 2019, 66, 3064 CIRC 2017 - 43rd IEEE European Solid State Circuits Conference,
[10] Biswas A, Chandrakasan A P. CONV-SRAM: An energy-efficient 2017, 316
SRAM with in-memory dot-product computation for low-power [30] Sun X Y, Liu R, Peng X C, et al. Computing-in-memory with SRAM
convolutional neural networks. IEEE J Solid State Circuits, 2019, and RRAM for binary neural networks. 2018 14th IEEE Internation-
54, 217 al Conference on Solid-State and Integrated Circuit Technology,
[11] Lin Z T, Zhu Z Y, Zhan H L, et al. Two-direction in-memory comput- 2018, 1
ing based on 10T SRAM with horizontal and vertical decoupled [31] Jiang Z W, Yin S H, Seo J S, et al. C3SRAM: in-memory-computing
read Ports. IEEE J Solid State Circuits, 2021, 56, 2832 SRAM macro based on capacitive-coupling computing. IEEE Sol-
[12] Wang J C, Wang X W, Eckert C, et al. A 28-nm compute SRAM id State Circuits Lett, 2019, 2, 131
 
with bit-serial logic/arithmetic operations for programmable in- [32] Si X, Chen J J, Tu Y N, et al. A twin-8T SRAM computation-in-
 
 
Z T Lin et al.: A review on SRAM-based computing in-memory: Circuits, functions, and applications
24 Journal of Semiconductors   doi: 10.1088/ 1674-4926/43/3/031401

memory unit-macro for multibit CNN-based AI edge processors. 2020, 1


IEEE J Solid State Circuits, 2020, 55, 189 [50] Kang M G, Keel M S, Shanbhag N R, et al. An energy-efficient VLSI
[33] Chiu Y C, Zhang Z X, Chen J J, et al. A 4-kb 1-to-8-bit configurable architecture for pattern recognition via deep embedding of com-
6T SRAM-based computation-in-memory unit-macro for CNN- putation in SRAM. 2014 IEEE International Conference on Acous-
based AI edge processors. IEEE J Solid State Circuits, 2020, 55, tics, Speech and Signal Processing, 2014, 8326
2790 [51] Gong M X, Cao N Y, Chang M Y, et al. A 65nm thermometer-en-
[34] Chen Z Y, Yu Z H, Jin Q, et al. CAP-RAM: A charge-domain in- coded time/charge-based compute-in-memory neural network ac-
memory computing 6T-SRAM for accurate and precision-pro- celerator at 0.735pJ/MAC and 0.41pJ/update. IEEE Trans Circuits
grammable CNN inference. IEEE J Solid State Circuits, 2021, 56, Syst II, 2021, 68, 1408
1924 [52] Lee E, Han T, Seo D, et al. A charge-domain scalable-weight in-
[35] Kang M G, Gonugondla S K, Patil A, et al. A multi-functional in- memory computing macro with dual-SRAM architecture for preci-
memory inference processor using a standard 6T SRAM array. sion-scalable DNN accelerators. IEEE Trans Circuits Syst I, 2021,
IEEE J Solid State Circuits, 2018, 53, 642 68, 3305
[36] Kang M, Gonugondla S K, Keel M, et al. An energy-efficient [53] Kim J, Koo J, Kim T, et al. Area-efficient and variation-tolerant in-
memory-based high-throughput VLSI architecture for convolu- memory BNN computing using 6T SRAM array. 2019 Symposium
tional networks. 2015 IEEE International Conference on Acous- on VLSI Circuits, 2019, C118
tics, Speech and Signal Processing, 2015, 1037 [54] Noel J P, Pezzin M, Gauchi R, et al. A 35.6 TOPS/W/mm2 3-stage
[37] Dong Q, Sinangil M E, Erbagci B, et al. A 351TOPS/W and pipelined computational SRAM with adjustable form factor for
372.4GOPS compute-in-memory SRAM macro in 7nm FinFET highly data-centric applications. IEEE Solid State Circuits Lett,
CMOS for machine-learning applications. 2020 IEEE International 2020, 3, 286
Solid-State Circuits Conference, 2020, 242 [55] Jiang H W, Peng X C, Huang S S, et al. CIMAT: A compute-in-
[38] Sinangil M E, Erbagci B, Naous R, et al. A 7-nm compute-in- memory architecture for on-chip training based on transpose
memory SRAM macro supporting multi-bit input, weight and out- SRAM arrays. IEEE Trans Comput, 2020, 69, 944
put and achieving 351 TOPS/W and 372.4 GOPS. IEEE J Solid State [56] Biswas A, Chandrakasan A P. Conv-RAM: An energy-efficient
Circuits, 2021, 56, 188 SRAM with embedded convolution computation for low-power
[39] Kang M G, Gonugondla S, Patil A, et al. A 481pJ/decision 3.4M de- CNN-based machine learning applications. 2018 IEEE Internation-
cision/s multifunctional deep In-memory inference processor us- al Solid-State Circuits Conference, 2018, 488
ing standard 6T sram array. arXiv: 1610.07501, 2016 [57] Nguyen V T, Kim J S, Lee J W. 10T SRAM computing-in-memory
[40] Kang M G, Gonugondla S K, Shanbhag N R. A 19.4 nJ/decision macros for binary and multibit MAC operation of DNN edge pro-
364K decisions/s in-memory random forest classifier in 6T SRAM cessors. IEEE Access, 2021, 9, 71262
array. ESSCIRC 2017 - 43rd IEEE European Solid State Circuits Con- [58] Jiang Z W, Yin S H, Seo J S, et al. C3SRAM: an in-memory-comput-
ference, 2017, 263 ing SRAM macro based on robust capacitive coupling comput-
[41] Chang J, Chen Y H, Chan G, et al. A 5nm 135Mb SRAM in EUV and ing mechanism. IEEE J Solid State Circuits, 2020, 55, 1888
high-mobility-channel FinFET technology with metal coupling [59] Jia H Y, Ozatay M, Tang Y Q, et al. A programmable neural-net-
and charge-sharing write-assist circuitry schemes for high-dens- work inference accelerator based on scalable in-memory comput-
ity and low-VMIN applications. 2020 IEEE International Solid- ing. 2021 IEEE International Solid-State Circuits Conference, 2021,
State Circuits Conference, 2020, 238 236
[42] Si X, Tu Y N, Huang W H, et al. A 28nm 64Kb 6T SRAM com- [60] Jia H Y, Valavi H, Tang Y Q, et al. A programmable heterogeneous
puting-in-memory macro with 8b MAC operation for AI edge microprocessor based on bit-scalable in-memory computing.
chips. 2020 IEEE International Solid-State Circuits Conference, IEEE J Solid State Circuits, 2020, 55, 2609
2020, 246 [61] Valavi H, Ramadge P J, Nestler E, et al. A mixed-signal binarized
[43] Su J W, Si X, Chou Y C, et al. A 28nm 64Kb inference-training two- convolutional-neural-network accelerator integrating dense
way transpose multibit 6T SRAM compute-in-memory macro for weight storage and multiplication for reduced data movement.
AI edge chips. 2020 IEEE International Solid-State Circuits Confer- 2018 IEEE Symposium on VLSI Circuits, 2018, 141
ence, 2020, 240 [62] Su J W, Chou Y C, Liu R H, et al. A 28nm 384kb 6T-SRAM computa-
[44] Ali M, Jaiswal A, Kodge S, et al. IMAC: in-memory multi-bit multi- tion-in-memory macro with 8b precision for AI edge chips. 2021
plication and ACcumulation in 6T SRAM array. IEEE Trans Circuits IEEE International Solid- State Circuits Conference, 2021, 250
Syst I, 2020, 67, 2521 [63] Khaddam-Aljameh R, Francese P A, Benini L, et al. An SRAM-
[45] Gonugondla S K, Kang M G, Shanbhag N. A 42pJ/decision based multibit in-memory matrix-vector multiplier with a preci-
3.12TOPS/W robust in-memory machine learning classifier with sion that scales linearly in area, time, and power. IEEE Trans Very
on-chip training. 2018 IEEE International Solid-State Circuits Con- Large Scale Integr VLSI Syst, 2020, 29, 372
ference, 2018, 490 [64] Zhang J, Lin Z T, Wu X L, et al. An 8T SRAM array with configur-
[46] Huang S S, Jiang H W, Peng X C, et al. XOR-CIM: compute-in- able word lines for in-memory computing operation. Electronics,
memory SRAM architecture with embedded XOR encryption. Pro- 2021, 10, 300
ceedings of the 39th International Conference on Computer- [65] Nasrin S, Ramakrishna S, Tulabandhula T, et al. Supported-Bin-
Aided Design, 2020, 1 aryNet: Bitcell array-based weight supports for dynamic accur-
[47] Kim H, Chen Q, Kim B. A 16K SRAM-based mixed-signal in- acy-energy trade-offs in SRAM-based binarized neural network.
memory computing macro featuring voltage-mode accumulator 2020 IEEE International Symposium on Circuits and Systems,
and row-by-row ADC. 2019 IEEE Asian Solid-State Circuits Confer- 2020, 1
ence, 2019, 35 [66] Gonugondla S K, Kang M G, Shanbhag N R. A variation-tolerant
[48] Jain S, Lin L Y, Alioto M. Broad-purpose in-memory computing for in-memory machine learning classifier via on-chip training. IEEE J
signal monitoring and machine learning workloads. IEEE Solid Solid State Circuits, 2018, 53, 3163
State Circuits Lett, 2020, 3, 394 [67] Wang B, Nguyen T Q, Do A T, et al. Design of an ultra-low voltage
[49] Bose S K, Mohan V, Basu A. A 75kb SRAM in 65nm CMOS for 9T SRAM with equalized bitline leakage and CAM-assisted en-
in-memory computing based neuromorphic image denoising. ergy efficiency improvement. IEEE Trans Circuits Syst I, 2015, 62,
 
2020 IEEE International Symposium on Circuits and Systems, 441
 
 
Z T Lin et al.: A review on SRAM-based computing in-memory: Circuits, functions, and applications
Journal of Semiconductors   doi: 10.1088/ 1674-4926/43/3/031401 25

[68] Xue C X, Zhao W C, Yang T H, et al. A 28-nm 320-kb TCAM macro prove linearity and consistency in SRAM in-memory computing.
using split-controlled single-load 14T cell and triple-margin IEEE J Solid State Circuits, 2021, 56, 2550
voltage sense amplifier. IEEE J Solid State Circuits, 2019, 54, 2743 [80] Lin Z T, Fang Y Q, Peng C Y, et al. Current mirror-based compensa-
[69] Jiang H W, Liu R, Yu S M. 8T XNOR-SRAM based parallel compute- tion circuit for multi-row read in-memory computing. Electron
in-memory for deep neural network accelerator. 2020 IEEE 63rd In- Lett, 2019, 55, 1176
ternational Midwest Symposium on Circuits and Systems, 2020, [81] Kim Y, Kim H, Park J, et al. Mapping binary resnets on computing-
257 in-memory hardware with low-bit ADCs. 2021 Design, Automa-
[70] Kang M G, Shanbhag N R. In-memory computing architectures tion & Test in Europe Conference & Exhibition, 2021, 856
for sparse distributed memory. IEEE Trans Biomed Circuits Syst,
2016, 10, 855
Zhiting Lin (SM’16) received the B.S. and Ph.D.
[71] Jain S, Lin L Y, Alioto M. ±CIM SRAM for signed in-memory broad-
degrees in electronics and information engin-
purpose computing from DSP to neural processing. IEEE J Solid
eering from the University of Science and Tech-
State Circuits, 2021, 56, 2981
nology of China (USTC), Hefei, China, in 2004
[72] Yue J S, Feng X Y, He Y F, et al. A 2.75-to-75.9TOPS/W computing-
and 2009, respectively. From 2015 to 2016, he
in-memory NN processor supporting set-associate block-wise
was a visiting scholar with the Engineering
zero skipping and Ping-pong CIM with simultaneous computa-
and Computer Science Department, Baylor Uni-
tion and weight updating. 2021 IEEE International Solid- State Cir-
versity, Waco, TX, USA. In 2011, he joined the
cuits Conference, 2021, 238
Department of Electronics and Information En-
[73] Yang X X, Zhu K R, Tang X Y, et al. An in-memory-computing
gineering, Anhui University, Hefei, Anhui. He
charge-domain ternary CNN classifier. 2021 IEEE Custom Integ-
is currently a professor at the Department of In-
rated Circuits Conference, 2021, 1
tegrated Circuit, Anhui University. He has pub-
[74] LeCun Y. Deep learning hardware: Past, present, and future. 2019
lished about 50 articles and holds over 20
IEEE International Solid-State Circuits Conference, 2019, 12 Chinese patents. His research interests in-
[75] Chih Y D, Lee P H, Fujiwara H, et al. 16.4 an 89TOPS/W and clude pipeline analog-to-digital converters
16.3TOPS/mm2 all-digital SRAM-based full-precision compute- and high-performance static random-access
in memory macro in 22nm for machine-learning edge applica- memory.
tions. 2021 IEEE International Solid-State Circuits Conference,
2021, 252
[76] Sie S H, Lee J L, Chen Y R, et al. MARS: multi-macro architecture Xiulong Wu received the B.S. degree in com-
SRAM CIM-based accelerator with co-designed compressed neur- puter science from the University of Science
al networks. IEEE Trans Comput Aided Des Integr Circuits Syst, and Technology of China (USTC), Hefei, China,
2021, in press in 2001, and the M.S. and Ph.D. degrees in elec-
[77] Agrawal A, Kosta A, Kodge S, et al. CASH-RAM: Enabling in- tronic engineering from Anhui University, He-
memory computations for edge inference using charge accumula- fei, in 2005 and 2008, respectively. From 2013
tion and sharing in standard 8T-SRAM arrays. IEEE J Emerg Sel to 2014, he was a visiting scholar with the En-
Top Circuits Syst, 2020, 10, 295 gineering Department, The University of
[78] Yue J S, Yuan Z, Feng X Y, et al. A 65nm computing-in-memory- Texas at Dallas, Richardson, TX, USA. He is a pro-
based CNN processor with 2.9-to-35.8TOPS/W system energy effi- fessor at Anhui University. He has published
ciency using dynamic-sparsity performance-scaling architecture about 60 articles and holds over 10 Chinese pat-
and energy-efficient inter/intra-macro data reuse. 2020 IEEE Inter- ents. His research interests include high-per-
national Solid-State Circuits Conference, 2020, 234 formance static random-access memory and
[79] Lin Z T, Zhan H L, Chen Z W, et al. Cascade current mirror to im- mixed-signal ICs.

 
 
Z T Lin et al.: A review on SRAM-based computing in-memory: Circuits, functions, and applications
View publication stats

You might also like