AreviewonSRAM Basedcomputingin Memory Circuitsfunctionsandapplications3
AreviewonSRAM Basedcomputingin Memory Circuitsfunctionsandapplications3
net/publication/359331985
CITATIONS READS
0 117
11 authors, including:
Chunyu Peng
Anhui University
56 PUBLICATIONS 354 CITATIONS
SEE PROFILE
All content following this page was uploaded by Zhongzhen Tong on 18 April 2023.
Citation: Z T Lin, Z Z Tong, J Zhang, F M Wang, T Xu, Y Zhao, X L Wu, C Y Peng, W J Lu, Q Zhao, and J N Chen, A review on
SRAM-based computing in-memory: Circuits, functions, and applications[J]. J. Semicond., 2022, 43(3).
View online: https://doi.org/10.1088/1674-4926/43/3/031401
Towards engineering in memristors for emerging memory and neuromorphic computing: A review
Journal of Semiconductors. 2021, 42(1), 013101 https://doi.org/10.1088/1674-4926/42/1/013101
Resistive random access memory and its applications in storage and nonvolatile logic
Journal of Semiconductors. 2017, 38(7), 071002 https://doi.org/10.1088/1674-4926/38/7/071002
A review: Photonics devices, architectures, and algorithms for optical neural computing
Journal of Semiconductors. 2021, 42(2), 023105 https://doi.org/10.1088/1674-4926/42/2/023105
Oscillation neuron based on a low-variability threshold switching device for high-performance neuromorphic computing
Journal of Semiconductors. 2021, 42(6), 064101 https://doi.org/10.1088/1674-4926/42/6/064101
关注微信公众号,获得更多资讯信息
Journal of Semiconductors
(2022) 43, 031401
REVIEWS
doi: 10.1088/1674-4926/43/3/031401
Abstract: Artificial intelligence (AI) processes data-centric applications with minimal effort. However, it poses new challenges
to system design in terms of computational speed and energy efficiency. The traditional von Neumann architecture cannot
meet the requirements of heavily data-centric applications due to the separation of computation and storage. The emergence
of computing in-memory (CIM) is significant in circumventing the von Neumann bottleneck. A commercialized memory architec-
ture, static random-access memory (SRAM), is fast and robust, consumes less power, and is compatible with state-of-the-art tech-
nology. This study investigates the research progress of SRAM-based CIM technology in three levels: circuit, function, and applic-
ation. It also outlines the problems, challenges, and prospects of SRAM-based CIM macros.
Key words: static random-access memory (SRAM); artificial intelligence (AI); von Neumann bottleneck; computing in-memory
(CIM); convolutional neural network (CNN)
Citation: Z T Lin, Z Z Tong, J Zhang, F M Wang, T Xu, Y Zhao, X L Wu, C Y Peng, W J Lu, Q Zhao, and J N Chen, A review on SRAM-
based computing in-memory: Circuits, functions, and applications[J]. J. Semicond., 2022, 43(3), 031401. https://doi.org/10.1088/
1674-4926/43/3/031401
1. Introduction an active topic in CIM for the robustness and access speed of
its cells.
Recently, with the breakthrough of key technologies To improve the performance of the SRAM-based CIM, the
such as big data and artificial intelligence (AI), emerging intelli-
SRAM bitcell structure has been modified and auxiliary peri-
gent applications represented by edge computing and intelli-
pheral circuits have been developed. For example, read–write
gent life have emerged in the trend of the era of rapid devel-
isolation 8T[4, 5], 9T[6−8], and 10T[9−11] cells were proposed to
opment[1, 2]. These emerging intelligent applications often
prevent the storage damage caused by multirow reading and
need to access the memory frequently when dealing with
calculation. Transposable cells were proposed to overcome
events. However, von Neumann architecture is the most com-
the limitations of storage arrangement[12−14]. Peripheral cir-
monly used architecture for data processing, which is imple-
cuits, such as word[15, 16], bit[10], line digital-to-analog conver-
mented by separating memory banks and computing ele-
sion (DAC), redundant reference columns[1, 5], and multi-
ments. Massive volumes of data are exchanged between the
plexed analog-to-digital conversion (ADC)[17], were proposed
memory and processing units, which consume large amounts
of energy. Furthermore, the memory bandwidth limits the com- to convert between analog and digital signals. With these mod-
puting throughput. The resulting memory limitation in- ified bitcells and additional peripheral circuits for SRAM CIM,
creases energy wastage and latency and decreases efficiency. researchers have achieved various computational operations,
This limitation is more severe in resource-constrained including in-memory Boolean logic[5, 7, 18−27], content-address-
devices. Therefore, it is important to explore solutions to over- able memory (CAM)[5, 19, 20, 28, 29], Hamming distance[8], multi-
come the issue of the “memory wall.” plication and accumulation (MAC)[1, 2, 9, 10, 15, 17, 30−34], and the
As a computing paradigm that may address the von Neu- sum of absolute difference (SAD)[35, 36]. These in-memory ope-
mann bottleneck, researchers have proposed the computing rations can expedite the AI algorithms. For example, Sinangil
in-memory (CIM) technology. The so-called CIM is a new archi- et al. implemented a multibit MAC based on a 7-nm FinFET
tecture and technology for computing directly in memory. It technique to accelerate the CNN algorithm, which achieved a
breaks through the limitation of traditional architecture, optim- recognition accuracy of 98.3%[37, 38]. Agrawal et al. proposed
izes the structure of storage unit and logic unit, realizes the in- a novel “read–compute–store” scheme, wherein the XOR calcu-
tegration of storage unit and logic unit, and avoids the cum- lation results were used to accelerate the AES algorithm
bersome process of transmitting data to processor register without having to latch data and perform subsequent write
for calculation and then back to memory, thus significantly re- operations[18].
ducing the delay and energy consumption of the chip[3]. Re- As illustrated in Fig. 1, we reviewed three levels of SRAM-
search on static random-access memory (SRAM) has become based CIM: circuit, function, and application. The circuit level
is reviewed from two aspects: 1) bitcell structures, which in-
Correspondence to: X L Wu, [email protected] clude read–write separation structures, transposable struc-
Received 28 AUGUST 2021; Revised 4 NOVEMBER 2021. tures, and compact coupling structures, and 2) peripheral auxi-
©2022 Chinese Institute of Electronics
liary circuits, which include analog-to-digital conversion cir-
2 Journal of Semiconductors doi: 10.1088/ 1674-4926/43/3/031401
Redundant
Overall framework of SRAM-based reference column Analog-to-Digital conversion circuit
Time Function
Digital -to- Analog conversion circuit The function of
Control
Function level: SRAM-based CIM
1 Digital CIM Boolean AND、NAND、NOR
The digital
Logic XNOR、XOR、IMP、OR
2 Mixed-signal CIM SRAM-based Content (b)
Binary CAM
CIM Addressable
Memory Ternary CAM
Layer4
Full connection 24 26 178 130 12 16 178 100 12 10 0 30 456
Layer2
Layer3 Layer 2 0 255 220 4 32 233 112 2 32 22 108
SUM (c)
Layer1 I1 substract I2 results di - pi
Column-based weak classifiers
AES algorithm Classifier algorithm
N<=10 128 bits data out BC1,1 BC1,1 BC1,1
Fig. 1. (Color online) Overall framework of static random-access memory (SRAM)-based computing in-memory (CIM) for the review: (a) various
functions implemented in CIM, (b) operation functions realizable with CIM, and (c) application scenarios of CIM.
cuits, digital-to-analog conversion circuits, redundant refer- 24−26, 29, 33, 35, 39−53] and Ref. [34], used standard 6T cells
ence columns, digital auxiliary circuits, and analog auxiliary cir- considering the area overhead. Fig. 2(a) illustrates a sche-
cuits (Fig. 1(a)). The second level is reviewed from two as- matic of the standard 6T SRAM cell. The 6T storage cell is
pects: 1) pure digital CIM, which includes Boolean logic and composed of two P-channel metal–oxide–semiconductors
CAM, and 2) mixed-signal CIM functions, which include MAC, (PMOSs) and four N-channel metal–oxide–semiconductors
Hamming distance, and SAD (Fig. 1(b)). The third level is (NMOSs), in which P1, N1, P2, and N2 constitute two cross-
mainly reviewed from the aspects of the application of the ac- coupled inverters to store data stably. To perform CIM with
celerating CNN, classifier, k-NN, and AES algorithms (Fig. 1(c)). the conventional 6T SRAM cell, the operands are commonly
Finally, the challenges and future development prospects of represented by the word line (WL) voltage and storage node
SRAM-based CIM are discussed from these three levels.
data. The processing results are often reflected by the
voltage difference between BL and BLB.
of information.
P1 P2 PU1 PU2
RBL/ RBLB
N3 N4 PG1 PG2
Q QB SL /SLB
N1 N2 PD1 PD2
BL BLB VSS
BL BLB
INV1 INV2
INV1 INV2
Shared CVSS1
RWL/ Decoupled RWLB/
VSS VSS CVSS CVSS2 ML Differential Read MLB
(a) (b) (c)
Fig. 2. (Color online) (a) Standard 6T SRAM cell, (b) dual-Split 6T SRAM cell, and (c) 4+2T SRAM cell.
Fig. 3. (Color online) SRAM cells with separated read and write: (a) standard 8T SRAM bit-cell, (b) 7T SRAM cell, (c) 9T SRAM cell, and (d) 10T SRAM
cell.
ted the N-well as the write word line (WWL), and the source macro for multibit multiply multibit[37, 38], and a computation-
ends of two pull-up PMOS were used as the write bit line al SRAM (C-SRAM) for vector processing[54]. To reduce the
(WBL) and write bit line bar (WBLB). The design was implemen- area cost, Jiang Hongwu et al.[55] proposed a 7T cell to com-
ted in the deep-decomposed channel process, where the plete the dot product operation, as shown in Fig. 3(b). To
bulk effect of the circuit was more significant. To write data, achieve better performance and execute more complex ope-
the threshold voltage of the PMOS can be significantly rations, Srinivasa et al. utilized 9T SRAM cells to perform CAM
altered by changing the N-well and PMOS source voltages. operations[6]. Fig. 3(c) presents a schematic of the 9T SRAM
During the application of CIM, different voltages on the WL cell. In Ref. [8], Ali et al. completed the Hamming distance
and storage node represent the different operands.
based on a 9T SRAM cell. In Refs. [9, 10, 56, 57] and Ref. [17],
10T SRAM cells were used to perform CIM of the dot product
2.2. Cell structure with additional devices for SRAM-
and XONR, as shown in Fig. 3(d). These studies demon-
based CIM strated the advantages of the read–write separation struc-
The simple 6T structure cannot realize complex comput- ture in CIM, but the additional transistors also reduced the
ing operations and does not fully meet the requirements of storage density.
CIM. Therefore, studies on CIM have modified the traditional 2.2.2. SRAM cells based on capacitive coupling
6T structure, which can be roughly divided into the follow- SRAM cells based on capacitive coupling add additional ca-
ing four categories.
WL VDD
VDD WL WL
MBL
T1 T2
T1 T2 T5 T6
QA QB
T5 T6
QA QB
T3 T4 T3 T4
BLB
am,n
BL
abm,n
VSS VSS
VC0 T7 Xn Xb n T8
T7 T8 Om,n
CC MWLB 0
BLB
BL
(a) (b)
Fig. 4. (Color online) SRAM cells based on capacitive coupling: (a) C3SRAM bitcell and (b) M-BC bitcell.
CBL
R_RBL VDD
BLB
BL
N8
VDD C_RWL P1 P2
VDD
P1 P2 T1 T2
N1 N2
N6 QB
Q T6 T5 Q QB
QB Q N3 N4
N5
VBLB
VSS
VBL
N3 N4 T7 T8
T3 T4
CWL Q RWL QB
N7 VSS VSS
WWL
CBLB WBLB WBL M1 M2 RL RR M3 M4
Refs. [59, 60], the capacitors inside the unit need to be CAM can be achieved in two directions. Because of the
coupled with each other through additional switches. There- row–column symmetry, these cells break the limitations of con-
fore, in the latter work, additional transistors need to be intro- ventional SRAM storage arrangements. Thus, the algorithm of
duced, which increases the area of the cell. In the selection of forward and backward propagations can be flexibly applied
capacitance type, Refs. [31, 58] selected MOSCAP which consti- to them.
tutes 27% of the bitcell area. However, Refs. [59, 60] selected 2.2.4. Compact coupling structure
MOMCAP that is formed using metal-fringing structures. The basic reading and writing operations of the com-
MOMCAP can be placed on the top of the cell, so there is no pact-coupling structure are consistent with traditional SRAM.
additional area overhead. However, compared with MOSCAP, However, exclusive and independent structures have been de-
MOMCAP has a lower capacitance density and is not suitable veloped for CIM operations. Yin et al.[17] proposed a 12T cell,
for large capacitances.
For Forward
VDD VDD 6T SRAM #0
BWLM<0>
C-RBL<0>
BWLL<0>
GBL<0>
LBLB
LBL
T1 T2
T5 T6 6T SRAM #15
T3 T4
VSS VSS TWT
RWL_P
(a) (b)
Fig. 6. (Color online) Compact coupling structure: (a) 12T cell and (b) two-way transpose multibitcell.
taining the standard 6T and 2) reconstructing basic units by 3.1.1. Quantifying BL voltage
adding additional transistors or capacitors. Currently, the As shown in Fig. 7(a), it is an asymmetrically sized sense
main purpose of cell design in CIM is to realize novel opera- amplifier (SA)[18]. This circuit is different from the traditional
tions, and the automatic restoration of operation results is SA, which is only used to read data. If the MBL in the circuit is
not realized. In the future, the unit design will focus on the set to be larger than MBLB, the BL will discharge faster than
processing of operation results.
the BLB under the same conditions. This difference in the dis-
3. Peripheral auxiliary circuits of the SRAM-based charging speed distinguishes the two cases of input ‘01’ and
CIM ‘10’ in the Boolean logic. Thus, in a single memory read cycle,
a class of bitwise Boolean operations can be obtained by read-
Depending on the basic cells in the array, only limited di- ing directly from the asymmetrically sized SA outputs.
gital computing functions can be achieved. Peripheral cir- Sinangil et al.[24, 37] proposed a 4-bit flash ADC. This ADC
cuits, such as high-precision ADC, weight processing, genera- uses SA to save the area and reduce the power consumption.
tion of reference voltage, pulse modulation, and near Each 4-bit flash ADC uses 15 SAs simultaneously and quant-
memory multiplication, must be used with the SRAM system izes the analog voltage of the RBL by setting 15 different refer-
to achieve high-performance memory operations or analog ence voltage levels (Fig. 7(b)).
domain computation.
SC/2CA Logic
Mode Selector
Asymmetric Differential Vref [14:11]
SA
SA[14:11] RBLS RBLO[1] RBLE[1] RBLO[0] RBLE[0] RBLO RBLE
RBL[3]
SA SC/2CA MUX SC/2CA Logic SC/2CA Logic
SBQ
Weight Processor
Vref [8] SA[8] VDD
VCM
SA
RBL[1] K3 K2 K1 K0
S1
SA OUT SA OUTB
Vref [7] SA[7]
SA 16Co 8Co 4Co Co 3Co
RBL[0] SUM
X(-16) X(-8) X(+4) X(+1)
BL BLB Vref [6]
SA
SA[6] SA
N3 N2 N1 N0
SH SH SH SH
Computation S1 S1 S1 S1
Caps 8*C u 4*C u 2*C u 1*C u
VDD
ADC Division capacitors Division module
(a) (b)
Fig. 8. (Color online) (a) Weighted array with different capacitor sizes and (b) multi-period weighting technique using capacitors of the same size.
that is, a stepwise comparison is performed through multiple ted data, this SAR ADC seems to have larger area and power
operation stages to quantify the analog voltage of the RBL. than the above-mentioned flash ADC. However, in this work,
The quantifying circuits process the final calculation it had 8-bit accuracy to increase the inference accuracy. To re-
result, which is especially important for the calculation accur- duce the interference of PVT, the SRAM-based CIM architec-
acy of the entire system. Most researchers choose flash ture needs to be integrated with ADC. Therefore, it is signific-
ADC[14, 37, 38, 55] or successive approximation ADC[44, 51, 52, 60, 63] antly important for the design of ADC in CIM works.
to quantify the BL voltage. Flash ADCs usually require mul- 3.1.2. Weighting BL voltage circuits
tiple comparators and reference voltages; thus, under the One of the weighting BL voltage circuits is the capacitor
same bit-accuracy, the area and power consumption of flash array weighting circuits. It has a higher linearity, which is of-
ADC is relatively larger than SAR ADC. Sometimes a high-accur- ten used when high-precision operations are required. Capacit-
acy flash ADC may be larger than the whole array. SAR ADC or array weighting techniques are broadly grouped into two
only needs one comparator to complete the quantization oper- categories:
ation. However, it needs multiple cycles for comparison. Con- 1) Weighting by different capacitor sizes[37, 38], where the
sequently, its speed is far lower than that of flash ADC. To se-
voltage of the RBL decreases by Δv after calculation and is
lect the type of ADC in CIM, the tradeoff between quantiza-
shared by the capacitors connected to each column. The cor-
tion accuracy and the overhead of area power consumption
responding N3 on RBL[3] is Δv , N2 is Δv , N1 is Δv , and N0
is considered first. For example, the C3SRAM used flash ADC
with multiple comparators and voltage references achieving is Δv , which is achieved by different capacitor sizes, as
a relatively high 1638 GOPS throughput[58]. The ADC con- shown in Fig. 8(a). Finally, the weighted shared charge is trans-
sumed 15.28% of the area and 22% power of the whole chip. mitted to an ADC composed of the SA.
In this work, the 256-row can be activated for dot-product cal- 2) Weighting by sharing charge across multiperiod opera-
culation; thus, the full resolution of partial convolution res- tions, as shown in Fig. 8(b). In Ref. [64], the parasitic capacit-
ults was 8-bit. However, considering the area, power, and ance on the BL was the same as that of C_DIV in the capacit-
latency, a lower flash ADC was used with 5-bit accuracy or array, so that the charge was shared equally among the ca-
design that was still able to maintain the final accuracy. In con- pacitors. When 1/8 weighting was required, in the first peri-
trast, the authors in Ref. [57] used SAR ADC achieving a od, the BL was connected to port 2 achieving the charge shar-
throughput of 8.5 GOPS, which consumed 43.1% of the area ing with three C_DIVs. Thus, the voltage decreased to 1/4. In
and 24.1% of the power of the entire chip. From the presen-
the second period, the BL was connected to port 1, and the
Z T Lin et al.: A review on SRAM-based computing in-memory: Circuits, functions, and applications
Journal of Semiconductors doi: 10.1088/ 1674-4926/43/3/031401 7
8:1 MUX
36t0
Global TS45 TS36
TS36
Timing ON 27t0
TS27 TS27
Signals TS18
18t0
TS9 TS18
TS0 0
3 SEL[2:0] TS9 9t0
Two-stage
MUX TS56 x3 TS0
2:1 MUX
1 0 56t0
XIN[5:3] XIN[2:0]
TS56
7t0
(b) Digital Inputs
Select signal (c)
WL 8 ΔV
Xoffset X[0] X[1] X[4]
WL
Bit-cell replica
(upsized by R)
MA,R
IADC WL_RESET MA
CLASS_EN MD
MD,R
shared charge was divided equally. Therefore, the original th or pulse height. The BL voltage can be decreased or incre-
charge was ultimately divided into 1/8 of the original charge. ased proportionately by controlling the pulse width or pulse
Charge can be shared in multiperiod operations in a similar height in proportion to the digital input. The precise genera-
manner to achieve other weights. tion of these widths or heights is crucial to multibit calcula-
The two types of weighting techniques have their demer- tion.
its. In the first technique, the capacitance increases exponen- For example, for an input of 6 bits, the circuit design re-
tially with the number of input bits, which increases the area quires 64 different pulse widths. Generating such a variety of
overhead exponentially. However, in that work its unit of com- pulse widths in the memory consumes an immense area and
putation capacitance is formed by the inherent cap of the power. Therefore, the solution to this problem requires a delic-
sense amplifier (SA) inside the 4-bit Flash ADC, which saves ate circuit. Biswas et al.[10] proposed a global read bitline
area and minimizes the kick-back effect. Moreover, it is diffi- (GBL) DAC circuit consisting of a cascade PMOS stack biased
cult to realize in the manufacturing process. In the second in the saturation region to act as a constant current source
technique, the capacitance of the circuit remains unchanged. (Fig. 9(a)) and a two-level data selector (Fig. 9(b)). The circuit
However, multiple operation periods are required to com- controls the opening time of the transmission gates accord-
plete the weighting, which decreases the computation ing to the input data so that the bitline is charged to the cor-
speed. In general, whether it is the equivalent capacitance of
responding voltage value. With the traditional scheme, it re-
SA, MOMCAP or MOSCAP, if there has a large capacitance in
quired 64 types of signals and 64 : 1 MUX if the input is 6-bit
the CIM system, the robustness of the operations will have a
(XIN [5:0]). However, in Ref. [10], a two-level data selector was
certain impact.
proposed to solve the problem. The first level had three 2 : 1
3.2. Digital-to-analog (DAC) conversion circuit MUX, and the second level an 8 : 1 MUX. The first stage TS56
The purpose of the DAC circuit is to convert a digital in- chooses XIN [5:3] as the input, and the second stage XIN
put into the corresponding analog quantity as the pulse wid-
[2:0]. The two stages have pulse widths of 56t0 and 7t0, re-
Z T Lin et al.: A review on SRAM-based computing in-memory: Circuits, functions, and applications
8 Journal of Semiconductors doi: 10.1088/ 1674-4926/43/3/031401
Precharge
erating multiple reference voltages in a memory array in-
creased the functionality and enhanced the accuracy of the en-
RWLvref
8T 8T 8T 10T
BLB
RWLvref
BL RBL RBLB tire system.
8T 8T 8T 10T
Row Decoder
Redundant column
SRAM
ARRAY
tion is a good choice, and digital auxiliary circuits can signific-
8T 8T 8T 10T
antly improve the computational accuracy of the system.
VREF2
Sense Amplifier VREF1
VREF2 VREF1
Fig. 11(a) illustrates a digital-aided technique for 2- and
3-bit multiplication. The Boolean logic “AND” and “NOR” func-
Fig. 10. (Color online) Redundant reference column technology. tions are realized by an SA sensing the voltage of column
CBL or CBLB[12, 13]. Then, the addition operation is implemen-
spectively. Finally, 64 proportional time-series signals are gen- ted by the Boolean logic. Finally, the trigger and data select-
erated by combining eight different widths of the 8 : 1 MUX in- or are used to accumulate the results of the addition opera-
put. The operation waveform is illustrated in Fig. 9(c). The 64 tion to execute multiplication.
different pulse widths can act on the charging time of the BL The digital circuit not only assists in the execution of multi-
through the constant current source such that the BL has 64 plication but also the MAC, as shown in Fig. 11(b)[22, 31, 58].
types of precharging voltage corresponding to the input. Fi- The bit-tree adder is used to obtain the cumulative sum of
nally, this circuit can achieve up to six bits of DAC input.
the BL quantization results. For an array with N numbers to
As shown in Fig. 9(d), the DAC circuit, used for pulse heig-
be accumulated, a log(N)-layer full adder is required to per-
ht modulation, is composed of a binary weight current sour-
form population count.
ce and a copy unit[15, 16]. In a current source circuit, a fixed vol-
Digital calculations are highly precise. However, this meth-
tage VBIS is applied to the gate of transistors with different
od requires multiple cycles and has a relatively large area over-
width-length ratios. It produces different proportions of cur-
head and power consumption. The efficient use of digital cir-
rent according to the input data and then passes it through
cuits is a research direction in in-memory computing.
the diode-connected MA, R, generating a weighted voltage at
As shown in the orange rectangle in Fig. 10[5], the BL and 4. Computational functions of the SRAM-based
BLB of the redundant reference column are shorted together.
CIM
Therefore, the parasitic capacitance of the redundant refer-
ence column is twice that of the main array. The BL voltage Because the internal cells of an SRAM array are repetitive,
of the redundant reference column is reduced by half relat- the operation in memory must be simple and repeatable. The
ive to the BL voltage of the main array, generating the de- existing SRAM-based CIM can be classified into two: pure digit-
sired reference voltage. The redundant column with buffers al CIM and mixed-signal CIM. The pure digital CIM mainly in-
generates the required reference voltages, as well as tracks cludes Boolean operation and CAM, and the mixed-signal
the PVT variations in the memory array, thereby increasing CIM the processing of the Hamming distance, MAC, and SAD.
the sensing margin. Si et al.[1, 2] proposed a dynamic input- 4.1. Digital SRAM-based CIM
aware reference generation scheme to generate an appropri-
ate reference for the binary dot-product accumulation mode. 4.1.1. Boolean logic (AND, OR, NAND, NOR, XNOR, XOR,
Two redundant columns were used as the reference column. and IMP)
The BLs of the reference columns were selectively connected
Implementing the Boolean logic in memory is relatively
Z T Lin et al.: A review on SRAM-based computing in-memory: Circuits, functions, and applications
Journal of Semiconductors doi: 10.1088/ 1674-4926/43/3/031401 9
T
EN
Near-Memory Part XNORed vector(from SAs)
Q D
T
DR
CBL
Cout
SA
AB/A+B/A+B Layer 1 + + +
AB Cout
Q D Layer 2 +
Cin C
DR EN
CBLB
A+ B
Bit-Tree
SA
A+B
Sum
+ Adder
Layer log (N)
SAE Latch_EN
(a) (b) Popcount
Fig. 11. (Color online) (a) In-/near-memory computing peripherals and (b) a bit-tree adder.
XLSB
0 X0 X1 X2 X3
XLSB
DEMUX Sign(W)
ϕ2,X
ϕ2,0 ϕ2,3 Δ VM
ϕ2,1 Vpre
ϕ2,2 ϕ3,3
ϕ2,2
Vpre
ϕ2,3
ϕ2,1 ϕ3,2 C2
ϕ3,0
Vpre
ϕ3,1 ϕ2,0 ϕ3,1 C1
ϕ3,2 Vpre
ϕrail Vpre
ϕdump CX
(a) (b) ΔVBLMUX
Fig. 12. (Color online) Signed 4-b × 8-b least significant bit (LSB) multiplier: (a) timing diagram and (b) circuit schematic.
simple and accurate because its operation is completed in Most of the existing Boolean logics are realized by open-
the digital domain. Fig. 13(a) illustrates a basic construct ing two rows of cells and sensing the BL voltage by setting
for performing in-place bitwise logical operations using the SA reference voltage[5, 7, 21, 23−27]. Implementing multiple in-
SRAM[19, 20, 22, 24]. To realize the logic operations of A and B, put logic operations in a single cycle and XNOR without addi-
the BL and BLB are first precharged to VDD. Then, the WL of tional combinational logic gates is challenging. Surana et al.
the corresponding cell is turned on, and the two BLs are dis- proposed a 12T dual port dual interlocked storage cell SRAM
charged according to the data stored. If AB = 11, BL will not to implement the essential Boolean logic in a single cycle[23].
discharge; if AB = 01/10, BL will discharge to a certain level; if Lin et al. leveraged the three types of BLs of a traditional 8T-
AB = 00, BL will discharge to the maximum, as shown in Fig. SRAM to simultaneously realize four input logic operations[5].
13(b). The SA can realize different logic functions, such as Zhang et al. proposed an 8T-SRAM with dual complementary
AND and NOR, by setting different VREF values. Finally, XNOR access transistors[64]. It utilizes the threshold voltage of NMOS
can be realized by combining ‘AND’ and ‘NOR’ through the and PMOS, precharges the BL voltage to 0.5 VDD, and finally
OR gate to implement the full Boolean logic. Fig. 13(c) shows configures different reference voltages to realize the Boolean
the implication logic (IMP) and XOR logic[18]. In the CIM logic. In addition, a composite logic operation can be real-
mode, SL1 is connected to the VDD supply, while SL2 is groun- ized without additional combinational logic. The existing meth-
ded, forming a voltage divider. RWL1 and RWL2 are initially ods add additional cycles to store results in other cells[18];
grounded, and RDBL is precharged to Vpre (at 400 mV). The however, it decreases the speed and storage density of the sys-
voltages of RWL1/ RWL2 represent the input data. If Q1 and tem. The future direction of in-memory Boolean logic is to ef-
Q2 store data ‘1’ and the input is ‘00/11’ (RWL1 = 0, RWL2 = fectively store the calculated results.
0; RWL1 = 1, RWL2 = 1), RDBL remains Vpre; if the input is ‘01’ 4.1.2. Content addressable memory (CAM)
(RWL1 = 0, RWL2 = 1), RDBL discharges to ground; and if the in- The CAM is a special type of memory that can automatic-
put is ‘10’ (RWL1 = 1, RWL2 = 0), RDBL charges to VDD. Fi- ally compare input data with all the data stored in the array
nally, the results of IMP and XOR can be realized by using simultaneously to determine whether the input data matches
two skewed inverters to sense the RDBL voltage.
the data in the array. The realization of CAM in SRAM can re-
Z T Lin et al.: A review on SRAM-based computing in-memory: Circuits, functions, and applications
10 Journal of Semiconductors doi: 10.1088/ 1674-4926/43/3/031401
WL AND operation
VBL VREF
RWL1 RDBL
Cell1 VDD
BL BLB
A A Q1
M2
Q1 M1
M1
WL VDD INV1 INV3
SL1
00 01 10 11 RWL2 RWL1 M2
Bit combinations Cell2 10
VREF NOR operation RDBL
VREF B B M4
INV2
VBLB VREF
Q2 RWL2 M4
01
M3
SA SL2
SA Q2 M3
Vref
Q QB Q QB
0 1 1 0
Vdd
-
SA
pre
+
Keeping High 1
(Match)
0
T1 T1 Match 0 1 0 1 (Mismatch)
+SA -
Slb Sl Slb Sl
BLB<4>
BLB<0>
BL<4>
BL<0>
Discharging
(a) (b)
BLB_1 BL_1 BLB_2 BL_2 BLB_3 BL_3 BLB_4 BL_4
CWL Driver
Column 1 Column 2 Column 3 Column 4
Sea rch Da ta Inp u t 10 0 1 / RWL Dr iver
S_1 1 0 1 0 1 0 0 1
1 0 1 0 0 1 1 0
SLB1
0
SL1
1
0 1 0 1 1 0 1 0
Search driver
S_2 1 0 1 0 1 0 1 0
SLB2
1
SL2
0
0 1 0 1 1 0 1 0
SLB3
1
SL3 S_3 0 1 0 1 0 1 0 1
0
1 0 1 0 1 0 1 0
SLB4
0
SL4
ML1 ML1' ML3 ML3'
1
1 0 VR EF SA SA SA SA SA SA SA SA
Match Mismatch
(c) (d)
Fig. 14. (Color online) Column-wise BCAM: (a) search example in 3D-CAM and (b) 4+2T. Row-wise TCAM: (c) organization based on 10T and (d) or-
ganization based on 6T.
duce data transmission and avoid a large amount of energy (a) and (b) are the column-wise BCAM operations, and (c) and
consumption. The CAM operation can be divided into binary (d) are the row-wise TCAM operations. Srinivasa et al.[6] used
CAM (BCAM) and fault-tolerant ternary CAM (TCAM). In addi- the array structure shown in Fig. 14(a) to execute a column-
tion, there are two different search modes, including row- wise BCAM operation. In the CAM operation, the data to be
wise search and column-wise search. The row-wise search is searched are stored in an array, and the search data are repres-
defined as the cases where the input searching data are repres- ented by the voltage values of the source line (Sl) and Slb. If
ented by lines connected by rows. Similarly, the column-wise the data to be searched matches the search data, the pre-
search is defined as the cases where the input searching data charged match line is considered not-discharged and vice
are represented by lines connected by columns. In Figs. 14,
versa. To save area during CAM operations, Dong et al.[20] pro-
Z T Lin et al.: A review on SRAM-based computing in-memory: Circuits, functions, and applications
Journal of Semiconductors doi: 10.1088/ 1674-4926/43/3/031401 11
Table 2. Summary of chip parameters and performance of in-memory Boolean logic and CAM
Parameter Ref. [19] Ref. [28] Ref. [20] Ref. [29] Ref. [5] Ref. [68] Ref. [11]
Technology 28-nm FDSOI 180 nm 55-nm DDC 28-nm FDSOI 65 nm 28 nm 28 nm
Cell type 6T 8T 4+2T 6T 8T 14T 10T
Array size 64×64 8×8 128×128 128×64 128×128 1024×320 64×64
Supply voltage (V) 1 1.2 0.8 0.9 1.2 0.9 0.9
Freq. 1560 (0.9 V) 262 256
370 (1 V) NA 270 (0.8 V) 813 (1.2 V) 1330 (0.9 V)
(MHz) 8.90 (0.38 V) (0.9 V) (0.9 V)
1.025 1.02
CAM
Energy (0.9 V) (0.9 V)
0.6 (1 V) NA 0.45 (0.8 V) 0.13 (0.9 V) 0.85 (1.2 V) 0.422 (0.9 V)
(fJ/bit) 0.635 0.632
(0.7 V) (0.7 V)
Freq.
NA NA 230 (0.8 V) NA 793 (1.2 V) NA ~300(0.9 V)
(MHz)
Logic ~31 (1.2 V) NA ~15 (0.9 V)
Energy ~22.5 (1 V) ~12.5 (0.8 V)
NA NA 24.1 (0.8 V) NA
(fJ/bit)
16.6 (0.8 V) ~10.5 (0.7 V)
Search mode 1 1 2 2 1 2 1 2
SRAM/ SRAM/ TCAM/ SRAM/ BCAM/SRAM/ SRAM/ CAM/ SRAM/ TCAM SRAM/CAM/
Function CAM/Logic Left Shift/ CAM/Logic Pseudo-TCAM Logic Logic/Matrix
Right/Shift transpose
1Row-wise search. 2Column-wise search. DDC, deeply depleted channel; FDSOI, full depleted silicon on insulator.
posed a 4 + 2TSRAM. The data to be searched is also stored gics and CAM are compared and the different search modes
in an array, and the search data is represented by the BL are presented. Most studies have implemented both Boolean
voltage. The SA is used to capture the voltage value of the operations and CAM by modifying standard cells, indicating
matching line to obtain the result, as shown in Fig. 14(b). Be- the compatibility of these two functions. However, few have
cause the two studies represent the search data by the BL achieved CAM function in two directions simultaneously. Cur-
voltage, all the memory units are column-wise addressable. rently, in-memory Boolean logic is realized at the cost of re-
The difference between TCAM and BCAM is that the duced parallelism as opposed to its analog counterpart. There-
former has a do-not-care state. Two cells are used to repres- fore, improving the parallelism for Boolean logic and using it
ent one data point owing to the presence of three states in to achieve sophisticated calculations in memory will be a direc-
the TCAM. The three states, 0/1/X (where X is an independ- tion for research.
by whether the other two MLs are discharged[11]. In addition, 4.2.1. Single-bit operation0
to match the characteristics of the data stored in rows in an A. Binary dot product
SRAM, this work used 10T-SRAM to simultaneously imple- The multiplication operation of a single bit is also a dot-
ment CAM in the row and column dimensions. In order to product operation. As mentioned previously, there are two
save area, Jeloka et al. used standard 6T cell achieving TCAM subtypes of this operation: binary and ternary dot products.
operations, as shown in Fig. 14(d)[19], where the search data Chiu et al. executed (1,0) × (1,0) and (1,0) × (+1, -1) dot-
are represented by WLs voltage and the nodes in two adja- product operations with standard 6T cells[33]. The input is rep-
cent cells in a row represent the searched data. Similarly, the resented by the WL voltage. The weight is represented by 6T
search results are obtained by SA sensing the BLs voltage. storage data. The truth table is used to summarize the combin-
However, unlike Refs. [11, 19], Ref. [29] used the virtual ations of different inputs and weights, as shown in Fig. 15(a).
ground wire technique as the sensing mechanism. The virtu- The binary dot product results are represented by the
al and actual grounds are connected via a diode, and the SA voltage difference between BL and BLB. The BL voltage
detects the voltage of the virtual ground, obtaining the tern- changes according to the result of the multiplication of the in-
ary addressing result. put and weight. Sun et al. proposed a dedicated 8T-SRAM
Lin et al.[5] and Chen et al.[28] utilized the 8T unit to com- cell for the parallel computing of a binary dot product[30, 69]
plete the CAM operation. The difference between the two stud- (Fig. 15(b)). There are two complementary WLs (i.e., WL and
ies is that the former used the BL as the data search line, WLB) and two pairs of pass gates (PGs) to achieve the opera-
whereas the latter used the WL. Ref. [67] used the CAM auxili- tion function. The first pair of PGs is controlled by the WL and
ary circuit techniques to improve the operational speed of connects Q and QB to BLB and BL, respectively, while the
the system under ultra-low voltage. In Ref. [68], the CAM func- second pair of PGs is controlled by WLB and connects Q and
tion was realized by combining two 6T cells with two addition- QB to BL and BLB, respectively. The proposed 8T-SRAM al-
al control transistors. ways contains a non-zero voltage difference between BL and
In Table 2, the performances of several CIM Boolean lo- BLB, which represents the results of the binary-weight multi-
Z T Lin et al.: A review on SRAM-based computing in-memory: Circuits, functions, and applications
12 Journal of Semiconductors doi: 10.1088/ 1674-4926/43/3/031401
Input=+1
Q QB Q QB
ΔVBL=-ΔVPS
0 VDD 0 VDD
T7 T8 T7 T8
IMC-D IMC-D
VDD VDD
WL=1, W=0 WL=1, W=1 0 0
Q QB Q QB
(a)
T9 T10 T9 T10
Input=-1
Truth table and BL situation VDD 0 VDD 0
0 0 XNOR=-1 0 0
XNOR=+1
Neuron Weight Multi iBL iBLB VBL VBLB
1 -1 -1 Δi 0 ΔV 0 T7 T8 T7 T8
-1 -1 1 0 Δi 0 ΔV 0
VDD VDD
0
1 1 1 0 Δi 0 ΔV Q QB Q QB
-1 1 -1 Δi 0 ΔV 0 T9 T10 T9 T10
Input=0
BL BLB (even row)VDD VDD
W:+1 Q/QB=1/0 VDD VDD
VDD VDD
XNOR=0 VDD 0 XNOR=0
W:-1 Q/QB=0/1 WL WLB
T7 T8 T7 T8
IBL IBLB
Q QB 0
VDD VDD
0
Q QB Q QB
A:+1
T9 T10 T9 T10
WL/WLB=1/0 Input=0
WLB WL
A:-1 (odd row) XNOR=0 XNOR=0
0 0 0 0
WL/WLB=0/1
(b) (c)
Fig. 15. (Color online) Schematic and truth table of the binary dot product: (a) 6T-SRAM binary dot product and (b) 8T-SRAM binary dot product.
(c) Ternary dot product: operation of ternary multiplication and XNOR value mapping table.
plication operations. The truth table in Fig. 15(b) reveals the The cumulative result of the above study is on the BL;
combination of different inputs and weights, and the corres- therefore, the operation should be performed by columns.
ponding changes in the BL discharge current (Δi) and voltage However, in the traditional SRAM storage mode, data are
(Δv) caused. In addition, according to the input neuron vec- stored row-wise. The CIM mode conflicts with the storage
tor, a WL switch matrix is used to simultaneously activate mul- mode, which reduces the operation efficiency. To address
tiple WLs. The discharge current from several bitcells in the this problem, Agrawal et al. used a 10T SRAM to perform a bin-
same column decreases the voltage of the BL (BL or BLB). ary line dot product, and the results were reflected on the hori-
Thus, the voltage difference between BL and BLB can be used zontal source lines[9], conforming to the traditional SRAM stor-
to determine the weighted sum.
age mode. However, the storage density is reduced by the in-
Z T Lin et al.: A review on SRAM-based computing in-memory: Circuits, functions, and applications
Journal of Semiconductors doi: 10.1088/ 1674-4926/43/3/031401 13
troduction of additional transistors in the basic cell. stored in the memory array, while the other vector is used to
B. Ternary dot product drive the RBL/RBLB. For example, if the input vector is ‘1’, the
Yin et al. achieved a ternary dot product[17] (Fig. 15(c)). RBL is connected to VDD, and the RBLB to GND. If the data in
For the XNOR operation if the result is +1, PMOS provides a one cell match the input, SL is connected to VDD through M1, n.
strong pull-up, and NMOS a weak pull-up on the RBL. If the res- If it mismatches the input, the SL is connected to GND
ult equals –1, then NMOS provides a strong pull-down and through M2, n. Hence, this circuit forms a voltage divider at
PMOS a weak pull-down. A resistor divider is formed for cells the SL. Finally, the results of the Hamming distance accumu-
on the same column, with the RBL as the output. The voltage late on the SL.
on the RBL represents the accumulation of the XNOR results Unlike the row-wise Hamming distance operations, Kang
in that column. For XNOR with ternary activations input (+1, et al. proposed a column-wise Hamming-distance macro
0, or –1) and binary weights, an additional case of ‘0’ input based on a 6T SRAM array[70]. In this design, two inputs are
must be considered. As shown in Fig. 15(c), if ‘0’ input is in stored in two different rows in the same column. Then, the lo-
the even row, the PMOS provides a weak pull-down and gic operation between the two inputs is executed by the SA.
NMOS a weak pull-up on the RBL. If ‘0’ input is in the odd The final Hamming distance value is obtained by combining
row, the PMOS provides a strong pull-up and NMOS a strong and accumulating the outputs of the SA. Because the result
pull-down on the RBL. Assuming there is a sufficiently large of the study is represented by BLs connected by columns, the
number of inputs, half of the inputs will be in the even row Hamming distance is calculated column-wise.
and the other half in the odd row. Thus, the cumulative in- In CIM, most of the results are reflected on BLs of
crease and decrease in the RBL voltage are 0. The result of column-wise connection. Therefore, vertical data storage is
the ternary multiplication is summarized in the truth table in generally required, increasing the implementation complex-
Fig. 15(c). ity for the SRAM writing mode. For example, Jeloka et al. pro-
As the input and weight of the dot-product operation posed a strategy of column-wise write[19]. It writes 1 in the
are in single bits, no additional auxiliary circuits are required first cycle and 0 in the latter cycle, which decreases the writ-
to process the weight, and the quantization of the operation ten data throughput and writing speed. In addition, as a ba-
results is simplified. The dot-product operation can be ap- sic and important operation, matrix transposition is generally
plied to the binary neural network (BNN) algorithm, where realized by data reading, moving, and writing back in a com-
the inputs can be restricted to either +1/–1 or 0/1. When the plicated operation with high power consumption. Thus, it will
inputs are restricted to 0 or 1, it has a 0.03% loss of accuracy be a research direction to realize Hamming distance opera-
compared with +1/–1 input, which is tested in the MNIST data- tion in both the row and column directions.
set[1]. However, because these studies used the analog 4.2.2. Multibit operation
voltage of BLs to reflect the operation results, they could not Unlike single-bit operations wherein the operands are lim-
obtain results that are as accurate as the digital CIM. ited to only 0, –1, and 1, multibit operations can obtain more
C. Hamming distance precise in-memory computations, which meets the require-
The Hamming distance algorithm is widely used in sig- ments of various AI algorithms. There are two main categor-
nal processing and pattern recognition. The Hamming dis- ies of multibit operations: 1) multibit multiplication and 2)
tance between any two vectors of the same length is defined SAD.
as the number of corresponding bits with different values. A. Multibit multiplication
The principle of Hamming distance is that two bytes of the The key to multibit multiplication is the weighting
same length are bitwise XNOR and these XNOR results are strategy. The weighting strategies include pulse width[10, 35, 36,
then accumulated. For example, the Hamming distance from 39, 40, 45, 50, 56, 64, 66, 71], pulse height[10, 15, 16, 44, 56], number of
1101 to 0111 is 2. Because this algorithm also requires high pulses[37, 38], width-to-length ratio of transistors[3, 32, 42, 43, 62], ca-
data access, it consumes a significant amount of energy pacitor array weighting[37, 38, 62, 63, 72, 73], and precharge time
when used in the traditional architecture. weighting[44]. The specific implementation strategy of the ca-
Ali et al. proposed a 9T SRAM to calculate the Hamming pacitor array weighting technology is introduced in Section
distance[8], as illustrated in Fig. 16. One of the vectors is
3. The multibit multiplication is reviewed from the following
Z T Lin et al.: A review on SRAM-based computing in-memory: Circuits, functions, and applications
14 Journal of Semiconductors doi: 10.1088/ 1674-4926/43/3/031401
WL
Q QB Q QB Q QB Q QB
ench-sh
ench-sh
ench-sh
ench-sh
t=8T t=8T+Tch-sh t=8T
Vch-sh
(a)
4T WL1
WL1
W0,14 W0,10 W1,14 W1,10
2T
WL2 WL2
W0,13 W0,9 W1,13 W1,9
1T
WL3
WL3
W0,12 W0,8 W1,12 W1,8
(b)
WL[3] Counter & RWL[63]
WL Driver
ΔV
Cell Cell Cell Cell 8T 8T 8T 8T
WL[3] BL/BLB3 11111
00000
WL[0] BL/BLB0
RBL[3]
RBL[2]
00000 11111
W/LM1,2=8 W/LM1,2=4 W/LM1,2=2 W/LM1,2=1 charge time weighting: When the precharge and the word
RWL0 RBL RBL RBL RBL
line are turned on simultaneously, there will be a relatively
Q3
M2
Q2
M2
Q1
M2
Q0
M2
large current which will increase power consumption. 2)
M1 M1 M1 M1
Pulse-width-weighting: Its operation is relatively simple.
Vi0
However, when the input bits increase, the corresponding
RWL1
M2 M2 M2 M2
pulse width will increase proportionally; thus, the time in-
Q2 Q1 Q0
Q3
M1 M1 M1 M1 creases exponentially, resulting in a significant decrease in cal-
Vi1 culation speed. Moreover, when the bit line voltage is relat-
IRBL IRBL IRBL IRBL ively low, linearity will also be a problem. 3) Pulse height
Iout
(a) Sensing Cricuit
weighting: The V–I characteristic of metal–oxide–semiconduct-
Read Port
or (MOS) devices is that the current between the source and
Most significant 8T(M8T) RWL Least significant 8T(L8T) drain increases proportionally to the square of the gate
WL
voltage. Therefore, unlike 2), the pulse height cannot be in-
creased proportionally to ensure proportional discharge of
Q1 QB1 N2_M N2_L QB2 Q2
PGM1 PGM2 PGL2 PGL1
the bit line, so it is difficult to set the pulse height. In addi-
N1_M N1_L
tion, the linearity is relatively poor compared with 2). 4) Pulse
BLB1
BLB2
BL2
BL1
VSS VSS
(b) 2X 1X
number weighting: As this design controls the discharge num-
RBL
heights.
∣D − P∣ = max(D − P, P − D)
4) Weighting strategy based on the number of pulses
Dong et al.[37, 38] designed a 7-nm FinFET CIM chip for ML = max(D + P + , P + D + ) (1)
that uses a pulse number modulation circuit (Fig. 17(d)). The ⇒ max(D + P, P + D),
4-bit input is represented by the number of read word-line
(RWL) pulses. These pulses are generated by a counter accord- where D and P are 1’s complement of D and P, respectively.
ing to the value of the input data and are applied to the RWL Note: D and P are available because of the complementary
to turn on the corresponding cells. The discharge amount of nature of the SRAM bitcell.
the BL is proportional to the number of times the RWL is Kang et al. executed a SAD operation based on a 6T
turned on, thereby executing multibit multiplication. SRAM array without sacrificing storage density[35, 50]. In the
5) Weighting strategy based on the transistor width– SAD, the template pattern P is stored with a polarity oppos-
length ratio ite to that of D. Both P and D are stored in four adjacent cells
As depicted in Fig. 18(a), in the 8T array, the sizes of the in a column, as shown in Fig. 19(a). The decimal data can be
read access transistors in different columns are adjusted in pro- read out on the BL through multirow read technology with
portion to the weights of the input data[3]. The width–length WL pulse modulation. For example, if the data stored in four
ratios of M1 and M2 in the first, second, third, and fourth consecutive cells in a column are d = ‘1111’, the decimal num-
columns are 8 (W/LM1,2 = 8) and 4 : 2 : 1, respectively. There- ber is D = 15. When reading the data with four weighted
fore, the sum of the multiple bits multiplied by one bit can pulse WLs (WL0–WL3), the BL voltage decreases by 15Δv, as
be obtained. The study combines the BL discharge currents shown in Fig. 19(b). Because P and D are stored in the array
of the four columns and converts the current value into a in the opposite manner, the AD operation results can be ob-
voltage value through an operational amplifier, which is ulti- tained by comparing the voltages of the two BLs. These out-
mately quantified by the ADC. Similarly, Si et al. proposed a puts are summed via a capacitive network using a charge-
twin-8T structure based on the traditional 8T[32] and adjusted transfer mechanism to generate the SAD.
the width–length ratio of one group of reading transistors to Table 3 summarizes the performances of existing single
twice that of the other group to achieve 2b input weighting and multibit operations. The number of input and output bits
(Fig. 18(b)). Su et al.[43, 62] proposed a transposable arithmetic reflect the performance of the operation. However, the diffi-
cell structure and quantified its internal weighting by the culty lies in improving the effective number of bits (ENOB) of
width–length ratio of the transistors, achieving simultaneous the final output result without using a high ENOB ADC. This
bidirectional calculations. is because the overhead of a high ENOB ADC is unaccept-
Different designs have tradeoffs among time cost, imple- able as it deviates from the original intention of low-over-
mentation difficulty, linearity, area cost, and process. 1) Pre-
head in-memory computing.
Z T Lin et al.: A review on SRAM-based computing in-memory: Circuits, functions, and applications
16 Journal of Semiconductors doi: 10.1088/ 1674-4926/43/3/031401
Table 3. Summary of chip parameters and performance of single- and multibit operations.
Parameter Refs. [1, 2] Ref. [10] Refs. [15, 16] Ref. [32] Ref. [31] Ref. [44] Ref. [33] Refs. [24, 37] Ref. [34]
Tchnology 65-nm 65-nm 130-nm 55-nm 65-nm 65-nm 55-nm 7-nm FinFET 65nm
CMOS CMOS CMOS CMOS CMOS TSMC CMOS
Cell structure DCS 6T 10T 6T Twin-8T 8T1C 6T 6T 8T 6T
Array size 4 Kb 16 Kb 16 Kb 64×60 b 2 KB 64 Kb 4 Kb 4 Kb 64 Kb
Chip area NA 6.3×104 2.67×105 4.69×104 8.1×104 NA 5.94×106 3.2×103 1.75×105
(μm2)
Input 1 6 5 1, 2, 4 1 5 1, 2, 7, 8 4 4
precision (bit)
Weight 1 1 1 2, 5 1 5 1, 2, 8 4 1, 2, 3
precision (bit) 4, 5, 8
Output 1 6 NA 3, 5, 7 5 NA 3, 7, 10, 19 4 NA
precision (bit)
Computing Analog Digital+ Digital+ Analog Analog Analog Digital+ Analog Analog
mechanism Analog Analog Analog
Model XNORNN/M CNN Classify CNN CNN VGG CNN VGG-9 NN CNN
BNN LeNet-5
Energy 30.49–55.8 40.3 (1 V) NA 18.37– 671.5 NA 0.6–40.2 351 49.4
efficiency 51.3 (0.8 V) 72.03 (0.8 V) (Input:
(TOPS/W) 4b
Weight:1b)
Throughput 278.2 8 (1 V) NA 21.2~ 1638 NA 5.14-329.14 372.4(0.8 V) 573.4
(GOPS) 1 (0.4 V) 67.5 (Input:
4b
Weight:2b)
Accu- MNIST 96.5% 98%(0.8 V) 90% 90.02%– 98.30% 99% 98.56%– 98.51%– 98.80%
racy (XNORNN) 98.3%(1 V) 99.52% 99.59% 99.99%
95.1%
(MBNN)
CIFAR NA NA NA 85.56%~ 85.50% 88.83% 85.97%- 22.89%- 89.00%
10 90.42% 91.93% 96.76%
VPRE Prech
WL3
VWL0 6T SRAM bitcell
VWL0
WL0 VWL0 WL2
D WL1
D D CWL
4 bits
d0 d0 WL0
VBL
VWL1 8△V
P
P P CWL 15△V 4△V
4 bits
6T SRAM bitcell
CBL CBL 2△V
△V
(a) (b) d3 d2 d1 d0
Fig. 19. (Color online) (a) Schematic of SAD circuit and (b) sequence diagram.
Dog
CNN
Cat
Cattle
Full connection
(a) Layer1 Layer2 Layer3 Layer4 Layer
Input
Peripheral Peripheral Peripheral WL
BL BLB
N<=10
AES
S0 S1 S2 S3
Add
S4 S5 S6 S7 Sub Shift Mix 128 bits
Round
S8 S9 S10 S11 Bytes Rows Column data out
Key
S12 S13 S14 S15
128 bits
LUT Performed on (100% XOR) (50% XOR)
data in state matrix
(b)
Fig. 20. (Color online) Implementation of (a) CNN and (b) AES on multiple SRAM arrays.
ten reflected in the BL voltage. Quantifying the analog ity have increased. For example, the convolution kernels in
voltage of the BL is key to the entire operation. the CNN algorithm and convolution step are obtained by train-
The weights and inputs of the CNN are usually multibit. ing with a large volume of data. However, the entire weight
However, to realize multibit operation in memory, it is ne- is stored in the array and, therefore, can be easily read, caus-
cessary to change the cell structure or add auxiliary circuits. ing data leakage. Encryption is crucial to big data. However,
A twin-8T structure was proposed to realize 1b, 2b, and 4b the power consumption and delay in implementing a set of en-
inputs and 1b, 2b, and 5b weights, with an output of up to cryption algorithms in the digital domain will limit the over-
7b[32]. The test accuracy of the system using the MNIST data- all performance of the system; hence, researchers have pro-
set was as high as 99.52%. To simplify circuit design, research- posed to implement these algorithms in memory.
ers have developed the BNN, with binary inputs and weights, Fig. 20(b) illustrates the process of the AES algorithm in
i.e., “+1” or “–1.” In simple application scenarios, it has nearly four steps: byte replacement, row shift, column mixing, and
the same accuracy as the traditional CNN algorithm[1, 2]. Chih round key addition. Byte replacement is used to replace the in-
et al.[75] proposed another solution that uses all-digital CIM to put plaintext with a look-up table and implement the first
execute MAC operations and has high energy efficiency and round of encryption. The row-shift operation shifts the trans-
throughput. In order to reduce computational costs, Sie et formed matrix by a certain rule. Column mixing is the XOR op-
al.[76] proposed a software and hardware co-design approach eration of the target and fixed matrices. Round key addition
to design MARS. In this study, a SRAM-based CIM CNN acceler- performs an iterative XOR between the data and the key mat-
ator that can utilize multiple SRAM CIM macros as pro- rix.
cessing units and support a sparse CNN was also proposed. To implement the AES algorithm in an SRAM array, first,
With the proposed hardware-software codesigned method, the plaintext matrix that needs to be encrypted is stored in
MARS can reach over 700 and 400 FPS for CIFAR-10 and CI- the array. Then, the plaintext and key matrices are encrypted
FAR-100, respectively. Although these studies have realized using a peripheral auxiliary circuit. The most repeated opera-
in-memory MAC, they could not execute the entire CNN pro- tion in the AES algorithm is the XOR operation; hence, imple-
cess in memory. Therefore, the execution of the entire pro- menting XOR in memory, storing the result, and continuing
to perform the XOR operation with the input are key steps in
cess in memory can be a possible research direction.
k-Nearest Neighbor
di 18 (k-NN) 17 pi results 1
10 24 14
133 100 33
32 substract 20 12
128 89 39
56 130 10 100 46 30
23 10 13
178 178
90 220 8 112 82
0
108 456
26 16 10
255 233 22
24
0
12
32 12 SUM
32
2 4 2
results di pi
I1 I2
Fig. 21. (Color online) Application in the k-NN algorithm.
which implemented part of the iterative XOR operation for As depicted in Fig. 22, high-precision boosted strong classi-
the AES. However, this strategy increases the area overhead fier can be realized by combining column-based weak classifi-
of the storage array and consequently requires an additional er C1-M. However, there is a typical characteristic for calcula-
empty row in the array to store the calculation results. Jaisw- tions by column in CIM, which can be perfectly mapped into
al et al. proposed interleaving WL (i-SRAM)[22] as the basic a column-based weak classifier. Zhang et al.[15, 16] achieved an
structure for embedding bitwise XOR computations in the ML classifier based on the 6T cell. Because of the nonlinear-
SRAM arrays, which improved the throughput of AES al- ity in column-wise CIM, the result of each column can only
gorithms by a factor of three. Huang et al. modified the 6T form a weak classifier. In memory, the boosted strong classifi-
SRAM bitcell with dual WLs to implement XOR without com- er can be obtained through adder or subtractor circuits that
promising the parallel computation efficiency[46], which can process the column-wise results which reduces the non-ideal
protect the DNN model in CIM. These studies realized part of characteristic of analog CIM. In addition, this design reduces
the operations in encryption, and the realization of the AES the energy consumption by 113 times when using a stand-
of the entire process in memory is a research direction.
sion. The concept of the algorithm is as follows: if most of the 6. Challenges and prospects
k-most similar samples in the feature space belong to a cat-
egory, the sample also belongs to that category. There are With the rapid development of AI, requirements for com-
two methods to measure the distance of two samples: the Euc- puting power have become stringent. The CIM architecture
lidean distance and Manhattan distance. has ushered in unprecedented development opportunities.
In Fig. 21, I1 and I2 are the pixel values of the target All CIM strategies make a compromise among bandwidth,
and test images, respectively. First, the pixel values of the delay, area overhead, energy consumption, and accuracy. The
same position of the test and training images were subtrac- following is an analysis of several typical problems in the CIM
ted, followed by calculating the absolute values and sum- architecture.
ming them. The sum represents the similarity between the 6.1. Read-disturb issue
test and target images. The smaller the value, the higher the In CIM, it is often necessary to access multiple rows for sim-
similarity. The k-NN algorithm finds the first k images that are ultaneous computation to increase the throughput of data
most similar to the target image. Kang et al. executed the processing. However, turning on multiple rows synchron-
SAD for application in the k-NN algorithm[35, 50]. They used ously connects the storage node directly to the BL, which will
hand-written number recognition to test the accuracy of cause data to be flipped during the reading process. There-
the algorithm on the MNIST dataset. The results showed that fore, this strategy resulted in an error in the final calculation
the CIM-based k-NN algorithm has a high recognition accur- result; even worse, it destroys the stored data. As shown in
acy of 92%.
Fig. 23, when the BL voltage drops decreases significantly
5.4. Application in classifier algorithms and reaches the write margin, the cell storing ‘1’ will be mis-
Classification is a significantly important method of data takenly written to ‘0’. To address this issue, Kang et al.[35] re-
mining. The concept of classification is to learn a classifica- duced the discharge speed of a BL by reducing the turn-on
tion function or construct a classification model (that is, what voltage of the WL. When the WL voltage is reduced to a cer-
we usually call a classifier) on the basis of existing data. tain extent, the full swing BL voltage can be achieved when
However, it is challenging to implement an energy-efficient multiple rows are opened simultaneously. Researchers have
classifier algorithm in resource-constrained devices. If the clas- suggested using read–write decoupling cells to achieve CIM
sifier algorithm can be realized with methods of CIM, the fre- operations, such as 8T[3, 5, 18, 28, 37, 38, 65, 77, 78] and 10T[9−11, 56, 57],
quent data access will be greatly reduced.
which can also obtain the full RBL voltage swing (from VDD
Z T Lin et al.: A review on SRAM-based computing in-memory: Circuits, functions, and applications
Journal of Semiconductors doi: 10.1088/ 1674-4926/43/3/031401 19
Column-based
weak classifier C2 xZ2 BC2,1 BC2,1 BC2,1
Σ
BC128,1 BC128,1 BC128,1
Column-based
weak classifier C3 xZ3
In memory
Z1 Z2 ZM
Column-based
weak classifier CM xZM Adder/Subtractor
Boosted strong classifier
Fig. 22. (Color online) Application in classifier algorithms.
WL 0=1 M0 BL/BLB cascode current mirror (CCM) peripheral circuit and applied it
to the bottom of each BL. The CCM clamps the BL voltage
Q QB 1
0 and proportionally duplicates the BL current in the addition-
Cell 0 Write margin al capacitor, increasing the calculation linearity. In addition,
Qn they also proposed a double WL structure to reduce the
WL n=1 pulse-width delay on the WL and increase the consistency of
M1 Read
disturb circuit calculations. They demonstrated the ability of the CCM
1
Q QB 0
QBn circuit to reduce the integer nonlinearity by approximately
BL Cell n BLB 70% at 0.8 V supply and improve the computational consist-
ency by 56.84% at 0.9 V supply.
Weight
T 8T
4T Identical data
Activate 2T
single 1T
row
Activate
multiple
rows
IMC based on BL/BLB IMC based on BL/BLB
SA & Digitial output
0 1 1 0 0 1 1 0 15 0 3 8 12 0 4 6 9 13 11 1 6 6 6 6 6 5 5 4 3 3 2 2
VBL VBL Error
Nonlinearity
ΔVBL 5ΔV Linearity 6ΔV 2ΔV
8ΔV
Inconsisten cy
(a) (b) (c)
Fig. 24. (Color online) (a) Single row activation during normal SRAM read operation, (b) multirow read and nonlinearity during CIM, and (c) incon-
sistent CIM calculation.
producing a large current and increasing the system’s power operations of the algorithm. The same operator can be real-
consumption.
ized by several circuits. In contrast, one circuit can also be
used by multiple operators. Therefore, the problem of selec-
6.4. Area overhead and energy efficiency challenges of
tion of the circuits set and effectively mapping the common
peripheral circuits operator set to it must be studied. As shown in Fig. 25, this
The architecture based on CIM requires several peripher- problem can be studied from three aspects.
al auxiliary circuits to perform additional computing func- 1) Refining and merging common operator sets. First, an
tions. For example, when implementing logic operations in di- initial set of common operators ϕ (such as multiplication,
gital functions, two SAs are required on each BL to distin- SAD, addition and subtraction, and MAC) is extracted accord-
guish different inputs. When performing multiplication and ac- ing to multiple factors, including requirements of AI algo-
cumulation in analog operations, the calculation results are re- rithms, and computational complexity and efficiency. Then,
flected on the BL with an analog voltage. Therefore, peripher- part of the operators must be split based on the initial set ϕ,
al auxiliary circuits are also needed to quantify these voltage which will facilitate the further integration of some operators
values. The use of multiple high-precision ADCs for the quanti- and circuit implementation. For example, the SAD can be
fication greatly increases the proportion of peripheral circuits. split into difference, absolute values, and addition operators.
In addition, multibit input modules, such as multi-pulse width 2) Exploring the appropriate circuit sets. The design pro-
or pulse-height-generation circuits, occupy a large part of the cess of the circuit set must consider, for example, the cover-
area. age of the circuit set, its redundancy, the accuracy of the calcu-
Yin et al.[17] designed a shared ADC with 64 columns, lation, the cost of the circuit area, and circuit latency and
quantized it 64 times through a data selector, and added a power consumption. The circuits set can be subdivided by
fault-tolerant algorithm to further reduce the requirement of the computational complexity and data requirements. On
quantization and the overhead of the peripheral circuits. Kim this basis, a unified interface is designed for the circuit mod-
et al.[81] used low-bit ADCs to achieve ResNet-style BNN al- ule, which is convenient for the upper architecture to organ-
gorithms via aggressive partial sum quantization and input- ize and implement.
splitting combined with retraining. The area overhead is re- 3) Designing word-column/row-block hierarchical shar-
duced due to the use of low-bit ADC. Maintaining the propor- ing architecture. Flexible and configurable hierarchical architec-
tion of peripheral circuits within an acceptable range comes ture is key to realizing mapping from a common operator set
with the loss of the accuracy of the final output digits. Interest- to a circuit set. The memory array can be divided into a
ingly, the complexity of the auxiliary circuit to achieve more word-column/row-block three-tier system. The lowest layer is
complex calculations cannot be dismissed in the CIM system- the word-level memory computing layer, which has the tight-
level design. est memory computing coupling and requires the lowest
In the future, peripheral circuits may be stacked in 3D to area cost and power among all the layers. It can directly read
form a pool of peripheral circuit resources shared by CIM ar- and write the data of a single word, transmit it rapidly, and per-
rays. In addition, the use of peripheral circuits can be greatly form lightweight operations. The implementation of simple op-
reduced by time-division multiplexing, which further reduces erators in word-level memory computing can improve en-
the area of the SRAM-based CIM architecture.
ergy efficiency. The middle layer is a column/row level
memory-computing layer and has column- or row-shared oper-
6.5. Research prospect
ation units, such as a linearity compensation module, and a
6.5.1. More efficient mapping from common operators consistency compensation module that improve the perform-
set to actual circuits set ance of multibyte operations. This layer, in which more com-
In CIM, it is necessary to design various circuits to realize plex operators can be implemented, balances the energy effi-
Z T Lin et al.: A review on SRAM-based computing in-memory: Circuits, functions, and applications
Journal of Semiconductors doi: 10.1088/ 1674-4926/43/3/031401 21
configurable layered
+ Area
SA constraint
Flexible and
architecture
Recurrent Neural
Network (RNN)
Long/Short Term
Memory ((LSTM)
-
Explore appropriate
merging common
operators set
Refining and
circuit sets
Denoising AE Redundancy Coverage
Auto Encoder (AE) Variational AE
(VAE) (DAE) ADC
Markov Chain
(MC)
Nopfieeld Network
(HN)
Boltzmann
Machine (BM)
* Delay Power
Consumption
constraint constraint
3 Low Complex
Word
Block sharing layer
Column (Row) Coupling
Operation
Block Column (Row) sharing layer degree
Layered sharing
Word sharing layer
High Simple
Fig. 25. (Color online) Approach of mapping from the common operator set to the actual circuits.
ciency, area cost, and power consumption. The uppermost lay- tical parallel reading techniques; thus, the bidirectional CIM ar-
er is the block-level memory computing layer. Although chitecture can be implemented with a small area cost.
memory and computing in this layer are loosely coupled, the 2) A low-powered fast-migration channel is introduced,
layer has the highest tolerance for area, power consumption, and efficient data-storage patterns are designed to reduce
and delay cost, and it suffers latency and demands a large the power consumed by migration. It was found that the
amount of power in performing operations, data caching, data used by two adjacent calculations overlapped signific-
and quantization. The implementation of complex operators antly in the CNN. To improve data utilization and reduce the
in the block-level memory-computing layer can enrich the volume of reloaded data, it is necessary to study the discon-
functions of the CIM system and provide a smooth transition tinuous data-storage mode to meet the requirements of the
between the CIM system and the traditional von Neumann ar- algorithm and improve the coupling between computing
chitecture.
and storage. Additionally, a fast data-migration channel can
6.5.2. Optimize the CIM process perform multiple calculations continuously without reload-
The primary steps of CIM include reading and comput- ing data.
ing, along with a series of processes of writing, quantization, 3) Memory circuits can be designed based on the reuse/re-
and writeback. The optimization of the entire process, which construction/transformation strategy. First, the existing mod-
is key to realizing energy-efficient, high-throughput, and low- ules of the SRAM memory are fully harnessed. The reusable
area-overhead CIM, can be performed from the following modules include sensitive amplifiers, BLs, WLs, redundant
three aspects, as shown in Fig. 26. columns, and decoding circuits. Second, the existing module
1) A horizontal computing channel can be introduced to structure is subtly modified to induce new functions into exist-
implement a bidirectional memory computing system. The in- ing SRAM modules at a markedly low area cost. With the SA
tent of SRAM is to introduce computing units into the stor- as an example, an appropriate configuration transistor can be
age array, reduce data movement, and break through the stor- included to not only induce the amplification and comparis-
age wall. However, in-memory calculations mainly rely on ver- on functions into the SA but also generate the SIGMOID func-
tical accumulative paths and can only be performed after stor- tion and reconstruct it as part of the ADC (Fig. 25). With the re-
age rearrangement, which complicates the calculation pro- dundant column of a duplicate BL as an example, a small num-
cess, changing from write → read → calculation → write- ber of switches can be included to transform the redundant
back in the von Neumann architecture to write → read → stor- column into a pulse-width-generation module that can track
age rearrangement → write → read–calculation integration the process voltage and temperature variations. The same cir-
→ quantification → writeback. The entire process consumed cuit can be used as part of the operators with different func-
significantly more energy than the original consumption with tions through a time-division multiplexing strategy, which
von Neumann architecture. Therefore, as shown in Fig. 25, hori- greatly reduces the area cost incurred by CIM. Third, the calcu-
zontal computational channels can be used to enable CIM lation mode is changed from analog domain calculation to di-
without storage rearrangement. Simultaneously, the vertical cu- gital–analog hybrid calculation, which not only preserves the
mulative path is preserved to make it compatible with the ver-
advantages of analog calculation but also remarkably re-
Z T Lin et al.: A review on SRAM-based computing in-memory: Circuits, functions, and applications
22 Journal of Semiconductors doi: 10.1088/ 1674-4926/43/3/031401
Sense Amplifier
DC
Reconfiguring
DC
RBL SA
Replica 1
WL 1T
VDD
Replica 2 SA
WL 2T
Replica 4
WL 4T
Pipeline processor
Memory
Memory
CIM
CIM Memory
Memory
f CIM
f CIM
f
f
Bus
Fig. 27. (Color online) Multithreaded CIM macro based on a pipeline processor.
tailed the different basic cell structures and peripheral auxili- memory vector computing. IEEE J Solid State Circuits, 2020, 55,
ary circuits of CIM. It also investigated various computing func- 76
tions that can be realized by the existing CIM framework and [13] Wang J C, Wang X W, Eckert C, et al. A compute SRAM with bit-seri-
their applications. Finally, the challenges encountered by cur- al integer/floating-point operations for programmable in-
memory vector acceleration. 2019 IEEE International Solid-State
rent CIM macros based on SRAM and the future scope of CIM
Circuits Conference, 2019, 224
were analyzed. To improve the computational accuracy and
[14] Jiang H W, Peng X C, Huang S S, et al. CIMAT: a transpose SRAM-
capability of the CIM architecture, we recommend efficient based compute-in-memory architecture for deep neural net-
mapping the common operators set to a circuit set and opti- work on-chip training. Proceedings of the International Symposi-
mizing the CIM process under spatiotemporal constraints. En- um on Memory Systems, 2019, 490
hancing the programmability of the CIM architecture will en- [15] Zhang J T, Wang Z, Verma N. In-memory computation of a ma-
hance its compatibility with general CPUs and enable its wide chine-learning classifier in a standard 6T SRAM array. IEEE J Solid
usage across industries.
State Circuits, 2017, 52, 915
[16] Zhang J T, Wang Z, Verma N. A machine-learning classifier imple-
Acknowledgements mented in a standard 6T SRAM array. 2016 IEEE Symposium on
VLSI Circuits, 2016, 1
This work was supported by the National Key Research
[17] Jiang Z W, Yin S H, Seok M, et al. XNOR-SRAM: In-memory comput-
and Development Program of China (2018YFB2202602), The
ing SRAM macro for binary/ternary deep neural networks. 2018
State Key Program of the National Natural Science Founda-
IEEE Symp VLSI Technol, 2018, 173
tion of China (NO.61934005), The National Natural Science [18] Agrawal A, Jaiswal A, Lee C, et al. X-SRAM: Enabling in-memory
Foundation of China (NO.62074001), and Joint Funds of the Na- Boolean computations in CMOS static random access memories.
tional Natural Science Foundation of China under Grant IEEE Trans Circuits Syst I, 2018, 65, 4219
U19A2074. [19] Jeloka S, Akesh N B, Sylvester D, et al. A 28 nm configurable
memory (TCAM/BCAM/SRAM) using push-rule 6T bit cell en-
abling logic-in-memory. IEEE J Solid State Circuits, 2016, 51, 1009
References
[20] Dong Q, Jeloka S, Saligane M, et al. A 4 2T SRAM for searching
[1] Si X, Khwa W S, Chen J J, et al. A dual-split 6T SRAM-based comput- and in-memory computing with 0.3-V VDDmin. IEEE J Solid State Cir-
ing-in-memory unit-macro with fully parallel product-sum opera- cuits, 2018, 53, 1006
tion for binarized DNN edge processors. IEEE Trans Circuits Syst I, [21] Rajput A K, Pattanaik M. Implementation of Boolean and arit-
2019, 66, 4172 hmetic functions with 8T SRAM cell for in-memory computa-
[2] Khwa W S, Chen J J, Li J F, et al. A 65nm 4Kb algorithm-depend- tion. 2020 International Conference for Emerging Technology,
ent computing-in-memory SRAM unit-macro with 2.3ns and 2020, 1
55.8TOPS/W fully parallel product-sum operation for binary DNN [22] Jaiswal A, Agrawal A, Ali M F, et al. I-SRAM: Interleaved wordlines
edge processors. 2018 IEEE International Solid-State Circuits Con- for vector Boolean operations using SRAMs. IEEE Trans Circuits
ference, 2018, 496 Syst I, 2020, 67, 4651
[3] Jaiswal A, Chakraborty I, Agrawal A, et al. 8T SRAM cell as a multib- [23] Surana N, Lavania M, Barma A, et al. Robust and high-perform-
it dot-product engine for beyond von Neumann computing. IEEE ance 12-T interlocked SRAM for in-memory computing. 2020
Trans Very Large Scale Integr VLSI Syst, 2019, 27, 2556 Design, Automation & Test in Europe Conference & Exhibition,
[4] Lu L, Yoo T, Le V L, et al. A 0.506-pJ 16-kb 8T SRAM with vertical 2020, 1323
read wordlines and selective dual split power lines. IEEE Trans [24] Simon W A, Qureshi Y M, Rios M, et al. BLADE: an in-cache comput-
Very Large Scale Integr VLSI Syst, 2020, 28, 1345 ing architecture for edge devices. IEEE Trans Comput, 2020, 69,
[5] Lin Z T, Zhan H L, Li X, et al. In-memory computing with double 1349
word lines and three read Ports for four operands. IEEE Trans Very [25] Chen J, Zhao W F, Ha Y J. Area-efficient distributed arithmetic
Large Scale Integr VLSI Syst, 2020, 28, 1316 optimization via heuristic decomposition and in-memroy com-
[6] Srinivasa S, Chen W H, Tu Y N, et al. Monolithic-3D integration aug- puting. 2019 IEEE 13th International Conference on ASIC, 2019, 1
mented design techniques for computing in SRAMs. 2019 IEEE In- [26] Lee K, Jeong J, Cheon S, et al. Bit parallel 6T SRAM in-memory com-
ternational Symposium on Circuits and Systems, 2019, 1 puting with reconfigurable bit-precision. 2020 57th ACM/IEEE
[7] Zeng J M, Zhang Z, Chen R H, et al. DM-IMCA: A dual-mode in- Design Automation Conference, 2020, 1
memory computing architecture for general purpose processing. [27] Simon W, Galicia J, Levisse A, et al. A fast, reliable and wide-
IEICE Electron Express, 2020, 17, 20200005 voltage-range in-memory computing architecture. Proceedings
[8] Ali M, Agrawal A, Roy K. RAMANN: in-SRAM differentiable of the 56th Annual Design Automation Conference, 2019, 1
memory computations for memory-augmented neural networks. [28] Chen H C, Li J F, Hsu C L, et al. Configurable 8T SRAM for enbling
Proceedings of the ACM/IEEE International Symposium on Low in-memory computing. 2019 2nd International Conference on
Power Electronics and Design, 2020, 61 Communication Engineering and Technology, 2019, 139
[9] Agrawal A, Jaiswal A, Roy D, et al. Xcel-RAM: Accelerating binary [29] Gupta N, Makosiej A, Vladimirescu A, et al. 1.56GHz/0.9V energy-ef-
neural networks in high-throughput SRAM compute arrays. IEEE ficient reconfigurable CAM/SRAM using 6T-CMOS bitcell. ESS-
Trans Circuits Syst I, 2019, 66, 3064 CIRC 2017 - 43rd IEEE European Solid State Circuits Conference,
[10] Biswas A, Chandrakasan A P. CONV-SRAM: An energy-efficient 2017, 316
SRAM with in-memory dot-product computation for low-power [30] Sun X Y, Liu R, Peng X C, et al. Computing-in-memory with SRAM
convolutional neural networks. IEEE J Solid State Circuits, 2019, and RRAM for binary neural networks. 2018 14th IEEE Internation-
54, 217 al Conference on Solid-State and Integrated Circuit Technology,
[11] Lin Z T, Zhu Z Y, Zhan H L, et al. Two-direction in-memory comput- 2018, 1
ing based on 10T SRAM with horizontal and vertical decoupled [31] Jiang Z W, Yin S H, Seo J S, et al. C3SRAM: in-memory-computing
read Ports. IEEE J Solid State Circuits, 2021, 56, 2832 SRAM macro based on capacitive-coupling computing. IEEE Sol-
[12] Wang J C, Wang X W, Eckert C, et al. A 28-nm compute SRAM id State Circuits Lett, 2019, 2, 131
with bit-serial logic/arithmetic operations for programmable in- [32] Si X, Chen J J, Tu Y N, et al. A twin-8T SRAM computation-in-
Z T Lin et al.: A review on SRAM-based computing in-memory: Circuits, functions, and applications
24 Journal of Semiconductors doi: 10.1088/ 1674-4926/43/3/031401
[68] Xue C X, Zhao W C, Yang T H, et al. A 28-nm 320-kb TCAM macro prove linearity and consistency in SRAM in-memory computing.
using split-controlled single-load 14T cell and triple-margin IEEE J Solid State Circuits, 2021, 56, 2550
voltage sense amplifier. IEEE J Solid State Circuits, 2019, 54, 2743 [80] Lin Z T, Fang Y Q, Peng C Y, et al. Current mirror-based compensa-
[69] Jiang H W, Liu R, Yu S M. 8T XNOR-SRAM based parallel compute- tion circuit for multi-row read in-memory computing. Electron
in-memory for deep neural network accelerator. 2020 IEEE 63rd In- Lett, 2019, 55, 1176
ternational Midwest Symposium on Circuits and Systems, 2020, [81] Kim Y, Kim H, Park J, et al. Mapping binary resnets on computing-
257 in-memory hardware with low-bit ADCs. 2021 Design, Automa-
[70] Kang M G, Shanbhag N R. In-memory computing architectures tion & Test in Europe Conference & Exhibition, 2021, 856
for sparse distributed memory. IEEE Trans Biomed Circuits Syst,
2016, 10, 855
Zhiting Lin (SM’16) received the B.S. and Ph.D.
[71] Jain S, Lin L Y, Alioto M. ±CIM SRAM for signed in-memory broad-
degrees in electronics and information engin-
purpose computing from DSP to neural processing. IEEE J Solid
eering from the University of Science and Tech-
State Circuits, 2021, 56, 2981
nology of China (USTC), Hefei, China, in 2004
[72] Yue J S, Feng X Y, He Y F, et al. A 2.75-to-75.9TOPS/W computing-
and 2009, respectively. From 2015 to 2016, he
in-memory NN processor supporting set-associate block-wise
was a visiting scholar with the Engineering
zero skipping and Ping-pong CIM with simultaneous computa-
and Computer Science Department, Baylor Uni-
tion and weight updating. 2021 IEEE International Solid- State Cir-
versity, Waco, TX, USA. In 2011, he joined the
cuits Conference, 2021, 238
Department of Electronics and Information En-
[73] Yang X X, Zhu K R, Tang X Y, et al. An in-memory-computing
gineering, Anhui University, Hefei, Anhui. He
charge-domain ternary CNN classifier. 2021 IEEE Custom Integ-
is currently a professor at the Department of In-
rated Circuits Conference, 2021, 1
tegrated Circuit, Anhui University. He has pub-
[74] LeCun Y. Deep learning hardware: Past, present, and future. 2019
lished about 50 articles and holds over 20
IEEE International Solid-State Circuits Conference, 2019, 12 Chinese patents. His research interests in-
[75] Chih Y D, Lee P H, Fujiwara H, et al. 16.4 an 89TOPS/W and clude pipeline analog-to-digital converters
16.3TOPS/mm2 all-digital SRAM-based full-precision compute- and high-performance static random-access
in memory macro in 22nm for machine-learning edge applica- memory.
tions. 2021 IEEE International Solid-State Circuits Conference,
2021, 252
[76] Sie S H, Lee J L, Chen Y R, et al. MARS: multi-macro architecture Xiulong Wu received the B.S. degree in com-
SRAM CIM-based accelerator with co-designed compressed neur- puter science from the University of Science
al networks. IEEE Trans Comput Aided Des Integr Circuits Syst, and Technology of China (USTC), Hefei, China,
2021, in press in 2001, and the M.S. and Ph.D. degrees in elec-
[77] Agrawal A, Kosta A, Kodge S, et al. CASH-RAM: Enabling in- tronic engineering from Anhui University, He-
memory computations for edge inference using charge accumula- fei, in 2005 and 2008, respectively. From 2013
tion and sharing in standard 8T-SRAM arrays. IEEE J Emerg Sel to 2014, he was a visiting scholar with the En-
Top Circuits Syst, 2020, 10, 295 gineering Department, The University of
[78] Yue J S, Yuan Z, Feng X Y, et al. A 65nm computing-in-memory- Texas at Dallas, Richardson, TX, USA. He is a pro-
based CNN processor with 2.9-to-35.8TOPS/W system energy effi- fessor at Anhui University. He has published
ciency using dynamic-sparsity performance-scaling architecture about 60 articles and holds over 10 Chinese pat-
and energy-efficient inter/intra-macro data reuse. 2020 IEEE Inter- ents. His research interests include high-per-
national Solid-State Circuits Conference, 2020, 234 formance static random-access memory and
[79] Lin Z T, Zhan H L, Chen Z W, et al. Cascade current mirror to im- mixed-signal ICs.
Z T Lin et al.: A review on SRAM-based computing in-memory: Circuits, functions, and applications
View publication stats