0% found this document useful (0 votes)
49 views14 pages

Reconfigurable 2T2R ReRAM Architecture For Versatile Data Storage and Computing In-Memory

The document proposes a reconfigurable 2T2R ReRAM architecture that can support TCAM, LiM, and IM-DP operations. It allows the architecture to operate as both a 2T2R and 1T1R configuration. The proposed LiM full adder improves delay, static power, and dynamic power compared to state-of-the-art designs. Different optimization techniques are used to reduce energy for TCAM search and 1T1R access.

Uploaded by

莊昆霖
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views14 pages

Reconfigurable 2T2R ReRAM Architecture For Versatile Data Storage and Computing In-Memory

The document proposes a reconfigurable 2T2R ReRAM architecture that can support TCAM, LiM, and IM-DP operations. It allows the architecture to operate as both a 2T2R and 1T1R configuration. The proposed LiM full adder improves delay, static power, and dynamic power compared to state-of-the-art designs. Different optimization techniques are used to reduce energy for TCAM search and 1T1R access.

Uploaded by

莊昆霖
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

2636 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 28, NO.

12, DECEMBER 2020

Reconfigurable 2T2R ReRAM Architecture


for Versatile Data Storage and
Computing In-Memory
Yuzong Chen , Lu Lu, Student Member, IEEE, Bongjin Kim, Member, IEEE,
and Tony Tae-Hyoung Kim , Senior Member, IEEE

Abstract— Nonvolatile memory (NVM)-based computing


in-memory (CIM) is a promising solution to data-intensive
applications. This work proposes a 2T2R resistive random access
memory (ReRAM) architecture that supports three types of CIM
operations: 1) ternary content addressable memory (TCAM);
2) logic in-memory (LiM) primitives and arithmetic blocks
such as full adder (FA) and full subtractor; and 3) in-memory
dot-product for neural networks. The proposed architecture
allows the NVM operations in both 2T2R and conventional
1T1R configurations. The proposed LiM full adder (LiM-FA)
improves the delay, the static power, and the dynamic power
by 3.2×, 1.2×, and 1.6×, respectively, compared with state-of-
the-art LiM-FAs. Furthermore, based on different optimization
techniques and robustness analysis, a lower precharge voltage is
set for each mode. This reduces the TCAM search energy and
1T1R ReRAM access energy by 1.6× and 1.14×, respectively,
compared with the case without optimizations.
Index Terms— Computing in-memory (CIM), nonvolatile mem-
ory (NVM), resistive random access memory (ReRAM), robust- Fig. 1. Traditional von-Neumann architectures. (a) Block diagram,
ness analysis, ternary content addressable memory (TCAM). (b) latency, and (c) energy cost of a 32-bit integer ALU addition and different
stages of data movements.
I. I NTRODUCTION

A S MORE attention in our daily life shifts toward emerg-


ing applications such as machine learning and big-data
processing, existing computing paradigms face unprecedented
latency [1] but dissipates tens of pJ [2]. When considering
off-chip memory, the latency and energy can increase to tens
of ns [1] and a few nJ [3], respectively.
challenges in executing required tasks with high energy effi- To continuously improve computing performance and
ciency. The scaling trend of CMOS performance has slowed energy efficiency, people have focused on beyond CMOS
down because of the power wall and slower voltage scal- devices and beyond von-Neumann architectures. For example,
ing. Moreover, traditional von-Neumann computing architec- resistive random access memory (ReRAM) [4] is an attractive
ture suffers from long latency and high energy consumption nonvolatile memory (NVM) candidate for the next-generation
because of expensive data movements between memory and storage system. It shows good read/write performance, low
arithmetic-logic units (ALUs). This latency and energy over- programming voltage, good scalability, and compatibility to
heads become more severe as the memory hierarchy goes the CMOS fabrication process. It also allows monolithic
from a register file to cache, main memory (e.g., DRAM), 3-D integration [5], [6] with logic devices to achieve high
and nonvolatile storage (e.g., FLASH). As shown in Fig. 1, integration density and fine-grained connectivity between logic
a typical ALU operation (e.g., 32-bit integer addition) takes and memory circuits.
less than 1 ns and consumes less than 1 pJ while a data Meanwhile, beyond von-Neumann architecture like com-
movement from a register file and cache takes comparable puting in-memory (CIM) is also under active research to
Manuscript received May 28, 2020; revised August 31, 2020; accepted tackle the expensive latency and energy cost associated with
September 25, 2020. Date of publication October 20, 2020; date of current data movements. Various CIM works have been reported
version November 24, 2020. This work was supported by RIE2020 ASTAR in the literature [7]–[20]. The ternary content addressable
AME IAF-ICP Grant under Grant I1801E0030. (Corresponding author: memory (TCAM) operations in [7] and [16]–[20] perform
Yuzong Chen.)
The authors are with the Centre for Integrated Circuits and Systems parallel bit-wise XOR/ XNOR between the search key and the
(CICS), School of Electrical and Electronic Engineering, Nanyang Tech- stored data to achieve fast search. The logic in-memory (LiM)
nological University, Singapore 639798 (e-mail: [email protected]; operations in [7] and [10]–[12] realize Boolean logic functions
[email protected]; [email protected]; [email protected]).
Color versions of one or more of the figures in this article are available between two or more words in/near memory without reading
online at https://ieeexplore.ieee.org. out operands and sending them to ALU. Application-specific
Digital Object Identifier 10.1109/TVLSI.2020.3028848 CIM executing in-memory dot-product (IM-DP) for neural
1063-8210 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: National Central University. Downloaded on October 19,2023 at 09:10:13 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: RECONFIGURABLE 2T2R RERAM ARCHITECTURE FOR VERSATILE DATA STORAGE AND CIM 2637

networks are available in [8], [9], and [13]–[15]. While many


CIM designs support only one functionality in addition to the
normal memory functionality, several research works imple-
ment two or more types of the aforementioned CIM functions
on the same array [7], [20] with various tradeoffs such as
cell area and data placement strategy. Instead of designing
separate CIM macros for each required task, a multifunctional
CIM macro can use the same memory array with combined
peripheral supports to accelerate different tasks. This improves
the versatility of the memory and reduces the cost.
Given the advantages of ReRAM and CIM, it brings
significant value to implement robust and high-density
ReRAM-based CIM (R-CIM) systems. However, some prior
works only focused on the functionality of R-CIM without
using conventional memory architectures (i.e., a memory array
with decoders). For example, the CIM processor in [20] is
based on a field-programmable gate array (FPGA)-like archi-
tecture and requires a customized compilation framework to
map the dataflow. Besides, it requires three arrays (two as row
and column decoders) to implement the storage of one array
in the memory mode, leading to large hardware overheads. Fig. 2. (a) Three-layer ReRAM device. (b) I –V characteristic of a bipolar
Zheng et al. [16] and Chang et al. [17] reported reliable ReRAM device with LRS = 10 k and HRS = 1 M. (c) Schematic
and biasing conditions for different operations of the 1T1R bipolar ReRAM
ReRAM-based TCAMs because of the high ION -to-IOFF ratio bit-cell.
of the additional access transistors in the bit-cell. However, the
large cell area (i.e., 5T2R) prevents them from being utilized
in high-density storage systems. Ly et al. [18] extensively high-resistance state (HRS) to a low-resistance state (LRS) by
characterized the robustness of a high-density 2T2R TCAM, the SET operation and LRS to HRS by the RESET operation.
but the analysis did not consider circuit-level variation sources ReRAM devices have two types of switching: 1) unipolar
such as the sense amplifier (SA) offset. switching where the switching direction depends on the ampli-
To address these limitations, this article proposes a reconfig- tude of the applied programming voltage but not on the voltage
urable 2T2R R-CIM architecture. It can support TCAM, LiM, polarity and 2) bipolar switching where the switching direction
and IM-DP operations. Besides, the 2T2R architecture can also also depends on the polarity of the programming voltage. This
operate as a conventional 1T1R ReRAM under the situations article focuses on the bipolar ReRAM device.
where CIM operations are not required. We optimize the pro- For simulation and analysis, we developed a Verilog-A
posed architecture using existing and novel design techniques. compact model like [22] for the ReRAM device. The model is
We propose a novel LiM full adder (LiM-FA) which is more based on the conductive filament switching mechanism [23].
efficient compared with state-of-the-art LiM-FA [10], [12] The I –V relationship of the ReRAM model can be expressed
in terms of the delay (3.2×), the static power (1.2×), and as
the dynamic power (1.6×). To characterize the robustness    
g V
of the proposed R-CIM system, we first provide a quanti- I = I0 ∗ exp − ∗ sinh (1)
tative approach for the sensing margin with respect to the g0 V0
precharge voltage and the ReRAM ON/ OFF ratio. Combining where g is the conductive filament gap distance and V is the
the proposed optimizations with robustness analysis, we can voltage applied to the ReRAM device. I0 , g0 , and V0 are fitting
set a lower precharge voltage for each operation mode and parameters. Fig. 2(b) shows the I –V characteristic curve of a
this reduces the TCAM search energy and the 1T1R ReRAM bipolar ReRAM device used in this work with LRS = 10 k
access energy by 1.6× and 1.14×, respectively. and HRS = 1 M. The ReRAM device is integrated with
The rest of this article is organized as follows. Section II transistors in 40-nm technology.
provides a background of ReRAM and the challenges associ- One common way to realize a ReRAM bit-cell is to integrate
ated with R-CIM. Section III introduces the proposed 2T2R a ReRAM device with one transistor (1T1R). Fig. 2(c) shows
R-CIM architecture and explains its operations including the 1T1R bit-cell schematic and the biasing conditions of the
TCAM, LiM, and IM-DP operations as well as the config- word-line (WL), the bit-line (BL), and the source-line (SL)
urable data storage. Section IV presents several optimizations for different operations. The read operation can be done in
for the proposed R-CIM using existing and novel design two different sensing ways: voltage-mode and current-mode.
techniques. In Section V, we evaluate the proposed R-CIM Current-mode sensing applies prefixed read voltage to BL and
system based on optimizations and robustness analysis. And measures the generated current at BL. It is faster than voltage-
finally, we conclude this article in Section VI. mode sensing for long BL length [21]. Voltage-mode sensing
II. R E RAM BASICS AND D ESIGN C HALLENGES precharges BL to a level (VREAD ) and discharges at different
rates depending on the ReRAM’s resistance state. Generally,
A. ReRAM Device and 1T1R Bit-Cell the voltage-mode sensing consumes less energy and is suitable
A ReRAM device is typically formed by a metal–insulator– for low-power applications [21]. In this article, we employ the
metal stack as shown in Fig. 2(a). It can switch from a voltage-mode sensing scheme.
Authorized licensed use limited to: National Central University. Downloaded on October 19,2023 at 09:10:13 UTC from IEEE Xplore. Restrictions apply.
2638 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 28, NO. 12, DECEMBER 2020

Fig. 4. Proposed 2T2R bit-cell. (a) Schematic, (b) read operation, and
(c) writing operation for writing data “1.”

Another challenge comes from the read-disturb. As depicted


in Fig. 2(b), VREAD is applied to BL during a read operation.
This is similar to the set condition except that VREAD and
VWL−READ are smaller than VSET and VWL−SET , respectively.
This bias condition does not affect the resistance of an ideal
Fig. 3. (a) LRS resistance versus set voltage and (b) HRS resistance versus ReRAM device. However, the resistance of a realistic ReRAM
reset voltage in recent ReRAM technologies.
device is partially affected by VREAD applied on BL, which
TABLE I is called a read-disturb. The read-disturb also occurs when
R E RAM D EVICE PARAMETERS FOR S IMULATION VREAD is applied to SL [35], [38], too. To prevent such read
disturbance, it is necessary to limit the upper bound of VREAD .
Chien et al. [31] and Lv et al. [36] suggested an upper bound
of 0.5 V for VREAD without having a disturbance on HRS.
In [18], the TCAM search endurance is improved from 90k to
450k cycles by lowering VREAD from 0.6 to 0.4 V. However,
VREAD cannot be too low when considering the BL sensing
margin, as a higher VREAD is necessary to provide more sensing
margin when the HLR is small [37].
B. Challenges of R-CIM III. P ROPOSED R-CIM A RCHITECTURE
The first challenge of R-CIM comes from the limited A. 2T2R ReRAM Bit-Cell
HRS-to-LRS ratio (HLR) of the ReRAM device. Fig. 3 shows Fig. 4(a) shows the 2T2R bit-cell for the proposed R-CIM
the relationship between LRS and the set voltage, and HRS and architecture. It consists of two 1T1R bit-cells sharing a com-
the reset voltage of the ReRAM devices in several published mon BL, which is similar to several prior works using a
ReRAM works [18], [24]–[33]. All listed technologies have common-SL scheme [35], [38]. Each row in a ReRAM array
HLRs > 10 while several technologies present HLRs > 100 shares a WL pair (WLL and WLR) while each column shares
[18], [24]–[27]. Although HLR of ReRAM is typically higher a BL and an SL pair (SL and SLB). The ReRAM device
than that of spin-transfer torque memory [10], it is still much pair (Q, QB) in the bit-cell represents data “1” with (Q,
smaller than that of SRAM (>105 ), leading to a significant QB) = (HRS, LRS) and data “0” with (Q, QB) = (LRS,
degradation in the sensing margin. This phenomenon is more HRS). For TCAM operations, the additional “don’t care” state
obvious when multiple rows are activated in CIM where the (“X”) is represented by (Q, QB) = (HRS, HRS) [16]–[20].
equivalent HLR is lower than that of a single ReRAM bit- When the proposed bit-cell is used for normal NVM storage,
cell [18], [20]. the complementary representation provides significant benefits
Realizing set and reset voltages lower than the nominal such as lower bit-error rate and faster sensing compared with
supply voltage of the main-stream fabrication technology is the conventional 1T1R ReRAM [13].
another challenge to be addressed. Recently, several works Fig. 4(b) explains the read operation of the 2T2R bit-cell.
have achieved low set and reset voltages (∼0.6 V) for ReRAM To read the data, BL is grounded, and SL and SLB are
at advanced technology nodes [33], [34]. In this work, we set precharged to VREAD and left floating. Once WLL and WLR
the target SET and RESET voltages as 0.7 V after compre- are asserted, SL and SLB will discharge at different rates
hensive SPICE simulations using 40-nm technology. The key depending on the stored data. After the voltages of SL and
parameters of the ReRAM device used for simulation in this SLB develop for some time, a differential SA is enabled and
work are summarized in Table I. detects this voltage difference to output the data.

Authorized licensed use limited to: National Central University. Downloaded on October 19,2023 at 09:10:13 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: RECONFIGURABLE 2T2R RERAM ARCHITECTURE FOR VERSATILE DATA STORAGE AND CIM 2639

Fig. 5. TCAM operations. (a) Search example (the right column is a mismatch). (b) Simulated waveforms for match and 1-b-mismatch cases of 2T2R
TCAM, the 1-b mismatch curve is only shown for WDL = 32 since it is dominated by LRS and is similar for different WDLs. (c) Write example (write
data = (0, X)).

The 2T2R ReRAM bit-cell requires two cycles for writing more bits are mismatched, the discharge speed will be higher.
data. Fig. 4(c) shows an example of writing data “1.” Since Therefore, the worst case scenario is when a 1-bit mismatch
the bit-cell contains two identical 1T1R cells, the same biasing occurs. It should be noted that the leakage current during the
conditions for writing the 1T1R bit-cell can be utilized. match case limits the maximum TCAM word-length (WDL).
However, the shared BL architecture can disturb the ReRAM Fig. 5(b) shows the BL voltage waveforms under a 1-bit
device programmed in the first cycle. To address this issue, mismatch case and match cases for different WDLs when
we propose to erase both ReRAM devices by setting them in BL precharge voltage = 0.5 V. With the increased WDL,
the first cycle. In the second cycle, either Q or QB is reset to it becomes more difficult to distinguish between the match
HRS depending on the data to be written. The LRS device is case and 1-bit mismatch case due to the reduced voltage
not disturbed in the second cycle since the corresponding SL difference.
and BL are grounded. The write operation for TCAM takes two phases to write a
B. TCAM Operations word column-wise. HRS states are written in the first phase
and LRS states are written in the second phase. An additional
TCAM is a critical component in many systems where fast
column decoder is necessary to select a column to be written.
searching is required. The proposed 2T2R ReRAM can operate
Since multiple cells in a column share the same BL and
as a 2T2R TCAM like the ones in [18] and [19] by storing
SL/SLB, the number of cells that can be written in parallel
words column-wise (Fig. 5). For TCAM search operation,
per cycle in each phase depends on the strength of the BL and
BL is precharged while SL and SLB are grounded, which
SL/SLB drivers [39]. Therefore, it may take multiple cycles
is different from the NVM access mode where SL and SLB
to program one TCAM word given the area constraints of
are precharged and BL is grounded. Search data and inverted
TCAM write drivers. However, many TCAM applications such
search data are applied to WLL and WLR, respectively. If all
as neuromorphic circuits require infrequent writes, and the
bits are matched, BL stays at the precharged level or discharges
proposed TCAM is well-suited for such applications due to
slowly because of the leakage current. If there is a mismatch,
its nonvolatile feature that consumes zero standby power [18].
one of BL and BLB, or both BL and BLB will discharge
Fig. 5(c) illustrates writing a two-bit string (0, X). In phase 1,
quickly through the ReRAM device in LRS. The SA compares
BL = 0 and SL = SLB = VRESET . In phase 2, BL = VSET
the BL voltage against a reference voltage VREF and generates
and SL = SLB = 0. The WLL and WLR states in each phase
the search result.
are determined by the data to be written as shown in Table II.
Fig. 5(a) explains an example of searching (0, 1). The left
column stores (X, 1) and the right column stores (1, 1). Since
the first column stores the matched data, BL [1] will be slowly C. LiM Operation
discharged through the leakage current of the bit-cells in HRS, The proposed 2T2R structure can also compute Boolean
which is recognized as match[1] = “1.” However, BL [2] will logic functions between two words (X and Y ) in memory.
be discharged below VREF quickly through the bit-cell in LRS We utilize two address decoders to turn on two rows at the
and produce a search result indicated by match [2] = “0.” If same time and use two single-ended SAs to generate LiM

Authorized licensed use limited to: National Central University. Downloaded on October 19,2023 at 09:10:13 UTC from IEEE Xplore. Restrictions apply.
2640 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 28, NO. 12, DECEMBER 2020

TABLE II
B IASING FOR TCAM W RITE O PERATION

Fig. 7. IM-DP for BNNs. (a) Sensing scheme. (b) Simplified architecture.

D. IM-DP for Binary Neural Networks


Neural networks are powerful tools to achieve state-of-the-
art results for many smart applications such as computer vision
and natural language processing. But they consume significant
storage and computation resources. Recently, binarized neural
networks (BNNs) [40] have been proposed to reduce storage
and computation demands by restricting weights and input
activations (IAs) to +1 and −1. As a result, the dot-product
operation (the dominant operation in neural networks) in
BNNs is replaced by the simple XNOR-popcount operation.
Several works have implemented IM-DP for BNNs based
on the 2T2R ReRAM structure [8], [13]. Chen et al. [8]
used multilevel current-mode SAs to accomplish this, while
Bocquet et al. [13] used differential voltage-mode sensing.
Fig. 6. LiM operations between two words X and Y . (a) By precharging However, the above 2T2R ReRAMs fail to support other
SL/SLB and grounding BL. (b) By precharging BL and grounding SL/SLB. functions like the proposed architecture.
(c) FA and FS equations.
Fig. 7 explains how the proposed ReRAM architecture
executes IM-DP for BNNs. The ReRAM array implemented
primitives. As shown in Fig. 6(a), the AND/ NAND operations with the proposed 2T2R bit-cells stores weights (W ) in a
can be performed by activating WLLs of two rows. If X complementary format. Additional pass transistors controlled
or Y is in LRS, SL will discharge quickly. If both are in by the BNN layer’s IAs (represented by signal F in a com-
HRS, SL will discharge much slowly. The left SA compares plementary format) connect SL and SLB to a differential SA
the SL voltage against VREF and generates the AND/ NAND as in [13]. Therefore, the XNOR/ XOR between the weight and
results. The NOR/ OR operations can be performed similarly by the IA can be computed during a differential read as shown
activating WLRs of two rows. The right SA compares the SLB in Fig. 7(a). Fig. 7(b) shows the simplified architecture to
voltage against VREF and produces the NOR/ OR results. The perform the IM-DP. All BLs are grounded (not drawn in
XOR operation can be realized by connecting the AND / NOR Fig. 7 for simplicity). In each cycle, one row is activated
results to a NOR gate. These primitives allow a LiM-FA to be by enabling its WLL and WLR. The bitwise XNOR/ XOR
implemented with a few additional logic gates [10], [12]. The between weights and IAs are computed by the differential
FA operation can be computed in one cycle since the proposed SA and the results are sent to a digital Wallace tree adder to
2T2R array structure simultaneously calculates all required perform the popcount operation. Although this approach has a
primitives for FA as indicated by the FA equation in Fig. 6(c). throughput degradation compared with other implementations
The proposed 2T2R structure can also compute the logic of IM-DP that activate many rows at the same time [8], [9],
primitives required by a full subtractor (FS) by precharg- this purely digital implementation does not need analog-to-
ing and sensing BL, and grounding SL/SLB. As shown in digital converters (ADCs) which incur high area overhead. For
Fig. 6(b), by activating WLL for row X and WLR for row example, the work in [9] requires 64-bit-cells in every column
Y , the two SAs can both implement XY and (X + Y ) in one to implement a 5-bit column ADC.
cycle. Similarly, by activating WLR for row X and WLL for
row Y , the two SAs can both implement X Y and (X + Y ) in
one cycle. Note that XY and X Y are both required as indicated E. Configurable Data Storage
by the FS equation in Fig. 6(c), but they cannot be computed Although the proposed 2T2R structure can perform different
simultaneously. Therefore, the FS will take two cycles to types of CIM operations, a system may need the ReRAM as a
complete by latching the SA results separately (e.g., latch XY pure storage block, e.g., program instruction storage. In such
to the left SA in the first cycle and latch XY to the right SA situations, it is desirable to minimize the area overhead coming
in another cycle) to get all required primitives for FS. from the 2T2R bit-cell structure. To address this issue, the

Authorized licensed use limited to: National Central University. Downloaded on October 19,2023 at 09:10:13 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: RECONFIGURABLE 2T2R RERAM ARCHITECTURE FOR VERSATILE DATA STORAGE AND CIM 2641

Fig. 9. Schematic of the reconfigurable SA modified from [7] in the TCAM


or LiM mode. Two switches “sw_on” are always ON.

Fig. 8. Read operation of the 2T2R structure configured as a 1T1R array.

proposed 2T2R bit-cell can be reconfigured so that it can


operate as two conventional 1T1R bit-cells. Fig. 8 shows a
diagram of the read operation in the 1T1R mode. While the
array has 4 rows × 2 columns in the 2T2R mode, it is
configured into 8 rows × 2 columns in the 1T1R mode
by configuring one 2T2R bit-cell into two symmetric 1T1R
bit-cells sharing one BL. The SL pair SL/SLB in the 2T2R
mode becomes SLs in the 1T1R mode. Although the two SLs
in one column are not physically connected, they have the
same column index.
The read operation follows that of the conventional 1T1R
ReRAM by grounding SL and precharging BL. Since the
number of rows is doubled, one additional row-address bit
is necessary to differentiate WLR from WLL. In Fig. 8, the
2-bit address ADDR[2:1] is sent to the row decoder as for Fig. 10. LiM-FA designs (a) before and (b) after optimization. (c) Imple-
mentation of an 8-bit RCA using LiM-FAs.
2T2R read. The additional address bit ADDR[0] is sent to
the word-line driver (WD) that contains WLL/WLR logic.
If ADDR[0] = “0,” WLL (red line) is turned on in that row; operation. In the TCAM and 1T1R storage modes (i.e., BL_sel
if ADDR[0] = “1,” WLR (blue line) is turned on in that row. = “1” and diff = “0”), BL is compared with VREF in the SA
Although BL is connected to two single-ended SAs, we can as depicted in Figs. 5(a) and 8, respectively. In the LiM mode
use only one of them. The biasing for the write operation of the (i.e., diff = “0”), either SL/SLB or BL is compared with VREF
1T1R configuration is identical to that of the 1T1R ReRAM as shown in Fig. 6(a) and (b), respectively. For 2T2R ReRAM
as shown in Fig. 2(c). read and IM-DP operations (i.e., BL_sel = “0” and diff =
“1”), SA1 and SA2 are connected in parallel to operate as a
IV. O PTIMIZATION OF THE P ROPOSED R-CIM differential SA.
The original design in [7] uses PMOS transistors controlled
A. Reconfigurable SA by “diff” and “diff_b.” Since PMOS can only pass a weak
As explained in Section III, the interconnects between the logic “0,” we use transmission-gate (TG) switches in our
R-CIM array and the SAs need to be controlled based on design to obtain strong differential outputs for both cross-
the selected functions. Fig. 9 shows the schematic of the coupled inverters. These four outputs provide all the primitives
reconfigurable SA modified from [7]. It consists of two pairs required for the LiM-FA and FS. Note that the TGs controlled
of cross-coupled inverters (SA1 and SA2). The 2:1 MUX by “diff” and “diff_b” in each small SA have a limited size
connected to SL/SLB and BL is controlled by a select signal for area efficiency. Therefore, their ability to pass voltages
(BL_sel) and connect them to the SA. The output latches (L during sensing will affect the SA offset. Since one of them
and R) are controlled by additional control signals “en_L” is ON during the sensing time, we put one additional switch
and “en_R” to sample the results of two single-ended SAs (sw_on) that is always ON in each SA to compensate for the
separately or together as required by the LiM-FA or FS offset.

Authorized licensed use limited to: National Central University. Downloaded on October 19,2023 at 09:10:13 UTC from IEEE Xplore. Restrictions apply.
2642 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 28, NO. 12, DECEMBER 2020

B. Optimization of FA TABLE III


WLL/WLR L OGIC TABLE
Prior LiM-FA design [10], [12], as shown in Fig. 10(a),
is based on the Boolean equations in Fig. 6(c). However,
the carry propagation is the critical path in ripple-carry adder
(RCA), therefore should be optimized to reduce latency. This
work improves the latency of the carry-out by modifying the
carry-out expression compatible with the proposed 2T2R array
architecture. The Boolean equation for COUT can be written
as a sum-of-product
COUT = XY + XCIN + Y CIN
= XY + (X + Y )CIN (2)
D. Additional Circuits for Versatile CIM Functions
where X and Y are two inputs, and CIN is carry-in. Using De
Morgan’s theorem, (2) can be reorganized as Implementation of versatile CIM functions on a conven-
tional ReRAM array requires additional architectural support.
COUT = XY + (X + Y )CIN We propose several architectural optimizations below.
= XY ∗ (X + Y )CIN . (3) 1) WLL/WLR Logic: This logic block should be designed
carefully since it is replicated in every row. In this work,
This shows that two primitives, XY and (X + Y ), can we use a 2-bit operation code (OPC) and two row decoder
implement COUT using only two additional NAND gates. outputs (RD_1[ j ] and RD_2[ j ]) to control WLL and WLR
Note that the above primitives and the two primitives for as shown in Fig. 12(a). RD_1[ j ] and RD_2[ j ] represent the
computing the FA’s sum (XY and X + Y ) are all available outputs from two row decoders for row j . Table III shows
at the reconfigurable SA’s outputs as shown in Fig. 6(a). the detailed logic table for WLL/WLR. The simplified logic
Fig. 10(a) and (b) shows the LiM-FA designs before and after expressions for WLL and WLR are “AX+BY” and “AY+BX,”
optimization, respectively. The design before optimization respectively. Here, “A” and “B” represent the 2-bit OPC while
requires a five-level delay for carry propagation while the “X” and “Y ” are the outputs from the row decoders. These
optimized design requires only a two-level delay. The opti- two simple representations allow the WLL/WLR logic to be
mized FA can also be reused as an FS without changing designed with only a few logic gates and the area overhead
the wiring between the reconfigurable SA’s outputs and the can be small.
additional four logic gates. By chaining LiM-FAs, an RCA 2) Multicycle TCAM Search: One challenge for
can be implemented as shown in Fig. 10(c). high-density ReRAM arrays is that the SA cannot fit
into the column pitch. Therefore, column multiplexing is
generally employed. In this work, a column multiplexing ratio
C. Inherent SA Redundancy for TCAM and 1T1R Read
of four is used for a 256 × 128 array (Section V) to read
For 2T2R TCAM search operations, the sensing margin out 32-bit words. Such a design choice increases the area
is severely degraded compared to the conventional CMOS efficiency but prevents parallel search in the TCAM mode.
TCAMs because of the large leakage current through the HRS To better utilize the column multiplexing feature, we propose
devices in the match case [18], [19]. For 1T1R ReRAM read to store each TCAM word across multiple columns, which
operations, the sensing margin also needs to be large due to requires multiple cycles for a search operation. In addition,
single-ended sensing. Therefore, it is important to reduce the the maximum WDL in the TCAM mode depends on the
SA offset to enhance the robustness of TCAM search and HLR of the ReRAM device. For HLR = 100, WDL can be
1T1R read operations. 32 [19]. Therefore, a 128-bit TCAM word needs to be stored
In the proposed reconfigurable SA, BL is connected to two across the first 32 rows and 4 columns in the 256 × 128
single-ended SAs in the TCAM mode and the 1T1R storage array. Other rows can still be used for other CIM operations.
mode. This inherently allows the use of SA redundancy [41]. Note that this design choice depends on the application (the
The main benefit of SA redundancy is that the exclusive selec- required TCAM WDL), the HLR of ReRAM device, and the
tion of multiple small SAs can improve the offset compared memory size. For a complete search, partial search results
to one large SA. Fig. 11 shows simulated SA offsets based on from four cycles are combined [39]. Fig. 12(b) shows an
10 000 Monte-Carlo points. The small SA is the single-ended example where a 64-bit word is searched over two cycles.
SA1 or SA2 in the reconfigurable SA presented in Fig. 9. The Two columns store the 64-bit word (i.e., 32 bits in each
large SA is the reconfigurable SA in differential mode (i.e., column). The partial result combiner after the SA consists of
SA1 and SA2 connected in parallel). Each of the two small just an SR flip-flop and a NOR gate.
SAs has an offset of 12.5 mV while the large SA has an offset
of 8.9 mV. By employing SA redundancy and selecting one SA
with a smaller offset from the two small SAs, the new offset E. Sensing Margin Considerations
becomes 7.5 mV. The SA redundancy improves the offset by Fig. 13 shows the simulated read power breakdown of
1.67× and 1.19× compared to the small SA and the large a conventional 1T1R ReRAM array using the DESTINY
SA, respectively. The improved offset obtained from the SA simulator [42]. The array size is 256 × 256 with a column
redundancy will be used for the evaluation of the proposed multiplexing ratio of two. The BL precharge voltage is set to
R-CIM in Section V. 0.3 V at room temperature. Due to the large BL capacitance,

Authorized licensed use limited to: National Central University. Downloaded on October 19,2023 at 09:10:13 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: RECONFIGURABLE 2T2R RERAM ARCHITECTURE FOR VERSATILE DATA STORAGE AND CIM 2643

Fig. 11. SA offsets for two small SAs (SA1 and SA2 in Fig. 9), one large SA (SA1 and SA2 connected in parallel), and SA redundancy.

VREAD also disturbs the ReRAM device resistance during read


operations and affects reliability. Therefore, VREAD needs to be
determined carefully after considering power, sensing margin,
speed, and read-disturb.
In the voltage sensing mode, the sensing margin (VSM )
is determined by the voltage difference between two BL
voltage levels that need to be distinguished. These two voltage
levels come from two RC discharging circuits with high
resistance and low resistance, respectively, which is given in
the following:
    
−t −t
VSM (t) = VREAD ∗ exp − exp . (5)
RH CBL RL CBL
Here, VSM (t) is the sensing margin with respect to time, VREAD
is the precharge voltage, RH and RL are the high resistance
and the low resistance seen from BL, and CBL is the BL
capacitance. Note that RH and RL may not be the same as
HRS and LRS. For example, for the TCAM search operation,
RH and RL are the worst case equivalent resistance values
seen from BL in the match case and the 1-bit mismatch
Fig. 12. Architectural support. (a) WLL/WLR logic block. It receives an case, respectively. When considering n-bit search data, RH is
OPC and results from two row decoders to control WLL and WLR of row j. obtained from n HRS cells while RL is from (n− 1) HRS cells
(b) Example for multicycle TCAM that searches a 64-bit word in two cycles
and combines partial results.
and one LRS cell. The derivative of (5) with respect to time
t is as follows:
 
d VSM −t 1
= VREAD ∗ exp ∗
dt RL CBL RL CBL
 
−t 1
−VREAD ∗ exp ∗ . (6)
RH CBL RH CBL
To calculate the maximum VSM , (6) should be set to 0, which
generates the delay to get the maximum VSM
 
RH CBL RH
tmax(VSM ) = ∗ ln . (7)
Fig. 13. Simulated power breakdown of ReRAM read with 0.3-V BL RH /RL − 1 RL
precharge voltage at room temperature.
From (7), tmax(VSM ) only relies on RH /RL , not on VREAD .
Fig. 14(a) shows the normalized tmax(VSM ) versus RH /RL with a
BL precharge and discharge consume more than half of the specific time constant RH CBL . It can be observed that tmax(VSM )
read power. For a low precharge voltage, a full BL swing is decreases exponentially when RH /RL increases. This indicates
generally required to get the maximum sensing margin across that when RH /RL is small, the SAs need more delay and the
process variations [43]. Therefore, the energy consumed by enabling time is sensitive to the variations in RH /RL . However,
BL switching can be expressed as when RH /RL is relatively large, the SAs can be enabled earlier
E BL = CBL ×VREAD
2
(4) and tmax(VSM ) is not sensitive to the variations in RH /RL , which
is more desirable. After applying (7) back to (6), the maximum
where CBL is the BL capacitance and VREAD is the BL VSM becomes
precharge voltage. As described by (4), the reduction in VREAD   1   1 
can reduce the BL switching power. But this comes at the cost RH 1− RRHL RH RRHL −1
max(VSM ) = VREAD ∗ − . (8)
of BL sensing margin degradation. As explained in Section II, RL RL

Authorized licensed use limited to: National Central University. Downloaded on October 19,2023 at 09:10:13 UTC from IEEE Xplore. Restrictions apply.
2644 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 28, NO. 12, DECEMBER 2020

Fig. 15. Block diagram of the 256 × 128 array for simulation.

TABLE IV
O FFSETS OF D IFFERENT SA C ONFIGURATIONS

Fig. 14. (a) Normalized tmax(VSM ) versus RH /RL . (b) max(VSM ) versus RH /RL
at different VREAD .
We perform 1000 Monte-Carlo simulations for TCAM search
and ReRAM access operations for different VREAD . Fig. 16(a)
Fig. 14(b) shows the relationship between max(VSM ) and and (b) shows the simulation results for TCAM (VREAD =
RH /RL at different VREAD levels. When RH /RL is low, VSM 0.5 V) and ReRAM access (VREAD = 0.3 V), respectively.
shows a larger sensitivity to RH /RL . When RH /RL is high, The maximum variations of 32-bit match and 1-bit mismatch
VSM is less sensitive to RH /RL . In summary, it is desirable to for TCAM have standard deviations equal to 5.8 and 7.3 mV,
provide high RH /RL for reliable BL sensing. respectively. The variations due to access transistors for normal
ReRAM read have smaller standard deviations. The transistor
variations for LiM is similar to that of ReRAM read because
V. E VALUATION OF THE P ROPOSED 2T2R R E RAM they only access a small number of cells. The standard
In this section, we evaluate the proposed 2T2R ReRAM deviations for accessing HRS and LRS devices are 0.3 and
under different operating modes. We first analyze the area and 6.5 mV, respectively. As a result, TCAM has the worst VSM
energy overheads of the proposed R-CIM architecture using degradation due to transistor variations of the 2T2R bit-cell.
Hspice and a modified version of DESTINY [42]. Then we In the following analysis, results for the achievable VSM
present how different optimization techniques help improve are based on the voltage difference developed between the
the robustness of the R-CIM system and reduce the energy worst Monte-Carlo case of two voltage levels that need to
consumption by setting a lower VREAD . The ReRAM device is be distinguished, e.g., the lowest curve of 32-bit match and
modeled with Verilog-A and the model is calibrated using the the highest curve of 1-bit mismatch in Fig. 16(a). To achieve
data from [24] to get default LRS = 10 k and HRS = 1 M. the maximum sensing margin, the time to enable sensing is
The modeled ReRAM devices are integrated with transistors calculated using (7). Note that (7) can be close to the optimal
in 40-nm CMOS technology. We design an R-CIM array of sensing time since the resistance due to access transistors is
256 × 128 as shown in Fig. 15. Note that WLLs/WLRs much smaller compared with the ReRAM resistances. For
are controlled by tristate buffers and signal “TCAM_EN.” single-ended sensing, the reference voltage is put in the middle
In TCAM mode, WLLs/WLRs are driven by TCAM drivers of two different voltage levels after determining the sensing
instead of row decoders and the maximum allowed search time.
WDL is 32 as explained in Section IV-D. For ReRAM variations, the LRS value is 10 k with a
The sensing margin must be larger than the SA offset for 20% variation through all simulations. We use different HRS
reliable sensing. The SA offset from Fig. 10 are summarized values to generate HLRs of 150, 100, and 50 to evaluate VSM
in Table IV. We consider SA offset variations of 4-sigma with respect to HRS variations. This HRS variation is adopted
for an acceptable yield. Besides the SA offset, variations of from [18] that characterizes a 2T2R TCAM and the reported
access transistors of the 2T2R cell also need to be considered. ±2.5σ HRS corresponds to 50% variation with respect to the

Authorized licensed use limited to: National Central University. Downloaded on October 19,2023 at 09:10:13 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: RECONFIGURABLE 2T2R RERAM ARCHITECTURE FOR VERSATILE DATA STORAGE AND CIM 2645

Fig. 16. BL discharge variations due to access transistors for (a) TCAM and Fig. 17. Overhead of the proposed R-CIM. (a) Area. (b) Energy.
(b) ReRAM access operations.

incur a slightly higher energy overhead (23.8%) compared


mean HRS value. Note that our assumption on HRS is more
to the baseline (one ReRAM read) due to the additional row
pessimistic compared with [18] since we use the worst 50%
decoder (7%), the WLL/WLR logic (1.5%), and the LiM-FA
variation for all HRS when HLR = 50.
circuit (15.3%). However, the LiM energy overhead allows
versatile LiM operations between two operands. With a naïve
A. Area and Energy Analysis implementation that reads out two operands and sends them
The proposed R-CIM system supports versatile operations to ALUs, these operations will cost at least the energy of
at the cost of additional hardware resources and energy. two ReRAM reads. For IM-DP at VREAD = 0.3 V, the energy
Fig. 17(a) shows the area breakdown of the baseline (a 2T2R overhead (17.4%) due to the XNOR circuit and the popcount
array without CIM support) and the proposed R-CIM archi- circuit is 4% and 13.4%, respectively.
tecture. Compared with the baseline, the R-CIM overheads
(16.6%) due to TCAM, LiM, and IM-DP operations are 3.1%, B. TCAM Robustness Evaluation
10.7%, and 2.1%, respectively. An additional 3.7% overhead Fig. 18(a) shows the achievable VSM for different VREAD
comes from the read voltage generators and the output MUX in TCAM mode. As expected, increasing HLR improves VSM
logic that sends SA outputs to different near-memory blocks at a given VREAD . Also, VSM is more sensitive to HLR in
in different modes. TCAM overhead includes the tristate TCAM mode since TCAM provides relatively small RH /RL as
buffers (0.6%), the column decoder (1.2%), and the partial explained in Section IV-D. Without employing SA redundancy,
result combiner (1.3%). LiM overhead includes the optimized the minimum voltage difference between BL and VREF needs
LiM-FA circuit (0.8%), the additional row decoder (2.2%), and to be at least 50 mV to overcome the SA offset (i.e., 12.5 mV,
the WLL/WLR logic block (7.7%). IM-DP overhead includes 4-sigma variation). Therefore, the minimum voltage difference
the XNOR circuit (0.3%) and the popcount circuit (1.8%). Note between the match case and the 1-bit mismatch case should
that the WLL/WLR logic incurs the highest area overhead be twice the voltage difference between BL and VREF , which
since it needs to be replicated in every row. However, this equals 100 mV. When employing the SA redundancy, the min-
overhead can be further mitigated if the number of columns imum VSM becomes 60 mV, which is a 1.67× improvement.
increases. Since TCAM search gives low RH /RL , we further charac-
Fig. 17(b) shows the energy overhead of various circuit terize its robustness under temperature variations. Fig. 18(b)
blocks for different CIM operations. For TCAM search oper- shows the achievable VSM at different temperatures consid-
ations, the overhead comes from the partial result combiner is ering the nominal case (HLR = 100) and HRS degradation
6.3% at VREAD = 0.6 V. LiM operations at VREAD = 0.3 V (HLR = 50). The achievable VSM degrades as the temperature

Authorized licensed use limited to: National Central University. Downloaded on October 19,2023 at 09:10:13 UTC from IEEE Xplore. Restrictions apply.
2646 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 28, NO. 12, DECEMBER 2020

Fig. 19. VSM versus VREAD for LiM primitive operations.

redundancy can lower VREAD from 600 to 400 mV at 90 ◦ C as


shown in Fig. 18(b), improving the TCAM search energy by
1.6×. Note that a VREAD of 600 mV will generate significant
read disturbance and degrade TCAM search endurance [18].
Therefore, instead of setting a default VREAD , it is necessary
to adjust VREAD smartly depending on the HLR and its
degradation to avoid the energy overhead coming from the
overdesign.

C. LiM Evaluation
We evaluate the robustness of the proposed LiM primi-
tive operations (e.g., in-memory AND/ NOR) as well as the
performance and energy consumption of the LiM-FA. Fig. 19
shows the achievable VSM versus VREAD for LiM primitive
operations. The VSM of LiM primitive operations is much
larger than that of the TCAM search operation because of the
increased RH /RL ratio obtained from the same HRS and LRS
values. For example, when computing the in-memory AND
function, RH is obtained from two HRS in parallel while RL
is one LRS and one HRS in parallel which is approximately
one LRS. This causes only a 2× reduction in RH /RL compared
to the HLR of the ReRAM device. Moreover, VSM is much
less sensitive to HRS degradation because of the high RH /RL
Fig. 18. TCAM robustness evaluation. (a) VSM versus VREAD for different ratio (>15). Based on the proposed SA analysis, the target
HLRs at room temperature. (b) VSM at different temperatures and HLRs. VSM is 100 mV without employing the SA redundancy, giving
(c) Search energy (normalized) with respect to VREAD . the minimum VREAD of 150 mV for HLR = 50.
Table V compares various FA designs. The LiM-FA [10],
increases. With nominal HLR, low VREAD (e.g., 400 mV) still [12] and CMOS FA [47] are simulated with 40-nm technology.
gives enough VSM at high temperatures. However, if consid- Data for other designs are directly obtained from the rele-
ering HRS degradation, a high VREAD (e.g., 600 mV) must vant articles. Compared with other LiM-FAs [10], [12], the
be used to provide enough VSM at high temperatures if not proposed LiM-FA achieves 3.2×, 1.2×, and 1.6× improve-
employing SA redundancy. On the contrary, our optimizations ments in the delay, the static power, and the dynamic power,
by employing the SA redundancy reduces the VSM requirement respectively. Compared with CMOS FA, the proposed LiM-FA
and allows TCAM to operate at VREAD = 400 mV under high has slightly worse dynamic power due to more levels of
temperatures. Note that TCAM has the worst VSM , requiring transition, but the delay and the static power are 1.34× and
higher VREAD . The other operations of the proposed ReRAM 8.9× better with fewer transistors. We also compare our
architecture can have lower VREAD . LiM-FA with several FAs based on nonvolatile devices such
The normalized TCAM search energy (including peripheral as magnetic tunnel junctions (MTJs) [44], ferroelectric tunnel
logics) with respect to VREAD is presented in Fig. 18(c). When junctions (FTJs) [45], and ferroelectric field-effect transistors
HLR = 150 at room temperature, SA redundancy allows (FeFETs) [46]. Regarding performance, our LiM-FA is only
VREAD to be lowered from 300 to 200 mV as depicted in slightly worse than the FA based on FeFET [46]. This is
Fig. 18(a). This improves the TCAM search energy by 1.2×. because the FA in [46] uses the dynamic logic design style
When the HLR decreases because of HRS degradation, the SA with only a pull-down NMOS network, therefore less capaci-

Authorized licensed use limited to: National Central University. Downloaded on October 19,2023 at 09:10:13 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: RECONFIGURABLE 2T2R RERAM ARCHITECTURE FOR VERSATILE DATA STORAGE AND CIM 2647

TABLE V
D ELAY AND P OWER C ONSUMPTION OF D IFFERENT FA D ESIGNS

TABLE VI
C OMPARISON W ITH R ECENT R E RAM AND R-CIM W ORKS

target VSM for 1T1R access is 100 mV without employing


SA redundancy, which is achievable with VREAD of 150 mV.
However, if SA redundancy is employed, the target VSM
becomes 60 mV and VREAD can be reduced to 100 mV. For
2T2R ReRAM read and IM-DP, the reconfigurable SA allows
differential sensing. Therefore, the voltage difference between
SL and SLB can be directly utilized as VSM rather than
degrades by a factor of 2 for single-ended sensing. In this case,
the SA offset sigma becomes 8.9 mV for a large differential
SA and the target VSM is ∼40 mV. Therefore, VREAD can be
set to 100 mV which is the same as 1T1R read employing SA
redundancy. Fig. 20(b) shows the normalized energy (includ-
ing peripheral logics) versus VREAD for 1T1R ReRAM access.
For 1T1R single-ended sensing, SA redundancy reduces VREAD
Fig. 20. ReRAM read evaluation. (a) VSM versus VREAD at room temperature. from 150 to 100 mV and achieves 1.14× lower access energy.
(b) Normalized 1T1R read energy versus VREAD .
E. Discussion and Comparison With Prior Works
Table VI summarizes the VREAD used in this work and
tive load. The drawback is that it requires an additional clock
recent ReRAM and R-CIM works based on voltage-mode
signal. Regarding power consumption, the proposed LiM-FA
sensing. The proposed R-CIM structure supports versatile CIM
consumes the least power among the compared FAs based on
operations using voltage-mode sensing. There are three VREAD
nonvolatile devices.
required in this work and the area overhead for generating
different VREAD is found to be only 1.2%.
D. ReRAM Read and IM-DP Evaluation The work in [13] uses 2T2R ReRAM cells to implement
Fig. 20(a) shows the achievable VSM versus VREAD for IM-DP during normal memory access with VREAD = 0.1 V.
ReRAM access. Like LiM primitive operations, the VSM for Although we adopt the IM-DP method in [13] and use the
1T1R ReRAM access is larger than that of TCAM. For 1T1R same VREAD , we show that VREAD = 0.1 V is sufficient to
ReRAM read, only single-ended sensing using a small SA provide enough sensing margin for ReRAM read (in both
is allowed. Considering 4-sigma variation for SA offset, the 1T1R and 2T2R configurations) and IM-DP after considering

Authorized licensed use limited to: National Central University. Downloaded on October 19,2023 at 09:10:13 UTC from IEEE Xplore. Restrictions apply.
2648 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 28, NO. 12, DECEMBER 2020

different types of variations. Note that the lowest VREAD [7] S. Jeloka, N. B. Akesh, D. Sylvester, and D. Blaauw, “A 28 nm
= 0.1 V for NVM read will affect the read speed [35]. configurable memory (TCAM/BCAM/SRAM) using push-rule 6T bit
cell enabling logic-in-memory,” IEEE J. Solid-State Circuits, vol. 51,
Fortunately, 0.1 V is far less than the SET/RESET voltage of no. 4, pp. 1009–1021, Apr. 2016.
ReRAM devices (>0.5 V) and it leaves a large space for the [8] W.-H. Chen et al., “A 65 nm 1 Mb nonvolatile computing-in-memory
tradeoff between speed and power while ensuring the ReRAM ReRAM macro with sub-16 ns multiply- and-accumulate for binary
reliability. As reported in [35], which is also based on ReRAM DNN AI edge processors,” in IEEE ISSCC Dig. Tech. Papers, Feb. 2018,
pp. 494–495.
integrated with 40-nm technology, the read speed is improved [9] C. Yu, T. Yoo, T. T.-H. Kim, K. C. T. Chuan, and B. Kim, “A 16K
by more than 2× when VREAD increases from 0.18 to 0.26 V. current-based 8T SRAM compute-in-memory macro with decoupled
The work in [18] uses VREAD = 0.6 V with RH /RL ratio read/write and 1-5bit column ADC,” in Proc. IEEE Custom Integr.

= during TCAM search. However, with VREAD = 0.6 V, Circuits Conf. (CICC), Mar. 2020, pp. 1–4.
[10] S. Jain, A. Ranjan, K. Roy, and A. Raghunathan, “Computing in memory
significant read disturbance may occur during TCAM search with spin-transfer torque magnetic RAM,” IEEE Trans. Very Large Scale
and degrades the ReRAM reliability [31], [36]. On the con- Integr. (VLSI) Syst., vol. 26, no. 3, pp. 470–483, Mar. 2018.
trary, we perform optimizations for TCAM using SA redun- [11] D. Reis, M. Niemier, and X. S. Hu, “Computing in memory with
dancy and lower the required VREAD to 0.4 V. Although FeFETs,” in Proc. Int. Symp. Low Power Electron. Design, Jul. 2018,
pp. 1–6.
a lower VREAD will decrease the search speed, it improves
[12] S. K. Thirumala, S. Jain, A. Raghunathan, and S. K. Gupta, “Non-
the reliability of ReRAM devices and reduces the search volatile memory utilizing reconfigurable ferroelectric transistors to
energy. When the robustness of R-CIM systems is of major enable differential read and energy-efficient in-memory computation,”
concern, designers should not just care about speed and in Proc. IEEE/ACM Int. Symp. Low Power Electron. Design (ISLPED),
Jul. 2019, pp. 1–6.
power consumption since the reliability of ReRAM devices
[13] M. Bocquet et al., “In-memory and error-immune differential RRAM
is also important when performing different CIM operations. implementation of binarized deep neural networks,” in IEDM Tech. Dig.,
Therefore, it is necessary to perform different optimizations to Dec. 2018, pp. 20-1–20-6.
ensure that optimal VREAD can be selected without significantly [14] C.-X. Xue et al., “A 1 Mb multibit ReRAM computing-in-memory
macro with 14.6 ns parallel MAC computing time for CNN based
disturbing ReRAM devices. This work provides a guide for AI edge processors,” in IEEE ISSCC Dig. Tech. Papers, Feb. 2019,
such optimizations. pp. 388–389.
[15] W. Wan et al., “A 74 TMACS/W CMOS-RRAM neurosynaptic core with
dynamically reconfigurable dataflow and in-situ transposable weights
VI. C ONCLUSION for probabilistic graphical models,” in IEEE ISSCC Dig. Tech. Papers,
Feb. 2020, pp. 498–499.
In this article, we proposed a reconfigurable 2T2R ReRAM [16] L. Zheng, S. Shin, and S.-M.-S. Kang, “Memristors-based ternary
architecture to support three types of CIM operations: content addressable memory (mTCAM),” in Proc. IEEE Int. Symp.
1) TCAM; 2) LiM; and 3) IM-DP. We proposed a con- Circuits Syst. (ISCAS), Jun. 2014, pp. 2253–2256.
figurable data storage strategy to allow the 2T2R ReRAM [17] M.-F. Chang et al., “Designs of emerging memory based non-volatile
TCAM for Internet-of-Things (IoT) and big-data processing: A 5T2R
to operate as conventional 1T1R ReRAM in situations that universal cell,” in Proc. IEEE Int. Symp. Circuits Syst. (ISCAS),
CIM is not required. We performed optimizations for the May 2016, pp. 1142–1145.
proposed R-CIM using existing and novel design techniques [18] D. R. B. Ly et al., “In-depth characterization of resistive memory-based
to improve its robustness and efficiency. We quantitatively ternary content addressable memories,” in IEDM Tech. Dig., Dec. 2018,
pp. 20-1–20-3.
analyzed the robustness of the proposed R-CIM with respect to [19] J. Li, R. K. Montoye, M. Ishii, and L. Chang, “1 Mb 0.41 μm2
the precharge voltage (VREAD ) and the ReRAM ON/ OFF ratio. 2T-2R cell nonvolatile TCAM with two-bit encoding and clocked
With the proposed optimizations, the TCAM search energy can self-referenced sensing,” IEEE J. Solid-State Circuits, vol. 49, no. 4,
be reduced by 1.6× with better reliability thanks to the lower pp. 896–907, Apr. 2014.
[20] Y. Zha, E. Nowak, and J. Li, “Liquid silicon: A nonvolatile fully pro-
VREAD . The proposed LiM-FA improves the delay (3.2×), the grammable processing-in-memory processor with monolithically inte-
static power (1.2×), and the dynamic power (1.6×) compared grated ReRAM for big data/machine learning applications,” in Proc.
with the state-of-the-art LiM-FA. Combining optimizations Symp. VLSI Circuits, Jun. 2019, pp. C206–C207.
with robustness analysis, the same VREAD for ReRAM access [21] M.-F. Chang et al., “Challenges and circuit techniques for
energy-efficient on-chip nonvolatile memory using memristive devices,”
can be set in 2T2R and 1T1R configurations. A lower VREAD IEEE J. Emerg. Sel. Topics Circuits Syst., vol. 5, no. 2, pp. 183–193,
in 1T1R configuration gives 1.14× lower access energy. Jun. 2015.
[22] P.-Y. Chen and S. Yu, “Compact modeling of RRAM devices and its
applications in 1T1R and 1S1R array design,” IEEE Trans. Electron
R EFERENCES Devices, vol. 62, no. 12, pp. 4022–4028, Dec. 2015.
[1] T.-K.-J. Ting et al., “An 8-channel 4.5 Gb 180GB/s 18ns-row-latency [23] R. Waser, R. Dittmann, G. Staikov, and K. Szot, “Redox-based resistive
RAM for the last level cache,” in IEEE ISSCC Dig. Tech. Papers, switching memories–nanoionic mechanisms, prospects, and challenges,”
Feb. 2017, pp. 404–405. Adv. Mater., vol. 21, nos. 25–26, pp. 2632–2663, Jul. 2009.
[2] J. Wang et al., “A 28-nm compute SRAM with bit-serial logic/arithmetic [24] K.-S. Li et al., “Utilizing sub-5 nm sidewall electrode technology for
operations for programmable in-memory vector computing,” IEEE atomic-scale resistive memory fabrication,” in Symp. VLSI Technol.
J. Solid-State Circuits, vol. 55, no. 1, pp. 76–86, Jan. 2020. (VLSI-Technol.), Dig. Tech. Papers, Jun. 2014, pp. 1–2.
[3] M. F. Ali, A. Jaiswal, and K. Roy, “In-memory low-cost bit-serial [25] H. Y. Lee et al., “Low power and high speed bipolar switching with a
addition using commodity DRAM technology,” IEEE Trans. Circuits thin reactive Ti buffer layer in robust HfO2 based RRAM,” in IEDM
Syst. I, Reg. Papers, vol. 67, no. 1, pp. 155–165, Jan. 2020. Tech. Dig., Dec. 2008, pp. 1–4.
[4] Y. Chen, “ReRAM: History, status, and future,” IEEE Trans. Electron [26] W. Kim et al., “Forming-free nitrogen-doped AlOX RRAM with sub-μA
Devices, vol. 67, no. 4, pp. 1420–1433, Apr. 2020. programming current,” in IEEE Symp. VLSI Technol. Dig. Tech. Papers,
[5] M. M. S. Aly et al., “The N3XT approach to energy-efficient abundant- Jun. 2011, pp. 22–23.
data computing,” Proc. IEEE, vol. 107, no. 1, pp. 19–48, Jan. 2019. [27] A. Grossi et al., “Fundamental variability limits of filament-based
[6] T. F. Wu et al., “A 43pJ/cycle non-volatile microcontroller with 4.7μs RRAM,” in IEDM Tech. Dig., Dec. 2016, pp. 4-1–4-7.
shutdown/wake-up integrating 2.3-bit/cell resistive RAM and resilience [28] E. Vianello et al., “Resistive memories for ultra-low-power embedded
techniques,” in IEEE ISSCC Dig. Tech. Papers, Feb. 2019, pp. 226–228. computing design,” in IEDM Tech. Dig., Dec. 2014, pp. 6-1–6-3.

Authorized licensed use limited to: National Central University. Downloaded on October 19,2023 at 09:10:13 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: RECONFIGURABLE 2T2R RERAM ARCHITECTURE FOR VERSATILE DATA STORAGE AND CIM 2649

[29] A. Fantini et al., “Intrinsic program instability in HfO2 RRAM and Lu Lu (Student Member, IEEE) received the B.E.
consequences on program algorithms,” in IEDM Tech. Dig., Dec. 2015, degree from the School of Computer and Infor-
pp. 7-1–7-5. mation, Hefei University of Technology, Hefei,
[30] Y. Chen et al., “Balancing SET/RESET pulse for >1010 endurance in China, in 2007, and the M.E. degree from
HfO2 /Hf 1T1R bipolar RRAM,” IEEE Trans. Electron Devices, vol. 59, the School of Microelectronics and Solid-State
no. 12, pp. 3243–3249, Dec. 2012. Electronics, Xiamen University, Xiamen, China,
[31] W. C. Chien et al., “A forming-free WOx resistive memory using a in 2010. She is currently working toward the Ph.D.
novel self-aligned field enhancement feature with excellent reliability degree at the School of Electrical and Electronic
and scalability,” in IEDM Tech. Dig., Dec. 2010, pp. 19-1–19-2. Engineering, Nanyang Technological University,
[32] B. Govoreanu et al., “10×10 nm2 Hf/HfOx crossbar resistive ram with Singapore.
excellent performance, reliability and low-energy operation,” in IEDM Her research interests include low-power SRAM
Tech. Dig., Dec. 2011, pp. 31-1–31-6. and SRAM-based physical unclonable function (PUF).
[33] P. Jain et al., “A 3.6 Mb 10.1Mb/mm2 embedded non-volatile ReRAM Ms. Lu was a recipient of the IEEE SSCS Singapore Chapter Award in
macro in 22 nm FinFET technology with adaptive forming/set/reset 2018.
schemes yielding down to 0.5 V with sensing time of 5ns at 0.7 V,”
in IEEE ISSCC Dig. Tech. Papers, Feb. 2019, pp. 212–213.
[34] O. Golonzka et al., “Non-volatile RRAM embedded into 22FFL FinFET Bongjin Kim (Member, IEEE) received the B.S.
technology,” in Proc. Symp. VLSI Technol., Jun. 2019, pp. 230–231. and M.S. degrees from POSTECH, Pohang, South
[35] C.-C. Chou et al., “An N40 256K × 44 embedded RRAM macro Korea, in 2004 and 2006, respectively, and the Ph.D.
with SL-precharge SA and low-voltage current limiter to improve read degree from the University of Minnesota, Minneapo-
and write performance,” in IEEE ISSCC Dig. Tech. Papers, Feb. 2018, lis, MN, USA, in 2014.
pp. 478–479. He spent two years with Rambus, Sunnyvale, CA,
[36] H. Lv et al., “BEOL based RRAM with one extra-mask for low cost, USA, where he was a Senior Staff Member and
highly reliable embedded application in 28 nm node and beyond,” in worked on the research of high-speed serial link
IEDM Tech. Dig., Dec. 2017, pp. 2–4. circuits and microarchitectures. He worked as a Post-
[37] M.-F. Chang et al., “Embedded 1Mb ReRAM in 28 nm CMOS with doctoral Fellow with Stanford University, Stanford,
0.27-to-1 V read using swing-sample-and-couple sense amplifier and CA, for a year. From 2006 to 2010, he was with
self-boost-write-termination scheme,” in IEEE ISSCC Dig. Tech. Papers, Samsung Electronics, Yongin, South Korea, where he performed the research
Feb. 2014, pp. 332–333. on clock generators for high-speed serial links. He was also a Research
[38] Q. Liu et al., “A fully integrated analog ReRAM based 78.4TOPS/W Intern with Texas Instruments, Dallas, TX, USA, IBM TJ Watson Research,
compute-in-memory chip with fully parallel MAC computing,” in IEEE Yorktown Heights, NY, USA, and Rambus, during his Ph.D., from 2012 to
ISSCC Dig. Tech. Papers, Feb. 2020, pp. 500–501. 2014. He joined Nanyang Technological University (NTU), Singapore, in Sep-
[39] Q. Guo, X. Guo, Y. Bai, and E. Ipek, “A resistive TCAM accelerator for tember 2017, as an Assistant Professor. His current research interests include
data-intensive computing,” in Proc. 44th Annu. IEEE/ACM Int. Symp. memory-centric computing circuits and architectures, hardware accelerators,
Microarchitecture (MICRO), Dec. 2011, pp. 339–350. and mixed-signal circuit design techniques and methodologies.
[40] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, Dr. Kim was a recipient of the Prestigious Doctoral Dissertation Fellow-
“Binarized neural networks,” in Proc. 30th Conf. Neural Inf. Process. ship Award for his Ph.D. study, the Low Power Design Contest Award
Syst. (NIPS), Dec. 2016, pp. 4107–4115. from ISLPED, and the Intel/IBM/Catalyst Foundation Award from CICC
[41] N. Verma and A. P. Chandrakasan, “A 65 nm 8T sub-Vt SRAM Conference. His research works appeared at top circuit conferences and
employing sense-amplifier redundancy,” in IEEE ISSCC Dig. Tech. journals including ISSCC, IEEE T RANSACTIONS ON V ERY L ARGE S CALE
Papers, Feb. 2007, pp. 328–329. I NTEGRATION (VLSI) S YSTEMS , CICC, ESSCIRC, and IEEE J OURNAL OF
[42] M. Poremba, S. Mittal, D. Li, J. S. Vetter, and Y. Xie, “DESTINY: A tool S OLID -S TATE C IRCUITS (JSSC).
for modeling emerging 3D NVM and eDRAM caches,” in Proc. Design,
Autom. Test Eur. Conf. Exhib. (DATE), Mar. 2015, pp. 1543–1546.
[43] M. E. Sinangil and A. P. Chandrakasan, “Application-specific SRAM Tony Tae-Hyoung Kim (Senior Member, IEEE)
received the B.S. and M.S. degrees in electrical
design using output prediction to reduce bit-line switching activity and
statistically gated sense amplifiers for up to 1.9× lower energy/access,” engineering from Korea University, Seoul, South
IEEE J. Solid-State Circuits, vol. 49, no. 1, pp. 107–117, Jan. 2014. Korea, in 1999 and 2001, respectively, and the
[44] S. Matsunaga et al., “Fabrication of a nonvolatile full adder based on Ph.D. degree in electrical and computer engineering
logic-in-memory architecture using magnetic tunnel junctions,” Appl. from the University of Minnesota, Minneapolis, MN,
Phys. Express, vol. 1, Aug. 2008, Art. no. 091301. USA, in 2009.
[45] Z. Wang et al., “A physics-based compact model of ferroelectric tunnel From 2001 to 2005, he was with Samsung Elec-
tronics, Hwasung, South Korea, where he performed
junction for memory and logic design,” J. Phys. D, Appl. Phys., vol. 47,
no. 4, Dec. 2013, Art. no. 045001. the research on the design of high-speed SRAM
[46] X. Yin, X. Chen, M. Niemier, and X. S. Hu, “Ferroelectric FETs-based memories, clock generators, and IO interface cir-
nonvolatile logic-in-memory circuits,” IEEE Trans. Very Large Scale cuits. From 2007 to 2009, he was with the IBM T. J. Watson Research
Integr. (VLSI) Syst., vol. 27, no. 1, pp. 159–172, Jan. 2019. Center, Yorktown Heights, NY, USA, and Broadcom Corporation, Edina,
[47] Deepa and V. S. Kumar, “Analysis of energy efficient PTL based full MN, USA, where he performed the research on circuit reliability, low-power
adders using different nanometer technologies,” in Proc. 2nd Int. Conf. SRAM, and battery-backed memory design. In 2009, he joined Nanyang
Technological University, Singapore, where he is currently an Associate
Electron. Commun. Syst. (ICECS), Feb. 2015, pp. 310–315.
Professor. He has authored or coauthored over 160 journal and conference
articles and holds 17 U.S. and Korean patents registered. His current research
interests include low-power and high-performance digital, mixed-mode, and
memory circuit design, ultralow-voltage circuits and systems design, variation
and aging-tolerant circuits and systems, and circuit techniques for 3-D ICs.
Dr. Kim received the Best Demo Award at APCCAS2016, the Low Power
Yuzong Chen received the B.Eng. degree in electri- Design Contest Award at ISLPED2016, the best paper awards at 2014
cal and electronic engineering from Nanyang Tech- and 2011 ISOCC, the AMD/CICC Student Scholarship Award at the IEEE
nological University, Singapore, in 2019. CICC2008, the Departmental Research Fellowship from the University of
He is currently a Project Officer with the Cen- Minnesota in 2008, the DAC/ISSCC Student Design Contest Award in 2008,
tre for Integrated Circuits and Systems (CICS), the Samsung Humantec Thesis Award in 2008, 2001, and 1999, and the ETRI
Nanyang Technological University. His research Journal Paper of the Year Award in 2005. He was the Chair of the IEEE
interests include resistive random access mem- Solid-State Circuits Society Singapore Chapter. He has served on numerous
ory (ReRAM) circuits design and in-memory conferences as a Committee Member. He serves as an Associate Editor for
computing. the IEEE T RANSACTIONS ON V ERY L ARGE S CALE I NTEGRATION (VLSI)
S YSTEMS , IEEE A CCESS , and the IEIE Journal of Semiconductor Technology
and Science.

Authorized licensed use limited to: National Central University. Downloaded on October 19,2023 at 09:10:13 UTC from IEEE Xplore. Restrictions apply.

You might also like