0% found this document useful (0 votes)
49 views12 pages

A Reliable 8T SRAM For High-Speed Searching and Logic-in-Memory Operations

Uploaded by

sonali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views12 pages

A Reliable 8T SRAM For High-Speed Searching and Logic-in-Memory Operations

Uploaded by

sonali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 30, NO.

6, JUNE 2022 769

A Reliable 8T SRAM for High-Speed Searching


and Logic-in-Memory Operations
Jian Chen , Graduate Student Member, IEEE, Wenfeng Zhao , Member, IEEE, Yuqi Wang ,
Yuhao Shu , Graduate Student Member, IEEE, Weixiong Jiang , Graduate Student Member, IEEE,
and Yajun Ha , Senior Member, IEEE

Abstract— To efficiently implement searching and logic func- high-throughput and energy-efficient computing architecture.
tions with the SRAM-based in-memory computing (IMC), However, conventional von Neumann architecture needs to
we need to perform computations on bitlines (BLs) (called com- move data back and forth between memory and processing
pute access) via multiple wordline (WL) activations. However, this
may cause prominent read disturbance when the IMC is imple- elements, which results in limited throughputs and substantial
mented with the standard 6T SRAM. To address this reliability energy overhead [1]. The in-memory computing (IMC) archi-
issue, existing solutions adopt either auxiliary assistance circuits tectures have been proposed to circumvent the von Neumann
or alternative bitcell topologies, but they lead to substantial bottleneck by reducing data transfers and performing com-
overheads of the access speed or array density. In this article, putations directly inside or near the memory. Recently, dif-
we propose a novel 8T compute SRAM (CSRAM) for reliable
and high-speed in-memory searching and compound logic-in- ferent levels of memory hierarchies, including SRAM [2], [3],
memory computations. Our 8T CSRAM features a pair of pMOS DRAM [4], and nonvolatile memories (such as RRAM [5], [6],
access transistors and split-WLs dedicated to the compute access. STT-MRAM [7], and Flash [8]) have been explored to imple-
A thorough circuit-level analysis reveals that the pMOS-based ment IMC systems. In this article, we focus on the compute
compute access port is essential for significantly mitigating the SRAM (CSRAM, as an SRAM-based IMC), as it is compatible
read disturbance. Moreover, we propose an elevated precharge
voltage scheme and a low-skewed inverter-based sensing amplifier with commercial CMOS technologies and has an opportunity
to improve the sensing speed. We have validated the proposed 8T to utilize existing large on-chip SRAM caches. The well-
CSRAM design in a 16 Kb array with a 28-nm CMOS technology. known analog-based CSRAM designs performing multiplica-
Compared to the state-of-the-art 8T CSRAM, results show that tion and accumulation (MAC)/dot-product computation have
our design is not only reliable but also 3.1 times faster, with a been presented in [9]–[21]. However, these works only sup-
maximum operating frequency upping to 2.44 GHz.
port specific error-resilient applications such as convolutional
Index Terms— Content addressable memory (CAM), neural networks (CNN).
in-memory computing (IMC), read disturbance, SRAM. To efficiently implement some types of popular applications
(such as searching and logic functions) with the digital-
I. I NTRODUCTION based CSRAM designs, we need to perform accurate bit-wise
computations on bitlines (BLs, called compute access) via
T HE surge of data-intensive applications such as arti-
ficial intelligence has an ever-increasing demand for multiple wordline (WL) activations [22]–[27]. The multiword
activation is an essential operation for digital-based CSRAM
Manuscript received August 25, 2021; revised December 18, 2021 and designs to achieve high throughput and energy efficiency.
February 4, 2022; accepted March 31, 2022. Date of publication April 20, However, this may cause prominent read disturbance when
2022; date of current version May 23, 2022. This work was supported
in part by the National Natural Science Foundation of China under Grant the IMC is implemented with the standard 6T SRAM bitcell.
62074101 and Grant 62150710549, and in part by the Shanghai Science Because the 6T SRAM has a shared read and write BL, the
and Technology Commission Funding under Grant 19511131200 and Grant bitcells of the same column may be flipped as the BL voltage
20ZR1435800. (Corresponding author: Yajun Ha.)
Jian Chen is with the School of Information Science and Technology, Shang- goes to a relatively low value. Therefore, the read disturbance
haiTech University, Shanghai 201210, China, also with Shanghai Institute significantly reduces the reliability of CSRAM designs.
of Microsystem and Information Technology, Chinese Academy of Sciences, Various techniques have been used to mitigate the read
Shanghai 200050, China, and also with the School of Electronic, Electrical and
Communication Engineering, University of Chinese Academy of Sciences, disturbance of CSRAM designs. First, a hierarchical 6T
Beijing 100049, China. CSRAM [28] and an interleaved structure [29] have been
Wenfeng Zhao is with the Department of Electrical and Computer proposed to circumvent the read disturbance at the architec-
Engineering, Binghamton University SUNY, Binghamton, NY 13902 USA.
Yuqi Wang, Yuhao Shu, and Weixiong Jiang are with the School of ture level. Nevertheless, both designs have rigid data layout
Information Science and Technology, ShanghaiTech University, Shanghai requirements and are not suitable for searching operation
201210, China. [i.e., content addressable memory (CAM)] operations. Second,
Yajun Ha is with the School of Information Science and Technology,
ShanghaiTech University, Shanghai 201210, China, and also with Shanghai designs in [25], [26] have eliminated the read disturbance
Engineering Research Center of Energy Efficient and Custom AI IC, issue for logic-in-memory computation by maintaining the
Shanghai 201210, China (e-mail: [email protected]). BL voltage at a high level. Again, both schemes do not
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TVLSI.2022.3164756. well support CAM operations. Third, other assist techniques
Digital Object Identifier 10.1109/TVLSI.2022.3164756 and alternative topologies, which aim to address the general
1063-8210 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on August 03,2022 at 06:01:49 UTC from IEEE Xplore. Restrictions apply.
770 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 30, NO. 6, JUNE 2022

TABLE I
S UMMARY F EATURES OF P REVIOUS A SSIST S CHEMES FOR CSRAM

read disturbance issue with multiple words simultaneously 1) First, we propose a novel and reliable 8T CSRAM
activated, are summarized in Table I. In [22], a WL under- with differential pMOS-based access transistors and split
drive technique is adopted and can efficiently mitigate the WLs dedicated to the compute access port. We per-
read disturbance by weakening the access transistors of 6T form in-depth circuit-level analysis to reveal that our
CSRAM, however, at the cost of accessing performance. CSRAM can well address the trade-off between reli-
Agrawal [30] introduce a staggered WL scheme, which acti- ability and performance compared to the previous
vates WLs successively to avoid the short circuit path. This CSRAM.
results in degraded speed and introduces complicated signal 2) Second, we propose an elevated precharge voltage
control overheads. Bitcells with more transistors have also scheme and a low-skewed inverter-based sense ampli-
been investigated, such as 8T [24], [30], 9T [30], and 10T [31]. fier (SA) to improve the compute access performance.
For the standard 8T, Lin et al. [30] and Agrawal et al. [24] 3) Third, we apply the new 8T CSRAM to realize reli-
circumvent the read disturbance by utilizing the decoupled 2T able and high-speed CAM for searching operations.
read ports cooperated with different sensing schemes. How- Moreover, the proposed 8T CSRAM is able to perform
ever, both sensing schemes need to distinguish two logic compound logic-in-memory functions in only one cycle.
functions from a single BL, which results in a low sensing The proposed 8T CSRAM is implemented in a 28-nm CMOS
margin and harms the access performance. Last, although technology, which has a similar area as a standard 8T bit-
the 9T [30] and 10T [31] with differential read ports provide cell [30]. A 16 Kb CSRAM macro has been post-layout
reliable compute access, both schemes suffer from area penalty validated, which is considerably faster than the state-of-the-
compared to 8T CSRAM. In summary, the aforementioned art designs.
works do not well address the trade-off among the compute The remainder of the article is organized as follows.
access performance, reliability, and bitcell area when dealing Section II introduces the read disturbance issue of conven-
with the read disturbance of CSRAM. It is pivotal to design a tional nMOS-access-based CSRAM. Section III presents the
novel CSRAM cell and architecture that can operate reliably design and operating principle of the proposed 8T bitcell. The
and with high access performance. reliability and access performance analysis of the proposed 8T
In this article, we have proposed a novel and reliable 8T CSRAM are also covered in this section. Section IV presents
CSRAM for reliable and high-speed in-memory searching and the elevated precharge scheme and low-skewed inverter-based
compound logic operation. Our 8T CSRAM features a pair sensing scheme to improve the sensing speed. Section V
of pMOS access transistors and split WLs dedicated to the introduces the principle of the CAM functions and the com-
compute access. The pMOS-access-based SRAMs have been pound logic-in-memory functions based on the proposed 8T
proved to be feasible, especially for ultralow voltage 6T [32] CSRAM. Section VI presents the experimental results and
and 9T [33]. However, these topologies can not address the analysis of the proposed design, and Section VII concludes the
new reliability issue of CSRAM, which makes them are not article.
applicable in IMC applications. Our proposed bitcell and its
corresponding peripheral circuit have been optimized for IMC
II. R EAD D ISTURBANCE OF CSRAM
applications so that the pMOS access transistor can be effec-
tively utilized in compute access. To the authors’ knowledge, Fig. 2 illustrates a 6T CSRAM design which activates two
this is the first time that pMOS-access-based CSRAM has been WLs simultaneously (i.e., a compute access). Based on the
proposed. stored data, the BL will be discharged to three different voltage
Our main contributions are summarized as follows. levels, as shown in Fig. 2(b). By sensing the three voltage

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on August 03,2022 at 06:01:49 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: RELIABLE 8T SRAM FOR HIGH-SPEED SEARCHING AND LOGIC-IN-MEMORY OPERATIONS 771

Fig. 1. (a) Schematic of conventional 6T CSRAM with two words selected simultaneously. (b) Dynamic behavior of equivalent half circuit during a compute
access. (c) BL transient behavior and formulated direct current during a typical corner. (d) Stored datum “B” is flipped under the worst corner (fast nMOS
slow pMOS).

activated during the sensing phase. When the BL voltage is at


a higher level, N1 is strongly on while N2 is off. This results in
BL discharging through N1-N3. Once the BL is discharged to
a lower level, N2 will be gradually on and charge the BL. As a
result, in a typical process corner, the BL will finally converge
to a nonzero value, and the direct current path (P2-N2-N1-N3)
will generate a substantial current overhead (around 45 µA),
as shown in Fig. 1(c). More seriously, to ensure a normal
SRAM write access operation, the nMOS access transistor
(N2) is designed to be stronger than the pull-up pMOS (P2).
Thus, a pseudo-write through P2-N2 likely occurs during a
compute access, which corrupts the stored datum and is well
known as the read disturbance issue. As shown in Fig. 1(d),
the stored datum “B” is corrupted in the worst process corner
(fast nMOS slow pMOS).
To mitigate this issue, conventional assist schemes, e.g.,
WL under-drive [22], aim to adjust the ratio balance between
N2 and P2. However, as the nMOS access transistor is
intrinsically stronger than the pull-up transistor, extremely
low WL voltage is required to mitigate the read disturbance.
Fig. 2. Example of conventional 6T CSRAM design: (a) schematic, (b) timing Thus, the access performance of nMOS-access-based CSRAM
diagram during a compute access, and (c) truth table for Boolean logic
operation.
is substantially degraded. In summary, for the nMOS-access-
based CSRAM, it is difficult to deal with the trade-off between
reliability and performance.

levels with an SA, the BL behavior forms a NAND operation.


III. P ROPOSED 8T CSRAM B ITCELL
A similar analysis can be applied to BL, which will implement
an OR operation. With the basic Boolean logic operations, In this section, we will first introduce the proposed 8T
complex functions such as addition/multiplication can be CSRAM and the working principle. Then, we analyze the
realized through either bit-serial or bit-parallel techniques read disturbance mitigation feature with a detailed circuit-level
[25], [34], [35]. However, the CSRAM generates a short circuit analysis. Last, we present the distinct advantage of pMOS-
path between two storage nodes (between A and B, or between access-based CSRAM by comparing it to the conventional
A and B). This causes a read disturbance issue when the stored nMOS-access-based CSRAM.
data is “01/10” (i.e., Case 01/10), in which the stored data may
be flipped when BL is discharged to a lower value.
In Fig. 1, 6T CSRAM is used to illustrate the detailed A. Bitcell Topology and Operating Principle
BL dynamics and read disturbance of a CSRAM with nMOS Fig. 3 shows the schematic, layout, and timing diagram
access transistors. It is worth noting that the read distur- of the proposed 8T CSRAM design. The proposed bitcell
bance only occurs when selected bitcells store different data consists of a standard 6T and a pair of pMOS transistors
(i.e., 01/10), as shown in Fig. 1(a). Fig. 1(b) shows the equiva- (i.e., P3 and P4) connecting to RBL and RBL. To perform
lent half circuit during a compute access. BL is first precharged the CAM operation, the RWL is split into RWLL and RWLR,
to VDD. Then, both the N1 and N2 access transistors are which are used to input the search data. The bitcell layout

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on August 03,2022 at 06:01:49 UTC from IEEE Xplore. Restrictions apply.
772 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 30, NO. 6, JUNE 2022

TABLE II
O PERATION C ONDITION OF THE P ROPOSED 8T CSRAM

is pulled up to VDD. In this case, P3 is on while P4 is off.


Thus, multiple XNOR computations can be realized in single
RBLs to form a CAM configuration. More details about the
CAM configuration will be demonstrated in Section V-A. For
a multiword logic operation, such as OR, AND, and XOR,
only RWLs are activated to perform computation on RBLs.
We can also utilize two read ports (i.e., BLs and RBLs) to
Fig. 3. Proposed 8T CSRAM. (a) Bitcell schematic and layout. (b) Simulated
perform a compound logic function by activating WLs and
timing diagram for different operation cycles. RWLs simultaneously (not presented in the table). The read
disturbance issues from BLs can be well addressed with a low-
swing BL scheme [25], [26]. More details about the logic-in-
memory functions are covered in Section V-B.
based on logic rules is shown in Fig. 3(a). A compact cell
layout area of 0.739 µm2 (0.39 µm × 1.895 µm) can be
achieved by maximally sharing the diffusion and contact, B. Principle of Read Disturbance Mitigation
which occupies the similar area as compared to the standard 8T An important feature of the proposed 8T CSRAM is that the
bitcell designs in the chosen 28-nm CMOS technology. read disturbance arising from IMC operations (i.e., CAM and
The timing diagram of the proposed 8T CSRAM is shown logic-in-memory) can be inherently mitigated instead of using
in Fig. 3(b). In a write cycle, only one WL is selected to any assisting techniques. To better understand the mitigation
write data by turning on nMOS access transistors N1 and N2. principle, we present a detailed circuit-level analysis on the
In a read cycle, although both ports (i.e., BLs and RBLs) proposed 8T CSRAM, as shown in Fig. 4. The equivalent half
can be accessed, the respective precharging and activating circuit is shown in Fig. 4(b). At first, the RBLs are precharged
logic are different. The BLs are precharged to VDD as the to GND level. During the access phase, only the pMOS access
conventional 6T SRAM, while the RBLs connecting to pMOS transistors (P3 and P4) are turned on by asserting RWL signals.
access transistors are precharged to GND level instead. There- As the RBLs are precharged to GND level, P4 is strongly on
fore, normal memory read is achieved via BL/BL discharging to charge the RBL, while P3 is off. Once the RBL is charged
through the nMOS access transistors, while the compute access to a relatively high value, P3 will be gradually on to discharge
is achieved via RBL/RBL charging through the pMOS access the RBL slowly. As a result, the RBL will converge to a non-
transistors, respectively. VDD value, as shown in Fig. 4(c). It is worth noting that the
With the extra pMOS-based compute access ports maximum direct current (around 25 µA) is reduced by 44%
(i.e., RBLs) and split read WLs, the proposed 8T CSRAM can compared to 6T CSRAM as this current is determined by the
be configured to perform SRAM, CAM, and logic operations. series-connected pMOS (P2-P4) transistors instead of nMOS
A detailed truth table for different operations is illustrated in (N1-N3) ones. More importantly, the pseudo-write operation
Table II. The corresponding operating mode of four access through the pMOS access transistor is successfully avoided.
transistors is also summarized. For the normal SRAM func- As shown in Fig. 4(d), the stored data are not corrupted even
tion, only WLs and nMOS-based read ports (N1, N2) will in the worst process corner (slow nMOS fast pMOS).
be activated to perform a normal write or read operation. The reliability of the proposed 8T CSRAM is owing to
To perform the CAM function, RWLLs and RWLRs will be the formation of a different pseudo-write path. For the 8T
configured to input search data. For example, if the search CSRAM, the possible pseudo-write path is changed to P3-N3
data is 1, the RWLL is pulled down to GND while the RWLR as shown in Fig. 4(b). The data corruption only occurs in the

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on August 03,2022 at 06:01:49 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: RELIABLE 8T SRAM FOR HIGH-SPEED SEARCHING AND LOGIC-IN-MEMORY OPERATIONS 773

Fig. 4. (a) Schematic of proposed 8T CSRAM with two words selected simultaneously. (b) Dynamic behavior of equivalent half circuit during a compute
access. (c) BL transient behavior and formulated direct current during a typical corner. (d) Stored data are NOT flipped under the worst corner (slow nMOS
fast pMOS).

case that the pMOS access transistor overpowers the pull-down


nMOS transistor, as opposed to nMOS-access-based CSRAM.
In fact, the pMOS access transistor has a higher threshold
(Vth ), lower mobility (µ), and smaller size than the pull-down
nMOS transistor. Consequently, the pMOS access transistor is
naturally weaker than the pull-down nMOS transistor in the
proposed 8T CSRAM, making it resilient to the read distur-
bance. Therefore, no extra assisting schemes are required for
our design. Moreover, pMOS transistors have a relatively large
driving strength in charging the RBLs, as pMOS transistors
are strong devices in passing logic “1” (i.e., charging the
RBLs). Therefore, by utilizing the pMOS transistors, we have
an opportunity to eliminate the read disturbance while keeping
high-frequency accesses simultaneously. Fig. 5. (a) One column of nMOS-access-based 6T CSRAM with 16 words
Through analyzing the BL transient behavior when two accessed simultaneously. (b) Transient behavior of a 2000-point Monte-Carlo
simulation showing bitcell “A” is flipped under 99.5% of cases.
words are selected, we have observed that the 8T CSRAM
using the pMOS access transistors are much more reliable than
the conventional CSRAM using the nMOS access transistors.
Besides, as sensing phase is changed to charging RBL through
pMOS, the access speed can be maintained simultaneously.
To better claim the advantage of 8T CSRAM, we make a
comparative analysis between nMOS-access-based CSRAM
and pMOS-access-based CSRAM.

C. Comparative Analysis Between nMOS-Access-Based


CSRAM and pMOS-Access-Based CSRAM
1) Reliability Analysis: Monte-Carlo simulations, which
include both process corners (global variations) and device
mismatch (local variations), are performed to evaluate the
read disturbance effects for IMC applications with multi- Fig. 6. (a) One column of proposed pMOS-access-based 8T CSRAM with
word activation. During the simulation, all transistors fluctuate 16 words accessed simultaneously. (b) Transient behavior of a 2000-point
according to the Gaussian distribution models. As multiword Monte-Carlo simulation showing bitcell “A” is NOT flipped under all cases.
activation is essential for CAM operation, we will evaluate
the read disturbance for this application. The feature of the
CAM operation is that there is only one side of the access the same column. Under this case, the uniquely different
transistors being activated for each bitcell [22], [24]. To make a bitcell is the most vulnerable. From a 2000-point Monte-Carlo
practical comparison, we present the effect of read disturbance simulation, with a nominal WL voltage, 99.5% of the stored
with 16-word activation for both nMOS-access-based CSRAM data are flipped in the nMOS-access-based CSRAM (see
and pMOS-access-based CSRAM, as shown in Figs. 5 and 6, Fig. 5), which means that the yield is only 0.5%. However,
respectively. As annotated in the schematic, the worst case for the pMOS-access-based CSRAM that also applied with a
happens when there is only one bitcell storing a different nominal WL voltage, there is no data flipping; thus, the yield
value compared to all the other 15 accessed bitcells sharing is 100% (see Fig. 6). The results indicate that the proposed

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on August 03,2022 at 06:01:49 UTC from IEEE Xplore. Restrictions apply.
774 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 30, NO. 6, JUNE 2022

Fig. 8. (a) Voltage-latched single-ended SA [36]. (b) Timing diagram during


Fig. 7. Access delay comparison under TT corner. (a) Proposed pMOS- a sensing process.
access-based 8T CSRAM charges RBLs to 0.2 V in 0.2 ns. (b) Conventional
nMOS-access-based CSRAM [22] with WL under-drive discharges BLs by
0.2 V in 1 ns.
in Fig. 2. The voltage-latched single-ended SA [36], as shown
in Fig. 8(a), is suitable for the proposed CSRAM in which
CSRAM mitigates the read disturbance effectively, even for the RBL is precharged to GND level. Once the SA is enabled
IMC applications with multiword activation. (i.e., SAE = 1), the SA will amplify the voltage difference
2) Access Performance Analysis: Compared to previous between RBL and Vref to rail-to-rail output voltages (OUT
assisting schemes that are trying to weaken the nMOS access and OUT). To reach a reasonable sensing yield, the voltage
transistor in conventional CSRAM, the pMOS-access-based difference should be larger than a specific value. In this design,
8T CSRAM shows a much higher access performance. The we define the target voltage difference as the one we can
WL under-drive scheme has been widely adopted in nMOS- get 100% yield for a 15 k-point Monte-Carlo simulation. The
access-based CSRAM, which mitigates read disturbance by simulation results indicate that the voltage difference has to be
weakening access transistors [22]. In order to support mul- developed larger than 0.15 VDD in this design, as illustrated
tioperand activation, WL voltage lower than half-VDD is in Fig. 8(b). As can be observed, due to the introduction of
required in [22], which degrades the access performance the additional reference voltage (Vref), the RBL has to be
severely. As a contrast, a nominal WL voltage can be applied charged to twice the voltage difference (i.e., 0.3 VDD) during
to the proposed design. The access delay comparison between a compute access. Thus, the singled-ended sensing scheme
the proposed pMOS-access-based CSRAM and the conven- deteriorates the access speed caused by a larger RBL charging
tional nMOS-access-based CSRAM with half-VDD WL volt- delay.
age is presented in Fig. 7. To build a sufficient voltage Fig. 9(a) shows the schematic of the proposed scheme,
difference (0.3 VDD in this case) for reliable sensing, the which consists of an elevated precharge voltage circuitry,
proposed CSRAM only needs to turn on WL for 0.2 ns a low-skewed inverter, an adaptive replica column, and a
while the conventional nMOS-access-based CSRAM with WL timing control unit (TCU). The elevated precharge circuitry
under-drive needs 1 ns. The results suggest that the proposed will charge the RBLs to 0.1 VDD before SA is enabled,
CSRAM achieves a much better access performance than while the low-skewed inverter can sense the RBL at around
convention CSRAM using nMOS access transistors. 0.3 VDD. Thus, the charging delay can be reduced as the RBL
Another advantage of the proposed 8T CSRAM is that a only needs to boost an additional 0.2 VDD during the access
pair of differential ports are available, compared to conven- phase. The adaptive replica column consists of dummy bitcells
tional standard 8T, which suffers from single port sensing. with the same parasitics as the array columns. By activating
As a result, a reduced BL voltage swing can be achieved in RWL_dum or both RWL_dum and RWL_CAM, the replica
the proposed design, which brings higher performance and column mimics the worst sensing delay of real columns for
applicability in low voltage applications as well. logic functions or CAM functions. The configuration of logic
and CAM functions will be detailed in Section V. Fig. 9(c)
shows the timing waveform during a sensing process. With the
IV. E LEVATED P RECHARGE S CHEME AND L OW-S KEWED replica-column technique, we can ensure that the real column
I NVERTER -BASED S ENSING S CHEME can be reliably sensed before read WLs (RWLL/RWLR) are
Although we have demonstrated that the proposed CSRAM disabled. Note the read scheme for BLs connecting to nMOS
with pMOS access transistors can offer both reliable com- access port is not shown here, as it can be addressed with
pute access operation and improved access performance as traditional timing scheme [26] and is not the critical timing
compared to the design with the nMOS counterpart, it is still path in this design.
crucial to further optimize the access speed. In this section, Fig. 10 presents the sensing delay comparison between
we first demonstrate the drawback of widely-used single-ended the proposed scheme and the conventional single-end sensing
SA. Then, to improve the access speed, we propose a novel scheme. The sensing delay is defined as the delay from the
elevated precharge voltage scheme and a low-skewed inverter- time that RWL is enabled to the time that RBL is correctly
based sensing scheme. sensed. It is significant to minimize the sensing delay as it
A widely-used sensing scheme for CSRAM is single-ended accounts for around (2/3) of the total access delay in our
sensing with a differential SA, which compares its BL/BLB design. As mentioned before, for the single-ended SA, the
voltage to a reference voltage (Vref) [24], [34], as shown RBL has to be at least 0.3 VDD to overpower the transistor

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on August 03,2022 at 06:01:49 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: RELIABLE 8T SRAM FOR HIGH-SPEED SEARCHING AND LOGIC-IN-MEMORY OPERATIONS 775

Fig. 11. BCAM search example in a 2 × 4 subarray. The transistor in red is


activated to charge the corresponding RBL.

V. A PPLICATION IN CAM AND L OGIC - IN -M EMORY


In this section, we present two IMC operations exploiting
the proposed 8T CSRAM. First, we show that the proposed
design is reliable and efficient for searching applications
when configured to binary CAM (BCAM) and ternary CAM
(TCAM). Second, the multioperand logic-in-memory opera-
Fig. 9. Read scheme for the RBLs. (a) Overall schematic. (b) Detailed tions will also be demonstrated.
schematic of precharge circuitry and low-skewed inverter. (c) Timing wave-
form during a sensing process.
A. Searching in CAM
One of the promising IMC application is the searching
functions in CAM [22], [24]. The conventional CAM cell
consists of a 6T storage part and a 4T dynamic XNOR circuit.
Unlike conventional CAM, the IMC-based CAM can perform
area-efficient bit-wise XNOR through access transistors and
BLs to indicate a string match or mismatch.
1) Binary CAM: A BCAM example on a 2 × 4 subarray is
presented in Fig. 11. To support CAM operation, the read-WL
(RWL) is split into RWLR and RWLL. The data to be searched
are stored in a column-wise format and buffered with search
drivers to compare with all the columns by driving row-wise
WLs (i.e., RWLR or RWLL). Unlike designs in [22] and [24]
which apply a lower supply voltage on the search drivers,
Fig. 10. Charging delay comparison between the proposed scheme and
single-ended SA scheme. the proposed design can adopt a nominal voltage to boost the
access performance. If the input datum is “0,” the RWLRs will
be low to turn on the right pMOS access transistors, while the
mismatch under all process corners. In contrast, to enable RWLLs will be high to cut off the left pMOS access transistor.
reliable sensing with the low-skewed inverter, the RBL only An opposite scenario occurs when the input datum is “1.” For
needs to be larger than the trip voltage point of the low-skewed each column, a pair of low-skewed inverters is used to detect
inverter. Actually, the trip point of the low-skewed inverter is the RBL/RBL behavior, and an AND gate connected to two
only around 0.3 VDD across different process corners. As we inverters produces a match or mismatch output. For a search
adopt the elevated precharge of RBL to 0.1 VDD, only an mismatch, RBL in the first column is charged by the bit stored
extra 0.2 VDD RBL charging is required. Thus, the total RBL on the second row, as shown in Fig. 11. Then, the inverters
charging delay can be reduced by at least 16% across different connecting to charged RBL will generate a logic “0.” Hence,
process corners, as shown in Fig. 10. the AND result of two inverters is logic “0,” which indicates

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on August 03,2022 at 06:01:49 UTC from IEEE Xplore. Restrictions apply.
776 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 30, NO. 6, JUNE 2022

Fig. 12. TCAM search example in a 4 × 4 subarray. The bitcells enclosed by purple boxes are used to represent “don’t care” state.

a mismatch. For a match, as the second column showing in TABLE III


Fig. 11, the RBLs will not be charged and keep at low. Then, S UMMARY OF S UPPORTED L OGIC F UNCTION
the AND result of two inverters is logic “1,” indicating a match.
2) Ternary CAM: A TCAM search example is shown in
Fig. 12. As the TCAM has three states, two bits are needed
to represent states 0, 1, and X (i.e., “don’t care”). Thus, two
columns are used to store each word. State X is represented
by “10” that is enclosed by purple boxes, while state 0/1
is represented by “00” and “11,” respectively. The sensing
procedure is similar to that of BCAM. For each stored word,
a search result can be generated by ANDing the output of the can be achieved. For the bitcells accessed through the pMOS
first inverter and the fourth inverter. For a selected cell with X, access transistors, the proposed 8T CSRAM achieves reliable
it will not affect the result of the first and fourth inverter; thus compute access without any read disturbance issue, as shown
it always represents a matched cell. When there is a match, in Fig. 13(b). For the bitcells accessed by the nMOS transis-
no BL will be charged, as shown in the first two columns of tors, as there are only two words activated, a low-swing BL
Fig. 12. When there is a search mismatch, as the third bits scheme [25], [26] can be applied to efficiently mitigate the
of the last two columns in Fig. 12, the BL will be charged read disturbance from BLs. As shown in Fig. 13(c), there is
and the inverter will generate a logic “0,” thereby detecting a also no data flipping when compute access is performed on
mismatch. the nMOS read port.
Table III summarize the supported logic functions of the
B. Logic-in-Memory proposed design. If a two-operand Boolean logic function is
The multioperand compound logic function is useful in required, either RBLs or BLs can be used. If a four-operand
many applications like Hamming encoding [24]. However, due compound Boolean logic function is required, both the RBLs
to the fact that conventional 6T CSRAM can only support a and BLs are used as presented in Fig. 13. Last, if we want to
two-word logic function per cycle, more than two cycles are perform a multioperand simple Boolean logic function, only
required to operate a compound logic-in-memory computation. the RBLs can be used to support this mode.
By utilizing the two read ports of the proposed 8T CSRAM,
four words can be accessed simultaneously to perform a com- VI. E XPERIMENTAL R ESULTS
pound logic operation in one cycle. As shown in Fig. 13(a), To verify the proposed 8T CSRAM, we have designed
two RWLs pairs are accessed to perform a logic function in a 16 Kb (128 × 128) CSRAM array that can be configured
RBLs, while two WLs are also accessed to perform another to perform CAM or logic in-memory operations. Fig. 14
logic function in BLs. With a number of logic gates introduced shows the overall architecture and layout view of the memory
as shown in Fig. 13(a), versatile compound logic operations macro. Compared with the conventional SRAM, two WL

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on August 03,2022 at 06:01:49 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: RELIABLE 8T SRAM FOR HIGH-SPEED SEARCHING AND LOGIC-IN-MEMORY OPERATIONS 777

Fig. 15. Transient behavior of CAM operation for a 15 k-point Monte-Carlo


simulation with global (process) and local (mismatch) variations and
VDD = 0.9 V. (a) There is no data flipping. The yield is 100% at 27 ◦ C.
Fig. 13. (a) Four-operand compound logic operation utilizing two read ports (b) There is also no data flipping. The yield is 100% at 80 ◦ C.
(RBLs and BLs). (b) From a 2000-point Monte-Carlo simulation, the read
disturbance from RBLs does not cause data flipping. (c) Read disturbance
from BLs does not cause data flipping. Monte-Carlo simulation (with both process variations and
mismatch) is adopted for read disturbance analysis. Moreover,
both speed and energy with layout parasitics are taken into
consideration.

A. Reliability
For the compound logic-in-memory functions, the read
disturbance is not prominent as there are only two words
activated per BL, as shown in Fig. 13. However, for the CAM
operation, the read disturbance is more severe as 128 words
are simultaneously accessed. Therefore, we only analyzed the
read disturbance of CAM operation. Taking one column as an
example, the worst case of CAM operation is that only one
accessed bitcell stores datum “0,” while the other 127 accessed
bitcells store datum “1.” When the 128 bitcells sharing the
same column are activated simultaneously, the RBLs will be
charged to a high value quickly. In this case, the data “0” is
most likely corrupted.
Fig. 15 presents the results of a 15 k-point Monte-Carlo sim-
ulation for the worst case CAM operation. The Monte-Carlo
simulation is performed with both global (process) and
local (mismatch) variation, at 27 ◦ C and 80 ◦ C. As can be
observed in the left figure of Fig. 15, the RBLs are charged to
the VDD level quickly when WLs are activated. Nevertheless,
the pseudo-write effect through the pMOS access is weak due
Fig. 14. Proposed 16 Kb (128 × 128) array. (a) Overall architecture.
(b) Layout view. to a large driving strength gap between the pMOS access
transistor and the pull-down nMOS transistor, as illustrated
in Section III-C. Thus, there is no data flipping occurring for
driver blocks are used to drive the pMOS access and the nMOS both 27 ◦ C and 80 ◦ C, as shown in Fig. 15, which means the
access transistors, respectively. The proposed CSRAM was read disturbance is successfully mitigated in this design.
implemented and post-layout validated in a commercial 28-nm
CMOS technology, which occupied a total area of 0.025 mm2 B. Access Performance and Energy Evaluation
(80 µm × 310 µm). The array efficiency of this design is Fig. 16 shows the operating frequency of logic and CAM
58.8%, which is comparable to industrial standards [28]. operation with respect to the supply voltage under the typical

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on August 03,2022 at 06:01:49 UTC from IEEE Xplore. Restrictions apply.
778 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 30, NO. 6, JUNE 2022

TABLE IV
C OMPARISON W ITH P REVIOUS CSRAM W ORKS

Fig. 17. (a) BCAM and TCAM energy across VDD. (b) Compound logic
operation energy across VDD.

Fig. 16. Operating frequency of CAM and logic operation across VDD. has a larger capacitance compared to other push-rule-based
designs. Therefore, for the CAM operation, the proposed
design has an energy overhead. But, the overhead can be
process corner and 27 ◦ C. The measured frequency of logic effectively alleviated if the proposed 8T is implemented with
operation is for the case of a four-operands compound logic push-rule. In addition, the proposed design still has energy
function, while the measured frequency of CAM operation benefit for logic operation as we implement a low swing
is for the worst case that only has a one-bit mismatch sensing. Due to the introduction of differential pMOS-based
in a column. As can be observed, the logic operation is slightly read ports, the proposed CSRAM has a larger sense margin
faster than the CAM operation because a multiword activation compared to the standard 8T, which uses only a single read
of CAM operation introduces large voltage overshoots caused port. Thus, a reduced BL swing can also be implemented to
by the gate–drain capacitance of the access transistors, which boost the access performance. As can be observed from the
has a negative impact on the access performance. At the table, when this work is validated in a post-layout fashion,
minimum supply voltage (VDD = 0.5 V), the logic operation the maximum frequency of BCAM is up to 1.90 GHz which
frequency is 180 MHz, while the CAM frequency is 123 MHz. is 2.3× faster than standard 8T [24] which is only validated
Fig. 17 illustrates the CAM operation energy and logic oper- with presimulation, and the maximum frequency of logic
ation energy versus the supply voltage. The CAM energy is operation is 2.44 GHz which is also 3.1× faster. Considering
measured under the case that all column data are mismatched. the standard 8T in the reference work [24] is implemented
At the minimum supply, the BCAM consumes 0.32 fJ/bit. in 65-nm technology, we normalized the frequency to 28-nm
As two bits are needed to represent a state in TCAM, the technology. Even with an ideal transformation, the projected
consumption energy per bit is doubled comparing to BCAM. frequency of [24] is about 1.63 and 1.59 GHz for CAM and
For the four-operand compound logic operation, the measured logic operation, which is still substantially lower than the
energy is averaged over all kinds of data patterns. The min- proposed design.
imum consumption energy is only 5.7 fJ/bit when VDD = Compared to the 6T CSRAM, although the proposed 8T
0.5 V. CSRAM will have around 26% area overhead if they are
Table IV summarizes a comparison between the pro- implemented with the same technology and logic rule, its
posed 8T CSRAM and state-of-the-art CSRAM designs. As we operating frequency is greatly improved. When compared to
adopt a logic-rule during the layout, the proposed design a 6T CSRAM adopting staggered WLs [30], our design has

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on August 03,2022 at 06:01:49 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: RELIABLE 8T SRAM FOR HIGH-SPEED SEARCHING AND LOGIC-IN-MEMORY OPERATIONS 779

improved the access speed by 7.3×. In addition, as a nominal [14] A. Jaiswal, I. Chakraborty, A. Agrawal, and K. Roy, “8T SRAM cell as
WL voltage can be applied in this design, the maximum CAM a multibit dot-product engine for beyond von Neumann computing,”
IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 27, no. 11,
and logic operation frequency is also 5.1× and 3.1× higher pp. 2556–2567, Nov. 2019.
than a 6T design [22] which adopts a WL under-drive scheme. [15] M. Ali, A. Jaiswal, S. Kodge, A. Agrawal, I. Chakraborty, and K. Roy,
“IMAC: In-memory multi-bit multiplication and accumulation in 6T
SRAM array,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 67, no. 8,
VII. C ONCLUSION pp. 2521–2531, Aug. 2020.
[16] J. Zhang, Z. Wang, and N. Verma, “In-memory computation of
The read disturbance issue is critical for IMC applications a machine-learning classifier in a standard 6T SRAM array,”
IEEE J. Solid-State Circuits, vol. 52, no. 4, pp. 915–924, Apr. 2017.
with multiword activation. In this article, we propose a reli- [17] M. Kang, S. K. Gonugondla, A. Patil, and N. R. Shanbhag,
able 8T bitcell with extra differential pMOS access transistors, “A multi-functional in-memory inference processor using a standard 6T
which is dedicated to multiword IMC. Comprehensive analysis SRAM array,” IEEE J. Solid-State Circuits, vol. 53, no. 2, pp. 642–655,
Feb. 2018.
shows that CSRAM adopting pMOS access transistor is more [18] X. Si et al., “15.5 A 28 nm 64 Kb 6T SRAM computing-in-memory
reliable and faster than conventional CSRAM adopting nMOS macro with 8b MAC operation for AI edge chips,” in IEEE ISSCC Dig.
access transistor. Moreover, a read BL elevated precharge Tech. Papers, Oct. 2020, pp. 246–248.
[19] M. Kang, S. K. Gonugondla, and N. R. Shanbhag, “Deep in-memory
circuitry and low-skewed inverter-based sensing scheme is architectures in SRAM: An analog approach to approximate computing,”
proposed to accelerate the accessing speed. To illustrate the Proc. IEEE, vol. 108, no. 12, pp. 2251–2275, Dec. 2020.
effectiveness, a 16 Kb CSRAM, which can be configured to [20] Z. Liu et al., “NS-CIM: A current-mode computation-in-memory archi-
tecture enabling near-sensor processing for intelligent IoT vision nodes,”
CAM and logic-in-memory operations, is designed and post- IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 67, no. 9, pp. 2909–2922,
layout validated. The results show that the operating frequency Sep. 2020.
of CAM and logic-in-memory operation is 2.3× and 3.1× [21] C. Yu, T. Yoo, T. T.-H. Kim, K. C. Tshun Chuan, and B. Kim,
faster compared to the state-of-the-art design based on the “A 16 K current-based 8T SRAM compute-in-memory macro with
decoupled read/write and 1-5bit column ADC,” in Proc. IEEE Custom
standard 8T CSRAM, respectively. Integr. Circuits Conf. (CICC), Mar. 2020, pp. 1–4.
[22] S. Jeloka, N. B. Akesh, D. Sylvester, and D. Blaauw, “A 28 nm
configurable memory (TCAM/BCAM/SRAM) using push-rule 6T bit
R EFERENCES cell enabling logic-in-memory,” IEEE J. Solid-State Circuits, vol. 51,
no. 4, pp. 1009–1021, Apr. 2016.
[1] M. Horowitz, “1.1 Computing’s energy problem (and what we can do [23] Q. Dong et al., “A 4+2T SRAM for searching and in-memory computing
about it),” in IEEE ISSCC Dig. Tech. Papers, Feb. 2014, pp. 10–14. with 0.3-V V D Dmin ,” IEEE J. Solid-State Circuits, vol. 53, no. 4,
[2] N. Verma et al., “In-memory computing: Advances and prospects,” IEEE pp. 1006–1015, Apr. 2018.
Solid State Circuits Mag., vol. 11, no. 3, pp. 43–55, Aug. 2019. [24] Z. Lin et al., “In-memory computing with double word lines and three
[3] S. Aga, S. Jeloka, A. Subramaniyan, S. Narayanasamy, D. Blaauw, and read ports for four operands,” IEEE Trans. Very Large Scale Integr.
R. Das, “Compute caches,” in Proc. IEEE Int. Symp. High Perform. (VLSI) Syst., vol. 28, no. 5, pp. 1316–1320, May 2020.
Comput. Archit. (HPCA), Austin, TX, USA, Feb. 2017, pp. 481–492. [25] K. Lee, J. Jeong, S. Cheon, W. Choi, and J. Park, “Bit parallel 6T SRAM
[4] S. Li, D. Niu, K. T. Malladi, H. Zheng, B. Brennan, and Y. Xie, in-memory computing with reconfigurable bit-precision,” in Proc. 57th
“DRISA: A dram-based reconfigurable in-situ accelerator,” in Proc. 50th ACM/IEEE Design Autom. Conf. (DAC), Dec. 2020, pp. 1–6.
Annu. IEEE/ACM Int. Symp. Microarchitecture (MICRO), Oct. 2017, [26] J. Chen, W. Zhao, Y. Wang, and Y. Ha, “Analysis and optimization
pp. 288–301. strategies toward reliable and high-speed 6T compute SRAM,” IEEE
[5] W. H. Chen et al., “A 65 nm 1 Mb nonvolatile computing-in-memory Trans. Circuits Syst. I, Reg. Papers, vol. 68, no. 4, pp. 1520–1531,
ReRAM macro with sub-16ns multiply-and-accumulate for binary DNN Apr. 2021.
AI edge processors,” in IEEE ISSCC Dig. Tech. Papers, Feb. 2018, [27] Z. Lin et al., “Two-direction in-memory computing based on 10T SRAM
pp. 494–496. with horizontal and vertical decoupled read ports,” IEEE J. Solid-State
[6] Y. Chen, L. Lu, B. Kim, and T. T.-H. Kim, “Reconfigurable 2T2R Circuits, vol. 56, no. 9, pp. 2832–2844, Sep. 2021.
ReRAM architecture for versatile data storage and computing in- [28] W. Simon, J. Galicia, A. Levisse, M. Zapater, and D. Atienza, “A fast,
memory,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 28, reliable and wide-voltage-range in-memory computing architecture,”
no. 12, pp. 2636–2649, Dec. 2020. in Proc. 56th ACM/IEEE Annu. Design Autom. Conf. (DAC). ACM,
[7] S. Jain, A. Ranjan, K. Roy, and A. Raghunathan, “Computing in memory Jun. 2019, pp. 1–6.
with spin-transfer torque magnetic RAM,” IEEE Trans. Very Large Scale [29] A. Jaiswal, A. Agrawal, M. F. Ali, S. Sharmin, and K. Roy, “I-SRAM:
Integr. (VLSI) Syst., vol. 26, no. 3, pp. 470–483, Dec. 2017. Interleaved wordlines for vector Boolean operations using SRAMs,”
[8] P. Wang et al., “Three-dimensional NAND flash for vector–matrix IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 67, no. 12, pp. 4651–4659,
multiplication,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., Dec. 2020.
vol. 27, no. 4, pp. 988–991, Apr. 2019. [30] A. Agrawal, A. Jaiswal, C. Lee, and K. Roy, “X-SRAM: Enabling in-
[9] X. Si et al., “A dual-split 6T SRAM-based computing-in-memory unit- memory Boolean computations in CMOS static random access mem-
macro with fully parallel product-sum operation for binarized DNN edge ories,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 65, no. 12,
processors,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 66, no. 11, pp. 4219–4232, Dec. 2018.
pp. 4172–4185, Nov. 2019. [31] Y. Zhang, L. Xu, Q. Dong, J. Wang, D. Blaauw, and D. Sylvester,
[10] W.-S. Khwa et al., “A 65 nm 4 Kb algorithm-dependent computing-in- “Recryptor: A reconfigurable cryptographic cortex-M0 processor with
memory SRAM unit-macro with 2.3 Ns and 55.8 TOPS/W fully parallel in-memory and near-memory computing for IoT security,” IEEE J. Solid-
product-sum operation for binary DNN edge processors,” in IEEE ISSCC State Circuits, vol. 53, no. 4, pp. 995–1005, Apr. 2018.
Dig. Tech. Papers, Feb. 2018, pp. 496–498. [32] M. Nabavi and M. Sachdev, “A 290-mV, 3.34-MHz, 6T SRAM with
[11] A. Biswas and A. P. Chandrakasan, “CONV-SRAM: An energy-efficient pMOS access transistors and boosted wordline in 65-nm CMOS tech-
SRAM with in-memory dot-product computation for low-power convo- nology,” IEEE J. Solid-State Circuits, vol. 53, no. 2, pp. 656–667,
lutional neural networks,” IEEE J. Solid-State Circuits, vol. 54, no. 1, Feb. 2018.
pp. 217–230, Jan. 2019. [33] S. Lutkemeier, T. Jungeblut, H. K. O. Berge, S. Aunet, M. Porrmann,
[12] X. Si et al., “24.5 A twin-8T SRAM computation-in-memory macro for and U. Ruckert, “A 65 nm 32 b subthreshold processor with 9T
multiple-bit CNN-based machine learning,” in IEEE ISSCC Dig. Tech. multi-Vt SRAM and adaptive supply voltage control,” IEEE J. Solid-
Papers, Feb. 2019, pp. 396–398. State Circuits, vol. 48, no. 1, pp. 8–19, Jan. 2013.
[13] J. Yang et al., “24.4 sandwich-RAM: An energy-efficient in-memory [34] J. Wang et al., “A 28-nm compute SRAM with bit-serial logic/arithmetic
BWN architecture with pulse-width modulation,” in IEEE ISSCC Dig. operations for programmable in-memory vector computing,” IEEE
Tech. Papers, Feb. 2019, pp. 394–396. J. Solid-State Circuits, vol. 55, no. 1, pp. 76–86, Jan. 2020.

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on August 03,2022 at 06:01:49 UTC from IEEE Xplore. Restrictions apply.
780 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 30, NO. 6, JUNE 2022

[35] H. Kim, T. Yoo, T. T.-H. Kim, and B. Kim, “Colonnade: A recon- Weixiong Jiang (Graduate Student Member, IEEE)
figurable SRAM-based digital bit-serial compute-in-memory macro for received the B.S. degree from Harbin Institute of
processing neural networks,” IEEE J. Solid-State Circuits, vol. 56, no. 7, Technology, Harbin, China, in 2017. He is currently
pp. 2221–2233, Jul. 2021. working toward the Ph.D. degree at the Recon-
[36] T. Na, S.-H. Woo, J. Kim, H. Jeong, and S.-O. Jung, “Comparative study figurable and Intelligent Computing Laboratory,
of various latch-type sense amplifiers,” IEEE Trans. Very Large Scale ShanghaiTech University, Shanghai, China.
Integr. (VLSI) Syst., vol. 22, no. 2, pp. 425–429, Feb. 2014. His current research interests include energy effi-
cient DNN acceleration and online slack measure-
ment on FPGA.
Jian Chen (Graduate Student Member, IEEE)
received the B.S. degree from Huazhong University
of Science and Technology, Wuhan, China, in 2016.
He is currently working toward the Ph.D. degree at
ShanghaiTech University, Shanghai, China, Shang-
hai Institute of Microsystem and Information Tech-
nology, Chinese Academy of Sciences, Shanghai,
and the University of Chinese Academy of Sciences,
Beijing, China.
His current research interests include memory
design, in-memory computing, and ultralow power
VLSI designs.

Wenfeng Zhao (Member, IEEE) received the B.S.


and M.S. degrees from Huazhong University of Yajun Ha (Senior Member, IEEE) received the
Science and Technology, Wuhan, China, in 2007 and B.S. degree from Zhejiang University, Hangzhou,
2009, respectively, and the Ph.D. degree from China, in 1996, the M.Eng. degree from the National
the National University of Singapore, Singapore, University of Singapore, Singapore, in 1999, and
in 2014. the Ph.D. degree from Katholieke Universiteit Leu-
He is currently an Assistant Professor of Electrical ven, Leuven, Belgium, in 2004, all in electrical
and Computer Engineering with Binghamton Uni- engineering.
versity, State University of New York, Binghamton, He is currently a Professor with ShanghaiTech
NY, USA. He was a Postdoctoral Associate with the University, Shanghai, China, and also the Direc-
Department of Biomedical Engineering, University tor of the Shanghai Engineering Research Center
of Minnesota, Twin Cities, Minneapolis, MN, USA. His research interests of Energy Efficient and Custom AI IC, Shanghai.
include neural engineering and neural signal processing, compressed sensing, Before this, he was a Scientist and the Director of the I2R-BYD Joint
in-memory computing, and ultralow power VLSI designs. Laboratory, Institute for Infocomm Research, Singapore; and an Adjunct
Associate Professor with the Department of Electrical and Computer Engi-
neering, National University of Singapore, Singapore. Prior to this, he was an
Yuqi Wang received the B.E. degree from Shang- Assistant Professor with the National University of Singapore. His research
haiTech University, Shanghai, China, in 2019, where interests include reconfigurable computing, ultralow power digital circuits
he is currently working toward the master’s degree. and systems, embedded system architecture, and design tools for applications
His current research interests include the memory in robots, smart vehicles, and intelligent systems. He has published around
design, in-memory computing, cryo-CMOS VLSI 100 internationally peer-reviewed journal/conference papers on these topics.
design. Dr. Ha was a recipient of two IEEE/ACM best paper awards. He has
served a number of positions in the professional communities. He has served
as the TPC Co-Chair for ISICAS 2020, the General Co-Chair for ASP-
DAC 2014, the Program Co-Chair for FPT 2010 and 2013, the Chair for
the Singapore Chapter of the IEEE Circuits and Systems (CAS) Society in
2011 and 2012, and a member for the ASP-DAC Steering Committee and
Yuhao Shu (Graduate Student Member, IEEE) the IEEE CAS VLSI and Applications Technical Committee. He has been
received the B.S. degree from Hefei University a Program Committee Member for a number of well-known conferences
of Technology, Hefei, China, in 2019. He is cur- in the fields of FPGAs and design tools, such as DAC, DATE, ASP-DAC,
rently working toward the Ph.D. degree at the FPGA, FPL, and FPT. He also serves as the Associate Editor-in-Chief for the
Reconfigurable and Intelligent Computing Labora- IEEE T RANSACTIONS ON C IRCUITS AND S YSTEMS —II: E XPRESS B RIEFS
tory, ShanghaiTech University, Shanghai, China. from 2020 to 2021, and an Associate Editor for the IEEE T RANSACTIONS
His current research interests include memory ON C IRCUITS AND S YSTEMS —I: R EGULAR PAPERS from 2016 to 2019, the
design, in-memory computing, and ultralow power IEEE T RANSACTIONS ON C IRCUITS AND S YSTEMS —II: E XPRESS B RIEFS
VLSI designs. from 2011 to 2013, the IEEE T RANSACTIONS ON V ERY L ARGE S CALE
I NTEGRATION (VLSI) S YSTEMS from 2013 to 2014, and the Journal of Low
Power Electronics since 2009.

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on August 03,2022 at 06:01:49 UTC from IEEE Xplore. Restrictions apply.

You might also like