0% found this document useful (0 votes)
12 views12 pages

Cap Ram

The document presents CAP-RAM, a charge-domain in-memory computing SRAM designed for energy-efficient convolutional neural network (CNN) inference. It features a novel charge-domain multiply-and-accumulate mechanism, enabling high accuracy and precision programmability while supporting multiple bit-widths and input activations. A prototype demonstrated impressive performance metrics, achieving 98.8% accuracy on the MNIST dataset and significant throughput and energy efficiency improvements compared to conventional designs.

Uploaded by

chaosmind47
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views12 pages

Cap Ram

The document presents CAP-RAM, a charge-domain in-memory computing SRAM designed for energy-efficient convolutional neural network (CNN) inference. It features a novel charge-domain multiply-and-accumulate mechanism, enabling high accuracy and precision programmability while supporting multiple bit-widths and input activations. A prototype demonstrated impressive performance metrics, achieving 98.8% accuracy on the MNIST dataset and significant throughput and energy efficiency improvements compared to conventional designs.

Uploaded by

chaosmind47
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

1924 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 56, NO.

6, JUNE 2021

CAP-RAM: A Charge-Domain In-Memory


Computing 6T-SRAM for Accurate and
Precision-Programmable CNN Inference
Zhiyu Chen , Student Member, IEEE, Zhanghao Yu , Student Member, IEEE, Qing Jin ,
Yan He, Student Member, IEEE, Jingyu Wang, Member, IEEE, Sheng Lin, Student Member, IEEE,
Dai Li , Student Member, IEEE, Yanzhi Wang , Senior Member, IEEE,
and Kaiyuan Yang , Member, IEEE

Abstract— A compact, accurate, and bitwidth-programmable


in-memory computing (IMC) static random-access memory
(SRAM) macro, named CAP-RAM, is presented for
energy-efficient convolutional neural network (CNN) inference.
It leverages a novel charge-domain multiply-and-accumulate
(MAC) mechanism and circuitry to achieve superior linearity
under process variations compared to conventional IMC designs.
The adopted semi-parallel architecture efficiently stores filters
from multiple CNN layers by sharing eight standard 6T SRAM Fig. 1. Conventional and IMC accelerators.
cells with one charge-domain MAC circuit. Moreover, up to
six levels of bit-width of weights with two encoding schemes and energy budget, specialized real-time, yet low-power CNN
and eight levels of input activations are supported. A 7-bit
charge-injection SAR (ciSAR) analog-to-digital converter (ADC) inference hardware is highly desired.
getting rid of sample and hold (S&H) and input/reference The fundamental and computationally dominant operation
buffers further improves the overall energy efficiency and of CNNs is convolution. A single convolution step can be
throughput. A 65-nm prototype validates the excellent linearity expressed by a multiply-and-accumulate (MAC)
and computing accuracy of CAP-RAM. A single 512 × 128
macro stores a complete pruned and quantized CNN model to 
R×R×C

achieve 98.8% inference accuracy on the MNIST data set and Y = Wi X i (1)
89.0% on the CIFAR-10 data set, with a 573.4-giga operations i=1
per second (GOPS) peak throughput and a 49.4-tera operations where Y , X, and W refer to output activations, input activa-
per second (TOPS)/W energy efficiency.
tions, and weights, respectively. R × R represents the kernel
Index Terms— CMOS, convolutional neural networks (CNNs), size, and C is the number of input channels. It is well known
deep learning accelerator, in-memory computation, mixed-signal that the energy bottleneck of such computations lies in the
computation, static random-access memory (SRAM).
overwhelming data movement, rather than arithmetic opera-
I. I NTRODUCTION tions [4]. The energy to access DRAMs and static random-
access memories (SRAMs) is approximately 8 ×104 times and
D EEP convolutional neural networks (CNNs) achieve
unprecedented success in countless artificial intelli-
gence (AI) applications due to their powerful feature extraction
3 × 103 higher than that of an 8-bit integer addition in 45 nm
[4], leading to the so-called memory wall. The memory walls
capabilities [1]–[3]. In many real-time applications, CNN are particularly severe for data-intensive computing, such as
models are typically pre-trained in the cloud and then deployed deep learning. State-of-the-art digital CNN accelerators are all
in edge devices, such as mobile phones and the Internet- optimized for energy-efficient dataflows and reduced memory
of-Things (IoT) devices, for fast and energy-efficient local access, by exploiting data locality and reuse [5]–[7].
inference. Because of the very limited computing resource To further alleviate the memory walls, emerging non-Von
Neumann CNN accelerators that perform computing directly
Manuscript received April 17, 2020; revised August 11, 2020 and inside the memory by accessing and computing multiple
October 29, 2020; accepted January 21, 2021. Date of current version
May 26, 2021. This article was approved by Associate Editor Jonathan rows in parallel attract significant interests [8]–[19]. In these
Chang. (Corresponding author: Kaiyuan Yang.) in-memory computing (IMC) designs, the data movement is
Zhiyu Chen, Zhanghao Yu, Yan He, Jingyu Wang, Dai Li, and Kaiyuan significantly reduced, and the read energy is amortized by
Yang are with the Department of Electrical and Computer Engineering, Rice
University, Houston, TX 77005 USA (e-mail: [email protected]). the parallel access, as shown in Fig. 1. IMC with on-chip
Qing Jin, Sheng Lin, and Yanzhi Wang are with the Department of Electrical SRAMs was first proposed in [20] and first implemented in
and Computer Engineering, Northeastern University, Boston, MA 02115 USA. silicon by Zhang et al. [9], which turns on multiple standard
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/JSSC.2021.3056447. 6T SRAM cells at the same time and accumulates current
Digital Object Identifier 10.1109/JSSC.2021.3056447 on the bitline to perform energy-efficient MAC computing.
0018-9200 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: California State University Fresno. Downloaded on July 01,2021 at 00:09:01 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: CAP-RAM: A CHARGE-DOMAIN IMC 6T-SRAM 1925

This current-domain IMC technique is then applied to the fully


connected layers of binary neural networks [19]. Variants of
the current-domain IMC are developed to support multi-bit
weights by modulating the pulsewidth of wordline signals [8],
and quantized XNOR-nets [21] with customized 12T SRAM
cells [13]. A novel 8T SRAM [14] is proposed recently to
maintain memory density while preventing read disturbance.
While these IMC techniques achieve higher efficiency than
their digital counterparts, their application and performance
are restricted by the computing inaccuracy caused by process
variations. To improve accuracy, charge-domain IMC is devel-
oped in [11] and [10], where the analog multiplication is
performed on local capacitors and the accumulation is per-
formed by charge sharing among all local capacitors. This
charge-domain computation can also be implemented with
capacitive coupling to simplify the cell structure [15]. In order
to accommodate increasingly diverse CNN structures and bit Fig. 2. Proposed 6T charge-domain IMC cluster and operating waveforms.
precisions, reconfigurable in-SRAM computing accelerators
[12], [16]–[18], [22] are developed to support multi-bit input
Moreover, we show that the proposed macro design is compat-
activations and weights, pushing IMC toward a universal
ible with state-of-the-art structured pruning and quantization
computing architecture. Recently, advanced techniques, such
training methods. The combination of high-density on-chip
as sparsity-aware computation [23] and on-chip CNN training
IMC macro and reduced CNN models promises fully on-chip
[24], are also achieved by IMC.
weight storage and highly efficient inference.
Since it is difficult to achieve full-precision analog
The rest of the article is organized as follows. Section II
computation inside the memory, precision-configurable IMC
presents the key ideas and design considerations of the
architectures work in a bit-serial fashion to support multi-bit
core IMC circuits. Section III covers the details of
computation
CAP-RAM implementation, followed by measurement results

BW 
BX 
R×R×C
p q in Section IV. Section V concludes this article.
Y = 2p 2q Wi X i (2)
p=1 q=1 i=1 II. P RINCIPLES OF THE P ROPOSED C HARGE -D OMAIN
where BW and B X are the bitwidth of weights
 R×R×C and inputs, C OMPUTING W ITH 6T SRAM S
p q
respectively. The one-bit MAC operation i=1 Wi X i is A. Principles of the Core Circuits
typically done inside the memory in the analog domain, and
The core unit in CAP-RAM is an SRAM cluster for
the rest of calculation is simply shift-and-add that can be
weight storage and charge-domain MAC computing, as shown
processed by peripheral circuits. To optimize the tradeoff
between the computing complexity and throughput, X i can
q in Fig. 2. Each cluster consists of: 1) eight standard 6T SRAM
cells to store weights; 2) switches and one metal-oxide-metal
be made multi-bit by adopting digital-to-analog converters
(MOM) capacitor to perform charge-domain analog MAC
(DACs), such as in [11], resulting in a modified (2)
computing; and 3) precharge and read/write circuits for normal

BW 
B̂ X 
R×R×C
p q SRAM operations. For simplicity, the wordline and bitline for
Y = 2p 2qh Wi X̂ i (3) the access transistors on the right-hand side of the 6T, which
p=1 q=1 i=1
are only used for normal read/write, are omitted in Fig. 2.
q
where X̂ i is an h-bit number and B̂ X = B X / h. Fig. 2 illustrates the operating principles of four IMC
While state-of-the-art IMC SRAMs show superior energy phases. In the first reset phase, the local bitline (LBL), local
efficiency over digital accelerators by leveraging parallelism MOM capacitors CMOM ’s, and the parasitic wire capacitor C P
and fewer memory access and efficient analog computing, they on the output line are precharged via global BL (GBL), input
face tradeoffs among computing accuracy, memory density, BL (IBL), and a global PMOS P1, respectively (precharge
and precision configurability. With the goal of simultane- phase). Next, the 4-bit digital input is converted to a voltage
ously achieving all desired properties of an IMC SRAM, VIN on the IBL by a DAC and sampled on CMOM by turning on
this article presents a Compact, Accurate, and Precision- SIN (DAC phase). Only one of the eight WLs will be activated
configurable charge-domain SRAM macro (CAP-RAM) using so that the stored data in the selected cell control M1 via LBL.
standard 6T cells. The enabling techniques of the IMC The voltage on CMOM is either pulled up to Vdd (equivalent
macro include: 1) a compact memory structure supporting to multiplying by 0) or keeps VIN (multiplying by 1) based on
lossless charge-domain in-memory computation; 2) a fully the ON–OFF state of M1, where 1b × 4b analog multiplication
reconfigurable semi-parallel computing scheme supporting is performed (multiplication phase). Notice that the analog
eight levels of input activations and six levels of weights; operations are all referenced to VDD as logic “0.” LBL
and 3) a high-speed and energy-efficient charge-injection becomes floating when the accessed cell holds “1.” However,
SAR (ciSAR) analog-to-digital converter (ADC) avoiding the the leakage currents from other cells storing “0” cause less
power-hungry sampling and drivers for reference voltages. than 0.3-mV change on CMOM in the worst case (FF corner,

Authorized licensed use limited to: California State University Fresno. Downloaded on July 01,2021 at 00:09:01 UTC from IEEE Xplore. Restrictions apply.
1926 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 56, NO. 6, JUNE 2021

Fig. 3. Existing IMC cell designs: (a) current-domain computation with an 8T SRAM cell, (b) charge-domain computation with wordline input,
(c) charge-domain computation with bitline input.

all other cells storing “0”). Finally, addition is performed by


turning off M1 and turning on SOUT , and hence, the local
charge is shared across the C P and all CMOM ’s connected
to the output line (accumulation phase). The 6T cells in the
cluster save 30% area compared with 8T cells and do not suffer
from read disturbance because the cells are only connected
to LBL with low capacitance (1.5 fF). In our importance
sampling-based transient MC failure analysis, the failure rate Fig. 4. Simulated linearity of a current-domain 128 × 128 IMC SRAM,
is out of ten sigmas at the nominal condition. During normal with 0.6-V input voltage, 200-ps access time, and 64 activated rows to avoid
SRAM read/write operations, the switch S P is turned on to M1 entering linear region.
connect all cells directly to GBL, and the rest operations are
the same as in standard SRAMs. partially mitigated by the coupling error on the output line (5).
To save energy and area, all switches except S P are imple- In practice, we tune the size of the three transistors to make
mented with the only PMOS. This is feasible because the DAC sure V = 0 when inputs are all “0000,” which cancels the
output is already forced to be above half Vdd for linearity offset of analog computation; 1000 Monte Carlo simulations
considerations. S P is a transmission gate with split P/N control show that the standard deviation of V is only 0.6 mV, so the
for SRAM read and write, but only the PMOS is turned on process variation will not significantly affect the cancellation.
to precharge LBL during computing. One potential problem MOM capacitors of 1.2 fF are implemented for each cluster
of the PMOS-only implementation is the charge injection and under tradeoffs between energy and the dynamic range of
capacitive coupling effects during switching, which changes analog output.
the voltage on CMOM due to the small capacitance. Fortunately,
the coupling effects caused by –1 3 in Fig. 2 can be mitigated B. Design Analysis and Related Work
because of their opposite polarities. The coupling error on 1) Current-Domain Versus Charge-Domain Computing:
CMOM is calculated by One of the key design choices for IMC macro is the analog
CGS computing mechanism. Computing in the current domain was
VCC,MOM = VD D . (4) first proposed in [9], which accesses multiple 6T SRAM cells
CGS + CMOM
simultaneously and accumulates their discharging current on
Notice that SOUT not only brings coupling error to the MOM the bitline. To prevent write disturbance and promote comput-
cap but also to the parasitic cap on the output line ing accuracy, recent papers introduce 8T cells as storage and
CGS computing units [12], [16], as shown in Fig. 3(a).
VCC,P = VD D (5) The main benefit of current-domain computing cells is
CGS + C P /128
their simplicity and compatibility with standard SRAM cells.
where C P is the total capacitance of the output line. VCC,MOM
However, it suffers from relatively low computing accuracy
and VCC,P are constant since the related gate control voltages
due to inherent non-linear IDS dependence and process varia-
(VLBL, SOUT , and SIN ) are rail-to-rail. On the other hand,
tions of the access transistor [M1 in Fig. 3(a)]. The linearity
charge-injection comes from the residue charge in the channel,
will be even worse if M1 enters the linear region when
so it only happens during switching-off (edge  1 and 2
the bitline voltage becomes low during computing. Fig. 4
in Fig. 2). The error is calculated by
shows the simulated linearity and variation (100 Monte Carlo
COX WL(VD D − VTH − VMOM ) simulations) of a 128 × 128 8T IMC SRAM in the 65-nm
VCI = . (6)
2CMOM CMOS process. The access time of each cell directly affects
Therefore, the final error that appears on the output line the linearity, which is set to 200 ps in our simulations. It is
after charge sharing is limited by the drivers and parasitic capacitance. To meet the
saturation condition of M1 (VDS > VGS − VTH ), the input
V = VON–OFF − VOFF–ON voltage is set to 600 mV, and 64 rows are activated. To cover
C P VCC,P − 128CMOM VCI
= . (7) the full dynamic range, all cells store “1,” and inputs are 1 bit.
128CMOM + C P It is evident from Fig. 4 that the analog computing output
As a result, the coupling error on the local MOM cap (4) is not linear and presents larger variations at lower output
can be perfectly cancelled, while the charge injection error is voltages.

Authorized licensed use limited to: California State University Fresno. Downloaded on July 01,2021 at 00:09:01 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: CAP-RAM: A CHARGE-DOMAIN IMC 6T-SRAM 1927

Fig. 7. Comparison between the fully parallel structure and the clustering
Fig. 5. Simulated histogram of MAC computing error of a current-domain structure.
128 × 128 IMC SRAM.
between the input and stored value with the output switch off.
Then, the switch is turned on so that the charge is shared
across all the capacitors connected to the output line. This is a
simple and efficient scheme for binary inputs but requires high
precision circuits to support multi-bit inputs using pulsewidth
modulation or current modulation because the sampling capac-
itors are tiny (a few fFs) for energy considerations. On the
other hand, the IMC with bitline input refers to sampling the
input signals on a local capacitor [see Fig. 3(c)]. This scheme
Fig. 6. Simulated variations (mean over sigma value) of the output line
voltage in 100 Monte Carlo simulations (when 16 rows are accessed) and will support multi-bit inputs. When the input is an h-bit digital
maximum active rows, under different input voltages. signal, it will first be converted to voltage VIN on the input line
by a DAC. h-bit × 1-bit multiplication is performed by closing
The systematic and random non-linearities will both affect RWL switch, similar to Fig. 3(b), and the output line switches
computing accuracy. Fig. 5 shows the simulated error dis- are later turned on to finish accumulation. As indicated in (3),
tribution with nearly 2-least significant bit (LSB) standard a h-bit input architecture achieves nearly h times throughput
deviation. In this experiment, the same IMC macro above is improvement over the pure bit-serial scheme that needs h loops
quantized by an ideal 6-bit ADC model. To cover the full to perform the same operations. Thus, CAP-RAM (Fig. 2)
output range better than pure random inputs, the 128 inputs adopts bitline inputs to support higher throughput.
are divided into 16 groups, where the kth group has 4k “1”s 3) Clustering Structure: Conventionally, IMC SRAMs are
and 64 − 4k “0”s at random locations. expected to activate all rows simultaneously to maximize
The linearity concerns also limit the number of WLs that energy efficiency and compute density, while CAP-RAM
can be turned on in parallel in current-domain IMC, lead- groups several 6T cells and one analog computing module
ing to degraded throughput and efficiency because of fewer into a cluster (as shown in Fig. 7), and only one of those
parallelism [25]. The simulation results in Fig. 6 depicts the cells will be selected at each operation. This is the result of
tradeoff between the parallelism and computing accuracy: a a design compromise to amortize the large bitcells needed
higher input voltage reduces the variation of M1 but makes for the charge domain in-SRAM computing. Larger bit cells
it easier for M1 to enter the linear region, which ultimately not only reduce compute density but also increase energy and
restricts the parallelism. The simulations are done on the same delay over an ideal fully parallel IMC 6T-SRAM. On the other
design as Fig. 4, with a fast 200-ps access time hand, as discussed in Section I, analog MAC with multi-bit
In comparison, charge-domain IMC achieves better com- inputs could linearly increase the throughput and energy
puting accuracy and higher parallelism [10], [11]. The com- efficiency over bit-by-bit serial computing [15], [26], which is
putation [as shown in Fig. 3(b) and (c)] is performed on leveraged in CAP-RAM together with the clustering structure
capacitors, which has much less variation than the current to offer higher macro-level compute density than fully
of minimum-sized access transistors. Meanwhile, the charge- parallel macros. For instance, state-of-the-art charge-based
sharing based operation is not affected by transistors’ operating serial computing cells, even for bit-by-bit serial computing,
regions, and therefore, a greater number of cells can be are about two to three times the area of a logic-rule 6T
turned on together for higher throughput and efficiency gain. SRAM cell [15], [16], [26].
It is clear that no significant linearity degradation is observed Comparatively, the standard 6T SRAM cell is used in
even in measurements of CAP-RAM (see Section IV-A1). CAP-RAM. Our implementation is in logic rule, but “push-
Therefore, charge-domain computing is a clear choice for rule” cells with ∼50% less cell area can be easily adopted.
accurate and higher-precision IMCs. The switching circuit is around three times the size of a
2) Wordline Versus Bitline Inputs: Fig. 3 depicts three logic-rule 6T cell. The area overhead of the switching circuit
categories of designs with two approaches to supply the can be greatly amortized by the clustering structure. For
inputs for convolutions, wordline, and bitline inputs [10]–[12]. example, a CAP-RAM cluster of three non-push-rule cells
Current-domain IMC is typically done with wordline inputs. will take the same area as two or three cells doing bit-by-
The pulsewidth modulated WL signal can represent multi-bit bit serial computation. If a 4-bit input × 1-bit weight MAC
inputs. is performed within the same sized array, CAP-RAM will
Charge-domain cells support both approaches. In cells with provide the highest total # of operations (bitwise multiply and
wordline inputs [see Fig. 3(b)], a logic AND is performed add) per cycle. In this iso-area comparison, the CAP-RAM will

Authorized licensed use limited to: California State University Fresno. Downloaded on July 01,2021 at 00:09:01 UTC from IEEE Xplore. Restrictions apply.
1928 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 56, NO. 6, JUNE 2021

Fig. 9. Illustration of the mapping of (a) 2’s complement encoding and (b)
ternary encoding.

slices is implemented. Each slice consists of 128 clusters and


performs one MAC computation on the output line, which is
connected to one of the differential inputs of an ADC.
IMC and analog-to-digital conversion are done serially in
Fig. 8. System diagram of the proposed charge-domain IMC architecture. one global clock cycle (70 MHz). During IMC, 128 4-bit
inputs are first converted to voltages via 128 DACs and trans-
simultaneously achieve the highest compute density, through- mitted to 64 clusters on the input line. The output line voltage
put, and also higher weight storage density. If push-rule lay- is then quantized to digital by a 7-bit SAR ADC. The digital
outs are considered, the gaps will become even larger because codes are registered and sent to the digital periphery, which
custom cells are unlikely to take full advantage of push rules. consists of serial accumulators and adder trees to accumulate
Previous multi-bit MAC computing IMC SRAM [11] also partial sums and combines to support programmable input and
adopts the clustering structure, but the use of 10T SRAM cells weight precision.
and more transistors in the switching circuits lead to ∼50%
more cell area and less compute density than CAP-RAM.
Meanwhile, the clustering IMC macro enables a candidate A. Reconfigurable Support of Two Data Encoding Schemes
architecture for edge inference, where IMC macros (i.e., PE) The proposed architecture supports both 2’s complement
stores the entire model. Compared to the classical architecture and ternary encodings of weights, which are most common in
with large on-chip buffer or embedded non-volatile memory CNN quantizations. This is achieved by grouping two neigh-
to store weights, CAP-RAM potentially offers comparable boring slices into a pair (“+” and “−” slices in Fig. 8) and
storage density as an SRAM buffer and eliminates latency and connecting their outputs to differential ADC inputs. The com-
energy associated with weight loading. When CAP-RAMs are putation and quantization can be programed to single-ended
used for such architecture, each cluster can be used to store or differential modes to support the two encoding schemes in
different CNN layers not executed in parallel. This will further different CNN layers.
alleviate the penalty on parallelism due to clustering. It is The 2’s complement encoded data are binary and, thus,
worth mentioning that fully parallelism of all MAC operations only require a single SRAM cell for each bit. To conduct
is possible with inter-layer pipelining [27], [28], and the imbal- IMC for k-bit 2’s complement weights, they will be stored
anced speed/throughput of each pipeline stage can be solved in the same column but k neighboring “+” or “−” slices,
by mapping techniques, such as replication. However, such as shown in Fig. 9(a). The two slices in each pair alternately
architectures also come with high overall power and pipelining turn on with the ADC set to single-input mode. To obtain the
overhead and also requires the pipelines to be filled by stream- final MAC result, k consecutive ADC outputs will be shifted
ing inputs, both of which may not be acceptable in many edge and summed up, which are done in the digital periphery.
applications. To this end, the proposed clustering structure Since the most significant bit (MSB) is negative, the sign
offers a tuning knob for the tradeoffs between the weight of the corresponding partial sum needs to be reversed. The
storage density and the compute density. Generally, the greater 2’s complement encoding is preferred for storage efficiency
the number of cells in a cluster, the lower the compute density, because k bit cells contain k-bit information.
but the higher the storage density. For specific applications, On the other hand, each ternary encoding unit (−1/0/+1)
the CAP-RAM based hardware can be moved on the opti- requires two binary SRAM cells. For example, 6 = 23 ×(+1)+
mization plane through co-design of cluster size and mapping 22 ×(−1)+21×(+1)+20 ×(0). As shown in Fig. 9(b), two cells
strategy. In our prototype, eight-cell clusters are employed with the same index within a slice pair store one encoding unit:
to maximize on-chip storage density and allow efficient “01” for “+1,” “10” for “−1,” and “00” for “0.” The ADCs run
layer-by-layer execution of LeNet and ResNet models. in differential-input mode and, hence, naturally perform the
subtraction of the positive and negative results without extra
III. I MPLEMENTATION processing. In general, to store k-bit information, 2k−2 bitcells
The proposed CAP-RAM macro consists of (see Fig. 8): are needed for ternary encoding, but only k − 1 ADCs are
a 6T SRAM array with charge-domain IMC switches, required. This is particularly useful for ternary neural networks
current-steering DACs for each column, ciSAR ADCs for each with only −1/0/+1 weights, because they require the same
pair of the slice, and digital peripherals after ADCs. In the memory footprint as 2-bit 2’s complement weights, but saves
prototype, a 512 × 128 memory divided into 32 pairs of one ADC for each MAC.

Authorized licensed use limited to: California State University Fresno. Downloaded on July 01,2021 at 00:09:01 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: CAP-RAM: A CHARGE-DOMAIN IMC 6T-SRAM 1929

Fig. 11. Operating waveforms of the ciSAR ADC.


Fig. 10. Diagram of the ciSAR ADC. The global SAR control logic and
clock are shared by 32 ADCs.

B. Compact and Driver-Less ciSAR ADC


ADCs are critical in the design of IMC macros because
of the stringent area and energy constraints. Flash ADCs are
widely used for low-precision IMC [13], [15], [29] due to their
simplicity, but its area scales quickly with a resolution. A serial
ADC [11], which uses a single dummy cell as a reference, Fig. 12. Diagram of the current-steering DAC and the simulated linearity.
achieves 7-bit resolution and high area-efficiency, but the serial
nature results in low throughput. MAC output is sampled on a capacitor, an input buffer with
SAR ADCs offer balanced area, throughput, and energy in a sufficient slew rate is necessary, which consumes significant
the 5–8-bit precision range that is most common for CNN energy and area. Comparatively, the ciSAR ADC only requires
quantizations. For example, a 5-bit SAR ADC is adopted in the signal to be sampled on a separate large capacitor. Thus,
[12]. However, conventional capacitor-based SAR ADCs face no extra S&H circuits are necessary for CAP-RAM because
several challenges in IMC applications. First, the capacitive the analog MAC outputs are already computed and stored on
DAC is large so that the total area of ADCs can be as large 128 local CMOM ’s and C P after charge sharing.
as that of the SRAM array [12]. Second, powerful reference Since CI cells can only deduct the charge from the main
voltage buffers with large output currents are necessary to capacitor, the SAR logic follows the monotonic switching
drive the bottom plates of all capacitive DACs in the same procedure proposed by Liu et al. [31]. Part of the SAR
macro. Finally, in charge-domain IMC, the analog computing control module is shared among 32 ADCs so that the area is
output is in charge and, thus, requires an input analog buffer to amortized. The MSB of the 7-bit differential ADC is processed
drive sample and hold (S&H) circuits. It is worth mentioning by discharging all 16 CI cells twice in order to save the number
that direct charge sharing from local CMOM ’s to the top plate of CI cells (see Fig. 11). At each SAR step, either M1 or
of the capacitor array (about 160 fF for a 7-bit ADC) is M2 will be turned on based on the comparison result. The
prohibitive due to significant dynamic range loss. Assuming charge is then shared from CMOM ’s and C P to the top node of
CMOM = 1.2 fF and C P = 80 fF, the input range reduces to the CI cell. This charge transferring is fast because M5 has a
1.2 × 128/(1.2 × 128 + 80 + 160) = 39% for a 7-bit ADC and faster settling speed than MOM capacitors [30]. M6 is used to
further drops to 28% for an 8-bit ADC. reset the “Top Node” before each conversion cycle. CI cells for
This work exploits ciSAR ADCs, first proposed in [30], different SAR step are unary coded to mitigate mismatch. The
to address the challenges above and meet the specific require- ADC can be easily switched to the single-input mode by
ments of CAP-RAM. Capacitors in the DACs are replaced by setting SOUT of the disabled slice to “0” and precharging its
transistors with a long channel length (see M5 in Fig. 10) in the CMOM ’s and C P to Vdd. The rest of the operations are identical
charge-injection cells (CI cells), which takes much less area to the differential case. Naturally, the single-ended ADC has
than unit capacitors in conventional SAR ADCs. M5 behaves a 6-bit resolution because the sign bit of the 7-bit output code
like a capacitor because it can store charge in the channel. will be discarded.
Despite the compact area (429 μm2 ), ciSAR ADC achieves
6.85 effective number of bits (ENOB) in transient noise sim- C. Current-Steering DAC
ulations. Unlike conventional SAR ADCs that are referenced A basic current DAC using biased transistors is used to
by voltages connected to the bottom plate of the capacitive generate the 4-bit input signal for each column, as shown
DAC, ciSAR ADC controls the conversion step through a in Fig. 12. The input bitlines are first precharged and then
bias voltage VB controlling the current of M3, and therefore, discharged by the DAC. The biased transistors are up-sized
no powerful reference driver is required. Moreover, in conven- to reduce the process variation and the effect of channel
tional SAR ADCs, the input voltage is sampled on the DAC’s length modulation yet match the pitch of a 6T cell. The 4-bit
capacitor with an input buffer and S&H circuits. Because the input digital code controls the binary sized current paths.

Authorized licensed use limited to: California State University Fresno. Downloaded on July 01,2021 at 00:09:01 UTC from IEEE Xplore. Restrictions apply.
1930 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 56, NO. 6, JUNE 2021

Fig. 13. Diagram of the adder-tree-based digital processing periphery. Fig. 15. Measured computing linearity of charge-domain IMC.

macro is 0.179 mm2 . The 8KB SRAM array occupies 62.6%


of the area, and the amortized bitcell area is about 38%
larger than that of the standard 6T cell. The ciSAR ADCs
and current DACs only occupy 8.2% and 3.6% of the total
macro area, respectively; 5.8% of the area is used by control
signal drivers and read/write circuitry, and 13.7% is occupied
by the digital processing periphery and readout registers.
In all experiments, the test chip is interfaced with a host PC
through digital I/O devices.
Fig. 14. Micrograph of the prototype chip.
A. Linearity
Monte Carlo simulations show that the standard deviation of The accuracy of analog IMC is largely decided by the lin-
the input line voltage is only around 1.56 mV in the worst earity of each component in the computing pipeline, i.e., DAC,
scenario. In order for the biased transistor to remain in the multiplication and addition, and ADC. We thoroughly evaluate
saturation region, its biasing voltage and discharging time the linearity of each component and propose a post-silicon
are set to keep the bitline voltage above 600 mV. Simulation calibration approach to enhance the linearity.
results demonstrate excellent linearity with R 2 = 0.9999 1) Linearity of Charge-Domain MAC Computing: Fig. 15
(see Fig. 12). shows the measured output line voltage of one slice under
D. Digital Processing Periphery different input patterns. The ideal output code refers to the
exact MAC computing results quantized by an ideal 6-bit
The digital periphery after ADCs is responsible for shifting
ADC. To isolate the impacts of the non-ideality of DACs,
and adding the partial sums based on (3). As shown in Fig. 13,
the input code of each DAC is either “0000” or “1111.” Due
the ADC output code is first passed through a 2’s complement
to the charge-sharing mechanism, almost perfect linearity is
transformation module where the sign of MSB’s partial sum
achieved with a 0.9999 R 2 value.
will be reversed when 2’s complement encoding is used.
2) Linearity of ADCs: The linearity of the ADC is mea-
Notice that the “+1” operation in the transformation is done
sured indirectly by controlling the MAC inputs. Given the
by the adder inside the accumulator. To support up to 8-bit
high linearity of charge-domain IMC, a reasonable estimation
input activations, the system takes two global cycles to add
of ADC’s linearity can be obtained. The ADCs operate in
the two partial sums in the accumulator, while inputs of less
7-bit differential mode, but only half of the ADC dynamic
than 4 bits only require one clock cycle. The adder tree sums
range (6 bit) is used because the computing results with
up the results of each accumulator to combine the partial MAC
a certain weight pattern only span half of the full range.
results calculated with different bits of the weights. The adder
As shown in Fig. 16(a), the 32 slices in one chip show
tree is programmable with four levels of outputs (see Fig. 13)
offset and gain variations. Comparators’ offset in ADCs is
to support six weight bitwidths (1/2/3/4/5/8). At output level
the main source of offset. The gain variations are caused
i , 2i slices are combined, so 2i -bit weights are supported by
by variations of sampled charge, charge sharing ratio, and
2’s complement encoding (1/2/4/8) and (2i−1 + 1)-bit weights
ADC’s conversion step among  different slices. The sampled
(i = 0) by ternary encoding (2/3/5). 
charge = 128 i=1 ×C V
MOM IN
i
of different slices varies due
Compared to the switch capacitor-based analog shift-adders
like [8], the adder tree-based approach has the reconfigurability to mismatch of CMOM ’s, while mismatch of CMOM to C P ratio
to process multiple bitwidth of weights. More importantly, results in different charge sharing ratio. Inconsistent ADC’s
analog shift-adders require the voltage to be sampled on conversion step is caused by the mismatch of CI cells and
separate switch capacitors by charge sharing, but this operation input capacitors (128 × CMOM + C P ). The mismatch observed
will significantly reduce the input dynamic range of ADCs is expected due to the area and power constraints in the
since the analog computing voltage has already been sampled implementation. Many known analog techniques can mitigate
on CMOM s and C P s. the variations, but, in this work, we focus on using digital
calibration.
IV. M EASUREMENT R ESULTS 3) Calibration: We propose a two-step low-cost calibration
A prototype chip, shown in Fig. 14, is fabricated in a to mitigate the variations and improve linearity. The calibration
65-nm LP CMOS process. The total area of CAP-RAM utilizes the measured data from Fig. 16(a). First, the 32 curves

Authorized licensed use limited to: California State University Fresno. Downloaded on July 01,2021 at 00:09:01 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: CAP-RAM: A CHARGE-DOMAIN IMC 6T-SRAM 1931

Fig. 17. Measured system error distribution over 524 288 samples (a) after
linear-fitting and (b) after linear-fitting and master curve calibration with noise
filtered out.

Fig. 18. Measured error distribution of eight prototype chips.

the analog computation, and the ADC. The measurement also


confirms that DAC’s nonideality does not significantly degrade
the system’s linearity, validating the analysis in Section III-C.

B. Computing Accuracy
Nonidealities described above (analog computing error,
DAC nonlinearity, ADC nonlinearity, and offset/gain varia-
tions), together with thermal noise, decide the final computing
Fig. 16. Measured linearity of the CAP-RAM arrays in two prototype chips.
(a) Raw transfer curves of 64 ADCs. (b) Transfer curves of 64 ADCs after errors. In addition to linearity tests, we directly assess MAC
two-step calibration. (c) Linearity of the complete analog computing pipeline computing errors by feeding random sets of inputs to the
over 64 slices. (d) INLs of 64 slices. system and comparing the outputs against expected ones.
1) Error Distribution: If the inputs are uniformly sampled,
are linear fitted with yi = ki x + bi , where yi is the measured the MAC outputs will mostly appear around the center in
output of the i th ADC and x is the ideal output. Each raw the dynamic range as a result of the central limit theorem.
output code ŷi is calibrated by ( ŷi −bi )/ki to remove the offset To alleviate such bias in the measured error distribution,
and gain error. Furthermore, a master curve, which is widely 16 random input sets with different distributions are used.
used in low-power analog applications, such as temperature In the kth set, 64 different input patterns are randomly sampled
sensors, can be applied to alleviate the systematic nonlinearity. from N(k −1, 2). In total, 524 288 (16×64 ×32 ×16) samples
The final two-step calibrated MAC result with better linearity are collected for Fig. 17 because there are 32 ADCs and
is shown in Fig. 16(b). Integrating the calibration module on every measurement is repeated 16 times. Fig. 17(a) shows
the chip requires some extra arithmetic logic in the periphery, the error distribution after the system is calibrated by the
but the area and energy will be much smaller than that of linear fitting. The spread is further reduced when noise is
existing convolution and batch normalization steps. filtered by averaging the outputs of multiple runs with the same
4) Linearity of the Analog Computing Slices: We further inputs, and the master curve calibration is performed, as shown
analyze the linearity of the complete analog computing chain in Fig. 17(b). Compared with the simulated error distribution
by including DAC’s nonideality. Instead of examining the of the current-domain system in Section II-B, the error of
linearity of a single analog chain as in most ADC studies, CAP-RAM is still smaller despite the ideal ADC and DAC
it is more important to examine the distribution of INLs (1-bit) assumption. Fig. 18 shows the error distribution over
and transfer curves across different computing chains and eight chips.
different chips for IMC applications. Therefore, the mean and 2) Random Errors: The thermal noise in ADCs and DACs
three-sigma spread of the transfer curves and INLs of 64 slices is another source of computing errors. Fig. 19(a) shows the rms
in two prototype chips are plotted in Fig. 16(c) and (d). errors of one ADC over the entire input range. The average rms
Fig. 16(c) is obtained by sweeping the pre-ADC inputs from is 0.35 LSB. Spikes can be observed when the input voltage
0 to 1920. Fig. 16(d) indicates that the largest linearity error of the ADC is close to the transition threshold. The variation
of the system is expected to be less than two LSBs. Note that of rms noise across 32 ADCs in the same macro is shown
the nonlinearity here includes contributions from the DAC, in Fig. 19(b). This noise level is acceptable for ADCs targeting

Authorized licensed use limited to: California State University Fresno. Downloaded on July 01,2021 at 00:09:01 UTC from IEEE Xplore. Restrictions apply.
1932 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 56, NO. 6, JUNE 2021

TABLE I
P RUNED AND Q UANTIZED L E N ET-5 S TRUCTURE AND M APPING

Fig. 19. Measured rms error (a) of one ADC over pre-ADC values
(i.e., analog MAC outputs) and (b) 32 ADCs at four pre-ADC values; rms TABLE II
errors are tested over 128 runs with repeated inputs. Q UANTIZED R ES N ET-20 S TRUCTURE AND M APPING

low power and compact area, but it can be further improved


in case of relaxed area and power requirements.
3) Inference on MNIST: The system achieves 98.8% infer-
ence accuracy for the MNIST data set using compressed
LeNet-5 [32], which is identical to the software baseline
with 500 images tested. The CNN is specifically trained for
CAP-RAM and pruned by 95.6% over the baseline model
using the alternating direction method of multipliers - neural
network (ADMM-NN) framework [33]. The pruned model has inputs. Therefore, a quantization-aware training approach
only 19 149 parameters. Therefore, the whole network can be is utilized to tolerate the quantization errors. CAP-RAM
stored in a single CAP-RAM macro. The mapping strategy achieves 89.0% inference accuracy with 500 images tested
of 2’s complement encoded weights follows three rules. and shows 1.6% degradation compared to the software
baseline. Compared with recent IMC architectures, Jia et al.
1) The bits of weight are assigned to the same column
[26] show higher accuracy since it uses 8-bit ADCs and
but neighboring “+” or “−” slices, as described in
utilizes VGG-like network with 7.04 times more operations
Section III-A.
than ResNet-20. In [12] and [22], 90.42% and 92.02%
2) Within a single filter, the bits from different weights
accuracies are achieved with ResNet-20, but parallelism is
but the same bit location are mapped to the same row.
sacrificed to avoid quantization errors. In both designs, a 5-bit
Different filters are mapped to different slices. If the
ADC only processes less than 16 rows of analog computing,
filter size (R × R × C) is larger than the row size, one
limiting the throughput and efficiency gains.
filter will be mapped as multiple filters.
3) One layer occupies one row in each slice. If the rows C. Energy and Throughput
out of 64 slices cannot accommodate a layer, a new row The CAP-RAM prototype operates at 70 MHz and achieves
in each slice will be occupied in the same way. 573.4-giga operations per second (GOPS) peak throughput and
The mapping strategy for ternary encoding is similar, except 3.4 tera operations per second (TOPS)/mm2 for convolution
that its encoding unit (+1/0/−1) utilizes two bitcells, and they with 4-bit inputs and binary (1b) or ternary (2b) weights
are paired in the same column and adjacent slices. Table I at 1.2-V supply voltage and 25 ◦ C. The compute density is
summarizes the different hardware utilization (rows per slice) higher than state-of-the-art programmable charge-domain IMC
for the four layers. In particular, F5 occupies four rows in each [26], [34] despite CAP-RAM has 1-to-8 row parallelism and
slice but only takes four cycles to process in this mapping. This multi-bit analog computing circuits. This is because: 1) the
is trivial compared to convolutional layers (576 cycles for C1), ciSAR ADC has high speed and 2) the clustered 6T cells
and a huge amount of area is saved due to the clustering have a highly compact layout, so the reduced parallelism
structure. The system is in the single-ended mode for C1 for will not degrade the compute density significantly. Although
efficient storage, while the other layers use differential mode C3SRAM [15] has higher compute density (20.2 TOPS/mm2 ),
because they have ternary weights. It is worth mentioning that yet one operation in C3SRAM is only 1’b by 1’b addi-
only linear fitting calibration is applied here, but one can also tion/multiplication and a flash ADC with only 11 levels is
apply both calibration steps in case the CNN model is more used for quantization. Since the throughput of bit-serial archi-
sensitive to computing errors. tectures naturally scales with the bitwidth, a bitwise compute
4) Inference on CIFAR-10: A quantized ResNet-20 [3] density (1 OP = 1’b by 1’b addition or multiplication) is
(see Table II) is deployed on CAP-RAM for the CIFAR-10 defined to make apple-to-apple comparisons, and CAP-RAM
data set. The same mapping strategy will be applied to becomes the best (27.2 TOPS/mm2 ) in this metric.
multiple macros when one macro cannot hold a whole layer The whole system consumes 11.62 mW with random 4-bit
(layers 15–19 occupy two macros). One of the challenges inputs and 1-bit weights. The SRAM array, DACs, and timing
of the model is to mitigate the effect of quantization errors controller consume 3.60 mW in the single-ended mode and
because the output levels of ADCs (6 bit in single mode) 6.35 mW in the differential mode with all 32 slices turned on.
are much smaller than the voltage levels (about 11 bit) of The power consumption of the 32 ADCs and the shared control

Authorized licensed use limited to: California State University Fresno. Downloaded on July 01,2021 at 00:09:01 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: CAP-RAM: A CHARGE-DOMAIN IMC 6T-SRAM 1933

TABLE III
P ERFORMANCE S UMMARY OF CAP-RAM AND C OMPARISON W ITH S TATE - OF - THE -A RT I N -M EMORY C OMPUTING SRAM S

and timing module is 7.56 mW. For the digital periphery, provides input/weight bitwidth configurability, and a ciSAR
the accumulators and 2’s complement modules take 0.78mW at ADC specifically designed for CAP-RAM further boosts the
the accumulation mode and 0.46 mW at the single-cycle mode, energy and area performance. A 65-nm prototype demonstrates
while the adder tree consumes 0.04/0.10/0.19 mW at the out- excellent computing linearity and accuracy. The pruned and
put level 1/2/3. Different from the fully bit-serial architectures, quantized LeNet-5 and ResNet-20 are mapped to CAP-RAM
CAP-RAM’s energy efficiency is based on MAC computation macros, which achieve 98.8% inference accuracy on MNIST
with 4’b inputs. To achieve this, more rowwise control signals and 89.0% on CIFAR-10, respectively. The system achieves
and 128 4’b DACs are involved. Similar to the definition 49.3 TOPS/W energy efficiency and 573.4-GOPS throughput.
above, CAP-RAM becomes more competitive in the bitwise
energy efficiency. More importantly, there exists a tradeoff R EFERENCES
between storage density and energy efficiency. The target of [1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
CAP-RAM is not to achieve the highest energy efficiency but with deep convolutional neural networks,” in Proc. Adv. Neural Inf.
to design a compact, accurate, and programmable architecture Process. Syst., 2012, pp. 1097–1105.
[2] C. Szegedy et al., “Going deeper with convolutions,” in Proc. IEEE
while maintaining competitive energy efficiency. The detailed Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 1–9.
performance comparison is summarized in Table III. [3] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
V. C ONCLUSION (CVPR), Jun. 2016, pp. 770–778.
[4] M. Horowitz, “1.1 Computing’s energy problem (and what we can do
In summary, this work presents and demonstrates a about it),” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech.
charge-domain IMC SRAM macro with 6T cells. The Papers, Feb. 2014, pp. 10–14.
charge-sharing mechanism ensures good accuracy, while the [5] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy-
efficient reconfigurable accelerator for deep convolutional neural net-
semi-parallel architecture provides best-in-class weight stor- works,” IEEE J. Solid-State Circuits, vol. 52, no. 1, pp. 127–138,
age density. Meanwhile, the digital processing periphery Jan. 2017.

Authorized licensed use limited to: California State University Fresno. Downloaded on July 01,2021 at 00:09:01 UTC from IEEE Xplore. Restrictions apply.
1934 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 56, NO. 6, JUNE 2021

[6] H.-J. Yoo, S. Park, K. Bong, D. Shin, J. Lee, and S. Choi, [27] A. Shafiee et al., “ISAAC: A convolutional neural network accelerator
“A 1.93TOPS/W scalable deep learning/inference processor with tetra- with in-situ analog arithmetic in crossbars,” ACM SIGARCH Comput.
parallel MIMD architecture for big-data applications,” in IEEE Int. Archit. News, vol. 44, no. 3, pp. 14–26, 2016.
Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2015, [28] L. Song, X. Qian, H. Li, and Y. Chen, “PipeLayer: A pipelined ReRAM-
pp. 80–81. based accelerator for deep learning,” in Proc. IEEE Int. Symp. High
[7] Z. Du et al., “ShiDianNao: Shifting vision processing closer to the sen- Perform. Comput. Archit. (HPCA), Feb. 2017, pp. 541–552.
sor,” in Proc. 42nd Annu. Int. Symp. Comput. Archit., 2015, pp. 92–104. [29] R. Guo et al., “A 5.1 pJ/neuron 127.3 us/inference RNN-based
[8] M. Kang, S. K. Gonugondla, A. Patil, and N. R. Shanbhag, “A speech recognition processor using 16 computing-in-memory SRAM
multi-functional in-memory inference processor using a standard 6T macros in 65 nm CMOS,” in Proc. Symp. VLSI Circuits, Jun. 2019,
SRAM array,” IEEE J. Solid-State Circuits, vol. 53, no. 2, pp. 642–655, pp. C120–C121.
Feb. 2018. [30] K. D. Choo, J. Bell, and M. P. Flynn, “Area-efficient 1GS/s 6b SAR
[9] J. Zhang, Z. Wang, and N. Verma, “In-memory computation of a ADC with charge-injection-cell-based DAC,” in IEEE Int. Solid-State
machine-learning classifier in a standard 6T SRAM array,” IEEE Circuits Conf. (ISSCC) Dig. Tech. Papers, Jan./Feb. 2016, pp. 460–461.
J. Solid-State Circuits, vol. 52, no. 4, pp. 915–924, Apr. 2017. [31] C.-C. Liu, S.-J. Chang, G.-Y. Huang, and Y.-Z. Lin, “A 10-bit 50-MS/s
[10] H. Valavi, P. J. Ramadge, E. Nestler, and N. Verma, “A mixed-signal SAR ADC with a monotonic capacitor switching procedure,” IEEE
binarized Convolutional-Neural-Network accelerator integrating dense J. Solid-State Circuits, vol. 45, no. 4, pp. 731–740, Apr. 2010.
weight storage and multiplication for reduced data movement,” in Proc. [32] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learn-
IEEE Symp. VLSI Circuits, Jun. 2018, pp. 141–142. ing applied to document recognition,” Proc. IEEE, vol. 86, no. 11,
[11] A. Biswas and A. P. Chandrakasan, “CONV-SRAM: An energy-efficient pp. 2278–2324, Nov. 1998.
SRAM with in-memory dot-product computation for low-power convo- [33] A. Ren et al., “ADMM-NN: An algorithm-hardware co-design frame-
lutional neural networks,” IEEE J. Solid-State Circuits, vol. 54, no. 1, work of DNNs using alternating direction methods of multipliers,” in
pp. 217–230, Jan. 2019. Proc. 24th Int. Conf. Architectural Support Program. Lang. Operating
[12] X. Si et al., “A twin-8T SRAM Computation-in-Memory unit-macro for Syst., 2019, pp. 925–938.
multibit CNN-based AI edge processors,” IEEE J. Solid-State Circuits, [34] H. Valavi, P. J. Ramadge, E. Nestler, and N. Verma, “A 64-tile 2.4-mb
vol. 55, no. 1, pp. 189–202, Jan. 2020. in-memory-computing CNN accelerator employing charge-domain com-
[13] S. Yin, Z. Jiang, J.-S. Seo, and M. Seok, “XNOR-SRAM: In-memory pute,” IEEE J. Solid-State Circuits, vol. 54, no. 6, pp. 1789–1799,
computing SRAM macro for binary/ternary deep neural networks,” IEEE Jun. 2019.
J. Solid-State Circuits, vol. 55, no. 6, pp. 1733–1743, Jun. 2020. [35] S. K. Gonugondla, M. Kang, and N. Shanbhag, “A 42 pJ/decision
[14] C. Yu, T. Yoo, T. T.-H. Kim, K. C. Tshun Chuan, and B. Kim, “A 16 K 3.12 TOPS/W robust in-memory machine learning classifier with on-
current-based 8T SRAM compute-in-memory macro with decoupled chip training,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech.
read/write and 1-5 bit column ADC,” in Proc. IEEE Custom Integr. Papers, Feb. 2018, pp. 490–492.
Circuits Conf. (CICC), Mar. 2020.
[15] Z. Jiang, S. Yin, J.-S. Seo, and M. Seok, “C3SRAM: In-memory-
computing SRAM macro based on capacitive-coupling computing,”
IEEE Solid-State Circuits Lett., vol. 2, no. 9, pp. 131–134, Sep. 2019.
[16] S. Okumura, M. Yabuuchi, K. Hijioka, and K. Nose, “A ternary based Zhiyu Chen (Student Member, IEEE) received
bit scalable, 8.80 TOPS/W CNN accelerator with many-core processing- the B.E. degree in electrical engineering from
in-memory architecture with 896K synapses/mm2 ,” in Proc. Symp. VLSI Nanjing University, Nanjing, China, in 2018. He is
Circuits, Jun. 2019, pp. C248–C249. currently pursuing the Ph.D. degree in electrical and
[17] H. Jia, Y. Tang, H. Valavi, J. Zhang, and N. Verma, “A microprocessor computer engineering at Rice University, Houston,
implemented in 65 nm CMOS with configurable and bit-scalable acceler- TX, USA.
ator for programmable in-memory computing,” 2018, arXiv:1811.04047. His research interests include digital and
[Online]. Available: http://arxiv.org/abs/1811.04047 mixed-signal circuit design for machine learning
[18] J. Wang, X. Wang, C. Eckert, A. Subramaniyan, R. Das, D. Blaauw, and
accelerators.
D. Sylvester, “A compute SRAM with bit-serial integer/floating-point
operations for programmable in-memory vector acceleration,” in IEEE
Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2019,
pp. 224–226.
[19] W.-S. Khwa et al., “A 65 nm 4 Kb algorithm-dependent computing-in-
memory SRAM unit-macro with 2.3 ns and 55.8 TOPS/W fully parallel
product-sum operation for binary DNN edge processors,” in IEEE Zhanghao Yu (Student Member, IEEE) received
Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2018, the B.E. degree in integrated circuit design and
pp. 496–498. integrated system from the University of Electronic
[20] M. Kang, M.-S. Keel, N. R. Shanbhag, S. Eilert, and K. Curewitz, Science and Technology of China, Chengdu, China,
“An energy-efficient VLSI architecture for pattern recognition via deep in 2016 and the M.S. degree in electrical engi-
embedding of computation in SRAM,” in Proc. IEEE Int. Conf. Acoust., neering from the University of Southern California,
Speech Signal Process. (ICASSP), May 2014, pp. 8326–8330. Los Angeles, CA, USA, in 2018. He is currently
[21] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “XNOR-Net: pursuing the Ph.D. degree in electrical and computer
Imagenet classification using binary convolutional neural networks,” in engineering at Rice University, Houston, TX, USA.
Proc. Eur. Conf. Comput. Vis. Springer, 2016, pp. 525–542. His current research interests include analog and
[22] X. Si et al., “15.5 a 28 nm 64 Kb 6 T SRAM Computing-in-Memory mixed-signal integrated circuits design for power
macro with 8b MAC operation for AI edge chips,” in IEEE Int. Solid- management, bio-electronics, and security.
State Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2020, pp. 246–248.
[23] J. Yue et al., “14.3 a 65 nm Computing-in-Memory-Based CNN proces-
sor with 2.9-to-35.8TOPS/W system energy efficiency using dynamic-
sparsity performance-scaling architecture and energy-efficient inter/intra-
macro data reuse,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Qing Jin received the M.S. degree in computer
Tech. Papers, Feb. 2020, pp. 234–236. engineering from Texas A&M University, College
[24] J.-W. Su et al., “15.2 a 28nm 64Kb inference-training two-way transpose Station, TX, USA, in 2018 and the B.S. and M.S.
multibit 6T SRAM Compute-in-Memory macro for AI edge chips,” degrees in microelectronics from Nankai University,
in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, Tianjin, China, in 2009 and 2012, respectively.
Feb. 2020, pp. 240–242. He was working as a Research Assistant
[25] N. Verma et al., “In-memory computing: Advances and prospects,” IEEE with Tsinghua University, Beijing, China, between
Solid State Circuits Mag., vol. 11, no. 3, pp. 43–55, Aug. 2019. 2010 and 2012. From 2013 to 2017, he was working
[26] H. Jia, H. Valavi, Y. Tang, J. Zhang, and N. Verma, “A programmable with the School of Microelectronics, Xi’an Jiaotong
heterogeneous microprocessor based on bit-scalable in-memory com- University, Xi’an, China. He is currently pursu-
puting,” IEEE J. Solid-State Circuits, vol. 55, no. 9, pp. 2609–2621, ing the Ph.D. degree with Northeastern University,
Sep. 2020. Boston, MA, USA.

Authorized licensed use limited to: California State University Fresno. Downloaded on July 01,2021 at 00:09:01 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: CAP-RAM: A CHARGE-DOMAIN IMC 6T-SRAM 1935

Yan He (Student Member, IEEE) received the B.S Yanzhi Wang (Senior Member, IEEE) received
degree in electronic science and technology from the B.S. degree from Tsinghua University, Beijing,
Zhejiang University, Hangzhou, China, in 2018. China, in 2009 and the Ph.D. degree from the
He is currently pursuing the Ph.D. degree in electri- University of Southern California, Los Angeles, CA,
cal and computer engineering with Rice University, USA, in 2014.
Houston, TX, USA. He is currently an Assistant Professor at
His current research interests include analog and the Department of ECE, Northeastern University,
mixed-signal integrated circuits design for power Boston, MA, USA. His research interests focus on
management and hardware security. model compression and platform-specific acceler-
ation of deep learning applications. His research
maintains the highest model compression rates on
representative Deep Neural Networks (DNNs) since 09/2018. His work on
Adiabatic Quantum-Flux-Parametron (AQFP) superconducting-based DNN
Jingyu Wang (Member, IEEE) received the B.S. acceleration is by far the highest energy efficiency among all hardware
degree in electronic science and technology, the M.S. devices. His recent research achievement, CoCoPIE, can achieve real-time per-
and Ph.D. degrees in microelectronics from Xidian formance on almost all deep learning applications using off-the-shelf mobile
University, Xi’an, China, in 2010, 2013, and 2017, devices, outperforming competing frameworks by up to 180X acceleration.
respectively. His work has been published broadly in top conference and journal venues
His current interests include mixed-signal inte- and has been cited above 8500 times.
grated circuits, ADC, image sensors and their appli- Dr. Wang has received five Best Paper and Top Paper Awards, has another
cations, biomedical circuits and systems, and RF ten Best Paper Nominations and four Popular Paper Awards. He has received
integrated circuits. the U.S. Army Young Investigator Program Award (YIP), Massachusetts
Acorn Innovation Award, Ming Hsieh Scholar Award, and other research
awards from Google, MathWorks, etc. Three of his former Ph.D./postdoc
students become tenure track faculty member at the University of Connecticut,
Storrs, CT, USA, Clemson University, Clemson, SC, USA, and Texas A&M
Sheng Lin (Student Member, IEEE) received the
University-Corpse Christi, Corpse Christi, TX, USA.
B.S. degree from Zhejiang University, Hangzhou,
China, in 2013, the M.S. degree from Syracuse Uni-
versity, Syracuse, NY, USA, in 2015, and the Ph.D.
degree in computer engineering from Northeastern
University, Boston, MA, USA, in 2020, under the Kaiyuan Yang (Member, IEEE) received the B.S.
supervision of Prof. Yanzhi Wang. degree in electronic engineering from Tsinghua Uni-
His current research interests include privacy- versity, Beijing, China, in 2012 and the Ph.D. degree
preserving machine learning, energy-efficient artifi- in electrical engineering from the University of
cial intelligence systems, model compression, and Michigan, Ann Arbor, MI, USA, in 2017.
mobile acceleration of deep learning applications. He is an Assistant Professor of electrical and
computer engineering at Rice University, Houston,
TX, USA. His research interests include digital and
Dai Li (Student Member, IEEE) received the B.S. mixed-signal circuits for secure and low-power sys-
and M.S. degrees in electronics engineering from tems, hardware security, and circuit/system design
Tsinghua University, Beijing, China, and the M.S. with emerging devices.
degree of electrical and computer engineering from Dr. Yang received the Distinguished Paper Award at the 2016 IEEE
Rice University, Houston, TX, USA, in 2010, 2013, International Symposium on Security and Privacy (Oakland), the Best Student
and 2017, respectively, where he is currently pursu- Paper Award (first place) at the 2015 IEEE International Symposium on
ing the Ph.D. degree. Circuits and Systems (ISCAS), the Best Student Paper Award Finalist at
His research interests include VLSI circuits, hard- the 2019 IEEE Custom Integrated Circuits Conference (CICC), and the
ware security, mixed-signal integrated circuits, and 2016 Pwnie Most Innovative Research Award Finalist. His Ph.D. research
low-power circuits. was recognized with the 2016–2017 IEEE Solid-State Circuits Society (SSCS)
Predoctoral Achievement Award.

Authorized licensed use limited to: California State University Fresno. Downloaded on July 01,2021 at 00:09:01 UTC from IEEE Xplore. Restrictions apply.

You might also like