Cap Ram
Cap Ram
6, JUNE 2021
achieve 98.8% inference accuracy on the MNIST data set and Y = Wi X i (1)
89.0% on the CIFAR-10 data set, with a 573.4-giga operations i=1
per second (GOPS) peak throughput and a 49.4-tera operations where Y , X, and W refer to output activations, input activa-
per second (TOPS)/W energy efficiency.
tions, and weights, respectively. R × R represents the kernel
Index Terms— CMOS, convolutional neural networks (CNNs), size, and C is the number of input channels. It is well known
deep learning accelerator, in-memory computation, mixed-signal that the energy bottleneck of such computations lies in the
computation, static random-access memory (SRAM).
overwhelming data movement, rather than arithmetic opera-
I. I NTRODUCTION tions [4]. The energy to access DRAMs and static random-
access memories (SRAMs) is approximately 8 ×104 times and
D EEP convolutional neural networks (CNNs) achieve
unprecedented success in countless artificial intelli-
gence (AI) applications due to their powerful feature extraction
3 × 103 higher than that of an 8-bit integer addition in 45 nm
[4], leading to the so-called memory wall. The memory walls
capabilities [1]–[3]. In many real-time applications, CNN are particularly severe for data-intensive computing, such as
models are typically pre-trained in the cloud and then deployed deep learning. State-of-the-art digital CNN accelerators are all
in edge devices, such as mobile phones and the Internet- optimized for energy-efficient dataflows and reduced memory
of-Things (IoT) devices, for fast and energy-efficient local access, by exploiting data locality and reuse [5]–[7].
inference. Because of the very limited computing resource To further alleviate the memory walls, emerging non-Von
Neumann CNN accelerators that perform computing directly
Manuscript received April 17, 2020; revised August 11, 2020 and inside the memory by accessing and computing multiple
October 29, 2020; accepted January 21, 2021. Date of current version
May 26, 2021. This article was approved by Associate Editor Jonathan rows in parallel attract significant interests [8]–[19]. In these
Chang. (Corresponding author: Kaiyuan Yang.) in-memory computing (IMC) designs, the data movement is
Zhiyu Chen, Zhanghao Yu, Yan He, Jingyu Wang, Dai Li, and Kaiyuan significantly reduced, and the read energy is amortized by
Yang are with the Department of Electrical and Computer Engineering, Rice
University, Houston, TX 77005 USA (e-mail: [email protected]). the parallel access, as shown in Fig. 1. IMC with on-chip
Qing Jin, Sheng Lin, and Yanzhi Wang are with the Department of Electrical SRAMs was first proposed in [20] and first implemented in
and Computer Engineering, Northeastern University, Boston, MA 02115 USA. silicon by Zhang et al. [9], which turns on multiple standard
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/JSSC.2021.3056447. 6T SRAM cells at the same time and accumulates current
Digital Object Identifier 10.1109/JSSC.2021.3056447 on the bitline to perform energy-efficient MAC computing.
0018-9200 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: California State University Fresno. Downloaded on July 01,2021 at 00:09:01 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: CAP-RAM: A CHARGE-DOMAIN IMC 6T-SRAM 1925
Authorized licensed use limited to: California State University Fresno. Downloaded on July 01,2021 at 00:09:01 UTC from IEEE Xplore. Restrictions apply.
1926 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 56, NO. 6, JUNE 2021
Fig. 3. Existing IMC cell designs: (a) current-domain computation with an 8T SRAM cell, (b) charge-domain computation with wordline input,
(c) charge-domain computation with bitline input.
Authorized licensed use limited to: California State University Fresno. Downloaded on July 01,2021 at 00:09:01 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: CAP-RAM: A CHARGE-DOMAIN IMC 6T-SRAM 1927
Fig. 7. Comparison between the fully parallel structure and the clustering
Fig. 5. Simulated histogram of MAC computing error of a current-domain structure.
128 × 128 IMC SRAM.
between the input and stored value with the output switch off.
Then, the switch is turned on so that the charge is shared
across all the capacitors connected to the output line. This is a
simple and efficient scheme for binary inputs but requires high
precision circuits to support multi-bit inputs using pulsewidth
modulation or current modulation because the sampling capac-
itors are tiny (a few fFs) for energy considerations. On the
other hand, the IMC with bitline input refers to sampling the
input signals on a local capacitor [see Fig. 3(c)]. This scheme
Fig. 6. Simulated variations (mean over sigma value) of the output line
voltage in 100 Monte Carlo simulations (when 16 rows are accessed) and will support multi-bit inputs. When the input is an h-bit digital
maximum active rows, under different input voltages. signal, it will first be converted to voltage VIN on the input line
by a DAC. h-bit × 1-bit multiplication is performed by closing
The systematic and random non-linearities will both affect RWL switch, similar to Fig. 3(b), and the output line switches
computing accuracy. Fig. 5 shows the simulated error dis- are later turned on to finish accumulation. As indicated in (3),
tribution with nearly 2-least significant bit (LSB) standard a h-bit input architecture achieves nearly h times throughput
deviation. In this experiment, the same IMC macro above is improvement over the pure bit-serial scheme that needs h loops
quantized by an ideal 6-bit ADC model. To cover the full to perform the same operations. Thus, CAP-RAM (Fig. 2)
output range better than pure random inputs, the 128 inputs adopts bitline inputs to support higher throughput.
are divided into 16 groups, where the kth group has 4k “1”s 3) Clustering Structure: Conventionally, IMC SRAMs are
and 64 − 4k “0”s at random locations. expected to activate all rows simultaneously to maximize
The linearity concerns also limit the number of WLs that energy efficiency and compute density, while CAP-RAM
can be turned on in parallel in current-domain IMC, lead- groups several 6T cells and one analog computing module
ing to degraded throughput and efficiency because of fewer into a cluster (as shown in Fig. 7), and only one of those
parallelism [25]. The simulation results in Fig. 6 depicts the cells will be selected at each operation. This is the result of
tradeoff between the parallelism and computing accuracy: a a design compromise to amortize the large bitcells needed
higher input voltage reduces the variation of M1 but makes for the charge domain in-SRAM computing. Larger bit cells
it easier for M1 to enter the linear region, which ultimately not only reduce compute density but also increase energy and
restricts the parallelism. The simulations are done on the same delay over an ideal fully parallel IMC 6T-SRAM. On the other
design as Fig. 4, with a fast 200-ps access time hand, as discussed in Section I, analog MAC with multi-bit
In comparison, charge-domain IMC achieves better com- inputs could linearly increase the throughput and energy
puting accuracy and higher parallelism [10], [11]. The com- efficiency over bit-by-bit serial computing [15], [26], which is
putation [as shown in Fig. 3(b) and (c)] is performed on leveraged in CAP-RAM together with the clustering structure
capacitors, which has much less variation than the current to offer higher macro-level compute density than fully
of minimum-sized access transistors. Meanwhile, the charge- parallel macros. For instance, state-of-the-art charge-based
sharing based operation is not affected by transistors’ operating serial computing cells, even for bit-by-bit serial computing,
regions, and therefore, a greater number of cells can be are about two to three times the area of a logic-rule 6T
turned on together for higher throughput and efficiency gain. SRAM cell [15], [16], [26].
It is clear that no significant linearity degradation is observed Comparatively, the standard 6T SRAM cell is used in
even in measurements of CAP-RAM (see Section IV-A1). CAP-RAM. Our implementation is in logic rule, but “push-
Therefore, charge-domain computing is a clear choice for rule” cells with ∼50% less cell area can be easily adopted.
accurate and higher-precision IMCs. The switching circuit is around three times the size of a
2) Wordline Versus Bitline Inputs: Fig. 3 depicts three logic-rule 6T cell. The area overhead of the switching circuit
categories of designs with two approaches to supply the can be greatly amortized by the clustering structure. For
inputs for convolutions, wordline, and bitline inputs [10]–[12]. example, a CAP-RAM cluster of three non-push-rule cells
Current-domain IMC is typically done with wordline inputs. will take the same area as two or three cells doing bit-by-
The pulsewidth modulated WL signal can represent multi-bit bit serial computation. If a 4-bit input × 1-bit weight MAC
inputs. is performed within the same sized array, CAP-RAM will
Charge-domain cells support both approaches. In cells with provide the highest total # of operations (bitwise multiply and
wordline inputs [see Fig. 3(b)], a logic AND is performed add) per cycle. In this iso-area comparison, the CAP-RAM will
Authorized licensed use limited to: California State University Fresno. Downloaded on July 01,2021 at 00:09:01 UTC from IEEE Xplore. Restrictions apply.
1928 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 56, NO. 6, JUNE 2021
Fig. 9. Illustration of the mapping of (a) 2’s complement encoding and (b)
ternary encoding.
Authorized licensed use limited to: California State University Fresno. Downloaded on July 01,2021 at 00:09:01 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: CAP-RAM: A CHARGE-DOMAIN IMC 6T-SRAM 1929
Authorized licensed use limited to: California State University Fresno. Downloaded on July 01,2021 at 00:09:01 UTC from IEEE Xplore. Restrictions apply.
1930 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 56, NO. 6, JUNE 2021
Fig. 13. Diagram of the adder-tree-based digital processing periphery. Fig. 15. Measured computing linearity of charge-domain IMC.
Authorized licensed use limited to: California State University Fresno. Downloaded on July 01,2021 at 00:09:01 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: CAP-RAM: A CHARGE-DOMAIN IMC 6T-SRAM 1931
Fig. 17. Measured system error distribution over 524 288 samples (a) after
linear-fitting and (b) after linear-fitting and master curve calibration with noise
filtered out.
B. Computing Accuracy
Nonidealities described above (analog computing error,
DAC nonlinearity, ADC nonlinearity, and offset/gain varia-
tions), together with thermal noise, decide the final computing
Fig. 16. Measured linearity of the CAP-RAM arrays in two prototype chips.
(a) Raw transfer curves of 64 ADCs. (b) Transfer curves of 64 ADCs after errors. In addition to linearity tests, we directly assess MAC
two-step calibration. (c) Linearity of the complete analog computing pipeline computing errors by feeding random sets of inputs to the
over 64 slices. (d) INLs of 64 slices. system and comparing the outputs against expected ones.
1) Error Distribution: If the inputs are uniformly sampled,
are linear fitted with yi = ki x + bi , where yi is the measured the MAC outputs will mostly appear around the center in
output of the i th ADC and x is the ideal output. Each raw the dynamic range as a result of the central limit theorem.
output code ŷi is calibrated by ( ŷi −bi )/ki to remove the offset To alleviate such bias in the measured error distribution,
and gain error. Furthermore, a master curve, which is widely 16 random input sets with different distributions are used.
used in low-power analog applications, such as temperature In the kth set, 64 different input patterns are randomly sampled
sensors, can be applied to alleviate the systematic nonlinearity. from N(k −1, 2). In total, 524 288 (16×64 ×32 ×16) samples
The final two-step calibrated MAC result with better linearity are collected for Fig. 17 because there are 32 ADCs and
is shown in Fig. 16(b). Integrating the calibration module on every measurement is repeated 16 times. Fig. 17(a) shows
the chip requires some extra arithmetic logic in the periphery, the error distribution after the system is calibrated by the
but the area and energy will be much smaller than that of linear fitting. The spread is further reduced when noise is
existing convolution and batch normalization steps. filtered by averaging the outputs of multiple runs with the same
4) Linearity of the Analog Computing Slices: We further inputs, and the master curve calibration is performed, as shown
analyze the linearity of the complete analog computing chain in Fig. 17(b). Compared with the simulated error distribution
by including DAC’s nonideality. Instead of examining the of the current-domain system in Section II-B, the error of
linearity of a single analog chain as in most ADC studies, CAP-RAM is still smaller despite the ideal ADC and DAC
it is more important to examine the distribution of INLs (1-bit) assumption. Fig. 18 shows the error distribution over
and transfer curves across different computing chains and eight chips.
different chips for IMC applications. Therefore, the mean and 2) Random Errors: The thermal noise in ADCs and DACs
three-sigma spread of the transfer curves and INLs of 64 slices is another source of computing errors. Fig. 19(a) shows the rms
in two prototype chips are plotted in Fig. 16(c) and (d). errors of one ADC over the entire input range. The average rms
Fig. 16(c) is obtained by sweeping the pre-ADC inputs from is 0.35 LSB. Spikes can be observed when the input voltage
0 to 1920. Fig. 16(d) indicates that the largest linearity error of the ADC is close to the transition threshold. The variation
of the system is expected to be less than two LSBs. Note that of rms noise across 32 ADCs in the same macro is shown
the nonlinearity here includes contributions from the DAC, in Fig. 19(b). This noise level is acceptable for ADCs targeting
Authorized licensed use limited to: California State University Fresno. Downloaded on July 01,2021 at 00:09:01 UTC from IEEE Xplore. Restrictions apply.
1932 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 56, NO. 6, JUNE 2021
TABLE I
P RUNED AND Q UANTIZED L E N ET-5 S TRUCTURE AND M APPING
Fig. 19. Measured rms error (a) of one ADC over pre-ADC values
(i.e., analog MAC outputs) and (b) 32 ADCs at four pre-ADC values; rms TABLE II
errors are tested over 128 runs with repeated inputs. Q UANTIZED R ES N ET-20 S TRUCTURE AND M APPING
Authorized licensed use limited to: California State University Fresno. Downloaded on July 01,2021 at 00:09:01 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: CAP-RAM: A CHARGE-DOMAIN IMC 6T-SRAM 1933
TABLE III
P ERFORMANCE S UMMARY OF CAP-RAM AND C OMPARISON W ITH S TATE - OF - THE -A RT I N -M EMORY C OMPUTING SRAM S
and timing module is 7.56 mW. For the digital periphery, provides input/weight bitwidth configurability, and a ciSAR
the accumulators and 2’s complement modules take 0.78mW at ADC specifically designed for CAP-RAM further boosts the
the accumulation mode and 0.46 mW at the single-cycle mode, energy and area performance. A 65-nm prototype demonstrates
while the adder tree consumes 0.04/0.10/0.19 mW at the out- excellent computing linearity and accuracy. The pruned and
put level 1/2/3. Different from the fully bit-serial architectures, quantized LeNet-5 and ResNet-20 are mapped to CAP-RAM
CAP-RAM’s energy efficiency is based on MAC computation macros, which achieve 98.8% inference accuracy on MNIST
with 4’b inputs. To achieve this, more rowwise control signals and 89.0% on CIFAR-10, respectively. The system achieves
and 128 4’b DACs are involved. Similar to the definition 49.3 TOPS/W energy efficiency and 573.4-GOPS throughput.
above, CAP-RAM becomes more competitive in the bitwise
energy efficiency. More importantly, there exists a tradeoff R EFERENCES
between storage density and energy efficiency. The target of [1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
CAP-RAM is not to achieve the highest energy efficiency but with deep convolutional neural networks,” in Proc. Adv. Neural Inf.
to design a compact, accurate, and programmable architecture Process. Syst., 2012, pp. 1097–1105.
[2] C. Szegedy et al., “Going deeper with convolutions,” in Proc. IEEE
while maintaining competitive energy efficiency. The detailed Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 1–9.
performance comparison is summarized in Table III. [3] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
V. C ONCLUSION (CVPR), Jun. 2016, pp. 770–778.
[4] M. Horowitz, “1.1 Computing’s energy problem (and what we can do
In summary, this work presents and demonstrates a about it),” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech.
charge-domain IMC SRAM macro with 6T cells. The Papers, Feb. 2014, pp. 10–14.
charge-sharing mechanism ensures good accuracy, while the [5] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy-
efficient reconfigurable accelerator for deep convolutional neural net-
semi-parallel architecture provides best-in-class weight stor- works,” IEEE J. Solid-State Circuits, vol. 52, no. 1, pp. 127–138,
age density. Meanwhile, the digital processing periphery Jan. 2017.
Authorized licensed use limited to: California State University Fresno. Downloaded on July 01,2021 at 00:09:01 UTC from IEEE Xplore. Restrictions apply.
1934 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 56, NO. 6, JUNE 2021
[6] H.-J. Yoo, S. Park, K. Bong, D. Shin, J. Lee, and S. Choi, [27] A. Shafiee et al., “ISAAC: A convolutional neural network accelerator
“A 1.93TOPS/W scalable deep learning/inference processor with tetra- with in-situ analog arithmetic in crossbars,” ACM SIGARCH Comput.
parallel MIMD architecture for big-data applications,” in IEEE Int. Archit. News, vol. 44, no. 3, pp. 14–26, 2016.
Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2015, [28] L. Song, X. Qian, H. Li, and Y. Chen, “PipeLayer: A pipelined ReRAM-
pp. 80–81. based accelerator for deep learning,” in Proc. IEEE Int. Symp. High
[7] Z. Du et al., “ShiDianNao: Shifting vision processing closer to the sen- Perform. Comput. Archit. (HPCA), Feb. 2017, pp. 541–552.
sor,” in Proc. 42nd Annu. Int. Symp. Comput. Archit., 2015, pp. 92–104. [29] R. Guo et al., “A 5.1 pJ/neuron 127.3 us/inference RNN-based
[8] M. Kang, S. K. Gonugondla, A. Patil, and N. R. Shanbhag, “A speech recognition processor using 16 computing-in-memory SRAM
multi-functional in-memory inference processor using a standard 6T macros in 65 nm CMOS,” in Proc. Symp. VLSI Circuits, Jun. 2019,
SRAM array,” IEEE J. Solid-State Circuits, vol. 53, no. 2, pp. 642–655, pp. C120–C121.
Feb. 2018. [30] K. D. Choo, J. Bell, and M. P. Flynn, “Area-efficient 1GS/s 6b SAR
[9] J. Zhang, Z. Wang, and N. Verma, “In-memory computation of a ADC with charge-injection-cell-based DAC,” in IEEE Int. Solid-State
machine-learning classifier in a standard 6T SRAM array,” IEEE Circuits Conf. (ISSCC) Dig. Tech. Papers, Jan./Feb. 2016, pp. 460–461.
J. Solid-State Circuits, vol. 52, no. 4, pp. 915–924, Apr. 2017. [31] C.-C. Liu, S.-J. Chang, G.-Y. Huang, and Y.-Z. Lin, “A 10-bit 50-MS/s
[10] H. Valavi, P. J. Ramadge, E. Nestler, and N. Verma, “A mixed-signal SAR ADC with a monotonic capacitor switching procedure,” IEEE
binarized Convolutional-Neural-Network accelerator integrating dense J. Solid-State Circuits, vol. 45, no. 4, pp. 731–740, Apr. 2010.
weight storage and multiplication for reduced data movement,” in Proc. [32] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learn-
IEEE Symp. VLSI Circuits, Jun. 2018, pp. 141–142. ing applied to document recognition,” Proc. IEEE, vol. 86, no. 11,
[11] A. Biswas and A. P. Chandrakasan, “CONV-SRAM: An energy-efficient pp. 2278–2324, Nov. 1998.
SRAM with in-memory dot-product computation for low-power convo- [33] A. Ren et al., “ADMM-NN: An algorithm-hardware co-design frame-
lutional neural networks,” IEEE J. Solid-State Circuits, vol. 54, no. 1, work of DNNs using alternating direction methods of multipliers,” in
pp. 217–230, Jan. 2019. Proc. 24th Int. Conf. Architectural Support Program. Lang. Operating
[12] X. Si et al., “A twin-8T SRAM Computation-in-Memory unit-macro for Syst., 2019, pp. 925–938.
multibit CNN-based AI edge processors,” IEEE J. Solid-State Circuits, [34] H. Valavi, P. J. Ramadge, E. Nestler, and N. Verma, “A 64-tile 2.4-mb
vol. 55, no. 1, pp. 189–202, Jan. 2020. in-memory-computing CNN accelerator employing charge-domain com-
[13] S. Yin, Z. Jiang, J.-S. Seo, and M. Seok, “XNOR-SRAM: In-memory pute,” IEEE J. Solid-State Circuits, vol. 54, no. 6, pp. 1789–1799,
computing SRAM macro for binary/ternary deep neural networks,” IEEE Jun. 2019.
J. Solid-State Circuits, vol. 55, no. 6, pp. 1733–1743, Jun. 2020. [35] S. K. Gonugondla, M. Kang, and N. Shanbhag, “A 42 pJ/decision
[14] C. Yu, T. Yoo, T. T.-H. Kim, K. C. Tshun Chuan, and B. Kim, “A 16 K 3.12 TOPS/W robust in-memory machine learning classifier with on-
current-based 8T SRAM compute-in-memory macro with decoupled chip training,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech.
read/write and 1-5 bit column ADC,” in Proc. IEEE Custom Integr. Papers, Feb. 2018, pp. 490–492.
Circuits Conf. (CICC), Mar. 2020.
[15] Z. Jiang, S. Yin, J.-S. Seo, and M. Seok, “C3SRAM: In-memory-
computing SRAM macro based on capacitive-coupling computing,”
IEEE Solid-State Circuits Lett., vol. 2, no. 9, pp. 131–134, Sep. 2019.
[16] S. Okumura, M. Yabuuchi, K. Hijioka, and K. Nose, “A ternary based Zhiyu Chen (Student Member, IEEE) received
bit scalable, 8.80 TOPS/W CNN accelerator with many-core processing- the B.E. degree in electrical engineering from
in-memory architecture with 896K synapses/mm2 ,” in Proc. Symp. VLSI Nanjing University, Nanjing, China, in 2018. He is
Circuits, Jun. 2019, pp. C248–C249. currently pursuing the Ph.D. degree in electrical and
[17] H. Jia, Y. Tang, H. Valavi, J. Zhang, and N. Verma, “A microprocessor computer engineering at Rice University, Houston,
implemented in 65 nm CMOS with configurable and bit-scalable acceler- TX, USA.
ator for programmable in-memory computing,” 2018, arXiv:1811.04047. His research interests include digital and
[Online]. Available: http://arxiv.org/abs/1811.04047 mixed-signal circuit design for machine learning
[18] J. Wang, X. Wang, C. Eckert, A. Subramaniyan, R. Das, D. Blaauw, and
accelerators.
D. Sylvester, “A compute SRAM with bit-serial integer/floating-point
operations for programmable in-memory vector acceleration,” in IEEE
Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2019,
pp. 224–226.
[19] W.-S. Khwa et al., “A 65 nm 4 Kb algorithm-dependent computing-in-
memory SRAM unit-macro with 2.3 ns and 55.8 TOPS/W fully parallel
product-sum operation for binary DNN edge processors,” in IEEE Zhanghao Yu (Student Member, IEEE) received
Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2018, the B.E. degree in integrated circuit design and
pp. 496–498. integrated system from the University of Electronic
[20] M. Kang, M.-S. Keel, N. R. Shanbhag, S. Eilert, and K. Curewitz, Science and Technology of China, Chengdu, China,
“An energy-efficient VLSI architecture for pattern recognition via deep in 2016 and the M.S. degree in electrical engi-
embedding of computation in SRAM,” in Proc. IEEE Int. Conf. Acoust., neering from the University of Southern California,
Speech Signal Process. (ICASSP), May 2014, pp. 8326–8330. Los Angeles, CA, USA, in 2018. He is currently
[21] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “XNOR-Net: pursuing the Ph.D. degree in electrical and computer
Imagenet classification using binary convolutional neural networks,” in engineering at Rice University, Houston, TX, USA.
Proc. Eur. Conf. Comput. Vis. Springer, 2016, pp. 525–542. His current research interests include analog and
[22] X. Si et al., “15.5 a 28 nm 64 Kb 6 T SRAM Computing-in-Memory mixed-signal integrated circuits design for power
macro with 8b MAC operation for AI edge chips,” in IEEE Int. Solid- management, bio-electronics, and security.
State Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2020, pp. 246–248.
[23] J. Yue et al., “14.3 a 65 nm Computing-in-Memory-Based CNN proces-
sor with 2.9-to-35.8TOPS/W system energy efficiency using dynamic-
sparsity performance-scaling architecture and energy-efficient inter/intra-
macro data reuse,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Qing Jin received the M.S. degree in computer
Tech. Papers, Feb. 2020, pp. 234–236. engineering from Texas A&M University, College
[24] J.-W. Su et al., “15.2 a 28nm 64Kb inference-training two-way transpose Station, TX, USA, in 2018 and the B.S. and M.S.
multibit 6T SRAM Compute-in-Memory macro for AI edge chips,” degrees in microelectronics from Nankai University,
in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, Tianjin, China, in 2009 and 2012, respectively.
Feb. 2020, pp. 240–242. He was working as a Research Assistant
[25] N. Verma et al., “In-memory computing: Advances and prospects,” IEEE with Tsinghua University, Beijing, China, between
Solid State Circuits Mag., vol. 11, no. 3, pp. 43–55, Aug. 2019. 2010 and 2012. From 2013 to 2017, he was working
[26] H. Jia, H. Valavi, Y. Tang, J. Zhang, and N. Verma, “A programmable with the School of Microelectronics, Xi’an Jiaotong
heterogeneous microprocessor based on bit-scalable in-memory com- University, Xi’an, China. He is currently pursu-
puting,” IEEE J. Solid-State Circuits, vol. 55, no. 9, pp. 2609–2621, ing the Ph.D. degree with Northeastern University,
Sep. 2020. Boston, MA, USA.
Authorized licensed use limited to: California State University Fresno. Downloaded on July 01,2021 at 00:09:01 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: CAP-RAM: A CHARGE-DOMAIN IMC 6T-SRAM 1935
Yan He (Student Member, IEEE) received the B.S Yanzhi Wang (Senior Member, IEEE) received
degree in electronic science and technology from the B.S. degree from Tsinghua University, Beijing,
Zhejiang University, Hangzhou, China, in 2018. China, in 2009 and the Ph.D. degree from the
He is currently pursuing the Ph.D. degree in electri- University of Southern California, Los Angeles, CA,
cal and computer engineering with Rice University, USA, in 2014.
Houston, TX, USA. He is currently an Assistant Professor at
His current research interests include analog and the Department of ECE, Northeastern University,
mixed-signal integrated circuits design for power Boston, MA, USA. His research interests focus on
management and hardware security. model compression and platform-specific acceler-
ation of deep learning applications. His research
maintains the highest model compression rates on
representative Deep Neural Networks (DNNs) since 09/2018. His work on
Adiabatic Quantum-Flux-Parametron (AQFP) superconducting-based DNN
Jingyu Wang (Member, IEEE) received the B.S. acceleration is by far the highest energy efficiency among all hardware
degree in electronic science and technology, the M.S. devices. His recent research achievement, CoCoPIE, can achieve real-time per-
and Ph.D. degrees in microelectronics from Xidian formance on almost all deep learning applications using off-the-shelf mobile
University, Xi’an, China, in 2010, 2013, and 2017, devices, outperforming competing frameworks by up to 180X acceleration.
respectively. His work has been published broadly in top conference and journal venues
His current interests include mixed-signal inte- and has been cited above 8500 times.
grated circuits, ADC, image sensors and their appli- Dr. Wang has received five Best Paper and Top Paper Awards, has another
cations, biomedical circuits and systems, and RF ten Best Paper Nominations and four Popular Paper Awards. He has received
integrated circuits. the U.S. Army Young Investigator Program Award (YIP), Massachusetts
Acorn Innovation Award, Ming Hsieh Scholar Award, and other research
awards from Google, MathWorks, etc. Three of his former Ph.D./postdoc
students become tenure track faculty member at the University of Connecticut,
Storrs, CT, USA, Clemson University, Clemson, SC, USA, and Texas A&M
Sheng Lin (Student Member, IEEE) received the
University-Corpse Christi, Corpse Christi, TX, USA.
B.S. degree from Zhejiang University, Hangzhou,
China, in 2013, the M.S. degree from Syracuse Uni-
versity, Syracuse, NY, USA, in 2015, and the Ph.D.
degree in computer engineering from Northeastern
University, Boston, MA, USA, in 2020, under the Kaiyuan Yang (Member, IEEE) received the B.S.
supervision of Prof. Yanzhi Wang. degree in electronic engineering from Tsinghua Uni-
His current research interests include privacy- versity, Beijing, China, in 2012 and the Ph.D. degree
preserving machine learning, energy-efficient artifi- in electrical engineering from the University of
cial intelligence systems, model compression, and Michigan, Ann Arbor, MI, USA, in 2017.
mobile acceleration of deep learning applications. He is an Assistant Professor of electrical and
computer engineering at Rice University, Houston,
TX, USA. His research interests include digital and
Dai Li (Student Member, IEEE) received the B.S. mixed-signal circuits for secure and low-power sys-
and M.S. degrees in electronics engineering from tems, hardware security, and circuit/system design
Tsinghua University, Beijing, China, and the M.S. with emerging devices.
degree of electrical and computer engineering from Dr. Yang received the Distinguished Paper Award at the 2016 IEEE
Rice University, Houston, TX, USA, in 2010, 2013, International Symposium on Security and Privacy (Oakland), the Best Student
and 2017, respectively, where he is currently pursu- Paper Award (first place) at the 2015 IEEE International Symposium on
ing the Ph.D. degree. Circuits and Systems (ISCAS), the Best Student Paper Award Finalist at
His research interests include VLSI circuits, hard- the 2019 IEEE Custom Integrated Circuits Conference (CICC), and the
ware security, mixed-signal integrated circuits, and 2016 Pwnie Most Innovative Research Award Finalist. His Ph.D. research
low-power circuits. was recognized with the 2016–2017 IEEE Solid-State Circuits Society (SSCS)
Predoctoral Achievement Award.
Authorized licensed use limited to: California State University Fresno. Downloaded on July 01,2021 at 00:09:01 UTC from IEEE Xplore. Restrictions apply.