0% found this document useful (0 votes)
23 views9 pages

A Fully Bit-Flexible Computation in Memory Macro Using Multi-Functional Computing Bit Cell and Embedded Input Sparsity Sensing

Uploaded by

施竣皓
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views9 pages

A Fully Bit-Flexible Computation in Memory Macro Using Multi-Functional Computing Bit Cell and Embedded Input Sparsity Sensing

Uploaded by

施竣皓
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 58, NO.

5, MAY 2023 1487

A Fully Bit-Flexible Computation in Memory


Macro Using Multi-Functional Computing Bit Cell
and Embedded Input Sparsity Sensing
Chun-Yen Yao , Tsung-Yen Wu , Han-Chung Liang, Yu-Kai Chen, and Tsung-Te Liu , Member, IEEE

Abstract— Computation in memory (CIM) overcomes the von practical because of the higher computing bandwidth and tran-
Neumann bottleneck by minimizing the communication over- sistor density with the advancement of semiconductor technol-
head between memory and process elements. However, using ogy. However, further improvement in energy efficiency using
conventional CIM architectures to realize multiply-accumulate
operations (MACs) with flexible input and weight bit precision traditional von Neumann architecture is limited, since the com-
is extremely challenging. This article presents a fully bit-flexible putation and storage are separate, and the corresponding data
CIM design with a compact area and high energy efficiency. access has dominated the overall energy consumption [1]. This
The proposed CIM macro employs a novel multi-functional “memory wall” has become the performance bottleneck of
computing bit cell design by integrating the MAC and the efficient implementations for ML algorithms and applications.
A/D conversion to maximize efficiency and flexibility. Moreover,
an embedded input sparsity sensing and a self-adaptive dynamic Computation in memory (CIM) is proposed to deal with
range (DR) scaling scheme are proposed to minimize the energy- the memory bottleneck by maximizing the utilization rate of
consuming A/D conversions in CIM. Finally, the proposed CIM the stored data with massively parallel and local processing
macro implementation utilizes an interleaved placement structure inside the memory macro. As a result, the CIM architecture
to enhance the weight-updating bandwidth and the layout sym- can achieve over 10× higher energy efficiency than the state-
metry. The proposed CIM design fabricated in standard 28-nm
CMOS technology achieves an area efficiency of 27.7 TOPS/mm2 of-the-art digital accelerators [2]. A typical CIM is commonly
and an energy efficiency of 291 TOPS/W, demonstrating a highly accomplished by storing the weight parameters in the memory
energy-area-efficient flexible CIM solution. and then feeding the input activations into the CIM macro to
Index Terms— Area-efficient, bit scalability, computation in generate the corresponding MAC outputs. This approach is
memory (CIM), deep neural network (DNN), energy-efficient, suitable for applications that require only fixed weight and
in-memory A/D conversion, sparsity sensing. input bit precision [3], [4], [5], [6], [7], [8], [9], [10], [11],
[12] but fails to serve the applications demanding the MACs
I. I NTRODUCTION with flexible bit precision. To solve this issue, several CIM
works split the weight parameters and input activations into
M ODERN machine learning (ML) algorithms, such
as deep neural networks (DNNs), require substan-
tial parameters and computations mainly composed of
bit groups with different weighting representations for low-bit
MACs first. These partial MAC results are then processed by
high-dimensional multiply-accumulate operations (MACs). MAC aggregation or near-memory computing (NMC) circuitry
The corresponding hardware implementations have become to complete full-precision MACs [13], [14], [15], [16], [17],
[18], [19], [20], [21], [22], [23], [24], [25]. Based on this
Manuscript received 18 February 2022; revised 24 July 2022; design principle, the CIM design employing 1-b × 1-b MACs
accepted 16 November 2022. Date of publication 9 January 2023; date of
current version 25 April 2023. This article was approved by Associate Editor together with MAC aggregation offers a promising solution
Kathryn Wilcox. This work was supported in part by the Ministry of Science to realize fully bit-flexible multi-bit MACs, which can be
and Technology, Taiwan, under Grant MOST 110-2218-E-002-034-MBK explained by the equation below:
and Grant MOST 111-2218-E-002-018-MBK; in part by the Intelligent
and Sustainable Medical Electronics Research Fund in National Taiwan N −1
  
P−1 Q−1 N −1

University; and in part by MediaTek Inc. under Contract MTKC-2022-0125.
(Corresponding author: Tsung-Te Liu.)
y =x·w= x n wn = (−1)k 2 p+q x n [ p]wn [q]
Chun-Yen Yao was with the Graduate Institute of Electronics Engi- n=0 p=0 q=0 n=0
neering, National Taiwan University, Taipei 10617, Taiwan. He is now (1)
with the Department of Electrical Engineering and Computer Sciences,
University of California at Berkeley, Berkeley, CA 94720 USA (e-mail: where x is an N-dimensional input vector whose bit preci-
[email protected]).
Tsung-Yen Wu was with the Graduate Institute of Electronics Engineering, sion in each scalar term x n is P, w is an N-dimensional
National Taiwan University, Taipei 10617, Taiwan. He is now with MediaTek weight vector whose bit precision in each scalar term wn is
Inc., Taipei 11491, Taiwan (e-mail: [email protected]). Q, and k is an integer term to handle negative conditions
Han-Chung Liang, Yu-Kai Chen, and Tsung-Te Liu are with the Gradu-
ate Institute of Electronics Engineering, National Taiwan University, Taipei for two’s complement operations. By (1), it is clear that a
10617, Taiwan (e-mail: [email protected]; [email protected]; general MAC consists of only  Ntwo classes of components:
−1
n=0 x n [ p]wn [q]) and scaling
[email protected]). common terms (summation,
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/JSSC.2022.3224363. terms ((−1)k 2 p+q ). As a result, the CIM design utilizing a
Digital Object Identifier 10.1109/JSSC.2022.3224363 1-b × 1-b MAC scheme to compute the common terms can
0018-9200 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: National Taiwan University. Downloaded on June 27,2024 at 03:00:40 UTC from IEEE Xplore. Restrictions apply.
1488 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 58, NO. 5, MAY 2023

achieve full bit flexibility if all scaling terms are computed via measurement results are shown in Section VI, where com-
MAC aggregation. parisons with the state-of-the-art results are also provided.
Several recent works [13], [15], [16], [19], [20] have taken Section VII concludes this article.
advantage of 1-b × 1-b MACs with MAC aggregation to
maximize the bit flexibility. Among these designs, analog II. R ELATED W ORKS
CIM approaches can potentially achieve a compact CIM
macro area without bulky digital adders required in digital The CIM designs employing 1-b × 1-b MACs with MAC
CIM counterparts. However, the analog CIM architecture aggregation can be mainly classified into two categories
demands a multi-bit A/D conversion process to reconstruct according to its computing scheme: current-based compu-
the mixed-signal MAC results back to digital codes. Due tation [13], [16] and charge-based computation [15]. The
to the required A/D conversion and multi-level reference current-based CIM design performs one 1-b × 1-b MAC by
circuitry, its efficiency can be significantly limited. Moreover, summing the total discharging currents on the bitlines. This
the A/D conversion process consumes a significant amount computing scheme features a short CIM delay that benefits
of energy, seriously impacting the CIM energy efficiency. from the short bitline discharging time but suffers from high
Finally, since the completion of an ML task requires both nonlinearity. Okumura et al. [13] proposed a 17T ternary bit
standard memory access and CIM operations, the performance cell that consists of two standard 6T SRAM cells and a
of standard memory access is also critical. Although the 5T discharging circuit. The multi-level reference circuitry for
standard write access and the CIM operation use the same the following successive-approximation register (SAR) A/D
number of rows, a single write is not a multi-row access operation was realized via the binary-weighted reference cells
function, causing low weight-updating bandwidth and severe replicating the same discharging path of ternary bit cells. How-
performance degradation of the system latency. ever, only one of the four banks in the MAC operating block
To overcome the design challenges above, this work pro- can simultaneously access these area-consuming reference
poses a fully bit-flexible CIM macro with the following design cells, significantly deteriorating the memory utilization and the
features. overall area efficiency. Besides, Chiu et al. [16] exploited the
1) A highly compact CIM computing bit cell (CIMC) can small footprint of standard 6T SRAM and directly built the
support standard read/write access, 1-b × 1-b MAC, discharging paths via the access transistors. The corresponding
reference voltage generation, and in-memory A/D con- digital codes can then be reconstructed via a self-timing
version to maximize the area efficiency by reducing the tracking and sensing scheme through four pairs of replica
A/D and reference circuitry overheads. bitlines. However, this approach requires additional compensa-
2) An embedded input sparsity sensing and an automatic tion bit cells to ensure the correct MAC function. The required
on-chip reference voltage generation scheme can realize number of compensation cells equals the maximum number of
a self-adaptive dynamic range (DR) scaling based on activated WLs, causing a huge area overhead for CIM designs.
the real-time input sparsity characteristics to minimize In addition, the required linear search A/D process slows down
the expensive A/D conversions and maximize the energy the A/D operation and requires more comparisons than a SAR
efficiency. A/D approach, significantly degrading its energy performance.
3) An interleaved CIMC placement structure can simul- On the other hand, the charge-based CIM approach demon-
taneously accelerate the weight-updating process, strates better linearity than the current-based designs by per-
maintain the symmetric layout implementation, and sup- forming the computations on capacitors. Besides, it features
port the ping-pong operation for higher weight-updating high integration, since the metal-oxide-metal (MOM) capac-
bandwidth. itors can be placed right above the transistors. Jia et al. [15]
The proposed fully bit-flexible CIM macro was imple- exploited this feature and proposed an 8T1C bit cell array for
mented and verified in standard 28-nm CMOS technol- 1-b × 1-b MACs via charge sharing. The outputs are then
ogy. The proposed CIM design achieves an area efficiency converted into digital codes via typical capacitor-switching
of 27.7 TOPS/mm2 , representing an 8.15× improvement SAR analog-to-digital converters (ADCs) outside the CIM
compared with the previous work. Moreover, the proposed array. However, this approach can cause a large area penalty
embedded input sparsity sensing and the self-adaptive DR due to additional sampling capacitors. Moreover, the sample-
scaling minimize the expensive A/D conversions, realizing and-hold process can result in severe voltage swing reduc-
27.4%–30.2% energy reduction and measured peak energy tion [11], [19], seriously deteriorating the sense margin of the
efficiency of 383 TOPS/W. Finally, the proposed interleaved comparators and the corresponding CIM performance. In sum-
CIMC placement topology can further enable a 32.7% reduc- mary, the A/D conversion and the reference circuits have
tion in operating cycles, substantially improving the system clearly become the performance bottleneck for further effi-
latency performance. ciency improvement in the CIM designs. Therefore, this work
This article is organized as follows. Section II introduces tackles this critical issue by proposing a novel multi-functional
the previous bit-flexible CIM works. Section III describes computing bit cell architecture that integrates the MAC and
the proposed CIM architecture and the operating principle A/D conversion together to maximize the CIM efficiency.
of in-memory A/D conversion. Section IV introduces the In addition, the energy-consuming A/D conversions are mini-
proposed embedded input sparsity sensing and self-adaptive mized with the proposed embedded input sparsity sensing and
DR scaling techniques. Section V describes the proposed self-adaptive DR scaling to further enhance the CIM energy
interleaved CIMC placement. The chip implementation and efficiency.
Authorized licensed use limited to: National Taiwan University. Downloaded on June 27,2024 at 03:00:40 UTC from IEEE Xplore. Restrictions apply.
YAO et al.: FULLY BIT-FLEXIBLE CIM MACRO 1489

Fig. 1. (a) Overall architecture of the proposed CIM design. (b) Schematic of CIMC and its supporting functions.

III. P ROPOSED CIM A RCHITECTURE


Fig. 1(a) shows the overall architecture of the proposed
bit-flexible CIM architecture with the size of 16-kb CIMCs and
the reconfigurable MAC aggregator. The proposed CIM macro
is partitioned into 32 CIM block pairs, each of which supports
a 256-D 1-b × 1-b MAC. The MAC result can be directly
reconstructed back to digital codes through the in-memory
A/D conversion process. Moreover, a sparsity sensor is imple-
mented to analyze the real-time CIM input characteristic.
This further reduces the CIM energy consumption with the
proposed self-adaptive DR scaling, which will be explained
in detail in Section IV. After the CIM macro completes its
operation, the computed results are sent to the reconfigurable
MAC aggregator. The reconfigurable MAC aggregator then
performs parallel and sequential shift-add computation to
Fig. 2. Operating example of the proposed CIMC 1-b × 1-b MACs in
support parallel weights (4/8/16 b) and bit-serial inputs. BlockA0 .

A. Multi-Functional CIMC
The proposed CIMC, as shown in Fig. 1(b), consists of a architectures and significantly reduce the overall area by
6T SRAM for weight storage, a stacked MOM capacitor (Cu ) 24.1%, as described in Section VI.
above the transistors, and a 6T AND–OR–INV gate serving An operation of the proposed CIM consists of two steps.
as computation logic and capacitor driver. Based on different Fig. 2 illustrates an operation example of a 256-D 1-b ×
ways to activate the control signals, including WL, RWL, 1-b MAC computation in BlockA0 . Fig. 3 shows the corre-
CTRL, and RST, each CIMC can support the following four sponding timing diagram. In the first step, BlockA0 computes
functions: 1) standard read/write access; 2) 1-b × 1-b mixed- the mixed-signal MAC, while BlockB0 generates the reference
signal MAC; 3) reference voltage generation; and 4) SAR A/D voltage V ref . The capacitive coupling mechanism in both
capacitor switching. As a result, the proposed highly integrated blocks can then be expressed as follows:
bit cell design can realize the required CIM functions within
a compact cell structure. In this way, we avoid additional Cu VDD 
255
VRBL = Zi (2)
ADC and reference circuit overhead in conventional CIM 256Cu + CRBL i=0

Authorized licensed use limited to: National Taiwan University. Downloaded on June 27,2024 at 03:00:40 UTC from IEEE Xplore. Restrictions apply.
1490 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 58, NO. 5, MAY 2023

Fig. 4. (a) Proposed bottom-plate sampling CIMC. (b) Alternative top-plate


sampling CIMC.

Fig. 3. Timing diagram of the proposed CIMC 1-b × 1-b MACs.

where Cu is the capacitance of Cu , CRBL is the load capaci-


Fig. 5. Schematic of SAR Ctrl block with the comparator.
tance on RBLA0 or RBLB0 , and Z i is the cell i whose node
Z [annotated in Fig. 1(b)] is pulled to the ground (Z i = 1)
or not (Z i = 0) in this step. After the activation pulse is
sent from RWL or CTRL, the corresponding MAC result and
reference voltage level are capacitively coupled on RBLA0 and
RBLB0 in BlockA0 and BlockB0 , respectively. After that, both
BlockA0 and BlockB0 together perform an in-memory SAR
A/D conversion, as shown in Fig. 3.

B. In-Memory SAR A/D Conversion


As discussed in Section II, A/D conversion and refer-
ence circuits become the performance bottleneck in the CIM
designs. Therefore, the proposed CIM design minimizes this Fig. 6. Simulated energy consumption given different DR scales.
overhead by reusing the capacitors in the CIM blocks for
A/D conversion and embedded reference voltage generation.
The proposed in-memory SAR A/D conversion architecture consists of an asynchronous pulse generator for local timing
employs several design techniques to optimize its performance. control and a SAR Logic module. After the ADStart signal is
First, it uses the monotonic switching procedure to minimize triggered at the negative edge, the SAR Ctrl block generates
the circuit complexity. The resulting optimized CIMC, which its local timing pulses and performs the corresponding logic
realizes MAC, V ref generation, and A/D switching functions, operation. At each positive edge of CLKSAR , the registers in
requires only additional six transistors, as shown in Fig. 1(b). the SAR Logic are updated to generate the next CTRL bus
Moreover, the asynchronous timing method is employed in configurations. After the conversion process finishes, all of
SAR A/D conversion to further enhance the throughput perfor- the local timing pulses are disabled by the Done signal, and
mance [26]. Finally, the proposed design uses the bottom-plate the final digital output Dout is available at this time.
sampling topology, as shown in Fig. 4(a). Compared with
the top-plate sampling counterpart shown in Fig. 4(b), the IV. E MBEDDED I NPUT S PARSITY S ENSING AND
proposed design exhibits lower circuit complexity with fewer S ELF -A DAPTIVE DR S CALING
control signals, leading to a compact CIMC area. In addition, To improve the CIM energy efficiency, the proposed design
the top-plate sampling approach incurs higher parasitic capac- further exploits input characteristics to reduce the CIM energy
itance, since the node RBL connects to more switches. This consumption resulting from the A/D conversion process. Fig. 6
can severely degrade the voltage swing and the performance compares the simulated MAC and A/D energy as a function
of A/D conversion. of different DR scales for a uniformly distributed input at
Fig. 5 shows the schematic of the SAR Ctrl block in Fig. 2 0.8 V. This result clearly demonstrates that the A/D conversion
with a comparator. The comparator employs a StrongARM dominates the total energy consumption in a CIM operation.
topology similar to the design in [27]. The SAR Ctrl block However, the CIM computation results seldom reach the whole

Authorized licensed use limited to: National Taiwan University. Downloaded on June 27,2024 at 03:00:40 UTC from IEEE Xplore. Restrictions apply.
YAO et al.: FULLY BIT-FLEXIBLE CIM MACRO 1491

Fig. 9. Concept of self-adaptive DR scaling given (a) dense and (b) sparse
Fig. 7. Modeled input sparsity of CIFAR-10 task with ResNet-18. input.

Fig. 8. Proposed sparsity sensor.

Fig. 10. Proposed automatic DR scaling algorithm.


DR of MACs, especially when the input data characteristic is
sparse. Fig. 7 illustrates the modeled probability distribution
of the bit-level input sparsity for a typical CIFAR-10 classifi- sensor, the corresponding reference voltage V ref can be gener-
cation task with ResNet-18. The value of input sparsity mainly ated by setting up the number of switched capacitors according
resides in the range where the bit-level input sparsity is lower to (2) and activating the CTRL bus shown in Fig. 2 during the
than 64. This distribution suggests that the corresponding DR step of reference voltage generation. In the following step,
could be lower for input with high sparsity, and the number the in-memory SAR A/D conversion is completed through
of voltage comparisons Ncomp can be reduced accordingly. the asynchronous monotonic switching process described in
The information of the input sparsity can actually be ana- Section III-B.
lyzed in advance during the run time, since these data must be
filled into the CIM input buffer shown in Fig. 1 before entering V. I NTERLEAVED CIMC P LACEMENT
the CIM macro. With the real-time sparsity information, Ncomp Both standard memory access and CIM operations can
can, thus, be minimized accordingly during run time to reduce impact the performance of CIM executing an ML task. Con-
the overall energy consumption. Therefore, a self-adaptive DR ventional standard write access and CIM computation use
scaling is proposed to exploit this characteristic by using an the same number of rows. However, a single write is not
embedded bit-serial input sparsity sensor together with an a multi-row access function, causing a low weight-updating
automatic voltage generation scheme. bandwidth and a substantial increase in the system latency.
The proposed input sparsity sensor shown in Fig. 8 can Therefore, this work proposes an interleaved CIMC placement
estimate the real-time input sparsity by detecting the change structure shown in Fig. 11 to accelerate the weight-updating
of the number of ones whenever new data are fed into the process. The proposed structure shapes the 256 CIMCs into
CIM input buffer. Then, different reference voltage generation two columns with interleaved RWLs for CIM MAC access.
levels are automatically configured according to the estimated In this way, the CIMCs connected to one WL are doubled
input sparsity characteristic. Fig. 9 illustrates the concept of the compared with the conventional designs without interleaving
self-adaptive DR scaling scheme given different input sparsity the RWLs. As a result, the corresponding write speed can
characteristics. The reference voltage level is generated to be twice faster than the conventional non-interleaved design.
half of the adaptive DR to ensure functionality. Ncomp is then Moreover, the RBL can be sandwiched between the symmetric
determined to realize a sparsity-aware, energy-efficient A/D capacitor array to maximize the axial symmetry with the
conversion. Fig. 9(b) illustrates that the optimal DR and Ncomp proposed placement structure. In this way, the impact of
can be reduced accordingly for a sparse input to minimize the random mismatch and the non-ideal coupling effect from
A/D conversion energy. other CIM blocks can be minimized. Fig. 12 shows the CIM
The self-adaptive DR scaling algorithm is shown in Fig. 10. performance improvements using the proposed interleaved
Given the information of Ncomp estimated by the sparsity placement structure with three types of DNN implementations

Authorized licensed use limited to: National Taiwan University. Downloaded on June 27,2024 at 03:00:40 UTC from IEEE Xplore. Restrictions apply.
1492 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 58, NO. 5, MAY 2023

Fig. 11. Interleaved CIMC placement, where the dots denote the wire access
to the CIMCs.
Fig. 13. Ping-pong operation using the proposed CIM macro.

Fig. 12. Operating cycle reduction using the proposed interleaved CIMC
placement structure.

for a CIFAR-10 task. The proposed approach can signifi-


Fig. 14. Die photograph of the proposed CIM design.
cantly reduce the overall operating cycles by 30.4%, 19.5%,
and 32.7% in VGG-16, ResNet-18, and ResNet-50 topology,
TABLE I
respectively.
C HIP P ERFORMANCE S UMMARY
In addition, the proposed CIMC and the interleaved place-
ment structure can readily support the efficient ping-pong
operation [22], which realizes CIM and weight-updating oper-
ations at the same time. As shown in Fig. 1(b), the weight
storage in the proposed CIMC can be decoupled from the
AND – OR– INV gate when the RWL is not activated. In this
way, the CIM blocks that do not perform MAC operations can
update their stored weights simultaneously when performing
the reference voltage generation or the SAR A/D switching
operation, as illustrated in Fig. 13. Overall, the operating
cycles can be further reduced by 49.3%, 39.0%, and 57.6%
in VGG-16, ResNet-18, and ResNet-50 topology, respectively,
as shown in Fig. 12.

VI. C HIP I MPLEMENTATION AND


M EASUREMENT R ESULTS
The proposed CIM architecture was designed and imple-
mented with standard 28-nm CMOS technology. Fig. 14 shows
the die photograph of the CIM test chip. Table I summa-
rizes the measured CIM performance results with different demonstrates a compact footprint of 1.215 µm2 with an area
inputs and weight precision. By integrating several critical overhead of 66.7% compared with the standard 6T SRAM cell.
CIM functions, the bit cell area of the proposed 12T CIMC As a result, the proposed design achieves a high area efficiency

Authorized licensed use limited to: National Taiwan University. Downloaded on June 27,2024 at 03:00:40 UTC from IEEE Xplore. Restrictions apply.
YAO et al.: FULLY BIT-FLEXIBLE CIM MACRO 1493

TABLE II
P ERFORMANCE C OMPARISON W ITH THE S TATE - OF - THE -A RT CIM D ESIGNS

Fig. 15. Area breakdown between the two different CIM macro designs. Fig. 16. Measured energy improvement with the proposed embedded input
sparsity sensing and self-adaptive DR scaling scheme.

of 27.7 TOPS/mm2 . Moreover, with the proposed embedded energy reduction at 0.7 and 0.8 V, respectively, effectively
input sparsity sensing and self-adaptive DR scaling scheme, minimizing the A/D conversion energy to enhance the overall
our design achieves a peak energy efficiency of 383 TOPS/W. CIM efficiency.
Given the CIFAR-10 dataset, the proposed design can achieve Table II compares the proposed CIM architecture with
91% classification accuracy with ResNet-18. the state-of-the-art designs that support flexible input and
Fig. 15 compares the area breakdown of the proposed CIM weight bit precision. The proposed multi-functional CIMC
macro with the baseline design. The proposed CIM architec- implemented in an advanced CMOS technology process
ture with in-memory A/D conversion significantly decreases enables a highly integrated and compact CIM design, demon-
the macro area from 43 200 to 32 800 µm2 , representing strating 8.15× higher area efficiency and 5.89× higher
a 24.1% area reduction. To evaluate the energy improve- energy efficiency than the previous work with the highest
ment with the proposed embedded input sparsity sensing and area efficiency [19]. Moreover, with the proposed embed-
self-adaptive DR scaling scheme, an additional test mode that ded input sparsity sensing and self-adaptive DR scaling
can disable the sparsity sensor was implemented in the CIM scheme, the proposed CIM design achieves similar energy
test chip. Fig. 16 compares the measured MAC energy of efficiency while realizing 72.1× higher area efficiency than
the proposed design before and after the embedded sparsity the most energy-efficient CIM [15]. Finally, the proposed
sensor is disabled. For a CIFAR-10 task with ResNet-18 CIM macro utilizes an interleaved placement structure to
topology, the proposed design with the input sparsity sens- maintain symmetric layout implementation and accelerate the
ing and self-adaptive DR scaling realizes 27.4% and 30.2% weight-updating process, substantially reducing the overall

Authorized licensed use limited to: National Taiwan University. Downloaded on June 27,2024 at 03:00:40 UTC from IEEE Xplore. Restrictions apply.
1494 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 58, NO. 5, MAY 2023

latency and benefiting large-scale CIM-based computation [12] S. Xie, C. Ni, A. Sayal, P. Jain, F. Hamzaoglu, and J. P. Kulkarni,
systems. “eDRAM-CIM: Compute-in-memory design with reconfigurable
embedded-dynamic-memory array realizing adaptive data converters
and charge-domain computing,” in IEEE Int. Solid-State Circuits Conf.
VII. C ONCLUSION (ISSCC) Dig. Tech. Papers, Feb. 2021, pp. 248–250.
[13] S. Okumura, M. Yabuuchi, K. Hijioka, and K. Nose, “A ternary based
This article presents an energy-area-efficient CIM design bit scalable, 8.80 TOPS/W CNN accelerator with many-core processing-
that can support emerging ML applications with flexible bit in-memory architecture with 896K synapses/mm2 ,” in Proc. Symp. VLSI
precision. The highly compact multi-functional CIMC design Technol., Jun. 2019, pp. 248–249.
[14] X. Si et al., “A twin-8T SRAM computation-in-memory unit-macro for
with in-memory SAR A/D conversion can maximize the CIM multibit CNN-based AI edge processors,” IEEE J. Solid-State Circuits,
efficiency and flexibility. In addition, the proposed embedded vol. 55, no. 1, pp. 189–202, Jan. 2020.
input sparsity sensing and self-adaptive DR scaling scheme [15] H. Jia, H. Valavi, Y. Tang, J. Zhang, and N. Verma, “A programmable
heterogeneous microprocessor based on bit-scalable in-memory com-
minimize the expensive A/D conversions effectively. Finally, puting,” IEEE J. Solid-State Circuits, vol. 55, no. 9, pp. 2609–2621,
the interleaved placement structure is proposed to improve Jan. 2020.
the weight-updating bandwidth and maintain the layout sym- [16] Y.-C. Chiu et al., “A 4-Kb 1-to-8-bit configurable 6T SRAM-based
computation-in-memory unit-macro for CNN-based AI edge proces-
metry simultaneously. The measurement results show that sors,” IEEE J. Solid-State Circuits, vol. 55, no. 10, pp. 2790–2801,
the proposed CIM design achieves the high energy and area Oct. 2020.
efficiencies of 291 TOPS/W and 27.7 TOPS/mm2 , respec- [17] J.-W. Su et al., “A 28 nm 64 Kb inference-training two-way transpose
multibit 6T SRAM compute-in-memory macro for AI edge chips,”
tively, representing a highly efficient bit-flexible CIM solution. in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers,
Sep. 2020, pp. 240–242.
[18] C.-X. Xue et al., “A 22 nm 2 Mb ReRAM compute-in-memory macro
ACKNOWLEDGMENT with 121–28TOPS/W for multibit MAC computing for tiny AI edge
The authors would like to thank Taiwan Semiconductor devices,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech.
Papers, Feb. 2020, pp. 244–246.
Manufacturing Company (TSMC), Hsinchu, Taiwan, and the [19] Z. Chen et al., “CAP-RAM: A charge-domain in-memory computing
Taiwan Semiconductor Research Institute (TSRI), Hsinchu, for 6T-SRAM for accurate and precision-programmable CNN inference,”
providing chip fabrication and technical support. They would IEEE J. Solid-State Circuits, vol. 56, no. 6, pp. 1924–1935, Jun. 2021.
[20] H. Kim, T. Yoo, T. T.-H. Kim, and B. Kim, “Colonnade: A recon-
also like to thank Bing-Chen Wu for providing suggestions to figurable SRAM-based digital bit-serial compute-in-memory macro for
this manuscript. processing neural networks,” IEEE J. Solid-State Circuits, vol. 56, no. 7,
pp. 2221–2233, Jul. 2021.
[21] X. Si et al., “A local computing cell and 6T SRAM-based computing-
R EFERENCES in-memory macro with 8-b MAC operation for edge AI chips,” IEEE J.
Solid-State Circuits, vol. 56, no. 9, pp. 2817–2831, Sep. 2021.
[1] M. Horowitz, “Computing’s energy problem (and what we can do about [22] J. Yue et al., “A 2.75-to-75.9TOPS/W computing-in-memory NN proces-
it),” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, sor supporting set-associate block-wise zero skipping and ping-pong
Feb. 2014, pp. 10–14. CIM with simultaneous computation and weight updating,” in IEEE
[2] N. Verma et al., “In-memory computing: Advances and prospects,” IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2021,
Solid-State Circuits Mag., vol. 11, no. 3, pp. 43–55, Summer 2019. pp. 238–240.
[3] J. Zhang, Z. Wang, and N. Verma, “In-memory computation of [23] C.-X. Xue et al., “A 22 nm 4 Mb 8b-precision ReRAM computing-in-
a machine-learning classifier in a standard 6T SRAM array,” memory macro with 11.91 to 195.7TOPS/W for tiny AI edge devices,”
IEEE J. Solid-State Circuits, vol. 52, no. 4, pp. 915–924, Apr. 2017. in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers,
[4] S. K. Gonugondla, M. Kang, and N. R. Shanbhag, “A variation-tolerant Feb. 2021, pp. 245–247.
in-memory machine learning classifier via on-chip training,” IEEE J. [24] J.-W. Su et al., “A 28 nm 384 kb 6T-SRAM computation-in-memory
Solid-State Circuits, vol. 53, no. 11, pp. 3163–3173, Nov. 2018. macro with 8b precision for AI edge chips,” in IEEE Int. Solid-State
[5] A. Biswas and A. P. Chandrakasan, “CONV-SRAM: An energy-efficient Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2021, pp. 250–252.
SRAM with in-memory dot-product computation for low-power convo- [25] Y.-D. Chih et al., “An 89TOPS/W and 16.3TOPS/mm2 all-digital
lutional neural networks,” IEEE J. Solid-State Circuits, vol. 54, no. 1, SRAM-based full-precision compute-in memory macro in 22 nm for
pp. 217–230, Jan. 2019. machine-learning edge applications,” in IEEE Int. Solid-State Circuits
[6] H. Valavi, P. J. Ramadge, E. Nestler, and N. Verma, “A 64-tile 2.4-Mb Conf. (ISSCC) Dig. Tech. Papers, Feb. 2021, pp. 252–254.
in-memory-computing CNN accelerator employing charge-domain com- [26] S.-W. M. Chen and R. W. Brodersen, “A 6-bit 600-MS/s 5.3-mW
pute,” IEEE J. Solid-State Circuits, vol. 54, no. 6, pp. 1789–1799, asynchronous ADC in 0.13-µm CMOS,” IEEE J. Solid-State Circuits,
Jun. 2019. vol. 41, no. 12, pp. 2669–2680, Dec. 2006.
[7] H. Kim, Q. Chen, and B. Kim, “A 16K SRAM-based mixed-signal in- [27] Q. Fan, Y. Hong, and J. Chen, “A time-interleaved SAR ADC with
memory computing macro featuring voltage-mode accumulator and row- bypass-based opportunistic adaptive calibration,” IEEE J. Solid-State
by-row ADC,” in Proc. IEEE Asian Solid-State Circuits Conf. (A-SSCC), Circuits, vol. 55, no. 8, pp. 2082–2093, Aug. 2020.
Nov. 2019, pp. 35–36.
[8] Z. Jiang, S. Yin, J.-S. Seo, and M. Seok, “C3SRAM: An in-memory-
computing SRAM macro based on robust capacitive coupling computing
mechanism,” IEEE J. Solid-State Circuits, vol. 55, no. 7, pp. 1888–1897,
Jul. 2020. Chun-Yen Yao received the B.S. degree in electrical
[9] Q. Dong et al., “A 351TOPS/W and 372.4GOPS compute-in-memory engineering with a minor in mechanical engineer-
SRAM macro in 7 nm FinFET CMOS for machine-learning applica- ing and the M.S. degree in electronics engineering
tions,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, from National Taiwan University, Taipei, Taiwan, in
Feb. 2020, pp. 242–244. 2019 and 2022, respectively. He is currently pursu-
[10] C. Yu, T. Yoo, T. T.-H. Kim, K. C. T. Chuan, and B. Kim, “A 16K ing the Ph.D. degree in electrical engineering with
current-based 8T SRAM compute-in-memory macro with decoupled the University of California at Berkeley, Berkeley,
read/write and 1–5 bit column ADC,” in Proc. IEEE Custom Integr. CA, USA.
Circuits Conf. (CICC), Mar. 2020, pp. 1–4. His current research interests include low-noise
[11] Y.-T. Hsu, C.-Y. Yao, T.-Y. Wu, T.-D. Chiueh, and T.-T. Liu, “A high- current sensing and adaptive sensing for biomedical
throughput energy–area-efficient computing-in-memory SRAM using applications.
unified charge-processing network,” IEEE Solid-State Circuits Lett., Mr. Yao received the National Taiwan University Outstanding Youth Award
vol. 4, pp. 146–149, 2021. in 2021.

Authorized licensed use limited to: National Taiwan University. Downloaded on June 27,2024 at 03:00:40 UTC from IEEE Xplore. Restrictions apply.
YAO et al.: FULLY BIT-FLEXIBLE CIM MACRO 1495

Tsung-Yen Wu received the B.S. degree in electrical Yu-Kai Chen received the B.S. degree in electrical
engineering from National Cheng Kung University, engineering from National Chung Hsing University,
Tainan, Taiwan, in 2019, and the M.S. degree in Taichung, Taiwan, in 2021. He is currently pursu-
electronics engineering from National Taiwan Uni- ing the M.S. degree with the Graduate Institute of
versity, Taipei, Taiwan, in 2022. Electronics Engineering, National Taiwan Univer-
He is currently with MediaTek Inc., Taipei. sity, Taipei, Taiwan.
His current research interests include computing His research interests include computation in
in memory for energy-efficient machine learning memory and mixed-signal circuit designs.
applications.

Tsung-Te Liu (Member, IEEE) received the B.S.


degree in electrical engineering and the M.S. degree
in electronics engineering from National Taiwan
University, Taipei, Taiwan, in 2002 and 2004,
respectively, and the Ph.D. degree in electrical
engineering from the University of California at
Berkeley, Berkeley, CA, USA, in 2012.
From 2004 to 2005, he was with MediaTek Inc.,
Hsinchu, Taiwan, where he was involved in circuit
Han-Chung Liang received the B.S. degree in and system design for wireless communications.
electrical engineering from National Taiwan Univer- From 2005 to 2012, he was a member of the
sity, Taipei, Taiwan, in 2021, where he is currently Berkeley Wireless Research Center (BWRC), University of California at
pursuing the M.S. degree with the Graduate Institute Berkeley. From 2012 to 2014, he was with Interuniversity Microelectronics
of Electronics Engineering. Centre (IMEC), Leuven, Belgium, where he conducted research on circuit
His current research interests include computation development for advanced CMOS technology. In 2014, he joined the faculty
in memory circuit design and energy-efficient circuit of National Taiwan University, where he is currently an Associate Professor
design. with the Graduate Institute of Electronics Engineering and the Department of
Electrical Engineering.
Dr. Liu was a recipient of several design and teaching awards. His research
interests involve energy-efficient circuit and system designs.

Authorized licensed use limited to: National Taiwan University. Downloaded on June 27,2024 at 03:00:40 UTC from IEEE Xplore. Restrictions apply.

You might also like