0% found this document useful (0 votes)
26 views14 pages

BR-CIM An Efficient Binary Representation Computation-In-Memory Design

Uploaded by

cr331122002203
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views14 pages

BR-CIM An Efficient Binary Representation Computation-In-Memory Design

Uploaded by

cr331122002203
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

3940 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 69, NO.

10, OCTOBER 2022

BR-CIM: An Efficient Binary Representation


Computation-In-Memory Design
Zhiheng Yue , Yabing Wang, Yubin Qin , Leibo Liu , Senior Member, IEEE,
Shaojun Wei , Fellow, IEEE, and Shouyi Yin , Member, IEEE

Abstract— Deep neural network (DNN) has recently attracted


tremendous attention in various fields. But the computing opera-
tion requirement and the memory bottleneck limit the energy
efficiency of hardware implementations. Binary quantization
is proposed to relieve the pressure of hardware design. And
the Computing-In-Memory (CIM) is regarded as a promising
method to resolve the memory wall challenge. However, the
binary computing paradigm is mismatched with the CIM scheme,
which incurs complex circuits and peripheral to realize binary
operation in previous works. To overcome previous issues, this
work presents Binary Representation Computation-In-Memory
(BR-CIM) with several key features. (1) A lightweight com-
putation unit is realized within the 6T SRAM array to accel-
erate binary computing and enlarge signal margin; (2) The
reconfigurable computing scheme and mapping method support
extendable bit precision to satisfy the accuracy requirement of
various applications; (3) Simultaneous computing and weight
loading is supported by column circuitry, which shortens the
data loading latency; Several experiments are conducted to Fig. 1. BNN SRAM-CIM Design Challenges.
estimate algorithm accuracy, the computing latency, and power
consumption. The energy efficiency reaches up to 1280 TOPs/W work (BNN) is efficient and compresses parameter size as low
for binary representation. And the algorithm accuracy achieves as possible.
97.82%/76.4% on MNIST/CIFAR-100 dataset. Multiple digital designs have been proposed for the
Index Terms— Computation-in-memory, artificial intelligence, binary neural networks. For example, an energy-efficient
deep neural network, binary neural network, sram, reconfig- reconfigurable processor tried to accelerate binary and ternary-
urable design. weight neural network [1]. But the digital designs are
I. I NTRODUCTION still limited by frequent memory access, which cannot be
fundamentally solved. Computation-in-memory (CIM) is a
D EEP Neural Network (DNN) has been prevalent in multi-
ple fields, like face detection, image classification, object
recognition, autonomous vehicles and language translation.
prominent method to realize in-situ computing without data
movement. Multiple memory technologies demonstrated the
However, the network size grows explosively and the storage feasibility of performing the computing operation within the
of the network is a great issue. Frequent memory access lowers cell array. A DRAM-CIM design activates triple array rows to
the throughput and energy efficiency of digital design, which perform logic operation [2]. Emerging non-volatile memories
limits the deployment of DNN on low-power mobile devices. technology such as resistive random-access memory (RRAM)
To address this challenge, several lightweight approaches also realizes computing functions within cell array [3], [4].
are proposed to reduce the memory access and computation Among those memory technologies, SRAM has the advan-
requirement, like bit precision quantization, weight compres- tage of lower read/write energy and latency, and better
sion, pruning, etc. Among those methods, binary neural net- data endurance. While multiple SRAM-CIM works prove
promising energy efficiency for multi-bit multiplication and
Manuscript received 21 March 2022; revised 5 June 2022 and accumulation (MAC), there are several challenges in deploy-
18 June 2022; accepted 19 June 2022. Date of publication 11 July 2022;
date of current version 29 September 2022. This work was supported in part ing SRAM-based CIM scheme for binarized input/weight
by NSFC under Grant 62125403, Grant U19B2041, and Grant 92164301; computing.
in part by the National Key Research and Development Program under First, the operation performed by BNN is XNOR and bit
Grant 2018YFB2202600; in part by the Beijing Science and Technology
Project under Grant Z191100007519016; and in part by the Beijing Advanced count, which is different from the basic computing para-
Innovation Center for Integrated Circuits. This article was recommended by digm of the compact 6T SRAM cell array. As explained in
Associate Editor M.-F. Chang. (Corresponding author: Shouyi Yin.) Fig. 1, the WL and data of the bit cell control the charge
The authors are with the School of Integrated Circuits, the Beijing Inno-
vation Center for Future Chip, and the Beijing National Research Center for of BL. Only when bit cell contains 1 and WL is activated,
Information Science and Technology, Tsinghua University, Beijing 100084, current charge BL. The read operation could be seen as a
China (e-mail: [email protected]). bit-wise multiplication happening on BL. Theoretically, the
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TCSI.2022.3185135. partial results accumulate in BL when multiple WLs activated
Digital Object Identifier 10.1109/TCSI.2022.3185135 simultaneously [5]. The fundamental computing operation of
1549-8328 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Chang Gung Univ.. Downloaded on October 20,2024 at 12:15:44 UTC from IEEE Xplore. Restrictions apply.
YUE et al.: BR-CIM: AN EFFICIENT BINARY REPRESENTATION COMPUTATION-IN-MEMORY DESIGN 3941

SRAM-CIM is the column-wise multiplication and accumula- requirement, which makes it possible to deploy neural network
tion (MAC). Therefore, an extra computing unit is incurred to on low power and AI edge devices. The work of Bina-
realize XNOR operation. A mixed-signal accelerator is imple- ryConnect first constraints the weights to either +1 or −1
mented based on SRAM, 2 gate transistors, and 1 MOM capac- during propagations [10]. XNOR-Net took bit reduction a
itance is added to each bit cell [6]. XNOR-SRAM realized step further, which compress both weight and activation preci-
an In-Memory-Computing SRAM macro for Binary/Ternary sion. The classification accuracy on MNIST reached 98.74%.
DNN but each 6T cell is expanded to 12T [7]. The multiplication and accumulation (MAC) are replaced by
Second, the sense margin represents the voltage differ- XNOR operation and the bit count of that XNOR results [11].
ence between two adjacent results as Fig. 1 shows. Limited Binarized Neural Network introduced the training kernel on
sense margin makes it difficult for ADC to distinguish and GPU, which realized 7 times throughput improvement of
quantize two neighboring values. Each bit-wise computing unoptimized GPU kernel without suffering any loss in classifi-
result is inaccurate due to the non-linear functioning transistor. cation accuracy [12]. XOR-Net proposed a network computing
And the sense margin even shrinks with larger accumulating scheme, which is up to 17% faster and 19% more energy-
value. Because the sum operation accumulates deviated results, efficient than XNOR-Net [13]. More works contributed to
which scatter the occurrence probability of outcome voltage. improving the accuracy of binary neural networks in dif-
The distribution range of high accumulating value becomes flat ferent applications. The performance of both classification
and wide, which further degrades the margin. This challenge and semantic segmentation tasks are improved by Group-
will be thoroughly discussed in section II.B. Net [14]. Research on infrared image detection shows the
Third, the largest SRAM-CIM storage size is around comparable performance of binarized networks versus full
5Mb [8], which is far less than the growth of neural network precision networks [15]. There is also an attempt to utilize
size. Though the binary representation is able to reduce the the binary neural network for solving visual place recognition
parameter size, CIM is not large enough to store all parameters for resource-constrained platforms, such as the small robot or
so the weight within CIM will be updated frequently. And drone [16].
the intermediate result will be written back to SRAM for
pipeline computation. The latency of loading fresh weight B. SRAM-CIM Circuit Design Challenge
degrades the overall throughput. A ping-pong CIM is proposed Different device technology has already provided available
to support simultaneous computing and weight-update opera- computation-in-memory design. Among those memory tech-
tions [9]. However, a replicate cell array and extra input wire nologies, SRAM outperforms in lower read/write energy, and
are needed, which increases area overhead and difficulties of high data endurance, SRAM also develops with the newest
layout routing. fabrication process. Several SRAM-CIMs have been proposed
To overcome the limitation of previous computation-in- for energy-efficient AI applications. A 10T SRAM-based CIM
memory architecture, this work presents Binary Representation Conv-RAM is utilized to accelerate CNN [17]. A 6T SRAM
Computation-In-Memory (BR-CIM), which performs binary CIM performed multiplication and accumulation (MAC) of
computation within compact 6T SRAM and the contribution 8-b weight/activation for DNN [18]. To accelerate lower bit
of this work can be summarized as below. precision computing, a compact 6T CIM is implemented for
• A computation-in-memory design performs binarized BNN [19]. However, there are several design challenges for
computing within an SRAM array eliminating frequent SRAM-based CIM architecture.
data movement. The symmetric computing implementa- 1) Read Disturb: When a large number of WLs are acti-
tion supports variable bit precision and enlarges readout vated at the same time, the charge current in BL increases.
margin; Considering the bit line voltage drop lower than the writ-
• Based on the intrinsic column scheme, a column-wise ing threshold, a pseudo-write operation happens and bit-flip
MUX supports simultaneous computing and weight load- occurs in store-1 SRAM cell, especially for 6T SRAM [19].
ing, which shortens the latency of data loading; To mitigate the read disturb issue, the BL voltage is required
• The reconfigurable digital peripheral and mapping to be clamped (above the write trigger voltage) or the number
method enables to support of different computing pre- of activated WL is limited. Another method is using read/write
cisions for volatile algorithm requirements. decoupled SRAM structures, like 8T and 10T [17], which
The remainder of this article is organized as fol- avoid disturbing storage data by isolating read BL and write
lows. Section 2 introduces the background of binary BL. Albeit this method eliminates the read disturb directly,
neural network algorithm and computation-in-memory design. it increases the cell array area greatly.
Section 3 describes the architecture of the proposed 2) Sense Margin: The sense margin determines the dif-
computation-in-memory. Section 4 presents the configurable ficulty of ADC quantization and computing accuracy. The
mapping method. Section 5 presents the experiment setup and voltage swing of accumulated results is bounded by the
evaluation results. Section 6 is the conclusion of this work. supply voltage. With higher bit precision of accumulating
result, the discrete value that the result could represent grows
II. BACKGROUND
exponentially. More voltage nodes are needed to represent
A. Binary Neural Network all discrete values. Eventually, the voltage difference between
Binary Neural Network (BNN) has demonstrated its effi- two neighboring values shrinks or even overlaps with each
ciency in reducing the computation and memory storage other. This issue could be traced back to each activated cell.

Authorized licensed use limited to: Chang Gung Univ.. Downloaded on October 20,2024 at 12:15:44 UTC from IEEE Xplore. Restrictions apply.
3942 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 69, NO. 10, OCTOBER 2022

TABLE I
T RUTH TABLE OF B INARY R EPRESENTATION XNOR O PERATION

Fig. 3. Binary MAC and Multi-bit Binary Representation.


Fig. 2. Logic Analog Value Mismatch.

Though the theoretical value is equal, the current contributed which leads to logic-analog mismatch challenge discussed in
by each bit cell is different due to the ‘asymmetric’ circuit the previous section. Therefore, it’s natural to map signed
design. For example, the partial result is viewed as 0 logically (+1/−1) to unsigned representation (+1/0). Observing the
but contains two possible cases in 6T SRAM-CIM [20]. When XNOR result in Table. I, the linear relationship between
input equals 0, the WL is not activated and the current two representations could be found: I Nsign × Wsign =
path from the cell array to BL is cut-off. But the bit cell 2(I Nunsign  Wunsign ) − 1. Then Equ. 1 is proposed to prove
still contributes discharge current to the BL when weight that the signed multiplication and accumulation (MAC) com-
equals to 0 as the leakage current path still exists between puting could be linearly mapped to unsigned XOR accumula-
the bit cell and BL. Therefore, the significance of weight tion (XAC). After the linear transform, each operand could be
and activation is not equivalent or non-symmetric. Then the represented by physical quantity like voltage (VDD/0). And
deviation accumulates in SUM operation and leads to the XNOR operation is replaced by XOR operation, which will
quantization error. be explained in the last paragraph of section III.C.
As illustrated in previous XOR-CIM [21], multiple input
combinations may produce same logical partial result but 
N
Result M AC = I Nsign × Wsign
acquire different analog values. Like Fig. 2 shows, two cases
contain identical logic computing results, but actual analog 
N

accumulating current is different from each other. Because = (2(I Nunsign  Wunsign ) − 1)
the bottom two bit-cells of the right column still contribute 
N
current and the current path is totally cut-off when the WL is = (2 (I Nunsign  Wunsign )) − N
not activated like the bit cell in the left column. The logic-

N
analog mismatch leads to the sense margin degradation = 2(N − I Nsign ⊕ Wsign )−N
3) ADC Overhead: The accumulated bit-wise current is
converted to a digital signal by ADC or sense amplifier. In gen- 
N

eral, the area and power overhead of ADC are proportional = N − (2 (I Nunsign ⊕ Wunsign )) (1)
to its quantization precision and computing latency. A high To accommodate the algorithm accuracy requirement, the
resolution ADC is not necessary for CIM design because of the binary representation could be extended to multi-bit binary
large area overhead. In addition, the maximum bit precision is representation (MBR) in this work. Specifically, each single bit
restricted by the sense margin of the computing circuit, which has binary value and keeps its bit significance. For example,
degrades severely with higher bit-width. 1001 equals to (+1) × 23 + (−1) × 22 + (−1) × 21 + 1 or
a decimal value of 3. The relationship between 4-bit binary
III. M ACRO A RCHITECTURE
representation and corresponding decimal logic value is shown
A. Binary Representation in Fig. 3 right. The benefit of MBR is the range is extended
The weight and activation are quantized into one bit in even two times larger than 2’s complement representation
binary representation, each equal to +1 or −1 logically. But because of wider interval. Moreover, the basic operation for
it’s difficult to find a negative physical quantity and thus multiplication is still XNOR so that the computing unit for
requires logic to physical conversion. Previous works utilized binary representation could be reused.
relative force between pull up and pull down network [7], [19] Fig. 3 presents one example of binary mode MAC
to represent the positive and negative number, but can hardly computing. In binary mode, 4 groups of multiplications are
ensure the current contribution of two paths are identical, accumulated vertically. The logic value (+1/−1) is mapped

Authorized licensed use limited to: Chang Gung Univ.. Downloaded on October 20,2024 at 12:15:44 UTC from IEEE Xplore. Restrictions apply.
YUE et al.: BR-CIM: AN EFFICIENT BINARY REPRESENTATION COMPUTATION-IN-MEMORY DESIGN 3943

Fig. 5. Top Level Architecture Design.

array (MBR-CA) and shared by all single binary representa-


tion compute array (SBR-CA). The bit-wise computation is
realized in the computing unit of each SBR-CA. The local
computing units are embedded in the SRAM array and shared
by two array columns. Up to 32 local computing units in each
Fig. 4. Multi-bit Binary Representation MAC.
SBR-CA could be activated simultaneously to perform XOR
operation and the results are accumulated horizontally in the
to 1/0 and multiplication is replaced by XOR operation. Then form of current. Then a flash ADC converts the value from
the result after accumulation is recovered to 2’s complement analog voltage intensity to digital bits. The partial result is
value as illustrated in Equ. 1. recovered and merged with the value from other SBR-CA in
Fig. 4 shows an example of multi-bit binary representa- the recover-reconfigurable merge unit (R-RMU). In the macro
tion (MBR) MAC. Two groups of 4-bit multiplication are peripheral, the neural network and corresponding dataset size
added together. Input is split and computed in bit-serial determine the CIM mapping strategy, input/weight bit preci-
manner. In cycle 1, the least significant bit of input 0 computes sion, and configuration unit setup. The CIM computing results
XOR with weight 0. Similarly, input 1 performs XOR with are stored in the data buffer and the macro also accepts off-
weight 1. Then all XOR results are accumulated according chip data from the input buffer. The data allocator decides the
to each bit position to acquire partial results, which are input to CIM macro based on the configuration unit.
transformed to decimal values based on Equ. 1. In cycle 2, the
second least significant bit is selected to perform XOR with
stationary weight. After exhaustively iterating, the computing C. Single Binary Representation Compute Array
results from all cycles are shifted and merged to acquire the
The single binary representation compute array (SBR-CA)
final result.
includes 16 × 64.6T SRAM cell array. Two columns of cell
array form one column group and share one load-compute
B. Top Level Architecture MUX (LC-MUX) and column-wise local computing unit
Explosive parameter size and operation number of neural (LCU), which is employed to support synchronous data load
networks exert a high demand on memory bandwidth. Though and in-situ computing. Macro also consists of Flash ADC,
binary quantization and computation-in-memory design enable row/column circuitry, and other peripheral.
to reduce data movement greatly, previous implementation The SBR-SA performs vector-wise analog computing in
encounters with the challenge of sense margin degradation or CIM mode. Respectively, one WL is selected and activated by
incurring large design overhead. Besides, the data precision of macro peripheral to avoid read disturb issue. The data readout
binary accelerator is fixed, which cannot be deployed to deeper from the bit cell is selected by load-compute MUX and pulled
networks. Therefore, an extendable bit-width binary CIM is rail-to-rail. Then the local computing unit performs bitwise
presented to realize energy efficient binary representation com- XOR operation between stored weight and input data. All the
putation. In addition, reconfigurable digital peripheral post- bitwise results between SRAM cell weight and input activation
processes partial results based on various pattern-dependent are accumulated horizontally and quantized by Flash ADC.
mapping methods. Fig. 7 illustrates the implementation detail of the Flash ADC.
The CIM macro contains 8 separated CIM slices and each The fundamental operation mechanism is similar to previous
individual slice performs 1 to 8 bit precision XOR and accu- Flash ADC works [22], [23], in which accumulated voltage
mulation (XAC) operation. As Fig. 5 shows, up to 32 inputs is compared with multiple voltage ladders. The comparator
are delivered into a multi-bit binary representation compute component is clock-enabled to reduce power consumption and

Authorized licensed use limited to: Chang Gung Univ.. Downloaded on October 20,2024 at 12:15:44 UTC from IEEE Xplore. Restrictions apply.
3944 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 69, NO. 10, OCTOBER 2022

Fig. 6. Single Binary Representation Compute Array design.

Fig. 8. Computing Unit Current Combination.

a 1/1 (input/weight) combination. The red line presents an


available current path when input and weight are different
from each other. For example, the right path opens when
input equals 0 and weight is 1. The computing unit is totally
shut off once the enable signal is 0, which could reduce
power consumption in low CIM utilization situation. The
Fig. 7. Flash ADC design. current from all horizontal computing units merges on partial
accumulate line and is quantized by ADC. Then R-RMU
decodes the partial results for post-stage computing.
employs a three-stage differential amplifier and latch to realize Both two cases have identical results after accumulation as
rail-to-rail signal converting. shown in Fig. 8. Though specific input and weight combina-
The circuit scheme and design choice of the local computing tions vary, the current intensity on the partial accumulate line
unit are introduced in Fig. 6. The weight data stored in the bit is equal. This symmetric XOR computing unit ensures that
cell controls the gate of PMOS T0 and NMOS T3. The input all data patterns with the same logical computing result own
data decides the on/off of NMOS T1 and PMOS T2. And T4 is identical analog current intensity, which eliminates the voltage
an enable footer to realize power gating. The local computing shifting challenge and the requirement of compensation circuit.
unit could be viewed as attaching two interleaved current paths It seems simple to modify the circuit scheme that moves
to a partial accumulate line. Both two paths are cut off once all PMOS (NMOS) transistors on one side, which turns the
the footer is not enabled or the input equals weight. Otherwise, computing operation into an XNOR. However, the current con-
one of the paths is conducted and contributes current to the tributed by the NMOS path and PMOS path is not equivalent,
horizontal accumulate line. Therefore, an XOR operation is which leads to deviation after accumulation. This phenomenon
realized by this x-shape circuit scheme. The symmetric design explains why a further transformation from XNOR to XOR is
ensures that the logical value is matched with the actual analog required in Eq. 1.
current intensity. For example, logical 0 is represented by
0 current contribution when both paths are cut off theoretically.
The relationship between logic value and actual physical D. Load-Compute MUX
quantity is unique. The logic-analog matching guarantees This feature is based on the observation that not all columns
that the same accumulation result even with different input within a row will be activated at the same time in the previous
combinations still has identical current intensity. This resolves CIM design, usually 8/16/32 of 64 columns per cycle [24].
the challenge of voltage shifting introduced in section II.B and This consideration is to accommodate the sense margin and
enlarges the sense margin. Though the leakage path exists, the computing accuracy, as discussed in section II.B. Furthermore,
typical current on accumulate line is around 1.5uA when the a large number of activating columns require a high precision
current path is conducted and is much higher than the leakage ADC to quantize the analog result, which degrades computing
current (<350pA). Compared with other memory technologies latency and increases area overhead. To handle this trade-
(RRAM 10∼1000), the on/off ratio of SRAM-CIM is high off, at least half of columns remain idle state during CIM
enough to guarantee the computing accuracy and makes the computing mode. Therefore, this work makes it possible to
LCU more resistible to process variation. update weight in non-activating column.
Multiple input/weight combinations are shown in Fig. 8. As Fig. 6 shows, two columns (one column group) share a
As input and weight are all 0, both two paths could not load compute MUX to realize simultaneous data loading and
contribute current to horizontal accumulate line as two NMOS computing, which greatly reduces the loading latency. For the
transistors are off. Similarly, both of PMOS are cut off with entire CIM macro, only two SBR-CAs per cycle could load

Authorized licensed use limited to: Chang Gung Univ.. Downloaded on October 20,2024 at 12:15:44 UTC from IEEE Xplore. Restrictions apply.
YUE et al.: BR-CIM: AN EFFICIENT BINARY REPRESENTATION COMPUTATION-IN-MEMORY DESIGN 3945

Fig. 11. Circuit Design of R-RMU.

Fig. 9. Load and Compute Column MUX Design.


kernel is divided into multiple groups and perform group-
wise convolution computing. During load-compute mode, one
column is selected as computing column and performs XAC
with input vectors. At the same cycle, LC-MUX activates
the write path for the other column and loads data from the
off-chip. The left snapshot of the CIM array indicates that
kernel group 0 performs computing and the weight of kernel
group 1 is written in the other column during the same period.
In compute mode, the CIM macro cuts off the writing path and
is focused on XAC computing as the right example shown
in Fig. 10. The original pipeline without LC-MUX cannot
hide the latency of weight loading, which degrades the overall
throughput. The theoretical system performance is limited if
considering loading latency, like Equ. 2 shows. By introducing
LC-MUX and pipeline, system throughput is close to peak
Fig. 10. Load and compute pipeline improvement.
performance since only the initial weight loading cannot be
hidden by the CIM macro pipeline.
fresh data due to the limitation of off-chip I/O bandwidth
(64 bits/cycle). But it should be noticed that activating the E. Recover-Reconfigurable Merge Unit
weight kernel is shared by multiple input activation during The partial results generated by each SBR-CA will be
sliding. Thus, the stationary weight in CIM macro enjoys a
recovered to the original XNOR value first by Eq. 1.
long data reuse period (>100 cycles) for convolution com- In addition, partial results from SBR-CAs are shifted and
puting, which enables writing weight in the same row of accumulated based on their respective bit position in multi-
all 64 SBR-CAs (32 cycles). bit binary representation.
The capability of handling simultaneous load-compute is Data precision requirement is algorithm and dataset depen-
facilitated by the adoption of LC-MUX. Fig. 9 presents the
dent, which further determines the corresponding mapping
circuit design of LC-MUX and the corresponding control strategy. A digital peripheral is supposed to manipulate flexible
signal. During load compute mode, the write enable (WE) is on shifting and accumulating. The recover-reconfigurable merge
then the Waddr and Caddr activate the opposite column. Fresh
unit is employed in this design to finish the digital post-
data is loaded through the write driver and the value of bit processing, which consists of a recover unit and a reconfig-
cell from neighboring column is read out to local computing urable merge unit.
unit (LCU). When the CIM switches to the compute mode, First, the recover unit converts the unsigned (+1/0) XOR
WE is cut off and the path from write pin to BL/BLB is and accumulation (XAC) results into signed (+1/−1) multi-
invalidated. Only computing operation is performed within the
plication and accumulation (MAC) result based on Equ. 1. The
CIM array. The size of the write driver is configured to be recover result is acquired by offset subtracts shifted SBR-CA
larger than the access transistor for a better driving force. The computing result as N − 2 × C Aresult , where N represents the
area overhead of introducing parallel loading and computing
number of activated columns and C Aresult is the quantized
could be reduced by reusing the intrinsic SRAM column mux value from the compute array. Considering the XAC value
and corresponding input pin in SRAM. And one LC-MUX range, 6 bits are sufficient to represent the recovered value.
circuitry is shared by two array columns, which helps maintain
Next, 8 separated recovered results are accumulated within
the column-wise pitch matching. the reconfigurable merge unit. The configuration unit sets up
Tcomput e the shifting value for each 6-bit recover result based on a
T hr oughputsyst em = × T hr oughput peak specific mapping method. For instance, 1-bit XAC computing
Tcomput e +Tload
(2) selects the non-shifting path and partial results are directly
bypassed to the output without shifting. In other scenarios,
Fig. 10 details the load-compute pipeline benefits. Assuming each stage shifts and merges two neighboring partial results
CIM macro is unable to store all weight parameter, the according to each bit position in multi-bit representation.

Authorized licensed use limited to: Chang Gung Univ.. Downloaded on October 20,2024 at 12:15:44 UTC from IEEE Xplore. Restrictions apply.
3946 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 69, NO. 10, OCTOBER 2022

Fig. 13. Data Mapping for Multi-bit Binary Representation.

one channel is allocated in the same row of cell array since


XOR result with channel-wise inputs will be accumulated
horizontally. The weight in the channel dimension is evenly
Fig. 12. Data Mapping for Binary Representation. split and stored in multiple rows once the channel depth is
larger than 32 (the largest channel size supported by CIM per
cycle). Instead, weights from multiple channels are allocated
3-stage shift/adder is implemented in the reconfigurable merge
at the identical row of the array if the channel number is
unit to accommodate the power of 2 bit-precision computation.
much smaller than the number of the array column. This
scenario is common for the first layer of the network as the
IV. DATA M APPING FOR BR-CIM A RRAY input RGB image generally has only 3 channels. The other
The data mapping determines the computing flow and array column that shares the same LC-MUX stores weight in panel-
utilization. The configurable CIM macro provides more free- wise dimension. In Load-Compute mode, the adjacent column
dom for data mapping with the assistance of digital R-RMU. could write weight from other kernels or next layers so as to
The data mapping for the binary neural network is illustrated avoid extra loading latency. For computing columns, all bit-
in Fig. 12, which utilizes convolution layer as an example. wise results from the computing unit merge horizontally in
Assume that the weight is stationary in the CIM array, the the form of current. A Flash ADC converts analog results to
weight kernels iteratively slide the entire input feature and each digital value based on the sampled voltage of accumulated line,
iteration is a three-dimension convolution. One slide convolu- which requires parallel comparison with a pre-defined voltage
tion requires R × R × C computation, where R represents the ladder. Then the quantized results are decoded to logical value
panel size of each filter. C equals the input channel number. and directly bypassed to the output buffer by selecting a non-
In addition, the input feature is shared by K separated kernels shifting path in R-RMU.
to extract various information. The iteration provides large
computing parallelism, which could be handled by regularized B. Multi-Bit Binary Representation
structured memory array. The fully connected layer could be
The binary representation is a special case of multi-bit
viewed as a special case the weight kernel size is equivalent
binary representation. This work supports integer power of
to the input feature size. And the data parallelism still exists
2 bit-width precision by employing R-RMU. Take 8 bits as an
between one input data and a large intermediate connection.
example, input data is asserted to CIM in the bit-serial pattern.
Thus, the mapping method mentioned in this work is also
Kernel weights are bit-wise split and allocated to 8 SBR-CA
suitable for FC-layer. Specific mapping method is shown in
(1 MBR-CA) as shown in Fig. 13. Each SBR-CA performs
the Algorithm. 1, which is divided into two sections, binary
bitwise XAC in channel dimension then the partial results
representation and multi-bit binary representation.
from 8 compute arrays are recovered and accumulated by one
R-RMU according to each bit position. The other CIM slice
A. Binary Representation is responsible for computing different kernels who share the
For single bit binary neural network, the weight along same input feature. The weight allocation rule within SBR-CA
the channel-wise direction is stored in the same row of is equivalent to single binary situation.
SBR-CA. The local computing unit performs XAC between The mapping method could also be configured to perform
the weight kernel and input feature in the channel dimension. fully connected computing. Considering the parallelism of
For example, the first SBR-CA stores the weight from the matrix multiplication, the input feature is aligned with multiple
same channel of kernel 0. As different kernels share the weight rows. Corresponding weight parameters are mapped in
same input feature, the input to the CIM macro is able to arrays or slices that share the input data. The fully connected
perform XAC with 64 kernels simultaneously. The weight in layer acceleration is achieved by massive CIM parallelism.

Authorized licensed use limited to: Chang Gung Univ.. Downloaded on October 20,2024 at 12:15:44 UTC from IEEE Xplore. Restrictions apply.
YUE et al.: BR-CIM: AN EFFICIENT BINARY REPRESENTATION COMPUTATION-IN-MEMORY DESIGN 3947

Similarly, weight is bit split and stored in separated SBR-CA.


And the partial results are recovered and merged based on
specific bit precision by R-RMU.

Algorithm 1 Data Mapping For Binary Representation CIM


ADDRESS COMPUTING
(W, Array)
IF MULTI-BIT BINARY REPRESENTATION

For w : 0 → Bi t_wi dth(W)


For k : 0 → ker nel_num(W)
For j : 0 → ker nel_wi dth 2 (W)
For c : 0 → channel(W)
If ( j %2 == 1) :
Row = c( j + 1)/(length(Arr ay)/2)
Col = 2(c%(length(Arr ay)/2)) + 1
Bank = wk Fig. 14. Test Accuracy on LeNet.
Else :
Row = c( j + 1)/(length(Arr ay)/2)
Col = 2(c%(length(Arr ay)/2)) the detailed methodology and efficiency of BR-CIM. The
Bank = wk power and area breakdown introduces the design choice and
WEIGHT LOADING(W, Array) corresponding overhead. Lastly, the comparison with previous
ELSE SINGLE BINARY REPRESENTATION CIM work is listed to illustrate the improvement of this work
in area and power efficiency. To have a fair comparison, most
w=1 CIM works are selected as they could also perform binary
For k : 0 → ker nel_num(W) computation or XNOR operation.
For j : 0 → ker nel_wi dth 2 (W)
For c : 0 → channel(W)
If ( j %2 == 1) : B. Algorithm Analysis
Row = c( j + 1)/(length(Arr ay)/2) This section elaborates on algorithm accuracy with different
Col = 2(c%(length(Arr ay)/2)) + 1 bit precision binary representations. To validate the accuracy,
Bank = k LeNet-5 and ResNet-18 are applied to MNIST and CIFAR-100
Else : datasets separately for recognition. Several input and weight
Row = c( j + 1)/(length(Arr ay)/2) bit precision combinations are explored by this work and all of
Col = 2(c%(length(Arr ay)/2)) them are quantized to single or multi-bit binary representation.
Bank = k The simulation results indicate that the accuracy loss with
WEIGHT LOADING(W, Array) binary representation is acceptable for light-weight network.
WEIGHT LOADING Higher bit-width representation could achieve baseline FP32
(W, Array) accuracy level for deeper networks and larger size datasets.
Write W[w][k][ j ][c] → Array[Bank][Row][Col] For LeNet in detail, all precision combinations incur less than
10% accuracy degradation as Fig. 14 shows. Performance of
multi-bit binary representation, like 8b weight/8b activation,
V. I MPLEMENTATION AND E VALUATION R ESULTS reaches up to 97.82% accuracy, which is even better than the
baseline performance. Then we compare the performance of
A. Experimental Setup different bit precision for deeper networks and larger datasets.
An illustrative schematic and layout of BR-CIM is imple- ResNet-18 and CIFAR-100 dataset are selected. As Fig. 15
mented in TSMC 28nm CMOS technology. The entire CIM shows, accuracy loss induced by higher bit-width is much
consists of 64 single binary representation compute arrays lower than binary representation. The 8-b weight and 8-b
(SBR-CA), each SBR-CA contains 16 × 64 bit cells. The activation combination achieves accuracy almost close to the
CIM Macro is designed to operate between 0.8∼0.99V. The baseline (76.4% to 76.79%). But one-bit binary representation
algorithm and dataset benchmarks include: binary quantization incurs much more error and cannot be directly deployed.
LeNet/ResNet-18 for MNIST/CIFAR-100 dataset. The algo- The result proves that multi-bit binary representation is more
rithm accuracy proves the effectiveness of binary representa- suitable for complex task scenario and larger dataset. It is also
tion achieved by CIM macro. The linearity of the computing feasible to instantly apply lower bit precision for lightweight
unit is analyzed by measuring sense margin in various power neural network. To conclude, the CIM design is required to
supplies and temperature. The experiments on computing adapt to different bit precision combination based on specific
latency and energy efficiency have been conducted to show application scenario.

Authorized licensed use limited to: Chang Gung Univ.. Downloaded on October 20,2024 at 12:15:44 UTC from IEEE Xplore. Restrictions apply.
3948 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 69, NO. 10, OCTOBER 2022

Fig. 16. Interval Voltage vs XAC Value.


Fig. 15. Test Accuracy on ResNet.

C. Computing Unit Linearity Analysis Coefficient of determination (CoD) is utilized to estimate


The ideal sense margin remains stable and distinguishable the deviation from the ideal computing curve. The CoD is
even with the increasing of accumulating voltage. But as the square of the correlation (r) between predicted results
more current merges in the horizontal accumulation line, the and actual actuals; thus, it ranges between 0 and 1. When
transistor within the local computing unit is forced to leave the r 2 equals 1, the actual output could be predicted without
linear region and cannot function as expected. The computing error from the independent variable. Instead, r 2 of 0 means
deviation in all transistors merges and degrades the entire scatter plots can hardly be predicted by the fitted model.
sense margin and leads to quantization error. To guarantee CoD between measured voltage node in partial accumulate
the CIM accuracy and relieve the burden of Flash ADC, the line and ideal linear plot describes the estimated accuracy of
maximum activated number of columns of one SBR-CA is computing unit. Fig. 17 presents the variation of CoD with
constrained to 32. XACV and initial CoD of XACV 0 is equal to 1. For 0.7V
Fig. 16 illustrates the relationship between interval voltage supply voltage, the accuracy ratio degrades obviously after
and XOR accumulation value (XACV) under different supply XAC 24, which incurs one order of higher deviation rate.
voltages. With an ideal computing unit, the accumulated The performance under 0.6V supply voltage is even worse
voltage will grow linearly with the increase of activated due to limited voltage swing. Instead, the CoD sustains above
columns (XACV). The interval of two neighboring voltage 99% between 0.8 and 0.99V supply voltage, which ensures the
nodes should remain stable, which is seen as a theoretical quantization accuracy. Overall, this design limits the minimum
sense margin. The y-axis shows the interval voltage nor- supply voltage to mitigate the error rates of computing results.
malized by the initial interval voltage, between XAC 0 and 
( ŷ − ȳ)2
XAC 1. For supply voltage above 0.8V, the internal voltage CoD = R 2 =  (3)
(y − ŷ)2
decreases slowly and is distinguishable for ADC quantization
even at XAC 32. The most difficult duty for ADC is to ŷ represents predicted value and y is corresponding actual
distinguish between all current paths open (XACV 32) and value.
only 1 path cut-off case (XACV 31). The interval shrinks when Previous experiments dedicate the sense margin and com-
the transistors begin to behave non-linearly, as discussed in the puting linearity of BR-CIM under a normal temperature of
previous session. At the 0.7V voltage node, the interval voltage 27 degrees. Then other temperature nodes are selected to esti-
drop remains trivial before XAC 16 but declines suddenly after mate the circuit linearity when facing temperature variation.
XAC 18, which represents that the sense margin is shrinking The supply voltage of the temperature experiment is fixed at
and the ideal function range is less than 16. Specifically, the 0.9V. Fig. 18 describes the relationship between computing
number of local computing units activated at the same time is linearity and multiple temperature nodes. Similarly, the inter-
supposed to be restricted below 16. At 0.6V supply voltage, the val voltage between two adjacent XAC values is normalized to
interval voltage decreases much earlier and the sense margin the first interval (between XAC0 and XAC1). The functionality
degrades to 20% of the initial signal margin eventually, which of CIM at 27 degrees has been proved by previous signal
proves that the proper function of the local computing unit margin and coefficient of determination analysis. And the
is also limited by the supply voltage. Considering the energy result presents that the interval voltage at other temperature
efficiency and accuracy trade-off, the CIM macro is required nodes does not shrink greatly with the increase of XACV.
to operate at 0.8∼0.99V for acceptable computing linearity Compared with 27 degrees, signal margin at −40 degrees
and relieve the pressure of ADC quantization. shrinks less than 15% and sense margin performance of 80 and

Authorized licensed use limited to: Chang Gung Univ.. Downloaded on October 20,2024 at 12:15:44 UTC from IEEE Xplore. Restrictions apply.
YUE et al.: BR-CIM: AN EFFICIENT BINARY REPRESENTATION COMPUTATION-IN-MEMORY DESIGN 3949

Fig. 19. Load and Compute Latency.


Fig. 17. Coefficient of Determination vs XAC Value.

As Fig. 19 shows, weight loading duration does not account


for a large proportion of total latency in the first few layers
of ResNet-18. The reason is the panel-wise size of the feature
map is large, which introduces more data reuse. The kernel
weight could stay stationary on CIM for a longer period, which
mitigates the need for fresh weight data access. With down-
sampling layers, the feature map shrinks, and the period of
weight from activated to retired also decreases. In the last few
layers, data reuse cannot amortize the latency of data loading.
And the CIM macro is required to load weight parameters
more frequently. Therefore, the significance of parallel loading
and computing is noticeable, which improves the latency
performance up to 36.6%. The average latency reduction is
21.5% for all convolution layers in ResNet-18.
The latency of finishing algorithm computing with different
input/weight bit precision in CIM macro is shown in Fig. 20
Fig. 18. Temperature Effects on Computing Linearity.
with logarithmic coordinates, in which the performance is nor-
malized to 1b/1b binary representation of each algorithm. The
specific precision combination is listed above each plot. The
120 degrees even improves by 13.6% and 21.7% respectively.
latency of ResNet-18 strictly follows the linear curve with
The experiment proves the CIM computing circuit is able
the increase in precision. However, the latency of LeNet does
to maintain linearity and operate at temperature nodes from
not scale with weight precision. For instance, the latency with
−40∼120 degrees.
8b weight is not twice the period of 4b weight, which is caused
by limited CIM utilization. The lower bit-width cannot fully
D. Algorithm Loading Computing Latency activate CIM macro as the channel depth or kernel number is
bounded in the lightweight network. Instead, ResNet-18 fulfills
Fig. 19 presents the execution latency of each convolutional
the CIM array and achieves algorithm acceleration with all
layer of ResNet-18, which consists of data loading and CIM
activated CIM slices. For both algorithms, the latency grows
computing. We estimate the latency based on this work in
linearly with the activation precision since the CIM Macro
compute mode so the CIM has to complete weight writing and
employs the bit-serial manner.
computing sequentially. By employing LC-MUX, the latency
caused by data loading vanishes under load-compute mode.
When the activated weight iteratively performs computing with E. Energy Efficiency of CIM Design
multiple activations, the write column path is enabled and Fig. 21 indicates the energy efficiency of a network with
replaces retired weight with fresh off-chip data. This exper- multiple precision combinations and the definition of one
iment reveals the portion of latency that could be resolved operation depends on the precision combination. In the
by simultaneous loading and computing. With fixed DRAM ResNet-18 case, the throughput performance is linearly related
writing I/O bandwidth, the data writing bandwidth is identical to the bit-width because of full CIM utilization. The highest
in all CIM modes. energy efficiency (1280 TOPs/W) occurs on 1b/1b at 0.8V.

Authorized licensed use limited to: Chang Gung Univ.. Downloaded on October 20,2024 at 12:15:44 UTC from IEEE Xplore. Restrictions apply.
3950 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 69, NO. 10, OCTOBER 2022

Fig. 20. Latency of Network Computing with Different Bit Precision.


Fig. 22. CIM Macro Power Breakdown.

Fig. 21. Energy Efficiency of Different Bit Precision.

Fig. 23. CIM Macro Area Breakdown.


The working frequency is 3ns, which could satisfy the com-
puting linearity requirement and achieves CoD higher than
0.99 for all XACV. But the energy efficiency evaluation of power since multiple parallel comparators are activated at the
LeNet does not scale with bit precision and the reason is same time to acquire a thermometer code during quantization.
similar to the analysis of algorithm computing latency. This The other portion of power belongs to the cell array, power
phenomenon is caused by the ‘dark silicon’ of the CIM array. reference, digital unit, and other CIM peripheral.
Lower bit-width of lightweight LeNet network causes limited The layout of the CIM macro is shown in Fig. 23, which
CIM utilization, which reduces the throughput of the system. presents the floorplan of subunits and corresponding area
Though the computing unit of idle memory array is non- breakdown. Each contains a cell array, computing unit, ADC,
activated, the corresponding cell array is still powered to store and other column/row circuitry. The ADC occupies a large
valid data. The highest energy efficiency occurs at 1b input proportion of chip size since Flash ADC is employed by this
and 4b weight combination for LeNet. design. The short quantization latency is at the expense of
multiple parallel comparators. A clock-enabled comparator is
F. CIM Array Power and Area Breakdown implemented in this work and the reference voltage for each
The schematic and layout of BR-CIM are implemented comparator is acquired from a resistor ladder. Considering that
in TSMC 28nm CMOS technology. The overall array size the Flash ADC is shared by the entire compute array, the area
reaches 64 Kb and contains 64 flash ADC for converting overhead of ADC could be amortized by scaling memory array
compute array data and 8 digital recover-reconfigurable merge size.
units. The power breakdown is shown in Fig. 22, where
the computing unit (CU) accounts for nearly half of power G. Comparison
consumption as the computing unit has to merge current Multiple SRAM-based computation-in-memory works are
from up to 32 paths. The flash ADC consumes a quarter of listed in Table. II, which compares cell array size, activation

Authorized licensed use limited to: Chang Gung Univ.. Downloaded on October 20,2024 at 12:15:44 UTC from IEEE Xplore. Restrictions apply.
YUE et al.: BR-CIM: AN EFFICIENT BINARY REPRESENTATION COMPUTATION-IN-MEMORY DESIGN 3951

TABLE II
C OMPUTING -I N -M EMORY C OMPARISON

and weight precision, macro area and cell area, operation computing and eliminates frequent data movement. The com-
function, energy efficiency and evaluation algorithm. For fair puting unit employs a symmetric circuit design to enlarge the
comparison, most of the CIM are selected since those works sense margin and ensure the computing linearity. Based on the
are optimized for binary neural network or enable to perform intrinsic column peripheral, a load-compute MUX is utilized to
energy-efficient XNOR operation. support parallel weight loading and computing, which enables
The result indicates that the performance of this work is hidden the write access latency. Then the binary number is
superior to other SRAM-based CIM designed for binary com- extended to multi-bit representation to achieve a higher accu-
putations in power and area efficiency. In addition, this work racy requirement for the larger dataset. Lastly, flexible data
makes effort to enlarge the signal margin and improves the mapping is proposed to explore data parallelism under differ-
computing linearity. Previous works illustrate the effectiveness ent precision combinations and a reconfigurable digital periph-
of single-bit binary representation in lightweight network and eral completes partial result recovering and post-processing.
datasets, like MNIST. However reduced bit precision algo- The design is implemented in TSMC 28nm process. The
rithm can hardly have an acceptable accuracy encountering a area and energy efficiency outperform previous mixed-signal
larger dataset. Based on this observation, the weight/activation CIM work. We deploy binary quantization LeNet/ResNet-18
precision is extendable in this work to achieve acceptable on CIM Macro and achieve 97.82%/76.4% accuracy on
accuracy when handling a deeper network. The results show MNIST/CIFAR-100 dataset.
that multi-bit binary representation could realize prediction
accuracy equivalent to full precision representation and better R EFERENCES
than 8b MAC CIM array. In addition, a multi-bit binary
representation is based on a similar basic operation (XOR) [1] S. Yin et al., “An energy-efficient reconfigurable processor for
binary- and ternary-weight neural networks with flexible data bit
and could be realized by an intrinsic computing unit without width,” IEEE J. Solid-State Circuits, vol. 54, no. 4, pp. 1120–1136,
incurring complex overhead. The energy efficiency of BR-CIM Apr. 2019.
also outperforms other binary or multi-bit multiplication and [2] V. Seshadri et al., “Ambit: In-memory accelerator for bulk bit-
wise operations using commodity dram technology,” in Proc. 50th
accumulation CIM. The best energy efficiency is estimated Annu. IEEE/ACM Int. Symp. Microarchitecture (MICRO), Oct. 2017,
under 0.8V and 3ns cycle time. pp. 273–287.
[3] S. Yu and P.-Y. Chen, “Emerging memory technologies: Recent trends
and prospects,” IEEE Solid State Circuits Mag., vol. 8, no. 2, pp. 43–56,
VI. C ONCLUSION Spring 2016.
This work implements binary representation computation- [4] C.-X. Xue et al., “16.1 A 22 nm 4mb 8b-precision ReRAM computing-
in-memory macro with 11.91 to 195.7TOPS/W for tiny AI edge devices,”
in-memory (BR-CIM) for binary number system. The in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers,
cell array realizes in-situ XOR and accumulation (XAC) vol. 64, Feb. 2021, pp. 245–247.

Authorized licensed use limited to: Chang Gung Univ.. Downloaded on October 20,2024 at 12:15:44 UTC from IEEE Xplore. Restrictions apply.
3952 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 69, NO. 10, OCTOBER 2022

[5] J. Zhang, Z. Wang, and N. Verma, “A machine-learning classifier [26] R. Liu et al., “Parallelizing SRAM arrays with customized bit-cell for
implemented in a standard 6T SRAM array,” in Proc. IEEE Symp. VLSI binary neural networks,” in Proc. 55th ACM/ESDA/IEEE Design Autom.
Circuits (VLSI-Circuits), Jun. 2016, pp. 1–2. Conf. (DAC), Jun. 2018, pp. 1–6.
[6] H. Valavi, P. J. Ramadge, E. Nestler, and N. Verma, “A mixed-signal [27] Z. Jiang, S. Yin, J.-S. Seo, and M. Seok, “C3SRAM: An in-memory-
binarized convolutional-neural-network accelerator integrating dense computing SRAM macro based on robust capacitive coupling computing
weight storage and multiplication for reduced data movement,” in Proc. mechanism,” IEEE J. Solid-State Circuits, vol. 55, no. 7, pp. 1888–1897,
IEEE Symp. VLSI Circuits, Jun. 2018, pp. 141–142. Jul. 2020.
[7] Z. Jiang, S. Yin, M. Seok, and J.-S. Seo, “XNOR-SRAM: In-memory [28] S. Yin, B. Zhang, M. Kim, J. Saikia, and J. S. Seo, “PIMCA: A
computing SRAM macro for binary/ternary deep neural networks,” in 3.4-MB programmable in-memory computing accelerator in 28nm for
Proc. IEEE Symp. VLSI Technol., Jun. 2018, pp. 173–174. on-chip DNN inference,” in Proc. Symp. VLSI Circuits, Jun. 2021,
[8] H. Jia et al., “15.1 A programmable neural-network inference accelerator pp. 1–2.
based on scalable in-memory computing,” in IEEE Int. Solid-State Cir- [29] H. Kim, T. Yoo, T. T.-H. Kim, and B. Kim, “Colonnade: A recon-
cuits Conf. (ISSCC) Dig. Tech. Papers, vol. 64, Feb. 2021, pp. 236–238. figurable SRAM-based digital bit-serial compute-in-memory macro for
[9] J. Yue, X. Feng, Y. He, Y. Huang, and Y. Liu, processing neural networks,” IEEE J. Solid-State Circuits, vol. 56, no. 7,
“15.2 A 2.75-to-75.9TOPS/W computing-in-memory NN processor pp. 2221–2233, Jul. 2021.
supporting set-associate block-wise zero skipping and ping-pong cim
with simultaneous computation and weight updating,” in IEEE Int.
Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2021,
pp. 238–240.
[10] M. Courbariaux, Y. Bengio, and J. P. David, Binaryconnect: Training
Deep Neural Networks With Binary Weights During Propagations. Zhiheng Yue received the B.S. degree in electronic
Cambridge, MA, USA: MIT Press, 2015. science and technology from the Beijing University
[11] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, XNOR-Net: of Posts and Telecommunications, Beijing, China,
Imagenet Classification Using Binary Convolutional Neural Networks. in 2017, and the M.S. degree in electrical and
Cham, Switzerland: Springer, 2016. computer engineer from the University of Michigan,
[12] I. Hubara, D. Soudry, and R. E. Yaniv, “Binarized neural networks,” in Ann Arbor, MI, USA, in 2019. He is currently
Proc. Adv. Neural Inf. Process. Syst., 2016, p. 29. pursuing the Ph.D. degree in electronic science and
[13] S. Zhu, L. H. K. Duong, and W. Liu, “XOR-Net: An efficient compu- technology with Tsinghua University, Beijing.
tation pipeline for binary neural network inference on edge devices,” in His current research interests include deep learn-
Proc. IEEE 26th Int. Conf. Parallel Distrib. Syst. (ICPADS), Dec. 2020, ing, computation-in-memory, AI accelerator, and
pp. 124–131. very-large-scale-integration (VLSI) design.
[14] B. Zhuang, C. Shen, M. Tan, L. Liu, and I. Reid, “Structured binary
neural networks for accurate image classification and semantic seg-
mentation,” in Proc. Conf. Comput. Vis. Pattern Recognit., 2018,
pp. 413–422.
[15] J. Kung, D. Zhang, G. van der Wal, S. Chai, and S. Mukhopadhyay,
“Efficient object detection using embedded binarized neural networks,” Yabing Wang received the B.S. degree in electronic
J. Signal Process. Syst., vol. 90, no. 6, pp. 877–890, Jun. 2017. science and technology from Xidian University,
[16] B. Ferrarini, M. J. Milford, K. D. McDonald-Maier, and S. Ehsan, Xi’an, China, in 2018. He is currently pursuing the
“Binary neural networks for memory-efficient and effective visual M.S. degree with the School of Integrated Circuits,
place recognition in changing environments,” IEEE Trans. Robot., early Tsinghua University, Beijing, China.
access, Mar. 2, 2022, doi: 10.1109/TRO.2022.3148908. His current research interests include deep learn-
[17] A. Biswas and A. P. Chandrakasan, “Conv-RAM: An energy-efficient ing, computation-in-memory, and very-large-scale
SRAM with embedded convolution computation for low-power CNN- integration (VLSI) design.
based machine learning applications,” in IEEE Int. Solid-State Circuits
Conf. (ISSCC) Dig. Tech. Papers, Feb. 2018, pp. 488–490.
[18] X. Si et al., “15.5 A 28 nm 64kb 6T SRAM computing-in-memory
macro with 8b MAC operation for AI edge chips,” in IEEE Int. Solid-
State Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2020, pp. 246–248.
[19] J. Kim et al., “Area-efficient and variation-tolerant in-memory BNN
computing using 6T SRAM array,” in Proc. Symp. VLSI Circuits,
Jun. 2019, pp. C118–C119. Yubin Qin received the B.S. degree from the School
of Electronic Science and Engineering, Southeast
[20] Z. Zhang et al., “A 55 nm 1-to-8 bit configurable 6T SRAM based
University, Nanjing, China, in 2020. He is cur-
computing-in-memory unit-macro for CNN-based AI edge processors,”
rently pursuing the Ph.D. degree with the School
in Proc. IEEE Asian Solid-State Circuits Conf. (A-SSCC), Nov. 2019,
of Integrated Circuits, Tsinghua University, Beijing,
pp. 217–218.
China.
[21] S. Huang, H. Jiang, X. Peng, W. Li, and S. Yu, “XOR-CIM: Compute-
His current research interests include deep learn-
in-memory SRAM architecture with embedded XOR encryption,” in
ing, very-large-scale integration (VLSI) design, and
Proc. IEEE/ACM Int. Conf. Comput. Aided Design (ICCAD), Nov. 2020,
hardware—software co-design.
pp. 1–6.
[22] V. H.-C. Chen and L. Pileggi, “An 8.5 mw 5GS/s 6b flash ADC with
dynamic offset calibration in 32 nm CMOS SOI,” in Proc. Symp. VLSI
Circuits, Jun. 2013, pp. C264–C265.
[23] S. Park, Y. Palaskas, and M. P. Flynn, “A 4-GS/s 4-bit flash ADC
in 0.18-µm CMOS,” IEEE J. Solid-State Circuits, vol. 42, no. 9,
pp. 1865–1872, Sep. 2007.
Leibo Liu (Senior Member, IEEE) received the B.S.
[24] J. Yue et al., “15.2 A 2.75-to-75.9TOPS/W computing-in-memory
degree in electronic engineering from Tsinghua Uni-
NN processor supporting set-associate block-wise zero skipping and
versity, Beijing, China, in 1999, and the Ph.D. degree
ping-pong CIM with simultaneous computation and weight updating,” in
from the Institute of Microelectronics, Tsinghua
IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, vol. 64,
University, in 2004.
Feb. 2021, pp. 238–240.
He is currently a Professor with the School of Inte-
[25] W.-S. Khwa et al., “A 65 nm 4kb algorithm-dependent computing-in- grated Circuits, Tsinghua University. His research
memory SRAM unit-macro with 2.3ns and 55.8TOPS/W fully parallel interests include reconfigurable computing, mobile
product-sum operation for binary DNN edge processors,” in IEEE computing, and very-large-scale integration digital
Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2018, signal processing (VLSI DSP).
pp. 496–498.

Authorized licensed use limited to: Chang Gung Univ.. Downloaded on October 20,2024 at 12:15:44 UTC from IEEE Xplore. Restrictions apply.
YUE et al.: BR-CIM: AN EFFICIENT BINARY REPRESENTATION COMPUTATION-IN-MEMORY DESIGN 3953

Shaojun Wei (Fellow, IEEE) was born in Beijing, Shouyi Yin (Member, IEEE) received the B.S.,
China, in 1958. He received the Ph.D. degree from M.S., and Ph.D. degrees in electronic engineering
the Faculte Polytechnique de Mons, Mons, Belgium, from Tsinghua University, Beijing, China, in 2000,
in 1991. 2002, and 2005, respectively.
He became a Professor at the Institute of He was a Research Associate with Imperial Col-
Microelectronics, Tsinghua University, Beijing, lege London, London, U.K. He is currently a Full
China, in 1995. His main research interests Professor and the Vice-Director of the School of
include VLSI SoC design, electronic design Integrated Circuits, Tsinghua University. He has
automation (EDA) methodology, and communica- published more than 100 journal articles and more
tion application-specific integrated circuit (ASIC) than 50 conference papers. His research interests
design. include reconfigurable computing, AI processors,
Dr. Wei is a Senior Member of the Chinese Institute of Electronics (CIE). and high-level synthesis.
Dr. Yin has served as a Technical Program Committee Member of the top
very-large-scale integration (VLSI) and electronic design automation (EDA)
conferences, such as the Asian Solid-State Circuits Conference (A-SSCC),
the IEEE/ACM International Symposium on Microarchitecture (MICRO),
the Design Automation Conference (DAC), the International Conference on
Computer-Aided Design (ICCAD), and the Asia and South Pacific Design
Automation Conference (ASP-DAC). He is also an Associate Editor of
IEEE T RANSACTIONS ON C IRCUITS AND S YSTEMS —I: R EGULAR PAPERS ,
ACM Transactions on Reconfigurable Technology and Systems (TRETS), and
Integration, the VLSI Journal.

Authorized licensed use limited to: Chang Gung Univ.. Downloaded on October 20,2024 at 12:15:44 UTC from IEEE Xplore. Restrictions apply.

You might also like