0% found this document useful (0 votes)
177 views11 pages

00) TD-SRAM - Time-Domain-Based - In-Memory - Computing - Macro - For - Binary - Neural - Networks

This document describes a TD-SRAM macro that combines time-domain computing and in-memory computing to enable efficient binary neural network processing. The macro utilizes a novel dual-edge single-input time-domain computing cell that significantly improves area and power efficiency over prior designs. Implemented in a 40nm CMOS process, the 8kb TD-SRAM macro achieves an energy efficiency of 537 TOPS/W at 0.9V supply voltage while maintaining 95.9-98% accuracy on the MNIST dataset for different neural network topologies.

Uploaded by

AMANDEEP SINGH
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
177 views11 pages

00) TD-SRAM - Time-Domain-Based - In-Memory - Computing - Macro - For - Binary - Neural - Networks

This document describes a TD-SRAM macro that combines time-domain computing and in-memory computing to enable efficient binary neural network processing. The macro utilizes a novel dual-edge single-input time-domain computing cell that significantly improves area and power efficiency over prior designs. Implemented in a 40nm CMOS process, the 8kb TD-SRAM macro achieves an energy efficiency of 537 TOPS/W at 0.9V supply voltage while maintaining 95.9-98% accuracy on the MNIST dataset for different neural network topologies.

Uploaded by

AMANDEEP SINGH
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 68, NO.

8, AUGUST 2021 3377

TD-SRAM: Time-Domain-Based In-Memory


Computing Macro for Binary
Neural Networks
Jiahao Song , Student Member, IEEE, Yuan Wang , Member, IEEE, Minguang Guo , Xiang Ji, Kaili Cheng,
Yixuan Hu, Xiyuan Tang , Member, IEEE, Runsheng Wang , Member, IEEE, and Ru Huang, Fellow, IEEE

Abstract— In-Memory Computing (IMC), which takes


advantage of analog multiplication-accumulation (MAC) insides
memory, is promising to alleviate the Von-Neumann bottleneck
and improve the energy efficiency of deep neural networks
(DNNs). Since the time-domain (TD) computing is also an
energy-efficient analog computing paradigm, we present an 8kb
mixed-signal IMC macro, TD-SRAM, by combining IMC with
TD computing. A dual-edge single input (DESI) TD computing
topology is proposed, which can significantly improve the
area and power efficiencies of TD cell. The TD-SRAM bitcell
consisting of a 6T DESI based TD cell and a 6T-SRAM cell
supports binary DNNs. In the IMC mode, 60 columns work
in parallel and 96-input binary-MAC operations are processed
in each column. Implemented in a standard 40-nm CMOS
process, the TD-SRAM achieves the high energy efficiency
of 537 TOPS/W at 0.9-V supply. With different DNN topologies,
the test chips achieve the accuracy of 95.90%-98.00% with a
dual 2-bit time-to-digital converter (TDC) in the MNIST dataset.
Index Terms— In-memory computing (IMC), time-domain
(TD), deep neural networks (DNNs), SRAM, binary operations.

I. I NTRODUCTION

I N RECENT years, deep neural networks (DNNs) have


exhibited the state-of-the-art performance and significant
breakthroughs in various fields. However, DNN algo-
rithms always come with a huge number of multiplication-
accumulation (MAC) and memory-access operations, making
it too energy exhausted to run on energy-constrained plat-
forms. Thus, an energy-efficient hardware is urgently needed
to deploy DNN algorithms in edge devices.
The combination of analog computing and memory-
centric design/in-memory computing (IMC) [1]–[25] has been
Manuscript received February 9, 2021; revised May 4, 2021; accepted
May 20, 2021. Date of publication June 9, 2021; date of current version
July 13, 2021. This work was supported in part by the National Key Research
and Development Program of China under Grant 2020YFB2205502, in part Fig. 1. Illustrations of (a) In-memory computing with analog processing
by the Joint Funds of the National Natural Science Foundation of China under engine (APE) and (b) time-domain computing.
Grant U20A20204, and in part by the 111 Project under Grant B18001.
This article was recommended by Associate Editor M.-F. Chang.
(Corresponding authors: Yuan Wang; Runsheng Wang.)
Jiahao Song, Yuan Wang, Kaili Cheng, Yixuan Hu, Xiyuan Tang, Runsheng regarded as a promising route to reduce the operating energy
Wang, and Ru Huang are with the Key Laboratory of Microelectronic Devices of MAC and memory-access, as illustrated in Fig. 1. In which
and Circuits (MOE), Institute of Microelectronics, Peking University, Beijing
100871, China (e-mail: [email protected]; [email protected]). the time-domain (TD) computing [1]–[8] is compatible with
Minguang Guo and Xiang Ji are with the School of Software and Micro- digital circuits and has a low power benefit in low preci-
electronics, Peking University, Beijing 100871, China. sion computing due to its analog nature. The current-starve
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TCSI.2021.3083275. method is usually utilized in TD computing to convert the
Digital Object Identifier 10.1109/TCSI.2021.3083275 multiplication result to an analog delay value in each stage,
1549-8328 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on June 20,2022 at 09:57:16 UTC from IEEE Xplore. Restrictions apply.
3378 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 68, NO. 8, AUGUST 2021

which is controlled by input activations and local weights. The


delay is gradually accumulated at each stage after a computing
edge coming in. Although the delay value has variations due
to transistor mismatch, the TD computing is more energy
efficient than the digital circuit, because it treats transistors in a
more flexible way (low SNR but high efficiency). On the other
hand, voltage or current domain computing needs a carefully
engineered analog-to-digital converter (ADC) to convert the
analog MAC outputs back to the digital domain. While some
standard digital library cells like D-flip-flop (DFF) can be used
to convert MAC outputs from the time domain to the digital
domain, which largely reduces the design time.
At the algorithm level, the DNN computing is mainly com-
posed of MAC operations. To reduce the computational com-
plexity, low-precision neural networks such as Binaryconnect
(weights are binarized to “+1/ − 1”), BNN and XNOR-NET
(both activations and weights are binarized to “+1/ − 1”)
are proposed [26]–[28]. Although these low-precision neural
networks present relatively low accuracies in complex tasks
(e.g., ImageNet dataset), they have good performance in
small datasets like MNIST and CIFAR10. So, these extremely
quantified neural networks (EQNNs) are suitable for the
applications in simple internet-of-thing (IoT). In addition,
analog-computing cells can be used to execute binary oper-
ations for EQNNs. Compared with digital floating-point arith-
metic and logic units (ALUs), analog binary cells can attain
the state-of-the-art energy efficiency at the cost of a limited
loss of accuracy. Recently, various hardware accelerators for
EQNN have been proposed [15]–[17], [29], [30], all of which
have been demonstrated to be very energy efficient (beyond
100 TOPS/W). Therefore, EQNNs are promising algorithms
to be implemented in hardware.
The time-based designs in previous work possessed inge-
nious computing cells to realize TD computing, yet still
require a large number of transistors, resulting in large area and
power overhead. To solve this issue, we propose a dual-edge-
single-input (DESI) cell topology, which utilizes one inverter
to implement TD computing. Based on DESI, a 12T IMC
cell that supports binary computing is developed, and the dual
2-bit DFF-based TDCs are employed to convert dual-edge TD Fig. 2. Time-Domain computing cells: (a) variable capacitance load (VC)
cell; (b) current starve (CS) cell; (c) tap based (TB) cell (parasitic capacitance
MACs to digital. The TD-SRAM macro consists of 128 × 60 Cmid and Cout at the output of inverter are omitted for simplicity). (d) I − V
IMC cells and 60 columns can work in parallel, which elimi- characteristics and variations in 40-nm CMOS process.
nates the energy consumption and the delay of row-by-row
memory reading. The fabricated 40nm TD-SRAM macro
prototype reaches the best energy efficiency of 537 TOPS/W the accumulations can be directly completed by the cascade
and the accuracy of 95.90%-98.00% in MINST dataset with of TD cells.
different DNN topologies. Contributions of this work are as Base on the digital-to-delay mechanism, TD cells can
follows: (1) At the cell level, DESI makes TD cells become be classified into three categories: variable-capacitance (VC)
smaller (12T, 1.2 ∼ 4.6× smaller than that of prior TD cells). cells, current-starve (CS) cells and tap-based (TB) cell as
(2) At the macro level, this work realizes 1.4 ∼ 5.6× energy shown in Fig. 2 (a)(b)(c). The VC cells convert the digital
efficiency improvement. output to cell delay by tuning the capacitance load according
to the corresponding digital code. The CS cells change the
II. TD-SRAM M ACRO A NALYSIS AND D ESIGN cell delay by limiting the pull-up or pull-down current of
A. TD Cell Analysis the inverter. The TB cells consist of several delay stages,
TD computing converts digital signal processing results and only one tap is connected to the output depending on
to time or delays by TD cells. For the algorithms the computation result. Several examples are summarized as
having numerous accumulation operations such as DNNs, follows.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on June 20,2022 at 09:57:16 UTC from IEEE Xplore. Restrictions apply.
SONG et al.: TD-SRAM: TD-BASED IMC MACRO 3379

A time-based integrate-and-fire (IAF) neuromorphic core is


present in [2]. The neuromorphic core is consisting of 64 dig-
itally controlled oscillator (DCO) circuits, and each DCO has
128 VC cells. The oscillation frequency of DCO represents the
MAC value of 128 cells. The DCO output is digitalized by the
IAF circuit. This core obtains the handwritten digit recognition
accuracy of 91% and consumes 320.4μW per DCO under
1.2-V supply.
Sandwich-RAM [7] is designed for binary weight net-
works (BWN) and an analog-computing engine based on the
pulse-width modulation (PWM) was used. Two 4-bit PWM
units (including MSB and LSB unit) accomplish one MAC
operation. Each 4-bit PWM unit is composed of two 2-bit
CS cells. To further reduce the power consumption, a bypass
mechanism is adopted. Since the bypass is enabled, the LSB
chain is bypassed when the feature data is relatively large, and
the MSB chain is bypassed when the feature data is relatively
small. The prototype chip achieved the peak energy efficiency
of 119.7 TOPS/W at 0.6-V.
The one-shot time-based accelerator is presented in [3]. 3-bit
one-hot encoded weights and 1-bit activations are employed.
At the cell level, the TB cell is used. The output tap is
realized as a complex tristate gate. In addition, the dynamic
threshold error correction (DTEO) technique is adopted to
improve the classification accuracy. The energy efficiency of
104.8 TOPS/W is achieved at 0.7-V.
The first-order energy of TD cells is analyzed as follows.
The main part of TD cells power consumption is dynamic
power, as illustrated in Fig. 2. Cout consists of drain/source
capacitance of output inverter, metal wire parasitic capaci-
tance, and gate capacitance of next input inverter. We assume
that the Cout of different cells has the same value. For CS cells,
the components of Cmid are same with those of Cout . For VS
cells, Ct une is larger than Cmid to have certain adjustment
ranges. The energy consumption of the TB cells has a linear
relationship with the number of delay stages N. Fig. 3. Comparison of (a) conventional time-domain cell topology and
(b) proposed time-domain cell topology (parasitic capacitance at the output
According to the above analysis, the CS cells tend to exhibit of inverter are omitted for simplicity).
the best energy efficiency due to its low capacitance load.
Thus the CS cells are adopted in our design. Although the CS
cells show good energy efficiency, its variation is larger than
the other two cells, especially when a long delay is needed, better efficiencies, but there is delay mismatch between odd
the transistors are biased at sub-threshold region, as shown stages and even stage due to PMOS/NMOS mismatch. Take
in Fig.2 (d). Besides, note that each delay cell has two inverters the rising-edge for example, as shown in Fig. 3(a), the
to retain consistent computing edge polarity between stages. rising-edge of input pulse is delayed by NMOS transistor in
This is critical because the rising edge and falling edge delays the first stage, and the polarity of computing pulse is reversed.
are not matched. The two-inverter design can solve the edge In the second stage, rising-edge of input computing pulse
mismatch issue, but lead to additional power consumptions. (falling-edge input in the second stage) is delayed by PMOS.
As the computing pulse pass through whole delay chain, the
rising-edge is delayed by NMOS/PMOS alternately.
B. TD-SRAM Bitcell and Macro Design To improve the energy efficiency of TD computing and
In conventional TD computing cell design as shown prevent the mismatch issue, the DESI TD computing cell is
in Fig. 3(a), the single-edge-single-input (SESI) cell, which proposed, as shown in Fig. 3(b). When a computing pulse
consists of a current-starve inverter followed by an inverter, comes in to DESI cells, only the falling edge delay is
is used to retain consistent delayed edge polarity between controlled by the multiplication result, the rising edge has a
stages. In other words, only PMOS or NMOS current-starve constant delay independent of the multiplication result. After
is used in the cell, the second inverter is just used to flip back passing through a DESI cell, the polarity of the computing
the polarity of delay edge. Consequently, only one of the clock pulse is reversed. If a positive computing pulse comes in,
edges is utilized. Dual-edge-dual-input (DEDI) cell can obtain the delay of the falling edge in each cell of delay chains can

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on June 20,2022 at 09:57:16 UTC from IEEE Xplore. Restrictions apply.
3380 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 68, NO. 8, AUGUST 2021

Fig. 4. TD-SRAM bitcell design and binary operations (D E x = Tmultiplication + T f ix , but only Tmultiplication is shown in the figure for simplicity).

be formulated as (1). mode and IMC mode are isolated, no read disturb issues will
 occur. The weight in SRAM determines whether to open the
Tmult iplicat ion , (x = 1, 3, 5, . . .)
T D f alling edge (x) = (1) left charge path or the right one. If the weight is “0”, the right
T f ix , (x = 2, 4, 6, . . .) charge path is turned on, and the input falling edge delay of
First, the falling edge is delayed (Tmult iplicat ion ) and cell is controlled by VM . Otherwise, the left path is turned on,
reversed to the rising edge by the multiplication controlled and the cell delay is controlled by VI N . VI N has two values
PMOS. In the next stage, the rising edge is delayed (T f i x , (VL and V H ) and it is controlled by the binary input. VL , VM ,
less correlated with multiplication result, we thus hypothesize V H is set to ensure that the relationship of delay in these three
its value maintained) and reversed back to the falling edge by bias voltages follow the role in formula (5).
an NMOS. This procedure is repeated until the pulse passes D E L = D E M /2 = D E H /3 (5)
through the whole delay chain. Similarly, the delay of rising
edge in each cell can be formulated as (2). And the binary computing operation (activations = “+1/−1”,
 weights = “0/1”) can be executed.
T f ix , (x = 1, 3, 5, . . .) Fig. 5(a) illustrates the architecture of the proposed
T D rising edge (x) = (2)
Tmult iplicat ion , (x = 2, 4, 6, . . .) TD-SRAM macro, which can process 60 128-input
Binary-MAC operations in parallel. In the computing,
So, the stages of the delay chain can be divided into two
60 columns are configured as computing columns, 3 columns
groups, odd stages and even stages. All odd-stage cells convert
(REFL, REFM, REFH) generate reference delay levels, and
half of MAC outputs to the delay of rising edge between the
the rest 1 column is not used in the IMC mode. As shown
input and the output computing pulses, all even-stage cells
in Fig. 5(b), the reference cells have the same structure with
convert the rest of MAC outputs to the delay of falling edge.
computing cells, which can compensate the fixed delays T f i x
After the input pulse passes through all cells, the total mul-
in computing cells. The reference cells are biased at different
tiplication values are accumulated in T D O D D and T D E V E N ,
voltage and are reconfigurable by writing corresponded
as described in (3) and (4), respectively.
n SRAM bit cell value. Take REFL Cells for example, the bias
T D odd = T D f alling edge (x) voltage for REFL cell is fixed in VL or VM depending on the
x=1
n/2 storage bit in 6T SRAM cell. So the delay of REFL cell is
= D E (2x − 1) (3) D E L or D E M (“−1/0” in digital). Like the computing cells,
nx=1 the reference cells are also divided into two groups: odd
T D even = T D rising edge (x) stages and even stages. The odd stages generate reference
x=1
n/2 (the falling edge in REFL) for computing cells of odd stages,
= D E (2x) (4)
x=1 while the even stages generate reference (the rising edge
Compared to conventional topology, the proposed topology in REFL) for computing cells of even stages. If storage bit
does not need the two-inverter design to maintain the clock in 128 REFL cells are all set to “1”, then the reference REFL
edge polarity in each stage. As a result, the energy efficiency is −64. If storage bit in 128 REFL cells are all set to “0”,
and the area efficiency of TD cells are improved. then the reference REFL is 0. So the adjustment range of
Combined with 6T SRAM Cell, a 12T binary TD computing REFL is from −64 to 0. Similarly, the adjustment range of
cell is developed as shown in Fig. 4. Because we only connect REFM is from -64 to 64 and the adjustment range of REFH
the two storage nodes to two gates of delay cell, the SRAM is from 0 to 64.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on June 20,2022 at 09:57:16 UTC from IEEE Xplore. Restrictions apply.
SONG et al.: TD-SRAM: TD-BASED IMC MACRO 3381

Fig. 5. (a) Proposed TD-SRAM macro architecture. (b) Reference cells design. (c) Dual TDC design. (d) Timing diagram of computing mode.

The dual TDC is shown in Fig.5 (c). The TDCEVEN and the and the TDC quantization effect, a behavioral model is built.
TDCODD are identical. A single TDC consists of three DFFs In this model, the statistical parameter σ is the delay variation,
and one Multiplexer. The output pulse from array sample the the deterministic parameter is the quantization bit N. This
reference delay by DFFs. The MSB of TDC output is decided model uses a 4-layer BNN model (2 convolution layers and
by REFM, the LSB of TDC output is decided by REFL or 2 fully-connect layers) and the MNIST dataset. To pursue
REFH. If MSB is “0”, DFF output of REFL is selected as LSB an acceptable classification accuracy, the model’s first layer
by Multiplexer. Otherwise, DFF output of REFH is selected and last layer are operated in the floating-point precision.
as LSB. Besides, the weights are transferred from “−1/+1” to “0/1”
When the input data are ready, a computing pulse pass by formula (6) according to the computation logic map table of
through 60 128-stage delay chains and two edge delays feed the proposed cell. Thus, the partial sum (PSUM) also needs a
into dual TDCs, and the dual TDCs convert the analog delay transform as described in (7), where Acount is an accumulation
signal to the digital outputs, the timing diagram is shown of all input activations.
in Fig.5 (d). The input of TDCODD (delayed computing pulse,
reference pulse) are reversed, because the DFFs in TDC are Wtrans f ormed = (1 + Woriginal )/2 (6)
positive-edge triggered. To avoid computing error, half the
P SU M original = Acount + 2 × P SU M trans f ormed (7)
period of computing pulse should be larger than the maximum
of total delay.
We assume all MAC operations are processed in the pro-
C. Behavioral Simulation posed TD macro, and other operations such as activation
As mentioned above, delay value has variations, which will functions, batch normalizations are processed in the digital
influence the algorithm performance. To evaluate this influence domain. The model will generate random values for each delay

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on June 20,2022 at 09:57:16 UTC from IEEE Xplore. Restrictions apply.
3382 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 68, NO. 8, AUGUST 2021

Fig. 6. (a) MNIST accuracy as a function of delay variations. (b) Cell delay variation (biased at VH ) as a function of the transistor gate area. (c) MNIST
accuracy as a function of TDC quantization bits. (d) Illustrations of Normal quantization and LR quantization.

cell according to the given σ and PSUM will be quantified by is 0.0875mm2. The area of TD-SRAM cell is 4.97μm2, and
the given N. the cell layout can be further optimized. Due to pitch match
Fig. 6(a) plots classification accuracy over a range of the considerations, the dual 2-bit quantization is adopted and only
delay variations. The accuracy largely decreases when σ/μ 5% area of macro is occupied by TDCs. As a result, the array
exceeds 50%. We can take this curve as the design guideline area efficiency is almost 70%. 68.9% of the total power
for the delay cell. In order to avoid too large delay variations is consumed by TD-cells, 25% by TDCs and 8% by other
and leave enough margin, the transistors in TD cells are peripherals, as shown in Fig. 7(b). In 100MHz SRAM mode,
slightly sized up compared to minimum size. The Monte Carlo read/write failure occurs when VDD is below 0.9-V, so we set
simulation result of cell delay that biased in V H is shown the minimum operation voltage (MPV) of the prototype chip as
in Fig. 6(b). What is needed to be focused on is that the long 0.9-V. Although the NN can tolerate a small rate of read/write
delay D E H has big σ/μ due to the bias voltage is close to the failure [31], [32] and the MPV can probably be lower than
threshold, the impact of which on the classification accuracy 0.9-V, which is not discussed in this article to simplify the
is set as the design point. testing process.
The classification accuracy over different quantization bits The measured power consumption is shown in Fig. 8.
is shown in Fig. 6(c). Two quantization methods, the normal To measure the power, we first set all computation cells to
and the limited range (LR) [17] are evaluated. With normal “1”, and apply corresponding input patterns. As mentioned
quantization, the reference delay is uniformly distributed in in Section II- A, the majority of TD-cell power consumption
full range. While the TDC reference range is small and quan- is the dynamic power that flips the intermediate node of
tization range can cover more data with limited quantization inverter chains. So the array power almost maintains when
level by LR quantization, as shown in Fig. 6(d). Trading off input/output changes. Furthermore, benefiting from the small
between TDC area and classification accuracy, and to keep the capacitive load in the intermediate node, the total average
TDC pitch match with memory array, 2-bit LR quantization power of array is as low as 10.73μW at VDD = 0.9V. The
is finally used in circuit implement. macro can process 60 columns of 128-input MAC operations
in parallel. At VDD = 0.9V the computing frequency is
III. M EASUREMENT R ESULTS AND A NALYSIS 0.5MHz. We treat one MAC as two operations (1 multiplica-
The prototype chip for the TD-SRAM macro is imple- tion and 1 addition), and the energy efficiency is 716 TOPS/W.
mented in a standard 40-nm CMOS process. As shown Due to the insufficient access to the internal signals of the
in Fig. 7(a), the capacity of macro is 8kb and the core area prototype chip, the transfer function of TD-SRAM macro is

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on June 20,2022 at 09:57:16 UTC from IEEE Xplore. Restrictions apply.
SONG et al.: TD-SRAM: TD-BASED IMC MACRO 3383

Fig. 7. (a) 40-nm TD-SRAM prototype chip micrograph and layout of


TD-SRAM cell. (b) Power and Area breakdown (CTRL includes control
block and WL decoder), the total power in 10.73μW and the total area is
0.0875mm2 .

Fig. 9. (a) Delay test circuit design. (b) Measured delay of four chips with
the same bias.

chain composes of two 128-cell chains, one propagates delay


from top to bottom, the other is from bottom to top. Two 128-
cell chains are placed very close and connected at the bottom.
So the cells in test chain have the same capacitance load with
the TD-SRAM macro, thus resulting in the same delay. Then
the input and output pass through an OR gate. The output
of the OR gate is observed and the increased pulse width
corresponds to the delay. Fig. 9(b) shows the measured delays
Fig. 8. Measured data-independent power and post-layout simulation of the of four chips under the same biases, a little delay difference
transfer function. between four chips is shown. Therefore, fine-tuning of bias
voltage for each chip is needed before the IMC mode test.

a post-layout simulation result, which shows good linearity


in Fig. 8. A. Calibration and Compensation
The bias voltages are given from external voltage references, Before the IMC test, a one-time calibration and compen-
and the actual set values of (VL , VM , V H ) are (0.00-V, sation are needed to set the three reference delay lines and
0.325-V, 0.381-V) at VDD = 1.0-V, and (0.00-V, 0.262-V, compensate the mismatch between each computing column.
0.313-V) at VDD = 0.9-V. To disclose the actual delay The reference should be appropriately set to minimize the
condition, three replica test delay chains are put in chip near accuracy loss in analog computing. The delay of reference
the computing macro, as shown in Fig. 9(a). Test delay chains cell can by changed by adjusting the value of corresponding
share the same VL , VM , V H with the TD-SRAM macro and the SRAM cell. Thus, the calibrated value can also be stored.
cell layout in the test delay chain is almost the same with the In the calibration mode, all computing columns are set to be
TD-SRAM cell in order to minimize the mismatch between the same value. Firstly, the computing column cells are all set
test circuit and macro. The difference is that 6T SRAM cells to be “0” and the REFM column is set to be half “0” and
in test chains are hardwired to “1” and 6T delay cell biases half “1”. Then a computing pulse is fed in and the output of
are fixed at VL , VM and V H , respectively. Each test delay computing column is compared with REFM. Comparison is

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on June 20,2022 at 09:57:16 UTC from IEEE Xplore. Restrictions apply.
3384 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 68, NO. 8, AUGUST 2021

Fig. 10. Measured TDC output of different columns: (a) before compensation Fig. 11. Comparison of TDC trip point variations before and after compen-
and calibration; (b) after compensation and calibration. sation: (a) before compensation and calibration; (b) after compensation and
calibration.

TABLE I
enabled by TDCs. Change the value in REFM until the output C OMPARISON OF D IFFERENT N EURAL N ETWORKS
of 60 TDCs has almost 50% “0” and 50% “1”. The REFL and
REFH are calibrated using a similar process.
32 rows of the macro are used for compensation and
the input of these rows are fixed at “1”, so the delay of
compensation cell is either D E M or D E H depending on the
weight in the SRAM cell. Note that odd computation rows and
even computation rows need to be compensated, respectively,
due to the DESI characteristics. As a result, 16 compensation
rows are for even cells and the others are for odd cells.
The compensation is completed according to the following
steps. First, we set all weights in computation rows to “1” so
that the delay only depends on the input pattern. In compensa-
tion rows, weights in half of the rows are set to be “1” and the
other rows are set to “0” as initial compensation data. Next,
we run an input pattern sweep (from all “0” to all “1”) and get
the corresponded TDCs outputs. We get the transfer function
characteristic (with TDC) of each column. From these transfer the number of “1” in compensation rows; if the trip point
curves, we find the input trip point where the 2-bit TDC output lower than the mean value, we increase the number of “1”.
flips from “10” to “01” of each column. Then we calculate the The deviation of trip point is map to the compensation code
mean and standard deviation of the 60 trip points. For each through a linear transformation. To utilize compensation rows
column, if the trip point exceeds the mean value, we reduce more efficiently, we set the maximum compensation code (all

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on June 20,2022 at 09:57:16 UTC from IEEE Xplore. Restrictions apply.
SONG et al.: TD-SRAM: TD-BASED IMC MACRO 3385

TABLE II
C OMPARISON TABLE FOR T IME -D OMAIN C OMPUTING

“0” or all “1”) corresponds to 2σ deviation. That means even if


the deviation of some columns exceeds 2σ , the compensation
code of these columns maintains as 2σ .
The measured transfer function (with TDC) of 60 columns
is shown in Fig. 10. Before compensation, all curves are very
scattered due to large columns to columns variations. After
compensation, the distribution is narrow. Fig. 11 shows the
variations of TDC trip point before and after compensation
of four chips. Although separating a portion of the macro for
compensation will lead to energy efficiency reduction (from
716 TOPS/W to 537 TOPS/W), it makes the variability of
macro smaller.

B. DNN Performance
The DNN accuracy test is processed with the FPGA board
and PC as shown in Fig. 12. The first-layer activations of
all test pictures are pre-computed by the PC in MATLAB
and Python, and then loaded to FPGA together with the Fig. 12. Hardware and software block diagram of the test environment.
second-layer weights, which are written into the test chip
in SRAM mode. Note that the weights are transferred from
“−1/+1” to “0/1” in software. Next, input activations fol- hidden layers (784-192-60-60-10). The second-layer weights
lowed by a computing pulse given to the chip frame by are divided into two parts, and put into macro and perform
frame, the TD-SRAM cells perform computing according computation sequentially. The PSUM of two parts are col-
to the computation logic mapping table in Fig. 4 and the lected and accumulated off chip to generate the final output
output activations are collected by FPGA and sent to PC in of the second layer. Then the rest of layers are processed in
succession through the Serial Port. Also, the output activations digital simulations. Four chips accomplish an average accuracy
are transferred off-chip. Then, the rest of layers and the of 95.90%. To improve the accuracy, we test a wider MLP
classification accuracy are computed in PC. (784-288-120-60-10) and a CNN as detailed in Table I. The
Owing to the throughput limitation, we evaluated the DNN test process of MLP2 is the same with that of MLP1 but need
accuracy for MNIST dataset using a small MLP with three more time. For the CNN test, the methodology in [15] is used,

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on June 20,2022 at 09:57:16 UTC from IEEE Xplore. Restrictions apply.
3386 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 68, NO. 8, AUGUST 2021

mismatches, power supply noises, and bias voltage noises.


Ref. [33] has analyzed the noise resilience of different DNNs
and hypothesizes that larger layers with more weights have
more redundancy which makes them more robust to noise.
Our accuracy measurement results of three different DNNs
(Table I) are consistent with this trend indicating a design
tradeoff between the accuracy and the operations/energy,
as shown in Fig. 14.

IV. C ONCLUSION
In this article, we present an 8kb TD computing based
IMC macro titled “TD-SRAM”, which is optimized for BNN.
In the cell level, we proposed a DESI scheme that utilizes
one inverter to compete the TD computing. Compared with
conventional two-inverter designs, our DESI scheme converts
MAC result to both edge delays of computing pulse. DESI
Fig. 13. Accuracy characterization for MNIST. improves the energy efficiency of TD computing further,
while the rising edge and falling propagation mismatch is
avoided. Based on the DESI, a 12T IMC cell supporting
binary operations is developed. The number of transistors is
1.2 ∼ 4.6× smaller than that of the prior TD cells. The dual
2-bit DFF-based TDCs are employed to convert TD MAC
to digital. Our 40-nm prototype achieves energy efficiency
of 537 TOPS/W for binary operations, which has improved
by 1.4 ∼ 5.6× compared to prior TD computing design.
To deal with the variability in the macro, 32 rows are used
for compensation. The test accuracy of 95.90%-98.00% with
different DNN topologies for the MNIST dataset is achieved
after compensation.

ACKNOWLEDGMENT
The authors would like to thank Hongfei Ye, Ming He,
Jian Cao, Kaixuan Du, and Zhixuan Wang for the valuable
discussion and help.
Fig. 14. Test accuracy and operations of different neural networks.
R EFERENCES
[1] D. Miyashita, S. Kousai, T. Suzuki, and J. Deguchi, “A neuromorphic
chip optimized for deep learning and CMOS technology with time-
wherein the hardware features of each chip are extracted for domain analog and digital mixed-signal processing,” IEEE J. Solid-State
the simulation. The simulation program randomly repeats for Circuits, vol. 52, no. 10, pp. 2679–2689, Oct. 2017.
10 times and the mean value of accuracies is reported. The [2] M. Liu, L. R. Everson, and C. H. Kim, “A scalable time-based integrate-
and-fire neuromorphic core with brain-inspired leak and local lateral
accuracy of four chips with three DNN topologies are shown inhibition capabilities,” in Proc. IEEE Custom Integr. Circuits Conf.
in Fig. 13. Among four chips, chip 2 and chip 4 achieve (CICC), Austin, TX, USA, Apr. 2017, pp. 1–4.
relatively high accuracy ascribed to the small variability after [3] L. R. Everson, M. Liu, N. Pande, and C. H. Kim, “An energy-efficient
one-shot time-based neural network accelerator employing dynamic
compensation. threshold error correction in 65 nm,” IEEE J. Solid-State Circuits,
vol. 54, no. 10, pp. 2777–2785, Oct. 2019.
[4] S. Gopal et al., “A spatial multi-bit Sub-1-V time-domain matrix
C. Comparison and Discussion multiplier interface for approximate computing in 65-nm CMOS,”
IEEE J. Emerg. Sel. Topics Circuits Syst., vol. 8, no. 3, pp. 506–518,
A comparison with other TD computing designs is summa- Sep. 2018.
rized in Table II. Benefiting from the cell design of DESI and [5] Z. Chen and J. Gu, “A time-domain computing accelerated image
recognition processor with efficient time encoding and non-linear logic
the BNN algorithm, the proposed TD bitcell only contains operation,” IEEE J. Solid-State Circuits, vol. 54, no. 11, pp. 3226–3237,
12 transistors, which is the minimized number among all Nov. 2019.
the TD-based designs. The improved efficiency of cell area [6] Z. Chen and J. Gu, “High-throughput dynamic time warping accelerator
for time-series classification with pipelined mixed-signal time-domain
leads to less power consumption due to the small load. As a computing,” IEEE J. Solid-State Circuits, vol. 56, no. 2, pp. 624–635,
result, our macro achieves the best energy efficiency in TD Feb. 2021.
computing designs. Although all TD computing designs show [7] J. Yang et al., “Sandwich-RAM: An energy-efficient in-memory BWN
architecture with pulse-width modulation,” in IEEE Int. Solid-State
good energy efficiencies, the accuracy on DNN tasks is a bit Circuits Conf. (ISSCC) Dig. Tech. Papers, San Francisco, CA, USA,
lower, because the computing results are sensitive to device Feb. 2019, pp. 394–396.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on June 20,2022 at 09:57:16 UTC from IEEE Xplore. Restrictions apply.
SONG et al.: TD-SRAM: TD-BASED IMC MACRO 3387

[8] Z. Chen, S. Fu, Q. Cao, and J. Gu, “A mixed-signal time-domain [20] M. Kang, Y. Kim, A. D. Patil, and N. R. Shanbhag, “Deep in-
generative adversarial network accelerator with efficient subthreshold memory architectures for machine learning–accuracy versus efficiency
time multiplier and mixed-signal on-chip training for low power edge trade-offs,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 67, no. 5,
devices,” in Proc. IEEE Symp. VLSI Circuits, Honolulu, HI, USA, pp. 1627–1639, May 2020.
Jun. 2020, pp. 1–2. [21] S. K. Gonugondla, M. Kang, and N. R. Shanbhag, “A variation-
[9] J. Yue et al., “A 2.75-to-75.9TOPS/W computing-in-memory NN proces- tolerant in-memory machine learning classifier via on-chip training,”
sor supporting set-associate block-wise zero skipping and ping-pong IEEE J. Solid-State Circuits, vol. 53, no. 11, pp. 3163–3173, Nov. 2018.
CIM with simultaneous computation and weight updating,” in IEEE Int. [22] X. Peng, R. Liu, and S. Yu, “Optimizing weight mapping and data
Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, San Francisco, flow for convolutional neural networks on processing-in-memory archi-
CA, USA, Feb. 2021, pp. 238–240. tectures,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 67, no. 4,
[10] H. Jia et al., “A programmable neural-network inference accelerator pp. 1333–1343, Apr. 2020.
based on scalable in-memory computing,” in IEEE Int. Solid-State [23] M. F. Ali, A. Jaiswal, and K. Roy, “In-memory low-cost bit-serial
Circuits Conf. (ISSCC) Dig. Tech. Papers, San Francisco, CA, USA, addition using commodity DRAM technology,” IEEE Trans. Circuits
Feb. 2021, pp. 236–238. Syst. I, Reg. Papers, vol. 67, no. 1, pp. 155–165, Jan. 2020.
[11] J.-W. Su et al., “A 28 nm 384kb 6T-SRAM computation-in-memory [24] M. Ali, A. Jaiswal, S. Kodge, A. Agrawal, I. Chakraborty, and K. Roy,
macro with 8b precision for AI edge chips,” in IEEE Int. Solid-State “IMAC: In-memory multi-bit multiplication and ACcumulation in 6T
Circuits Conf. (ISSCC) Dig. Tech. Papers, San Francisco, CA, USA, SRAM array,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 67, no. 8,
Feb. 2021, pp. 250–252. pp. 2521–2531, Aug. 2020.
[12] Y.-D. Chih et al., “An 89TOPS/W and 16.3TOPS/mm2 all-digital [25] Z. Liu et al., “NS-CIM: A current-mode computation-in-memory archi-
SRAM-based full-precision compute-in memory macro in 22 nm for tecture enabling near-sensor processing for intelligent IoT vision nodes,”
machine-learning edge applications,” in IEEE Int. Solid-State Circuits IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 67, no. 9, pp. 2909–2922,
Conf. (ISSCC) Dig. Tech. Papers, San Francisco, CA, USA, Feb. 2021, Sep. 2020.
pp. 252–254. [26] M. Courbariaux, Y. Bengio, and J. David, “Binaryconnect: Training deep
[13] J. Zhang, Z. Wang, and N. Verma, “In-memory computation of a neural networks with binary weights during propagations,” in Proc. Adv.
machine-learning classifier in a standard 6T SRAM array,” IEEE Neural Inf. Process. Syst. (NIPS), 2015, pp. 3123–3131.
J. Solid-State Circuits, vol. 52, no. 4, pp. 915–924, Apr. 2017. [27] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio,
[14] A. Biswas and A. P. Chandrakasan, “CONV-SRAM: An energy-efficient “Binarized neural networks,” in Proc. Adv. Neural Inf. Process. Syst.
SRAM with in-memory dot-product computation for low-power convo- (NIPS), 2016, pp. 4107–4115.
lutional neural networks,” IEEE J. Solid-State Circuits, vol. 54, no. 1, [28] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “XNOR-Net:
pp. 217–230, Jan. 2019. ImageNet classification using binary convolutional neural networks,” in
[15] S. Yin, Z. Jiang, J.-S. Seo, and M. Seok, “XNOR-SRAM: In-memory Proc. Eur. Conf. Comput. Vis. (ECCV), vol. 2016, pp. 525–542.
computing SRAM macro for binary/ternary deep neural networks,” IEEE [29] D. Bankman, L. Yang, B. Moons, M. Verhelst, and B. Murmann, “An
J. Solid-State Circuits, vol. 55, no. 6, pp. 1733–1743, Jun. 2020. always-On 3.8 μ J/86% CIFAR-10 mixed-signal binary CNN processor
[16] J. Kim et al., “Area-efficient and variation-tolerant in-memory BNN with all memory on chip in 28-nm CMOS,” IEEE J. Solid-State Circuits,
computing using 6T SRAM array,” in Proc. Symp. VLSI Circuits, vol. 54, no. 1, pp. 158–172, Jan. 2019.
Jun. 2019, pp. C118–C119. [30] B. Moons, D. Bankman, L. Yang, B. Murmann, and M. Verhelst, “Binar-
[17] Z. Jiang, S. Yin, J.-S. Seo, and M. Seok, “C3SRAM: An in-memory- Eye: An always-on energy-accuracy-scalable binary CNN processor with
computing SRAM macro based on robust capacitive coupling computing all memory on chip in 28nm CMOS,” in Proc. IEEE Custom Integr.
mechanism,” IEEE J. Solid-State Circuits, vol. 55, no. 7, pp. 1888–1897, Circuits Conf. (CICC), San Diego, CA, USA, Apr. 2018, pp. 1–4.
Jul. 2020. [31] X. Sun et al., “Low-VDD operation of SRAM synaptic array for
[18] X. Si et al., “A dual-split 6T SRAM-based computing-in-memory unit- implementing ternary neural network,” IEEE Trans. Very Large Scale
macro with fully parallel product-sum operation for binarized DNN edge Integr. (VLSI) Syst., vol. 25, no. 10, pp. 2962–2965, Oct. 2017.
processors,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 66, no. 11, [32] L. Yang, D. Bankman, B. Moons, M. Verhelst, and B. Murmann, “Bit
pp. 4172–4185, Nov. 2019. error tolerance of a CIFAR-10 binarized convolutional neural network
[19] W.-H. Chen et al., “A 65 nm 1Mb nonvolatile computing-in-memory processor,” in Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), Florence,
ReRAM macro with sub-16ns multiply-and-accumulate for binary Italy, May 2018, pp. 1–5.
DNN AI edge processors,” in IEEE Int. Solid-State Circuits Conf. [33] T.-J. Yang and V. Sze, “Design considerations for efficient deep neural
(ISSCC) Dig. Tech. Papers, San Francisco, CA, USA, Feb. 2018, networks on processing-in-memory accelerators,” in IEDM Tech. Dig.,
pp. 494–496. San Francisco, CA, USA, Dec. 2019, p. 22.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on June 20,2022 at 09:57:16 UTC from IEEE Xplore. Restrictions apply.

You might also like