00) TD-SRAM - Time-Domain-Based - In-Memory - Computing - Macro - For - Binary - Neural - Networks
00) TD-SRAM - Time-Domain-Based - In-Memory - Computing - Macro - For - Binary - Neural - Networks
I. I NTRODUCTION
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on June 20,2022 at 09:57:16 UTC from IEEE Xplore. Restrictions apply.
3378 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 68, NO. 8, AUGUST 2021
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on June 20,2022 at 09:57:16 UTC from IEEE Xplore. Restrictions apply.
SONG et al.: TD-SRAM: TD-BASED IMC MACRO 3379
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on June 20,2022 at 09:57:16 UTC from IEEE Xplore. Restrictions apply.
3380 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 68, NO. 8, AUGUST 2021
Fig. 4. TD-SRAM bitcell design and binary operations (D E x = Tmultiplication + T f ix , but only Tmultiplication is shown in the figure for simplicity).
be formulated as (1). mode and IMC mode are isolated, no read disturb issues will
occur. The weight in SRAM determines whether to open the
Tmult iplicat ion , (x = 1, 3, 5, . . .)
T D f alling edge (x) = (1) left charge path or the right one. If the weight is “0”, the right
T f ix , (x = 2, 4, 6, . . .) charge path is turned on, and the input falling edge delay of
First, the falling edge is delayed (Tmult iplicat ion ) and cell is controlled by VM . Otherwise, the left path is turned on,
reversed to the rising edge by the multiplication controlled and the cell delay is controlled by VI N . VI N has two values
PMOS. In the next stage, the rising edge is delayed (T f i x , (VL and V H ) and it is controlled by the binary input. VL , VM ,
less correlated with multiplication result, we thus hypothesize V H is set to ensure that the relationship of delay in these three
its value maintained) and reversed back to the falling edge by bias voltages follow the role in formula (5).
an NMOS. This procedure is repeated until the pulse passes D E L = D E M /2 = D E H /3 (5)
through the whole delay chain. Similarly, the delay of rising
edge in each cell can be formulated as (2). And the binary computing operation (activations = “+1/−1”,
weights = “0/1”) can be executed.
T f ix , (x = 1, 3, 5, . . .) Fig. 5(a) illustrates the architecture of the proposed
T D rising edge (x) = (2)
Tmult iplicat ion , (x = 2, 4, 6, . . .) TD-SRAM macro, which can process 60 128-input
Binary-MAC operations in parallel. In the computing,
So, the stages of the delay chain can be divided into two
60 columns are configured as computing columns, 3 columns
groups, odd stages and even stages. All odd-stage cells convert
(REFL, REFM, REFH) generate reference delay levels, and
half of MAC outputs to the delay of rising edge between the
the rest 1 column is not used in the IMC mode. As shown
input and the output computing pulses, all even-stage cells
in Fig. 5(b), the reference cells have the same structure with
convert the rest of MAC outputs to the delay of falling edge.
computing cells, which can compensate the fixed delays T f i x
After the input pulse passes through all cells, the total mul-
in computing cells. The reference cells are biased at different
tiplication values are accumulated in T D O D D and T D E V E N ,
voltage and are reconfigurable by writing corresponded
as described in (3) and (4), respectively.
n SRAM bit cell value. Take REFL Cells for example, the bias
T D odd = T D f alling edge (x) voltage for REFL cell is fixed in VL or VM depending on the
x=1
n/2 storage bit in 6T SRAM cell. So the delay of REFL cell is
= D E (2x − 1) (3) D E L or D E M (“−1/0” in digital). Like the computing cells,
nx=1 the reference cells are also divided into two groups: odd
T D even = T D rising edge (x) stages and even stages. The odd stages generate reference
x=1
n/2 (the falling edge in REFL) for computing cells of odd stages,
= D E (2x) (4)
x=1 while the even stages generate reference (the rising edge
Compared to conventional topology, the proposed topology in REFL) for computing cells of even stages. If storage bit
does not need the two-inverter design to maintain the clock in 128 REFL cells are all set to “1”, then the reference REFL
edge polarity in each stage. As a result, the energy efficiency is −64. If storage bit in 128 REFL cells are all set to “0”,
and the area efficiency of TD cells are improved. then the reference REFL is 0. So the adjustment range of
Combined with 6T SRAM Cell, a 12T binary TD computing REFL is from −64 to 0. Similarly, the adjustment range of
cell is developed as shown in Fig. 4. Because we only connect REFM is from -64 to 64 and the adjustment range of REFH
the two storage nodes to two gates of delay cell, the SRAM is from 0 to 64.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on June 20,2022 at 09:57:16 UTC from IEEE Xplore. Restrictions apply.
SONG et al.: TD-SRAM: TD-BASED IMC MACRO 3381
Fig. 5. (a) Proposed TD-SRAM macro architecture. (b) Reference cells design. (c) Dual TDC design. (d) Timing diagram of computing mode.
The dual TDC is shown in Fig.5 (c). The TDCEVEN and the and the TDC quantization effect, a behavioral model is built.
TDCODD are identical. A single TDC consists of three DFFs In this model, the statistical parameter σ is the delay variation,
and one Multiplexer. The output pulse from array sample the the deterministic parameter is the quantization bit N. This
reference delay by DFFs. The MSB of TDC output is decided model uses a 4-layer BNN model (2 convolution layers and
by REFM, the LSB of TDC output is decided by REFL or 2 fully-connect layers) and the MNIST dataset. To pursue
REFH. If MSB is “0”, DFF output of REFL is selected as LSB an acceptable classification accuracy, the model’s first layer
by Multiplexer. Otherwise, DFF output of REFH is selected and last layer are operated in the floating-point precision.
as LSB. Besides, the weights are transferred from “−1/+1” to “0/1”
When the input data are ready, a computing pulse pass by formula (6) according to the computation logic map table of
through 60 128-stage delay chains and two edge delays feed the proposed cell. Thus, the partial sum (PSUM) also needs a
into dual TDCs, and the dual TDCs convert the analog delay transform as described in (7), where Acount is an accumulation
signal to the digital outputs, the timing diagram is shown of all input activations.
in Fig.5 (d). The input of TDCODD (delayed computing pulse,
reference pulse) are reversed, because the DFFs in TDC are Wtrans f ormed = (1 + Woriginal )/2 (6)
positive-edge triggered. To avoid computing error, half the
P SU M original = Acount + 2 × P SU M trans f ormed (7)
period of computing pulse should be larger than the maximum
of total delay.
We assume all MAC operations are processed in the pro-
C. Behavioral Simulation posed TD macro, and other operations such as activation
As mentioned above, delay value has variations, which will functions, batch normalizations are processed in the digital
influence the algorithm performance. To evaluate this influence domain. The model will generate random values for each delay
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on June 20,2022 at 09:57:16 UTC from IEEE Xplore. Restrictions apply.
3382 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 68, NO. 8, AUGUST 2021
Fig. 6. (a) MNIST accuracy as a function of delay variations. (b) Cell delay variation (biased at VH ) as a function of the transistor gate area. (c) MNIST
accuracy as a function of TDC quantization bits. (d) Illustrations of Normal quantization and LR quantization.
cell according to the given σ and PSUM will be quantified by is 0.0875mm2. The area of TD-SRAM cell is 4.97μm2, and
the given N. the cell layout can be further optimized. Due to pitch match
Fig. 6(a) plots classification accuracy over a range of the considerations, the dual 2-bit quantization is adopted and only
delay variations. The accuracy largely decreases when σ/μ 5% area of macro is occupied by TDCs. As a result, the array
exceeds 50%. We can take this curve as the design guideline area efficiency is almost 70%. 68.9% of the total power
for the delay cell. In order to avoid too large delay variations is consumed by TD-cells, 25% by TDCs and 8% by other
and leave enough margin, the transistors in TD cells are peripherals, as shown in Fig. 7(b). In 100MHz SRAM mode,
slightly sized up compared to minimum size. The Monte Carlo read/write failure occurs when VDD is below 0.9-V, so we set
simulation result of cell delay that biased in V H is shown the minimum operation voltage (MPV) of the prototype chip as
in Fig. 6(b). What is needed to be focused on is that the long 0.9-V. Although the NN can tolerate a small rate of read/write
delay D E H has big σ/μ due to the bias voltage is close to the failure [31], [32] and the MPV can probably be lower than
threshold, the impact of which on the classification accuracy 0.9-V, which is not discussed in this article to simplify the
is set as the design point. testing process.
The classification accuracy over different quantization bits The measured power consumption is shown in Fig. 8.
is shown in Fig. 6(c). Two quantization methods, the normal To measure the power, we first set all computation cells to
and the limited range (LR) [17] are evaluated. With normal “1”, and apply corresponding input patterns. As mentioned
quantization, the reference delay is uniformly distributed in in Section II- A, the majority of TD-cell power consumption
full range. While the TDC reference range is small and quan- is the dynamic power that flips the intermediate node of
tization range can cover more data with limited quantization inverter chains. So the array power almost maintains when
level by LR quantization, as shown in Fig. 6(d). Trading off input/output changes. Furthermore, benefiting from the small
between TDC area and classification accuracy, and to keep the capacitive load in the intermediate node, the total average
TDC pitch match with memory array, 2-bit LR quantization power of array is as low as 10.73μW at VDD = 0.9V. The
is finally used in circuit implement. macro can process 60 columns of 128-input MAC operations
in parallel. At VDD = 0.9V the computing frequency is
III. M EASUREMENT R ESULTS AND A NALYSIS 0.5MHz. We treat one MAC as two operations (1 multiplica-
The prototype chip for the TD-SRAM macro is imple- tion and 1 addition), and the energy efficiency is 716 TOPS/W.
mented in a standard 40-nm CMOS process. As shown Due to the insufficient access to the internal signals of the
in Fig. 7(a), the capacity of macro is 8kb and the core area prototype chip, the transfer function of TD-SRAM macro is
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on June 20,2022 at 09:57:16 UTC from IEEE Xplore. Restrictions apply.
SONG et al.: TD-SRAM: TD-BASED IMC MACRO 3383
Fig. 9. (a) Delay test circuit design. (b) Measured delay of four chips with
the same bias.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on June 20,2022 at 09:57:16 UTC from IEEE Xplore. Restrictions apply.
3384 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 68, NO. 8, AUGUST 2021
Fig. 10. Measured TDC output of different columns: (a) before compensation Fig. 11. Comparison of TDC trip point variations before and after compen-
and calibration; (b) after compensation and calibration. sation: (a) before compensation and calibration; (b) after compensation and
calibration.
TABLE I
enabled by TDCs. Change the value in REFM until the output C OMPARISON OF D IFFERENT N EURAL N ETWORKS
of 60 TDCs has almost 50% “0” and 50% “1”. The REFL and
REFH are calibrated using a similar process.
32 rows of the macro are used for compensation and
the input of these rows are fixed at “1”, so the delay of
compensation cell is either D E M or D E H depending on the
weight in the SRAM cell. Note that odd computation rows and
even computation rows need to be compensated, respectively,
due to the DESI characteristics. As a result, 16 compensation
rows are for even cells and the others are for odd cells.
The compensation is completed according to the following
steps. First, we set all weights in computation rows to “1” so
that the delay only depends on the input pattern. In compensa-
tion rows, weights in half of the rows are set to be “1” and the
other rows are set to “0” as initial compensation data. Next,
we run an input pattern sweep (from all “0” to all “1”) and get
the corresponded TDCs outputs. We get the transfer function
characteristic (with TDC) of each column. From these transfer the number of “1” in compensation rows; if the trip point
curves, we find the input trip point where the 2-bit TDC output lower than the mean value, we increase the number of “1”.
flips from “10” to “01” of each column. Then we calculate the The deviation of trip point is map to the compensation code
mean and standard deviation of the 60 trip points. For each through a linear transformation. To utilize compensation rows
column, if the trip point exceeds the mean value, we reduce more efficiently, we set the maximum compensation code (all
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on June 20,2022 at 09:57:16 UTC from IEEE Xplore. Restrictions apply.
SONG et al.: TD-SRAM: TD-BASED IMC MACRO 3385
TABLE II
C OMPARISON TABLE FOR T IME -D OMAIN C OMPUTING
B. DNN Performance
The DNN accuracy test is processed with the FPGA board
and PC as shown in Fig. 12. The first-layer activations of
all test pictures are pre-computed by the PC in MATLAB
and Python, and then loaded to FPGA together with the Fig. 12. Hardware and software block diagram of the test environment.
second-layer weights, which are written into the test chip
in SRAM mode. Note that the weights are transferred from
“−1/+1” to “0/1” in software. Next, input activations fol- hidden layers (784-192-60-60-10). The second-layer weights
lowed by a computing pulse given to the chip frame by are divided into two parts, and put into macro and perform
frame, the TD-SRAM cells perform computing according computation sequentially. The PSUM of two parts are col-
to the computation logic mapping table in Fig. 4 and the lected and accumulated off chip to generate the final output
output activations are collected by FPGA and sent to PC in of the second layer. Then the rest of layers are processed in
succession through the Serial Port. Also, the output activations digital simulations. Four chips accomplish an average accuracy
are transferred off-chip. Then, the rest of layers and the of 95.90%. To improve the accuracy, we test a wider MLP
classification accuracy are computed in PC. (784-288-120-60-10) and a CNN as detailed in Table I. The
Owing to the throughput limitation, we evaluated the DNN test process of MLP2 is the same with that of MLP1 but need
accuracy for MNIST dataset using a small MLP with three more time. For the CNN test, the methodology in [15] is used,
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on June 20,2022 at 09:57:16 UTC from IEEE Xplore. Restrictions apply.
3386 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 68, NO. 8, AUGUST 2021
IV. C ONCLUSION
In this article, we present an 8kb TD computing based
IMC macro titled “TD-SRAM”, which is optimized for BNN.
In the cell level, we proposed a DESI scheme that utilizes
one inverter to compete the TD computing. Compared with
conventional two-inverter designs, our DESI scheme converts
MAC result to both edge delays of computing pulse. DESI
Fig. 13. Accuracy characterization for MNIST. improves the energy efficiency of TD computing further,
while the rising edge and falling propagation mismatch is
avoided. Based on the DESI, a 12T IMC cell supporting
binary operations is developed. The number of transistors is
1.2 ∼ 4.6× smaller than that of the prior TD cells. The dual
2-bit DFF-based TDCs are employed to convert TD MAC
to digital. Our 40-nm prototype achieves energy efficiency
of 537 TOPS/W for binary operations, which has improved
by 1.4 ∼ 5.6× compared to prior TD computing design.
To deal with the variability in the macro, 32 rows are used
for compensation. The test accuracy of 95.90%-98.00% with
different DNN topologies for the MNIST dataset is achieved
after compensation.
ACKNOWLEDGMENT
The authors would like to thank Hongfei Ye, Ming He,
Jian Cao, Kaixuan Du, and Zhixuan Wang for the valuable
discussion and help.
Fig. 14. Test accuracy and operations of different neural networks.
R EFERENCES
[1] D. Miyashita, S. Kousai, T. Suzuki, and J. Deguchi, “A neuromorphic
chip optimized for deep learning and CMOS technology with time-
wherein the hardware features of each chip are extracted for domain analog and digital mixed-signal processing,” IEEE J. Solid-State
the simulation. The simulation program randomly repeats for Circuits, vol. 52, no. 10, pp. 2679–2689, Oct. 2017.
10 times and the mean value of accuracies is reported. The [2] M. Liu, L. R. Everson, and C. H. Kim, “A scalable time-based integrate-
and-fire neuromorphic core with brain-inspired leak and local lateral
accuracy of four chips with three DNN topologies are shown inhibition capabilities,” in Proc. IEEE Custom Integr. Circuits Conf.
in Fig. 13. Among four chips, chip 2 and chip 4 achieve (CICC), Austin, TX, USA, Apr. 2017, pp. 1–4.
relatively high accuracy ascribed to the small variability after [3] L. R. Everson, M. Liu, N. Pande, and C. H. Kim, “An energy-efficient
one-shot time-based neural network accelerator employing dynamic
compensation. threshold error correction in 65 nm,” IEEE J. Solid-State Circuits,
vol. 54, no. 10, pp. 2777–2785, Oct. 2019.
[4] S. Gopal et al., “A spatial multi-bit Sub-1-V time-domain matrix
C. Comparison and Discussion multiplier interface for approximate computing in 65-nm CMOS,”
IEEE J. Emerg. Sel. Topics Circuits Syst., vol. 8, no. 3, pp. 506–518,
A comparison with other TD computing designs is summa- Sep. 2018.
rized in Table II. Benefiting from the cell design of DESI and [5] Z. Chen and J. Gu, “A time-domain computing accelerated image
recognition processor with efficient time encoding and non-linear logic
the BNN algorithm, the proposed TD bitcell only contains operation,” IEEE J. Solid-State Circuits, vol. 54, no. 11, pp. 3226–3237,
12 transistors, which is the minimized number among all Nov. 2019.
the TD-based designs. The improved efficiency of cell area [6] Z. Chen and J. Gu, “High-throughput dynamic time warping accelerator
for time-series classification with pipelined mixed-signal time-domain
leads to less power consumption due to the small load. As a computing,” IEEE J. Solid-State Circuits, vol. 56, no. 2, pp. 624–635,
result, our macro achieves the best energy efficiency in TD Feb. 2021.
computing designs. Although all TD computing designs show [7] J. Yang et al., “Sandwich-RAM: An energy-efficient in-memory BWN
architecture with pulse-width modulation,” in IEEE Int. Solid-State
good energy efficiencies, the accuracy on DNN tasks is a bit Circuits Conf. (ISSCC) Dig. Tech. Papers, San Francisco, CA, USA,
lower, because the computing results are sensitive to device Feb. 2019, pp. 394–396.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on June 20,2022 at 09:57:16 UTC from IEEE Xplore. Restrictions apply.
SONG et al.: TD-SRAM: TD-BASED IMC MACRO 3387
[8] Z. Chen, S. Fu, Q. Cao, and J. Gu, “A mixed-signal time-domain [20] M. Kang, Y. Kim, A. D. Patil, and N. R. Shanbhag, “Deep in-
generative adversarial network accelerator with efficient subthreshold memory architectures for machine learning–accuracy versus efficiency
time multiplier and mixed-signal on-chip training for low power edge trade-offs,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 67, no. 5,
devices,” in Proc. IEEE Symp. VLSI Circuits, Honolulu, HI, USA, pp. 1627–1639, May 2020.
Jun. 2020, pp. 1–2. [21] S. K. Gonugondla, M. Kang, and N. R. Shanbhag, “A variation-
[9] J. Yue et al., “A 2.75-to-75.9TOPS/W computing-in-memory NN proces- tolerant in-memory machine learning classifier via on-chip training,”
sor supporting set-associate block-wise zero skipping and ping-pong IEEE J. Solid-State Circuits, vol. 53, no. 11, pp. 3163–3173, Nov. 2018.
CIM with simultaneous computation and weight updating,” in IEEE Int. [22] X. Peng, R. Liu, and S. Yu, “Optimizing weight mapping and data
Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, San Francisco, flow for convolutional neural networks on processing-in-memory archi-
CA, USA, Feb. 2021, pp. 238–240. tectures,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 67, no. 4,
[10] H. Jia et al., “A programmable neural-network inference accelerator pp. 1333–1343, Apr. 2020.
based on scalable in-memory computing,” in IEEE Int. Solid-State [23] M. F. Ali, A. Jaiswal, and K. Roy, “In-memory low-cost bit-serial
Circuits Conf. (ISSCC) Dig. Tech. Papers, San Francisco, CA, USA, addition using commodity DRAM technology,” IEEE Trans. Circuits
Feb. 2021, pp. 236–238. Syst. I, Reg. Papers, vol. 67, no. 1, pp. 155–165, Jan. 2020.
[11] J.-W. Su et al., “A 28 nm 384kb 6T-SRAM computation-in-memory [24] M. Ali, A. Jaiswal, S. Kodge, A. Agrawal, I. Chakraborty, and K. Roy,
macro with 8b precision for AI edge chips,” in IEEE Int. Solid-State “IMAC: In-memory multi-bit multiplication and ACcumulation in 6T
Circuits Conf. (ISSCC) Dig. Tech. Papers, San Francisco, CA, USA, SRAM array,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 67, no. 8,
Feb. 2021, pp. 250–252. pp. 2521–2531, Aug. 2020.
[12] Y.-D. Chih et al., “An 89TOPS/W and 16.3TOPS/mm2 all-digital [25] Z. Liu et al., “NS-CIM: A current-mode computation-in-memory archi-
SRAM-based full-precision compute-in memory macro in 22 nm for tecture enabling near-sensor processing for intelligent IoT vision nodes,”
machine-learning edge applications,” in IEEE Int. Solid-State Circuits IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 67, no. 9, pp. 2909–2922,
Conf. (ISSCC) Dig. Tech. Papers, San Francisco, CA, USA, Feb. 2021, Sep. 2020.
pp. 252–254. [26] M. Courbariaux, Y. Bengio, and J. David, “Binaryconnect: Training deep
[13] J. Zhang, Z. Wang, and N. Verma, “In-memory computation of a neural networks with binary weights during propagations,” in Proc. Adv.
machine-learning classifier in a standard 6T SRAM array,” IEEE Neural Inf. Process. Syst. (NIPS), 2015, pp. 3123–3131.
J. Solid-State Circuits, vol. 52, no. 4, pp. 915–924, Apr. 2017. [27] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio,
[14] A. Biswas and A. P. Chandrakasan, “CONV-SRAM: An energy-efficient “Binarized neural networks,” in Proc. Adv. Neural Inf. Process. Syst.
SRAM with in-memory dot-product computation for low-power convo- (NIPS), 2016, pp. 4107–4115.
lutional neural networks,” IEEE J. Solid-State Circuits, vol. 54, no. 1, [28] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “XNOR-Net:
pp. 217–230, Jan. 2019. ImageNet classification using binary convolutional neural networks,” in
[15] S. Yin, Z. Jiang, J.-S. Seo, and M. Seok, “XNOR-SRAM: In-memory Proc. Eur. Conf. Comput. Vis. (ECCV), vol. 2016, pp. 525–542.
computing SRAM macro for binary/ternary deep neural networks,” IEEE [29] D. Bankman, L. Yang, B. Moons, M. Verhelst, and B. Murmann, “An
J. Solid-State Circuits, vol. 55, no. 6, pp. 1733–1743, Jun. 2020. always-On 3.8 μ J/86% CIFAR-10 mixed-signal binary CNN processor
[16] J. Kim et al., “Area-efficient and variation-tolerant in-memory BNN with all memory on chip in 28-nm CMOS,” IEEE J. Solid-State Circuits,
computing using 6T SRAM array,” in Proc. Symp. VLSI Circuits, vol. 54, no. 1, pp. 158–172, Jan. 2019.
Jun. 2019, pp. C118–C119. [30] B. Moons, D. Bankman, L. Yang, B. Murmann, and M. Verhelst, “Binar-
[17] Z. Jiang, S. Yin, J.-S. Seo, and M. Seok, “C3SRAM: An in-memory- Eye: An always-on energy-accuracy-scalable binary CNN processor with
computing SRAM macro based on robust capacitive coupling computing all memory on chip in 28nm CMOS,” in Proc. IEEE Custom Integr.
mechanism,” IEEE J. Solid-State Circuits, vol. 55, no. 7, pp. 1888–1897, Circuits Conf. (CICC), San Diego, CA, USA, Apr. 2018, pp. 1–4.
Jul. 2020. [31] X. Sun et al., “Low-VDD operation of SRAM synaptic array for
[18] X. Si et al., “A dual-split 6T SRAM-based computing-in-memory unit- implementing ternary neural network,” IEEE Trans. Very Large Scale
macro with fully parallel product-sum operation for binarized DNN edge Integr. (VLSI) Syst., vol. 25, no. 10, pp. 2962–2965, Oct. 2017.
processors,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 66, no. 11, [32] L. Yang, D. Bankman, B. Moons, M. Verhelst, and B. Murmann, “Bit
pp. 4172–4185, Nov. 2019. error tolerance of a CIFAR-10 binarized convolutional neural network
[19] W.-H. Chen et al., “A 65 nm 1Mb nonvolatile computing-in-memory processor,” in Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), Florence,
ReRAM macro with sub-16ns multiply-and-accumulate for binary Italy, May 2018, pp. 1–5.
DNN AI edge processors,” in IEEE Int. Solid-State Circuits Conf. [33] T.-J. Yang and V. Sze, “Design considerations for efficient deep neural
(ISSCC) Dig. Tech. Papers, San Francisco, CA, USA, Feb. 2018, networks on processing-in-memory accelerators,” in IEDM Tech. Dig.,
pp. 494–496. San Francisco, CA, USA, Dec. 2019, p. 22.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on June 20,2022 at 09:57:16 UTC from IEEE Xplore. Restrictions apply.