0% found this document useful (0 votes)
21 views

ADC-Less 3D-NAND Compute-In-Memory Architecture Using Margin Propagation

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

ADC-Less 3D-NAND Compute-In-Memory Architecture Using Margin Propagation

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

2023 IEEE 66th International Midwest Symposium on Circuits and Systems (MWSCAS)

Phoenix, Arizona, USA, August 6-9, 2023

ADC-less 3D-NAND Compute-in-Memory


Architecture using Margin Propagation
Aswin Chowdary Undavalli1 , Gert Cauwenberghs2 , Arun Natarajan3 , Shantanu Chakrabartty1 and Aravind Nagulu1,∗
1
Department of Electrical and Systems Engineering, Washington University in St. Louis, MO, USA
2023 IEEE 66th International Midwest Symposium on Circuits and Systems (MWSCAS) | 979-8-3503-0210-3/23/$31.00 ©2023 IEEE | DOI: 10.1109/MWSCAS57524.2023.10406082

2
Department of Bioengineering University of California at San Diego, CA, USA
3
School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, OR, USA
* email - [email protected]

Abstract—Compute In Memory (CIM) has gained significant exceed the state-of-the-art performance specifications. The
attention in recent years due to its potential to overcome the key technical challenge is that unlike a NOR configuration,
memory bottleneck in Von-Neumann computing architectures. where floating-gate [3] or FeFET transistors are connected in
While most CIM architectures use non-volatile memory elements
in a NOR-based configuration, NAND-based configuration, and parallel to a current summation line, a NAND-flash comprises
in particular, 3D-NAND flash memories are attractive because of a cascade of floating-gate transistors. Thus, traditional par-
their potential in achieving ultra-high memory density and ultra- allel multiply-and-accumulate (MAC) approaches cannot be
low cost per bit storage. Unfortunately, the standard multiply- applied to NAND architectures to compute inner products
and-accumulate (MAC) CIM-paradigm can not be directly ap- or matrix-vector multipliers (MVMs). However, our recent
plied to NAND-flash memories. In this paper, we report a NAND-
Flash-based CIM architecture by combining the conventional 3D- results have shown that inner product computation, like any
NAND flash with a Margin-Propagation (MP) based approximate pattern-matching computation, has sufficient ensemble redun-
computing technique. We show its application for implementing dancy to tolerate minor approximation errors in individual
matrix-vector multipliers (MVMs) that do not require analog-to- MCEs. We leveraged this observation to develop a 3D NAND-
digital converters (ADCs) for read-out. Using simulation results based Multiply-Accumulate-Macro (MAM) to achieve multi-
we show that this approach has the potential to provide a 100×
improvement in compute density, read speed, and computation bit precision for MVM computations. Specifically, we will
efficiency compared to the current state-of-the-art. use a Margin-propagation (MP) based approximate computing
Index Terms—Compute-In Memory, Multiply and Accumulate, architecture [4] whose operational principle perfectly matches
NAND Flash, Margin Propagation, and Matrix Vector Multipli- the operating physics of a 3-D NAND-flash memory, as shown
ers. in Fig. 1(a).
I. I NTRODUCTION
Conventional processors based on Von-Neumann architec-
ture suffer from the memory-wall bottleneck [1] which arises
due to the energy and bandwidth limitations of memory access
and movement of data between the memory and the processor.
The compute-in-memory (CIM) paradigm [2] can potentially
alleviate this bottleneck by integrating some core computing
function with the memory and by exploiting highly-parallel
analog computing techniques. In literature, several CIM archi-
tectures have been proposed and have been applied to different
types of volatile and non-volatile memories (NVMs) [3] as
Multiply Compute Elements (MCE). NVM CIMs are more
Fig. 1. (a) Analog bottom-up, top-down co-design that exploits the compu-
desirable because the stored parameters, once programmed tational redundancy in inner-product and computational primitives inherent in
can be retained across brown-outs and reused without in- margin-propagation (b) System level architecture of the differential MP-based
curring any data upload costs. Amongst all the non-volatile 3-D NAND Flash Matrix Vector Multiplier.
memory technologies, 3D NAND flash boasts one of the The proposed approach also naturally lends itself to an
highest integration densities with memory capacities exceeding ADC-less readout. As shown in Fig. 1(b), when the NAND
14GB/mm2 . Also, the technology can vertically stack more stack is coupled to a pre-charged peripheral capacitor, the
than 200 layers of floating-gate memory cells which implies temporal voltage decay on the capacitor can be used to time-
each stack occupies a cross-sectional area of less than 0.01F 2 encode the MAC output. In this manner, the proposed NAND-
per bit. Thus, if it is possible to implement an analog compute- based architecture also eliminates the need for peripheral
in-memory architecture on a 3D NAND flash array, similar to a ADCs. The proposed 3D NAND-based in-memory compute
NOR configuration [2], then the resulting design would easily array can therefore result in

979-8-3503-0210-3/23/$31.00 ©2023 IEEE 89

Authorized licensed use limited to: Indian Institute of Technology Indore. Downloaded on March 21,2024 at 06:59:20 UTC from IEEE Xplore. Restrictions apply.
• Highest memory density of all non-volatile memories by flash involves complex interactions between electrical charges
leveraging the density of 3D NAND stack. and transistors, but its ability to store data without power and
• Parallel computation and single-shot readout per MAM, its fast read and write times have made it a popular storage
thus significantly reducing the read time and enhancing technology in a wide range of applications.
compute efficiency.
• ADC-less digital output and readout by leveraging the B. System Architecture
capacitor discharging/charging scheme to directly encode
In the proposed CIM architecture we exploit the 3D NAND
MAM into time.
stacking to efficiently perform a simultaneous Inner Product
Overall, the proposed scheme advances state-of-the-art archi- (IP) calculation between a vector X, with ‘n’ components, and
tectures in NVM computing. ‘m’ different weight vectors (Wi , for i = 1 to m), in a single
II. P RINCIPLE OF O PERATION read cycle. We propose to use non-volatile 3-D NAND vertical
stacks to store the weight vectors (Wi = [w1i w2i . . . wni ])
A. 3D NAND Flash
along the memory cells in a vertical stack. Multiple weight
3D NAND flash is a variation on the traditional floating- vectors (Wi , for i = 1 to m) are stored in parallel NAND
gate non-volatile memories where the floating-gate transistors stacks (as illustrated in Fig. 1(b) in pink ink). The input
are vertically stacked on top of each other, as shown in Fig. 2. vector (X = [x1 x2 . . . xn ]) is applied simultaneously to all
This not only increases the storage density but also reduces the stacks, and the individual inputs (xp , for p = 1 to n)
interference across memory cells which is important for CIM are directed to individual memory cells within each stack (as
architectures. The memory cells are vertically arranged in a shown in Fig. 1(b) in cyan ink). From MP-computation theory,
grid of rows and columns, with each cell consisting of a we have shown that the difference in resistance between the
transistor and a floating gate as shown in Fig 2. A thin oxide positive and negative stacks indicates the value of the inner
layer insulates the floating gate from the transistor and can product (discussed in sections III-A & III-B). During the
store electrical charges that can represent multi-bit information sense phase, pre-charged capacitors are connected to the output
[5]. nodes, and the decay of their charge is tracked to encode the
inner product value into the pulse width of the digital output
pulse (as illustrated in Fig. 1(b) in red ink). In summary, the
envisioned system can simultaneously activate all the vertical
NAND stacks and individual memory cells within each stack
to perform the inner product calculation between X and Wi ,
for all i = 1 to m, in a single readout cycle. This architecture
enables substantial improvements in efficiency, readout speed,
and compute density, making it the most suitable for in-
memory computing applications.

C. Multiplier-less MCEs based on Margin Propagation


In [6], it has been demonstrated that an inner product
between two vectors X and W could be approximated using
Fig. 2. (a) 3D-NAND Flash Topology where DL = ‘Data Line’, CL = ‘Control a differential MP architecture that computes z + = M P ([X +
Line’, GSL = ‘Ground Select Line’ and CEL = ‘Chip Enable Line’. (b) Single-
cell NAND Flash Structure W, −X −W ], γ), z − = M P ([X −W, −X +W ], γ), and (z + −
z − ) ≈ X T W . Previously, MP-based vector multipliers were
To read data from a NAND flash memory cell, a voltage implemented using both digital [7] and analog current domain
is applied to the cell’s control gate, which switches on the [4] methods. In this work, we plan to extend this concept
transistor and allows current to flow through the channel. The to an in-memory compute architecture based on NAND flash
amount of current that flows depends on the amount of charge memory, as depicted in Fig. 1.
stored in the floating gate. This current is detected by a sense Starting with a method to realize an MCE; the multi-level
amplifier and converted into a digital signal that represents the input differential weight (w and -w) will be stored on 2 units of
data stored in the cell. To write data to a NAND flash memory 2-stacked NAND flash memory cells, and differential input (x
cell, a voltage is applied to the control gate, and another and -x) will be incident on the input gates with the appropriate
voltage is applied to the cell’s drain, creating an electric field gate bias (Vg ) (as shown in Fig. 3(a)). In the common source
that allows electrons to tunnel through the oxide layer and configuration, the source terminals (S+ and S-) are grounded,
into the floating gate. NAND flash memory is organized into and current Isense is injected into the drain terminals. When
blocks, with each block consisting of multiple memory cells. biased in the triode region, the ON-resistance of the floating
To erase data from a block, a higher voltage is applied to the gate transistor is RDS = 1/[kn (VGS −VT )]. So ON-resistance
+
block’s control gate, which releases all electrical charges from of the left branch can be written as RDS = R1+ + R2+ where
+ +
the floating gates in the block. Overall, the working of NAND R1 and R2 are device resistances with x+w and -x-w as

90

Authorized licensed use limited to: Indian Institute of Technology Indore. Downloaded on March 21,2024 at 06:59:20 UTC from IEEE Xplore. Restrictions apply.
Fig. 3. (a) Schematic of a 2-stacked NAND MCE. Output of a 2-stacked NAND MCE across varying x and w when operating in (b) common-source
configuration, and (c) common drain configuration.

effective inputs, i.e., actual gate input + VT change due to III. MP-BASED MVM S
trapped charges, and can be expressed as A. MVMs Using 3D MP-NAND Flash
1 1
R1+ = , R+ = , Expanding upon the single unit NAND-based MCE, vector
kn (VgT + (x − w)) 2 kn (VgT − (x − w)) inner product calculations can be easily achieved by stacking
Similarly for right branch ON-resistance can be written as unit cells on top of each other, as illustrated in Fig. 4(a).

RDS = R1− + R2− where R1− and R2− are device resistances This approach would be highly efficient in terms of area
with -x+w and x-w as inputs and can be expressed as utilization if implemented with 3-D NAND Flash technology.
In fact, a single read operation of a 128-stacked NAND
− 1 − 1 flash memory can evaluate an inner product of a vector with
R1 = ,R = .
kn (VgT + (x + w)) 2 kn (VgT − (x + w)) 64 elements. As the number of NAND cells increases per
Finally, we can show that stack, the mismatch errors are averaged. When the 128-stacked
2 (VgT ) 2 (VgT ) NAND flash architecture is biased in the CD configuration,
+ −
RDS =   , RDS =   , the MP-based NAND Flash MVM yielded an inner-product
kn VgT 2 − (x − w)2 kn VgT2 − (x + w)2
computation with an RMS error of -30.5dB, equivalent to 5-bit
  precision (see Fig. 4(b)). When biased in the CS configuration,
 8 (VgT ) Isense  the MP-based NAND Flash MAM yielded an inner-product
VD = −    (x×w) computation with an RMS error of -43.5dB, equivalent to 7.2-
 k V 2 − (x + w)2 V 2 − (x − w)2 
n gT gT bit precision (see Fig. 4(b)). The reduced computing precision
of the CD-NAND MVM compared to the CS-NAND MVM
VD ∝ (x × w) for large VgT , is due to the higher non-linearity in the CD-NAND MCE, as
+ −
Where RDS and RDS represents the resistance of the pos- shown in Fig. 3(c).
itive and negative stacks respectively, and VgT = Vg − VT . As
can be seen, the differential drain voltage tracks the product B. Parallel Computation and Single-Shot Readout per MVM
VD ∝ x × w (CS-NAND MCE). Similar behavior is also In the most recent NAND flash-based in-memory compute
observed in our simulations shown in Fig. 3(b). Due to the blocks, only one unit cell within the vertical stack is activated
lack of availability of NAND flash technology at present, per read cycle, and this approach limits the read time and
the simulations were performed using conventional 22nm computing efficiency [8]. However, our proposed scheme
transistors where the threshold change of the flash memory improves upon this limitation by activating all the unit cells
cells is emulated using a voltage source in series with the in a vertical stack. This allows the inner product between the
transistor gate. stored weights and input vector to be computed within a single
Additionally, the MP-based NAND Flash MCE can also read cycle, significantly improving the read time and compute
operate in the common-drain configuration, where the drain efficiency by a factor of (0.5 × no. of stacked unit cells)
terminals are biased at VDD and current Isense is drawn out (discussed in section II-C). Additionally, when implemented
of the source terminals. The analysis of such a system is more on a 3-D NAND flash technology, thus improvement can
involved and has been omitted for brevity. Essentially, in this enable extremely compact implementation. For instance, with
case, the differential output voltage VS = (VS+ − VS− ) tracks the current state-of-the-art 3-D NAND flash, it’s possible to
the multiplication and is monotonic to (x × w) (CD-NAND stack up to 256 unit cells, resulting in a 128× improvement
MCE) as shown in Fig. 3(c). The nonlinearity of the output, in read time, compute efficiency, and implementation area.
with respect to the product (x × w), can be easily calibrated
during the calibration phase. In summary, to realize a single C. ADC-Less Time-Encoded Output and Readout
analog multiplication, we use four non-volatile flash memory Peripheral circuits, such as the output ADC, can limit the
units (either in the CD or CS configuration) occupying a efficiency of a system. To address this issue, we propose an
feature size of 2F × 2F . ADC-less readout scheme in Fig. 5(a) that encodes the output

91

Authorized licensed use limited to: Indian Institute of Technology Indore. Downloaded on March 21,2024 at 06:59:20 UTC from IEEE Xplore. Restrictions apply.
Fig. 4. (a) Schematic of the MP-based 3-D NAND Flash MAM. Representative simulation of a 128-stacked NAND Flash MAM across varying inner product
when operating in (b) common-drain configuration resulting in 5-bit precision and (c) common-source configuration resulting in 7.2-bit precision.

in the pulse width of a digital signal. Initially, the output


capacitors are pre-charged. During the compute phase, the
output capacitors discharge through the MCE with different
time constants in the positive and negative branches, causing
the output nodes to reach the comparator reference voltage
at different time instances. We can extract the difference
between these instances using XOR and AND gates. Simu-
lations suggest that the pulse width can successfully encode
the multiplication of x and w. Fig. 5(b) illustrates the pulse
width of the XOR output across varying inner product value
between two vectors X and W, creating a profile equivalent
to an ideal multiplication. For a total discharge time (a.k.a Fig. 5. (a) Schematic representation of a 2-stacked NAND MCE with time-
encoded output including a calibration phase that calibrates pulse width due
readout time per vector inner product) of 45 ns, the overall to comparator offsets. (b) The output of a CS-NAND MCE across varying
pulse width ranges from −4 ns to +4 ns as shown in Fig. 5(b). inner product between two vectors X and W .
This pulse width can be digitized to 8-bit accuracy assuming a
time-to-digital converter with an LSB of ∼31 ps which can be R EFERENCES
accomplished with sub-mW in scaled CMOS technologies [9]. [1] M. Horowitz, “1.1 computing’s energy problem (and what we can do
This concept can be easily extended to the 3D-NAND MAM- about it),” in IEEE ISSCC, pp. 10–14, IEEE, 2014.
[2] N. R. Shanbhag and S. K. Roy, “Comprehending in-memory computing
based MVM system discussed earlier, resulting in an ADC- trends via proper benchmarking,” in IEEE CICC, pp. 01–07, IEEE, 2022.
less readout scheme, thereby enhancing the overall system [3] S. Chakrabartty and G. Cauwenberghs, “Sub-microwatt analog VLSI
efficiency. trainable pattern classifier,” IEEE JSSC, vol. 42, no. 5, pp. 1169–1179,
2007.
[4] M. Gu and S. Chakrabartty, “Synthesis of bias-scalable CMOS analog
computational circuits using margin propagation,” IEEE TCAS I: Regular
Papers, vol. 59, no. 2, pp. 243–254, 2011.
IV. C ONCLUSION [5] A. Goda, “Recent progress on 3D NAND flash technologies,” Electronics,
vol. 10, no. 24, p. 3156, 2021.
[6] A. R. Nair et al., “Multiplierless MP-kernel machine for energy-efficient
Combining the proposed MP-based MAM with 3-D NAND edge devices,” IEEE Trans. on VLSI Systems, vol. 30, no. 11, pp. 1601–
Flash will enable a 100× increase in read speed without 1614, 2022.
[7] M. Gu and S. Chakrabartty, “A 100 pJ/bit, (32,8) CMOS Analog Low-
compromising the computing efficiency. The NAND flash Density Parity-Check Decoder Based on Margin Propagation,” IEEE
MAM’s parallel compute and single stack architecture can JSSC, vol. 46, no. 6, pp. 1433–1442, 2011.
enable 100× improvement in the memory density compared to [8] M. Kim et al., “An Embedded nand Flash-Based Compute-In-Memory
Array Demonstrated in a Standard Logic Process,” IEEE JSSC, vol. 57,
CIM architectures based on NOR Flash memory. The proposed no. 2, pp. 625–638, 2022.
ADC-less readout scheme that encodes the output in the pulse [9] H. Kim et al., “19.3 A 2.4GHz 1.5mW digital MDLL using pulse-width
width of a digital signal would reduce the energy overhead comparator and double injection technique in 28nm CMOS,” in IEEE
ISSCC, pp. 328–329, 2016.
from the peripheral circuits.

92

Authorized licensed use limited to: Indian Institute of Technology Indore. Downloaded on March 21,2024 at 06:59:20 UTC from IEEE Xplore. Restrictions apply.

You might also like