ADC-Less 3D-NAND Compute-In-Memory Architecture Using Margin Propagation
ADC-Less 3D-NAND Compute-In-Memory Architecture Using Margin Propagation
2
Department of Bioengineering University of California at San Diego, CA, USA
3
School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, OR, USA
* email - [email protected]
Abstract—Compute In Memory (CIM) has gained significant exceed the state-of-the-art performance specifications. The
attention in recent years due to its potential to overcome the key technical challenge is that unlike a NOR configuration,
memory bottleneck in Von-Neumann computing architectures. where floating-gate [3] or FeFET transistors are connected in
While most CIM architectures use non-volatile memory elements
in a NOR-based configuration, NAND-based configuration, and parallel to a current summation line, a NAND-flash comprises
in particular, 3D-NAND flash memories are attractive because of a cascade of floating-gate transistors. Thus, traditional par-
their potential in achieving ultra-high memory density and ultra- allel multiply-and-accumulate (MAC) approaches cannot be
low cost per bit storage. Unfortunately, the standard multiply- applied to NAND architectures to compute inner products
and-accumulate (MAC) CIM-paradigm can not be directly ap- or matrix-vector multipliers (MVMs). However, our recent
plied to NAND-flash memories. In this paper, we report a NAND-
Flash-based CIM architecture by combining the conventional 3D- results have shown that inner product computation, like any
NAND flash with a Margin-Propagation (MP) based approximate pattern-matching computation, has sufficient ensemble redun-
computing technique. We show its application for implementing dancy to tolerate minor approximation errors in individual
matrix-vector multipliers (MVMs) that do not require analog-to- MCEs. We leveraged this observation to develop a 3D NAND-
digital converters (ADCs) for read-out. Using simulation results based Multiply-Accumulate-Macro (MAM) to achieve multi-
we show that this approach has the potential to provide a 100×
improvement in compute density, read speed, and computation bit precision for MVM computations. Specifically, we will
efficiency compared to the current state-of-the-art. use a Margin-propagation (MP) based approximate computing
Index Terms—Compute-In Memory, Multiply and Accumulate, architecture [4] whose operational principle perfectly matches
NAND Flash, Margin Propagation, and Matrix Vector Multipli- the operating physics of a 3-D NAND-flash memory, as shown
ers. in Fig. 1(a).
I. I NTRODUCTION
Conventional processors based on Von-Neumann architec-
ture suffer from the memory-wall bottleneck [1] which arises
due to the energy and bandwidth limitations of memory access
and movement of data between the memory and the processor.
The compute-in-memory (CIM) paradigm [2] can potentially
alleviate this bottleneck by integrating some core computing
function with the memory and by exploiting highly-parallel
analog computing techniques. In literature, several CIM archi-
tectures have been proposed and have been applied to different
types of volatile and non-volatile memories (NVMs) [3] as
Multiply Compute Elements (MCE). NVM CIMs are more
Fig. 1. (a) Analog bottom-up, top-down co-design that exploits the compu-
desirable because the stored parameters, once programmed tational redundancy in inner-product and computational primitives inherent in
can be retained across brown-outs and reused without in- margin-propagation (b) System level architecture of the differential MP-based
curring any data upload costs. Amongst all the non-volatile 3-D NAND Flash Matrix Vector Multiplier.
memory technologies, 3D NAND flash boasts one of the The proposed approach also naturally lends itself to an
highest integration densities with memory capacities exceeding ADC-less readout. As shown in Fig. 1(b), when the NAND
14GB/mm2 . Also, the technology can vertically stack more stack is coupled to a pre-charged peripheral capacitor, the
than 200 layers of floating-gate memory cells which implies temporal voltage decay on the capacitor can be used to time-
each stack occupies a cross-sectional area of less than 0.01F 2 encode the MAC output. In this manner, the proposed NAND-
per bit. Thus, if it is possible to implement an analog compute- based architecture also eliminates the need for peripheral
in-memory architecture on a 3D NAND flash array, similar to a ADCs. The proposed 3D NAND-based in-memory compute
NOR configuration [2], then the resulting design would easily array can therefore result in
Authorized licensed use limited to: Indian Institute of Technology Indore. Downloaded on March 21,2024 at 06:59:20 UTC from IEEE Xplore. Restrictions apply.
• Highest memory density of all non-volatile memories by flash involves complex interactions between electrical charges
leveraging the density of 3D NAND stack. and transistors, but its ability to store data without power and
• Parallel computation and single-shot readout per MAM, its fast read and write times have made it a popular storage
thus significantly reducing the read time and enhancing technology in a wide range of applications.
compute efficiency.
• ADC-less digital output and readout by leveraging the B. System Architecture
capacitor discharging/charging scheme to directly encode
In the proposed CIM architecture we exploit the 3D NAND
MAM into time.
stacking to efficiently perform a simultaneous Inner Product
Overall, the proposed scheme advances state-of-the-art archi- (IP) calculation between a vector X, with ‘n’ components, and
tectures in NVM computing. ‘m’ different weight vectors (Wi , for i = 1 to m), in a single
II. P RINCIPLE OF O PERATION read cycle. We propose to use non-volatile 3-D NAND vertical
stacks to store the weight vectors (Wi = [w1i w2i . . . wni ])
A. 3D NAND Flash
along the memory cells in a vertical stack. Multiple weight
3D NAND flash is a variation on the traditional floating- vectors (Wi , for i = 1 to m) are stored in parallel NAND
gate non-volatile memories where the floating-gate transistors stacks (as illustrated in Fig. 1(b) in pink ink). The input
are vertically stacked on top of each other, as shown in Fig. 2. vector (X = [x1 x2 . . . xn ]) is applied simultaneously to all
This not only increases the storage density but also reduces the stacks, and the individual inputs (xp , for p = 1 to n)
interference across memory cells which is important for CIM are directed to individual memory cells within each stack (as
architectures. The memory cells are vertically arranged in a shown in Fig. 1(b) in cyan ink). From MP-computation theory,
grid of rows and columns, with each cell consisting of a we have shown that the difference in resistance between the
transistor and a floating gate as shown in Fig 2. A thin oxide positive and negative stacks indicates the value of the inner
layer insulates the floating gate from the transistor and can product (discussed in sections III-A & III-B). During the
store electrical charges that can represent multi-bit information sense phase, pre-charged capacitors are connected to the output
[5]. nodes, and the decay of their charge is tracked to encode the
inner product value into the pulse width of the digital output
pulse (as illustrated in Fig. 1(b) in red ink). In summary, the
envisioned system can simultaneously activate all the vertical
NAND stacks and individual memory cells within each stack
to perform the inner product calculation between X and Wi ,
for all i = 1 to m, in a single readout cycle. This architecture
enables substantial improvements in efficiency, readout speed,
and compute density, making it the most suitable for in-
memory computing applications.
90
Authorized licensed use limited to: Indian Institute of Technology Indore. Downloaded on March 21,2024 at 06:59:20 UTC from IEEE Xplore. Restrictions apply.
Fig. 3. (a) Schematic of a 2-stacked NAND MCE. Output of a 2-stacked NAND MCE across varying x and w when operating in (b) common-source
configuration, and (c) common drain configuration.
effective inputs, i.e., actual gate input + VT change due to III. MP-BASED MVM S
trapped charges, and can be expressed as A. MVMs Using 3D MP-NAND Flash
1 1
R1+ = , R+ = , Expanding upon the single unit NAND-based MCE, vector
kn (VgT + (x − w)) 2 kn (VgT − (x − w)) inner product calculations can be easily achieved by stacking
Similarly for right branch ON-resistance can be written as unit cells on top of each other, as illustrated in Fig. 4(a).
−
RDS = R1− + R2− where R1− and R2− are device resistances This approach would be highly efficient in terms of area
with -x+w and x-w as inputs and can be expressed as utilization if implemented with 3-D NAND Flash technology.
In fact, a single read operation of a 128-stacked NAND
− 1 − 1 flash memory can evaluate an inner product of a vector with
R1 = ,R = .
kn (VgT + (x + w)) 2 kn (VgT − (x + w)) 64 elements. As the number of NAND cells increases per
Finally, we can show that stack, the mismatch errors are averaged. When the 128-stacked
2 (VgT ) 2 (VgT ) NAND flash architecture is biased in the CD configuration,
+ −
RDS = , RDS = , the MP-based NAND Flash MVM yielded an inner-product
kn VgT 2 − (x − w)2 kn VgT2 − (x + w)2
computation with an RMS error of -30.5dB, equivalent to 5-bit
precision (see Fig. 4(b)). When biased in the CS configuration,
8 (VgT ) Isense the MP-based NAND Flash MAM yielded an inner-product
VD = − (x×w) computation with an RMS error of -43.5dB, equivalent to 7.2-
k V 2 − (x + w)2 V 2 − (x − w)2
n gT gT bit precision (see Fig. 4(b)). The reduced computing precision
of the CD-NAND MVM compared to the CS-NAND MVM
VD ∝ (x × w) for large VgT , is due to the higher non-linearity in the CD-NAND MCE, as
+ −
Where RDS and RDS represents the resistance of the pos- shown in Fig. 3(c).
itive and negative stacks respectively, and VgT = Vg − VT . As
can be seen, the differential drain voltage tracks the product B. Parallel Computation and Single-Shot Readout per MVM
VD ∝ x × w (CS-NAND MCE). Similar behavior is also In the most recent NAND flash-based in-memory compute
observed in our simulations shown in Fig. 3(b). Due to the blocks, only one unit cell within the vertical stack is activated
lack of availability of NAND flash technology at present, per read cycle, and this approach limits the read time and
the simulations were performed using conventional 22nm computing efficiency [8]. However, our proposed scheme
transistors where the threshold change of the flash memory improves upon this limitation by activating all the unit cells
cells is emulated using a voltage source in series with the in a vertical stack. This allows the inner product between the
transistor gate. stored weights and input vector to be computed within a single
Additionally, the MP-based NAND Flash MCE can also read cycle, significantly improving the read time and compute
operate in the common-drain configuration, where the drain efficiency by a factor of (0.5 × no. of stacked unit cells)
terminals are biased at VDD and current Isense is drawn out (discussed in section II-C). Additionally, when implemented
of the source terminals. The analysis of such a system is more on a 3-D NAND flash technology, thus improvement can
involved and has been omitted for brevity. Essentially, in this enable extremely compact implementation. For instance, with
case, the differential output voltage VS = (VS+ − VS− ) tracks the current state-of-the-art 3-D NAND flash, it’s possible to
the multiplication and is monotonic to (x × w) (CD-NAND stack up to 256 unit cells, resulting in a 128× improvement
MCE) as shown in Fig. 3(c). The nonlinearity of the output, in read time, compute efficiency, and implementation area.
with respect to the product (x × w), can be easily calibrated
during the calibration phase. In summary, to realize a single C. ADC-Less Time-Encoded Output and Readout
analog multiplication, we use four non-volatile flash memory Peripheral circuits, such as the output ADC, can limit the
units (either in the CD or CS configuration) occupying a efficiency of a system. To address this issue, we propose an
feature size of 2F × 2F . ADC-less readout scheme in Fig. 5(a) that encodes the output
91
Authorized licensed use limited to: Indian Institute of Technology Indore. Downloaded on March 21,2024 at 06:59:20 UTC from IEEE Xplore. Restrictions apply.
Fig. 4. (a) Schematic of the MP-based 3-D NAND Flash MAM. Representative simulation of a 128-stacked NAND Flash MAM across varying inner product
when operating in (b) common-drain configuration resulting in 5-bit precision and (c) common-source configuration resulting in 7.2-bit precision.
92
Authorized licensed use limited to: Indian Institute of Technology Indore. Downloaded on March 21,2024 at 06:59:20 UTC from IEEE Xplore. Restrictions apply.