Process-in-Memory forAI
Process-in-Memory forAI
Bongjin Kim
Tony Tae-Hyoung Kim Editors
Processing-in-
Memory for AI
From Circuits to Systems
Processing-in-Memory for AI
Joo-Young Kim • Bongjin Kim
Tony Tae-Hyoung Kim
Editors
Processing-in-Memory
for AI
From Circuits to Systems
Editors
Joo-Young Kim Bongjin Kim
Korea Advanced Institute of Science and University of California
Technology Santa Barbara, CA, USA
Daejeon, Korea (Republic of)
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland
AG 2023, corrected publication 2023
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Joo-Young Kim
2 Backgrounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Chengshuo Yu, Hyunjoon Kim, Bongjin Kim,
and Tony Tae-Hyoung Kim
3 SRAM-Based Processing-in-Memory (PIM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Hyunjoon Kim, Chengshuo Yu, and Bongjin Kim
4 DRAM-Based Processing-in-Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Donghyuk Kim and Joo-Young Kim
5 ReRAM-Based Processing-in-Memory (PIM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Tony Tae-Hyoung Kim, Lu Lu, and Yuzong Chen
6 PIM for ML Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
Jaehoon Heo and Joo-Young Kim
7 PIM Software Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
Donghyuk Kim and Joo-Young Kim
8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
Joo-Young Kim, Bongjin Kim, and Tony Tae-Hyoung Kim
Correction to: Processing-in-Memory for AI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C1
v
Chapter 1
Introduction
Joo-Young Kim
Artificial intelligence (AI) and machine learning (ML) technology enable computers
to mimic the cognitive tasks believed to be what only humans can do, such as
recognition, understanding, and reasoning [1]. A deep ML model named AlexNet
[2], which uses eight layers in total, won the famous large-scale image recognition
competition called ImageNet by a significant margin over shallow ML models in
2012. Since then, deep learning (DL) revolution has been ignited and spread to
many other domains such as speech recognition [3], natural language processing
[4], virtual assistance [5], autonomous vehicle [6], and robotics [7]. With significant
successes in various domains, DL revolutionizes a wide range of industry sectors
such as information technology, mobile communication, automotive, and manufac-
turing [8]. However, as more industries adopt the new technology and more people
use it daily, we face an ever-increasing demand for a new type of hardware for the
workloads. Conventional hardware platforms such as CPU and GPU are not suitable
for the new workloads. CPUs cannot cope with the tremendous amount of data
transfers and computations required in the ML workloads, while GPUs consume
large amounts of power with high operating costs.
AI chip or accelerator is the hardware that enables faster and more energy-
efficient processing for AI workloads (Fig. 1.1). Over the past few years, many
AI accelerators have been developed to serve the new workloads, targeting from
battery-powered edge devices [9–11] to datacenter servers [12]. As McKinsey pre-
dicted in the report [13], the AI semiconductor industry is expected to grow 18–19%
every year to 65 billion, accounting for about 19% in the entire semiconductor
market in 2025. So far, the AI hardware industry is led by big tech companies.
Google developed their own AI chip named tensor processing unit (TPU) that can
work with TensorFlow [14] software framework. Amazon developed Inferentia chip
[15] for high-performance ML inference. Microsoft’s BrainWave [16] uses FPGA
infrastructure to accelerate ML workloads at scale. Even an electric car maker
Tesla developed the full self-driving (FSD) chip for autonomous vehicles. There
are many start-up companies in this domain. Habana Labs, acquired by Intel in
late 2019, developed Gaudi processor for AI training. Graphcore has developed
intelligent processing unit (IPU) [17] and deployed in datacenters. Groq’s tensor
streaming processor [18] optimizes data streaming and computations with fixed task
scheduling. Cerabras’s wafer-scale engine [19] tries to use a whole wafer as a ML
processor to keep a large model without external memories.
In this section, we introduce the basic models of deep neural networks (DNNs)
and their computations. A DNN model is composed of multiple layers of artificial
neurons, where neurons of each layer are inter-connected with the neurons in the
neighbor layers. The mathematical model of the neuron comes from Frank Rosen-
blatt’s Perceptron [20] model, as shown in Fig. 1.2. Inspired by the human neuron
model, it receives multiple inputs among many input neurons and accumulates
their weighted sums with a bias. Then it decides the output through an activation
function, where the activation function is non-linear and differentiable, having a
step-like characteristic shape. As a result, the output of a neuron is expressed with
the following equation:
1 Introduction 3
Based on the network connection, there are three major layers in the DNN
models, which determines the actual computations: fully connected, convolutional,
and recurrent layer.
Figure 1.3 shows the fully connected layer that interconnects the neurons in the input
layer to the neurons in the next layer. The input vector is the values of input neurons,
a 3 × 1 vector in this case, and the output vector is the values of output neurons, a
4×1 vector. Each connection of the fully connected network between the two layers
represents a weight parameter in the model. For example, W01 represents a weight
parameter of the connection between the input neuron 0 and the output neuron 1.
Collectively, the network becomes a 4 × 3 weight matrix. In addition, each output
neuron has a bias, so the layer has 4×1 bias vector. For each output neuron, we can
4 J.-Y. Kim
write the output value y using Eq. 1.1. As a result, the equations can be formulated
into a matrix-vector equation as follows:
y = f (W x + b) (1.2)
If the model includes multiple layers, which is the case of deep neural networks, the
matrix operations will be cascaded one by one. Traditional multi-layer perceptron
(MLP) models as well as the latest transformer models [21] are based on the fully
connected layer.
The convolutional layer iteratively performs 3-d convolution operations on the input
layer using multiple weight kernels to generate the output layer, as illustrated in
Fig. 1.4. The input layer has multiple 2-d input feature maps, sized H × H × C, and
the size of each kernel is K × K × C. For computation, it performs 3-d convolution
operations from top-left to bottom-right for each kernel with a stride of U . A single
convolution operation accumulates all the inner products between the input and the
kernel. As a result of scanning for a kernel, it gets a single output feature map sized
E × E. By repeating this process for all kernels, the convolutional layer produces
the final output layer, sized to E × E × M. The equation for an output point in the
convolutional layer is as follows:
1 Introduction 5
K−1
C−1 K−1
O[u][x][y] = B[u] + I[k][U x + i][Uy + j ]W[u][k][i][j ],
k=1 i=1 j =1 (1.3)
(H − R + U )
0 ≤ u < M, 0 ≤ x, y < E, E =
U
Many convolution neural network (CNN) models use a number of convolutional
layers with a few fully connected layers at the end for image classification and
recognition task [22].
Figure 1.5 shows the recurrent layer that has a feedback loop from the output to
the input layer in the fully connected setting. In this layer, the cell state of the
previous timestamp affects the current state. Its computation is also matrix-vector
multiplication but involves multiple steps with dependency. Its cell and output value
are expressed as follows. The hyperbolic tangent is usually used for activation
function in the recurrent layer.
Recurrent neural networks (RNNs) such as GRUs [23] and LSTMs [24] are based
on this type of layer and popularly used for speech recognition.
6 J.-Y. Kim
The von Neumann bottleneck has been mitigated with a hierarchical memory
structure. Processors include the fastest but smallest SRAM-based cache on-chip
to leverage the temporal and spatial locality. Outside of the processor chip, there
exists the main memory of the system based on DRAM. DRAM is fast and has
a larger capacity than SRAM. After that, the system has solid-state drives (SSD)
for high storage capacity. However, as the DNN models get deeper and bigger to
the tera-bytes level, the ML workloads require even higher bandwidth between the
processor chip and the main memory. Even worse, the process technology scaling
faces strong challenges with the end of Moore’s law below 10 nm technology node
[27].
To overcome the von Neumann bottleneck and the slow-down of process scaling,
many companies propose array-type architectures to accelerate data-intensive ML
processing along with 3-d stacking DRAM technology called high-bandwidth
memory (HBM) [28] to provide higher bandwidth between the compute and
memory devices. Google developed their own AI chip named TPU to serve the
inference and training workloads in datacenters with a better cost and energy
efficiency [12]. Intel recently released the NNP-T processor [29] and Habana Labs
Gaudi processor [30] for training workloads. Start-up companies such as Graphcore
[31] and Groq [18] are also based on this architecture. However, although these
AI accelerators with HBM technology can mitigate the bandwidth bottleneck up
to a couple TB/s level, they cannot address it eventually as they still fall into von
Neumann architecture. In addition, the HBM suffers from high-power dissipation
and low capacity [32]. Table 1.1 shows the summary of the hardware specifications
of the latest AI accelerators [33].
1.4.2 Challenges
Although it looks promising, PIM has many challenges as it needs to integrate logic
units into the memory module. The three notable challenges in PIM design are
process accessibility, architecting, and designing considering physical constraints,
and software stack and usability.
Among many memory technologies, SRAM is the only memory type that we
can build using a commercially available logic process. This is why many PIM
prototypes are based on SRAM [37–40]. It is possible to fabricate with a logic
process and easy to customize both memory cell and peripheral circuits. As the cell
size is the biggest among others, SRAM-based PIM has the least area restriction
on the logic integration. Except SRAM, DRAM and non-volatile memory (NVM)
processes are difficult to access. Memory vendors such as Samsung, SK Hynix, and
Micron have their own memory processes, but they are not open to outside. Since the
process design kit (PDK) is not accessible, most researchers cannot even simulate
the basic circuits. There have been many PIM architecture proposals for DRAM
[41, 42]; however, they only evaluate the architectures at a performance simulator
level without much physical design. Since the DRAM process is vastly different
from the logic process, focusing on increasing cell capacity and cell density, it is
hard to convince that the proposed PIM architectures are feasible to be fabricated
with only simulations.
It is imperative for chip designers to choose what function they should put into
the memory in the PIM design. They cannot implement various functions or too
generic logic as the silicon area is limited. In addition, the chip will lose the memory
capacity for the area of the logic merged. Another challenge is that the logic design
should be physically aligned with the memory cell design to maximize the internal
bandwidth.
10 J.-Y. Kim
The software stack is the last hurdle in the PIM design. It is essential for the wide-
spreading adoption of PIM as a new device. Unlike traditional memory devices,
PIM is not a passive device anymore as it can perform logic operations at the same
time. What this means is that we need a fundamental change in the software side
either. For real PIM system optimizations, we need to revisit a whole software stack,
including programming language, compiler, driver, and run-time. Otherwise, it will
not be able to outperform the existing von Neumann computer’s performance and
usability. Table 1.2 summarizes the opportunities and challenges of PIM technology.
References
3. G. Hinton, L. Deng, D. Yu, G.E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke,
P. Nguyen, T.N. Sainath, B. Kingsbury, Deep neural networks for acoustic modeling in speech
recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29(6), 82–97
(2012)
4. Y. Goldberg, Neural network methods for natural language processing. Synth. Lect. Hum.
Lang. Technol. 10(1), 1–309 (2017)
5. V. Kepuska, G. Bohouta, Next-generation of virtual personal assistants (Microsoft Cortana,
Apple Siri, Amazon Alexa and Google Home), In 2018 IEEE 8th Annual Computing and
Communication Workshop and Conference (CCWC). IEEE, Piscataway (2018), pp. 99–103
6. M. Teichmann, M. Weber, M. Zoellner, R. Cipolla, R. Urtasun, MultiNet: real-time joint
semantic reasoning for autonomous driving, in 2018 IEEE Intelligent Vehicles Symposium (IV).
IEEE, Piscataway (2018), pp. 1013–1020
7. H.A. Pierson, M.S. Gashler, Deep learning in robotics: a review of recent research. Adv. Robot.
31(16), 821–835 (2017)
8. M.I. Jordan, T.M. Mitchell, Machine learning: trends, perspectives, and prospects. Science
349(6245), 255–260 (2015)
9. Y.H. Chen, T. Krishna, J.S. Emer, V. Sze, Eyeriss: an energy-efficient reconfigurable accel-
erator for deep convolutional neural networks. IEEE J. Solid-State Circuits 52(1), 127–138
(2016)
10. Z. Yuan, Y. Liu, J. Yue, Y. Yang, J. Wang, X. Feng, J. Zhao, X. Li, H. Yang, STICKER: an
energy-efficient multi-sparsity compatible accelerator for convolutional neural networks in 65-
nm CMOS. IEEE J. Solid-State Circuits 55(2), 465–477 (2019)
11. J. Lee, C. Kim, S. Kang, D. Shin, S. Kim, H.J. Yoo, UNPU: A 50.6 TOPS/W unified deep
neural network accelerator with 1b-to-16b fully-variable weight bit-precision, in 2018 IEEE
International Solid-State Circuits Conference-(ISSCC). IEEE, Piscataway (2018), pp. 218–220
12. N.P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N.
Boden, A. Borchers, R. Boyle, P.-l. Cantin, C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau,
J. Dean, B. Gelb, T.V. Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann, C.R. Ho, D.
Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski, A. Kaplan, H. Khaitan, D.
Killebrew, A. Koch, N. Kumar, S. Lacy, J. Laudon, J. Law, D. Le, C. Leary, Z. Liu, K. Lucke, A.
Lundin, G. MacKean, A. Maggiore, M. Mahony, K. Miller, R. Nagarajan, R. Narayanaswami,
R. Ni, K. Nix, T. Norrie, M. Omernick, N. Penukonda, A. Phelps, J. Ross, M. Ross, A. Salek,
E. Samadiani, C. Severn, G. Sizikov, M. Snelham, J. Souter, D. Steinberg, A. Swing, M. Tan,
G. Thorson, B. Tian, H. Toma, E. Tuttle, V. Vasudevan, R. Walter, W. Wang, E. Wilcox, D.H.
Yoon, In-datacenter performance analysis of a tensor processing unit, in Proceedings of the
44th Annual International Symposium on Computer Architecture (2017), pp. 1–12
13. G. Batra, Z. Jacobson, S. Madhav, A. Queirolo, N. Santhanam, Artificial-Intelligence Hard-
ware: New Opportunities for Semiconductor Companies. McKinsey and Company (2019)
14. M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G.
Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D.G. Murray, B. Steiner,
P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, X. Zheng, TensorFlow: a system for
large-scale machine learning, in 12th USENIX Symposium on Operating Systems Design and
Implementation (OSDI 16) (2016), pp. 265–283
15. Mitchell TM, Machine learning, in Amazon (2017). https://aws.amazon.com/machine-
learning/inferentia/. Accessed 21 Oct 2021
16. E. Chung, J. Fowers, K. Ovtcharov, M. Papamichael, A. Caulfield, T. Massengill, M. Liu, D.
Lo, S. Alkalay, M. Haselman, M. Abeydeera, L. Adams, H. Angepat, C. Boehn, D. Chiou, O.
Firestein, A. Forin, K.S. Gatlin, M. Ghandi, S. Heil, K. Holohan, A. El Husseini, T. Juhasz, K.
Kagi, R.K. Kovvuri, S. Lanka, F. van Megen, D. Mukhortov, P. Patel, B. Perez, A.G. Rapsang,
S.K. Reinhardt, B.D. Rouhani, A. Sapek, R. Seera, S. Shekar, B. Sridharan, G. Weisz, L.
Woods, P.Y. Xiao, D. Zhang, R. Zhao, D. Burger, Serving DNNs in real time at datacenter
scale with project brainwave. iEEE Micro 38(2), 8–20 (2018)
17. Ltd G IPU processors, in: IPU Processors. https://www.graphcore.ai/products/ipu. Accessed
21 Oct 2021
12 J.-Y. Kim
Chengshuo Yu, Hyunjoon Kim, Bongjin Kim, and Tony Tae-Hyoung Kim
Table 2.1 Device characteristics of mainstream and emerging memory technologies [1]
Mainstream memories Emerging memories
SRAM DRAM NOR NAND MRAM PCRAM RRAM
Cell area >100F2 6F2 10F2 <4F2 6–50F2 4–30F2 4–12F2
(3D)
Multi-bit 1 1 2 3 1 2 2
Voltage <1 V <1 V >10 V >10 V <1.5 V 3V 3V
Read ~1 ns ~10 ns ~50 ns ~10 <10 ns <10 ns <10 ns
time μs
Write ~1 ns ~10 ns 10 μs to 0.1–1 <10 ns ~50 ns <10 ns
time 1 ms ms
Retention NA ~64 ms >10 year >10 >10 year >10 year >10 year
year
Endurance >1E16 >1E16 >1E5 >1E4 >1E15 >1E9 >1E6–
1E12
Write ~fJ ~100 fJ ~100 pJ ~10 fJ ~0.1 pJ ~10 pJ ~0.1 pJ
energy
(/bit)
F feature size of the lithography
H Wordline H Wordline
L L
VDD VDD
Q
‘1’ QB '1'
/Bitline
/Bitline
'0'
Bitline
Bitline
‘0’
GND GND
Driver
'0' '1'
(a) (b)
Fig. 2.1 SRAM cell operation: (a) write and (b) read
turned off. Therefore, SRAM does not require refresh operation and write-back
operation. Another key advantage of SRAM is that SRAM is fully compatible
with CMOS process technology, which allows SRAM to be easily embedded with
computing blocks. However, as CMOS technology scaling continues, SRAM also
faces various challenges such as insufficient stability margin, increased leakage
current, and difficult supply voltage scaling [2]. Various design techniques have
been reported to address these issues [3–11].
Figure 2.1 shows the schematic of the conventional 6T SRAM cell and its write
and read operations. The typical SRAM cell consists of six transistors, forming two
cross-coupled inverters, and two access transistors. Write operation starts by loading
data a bitline pair, followed by turning on a wordline. Then, the data in the bitline
pair go to the SRAM cell nodes through the access transistors. For example, as
shown in Fig. 2.1a, if the data in Bitline is “0” and the data in /Bitline is “1,” Q will
be lowered through the access transistor and QB will be raised by /Bitline. Then, the
SRAM will store Q = “0.” SRAM write operation is mainly limited by the path of
writing “0” because the NMOS access transistors can pass low voltage better than
high voltage. Therefore, the access transistors need to be stronger than the PMOS
transistors to lower Q below the trip point of the inverters in the SRAM cell. SRAM
read operation starts by turning on a wordline after precharging bitline pairs. One
of the differential bitlines decreases depending on the data stored in the SRAM cell.
For example, in Fig. 2.1b, Bitline decreases and /Bitline remains at VDD since Q
is “0.” A sense amplifier amplifies the differential bitline voltage and generates an
output signal.
Figures 2.2 and 2.3 depict a sample SRAM architecture. In general, SRAM
consists of an array, row decoding, column multiplexing, sense amplifiers, write
drivers, and a controller. During read operation, an accessed cell generates differen-
tial voltage at a bitline pair. The differential bitline voltage is connected to a sense
amplifier through a column multiplexer. Unlike DRAM, SRAM has sense amplifiers
that are shared by multiple columns. Therefore, only one column is connected to
a sense amplifier for amplification. No write-back operation is necessary in the
unselected columns because SRAM cells can regenerate the stored data through the
18 C. Yu et al.
SRAM Cell
SRAM Cell
Precharge
DOUT
/DIN
/Bitline
H
L
L
M x N SRAM Array
Column Mux.
Ctrl.
cross-coupled inverters. During write operation, write drivers send the write data to
the selected bitlines through the column multiplexer. However, the access transistors
in the unselected columns will be on, which can cause unwanted write operation. To
mitigate this, the bitlines of unselected columns are precharged to VDD, so that the
SRAM cells in the selected row and the unselected columns undergo read operation.
Wordline Wordline
H H VDD/2+ ΔV
VDD/2
L L
Icell1 VDD/2-ΔV
Bitline
Bitline
C
Icell0 Icell
Driver SA
'0' or '1'
(a) (b)
Fig. 2.4 DRAM cell operation: (a) write and (b) read
and the written data is stored in the capacitor. However, the stored data at the
capacitor changes over time because of the leakage current flowing through the
access transistor. Multiple leakage paths are formed in the DRAM cell depending
on the stored voltage. The stored data will be lost once the stored voltage deviates
significantly from the original values and cannot be read through read operation.
To tackle this inevitable issue, refresh operation is involved in DRAM to maintain
the stored data. Refresh operation read data from selected DRAM cells and write
the read data back to the selected DRAM cells using strong “1” and “0.” DRAM
technologies have developed various techniques such as stacked capacitors [14, 15]
and trench capacitors [16, 17] that can achieve the same or larger capacitance after
technology scaling.
In DRAM, read operation starts by precharging the bitline with VDD/2. After
precharging, the bitline will be floating at VDD/2. Then, the selected wordline is
turned on, which will connect the capacitor node to the bitline through the access
transistor. The floating bitline voltage will increment or decrement slightly through
charge sharing, depending on the cell data. The final voltage formed by the charge
sharing can be calculated by the following expression.
CBL × V DD
+ CCELL × VCELL
Final Voltage = 2
(2.1)
CBL + CCELL
Here, CBL , CCELL , VCELL are the bitline capacitance, the cell capacitance, and
the cell voltage. Note that the bitline is assumed to be precharged to VDD/2. In
general, CBL is much larger than CCELL because a few hundred cells share a bitline.
Therefore, the number of cells per bitline should be determined carefully after
considering the minimum voltage swing requirement for reliable sensing.
Figure 2.5 depicts a sample DRAM data path. Before read operation, the
equalizer precharges the bitline pair with VDD/2. In read operation, a DRAM cell
increments or decrements the bitline voltage, and the bitline voltage is amplified
by a sense amplifier in each bitline. The amplified voltage will be transferred
to the output (DOUT) through the selection signal (BSel). The read operation
is destructive since the cell voltage after charge sharing will become Eq. (2.1).
20 C. Yu et al.
DIN
Equalizer
H L /DIN
L
/DOUT
/Bitline
Wordline_1
Wordline_2
Wordline_n
Bitline_m
Bitline_1
Bitline_2
Therefore, the amplified voltage will also be written back to the selected DRAM
cell through the bitline to maintain the data.
A sample DRAM architecture is shown in Fig. 2.6. When a row is selected for
read operation by turning on a wordline, all the DRAM cells in the selected row
will generate voltage increment or decrement in the bitlines. The sense amplifier
in each bitline will amplify the small voltage change and generate read data. The
read data generated in all the columns cannot be sent to the output ports over one
clock cycle because of the limited data width. Therefore, it is necessary to use
multiple cycles for reading them out. If not, only a part of the read data will be sent
to the outputs. Writing operation also happens in the selected columns. However,
the access transistors in the unselected columns will be also on, generating voltage
increment or decrement in the bitlines. Therefore, it is necessary to activate the sense
2 Backgrounds 21
10-3
Vp RESET to HRS
10-4
Current (A)
Top Electrode 10-5
10-8
-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5
Applied Voltage (V)
amplifiers and write the strong data after amplification back to the corresponding
DRAM cells.
+ -
R V R V R R
- reram + reram
WL WL WL WL
Fig. 2.8 SET, RESET, and read operation of 1T1R ReRAM cell
GND at the source line (SL). Turning on the NMOS access transistor will make
current from BL to SL through the ReRAM device. This current will change the
ReRAM state from HRS to LRS. One important parameter to consider is the
ReRAM set voltage (VR-SET ), which is the minimum required voltage across the
ReRAM device for SET. When current flows the 1T1R cell, the drain node of
the access transistor goes up. Therefore, only a part of VSET is observed across
the ReRAM device (Vreram ). For proper SET, Vreram should be larger than VR-SET .
RESET occurs when BL is grounded and VRESET is applied to SL. In this case,
current flows in the opposite direction of the SET operation, and the ReRAM state
switches from LRS to HRS. Like SET, Vreram should be larger than the ReRAM
reset voltage (VR-RESET ). However, the NMOS access transistor has larger voltage
drop and Vreram of RESET cab be smaller than Vreram of SET when the same voltage
is used at WL. Therefore, it is necessary to employ a technique that can reduce the
voltage drop through the access transistor smaller. Boosted WL voltage can improve
the voltage drop at the cost of reliability degradation. ReRAM read operation can
be implemented in two different modes (i.e., voltage mode and current mode). In
the voltage mode, BL is precharged to read voltage (VREAD ) and is discharged with
different rates based on the ReRAM state. A sense amplifier detects the discharging
rates and produces an output signal. VREAD needs to be small, so that the ReRAM
state is not disturbed. Since the bias condition of the ReRAM read operation is
similar to SET, VREAD needs to be much smaller than VSET to avoid unwanted SET.
In the current mode, predefined read current (IREAD ) is supplied to the selected
ReRAM cell. This current will generate voltages utilizing the ReRAM states, which
are sensed by amplifier. The current needs to be small enough to maintain the voltage
across the ReRAM device smaller than VR-SET with enough margins. It is also
necessary to regulate the voltage at BL below a certain level to prevent ReRAM
state disturbance.
Figure 2.9 describes a sample 1T1R ReRAM array architecture. A row is selected
by applying high voltage to the selected wordline. During SET and RESET, the
bitlines and the source lines need to be biased properly. The ReRAM architecture in
Fig. 2.9 cannot execute SET and RESET at the same time in the selected row. In the
SET operation, some ReRAM cells can be set while the others remain unchanged.
2 Backgrounds 23
WL_1
HRS LRS HRS
SL_2
WL_n
SL_n
BL_m
BL_1
BL_2
Fig. 2.9 1T1R ReRAM array architecture
Similarly, some ReRAM cells will be reset in the RESET operation while the
rest will remain unchanged. This occurs because each source line is shared by a
row. SET and RESET can be executed simultaneously when the source lines run
vertically like the bitlines. Here, the bitline and the source line of each column
can be controlled independently, which facilitates SET and RESET over one cycle.
However, this architecture consumes more power than the architecture in Fig. 2.9.
Therefore, the ReRAM array architecture needs to be selected carefully based on
the system requirement. In the read operation, current will flow from each bitline to
the selected source line as shown in Fig. 2.9. The ReRAM states will determine the
magnitude of the current in each bitline. For example, Icell1 will be smaller than Icell2
in Fig. 2.9 since Icell1 and Icell2 are generated by HRS and LRS, respectively. The
current in each bitline will be compared with the average value of Icell1 and Icell2 by
a sense amplifier. The 1T1R ReRAM architecture have faced various challenges in
the point of scaling. First, the resistance of the ReRAM devices should be properly
defined so that the voltage across the ReRAM devices is large enough to set or reset.
If the ReRAM resistance is too low, the voltage drop across the access transistor
becomes large. This requires boosted wordline voltage, which will deteriorate the
ReRAM device reliability. The lowest ReRAM resistance value can be determined
by RESET. Another challenge is ReRAM device parameter scaling. It is well known
that ReRAM programming current does not show good scalability. ReRAM set and
reset voltages also need to be scaled in a similar rate of CMOS scaling. If ReRAM
device parameters are not salable, additional circuit techniques should address the
scalability issues. Figure 2.10 summarizes LRS, HRS, ReRAM set voltage, and
ReRAM reset voltage in literature. It is obvious that ReRAM set and reset voltages
are still too high when considering the supply voltage levels of the mainstream
CMOS technologies. They must be scaled below 1 V, so that they can be integrated
with advanced CMOS technologies.
24 C. Yu et al.
105 ◙ ◙ ◙ 108
◙
◙◙ ◙
◙◙ ◙ ◙ 107
LRS [Ω]
HRS [Ω]
104 ◙
◙ ◙
◙◙ 106
◙
◙
◙ ◙
103 ◙ ◙ 105
1 2 3 -4 -3 -2 -1
Set Voltage [V] Reset Voltage [V]
Fig. 2.10 Literature survey of LRS, HRS, ReRAM set voltage, and ReRAM reset voltage [31]
Selector
WL_2
WL_3
WL_4
BL_3
BL_4
BL_1
BL_2
Another popular ReRAM array is the crossbar architecture where ReRAM cells
without access transistors are sandwiched between rows and columns as illustrated
in Fig. 2.11 [32–34]. Since no transistor is used, the crossbar architecture provides
higher area efficiency compared to the 1T1R architecture. The ReRAM cell for
the crossbar architecture includes a selector device to cut the sneak current in the
unselected ReRAM cells. Programming and reading in the crossbar architecture
require more careful design considerations because of the sneak paths formed by
unselected ReRAM cells.
Figure 2.12a, b show two popular ReRAM programming schemes in the crossbar
architecture. The selected row and the selected column are biased with the writing
voltage (VDD in Fig. 2.12) and GND or vice versa relying on the intended pro-
gramming data. Programming current will flow through the selected cell switching
the resistance state. However, there are additional current paths whose current (i.e.,
sneak current) is not negligible. The additional current paths, named sneak paths,
can vary when employing different write schemes. In the VDD/2 writing scheme
(Fig. 2.12a), the unselected cells in the selected row and the unselected cells in the
selected column generate the sneak current. However, in the VDD/3 write scheme
2 Backgrounds 25
Selected
in the crossbar: (a) write
using VDD/2, (b) write using
VDD/3, and (c) read VDD/2
operation
VDD/2
Current
VDD/2
Sneak
Current
VDD/2 VDD/2 VDD/2 GND
(a)
VDD
Selected
VDD/3
VDD/3
Current
VDD/3
Sneak
Current
2VDD/3 2VDD/3 2VDD/3 GND
(b)
GND
Selected
VREAD
VREAD
Current
VREAD
(c)
26 C. Yu et al.
CSel
WL[0] SL[1] WL[i] SL[i]
Bitline
DIN
Cell
Cell
Precharge
SA DOUT
RCell
RCell
Reference
H L L H
(Fig. 2.12b), all the unselected cells will contribute to the sneak current. However,
the total current required in each row driver will be less than that of the VDD/2 write
scheme. The VDD/3 write scheme has VDD/3 as the potential difference across the
unselected cells in the selected row, while the VDD/2 write scheme has VDD/2
for the same cells. The total amount of the sneak current is also data-dependent.
The programming current and the sneak current should be provided by the wordline
driver in each row. Therefore, the actual array size and the number of programmed
cells per cycle needs to be decided carefully after considering the driving capability
and the area of the driver. Figure 2.12c explains the read operation in the crossbar
architecture. GND is applied to the selected row, and the rest signals are connected
to read voltage (VREAD ). Read current will flow from the bitlines to the grounded
selected row through the selected cells. In principle, all the data in the selected row
can be read out, which requires a sense amplifier in each column. In actual scenarios,
only a part of the row data will be transferred to sense amplifiers. However, all the
columns will still consume read current like SRAM.
Figure 2.13 shows a sample data path of ReRAM. When a wordline is selected,
the cells in the selected row will generate bitline voltages. Multiple columns share
a sense amplifier (SA), so a column multiplexer (CSel in Fig. 2.13) will connect the
bitline voltage of the selected column to the sense amplifier for further amplification.
Sense amplifiers need a reference voltage for comparison. The reference voltage
can be provided through an external signal after testing the fabricated ReRAM.
However, this requires comprehensive and time-consuming test sequences, which
is not practical. As shown in Fig. 2.13, it is more desirable to implement on-chip
reference voltage. One way of generating on-chip reference voltage is using ReRAM
replicas. In Fig. 2.13, the Replica ReRAM cell (RCell) is programmed to make the
reference voltage higher than the bitline voltage for LRS and lower than the bitline
voltage for HRS. The programming of RCell can be designed in various ways. One
common way is to use two RCells connected in parallel. One RCell is programmed
2 Backgrounds 27
with HRS, and the other RCell is programmed with LRS. In this case, the equivalent
resistance becomes as follows.
RHRS × RLRS
RREFERENCE = (2.2)
RHRS + RLRS
Another way is to use one RCell and program it with (RHRS + RLRS )/2. However,
this requires a complicated control on the programming voltage and the pulse width
for accurate programming. Therefore, it is more desirable to use multiple RCells
programmed with either HRS or LRS.
PIM Macro
WL
Q Qb
Parallel Inputs
BLb
BL
Standard 6T SRAM
WL WL
Column-based
Processing
SL
BL
BL
Parallel Outputs
Fig. 2.14 PIM macro using common memory cells (standard 6T SRAM/1T1C DRAM/1T1R
ReRAM)
w11
w11 x1
x1
(a)
y1 x1
x1
∑ x2
x3
x2
∙∙∙
x3
∙∙∙
x16
x16
y1
(b)
y1 x1
x1
∑ x2
x3
y2 ∑ x2
∙∙∙
y3 ∑ x3
∙∙∙
∙∙∙
∙∙∙
x16
y16 ∑ x16
y16 ∙∙∙ y3 y2 y1
(c)
Fig. 2.15 Processing a fully connected layer using an SRAM-based PIM macro: (a) a multiplica-
tion; (b) a dot-product; (c) a vector-matrix multiplication
30 C. Yu et al.
w11
x1
#16
w11 x1
#1
···
#16
#1
(a)
#16
2x2 2x2
#1
···
#16
#1
(b)
#1
···
#16
#1
(c)
Fig. 2.16 Processing a convolutional layer using an SRAM-based PIM macro: (a) a multiplica-
tion; (b) a dot-product for a 2D filter; (c) a dot-product for a 3D filter; (d) a vector-matrix for a 4D
filter; (e) after 16 cycles of vector-matrix operations
2 Backgrounds 31
···
#1
···
#16
#1
(d)
#1
···
#16
#1
(e)
(3D) filter weights and input feature maps, composed of multiple channels of 2D
filters and input feature maps, are mapped to the entire column of the PIM macro, as
depicted in Fig. 2.16c. Note that the 3D filter and input feature map can be mapped
to multiple macro columns when the number of filter and input feature map element
pairs is larger than the number of bitcells in a single macro column. Figure 2.16d
expands one more dimension of the convolutional layer processing (i.e., output
channels or the channels of 3D filters), which can be processed in parallel using
multiple columns in the PIM macro. Each column output corresponds to a pixel of
each 2D output feature map. To complete the 3D output feature map generation, we
will reuse the same PIM macro and process while sliding the window of the input
feature map and complete the 3D output feature map, as illustrated in Fig. 2.16e.
32 C. Yu et al.
Input and output delivery of the PIM macro add extra operation latency and energy
consumption. Although I/O and periphery blocks in many state-of-the-art macro
designs are not emphasized, it is important to note their key component and their
associated challenges.
Among I/O and periphery blocks, a sense amplifier (SA) is one of the most
critical components in many SRAM based PIM macro designs that utilize bitline
discharge for MAC operation. Figure 2.17a is the standard latch-type SA with
the minimum number of transistors. However, the voltage drop across the pass
transistor causes limitation in input voltages, ultimately reducing the noise margin
and the voltage swing of the bitline. Figure 2.17b is an improved version of the
StrongARM Latch [35] to achieve low static power, produce higher output dynamic
range, and minimize the offset caused by the input differential pair. However, the
circuit operation phases include voltage gain, which introduces an amplified offset
issue from the VN and VP transistor mismatch. The separate SR latch added to the
right part of Fig. 2.17b is one of the techniques to cancel the offset by establishing
different discharge rates from the input pair. Figure 2.17c also provides offset
cancellation from programmable capacitors. However, to control the random offset
values, the number of capacitor switches increases and ultimately raises power
dissipation and lowers operation speed.
In many cases, SA output requires analog-to-digital (A2D) conversion as the
function of SA is an analog comparator. Thus, ADC is another critical block in PIM
macros that can directly affect the performance of the PIM macro.
Figure 2.18a describes the operating principle of a single-slope ADC operation
[37]. For illustration purposes, a column of bitcells is simplified to have 13 PIM
bitcells for a nine-input dot-product and a 2-bit single-slope column ADC. Thus, an
N-bit single-slope ADC requires 2N -1 cycles to complete single data conversion.
The operating principle of the binary-searching ADC [38] is shown in Fig. 2.18b.
The top 384 bitcells are simplified to a black box with a fixed dot-product result
(+27), and the following 96 bitcells are separated into two grounds for representing
weight “+1” (white boxes) and weight “−1” (gray boxes) when operating as the
binary-searching ADC. Each group has four input signals via RWLs that control 6,
6, 12, and 24 bitcells, respectively. The binary-searching ADC takes five cycles to
generate a 5-bit output code, as shown at the bottom of Fig. 2.18b.
The 4-bit flash ADC [36] has advantages in power consumption, performance,
and area tradeoff. Moreover, the 15 reference voltage levels of the flash ADC can
be easily adjusted by changing the voltage input of the resister diode ladder, so
that the read bitline (RBL) dynamic range can also be tunned easily. Note that
the Flash ADC comprises clocked comparators that can be readily implemented
in the column-pitch of the PIM macro. The connection of 15 comparators to 15 read
bitlines (RBLs) to serve as part of the computation caps is described in detail in Fig.
2.18c.
2 Backgrounds 33
VDD
SAEN
BLB
BL
OUT
OUTB
SAEN
(a)
VDD
VOP
VON
VP VN
CLK CLK
(b)
VDD
SAEB
SH SH
MPL MPR
RBL VREF
SH MNL MNR SH
SH SH
SAE
SR
SAEB Latch SAEB
OUT
(c)
Fig. 2.17 Detailed circuit of (a) standard latch-type sense amplifier, (b) an improved version of
the StrongARM latch [35], and (c) an offset cancellation latch [36]
34 C. Yu et al.
-1 -1 -1
(Fi xed)
+1 +1 +1
Dot-Product (DP) / ADC Ref.
-1 -1 -1 Ref=+2
-1 -1 -1
+1 +1 +1
Ref=0
+1 +1 +1
-1 -1 -1
DP=-1
Ref=-2
(To Sweep)
-1 -1 -1
ADC Ref.
OUTTH[2]=0
OUTTH[1]=0
OUTTH[0]=1
-1 -1 +1
-1 +1 +1
+1 +1 +1
SA SA SA 0 1 2
Cycle Number
OUTTH[2]=0 OUTTH[1]=0 OUTTH[0]=1
OUTB[1:0]=01
(a)
Fig. 2.18 Detailed circuit of (a) single-slope column ADC [37]; (b) binary-searching column
ADC [38]; (c) 4-bit flash ADC [36]; (d) charge-sharing ADC [39]
2 Backgrounds 35
Dot-Product
(Fixed)
+27 +27 384x Bitcells
Cycle # [Dot-product = +27]
1 2 3 4 5
6x Bitcells [weight +1]
ADC Fixed +1
6x Bitcells [weight -1]
Input = 0
Input = 1
Vpc h
ADC Fixed -1
(b)
Vref [14:11]
Vref [10:9]
Vref [3:0]
Vref [5:4]
RBL[3]
RBL[2]
RBL[1]
RBL[0]
RBL[1]
RBL[2]
RBL[3]
Vref [6]
Vref [7]
Vref [8]
SA SA SA SA SA SA SA
SAE
[3:0] [5:4] [6] [7] [8] [10:9] [14:11]
(c)
SAOP
VpAVG CSH_ADC
SA YOUT
VnAVG Logic 7
SAON
SA_EN
ɸ1 ɸ2 EVAL
(d)
While the analog PIM macros present outstanding efficiency numbers, serious
design challenges exist. Most well-known issues are process, temperature, and
voltage (PVT) variation induced computation nonlinearity and DAC/ADC overhead.
Figure 2.19a depicts input offset error of analog circuits in PIM macro (i.e.,
bitcell, SA, and ADC) caused by process variation. Figure 2.19a, left, illustrates the
error distribution of the output ADC code for identical MAC operations. Although
the memory bitcells in the PIM array have a regular structure, the difference in
MAC results exists due to process variation in the fabrication of memory bitcells.
The variation of one single bitcells and a whole column are shown in Fig. 2.19a
right. Figure 2.19a top right describes the distribution of discharge current when
one bitcell processes multiplicate operation using current discharge, and Fig. 2.19a
bottom right shows the bitline (BL) voltage allocation after completing the dot-
product operation in one column-based neuron. Overall, the process variation
fluctuates the bitline voltage representing the dot-product result and increases the
possibility of producing incorrect output ADC code. In terms of neural networks,
the generated output code becomes the new input activation for the next layer and
used to calculate another dot-product. Thus, the wrong output code of one layer
propagates through several computations and generates classification error at the
end, reducing the application task’s accuracy.
The computation nonlinearity happens when more rows are activated in parallel
to improve the computation efficiency, as shown in Fig. 2.19b. The bitline voltage
representing dot-product results decrease when more “1”s in the column are
added and cause a dynamic range limit. The accumulation linearity is significantly
degraded if the bitline voltage drops too low, as shown by the red dot line of Fig.
2.19b right.
The overhead of digital-to-analog and analog-to-digital converter (DAC/ADC)
for data transmission is also a significant concern for PIM macro. As shown in Fig.
2.19c, the DAC/ADC not only spends huge circuit area and energy consumption but
also increases the latency of the neural network accelerator. In addition, the typical
ADC has the fixed bit-precision, resulting in limitation about reconfigurability.
On the other hand, digital PIM macros suffer from different critical issues: low
area efficiency and high-power consumption. Figure 2.20a describes the modern
neural network accelerator containing a complete digital process elements (PEs)
array that processes massive MAC operation synchronously [41]. With the help of
hierarchical memory and data reuse strategy, this work improves computation effi-
ciency while also saving energy since memory access energy exceeds energy from
MAC operations. Figure 2.20b illustrates one PIM column with the parallel adder
tree, which performs massively parallel accumulation operation without additional
registers to store input activations and partial sums [40]. Note that the bit-serial
multiplication also improves energy efficiency in the tradeoff of operation latency.
In addition, the entire digital approach avoids the compute nonlinearity and poor
scaling of analog circuits. However, the full digital PE comprises more arithmetic
circuits, which not only occupies large area but also costs larger static/dynamic
energy compared to the bitcell of analog PIM.
2 Backgrounds 37
Count
Idischarge
Count
Dot-Product Results VBL(or BLb)
(a)
0 1 1 1 1 1
0 0 1 1 1 1
Parall el Acti vat ion
0 0 0 1 1 1
Linear
0 0 0 1 1 1
0 0 0 1 1 1
0 0 0 1 1 1
0 0 0 1 1 1 Nonlinear
0 0 0 1 1 1
0 0 0 0 1 1
0 0 0 0 0 1 Ideal
(c)
Fig. 2.19 Challenges of analog PIM macro: (a) process variation; (b) nonlinearity; (c) ADC
overhead
38 C. Yu et al.
~100 MB ~1-100 KB
~40-100 pJ/byte ~1-10 pJ/byte
PE PE PE PE
Memory PE PE PE PE
DRAM
global
(off chip) PE PE PE PE
buffers
PE PE PE PE
Memory
Vector MAC
input
buffer Memory
Memory
+ accumulaon
buffer
weight
buffer
Typical PE buffer size: 32B -1KB, ~0.33-1 pJ/byte
(a)
W
Digital Paral Sum
Input 1
W
Input 2
W
Input 3
W
Input 4
Digital
Accumulaon Shi
/add
(Digital mulply)
(b)
Fig. 2.20 (a) Simplified block diagram of a typical digital DNN accelerator; (b) A column-based
dot-product circuit using digital PIM [40]
2 Backgrounds 39
References
1. S. Yu, P.-Y. Chen, Emerging memory technologies: Recent trends and prospects. IEEE Solid-
State Circuits Magaz. 8(2), 43–56 (2016)
2. K. Zhang, F1: Embedded memory design for nano-scale VLSI systems, in 2008 IEEE
international solid-state circuits conference—Digest of technical papers (2008), pp. 650–651
3. N. Shibata et al., 1-V 100-MHz embedded SRAM techniques for battery-operated MTC-
MOS/SIMOX ASICs. IEEE J. Solid State Circuits 35(10), 1396–1407 (2000)
4. K. Agawa et al., A bitline leakage compensation scheme for low-voltage SRAMs. IEEE J.
Solid State Circuits 36(5), 726–734 (2001)
5. K. Nii et al., A 90-nm low-power 32-kB embedded SRAM with gate leakage suppression
circuit for mobile applications. IEEE J. Solid State Circuits 39(4), 684–693 (2004)
6. T.-H. Kim et al., A 0.2V, 480kb subthreshold SRAM with 1k cells per bitline for ultra-low
voltage computing. IEEE J. Solid State Circuits 43(2), 518–529 (2008)
7. T.-H. Kim et al., A voltage scalable 0.26V, 64kb 8T SRAM with Vmin lowering techniques
and deep sleep mode. IEEE J. Solid State Circuits 44(6), 1785–1795 (2009)
8. T. Kim et al., Design of a temperature-aware low voltage SRAM with self-adjustable sensing
margin enhancement for high temperature applications up to 300◦ C. IEEE J. Solid State
Circuits 49(11), 2534–2546 (2014)
9. B. Wang et al., Design of an ultra-low voltage 9T SRAM with equalized bitline leakage and
CAM-assisted energy efficiency improvement. IEEE Trans. Circuits Syst. TCAS-I Regul. Pap.
62(2), 441–448 (2015)
10. A. Do et al., 0.2 V 8T SRAM with PVT-aware bit-line sensing and column-based data
randomization. IEEE J. Solid State Circuits 51(6), 1487–1498 (2016)
11. C. Duan et al., Energy-efficient reconfigurable SRAM: Reducing read power through data
statistics. IEEE J. Solid State Circuits 52(10), 2703–2711 (2017)
12. T. Kirihata et al., An 800 MHz embedded DRAM with a concurrent refresh mode, in IEEE int.
solid-state circuits conference (ISSCC), (IEEE, Piscataway, 2004), pp. 206–523
13. M. Kumar et al., A simple and high-performance 130 nm SOI EDRAM technology
using floating-body pass-gate transistor in trench-capacitor cell for system-on-a-chip (SoC)
applications, in IEEE int. electron devices meeting (IEDM), (IEEE, Piscataway, 2003), pp.
17.4.1–17.4.4
14. S. Yamamichi et al., A stacked capacitor technology with ECR plasma MOCVD
(Ba,Sr)TiO/sub 3/and RuO/sub 2//Ru/TiN/TiSi/sub x/ storage nodes for Gb-scale DRAMs.
IEEE Trans. Electron Devices 44(7), 1076–1083 (1997)
15. S. Yamamichi et al., An ECR MOCVD (Ba,Sr)TiO/sub 3/based stacked capacitor technology
with RuO/sub 2//Ru/TiN/TiSi/sub x/storage nodes for Gbit-scale DRAMs, in IEEE int. electron
devices meeting (IEDM), (IEEE, Piscataway, 1995), pp. 119–122
16. G. Aichmayr et al., Carbon/high-k trench capacitor for the 40nm DRAM generation, in IEEE
symp. on VLSI technology, (IEEE, Piscataway, 2007), pp. 186–187
17. T.S. Boscke et al., Tetragonal phase stabilization by doping as an enabler of thermally stable
HfO2 based MIM and MIS capacitors for sub 50nm deep trench DRAM. Int. Electron Devices
Meet. 2006, 1–4 (2006)
18. H.-S.P. Wong et al., Metal-oxide RRAM. Proc. IEEE 100(6), 1951–1970 (2012)
19. M.-F. Chang et al., Low VDDmin swing-sample-and-couple sense amplifier and energy-
efficient self-boost-write-termination scheme for embedded ReRAM macros against resistance
and switch-time variations. IEEE J. Solid State Circuits 50(11), 2786–2795 (2015)
20. S. Zuloaga et al., Scaling 2-layer RRAM cross-point array towards 10 nm node: A device-
circuit co-design, in IEEE int. symp. on circuits and systems (ISCAS), (2015), pp. 193–196
21. A. Bricalli et al., SiOx-based resistive switching memory (RRAM) for crossbar storage/select
elements with high on/off ratio, in IEEE int. electron devices meeting (IEDM), (2016), pp.
4.3.1–4.3.4
40 C. Yu et al.
22. Y. Chen et al., Reconfigurable 2T2R ReRAM architecture for versatile data storage and
computing in-memory. IEEE Trans. VLSI Syst. 28(12), 2636–2649 (2020)
23. H.Y. Lee et al., Low power and high speed bipolar switching with a thin reactive Ti buffer
layer in robust HfO2 based RRAM, in IEEE int. electron devices meeting (IEDM), (IEEE,
Piscataway, 2008), pp. 1–4
24. C. Zambelli et al., Electrical characterization of read window in ReRAM arrays under different
SET/RESET cycling conditions, in IEEE 6th int. memory workshop, (IEEE, Piscataway, 2014),
pp. 1–4
25. E. Vianello et al., Resistive memories for ultra-low-power embedded computing design, in
IEEE int. electron devices meeting (IEDM), (IEEE, Piscataway, 2014), pp. 6.3.1–6.3.4
26. A. Fantini et al., Intrinsic program instability in HfO2 RRAM and consequences on program
algorithms, in IEEE int. electron devices meeting (IEDM), (IEEE, Piscataway, 2015), pp. 7.5.1–
7.5.4
27. Z.-Q. Wang et al., Cycling-induced degradation of metal-oxide resistive switching memory
(RRAM), in IEEE int. electron devices meeting (IEMD), (IEEE, Piscataway, 2015), pp. 7.6.1–
7.6.4
28. H.B. Lv et al., BEOL based RRAM with one extra-mask for low cost, highly reliable embedded
application in 28nm node and beyond, in IEEE int. electron devices meeting (IEMD), (IEEE,
Piscataway, 2017), pp. 2.4.1–2.4.4
29. P.-Y. Chen et al., Design tradeoffs of vertical RRAM-based 3-D cross-point array. IEEE Trans.
Very Large-Scale Integr. Syst. 24(12), 3460–3467 (2016)
30. P.-Y. Chen, Compact modeling of RRAM devices and its applications in 1T1R and 1S1R array
design. IEEE Trans. Electron Devices 62(12), 4022–4028 (2015)
31. L. Lu et al., ReRAM device and circuit co-design challenges in nano-scale CMOS technology,
in 16th IEEE Asia Pacific conference on circuits and systems, (IEEE, Piscataway, 2020), pp.
213–216
32. Y. Youn et al., Investigation on the worst read scenario of a ReRAM crossbar array. IEEE
Trans. Very Large Scale Integr. Syst. 25(9), 2402–2410 (2017)
33. H. Lim et al., ReRAM crossbar array: Reduction of access time by reducing the parasitic
capacitance of the selector device. IEEE Trans. Electron Devices 63(2), 873–876 (2016)
34. P. Ma et al., High-performance InGaZnO-based ReRAMs. IEEE Trans. Electron Devices 66(6),
2600–2605 (2019)
35. B. Razavi, The StrongARM latch [a circuit for all seasons]. IEEE Solid-State Circuits Magaz.
7(2), 12–17 (2015)
36. M.E. Sinangil et al., A 7-nm compute-in-memory SRAM macro supporting multi-bit input,
weight and output and achieving 351 TOPS/W and 372.4 GOPS. IEEE J. Solid State Circuits
56(1), 188–198 (2021)
37. C. Yu, T. Yoo, T. Kim, K. Chai, B. Kim, A 16K current-based 8T SRAM compute-in-memory
macro with decoupled read/write and 1-5bit column ADC, in IEEE custom integrated circuits
conference (CICC), (IEEE, Piscataway, 2020), pp. 1–4
38. C. Yu, K. Chai, T. Kim, B. Kim, A zero-skipping reconfigurable SRAM in-memory computing
macro with binary-searching ADC, in IEEE 47th European solid-state circuits conference
(ESSCIRC), (IEEE, Piscataway, 2021), pp. 1–4
39. A. Biswas, A.P. Chandrakasan, CONV-SRAM: An energy-efficient SRAM with in-memory
dot-product computation for low-power convolutional neural networks. IEEE J. Solid State
Circuits 54(1), 217–230 (2019)
40. Y.-D. Chih et al., 16.4 an 89TOPS/W and 16.3TOPS/mm2 all-digital SRAM-based full-
precision compute-in memory macro in 22nm for machine-learning edge applications, in 2021
IEEE international solid-state circuits conference (ISSCC), (IEEE, Piscataway, 2021), pp. 252–
254
41. B. Zimmer et al., A 0.32–128 TOPS, scalable multi-chip-module based deep neural network
inference accelerator with ground-referenced signaling in 16 nm. IEEE J. Solid State Circuits
55(4), 920–932 (2020)
Chapter 3
SRAM-Based Processing-in-Memory
(PIM)
3.1 Introduction
H. Kim · C. Yu
Nanyang Technological University, Singapore, Singapore
e-mail: [email protected]; [email protected]
B. Kim ()
University of California Santa Barbara (UCSB), Santa Barbara, CA, USA
e-mail: [email protected]
A standard 6T SRAM cell, a conventional embedded cache memory cell for storing
single-bit data, can process a binary MAC operation in memory. For implementing
a MAC operation in the SRAM cell, a binary input is applied to a WL, and a binary
weight is stored in the internal storage nodes of the SRAM cell, Q and Qb, as shown
in Fig. 3.1a. A DC low signal is applied to the WL to represent a zero input, and
a short positive pulse is applied to the WL to represent +1. Note that a pulse is
required (instead of a DC high) to accumulate element-wise multiplication results
from the SRAM cells sharing the same bitlines. If an SRAM storage node Q stores
a “high” (or a “low”), the stored weight is +1 (or −1). As soon as the input (either
a DC low or a short positive pulse) is applied to the WL, a binary multiplication
in SRAM cell is performed right away based on the input and the stored weight
values, and it contributes to the accumulation result (i.e., a voltage difference in BL
and BLb). For instance, if the input is 0, the SRAM cell is disabled due to the DC
low signal in the WL, and hence, the bitline voltages do not change, as shown in
Fig. 3.1b. If the input is +1 or −1, one of the bitlines (BL or BLb) will discharge its
capacitance by developing a discharging path between the bitline and the ground, as
shown in Fig. 3.1c, d. As a result, the bitline voltage will drop, and the magnitude
of the voltage drop is proportional to the pulse width of the input signal. Both BL
and BLb are initially precharged to a high voltage level (VDD in Fig. 3.1), and the
magnitude of the voltage drop due to a single SRAM cell is ΔV (i.e., a unit voltage
drop per SRAM cell).
The element-wise binary multiplication results (i.e., a bitline voltage drop as
much as ΔV for +1 and −1 multiplication results) from each bitcells in the
3 SRAM-Based Processing-in-Memory (PIM) 43
Q Qb Q Qb
OFF OFF
(a) (b)
ON ON OFF
(c) (d)
Fig. 3.1 A standard 6T SRAM cell as a PIM cell: (a) SRAM cell schematic with input and output
for PIM operation; Binary MAC operations when (b) the input is zero; (c) the input is +1, and the
weight is +1; (d) the input is +1, and the weight is −1
same column are accumulated and results in an aggregated voltage drop in both
bitlines. Figure 3.2, left, illustrates a column of 6T SRAM cells that accumulates
element-wise bitline voltage drops when the number of SRAM cells having the
multiplication results of +1 (or −1) is P (or N). Based on P and N values, we
can estimate the column accumulation result as a bitline voltage difference, V(BL)
− V(BLb) = (P − N)•ΔV. Figure 3.2, right, plots the BL (or BLb) voltage as a
function of P (or N). Note that a dynamic range is set to ensure a linear accumulate
operation and no disturbance issue (i.e., a false SRAM write operation due to a wide
bitline dynamic range).
The standard 6T SRAM cell uses bitlines for write and read (or compute for PIM)
operations. As a result, there is a disturbance issue that could overwrite SRAM cells
with unintended values. Recently, custom SRAM cells with extra transistors have
been developed and used for processing MAC operations to prevent the SRAM cell
read disturbance issue by decoupling the SRAM write and read ports.
An 8-transistor (8T) foundry SRAM cell with a decoupled read port was used
as a PIM unit cell [5]. Two extra NMOS transistors and two additional read ports
have been added to decouple the read operation, as shown in Fig. 3.3. The two
44 H. Kim et al.
BL BLb
6T SRAM Pre-charged voltage-level
VDD-N∙ΔV VDD-P∙ΔV
0 N (or P)* # Cells
*# of -1 (or +1) multiplication results is N (or P)
WL
(Accumulate Node)
BLb
BL
Q Qb
RBL
RWL (Input)
Fig. 3.3 A foundry 8T SRAM cell with a decoupled read port as a PIM unit cell [5]
NMOS transistors are connected in series, and they are used to create a read bitline
(RBL) discharging path to the ground. When both an internal SRAM node (Qb) and
a read wordline (RWL) input is high, the RBL discharging path is enabled. While
the foundry 8T SRAM cell prevents the SRAM read disturbance issue, there are
drawbacks, including a single-ended read operation and a larger bitcell area than
the 6T standard cell.
A custom 8T SRAM cell has been developed [6] to improve the read operation
through differential accumulation nodes. As illustrated in Fig. 3.4, two extra NMOS
transistors are added to realize a differential read bitline (RBL and RBLb). Instead of
connecting the NMOS transistors to the ground, an input node, RWL, is connected
to the source nodes of both the read NMOS transistors. Therefore, a discharging
path is created when the RWL is low and the internal circuit node (Q or Qb) is high.
The dynamic range of the bitline discharge is improved from the standard 6T
design while the size of the bitcell is increased. Aside from the area increase, short
pulse-width control for the RWL is an issue, while the data integrity concern is
resolved by the separate read/write ports. The SRAM supply is lowered improving
the linearity and reducing the energy consumption, while the precharge voltage for
3 SRAM-Based Processing-in-Memory (PIM) 45
WL
BLb
BL
Q Qb
RBLb
RBL
RWL (Input)
Fig. 3.4 A custom 8T SRAM PIM cell with differential decoupled read ports [6]
RWL
WL
BLb
BL
Q Qb
LBLT
LBLF
Fig. 3.5 A custom 10T SRAM PIM cell with hierarchical bitlines (LBLT, LBLF) [10]
RWL and RBL/RBLb is set to a higher voltage to guarantee the operation linearity
and to resolve leakage concerns.
A custom 10T SRAM cell shown in Fig. 3.5 provides similar improvements to the
custom 8T SRAM in terms of the read operation with the differential read and de-
coupling read/write ports, however, uses an extra NMOS to load the multiplication
result only to the LBLT and LBLF instead of functioning as the accumulation node.
10T implementation further improves the dynamic range of the bitline and also
resolves the data integrity issue while sacrificing the bitcell area.
A custom dual 7T SRAM was developed from the custom 8T design to enable
reconfigurable weight with 3–15 precision levels in the analog PIM macro, while
also de-coupling the write/read operations as shown in Fig. 3.6. The dual 7T
can store ternary (3-level) weight values that can form multiple SRAM stacks to
represent up to 15-level weight values (3× 7T SRAMs). Also, zero-skipping is
implemented for both zero-weight and zero-input for further energy reduction.
In summary, the dual 7T SRAM kept its strengths such as differential read
scheme and the improved bitline dynamic range (vs. 6T) with the added features
of precision reconfigurability and zero-input/weight skipping, while sacrificing the
bitcell area.
The SRAM-based PIM with bitline discharging operations suffers from a limited
dynamic range. As the bitline dynamic range increases, the accumulation linearity
46 H. Kim et al.
WL
RBLL
BLbR
BLbL
BLR
BLL
QbL QbR
RBLR
RWL
Fig. 3.6 A dual 7T SRAM PIM cell with separate read bitlines (RBLL , RBLR ) [11]
Nonlinear
Multiplication Results
WL WL
BLb
RBL
BLb
BL
BL
Q Qb Q Qb
RWL_P
RWLB_P Y XNOR Yb
RWL_N RBL
RWLB_N RBLb
(a) (b)
Fig. 3.8 Custom SRAM-based PIM cells for voltage-mode accumulation: (a) single-ended [7]
and (b) differential accumulation [8]
RBL
BLb
(a) charge-sharing-based; (b) Acc
charge-redistribution-based Qb Q
(or capacitive-coupling-
based)
RWLB
RWL
(a)
WL
RBL
BLb
Qb Q
RWLB
RWL
(b)
Charge-domain SRAM-based PIM cells [9, 25] have been developed to minimize
the residual analog nonidealities. Instead of using pull-down NMOS transistors (for
bitline discharging) or CMOS (pull-up or pull-down) drivers (for voltage mode
accumulation), the charge-domain SRAM-based PIM cells use passive capacitors
for sharing or redistributing charges. Figure 3.9a shows a PIM cell for charge-
sharing-based accumulation. The cell requires two read wordlines (RWL and
RWLb), a read bitline (RBL), a unit capacitor, and three switches. Figure 3.9b shows
a charge-domain PIM cell with a charge-redistribution-based accumulation using
two read wordlines (RWL and RWLb), a read bitline (RBL), two switches, and a
unit capacitor.
Despite the recent efforts, there are more challenges in the design of analog
and mixed-signal SRAM-based PIM macros, such as data conversion overhead
(i.e., DAC for input conversion and ADC for output conversion), limited recon-
figurability, and the residual analog nonidealities (device mismatch, variation, and
nonlinearity). To overcome such limitations, digital SRAM-based PIM cells are
developed [22, 25]. Instead of using analog MAC circuits, the digital PIM cells
use all-digital MAC circuits using an XNOR (or AND) gate and a full-adder.
Hence, the digital PIM is free from analog nonidealities and data conversion
overhead challenges. Figure 3.10a shows a digital PIM cell composed of a standard
48 H. Kim et al.
BLb
fixed weight precision PIM
BL
cell [25]; (b) a reconfigurable Q Qb
weight precision PIM cell
OUT
[22]
RWL (INb)
(a)
WL
Carry-In
Sign-In
BLb
BL
Q Qb
XNOR
Y
2:1 2:1
Psum-In A B Psum-Out
Ci Full Adder S
Sign-Out
Co
Carry-Out
(b)
6T SRAM cell and a NOR gate working as an AND gate for a bitwise multiplier.
The accumulation is performed in a digital adder tree at the macro-level. Figure
3.10b shows a reconfigurable digital PIM cell that embeds a bitwise XNOR-gate
multiplier and a full-adder for accumulation. Note that a digital PIM macro based
on the reconfigurable PIM cell can be reconfigured to operate with a 1–16b variable
weight precision. A bit-serial computing scheme is applied for inputs for both digital
PIM cells, saving a significant area for conventional bit-parallel computing. The
bit-serial computing is also intrinsically reconfigurable as the number of bit-serial
operation cycles determines the serialized input bit precision.
Figure 3.11 shows four columns of foundry 8T SRAM cells for processing a dot-
product computation between 4b inputs and 4b weights stored in the SRAM cells. A
4b weight is stored bit-by-bit into four SRAM cells in the same row. A 4b serialized
input is applied to a horizontal RWL as multiple positive short pulses. The multiple
3 SRAM-Based Processing-in-Memory (PIM) 49
6T 6T 6T 6T
SRAM SRAM SRAM SRAM
RWL[63]
6T 6T 6T 6T
SRAM SRAM SRAM SRAM
RWL[62]
6T 6T 6T 6T
SRAM SRAM SRAM SRAM
RWL[61]
6T 6T 6T 6T
SRAM SRAM SRAM SRAM
RWL[2]
6T 6T 6T 6T
SRAM SRAM SRAM SRAM
RWL[1]
6T 6T 6T 6T
SRAM SRAM SRAM SRAM
RWL[0]
Fig. 3.11 Four columns of foundry 8T SRAM cells for a dot-product PIM operation between 4b
inputs and 4b weights [5]
pulse scheme can guarantee better linearity than the pulse-width modulation (PWM)
scheme. Each column first accumulates element-wise multiplication results based
on the bitline discharging method. Then all four column accumulation results are
combined by charge-sharing across the bitlines from the four columns.
The detailed bitline-discharging and charge-sharing operations using four
SRAM-based PIM columns are illustrated in Fig. 3.12. It consists of four-step
operations: (1) RBL precharging; (2) RBL discharging (column accumulations);
(3) charge-sharing; (4) analog-to-digital conversion of the combined accumulation
results. RBL [3:0] is the first precharged to a high level (step 1) and then is
discharged based on the multiplication results between a series of multiple input
pulses (RWL [63:0]) and the weights stored in each column (step 2). After the
column-by-column accumulation based on RBL discharging, charge-sharing (step
3) combines the accumulated results. Finally, the combined analog voltage is
converted to a digital output code (step 4).
Figure 3.13 shows PIM macro architecture using the foundry 8T SRAM with
column-wise multiply-and-average (MAV) scheme using 64× 4b inputs and 16×
4b weights in a single cycle to produce 16× 4b outputs. Each four RBLs (i.e., 4×
SRAMs in the same row to realize 4b weight) share one 4b flash ADC, while the
RWL counters produce pulse control signals to realize input precision of 4b. RWL
50 H. Kim et al.
6T 6T 6T 6T 6T 6T 6T 6T
SRAM SRAM SRAM SRAM SRAM SRAM SRAM SRAM
RWL[2] RWL[2]
6T 6T 6T 6T 6T 6T 6T 6T
SRAM SRAM SRAM SRAM SRAM SRAM SRAM SRAM
RWL[1] RWL[1]
6T 6T 6T 6T 6T 6T 6T 6T
SRAM SRAM SRAM SRAM SRAM SRAM SRAM SRAM
RWL[0] RWL[0]
A B C D A B C D
8Cu 4Cu 2Cu 1Cu 8Cu 4Cu 2Cu 1Cu
ADC ADC
6T 6T 6T 6T 6T 6T 6T 6T
SRAM SRAM SRAM SRAM SRAM SRAM SRAM SRAM
RWL[2] RWL[2]
6T 6T 6T 6T 6T 6T 6T 6T
SRAM SRAM SRAM SRAM SRAM SRAM SRAM SRAM
RWL[1] RWL[1]
6T 6T 6T 6T 6T 6T 6T 6T
SRAM SRAM SRAM SRAM SRAM SRAM SRAM SRAM
RWL[0] RWL[0]
A B C D A B C D
8Cu 4Cu 2Cu 1Cu 8Cu 4Cu 2Cu 1Cu
ADC ADC
Fig. 3.12 Four-step operations for bitline-discharging and charge-sharing operations to combine
accumulation results from four columns [5]. Step 1 RBL precharging. Step 2: RBL discharging.
Step 3: Charge-sharing. Step 4: Analog-to-digital conversion. Transient voltages at nodes A, B, C,
and D
voltage is sampled through the compensation cap, then the averaging is processed
by charge sharing from the binary-weighted computation caps (included in the
flash ADC), while the inherent capacitance of the sense amplifier inside 4b flash
ADCs used to represent the unit capacitance that effectively produces MAV results
(averaged voltage) at the output. The architecture achieves high energy efficiency
using advanced technology and novel computation blocks; however, the design
could not completely resolve the limited dynamic range of the bitline as well as
the process variation induced nonlinearity.
Figure 3.14 illustrates a column-based circuit for processing a dot-product using
custom 8T SRAM PIM cells (Fig. 3.6) [6] with differential read ports for bitline-
discharging-based accumulation. A column consists of a precharge PMOS circuit,
128× SRAM PIM cells (64× for dot-product, 32× for ADC, and 32× for offset
calibration) and a sense amplifier (SA) for a single-slope column ADC operation.
3 SRAM-Based Processing-in-Memory (PIM) 51
RWL Drivers
IN2[3:0]
IN1[3:0]
IN0[3:0]
Compensation Caps
OUT15[3:0]
OUT14[3:0]
OUT0[3:0]
16x 4bit Outputs
Figure 3.14, right, plots transient waveforms for maximum and minimum RBL
voltages based on the number of cells discharging RBL or RBLb. The bitline
capacitances are discharged for a short RWL negative pulse width.
In this particular design, the custom 8T SRAM offered reduced ADC overhead
by using the replica bitcells, improved the bitline dynamic range compared to the
standard 6T SRAM and addressed variation-induced nonlinearity through offset
calibration blocks. On the other hand, the dynamic range improvement is not
significant, ADC overhead is not completely resolved, and the parallel MAC
utilization is reduced in half due to the bitcells that are assigned as column ADCs
and offset calibration blocks.
A PIM macro architecture using the custom 8T SRAM is shown in Fig. 3.15,
and its column MAC structure is illustrated in Fig. 3.14. 128× input rows are
divided into three functional blocks for inputs, ADC reference, and offset calibration
to perform column-wise MAC operation to output 128× 1b outputs. The custom
8T SRAM macro utilizes wordline pulse width modulation and reconfigurable
ADC precision through replica PIM cells to avoid larger ADC overhead while
sacrificing the input parallelism. Compared to the architecture in Fig. 3.13, the
custom 8T SRAM macro provides higher area efficiency from the non-flash ADC
implementation and higher number of input columns.
Single-ended voltage-mode SRAM PIM cells [7] are used for processing a dot-
product in a column-based memory array, as shown in Fig. 3.16. XNOR-based
multiplications are performed between the stored binary weights (in a 6T standard
SRAM cell embedded in each PIM cell) and the input read wordlines connected
52 H. Kim et al.
VDDPCH
PCHb
RBL RBLb
6T
Dot-Product
[V] V(RWL)
SRAM
0.8
(64x)
X[63]
0
ADC Ref.
R[31] 0.8
(32x)
0.7
R[0] 8T SRAM PIM Cell
0.6 Min. V(RBL)
Offset Cal.
C[31] 8T SRAM PIM Cell
(32x)
0.5
SA
Fig. 3.14 A column-based dot-product circuit using custom 8T SRAM PIM cells with an
embedded single-slope column ADC based on replica bitcells [6]
Inputs X[0]
RWL Drivers
R[31]
128x128
ADC
Ref. R[0]
Custom 8T
C[31] SRAM Array
Offset
Cal. C[0]
Sense Amplifiers
OUT[127]
OUT[126]
OUT[125]
OUT[2]
OUT[1]
OUT[0]
6T SRAM # of PU = 256-N
RWL_P[255]
RBL
RWLB_P[255] 0 0 RU/(256-N)
RBL
V(RBL)
RWL_N[255]
1 1 RD/N
RWLB_N[255]
6T SRAM
RWL_P[254] # of PD = N
RWLB_P[254] Single-ended Voltage-Mode Acc. Result
(Resistive Divider)
RWL_N[254]
0.6
RWLB_N[254]
0.5
0
RWL_N[0]
-256 -128 0 128 256
RWLB_N[0] Dot-Product Result
Fig. 3.16 A column-based dot-product circuit using single-ended voltage-mode SRAM PIM cells
in Fig. 3.8a and the measured accumulated RBL voltage [7]
RWL Drivers
thermometer code output [7] 256x64
Single-Ended
Voltage-Mode
IN2[1:0]
SRAM Cell Array
IN1[1:0]
IN0[1:0]
Column Decoder
Analog Mux (64:1)
3.46bit Flash ADC
Q[9]
Q[8]
Q[7]
Q[2]
Q[1]
Q[0]
10b Therm. Code
multiplication results. Figure 3.18, bottom-left, shows the equivalent circuits rep-
resenting the pull-up and pull-down resistors that eventually determine the voltages
for RBL and RBLb. The Monte-Carlo simulation result of the pseudo-differential
RBL voltage (i.e., V(RBL)-V(RBLb)) is shown in Fig. 3.18, bottom-right. Note
that a dynamic range is doubled and the transfer characteristic is symmetric while
residual nonlinearities and variations are observed.
Differential voltage-mode SRAM PIM macro architecture is shown in Fig. 3.19.
Similar to the custom 8T SRAM PIM, 128× inputs are divided into the same three
functional parts in row-wise compute scheme. Although the operational difference
from the custom 8T SRAM PIM exists due to the voltage mode operation as opposed
to the bitline discharge, a similar performance trade-off of reduced parallelism exists
from the column bitcells that are assigned for row ADC and offset calibration. Also,
the low precision weight SRAMs limit the application mapping as well as the PVT
variation-induced nonlinearity causing output reliability issues.
Compared to the architecture described in Fig. 3.17, differential voltage-mode
SRAM PIM provides higher throughput due to the elimination of the output analog
MUX and minimized ADC overhead from the replica bitcells that realize parallel
row ADCs.
Passive capacitors have been used to implement charge-domain SRAM-based
PIM macros to minimize analog nonidealities in processing accumulations using
bitline-discharging or voltage-mode accumulation. Figure 3.20 shows a column
of custom SRAM PIM cells for charge-sharing-based accumulation. A PIM cell
consists of a standard 6T SRAM cell, a pair of switches, and two read wordline
inputs (RWL and RWLb) for an XNOR-based binary multiplication. A unit passive
capacitor and a switch are used for sharing charges across the bitcells in the same
3 SRAM-Based Processing-in-Memory (PIM) 55
RBL
RBLb
# of PU = 64-N
RBL
0 0 Ru/(64-N)
-0.2
0 0 Ru/N
-0.3
-0.4
V(RBLb)
-0.5
-0.6
1 1 Rd/(64-N)
-64 -32 0 32 64
Dot-Product Result
# of PD = 64-N
Fig. 3.18 A column-based dot-product circuit using differential voltage-mode SRAM PIM cells
in Fig. 3.8b and the Monte-Carlo simulated differential RBL voltage [8]
cells [8]
128x128
Fused 6T+XNOR
SRAM Array OUT[2]
OUT[1]
OUT[0]
BL Drivers
R[31]
C[31]
X[63]
R[0]
C[0]
X[0]
Inputs
Offset
ADC
Ref.
64x
Cal.
56 H. Kim et al.
RBL
Acc
6T
SRAM
RWLb[255] 1V
RWL[255] C C
Acc
6T N
SRAM cells
RWLb[254]
C C
Acc
6T 0V
SRAM
C C
RWLb[1]
RWL[1] 256-N
Acc cells
6T 0V
SRAM
C C
RWLb[0]
RWL[0]
Fig. 3.20 A column-based dot-product circuit using SRAM-based PIM cells in Fig. 3.9a for
accumulation based on charge-sharing and the resulting RBL voltage
RBL
6T SRAM
RWLb[255]
1V
RWL[255]
6T SRAM N∙Cu
1V
RWLb[254]
RWL[254]
VRBL=
RBL
N/256 [V]
1V
6T SRAM
RWLb[1] 1V (256-N)∙Cu
RWL[1]
Equivalent Circuit
6T SRAM (Capacitive Divider)
RWLb[0]
RWL[0]
Fig. 3.21 A column-based dot-product circuit using SRAM-based PIM cells in Fig. 3.9b for
accumulation based on charge-redistribution (or capacitive coupling)
RWLb[255]
RWL[255]
RWLb[254]
RWL[254]
RWLb[0]
RWL[0]
RBL[0] RBL[1] RBL[63]
VREF[0:9]
RWL
ADC Reference Col.
10T 10T 10T
BLNref
SRAM SRAM
BLPref
SRAM
LBLF63
LBLT0
LBLF0
LBLT63
x16
PCH
Local
MAV
EQp
EQn
ENp
ENn
ENn
ENp
VpAVG
VnAVG
Local MAV Charge-sharing
+
Integration
(a)
LBLF0 LBLT0
DAC Mult. Avg.
Vin
VpAVG
VnAVG
0
(b)
Fig. 3.23 10T SRAM-based PIM architecture: (a) a row-wise charge sharing architecture; (b) its
multiply-and-average (MAV) operation diagram [10]
Pre-charge Drivers
OUT[0]
OUT[1]
OUT[2]
WL Decoders
256x64
7b ADC
Custom 10T
SRAM Array OUT[13]
OUT[14]
OUT[15]
7b DAC
X[61]
X[62]
X[63]
X[0]
X[1]
X[2]
Fig. 3.24 A PIM macro architecture of a 256 × 64 10T SRAM cell array with 64× 7b inputs and
16× 7b outputs [10]
3 SRAM-Based Processing-in-Memory (PIM) 59
W3 W2 W1 W0
IN[0]
4
W3 W2 W1 W0
5
IN[1]
11
Adder OUT[11:0]
W3 W2 W1 W0
Tree 12
IN[254] 11
4
W3 W2 W1 W0
5
IN[255]
Fig. 3.25 A digital SRAM PIM dot-product circuit using 256 × 4 PIM cells and an adder tree
[25]
the larger SRAM area, the area efficiency did not suffer due to the small size of
the CSH_ADC. However, hardware scalability, low precision weight operation and
analog variation issues remain as design challenges.
Analog PIM works [1–4, 13, 16] achieve outstanding energy efficiency and
energy efficiency. However, analog computing issues such as ADC/DAC overhead
and PVT variation-induced nonlinearities remain as major concerns. Digital PIM
architectures address both issues by directly processing the digital input bits without
the data conversion and utilizing the binary abstraction that reduces sensitivity to
any physical variation.
Figures 3.25 and 3.26 illustrate digital PIM macro [25] that utilize 256× 4b
weights per column, producing 64× 12b MAC output. Each PIM cell is composed
of fused 6T SRAM and a two-input NOR gate producing binary multiplication
result while the accumulate operation is processed separately in the dedicated
adder tree that uses 256× 4b multiply results as inputs to produce a single 12b
output. Although the weight bit precision is fixed to 4b, bit-width can be further
reconfigured to 8b, 12b, and 16b depending on the multiple macro scaling with area
trade-off.
Figures 3.27 and 3.28 present another digital PIM macro [22] with fully
reconfigurable inputs, weights, and outputs. Each PIM cell is composed of 6T
SRAM, an XNOR gate and a full-adder. The PIM cells can form 1–16b unit
column MAC by stacking together to achieve reconfigurable processing element
(PE). Despite the regular design of unit PIM cells in a multi-bit column MAC,
different function is assigned depending on its location within the unit column.
60 H. Kim et al.
W3 W2 W1 W0
IN[1]
Adder Tree
Adder Tree
Adder Tree
W3 W2 W1 W0
IN[254]
W3 W2 W1 W0
IN[255]
OUT63[19:0]
OUT0[19:0]
OUT1[19:0]
Fig. 3.26 A PIM macro architecture of a 256 × 256 digital SRAM PIM cell array with adder trees
[25]
OUT[1]
OUT[N-1]
OUT[N]
OUT[N+6]
IN[127]
IN[126]
IN[0]
IN[1]
For example, the PIM cell located at the top of the column represents LSB weight
and the gray cells shown in Fig. 3.26 are accumulation-only PIM cells. The gray
PIM cells produce sign-extended output as well as partial sum propagation through
3 SRAM-Based Processing-in-Memory (PIM) 61
6T 6T 6T 6T
OUT[0]
Post Accumulator
6T 6T 6T 6T
OUT[1]
6T 6T 6T 6T
OUT[126]
6T 6T 6T 6T
OUT[127]
IN[127]
IN[0]
IN[1]
Fig. 3.28 A PIM macro of 128 × 128 reconfigurable digital SRAM PIM cells [22]
all the columns in the macro array. With 128× column, each unit column MAC
requires 7× accumulation-only PIM cells to propagate all the partial sums to the
output. Challenges associated with the digital SRAM PIM [22] is the hardware
redundancy caused by enabled/disabled memory and compute blocks from the
regular structure that processes reconfigurable MAC operation. As a result of the
redundant circuit blocks, unit PIM cell becomes large and the SRAM capacity in
the macro is degraded.
3.4 Summary
References
1. C. Eckert et al., Neural cache: Bit-serial in-cache acceleration of deep neural networks, in
ACM/IEEE 45th annual international symposium on computer architecture (ISCA), (ACM,
New York, 2018), pp. 383–396
2. J. Wang et al., 14.2 a compute SRAM with bit-serial integer/floating-point operations for
programmable in-memory vector acceleration, in 2019 IEEE international solid-state circuits
conference - (ISSCC), (IEEE, Piscataway, 2019), pp. 224–226
3. X. Si et al., 24.5 a twin-8T SRAM computation-in-memory macro for multiple-bit CNN-based
machine learning, in 2019 IEEE international solid-state circuits conference - (ISSCC), (IEEE,
Piscataway, 2019), pp. 396–398
4. Q. Dong, S. Jeloka, Y. Kim, M. Kawaminami, A. Harada, S. Miyoshi, M. Yasuda, D. Blaauw,
D. Sylvester, A 4 + 2T SRAM for searching and in-memory computing with 0.3-V VDDmin .
IEEE J. Solid State Circuits 53(4), 1006–1015 (2018)
5. Q. Dong, M.E. Sinangil, B. Erbagci, D. Sun, W. Khwa, H. Liao, Y. Wang, J. Chang, A
351TOPS/W and 372.4GOPS compute-in-memory SRAM macro in 7nm FinFET CMOS
for machine-learning applications, in IEEE int. solid-state circuits conf. (ISSCC), (IEEE,
Piscataway, 2020), pp. 242–244
6. C. Yu, T. Yoo, T. Kim, K. Chai, B. Kim, A 16K current-based 8T SRAM compute-in-memory
macro with decoupled read/write and 1- 5bit column ADC, in IEEE custom integrated circuits
conference (CICC), (IEEE, Piscataway, 2020), pp. 1–4
7. S. Yin, Z. Jiang, J.-S. Seo, M. Seok, XNOR-SRAM: In-memory computing SRAM macro for
binary/ternary deep neural networks. IEEE J. Solid State Circuits 55(6), 1733–1743 (2020)
8. H. Kim, Q. Chen, B. Kim, A 16K SRAM-based mixed-signal in-memory computing macro
featuring voltage-mode accumulator and row-by-row ADC, in IEEE Asian solid-state circuit
conference (ASSCC), (IEEE, Piscataway, 2019), pp. 35–36
9. Z. Jiang, S. Yin, J. Seo, M. Seok, C3SRAM: An in-memory-computing SRAM macro based
on robust capacitive coupling computing mechanism. IEEE J. Solid State Circuits 55(7), 1888–
1897 (2020)
10. A. Biswas, A.P. Chandrakasan, CONV-SRAM: An energy-efficient SRAM with in-memory
dot-product computation for low-power convolutional neural networks. IEEE J. Solid State
Circuits 54(1), 217–230 (2019)
11. C. Yu, K. Chai, T. Kim, B. Kim, A zero-skipping reconfigurable SRAM in-memory com-
puting macro with binary-searching ADC, in IEEE European solid-state circuit conference
(ESSCIRC), (IEEE, Piscataway, 2021), pp. 131–134
3 SRAM-Based Processing-in-Memory (PIM) 63
12. J. Kim, J. Lee, J. Heo, J.Y. Kim, Z-PIM: A sparsity-aware processing-in-memory architecture
with fully variable weight bit-precision for energy-efficient deep neural networks. IEEE J. Solid
State Circuits 56(4), 1093–1104 (2021)
13. W. Khwa et al., 31.5 a 65nm 4Kb algorithm-dependent computing-in-memory SRAM unit-
macro with 2.3ns and 55.8TOPS/W fully parallel product-sum operation for binary DNN
edge processors, in 2018 IEEE international solid-state circuits conference - (ISSCC), (IEEE,
Piscataway, 2018), pp. 496–498
14. X. Si et al., 15.5 a 28nm 64Kb 6T SRAM computing-in-memory macro with 8b MAC operation
for AI edge chips, in 2020 IEEE international solid-state circuits conference - (ISSCC), (IEEE,
Piscataway, 2020), pp. 246–248
15. J. Su et al., 15.2 a 28nm 64Kb inference-training two-way transpose multibit 6T SRAM
compute-in-memory macro for AI edge chips, in 2020 IEEE international solid-state circuits
conference - (ISSCC), (IEEE, Piscataway, 2020), pp. 240–242
16. H. Jia, M. Ozatay, Y. Tang, H. Valavi, R. Pathak, J. Lee, N. Verma, 15.1 a programmable
neural-network inference accelerator based on scalable in-memory computing, in 2021 IEEE
international solid-state circuits conference - (ISSCC), (IEEE, Piscataway, 2021), pp. 236–238
17. J. Yue et al., 15.2 a 2.75-to-75.9TOPS/W computing-in-memory NN processor supporting set-
associate block-wise zero skipping and ping-pong CIM with simultaneous computation and
weight updating, in 2021 IEEE international solid-state circuits conference - (ISSCC), (IEEE,
Piscataway, 2021), pp. 238–240
18. R. Guo et al., 15.4 a 5.99-to691.1 TOPS/W tensor-train in-memory-computing processor
using bit-level-sparsity-based optimization and variable-precision quantization, in 2021 IEEE
international solid-state circuits conference - (ISSCC), (2021), pp. 242–244
19. J. Su et al., 16.3 a 28nm 384kb 6T-SRAM computation-in-memory macro with 8b precision
for AI edge chips, in 2021 IEEE international solid-state circuits conference - (ISSCC), (IEEE,
Piscataway, 2021), pp. 250–252
20. M. Kang, S. Gonugondla, A. Patil, N. Shanbhag, A multi-functional in-memory inference
processor using a standard 6T SRAM array. IEEE J. Solid State Circuits 53(2), 642–655 (2018)
21. J. Zhang, Z. Wang, N. Verma, In-memory computation of a machine-learning classifier in a
standard 6T SRAM array. IEEE J. Solid State Circuits 52(4), 915–924 (2017)
22. H. Kim, T. Yoo, T. Kim, B. Kim, Colonnade: A reconfigurable SRAM-based digital bit-serial
compute-in-memory macro for processing neural networks. IEEE J. Solid State Circuits 56(7),
2221–2233 (2021)
23. H. Valavi, P. Ramadge, E. Nestler, N. Verma, A 64-tile 2.4-Mb in-memory-computing CNN
accelerator employing charge-domain compute. IEEE J. Solid State Circuits 54(6), 1789–1799
(2019)
24. K. Ando et al., BRein memory: A single-chip binary/ternary reconfigurable in-memory deep
neural network accelerator achieving 1.4 TOPS at 0.6 W. IEEE J. Solid State Circuits 53(4),
983–994 (2018)
25. Y. Chih et al., 16.4 an 89TOPS/W and 16.3TOPS/mm2 all-digital SRAM-based full-precision
compute-in memory macro in 22nm for machine-learning edge applications, in 2021 IEEE
international solid- state circuits conference (ISSCC), (IEEE, Piscataway, 2021), pp. 252–254
26. H. Sharma et al., Bit fusion: Bit-level dynamically composable architecture for accelerating
deep neural network, in Proc. 45th annual IEEE/ACM international symposium on computer
architecture (ISCA), (ACM, New York, 2018), pp. 764–775
27. P. Judd et al., Stripes: Bit-serial deep neural network computing, in Proc. 49th Annual
IEEE/ACM international symposium on microarchitecture (MICRO), (ACM, New York, 2016),
pp. 1–12
Chapter 4
DRAM-Based Processing-in-Memory
4.1 Introduction
Figure 4.2 shows the typical organization of a modern DRAM chip. It broadly
consists of control logic, multiple banks, and data IO circuitry. The DRAM bank is
The original version of the chapter has been revised. A correction to this chapter can be found at
https://doi.org/10.1007/978-3-030-98781-7_9.
made of DRAM mat, the basic 2-dimensional array structure with periphery circuits
to access the cells. In the mat, the row decoder specifies a single wordline and
drives it based on the address. Then, all the access transistors of the DRAM cells
connected to the wordline are activated, and the values are loaded into the bitlines.
Each DRAM cell, composed of a single transistor and a capacitor, starts to share the
stored charge with the bitline, which is pre-charged to half VDD when the transistor
is turned on. Slight voltage differences in the bitlines caused by the charge sharing
are amplified by the bitline sense amplifiers in the bottom. It is also called row buffer
because it stores the cell values of the entire row. Once an entire row is buffered in
the row buffer, the column decoder chooses one or more bitlines to transfer data to
the IO pads.
4 DRAM-Based Processing-in-Memory 67
In this section, we study the DRAM-based PIMs that embed logic into the level of
memory cells/ bitline sense amplifiers, which is the lowest level you can possibly
design. In order to maximally use the internal read bandwidth, they enable multiple
rows at once and perform low-level logic operations such as AND, OR, or NOR for
entire rows at the bitline sense amplifiers. It is called bulk bitwise processing.
4.3.1 AMBIT
4.3.1.1 Triple Row Activation
As shown in Fig. 4.3, AMBIT activates three wordlines simultaneously, unlike the
regular DRAM only activates a single row at a time. If we look from the perspective
of a single bitline, three cells connected to the bitline are accessed at the same cycle.
Then, all the charges from the cells will be shared at the bitline. If the number of cells
charged with VDD (i.e., logical 1) is larger than the number of cells with no charge
(i.e., logical 0), the amount of net charge injecting to the bitline will be positive.
Since the bitline is already pre-charged to half-VDD , the final voltage level by the
charge sharing will be a bit higher than the half-VDD and eventually goes to VDD by
the sense amplifier. In other words, if the bitline has two or three charged cells, the
final voltage will be VDD . On the other hand, if the bitline has zero or one charged
cell, the final voltage value will be 0. Equation 4.1 shows the exact charge sharing
equation among the three cells by the triple row activation (TRA). Cc and Cb are
cell capacity and bitline capacity, respectively, and k is the number of 1s among the
three cells.
The final value of the TRA becomes 1 if the number of 1s among three cells is
more than or equivalent to 2. It is the same as the majority function. Among the
three cell values (i.e., A, B, and C), if either A and B or B and C or C and A are 1,
then the final value will be 1. This is AB + BC + CA in a simple Boolean equation,
which is same as C(A + B) + C(AB). Based on this Boolean equation, we can
easily implement AND or OR function by controlling the C value. If C is set to 0,
the first term will be gone, so the final value will be AB. This is the AND operation
for all the bits between the entire two rows. Otherwise, if C is set to 1, the final
value will be A + B, the OR operation for the entire two rows. By presetting C and
enabling three rows simultaneously, we can implement AND or OR operation for
the entire bits of the two rows. As an entire row of a DRAM bank can be multiple
kilobytes (usually 1KB or 2KB), this TRA scheme enables multi-kilobytes bitwise
AND/OR operation. This is the main idea of the AMBIT. AMBIT utilizes the TRA
and the property of charge sharing in the bitlines, as it is difficult to integrate an
AND or OR gate within a pitch of the tiny DRAM cell because each gate requires
six transistors. Thanks to TRA, AMBIT implements bulk bitwise operations without
any transistors added to the sense amplifiers.
Although AMBIT efficiently implements the bulk bitwise AND/OR operation using
TRA, it has some issues. First, the TRA re-writes the final result to the original cells
like a normal memory read does. As it activates three rows at the same time and gets
AND/OR values, it destroys the original values of the cells in the three rows. The
second issue is the cost of TRA. Since it needs to activate three rows simultaneously,
the decoding logic needs to decode three addresses at once. Because it causes a
linear increase in address bus and row decoding logic, the TRA puts a burden on the
control logic of the memory cell array.
To solve the above issues, AMBIT divides the row address space of the memory
subarray into three groups: (1) bitwise group (B-group), (2) control group (C-
group), and (3) data group (D-group), as shown in Fig. 4.4. B-group has eight
designated rows for bulk bitwise AND/OR operations with special decoding logic
for TRA. C-group pre-stores 0 and 1 to select AND and OR operation, respectively.
D-group stores the original data, occupying the most rows in the subarray. For C-
group and D-group, AMBIT uses the regular row decoder that does not require any
changes in design. For the operation, it copies the two rows of data from D-group to
the designated rows in B-group (i.e., T0 and T1). These two rows will be the input
operands of the bulk bitwise operation. It also initializes the designated row T2 to 0
(=AND) or 1 (=OR) to choose the operation. Then, it simultaneously activates the
three designated rows, T0, T1, and T2, for computation. Finally, it copies the result
4 DRAM-Based Processing-in-Memory 69
row T0 to a row in the D-group. Using three copies between the main D-group and
special-purpose B-group, AMBIT completes the bulk bitwise operation.
AMBIT requires a lot of row-wide copies between the main D-group and the
special-purpose B-group designated for TRA. To reduce the long latency of the
row copies between the two groups, the authors utilize the method of RowClone-
FPM (Fast Parallel Mode) [8]. For a row-wide copy, a regular DRAM requires
an activation command followed by many column-read commands and the final
pre-charge command to read an entire row. It also needs an activation command
followed by many column-write commands and the final pre-charge command to
write a destination row. Unlike the regular DRAM requires lots of commands with
a very long latency of more than 1000 ns, the RowClone-FPM uses only three
commands: source row activation, destination row activation, and the pre-charge,
as shown in Fig. 4.5. By activating the destination row right after the source row
being amplified at the row buffer, RowClone-FPM copies the entire source row to
the destination row very efficiently, reducing the latency more than 10 times.
sense amplifier evaluates the cell value. Then, it enables the n-wordline of the DCC,
which connects the inverted node of the sense amplifier and the cell in the DCC. If
the inverted value of the bitline is 0, the charge in the cell will be discharged to the
ground. Hence, the cell value of DCC is the same as the inverted value. On the other
hand, if the inverted value is 1, the cell will be charged to VDD . Therefore, AMBIT
successfully moves the inverted value of the source row to the DCC using the n-
wordline and its transistor. DCC uses d-wordline and its transistor when it needs to
copy the row of DCCs, the result of NOT operation, to another row.
The actual implementation of DCC may not be feasible as it requires one more
wordline and transistor to fit the pitch of a DRAM cell. In DRAM, the pitch of a
single cell is already optimized to having only a single transistor and a capacitor.
Adding another transistor and a wordline to it can be extremely challenging
(Table 4.1).
4 DRAM-Based Processing-in-Memory 71
Although the B-group has only eight physical rows, it contains 16 reserved
addresses, B0–B15. Table 4.1 lists the 16 addresses and the corresponding word-
lines. Among 16 addresses, the first four addresses, B0–B3, are designated for
operands (i.e., T0–T3), and the next four addresses, B4–B7, are assigned to two
DCC rows for bulk NOT operation. B8–B11 activates two wordlines simultaneously
for the copy. For example, B8 activates T0 and n-wordline of DCC0 to move
negated values to DCC0. B12–B15 actives three wordlines together for bulk bitwise
AND/OR operation. For example, B12 activates T0, T1, and T2 at the same time,
and the computed value will be re-written to all the rows.
To execute the proposed bulk bitwise operation, which involves row-wide data
copies and logical computation, AMBIT supports a fused complex command
primitive named AAP (Activate–Activate–Pre-charge). By combining row activate,
row activate, and pre-charge back to back, the AAP primitive reduces the number
of required commands significantly. Hence, it reduces the total latency. Figure 4.7
shows how the basic logical operations can be done with the AAP primitive. AAP
(Di, B0) means that it activates Di from the D-group and activates B0 from the
B-group to copy Di to T0 and pre-charges for the following command. Likewise,
AAP (Dj, B1) copies the Dj from the D-group to T1. AAP(C0, B2) sets the
T2 to 0 for bulk AND operation. Finally, AAP(B12, Dk) executes TRA with B12,
and the final AND result will be copied to Dk in the D-group.
4.3.1.7 Evaluation
To evaluate the AMBIT architecture, the authors compare the raw throughput
of bulk bitwise operations of AMBIT, such as NOT, AND/OR, NAND/NOR,
and XOR/XNOR, against Intel Skylake CPU and NVIDIA GeForce GTX 745
GPU. As expected, AMBIT outperforms the others with DDR3 memories by 32x
72 D. Kim and J.-Y. Kim
4.3.2 DRISA
4.3.2.1 Motivation
The goal of DRISA is to merge the strength of memory-rich processors such GPUs
and ASIC-based neural processing units (NPUs) [9] and compute-capable PIMs
[1]. The memory-rich processors show high performances using abundant memory
bandwidth but have little memory capacity. On the other hand, compute-capable
PIMs suffer from low performance. To have both strengths, DRISA builds a PIM
accelerator based on DRAM technology. Like AMBIT, DRISA adds the logic
operations at the level of bitline sense amplifiers to leverage the maximal internal
bandwidth while minimizing the design changes from regular DRAM.
Figure 4.8 shows the overview of DRISA. The highlighted regions with green and
blue depict the building blocks that require design changes from a regular DRAM.
At the chip level, DRISA modifies the group and bank buffers to facilitate internal
data transfers. It also modifies the bank controller to control logic processing in
multiple subarrays in each bank. At the lowest cell matrix level, it adds logic gates
and shifters at the bottom of the DRAM cells.
Unlike AMBIT uses the same DRAM cell architecture as the regular DRAMs for
feasibility/manufacturability, DRISA proposes three different DRAM cell architec-
tures: 3T1C, 1T1C-NOR/MIX, and 1T1C-ADDER (Fig. 4.9).
4 DRAM-Based Processing-in-Memory 73
1T1C-NOR/MIX adds a NOR gate or other gates below each bitline sense
amplifier with a latch. It performs bitwise logic operation between the read operand
and the latched operand. On the other hand, 1T1C-ADDER adds the latches and a
parallel adder below multiple sense amplifiers. However, both of them are difficult
to be realized considering the extremely narrow DRAM cell pitch. The simplest case
requires 4 transistors for a NOR gate and 8 transistors for a latch. Having to route
metal connections as well, it is not trivial to integrate the logic within the DRAM
cell pitch, even for the simplest case.
The 3T1C cell, illustrated in more detail in Fig. 4.10, was used in early DRAM
design. It has separated wordlines, one for write and the other for read operation,
and two transistors to connect them (M1 and M3). The M2 transistor decouples
the other two transistors, and its gate is connected to the cell capacitor. If the M3
transistor is enabled, it is connected to the read bitline BL2 with having the cell
capacitor as the input value. From the bitline perspective, M2 transistors connected
in parallel implement NOR operation. In other words, if only a single cell value is
1, the bitline value will go to the ground. Using this native NOR configuration of
the 3T1C cell, DRISA can perform bulk bitwise NOR operations between the two
74 D. Kim and J.-Y. Kim
While AMBIT utilizes bitwise AND/OR and NOT operation to implement logic
functions, DRISA only uses bitwise NOR operation as it is functionally complete.
As explained in the previous section, DRISA activates two rows simultaneously
to compute bitwise NOR operation using the native NOR connection of the 3T1C
cells. Figure 4.11 illustrates the DRISA’s NOR-based selector, or multiplexer, logic
implementation. The Boolean equation for the selector is R = SX+ ∼ SY ,and this
can be re-written as R =∼ NOR(NOR(∼ S, ∼ X), N OR(S, ∼ Y ) using 3 NOR
operations and 4 NOT operations. NOT operation can also be computed using NOR
by having one of the input operands to 0 (NOT (X) = NOR(0, X)). As a result,
the bulk bitwise selector logic can be done in 7 steps in DRISA.
As the bulk bitwise operation applies the same low-level logic operations to all
the bits, mapping higher-level logic functions is not straightforward. To address
this problem a little bit, DRISA includes shifters under the bitline sense amplifiers
for data communication among neighbor bitlines. As a simple but essential use
4 DRAM-Based Processing-in-Memory 75
case, the shifter can propagate carry-out signals to the neighbor bitlines in addition.
Specifically, DRISA supports three types of shifting operations: intra-lane, inter-
lane, and forwarding, as shown in Fig. 4.12. As the name implies, intra-lane shift is
a single-bit shift to the neighbor bitlines inside the lane, and inter-lane shift is a shift
in a lane unit such as byte shift or word shift. The lane means a unit of data, such as
8 bits or 16 bits. Forwarding is just a read without any shift applied.
As the shifter implementation can be complex, DRISA proposes transistor-level
shifter circuits. Figure 4.13 shows the 4-bit intra-lane shifter circuits with the
example of left shift by 2 and right shift by 3, where rBL, wBL, and FL means read
bitline, write bitline, and filling line, respectively. According to the control lines in
the bottom (L0, L1, L2, R1, and R3), only the necessary transistors are enabled for
barrel shifting operations. For example, the read bitlines of columns 3 and 4 are
enabled, and the read values are transferred to columns 1 and 2, respectively, for the
left shift by 2.
4.3.2.4 Evaluation
The bulk bitwise operation that implements logic at the bitline sense amplifier level
is the best in terms of internal data bandwidth. However, area constraint for this
method is the toughest because of the narrow cell pitch; the cell pitch has kept
decreasing to integrate more cells, having a high capacity. The next possible level
is bank level, which integrates processing logic after column decoder and selector.
Since the processing logic can enjoy the whole width of the cell array, not a single
cell pitch, it is more affordable to add logic functions in the space. In addition, as
every commercial DRAM has column selectors, this method is less invasive as it
4 DRAM-Based Processing-in-Memory 77
does not change any design at the cell array matrix. Also, it utilizes the possible
maximum bandwidth of the existing DRAM architectures.
4.4.1 Newton
4.4.1.1 Motivation
4.4.1.2 Architecture
Figure 4.15 shows the overall architecture of Newton in a single DRAM die. It has
a total of 16 banks where each bank includes 16 multipliers, 16 adders, and 16-bit
accumulation register. As mentioned earlier, it integrates the above logic gates after
the column decoder, i.e., 32:1 column mux in the diagram, to make it feasible with
minimizing the changes in the memory bank design. Unlike AMBIT or DRISA, it
does not have to change the design in row decoder and row drivers because it does
not require multi-row activation.
Newton activates a single row in a bank, like in a normal DRAM, which has the
size of 1KB. Among them, only 32 bytes are selected after 32:1 column select. As
Newton uses the half-precision floating-point data type (FP16), it reads 16 FP16
data at a time out of 512 FP16 data in a row. Each of FP16 data enters to an input
of each multiplier. The other input comes from the global buffer. The multipliers
multiply between the 16 FP data from the global buffer and the 16 FP data from the
78 D. Kim and J.-Y. Kim
bank, and the 16 products are accumulated through the adder tree to a single FP16
result.
In Newton, the global buffer broadcasts an input vector to the memory banks,
while the banks store different parts of the weight matrix, as illustrated in Fig. 4.16.
The large weight matrix is chunked into tiles, whose size is 16 rows by 512 FP16
data, and the rows in a tile are interleaved over the multiple banks. The input vector
is also segmented into the groups of 512 FP16 data, and they are distributed to the
banks for matrix-vector operation. To increase the internal read bandwidth, Newton
activates multiple banks at the same time. This multi-bank activation, or bank-level
parallelism, is a key differentiator from a regular DRAM. It increases the internal
read bandwidth and, hence, its compute bandwidth.
Figure 4.17 shows the overall operation of Newton, including new commands and
multi-bank activation for PIM operations. First, it loads the global buffer with the
input vector data using GWRITE command. Then, it activates multiple banks using
G_ACT command. Although activating all 16 banks would be the best option for
achieving high throughput, it is difficult because of power and voltage drop issue.
In this chapter, Newton can activate four banks at a time and needs an interval time
4 DRAM-Based Processing-in-Memory 79
4.4.1.4 Evaluation
4.4.2 HBM-PIM
4.4.2.1 Motivation
bank and the shared PCU. In addition, the PIM controller for operating the PCUs
is integrated in the TSV area. To maximize its internal bandwidth, HBM-PIM uses
the bank-level parallelism, as Newton does. As two banks share one compute unit to
limit the power and area cost, it activates half of the banks (i.e., even banks or odd
banks) in the die at the same time.
From the host side, the proposed HBM-PIM is seen exactly the same as a
regular HBM, being compatible with the existing DRAM interfaces. With the PIM
instructions stored in the command register file in the PCU, the host can control
every PIM instruction with conventional load and store instructions to specific
memory addresses. The only thing the controller inside the DRAM should do is
the mode change between the normal and PIM.
As shown in Fig. 4.22, the internal PIM controller decodes specific combinations
between the command and address to generate a mode change signal. For example,
if the row activate (ACT) command comes with a specific address in bank 0,
PIM_Even signal is asserted. Likewise, if the same command and an address come
in for bank 1, PIM_Odd signal is asserted. Only if both signals are asserted, the
mode changes to the PIM mode. With the start of PIM mode, the PCU gets its clock
to run the PIM instructions. Once the PIM operations finish, the controller changes
the PIM mode to normal, again with the combination of a command and a specific
address.
In the execution model, the major difference between Newton and HBM-PIM is that
Newton adds a few special PIM commands to use the integrated arithmetic units
84 D. Kim and J.-Y. Kim
with a fixed function, while HBM-PIM changes the mode to activate the compute
unit named PCU. PCU is a programmable unit with its own instructions. Figure 4.23
shows the block diagram of the PCU, consisting of an interface unit, execution
unit, and register group. The interface unit receives control and data signals from
the memory’s command controller. The execution unit includes a pair of 16 FP16
multipliers and adders. Each of them has a 5-stage pipeline and works in parallel
with single-instruction multiple-data (SIMD) fashion.
The register group includes the command register file (CRF), general-purpose
register file (GRF), and scalar register file (SRF). The CRF buffers up to 32 32-
bit PIM instructions. The GRF composed of sixteen 256-bit registers is evenly
divided into GRF_A and GRF_B for even bank and odd bank, respectively. The
SRF replicates a scalar value to a vector and performs scalar multiplications or scalar
additions to a source operand from GRF. Like other in-order cores, the PCU fetches
a PIM instruction from the CRF, decodes it, reads source operands to the SIMD
FP units, and stores the result back to the GRF. Table 4.2 shows the overall 9 PIM
instructions.
HBM-PIM has three operational steps. First, the host stores input data to the DRAM
cell arrays. As it is set to normal mode with initialization, the host accesses each
bank as a regular DRAM. Then, the host changes the operation mode from normal
to PIM. Second, the host sends instructions and weight data to PCUs via the DQ
interface. The PCU can save up to 32 instructions in the CRF, and its program
counter reads the instructions one by one from address 0. Third, once each PCU
4 DRAM-Based Processing-in-Memory 85
As HBM-PIM introduces PCUs between even and odd banks, it needs data buses
to transfer data among them. The local data buses are responsible for data transfers
between PCUs and banks. With the MOVE command, the PCU can load data from
the cell array to the GRF or store data from the GRF to the cell array. This is a multi-
bank operation; either even or odd banks can be enabled simultaneously. The host
uses the bank group global bus to issue instructions and weight data to the PCUs.
Figure 4.25 depicts the two types of buses for data movements in HBM-PIM.
HBM-PIM is the first PIM chip ever fabricated in HBM using a 20 nm DRAM
process. Figure 4.26 shows the chip micrograph and measurement results. It
achieves 2.4 Gbps/pin operation, without power consumption increase from HBM2,
and PCU operation at 300 MHz. In addition, an FPGA-based test platform and an
emulation environment confirm the system performance can improve by 2.1× for
DeepSpeech2 benchmark [21] while reducing the system energy by 71% compared
to a typical GPU system using HBM2.
4 DRAM-Based Processing-in-Memory 87
In this section, we look into the PIM architectures using full 3-d vertical stacking of
memory and logic. The HBM-PIM that we describe in Sect. 4.4 is a 2.5-d solution;
it puts the logic module and HBM side by side via silicon interposer. On the other
hand, the full 3-d stacking means the integration of the logic die in the bottom with
the stacked memories like an HBM on top of it. Since it stacks the main compute
die and stacked memory dies, it is further advanced from HBM, expecting more
energy-efficient data communications between the two entities. Hybrid memory
cube (HMC) is the main example of this. However, the realization of 3-d PIM can be
difficult due to tight physical and timing constraints among 3-d stacked dies. All the
proposed 3-d PIM architectures are evaluated only using simulation. In this section,
we briefly review a few works based on the 3-d PIM architecture.
4.5.1 Neurocube
Neurocube [5] is one of the earliest architectures that demonstrates the feasibility
and performance benefits of using a 3-d high-density memory package for deep
88 D. Kim and J.-Y. Kim
4.5.2 Tetris
4.5.3 iPIM
iPIM [7] compromises the 3-d PIM approach that Neurocube and Tetris used and
the bank-level PIM approach that Newton and HBM-PIM used, in order to increase
effective compute bandwidth and reduce energy spent from data movements via
TSVs. As Fig. 4.29 shows, iPIM’s vault architecture decouples the role to control
and execution. The logic die includes the iPIM core that performs complex control
operations such as instruction decoding and issuing, and memory bank controls. On
the other hand, the process group (PG), integrated into each DRAM die of a vault,
performs simple but memory-intensive operations at near bank. To enable massive
bank-level concurrent execution, iPIM proposes single-instruction multiple-bank
(SIMB) instructions, including computation, index calculation, intra/inter-vault data
movement, and synchronization operation.
90 D. Kim and J.-Y. Kim
References
14. K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 770–778
15. J. Devlin, M.W. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidirectional
transformers for language understanding (2018). arXiv preprint arXiv:1810.04805
16. OpenAI, GPT-3 powers the next generation of apps, In: OpenAI (2021). https://openai.com/
blog/gpt-3-apps/. Accessed 5 Nov 2021
17. M. Naumov, D. Mudigere, H.-J.M. Shi, J. Huang, N. Sundaraman, J. Park, X. Wang, U.
Gupta, C.-J. Wu, A.G. Azzolini, D. Dzhulgakov, A. Mallevich, I. Cherniavskii, Y. Lu, R.
Krishnamoorthi, A. Yu, V. Kondratenko, S. Pereira, X. Chen, W. Chen, V. Rao, B. Jia, L.
Xiong, and M. Smelyanskiy, Deep learning recommendation model for personalization and
recommendation systems (2019). arXiv preprint arXiv:1906.00091
18. P. Rosenfeld, E. Cooper-Balis, B. Jacob, DRAMSim2: a cycle accurate memory system
simulator. IEEE Comput. Archit. Lett 10(1), 16–19 (2011)
19. A. Bakhoda, G.L. Yuan, W.W. Fung, H. Wong, T.M. Aamodt, Analyzing CUDA workloads
using a detailed GPU simulator, in 2009 IEEE International Symposium on Performance
Analysis of Systems and Software. IEEE, Piscataway (2009), pp. 163–174
20. Y. Wu, M. Schuster, Z. Chen, Q.V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q.
Gao, K. Macherey, J. Klingner, A. Shah, M. Johnson, X. Liu, Ł. Kaiser, S. Gouws, Y. Kato,
T. Kudo, H. Kazawa, K. Stevens, G. Kurian, N. Patil, W. Wang, C. Young, J. Smith, J.
Riesa, A. Rudnick, O. Vinyals, G. Corrado, M. Hughes, J. Dean, J. Google’s neural machine
translation system: bridging the gap between human and machine translation (2016). arXiv
preprint arXiv:1609.08144
21. D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case, J. Casper, B.
Catanzaro, Q. Cheng, G. Chen, J. Chen, J. Chen, Z. Chen, M. Chrzanowski, A. Coates, G.
Diamos, K. Ding, N. Du, E. Elsen, J. Engel, W. Fang, L. Fan, C. Fougner, L. Gao, C. Gong, A.
Hannun, T. Han, L. Johannes, B. Jiang, C. Ju, B. Jun, P. LeGresley, L. Lin, J. Liu, Y. Liu, W. Li,
X. Li, D. Ma, S. Narang, A. Ng, S. Ozair, Y. Peng, R. Prenger, S. Qian, Z. Quan, J. Raiman, V.
Rao, S. Satheesh, D. Seetapun, S. Sengupta, K. Srinet, A. Sriram, H. Tang, L. Tang, C. Wang,
J. Wang, K. Wang, Y. Wang, Z. Wang, Z. Wang, S. Wu, L. Wei, B. Xiao, W. Xie, Y. Xie, D.
Yogatama, B. Yuan, J. Zhan, Z. Zhu, Deep speech 2: End-to-end speech recognition in English
and mandarin, in International Conference on Machine Learning (2016), pp. 173–182. PMLR
Chapter 5
ReRAM-Based Processing-in-Memory
(PIM)
5.1 Introduction
PIM designs covering from unit cells, circuit techniques, and architectures. Besides
MAC, other essential logic-in-memory functions will also be discussed.
Figure 5.1 illustrates a typical ReRAM array for PIM. In general, the ReRAM array
for PIM looks the same as the ReRAM array for regular memory operation. Since
the 1T1R ReRAM cell is compact and can provide high density, many ReRAM PIM
accelerators are designed using the typical 1T1R ReRAM cell. The key difference
between normal ReRAM and ReRAM PIM is the number of rows that are activated
simultaneously. In normal ReRAM, only one row is accessed at a given time for
programming and read operation. However, ReRAM PIM accesses multiple rows
for reading operation, which is the most frequently executed operation for neural
networks. Activating multiple rows will create multiple current components in each
bitline, which is used as an analog multiply-and-accumulate (MAC) result. The
analog MAC output can be represented as follows:
v
v
MAC = INi × Wi = IMC[i]
i=0 i=0
Here, INi , Wi , IMC[i] , and v are input signal, weight, cell current, and the
number of selected rows, respectively. INi affects WLs in Fig. 5.1 and is usually
represented by multiple voltage levels or time durations, depending on the required
WL[0]
GND
SL[0]
WL[1]
SL[1] GND
WL[i]
SL[i]
BL[n]
BL[0]
BL[1]
number of bits. Wi is stored in ReRAM devices and also requires multiple cells
for realizing a multi-bit weight. The multiple cells can be located at multiple rows
in the same column or at multiple columns in the same row, which is determined
by the employed ReRAM macro architecture. When one column is used for MAC
operation, the bitline current of each column (BL[i]) represents a MAC result and
is digitized by an analog-to-digital converter (ADC). Using multiple columns needs
additional control circuits for merging the bitline currents from the multiple columns
and generating the final output current. Digitization will be performed using the
final output current. Recently, multi-bit ReRAM devices have been reported, which
allows a ReRAM cell to store a multi-bit weight [18–21]. The number of ReRAM
cells for generating a MAC result will decrease when employing the multi-bit
ReRAM devices. Therefore, higher density ReRAM PIM macros can be realized
using the same ReRAM array density. However, multi-bit ReRAM technology is not
mature and shows large variations. Therefore, the application of it is still limited.
Table 5.1 shows how a ReRAM cell can be used for binary multiply. When two
binary bits, input, and weight, are multiplied, the output will be either logic “1” or
logic “0.” In 1T1R ReRAM cell, the input is applied to the wordline (WL[i] in Fig.
5.1), and the weight is stored in the ReRAM device. In general, the high resistance
state (HRS) and the low resistance state (LRS) represent “0” and “1,” respectively.
When the input is “0,” the access transistor in the 1T1R cell is off, and no current
will flow through the cell (IMC = 0). When the input is “1,” the access transistor is
on, and the current will be determined by the ReRAM device state. If the ReRAM
device in HRS (Weight = “0”), the current (IMC ) will be IHRS . When the ReRAM
device is in LRC, ILRS will flow through the cell. Here, it needs to be noted that
the binary multiply result of “0” is represented by two different current values (i.e.,
0 and IHRS ), which will degrade the sensing margin. The impact of IHRS will be
more significant when a large number of rows are activated for the multiply-and-
accumulate (MAC) operation. In this case, IHRS in multiple ReRAM devices will
be added. It becomes more difficult to generate accurate MAC results because of
the unwanted IHRS . Therefore, the number of rows to be accessed at the same time
should be decided carefully after considering the impact of IHRS . The impact of IHRS
will be mitigated by increasing the ratio of ILRS to IHRS .
96 T. T.-H. Kim et al.
Table 5.1 Binary multiply in Input (IN) Weight (W) Product (IN × W) IMC
ReRAM
0 0 (HRS) 0 0
0 1 (LRS) 0 0
1 1 (LRS) 1 ILRS
1 0 (HRS) 0 IHRS
Multi-bit weights can be realized by using multiple ReRAM cells. Table 5.2 shows
an example of realizing multiplication of 1-bit input (IN) with a ternary weight
[11]. The ternary weight is implemented with two ReRAM cells, one in the positive
array (nvCIM-P) and the other in the negative array (nvCIM-N). The weight in the
positive array includes only “+1(LRS)” and “0(HRS)” while that in the negative
array includes “-1(LRS)” and “0(HRS).” The ternary multiply result is obtained by
combining the results from the positive and negative arrays as depicted in Table 5.2.
The multiply result will be “1” only when the input and the weight in the positive
array are “1” and “1(LRS)” and the weight in the negative array is “0(HRS).” The
multiply result of “−1” is defined by the opposite case where the input and the
weight are “1” and “−1(LRS)” in the negative array and the weight in the positive
array is “0(HRS).” Similar to Table 5.1, various cases generate IHRS even though
their multiply results must be “0.” The impact of IHRS needs to be carefully handled
to meet the required output precision.
Multiplication of multi-bit inputs and multi-bit weights can be done in various ways.
Figure 5.2 illustrates three different ways for realizing 2-bit input (IN[1:0]) and 2-
bit weight (WM WL ) multiplication. Figure 5.2a, b use one cycle, while Fig. 5.2c
5 ReRAM-Based Processing-in-Memory (PIM) 97
uses two cycles. However, Fig. 5.2a generates the multiplication result using one
column while Fig. 5.2b, c use multiple columns for generating a MAC result. In
Fig. 5.2a, the 2-bit weight for IN[0] (LSB) can be realized by using three cells, one
cell for WL (LSB) and two cells for WM (MSB). However, for IN[1] (MSB), six
cells are necessary since IN[1] is 2 × IN[0] when both are “1s.” Therefore, total
nine ReRAM cells are necessary to realize the multiplication of the 2-bit input and
the 2-bit weight. This will produce the maximum current of 9Icell . Even though
Fig. 5.2a can realize the multi-bit multiplication using one column in principle, it is
challenging to generate large bitline current accurately. One of the main reasons is
that the range of the bitline voltage during read operation is limited by the ReRAM
device characteristics. If the bitline voltage is relatively high, the accessed ReRAM
devices are under weak set or reset conditions, which can partially change the
ReRAM resistance. To avoid this, the bitline voltage should be maintained low so
that no disturbance occurs in the ReRAM resistance. However, when the bitline
voltage is maintained low, the current precision will also be affected, which limits
the overall multiplication accuracy. Besides the low bitline voltage, ReRAM device
variations also make it challenging to generate accurate bitline current proportional
to the MAC result.
Another way of improving the multiplication accuracy is to use one macro over
multiple cycles, which is called “Serial-Input-Parallel-Weight (SIPW).” As shown
in Fig. 5.2c, the 2-bit input signals are applied to the macro bit by bit over two cycles.
The maximum bitline current is 3 × Icell in each cycle. To merge the currents over
two cycles, the current from the first cycle needs to be sampled in a capacitor so
that it can be merged with the current generated in the second cycle. In addition, the
98 T. T.-H. Kim et al.
Macro[0]
WM = Most Significant Bit (MSB) Weight
WL = Least Significant Bit (LSB) Weight
Icell = Cell Current IN[0] WM
CLK
IN[0] WM
INPUT IN[0]
IN[0] WL
INPUT IN[1]
IN[1] WM WM
IN[1] WM WM
IN[1] WL WL
9Icell
(a)
CLK
INPUT IN[0]
INPUT IN[1]
Macro[0] Macro[1]
IN[0] WM WM WL IN[1] WM WM WL
(b)
CLK
Macro[0]
INPUT WM WM WL
2Icell Icell
(c)
Fig. 5.2 Multiplication of multi-bit inputs and weights: (a) one cycle and one macro, (b) one cycle
and multiple macros, and (c) multiple cycles and one macro [19]
5 ReRAM-Based Processing-in-Memory (PIM) 99
ratio of the current produced by the MSB to the current produced by LSB needs to be
considered before merging. This architecture is called “Serial-Input-Parallel-Weight
(SIPW).”
The MAC result from PIPW and SIPW can be written as follows:
Table 5.3 summarizes the comparison of PIPW and SIPW [19]. The number
of ReRAM cells can be reduced when multi-bit ReRAM devices are utilized. The
multi-bit ReRAM devices can employ any multiplication methods in Fig. 5.2.
However, it is challenging to implement multi-bit ReRAM devices accurately.
Therefore, the output precision of the multi-bit ReRAM-based multiplication is still
limited.
5.4.1 Introduction
This section will introduce the overview of ReRAM PIM architecture. Convolu-
tional neural networks (CNNs) have demonstrated high accuracy in various artificial
intelligence (AI) tasks. CNNs consist of multiple convolutional layers and fully
connected layers as shown in Fig. 5.3a. Dot product and multiply-and-accumulate
(MAC) operations are the basic operations that are heavily executed in CNNs as
illustrated in Fig. 5.3b. These functions consume excessive energy in Von Neumann
architecture because of the massive data transfer between memory and processing
elements. Processing-in-memory (PIM) can address this issue by merging memory
and processing elements together. However, typical PIMs utilize analog signals to
represent MAC results, which requires careful design, particularly when high output
precision is necessary. As explained in Fig. 5.2, the summation of the multiplication
results can be implemented in various ways depending on the accuracy of the current
in each bitline. The activation function (f ) in Fig. 5.3b is generally realized by an
analog-to-digital converter (ADC) for multi-bit precision or a comparator for one-bit
precision.
Typical deep neural networks (DNNs) store weights in separated memories in
non-volatile memories and transfer them to processing elements through multiple
100 T. T.-H. Kim et al.
Layer Input
Layer
Layer Input A
Input W
B
X CONV Y CONV FC
Input Layer Layer Layer
Y
Z
Feature Extraction Classification
(a)
x1 W1
x2 W2
∑ f y
Sum Activation Output
n
xn Wn
y = f ( ∑ wi x i )
Inputs Weights i=1
(b)
Fig. 5.3 (a) CNN structure and (b) multiply-and-accumulate operation [22]
PIM-Control
DRAM P+A P+A P+A
SRAM
NVM
Lite-PE
P + A: Pooling + Activation
(a)
PE
In fer ence Energy
10 ~ 1000×
SRAM
NVM
PE
Von
nvCIM
Neumann
(b)
Fig. 5.4 (a) PIM processor architecture and (b) inference energy comparison [11]
Reference Generator
WL[0]
GND
SL[0]
Word line Decod er
WL[1]
SL[1] GND
WL[i]
SL[i]
Analog-to-Digital Output
Control
Write Control
the conversion range automatically after tracking the actual variations. In general,
reference voltage levels are generated by using ReRAM replicas for tracking the
systematic variations. However, ReRAM devices show large variations compared
to CMOS transistors. Therefore, the output precision of ReRAM-based PIMs is
still worse than that of CMOS counterparts. Mature ReRAM technology with less
device-to-device mismatches will reduce the precision gap between CMOS-based
PIMs and ReRAM-based PIMs.
Input Buffer
DACs
ReRAM Array
ADCs
Timing
Control
Output Buffer
(a)
ReRAM
ReRAM DAC(2bit) 2%
15% 7%
DAC(2bit)
24% ADC(8bit) ADC(8bit)
61% 91%
Fig. 5.6 (a) Traditional ReRAM-based PIM architecture and (b) sample power and area break-
down [14]
5.5.1 Architecture
RISC CPU
ADCs & DACs
Input Buffer
ReRAM Crossbar
SRAM
(Instruction
& Data
Memory)
Memory
32-bit Bus
Control
VDD
ADC/DAC En
Control
A DC Out put
To Crossbar Pad
ADC
Select Read
DAC
D AC Inp ut
3b D AC
Select
Write_L
DAC
Write_H
DAC
generate high-voltage pulses to the accessed ReRAM cells for programming. During
PIM operation, the DACs for read apply input pulses to the crossbar, and each
column generates bitline current or voltage as a vector-matrix multiplication result.
The ADCs connected to the selected columns convert the analog signal in each
column into digital codes for further processing. The unused ADCs and DACs are
disabled to reduce unnecessary power consumption.
Figure 5.9 depicts the operations of the ADCs and DACs during the programming
and PIM modes of the ReRAM PIM coprocessor [26]. During programming (Fig.
5.9a), the DACs in the selected row and the selected column generate a train of
differential pulses so that the selected ReRAM device undergoes either positive
or negative programming voltage depending on the programming data. The DAC
output voltage levels and the number of pulses are programmed into registers for
flexible control. The DACs in the unselected rows and columns generate common-
mode voltage (e.g., 1 V in [26]) to avoid unwanted programming. The driving
strength of the DACs should be designed carefully so that the sneak currents flowing
from the selected row to the unselected columns and from the unselected rows to
the selected column do not affect the DAC outputs for the selected row and the
selected column significantly. Since the worst-case scenarios need to be considered,
the above requirement will increase the power and the area of the DACs. Figure
5.9b shows the ADCs and the DACs in the PIM mode where the DACs for rows are
selectively activated relying on the applied input signal, and the ADCs for columns
convert the conductance of each column into multi-bit digital outputs. The DACs
for columns are disabled by the mixed-signal interface as shown in Fig. 5.8. The
106 T. T.-H. Kim et al.
1V
D2=0 DAC
1V
D3=0 DAC
1V 1V 1V
Bitline
D0=0 DAC
D1=2 DAC
D2=0 DAC
D3=0 DAC
1V
Voltage
(a)
D1=2 DAC
D1=1 DAC
D2=1 DAC
D3=3 DAC
ADC
ADC
ADC
Voltage 1.2
A2
A3
A0
A1
(b)
number of pulses going to each selected row is controlled by the controller (i.e.,
3b DAC Select in Fig. 5.8). The DACs for unselected rows generate 1.2 V as the
common-mode voltage. The DACs use the pulse amplitude of 0.6 V that can also be
adjusted depending on the ReRAM characteristics and the resolution of the ADCs.
The column outputs excited with 0.6 V pulses are digitized by the ADCs in parallel.
5 ReRAM-Based Processing-in-Memory (PIM) 107
Implementation1 Implementation2
Neuron
Drivers
Drivers
ADC/
Weights Weights
Drivers Drivers
ADC/ Inference ADC/
Neuron Backpropagation Neuron
Transposable PIM macros with forward and backward propagation are necessary
neural networks for inference and training [27]. In the inference mode, the forward
operation will be performed. In the training mode, weights will be updated through
backward propagation. Both propagation cases execute the convolution of weights
and inputs. However, the weights matrix will be transposed in backward propagation
compared with the feedforward one. Thus, the transpose weight matrix is necessary
for computation. To realize transposable PIM macros, the memory array storing
weights needs to be accessed vertically and horizontally. Figure 5.10 depicts two
different architectures of the transposable PIM macros [27]. The first implemen-
tation includes dedicated drivers, ADCs, and neurons for forward propagation and
backward propagation, respectively. However, the ADCs and the neurons can be
used only one direction at a time, which facilitates sharing by the two propagation
directions. The second implementation shows that the ADCs and the neurons are
shared by the inference propagation and the backpropagation. This reduces the
energy, latency, and area overheads coming from the ADCs and the neurons.
ReRAM PIMs face various challenges such as large bitline current, large offset in
sensing, overlap in bitline current for different MAC values, etc. Fig. 5.11a shows
a ReRAM PIM macro activating multiple wordlines simultaneously. The bitline
108 T. T.-H. Kim et al.
WL
ReRAM Array
YMUX
ADCs
(a)
MAC Values
1 2 3 4 5 6 7 8 9
Probabi lit y
IBL
1L0H 2L0H 3L0H 4L0H 5L0H 6L0H 7L0H 8L0H 9L0H
(b)
Fig. 5.11 (a) ReRAM PIM macro for parallel computation and (b) distribution of bitline current
over multiple MAC values [11]
current distribution of each MAC value can be estimated depending on the ReRAM
device status. Figure 5.11b illustrates an example of the bitline current distribution
when assuming that the maximum MAC value from each column is 9 [11]. Here,
the smallest bitline current for the MAC value of “1” is “1L0H” (one ILRS and no
IHRS ), which can happen when only one wordline is turned on for MAC operation.
The maximum bitline current for the same MAC value is “1L8H” (one ILRS and
eight IHRS ) where 9 wordlines are turned on and 8 ReRAM devices are in the
HRS state. The selected ReRAM devices in the HRS state will generate IHRS even
though the computed MAC value has no difference, which needs to be considered
when sensing or digitizing the accumulated bitline current. Sensing margins can
5 ReRAM-Based Processing-in-Memory (PIM) 109
To tackle the sensing margin degradation issue caused by the ReRAM current
variations, input-aware dynamic reference generation scheme is proposed in [11].
This scheme considers the reference current dependency on the input signal.
Therefore, instead of using fixed reference currents, the reference currents are
dynamically generated by input-aware replica rows. The replica rows are controlled
by the number of wordlines for the input signal, which is counted by a counter.
Figure 5.12 illustrates the distributions of the bitline current over various input
values. It is evident that the optimal reference current for sensing MAC values reply
on the number of wordlines (NWL). The input-aware reference current generation
MAC Values
Probability
0 1 2 3 4 5 6 7
NWL=9
NWL=7
IREF_FIX[7:1]
NWL=2
IREF_IA[7:1]
NWL=1
Input
SINWP SINWP
DSWCT DSWCT
Comparator + PN-ISUB
TMCSA
DOUTSIGN DOUT[2:0]
is more critical for higher MAC values where the current difference between MAC
values is smaller because of the variations in the bitline current. It is reported that
the input-aware dynamic reference generation improves the error rate by 50 times
compared to the conventional fixed reference generation scheme.
BL[n-1]
SL[n-1]
BL[n]
BL[0]
SL[n]
BL[1]
SL[0]
SL[1]
WL[0]
WL[1]
WL[i]
Figure 5.14 shows the schematic of the ReRAM array for SINWP. The 2-bit input
is applied to the array at different timings. LSB is applied first, while MSB is
applied later. The array stores 2-bit weights using two ReRAM bit cells that are
implemented in two columns. The bitline currents from two columns are non-
weighted. Therefore, they need to be further processed to generate final weighted
current for digitization. The array for the negative weights also generates bitline
currents in the same way as the array for the positive weights.
Table 5.4 explains the realization of the positive and negative weights in SINWP.
Note that one resistance combination of two ReRAM bit cells indicates two weight
values with the same absolute value. The combination of (MCM = LRS, MCL =
LRS) is used for the weights of “3” and “−3” since it produces the largest bitline
current. Similarly, the weights of “1” and “−1” are realized by the combination of
(MCM = HRS, MCL = HRS) for generating the smallest bitline current.
112 T. T.-H. Kim et al.
Figure 5.15 shows how two bitline currents from each weight group are weighted
by a circuit called down-scaling weighted current translator (DSWCT). As shown
in Fig. 5.15a, the bitline current generated by the MSB weight (IMSB ) is downscaled
by 2, while the bitline current generated by the LSB weight (ILSB ) is downscaled
by 4. The schematic of DSWCT is presented in Fig. 5.15b. The weighted current
is generated by N0, N1, P0, P1, P2, and P3. The bitline current generated by MSB
weight (IDL_MSB[0] ) flows through P0 and N0 forming analog voltage at the gate of
P0. The gate voltage of P0 is shared with P1 whose size is half of P0. Therefore
the current flowing through P1 is half of IDL_MSB[0] . The bitline current generated
by LSB weight (IDL_LSB[0] ) is processed in a similar way except that the size of
P3 is a quarter of P2. Therefore, only a quarter of IDL_LSB[0] flows through P3.
As shown in Fig. 5.14, IDL_MSB[0] and IDL_LSB[0] are generated when IN[0] is
Input X Input X
Weight-MSB Weight-LSB
IMSB × ½ + ILSB × ¼
IMSB = MSB current summation
ILSB = LSB current summation
(a)
IDL_MSB[0] Positive-Weight Block IDL_LSB[0]
P1 P0 N0 N1 P2 P3
×2 ×4 VBiasM VBiasL ×4 ×1
IDLC_MSB[0] IDLC_LSB[0]
IWDL_MSB[0] IWDL_LSB[0]
N2 N3 N4 N7 N6 N5
×1/4 ×1/2 ×1/2 ×1/4
IDL_P
(b)
Fig. 5.15 (a) Operation principle of DSWCT and (b) circuit implementation [18]
5 ReRAM-Based Processing-in-Memory (PIM) 113
Current
3.6x 3.83x
9IHR
SINWP+
SINWP+
w/o Proposed DSWCT+
DSWCT
PN-ISUB
applied to the ReRAM array. These two current components need to be merged
with the current generated by IN[1]. Therefore, IWDL_MSB[0] and IWDL_LSB[0] are
stored in the capacitors after converting them into voltage at the gate node of N2
and N5. After this, IN[1] is applied to the array and generates bitline currents,
IDL_MSB[1] and IDL_LSB[1] . Note that these currents are not weighted when compared
with IDL_MSB[0] and IDL_LSB[0] . N3, N4, N6, and N7 merge IDL_MSB[0] , IDL_LSB[0] ,
IDL_MSB[1] , and IDL_LSB[1] after considering their weights. Since the weight of IN[1]
is 2× of IN[0], IWDL_MSB[0] and IWDL_LSB[0] are downscaled by 4 and IDL_MSB[1]
and IDL_LSB[1] are downscaled by 2 for proper merging. This is realized by selecting
the device size of N3 and N6 a quarter of N2 and N5. Consequently, the merged
current (IDL_P ) can be written as follows.
1 1
IDL_P = IWDL_LSB[0] + IWDL_MSB[0] + IWDL_LSB[1] + IWDL_MSB[1]
4 2
1 IDL_LSB[0] IDL_MSB[0] 1 IDL_LSB[1] IDL_MSB[1]
= + + +
4 4 2 2 4 2
The merged current from the negative weight group (IDL_N ) is also generated by
the same way as IDL_P . Finally, “IDL_P − IDL_N ” is computed by PN-ISUB whose
output is digitized. Figure 5.16 shows the read path current reduction achieved by
SINWP, DSWCT, and PN-ISUB. The improvement of 3.83× is obtained.
114 T. T.-H. Kim et al.
TCAM_Write_ADDR
WLL[0]
2T2R
WLR[0]
ADDR_1
2 R ow De cod er s
WL L/WL R L ogi c
TCAM Driver
WLL[1]
WLR[1]
256 x 128
ADDR_2
Array
WLL[255]
WLR[255]
Figure 5.18 shows the 2T2R bit cell and its normal ReRAM operation employed
in Fig. 5.17. It comprises two 1T1R ReRAM bit cells using a common source line
(SL) scheme [14, 31] and stores differential data. For read operation (Fig. 5.18b),
both WLL and WLR are enabled so that SL and SLB generate differential voltage
that can be sensed by a sense amplifier. The 2T2R bit cell employs a two-cycle
write operation. During the first cycle, BL is set to the set voltage (VSET ), and SL
and SLB are grounded. This will set the ReRAM devices to LRS. At the second
cycle, one of SL and SLB is set to the set voltage (VSET ) while the unselected
SL or SLB, and BL are grounded. This will make the selected ReRAM device
transit to HRS. Therefore, differential data can be written into the 2T2R bit cell
over two cycles. This may not be preferred in applications where the frequent write
operation is executed. However, in PIMs, read operation in various modes is much
more frequent. Therefore, it is acceptable to sacrifice one cycle for the differential
data writing.
Q QB Q QB
LRS LRS SL HRS LRS
VSET
VSET BL
Cycle 1: Erase Cycle 2: Reset QB
(Set Q and QB) (c)
Fig. 5.18 2T2R ReRAM bitcell: (a) schematic, (b) read, and (c) write [15]
116 T. T.-H. Kim et al.
Mismatch
Search (0, 1)
LRS
HRS HRS
HRS
VREF - SA + - SA +
match[1] = 1 match[2] = 0
Figure 5.19 explains the 2T2R ReRAM operation in the TCAM mode. Search data
are loaded into WLL (e.g., (0,1) in Fig. 5.18), and the bitlines (BL[i]) and the source
lines (SL[i] and SLB[i]) are precharged to VDD and grounded, respectively. When
there is a mismatch, the corresponding bitline is discharged quickly through the
mismatched cell and the sense amplifier in the corresponding bitline will produce
“0” as a result. If the number of mismatched cells increases, the discharging speed
will be higher. The overall search operation result will be generated through the
sense amplifiers.
The 2T2R ReRAM in Fig. 5.17 also supports logic-in-memory operation as shown
in Fig. 5.20. AND/NAND operations are executed by enabling two WLLs in two
rows with grounded BLs. Similarly, OR/NOR operations are performed by enabling
two WLRs in two rows with grounded BLs. XOR operation (Fig. 5.20a) can be
done by combining the results of the AND and NOR operations. XNOR operation
(Fig. 5.20b) can also be realized in a similar way with grounded SLs. Here, BLs are
connected to the sense amplifiers through reconfiguration. Figure 5.20c shows the
logic functions required for a full adder (FA) and a full subtractor (FS). Note that all
the logic functions can be achieved by the circuit configurations explained in Fig.
5.20a, b. However, FA requires two cycles since the sense amplifier output from Fig.
5.20a needs to be latched before using the configuration of Fig. 5.20b.
5 ReRAM-Based Processing-in-Memory (PIM) 117
SL BL SLB SL BL SLB
WLL[0] WLL[0]
X X
X X
WLR[0] WLR[0]
WLL[1] WLL[1]
Y Y
Y Y
WLR[1] WLR[1]
(a) (b)
X+Y X+Y
SL BL SLB
Cycle 1 2T2R 2T2R 2T2R
WLL
W
W
WLR Cycle 2 2T2R 2T2R 2T2R
Weight
reduces power and area overheads significantly. Table 5.5 compares various ReRAM
supporting various PIM functions without MAC operation.
5.9 Summary
the nonlinearity. This chapter introduces various design techniques, including cell-
level techniques, ReRAM PIM architectures, and CMOS circuit techniques for
addressing the aforementioned challenges. ReRAM PIMs will be more impactful
when the ReRAM fabrication technology becomes more mature and the key
ReRAM device parameters are improved.
References
16. Y. Chen et al., A reconfigurable 4T2R ReRAM computing in-memory macro for efficient edge
applications. IEEE Open J. Circuits Syst. 2, 210–222 (2021)
17. C.-X. Xue et al., A 22nm 4Mb 8b-precision ReRAM computing-in-memory macro with 11.91
to 195.7TOPS/W for tiny AI edge devices, in Proc. IEEE int. solid- state circuits conf. (ISSCC),
(IEEE, Piscataway, 2021), pp. 245–247
18. C.-X. Xue et al., A 1Mb multibit ReRAM computing-in-memory macro with 14.6ns parallel
MAC computing time for CNN based AI edge processors, in Proc. IEEE int. solid-state circuits
conf. (ISSCC), (IEEE, Piscataway, 2019), pp. 388–390
19. W. Lee et al., Multilevel resistive-change memory operation of Al-doped ZnO thin-film
transistor. IEEE Electron. Dev. Lett. 37(8), 1014–1017 (2016)
20. R. Yasuhara et al., Reliability issues in analog ReRAM based neural-network processor, in
IEEE international reliability physics symposium (IRPS), (IEEE, Piscataway, 2019), pp. 1–5
21. R. Mochida et al., A 4M synapses integrated analog ReRAM based 66.5 TOPS/W neural-
network processor with cell current controlled writing and flexible network architecture. IEEE
Symp. VLSI Technol. 2018, 175–176 (2018)
22. A. Biswas et al., Conv-RAM: An energy-efficient SRAM with embedded convolution compu-
tation for low-power CNN-based machine learning applications, in Proc. IEEE int. solid- state
circuits conf. (ISSCC), (IEEE, Piscataway, 2018), pp. 488–490
23. C. Yu et al., A 16K current-based 8T SRAM compute-in-memory macro with decoupled
read/write and 1-5bit column ADC, in Proc. IEEE custom integrated circuits conference
(CICC), (2020)
24. C. Yu et al., A zero-skipping reconfigurable SRAM in-memory computing macro with binary-
searching ADC, in Proc. IEEE Eur. solid state circuits conf. (ESSCIRC), (2021)
25. C. Yu et al., A logic-compatible eDRAM compute-in-memory with embedded ADCs for
processing neural networks. IEEE Trans. Circuits Syst. I Regul. Pap. 68(2), 667–679 (2021)
26. J.M. Correll et al., A fully integrated reprogrammable CMOS-RRAM compute-in-memory
coprocessor for neuromorphic applications. IEEE J. Explor. Solid-State Computat. Devices
Circuits 6(1), 36–44 (2020)
27. W. Wan et al., A 74 TMACS/W CMOS-RRAM neurosynaptic core with dynamically
reconfigurable dataflow and in-situ transposable weights for probabilistic graphical models,
in Proc. IEEE intl. solid-state circuits conference (ISSCC), (2020), pp. 498–499
28. L. Zheng et al., Memristors-based ternary content addressable memory (mTCAM), in IEEE
int. symp. on circuits and systems (ISCAS), (2014), pp. 2253–2256
29. M. Chang et al., Designs of emerging memory based non-volatile TCAM for Internet-of-
Things (IoT) and big-data processing: A 5T2R universal cell, in IEEE int. symp. on circuits
and systems (ISCAS), (IEEE, Piscataway, 2016), pp. 1142–1145
30. D. Ly et al., In-depth characterization of resistive memory-based ternary content addressable
memories, in IEEE int. electron devices meeting (IEDM), (IEEE, Piscataway, 2018), pp.
20.3.1–20.3.4
31. C. Chou et al., An N40 256K×44 embedded RRAM macro with SL-precharge SA and low-
voltage current limiter to improve read and write performance, in Proc. IEEE int. solid-state
circuits conf. (ISSCC), (IEEE, Piscataway, 2018), pp. 478–479
32. M. Bocquet et al., In-memory and error-immune differential RRAM implementation of
binarized deep neural networks, in IEEE intl. electron devices meeting (IEDM), (IEEE,
Piscataway, 2018), pp. 20.6.1–20.6.4
Chapter 6
PIM for ML Training
6.1 Introduction
Machine learning (ML) inference is the evaluation process of a trained model for a
given input. To this end, it reads the input data and sends it through the various
ML layers, such as fully connected, convolutional, and recurrent layers, which
involve data-intensive computations with model parameters to get the final result.
It is a read-only and unidirectional process. On the other hand, ML training is the
process of finding the network’s weight and bias parameters that can perform a target
task. Mathematically speaking, it defines the cost function and updates the model
parameters to minimize the cost for the given training data set consisting of many
pairs of inputs and outputs. It involves numerous parameter updates with iterative
forward and backward propagation.
With algorithmic complexity and limited usage, there are not many commercial
products available or under development for ML training, except in the cloud
datacenter domain. This is why GPU is still the most dominant platform in
training, unlike in inference where many accelerators challenge to replace GPU.
As discussed in the previous chapters, processing-in-memory (PIM) architecture
can improve both performance and energy efficiency in various ML workloads by
addressing the data movement bottleneck between the compute and memory device.
Since the training process generates more intermediate data and requires higher
bandwidth than the inference, PIM has greater opportunities in training, despite its
computational complexity. In this chapter, we will review the training computations
and look into the latest PIM works designed for ML training.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 121
J.-Y. Kim et al. (eds.), Processing-in-Memory for AI,
https://doi.org/10.1007/978-3-030-98781-7_6
122 J. Heo and J.-Y. Kim
Unlike the inference has a data-intensive but simple computational flow, the training
process has a complex flow with many iterations. The goal of training is to find all
the weight and bias parameters that minimize the cost function written in Eq. 6.1.
It represents the overall distance between the predicted outputs y o and the training
sample outputs yt .
1
y o − yt 2
argmin C = (6.1)
2
Once the target model’s weight and bias parameters are initialized, the training
iterates the following steps for each training sample to minimize the cost function.
First, it performs feed-forward propagation (FP). It is the same as the inference
process; an input vector goes through the network for evaluation. It then computes
the error at the output by subtracting the evaluation result and the training sample.
Second, it propagates the error backward from the output to the input (backward
propagation, BP). Third, it calculates the gradient for each layer to reduce the overall
difference between the predicted outputs and the ground truths (i.e., training set).
Lastly, it updates the weight and bias parameters. The above process is repeated
until the network is converged, which means the magnitude of an update gets small
enough under a threshold.
We use the gradient descent method for optimization, which iteratively moves
in the steepest descent direction defined by the gradient’s negative to minimize the
cost function. With a stochastic process using mini-batching, stochastic gradient
descent (SGD) is the de facto standard in training as it enables fast convergence.
Equation 6.2 shows how the weights are updated in the SGD.
∂E
W+ = W − η · (6.2)
∂W
The input data is propagated through multiple layers in FP, executing multiply-and-
accumulate (MAC) operations between the input and weight data for each layer,
as discussed in Chap. 1. Eqs. 6.3 and 6.4 show the operation of the fully connected
layer and convolutional layer, respectively, where function f is non-linear activation
function. For convolutional layers, down-sampling pooling functions can be placed
between layers.
Y l = W l H l−1 + b
(6.3)
Z l = f (Y l )
6 PIM for ML Training 123
Y l = W l ∗ H l−1 + b
(6.4)
Z l = f (Y l )
symbol ∗ : convolution
Once FP is done, the final layer’s output is the predicted result and is used for
calculating the error against a labeled sample of the training set. Although the
computation itself is the same as the inference, FP of the training process needs to
keep the intermediate results because they will be re-used in the later steps. For each
layer, it needs to store both activation and activation gradient, the gradient of the
layer’s activation function with respect to the layer’s activation, as they are necessary
for backward propagation and gradient calculation (GC). For the convolution layers
with pooling, selected positions out of the pooling window are needed for error
propagation during BP.
In BP, the calculated error δ is propagated layer by layer from the output to
the input side. Based on the chain rule, we calculate the first-order derivative of
the cost function with regard to each parameter to compute the propagated error
and gradient matrix for each layer. Equation 6.5 shows the equation for fully
connected layers, which multiplies the transposed weight matrix to the propagated
error from the previous layer and applies the Hadamard product (i.e., element-
wise multiplication) with the activation gradient. Likewise, Eq. 6.6 shows the
equation for convolutional layers, which applies deconvolution instead of transposed
multiplication. Deconvolution is same as the convolution with 180◦ rotated weights
after zero padding.
⎧
⎨δ L = (Y − Yt ) f L Z L if Output layer L
(6.5)
⎩δ l = W l+1 T δ l+1 f l Z l if l < L
δ L = (Y − Yt ) f L Z L if Output layer L
(6.6)
δ l = W l+1 δ l+1 f l Z l if l < L
The error propagation process is different for average pooling and max pooling. The
error is divided evenly with a square of the window size if it is average pooling. For
the max pooling case, the error only propagates to the max positions stored during
the FP stage.
124 J. Heo and J.-Y. Kim
To calculate the gradient matrix, each layer operates the propagated error and the
activation. As shown in Eqs. 6.7 and 6.8, outer product and convolution are applied
for the fully connected and convolutional layer, respectively. Like in BP, GC utilizes
the activation results stored during FP.
W l+ = W l − η · δ l ⊗ H l−1 (6.7)
W l+ = W l − η · δ l ∗ H l−1 (6.8)
Finally, the weight parameters are updated by subtracting the multiplied product
scaled by the learning rate. The learning rate is a hyperparameter that decides the
magnitude of a moving step in SGD, deciding how fast the learning would be, while
the gradient represents the direction of movement to minimize the loss. If the mini-
batch size is more than one, the calculated gradient matrices should be averaged
before the weight updated is performed.
Although SRAM-based PIM suffers from low memory density, its logic process is
best to implement high-speed logic circuits. Some SRAM-based PIM works [1–3]
have been proposed for on-device training, achieving high energy efficiency with
good training accuracy.
Su et al. [1] suggest the first SRAM-based PIM supporting both FP and BP stages
of the training. Before this work, previous PIM chips mostly focused on low-energy
inference scenarios for intelligent edge applications. This paper proposes a two-
way transpose (TWT) SRAM macro that supports multi-bit MAC operations for FP
and BP with high energy efficiency and compact area. It also contains customized
sense amplifiers called small-offset gain-enhancement amplifiers to reduce energy
consumption. The authors fabricated a chip in a 28 nm CMOS technology to verify
the proposed design. Even though it is the first fabricated SRAM-based PIM that
performs both FP and BP, it is impractical to be used in actual ML training because
it does not cover the whole process, missing the GC and WU stage.
6 PIM for ML Training 125
Figure 6.1 shows the overall design of the TWT SRAM PIM macro. It consists of FP
input driver, BP input driver, and 32 × 16 multibit-weight-product-units (MWPUs),
whose total memory size is 64K bits. To support both multi-bit computation and
digital conversion of the computed analog signal, the macro contains 16 multibit-
readout units for FP (MRU-F) at the bottom and 32 MRUs for BP (MRU-B) on the
right of the MWPU array. Each MWPU is composed of 8 bit-wise product units
(BWPUs), in which each BWPU contains 16 SRAM cells in a column. An 8-bit
weight is stored in the 8 cells on the same rows across the BWPUs. At the bottom of
the BWPU, a TWT multiply cell (TWT-MC) exists to multiply the 1-bit weight from
the cells and 2-bit input, utilizing voltage variation of bitline. The macro iterates
multiple phases of 2-bit multiplication if the input bit width is greater than 2 and a
multiple of 2.
Figure 6.2 illustrates how TWT macro performs in-memory multiplication in the
BWPU. The TWT-MC inside BWPU has 2 pass transistors (N1 and N2 ) and 2×3
multiply transistors (N3 -N5 and N6 -N8 ). For the case of FP, the macro precharges
column-read-bitline (C-RBL) to VDD and sets row-read-bitline (R-RBL) to the
ground. Then, it injects the two input bits, which is an activation data, via the
forward wordline of MSB (FWLM) and LSB (FWLL). The FWLM and FWLL
are connected to N3 and N7 , respectively, and the weight is connected to both N5
and N6 . Their values decide to either connect the path from the pre-charged bitline,
126 J. Heo and J.-Y. Kim
C-RBL, to the ground or disconnect the path. If connected, the voltage drop occurs
in C-RBL. To differentiate the amount of voltage drop according to the bit position
of the input bits, the width of the multiply transistors whose gates are connected to
a higher input bit, i.e., N5 transistors, is double that of those connected to the lower
input bit, i.e., N6 transistors. This is because the current of a transistor increases with
the gate width. As a result, the voltage drop that ranges from 3ΔV to 0 is generated
on C-RBL, and its value is equivalent to the multiplication between the 2-bit input
and 1-bit weight, as delineated in the table. Since the 32 BWPUs across the MWPUs
on the same column share the C-RBL, their voltage drops are all accumulated via
the charge sharing. Once this analog computation is done across all the columns of
BWPUs, the results are transferred to the MRU-Fs in which each contains SOGE-
SA, shifter, and digital adder. SOGE-SA converts the analog values to digital values,
and the shifter performs shifting considering their bit positions. The digital adder
accumulates the shifted results and generates the final value in 20 bits.
During BP, the macro precharges R-RBL to VDD and sets C-RBL to ground.
Then, it injects the two input bits, which is an error in this case, through the
backward wordline of MSB (BWLM) and LSB (BWLL). The post steps are similar
to FP, except MWPUs on the same row are added across different columns through
charge sharing of R-RBL to implement a transposed multiplication. In addition, the
computation results are transferred horizontally to MRU-B.
6.3.2 CIMAT
Fig. 6.3 (a) 7T transpose SRAM cell (b) 8T transpose SRAM cell (c) Overall architecture of
CIMAT
stages of training (i.e., FP, BP, GC, and WU), unlike the TWT SRAM PIM only
covers the first two. Modeled in a 7 nm CMOS technology, CIMAT successfully
trains the ImageNet using the ResNet-18 model, achieving the energy efficiency of
10.79TOPS/W with the area of 121.51 mm2 . It is simulated based on NueroSim [4].
CIMAT proposes custom 7T and 8T transpose SRAM cell for in-memory processing
while having standard 6T SRAM cells for generic usage as well. Figure 6.3a
shows the proposed 7T SRAM cell, which allows bidirectional read, horizontal
and vertical, and read-disturb-free access. It feeds the activation data to the weight
data stored in the cell array differently according to its operation mode. During FP,
column read wordline (C_RWL) is used for activation injection, and column-read-
bitline (C_RBL) is used as a bitline for vertical partial sum read-out. The value
of C_RBL becomes 1 only if both the weight bit on Q and the injected bit on
C_RWL are 1. Therefore, the in-memory computation implements AND operation,
and CIMAT uses this as a basic operation. For BP, their roles are changed. An error
bit is injected through R_RWL, while R_RBL is used for horizontal partial sum
read-out.
CIMAT also proposes an 8T SRAM cell that can execute FP and BP concurrently
for higher throughput as shown in Fig. 6.3b. Occupying an additional area, the 8T
SRAM cell design adds a PMOS transistor whose gate is connected to QB to support
read accesses from both sides of the cell. In addition, R_RWL and R_RBL are added
and connected to the PMOS transistor. C_RWL and R_RWL are used for activation
128 J. Heo and J.-Y. Kim
and error input, respectively. Likewise, C_RBL and R_RBL are used for reading out
partial sum by column and row, respectively.
Figure 6.3c shows the overall architecture of CIMAT, having a memory array
based on the proposed 7T/8T SRAM cells with extra periphery circuits. To enable
read/write and in-memory operation of the SRAM cell array, CIMAT has wordline
writers in both directions for injecting activations via C_RWLs and errors via
R_RWLs to the cells, a wordline decoder for writing weights to the cells, and
a precharger for pre-charging write bitlines (WBLs) and bitline bars (WBLBs).
To compute the result of the SRAM cell array, it has ADCs for the partial sum
quantization and shifter and adder for accumulation of digital partial sums in bit-
serial arithmetic. Two groups of periphery circuits exist at the bottom and the right
side of the SRAM array. There is a special row of 6T SRAM cells at the top side of
the SRAM array for WU.
CIMAT flattens a 3-d kernel in the direction of input channels and maps the
flattened elements from different filters to different columns in a sub-array. Different
locations of the elements in a filter are stored across different sub-arrays (e.g., 9
sub-arrays are required for 3 × 3 × N filters). CIMAT uses an adder tree to add
the partial sums from different sub-arrays. To perform MAC operation using the
in-memory operation, CIMAT pre-writes weights to the 7T/8T SRAM cells and
injects activations through read-wordlines, C_RWL in Fig. 6.3. CIMAT adds the
results of in-cell AND operations among different rows through the read bitlines and
accumulates the results of different sub-arrays through the adder tree. Even though
the adder tree can be an additional cost, it adopts this spatial accumulation scheme to
make the FP and BP operation symmetric. Without this mapping, an accumulation
result inside a sub-array is asymmetric because the result of FP is an entire partial
sum while the result of BP is just part of the partial sum.
The transpose SRAM cell design of CIMAT removes the overhead of transposing
the weight matrix of BP. With the proposed array design, the mapping scheme
of the weight matrix to the SRAM array does not need to change. Instead, the
accumulation direction should be reversed, from vertical to horizontal. With the
same periphery circuits, the rest of the computation is the same as FP. With the
proposed 8T SRAM design, FP and BP can be performed simultaneously within the
same sub-array.
CIMAT uses extra non-transpose 6T SRAM arrays during GC to perform
convolution operations between the error maps and corresponding activation maps.
It first saves the error data in the SRAM array and loads the activation from the
off-chip DRAM to the on-chip buffer. Each plane of error data is stretched into one
column, and the next column stores the following output channel elements. CIMAT
executes bit-wise multiplication and accumulation using the periphery circuits used
in FP. The results of the columns form the entire gradient matrix. The multi-batch
6 PIM for ML Training 129
mode sends the gradients to off-chip DRAM to store the data. At the end of each
batch, gradients are loaded back and accumulated on-chip.
After GC is done, each row of the gradient results is fetched to the additional 6T
SRAM row residing above the 7T/8T SRAM. The data are fed into the shift registers
row-by-row in a read-modify-write mode to multiply the learning rate. After that,
the 6T row and a paired weight row of 7T or 8T are activated simultaneously. The
subtraction of the two rows is done in the weight update module. To speed up this
row-by-row data processing, CIMAT proposes an array-level pipelined architecture
that updates different significant bits at different stages, as shown in Fig. 6.4.
In the case of 7T SRAM, CIMAT uses a 7-stage layer-level pipelining during FP and
BP for achieving high throughput (Fig. 6.5a). Each stage computes multiple layers
of a different image, while multiple images are processed throughout the pipelines
6.3.3 HFP-CIM
However, using fixed-point numbers in ML training can harm its accuracy, which is
a critical problem.
HFP-CIM suggests three key features to process floating-point numbers in PIM
efficiently: (1) heterogeneous floating-point computing architecture and hardware
design, (2) computing algorithm that reduces the communication between exponent
and mantissa, and (3) data mapping and additional computing unit to support ML
training. HFP-CIM is fabricated and verified in a 28 nm CMOS technology.
Fig. 6.6 (a) Conventional floating-point computing (b) Heterogeneous floating-point computing
132 J. Heo and J.-Y. Kim
CIM can execute floating-point MAC within two cycles. Figure 6.6b shows the
concept of proposed heterogeneous floating-point computing.
Figure 6.7a shows the overall architecture of ECIM. It comprises CIM local
arrays (CLAs), CLA decoder, wordline driver, normal I/O interface, and the
peripherals for exponent computations. Its in-memory processing is composed of
four steps: (1) storing the weight’s exponent value in SRAM cells, (2) executing
in-cell AND/NOR operation on local bitline (LBL) and bitline bar (LBLB) by
enabling wordline of computing row after pre-charging LBL with exponent value
of an input, (3) transferring the value of LBL and LBLB to global bitline bar
(GBLB) and bitline (GBL) through drivers, and (4) loading the results in the global
lines to peripheral circuits for further processing such as addition, subtraction,
and comparison. During this process, the ECIM reduces power consumption with
the following two features. First, it only precharges a single bitline, while the
conventional memory precharges both bitlines (i.e., LBL and LBLB) to VDD for
read operation. ECIM either precharges LBL or LBLB depending on the input
value. Second, it reuses the charge in GBL to reduce the switching power of ECIM.
Not precharging every cycle, GBL reuses its charge from the previous cycle if the
current cycle’s in-cell AND/NOR result is the same. Since DNN computations tend
to produce similar values on the results, exploiting temporal locality in memory is
effective to energy-efficient hardware design.
Even though ECIM reduces the power consumption of exponent computation,
mantissa computation in MPE still has two problems. First, the normalization result
of mantissa has to be transferred to ECIM every cycle for exponent update. This
process causes throughput degradation due to massive communication between the
mantissa and exponent part. Second, the power consumption of mantissa’s expen-
sive arithmetic units such shifter and leading-one-detector is still high even after
the ECIM is adopted. To solve these problems, HFP-CIM develops a mantissa-free-
6 PIM for ML Training 133
Figure 6.8a shows the overall processor design using HFP-CIM, composed of
the top RISC controller, multiple instances of heterogeneous-exponent-mantissa-
training-core (HEMTC), aggregation & activation core (AAC) that accumulates
Fig. 6.8 (a) Overall architecture of proposed processor (b) Data mapping
134 J. Heo and J.-Y. Kim
partial sums from HEMTCs, and 1-D SIMD core computing element-wise multi-
plication. The HEMTC contains the ECIM macro and MPEs. Utilizing these units,
the processor supports the ML training composed of FP, BP, GC, and WU. It devises
a data mapping that can additionally support the convolutional layer as well as
maximally reuse partial sums for MFEC.
Figure 6.8b shows the data flow and data mapping of the proposed processor.
There are two kinds of accumulation, and the first one is input channel accumulation.
Each column of CLAs in ECIM contains the weight of different output channels,
which is first flattened in the input channel direction. These values between all the
rows are accumulated. After the input channel accumulation, the weight value in
ECIM is changed to another element of the kernel window, the corresponding input
value is injected, and the same accumulation process is performed. This process is
image accumulation, and during these two steps, the partial sums are accumulated in
the same memory space. This paper does not describe how all the processor executes
numerous functions of ML training and movement of intermediate data.
As previous works [6, 7] exploit the sparsity of data for high energy efficiency,
HFP-CIM processor also performs zero-skipping. The zero skip controller in
HEMTC gets non-zero encoded exponent, mantissa, and bitmap from memory. Then
it converts them into non-zero value feature maps with their corresponding indexes,
saves them in queues, and feeds the data into ECIM and MPE. With these data
feeding schemes, the HEMTC reduces energy consumption and latency by skipping
the calculation of zero values.
6.4.1 PipeLayer
Song et al. [10] propose PipeLayer, a ReRAM-based PIM accelerator for CNN.
The authors claim that previous ReRAM-based PIM works, PRIME [8] and ISAAC
[9], cannot support training due to a few reasons. Both works do not consider the
complex data dependency of training. Unclear data organization and data mapping
6 PIM for ML Training 135
Figure 6.9 shows the overall architecture of PipeLayer. Note that it does not
include any processing units for computations as the ReRAM cell arrays substitute
them. The in-memory computation of PipeLayer is similar to that of ISAAC.
It uses a weight spike coding scheme for input/output signaling to remove the
overhead of ADCs and DACs. Unlike ISAAC, which still needs ADCs for output
spikes, PipeLayer does not need both DACs and ADCs thanks to the spike driver
and integration-and-fire circuits, respectively. When the input is N bit, the spike
driver iterates N times and generates a sequence of weighted spikes by looking up
reference voltage at each cycle. Then, it feeds them to the ReRAM cell array (i.e.,
Morp). The weight data is stored in the ReRAM cell array as cell conductance,
and cells are located at the cross points of the wordlines and bitlines. Since the
multiplication result of conductance and voltage is a current value, a current flowing
at each cell can be viewed as a multiplication result of the input value from the spike
driver and the weight value in the cell. Then, PipeLayer accumulates the in-cell
multiplication results by sharing the current on a bitline. After accumulation, it uses
integration-and-fire (I&F) unit, which integrates input current and generates output
spikes. Furthermore, the counter connected to the output spikes finally converts the
spikes to digital values. For the network that needs high resolution, it accumulates
the partial sums after shifting.
As shown in Fig. 6.10a, PipeLayer flattens weight kernels and stores them in
Morp. Each column of the cell array includes the weights from a flattened kernel.
PipeLayer feeds the input data after flattening them with the same method as the
kernel’s. For the input case whose width and height are set to 114, PipeLayer feeds
the input data through wordlines and accumulates the multiplication results through
bitlines. Because it takes only a single cycle for element-wise multiplications
between a flattened input and each of the weights by using all the cross points,
it needs 112 × 112(=12544) cycles to finish all the outputs. Since the mapping of
all kernels to a single huge ReRAM sub-array is unrealistic, PipeLayer partitions it
into smaller ReRAM sub-arrays with a size of 128 × 128.
To improve performance, PipeLayer can compute multiple flattened inputs at
the same time for the weights in the same layer. This strategy is called intra-layer
parallelism. To this end, PipeLayer needs to store the same weight data G times
(Fig. 6.10b). For an extreme case, the results of the layer could be generated in just
one cycle if G is 12544. The authors chose the G value to 256, considering the linear
increase in hardware cost. Since it is only simulation-based, we are not certain that
PipeLayer can be efficiently implemented.
In addition, PipeLayer exploits inter-layer parallelism where it computes multiple
layers from different images in a mini-batch in parallel. For this parallelism, the
proposed design does a data computation pipeline from different images that do
not have data dependency between them. Figure 6.10c compares a conventional
design and the proposed strategy. In the conventional design, the computation is
6 PIM for ML Training 137
Fig. 6.10 (a) Data mapping of PipeLayer (b) Intra-layer parallelism (c) Inter-layer parallelism
sequential and has a long latency since there is no pipelining. The pipelining design
of PipeLayer increases the throughput by computing different images at different
Morp concurrently. However, the PipeLayer needs more buffers to enable this high-
performance pipelining design, which causes extra area overhead.
6.4.2 FloatPIM
of the inputs. If the state of all parallel-connected input devices is ROF F , the state
of the output device does not change. However, if one of the input devices’ state
changes to RON , the output memristor is switched from RON to ROF F . Since an
assertion among the input devices causes the output de-asserted, it implements the
NOR operation. As NOR operation is functionally complete, other arithmetic oper-
ations such as addition and multiplication can be implemented. Each row contains
cells for storing both operands to read simultaneously and separate processing cells
only for storing intermediate results. In-memory operation of ReRAM-based PIM
is slower than CMOS-based PIM due to the slow switching speed of the memristor.
To overcome this, FloatPIM suggests even more parallelism during computation. It
can compute addition and multiplication in parallel, irrespective of the number of
rows.
Figure 6.11b depicts the overview of FloatPIM. It is composed of crossbar
memory blocks, in which each block contains data from a different layer of
DNN. The memory blocks store only weight data during inference. However,
during training, they store weights, activation gradients (derivatives of activation
functions), and results of activation functions. Each block sends computation results
to the next block through the switch that aligns the data structure for the data transfer
phase.
reads a vector from the block, the switch connects each row data point to each
driver’s column for the write operation to the next block. The shifter is inside the
memory block to support convolution operation. The controller computes the loss
function and controls data drivers and switches.
FloatPIM uses different parallelism schemes in FP and BP. For FP, it computes
each batch in each tile at the same time. For BP, the FloatPIM has 2 configurations:
low-power FloatPIM (FloatPIM-LP) and high-power FloatPIM (FloatPIM-HP). It
determines the parallelism strategy considering the trade-off among speed, energy
efficiency, and memory size. In the FloatPIM-LP, a single memory block iteratively
computes all data points in a mini-batch, generates gradients, and subtracts gener-
ated gradients from the current weights. In contrast, in the FloatPIM-HP, multiple
blocks compute different data points in a mini-batch in parallel, sum up the gradients
across different blocks, and update the weights. Even though FloatPIM-HP performs
computation faster, the weights need to be duplicated in each block. This duplication
makes FloatPIM-HP consume large memory and energy.
During FP, FloatPIM processes the input data in a pipeline stage. While the value
of a single batch passes through each data point, FloatPIM stores gradients and the
results of activation functions in each data point. For BP, FloatPIM measures the
loss function in the last output layer and updates the weights of each layer using the
previously stored data, while error propagates each data point.
Figure 6.13a shows how FloatPIM performs two key operations of CNN: matrix-
vector multiplication and convolution operation. FloatPIM stores multiple copies
of the input vector in multiple rows and stores the weight matrix in the transposed
shape for matrix-vector multiplication. It first performs the multiplication between
the inputs and weights and then accumulates the multiplication results horizontally.
While FloatPIM performs the computation in a row-parallel way for high perfor-
mance, it always needs extra memory for input vector copy, which is memory area
overhead.
FloatPIM performs convolution using weight interconnect logic, which is a
barrel shifter. It prevents frequent memory write operation, which is a considerable
overhead in ReRAM-based PIM since its memory write operation is slow. During
the convolution, FloatPIM stores all convolution weights in a single row and
copies them to other rows. It first multiplies the corresponding inputs and weights
considering the convolution window and computes the next multiplication with
shifted inputs. Then, it performs accumulation with the results for the final result.
Specifically, for the N × N convolution window, the number of the shift operations
is N − 1.
Since FloatPIM performs frequent data copy during both operations, it supports
an optimized data copy operation, which writes the same value to all rows within
two cycles. FloatPIM supports the Sigmoid function by using three terms of the
Taylor expansion and the ReLu function for the activation. The max/min pooling
6 PIM for ML Training 141
Fig. 6.13 (a) Matrix-vector multiplication and convolution (b) Training of FloatPIM
of FloatPIM first compares the exponent and then compares the mantissa with the
same maximum/minimum exponent.
During BP, GC, and WU, the error vector propagates to corresponding memory
blocks to access required data points for weight update. For the BP, FloatPIM
multiplies the copied error vector with the transposed weight matrix and multiplies
the activation gradient stored during FP. The resulting error propagates to the next
memory block. To perform GC, FloatPIM multiplies the same copied error vector to
the result of the activation function scaled with the learning rate and finally updates
the weight matrix. The whole process of CNN training in FloatPIM is summarized
in Fig. 6.13b.
References
1. J.-W. Su, X. Si, Y.-C. Chou, T.-W. Chang, W.-H. Huang, Y.-N. Tu, R. Liu, P.-J. Lu, T.-W.
Liu, J.-H. Wang, Z. Zhang, H. Jiang, S. Huang, C.-C. Lo, R.-S. Liu, C.-C. Hsieh, K.-T.
Tang, S.-S. Sheu, S.-H. Li, H.-Y. Lee, S.-C. Chang, S. Yu, and M.-F. Chang, 15.2 a 28 nm
64Kb inference-training two-way transpose multibit 6T SRAM Compute-in-Memory macro
for AI edge chips, in 2020 IEEE International Solid-State Circuits Conference-(ISSCC). IEEE,
Piscataway (2020), pp. 240–242
142 J. Heo and J.-Y. Kim
This chapter will discuss a software stack for PIM and the challenges that must be
overcome to be well-suited in the conventional computer architecture. In a standard
term, a software stack consists of layers of software components that create a
complete platform without any additional component to support applications. It
provides an interface between hardware and programmers through the layers of
application, framework, library, runtime, and device driver. In order to efficiently
adopt PIM into the conventional architecture, a software stack needs modification
on these layers, as shown in Fig. 7.1.
The primary purpose of the PIM software stack is not just to make an application
run on the PIM hardware; it must be optimized for the PIM hardware in a seamless
manner, regarding the utilization of the hardware, scheduling, and optimization of
the code. Also, it must aim for high programmability and optimization to a variety of
applications and architecture systems, which should make PIM hardware convenient
to programmers and system architects.
In order to adopt PIM properly, we must thoroughly tackle the entire software
architecture layers, including PIM application, PIM library, and PIM device driver.
An application is a high-level software code that users write. However, not every
type of application could exploit PIM well. Due to the high internal bandwidth
The original version of the chapter has been revised. A correction to this chapter can be found at
https://doi.org/10.1007/978-3-030-98781-7_9.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023, 143
corrected publication 2023
J.-Y. Kim et al. (eds.), Processing-in-Memory for AI,
https://doi.org/10.1007/978-3-030-98781-7_7
144 D. Kim and J.-Y. Kim
CRF, while the host is responsible for providing instructions to each PIM execution
unit. HBM-PIM’s instructions and instruction format are illustrated in Fig. 7.4.
The software stack shown in Fig. 7.2 is modified to utilize the PIM execution
unit efficiently. It supports basic linear algebra subprograms (BLAS), runtime, and
a device driver to allow users to run the original source code without any modifica-
tions. As another option of programming, it also supports PIM custom operations
that directly invoke the PIM hardware. They are PIM BLAS based TensorFlow
operations: addition (ADD), multiplication (MUL), rectified linear unit (ReLU),
long short-term memory (LSTM), general matrix-vector multiplication (GEMV),
and batch normalization (BN). They explicitly call the corresponding PIM BLAS
library. Then, the PIM BLAS calls the PIM kernel to generate PIM micro-kernel
146 D. Kim and J.-Y. Kim
codes and execute them. This path allows users for manual and direct use of the PIM
execution unit. In the case of not using the manual programming with PIM custom
operations, PIM runtime is responsible for a seamless connection. It optimizes the
TensorFlow operations and invokes a PIM kernel without having the user modifying
the original source code. PIM runtime consists of a pre-processor, memory manager,
and executor. The pre-processor analyzes the TensorFlow operations to find which
operations to offload to the PIM execution unit at runtime. The memory manager
manages the PIM operations and also maps PIM micro-kernel code and operand
data to the memory space allocated by the PIM device driver. The important thing
is to match the data location and the execution unit to minimize the data movement
overhead. The PIM executor calls and configures the PIM kernel. PIM device driver
reserves memory for the PIM execution unit. It forces the reserved memory space to
be uncacheable to guarantee the DRAM memory accesses from the host processor
to PIM. It also manages cache coherence issues between the host and the PIM
execution unit by not using caches for the shared data.
The programming model for the PIM execution unit is to execute the PIM
micro-kernel with a valid memory request to DRAM. The programming model is
depicted in Fig. 7.5. The most important aspects of the PIM kernel are utilizing
the whole internal HBM-PIM compute bandwidth and ensuring the order of the
memory requests to keep the execution order of the PIM micro-kernel. First, to
fully utilize the compute bandwidth of the HBM-PIM, it generates enough threads
to map kernels to all of the PIM execution units. Each thread can send a memory
request, and there should be enough number of threads to utilize the GRF accessing
size, which is 256B in total. Some threads are grouped and form a thread group
7 PIM Software Stack 147
The layers of the software stack must be modified to program PIM efficiently.
However, modifying the software stack for PIM introduces several challenges. First,
we must identify PIM offloading execution. There are types of applications or
executions that could run better on PIM. It is important to know what properties
they have and how to distinguish them from the rest. Second, data mapping must
be appropriately managed to reduce the data movement overhead. Mapping data in
an appropriate format is crucial for some types of PIM hardware. If data mapping
is properly done, it could relieve the internal data movement bottleneck. Third,
scheduling of the PIM executions must be applied for efficiency and the best
performance of PIM. It helps to optimize the utilization of the PIM and the host
processor by concurrently offloading execution kernels to both at the same time. To
148 D. Kim and J.-Y. Kim
The first challenge is to identify what codes to be executed on PIM. Since not
every code is efficient running on PIM, we need to distinguish a PIM-friendly
code that effectively exploits PIM. PIM-friendly codes can be statically assigned
to PIM cores by programmers manually. It should come with a deep understanding
of the PIM architecture, code property, and the benefit of offloading codes to
PIM. Depending on what and how the PIM execution unit is designed in memory,
deciding what code to offload to PIM varies. Identifying PIM-friendly codes is
relatively straightforward for PIM cores with a custom logic if the PIM logic is
specialized for a certain function. For example, a PIM core in Newton [2] consists
of 16 multipliers and a reduction adder tree with a fixed data flow. This PIM
architecture is specifically designed for a particular application, a matrix-vector
multiplication in this case. Utilizing Newton for matrix-vector multiplications
results in promising performance with its specialized logic unit in each DRAM
bank. It gains wide internal bandwidth to each PIM core while reducing the amount
of data transferred from the memory to the processor, which eases the burden on
memory bottleneck issues. The energy consumption on off-chip data transfer is
significantly reduced, and the throughput is increased by utilizing higher internal
DRAM bandwidth.
On the other hand, identifying PIM-friendly codes in general-purpose PIM cores
is much more difficult. It must be analyzed that the code is memory-intensive, which
means there is a memory bottleneck on off-chip bandwidth between the memory and
the processor. Typically, PIM copes with memory-intensive but low data locality
applications. Conversely, CPU has an advantage in compute-intensive and cache-
friendly applications. Such memory-intensive applications require tremendous data
from memory. If the processor can only use the same data for few times and needs
to request a lot of data from memory more than a given off-chip bandwidth, memory
bottleneck happens. Boroumand et al. [3] propose an efficient tool flow rather
7 PIM Software Stack 149
Fig. 7.6 Energy and execution time breakdown on various models on CPU
The second challenge is to manage data mapping for efficient programming. Data
mapping strategy must consider the PIM hardware architecture and its target appli-
cation. Inappropriate data mapping scheme to the memory generates an even worse
data movement bottleneck in the system and degrades the overall performance with
inefficient data access patterns by PIM computation units. The best data mapping
strategy must be optimized differently for different types of PIM architectures and
applications since it affects the computing performance of PIM.
7 PIM Software Stack 151
Data mapping strategy depends on the size of the data granularity of PIM com-
putation units. Data granularity varies along with the location of PIM computation
units across different architecture levels, from DRAM’s subarray level to bank level.
They could be placed inside each DRAM’s bank and access only selected data after
a column decoder or before a column decoder with a whole row of the subarray.
For example, one type of PIM hardware is located in DRAM’s subarray level, and
PIM units have data granularity of an entire row [4–6] with bulk-bitwise operations.
The issue here is that they require data to be aligned in the same row and located
in a certain subarray in order for PIM to execute accurately. It can be managed by
generating sequentially aligned physical addresses from given virtual addresses in
the operating system and exposing subarray’s address information to the memory
controller. These approaches ensure that the data can be physically located in a
specific DRAM subarray within the same row.
Another type of PIM hardware is a bank-level PIM, where each PIM core is
located in each bank, and computation is done after the column decoder. Bank-
level PIM has fewer issues in aligning data since it requires a smaller size of data
granularity. However, the biggest issue comes from the irregular data access pattern.
Unlike CPU, bank-level PIM is limited to have caches or registers where some
amount of data can be held locally. Consequently, data access time is dominated by
accessing memory cells rather than each PIM core’s local registers. Additionally,
the relative distance between the bank that stores the target data and the PIM
computation unit that can be either in the same bank or another bank causes
overhead in data movement. First, a sequential memory accesses pattern guarantees
the shortest data read latency within a bank in DRAM. While it requires additional
row-to-row delay with DRAM’s pre-charge and activation commands in accessing
different row address data, a sequential memory access pattern can read the same
row without the additional delay. Second, memory access from a PIM core to its
neighbor bank memory burdens the global data bus. Many memory requests from
different PIM cores potentially cause bottlenecks between banks. This inter-bank
data movement can be done through a global data bus, if the PIM architecture
supports its bank-to-bank data transmission. Otherwise, it must be done by the
memory copy function, which moves data all the way from the source memory
address to the host and back to the destination memory address. A solution to this
issue can come from optimizing data mapping and assigning the proper PIM core to
execute with the corresponding data. By matching data location and the execution
of code to a specific PIM core, it is possible to alleviate the burden on the data
movement from one memory bank to another.
In order to alleviate these challenges, Hsieh et al. [7] propose a new programmer-
transparent data mapping mechanism. It co-locates offloaded code and data in the
same PIM computation unit by exploiting predictability in the memory access
patterns out of offloaded code blocks. Figure 7.8 shows the memory access patterns
for various memory-intensive workloads selected for offloading candidate code
blocks. They are backward propagation (BP), BFS graph traversal (BF), K-means
(KM), and CFD solver (CFD) from Rodinia 3.0 [8], LIBOR Monte Carlo (LIB) and
RAY tracing (RAY) from GPGPU-Sim [9], and Fast Walsh-Hadamard transform
152 D. Kim and J.-Y. Kim
(FWT), scalar product (SP), and parallel reduction (RD) from CUDA SDK. The
result shows that 85% of all offloaded code blocks have a fixed offset between access
addresses, generating a predictable access pattern. With this predictability given,
observing only a small fraction, 0.1%, of initial offloading candidate instances can
achieve the same effect as observing whole offloading candidate instances in data
access. Although it can modify the data mapping for PIM offloaded code blocks, it
keeps the original memory mapping for the rest of the data to be executed on the
main CPU/GPU with maximized bandwidth utilization.
In this section, we will discuss the third challenge, the dynamic scheduling of PIM
offloaded code. Section 7.2 discussed PIM offloading execution in a static manner
such that a compiler statically identifies what code to offload to PIM with analytical
energy and memory models. Along with the static decision, the dynamic decision
can improve the optimization of the offloading code to maximize the utilization
of PIM and the host processor. Especially, there are research on scheduling GPU-
PIM architecture, as shown in Fig. 7.9. GPU-PIM architecture consists of multiple
3d-stacked memories and the main GPU with multiple streaming multiprocessors
(SMs). The 3d-stacked memory is a PIM module; it has computation capability
with SMs on its logic layer, which are the PIM computation units, and is topped
with memory layers.
Hsieh et al. introduce two issues in scheduling GPU-PIM execution and propose
a dynamic decision mechanism in scheduling GPU-PIM architecture. First, the
authors introduce that when a large number of offloading transactions are queued for
the PIM computation unit, which cannot handle fast enough, it causes a performance
bottleneck. In this case, GPU is waiting on the PIM computation unit to complete
whole executions. Second, they introduce that discrepancy in the bandwidth savings
of the off-chip data transmissions causes the memory bottleneck. It means that
offloading such transactions might only burden one of receive (RX) or transmit
7 PIM Software Stack 153
Fig. 7.10 Kernel offloading mechanism and concurrent kernel management mechanism
7 PIM Software Stack 155
Table 7.1 Metrics for predicting compute engine affinity and execution time
Primary category Predictive metric Static/dynamic
Memory intensity of kernel Memory-to-compute ratio Static
Number of compute instructions Static
Number of Memory Instructions Static
Available parallelism in the kernel Number of CTAs Dynamic
Total number of threads Dynamic
Number of thread instructions Dynamic
Shared memory intensity of kernel Total number of shared memory instructions Static
et
δ(t) = (7.2)
et +1
⎧
⎪
⎪δ(t) = Model Output (0 if δ(t) < 0.5, 1 else if δ(t) ≥ 0.5)
⎪
⎪
⎨t = α0 + α1 x1 + α2 x2 + α3 x3 + α4 x4 + α5 x5 + α6 x6 + α7 x7
(7.3)
⎪
⎪ αi = Coefficients of the Regression Model
⎪
⎪
⎩
xi = Predictive Metrics/Variables (Table 7.1)
The regression model uses a total of 25 applications, where 60% and 40% of them
are used for training and testing, respectively. As a result, the model can accurately
predict a kernel with 83% accuracy.
For the other runtime technique, the authors propose a new concurrent kernel
management mechanism for both GPU and PIM computation units with three key
information: kernel dependency information, affinity prediction model, and execu-
tion time prediction model. Kernel dependence graph is obtained by read-after-write
(RAW) dependencies across the kernels by profiling the whole application’s kernel
execution. It helps to determine which kernels can execute in parallel. The affinity
prediction model is obtained by the logistic regression model described above. It
156 D. Kim and J.-Y. Kim
determines which computation cores can execute each kernel. The execution time
prediction model predicts the execution time of a kernel on each computation core.
The equation for the execution time prediction is shown in Eq. 7.4 and is obtained
by the linear regression model.
y = β0 + β1 x1 + β2 x2 + β3 x3 + β4 x4 + β5 x5 + β6 x6 + β7 x7 (7.4)
⎧
⎪
⎪ y = Model Output (Predicted Execution Tim)
⎪
⎪
⎪
⎪
⎪βi
⎪ = Coefficients of the Regression Model
⎪
⎪
⎪
⎪ =
⎪
⎪ xi ⎧ Predictive Metrics/Variables (Table 7.1)
⎪
⎨ ⎪
⎪ 1 (very Low) if y < 10K
⎪
⎪ (7.5)
⎪ ⎪
⎪
⎪
⎪ ⎪
⎨2 (Low) if 10K < y < 500K
⎪
⎪
⎪Bins
⎪ 3 (Medium) if 500K < y < 5M
⎪
⎪ ⎪
⎪
⎪ ⎪
⎪
⎪
⎪ ⎪
⎪
⎪
⎪ ⎪4 (High)
⎪
if 5M < y < 50M
⎩ ⎩5 (very High) if 50M < y
The equation uses the same metrics used in the affinity prediction model in
Table 7.1. The information on execution time helps balance the kernels between
computation cores and minimizes the under-utilization issue. For example, if two
independent kernels have an affinity toward the same computation core while the
other computation core has no kernel to execute on, it suffers from under-utilization
of hardware resources. In this case, it is better to offload a kernel with a lower
execution time to an underutilized computation core, reducing the execution time of
whole kernels and the under-utilization issue.
In a uni-processor computer system, a single core does all the work and manages
the data in memories. Even with a change in a value in any memory location, it does
not affect the correctness of the computation since the single core can always see
latest data. On the other hand, in a multi-processor computer system, this may not
always be true. When multiple processors are working simultaneously on the same
data locations, they can access them freely as long as any core does not modify the
data. However, once one processor modifies the data, the other processors may not
acknowledge the change of the data in their local caches. This situation is when
cache coherence protocol comes into play. Cache coherence protocol manages to
write permission to each cache of the core to update or invalidate stale data value. It
also handles the arbitration of requests from multiple cores to the same memory
7 PIM Software Stack 157
address. As a result, all the cores can see the valid data anytime, even if they
simultaneously work on the same data.
Cache coherence mechanism could also apply in PIM architecture and is a
primary challenge for enabling general-purpose PIM execution. If we consider
PIM units as a conventional multi-processor architecture, we can apply traditional
cache coherence protocol. PIM can be programmed as multi-core programming
based on traditional shared memory with the host processor. As a result, the PIM
programming model becomes simple, and PIM architecture can easily turn to
general-purpose systems. However, applying traditional cache coherence protocol to
PIM causes a significant overhead in off-chip memory bandwidth with many fine-
grained coherence message transactions. As a result, it reverses the main benefit
of PIM, high bandwidth and low latency execution. Traditional cache coherence
protocol in traditional multi-processor architecture does not have this issue since it
can exploit the wide bandwidth of on-chip shared interconnect. Several solutions
proposed by previous researches [1, 7, 11–13] suggest some restrictions on the
programming model with cache bypass policy, writeback, and message passing
based mechanism. For example, Ahn et al. [11] propose to use message passing to
communicate between PIM cores and CPU caches. HBM-PIM [1] and GraphPIM
[13] use cache bypassing policy for offloading target. It makes a part of the memory
region uncacheable and lets all memory requests bypass the cache hierarchy and
send write requests directly to memory. Ahn et al. [12] use back-invalidation or
writeback for the cache block before and after PIM execution. Hsieh et al. [7]
propose to use write-through for cache coherence between GPU and PIM. These
mechanisms could work for applications that share not so much data between PIM
and the host. However, this might not always be true if tremendous data is shared
between PIM and the host. All these restriction-based mechanisms could cause
a degradation in performance by forcing data to write back or write through to
memory frequently rather than staying in a cache.
Regarding this issue, Boroumand et al. [14] propose a new coherence mechanism
called coherence for near data accelerators (CoNDA), as shown in Fig. 7.11. The
authors also analyze three different existing coherence mechanisms for near data
accelerator (NDA): Non-cacheable approach (NC), coarse-grained coherence (CG),
and fine-grained coherence (FG).
First, the non-cacheable mechanism forces the CPU to write data to memory
when the CPU has to update data, enabling NDA always to see valid data. It works
well in a specific condition when the CPU hardly accesses the NDA memory,
while it works poorly with most of the cases when the CPU accesses the NDA
memory frequently. The authors evaluate three different graph applications from a
multi-threaded graph framework called Ligra [15]: Connected Components, Radii,
and PageRank. For the dataset, it uses arXiv and Gnutella25 [16]. Figure 7.12
shows the memory system’s energy consumption and speedup graph of different
coherence mechanisms on different applications. As shown in the graphs, the
energy consumption and the performance of the non-cacheable approach are worse
than CPU-only due to frequent accesses made by the CPU threads to NDA
memory. Second, coarse-grained coherence is another type of coherence mechanism
that optimizes the enforcement of conventional coherence. It forces monitoring
158 D. Kim and J.-Y. Kim
Fig. 7.12 Energy consumption and speedup of different existing cache coherence protocols
the coherence of much larger memory regions, which helps avoid unnecessary
broadcasts and cache-tag look-ups. This mechanism works well on the NDA having
limited shared data with the CPU threads, while it works poorly with much more
shared data as it causes unnecessary data movement between the CPU and the NDA.
The CPU must write all cache lines within the same NDA memory region even
though the NDA only accesses a few memory addresses. As shown in the graph,
coarse-grained coherence is 0.4% slower than CPU-only and is still not a good
fit for NDA for many applications. Third, fine-grained coherence is a traditional
protocol and works well with the applications involving irregular memory accesses.
However, the limited off-chip bandwidth between the memory and the CPU cannot
handle unnecessary off-chip data movements.
7 PIM Software Stack 159
Instead of applying the existing coherence protocols, the authors propose a new
efficient cache coherence protocol for NDA called CoNDA, which executes NDA
on optimistic execution mode, as shown in Fig. 7.13. Its new protocol works as
follows. First, let the NDA always execute on optimistic execution mode. The NDA
stops issuing any coherence request to the CPU during optimistic execution mode
and keeps track of memory accesses. Also, it assumes that it always has coherence
permission to the CPU without even looking at the CPU coherence directory. It
guarantees that none of the modified data during the optimistic execution mode is
written to memory. Second, after the NDA completes optimistic execution mode,
it starts dealing with the coherence requests that could not be issued during the
optimistic execution mode. It only works on the shared data that was actually used
during the execution using the memory access tracking information, including the
addresses of all NDA read, NDA write, and CPU write. CoNDA compares this
information to find necessary coherence requests. Depending on the cases, CoNDA
either makes the NDA invalidate or re-execute all the un-committed updates or the
CPU to resolve the necessary coherence requests. The authors evaluate the CoNDA
protocol by comparing it to NDA execution using NC, CG, FG, and ideal-NDA.
Ideal-NDA is a reference model that does not count any coherence overhead, thus
giving the best performance result. The speedup performance analysis is also shown
in Fig. 7.12. It appears that both CG and NC hardly benefit from PIM due to its high
cost in maintaining the coherence. FG, on the other hand, achieves 44.9% of Ideal-
NDA’s performance benefits. However, CoNDA achieves the most benefits among
the other coherence protocols. It appears that CoNDA improves performances over
CPU-only by 66.0%. The author shows that CoNDA effectively reduces the number
of unnecessary coherence requests that travel through off-chip buses.
References
1. S. Lee, S.-h. Kang, J. Lee, H. Kim, E. Lee, S. Seo, H. Yoon, S. Lee, K. Lim, H. Shin, J.
Kim, S. O, A. Iyer, D. Wang, K. Sohn, N.S. Kim, Hardware architecture and software stack
for PIM based on commercial DRAM technology: industrial product, in 2021 ACM/IEEE 48th
Annual International Symposium on Computer Architecture (ISCA). IEEE, Piscataway (2021),
pp. 43–56
160 D. Kim and J.-Y. Kim
2. M. He, C. Song, I. Kim, C. Jeong, S. Kim, I. Park, M. Thottethodi, T.N. Vijaykumar, Newton:
a DRAM-maker’s accelerator-in-memory (AiM) architecture for machine learning, in 2020
53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE,
Piscataway (2020), pp. 372–385
3. A. Boroumand, S. Ghose, Y. Kim, R. Ausavarungnirun, E. Shiu, R. Thakur, D. Kim, A.
Kuusela, A. Knies, P. Ranganathan, O. Mutlu, Google workloads for consumer devices:
mitigating data movement bottlenecks, in Proceedings of the Twenty-Third International
Conference on Architectural Support for Programming Languages and Operating Systems
(2018), pp. 316–331
4. Y. Kim, V. Seshadri, D. Lee, J. Liu, O. Mutlu, A case for exploiting subarray-level parallelism
(SALP) in DRAM, in 2012 39th Annual International Symposium on Computer Architecture
(ISCA). IEEE, Piscataway (2012), pp. 368–379
5. V. Seshadri, D. Lee, T. Mullins, H. Hassan, A. Boroumand, J. Kim, M.A. Kozuch, O. Mutlu,
P.B. Gibbons, T.C. Mowry, Ambit: in-memory accelerator for bulk bitwise operations using
commodity DRAM technology, in 2017 50th Annual IEEE/ACM International Symposium on
Microarchitecture (MICRO). IEEE, Piscataway (2017), pp. 273–287
6. V. Seshadri, Y. Kim, C. Fallin, D. Lee, R. Ausavarungnirun, G. Pekhimenko, Y. Luo, O. Mutlu,
P.B. Gibbons, M.A. Kozuch, T.C. Mowry, RowClone: Fast and energy-efficient in-DRAM
bulk data copy and initialization, in Proceedings of the 46th Annual IEEE/ACM International
Symposium on Microarchitecture (2013), pp. 185–197
7. K. Hsieh, E. Ebrahimi, G. Kim, N. Chatterjee, M. O’Connor, N. Vijaykumar, ..., S.W.
Keckler, Transparent offloading and mapping (TOM) enabling programmer-transparent near-
data processing in GPU systems. ACM SIGARCH Comput. Archit. News 44(3), 204–216
(2016)
8. S. Che, M. Boyer, J. Meng, D. Tarjan, J.W. Sheaffer, S.H. Lee, K. Skadron, Rodinia: a
benchmark suite for heterogeneous computing, in 2009 IEEE International Symposium on
Workload Characterization (IISWC). IEEE, Piscataway (2009), pp. 44–54
9. A. Bakhoda, G.L. Yuan, W.W. Fung, H. Wong, T.M. Aamodt, Analyzing CUDA workloads
using a detailed GPU simulator, in 2009 IEEE International Symposium on Performance
Analysis of Systems and Software. IEEE, Piscataway (2009), pp. 163–174
10. A. Pattnaik, X. Tang, A. Jog, O. Kayiran, A.K. Mishra, M.T. Kandemir, O. Mutlu, C.R.
Das, Scheduling techniques for GPU architectures with processing-in-memory capabilities, in
Proceedings of the 2016 International Conference on Parallel Architectures and Compilation
(2016), pp. 31–44
11. J. Ahn, S. Hong, S. Yoo, O. Mutlu, K. Choi, A scalable processing-in-memory accelerator
for parallel graph processing, in Proceedings of the 42nd Annual International Symposium on
Computer Architecture (2015), pp. 105–117
12. J. Ahn, S. Yoo, O. Mutlu, K. Choi, PIM-enabled instructions: a low-overhead, locality-aware
processing-in-memory architecture, in 2015 ACM/IEEE 42nd Annual International Symposium
on Computer Architecture (ISCA). IEEE, Piscataway (2015), pp. 336–348
13. L. Nai, R. Hadidi, J. Sim, H. Kim, P. Kumar, H. Kim, GraphPIM: enabling instruction-level
PIM offloading in graph computing frameworks, in 2017 IEEE International Symposium on
High Performance Computer Architecture (HPCA). IEEE, Piscataway (2017), pp. 457–468
14. A. Boroumand, S. Ghose, M. Patel, H. Hassan, B. Lucia, R. Ausavarungnirun, K. Hsieh,
N. Hajinazar, K.T. Malladi, H. Zheng, O. Mutlu, CoNDA: efficient cache coherence support
for near-data accelerators, in Proceedings of the 46th International Symposium on Computer
Architecture (2019), pp. 629–642
15. J. Shun, G.E. Blelloch, Ligra: a lightweight graph processing framework for shared memory,
in Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel
Programming (2013), pp. 135–146
16. SNAP: Stanford Network Analysis Project. http://snap.stanford.edu/
Chapter 8
Conclusion
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 161
J.-Y. Kim et al. (eds.), Processing-in-Memory for AI,
https://doi.org/10.1007/978-3-030-98781-7_8
162 J.-Y. Kim et al.
PIM is especially effective for data-intensive workloads like AI/ML and big data
applications. However, there are also many challenges ahead; the fabrication process
for memory is not accessible, and chip designers need to carefully design the merged
logic with the physical design constraints to maximize internal memory bandwidth.
In this book, we triage and investigate the PIM technology according to the memory
type it is based on: SRAM-based PIM, DRAM-based PIM, and ReRAM-based PIM.
For each PIM category, we thoroughly cover the basic operation of the memory,
circuit component designs, macro designs, and entire architectures and operations.
The summary for each category is as follows.
SRAMs are typically used as cache memory placed near the processor in
conventional digital systems following von Neumann architecture. Compatibility
with the standard logic CMOS process and the low-density nature of the larger
bitcell size (compared to other memory technologies) has driven the technology
to be used as a small capacity and high-speed on-chip memory. SRAM-based
PIM architecture naturally became a popular choice based on the solid integration
compatibility that enables the additional computing feature to the regular memory
operation without raising new concerns in manufacturing feasibility and the efficacy
of the assigned role as a processing element. SRAM-based PIM has primary design
challenges similar to the challenges in the design of SRAM for cache memory. For
example, low storage density, limited bitline dynamic range, noise margin issue, and
variation-induced nonlinearity are the challenges for both SRAM-based cache and
PIM macros. Other challenges raised by the additional computing task assignment
for SRAM PIM macros are data conversion overhead, voltage–frequency scaling,
macro scalability for different DNN networks, and DNN parameter data compres-
sion. For most PIM implementation of DNN applications, the function of the PIM
macro is simple multiply-and-accumulate (MAC) or multiply-and-average (MAV).
However, additional mathematical operations must be processed in the preceding or
following digital blocks in the system (or system-on-chip). As a result, PIM macros
are rarely designed as a stand-alone processor but instead are built as a co-processor
or an accelerator in larger ASICs, FPGAs, or SoCs. For this reason, input and
output data of PIM macros that operate in the mixed-signal domain must provide
some form of data conversion layer in its functional data flow pipeline. The data
conversion in mixed-signal SRAM-based PIM is a significant design overhead in
latency, energy consumption, and hardware footprint. Besides, application flexibility
is another concern as the conversion blocks such as DAC/ADC are implemented in
fixed precision. Custom-designed SRAM cells typically target to resolve SRAM-
specific issues and utilize highly optimized supply voltage and timing schemes
while trading off the configuration flexibility of the PIM macro. Data flow and
architecture-level improvements are required to achieve the scalability of the PIM
macro. Digital implementation of the PIM macro alleviates many of the stated
concerns, but it is not a complete solution and remains an active research area.
The application parameter data storage is another critical issue for SRAM-based
PIM due to its low-density memory bitcell array. While there is no physical
solution to map millions of high-precision parameters into the SRAM without
sacrificing efficiency, algorithmic improvements can resolve the issue through data
8 Conclusion 163
full potential should be easily exploited by end-users. For that, a renewed software
stack for PIM covering programming language, library, runtime, and the device
driver is essential. We hope this book can help readers understand PIM technology
with a holistic view, from lower-level circuit implementation to system integration.
We also hope that this book can give readers ideas and directions for their future
research.
Correction to: Processing-in-Memory
for AI
Correction to:
J.-Y. Kim et al. (eds.), Processing-in-Memory for AI,
https://doi.org/10.1007/978-3-030-98781-7
This chapter’s author name was inadvertently published with a typo error. The
Author Donghyuck Kim name is now updated to Donghyuk Kim. The book has
been updated with the change.