Towards Integration of A Dedicated Memory Controll
Towards Integration of A Dedicated Memory Controll
Article
Towards Integration of a Dedicated Memory Controller and Its
Instruction Set to Improve Performance of Systems Containing
Computational SRAM
Kévin Mambu * , Henri-Pierre Charles * , Maha Kooli * and Julie Dumas *
CEA, LIST, Université Grenoble Alpes, F-38000 Grenoble, France
* Correspondence: [email protected] (K.M.); [email protected] (H.-P.C.); [email protected] (M.K.);
[email protected] (J.D.)
Abstract: In-memory computing (IMC) aims to solve the performance gap between CPU and memo-
ries introduced by the memory wall. However, it does not address the energy wall problem caused
by data transfer over memory hierarchies. This paper proposes the data-locality management unit
(DMU) to efficiently transfer data from a DRAM memory to a computational SRAM (C-SRAM)
memory allowing IMC operations. The DMU is tightly coupled within the C-SRAM and allows
one to align the data structure in order to perform effective in-memory computation. We propose
a dedicated instruction set within the DMU to issue data transfers. The performance evaluation of
a system integrating C-SRAM within the DMU compared to a reference scalar system architecture
shows an increase from ×5.73 to ×11.01 in speed-up and from ×29.49 to ×46.67 in energy reduction,
versus a system integrating C-SRAM without any transfer mechanism compared to a reference scalar
system architecture.
Citation: Mambu, K.; Charles, H.-P.; Keywords: in-memory computing; energy modeling; non-von neumann; instruction set; compilation;
Kooli, M.; Dumas, J. Towards stencils; convolutions; sram; energy wall; memory wall
Integration of a Dedicated Memory
Controller and Its Instruction Set to
Improve Performance of Systems
Containing Computational SRAM. J.
1. Introduction
Low Power Electron. Appl. 2022, 12, 18.
https://doi.org/10.3390/ Von Neumann architectures are limited by the performance bottleneck characterized
jlpea12010018 by the “memory wall”, i.e., the performance limitation of memory units compared to CPU,
and the “energy wall”, i.e., the gap between the energies consumed for computation and
Academic Editors: Alex Serb and
data transfers between different system components.
Adnan Mehonic
Figure 1a exposes the energy discrepancy between each component of a standard von
Received: 15 December 2021 Neumann architecture. We note that the energy increases by ×100 between the CPU and the
Accepted: 14 February 2022 cache memory, and by ×10, 000 between the CPU and the DRAM memory [1]. In-memory
Published: 16 March 2022 computing (IMC) is a solution to implement non-von Neumann architectures and mitigate
the memory wall by moving computation directly into memory units [2]. It allows the
Publisher’s Note: MDPI stays neutral
with regard to jurisdictional claims in
reduction of data transfers and thus energy consumption. However, the efficiency of IMC
published maps and institutional affil-
depends on the proper arrangement of data structures. Indeed, to be correctly computed in
iations. the memory, data should be arranged to respect a precise order (e.g., aligned in memory
rows) imposed by IMC hardware design constraints.
While various state-of-the-art works propose IMC solutions, very few take into account
their integration to complete computer systems while describing efficient methods to
Copyright: © 2022 by the authors. transfer data from IMC to high-latency memories or peripherals. This lack of consideration
Licensee MDPI, Basel, Switzerland. can be explained by the majority of IMC architectures being currently specialized for a few
This article is an open access article use cases, i.e., AI and big data, which limits their efficiency for general-purpose computing.
distributed under the terms and We propose a data-locality management unit (DMU), a transfer block presented in Figure 1b,
conditions of the Creative Commons coupled to an SRAM-based IMC unit to generate efficient data transfer and reorganization
Attribution (CC BY) license (https:// through a dedicated instruction set. As IMC architecture, we consider the computational
creativecommons.org/licenses/by/
SRAM (C-SRAM), an SRAM-based bit-parallel IMC architecture detailed in [2–4], and able
4.0/).
to perform logical and arithmetical operations in-parallel thanks to an arithmetic and logic
unit (ALU) in its periphery. We integrate it within a CPU and a DRAM as main memory.
Figure 1. (a) Performance bottlenecks of von Neumann architecture; energy costs are based on [1].
(b) Proposed architecture with IMC to mitigate the “memory wall” and a “DMU” block to mitigate
the “energy wall” between IMC and low-latency memory.
2. Related Work
2.1. In-Memory Computing (IMC)
IMC architectures of the state-of-the-art can be differentiated by their technology
and programming model [5]. Volatile memory-based IMC architectures include DRAM
and SRAM technologies. DRAM-based IMC architectures propose to enhance DRAM
memories with bulk-bitwise computation operators. These solutions offer cost and area
efficiency and large parallelism, although their arithmetic support is limited to logical or
specialized operators [6,7]. SRAM-based IMC architectures are less scalable than DRAM-
based solutions in terms of design, but they implement more elaborated computation
operators, either through strict IMC using bit-lines and sense amplifiers, or through near-
memory computing by using an arithmetic logic unit (ALU) in the periphery of the bit-
cell array [3,8,9]. Other approaches using emerging technologies such as MRAM or Re-
RAM [10,11] have been explored. They present interesting opportunities in terms of access
latency and nonvolatile capability, but have drawbacks in terms of cycle-to-cycle variability
and analog-to-digital conversion of input data.
J. Low Power Electron. Appl. 2022, 12, 18 3 of 11
3. DMU Specification
3.1. Overview
In this section, we present a DMU, a memory controller architecture to provide memory
access instructions to IMC to efficiently transfer and reorganize data before computation.
We implement in the DMU the control of source and destination offsets to enable fine-grain
data reorganization in IMC as well as DRAM memory to address alignment constraints
necessary for certain applications, and the implementation of two different operating modes
makes online data padding available. Finally, the DMU implements a dedicated instruction
set to program data transfers in a single clock cycle, compared to classical DMA solutions.
This instruction set is implemented as a subset of the C-SRAM instruction set architecture.
The DMU controller is proposed to be tightly coupled in the periphery of the IMC unit, as
shown in Figure 2. This means that there is a direct interface between the DMU and IMC
without going through the system bus, which is one of the main difference compared to
existing DMA controllers.
Figure 2. The integration of DMU to IMC offers an instruction set for efficient data transfers as well
as a dedicated transfer bus with the main memory.
Operation Parameters
SET_SRC_DRAM_REGION DRAM base address, region width,
(Nonblocking) element size
SET_DST_DRAM_REGION DRAM base address, region width,
(Nonblocking) element size
Source X position, source Y position, dest. IMC
READ_TRANSFER
address, length, source offset, dest offset,
(MEM→IMC, nonblocking)
operating mode
COPY Source IMC address, dest IMC address, source
(IMC→IMC, nonblocking) offset, dest offset, operating mode
Dest X position, dest Y position, source IMC
WRITE_TRANSFER
address, length, source offset, dest. offset,
(MEM→IMC, nonblocking)
operating mode
BLOCKING_WAIT None
READ_TRANSFER, WRITE_TRANSFER and COPY can operate to transfer data and per-
form online reorganization. For example, the parameterizing of source and destination
offsets allow the data to be padded upon arrival in the C-SRAM. To cover most use-cases
induced with the configuration of the destination offset, we implement in the DMU two
operating modes through the transfer_start register, illustrated in Figure 4. A zero-
padding mode fills the blanks in between destination data with zeros to perform unsigned
byte extension, while an overwriting mode preserves the data present in the destination
C-SRAM row and updates only relevant bytes. The former is destructive but enables
online byte extension to perform higher-precision arithmetic for workloads such as image
processing or machine learning, while the latter is more suitable for nondestructive data
movements. Since most iterative codes such as convolutions induce strong data redun-
dancy, COPY can be used to duplicate data and mitigate accesses with the DRAM for better
energy efficiency.
Algorithm 1 describes the side effects generated by the READ_TRANSFER instruction,
according to its parameters and the offset mechanism described in Figure 4.
(a)
(b)
Figure 4. DMU operating modes and their impact on destination memory, here an SRAM IMC
memory. (a) Zero-padding mode. (b) Overwriting mode.
(a) (b)
Figure 5. Experimental memory architectures for the evaluation. All cache units have write-through
policy. Our proposed architecture substitutes the 16 kB L1 D with an 8 kB L1 D, an 8 kB C-SRAM and
a DMU. (a) Reference architecture; (b) proposed architecture.
J. Low Power Electron. Appl. 2022, 12, 18 7 of 11
Table 2. Memory parameters of the reference and proposed architecture, used for the experimental
evaluation.
4.2. Applications
We consider three applications to evaluate our proposed architecture (IMC-DMU)
versus the reference scalar architecture (REF):
• Frame differencing is used in computer vision to perform motion detection [15], and
performs saturated subtraction between two (or more) consecutive frames in a video
stream to highlight pixel differences. It has linear complexity in both computing
and memory.
• A Sobel filter applies two 3 × 3 convolution kernels on an input image to generate its
edge-highlighted output. It is a standard operator in Image processing as well as
computer vision to perform edge detection [16]. It has linear arithmetic complexity
and shows constant data redundancy (2 × 9 reads per input pixel, on average).
• Matrix-matrix multiplication is used in various domains such as signal processing or
physics modeling, and is a standard of linear algebra as the gemm operator [17]. It has
cubic (O(n3 )) complexity in computing and memory and shows quadratic (O(n2 ))
data redundancy.
currently
currentlydeveloping
developing aa mechanism toautomatically
mechanism to automaticallygenerate
generate data
data transfers
transfers andand duplication
duplication
from programmable memory access patterns to effortlessly achieve quasi-optimal
from programmable memory access patterns to effortlessly achieve quasi-optimal energy energy
efficiency. We will soon publish our specifications and our results.
efficiency. We will soon publish our specifications and our results.
nrow += 3;
}
}
}
Figure7.7.Example
Figure Examplecode
codetransferring convolutionwindows
transferring convolution windowstoto C-SRAM
C-SRAM using
using ourour
DMUDMU instruction
instruction set. set.
4.4.
4.4.Results
Resultsand
andDiscussion
Discussion
Weevaluated
We evaluated the
the performance
performance of ofour
ourproposed
proposedarchitecture,
architecture, according
according to atoreference
a reference
architecture,in
architecture, interms
terms ofof speed-ups
speed-ups andandenergy
energyreductions.
reductions.We We considered
considered three scenarios:
three scenarios:
(1)using
(1) usingthe
theC-SRAM
C-SRAM without
without DMU DMU (C-SRAM-only),
(C-SRAM-only),(2)(2)using using thethe
C-SRAM
C-SRAM with the the
with
proposed DMU controller to fetch input data strictly from
proposed DMU controller to fetch input data strictly from the main memory and the main memory and (C- (C-
SRAM+DMU) and (3) using the C-SRAM with the proposed
SRAM+DMU) and (3) using the C-SRAM with the proposed DMU controller to performDMU controller to perform
datatransfers
data transfersand
and data
data reuse
reuse whenever
wheneverpossible.
possible.InIncase
case1, 1,
thethedata
dataareare
transferred
transferred fromfrom
the L1 data cache to the C-SRAM by the CPU, while in cases 2 and 3, the CPU issues
the L1 data cache to the C-SRAM by the CPU, while in cases 2 and 3, the CPU issues
data transfers directly between the main memory and the C-SRAM using the DMU. Case
data transfers directly between the main memory and the C-SRAM using the DMU. Case
3 is particularly relevant to the Sobel filter, which presents data redundancy due to the
3 is particularly relevant to the Sobel filter, which presents data redundancy due to the
application of the convolution filters on the input images.
application of the convolution filters on the input images.
Figure 8 shows the energy reduction and speed-up for the three applications, compared
Figure
to the 8 shows
reference thearchitecture.
scalar energy reduction and speed-up
The X-axis representsforthethe
sizethree applications,
of the inputs, andcompared
Y-axis
torepresents
the reference
the improvement factors evaluated for each application (higher isinputs,
scalar architecture. The X-axis represents the size of the better). and
TableY-axis
3
represents
shows thethe improvement
average factors evaluated
of the maximum speed-upfor andeach application
energy reductions (higher is better).
evaluated Table 3
for each
shows the average
implementation of the
across allmaximum
applications. speed-up
While theand energy reductions
C-SRAM-only evaluatedshows
implementation for each
implementation across alltoapplications.
improvement compared While
the scalar system, thethe C-SRAM-only
integration implementation
of the DMU to the C-SRAM shows
improvement comparedand
improves the speed-up to energy
the scalar system,respectively,
reduction, from ×5.73
the integration of theand
DMU ×11.01 to ×29.49
to the C-SRAM
the speed-up and energy reduction, respectively, from ×5.73 and ×11.01 to ×29.49
and ×46.67.
improves
and ×46.67.
J. Low Power Electron. Appl. 2022, 12, 18 9 of 11
×
7.722797293 4 2
10 2 1
0 0 0
256 B 4 KB 16 KB 64 KB 256 KB 1 MB 4 MB
60 60 52.21707362 35 30.93193954
50 57.096142 50 30
40 40 25
33.40812798
20
Energy 30
20 10.3528485
30
20
15
14.42109865
reduction 1 8.294194126 10
×
10 10
5
0 0 0
256 B 4 KB 16 KB 64 KB 256 KB 1 MB 4 MB
Figure 8. Energy reduction and speed-up for all applications compared to the reference scalar
architecture. The X and Y axes of the plots are, respectively, the data sizes and the improvement
factors, i.e., higher is better.
Table 3. Average maximum speed-up and energy reduction per evaluated implementation.
5. Conclusions
We presented the DMU, a programmable memory controller architecture to efficiently
transfer and reorganize data between the SRAM IMC memory and the main memory. We
integrated the DMU in a C-SRAM architecture and evaluated the energy reduction and
speed-up for three applications, compared to a reference scalar architecture. The integration
of the DMU to C-SRAM improved the speed-up and energy reduction, respectively, from
×5.73 and ×11.01 to ×29.49 and ×46.67.
Our future works include the physical implementation of the DMU on a test chip
for the validation of our experiments and the compiler support of its ISA to implement
an efficient programming model at the language level. We also plan to describe the
specifications of a more elaborate instruction set, able to transfer complex data structures
such as stencil kernels and convolution windows using pattern descriptors, in order to
automate transfer optimizations at the hardware level.
Author Contributions: Conceptualization, K.M., H.-P.C., M.K. and J.D.; methodology, all authors;
software, all authors; validation, all authors; formal analysis, all authors; investigation, all authots;
resources, all authors; data curation, all authots; writing—original, K.M.; writing—review and editing,
H.-P.C., M.K. and J.D.; visualization, all authors; supervision, H.-P.C. and M.K.; project administration,
H.-P.C. and M.K.; funding acquisition, H.-P.C. and M.K. All authors have read and agreed to the
published version of the manuscript.
Funding: This work was supported by the EU H2020 project 955606 “DEEPSEA” – Software for
Exascale Architectures.
Data Availability Statement: Not applicable.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Horowitz, M. 1.1 Computing’s energy problem (and what we can do about it). In Proceedings of the 2014 IEEE International
Solid-State Circuits Conference Digest of Technical Papers (ISSCC), San Francisco, CA, USA, 9–13 February 2014; pp. 10–14.
2. Noel, J.P.; Pezzin, M.; Gauchi, R.; Christmann, J.F.; Kooli, M.; Charles, H.P.; Ciampolini, L.; Diallo, M.; Lepin, F.; Blampey, B.;
et al. A 35.6 TOPS/W/mm2 3-Stage Pipelined Computational SRAM with Adjustable Form Factor for Highly Data-Centric
Applications. IEEE Solid-State Circuits Lett. 2020, 3, 286–289. [CrossRef]
3. Kooli, M.; Charles, H.P.; Touzet, C.; Giraud, B.; Noel, J.P. Smart Instruction Codes for In-Memory Computing Architectures
Compatible with Standard SRAM Interfaces. In Proceedings of the 2018 Design, Automation & Test in Europe Conference &
Exhibition (DATE), Dresden, Germany, 19–23 March 2018; p. 6.
4. Gauchi, R.; Egloff, V.; Kooli, M.; Noel, J.P.; Giraud, B.; Vivet, P.; Mitra, S.; Charles, H.P. Reconfigurable tiles of computing-in-
memory SRAM architecture for scalable vectorization. In Proceedings of the ACM/IEEE International Symposium on Low Power
Electronics and Design, Boston, MA, USA, 10–12 August 2020; pp. 121–126.
5. Bavikadi, S.; Sutradhar, P.R.; Khasawneh, K.N.; Ganguly, A.; Pudukotai Dinakarrao, S.M. A Review of In-Memory Computing
Architectures for Machine Learning Applications. In Proceedings of the 2020 on Great Lakes Symposium on VLSI, Virtual, China,
7–9 September 2020; pp. 89–94.
J. Low Power Electron. Appl. 2022, 12, 18 11 of 11
6. Seshadri, V.; Lee, D.; Mullins, T.; Hassan, H.; Boroumand, A.; Kim, J.; Kozuch, M.A.; Mutlu, O.; Gibbons, P.B.; Mowry, T.C. Ambit:
In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology. In Proceedings of the 2017 50th
Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Boston, MA, USA, 14–17 October 2017; pp. 273–287.
7. Deng, Q.; Zhang, Y.; Zhang, M.; Yang, J. LAcc: Exploiting Lookup Table-based Fast and Accurate Vector Multiplication in
DRAM-based CNN Accelerator. In Proceedings of the 56th Annual Design Automation Conference, Las Vegas, NV, USA, 2–6 June
2019; pp. 1–6.
8. Fujiki, D.; Mahlke, S.; Das, R. Duality Cache for Data Parallel Acceleration. In Proceedings of the 46th International Symposium
on Computer Architecture, Phoenix, AZ, USA, 22–26 June 2019; pp. 1–14.
9. Lee, K.; Jeong, J.; Cheon, S.; Choi, W.; Park, J. Bit Parallel 6T SRAM In-memory Computing with Reconfigurable Bit-Precision.
In Proceedings of the 2020 57th ACM/IEEE Design Automation Conference (DAC), San Francisco, CA, USA, 20–24 July 2020;
pp. 1–6.
10. Bhattacharjee, D.; Devadoss, R.; Chattopadhyay, A. ReVAMP: ReRAM based VLIW architecture for in-memory computing. In
Proceedings of the Design, Automation & Test in Europe Conference & Exhibition (DATE), Lausanne, Switzerland, 27–31 March
2017; pp. 782–787.
11. Ezzadeen, M.; Bosch, D.; Giraud, B.; Barraud, S.; Noel, J.P.; Lattard, D.; Lacord, J.; Portal, J.M.; Andrieu, F. Ultrahigh-Density 3-D
Vertical RRAM with Stacked Junctionless Nanowires for In-Memory-Computing Applications. IEEE Trans. Electron Devices 2020,
67, 4626–4630. [CrossRef]
12. Ahn, J.; Yoo, S.; Mutlu, O.; Choi, K. Pim-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture.
In Proceedings of the 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA), Portland, OR,
USA, 13–17 June 2015; pp. 336–348.
13. Mambu, K.; Charles, H.-P.; Dumas, J.; Kooli, M. Instruction Set Design Methodology for In-Memory Computing through
QEMU-Based System Emulator. 2021. Available online: https://hal.archives-ouvertes.fr/hal-03449840/document (accessed on
14 December 2021).
14. Muralimanohar, N.; Balasubramonian, R.; Jouppi, N.P. Cacti 6.0: A Tool to Model Large Caches; HP Laboratories: Palo Alto, CA,
USA, 2009.
15. Zhang, H.; Wu, K. A Vehicle Detection Algorithm Based on Three-Frame Differencing and Background Subtraction. In Proceedings
of the 2012 Fifth International Symposium on Computational Intelligence and Design, Hangzhou, China, 28–29 October 2012;
Volume 1, pp. 148–151.
16. Khronos Vision Working Group. The OpenVX Specification Version 1.2. Available online: https://www.google.co.th/url?sa=
t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwjB65aM0sn2AhUkxzgGHa-sDlMQFnoECAgQAQ&
url=https%3A%2F%2Ffanyv88.com%3A443%2Fhttps%2Fwww.khronos.org%2Fregistry%2FOpenVX%2Fspecs%2F1.2%2FOpenVX_Specification_1_2.pdf&usg=
AOvVaw18k11o92s0PEjGw7rZ5Sm8 (accessed on 11 October 2017).
17. Basic Linear Algebra Subprograms Technical (BLAST) Forum Standard, Basic Linear Algebra Subprograms Technical
Forum. Available online: https://www.google.co.th/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&
ved=2ahUKEwj6orG00sn2AhU_zjgGHcvDADoQFnoECAgQAQ&url=https%3A%2F%2Ffanyv88.com%3A443%2Fhttp%2Fwww.netlib.org%2Futk%2Fpeople%
2FJackDongarra%2FPAPERS%2F135_2002_basic-linear-algebra-subprograms-techinal-blas-forum-standard.pdf&usg=
AOvVaw3GQ_wyRmkgT9TG9mwuCS0I (accessed on 21 August 2001).