0% found this document useful (0 votes)
17 views8 pages

Paper 1

Uploaded by

arefin.aktar02
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views8 pages

Paper 1

Uploaded by

arefin.aktar02
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Heterogeneous Multi-Functional

Look-Up-Table-based Processing-in-Memory
Architecture for Deep Learning Acceleration
2023 24th International Symposium on Quality Electronic Design (ISQED) | 979-8-3503-3475-3/23/$31.00 ©2023 IEEE | DOI: 10.1109/ISQED57927.2023.10129338

Sathwika Bavikadi∗ , Purab Ranjan Sutradhar† , Amlan Ganguly† and Sai Manoj Pudukotai Dinakarrao∗
∗ Department of Electrical and Computer Engineering, George Mason University, Fairfax, VA, USA
†Department of Computer Engineering, Rochester Institute of Technology, Rochester, NY, USA.
∗ {sbavikad, spudukot}@gmu.edu, † {ps9525,axgeec}@rit.edu

Abstract—Emerging applications including deep neural net- On the other hand, Field-programmable gate array (FPGA) [5],
works (DNNs) and convolutional neural networks (CNNs) employ [6] accelerators address the programmability challenges but are
massive amounts of data to perform computations and data hindered by low energy efficiency, complexity, and volatility
analysis. Such applications often lead to resource constraints
and impose large overheads in data movement between memory challenges. For executing DL/ML applications, central pro-
and compute units. Several architectures such as Processing-in- cessing units (CPUs) are less energy-efficient than ASICs.
Memory (PIM) are introduced to alleviate the bandwidth bot- Thus, conventional von Neumann architecture-based computing
tlenecks and inefficiency of traditional computing architectures. systems, including general-purpose processors (GPPs), central
However, the existing PIM architectures represent a trade-off processing units (CPUs), and graphics processing units (GPUs)
between power, performance, area, energy efficiency, and pro-
grammability. To better achieve the energy-efficiency and flexibil- [7] have extremely low energy efficiency and latency [2],
ity criteria simultaneously in hardware accelerators, we introduce [8]. This excessive cost of computing efficiency [2], [8] is
a multi-functional look-up-table (LUT)-based reconfigurable PIM associated with the expensive memory access and data move-
architecture in this work. The proposed architecture is a many- ment caused by the physical separation between the processing
core architecture, each core comprises processing elements (PEs), unit and the memory unit inside a conventional von-Neumann
a stand-alone processor with programmable functional units built
using high-speed reconfigurable LUTs. The proposed LUTs can architecture.
perform various operations, including convolutional, pooling, and Computing architectures such as ‘non-von Neumann’ archi-
activation that are required for CNN acceleration. Additionally, tectural paradigms [9] including processing-in-memory (PIM)
the proposed LUTs are capable of providing multiple outputs
relating to different functionalities simultaneously without the a.k.a In-Memory Computing (IMC), near-data processing
need to design different LUTs for different functionalities. This (NDP), are introduced to alleviate data transfer bottleneck [10].
leads to optimized area and power overheads. Furthermore, we IMC architectures [11] perform the computations on the mem-
also design special-function LUTs, which can provide simultaneous ory chip itself and exhibit higher energy efficiency compared
outputs for multiplication and accumulation as well as special to other paradigms due to its intra-memory communication and
activation functions such as hyperbolics and sigmoids. We have
evaluated various CNNs such as LeNet, AlexNet, and ResNet- computations. Numerous PIM designs are implemented on a
18,34,50. Our experimental results have demonstrated that when wide range of emerging memory technologies such as tradi-
AlexNet is implemented on the proposed architecture shows a tional volatile static random access memory (SRAM) [12] and
maximum of 200× higher energy efficiency and 1.5× higher dynamic random access memory (DRAM) [11], [13]–[16], as
throughput than a DRAM-based LUT-based PIM architecture. well as non-volatile memory technologies like Resistive RAM
(ReRAM) [17], and Magnetic RAM (STT/SOT-MRAM) [18].
I. I NTRODUCTION However, DRAM is the most widely used memory technology
The rapid advancements in hardware fabrication and inte- for manufacturing external memory devices due to its higher
gration, along with the software applications, lead to the de- memory density, lower power consumption, and lower cost of
velopment of various fields, including computer vision, image production compared to other memory technologies [19].
processing, artificial intelligence (AI) and natural language To overcome the limited processing speed of IMC, look-
processing. These emerging applications led to an eventual up-table (LUT)-based PIMs have emerged as a panacea [6],
increase in the demand for performance and efficiency, along [20]. Numerous works have been introduced in recent years
with the data to be observed and analyzed. Machine learning that use memory LUTs for performing arithmetic and logical
(ML) and deep learning (DL) are introduced as a panacea to computations [14], [16]. Although several designs propose
process and analyze such vast amounts of data [1]–[3]. implementing PIM architectures utilizing the LUTs, the ex-
To meet hardware efficiency and other performance require- isting architectures are confined to specific applications and
ments, several architectural innovations have been proposed in operations i.e., lacks the flexibility to be adapted to other
recent years. Custom-designed accelerators such as application- applications and programmability. The LUTs and PIM systems
specific integrated circuits (ASICs) [4] though are energy are designed to support only one type of functionality and only
efficient and optimized, they have extremely low flexibility. one type of application, either a compute-intensive or memory-

Authorized licensed use limited to: George Mason University. Downloaded on September 01,2023 at 00:08:55 UTC from IEEE Xplore. Restrictions apply.
intensive application with limited performance when executing -50, and show that it outperforms the state-of-the-art
other types of applications [20]. Therefore a flexible hardware techniques in terms of throughput, energy efficiency, and
platform that supports a variety of CNN/DNN operations is accuracy.
required.
To address these challenges and offer a larger degree of func- II. BACKGROUND AND R ELATED W ORKS
tional flexibility and programmability, we introduce a DRAM-
Deep Neural Network algorithms are dominated by a large
based multi-functional look-up-table-based reconfigurable PIM
number of simplistic, data-parallel computations, such as con-
architecture that supports existing and emerging applications
volutions and matrix multiplications. These operations can be
with low overheads and high programmability. This proposed
executed with a very high level of operational parallelism in the
architecture consists of multiple clusters embedded with many
hardware. Non-von Neumann architectures such as Processing-
heterogeneous reconfigurable LUT cores. Each cluster com-
in-Memory also known as In-Memory Computing devices
prises three types of LUT cores: ALU LUT core, special ALU
are being widely investigated for DNN/ CNN applications in
(S-ALU) LUT core, and special-function (SF) LUT core.
recent times. PIM architectures are able to perform massively
Unlike the existing works [14], [16], [21]–[23], the proposed parallel simple computations at surprisingly low latency and
LUT cores are heterogeneous multi-functional special LUT high energy efficiency. PIM architectures are memory-centric
cores i.e., each of these cores are capable of performing distinct architectures [11], [13], which are entirely implemented on a
operations from each other and can provide multiple outputs memory chip. PIM devices have been demonstrated to offer
corresponding to multiple functionalities in a multiplexed man- better parallel performance than most CPUs and ASIC devices
ner, thereby called multi-functional LUTs. This approach not [7], as well as, better energy efficiency than GPUs [16].
only provides a reduced number of LUTs but also increases This virtually eliminates the data bandwidth bottleneck of off-
the utilization efficiency and functional support offered by chip communications, otherwise suffered by state-of-the-art
LUTs. The ALU-LUT cores are specifically programmed to processing devices [2].
implement the MAC operations in the PIM. The special ALU Recently, numerous works have been proposed on in-
(S-ALU) LUTs can provide multiple outputs relating to differ- memory computing hardware accelerators using conventional
ent functionalities simultaneously without the need to design CMOS and emerging memory devices. To overcome the large
different LUTs for different functionalities. For instance, S- latency overheads due to the frequent data transfer between
ALU LUT cores can be programmed to do both multiplication memory-logic units, IMC is seen as an efficient alternative for
and addition on the same given input in a single clock cycle. executing data-intensive ML applications. Despite efficiency
Thus providing the output of both operations without the need in terms of energy consumption, in-memory computations
of programming two cores separately to do multiplication and including addition and multiplication operations are orders of
addition operations. This leads to optimized area and power magnitude slower compared to the traditional CMOS-based
overheads. Finally, the special-function (SF) LUTs are designed hardware accelerators. In addition to the large area and power
to implement special-function operations such as hyperbolics overheads of the DRAM-based in-memory accelerators, they
and sigmoid, ReLU operations. In order to provide inherent require significant modifications to the memory-bank architec-
computing support for MAC operations, activation operations tures such as activation of multiple rows, high precision timers,
such as sigmoid, hyperbolic, and ReLU, nine LUT core design and novel sense amplifiers to enable efficient IMC.
exploration in a cluster is adapted.
With the emergence of non-volatile memory (NVM), NVM-
To summarize, the novel contributions of this work are: based in-memory computing techniques are introduced and
• We propose a novel heterogeneous multi-functional look- adopted in academia and industry. The NVMs achieve higher
up-table-based reconfigurable processing-in-memory ar- integration densities i.e., low area, offer better scalability, and
chitecture to address the energy efficiency and flexibility lower power consumption compared to the standard DRAM
criteria for computing architectures. technology. This makes NVMs an ideal candidate for the design
• Presenting a flexible architecture by introducing recon- of hardware accelerators in this work. There exist multiple
figurable LUTs capable of performing multi-functional emerging NVMs which can potentially replace their CMOS
operations required to process different layers of a neural counterparts such as ReRAM [17], Phase-Change Memory
network for CNN acceleration. (PCM), Spin-Transfer Torque (STT)-MRAM [18], and Spin-
• We propose special heterogeneous multi-function LUTs Orbit Torque (SOT)-MRAM [24] technologies. Numerous IMC
capable of producing multiple outputs for multiplication, hardware accelerators that support ML applications are intro-
accumulation, sigmoid, and hyperbolic, ReLU operations. duced in the literature [20]. However, due to the low voltage
• We proposed three different kinds of LUT cores with operation, asymmetric read/write current of emerging NVMs
different functionality: ALU LUT core, S-ALU LUT core, cause noise margin issues and are highly vulnerable to reliabil-
and SF-ALU LUT core, which are specially designed ity concerns, and are not a viable option for CNN acceleration.
to tackle multi-functional operations required for various A majority of the IMC works [11], [13]–[15] focus on
CNN acceleration. performing faster computations and do not consider the re-
• We evaluate the proposed architecture on various CNN configurability and networking concerns of the accelerators.
architectures including LeNet, AlexNet, ResNet-18, -34, However, the functionality of these architectures is almost

Authorized licensed use limited to: George Mason University. Downloaded on September 01,2023 at 00:08:55 UTC from IEEE Xplore. Restrictions apply.
Fig. 1. Hierarchical Architecture showing the cluster arrangement and multi-functional heterogeneous core organization inside the cluster.

exclusively limited by their application, reconfigurability, over- Read port Read port
heads, latency, and inference of CNN/DNNs. To overcome the Router Router
8
aforementioned challenges, the proposed work introduces a 4 or 8

multi-functional LUT-based reconfigurable PIM architecture to

LUT

LUT
256:1 256:1
achieve high-speed reconfiguration for accelerating various ML A B MUX A B MUX

algorithms. 4 4
4 4

III. P ROPOSED M ULTI - FUNCTIONALITY LUT- BASED


(a) (b)
H ETEROGENEOUS DL ACCELERATOR A RCHITECTURE
Figure 1 shows the hardware architecture of the proposed Fig. 2. Microarchitectures of Heterogenous LUT-based PIM cores (a) Mi-
heterogeneous multi-functional LUT-based reconfigurable DL croarchitecture of ALU-LUT core and SF-LUT core (b) Microarchitecture of
S-ALU-LUT core
accelerator. The reconfigurable LUTs are capable of support-
ing different precision data, and fewer LUTs are required
for reduced precision operations. Using lower precision LUT
do any arbitrary operation. The LUTs are implemented using
for computational operations leverages improved latency and
8-bit 256-to-1 multiplexers. For example, in order to perform
energy efficiency without compromising the accuracy of CNN
an activation operation with an 8-bit operand, the 8-bit MUX
algorithms. This architecture is composed of multiple clusters,
in the PIM core is used to perform a look-up operation and
each cluster comprises nine reconfigurable heterogeneous cores
provide 8-bit output. Each LUT core can either support a single
which facilitate multi-functional programmable operations on
8-bit operand or a pair of 4-bit operands in order to perform
a pair of 4-bit or a single 8-bit input data. We chose this
operations. Consequently, our proposed heterogeneous multi-
precision as most computer vision applications perform reliably
functional LUT cores can perform any kind of in-memory
at this precision with minimal accuracy loss compared to higher
computations utilized to implement different layers of a neural
precision [25]. Nine of the reconfigurable heterogeneous cores
network for ML acceleration. Functionalities of the proposed
consist of special multi-functional LUTs (ALU-LUT, S-ALU-
cluster and core design are discussed in Figure 1, referring to
LUT, SF-LUT), that are grouped together and interconnected
corresponding color codes.
by a router to form a single cluster. Each cluster can be
programmed to perform a wide range of operations such as ALU-LUT core (Core 1 to 6): The blue squares in Figure
multiply and accumulate, substitution, comparison, bit-wise 1 represent the multi-functional LUT-based PIM cores, that are
logic operations, hyperbolics, sigmoid, and ReLU activation programmed to perform 4-bit AND or XOR operations on a
operations. Therefore, an array of these clusters can be utilized pair of 4-bit data input and provide 4-bit output. A multiplexer
to implement different layers of CNNs and DNNs such as is used to select the functionality required for the different
Convolutional Layers, Fully-connected Layers, Activation, and operations of the CNN algorithm, to either perform XOR or
Pooling layers for various CNN inference applications. AND operation on the inputs as shown in Figure 2 (a). Based
on the multiplexer input, the multi-functional core performs
A. LUT Core Architecture either AND or XOR operation on the input data. The cluster
The primary goal of the proposed nine LUT core design is accommodated with 6 of the ALU-LUT cores.
explorations in a cluster is to facilitate intrinsic computational S-ALU LUT core (Core 7 and 8): The second kind of
support to perform MAC operations, sigmoid, hyperbolic, and core used in the cluster, represented in red squares in Figure
ReLU operations. The LUT-based design approach for our PIM 1, are the special LUT-based PIM cores. The cluster contains
core provides functional flexibility to configure the core’s to two of these cores, that are programmed such that the output

Authorized licensed use limited to: George Mason University. Downloaded on September 01,2023 at 00:08:55 UTC from IEEE Xplore. Restrictions apply.
consists of two entirely different operations (XOR and AND) A B0 A B1 A B2 A B3
on the same pair of inputs. Despite the fact that the S-ALU-
LUT core supports the same operations (XOR and AND) as t=1 I1 I2 I3 I4
P0
the ALU-LUT core, its functionality is entirely different. This
ALU-LUT t=2 J1 J2
core is used in a special scenario when we need both XOR Core P1
and AND operations for the same input data, mainly used for
t=3 K1 K2
the accumulation process. This core is programmed to produce S-ALU
LUT Core
8-bit output data for a pair of 4-bit inputs, the upper half of the T= 3
t=4 L1 L2
core output represents the 4 bits XOR operation of the input SF-LUT P2
data while the lower half represents the 4-bit AND operation Core t=5 M1 M2
of the same input data as shown in Figure 2 (b). Thus, without
T= 3
the need to create separate LUT cores for various purposes, this t=6 N1 N2
P3
unique S-ALU-LUT core may deliver several outputs pertaining
t=7 O1
to different functionality concurrently. MAC
SF LUT core (Core 9): The third kind of heterogeneous Operation T= 3
t=8 Q1 Q2
core used in the proposed architecture is represented in the
green square in Figure 1. The cluster contains only one of t=9 R1
these special multi-functional LUT-based PIM cores, which P4-P7
is programmed to perform 8-bit special-function activation Activation
operation tT= 3
= 10 S1
operations such as sigmoid, hyperbolic, and ReLU using 8-bit
LUT cores. Similar to the ALU-LUT core a multiplexer is used
to select the different activation operations to be implemented
Fig. 3. Overview of the dataflow for MAC operation in Muti-functional
in SF-LUT as shown in Figure 2 (a). This core is programmed Heterogeneous PIM architecture
to produce 8-bit output on 8-bit input. Based on the multiplexer,
the multi-functional core performs either sigmoid, hyperbolic,
or ReLU activation operation on the input. of micro-operations across the nine LUT cores with the help
Each of these cores is capable of performing distinct op- of a routing mechanism.
erations from each other and can provide multiple outputs The distribution of the operands during every single stage in
corresponding to multiple functionalities in a multiplexed man- the operational stage is performed with the help of the router.
ner, thereby called heterogeneous multi-functional LUTs. This The router enables parallel communication by connecting every
provides functional flexibility for the PIM to support various component of the cluster including the cores and the read/write
operations required for CNN acceleration. The ALU-LUT and ports. The router is used to connect all the cores, in order to
S-ALU-LUT cores are specifically programmed to implement access any core data at any point of time during the execution.
the MAC operations in the PIM. Whereas the SF-LUT core The router plays a vital role during the implementation. The
is designed to implement special-function activation operations memory read/write buffer of the cluster is used to read the
such as hyperbolics and sigmoid, ReLU operations. Therefore, data input from the memory and write outputs back into the
with the proposed nine LUT core design explorations in the memory, in order to perform the required operations for CNN
cluster, the proposed PIM can support the computational sup- acceleration.
port required for CNN acceleration. The data communication among clusters inside the memory
chip is achieved through the routing mechanism. This makes the
B. Cluster Architecture proposed architecture easily distribute a particular task among
As shown in Figure 1, the cluster formed by nine LUT multiple clusters. At the same time, different clusters inside the
cores is placed inside the memory banks in order to allow the memory bank execute parallel and independent tasks in a single
quickest access to the memory data and to perform the in- instruction multiple data (SIMD) fashion.
memory operation with significantly lower latency. Nine cores
in the proposed PIM cluster constitute six ALU-LUT cores that IV. O PERATIONS S UPPORTED BY THE P ROPOSED
support either AND or XOR operations and two cores S-ALU- H ETEROGENEOUS M ULTI - FUNCTIONAL A RCHITECTURE
LUT cores that can perform both AND and XOR operations for The main benefit of the proposed architecture is that its LUTs
the given input data. Whereas the SF-LUT core is programmed can be programmed to implement virtually any type of com-
to support activation operations such as hyperbolic, sigmoid and putation. This equips it with the functional flexibility required
ReLU operations. for implementing different operations required by various DL
Nine of these heterogenous muti-functional LUT cores inside applications such as linear algebraic operations, activation, and
the cluster are programmed in a specific way, interconnected by pooling operations. Among the nine heterogeneous LUT cores,
a routing mechanism in order to perform complex operations 8 (6 ALU-LUT cores, 2 S-ALU-LUT cores) of them are used
such as MAC operations, sigmoid, hyperbolic, and ReLU for performing MAC operations, and the remaining 1 (SF-
operations required for CNN acceleration. These operations can LUT core) is designed to implement activation operations using
be performed in a multi-staged pipeline by organizing a series the memory look-up approach. These operations are carried

Authorized licensed use limited to: George Mason University. Downloaded on September 01,2023 at 00:08:55 UTC from IEEE Xplore. Restrictions apply.
out within the cluster by executing a multi-stage pipeline of functional SF-LUT core to implement activation functions. A
the nine heterogeneous LUT cores, coupled together with a multiplexer is used to select the different activation operations
routing mechanism. Since each core is capable of performing to be implemented in the SF-LUT. Based on the input from the
operations on a pair of 4-bit, or a single 8-bit operand. The multiplexer, the multi-functional core performs either sigmoid,
MAC operations are performed on pair of 4-bit data in parallel hyperbolic, or ReLU activation operations on the input data.
to obtain the output of 8-bit inputs using the ALU-LUT and This operation can be performed in a single clock cycle during
S-ALU LUT cores. Later the output of the MAC operation is the execution. The router is used to enable the chain of
passed to the SF LUT cores to perform activation functions operations required for MAC and activation operations inside
such as sigmoid, hyperbolic, and ReLU operations. the cluster. The key advantage of the proposed architecture is
In order to perform the Multiplication and Accumulation that it enables a special routing scheme, and parallelization
operation on two 4-bit data operands, initially both the input process in order to efficiently utilize the cores inside the cluster.
data, A and B are split into sections A3 , A2 , A1 , A0 , and B3 , Moreover, it can be said that the LUTs in the proposed archi-
B2 , B1 , B0 respectively. The 4-bit multiplication is performed tecture are capable of reprogramming at run-time to perform
similarly to decimal multiplication. As demonstrated in Figure complex computational operations to implement CNN at ultra-
3, a special routing mechanism is used to perform the MAC low latency.
operation in a multi-stage pipeline. Figure 3 also illustrates how V. E VALUATION
each process in the dataflow has been assigned a special tag
A. Design Verification
consisting of a letter and a number for ease of implementation
and testing. Numbers 0, 1, 2, and 3 denote various parallel We verified the architecture using ASIC via Verilog HDL
operations carried in each clock cycle, whereas letters I, J, K, L, implementation. We evaluate the performance using different
M, N, O, Q, R, and S denote the clock steps of LUT operations, metrics (such as operational latency, power consumption and
P0-P7 represent the MAC operation output. During the runtime, active area) from HDL synthesis on Synopsys Design Compiler
P0-P7 of the MAC operation is accumulated using the S-ALU- using 28 nm standard cell library from TSMC and are presented
LUT core and passed to the SF-LUT core to do the activation in Table I. Within a cluster, a single 8-bit MAC requires
operation. computations inside PIM cores as well as communication
The MAC operation inside the cluster is implemented in across cores, which adds to the delay. Whereas, the cluster’s
a combinational circuit manner by utilizing the LUT cores power consumption is equal to the sum of each core’s as well as
such that the multiplication is implemented using a series of the core-to-core communication. The power and delay for intra
AND logic operations performed by the ALU-LUT cores and and inter-subarray data transfers are obtained from [15] and
accumulation process by the S-ALU LUT cores as shown in [26]. These metrics are used in the system-level performance
Figure 3. Utilizing the multi-functional S-ALU LUT instead evaluation.
of ALU-LUT for the accumulation process improves the area,
TABLE I
power, and latency overheads of the proposed architecture. To C HARACTERISTICS OF MULTI - FUNCTIONAL HETEROGENEOUS HARDWARE
further improve core utilization, overlapping of two consecutive ACCELERATOR AND ITS COMPONENTS IN 28 NM TECHNOLOGY NODE
accumulations in parallel for executing the MAC operation is
enabled. Component Delay (ns) Power (mW) Active Area(µm2 )
ALU-LUT Core 0.10 0.00177 8010
For the 4-bit input A and B, partial products are obtained by S-ALU-LUT Core 0.26 0.00497 13210
multiplying each bit of input B with the entire 4-bit of input SF-LUT Core 0.7 0.01853 141304
Multi-functional 1.62 0.05539 199764
A operand. The first partial product is obtained by multiplying Heterogeneous Cluster
B0 with A3 , A2 , A1 , A0 , and the second partial product is LUT Core [16] 0.8 2.7 4196.64
formed by multiplying B1 with A3 , A2 , A1 , A0 likewise for LUT Cluster (MAC Opera- 6.4 8.2-11 37769.81
tion) [16]
third and forth partial products. So these partial products can Intra-Subarray Communica- 63.0 0.028 N/A
be implemented with AND operator using ALU-LUT core as tion [26]* µJ/comm
Inter-Subarray Communica- 148.5/ 0.09/ N/A
shown in Figure 3. The ALU-LUT core takes two 4-bit input tion [15] for subarrays 1/7/15 196.5/ 0.12/ 0.17
operands and performs logical AND operations using the LUTs hops away* 260.5 µJ/comm

to provide 4-bit output. All these operations can be performed *Represented in 28nm technology node
in a single clock cycle during the execution. These partial
products are then added by using 4-bit S-ALU LUT cores Firstly from Table I, it is observed that due to the different
to parallelize the addition process. The first partial product is operational support provided by heterogeneous cores, they have
added to the second partial product, then this result is added to different delay, area, and power metrics. Since the SF-LUTs
the next partial product with carry-out and it goes on till the process 8-bit data on 8-bit memory LUTs, which is different
final partial product. Finally, it produces an 8-bit output which from the ALU and S-ALU cores, the SF-LUT has the highest
indicates the MAC value of the two 4-bit input operands. A delay, area, and power consumption. The ALU-LUT core is
combined multiplication and addition process can be executed designed to process a pair of 4-bit data on 4-bit memory LUTs
in a 9-clock cycle pipeline as shown in Figure 3. and has the least delay, area, and power consumption. However,
The output of the MAC operation is passed to the multi- compared to the LUT core [16], the proposed cores have

Authorized licensed use limited to: George Mason University. Downloaded on September 01,2023 at 00:08:55 UTC from IEEE Xplore. Restrictions apply.
relatively less delay and power consumption, but the active area are needed to be performed which implies more parallelization
is 2× greater. However, the proposed heterogeneous PIM core to perform these operations. Therefore, for a higher num-
can provide multiple functionalities simultaneously, whereas ber of layers in the CNN algorithm, the energy efficiency
multiple traditional LUT cores are required to provide multiple achieved is high. It is observed that LeNet, AlexNet, and
functionalities. This indicates the increased area overheads can ResNet 18 achieved the inference energy efficiency of 0.0011
be well justified and minimal for systems that perform complex Frames/Joule, 0.024 Frames/Joule, and 0.038 Frames/Joule
operations. respectively.
Nine of these cores are grouped together as discussed in Figure 4 also shows that the proposed architecture achieves
Section III-A, forming a single cluster. From the system-level better performance for CNN algorithms with a comparatively
perspective, the PIM requires 256 of these PIM clusters in order lower computational workload such as LeNet. However, for
to perform computational operations for 8-bit data precision. AlexNet with 8 layers, the proposed architecture achieves
In order to facilitate that, we consider infusing one PIM bank an inference throughput of 150.3 Frames/s and 50 layered
with 256 PIM clusters per DRAM chip in the entire rank of ResNet algorithm achieves an inference throughput of 45.9
the DRAM chips for a DIMM (dual in-line memory module) Frames/s. Therefore it can be said that the proposed architec-
For the cluster characteristics when implementing the 8- ture can achieve impressive performance while implementing
bit MAC operation and activation operation on the proposed MAC, activation operations, for the convolutional layers in the
architecture, the delay is observed to be 1.62 ns, whereas for CNN/DNNs to process very efficiently. For instance, ResNet-
the LUT core [16] to perform just the MAC operation the 50, the largest network implemented on the proposed architec-
delay is 6.4 ns. Which is almost 4× faster implementation of ture consists of 50 layers with thirty-eight billion computations
MAC operation on multi-functional cores compared to the LUT that can be processed within 10 ms on the proposed architec-
core [16]. Therefore, it is observed that the multi-functional ture.
architecture is highly suitable for ultra-low latency, low-power
applications such as real-time IoT devices, and edge devices. C. Inference Accuracy
Even though it is observed that the proposed architecture has We evaluate on our proposed architecture for various state-
more area than IMC LUT-based design [16], it is still observed of-the-art deep neural networks such as LeNet [27], AlexNet
to achieve a lower area in the case of edge devices. [1], ResNet -18,-34,-50 [28]. These deep learning algorithms
are implemented on the proposed hardware accelerator using
B. Performance Evaluation MNIST [29] (28×28×1 dimensions), CIFAR-10 [30] dataset
In this subsection, we perform a comparative performance (32 x 32 x 3 dimensions). Figure 5 shows the Top 5 accuracy
analysis of the proposed architecture in terms of throughput comparison plots for 16-bit floating-point (FP) and 8-bit fixed-
and energy efficiency on LeNet, AlexNet, ResNet-18, -34, and point data precision for both datasets. It is observed that the ac-
-50 CNN algorithms for a batch size of 64. Energy efficiency
is defined as the number of frames processed in the processor MNIST - 16 bit MNIST - 8 bit CIFAR10 - 16 bit CIFAR10 - 8 bit
per unit of energy (Joules). Figure 4 presents comparisons of
100
the throughput (in Frames per second) and energy efficiency
Percentage of Accuracy (%)

(in Frames per Joule) of inference on all these CNNs deployed 80


on the proposed multi-functional heterogeneous architecture.
60

40

20

0
LeNet AlexNet ResNet 18 ResNet 34 ResNet 50

Fig. 5. Comparison of Top-5 accuracies of LeNet, AlexNet, ResNet-18, -34


and -50 on MNIST, CIFAR-10 dataset for 16-bit, 8-bit data precision

curacies obtained on the evaluated networks are very similar for


16-bit and 8-bit precision data (inputs and weights). The Top-1
accuracy obtained for the MNIST dataset when implemented on
AlexNet is 98.89% and 99.43% for 16-bit and 8-bit precision
Fig. 4. Comparison of Energy efficiency (Frames/Joules) and Throughput
(Frames/second) for LeNet, AlexNet, ResNet18, ResNet34, ResNet50 on the
respectively. On the other hand, the Top-1 accuracy obtained
proposed multi-functional heterogeneous architecture for the CIFAR-10 dataset when implemented on AlexNet is
83.5% and 82% for 16-bit and 8-bit precision respectively. It
Firstly, Figure 4 shows the energy efficiency of the CNN is also observed that the CNN accuracies for the CIFAR-10
algorithms is proportional to the depth of the network. As the dataset are noticeably lower when compared to the MNIST
number of layers increases, more MAC, activation operations dataset, also shown in Figure 5. The performance degradation

Authorized licensed use limited to: George Mason University. Downloaded on September 01,2023 at 00:08:55 UTC from IEEE Xplore. Restrictions apply.
is around 10%-15% for all the CNNs deployed. The accuracy also observed that the proposed architecture outperforms LAcc
of the CIFAR10 dataset, in general, is significantly lower than and pPIM by almost 1.5× for AlexNet inference throughput.
MNIST dataset due to the comparatively higher complexity The proposed architecture is also observed to achieve a max-
of the dataset. Although higher accuracy with CIFAR-10 is imum of 200× higher energy efficiency than LAcc and pPIM
reported in the literature, it is with higher data precision than implementation for AlexNet inference.
those adopted in this paper [25].
VI. C ONCLUSION
D. Performance Comparison with State-of-the-Art Hardware In order to address the energy efficiency and flexibility
Accelerators for CNN Implementation requirements for computer architectures, we present a novel
Performance is evaluated by comparing the proposed ar- multi-functional heterogeneous look-up table-based reconfig-
chitecture with state-of-the-art PIM accelerator architectures urable PIM architecture in this work. The proposed architecture
in terms of power consumption (Watt) and throughput is aimed at CNN and DNN inference applications that support
(Frames/second), as shown in Figure 6. existing and emerging applications with low overheads and
high programmability. The proposed hardware accelerator’s
heterogeneous reconfigurable LUTs enable multi-functional
programming to carry out almost any arithmetic or logical
operation. As a result, it can process Convolutional, Fully-
connected, Activation, and Pooling Layers in a CNN/DNN
algorithm. Performance is evaluated by comparing the proposed
architecture with state-of-the-art PIM architectures. We have
evaluated various CNNs such as LeNet, AlexNet, and ResNet-
18,34,50 on the proposed architecture. Our experimental results
have demonstrated that when AlexNet is implemented on the
proposed architecture, it shows a maximum of 200× higher
Fig. 6. Comparative performance analysis of proposed multi-functional hetero-
energy efficiency and 1.5× higher throughput than a DRAM-
geneous architecture with respect to state-of-the-art PIM architectures in terms based LUT-based PIM architecture. Although the proposed
of throughput (Frames/second) and power consumption (Watt) architecture is primarily designed for CNN acceleration, its het-
erogeneous multi-functionality, reconfiguration, and ultra-low
As a proof of concept, we evaluate and implement AlexNet latency implementation make it suitable for a wider range of
[1] on the proposed architecture with the 8-bit width precision. application domains such as real-time IoT, edge devices, mobile
The PIM architectures under comparison in this section include applications, automated robots, and automated computers.
DRAM-based bulk bit-wise processing devices DRISA [11],
and DrAcc [13], SRAM-implemented Neural Cache [12], LUT- VII. ACKNOWLEDGEMENTS
based PIM implemented on the DRAM platforms such as LAcc This work was supported in part by the US National Science
[14], and pPIM architecture [16]. Foundation (NSF) Grant CNS-2228239. The views, opinions,
Among the PIMs studied here, Neural Cache [12] is the slow- and/or findings contained in this article are those of the au-
est due to its limited processing capabilities and comparatively thor(s) and should not be interpreted as representing the official
slower bit-serial computing mechanism. On the other hand, a views or policies, either expressed or implied, of the US NSF.
relatively higher throughput is observed for DRISA [11] due
to its ability to parallelize operations across multiple banks. R EFERENCES
Whereas DrAcc [13] implements 8-bit ternary precision infer- [1] M. Z. Alom et al., “The history began from alexnet: A comprehensive
ences through very minimal circuit modifications which allows survey on deep learning approaches,” arXiv, 2018.
it to obtain high performance similar to that of pPIM [16]. The [2] S.-L. Lu et al., “Scaling the “memory wall”: Designer track,” in 2012
IEEE/ACM International Conference on Computer-Aided Design (IC-
benefits of adopting LUTs in order to utilize pre-calculated CAD), 2012, pp. 271–272.
results instead of performing in-memory logic operations are [3] S. Rafatirad et al., Machine Learning for Computer Scientists and Data
convincingly demonstrated by LAcc [14], pPIM [16] which Analysts: From an Applied Perspective. Springer Nature, 2022.
[4] Y. Chen et al., “Dadiannao: A machine-learning supercomputer,” in 2014
achieve impressive inference performances. 47th Annual IEEE/ACM International Symposium on Microarchitecture,
The proposed architecture, on the other hand, utilizes the 2014, pp. 609–622.
multi-functional heterogeneous memory LUTs to perform the [5] J. Fowers et al., “A performance and energy comparison of FPGAs,
GPUs, and multicores for sliding-window applications,” in ACM/SIGDA
CNN algorithms and is observed to have relatively higher Int. Symp. on Field Programmable Gate Arrays, 2012.
AlexNet throughput than LUT-based PIMs under comparison. [6] S. Bavikadi et al., “A survey on machine learning accelerators and
It is also observed to have a much higher throughput when evolutionary hardware platforms,” IEEE Design & Test, vol. 39, no. 3,
pp. 91–116, 2022.
compared to other PIM architectures such as DRISA, Dracc, [7] N. P. Jouppi et al., “In-datacenter performance analysis of a tensor
and Neural cache as shown in Figure 6. A similar trend is processing unit,” CoRR, vol. abs/1704.04760, 2017. [Online]. Available:
observed in for power consumption comparison, the proposed http://arxiv.org/abs/1704.04760
[8] O. Villa et al., “Scaling the power wall: A path to exascale,” in SC
architecture is observed to have lower power consumption ’14: Proceedings of the International Conference for High Performance
compared to the PIM architectures as shown in Figure 6. It is Computing, Networking, Storage and Analysis, Nov 2014, pp. 830–841.

Authorized licensed use limited to: George Mason University. Downloaded on September 01,2023 at 00:08:55 UTC from IEEE Xplore. Restrictions apply.
[9] A. Ganguly, R. Muralidhar, and V. Singh, “Towards energy efficient non- com-2019-10-03-accelerating-compute-by-cramming-it-into-dram/
von neumann architectures for deep learning,” in Int. Symp. on Quality [20] S. Bavikadi et al., “A review of in-memory computing architectures for
Electronic Design (ISQED), 2019. machine learning applications,” ser. GLSVLSI ’20, 2020.
[10] M. Gao, G. Ayers, and C. Kozyrakis, “Practical near-data processing for [21] P. R. Sutradhar et al., “Look-up-table based processing-in-
in-memory analytics frameworks,” 10 2015, pp. 113–124. memoryarchitecture with programmable precision-scalingfor deep
[11] S. Li et al., “Drisa: A dram-based reconfigurable in-situ accelerator,” in learning applications,” IEEE TPDS, 2021.
2017 50th Annual IEEE/ACM International Symposium on Microarchi- [22] S. Bavikadi et al., “upim: Performance-aware online learning capable
tecture (MICRO), 2017, pp. 288–301. processing-in-memory,” in 2021 IEEE 3rd International Conference on
[12] C. Eckert et al., “Neural cache: Bit-serial in-cache acceleration of Artificial Intelligence Circuits and Systems (AICAS), 2021, pp. 1–4.
deep neural networks,” in 2018 ACM/IEEE 45th Annual International [23] S. Bavikadi et al., “Polar: Performance-aware on-device learning capable
Symposium on Computer Architecture (ISCA), 2018, pp. 383–396. programmable processing-in-memory architecture for low-power ml ap-
[13] Q. Deng et al., “Dracc: a dram based accelerator for accurate cnn in- plications,” in 2022 25th Euromicro Conference on Digital System Design
ference,” in 2018 55th ACM/ESDA/IEEE Design Automation Conference (DSD), 2022, pp. 889–898.
(DAC), 2018, pp. 1–6.
[24] G. Yuan et al., “A sot-mram-based processing-in-memory engine for
[14] Q. Deng et al., “Lacc: Exploiting lookup table-based fast and accu- highly compressed dnn implementation,” 2019. [Online]. Available:
rate vector multiplication in dram-based cnn accelerator,” 2019 56th https://arxiv.org/abs/1912.05416
ACM/IEEE Design Automation Conference (DAC), pp. 1–6, 2019.
[25] K. Vasquez et al., “Activation Density based Mixed-Precision Quan-
[15] K. K. Chang et al., “Low-cost inter-linked subarrays (lisa): Enabling
tization for Energy Efficient Neural Networks,” arXiv e-prints, p.
fast inter-subarray data movement in dram,” in IEEE Int. Symp. on High
arXiv:2101.04354, Jan. 2021.
Performance Computer Arch (HPCA), March 2016, pp. 568–580.
[16] P. R. Sutradhar et al., “pPIM: A programmable processor-in-memory [26] V. Seshadri et al., “Rowclone: Fast and energy-efficient in-dram bulk data
architecture with precision-scaling for deep learning,” IEEE Computer copy and initialization,” in 2013 46th Annual IEEE/ACM International
Architecture Letters, vol. 19, no. 2, pp. 118–121, 2020. Symposium on Microarchitecture (MICRO), Dec 2013, pp. 185–197.
[17] L. Song et al., “Pipelayer: A pipelined reram-based accelerator for deep [27] Y. Lecun et al., “Gradient-based learning applied to document recogni-
learning,” in 2017 IEEE International Symposium on High Performance tion,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
Computer Architecture (HPCA), 2017, pp. 541–552. [28] K. He et al., “Deep residual learning for image recognition,” arXiv, 2015.
[18] S. Angizi et al., “Mrima: An mram-based in-memory accelerator,” IEEE [29] L. Deng, “The mnist database of handwritten digit images for machine
Transactions on Computer-Aided Design of Integrated Circuits and learning research [best of the web],” IEEE Signal Processing Magazine,
Systems, vol. 39, no. 5, pp. 1123–1136, 2020. vol. 29, no. 6, pp. 141–142, Nov 2012.
[19] T. P. Morgan, “Accelerating compute by cramming it into dram memory,” [30] A. Krizhevsky, “Learning multiple layers of features from tiny images,”
Oct 2019. [Online]. Available: https://www.upmem.com/nextplatform- 2009.

Authorized licensed use limited to: George Mason University. Downloaded on September 01,2023 at 00:08:55 UTC from IEEE Xplore. Restrictions apply.

You might also like