0% found this document useful (0 votes)
16 views20 pages

High-Throughput Near-Memory Processing On Cnns With 3D Hbm-Like Memory

paper

Uploaded by

jieyang2020
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views20 pages

High-Throughput Near-Memory Processing On Cnns With 3D Hbm-Like Memory

paper

Uploaded by

jieyang2020
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

High-throughput Near-Memory Processing on CNNs with

3D HBM-like Memory

NAEBEOM PARK and SUNGJU RYU, Pohang University of Science and Technology
JAEHA KUNG, DGIST
JAE-JOON KIM, Pohang University of Science and Technology

This article discusses the high-performance near-memory neural network (NN) accelerator architecture uti-
lizing the logic die in three-dimensional (3D) High Bandwidth Memory– (HBM) like memory. As most of
the previously reported 3D memory-based near-memory NN accelerator designs used the Hybrid Memory
Cube (HMC) memory, we first focus on identifying the key differences between HBM and HMC in terms of
near-memory NN accelerator design. One of the major differences between the two 3D memories is that HBM
has the centralized through-silicon-via (TSV) channels while HMC has distributed TSV channels for sepa-
rate vaults. Based on the observation, we introduce the Round-Robin Data Fetching and Groupwise Broadcast
schemes to exploit the centralized TSV channels for improvement of the data feeding rate for the processing
elements. Using synthesized designs in a 28-nm CMOS technology, performance and energy consumption of
the proposed architectures with various dataflow models are evaluated. Experimental results show that the
proposed schemes reduce the runtime by 16.4–39.3% on average and the energy consumption by 2.1–5.1% on
average compared to conventional data fetching schemes.
CCS Concepts: • Computer systems organization → Neural networks; • Hardware → 3D integrated
circuits; Hardware accelerators; Application specific processors;
Additional Key Words and Phrases: Neural network accelerator, HBM
ACM Reference format:
Naebeom Park, Sungju Ryu, Jaeha Kung, and Jae-Joon Kim. 2021. High-throughput Near-Memory Processing
on CNNs with 3D HBM-like Memory. ACM Trans. Des. Autom. Electron. Syst. 26, 6, Article 48 (June 2021),
20 pages.
https://doi.org/10.1145/3460971

1 INTRODUCTION
In parallel with the rapid improvement in deep learning algorithms, there has been a great effort in
hardware domain to improve the throughput/energy efficiency of the training or inference of the
complex deep learning models [39]. Owing to the innovations from both software and hardware,

This work was in part supported by the National Research Foundation of Korea (NRF) grant funded by the Korea gov-
ernment (MSIT) (NRF-2019R1A5A1027055, NRF-2020R1A2C2004329) and the Institute of Information & communications
Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT)(2020-0-01309, Development of
artificial intelligence deep learning processor technology for complex transaction processing server).
Author’s addresses: N. Park, S. Ryu, and J.-J. Kim, Pohang University of Science and Technology, Pohang; emails: {nae-
beom.park, sungju.ryu, jaejoon}@postech.ac.kr; J. Kung, DGIST, Daegu; email: [email protected].
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and
the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from [email protected].
© 2021 Association for Computing Machinery.
1084-4309/2021/06-ART48 $15.00
https://doi.org/10.1145/3460971

ACM Transactions on Design Automation of Electronic Systems, Vol. 26, No. 6, Article 48. Pub. date: June 2021.
48
48:2 N. Park et al.

deep learning has shown the best performance in a wide range of applications, such as object clas-
sification, object segmentation, natural language processing, robot control, and game competition,
to name a few [12, 15, 24, 33, 34, 42]. Unfortunately, the models to achieve the best accuracy in each
application domain require a significantly large number of computations or size of storage [41].
To accommodate the large number of computations, we could put more processing elements
(PEs) in the processor as in GPUs and/or use multiple processors together [3, 20]. Due to the large
array of simple stream processors in GPUs, the most recent Turing architecture shows the perfor-
mance of 113.8 Tensor TFLOPS [1]. The same approach, but with higher energy-efficiency, applies
to an ASIC or FPGA implementation of deep learning accelerators. The improved energy-efficiency
comes from the customized hardware architecture optimized for deep learning algorithms [6, 13].
Yet, we are not allowed to put as many PEs as we want due to the limited memory bandwidth
provided to the processor [21, 40]. The processing speed does not scale proportional to the num-
ber of PEs after some point as the PE utilization is constrained by the amount of data fetched
by a given memory bandwidth. Prior works overcome this bandwidth bottleneck by improving
the dataflow model of architectures [6, 21, 29] or exploiting the sparsity residing in deep learning
models [13, 14, 26, 45].
To directly solve the memory bottleneck, we can use DRAM with higher memory bandwidth for
each processor [17, 19, 20] or connect processors with high-speed links [20]. The high-end GPUs
or Neural Network (NN) accelerators utilize laterally placed three-dimensional (3D) memory
stacks on the same interposer making 2.5D design [2, 8, 20, 23]. Another approach other than the
2.5D design is implementing an NN processor/accelerator on logic die within a 3D memory stack
(3D design), called near-memory processing [4, 11, 22]. By placing an NN accelerator on the logic
die, we are able to compute deep learning algorithms without sending data via expensive external
links.
Until now, there have been two 3D memory specifications: (i) High Bandwidth Memory
(HBM) [19, 27] and (ii) Hybrid Memory Cube (HMC) [17, 18]. The HBM has been fabricated
by Samsung and SK Hynix [27, 36] and HMC by Micron [17]. Very recently, however, Micron an-
nounced that they will also focus on the production of HBM as a high-performance memory [32].
The research work from academia have only used HMC as a default model for near-memory pro-
cessing to accelerate deep learning [4, 11, 22].
In this article, we show that an efficient data fetching scheme from DRAM dies to the NN ac-
celerator on logic die in HBM is different from the data fetching scheme that is suitable for NN
accelerators in HMC. To be more exact, the bottom buffer die of current generation HBM is im-
plemented in a DRAM process. Because a logic process is needed to implement high-performance
NN accelerator, we assume that bottom die of HBM is implemented in logic process in our study.
So, let us use HBM instead of HBM-like memory for brevity throughout the article.
We analyzed the overall efficiency on the inference of five representative Convolutional Neu-
ral Network (CNN) algorithms. By using the architecture with proposed schemes, we could im-
prove the performance by 16.4–39.3% and the energy-efficiency by 2.1–5.1% on average compared
to the architecture with conventional data fetching schemes based on HMC.
Main contributions of our study are as follows:
(1) Data fetching strategy with HBM: We propose that, for the data movement from DRAM
stacks to NN Engines on the logic die in HBM, the round-robin fashion data fetching is
more efficient than the conventional distributed data fetching. The round-robin data fetching
exploits the characteristics of the centralized through-silicon-via (TSV) structure in HBM
while the distributed data fetching is more suitable for the HMC-based designs, in which
distributed TSV structure is used.

ACM Transactions on Design Automation of Electronic Systems, Vol. 26, No. 6, Article 48. Pub. date: June 2021.
High-throughput Near-Memory Processing on CNNs with 3D HBM-like Memory 48:3

(2) Local dataflow model within NN Engine: In row stationary dataflow model, fetching
data from global buffer to PEs takes most of the processing time, because the dataflow ac-
tively reuses data on the global buffer and register files within NN Engine. Therefore, to
reduce overall processing time, the bit-width of fetching data from global buffer to PEs must
be increased. However, with conventional architectures using multicast scheme, increasing
the bit-width leads to huge routing overhead of connecting wide I/Os to all PEs. On the con-
trary, the proposed architecture using Groupwise Broadcast scheme with pre-determined and
grouped interconnects can remove the routing overhead.

2 BACKGROUND AND MOTIVATION


2.1 Neural Network Accelerator
Deep Neural Networks (DNNs) are being actively used for tasks such as image recognition, local-
ization, natural language processing, and machine translation, showing remarkable accuracy [25].
Depending on the type of interconnection between neurons, there are various types of layers: con-
volutional (CONV), fully-connected (FC), pooling, and activation layers. Among them, CNNs,
which are widely used for image classification, typically have more than 90% of the operations
concentrated on the CONV layers [39].
The operation in a CONV layer can be described as follows:
 N −1 
i

O[b][v] = f I [b][u] ∗ W [u][v] + B[v] , (1)


u=0

where I means the input feature map (ifmap), W means the filter weight, B means the bias, and O
means the output feature map (ofmap). Ni is the number of input neurons required to compute
one output neuron. The variables “b,” “v,” or “u” represents the index of a two-dimensional feature
map (fmap), a channel in ofmaps, or a channel in ifmaps, respectively. f(.) is a non-linear function
that determines the activation of an output neuron. In Equation (1), the number of operations for
one output neuron is determined by Ni . The operations can be performed with simple Multiply-
Accumulate (MAC) units, because only multiplication and accumulation are used for operations
in CONV layers [5, 6, 44].
The AlexNet, one of the earlier CNN models for processing image classification with ImageNet
dataset, consists of 5 CONV and 3 FC layers [24]. A more recent model, ResNet, consists of 151
CONV and 1 FC layers [15]. Recent trend shows that the number of layers in DNN and the number
of filters tend to increase, while the filter size tends to decrease. As the number of filters increases,
the number of channels of a fmap increases and the number of operands required to compute one
pixel of ofmap increases.
The NN accelerator is a hardware that consists of a large number of computation and control
logic for the specific operations used in the NN. To run the large number of operations, the NN
accelerator typically uses a spatial architecture that consists of a large number of identical PEs.
The PEs perform the same operations with different operands in parallel [6]. There are various
dataflow models optimized for processing NNs in the NN accelerators such as output stationary
(OS), weight stationary (WS), and row stationary (RS) model. The architectures reuse different
types of the data within the PE. In the (OS, WS) model, one PE stores and reuses (partial sum
(psum), weight) values on the register file [21, 29]. In the RS model, one PE stores and reuses part
of the ifmap, filter, and psum values on the register files [6]. With those dataflow models, the
NN accelerators process NNs faster and consume less power than general-purposed computing
platforms such as CPU and GPU.

ACM Transactions on Design Automation of Electronic Systems, Vol. 26, No. 6, Article 48. Pub. date: June 2021.
48:4 N. Park et al.

Fig. 1. Structural difference between (a) HBM and (b) HMC.

Fig. 2. Widely used model partitioning schemes: (a) Fmap partitioning and (b) Output partitioning scheme.

The number of MAC operations for processing an NN for image classification with ImageNet
dataset varies from 724M to 15.5G depending on the network [39]. As the number of PEs increases,
the number of operations that can be performed in a cycle increases. However, the large number
of PEs does not always guarantee the fast processing because of the bottleneck in memory band-
width [21]. To make a large number of PEs operate simultaneously, it is necessary to feed the
operands to all the PEs continuously. Therefore, as the number of PEs increases, the required
bandwidth for fetching data to PEs also increases [11]. As a result, NN accelerators prefer to use
high-bandwidth DRAM as an external memory [6]. To increase the bandwidth of the external mem-
ory, multiple two-dimensional (2D) DRAMs with lower bandwidth or a single high-bandwidth 3D
DRAM such as HBM [19] or HMC [18] can be used.

2.2 Near-Memory NN Accelerators Using 3D Memory


In addition to the high-bandwidth I/O, 3D memory offers another promising feature for NN ac-
celeration as it has a logic die inside the 3D DRAM stack. If computations can be done inside the
memory module, then energy consumption caused by the communication between the memory
and NN accelerators can be greatly reduced.
HBM and HMC are similar in that both of them consist of multiple DRAM dies and one logic
die. The dies are connected with TSVs. Except those similarities, HBM and HMC have significant
differences in terms of structure. Figure 1 shows the structure of HBM and HMC. In HBM, all TSVs
are placed at the center of dies. However, the dies in HMC are divided into 16 vaults, and TSVs
are positioned independently in each vault. Also, a memory controller is located in each vault of
HMC [17–19, 27]. Our proposal focuses on the difference of the TSV connections in HBM and
HMC.
Figure 2 shows model partitioning schemes that are used to process NNs with multiple NN En-
gines. There are several methods to divide workloads and assign them to multiple NN Engines [22].
Figure 2(a) shows fmap partitioning (FP) scheme that divides a fmap into multiple sub-fmap
tiles. If we assign an NN to 4 NN Engines, then a fmap is divided into four sub-fmap tiles and each

ACM Transactions on Design Automation of Electronic Systems, Vol. 26, No. 6, Article 48. Pub. date: June 2021.
High-throughput Near-Memory Processing on CNNs with 3D HBM-like Memory 48:5

Fig. 3. The overall architecture of the proposed near-memory processing: (a) “NN Accelerator” on the logic
die of HBM. (b) A “NN Engine” with the proposed data fetching schemes.

sub-fmap tile is assigned to each NN Engine. When using the FP scheme, each NN Engine receives
different ifmap data and the same filter data. Figure 2(b) shows output partitioning (OP) scheme
in which ofmap channels are divided into multiple groups. When the OP scheme is used, each NN
Engine has the same ifmap data and different filter data.
In addition to the two partitioning schemes, there are two more partitioning schemes: input
partitioning and batch partitioning [11]. When the input partitioning scheme is used, the ifmap
channels are partitioned into groups. In this case, the outputs from NN Engines are the psums for
an ofmap element, and hence a large amount of psum communications between NN Engines are
required. With the batch partitioning scheme, the CNN model should be replicated in each NN
Engine and it can cause too much burden if the size of CNN model is large.
Neurocube [22] and TETRIS [11] are two well-known near-memory NN accelerator architec-
tures based on HMC. Neurocube [22] uses the FP scheme to process all CONV layers. The NN
Engine in each vault can process a part of the NN independently using the memory controller
dedicated to each vault. Because of a unique characteristic of CONV layer, ifmap data that are
required to compute the adjacent ofmaps can be partially overlapped between two vaults. In addi-
tion, the same filter data are delivered 16 times as the replicated data need to be delivered to each
NN Engine via separate TSV channels for each vault.
In contrast, TETRIS [11] uses a hybrid partitioning (HP) scheme to process CONV layers. HP
scheme chooses between FP and OP schemes for each layer to find one that consumes lower energy
to minimize the total energy consumption. According to Reference [11], FP scheme worked better
in earlier layers and OP worked better in deeper layers, because the large size of fmap is the decid-
ing factor in earlier layers, and the size of filters is dominant in the deeper layers [15, 24, 35, 43].

3 NEAR-MEMORY PROCESSING WITH HBM


3.1 Overall Architecture
Figure 3 shows the proposed NN accelerator on the logic die of HBM, consisting of multiple NN
Engines. An NN Engine consists of systolic PE array, global buffer, and control logic for the

ACM Transactions on Design Automation of Electronic Systems, Vol. 26, No. 6, Article 48. Pub. date: June 2021.
48:6 N. Park et al.

elements, which is a typical configuration for NN processing engines. A PE receives operand data
from the adjacent PEs or global buffer every cycle to perform MAC operations. Global buffer is an
intermediate memory between DRAM dies and systolic PE array. Reusing the operand data in the
global buffer reduces the number of DRAM accesses, which occupy most of the energy consump-
tion of the NN accelerator.
We tested three different dataflows (OS, WS, and RS) for an NN Engine. The number of PEs and
the size of a global buffer in an NN Engine are adopted from [6, 21, 29] with appropriate scale. The
number of PEs and the size of a global buffer (for single buffering) of (OS, WS, RS) dataflow is (256,
256, 168) and (64, 96, 108) KB, respectively. We used (8, 16, 8) NN Engines that use (OS, WS, RS)
dataflow model in NN accelerator. The rationale behind selecting the numbers of NN Engines in NN
accelerators is to maximize performance and energy-efficiency and meet the thermal constraint of
the inherent 3D structure (refer to Section 3.3).
In the proposed architecture, we present two data delivery schemes to maximize the benefit from
the centralized TSV connection in the HBM structure. The first scheme, Round-Robin Data Fetching
(Section 3.2), is for the data fetching from DRAM stack to a global buffer residing in each NN
Engine. Second, we present Groupwise broadcast for the data movement from the global buffer to
PEs inside the RS dataflow-based NN Engine (Section 4.1). The groupwise broadcast alleviates the
routing overhead due to the wider bus (2,048 bit) while maintaining high internal data bandwidth.

3.2 Round-robin Data Fetching


3.2.1 Definition. The main difference between HBM and HMC is the location of TSV connec-
tions [17, 19]. To fully exploit the benefit of centralized TSV connections in HBM, we propose
a new data fetching scheme called round-robin data fetching. Note that HMC is divided into 16
vaults, each vault with its own TSVs and memory controller [17]. This may result in out-of-order
data arrival or data congestion in Network-on-Chip (NoC) buffers with HMC [22]. In compar-
ison, there are eight channels for data delivery in HBM, each with 128 b data bus-width and all
the data are communicated through the TSV farm at the center. Therefore, 1,024 b (128b × 8 CH)
data can be fetched on the HBM logic die every strobe and can be accessed by each NN Engine in
a single cycle. The centered TSVs in HBM have 1,024-bit I/O with DDR interface providing 2,048
bit to an NN Engine per cycle (256 GB/s in HBM 2.0) [19, 27, 36].
In the previous HMC based near-memory NN accelerators, the 1,024-b data from memory core
were evenly partitioned into N Enдine groups and distributed to each vault. An NN Engine that
is located in a vault receives (1024 / N Enдine )b each. Let us call this scheme the distributed data
fetching. Then, each NN Engine waits until a single execution_set is queued in its own global buffer.
Here, N Enдine represents the number of NN Engines implemented on the logic die of HMC. The
size of an execution_set depends on the desired level of data reuse and the layer configuration. With
this scheme, all NN Engines fill in the global buffer at the same rate and start the execution at the
same time, and hence NN Engines need to stall while data are being fed to them.
To reduce the stall problem, we propose to use the round-robin data fetching scheme, with which
1,024-b data are delivered to a single NN Engine at a time. The global buffer in each NN Engine
is connected to 1,024 TSV channels and the connections are controlled by multiplexers (MUXs).
The control logic including the MUXs selects the NN Engine that receives the data from TSV
channels at every strobe to implement the round-robin data delivery. In this scheme, an NN starts
computing when a global buffer in an NN Engine receives the sufficient amount of data, equal to
the size of an execution_set. At the same time, delivering 1,024-b data to the next single NN Engine
starts. Because an NN Engine can receive the data while other NN Engines are doing computation,
the data load overhead can be mostly hidden. The MUXs are required to implement the round-robin
data fetching, but the additional logic devices for MUXs are much smaller than the NN Engines.

ACM Transactions on Design Automation of Electronic Systems, Vol. 26, No. 6, Article 48. Pub. date: June 2021.
High-throughput Near-Memory Processing on CNNs with 3D HBM-like Memory 48:7

Fig. 4. Timing diagram for processing thethird layer of AlexNet [24] by using Distributed and Round-Robin
data fetching with (a) OS, (b) WS, and (c) RS dataflow model.

Note that double buffering can also be used to hide the cycles for fetching data from exter-
nal memory to NN Engines [31]. However, the double buffering typically incurs significant area
overhead. The proposed round-robin data scheme with single buffering can achieve comparable
performance to double-buffering while avoiding area overhead. Detailed comparison between the
round-robin fetching with single buffering and double buffering are given in Section 6.1.1.
3.2.2 Evaluation. For analysis, we divide the instruction stages for processing NN in an NN
Engine into three stages: “DRAM to Engine (A),” “Engine Processing (B),” and “Engine to DRAM
(C).” In the A stage, a global buffer in an NN Engine receives operands required for convolutions
from DRAM. If the global buffer receives all the data for an execution_set, then the NN Engine
enters into B stage. In the B stage, PEs perform MAC operations with the operands received from
the global buffer and send the output to the global buffer. In the C stage, the output data stored
in the global buffer are sent to DRAM. When C stage is finished, the NN Engine becomes idle and
waits for next required operands to be fetched from DRAM.
For a quantitative analysis, we simulated a system with eight NN Engines processing the third
CONV layer in Alexnet [24] using three different dataflow models: OS, WS, and RS. We compared
the distributed data fetching and the round-robin data fetching for those dataflow models in terms
of processing time and the results are provided in Figure 4.
OS dataflow model. Each PE with OS dataflow model stores a psum value for computation
of one ofmap pixel and receives one ifmap and one filter values for accumulation on the psum
every cycle. Figure 4(a) shows the runtime comparison between the distributed fetching case and
the round-robin fetching case for OS dataflow. In the distributed fetching case, all the NN Engines
consume the same number of clock cycles as they start computing at the same time after receiving
the entire execution_set for NN Engines. In contrast, in the round-robin fetching case, an NN Engine
starts to receive the operands when the previous NN Engine finishes receiving the corresponding
execution_set and each NN Engine immediately starts computing without waiting for other NN
Engines to receive data.
In this example, it is observed that the B stage requires 1,165 cycles for both data fetching
schemes as the architecture of an NN Engine for both schemes remains the same. However, the A
stage consumes 1,512 cycles in the distributed case while it takes only 189 cycles in the round-robin
case. The C stage requires 13 and 2 cycles with distributed and round-robin data fetching schemes,
respectively. The C stage is not clearly seen in Figure 4(a), because it takes negligible time compared
to the other stages. In summary, in the OS dataflow case, the round-robin data fetching scheme

ACM Transactions on Design Automation of Electronic Systems, Vol. 26, No. 6, Article 48. Pub. date: June 2021.
48:8 N. Park et al.

Fig. 5. NoC for data delivery between vaults with round-robin data fetching in HMC assuming that N Enдine
is (a) 16 and (b) 8. The number in the boxes indicates the number of adjacent vaults that can provide the
data into the vault every hop.

shows 45.25% improvement (from 218,868 to 119,841) in terms of runtime compared to the dis-
tributed fetching case thanks to the reduced cycles for feeding the data from DRAM to NN Engines.
WS dataflow model. In WS dataflow, a filter weight value is stored in each PE and the PE
receives an ifmap and a psum every cycle to compute MAC operation. As shown in Figure 4(b),
WS dataflow shows similar trend to OS dataflow as feeding the data from DRAM to NN Engines
consumes substantial cycles and the round-robin case requires much smaller cycles than the dis-
tributed case for the A and C stages.
RS dataflow model. In RS dataflow, each PE receives a part of rows for ifmap, filter, and psum
values during multiple cycles and computes the psum values using the received ifmap and filter
values. After that, the dataflow locally reuses the part of ifmap and filter data and receives new
ifmap data to compute other psum values. Compared to OS and WS dataflow models, RS dataflow
reuses ifmap and filter data more extensively at the local register file level. Due to the local reuse
characteristics, B stage takes much longer time than A stage in RS dataflow different from OS and
WS dataflow models. As a result, the runtime difference between distributed and round-robin cases
in RS dataflow is much smaller than that of OS or WS dataflow shown in Figure 4(c).

From the evaluation, we can see that the proposed round-robin fetching is very effective for
OS and WS dataflow in HBM-based near-memory processing of CNNs. However, the benefit of
reducing the cycles for A stage with round-robin fetching is relatively small in RS dataflow, because
input data reuse is heavily exploited in the dataflow to make the system compute bound rather than
memory bound. This observation implies that near-memory processing of RS dataflow is not as
attractive as that of OS and WS dataflow, since DRAM communication overhead is relatively small
in RS dataflow with a large size local register within a PE. To make the near-memory processing
of RS dataflow more efficient, we propose a methodology for improving data movement inside the
NN Engine (explained in Section 4).
3.2.3 Round-Robin Data Fetching on HMC. In the previous section, we discussed the advantage
of the round-robin data fetching scheme over the distributed data fetching scheme. One can won-
der whether such an assessment is valid for near-memory NN processors with any 3D-memories
including the HMC-based one. Here, we argue that the round-robin data fetching scheme has a
merit with the centralized TSV channel as in HBM but not with the distributed TSV channel as in
HMC.
Unlike the HBM in which all the data delivered through TSV can be transferred to an NN Engine
each cycle, the HMC requires multi-cycle NoC routing to transfer the data from a vault to an NN

ACM Transactions on Design Automation of Electronic Systems, Vol. 26, No. 6, Article 48. Pub. date: June 2021.
High-throughput Near-Memory Processing on CNNs with 3D HBM-like Memory 48:9

Fig. 6. Relative runtime for processing five representative CNN algorithms with different numbers of NN
Engines.

Engine in another vault. Let us assume that each vault has 1 NN Engine and a NoC router with
the 4 × 4 2D mesh network for the 16 vaults in HMC memory [22]. Typically, it takes 1 cycle
for a data to hop from a router to an adjacent router. In general, data in a specific vault can be
transferred to adjacent vaults in four directions. More precisely, the number of adjacent vaults in
the real configuration varies from 2 to 4 depending on the position of the vault as shown in the
Figure 5(a). As a result, it takes 4 to 8 cycles to deliver the data from other 15 vaults to one vault
using the round-robin data fetch scheme in the HMC case.
If the number of NN Engines is set to 8, then the number of adjacent vaults varies from 3 to 4
and it takes 4 or 5 cycles to feed an NN Engine from other vaults (Figure 5(b)). The NN Engine (A)
(colored in dark blue in Figure 5(b)) has 5 adjacent vaults but it can only receive the data from 4
vaults (solid lines), because the vault in the lower-right corner (V15) cannot feed the NN Engine
(A). The vault V15 needs to receive the data from the vault V11 to feed the NN Engine (A) but it
cannot have the new data values while the vault V11 feeds the NN Engine (A). Hence, the proposed
round-robin data fetching can be made effective only when the TSV farm can deliver the data to
any location on the logic die of the 3D memory as quickly as possible as in the HBM case. Detailed
simulation results for the comparison between HBM and HMC in terms of the effectiveness of
round-robin data fetching are given in Section 6.3.

3.3 Design Space Exploration


The number of NN Engines is a critical design parameter in maximizing system performance. In-
creasing the number of NN Engines improves the parallelism of the near-memory NN accelerator.
To allow similar level of data reuse with more NN Engines, the size of register files and global
buffers needs to be scaled accordingly. Larger on-chip memory, however, naturally increases the
area and static power consumption. Thus, it is important to find the optimal number of NN En-
gines in terms of energy consumption to operate an accelerator with the maximum performance
under thermal bound of a 3D-stacked DRAM [10, 22].
Figure 6 shows the average of relative runtime in processing various CNN algorithms with
different numbers of NN Engines using three different dataflow models (OS, WS, RS). Among the
tested partitioning schemes (FP, OP, HP), we show the data using the HP scheme as it gave us the
highest performance. The simulation assumes that operands are being fetched from DRAM using
round-robin data fetching providing 2,048-bit/cycle to the global buffer when requested.
The runtime to process a CNN algorithm is reduced by increasing the number of NN Engines.
The additional speed-up by doubling up the NN Engines decreases as the number of NN Engines
increases (Figure 6). This is mainly due to the limited bandwidth provided by the external memory.
As we move from 2 to 4 NN Engines, the system performance increases 2.0× as PEs are fully
utilized. When the number of NN Engines exceeds 8 (or 16), however, the memory bandwidth
cannot catch up with the data request rate from all NN Engines using OS and RS (or WS) dataflow.

ACM Transactions on Design Automation of Electronic Systems, Vol. 26, No. 6, Article 48. Pub. date: June 2021.
48:10 N. Park et al.

Fig. 7. Two methods of delivering data in an NN Engine: (a) Multicast [6] and (b) Groupwise Broadcast. The
number in each PE represents an ID of the data that the PE requires.

Then, the speed-up becomes limited and the performance improvement by doubling up the NN
Engines becomes less than 1.2× at a certain number of NNs.
For power analysis, we increased the size of global buffers and register files proportional to the
increase in the number of PEs, not exceeding the area constraint imposed by HBM logic die [27].
As the size of the on-chip memory as well as PE array increases, the total power consumption is
expected to increase. Meanwhile, it is important to make sure that the total power consumption of
3D near-memory computing system does not exceed the Thermal Design Point (TDP). As HBM
has a 3D structure, the thermal bound must be always considered [10, 22]. A research work from
AMD simulates a wide range of thermal environments on a PIM-based architecture [10]. With a
passive heat sink, the maximum power that a logic die can consume was reported to be 8.5 W
not exceeding 85◦ C. The logic die with (8, 16, 8) NN Engines using (OS, WS, RS) dataflow models
consumes (4.1, 7.2, 4.9)W, respectively and the power numbers do not exceed the TDP (8.5W).
Backed up by the performance-power-thermal analysis, we selected the number of NN Engines in
(OS, WS, RS) dataflow model to be (8, 16, 8) in our proposed system.

4 DATAFLOW IN AN NN ENGINE
In the NN Engines using OS and WS dataflow model, all PEs perform single MAC operation every
single cycle with a pair of operands received each cycle. However, in the NN Engine using RS
dataflow model, PEs perform a set of MAC operations over several cycles. PEs first receive a set of
operands over multiple cycles, and perform the corresponding MAC operations by reusing given
set of operands. In other words, the performance for processing CNNs with RS dataflow model
can be improved by reducing the cycles fetching the pairs of operands from global buffer to all
PEs within NN Engine (increasing the internal bandwidth). In this section, we propose Groupwise
Broadcast, a new data fetching scheme from global buffer to PEs for RS dataflow.

4.1 Multicast vs. Groupwise Broadcast


4.1.1 Multicast. Eyeriss, which uses 2D DRAM as an external memory, delivers data to PEs in
multicast manner [6]. Figure 7(a) shows a simple example of multicast with 256-bit wires connected
to all 4 × 3 PEs. With multicast, data concatenated with an ID are delivered to all PEs (in an NN
Engine), and each PE receives the data if the ID matches with the PE’s own ID. In Eyeriss with 2D
DRAM, 144-bit data are connected to all 168 PEs in a PE array (= an NN Engine without a global
buffer) [6]. RS dataflow such as Eyeriss consumes a large amount of cycles processing CNNs within
an NN Engine. Therefore, processing time can be reduced by increasing the bit-width of fetching
data from on-chip global buffer to PEs within the NN Engine. Figure 8 shows that 2,048-b multicast
has 19% improvement in runtime over 256-b multicast case. However, to enable the operation of
RS dataflow with high bit-width (2,048-bit) data fetching, “2048bit data + ID” should be connected

ACM Transactions on Design Automation of Electronic Systems, Vol. 26, No. 6, Article 48. Pub. date: June 2021.
High-throughput Near-Memory Processing on CNNs with 3D HBM-like Memory 48:11

Fig. 8. Runtime comparison between 256-b and 2,048-b multicast. Runtime for 2,048-b groupwise broadcast
is also shown.

Fig. 9. An example of dataflow in an NN Engine with control signals (C[0 : 3] = 1010) assuming the filter
size of 5 × 5.

with more than 2,048-bit wires to receive the data. This incurs area inefficiency as well as routing
overhead in the hardware.
We compared the area with two different sizes of bit-width connecting the global buffer to PEs:
256 bit and 2,048 bit. Then, we estimated the area of the PEs using Synopsys PrimeTime after
synthesizing the PEs with 28-nm CMOS technology [37]. We also considered the minimum space
between metal lines for routing by using the design rules of 28-nm CMOS technology. The wiring
overhead with 256-bit multicast was 10.4% and that with 2,048 bit was 83.2% compared to the area
of PEs. The area overhead with 2,048-bit wiring over 256-bit wiring becomes 60.3%. Therefore, it
is not practical to deliver high bandwidth (2,048-bit) data with multicast manner and alternative
scheme with less area overhead is needed.

4.1.2 Groupwise Broadcast. Rather than connecting all PEs with high-bandwidth bus, we can
partition the data into multiple groups and use low-bandwidth bus for area/power efficiency. We
call this data delivery from global buffer to PEs as groupwise broadcast. Each group of partitioned
data can be assigned to PEs that require the same data. By using the proposed groupwise broadcast,
the wires connected to each PE can be significantly reduced.
Figure 7(b) shows the proposed groupwise broadcast in which 64-bit-wide bus, instead of
256 bit, is connected to each group of 1 × 3 PEs. After reading 256-bit data from the global buffer,
the data are partitioned into four groups of 64-bit data. Then each partitioned datum is delivered to
each PE group without comparing IDs. This eliminates the hardware controller for comparing IDs
as well. Using the groupwise broadcast, larger data can be efficiently delivered to multiple PEs as

ACM Transactions on Design Automation of Electronic Systems, Vol. 26, No. 6, Article 48. Pub. date: June 2021.
48:12 N. Park et al.

Fig. 10. The mappings of ifmap data when convolution stride = 2 with (a) multicast and (b) groupwise broad-
cast. The filter data for convolution should be assigned in an interleaved fashion for the correct operation.

small groups. Also, we can eliminate the unnecessary data movement when the data are delivered
to the PEs having different operation IDs.
Figure 9 shows the dataflow and the corresponding bus topology for delivering ifmap, filter and
bias data to PEs. To use the groupwise broadcast, the wires for delivering data to the PEs are pre-
determined. This is possible as we fix the PE array size to 3 × 14 instead of dynamically changing it.
The row size of PE array is related to the row size of filter, because the psum values computed on
the PEs should be accumulated in vertical direction in RS dataflow model [6]. We set the row size of
PE as 3, because the row size of filters in most of the layers of the state-of-the-art CNN algorithms
is 3 [15, 24, 35, 43]. We assumed similar size of register files attached to each PE to the one from
Eyeriss architecture [6]. With these system configurations, we have selected the bit-width for data
fetching as (ifmap : filter : bias) = (256 : 768 : 1024) bit.
In Figure 9, if the row size of filter does not exceed 3, then all four PE arrays receive the same
ifmap data (kernel-level parallelism). If the row size of filter exceeds 3, then multiple PE arrays
are utilized to complete a 2D convolution operation at the same time. For instance, two PE arrays
are used to handle 5 × 5 convolution kernels. Then, one of two PE arrays selectively accepts the
required ifmap data by using a cluster of AND gates. The data selection is performed by a 4-bit
signal controlling the operation of four AND gate clusters.
In Figure 9, each operand is colored differently for the illustration purpose. To deliver data using
groupwise broadcast, we make groups of PEs that should receive the same data within an NN
Engine. For ifmap data, the PEs located on the diagonal line form a group. For filter (or bias/psum)
data, the PEs on the horizontal line form a group. Note that the bias/psum data are delivered to the
bottom PEs of the PE array, and the remaining top rows do not get any bias/psum data. Psums are
collected at the top row of each PE array and stored back to the global buffer for the next operation.

4.2 Additional Features for Groupwise Broadcast


If the data read from the global buffer are delivered to PEs in a deterministic manner, then we
may lose the flexibility in processing a wide range of layer configurations. In this sub-section, we
present methods to improve flexibility (slightly) sacrificed by groupwise broadcast.
If the row size of PE array is less than that of filter, then one PE array cannot complete a single
convolution by itself. An example is shown in Figure 9. To solve this problem, we added small
register files (14 × 16 × 16 bit) and adders near the global buffer to accumulate the psum values
prior to storing the results. If the size of PE array is 3 × 14 and the filter size is 5 × 5, then we should
use two PE arrays to process the entire filter. Then, the psum values computed from two PE arrays
need accumulation, which can be done by the adders near the global buffer. The size of the register
files (0.44 KB) used for this approach is negligible compared to that of the global buffer (108 KB).
In addition, small number of adders cause minor overhead in terms of area or power consumption.

ACM Transactions on Design Automation of Electronic Systems, Vol. 26, No. 6, Article 48. Pub. date: June 2021.
High-throughput Near-Memory Processing on CNNs with 3D HBM-like Memory 48:13

Another challenge in using groupwise broadcast is in handling various stride values. If the stride
of layer is 1, then ifmap or filter data can be delivered in the same way as Eyeriss. When the stride
of layer becomes 2 or more, however, we need a slight modification in delivering data to PEs.
Figure 10(a) shows the 5 × 4 PE array processing a CONV layer with 5 × 5 kernel by stride = 2
using multicast. The number in each PE represents the ID concatenated to the ifmap data delivered
to PE array. With stride=2, the ifmap data that have the same ID in PE array exist in every other
rows. Such data configuration cannot be made by using groupwise broadcast as PEs in the same
diagonal line are hardwired to receive the same ifmap data.
If the PEs in odd rows with red boxes in Figure 10(a) are grouped, then the PEs in the diagonal
line will have the same ID (Figure 10(b)). If we permute the row orders in Figure 10(a) to the one in
Figure 10(b), then the correct convolution result with stride > 1 can be computed. Here, we need
to change the row order of the filter data as well so that the proper ifmap-filter pairs are assigned
to PEs.

5 EXPERIMENTAL METHODOLOGY
Workloads. We used various representative CNNs to evaluate the proposed architectures:
AlexNet, ZFNet, VGG-16, VGG-19, and ResNet [15, 24, 35, 43]. The ImageNet dataset has been
used for testing all the networks [9].
Hardware Analysis. We used a 28-nm CMOS technology for evaluation of the NN accelerators
on the logic die in HBM. The NN Engines for each dataflow were synthesized to run at 900MHz.
The areas of register file and global buffer with different capacities were modeled using CACTI-P
[28]. The computation and control logic for PEs excluding the register file and global buffer were
synthesized using Synopsys Design Compiler [38]. Area, critical path delay, and power consump-
tion of the computation and control logic were measured using the Synopsys PrimeTime [37].
We used three different architectures using OS, WS, and RS dataflow models as an NN Engine.
The number of PEs and the size of a global buffer in an NN Engine are adopted from References [6,
21, 29] with appropriate scale. The number of PEs and the size of a global buffer (single buffering)
of architectures using (OS, WS) dataflow model is (256, 256) and (64, 96) KB, respectively. When
the same area budget was given for an NN Engine using RS dataflow with both multicast and
groupwise broadcast schemes, the NN Engine with multicast scheme includes one 14 × 14 PE
array, global buffer (108 KB), and control logic. In comparison, a single NN Engine with groupwise
broadcast scheme consists of four 3 × 14 PE arrays, a global buffer (108 KB) and control logic.
For the same area budget, the baseline (multicast) has 28 more PEs than the proposed (groupwise
broadcast) design, because the proposed design receives the larger number of data at once than
the baseline, which requires larger area for larger number of flip-flops.
Performance and Energy Consumption. We obtained the maximum operating frequency by
measuring the critical path delay of the PE logic with Synopsys PrimeTime and the latency for ac-
cess to register file and global buffer with CACTI-P. The total runtime was obtained by multiplying
clock period by the number of cycles for total operations measured from the simulation.
Total energy consumption was obtained by adding the energy consumption measured from
several components as follows [11]. “PE Dynamic” represents the energy consumption for the
dynamic operation of logic in NN Engine except for register files and global buffers. “TSV-NN En-
gine” represents the energy consumed for fetching data from TSV buffer on the HBM logic die
to NN Engines. The power consumption for “PE Dynamic” and “TSV-NN Engine” was measured
with Synopsys PrimeTime, and energy consumption was calculated by multiplying the power con-
sumption by the runtime. “Total Static” indicates the energy consumption due to static leakage
current in the NN accelerator. Static power consumption for the register file and global buffer was

ACM Transactions on Design Automation of Electronic Systems, Vol. 26, No. 6, Article 48. Pub. date: June 2021.
48:14 N. Park et al.

Fig. 11. Runtime for processing CNNs with round-robin fetching applied for OS, WS, RS dataflows. The
runtime is normalized to the ones using distributed data fetching.

Fig. 12. Runtime comparison between single and double buffering for distributed and round-robin data fetch-
ing schemes.

measured using CACTI-P and static power for PE logic was measured with Synopsys PrimeTime.
“RF/GB Dynamic” indicates the energy consumption due to read/write from/to register files and
global buffers. “RF/GB Dynamic” was measured with CACTI-P. “Memory Dynamic” represents
the energy consumption due to read/write from/to DRAM stacks (HBM) used as external memory.
“Memory Dynamic” value was adopted from Reference [30].
Comparison. We compared the runtime and energy consumption for processing CNNs with
the architectures using round-robin data fetching in HBM with distributed data fetching and HP
schemes. Also, we compared the runtime and area for processing NNs with architectures using
single and double buffering.
We assumed that the simulation results for the distributed data fetching cases for HBM are the
same as the results for the distributed data-fetching for HMC-based NN processing. For distributed
data fetching in HBM-based NN processing, each NN Engine is connected to (1024 / N Enдine ) TSV
channels independently. It has the same structure with the NN accelerator on HMC with sepa-
rated vaults. With HMC, however, there exists NoC burden for delivering the overlapped input
data between adjacent vaults on the logic (base) die. If input data are duplicated across the adja-
cent vaults on HMC, then the NoC burden can be eliminated [22]. Therefore, for fair comparison
between HBM and HMC, we assumed that near-memory processing with HMC used the input
duplication scheme.

6 EVALUATION
6.1 Performance Analysis
6.1.1 Round-Robin Data Fetching. Figure 11 compares the round-robin data fetching with dis-
tributed data fetching in terms of the runtime for processing five state-of-the-art CNNs [15, 24, 35,
43] with three dataflow models. Applying round-robin data fetching reduces the runtime of (OS,
WS, RS) dataflows by (39.3, 37.5, 3.2)% compared to distributed data fetching scheme.

ACM Transactions on Design Automation of Electronic Systems, Vol. 26, No. 6, Article 48. Pub. date: June 2021.
High-throughput Near-Memory Processing on CNNs with 3D HBM-like Memory 48:15

Fig. 13. Runtime comparison between multicast (256 b) and groupwise broadcast (2,048 b) with round-robin
fetch (Normalized to multicast (256 b) with distributed fetch case).

As explained in the Section 3.2.2, round-robin data fetching improves the performance of OS
and WS substantially, because operands required for processing CNN are delivered to NN Engine
faster, so that each Engine can start the operations earlier than the case with the distributed data
fetching scheme.
Compared to the other dataflow models, the RS dataflow model has much less improvement with
the round-robin data fetching scheme (Figure 11), because internal processing time is dominant in
RS dataflow with heavy local reuse [6]. The RS dataflow model focuses on reusing the operands
in a large size of register files on a PE, so it consumes most of the cycles for processing CNNs
within the NN Engine, not fetching data from external memory. The round-robin data fetching
scheme reduces the cycles for fetching data from/to external memory to NN Engine. Therefore, the
speedup of the architectures using RS dataflow model with round-robin data fetching is relatively
small compared to other dataflow models.
Meanwhile, double buffering is another scheme that hides the cycles for fetching data from
external memory to Engines [31]. Therefore, we compared the runtime between single buffering
and double buffering for different dataflows and data fetching schemes. As shown in Figure 12,
the proposed round-robin data fetching scheme with single buffering can produce almost same
performance as the distributed data fetching with double buffering cases. The runtime difference is
only (0.8, 0.5, 0.4)% for (OS, WS, RS) dataflows, respectively. An important consideration is that the
architectures using double buffering with (OS, WS, RS) dataflow models require (21.3, 46.4, 13.3)%
larger area than the architectures using single buffering. Because the double buffering requires
large area overhead, we argue that round-robin data fetching with single buffering is the better
solution than distributed data fetching with double buffering.

6.1.2 Groupwise Broadcast. Figure 13 compares the runtime between the architecture with RS
dataflow using groupwise broadcast scheme and multicast scheme. As shown in Figure 13, with
the groupwise broadcast scheme, the runtime is reduced by 13.5% compared to the design with
the multicast scheme. The main reason for the improvement in runtime is that the PEs can initiate
CONV operations earlier by fast delivery of operands with wider local bandwidth with groupwise
broadcast than with multicast. This result indicates that increasing the local internal bandwidth is
important if RS dataflow is adopted in HBM-like 3D memory based near-memory NN processing.

6.2 Energy Consumption


Figure 14 compares the energy consumption between round-robin data fetching and distributed
data fetching and Figure 15 compares the energy consumption between groupwise broadcast and
multicast schemes in RS dataflow. As shown in Figure 14 and Figure 15, the energy consumption
has a similar tendency to the runtime for processing CNNs. With the round-robin data fetching
scheme, the total energy consumption was reduced by (2.1, 5.1, 1.7)% with (OS, WS, RS) dataflow

ACM Transactions on Design Automation of Electronic Systems, Vol. 26, No. 6, Article 48. Pub. date: June 2021.
48:16 N. Park et al.

Fig. 14. Energy breakdown with two data fetching schemes: distributed and round-robin.

Fig. 15. Energy consumption breakdown for multicast and groupwise broadcast scheme for RS dataflow.

Fig. 16. Runtime comparison of Round-Robin data fetching scheme with architectures using three dataflow
models in HBM and HMC (Normalized to Distributed data fetching case).

model compared to the distributed data fetching case. With the groupwise broadcast scheme, the
total energy consumption was reduced by 3.1% compared to the multicast scheme case. The reduc-
tion of the energy consumption almost coincides with the reduction of energy consumption due
to the static power (Total Static) as the dynamic power is almost same. The energy consumption
due to the static power is reduced because of the reduced runtime.

ACM Transactions on Design Automation of Electronic Systems, Vol. 26, No. 6, Article 48. Pub. date: June 2021.
High-throughput Near-Memory Processing on CNNs with 3D HBM-like Memory 48:17

6.3 Effective Data Fetching on HBM vs. HMC


For the quantitative analysis of the effectiveness of the round-robin data fetching on HBM and
HMC, we simulated the three different dataflows with the same hardware configurations (Fig-
ure 16). On HBM-based design, the runtime with round-robin data fetching is ∼40% smaller than
distributed data fetching. In contrast, distributed fetching is ∼2× faster than round-robin fetching
on HMC-based design due to the heavy burden in NoC routing for inter-vault data transfer in
HMC as discussed in Section 3.2.3. Above assessment holds valid for OS and WS dataflows, but
performance of RS dataflow is relatively insensitive to the data fetching schemes as the compute
time rather than the data fetching time is dominant in RS dataflow. The results suggest that, for 3D-
memory based near-memory processing, (1) OS/WS dataflows are more suitable than RS dataflow,
and (2) round-robin data fetching is more efficient on HBM and distributed data fetching is more
efficient on HMC.

7 DISCUSSION
7.1 Scalability of NN Accelerators on HBM
For the larger die size of HBM, the overall throughput is expected to increase with the larger num-
ber of PEs. However, the higher bandwidth of HBM is required to increase the overall throughput,
because an increased number of PEs without any changes in bandwidth cause the accelerator to
become memory-bound. HBM2 achieves higher peak bandwidth and large die size compared to
HBM1 [7]. The higher peak bandwidth is achieved with faster pin speed with the same number
of TSV channels. In terms of scalability, we focus on the structure of the NN accelerator using the
round-robin data fetching with higher peak bandwidth and larger die size.
If the peak bandwidth increases, then the number of cycles consumed by data fetching from
DRAM to NN Engines decreases. In short, the accelerator becomes compute-bound. The number
of NN Engines should be increased until cycles for data fetching from DRAM and NN processing on
PE arrays are balanced. The increased number of NN Engines does not affect to the implementation
of round-robin data fetching scheme. It only requires the connections between TSV channels and
additional NN Engines. In addition to the performance of the accelerators, the power budget for
thermal constraints and die size should be considered to decide the number of NN Engines [10].
With higher peak bandwidth of HBM, the NN accelerator can be implemented with an increased
number of NN Engines within the power and area budget.

7.2 Round-robin Data Fetching Applied to Other Networks


Our experimental results in Section 6 shows the performance improvement of our proposals on
large representative networks. Meanwhile, MobileNets are garnering interests thanks to the re-
duced number of parameters and operations for mobile platform [16]. A main characteristic of the
MobileNet structure is the depthwise separable convolution, which divides a general convolution
into depthwise convolution and pointwise convolution. The filters for depthwise convolutional
layer are 2-dimensional and applied to only one corresponding channel. The pointwise convolu-
tional layer is equivalent to the general convolutional layer with 1X1 filter.
We measured the relative runtime for processing MobileNetV3 with the accelerators using WS
dataflow model, as shown in Figure 17. The convolutional layers (C1, C17) require a general con-
volution, so there is no difference from other large networks. We analyzed the depthwise and
pointwise convolutions of other layers (C2–C16), separately.
The results show that we can reduce the runtime with round-robin data fetching in the pointwise
convolution layers similar to other large networks. However, there is little difference of runtime
between two data fetching schemes in the depthwise convolution layers except C2 layer. It is

ACM Transactions on Design Automation of Electronic Systems, Vol. 26, No. 6, Article 48. Pub. date: June 2021.
48:18 N. Park et al.

Fig. 17. Relative runtime for processing MobileNetV3 [16] with the accelerator using round-robin data fetch-
ing scheme. The runtime was normalized to distributed case.

because the amount of data required to process depthwise convolution layer is small enough to
be stored in the global buffer of a NN Engine. Combining the effect of depthwise and pointwise
convolutions, the overall runtime was reduced by 24.9% with round-robin data fetching.
Meanwhile, the accelerators for processing sparse NN have been studied on workload balancing
among a number of PEs [13, 14, 26, 45]. The sparse NN accelerators deal with the data movement
between the global buffer and the processing core circuits while the proposed round-robin data
fetching scheme deals with the data transfer between external DRAM and the global buffer. There-
fore, the proposed data fetching scheme can be applied to both of the sparse NN processing and
the dense NN processing.

8 CONCLUSION
We started this work with two observations: (1) all previous papers on near-memory NN acceler-
ators were about the design based on HMC. (2) HBM is enjoying more commercial success than
HMC. Then, the naturally raised question was whether the best design philosophy used in HMC-
based NN accelerator could be the best for HBM-based accelerator. It was identified that the cen-
tralized TSV farm in HBM gives a unique opportunity to increase the data feeding rate compared
to the separate TSV channel used for each vault in HMC. Utilizing the centralized TSV connections
in HBM, we proposed the round-robin data fetching for efficient data movement from DRAM to
NN Engines to increase the performance of the system. In HMC, data at different vaults travel via
multiple hops within NoC to arrive at a vault. This limits HMC from adopting round-robin data
fetching. We also proposed the groupwise broadcast scheme to address the routing overhead inside
the NN Engine, which occurs when high bit-width multicast is used for RS dataflow. Compared
to NN accelerators with conventional distributed data fetch and low data-bit multicast, proposed
schemes reduce runtime by 16.4–39.3% and energy by 2.1–5.1% on average.

REFERENCES
[1] [n.d.]. NVIDIA Turing Architecture In-Depth. Retrieved December 7, 2018 from https://devblogs.nvidia.com/nvidia-
turing-architecture-in-depth/.
[2] [n.d.]. Reinventing Memory Technology. Retrieved December 7, 2018 from https://www.amd.com/en/technologies/
hbm/.
[3] Takuya Akiba, Shuji Suzuki, and Keisuke Fukuda. 2017. Extremely large minibatch SGD: Training ResNet-50 on Ima-
geNet in 15 minutes. arXiv:1711.04325. Retrieved from https://arxiv.org/abs/1711.04325.
[4] Erfan Azarkhish, Davide Rossi, Igor Loi, and Luca Benini. 2017. Neurostream: Scalable and energy efficient deep
learning with smart memory cubes. IEEE Trans. Parallel Distrib. Syst. 29, 2 (2017), 420–434.
[5] Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun,
and Olivier Temam. 2014. Dadiannao: A machine-learning supercomputer. In Proceedings of the 47th Annual IEEE/ACM
International Symposium on Microarchitecture. IEEE Computer Society, 609–622.
[6] Yu-Hsin Chen, Tushar Krishna, Joel S. Emer, and Vivienne Sze. 2017. Eyeriss: An energy-efficient reconfigurable
accelerator for deep convolutional neural networks. IEEE J. Solid-State Circ. 52, 1 (2017), 127–138.

ACM Transactions on Design Automation of Electronic Systems, Vol. 26, No. 6, Article 48. Pub. date: June 2021.
High-throughput Near-Memory Processing on CNNs with 3D HBM-like Memory 48:19

[7] Jin Hee Cho, Jihwan Kim, Woo Young Lee, Dong Uk Lee, Tae Kyun Kim, Heat Bit Park, Chunseok Jeong, Myeong-Jae
Park, Seung Geun Baek, Seokwoo Choi, Byung Kuk Yoon, Young JAe Choi, Kyo Yun Lee, Daeyong Shim, Jonghoon
Oh, Jinkook Kim, and Seok-Hee Lee. 2018. A 1.2 V 64Gb 341GB/S HBM2 stacked DRAM with spiral point-to-point
TSV structure and improved bank group data control. In Proceedings of the 2018 IEEE International Solid-State Circuits
Conference (ISSCC’18). IEEE, 208–210.
[8] Jeff Dean. 2017. Recent advances in artificial intelligence and the implications for computer system design. In Proceed-
ings of the IEEE Hot Chips 29 Symposium (HCS’17). 1–116.
[9] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image
database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09). IEEE, 248–255.
[10] Yasuko Eckert, Nuwan Jayasena, and Gabriel H. Loh. 2014. Thermal feasibility of die-stacked processing in memory.
In 2nd Workshop on Near-Data Processing (WoNDP’14).
[11] Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, and Christos Kozyrakis. 2017. TETRIS: Scalable and efficient neural
network acceleration with 3D memory. ACM SIGOPS Operat. Syst. Rev. 51, 2 (2017), 751–764.
[12] Shixiang Gu, Ethan Holly, Timothy Lillicrap, and Sergey Levine. 2017. Deep reinforcement learning for robotic manip-
ulation with asynchronous off-policy updates. In IEEE International Conference On Robotics And Automation (ICRA’17).
3389–3396. https://doi.org/10.1109/ICRA.2017.7989385
[13] Song Han et al. 2016. EIE: Efficient inference engine on compressed deep neural network. In Proceedings of the 2016
ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA’16). IEEE, 243–254.
[14] Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A. Horowitz, and William J. Dally. 2017. Ese:
Efficient speech recognition engine with sparse lstm on fpga. In Proceedings of the 2017 ACM/SIGDA International
Symposium on Field-Programmable Gate Arrays. ACM, 75–84.
[15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.
[16] Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu,
Ruoming Pang, Vijay Vasudevan, et al. 2019. Searching for mobilenetv3. In Proceedings of the IEEE International Con-
ference on Computer Vision. 1314–1324.
[17] J. Thomas Pawlowski. 2011. Hybrid memory cube (HMC). In IEEE Hot Chips 23 Symposium (HCS’11). IEEE, 1–24.
[18] Joe Jeddeloh and Brent Keeth. 2012. Hybrid memory cube new DRAM architecture increases density and performance.
In Proceedings of the 2012 Symposium on VLSI Technology (VLSIT’12). IEEE, 87–88.
[19] JEDEC Standard. 2013. High bandwidth memory (HBM) dram. J. Educ. Sust. Dev. 235 (2013).
[20] Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele P. Scarpazza. 2018. Dissecting the NVIDIA Volta GPU Archi-
tecture via Microbenchmarking. Technical Report. Citadel.
[21] Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh
Bhatia, Nan Boden, Al Borchers, Rick Boyle, Perre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley,
Matt Dau, Feffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann,
C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexan-
der Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law,
Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Aire Mahony,
Keran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana
Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov,
Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma,
Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon. 2017. In-datacenter
performance analysis of a tensor processing unit. In Proceedings of the ACM/IEEE 44th Annual International Symposium
on Computer Architecture (ISCA’17). IEEE, 1–12.
[22] Duckhwan Kim, Jaeha Kung, Sek Chai, Sudhakar Yalamanchili, and Saibal Mukhopadhyay. 2016. Neurocube: A pro-
grammable digital neuromorphic architecture with high-density 3D memory. In Proceedings of the ACM/IEEE 43rd
Annual International Symposium on Computer Architecture (ISCA’16). IEEE, 380–392.
[23] Urs Köster, Tristan Webb, Xin Wang, Marcel Nassar, Arjun K. Bansal, William Constable, Oguz Elibol, Stewart Hall,
Luke Hornof, Amir Khosrowshahi, Carey Kloss, Ruby J. Pai, and Naveen Rao. 2017. Flexpoint: An adaptive numerical
format for efficient training of deep neural networks. arXiv:1711.02213. Retrieved from https://arxiv.org/abs/1711.
02213.
[24] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural
networks. In Advances in Neural Information Processing Systems. 1097–1105.
[25] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature 521, 7553 (2015), 436.
[26] Jinsu Lee, Juhyoung Lee, Donghyeon Han, Jinmook Lee, Gwangtae Park, and Hoi-Jun Yoo. 2019. 7.7 LNPU: A 25.3
TFLOPS/W sparse deep-neural-network learning processor with fine-grained mixed precision of FP8-FP16. In Pro-
ceedings of the 2019 IEEE International Solid-State Circuits Conference (ISSCC’19). IEEE, 142–144.

ACM Transactions on Design Automation of Electronic Systems, Vol. 26, No. 6, Article 48. Pub. date: June 2021.
48:20 N. Park et al.

[27] Jong Chern Lee, Jihwan Kim, Kyung Whan Kim, Young Jun Ku, Dae Suk Kim, Chunseok Jeong, Tae Sik Yun, Hongjung
Kim, Ho Sung Cho, Yeon Ok Kim, Jae Hwan Kim, Jin Ho Kim, Sangmuk Oh, Hyun Sung Lee, Ki Hun Kwon, Dong Beom
Lee, Young Jae Choi, Jeajin Lee, Hyeon Gon Kim, Jun Hyun Chun, Jonghoon Oh, and Seok Hee Lee. 2016. 18.3 A 1.2
V 64Gb 8-channel 256GB/s HBM DRAM with peripheral-base-die architecture and small-swing technique on heavy
load interface. In Proceedings of the IEEE International Solid-State Circuits Conference (ISSCC’16). IEEE, 318–319.
[28] Sheng Li, Ke Chen, Jung Ho Ahn, Jay B. Brockman, and Norman P. Jouppi. 2011. CACTI-P: Architecture-level modeling
for SRAM-based structures with advanced leakage reduction techniques. In Proceedings of the International Conference
on Computer-Aided Design. IEEE Press, 694–701.
[29] Bert Moons, Roel Uytterhoeven, Wim Dehaene, and Marian Verhelst. 2017. 14.5 envision: A 0.26-to-10tops/w subword-
parallel dynamic-voltage-accuracy-frequency-scalable convolutional neural network processor in 28nm fdsoi. In Pro-
ceedings of the IEEE International Solid-State Circuits Conference (ISSCC’17). IEEE, 246–247.
[30] Mike O’Connor, Niladrish Chatterjee, Donghyuk Lee, John Wilson, Aditya Agrawal, Stephen W. Keckler, and Wiliam J.
Dally. 2017. Fine-grained DRAM: Energy-efficient DRAM for extreme bandwidth systems. In Proceedings of the 50th
Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 41–54.
[31] José Carlos Sancho and Darren J. Kerbyson. 2008. Analysis of double buffering on two different multicore architectures:
Quad-core Opteron and the Cell-BE. In Proceedings of the 2008 IEEE International Symposium on Parallel and Distributed
Processing. IEEE, 1–12.
[32] Andreas Schlapka. 2018. Micron Announces Shift in High-Performance Memory Roadmap Strategy. Retrieved
from https://www.micron.com/about/blogs/2018/august/micron-announces-shift-in-high-performance-memory-
roadmap-strategy.
[33] Pierre Sermanet, David Eigen, Xiang Zhang, Michaël Mathieu, Rob Fergus, and Yann LeCun. 2013. OverFeat: Inte-
grated recognition, localization and detection using convolutional networks. arXiv:1312.6229. Retrieved from https:
//arxiv.org/abs/1312.6229.
[34] David Silver, Aja Huang, Christopher J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian
Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John
Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel,
and Demis Hassabis. 2016. Mastering the game of Go with deep neural networks and tree search. Nature 529 (2016),
484–503.
[35] Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition.
arXiv:1409.1556. Retrieved from https://arxiv.org/abs/1409.1556.
[36] Kyomin Sohn, Won-Joo Yun, Reum Oh, Chi-Sung Oh, Seong-Young Seo, Min-Sang Park, Dong-Hak Shin, Won-Chang
Jung, Sang-Hoon Shin, Je-Min Ryu, Yum Hye-Seung, Jae-Hun Jung, Hyunui Lee, Seok-Yong Kang, Young-Soo Sohn,
Jung-Hwan Choi, Yong-Cheol Bae, Seong-Jin Jang, and Gyoyoung Jin. 2017. A 1.2 V 20 nm 307 GB/s HBM DRAM with
at-speed wafer-level IO test scheme and adaptive refresh considering temperature distribution. IEEE J. Solid-State Circ.
52, 1 (2017), 250–260.
[37] Synopsys. 2017. PrimeTime Static Timing Analysis. Retrieved from http://www.synopsys.com/Tools/Implementation/
RTLSynthesis/DesignCompiler/Pages/default.aspx.
[38] Synopsys. 2018. Design Compiler. Retrieved from http://www.synopsys.com/Tools/Implementation/RTLSynthesis/
DesignCompiler/Pages/default.aspx.
[39] Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, and Joel S. Emer. 2017. Efficient processing of deep neural networks: A
tutorial and survey. Proc. IEEE 105, 12 (2017), 2295–2329.
[40] S. Williams, D. Patterson, L. Oliker, J. Shalf, and K. Yelick. 2008. The roofline model: A pedagogical tool for program
analysis and optimization. In Proceedings of the IEEE Hot Chips 20 Symposium (HCS’08). 1–71.
[41] Xiaowei Xu, Yukun Ding, Sharon Xiaobo Hu, Michael Niemier, Jason Cong, Yu Hu, and Yiyu Shi. 2018. Scaling for
edge inference of deep neural networks. Nat. Electr. 1 (Apr. 2018).
[42] Tom Young, Devamanyu Hazarika, Soujanya Poria, and Erik Cambria. 2017. Recent trends in deep learning based
natural language processing. arXiv:1708.02709. Retrieved from https://arxiv.org/abs/1708.02709.
[43] Matthew D. Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In Proceedings of the
European Conference on Computer Vision. Springer, 818–833.
[44] Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing FPGA-based accel-
erator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium
on Field-Programmable Gate Arrays. ACM, 161–170.
[45] Jie-Fang Zhang, Ching-En Lee, Chester Liu, Yakun Sophia Shao, Stephen W. Keckler, and Zhengya Zhang. 2019. SNAP:
A 1.67–21.55 TOPS/W sparse neural acceleration processor for unstructured sparse deep neural network inference in
16nm CMOS. In Proceedings of the 2019 Symposium on VLSI Circuits. IEEE, C306–C307.

Received April 2021; revised October 2020; accepted April 2021

ACM Transactions on Design Automation of Electronic Systems, Vol. 26, No. 6, Article 48. Pub. date: June 2021.

You might also like