0% found this document useful (0 votes)
75 views14 pages

A Multi-Neural Network Acceleration Architecture

Uploaded by

guantongpeng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
75 views14 pages

A Multi-Neural Network Acceleration Architecture

Uploaded by

guantongpeng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA)

A Multi-Neural Network Acceleration Architecture


Eunjin Baek, Dongup Kwon, and Jangwoo Kim‡
Department of Electrical and Computer Engineering, Seoul National University
{ebaek, dongup, jangwoo}@snu.ac.kr

Abstract—A cost-effective multi-tenant neural network execu- matrix and vector operations), those accelerators often adopt
tion is becoming one of the most important design goals for two-dimensional processing element (PE) arrays [14], [15],
modern neural network accelerators. For example, as emerging
[33]. The accelerators aim to utilize as many PEs as possible
AI services consist of many heterogeneous neural network execu-
tions, a cloud provider wants to serve a large number of clients concurrently and also minimize the data movements between
using a single AI accelerator for improving its cost effectiveness. various units. For the purpose, researchers have also proposed
Therefore, an ideal next-generation neural network accelerator various dataflow mechanisms to find the best neural network-
should support a simultaneous multi-neural network execution, to-PE array mapping mechanism for the maximum hardware
while fully utilizing its hardware resources. However, existing
utilization and the minimum data movements [14], [36], [60].
accelerators which are optimized for a single neural network
execution can suffer from severe resource underutilization when While many AI services benefit from the accelerators target-
running multiple neural networks, mainly due to the load ing single neural network executions, the cost-effective execu-
imbalance between computation and memory-access tasks from tion of multi-tenant neural networks is becoming increasingly
different neural networks. important, in particular for cloud providers due to the follow-
In this paper, we propose AI-MultiTasking (AI-MT), a novel
ing reasons. First, a modern cloud service consists of many
accelerator architecture which enables a cost-effective, high-
performance multi-neural network execution. The key idea of AI algorithms executing many heterogeneous neural networks.
AI-MT is to fully utilize the accelerator’s computation resources For example, Google’s search, vision, and translation services
and memory bandwidth by matching compute- and memory- require MLP, CNN, and RNN executions, respectively [10],
intensive tasks from different networks and executing them in [19], [28], [29], [33], [49], [55], and the emerging complex ser-
parallel. However, it is highly challenging to find and schedule the
vices such as self-driving and natural language processing need
best load-matching tasks from different neural networks during
runtime, without significantly increasing the size of on-chip to run many different neural networks simultaneously [5], [6],
memory. To overcome the challenges, AI-MT first creates fine- [8], [9], [21], [24], [34], [53], [58]. Second, cloud providers
grain tasks at compile time by dividing each layer into multiple must minimize their huge operation costs by running as many
identical sub-layers. During runtime, AI-MT dynamically applies applications on a given server as possible, while satisfying
three sub-layer scheduling methods: memory block prefetching
the quality of each service [2], [11], [12], [40]. Therefore,
and compute block merging for the best resource load matching,
and memory block eviction for the minimum on-chip memory sever architects are now in dire need of a new AI acceleration
footprint. Our evaluations using MLPerf benchmarks show that architecture to enable a cost-effective, high-performance multi-
AI-MT achieves up to 1.57x speedup over the baseline scheduling neural network execution.
method. The existing neural network accelerators, however, cannot
support a cost-effective multi-tenant neural network execution.
I. I NTRODUCTION A conventional accelerator targeting single neural network
Neural networks (NNs) have been applied to a wide range executions can naively enable a multi-neural network execu-
of applications thanks to their high accuracy and performance tion by executing different neural networks in sequence or
(e.g., image classification [30], [32], [35], [50], [56] and layers from different neural networks iteratively. However, this
speech recognition [55]). However, executing a modern neural approach creates many long periods of either compute- or
network can incur an extremely large number of operations memory-intensive execution which leads to the long periods
and data movements due to the ever-increasing size of the of specific resource’s severe underutilization. Furthermore,
network model and input data. Therefore, when modern the existence of inter-layer dependency within a single layer
general-purpose processors (e.g., CPU, GPU) run a large-scale prevents the following layers from even starting their execution
neural network, they can suffer from low performance and until the current layer’s completion, which also leads to a
cost-effectiveness due to their limited amount of computing resource underutilization per layer transition.
resources and high power consumption [13], [16], [18]. To enable a cost-effective multi-neural network execution,
To address this issue, researchers have made a lot of effort this paper aims (1) to make a conventional AI accelerator
to design various neural network accelerators which execute support a multi-neural network execution at minimum cost
a single neural network in the most cost-effective way [13], and (2) to maximize its hardware utilization by creating
[16], [18], [27], [33], [42], [54]. For example, to efficiently fine-grained, dependency-free tasks with different resource
support the basic operations of modern neural networks (e.g., intensities and executing them in parallel.
To achieve the goals, we propose AI-MultiTasking (AI-MT),
‡ Corresponding author. a novel accelerator architecture which enables a cost-effective,

978-1-7281-4661-4/20/$31.00 ©2020 IEEE 940


DOI 10.1109/ISCA45697.2020.00081

Authorized licensed use limited to: University of Canberra. Downloaded on July 19,2020 at 00:14:30 UTC from IEEE Xplore. Restrictions apply.
high-performance multi-neural network execution. The key

Classifier
idea of AI-MT is to fully utilize the accelerator’s computation

CONV

CONV
POOL
input output

FC
resources and memory bandwidth by (1) creating fine-grain
compute- and memory-intensive tasks from different networks,
(2) finding heterogeneous tasks with similar execution latency (a) Example neural network
during runtime, and (3) executing them in parallel. AI-MT also ic ow
(4) minimizes its on-chip memory capacity requirement by k ic oc
evicting large allocations as early as possible. oh
k
ih
To create fine-grain tasks, AI-MT first divides each layer
into multiple identical sub-layers at compile time. As the size iw
of the sub-layer is statically determined by the PE array’s
Filters Input features Output features
mapping granularity, each sub-layer’s SRAM requirement is (ic x k x k x oc) (ic x ih x iw) (oc x oh x ow)
small and identical. During a sub-layer execution, we define
(b) CONV layer
the phase of loading its weights to the on-chip SRAM as Mem-
ory Block (MB) execution and the phase of input processing Fig. 1: Example neural network and CONV layer
with the loaded weights as Compute Block (CB) execution.
To dynamically execute MBs and CBs for the best resource II. BACKGROUND
load matching during runtime, AI-MT exploits a hardware- A. Neural Networks
based sub-layer scheduler. The scheduler dynamically sched-
ules MBs and CBs as long as their dependency is satisfied. The neural networks consist of various layers connected to
First, it can fetch dependency-free MBs early (memory block each other. Figure 1a shows an example neural network which
prefetching) to fully utilize the memory bandwidth available. consists of convolutional (CONV), fully connected (FC), and
Second, it can group dependency-free CBs (compute block pooling (POOL) layers. In the network, each layer’s output
merging) to fully utilize the computing resources available. features are passed to the next layer as its input features.
Lastly, it can early schedule and evict SRAM capacity-critical Depending on the layer’s type, different layers perform
MBs (memory block eviction) to minimize the on-chip mem- different operations with their input features. For example,
ory’s capacity requirement. Figure 1b shows the operations of the example CONV layer.
To produce a single output value, the layer computes the
To evaluate our AI-MT architecture, we use representative (ic×k×k) dot products using the given input feature and filter,
neural network workloads taken mainly from MLPerf bench- and adds the results with a bias value. By moving the filters
mark. The results show that AI-MT successfully utilizes its to cover the entire input features, the layer generates the final
processing elements and memory bandwidth, while minimiz- output features. As another example, an FC layer performs
ing the SRAM capacity requirement. Thanks to the higher matrix multiplications of input features and weight matrices.
resource utilization, AI-MT achieves up to 1.57x speedup The FC layer can be considered as a special form of CONV
over the baseline scheduling method. The sensitivity analy- layer which computes the dot product of the single-dimension
sis for workload batching and memory capacity shows that input features (ic × 1 × 1) and the single-dimension kernels
AI-MT significantly reduces the on-chip memory’s capacity (ic × 1 × 1). Performing this operation oc times produces a
requirement. (oc×1) output. Lastly, a POOL layer reduces the input feature
In summary, our work makes the following contributions: sizes. For example, when applying a (2×2) min pooling layer,
• Multi-Neural Network Acceleration: We propose AI- the layer takes a minimum value in the (2 × 2) region of
MT, a novel AI acceleration architecture to enable a cost- the input feature values. The pooling can be used to take the
effective multi-neural network execution. maximum value or the average of all values.
• Efficient Scheduling Methods: AI-MT exploits three B. Baseline Neural Network Accelerator Architecture
hardware-based scheduling methods to efficiently sched-
ule dependency-free sub-layer tasks. To construct our baseline architecture, we adopt a conven-
• High Performance and Resource Utilization: AI- tional systolic array architecture based on Google’s TPU [33]
MT significantly improves the hardware utilization, and scale the architecture up for our purpose (Figure 2). Each
which also leads to the significant performance improve- core of the latest TPU model has two processing element (PE)
ment. arrays and each array has 128 × 128 16-bit bfloat multiply-
• Minimum SRAM Requirement AI-MT significantly and-accumulate (MAC) units [1]. In addition, each TPU core
reduces its on-chip SRAM capacity requirement, which incorporates HBM (300 GB/s for each TPUv2 core) and on-
is a critical contribution for future scalability. chip SRAM buffers to mitigate the memory bottleneck in both
inference and training [20].
To the best of knowledge, this is the first work to propose an In this paper, we scale up the number of PE arrays to 16
accelerator architecture to enable a cost-effective multi-neural as we target for server-scale neural network inference, which
network execution. commonly utilizes a reduced bit precision (8-bit integer, 2x

941

Authorized licensed use limited to: University of Canberra. Downloaded on July 19,2020 at 00:14:30 UTC from IEEE Xplore. Restrictions apply.
CONV FC
Core
Weight Fetcher MBs MBs

(Input/Output/Weight)
On-chip Buffers Processing Element (PE) Arrays CBs CBs

Input Fetcher
PE PE PE PE PE PE PE PE

PE PE PE PE PE PE PE PE (a) Layer execution


HBM

PE PE PE PE PE PE PE PE

PE PE PE PE PE PE PE PE MBs MBs
CBs CBs
Accumulator
Norm / Activation (b) Sub-layer execution
Pooling
MBs MBs

Fig. 2: Baseline neural network accelerator architecture CBs CBs

(c) Sub-layer execution with prefetching


Input Feature Filters
޾ i0 i1 i2 i3
e0 e1 e2 e3
z0 z1 z2
y0 x3
Fig. 4: Layer and sub-layer granularity executions
y1 x4
y2 z5
a0 a1 a2 a3 j3 x0 x1 x2
f3 x3 x6
x4 x7
y5 z8
b0 b1 b2 b3 k3 x3 x4 x5
g3 x6 x7 y8
c0 c1 c2 c3 l3 x6 x7 x8
h3
d0 d1 d2 d3 ޽
޿ PE Arrays
PE (‚). Next, the input values (ƒ) are mapped to the different
b1 b0 a1 a0 x0
PE Wn
Partial
Result
columns of the PE array and fed into the array in a streamed
b2 b1 a2 a1 x1
߀ manner („). Then, every PE multiplies the received input by
b3 b2 a3 a2 x2 Xn x + Xn ߁ the stored weight, and add the result to its partial result ( ).
...
...
...

l3 l2 k3 k2 z8 ߁ Finally, every PE passes the updated partial result to the next


t5 t4 t3 t2 t1 t0
row, and it also passes its input values to the right column (†).
Fig. 3: Mapping example of a convolution layer over a systolic On finishing the input set, the PE array repeats this process
array architecture with another filter set.
In this scheme, the layer’s type and size determine the num-
ber of iterations and memory bandwidth requirements. For an
less bit width than TPUv3) and high-end HBM technolo- FC layer, the weight matrix (ic × oc) is considered as oc filters
gies [4] (450GB/s for each core, 1.5x higher than TPUv21 ). where each filter has 1 × 1 × ic size, and each filter is mapped
We assume that the accelerator has three physically decoupled to each column of the array. When a filter size (1 × 1 × ic) is
buffers for input features, output features, and filters (or weight larger than PE’s row size, it repeats mapping  P1×1×ic
E dim  times
matrices). Each buffer has multiple banks to simultaneously to complete the required number of dot product operations.
feed the input and weight values to the PE array and write the In addition, the total number of columns in the PE arrays
output values to the buffer. (P E dim × #P E array) determines the number of filters
To start a neural network execution, our baseline architec- runnable concurrently which leads to  P E dim×#P oc
E array 
ture first loads the weight values from HBM to the unified iterations to execute all filters. Therefore, the systolic array
buffers. After feeding the weight values into PE arrays, it performs  P1×1×ic
E dim ∗ P E dim×#P E array  iterations for each
oc

performs the layer’s operations using the arrays. Next, the layer. For a CONV layer, which reuses a relatively small
intermediate results are passed to the dedicated units which number of filters, we assume that all PE arrays share the
perform subsequent operations (e.g., activation, normalization, same weight mapping and each PE array has partitioned
pooling). Lastly, the accelerator passes the output values to the input feature streams. In this case, the systolic array has
buffer which can be reused as the input values for the next E dim  ∗  P E dim  iterations to execute the layer.
 Pk×k×ic oc

layer. Our baseline architecture supports a double-buffering to


prefetch the weights to the PE array during its computation D. Baseline Scheduling and Prefetching Granularity
for hiding the fetching latency [33].
To describe our scheduling granularity, we define sub-layer
C. Systolic Array Architecture as a single PE array mapping (i.e., single iteration). As the
A systolic array architecture is a two-dimensional PE array accelerator configuration determines the size of a sub-layer, a
to maximize the data reuse by directly forwarding them layer is divided into a number of equal-sized sub-layers. Fig-
between adjacent PEs. Therefore, a systolic array architecture ure 4a and 4b show example layer-granularity and sub-layer-
can reduce energy consumption by reducing data movements granularity scheduling, in which MB indicates the memory
between the PE array and on-chip memories. bandwidth usage to fetch the weights and CB indicates the
Figure 3 shows how the systolic array operates for an PE array usage.
example CONV layer. First, a set of filters (or weights) is In this paper, we adopt sub-layer granularity scheduling to
mapped to the PE array before starting the CONV operations create fine-grain tasks. We also adopt the weight prefetching to
1 TPUv3 memory bandwidth is not officially disclosed. We assume its HBM
fetch the weight values for the next sub-layer mapping during
bandwidth per core is equal to TPUv2 (i.e., 300 GB/s). the previous CB’s execution (Figure 4c).

942

Authorized licensed use limited to: University of Canberra. Downloaded on July 19,2020 at 00:14:30 UTC from IEEE Xplore. Restrictions apply.
to maximize the resource utilization. However, it is highly
 
challenging to make a conventional accelerator to effectively

support the scheduling in a cost-effective way.




B. Resource Idleness in Neural Network Execution



To analyze the resource underutilization while running mul-
 tiple neural networks, we consider three reasonable schedul-
 ing mechanisms (i.e., First-In-First-Out (FIFO), Round-Robin
   (RR), greedy algorithm) in Figure 6. For simplicity, we assume
that each scenario runs three neural networks showing only one
Fig. 5: Ratio of computation and memory-prefetching latency layer each which is divided into multiple sub-layers. In each
scenario, a sub-layer is shown as its MB and CB executions.
#.1*,& #.1*,& #.1*,& "'#)#-- Figure 6a shows the FIFO scheduling timeline. Since the
FIFO mechanism first performs the neural network which
 " came in first, it is effectively the same as the network-wise
" serial execution. In the network-wise serial execution, con-
secutive sub-layers coming from the same layer have similar
(a) Sub-layer FIFO scheduling resource intensities, which is likely to incur high resource
 "
idleness.
Next, Figure 6b shows the RR scheduling timeline. The RR
"
scheduling repeatedly selects one sub-layer from different neu-
(b) Sub-layer RR scheduling ral networks every time in a determined order, which provides
a fairness among networks. This scheduling can outperform
 " the FIFO scheduling if two back-to-back scheduled sub-layers
" happen to have different resource intensities. However, this
static scheduling is more likely to lead to severe resource
(c) Sub-layer Greedy scheduling underutilization due to the mismatching resource intensities.
Fig. 6: Multi-neural network execution examples To enable more flexible scheduling, we consider a simple
greedy algorithm which dynamically selects the most similar
size of MB to the currently executing CB (Figure 6c). This
III. M OTIVATION algorithm is likely to outperform the FIFO and RR mecha-
A. Multiple Neural Network Execution nisms, but it can still suffer from the resource idleness due to
the mismatching resource intensities.
A conventional accelerator targeting single neural network To see resource idleness problems in the real workloads,
executions creates many long periods of either compute- or we run neural network benchmarks included in MLPerf In-
memory-intensive execution, which leads to the long periods ference [3] and VGG16 [50], and measure their hardware
of specific resource’s severe underutilization. For example, utilization using our cycle-accurate simulator. To produce
Figure 5 shows the portion of computation and memory various multi-neural network executions, we co-locate neu-
prefetching latency of each layer in VGG16 [50]. As the ral networks having distinct resource-utilization characteris-
earlier layers are compute-intensive, the memory bandwidth tics. For instance, we combine memory-intensive networks
utilization remains significantly low during this period. On the (e.g., VGG16 with large FC layers and GNMT) and mul-
other hand, the later layers are highly memory-intensive, the tiple compute-intensive networks combined (e.g., ResNet34,
utilization of computation resources quickly drops. In addition, ResNet50, and MobileNet). To provide a balanced distribu-
the inter-layer dependency prevents the following layers from tion of CBs and MBs, we iteratively run memory-intensive
starting their execution until the current layer’s completion, workloads to properly match the amount of CBs produced by
incurring a resource underutilization per layer transition. Note compute-intensive workloads.
that dividing a layer to sub-layers does not improve this
We run the synthesized workloads on our cycle-accurate
underutilization as sub-layers share the same resource intensity
neural network acceleration simulator (as described in Sec-
(Figure 4b).
tion V-A) and measure both the performance and the resource
Running multiple neural networks together has a potential to
utilization using three different scheduling mechanisms (i..e,
alleviate the problem as layers from different neural networks
RR, Greedy, and Shortest-Job-First (SJF)). For the SJF mech-
can be freely scheduled without any dependency issue. To
anism, we choose the next block to be the smallest one, where
achieve the goal, the baseline accelerator should support an
the size is determined by max(MB cycle, CB cycle).
optimal multi-neural network scheduling method which can
Figure 7 shows the resource utilization obtained with the
create many fine-grained, dependency-free tasks with different
sub-layer granularity RR mechanism. As reasoned by the
resource intensities and executing them in parallel in a way

943

Authorized licensed use limited to: University of Canberra. Downloaded on July 19,2020 at 00:14:30 UTC from IEEE Xplore. Restrictions apply.
 # "!  #


!$!

























































































   

Fig. 7: Compute and memory bandwidth utilization under the round-robin scheduling algorithm (RN34: ResNet34, RN50:
ResNet50, MN: MobileNet)

   #.1*,& #.1*,& #.1*,& "'#)#--



 MBs


 CBs

 (a) Infinite SRAM capacity
Full SRAM Capacity




















MBs








CBs






(b) Limited SRAM capacity


  Fig. 9: Impact of the limited SRAM capacity on multi-neural
Fig. 8: Speedup of different scheduling mechanisms over the network execution
baseline sub-layer FIFO scheduling
In terms of the performance improvement only, Figure 9a
simple scenario examples, the RR mechanism incurs severe re-
shows the impact of the prefetch-aware scheduling method
source underutilization due to the frequent mismatches of MBs
for the example used in Figure 6. Compared to the FIFO
and CBs, even though different resource-demand (memory-
scheduling mechanism (Figure 6a), the proposed prefetching
and compute-intensive) workloads are co-located. Running
method significantly improves both compute- and memory-
GNMT with other CNNs show slightly higher resource utiliza-
bandwidth utilization.
tion, but the RR scheduling mechanism significantly suffers
from its limited resource matching capability. However, this scheduling method has a fundamental limi-
Figure 8 show the relative performance of three scheduling tation due to its significant SRAM capacity requirement. The
mechanisms over the FIFO scheduling mechanism. We ob- key idea of this scheme is to prefetch as many as MBs to the
serve that SJF also suffers from the high resource idleness be- SRAM, until all long-running CBs are completed. Therefore, if
cause repeatedly scheduling shorts jobs from the same network the prefetched MBs are not evicted during the long periods of
eventually performs a FIFO-like network-wise scheduling. In large CBs executions, the SRAM becomes full quickly which
general, the greedy mechanism outperforms the FIFO mecha- disables any further MB prefetching (Figure 9b).
nism, but the improvement is far from the optimal performance
due to its still limited capability to find the best-matching Figure 10 shows the required SRAM capacity to prefetch
MB and CB pairs. weight values while executing the layers in MLPerf workload.
We estimate the accumulated latency for all CBs and assume
C. Limited SRAM Capacity the accelerator prefetches MBs from later layers during each
In Section II, we introduced the weight prefetching to fetch layer’s execution. The result shows that even a single batch
the weight values for the next sub-layer mapping during the layer execution can require over 10 MB SRAM to fully utilize
previous CB’s execution (Figure 4c). With such prefetching the memory bandwidth, and this capacity pressure can be
enabled, one of the easiest ways to amortize both computation accumulated during the multiple neural network execution.
and memory-bandwidth underutilization is to schedule all the For this reason, we limit the default SRAM capacity to 1 MB
compute-intensive sub-layers first and then all the memory- in this work, and propose a solution to reduce the capacity
intensive sub-layers. Note that this method ignores the fairness. requirement.

944

Authorized licensed use limited to: University of Canberra. Downloaded on July 19,2020 at 00:14:30 UTC from IEEE Xplore. Restrictions apply.

Algorithm 1: Latency estimation model


1 if LayerType = CONV then





2 M B.cycle = read cyc per array



!

3 CB.cycle =  #Pow∗oh
E array  ∗ batch + f illing time

4 #iters =  P E dim  ∗  Pic∗k∗k
oc
E dim 

 5 else if LayerType = FC then
 6 M B.cycle = read cyc per array ∗ #P E array
     7 CB.cycle = batch + f illing time
8 #iters =  P E dim∗#P
oc
E array  ∗  P E dim 
ic∗1∗1
Fig. 10: Required prefetch SRAM buffer size for each layer
in MLPerf inference benchmarks

A. Overview
D. Design Goals
Figure 11 shows the overview of AI-MT. To support fine-
From the above observations, we set three key design goals grain task management, AI-MT has three key features: Sub-
to architect a cost-effective multi-neural network execution Layer Scheduling Tables, Candidate Queues, and Weight Man-
accelerator as follows. First, it should support a cost-effective agement Table.
multi-neural network execution. Second, it should fully utilize First, each sub-layer scheduling table keeps the sub-layer
the hardware resources using an effective scheduling method. level metadata for the fine-grain task execution of each neural
Third, it should minimize the SRAM capacity requirement for network. The sub-layer scheduling table has as many rows as
future scalability. the number of layers, and each row includes the information
of MB and CB (e.g., cycles, #sub-layers, dependencies) of
the layer. AI-MT initializes the sub-layer scheduling table at
IV. AI-M ULTI TASKING A RCHITECTURE
compile time with the statically determining information such
as the cycles taken by MB and CB of the layer.
In this section, we propose AI-MultiTasking (AI-MT), a
Second, AI-MT has MB candidate queue and CB candidate
simultaneous multi-neural network execution processor ar-
queue each tracks the dependency-free MBs and CBs. AI-MT
chitecture. The key idea of AI-MT is to fully utilize the
finds the dependency-free MBs and CBs by referring the sub-
accelerator’s computation resources and memory bandwidth
layer scheduling table and enters the task to the corresponding
by (1) creating fine-grain computation and memory-access
candidate queue.
tasks from different networks, (2) scheduling the best load-
Lastly, AI-MT keeps the weight address for each sub-layer
matching tasks during runtime, and (3) executing them in
using the weight management table. As AI-MT prefetches
parallel.
the weight values during MB’s execution and consumes them
At compile time, AI-MT divides each layer into multi- during CB’s execution, AI-MT needs to track the weight
ple identical sub-layers to create fine-grain computation and address for the CB’s execution.
memory-access tasks. Then, AI-MT generates the sub-layer 1) Sub-Layer Scheduling Table Initialization: At compile
scheduling table to keep the sub-layer level metadata required time, AI-MT initializes several columns of the sub-layer
to the fine-grain task management at runtime. Assuming scheduling tables in Figure 11.
Google’s TPU-like CISC instructions [33] which utilize sub-
AI-MT estimates the cycles taken by MB and CB for each
layer granularity operations (i.e., matrix multiplication based
layer (i.e., cycles column), and total number of the sub-layers
on PE array size of weight values), AI-MT can naturally divide
(i.e., #iters column) using Algorithm 1. In the algorithm,
each layer into sub-layers at compile time without sacrificing
read cyc per array indicates the cycles to prefetch weight
performance.
values for a PE array and filling time indicates the taken cycles
At runtime, AI-MT dynamically schedules fine-grain com- from the first input value injected into the PE array to the first
putation and memory-access tasks from multiple networks. AI- output value is generated.
MT tracks the dependency-free MBs and CBs by referring to As mentioned in Section II-C, we adopt different mecha-
the sub-layer scheduling tables, and then applies load balanc- nisms for a CONV layer and an FC layer as their data reuse
ing scheduling mechanism (Section IV-B). The load balancing properties are different. For a CONV layer, which reuses a
scheduling mechanism fetches dependency-free MBs early to small number of filters, we assume that all PE arrays share
fully utilize the accelerator’s memory bandwidth resources, the same weight mapping and each PE array has partitioned
and groups dependency-free CBs to fully utilize the computing input feature streams. In this case, the cycles required by MB is
resources available. On top of the load balancing scheduling equal to the cycles to prefetch weight values from HBM to
mechanism, AI-MT applies early MB eviction, which can early SRAM for a PE array, and the cycles required by CB is equal
schedule and evict SRAM capacity-critical MBs to minimize to the cycles to operate the partitioned input features (i.e.,
the on-chip memory’s capacity requirement. shared weight values, partitioned input features).

945

Authorized licensed use limited to: University of Canberra. Downloaded on July 19,2020 at 00:14:30 UTC from IEEE Xplore. Restrictions apply.
Core AI-MultiTasking Weight management table
AI-MultiTasking & Free list
Weight Fetcher # networks Sub-layer scheduling tables

Estimator (Algorithm 1)
(Input/Output/Weight) Scheduler
PE Arrays Network-3 Scheduled CB Queue
On-chip Buffers

PE PE PE PE MBs Network-2CBs Post layer

Schedule Logic
Input Fetcher

MBs#iters cycles #indegree#iters Network-1 CBs w_head w_tail sr_cycles si_addr Post
layerlayer

(Algorithm2)
HBM

PE PE PE PE #indegree cycles i_addr id layer id


#indegree #iters cycles #indegree#iters cycles i_addr
MBs CBsw_head w_tail sr_cycles si_addr layer
Post id layer id
layer
PE PE PE PE Candidate Queues
PE PE PE PE #indegree #iters cycles #indegree#iters cycles i_addr w_head w_tail sr_cycles si_addr layer id layer id
MBs

# layers
Accumulator CBs
Norm/Activation
# networks
Pooling

Fig. 11: Overview of AI-MT

Otherwise, for an FC layer, we assume that each PE array during CB’s execution, AI-MT needs to track the weight
has individual weight values since FC layer reuses the weight address for the CB’s execution. AI-MT store the weight
values by the batch size which is relatively smaller than the address To support this, AI-MT has three main features: (1)
reuse amount of CONV layer. In this case, the cycles required Free List, (2) Weight Management Table, and (3) w head and
by MB is equal to the cycles to prefetch weight values from w tail columns in the sub-layer scheduling table. First, the free
HBM to SRAM for all PE arrays, and the cycles required by list keeps the empty block ids in the weight buffer. Next, the
CB is equal to the cycles to operate all the input features (i.e., weight management table keeps a list where the index is the
individual weight values, shared input features). weight block id and the value refers to the next weight block
AI-MT also updates #indegree and layer id columns which id. Lastly, w head and w tail columns indicate the start and
are used to track the dependency of layers in the later section. end block id of the prefetched weight blocks for the layer’s
AI-MT sets #indegree to the number of dependencies with execution.
previous-layers (i.e., the number of its previous layers), and By combining three features, AI-MT can easily track the
layer id column to the post-layer number. When a layer has weight address for CBs to be executed. For example, when
multiple consequent post-layers, AI-MT sets two layer id AI-MT prefetches the weight values, the memory controller
colums to the first and last post-layer’s number. allocates block ids referring to the free list and stores the
2) Candidate Queue Insertion: By referring #indegree, weights to the blocks. After that, AI-MT refers to the weight
#iters and layer id columns, AI-MT finds the dependency-free management table using w tail as an index and updates the
MBs and CBs and inserts them to the corresponding candidate value to the newly allocated weight block id. AI-MT also
queue. updates w tail to the newly allocated weight block id to
AI-MT uses #indegree and layer id to track the dependen- keep the last weight block id for the layer. This mechanism
cies between the previous layer’s last sub-layer and the next enables to find the CB’s weight address sequentially using only
layer’s first sub-layer. Whenever the last sub-layer’s MB or w head and w tail. When CB is scheduled, AI-MT refers to
CB finishes, AI-MT refers to the layer id row of the sub- the w head to find the corresponding weight blocks for the
layer scheduling table (i.e., the next layer’s row) to resolve CB, and updates w head to the next weight blocks by referring
the dependency between the layers. Then, AI-MT decreases weight management table for the next CB.
#indegree of the next layer’s MB or CB by 1, and inserts the
next layer’s first MB or CB to the corresponding candidate B. Load Balancing Scheduling
queue if #indegree becomes 0. Figure 12a shows an example scheduling scenario when
AI-MT also uses #iters to track the dependencies between applying RR scheduling mechanism. The different sizes of
the sub-layers in the same layer. #iters indicates the remained MB and CB incur the resource idleness; memory bandwidth
sub-layer blocks for MB or CB, and AI-MT can executes the idleness when CB is larger than MB (Part-1) and PE array
blocks sequentially by decreasing it one-by-one. Also, when idleness when MB is larger than CB (Part-2).
#iters of MB is smaller than #iter of com, it indicates that CB’s To increase resource utilization in both memory bandwidth
corresponding MB is already scheduled. Therefore, AI-MT and PE arrays, we propose the balance scheduling mechanism
can tracks both dependencies (1) between the previous MB and using two schemes: MB prefetching and CB merging. MB
the next MB, and (2) between MB and the corresponding prefetching reduces memory bandwidth idleness by aggres-
CB. For example, whenever MB or CB is scheduled, AI- sively prefetching MBs of the later sub-layers. It also improves
MT decreases the corresponding #iters by 1, and adds the next CB resource utilization as the earlier MB execution resolves
block to the candidate queue. At this time, AI-MT checks if the the dependency of the corresponding CB earlier. However,
next CB’s dependency is resolved by comparing the remained when MB prefetching is not enough to resolve the next CB’s
#iters of MBs and CBs. If the #iters of MB is smaller than dependency, it incurs PE array idleness until the dependency
that of CB, AI-MT adds the next CB to CB’s candidate queue is resolved. Our CB merging scheme improves PE array
and decreases the #iters of CBs. utilization by keeping MB and CB size at a similar pace.
3) Prefetched Weight Management: As AI-MT prefetches 1) Memory Block (MB) Prefetching: MB prefetch-
the weight values during MB’s execution and consumes them ing prefetches MBs whenever the remained SRAM capacity

946

Authorized licensed use limited to: University of Canberra. Downloaded on July 19,2020 at 00:14:30 UTC from IEEE Xplore. Restrictions apply.
#.1*,& #.1*,& #.1*,& #.1*,& "'#)#-- Algorithm 2: MB prefetching and CB merging
Part-1 Part-2
1 [Whenever the previous MB finishes]
MBs 2 target = N one
CBs
3 for M B in M B CQ do
(a) Baseline RR scheduling 4 if M B.cycle < RM C then
5 if AV L CB < threshold then
MBs 6 if M B  s CB.cycle > M B.cycle then
CBs 7 target = MB; break;
Dependency incurs PE array idleness 8 else
9 target = MB; break;
(b) Memory Block (MB) prefetching
13 10 if target == N one then
Avl. 7 12 12 11 stall until executing CB finishes
CBs 6 6 11 11 11 12 AV L CB decreases for the executed CBs
3 3 3 7 7 7
13 else
MBs 1 2 4 5 8 9 10
14 M B CQ.pop(target)
CBs 3 6 7 11 12
15 M B C += target.cycle; RM C -= target.cycle;
7 7 13 13 16 AV L CB = max(AV L CB - target.cycle, 0)
Avl. 6 12 12
CBs 11 17 + target’s CB.cycle
18 for CB in CB CQ do
(c) Compute Block (CB) merging 19 if CB C < M B C then
Fig. 12: Examples of load balancing scheduling 20 CB CQ.pop(CB)
21 CB C += CB.cycle
22 CB SQ.push(CB)
is enough to serve them regardless of the sub-layer boundary.
Figure 12b presents MB prefetching mechanism and its effect. 23 [Whenever the previous CB finishes]
MB prefetching prefetches MB in Part-2 just after MB in Part- 24 target = CB SQ.pop(f irst)
1 finishes. It successfully amortizes the resource requirements 25 RM C += target.cycle;
by utilizing the idle memory bandwidth in Part-1. Moreover,
as the earlier execution of MB in Part-2 resolves the cor-
responding CB earlier, we see that the PE array idleness of
Part-2 also reduces. the scheduling order that CB merging selects. For example,
2) Compute Block (CB) Merging: When the prefetched when CB merging selects block-2, it also selects block-3
MBs become larger than the corresponding CBs, it incurs the to make scheduled CBs are larger than the scheduled MB.
PE array underutilization due to the dependency (Figure 12b). Next, CB merging selects block-4 rather than selecting block-
In this case, the PE arrays are idle until the scheduled 5 which was originally selected (Figure 12b) as the available
MB finishes, resolves the dependency, and generates a new CBs (i.e., Block-3,6,7) are not enough to cover block-5. As
CB candidate. To further improve PE array utilization, CB block-4’s corresponding CB (block-7) is larger than it, CB
merging ensures the available CBs are large enough to sched- merging successfully extend the available CBs for the large
ule the next MB. MBs. At this time, CB merging does not select any CB as
CB merging schedules MBs and CBs whenever the exe- the scheduled CBs are already enough to cover block-4. In
cuting MB finishes. CB merging first selects the next MB, practice, we only track the total cycles which will be taken by
and then schedules multiple CBs until the total cycles taken the available CBs rather than tracking the list of them.
by the scheduled CBs become larger than the scheduled MB. 3) Combining MB Prefetching and CB Merging: Algo-
This mechanism keeps the PE arrays busy during the selected rithm 2 shows how AI-MT’s scheduler assigns MBs and CBs
MB’s execution. To ensure the available CBs are large enough, based on MB prefetching and CB merging. In the algorithm,
CB merging has a variable called AVL CB which tracks the the scheduler has three kinds of variables.
total cycles required by the available CBs whose dependency
First, the scheduler has MB C and CB C to track the total
with the corresponding MB is resolved. If AVL CB becomes
cycles taken by the scheduled MBs and CBs. Whenever the
smaller than a pre-defined threshold, CB merging extends it
previous MB or CB finishes, the scheduler adds the block
by giving the high priority to MBs whose cycle is smaller than
cycles to MB C or CB C.
the corresponding CBs. Then, AVL CB increases faster than
the scheduled MB, and CB merging becomes possible to keep Second, the scheduler has AVL CB to track the total
the AVL CB large enough to schedule the next MB. cycles required by the available CBs. AVL CB increases when
MB resolves the corresponding CB, and decreases whenever
Figure 12c shows the example scheduling scenario of CB
MB finishes (line 16,17). Note that AVL CB decreases by
merging, and the numbers written in the blocks indicate
the cycles taken by the scheduled MB as it will be used to

947

Authorized licensed use limited to: University of Canberra. Downloaded on July 19,2020 at 00:14:30 UTC from IEEE Xplore. Restrictions apply.
overlap MB. If AVL CB is small to overlap the scheduled MB, 2. Selected small MBs
the scheduler sets AVL CB to zero (line 16). Also, when the due to small SRAM capacity
scheduler fails to find the next MB, it stalls until the executing MBs
CB finishes. In this case, AVL CB decreases by the cycles CBs
required by the executing CB (line 12).
1. Large CBs makes SRAM full
Third, the scheduler has RM C which indicates required
MB cycles to fill the remained SRAM capacity. The scheduler (a) Resource idleness under SRAM capacity shortage
uses RM C to check whether the remained SRAM capacity is 1. Provide large MBs higher priority
enough to prefetch the next MB. At the initialization, RM C MBs
is set to the required cycles to fill a given SRAM capacity,
CBs
and then it increases or decreases whenever MB allocates or
CB finishes. 2. Select smallest CB first
Using the three variables, the scheduler schedules the (when available SRAM < threshold)
next MB and CBs. First, the scheduler assigns the next (b) Priority mechanism of the early MB eviction
MB whenever the executing MB finishes (MB prefetching).
2. Select smallest CB first 3. Penalty cycles to fill PE array
The scheduler iterates the MB candidate queue and selects
the next MB (target in Algorithm 2) whose required cycle is MBs
smaller than RM C (line 3-4,9). In case AVL CB is smaller CBs
than a pre-defined threshold, the scheduler extends AVL CB
quickly by giving the high priority to MBs whose required 1. Halt and resume a large CB block
cycle is smaller than the corresponding CB’s (line 5-7). If (c) Split mechanism of the early MB Eviction
none of MB in the candidate queue is smaller than RM C,
the scheduler waits until the executing CB finishes to recover Fig. 13: SRAM capacity aware scheduling
the corresponding SRAM capacity (line 11).
After the scheduler selects the next MB (target), it finds
CBs to overlap the selected MB’s execution (line 18-22). The the corresponding weight values in the SRAM. Figure 13b
selected CBs are inserted in CB SQ (CB Selected Queue) and shows the effects of the mechanism.
waits until the earlier scheduled CBs finish (line 24-25). Next, when CB is too long due to the large batch size or the
network configuration, SRAM capacity becomes full during
C. SRAM Capacity Aware Scheduling CB’s execution, which also incurs memory bandwidth idleness
When the SRAM is short of the free region, the balancing (Figure 13a-1). To alleviate this issue, early MB eviction halts
scheduling mechanism can incur memory bandwidth idleness the current long CB execution and schedules smaller CBs
during the large CB’s execution (Figure 13a-1). During the first to recover SRAM capacity quickly (Figure 13c) rather
large CB’s execution, AI-MT can continously prefetch multi- than waiting until the executing large CB finishes (line 11 in
ple MBs to fully utilize memory bandwidth, which requires a Algorithm 2).
large SRAM capacity. Also, the large CB frees its correspond- To halt and resume CB later, AI-MT records the status of
ing MB after the long execution. Especially if the remained the target CB and recover CB candidate queue. First, AI-
SRAM capacity is smaller than a large MB (i.e., FC sub- MT inserts the target CB to the candidate queue again and
layer’s MB), it worsens as the scheduler can select the small pops the target’s next CB which has a dependency with the
MBs (i.e., CONV sub-layer’s MB) continuously (line 4,9 in target CB. It prevents the next CB’s earlier execution than
Algorithm 2, Figure 13a-2). the target CB. Then, AI-MT records current executing input’s
To alleviate the memory bandwidth idleness coming from address to the si addr column and remained cycles to the
the limited on-chip SRAM capacity, we introduce early MB sr cycles in the sub-layer scheduling table (Figure 11). When
eviction which schedules and evicts SRAM capacity-critical the scheduler resumes the target CB later, the execution starts
MBs to minimize the on-chip memory’s capacity requirement. with referring the si addr.
When selecting the next MB and CB, Algorithm 2 simply This mechanism has two main overheads, which can be
iterates from the first element in the corresponding candidate acceptable to improve resource utilization. First, it requires
queue (line 9, 20). Rather than finding the next blocks from the penalty cycles to fill the PE arrays again when the halted
the front, early MB eviction gives the high priority to the CB resumes, which can reduce PE array resource utilization.
SRAM capacity-critical MBs whose cycles are larger than the However, considering that the filling cycle is relatively small
cycles of the corresponding CBs (i.e., FC sub-layer’s MB). than the large CB execution, it has more potential to improve
As the large MB occupies the large SRAM capacity and the the total resource utilization. Next, CB split reads the corre-
corresponding CB is relatively small, it can recover a large sponding weight values again when the scheduler resumes CB,
amount of SRAM capacity quickly. Also, AI-MT schedules which incurs the additional SRAM read energy consumption.
the smallest CBs first when the SRAM is short of the free Still, the additional energy consumption can be acceptable as
region. Small CBs recover SRAM capacity quickly by freeing the large CB is usually coming from the compute-intensive

948

Authorized licensed use limited to: University of Canberra. Downloaded on July 19,2020 at 00:14:30 UTC from IEEE Xplore. Restrictions apply.
layer (i.e., CONV layer) which requires a small amount of TABLE I: Hardware and architecture parameters
weight values by sharing the same weight values across all
Parameter Value
the PE array (Algorithm 1).
Processing Element Dimension 128x128
D. Alternative Software Implementation of AI-MT # Processing Element Array 16
Frequency 1 GHz
In this work, we propose a hardware implementation of AI-
MT, but it is also possible to implement our mechanism only Memory Bandwidth 450 GB/s
using software by implementing the scheduling layer between On-Chip SRAM Size (Input/Output) 18 MB
the software framework (e.g., TensorFlow) and the accelera- On-Chip SRAM Size (Weight) 1 MB
tor. A software-based implementation is expected to achieve
TABLE II: Neural network workloads and their configurations
similar performance improvement if the following conditions
are met. First, the software-based scheduling should be fast. Layers Batch
Name
Second, the sub-layer execution is long enough (i.e., coarse- FC CONV Size
grain sub-layer) to hide the software scheduling overhead. ResNet34 1 36 1-32
Third, the sub-layer split occurs infrequently enough to avoid ResNet50 1 53 1-32
over-populating the transfer medium (e.g., PCIe). For our
VGG16 3 13 1-32
current execution environment, we observe coarse-grain sub-
MobileNet 1 27 1-32
layer executions (i.e., thousands of cycles) and infrequent sub-
GNMT 6 1-32
layer splits, which make the software-based implementation
also effective as the hardware-based implementation.
Aside from the software implementation’s potential, we We evaluate our AI-MT with multi-neural network bench-
believe that our hardware implementation will be more ef- marks synthesized from the MLPerf Inference [3] and
fective than the alternative software implementation for more VGG16 [50]. Table II summarizes the configurations and
aggressive multi-NN execution scenarios, where a faster ac- characteristics of the target workloads. Our target workloads
celerator would aim to run a larger number of heterogeneous contain four CNNs and one RNN workloads, each of which
neural network models. This execution model would get more has representative layer characteristics and covers typical
benefits from finer-grain splits, but it would increase the neural network workloads running on clouds reported by
gap between the software-based scheduling and hardware- service providers [33]. In the case of GNMT, we assume that
based execution, the SRAM capacity contention, and the embedding lookup remains on the CPU. We co-locate neural
transfer medium contention. Therefore, the hardware-based networks which have two distinct resource-utilization charac-
fast scheduling without get more performance benefits without teristics. For instance, we combine memory-intensive networks
increasing the transfer medium bandwidth. But, with the (e.g., VGG16 with large FC layers and GNMT) and multiple
hardware implementation overhead considered, we believe compute-intensive networks (e.g., ResNet34, ResNet50, and
hardware/software collaborative approaches will be promising MobileNet). To provide a balanced distribution of CBs and
as another research direction. MBs, we iteratively run memory-intensive workloads to prop-
erly match the amount of CBs produced by compute-intensive
V. E VALUATION workloads.
A. Evaluation Setup
B. Multi-NN Execution Latency
In this work, we evaluate our AI-MT by extending a sys- Figure 14 shows the speedup results for all co-located neural
tolic array cycle-accurate simulator to enable the multi-neural network workloads over the FIFO mechanism (Figure 6a)
network execution support. We take the baseline hardware using a single batch. We first evaluate the performance impact
parameters from recent TPU specifications [1] and scale up of MB prefetching by applying it to the RR mechanism
the number of PE arrays from two to 16 to model the introduced in Section III-B. AI-MT achieves speedup by
effects of the reduced bit precision (8-bit integer, 2x less up to 1.34x and 1.05x over the baseline when co-locating
than TPUv3’s) and high-end HBM technologies (450 GB/s CNNs with GNMT and VGG16, respectively. For the overall
for each core, 1.5x higher than TPUv2’s) [4]. Our simulation workload set, AI-MT achieves the geomean performance im-
framework also implements physically decoupled buffers for provement of 1.13x. Even VGG16 has large memory-intensive
input, output features and weights (Section II-B), weight- FC layers, its compute-intensive CONV layers earlier than the
stationary dataflow (Section II-C), and sub-layer-granularity memory-intensive layers reduce the opportunity to overlap the
scheduling (Section II-D). We set the size of the SRAM buffer two different resource-demand sub-layers. On the other hand,
used for prefetching and storing weight values to 1 MB. The co-locating GNMT achieves higher performance improvement
18 MB on-chip SRAM buffers are used for storing input as it consists of many memory-intensive FC layers which can
and output features from multiple concurrent neural network be overlapped with different resource-demand sub-layers.
workloads. The architectural parameters are summarized in Then, we apply both CB merging and MB prefetching to
Table I. the neural network workloads. CB merging achieves speedup

949

Authorized licensed use limited to: University of Canberra. Downloaded on July 19,2020 at 00:14:30 UTC from IEEE Xplore. Restrictions apply.
($#% ' )"#%! ' )"#%!'!#%! $$



&*&





































 
 
 



















 
 
 








    

Fig. 14: Speedup of AI-MT over network-serial multi-neural network execution (Baseline)

&"!# %)%' !# '#"$&(!"$  &*&(!"$ 


%' !#%!# "" &(!"$ & "$  ##
 
 

%)%

$($









 

 

 

 


 

 

 

 
 

 
 

 


   
 (!   (!
Fig. 15: Sensitivity test with different batch sizes
Fig. 16: Sensitivity test with different SRAM sizes

up to 1.57x compared to the baseline by fully utilizing baseline. When the batch size increases, CBs become larger
computing resources and also improves the geomean perfor- and the limited SRAM capacity creates the bottleneck to fully
mance improvement of 1.33x for all the workloads. Similar utilize the memory bandwidth. As early MB eviction quickly
to MB prefetching, its effect varies depending on the co- recovers the SRAM capacity by releasing the SRAM capacity-
located neural networks due to the different resource-demand critical MBs first, it successfully achieves the performance
requirements. improvement with the large batch size. The reduction of the
Lastly, we measure the performance impact of early MB maximum performance improvement in the figure is coming
eviction, and AI-MT achieves the geomean performance im- from the PCIe transaction overhead as transferring input and
provement of 1.33x over the baseline. With a small batch output features becomes dominant in the total computation
size, the cycles taken by CBs are small and the performance latency.
improvement is marginal compared to other scheduling mech-
anisms. To further show the impact of early MB eviction with D. Sensitivity Test: On-chip SRAM Size
large CBs, we show a sensitivity test of different batch sizes Figure 16 shows the speedup results over the network-serial
in Section V-C. multi-neural network execution under the various on-chip
SRAM sizes. For this experiment, we assume that the input
C. Sensitivity Test: Batch Size and output buffers are enough to accommodate the feature
Figure 15 shows the sensitivity results of the different batch maps when the batch size is 8. We also execute our workloads
sizes when multiple CNNs are running with GNMT. To verify iteratively to see the impact of the long-execution of neural
the impact of early MB eviction, we assume that the SRAM networks like a cloud server environment where the multiple
capacity is large enough to accommodate the input and output neural networks come into the accelerator continuously.
features for the execution with large batch sizes (i.e., larger To see the required SRAM capacity, we apply two baseline
than 8). The results show that attaching early MB eviction (i.e., scheduling mechanisms with MB prefetching. First, we apply
AI-MT (All)) further improves performance than just applying the naive scheduling mechanism (Figure 9a) which schedules
MB prefetching and CB merging at the large batch sizes. With the compute-intensive neural networks first and then schedules
the RN34 and GNMT workloads, for example, applying only memory-intensive neural networks. As it can prefetch all the
MB prefetching and CB merging achieves 1.29x speedup and weight values during a large amount of CBs, it achieves ideal
applying all the three schemes achieves 1.47x speedup over the performance when the SRAM capacity is infinitive. However,

950

Authorized licensed use limited to: University of Canberra. Downloaded on July 19,2020 at 00:14:30 UTC from IEEE Xplore. Restrictions apply.
TABLE III: Power and area overheads of on-chip memory B. Spatial Aspects of the PE Array Utilization
blocks This work focuses on exploiting the temporal aspects of
Memory Block Power [mW] Area [mm2 ]
the PE arrays to improve the resource utilization. To further
improve the resource utilization, we believe that the spatial
Input/Output buffer (18 MB) 3575.872 119.399
aspects of the PE array can be addressed together. Depending
Weight buffer (1 MB) 170.408 3.843
on the dimensions of neural networks and their mapping strate-
Sub-layer scheduling table (3 KB * 5) 2.897 0.0592
gies, the resource underutilization in terms of the spatial aspect
CQs and SQ (64 Bytes) 0.0172 0.000261 can become a non-trivial issue. For example, when CB has a
Weight management table (64 Bytes) 0.0168 0.000244 small dimension to fully utilize all the MAC operations of the
Free list (64 Bytes) 0.0168 0.000244 PE arrays, it might be possible to perform multiple CBs at the
same time to improve the resource utilization within the PE
as Figure 16 shows, it requires more than 4 GB to achieve array.
ideal performance improvement in our evaluation environment. VII. R ELATED W ORK
Next, we apply MB prefetching to the greedy mechanism A. Systolic Array DNN Accelerators
(Figure 6c). We see that the greedy mechanism also shows
a similar SRAM capacity requirement due to the imbalance Systolic array architectures are useful for computing DNN
size of MB and CB pair. On the other hand, our AI-MT models dominated by matrix computations. For example,
successfully achieves almost ideal speedup even with 1 MB Google’s TPU [33] comprises two-dimensional systolic arrays
SRAM capacity. and embraces the idea of the weight-stationary dataflow. It
preloads weight values on matrix-multiply units and performs
E. Power & Area Overheads multiply-accumulate operations by forwarding and reducing
We estimate static power and area overheads of the memory partial sums along rows or columns. Moreover, recent TPUs
blocks to support AI-MT using CACTI 7.0 [39] using 28-nm [1] have a high bandwidth memory (HBM) dedicated to each
technology (Table III). In the estimation, we assume the five core and thus avoid the critical memory bandwidth bottleneck
neural networks can be executed concurrently as the same as incurred by the earlier version of architectures. Similarly,
our evaluation setup. The result shows that the overheads of the Eyeriss [14] utilizes a systolic array to exploit the data
additional elements to support AI-MT are negligible compared reuse characteristics of modern CNNs, but leverages the row-
to the large input and output buffers. stationary dataflow to maximize data reuse and accumulation
The small overheads of AI-MT enable to scale the number at the RF level. Its following work, Eyeriss v2 [15], handles
of networks. AI-MT has the sub-layer scheduling tables per various layer’s dimensions and sizes through a flexible on-
neural network and the maximum number of the entries chip network. Also, Xilinx’s FPGA-based xDNN systolic-
of the candidate queue is the same as the number of the array architecture provides two operational modes, throughput-
networks. The major overhead per neural network is a sub- and latency-optimized modes, by adopting different mapping
layer scheduling table (3 KB), but it is also small enough to strategies for CNN’s early layers and adjusting its pipeline
be scaled. stages.
Also, our scheduling mechanism requires a negligible Although they demonstrated using a systolic array archi-
amount of computation overheads. To run CB, the accelerator tecture is highly effective at dealing with varying DNNs,
performs 128 × 128 MAC operations per PE array at every simultaneous multi-network (or multi-threading) techniques
cycle. On the other hand, our scheduling mechanism only have not been fully explored. SMT-SA [48] employs a simulta-
requires dozens of addition and comparison operations for neous multi-threading technique to address the underutilization
each MB scheduling (Algorithm 2), which requires much problem caused by zero-valued input. However, it still cannot
smaller operations less frequently. handle the underutilization problem caused by layers’ different
resource-usage characteristics and scheduling methods.
VI. D ISCUSSION
A. Input and Output Buffer Capacity B. Optimized DNN Dataflows
This work focuses on reducing the SRAM capacity require- Hardware accelerators employ various dataflows to optimize
ment by storing a minimal set of weight values when executing data access and communication patterns. Chen et al. classified
multiple neural networks. But, increasing the number of neural neural network accelerators into dataflow categories based on
networks or increasing the batch size can increase the SRAM their data reuse characteristics [14]. For example, weight-
capacity requirement as well. We currently assume that the stationary [7], [13], [16], [33], [43], [46], [47], [51], [52],
SRAM has a dedicated space enough to store the required [59], output-stationary [22], [51], row-stationary [14], [15],
input and output features. If the required size becomes larger [26] dataflows determine spatial and temporal data mappings
than the SRAM capacity available, it might be useful to deploy to PEs differently. In addition, Yang et al. [57] provided a
a preemption mechanism to keep only a minimal working formal taxonomy and covered more dataflows employed in
set of input and output features with ways to mitigate the recent NN accelerators [23], [25], [31], [37], [38], [41], [43]–
overhead. [45].

951

Authorized licensed use limited to: University of Canberra. Downloaded on July 19,2020 at 00:14:30 UTC from IEEE Xplore. Restrictions apply.
mRNA [60] and MAESTRO [36] explore the cost and R EFERENCES
benefits of various dataflow strategies for varying hardware
[1] “https://cloud.google.com/tpu/docs/system-architecture,” google Cloud
configurations based on the energy-performance trade-off anal- TPU System Architecture.
ysis. MAESTRO [36], using the trade-off analysis results, [2] “https://developer.nvidia.com/tensorrt,” nVIDIA TensorRT:
introduces a set of compiler directives to specify the preferred Programmable Inference Accelerator.
DNN dataflow. However, they have the challenges of multi- [3] “https://mlperf.org/inference-overview,” mLPerf Inference Benchmark.
[4] “https://news.skhynix.com/hbm2e-opens-the-era-of-ultra-speed-
network or multi-thread supports since they did not take memory-semiconductors,” hBM2E Opens the Era of Ultra-Speed
account of the situation under which concurrent execution Memory Semiconductors.
is allowed. We believe that our simultaneous multi-DNN [5] F. Altché and A. de La Fortelle, “An lstm network for highway trajectory
prediction,” in 2017 IEEE 20th International Conference on Intelligent
supports further improve the single-network performance as Transportation Systems (ITSC). IEEE, 2017, pp. 353–359.
well by adopting the dataflow optimizations when we split [6] J. M. Alvarez, T. Gevers, Y. LeCun, and A. M. Lopez, “Road scene seg-
DNN layers into multiple iteration blocks. mentation from a single image,” in European Conference on Computer
Vision. Springer, 2012, pp. 376–389.
[7] M. Alwani, H. Chen, M. Ferdman, and P. Milder, “Fused-layer cnn
C. Multi-DNN Scheduling accelerators,” in 2016 49th Annual IEEE/ACM International Symposium
on Microarchitecture (MICRO). IEEE, 2016, pp. 1–12.
There are huge demands for system-level and architectural [8] M. Aly, “Real time detection of lane markers in urban streets,” in 2008
IEEE Intelligent Vehicles Symposium. IEEE, 2008, pp. 7–12.
supports of multi-DNN scheduling to maximize the hardware [9] D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg,
utilization and reduce the cost of running large-scale produc- C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen et al., “Deep
tion systems. For example, NIVIDA’s TensorRT provides the speech 2: End-to-end speech recognition in english and mandarin,” in
International conference on machine learning, 2016, pp. 173–182.
concurrent DNN execution support, with which users can run [10] L. A. Barroso, U. Hölzle, and P. Ranganathan, “The datacenter as a
multiple DNNs on the same GPUs simultaneously. In addition, computer: Designing warehouse-scale machines,” Synthesis Lectures on
Baymax [12] and Prophet [11] address QoS and utilization Computer Architecture, vol. 13, no. 3, pp. i–189, 2018.
[11] Q. Chen, H. Yang, M. Guo, R. S. Kannan, J. Mars, and L. Tang,
problems of current multi-DNN execution on GPUs. On the “Prophet: Precise qos prediction on non-preemptive accelerators to
other hand, PREMA [17] proposes a preemptive scheduling improve utilization in warehouse-scale computers,” ACM SIGOPS Op-
algorithm under multi-DNN supports and explores various erating Systems Review, vol. 51, no. 2, pp. 17–32, 2017.
preemption mechanisms to reduce the overhead. Although they [12] Q. Chen, H. Yang, J. Mars, and L. Tang, “Baymax: Qos awareness and
increased utilization for non-preemptive accelerators in warehouse scale
demonstrate diverse multi-DNN scheduling optimizations are computers,” ACM SIGPLAN Notices, vol. 51, no. 4, pp. 681–696, 2016.
effective at meeting the restricted latency requirements and [13] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam,
increasing the hardware utilization, they fail to handle the “Diannao: A small-footprint high-throughput accelerator for ubiquitous
machine-learning,” in ACM Sigplan Notices, vol. 49, no. 4. ACM,
simultaneous execution of multiple DNN models and thus 2014, pp. 269–284.
cannot obtain the optimal performance under SMT supports [14] Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss: A spatial architecture for
and decoupled architectures. energy-efficient dataflow for convolutional neural networks,” in ACM
SIGARCH Computer Architecture News, vol. 44, no. 3. IEEE Press,
2016, pp. 367–379.
VIII. C ONCLUSION [15] Y.-H. Chen, T.-J. Yang, J. Emer, and V. Sze, “Eyeriss v2: A flexible
accelerator for emerging deep neural networks on mobile devices,” IEEE
Journal on Emerging and Selected Topics in Circuits and Systems, 2019.
In this paper, we propose AI-MT, a novel processor architec- [16] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen,
ture which enables a cost-effective, high-performance multi- Z. Xu, N. Sun et al., “Dadiannao: A machine-learning supercomputer,”
neural network execution. Motivated by the severe under- in Proceedings of the 47th Annual IEEE/ACM International Symposium
on Microarchitecture. IEEE Computer Society, 2014, pp. 609–622.
utilization of existing accelerators mainly due to the load [17] Y. Choi and M. Rhu, “Prema: A predictive multi-task scheduling
imbalance between computation and memory-access tasks, AI- algorithm for preemptible neural processing units,” arXiv preprint
MT proposes memory block prefetching and compute block arXiv:1909.04548, 2019.
[18] E. Chung, J. Fowers, K. Ovtcharov, M. Papamichael, A. Caulfield,
merging for the best resource load matching. Then, to mini- T. Massengill, M. Liu, D. Lo, S. Alkalay, M. Haselman et al., “Serving
mize its on-chip SRAM capacity requirement during runtime, dnns in real time at datacenter scale with project brainwave,” IEEE
AI-MT applies memroy block eviction which early schedules Micro, vol. 38, no. 2, pp. 8–20, 2018.
[19] J. Clark, “Google turning its lucrative web search over to ai machines,”
and evicts SRAM-capacity-critical MBs. Combined with all Bloomberg Technology, vol. 26, 2015.
these methods, AI-MT successfully achieves performance im- [20] J. Dean, “Recent advances in artificial intelligence via machine learning
provement with the minimum SRAM capacity. and the implications for computer system design,” in 2017 IEEE Hot
Chips 29 Symposium, 2017.
[21] E. Derman and A. A. Salah, “Continuous real-time vehicle driver authen-
ACKNOWLEDGMENT tication using convolutional neural network based face recognition,” in
2018 13th IEEE International Conference on Automatic Face & Gesture
Recognition (FG 2018). IEEE, 2018, pp. 577–584.
This work was supported by Samsung Research Funding
[22] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen,
& Incubation Center of Samsung Electronics under Project and O. Temam, “Shidiannao: Shifting vision processing closer to the
Number SRFC-IT1901-12. We also appreciate the support sensor,” in 2015 ACM/IEEE 42nd Annual International Symposium on
from Automation and Systems Research Institute (ASRI) and Computer Architecture (ISCA). IEEE, 2015, pp. 92–104.
[23] C. Eckert, X. Wang, J. Wang, A. Subramaniyan, R. Iyer, D. Sylvester,
Inter-university Semiconductor Research Center (ISRC) at D. Blaaauw, and R. Das, “Neural cache: Bit-serial in-cache acceleration
Seoul National University. of deep neural networks,” in 2018 ACM/IEEE 45th Annual International

952

Authorized licensed use limited to: University of Canberra. Downloaded on July 19,2020 at 00:14:30 UTC from IEEE Xplore. Restrictions apply.
Symposium on Computer Architecture (ISCA). IEEE, 2018, pp. 383– [42] E. Park, D. Kim, and S. Yoo, “Energy-efficient neural network ac-
396. celerator based on outlier-aware low-precision computation,” in 2018
[24] H. M. Eraqi, M. N. Moustafa, and J. Honer, “End-to-end deep learning ACM/IEEE 45th Annual International Symposium on Computer Archi-
for steering autonomous vehicles considering temporal dependencies,” tecture (ISCA). IEEE, 2018, pp. 688–698.
arXiv preprint arXiv:1710.03804, 2017. [43] J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang,
[25] J. Fowers, K. Ovtcharov, M. Papamichael, T. Massengill, M. Liu, D. Lo, N. Xu, S. Song et al., “Going deeper with embedded fpga platform for
S. Alkalay, M. Haselman, L. Adams, M. Ghandi et al., “A configurable convolutional neural network,” in Proceedings of the 2016 ACM/SIGDA
cloud-scale dnn processor for real-time ai,” in 2018 ACM/IEEE 45th International Symposium on Field-Programmable Gate Arrays, 2016,
Annual International Symposium on Computer Architecture (ISCA). pp. 26–35.
IEEE, 2018, pp. 1–14. [44] M. Rhu, N. Gimelshein, J. Clemons, A. Zulfiqar, and S. W. Keckler,
[26] M. Gao, J. Pu, X. Yang, M. Horowitz, and C. Kozyrakis, “Tetris: “vdnn: Virtualized deep neural networks for scalable, memory-efficient
Scalable and efficient neural network acceleration with 3d memory,” in neural network design,” in 2016 49th Annual IEEE/ACM International
Proceedings of the Twenty-Second International Conference on Archi- Symposium on Microarchitecture (MICRO). IEEE, 2016, pp. 1–13.
tectural Support for Programming Languages and Operating Systems, [45] H. Sharma, J. Park, D. Mahajan, E. Amaro, J. K. Kim, C. Shao,
2017, pp. 751–764. A. Mishra, and H. Esmaeilzadeh, “From high-level deep neural models
[27] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and to fpgas,” in 2016 49th Annual IEEE/ACM International Symposium on
W. J. Dally, “Eie: efficient inference engine on compressed deep neural Microarchitecture (MICRO). IEEE, 2016, pp. 1–12.
network,” in 2016 ACM/IEEE 43rd Annual International Symposium on [46] Y. Shen, M. Ferdman, and P. Milder, “Overcoming resource underutiliza-
Computer Architecture (ISCA). IEEE, 2016, pp. 243–254. tion in spatial cnn accelerators,” in 2016 26th International Conference
[28] J. Hauswald, Y. Kang, M. A. Laurenzano, Q. Chen, C. Li, T. Mudge, on Field Programmable Logic and Applications (FPL). IEEE, 2016,
R. G. Dreslinski, J. Mars, and L. Tang, “Djinn and tonic: Dnn as a pp. 1–4.
service and its implications for future warehouse scale computers,” in [47] ——, “Maximizing cnn accelerator efficiency through resource parti-
2015 ACM/IEEE 42nd Annual International Symposium on Computer tioning,” in 2017 ACM/IEEE 44th Annual International Symposium on
Architecture (ISCA). IEEE, 2015, pp. 27–40. Computer Architecture (ISCA). IEEE, 2017, pp. 535–547.
[29] K. Hazelwood, S. Bird, D. Brooks, S. Chintala, U. Diril, D. Dzhulgakov, [48] G. Shomron, T. Horowitz, and U. Weiser, “Smt-sa: Simultaneous mul-
M. Fawzy, B. Jia, Y. Jia, A. Kalro et al., “Applied machine learning tithreading in systolic arrays,” IEEE Computer Architecture Letters,
at facebook: A datacenter infrastructure perspective,” in 2018 IEEE vol. 18, no. 2, pp. 99–102, 2019.
International Symposium on High Performance Computer Architecture [49] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van
(HPCA). IEEE, 2018, pp. 620–629. Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam,
[30] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image M. Lanctot et al., “Mastering the game of go with deep neural networks
recognition,” in Proceedings of the IEEE conference on computer vision and tree search,” nature, vol. 529, no. 7587, p. 484, 2016.
and pattern recognition, 2016, pp. 770–778. [50] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
[31] K. Hegde, J. Yu, R. Agrawal, M. Yan, M. Pellauer, and C. Fletcher, large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
“Ucnn: Exploiting computational reuse in deep neural networks via [51] M. Song, J. Zhang, H. Chen, and T. Li, “Towards efficient microarchi-
weight repetition,” in 2018 ACM/IEEE 45th Annual International Sym- tectural design for accelerating unsupervised gan-based deep learning,”
posium on Computer Architecture (ISCA). IEEE, 2018, pp. 674–687. in 2018 IEEE International Symposium on High Performance Computer
[32] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, Architecture (HPCA). IEEE, 2018, pp. 66–77.
T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convo- [52] N. Suda, V. Chandra, G. Dasika, A. Mohanty, Y. Ma, S. Vrudhula, J.-s.
lutional neural networks for mobile vision applications,” arXiv preprint Seo, and Y. Cao, “Throughput-optimized opencl-based fpga accelerator
arXiv:1704.04861, 2017. for large-scale convolutional neural networks,” in Proceedings of the
[33] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, 2016 ACM/SIGDA International Symposium on Field-Programmable
S. Bates, S. Bhatia, N. Boden, A. Borchers et al., “In-datacenter Gate Arrays, 2016, pp. 16–25.
performance analysis of a tensor processing unit,” in 2017 ACM/IEEE [53] R. Valiente, M. Zaman, S. Ozer, and Y. P. Fallah, “Controlling steering
44th Annual International Symposium on Computer Architecture (ISCA). angle for cooperative self-driving vehicles utilizing cnn and lstm-based
IEEE, 2017, pp. 1–12. deep networks,” in 2019 IEEE Intelligent Vehicles Symposium (IV).
[34] J. Kocić, N. Jovičić, and V. Drndarević, “An end-to-end deep neural IEEE, 2019, pp. 2423–2428.
network for autonomous driving designed for embedded automotive [54] S. Venkataramani, A. Ranjan, S. Banerjee, D. Das, S. Avancha, A. Ja-
platforms,” Sensors, vol. 19, no. 9, p. 2064, 2019. gannathan, A. Durg, D. Nagaraj, B. Kaul, P. Dubey et al., “Scaledeep:
[35] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification A scalable compute architecture for learning and evaluating deep net-
with deep convolutional neural networks,” in Advances in neural infor- works,” ACM SIGARCH Computer Architecture News, vol. 45, no. 2,
mation processing systems, 2012, pp. 1097–1105. pp. 13–26, 2017.
[36] H. Kwon, P. Chatarasi, M. Pellauer, A. Parashar, V. Sarkar, and T. Kr- [55] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey,
ishna, “Understanding reuse, performance, and hardware cost of dnn M. Krikun, Y. Cao, Q. Gao, K. Macherey et al., “Google’s neural
dataflow: A data-centric approach,” in Proceedings of the 52nd Annual machine translation system: Bridging the gap between human and
IEEE/ACM International Symposium on Microarchitecture. ACM, machine translation,” arXiv preprint arXiv:1609.08144, 2016.
2019, pp. 754–768. [56] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual
[37] H. Li, X. Fan, L. Jiao, W. Cao, X. Zhou, and L. Wang, “A high perfor- transformations for deep neural networks,” in Proceedings of the IEEE
mance fpga-based accelerator for large-scale convolutional neural net- conference on computer vision and pattern recognition, 2017, pp. 1492–
works,” in 2016 26th International Conference on Field Programmable 1500.
Logic and Applications (FPL). IEEE, 2016, pp. 1–9. [57] X. Yang, M. Gao, J. Pu, A. Nayak, Q. Liu, S. E. Bell, J. O. Setter,
[38] W. Lu, G. Yan, J. Li, S. Gong, Y. Han, and X. Li, “Flexflow: A flexible K. Cao, H. Ha, C. Kozyrakis et al., “Dnn dataflow choice is overrated,”
dataflow accelerator architecture for convolutional neural networks,” in arXiv preprint arXiv:1809.04070, 2018.
2017 IEEE International Symposium on High Performance Computer [58] Z. Yang, Y. Zhang, J. Yu, J. Cai, and J. Luo, “End-to-end multi-modal
Architecture (HPCA). IEEE, 2017, pp. 553–564. multi-task vehicle control for self-driving cars with visual perceptions,”
[39] N. Muralimanohar, R. Balasubramonian, and N. P. Jouppi, “Cacti 6.0: in 2018 24th International Conference on Pattern Recognition (ICPR).
A tool to model large caches,” HP laboratories, vol. 27, p. 28, 2009. IEEE, 2018, pp. 2289–2294.
[40] C. Olston, N. Fiedel, K. Gorovoy, J. Harmsen, L. Lao, F. Li, V. Ra- [59] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing
jashekhar, S. Ramesh, and J. Soyke, “Tensorflow-serving: Flexible, high- fpga-based accelerator design for deep convolutional neural networks,”
performance ml serving,” arXiv preprint arXiv:1712.06139, 2017. in Proceedings of the 2015 ACM/SIGDA International Symposium on
[41] A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, Field-Programmable Gate Arrays, 2015, pp. 161–170.
B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally, “Scnn: An [60] Z. Zhao, H. Kwon, S. Kuhar, W. Sheng, Z. Mao, and T. Krishna, “mrna:
accelerator for compressed-sparse convolutional neural networks,” in Enabling efficient mapping space exploration for a reconfiguration neural
2017 ACM/IEEE 44th Annual International Symposium on Computer accelerator,” in 2019 IEEE International Symposium on Performance
Architecture (ISCA). IEEE, 2017, pp. 27–40. Analysis of Systems and Software (ISPASS). IEEE, 2019, pp. 282–292.

953

Authorized licensed use limited to: University of Canberra. Downloaded on July 19,2020 at 00:14:30 UTC from IEEE Xplore. Restrictions apply.

You might also like