An Efficient Hardware Architecture For Exploiting Sparsity in Neural Networks Master Thesis
An Efficient Hardware Architecture For Exploiting Sparsity in Neural Networks Master Thesis
Networks
by
Sparsity – the presence of many zero values – is a pervasive property of modern deep neural networks, as it
is inherently induced by state-of-the-art algorithmic optimizations. Recent efforts in hardware design for
acceleration of neural networks have targeted the structure of computation of these workloads. However,
when run on these value-agnostic accelerators, value sparsity is not exploited to provide performance or
efficiency benefits, and instead results in wasted computation. In this thesis, we present architectural
optimizations that efficiently leverage value sparsity in network weights in order to achieve significant
performance benefits, with minimal hardware overhead. The culmination of this work is a hardware
front-end (data fetching and staging unit) which, when paired with our novel, co-designed software
scheduling algorithm, achieves more than a 2× speedup on average for the networks studied, with just
ii
I’m extremely lucky to have had a battalion of amazing people support me throughout my graduate
studies at the University of Toronto. To my supervisor, Andreas Moshovos, thank you for being the
most friendly, knowledgeable, and sincere supervisor and person I’ve ever had the honour of working
with. To the friends I’ve made here, and the ones I brought with me in spirit from home, thank you for
being a consistent and welcome source of distraction, camaraderie, motivation, and guidance. To my
parents, who’s support, in every sense of the word, has been the one constant in my life, thank you for
everything. Finally, thank you Philippa for encouraging me to take on graduate school, knowing that it
would mean being apart for so long, and yet sticking with me nonetheless. This thesis wouldn’t exist
without your support, encouragement, companionship, transatlantic excursions, and daily phone calls.
iii
Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Background 7
2.1 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.2 Value Sparsity in Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Hardware Acceleration for Convolutional Neural Networks . . . . . . . . . . . . . . . . . . 12
2.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.1 Value-Agnostic CNN ASICs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.2 Value-Aware CNN ASICs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4 Evaluation 35
4.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3 Front-End Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3.1 Effect of Scheduling Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3.2 Sensitivity to Sparsity Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.3.3 Area Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
iv
4.3.4 Alternative Interconnect Configurations . . . . . . . . . . . . . . . . . . . . . . . . 43
4.4 Alternative Back-End Designs - TCTp and TCTe . . . . . . . . . . . . . . . . . . . . . . . 45
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5 Conclusion 48
5.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Bibliography 50
v
List of Tables
vi
List of Figures
vii
4.4 Effect of the scheduling algorithm on the networks studied . . . . . . . . . . . . . . . . . . 41
4.5 Performance of multiple interconnect patterns at varying sparsity levels . . . . . . . . . . 41
4.6 Performance variation of the Th2, 5i design with different scheduling approaches as spar-
sity level changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.7 Performance of additional interconnect configurations relative to the baseline architecture. 44
4.8 Performance as lookaside connections are pruned, with lookahead fixed at 2. . . . . . . . . 44
4.9 Performance of TCTp and TCTe normalized to DaDianNao . . . . . . . . . . . . . . . . . 46
4.10 Speedup of TCT, TCTp, and TCTe over DaDianNao++ using the T8h2, 5i configuration
with various main memory technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
viii
Chapter 1
Introduction
Deep learning is a machine learning (ML) technique that has enjoyed widespread attention from industry
and academia in recent years. Deep neural network (DNN) models have emerged as powerful tools in
a wide range of fields in which traditional algorithms have struggled to achieve satisfactory proficiency.
Perhaps the most widely deployed type of DNN model is the convolutional neural network (CNN), which
is dominant in computer vision tasks [1, 2, 3, 4], but has also seen success in fields as varied as speech
recognition [5], reinforcement learning [6], and text translation [7].
From a hardware perspective, the deep neural network models that are used to implement deep
learning represent a compelling workload for acceleration due to their widespread deployment in con-
sumer and commercial settings, along with their unique computational structure and dataflow. The
vast amount of computation required to run modern DNNs during inference (often on the order of
tens of giga-operations (GOPs) [8]), along with their large memory footprint (commonly hundreds of
MBs [8]) also make them a prime target for custom hardware. Indeed, many recent works have tackled
the design and evaluation of hardware architectures for the acceleration of CNN inference processing.
Some seminal works have investigated architectural techniques for efficiently exploiting the structure
and forms of parallelism present in CNNs, with many highly influential application-specific integrated
circuit (ASIC) architectures as well as field-programmable gate array (FPGA) implementations targeting
CNNs, multi-layer perceptrons (MLPs), recurrent neural networks (RNNs), and other DNN types.
Alongside their basic computational structure, neural networks exhibit unique and interesting value
properties – the unique distribution of values that appear during runtime for these workloads. Certain
value properties can offer further opportunities for optimizing the hardware architectures designed to
run these networks. Reminiscent of classical architectural approaches for exploiting workload character-
istics such as cache hierarchies, which leverage the spatio-temporal locality of memory accesses, much
investigation has been put into leveraging the implicit value properties of neural networks in hardware.
This is an attractive prospect for several reasons, not least of which is that the massive computational
complexity of neural network inference poses problems in terms of latency, energy, and power constraints,
meaning techniques that can reduce this computational complexity are highly valuable. Additionally,
the excessive memory footprint of modern CNNs is problematic for a multitude of reasons, including
high memory transfer latency and energy, and large on-chip memory requirements. This makes model
compression methods commonplace, many of which introduce even more opportunities for value-aware
computation engines.
1
Chapter 1. Introduction 2
One value property that has become prevalent in CNNs in recent years due to model compression
efforts is weight sparsity – a phenomenon in which a large proportion of the network weights are equal
to zero. It is well established that DNNs are often severely over-parameterized [9, 10]. This observation
is used to motivate network pruning, in which a large proportion of the weights in a neural network
are set to zero, resulting in sparse weight tensors. The goal of network pruning is to reduce the overall
memory footprint and computational complexity of the model, without significantly affecting network
accuracy – indeed, the pruning can act as a regularizer, sometimes increasing accuracy [11]. One effect
of weight pruning is that it results in many ineffectual computations, which are computations that do
not effect the final output, and thus may be omitted without negatively impacting the accuracy of the
neural network. These ineffectual computations occur in a sparse neural network any time one of the
pruned, zero-valued weights is multiplied with an activation, and the resulting product is accumulated
to an output. However, when run on value-agnostic hardware, weight pruning applied naively will not
confer any computational benefits. Designing hardware that can skip the ineffectual computations due to
weight sparsity is an enticing endeavour, as such hardware would benefit from greatly reduced execution
time when running these networks.
The goal of this thesis is to describe Bit-Tactical (TCT), an area-efficient front-end architecture (a
hardware architecture that performs data fetching and staging) for a hardware accelerator which exploits
weight sparsity as an avenue for performance improvements, along with a design space exploration
and low-level optimizations for this architecture, and a novel scheduling algorithm which improves the
performance of the front-end by up to 28%. The best performing configuration of Bit-Tactical achieves a
2.03× speedup over an equivalently provisioned value-agnostic architecture, whilst incurring just a 8.2%
overhead in terms of logic area. Further, the Bit-Tactical approach does not impose any restrictions
on the distribution of sparsity in DNNs, meaning it requires no additional effort on the part of the ML
developer when training a sparse neural network. Additional optimizations will be described which can
increase TCT’s performance by as much as 18%, without incurring increased hardware costs. Memory
optimizations will also be described which can reduce the memory overheads associated with TCT’s
front-end scheduling metadata by up to 82%.
1.1 Motivation
Given the widespread deployment of DNNs, many recent algorithmic efforts have focused on decreasing
their computational complexity and storage requirements as a means to improve a number of key metrics
associated with their usage, including energy consumption, inference latency, training time, and ability to
be run on resource-constrained hardware. Along with works that employ quantization [12] and efficient
model design [13], many works have explored model pruning as a means of reducing both the memory
footprint of DNNs, as well as the number of Multiply-Accumulate (MAC) operations required during
inference [10, 11, 14, 15, 16]. In spite of these sparsification efforts, weight sparsity does not confer any
performance improvements for neural network inference when run on value agnostic hardware. That is,
without hardware support for replacing the zero-valued weights with non-zero weights, no reduction in
run time is seen. Rather, the zero-valued weights are still processed like any other weight, despite not
contributing to the output of the network.
Figure 1.1 shows how a sparse weight tensor might be processed using simple parallel hardware
with 4 multipliers. Despite significant weight sparsity, the products of the 4 non-zero weights and
Chapter 1. Introduction 3
Activations
Current
Weights Cycle
x
Current
Cycle
Figure 1.1: How a sparse weight tensor is processed on a simple value-agnostic architecture with parallel
multipliers. Zero valued weights are represented by blank positions. Weights and activations of the
same color need to be multiplied together. Activations paired with zero-valued weights are greyed out
as “don’t care” terms. Values scheduled to be processed in the first cycle appear in the ‘Current Cycle’
window.
corresponding activations will still take 3 cycles to compute on the 4 available multipliers. To remedy
this, recent hardware designs have been proposed which take advantage of weight sparsity using dynamic
routing of values to or from multipliers.
There are many ways of using dynamic routing hardware in order to exploit value sparsity. The
primary requirements are that the correct values to be multiplied together must be routed to the correct
multiplier in the correct cycle, and must be accumulated to the correct output. Cambricon-X is one ar-
chitecture which exploits weight sparsity by allowing near-arbitrary packing of non-zero weights, thereby
requiring fully associative logic to route activations to the correct multipliers, similar to the approach
shown in Figure 1.2. This routing has considerable overheads, however, with the Indexing Module logic
that performs this function accounting for over 31% of total die area, and over 34% of on-chip power
consumption [17].
Yet another approach to exploiting weight sparsity is that taken by SCNN, which is an accelerator
for sparse CNNs that also targets activation value sparsity [18]. This method makes use of the fact
that the product of any two non-zero values within a given channel in a CNN is a useful computation,
provided it is accumulated to the correct output value. The routing of products between multipliers and
accumulators is shown in Figure 1.3, and requires a large crossbar which scales poorly and can account
for over 21% of on-chip area alone.
The large overheads of the fully-associative, dynamic routing required to fully exploit weight sparsity
in pruned neural networks motivates the need for a new approach, which leverages the large speedup
potential of weight sparsity using lightweight hardware. The opportunity to design such a machine comes
from noticing that, as with many challenges in hardware design, previous approaches lose efficiency by
Chapter 1. Introduction 4
Activations
Current
Weights Cycle
x
x
x
x
Current
Cycle
Figure 1.2: The dense dynamic activation routing approach to exploiting weight sparsity. The non-zero
weights of Figure 1.1 are packed densely in memory, requiring fully-associative activation routing.
Activations
Outputs
Current
Weights Cycle
x
Current
Cycle
Figure 1.3: The dynamic output routing approach to exploiting weight and activation sparsity. All
non-zero values of Figure 1.1 are packed densely in memory, and products are routed to the correct
accumulator.
Chapter 1. Introduction 5
Activations
Scheduled
in
software
Current
Weights Cycle
x
Current
Cycle
Figure 1.4: The proposed approach for exploiting weight sparsity with limited overheads using small
multiplexers. Extra wires coming from activation memory are used to deliver multiple inputs to each
mutliplexer.
attempting to exploit all of the potential performance without regard for hardware complexity. Instead,
the opportunity for efficiency comes from settling for most of the potential performance gains, whilst
overall offering a much more attractive complexity-to-performance ratio.
This thesis proposes Bit-Tactical, a front-end architecture and interconnect designed with the goal
of exploiting most of the available weight sparsity present in DNNs, with minimal hardware overheads.
Figure 1.4 gives an overview of Bit-Tactical’s approach, in which limited routing flexibility provided
by small multiplexers before the input of each multiplier allows activations to be routed to the correct
multipliers, provided we place constraints on how weights can deviate from their original position in the
dense schedule. The hope is that this constrained routing ability will still provide enough flexibility to
extract most of the available performance from weight sparsity, provided it is designed intelligently.
1.2 Contributions
The focus of this thesis is to explore ways of efficiently leveraging value sparsity in neural networks using
minimal hardware, in order to achieve performance improvements without large hardware overheads. To
this end, the primary contributions made in this thesis are as follows:
• A novel front-end interconnect that allows value sparsity in neural networks to be exploited using
a sparse shuffling network, achieving up to a 2.62× speedup over an equivalent value-agnostic
accelerator, with logic area overheads of just 8.2%.
• A heuristic scheduling algorithm co-designed with the front-end interconnect, capable of increasing
front-end performance by up to 28% on the DNNs studied.
Chapter 1. Introduction 6
• Hardware and dataflow optimizations which reduce the memory overheads of the scheduling meta-
data by up to 82%, and increase performance by up to 18%, respectively.
The majority of the work related to the front-end architecture described in Chapter 3 appears in
Bit-Tactical: A Software/Hardware Approach to Exploiting Value and Bit Sparsity in Neural Networks,
published in the proceedings of the 25th International Conference on Architectural Support for Program-
ming Languages and Operating Systems, in April 2019. Further results appear in Accelerating Image-
Sensor-Based Deep Learning Applications, published in IEEE Micro, Volume 39, Issue 5, in September
2019.
Background
This chapter will introduce the background and prior works necessary to understand, and give context
to, the contributions of this thesis. First we will introduce the target workload itself (neural networks
and their derivatives), before outlining the relevant characteristics of the workload that make it a prime
target for hardware acceleration. Finally, a comprehensive survey of related works will place this work
in the context of prior state-of-the-art research in the field.
A0
W0
A1 W1 Aout
f(σ𝒊 𝑨𝒊 𝑾𝒊 )
W2
A2
Figure 2.1: The McCulloch-Pitts model of a neuron, in which the dot product of a vector of input
activations and synaptic weights are passed through a non-linear function f to produce an output
activation [19].
.
Neural networks are loosely inspired by the operation of biological neurons in the brain, in that the
artificial neurons in a neural network produce a positive output only if the sum of its weighted inputs
7
Chapter 2. Background 8
exceed some threshold, as modelled by the McCulloch-Pitts neuron in Figure 2.1 [19]. Neural networks
are built up of layers of artificial neurons, with each layer containing many neurons. Neurons in adjacent
layers are connected to one another by synaptic weights, with the output of one layer becoming the input
of the next layer, and so on. In a fully connected (FC) layer, all output activations are connected by
synapses to every neuron of the proceeding layer. Other forms of inter-layer connectivity are possible.
By building up many layers, a deep neural network is created.
Neural network computation has two phases: training and inference. During training, the synaptic
weights are ‘learned’ by a training algorithm which attempts to maximize the accuracy of the network in
completing some task. Commonly, modern networks are trained by a gradient descent-based algorithm,
with gradients being computed by the backpropagation algorithm which makes use of the fact that the
network forms a directed acyclic graph (DAG) to calculate gradients using automatic differentiation.
Once a network is trained, its weights and structure are fixed, and it can be deployed to make inferences.
During inference, a neural network takes in an input, computes the activations for every layer and feeds
them forward through the network, eventually producing an output at the last layer. The input and
output depend on the task that the network has been trained for. Commonly, inference is performed on
a single input at a time, rather than ‘batching’ multiple inputs together, as inference is often a latency-
critical task [20]. However, where latency constraints and hardware permit it, performing inference on a
batch of inputs is desirable, as it allows weights to be reused on each of the inputs, reducing the number
of times they have to be read from off-chip. This reduces total inference energy over the batch, as off-chip
transfers are 2 to 3 orders of magnitude more energy intensive and slower than integer multiplies.
p(Y=y│x)
Boat
Dog
Person
Car
Chair
Plane
Cat
Figure 2.2: A CNN composed of 3 convolutional layers, 1 pooling layer, and 1 fully connected layer.
A neural network consist of layers, each of which performs a linear transformation on its input, before
applying an element-wise non-linearity to each output. In CNNs, many of these layers consist of linear
transformations that take the form of convolution operations between the input and a set of filters.
The filters in each layer act as learned feature detectors (hence activations are sometimes referred to as
Chapter 2. Background 9
‘feature maps’). By stacking multiple convolutional layers, as in Figure 2.2, the neural network acts as
a hierarchical feature extractor, in which each layer can detect features that represent more and more
complex patterns in the original input. Inputs may represent any arbitrary information, and can take
the form of RGB images, graphs, sentences, or audio spectrograms, to name a few examples.
C Ck K
S
=
ReLU Y * K R
H
X W
Figure 2.3: A convolutional layer, in which an input volume of size X × Y × C is convolved with K
filters, each of size R × S × Ck, to produce an output of size W × H × K. Note that filter channel depth
(Ck) need not necessarily be the same as input channel depth (C). The rectified linear unit (ReLU)
non-linear activation function is applied element-wise to the output of the convolution operation.
The filters in a layer are typically composed of a 4-dimensional tensor of synaptic weights (or param-
eters). Each 3-dimensional filter is convolved with the entire input space to produce an output channel,
as shown in Figure 2.3. Parameters of the convolution include stride (how many elements the filter is
stepped across the input by) and padding (additional zero-valued elements added to the edges of each
input channel to maintain output size).
CNNs can contain layers other than convolutional layers, and commonly contain pooling layers,
which perform down-sampling on the input, either in the form of average pooling or max pooling on
small windows of the input space. Almost all classification neural networks contain an FC layer as a final
classification layer, followed by a ‘softmax’ function. The softmax function is given by Equation 2.1.
eyi
Sof tmax(yi ) = P yj (2.1)
je
Softmax is used to convert a vector of positive and negative real numbers to a distribution of prob-
abilities, which can be interpreted as the probability of each class being the correct class, given the
input.
that it is better in terms of memory footprint to train a large neural network and prune it, than it is to
train a smaller neural network with a similar number of final non-zero parameters. The intuition behind
the findings of Frankle & Carbin [28] is that there is exists a small ‘subnetwork’ within any large DNN
that has been initialized in such a way that it is amenable to training to convergence successfully – called
a ‘winning ticket’ subnetwork – and that is responsible for most of the accuracy of the network. Pruning
reveals these subnetworks without affecting accuracy by removing redundant weights. Training a larger
dense network increases the likelihood of there being a winning ticket subnetwork, thus it will always be
easier to train a large network and prune away dense connections than it is to train a compact, dense
network from scratch. This corroborates the findings of Zhu & Gupta [10], and suggests that weight
sparsity is a value property that will likely continue to pervade neural networks in the future.
Weights Neurons
Original Pruned
Network Network
Figure 2.4: Example of weight pruning, which deletes weights using an heuristic algorithm, resulting in
a sparse network. If all of the input weights of a neuron are removed, that neuron is effectively removed
as well.
Modern pruning algorithms are implemented either as a post-processing step after the unpruned
network has converged (with retraining to recover any lost accuracy), or as a part of the training
process [10, 9]. A number of heuristics can be used to decide which weights to eliminate, the most
common being the magnitude of the weight’s value [10]. Others remove weights to which the output has
the least sensitivity first [14, 15], however computing complex metrics like this is too costly for modern
DNNs. In part because weights are randomly initialized at the start of training, the heuristics used
in pruning result in sparsity that is relatively uniformly randomly distributed throughout the weight
tensors [10]. This leads to irregularities in the network computation, which some works have tried
to address by imposing constraints on how weights can be pruned, forcing them to be eliminated in
groups [30, 31, 32]. These algorithms lead to structured sparsity, where either an entire contiguous group
of weights (e.g., an entire filter channel) are zero, or none of them are. However, though designed to be
more hardware-friendly, in practice these structured pruning techniques are rarely used as they make
Chapter 2. Background 11
training a network to convergence without accuracy loss much more difficult. Mao et al. find that
unstructured sparsity can reach much higher sparsity levels whilst maintaining model accuracy when
compared to pruning at the granularity of filter-rows, filter-channels, or entire filters [33, 34].
Algorithm 1 gives a high-level, generic description of (one class of) pruning algorithms which prune
iteratively during training. The procedure will operate on the network as defined by its weight tensors,
W , the training data inputs X and labels Y , the learning rate LR, the number of training epochs, and
the target sparsity, S. Before each forward pass, a binary mask is applied to the weights to zero-out
pruned weights. The weights are updated using whatever learning algorithm is desired (vanilla gradient
descent is shown in the algorithm). Finally, a new mask is generated (i.e., more weights are pruned) as
a function of the current epoch and the final target sparsity, as the sparsity level is gradually increased
during training. The GenerateM ask() function is one of the key defining factors of a pruning algorithm,
and its operation is what differentiates different pruning approaches. A simple function might simply
Epochs−t−1
sort all elements of W by magnitude, and prune the smallest S × (1 − Epochs ) weights, meaning
sparsity scales linearly as training progresses through epochs.
Figure 2.5: Activation sparsity of a 2D activation plane due to the ReLU element-wise non-linearity.
Another form of value sparsity present in modern neural networks is activation sparsity, where a
significant fraction of activation values are equal to zero. Activation sparsity primarily occurs due to
the rectified linear unit (ReLU) activation function, which is applied element-wise to the output of a
convolutional or fully connected layer of a CNN, and clamps all negative values to zero, whilst letting
positive values pass through unaffected, as shown in Figure 2.5. Activation sparsity is typically between
40% − 50% per-layer in a modern CNN [35].
Chapter 2. Background 12
• Widespread deployment
• Computationally intensive
• Latency constrained
As such, a large amount of research and development activity has taken place in the CNN accelerator
space in the last few years.
Compute Units
PE PE PE PE
On-Chip Memory
(SRAM / eDRAM)
PE PE PE PE
High
Bandwidth
Interface
4
Figure 2.6: An abstract view of hardware architectures for CNN acceleration.
The majority of these designs are inference accelerators, though some more recent works also target
the training phase of CNNs. Seminal works in this space explored how best to exploit the structure of
CNN computation and leverage the large amount of parallelism in the workloads efficiently [42, 43, 44],
however many more recent designs target the unique value properties of CNNs in order to increase the
performance potential and efficiency of the hardware [17, 18, 35, 45, 46, 47, 48]. Thus, most hardware in
the accelerator space can be classified as either value-aware or value-agnostic. Accelerators of both types
share many basic traits. The primary operation in a CNN is the MAC, of which there are potentially
billions per inference – see Table 2.1. All CNN accelerators will therefore support a large amount
of parallel multiply and accumulate throughput in hardware. Figure 2.6 shows the basic structure of
Chapter 2. Background 13
most modern CNN accelerators, the organization of which contains an array of processing elements to
exploit the abundant data parallelism, on-chip buffers and scratchpads to exploit data reuse, and a
high-bandwidth interface to main memory to load inputs and weights without causing a bottleneck.
Mem Mem
Mem
Mem
Mem
Acc
The exact specification of the PEs and their specific architecture and organization can vary widely,
and is at least partially a function of the desired dataflow – of which there are many possible. As
in Figure 2.7, PEs may be simple MAC units, or even just multipliers as is the case in some systolic
array-based designs [20], or may contain internal buffers or scratchpads with broadcast/multicast con-
nectivity from memory [43], or multiple multipliers with accumulation capability, as in coarse-grained
reconfigurable array (CGRA) style architectures [49]. PEs may also contain additional functional units
to apply activation functions or pooling operations.
Dataflow and memory hierarchy are important aspects of the accelerator design, as together they
define the memory access energy associated with inference, which can be a major contributor to energy
efficiency. Table 2.2 shows the relative energy costs of arithmetic and memory access operations. Nearly
all recent works in this space target 16-bit or 8-bit fixed point arithmetic as a baseline, due to the fact
that modern neural networks suffer little-to-no accuracy loss at this data width compared to their native
32-bit floating point format [47], whilst integer multipliers and adders are many times more area efficient
Chapter 2. Background 14
Table 2.2: Relative energy cost of various operations in 45nm 0.9V CMOS technology [50].
the row-stationary dataflow – in which 1D convolutions of a single row of filter weights and activations
are assigned to a given PE at a time – as an energy efficient option that, when mapped to their PE array,
allows flexibility in the mapping of computations to PEs in order to avoid severe under-utilization. As
with DaDianNao, the authors note that data movement is the leading contributor to energy inefficiency,
and so the Eyeriss design makes extensive use of an on-chip memory hierarchy co-designed with their
dataflow to minimize this energy cost. Implemented on their architecture, the row-stationary dataflow
is between 1.4× and 2.5× more energy efficient than the other dataflows studied.
Google’s Tensor Processing Unit (TPU) evaluation [20] is of particular interest to the architecture
community, as it provides an insight into the measured, rather than modelled, performance and energy
metrics of a DNN inference accelerator in large-scale datacenter deployment running real workloads. The
TPU uses a systolic array architecture, with large on-chip buffers and a large main memory for weight
storage. Compared to modern CPU and GPU servers used in Google datacenters, the TPU provides
between 17×−31× and 14×−25× average performance per Watt, respectively, depending on the DRAM
specification used.
to Cambricon-S, it does not impose any restrictions on the weight sparsity distribution. Instead, SCNN
makes use of convolutional reuse by making use of the fact that, within a given channel, the product of
any weight and any activation is a valid and useful computation (modulo effects of stride and padding).
Therefore, by adopting a channel-wise dataflow (a single channel is processed at a time) and sending only
non-zero values to each PE, ineffectual computations can be almost completely eliminated. However,
this strategy requires calculating which output each product belongs to dynamically, and routing these
products to the correct accumulator within a PE using an over-provisioned crossbar, which is a costly
design approach. Additionally, the design incurs performance overheads for dense (and mostly-dense)
networks, meaning it requires high sparsity to achieve a speedup, and punishes low-sparsity networks.
The crossbar alone accounts for over 21% of PE area, with the over-provisioned accumulator banks ac-
counting for another 29%. Nevertheless, for sufficiently sparse networks, SCNN achieves a 2.7× speedup
and 2.3× energy efficiency improvement over an equivalent dense design. However, the design does not
perform well on FC layers, with at most 25% utilization due to its Cartesian product multiplier array
that broadcasts weights.
The Efficient Inference Engine (EIE) [48] is an architecture that targets weight and activation spar-
sity in networks compressed using the Deep Compression algorithm [9]. The authors target only fully
connected layers, as in older networks they contain the majority of the weights and thus account for the
majority of off-chip traffic and energy. However, in modern networks this effect is less pronounced, and
in both old and new networks, FC layers account for a small fraction of computation time (0.05% of
inference time on ResNet-50 running on a DaDianNao tile), thus speedup potential is extremely limited.
2.4 Summary
This chapter has outlined the key compute and value characteristics of DNNs that are required to under-
stand state-of-the-art techniques in hardware accelerators which target these workloads. Additionally,
relevant hardware research which targets DNN value properties has been described, in order to frame
the contributions of this thesis in the context of appropriate prior work.
Chapter 3
With the explosion of deep learning applications and DNN deployment, ASICs targeting DNN inference
have flourished [18, 20, 42, 43, 44, 48]. The widespread potential for these algorithms, and their compute-
and memory-intensive nature, has made hardware efficiency and performance especially important. Tar-
geting weight sparsity as a means of extracting additional performance during neural network inference
is attractive for a number of reasons. Principally, many recent works have supported the hypothesis that
sparsity is a manifestation of how neural networks inherently learn to classify information, and therefore
it will remain a common value property available for exploitation going forward [28, 29]. Additionally, as
will be shown in Chapter 4, the potential performance improvements to be gained by targeting sparsity
in neural networks are significant, at up to 7.35×.
17
Chapter 3. A Low-Overhead Architecture for Sparse Neural Networks 18
16
1
Activation Nx16 activation lane
Memory 16 filter lane
N weight lane
16
1 x
Filter 1
+
k x N x 16
16
N x
PE 1
Weight Output
Memory Activations
1 16 x
Filter k
+
N 16
x
PE k
1
1
1
1
Figure 3.2: Dataflow used by DaDianNao, showing an example of the N activations and N weights of
which the inner product is computed in a single PE each cycle.
zeros will be inserted in the weight tensor as ‘padding’ to ensure values are aligned correctly, as dictated
by the dataflow.
Though DaDianNao was originally conceived as a multi-node accelerator with enough on-chip eDRAM
to store all weights and activations per-layer on-chip during inference, this design choice is inefficient and
over-provisioned. Instead, we size the activation memory to be large enough to keep input and output
activations on-chip at all times using double buffering, but size our weight memory to store only one
working set of filters at a time, and hide off-chip latency using double buffering, using our previously
proposed heuristics [55]. We refer to this modified baseline design as DaDianNao++.
convolution. That is to say, acc + w × a = acc, if w = 0, leaving acc unchanged. The same is of course
true for zero activations, however as these are far less numerous in sparse networks compared to zero
weights, in general they are less attractive as a prospect for acceleration. However, the relative sparsity
of weights and activations will change on a per-layer basis, and it may be the case that activation sparsity
offers a higher speedup potential for some layers and networks.
Activation Buffer
Activation Buffer
(a) (b)
Figure 3.3: Naive approach to exploiting weight sparsity. White elements indicate zero weights. Non-
zero weights must be multiplied by the corresponding activation of the same colour. (a) shows how a
value-agnostic machine would compute these multiplies, with poor utilization. (b) shows how non-zero
weights can be packed densely in memory, but activations then require dynamic, arbitrary routing to
the multipliers, which is costly.
Architecturally, exploiting weight sparsity for performance gains implies replacing ineffectual MACs
with effectual MACs. On the weight side, replacing zero-valued weights with non-zero weights in memory
is trivial, as weights are static at run-time. After a network has been trained and pruned, the weights
will not change, so rearranging individual weights in memory using any sparse compression format can
be done offline. However, issues arise due to the fact that the non-zero weights must be paired with
their corresponding activations at run-time. This requires the ability to fetch potentially arbitrary
activation values from the activation stream, implying hardware that facilitates expensive dynamic
routing capabilities, as seen in Figure 3.3. However, as discussed in Section 2.3, this design strategy
leads to expensive crossbars within PEs, which account for a large proportion of area and energy – above
30% for some designs [17]. Similarly, packing both sparse weights and activations densely into PEs in
SCNN requires costly crossbar routing between the multiplier array and accumulators [18]. Therefore,
when trying to exploit sparsity, we would ideally like hardware that allows us to:
The following sections will describe and evaluate a mechanism for achieving these goals along with
optimisations surrounding the mechanism.
a 00 a 01 a 10 a 11 a 20 a 21 a 30 a 31
Dense Schedule
w00 x
w10 x
+
w22 w12 x
w33 w30 x
Lookahead window
Figure 3.4: Example of how lookahead exploits sparsity. Value superscripts denote their multiplier lane
position in the dense schedule, and subscripts represent the cycle each would be processed in using the
dense schedule. w12 is processed in cycle 0 by moving it in memory statically, and selecting a21 as an
input to the correct multiplier that same cycle, as indicated by the red wire.
a 00 a 01 a 10 a 11 a 20 a 21 a 30 a 31
Dense Schedule
w00 x
x
w11 w10
x
+
w22
x
w33 w30
Lookaside
Figure 3.5: Example of how lookaside exploits sparsity and reduces multiplier lane imbalance. w11 is
processed one cycle early, and by a different multiplier than the dense schedule dictated, by moving it
in memory statically. The corresponding activation movement is indicated by the red wire.
to the input multiplexers of neighbouring multiplier lanes, meaning lookaside is comparatively cheaper
than lookahead. Lookaside also comes with the added benefit of reducing inter-multiplier imbalance, as
non-zero weights that would be serialized to a single multiplier can instead by allocated to neighbouring
multipliers within a PE. With these two weight movement primitives, we can construct many different
frontend interconnect patterns by connecting activation wires to multiplexer inputs. We define lookahead
distance as the number of time steps ahead of the original schedule a weight may advance. Lookahead
distance defines how many rows of activations are broadcast per cycle. Similarly, the number of lookaside
connections determines the number of additional multiplier lanes a weight may appear at in a given
cycle. Note that within this design space, we can recreate fully associative scheduling by setting setting
the lookahead distance to be as deep as the weight memory, and using all lookaside connections in
the lookahead window. This would obviously be a very costly design, and as we will show would not
represent a competitive cost-reward trade-off compared to a more pragmatic design.
A speedup is achieved when there are no effectual weights scheduled to be processed in a given
cycle (either due to them being processed ahead of their dense cycle, or due to sparsity), and so the
lookahead window can advance by more than one row of activations. This row of ineffectual weights must
span an entire tile of PEs, as that is the granularity at which PEs are synchronized due to activations
being broadcast. This initially appears to be a very limiting constraint on the speedup potential, as
Chapter 3. A Low-Overhead Architecture for Sparse Neural Networks 22
Time
C7 C6 C5 C4 C3 C2 C1 C0
𝐖 𝐖 𝐖 𝐖 𝐖 𝐖 𝐖
𝐖 𝐖 𝐖 𝐖 𝐖 𝐖 𝐖
𝐖 𝐖 𝐖 𝐖 𝐖 𝐖 𝐖
Figure 3.6: How TCT achieves a speedup despite synchronization constraints. Weights promoted to
appear in cycle i leave ’gaps’ in the schedule, leading to more promotion opportunities in cycle i + 1.
the likelihood of all weight lanes across an entire tile containing only zero weights seems small, even
with lookahead and lookaside. However, Figure 3.6 demonstrates how weight promotions leave gaps in
the schedule which can propagate backwards until a row of zeros appear, at which point the lookahead
window can advance multiple steps, resulting in a speedup. In Figure 3.6, an ineffectual weight appears
in cycle 0 (C0 ). A weight from C1 is promoted, leaving a bubble into which a weight from C2 can be
promoted. This continues until C7 , which now contains only ineffectual weights in all three multiplier
lanes, meaning it can be skipped and the lookahead window can slide onwards by 2. At no point during
execution did weights from more than 2 cycles in the original dense schedule appear at the same cycle
in the new sparse schedule.
From ASU
A N,1
A N,2
A N,h
A 1,h
A 1,1
A 1,2
A 2,h
A 2,1
A 2,2
To ASU
16b 16b
ALC
Weight Lane 1
ws1 Lookahead
From Weight Memory
log(h+s)b
16b h 32b
w1 X
Weight Lane 2
ws2
Lookaside
w2 X
s
Weight Lane N
wsN
wN X
Figure 3.7: The Weight Skipping Unit (WSU) which routes activations to the correct multiplier at
runtime.
WS WS
w w w w w
WS WS
w w
WS WS
w w
w w
w WS WS
w w
Activation Buffer
Column 0 Column 1 Column h
Nx16b
h+1
N
ABR0 a1 a2 aN a1 a2 aN a1 a2 aN ABRh
Nx16b
head
AC (h+1)-to-1
16b
h+1
ALC
16b
From WSU
16b
A 1,h
A 1,0
A 1,1
A N,0
A N,1
A N,h
A 2,h
A 2,0
A 2,1
h+1
N
Figure 3.9: The Activation Select Unit (ASU) which implements the sliding activation window.
are fed by the Activation Select Unit (ASU), which implements the sliding window required to implement
lookahead and gain a performance advantage. The WSU communicates to the ASU using the Activation
Lane Control field (ALC), a small field stored alongside each row of N weights in weight memory.
For each weight, there are h + 1 activations in the lookahead window, which need to appear in
‘lookahead order’ (from lookahead distance 0 to lookahead distance h) on the wires within the WSU
every cycle for the front-end to work, as the connectivity is fixed. The ASU is detailed in Figure 3.9,
which shows how activations are routed to activation wires, facilitating lookahead and a sliding window
whilst maintaining logical ordering of activations. The ASU buffers activations in the Activation Buffer
Registers (ABRs), of which there are h + 1, before constructing the lookahead window by routing them
through h + 1, (h + 1) − to − 1 multiplexers, such that they appear on the activation wires in the order
that the WSU requires. Each ABR therefore contains N 16-bit activations – 1 for each multiplier in a
PE. The lookahead multiplexers are controlled by the Activation Control (AC) logic, which keeps track
of which ABR is at the head (i.e., is at a lookahead distance of 0), and uses the ALC field from weight
memory to determine how far ahead to slide the lookahead window each cycle, creating a circular queue.
This circumvents the need to copy data between ABRs when sliding the lookahead window. The ABRs
are fed by the activation buffer, which contains h + 1 banks, each with a dedicated read port. When
sliding the lookahead window, any number of the ABRs can therefore be updated independently.
𝐖11 𝐖12
𝐖 21 𝐖 21
lanes
lanes
𝐖 31 𝐖 32 𝐖 31 𝐖 30
𝐖 41 𝐖 41
𝐖 52 𝐖 51 𝐖 50 𝐖 52
Weight to
rows replace rows
Figure 3.10: Two potential interconnect patterns. The ‘L’ pattern is contiguous and unidirectional. The
‘Trident’ pattern, denoted ‘T’, is sparse and bidirectional.
limited area budget. Constructing this interconnect means defining, in weight memory-space, a search
window for each multiplier lane within which weights can be stolen, and promoted to that multiplier.
The adder-tree design means we must constrain this search window to be entirely within a single filter
lane, so that all products calculated within a PE contribute to the same output activation.
Naively, for a given lookahead and lookaside distance, denoted hh, si, the search window could be
a contiguous pattern h deep and s wide. An example of this is given in Figure 3.10 with the h2, 5iL
pattern, which has a search window that extends 2 steps ahead in time, and across 5 multiplier lanes
to the side. A connectivity pattern is also defined by how many steal locations it allows, which has a
hardware cost reflected in the size of the input multiplexer. For example, both patterns in Figure 3.10
have 7 steal locations, plus the original input location, meaning they require an 8-input multiplexer
before each multiplier. This can be denoted in the the L-shaped pattern shorthand as L8h2, 5i. Note
that lookaside connections wrap around at the edge of the PE, such that a lookaside connection at a
distance of 1 may allow multiplier N − 1 to steal weights from multiplier lane 0 – recall that lookaside
connections are not costly compared to lookahead connections.
A design’s maximum speedup potential is determined by its lookahead distance h, and is given by
Equation 3.1.
SpeedupM ax = h + 1 (3.1)
To see why this is the case, consider a design with h = 1. This can at best schedule weights from two
cycles in the original dense schedule to be processed in a single cycle, obtaining a speedup of 2×, before
the sliding window advances by two rows. There is also a maximum potential speedup per-network,
determined by taking the inverse of the network density, e.g., a network that has 75% compute sparsity
Chapter 3. A Low-Overhead Architecture for Sparse Neural Networks 26
𝐖 01 𝐖 00 𝐖 01 𝐖 00
Lane0
Lane1
𝐖11 𝐖11
Lane2 Exclusive
time Promotion
Location
Figure 3.11: A toy example of two possible schedules for a machine with 3 weight lanes, and an inter-
connect with lookahead of 1 and lookaside of 1 (biased right). A poor schedule would take two cycles to
process the 3 effectual weights, whereas an optimal schedule can process them in a single cycle.
1
has a maximum speedup of 1−0.75 = 4×. The geometric mean of the compute sparsity across the
networks studied is 69%, corresponding to a 3.2× potential speedup. Given this observation, designs
with a lookahead distance of 2 (i.e., a maximum speedup of 3×) seem to represent a good trade off
between hardware cost and performance potential.
Looking at Figure 3.10, it is clear that for the L shaped interconnect, neighbouring lanes will have a
lot of overlap in their search windows, increasing contention for effectual work. This design also heavily
favours weights at a lookahead distance of 1 over those at a distance of 2, which can only be assigned to
a single possible multiplier. Both of these issues are addressed in the Trident shaped interconnect, which
still has a lookahead distance of 2 and an 8-input multiplexer, but spreads its search window between
the two lookahead rows, resulting in a sparse shuffling network, with little overlap between neighbouring
multiplier lanes in order to minimize contention. These features are by design, as the Trident pattern was
hand-designed by examining the types of issues encountered by the L shaped pattern when scheduling
real weight sparsity distributions with pen and paper. Compared to the L shaped pattern, in which
neighbouring lanes share 5 out of 7 steal locations, neighbouring lanes in the Trident pattern share only
2 out of 7 steal locations.
Time
C1 C0 Tally C1 C0 Tally C1 C0 Tally C1 C0 Tally
𝐖 1 𝐖 0 𝐖 0 𝐖 0
𝐖 0 𝐖 0 𝐖 0 𝐖 0
𝐖 𝐖 0 𝐖 𝐖 0 𝐖 𝐖 0 𝐖 0
𝐖 2 𝐖 2 1 𝐖 0
2 1 𝐖 0 𝐖 0
Step 0 Step 1 Step 2 Step 3
Figure 3.12: An example of how the scheduling algorithm operates. In this example, lookahead is 1 and
lookaside is 2 (bidirectional, with wraparound connections).
Figure 3.11 demonstrates the need for a scheduling algorithm in cases where there are multiple
multipliers on which a weight could potentially be scheduled. This scheduling problem relates to a form
of the Job Shop Problem, or job shop scheduling, which is known to be NP-hard [56]. In essence, we
are trying to minimize the makespan, where weights are jobs (or units of work) and multipliers are
the job-shop machines. To this end, we note that in Figure 3.11, the key characteristic that makes it
desirable to move w11 to multiplier lane 2 is that w11 is the only weight that lane 2 can possibly process
that cycle. We term this an exclusive promotion location, in that lane 2 is exclusive to w11 (it is the only
weight that can be moved there), despite the fact that multiple lanes (1 and 2) could process w11 . The
scheduling algorithm we design is based on making exclusive promotions first, in order to reduce the
amount of sub-optimal weight movements. This algorithm is co-designed with the interconnect, which
is intended to reduce contention between nearby lanes, resulting in more of these exclusive promotion
opportunities.
An example of how the scheduling algorithm operates is given in Figure 3.12. At each step of the
scheduling algorithm, a tally keeps track of the number of candidate weights that could be promoted to
a given multiplier lane for processing that cycle – these candidates are denoted with red arrows, with the
tally equalling the number of incoming arrows to a given position. The scheduler then makes promotions
to the lanes that have the smallest candidate tally, which in the common case will be equal to 1. The
tally is updated, and the process repeats until there are no candidate weights left, or the multipliers
become fully utilized that cycle.
The heuristic weight scheduling algorithm is described in Algorithm 2. The algorithm is shown for a
single ‘warp’ of filters. A warp is defined as a set of K filters assigned to a K PEs simultaneously, before
the next K filters are processed, i.e., PEs are synchronized on a warp boundary. Inputs to the algorithm
include N , which is the number of multipliers per PE, the matrix of weights W , which is assumed to
already be in in-memory layout and has dimensionality R × L, where R is the number of ‘rows’ of
weights are to be processed, and is equivalent to the number of cycles the filter would take to process on
DaDianNao++, and where L is the total number of multiplier lanes available in the accelerator, which
for a single tile is equal to k × N . The interconnect is described by a list, I, of (lookahead, lookaside)
Chapter 3. A Low-Overhead Architecture for Sparse Neural Networks 28
8: procedure schedule(W , I, N )
9: WS ← 0
10: for r = 0 : R − 1 do
11: if containsNonZero(W [r, 0 : L − 1]) then
12: numCandidates ← countCandidates(W , I, N , r)
13: while containsNonZero(numCandidates) do
14: W, WS ← promote(W , WS , I, N , r, numCandidates)
15: numCandidates ← countCandidates(W , I, N , r)
16: else
17: deleteRow(W, r)
18: deleteRow(WS , r)
19: return W , WS
coordinates, e.g., I = [(1, −1), (1, 0), (1, 1)] would describe an interconnect with a lookahead of 1 and
2 lookaside connections. The algorithm outputs the modified weight matrix, and a matrix of the same
shape containing the mux select signals, WS . The main loop of the scheduling algorithm is contained in
the schedule procedure, from lines 8 – 19. Here, the rows of the weight matrix are iterated through. If a
row contains only zeros, we can delete it from the weight matrix and the schedule signal matrix, which in
hardware implies a speedup (lines 17 and 18). Otherwise, we keep a tally of how many candidate weights
could be promoted to each multiplier lane that cycle, using the countCandidates(...) helper function.
While there continue to be promotions that could be made, the promote(...) function is called and
makes promotions to locations that have a numCandidates tally equal to the minimum non-zero tally.
This process repeats for all rows of weights in weight memory format.
For completeness, in Section 4.3.1 we compare against a greedy algorithm that iterates through
each multiplier lane in each cycle in some scan order, and iterates through the search window for each
multiplier (also in some scan order) until it finds a non-zero weight to promote. It performs that weight
promotion before continuing the process for the remaining multiplier lanes. The algorithm is essentially
making the first promotion possible on each iteration, and will be heavily affected by scan order.
The runtime of the scheduler is shown in Table 3.1, with measurements taken on a machine con-
taining an Intel Core i7-8700 processor, and 64GB of memory, and the scheduler implemented in the
Python programming language, using the NumPy numerical programming package [57]. Note that the
algorithm’s complexity is linear in the number of network weights, and so the runtime is slowest for
networks with large FC layers (AlexNet and Bi-LSTM), but very fast for even reasonably deep CNNs,
such as ResNet-50. In any case, the scheduling time is negligible compared to the time of training a
large DNN, and so is immaterial to the practicality of DNN deployment.
Figure 3.13: How filter shuffling increases performance. (a) Unshuffled warps are slowed down by
‘stragglers’ (highlighted). (b) Shuffled warps reduce synchronization idle time.
* = * =
= =
Shuffled
* = * =
Figure 3.14: Example of how filter shuffling (bottom) changes the computation, but results in the same
output as unshuffled filters (top).
Chapter 3. A Low-Overhead Architecture for Sparse Neural Networks 31
MST
N
ws
Weight Lane 1
Weight Lane 1
ws1
From Weight Memory
Weight Lane 1
Weight Lane 1
ws2
w2 w2 X
X
Weight Lane N
Weight Lane N
wsN
wN wN X
X
Figure 3.15: Weight memory interface (a) without and (b) with the Mux Select Table (MST).
1
0.9
0.8
Mux Signal Coverage
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53
MST Size
Figure 3.16: MST size vs. signal combination coverage. Signal combinations are weighted by reuse.
1 0.95
(Normalized to naive case)
0.9
0.77
0.8
Memory Overhead
0.7
0.6
0.5 0.45
0.4 0.36
0.32
0.3 0.26
0.18
0.2
0.1
0
Figure 3.17: Memory overhead of scheduling metadata normalized to the naive implementation. Results
are for a 32 entry MST and 8-input muxes.
Chapter 3. A Low-Overhead Architecture for Sparse Neural Networks 33
1
0.9
Mux Signal Coverage 0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53
MST Size
connectivity, we should expect that a lot of schedule patterns are seen many times within the processing
of a given filter. Indeed, as Figure 3.16 shows, no filter in any network in our benchmark suite requires
more than 53 unique scheduling steps/mux signal combinations – much less than the 248 expressible in
the naive design. A 32 entry MST would require only a 5-bit field per group of 16 weights, and would
cover more than 99.5% of combinations for all networks except the two pruned versions of AlexNet (the
oldest and most over-provisioned network we benchmarked). The cumulative distribution is weighted
by filter reuse, which directly corresponds with processing time, meaning that this would imply at most
a 0.5% performance degradation for the newer networks. The performance degradation arises when the
scheduler cannot select one of the 32 MST entries to drive the muxes, and so must revert to the dense
schedule in 0.5% of cycles. In reality, due to the filter synchronization described in Section 3.4, the
performance overhead will likely be even less.
Figure 3.17 shows the relative metadata memory overheads for both the naive implementation, which
uses a single 3-bit WS signal per weight, and the MST optimization, which uses a 32-entry, 48-bit wide
MST per PE, which is updated with each filter that is loaded from off-chip. Note that updating the MST
represents an overhead, however the reduction in memory overhead of the WS signals greatly outweighs
this overhead for all but GoogLeNet-ES, which sees only a modest 5% reduction in metadata overhead,
as this MST refresh overhead dilutes any improvements.
We also show results for signal coverage with combinations unweighted in Figure 3.18. These results
are representative of the raw number of scheduling decisions that can be covered by an MST of a given
size. These results are more representative of the memory traffic savings that are possible. We see that
the two AlexNet networks require a larger MST than most networks, partially due to the large FC layers
in these networks, which are often quite sparse.
Chapter 3. A Low-Overhead Architecture for Sparse Neural Networks 34
3.6 Summary
This chapter has presented the Bit-Tactical front-end architecture and interconnect, which exploits
weight sparsity in neural networks, and has highlighted the modifications made to the baseline DaDi-
anNao architecture to implement this design. The outline of the architecture has also shown precisely
how the hardware is capable of achieving a speedup on sparse neural networks. Alongside the hardware,
the weight scheduling algorithm has also been detailed, demonstrating how TCT’s hardware/software
co-design approach can extract additional performance with limited hardware complexity. Further hard-
ware and software optimizations have been illustrated which can increase the performance and memory
compression achieved by TCT.
Chapter 4
Evaluation
Here, we provide an evaluation of the Bit-Tactical front-end architecture on a variety of neural network
benchmarks. We will first describe the evaluation methodology in Section 4.1, along with the benchmark
suite used for evaluation in Section 4.2. In Section 4.3, we will show how the best front-end design
evaluated achieves a 2.03× geomean speedup across the networks studied. The effect of the scheduling
algorithm, which can improve performance by up to 28%, is discussed in Section 4.3.1. We will also show
how robust this combination of front-end design and scheduler is with a sensitivity study in Section 4.3.2.
We will show the area overheads of the TCT front-end in Section 4.3.3, demonstrating that the design
offers compelling performance-area trade-offs. Finally, for comparison purposes, Section 4.3.4 will show
the performance of various alternative interconnect designs.
4.1 Methodology
The performance of Bit-Tactical is evaluated using a custom cycle-accurate simulator which models the
performance of the front-end and allows the exploration of various interconnect designs and scheduling
algorithms. The simulator also provides detailed performance counters that allow analysis of the results.
DaDianNao++ or TCT
Tiles 4 Filters/Tile 16
AS/Tile 32KB × 32 Banks Weights/Filter (N) 16
WS/Tile 2KB × 32 Banks Precision 16b
Act.Buffer/Tile 1KB × (h + 1) Frequency 1GHz
Main Memory 8GB various tech nodes. Tech Node 65nm
Lookahead 0-4 Lookaside 0-6
DaDianNao++
Peak Compute BW 2 TOPS Area 61.29 mm2
Power 5.92 Watt
The hardware configuration for the baseline design and TCT is outlined in Table 4.1. Area and energy
measurements are performed post-layout using representative circuit activity. Layouts are generated for
a TMSC 65nm technology using Cadence Innovus after synthesis using Synopsys Design Compiler.
SRAMs are modeled via CACTI [58]. Off-chip memory energy consumption is modeled using Micron’s
DDR4 power calculator [59] along with access counts from the cycle-accurate simulations. All designs
operate at 1GHz, with pipelining of the datapath as needed to reach this target frequency. Both TCT
35
Chapter 4. Evaluation 36
and DaDianNao++ use k = 16 PEs per tile, with N = 16 multipliers per PE, all operating on 16-bit
fixed point inputs. We initially show results assuming sufficient off-chip bandwidth so that no off-chip
stalls occur, but later show the effect of various main memory technologies. We use run-length based
zero compression as in [18] for weights, and fine-grain per group precision as in [60] for activations to
reduce off-chip bandwidth for all layers.
4.2 Benchmarks
For our benchmarks, we take open-source pruned models from Yang et al. [11] (models denoted with
the ‘-ES’ suffix) and from Park et al. [16] (models denoted with the ‘-SS’ suffix). We do not modify
these networks and use them as-is. We also perform magnitude based threshold pruning on an open-
source pre-trained MobileNet v1 and Bi-LSTM, as proposed by [61], targeting 75% sparsity for each.
For MobileNet, we only prune the pointwise convolution layers, in line with [10], as these contain 99%
of the network parameters.
Table 4.2 lists the networks studied and their sparsity levels after pruning. Note that we provide
two measures of sparsity, labelled ‘Storage Sparsity’ and ‘Compute Sparsity’. Storage sparsity refers to
the total sparsity of the weight tensors themselves, which directly relates to the total memory footprint
reduction potential due to sparsity. However, storage sparsity is not a good measure of performance
potential, as some layers may have high sparsity and large weight tensors, but only contribute to a small
proportion of total network MACs, and vice versa. For example, fully connected layers can contain a
large proportion of the network weights, but each weight only participates in a single multiply, whereas
some filters in convolutional layers have a very small memory footprint, but each weight participates
in thousands of multiplies – in other words, they have high computational intensity. Compute sparsity,
then, is the proportion of total valid multiply operations in which the weight is zero. This metric has a
more direct correlation with performance potential. Indeed, the ‘Potential Speedup’ column of Table 4.2
is calculated as the inverse of compute density, as in Equation 4.1.
1
potential = (4.1)
1 − sparsity
This is equivalent to the speedup of a hypothetical machine with a single MAC unit that processes
one weight-activation pair per cycle and never stalls. Of course, any real hardware architecture will
suffer from under-utilization for myriad reasons.
1.00
1.40
1.60
2.00
2.20
1.22 1.58
1.22 1.79
1.45 1.99
1.56 2.11
1.67 2.18
1.72 2.21
1.79 2.12
AlexNet-ES
1.45 2.38
4.55
1.12 1.42
1.12 1.63
1.29 1.71
1.36 1.80
1.43 1.81
1.49 1.82
1.53 1.69
AlexNet-SS
1.29 2.11
3.25
1.06 1.44
1.06 1.78
1.10 1.83
1.14 1.87
1.16 1.87
1.17 1.68
1.17 1.21
GoogLeNet-ES
1.10 2.25
4.24
1.10 1.18
1.10 1.44
1.07 1.43
1.11 1.47
1.16 1.42
1.19 1.37
1.22 1.31
GoogLeNet-SS 1.07 1.77
2.34
1.11 1.24
1.11 1.34
Network
1.16 1.38
1.20 1.40
1.23 1.39
1.26 1.37
1.28 1.35
ResNet50-SS
1.16 1.48
1.80
1.03 1.50
1.03 1.92
1.03 1.88
1.03 1.88
1.03 1.53
1.03 1.50
Mobilenet
1.03 1.33
1.03 2.27
4.66
1.15 1.67
1.15 2.19
1.15 2.19
Bi-LSTM
1.15 1.42
1.15 2.10
4.16
1.11 1.42
1.11 1.71
1.17 1.75
1.21 1.80
1.24 1.68
1.27 1.64
Geomean
1.29 1.47
1.17 2.03
3.39
L8<6,1>S
L8<5,2>S
L8<4,3>S
L8<3,4>S
L8<2,5>S
L8<1,6>S
L4<1,2>S
T8<2,5>S
X<inf,15>
37 Chapter 4. Evaluation
Chapter 4. Evaluation 38
Unshuffled Shuffled
Figure 4.2: Speedup of the T8h2, 5i configuration over the baseline, with and without filter shuffling.
We compare the performance of a variety of front-end designs against the baseline DaDianNao++
architecture across our benchmark suite in Figure 4.1. The lower portion of each stacked bar in the
figure is the performance achieved by lookahead alone. All results are using the scheduling algorithm
described in Section 3.3.4.
The figure shows the diminishing returns of increased lookahead, with the L8h3, 4i having the best
geomean performance of the L-shaped designs studied, despite it only having an ideal performance
potential of 4×, compared to the 7× potential of the L8h6, 1i configuration.
The best performing design, T8h2, 5i, achieves a 2.03× geomean speedup on the networks studied.
Though this is significantly less than the 3.39× speedup of the hypothetical Xh∞, 15i configuration, note
that this represents a prohibitively costly design consisting of a vary large crossbar and many activation
wires. In this context, the fact that Tactical T8h2, 5i achieves up to 76% and 82% of the performance of
this ideal configuration on GoogLeNet-SS and ResNet50-SS, respectively, at greatly reduced hardware
complexity is promising. On average, it achieves 60% of this speedup potential.
Across networks, the speedup of TCT over the baseline varies widely. This is in part due to the
varying levels of compute sparsity. Networks with higher compute sparsity, like AlexNet-ES (86.4%
sparsity), will see a larger speedup than networks with lower compute sparsity, like ResNet50-SS (44.9%
sparsity). The specific pruning algorithm employed can make a large difference to performance as well.
The ‘ES’ networks of Yang et al. [11] are pruned using ‘energy aware pruning’, which takes into account
the energy saving potential of removing each weight during the pruning process. This correlates with
filter re-use, and thus these networks have a high compute sparsity compared to the ‘SS-’ networks. As
a point of comparison, from Table 4.2 we see that GoogLeNet-SS has a higher storage sparsity (more
zeros in weight tensors) than GoogLeNet-ES (78.0% versus 66.1%), but GoogLeNet-ES manages to have
more compute sparsity than GoogLeNet-SS (74.8% versus 61.6%).
One network with interesting characteristics is Mobilenet, which utilizes a type of convolutional layer
called depthwise separable convolutions. These layers contain filters which have an effective channel
depth of 1, making them perform poorly on DaDianNao, given it’s channel-first dataflow which requires
Cycles (as a Proportion of Dense Cycles)
0
1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Conv 1
Conv 2
Conv 3
Conv 4
Alexnet-ES
Conv 5
Total
Conv 1
Conv 2
Conv 3
Conv 4
AlexNet-SS
Conv 5
Total
Ineffectual
Conv1
Conv2
icp2/out2
icp4/out1
icp9/out0
GoogLeNet-ES
Lookahead
Total
Conv1
Conv2
icp2/out2
icp4/out1
Lookaside
icp9/out0
GoogLeNet-SS
Total
Conv1
2a_br2b
3b_br2b
4a_br2b
Unpromoted
Resnet50-SS
5c_br2c
Total
conv1
conv5
Padding
fc8
Bi-LSTM
lstm1_forward
Total
Figure 4.3: Breakdown of execution time normalized to dense time for representative layers of each network.
conv2_2/dw
conv2_2/sep
conv6/dw
conv6/sep
Mobilenet
Total
39 Chapter 4. Evaluation
Chapter 4. Evaluation 40
a multiple of 16 filter channels for full utilization. TCT can benefit these layers as the padding used to
lay out these depthwise filters correctly in memory is effectively a form of sparsity that can be promoted
into using lookaside. Hence, lookahead does not provide much speedup alone (1.03× across all designs
evaluated), but lookaside can prove very effective. This is an extreme case of how TCT may provide a
speedup even for dense networks through zero-padding.
We also explore the effect of filter shuffling on front-end performance, and find that this optimiza-
tion can result in up to a 18% performance increase (Bi-LSTM) on the studied networks, at no extra
hardware cost. Figure 4.2 shows the performance results due to filter shuffling across networks. For
the best performing network, AlexNet-ES, filter shuffling boosts performance by nearly 10%, increas-
ing its speedup to 2.62×. This optimization always provides a performance increase, even if modest
(2.2% for GoogLeNet-ES). The speedups suggest that in most sparse network layers, sparsity is reason-
ably uniformly distributed, as there isn’t a large disparity between the slowest and the fastest filters to
process.
Figure 4.3 explores the execution time breakdown for representative layers of each neural network,
and for the total network. Multiplier cycles are normalized to dense execution time for each network and
layer, and are categorized into four classes: ineffectual multiplier cycles due to processing ineffectual
weights, effectual weights that were promoted using lookahead or lookaside, and effectual weights that
remained unpromoted - i.e., weights that remain in their original position as in the dense schedule.
The figure also shows the proportion of the original, dense time that was spent processing zero-padding
within filters due to filter sizes not being precisely aligned with the on-chip memory layout. This zero-
padding is shown on the same axis for illustrative purposes, as it serves to show how much of the original
layer sparsity was due to padding. Note that the depthwise (dw) layers of Mobilenet are not sparsified,
but have a large amount of zero padding as they have an effective filter depth of 1, which isn’t well
aligned to the depth-first dataflow used by TCT and DaDianNao. This is evident to a lesser degree in
the ‘Conv 1’ layers of the networks studied, which have some padding due to the fact that the channel
depth of the input to these layers is 3 (RGB input images). This highlights the ability of TCT to handle
padding and other forms of irregularity in neural networks to potentially benefit even dense networks to
some extent.
Table 4.3: Proportion of zero values removed after scheduling.
Related to the execution time breakdown is the amount of sparsity that the TCT scheduler is able to
effectively remove from execution. Though this information is derivable from Figure 4.3, it is explicitly
2
listed per-network in Table 4.3. TCT manages to remove approximately 3 of the ineffectual work due
to zero weights and padding for all of our benchmarks, leaving at most 35.4% of ineffectual work on the
table (GoogLeNet-ES).
1.4
1.2
Execution Time
1
0.8
0.6
0.4
0.2
0
Scheduler Greedy
Figure 4.4: Effect of the scheduling algorithm on the networks studied, with execution time normalized
to that achieved using Algorithm 2.
also offers little opportunity for the heuristic scheduler to excel, as the greedy algorithm already extracts
nearly all of the potential performance from the network. On the more sparse networks, however, the
heuristic scheduler offers significant improvements, outperforming the greedy approach by up to 28% on
Googlenet-SS. On average, the heuristic algorithm achieves a modest 8% performance improvement on
the networks studied. This increases to 14% if we consider only the networks for which the choice of
scheduler has any impact on performance.
Figure 4.5: Performance of multiple interconnect patterns at varying sparsity levels. Solid lines represent
the average speedup, with bands showing the range.
When evaluating the front-end design space and scheduling algorithm, it is useful to understand how
Chapter 4. Evaluation 42
both behave in isolation, and the interplay between them, as sparsity varies. To this end, we perform
sensitivity studies in which the sparsity level is swept from 0%−90% in 10% increments. At each sparsity
level, 100 sets of 16 filters, each of size 3 × 3 with 512 channels, are randomly generated and simulated
on variants of the front-end design.
Figure 4.6: Performance variation of the Th2, 5i design with different scheduling approaches as sparsity
level changes, and compared against an L-shaped interconnect using the scheduling algorithm.
Figure 4.5 shows the speedup of various TCT front-end designs, with varying lookahead distance
and input multiplexer size, as sparsity is swept. All of the studied designs use a variant of the Trident
connectivity shape. The T8h2, 5i design is the best performing across all sparsity levels below 0.8,
after which point the T8h3, 4i design, with its increased lookahead distance, becomes dominant. The
additional lookahead does, however, make it a more expensive design. Given that all but one of the
networks studied have a sparsity level less than 0.8, this extra hardware cost is not justified. On the
other end of the design spectrum, the cheaper T8h1, 6i design is substantially slower across most sparsity
levels, with the T8h2, 5i design being 1.45× faster at a sparsity level of 0.8.
Figure 4.6 illustrates the efficacy of the combination of the scheduling algorithm and the co-designed
T8h2, 5i interconnect. The greedy scheduler’s performance will depend on the scan order, which in this
experiment is set to target lookaside first. This explains why it performs slightly better at lower levels
of sparsity than the heuristic scheduler, which is designed to make more globally-optimal scheduling
decisions across the search window.
One key observation to make from these results is that the performance of the front-end is robust to
changes in the sparsity distribution, with the range of speedups achieved by the T8h2, 5i design never
deviating from the average by more than 6%. Additionally, though initially an unremarkable observation,
it is useful to note that Bit-Tactical never decreases performance below the baseline design – even at
0% sparsity, it achieves the same performance as DaDianNao++. This is not the case for SCNN, which
suffers more than a 20% slowdown over their value-agnostic baseline design on dense networks, and only
achieves speedups when both weight and activation sparsity surpass 15% each. This further validates
the design principal of the Bit-Tactical front-end, which uses simple hardware and in doing so avoids
potential performance overheads.
Chapter 4. Evaluation 43
Lookaside Connections 0 3 5
Area Overhead 4.0% 6.8% 8.2%
Table 4.4 shows the logic area overhead of TCT configured with a lookahead of 2, and various numbers
of lookaside connections. The relative cost of lookahead and lookaside can be seen in the marginal cost of
adding lookaside connections. For example, a design with a lookahead of 2 and no lookaside connections
has a 4% area overhead compared to the baseline, whereas adding an additional 3 lookaside connections
only increases the total logic area by 2.8%. The best performing design we evaluate, the T8h2, 5i, has
a modest area overhead of just 8.2%. Note also that this is only accounting for compute logic area. If
on-chip memory is included, the area overhead of TCT is diluted to just 1.9%, due to the large on-chip
activation memory used. For reference, SCNN’s area overhead compared to their dense baseline design
is 33.9% [18], and Cambricon-X uses 2.11× the area of their baseline design, DianNao [17].
3.5
3
Speedup over DaDN
2.5
2
1.5
1
0.5
0
Figure 4.7: Performance of additional interconnect configurations relative to the baseline architecture.
2.6
2.4
Speedup over DaDN
2.2
2
1.8
1.6
1.4
1.2
1
12 11 10 9 8 7 6 5 4 3 2 1 0
Lookaside Connecitons
AlexNet-ES AlexNet-SS GoogLeNet-ES GoogLeNet-SS
Resnet50-SS Mobilenet Bi-LSTM
Figure 4.8: Performance as lookaside connections are pruned, with lookahead fixed at 2.
Chapter 4. Evaluation 45
with connections in place for every position up to a distance of 4 either side, and 2 ahead. Then, for
a given network, the scheduler is run, and a tally is kept of how many promotions each interconnect
connection makes. After scheduling the entire network, the connection with the lowest tally is removed.
This process is repeated until only 7 connections remain in the interconnect, resulting in a design that
uses 8-input muxes that is customized for a given network. The heuristic of removing the connection that
participated in the fewest promotions is chosen as it should represent a proxy for the connection that
contributed the least to the performance gain. This heuristic works well for GoogLeNet-ES, ResNet50-
SS, Bi-LSTM and Mobilenet, leading to a geomean performance improvement of 5.1% over T8h2, 5i.
However, it is not a perfect proxy, as is evidenced by the networks for which the custom interconnects
perform worse than the T8h2, 5i design. The complexities of scheduling are likely responsible for this
very slight performance deficit, as a connection that does not perform many promotions may actually be
instrumental in the scheduling decisions made, despite not specifically participating in many promotions
itself. The high relative performance of the T8h2, 5i design also highlights its success at being robust
across networks, making it general purpose across networks, yet performant for individual networks.
For comparison, results for a second hand-designed interconnect, the checkers configuration, are
also shown. The design represents also uses 8-input muxes, where connections come from every second
position in each lookahead row, with the second row being offset from the first by one position, making
it resemble a checker board/chess board alternating pattern.
Figure 4.8 shows how performance varies as the number of lookaside connections is decreased. For
all configurations, the maximum lookahead distance is 2. The process is similar to the one used to
generate the ‘Custom’ interconnects described above, except that the two lookahead connections are not
permitted to be removed. The plot helps to justify the number of lookaside connections used in the
T8h2, 5i configuration, as 4 connections represents a knee in the curve for most networks – any fewer
connections, and performance begins to degrade substantially. For example, going from 4 lookaside
connections to 3 degrades performance for AlexNet-ES by 12.4%. The extra cost of the 5th lookaside
input is justified so as to fill the 8 potential connections expressible with a 3-bit mux select signal, which
4 lookaside connection (along with the 2 lookahead connections) will also require.
18
16.2
15.5
16
14.1
Speedup over DaDN
14 12.9
11.2 11.7
12 11.0 10.9 11.3 11.7 11.3
10.5
10.3
9.5 9.6 9.7 10.1
10 9.1 9.4 9.29.69.1 9.2 9.3
8.6 8.4
7.4 7.8 7.9
8 7 7.1 6.2 7.1 6.7
5.8 6.2 6.1
5.5 5.1 5.35.65.4 5.6
6 4.5 4.9 4.6
4 3.3 3.1
2
0
Figure 4.9: Performance of TCTp and TCTe normalized to DaDianNao. The h2, 5i configurations use
the Trident interconnect, whilst the other configurations use the L-shaped interconnect.
processing, but uses Booth encoding and the multiplication approach of Albericio et al. [46] to only
process effectual activation terms (powers of two), thus exploiting ‘term sparsity’. Both designs make
use of the fact that the bit-serial back-ends occupy less area than their bit-parallel counterparts, and so
many more PEs can be placed within a similar area budget, resulting in improved performance.
Figure 4.9 shows the speedup of TCTp and TCTe over DaDianNao. The additional benefits of ex-
ploiting EoP and term sparsity prove to be complementary, with the speedups being almost multiplicative
with the speedups from exploiting weight sparsity alone. This motivates improvements to the front-end
architecture, as any additional performance gains will be realized many times over when integrated with
the bit-serial back-ends. The 11.3× geomean speedup of the Trident design integrated with TCLe is
almost 5.6× more than that of the T8h2, 5i front-end alone. However, both TCTp and TCTe come
with significant area overheads compared to the front-end alone, with the logic alone occupying 2.95×
and 6.94× the area compared to that of DaDianNao. This is significant compared to TCT’s modest
front-end area logic overhead. Admittedly, when the area overheads of on-chip memories are included,
the impact of the logic overheads are diluted significantly. However this is true for both TCTp/e and
the TCT front-end.
Also shown in Figure 4.10 is the performance of the TCT family of accelerators when taking into
account the off-chip bandwidth provided by various memory technologies. TCT can reach its full po-
tential speedup using a very modest LPDDR3-1600 single channel main memory for all networks except
Bi-LSTM, whose LSTM layers contain many parameters but have little value re-use, making off-chip
traffic a performance bottleneck. Additionally, all networks can be run at a reasonably high framerate,
or equivalently, a low-latency of at most 2.8ms (ResNet-50-SS). This is important as it is well above 60
FPS, the highest framerate that modern digital videos are commonly recorded at, making it suitable for
real-time inference in applications such as virtual reality, augmented reality, and self-driving cars.
Chapter 4. Evaluation 47
12706
28.9
9558
9189 23.2
5491
20.9 20.3 2206 4295
7381 18.3
16.8 7444 15.7 8271
18.1 16.0
5717
13.0 3246
12.0 1287
10.7 2563
9.3
3438
1894 1706 1011
1782 2146 6.7
4.3 351 901 4.3 4.2
3.9 3.7 3.3
2.9
Figure 4.10: Speedup of TCT, TCTp, and TCTe over DaDianNao++ using the T8h2, 5i configuration
with various main memory technologies (listed in the legend – x1 means a single memory channel is
used). Frames per Second of inference (FPS) and effective Tera-Operations per Second (TOPS) are also
shown on top of each bar for infinite off-chip bandwidth.
4.5 Summary
This chapter has detailed the performance evaluation of TCT on a suite of sparse neural networks. A
range of front-end configurations were evaluated, along with the impacts of various algorithmic opti-
mizations. We show how the co-designed scheduling algorithm can improve performance by up to 28%,
and how filter shuffling can increase performance on an already scheduled network by as much as an
additional 18%. A sensitivity study highlights the robustness of TCT to different sparsity patterns. The
best performing design across the networks studied is the T8h2, 5i configuration, which has a geomean
speedup of 2.03× over the baseline, and a maximum speedup with filter shuffling of 2.62×, with an 8.2%
logic area overhead.
Chapter 5
Conclusion
The large computational complexity and memory requirements of modern deep neural networks moti-
vates the need for algorithmic techniques to reduce both of these metrics. One prevalent technique for
doing so is weight pruning, which sets a large fraction of network weights to zero, with the goal of reduc-
ing memory footprint and traffic of DNN models, whilst giving a large amount of potential for speedup
during inference. This thesis has motivated the need for more efficient approaches to leveraging weight
sparsity in DNN inference. The Bit-Tactical front-end architecture is presented, and a thorough design
space exploration and optimizations have been detailed. By designing and optimizing a lightweight
front-end interconnect, we show how to judiciously leverage weight sparsity in hardware. Novel algorith-
mic optimizations which can improve the performance and reduce synchronization overheads are shown,
including a scheduling algorithm which increases the performance of the front-end design by up to 28%.
Additionally, we present further scheduling and hardware improvements which increase performance and
decrease memory overheads by up to an additional 18%, and by up to 82%, respectively. Combining both
the front-end hardware with the scheduling algorithm and optimizations results in a front-end accelera-
tor design which can achieve up to a 2.62× speedup over a similarly provisioned value-agnostic baseline
design, with just an 8.2% logic area overhead. In addition, despite targeting sparse neural networks, the
design presented suffers no performance degradation for dense networks (unlike other sparse accelerators,
which suffer from decreased performance on dense networks), and may even offer slight performance im-
provements due to zero-values from weight padding. Equivalently, Bit-Tactical’s performance is robust
across all sparsity levels, and so encourages weight pruning wherever possible, even if very high sparsity
levels are not attainable. In summary, Bit-Tactical’s novel, pragmatic approach to exploiting weight
sparsity offers a compelling trade-off between hardware complexity and attainable performance that we
hope motivates similar future efforts in value-aware acceleration for ML and other domains.
48
Chapter 5. Conclusion 49
is one target for future work that has the potential to provide about a 2× performance increase. However,
the challenge with activation sparsity is that it is a dynamic, input-dependent property, and thus the
offline scheduling used by Bit-Tactical is not practical. Instead, the novelty of such work would be to
find a way to perform the value scheduling on-the-fly, in hardware.
Going further, value sparsity is known to appear not just at inference, but also during training.
The ReLU activation function applied during training causes sparsity in both activations and gradients,
providing a large speedup opportunity if a hardware scheduler were to be developed for Bit-Tactical.
On the algorithmic side, pruning approaches that tailor the structure of the weight sparsity to the
underlying hardware are an active field of research [31, 30]. This thesis focuses on generality, and
thus doesn’t impose any structure or constraints on the induced sparsity, instead evaluating on out-of-
the-box sparsified models from other research groups. However, one exciting future work would be to
design a pruning algorithm that works in tandem with Bit-Tactical’s software scheduler in order to tailor
the sparsity distribution to the interconnect. Similarly, a co-optimized network and interconnect also
represents interesting work. These efforts would allow for increased performance for a given sparsity
level, reducing the amount of sparsity that is left on the table by Bit-Tactical currently. Indeed, the
pruning constraints used by the Cambricon-S design [30], which uses depth-first pruning, would lead
to entire rows of weights in TCT’s weight memory being removed, resulting in perfect utilization and
maximum performance potential using lookahead alone (no lookaside would be necessary).
Other avenues for improvement include further optimizing the front-end scheduling algorithm, which
currently uses a heuristic approach. Integer linear programming (ILP) techniques are commonly em-
ployed to achieve a near-optimal schedule in makespan minimization problems. That said, the heuristic
algorithm described in this thesis achieves very close to the maximum achievable speedup for the net-
works studied, as discussed in 4.3.4, so it is unlikely that any improvements will be significant.
Similarly, it may be possible to further optimize the front-end interconnect by employing a hetero-
geneous connectivity pattern, as opposed to the homogeneous connectivity patterns considered in this
thesis in which the relative connectivity is the same for each multiplier lane. A heterogeneous intercon-
nect makes the optimization problem significantly more difficult, increasing the dimensionality of the
problem, and would potentially make scheduling a more challenging task - nevertheless, the performance
gains from such a design could be significant.
Finally, it is an interesting research question to consider how the pragmatic approach of constrained
routing connectivity could be applied to other architectures and organizations. DaDianNao serves as
a strong baseline architecture to prove that the concept of constrained routing works for exploiting
weight sparsity, but there is no reason to believe that this same approach could not be applied to other
organizations, such as systolic array architectures.
Bibliography
[1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep cnns.
In F. Pereira, C.J.C. Burges, L. Bottou, and K.Q. Weinberger, editors, NIPS 25, pages 1097–1105.
Curran Associates, Inc., 2012.
[3] K. Zhang, W. Zuo, S. Gu, and L. Zhang. Learning deep cnn denoiser prior for image restoration.
In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2808–2817,
July 2017.
[4] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convo-
lutional neural networks. Commun. ACM, 60(6):84–90, May 2017.
[5] O. Abdel-Hamid, A. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu. Convolutional neural net-
works for speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing,
22(10):1533–1545, Oct 2014.
[6] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim
Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learn-
ing. In Maria Florina Balcan and Kilian Q. Weinberger, editors, Proceedings of The 33rd Interna-
tional Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research,
pages 1928–1937, New York, New York, USA, 20–22 Jun 2016. PMLR.
[7] Jonas Gehring, Michael Auli, David Grangier, and Yann N. Dauphin. A convolutional encoder
model for neural machine translation. CoRR, abs/1611.02344, 2016.
[8] Alfredo Canziani, Adam Paszke, and Eugenio Culurciello. An analysis of deep neural network
models for practical applications. CoRR, abs/1605.07678, 2017.
[9] Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural network
with pruning, trained quantization and huffman coding. In 4th International Conference on Learning
Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings,
2016.
[10] Michael Zhu and Suyog Gupta. To prune, or not to prune: exploring the efficacy of pruning for
model compression. arXiv e-prints, page arXiv:1710.01878, Oct 2017.
50
Bibliography 51
[11] Yang, Tien-Ju and Chen, Yu-Hsin and Sze, Vivienne. Designing Energy-Efficient Convolutional
Neural Networks using Energy-Aware Pruning. In IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), 2017.
[12] Asit Mishra, Eriko Nurvitadhi, Jeffrey J Cook, and Debbie Marr. WRPN: Wide reduced-precision
networks. In International Conference on Learning Representations, 2018.
[13] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand,
Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for
mobile vision applications. CoRR, abs/1704.04861, 2017.
[14] Yann Le Cun, John S. Denker, and Sara A. Solla. Optimal brain damage. In Advances in Neural
Information Processing Systems, pages 598–605. Morgan Kaufmann, 1990.
[15] Babak Hassibi and David G. Stork. Second order derivatives for network pruning: Optimal brain
surgeon. In S. J. Hanson, J. D. Cowan, and C. L. Giles, editors, Advances in Neural Information
Processing Systems 5, pages 164–171. Morgan-Kaufmann, 1993.
[16] Jongsoo Park, Sheng Li, Wei Wen, Ping Tak Peter Tang, Hai Li, Yiran Chen, and Pradeep Dubey.
Faster CNNs with Direct Sparse Convolutions and Guided Pruning. In 5th International Conference
on Learning Representations (ICLR), 2017.
[17] Shijin Zhang, Zidong Du, Lei Zhang, Huiying Lan, Shaoli Liu, Ling Li, Qi Guo, Tianshi Chen, and
Yunji Chen. Cambricon-x: An accelerator for sparse neural networks. In Proceedings of the 49th
International Symposium on Microarchitecture, 2016.
[18] Angshuman Parashar, Minsoo Rhu, Anurag Mukkara, Antonio Puglielli, Rangharajan Venkatesan,
Brucek Khailany, Joel Emer, Stephen W. Keckler, and William J. Dally. Scnn: An accelerator for
compressed-sparse convolutional neural networks. In Proceedings of the 44th Annual International
Symposium on Computer Architecture, ISCA ’17, pages 27–40, New York, NY, USA, 2017. ACM.
[19] Warren S. McCulloch and Walter Pitts. A logical calculus of the ideas immanent in nervous activity.
The bulletin of mathematical biophysics, 5(4):115–133, Dec 1943.
[20] Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa,
Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao,
Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaem-
maghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg,
John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan,
Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James
Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana
Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy
Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt
Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter,
Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle,
Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon. In-datacenter
performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International
Symposium on Computer Architecture, ISCA ’17, pages 1–12, 2017.
Bibliography 52
[21] Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves,
Nal Kalchbrenner, Andrew W. Senior, and Koray Kavukcuoglu. Wavenet: A generative model for
raw audio. CoRR, abs/1609.03499, 2016.
[22] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object
detection. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages
779–788, June 2016.
[23] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale GAN training for high fidelity
natural image synthesis. In International Conference on Learning Representations, 2019.
[24] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang. Beyond a gaussian denoiser: Residual learning
of deep cnn for image denoising. IEEE Transactions on Image Processing, 26(7):3142–3155, July
2017.
[25] K. Zhang, W. Zuo, and L. Zhang. Ffdnet: Toward a fast and flexible solution for cnn-based image
denoising. IEEE Transactions on Image Processing, 27(9):4608–4622, Sep. 2018.
[26] Michaël Gharbi, Gaurav Chaurasia, Sylvain Paris, and Frédo Durand. Deep joint demosaicking and
denoising. ACM Trans. Graph., 35(6):191:1–191:12, November 2016.
[27] J. Kim, J. K. Lee, and K. M. Lee. Accurate image super-resolution using very deep convolutional
networks. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages
1646–1654, June 2016.
[28] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable
neural networks. In International Conference on Learning Representations, 2019.
[29] Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M. Roy, and Michael Carbin. The lottery
ticket hypothesis at scale. CoRR, abs/1903.01611, 2019.
[30] X. Zhou, Z. Du, Q. Guo, S. Liu, C. Liu, C. Wang, X. Zhou, L. Li, T. Chen, and Y. Chen. Cambricon-
s: Addressing irregularity in sparse neural networks through a cooperative software/hardware ap-
proach. In 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO),
pages 15–28, Oct 2018.
[31] Jiecao Yu, Andrew Lukefahr, David Palframan, Ganesh Dasika, Reetuparna Das, and Scott Mahlke.
Scalpel: Customizing dnn pruning to the underlying hardware parallelism. In Proceedings of the
44th Annual International Symposium on Computer Architecture, ISCA ’17, pages 548–560, New
York, NY, USA, 2017. ACM.
[32] Sajid Anwar, Kyuyeon Hwang, and Wonyong Sung. Structured pruning of deep convolutional neural
networks. J. Emerg. Technol. Comput. Syst., 13(3):32:1–32:18, February 2017.
[33] Huizi Mao, Song Han, Jeff Pool, Wenshuo Li, Xingyu Liu, Yu Wang, and William J. Dally. Exploring
the regularity of sparse structure in convolutional neural networks. CoRR, abs/1705.08922, 2017.
[34] Shuo Wang, Zhe Li, Caiwen Ding, Bo Yuan, Qinru Qiu, Yanzhi Wang, and Yun Liang. C-lstm:
Enabling efficient lstm using structured compression techniques on fpgas. In Proceedings of the 2018
ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA ’18, pages 11–
20, New York, NY, USA, 2018. ACM.
Bibliography 53
[35] Jorge Albericio, Patrick Judd, Tayler Hetherington, Tor Aamodt, Natalie Enright Jerger, and
Andreas Moshovos. Cnvlutin: Ineffectual-neuron-free deep neural network computing. In 2016
IEEE/ACM International Conference on Computer Architecture (ISCA), 2016.
[36] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolu-
tional neural networks. In Proceedings of the 25th International Conference on Neural Information
Processing Systems - Volume 1, pages 1097–1105, 2012.
[37] C. Szegedy, Wei Liu, Yangqing Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke,
and A. Rabinovich. Going deeper with convolutions. In 2015 IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), pages 1–9, June 2015.
[38] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image
recognition. CoRR, abs/1409.1556v6, 2014.
[39] G. Huang, Z. Liu, L. v. d. Maaten, and K. Q. Weinberger. Densely connected convolutional networks.
In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2261–2269,
July 2017.
[40] Yunpeng Chen, Jianan Li, Huaxin Xiao, Xiaojie Jin, Shuicheng Yan, and Jiashi Feng. Dual path
networks. In Proceedings of the 31st International Conference on Neural Information Processing
Systems, NIPS’17, pages 4470–4478, USA, 2017. Curran Associates Inc.
[41] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In 2016 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, June 2016.
[42] T Chen, Z Du, N Sun, J Wang, C Wu, Y Chen, and O Temam. Diannao: A small-footprint high-
throughput accelerator for ubiquitous machine-learning. In Proceedings of the 19th international
conference on Architectural support for programming languages and operating systems, 2014.
[43] Yu-Hsin Chen, Joel Emer, and Vivienne Sze. Eyeriss: A spatial architecture for energy-efficient
dataflow for convolutional neural networks. In Proceedings of the 43rd International Symposium on
Computer Architecture, ISCA ’16, pages 367–379, 2016.
[44] Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen,
Zhiwei Xu, Ninghui Sun, and O. Temam. Dadiannao: A machine-learning supercomputer. In
Microarchitecture (MICRO), 2014 47th Annual IEEE/ACM International Symposium on, pages
609–622, Dec 2014.
[45] Patrick Judd, Jorge Albericio, Tayler Hetherington, Tor Aamodt, and Andreas Moshovos. Stripes:
Bit-serial Deep Neural Network Computing . In Proceedings of the 49th Annual IEEE/ACM Inter-
national Symposium on Microarchitecture, MICRO-49, 2016.
[46] Jorge Albericio, Alberto Delmás, Patrick Judd, Sayeh Sharify, Gerard O’Leary, Roman Genov,
and Andreas Moshovos. Bit-pragmatic deep neural network computing. In Proceedings of the 50th
Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-50 ’17, pages 382–394,
2017.
Bibliography 54
[47] Z. Du, K. Palem, A. Lingamneni, O. Temam, Y. Chen, and C. Wu. Leveraging the error resilience
of machine-learning applications for designing highly energy efficient accelerators. In 2014 19th Asia
and South Pacific Design Automation Conference (ASP-DAC), pages 201–206, Jan 2014.
[48] Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A. Horowitz, and William J.
Dally. Eie: Efficient inference engine on compressed deep neural network. In Proceedings of the
43rd International Symposium on Computer Architecture, ISCA ’16, pages 243–254, Piscataway,
NJ, USA, 2016. IEEE Press.
[49] Raghu Prabhakar, Yaqi Zhang, David Koeplinger, Matt Feldman, Tian Zhao, Stefan Hadjis, Ar-
davan Pedram, Christos Kozyrakis, and Kunle Olukotun. Plasticine: A reconfigurable architecture
for parallel paterns. In Proceedings of the 44th Annual International Symposium on Computer
Architecture, ISCA ’17, pages 389–402, New York, NY, USA, 2017. ACM.
[50] M. Horowitz. Computing’s energy problem (and what we can do about it). In 2014 IEEE In-
ternational Solid-State Circuits Conference Digest of Technical Papers (ISSCC), pages 10–14, Feb
2014.
[51] Zidong Du, R. Fasthuber, Tianshi Chen, P. Ienne, Ling Li, Tao Luo, Xiaobing Feng, Yunji Chen,
and O. Temam. ShiDianNao: Shifting vision processing closer to the sensor. In 2015 ACM/IEEE
42nd Annual International Symposium on Computer Architecture (ISCA), pages 92–104, June 2015.
ShiDianNao.
[52] Shaoli Liu, Zidong Du, Jinhua Tao, Dong Han, Tao Luo, Yuan Xie, Yunji Chen, and Tianshi Chen.
Cambricon: An instruction set architecture for neural networks. In 2016 IEEE/ACM International
Conference on Computer Architecture (ISCA), 2016.
[53] Hardik Sharma, Jongse Park, Naveen Suda, Liangzhen Lai, Benson Chau, Vikas Chandra, and
Hadi Esmaeilzadeh. Bit fusion: Bit-level dynamically composable architecture for accelerating
deep neural networks. In Proceedings of the 45th Annual International Symposium on Computer
Architecture, ISCA ’18, pages 764–775, Piscataway, NJ, USA, 2018. IEEE Press.
[54] E. Park, D. Kim, and S. Yoo. Energy-efficient neural network accelerator based on outlier-aware low-
precision computation. In 2018 ACM/IEEE 45th Annual International Symposium on Computer
Architecture (ISCA), pages 688–698, June 2018.
[55] K. Siu, D. M. Stuart, M. Mahmoud, and A. Moshovos. Memory requirements for convolutional
neural network hardware accelerators. In 2018 IEEE International Symposium on Workload Char-
acterization (IISWC), pages 111–121, Sep. 2018.
[56] Michael L. Pinedo. Scheduling: Theory, Algorithms, and Systems. Springer Publishing Company,
Incorporated, 3rd edition, 2008.
[57] Travis E Oliphant. A guide to NumPy, volume 1. Trelgol Publishing USA, 2006.
[58] Naveen Muralimanohar and Rajeev Balasubramonian. Cacti 6.0: A tool to understand large caches.
[59] Micron. Calculating Memory Power for DDR4 SDRAM. Technical Note TN-40-07.
https://www.micron.com/resource-details/868646c5-7ee2-4f6c-aaf4-7599bd5952df, 2017.
Bibliography 55
[60] Alberto Delmas, Patrick Judd, Sayeh Sharify, and Andreas Moshovos. Dynamic stripes: Exploiting
the dynamic precision requirements of activation values in neural networks. CoRR, abs/1706.00504,
2017.
[61] D. Yu, F. Seide, G. Li, and L. Deng. Exploiting sparseness in deep neural networks for large
vocabulary speech recognition. In 2012 IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP), pages 4409–4412, March 2012.
[62] Cheng Wang, Haojin Yang, and Christoph Meinel. Image captioning with deep bidirectional lstms
and multi-task learning. ACM Trans. Multimedia Comput. Commun. Appl., 14(2s):40:1–40:20, April
2018.
[63] A. Delmas, S. Sharify, P. Judd, K. Siu, M. Nikolic, and A. Moshovos. DPRed: Making Typical
Activation Values Matter In Deep Learning Computing. ArXiv e-prints, December 2018.