0% found this document useful (0 votes)

92 views63 pages

An Efficient Hardware Architecture For Exploiting Sparsity in Neural Networks Master Thesis

Uploaded by

Đăng Nguyên Trịnh Vũ

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

92 views63 pages

An Efficient Hardware Architecture For Exploiting Sparsity in Neural Networks Master Thesis

Uploaded by

Đăng Nguyên Trịnh Vũ

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 63

An Efficient Hardware Architecture for Exploiting Sparsity in Neural

Networks

Dylan James Malone Stuart

A thesis submitted in conformity with the requirements

for the degree of Master of Applied Science
Graduate Department of Electrical and Computer Engineering
University of Toronto

c Copyright 2019 by Dylan James Malone Stuart

Abstract
An Efficient Hardware Architecture for Exploiting Sparsity in Neural Networks

Dylan James Malone Stuart

Master of Applied Science
Graduate Department of Electrical and Computer Engineering
University of Toronto
2019

Sparsity – the presence of many zero values – is a pervasive property of modern deep neural networks, as it

is inherently induced by state-of-the-art algorithmic optimizations. Recent efforts in hardware design for

acceleration of neural networks have targeted the structure of computation of these workloads. However,

when run on these value-agnostic accelerators, value sparsity is not exploited to provide performance or

efficiency benefits, and instead results in wasted computation. In this thesis, we present architectural

optimizations that efficiently leverage value sparsity in network weights in order to achieve significant

performance benefits, with minimal hardware overhead. The culmination of this work is a hardware

front-end (data fetching and staging unit) which, when paired with our novel, co-designed software

scheduling algorithm, achieves more than a 2× speedup on average for the networks studied, with just

an 8.2% overhead in compute area.

ii
I’m extremely lucky to have had a battalion of amazing people support me throughout my graduate
studies at the University of Toronto. To my supervisor, Andreas Moshovos, thank you for being the
most friendly, knowledgeable, and sincere supervisor and person I’ve ever had the honour of working
with. To the friends I’ve made here, and the ones I brought with me in spirit from home, thank you for
being a consistent and welcome source of distraction, camaraderie, motivation, and guidance. To my
parents, who’s support, in every sense of the word, has been the one constant in my life, thank you for
everything. Finally, thank you Philippa for encouraging me to take on graduate school, knowing that it
would mean being apart for so long, and yet sticking with me nonetheless. This thesis wouldn’t exist
without your support, encouragement, companionship, transatlantic excursions, and daily phone calls.

iii
Contents

1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Background 7
2.1 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.2 Value Sparsity in Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Hardware Acceleration for Convolutional Neural Networks . . . . . . . . . . . . . . . . . . 12
2.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.1 Value-Agnostic CNN ASICs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.2 Value-Aware CNN ASICs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3 A Low-Overhead Architecture for Sparse Neural Networks 17

3.1 Baseline Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 The Trouble with Sparsity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3 The Bit-Tactical Front-End . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3.1 Lookahead and Lookaside . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3.2 Architectural Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3.3 Interconnect Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.4 Weight Scheduling Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4 Filter Shuffling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.5 Reducing Schedule Metadata Overheads . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4 Evaluation 35
4.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3 Front-End Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3.1 Effect of Scheduling Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3.2 Sensitivity to Sparsity Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.3.3 Area Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

iv
4.3.4 Alternative Interconnect Configurations . . . . . . . . . . . . . . . . . . . . . . . . 43
4.4 Alternative Back-End Designs - TCTp and TCTe . . . . . . . . . . . . . . . . . . . . . . . 45
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5 Conclusion 48
5.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

Bibliography 50

v
List of Tables

2.1 Image Classification CNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2 Relative energy cost of various operations in 45nm 0.9V CMOS technology . . . . . . . . 14

3.1 Runtime of the scheduling algorithm for various networks. . . . . . . . . . . . . . . . . . . 29

4.1 Baseline DaDianNao++ and TCT configurations. . . . . . . . . . . . . . . . . . . . . . . . 35

4.2 Networks studied. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3 Proportion of zero values removed after scheduling. . . . . . . . . . . . . . . . . . . . . . . 40
4.4 Compute area overheads for TCT with lookahead of 2. . . . . . . . . . . . . . . . . . . . . 43

vi
List of Figures

1.1 How a sparse weight tensor is processed on a simple value-agnostic architecture . . . . . . 3

1.2 The dense dynamic activation routing approach to exploiting weight sparsity . . . . . . . 4
1.3 The dynamic output routing approach to exploiting weight and activation sparsity . . . . 4
1.4 The proposed approach for exploiting weight sparsity . . . . . . . . . . . . . . . . . . . . . 5

2.1 The McCulloch-Pitts model of a neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 An example of a CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Example of a single convolutional layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Example of weight pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5 Activation sparsity of a 2D activation plane due to the ReLU element-wise non-linearity. . 11
2.6 An abstract view of hardware architectures for CNN acceleration. . . . . . . . . . . . . . . 12
2.7 Spectrum of hardware architectures for accelerating CNNs . . . . . . . . . . . . . . . . . . 13

3.1 Baseline Architecture based on DaDianNao. . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2 Dataflow used by DaDianNao . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3 Naive approach to exploiting weight sparsity . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4 Example of how lookahead exploits sparsity . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.5 Example of how lookaside exploits sparsity . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.6 How TCT achieves a speedup despite synchronization constraints. . . . . . . . . . . . . . 22
3.7 The Weight Skipping Unit (WSU) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.8 How TCT achieves compression for weights . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.9 The Activation Select Unit (ASU) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.10 Potential interconnect patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.11 A toy example of two possible weight schedules . . . . . . . . . . . . . . . . . . . . . . . . 26
3.12 An example of how the scheduling algorithm operates . . . . . . . . . . . . . . . . . . . . 27
3.13 How filter shuffling increases performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.14 Example of how filter shuffling changes the computation . . . . . . . . . . . . . . . . . . . 30
3.15 Weight memory interface with and without the Mux Select Table . . . . . . . . . . . . . . 31
3.16 MST size vs. signal combination coverage. Signal combinations are weighted by reuse. . . 32
3.17 Memory overhead of scheduling metadata normalized to the naive implementation . . . . 32
3.18 MST size vs. signal combination coverage, unweighted. . . . . . . . . . . . . . . . . . . . . 33

4.1 Speedup of Bit-Tactical configurations over the baseline design. . . . . . . . . . . . . . . . 37

4.2 Speedup of the T8h2, 5i configuration over the baseline, with and without filter shuffling. . 38
4.3 Breakdown of execution time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

vii
4.4 Effect of the scheduling algorithm on the networks studied . . . . . . . . . . . . . . . . . . 41
4.5 Performance of multiple interconnect patterns at varying sparsity levels . . . . . . . . . . 41
4.6 Performance variation of the Th2, 5i design with different scheduling approaches as spar-
sity level changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.7 Performance of additional interconnect configurations relative to the baseline architecture. 44
4.8 Performance as lookaside connections are pruned, with lookahead fixed at 2. . . . . . . . . 44
4.9 Performance of TCTp and TCTe normalized to DaDianNao . . . . . . . . . . . . . . . . . 46
4.10 Speedup of TCT, TCTp, and TCTe over DaDianNao++ using the T8h2, 5i configuration
with various main memory technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

viii
Chapter 1

Introduction

Deep learning is a machine learning (ML) technique that has enjoyed widespread attention from industry
and academia in recent years. Deep neural network (DNN) models have emerged as powerful tools in
a wide range of fields in which traditional algorithms have struggled to achieve satisfactory proficiency.
Perhaps the most widely deployed type of DNN model is the convolutional neural network (CNN), which
is dominant in computer vision tasks [1, 2, 3, 4], but has also seen success in fields as varied as speech
recognition [5], reinforcement learning [6], and text translation [7].
From a hardware perspective, the deep neural network models that are used to implement deep
learning represent a compelling workload for acceleration due to their widespread deployment in con-
sumer and commercial settings, along with their unique computational structure and dataflow. The
vast amount of computation required to run modern DNNs during inference (often on the order of
tens of giga-operations (GOPs) [8]), along with their large memory footprint (commonly hundreds of
MBs [8]) also make them a prime target for custom hardware. Indeed, many recent works have tackled
the design and evaluation of hardware architectures for the acceleration of CNN inference processing.
Some seminal works have investigated architectural techniques for efficiently exploiting the structure
and forms of parallelism present in CNNs, with many highly influential application-specific integrated
circuit (ASIC) architectures as well as field-programmable gate array (FPGA) implementations targeting
CNNs, multi-layer perceptrons (MLPs), recurrent neural networks (RNNs), and other DNN types.
Alongside their basic computational structure, neural networks exhibit unique and interesting value
properties – the unique distribution of values that appear during runtime for these workloads. Certain
value properties can offer further opportunities for optimizing the hardware architectures designed to
run these networks. Reminiscent of classical architectural approaches for exploiting workload character-
istics such as cache hierarchies, which leverage the spatio-temporal locality of memory accesses, much
investigation has been put into leveraging the implicit value properties of neural networks in hardware.
This is an attractive prospect for several reasons, not least of which is that the massive computational
complexity of neural network inference poses problems in terms of latency, energy, and power constraints,
meaning techniques that can reduce this computational complexity are highly valuable. Additionally,
the excessive memory footprint of modern CNNs is problematic for a multitude of reasons, including
high memory transfer latency and energy, and large on-chip memory requirements. This makes model
compression methods commonplace, many of which introduce even more opportunities for value-aware
computation engines.

1
Chapter 1. Introduction 2

One value property that has become prevalent in CNNs in recent years due to model compression
efforts is weight sparsity – a phenomenon in which a large proportion of the network weights are equal
to zero. It is well established that DNNs are often severely over-parameterized [9, 10]. This observation
is used to motivate network pruning, in which a large proportion of the weights in a neural network
are set to zero, resulting in sparse weight tensors. The goal of network pruning is to reduce the overall
memory footprint and computational complexity of the model, without significantly affecting network
accuracy – indeed, the pruning can act as a regularizer, sometimes increasing accuracy [11]. One effect
of weight pruning is that it results in many ineffectual computations, which are computations that do
not effect the final output, and thus may be omitted without negatively impacting the accuracy of the
neural network. These ineffectual computations occur in a sparse neural network any time one of the
pruned, zero-valued weights is multiplied with an activation, and the resulting product is accumulated
to an output. However, when run on value-agnostic hardware, weight pruning applied naively will not
confer any computational benefits. Designing hardware that can skip the ineffectual computations due to
weight sparsity is an enticing endeavour, as such hardware would benefit from greatly reduced execution
time when running these networks.
The goal of this thesis is to describe Bit-Tactical (TCT), an area-efficient front-end architecture (a
hardware architecture that performs data fetching and staging) for a hardware accelerator which exploits
weight sparsity as an avenue for performance improvements, along with a design space exploration
and low-level optimizations for this architecture, and a novel scheduling algorithm which improves the
performance of the front-end by up to 28%. The best performing configuration of Bit-Tactical achieves a
2.03× speedup over an equivalently provisioned value-agnostic architecture, whilst incurring just a 8.2%
overhead in terms of logic area. Further, the Bit-Tactical approach does not impose any restrictions
on the distribution of sparsity in DNNs, meaning it requires no additional effort on the part of the ML
developer when training a sparse neural network. Additional optimizations will be described which can
increase TCT’s performance by as much as 18%, without incurring increased hardware costs. Memory
optimizations will also be described which can reduce the memory overheads associated with TCT’s
front-end scheduling metadata by up to 82%.

1.1 Motivation
Given the widespread deployment of DNNs, many recent algorithmic efforts have focused on decreasing
their computational complexity and storage requirements as a means to improve a number of key metrics
associated with their usage, including energy consumption, inference latency, training time, and ability to
be run on resource-constrained hardware. Along with works that employ quantization [12] and efficient
model design [13], many works have explored model pruning as a means of reducing both the memory
footprint of DNNs, as well as the number of Multiply-Accumulate (MAC) operations required during
inference [10, 11, 14, 15, 16]. In spite of these sparsification efforts, weight sparsity does not confer any
performance improvements for neural network inference when run on value agnostic hardware. That is,
without hardware support for replacing the zero-valued weights with non-zero weights, no reduction in
run time is seen. Rather, the zero-valued weights are still processed like any other weight, despite not
contributing to the output of the network.
Figure 1.1 shows how a sparse weight tensor might be processed using simple parallel hardware
with 4 multipliers. Despite significant weight sparsity, the products of the 4 non-zero weights and
Chapter 1. Introduction 3

Activations
Current
Weights Cycle

x
Current
Cycle

Figure 1.1: How a sparse weight tensor is processed on a simple value-agnostic architecture with parallel
multipliers. Zero valued weights are represented by blank positions. Weights and activations of the
same color need to be multiplied together. Activations paired with zero-valued weights are greyed out
as “don’t care” terms. Values scheduled to be processed in the first cycle appear in the ‘Current Cycle’
window.

corresponding activations will still take 3 cycles to compute on the 4 available multipliers. To remedy
this, recent hardware designs have been proposed which take advantage of weight sparsity using dynamic
routing of values to or from multipliers.
There are many ways of using dynamic routing hardware in order to exploit value sparsity. The
primary requirements are that the correct values to be multiplied together must be routed to the correct
multiplier in the correct cycle, and must be accumulated to the correct output. Cambricon-X is one ar-
chitecture which exploits weight sparsity by allowing near-arbitrary packing of non-zero weights, thereby
requiring fully associative logic to route activations to the correct multipliers, similar to the approach
shown in Figure 1.2. This routing has considerable overheads, however, with the Indexing Module logic
that performs this function accounting for over 31% of total die area, and over 34% of on-chip power
consumption [17].
Yet another approach to exploiting weight sparsity is that taken by SCNN, which is an accelerator
for sparse CNNs that also targets activation value sparsity [18]. This method makes use of the fact
that the product of any two non-zero values within a given channel in a CNN is a useful computation,
provided it is accumulated to the correct output value. The routing of products between multipliers and
accumulators is shown in Figure 1.3, and requires a large crossbar which scales poorly and can account
for over 21% of on-chip area alone.
The large overheads of the fully-associative, dynamic routing required to fully exploit weight sparsity
in pruned neural networks motivates the need for a new approach, which leverages the large speedup
potential of weight sparsity using lightweight hardware. The opportunity to design such a machine comes
from noticing that, as with many challenges in hardware design, previous approaches lose efficiency by
Chapter 1. Introduction 4

Activations
Current
Weights Cycle

x
x
x
x
Current
Cycle

Figure 1.2: The dense dynamic activation routing approach to exploiting weight sparsity. The non-zero
weights of Figure 1.1 are packed densely in memory, requiring fully-associative activation routing.
Activations

Outputs
Current
Weights Cycle

x
Current
Cycle

Figure 1.3: The dynamic output routing approach to exploiting weight and activation sparsity. All
non-zero values of Figure 1.1 are packed densely in memory, and products are routed to the correct
accumulator.
Chapter 1. Introduction 5

Activations
Scheduled
in
software

Current
Weights Cycle

x
Current
Cycle

Figure 1.4: The proposed approach for exploiting weight sparsity with limited overheads using small
multiplexers. Extra wires coming from activation memory are used to deliver multiple inputs to each
mutliplexer.

attempting to exploit all of the potential performance without regard for hardware complexity. Instead,
the opportunity for efficiency comes from settling for most of the potential performance gains, whilst
overall offering a much more attractive complexity-to-performance ratio.
This thesis proposes Bit-Tactical, a front-end architecture and interconnect designed with the goal
of exploiting most of the available weight sparsity present in DNNs, with minimal hardware overheads.
Figure 1.4 gives an overview of Bit-Tactical’s approach, in which limited routing flexibility provided
by small multiplexers before the input of each multiplier allows activations to be routed to the correct
multipliers, provided we place constraints on how weights can deviate from their original position in the
dense schedule. The hope is that this constrained routing ability will still provide enough flexibility to
extract most of the available performance from weight sparsity, provided it is designed intelligently.

1.2 Contributions
The focus of this thesis is to explore ways of efficiently leveraging value sparsity in neural networks using
minimal hardware, in order to achieve performance improvements without large hardware overheads. To
this end, the primary contributions made in this thesis are as follows:

• A novel front-end interconnect that allows value sparsity in neural networks to be exploited using
a sparse shuffling network, achieving up to a 2.62× speedup over an equivalent value-agnostic
accelerator, with logic area overheads of just 8.2%.

• A heuristic scheduling algorithm co-designed with the front-end interconnect, capable of increasing
front-end performance by up to 28% on the DNNs studied.
Chapter 1. Introduction 6

• Hardware and dataflow optimizations which reduce the memory overheads of the scheduling meta-
data by up to 82%, and increase performance by up to 18%, respectively.

The majority of the work related to the front-end architecture described in Chapter 3 appears in
Bit-Tactical: A Software/Hardware Approach to Exploiting Value and Bit Sparsity in Neural Networks,
published in the proceedings of the 25th International Conference on Architectural Support for Program-
ming Languages and Operating Systems, in April 2019. Further results appear in Accelerating Image-
Sensor-Based Deep Learning Applications, published in IEEE Micro, Volume 39, Issue 5, in September
2019.

1.3 Thesis Organization

This thesis is organized as follows: an overview of CNNs and sparsity in CNNs, along with a survey of
relevant work, is given in Chapter 2. Chapter 3 then describes the Bit-Tactical front-end design in detail,
along with the novel scheduling algorithm, and relevant hardware and algorithmic optimizations. A full
evaluation and design space exploration of the front-end is given in Chapter 4. Chapter 5 summarizes
this thesis and its contributions, and outlines promising directions for research that might build upon
what is presented here.
Chapter 2

Background

This chapter will introduce the background and prior works necessary to understand, and give context
to, the contributions of this thesis. First we will introduce the target workload itself (neural networks
and their derivatives), before outlining the relevant characteristics of the workload that make it a prime
target for hardware acceleration. Finally, a comprehensive survey of related works will place this work
in the context of prior state-of-the-art research in the field.

2.1 Neural Networks

Machine learning (ML) is a field of artificial intelligence (AI) research that utilizes data-driven algorithms
which iteratively improve their performance on some task, without explicit programming on how to
complete that task. A popular machine learning technique that claims state-of-the-art results across a
wide range of application areas is the neural network.

A0
W0

A1 W1 Aout
f(σ𝒊 𝑨𝒊 𝑾𝒊 )

W2
A2

Figure 2.1: The McCulloch-Pitts model of a neuron, in which the dot product of a vector of input
activations and synaptic weights are passed through a non-linear function f to produce an output
activation [19].
.

Neural networks are loosely inspired by the operation of biological neurons in the brain, in that the
artificial neurons in a neural network produce a positive output only if the sum of its weighted inputs

7
Chapter 2. Background 8

exceed some threshold, as modelled by the McCulloch-Pitts neuron in Figure 2.1 [19]. Neural networks
are built up of layers of artificial neurons, with each layer containing many neurons. Neurons in adjacent
layers are connected to one another by synaptic weights, with the output of one layer becoming the input
of the next layer, and so on. In a fully connected (FC) layer, all output activations are connected by
synapses to every neuron of the proceeding layer. Other forms of inter-layer connectivity are possible.
By building up many layers, a deep neural network is created.
Neural network computation has two phases: training and inference. During training, the synaptic
weights are ‘learned’ by a training algorithm which attempts to maximize the accuracy of the network in
completing some task. Commonly, modern networks are trained by a gradient descent-based algorithm,
with gradients being computed by the backpropagation algorithm which makes use of the fact that the
network forms a directed acyclic graph (DAG) to calculate gradients using automatic differentiation.
Once a network is trained, its weights and structure are fixed, and it can be deployed to make inferences.
During inference, a neural network takes in an input, computes the activations for every layer and feeds
them forward through the network, eventually producing an output at the last layer. The input and
output depend on the task that the network has been trained for. Commonly, inference is performed on
a single input at a time, rather than ‘batching’ multiple inputs together, as inference is often a latency-
critical task [20]. However, where latency constraints and hardware permit it, performing inference on a
batch of inputs is desirable, as it allows weights to be reused on each of the inputs, reducing the number
of times they have to be read from off-chip. This reduces total inference energy over the batch, as off-chip
transfers are 2 to 3 orders of magnitude more energy intensive and slower than integer multiplies.

2.1.1 Convolutional Neural Networks

Convolutional neural networks are a type of deep neural network that has become a dominant algorithm
across a range of application domains. Most notably, they are a prevailing tool in the field of computer
vision, but have also found success in areas as broad as speech recognition [5] and generation [21],
reinforcement learning [6] and machine translation [7]. In computer vision, they are perhaps most widely
studied as an image classification tool (in part thanks to the popularity of the ImageNet Large Scale
Visual Recognition Challenge (ILSVRC) challenge, which was first won by a CNN in 2012 [4]), though
CNNs are also adept at image segmentation [2], object detection [22], image generation [23], and a host of
computational imaging tasks such as de-noising [3, 24, 25], de-mosaicing [26], and super-resolution [27].

p(Y=y│x)

Boat
Dog
Person
Car
Chair
Plane
Cat

Input Conv Pool Conv Conv FC Softmax

Figure 2.2: A CNN composed of 3 convolutional layers, 1 pooling layer, and 1 fully connected layer.

A neural network consist of layers, each of which performs a linear transformation on its input, before
applying an element-wise non-linearity to each output. In CNNs, many of these layers consist of linear
transformations that take the form of convolution operations between the input and a set of filters.
The filters in each layer act as learned feature detectors (hence activations are sometimes referred to as
Chapter 2. Background 9

‘feature maps’). By stacking multiple convolutional layers, as in Figure 2.2, the neural network acts as
a hierarchical feature extractor, in which each layer can detect features that represent more and more
complex patterns in the original input. Inputs may represent any arbitrary information, and can take
the form of RGB images, graphs, sentences, or audio spectrograms, to name a few examples.

C Ck K
S
=
ReLU Y * K R
H

X W
Figure 2.3: A convolutional layer, in which an input volume of size X × Y × C is convolved with K
filters, each of size R × S × Ck, to produce an output of size W × H × K. Note that filter channel depth
(Ck) need not necessarily be the same as input channel depth (C). The rectified linear unit (ReLU)
non-linear activation function is applied element-wise to the output of the convolution operation.

The filters in a layer are typically composed of a 4-dimensional tensor of synaptic weights (or param-
eters). Each 3-dimensional filter is convolved with the entire input space to produce an output channel,
as shown in Figure 2.3. Parameters of the convolution include stride (how many elements the filter is
stepped across the input by) and padding (additional zero-valued elements added to the edges of each
input channel to maintain output size).
CNNs can contain layers other than convolutional layers, and commonly contain pooling layers,
which perform down-sampling on the input, either in the form of average pooling or max pooling on
small windows of the input space. Almost all classification neural networks contain an FC layer as a final
classification layer, followed by a ‘softmax’ function. The softmax function is given by Equation 2.1.

eyi
Sof tmax(yi ) = P yj (2.1)
je

Softmax is used to convert a vector of positive and negative real numbers to a distribution of prob-
abilities, which can be interpreted as the probability of each class being the correct class, given the
input.

2.1.2 Value Sparsity in Neural Networks

Value sparsity is a specific value property in which a significant proportion of values in a given tensor are
equal to zero. Pruning is a common optimization step during neural network training during which a
significant fraction of the network weights are set to zero, as in Figure 2.4. This results in a sparse neural
network, as the weight tensors are sparse objects containing many zero-valued elements. Weight pruning
was originally introduced by LeCun et al. in their Optimal Brain Damage algorithm [14] as a way of
reducing overfitting/increasing generalization, but it has seen increased favour in the ML community for
its secondary benefit of acting as a network compression scheme. Pruning has been studied thoroughly
due to its surprising efficacy, and numerous hypotheses have attempted to explain why pruning doesn’t
seem to affect model accuracy much until very high sparsity levels are reached [28, 29]. Other works have
explored the trade-off between pruning, network size, and relative accuracy. Zhu & Gupta [10] argue
Chapter 2. Background 10

that it is better in terms of memory footprint to train a large neural network and prune it, than it is to
train a smaller neural network with a similar number of final non-zero parameters. The intuition behind
the findings of Frankle & Carbin [28] is that there is exists a small ‘subnetwork’ within any large DNN
that has been initialized in such a way that it is amenable to training to convergence successfully – called
a ‘winning ticket’ subnetwork – and that is responsible for most of the accuracy of the network. Pruning
reveals these subnetworks without affecting accuracy by removing redundant weights. Training a larger
dense network increases the likelihood of there being a winning ticket subnetwork, thus it will always be
easier to train a large network and prune away dense connections than it is to train a compact, dense
network from scratch. This corroborates the findings of Zhu & Gupta [10], and suggests that weight
sparsity is a value property that will likely continue to pervade neural networks in the future.

Weights Neurons

Original Pruned
Network Network

Figure 2.4: Example of weight pruning, which deletes weights using an heuristic algorithm, resulting in
a sparse network. If all of the input weights of a neuron are removed, that neuron is effectively removed
as well.

Modern pruning algorithms are implemented either as a post-processing step after the unpruned
network has converged (with retraining to recover any lost accuracy), or as a part of the training
process [10, 9]. A number of heuristics can be used to decide which weights to eliminate, the most
common being the magnitude of the weight’s value [10]. Others remove weights to which the output has
the least sensitivity first [14, 15], however computing complex metrics like this is too costly for modern
DNNs. In part because weights are randomly initialized at the start of training, the heuristics used
in pruning result in sparsity that is relatively uniformly randomly distributed throughout the weight
tensors [10]. This leads to irregularities in the network computation, which some works have tried
to address by imposing constraints on how weights can be pruned, forcing them to be eliminated in
groups [30, 31, 32]. These algorithms lead to structured sparsity, where either an entire contiguous group
of weights (e.g., an entire filter channel) are zero, or none of them are. However, though designed to be
more hardware-friendly, in practice these structured pruning techniques are rarely used as they make
Chapter 2. Background 11

training a network to convergence without accuracy loss much more difficult. Mao et al. find that
unstructured sparsity can reach much higher sparsity levels whilst maintaining model accuracy when
compared to pruning at the granularity of filter-rows, filter-channels, or entire filters [33, 34].

Algorithm 1 Pruning Algorithm

1: procedure Train And Prune(W, X, Y, LR, Epochs, S)
2: M ask ← Ones(W.shape)
3: for t = 0 : Epochs − 1 do
4: W ←W M ask
5: Y 0 ← F orwardP ass(W, X)
6: WGrad ← BackwardP ass(W, X, Y, Y 0 )
7: W ← W − WGrad LR
8: M ask ← GenerateM ask(W, S, t)

Algorithm 1 gives a high-level, generic description of (one class of) pruning algorithms which prune
iteratively during training. The procedure will operate on the network as defined by its weight tensors,
W , the training data inputs X and labels Y , the learning rate LR, the number of training epochs, and
the target sparsity, S. Before each forward pass, a binary mask is applied to the weights to zero-out
pruned weights. The weights are updated using whatever learning algorithm is desired (vanilla gradient
descent is shown in the algorithm). Finally, a new mask is generated (i.e., more weights are pruned) as
a function of the current epoch and the final target sparsity, as the sparsity level is gradually increased
during training. The GenerateM ask() function is one of the key defining factors of a pruning algorithm,
and its operation is what differentiates different pruning approaches. A simple function might simply
Epochs−t−1
sort all elements of W by magnitude, and prune the smallest S × (1 − Epochs ) weights, meaning
sparsity scales linearly as training progresses through epochs.

-0.2 1.1 0.7 -1.0 0 1.1 0.7 0

0.3 0.3 -0.5 0.1 0.3 0.3 0 0.1

0.4 -0.1 1.8 -1.2 0.4 0 1.8 0

ReLU

-1.1 -0.9 0.9 -0.2 0 0 0.9 0

Figure 2.5: Activation sparsity of a 2D activation plane due to the ReLU element-wise non-linearity.

Another form of value sparsity present in modern neural networks is activation sparsity, where a
significant fraction of activation values are equal to zero. Activation sparsity primarily occurs due to
the rectified linear unit (ReLU) activation function, which is applied element-wise to the output of a
convolutional or fully connected layer of a CNN, and clamps all negative values to zero, whilst letting
positive values pass through unaffected, as shown in Figure 2.5. Activation sparsity is typically between
40% − 50% per-layer in a modern CNN [35].
Chapter 2. Background 12

2.2 Hardware Acceleration for Convolutional Neural Networks

Convolutional neural networks have seen adoption and deployment across a range of industries and
consumer applications. They are an attractive candidate workload for hardware acceleration as they are
often employed in applications that where they have the following characteristics:

• Widespread deployment

• Computationally intensive

• Latency constrained

• Energy and/or power constrained

As such, a large amount of research and development activity has taken place in the CNN accelerator
space in the last few years.

Compute Units

PE PE PE PE

On-Chip Memory
(SRAM / eDRAM)

PE PE PE PE

High
Bandwidth
Interface

Main Memory (DRAM)

4
Figure 2.6: An abstract view of hardware architectures for CNN acceleration.

The majority of these designs are inference accelerators, though some more recent works also target
the training phase of CNNs. Seminal works in this space explored how best to exploit the structure of
CNN computation and leverage the large amount of parallelism in the workloads efficiently [42, 43, 44],
however many more recent designs target the unique value properties of CNNs in order to increase the
performance potential and efficiency of the hardware [17, 18, 35, 45, 46, 47, 48]. Thus, most hardware in
the accelerator space can be classified as either value-aware or value-agnostic. Accelerators of both types
share many basic traits. The primary operation in a CNN is the MAC, of which there are potentially
billions per inference – see Table 2.1. All CNN accelerators will therefore support a large amount
of parallel multiply and accumulate throughput in hardware. Figure 2.6 shows the basic structure of
Chapter 2. Background 13

Table 2.1: Image Classification CNNs

Number of Layers MAC Operations (mils.)

Network
CONV FC CONV FC
AlexNet [36] 5 3 665.8 58.6
GoogleNet [37] 57 1 1233.0 1.0
VGG-M [38] 5 3 1141.8 85.9
VGG-S [38] 5 3 1901.5 96.4
VGG-19 [38] 16 3 14999.8 123.6
MobileNet [13] 27 1 567.7 1.0
DenseNet-121 [39] 120 1 3062.0 1.0
DPNet-92 [40] 95 1 7384.5 2.7
ResNet-50 [41] 53 1 3855.9 2.0

most modern CNN accelerators, the organization of which contains an array of processing elements to
exploit the abundant data parallelism, on-chip buffers and scratchpads to exploit data reuse, and a
high-bandwidth interface to main memory to load inputs and weights without causing a bottleneck.

Simple PEs Complex PEs

Mem Mem

Mem
Mem

Mem

Acc

Systolic Arrays Spatial Re-use Arrays CGRA-Like

e.g. TPU e.g. DaDianNao e.g. Eyeriss

Figure 2.7: Hardware architectures for accelerating CNNs vary in their organisation and in the complex-
3
ity of their processing elements.

The exact specification of the PEs and their specific architecture and organization can vary widely,
and is at least partially a function of the desired dataflow – of which there are many possible. As
in Figure 2.7, PEs may be simple MAC units, or even just multipliers as is the case in some systolic
array-based designs [20], or may contain internal buffers or scratchpads with broadcast/multicast con-
nectivity from memory [43], or multiple multipliers with accumulation capability, as in coarse-grained
reconfigurable array (CGRA) style architectures [49]. PEs may also contain additional functional units
to apply activation functions or pooling operations.
Dataflow and memory hierarchy are important aspects of the accelerator design, as together they
define the memory access energy associated with inference, which can be a major contributor to energy
efficiency. Table 2.2 shows the relative energy costs of arithmetic and memory access operations. Nearly
all recent works in this space target 16-bit or 8-bit fixed point arithmetic as a baseline, due to the fact
that modern neural networks suffer little-to-no accuracy loss at this data width compared to their native
32-bit floating point format [47], whilst integer multipliers and adders are many times more area efficient
Chapter 2. Background 14

Table 2.2: Relative energy cost of various operations in 45nm 0.9V CMOS technology [50].

Bitwidth Energy (pJ) Relative Cost

32 bit int ADD 0.1 1
32 bit float ADD 0.9 9
32 bit Reg File access 1 10
32 bit int MULT 3.1 31
32 bit float MULT 3.7 37
32 bit 32KB SRAM access 5 50
32 bit DRAM access 640 6400

in hardware than their single-precision floating point counterparts [50].

2.3 Related Work

Many ASIC designs have been proposed for CNN inference over the years, with a recent resurgence in the
field of ML increasing the rate of innovation. Most seminal works of the last decade have explored how
best to extract parallelism and exploit dataflows for efficient acceleration in a value-agnostic manner.
More recently, a lot of emphasis has been put on value-aware accelerators which exploit the value
properties of the CNN workloads they target.

2.3.1 Value-Agnostic CNN ASICs

DianNao [42] was one of the first designs to explicitly target CNN applications. Using a vector of parallel
multipliers operating in Single Instruction Multiple Data (SIMD) fashion, with groups of multipliers
feeding adder trees for output accumulation, the accelerator achieves over 100× the performance of a
128-bit wide SIMD processor on average, whilst being over 20× more energy efficient. The accelerator
also contains special functional blocks for implementing non-linearities and more non-standard operations
such as max- and average-pooling. Clocked at just under 1GHz, and occupying just over 3mm2 post-
layout in 65nm technology, DianNao is an efficient accelerator that has spawned many follow-on works.
Most notably, the DaDianNao inference accelerator builds on DianNao, using the same PE structure,
but makes some optimisations to increase performance and energy efficiency at scale [44]. The key insight
of DaDianNao is that off-chip memory accesses are the largest contributor to total inference energy on
previous designs, including DianNao. Thus, DaDianNao takes the approach of adding large embedded
DRAM (eDRAM) banks to each ‘tile’ of the accelerator to store weights, along with centralized eDRAM
to store input and output activations, meaning that with enough chips in a multi-chip system, main
memory accesses can be eliminated completely. Using a total of 36MB of eDRAM capacity per node,
DaDianNao outperforms an Nvidia K20M GPU by 21× with a single node system on average, and by
450× on a 64-node system, with an iso-area design being 174× faster. A single node design is also 330×
more energy efficient than the baseline GPU on average. The efficiency of the adder-tree based design,
the on-chip activation memory, and the scalability of DaDianNao have inspired a number of designs
which use a similar dataflow and baseline architecture [17, 30, 35, 51, 52].
The Eyeriss accelerator is designed with dataflow efficiency and mapping flexibility in mind [43].
Using a simplified model of energy, the paper explores a taxonomy of potential dataflows and introduces
Chapter 2. Background 15

the row-stationary dataflow – in which 1D convolutions of a single row of filter weights and activations
are assigned to a given PE at a time – as an energy efficient option that, when mapped to their PE array,
allows flexibility in the mapping of computations to PEs in order to avoid severe under-utilization. As
with DaDianNao, the authors note that data movement is the leading contributor to energy inefficiency,
and so the Eyeriss design makes extensive use of an on-chip memory hierarchy co-designed with their
dataflow to minimize this energy cost. Implemented on their architecture, the row-stationary dataflow
is between 1.4× and 2.5× more energy efficient than the other dataflows studied.
Google’s Tensor Processing Unit (TPU) evaluation [20] is of particular interest to the architecture
community, as it provides an insight into the measured, rather than modelled, performance and energy
metrics of a DNN inference accelerator in large-scale datacenter deployment running real workloads. The
TPU uses a systolic array architecture, with large on-chip buffers and a large main memory for weight
storage. Compared to modern CPU and GPU servers used in Google datacenters, the TPU provides
between 17×−31× and 14×−25× average performance per Watt, respectively, depending on the DRAM
specification used.

2.3.2 Value-Aware CNN ASICs

There has been a slew of designs for value-aware accelerators for CNN inference of late. Many have
focused on exploiting different forms of quantization [53, 54] and variable data width requirements [45],
however of particular interest in this thesis are architectures that target value sparsity in CNNs.
One of the first studies to exploit value sparsity in CNNs is Cnvlutin [35], which targets zero-valued
activations due to the ReLU activation function during inference, and builds on DaDianNao as a baseline.
Since zero-valued activations constitute 44% of activations across the networks studied, the performance
potential of such a design is modest. Nevertheless, by storing only non-zero activations on-chip in
a compressed format using (value, offset) pairs, and decoupling activation broadcast lanes so that they
can advance independently, Cnvlutin achieves a 1.37× average speedup over its baseline without affecting
accuracy, and with a modest area overhead of just 4.49%. It should be noted that whilst Eyeriss does
power gate multipliers that are fed a zero activation, but it does not exploit this data characteristic for
performance improvements.
Cambricon-X [17] is an inference accelerator that targets value sparsity in weights due to pruning.
Like DaDianNao, it uses an adder-tree based PE design and a similar dataflow. Cambricon-X processes
only non-zero weights by packing them in memory using a sparse encoding scheme. However, it uses
fully associative logic to map activations to multipliers within a PE, which allows for good utilization
for sparse networks, but accounts for over 31% of die area and almost 35% of power consumption alone.
The design outperforms DianNao by 7.32×, whilst being 6.43× more energy efficient. The authors do
not compare to DaDianNao, but compared to a value-agnostic version of their own design, Cambricon-X
is 2.51× faster on average for convolutional layers.
Cambricon-S [30] is a follow-on design to Cambricon-X which exploits activation and weight sparsity.
However, doing so naively implies expensive routing logic between memories and multipliers. In order
to mitigate this cost, the author’s propose a joint hardware/algorithmic approach by using structured
network pruning such that the zero-valued weights are grouped in a way that is easy for the hardware
to exploit. This comes with a programmer overhead, as constraining weight pruning in a structured way
makes it more difficult to achieve convergence during training.
SCNN [18] is another design which leverages both weight and activation sparsity, however in contrast
Chapter 2. Background 16

to Cambricon-S, it does not impose any restrictions on the weight sparsity distribution. Instead, SCNN
makes use of convolutional reuse by making use of the fact that, within a given channel, the product of
any weight and any activation is a valid and useful computation (modulo effects of stride and padding).
Therefore, by adopting a channel-wise dataflow (a single channel is processed at a time) and sending only
non-zero values to each PE, ineffectual computations can be almost completely eliminated. However,
this strategy requires calculating which output each product belongs to dynamically, and routing these
products to the correct accumulator within a PE using an over-provisioned crossbar, which is a costly
design approach. Additionally, the design incurs performance overheads for dense (and mostly-dense)
networks, meaning it requires high sparsity to achieve a speedup, and punishes low-sparsity networks.
The crossbar alone accounts for over 21% of PE area, with the over-provisioned accumulator banks ac-
counting for another 29%. Nevertheless, for sufficiently sparse networks, SCNN achieves a 2.7× speedup
and 2.3× energy efficiency improvement over an equivalent dense design. However, the design does not
perform well on FC layers, with at most 25% utilization due to its Cartesian product multiplier array
that broadcasts weights.
The Efficient Inference Engine (EIE) [48] is an architecture that targets weight and activation spar-
sity in networks compressed using the Deep Compression algorithm [9]. The authors target only fully
connected layers, as in older networks they contain the majority of the weights and thus account for the
majority of off-chip traffic and energy. However, in modern networks this effect is less pronounced, and
in both old and new networks, FC layers account for a small fraction of computation time (0.05% of
inference time on ResNet-50 running on a DaDianNao tile), thus speedup potential is extremely limited.

2.4 Summary
This chapter has outlined the key compute and value characteristics of DNNs that are required to under-
stand state-of-the-art techniques in hardware accelerators which target these workloads. Additionally,
relevant hardware research which targets DNN value properties has been described, in order to frame
the contributions of this thesis in the context of appropriate prior work.
Chapter 3

A Low-Overhead Architecture for

Sparse Neural Networks

With the explosion of deep learning applications and DNN deployment, ASICs targeting DNN inference
have flourished [18, 20, 42, 43, 44, 48]. The widespread potential for these algorithms, and their compute-
and memory-intensive nature, has made hardware efficiency and performance especially important. Tar-
geting weight sparsity as a means of extracting additional performance during neural network inference
is attractive for a number of reasons. Principally, many recent works have supported the hypothesis that
sparsity is a manifestation of how neural networks inherently learn to classify information, and therefore
it will remain a common value property available for exploitation going forward [28, 29]. Additionally, as
will be shown in Chapter 4, the potential performance improvements to be gained by targeting sparsity
in neural networks are significant, at up to 7.35×.

3.1 Baseline Architecture

Due to its popularity, scalability, and the efficiency of its adder-tree based design, we use a DaDianNao-
like design as the baseline architecture on top of which we make significant modifications to accommodate
weight sparsity.
As mentioned in Section 2.3, DaDianNao (DaDN) is a wide vector-like machine, with separate,
moderately sized on-chip buffers for weights and activations. DaDianNao is an extensible tile-based
architecture, in which multiple tiles can be chained together to scale up the design. Figure 3.1 shows
an overview of a DaDianNao tile. Each tile has a heavily banked weight memory (WM) which feeds k
PEs – also called filter lanes, as each PE is assigned an entire filter at a time. Every tile also contains a
slice of activation memory (AM), from which activations are broadcast to every PE within a tile. Each
PE contains N multipliers, meaning it processes the inner product of N weights and N activations per
cycle. The multipliers feed an N -input adder tree, which amortizes the cost of accumulation of an output
activation.
DaDianNao uses a channel-first dataflow, in which activations and weights are processed in contiguous
chunks of N values from the same X − Y and R − S dimensions, respectively. That is, a slice of 1 × 1 × N
weights and activations are processed in each PE each cycle, with the N values being contiguous in the
channel dimension, as shown in Figure 3.2. If the number of channels in a filter is not a multiple of N ,

17
Chapter 3. A Low-Overhead Architecture for Sparse Neural Networks 18

16
1
Activation Nx16 activation lane
Memory 16 filter lane
N weight lane

16
1 x

Filter 1
+

k x N x 16
16
N x
PE 1
Weight Output
Memory Activations
1 16 x

Filter k
+
N 16
x
PE k

Figure 3.1: Baseline Architecture based on DaDianNao.

Input Activations Filter

1
1
1
1

Figure 3.2: Dataflow used by DaDianNao, showing an example of the N activations and N weights of
which the inner product is computed in a single PE each cycle.

zeros will be inserted in the weight tensor as ‘padding’ to ensure values are aligned correctly, as dictated
by the dataflow.
Though DaDianNao was originally conceived as a multi-node accelerator with enough on-chip eDRAM
to store all weights and activations per-layer on-chip during inference, this design choice is inefficient and
over-provisioned. Instead, we size the activation memory to be large enough to keep input and output
activations on-chip at all times using double buffering, but size our weight memory to store only one
working set of filters at a time, and hide off-chip latency using double buffering, using our previously
proposed heuristics [55]. We refer to this modified baseline design as DaDianNao++.

3.2 The Trouble with Sparsity

Weight sparsity results in many ineffectual MAC operations, in which a pruned weight is one of the
two operands, rendering the operation ineffectual in that it has no effect on the final output of the
Chapter 3. A Low-Overhead Architecture for Sparse Neural Networks 19

convolution. That is to say, acc + w × a = acc, if w = 0, leaving acc unchanged. The same is of course
true for zero activations, however as these are far less numerous in sparse networks compared to zero
weights, in general they are less attractive as a prospect for acceleration. However, the relative sparsity
of weights and activations will change on a per-layer basis, and it may be the case that activation sparsity
offers a higher speedup potential for some layers and networks.

Activation Buffer

Weight Buffer Weight Buffer

‘packed’
x x
x x
x x
x x

(a) (b)

Figure 3.3: Naive approach to exploiting weight sparsity. White elements indicate zero weights. Non-
zero weights must be multiplied by the corresponding activation of the same colour. (a) shows how a
value-agnostic machine would compute these multiplies, with poor utilization. (b) shows how non-zero
weights can be packed densely in memory, but activations then require dynamic, arbitrary routing to
the multipliers, which is costly.

Architecturally, exploiting weight sparsity for performance gains implies replacing ineffectual MACs
with effectual MACs. On the weight side, replacing zero-valued weights with non-zero weights in memory
is trivial, as weights are static at run-time. After a network has been trained and pruned, the weights
will not change, so rearranging individual weights in memory using any sparse compression format can
be done offline. However, issues arise due to the fact that the non-zero weights must be paired with
their corresponding activations at run-time. This requires the ability to fetch potentially arbitrary
activation values from the activation stream, implying hardware that facilitates expensive dynamic
routing capabilities, as seen in Figure 3.3. However, as discussed in Section 2.3, this design strategy
leads to expensive crossbars within PEs, which account for a large proportion of area and energy – above
30% for some designs [17]. Similarly, packing both sparse weights and activations densely into PEs in
SCNN requires costly crossbar routing between the multiplier array and accumulators [18]. Therefore,
when trying to exploit sparsity, we would ideally like hardware that allows us to:

1. Pack non-zero weights in memory as densely as possible.

2. Route activations to the correct multiplier as cheaply as possible.

Chapter 3. A Low-Overhead Architecture for Sparse Neural Networks 20

3. Have products accumulated to the correct output as cheaply as possible.

The following sections will describe and evaluate a mechanism for achieving these goals along with
optimisations surrounding the mechanism.

3.3 The Bit-Tactical Front-End

The design philosophy of the novel Bit-Tactical (TCT) front-end, an inference accelerator for sparse
neural networks, is to acknowledge that allowing arbitrary weight/activation routing is extremely costly,
and places most designs within the diminishing returns region of the cost-reward curve. What’s more,
overheads and inefficiencies mean that aggressive designs that attempt to exploit 100% of sparsity will not
see the full benefits materialize anyway. Instead, Bit-Tactical takes the judicious approach of attempting
to extract most of the potential benefit from the workload with conservative, inexpensive hardware.
Additionally, since weights are static after training, Bit-Tactical makes use of an offline weight scheduling
algorithm to further increase performance. Since there is no guarantee that weight sparsity will conform
to any particular spatial distribution, we want to design a hardware interconnect that performs well
across many networks, without being specialized to any particular sparsity patterns. We will show that
the configuration to be presented in Section 3.3.3 performs well across all networks studied.

3.3.1 Lookahead and Lookaside

We will now describe the key mechanisms by which the Bit-Tactical front-end exploits weight sparsity in
pruned neural networks, which we term lookahead and lookaside. Recall that exploiting weight sparsity
implies replacing ineffectual work (a multiply with a zero weight as an operand) with effectual work (a
multiply with a non-zero weight as an operand). That is, weights need to be scheduled to appear in a
different cycle and/or multiplier than they would have if the network was dense (their dense schedule). As
weights are known statically ahead of runtime, one way of replacing ineffectual work with effectual work
is by replacing pruned weights in the original dense memory layout with non-zero weights. DaDianNao’s
adder tree-based PEs mean that all weights in a filter lane contribute to the same output, so moving
weights within their PE schedule will not affect correctness, so long as they are paired with the correct
activations.
Pairing weights with activations is achieved using input multiplexers in front of each multiplier within
a PE. Scheduling a weight to appear at the same multiplier as it would have in the original dense schedule,
but one cycle earlier in order to replace an ineffectual weight, means the corresponding activation needs
to appear at that same multiplier one cycle ‘early’. We can achieve this by broadcasting two rows of N
activations at a time from activation memory, and then selecting the correct activation as input to the
multiplexer earlier than in the dense schedule, as shown in Figure 3.4. We term the space of activations
which can be paired with weights in a given cycle the lookahead window. This of course requires more
activation wires, however the hardware cost of this is amortized over all PEs within a tile.
Similar to lookahead, lookaside is a weight movement that allows weights to be processed ahead of
when they would have in the original dense schedule. Additionally, lookaside moves weights such that
they appear at a different multiplier than they would under a dense schedule, as illustrated in Figure 3.5.
This requires no additional activations to be broadcast from activation memory, assuming lookahead is
already implemented. Instead, lookaside only requires that additional activation broadcast wires be fed
Chapter 3. A Low-Overhead Architecture for Sparse Neural Networks 21

a 00 a 01 a 10 a 11 a 20 a 21 a 30 a 31
Dense Schedule

w00 x

w10 x
+
w22 w12 x

w33 w30 x

Lookahead window

Figure 3.4: Example of how lookahead exploits sparsity. Value superscripts denote their multiplier lane
position in the dense schedule, and subscripts represent the cycle each would be processed in using the
dense schedule. w12 is processed in cycle 0 by moving it in memory statically, and selecting a21 as an
input to the correct multiplier that same cycle, as indicated by the red wire.

a 00 a 01 a 10 a 11 a 20 a 21 a 30 a 31
Dense Schedule

w00 x

x
w11 w10
x
+
w22
x
w33 w30

Lookaside

Figure 3.5: Example of how lookaside exploits sparsity and reduces multiplier lane imbalance. w11 is
processed one cycle early, and by a different multiplier than the dense schedule dictated, by moving it
in memory statically. The corresponding activation movement is indicated by the red wire.

to the input multiplexers of neighbouring multiplier lanes, meaning lookaside is comparatively cheaper
than lookahead. Lookaside also comes with the added benefit of reducing inter-multiplier imbalance, as
non-zero weights that would be serialized to a single multiplier can instead by allocated to neighbouring
multipliers within a PE. With these two weight movement primitives, we can construct many different
frontend interconnect patterns by connecting activation wires to multiplexer inputs. We define lookahead
distance as the number of time steps ahead of the original schedule a weight may advance. Lookahead
distance defines how many rows of activations are broadcast per cycle. Similarly, the number of lookaside
connections determines the number of additional multiplier lanes a weight may appear at in a given
cycle. Note that within this design space, we can recreate fully associative scheduling by setting setting
the lookahead distance to be as deep as the weight memory, and using all lookaside connections in
the lookahead window. This would obviously be a very costly design, and as we will show would not
represent a competitive cost-reward trade-off compared to a more pragmatic design.
A speedup is achieved when there are no effectual weights scheduled to be processed in a given
cycle (either due to them being processed ahead of their dense cycle, or due to sparsity), and so the
lookahead window can advance by more than one row of activations. This row of ineffectual weights must
span an entire tile of PEs, as that is the granularity at which PEs are synchronized due to activations
being broadcast. This initially appears to be a very limiting constraint on the speedup potential, as
Chapter 3. A Low-Overhead Architecture for Sparse Neural Networks 22

Time
C7 C6 C5 C4 C3 C2 C1 C0

𝐖 𝐖 𝐖 𝐖 𝐖 𝐖 𝐖

𝐖 𝐖 𝐖 𝐖 𝐖 𝐖 𝐖
Figure 3.6: How TCT achieves a speedup despite synchronization constraints. Weights promoted to
appear in cycle i leave ’gaps’ in the schedule, leading to more promotion opportunities in cycle i + 1.

the likelihood of all weight lanes across an entire tile containing only zero weights seems small, even
with lookahead and lookaside. However, Figure 3.6 demonstrates how weight promotions leave gaps in
the schedule which can propagate backwards until a row of zeros appear, at which point the lookahead
window can advance multiple steps, resulting in a speedup. In Figure 3.6, an ineffectual weight appears
in cycle 0 (C0 ). A weight from C1 is promoted, leaving a bubble into which a weight from C2 can be
promoted. This continues until C7 , which now contains only ineffectual weights in all three multiplier
lanes, meaning it can be skipped and the lookahead window can slide onwards by 2. At no point during
execution did weights from more than 2 cycles in the original dense schedule appear at the same cycle
in the new sparse schedule.

3.3.2 Architectural Support

This section will describe in detail the additional architectural support required to implement lookahead
and lookaside as an extension to the baseline design.
As outlined in Section 3.3.1, facilitating weight skipping requires a mechanism that allows the correct
activation and weight pair to meet at a multiplier within the correct PE. In TCT, this routing happens
on the activation side (as weights are static), and is controlled by small multiplexers at the input to each
multiplier. Figure 3.7 details the implementation of the Weight Skipping Unit (WSU) which performs
the routing and multiplications within a PE. Each input multiplexer contains h + s + 1 inputs (h is the
number of lookahead connections, and s is the number of lookaside connections), meaning it is driven by
a log2 (h + s + 1)-bit Weight Select (WS) signal, which is stored alongside each weight in weight memory
and represents a small overhead.
The WS fields are generated by the software scheduler to be discussed in Section 3.3.4, and can be
considered compression metadata, as they signify each weight’s original position in the dense schedule.
Figure 3.8 shows how the scheduling represents a form of zero-compression on the weights, which reduces
the size of weight memory and off-chip traffic, both of which save inference energy.
In the WSU, multiplexer inputs are taken from activation wires, to which activations are broadcast.
Lookahead connections are taken directly from the additional activation wires within a multiplier lane,
with lookaside connections coming from neighbouring activation lookahead lanes. The activation wires
Chapter 3. A Low-Overhead Architecture for Sparse Neural Networks 23

From ASU

A N,1
A N,2

A N,h
A 1,h
A 1,1
A 1,2

A 2,h
A 2,1
A 2,2
To ASU

16b 16b

ALC

Weight Lane 1
ws1 Lookahead
From Weight Memory

log(h+s)b
16b h 32b
w1 X

Weight Lane 2
ws2
Lookaside
w2 X
s

Weight Lane N
wsN

wN X

Figure 3.7: The Weight Skipping Unit (WSU) which routes activations to the correct multiplier at
runtime.

WS WS

w w w w w
WS WS
w w
WS WS
w w
w w
w WS WS

w w

256 bits 144 bits

Figure 3.8: Example of how TCT achieves compression for weights. This example assumes a design
with lookahead = 2 and lookaside = 1, meaning a 4-input multiplexer is used, requiring 2b WS signals.
Weights are 16b values.
Chapter 3. A Low-Overhead Architecture for Sparse Neural Networks 24

Activation Buffer
Column 0 Column 1 Column h

Nx16b
h+1
N
ABR0 a1 a2 aN a1 a2 aN a1 a2 aN ABRh

Nx16b
head

AC (h+1)-to-1

16b
h+1
ALC
16b
From WSU

16b
A 1,h
A 1,0
A 1,1

A N,0
A N,1

A N,h
A 2,h
A 2,0
A 2,1

h+1
N

Figure 3.9: The Activation Select Unit (ASU) which implements the sliding activation window.

are fed by the Activation Select Unit (ASU), which implements the sliding window required to implement
lookahead and gain a performance advantage. The WSU communicates to the ASU using the Activation
Lane Control field (ALC), a small field stored alongside each row of N weights in weight memory.
For each weight, there are h + 1 activations in the lookahead window, which need to appear in
‘lookahead order’ (from lookahead distance 0 to lookahead distance h) on the wires within the WSU
every cycle for the front-end to work, as the connectivity is fixed. The ASU is detailed in Figure 3.9,
which shows how activations are routed to activation wires, facilitating lookahead and a sliding window
whilst maintaining logical ordering of activations. The ASU buffers activations in the Activation Buffer
Registers (ABRs), of which there are h + 1, before constructing the lookahead window by routing them
through h + 1, (h + 1) − to − 1 multiplexers, such that they appear on the activation wires in the order
that the WSU requires. Each ABR therefore contains N 16-bit activations – 1 for each multiplier in a
PE. The lookahead multiplexers are controlled by the Activation Control (AC) logic, which keeps track
of which ABR is at the head (i.e., is at a lookahead distance of 0), and uses the ALC field from weight
memory to determine how far ahead to slide the lookahead window each cycle, creating a circular queue.
This circumvents the need to copy data between ABRs when sliding the lookahead window. The ABRs
are fed by the activation buffer, which contains h + 1 banks, each with a dedicated read port. When
sliding the lookahead window, any number of the ABRs can therefore be updated independently.

3.3.3 Interconnect Patterns

We define an interconnect connectivity pattern as a specific set (or pattern) of connections between
activation wires and multiplier input multiplexers that, in turn, dictates the allowable weight movements
within the static schedule. Given our lookaside and lookahead weight movement primitives, we would
like to construct a front-end connectivity pattern that gives us as much performance as possible in a
Chapter 3. A Low-Overhead Architecture for Sparse Neural Networks 25

Dense Schedule Dense Schedule

Steal
𝐖 01 𝐖𝑙𝑟 location 𝐖 01

𝐖11 𝐖12

𝐖 21 𝐖 21

lanes

lanes
𝐖 31 𝐖 32 𝐖 31 𝐖 30

𝐖 41 𝐖 41

𝐖 52 𝐖 51 𝐖 50 𝐖 52
Weight to
rows replace rows

(a) L-shaped pattern, (b) Trident pattern,

L8<2,5> T8<2,5>

Figure 3.10: Two potential interconnect patterns. The ‘L’ pattern is contiguous and unidirectional. The
‘Trident’ pattern, denoted ‘T’, is sparse and bidirectional.

limited area budget. Constructing this interconnect means defining, in weight memory-space, a search
window for each multiplier lane within which weights can be stolen, and promoted to that multiplier.
The adder-tree design means we must constrain this search window to be entirely within a single filter
lane, so that all products calculated within a PE contribute to the same output activation.
Naively, for a given lookahead and lookaside distance, denoted hh, si, the search window could be
a contiguous pattern h deep and s wide. An example of this is given in Figure 3.10 with the h2, 5iL
pattern, which has a search window that extends 2 steps ahead in time, and across 5 multiplier lanes
to the side. A connectivity pattern is also defined by how many steal locations it allows, which has a
hardware cost reflected in the size of the input multiplexer. For example, both patterns in Figure 3.10
have 7 steal locations, plus the original input location, meaning they require an 8-input multiplexer
before each multiplier. This can be denoted in the the L-shaped pattern shorthand as L8h2, 5i. Note
that lookaside connections wrap around at the edge of the PE, such that a lookaside connection at a
distance of 1 may allow multiplier N − 1 to steal weights from multiplier lane 0 – recall that lookaside
connections are not costly compared to lookahead connections.
A design’s maximum speedup potential is determined by its lookahead distance h, and is given by
Equation 3.1.

SpeedupM ax = h + 1 (3.1)

To see why this is the case, consider a design with h = 1. This can at best schedule weights from two
cycles in the original dense schedule to be processed in a single cycle, obtaining a speedup of 2×, before
the sliding window advances by two rows. There is also a maximum potential speedup per-network,
determined by taking the inverse of the network density, e.g., a network that has 75% compute sparsity
Chapter 3. A Low-Overhead Architecture for Sparse Neural Networks 26

Connectivity Pattern Bad Schedule Optimal Schedule

𝐖 01 𝐖 00 𝐖 01 𝐖 00

Lane0
Lane1
𝐖11 𝐖11

Lane2 Exclusive
time Promotion
Location

Figure 3.11: A toy example of two possible schedules for a machine with 3 weight lanes, and an inter-
connect with lookahead of 1 and lookaside of 1 (biased right). A poor schedule would take two cycles to
process the 3 effectual weights, whereas an optimal schedule can process them in a single cycle.

1
has a maximum speedup of 1−0.75 = 4×. The geometric mean of the compute sparsity across the
networks studied is 69%, corresponding to a 3.2× potential speedup. Given this observation, designs
with a lookahead distance of 2 (i.e., a maximum speedup of 3×) seem to represent a good trade off
between hardware cost and performance potential.
Looking at Figure 3.10, it is clear that for the L shaped interconnect, neighbouring lanes will have a
lot of overlap in their search windows, increasing contention for effectual work. This design also heavily
favours weights at a lookahead distance of 1 over those at a distance of 2, which can only be assigned to
a single possible multiplier. Both of these issues are addressed in the Trident shaped interconnect, which
still has a lookahead distance of 2 and an 8-input multiplexer, but spreads its search window between
the two lookahead rows, resulting in a sparse shuffling network, with little overlap between neighbouring
multiplier lanes in order to minimize contention. These features are by design, as the Trident pattern was
hand-designed by examining the types of issues encountered by the L shaped pattern when scheduling
real weight sparsity distributions with pen and paper. Compared to the L shaped pattern, in which
neighbouring lanes share 5 out of 7 steal locations, neighbouring lanes in the Trident pattern share only
2 out of 7 steal locations.

3.3.4 Weight Scheduling Algorithm

With fully-associative connectivity, scheduling decisions do not have to be made, as given N multipliers,
in each cycle the first N non-zero values are simply routed to any of the available multipliers. As soon
as connectivity is anything short of fully-associative, however, there will be contention for resources as
constraints are effectively placed on which multipliers a given weight may be routed to. In essence,
applying the constraints of the hardware interconnect make scheduling a complex problem as weight
movements become interdependent – each effectual weight may have many ineffectual weights it could
replace in the schedule, and each ineffectual weight may have many effectual weights that could replace
it. In addition, scheduling decision made for a given cycle will affect the possible scheduling decisions
made for all future cycles for a given filter.
Chapter 3. A Low-Overhead Architecture for Sparse Neural Networks 27

Time
C1 C0 Tally C1 C0 Tally C1 C0 Tally C1 C0 Tally

𝐖 1 𝐖 0 𝐖 0 𝐖 0
𝐖 0 𝐖 0 𝐖 0 𝐖 0
𝐖 𝐖 0 𝐖 𝐖 0 𝐖 𝐖 0 𝐖 0
𝐖 2 𝐖 2 1 𝐖 0
2 1 𝐖 0 𝐖 0
Step 0 Step 1 Step 2 Step 3

Figure 3.12: An example of how the scheduling algorithm operates. In this example, lookahead is 1 and
lookaside is 2 (bidirectional, with wraparound connections).

Figure 3.11 demonstrates the need for a scheduling algorithm in cases where there are multiple
multipliers on which a weight could potentially be scheduled. This scheduling problem relates to a form
of the Job Shop Problem, or job shop scheduling, which is known to be NP-hard [56]. In essence, we
are trying to minimize the makespan, where weights are jobs (or units of work) and multipliers are
the job-shop machines. To this end, we note that in Figure 3.11, the key characteristic that makes it
desirable to move w11 to multiplier lane 2 is that w11 is the only weight that lane 2 can possibly process
that cycle. We term this an exclusive promotion location, in that lane 2 is exclusive to w11 (it is the only
weight that can be moved there), despite the fact that multiple lanes (1 and 2) could process w11 . The
scheduling algorithm we design is based on making exclusive promotions first, in order to reduce the
amount of sub-optimal weight movements. This algorithm is co-designed with the interconnect, which
is intended to reduce contention between nearby lanes, resulting in more of these exclusive promotion
opportunities.
An example of how the scheduling algorithm operates is given in Figure 3.12. At each step of the
scheduling algorithm, a tally keeps track of the number of candidate weights that could be promoted to
a given multiplier lane for processing that cycle – these candidates are denoted with red arrows, with the
tally equalling the number of incoming arrows to a given position. The scheduler then makes promotions
to the lanes that have the smallest candidate tally, which in the common case will be equal to 1. The
tally is updated, and the process repeats until there are no candidate weights left, or the multipliers
become fully utilized that cycle.
The heuristic weight scheduling algorithm is described in Algorithm 2. The algorithm is shown for a
single ‘warp’ of filters. A warp is defined as a set of K filters assigned to a K PEs simultaneously, before
the next K filters are processed, i.e., PEs are synchronized on a warp boundary. Inputs to the algorithm
include N , which is the number of multipliers per PE, the matrix of weights W , which is assumed to
already be in in-memory layout and has dimensionality R × L, where R is the number of ‘rows’ of
weights are to be processed, and is equivalent to the number of cycles the filter would take to process on
DaDianNao++, and where L is the total number of multiplier lanes available in the accelerator, which
for a single tile is equal to k × N . The interconnect is described by a list, I, of (lookahead, lookaside)
Chapter 3. A Low-Overhead Architecture for Sparse Neural Networks 28

Algorithm 2 Scheduling Algorithm

1: Input
2: W ∈ ZR×L weight matrix in memory layout
3: I list of interconnect connection coordinates, stored as (ahead, aside) pairs
4: N number of multiplier lanes per PE
5: Output
6: W ∈ ZR×L modified weight matrix
7: WS ∈ ZR×L weight select signal matrix

8: procedure schedule(W , I, N )
9: WS ← 0
10: for r = 0 : R − 1 do
11: if containsNonZero(W [r, 0 : L − 1]) then
12: numCandidates ← countCandidates(W , I, N , r)
13: while containsNonZero(numCandidates) do
14: W, WS ← promote(W , WS , I, N , r, numCandidates)
15: numCandidates ← countCandidates(W , I, N , r)
16: else
17: deleteRow(W, r)
18: deleteRow(WS , r)
19: return W , WS

20: procedure countCandidates(W , I, N , r)

21: numCandidates[0, . . . , L − 1] ← 0
22: for l = 0 : L − 1 do
23: if W[r,l] == 0 then
24: for (ahead, aside) ∈ I do
25: if W[r + ahead, (l + aside)%N )] != 0 then
26: Candidates[l] + +
27: return numCandidates

28: procedure promote(W , WS , I, N , r, numCandidates)

29: Candidatesmin ← min(nonZeros(numCandidates))
30: for l = 0 : L − 1 do
31: if W[r,c] == 0 && Candidates[c] == Candidatesmin then
32: for (ahead, aside) ∈ I do
33: if W[r + ahead, (l + aside)%N ] != 0 then
34: W[r, l] ← W [r + ahead, (l + aside)%N ]
35: W[r + ahead, (l + aside)%N ] ← 0
36: WS [l] ← getIndexOf((ahead, aside), I)
37: Break
38: return W, WS
Chapter 3. A Low-Overhead Architecture for Sparse Neural Networks 29

Table 3.1: Runtime of the scheduling algorithm for various networks.

Network AlexNet-ES AlexNet-SS GoogLeNet-ES GoogLeNet-SS Resnet-50-SS Bi-LSTM Mobilenet

Scheduler Runtime (s) 198.2 690.2 220.1 23.4 79.1 493.5 24.9

coordinates, e.g., I = [(1, −1), (1, 0), (1, 1)] would describe an interconnect with a lookahead of 1 and
2 lookaside connections. The algorithm outputs the modified weight matrix, and a matrix of the same
shape containing the mux select signals, WS . The main loop of the scheduling algorithm is contained in
the schedule procedure, from lines 8 – 19. Here, the rows of the weight matrix are iterated through. If a
row contains only zeros, we can delete it from the weight matrix and the schedule signal matrix, which in
hardware implies a speedup (lines 17 and 18). Otherwise, we keep a tally of how many candidate weights
could be promoted to each multiplier lane that cycle, using the countCandidates(...) helper function.
While there continue to be promotions that could be made, the promote(...) function is called and
makes promotions to locations that have a numCandidates tally equal to the minimum non-zero tally.
This process repeats for all rows of weights in weight memory format.
For completeness, in Section 4.3.1 we compare against a greedy algorithm that iterates through
each multiplier lane in each cycle in some scan order, and iterates through the search window for each
multiplier (also in some scan order) until it finds a non-zero weight to promote. It performs that weight
promotion before continuing the process for the remaining multiplier lanes. The algorithm is essentially
making the first promotion possible on each iteration, and will be heavily affected by scan order.
The runtime of the scheduler is shown in Table 3.1, with measurements taken on a machine con-
taining an Intel Core i7-8700 processor, and 64GB of memory, and the scheduler implemented in the
Python programming language, using the NumPy numerical programming package [57]. Note that the
algorithm’s complexity is linear in the number of network weights, and so the runtime is slowest for
networks with large FC layers (AlexNet and Bi-LSTM), but very fast for even reasonably deep CNNs,
such as ResNet-50. In any case, the scheduling time is negligible compared to the time of training a
large DNN, and so is immaterial to the practicality of DNN deployment.

3.4 Filter Shuffling

One cause of under-utilization in Bit-Tactical is inter-filter imbalance. This arises due to the fact that
Bit-Tactical broadcasts activations across all PEs, and so filter processing must be semi-synchronized
within a tile, with no filter finishing more than h cycles ahead of any other filter within a filter group
(a ‘warp’), where h is the lookahead distance. Figure 3.13 shows how filters are naively grouped into
warps based on their order in the dense schedule, but can be trivially sorted by sparse processing time
on TCT and rearranged and re-scheduled to reduce synchronization overheads, resulting in a speedup.
However, some filters are naturally more sparse than others and/or their sparsity pattern is better
leveraged by the interconnect pattern and scheduling algorithm. By grouping filters that take similar
numbers of cycles together within a warp for simultaneous processing, these synchronization overheads
could be reduced. This filter shuffling, like the weight scheduling, can be done statically as part of
the overall scheduling procedure. It does, however, change the order that filters are applied to the
input activations and, thus, the order at which output channels are produced and written to Activation
Memory (AM). This can be accounted for by simply shuffling the filter channels of the proceeding layer
in the network, as in Figure 3.14, which again can be done statically, offline, during weight scheduling.
Chapter 3. A Low-Overhead Architecture for Sparse Neural Networks 30

Warp 1 Warp 0 Warp 1 Warp 0

Total: 8 cycles Total: 5 cycles
4 cycles 4 cycles 1 cycle 4 cycles

F-4 F-0 F-1 F-3

PE-0 PE-0
1 cycle 2 cycles 1 cycle 4 cycles

F-5 F-1 F-2 F-6

PE-1 PE-1
2 cycles 1 cycle 1 cycle 4 cycles

F-6 F-2 F-4 F-0

PE-2 PE-2
4 cycles 1 cycle 1 cycle 2 cycles

F-7 F-3 F-7 F-5

PE-3 PE-3
1 cycle 4 cycles 1 cycle 2 cycles

(a) unshuffled (b) shuffled

Figure 3.13: How filter shuffling increases performance. (a) Unshuffled warps are slowed down by
‘stragglers’ (highlighted). (b) Shuffled warps reduce synchronization idle time.

Layer L Layer L+1

Inputs Filters Outputs Inputs Filters Outputs

* = * =

= =
Shuffled

* = * =

Figure 3.14: Example of how filter shuffling (bottom) changes the computation, but results in the same
output as unshuffled filters (top).
Chapter 3. A Low-Overhead Architecture for Sparse Neural Networks 31

MST

N
ws

Weight Lane 1
Weight Lane 1
ws1
From Weight Memory

From Weight Memory

w1 w1 X
X

Weight Lane 1

Weight Lane 1
ws2

w2 w2 X
X

Weight Lane N

Weight Lane N
wsN

wN wN X
X

Figure 3.15: Weight memory interface (a) without and (b) with the Mux Select Table (MST).

3.5 Reducing Schedule Metadata Overheads

The multiplexer select signals that dictate the weight schedule by selecting appropriate activation-weight
pairings represent a memory and hardware overhead. For 8-input multiplexers, a 3-bit Weight Select
(WS) signal is stored alongside each weight. For 16-bit weights, this is only a 19% overhead. However,
as algorithmic efforts increase the prevalence of weight quantization, this overhead may become more
significant.
One avenue for curtailing this cost is to notice that the WS signals as described above are vastly
over-provisioned. For a PE with 16 multipliers and 8-input multiplexers, they are capable of representing
all 248 potential signal combinations. However, not all of these combinations are possible for a given
interconnect configuration. For example, no two multipliers may be processing the same weight in the
same cycle. Additionally, some possible expressible schedules are ‘equivalent’, in that they result in
the same weights being processed in the same cycle. The scheduler will not produce these redundant
schedules. With this observation in mind, we can design a less expressive, but more compact way of
representing the schedule in memory.
The optimization we implement is to store only a small number of mux select signals in a table, and
use a per-group WS signal (rather than per-weight), which is used to index the table. This Mux Select
Table (MST) stores multiple 48-bit entries (3b select signals × 16 multipliers), and is indexed each cycle
by a single WS signal stored per group of 16 weights that are processed, as shown in Figure 3.15. We
elect to use a single MST per PE unit, and refresh the MST when each new filter is loaded.
The sparse neural network benchmarks are profiled in order to determine how many entries the
MST should have in order to capture a large proportion of possible mux signal/schedule patterns,
without degrading performance. As the scheduler is deterministic, and the interconnect has constrained
Chapter 3. A Low-Overhead Architecture for Sparse Neural Networks 32

1
0.9
0.8
Mux Signal Coverage

0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53
MST Size

Resnet50-SS GoogleNet-SS GoogleNet-ES Mobilenet

Bi-LSTM AlexNet-SS AlexNet-ES

Figure 3.16: MST size vs. signal combination coverage. Signal combinations are weighted by reuse.

1 0.95
(Normalized to naive case)

0.9
0.77
0.8
Memory Overhead

0.7
0.6
0.5 0.45
0.4 0.36
0.32
0.3 0.26
0.18
0.2
0.1
0

Without MST With MST

Figure 3.17: Memory overhead of scheduling metadata normalized to the naive implementation. Results
are for a 32 entry MST and 8-input muxes.
Chapter 3. A Low-Overhead Architecture for Sparse Neural Networks 33

1
0.9
Mux Signal Coverage 0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53
MST Size

Resnet50-SS GoogleNet-SS GoogleNet-ES Mobilenet

Bi-LSTM AlexNet-SS AlexNet-ES

Figure 3.18: MST size vs. signal combination coverage, unweighted.

connectivity, we should expect that a lot of schedule patterns are seen many times within the processing
of a given filter. Indeed, as Figure 3.16 shows, no filter in any network in our benchmark suite requires
more than 53 unique scheduling steps/mux signal combinations – much less than the 248 expressible in
the naive design. A 32 entry MST would require only a 5-bit field per group of 16 weights, and would
cover more than 99.5% of combinations for all networks except the two pruned versions of AlexNet (the
oldest and most over-provisioned network we benchmarked). The cumulative distribution is weighted
by filter reuse, which directly corresponds with processing time, meaning that this would imply at most
a 0.5% performance degradation for the newer networks. The performance degradation arises when the
scheduler cannot select one of the 32 MST entries to drive the muxes, and so must revert to the dense
schedule in 0.5% of cycles. In reality, due to the filter synchronization described in Section 3.4, the
performance overhead will likely be even less.

Figure 3.17 shows the relative metadata memory overheads for both the naive implementation, which
uses a single 3-bit WS signal per weight, and the MST optimization, which uses a 32-entry, 48-bit wide
MST per PE, which is updated with each filter that is loaded from off-chip. Note that updating the MST
represents an overhead, however the reduction in memory overhead of the WS signals greatly outweighs
this overhead for all but GoogLeNet-ES, which sees only a modest 5% reduction in metadata overhead,
as this MST refresh overhead dilutes any improvements.

We also show results for signal coverage with combinations unweighted in Figure 3.18. These results
are representative of the raw number of scheduling decisions that can be covered by an MST of a given
size. These results are more representative of the memory traffic savings that are possible. We see that
the two AlexNet networks require a larger MST than most networks, partially due to the large FC layers
in these networks, which are often quite sparse.
Chapter 3. A Low-Overhead Architecture for Sparse Neural Networks 34

3.6 Summary
This chapter has presented the Bit-Tactical front-end architecture and interconnect, which exploits
weight sparsity in neural networks, and has highlighted the modifications made to the baseline DaDi-
anNao architecture to implement this design. The outline of the architecture has also shown precisely
how the hardware is capable of achieving a speedup on sparse neural networks. Alongside the hardware,
the weight scheduling algorithm has also been detailed, demonstrating how TCT’s hardware/software
co-design approach can extract additional performance with limited hardware complexity. Further hard-
ware and software optimizations have been illustrated which can increase the performance and memory
compression achieved by TCT.
Chapter 4

Evaluation

Here, we provide an evaluation of the Bit-Tactical front-end architecture on a variety of neural network
benchmarks. We will first describe the evaluation methodology in Section 4.1, along with the benchmark
suite used for evaluation in Section 4.2. In Section 4.3, we will show how the best front-end design
evaluated achieves a 2.03× geomean speedup across the networks studied. The effect of the scheduling
algorithm, which can improve performance by up to 28%, is discussed in Section 4.3.1. We will also show
how robust this combination of front-end design and scheduler is with a sensitivity study in Section 4.3.2.
We will show the area overheads of the TCT front-end in Section 4.3.3, demonstrating that the design
offers compelling performance-area trade-offs. Finally, for comparison purposes, Section 4.3.4 will show
the performance of various alternative interconnect designs.

4.1 Methodology
The performance of Bit-Tactical is evaluated using a custom cycle-accurate simulator which models the
performance of the front-end and allows the exploration of various interconnect designs and scheduling
algorithms. The simulator also provides detailed performance counters that allow analysis of the results.

Table 4.1: Baseline DaDianNao++ and TCT configurations.

DaDianNao++ or TCT
Tiles 4 Filters/Tile 16
AS/Tile 32KB × 32 Banks Weights/Filter (N) 16
WS/Tile 2KB × 32 Banks Precision 16b
Act.Buffer/Tile 1KB × (h + 1) Frequency 1GHz
Main Memory 8GB various tech nodes. Tech Node 65nm
Lookahead 0-4 Lookaside 0-6
DaDianNao++
Peak Compute BW 2 TOPS Area 61.29 mm2
Power 5.92 Watt

The hardware configuration for the baseline design and TCT is outlined in Table 4.1. Area and energy
measurements are performed post-layout using representative circuit activity. Layouts are generated for
a TMSC 65nm technology using Cadence Innovus after synthesis using Synopsys Design Compiler.
SRAMs are modeled via CACTI [58]. Off-chip memory energy consumption is modeled using Micron’s
DDR4 power calculator [59] along with access counts from the cycle-accurate simulations. All designs
operate at 1GHz, with pipelining of the datapath as needed to reach this target frequency. Both TCT

35
Chapter 4. Evaluation 36

and DaDianNao++ use k = 16 PEs per tile, with N = 16 multipliers per PE, all operating on 16-bit
fixed point inputs. We initially show results assuming sufficient off-chip bandwidth so that no off-chip
stalls occur, but later show the effect of various main memory technologies. We use run-length based
zero compression as in [18] for weights, and fine-grain per group precision as in [60] for activations to
reduce off-chip bandwidth for all layers.

4.2 Benchmarks
For our benchmarks, we take open-source pruned models from Yang et al. [11] (models denoted with
the ‘-ES’ suffix) and from Park et al. [16] (models denoted with the ‘-SS’ suffix). We do not modify
these networks and use them as-is. We also perform magnitude based threshold pruning on an open-
source pre-trained MobileNet v1 and Bi-LSTM, as proposed by [61], targeting 75% sparsity for each.
For MobileNet, we only prune the pointwise convolution layers, in line with [10], as these contain 99%
of the network parameters.

Table 4.2: Networks studied.

Weight Sparsity Potential
Network Storage Compute Speedup
AlexNet-ES [11] 90.6% 86.4% 7.35×
AlexNet-SS [16] 87.5% 79.4% 4.85×
GoogLeNet-ES [11] 66.1% 74.8% 3.96×
GoogLeNet-SS [16] 78.0% 61.6% 2.60×
ResNet-50-SS [16] 50.4% 44.9% 1.81×
Mobilenet [13] 74.3% 73.8% 3.82×
Bi-LSTM [62] 75.0% 74.9% 3.99×

Table 4.2 lists the networks studied and their sparsity levels after pruning. Note that we provide
two measures of sparsity, labelled ‘Storage Sparsity’ and ‘Compute Sparsity’. Storage sparsity refers to
the total sparsity of the weight tensors themselves, which directly relates to the total memory footprint
reduction potential due to sparsity. However, storage sparsity is not a good measure of performance
potential, as some layers may have high sparsity and large weight tensors, but only contribute to a small
proportion of total network MACs, and vice versa. For example, fully connected layers can contain a
large proportion of the network weights, but each weight only participates in a single multiply, whereas
some filters in convolutional layers have a very small memory footprint, but each weight participates
in thousands of multiplies – in other words, they have high computational intensity. Compute sparsity,
then, is the proportion of total valid multiply operations in which the weight is zero. This metric has a
more direct correlation with performance potential. Indeed, the ‘Potential Speedup’ column of Table 4.2
is calculated as the inverse of compute density, as in Equation 4.1.

1
potential = (4.1)
1 − sparsity
This is equivalent to the speedup of a hypothetical machine with a single MAC unit that processes
one weight-activation pair per cycle and never stalls. Of course, any real hardware architecture will
suffer from under-utilization for myriad reasons.

4.3 Front-End Performance

Speedup over DaDN
1.20
1.80
2.40

1.00
1.40
1.60
2.00
2.20

1.22 1.58
1.22 1.79
1.45 1.99
1.56 2.11
1.67 2.18
1.72 2.21
1.79 2.12
AlexNet-ES

1.45 2.38
4.55
1.12 1.42
1.12 1.63
1.29 1.71
1.36 1.80
1.43 1.81
1.49 1.82
1.53 1.69
AlexNet-SS

1.29 2.11
3.25
1.06 1.44
1.06 1.78
1.10 1.83
1.14 1.87
1.16 1.87
1.17 1.68
1.17 1.21
GoogLeNet-ES

1.10 2.25
4.24
1.10 1.18
1.10 1.44
1.07 1.43
1.11 1.47
1.16 1.42
1.19 1.37
1.22 1.31
GoogLeNet-SS 1.07 1.77
2.34
1.11 1.24
1.11 1.34

Network
1.16 1.38
1.20 1.40
1.23 1.39
1.26 1.37
1.28 1.35

ResNet50-SS
1.16 1.48
1.80
1.03 1.50
1.03 1.92
1.03 1.88
1.03 1.88
1.03 1.53
1.03 1.50

Mobilenet
1.03 1.33
1.03 2.27
4.66
1.15 1.67
1.15 2.19
1.15 2.19

Figure 4.1: Speedup of Bit-Tactical configurations over the baseline design.

1.15 2.19
1.14 1.69
1.14 1.69

Bi-LSTM
1.15 1.42
1.15 2.10
4.16
1.11 1.42
1.11 1.71
1.17 1.75
1.21 1.80
1.24 1.68
1.27 1.64

Geomean
1.29 1.47
1.17 2.03
3.39

L8<6,1>S
L8<5,2>S
L8<4,3>S
L8<3,4>S
L8<2,5>S
L8<1,6>S
L4<1,2>S

T8<2,5>S
X<inf,15>
37 Chapter 4. Evaluation
Chapter 4. Evaluation 38

Speedup over DaDianNao

2.6
2.4
2.2
2
1.8
1.6
1.4
1.2
1

Unshuffled Shuffled

Figure 4.2: Speedup of the T8h2, 5i configuration over the baseline, with and without filter shuffling.

We compare the performance of a variety of front-end designs against the baseline DaDianNao++
architecture across our benchmark suite in Figure 4.1. The lower portion of each stacked bar in the
figure is the performance achieved by lookahead alone. All results are using the scheduling algorithm
described in Section 3.3.4.
The figure shows the diminishing returns of increased lookahead, with the L8h3, 4i having the best
geomean performance of the L-shaped designs studied, despite it only having an ideal performance
potential of 4×, compared to the 7× potential of the L8h6, 1i configuration.
The best performing design, T8h2, 5i, achieves a 2.03× geomean speedup on the networks studied.
Though this is significantly less than the 3.39× speedup of the hypothetical Xh∞, 15i configuration, note
that this represents a prohibitively costly design consisting of a vary large crossbar and many activation
wires. In this context, the fact that Tactical T8h2, 5i achieves up to 76% and 82% of the performance of
this ideal configuration on GoogLeNet-SS and ResNet50-SS, respectively, at greatly reduced hardware
complexity is promising. On average, it achieves 60% of this speedup potential.
Across networks, the speedup of TCT over the baseline varies widely. This is in part due to the
varying levels of compute sparsity. Networks with higher compute sparsity, like AlexNet-ES (86.4%
sparsity), will see a larger speedup than networks with lower compute sparsity, like ResNet50-SS (44.9%
sparsity). The specific pruning algorithm employed can make a large difference to performance as well.
The ‘ES’ networks of Yang et al. [11] are pruned using ‘energy aware pruning’, which takes into account
the energy saving potential of removing each weight during the pruning process. This correlates with
filter re-use, and thus these networks have a high compute sparsity compared to the ‘SS-’ networks. As
a point of comparison, from Table 4.2 we see that GoogLeNet-SS has a higher storage sparsity (more
zeros in weight tensors) than GoogLeNet-ES (78.0% versus 66.1%), but GoogLeNet-ES manages to have
more compute sparsity than GoogLeNet-SS (74.8% versus 61.6%).
One network with interesting characteristics is Mobilenet, which utilizes a type of convolutional layer
called depthwise separable convolutions. These layers contain filters which have an effective channel
depth of 1, making them perform poorly on DaDianNao, given it’s channel-first dataflow which requires
Cycles (as a Proportion of Dense Cycles)
0
1

0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9

Conv 1
Conv 2
Conv 3
Conv 4
Alexnet-ES

Conv 5
Total
Conv 1
Conv 2
Conv 3
Conv 4
AlexNet-SS

Conv 5
Total
Ineffectual

Conv1
Conv2
icp2/out2
icp4/out1
icp9/out0
GoogLeNet-ES
Lookahead
Total
Conv1
Conv2
icp2/out2
icp4/out1

Lookaside
icp9/out0

GoogLeNet-SS
Total
Conv1
2a_br2b
3b_br2b
4a_br2b

Unpromoted
Resnet50-SS
5c_br2c
Total
conv1
conv5

Padding
fc8

Bi-LSTM
lstm1_forward
Total

Figure 4.3: Breakdown of execution time normalized to dense time for representative layers of each network.
conv2_2/dw
conv2_2/sep
conv6/dw
conv6/sep

Mobilenet
Total
39 Chapter 4. Evaluation
Chapter 4. Evaluation 40

a multiple of 16 filter channels for full utilization. TCT can benefit these layers as the padding used to
lay out these depthwise filters correctly in memory is effectively a form of sparsity that can be promoted
into using lookaside. Hence, lookahead does not provide much speedup alone (1.03× across all designs
evaluated), but lookaside can prove very effective. This is an extreme case of how TCT may provide a
speedup even for dense networks through zero-padding.
We also explore the effect of filter shuffling on front-end performance, and find that this optimiza-
tion can result in up to a 18% performance increase (Bi-LSTM) on the studied networks, at no extra
hardware cost. Figure 4.2 shows the performance results due to filter shuffling across networks. For
the best performing network, AlexNet-ES, filter shuffling boosts performance by nearly 10%, increas-
ing its speedup to 2.62×. This optimization always provides a performance increase, even if modest
(2.2% for GoogLeNet-ES). The speedups suggest that in most sparse network layers, sparsity is reason-
ably uniformly distributed, as there isn’t a large disparity between the slowest and the fastest filters to
process.
Figure 4.3 explores the execution time breakdown for representative layers of each neural network,
and for the total network. Multiplier cycles are normalized to dense execution time for each network and
layer, and are categorized into four classes: ineffectual multiplier cycles due to processing ineffectual
weights, effectual weights that were promoted using lookahead or lookaside, and effectual weights that
remained unpromoted - i.e., weights that remain in their original position as in the dense schedule.
The figure also shows the proportion of the original, dense time that was spent processing zero-padding
within filters due to filter sizes not being precisely aligned with the on-chip memory layout. This zero-
padding is shown on the same axis for illustrative purposes, as it serves to show how much of the original
layer sparsity was due to padding. Note that the depthwise (dw) layers of Mobilenet are not sparsified,
but have a large amount of zero padding as they have an effective filter depth of 1, which isn’t well
aligned to the depth-first dataflow used by TCT and DaDianNao. This is evident to a lesser degree in
the ‘Conv 1’ layers of the networks studied, which have some padding due to the fact that the channel
depth of the input to these layers is 3 (RGB input images). This highlights the ability of TCT to handle
padding and other forms of irregularity in neural networks to potentially benefit even dense networks to
some extent.
Table 4.3: Proportion of zero values removed after scheduling.

Network AlexNet-ES AlexNet-SS GoogLeNet-ES GoogLeNet-SS Resnet-50-SS Bi-LSTM Mobilenet

Proportion of Zeros Removed 67.5% 67.4% 64.6% 70.8% 68.3% 68.3% 68.5%

Related to the execution time breakdown is the amount of sparsity that the TCT scheduler is able to
effectively remove from execution. Though this information is derivable from Figure 4.3, it is explicitly
2
listed per-network in Table 4.3. TCT manages to remove approximately 3 of the ineffectual work due
to zero weights and padding for all of our benchmarks, leaving at most 35.4% of ineffectual work on the
table (GoogLeNet-ES).

4.3.1 Effect of Scheduling Algorithm

Figure 4.4 shows the relative performance using the heuristic scheduler outlined in Section 3.3.4 and
a greedy algorithm. The heuristic scheduler never performs worse than the greedy scheduler, but on
the hard-pruned networks (Mobilenet and Bi-LSTM), the structure in the induced sparsity means both
algorithms produce equivalent schedules most of the time. The low level of sparsity in ResNet-50-SS
Chapter 4. Evaluation 41

1.4
1.2

Execution Time
1
0.8
0.6
0.4
0.2
0

Scheduler Greedy

Figure 4.4: Effect of the scheduling algorithm on the networks studied, with execution time normalized
to that achieved using Algorithm 2.

also offers little opportunity for the heuristic scheduler to excel, as the greedy algorithm already extracts
nearly all of the potential performance from the network. On the more sparse networks, however, the
heuristic scheduler offers significant improvements, outperforming the greedy approach by up to 28% on
Googlenet-SS. On average, the heuristic algorithm achieves a modest 8% performance improvement on
the networks studied. This increases to 14% if we consider only the networks for which the choice of
scheduler has any impact on performance.

4.3.2 Sensitivity to Sparsity Level

Figure 4.5: Performance of multiple interconnect patterns at varying sparsity levels. Solid lines represent
the average speedup, with bands showing the range.

When evaluating the front-end design space and scheduling algorithm, it is useful to understand how
Chapter 4. Evaluation 42

both behave in isolation, and the interplay between them, as sparsity varies. To this end, we perform
sensitivity studies in which the sparsity level is swept from 0%−90% in 10% increments. At each sparsity
level, 100 sets of 16 filters, each of size 3 × 3 with 512 channels, are randomly generated and simulated
on variants of the front-end design.

Figure 4.6: Performance variation of the Th2, 5i design with different scheduling approaches as sparsity
level changes, and compared against an L-shaped interconnect using the scheduling algorithm.

Figure 4.5 shows the speedup of various TCT front-end designs, with varying lookahead distance
and input multiplexer size, as sparsity is swept. All of the studied designs use a variant of the Trident
connectivity shape. The T8h2, 5i design is the best performing across all sparsity levels below 0.8,
after which point the T8h3, 4i design, with its increased lookahead distance, becomes dominant. The
additional lookahead does, however, make it a more expensive design. Given that all but one of the
networks studied have a sparsity level less than 0.8, this extra hardware cost is not justified. On the
other end of the design spectrum, the cheaper T8h1, 6i design is substantially slower across most sparsity
levels, with the T8h2, 5i design being 1.45× faster at a sparsity level of 0.8.
Figure 4.6 illustrates the efficacy of the combination of the scheduling algorithm and the co-designed
T8h2, 5i interconnect. The greedy scheduler’s performance will depend on the scan order, which in this
experiment is set to target lookaside first. This explains why it performs slightly better at lower levels
of sparsity than the heuristic scheduler, which is designed to make more globally-optimal scheduling
decisions across the search window.
One key observation to make from these results is that the performance of the front-end is robust to
changes in the sparsity distribution, with the range of speedups achieved by the T8h2, 5i design never
deviating from the average by more than 6%. Additionally, though initially an unremarkable observation,
it is useful to note that Bit-Tactical never decreases performance below the baseline design – even at
0% sparsity, it achieves the same performance as DaDianNao++. This is not the case for SCNN, which
suffers more than a 20% slowdown over their value-agnostic baseline design on dense networks, and only
achieves speedups when both weight and activation sparsity surpass 15% each. This further validates
the design principal of the Bit-Tactical front-end, which uses simple hardware and in doing so avoids
potential performance overheads.
Chapter 4. Evaluation 43

4.3.3 Area Comparisons

One of the key guiding principals for Bit-Tactical’s architecture design is that it must offer an area-
efficient alternative to the dense, associative connectivity of competing accelerators like SCNN and
Cambricon-X. Indeed, the insight that the crossbar routing of these designs amounts to an over-
provisioned design is a large motivator for this work in the first place. Therefore, it is imperative
to assess the area overhead of the Bit-Tactical front-end over our baseline architecture.

Table 4.4: Compute area overheads for TCT with lookahead of 2.

Lookaside Connections 0 3 5
Area Overhead 4.0% 6.8% 8.2%

Table 4.4 shows the logic area overhead of TCT configured with a lookahead of 2, and various numbers
of lookaside connections. The relative cost of lookahead and lookaside can be seen in the marginal cost of
adding lookaside connections. For example, a design with a lookahead of 2 and no lookaside connections
has a 4% area overhead compared to the baseline, whereas adding an additional 3 lookaside connections
only increases the total logic area by 2.8%. The best performing design we evaluate, the T8h2, 5i, has
a modest area overhead of just 8.2%. Note also that this is only accounting for compute logic area. If
on-chip memory is included, the area overhead of TCT is diluted to just 1.9%, due to the large on-chip
activation memory used. For reference, SCNN’s area overhead compared to their dense baseline design
is 33.9% [18], and Cambricon-X uses 2.11× the area of their baseline design, DianNao [17].

4.3.4 Alternative Interconnect Configurations

The performance evaluations described so far in Section 4.3 represent an insight into what we believe is
a set of designs that offer a good trade-off between generality and performance, however they are by no
means exhaustive. It is also useful to study alternative interconnect designs that fall within the design
space of interest. To this end, we evaluate front-end configurations that offer different trade-offs than
the Trident interconnect presented previously, including configurations that are customized per-network
in an attempt to offer increased performance at the cost of generality, and configurations that are more
costly in hardware but have superior performance.
Shown in Figure 4.7 is the relative performance of some potential interconnect designs. All designs
are configured with a lookahead distance of 2, in order to maintain modest hardware overheads. The
LA-2 Max configuration represents the fastest TCT design possible with a lookahead of 2, as it has
fully associative routing within a PE - using our notation, a Dense-33h2, 30i design. Despite clearly
representing much costlier hardware, the LA-2 Max configuration performs only marginally better than
any of the other schemes evaluated, suggesting that the combination of the judiciously designed T8h2, 5i
interconnect and the scheduling algorithm are together capable of extracting the vast majority of the
potential performance within the design constraints of interest. Indeed, across the benchmark suite,
the LA-2 Max achieves only a 14.7% increase in geomean speedup over the T8h2, 5i interconnect after
scheduling.
We also evaluate interconnects that are designed to be optimized on a per-network basis, denoted
‘Custom’ in Figure 4.7. These interconnects are designed in an automated fashion, rather than by hand,
using a heuristic process. To begin, we initialize the interconnect to have a Dense-17h2, 16i configuration,
Chapter 4. Evaluation 44

3.5
3
Speedup over DaDN

2.5
2
1.5
1
0.5
0

Checkers T8<2,5> Custom LA-2 MAX

Figure 4.7: Performance of additional interconnect configurations relative to the baseline architecture.

2.6
2.4
Speedup over DaDN

2.2
2
1.8
1.6
1.4
1.2
1
12 11 10 9 8 7 6 5 4 3 2 1 0
Lookaside Connecitons
AlexNet-ES AlexNet-SS GoogLeNet-ES GoogLeNet-SS
Resnet50-SS Mobilenet Bi-LSTM

Figure 4.8: Performance as lookaside connections are pruned, with lookahead fixed at 2.
Chapter 4. Evaluation 45

with connections in place for every position up to a distance of 4 either side, and 2 ahead. Then, for
a given network, the scheduler is run, and a tally is kept of how many promotions each interconnect
connection makes. After scheduling the entire network, the connection with the lowest tally is removed.
This process is repeated until only 7 connections remain in the interconnect, resulting in a design that
uses 8-input muxes that is customized for a given network. The heuristic of removing the connection that
participated in the fewest promotions is chosen as it should represent a proxy for the connection that
contributed the least to the performance gain. This heuristic works well for GoogLeNet-ES, ResNet50-
SS, Bi-LSTM and Mobilenet, leading to a geomean performance improvement of 5.1% over T8h2, 5i.
However, it is not a perfect proxy, as is evidenced by the networks for which the custom interconnects
perform worse than the T8h2, 5i design. The complexities of scheduling are likely responsible for this
very slight performance deficit, as a connection that does not perform many promotions may actually be
instrumental in the scheduling decisions made, despite not specifically participating in many promotions
itself. The high relative performance of the T8h2, 5i design also highlights its success at being robust
across networks, making it general purpose across networks, yet performant for individual networks.
For comparison, results for a second hand-designed interconnect, the checkers configuration, are
also shown. The design represents also uses 8-input muxes, where connections come from every second
position in each lookahead row, with the second row being offset from the first by one position, making
it resemble a checker board/chess board alternating pattern.
Figure 4.8 shows how performance varies as the number of lookaside connections is decreased. For
all configurations, the maximum lookahead distance is 2. The process is similar to the one used to
generate the ‘Custom’ interconnects described above, except that the two lookahead connections are not
permitted to be removed. The plot helps to justify the number of lookaside connections used in the
T8h2, 5i configuration, as 4 connections represents a knee in the curve for most networks – any fewer
connections, and performance begins to degrade substantially. For example, going from 4 lookaside
connections to 3 degrades performance for AlexNet-ES by 12.4%. The extra cost of the 5th lookaside
input is justified so as to fill the 8 potential connections expressible with a 3-bit mux select signal, which
4 lookaside connection (along with the 2 lookahead connections) will also require.

4.4 Alternative Back-End Designs - TCTp and TCTe

The lightweight TCT front-end architecture coupled with the software scheduler presented so far make for
a compelling trade-off between complexity and performance. However, the fact that the front-end only
exploits weight sparsity means that it leaves potential performance gains on the table when compared to
designs like SCNN which also leverage activation sparsity. The challenges of allowing dynamic routing of
non-zero values for both weights and activations are substantial, and the hardware overheads of taking
such an approach are significant. This is evident in SCNN, which dedicates over 50% of PE compute
area to the over-provisioned dynamic routing, accumulation logic and memory required to handle weight
and activation sparsity [18].
Though not the focus of this thesis, for completeness this section details an evaluation of TCTp
and TCTe, two variants of Bit-Tactical which are supplemented with bit-serial back-ends (multiply-
accumulate compute logic), in order to exploit two alternative value properties of CNNs. TCTp utilizes
bit-serial multipliers in order to exploit the dynamic variable precision requirements, also called excess
of precision (EoP), of activations, using the approach of Delmas et al. [63]. TCTe also utilizes bit-serial
Chapter 4. Evaluation 46

18
16.2
15.5
16
14.1
Speedup over DaDN
14 12.9
11.2 11.7
12 11.0 10.9 11.3 11.7 11.3
10.5
10.3
9.5 9.6 9.7 10.1
10 9.1 9.4 9.29.69.1 9.2 9.3
8.6 8.4
7.4 7.8 7.9
8 7 7.1 6.2 7.1 6.7
5.8 6.2 6.1
5.5 5.1 5.35.65.4 5.6
6 4.5 4.9 4.6
4 3.3 3.1

2
0

TCLp - L8<1,6> TCLp - T8<2,5> TCLp - L8<4,3>

TCLe - L8<1,6> TCLe - T8<2,5> TCLe - L8<4,3>

Figure 4.9: Performance of TCTp and TCTe normalized to DaDianNao. The h2, 5i configurations use
the Trident interconnect, whilst the other configurations use the L-shaped interconnect.

processing, but uses Booth encoding and the multiplication approach of Albericio et al. [46] to only
process effectual activation terms (powers of two), thus exploiting ‘term sparsity’. Both designs make
use of the fact that the bit-serial back-ends occupy less area than their bit-parallel counterparts, and so
many more PEs can be placed within a similar area budget, resulting in improved performance.
Figure 4.9 shows the speedup of TCTp and TCTe over DaDianNao. The additional benefits of ex-
ploiting EoP and term sparsity prove to be complementary, with the speedups being almost multiplicative
with the speedups from exploiting weight sparsity alone. This motivates improvements to the front-end
architecture, as any additional performance gains will be realized many times over when integrated with
the bit-serial back-ends. The 11.3× geomean speedup of the Trident design integrated with TCLe is
almost 5.6× more than that of the T8h2, 5i front-end alone. However, both TCTp and TCTe come
with significant area overheads compared to the front-end alone, with the logic alone occupying 2.95×
and 6.94× the area compared to that of DaDianNao. This is significant compared to TCT’s modest
front-end area logic overhead. Admittedly, when the area overheads of on-chip memories are included,
the impact of the logic overheads are diluted significantly. However this is true for both TCTp/e and
the TCT front-end.
Also shown in Figure 4.10 is the performance of the TCT family of accelerators when taking into
account the off-chip bandwidth provided by various memory technologies. TCT can reach its full po-
tential speedup using a very modest LPDDR3-1600 single channel main memory for all networks except
Bi-LSTM, whose LSTM layers contain many parameters but have little value re-use, making off-chip
traffic a performance bottleneck. Additionally, all networks can be run at a reasonably high framerate,
or equivalently, a low-latency of at most 2.8ms (ResNet-50-SS). This is important as it is well above 60
FPS, the highest framerate that modern digital videos are commonly recorded at, making it suitable for
real-time inference in applications such as virtual reality, augmented reality, and self-driving cars.
Chapter 4. Evaluation 47

12706
28.9

9558
9189 23.2
5491
20.9 20.3 2206 4295
7381 18.3
16.8 7444 15.7 8271
18.1 16.0
5717
13.0 3246
12.0 1287
10.7 2563
9.3
3438
1894 1706 1011
1782 2146 6.7
4.3 351 901 4.3 4.2
3.9 3.7 3.3
2.9

Figure 4.10: Speedup of TCT, TCTp, and TCTe over DaDianNao++ using the T8h2, 5i configuration
with various main memory technologies (listed in the legend – x1 means a single memory channel is
used). Frames per Second of inference (FPS) and effective Tera-Operations per Second (TOPS) are also
shown on top of each bar for infinite off-chip bandwidth.

4.5 Summary
This chapter has detailed the performance evaluation of TCT on a suite of sparse neural networks. A
range of front-end configurations were evaluated, along with the impacts of various algorithmic opti-
mizations. We show how the co-designed scheduling algorithm can improve performance by up to 28%,
and how filter shuffling can increase performance on an already scheduled network by as much as an
additional 18%. A sensitivity study highlights the robustness of TCT to different sparsity patterns. The
best performing design across the networks studied is the T8h2, 5i configuration, which has a geomean
speedup of 2.03× over the baseline, and a maximum speedup with filter shuffling of 2.62×, with an 8.2%
logic area overhead.
Chapter 5

Conclusion

The large computational complexity and memory requirements of modern deep neural networks moti-
vates the need for algorithmic techniques to reduce both of these metrics. One prevalent technique for
doing so is weight pruning, which sets a large fraction of network weights to zero, with the goal of reduc-
ing memory footprint and traffic of DNN models, whilst giving a large amount of potential for speedup
during inference. This thesis has motivated the need for more efficient approaches to leveraging weight
sparsity in DNN inference. The Bit-Tactical front-end architecture is presented, and a thorough design
space exploration and optimizations have been detailed. By designing and optimizing a lightweight
front-end interconnect, we show how to judiciously leverage weight sparsity in hardware. Novel algorith-
mic optimizations which can improve the performance and reduce synchronization overheads are shown,
including a scheduling algorithm which increases the performance of the front-end design by up to 28%.
Additionally, we present further scheduling and hardware improvements which increase performance and
decrease memory overheads by up to an additional 18%, and by up to 82%, respectively. Combining both
the front-end hardware with the scheduling algorithm and optimizations results in a front-end accelera-
tor design which can achieve up to a 2.62× speedup over a similarly provisioned value-agnostic baseline
design, with just an 8.2% logic area overhead. In addition, despite targeting sparse neural networks, the
design presented suffers no performance degradation for dense networks (unlike other sparse accelerators,
which suffer from decreased performance on dense networks), and may even offer slight performance im-
provements due to zero-values from weight padding. Equivalently, Bit-Tactical’s performance is robust
across all sparsity levels, and so encourages weight pruning wherever possible, even if very high sparsity
levels are not attainable. In summary, Bit-Tactical’s novel, pragmatic approach to exploiting weight
sparsity offers a compelling trade-off between hardware complexity and attainable performance that we
hope motivates similar future efforts in value-aware acceleration for ML and other domains.

5.1 Future Work

The work described in this thesis opens a number of interesting directions for further research. The Bit-
Tactical concept itself offers insights into how more judicious approaches to exploiting value properties
in DNNs can pay dividends in terms of area-efficiency, without giving up much potential performance.
One interesting area of investigation is to ask what other types of value sparsity the Bit-Tactical
paradigm can be applied to. For instance, the activation sparsity caused by the ReLU activation function

48
Chapter 5. Conclusion 49

is one target for future work that has the potential to provide about a 2× performance increase. However,
the challenge with activation sparsity is that it is a dynamic, input-dependent property, and thus the
offline scheduling used by Bit-Tactical is not practical. Instead, the novelty of such work would be to
find a way to perform the value scheduling on-the-fly, in hardware.
Going further, value sparsity is known to appear not just at inference, but also during training.
The ReLU activation function applied during training causes sparsity in both activations and gradients,
providing a large speedup opportunity if a hardware scheduler were to be developed for Bit-Tactical.
On the algorithmic side, pruning approaches that tailor the structure of the weight sparsity to the
underlying hardware are an active field of research [31, 30]. This thesis focuses on generality, and
thus doesn’t impose any structure or constraints on the induced sparsity, instead evaluating on out-of-
the-box sparsified models from other research groups. However, one exciting future work would be to
design a pruning algorithm that works in tandem with Bit-Tactical’s software scheduler in order to tailor
the sparsity distribution to the interconnect. Similarly, a co-optimized network and interconnect also
represents interesting work. These efforts would allow for increased performance for a given sparsity
level, reducing the amount of sparsity that is left on the table by Bit-Tactical currently. Indeed, the
pruning constraints used by the Cambricon-S design [30], which uses depth-first pruning, would lead
to entire rows of weights in TCT’s weight memory being removed, resulting in perfect utilization and
maximum performance potential using lookahead alone (no lookaside would be necessary).
Other avenues for improvement include further optimizing the front-end scheduling algorithm, which
currently uses a heuristic approach. Integer linear programming (ILP) techniques are commonly em-
ployed to achieve a near-optimal schedule in makespan minimization problems. That said, the heuristic
algorithm described in this thesis achieves very close to the maximum achievable speedup for the net-
works studied, as discussed in 4.3.4, so it is unlikely that any improvements will be significant.
Similarly, it may be possible to further optimize the front-end interconnect by employing a hetero-
geneous connectivity pattern, as opposed to the homogeneous connectivity patterns considered in this
thesis in which the relative connectivity is the same for each multiplier lane. A heterogeneous intercon-
nect makes the optimization problem significantly more difficult, increasing the dimensionality of the
problem, and would potentially make scheduling a more challenging task - nevertheless, the performance
gains from such a design could be significant.
Finally, it is an interesting research question to consider how the pragmatic approach of constrained
routing connectivity could be applied to other architectures and organizations. DaDianNao serves as
a strong baseline architecture to prove that the concept of constrained routing works for exploiting
weight sparsity, but there is no reason to believe that this same approach could not be applied to other
organizations, such as systolic array architectures.
Bibliography

[1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep cnns.
In F. Pereira, C.J.C. Burges, L. Bottou, and K.Q. Weinberger, editors, NIPS 25, pages 1097–1105.
Curran Associates, Inc., 2012.

[2] V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A deep convolutional encoder-decoder

architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, 39(12):2481–2495, Dec 2017.

[3] K. Zhang, W. Zuo, S. Gu, and L. Zhang. Learning deep cnn denoiser prior for image restoration.
In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2808–2817,
July 2017.

[4] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convo-
lutional neural networks. Commun. ACM, 60(6):84–90, May 2017.

[5] O. Abdel-Hamid, A. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu. Convolutional neural net-
works for speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing,
22(10):1533–1545, Oct 2014.

[6] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim
Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learn-
ing. In Maria Florina Balcan and Kilian Q. Weinberger, editors, Proceedings of The 33rd Interna-
tional Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research,
pages 1928–1937, New York, New York, USA, 20–22 Jun 2016. PMLR.

[7] Jonas Gehring, Michael Auli, David Grangier, and Yann N. Dauphin. A convolutional encoder
model for neural machine translation. CoRR, abs/1611.02344, 2016.

[8] Alfredo Canziani, Adam Paszke, and Eugenio Culurciello. An analysis of deep neural network
models for practical applications. CoRR, abs/1605.07678, 2017.

[9] Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural network
with pruning, trained quantization and huffman coding. In 4th International Conference on Learning
Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings,
2016.

[10] Michael Zhu and Suyog Gupta. To prune, or not to prune: exploring the efficacy of pruning for
model compression. arXiv e-prints, page arXiv:1710.01878, Oct 2017.

50
Bibliography 51

[11] Yang, Tien-Ju and Chen, Yu-Hsin and Sze, Vivienne. Designing Energy-Efficient Convolutional
Neural Networks using Energy-Aware Pruning. In IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), 2017.

[12] Asit Mishra, Eriko Nurvitadhi, Jeffrey J Cook, and Debbie Marr. WRPN: Wide reduced-precision
networks. In International Conference on Learning Representations, 2018.

[13] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand,
Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for
mobile vision applications. CoRR, abs/1704.04861, 2017.

[14] Yann Le Cun, John S. Denker, and Sara A. Solla. Optimal brain damage. In Advances in Neural
Information Processing Systems, pages 598–605. Morgan Kaufmann, 1990.

[15] Babak Hassibi and David G. Stork. Second order derivatives for network pruning: Optimal brain
surgeon. In S. J. Hanson, J. D. Cowan, and C. L. Giles, editors, Advances in Neural Information
Processing Systems 5, pages 164–171. Morgan-Kaufmann, 1993.

[16] Jongsoo Park, Sheng Li, Wei Wen, Ping Tak Peter Tang, Hai Li, Yiran Chen, and Pradeep Dubey.
Faster CNNs with Direct Sparse Convolutions and Guided Pruning. In 5th International Conference
on Learning Representations (ICLR), 2017.

[17] Shijin Zhang, Zidong Du, Lei Zhang, Huiying Lan, Shaoli Liu, Ling Li, Qi Guo, Tianshi Chen, and
Yunji Chen. Cambricon-x: An accelerator for sparse neural networks. In Proceedings of the 49th
International Symposium on Microarchitecture, 2016.

[18] Angshuman Parashar, Minsoo Rhu, Anurag Mukkara, Antonio Puglielli, Rangharajan Venkatesan,
Brucek Khailany, Joel Emer, Stephen W. Keckler, and William J. Dally. Scnn: An accelerator for
compressed-sparse convolutional neural networks. In Proceedings of the 44th Annual International
Symposium on Computer Architecture, ISCA ’17, pages 27–40, New York, NY, USA, 2017. ACM.

[19] Warren S. McCulloch and Walter Pitts. A logical calculus of the ideas immanent in nervous activity.
The bulletin of mathematical biophysics, 5(4):115–133, Dec 1943.

[20] Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa,
Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao,
Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaem-
maghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg,
John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan,
Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James
Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana
Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy
Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt
Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter,
Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle,
Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon. In-datacenter
performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International
Symposium on Computer Architecture, ISCA ’17, pages 1–12, 2017.
Bibliography 52

[21] Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves,
Nal Kalchbrenner, Andrew W. Senior, and Koray Kavukcuoglu. Wavenet: A generative model for
raw audio. CoRR, abs/1609.03499, 2016.

[22] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object
detection. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages
779–788, June 2016.

[23] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale GAN training for high fidelity
natural image synthesis. In International Conference on Learning Representations, 2019.

[24] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang. Beyond a gaussian denoiser: Residual learning
of deep cnn for image denoising. IEEE Transactions on Image Processing, 26(7):3142–3155, July
2017.

[25] K. Zhang, W. Zuo, and L. Zhang. Ffdnet: Toward a fast and flexible solution for cnn-based image
denoising. IEEE Transactions on Image Processing, 27(9):4608–4622, Sep. 2018.

[26] Michaël Gharbi, Gaurav Chaurasia, Sylvain Paris, and Frédo Durand. Deep joint demosaicking and
denoising. ACM Trans. Graph., 35(6):191:1–191:12, November 2016.

[27] J. Kim, J. K. Lee, and K. M. Lee. Accurate image super-resolution using very deep convolutional
networks. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages
1646–1654, June 2016.

[28] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable
neural networks. In International Conference on Learning Representations, 2019.

[29] Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M. Roy, and Michael Carbin. The lottery
ticket hypothesis at scale. CoRR, abs/1903.01611, 2019.

[30] X. Zhou, Z. Du, Q. Guo, S. Liu, C. Liu, C. Wang, X. Zhou, L. Li, T. Chen, and Y. Chen. Cambricon-
s: Addressing irregularity in sparse neural networks through a cooperative software/hardware ap-
proach. In 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO),
pages 15–28, Oct 2018.

[31] Jiecao Yu, Andrew Lukefahr, David Palframan, Ganesh Dasika, Reetuparna Das, and Scott Mahlke.
Scalpel: Customizing dnn pruning to the underlying hardware parallelism. In Proceedings of the
44th Annual International Symposium on Computer Architecture, ISCA ’17, pages 548–560, New
York, NY, USA, 2017. ACM.

[32] Sajid Anwar, Kyuyeon Hwang, and Wonyong Sung. Structured pruning of deep convolutional neural
networks. J. Emerg. Technol. Comput. Syst., 13(3):32:1–32:18, February 2017.

[33] Huizi Mao, Song Han, Jeff Pool, Wenshuo Li, Xingyu Liu, Yu Wang, and William J. Dally. Exploring
the regularity of sparse structure in convolutional neural networks. CoRR, abs/1705.08922, 2017.

[34] Shuo Wang, Zhe Li, Caiwen Ding, Bo Yuan, Qinru Qiu, Yanzhi Wang, and Yun Liang. C-lstm:
Enabling efficient lstm using structured compression techniques on fpgas. In Proceedings of the 2018
ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA ’18, pages 11–
20, New York, NY, USA, 2018. ACM.
Bibliography 53

[35] Jorge Albericio, Patrick Judd, Tayler Hetherington, Tor Aamodt, Natalie Enright Jerger, and
Andreas Moshovos. Cnvlutin: Ineffectual-neuron-free deep neural network computing. In 2016
IEEE/ACM International Conference on Computer Architecture (ISCA), 2016.

[36] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolu-
tional neural networks. In Proceedings of the 25th International Conference on Neural Information
Processing Systems - Volume 1, pages 1097–1105, 2012.

[37] C. Szegedy, Wei Liu, Yangqing Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke,
and A. Rabinovich. Going deeper with convolutions. In 2015 IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), pages 1–9, June 2015.

[38] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image
recognition. CoRR, abs/1409.1556v6, 2014.

[39] G. Huang, Z. Liu, L. v. d. Maaten, and K. Q. Weinberger. Densely connected convolutional networks.
In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2261–2269,
July 2017.

[40] Yunpeng Chen, Jianan Li, Huaxin Xiao, Xiaojie Jin, Shuicheng Yan, and Jiashi Feng. Dual path
networks. In Proceedings of the 31st International Conference on Neural Information Processing
Systems, NIPS’17, pages 4470–4478, USA, 2017. Curran Associates Inc.

[41] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In 2016 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, June 2016.

[42] T Chen, Z Du, N Sun, J Wang, C Wu, Y Chen, and O Temam. Diannao: A small-footprint high-
throughput accelerator for ubiquitous machine-learning. In Proceedings of the 19th international
conference on Architectural support for programming languages and operating systems, 2014.

[43] Yu-Hsin Chen, Joel Emer, and Vivienne Sze. Eyeriss: A spatial architecture for energy-efficient
dataflow for convolutional neural networks. In Proceedings of the 43rd International Symposium on
Computer Architecture, ISCA ’16, pages 367–379, 2016.

[44] Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen,
Zhiwei Xu, Ninghui Sun, and O. Temam. Dadiannao: A machine-learning supercomputer. In
Microarchitecture (MICRO), 2014 47th Annual IEEE/ACM International Symposium on, pages
609–622, Dec 2014.

[45] Patrick Judd, Jorge Albericio, Tayler Hetherington, Tor Aamodt, and Andreas Moshovos. Stripes:
Bit-serial Deep Neural Network Computing . In Proceedings of the 49th Annual IEEE/ACM Inter-
national Symposium on Microarchitecture, MICRO-49, 2016.

[46] Jorge Albericio, Alberto Delmás, Patrick Judd, Sayeh Sharify, Gerard O’Leary, Roman Genov,
and Andreas Moshovos. Bit-pragmatic deep neural network computing. In Proceedings of the 50th
Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-50 ’17, pages 382–394,
2017.
Bibliography 54

[47] Z. Du, K. Palem, A. Lingamneni, O. Temam, Y. Chen, and C. Wu. Leveraging the error resilience
of machine-learning applications for designing highly energy efficient accelerators. In 2014 19th Asia
and South Pacific Design Automation Conference (ASP-DAC), pages 201–206, Jan 2014.

[48] Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A. Horowitz, and William J.
Dally. Eie: Efficient inference engine on compressed deep neural network. In Proceedings of the
43rd International Symposium on Computer Architecture, ISCA ’16, pages 243–254, Piscataway,
NJ, USA, 2016. IEEE Press.

[49] Raghu Prabhakar, Yaqi Zhang, David Koeplinger, Matt Feldman, Tian Zhao, Stefan Hadjis, Ar-
davan Pedram, Christos Kozyrakis, and Kunle Olukotun. Plasticine: A reconfigurable architecture
for parallel paterns. In Proceedings of the 44th Annual International Symposium on Computer
Architecture, ISCA ’17, pages 389–402, New York, NY, USA, 2017. ACM.

[50] M. Horowitz. Computing’s energy problem (and what we can do about it). In 2014 IEEE In-
ternational Solid-State Circuits Conference Digest of Technical Papers (ISSCC), pages 10–14, Feb
2014.

[51] Zidong Du, R. Fasthuber, Tianshi Chen, P. Ienne, Ling Li, Tao Luo, Xiaobing Feng, Yunji Chen,
and O. Temam. ShiDianNao: Shifting vision processing closer to the sensor. In 2015 ACM/IEEE
42nd Annual International Symposium on Computer Architecture (ISCA), pages 92–104, June 2015.
ShiDianNao.

[52] Shaoli Liu, Zidong Du, Jinhua Tao, Dong Han, Tao Luo, Yuan Xie, Yunji Chen, and Tianshi Chen.
Cambricon: An instruction set architecture for neural networks. In 2016 IEEE/ACM International
Conference on Computer Architecture (ISCA), 2016.

[53] Hardik Sharma, Jongse Park, Naveen Suda, Liangzhen Lai, Benson Chau, Vikas Chandra, and
Hadi Esmaeilzadeh. Bit fusion: Bit-level dynamically composable architecture for accelerating
deep neural networks. In Proceedings of the 45th Annual International Symposium on Computer
Architecture, ISCA ’18, pages 764–775, Piscataway, NJ, USA, 2018. IEEE Press.

[54] E. Park, D. Kim, and S. Yoo. Energy-efficient neural network accelerator based on outlier-aware low-
precision computation. In 2018 ACM/IEEE 45th Annual International Symposium on Computer
Architecture (ISCA), pages 688–698, June 2018.

[55] K. Siu, D. M. Stuart, M. Mahmoud, and A. Moshovos. Memory requirements for convolutional
neural network hardware accelerators. In 2018 IEEE International Symposium on Workload Char-
acterization (IISWC), pages 111–121, Sep. 2018.

[56] Michael L. Pinedo. Scheduling: Theory, Algorithms, and Systems. Springer Publishing Company,
Incorporated, 3rd edition, 2008.

[57] Travis E Oliphant. A guide to NumPy, volume 1. Trelgol Publishing USA, 2006.

[58] Naveen Muralimanohar and Rajeev Balasubramonian. Cacti 6.0: A tool to understand large caches.

[59] Micron. Calculating Memory Power for DDR4 SDRAM. Technical Note TN-40-07.
https://www.micron.com/resource-details/868646c5-7ee2-4f6c-aaf4-7599bd5952df, 2017.
Bibliography 55

[60] Alberto Delmas, Patrick Judd, Sayeh Sharify, and Andreas Moshovos. Dynamic stripes: Exploiting
the dynamic precision requirements of activation values in neural networks. CoRR, abs/1706.00504,
2017.

[61] D. Yu, F. Seide, G. Li, and L. Deng. Exploiting sparseness in deep neural networks for large
vocabulary speech recognition. In 2012 IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP), pages 4409–4412, March 2012.

[62] Cheng Wang, Haojin Yang, and Christoph Meinel. Image captioning with deep bidirectional lstms
and multi-task learning. ACM Trans. Multimedia Comput. Commun. Appl., 14(2s):40:1–40:20, April
2018.

[63] A. Delmas, S. Sharify, P. Judd, K. Siu, M. Nikolic, and A. Moshovos. DPRed: Making Typical
Activation Values Matter In Deep Learning Computing. ArXiv e-prints, December 2018.

978 3 031 01725 4
No ratings yet
978 3 031 01725 4
137 pages
Q - Skills For Success - Level 1 - Reading and Writing Split
No ratings yet
Q - Skills For Success - Level 1 - Reading and Writing Split
116 pages
Levental Uchicago 0330D 17419
No ratings yet
Levental Uchicago 0330D 17419
163 pages
Serving Large Language Models on Huawei CloudMatrix384（华为GPU+DEEPSEEK）
No ratings yet
Serving Large Language Models on Huawei CloudMatrix384（华为GPU+DEEPSEEK）
59 pages
Heinsius Ma Eemcs
No ratings yet
Heinsius Ma Eemcs
99 pages
SpiNNaker Book
No ratings yet
SpiNNaker Book
343 pages
THESIS LucasHuijbregts Final
No ratings yet
THESIS LucasHuijbregts Final
86 pages
Tesi
No ratings yet
Tesi
73 pages
Model Acceleration For Efficient Deep Learning Computing
No ratings yet
Model Acceleration For Efficient Deep Learning Computing
92 pages
Customizable Computing
No ratings yet
Customizable Computing
120 pages
Serving Large Language Models On Huawei Cloudmatrix384
No ratings yet
Serving Large Language Models On Huawei Cloudmatrix384
58 pages
Tesi
No ratings yet
Tesi
57 pages
Mondal Umn 0130E 25561
No ratings yet
Mondal Umn 0130E 25561
111 pages
Younghoon Kim Thesis Final
No ratings yet
Younghoon Kim Thesis Final
93 pages
Christopher Noel Hesse
No ratings yet
Christopher Noel Hesse
103 pages
Modeling A Non-Uniform Memory Access Architecture For Optimizing
No ratings yet
Modeling A Non-Uniform Memory Access Architecture For Optimizing
79 pages
Analysis and Comparison of Performance and Power Consumption of Neural Networks On CPU, GPU, TPU and FPGA - Christopher - Noel - Hesse
No ratings yet
Analysis and Comparison of Performance and Power Consumption of Neural Networks On CPU, GPU, TPU and FPGA - Christopher - Noel - Hesse
103 pages
Binary Neural Networks
No ratings yet
Binary Neural Networks
218 pages
Advanced Auditing
No ratings yet
Advanced Auditing
76 pages
1177 Modular Deep Learning
No ratings yet
1177 Modular Deep Learning
76 pages
Machine Learning Based Signal Generation Strategies For High Speed Optical Transmitters
No ratings yet
Machine Learning Based Signal Generation Strategies For High Speed Optical Transmitters
103 pages
MSC Thesis Martijn Berkers
No ratings yet
MSC Thesis Martijn Berkers
73 pages
Conv Thesis 8
No ratings yet
Conv Thesis 8
38 pages
On Chip Network by Natalie PDF
100% (1)
On Chip Network by Natalie PDF
141 pages
Architecture Design For Highly Flexible and Energy-Efficient Deep Neural Network Accelerators
No ratings yet
Architecture Design For Highly Flexible and Energy-Efficient Deep Neural Network Accelerators
147 pages
Report On Quadraped Robot's Leg Analysis
No ratings yet
Report On Quadraped Robot's Leg Analysis
58 pages
Jungkook 1 PDF
No ratings yet
Jungkook 1 PDF
168 pages
Master Thesis Mattias Wiberg Jonas Lauri
No ratings yet
Master Thesis Mattias Wiberg Jonas Lauri
75 pages
Ali Thesis
No ratings yet
Ali Thesis
125 pages
Nguyen Duy
No ratings yet
Nguyen Duy
66 pages
Jaisimha Thesis 2021
No ratings yet
Jaisimha Thesis 2021
81 pages
A Case Study in Mathematizing Divination Systems Using Modular
100% (1)
A Case Study in Mathematizing Divination Systems Using Modular
19 pages
Comparadores Auto Hecho
No ratings yet
Comparadores Auto Hecho
107 pages
Application of DL in Deep Space Wireless Signal Identification For ICS
No ratings yet
Application of DL in Deep Space Wireless Signal Identification For ICS
93 pages
FULLTEXT02
No ratings yet
FULLTEXT02
87 pages
DL Acceleration On The Edge
No ratings yet
DL Acceleration On The Edge
78 pages
p6 Aionfpga Thesis Canzani Mueller
No ratings yet
p6 Aionfpga Thesis Canzani Mueller
91 pages
Kogge 2013 Yearly Update Exascale Projections
No ratings yet
Kogge 2013 Yearly Update Exascale Projections
130 pages
Modern ML
No ratings yet
Modern ML
146 pages
Testbank For Economics of Money Banking and Financial Markets The 13th Edition Mishkin Instant Download
No ratings yet
Testbank For Economics of Money Banking and Financial Markets The 13th Edition Mishkin Instant Download
18 pages
A Practical Approach To Sedimentology (Roy C. Lindholm, 1987) - (Geo Pedia) PDF
67% (3)
A Practical Approach To Sedimentology (Roy C. Lindholm, 1987) - (Geo Pedia) PDF
291 pages
Zynqnet: An Fpga-Accelerated Embedded Convolutional Neural Network
No ratings yet
Zynqnet: An Fpga-Accelerated Embedded Convolutional Neural Network
102 pages
L 0017398760 PDF
No ratings yet
L 0017398760 PDF
24 pages
UVEB Technology With 1.5 Nanometer Heteroatom Titanates Zirconates
No ratings yet
UVEB Technology With 1.5 Nanometer Heteroatom Titanates Zirconates
106 pages
Ug4 Proj
No ratings yet
Ug4 Proj
44 pages
BR Gaswellblowoutfire
No ratings yet
BR Gaswellblowoutfire
8 pages
Summary
No ratings yet
Summary
22 pages
Photometry and Instrumentation.V2
No ratings yet
Photometry and Instrumentation.V2
28 pages
Lecture4 Chapter1 - Binary - Gray, and ASCII Codes
No ratings yet
Lecture4 Chapter1 - Binary - Gray, and ASCII Codes
36 pages
Aye, Ai!, Ai Ai AI!, Ayes, and I: Market Comments
No ratings yet
Aye, Ai!, Ai Ai AI!, Ayes, and I: Market Comments
7 pages
The Mars Agency Retail Media Report Card ANZ Mar 2024
No ratings yet
The Mars Agency Retail Media Report Card ANZ Mar 2024
31 pages
Papyrus History Lesson XL
No ratings yet
Papyrus History Lesson XL
9 pages
Image Recognition Using Neural Network & Deep Learning
No ratings yet
Image Recognition Using Neural Network & Deep Learning
60 pages
Convolutional Neural Network Layers Implementation On Low-Cost Reconfigurable Edge Computing Platforms
No ratings yet
Convolutional Neural Network Layers Implementation On Low-Cost Reconfigurable Edge Computing Platforms
31 pages
HIV Prevention in Ethiopia National Road Map 2018 - 2020 FINAL - FINAL
No ratings yet
HIV Prevention in Ethiopia National Road Map 2018 - 2020 FINAL - FINAL
52 pages
Master Thesis Report
No ratings yet
Master Thesis Report
78 pages
2017 Fall ME501 06 VectorCalculus
No ratings yet
2017 Fall ME501 06 VectorCalculus
95 pages
Debate Grading Rubric
No ratings yet
Debate Grading Rubric
3 pages
Author Biographies Preface Acknowledgments Table of Figures
No ratings yet
Author Biographies Preface Acknowledgments Table of Figures
6 pages
Design and Optimization Techniques of High Speed VLSI Circuit
No ratings yet
Design and Optimization Techniques of High Speed VLSI Circuit
310 pages
Applied Parallel Computing-Honest
100% (1)
Applied Parallel Computing-Honest
218 pages
PA 6.0 Amplifier Datasheet
No ratings yet
PA 6.0 Amplifier Datasheet
6 pages
Kuutti 2019
No ratings yet
Kuutti 2019
82 pages
Intermediate Relay: Wiring Diagram
No ratings yet
Intermediate Relay: Wiring Diagram
1 page
Design and Optimization Techniques of High-Speed VLSI Circuits
No ratings yet
Design and Optimization Techniques of High-Speed VLSI Circuits
310 pages
Distribution Restriction:: Approved For Public Release Distribution Is Unlimited
100% (1)
Distribution Restriction:: Approved For Public Release Distribution Is Unlimited
386 pages
Dfick 1
No ratings yet
Dfick 1
147 pages
Image Recognitiion
No ratings yet
Image Recognitiion
50 pages
SM T311 - Direy 6
No ratings yet
SM T311 - Direy 6
3 pages
Ref - Integrity Problems of Concrete Piles - FPrimeC - FPrimeC Solutions Inc
No ratings yet
Ref - Integrity Problems of Concrete Piles - FPrimeC - FPrimeC Solutions Inc
7 pages
Cosmetic & Homecare Industry
No ratings yet
Cosmetic & Homecare Industry
2 pages
Noc Book
No ratings yet
Noc Book
141 pages
Implementation of A Fast Artificial Neural Network Library (Fann)
No ratings yet
Implementation of A Fast Artificial Neural Network Library (Fann)
92 pages
Pbyr2545ct CTB Cte-05
No ratings yet
Pbyr2545ct CTB Cte-05
14 pages
Maximilian Steinberg
No ratings yet
Maximilian Steinberg
5 pages
Answer Key
No ratings yet
Answer Key
1 page
Advances in 3D Data Acquisition and Processing For Industrial Applications 2010
No ratings yet
Advances in 3D Data Acquisition and Processing For Industrial Applications 2010
11 pages
Vaidyasala Malayalam Note
No ratings yet
Vaidyasala Malayalam Note
2 pages
Bobj Integration With Portal
No ratings yet
Bobj Integration With Portal
3 pages
Test 1 Truss Test 2014
No ratings yet
Test 1 Truss Test 2014
4 pages
2.21 Dynamo For Civil 3D PDF
No ratings yet
2.21 Dynamo For Civil 3D PDF
1 page
Polysafe Strata Product Spec
No ratings yet
Polysafe Strata Product Spec
1 page
Advanced Multiplayer Game Development with Ureal Engine 5: A Comprehensive Guide to C++ Scripting
From Everand
Advanced Multiplayer Game Development with Ureal Engine 5: A Comprehensive Guide to C++ Scripting
Vladimir Kiselev
No ratings yet
ChatGPT for Business: Strategies for Success
From Everand
ChatGPT for Business: Strategies for Success
Matthew C. Smith
1/5 (1)
Time-dependent Behaviour and Design of Composite Steel-concrete Structures
From Everand
Time-dependent Behaviour and Design of Composite Steel-concrete Structures
Massimiliano Bocciarelli
No ratings yet
Content Creation Revolution with chatGPT
From Everand
Content Creation Revolution with chatGPT
Maria Cowen
No ratings yet
Gray Hat Hacking the Ethical Hacker's
From Everand
Gray Hat Hacking the Ethical Hacker's
Çağatay Şanlı
5/5 (1)
Securing ChatGPT: Best Practices for Protecting Sensitive Data in AI Language Models
From Everand
Securing ChatGPT: Best Practices for Protecting Sensitive Data in AI Language Models
Matthew C. Smith
No ratings yet
Software Patterns Made Easy
From Everand
Software Patterns Made Easy
Justice Nanhou
No ratings yet