Deep Learning Systems Algorithms Compilers and Processors for Large-Scale Production [BooksRack.net]
Deep Learning Systems Algorithms Compilers and Processors for Large-Scale Production [BooksRack.net]
RODRIGUEZ
Synthesis Lectures on
Computer Architecture
Algorithms, Compilers, and Processors for Large-Scale Production
Andres Rodriguez, Intel
This book describes deep learning systems: the algorithms, compilers, and processor components to efficiently train and deploy deep learning
models for commercial applications. The exponential growth in computational power is slowing at a time when the amount of compute
consumed by state-of-the-art deep learning (DL) workloads is rapidly growing. Model size, serving latency, and power constraints are a
About SYNTHESIS
This volume is a printed version of a work that appears in the Synthesis
Digital Library of Engineering and Computer Science. Synthesis
Synthesis Lectures on
Computer Architecture
store.morganclaypool.com
Natalie Enright Jerger, Series Editor
Deep Learning Systems
Algorithms, Compilers, and
Processors for Large-Scale Production
Synthesis Lectures on
Computer Architecture
Editor
Natalie Enright Jerger, University of Toronto
Editor Emerita
Margaret Martonosi, Princeton University
Founding Editor Emeritus
Mark D. Hill, University of Wisconsin, Madison
Synthesis Lectures on Computer Architecture publishes 50- to 100-page books on topics pertaining to
the science and art of designing, analyzing, selecting, and interconnecting hardware components to
create computers that meet functional, performance, and cost goals. The scope will largely follow
the purview of premier computer architecture conferences, such as ISCA, HPCA, MICRO, and
ASPLOS.
Analyzing Analytics
Rajesh Bordawekar, Bob Blainey, and Ruchir Puri
2015
Customizable Computing
Yu-Ting Chen, Jason Cong, Michael Gill, Glenn Reinman, and Bingjun Xiao
2015
Die-stacking Architecture
Yuan Xie and Jishen Zhao
2015
Shared-Memory Synchronization
Michael L. Scott
2013
Multithreading Architecture
Mario Nemirovsky and Dean M. Tullsen
2013
Performance Analysis and Tuning for General Purpose Graphics Processing Units
(GPGPU)
Hyesoon Kim, Richard Vuduc, Sara Baghsorkhi, Jee Choi, and Wen-mei Hwu
2012
On-Chip Networks
Natalie Enright Jerger and Li-Shiuan Peh
2009
The Memory System: You Can’t Avoid It, You Can’t Ignore It, You Can’t Fake It
Bruce Jacob
2009
Transactional Memory
James R. Larus and Ravi Rajwar
2006
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in
any form or by any means—electronic, mechanical, photocopy, recording, or any other except for brief quotations
in printed reviews, without the prior permission of the publisher.
Deep Learning Systems: Algorithms, Compilers, and Processors for Large-Scale Production
Andres Rodriguez
www.morganclaypool.com
DOI 10.2200/S01046ED1V01Y202009CAC053
Lecture #53
Editor: Natalie Enright Jerger, University of Toronto
Editor Emerita: Margaret Martonosi, Princeton University
Founding Editor Emeritus: Mark D. Hill, University of Wisconsin, Madison
Series ISSN
Print 1935-3235 Electronic 1935-3243
Deep Learning Systems
Algorithms, Compilers, and
Processors for Large-Scale Production
Andres Rodriguez
Intel
M
&C Morgan & cLaypool publishers
ABSTRACT
This book describes deep learning systems: the algorithms, compilers, and processor components
to efficiently train and deploy deep learning models for commercial applications.
The exponential growth in computational power is slowing at a time when the amount of
compute consumed by state-of-the-art deep learning (DL) workloads is rapidly growing. Model
size, serving latency, and power constraints are a significant challenge in the deployment of DL
models for many applications. Therefore, it is imperative to codesign algorithms, compilers, and
hardware to accelerate advances in this field with holistic system-level and algorithm solutions
that improve performance, power, and efficiency.
Advancing DL systems generally involves three types of engineers: (1) data scientists that
utilize and develop DL algorithms in partnership with domain experts, such as medical, eco-
nomic, or climate scientists; (2) hardware designers that develop specialized hardware to ac-
celerate the components in the DL models; and (3) performance and compiler engineers that
optimize software to run more efficiently on a given hardware. Hardware engineers should be
aware of the characteristics and components of production and academic models likely to be
adopted by industry to guide design decisions impacting future hardware. Data scientists should
be aware of deployment platform constraints when designing models. Performance engineers
should support optimizations across diverse models, libraries, and hardware targets.
The purpose of this book is to provide a solid understanding of (1) the design, training, and
applications of DL algorithms in industry; (2) the compiler techniques to map deep learning
code to hardware targets; and (3) the critical hardware features that accelerate DL systems.
This book aims to facilitate co-innovation for the advancement of DL systems. It is written for
engineers working in one or more of these areas who seek to understand the entire system stack
in order to better collaborate with engineers working in other parts of the system stack.
The book details advancements and adoption of DL models in industry, explains the train-
ing and deployment process, describes the essential hardware architectural features needed for
today’s and future models, and details advances in DL compilers to efficiently execute algorithms
across various hardware targets.
Unique in this book is the holistic exposition of the entire DL system stack, the emphasis
on commercial applications, and the practical techniques to design models and accelerate their
performance. The author is fortunate to work with hardware, software, data scientist, and re-
search teams across many high-technology companies with hyperscale data centers. These com-
panies employ many of the examples and methods provided throughout the book.
KEYWORDS
deep learning, machine learning, artificial intelligence, distributed training systems,
inference, accelerators, processors, architectures, compilers, optimizations
xi
Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Deep Learning in Action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 AI, ML, NN, and DL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Brief History of Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Types of Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4.2 Unsupervised and Self-Supervised Learning . . . . . . . . . . . . . . . . . . . . . 7
1.4.3 Semi-Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4.4 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Types of Topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5.1 Multilayer Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.5.2 Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.5.3 Recurrent Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.5.4 Transformer Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.5.5 Graph Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.5.6 Adversarial Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5.7 Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.5.8 Bayesian Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.5.9 Spiking Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.6 Training and Serving a Simple Neural Network . . . . . . . . . . . . . . . . . . . . . . . . 16
1.7 Memory and Computational Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.8 Hardware Design Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.9 Software Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.10 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2 Building Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.1 Activation Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
xiv
2.2 Affine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3 Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.4 Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.5 Recurrent Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.6 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.7 Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.8 Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.9 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4 Training a Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.1 Generalizing from Training to Production Datasets . . . . . . . . . . . . . . . . . . . . . 73
4.2 Weight Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.3 Optimization Algorithms: Minimizing the Cost . . . . . . . . . . . . . . . . . . . . . . . 79
4.4 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.5 Training Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.5.1 Training Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.5.2 Designing a Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.5.3 Debugging Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.5.4 Tuning Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.6 Transfer Learning Via Fine-Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.7 Training with Limited Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
xv
5 Distributed Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.1 Data Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.2 Model Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.2.1 Pipeline Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.3 Federated Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.4 Collective Communication Primitives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
7 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
7.1 Moore, Dennard, and Amdahl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
7.2 Memory and Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
7.2.1 Set-Associative Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
7.3 Roofline Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
7.4 Processor Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
7.5 High-Performance Interconnects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
7.5.1 Physical Network Topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
7.6 Processors in Production . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
7.7 Platforms Strengths and Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
7.8 Evaluating Devices and Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
Preface
Many concepts throughout the book are interdependent and often introduced iteratively with
a reference to the section covering the concept in detail. If you are new to this field, read the
introductory chapter in its entirety and each chapter’s introductory section and concluding para-
graph to capture some of the key takeaways. Then, go back and read each chapter in its entirety.
A background in linear algebra, calculus, programming, compilers, and computer architecture
may be helpful for some parts but not required. The book is organized as follow:
Chapter 1 starts with an introduction to essential concepts detailed throughout the book.
We review the history and applications of deep learning (DL). We discuss various types of
topologies employed in industry and academia across multiple domains. We also provide an
example of training a simple DL model and introduce some of the architectural design consid-
erations.
Chapter 2 covers the building blocks of models used in production. We describe which of
these building blocks are compute bound and which are memory bandwidth bound.
Chapter 3 covers the applications benefiting the most from DL, the prevalent models
employed in industry, as well as academic trends likely to be adopted commercially over the
next few years. We review recommender system, computer vision, natural language processing
(NLP), and reinforcement learning (RL) models.
Chapter 4 covers the training process domain experts should follow to adopt DL algo-
rithms successfully. We review topology design considerations employed by data scientists, such
as weight initialization, objective functions, optimization algorithms, training with a limited
dataset, dealing with data imbalances, and training with limited memory. We also describe the
mathematical details behind the backpropagation algorithm to train models.
Chapter 5 covers distributed algorithms adopted in data centers and edge devices (known
as federated learning). We discuss the progress and challenges with data and model parallelism.
We also review communication primitives and AllReduce algorithms.
Chapter 6 covers the lower numerical formats used in production and academia. These
formats can provide computational performance advantages over the standard 32-bit single-
precision floating-point, sometimes at the expense of lower statistical performance (accuracy).
We also discuss pruning and compression techniques that further reduce the memory footprint.
Chapter 7 covers hardware architectural designs. We review the basics of computer archi-
tecture, reasons for the slower growth in computational power, and ways to partially mitigate
this slowdown. We explain the roofline model and the important hardware characteristics for
serving and multinode training. We also discuss CPUs, GPUs, CRGAs, FPGAs, DSPs, and
xviii PREFACE
ASICs, their advantages and disadvantages, and the prominent DL processors and platforms
available in the market or in development.
Chapter 8 covers high-level languages and compilers. We review language types and ex-
plain the basics of the compilation process. We discuss front-end compilers that transform a
program to an LLVM internal representation (IR) and the LLVM back-end compiler. We also
describe the standard compiler optimizations passes for DL workloads.
Chapter 9 covers the frameworks and DL compilers. We review in detail the TensorFlow
and PyTorch frameworks and discuss various DL compilers in development.
Chapter 10 concludes with a look at future opportunities and challenges. We discuss the
opportunities to use machine learning algorithms to advance various parts of the DL system
stack. We discuss some challenges, such as security, interpretability, and the social impact of
these systems. We also offer some concluding remarks.
xix
Acknowledgments
The task of writing this book would not have been possible without the support of many indi-
viduals. At the risk of unintentionally leaving some out, I want to thank the following for their
guidance, time, and help discussing and improving the content.
Thanks to Michael Morgan at Morgan & Claypool for inviting me to write this book as
well as Natalie Enright Jerger, Christine Kiilerich, and C.L. Tondo for their continuous en-
couragement and excellent support throughout the writing and publishing process. Thanks to
Parthasarathy Ranganathan for his guidance and for introducing me to Michael and Natalie.
Thanks to Cliff Young, Natalie Enright Jerger, Nikhil Murthy, Nicholas Lee, and Joanne
Yuan for providing detailed comments that were invaluable in improving the entire manuscript.
Thanks to Robert Zak, Wolff Daniel Dobson, Vinay Phegade, Roman Dubtsov, Kreig
DuBose, Matthew Brookhart, Chinnikrishna Kothapalli, Steven (Cap) Rogers, Dheevatsa
Mudigere, Carlos Escapa, Tatiana Shpeisman, Vijayakumar (Kumar) Bhagavatula, Mourad
Gouicem, AG Ramesh, Jian Hui Li, Romir Desai, Vadim Pirogov, Michael Goldsmith, Chris
Browning, Gal Novik, Sherine Abdelhak, Abigail Wen, Md Faijul Amin, Tristan Webb, Marcel
Nassar, Koichi Yamada, Anna Bethke, Ivan Kuzmin, Anahita Bhiwandiwalla, Mariano Phe-
lipp, Brian Retford, Tiffany Shih, Jayaram Bobba, Edward Groden, Anthony Reina, Bin Wei,
Jacek Czaja, Etay Meiri, Luke Ge, Ran Cohen, Derek Osborne, Jayarama Shenoy, Michael
Greenfield, Madhusudhan Rangarajan, Eric Lin, Lei Xia, Albert Hu, Brinda Ganesh, Diego
Caballero, Tatyana Primak, Naveen Mellempudi, Mohammad Ashraf Bhuiyan, Rajesh Poor-
nachandran, Rinat Rappoport, Xin Wang, Yoann Foucher, Pujiang He, Jun Jin, Eric Gardner,
Adam Straw, Scott Cyphers, Brian Golembiewski, Clayne Robison, Sangeeta Bhattacharya,
Ravi Panchumarthy, Patric Zhao, Derssie Mebratu, Anisha Gartia, Shamima Najnin, Rajen-
drakumar Chinnaiyan, Zhenlin Luo, Chris Banyai, Vinod Devarampati, Alex Heinecke, Evarist
Fomenko, Milind Pandit, Lei Shao, Yong Wu, Sameh Gobriel, Andrey Nikolaev, Nawab Ali,
Bernhard Friebe, Nikhil Deshpande, Shashikant Kale, Amir Gholaminejad, Indu Kalyanara-
man, and Greg Leeming for their excellent comments on portions of the book relevant to their
respective expertise, the discussions to ensure correctness, or proofreading early drafts to improve
clarity. Thanks to Roberto Gauna and Travis Belnap for their assistance making and improving
many of the figures.
My biggest thanks goes to Mary-Kathryn for her unwavering friendship and fervent sup-
port, which made the writing of this book possible. And my second biggest thanks to our chil-
dren Isabel, Tomas, and David, who inspire me, and who were patient and forgiving when Papi
was busy writing and could not play. This book is dedicated to you. May the progress in artifi-
xx ACKNOWLEDGMENTS
cial intelligence and deep learning systems make the world for you and your generation a better
place.
Disclaimer: The views expressed in the book are my own and do not represent those of the
Intel Corporation, previous employers, or those acknowledged. Details regarding software and
hardware products come from publicly disclosed information, which may not represent the latest
status of those products.
Andres Rodriguez
October 2020
1
CHAPTER 1
Introduction
A deep learning (DL) model is a function that maps input data to an output prediction. To im-
prove the accuracy of the prediction in complex tasks, DL models are increasingly requiring more
compute, memory, bandwidth, and power, particularly during training. The number of compu-
tations required to train and deploy state-of-the-art models doubles every 3:4 months [DH18].
The required computation scales at least as a fourth-order polynomial with respect to the accu-
racy and, for some tasks, as a nineth-order polynomial [TGL+20]. This appetite for more com-
pute far outstrips the compute growth trajectory in hardware and is unsustainable. In addition,
the main memory bandwidth is becoming a more significant bottleneck; computational capacity
is growing much faster than memory bandwidth, and many algorithms are already bandwidth
bound.
The evolution of computational growth is driving innovations in DL architectures. Im-
provements in transistor design and manufacturing no longer result in the previously biennial
2 general-purpose computational growth. The amount of dark silicon, where transistors can-
not operate at the nominal voltage, is increasing. This motivates the exploitation of transistors
for domain-specific circuitry.
Data scientists, optimization (performance) engineers, and hardware architects must col-
laborate on designing DL systems to continue the current pace of innovation. They need to be
aware of the algorithmic trends and design DL systems with a 3–5 year horizon. These designs
should balance general-purpose and domain-specific computing and accommodate for unknown
future models.
The characteristics of DL systems vary widely depending on the end-user and operating
environment. Researchers experimenting with a broad spectrum of new topologies (also known
as DL algorithms or neural networks) require higher flexibility and programmability than en-
gineers training and deploying established topologies. Furthermore, even established topologies
have vastly different computational profiles. For instance, an image classification model may
have a compute-to-data ratio three orders of magnitude higher than that of a language transla-
tion model.
A mixture of specialized hardware, higher bandwidth, compression, sparsity, smaller nu-
merical representations, multichip communication, and other innovations is required to satisfy
the appetite for DL compute. Each 2 in performance gain requires new hardware, compiler,
and algorithmic co-innovations.
2 1. INTRODUCTION
Advances in software compilers are critical to support the Cambrian explosion in DL
hardware and to effectively compile models to different hardware targets. Compilers are essen-
tial to mitigate the cost of evaluating or adopting various hardware designs. A good compiler
generates code that runs efficiently and speedily executes. That is, the generated code takes ad-
vantage of the computational capacity and memory hierarchy of the hardware so the compute
units have high utilization. Several efforts, detailed in Chapter 9, are ongoing toward making
this possible.
The purposes of this book are (1) to provide a solid understanding of the design, training,
and applications of DL algorithms, the compiler techniques, and the critical processor features
to accelerate DL systems, and (2) to facilitate co-innovation and advancement of DL systems.
In this chapter, we introduce the fundamental concepts detailed throughout the book. We
review the history, applications, and types of DL algorithms. We provide an example of training
a simple model and introduce some of the architectural design considerations. We also introduce
the mathematical notation used throughout parts of the books.
Figure 1.1: Deep learning is a subset of neural networks, which is a subset of machine learning,
which is a subset of artificial intelligence.
training, the total compute spent on inference on a given model dwarfs that of training over the
entire life span of the model.
Serving is typically latency bounded. Product recommendations, search results, voice as-
sistant queries, and pedestrian detection in autonomous vehicles require results within a prespec-
ified latency constraint. Thus, during serving, only one or a few samples are typically processed
to meet the latency requirement. Effectively parallelizing the serving computations for one data
sample across a large number of cores is challenging. For this reason, GPUs (and CPUs to a
lesser extend) suffer from poor compute utilization during serving. There is an opportunity for
hardware architects to design better low-latency processors and minimize idle compute cycles,
detailed in Chapter 7.
Serving in data centers typically happens on CPUs due to their higher availability, higher
core frequency, and higher compute utilization for small batches. Given the parallelization chal-
lenges when using one data sample, fewer faster cores in a CPU may be advantageous over many
slower cores in a GPU. Using more cores can further reduce the latency at the expense of lower
core utilization (due to the core-to-core communication overhead). However, as models grow
and require more compute, some companies are transitioning to GPUs or experimenting with
dedicated processors for inference. In addition, low power (smaller) GPUs or GPUs with virtu-
alization reduces the number of cores allocated to a workload, which improves core utilization.
Figure 1.2: A machine learning algorithm maps the input data to a space or manifold where the
data can be classified with a linear classifier. Source: [Wik11] (CC BY-SA 4.0).
AI is any program or system that can learn, act, or adapt. The recent popularity of AI comes
from advances in ML algorithms, specifically in DL. An ML model is a program that learns a
function that maps the input data (or features extracted from the input data) to a desired output.
Geometrically, this mapping is from a vector space where the data is not linearly separable to
a vector space where the data is linearly separable, as illustrated in Figure 1.2. These vector
spaces are formally known as Hilbert spaces or manifolds. The mapping function or statistical
performance (accuracy) of the model usually improves with more data.
NN models, also called artificial neural networks (ANNs), are typically composed of sim-
ple nonlinear functions, called layers, stacked together to represent complex functions or map-
pings. Stacking multiple linear functions results in one linear function that can be represented
with one layer, and would negate the benefit of multilayer mappings. Thus, the need for non-
linear functions. DL models, sometimes called deep neural networks (DNNs), are NN models
with more than three layers and are end-to-end differentiable. Traditional machine learning
(non-NN ML) models and NN models with 1–3 layers are also called shallow models.
A difference between traditional ML and most of DL is traditional ML relies on do-
main experts to specify key features to use for the given task. In contrast, DL typically learns
these features at the expense of requiring more data and compute. For decades, computer vision
experts spent significant efforts studying image features to improve detection [FGM+10]. DL
practitioners with limited computer vision expertise demonstrated that NNs were able to learn
features with increasing complexity at each layer and outperform state-of-the-art techniques in
image detection and classification tasks [KSH12].
DL models are particularly advantageous, although requiring much more data and com-
pute, over traditional ML models for workloads where the relationship between features cannot
be reasonably approximated, such as with visual, text, or speech data. Traditional ML mod-
els continue to be popular with tabular or structured data where the feature relationships can
be approximated, for instance, using a Bayesian model to encode the hierarchical relationships
manually (do not worry if you are unfamiliar with Bayesian models) [DWO+19].
6 1. INTRODUCTION
1.3 BRIEF HISTORY OF NEURAL NETWORKS
NNs were popularized in the 1960s and used for binary classification. Their popularity dimin-
ished in the 1970s when NNs did not deliver on the hype. Interest in NNs increased in the
mid-1980s when the backpropagation algorithm (detailed in Chapter 4) was rediscovered, fa-
cilitating the training of multilayer NNs to learn more complex classifiers. In the mid-1990s,
most of the AI focus shifted toward support vector machines (SVMs), a class of ML algorithms
with theoretical performance bounds. The NN community refers to the 1970s as the first AI
winter and the mid-1990s to early 2000s as the second AI winter due to the limited funding of
and progress in NNs (these should be called NN winters since AI is bigger than NNs).
During the past decade, there has been a revived interest as NN have vastly outperformed
other techniques, particularly for vision and natural language processing tasks. This recent suc-
cess is due to faster and cheaper hardware, more massive datasets, improved algorithms, and
open-source software [SAD+20]. Researchers from competing companies often publish their
algorithms and training methodologies (but typically not their trained models or datasets); thus,
they build on each other’s knowledge and accelerate progress.
Examples of supervised learning tasks with input data and labels are (task: input data !
label):
• Image classification: pixels ! class label of the object in an image
• Image detection: pixels ! bounding box around each object in an image and the class
label of those objects
• Machine translation: sentence in the source language ! sentence in the target language
• Regression analysis: house size, local school rating ! price of the house
Figure 1.4: One-hot vector. All entries are zero except a one at the vector entry corresponding
to the word.
this sparse vector to a small and dense vector representation. Other examples are learning dense
vector representations for persons in a social network and products in a large catalog. These dense
vector representations are often the inputs into a supervised learning model.
Figure 1.5: A computation graph takes a tensor input and produces a tensor output. Serving
involves typically one forward propagation. Training involves numerous forward and backward
iteration cycles.
Figure 1.6: An MLP with four layers: the input layer, two hidden layers, and the output layer.
This model maps the 784 pixel values to a probability distribution over 10 possible classes.
Figure 1.7: A CNN model with several layers maps the input image to a probability distribution
across multiple possible labels. Source: [Wik15] (CC BY-SA 4.0).
Figure 1.8: An RNN topology can be represented as an FFNN topology with the same weights
W across all the layers.
The RNNs inputs and outputs can vary in length, unlike in MLP and CNN models. For
instance, in machine translation, the input and output sentences have a different number of
words. An RNN can be unrolled and represented as an FFNN sharing the same weights across
the layers, as shown in Figure 1.8.
RNN models can be stacked with multiple layers and also bidirectional, as shown in Fig-
ure 1.9. The main building block of an RNN model is a recurrent unit that captures a represen-
tation or “memory” of the aggregated relevant data from previous steps. There are three main
types of RNN models depending on the type of recurrent units they use: vanilla RNN, LSTM,
and GRU units, detailed in Section 2.5. In the literature, the term RNN denotes either a vanilla
RNN or, more broadly, these three types of models. In this book, when referring to a vanilla
RNN model, we explicitly use the term vanilla RNN to prevent confusion. LSTM and GRU
models are usually favored over vanilla RNN models for their superior statistical performance.
RNN models have two significant challenges: (1) capturing the dependencies in long se-
quences and (2) parallelizing the computation (due to the sequential nature where the output at a
12 1. INTRODUCTION
given timestep depends on the previous timestep). Using attention layers, detailed in Section 2.8,
mitigates these challenges. Concatenating multiple sequential outputs from the first layer in the
stack and passing those as inputs to the second layer in the stack improves the computational
efficiency in a model with multiple layers [HSP+19].
Figure 1.10: A GNN operates on graphs rather than tensors. This GNN has four layers, an
input, output, and two hidden layers. Based on [Jad19].
common in many applications, such as in social networks to represent persons and their connec-
tions, in molecular biology to represent atoms and bonds, in recommender systems to represent
users, items, and ratings, in telecommunications to represent networks, and in drug discovery to
represent the compound structure and protein-enzyme interactions. Graphs of graphs are also
common; one example is web document classification with a graph of web documents where
the edges are the hyperlinks, and each node is a graph with XML-formatted elements for each
document. GNNs provide the structure to learn and make predictions on graphs, often with
sparsely labeled data. Given the sparse representation of the adjacency matrix in GNNs, it is
beneficial to advance work in nonsequential memory access retrieval to accelerate GNNs.
GNNs were introduced in 2009 and have recently seen astronomical growth in
academia [SGT+09]. Given the many real-world graph applications, rapid growth in the in-
dustry over the next few years is expected. Large-scale recommender systems, such as Pinter-
est’s PinSage, already use GNNs [YKC+18]. Hyperscalers are developing platforms, such as
Alibaba’s AliGraph, Microsoft’s NeuGraph, and Amazon’s Deep Graph Library (DGL) to fa-
cilitate GNN industry adoption [ZZY+19, MYM+19, WVP+19]. PyTorch Geometric (PyG)
is primarily targeting academic research [FL19].
Figure 1.11: A generative adversarial network has a discriminator and a generator network that
compete with each other. Based on [Gha17].
set. The discriminator evaluates the candidates as authentic or synthetic (generated). The gen-
erator’s objective is to increase the error rate of the discriminator. It generates data to fool the
discriminator into classifying it as authentic. The discriminator is initially trained with a training
dataset. Then it is tuned as it competes with the generator. As the model trains, the generated
data becomes more authentic-like, and the discriminator improves at recognizing synthetic data.
Yann LeCun, likely the most prominent DL scientist, described GANs as “the coolest
idea in machine learning in the last twenty years” [Lec16]. GANs were initially proposed for
unsupervised learning and now they are used across all types of learning. Wasserstein GAN
(WGAN) improves the stability of learning the model, and Weng provides a detailed explana-
tion of the mathematics used in WGAN [ACB17, Wen17].
GANs are also used for model physics-based simulations in particle physics and cosmol-
ogy, reducing the simulation time by orders of magnitude [PdO+18, RKL+18]. Section 3.2.5
discusses various GANs use for image generation.
1.5.7 AUTOENCODER
An AE is a class of unsupervised learning topology that learns a low-dimensional representation
(an encoding) of the input. The AE learns to reconstruct the input data in the output layer and
uses the output of the bottleneck layer (usually the middle-most layer) as the low-dimensional
representation. The number of units typically decreases in each layer until the bottleneck layer,
as shown in Figure 1.12.
AEs are used (1) as a compact (compressed) representation of the input data; (2) as a pre-
processing step to a classification problem where the data is first encoded and then passed to the
classifier; (3) in a data matching problem by comparing the encoding of two data samples; (4) to
1.5. TYPES OF TOPOLOGIES 15
Figure 1.12: An autoencoder learns to reconstruct the input data in the output layer. The output
of the bottleneck layer is often used as a low-dimensional representation of the input.
denoise data by learning a mapping from a noisy input to a clean output; and (5) as a generative
model to generate data using the decoder (known as a variational autoencoder (VAE)).
4. Backpropagate the gradient of the cost with respect to each layer’s weights and activations.
6. Return to Step 2, or stop if the validation error is less than some threshold or is not de-
creasing.
During training, the dataset is processed in batches. The completion of a cycle through
steps 2–6 for a batch is called an iteration, and each cycle through the entire training dataset is
called an epoch. For instance, if the dataset has 1M samples and a batch has 100 samples, it takes
10K iterations to complete an epoch.
Training a model may require tens of epochs to learn a good set of weights. After training,
the validation (also called out-of-sample) performance is measured using a validation dataset.
The validation dataset contains labeled data not used during training and should be as similar
as possible to the serving data the model encounters when deployed. The performance on this
validation dataset is a good indicator of the performance in deployment and helps to determine if
the model overfits the training dataset. Overfitting occurs when a model learns features unique
to the training data and, therefore, does not generalize to data outside the training dataset.
Regularization techniques to mitigate overfitting are discussed in Section 4.1.
During serving, the model processes a micro-batch. The data is propagated forward
through the network to compute the output. Serving is also known as inference since the model
is inferring the label of the data sample. Step 2 above is inference; that is, inference is a step
in the training process but usually with a smaller batch size and some optimizations specific to
serving.
The following example illustrates the training process. The task is to classify handwritten
digits from the MNIST dataset using an MLP model [LBB+98]. Figure 1.13 shows a small
subset of the 70,000 gray-scaled 28 28 pixel images in the MNIST dataset. Typically with
MNIST, 60,000 images are used for training and 10,000 images are used for validation. In
practice, a CNN model would be a better choice for image classification, but a simple MLP
model is used to introduce some fundamental concepts.
Each layer in the MLP is composed of units (neurons) that linearly combine the weighted
outputs or activations from the previous layer plus a bias weight, as shown in Figure 1.14 for
one unit. The output from this affine transformation is passed to a nonlinear activation function
g./. An activation function refers to the nonlinear function, an activation input is the input to
the activation function, and an activation (short for activation output) refers to the output of
an activation function. Common activation functions are the rectified linear unit (ReLU) and
variants of ReLU, the sigmoid and its generalization, the softmax, and the hyperbolic tangent
(tanh), which are all detailed in Section 2.1.
18 1. INTRODUCTION
Figure 1.13: Examples from the MNIST dataset. Each digit image has 28 28 pixels.
Source: [Wik17] (CC BY-SA 4.0).
Figure 1.14: A neural unit at layer .l C 1/ applies a nonlinear transformation or function to the
weighted sum of the activations from the previous layer .l/.
The MLP model used for this digit classification task, shown in Figure 1.6, has 784 units
in the input layer (Layer 0) corresponding to the number of pixel values in each image. The
output layer has 10 units corresponding to the probability distribution of the possible 0–9 labels.
This MLP has two hidden layers with 128 and 32 units, respectively. The choice for the number
of hidden layers and the number of units in each layer requires experimentation. In Section 4.5,
we discuss techniques to choose an appropriate topology.
To train the model, the 28 28 image pixel values are reordered as a 784 1 vector and
normalized to zero-mean and unit-norm (the benefits of normalization are explained in Sec-
tion 2.6). This is the input to the NN and can be thought of as the activations of Layer 0. The
input zi.1/ to unit i in Layer 1 is the weighted sum of the activations of Layer 0 plus a bias. The
activation ai.1/ of unit i is a nonlinear transformation of the unit’s activation input zi.1/ :
ai.1/ D g zi.1/ D max 0; zi.1/ ;
1.6. TRAINING AND SERVING A SIMPLE NEURAL NETWORK 19
where g./ is the ReLU activation function, and
783
X
zi.1/ D .0/
wik xk C bi.0/
kD0
is the output of the affine transformation (also known as the activation input in Layer 1), where
xk represents the k 2 Œ0; 783th pixel value. In this example, the activation functions are ReLU
for Layers 1 and 2, and softmax for the output layer. The ReLU function zeros out negative
values and keeps the positive values unchanged. The softmax function is used in the output layer
to map a vector of values to a probability distribution where the values are all between 0 and 1
and sum to 1. The i th output value can be computed as follows:
exp zi.3/
yOi D P ;
9 .3/
kD0 exp z k
where yOi represents the probability the input image corresponds to class i . There is no bias term
in a softmax layer.
This softmax output is compared with the ground truth. For this task, the ground truth
is a one-hot vector with the nonzero index corresponding to the correct class label. The cross-
entropy loss is:
9
X
yk log.yOk /;
kD0
where log represents the natural logarithm (log base-e ), yk is 1 if the sample belongs to class k 2
Œ0; 9 and 0 otherwise, and yOk is the model’s prediction (as a probability) that the sample belongs
to class k . Figure 1.15 depicts the expected and actual output for a sample image corresponding
to the digit 4. In the figure, the model’s output yO incorrectly indicates digit 8 is the most likely
inferred interpretation. Additional training iterations are needed to reduce this loss.
The gradients of the cost with respect to all the layers’ activations and weights are com-
puted using the chain rule from the last layer and moving backward layer by layer toward the first
layer. Hence, the name backpropagation. The gradients provide a measurement of the contribu-
tion of each weight and activation to the cost. In practice, all of the activations for a given batch
and a given layer are simultaneously computed using matrix algebra. For these computations,
data scientists use software libraries optimized for the particular hardware target.
During training, the activations are saved for the backpropagation computations. There-
fore, hardware for training requires a larger memory capacity than hardware for inference. The
required memory is proportional to the batch size.
20 1. INTRODUCTION
Figure 1.15: A batch of size 1 containing a sample image of the digit 4 is passed through the
model. The actual output yO and the expected output (ground truth) y are used to compute the
cost J.w/. The model performs poorly in this example and predicts digit 8 with 40% probability
and digit 4 with 10% probability. The cross-entropy loss is log.0:1/.
Ng D NaL1 C NaL2
D .128 C 32/ N
D 160N:
Thus, the total memory requirement for training, using 4 bytes for each value, is:
TM D .Nw C Na C Ng / 4
D .104;934 C 1114N / 4
D 419736 C 4456N:
Assuming a batch of N D 128, the required memory for training is 1:0 MB.
The total memory requirement for inference, using 4 bytes for each value, is:
TM D .Nw C Na / 4
D .104;934 C .784 C 128/N / 4
D 419736 C 3648N:
Figure 1.16: Typical CNNs, MLPs, RNNs, and embeddings differ by orders of magnitude in
storage, operational intensity, and memory access irregularities. Based on [Haz20].
efficiency or utilization (the percentage of used compute cycles vs. the total compute capacity)
is low for workloads bottlenecked by bandwidth (also known as bandwidth bound), and adding
more compute capacity does not improve their performance. Keeping the data close to the com-
pute can alleviate this bottleneck. In order of decreasing access time and increasing die area,
the storage types are nonvolatile memory (flash memory, magnetic disk), DRAM (HBM2/E,
GDDR6, DDR4, LPDDR4/5), SRAM (scratchpad, cache), and registers. DRAM is often
called main memory and SRAM local memory.
The design of a balanced platform is complicated by the spectrum of workloads with di-
verse compute, memory, and bandwidth requirements. For instance, the CNNs, MLPs, RNNs,
and embeddings used at Facebook (and similar at other hyperscalers) differ by orders of magni-
tude in these requirements, as shown in Figure 1.16 [Haz20]. Operational intensity is a measure
of the number of operations performed per byte read from memory. The last level cache (LLC)
miss rate as measured by misses per 1000-instructions (MPKI) is a standard metric to analyze
the local memory (SRAM)’s efficient use and can be a metric for the irregular memory access
patterns of a workload.
The numerical format is another design consideration that can impact the computational
(speed) performance and statistical (accuracy) performance. Figure 1.17 shows various numerical
formats, detailed in Section 6.1. A numerical representation with fewer bytes can improve the
number of operations per cycle and reduce power consumption but may result in lower statistical
performance. Training uses single-precision floating-point (fp32) with half-precision floating-
point (fp16) and bfloat16 (bf 16) rapidly gaining adoption. Inference uses fp16 and bf 16 with 8-
bit integer (int8) gaining adoption for some applications. A research area is developing numerical
representations that can better represent values using 8 bits, such as fp8, discussed in Section 6.1,
and can be efficiently implemented in silicon. Other techniques to reduce the memory and
bandwidth requirements are increasing the sparsity and compressing the data.
A MAC unit computes the product of two values and aggregates the result to a run-
ning sum of products. The numerical format of the output (the accumulation) may be different
1.9. SOFTWARE STACK 23
Figure 1.17: Numerical formats. Green is the sign bit. Brown are the exponent bits. Blue are the
mantissa bits.
from the input. Computations involving dot products, such as in matrix multiplications and
convolutions, typically use MACs. When describing MAC units, the notation used is MAC-
input-format ! MAC-accumulate-format. For instance, int8 ! int32 means the int8 values are
multiplied and accumulated as int32 values. Accumulating values in a large numerical format
mitigates numerical overflows.
Different hardware usages have different requirements. Table 1.1 shows the high-level
requirements for common usages by hyperscalers: topology design, training established produc-
tion models (Trn. Prod.), data center inference (Inf. DC), and edge inference (Inf. Edge). In
the table, format refers to the number of bits to represent the weights and activations. Train-
ing requires more memory and bandwidth than inference to transfer and store the activations.
Another use case not shown in Table 1.1 is for hardware design, which requires reconfigurable
hardware (for example, FPGAs) or hardware simulators.
• operating systems.
The primary software stack design goals are ease-of-use and high performance across various
models and hardware devices.
A deployment and training management system facilitates taking a model across the
pipeline stages: data preparation, topology exploration, experiment tracking, model packaging,
at-scale model deployment, and retraining. The management system is designed to meet the
needs of the data scientist and the infrastructure team. It provides a collaborative and secure
environment, and access to the latest ML libraries, such as TensorFlow and PyTorch.
At the core of the software stack are compilers to transform the programmer’s high-level
code into executable code that runs efficiently on a target device. Frameworks and inference
engines (IEs), such as TensorFlow, PyTorch, OpenVINO, and TensorRT, provide a high-level
abstraction to the operators used across DL models. They use graph optimizers (either built-
in or external) to optimize the model. The framework’s scheduler relies on low-level DL and
Math libraries, such as oneDNN (formerly called Intel MKL-DNN), Nvidia cuDNN, Eigen,
or OpenBLAS, or in tensor compilers for optimizations to standard DL functions. Frameworks
also have a code generation path to supplement these libraries with other compilers, such as
LLVM.
The ISA defines the operators, data types, and memory management for an abstract com-
puter architecture. A particular implementation of an ISA is called a microarchitecture. For in-
stance, Intel and AMD CPUs use the x86 or x86-64 ISA across different microarchitecture
implementations and CPU generations. Programs are binary compatible across all microarchi-
tecture implementations of a particular ISA. Different microarchitectures can have different
properties that can affect their performance, such as instructions latencies and cache hierar-
chies. A specific microarchitecture can be available in various flavors with different frequencies
and cache sizes.
1.10. NOTATION 25
The operating system (OS) manages all the hardware and software in a compute device;
it allocates hardware resources, such as compute and memory, to the software applications. An
overview of operating systems is beyond the scope of this book.
Chapter 8 introduces programming languages and compiler techniques, and Chapter 9
details the prevalent DL graph and tensor compilers. Chapter 10 highlights higher-level plat-
forms used by hyperscalers to manage training and deployment.
1.10 NOTATION
This section references the notation used throughout this book to represent input data, labels,
weights, affine transformations, activations, and outputs. Recall that the compute operations
in training and serving boil down to multiplications and additions. Linear algebra is used to
represent groups of multiplications and additions as a single matrix-matrix or matrix-vector or
vector-vector operation. While helpful, a background in linear algebra is not required; the reader
can overlook the equations without a significant impact on the other parts of the book.
In DL literature, the output from an affine transformation can be equivalently represented
as either
DX.l/ 1
.lC1/
zj D wj.l/ .l/ .l/
i ai C bj
i D0
or as
D .l/
X
zj.lC1/ D wj.l/ .l/
i ai ;
i D0
• wj.l/ .l/
i 2 W : weight from output i in Layer l to input j in Layer l C 1, where i 2
Œ0; D .l/ 1, and j 2 ŒD .lC1/ 1
.l/
• a.l/ D g.z.l/ / 2 <D : activation of Layer l 2 Œ0; L 1
• z.lC1/ D W.l/ a.l/ C b.l/ D ŒW.l/ b.l/ Œa.l/ I 1, where ŒW.l/ b.l/ represents a matrix
with b.l/ right appended to matrix W.l/ , and Œa.l/ I 1 represents a vector with a 1 bottom
appended to vector a.l/
.0/ N
• X D ŒxŒ0 ; ; xŒN 1
2 <D
• Y D ŒyŒ0 ; ; yŒN 1
2 <M N
O D ŒOyŒ0 ; ; yO ŒN
• Y 1
2 <M N
.l/ N
• Z.l/ D Œz.l/Œ0 ; ; z.l/ŒN 1
2 <D
.l/ N
• A.l/ D Œa.l/Œ0 ; ; a.l/ŒN 1
2 <D
CHAPTER 2
Building Blocks
There are four main types of NN topologies used in commercial applications: multilayer percep-
trons (MLPs), convolution neural networks (CNNs), recurrent neural networks (RNNs), and
transformer-based topologies. These topologies are directed graphs with nodes and edges, where
a node represents an operator, and an edge represents a data-dependency between the nodes, as
shown in Figure 1.5.
A node, also called primitive (short for primitive function), layer, expression, or kernel,
is the building block of a topology. While the number of functions developed by researchers
continues to grow, for example, the popular TensorFlow framework supports over 1,000 opera-
tors, the number of functions used in commercial applications is comparatively small. Examples
of these functions are ReLU, sigmoid, hyperbolic tangent, softmax, GEMM, convolution, and
batch normalization.
There are three types of compute functions: dense linear functions (e.g., GEMM and con-
volution), nonlinear functions (e.g., ReLU and sigmoid), and reduction functions (e.g., pooling).
A dense linear function is typically implemented as a matrix-wise operator and a nonlinear func-
tion as an element-wise operator. A reduction function reduces the input vector to one scalar
value.
Matrix-wise operators are compute-intensive and (depending on the hardware and the
amount of data reuse) can be compute bound (referred to as Math bound in some GPU litera-
ture). Element-wise operators are compute-light and memory bandwidth bound. The inputs to
these functions are read from memory, and the results are written back to memory; there is no
data reuse.
A common technique to improve the compute efficiency of a model is to fuse a compute-
light element-wise operator into a compute-intensive matrix-wise operator. Thus, the interme-
diate results are not written to and then read from main memory. The element-wise computa-
tions happen immediately after the matrix-wise computations while the data is in the registers or
the storage closes to the computing unit. Chapter 8 details this and other techniques to improve
the efficiency via software optimizations.
In this and the next chapter, we follow a bottom-up approach. In this chapter, we introduce
the standard primitives in popular models used at hyperscalers. In the next chapter, we discuss
the actual models and applications built using these primitives. Readers that prefer a top-down
approach may first read Chapter 3 to better understand the types of models and applications
28 2. BUILDING BLOCKS
before diving into the building blocks in this chapter. A review of the notation introduced in
Section 1.10 can help understand the equations presented in this chapter.
where M is the number of classes. The activation input z to the softmax layer is called the logit
vector or score, which corresponds to the unnormalized model predictions, and should not be
confused with the logit (sigmoid) function.
Applying the exponential function to large logits magnifies the numerical errors. There-
fore, it is a common practice to subtract the maximum logit m from all the logits before using
the softmax function [BHH20]. The result is mathematically equivalent:
ex m
ex e m ex
D D ;
ex m C ey m C em m .e x C e y C e m /e m ex C ey C em
2.2 AFFINE
An affine transformation (also known as fully-connected, feedforward, or GEMM layer) pro-
vides a weighted sum of the inputs plus a bias. Figure 2.2 illustrates an affine transformation
.l/ 1
DX
zj.lC1/ D wj.l/ .l/ .l/
i ai C bj ;
i D0
An affine transformation can be formulated as a general matrix multiply (GEMM) for all the
samples in a batch and for all the units in a layer, as shown in the last equation in Section 1.10.
An affine transformation is called a linear primitive in DL literature (slightly abusing the term
since a linear function should not have a bias).
Using a bias is always recommended even in large networks where a bias term may have a
negligible impact on performance; removing the bias has little computational or memory savings.
Note that when the affine layer is followed by a batch normalization (BN) layer (discussed in
Section 2.6), the bias has no statistical impact as BN cancels out the bias.
30 2. BUILDING BLOCKS
2.3 CONVOLUTION
Convolution kernels (commonly called filters) are widely adopted in computer vision and used
with 2D images, 3D volumetric images, such as MRI scans, and 3D temporal images or video.
Tasks where there is a correlation associated with the spatial or temporal proximity in the input
values, such as in images, video, and spectrograms (discussed in Section 2.4), can use convolution
kernels.
The term convolution has different meanings in the DL and signal processing literature.
A convolution operator in the DL literature, and the one used in this book, is a cross-correlation
operator between the weights and input activations (the input values to the convolutional layer).
Each convolution output value is a dot product of the filter and a subset of the input. The entire
convolution output is computed by shifting this filter across all the subsets in the input.
A 1D convolution using one filter follows:
H
X1
zi.lC1/ D .l/
ahCi wh.l/ C bi.l/ ;
hD0
where H is the length of filter w .l/ . This equation can be easily extended to a 2D convolution,
which is more common in DL. Typically, multiple filters are used in each layer. Figure 2.3 il-
lustrates K 1D convolutions and K 2D convolutions (the biases are omitted to simplify the
figure).
The output is smaller if the input is not padded or if there is a stride between each filter
shift. It is a common practice to extend or pad the input with zeros to enforce that the output
size matches the input size (assuming the stride is 1). Another padding technique is using partial
convolution, which generates a more fitting padding and is discussed elsewhere [LRS+18].
To demonstrate a 2D convolution, assume, for illustration purposes, a 6 6 gray-scaled
input tensor (in practice, the input is usually much bigger) and a 5 5 filter, as shown in Fig-
2.3. CONVOLUTION 31
Figure 2.3: (a) K 1D convolutions and (b) K 2D convolutions. The results across all filters are
concatenated across a new dimension. Thus, the output tensor of the K 2D convolutions has a
depth (number of channels) of K .
Figure 2.4: A 2D convolution operation. The top-left value in the output tensor (right) is the
dot product of the values in the filter (center) with the upper left values in input tensor (left)
in the red square. The input tensor is first zero-padded so the output tensor height and width
dimensions equal those of the input tensor. Credit: Joanne Yuan.
ure 2.4. The input is padded with zeros to ensure the output size equals the input size. The
upper left value of the 2D output array is the dot product of the 5 5 filter with the upper-left
5 5 pixels in the zero-padded input tensor (marked in red). Note that in this book and the DL
literature, the dots product’s definition includes the aggregated sum of the Hadamard product
(element-wise multiplication) of two 2D arrays. The next output value is computed using the
next 5 5 values in the input tensor (marked in green). This pattern continues across the entire
input array to compute all the output values.
32 2. BUILDING BLOCKS
Figure 2.5: K 2D convolutions with an H W input with C channels. Each filter also has C
channels. The output tensor has K channels, each one corresponding to the convolution output
of each filter with the input tensor.
An H W color image has 3 channels (red, green, blue), also known as feature channels
or tensor depth. The dimension of the image is represented as 3 H W . The filters have the
same number of channels as the input, as illustrated in Figure 2.5. Assuming K 5 5 filters with
3 channels (represented as 3 5 5), each one of the K 2D convolutions is the dot product
between a 3 5 5 filter and all the 3 5 5 subsets of the input shifted across the height and
width. In 2D convolution, the filters do not shift across the depth (channels). Note that filter
sizes are often described only by their height and width; the depth is inferred: it is the number
of channels of the input tensor.
A convolutional layer has a bank of filters, each detecting different features in the input.
To illustrate, suppose the input is a 3 224 224 tensor, and the layer has a bank of 64 filters.
Each filter produces one 224 224 output. Each output contains the features detected in the
input by the corresponding filter. The aggregated layer output is a 64 224 224 tensor, and all
the filters in the next convolutional layer have 64 channels.
In practice, a convolution layer typically uses 4D input, filter, and output tensors. The usual
way tensors are arranged in (1D) memory, known as the data layout, is as NCHW or NHWC for
the input tensors, where N is the number of samples in the batch, C is the input depth (or
equivalently, the number of channels or features), W is the input width, and H is the input
height. The filters are arranged as RSCK or KCRS, where K is the number of filters (also known
as the number of output feature channels), R is the filter height, and S is the filter width. The
C in NCWH and KCRS are the same. Note that KCRS is sometimes denoted as OIHW in some
literature but not in this book to avoid confusion with the H and W used for the input tensor.
In the example above, the input has NCHW dimensions 1 3 224 224, the filter has KCRS
dimensions 64 3 5 5, and the output has NK HQ W Q dimensions 1 64 224 224.
2.3. CONVOLUTION 33
The convolution is computed along seven dimensions: batch size N , output channels K ,
input channels C , output height HQ , output width WQ , filter height R, and filter width S . It
can be implemented naively as seven for loops, as shown in Algorithm 2.1, where k , hQ , and
wQ , represent the channel, height, and width indices of the output tensor Z. For simplicity, the
stride is assumed to be 1. There are more efficient implementations that account for a device’s
memory hierarchy and parallel computing capabilities [DAM+16].
2.4 POOLING
Pooling or subsampling reduces the size of the input tensor across the height and width, typ-
ically without affecting the number of channels. Pooling often follows a convolutional layer.
The common implementation, known as max pooling, is to select the maximum value in a small
region. A 2D pooling layer uses 2 2 nonoverlapping regions and reduces the tensor size by a
factor of 4, as illustrated in Figure 2.7.
The main benefit of pooling is that filters after a pooling layer have a larger receptive field
or coverage on the original input image. For example, a 3 3 filter maps to a 6 6 portion
of the input image after one 2 2 pooling layer. A 3 3 filter deeper in the model, after five
convolutional and pooling layers, maps to a 96 96 (note that 3 25 D 96) portion of the input
image and can learn more complex features. Another benefit of pooling is that it reduces the
number of operations.
Other forms of pooling include average pooling, stochastic pooling, and spatial pyramid pool-
ing (SPP). Average pooling and stochastic pooling are similar to max pooling. Average pooling
computes the average of the values in a small region. Stochastic pooling samples a value based on
the distribution of the values in the small region [ZF13]. SPP is used after the last convolution
layer to generate fixed-length outputs when the input images are not of a fixed size [HZR+15].
In Section 3.2.3, we provide an example of SPP used in a production model for image segmen-
tation.
2.5. RECURRENT UNITS 35
Figure 2.7: A (left) 1 20 6 6 tensor (in NCHW layout) input into a 2 2 pooling layer
produces a (right) 1 20 3 3 tensor output. Credit: Joanne Yuan.
Figure 2.8: LSTM and GRU units have soft gates that control how the memory cell values are
modified. Based on [Phi18].
Figure 2.9: The cost space as a function of two weights for (left) unnormalized data and (right)
normalized data. Each contour represents a set of weights with equal cost and the minimum is
in the inner contour. Normalizing the data results in faster learning because each parameter can
make a similar contribution.
2.6 NORMALIZATION
A common ML technique that improves training is normalizing the input data by subtracting
the mean of the data and dividing it by the standard deviation. Normalization improves learn-
ing in single layer models as each parameter can make similar contributions to the learning, as
illustrated in Figure 2.9. It is also beneficial to carefully normalize the inputs of some the layers.
The distribution of the inputs to each layer through the network can vary widely, resulting
in some gradients that have little impact on the learning process. Normalizing the inputs or
outputs of the activation functions improves training stability, enables the training of larger
2.6. NORMALIZATION 37
Figure 2.10: Different normalization methodologies normalize across different portions of the
tensor. The tensor values colored in green are normalized by their mean and variance. Based
on [WH18].
models, and results in faster convergence. The reason is that the gradient of the weights in a
given layer is somewhat proportional to the magnitude of the layer inputs. Having gradients
with similar magnitudes (1) reduces the effects of exploding and diminishing gradients when
backpropagating through a deep network and (2) prevents some of the partial derivatives from
skewing the overall gradient in a particular direction.
The most common techniques to normalize activations are batch normalization, batch
renormalization, layer normalization, and group normalization, shown in Figure 2.10. In prac-
tice, we recommend using batch normalization for non-recurrent models when the batch size is
greater than or equal to 32 and group normalization otherwise.
Batch normalization (BN) was a breakthrough technique enabling the training of deeper
and more accurate models and is widely adopted in production [IS15]. BN can be applied to the
input or output of an activation function. Based on empirical data, the latter is recommended
and used in the analysis below.
The activations a.l/ in Layer l are normalized by the mean E and variance V across a batch
of samples. Each BN layer has two trainable parameters: and ˇ to scale and then shift the
normalized activations. These parameters provide flexibility over the amount of normalization
in a BN layer to maximize statistical performance. Note that data scientists can remove the bias
term in the fully-connected or convolutional layer with no statistical effects as the shift term ˇ
effectively cancels out the bias term.
At the end of the training process, the mean and variance for each BN layer are computed
using statistics from the entire training set or a large representative subset. These values are fixed
and used during serving; they are not recomputed in each serving batch. During inference, the
BN output is:
a.lC1/ E1 g.W.l/ a.l/ / E1
BN a.lC1/ D C ˇ1 D C ˇ1
V E
V
.l/
Dg W.l/ a.l/ C ˇ 1 D g W0 a.l/ C b0 ;
V V
38 2. BUILDING BLOCKS
where g./ is the activation function, W0 D V W and b0 D .ˇ VE /1. That is, during inference
the BN can be incorporated directly in the weights by multiplying them by V in the preceding
convolutional or fully-connected layer, and adding the bias b0 to the activations.
There are two drawbacks to batch normalization. First, it requires training with batches
of sufficient samples (usually 32 or more) to capture adequate statistics representative of the
entire training set. This requirement limits distributed training algorithms when the batch per
device is small. Second, batch normalization cannot be used in recurrent layers because the
statistics change with each recurrent step, but the BN parameters are shared across all steps.
Batch renormalization, layer normalization, and group normalization address these drawbacks.
Batch renormalization constrains the mean and standard deviation of BN to reduce the
large difference when the batch size is small [Iof17]. Batch renormalization allows training with
small batches.
Layer normalization computes the mean and standard deviation across all the activation
values in a layer in a data sample. Therefore, different data samples have different normalization
terms [BKH16].
Group normalization is a generalization of layer normalization. It uses the mean and
variance across groups of channels, rather than across all the channels [WH18]. The number
of groups is a hyperparameter chosen by the user. Both of these methods also include the two
trainable parameters as in batch normalization. Empirical results show group normalization
works much better than BN for small batch sizes, and only slightly worse than BN for large
batch sizes [WH18].
Local response normalization (LRN) square-normalizes the values using the statistics
in a small neighborhood [KSH12]. LRN is not a trainable layer. It was used in older models
before batch normalization gained adoption.
2.7 EMBEDDINGS
An embedding (also known as encoding or thought-vector) is a low-dimensional dense vector
representation of a data sample. It is often used as the input to a language or recommender model.
An embedding layer maps high-dimensional sparse data to an embedding. To illustrate, suppose
a dictionary has 10,000 words. A 10,000-dimensional vector of all zeros except for a one at the
corresponding index represents a word. This is called a one-hot vector. Unsupervised learning
algorithms, such as word2vec or GloVe, can learn to map a corpus of words to low-dimensional
dense representations [MSC+13, PSM14]. Other usages are learning dense representation for
persons in a social network or products in a retail business with a large catalog. In images, the
activations of the last or second-to-last layer of a CNN model are often used to embed the image.
The embeddings often demonstrate data associations, and vector embeddings of similar
words are closer to each other. For instance, using their learned vector representations, vqueen
vwoman C vking vman , as shown in Figure 2.11.
2.8. ATTENTION 39
Figure 2.11: 3D word embeddings. Word embedding often capture word associations. Based
on [Goo20].
2.8 ATTENTION
An attention layer learns how the input vectors influence each output vector, as shown in Fig-
ure 2.12. Some attention layers also capture how the neighboring output vectors influence each
other. Attention layers are popular in language models to determine the associations between
input and output tokens [VSP+17]. A token is a word, a subword, or a symbol, such as a question
mark. Attention layers can be computationally expensive as each layer may require computing
an attention value for each combination of input and output tokens. This additional computa-
tion may increase the serving latency in some workloads beyond what is acceptable for a given
application. Nevertheless, using attention layers can improve the statistical performance.
Attention layers are also used in some recommenders to learn a context vector that captures
the influence between users. They are also used in image captioning to focus the decoder on the
relative parts of the image [CZZ+19, YHG+15]. Attention can improve interpretability. The
attention layer may be used to observe and explain how an input feature influences a particular
output.
2.9 DROPOUT
Dropout is designed to reduce overfitting and may be used in fully connected non-recurrent
layers. During training, a percentage of the weights in a layer is ignored (dropped) for an it-
eration, as shown in Figure 2.13. At each iteration, a new set of weights is randomly ignored,
which reduces overfitting by reducing cooperation between the weights. During inference, all
the weights are used.
RNN-based models can use dropout after the embedding layers and in-between RNN
stacks. While dropout could be used across temporal units if the same set of weights across all
40 2. BUILDING BLOCKS
Figure 2.12: Attention-based models capture how each output token is influenced by the all
input tokens. The blue and green rectangles are cells corresponding to the encoder and decoder,
respectively. To generate the french word suis, the attention weight is the highest for the corre-
sponding English word am. Based on [Syn17].
Figure 2.13: (left) Original model before dropout. (right) A percentage of the weights in a layer
are dropped during each training iteration.
the timesteps are dropped, it is usually not used. CNNs layers typically do not use dropout given
that those layers already have few weights.
2.9. DROPOUT 41
In practice, normalization techniques are preferred to reduce overfitting, and newer mod-
els do not use dropout. Based on empirical evaluations, normalization and dropout should not
be jointly used as they have a negative convergence impact [LCH+19].
SLIDE is an extension to dropout [CMF+20]. In dropout, the weights are randomly
dropped. In SLIDE, the weights that produce small activations are dropped. The percentage of
dropped weights in SLIDE can be 90–95%, whereas in dropout it is usually 50%. Thus, only
the most relevant weights are updated in each training iteration. The challenge that SLIDE
addresses is predicting which weight vectors produce large activations. SLIDE uses locality sen-
sitive hashing (LSH) to select similar vector weights to the input activation vectors.
SLIDE has a CPU affinity for two main reasons. First, LSH relies on branching, for
which CPUs are well optimized. Second, the LSH tables require a large memory capacity,
which also favors CPUs. There is ongoing research toward making hashing more efficient on
GPUs [AFO18].
Similarly to dropout, SLIDE is primarily beneficial for fully-connected, non-recurrent
layers with many units, since these are typically overparameterized and can be heavily sparsified.
In particular, for the affine layer into the softmax layer in extreme classification tasks (common
in recommender systems), where the softmax has hundreds of thousands of units. Similar to
dropout, jointly using SLIDE and normalization is not recommended. Finally, SLIDE is rel-
atively new and has not been adopted in production environments. Further work is needed to
facilitate adoption.
In this chapter, we detailed the standard building blocks of topologies used in commercial ap-
plications and explained their purpose. These building blocks or layers have different hardware
needs. Typically, embeddings layers need large memory and memory bandwidth, convolutional
layers need large compute, and recurrent layers need large memory bandwidth. We introduced
the concept of a graph with nodes and edges as a representation for a topology. A standard
graph optimization technique, detailed in Chapter 8, is to merge dense linear nodes, such as
GEMM and convolutions, with element-wise nodes, such as ReLU, to reduce memory accesses
and improve performance. We recommended using batch normalization for non-recurrent lay-
ers when the batch size is greater than or equal to 32 and group normalization otherwise. Also,
normalization is preferable over dropout and both should not be used jointly. In the next chap-
ter, we discuss foundational topologies composed of these building blocks and their applications
by hyperscalers.
43
CHAPTER 3
(1) Content-based systems recommend items to a user based on their profile and their user-
item interaction.
(2) Collaborative filtering recommends items to a user based on the user-item interactions
from similar users.
The input can be structured data, such as databases, or unstructured data, such as text and
images. CNNs and RNNs can be applied to images and text, respectively, to extract features to
input into a recommender. The recommended items can be ads to click, merchandise to purchase,
videos to watch, songs to listen, social contacts to add, and news and social media posts to
read. Recommender systems recommend items based on user features, item features, user-item
ratings, and contexts, such as the time, day, and season. User-item ratings can be explicit or
implicit based on user-item interaction. Implicit feedback includes the amount of time spent
reading a news article, listening to a song, or viewing a clip. Details on the recent advances in
context-aware recommender systems (CARS) are available elsewhere [RD19].
The total number of user-item combinations can reach quintillions, and adding con-
text further increases that number. Netflix has around 200 million users and over 13,000 ti-
tles [Lov19]. Google Play has over 1 billion users and over 1 million apps [CKH+16]. eBay has
more than 180 million buyers and over 1:3 billion listings [Don19, Pad19]. Alibaba has 2 billion
3.1. RECOMMENDER SYSTEMS TOPOLOGIES 45
products and serves as many as 500 million customers per day [Fel19]. This huge catalog results in
memory size bottlenecks on the hardware platform. If every combination requires one byte, then
the total user-item combinations would require 1 exabyte of memory, which is 4 more than
the total storage of the largest supercomputer Summit. eBay clusters items into categories and
utilizes user-category (rather than user-item) to reduce the number of combinations [Bro19].
Rather than ranking potentially billions of items, a large-scale recommender system breaks
the process into two stages to meet the latency requirement and reduce the number of compu-
tations. First, a recall (candidate generator) stage selects several items that may be of interest to
the user. Second, a ranking stage scores each item and selects those shown to the user [Bro19].
The recall step selects a set of items (for instance, 1000) that may be of interest to a particular
user, each one represented as a vector. The dot products between the vector representing the user
and each of the 1000 item-vectors are computed. The items producing the highest dot products
are then recommended to the user.
Despite using a two-stage approach, a significant challenge of large-scale recommender
systems is the large memory required, in particular for the embedding tables to embed users and
item features. Baidu’s Phoenix Nest online advertising recommender models can exceed 10 TB,
dwarfing the capacity of a GPU. Therefore, the model is partitioned into embeddings on the
CPU and the NN on the GPU on Baidu’s AIBox [ZZX+19].
Content-based recommenders use features, known as metadata, for each item (such as
movie genres and IMDb ratings) and recommend items based on the similarities to other items
the user has liked. A profile for each user is learned based on their likes and used to evaluate the
recalled items to make a recommendation. Other features can include embedding representa-
tions (using RNN, CNN, or hand-engineered features) of written reviews, movie summaries,
and still images. Content-based recommenders do not use information or ratings from other
users.
Collaborative filtering (CF) recommends items based on user-item interactions across
multiple users. Collaborative filtering uses no metadata or domain knowledge about the items;
instead, it learns all the feature vectors. This eliminates the dependency of manually chosen
features at the expense of requiring more data. A rating matrix R, also known as the utility matrix
or user-interaction matrix, contains the ratings across various users and items. Collaborative
filtering learns a user matrix U and an item matrix V composed of user and item feature vectors
of equal dimension, respectively, such that the squared differences between R and the dense
matrix R O D UVT is minimized. This is known as matrix factorization. R O provides a metric of
similarity between the items and users. In practice, for large rating matrices, only a subset of
entries is used. The alternative least squares (ALS) algorithm can perform matrix factorization
by alternating between holding constant one of the matrices and adjusting the other one to
minimize the error. Singular Value Decomposition (SVD) is another commonly used matrix
factorization algorithm.
46 3. MODELS AND APPLICATIONS
Figure 3.1: A (left) linear and (right) deep learning model. Based on [CKH+16].
Neural recommenders typically use a hybrid approach. They are trained with large datasets
across multiple user and item pairs. Standard neural recommenders are Wide and Deep (W&D),
Neural collaborative filtering (NCF), Deep Interest Evolution Network (DIEN), and Deep
Learning Recommender Model (DLRM). GNNs are also gaining adoption for recommenders.
Other recommenders include autoencoders to encode implicit ratings or feedback, GANs,
and deep RL to tackle dynamic changes in items and users’ preferences [LKH+18, WYZ+17,
ZZZ+18].
Wide & Deep (W&D) combines the output from a linear model (referred to as wide)
and a deep model, as shown in Figure 3.1 [CKH+16]. This was originally developed to improve
Google Play’s app recommendation. The probability the recommended app is chosen given the
input vector is:
P .vi D 1jx/ D wTwide .x/ C fdeep Wdeep ; x C b ;
where is the logit function, b is the bias term, .x/ are the features on the linear model,
fdeep ./ is the deep model, wwide is the weight vector for the linear model, and Wdeep is the set
of weights for the deep model. The input vector x has user features (for instance, country, lan-
guage, demographics), contextual features (for instance, device, hour of the day, and day of the
week), and item features (for instance, app age, and historical statistics of an app). Sparse dis-
crete high-dimensional categorical feature vectors are embedded into a low-dimensional dense
representation and concatenated as one input vector into an MLP.
Similar models to W&D are the MLP model used for YouTube recommendations, which
incorporates the mixture-of-experts ML technique, and the DeepFM model, which shares the
input with its “wide” and “deep” parts [CAS16, ZHW+19, GTY+17]. Another similar model
is Deep & Cross Network (DCN) used for ad click prediction. It applies feature crossing and,
unlike W&D, does not require manually selecting the features to cross.
Neural Collaborative Filtering (NCF) is a CF-based recommender that generalizes the
popular matrix factorization algorithm [HLZ+17]. A one-layer linear NN can represent matrix
3.1. RECOMMENDER SYSTEMS TOPOLOGIES 47
Figure 3.2: A neural collaborative filtering (NCF) model with one embedding layer for the user
and one for the items, and an MLP model. Based on [HLZ+17].
factorization. NCF augments this linear NN with multiple layers, as shown in Figure 3.2, to
model complex nonlinearities in the data, which improves the learned features and recommen-
dations.
Deep Interest Evolution Network (DIEN) and Behavior Sequence Transformer
(BST) are used in production at Alibaba’s Taobao to recommend advertisements [ZMF+18,
CZL+19]. They use a GRU- and a transformer-based topology, respectively, to model user be-
havior through time. A similar model, Recurrent Recommender Network (RRN), uses LSTM
units [WAB+17].
Deep Learning Recommendation Model (DLRM) is a class of models used by Face-
book. DLRM improves the handling of categorical features [NMS+19]. The dot product of pairs
of embedding vectors and processed dense features are post-processed through another MLP,
as shown in Figure 3.3, to predict event probabilities. Because the embedding tables are enor-
mous, model parallelism, discussed in Chapter 6, can be used to mitigate memory constraints.
Facebook also proposed using nonvolatile memory (NVM) as the primary storage medium and
DRAM as a cache for commonly used embedding vectors [ENG+18].
Graph Neural Networks (GNNs), introduced in Section 1.5.5, are gaining traction for
large-scaled recommender systems. Industry platforms include Pinterest’s PinSage, Alibaba’s
AliGraph, Microsoft’s NeuGraph, and Amazon’s Deep Graph Library (DGL) [YKC+18,
ZZY+19, MYM+19, WVP+19, FL19, Fey20].
48 3. MODELS AND APPLICATIONS
Figure 3.3: A Deep Learning Recommendation Model (DLRM) with a dense feature, multiple
embedding layers for sparse features, and multiple MLP topologies. Based on [NMS+19].
Figure 3.4: Top-5 classification error from 2010–2017 on the ImageNet-1K dataset. Since 2012
all the top results have used DL. Based on [Zis18].
similar features to those developed over decades of research, which are also similar to the features
used by the mammal primary visual cortex. That is, the weights in the first layer usually become
edge detectors after training the model. The second figure in Zeiler and Fergus’ 2013 paper
(not replicated here) shows what some of the feature maps across various layers specialized to
detect [ZF13]. One difference (of many) between CNN models and the mammal visual cortex
is that CNNs rely more on texture features than shape features [GRM+18]. Augmenting the
training dataset by perturbing each image’s texture increases the dependency on shape features
and improves the model’s performance.
Computer vision tasks detailed in this section are classification, object detection, semantic
segmentation, verification, and image generations. Additional tasks not discussed include action
recognition, image denoising, super-resolution, and style transfer.
Figure 3.5: All the layers of the AlexNet topology. Based on [Has18].
Figure 3.6: The factorization of a 5 5 filter into two consecutive 3 3 filters maintains the
same receptive field.
cussed below, such as inception, residual, group convolution, and depthwise separable convolu-
tional layers, and introduce new techniques, such as factorization.
AlexNet, shown in Figure 3.5, is similar to the 1998 LeNet-5 topology used for digit
recognition but with more layers and units. Also, AlexNet uses ReLU rather than the logit
activation functions, and max pooling rather than average pooling [LBB+98].
VGG is a family of topologies similar to AlexNet but with more layers and only uses 3 3
convolution filters [SZ14]. VGG factorizes a 5 5 into two consecutive 3 3 layers to reduce
the number of parameters, as shown in Figure 3.6. Factorization maintains the same receptive
field coverage. The reduced number of parameters mitigates overfitting, which facilitates using
topologies with more layers.
Inception-v1, also known as GoogleNet, introduced the inception module, which is
composed of multiple filters of different sizes that process the same input, as shown in Fig-
3.2. COMPUTER VISION TOPOLOGIES 51
Figure 3.7: The Inception-v1 module. Different filter sizes are applied to the input tensor and
the outputs are concatenated.
Figure 3.8: The factorization of a 5 5 filter into a 5 1 and 1 5 filters maintains the same
receptive field.
ure 3.7 [SVI+15, SLJ+14]. These filters extract multilevel features, and their outputs are con-
catenated. The inception module popularized the usage of 1 1 filters, which modifies the num-
ber of channels. Inception replaces the fully-connected layers at the end of the topology with a
global average pooling across the 2D feature map, which notably reduces the total number of
parameters.
Inception-v3 introduces the factorization of an n n convolutional filter into a 1 n fol-
lowed by an n 1 filter, as shown in Figure 3.8. This factorization maintains the same receptive
field and reduces the number of weights from n2 to 2n. Inception-v3 also adds a regulariza-
tion known as label smoothing to the one-hot label vectors by replacing the zeros with a small
epsilon value. Inception-v3, like Inception-v2 (also known as Batch-Normalization Inception),
uses batch normalization [IS15].
Another technique introduced in VGG and improved in Inception-v3 is doubling the
number of channels and halving the feature maps’ length and width in consecutive layers. This
pattern is made in one of three ways. First, convolution followed by pooling at the expense of a
52 3. MODELS AND APPLICATIONS
Figure 3.9: Efficient grid size reduction. The number of channels doubles and the height and
width are halved.
Figure 3.10: Residual layers have skip connections that bypass certain layers. (left) A residual
layer with two convolutional layers. (right) A residual module reduces the tensor to 64 channels
(from 256 channels) to reduce the number of 3 3 convolutions and then expands the output
back to 256 channels.
convolution with a larger tensor input. Second, pooling followed by convolution at the expense
of a less-expressive layer. Third (recommended), two parallel blocks: (1) a convolution block with
a stride of 2 that maintains the same number of channels; and (2) a pooling layer, as shown in
Figure 3.9.
3.2. COMPUTER VISION TOPOLOGIES 53
Figure 3.11: Depthwise separable convolution is a depthwise convolution, where every input
channel is convolved with a different filter, followed by a pointwise convolution.
ResNet is a family of models that popularized layers with skip connections, also known
as residual layers. Skip connections bypass other layers, as shown in Figure 3.10 [HZR+15]. The
motivation is that rather than learning the direct mapping from x to H.x/ it is easier to learn
F .x/, which is the difference or the residual between x and H.x/. Then H.x/ can be computed
by adding this residual. Using residual layers, together with batch normalization, allows the
training of overly deep models with over 1000 layers. The gradient can backpropagate via shortcut
connections; thus, mitigating the vanishing gradient problem introduced in Section 2.1. Deep
ResNets use a bottleneck unit to reduce the number of computations, as shown on the right of
Figure 3.10.
DenseNet connects each layer to every other layer [HLv+16]. Each layer’s inputs are
the concatenation of all feature maps from all the previous layers, which have a large memory
footprint. On the flip side, DenseNet requires fewer weights than other similarly performing
models.
Extreme Inception (Xception) combines design principles from VGG, Inception, and
ResNet, and introduces depthwise separable convolutions, shown in Figure 3.11. In depthwise
separable convolutions, the cross-channel correlations and spatial correlations are mapped sep-
arately [Cho16]. That is, every input channel is convolved with a different filter and the results
are aggregated using a 1 1 filter called a pointwise convolution.
MobileNet, MobileNet-v2, and MobileNet-v3 target hardware with limited power,
compute, and memory, such as mobile phones. These models use depthwise separable convolu-
tion blocks with no pooling layers in between. MobileNet-v2 uses residual connections and adds
54 3. MODELS AND APPLICATIONS
Figure 3.12: The MobileNet modules with arbitrary input tensor sizes using stride 1 for (left)
v1 and (right) v2.
a channel expansion convolution layer prior to the depthwise separable convolutions, as shown
in Figure 3.12 for stride 1 [HZC+17]. MobileNet-v3 uses AutoML, a technique discussed in
Section 10.1. These models are served not just in mobile phones but also in data centers.
ResNeXt reintroduced group convolutions (initially used by AlexNet to distribute the
model into two GPUs) [XGD+17]. In group convolution, the filters separate into groups, and
each group operates on specific channels of the input tensor. The group convolution tensors are
typically represented as a 5D tensor with the group id as the additional dimension. Depthwise
separable convolution is a particular case of group convolution where the number of groups
equals the number of channels of the input tensor.
ResNeXt replaces residual convolution blocks with residual group convolutions, shown
in Figure 3.13, and every path of the group contains the same topology. These convolutions
facilitate training and serving across multiple devices since each convolution in the group can be
done independently of the other ones. ResNeXt is or has been used at Facebook [PNB+18].
NAS is a family of algorithms that learn both the topology and the weights targeting a
particular hardware target, such as NASNet and EfficientNet [TL19]. EfficientNet was initially
used on TPUs, but can be used with other hardware. Given their long training times and the
diverse hardware fleet in data centers (multiple generations of CPUs and GPUs), the adoption
of NAS-based models in the industry is still limited.
Figure 3.13: ResNeXt module with equivalent representations. ResNeXt uses residual group
convolutions which are easier to parallelize across compute units. Based on [XGD+17].
posal step and a classification step. The input image is scaled up and down, known as an image
pyramid, to detect objects of various sizes. New NNs can do these steps simultaneously, start-
ing with the widely adopted Single-Shot Detector (SSD) and You Only Look Once (YOLO)
models. Despite being relatively old, these models are still used in production because of their
plug-and-play nature, where the object detector can use the latest CNN classifier as the base
network.
Object detection models use a unified weighted cost function that accounts for the local-
ization and the classification tasks. Also, object detectors generate several bounding boxes for a
given object, and remove most of them using non-maximum suppression (NMS).
The most common metric to measure the detection accuracy is the mean Average Precision
(mAP). The average precision (AP) is the area under the precision-recall curve and ranges from
0 to 1, with 1 being perfect detection for one class. The mAP is the mean AP across all the
classes.
Key neural object detectors include Faster-RCNN, YOLO, SSD, RetinaNet, and Effi-
cientDet.
Faster-RCNN uses a two-step approach with a region proposal network (RPN) and a
classification network [RHG+15]. In Faster-RCNN, these two networks share a base CNN
network or backbone architecture, which reduces the number of redundant computations. The
base CNN model extracts feature maps from the image, which are passed to the RPN to gen-
erate and refine candidate bounding boxes, as shown in Figure 3.14. All the bounding boxes
are then reshaped to be the same size and passed to the classifier. The Feature Pyramid Net-
work (FPN) improved this topology; the predictions happen on high- and low-resolution feature
maps [LGH+16].
YOLO divides the image into a 7 7 grid [RDG+16]. Each grid cell is responsible for 2
bounding boxes. Each bounding box is composed of the .x; y/ center coordinate of an object, and
56 3. MODELS AND APPLICATIONS
Figure 3.14: (left) The Faster-RCNN topology generates regions of interest in a feature map
and jointly processes them. (right) The Feature Pyramid Network (FPN) topology can be used
with the Faster-RCNN topology or other models to better detect objects at different scales.
the width, height, and confidence. That is, each bounding box has 5 values. The output of each
grid cell is 5 values times 2 bounding boxes plus the probability of each class label given the input.
If there are 20 classes, then each grid cell has an output vector of 5 2 C 20 D 30, and given the
7 7 cells, then the total number of output values for an input image is 7 7 30 D 1470, as
shown in Figure 3.15. In practice, the number of grid cells and bounding boxes are hyperparam-
eters. The input image maps to the output via a CNN pretrained on an image classification task,
such as the ImageNet dataset. YOLOv2 and YOLOv3 improves by detecting at three scales,
using a deeper CNN topology, and having a class score for each bounding box [RF18].
Single-shot detector (SSD) uses an image classification model, such as VGG or Mo-
bileNet, as the base network and appends additional layers to the model [LAE+15]. Bounding
boxes start from predefined anchor boxes. In each of the appended layers, the model refines or
predict the bounding box coordinates, each with a respective score. Most of the computations
are in the base network.
RetinaNet is the first one-stage detector model that outperforms the two-stage detection
approach. The primary reason for previous one-stage detectors trailing in accuracy is the extreme
class imbalance (many more background class samples). RetinaNet uses the focal loss function
to mitigate this class imbalance [LGG+17]. The focal loss reduces the loss to well-classified
examples.
EfficientDet is a scalable family of detectors based on EfficientNet. It uses a pyramid
network for multiscale detection [TPL19].
3.2. COMPUTER VISION TOPOLOGIES 57
Figure 3.15: A YOLO model can map an input image to a 7 7 30 grid output. Based
on [Tsa18].
3.2.3 SEGMENTATION
Segmentation is a generalized and more challenging form of object detection, where every pixel
in an image has a corresponding class label. Widely adopted models in the industry include
Mask-RCNN and DeepLabv3 and in biomedical applications: U-Net, 3D-UNet, and V-Net.
Mask R-CNN extends Faster-RCNN by adding a separate output branch to predict the
masks for all the classes [HGD+17]. This branch is in parallel to the bounding box predictor
branch. Similar to Faster-RCNN, the choice for the base network is flexible.
DeepLabv3 uses atrous convolution, also known as dilated convolution, hole algorithm, or
up-conv to increase the size of the feature maps by upsampling the weight filter, that is, inserting
one or more zeros between each weight in the filters [CPS+17]. Atrous convolution, combined
with Spatial Pyramid Pooling (SPP), is known as Atrous SPP (ASPP) and shown in Figure 3.16.
ASPP can account for different object scales in the image.
U-Net is an encoder-decoder CNN [RFB15]. It uses convolutions to reduce the size of
the receptive field, followed by transposed convolutions (or upsampling) to increase the size.
U-Net also has skip connections between mirrored layers in the encoder and decoder stacks.
This type of model is known as a fully-convolutional network (FCN) [LSD14]. U-Net can be
trained with few images using data augmentation with multiple shifts and rotations.
3D U-Net and V-Net are 3D convolutional networks designed for voxel (3D pixels) seg-
mentation from volumetric data [CAL+16, MNA16]. These models generally required the im-
mense memory only available on server CPUs for training due to the large activations. Model
58 3. MODELS AND APPLICATIONS
Figure 3.16: Atrous or dilated convolutions can maintain or increase the size of the feature maps.
Based on [CPS+17].
parallelism techniques (discussed in Section 5.2) can be applied to train on GPUs and acceler-
ators.
Detectron is a popular open-source platform developed by Facebook [WKM+19]. De-
tectron is implemented in PyTorch and contains implementations of various object detection
and segmentation algorithms, which facilitates community adoption [WKM+19].
3.2.4 VERIFICATION
The task of verification is to determine whether a sample belongs to a particular set. The set size
may be one, for instance, in the set of people with access to a personal bank account, or many,
for instance, in the set of people with access to a building. Siamese networks are designed for
this task.
Siamese networks learn a similarity function between two input images [BGL+93]. They
can be trained by comparing an anchor image to a positive and negative image or, more precisely,
comparing a metric of the distance between the feature vectors extracted from the images. The
objective is to simultaneously minimize the distance between the anchor and positive image
features and maximize the distance between the anchor and negative image features.
While Siamese networks are decades old, they can use modern techniques to improve
their performance. For instance, CNN models can be a component of a Siamese network. The
CNN models are trained across a variety of image appearances and used to extract features from
the images [ZK15].
Figure 3.17: A 3D-GAN generator takes a random vector z and generates a 3D image. Based
on [WZX+16].
PixelRNN and PixelCNN are auto-regressive models that predict the pixels along both
axes using recurrent and convolutional layers, respectively. These models generate a condi-
tional distribution over the 256 possible values for each RGB image channel at each pixel loca-
tion [vKK+16].
DCGAN and 3D GAN combine CNNs and GANs to generate 3D objects, as shown
in Figure 3.17 [RMC15, WZX+16]. These GANs learn to generate high-quality objects by
sampling from a low-dimensional space and passing those samples to the generator. Stacked
GANs trains across multiple stacks of GANs, which results in higher quality image genera-
tion [HLP+17].
StarGAN and StyleGAN generate photorealistic images. For instance, they can generate
human faces adjusting latent factors, such as freckles, hair color, gender, eyeglasses, and facial
shape, when trained on a face dataset [CCK+17, KLA19].
Pix2pix is an adversarial network that learns a mapping from an input image to an output
image and also learns a cost function to train this mapping. It can generate realistic images from
labeled maps, colorize gray images, fill gaps in images, remove image backgrounds, and generate
images from sketches [IZZ+16].
Other computer vision topologies that have been influential in the field, but do not cur-
rently have extensive commercial adoption are FaceNet for face recognition and verification,
SqueezeNet and ShuffleNet for image classification on edge devices, SENet for high accuracy
image classification, SRGAN for image super-resolution, and SqueezeSegV2 for road-object
segmentation from LiDAR point cloud [SKP15, HSW+18, IHM+16, ZZL+17, HSA+19,
LTH+16, WZZ+19]. OpenPose is used for pose estimation and has some adoption; Wrnch.AI
uses a modified proprietary model to detect kinematics from 2D video.
60 3. MODELS AND APPLICATIONS
3.3 NATURAL LANGUAGE PROCESSING TOPOLOGIES
NLP has been considered an AI-complete problem (requiring human-level intelligence to solve)
given the complexity required to understand language. NLP is a required step toward automatic
reasoning, that is, using stored knowledge to generate additional knowledge [Yam12]. Academia
and industry have made tremendous progress in recent years.
Traditional NLP systems often use a hidden Markov model (HMM) (do not worry if you
are not familiar with HMM). An HMM requires language experts to encode grammatical and
semantic rules, provide a detailed ontology, parse the data, tag words with the appropriate part-
of-speech, and iteratively align inputs and outputs. Neural NLP models can learn a particular
task using a lexicon or language vocabulary and a massive data corpus and without explicitly pro-
gramming the language rules. A popular benchmark to assess the performance of NLP models
is the General Language Understanding Evaluation (GLUE) benchmark [SMH+18].
Hyperscalers use NLP algorithms for NLU, speech recognition, speech generation, and
speech-to-speech translation tasks. NLU tasks include language translation, sentiment analysis,
automatic document summarization, image captioning, document clustering, and question &
answering. Speech recognition and speech synthesis are used as part of an NLP system by AI
assistants, such as Apple Siri, Amazon Alexa, Google Assistant, Microsoft Cortana, and Baidu
DuerOS. Speech-to-speech translation is used to interpret speech between different languages
either as three separate stages (speech-to-text, text translation, and text-to-speech) or as a com-
bined model. NLP algorithms facilitate human-machine interactions, enhancing a machine’s
ability to understand human language, and improve human-human communication, enabling
communication between people without a common language.
Figure 3.18: Beam search using a beam size of 2. Except for the initial decoder input at decoding
timestep 0, every timestep uses the 2 most probable outputs (underlined) from the previous
timesteps as inputs. At time t D 4, beam search results in the sentences: “I am playing the piano
with ...” and “I am playing the piano <eos> ...,” where <eos> is the end of sentence token.
from the nM choices. A common choice is n D 10 to provide a compromise between speed and
accuracy. Figure 3.18 depicts a beam search of n D 2.
The quality of the target sentence in machine translation is typically reported using the
BLEU score, a measure of similarity between the machine’s translation and a professional
human translation normalized by the sequence length. Other quality metrics have been pro-
posed [NDC+17].
One implementation challenge during training is the variable sequence length. Using a
constant sequence length to batch the sequences and either pad short sequences or truncate
long sequences to the predetermined length so that each sample in a batch is of the same length
mitigates this challenge.
The inputs to the NN are known as tokens. While earlier NLU topologies used words as
tokens, most newer topologies use learned subwords [SHB15]. An algorithm segments words
constrained to a fixed vocabulary size (the maximum number of subwords). These subwords are
often interpretable, and the model can generalize to new words not seen during training using
these subwords. Subwords are crucial for low-resource languages, that is, languages where the
data corpus is small. The downside of using subwords rather than words is that a sequence has
more tokens, which requires more computations.
Multi-language NMT involves learning a model used across multiple language pairs. They
are particularly helpful for low-resource languages. Some care is needed to use them over simpler
pairwise language models without sacrificing the performance of the translations from the high-
resource language pairs [ABF+19]. Jointly learning the subwords across the combined languages
62 3. MODELS AND APPLICATIONS
Figure 3.19: Encoder and decoder LSTM units for a question-and-answer system. The input
sentence is represented by the thought vector.
has been shown to be beneficial [SHB15]. Google uses a multi-language NMT transformer-
based model to support translation across 100 languages [ABF+19].
RNN-Based
During serving, RNN-based models are challenging to parallelize due to their sequential na-
ture. A server CPU with fewer but more powerful cores than a GPU works well for RNN-based
inference [ZRW+18]. These models are typically memory bandwidth bound, leaving much com-
putational capacity unused. Some work demonstrates that their implementation can be modified
to be more compute bound [Vil18]. ShaRNN provides an example of an RNN model with a
small memory footprint, which is useful for edge deployments [DAM+19].
Despite the adoption of transformer-based models in commercial applications, RNN-
based models continue to be used commercially due to their adequate statistical performance
and low latency, and due to the larger memory and computational requirements of transformer-
based models [Mer19].
Sequence-to-sequence (S2S) was the first widely adopted NMT model, and provides
the foundation for similar models still used in production [SVL14]. The encoder LSTM units
take as input (1) the state of the previous LSTM cell, (2) the output of the previous LSTM cell,
and (3) the current token, as shown in Figure 3.19. The thought vector is the concatenated state
vector and output vector of the last LSTM encoder unit. This thought vector is an encoding of
the source sentence. The decoder takes the thought vector as input to the first decoder LSTM
unit and produces a target word. Each subsequent unit takes the output from the previous unit
as its input. This cycle continues until an LSTM outputs an end-of-sentence token. In practice,
generating the target sentence in reverse order typically results in better quality translations.
Variants of the original S2S topology include models with multiple stacked bidirectional
LSTM layers and bidirectional attention [SKF+16]. The term NMT is sometimes incorrectly
used as a synonym for S2S or for GNMT.
Google’s Neural Machine Translation (GNMT) is the most popular RNN-based
model [WSC+16]. GNMT learns a better thought vector by simultaneously training across
3.3. NATURAL LANGUAGE PROCESSING TOPOLOGIES 63
multiple languages and incorporates an attention module to cope with long sentences [LPM15].
The main idea of GNMT is that the thought vector should be the same regardless of the source
and target language since it captures a meaning, which should be independent of the language.
CNN-Based
Using CNNs may have a computational advantage over RNNs, given they are more easily par-
allelizable and have a higher operational intensity (discussed in Section 7.3). Another advantage
is they extract features hierarchically and may better capture complex relationships in the data.
Bai et al. demonstrated that CNN-based models outperform RNN-based models on var-
ious NLP long-sequence tasks [BKK18]. Similarly, Facebook demonstrated that CNN-based
models had a computational advantage over GNMT at a similar statistical performance (both
on CPUs and GPUs) [GAG+17]. When trained on models of the same size, the CNN-based
models outperform GNMT.
CNN models have also been used as a preprocessing step to image captioning by extracting
relevant features [VTB+14]. In particular, the second-to-last activation output in a CNN model
is often used as the feature vector. This vector is an encoding of the image and passed to an NLU
decoder to generate a caption. Attention can improve the captions by focusing the decoder on
certain parts of the input image [YHG+15].
Transformer-Based
Transformer-based models use attention modules without any RNN units. The first
transformer-based model, Transformer-LT, was introduced by Google in the 2017 paper At-
tention is All You Need and has been shown to statistically outperform RNN-based methods on
various NLP tasks [VSP+17, KCH+19]. These models are more easily parallelizable than RNNs,
can learn longer-term dependencies, and have higher arithmetic intensity.
A transformer primarily consists of a set of encoder and decoder blocks with the same
structure but different weight values and with skip connections, as shown in Figure 3.20. Each
encoder block consists of two main layers: a self-attention and a feedforward layer, where the
self-attention block helps account for context in the input sentence. Each decoder block consists
of three main layers: a self-attention, an encoder-decoder attention, and a feedforward layer. In
the decoder, the encoder-decoder attention allows the decoder to focus on the crucial parts of the
encoder representation. Words (or subwords, in practice) get embedded into vectors. A stack of
encoders processes these vectors, and a stack of decoders processes their output. The architecture
has skip-connections added and normalized after each layer. The target output word is chosen
from the softmax output using a beam search approach.
Bidirectional Encoder Representations from Transformers (BERT) is a bidirectional
transformer model developed by Google, and widely adopted across hyperscalers [DCL+18].
BERT achieved state-of-the-art results on multiple NLP tasks using a massive corpus of unan-
notated text crawled from the web, rather than a corpus labeled for a specific task. The standard
64 3. MODELS AND APPLICATIONS
Figure 3.20: (a) A transformer is composed of several encoder and decoder blocks; (b) each
block has an attention layer (the decoder has two) and a feedforward layer; and (c) the entire
transformer model with N blocks is depicted. Based on [Ala18, VSP+17].
embedding models before BERT, such as word2vec or GloVe (discussed in Section 2.7), learned
context-free word embeddings, whereas BERT uses context to learn better embeddings. BERT
is used by Google Search to better understand long search queries to improve the quality of the
results [Nay19].
BERT is trained using two self-supervised learning tasks. In one task, the model predicts
a randomly masked-out word based on the context of the words before and after it. In the second
task, the model predicts whether the second sentence follows the first sentence in the original
paragraph.
BERT and other transformer-based models are shown in Figure 3.21 and the most promi-
nent are highlighted in Table 3.1. Typically, newer models better capture the dependencies be-
tween tokens [YDY+19].
Large transformer-based models require considerable power and compute to train and
deploy. While hyperscalers widely use them, they are less common at companies without WSCs
due to the training costs. Also, larger transformer-based models may not meet the stringent low
latency inference requirements in some applications.
3.3. NATURAL LANGUAGE PROCESSING TOPOLOGIES 65
Table 3.1: Prominent transformer-based models
Parameters Dataset
Model Institution Released
(millions) (GB)
BERT [DCL+18] Google Oct. 2018 340 16
GPT-2 [RWC+19] OpenAI Feb. 2019 1,500 40
XLNet [YDY+19] CMU Jun. 2019 340 158
RoBERTa [LOG+19] Facebook Jul. 2019 355 160
ERNIE 2.0 [SWL+19] Baidu Jul. 2019 340 –
ALBERT [LCG+19] Google Sep. 2019 235 16
DistilBERT [SDC+19] Hugging Face Oct. 2019 66 16
T5 [RSR+19] Google Oct. 2019 11,000 750
Turing-NLG [Ros20] Microsoft Feb. 2020 17,000 –
GPT-3 [BMR+20] OpenAI May 2020 175,000 570
GShard [LLX+20] Google Jun. 2020 600,000 –
The Hugging Face Transformers, Facebook Fairseq, and AWS Sockeye 2 libraries con-
tain several transformer-based models to facilitate wider adoption [DDV+20]. Future models
are likely to compromise between prodigious costly models and smaller efficient models, trained
and adopted by medium-size companies and universities, with smaller serving latencies. These
include smaller BERT-like models, such as ALBERT by Google, DistilBERT by Hugging
Face, and Q-BERT by UC Berkeley. Other solutions are replacing computationally expensive
layers with light convolutions, adapting the number of attention layers, or removing most atten-
tion layers during inference to reduce serving latency [LCG+19, SDC+19, SDY+19, WFB+19,
SGB+19, MLN19].
Figure 3.21: Multiple transformer-based models and their respective number of parameters
across time. Based on [San19].
ASR systems and other speech-related systems often transform the acoustic sound waves
into a spectrogram or Mel-spectrogram representation. A spectrogram is a 2D frequency-time
representation of the acoustic signal that uses frequencies across short-time intervals, as shown
in Figure 3.22. In the figure, the color represents the amplitude of a particular frequency at a
specific time interval. The Mel-spectrogram is a spectrogram where the frequencies are scaled
using the mel-scale to better match the frequency resolution of the human auditory system.
Deep Speech 2 (DS2) was developed by Baidu and is the first major neural ASR. It pro-
vides a baseline for other models. DS2 uses a spectrogram as the input to a series of CNN and
RNN layers [AAB+15]. The CNN layers treat the spectrogram input as an image.
3.3. NATURAL LANGUAGE PROCESSING TOPOLOGIES 67
Listen, Attend, and Spell (LAS) was developed by Google. This model uses SpecAug-
ment for data augmentation. SpecAugment uses image augmentation techniques on the spec-
trogram [CJL+16, PCZ+19]. The LAS system has an encoder and decoder. The encoder is a
pyramid RNN. The decoder is an attention-based RNN that emits each character conditioned
on all previous characters and the entire acoustic sequence.
RNN-Transducer (RNN-T) processes the input samples and streams alphabetical char-
acter outputs. It does not use attention. For mobile devices, Google developed a quantized
RNN-T model that runs in real-time on a Google Pixel device and is deployed with the Gboard
app with 80 MB memory footprint [HSP+19, Sch19].
Wav2letter++ is an open-source neural ASR framework developed by Facebook; it uses
the fully convolutional model ConvLM [PHX+18, ZXL+18]. Facebook also demonstrated the
use of transformers for ASR [WML+19].
3.3.3 TEXT-TO-SPEECH
Text-to-speech (TTS) is the task of synthesizing speech from text. The most well-known TTS
system is probably the one used by the late Prof. Stephen Hawking. A TTS system is typically
composed of three stages: (1) a text-analysis model, (2) an acoustic model, and (3) an audio
synthesis module known as a vocoder. Traditionally, audio synthesis modules combined short-
speech fragments collected from a user to form complete utterances. Using these fragments
makes it difficult to modify the tone or voice characteristics and results in a robotic-like synthesis.
Neural TTS systems are now able to generate human-like speech as measured by the
MOS (Mean Opinion Score), a human evaluation of the quality of voice. A neural TTS model
can learn to generate voices with different characteristics. They can also be adapted to generate
music and speech from an image. Facebook uses automatic captioning to help visually impaired
users browse their News Feed and hear a machine-generated caption of each image [WWF+17].
68 3. MODELS AND APPLICATIONS
Google Duplex uses neural TTS models on Pixel phones, for example, to contact restaurants to
make reservations [LM18].
The primary neural speech synthesis systems deployed in production are WaveNet, Parallel
WaveNet, and WaveRNN and require a text-to-linguistic features preprocessing step. Tacotron
2 provides a full end-to-end text-to-speech generator. Deep Voice 3 and ClariNet are speech
synthesizers (not end-to-end TTS) developed by Baidu that have been influential and may be
used in production. GAN-based TTS is starting to gain traction in academia despite the earlier
unknowns of how to use GANs with discrete values [BDD+20].
WaveNet by Google is a vocoder autoregressive model based on the PixelCNN
model [vDZ+16]. It predicts a distribution for each audio sample conditioned on all previous
audio samples and the input linguistic features. These features are derived from the input text
and contain phoneme, syllable, and word information. To deal with long-range temporal depen-
dencies needed for raw audio generation, WaveNet uses a stack of dilated causal convolutions
to allow their receptive fields to grow exponentially with depth.
WaveNet suffers from high serving latency due to the sequential generation of audio sam-
ples. WaveNet uses an 8-bit integer value timestep (rather than a 16-bit, as is typical in audio)
to reduce the latency and make the softmax output more tractable.
Parallel WaveNet by Google uses knowledge distillation to train a feedforward net-
work with WaveNet [vLB+17, HVD15]. Knowledge distillation (detailed in Section 6.4) uses a
teacher model to train a smaller, more efficient student model. The FFNN is easily parallelizable
and generates speech samples in real-time with minimal accuracy loss compared to WaveNet.
Google Assistant uses Parallel WaveNet.
Tacotron 2 by Google is a generative end-to-end model trained with audio and text
pairs that synthesizes speech directly from characters and combines the methodologies of the
popular WaveNet and Tacotron to generate human-like speech [SPW+17, WSS+17]. Specif-
ically, Tacotron 2 uses CNN and LSTM layers to encode character embeddings into Mel-
spectrograms, capturing audio with various intonations. This Mel-spectrogram is then converted
to waveforms using a WaveNet model as a vocoder. This system can be adapted to generate
speech audio in the voice of different speakers [JZW+18]. A speaker encoder network can gen-
erate a vector representation for a given speaker using seconds of reference speech from a target
speaker. The Tacotron 2 network is adapted to generate speech conditioned on this vector rep-
resentation.
WaveRNN by Google uses a dual softmax layer to predict 16-bit audio samples efficiently;
each softmax layer predicts 8 bits. For real-time inference in mobile CPUs, the small model
weights are pruned (removed or forced to zero) [KES+18]. LPCNet is a WaveRNN variant that
achieves higher quality by combining linear prediction with the RNN [VS19].
Deep Voice 3 (DV3) by Baidu is is a generative end-to-end model synthesizer, similar
to Tacotron 2 [PPG+17]. The primary difference is that Tacotron 2 uses a fully convolutional
3.4. REINFORCEMENT LEARNING ALGORITHMS 69
Figure 3.23: Reinforcement learning can be used to learn to balance the pole. Source: [Wik12]
(CC BY-SA 1.0).
showing superior performance than previous RL methods across various Atari games (soon
after, Google acquired DeepMind, and is now a sibling company to Google under Alpha-
bet) [KSe+13]. Using a variety of Q-learning models achieves better performance over any single
Q-learning model [HMv+17].
Policy optimization, also known as on-policy, learns the policy function and selects the
output action stochastically. A policy is the agent’s strategy. A policy function maps the input
state to a distribution of actions, and a DL model can represent this function.
Policy optimization was popularized by the Policy Gradient (PG) algorithm that showed
superior performance over DQN [MKS+15]. The space is explored initially through random
actions. Actions that lead to a positive reward are more likely to be retaken.
A primary challenge is the sparse delayed rewards, formally known as the credit assign-
ment problem. The agent receives a reward after taking several actions. The reward can be pos-
itive or negative. Depending on the reward, all the actions taken are considered good or bad,
even if only some of them were critical to receiving the reward. Given the sparse rewards, policy
optimization requires lots of training samples. Alternatively, manually shaping the rewards for
a pasticular task can guide the learning behavior. Trust Region Policy Optimization (TRPO)
is typically used over vanilla PG as it guarantees monotonic policy improvements [SLM+17].
A comparison of TRPO to DDPG and other PG-based algorithms, such as Proximal Policy
Optimization (PPO) and Actor-Critic using Kronecker-Factored Trust Region (ACKTR) can
be found elsewhere [HIB+19].
Various algorithms combine Q-learning and policy optimization methodologies. The
most popular ones are A3C and DDPG [MBM+16, LHP+19]. Asynchronous Actor-Critic
Agents (A3C) uses a policy-based actor and a value-based critic to measure how good is the cho-
sen action. Deep Deterministic Policy Gradients (DDPG) uses continuous (rather than discrete)
actions. While TRPO, DDPG, and A3C are typically good algorithms to use, experimentation
is required to determine the most suitable for a particular task.
72 3. MODELS AND APPLICATIONS
Model-based algorithms use a model with the rules of their environment. The agent uses
the model to infer the outcomes of various sets of actions and chooses the set with the max-
imum reward. Model-based algorithms are used in games like chess and Go, where the rules
of the game are known. DeepMind’s AlphaGo, AlphaGo Zero, AlphaZero, and MuZero use
model-based algorithms [SHM+16, SSS+17, SSS+18, SAH+20]. Learning a model through
trial and error introduces biases, and errors in the inferred outcome compound over the predic-
tion horizon. Model-based policy optimization (MBPO) uses a model with policy optimization
to mitigate the compounding errors [JFZ+19].
In this chapter, we detailed the types of workloads that typically use DL models at hyperscalers:
recommenders, computer vision, and NLP. We discussed the common topologies used in each of
these workloads. Despite having the smallest adoption in academia, top hyperscalers widely use
recommender models. We highlighted popular academic trends in RL that may soon transition
to commercial applications. In the next chapter, we review how to train a topology, including
how a data scientist may use an existing topology to guide the topology design for a related
application.
73
CHAPTER 4
Training a Model
Training a model to achieve high statistical performance within a computational and power
budget requires several design considerations. These include defining a topology, preparing the
dataset, properly initializing the model weights, selecting an optimization algorithm and objec-
tive function, reducing the model size, and evaluating the trained model. The training process
can be computational and memory intensive, and there are techniques discussed in this and the
next two chapters to reduce the training time and mitigate memory bottlenecks.
In Section 1.6, we introduced the training steps. The training stops when the validation
error is either less than some threshold or does not continue to decrease after several iterations.
The validation error is computed every n training iterations, where n is chosen by the data sci-
entist. It is used as a metric of how the model will perform when it is deployed.
During the backpropagation step, the computed gradients provide a measurement of the
contribution of each weight to the cost. The terms cost, loss, penalty, error, and objective func-
tion, are sometimes used interchangeably. In this book, loss represents a metric of difference
between the expected output and actual output for one data sample, and cost, error, and objec-
tive function synonymously represent the sum of the losses for a batch of samples. Examples of
common objective functions are the cross-entropy error (discussed in Section 4.4) and the mean
square error (MSE) for classification and regression tasks, respectively.
In the remainder of this chapter, we detail how to train a model to achieve low training
and low test error. We review techniques to improve the performance on each of the training
steps outlined in Section 1.6. We provide the methodologies that experienced data scientists use
in industry to deal with unbalanced datasets, design new topologies, resolve training bugs, and
leverage existing pre-trained models. We also discuss methods to reduce memory bottlenecks.
Distributed training algorithms to reduce the training time are discussed in Chapter 6. A review
of the notation introduced in Section 1.10 can help understand the equations presented in this
chapter.
Figure 4.1: (a) Four training samples (blue dots) and one validation sample (green dot). (b) A
fourth-order polynomial function has zero training error but high validation error. (c) A sim-
pler first-order polynomial function has low validation error. The red dot represents the model’s
prediction.
source of high test error rates, specifically, underfitting, overfitting, and sharp minima, and how
to reduce this error. The red dot represents the model’s prediction.
Underfitting occurs when the model is too small because it has too little learning capacity
and cannot properly learn the general characteristics of the data. The symptoms of underfitting
are high training error and high test error. The best technique to mitigate underfitting is to use
a more complex model. In DL, this means increasing the topology’s representative capacity by
adding more layers and more weights.
Overfitting occurs when a model has too much learning capacity and learns to fit the
noise in the training data samples or other characteristics unique to the training set. Overfitting
happens when using a prodigious model with insufficient training samples. The symptoms of
overfitting are low training error and high test error. Figure 4.1 illustrates overfitting with a toy
1D example using linear regression, a simple ML algorithm. Figure 4.1a shows four training
samples (the blue dots) and one validation sample (the green dot) not used during training.
The x -axis is one feature, such as house size, and the y -axis is the label, such as house price.
A polynomial function of third or higher-order can perfectly pass through the four training
data points. The illustration uses a fourth-order polynomial for simple visualization. Figure 4.1b
shows the model has no training error but has a higher validation error (the squared distance
between the red and green dots). A simpler first-order (affine) function does not perfectly pass
through all the training data points but has low validation error, as shown in Figure 4.1c. The red
dot shows what each model predicts on the validation sample, and the green dot is the ground
truth for that sample. The complex model overfits the training samples; it has zero training error
but high validation error compared to the simpler model. Therefore, in this example, the simpler
model is preferred.
Figure 4.2 illustrates what happens to the training and validation error as the model grows
in complexity. While the training error decreases with more complexity, the validation error first
4.1. GENERALIZING FROM TRAINING TO PRODUCTION DATASETS 75
Figure 4.2: The ideal level of model complexity is where the validation error is the lowest.
decreases and then increases. A model with complexity left of the dashed line is underfitting,
and a model with complexity right of the dashed line is overfitting. The sweet spot is right at the
dashed line, where the model has the lowest validation error. The model is complex enough to
learn the characteristics of the data to avoid underfitting but simple enough to avoid overfitting.
The validation error is much more important than the training error because it represents
the expected error when the model deploys in production. In ML theory, minimizing these
errors is known as the bias-variance tradeoff. A high training error indicates high bias or under-
fitting. A high validation error and low training error indicates high variance or overfitting. It is
always critical to determine the source of poor performance (overfitting or underfitting) before
prescribing a solution.
An interesting and counterintuitive phenomenon unique to various DL topologies is the
deep double descent, illustrated in Figure 4.3 [NKB+20]. As the topology complexity increases
(that is, as the model grows in depth), the validation error first follows the expected trajectory of
decreasing and then increasing, but then it begins to decrease again. That is, increasing the size of
the topology can lower the test error in some scenarios. The exact reason is not well understood
as complex models should result in overfitting. A tentative (hand-wavy) reason is that very large
topologies can explore a larger solution space leading to superior solutions. Understanding this
phenomenon and the impact on the recommended training techniques is ongoing research. Most
practitioners safely ignore this phenomenon or are not aware of it.
Another source of poor generalization may be sharp minima [HS97]. This hypothesis is
based on empirical evidence. Figure 4.4 illustrates the intuition with a toy 1D example using
only one weight or feature (the x -axis). Training involves iteratively updating the model and
moving to an area in the solution space with lower training error. The training cost function
(solid blue line) is similar but slightly different than the testing cost function (dotted green line).
This difference is because the test samples are similar but not identical to the training samples.
In this example, the flat minimum solution and the sharp minimum solution have the same
76 4. TRAINING A MODEL
Figure 4.3: An illustration of the deep double descent observed in some DL topologies; as the
complexity increases, the validation error decreases and then increases as expected, but then it
begins to decrease again. Based on [NKB+20].
Figure 4.4: In this toy example, the cost function with respect to the test dataset is slightly shifted
from the cost function with respect to the training dataset. The sharp minimum solution has a
high test error. The flat minimum has a small test error. Based on [KMN+17].
training error but different test errors. These errors are represented by J.w/ along the y -axis.
The flat minimum solution has a low test error, while the sharp minimum solution has a high
test error (the green dot). A measurement of flatness is the trace of the Hessian; a small trace
indicates a flat minimum [DYC+19].
While a flat minimum generalizes better to unseen data, a sharp minimum does not nec-
essarily indicate overfitting, and a flat minimum does not necessarily indicate low validation
error [ML18]. Also, the functions resulting in a flat minimum can be altered to result in a sharp
4.1. GENERALIZING FROM TRAINING TO PRODUCTION DATASETS 77
minimum without affecting the validation error, demonstrating the hypothesis above does not
always hold [DPB+17].
There are various techniques to improve generalization, often by simplifying (regularizing)
the model. The most common ones are as follows:
Larger datasets is the best technique to avoid overfitting. The toy example above only used
four samples to train the fourth-order polynomial. Adding more samples while keeping the same
model complexity (fourth-order polynomial) results in a more affine-like function that better
generalizes to data not in the training set. OpenAI recommends for NLP models increasing the
number of parameters by 2:55 whenever the dataset doubles to improve learning capacity and
avoid over/underfitting [KMH+20].
Weight decay (also known as L2 -regularization) penalizes the magnitude of the weights
and reduces overfitting. In the fourth-order polynomial example above, this would penalize the
magnitude of the coefficients and result in a more affine-like function. The objective function
incorporates the weight decay by adding a penalty term:
new cost D cost C jjwjj22 ;
where 0 is the regularization factor and w is the model weights (the polynomial coefficients
in the regression example above). The bias weight does not have a multiplicative interaction with
the activations; therefore, it is not regularized. Note that L1 (rather than L2 , as shown above)
regularization is less common.
Smaller batches improves generalization [ML18]. A training iteration involves process-
ing a batch of data. Larger batches can have computational advantages (they have higher data
reuse), but often large batches result in sharp minima. The ideal is a medium size batch where
the model converges to a flat minimum and has high compute utilization. Finding an adequate
batch size requires experimentation.
Better optimizer that finds a solution with a lower validation error. In Section 4.3, we
discuss the gradient descent optimizer and others less prone to sharp minima solutions, such as
LARS, LAMB, and RangerLARS.
Topology pruning means forcing some of the smaller weights to zero or removing parts
of the model. In Section 6.3, we discuss pruning in more detail.
Label-smoothing regularization (LSR) modifies the ground-truth one-hot vector by
adding a small =M value to all the zero entries, where M is the number of classes and is
a small value, such as D 0:1 [SVI+15]. The “1” entry in the one-hot vector is changed to 1
to maintain a valid probability distribution. Reducing the difference between the largest logit
and all others reduces the confidence of a model and results in better adaptation to non-training
samples.
Early stopping means the training stops when the validation error begins to increase.
Similarly, the model is evaluated on the validation dataset and saved every n training iterations,
and the model with the lowest validation error is selected. There are mixed opinions on using
early stopping. Regularization via weight decay without using early stopping can lead to bet-
78 4. TRAINING A MODEL
ter results when the computational resources are available to experiment with multiple weight
penalties. In practice, early stopping is a simple and effective technique to reduce overfitting and
commonly used. Note, somewhat related, that Hoffer et al. demonstrated better generalization
with additional training cycles when the validation error has plateaued, but the training error
continues to decrease [HHS17].
Model ensemble is where an ensemble (group) of models is trained for a particular task.
During inference, a combination of the models’ predictions is used, such as the average. Com-
bining the predictions reduces the impact of each model overfitting. More formally, model en-
semble reduces the variance of the validation error.
In addition, normalization and dropout (discussed in Sections 2.6 and 2.9) are other forms of
regularization which reduce overfitting.
the number of units in Layer l [HZR+15]. A truncated normal (the sides of the distribution are
truncated) is recommended to prevent initializing the weights with large magnitudes. Kaiming
initialization allows the training of much deeper networks. Before this technique was developed,
the authors of the well-known VGG paper meticulously initialized the layers of the larger VGG
networks in various steps. With Kaiming’s initialization, this is no longer needed.
For sigmoid or hyperbolic tangent layers, the Xavier initialization is preferred [GB10].
The weights at Layer l are sampled from a uniform distribution U . k; k/ where
s
6
kD :
D .l/ C D .lC1/
These initialization techniques can be adapted to train hypernetworks, meta-NNs that generate
weights for a primary NN [CFL20].
4.3. OPTIMIZATION ALGORITHMS: MINIMIZING THE COST 79
Bias Initialization
It is common to initialize the bias weights to zero. Exceptions are as follows:
• The bias of the last layer in a model for binary classification trained with imbalanced
datasets (far more negative than positive samples) should be initialized to [Kar19]
number of positive samples
loge :
number of negative samples
• The bias of the last layer in a regression model trained with imbalanced datasets should
be initialized to the expected mean output value. Alternatively, the data targets should
be normalized, and the bias initialized to 0.
• The bias of the LSTM forget gate should be initialized to 1 to prevent the LSTM unit
from forgetting at the start of training. The model needs some training cycles to learn
to forget [GSC99, JZS15].
• The bias of the LSTM input and output gates should be initialized to 1 to push the
initial memory cell activations toward zero [HS97].
• The bias in a ReLU layer may be initialized to a positive value to reduce the number of
zero activations that may cause the dying ReLU phenomenon [Ste19]. However, the
benefits have not been extensively explored.
N
X1
J.w/ D loss fw xŒn ; yŒn
nD0
dJ.w/
gD D rw J.w/
dw
w WD w ˛ g;
where w represents all the weights in the model and ˛ is the learning rate (LR). Note that a
weight decay term (see Section 4.1) is used in practice; it is excluded from all the equations in
this section to simplify notation.
The LR controls the change of the model in response to the gradient and is the most
critical hyperparameter to tune for numerical stability [Ben12]. In Section 4.5.4, we provide
recommendations on tuning this and other hyperparameters. Figure 4.5 shows a GD update toy
example in a 1D space using different LRs. A high LR can cause the model to diverge, where
the cost increases rather than decreases. A small LR can result in longer-than-needed number of
convergence steps and training time. A good LR results in proper progress toward the minimum
(the green arrow in the figure).
In SGD or, more precisely, mini-batch gradient descent (MBGD), the dataset is divided
into several batches. In statistics literature, SGD means MBGD with a batch size of 1, but
in most DL literature and in this book, SGD refers to MBGD with any arbitrary batch size
less than the training dataset. When the batch size equals the full-batch, SGD becomes GD,
and one epoch equals one training iteration. In SGD, the gradient used to update the model is
computed with respect to a mini-batch (as opposed to the entire dataset), as shown in Figure 4.6,
and otherwise, the implementation of SGD and GD are equivalent.
There are two main challenges with GD and large-batch SGD. First, each step or iteration
is computationally expensive as it requires computing the cost over a large number of samples.
Second, the optimizer may converge to a sharp minimum solution (rather than stuck at a saddle
point as previously thought) that often does not generalize, as shown in Figure 4.4 [ML18,
YGL+18, DPG+14].
4.3. OPTIMIZATION ALGORITHMS: MINIMIZING THE COST 81
Figure 4.5: Gradient descent update using LRs that are (red arrows) too large or too small, and
(green arrow) good enough.
The Hessian (this is the second derivative in 1D) can be used to analyze the curvature
of the objective function along the various dimensions to determine if a solution is in a flat or
sharp minimum. Smaller absolute eigenvalues indicate a flatter curvature in the corresponding
dimension, and the average Hessian trace provides a metric for the average curvature across all
dimensions; a higher trace value indicates a sharp minimum [DYC+19].
The algorithmic reasons for the convergence to a sharp minimum are not well understood.
One hypothesis is that the objective function has many sharp minima and gradient descent does
not explore the optimization space but rather moves toward the local minimum directly under-
neath its starting position, which is typically a sharp minimum [KMN+17]. This hypothesis is at
conflict with the hypothesis that the objective function is roughly convex [XAT+18]. Additional
research is required to understand the reasons better.
The batch size is an important hyperparameter to tune. A larger batch size has higher
compute utilization because there is more data reuse; that is, the compute-to-data-read ratio is
higher for larger batches. However, using very large batches suffers from the same challenges
as GD and requires meticulous tuning to avoid converging to a sharp minimum. Still, using a
micro-batch is not ideal because the computational resources are tipically underutilized. Further-
more, micro-batches do not have sufficient statistics to properly use batch normalization [Iof17].
There is a sweet spot of a batch size where it is large enough to use the hardware compute units
efficiently and small enough for the model to properly converge to a flat minimum without too
much hyperparameter tuning.
Shallue et al. demonstrated empirically across several models and datasets, that for a given
optimizer and a model, there are three batch size regions. There is a perfect scaling region, where
82 4. TRAINING A MODEL
Figure 4.6: The dataset is broken into M batches, and the weight vector (two dimensions in this
toy example) is updated using the gradient computed with respect to the cost associated with
a batch. The progress toward the minimum (the inner oval) is not smooth (unlike in GD) but
faster than GD: for every 1 GD step, SGD takes M steps.
Table 4.1: Batch size scaling regions across the three models observed in Figure 4.7
the batch size and LR proportionally increase and the number of training iterations proportion-
ally decreases. There is a diminishing-returns region, where increasing the batch size decreases
the number of iterations but not proportionally. And there is a stagnation region, where increas-
ing the batch size provides minimal to no benefits. The stagnation occurs because the gradients
computed with a large-batch have low variance. They already closely approximate the GD gra-
dient, and increasing the batch size further does not result in significantly different gradients.
Furthermore, as already discussed, very large batches may converge to sharp minima. Figure 4.7
captures some of their results on three popular models and datasets and Table 4.1 summarizes
the results in the figure [SLA+19]. In Section 4.5.4, we discuss hyperparameter tuning, which
includes choosing a batch size.
Training iterations should (on average) decrease the training error. A plateau training
error indicates that the solution is bouncing along the edges of the objective function and no
4.3. OPTIMIZATION ALGORITHMS: MINIMIZING THE COST 83
Figure 4.7: The number of training steps required to meet the expected training and valida-
tion error as a function of batch size for three models. Dotted line denotes perfect scaling. See
Table 4.1 for the high-level summary. Source: [SLA+19] (CC BY-SA 4.0).
Figure 4.8: Toy example of a 2D space with a ravine. (a) SGD makes slow progress. (b) SGDM
makes faster progress toward the minimum. Based on [Orr99].
longer converging. Decreasing the LR can help the error continue to decrease and converge to a
solution closer to the local minimum. A better approach may be to use a cyclical LR between a
user-set high and low LR to better explore the solution space, in particular toward the later part
of training [LH17, Smi17, IPG+19]. Each learning cycle starts at the high LR, which decreases
with each iteration. After reaching the low LR, another learning cycle starts (at the high LR).
This technique can be applied with all the optimizers.
SGDM improves the speed of convergence over SGD alone [Qia99]. Most training in
the literature that claims SGD actually used SGDM. That is, the term SGD is often an alias
for SGDM in published literature but not in this chapter to avoid confusion. SGD alone makes
slow progress in ravines (areas where the partial derivative in one dimension is much higher than
other dimensions), as shown in Figure 4.8. Ravines are prevalent when optimizing over millions
of dimensions, which is common in DL models.
SGDM accelerates SGD in the direction of the exponential decaying average of past
gradients, also known as the first moment or just moment, and dampens oscillations. Rather than
directly modifying the weights, the gradients modify this moment, and the moment is then used
84 4. TRAINING A MODEL
to update the weights as follows:
g D rw J.w/
m WD ˇ m C .1 ˇ/ g
w WD w ˛ m;
where m is the (exponential decaying) average gradient or first moment that gets decayed by
the momentum term ˇ usually set to ˇ D 0:9, m is initialized to m D 0, and ˛ is the LR which
requires tuning. SGDM is widely adopted in the industry, in particular, for computer vision
models, and works well across multiple tasks when the learning rate is properly tuned.
Adaptive Moment Estimation (Adam) is more robust than momentum to different LRs,
and therefore requires less LR tuning [KB17]. Adam computes an adaptive LR for each pa-
rameter. Specifically, Adam uses an average gradient (as in SGDM) normalized by an average
gradient squared called the second moment or variance. Thus, every weight is updated with a
different LR as follows:
g D rw J.w/
m WD ˇ1 m C .1 ˇ1 / g
v WD ˇ2 v C .1 ˇ2 / g2
O D m=.1 ˇ1t /
m
vO D v=.1 ˇ2t /
p
r D m=.
O vO C /
w WD w ˛ r;
where m and v are the first and second moment estimates, m O and vO are the bias-corrected first
and second moment estimates, respectively, g2 is the element-wise squared of g, vector division
is element-wise division m and v are both initialized to 0, ˇ1 2 Œ0; 1/, ˇ2 2 Œ0; 1/, and > 0 are
usually set to ˇ1 D 0:9, ˇ2 D 0:999, and D 0:001, the exponent term t is the training iteration
and ˛ is the LR which requires some tuning.
Intuitively, a small variance in the gradients means the gradients are pointing in similar
directions, which increases the confidence that the direction is right. Therefore, a larger step in
that direction is taken using a larger LR. The opposite happens with a large variance: a small
step is taken.
When switching from SGD to Adam, the regularization hyperparameter needs to be
adjusted since Adam requires more regularization [LH19]. While the original paper used D
10 8 , we recomend D 10 3 to prevent a huge step size when vO is miniscule, which often
happens toward the end of training [KB17].
Adam is widely adopted in the industry, in particular, for NLP models, and empirically
works well across multiple tasks despite not converging to the optimal solution in simpler convex
optimization tasks [RKK19]. SGDM continues to perform well or better across various tasks
when the LR is well tuned compared to newer techniques. SGDM often converges and general-
izes better, albeit with longer training time, than Adam [WRS+18, KS17]. Some practitioners
4.3. OPTIMIZATION ALGORITHMS: MINIMIZING THE COST 85
begin training with Adam due to the convergence speed and finish with SGDM due to the
convergence quality.
Rectified Adam (RAdam) is a simple adaptation to Adam that switches between Adam
and SGDM [LJH+19]. RAdam dynamically turns on or off the adaptive LR depending on the
variance confidence. Thus, Adam’s possible initial training instability due to the limited data
points used to compute the variance is mitigated with this on/off adaptive LR. RAdam uses a
rectified adaptive LR as it gains confidence about the variance; otherwise, it falls back to SGDM.
All the above optimizers share a common challenge that LARS and LAMB addresses. To
maintain stability, weights with a small magnitude should have a small weight update magni-
.l/ jj
tude, and vice versa. However, every layer in a model often has vastly different jjw jjg .l/ jj
ratios. A
small ratio can lead to training instability (divergence), and a large ratio can lead to slow learn-
ing. LARS and LAMB improve training stability by normalizing the step size in each layer.
This additional stability allows training with large-batches (up to some size determined experi-
mentally).
Layer-wise Adaptive Rate Scaling (LARS) uses a local LR ˛ .l/ proportional to the ratio
of the magnitude of the weights to the magnitude of the gradients [YGG17]. LARS is applied
to SGD as follows:
jjw.l/ jj
˛ .l/ D
jjg.l/ jj
w.l/ WD w.l/ ˛0 ˛ .l/ g.l/ ;
jjw.l/ jj
˛ .l/ D
jjr.l/ jj
w.l/ WD w.l/ ˛0 ˛ .l/ r.l/ :
Other influential optimizers are AdaGrad (in particular, for sparse data), RMSProp,
AdaDelta, Nadam, Nesterov accelerated gradient (NAG), AdamW, AMSGrad, and Novo-
Grad [DHS11, HSS12, Zei12, Doz16, BLB17, LH19, RKK19, GCH+20]. Figure 4.9 shows
an estimated pedigree of optimizers. These are first-order optimizers. AdaHessian is a second-
order optimizer that converges to a better minimum than first-order optimizers without the
prohibited computational cost of other second-order optimizers [YGS+20]. Given the promis-
ing results, AdaHessian adoption may grow.
Stochastic weight averaging (SWA) and LookAhead (LA) are complementary techniques
that improve generalization by converging to a better (flatter) minimum [IPG+19, ZLH+19].
The motivation for SWA is that during the later training iterations, SGD bounces between the
86 4. TRAINING A MODEL
borders of a wider minimum. The average of the bounces is a better solution. SWA maintains
a separate set of averaged weights wSWA in addition to the regular set of weights w used by the
optimizer. wSWA is initialized with w after completing at least 75% of the training iterations.
Then, after completing several iterations, wSWA is updated as follows:
wSWA ncycle C w
wSWA WD ;
ncycle C 1
where ncycle is the number of completed cycles after initializing wSWA , and w is the model learned
by the optimizer. One cycle consists of multiple iterations, typically one epoch, but this can vary
depending on the dataset’s size.
For training, SWA requires sizeof .wSWA / additional memory, which is relatively small
compared to the activations and requires negligible additional computations to update. No ad-
ditional memory or computations is required for serving.
LookAhead (LA) follows a similar approach to SWA [ZLH+19]. The primary difference
is that the optimizer updates its weights to wLA after some iterations: w WD wLA . That is, the
moving average wLA changes the optimization trajectory.
Ranger is a combination of RAdam and LA, and RangerLARS applies LARS techniques
to Ranger [Wri19]. We recommend using Ranger as the go-to optimizer and RangerLARS
when using large batches.
4.4. BACKPROPAGATION 87
4.4 BACKPROPAGATION
The rediscovery of the backpropagation algorithm in the 1980s facilitated multilayer NN train-
ing. Backpropagation provides an efficient way to compute the gradients, which are then used
by the optimization algorithm. This section introduces some of the mathematics behind back-
propagation to demystify the learning process; for a reader who may not be interested in all these
details, the main takeaway is that backpropagation boils down to multiplications and additions.
The cross-entropy cost function, also known as the log-cost or logistic cost, is as follows:
N
X1 K
X1
J.w/ D ykŒn log yOkŒn ;
nD0 kD0
where N is the number of samples in a training batch, ykŒn 2 f0; 1g is 1 if sample n belongs to
class k and 0 otherwise, yOkŒn is the model’s prediction (as a probability) that sample n belongs
to class k . The intuition is that when the model predicts a low probability for the correct class,
the cost for that sample is high and vice versa. When ykŒn D 1, as yOkŒn approaches zero, the
loss approaches infinity. Note that in practice, the cost function includes a weight decay penalty
(shown here but often omitted to simplify the notation):
! 0 L 2 D .lC1/ D .l/ 1
N
X1 K X1 Œn X X X 2
J.w/ D yk log yOkŒn C@ wj.l/
i
A;
nD0
2
kD0 lD0 j D1 iD1
Figure 4.10: Using the chain rule to compute the partial derivative of the cost with respect to a
weight in the model. For simplicity, the bias is omitted from the figure.
• Both the source and destination models should share the lower and middle layers; only
the upper layers are replaced or reinitialized.
1. the similarities between the source task and the destination task; the more similar
the tasks, the fewer layers should be reinitialized; and
2. the difference between the size of the source and destination dataset; the smaller
the difference, the more layers should be replaced or reinitialized.
• Fine-tuning works best when the source dataset is much larger than the destination
dataset; if the destination dataset is the same size or bigger, training a new model for
the destination task is a better approach.
• The initial LR to fine-tune these models should be 10–100 smaller than the initial
LR used to train the original model for the pretrained layers. A regular LR should be
used for the replaced or reinitialized layers.
• The same data preprocessing techniques on the original larger dataset should be applied
to the datasets used for fine-tuning and validation.
4.7. TRAINING WITH LIMITED MEMORY 95
Figure 4.11: High-level guidance on when and what to fine-tune. When the new task’s dataset is
similar to the original dataset, only the last upper layers should be retrained. When the datasets
are different, then training more layers is required. If the new task’s dataset is sufficiently large,
then it is best to retrain the entire model.
As a simple example, the following steps can be used to design and train a cats vs. dogs
classifier (in practice, more recent models have better statistical performance):
1. Replace the last layer of a pretrained VGG16 model from 4096 1000 to 4096 2, as
shown in Figure 4.12, since the source dataset has 1000 classes but this task only has 2.
2. Initialize the last layer and use the pretrained weights for the reminder layers.
3. Either freeze or reduce the LR of all the layers except the last one by 100.
4. Train the topology with the target dataset (note that a modern laptop has sufficient com-
putational capacity for this task).
Fine-tuning is also commonly used after making some modifications to the model, such
as after pruning or quantizing the weights (discussed in Chapter 6). There are other types of
transfer learning techniques, such as domain adaptation, {zero, one, few}-shot learning, and
multitask learning [PY10, KL19, WYK+19, Rud17]. These techniques have limited industry
adoption.
Figure 4.12: Fine-tuning the VGG-16 model for the task of dogs vs. cats classification initially
trained on the ImageNet-1K dataset.
models with batch normalization layers. A solution is to replace batch normalization with group
normalization technique and use a micro-batch.
The next best technique is gradient checkpoint introduced in 2000 and recently gain-
ing traction in academia and some adoption in the industry after the technique resurfaced in
2016 [GW00, CXZ+16]. Gradient checkpoint reduces memory requirements at the expense of
additional computations. Rather than storing the activations across all the layers, only the acti-
vations of some layers are stored. For instance, a model with 100 layers can have the activations
saved every 10 layers. These layers are known as checkpoints, and the group of layers between
checkpoints is a segment. During the backpropagation, the activations are recomputed for a
particular segment. The process of recomputing them is called rematerialization. The activations
in memory at a given time are (1) the checkpoint activations and (2) the activations for one
segment. In the example with 100 layers and 10 checkpoints, only 20% of all the activations
are stored at any one time. The computation cost is an extra forward propagation. In a GPU
or accelerator with high compute capacity and limited memory, this additional compute may
require less time and power than storing and fetching the activations from the host.
In practice, uniformly dividing the checkpoints is not a good practice. The total size of the
activations and the computational cost of the forward propagation in each segment can signifi-
cantly vary. Furthermore, checkpoints within skip connections should be avoided. Selecting an
optimal number of checkpoint layers that evenly divides the total size of the activations across
segments is an NP-complete problem. Jain et al. introduced Checkmate, a system that finds
checkpoints for particular hardware targets. Checkmate uses an off-the-shelf mixed-integer lin-
ear program solver coupled with a hardware cost model to find suitable checkpoints [JJN+19].
4.7. TRAINING WITH LIMITED MEMORY 97
Another technique is to store the activations as 16 bits (as opposed to 32 bits). This reduces
the memory and bandwidth usage by up to a factor of 2. NNs are robust to noise, and comput-
ing the gradients using activations with half the bits typically does not impact the statistical
performance. A related technique is to store compressed activations [JJN+19].
A final technique is deep equilibrium (DEQ), where the depth of the model can vary
while keeping the required memory constant. The memory is equivalent to a single layer’s acti-
vation [BKK19]. DEQ reduces the memory requirements at the expense of additional compu-
tations. This technique does not yet have adoption in industry.
In this chapter, we described how to train a model that generalizes and avoids underfitting and
overfitting. We explained how to initialize the weights in different layers. We detailed SGD
and review various variants. We recommend using Ranger for small to medium batches and
RangerLARS for large batches or, for someone new to training, Adam is well documented
and simple to get started. We noted that while operating on large batches can result in higher
hardware utilization, small batches may generalize better, and we provided guidance on selecting
a batch size. We decomposed the backpropagation algorithm as a series of multiplications and
additions, which motivate the need for specialized matrix multipliers in hardware. We provided
guidelines to topology design and recommended hyperparameters that data scientists should use
in the design and debug stage. We explained how to mitigate memory capacity bottlenecks in
the training phase at the expense of added compute. For companies with smaller datasets, we
recommended modifying an existing model and fine-tuning it for a particular task. In the next
chapter, we explore how to accelerate the training by distributing the computations and memory
requirements across various compute nodes.
99
CHAPTER 5
Distributed Training
The number of computations required to train state-of-the-art models is growing exponentially,
doubling every 3:4 months (far below the glory days of Moore’s Law 1.5–2 years) [DH18].
Training a large model can have two primary challenges: (1) the memory required exceeds avail-
ability and (2) the time-to-train on a single node can be prohibitively long. To illustrate, train-
ing production models commonly used at Google would require 2–16 months on one dedicated
DL processor (TPU v2) [JYK+20]. Distributing the computations or the memory requirements
among multiple nodes alleviates these challenges and is becoming the norm to train large-scale
production models. Hardware designers at Intel, Nvidia, AMD, Google, Graphcore, Cerebras
Systems, and others, detailed in Section 7.7, have or are developing dedicated, scalable, multin-
ode training platforms.
Training the popular ResNet-50 model commonly used for image classification requires
about 1018 (1 exa) operations which is considered small by today’s standards and can be trained in
under 2 hours with 8 V100 GPUs and in 75 seconds with 2048 V100 GPUs [YZH+18, Nvi20c,
YKT+18]. Training the larger 8:3 billion Megatron-LM model requires 12 1021 (12 zetta)
operations, and can take several days on hundreds of compute nodes [SPP+19]. Training the
prodigious 600 billion parameter GShard takes 4 days on 2048 TPU v3 accelerators [LLX+20].
The main techniques to distribute a training workload across multiple nodes are data par-
allelism and model parallelism (including pipeline parallelism), illustrated in Figure 5.1, and a
hybrid of these. Also, federated learning is a form of data parallelism distributed training in
edge (client/IoT) devices. Data and model parallelism benefit from high bandwidth intercon-
nects between the nodes. In data parallelism, a batch (called the global-batch in this chapter)
is split among the worker nodes and called the node-batch, with each node working on the
same model. The nodes communicate the weight updates. In model parallelism, the model is
split among the worker nodes, and the nodes communicate the activations. Model parallelism
is typically used when the memory requirement exceeds the node’s memory. In hybrid paral-
lelism, data parallelism is used across groups of nodes (super-nodes), and model parallelism is
used within each super-node.
Data parallelism is more commonly used in industry, but as the sizes of the models are
growing, hybrid parallelism is becoming the norm for state-of-the-art models. In the remain-
der of this chapter, we describe data and model parallelism, their typical usages in data center
training, and their limitations. We also discuss federated learning, and we review various com-
munication primitives.
100 5. DISTRIBUTED TRAINING
Figure 5.1: (a) In model parallelism the model is distributed among multiple compute nodes. (b)
In data parallelism, the training dataset is split among multiple compute nodes and each node
has the entire model.
You et al. achieved extensive scaling using some of these techniques. They partitioned
a 32K batch size using 1K CPU nodes and achieved the fastest ResNet-50 TTT at the
time [YZH+18]. Similarly, You et al. achieved record scaling and TTT on the BERT model on
a TPUv3 Pod (1024 chips) [YLR+20]. Today, ResNet-50 can be trained in several days on one
V100 GPU node or in 2 minutes (1 epoch per second on ImageNet-1k) using 3456 V100
GPU nodes or using a TPUv3 Pod with no accuracy drop [MSU+19, YKC+18].
Figure 5.2: Federated learning. (a) An untrained global model (represented by the green dot) is
broadcasted to the client devices. (b) The training happens in each client device, and only the
updated client model (represented by the various geometric figures) is transmitted to the cloud
to update the global model. (c) The updated global model (represented by the gray pentagon) is
broadcasted to all the client devices, and the process repeats. Based on [Sas19].
care patient data, companies’ emails, and manufacturing equipment data. Some organizations
(e.g., a hospital) can be thought of as a device client among a group (e.g., a group of hospitals)
in federated learning.
Federated learning is a generalized form of Sync SGD but, rather than synchronizing af-
ter every iteration, the weights are synchronized after some local epochs. The more infrequent
the synchronizations, the more likely the model has convergence challenges. However, frequent
synchronizations consume significant network bandwidth, which is prohibited in some devices.
The primary challenge of federated learning is to reduce the synchronization frequency (by in-
creasing the number of local epochs) and maintain the expected training convergence.
Two additional challenges can affect convergence. First, the data in each device is typically
not independent and identically distributed (IID); data within a client is more similar than data
across clients, and the number of samples between clients varies. This non-IID violates the
guidance to randomize the order of the samples in the training dataset so each batch has IID
samples.
Second, the local devices have heterogeneity both in computational capacity and network
reliability across devices. In particular, mobile phones vary significantly in memory, compute, and
network connectivity with approximately two-thirds of operating mobile phones in the world
being over six years old [Haz18].
The server uses an average of the local models weighted by the number of training samples
in each device to compute a global model update. Alternatively, a more stable approach is to
randomly choose the clients (assuming a large pool of candidates) with probability proportional
to the number of training samples in each device, and use an unweighted average to compute
the global model update [LSZ+19].
5.3. FEDERATED LEARNING 105
A federated learning system uses more clients than needed to train local models to mitigate
device and network unreliability. A system may assign 300 devices to train local models but only
needs to collect local models from 256 devices. Assuming each device uses a local batch size of
16, then the global batch size is 256 16 D 4096, which may be the limit (the largest batch size
that converges to an adequate minimum) for some topologies.
A simple technique to improve robustness to both non-IID batches and local models that
are unable to complete the local number of epochs is to use a proximal term. This term is a small
adaptable penalty in the objective function for significant deviations from the global model.
Note that it is better to communicate a local model that has not completed the requested epochs
than to ignore it [LSZ+19].
Communication overhead can be reduced by quantizing with rotations and communicat-
ing the weight changes [KMY+17]. A randomly applied mask can further reduce the number
of communicated parameters. Traditional data compression techniques can also be used. These
techniques also apply to conventional Sync SGD data parallelism to decrease network traffic
but are more critical in federated learning due to the higher communication cost. Optimization
techniques, such as LAMB and RangerLARS, used in data centers, can be applied to feder-
ated learning to increase the number of client devices and accelerate training. Also, TensorFlow
provides an API to simulate federated learning with a couple of additional lines of code.
Areas of Caution
Three areas of caution are as follows:
1. Training and communicating a model can be expensive (in terms of battery and data con-
sumption). These expenses are mitigated by limiting training to periods when the device
is plugged in and idled and communicating the local model when the device is on a free
wireless connection.
2. Despite not transmitting the training data, some information about the local training data
can be extracted from local models [HAP17]. To preserve privacy, for instance, Google
uses secure aggregation where the local models are only unencrypted and averaged when
multiple models become available to the server [BIK+17]. OpenMined developed PySyft
on top of PyTorch to improve privacy. Section 10.3 discusses other ongoing work to main-
tain privacy.
3. Older devices with limited computational and memory capacities, and devices in remote
areas may not proportionally contribute to the overall training. This imbalance results in
a model that learns characteristics biased toward more affluent populations. Further work
is required to mitigate this.
106 5. DISTRIBUTED TRAINING
5.4 COLLECTIVE COMMUNICATION PRIMITIVES
There are various communication functions, known as collective communication primitives, and
library implementations. These primitives are used in data parallelism to communicate and then
aggregate the local gradients, in model parallelism to communicate the activations and their
respective gradients, and in transitioning between model and data parallelism to rearrange the
data properly. Some common collective communication primitives are as follows:
• Broadcast: M elements in the root node are copied to the other P 1 processor nodes,
as shown in Figure 5.3a.
• Scatter: M elements in the root node are partitioned, and each partition with M=.P
1/ elements is copied to a different processor node, as shown in Figure 5.3b.
• Reduce: the root node receives M elements from each of the others P 1 processor
nodes and performs a reduction operation, such as sum, maximum, minimum, mean,
or product, across each of the P 1 elements.
• Gather: the root node receives M=.P 1/ elements from each of the other P 1
processor nodes and concatenates them (equivalent to Figure 5.3b with the arrows
reversed).
• AllToAll: M elements in each node are partitioned, and each partition with M=.P
1/ elements is copied to a different processor node where the received partitions are
concatenated. Equivalent result to Scatter and Gather for all nodes, as shown in Fig-
ure 5.3c.
The AllReduce, AllToAll, and AllGather primitives do not require a dedicated root node. While
their end-result is equivalent to sequentially using two simpler primitives, they typically use more
efficient implementations. Later in this section, we analyze various AllReduce implementations.
The MPICH, OpenMPI, Intel MPI, and MVAPICH libraries implement primitives us-
ing the Message Passing Interface (MPI) standard specifications. The MPI is a library specifica-
tion that operates at the transport layer implemented by MPICH and other libraries in C/C++
and Fortran with message-passing standards and APIs. In the MPI specification, each processor
node has a unique address space. The literature on collective communication primitives is ex-
tensive, including their optimizations for clusters connected by switched networks and a study
of MPI usages [TRG05, LMM+19].
5.4. COLLECTIVE COMMUNICATION PRIMITIVES 107
Figure 5.3: (a) The broadcast primitive copies a set of elements in the root node to the other
nodes. (b) The scatter primitive copies a separate partition of a set of elements in the root node
to the other nodes. Note that reversing the arrows results in the gather primitive. (c) The all-to-
all primitive (also known as transpose) copies a separate partition of a set of elements in each
node to the other nodes, where the received partitions are concatenated.
Libraries that offer higher-level communication functions using existing primitives li-
braries or reimplementing them are: Horovod, Nvidia’s NCCL, Facebook’s Gloo, Intel’s
oneCCL, and SparCML and Blink from academia [SDB18, RAA+19, WVP+19]. Horovod
has broad industry adoption for GPU and CPU distributed training. It is supported by various
DL libraries, including TensorFlow, PyTorch, and MXNet. Horovod uses NCCL for GPUs
and oneCCL, MPI, and Gloo for CPUs. Uber developed and contributed Horovod to the LF
AI foundation.
The most common primitives used in distributed training are (in this order) AllReduce,
AllToAll, and AllGather. AllReduce is used to aggregate the local gradients in data parallelism.
AllToAll is used to exchange the activations and activation gradients in model parallelism and
to transition from model to data parallelism. AllGather is used to concatenate activations or
gradients in a specified order, for instance, in Gshard to change a sharded (broken) tensor to a
replicated tensor [LLX+20].
In Sync SGD data parallelism, the end result of AllReduce is for all the nodes to receive
the aggregated sum of all the local weight gradients; that is, the reduction happens across the
nodes. For instance, during the backpropagation of a typical convolution layer with a 4D weight
gradient tensor (number of kernels, number of channels, kernel height, and kernel width), the
AllReduce primitive aggregates the 4D tensors across all the nodes and broadcasts the sum. In
Sync SGD, AllReduce is necessary to ensure the weights across all the nodes are the same at the
end of each training iteration. AllReduce algorithms differ in the specific mechanism to achieve
this Reduce+Broadcast, but the results are the same.
108 5. DISTRIBUTED TRAINING
In the following analysis, we examine four AllReduce algorithms based on the number of
nodes, latency, and bandwidth: parameter server (PS), AllReduce-Ring, AllReduce-Butterfly,
and AllReduce-Tree, shown in Figure 5.4. We assume there are P nodes connected in a 1-hop
all-to-all (fully connected) physical network: each node-to-node link has the same latency L
independent of how many nodes are communicating. We also assume the links are bidirectional
with a per directional bandwidth of B between any two nodes, and the nodes can simultaneously
send and receive messages without affecting the unidirectional performance. The terms node and
processor are used interchangeably, and rank refers to the node ID from 0 to P 1. Note that
the physical network topology impacts which algorithm is optimal. For instance, running an
AllReduce-Ring algorithm on a system with a ring physical network topology is much better
than an AllReduce-Butterfly algorithm on the same ring physical topology since the load would
not be balanced between links. Section 7.5 discusses the physical interconnects and physical
network topologies.
Note the difference between network latency and bandwidth. The latency L is the time to
communicate one byte from one node to another. The bandwidth B is the number of bytes that
can move through the network per second (the width of the network pipeline) per direction.
The total execution time T to transfer a message of M bytes from one node to another node is:
T D L C M=B D L C T 0 ;
where T 0 D M=B is the time it takes to move the data without accounting for the latency. The
above equation ignores the software overhead and the time to aggregate (sum) the M elements
by the receiver node.
5.4. COLLECTIVE COMMUNICATION PRIMITIVES 109
PS performs a reduce-sum and then a broadcast operation, which requires two steps. The
total execution time is:
The data moves in one direction, and most of the links in the fully connected physical network
are unused.
The AllReduce-Ring requires two steps in a 1-hop all-to-all physical network. In step 1,
each node breaks down the message into P smaller packages and sends a message of size M=P
to each of the other P 1 nodes, and the receiver nodes aggregates the messages. In step 2,
each node broadcasts the aggregated message of size M=P to each of the other P 1 nodes.
The total execution time is:
Tring D 2 .L C T 0 =P /:
The data moves bidirectionally using all the links in the fully connected physical network.
The AllReduce-Tree performs a reduction and a broadcast operation both in a tree pattern,
which requires 2 log.P / steps (log is base 2 and a floor operator). The total execution time is:
Ttree D 2 log.P / .L C T 0 /:
Using two trees simultaneously in each link reduces the time, with each tree working on half the
data. Each package is of size M=2. A similar approach is the Two-Tree algorithm, also known
as Double Binary Tree [SST09]. The total execution time using bidirectional links is:
Most of the links in the fully connected, physical network are unused.
The AllReduce-Butterfly requires log.P / steps. For simplicity, we assume P is a power
of 2. During each step, a package is exchanged with a neighbor in a butterfly pattern. More
precisely, at step s 2 Œ0; log.P / 1, node p 2 Œ0; P 1 sends and receives a package of size M
P
to node p C 2sC1 < P . The total execution time using bidirectional links is:
Tbf D log.P / .L C T 0 /:
The analysis shows that for homogeneous all-to-all physical topology, the AllReduce-Ring
has the lowest execution time when P > 2. This homogeneity is typical for 1-hop connections,
where two nodes only go through one network switch to communicate, such as a rack of CPUs or
a DGX-2 system. Most CPU rack designs rely on the top-of-rack (ToR) switch even for intra-
chassis CPU message passing. For chassis with internal switches, the analysis above only applies
to CPUs within the chassis. In a DGX-2 system, GPU nodes have 300 GB/s bidirectional
110 5. DISTRIBUTED TRAINING
NVLink links (150 GB/s in each direction) (note that the GPU nodes in a DGX-2 system have
an additional, albeit smaller, 32 GB/s bidirectional link through PCIe).
Large-scale distributed training across nodes that require multiple hops usually involves
multiple communication primitives. Otherwise, the largest latency and smallest bandwidth link
would determine the primitive’s latency and bandwidth. A common approach to scale, for in-
stance, across multiple DGX-2 systems is to use AllReduce-Ring within each DGX-2, then
AllReduce-Ring across the DGX-2 systems, and then broadcast within each DGX-2. A similar
approach can be employed with racks of CPU servers.
Wang et al. developed a collective communication library known as Bling that efficiently
uses heterogeneous links [WVP+19]. Bling uses a collection of spanning trees to find various
paths to pass messages in parallel and has shown to outperform other libraries in the presence
of heterogeneous network links.
In this chapter, we addressed three challenges to training some models: the required memory ex-
ceeds availability, the time-to-train is prohibitively long, and the training data is scattered across
multiple edge devices. We detailed data and model parallelism. Data parallelism is more com-
monly used in industry and is supported by the major frameworks. However, some impediments
include memory constraints for prodigious models, high communication latency for large mod-
els, large global-batch to scale, and small node-batch inefficiencies. Model parallelism can be
used for large models, but usually, the scaling is limited to eight nodes, the optimal way to split
the model is an NP-complete problem. There is limited support in the major frameworks for
efficient model parallelism. Pipeline parallelism suffers from stalled weights, and we discussed
some work to partially mitigate this. Hybrid parallelism is becoming the norm for state-of-the-
art models. Data parallelism is used across groups of super-nodes, and model parallelism is used
within each super-node with 4–8 nodes per super-node. In the next chapter, we explore the var-
ious formats to represent numerical values used in production and those in academic exploration
as well as compression techniques to reduce the memory footprint of models.
111
CHAPTER 6
Figure 6.1: Distributions of the ResNet-110 weights, activations, and weight updates at two
separate training epochs using the CIFAR dataset. Adapted from [KWW+17] with the authors’
permission.
log-base 2 absolute values from ResNet-110 tensors across two separate training epochs and
illustrates the larger range of the weight update values.
An active research area is to develop numerical representations that better represent the
values with 8 bits and 4 bits and are simple to implement in silicon. Using a smaller numerical
representation can improve training and inference even if the hardware does not support higher
peak operations per cycle at the smaller representation because the memory bandwidth savings
accelerate memory bandwidth bound layers, which are common.
Models are typically overparameterized, which facilitates training and provides opportuni-
ties to reduce the model size post-training. Trained models typically have several small weights.
Forcing them to zero can have computational advantages with minimal to no statistical impact.
This process is called pruning and results in a sparse model. There are two types of model sparsity,
discussed in Section 6.3, structured and unstructured.
A key benefit of sparse models is improved compression. Compression reduces the mem-
ory footprint and memory bandwidth consumption at the expense of some additional compu-
tations for decompression. The time for this additional decompression is usually less than the
additional time to transmit the uncompressed data; therefore, compression is advantageous.
A small model can be trained to produce the output of a large trained model. The knowl-
edge of the larger trained model (the teacher model) is distilled to the smaller model (the student
model). This method is known as knowledge distillation.
6.1. NUMERICAL FORMATS 113
In Section 6.1, we review the various 16-bit and 8-bit numerical formats adopted in pro-
duction, as well as other promising formats. In Section 6.2, we discuss techniques to quantize
a model from fp32 to int8. In Section 6.3, we review pruning and compression techniques. In
Section 6.4, we explain knowledge distillation in more detail.
• training production models in data centers: fp32, bf 16, fp16, and fp19; limited fp8;
• serving production models in data centers: fp16, bf 16, and fp8; some int8; extremely
limited int4; and
1=2.nM C1/
114 6. REDUCING THE MODEL SIZE
Table 6.1: A comparison of different numerical formats. The maximum numerical error of a
given floating-point representation is the floating-point number multiplied by Maximun Error.
• MAC operators with 16-bit operands accumulated to fp32, and the accumulation is
converted to 16-bit after totalling the running sum (note that the hardware logic may
accumulate to less-bits registers, such as (1; 8; 21) to reduce cost);
• a copy of fp32 weights used for the weight update (the updates use 16-bit gradients);
and
• a copy of the updated weights converted to 16-bit for the next iteration.
The first three bullets also apply to inference with a 16-bit or 8-bit format. In both cases,
accumulation to a larger numerical format is recommended to avoid numerical overflow (nota-
tion: MAC source ! MAC destination): ffp16; bf 16g ! fp32, and int8 ! s32 (signed int32).
6.1. NUMERICAL FORMATS 115
Floating-point 16-bit bfloat (bf 16) was introduced by Google as brain floating-point.
Models are robust to additive noise, and, in fact, it is a common practice to add noise when
training a model in the form of weight decay regularization, as discussed in Section 4.1. Re-
ducing the mantissa bits from 23 in fp32 to 7 in bf 16 can be interpreted as injecting noise into
the model. bf 16 maintains the same range factor as fp32 and is particularly useful to support the
range in the gradients. Experiments demonstrate that models trained with bf 16 have virtually
the same accuracy as those trained with fp32 with the same number of iterations, without chang-
ing any hyperparameter, and without scaling the objective function cost [KMM+19]. However,
there may be outlier models where these observations are not valid. Also, when the number of
classes is greater than 2nM or 127, fp32 should be used for the cost function. Moreover, while
softmax alone can use bf 16, various implementations combine the softmax function and the cost
function. Those implementations should use fp32.
While bf 16 was primarily designed for training (the large exponent to represent the gra-
dients), it is also used for inference with similar computational gains over fp32. Google TPU
v2–4, the Habana Gaudi AI processor, the 3rd-generation Intel Xeon Scalable processor (co-
dename Cooper Lake), the Arm-based Neoverse N2 “Zeus” CPU, and the Nvidia A100 GPU
have bf 16 multipliers.
Floating-point 16-bit half-precision (fp16) is used for inference and training, the latter
often requiring a technique known as loss-scaling. During training, particularly during the early
stages, the magnitude of many activation gradients often falls below the supported range of fp16
and gets truncated to zero and the upper range of fp16 is unutilized. Scaling the loss (more
precisely, the cost or objective function), mitigates this inability to represent very small values
and enables the use of the higher range. Specifically, the cost is scaled by a value 1 with-
out overflowing the activation gradients past the upper fp16 range. Then, unscaling the weight
gradients by the same factor before the weight update. In addition, normalizing 0–255 RGB
input image value to 0–1 and adding batch normalization to the activation reduces overflow
risks [Wu19]. Nvidia GPUs, AMD Radeon GPUs, Huawei Atlas and Ascend processors, and
Graphcore Colossus have fp16 multipliers.
The primary advantage of bf 16 over fp16 is avoiding the need to implement loss-scaling,
which requires empirical tuning. This advantage is particularly significant for models requiring
dynamic loss scaling (and dynamic tuning) such as GNMT and Transformer, given the large
variations in gradient distribution throughout training, which increases the software complex-
ity [MSD+19]. Some tools, such as OpenSeq2Seq, can automate dynamic loss scaling for some
models [KGG+18].
A disadvantage of bf 16 over fp16 is the 3 fewer mantissa bits; there may be some precision-
sensitive workloads that benefit from those bits. The upper range values of bf 16 are not used,
bringing to question the need for 8 exponent bits for most training workloads. Facebook, for
instance, uses fp16 (rather than bf 16) to store the embedding layers (not for MAC operators)
in DLRM training (the MAC operators of the embedding layers happen in fp32) [ZYY18]. In
116 6. REDUCING THE MODEL SIZE
designing a training processor, it is recommended to support both fp16 and bf 16 (using a 19-
bit (1; 8; 10) fp19 floating-point circuitry unit) to facilitate transitioning from existing hardware
that only support one format (fp16 or bf 16).
TensorFloat-32 with 19-bit floats (tf 32) was introduced by Nvidia starting in the Am-
pere architecture. TensorFloat-32 uses fp19 MACs with fp32 accumulation. All the operations
and storage happen in fp32 except for the MAC operations used in matrix multiplications. Those
fp32 MACs are replaced with fp19 MACs and accelerated with specialized tensor cores. This re-
placement can be hidden to the framework end-user, where everything seems to run in fp32.
The fp32 to fp19 conversions (truncating the last 13 mantissa bits) and the fp19 MACs are man-
aged by the CUDA compiler and hidden by low-level libraries, such as cuDNN and cuBLAS.
The accuracy of fp19 MACs is not guaranteed to be the same as fp32 MACs. However, empir-
ical evidence using bf 16 (which carries to fp19) suggests that for DL workloads, the accuracy
difference is insignificant; although unknown outliers may exist [KMM+19].
The primary advantage of tf 32 is the ease-of-adoption. It requires no changes in the DL
libraries (except for an enablement flag) and works out-of-the-box. The disadvantage is the
lack of memory or bandwidth savings compared to 16-bit formats, which is often the bigger
bottleneck.
Integer-16 (int16) training has been demonstrated on some models with no hyperparam-
eters tuning [KWW+17, DMM+18]. The distribution of the weights, activations, weight gradi-
ents, and activation gradients in a tensor can be represented using int16 and one shared scalar for
the entire tensor. This scalar is dynamically adjusted to maximize range and minimize overflow.
The weight and activation distributions do not change rapidly in consecutive training iterations.
The gradient distribution changes more rapidly. A program can monitor the distributions and
adjust the exponents for each tensor as needed.
For training, int16 is not used in production; bf 16 and fp16 are preferred over int16 given
the added complexity to manage the shared exponent with int16, particularly for the gradient
tensors. For inference, int16 has some adoption. Habana Goya uses int16 for workloads that
required more precision than int8 (Habana Goya also supports other formats) [Hab19].
Integer-8 (int8) is rapidly gaining adoption for some inference workloads. Using int8 often
reduces the statistical performance due to the information loss quantizing from 32-bit to 8-bit.
For some applications, a small drop in statistical performance is unacceptable, as it can have a
negative monetary impact. In particular, less relevant product recommendation results in reduced
purchases. There are techniques to reduce the statistical loss discussed in Section 6.2. Note that
training with int8 is limited to academic research on a few simple models not relevant in industry.
There are two main challenges with most int8 quantization techniques. First, the uniform
distribution of int8 does not allow finer-granularity to better represent values in high-density
regions where most of the information exists. A better approach is to use a nonuniform numerical
format with high granularity in high-density regions and low granularity in low-density regions.
This reduces the 32- to 8-bit information loss. Some proposals, such as fp8, are discussed below.
6.1. NUMERICAL FORMATS 117
Second, precomputing the activations’ quantization factors is needed to maximize the
computational benefits of int8 but requires additional effort for the developer. The distribution
of the activation values with production data can be estimated using data samples with similar
characteristics as the production data. This requires that a developer quantizing a model has
access to production-like data samples.
Despite these challenges, int8 is supported by all prevalent hardware marketed for infer-
ence. Google uses int8 in production on TPUs for some MLP-, CNN-, and LSTM-based mod-
els, and on the Google Pixel phone for speech recognition with RNN models. Facebook (as well
as many other companies) also uses int8 across various workloads [JYP+17, HSP+19, PNB+18].
Facebook also demonstrated quantization to 4 bits on the embedding layers for serving recom-
mendations without affecting statistical performance.
In particular int8 inference has been shown to work across various CNN mod-
els [GMY+19]. However, even some CNN models like MobileNet and ResNeXt, and vari-
ous non-CNNs such as BERT, are more susceptible to information loss from quantization and
require additional effort to achieve acceptable statistical performance [SDY+19]. While the ac-
ceptable degradation varies, for most companies degradation over 1% is unacceptable, under
0:5% is acceptable, and in between depends on the application. Recommenders have a stricter
threshold in the order of 0.01% due to the monetization impact.
Floating-point 8-bit (fp8) is used by Microsoft in FPGAs (Microsoft also uses fp9)
using either 2 or 3 mantissa bits. fp8 is implemented by researchers in some ASICs, such
as the deep-learning neural processing unit (LNPU) to demonstrate training models on mo-
bile devices (LNPU uses fp8 and fp16 mixed precision training) [CFO+18, LLH+19]. Intel
and IBM demonstrate that fp8 multiplies (accumulated to fp32 and fp16, respectively) can
be used for training and inference with insignificant loss in performance for various work-
loads [CBG+20, MSD+19, SCC+19].
There is no standardized fp8 format. The most common formats are (1; 5; 2) and (1; 4; 3).
The (1; 5; 2) format better represents the dynamic range of the gradients. A particular challenge
in training with an 8-bit format is in RNNs and models without normalization layers, as they
are more susceptible to errors. The gradient errors can quickly increase in RNNs, and the typical
lack of normalization can result in irregular tensor value distributions.
IBM proposed a hybrid (1; 4; 3) and (1; 5; 2) approach for the forward and backpropa-
gation, respectively, using loss-scaling and stochastic rounding, and keeping the input and last
layers at fp16 [SCC+19]. The (1; 4; 3) format is modified using a 4 fixed exponent bias to shift
the coverage range by 2 4 to better align with the distribution of the weights and activations.
This format is referred to as fp8-ibm in Table 6.1. There are two primary challenges to this format.
First, some models, such as GNMT and Transfomer, require dynamic loss to properly converge,
which increases the software complexity. Second, the more limited representation of small val-
ues, compared to fp16 (the smallest positive values are 1:5 10 5 in (1; 5; 2) vs. 6:0 10 8 in
(1; 5; 10), often results in underflow.
118 6. REDUCING THE MODEL SIZE
Intel has proposed two methods, both using the (1; 5; 2) format. One method uses a shift
and scale (shifted and squeezed FP8 (S2FP8)) parameter per tensor to represent a broad set of
values. S2FP8 alleviates the need for loss-scaling, stochastic rounding, and fp32 for the first and
last layer. The main weights and accumulations are in fp32 [CBG+20]. However, S2FP8 requires
tracking the statistics in the tensor distribution (similar to int16 training) and updating the shift
and scale parameters which increases the software complexity.
The other method uses enhanced loss scaling to improve the range of values and reduce
the common underflow observed with fp8 training. This method uses loss scaling with a dynam-
ically increasing minimum threshold for the scaling factor. Using a minimum threshold ignores
spurious overflows in order to maintain a higher loss scale value. However, this method requires
observing the training cost to determine when to adjust this threshold value.
A significant advantage of fp8 over int8 inference is circumventing the complexities of
quantization. The current disadvantage is the limited hardware and software supporting fp8
formats. A minor disadvantage is that NaNs are overrepresented and consume 6 out of 256
(2%) and 14 out of 256 (6%) values in the (1; 5; 2) and (1; 4; 3) formats, respectively.
The published fp8 empirical results suggest that for the backpropagation (1; 5; 2) is pre-
ferred over (1; 4; 3). For inference (forward propagation), IBM demonstrated superior statistical
performance using (1; 4; 3) with the exponent shift, albeit the results are primarily targeting
convolutional models. Intel demonstrated (1; 5; 2) for both forward and backpropagation across
ResNet, GNMT, Transformer, and NCF. The published results suggest that CNN models can
benefit more from the additional mantissa bit in (1; 4; 3), and non-CNN models can benefit
more from the additional exponent bit in (1; 5; 2). Nevertheless, the number of models in these
studies is relatively small, and making solid conclusions requires further work.
Integer-4 (int4) support is available in recent Nvidia GPUs. int4 inference adoption on
some CNN models may slowly grow on edge devices, such as in mobile phones, where power
and memory are limited. The adoption in data centers may likely be none to very limited for
workloads tolerant to extremely low range and precision and limited to representing activations
from ReLU functions with unsigned int4 (the weights kept at int8). There is ongoing research
toward improving int4 quantization [CWV+18, Don19, GMY+19].
Floating-point 24-bit (fp24) (1; 8; 15) is used by Alibaba Neural Processing Unit (NPU)
for CNN models for the element-wise and reduction operators (the matrix-wise operators use
int8 ! int16) [JHJ+20].
Posit is a relatively new format different from the IEEE floating standard. This format
requires less power and die area than the IEEE floating-point counterpart [Gus17, Joh18]. It
does not overrepresent NaNs and provides other benefits and drawbacks [dDF+19]. However,
this format has minimal adoption in academia and none in industry.
Log-domain is another form of nonlinear quantization that has been shown to main-
tain statistical performance with smaller numerical formats [LMC+17]. This format has limited
adoption in academia and none in industry.
6.2. QUANTIZATION METHODOLOGY 119
Binary (1 bit) and ternary (2 bits to represent 1, 0, and 1) have been used in research,
in particular, to represent the weights in a forward propagation passes [ROR+16, HS14].
Die Cost
The die cost to build a multiplier, and the power cost to use the multiplier both exhibit quadratic
growth with the number of mantissa bits and increase linearly with the number of exponent
bits. Therefore, a bf 16 multiplier is less expensive than a fp16 multiplier. However, area costs
continue to decrease rapidly, and therefore this difference should not be a major factor in the
DL hardware design decisions. Usability and software development costs are much more critical
factors.
To facilitate transitioning from hardware that only support one format (fp16 or bf 16),
we recommend designing hardware that supports both bf 16 and fp16 formats using a 19-bit
(1; 8; 10) floating-point unit (FPU). Similarly, we recommend supporting both (1; 5; 2) and
(1; 4; 3) fp8 formats using a 9-bit (1; 5; 3) FPU. According to IBM, supporting both formats
only requires a 5% larger unit than supporting one format [SCC+19].
This algorithm determines the layers that can be quantized. Note that one challenge is that
interleaving layers with large and small numerical formats may result in higher computational
cost from the overhead of the many conversions.
Cross-layer range equalization is a data-free quantization (requires no data and no back-
propagation). The range of weights across the layers is equalized, and the range of activations
are constraint under the assumption that a piece-wise linear activation function (such as ReLU)
is used between the layers [NvB+19]. This constraint is satisfied by many CNN models but not
by non-CNN models. This technique is used in the Qualcomm Neural Processing SDK.
Channel-wise quantization uses a quantization factor for each channel rather than one
factor for the entire tensor.
Stochastic rounding (rather than nearest-value rounding) after multiplying by the quan-
tization factor can improve performance [WCB+18]. To illustrate, rather than rounding the
number 1:2 to the number 1, it is rounded to 1 with 80% probability and to 2 with 20% proba-
bility.
Unsigned int8 ReLU activations uses the unsigned int8 representation, rather than signed
int8, for the activations of the ReLU functions. Using signed int8 wastes half of the values since
all the activations are nonnegative.
The techniques QAT, selective quantization, channel-wise quantization, and stochastic
rounding also benefit fp8 [CBG+20].
Figure 6.2: Pruning a model by removing the weights (links) closed to zero.
Figure 6.3: Knowledge distillation. A large teacher model distills the knowledge to a smaller
student model. The student model learns using both the regular softmax and a softened softmax
from the teacher model. Based on [Int18].
have the highest value for 7 and also a relatively high value for digits that look like 7, such as the
handwritten digit 1 and 9. The student model is trained to learn (1) the softened output using
a softmax temperature and (2) the one-hot ground truth vector using the regular softmax. The
softmax temperature also provides regularization to the model [YTL+19].
The intuition behind KD is that the teacher model requires a more complex model to learn
the relationships between the various classes. The ground truth one-hot vector does not encode
class similarities and treats each class as entirely independent. The teacher model provides the
class relations to the student model. Thus, the student model does not need to learn them from
scratch and can use a simpler topology.
Extensions to this work are the deep mutual learning (DML) where an ensemble of stu-
dents collaboratively learn and teach others by sharing their softmax outputs, and the teacher
assistant (TA) to distill the knowledge from the larger-size teacher model to an intermediate-
size TA model to a smaller-size student model [ZXH+17, MFL+19].
In this chapter, we detailed the various numerical formats used in production and those in ex-
ploration by researchers as well as compression techniques to reduce the memory footprint of
models. Using a smaller numerical representation can increase the number of operations per cy-
cle, and reduce the memory, memory bandwidth, network bandwidth, and power consumption.
However, it may also result in lower statistical performance, particularly for some int8 mod-
els. We discussed advances in quantization techniques to mitigate this accuracy loss and find
Hessian-based analysis as a promising path to determine which layers are quantizable. Hard-
ware support across numerical formats is one of the vital hardware design decisions. We rec-
6.4. KNOWLEDGE DISTILLATION 125
ommend that training processors primarily support both bf 16 and fp16 given the small die cost
over supporting just one, and some fp32, and inference processors primarily support fp16, bf 16
for compatibility with the training format, int8 and fp8 and some fp32. In the next chapter, we
review the basics of computer architecture, and discuss the various DL hardware designs.
127
CHAPTER 7
Hardware
The primary components in a DL platform are multitudinous multiplication and addition units,
sufficient memory capacity, high memory bandwidth to feed the compute units, high inter-node
and inter-server bandwidth for distributed computing, and power to operate. The tradeoffs of
architecting DL hardware depend on the targeted workloads and operating environment. The
enormous design space includes numerical formats, memory hierarchies, power constraints, area
constraints, software- or hardware-managed caches/scratchpads, support for dense and sparse
computations, domain-specific to general-purpose compute ratios, compute-to-bandwidth ra-
tios, inter-chip and inter-server interconnects, and ease of programmability.
The cost of arithmetic logic units (ALUs) is decreasing, and computational capacity is
growing faster than memory bandwidth, as shown in Figure 7.1 for the top supercomputer. The
primary hardware bottlenecks executing DL workloads are:
Moore’s Law continues to deliver exponential growth in the number of transistors that can
be packed into a given area, albeit at a slower rate than before. Computer architects are finding
new ways to extract performance from this exponential growth. However, as a consequence of
this exponential growth, compute and memory capacity are increasing much faster than memory
bandwidth, which is the bottleneck in many DL workloads. The slow growth in bandwidth
relative to compute is known as the memory wall or bandwidth wall, where compute units are
idled waiting for data [WM95, RKB+09].
As transistors shrink, their power density no longer stays constant but rather increases,
which is known as the end of Dennard’s scaling (discussed in Section 7.1) [DGY+74]. The
amount of dark silicon, where transistors cannot operate at the nominal voltage, is increasing.
This dark silicon motivates the exploitation of transistors for multicore processors and domain-
specific circuitry. Some of the existing techniques to increase performance are (detailed in Sec-
tion 7.4):
Figure 7.1: Computational capacity is growing faster than memory bandwidth as measured by
the capacity of the top supercomputer. Based on [LHL+18].
• placing the memory close to the compute units to reduce access time and energy;
serving production hardware. Training requires storing and retrieving the activations across all
the layers, which typically involves reading and writing several GB of data (the activations) from
and to DRAM. In training CNNs, the size of the activations typically has a more significant
impact on the total memory requirements than the size of the model. To illustrate, U-Net (used
for medical 3D image classification) has 20 million weights but requires 256 GB of memory.
Conversely, Megatron-LM-1.2B has 1:2 billion weights but requires 32 GB of memory. Given
the amount of data transfer, using a high bandwidth DRAM, such as HBM2E, for training
tasks is beneficial. An advantageous design choice is to put enough SRAM to store the model
and the activations associated with two consecutive layers in training and inference. Note that
the size of the activations is proportional to the batch size, which is usually small for inference.
As much as possible, data center managers want a homogeneous and manageable data
center leveraging specialized accelerators only when absolutely needed. However, given the ex-
ponential demand for compute and the end of Dennard’s scaling, the demand for dedicated DL
processors is increasing. Hardware designers should be aware of what hyperscalers value:
Figure 7.2: Total power requirement (red curve) across various voltages. Low voltage results in
high static power due to current leakage. High voltage results in high dynamic power. There is
an optimal voltage V where the total power usage is minimized.
Figure 7.3: Increasing the frequency past fmin linearly increases the required voltage, and (not
shown) cubically increases the dynamic power.
the circuitry. The time it takes to travel this distance must be less than one clock tick, which can
be an issue for large circuits when operating at high frequencies.
The primary contributors to the increased dark silicon are the exponential growth in tran-
sistors per area, current leakage, and power constraints. Multicore processors and specialized
computing are two methods to mitigate dark silicon. These methods have enable the continued
growth in computational capacity at the expense of two new challenges: Amdahl’s law and the
memory wall.
134 7. HARDWARE
Gene Amdahl formalized the speedup when only a fraction of a program is improved,
known as Amdahl’s law. It is used to determine the limitations of parallel computing [Amd67].
Using N > 1 cores for a particular workload results in a maximum speed up of
1
;
.1 P / C .P =N /
where P is the percentage of the workload that is parallelizable. Approaching this maximum
speed up requires nontrivial parallel programming, and there is a computer science field dedicated
to this. Even assuming P D 1, perfect linear scaling across general-purpose multicores is not
possible. There are core-to-core bandwidth limitations and cache coherence overhead, which
grows with more cores.
These limitations and overheads are motivations to reduce the scope of hardware-based
cache coherence and to use domain-specific DL processors for embarrassingly parallel (minimal
communication/synchronization between parallel elements) workloads with predictable opera-
tions. Solutions still require a way to operate on the right data, and this drives a combination of
application-specific hardware and software-based “coherence” [AVG+15, TKT+16, ADC11].
Figure 7.4 provides a high-level view of the trends in microprocessors. The number of
transistors per area continues to grow exponentially, and the number of logical cores is following
that same growth path; new transistors are primarily used for additional cores. In the future, the
growth in the number of cores may slow down, and more transistors utilized for domain-specific
acceleration. While frequency has already plateaued, single-thread performance continues to
increase due to better instruction pipeline, improved branch prediction, out-of-order execution,
larger instruction vectors, and specialized execution units, resulting in more IPC.
Figure 7.4: Trends in microprocessors. Source: [Rup20] (CC BY-SA 4.0 license).
Memory can be described by its capacity (bytes) and data transfer rate or bandwidth (bytes
per second). The bandwidth (BW) can be computed as follows:
where fmem is the memory frequency, the interfaces are typically 2 (dual-channel configuration)
in modern processors, and the transfers per clock are 2 for memories that transfer on both the
rising and falling clock edge (such as DDR) and 1 otherwise. In practice, the effective trans-
fers per clock may be slightly lower and workload-dependent; in DRAM, it depends on the
distribution of read and write transactions.
The memory types used in production in increasing order of accessed time and, equiva-
lently, in increasing order of memory density (bytes per silicon area) and decreasing monetary
cost per byte are as follows:
1. processor registers;
2. SRAM: scratchpad, cache (typically with multiple levels); and
3. DRAM: HBM2/E, GDDR6, DDR4/5, LPDDR4/5.
There are two types of random-access memory: dynamic RAM (DRAM) and static RAM
(SRAM). SRAM uses a bistable circuit design that is faster but more expensive and requires four
136 7. HARDWARE
to six transistors per bit. DRAM is slower but less expensive and requires only one transistor (and
a capacitor) per bit, and hence it has higher memory density. The capacitor stores the charge (the
bit). Reading the stored bit consumes this charge requiring a write after the read cycle to save the
value. Even in the absence of read/write activity, DRAM memory must be frequently refreshed
to avoid losing information as the charge leaks (at a temperature and device-dependent rate).
This refresh involves reading the data and immediately writing it to the same area (as DRAM
reads are destructive). SRAM does not require frequent reads and writes. Both DRAM and
SRAM are volatile memories; that is, they lose the stored bits when the power is off.
There are two main types of SRAM configurations: caches and scratchpads [LAS+07]. A
cache is implicitly addressed (not directly addressed by the software), hardware-managed mem-
ory. A scratchpad (also called streaming memory) is explicitly addressed, software-managed
memory. Caches are common in CPUs and GPUs to support general-purpose workloads.
Scratchpads are common in embedded and dedicated hardware, such as ASICs and DSPs, for
static graph-based workloads to reduce power consumption.
A cache has additional logic circuitry to ensure cache coherence and improve locality to
determine what data to keep (this data is known as hot entries or working set) and what data to
replace. This logic alleviates the software (the programmer or compiler) from directly managing
the cache memory access. However, it comes at the expense of higher energy cost per data access
and lower memory density. This additional logic is beneficial for irregular access patterns, such
as in GNNs, embedding layers, and DL dynamic graph-based models.
There can be different levels of caches. Modern CPUs have three-levels of caches: L1,
L2 (mid-level cache (MLC)), and L3 (last-level cache (LLC)). L1 is the smallest and closest
memory to the compute unit, and therefore has the fastest access time. CPU processors have
two different L1 caches: a data cache unit (DCU or L1d ) and an instruction cache unit (ICU or
L1i ). Data and instructions share the cache in L2 and L3. Modern GPUs have 2 levels of cache.
The canonical chunk (block) of memory loaded from the main memory to the cache hierarchy
is called a cache line. Note that loading an entire cache line can waste bandwidth and storage
on sparsely strided memory accesses.
Different architectures use different cache replacement policy algorithms, and even differ-
ent cache levels within an architecture may use different policies. While the specific policy used
by a microarchitecture is not always made public, variants of the Least Recently Used (LRU)
eviction policy are common, such as Adaptive Replacement Cache (ARC). LRU means the
cache tracks and evicts the least recently accessed page when adding a new page. ARC tracks
frequently used, recently used, and recently evicted pages.
While caches are hardware-managed, there is some work to enhance cache control with
software hints. One example is using the CLDEMOTE instruction, which hints to the hard-
ware to demote a given cache line to more distant cache from the processor to speed up access
to the cache line by other cores (L1 caches are unique to a specific core).
7.2. MEMORY AND BANDWIDTH 137
A scratchpad has a simple memory structure that provides better efficiency at the expense
of sophisticated software; it manages all the memory accesses and the replacement policy. A
scratchpad is typically more efficient than a cache, usually 1–2 clock cycles per memory access.
A scratchpad has addressable storage and requires explicit software-controlled direct memory
access (DMA) transfers to orchestrate all data movement in the proper order. However, any
mismatch of memory accesses to the ALU or FPU logic inputs or outputs may lead to orders of
magnitude of performance degradation. Thus, scratchpads are typically limited to DL workloads
with static graphs, where all data accesses are predictable and determined at compile-time. In
high-volume production, saving some power and execution time has multiplicative benefits over
the lifetime of the model, which may outweigh the software complexity costs.
A hybrid memory system uses both cache and scratchpad configurations. Nvidia architec-
tures (excluding Pascal) configure some cache memory as a scratchpad for application-specific
locality and communication optimizations. Note that Nvidia refers to scratchpad and cache as
shared and automatic memory, respectively. There is research toward a unified configuration to
get the best of both, such as Stash and Buffets [KSA+15, PSC+19].
There are three types of caches with different speeds and conflicts tradeoffs. Cache con-
flicts occur when a different cache line from memory maps to the same cache entry, thus evicting
and replacing the existing cache entry. The placement depends on the memory address.
• Fully Associative places a cache line from memory in any entry in the cache; this has
the slowest-access time but minimizes conflicts.
• Direct Mapped places a cache line from memory in a specific entry in the cache; this
has the fastest-access time but maximizes conflicts.
• N -way Set-Associative places a cache line from memory in any of N entries in the
cache; this provides a compromise between access time and conflicts.
In practice, most CPU caches in production are N -way set-associative caches. Understanding
cache associativity can guide the design of the DL topology. To illustrate, an fp32 GEMM with a
leading dimension of 1024 (used in an RNN layer with 1024 units), results in high cache conflicts
in CPUs; a better leading dimension is 1040 in modern CPUs, as explained in Section 7.2.1.
DRAM or, more precisely today, Synchronous DRAM, is less expensive in price and sili-
con area but is significantly more expensive in energy and access time compared to SRAM. There
are various types of DRAM used in production: Double Data Rate (DDR), High-Bandwidth
Memory (HBM), Graphics DDR (GDDR), and Low-power DDR (LPDDR), and various
generations within each type [GLH+19]. DDR memories fetch the data on both the leading
and falling edge of the clock signal. Other types of DRAM with minimal market adoption are
Hybrid Memory Cube (HMC) and Wide I/O (WIO).
DDR DDR4 is the most widely used DRAM. It is available in servers, workstations, laptops,
and some inference accelerators, such as Habana Goya. Increasing the number of main mem-
138 7. HARDWARE
Figure 7.5: HBM memory connected to the processor via an interposer. (a) Top view. (b) Side
view. Based on [Sam16].
ory channels improves bandwidth and partially mitigates the memory wall [Hor14, PRH+17].
However, the maximum number of balls or pins possible on a package limits the number of
channels. DDR5 is the latest generation of DDR providing higher bandwidth and density. Intel
processors codenamed Sapphire Rapids and (likely) AMD processors codename Genoa should
support DDR5.
HBM HBM2 is the defacto DRAM memory for GPUs and accelerators targeting training,
HPC, and cryptomining. It is available in the Nvidia {P, V, A}100 GPUs and Habana Gaudi.
Google TPU v2 and v3 (and likely v4) use HBM but have not made public the specific HBM
generation.
HBM2 has a 1024-bit wide interface across 8 channels per stack, and (in the latest speci-
fication) 2:4 GT/s transfer rates (each bus lane transfers 2:4 Gbps), for a total of 307 GB/s
per DRAM stack or package. It provides higher bandwidth and uses less power relative to other
DRAM memories. HBM memory connects to the processor via a purpose-built silicon chip
called an interposer and mounts in the package substrate, as illustrated in Figure 7.5. The shorter
wires allow for higher bandwidth at lower power. Given that HBM uses a stack of memory
chips, it is referred to as 2.5D memory. An issue with HBM is the high price to manufac-
ture the interposer, in part, because 2.5D is a relatively new memory technology. The cost may
decrease as the technology gains broad adoption.
GDDR GDDR6 is used in the latest gaming graphics cards and data center inference GPUs,
such as the Nvidia T4, and may expand to other inference accelerators. Compared to HBM,
GDDR is less expensive and has lower latency, but it also has lower bandwidth and lower mem-
ory density.
7.2. MEMORY AND BANDWIDTH 139
LPDDR LP-DDR4 and LP-DDR4X are widely used in low power devices, such as mobile
phones. LPDDR has short wires and, therefore, low latency response. The newest generation
LP-DDR5 is available in the latest mobile phones and expanding to other devices, such as
tablets, ultra-thin notebooks, automotive, and tentatively, DL inference processors.
Figure 7.7: A roofline model models the maximum attainable OPS for a particular kernel on a
particular hardware. Kernel 1 is well optimized and operating near the roofline. Kernels 2 and 3
are well below the roofline and require better software optimizations to more efficiently use the
computational resources.
In the literature, OI analyses sometimes assumes this best scenario, making OI independent of
hardware. In practice, however, the OI depends on the memory hierarchy.
Figure 7.7 shows a roofline plot. The maximum attainable OPS for a kernel is the
min.bandwidth OI; peak OPS/. Kernels where the attainable OPS are constrained by the
bandwidth OI are bandwidth bound, and those constrained by the peak OPS are compute bound.
Increasing the computational capacity does not increase performance for bandwidth bound ker-
nels. The relation between roofline and computation time is as follows: the time T it takes to
142 7. HARDWARE
execute a kernel, assuming perfect overlap of communication and computation, is:
number of ops to compute kernel bytes to read from memory
T D max ; :
peak processor OPS peak memory bandwidth
Data reuse is key to achieving high OI. Data reuse means reusing the operands or the
result for multiple cycles. The OI for a kernel function can vary considerably depending on how
much data is reused. A traditional CNN kernel has high OI (1000 ops/B), whereas a GEMM
kernel used in an MLP, RNN, or other fully-connected layers typically has low OI (10 ops/B)
(see Figure 1.16).
The OI of a C D A B GEMM operation, assuming the data fits in SRAM, where A 2
<M K , B 2 <KN , and C 2 <M N is:
2MKN
OI D ;
sizeof .datatype/ .2MN C MK C KN/
where the 2 in the numerator is to account for multiplies and adds and the 2 in 2MN in the
denominator is to account for reading and writing matrix C from and to main memory. A prac-
tical example is a fully-connected layer going from a layer with M units to a layer with K units
and using a batch size of N and where matrix A is the weight matrix. Similarly, the OI of an
Z D X ˝ Y convolution operation assuming the operands fits in SRAM, where X 2 <NCHW ,
Q Q
Y 2 <KCRS , and Z 2 <NK H W , is:
2NKCRSHQ WQ
OI D :
sizeof .datatype/ .2N HQ WQ K C KCRS C NHWC/
Element-wise operators have no data reuse and a very low OI. The OI can increase
by fusing (merging) element-wise operators with computationally intensive operators, such as
GEMM and convolution. For instance, the ReLU operator can be applied to the output of a
convolution operation while the output data is still in the registers before writing it back to the
cache or main memory.
Even when the operands do not fully fit in SRAM, GEMM and convolution operators
can take advantage of data reuse. In the C D A B GEMM operation above, every value in
matrix B is reused M times: every value in row k 2 Œ0; K 1 in matrix B is multiplied by all
the M values in the corresponding column k in matrix A. Every value in C is reused K times as
it accumulates the K products. Weight reuse (the data in matrix A) is proportional to the batch
size N ; a batch size of N D 1 has no weight reuse in a GEMM operation.
In the convolution operator, there is more data reuse. The weights of one filter Yk 2 <CRS
can be reused across the N dimension in the input tensor X. Alternatively, the activations across
one sample, XŒn 2 <HWC , can be reused across all weights Y.
7.4. PROCESSOR DESIGNS 143
• RISC open-sourced RISC-V ISA in academia with some small traction in production
at Alibaba and SiFive; and
There are different ways to parallelize a kernel in hardware, such as with SIMD/SIMT
instructions, multicores, or systolic architectures. Also, model parallelism techniques, discussed
in Section 6.3, can distribute the kernel’s computations among multiple nodes.
Single instruction, multiple data (SIMD), and single instruction, multiple threads
(SIMT) (coined by Nvidia), are used by CPU and GPU vector processors, respectively. In CPUs,
a SIMD instruction is concurrently applied to all the values in the respective registers within an
execution unit (EU) in a core. To illustrate, an AVX-512 instruction execution unit (EU) may
take two 512-bit input registers, each with 16 fp32 values, and computes the element-wise prod-
uct across the registers and stores the resulting 16 fp32 values in another 512-bit register. GPUs
generalize SIMD with SIMT; rather than apply an instruction to data in registers, GPUs apply
an instruction across multiple threads (a warp or 32 threads in Nvidia GPUs and a wavefront or
64 threads in AMD GPUs). Specifically, GPUs use coalesced loads, where different values in
the same cache line are concurrently accessed and used by the threads in a warp or wavefront.
SSE, MMX, AVX, AVX-2, and AVX-512 (sometimes called AVX-3) are SIMD instruc-
tion extensions to the x86 ISA, and NEON and the Scalable Vector Extensions (SVE) are
SIMD instruction extensions to the Arm ISA (do not worry if you are unfamiliar with these
ISAs). The primary differences between these ISA extensions are the number of supported in-
structions and the data size that the instruction can be concurrently applied. For example, AVX-
512 has more instructions than AVX-2 and concurrently operates on 512 bits, whereas AVX-2
operates on 256 bits.
Nvidia provides a pseudo-assembly language virtual ISA called the Parallel Thread Execu-
tion (PTX). Compilers, such as GCC (detailed in Section 8.2), generate PTX code. PTX code
requires using Nvidia’s NVCC compiler to access the physical ISA known as SASS to generate
an executable binary [Nvi15]. Recent AMD GPUs use the Vega ISA, RDNA ISA, or CDNA
ISA.
Simultaneous multithreading (SMT), called hyper-threading for Intel processors, is used
in CPUs to run two (and potentially four or more) threads in one core to utilize better the
EUs that may otherwise be idled. For well-optimized kernels, however, an EU may not sit idle,
and using two threads may not provide a significant gain in performance. In high OI kernels,
enabling SMT could reduce the performance due to the thread switching overhead. Experimen-
tation is required to assess the gains or losses of SMT on a particular workload.
7.4. PROCESSOR DESIGNS 145
Another set of instructions designed to exploit instruction-level parallelism is the very long
instruction word (VLIW) instructions, where multiple instructions execute in parallel. VLIW
processors work best with regular, predictable code for the compiler to extract the required level
of parallelism. The retired Itanium, and today’s Habana AI processors as well as Google’s TPU
v2 (and perhaps v3 and v4) use VLIW SIMD vector processors.
Dataflow parallelism uses systolic architectures (also called dataflow architectures or
dataflow processors) with multiple simple processing engines (PEs). A PE performs a simple
computation, such as a MAC, and passes the result to its neighbor PE. The collected work across
all PEs results in high throughput. Given the simple circuitry design, dataflow architectures can
be power-efficient. In a systolic array, the PEs connect in a mesh pattern; the shorter wires con-
nect nearby PEs and provide high bandwidth at much lower power than longer wires. Dataflow
parallelism is adopted in specialized hardware discussed below, including in domain-specific
circuitry added to CPUs and GPUs, such as Intel’s AMX and Nvidia’s tensor cores. Dataflow
processors work best with regular, predictable code. Using systolic architectures (and SIMD
and SIMT) near peak performance, requires a mature compiler or program that considers the
memory hierarchy. A minor mismatch from memory access to the systolic dataflow processor
can lead to orders of magnitude of slower performance.
A central processing unit (CPU) consists of RAM, registers, and execution units. RAM
holds both the program instructions and the data. A server CPU typically has faster but fewer
cores compared to a GPU or a dedicated DL accelerator. It may better balance complex work-
loads: the parallelizable code can benefit from the many CPU cores, and the serial code can
benefit from the single-core high-frequency performance. Note that the execution time does
not decrease linearly with increased core count, per Amdahl’s law. A CPU provides maximum
flexibility and is typically simpler to program than other hardware. It has built-in logic to ex-
ploit control-flow, including branch prediction. This flexibility comes at the expense of higher
power consumption to decode and execute the instructions in each core. Embarrassingly parallel
workloads with static graphs do not require many of the capabilities of the CPU, and a dedicated
processor should provide higher performance per watt.
A graphical processing unit (GPU) consists of RAM, registers, and compute units.
GPUs are designed for embarrassingly parallel tasks, initially targeting image manipulation by
simultaneously applying an operator to each pixel or group of pixels, and later targeting DL ma-
trix multiplications and convolutions. A difference between a CPU and a GPU core is that the
CPU core can decode and execute an instruction independently of the other core. A GPU core
executes the same instruction as the other cores in their group, known as a warp and wavefront
by Nvidia and AMD, respectively. The CPU cores provide more flexibility than GPU cores, and
the GPU cores provide higher energy efficiency than CPU cores.
A CPU core is an independent processor with dedicated ALUs, control logic, local SRAM
with a dedicated L1 cache, and multiple registers shared only between the SMT threads (when
SMT is enabled). A GPU core cannot operate independently of other cores; it has dedicated
146 7. HARDWARE
Figure 7.9: The memory designs under the same power consumption range from using (left)
HBM and small local SRAM to using (right) multiple SRAM units that take most of the silicon
area and no DRAM. Blue rectangles represent the memory and yellow rectangles the compute
units.
registers but not dedicated SRAM; instead, it shares the memory with all the cores in the warp
or wavefront. Given the limitations of a GPU core, some literature refers to them as threads.
The warp or wavefront can be thought of as a core with massive SMT capabilities. Compared
to CPUs, GPUs use much larger register files similar in sizes to a CPU’s LLC to support the
massive SMTs at higher throughput at the expense of higher latency.
A typical bottleneck is the limited memory bandwidth. Increasing the SRAM associated
with every compute unit or PE can mitigate this. Design choices range from Nvidia’s V100
with large HBM2 and small local SRAM to Graphcore’s Colossus with no DRAM and large
SRAM units that take most of the silicon area, as illustrated in Figure 7.9, and to emerging
in-memory processing technology. The design choices affects the batch size required to achieve
high efficiency. Hardware with more local SRAM can have higher compute utilization with
small batch sizes, which can benefit both training and inference. Training with small batch sizes
requires less hyperparameter tuning to converge. Inference with small batch sizes (often a batch
size of one) is typical to meet latency constraints.
A field-programmable gate array (FPGA) is a type of hardware with some small compute
elements (logic blocks), such as memory, registers, lookup tables, and macro functions, and
whose connectivity is reconfigurable and can be programmed. This programmability is beneficial
to adapt to new workloads that require different hardware characteristics. Also, FPGAs are
used to simulate ASICs and other processor designs before building them. Two challenges with
FPGAs are the long compilation time (several minutes to hours) to reprogram the logic gates
and the limited DL software tools.
A coarse-grained reconfigurable array (CGRA) is also a type of programmable hard-
ware [LZL+19]. A CGRA can be thought of as an FPGA with coarser reconfigurability. Thus,
in theory, a CGRA provides easier programmability but less flexibility compared to an FPGA.
In practice, CGRAs have limited adoption due to the limited software tools.
7.4. PROCESSOR DESIGNS 147
A digital signal processor (DSP) is a specialized, low-latency microprocessor with a spe-
cialized ISA optimized for frequently used functions in signal processing, like convolution.
Modern DSPs are modular in that they may have a base ISA that is consistent, and an extension
ISA that is specific to the type of processing (for instance, for image, audio, and network sig-
nals). Unlike a CGRA, a DSP is not reconfigurable. DSPs are programmable but require a good
compiler for high performance. DSPs are typically used in combination with other hardware in
a heterogeneous design.
An application-specific integrated circuit (ASIC) provides the best performance for a
specific application but is the least flexible. ASICs have limited control logic and depend on
the programmer or compiler to manage data movement. Achieving high-utilization requires
experienced low-level programmers or a matured DL compiler. Current DL compilers are still
immature and require significant time to map the kernels to execute efficiently in hardware.
Given the software complexity, ASICs work best with regular, predictable code. Some newer
models have dynamic graphs with complex datapaths that are difficult to compile efficiently.
ASICs are often used as part of a DL design with other architectures to handle the computa-
tionally intensive operators.
Most ASICs use dataflow architectures for MAC computations. A recommended high-
level design for an ASIC is to pack as many transistors as possible into a die for MACs operators
(the die size and power constrained by the deployment environment) to support matrix-wise
operators. Then, use some of the silicon for element-wise operators, matrix transposes, and I/O,
and use most of the rest for SRAM. The processor should operate at or slightly above the Vmin
voltage to ensure the highest ops/s per watt. Increasing the frequency past fmin increases the
power with the cube of the increased frequency (see Section 7.1).
There are various ways to implement MACs with dataflow parallelism. Chen et al. and
Sze et al. provide a detailed review of various dataflow architectures [CES17, SCY+20]. These
architectures have an array of PEs connected via a network-on-chip (NoC) with global SRAM
memory, as illustrated in Figure 7.10 for a 3 3 array (in practice, the arrays are larger). The PE
array gets the activations (Act), weights, and accumulated sum from the global SRAM. Each
PE contains the ALU or FPU logic to perform MAC operations, a local control unit, and may
have a local SRAM scratchpad (Spad). The MAC unit multiplies a set of weights and activation
and adds the result to the accumulated partial sum.
There are four types of dataflow architectures: no local reuse, weight-stationary, output-
stationary, and row-stationary [CES17].
No Local Reuse maximizes the size of the global SRAM by not having local PE memory.
The weights and activations pass from the global SRAM to each PE, with passes the accumu-
lated sum to its neighbor along a row of PEs, as illustrated in Figure 7.11.
Weight-Stationary maximizes weight reuse by storing the weights in the PE’s local mem-
ory. An activation is broadcasted to the relevant PEs, and the accumulated sum flows from each
PE to its neighbor along a row of PEs, as illustrated in Figure 7.12. This data flow works well
148 7. HARDWARE
Figure 7.10: An accelerator chip with a 3 3 array of PEs. Each PE has a MAC unit that
multiplies a set of weights and activations and adds the result to the accumulated sum. Based
on [SCY+17].
for traditional convolutional layers that reuse the weights. It is not efficient for fully-connected
layers or convolutional layers with limited weight reuse, such as 1 1 convolution or depthwise
separable convolutions.
Output-Stationary maximizes reuse of the accumulated sums by storing them in the PE’s
local memory. A weight is broadcasted to all the relevant PEs. The activations flow from each
PE to its neighbor along a row of PEs, as illustrated in Figure 7.13.
Row-Stationary maximizes reuse across weights and activations. The accumulated sums
flow from the bottom to the top columns, as illustrated in Figure 7.14. Row-Stationary, proposed
by Chen et al., provides the best performance per watt for convolutions and fully-connected
layers [CES17, CES16].
An operation may not distribute evenly across all the PEs in the array. dMazeRun-
ner efficiently explores the various ways to split computational kernels in a dataflow acceler-
ator [DKA+19].
ASICs can also be customized to better support sparse matrix multiplications. Nvidia
researchers demonstrated the benefits of sparse multiplications with the ExTensor accelerator
that rapidly finds intersections of nonzero operands and avoids multiplies by zero [HAP+19].
Compute-in-memory and neuromorphic processors are two different designs; both have
challenges and no adoption in production. A compute-in-memory processor uses analog com-
7.4. PROCESSOR DESIGNS 149
Figure 7.15: Examples of topology designs using 8 nodes. High-radix topologies provide lower
communication latency across nodes. Based on [NKM+20].
CHAPTER 8
Compiler Optimizations
At the core of the software stack are compilers to transform the programmer’s high-level code
into executable code that runs efficiently on a target device. Programmers use a variety of lan-
guages to code at various levels of abstraction. A programming language is a formal language
used to write code, such as for functions and algorithms. High-level languages are indepen-
dent of a hardware target and include C, C++, Python, Java, Javascript, CUDA C/C++, Swift,
and Julia. Assembly (asm) is a low-level language that targets a specific instruction set architec-
ture (ISA). In between are intermediate languages that are assembly-like in format but general
enough for execution on different ISA, such as LLVM IR, various Multi-Level IR (MLIR)
dialects, and PTX for Nvidia GPUs.
Programming languages have a set of specifications or rules that dictate what the outputs
should be for a given input. The output also depends on the dynamic conditions of the running
program. The approaches to implement a programming language are interpretation, compila-
tion, or a mixture of both. The terms interpreted language and compiled language denote that the
default or canonical implementation of that language uses an interpreter or a compiler, respec-
tively. For some languages, the canonical implementation is the only implementation, while
others like Python have multiple implementations (more on this below).
An interpreter is a computer program that directly executes the code for a particular lan-
guage. That is, the code does not map to machine code. The processor executes (runs) the inter-
preter, and the interpreter reads and generates the output for the interpreted language according
to the interpreted language’s specifications and rules. The interpreter’s source code (the program
that is executed) can be a different language than the interpreted language.
A compiler is a computer program that transforms code between two languages or within
a language. The compiler runs various optimization passes to improve the execution time and
simplify the code. Alternatively, the compiler may only focus on code canonicalization, which
transforms the code into more rigid patterns removing unnecessary variations. The compiled
code is passed to an interpreter or directly to the processor when it is machine code (in this case,
the processor can be thought of as the interpreter of the machine code).
Often, before an interpreter executes a high-level code, the code is first dynamically (just-
in-time) compiled into bytecode, which is a compact language or efficient intermediate represen-
tation. This compilation is usually a minor transformation to make it easier for the interpreter
to parse the code. Typically, more compilation (optimization passes) leads to faster execution;
however, this comes at the expense of longer build time.
160 8. COMPILER OPTIMIZATIONS
Let us look at the Python language as an example of a language with various implemen-
tations, and focus on two: CPython and PyPy. CPython is an interpreter implementation and
the canonical (default) Python implementation. Python programmers that have never heard
of CPython likely use the CPython interpreter. Like other interpreted languages, the Python
source code or Python command, when used interactively by the programmer, is transformed
into bytecode. Then, this bytecode is interpreted by CPython one command at a time. PyPy is
an interpreter and a JIT compiler (more on JIT below) Python implementation.
Compilers lower (this is compiler parlance for transform) code from a higher-level lan-
guage to a lower-level language, for instance, from C++ to x86 machine code. Compilation to
machine code that happens before runtime (execution) is known as Ahead-of-Time (AOT)
compilation. Compilation to machine code that happens during runtime is known as Just-in-
Time ( JIT) compilation. AOT improves the performance for static graphs at the expense of
longer compile times.
A JIT compiler is a computer program that compiles to machine code at runtime. Using
a JIT compiler can significantly increase startup time. To mitigate, JIT compilers are typically
used alongside an interpreter for runtime profile-guided optimizations, also known as adaptive
optimizations. As the interpreter executes the source code (or, more precisely, the bytecode), the
interpreter tracks repetitively used sections and triggers the JIT compilation for these sections
into higher-performing machine code. The compiled code is cached, and the interpreter can
then alternate between the usual execution of bytecode and the execution of the JIT code.
An intermediate representation (IR) is a data structure or graph representing the required
operations for a particular program. Compilation may use several levels of IR, progressively
lowering on each pass. A high-level, hardware-independent IR may contain control-flow tokens,
such as for, if, and while. A low-level, hardware-independent IR may look similar to assembly
language while still being generic enough not to be tied to a specific hardware implementation
to simplify the next stage of compilation. Bytecode is an example of an IR.
Two common properties of some IRs are static single-assignment (SSA) form and three-
address code (TAC). SSA requires that each variable (called a typed register) is assigned precisely
once (it is not mutable), and every variable is defined before it is used, which facilitates various
optimizations. TAC requires that statements have at most three operands.
Compilers often take multiple optimization passes over each IR, and each pass may affect
subsequent passes. The following are hardware-independent and hardware-dependent optimiza-
tion passes common in the compilation of DL models (italicized passes are the most critical for
performance in DL):
Operator fusion and loop tiling are the most important optimizations for DL models, followed
by the other italicized optimizations. Some operator fusions may be hardware-dependent; those
that are ISA-dependent are encompassed under operator folding. All these optimization passes
are discussed in Sections 8.4 and 8.5.
In the remainder of this chapter, we review programming language types. We explain the
compilation process from high-level language to machine code and, as an example, explain how
this process works with the popular LLVM compiler. Moreover, we describe standard com-
piler optimization passes to accelerate the execution of DL models. Specific DL compilers are
discussed in Chapter 9.
Figure 8.1: The compilation process consists of a front-end, middle-end, and back-end phase.
Figure 8.2: (green) The programmer’s source code. (orange) The abstract syntax tree (AST) rep-
resentation. The parser constructs an AST that captures the lexical structure of the source code.
Each phase has one or more IRs depending on the optimization passes. One or multiple com-
pilation infrastructures may be used for these phases.
Front-end The front-end compiler parses the code, converts it into tokens, checks for errors
(syntactic and semantic analysis), and generates a domain-specific IR. Two common types of
IR used by front-end compilers are the abstract syntax tree (AST) and the control-flow graph
(CFG) data structures. The AST is language-dependent. It captures the lexical structure (lay-
out) of the source code, using the internal nodes for the statements and operators, and the leaf
nodes for the operands representing values or variables. The parser returns an error message if a
rule in the language specification is violated. Front-end compiler algorithms are fairly matured.
Figure 8.2 illustrates an AST generated from a for loop.
A CFG is language-independent and expresses the control-flow and data paths through
a program. A control-flow statement, such as for, while, and if, determines which of two or
more paths to take. The nodes are basic blocks, and the edges represent possible execution paths
between basic blocks. Basic blocks are a set of sequential operations with no branch statements
until the end of the block. Figure 8.3 illustrates a CFG used to compute the factorial of N . The
top block is for the code that runs before the while loop. The next block is the comparison to
8.2. FRONT-END, MIDDLE-END, AND BACK-END COMPILATION PHASES 163
Figure 8.3: (green) The programmer’s source code. (orange) The control-flow graph (CFG) rep-
resentation. The CFG expresses the possible decisions at each graph node.
Figure 8.4: The optimizer reduces the number of operators that need to be executed: (left) the
unoptimized code and (right) the equivalent optimized code assuming a is an unsigned integer.
decide which branch to take. The next block is the body and returns to the comparison. The last
block is the code that runs after the while loop. A CFG is typically compiled from an AST IR.
Middle-end The middle-end compiler has two main tasks: (1) canonicalize the various ways of
representing the code into predictable patterns removing unnecessary variations and (2) improve
the performance via a series of optimizations. Some middle-end optimizations are completely
hardware-agnostic, and others need information about the back-end hardware, such as multi-
threaded parallelization and SIMD vectorization. Figure 8.4 illustrates an example optimizing
the equation c D a C b as c D a << 1, where the operator << left-shifts a by 1 bit, which is
equivalent to multiplication by 2.
The optimizer typically performs a series of distinct optimization passes on the IR. LLVM
does around 150 passes. GCC and LLVM use different algorithms to traverse the IR iteratively.
While the order of optimizations affects the end result, strict rules to determine the optimal
order do not exist.
164 8. COMPILER OPTIMIZATIONS
In general, there are three common compiler optimization parts: legality analysis, prof-
itability analysis, and transformation. Legality analysis makes sure the transformation does not
break the program. Profitability analysis uses a cost model to determine if the optimization is
beneficial and searches for parameters to perform the optimization. Finally, the transformation
performs the actual modification of the code.
Back-end The back-end compiler lowers the IR onto the target ISA and performs hardware-
dependent optimizations. These include instruction selection, instruction scheduling, and mem-
ory and register allocation.
The output from the back-end compiler is machine code in an assembly file or object file.
The linker takes the object file(s) and dependent libraries to generate an executable file.
Intrinsic functions There are some constructs, such as vectorization with SIMD instructions,
that a high-level language may not address. In these cases, intrinsic functions provide a way
for the programmer to use such constructs. An intrinsic function is a function used in a given
language. The implementation is handled especially by the compiler, which maps and optimizes
the intrinsic function for a back-end target. Typically, the compiler substitutes a sequence of
instructions for the intrinsic function call. Some intrinsic functions are portable, and others are
target specific.
An intrinsic function provides a compromise between transparent integration inside a
C/C++ function and writing full inline assembly (where most instructions map directly to an ISA
instruction and the compiler takes care of register allocation). GCC, for instance, implements
intrinsics for C/C++ that map directly to the x86 SIMD instructions.
8.3 LLVM
LLVM originally stood for low-level virtual machine (albeit with no relationship to what most
current developers today think of as virtual machines) since the low-level LLVM IR code targets
a universal theoretical machine (hence the original term virtual) and compiles for a variety of
architectures [LA04]. While the concept is still accurate, LLVM is now the full name and no
longer an acronym. LLVM is a brand for an umbrella project applied to the following:
• LLVM IR
• LLVM Core
• LLVM debugger
• LLVM implementation of the C++ standard library
• LLVM foundation
In this section, LLVM refers to the LLVM Core, a middle-end and back-end compiler program
written in C++.
8.3. LLVM 165
Figure 8.5: LLVM is designed as a set of modular compiler components supporting various
front-end languages and back-end hardware targets.
Figure 8.6: Many languages use a higher-level domain-specific IR for domain-specific optimiza-
tions before lowering to the LLVM IR. Based on [LP19].
In line 1, the function @f with value %z is declared. In line 3, the function @p with integer
arguments %a and %b is defined. %0 equals the product of %a and %b; %1 equals the returned
value of function @f with argument %0; %2 equals the product of %0 and %1; and %2 is returned
value.
8.3. LLVM 167
Figure 8.7: GCC can be used for the front-end, middle-end, and back-end compilation.
A phi node is an instruction used to merge multiple control-flow paths and multiple
definitions of a variable selecting which definition to use. In the CFG, the phi instruction,
when used, is always at the start of a basic block. The phi node has multiple pairs of operands;
each pair consists of a value and a reference to a basic block. The basic blocks are the immediate
predecessors to the basic block in which the phi instruction is located.
• Adoption: GCC has larger adoption; both have a large community of developers.
• License: GCC’s GPL license requires developers who distribute extensions or modi-
fied versions of GCC to make their source code available unlike LLVM’s Apache 2.0
license.
168 8. COMPILER OPTIMIZATIONS
8.4 HARDWARE-INDEPENDENT OPTIMIZATIONS
The overarching goal of hardware-independent optimizations is to reduce memory accesses and
reduce the number of operations. To that end, the following set of optimization passes are com-
mon. In DL, some of these optimizations are referred to as graph compilations, and the most
important is operator fusion.
Operator fusion merges operators (also known as graph nodes) to reduce memory ac-
cesses by not having to save the intermediate results in memory. It is applicable when the opera-
tors have compatible loop patterns with continuous (called coalesced in GPU parlance) memory
access. To illustrate, a fused sigmoid operator (see Figure 2.1) computes the exponentiation, ad-
dition, and division components keeping the intermediate results in local caches or registers and
only saving the final result to memory.
Fused operators require that either the primitive libraries, such as oneDNN, MIOpen,
and cuDNN, or that a back-end compiler provides or generates an optimized fused primitive
to get the performance benefit. Thus, it is not entirely device-independent. Note that operator
folding is a hardware-dependent operator fusion pass discussed in Section 8.5.
The types of operator fusions are:
• element-wise operator with another element-wise operator, for instance, the multiple
element-wise operators in a sigmoid function;
• element-wise operator with a reduction operator, for instance, in the softmax function;
and
• matrix-wise operator with an element-wise operator.
An example of the last bullet is a convolution or a GEMM operator fused with an activa-
tion function that operates on each element of the tensor, such as convolution followed by ReLU.
The activation function is applied immediately after the output tensor value from the convolu-
tion is computed, and while this value is still in a register or scratchpad. Some of the fusion
operators supported by TensorFlow’s built-in compiler, Grappler (introduced in Section 9.2.7),
are:
• Conv2D + BiasAdd + <Activation function>
• Conv2D + FusedBatchNorm + <Activation function>
• MatMul + BiasAdd + <Activation function>
• FusedBatchNorm + <Activation function>
As an example of the fusion benefits, Intel reported around 80 performance gain for batch
size 1 fusing group convolutions in the MobileNet v1 model [SPE19]. In group convolution
(introduced in Section 3.2.1), the different feature channels across a data batch are divided up
8.4. HARDWARE-INDEPENDENT OPTIMIZATIONS 169
Figure 8.8: (a) A group of convolutions used in MobileNet v1. (b) A fused operator can be jointly
optimized for the entire group. Based on [SPE19].
into groups processed independently. The fused group convolution is jointly processed as a single
DL operator, as shown in Figure 8.8.
Loop permutations modify loop indices to improve memory access. Some permutations,
such as loop tiling, are target-dependent and are discussed in the next section. An example
of permutation is interchanging for loops, as shown in the following code. The indices are
interchanged to have coalesced memory accesses, which are faster than strided memory access.
1 // before loop permutations
2 for (i=0; i<N; i++)
3 for (j=0; j<M; j++)
4 x[j][i] = y[j][i]; // strided memory access
Arithmetic simplifications reduces the number of expressions and simplifies the code.
Examples include these replacements:
• a x C b x C c x ) .a C b C c/ x
• Š.x < y/ ) x y
• 2 x ) x << 1 (for unsigned integers)
• x x)0
170 8. COMPILER OPTIMIZATIONS
• x 0)x
• .x 2/ x ) x.
• AT B T ) .BA/T
• .AT B T /T ) BA
The last two items are known as transpose eliminations, which are a subset of arithmetic sim-
plifications. Some of the simplifications can lead to numeric differences compared to the original
expression. Still, these differences are generally small and can be safely ignored in DL.
During inference, the batch normalization expression can be incorporated into the convo-
lution expression by scaling the weight values, as detailed in Section 2.6. While this is sometimes
referred to as a fused operator, this optimization is an arithmetic simplification.
Constant propagation and constant folding substitute (propagate) known constants val-
ues in the expressions, and precompute (fold) constant expressions. Examples include these re-
placements:
• 3 4 ) 12
• x D 2I y D 3 x ) y D 6.
Dead code elimination (DCE) eliminates unused code. In the following code samples,
the if expression is eliminated. Note that a has to be an integer (not a float).
1 // before constant propagation and DCE
2 int a=0;
3 if (a)
4 mycode ();
1 // after DCE
2 int a=0;
1 // after CSE
2 c = a + b
3 d = c
4 e = c + d
8.5. HARDWARE-DEPENDENT OPTIMIZATIONS 171
Inlining, also known as inlining expansion, (not to be confused with the unrelated C++
inline keyword) moves the code of the called function into the calling function. It saves the over-
head of procedure calls and allows further optimizations at the calling function at the expense of
a larger executable file and, therefore, longer load time and increased pressure on the instruction
cache. A toy example follows:
1 // before inlining
2 myFunction(int x){
3 printf("%d\n", x);
4 printf("%d\n", x*x);
5 }
6 myFunction(a);
7 myFunction(b);
1 // after inlining
2 printf("%d\n", a);
3 printf("%d\n", a*a);
4 printf("%d\n", b);
5 printf("%d\n", b*b);
Note that inlining wrapper functions do not affect the size of the executable.
Loop-invariant code motion (LICM), also called hoisting or scalar promotion, moves
out expressions that are not required to be in the loop.
Memory to register promotion tries to promote memory references to be register refer-
ences in order to reduce the number of memory loads and stores. The front-end and middle-end
compilers assume an unlimited number of registers. Register assignment happens in the back-
end compiler and is hardware-dependent.
1 // Step 1: Tiling
2 for (i = 0; i < N; i++)
3 for (jj = 0; jj < M; jj += TILE)
4 for (j = jj; j < jj + TILE; j++)
5 operation(x[i], y[j]);
1 // Step 2: Permuting
2 for (jj = 0; jj < M; jj += TILE)
3 for (i = 0; i < N; i++)
4 for (j = jj; j < jj + TILE; j++)
5 operation(x[i], y[j]);
The optimal stencil (tile size) is unique to each microarchitecture and is a parameter the
compiler has to select, adding complexity to the solution space. One algorithm to facilitate the
selection is the Cache-Oblivious Recursion algorithm [FLP+99].
Polyhedral is a compiler technique that results in a set of loop transformations used
for efficient code generation. Note that some of the polyhedral transformations are hardware-
independent. A polyhedral representation specifies the boundary of a polyhedron (the index
space of a tensor expression). The polyhedral-based compilations provide a set of (usually affine)
loop transformations, such as loop tiling, to facilitate efficient code generation on a hardware
target.
Polyhedral compilation techniques are conventional in HPC and image processing. The
challenge is the NP-complete algorithms, such as integer linear programming (ILP) solvers or
other exponential algorithms required, which limit scalability.
An affine representation is a simplified polyhedral representation with for loops and if
control structure ops. An affine transformation applies a unique affine function to each element
of a tensor and preserves the dimensions of the tensor. An affine compilation does not require
the use of ILP or any other NP-complete algorithms. The DL compilers PlaidML, TVM, and
MLIR dialects, such as LinAlg and Affine, use polyhedral-based (typically, affine-based) loop
transformations. Chapter 9 covers these compilers.
Data layout, also known as memory format, memory layout, or tensor layout transforma-
tions, modifies the data layout so it is efficiently accessed. As reviewed in Section 2.3, standard
data layouts used by the main frameworks are NCHW or NHWC, and RSCK or KCRS for the
8.5. HARDWARE-DEPENDENT OPTIMIZATIONS 173
weight tensors. These data layouts are referred to as plain formats or native formats (native or
default to the DL framework).
Data in memory is arranged as a 1D vector. The NCHW format means the width values
are the innermost dimension and are adjacent in memory. The memory index offset for a given
index n; c; h; w 2 N; C; H; W is
offset.n; c; h; w/ D n CHW C c H W C h W C w:
TensorFlow and PyTorch natively support both NCHW and NHWC with NCHW as the default
layout. ONNX only supports NCHW . FBGEMM and the Quantized Neural Networks PACK-
age (QNNPACK) support NHWC but not NCHW . LIBrary for eXtra Small Matrix Multiplies
(LIBXSMM) supports both but is optimized for NHWC.
The data layout can be modified to achieve better reuse from cache (also known as local
memory in AMD GPUs or shared memory in Nvidia GPUs), scratchpad, and registers to use
SIMD, SIMT, or dataflow instructions more effectively. To illustrate, one of the layouts used by
oneDNN for CPUs for architectures with 512-bit registers and fp32 values is the 5D tensor
N CO HW16c:
O
This format blocks (tiles) the channel dimension in blocks of 16 to fit into a 512-bit (16 fp32
values) register. The memory index offset, using the N CO HW16cO layout, is:
jc k
offset.n; c; h; w/ D n CHW C 16HW C h 16W C w 16 C .c mod 16/;
16
where bc is the floor operator. Using this layout format, the data is fed as 16 consecutive fp32 val-
ues into a register from the same n; h; w indices but different channels and processed in parallel
using SIMD instructions. A channel size multiple of 16 is beneficial for this blocked format.
The cuDNN primitive library typically uses the NCHW layout. However, newer GPUs,
such as the V100, prefer the NHWC layout for fp16 computations with C being a multiple of 8 to
use the available tensor cores efficiently. Padding the channels with zeros to the desired size can
improve the computational efficiency. Note that TensorRT supports blocked formats to achieve
the highest performance on some workloads.
Depending on the operands, different layout strategies result in better performance. For
instance, the convolution function potentially uses three different tensor layout strategies de-
pending on the operand sizes:
2. one layout for operands with a large number of input activations; and
When the number of loop iterations is not known until runtime, an AOT compiler can generate
several versions of the loop with different unrolling factors, or alternatively, a JIT compiler can
be used.
Loop splitting splits the loop iterations into multiple loops if the iterations are not de-
pendent on each other and can execute in parallel. A toy example follows:
1 // Before loop splitting
2 for (i = 0; i <100; i++)
3 printf( "Iteration %d\n" , i);
Loop fission, also known as loop distribution, splits the body of a loop if the components
are not dependent on each other and can execute in parallel. Note that the reverse is called loop
fusion which unites multiple loops into a single loop. To illustrate:
1 // Before loop fission
2 for (i = 0; i <100; i++){
3 a[i] = 3 * i;
4 b[i] = 4 * i; }
In this chapter, we review the basics of programming languages and compilers that map high-
level languages to machine code. We highlighted standard compiler optimization passes to accel-
erate the execution of DL models, particularly fusing element-wise operations into dense linear
176 8. COMPILER OPTIMIZATIONS
operations. Compilers are imperative for the success of dedicated DL processors; manually opti-
mizing a model to perform well on a back-end target is extremely costly and not scalable across
several targets. In the next chapter, we discussed prominent DL compilers used by hardware
vendors and hyperscalers.
177
CHAPTER 9
• High- and low-level optimizations that are reusable across front-end frameworks and
back-end targets.
• Strongly-typed tensors; that is, tensors with a known static shape and element type,
such as fp32, fp16, bf 16, s8, u8, or bool.
• Dynamic shapes.
The most prominent DL compilers (outside of the frameworks’ built-in graph optimizers)
are TVM, XLA, Glow, PlaidML, and various MLIR dialects (MLIR is a compiler infrastructure
that supports various IRs or dialects and compiler passes). These compilers are written in C/C++
for speed and portability. While TVM is the most mature compiler today, all compilers are still
in their infancy and have limited adoption in industry. This is likely to change in the next few
years with the wave of DL hardware starting to hit the market, which increases the market
demand for robust compilers. Table 9.1 provides a summary of key features from each of the
main DL compilers outside the default framework compilers. Other less prevalent compilers
are taco, Tensor Comprehension, DLVM, Weld, and Diesel. Sections 9.4–9.9 discusses these
compilers and their adoption in industry.
While DL compilers aim to support multiple front-end frameworks, they are often de-
veloped by a team related to an existing framework that firstly focuses on that framework. In
particular, XLA and MLIR dialects with TensorFlow, Glow with PyTorch, and TVM with
MXNet. Nevertheless, compilers are expanding their front-end support.
Grappler (TensorFlow’s built-in graph optimizer), PyTorch JIT, XLA HLO, and Glow
compilers strive to optimize the inefficiency brought by the user program via target-independent
optimizations. They rely on a primitive library (such as cuDNN, MIOpen, oneDNN, or Eigen)
or another compiler for target-dependent optimizations. PlaidML, various MLIR dialects, and
TVM support target-independent and dependent optimizations and back-end code-generation.
In this reminder of this chapter, we review the DL frameworks with a particular focus on
TensorFlow and PyTorch, which have built-in graph optimizers and schedulers to execute the
computation graphs. We also describe in more detail the prevalent DL compilers.
182 9. FRAMEWORKS AND COMPILERS
9.1 FRAMEWORKS
DL libraries or frameworks provide the programmer tools to define, train, and deploy models.
Frameworks abstract many of the mathematical and implementation details. For instance, they
contain functions or modules to differentiate a model with respect to a cost function (compute its
gradients), so the programmer does not have to code the gradient computations. While the com-
putational performance across the frameworks varies depending on the optimization techniques
exploited, the statistical performance of the models trained across frameworks is essentially the
same; they implement essentially the same mathematical algorithms.
Frameworks compile the program to a graph and optimize the graph. The nodes are im-
plemented using C++, CUDA, or using a precompiled target-specific implementation available
in a primitive library. Frameworks may also use a DL compiler to improve execution efficiency.
The most popular frameworks are TensorFlow developed by Google and PyTorch de-
veloped by Facebook, both written in C++ and have a Python wrapper. TensorFlow is the
most popular framework in the industry and the second most popular in academia. PyTorch
is the most popular framework in academia, the second most popular in the industry, and the
fastest-growing framework [Lor19]. Other frameworks used in industry but (based on Google
Trends) with limited adoption outside the companies that developed them are Apache MXNet,
PaddlePaddle, and Flax/JAX. Amazon (in collaboration with the University of Washington,
Carnegie Mellon University) developed MXNet, Baidu developed PaddlePaddle, and Google
developed Flax/JAX (primarily for research). Flax provides high-level functions on top of JAX,
a JIT compiler that uses Autograd and XLA for differentiation and executes NumPy code on
CPUs, TPUs, and GPUs [Jax20]. NumPy is a library for Python for multidimensional tensor
operations.
TensorFlow and PyTorch offer two programming paradigms: imperative programming
and declarative (symbolic) programming. Imperative programming performs the computations
as they run, and declarative programs separate the definition of the various expressions in the
program from the execution. Gluon and the standard front-end MXNet, respectively, also adopt
these paradigms.
In the remainder of this section, we provide a brief history and adoption of various frame-
works. We discuss imperative and declarative programming styles and their tradeoffs as well as
dynamic and static programming.
9.2 TENSORFLOW
TensorFlow is an open-source library, written in C++, developed by Google with several con-
tributors outside of Google. It was released in November 2015 and has become the most popular
framework in the industry. It supports over a thousand different operators [SL19]. In addition
to Python, TensorFlow supports other language APIs (some maintained by the broader com-
munity at various degrees of support), including Swift, Julia, C++, Scala, Java, JavaScript, Rust,
and Go. Models trained by TensorFlow can deploy across various inference engines.
TensorFlow v1 is designed as a declarative programming style library [ABC+16]. Pro-
grammers construct an AST (the graph), usually in Python using a low-level API, and then
compile and interact with the graph using a TensorFlow session. However, this low-level API
has a steep learning curve and does not let the programmer use native Python control-flow or
debuggers. TensorFlow v1 uses control-flow nodes, such as loop condition, switch, and merge
nodes to represent data flow, which increases the complexity of pattern matching required for
optimizations [YAB+18]. To facilitate v1 usage, higher-level libraries and APIs were developed,
such as TFLearn, Slim, SKflow, and Keras. TensorFlow v1 is under maintenance mode, and all
new work is going into TensorFlow v2.
9.2. TENSORFLOW 185
The most notable changes from TensorFlow v1 to v2 are: (1) the Keras APIs are default,
(2) eager execution is default, and (3) improved organization for APIs, functions, and names-
paces. TensorFlow provides a conversion tool to port the code from v1 to v2. To help determine
whether an online document or code sample refers to v1 or v2, note that v1 uses the following
objects not present in v2: tf.enable_eager_execution, session.run, tf.placeholder, and
feed_dict.
The remainder of this section is as follows: We introduce the Keras APIs infrastructure,
the Estimator API, and the tools to convert a dynamic graph constructed in Eager-style code
to a static graph using @tf.function and AutoGraph. We highlight the tools for distributed
training, the TensorBoard visualization tool, the Profiler tool, and the compilation TensorFlow
infrastructure. Other TensorFlow libraries and tools with some adoption in industry are Tensor-
Flow Hub, TensorFlow Extended (TFX), TensorFlow Lite (TFLite), and TensorFlow Proba-
bility (TFP). TensorFlow Hub provides an extensive service of prebuilt models; end-users can
fine-tune them or use them as preprocessing layers (such as some of the embeddings available).
TFX is an end-to-end series of connected libraries use to deploy DL pipelines; specifically, TFX
provides the critical parts of the DL pipeline except for the model building and training (which
is core TensorFlow). TFLite is a lite framework for on-device inference. TFP is a library for
probabilistic reasoning and statistical analysis.
graphs (DAGs), where each layer may have multiple tensor inputs or outputs, shared layers, or
nonsequential data flow, such as in residual connections. The Keras Subclassing API is used
for imperative programming; the programmer defines a new class that inherits and extends the
Keras Model class defined by the framework. This class imperatively defines a function with the
model and a function with the forward pass (the backward pass is generated automatically). The
low-level API from TensorFlow v1 is still available to use in TensorFlow v2.
We recommend using the Keras Subclassing API as it provides flexibility to develop and
experiment with any type of model, including dynamic models. Also, it has a similar program-
ming style to PyTorch, which can facilitate using both frameworks (it is not uncommon for
different engineers in the same company to use one or the other).
9.2.4 ESTIMATOR
TensorFlow v2 keeps the Estimator API (including premade Estimators), another high-level
TensorFlow API introduced in v1. Premade Estimators provide preimplemented, ready-to-use
model functions for training and inference, such as Linear Classifier, DNN Classifier, Combined
DNN Linear Classifier (Wide & Deep models), and Gradient Boosted Trees. Note, however,
that using the Keras API is recommended over Estimators.
In distribute.Strategy in TensorFlow v2, the distribution toolkit was rewritten to
build on the low-level parts of the library. Likewise, tf.data’s distributed-by-default approach
in v2 makes a lot of the metaprograming in Estimators unnecessary.
9.2.5 TENSORBOARD
TensorBoard displays the graph, embeddings, and tensor distributions. It plots cost values during
a run, which helps determine convergence and facilitates debugging. TensorBoard also compares
various models and costs across training runs. In addition, TensorFlow enables the programmer
to visualize the graph using keras.utils.plot_model, and model.summary() to get the de-
scription of the layers, weights, and shapes.
188 9. FRAMEWORKS AND COMPILERS
Figure 9.1: The TensorFlow IR GraphDef is optimized by Grappler and passed to other com-
pilers for additional optimizations. Based on [Goo19].
9.2.6 PROFILER
Profiler tracks the performance of models and hardware consumption (time and memory) for
the various operators. It can be used during training and inference to resolve performance bot-
tlenecks and improve a model’s performance on a CPU or GPU.
9.3 PYTORCH
PyTorch is an open-source Python library for tensor computations similar to NumPy but with
GPU support. It has built-in automatic differentiation and APIs for training and inference ap-
plications. PyTorch is maintained by Facebook with multiple contributors outside of Facebook.
It was released in October 2016. It is the most popular framework in academia, the second most
popular framework in the industry, and the fastest-growing framework [Lor19].
PyTorch v0 was designed as an imperative programming style library to facilitate research
and development. For production-scale where performance is critical, Facebook developed the
open-source Caffe2 graph-based execution library in April 2017. Facebook’s servers and mo-
bile app used Caffe2. To better interface between PyTorch v0, Caffe2, and other frameworks,
Facebook partnered with Microsoft and later with other companies to develop the Open Neural
Network Exchange (ONNX) format released in Sep. 2017. ONNX provides a standard format
for various frameworks to exchange (export and import) extensible computation graph models
for inference and, thus, streamline the path from research and development to production. A
model would be developed and trained in PyTorch v0, exported to ONNX, and then imported
into Caffe2 for production at scale.
PyTorch v1 (released in December 2018), hereafter referred to as just PyTorch, merges
PyTorch v0 and Caffe2. PyTorch enables switching models from eager (imperative) mode to
graph execution (declarative) mode, which further streamlines the path from research and de-
velopment to production. Programmers develop, debug, and test their models in eager mode.
They then migrate the models to graph mode for graph optimizations and may export a non-
Python representation for scaled production in servers, mobile, or other platforms. Other key
additions to PyTorch are a C++ API, JIT compilation, and a distributed library across Python
and C++ environments.
PyTorch computation graphs are dynamic. PyTorch keeps track of the operators per-
formed and builds a computation graph behind the scenes. Every time the programmer adds
a layer, PyTorch rebuilds the computation graph. Automatic differentiation uses this computa-
tion graph.
PyTorch GPU expressions execute asynchronously, meaning the expressions can run in
the GPU and synchronize with the CPU host when necessary, such as when copying data be-
tween host and device, or between devices. This synchronization is invisible to the programmer.
For debugging, it may be useful to force synchronize-execution to trace an error.
190 9. FRAMEWORKS AND COMPILERS
PyTorch supports x86/64, Arm, and POWER CPUs and Nvidia GPU back-end targets.
Support for other platforms is available via Glow. Google and Facebook added a PyTorch front-
end to XLA to enable PyTorch programs to run on TPUs [She18].
Figure 9.2: PyTorch can be executed in Eager mode via the Python runtime or in JIT mode
via TorchScript, Tracing or both to generate a complete graph representation. This graph is
optimized and then executed. Each expression is executed with the ATen library.
torch.distributed supports distributed training across multiple nodes using NCCL for
GPUs and Gloo or MPI for CPUs.
torch.utils supports data loading and TensorBoard visualization (discussed in Sec-
tion 9.2.5).
TVM is an Apache incubator project, and an end-to-end DL compiler stack for automatic code-
generation across various hardware targets [CMJ+18]. TVM was developed by Tianqi Chen et
al. at the University of Washington (UW). The project has several contributors from UW, Ama-
zon Web Services (AWS), Qualcomm, Facebook, Google, Huawei, AMD, Microsoft, Cornell
University, and University of California, Berkeley [Tvm19].
The TVM stack has two main levels of abstraction: a graph compiler and an operator-
level compiler. TVM takes as input a model from MXNet, PyTorch/TorchScript, TensorFlow,
Keras, CoreML, ONNX, and DarkNet and compiles it to the Relay IR (also known as NNVM
v2) [Tvm19]. TVM is tightly integrated with MXNet with modules shared between the projects;
both projects started at UW as part of the Deep Machine Learning Community (DMLC). The
Relay IR is a statically-typed, complete (purely functional), modular, and extensible program-
ming language. Relay provides common DL primitives, auto-differentiation, and mathematical
optimizers.
TVM performs high-level graph optimization, on the Relay IR and then compiles into a
low-level specification language called a tensor expression (TE). This language declaratively speci-
fies the tensor operands, their shapes, and the operators, but the execution details are unspecified;
thus, TVM decouples the definition of the expression with the execution. TVM borrows this
decoupling idea from the Halide programming language [CMJ+18].
TVM defines a space of functionally-equivalent schedules for a TE and a given target.
The space of schedules includes various loop transformations, cache localities, and vectorization
strategies; a TE potentially has billions of schedules from all the possible combinations. A matrix
multiplication TE can result in schedules with vanilla loops (see Algorithm 2.1), tiled loops,
and accelerator intrinsics. Improving the constraints on the space of schedules is an important
research area.
TVM borrows scheduling algorithms from Halide for CPUs and incorporates new algo-
rithms for GPUs and accelerators. For a GPU and TPU-like accelerator, the space of schedules
includes various strategies for thread cooperation and shared memory across the compute units.
The space of schedules is usually the largest for a TPU-like accelerator. It includes hardware in-
trinsics for high-dimension tensor expressions and a hierarchical memory system with memory
buffers and instructions for memory access. TVM uses a description of the hardware interface
to narrow the scheduling space.
A goal of TVM is to automatically search over this space to obtain an efficient program
configuration for a TE for a particular hardware target. One naive approach is to randomly
sample the scheduling space, test each schedule on the target hardware, and return the sampled
program configuration with the minimum runtime. Instead, TVM uses a simulated annealing
algorithm to search the space of schedules, and AutoTVM, an ML-based performance predictor,
to predict the runtime of a schedule without executing the schedule on the actual hardware.
9.5. PLAIDML 193
AutoTVM learns a model that predicts the runtime of a schedule using an XGBoost al-
gorithm, which is a computationally inexpensive ML algorithm. AutoTVM can be orders of
magnitude faster than actual hardware runtime measurements [CG16]. Thus, this allows evalu-
ating orders of magnitude more schedules and discovering a better one. Learning this model re-
quires collecting training data using a dataset of schedules and measured runtime pairs. Transfer
learning techniques can be used with new hardware or new TEs to reduce the required amount
of training data.
The selected schedules are compiled using LLVM for CPUs, CUDA, OpenCL, or Metal
for GPUs, or another back-end compiler for an accelerator. The compiled code is placed in a
library with function pointers, and a higher-level program allocates input and output buffers and
calls these functions during execution. TVM supports various deployment languages, including
C++, Python, and Java.
The versatile tensor accelerator (VTA) is an open-source accelerator with an open-source
microarchitecture and a software stack tightly integrated with TVM that can be prototyped on
an FPGA or simulated on a laptop. Thus, VTA can facilitate the experimentation of custom
optimizations across various back-end targets.
9.5 PLAIDML
PlaidML is an open-source (as of Aug. 2017) compiler stack developed and maintained by then
vertex.ai and, as of Aug. 2018, part of Intel. PlaidML consumes a high-level static graph, such as
ONNX, or others, and generates optimized code for various back-end targets. The most matured
targets are GPUs and Movidius VPUs.
The PlaidML framework automatically generates efficient primitives from polyhedral
tensor expressions, transforming graph-level operations requested by the graph compiler into
optimized device-specific implementations. PlaidML compiles a high-level IR into target-
dependent code: The high-level IR is mapped to the Tile IR using the Tile language capable
of describing DL expressions. Like TVM’s tensor expression, the Tile language is a differen-
tiable DSL that represents mathematical formulas for the tensor expressions, and it is hardware
agnostic.
A general polyhedral model allows for complex data dependencies. However, in a Tile
contraction (a reduction operator that merges values across one or more indices), the only data
dependency is in the aggregation. Tile only uses commutative and associative aggregation oper-
ations, so this dependency is only mildly restrictive. This narrow focus allows Tile’s optimization
to be more useful than general-purpose polyhedral optimizers.
The Tile IR lowers to a hardware-agnostic Stripe IR [ZB19]. The Stripe IR is then com-
piled via a series of hardware targeted optimizations and lowered to a hardware abstraction layer,
accelerator runtime, or other hardware-appropriate code.
The Stripe IR uses hardware descriptions to constrain the optimization space using an
affine tensor space. Stripe determines the optimal loop tiling and other loop permutations to
194 9. FRAMEWORKS AND COMPILERS
reuse data across the memory hierarchy for a specific back-end target. The loop tiling param-
eters are selected based on hardware descriptors and adjusted via profile-guided optimizations.
Stripe then produces an execution schedule for each primitive and inter-primitive data depen-
dencies, including data movement instructions. PlaidML optimizations are also incorporated as
an MLIR dialect.
9.6 GLOW
Glow (an abbreviation for Graph-lowering) is a DL compiler stack used for inference and train-
ing (the inference stack is more mature). The Glow compiler project is maintained by Facebook
with committed support from Intel, Cadence, Esperanto, Marvell, Qualcomm, Bitmain, STMi-
croelectronics, Synposys, and Ceva [Fac20].
Glow is designed to compile a high-level graph supporting many operators to a low-level
graph supporting a small number of linear algebra operators [Fac18]. The compiler passes can
be shared across the various hardware targets. A separate hardware back-end compiler then
consumes the low-level IR and generates executable code.
Glow takes as input a model from PyTorch’s TorchScript or constructed via the C++
interface and compiles it to a high-level IR graph. Target-independent optimizations, such as
automatic-differentiation and quantization to 8-bit integer if required, are applied to this high-
level graph. Note that Glow does not use a polyhedral model as this has a long compilation time,
which is not acceptable for JIT.
Glow compiles the high-level IR to a low-level instruction-based address-only (operands
are typed pointers to buffers) IR via two lowerings. The first lowering decomposes the graph op-
erators into convolution nodes and linear algebra operator nodes. For instance, a fully connected
layer is transformed into a matrix multiplication node followed by a broadcasted add node (for
the bias). Additional optimization passes occur on this mid-level IR. This graph is not SSA and
is organized as a sequence of nodes with no control-flow.
The second lowering transforms the linear algebra nodes into a low-level instruction-
based, address-only strongly-typed IR, known as IRGen. These instructions operate on tensors
and are referenced by a hardware-independent address. The IRGen compiler passes determine
the required memory allocation for these tensors and the possible in-place computations. The
goal of this low-level IR is to facilitate optimizations by the back-end compiler.
The back-end compiler can consume either the mid-level or low-level IR (IRGen). It per-
forms tensorization and code-generation for the specific hardware target. The back-end compiler
may implement additional IRs with control-flow for low-level IR instructions, such as convo-
lution.
Glow provides a CPU reference implementation to verify an accelerator’s correct function-
ality. For CPU, Glow uses the LLVM compiler to optimize and generate code. The low-level
IR can be AOT compiled (since the shapes and types of all the tensors are known) into machine
code object files. These files are linked to some application with no further dependence on Glow
9.7. XLA 195
(this is important for environments with limited memory, such as mobile devices). Alternatively,
the low-level IR can execute code in JIT mode using a library of precompiled LLVM bitcode
linear algebra micro-kernels written in C called libjit.
9.7 XLA
The Accelerated Linear Algebra (XLA) is a graph compiler developed and maintained by
Google. XLA is used with TPUs, CPUs, and GPUs, and can be extended to other back-end
targets. XLA is tightly integrated with TensorFlow and also supports PyTorch/Trace and Julia.
The TensorFlow APIs let the programmer explicitly invoke the XLA compiler on a subset
of the TF graph (or the entire graph, if possible). The tf2xla compiler maps the TensorFlow
subgraphs to the XLA High-Level Optimizer (HLO) IR. XLA decomposes the XLA HLO
ops into basic functions, including element-wise ops, specialized NN ops (such as convolution),
data layout reshape ops, control-flow ops, and data transfer ops [Goo20g]. Then, XLA fuses
ops to reduce memory access overhead [Goo20c]. This optimized HLO IR maps to a back-
end compiler for target-dependent optimizations and code-generation. XLA uses the LLVM
compiler for code-generation on CPUs and GPUs, and a TPU compiler for TPUs. While XLA
is a JIT compiler, it also provides AOT executable codegen compilation for some back-end
tagets, such as CPUs.
In practice, XLA works well for a defined set of primitives, but supporting custom prim-
itives can be a challenge [SL19]. This limits the adoption of XLA in the research community,
where experimentation with new operators is common. In addition, XLA cannot compile ten-
sors with dynamic shapes [BCD+18].
9.8 MLIR
One effort to improve the TensorFlow infrastructure and reduce the duplication of optimiza-
tions is the Multi-Level IR (MLIR). It was released in April 2019 by Google as a TensorFlow
project, and later adopted as an LLVM project. While the initial front-end framework is Ten-
sorFlow, other frameworks can use it.
MLIR is a flexible ML SSA-based, typed-language, multilevel IR compiler infrastructure.
MLIR is not a compiler but a compiler infrastructure; standard optimizations can be shared
across the various levels of abstractions. It borrows many ideas from LLVM IR, both designed
by Chris Lattner and other contributors, and has a library of optimization and compiler utilities.
It has a flexible type system and supports dynamic tensor shapes and ranks. MLIR enables
optimizations across various levels of abstractions from high-level optimizations with better
control-flow representation to low-level compilers and executors that generate target machine
code. The MLIR structure resembles the LLVM structure with modules, functions, blocks, and
operations (note that in LLVM parlance, these are called instructions rather than operations,
196 9. FRAMEWORKS AND COMPILERS
and in TVM parlance are called expressions). MLIR operators are the basic unit of MLIR code.
Unlike LLVM, in MLIR the optimization passes are implicitly multithreaded.
MLIR IRs are called dialects. A dialect has a defined set of operations with input and
output types and can express different levels of abstraction. Examples of dialects are the Ten-
sorFlow IR, XLA HLO, TFLite, Affine, and LLVM IR, and exclusively for GPUs: NVVM,
SPIR-V, and ROCm. An affine dialect is a simplified polyhedral model with for loops and
if control structure ops [Llv20]. A dialect provides invariants on the operators and a canonical
representation. This canonicalization simplifies pattern-matching, verification, rewriting, and
conversion to other dialects. Optimizations can be shared across dialects. Also, MLIR allows
custom operators for a particular dialect.
Expressions can be written at multiple levels of abstraction. The high-level graph opti-
mizations can use the TF dialect. The tensor optimizations (such as matrix multiplications and
fusion) can use the XLA dialect, and the LLVM code-generation can use the LLVM dialect on
supported hardware, all with the same infrastructure.
TensorFlow is gradually porting graph transformations to MLIR and unifying the inter-
faces to the back-end code generators [LS19]. Other hardware libraries or hardware vendor IRs
can consume the MLIR and generate code for their respective back-end targets.
9.9 OTHERS
Other notable compilers include the following:
Halide was developed as a DSL for image processing [RBA+13]. Key Halide concepts
can extend to DL compilers. TVM borrows many ideas from Halide, including decoupling the
tensor expression from the schedule and defining the scheduling space.
Diesel was developed by Nvidia to generate efficient code for GPUs [ERR+18]. Diesels
maps a DSL to a high-level graph and then lowers the graph to a Polyhedral IR. Optimization
passes are applied to tile a loop for efficient parallelism between threads, warps, blocks, and SM.
Diesel then generates CUDA code for various Nvidia GPU back-end architectures.
nGraph is an open-source C++ library for high-level compilation designed by Intel but
no longer actively maintained. nGraph consumes a TensorFlow or ONNX computation graph,
maps the subgraphs supported by nGraph to an nGraph IR (for TF models, the TF runtime
handles nonsupported nodes), and performs high-level optimization passes, as shown in Fig-
ure 9.3 [SPE19].
Tensor Comprehension (TC) was developed by Facebook AI Lab and released in early
2018 [VZT+18]. Facebook appears to be prioritizing the Glow graph compiler. TC defines a
scheduling space for GPUs using polyhedral methods and uses a JIT compiler to search for an
efficient schedule. TC does not use ML to facilitate the selection of a schedule.
Tensor Algebra Compiler (taco) generates sparse tensor operators on a CPU [KKC+17].
DLVM has full control-flow and can be used for graph-level optimization [WSA18].
WELD is a DSL for data processing.
9.9. OTHERS 197
Figure 9.3: Graph-level optimizations used by nGraph (and typical in DL compilers). Various
nodes are fused to reduce memory access overhead. Based on [SPE19].
In this chapter, we reviewed the importance of DL compilers to support the execution of models
across diverse hardware targets. We detailed the DL compilers and software libraries used by
hyperscalers and hardware vendors. The most popular frameworks (with built-in compilers) are
TensorFlow and PyTorch, and the most popular compilers are TVM and XLA, with MLIR
providing a compiler infrastructure. In the next chapter, we provide concluding remarks and
discuss some of the future challenges and opportunities to advance DL.
199
CHAPTER 10
• Integrated circuit (IC) design, which currently relies heavily on a human expert’s ex-
perience and intuition.
• Weight initialization.
• Model compression.
• Index data structures (faster and with less memory than B-Trees and Bloom filters).
Figure 10.1: The data science and infrastructure teams have different priorities. Based
on [BCC+19].
experiment tracking, model packaging, and model deployment at scale. At-scale deployment
often uses Kubernetes (k8s) clusters or Spark clusters. These platforms provide a collaborative
and secure environment and access to the latest ML libraries. These platforms are designed to
meet the needs of the data scientists and the infrastructure teams, which typically have different
priorities, as illustrated in Figure 10.1. Some of platforms are open-sourced. In the remainder
of this section, we mention existing platforms that companies can adopt or emulate.
Platforms used for first-party users (that is, internal company users as opposed to third-
party users, such as the external customers of cloud service providers) are as follows [HBB+18,
Goo20e, Mic20, AAB+19, KR19, HDB17, HM19, Eid18, Met19, Met19b]:
• Facebook FBLearner
• Google TF Extended (TFX)
• Microsoft ML.NET
• eBay Krylov
• Uber Michelangelo
• AWS Eider
• Netflix Metaflow (integrated into AWS)
Platforms provided by cloud service providers for third-party users are as follows [Ama20,
Goo20d, Mic20b, Ali20]:
• Amazon Sagemaker
• Google Cloud AI Platform
• Microsoft Azure cognitive services
202 10. OPPORTUNITIES AND CHALLENGES
• Alibaba PAI
Some of the above platforms can be deployed on-premise to facilitate switching between on-
premise and on-cloud. Platforms targeting enterprises are as follows [Mlf20, Cor20, Nvi20,
Int20c, Gui20, Ber19]:
• Nvidia RAPIDS
• Guild AI
Some platforms facilitate the development and training of new models or the consump-
tion of industry pre-trained models. As DL becomes widely adopted across industries, these
platforms may become more critical.
10.3 SECURITY
Security expands all parts for the DL system stack from hardware to model robustness to data
privacy. Attacks are increasing in scale and sophistication. In this section, we discuss two areas
of active research: (1) adversarial ML and (2) data and model privacy. Although not discussed in
further detail, DL is also used to improve security in domains, such as fraud detection, malware
detection, vulnerability detection, and software verification [XLF+18, HDS+19].
Adversarial machine learning is the study of learning and preventing attacks. Adversarial
attacks use tuned signals designed to deceive the model into producing a different than expected
output. To illustrate, a correctly classified bus image can be imperceptibly perturbed to deceive a
model to label it as an ostrich [SZS+14]. Adversarial attacks put in jeopardy applications where
safety or security is critical, such as autonomous driving and biometric authentication.
Compressing a model makes it more vulnerable to these attacks by enlarging the magni-
tude of the adversarial noise [GWY+19, LGH19]. Training models robust to adversarial attacks
can require larger models to converge to flatter minima (see Section 4.1), which in turn may re-
quire more computational resources [TSE+19].
There are two types of adversarial attacks: white-box and black-box attacks. In white-
box attacks, the attacker knows the details of the target models, and in black-box attacks, the
attacker does not have these details. Several techniques have been developed (none of them
bulletproof ) to increase robustness to adversarial attacks, including the following [ACW18,
PMW+16, XEQ17, MC17, TKP+18, DAL+18, MMS+19, Nak19, LGH19, BV20, XZZ20]:
10.3. SECURITY 203
• defensive distillation to reduce the amplitude of the gradients, known as gradient mask-
ing, and smooth the model;
• reducing the bits per pixels in the input image and using spatial smoothing;
• training a model to modify adversarial examples, so they are correctly classified;
• augmenting the training dataset with adversarial examples;
• using models with larger capacity (more weights) than needed;
• optimizing robustness at smaller numerical representations;
• iteratively training a model with an adversary; and
• using the k -winners-take-all activation function.
Generative attacks use generative models to generate realistic data samples. These samples
can deceive an authentication system or a human into believing the data is real. Mor et al. provide
optimal strategies for the attacker and the authenticator systems and provide insights to design
models robust to attacks [MPG+20].
Privacy is an area of active research. Key areas focused on preserving privacy are federated
learning, GAN cryptography, homomorphic encryption, secured multiparty computations, and
differential privacy.
Federated learning, discussed in Section 5.3, ensures that data stays local and is not trans-
mitted to a centralized location. Training happens locally, and only the model updates are trans-
mitted. However, some information about the local training data can be extracted from local
updates [HAP17]. The updates should be encrypted before transmission and unencrypted only
after the centralized location receives multiple models to preserve privacy [BIK+17].
GAN cryptography can facilitate training models that perform encryption and decryp-
tion [ACG+16]. Intel is developing homomorphic encryption tools to facilitate building models
that operate on encrypted data. Homomorphic encryption methods, in theory, enable training
and serving models using encrypted data; in practice, they require enormously more compu-
tations [Gen09]. Another more computationally feasible method is secure multiparty compu-
tations (SMPC), where parties jointly compute functions without revealing their inputs and
outputs [ZZZ+19].
Differential privacy is an area of active research to train models without compromising
the privacy of the training dataset [AA16, JYv19, LAG+19, DJS20, Goo20b, WZL+19]. Large
models can memorize training data, and attackers may be able to extract information from a
trained model. To illustrate, using a sentence completion tool an attacker types “The bank ac-
count of Amy Jones is”, and the tool may regurgitate the actual account number if it is in the
training dataset. To mitigate this vulnerability, Apple uses differential privacy technology adding
some noise to the data in a user’s device before such data is transmitted to Apple.
204 10. OPPORTUNITIES AND CHALLENGES
Figure 10.2: Algorithms that are more interpretable typically have lower accuracy. Note this
is not shown to scale, but rather is a generalization of the algorithms’ interpretability. Based
on [Gun17].
10.4 INTERPRETABILITY
Interpretability is an area of active research to explain the reasons for the decisions, biases, and
limitations of a given model. Limited interpretability is a barrier for some industries adopting
DL algorithms despite their higher statistical performance. For instance, online credit appli-
cations should provide the reasons that a given loan was accepted or rejected. This right-to-
explanation is required in some legal systems.
Interpretability methods can be applied to a topology using attention. Attention-based
models learn to focus on the relevant inputs to produce a given output, which results in superior
statistical performance while simultaneously provides interpretable insights [AP19, KZK+19,
SLA+19].
BNN combine the strength of NNs and Bayesian models to estimate the uncertainty of
a NN prediction [Nea95]. They can estimate uncertainty and provide performance guarantees.
However, they are computationally expensive and require a good prior approximation to make
them useful. BNNs are an active field of research.
An already trained model may be interpreted using activations, a saliency map, and test-
ing concept vectors as follows: visualizing the activation features can provide insights into
what a neuron or group of neurons learned but provides no insights into why a decision was
made [ZF13, OSJ+18].
Another approach is using saliency maps to measure the impact of each input xi in the
output p.z/ W @p.z/
@xi
. Salient maps are used in various domains, including in RL to gain insights
10.5. SOCIETY IMPACT 205
on the behavior of learned agents [GPV+20]. However, saliency map methods may lack relia-
bility [AGM+18, HEK+19].
Google developed testing concept activation vectors (TCAV) to quantify the importance
of user-defined concepts in a model’s output [KWG+18, Goo20f]. TCAV learns concepts from
examples. For instance, to determine the importance of stripes in classifying an image as a zebra,
a concept is learned using images of stripes, and then TCAV can test the model using this learned
concept. A current limitation is that the user needs to determine which concepts to test and needs
training samples to learn the concept.
Another aspect of interpretability is giving users information about the training of the
model. This information includes the objective function (what the model is mathematically de-
signed to do), and the type of training data [MWZ+19, GMV+20]. Model developers should
explain where the model works and where it fails and possible biases in the model. Google calls
this the model card. This level of transparency is vital to accelerate DL adoption and mitigate
misuse or unintended consequences. The Partnership on AI is one effort in this direction. Uni-
versity of Washington’s LIME and Google’s What If Tool provide tools to analyze a model to
assist in this effort.
Bibliography
[ABC+16] M. Abadi, P. Barham, J. Chen, et al. TensorFlow: a system for large-scale machine
learning. OSDI, 2016. 184
[ACG+16] M. Abadi, A. Chu, I. Goodfellow, H. McMahan, I. Mironov, K. Talwar, and L.
Zhang. Deep learning with differential privacy. CCS, Oct. 2016. 203
[AA16] M. Abadi and D. Andersen. Learning to protect communications with adversarial neu-
ral cryptography. Oct. 2016. 203
[AGM+18] J. Adebayo, J. Gilmer, M. Muelly, I. Goodfellow, M. Hardt, and B. Kim. Sanity
checks for saliency maps. NeurIPS, Dec. 2018. 205
[ARS+20] D. Abts, J. Ross, J. Sparling, et al. Think fast: a tensor streaming processor (TSP)
for accelerating deep learning workloads. ISCA, Jun. 2020. 154
[AMP+19] A. Agrawal, A. Modi, A. Passos, et al. TensorFlow Eager: a multi-stage, Python-
embedded DSL for machine learning (slides). MLSys, Feb. 2019. 186, 187
[AAB+19] Z. Ahmed, S. Amizadeh, M. Bilenko, et al. Machine learning at Microsoft with
ML .NET. SIGKDD, Jul. 2019. 201
[ALV08] M. Al-Fares, A. Loukissas, and A. Vahdat. A scalable, commodity data center net-
work architecture. SIGCOMM, Oct. 2008. 150
[Ali20] Alibaba. Machine Learning Platform for AI. 2020. 201
[Ala18] J. Alammar. The illustrated transformer. June 2018. 64
[AHJ+18] D. Alistarh, T. Hoefler, M. Johansson, S. Khirirat, N. Konstantinov, and C. Renggli.
The convergence of sparsified gradient methods. NeurIPS, Dec. 2018. 102
[AVG+15] L. Alvarez, L. Vilanova, M. Gonzalez, X. Martorell, N. Navarro, and E. Ayguade.
Hardware-software coherence protocol for the coexistence of caches and local memories.
TC, Jan. 2015. 134
[Ama19] Amazon. EC2 Inf1 Instances. 2019. 153
[Ama19b] Amazon. AWS re:Invent 2019: deliver high performance ML inference with AWS
Inferentia. Dec. 2019. 153
210 BIBLIOGRAPHY
[Ama20] Amazon. SageMaker. 2020. 201
[Amd67] G. Amdahl. Validity of the single processor approach to achieving large scale com-
puting capabilities. AFIPS, Apr. 1967. 134
[Amd19] Amd. EPYC 7742. 2019. 152
[AAB+15] D. Amodei, R. Anubhai, E. Battenberg, et al. Deep Speech 2: end–to-end speech
recognition in English and Mandarin. ICML, Dec. 2015. 66
[AC16] D. Amodei and J. Clark. Faulty reward functions in the wild. OpenAI, Dec. 2016. 70
[DH18] A. Dario and D. Hernandez. AI and compute. OpenAI, May 2018. 1, 99
[AES19] A. Antoniou, H. Edwards, and A. Storkey. How to train your MAML. ICLR, Mar.
2019. 200
[AP19] S. Arik and T. Pfister. ProtoAttend: attention-based prototypical learning. Sep. 2019.
204
[ABF+19] N. Arivazhagan, A. Bapna, O. Firat, et al. Massively multilingual neural machine
translation in the wild: findings and challenges. July 2019. 61, 62
[ACB17] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein GAN. Jan. 2017. 14
[ADC11] T. Ashby, P. Diaz, and M. Cintra. Software-based cache coherence with hardware-
assisted selective self-invalidations using Bloom filters. TC, Apr. 2011. 134
[AFO18] S. Ashkiani, M. Farach-Colton, and J. Owens. A dynamic hash table for the GPU.
IPDPS, May 2018. 41
[ACW18] A. Athalye, N. Carlini, and D. Wagner. Obfuscated gradients give a false sense of
security: circumventing defenses to adversarial examples. ICML, Jul. 2018. 202
[BKH16] J. Ba, J. Kiros, and G. Hinton. Layer normalization. July 2016. 38
[BGJ+18] V. Bacoyannis, V. Glukhov, T. Jin, J. Kochems, and D. Song. Idiosyncrasies and
challenges of data driven learning in electronic trading. NeurIPS, Dec. 2018. 70
[Bai20] Baidu. Kunlun. 2020. 153
[BKK18] S. Bai, J. Kolter, and V. Koltun. An empirical evaluation of generic convolutional and
recurrent networks for sequence modeling. Mar. 2018. 63
[BKK19] S. Bai, J. Kolter, and V. Koltun. Deep equilibrium models. NeurIPS, Dec. 2019. 97
[BTV06] H. Bay, T. Tuytelaars, and L. Van Gool. SURF: speeded up robust features. ECCV,
2006. 48
BIBLIOGRAPHY 211
[BES+19] P. Balaprakash, R. Egele, M. Salim, V. Vishwanath, F. Xia, T. Brettin, and R.
Stevens. Scalable reinforcement learning based neural architecture search for cancer deep
learning research. SC, Nov. 2019.
[BV20] M. Balunovic and M. Vechev. Adversarial training and provable defenses: bridging the
gap. ICLR, Feb. 2020. 202
[BBC+19] C. Berner, G. Brockman, B. Chan, et al. Dota 2 with large scale deep reinforcement
learning. Dec. 2019. 70
[BDD+20] M. Binkowski, J. Donahue, S. Dieleman, et al. High fidelity speech synthesis with
adversarial networks. ICLR, Apr. 2020. 68
[BIK+17] K. Bonawitz, V. Ivanov, B. Kreuter, et al. Practical secure aggregation for privacy-
preserving machine learning. CCS, Oct. 2017. 105, 203
[BLB17] A. Botev, G. Lever, and D. Barber. Nesterov’s accelerated gradient and momentum
as approximations to regularised update descent. IJCNN, Jul. 2017. 85
[BCD+18] T. Boyd, Y. Cao, S. Das, T. Joerg, and J. Lebar. Pushing the limits of GPU perfor-
mance with XLA. Nov. 2018. 195
[BMR+20] T. Brown, B. Mann, N. Ryder, M. Subbiah, et al. Language models are few-shot
learners. May 2020.
[CZH19] H. Cai, L. Zhu, and S. Han. ProxylessNAS: direct neural architecture search on
target task and hardware. ICLR, Feb. 2019. 200
[HSW+18] Z. Cao, G. Hidalgo, T. Simon, S. Wei, and Y. Sheikh. OpenPose: realtime multi-
person 2D pose estimation using part affinity fields. CVPR, Dec. 2018. 59
[CJL+16] W. Chan, N. Jaitly, Q. Le, and O. Vinyals. Listen, attend and spell: a neural network
for large vocabulary conversational speech recognition. ICASSP, 2016. 67
[CFL20] O. Chang, L. Flokas, and H. Lipson. Principled weight initialization for hypernet-
works. ICLR, Feb. 2020. 78
BIBLIOGRAPHY 213
[CCS+17] P. Chaudhari, A. Choromanska, S. Soatto, et al. Entropy-SGD: biasing gradient
descent into wide valleys. ICLR, Mar. 2017.
[CXZ+16] T. Chen, B. Xu, C. Zhang, and C. Guestrin. Training deep nets with sublinear
memory cost. Apr. 2016. 96
[CES16] Y. Chen, J. Emer, and V. Sze. Eyeriss: a spatial architecture for energy-efficient
dataflow for convolutional neural networks. ISCA, June 2016. 148
[CG16] T. Chen and C. Guestrin. XGBoost: a scalable tree boosting system. SIGKDD, Aug.
2016. 193
[CES17] Y. Chen, J. Emer, and V. Sze. Using dataflow to optimize energy efficiency of deep
neural network accelerators. MICRO, June 2017. 147, 148
[CYC19] C. Chen, C. Yang, and H. Cheng. Efficient and robust parallel DNN training
through model parallelism on multi-GPU platform. Oct. 2019. 103
[CZZ+19] C. Chen, M. Zhang, M. Zhang, Y. Liu, Y. Li, and S. Ma. Social attentional memory
network: modeling aspect- and friend-level differences in recommendation. WSDM, Jan.
2019. 39
[CZL+19] Q. Chen, H. Zhao, W. Li, P. Huang, and W. Ou. Behavior sequence transformer
for e-commerce recommendation in Alibaba. DLP-KDD, Aug. 2019. 47
[CYE+19] Y. Chen, T. Yang, J. Emer, and V. Sze. Eyeriss v2: a flexible accelerator for emerging
deep neural networks on mobile devices. JETCAS, June 2019. 157
214 BIBLIOGRAPHY
[CKH+16] H. Cheng, L. Koc, J. Harmsen, et al. Wide and deep learning for recommender
systems. DLRS, Sep. 2016. 44, 46
[CCK+17] Y. Choi, M. Choi, M. Kim, J. Ha, S. Kim, and J. Choo. StarGAN: unified gen-
erative adversarial networks for multi-domain image-to-image translation. CVPR, Nov.
2017. 59
[Cho16] F. Chollet. Xception: deep learning with depthwise separable convolutions. CVPR,
Oct. 2016. 53
[CB18] N. Choma and J. BrunaY. Graph neural networks for neutrino classification. Big Data
Summit, Feb. 2018.
[CGC+14] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical evaluation of gated re-
current neural networks on sequence modeling. Dec. 2014. 35
[CFO+18] E. Chung, J. Fowers, K. Ovtcharov, et al. Serving DNNs in real time at datacenter
scale with project Brainwave. MICRO, Mar. 2018. 117
[CAS16] P. Covington, J. Adams, and E. Sargin. Deep neural networks for YouTube recom-
mendations. RecSys, Sep. 2016. 46
[DB19] W. Dai and D. Berleant. Benchmarking contemporary deep learning hardware and
frameworks: a survey of qualitative metrics. CogMI, Dec. 2019. 157
[DAM+16] D. Das, S. Avancha, D. Mudigere, et al. Distributed deep learning using syn-
chronous stochastic gradient descent. Feb. 2016. 33
[Dal16] B. Dally. High-performance hardware for machine learning. ENN, Feb. 2017. 129
[HDB17] J. Hermann and M. Del Balso. Meet Michelangelo: Uber’s machine learning plat-
form. Sep. 2017. 201
[HR15] T. Highlander and A. Rodriguez. Very efficient training of convolutional neural net-
works using fast Fourier transform and overlap-and-add. BMVA, Sep. 2015. 33
[HSS12] G. Hinton, N. Srivastava, and K. Swersky. RMSProp: divide the gradient by a running
average of its recent magnitude. Coursera, 2012. 85
[HVD15] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network.
Mar. 2015. 68, 123
[HAP17] B. Hitaj, G. Ateniese, and F. Perez-Cruz. Deep models under the GAN: information
leakage from collaborative deep learning. SIGSAC CCS, Sep. 2017. 105, 203
[HS97] S. Hochreiter and J. Schmidhuber. Flat minima. Neural Comp., Jan. 1997. 75
[HS97] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Comp., Nov.
1997. 79
[HHS17] E. Hoffer, I. Hubara, and D. Soudry. Train longer, generalize better: closing the
generalization gap in large batch training of neural networks. NeurIPS, Dec. 2017. 78, 93
[HM19] A. Holler and M. Mui. Evolving Michelangelo model representation for flexibility at
scale. Oct. 2019. 201
[Hor14] M. Horowitz. 1.1 Computing’s energy problem (and what we can do about it). ISSCC,
Feb. 2014. 129, 138
[Hou19] J. Hou. New research on quantization could revolutionize power-efficient AI. July
2019. 153
[ JYv19] J. Jordon, J. Yoon, and M. van der Schaar. PATE-GAN: generating synthetic data with
differential privacy guarantees. ICLR, Feb. 2019. 203
[ JYK+20] N. Jouppi, D. Yoon, G. Kurian, S. Li, N. Patil, J. Laudon, C. Young, and D. Pat-
terson. A domain-specific supercomputer for training deep neural networks. CACM, July
2020. 99, 153
[KES+18] N. Kalchbrenner, E. Elsen, K. Simonyan, et al. Efficient neural audio synthesis. June
2018. 68
[Kar19] A. Karpathy. A recipe for training neural networks. Apr. 2019. 79, 91
[KLA19] T. Karras, S. Laine, and T. Aila. A style-based generator architecture for generative
adversarial networks. CVPR, Mar. 2019. 59
[KR19] S. Katariya and A. Ramani. eBay’s transformation to a modern AI platform. Dec. 2019.
201
[KWG+18] B. Kim, M. Wattenberg, J. Gilmer, C. Cai, J. Wexler, F. Viegas, and R. Sayres. In-
terpretability beyond feature attribution: quantitative testing with concept activation vec-
tors (TCAV). ICML, June 2018. 205
[KKS+19] C. Kim, S. Kang, D. Shin, S. Choi, Y. Kim, and H. Yoo. A 2.1TFLOPS/W mobile
deep RL accelerator with transposable PE array and experience compression. ISSCC, Feb.
2019. 70
[KB17] D. Kingma and J. Ba. Adam: a method for stochastic optimization. ICLR, Jan. 2017.
84
[KKC+17] F. Kjolstad, S. Kamil, S. Chou, D. Lugato, and S. Amarasinghe. The tensor algebra
compiler. OOPSLA, Oct. 2017. 196
[Kod19] R. Koduri. Intel unveils new GPU architecture with high-performance computing
and AI acceleration, and oneAPI software stack with unified and scalable abstraction for
heterogeneous architectures. Intel HPC Dev. Conf., Nov. 2019. 153
[KSA+15] R. Komuravelli, M. Sinclair, J. Alsop, et al. Stash: have your scratchpad and cache
it too. ISCA, Oct. 2015. 137
[KWW+17] U. Koster, T. Webb, X. Wang, et al. Flexpoint: an adaptive numerical format for
efficient training of deep neural networks. NeurIPS, Dec. 2017. 112, 116
[KL19] W. Kouw and M. Loog. An introduction to domain adaptation and transfer learning.
Jan. 2019. 95
226 BIBLIOGRAPHY
[KBC+18] T. Kraska, A. Beutel, E. H. Chi, J. Dean, and N. Polyzotis. The case for learned
index structures. Apr. 2018. 199
[KSH12] A. Krizhevsky, I. Sutskever, and G. Hinton. ImageNet classification with deep con-
volutional neural networks. NeurIPS, Dec. 2012. 5, 28, 38, 48, 102
[Kri14] A. Krizhevsky. One weird trick for parallelizing convolutional neural networks. Apr.
2014.
[KGG+18] O. Kuchaiev, B. Ginsburg, I. Gitman, et al. Mixed-precision training for NLP and
speech recognition with OpenSeq2Seq. Nov. 2018. 115
[LA04] C. Lattner and V. Adve. LLVM: a compilation framework for lifelong program analysis
& transformation. CGO, Mar. 2004. 164
[LP19] C. Lattner and J. Pienaar. MLIR primer: a compiler infrastructure for the end of
Moore’s Law. CGO, Feb. 2019. 166
[LG16] A. Lavin and S. Gray. Fast algorithms for convolutional neural networks. CVPR, Sep.
2015. 33
[Lec16] Y. Lecun. RI seminar: Yann LeCun : the next frontier in AI: unsupervised learning.
Nov. 2016. 14
[LDS89] Y. Lecun, J. Denker, and S. Solla. Optimal brain damage. NeurIPS, 1989. 121
[LLX+20] D. Lepikhin, H. Lee, Y. Xu, et al. GShard: scaling giant models with conditional
computation and automatic sharding. June 2020. 99, 102, 107
[LM18] Y. Leviathan and Y. Matias. Google Duplex: an AI system for accomplishing real-
world tasks over the phone. May 2018. 68
[LSZ+19] T. Li, A. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith. Federated op-
timization in heterogeneous networks. Sep. 2019. 104, 105
[LCH+19] X. Li, S. Chen, X. Hu, and J. Yang. Understanding the disharmony between
dropout and batch normalization by variance shift. CVPR, Jan. 2019. 41
[LGH19] J. Lin, C. Gan, and S. Han. Defensive quantization: when efficiency meets robust-
ness. ICLR, Apr. 2019. 123, 202
[LGH+16] T. Lin, P.Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyra-
mid networks for object detection. CVPR, Dec. 2016. 55
[LGG+17] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar. Focal loss for dense object
detection. ICCV, Aug. 2017. 56
[LSP+19] T. Lin, S. Stich, K. Patel, and M. Jaggi. Don’t use large mini-batches, use local SGD.
June 2019.
[LHM+18] Y. Lin, S. Han, H. Mao, Y. Wang, and W. Dally. Deep gradient compression:
reducing the communication bandwidth for distributed training. ICLR, Feb. 2018. 101
[MY17] T. Munkhdalai and H. Yu. Meta networks. ICML, June 2017. 200
[Nak19] P. Nakkiran. Adversarial robustness may be at odds with simplicity. adversarial robust-
ness may be at odds with simplicity. Jan. 2019. 202
[NKB+20] P. Nakkiran, G. Kaplun, Y. Bansal, T. Yang, B. Barak, and I. Sutskever. Deep double
descent: where bigger models and more data hurt. ICLR, Apr. 2020. 75, 76
[NSA+19] A. Nassif, I. Shahin, I. Attili, M. Azzeh, and K. Shaalan. Speech recognition using
deep neural networks: a systematic review. Access, 2019. 65
[NMS+19] M. Naumov, D. Mudigere, H. Shi, et al. Deep learning recommendation model for
personalization and recommendation systems. May 2019. 47, 48
[NKM+20] M. Naumov, J. Kim, D. Mudigere, et al. Deep learning training in Facebook data
centers: design of scale-up and scale-out systems. Mar. 2020. 3, 102, 152, 156
[Nay19] P. Nayak. Understanding searches better than ever before. Oct. 2019. 64
[NMZ19] E. Neftci, H. Mostafa, and F. Zenke. Surrogate gradient learning in spiking neural
networks: bringing the power of gradient-based optimization to spiking neural networks.
SPM, Nov. 2019. 15
[Nea95] R. Neal. Bayesian learning for neural networks. Ph.D. Thesis, University of Toronto,
1995. 15, 204
[NDC+17] J. Novikova, O. Dusek, A. Curry, and V. Rieser. Why we need new evaluation
metrics for NLG. July 2017. 61
232 BIBLIOGRAPHY
[NKJ+19] E. Nurvitadhi, D. Kwon, A. Jafari, et al. Why compete when you can work together:
FPGA-ASIC integration for persistent RNNs. FCCM, May 2019. 153
[Nvi15] Nvidia. PTX and SASS assembly debugging. 2015. 144
[Nvi20] Nvidia. RAPIDS. 2020. 202
[Nvi20b] Nvidia. T4. 2020. 134
[Nvi20c] Nvidia. Data center deep learning product performance. July 2020. 99
[OSJ+18] C. Olah, A. Satyanarayan, I. Johnson, S. Carter, L. Schubert, K. Ye, and A. Mord-
vintsev. The building blocks of interpretability. 2018. 204
[OPM02] T. Ojala, M. Pietikäinen, and T. Maenpaa. Multiresolution gray-scale and rotation
invariant texture classification with local binary patterns. PAMI, July 2002. 48
[Ope18] OpenAI. Kinds of RL algorithms. 2018. 71
[Orr99] G. Orr. Momentum and Learning Rate Adaptation. Willamette University, 1999. 83
[Pad19] S. Padmanabhan. Building a product catalog: eBay’s university machine learning com-
petition. Oct. 2019. 44
[PdO+18] M. Paganini, L. de Oliveira, and B. Nachman. Accelerating science with generative
adversarial. networks: an application to 3D particle showers in multi-layer calorimeters.
PRL, Jan. 2018. 14
[PY10] S. Pan and Q. Yang. A survey on transfer learning. TKDE, Oct. 2010. 95
[PMW+16] N. Papernot, P. McDaniel, X. Wu, S. Jha, and A. Swami. Distillation as a defense
to adversarial perturbations against deep neural networks. S&P, Mar. 2016. 202
[PCZ+19] D. Park, W. Chan, Y. Zhang, C. Chiu, B. Zoph, E. Cubuk, and Q. Le. SpecAug-
ment: a simple data augmentation method for automatic speech recognition. Apr. 2019.
67, 89
[PNB+18] J. Park, M. Naumov, P. Basu, S. Deng, et al. Deep learning inference in Facebook
data centers: characterization, performance optimizations and hardware implications. Nov.
2018. 54, 117
[PRH+17] A. Pedram, S. Richardson, M. Horowitz, S. Galal, and S. Kvatinsky. Dark memory
and accelerator-rich system optimization in the dark silicon era. D&T, May 2016. 138
[PSC+19] M. Pellauer, Y. Shao, J. Clemons, et al. Buffets: an efficient and composable storage
idiom for explicit decoupled data orchestration. ASPLOS, Apr. 2019. 137
BIBLIOGRAPHY 233
[PSM14] J. Pennington, R. Socher, and C. Manning. GloVe: global vectors for word represen-
tation. EMNLP, 2014. 38
[PGZ+18] H. Pham, M. Guan, B. Zoph, Q. V. Le, and J. Dean. Efficient neural architecture
search via parameter sharing. Feb. 2018. 200
[Phi18] M. Phi. Illustrated guide to LSTM’s and GRU’s: a step by step explanation. TDS. Sep.
2018. 36
[PPG+17] W. Ping, K. Peng, A. Gibiansky, S. Arik, A. Kannan, S. Narang, J. Raiman, and
J.Miller. Deep Voice 3: scaling text-to-speech with convolutional sequence learning. Oct.
2017. 68
[PPC18] W. Ping, K. Peng, and J. Chen. ClariNet: parallel wave generation in end-to-end
text-to-speech. July 2018. 69
[Pol99] F. Pollack. New microarchitecture challenges in the coming generations of CMOS pro-
cess technologies. MICRO, Nov. 1999.
[PZK+17] R. Prabhakar, Y. Zhang, D. Koeplinger, et al. Plasticine: a reconfigurable architec-
ture for parallel patterns. SIGARCH, June 2017. 154
[PHX+18] V. Pratap, A. Hannun, Q. Xu, et al. wav2letter++: the fastest open-source speech
recognition system. Dec. 2018. 67
[Qia99] N. Qian. On the momentum term in gradient descent learning algorithms. Jan. 1999.
83
[RMC15] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with
deep convolutional generative adversarial networks. ICIGP, Nov. 2015. 59
[RWC+19] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language
models are unsupervised multitask learners. 2019.
[RBA+13] J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and S. Amarasinghe.
Halide: a language and compiler for optimizing parallelism, locality, and recomputation in
image processing pipelines. PLDI, June 2013. 196
[RSR+19] C. Raffel, N. Shazeer, A. Roberts, et al. Exploring the limits of transfer learning
with a unified text-to-text transformer. Oct. 2019. 102
[RZQ+19] K. Rakelly, A. Zhou, D. Quillen, C. Finn, and S. Levine. Efficient off-policy meta-
reinforcement learning via probabilistic context variables. Mar. 2019. 199
[ROR+16] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. XNOR-Net: ImageNet
classification using binary convolutional neural networks. ECCV, Sep. 2016. 119
234 BIBLIOGRAPHY
[RD19] S. Raza and C. Ding. Progress in context-aware recommender systems–an overview.
Jan. 2019. 44
[RAH+19] E. Real, A. Aggarwal, Y. Huang, and Q. Le. Regularized evolution for image clas-
sifier architecture search. AAAI, Feb. 2019. 200
[RKK19] S. Reddi, S. Kale, and S. Kumar. On the convergence of Adam and beyond. ICLR,
Apr. 2019. 84, 85
[RDG+16] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: unified,
real-time object detection. CVPR, 2016. 55
[RF18] J. Redmon and A. Farhadi. YOLOv 3: an incremental improvement. Apr. 2018. 56
[RHG+15] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: towards real-time object
detection with region proposal networks. NeurIPS, Dec. 2015. 55
[RAA+19] C. Renggli, S. Ashkboos, M. Aghagolzadeh, D. Alistarh, and T. Hoefler. Spar-
CML: high-performance sparse communication for machine learning. SC, Aug. 2019.
107
[RKL+18] A. Rodriguez, T. Kacprzak, A. Lucchi, et al. Fast cosmic web simulations with gen-
erative adversarial networks. CompAC, Nov. 2018. 14
[RKB+09] B. Rogers, A. Krishna, G. Bell, K. Vu, X. Jiang, and Y. Solihin. Scaling the band-
width wall: challenges in and avenues for CMP scaling. SIGARCH, Jun. 2009. 127
[RDK+19] D. Rolnick, P. Donti, L. Kaack, et al. Tackling climate change with machine learn-
ing. Nov. 2019.
[RDK+19] D. Rolnick, P. Donti, L. Kack, et al. Tackling climate change with machine learning
workshop. NeurIPS, Dec. 2019. 206
[RFB15] O. Ronneberger, P. Fischer, and T. Brox. U-Net convolutional networks for biomed-
ical image segmentation. May 2015. 57
[Ros20] C. Rosset. Turing-NLG: a 17-billion-parameter language model by Microsoft. Feb.
2020.
[RXT19] B. Roune and XLA Team. Compiling ML with XLA. Feb. 2019.
[RJP19] K. Roy, A. Jaiswal, and P. Panda. Towards spike-based machine intelligence with neu-
romorphic computing. Nature, 2019. 15, 150
[Rud17] S. Ruder. An overview of multi-task learning in deep neural networks. June 2017. 95
[Rup20] K. Rupp. Microprocessor trend data. 2020. 135
BIBLIOGRAPHY 235
[RDS+15] O. Russakovsky, J. Deng, H. Su, et al. Large scale visual recognition challenge. IJCV,
2015. 48
[RRS+19] A. Rusu, D. Rao, J. Sygnowski, O. Vinyals, R. Pascanu, S. Osindero, and R. Hadsell.
Meta-learning with latent embedding optimization. ICLR, Mar. 2019. 200
[Sam16] Samgsung. Samsung begins mass producing world’s fastest DRAM–based on newest
high bandwidth memory (HBM) interface. 2016. 138
[SST09] P. Sanders, J. Speck, and J. Traff. Two-tree algorithms for full bandwidth broadcast,
reduction and scan. Sep. 2009. 109
[SDC+19] V. Sanh, L. Debut, J. Chaumond, and T. Wolf. DistilBERT, a distilled version of
BERT: smaller, faster, cheaper and lighter. Oct. 2019. 65
[San19] V. Sanh. Smaller, faster, cheaper, lighter: introducing DistilBERT, a distilled version
of BERT. Medium, Aug. 2019. 66
[Sas19] K. Sasaki. Federated Learning with TensorFlow. 2019. 104
[SYP17] K. Sato, C. Young, and D. Patterson. An in-depth look at Google’s first Tensor Pro-
cessing Unit (TPU). May 2017. 174
[SGT+09] F. Scarselli, M. Gori, A. Tsoi, M. Hagenbuchner, and G. Monfardini. The graph
neural network model. TNNLS, Jan. 2009. 13
[Sch19] J. Schalkwyk. An all-neural on-device speech recognizer. Mar. 2019. 67
[SAH+20] J. Schrittwieser, I. Antonoglou, T. Hubert, et al. Mastering Atari, Go, Chess and
Shogi by planning with a learned model. Feb. 2020. 72, 199
[SKP15] F. Schroff, D. Kalenichenko, and J. Philbin. FaceNet: a unified embedding for face
recognition and clustering. CVPR, Mar. 2015. 59
[SLM+17] J. Schulman, S. Levine, P. Moritz, M. Jordan, and P. Abbeel. Trust region policy
optimization. Apr. 2017. 71
[SFD+14] F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu. 1-bit stochastic gradient descent
and application to data-parallel distributed training of speech DNNs. Int’ Speech Comm.
Association, Sep. 2014. 101
[SDB18] A. Sergeev and M. Del Balso. Horovod: fast and easy distributed deep learning in
TensorFlow. Feb. 2018. 107
[SHB15] R. Sennrich, B. Haddow, and A. Birch. Neural machine translation of rare words
with subword units. Aug. 2015. 61, 62
236 BIBLIOGRAPHY
[SKF+16] M. Seo, A. Kembhavi, A. Farhadi, and H. Hajishirzi. Bidirectional attention flow
for machine comprehension. Nov. 2016. 62
[SLA+19] C. Shallue, J. Lee, J. Antognini, J. Sohl-Dickstein, R. Frostig, and G. Dahl. Mea-
suring the effects of data parallelism on neural network training. JMLR, July 2019. 82,
83
[SWR18] Y. Sharan, H. Wang, and S. Rath. GUI testing powered by deep learning. eBay Tech
Blog. June 2018.
[SCP+18] N. Shazeer, Y. Cheng, N. Parmar, et al. Mesh-TensorFlow: deep learning for su-
percomputers. NeurIPS, Dec. 2018. 102
[SPW+17] J. Shen, R. Pang, R. Weiss, et al. Natural TTS synthesis by conditioning WaveNet
on Mel Spectrogram predictions. ICASSP, Dec. 2017. 68
[SDY+19] S. Shen, Z. Dong, J. Ye, L. Ma, Z. Yao, A. Gholami, M. Mahoney, and K. Keutzer.
Q-BERT: Hessian based ultra low precision quantization of BERT. Sep. 2019. 65, 117
[She18] R. Sheth. Introducing PyTorch across Google Cloud. Oct. 2018. 190
[SLA+19] B. Shickel, T. Loftus, L. Adhikari, T. Ozrazgat-Baslanti, A. Bihorac, and P. Rashidi.
DeepSOFA: a continuous acuity score for critically ill patients using clinically interpretable
deep learning. Feb. 2019. 204
[SPP+19] M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro. Mega-
tron LM training multi billion parameter language models using model parallelism. Oct.
2019. 3, 99, 102
[SL19] T. Shpeisman and C. Lattner. MLIR: multi-level intermediate representation for com-
piler infrastructure. Apr. 2019. 184, 195
[SHM+16] D. Silver, A. Huang, C. Maddison, et al. Mastering the game of Go with deep
neural networks and tree search. Nature, Jan. 2016. 44, 72
[SSS+17] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, et al. Mastering the game of
Go without human knowledge. Nature, Oct. 2017. 72
[SSS+18] D. Silver, J. Schrittwieser, K. Simonyan, et al. A general reinforcement learning al-
gorithm that masters chess, shogi, and Go through self-play. Science, Dec. 2018. 72
[SZ14] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image
recognition. Sep. 2014. 50
[Smi17] L. Smith. Cyclical learning rates for training neural networks. WACV, Apr. 2017. 83,
93
BIBLIOGRAPHY 237
[SSZ17] J. Snell, K. Swersky, and R. Zemel. Prototypical networks for few-shot learning.
NeurIPS, Dec. 2017. 200
[Ste19] I. Steinwart. A sober look at neural network initializations. Sep. 2019. 79
[Ste19b] N. Stephens. BFloat16 processing for neural networks on Armv8-A. Aug. 2019. 152
[SA19] A. Stooke and P. Abbeel. Accelerated methods for deep reinforcement learning. Jan.
2019. 70
[SPE19] A. Straw, A. Procter, and R. Earhart. nGraph: unlocking next-generation perfor-
mance with deep learning compilers. 2019. 168, 169, 196, 197
[SGB+19] S. Sukhbaatar, E. Grave, P. Bojanowski, and A. Joulin. Adaptive attention span in
transformers. May 2019. 65
[SCC+19] X. Sun, J. Choi, C. Chen, et al. Hybrid 8-bit floating point (HFP8) training and
inference for deep neural networks. NeurIPS, Dec. 2019. 117, 119
[SWL+19] Y. Sun, S. Wang, Y. Li, S. Feng, H. Tian, H. Wu, and H. Wang. ERNIE 2.0: a
continual pre-training framework for language understanding. 2019.
[SAD+20] Y. Sun, N. Agostini, S. Dong, and D. Kaeli. Summarizing CPU and GPU design
trends with product data. 2020. 6
[SVL14] I. Sutskever, O. Vinyals, and Q. Le. Sequence to sequence learning with neural net-
works. NeurIPS, Dec. 2014. 62
[SCY+17] V. Sze, Y. Chen, T. Yang, and J. Emer. Efficient processing of deep neural networks:
a tutorial and survey. Proc. IEEE, Dec. 2017. 148, 149
[SCY+20] V. Sze, Y. Chen, T. Yang, and J. Emer. Efficient processing of deep neural networks.
M&C, June 2020. 147
[SLJ+14] C. Szegedy, W. Liu, Y. Jia, et al. Going deeper with convolutions. CVPR, Sep. 2014.
51
[SVI+15] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the Incep-
tion architecture for computer vision. CVPR, Dec. 2015. 51, 77
[SZS+14] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R.
Fergus. Intriguing properties of neural networks. Feb. 2014. 202
[Syn17] Synced. A brief overview of attention mechanism. Medium, Sep. 2017. 40
[TPL19] M. Tan, R. Pang, and Q. Le. EfficientDet: scalable and efficient object detection.
Nov. 2019. 56, 200
238 BIBLIOGRAPHY
[TL19] M. Tan and Q. Le. EfficientNet: rethinking model scaling for convolutional neural
networks. May 2019. 54, 200
[TYD+18] Y. Tassa, Y, Doron, A. Muldal, et al. DeepMind control suite. Jan. 2018. 70
[TKT+16] S. Tavarageri, W. Kim, J. Torrellas, and P. Sadayappan. Compiler support for soft-
ware cache coherence. HiPC, Dec. 2016. 134
[Tsa18] S. Tsang. Review: YOLOv1— you only look once (object detection). TDS, Oct. 2018.
57
[Tvm19] TVM. TVM deep learning compiler joins Apache Software Foundation. Mar. 2019.
192
[vKK+16] A. van den Oord, N. Kalchbrenner, and K. Kavukcuoglu. Pixel recurrent neural net-
works. Jan. 2016. 59
[vDZ+16] A. van den Oord, S. Dieleman, H. Zen, et al. WaveNet: a generative model for raw
audio. Sep. 2016. 68
[vLB+17] A. van den Oord, Y. Li, I. Babuschkin, et al. Parallel WaveNet: fast high-fidelity
speech synthesis. Nov. 2017. 68
[VS19] J. Valin and J. Skoglund. LPCNet: improving neural speech synthesis through linear
prediction. ICASSP, May 2019. 68
BIBLIOGRAPHY 239
[VZT+18] N. Vasilache, O. Zinenko, T. Theodoridis, et al. Tensor Comprehensions:
framework-agnostic high-performance machine learning abstractions. ICASSP, May
2019. 196
[VSP+17] A. Vaswani, N. Shazeer, N. Parmar, et al. Attention is all you need. NeurIPS, Dec.
2017. 12, 39, 63, 64
[VSZ+19] R. Venkatesan, Y. Shao, B. Zimmer, et al. A 0.11 PJ/OP, 0.32-128 TOPS, scal-
able multi-chip-module-based deep neural network accelerator designed with a high-
productivity VLSI methodology. HCS, Aug. 2019. 152
[VTB+14] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: a neural image
caption generator. CVPR, Nov. 2014. 63
[VBC+19] O. Vinyals, I. Babuschkin, J. Chung, et al. AlphaStar: mastering the real-time strat-
egy game StarCraft II. Dec. 2019. 70
[SMH+18] A.Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman. GLUE: a multi-
task benchmark and analysis platform for natural language understanding. Apr. 2018. 60
[WYL+20] H. Wang, J. Yang, H. Lee, and S. Han. Learning to design circuits. Jan. 2020. 199
[WYZ+17] J. Wang, L. Yu, W. Zhang, Y. Gong, Y. Xu, B. Wang, P. Zhang, and D. Zhang.
IRGAN: a minimax game for unifying generative and discriminative information retrieval
models. SIGIR, May 2017. 46
[Wri19] L. Wright. New deep learning optimizer, Ranger synergistic combination of RAdam
+ LookAhead for the best of both. Aug. 2019. 86, 93
[WSC+16] Y. Wu, M. Schuster, Z. Chen, et al. Google’s neural machine translation system:
bridging the gap between human and machine translation. Sep. 2016. 62
[WFB+19] F. Wu, A. Fan, A. Baevski, Y. Dauphin, and M. Auli. Pay less attention with
lightweight and dynamic convolutions. Jan. 2019. 65
[Wu19] H. Wu. Low precision inference on GPU. GTC, Mar. 2019. 115, 120
[WDZ+19] B. Wu, X. Dai, P. Zhang, et al. FBNet: hardware-aware efficient ConvNet design
via differentiable neural architecture search. CVPR, May 2019. 200
[WM95] W. Wulf and S. McKee. Hitting the memory wall: implications of the obvious.
SIGARCH, Mar. 1995. 127
[XGD+17] S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He. Aggregated residual transforma-
tions for deep neural networks. CVPR, July 2017. 54, 55
[Xil19] Xilinx. Versal: the first adaptive compute acceleration platform (ACAP). 2019. 153
[XAT+18] C. Xing, D. Arpit, C. Tsirigotis, and Y. Bengio. A walk with SGD. May 2018. 79,
81
[XEQ17] W. Xu, D. Evans, and Y. Qi. Feature squeezing: detecting adversarial examples in
deep neural networks. Dec. 2017. 202
[XLF+18] X. Xu, C. Liu, Q. Feng, H. Yin, L. Song, and D. Song. Neural network-based graph
embedding for cross-platform binary code similarity detection. CCS, July 2018. 202
[YKT+18] M. Yamazaki, A. Kasagi, A. Tabuchi, et al. Yet another accelerated SGD: ResNet-
50 training on ImageNet in 74.7 seconds. Mar. 2019. 99
[YCS17] T. Yang, Y. Chen, and V. Sze. Designing energy-efficient convolutional neural net-
works using energy-aware pruning. CVPR, Apr. 2017. 123
[YDY+19] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. Le. XLNet: gen-
eralized autoregressive pretraining for language understanding. NeurIPS, Dec. 2019. 64
[YHG+15] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola. Stacked attention networks for
image question answering. CVPR, Nov. 2015. 39, 63
[YSE+20] J. Yin, S. Sethumurugan, Y. Eckert, N. Enright Jerger, et al. Experiences with ML-
driven design: a NoC case study. HPCA, Feb. 2020. 199
[YKC+18] C. Ying, S. Kumar, D. Chen, T. Wang, and Y. Cheng. Image classification at su-
percomputer scale. NeurIPS, Dec. 2018. 13, 47, 102
[YGG17] Y. You, I. Gitman, and B. Ginsburg. Large batch training of convolutional networks.
Sep. 2017. 85
BIBLIOGRAPHY 243
[YLR+20] Y. You, J. Li, S. Reddi, et al. Large batch optimization for deep learning: training
BERT in 76 minutes. ICLR, Jan. 2020. 85, 102
[YZH+18] Y. You, Z. Zhang, C. Hsieh, J. Demmel, and K. Keutzer. ImageNet training in
minutes. Jan. 2018. 3, 99, 102
[YAB+18] Y. Yu, M. Abadi, P. Barham, et al. Dynamic control flow in large-scale machine
learning. EUROSYS, May 2018. 180, 184
[YTL+19] L. Yuan, F. Tay, G. Li, T. Wang, and J. Feng. Revisit knowledge distillation: a
teacher-free framework. Sep. 2019. 124
[ZK15] S. Zagoruyko and N. Komodakis. Learning to compare image patches via convolutional
neural networks. CVPR, June 2015. 58
[ZXL+18] N. Zeghidour, Q. Xu, V. Liptchinsky, N. Usunier, G. Synnaeve, and R. Collobert.
Fully convolutional speech recognition. Dec. 2018. 67
[Zei12] M. Zeiler. ADADELTA: an adaptive learning rate method. Dec. 2012. 85
[ZF13] M. Zeiler and R. Fergus. Visualizing and understanding convolutional networks.
ECCV, Nov. 2013. 49, 204
[ZF13] M. Zeiler and R. Fergus. Stochastic pooling for regularization of deep convolutional
neural networks. ICLR, May 2013. 34
[ZES+20] A. Zela, T. Elsken, T. Saikia, Y. Marrakchi, T. Brox, and F. Hutter. Understanding
and robustifying differentiable architecture search. ICLR, Jan. 2020. 200
[ZB19] T. Zerrell and J. Bruestle. Stripe: tensor compilation via the nested polyhedral model.
Mar. 2019. 193
[ZDH19] B. Zhang, A. Davoodi, and Y. Hu. Efficient inference of CNNs via channel pruning.
Aug. 2019. 122
[ZYY18] J. Zhang, J. Yang, and H. Yuen. Training with low-precision embedding tables.
NeurIPS, Dec. 2018. 115
[ZRW+18] M. Zhang, S. Rajbhandari, W. Wang, and Y. He. DeepCPU: serving RNN-based
deep learning models 10x faster. ATC, 2018. 62, 134
[ZLH+19] M. Zhang, J. Lucas, G. Hinton, and J. Ba. Lookahead optimizer: k steps forward,
1 step back. NeurIPS, Dec. 2019. 85, 86, 153
[ZL19] W. Zhang and P. Li. Spike-train level backpropagation for training deep recurrent
spiking neural networks. NeurIPS, Dec. 2019. 150
244 BIBLIOGRAPHY
[ZZL+17] X. Zhang, X. Zhou, M. Lin, and J. Sun. ShuffleNet: an extremely efficient convo-
lutional neural network for mobile devices. CVPR, July 2017. 59
[ZXH+17] Y. Zhang, T. Xiang, T. Hospedales, and H. Lu. Deep mutual learning. CVPR, Jan.
2018. 124
[ZZZ+19] C. Zhao, S. Zhao, M. Zhao, Z. Chen, C. Gao, H. Li, and Y. Tan. Secure multi-
party computation: theory, practice and applications. Inf. Sciences, Feb. 2019. 203
[ZZX+19] W. Zhao, J. Zhang, D. Xie, Y. Qian, R. Jia, and P. Li. AIBox: CTR prediction
model training on a single node. CIKM, Nov. 2019. 45
[ZHW+19] Z. Zhao, L. Hong, L. Wei, et al. Recommending what video to watch next: a
multitask ranking system. RecSys, Sep. 2019. 46
[ZZZ+18] G. Zheng, F. Zhang, Z. Zheng, Y. Xiang, N. Yuan, X. Xie, and Z. Li. DRN: a deep
reinforcement learning framework for news recommendation. IW3C2, Apr. 2018. 46, 153
[ZMF+18] G. Zhou, N. Mou, Y. Fan, Q. Pi, W. Bian, C. Zhou, X. Zhu, and K. Gai. Deep
interest evolution network for click-through rate prediction. AAAI, Nov. 2018. 47
[ZTZ+18] Z. Zhuang, M. Tan, B. Zhuang, J. Liu, Y. Guo, Q. Wu, J. Huang, and J. Zhu.
Discrimination aware channel pruning for deep neural networks. NeurIPS, Dec. 2018.
122
[ZZY+19] R. Zhu, K. Zhao, H. Yang, W. Lin, C. Zhou, B. Ai, Y. Li, and J. Zhou. AliGraph:
a comprehensive graph neural network platform. PVLDB, Aug. 2019. 13, 47
[Zis18] A. Zisserman. Self-supervised learning. July 2018. 49
[ZL17] B. Zoph and Q. Le. Neural architecture search with reinforcement learning. Feb. 2017.
199, 200
245
Author’s Biography
ANDRES RODRIGUEZ
Andres Rodriguez is a Sr. Principal Engineer and AI Architect in the Data Platform Group
at Intel Corporation where he designs deep learning solutions for Intel’s customers and pro-
vides technical leadership across Intel for deep learning hardware and software products. He
has 15 years of experience working in AI. Andres received a Ph.D. from Carnegie Mellon Uni-
versity for his research in machine learning. He was the lead instructor in the Coursera course
An Introduction to Practical Deep Learning to over 20 thousand students. He has been an in-
vited speaker at several AI events, including AI with the Best, ICML, CVPR, AI Frontiers
Conference, Re-Work Deep Learning Summit, TWIML, Startup MLConf, Open Compute
Platform Global Summit, AWS re:Invent, Baidu World, Baidu Cloud ABC Inspire Summit,
Google Cloud OnAir Webinar, and several Intel events, as well as an invited lecturer at Carnegie
Mellon University, Stanford University, UC Berkeley, and Singularity University.