Deep Learning - Roy Keyes
Deep Learning - Roy Keyes
Roy Keyes
This book is for sale at http://leanpub.com/zefsguide2dl
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Why deep learning? . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Why this book? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
What does this book cover and not cover? . . . . . . . . . . . . . 3
How to use this book . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2. Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
What is machine learning? . . . . . . . . . . . . . . . . . . . . . . 5
Types of machine learning tasks and solutions . . . . . . . . . . . 8
Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Supervised learning . . . . . . . . . . . . . . . . . . . . . . . . 11
Unsupervised learning . . . . . . . . . . . . . . . . . . . . . . 12
Self-supervised learning . . . . . . . . . . . . . . . . . . . . . 12
Reinforcement learning . . . . . . . . . . . . . . . . . . . . . . 13
An example task . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Predicting real estate sales prices . . . . . . . . . . . . . . . . 13
Formulating machine learning problems . . . . . . . . . . . . . . 15
Data sets and features . . . . . . . . . . . . . . . . . . . . . . . . . 16
Measuring performance . . . . . . . . . . . . . . . . . . . . . . . . 17
Performance baselines and success thresholds . . . . . . . . . 18
Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Model training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Supervised learning . . . . . . . . . . . . . . . . . . . . . . . . 20
CONTENTS
Unsupervised learning . . . . . . . . . . . . . . . . . . . . . . 22
Loss functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Parameter optimization . . . . . . . . . . . . . . . . . . . . . . 25
Generalization and overfitting . . . . . . . . . . . . . . . . . . 27
Avoiding overfitting . . . . . . . . . . . . . . . . . . . . . . . . 31
Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . 34
Productionization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Common issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Common machine learning models . . . . . . . . . . . . . . . . . 38
From “traditional” ML to deep learning . . . . . . . . . . . . . . . 39
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3. Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
What is a neural network? . . . . . . . . . . . . . . . . . . . . . . 41
What are some tasks that neural networks can accomplish? . . . 42
The building blocks of neural networks . . . . . . . . . . . . . . . 43
Activation functions . . . . . . . . . . . . . . . . . . . . . . . . 45
Neural network layers . . . . . . . . . . . . . . . . . . . . . . 46
Connections, weights, and biases . . . . . . . . . . . . . . . . 47
Learning via gradient descent . . . . . . . . . . . . . . . . . . 48
Output layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
What does a neural network do? . . . . . . . . . . . . . . . . . . . 57
From basic neural networks to deep learning . . . . . . . . . . . . 59
Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Convolutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Filter size, strides, padding, and pooling . . . . . . . . . . . . 76
A basic CNN architecture . . . . . . . . . . . . . . . . . . . . 79
Some important CNN model architectures for computer vision
tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
AlexNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
ResNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
U-Net for semantic segmentation . . . . . . . . . . . . . . . . 84
YOLO for object detection . . . . . . . . . . . . . . . . . . . . 87
Image generation with GANs . . . . . . . . . . . . . . . . . . 91
Common CNN techniques . . . . . . . . . . . . . . . . . . . . . . 93
Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
Data augmentation . . . . . . . . . . . . . . . . . . . . . . . . 96
Batch normalization . . . . . . . . . . . . . . . . . . . . . . . . 97
Gradient descent algorithms . . . . . . . . . . . . . . . . . . . 97
Transfer learning . . . . . . . . . . . . . . . . . . . . . . . . . 100
Summary and resources . . . . . . . . . . . . . . . . . . . . . . . . 102
has been a dream for generations of computer scientists and others going
back decades. Where we are on the road to the development of AGI is highly
debated, but the practical applications of deep learning are here today and
providing value via many real-world use cases.
team and organization can use deep learning, this book is for you. It is short
on purpose to help you get up to speed quickly. Most concepts are presented
with accompanying illustrations to better help you understand how things
fit together.
neural networks and moving to use case specific model architectures that
are in wide usage right now (2022). What that means is that the first half of
the book forms the basis for the rest and should be read through in order
to best understand the topics. Chapter 2 is about general machine learning
principles. If you’re already familiar with “traditional” machine learning
you can safely skip that chapter. The second half gets more into specific
topics, such as computer vision, natural language processing, and generative
models. If you are not interested in one or more of those topics, feel free to
skip them.
Available at zefsguides.com¹, but not included with this book, are flash cards
that cover the same topics. These flash cards can help you review and recall
the concepts in this book, especially when used with a spaced repetition²
method, such as that employed by the app Anki³.
If you are interested in going deeper into computer vision, natural language
processing, or better understanding transformers, I recommend that after
finishing the relevant sections of this book that you continue with one of
the other books in the Zefs Guides series on those topics.
I hope you find this book enjoyable and rewarding to read and wish you
the best in learning about these incredible technologies.
¹https://zefsguides.com
²https://en.wikipedia.org/wiki/Spaced_repetition
³https://apps.ankiweb.net/
2. Machine Learning
Deep learning is a family of techniques for building predictive models that
are at the center of the current boom in “artificial intelligence”. In the
past decade deep learning models have been able to surpass “traditional”
techniques in many domains and applications, sometimes even exceeding
human performance. But to understand deep learning, we need to put it
in the larger context of machine learning, as deep learning is itself one
type of machine learning. We will also look at shallow neural networks,
the ancestors of deep learning.
In this chapter and the next we’ll go over some of the core concepts of
machine learning and the basics of neural networks, setting us up to learn
about modern, deep neural networks.
focus on producing the most highly predictive models, even if they are effec-
tively “black boxes”¹ to the users. Statistical modeling has more emphasis
on creating models that have more explanatory power (e.g. creating a model
that mimics the underlying process that generates the data). Statistics has
also built out robust techniques to deal with small amounts of data, which
many practitioners of ML tend to avoid. That said, there are many concepts
from statistics that underlie ML and deep learning.
As more data has become available in many areas of science, industry, and
government, machine learning has been increasingly adopted as a viable
solution to solve real-world problems. The abundance of data is however not
the only factor. More data plus cheaper storage and computing power has
lead to the development of more software (especially open source software)
¹A black box in this context is a process, model, or algorithm where the user is unable to examine the
internal logic and can only see the input and output.
Machine Learning 7
and the refinement of techniques to make the best use of that data. This has
been especially true of deep learning, as we will see in later chapters.
Artificial Intelligence
Artificial intelligence, or AI, is often discussed in relation to deep
learning. In fact, many people are actually referring to deep learning
these days when the say “AI”. Worse, some people even use “AI” to refer
to anything related to data. In this book I will try to avoid the term AI,
as it is currently used in a very broad and often vague sense, in part due
to the hype surrounding it.
AI has a very interesting history going back decades. It’s broadly defined
as enabling computers to do tasks that we typically think of as needing
some level of intelligence to achieve, whether that be playing chess or
holding a conversation. I will not go into the details of this history,
except to mention that the two main approaches to achieving this
have been so-called “knowledge-based” approaches, with hard-coded
logic rules, and “connectionist” or learning-based approaches, which
ultimately include the current deep learning techniques.
I am of the opinion that for most practitioners of deep learning, the
near-term task-oriented applications of deep learning are where they
will get their biggest returns on learning. That said there is a lot of
interesting research on how these current techniques might achieve
broader “artificial general intelligence” type systems.
For more on AI and AGI, a good place to start is Wikipedia’s article on
artificial intelligence.
https://en.wikipedia.org/wiki/Artificial_intelligence
Machine Learning 8
Classification vs Regression
Regression
Machine learning tasks can be broken down into a handful of categories. A
common way to categorize tasks is by the type of output that a machine
Machine Learning 9
learning model needs to produce. One of the most common types of task
is regression². Regression is the task of predicting a (continuous) numerical
value, such as the temperature in a weather forecast, the sales volume of a
product, or estimating time of arrival of a vehicle.
The most basic version of regression that many people are familiar with
is finding the slope and intercept of a “best fit line” that fits some data
points. This line can then be used to predict other values by using the basic
line equation to estimate new, unseen values. This is the manual version
of finding the best fit slope and intercept values (a.k.a. parameters). In
ML, as we’ll discuss in more detail, those slope and intercept parameters
are determined by the training process, where different values are tried
iteratively and the fit of the line is compared with the known data to see
how good the fit is. This is the core of machine learning: “learning” the best
parameter values of a model from the known, existing data.
Regression is one of the most common applications of machine learning
and typically falls under the category of supervised learning, which we will
discuss below.
Classification
Supervised learning
Another way to organize machine learning tasks is by the method used to
solve the problem, rather than the type of predictive task. One of the most
common classes of methods used to build predictive models is “supervised
learning”. Supervised learning is when you have data to train a model that
includes “ground truth” answers. For example, if you were building a model
to predict how tall children would be as adults, you could use a supervised
learning approach if you had data on full-grown adults when they were
children and their final, adult heights. The inputs to the model would be
the earlier data, such as age, height, height of parents, etc, and the matching
outputs would be the known, final heights of those people.
Essentially supervised learning is when you train a model by taking example
inputs and comparing the predictions of the model to the known, desired
Machine Learning 12
outputs. By altering the parameters of the model (think the slope and
intercept of your fit line), you can then check if the new parameters provide
an improvement in the overall predictive performance of the model. In
practice this is done by training algorithms, which automatically try many
possible model parameter values to land on the best model, given the current
data.
Unsupervised learning
Self-supervised learning
There are some approaches to solving problems that use “unlabeled” data,
but in a different way than the unsupervised approaches mentioned above.
Instead of searching for natural clusters or patterns within the data, you
can pose tasks where the answer is already contained in the data. Examples
include training models to predict the next word in a sentence, using the first
part of the sentence as the input and the following word(s) as the “ground
truth” output, effectively formulating the task as a supervised learning
problem. This reformulation from “unlabeled” data to a supervised learning
Machine Learning 13
problem is why the moniker “self-supervised” is used, as the data is its own
label.
The distinction between supervised, unsupervised, and semi-supervised
learning is not always clear-cut, but it is often a useful one³.
Reinforcement learning
An example task
Before we look at the details of how machine learning problems are
formulated and solutions are built and tested, let’s consider an example
problem. This will give us even more context when we discuss the specifics
of how machine learning solutions are created.
Let’s imagine that you decide that you want to buy some real estate (e.g.
a house). You want to use your computing skills to give you the best
understanding possible of how much to pay for a house by building a
³There is another category termed “semi-supervised” learning that we will not cover in this book.
⁴https://en.wikipedia.org/wiki/MuZero
Machine Learning 14
program that could predict how much a given house would ultimately sell
for. How would you do this?
As we have seen above, this is a regression problem. We want to know
a specific numerical estimate: the price in some currency. If you were a
physicist, cough, you might try to build a model from first principles. One
based on assumptions about how humans act and information about the
fundamental attributes of the house. This probably won’t work, as the
underlying mechanisms are far too complicated. The machine learning
approach would be to try to use data to adjust a mathematical model, such
that the model can effectively use the different characteristics of the house
to predict the sales price. By collecting data from recent sales of homes in
the region, you could use the known characteristics, or features, of these
homes as training inputs to an ML model and compare the predicted sales
prices to the known, actual sales prices.
Assuming you had selected a model and trained it using the data you had,
how would you know if the model was good? A good model should produce
price predictions that are close to the actual sales prices of homes in the
area. While training, you would compare the outputs of your model to the
known values, adjust the internal parameters of the model, check the overall
goodness of the predictions, and iterate this process until the model is no
longer improving or you are satisfied. This training process is done via a
training algorithm, which is usually specific to the type of machine learning
model you are training.
One immediate issue with this is that your model may be able to learn to
predict the sales prices of the examples you have by essentially memorizing
them, but, when faced with new data, not be able to make reasonable
predictions. You model has focused on memorizing the training data, rather
than on learning the more general patterns that are useful for predicting
sales prices of houses that it has not yet seen. The way to understand and
quantify the goodness of your model is to test it on house sales data that it
did not see during the training process. This will help estimate how it will
do with “real world” data.
The basic steps in this example are:
Machine Learning 15
There are a lot of details they I have glossed over in this example of a
supervised regression problem, but we have now set the stage to dive into
those details more deeply.
are present. Sometimes a raw feature, such as the size of a house, is highly
predictive, but other times features must be “engineered” for the most
predictive ability. If you were trying to predict disease risk of some disease,
height and weight might be less predictive that a composite quantity, or
feature, such as body mass index. There are many techniques to create these
more abstract features and ML practitioners commonly try many of them
to see if they will improve the predictive performance of their models.
Finally, it’s crucial that the data you are using is truly representative of
the “real world” data that your system is intended to work with. This
can be difficult to assess, as there are many ways that data can be non-
representative, such as biased sampling, snapshotting data from a changing
system, or even collecting data from the wrong sources.
Measuring performance
Probably the most important question about a machine learning model is
how good the model is. This question sounds straight-forward, but how you
measure the performance of a model is something that ML practitioners
often must spend a lot of effort on.
For regression tasks, common measures of performance are the mean abso-
lute error, MAE, and the root-mean-square error, RMSE. These quantify the
typical error of predictions made by the model. Both look at the difference
between a prediction and the “ground truth”, while treating an overestimate
the same as and underestimate, but RMSE effectively “focuses” on large
errors in a disproportionate way. This is often preferred, if large prediction
errors are especially costly. In contrast to this, you might have a task
where overestimation is fine, but underestimation is very costly, so using a
symmetric metric, such as MAE or RMSE would not be appropriate and a
“weighted” approach would be better suited.
Classification tasks have their own set of performance metrics and even
more explicit tradeoffs, depending on the issues related to incorrect pre-
dictions. Common classification metrics include accuracy, precision, recall,
Machine Learning 18
specificity, and F1 score. We will discuss some of these in more detail when
we talk about classification tasks for deep learning.
Regardless of the task, the most appropriate performance metric should be
chosen, taking into account the pros and cons as relates to the specific task.
Model selection
Once you have formulated the problem, you need to select a machine
learning model and train it. In practice, many ML practioners don’t simply
⁶Occasionally you’ll create a very simple model as a baseline and find that it is actually good enough
for what you want to do!
Machine Learning 19
select a model and run with it, but rather select a set of models and see
which ones perform best for the task at hand.
Generally there are sets of model that are suitable for different types of
tasks, which allows ML practitioners to narrow down their choices. The
type of problem and amount of data are key factors in choosing a model.
Practical considerations related to training resources needed, prediction
speed (a.k.a inference speed), explainability, simplicity, deployability, and
current infrastructure come into play when deciding on which models to
consider, as well.
1. Show up.
2. Use XGBoost.
Model training
Training a model is how you go from a set of data and a raw model to
something useful. Models have internal parameters that need to be tuned to
provide the best predictions related to a task. “Learning” the best parameters
is achieved by using the (hopefully high quality) available data.
Supervised learning
For supervised learning problems you have feature data, which describes
the example (e.g. a house, picture, website interaction, patient symptoms,
etc), often denoted with the variable x, and the so-called target data, which
is the outcome, quantity, or label (e.g. sales price, content of the picture,
what someone clicked on, diagnosis, etc). The target is often denoted with
the variable y .
To train your model you will adjust the model’s parameters, such as the
slope and intercept terms of a simple linear model, provide the model with
example input features, and compare the model output, often labeled ŷ ,
with the known “ground truth” target, y . Based on the agreement or error
between the predicted target value and the actual target value, you will then
adjust the parameters and once again pass in the input features and see how
good the predicted outcome values are. This iteration is carried out until the
predictions are “good enough” or have stopped improving.
Machine Learning 21
For training supervised ML models you need a (high quality) labelled data
set, but you also need to use the data in a specific way. A supervised model
learns by taking in data, x, producing predictions, ŷ , and comparing those
predictions with the “ground truth”, y . But to understand how well the
model will perform on as-of-yet unseen, “real world” data, another data
set that the model has not yet seen is needed.
In order to make this possible, the original data set is first split randomly
into a training set and a “test” set, which is only used to estimate the
performance of the model once the training is complete. Typically a small
fraction between 5% - 25% is held out for the test set. It’s very important
that these sets are kept separate so that there is no “leakage” between them
that would let the model learn the specific examples that are in the test set.
Machine Learning 22
In practice a third set is also split out of the training set after the initial
split. This is called the validation set and is used to optimize so-called
“hyperparameters”. In contrast to model parameters, which are learned
directly by training the model, hyperparameters are model settings that are
set by hand. Examples of hyperparameters are the number of trees in a
random forest, the degree of the polynomial in polynomial regression, and
the number of layers in a neural network.
The validation set is used after a model has been trained on the training
set to do an initial estimate of the goodness of the model. After that, the
hyperparameters are varied and the training process is carried out again to
evaluate another set of hyperparameters⁷. Finally, the model with the best
hyperparameters is tested against the test set, producing the final model
performance estimate.
Unsupervised learning
In unsupervised learning, you have no target or outcome data (a.k.a. data
labels). Instead, the model is using input features as the basis for grouping or
clustering the data. This can be used to find natural groupings of customers,
such as “big spenders” or “browsers”, by finding behavioral similarities. This
can also be used for finding anomalous and fraudulent behavior that falls
outside of the main clusters.
Typical clustering algorithms iterate through the data, trying to place each
data example in a cluster that best fits it. Other clustering algorithms try
clusters with slightly different parameters until most data examples seem
to fit well within a cluster. Most clustering algorithms require that the
user set the number of expected clusters that should exist in the data. The
optimal number of clusters is not necessarily known beforehand, but there
are procedures and metrics, such as the elbow method, the gap score, and the
silhouette score, that can help the modeler find the best number of clusters.
The clusters resulting from unsupervised methods do not always correspond
to obvious groupings and typically require human inspection to assign
⁷There are several common approaches to hyperparameter search, such as grid search, random search,
and various Bayesian methods.
Machine Learning 23
meaningful labels (if that’s important). For example, if you clustered a large
set of songs, you might find that the most important feature was the tempo
of the music, which might put otherwise disparate genres together.
Dimensionality reduction
Loss functions
For regression tasks, mean squared error (MSE) is a common loss function:
1 ∑[ ]2
n
L(θ) = y(xi ) − ŷ(xi , θ)
n i=1
for some given parameters of the model, denoted by θ, and n training data
examples.
This has the advantage of treating overestimates and underestimates
equally, while having some other convenient mathematical properties⁹. It
also places more weight on larger errors than a simpler, (piecewise) linear
∑n
measure, such as mean absolute error (L(θ) = n1 i=1 |y(xi ) − ŷ(xi , θ)|)
Classification tasks predict the class or label of the input and thus need
a different type of loss function. The loss function needs to compare
the label assigned by the model and the “ground truth” label. There are
⁹Specifically the the derivative with respect to the model’s parameters is a very simple form.
Machine Learning 25
Parameter optimization
Gradient descent
The method most commonly used to for finding model parameters that
minimize the loss function is called gradient descent. Gradient descent is
Machine Learning 27
the process of moving “downwards” along the slope, or gradient, of the loss
function in the parameter space. The process is to adjust the parameters
in the direction of the (negative) gradient step by step, moving “down
hill” and re-calculating the gradient after each step. The size of the step
is determined by the steepness of the gradient and a step size multiplier, or
learning rate, set by the user. Eventually the steps should lead to a local or
global minimum, usually after taking smaller and smaller steps. Stopping is
determined by some criteria set by the user.
From a mathematical perspective, the gradient of the loss function is found
by estimating the derivative of the surface with respect to the parameters,
which is the same as the slope, but often on a surface that exists in a very
high dimensional space.
There are several different algorithms for performing gradient descent and
we will look at a few in more depth in the sections on deep learning.
The goal of learning from a data set is not for the model to simply memorize
the training data, but rather for the model to recognize the patterns in the
data, such that it can make good predictions about data from the same
distribution that it has not yet seen. In other words, we want out cat image
detector to work for all images of cats, not just the ones it’s already seen.
This process of going from training data to being able to make prediction
about as-of-yet unseen data is called “generalization”.
The opposite of generalization is called “overfitting”. Overfitting is when the
model has focused too much on the specific quirks or noise of the training
data and has not been able to learn the more general patterns within the
data. Overfitting is a common issue in training ML models and is one of the
things that practitioners worry about the most.
Machine Learning 28
set, instead of learning the more general patterns in the data, it will do worse
and worse on the test data¹¹. The best parameters for the model are the ones
from the “sweet spot”, where the test error is lowest.
Two of the important ways to quantify how well a model behaves are bias
and variance. In general you can think of bias as how consistently off an
estimate is from the true value. More specific to ML, bias is how far off a
model trained with a given set of hyperparameters is from the the true target
value when being used to predict the target value from a test set. In other
contexts this is called systematic error, rather than random error or noise.
To characterize the bias, multiple estimates of a value need to be made. The
mean error of those estimates is the bias.
For ML models, the bias is the mean error from multiple models all with
the same hyperparameters, but trained on different samples of the training
data. This estimates how far off(and in what direction) a specific model will
be from the “ground truth”.
¹¹There is a related phenomenon called “double descent”, where the test error may go up and then back
down again, but we will not cover that here. See https://en.wikipedia.org/wiki/Double_descent.
Machine Learning 30
Variance is a measure of how “spread out” a set of values are. For ML models
this is the spread of predictions for a single point in the test set, as made by
several instances of the same model (i.e. same hyperparameters) trained on
different samples of the training data. A model that overfits will results in
several models that give widely varying predictions of the same input, as
each model has learned the very specific patterns in the samples, rather than
the more general pattern common to all samples.
Machine Learning 31
The ideal model has low bias and low variance. In practice there is typically
a trade off between having either a low bias or low variance model and the
best choice is a moderately low bias and moderately low variance model.
Avoiding overfitting
Regularization
Overfitting, high variance models are typically the result of too much
flexibility in a model. If a model has enough internal parameters to match
every quirk of the data, it’s less likely to settle on the “smoother” general
patterns in the data and more likely to try to match noise. One way to use a
flexible model while avoiding overfitting is by constraining the parameters
in the model. This approach is a form of so-called regularization.
Instead of allowing the parameters to take any values they want, regulariza-
tion of this type constrains the parameters in a way that limits how much
weight can be used in the model. If the model parameters are constrained
to some total magnitude, then the model is more likely to distribute the
parameter values to emphasize the more fundamental patterns in the data,
rather than noise or edge cases.
This type of constrained regularization is typically achieved by adding a
term to the loss function, which increases the loss function when the sum
of the parameter values is high. An example is L2 (or ridge) regularization:
1 ∑[ ]2 ∑
n k
L(θ) = y(xi ) − ŷ(xi , θ) + λ θj2
n i=1 j=1
The higher order (i.e. higher index) terms in the Legendre polynomial model
are “wigglier”, which allows them to capture more small-scale change in
the data, which is often due to noise. By constraining the sum of the
(squared) parameter values, the lower degree terms are more likely to be
emphasized. Conversely, too much weight on the lowest order terms can
cause underfitting. A good fit will balance the terms and requires just the
right amount of regularization.
There are several other forms of regularization, some of which we will
Machine Learning 34
discuss in later chapters. All are designed to create models that are more
likely to model the general, underlying pattern of the data and less likely to
focus on the noise in the data.
Hyperparameters
• Grid search: a grid of values (e.g. 0.1, 1, 10, 100, etc) are tried for each
hyperparameter and combination of hyperparameters. Grid search
has the advantage of being parallelizable, but has the drawback of
potentially taking a very long time when there are many hyperpa-
rameters (and thus many, many combinations).
• Random search: combinations of hyperparameters are randomly
chosen from possible hyperparameter values. This has the advantage
of being able to evaluate combinations in a very large space of
Machine Learning 35
Productionization
Finding the best model through the iterative training process is not usually
the end goal of an ML project. The end goal is to actually use that model in
some real-world capacity. The model needs to be put “into production” or
deployed for real-world use.
Deployment can look very different in different settings and contexts. E.g.
creating a model that is run manually once a month is very different from
a model that runs continuously on a mobile phone. The most common
scenario of deployment for most models is as an API as part of a larger,
server-based system, such as a product recommendation system.
While the different productionization scenarios have their own sets of tools,
common themes are deployment, monitoring, and maintaining models. It’s
often said that development of an ML model is never finished, because while
a model is usually static, the world is constantly changing. This means that
once it’s deployed, whether as a web API or on an edge device, the owner of
the model needs to keep track of its performance. Some scenarios, such as
predicting the weather lend themselves easily to performance monitoring,
as the “ground truth” is easily available. Other scenarios, such as detecting
Machine Learning 36
infrequent anomalies in long cycle systems, may take time and effort to
understand the model performance. Tools and procedures for these may
exist or may need to be developed by the engineers.
Maintenance of models is needed when the current model is no longer good
enough. That may be because the “world” has changed and the previous
predictions are no longer accurate¹², there are new requirements, or a better
version of the model has been developed. A key feature of a robust ML
deployment system is the ability to track versions of models and to easily
roll-back from a model that’s not working to one that is. This is where
the research and development aspects of machine learning really meet the
software engineering aspects of machine learning.
Common issues
Machine learning is a very powerful approach to solve certain kinds of
problems, but it’s also incredibly easy to get wrong¹³. There are a number
of common mistakes that practitioners make and anyone working on ML
models needs to be vigilant to avoid these.
Overfitting, described above, is a common problem. Without monitoring
training closely, most ML model types can overfit easily. Strategies to
mitigate and avoid overfitting include training on more data, regulariza-
tion, early stopping, and cross-validation (to better estimate performance),
among others.
Not having enough data can also lead to low performance. This can be
diagnosed by looking at the learning curve(s) while training. If the learning
curve is still showing improved performance (i.e. decreasing error), despite
having trained on all of the data, you are likely to get better performance
by training with even more data.
¹²You can imagine the impact that the CoViD-19 pandemic had on many models that were previously
running well.
¹³A very good talk about this was given by Ben Hamner from Kaggle in 2014 called Machine Learning
Gremlins, laying out many of the ways that ML can go wrong. https://www.youtube.com/watch?v=tleeC-
KlsKA
Machine Learning 37
ML models only know what they’ve learned from their training data. Even if
they are correctly trained, if the underlying data is wrong in some way, your
model will have problems when being applied to the real-world problem.
This can be due to non-representative sampling, which can cause bias in
the model’s predictions or cause the model to learn non-essential context,
rather than the main “subject” itself. For example, if you are trying to
classify images as either cows or horses, but all of your images of cows are
in pastures and all of your pictures of horses are inside barns, your classifier
may learn how to distinguish fields from barns (i.e. the context) rather than
the difference between cows and horses (i.e. the subjects).
Having well sampled data is a basic requirement, but it can still have other
issues. If the data is “dirty” in some way, it may not be useable for actually
solving the problem at hand, because it cannot be cleanly processed or there
are features that cannot be cleanly tied to the associated events or entities.
Another common issue is when the training data contains information that
real-world input data would not have, but is highly correlated with the
prediction target. For example, if you were trying to build a model that
predicted whether students were going to graduate on time and the data
contained information on whether they had completed all requirements,
it’s very likely that your model will find the correlation between that and a
students eventual graduation. In the real world that information would not
yet be known and allows the model to “cheat”. This phenomenon is called
data leakage and it can take many forms.
While your training data may be strongly representative of the real world
data, it is effectively a snapshot in time of the data distribution. If the events
or the environment changes, your model, trained on a past snapshot of the
data, will begin to have performance issues. This is a form of data drift. To
avoid issues, the performance of the model must be regularly monitored.
Finally, coming back to the first step in the process of building a machine
learning model, formulating the problem, it’s easy to build a model that
solves the wrong problem. This sounds like something that’s trivial to avoid,
but the reality is that the process of building a model is often complicated
and time consuming and the basic problem statement is lost in the confusion
Machine Learning 38
All of these are made available in popular open source ML software libraries,
such as scikit-learn for Python users and others for R users.
References
For your reference, here are a number of resources for learning more about
machine learning basics and traditional ML models.
Courses:
Books:
¹⁴https://www.coursera.org/learn/machine-learning
¹⁵https://work.caltech.edu/telecourse.html
¹⁶https://www.udacity.com/course/machine-learning--ud262
¹⁷https://www.dataschool.io/15-hours-of-expert-machine-learning-videos/
¹⁸https://www.statlearning.com/
3. Neural Networks
This book is about “deep learning”, but deep learning is really just a name
for the modern use of (deep) neural networks. Before we get to deep
learning, we need to first look at traditional (shallow) neural networks and
understand the basic concepts and issues with them. In this chapter we will
cover those basics, preparing us to understand the more recent innovations
that fall into the category of deep learning.
As we will discuss in the rest of this book, neural networks are not one
single model, but due to the flexibility of how the components of neural
networks can be combined, they form a very large class of machine learning
models. We will first look at the simplest form of neural network and in later
chapters learn about many of the variations that are used for specific tasks.
Some of these tasks have only become tractable in the past decade due to
the development of (useable) deep neural networks. On some tasks neural
networks have even been able to match or surpass human performance.
Other tasks have been addressed with neural networks for decades. The
success and promise of neural networks across so many tasks is one of the
main reasons why they are of so much interest to researchers and engineers.
They are not one-trick ponies. They can be adapted to almost any machine
learning task.
{ ∑
0 if i xi wi ≤ threshold
Perceptron output = ∑
1 if i xi wi > threshold
The perceptron output criterion equation can be rewritten, such that it’s
the weighted sum plus a “bias” term, equal to the negative value of the
threshold. If this total sum is greater than zero, the output is one, otherwise
it’s zero. This simplified formulation is inline with the notation used for
modern neural networks.
{
0 if x · w + b ≤ 0
Perceptron output =
1 if x · w + b > 0
∑
Here x · w = i xi wi and is the inner or “dot” product of the inputs and the
weights.
Activation functions
The summing and threshold process in the perceptron is called its activation
function, which is once again inspired by how real neurons work. In the
perceptron this activation function is a step function, taking only two values:
0 and 1. For binary classification we only need two values, but even so it
turns out that a step function is not the best activation function. A step
function has such a sudden change from 0 to 1, that a very small change
in inputs or weights can flip the output. In practice smoother functions are
more useful for several reasons. One is that relatively smooth² functions
that are not simply linear can enable the modeling of non-linear patterns.
The other reason is related to gradient descent optimization: some functions
have derivatives (i.e. slopes) that are more amenable to effective gradient
descent optimization.
²Here I’m using “relatively smooth” to mean functions that are at least piece-wise connected.
Neural Networks 46
middle, and a layer of output nodes at the end. The number of input nodes
depends on the input data and how it is encoded. For example, if you had
seven numerical features about houses for sale, you could use seven input
nodes in the input layer. The number and size of the middle, hidden layers
is up to the user to define. And finally, the size of output layer depends on
the number of outputs needed. For a classification task with N classes, you
would use N output nodes, one corresponding to each class. For a standard
regression problem you would only need single output node to provide a
single numerical value.
parameters typically means you need a lot of training data and a lot of
computing resources.
Neural networks learn the best parameters by training on data. Like several
other types of machine learning models, they can learn in a supervised
manner via gradient descent: iteratively adjusting the model’s parameters
to decrease the overall prediction error, as measured by the loss function.
A note on math
This section contains some equations, but the important part is really
how the equations go together, rather than all of the details of each equa-
tion. You should be able to understand the gist without understanding
every single part.
Gradient descent
⁴Technically you’re stepping in the opposite direction of the gradient, as the gradient is a calculation
of the vector in the direction of the steepest increase of the loss surface.
Neural Networks 50
step = −α∇
⃗θ
Determining the gradient is more difficult. The gradient is telling you what
would happen to the loss function value if you made a very small change to
the parameters in the direction the gradient is pointing in. Mathematically,
this is the derivative along each parameter dimension. The gradient is the
composite vector, whose components show how the loss function changes
when any single parameter is changed.
[ ]T
∂L ∂L ∂L
∇
⃗θ = , ,...,
∂θ1 ∂θ2 ∂θi
The tricky part about determining the gradient is that the loss function
depends on parameters throughout the network. A change of a weight in
the beginning of the network affects the ultimate prediction output by the
network. Fortunately there is a technique that makes the calculation of the
gradient relatively easy: backpropagation.
Real world neural networks tend to have many, many parameters, all of
which can affect the output. You can think of the output as the result of
many nested mathematical operations. The input to the network starting
at the innermost level of the nest and the final output coming from the
outermost level.
We are interested in the loss function, e.g. Lθ = 21 (y − ŷ)2 , which is
dependent on the set of parameters, θ, indirectly through the prediction
produced by the network, ŷ . To understand how a change in the parameters
would affect the loss function, we can therefore look at how ŷ is affected by
the change in the parameters.
Neural Networks 51
Let’s consider the simplest neural network, a network with only one node
per layer. This will give us a simpler scenario, while still giving us the core
conceptual parts of backpropagation. To determine how a small change of
a parameter somewhere in the network would affect the output, we first
look at the derivative of the final activation function with respect to its
immediate parameters. The final output could be written like
ff (z) = ff (bf + wf × af −1 )
where z is just a placeholder for the input to the function, bf is the bias term
of that layer, wf is the weight of that layer, and af −1 is the output of the
activation function of the preceding layer feeding into the final layer. We
can keep fleshing this out to see the nested operations and end up with a
really long, hard to read equation:
Neural Networks 52
where x is the initial input to the network and the index indicates which
layer the term is from, relative to the final layer. Fortunately, we can
understand the concept without having to work through every layer.
The derivative of the the final activation function with respect to its own
weight, for example, would be
∂ff ∂ff (z) ∂z
= × = ff′ × af −1
∂wf ∂z ∂wf
where z is a placeholder for the input to ff , i.e. z = bf + wf × af −1 .
The important thing this equation is telling us that we need to know the
derivative of the final activation function with respect to its input and we
need to know the derivative of the input with respect to wf . This works,
because the chain rule of calculus tells us that the derivative of a nested or
compound function is the product of its component derivatives.
Neural Networks 53
which is an equally bad problem, but this is easier to deal with⁵). Therefore,
there are many strategies to how one should initialize the parameters of
a network. While there are no perfect strategies, the general idea is to
set values randomly with a mean of zero and a variance that’s smaller as
the number of parameters increases, such as in Xavier initialization or He
initialization.
Output layers
The final layer of a network is usually different from the middle, hidden
layers, as the final layer needs to produce values that are interpretable as
solutions to the problem the network is designed to solve. For example, the
network maybe predicting the class or label of the object in an image (a
classification problem) or it may be predicting the remaining useful life
of a tool (a regression problem). For different use cases, different types of
activation functions (a.k.a. layers) are used for outputs.
For regression problems a linear output function is appropriate, as it can take
on a wide range of values. The output layer would then have as many output
nodes as were needed for the prediction, e.g. one node for predicting the
sales price of a house versus two nodes for predicting the size of a rectangle
that could frame an object in an image.
While it may seem that a step function would be appropriate for a (binary)
classification output, a step function does not allow gradient-based learning.
Additionally, most classification methods actually produce a value that is
more like a probability. So instead of just 0 or 1 in the case of binary
classification, a value such as 0.78 is produced and then used either directly
or with a threshold for classification. By adjusting the threshold, the user
also has the ability to choose how comfortable they are with different
amounts of misclassification.
For multi-class classification, such as labeling many different species in
images, a function such as softmax is commonly used.
⁵Gradients that are too large can be dealt with via gradient clipping: any gradient magnitude greater
than some threshold is set to the threshold value.
Neural Networks 57
exi
Softmax(xi ) = ∑n
j=1 exj
Softmax, or soft argmax, gives the probability of the answer being of class i.
The sum in the denominator makes the total of all output values for classes
1 through n sum to one, as is required for probabilities. For a multi-class
classification network, the final layer would then have as many nodes as
there were classes, e.g. 10 nodes for classifying images of single digits 0
through 9.
Neural networks transform feature space to make the problem easier to solve
In the remaining chapters of this book we will look at some of the techniques
and applications that have led to and come out of the deep learning
revolution.
Neural Networks 60
Resources
Some further resources for learning the basics of neural networks and deep
learning:
Courses
• The NYU Deep Learning⁶ course (Theme 1) by Yann LeCun & Alfredo
Canziani.
• Fast.ai’s Practical Deep Learning for Coders⁷, with Sylvain Gugger
and Jeremy Howard.
• Andrew Ng’s Neural Networks and Deep Learning⁸ on Coursera.
Books
⁶https://atcold.github.io/NYU-DLSP21/
⁷https://course.fast.ai/
⁸https://www.coursera.org/learn/neural-networks-deep-learning
⁹http://neuralnetworksanddeeplearning.com/
¹⁰https://www.3blue1brown.com/topics/neural-networks
¹¹https://e2eml.school/blog.html#193
4. The rise of deep learning
In the previous chapters we have looked at “traditional” machine learning
techniques, including neural networks. These methods have proven useful
for addressing many predictive problems, but there is a reason this book is
about deep learning: deep learning has proven extremely good for certain
types of problems that traditional techniques have not excelled at.
Compared to traditional ML models, deep neural networks have proven to
be good at a wide range of problems, but have turned out to be particularly
good at problems involving computer vision (i.e. image based problems)
and natural language processing. In this chapter we will discuss how and
why deep learning came to the forefront of machine learning, with a very
brief overview of the history of deep learning¹. In the following chapters we
will go into more detail about techniques specifically designed to address
computer vision and natural language problems, as well as some generic
and advanced deep learning techniques and practical considerations.
From a theoretical point of view, it can be shown that networks with many
layers can represent certain mathematical functions² with exponentially
fewer nodes than would be needed to represent the function with a single
hidden layer. From a more empirical point of view, deeper networks seem
to have the ability to represent the data in a hierarchical way that fits
naturally to many types of data. Additionally with deeper networks, you
can create more problem specific network architectures, often combining
different layers in modular, purposeful ways, as we will see later.
Researchers came up with many of the key ideas for deep neural networks
decades ago, but they were not practical to train or use at that time. Several
of the key innovations, such as backpropagation and network architectures
well suited for solving computer vision and natural language problems,
were created in the 1980s and 1990s. Their potential was not fully realized,
however, due to the limited computing resources available at the time
(relative to modern computing hardware) and high cost of producing
enough training data.
While many of the core ideas for deep neural networks had been around for
a while, it was only within the last decade that they really took off. In many
ways, deep neural networks were “ahead of their time”, as the computing
technology of the day was simply too underpowered to demonstrate the
potential of neural networks.
convenient marketing term. It has allowed for a break from the “old”
neural network days, when neural networks often suffered from more
hype than they could deliver on, and signifying that we’ve entered a
new era.
Regardless, “deep learning” is the most common name for this set of
technologies.
The trends that computing resources have followed, such at Moore’s law for
transistor density, have been the exponential increase of processor speed,
the exponential decrease in the cost of data storage, and the exponential
increase in network bandwidth. These trends (and other closely related
technology trends) have also led to an exponential increase in the amount of
data produced and stored. These trends led to the era of “Big Data”, starting
in the 2000s, and ultimately enabled the emergence of deep learning as a
leading ML technique.
One specific development that helped deep neural networks emerge was
the evolution of GPUs³ from being aimed solely at processing graphics to
enabling more general workflows. Because neural networks can be formu-
lated mathematically primarily as matrix operations, researchers were able
to port neural network operations to GPUs, taking advantage of the highly
parallel nature of the GPUs. Today GPUs are the workhorse hardware for
most deep learning training, while other new processor architectures have
been designed specifically to handle neural network processing.
The effect of these trends became apparent to the wider ML and computing
communities in 2012, when a neural network won the ImageNet Large Scale
Visual Recognition Challenge (ILSVRC) competition by a large margin. Up
until 2012, ILSVRC and other similar competitions had been dominated
by traditional computer vision feature extraction techniques paired with
traditional ML models, such as support vector machines. In 2012 a deep
convolutional neural network named AlexNet won the flagship ILSVRC
³Graphics processing units
The rise of deep learning 64
challenge with an error rate more than 10% lower than the second place
finishers, a huge performance improvement over the rest of the field and
previous competitions.
AlexNet, designed by Alex Krizhevsky (namesake of the network), Ilya
Sutskever, and Geoffrey Hinton of the University of Toronto, mostly used
components and techniques that had existed previously, but was able to put
them together in a way that led to a huge performance increase. AlexNet
consisted of convolutional layers, ReLU activations, established dropout for
regularization, and was implemented in a way that allowed it to train on
multiple GPUs⁴. By combining these established ideas, some newer tricks,
and porting the code to run on the latest hardware, they were able to
train on a very large dataset and achieve a breakthrough in predictive
performance. The dataset itself, ImageNet, was much larger that previously
available datasets. The images were sourced from the internet and labeling
was performed by crowdsourcing, something previously not available⁵.
News of AlexNet’s win at ILSVRC in 2012 spread very quickly and by 2014
the challenge was dominated by competitors using deep neural networks.
More importantly, not only was the ML community taking notice, but the
wider tech world was paying attention and resources started being directed
towards further developing and using deep learning methods.
This new interest in neural networks led to many further, rapid develop-
ments, quickly improving the state-of-the-art performance on many tasks.
Important in this progress was the development of several open source
software projects, the open publishing of results (and often the models
themselves), and overall the new momentum of progress in the field, which
has seemingly grown consistently over the past ten years. Deep learning is
now used across a very wide range of academic and industry domains.
⁴We will cover the same components and techniques used in AlexNet in more detail in later chapters.
⁵https://image-net.org, Li Fei-Fei, et al.
The rise of deep learning 65
“computer vision” to mean any image-related task that neural networks can
be applied to, from image classification to image generation and beyond.
with a label, each pixel is given an estimate of distance from the camera,
making it a pixel-wise regression problem.
For semantic segmentation, the most common metrics are intersection over
union (IoU) and the Dice coefficient for binary segmentation tasks and mean
IoU and mean Dice coefficient for multi-class tasks.
Image transformation is the task of changing an image in some way. This is
a broader category that has overlap with several other categories. Examples
include colorizing a black and white image, “super resolution”, where an
image’s resolution is increased by “intelligently” filling in new pixels, fixing
problems in images, such removing an object from an image, or filling in an
area of an image, changing an image from one “style” to another, and adding
elements to images, such as whimsical effects and filters to selfies. Tasks
such as colorization are very similar to segmentation, mentioned above, as
the task is to provide a label (i.e. color) or numerical value (i.e. pixel-wise
regression) for each pixel. Many image transformation tasks involve several
steps, such as detecting an object and then modifying the image in some
desired way, such as automatically blurring faces in images. Evaluation of
image transformations is very task specific.
Image generation is the creation of entirely new images or filling in regions
of images in some desired way. Examples include generating landscape
scenes, with or without input from the user, altering the pose or facial
expression of a person in an image, or generating models wearing clothing
products. Image generation and related tasks, such as image transformation,
require metrics that measure how similar the generated image is to being a
real image. We will look at some of these later when we look at generative
adversarial networks.
There are some more fundamental difficulties in computer vision tasks. For
example, you may be able to hard code a detector that finds triangles in an
image, but the logic may fail if the triangle is shifted, rotated, or skewed
somehow. As humans we know that the triangle is still a triangle despite
being moved to another part of the image, but capturing the essence of
“triangleness” may be very difficult in code. This is one reason for the desire
to create models that can learn how to determine “triangleness” on their
own by looking at lots of examples of triangles (or really objects that are
much more complicated than simple geometric shapes). If the fundamental
patterns of these objects and their common variations can be learned, a
much more robust model can be built.
Computer vision and convolutional neural networks 73
Convolutions
A convolution in 1D
³This formula is technically a cross-correlation, rather than a convolution, but it produces an equivalent
output and is the form used in practice most of the time.
Computer vision and convolutional neural networks 75
∑
yi = wj xi+j
j
where wj is the filter weight for position j in the filter array, xi+j is the
(image) array value at i + j , and yi is the value of the convolution operation
from position i of the (image) array. As i is increased, it’s as if the filter is
sliding along the (image) array and you get the result of applying the filter
to the entire array.
A convolution in 2D
The most important part to understand is that the same filter array of
Computer vision and convolutional neural networks 76
There are several knobs that can be turned to tune convolutional filters in a
CNN. These are typically treated as hyperparameters of the network.
Filter size is the dimension of the filter array – how large the window is
that you’re sliding over the image. A smaller filter will focus on smaller
features, whereas a larger filter will focus on larger scale features.
A number of filters can be applied to an input array simultaneously in
parallel. This is somewhat analogous to the number of nodes in a standard,
fully-connected neural network layer. Each filter is applied independently,
learning its own weights, and producing its own feature map.
Computer vision and convolutional neural networks 77
Stride is how far the filter is moved each time you slide it along. A “vanilla”
filter would be moved one pixel over, having a stride length of one. A stride
length of three would move the filter over by three pixels each time. Since
the number of outputs from performing a convolution depends on how
many positions the filter is applied to, a larger stride will produce fewer
outputs, and thus reduce the resolution of the resulting array from the
convolution. Selecting stride length depends on the desired effect and/or
computing resource constraints or efficiency goals and can be treated as a
hyperparameter.
can have the same issue. The more parameters, the more computing
resources, especially memory, are needed to train and run the network.
One of the advantages of task specific networks, such as CNNs, is that
they can be designed to reduce the number of parameters in the network
in clever ways.
CNNs reuse the same filter weights across the entire image, rather than
learning separate weights for each part of the image. This is largely to
be able to detect specific features (a.k.a. motifs) that can be found in any
part of the image, but the side effect is to reduce the overall size of the
network.
Of course the general trend in deep neural networks is that bigger is
better: more layers allow you to find even more robust patterns. But,
you want those layers and the deep network to be as resource efficient
as possible.
Padding means adding extra data (pixels) to the edges of your image.
Unless your filter is of size 1x1 and stride 1, the result of the convolution
will be a matrix of smaller size (i.e. lower resolution) than the original
image. Additionally, the pixels at the edges of an image are seen less by
the convolution operation compared with pixels farther in. Padding can
alleviate these issues. The amount of padding depends on the goal, e.g.
maintaining the same resolution, which is termed “same padding”. The most
common data value to use for padding is zero.
Pooling is the reduction of the size of the output array (a.k.a downsampling)
by performing an operation, such as taking the mean or the max, on the data
values in subarrays of the output. The values are “pooled” together. This in
turn reduces the resolution of the feature map as well as the number of
weights needed downstream. Conceptually, pooling is a way to compress
the information contained in a feature map, maintaining the larger scale
features while discarding some of the “noise”. Pooling also adds some
robustness to where in an image specific features are found, i.e. an object
not in the center will be more like to be detected as the same object as if it
Computer vision and convolutional neural networks 79
were located in the center of the image. A pooling filter of 2x2 pixels with
a stride of 2 is a common size.
AlexNet
In the “modern era” of deep neural networks, AlexNet⁴ was the one that
is considered the most important from a historical point of view. It is not a
network that is still used, but its win in the 2012 ILSVRC Top 5 classification
competition was the event that kicked off much of the current deep learning
era.
AlexNet is similar to the basic CNN described earlier, with six convolutional
blocks, using ReLU and max pooling, and two hidden fully connected
layers at the end, followed by a softmax layer for making multi-class
classifications.
⁴See the discussion of AlexNet in Chapter 4 for more historical context.
Computer vision and convolutional neural networks 82
ResNet
practice, though, this was not happening. He, et al, hypothesized that instead
of learning the identity function, it would be easier for the network to learn
the difference between the input data (for a layer) and the optimal output
data: the residual. To achieve this they added so-called “skip connections”,
allowing data to bypass some number of convolutional layers. That input
data is then summed with the output data of the convolutional layers.
While He, et al, didn’t invent skip connections, ResNet went on to win
the ILSVRC 2015 competition and popularized the idea of skip connections
and residual blocks. Architectures based on these ideas are still the most
commonly used CNNs, used for image classification and as the “backbone”
of networks designed for other vision-related tasks. Some of the successor
architectures to ResNet include DenseNet, ResNeXt, and ResNeSt.
The autoencoder
Computer vision and convolutional neural networks 84
Autoencoders
An autoencoder is a network designed to learn how to encode or com-
press an image (or other input). For images, it’s essentially two CNNs,
with the second flipped around. The first “encodes” the input image by
learning features and squeezing the resolution down to some smaller
size. The second part (a.k.a. the “decoder”) rebuilds the image back to
its original resolution and contents using an upsampling method. The
difficulty in doing this is that you’re trying to represent the full image
with a lower amount of data (i.e. compression).
An autoencoder can be trained with unlabeled data, where the input
and output are compared (an example of “self-supervised” learning). The
network attempts to reconstruct the input with the lowest error possible.
Historically this was a method to “pre-train” layers within CNNs on
unlabeled datasets, which were typically larger than the labeled dataset.
The convolutional layer from the encoder could then be re-used in
the CNN, where further supervised training would take place. Larger
labeled datasets, more powerful computing resources, and more effi-
cient CNN architectures made this approach less common, but similar
approaches have become very important in other areas, such as natural
language processing.
See the section on “transfer learning” later in the chapter.
See “transposed convolutions” below.
Thomas Brox for use in biomedical image segmentation, but it has since
been adopted for segmentation tasks across many domains.
U-Net
Transposed Convolution
Computer vision and convolutional neural networks 87
Transposed convolutions
Unless specifically designed not to, most convolution operations result
in a feature map that is smaller (i.e. lower resolution) than the input
(image) data. Several tasks require the opposite result, i.e. increasing
the resolution. One way to increase the array size of the data is with
a “transposed convolution”.
A transposed convolution applies a filter to the input data in such a way
as to produce a larger output array. It has similarities to the convolution
process, but also differences. The filter array is applied to only a single
cell of the input array at a time, such that the output is the value of
the input cell multiplied by the value of each cell of the filter. These
values are then added to the output array in a specific position. The
filter array is then moved to the next cell of the input array and the
process is repeated, with these values added to the output array in the
corresponding position of the input cell. The values put into the output
array that overlap are simply summed.
Like a normal convolution the values of the filter array are learned in
the training process, such that they produce the best values to achieve
the overall task of the network. As with convolutions, there are a few
key hyperparameters that can be chosen to achieve your desired goal,
including filter array size, padding, and stride length.
of an image, for example by sliding a window over the image, and run a
classifier on each subregion to predict the presence of an object of the class
of interest. In principle this works, but it comes at a high computational
cost.
YOLO (You Only Look Once) is an efficient object detection algorithm that
uses several tricks to make it both accurate and light on resources. It was
introduced by Joseph Redmon, et al in 2015. YOLO breaks up the image
into a grid and uses convolutional blocks to detect objects everywhere in
the image in a single pass (hence, “you only look once”). Training data must
be labelled in a YOLO-specific way to reflect which grid cell it belongs to,
the presence or absence of an object, the object class, and the boundary box
coordinates. Grid cells without objects still need labelled data, though the
details of the class and bounding box don’t matter.
object may extend into multiple grid cells. When setting up the YOLO model,
the user can choose the number or cells and the number of object centers to
look for in each cell, as there could be more than one object centered in a
cell. This is a limitation and trade off of the model, but is largely dictated by
the type of data. Very busy images with a lot of objects close to one another
require more grid cells and / or more object centers per cell.
An object detection model, such as YOLO, may end up predicting several
bounding boxes for a single object. To decide which bounding box to choose,
a complimentary algorithm called non-maximum suppression, or NMS, is
used. NMS looks at both the confidence of the model’s prediction for a
certain class and the overlap of candidate bounding boxes, as measured by
intersection over union (IoU).
Non-maximum suppression
YOLO is one of the most important object detection architectures. There are
multiple versions of YOLO (some not directly related to the first version) and
several other networks that descend from YOLO or incorporate its ideas.
Image generation is the task of creating new images from scratch, modifying
existing images, or filling in areas (a.k.a. inpainting) of existing images. In
2014 Ian Goodfellow et al introduced the generative adversarial network
(GAN), which became the basis for a large family of neural network archi-
tectures that could generate images extremely well. GANs have primarily
been applied to images, but have also proven useful for generating some
other types of data, such as audio.
Computer vision and convolutional neural networks 92
GANs consist of two main parts: the “generator” and the “discriminator”.
Each is a separate neural network. The generator is a network which
produces an output image, while the discriminator is a network that tries
to classify images as either coming from the training set or having been
produced by the generator. Training the two networks becomes a game,
with the generator trying to learn how to produce images that can fool the
discriminator into thinking that its images are actually from the training
set, while the discriminator is trying to learn how to detect the false images.
This is the “adversarial” part of GANs. The game between the generator and
the discriminator is what makes them able to learn so well, but it can also
lead to instabilities in the training process.
Computer vision and convolutional neural networks 93
Improvements to the basic GAN framework have yielded very high quality
images that are often difficult for humans to recognize as being synthetic. As
we will discuss in Chapter 7, GANs and other image generation models can
be designed to use user input to guide image generation toward a desired
output. Other notable image generation architectures include variational
autoencoders (VAEs), flow-based models, and diffusion models, that latter
getting a lot of attention so far in 2022. We will look more at diffusion models
in Chapter 7.
models use that are worth being familiar with, because they pop up so often.
Here I will touch on some of the most important ones.
Regularization
The goal of all machine learning models is to learn the general patterns of
the real-world distribution of data from the (limited) sample of training data.
The opposite of that is overfitting. As discussed in Chapter 2, regularization
is one of the approaches for reducing the chances of overfitting to the
training data.
There are several flavors of regularization. Two of the most commonly
used for neural networks are dropout and adding constraints to parameters
within the loss function (e.g. L1 and L2 regularization).
Dropout
Dropout
L1 and L2 Regularization
the most important nodes and connections. This means that it cannot as
easily fit noise within the training data, as it doesn’t have an unlimited
weight budget. See Chapter 2 for more detail and an illustration of how this
type of regularization affects overfitting and generalization.
Data augmentation
Models are typically able to better learn the general patterns and avoid
overfitting if they are trained on more data. More data is almost always
better. Since data can often be “expensive” to obtain, however, one approach
to increasing the size of the training set is by synthetically creating new,
labeled data. For images, this can be as easy as flipping the image right to
left. For most scenarios, that type of change would result in the same label
on the data.
Data Augmentation
Computer vision and convolutional neural networks 97
There are many ways in which image data can be augmented: moving,
rotating, stretching, scaling, cropping, blurring, or flipping the image,
adding noise, changing the colors, etc. The important part of augmenting the
training set with data transformations is that the new image is not changed
so much that its label is no longer appropriate.
By automating this process, a training data set may be increased many-fold.
Batch normalization
steps “down hill” on the multi-dimensional loss function surface. The down
hill direction is found by calculating the gradient for each feature dimension
and the parameter values are updated via the process of backpropagation.
It turns out that “vanilla” gradient descent has some limitations in practice,
primarily the speed of the process and the amount of resources needed to
handle large data sets (and large data examples, like images). In order to
overcome some of these limitations, several variants of gradient descent
have been developed. Here we will look at a couple of the most important
ones.
examples in the training set, SGD instead calculates the value of the loss
function on a single example. Inherently, this means that the estimate of
the loss function is worse than averaging the loss value from all available
data, but it means that it can be performed very quickly. This “quick and
dirty” estimate of the loss function means that the subsequent update (or
step) in the parameter values will likely be sub-optimal – not necessarily
the best direction or step size. This is the “stochastic” part of SGD.
It turns out, that by taking many steps based on the single data example
estimate of the loss function and thus the gradient, the model’s parameters
will still end up moving in the right direction. It’s like a random walk, but
with the wind blowing the walker in the right direction. Training may need
more steps, but the overall training time can be greatly reduced. This has
proven to be a very useful approach in practice.
While SGD was formulated using one example at a time, it’s more com-
monly done with a small number of examples or “batch” of data, sampled
from the training data set.
Adam
SGD is not without its own limitations. The direction of the next step in SGD
is based purely on the gradient of the current mini batch. Due to the small
size of the batch, that can mean the next step is in a very different direction
than the previous step. While this will work out over time, researchers have
realized that if the direction of the next step is less abrupt, the training is
likely to converge to the best parameter values more quickly.
One mechanism for taking smoother steps is called “momentum”. With
momentum the process takes steps as if the gradient was more like a force
acting on a moving mass, which wants to maintain its current trajectory.
This is accomplished by essentially taking a moving average of the previous
steps along with the current gradient estimate. The magnitude of the
momentum is a hyperparameter.
Another issue that SGD has is related to learning rates. If the same learning
rate is used the entire time, SGD is prone to overshooting good minima.
Computer vision and convolutional neural networks 100
Intuitively, gradient descent is a bit like golf. In the beginning you (typically)
need to take long shots and then as you approach the hole, you take short
shots. This can be better achieved for SGD by dividing the learning rate
by a term related to the size of recent steps. If the most recent steps have
been large, the learning rate is scaled down. Techniques called AdaGrad⁸
and RMSProp⁹ both used this kind of scaling on a per parameter basis.
Adam is a popular gradient descent technique which incorporates both
momentum to smooth out the steps and per parameter learning rate
scaling to speed up optimization. Besides learning rate, Adam has two
hyperparameters related to these modifications to SGD (as well as batch
size).
Transfer learning
Transfer Learning
On a practical level, transfer learning has been very important because it has
allowed users with lower levels of resources to create models that meet their
needs. Many researchers and others who have built models requiring large
amounts of data and computing resources have made these trained models
available for free to the community. Many of the deep learning toolkits and
libraries now include pre-trained models out of the box.
Feature transfer
There are a few ways to “re-use” models trained for one task on another task.
The most straightforward method is called “feature transfer”. In this case the
majority of the network is reused as is, but the final layer is modified to fit
the new task (e.g. new output classes or moving from a classification task
to a regression task). This new final layer is then trained on a small(er) set
of data, utilizing the features produced by the rest of the network, which it
Computer vision and convolutional neural networks 102
had previously learned for its original task. In this scenario, the parameters
of the rest of the network are said to be “frozen” while the final layer is
trained.
Fine tuning
Courses
Books
¹⁰https://course.fast.ai/
¹¹https://www.coursera.org/learn/convolutional-neural-networks
¹²https://atcold.github.io/NYU-DLSP21/
¹³https://zefsguides.com
¹⁴http://neuralnetworksanddeeplearning.com/
¹⁵https://e2eml.school/blog.html#193
6. Natural language
processing and sequential
data techniques
Natural language processing (NLP) is another area where deep learning has
emerged as the state of the art approach in the past decade. In this chapter
we will look at some of the approaches to handling NLP tasks and other
sequential data tasks. Specifically, we will look at recurrent neural network
architectures (RNNs) as well as attention-based architectures (transform-
ers).
Some of these are relatively simple tasks for computers, such as spelling
correction, while others, such as text generation and translation are harder
problems to solve, as they require much stronger understanding of the
patterns of language.
More general than natural language tasks are sequential data tasks. Sequen-
tial data is any data that has an order, where the order provides part of the
information contained in the data. Besides (most) language, this includes
things like:
• Audio signals
• Sensor data, such as from medical, vehicle, and equipment monitors.
Natural language processing and sequential data techniques 106
• Video
• Computer logs, such as website traffic.
All of the above have the commonality that the data and its order provide
important contextual information for the task. This could be granular, such
as the letters in a word, or larger scale, such as themes in an essay. All of
this provides important context for accomplishing the task of interest.
The above tasks can be described in very generic terms by looking at the
size and role of the sequence in the task.
Natural language processing and sequential data techniques 107
You can break down these tasks into the following categories:
Traditional approaches
times modifying the meaning. For example, “dog” and “dogs” are simple
singular and plural variations of the same word, but “dog” can also be a
verb or used as an adjective, “dogged”, which are related, but distinct from
the canine animal noun sense of the word. You can easily imagine how
complicated things can get.
While languages generally follow grammatical rules, there are numerous
exceptions to those rules, making it extremely difficult to capture all of the
edge cases that exist in real world usage. These kinds of flexibility make
many natural language tasks very difficult for computers.
Traditionally, NLP and sequence-based tasks have relied on methods using
some combination of hand-coded logic and statistical understandings of
language data. Raw language and sequence data typically needs to have a
lot of preprocessing performed to make it usable by computers and to create
features that can be used for making predictions or decisions.
Preprocessing and explicit feature creation can be as basic as splitting text
into words based on spaces or as complex as calculating word frequencies
and distributions. Much of the work related to using traditional NLP meth-
ods is around creating these usable features and removing low information
words, such as very common words like “the” or pause words like “uh” in
speech.
Some commonly seen traditional techniques for text related tasks include
tokenization, stemming, lemmatization, identifying stop words, creating n-
grams, calculating term frequency inverse document frequency (TF-IDF),
edit distances, and bag of words and word count. Some of these are also used
with deep learning techniques and we will touch on those later. For other
types of sequence data, auto-regressive features are commonly created, such
as sliding means.
With hand-engineered features, traditional ML methods, such as support
vector machines and tree-based models, as well as non-ML models, such
as hidden Markov models and Bayesian techniques, can be applied to
NLP problems. Those same techniques can be applied to other sequential
data problems. Time series related tasks are often approached with so-
called “time series methods”, such as ARIMA, which typically do not have
Natural language processing and sequential data techniques 109
learnable parameters.
The raw output of the network that has processed the previous item in the
sequence plays the role of memory and is typically referred to as the hidden
state of the network. This would allow this chain of networks to process
the sequence in order, extract features from the inputs, and maintain some
memory to establish context.
• The length of the inputs and outputs are not always fixed for the
same task.
• A lot of parameters would be needed, so you lose out on pa-
rameter sharing that can take place in a specialized sequence
architecture like an RNN.
RNNs and other sequence architectures are designed to deal with these
issues efficiently.
While this might work for some sequence tasks, where the inputs and
outputs have fixed lengths, there is a much better way to do this with only
a small change to this “chain of networks” architecture: adding recurrence.
RNNs are essentially chains of networks, but instead of having several sub-
networks, they use the same network for each network in the chain, feeding
the raw output of the network back to itself in order to process the next
step in the sequence. This is the recurrence. This means that each step in
the sequence shares the same parameters and that the network can handle
sequences of variable length.
Natural language processing and sequential data techniques 111
The mechanics of a generic RNN are thus: For a sequence of N items, such
as words, the network first looks at the first word, x0 , and produces two
outputs, the first hidden state, h0 , and the first output, ŷ0 . The first output
may or may not be used, depending on the task. The hidden state is passed
back into the network for use with the second input, x1 . This process then
repeats, until the entire sequence has been processed.
The hidden state captures information that can be passed through all steps
of the sequence, providing some context for the task. Note that the network
starts with no hidden state input, so an input of zero value is typically used.
More generally, the hidden state from step t can be represented as ht and
the output for step t represented as ŷt , with
ŷt = WTht ht
where the W’s are learned weight matrices representing the the connections
of a single layer³ and σ is the layer’s activation function. Each weight matrix
takes a specific input to produce a specific output, as designated by the
indices.
As with other neural networks that we’ve seen, RNNs are trained via
(versions of) gradient descent and backpropagation. The recurrent nature of
the RNN requires backpropagation to effectively be performed not just once,
but back through each step of a sequence. This is termed “backpropagation
through time” or BPTT.
³We’re using the notation of a single layer network here for convenience. Real RNN’s would likely
have multiple hidden layers, but which would require more complicated notation to represent the iteration
over the hidden layers.
Natural language processing and sequential data techniques 113
While the network’s weights are “reused”, each step in processing a se-
quence provides the network with additional information, so must be taken
into account. You can think of backpropagation as taking place over the
entire unfurled network. The loss function is a sum of the losses for each of
the sequence step outputs (if there are more than one).
Embeddings
Embeddings are ways to represent data that are richer in information than
simple numerical encodings, such as one-hot-encodings. Though definitions
vary somewhat, I will use the definition that an embedding is a “learned, low
dimensional representation of data that increases the information contained
[for the task of interest], versus a simple enumeration”.
What does that mean? It means that we want to be able to use the data itself
to learn relationships and other information that is implicitly captured in
the data. In Chapter 3 we saw how neural networks learn transformations
to make the task of the network easier to perform. Embeddings are a version
Natural language processing and sequential data techniques 115
of this. The network learns how to transform the data in such a way that it
is more useful than the raw data.
Embeddings
What does “low dimensional” mean? Low dimensional means using fewer
dimensions (i.e. the length of the vector) to represent the data than you
would need for one-hot-encoding. If your vocabulary had 1,000 words,
instead of 1,000 dimension one-hot-encoded vectors, you might use an
embedding vector of size 40. The size of the embedding vector (dimension)
is a hyperparameter. Instead of a sparse, high dimensional encoding, the
data is now in a dense, lower dimensional space.
If the embedding is created via a process that needs to extract relationship
information from the words, it will necessarily create embedding vectors
that exhibit these relationships.
Natural language processing and sequential data techniques 116
Embeddings are very useful in several contexts. Because you can compute
the (dis)similarity of two embedding vectors with each other, you can use
embeddings for things like search and ranking (especially when embedding
things like images, websites, or products). One of the most common ways to
compare two embedding vectors is with cosine similarity. Cosine similarity
is simply the cosine of the angle between two vectors and ranges from -1
to +1. Vectors that have a smaller angle between them have a higher cosine
similarity.
Cosine similarity
It’s easy to think about embedding vectors for similar data being close to
Natural language processing and sequential data techniques 118
each other and thus having a small angle between them, but of course most
embeddings are in high enough dimensions, that you can not visualize what
this looks like.
The most straight forward way that embeddings are created is by creating
a “transformation matrix”⁵, which, when multiplied with the one-hot-
encoded vector, encodes (or projects) the one-hot-encoded vector into the
embedding space, i.e. you go from your 1,000 dimension one-hot-encoded
vector to the 40 dimensional embedding vector. In a typical neural network,
this is the equivalent of adding an additional layer to the network.
For word related tasks, there are a few historically important methods for
creating word embeddings, including word2vec and GloVe⁶. We will look
in more detail at word2vec.
Word2vec
Famously, the inventors showed that you could perform meaningful “em-
bedding math”, whereby basic math on the embedding vectors can produce
results such as “brother” - “man” + “woman” = “sister” (i.e. if you subtract
the “man” vector from the “brother” vector and then add the “woman”
vector to that, the closest embedding vector to this result is the vector
for “sister”). The conclusion being that word2vec (and similar) embeddings
are able to represent words in ways that capture semantic meaning inter-
pretable by humans.
Word2vec works by training a shallow neural network⁸ to solve a self-
supervised learning task. To do this, a large set of text data is used as
a training set and an appropriate task is chosen. There are a number of
training tasks that are used, but we will discuss building a model to solve
the task called “skipgrams with negative samples”.
⁸A “shallow” neural network is simply a network with only one hidden layer.
Natural language processing and sequential data techniques 120
Skipgrams are sequences of text where one word is the context word and
the words surrounding it are blanked out. The goal of the task is to guess the
words most likely found around the context word – sort of the opposite of
“fill in the blank”. The number of surrounding words is called the “window
size” and is set by the user. It’s essentially a multi-label classification
problem. For a given context word input, the network needs to predict
all of the correct labels, i.e. the surrounding words. By moving through
the training text and choosing each word as the context word and the
surrounding words as the target labels, you provide the model with a large
set of self-supervised training data.
Word2Vec training
This can work, but one of the main problems with this task formulation is
that the possible number of labels is essentially the size of the vocabulary,
which can be very large. The creators of word2vec realized that they could
simplify this task by turning it into a binary task instead. The new task
Natural language processing and sequential data techniques 121
is to simply classify whether two words, the context word and another
word, would be found in the same sequence window. The window itself
only provides positive samples for this task (i.e. words that are found in the
same window as the context word), so you also need to generate negative
sample words (i.e. words that would not be found in the same window). To
do this, you can randomly select words from the full vocabulary of the text.
With negative samples, the task then becomes to train the model to classify
pairs of words as coming from the window or not. For the model to perform
well on this task, the network needs to transform the input one-hot-encoded
words into a lower dimension space, such that words that would be found
together in windows would be also near each other in this embedding
space. Words not found together within the same window should not be
near each other in the embedding space. This “nearness” (or similarity) is
calculated within the network by performing an inner (or dot) product of
the embedding vectors, which is closely related to cosine similarity.
• The “reset gate” determines how much of the previous hidden state
should be used to create the new candidate hidden state.
• The “update gate” determines how much of the new candidate state it
should use and how much of the previous hidden state it should use
as the new hidden state.
The main differences are that LSTMs use an additional “output gate”
and pass two separate hidden states between each recurrence, one is the
activation function output (used in RNNs and GRUs as the hidden state)
and the other is a gated version of the activation function output, i.e. the
same value, but multiplied by a factor between 0 and 1.
Similar to GRUs these gates and the resulting learned hidden state manage-
ment allow LSTMs to be more robust and better at predicting the solutions
to sequential data tasks. These differences make the LSTM generally more
powerful than GRUs, but also somewhat more complicated.
Given the success that LSTMs have had and their (relatively) long history,
the LSTM has been one fo the most influential deep learning architectures.
Attention
The modifications to generic RNNs introduced by LSTMs and GRUs allow
for much higher performance, but they are not without limitations or issues.
Due to the sequential nature of recurrence, RNNs cannot be parallelized
for training, making them poorly suited for very large scale training sets.
Additionally, for the same reason, they are limited to sequences of only tens
or hundreds of tokens.
“Attention” is an alternative method of considering context in sequential
tasks. It was introduced in 2014 by Dzmitry Bahdanau et al and has been a
highly influential technique. Instead of focusing on the hidden state passed
from the most recent recurrence, as in an RNN, attention determines how
much weight should be given to each previous step¹⁰. Attention is especially
relevant in sequence to sequence tasks, such as language translation or
document summarization.
For producing translations, summaries, or similar tasks, the higher level
architecture that is typically used is called an encoder-decoder architecture.
One RNN encodes the input sequence into a hidden state. Another RNN
decodes that hidden state into an output sequence.
¹⁰RNNs can run forward, backward, or in both directions. For tasks such as language translation,
bidirectional RNNs are advantageous, because the order of words varies in different languages and looking
ahead can be important.
Natural language processing and sequential data techniques 125
“The dog, that lived two houses away and liked to patrol the
neighborhood, started barking.”
“Der Hund, der zwei Häuser weiter wohnte und gerne in der
Nachbarschaft patrouillierte, fing an zu bellen.”
“El perro, que vivía a dos casas de distancia y le gustaba
Natural language processing and sequential data techniques 126
All have phrases between the subject and verb clauses. Only looking at the
words immediately preceding the verb phrase might cause the subject to be
lost.
Attention is a quantification of how much the current output sequence step
should consider different input sequence steps. It’s calculated by looking
at the hidden state of the previous output sequence step, which we’ll
denote as St−1 , and the hidden states of the input sequence steps, e.g.
ht−3 , ht−2 , · · · , ht+3 .
The attention value is calculated by learning the weights for a small neural
network that takes the previous state, St−1 , and an input hidden state, e.g.
ht−2 , and outputs an attention value, αt,t−2 . This is calculated for each input
hidden state. The sum of attention values, Σi αt,ti is normalized to equal one.
Natural language processing and sequential data techniques 127
The attention values are then used to created a weighted sum of the input
sequence hidden states used to calculate the current sequence output, St .
Attention weights
With attention, models are able to better focus on what parts of the input
are important to the output. While attention was very useful in the context
it was developed, it went on to be even more important with the further
development of self-attention and transformers.
Transformers
In 2017, Vaswani et al published a groundbreaking paper titled “Attention
Is All You Need”¹¹, which introduced the Transformer architecture. The
Transformer is a novel network architecture that doesn’t use recurrence as
its main mechanism, allowing for parallelization of training and longer term
¹¹Vaswani et al, “Attention Is All You Need”, 2017, https://arxiv.org/abs/1706.03762
Natural language processing and sequential data techniques 128
Having only one embedding for “kind” would either end up capturing only
one meaning of “kind” or ending up as some average of meanings, which
likely wouldn’t be very useful. We’ll see in a little bit how the Transformer
architecture enables the more useful contextual embedding.
The Transformer is a relatively complex architecture with several parts. We
will cover the most important parts and innovations, while leaving some
details out. Please see the resources section at the end of the chapter if you’d
like to dive deeper.
Both the encoder and decoder parts make use of two important innovations:
so-called “self-attention” and “multi-head attention”. Self-attention is a way
for the network to determine which parts of the sequence are important
in a way that is somewhat analogous to how a convolutional layer learns
to extract or filter parts of interest of an image. Multi-head attention is
essentially doing this in parallel, analogous to having multiple channels in a
convolutional block in a CNN. Like a channel in a convolutional block, each
attention head learns different features in the sequence appropriate for the
task at hand.
Self-attention
of the encoder hidden states to pay more attention to for a specific output
step. With self-attention, the network is trying to figure out which parts of
the input to pay more attention to for all output steps as a single process. In
this sense, the network is comparing the current input to itself to understand
which parts should get more emphasis. It can be used in an encoder-decoder
architecture or just an encoder or decoder.
Self-attention is accomplished by having the network “ask” itself about what
is important at each step in the network where self-attention is applied. This
is formulated as having the network create “queries” from the layer’s input
tokens. The network then compares each query to a set of “keys” generated
for the same inputs of the layer. Finally, there are “values” created from the
inputs that correspond to each key. If a query and the key are similar, than
the value for that key is given more weight (i.e. attention).
You can think of the “query”, “key”, and “value” arrays as learned embed-
dings. The query-key comparison is a search in the embedding space and
more weight is given to values that are better matches.
Natural language processing and sequential data techniques 131
Mathematically these are matrix operations, where the input array (or
tensor) for a given word in the input sequence is multiplied by the query
matrix, Q, the key matrix, K , and the value matrix, V , to produce the query,
key, and value arrays, respectively. The values in these matrices are model
parameters that are learned during training. The similarity of the resulting
key and value arrays are calculated by taking the inner product, producing
a single weight. Each query is compared with all keys, including the query’s
own key.
The weight values for each word in the input sequence are calculated this
way and then they are divided by a scaling factor and squished with a
softmax¹³ function. For each query array, all the value arrays are multiplied
by there corresponding weights. These scaled arrays are then summed,
producing the attention weighted outputs for each input of the self-attention
¹³Refer back to Chapter 4 for details on the softmax function.
Natural language processing and sequential data techniques 132
layer (i.e. if there are four input tokens, there will be four outputs).
Self-attention weights
Although there is clear structure and intent to this design of queries, keys,
and values, you can also think of this as a blackbox feature extractor that the
network learns, similar to a convolutional block. Ultimately, self-attention
is trying to take n inputs and return n outputs that are the best weighted
sum of some embedded version of those inputs. The “best” weighted sum
being whatever will best help the network solve its task.
input sequence and learn different ways to represent what’s in the sequence.
This is similar to multiple channels in a convolutional layer in a CNN –
each channel or attention head can perform its own feature extraction. The
outputs of the multiple heads are combined.
Also analogous to CNNs having multiple convolutional blocks, the Trans-
former has several encoder blocks in series. Each of them having a chance
to extract more useful features.
The input to the initial layer of the first block is different, in that an
embedding layer is applied and a special position encoding operation is
performed. The embedding layer puts the input sequence elements (i.e.
words) into the right dimension array. The positional encoding allows the
network to have a sense of where in the sequence each input is.
The decoder is very similar to the encoder side, but with a few differences to
allow it to produce sequences as output. The input sequence to the decoder
Natural language processing and sequential data techniques 134
is actually from itself – it’s a recurrent network that feeds its own output
back in to build up a sequence. Because the decoder is producing the output
sequentially, it uses a mask to ignore the “future” items in the sequence
(even though they don’t yet exist). This helps to reduce errors. This “masked
attention” block comes before the (multi-head) self attention layer. The self-
attention layer and feed-forward layers are the same architecture as in the
encoder, except that the inputs are coming from two places: the masked
attention block and the output of the encoder block. The output of the
encoder is transformed into a new set of queries and keys for the decoder
block’s self-attention layer. This special mixing of inputs is called “cross-
attention”.
One of the first transformer based models to get wide use was BERT, a model
created in 2018 by researchers at Google AI¹⁴. It was able to set many records
on language related benchmarks. BERT stands for “Bidirectional Encoder
Representations from Transformers”. It only used the encoder section of the
Transformer architecture, because that’s all it needed for the tasks it was
designed to perform.
¹⁴Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, BERT: Pre-training of Deep
Bidirectional Transformers for Language Understanding, 2018. https://arxiv.org/abs/1810.04805
Natural language processing and sequential data techniques 135
Courses
Books
²⁶https://jalammar.github.io/illustrated-transformer/
²⁷https://e2eml.school/transformers.html
7. Advanced techniques and
practical considerations
Beyond the task specific architectures introduced in the last two chapters,
there are many techniques and practical considerations important for
people working with deep learning to be familiar with. In this chapter we
will discuss some advanced architectures, common techniques, and some of
the problems and tasks that need to be solved in order to make deep learning
useful in the real world.
Because this chapter will be wide-ranging, I will include pointers to re-
sources in the sections that they are associated with, rather than at the end
of the chapter.
Image captioning
One of the tasks that was mentioned in the last chapter was that of image
captioning, which is simply adding a descriptive sentence or phrase to go
along with an image. While we have considered both image and text-based
Advanced techniques and practical considerations 140
tasks, how we could combine these may not be obvious. As is turns out, the
key is embeddings.
In Chapter 6 we discussed how embeddings are a way to take data and create
new richer, typically lower dimensional, representations of it by learning a
transformation (i.e. projecting the data into a lower dimensional space). We
can then use the embeddings as inputs to other systems or networks.
For image captioning, a common way to achieve this is to use a (pre-trained)
CNN to produce embeddings as the initial hidden state input to an RNN or
Transformer decoder model – an example of feature transfer.
Joint embeddings
Embeddings are “abstract” representations of data as vectors. Unlike images,
sentences, or other human-interpretable types of data, they are just arrays
of numbers optimized to contain as much useful information as possible.
While that may seem intangible, it also presents us with the opportunity to
create relationships between different data types for solving tasks such as
search and recommendations.
Diffusion models
Diffusion is an image generation technique that has taken the deep learning
world by storm in 2022². Similar to GANs discussed in Chapter 5, diffusion
learns to construct images from scratch. Unlike GANs, diffusion uses a
multi-step generation process that seems to be more robust during training
and ultimately seems to produce better results.
¹https://openai.com/blog/clip/
²I’m very curious to see how this innovation ages.
Advanced techniques and practical considerations 143
Stable Diffusion
While you can take an image of pure noise and run it through a diffusion
network to get an image, it turns out that you can both direct the image
generation process and produce a clearer image by using a “prompt” to guide
to the process.
A prompt is simply another piece of data, such as a text description or
another image, such as a sketch. Unsurprisingly, to input that prompt into
the network, you’re gonna need an embedding.
In 2022 several diffusion models with text and/or image-based prompts
where announced, including DALL-E 2⁴ from OpenAI, Imagen⁵ from
Google Research, and Stable Diffusion⁶ from Stability AI. These models
work by training a diffusion model to incorporate a prompt (embedding)
to “condition” the output of the model. The U-net in the diffusion model
learns to change its output based on the input prompt.
Arguably, Stable Diffusion has made the biggest impact of these models,
as its code was released publicly for all to use. Stable Diffusion works in
a slightly different way than the other models mentioned, as it performs
diffusion in the “latent space” (essentially the same as the embedding space)
rather than in “pixel space”. We will look at Stable Diffusion as an example,
but overall it is very similar to the other popular diffusion-based image
generation models.
⁴https://openai.com/dall-e-2/
⁵https://imagen.research.google/
⁶https://stability.ai/blog/stable-diffusion-announcement
Advanced techniques and practical considerations 145
Stable diffusion
The prompts, whether text or image, are embedded with a CLIP-style model.
During inference⁸, the U-net is run with both the prompt as an input and
with a null prompt as an input (i.e. an array of zeros). The output of these
are compared, with the idea that the embedding space vector corresponding
to the difference between the prompted output and the prompt-less output
points in the direction of the best de-noising. The prompted output is then
moved in that direction (by some amount). This is essentially a trick and is
termed “classifier-free guidance”.
To enable the prompt to guide the de-noising, cross-attention is included
in the U-net to allow it to better incorporate the prompt embedding
information.
Taken all together, these steps have proven to be very powerful in creating
highly realistic and novel images when trained on large, high quality
datasets.
Self-supervised learning
Most of this book has been concerned with supervised learning, where a
model is trained on a dataset where the answers (i.e. the labels) are known.
We also touched on unsupervised training, where a model learns to look for
groupings of similar data within a dataset without the use of labels to guide
it.
A third paradigm for training models is self-supervised learning, which is in
someways a mix of supervised and unsupervised learning⁹. Acquiring unla-
beled data is often much easier and/or cheaper than acquiring or curating
labeled data. Self-supervised learning is a solution to this problem¹⁰.
In self-supervised learning, a supervised learning task is chosen, such that
the data labels can be deterministically derived from the data itself. This
often means “scrambling” the original data in some way, with the original
⁸In ML, “inference” just means making predictions with the model after it’s trained.
⁹Historically this was most often grouped with unsupervised learning before the term “self-supervised
learning” gained popularity.
¹⁰It can be argued that it is also more similar to much learning done by naturally intelligent systems.
Advanced techniques and practical considerations 147
state of the data serving as the label. Examples include reordering the letters
of a word, where the correct order is the label, learning to colorize a photo,
where the input to the process is the original image, but in grayscale, and
the target is the original, color image.
Once a model is trained on this self-supervised task, the model can be used
as the basis for transfer learning – either as feature transfer (e.g. using the
outputs as embeddings) or using a labeled dataset to finetune the model for
a related task.
In previous chapters we have seen several examples of self-supervised
learning already, including autoencoders and GANs (Chapter 5), word and
text embedding techniques, including word2vec and Transformers (Chapter
6), and diffusion techniques in this chapter.
Image-based techniques
Contrastive learning
Constrastive learning is a self-supervised approach, where the goal is to
train the model to learn to produce the “correct” output well and to be
bad at producing the “incorrect” output. That means that the model needs
to learn from both “positive” and “negative” labels. In typical supervised
Advanced techniques and practical considerations 149
learning this is inherent in the data, as the training set will (hopefully!)
have examples from other classes, so the model needs to be able to predict
all different classes.
We have seen examples of this style of self-supervised learning already:
word2vec skipgrams with negative sampling and CLIP. Both present the
scenario where you could just train on the “positive” examples, but instead
are greatly enhanced by also trying to predict against non-positive examples
as well. In word2vec the task boils down to whether two words are from the
same window of words. By also asking the model to compare against words
from outside the given window, the “contrastive” aspect is added. In CLIP
the model needs to force the paired image and caption embeddings to be
similar, but also the image embeddings should be dissimilar to the caption
embeddings of other images.
Contrastive learning
training are just randomly sampled from the larger dataset. Because this is
not human labeled, there’s no guarantee that every contrastive sample will
actually be contrastive (i.e. you might randomly select a word that happened
to also be in the same window or a caption from an image that was similar).
The goal is that statistically there will be enough contrastive examples to
produce a strong model.
As with most topics in this book, we’ve only scratched the surface of con-
trastive learning, but the key idea is there: learning better representations of
the data to improve your ability to perform subsequent, related tasks (a.k.a.
the “downstream” task).
Some survey articles on self-supervised and contrastive learning techniques:
Linear algebra
Statistics and probability are at the heart of all machine learning. Machine
learning is about recognizing and approximating the general pattern in the
data. Instead of memorizing the training data, a well-trained model will
produce good predictions, because it has learned the general patterns in the
data.
Understanding concepts of distributions, summary statistics, probability,
etc, are key to really understanding and working with machine learning.
Not only does statistics underlie how neural networks work, but also how
you evaluate the performance of ML models. Performance metrics are
statistical measures of how good a model’s predictions are. Additionally, to
compare two models, you may want to employ statistical hypothesis testing.
Some resources for beginners in statistics and probability:
¹³https://www.khanacademy.org/math/linear-algebra
¹⁴https://ocw.mit.edu/courses/18-06-linear-algebra-spring-2010/
¹⁵https://math.mit.edu/~gs/everyone/
Advanced techniques and practical considerations 152
Differential calculus
The “learning” part of deep learning is finding the right parameter values
for the network to best perform at its task. As we saw in Chapter 3, gradient
descent is the key method used to optimize a network’s parameters. At the
heart of gradient descent is calculus.
While most deep learning practitioners are not solving many integrals
or calculating derivatives by hand, calculus is the mathematics of ML
parameter optimization. Understanding derivatives (in particular partial
derivatives) is important for understanding how and why much of the
optimization process works in ML and DL, in particular. It’s often said that
backpropagation is really “just the chain rule” for derivatives.
Some resources for beginners in calculus:
Rather than writing model code from scratch, deep learning practitioners
use libraries designed specifically for creating deep neural networks. Li-
braries such as PyTorch²¹ and TensorFlow²² center on optimized tensor
operations, but also have many built-in conveniences for creating neural
networks.
Models are usually just a single, small part of real-world machine learning
systems. A typical ML system that runs “in production” will include:
Wrapping up
The history of neural networks is a long and filled with booms and busts. By
all indications, though, “this time is different”. The deep learning era kicked
off in the early 2010s has arguably brought more innovation and, crucially,
more value than any of the previous eras of neural networks or artificial
intelligence.
This book’s goal is to give the reader a conceptual overview of deep learning,
touching on the most import concepts and topics, as well as diving deeper
into some of the most important technical aspects. Hopefully the book
succeeded in helping you gain a more complete understanding of how
everything fits together and provided you with some good jumping off
points for further learning.
The breakneck pace of innovation in deep learning means that it’s more
difficult than ever to keep up with the latest techniques and technologies²⁵.
Hopefully this book has made it easier for you to at least understand the big
picture of deep learning and how the latest techniques fit in, if not actually
understand the technical aspects of those techniques.
I am sure that we are in for many more interesting developments. Please be
on the lookout for additional books in the Zefs Guides series!
²⁵During the writing of this book in the first half of 2022 I had only intended to include diffusion as a
bullet point in a list, but due to the sudden explosion interest in diffusion-based models in the second half
of 2022 I decided it warranted inclusion. Things move quickly!