Introduction to Machine Learning and Neural Networ
Introduction to Machine Learning and Neural Networ
CONTENTS
Introduction and Applications of Machine Learning / 309
K-fold Cross-Validation for Evaluating Prediction/Test Accuracy / 310
Other Applications 311
Avoiding Under/Overfitting in a Neural Network for Regression / 312
Comparing Neural Networks for Image Classification / 314
Cross-Validation for Evaluating Predictions of Earth System Model Parameters / 315
Suggested Reading / 317
Quizzes / 317
In this chapter we introduce basic concepts and computer vision. In this problem, we would like
algorithms from machine learning. We explain a function that can input an image, and output an
how neural networks can be used for regression integer which indicates class membership. More
and classification problems, and how cross-valida- precisely, let us consider the MNIST and Fashion-
tion can be used for training and testing machine MNIST data sets (Figure 36.1), in which each input
learning algorithms. is a grayscale image with height and width of 28
pixels, represented as a matrix of real numbers x
INTRODUCTION AND APPLICATIONS OF ∈ R28×28 (LeCun et al., 1998, Xiao et al., 2017).
MACHINE LEARNING In both the MNIST and Fashion-MNIST data sets
each image has a corresponding label which is
Machine learning is the domain of computer sci-
an integer y ∈ {0, 1, …, 9}. In the MNIST data
ence which is concerned with efficient algorithms
set each image/label represents a digit, whereas
for making predictions in all kinds of big data sets.
in Fashion-MNIST each image/label represents
A defining characteristic of supervised machine
learning algorithms is that they require a data set
for training. The machine learning algorithm then Learning Train Learned Predictions
memorizes the patterns present in those training Algorithm data function on test data
data, with the goal of accurately predicting simi- g( )=0
lar patterns in new test data. Many machine learn- Learn( ) g g( )=1
ing algorithms are domain-agnostic, which means g( )=1
All data
Inputs Outputs Fold
Split 1 Split 2 Split 3
1 Features D Labels IDs 2 1 1
2 1 1
1 2 Learning Learning Learning
2 1 1
Train set
Train set
Train set
Test set
Test set
3 1 2 3
2
Compute Compute Compute
1 2 3
accuracy accuracy accuracy
1 1 2 3
with respect with respect with respect
N 2
to held out to held out to held out
A1 test labels A2 test labels A3 test labels
Figure 36.2. K = 3 fold cross-validation. Left: the first step is to randomly assign a fold ID from 1 to K to each of the obser-
vations/rows. Right: in each of the k ∈ {1, . . . , K} splits, the observations with fold ID k are set aside as a test set, and the
other observations are used as a train set to learn a prediction function (f1–f3), which is used to predict for the test set, and to
compute accuracy metrics (A1–A3).
W x V y
1
mask (one element for every pixel in the image) L w,v T
i i (36.2)
N
indicating whether or not that pixel contains an i 1
object of interest.
Machine learning can be used for automatic Gradient descent begins using uninformative
translation between languages. In this context the parameters w0, v0 (typically random numbers
input is a text in one language (e.g., French) and close to zero), then at each iteration t ∈ {1, …, T }
the output is the text translated to another lan- the parameters are improved by taking a step of
guage (e.g., English). The desired prediction func- size α > 0 in the negative gradient direction,
tion f inputs a French text and outputs the English
translation. w t w t 1 w L w t 1,v t 1 (36.3)
Machine learning can be used for medical diag-
nosis. For example, Poplin et al. (2017) showed
that retinal photographs can be used to predict v t v t 1 v L w t 1,v t 1 (36.4)
blood pressure or risk of heart attack. Since the
output y is a real number (e.g., blood pressure of The algorithm described above is referred to
120 mm mercury), we refer to this as a regression as “full gradient” because the gradient descent
problem. direction is defined using the full set of N samples
in the train set. Other common variants include
AVOIDING UNDER/OVERFITTING IN A NEURAL “stochastic gradient” (gradient uses one sample)
NETWORK FOR REGRESSION and “minibatch” (gradient uses several samples).
When doing gradient descent on a neural network
In this section we begin by explaining the predic- model, one “epoch” includes computing gradients
tion function and learning algorithm for a simple once for each sample (e.g., 1 epoch = 1 iteration
neural network. We then demonstrate how the of full gradient, 1 epoch = N iterations of stochas-
number of iterations of the learning algorithm can tic gradient).
be selected using a validation set, in order to avoid In the algorithm above, the number of hidden
underfitting and overfitting. units U, the number of iterations T, and the step
We consider a simple regression problem for size α must be fixed before running the learn-
which the input x ∈ R is a single real number (D = ing algorithm. These hyper-parameters affect
1 feature/column in the design matrix), and the the learning capacity of the neural network. An
output y ∈ R is as well. Using a neural network important consideration when using any machine
with a single hidden layer of U units, two unknown learning algorithm is that you most likely need
parameter vectors are apparent which need to be to tune the hyper-parameters of the algorithm
learned using the training data, w ∈ RU and v ∈ RU. in order to avoid underfitting and overfitting.
The prediction function f is then defined as: Underfitting occurs when the learned function f
neither provides accurate predictions for the train
data, nor the test data. Overfitting occurs when
f x wT xv w r z, (36.1)
the learned function f only provides accurate pre-
dictions for the train data (and not for the test
where σ : RU → RU is a non-linear activation data). Both underfitting and overfitting are bad,
function, and z ∈ RU is the vector of hidden and need to be avoided, because the goal of any
units. Typical activation functions include the learning algorithm is to find a prediction func-
logistic sigmoid σ(t) = 1/(1 + exp(−t)) and tion f which provides accurate predictions in test
the rectifier (or rectified linear units, ReLU) σ(t) data.
= max(0, t). The prediction function is learned How can we select hyper-parameters which
using gradient descent, which is an algorithm avoid overfitting? Note that the choice of
Figure 36.3. Illustration of underfitting and overfitting in a neural network regression model (single hidden layer, 50 hidden
units). Left: noisy data with a nonlinear sine wave pattern (grey circles), learned functions (colored curves), and residuals/
errors (black line segments) are shown for three values of epochs (panels from left to right) and two data subsets (panels from
top to bottom). Right: in each epoch the model parameters are updated using gradient descent with respect to the subtrain loss,
which decreases with more epochs. The optimal/minimum loss with respect to the validation set occurs at 64 epochs, indicat-
ing underfitting for smaller epochs (green function, too regular/linear for both subtrain/validation sets) and overfitting for
larger epochs (purple function, very irregular/nonlinear so good fit for subtrain but not validation set).
Figure 36.4. Prediction accuracy of functions learned for image classification of handwritten digits. The baseline function
always predicts the most frequent class in the train set; other three learned functions are neural networks with different numbers
of hidden layers (linear=0, conv=2, dense=8).
Figure 36.5. Cross-validation for estimating error rates of machine learning algorithms that predict earth system model param-
eters. Top: fold IDs were assigned to each observation using longitude (left) or randomly (right). Bottom: prediction error for
four of the 25 outputs. Please see (Tao et al., 2020) for meanings of abbreviations (cryo, maxpsi, tau4s3, fs2s3).