0% found this document useful (0 votes)
17 views

Introduction to Machine Learning and Neural Networ

This chapter introduces machine learning and neural networks, focusing on their applications in regression and classification problems, particularly in image classification using datasets like MNIST and Fashion-MNIST. It discusses the importance of cross-validation for evaluating prediction accuracy and methods to avoid underfitting and overfitting in neural networks. Additionally, it highlights various applications of machine learning beyond image classification, including spam filtering and medical diagnosis.

Uploaded by

harshvijayarora
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Introduction to Machine Learning and Neural Networ

This chapter introduces machine learning and neural networks, focusing on their applications in regression and classification problems, particularly in image classification using datasets like MNIST and Fashion-MNIST. It discusses the importance of cross-validation for evaluating prediction accuracy and methods to avoid underfitting and overfitting in neural networks. Additionally, it highlights various applications of machine learning beyond image classification, including spam filtering and medical diagnosis.

Uploaded by

harshvijayarora
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

CHAPTER THIRTY-SIX

Introduction to Machine Learning and Neural Networks

Toby Dylan Hocking


Northern Arizona University, Flagstaff, USA

CONTENTS
Introduction and Applications of Machine Learning / 309
K-fold Cross-Validation for Evaluating Prediction/Test Accuracy / 310
Other Applications 311
Avoiding Under/Overfitting in a Neural Network for Regression / 312
Comparing Neural Networks for Image Classification / 314
Cross-Validation for Evaluating Predictions of Earth System Model Parameters / 315
Suggested Reading / 317
Quizzes / 317

In this chapter we introduce basic concepts and computer vision. In this problem, we would like
algorithms from machine learning. We explain a function that can input an image, and output an
how neural networks can be used for regression integer which indicates class membership. More
and classification problems, and how cross-valida- precisely, let us consider the MNIST and Fashion-
tion can be used for training and testing machine MNIST data sets (Figure 36.1), in which each input
learning algorithms. is a grayscale image with height and width of 28
pixels, represented as a matrix of real numbers x
INTRODUCTION AND APPLICATIONS OF ∈ R28×28 (LeCun et al., 1998, Xiao et al., 2017).
MACHINE LEARNING In both the MNIST and Fashion-MNIST data sets
each image has a corresponding label which is
Machine learning is the domain of computer sci-
an integer y ∈ {0, 1, …, 9}. In the MNIST data
ence which is concerned with efficient algorithms
set each image/label represents a digit, whereas
for making predictions in all kinds of big data sets.
in Fashion-MNIST each image/label represents
A defining characteristic of supervised machine
learning algorithms is that they require a data set
for training. The machine learning algorithm then Learning Train Learned Predictions
memorizes the patterns present in those training Algorithm data function on test data
data, with the goal of accurately predicting simi- g( )=0
lar patterns in new test data. Many machine learn- Learn( ) g g( )=1
ing algorithms are domain-agnostic, which means g( )=1

they have been shown to provide highly accu- h( )=0


rate predictions in a wide variety of application Learn( ) h h( )=0
h( )=1
domains (computer vision, speech recognition,
automatic translation, biology, medicine, climate Figure 36.1. A learning algorithm inputs a train data set, and
science, chemistry, geology, etc.). outputs a prediction function, g or h. Both g and h input a
For example, consider the problem of image grayscale image and output a class (integer from 0 to 9), but
classification from the application domain of g is for digits and h is for fashion.

DOI: 10.1201/9780429155659-46 309


a category of clothing (0 for T-shirt/top, 1 for seen in a desired application (including new data
Trouser, 2 for Pullover, etc). In both data sets the that were not seen during learning). To formalize
goal is to learn a function f: R28×28 → {0, 1, …, 9} this idea, and to compute quantitative evaluation
which inputs an image x and outputs a predicted metrics (accuracy/error rates), we need a test data
class f (x) which should ideally be the same as the set, as explained in the next section.
corresponding label y.
As mentioned above, a big advantage of super-
K-fold Cross-Validation For Evaluating Prediction/
vised learning algorithms is that they are typically
domain-agnostic, meaning that they can learn Test Accuracy
accurate prediction functions f using data sets with Each input x in a data set is typically represented as
different kinds of patterns. That means we can use one of N rows in a “design matrix” with D columns
a single learning algorithm LEARN on either the (one for each dimension or feature). Each output
MNIST or Fashion-MNIST data sets (Figure 36.1, y is represented as an element of a label vector of
left). For the MNIST data set the learning algo- size N, which can be visualized as another column
rithm will output a function for predicting the alongside the design matrix (Figure 36.2, left). For
class of digit images, and for Fashion-MNIST the example, in the image data sets discussed above we
learning algorithm will output a function for pre- have N = 60,000 labeled images/rows, each with
dicting the class of a clothing image (Figure 36.1, D = 784 dimensions/features (one for each of the
right). The advantage of this supervised machine 28 × 28 pixels in the image).
learning approach to image classification is that The goal of supervised learning is to find a
the programmer does not need any domain-spe- prediction function f such that f (x) = y for all
cific knowledge about the expected pattern (e.g., inputs/outputs (x, y) in a test data set (which is
shape of each digit, appearance of each cloth- not available for learning f). So how do we learn f
ing type). Instead, we assume there is a data set for accurate prediction on a test data set, if that test
with enough labels for the learning algorithm to set is not available? We must assume that we have
accurately infer the domain-specific pattern and access to a train data set with the same statistical
prediction function. This means that the machine distribution as the test data. The train data set is
learning approach is only appropriate when it is used to learn f, and the test data can only be used
possible/inexpensive to create a large, labeled data for evaluating the prediction accuracy/error of f.
set that accurately represents the pattern/function Some benchmark data sets which are used
to be learned. for machine learning research, like MNIST and
How do we know if the learning algorithm is Fashion-MNIST, have designated train/test sets.
working properly? The goal of supervised learn- However, in most applications of machine learn-
ing is generalization, which means the learned ing to real data sets, train/test sets must be cre-
prediction function f should accurately predict ated. One approach is to create a single train/test
f (x) = y for any inputs/outputs (x, y) that will be split by randomly assigning a set to each of the

All data
Inputs Outputs Fold
Split 1 Split 2 Split 3
1 Features D Labels IDs 2 1 1
2 1 1
1 2 Learning Learning Learning
2 1 1
Train set

Train set

Train set

1 Algorithm Algorithm Algorithm


1
3
2
3 f1 1
3 f2 1
2 f3
3 3 2
Observations

2 Predict labels Predict labels Predict labels


3 3 2
3 on test set on test set on test set
3 3 2
1
3 1 2 3
Test set

Test set

Test set

3 1 2 3
2
Compute Compute Compute
1 2 3
accuracy accuracy accuracy
1 1 2 3
with respect with respect with respect
N 2
to held out to held out to held out
A1 test labels A2 test labels A3 test labels

Figure 36.2. K = 3 fold cross-validation. Left: the first step is to randomly assign a fold ID from 1 to K to each of the obser-
vations/rows. Right: in each of the k ∈ {1, . . . , K} splits, the observations with fold ID k are set aside as a test set, and the
other observations are used as a train set to learn a prediction function (f1–f3), which is used to predict for the test set, and to
compute accuracy metrics (A1–A3).

310 INTRODUCTION TO MACHINE LEARNING AND NEURAL NETWORKS


N rows/observations, say 50% train rows and 50% For example, Figure 36.4 uses K = 4-fold cross-
test rows. The advantage of that approach is sim- validation to compare four learned functions on an
plicity, but the drawback is that we can only report image classification problem. The accuracy rates of
accuracy/error metrics with respect to one test the “dense” and “linear” functions, 97.4 ± 1.6%
set (e.g., the algorithm learned a function which and 96.3 ± 1.9% (mean ± standard deviation) are
accurately predicted 91.3% of observations/labels not significantly different. Both rates are signifi-
in the test set, meaning 8.7% error rate). cantly larger than the accuracy of the “baseline”
In addition to estimating the accuracy/error constant function, 16.4 ± 1.4%, and smaller than
rate, it is important to have some estimate of vari- the accuracy of the “conv” function, 99.3 ± 1.1%.
ance in order to make statements about whether We can therefore conclude that the most accurate
the prediction accuracy/error of the learned func- learning algorithm for this problem, among these
tion f is significantly larger/smaller than other four candidates, is the “conv” method (which uses
prediction functions. The other functions to com- a convolutional neural network, explained later). It
pare against may be from other supervised learn- is important to note that statements about which
ing algorithms, or some other method that does algorithm is most accurate can only be made for a
not use machine learning (e.g., a domain-specific particular data set, after having performed K-fold
physical/mechanistic model). A common baseline cross-validation to estimate prediction accuracy/
is the constant function f (x) = y0 where y0 is the error rates.
average or most frequent label in the train data.
This baseline ignores all of the inputs/features
OTHER APPLICATIONS
x, and can be used to show that the algorithm is
learning some non-trivial predictive relationship So far we have only discussed machine learning
between inputs and outputs (for an example see algorithms in the context of a single prediction
Figure 36.4). problem, image classification. In this section we
The K-fold cross-validation procedure generates briefly discuss other applications of machine
K splits, and can therefore be used to estimate both learning. In each application the set of possible
mean and variance of prediction accuracy/error. inputs x and outputs y are different, but machine
The number of folds/splits K is a user-defined learning algorithms can always be used to learn a
integer parameter which must be at least 2, and prediction function f (x) ≈ y. Jones et al. (2009)
at most N. Typical choices range from K = 3 to 10, proposed to use interactive machine learning
and usually the value of K does not have a large for cell image classification in the CellProfiler
effect on the final estimated mean/variance of Analyst system. This application is similar to the
prediction accuracy/error. The algorithm begins previously discussed digit/fashion classification
by randomly assigning a fold ID number (integer problem, but with only two classes (binary clas-
from 1 to K) to each observation (Figure 36.2, sification). In this context the input is a multi-
left). Then for each unique fold value from 1 to K, color image of cell x ∈ Rh×w×c where h, w are the
we hold out the corresponding observations/rows height and width of the image in pixels, and
as a test set, and use data from all other folds as c = 3 is the number of channels used to represent
a train set (Figure 36.2, right). Each train set is a color image (red, green, blue). The output y ∈
used to learn a corresponding prediction function, {0, 1} is a binary label which indicates whether
which is then used to predict on the held-out test or not the image contains the cell phenotype of
data. Finally, accuracy/error metrics are computed interest.
in order to quantify how well the predictions fit Some email programs use machine learning
the labels for the test data. Overall, for each data set for spam filtering, which is another example of
and learning algorithm the K-fold cross-validation a binary classification problem. When you click
procedure results in K splits, K learned functions, the “spam” button in the email program you are
and K test accuracy/error metrics, which are typi- labeling that email as spam (y = 1), and when you
cally combined by taking the mean and standard respond to an email you are labeling that email as
deviation (or median and quartiles). Other algo- not spam (y = 0). The input x is an email message,
rithms may be used with the same fold assign- which can be represented using a “bag-of-words”
ments, in order to compare algorithms in terms of vector (each element is the number of times a spe-
accuracy/error rates in particular data sets. cific word occurs in that email message).

TOBY DYLAN HOCKING 311


Russell et al. (2008) proposed the LabelMe that attempts to find parameters w, v which
tool for creating data sets for image segmentation, minimize the mean squared error between the
which is more complex than the previously dis- predictions and the corresponding labels in the
cussed image classification problems. In this con- N train data:
text the input x ∈ Rh×w×c is typically a multi-color
image, and the output y ∈ {0, 1}h×w is a binary N

 W   x V   y 
1
mask (one element for every pixel in the image) L  w,v   T
i i (36.2)
N
indicating whether or not that pixel contains an i 1

object of interest.
Machine learning can be used for automatic Gradient descent begins using uninformative
translation between languages. In this context the parameters w0, v0 (typically random numbers
input is a text in one language (e.g., French) and close to zero), then at each iteration t ∈ {1, …, T }
the output is the text translated to another lan- the parameters are improved by taking a step of
guage (e.g., English). The desired prediction func- size α > 0 in the negative gradient direction,
tion f inputs a French text and outputs the English
translation. w t  w t 1   w L  w t 1,v t 1  (36.3)
Machine learning can be used for medical diag-
nosis. For example, Poplin et al. (2017) showed
that retinal photographs can be used to predict v t  v t 1   v L  w t 1,v t 1  (36.4)
blood pressure or risk of heart attack. Since the
output y is a real number (e.g., blood pressure of The algorithm described above is referred to
120 mm mercury), we refer to this as a regression as “full gradient” because the gradient descent
problem. direction is defined using the full set of N samples
in the train set. Other common variants include
AVOIDING UNDER/OVERFITTING IN A NEURAL “stochastic gradient” (gradient uses one sample)
NETWORK FOR REGRESSION and “minibatch” (gradient uses several samples).
When doing gradient descent on a neural network
In this section we begin by explaining the predic- model, one “epoch” includes computing gradients
tion function and learning algorithm for a simple once for each sample (e.g., 1 epoch = 1 iteration
neural network. We then demonstrate how the of full gradient, 1 epoch = N iterations of stochas-
number of iterations of the learning algorithm can tic gradient).
be selected using a validation set, in order to avoid In the algorithm above, the number of hidden
underfitting and overfitting. units U, the number of iterations T, and the step
We consider a simple regression problem for size α must be fixed before running the learn-
which the input x ∈ R is a single real number (D = ing algorithm. These hyper-parameters affect
1 feature/column in the design matrix), and the the learning capacity of the neural network. An
output y ∈ R is as well. Using a neural network important consideration when using any machine
with a single hidden layer of U units, two unknown learning algorithm is that you most likely need
parameter vectors are apparent which need to be to tune the hyper-parameters of the algorithm
learned using the training data, w ∈ RU and v ∈ RU. in order to avoid underfitting and overfitting.
The prediction function f is then defined as: Underfitting occurs when the learned function f
neither provides accurate predictions for the train
data, nor the test data. Overfitting occurs when
f  x   wT  xv   w r z, (36.1)
the learned function f only provides accurate pre-
dictions for the train data (and not for the test
where σ : RU → RU is a non-linear activation data). Both underfitting and overfitting are bad,
function, and z ∈ RU is the vector of hidden and need to be avoided, because the goal of any
units. Typical activation functions include the learning algorithm is to find a prediction func-
logistic sigmoid σ(t) = 1/(1 + exp(−t)) and tion f which provides accurate predictions in test
the rectifier (or rectified linear units, ReLU) σ(t) data.
= max(0, t). The prediction function is learned How can we select hyper-parameters which
using gradient descent, which is an algorithm avoid overfitting? Note that the choice of

312 INTRODUCTION TO MACHINE LEARNING AND NEURAL NETWORKS


hyper-parameters such as number of hidden units validation error over the K splits, which typically
U and iterations T affect the learned function f, so yields hyper-parameters that result in more accu-
we cannot use the test data to learn these hyper- rate/generalizable predictions (when compared to
parameters (by assumption that the test data are hyper-parameters selected using a single subtrain/
not available at train time). Then, how do we know validation split). Note that this K-fold cross-vali-
which hyper-parameters will result in learned dation for hyper-parameter learning is essentially
functions which best generalize to new data? the same procedure as shown in Figure 36.2, but
A general method which can be used with any we split the train set into subtrain/validation sets
learning algorithm is splitting the train set into (instead of splitting all data into train/test sets as
subtrain and validation sets, then using grid search shown in the figure).
over hyper-parameter values. The subtrain set is For example, we simulated some data with a
used for parameter learning, and the validation set sine wave pattern (Figure 36.3), and used the R
is used for hyper-parameter selection. In detail, we package nnet to fit a neural network with one
first fix a set of hyper-parameters, say U = 50 hid- hidden layer of U = 50 units (Venables and Ripley,
den units and T = 100 iterations. Then the sub- 2013). We demonstrate the effects of under/
train set is used with these hyper-parameters as overfitting by varying the number of iterations/
input to the learning algorithm, which outputs the epochs from T = 1 to 1000. In this example
learned parameter vectors w, v. Finally, the learned K = 4-fold cross-validation was used, so each
parameters are used to compute predictions f (x) data point was randomly assigned a fold ID inte-
for all inputs x in the validation set, and the cor- ger from 1 to 4. The result for only the first split
responding labels y are used to evaluate the accu- is shown, so observations assigned fold ID=1 are
racy/error of those predictions. The procedure is considered the validation set, and other observa-
then repeated for another hyper-parameter set, say tions (folds 2– 4) are considered the subtrain set
U = 10 hidden units with T = 500 iterations. In (which is used at input to the nnet R function
the end we select the hyper-parameter set with which implements the gradient descent learning
minimal validation error, and then retrain using algorithm). We then used the predict function
the learning algorithm on the full train set with in R to compute predictions for subtrain and vali-
those hyper-parameters. A variant of this method dation data, and analyzed how the prediction error
is to use K-fold cross-validation to generate K changes as a function of the number of iterations/
subtrain/validation splits, then compute mean epochs T of gradient descent. The data exhibit a

Figure 36.3. Illustration of underfitting and overfitting in a neural network regression model (single hidden layer, 50 hidden
units). Left: noisy data with a nonlinear sine wave pattern (grey circles), learned functions (colored curves), and residuals/
errors (black line segments) are shown for three values of epochs (panels from left to right) and two data subsets (panels from
top to bottom). Right: in each epoch the model parameters are updated using gradient descent with respect to the subtrain loss,
which decreases with more epochs. The optimal/minimum loss with respect to the validation set occurs at 64 epochs, indicat-
ing underfitting for smaller epochs (green function, too regular/linear for both subtrain/validation sets) and overfitting for
larger epochs (purple function, very irregular/nonlinear so good fit for subtrain but not validation set).

TOBY DYLAN HOCKING 313


nonlinear sine wave pattern, but the learned func- The number of units in the input layer is fixed,
tion for T = 4 iterations/epochs is mostly linear u0 = D, based on the dimension of the inputs x
(underfitting, large error on both subtrain/vali- ∈ RD. The number of units in the output layer uL
dation sets). For T = 512 iterations/epochs the is also fixed based on the outputs/labels y. The
learned function is highly non-linear (overfitting, numbers of units in the hidden layers (u1, …, uL−1)
small error for the subtrain set but large error for are hyper-parameters which control under/over-
the validation set). When the error rates are plot- fitting. Increasing the numbers of hidden units
ted as a function of a model complexity hyper- ul results in larger weight matrices Wl, which
parameter such as T (Figure 36.3, right), we see the in general means more parameters to learn, and
characteristic U shape for the validation error, and larger capacity for fitting complex patterns in
the monotonic decreasing train error. The hyper- the data. The sparsity pattern of Wl means which
parameter with minimal validation error is T = 64 entries are forced to be zero; this technique is
iterations/epochs; smaller T values underfit or are used in “convolutional” neural networks for
overly regularized, and larger T values overfit or are avoiding overfitting and reducing training/pre-
under-regularized. diction time. When the matrix is not sparse (all
Overall, in this section we have seen how a entries non-zero), we refer to the layer as dense
neural network for regression can be trained using or fully connected.
gradient descent (for learning parameter vectors, For example, in the previous section we used
given fixed hyper-parameters) and sub-train/vali- a neural network for regression with one hidden
dation splits (for learning hyper-parameter values layer, which in this more general notation means
to avoid under/overfitting). using L = 2 intermediate functions; the input
dimension is u0 = D = 1, the number of hidden
units is u1 = U = 50, and there is a single output
COMPARING NEURAL NETWORKS FOR IMAGE
u2 = 1 to predict. The weight matrices are dense/
CLASSIFICATION fully connected (no convolution/sparsity), of
In this section we provide a comparison of several dimension W1 ∈ R50×1, W2 ∈ R1×50. The hidden
other neural networks for image classification. In layer activation function A1 used by the R nnet
general, in a neural network with L−1 hidden lay- package is the logistic sigmoid, σ(t) = 1/(1 +
ers we can represent the prediction function as the exp(−t)), and the output activation for regression
composition of L intermediate fl functions, for all (real-valued outputs) is the identity, A2(t) = t.
layers l ∈ {1, …, L}: In this section we implement three other neu-
ral networks for image classification. Using the
f  x   fL  f1  x  (36.5) “zip.train” data set of N = 7291 handwritten dig-
its (Hastie and Tibshirani, 2009), each input is a
greyscale image of 16 × 16 pixels which means
Each of the intermediate functions has the same that number of input units is u0 = 256. As in Figure
form: 36.1 (top) there are ten output classes, one for
each digit. For the activation function AL in the
 
lfl  t   A l Wlt , (36.6) output layer we use the “softmax” function which
results in a score/probability for each of the ten
possible output classes, so the number of output
where Al is an activation function and Wl ∈ Rul×ul−1
units is uL = 10.
is a weight matrix with elements that must be
The three neural networks that we consider are:
learned based on the data. This model includes sev-
eral hyper-parameters which must be fixed prior linear L = 1 intermediate function with 2,570
to learning the neural network weights: parameters to learn (linear model, inputs
fully connected to outputs, no hidden
• The number of layers L.
units/layers).
• The activation functions A1.
dense L = 9 intermediate functions with 97,410
• The number of units per layer u1.
parameters to learn (nonlinear model, each
• The sparsity pattern in the weight matrices hidden layer dense/fully connected with
Wl. 100 units).

314 INTRODUCTION TO MACHINE LEARNING AND NEURAL NETWORKS


sparse L = 3 intermediate functions with we must be careful not to generalize these con-
99,310 parameters to learn (nonlinear clusions to other data sets — even for some other
model, one convolutional/sparse layer fol- image classification data sets such as MNIST (Figure
lowed by two dense/fully connected layers). 36.1), the most accurate algorithm may be differ-
ent. For very difficult data sets, it may even be the
We defined and trained each neural network
case that these three neural networks are no more
using the keras R package (Allaire and Chollet,
accurate than the baseline model which always
2020). We used the fit function with argument
predicts the most frequent class in the train set.
validation_split=0.2, which creates a
In general, we always need to use computational
single split (80% subtrain, 20% validation). We
cross-validation experiments to determine which
selected the number of epochs hyper-parameter by
machine learning algorithm is most accurate in
minimizing the validation loss, and we used the
any given data set. To learn a predictive model with
selected number of epochs to re-train the neural
maximum prediction accuracy, machine learning
network on the entire train set (no subtrain/vali-
algorithms other than neural networks should be
dation split).
additionally considered (e.g., regularized linear
We did this entire procedure K = 4 times, once
models, decision trees, random forests, boosting,
for each fold/split in K-fold cross-validation. Note
support vector machines).
that even though these data have a pre-defined split
into “zip.train” and “zip.test” files, we used K-fold
cross-validation on the “zip.train” file, yielding K CROSS-VALIDATION FOR EVALUATING
train/test splits that we used to estimate mean and
PREDICTIONS OF EARTH SYSTEM MODEL
variance of prediction accuracy for these models
PARAMETERS
(the “zip.test” file was ignored). In each split we
used the test set to quantify the prediction accuracy As a final example application, we consider using
of the learned models. It is clear that the test accu- cross-validation to evaluate a neural network that
racy of all three neural networks is significantly predicts carbon cycle model parameters (Tao et al.,
larger than the baseline model which always pre- 2020). In this context there is a data set with
dicts the most frequent class in the train set (Figure N = 26,158 observations, each one a soil sample
36.4, left); they are clearly learning some non- with D = 60 input features. There are 25 real-valued
trivial predictive relationship between inputs and output variables to predict; each is the value of an
outputs. Furthermore, it is clear from Figure 36.4 earth system model parameter at the location of
(right) that the dense neural network is slightly the soil sample. We want a neural network that
more accurate than the linear model (p = 0.032 in will be able to predict the values of these earth sys-
paired one-sided t3-test), and the sparse/convolu- tem parameters at new locations. Tao et al. (2020)
tional neural network is significantly more accurate proposed using a neural network with L = 4 fully
than the dense model (p = 0.009). connected layers and dropout regularization for
In summary, from this comparison it is clear this task (see paper for details). In this section
that among these three neural networks, the sparse the “multi-task” model uses the same number of
model should be preferred for most accurate pre- layers/units as described in that paper; the term
dictions in this particular “zip” data set. However, multi-task means that the neural network outputs

Figure 36.4. Prediction accuracy of functions learned for image classification of handwritten digits. The baseline function
always predicts the most frequent class in the train set; other three learned functions are neural networks with different numbers
of hidden layers (linear=0, conv=2, dense=8).

TOBY DYLAN HOCKING 315


a prediction for all 25 outputs/tasks. For compari- (longitude and random). For these data with N =
son, we additionally consider “single-task” mod- 26, 158 observations total, each fold has approxi-
els with the same number of hidden layers/units, mately 5000 observations, so each resulting test
but only one output unit. We expect the multi-task set has approximately 1000 observations. As
model to sometimes be more accurate, because of described in the last section on image classifica-
the expected correlation between outputs (earth tion, we used the R keras package to compute
system model parameters). To see whether or not the neural network parameters and predictions
these neural networks learn any nontrivial predic- (using a maximum of 100 epochs, and a single
tive relationship between inputs and output, we 80% subtrain 20% validation split to choose the
consider a baseline model which always predicts optimal number of epochs for re-training on the
the mean of the train set label/output values (and entire train set). For each fold/model/output we
does not use the inputs at all). computed mean squared error with respect to
Here we show how K = 5 fold cross-validation the test set, and we plot these values for four of
can be used to evaluate how well these neural the 25 outputs (Figure 36.5, bottom). It is clear
networks predict each of the outputs at new loca- that some outputs are more difficult to predict
tions. We first assign a fold ID from 1 to 5 to each than others; for cryo and maxpsi outputs the neu-
observation/row, either systematically using the ral networks show little or no improvement over
longitude coordinate, or randomly (Figure 36.5, baselines, whereas for tau4s3 and fs2s3 outputs
top). We can define a cross-validation procedure we observed substantial improvements over base-
using both sets of fold IDs, in order to answer the lines. As expected, there is a difference in test error
question, “is it more difficult to predict at new between fold assignment methods (random has
longitudes, or new random locations?” We expect lower error rates than Lon for several outputs),
that predicting at new longitudes should be more indicating that it is indeed easier to predict at
difficult, because that involves more extrapolation new random locations, and harder to predict at
(predicting outside the range of observed data val- new longitudes. Finally, the multi-task models are
ues). In detail, for each fold ID from 1 to 5, we slightly more accurate than the single-task models,
define the test set as the data points which have indicating that the neural network is learning to
been assigned that fold ID using both methods exploit the correlations between outputs. Overall

Figure 36.5. Cross-validation for estimating error rates of machine learning algorithms that predict earth system model param-
eters. Top: fold IDs were assigned to each observation using longitude (left) or randomly (right). Bottom: prediction error for
four of the 25 outputs. Please see (Tao et al., 2020) for meanings of abbreviations (cryo, maxpsi, tau4s3, fs2s3).

316 INTRODUCTION TO MACHINE LEARNING AND NEURAL NETWORKS


this comparison has shown how cross-validation L. Wasserman. 2010. All of Statistics: A Concise Course in
can be used to quantitatively evaluate and compare Statistical Inference. Springer, New York.
machine learning algorithms for predicting earth
system model parameters.
QUIZZES
In comparison to the neural network practice
in unit 10, the main difference is that here we 1. When using a design matrix to represent
discussed how held-out test sets can be used to machine learning inputs, what does each row
estimate prediction accuracy/error rates of learn- and column represent? What other data/options
ing algorithms. Chapter 38 discusses how a valida- does a supervised learning algorithm such as
tion set can be used to avoid overfitting, as we have gradient descent need as input, and what does it
done in this chapter as well. We have additionally yield as output?
discussed how K = 5 fold cross-validation can be 2. When splitting data into train/test sets, what is
used to generate several train/test splits, which can the purpose of each set? When splitting a train
be used to estimate prediction error rates for each set into subtrain/validation sets, what is the pur-
fold/data/algorithm combination (e.g., Figure pose of each set? What is the advantage of using
36.5, bottom). This technique is useful since it K-fold cross-validation, relative to a single split?
allows us to see which algorithms are significantly 3. In order to determine if any non-trivial predic-
more/less accurate than others on given data sets. tive relationship between inputs and output has
been learned, a comparison with a baseline that
ignores the inputs must be used. How do you
SUGGESTED READING compute the baseline predictions, for regression
and classification problems?
Machine learning is a large field of research with many
4. How can you tell if machine learning model
algorithms, and there are several useful textbooks that
predictions are underfitting or overfitting?
provide overviews from various perspectives:
Reproducibility statement. Code for figures 5. When using the nnet function in R to learn a
in this chapter can be freely downloaded from neural network with a single hidden layer, do
large or small values of the number of iterations
https://github.com/tdhock/2020-yiqi-summer-
hyper-parameter result in overfitting? Why?
school
C. M. Bishop. 2006. Pattern Recognition and Machine Learning. 6. When using the nnet function in R learn a neu-
Springer, New York. ral network with a single hidden layer, and you
I. J. Goodfellow, Y. Bengio, and A. Courville. 2016. Deep do not yet know how many iterations to use,
Learning. MIT Press, Cambridge, MA, USA. what data set should you use as input to nnet?
How should you learn the number of iterations
T. Hastie, R. Tibshirani, and J. Friedman. 2009. The
to avoid underfitting and overfitting? After hav-
Elements of Statistical Learning. Springer Series in Statistics.
ing computed the number of iterations to use,
Springer, Springer Science+Business Media, New
what data set should you then use as input to
York NY, second edition.
nnet to learn your final model? Hint: possible
K. P. Murphy. 2013. Machine Learning:A Probabilistic Perspective. choices for set to use are all, train, test, subtrain,
2013. MIT Press, Cambridge, MA. validation.

TOBY DYLAN HOCKING 317

You might also like