Unit 3 Slides - Getting Started With Neural Networks
Unit 3 Slides - Getting Started With Neural Networks
Unit 3 Slides - Getting Started With Neural Networks
neural networks
1
Anatomy of a neural network
Training a neural network revolves around
the following objects:
Layers, which are combined into a network
(or model)
The input data and corresponding targets
The loss function, which defines the
feedback signal used for learning
The optimizer, which determines how
learning proceeds
2
Anatomy of a neural network
3
Layers: the building blocks of DL
A layer is a data-processing module that takes as
input tensors and that outputs tensors.
Different layers are appropriate for different types
of data processing.
Dense layers for 2D tensors (samples,
4
Layers: the building blocks of DL
We can think of layers as the LEGO bricks
of deep learning.
Building deep-learning models in Keras is
done by clipping together compatible layers
to form useful data-transformation
pipelines.
In Keras, the layers we add to our models
are dynamically built to match the shape of
the incoming layer.
5
Layers: the building blocks of DL
from keras import models
from keras import layers
model = models.Sequential()
model.add(layers.Dense(32, input_shape=(784,)))
model.add(layers.Dense(32))
8
Loss functions and optimizers
Once the network architecture is defined, we still
have to choose two more things:
Loss function (objective function) - The quantity
that will be minimized during training.
It represents a measure of success for the task
at hand.
Optimizer - Determines how the network will be
updated based on the loss value.
It implements a specific variant of stochastic
9
Loss functions and optimizers
Choosing the right objective function for the right
problem is extremely important.
There are simple guidelines we can follow to
choose the correct loss function for common
problems such as classification, regression, and
sequence prediction.
Binary cross-entropy for classification problem
classification problem
Mean-squared error for a regression problem
10
Introduction to Keras
Keras is a deep-learning framework for Python
that provides a convenient way to define and
train almost any kind of deep-learning model.
It allows the same code to run seamlessly on
CPU or GPU.
It has a user-friendly API that makes it easy to
11
Introduction to Keras
Keras is a model-level library, providing high-
level building blocks for developing DL models.
Several different backend engines can be
plugged seamlessly into Keras.
TensorFlow, Theano, CNTK
13
Developing with Keras: an overview
The typical Keras workflow:
1. Define your training data: input tensors and
target tensors.
2. Define a network of layers (or model) that maps
your inputs to your targets.
3. Configure the learning process by choosing a
loss function, an optimizer, and some metrics to
monitor.
4. Iterate on your training data by calling the fit()
method of your model.
14
Developing with Keras: an overview
There are two ways to define a model:
1. Using the Sequential class (only for linear
stacks of layers, which is the most common
network architecture by far) or
2. Using the functional API (for directed acyclic
graphs of layers, which lets you build completely
arbitrary architectures).
The learning process is configured in the
compilation step, where we specify a loss
function, an optimizer, and some metrics to
monitor.
15
Developing with Keras: an overview
The compilation step
from keras import optimizers
model.compile(optimizer=optimizers.RMSprop(lr=0.001),
loss='mse',
metrics=['accuracy'])
The learning process consists of passing
Numpy arrays of input data to the model
via the fit() method.
model.fit(input_tensor, target_tensor, batch_size=128, epochs=10)
16
Developing with Keras: an overview
17
3.3 Setting up a DL workstation
It’s highly recommended that you run deep-
learning code on a modern NVIDIA GPU.
To do deep learning on a GPU, you have
three options:
Buy and install a physical NVIDIA GPU on
your workstation.
Use GPU instances on Google Cloud or AWS
EC2.
Use the free GPU runtime from Colaboratory.
Recommended in Chollet 2nd ed.
18
Setting up a DL workstation
3.3.1. Jupyter notebooks: the preferred way
to run deep-learning experiments
Jupyter notebooks are a great way to run
deep-learning experiments - in particular,
the many code examples in this book.
They’re widely used in the data-science
and machine-learning communities.
We recommend using Jupyter notebooks
to get started with Keras.
19
Setting up a DL workstation
Reading assignment
3.3.2. Getting Keras running: two options
3.3.3. Running deep-learning jobs in the
cloud: pros and cons
3.3.4. What is the best GPU for deep
learning?
20
Setting up a DL workstation
Chollet 2nd edition
Colaboratory (or Colab for short) is a free
Jupyter notebook service that requires no
installation and runs entirely in the cloud.
It gives you access to a free (but limited)
GPU runtime and even a TPU runtime, so
you don’t have to buy your own GPU.
Colaboratory is what we recommend for
running the code examples in this book.
21
Summary on TF & Keras
TensorFlow is an industry-strength numerical
computing framework that can run on CPU, GPU,
or TPU. It can automatically compute the gradient
of any differentiable expression.
Keras is the standard API for doing DL with TF.
The central class of Keras is the Layer. Layers
are assembled into models.
Before you start training a model, you need to
pick an optimizer, a loss, and some metrics,
which you specify via the model.compile() method.
22
Summary on TF & Keras …
To train a model, you can use the fit()
method, which runs mini-batch gradient
descent for you. You can also use it to
monitor your loss and metrics on validation
data, a set of inputs that the model doesn’t
see during training.
Once your model is trained, you use the
model.predict() method to generate
predictions on new inputs.
23
3.4. Classifying movie reviews
The IMDB dataset
A set of 50,000 highly polarized reviews from the
Internet Movie Database.
25,000 reviews for training and 25,000 reviews
for testing, each set consisting of 50% negative
and 50% positive reviews.
The IMDB dataset comes packaged with Keras. It
has already been preprocessed: the reviews
(sequences of words) have been turned into
sequences of integers, where each integer
stands for a specific word in a dictionary.
24
Classifying movie reviews
The following code will load the dataset
from keras.datasets import imdb
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)
The argument num_words=10000 means you’ll
only keep the top 10,000 most frequently
occurring words in the training data (rather than
88,585 unique words in it) – we work with vector
data of manageable size.
>>> train_data[0]
[1, 14, 22, 16, ... 178, 32]
>>> train_labels[0]
1
25
Classifying movie reviews
If we want, we can quickly decode one of
these reviews back to English words.
word_index = imdb.get_word_index()
reverse_word_index = dict(
[(value, key) for (key, value) in word_index.items()])
decoded_review = ' '.join(
[reverse_word_index.get(i - 3, '?') for i in train_data[0]])
26
Classifying movie reviews
Preparing the data
We can’t feed lists of integers into a neural
network - they all have different lengths, but a
neural network expects to process contiguous
batches of data.
We have to turn our lists into tensors.
There are two ways to do that:
Pad our lists so that they all have the same
length.
One-hot encode our lists to turn them into
x_train = vectorize_sequences(train_data)
x_test = vectorize_sequences(test_data)
>>> x_train[0]
array([ 0., 1., 1., ..., 0., 0., 0.])
28
Classifying movie reviews
We should also vectorize your labels,
which is straightforward:
y_train = np.asarray(train_labels).astype('float32')
y_test = np.asarray(test_labels).astype('float32')
29
Classifying movie reviews
Building our network
The input data is vectors, and the labels are
scalars (1s and 0s): this is the easiest setup we’ll
ever encounter - a simple stack of fully
connected layers with relu activations suffice.
Two key architecture decisions to make are:
How many layers to use
How many units to choose for each layer
Next chapter gives formal principles to guide us
in making these choices.
30
Classifying movie reviews
Building our network
We go with the following
architecture:
Two intermediate layers with 16
hidden units each
A third layer that will output the
scalar prediction regarding the
sentiment of the current review
Relu for the intermediate layers
and sigmoid for the output layer
31
Classifying movie reviews
Building our network
from keras import models
from keras import layers
model = models.Sequential()
model.add(layers.Dense(16, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
Having more hidden units allows our network to learn
more-complex representations, but it makes the network
more computationally expensive.
32
Classifying movie reviews
Building our network
Without an activation function like relu, the Dense
layer would consist of two linear operations - a
dot product and an addition.
output = dot(W, input) + b
34
Classifying movie reviews
Building our network
35
Classifying movie reviews
Validating our approach
In order to monitor during training the accuracy of
the model on data it has never seen before, we’ll
create a validation set by setting apart 10,000
samples from the original training data.
x_val = x_train[:10000]
partial_x_train = x_train[10000:]
y_val = y_train[:10000]
partial_y_train = y_train[10000:]
We’ll now train the model for 20 epochs, in mini-
batches of 512 samples.
36
Classifying movie reviews
Validating our approach
history = model.fit(partial_x_train,
partial_y_train,
epochs=20,
batch_size=512,
validation_data=(x_val, y_val))
model.fit() returns a History object.
>>> history_dict = history.history
>>> history_dict.keys()
['acc', 'loss', 'val_acc', 'val_loss']
37
Classifying movie reviews
Plotting the training and validation loss
import matplotlib.pyplot as plt
history_dict = history.history
loss_values = history_dict['loss']
val_loss_values = history_dict['val_loss']
plt.show()
38
Classifying movie reviews
39
Classifying movie reviews
The training loss decreases with every
epoch, and the training accuracy increases
with every epoch.
But that isn’t the case for the validation loss
and accuracy: they seem to peak at the
fourth epoch.
This is an example of overfitting.
We could stop training after three epochs.
We’ll cover a range of techniques in chapter 4.
40
Classifying movie reviews
Let’s train a new network from scratch for four
epochs and then evaluate it on the test data.
model = models.Sequential()
model.add(layers.Dense(16, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop',
loss='binary_crossentropy',
metrics=['accuracy'])
model.fit(x_train, y_train, epochs=4, batch_size=512)
41
Classifying movie reviews
Evaluate on the test set.
results = model.evaluate(x_test, y_test)
>>> results
[0.2929924130630493, 0.88327999999999995]
43
Classifying movie reviews
Further experiments
Try Train/Test/Validation split 30k/10k/10k.
Try using one or three hidden layers, and see how
doing so affects validation and test accuracy.
Try using layers with more hidden units or fewer
hidden units: 32 units, 64 units, and so on.
Try using the mse loss function instead of
binary_crossentropy.
Try using the tanh activation (an activation that was
popular in the early days of neural networks)
instead of relu.
44
Classifying movie reviews
Summary
Stacks of Dense layers with relu activations can solve a wide
range of problems.
In a binary classification problem, our network should end
with a Dense layer with one unit and a sigmoid activation.
On a binary classification problem, the loss function we
should use is binary_crossentropy.
The rmsprop optimizer is generally a good enough choice,
on whatever problem.
As they get better on their training data, neural networks
eventually start overfitting and end up obtaining worse
results on data they’ve never seen before.
45
Classifying newswires - multiclass
The Reuters dataset
A set of short newswires and their topics,
published by Reuters in 1986.
It’s a simple, widely used toy dataset for
text classification.
46 different topics
single-label, multiclass classification
from keras.datasets import reuters
(train_data, train_labels), (test_data, test_labels) = reuters.load_data(
num_words=10000)
46
Classifying newswires - multiclass
>>> len(train_data)
8982
>>> len(test_data)
2246
>>> train_data[10]
[1, 245, 273, 207, 156, 53, 74, 160, 26, 14, 46, 296, 26, 39, 74, 2979,
3554, 14, 46, 4689, 4329, 86, 61, 3499, 4795, 14, 61, 451, 4329, 17, 12]
>>> train_labels[10]
3
The label associated with an example is an
integer between 0 and 45 -- a topic index.
47
Classifying newswires - multiclass
Encoding the data
import numpy as np
def vectorize_sequences(sequences, dimension=10000):
results = np.zeros((len(sequences), dimension))
for i, sequence in enumerate(sequences):
results[i, sequence] = 1.
return results
x_train = vectorize_sequences(train_data)
x_test = vectorize_sequences(test_data)
48
Classifying newswires - multiclass
One-hot encoding of the labels
def to_one_hot(labels, dimension=46):
results = np.zeros((len(labels), dimension))
for i, label in enumerate(labels):
results[i, label] = 1.
return results
one_hot_train_labels = to_one_hot(train_labels)
one_hot_test_labels = to_one_hot(test_labels)
Or we can use Keras’s built-in to_categorical
from keras.utils.np_utils import to_categorical
one_hot_train_labels = to_categorical(train_labels)
one_hot_test_labels = to_categorical(test_labels)
49
Classifying newswires - multiclass
Building our network:
46 output classes
Last layer has 46 neurons
Last layer has softmax activation
To avoid information bottleneck, we go with 64 units in
the lower layers
from keras import models
from keras import layers
model = models.Sequential()
model.add(layers.Dense(64, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(46, activation='softmax'))
50
Classifying newswires - multiclass
Building our network:
The best loss function to use in this case is
categorical_crossentropy.
model.compile(optimizer='rmsprop',
loss='categorical_crossentropy',
metrics=['accuracy'])
Setting aside a validation set
x_val = x_train[:1000]
partial_x_train = x_train[1000:]
y_val = one_hot_train_labels[:1000]
partial_y_train = one_hot_train_labels[1000:]
51
Classifying newswires - multiclass
Training the model:
Let’s train the network for 20 epochs.
history = model.fit(partial_x_train,
partial_y_train,
epochs=20,
batch_size=512,
validation_data=(x_val, y_val))
And finally, display its loss and accuracy
curves.
52
Classifying newswires - multiclass
53
Classifying newswires - multiclass
The network begins to overfit after nine epochs.
Let’s train a new network from scratch for nine
epochs and then evaluate it on the test set.
model = models.Sequential()
model.add(layers.Dense(64, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(46, activation='softmax'))
model.compile(optimizer='rmsprop', loss='categorical_crossentropy',
metrics=['accuracy'])
54
Classifying newswires - multiclass
The network begins to overfit after nine epochs.
Let’s train a new network from scratch for nine
epochs and then evaluate it on the test set.
model.fit(x_train, y_train, epochs=9, batch_size=512)
results = model.evaluate(x_test, one_hot_test_labels)
>>> results
[0.9565213431445807, 0.79697239536954589]
55
Classifying newswires - multiclass
Code to find the random baseline
56
Classifying newswires - multiclass
A different way to handle the labels
Another way to encode the labels would be to
cast them as an integer tensor, like this:
y_train = np.array(train_labels)
y_test = np.array(test_labels)
The only thing this approach would change is the
choice of the loss function. With integer labels,
we should use sparse_categorical_crossentropy.
model.compile(optimizer='rmsprop',
loss='sparse_categorical_crossentropy',
metrics=['acc'])
57
Classifying newswires - multiclass
The importance of having sufficiently large
intermediate layers:
To avoid information bottleneck, it’s better to
have more than 46 units in each layer.
model = models.Sequential()
model.add(layers.Dense(64, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(4, activation='relu'))
model.add(layers.Dense(46, activation='softmax'))
The above network peaks at 71% validation
accuracy, an 8% absolute drop.
58
Classifying newswires - multiclass
Further experiments:
Try using larger or smaller layers: 32 units,
128 units, and so on.
We used two hidden layers. Now try using
a single hidden layer, or three hidden
layers.
59
Classifying newswires - multiclass
Summary
To classify data points among N classes, your network
should end with a Dense layer of size N.
For multiclass classification problem, our network should
end with a softmax activation.
Categorical crossentropy is almost always the loss
function you should use for such problems.
Two ways to handle labels in multiclass classification:
one-hot encoding with categorical_crossentropy
integer tensor with sparse_categorical_crossentropy
We should avoid creating information bottlenecks in our
network with intermediate layers that are too small.
60
Predicting house prices - regression
The Boston Housing Price dataset
506 data points: 404 training samples and 102
test samples, each with 13 numerical features.
Different features have different scales.
from keras.datasets import boston_housing
(train_data, train_targets), (test_data, test_targets) = boston_housing.load_data()
>>> train_data.shape
(404, 13)
>>> test_data.shape
(102, 13)
61
Predicting house prices - regression
Preparing the data
Different features have different scales.
A widespread best practice to deal with such
data is to do feature-wise normalization
Subtract mean and devide by the standard deviation
mean = train_data.mean(axis=0)
train_data -= mean
std = train_data.std(axis=0) Note that the quantities used for
train_data /= std normalizing the test data are
test_data -= mean computed using the training data.
test_data /= std A standard practice!
62
Predicting house prices - regression
Building our network
Few samples, we go for a small network with two
hidden layers, each with 64 units.
In general, the less training data we have, the worse
overfitting will be - small network mitigates overfitting.
from keras import models
from keras import layers
def build_model():
model = models.Sequential()
model.add(layers.Dense(64, activation='relu', input_shape=(train_data.shape[1],)))
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(1))
model.compile(optimizer='rmsprop', loss='mse', metrics=['mae'])
return model
63
Predicting house prices - regression
Validating network using K-fold validation
With little data, the validation scores might
change a lot depending on which data
points you chose to use for validation and
which for training.
The best practice is to use K-fold cross-validation.
Let us choose k = 4.
Let’s train for 500 epochs and keep record
of how well the model does at each epoch.
64
Predicting house prices - regression
65
Predicting house prices - regression
To visualise better, let’s omit the first 10 data
points and replace each point with an exponential
moving average of the 10 previous points, to
obtain a smooth curve.
66
Predicting house prices - regression
Validation MAE stops improving significantly after 80
epochs. Past that point, model starts overfitting.
Once we’re finished tuning other parameters of the
model, we can train a final production model on all of
the training data, and look at its performance on the
test data.
model = build_model()
model.fit(train_data, train_targets, epochs=80, batch_size=16, verbose=0)
test_mse_score, test_mae_score = model.evaluate(test_data, test_targets)
>>> test_mae_score
2.5532484335057877
67
Predicting house prices - regression
With this regression model, predict() returns
the model’s guess for the price.
>>> predictions = model.predict(test_data)
>>> predictions[0]
array([9.990133], dtype=float32)
The first house in the test set is predicted
to have a price of about $10,000.
68
Predicting house prices - regression
Wrapping up
MSE is a loss function commonly used for
regression.
A common regression metric is MAE.
When features in the input data have values in
different ranges, we apply normalization.
When there is little data available,
use K-fold validation,
use a small network with few hidden layers (typically
only one or two) , in order to avoid severe overfitting.
69
Chapter summary
Preprocess raw data to feed into a NN.
Use normalization if different ranges.
Avoid information bottlenecks.
Regression uses different loss functions
and evaluation metrics than classification.
When working with little data,
Use a small network with only one or two
hidden layers, to avoid severe overfitting,
Use k-fold validation.
70