Unit 3 Slides - Getting Started With Neural Networks

Download as pdf or txt
Download as pdf or txt
You are on page 1of 70

Getting started with

neural networks

Chapter 3 from Chollet book

1
Anatomy of a neural network
Training a neural network revolves around
the following objects:
 Layers, which are combined into a network
(or model)
 The input data and corresponding targets
 The loss function, which defines the
feedback signal used for learning
 The optimizer, which determines how
learning proceeds
2
Anatomy of a neural network

The network (chain of layers) maps


the input data to predictions.
The loss function then compares these
predictions to the targets.
The optimizer uses this loss value to
update the network’s weights.

3
Layers: the building blocks of DL
 A layer is a data-processing module that takes as
input tensors and that outputs tensors.
 Different layers are appropriate for different types
of data processing.
 Dense layers for 2D tensors (samples,

features) - simple vector data


 RNNs (or LSTMs) for 3D tensors (samples,

time-steps, features) - sequence data


 CNNs for 4D tensors (samples, height, width,

colour_depth) - image data

4
Layers: the building blocks of DL
 We can think of layers as the LEGO bricks
of deep learning.
 Building deep-learning models in Keras is
done by clipping together compatible layers
to form useful data-transformation
pipelines.
 In Keras, the layers we add to our models
are dynamically built to match the shape of
the incoming layer.
5
Layers: the building blocks of DL
from keras import models
from keras import layers

model = models.Sequential()
model.add(layers.Dense(32, input_shape=(784,)))
model.add(layers.Dense(32))

 The second layer didn’t receive an input shape


argument - instead, it automatically inferred its
input shape as being the output shape of the
layer that came before.
6
Models: networks of layers
 A deep-learning model is a directed, acyclic
graph of layers. The most common instance is a
linear stack of layers, mapping a single input to a
single output.
 By choosing a network topology, we constrain
your space of possibilities (hypothesis space) to
a specific series of tensor operations, mapping
input data to output data.
 What we will then be searching for is a good set
of values for the weight tensors involved in these
tensor operations.
7
Models: networks of layers
 Picking the right network architecture is
more of an art than a science.
 Although there are some best practices
and principles we can rely on, only practice
can help us become a proper neural-
network architect, developing intuition as to
what works or doesn’t work for specific
problems.

8
Loss functions and optimizers
Once the network architecture is defined, we still
have to choose two more things:
 Loss function (objective function) - The quantity
that will be minimized during training.
 It represents a measure of success for the task

at hand.
 Optimizer - Determines how the network will be
updated based on the loss value.
 It implements a specific variant of stochastic

gradient descent (SGD).

9
Loss functions and optimizers
 Choosing the right objective function for the right
problem is extremely important.
 There are simple guidelines we can follow to
choose the correct loss function for common
problems such as classification, regression, and
sequence prediction.
 Binary cross-entropy for classification problem

 Categorical cross-entropy for a multiclass

classification problem
 Mean-squared error for a regression problem

10
Introduction to Keras
 Keras is a deep-learning framework for Python
that provides a convenient way to define and
train almost any kind of deep-learning model.
 It allows the same code to run seamlessly on

CPU or GPU.
 It has a user-friendly API that makes it easy to

quickly prototype deep-learning models.


 It has built-in support for CNNs and RNNs.

 It supports arbitrary network architectures.

11
Introduction to Keras
 Keras is a model-level library, providing high-
level building blocks for developing DL models.
 Several different backend engines can be
plugged seamlessly into Keras.
 TensorFlow, Theano, CNTK

The deep-learning software and hardware stack


12
Introduction to Keras
 Nowadays, both Theano and CNTK are out of
development (Chollet 2021).
 The extensive redesign of TensorFlow and Keras
has taken into account over four years of user
feedback and technical progress.

13
Developing with Keras: an overview
The typical Keras workflow:
1. Define your training data: input tensors and
target tensors.
2. Define a network of layers (or model) that maps
your inputs to your targets.
3. Configure the learning process by choosing a
loss function, an optimizer, and some metrics to
monitor.
4. Iterate on your training data by calling the fit()
method of your model.
14
Developing with Keras: an overview
There are two ways to define a model:
1. Using the Sequential class (only for linear
stacks of layers, which is the most common
network architecture by far) or
2. Using the functional API (for directed acyclic
graphs of layers, which lets you build completely
arbitrary architectures).
 The learning process is configured in the
compilation step, where we specify a loss
function, an optimizer, and some metrics to
monitor.
15
Developing with Keras: an overview
 The compilation step
from keras import optimizers
model.compile(optimizer=optimizers.RMSprop(lr=0.001),
loss='mse',
metrics=['accuracy'])
 The learning process consists of passing
Numpy arrays of input data to the model
via the fit() method.
model.fit(input_tensor, target_tensor, batch_size=128, epochs=10)

16
Developing with Keras: an overview

17
3.3 Setting up a DL workstation
 It’s highly recommended that you run deep-
learning code on a modern NVIDIA GPU.
 To do deep learning on a GPU, you have
three options:
 Buy and install a physical NVIDIA GPU on
your workstation.
 Use GPU instances on Google Cloud or AWS
EC2.
 Use the free GPU runtime from Colaboratory.
 Recommended in Chollet 2nd ed.
18
Setting up a DL workstation
3.3.1. Jupyter notebooks: the preferred way
to run deep-learning experiments
 Jupyter notebooks are a great way to run
deep-learning experiments - in particular,
the many code examples in this book.
 They’re widely used in the data-science
and machine-learning communities.
 We recommend using Jupyter notebooks
to get started with Keras.
19
Setting up a DL workstation
Reading assignment
 3.3.2. Getting Keras running: two options
 3.3.3. Running deep-learning jobs in the
cloud: pros and cons
 3.3.4. What is the best GPU for deep
learning?

20
Setting up a DL workstation
Chollet 2nd edition
 Colaboratory (or Colab for short) is a free
Jupyter notebook service that requires no
installation and runs entirely in the cloud.
 It gives you access to a free (but limited)
GPU runtime and even a TPU runtime, so
you don’t have to buy your own GPU.
 Colaboratory is what we recommend for
running the code examples in this book.
21
Summary on TF & Keras
 TensorFlow is an industry-strength numerical
computing framework that can run on CPU, GPU,
or TPU. It can automatically compute the gradient
of any differentiable expression.
 Keras is the standard API for doing DL with TF.
 The central class of Keras is the Layer. Layers
are assembled into models.
 Before you start training a model, you need to
pick an optimizer, a loss, and some metrics,
which you specify via the model.compile() method.

22
Summary on TF & Keras …
 To train a model, you can use the fit()
method, which runs mini-batch gradient
descent for you. You can also use it to
monitor your loss and metrics on validation
data, a set of inputs that the model doesn’t
see during training.
 Once your model is trained, you use the
model.predict() method to generate
predictions on new inputs.

23
3.4. Classifying movie reviews
The IMDB dataset
 A set of 50,000 highly polarized reviews from the
Internet Movie Database.
 25,000 reviews for training and 25,000 reviews
for testing, each set consisting of 50% negative
and 50% positive reviews.
 The IMDB dataset comes packaged with Keras. It
has already been preprocessed: the reviews
(sequences of words) have been turned into
sequences of integers, where each integer
stands for a specific word in a dictionary.
24
Classifying movie reviews
 The following code will load the dataset
from keras.datasets import imdb
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)
 The argument num_words=10000 means you’ll
only keep the top 10,000 most frequently
occurring words in the training data (rather than
88,585 unique words in it) – we work with vector
data of manageable size.
>>> train_data[0]
[1, 14, 22, 16, ... 178, 32]
>>> train_labels[0]
1
25
Classifying movie reviews
 If we want, we can quickly decode one of
these reviews back to English words.

word_index = imdb.get_word_index()
reverse_word_index = dict(
[(value, key) for (key, value) in word_index.items()])
decoded_review = ' '.join(
[reverse_word_index.get(i - 3, '?') for i in train_data[0]])

26
Classifying movie reviews
Preparing the data
 We can’t feed lists of integers into a neural
network - they all have different lengths, but a
neural network expects to process contiguous
batches of data.
 We have to turn our lists into tensors.
 There are two ways to do that:
 Pad our lists so that they all have the same

length.
 One-hot encode our lists to turn them into

vectors of 0s and 1s.


27
Classifying movie reviews
 One-hot encoding
import numpy as np

def vectorize_sequences(sequences, dimension=10000):


results = np.zeros((len(sequences), dimension))
for i, sequence in enumerate(sequences):
results[i, sequence] = 1.
return results

x_train = vectorize_sequences(train_data)
x_test = vectorize_sequences(test_data)

>>> x_train[0]
array([ 0., 1., 1., ..., 0., 0., 0.])

28
Classifying movie reviews
 We should also vectorize your labels,
which is straightforward:

y_train = np.asarray(train_labels).astype('float32')
y_test = np.asarray(test_labels).astype('float32')

 Now the data is ready to be fed into a


neural network.

29
Classifying movie reviews
Building our network
 The input data is vectors, and the labels are
scalars (1s and 0s): this is the easiest setup we’ll
ever encounter - a simple stack of fully
connected layers with relu activations suffice.
 Two key architecture decisions to make are:
 How many layers to use
 How many units to choose for each layer
 Next chapter gives formal principles to guide us
in making these choices.

30
Classifying movie reviews
Building our network
 We go with the following
architecture:
 Two intermediate layers with 16
hidden units each
 A third layer that will output the
scalar prediction regarding the
sentiment of the current review
 Relu for the intermediate layers
and sigmoid for the output layer
31
Classifying movie reviews
Building our network
from keras import models
from keras import layers
model = models.Sequential()
model.add(layers.Dense(16, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
 Having more hidden units allows our network to learn
more-complex representations, but it makes the network
more computationally expensive.

32
Classifying movie reviews
Building our network
 Without an activation function like relu, the Dense
layer would consist of two linear operations - a
dot product and an addition.
 output = dot(W, input) + b

 Such a hypothesis space is too restricted and


wouldn’t benefit from multiple layers of
representations - still only a linear operation.
 In order to get access to a much richer
hypothesis space that would benefit from deep
representations, we need activation functions.
33
Classifying movie reviews
Building our network
 Relu is the most popular activation function in
deep learning.
 Finally, we need to choose a loss function and an
optimizer. Because we’re facing a binary
classification problem and the output is a
probability, we use the binary_cross-entropy loss.
 As for the choice of the optimizer, we’ll go with
rmsprop, which is a usually a good default choice
for virtually any problem.

34
Classifying movie reviews
Building our network

from keras import losses


from keras import metrics
model.compile(optimizer=optimizers.RMSprop(lr=0.001),
loss=losses.binary_crossentropy,
metrics=[metrics.binary_accuracy])

35
Classifying movie reviews
Validating our approach
 In order to monitor during training the accuracy of
the model on data it has never seen before, we’ll
create a validation set by setting apart 10,000
samples from the original training data.
x_val = x_train[:10000]
partial_x_train = x_train[10000:]
y_val = y_train[:10000]
partial_y_train = y_train[10000:]
 We’ll now train the model for 20 epochs, in mini-
batches of 512 samples.
36
Classifying movie reviews
Validating our approach
history = model.fit(partial_x_train,
partial_y_train,
epochs=20,
batch_size=512,
validation_data=(x_val, y_val))
 model.fit() returns a History object.
>>> history_dict = history.history
>>> history_dict.keys()
['acc', 'loss', 'val_acc', 'val_loss']

37
Classifying movie reviews
 Plotting the training and validation loss
import matplotlib.pyplot as plt

history_dict = history.history
loss_values = history_dict['loss']
val_loss_values = history_dict['val_loss']

epochs = range(1, len(acc) + 1)

plt.plot(epochs, loss_values, 'bo', label='Training loss')


plt.plot(epochs, val_loss_values, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()
38
Classifying movie reviews

39
Classifying movie reviews
 The training loss decreases with every
epoch, and the training accuracy increases
with every epoch.
 But that isn’t the case for the validation loss
and accuracy: they seem to peak at the
fourth epoch.
 This is an example of overfitting.
 We could stop training after three epochs.
 We’ll cover a range of techniques in chapter 4.
40
Classifying movie reviews
 Let’s train a new network from scratch for four
epochs and then evaluate it on the test data.
model = models.Sequential()
model.add(layers.Dense(16, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop',
loss='binary_crossentropy',
metrics=['accuracy'])
model.fit(x_train, y_train, epochs=4, batch_size=512)

41
Classifying movie reviews
 Evaluate on the test set.
results = model.evaluate(x_test, y_test)
>>> results
[0.2929924130630493, 0.88327999999999995]

 This fairly naive approach achieves an


accuracy of 88%. With state-of-the-art
approaches, you should be able to get
close to 95%.
 The first number is loss, 2nd is accuracy!
42
Classifying movie reviews
 We can generate the likelihood of reviews being
positive by using the predict method.
>>> model.predict(x_test)
array([[ 0.98006207]
[ 0.99758697]
[ 0.99975556]
...,
[ 0.82167041]
[ 0.02885115]
[ 0.65371346]], dtype=float32)
 Our network is confident for some samples (0.99 or more,
or 0.01 or less) but less confident for others (0.65, 0.35).

43
Classifying movie reviews
Further experiments
 Try Train/Test/Validation split 30k/10k/10k.
 Try using one or three hidden layers, and see how
doing so affects validation and test accuracy.
 Try using layers with more hidden units or fewer
hidden units: 32 units, 64 units, and so on.
 Try using the mse loss function instead of
binary_crossentropy.
 Try using the tanh activation (an activation that was
popular in the early days of neural networks)
instead of relu.
44
Classifying movie reviews
Summary
 Stacks of Dense layers with relu activations can solve a wide
range of problems.
 In a binary classification problem, our network should end
with a Dense layer with one unit and a sigmoid activation.
 On a binary classification problem, the loss function we
should use is binary_crossentropy.
 The rmsprop optimizer is generally a good enough choice,
on whatever problem.
 As they get better on their training data, neural networks
eventually start overfitting and end up obtaining worse
results on data they’ve never seen before.
45
Classifying newswires - multiclass
The Reuters dataset
 A set of short newswires and their topics,
published by Reuters in 1986.
 It’s a simple, widely used toy dataset for
text classification.
 46 different topics
 single-label, multiclass classification
from keras.datasets import reuters
(train_data, train_labels), (test_data, test_labels) = reuters.load_data(
num_words=10000)

46
Classifying newswires - multiclass
>>> len(train_data)
8982
>>> len(test_data)
2246
>>> train_data[10]
[1, 245, 273, 207, 156, 53, 74, 160, 26, 14, 46, 296, 26, 39, 74, 2979,
3554, 14, 46, 4689, 4329, 86, 61, 3499, 4795, 14, 61, 451, 4329, 17, 12]
>>> train_labels[10]
3
 The label associated with an example is an
integer between 0 and 45 -- a topic index.
47
Classifying newswires - multiclass
 Encoding the data
import numpy as np
def vectorize_sequences(sequences, dimension=10000):
results = np.zeros((len(sequences), dimension))
for i, sequence in enumerate(sequences):
results[i, sequence] = 1.
return results
x_train = vectorize_sequences(train_data)
x_test = vectorize_sequences(test_data)

48
Classifying newswires - multiclass
 One-hot encoding of the labels
def to_one_hot(labels, dimension=46):
results = np.zeros((len(labels), dimension))
for i, label in enumerate(labels):
results[i, label] = 1.
return results

one_hot_train_labels = to_one_hot(train_labels)
one_hot_test_labels = to_one_hot(test_labels)
 Or we can use Keras’s built-in to_categorical
from keras.utils.np_utils import to_categorical
one_hot_train_labels = to_categorical(train_labels)
one_hot_test_labels = to_categorical(test_labels)
49
Classifying newswires - multiclass
Building our network:
 46 output classes
 Last layer has 46 neurons
 Last layer has softmax activation
 To avoid information bottleneck, we go with 64 units in
the lower layers
from keras import models
from keras import layers

model = models.Sequential()
model.add(layers.Dense(64, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(46, activation='softmax'))
50
Classifying newswires - multiclass
Building our network:
 The best loss function to use in this case is
categorical_crossentropy.
model.compile(optimizer='rmsprop',
loss='categorical_crossentropy',
metrics=['accuracy'])
 Setting aside a validation set
x_val = x_train[:1000]
partial_x_train = x_train[1000:]
y_val = one_hot_train_labels[:1000]
partial_y_train = one_hot_train_labels[1000:]

51
Classifying newswires - multiclass
Training the model:
 Let’s train the network for 20 epochs.
history = model.fit(partial_x_train,
partial_y_train,
epochs=20,
batch_size=512,
validation_data=(x_val, y_val))
 And finally, display its loss and accuracy
curves.

52
Classifying newswires - multiclass

53
Classifying newswires - multiclass
 The network begins to overfit after nine epochs.
Let’s train a new network from scratch for nine
epochs and then evaluate it on the test set.

model = models.Sequential()
model.add(layers.Dense(64, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(46, activation='softmax'))
model.compile(optimizer='rmsprop', loss='categorical_crossentropy',
metrics=['accuracy'])

54
Classifying newswires - multiclass
 The network begins to overfit after nine epochs.
Let’s train a new network from scratch for nine
epochs and then evaluate it on the test set.
model.fit(x_train, y_train, epochs=9, batch_size=512)
results = model.evaluate(x_test, one_hot_test_labels)
>>> results
[0.9565213431445807, 0.79697239536954589]

 This approach reaches an accuracy of 80%.


 Pretty good compared to a random baseline 19%.

55
Classifying newswires - multiclass
 Code to find the random baseline

>>> import copy


>>> test_labels_copy = copy.copy(test_labels)
>>> np.random.shuffle(test_labels_copy)
>>> hits_array = np.array(test_labels) == np.array(test_labels_copy)
>>> float(np.sum(hits_array)) / len(test_labels)
0.18655387355298308

56
Classifying newswires - multiclass
A different way to handle the labels
 Another way to encode the labels would be to
cast them as an integer tensor, like this:
y_train = np.array(train_labels)
y_test = np.array(test_labels)
 The only thing this approach would change is the
choice of the loss function. With integer labels,
we should use sparse_categorical_crossentropy.
model.compile(optimizer='rmsprop',
loss='sparse_categorical_crossentropy',
metrics=['acc'])
57
Classifying newswires - multiclass
The importance of having sufficiently large
intermediate layers:
 To avoid information bottleneck, it’s better to
have more than 46 units in each layer.
model = models.Sequential()
model.add(layers.Dense(64, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(4, activation='relu'))
model.add(layers.Dense(46, activation='softmax'))
 The above network peaks at 71% validation
accuracy, an 8% absolute drop.

58
Classifying newswires - multiclass
Further experiments:
 Try using larger or smaller layers: 32 units,
128 units, and so on.
 We used two hidden layers. Now try using
a single hidden layer, or three hidden
layers.

59
Classifying newswires - multiclass
Summary
 To classify data points among N classes, your network
should end with a Dense layer of size N.
 For multiclass classification problem, our network should
end with a softmax activation.
 Categorical crossentropy is almost always the loss
function you should use for such problems.
 Two ways to handle labels in multiclass classification:
 one-hot encoding with categorical_crossentropy
 integer tensor with sparse_categorical_crossentropy
 We should avoid creating information bottlenecks in our
network with intermediate layers that are too small.
60
Predicting house prices - regression
The Boston Housing Price dataset
 506 data points: 404 training samples and 102
test samples, each with 13 numerical features.
 Different features have different scales.
from keras.datasets import boston_housing
(train_data, train_targets), (test_data, test_targets) = boston_housing.load_data()

>>> train_data.shape
(404, 13)
>>> test_data.shape
(102, 13)

61
Predicting house prices - regression
Preparing the data
 Different features have different scales.
 A widespread best practice to deal with such
data is to do feature-wise normalization
 Subtract mean and devide by the standard deviation
mean = train_data.mean(axis=0)
train_data -= mean
std = train_data.std(axis=0) Note that the quantities used for
train_data /= std normalizing the test data are
test_data -= mean computed using the training data.
test_data /= std A standard practice!

62
Predicting house prices - regression
Building our network
 Few samples, we go for a small network with two
hidden layers, each with 64 units.
 In general, the less training data we have, the worse
overfitting will be - small network mitigates overfitting.
from keras import models
from keras import layers
def build_model():
model = models.Sequential()
model.add(layers.Dense(64, activation='relu', input_shape=(train_data.shape[1],)))
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(1))
model.compile(optimizer='rmsprop', loss='mse', metrics=['mae'])
return model
63
Predicting house prices - regression
Validating network using K-fold validation
 With little data, the validation scores might
change a lot depending on which data
points you chose to use for validation and
which for training.
 The best practice is to use K-fold cross-validation.
 Let us choose k = 4.
 Let’s train for 500 epochs and keep record
of how well the model does at each epoch.

64
Predicting house prices - regression

65
Predicting house prices - regression
 To visualise better, let’s omit the first 10 data
points and replace each point with an exponential
moving average of the 10 previous points, to
obtain a smooth curve.

66
Predicting house prices - regression
 Validation MAE stops improving significantly after 80
epochs. Past that point, model starts overfitting.
 Once we’re finished tuning other parameters of the
model, we can train a final production model on all of
the training data, and look at its performance on the
test data.
model = build_model()
model.fit(train_data, train_targets, epochs=80, batch_size=16, verbose=0)
test_mse_score, test_mae_score = model.evaluate(test_data, test_targets)
>>> test_mae_score
2.5532484335057877

67
Predicting house prices - regression
 With this regression model, predict() returns
the model’s guess for the price.
>>> predictions = model.predict(test_data)
>>> predictions[0]
array([9.990133], dtype=float32)
 The first house in the test set is predicted
to have a price of about $10,000.

68
Predicting house prices - regression
Wrapping up
 MSE is a loss function commonly used for
regression.
 A common regression metric is MAE.
 When features in the input data have values in
different ranges, we apply normalization.
 When there is little data available,
 use K-fold validation,
 use a small network with few hidden layers (typically
only one or two) , in order to avoid severe overfitting.

69
Chapter summary
 Preprocess raw data to feed into a NN.
 Use normalization if different ranges.
 Avoid information bottlenecks.
 Regression uses different loss functions
and evaluation metrics than classification.
 When working with little data,
 Use a small network with only one or two
hidden layers, to avoid severe overfitting,
 Use k-fold validation.
70

You might also like