Unit 3 Slides - Getting Started With Neural Networks

Getting started with
neural networks
Chapter 3 from Chollet book
1
Anatomy of a neural network
Training a neural network revolves around
the following objects:
 Layers, which are combined into a network
(or model)
 The input data and corresponding targets
 The loss function, which defines the
feedback signal used for learning
 The optimizer, which determines how
learning proceeds
2
Anatomy of a neural network
The network (chain of layers) maps

the input data to predictions.
The loss function then compares these
predictions to the targets.
The optimizer uses this loss value to
update the network’s weights.
3
Layers: the building blocks of DL
 A layer is a data-processing module that takes as
input tensors and that outputs tensors.
 Different layers are appropriate for different types
of data processing.
 Dense layers for 2D tensors (samples,
features) - simple vector data

 RNNs (or LSTMs) for 3D tensors (samples,
time-steps, features) - sequence data

 CNNs for 4D tensors (samples, height, width,
colour_depth) - image data
4
 We can think of layers as the LEGO bricks
of deep learning.
 Building deep-learning models in Keras is
done by clipping together compatible layers
to form useful data-transformation
pipelines.
 In Keras, the layers we add to our models
are dynamically built to match the shape of
the incoming layer.
5
from keras import models
from keras import layers
model = models.Sequential()
model.add(layers.Dense(32, input_shape=(784,)))
model.add(layers.Dense(32))
 The second layer didn’t receive an input shape

argument - instead, it automatically inferred its
input shape as being the output shape of the
layer that came before.
6
Models: networks of layers
 A deep-learning model is a directed, acyclic
graph of layers. The most common instance is a
linear stack of layers, mapping a single input to a
single output.
 By choosing a network topology, we constrain
your space of possibilities (hypothesis space) to
a specific series of tensor operations, mapping
input data to output data.
 What we will then be searching for is a good set
of values for the weight tensors involved in these
tensor operations.
7
Models: networks of layers
 Picking the right network architecture is
more of an art than a science.
 Although there are some best practices
and principles we can rely on, only practice
can help us become a proper neural-
network architect, developing intuition as to
what works or doesn’t work for specific
problems.
8
Loss functions and optimizers
Once the network architecture is defined, we still
have to choose two more things:
 Loss function (objective function) - The quantity
that will be minimized during training.
 It represents a measure of success for the task
at hand.
 Optimizer - Determines how the network will be
updated based on the loss value.
 It implements a specific variant of stochastic
gradient descent (SGD).
9
Loss functions and optimizers
 Choosing the right objective function for the right
problem is extremely important.
 There are simple guidelines we can follow to
choose the correct loss function for common
problems such as classification, regression, and
sequence prediction.
 Binary cross-entropy for classification problem
 Categorical cross-entropy for a multiclass
classification problem
 Mean-squared error for a regression problem
10
Introduction to Keras
 Keras is a deep-learning framework for Python
that provides a convenient way to define and
train almost any kind of deep-learning model.
 It allows the same code to run seamlessly on
CPU or GPU.
 It has a user-friendly API that makes it easy to
quickly prototype deep-learning models.

 It has built-in support for CNNs and RNNs.
 It supports arbitrary network architectures.
11
 Keras is a model-level library, providing high-
level building blocks for developing DL models.
 Several different backend engines can be
plugged seamlessly into Keras.
 TensorFlow, Theano, CNTK
The deep-learning software and hardware stack

12
 Nowadays, both Theano and CNTK are out of
development (Chollet 2021).
 The extensive redesign of TensorFlow and Keras
has taken into account over four years of user
feedback and technical progress.
13
Developing with Keras: an overview
The typical Keras workflow:
1. Define your training data: input tensors and
target tensors.
2. Define a network of layers (or model) that maps
your inputs to your targets.
3. Configure the learning process by choosing a
loss function, an optimizer, and some metrics to
monitor.
4. Iterate on your training data by calling the fit()
method of your model.
14
There are two ways to define a model:
1. Using the Sequential class (only for linear
stacks of layers, which is the most common
network architecture by far) or
2. Using the functional API (for directed acyclic
graphs of layers, which lets you build completely
arbitrary architectures).
 The learning process is configured in the
compilation step, where we specify a loss
function, an optimizer, and some metrics to
monitor.
15
 The compilation step
from keras import optimizers
model.compile(optimizer=optimizers.RMSprop(lr=0.001),
loss='mse',
metrics=['accuracy'])
 The learning process consists of passing
Numpy arrays of input data to the model
via the fit() method.
model.fit(input_tensor, target_tensor, batch_size=128, epochs=10)
16
17
3.3 Setting up a DL workstation
 It’s highly recommended that you run deep-
learning code on a modern NVIDIA GPU.
 To do deep learning on a GPU, you have
three options:
 Buy and install a physical NVIDIA GPU on
your workstation.
 Use GPU instances on Google Cloud or AWS
EC2.
 Use the free GPU runtime from Colaboratory.
 Recommended in Chollet 2nd ed.
18
Setting up a DL workstation
3.3.1. Jupyter notebooks: the preferred way
to run deep-learning experiments
 Jupyter notebooks are a great way to run
deep-learning experiments - in particular,
the many code examples in this book.
 They’re widely used in the data-science
and machine-learning communities.
 We recommend using Jupyter notebooks
to get started with Keras.
19
Reading assignment
 3.3.2. Getting Keras running: two options
 3.3.3. Running deep-learning jobs in the
cloud: pros and cons
 3.3.4. What is the best GPU for deep
learning?
20
Chollet 2nd edition
 Colaboratory (or Colab for short) is a free
Jupyter notebook service that requires no
installation and runs entirely in the cloud.
 It gives you access to a free (but limited)
GPU runtime and even a TPU runtime, so
you don’t have to buy your own GPU.
 Colaboratory is what we recommend for
running the code examples in this book.
21
Summary on TF & Keras
 TensorFlow is an industry-strength numerical
computing framework that can run on CPU, GPU,
or TPU. It can automatically compute the gradient
of any differentiable expression.
 Keras is the standard API for doing DL with TF.
 The central class of Keras is the Layer. Layers
are assembled into models.
 Before you start training a model, you need to
pick an optimizer, a loss, and some metrics,
which you specify via the model.compile() method.
22
Summary on TF & Keras …
 To train a model, you can use the fit()
method, which runs mini-batch gradient
descent for you. You can also use it to
monitor your loss and metrics on validation
data, a set of inputs that the model doesn’t
see during training.
 Once your model is trained, you use the
model.predict() method to generate
predictions on new inputs.
23
3.4. Classifying movie reviews
The IMDB dataset
 A set of 50,000 highly polarized reviews from the
Internet Movie Database.
 25,000 reviews for training and 25,000 reviews
for testing, each set consisting of 50% negative
and 50% positive reviews.
 The IMDB dataset comes packaged with Keras. It
has already been preprocessed: the reviews
(sequences of words) have been turned into
sequences of integers, where each integer
stands for a specific word in a dictionary.
24
Classifying movie reviews
 The following code will load the dataset
from keras.datasets import imdb
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)
 The argument num_words=10000 means you’ll
only keep the top 10,000 most frequently
occurring words in the training data (rather than
88,585 unique words in it) – we work with vector
data of manageable size.
>>> train_data[0]
[1, 14, 22, 16, ... 178, 32]
>>> train_labels[0]
1
25
 If we want, we can quickly decode one of
these reviews back to English words.
word_index = imdb.get_word_index()
reverse_word_index = dict(
[(value, key) for (key, value) in word_index.items()])
decoded_review = ' '.join(
[reverse_word_index.get(i - 3, '?') for i in train_data[0]])
26
Preparing the data
 We can’t feed lists of integers into a neural
network - they all have different lengths, but a
neural network expects to process contiguous
batches of data.
 We have to turn our lists into tensors.
 There are two ways to do that:
 Pad our lists so that they all have the same
length.
 One-hot encode our lists to turn them into
vectors of 0s and 1s.

27
 One-hot encoding
import numpy as np
def vectorize_sequences(sequences, dimension=10000):

results = np.zeros((len(sequences), dimension))
for i, sequence in enumerate(sequences):
results[i, sequence] = 1.
return results
x_train = vectorize_sequences(train_data)
x_test = vectorize_sequences(test_data)
>>> x_train[0]
array([ 0., 1., 1., ..., 0., 0., 0.])
28
 We should also vectorize your labels,
which is straightforward:
y_train = np.asarray(train_labels).astype('float32')
y_test = np.asarray(test_labels).astype('float32')
 Now the data is ready to be fed into a

neural network.
29
Building our network
 The input data is vectors, and the labels are
scalars (1s and 0s): this is the easiest setup we’ll
ever encounter - a simple stack of fully
connected layers with relu activations suffice.
 Two key architecture decisions to make are:
 How many layers to use
 How many units to choose for each layer
 Next chapter gives formal principles to guide us
in making these choices.
30
 We go with the following
architecture:
 Two intermediate layers with 16
hidden units each
 A third layer that will output the
scalar prediction regarding the
sentiment of the current review
 Relu for the intermediate layers
and sigmoid for the output layer
31
model.add(layers.Dense(16, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
 Having more hidden units allows our network to learn
more-complex representations, but it makes the network
more computationally expensive.
32
 Without an activation function like relu, the Dense
layer would consist of two linear operations - a
dot product and an addition.
 output = dot(W, input) + b
 Such a hypothesis space is too restricted and

wouldn’t benefit from multiple layers of
representations - still only a linear operation.
 In order to get access to a much richer
hypothesis space that would benefit from deep
representations, we need activation functions.
33
 Relu is the most popular activation function in
deep learning.
 Finally, we need to choose a loss function and an
optimizer. Because we’re facing a binary
classification problem and the output is a
probability, we use the binary_cross-entropy loss.
 As for the choice of the optimizer, we’ll go with
rmsprop, which is a usually a good default choice
for virtually any problem.
34
from keras import losses

from keras import metrics
model.compile(optimizer=optimizers.RMSprop(lr=0.001),
loss=losses.binary_crossentropy,
metrics=[metrics.binary_accuracy])
35
Validating our approach
 In order to monitor during training the accuracy of
the model on data it has never seen before, we’ll
create a validation set by setting apart 10,000
samples from the original training data.
x_val = x_train[:10000]
partial_x_train = x_train[10000:]
y_val = y_train[:10000]
partial_y_train = y_train[10000:]
 We’ll now train the model for 20 epochs, in mini-
batches of 512 samples.
36
Validating our approach
history = model.fit(partial_x_train,
partial_y_train,
epochs=20,
batch_size=512,
validation_data=(x_val, y_val))
 model.fit() returns a History object.
>>> history_dict = history.history
>>> history_dict.keys()
['acc', 'loss', 'val_acc', 'val_loss']
37
 Plotting the training and validation loss
import matplotlib.pyplot as plt
history_dict = history.history
loss_values = history_dict['loss']
val_loss_values = history_dict['val_loss']
epochs = range(1, len(acc) + 1)
plt.plot(epochs, loss_values, 'bo', label='Training loss')

plt.plot(epochs, val_loss_values, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()
38
39
 The training loss decreases with every
epoch, and the training accuracy increases
with every epoch.
 But that isn’t the case for the validation loss
and accuracy: they seem to peak at the
fourth epoch.
 This is an example of overfitting.
 We could stop training after three epochs.
 We’ll cover a range of techniques in chapter 4.
40
 Let’s train a new network from scratch for four
epochs and then evaluate it on the test data.
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop',
loss='binary_crossentropy',
model.fit(x_train, y_train, epochs=4, batch_size=512)
41
 Evaluate on the test set.
results = model.evaluate(x_test, y_test)
>>> results
[0.2929924130630493, 0.88327999999999995]
 This fairly naive approach achieves an

accuracy of 88%. With state-of-the-art
approaches, you should be able to get
close to 95%.
 The first number is loss, 2nd is accuracy!
42
 We can generate the likelihood of reviews being
positive by using the predict method.
>>> model.predict(x_test)
array([[ 0.98006207]
[ 0.99758697]
[ 0.99975556]
...,
[ 0.82167041]
[ 0.02885115]
[ 0.65371346]], dtype=float32)
 Our network is confident for some samples (0.99 or more,
or 0.01 or less) but less confident for others (0.65, 0.35).
43
Further experiments
 Try Train/Test/Validation split 30k/10k/10k.
 Try using one or three hidden layers, and see how
doing so affects validation and test accuracy.
 Try using layers with more hidden units or fewer
hidden units: 32 units, 64 units, and so on.
 Try using the mse loss function instead of
binary_crossentropy.
 Try using the tanh activation (an activation that was
popular in the early days of neural networks)
instead of relu.
44
Summary
 Stacks of Dense layers with relu activations can solve a wide
range of problems.
 In a binary classification problem, our network should end
with a Dense layer with one unit and a sigmoid activation.
 On a binary classification problem, the loss function we
should use is binary_crossentropy.
 The rmsprop optimizer is generally a good enough choice,
on whatever problem.
 As they get better on their training data, neural networks
eventually start overfitting and end up obtaining worse
results on data they’ve never seen before.
45
Classifying newswires - multiclass
The Reuters dataset
 A set of short newswires and their topics,
published by Reuters in 1986.
 It’s a simple, widely used toy dataset for
text classification.
 46 different topics
 single-label, multiclass classification
from keras.datasets import reuters
(train_data, train_labels), (test_data, test_labels) = reuters.load_data(
num_words=10000)
46
>>> len(train_data)
8982
>>> len(test_data)
2246
>>> train_data[10]
[1, 245, 273, 207, 156, 53, 74, 160, 26, 14, 46, 296, 26, 39, 74, 2979,
3554, 14, 46, 4689, 4329, 86, 61, 3499, 4795, 14, 61, 451, 4329, 17, 12]
>>> train_labels[10]
3
 The label associated with an example is an
integer between 0 and 45 -- a topic index.
47
 Encoding the data
import numpy as np
def vectorize_sequences(sequences, dimension=10000):
results = np.zeros((len(sequences), dimension))
for i, sequence in enumerate(sequences):
results[i, sequence] = 1.
return results
x_train = vectorize_sequences(train_data)
x_test = vectorize_sequences(test_data)
48
 One-hot encoding of the labels
def to_one_hot(labels, dimension=46):
results = np.zeros((len(labels), dimension))
for i, label in enumerate(labels):
results[i, label] = 1.
return results
one_hot_train_labels = to_one_hot(train_labels)
one_hot_test_labels = to_one_hot(test_labels)
 Or we can use Keras’s built-in to_categorical
from keras.utils.np_utils import to_categorical
one_hot_train_labels = to_categorical(train_labels)
one_hot_test_labels = to_categorical(test_labels)
49
Building our network:
 46 output classes
 Last layer has 46 neurons
 Last layer has softmax activation
 To avoid information bottleneck, we go with 64 units in
the lower layers
model.add(layers.Dense(46, activation='softmax'))
50
Building our network:
 The best loss function to use in this case is
categorical_crossentropy.
loss='categorical_crossentropy',
 Setting aside a validation set
x_val = x_train[:1000]
partial_x_train = x_train[1000:]
y_val = one_hot_train_labels[:1000]
partial_y_train = one_hot_train_labels[1000:]
51
Training the model:
 Let’s train the network for 20 epochs.
history = model.fit(partial_x_train,
partial_y_train,
epochs=20,
batch_size=512,
validation_data=(x_val, y_val))
 And finally, display its loss and accuracy
curves.
52
53
 The network begins to overfit after nine epochs.
Let’s train a new network from scratch for nine
epochs and then evaluate it on the test set.
model.compile(optimizer='rmsprop', loss='categorical_crossentropy',
54
 The network begins to overfit after nine epochs.
Let’s train a new network from scratch for nine
epochs and then evaluate it on the test set.
model.fit(x_train, y_train, epochs=9, batch_size=512)
results = model.evaluate(x_test, one_hot_test_labels)
>>> results
[0.9565213431445807, 0.79697239536954589]
 This approach reaches an accuracy of 80%.

 Pretty good compared to a random baseline 19%.
55
 Code to find the random baseline
>>> import copy

>>> test_labels_copy = copy.copy(test_labels)
>>> np.random.shuffle(test_labels_copy)
>>> hits_array = np.array(test_labels) == np.array(test_labels_copy)
>>> float(np.sum(hits_array)) / len(test_labels)
0.18655387355298308
56
A different way to handle the labels
 Another way to encode the labels would be to
cast them as an integer tensor, like this:
y_train = np.array(train_labels)
y_test = np.array(test_labels)
 The only thing this approach would change is the
choice of the loss function. With integer labels,
we should use sparse_categorical_crossentropy.
loss='sparse_categorical_crossentropy',
metrics=['acc'])
57
The importance of having sufficiently large
intermediate layers:
 To avoid information bottleneck, it’s better to
have more than 46 units in each layer.
 The above network peaks at 71% validation
accuracy, an 8% absolute drop.
58
Further experiments:
 Try using larger or smaller layers: 32 units,
128 units, and so on.
 We used two hidden layers. Now try using
a single hidden layer, or three hidden
layers.
59
Summary
 To classify data points among N classes, your network
should end with a Dense layer of size N.
 For multiclass classification problem, our network should
end with a softmax activation.
 Categorical crossentropy is almost always the loss
function you should use for such problems.
 Two ways to handle labels in multiclass classification:
 one-hot encoding with categorical_crossentropy
 integer tensor with sparse_categorical_crossentropy
 We should avoid creating information bottlenecks in our
network with intermediate layers that are too small.
60
Predicting house prices - regression
The Boston Housing Price dataset
 506 data points: 404 training samples and 102
test samples, each with 13 numerical features.
 Different features have different scales.
from keras.datasets import boston_housing
(train_data, train_targets), (test_data, test_targets) = boston_housing.load_data()
>>> train_data.shape
(404, 13)
>>> test_data.shape
(102, 13)
61
Preparing the data
 Different features have different scales.
 A widespread best practice to deal with such
data is to do feature-wise normalization
 Subtract mean and devide by the standard deviation
mean = train_data.mean(axis=0)
train_data -= mean
std = train_data.std(axis=0) Note that the quantities used for
train_data /= std normalizing the test data are
test_data -= mean computed using the training data.
test_data /= std A standard practice!
62
 Few samples, we go for a small network with two
hidden layers, each with 64 units.
 In general, the less training data we have, the worse
overfitting will be - small network mitigates overfitting.
def build_model():
model.add(layers.Dense(64, activation='relu', input_shape=(train_data.shape[1],)))
model.add(layers.Dense(1))
model.compile(optimizer='rmsprop', loss='mse', metrics=['mae'])
return model
63
Validating network using K-fold validation
 With little data, the validation scores might
change a lot depending on which data
points you chose to use for validation and
which for training.
 The best practice is to use K-fold cross-validation.
 Let us choose k = 4.
 Let’s train for 500 epochs and keep record
of how well the model does at each epoch.
64
65
 To visualise better, let’s omit the first 10 data
points and replace each point with an exponential
moving average of the 10 previous points, to
obtain a smooth curve.
66
 Validation MAE stops improving significantly after 80
epochs. Past that point, model starts overfitting.
 Once we’re finished tuning other parameters of the
model, we can train a final production model on all of
the training data, and look at its performance on the
test data.
model = build_model()
model.fit(train_data, train_targets, epochs=80, batch_size=16, verbose=0)
test_mse_score, test_mae_score = model.evaluate(test_data, test_targets)
>>> test_mae_score
2.5532484335057877
67
 With this regression model, predict() returns
the model’s guess for the price.
>>> predictions = model.predict(test_data)
>>> predictions[0]
array([9.990133], dtype=float32)
 The first house in the test set is predicted
to have a price of about $10,000.
68
Wrapping up
 MSE is a loss function commonly used for
regression.
 A common regression metric is MAE.
 When features in the input data have values in
different ranges, we apply normalization.
 When there is little data available,
 use K-fold validation,
 use a small network with few hidden layers (typically
only one or two) , in order to avoid severe overfitting.
69
Chapter summary
 Preprocess raw data to feed into a NN.
 Use normalization if different ranges.
 Avoid information bottlenecks.
 Regression uses different loss functions
and evaluation metrics than classification.
 When working with little data,
 Use a small network with only one or two
hidden layers, to avoid severe overfitting,
 Use k-fold validation.
70

Unit 3 Slides - Getting Started With Neural Networks

Uploaded by

Copyright:

Available Formats

Unit 3 Slides - Getting Started With Neural Networks

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 3 Slides - Getting Started With Neural Networks

Uploaded by

Copyright:

Available Formats

Getting started with

Chapter 3 from Chollet book

The network (chain of layers) maps

features) - simple vector data

time-steps, features) - sequence data

colour_depth) - image data

 The second layer didn’t receive an input shape

gradient descent (SGD).

 Categorical cross-entropy for a multiclass

quickly prototype deep-learning models.

 It supports arbitrary network architectures.

The deep-learning software and hardware stack

vectors of 0s and 1s.

def vectorize_sequences(sequences, dimension=10000):

 Now the data is ready to be fed into a

 Such a hypothesis space is too restricted and

from keras import losses

epochs = range(1, len(acc) + 1)

plt.plot(epochs, loss_values, 'bo', label='Training loss')

 This fairly naive approach achieves an

 This approach reaches an accuracy of 80%.

>>> import copy

You might also like