Unit - 3 DL
Unit - 3 DL
Unit - 3 DL
DEMON
UNIT-III
UNIT III: Neural Networks: Anatomy of Neural Network, Introduction to Keras: Keras, TensorFlow,
Theano and CNTK, Setting up Deep Learning Workstation, Classifying Movie Reviews: Binary
Classification, Classifying newswires: Multiclass Classification
Neural Networks:
training a neural network revolves around the following objects:
The loss function, which defines the feedback signal used for learning
You can visualize their interaction as illustrated in figure 3.1: the network, composed of layers that
are chained together, maps the input data to predictions. The loss function then compares these
predictions to the targets, producing a loss value: a measure of how well the network’s predictions
match what was expected. The optimizer uses this loss value to update the network’s weights.
A layer is a data-processing module that takes as input one or more tensors and that outputs one or
more tensors. Some layers are stateless, but more frequently layers have a state: the layer’s weights,
one or several tensors learned with stochastic gradient descent, which together contain the
network’s knowledge. Different layers are appropriate for different tensor formats and different
types of data processing. For instance, simple vector data, stored in 2D tensors of shape (samples,
features), is often processed by densely connected layers, also called fully connected or dense layers
(the Dense class in Keras). Sequence data, stored in 3D tensors of shape (samples, timesteps,
features), is typically processed by recurrent layers such as an LSTM layer. Image data, stored in 4D
tensors, is usually processed by 2D convolution layers (Conv2D).
DEMON
You can think of layers as the LEGO bricks of deep learning, a metaphor that is made explicit by
frameworks like Keras. Building deep-learning models in Keras is done by clipping together
compatible layers to form useful data-transformation pipelines. The notion of layer compatibility
here refers specifically to the fact that every layer will only accept input tensors of a certain shape
and will return output tensors of a certain shape
A deep-learning model is a directed, acyclic graph of layers. The most common instance is a linear
stack of layers, mapping a single input to a single output. But as you move forward, you’ll be
exposed to a much broader variety of network topologies. Some common ones include the
following:
Two-branch networks
Multihead networks
Inception blocks
The topology of a network defines a hypothesis space.we defined machine learning as “searching for
useful representations of some input data, within a predefined space of possibilities, using guidance
from a feedback signal.” By choosing a network topology, you constrain your space of possibilities
(hypothesis space) to a specific series of tensor operations, mapping input data to output data. What
you’ll then be searching for is a good set of values for the weight tensors involved in these tensor
operations.
Once the network architecture is defined, you still have to choose two more things:
Loss function (objective function)—The quantity that will be minimized during training. It
represents a measure of success for the task at hand.
Optimizer—Determines how the network will be updated based on the loss function. It
implements a specific variant of stochastic gradient descent (SGD).
A neural network that has multiple outputs may have multiple loss functions (one per output). But
the gradient-descent process must be based on a single scalar loss value; so, for multiloss networks,
all losses are combined (via averaging) into a single scalar quantity.
Choosing the right objective function for the right problem is extremely important: your network will
take any shortcut it can, to minimize the loss; so if the objective doesn’t fully correlate with success
for the task at hand, your network will end up doing things you may not have wanted. Imagine a
stupid, omnipotent AI trained via SGD, with this poorly chosen objective function: “maximizing the
average well-being of all humans alive.” To make its job easier, this AI might choose to kill all
humans except a few and focus on the well-being of the remaining ones—because average well-
being isn’t affected by how many humans are left. That might not be what you intended! Just
remember that all neural networks you build will be just as ruthless in lowering their loss function—
so choose the objective wisely, or you’ll have to face unintended side effects
Introduction to Keras
Throughout this book, the code examples use Keras (https://keras.io). Keras is a deep-learning
framework for Python that provides a convenient way to define and train almost any kind of deep-
DEMON
learning model. Keras was initially developed for researchers, with the aim of enabling fast
experimentation. Keras has the following key features:
It has a user-friendly API that makes it easy to quickly prototype deep-learning models.
It has built-in support for convolutional networks (for computer vision), recurrent networks (for
sequence processing), and any combination of both.
This means Keras is appropriate for building essentially any deep-learning model, from a generative
adversarial network to a neural Turing machine. Keras is distributed under the permissive MIT
license, which means it can be freely used in commercial projects. It’s compatible with any version of
DEMON
TensorFlow, CNTK, and Theano are some of the primary platforms for deep learning today. Theano
(http://deeplearning.net/software/theano) is developed by the MILA lab at Université de Montréal,
TensorFlow (www.tensorflow.org) is developed by Google, and CNTK
(https://github.com/Microsoft/CNTK) is developed by Microsoft. Any piece of code that you write
with Keras can be run with any of these backends without having to change anything in the code:
you can seamlessly switch between the two during development, which often proves useful—for
instance, if one of these backends proves to be faster for a specific task. We recommend using the
TensorFlow backend as the default for most of your deep-learning needs, because it’s the most
widely adopted, scalable, and production ready. Via TensorFlow (or Theano, or CNTK), Keras is able
to run seamlessly on both CPUs and GPUs. When running on CPU, TensorFlow is itself wrapping a
low-level library for tensor operations called Eigen (http://eigen.tuxfamily.org). On GPU, TensorFlow
wraps a library of well-optimized deep-learning operations called the NVIDIA CUDA Deep Neural
Network library (cuDNN).
Whether you’re running locally or in the cloud, it’s better to be using a Unix workstation. Although
it’s technically possible to use Keras on Windows (all three Keras backends support Windows), We
don’t recommend it. In the installation instructions in appendix A, we’ll consider an Ubuntu
machine. If you’re a Windows user, the simplest solution to get everything running is to set up an
Ubuntu dual boot on your machine. It may seem like a hassle, but using Ubuntu will save you a lot of
time and trouble in the long run.
Jupyter notebooks are a great way to run deep-learning experiments—in particular, the many code
examples in this book. They’re widely used in the data-science and machine-learning communities. A
notebook is a file generated by the Jupyter Notebook app (https://jupyter.org), which you can edit in
your browser. It mixes the ability to execute Python code with rich text-editing capabilities for
annotating what you’re doing. A notebook also allows you to break up long experiments into smaller
pieces that can be executed independently, which makes development interactive and means you
don’t have to rerun all of your previous code if something goes wrong late in an experiment.
DEMON
Use the official EC2 Deep Learning AMI (https://aws.amazon.com/amazonai/amis), and run Keras
experiments as Jupyter notebooks on EC2. Do this if you don’t already have a GPU on your local
machine. Appendix B provides a step-by-step guide.
Install everything from scratch on a local Unix workstation. You can then run either local Jupyter
notebooks or a regular Python codebase. Do this if you already have a high-end NVIDIA GPU.
Appendix A provides an Ubuntu-specific, step-by-step guide.
If you don’t already have a GPU that you can use for deep learning (a recent, high-end NVIDIA GPU),
then running deep-learning experiments in the cloud is a simple, lowcost way for you to get started
without having to buy any additional hardware. If you’re using Jupyter notebooks, the experience of
running in the cloud is no different from running locally. As of mid-2017, the cloud offering that
makes it easiest to get started with deep learning is definitely AWS EC2. Appendix B provides a step-
by-step guide to running Jupyter notebooks on a EC2 GPU instance. But if you’re a heavy user of
deep learning, this setup isn’t sustainable in the long term—or even for more than a few weeks. EC2
instances are expensive: the instance type recommended in appendix B (the p2.xlarge instance,
which won’t provide you with much power) costs $0.90 per hour as of mid-2017. Meanwhile, a solid
consumerclass GPU will cost you somewhere between $1,000 and $1,500—a price that has been
fairly stable over time, even as the specs of these GPUs keep improving. If you’re serious about deep
learning, you should set up a local workstation with one or more GPUs. In short, EC2 is a great way
to get started. You could follow the code examples in this book entirely on an EC2 GPU instance. But
if you’re going to be a power user of deep learning, get your own GPUs.
The IMDB dataset is a set of 50,000 highly polarized reviews from the Internet Movie
Database. They are split into 25000 reviews each for training and testing. Each set contains an
equal number (50%) of positive and negative reviews.
The IMDB dataset comes packaged with Keras. It consists of reviews and their corresponding
labels (0 for negative and 1 for positive review). The reviews are a sequence of words. They
come preprocessed as a sequence of integers, where each integer stands for a specific word in
the dictionary.
The IMDB dataset can be loaded directly from Keras and will usually download about 80 MB
on your machine.
Let’s load the prepackaged data from Keras. We will only include 10,000 of the most
frequently occurring words.
Loading and Analyzing input data
DEMON
We cannot feed a list of integers into our deep neural network. We will need to convert them
into tensors.
To prepare our data, we will One-hot Encode our lists and turn them into vectors of 0’s and
1’s. This would blow up all of our sequences into 10,000-dimensional vectors containing 1 at
all indices corresponding to integers present in that sequence. This vector will have element 0
at all index, which is not present in the integer sequence.
Simply put, the 10,000-dimensional vector corresponding to each review will have
• Every index with value 1, is a word that is present in the review and is denoted by its
integer counterpart.
We will vectorize our data manually for maximum clarity. This will result in a tensor of shape
(25000, 10000).
Pre Processing Input data
Our input data is vectors that need to be mapped to scaler labels (0s and 1s). This is one of the
easiest setups, and a simple stack of fully-connected, Dense layers with relu activation perform
quite well.
Hidden layers
In this network, we will leverage hidden layers. We will define our layers as such.
Dense(16, activation='relu')
The argument being passed to each Dense layer, (16) is the number of hidden units of a layer.
The output from a Dense layer with relu activation is generated after a chain
of tensor operations. This chain of operations is implemented as
output = relu(dot(W, input) + b)
DEMON
Having 16 hidden units means that the matrix W will be of the shape (input_Dimension, 16 ).
In this case, where the dimension of the input vector is 10,000, the shape of the Weight matrix
will be (10000, 16). If you were to represent this network as a graph, you would see 16
neurons in this hidden layer.
Each of these balls or hidden units is a dimension in the representation space of the layer.
Representation space is the set of all viable representations for the data. Every hidden
layer composed of its hidden units aims to learn one specific transformation of the data or one
feature/pattern from the data.
Hidden layers, simply put, are layers of mathematical functions each designed to produce an
output specific to an intended result.
Hidden layers allow for the function of a neural network to be broken down into specific
transformations of the data. Each hidden layer function is specialized to produce a defined
output.For example, a hidden layer functions that are used to identify human eyes and ears
may be used in conjunction by subsequent layers to identify faces in images. While the
functions to identify eyes alone are not enough to independently recognize objects, they can
function jointly within a neural network.
ReLU activation function. This is one of the most commonly used activation functions.
DEMON
Model Architecture
3. Intermediate layers will use the relu activation function. relu or Rectified linear unit
function will zero out the negative values.
4. Sigmoid activation for the final layer or output layer. A sigmoid function
There are formal principles that guide our approach in selecting the architectural attributes of a
model. These are not covered in this case study.
Defining the model architecture
In this step, we will choose an optimizer, a loss function, and metrics to observe. We will go
forward with
DEMON
We can pass our choices for optimizer, loss function and metrics as strings to
the compile function because rmsprop, binary_crossentropy and accuracy come packaged with
Keras.
model.complie(optimizer='rmsprop',
loss = 'binary_crossentropy',
metrics = ['accuracy'])
One could use a customized loss function or optimizer by passing a custom class instance as
an argument to the loss, optimizer or mertics fields.
In this example, we will implement our default choices, but we will do so by passing class
instances. This is precisely how we would do it if we had customized parameters.
Compiling the model
Setting up Validation
We will set aside a part of our training data for validation of the accuracy of the model as it
trains. A validation set enables us to monitor the progress of our model on previously unseen
data as it goes through epochs during training.
Validation steps help us fine-tune the training parameters of the model.fit function to avoid
Overfitting and underfitting of data.
Setting up the Validation set for training the model
Initially, we will train our models for 20 epochs in mini-batches of 512 samples. We will also
pass our validation set to the fit method.
Calling the fit method returns a History object. This object contains a member history which
stores all data about the training process, including the values of observable or monitored
quantities as the epochs proceed. We will save this object to determine the fine-tuning better to
apply to the training step.
Training the model. Times corresponding to Google Colab GPU. Usually takes about 20
seconds on CPU, i7
At the end of the training, we have attained a training accuracy of 99.85% and validation
accuracy of 86.57%
Now that we have trained our network, we will observe its performance metrics stored in
the History object.
Calling the fit method returns a History object. This object has an attribute history which is a
dictionary containing four entries: one per monitored metric.
DEMON
• Training loss
• Training Accuracy
• Validation Loss
• Validation Accuracy
Let’s use Matplotlib to plot Training and validation losses and Training and Validation
Accuracy side by side.
Analysis of Loss and Accuracy data derived from Training history. This data tells us about
the performance of our training strategy.
We observe that minimum validation loss and maximum validation Accuracy is achieved at
around 3–5 epochs. After that, we observe two trends:
This implies that the model is getting better at classifying the sentiment of the training data,
but making consistently worse predictions when it encounters new, previously unseen data.
This is the hallmark of Overfitting. After the 5th epoch, the model begins to fit too closely to
the training data.
To address Overfitting, we will reduce the number of epochs to somewhere between 3 and 5.
These results may vary depending on your machine and due to the very nature of the random
assignment of weights that may vary from model to model.
We retrain our neural network based on our findings from studying the history of loss and
accuracy variation. This time we run it for 3 epochs so as to avoid Overfitting on training data.
Retraining from scratch
DEMON
In the end, we achieve a training accuracy of 99% and a validation accuracy of 86%. This is
pretty good, considering we are using a very naive approach. A higher degree of accuracy can
be achieved by using a better training algorithm.
We will use our trained model to make predictions for the test data. The output is an array of
floating integers that denote the probability of a review being positive. As you can see, in
some cases, the network is absolutely sure the review is positive. In other cases — not so
much!
Making Predictions
You could try to find some error metric for the number of sentiments that were wrongly
classified by using a metric like mean squared error as I did here. But it would be stupid to do
so! The analysis of the result is not something we will cover here. However, I will shed some
light on why usingmse is futile in this case.
The result from our model is the measure of how much the model perceives a review to be
positive. Rather than telling us the absolute class of the sample, the model tells us by how
much it perceives the sentiment to be skewed on one side or the other. MSE is too simple a
metric and fails to capture the complexity of the solution.
Multiclass Classification is the classification of samples in more than two classes. Classifying
samples into precisely two categories is colloquially referred to as Binary Classification.
Information Bottleneck
Neural networks are comprised of many layers. Each layer performs some transformation on
the data mapping input to the output of the network. However, it is crucial to note that these
layers do not generate any additional data and work solely on the data that they receive from
the preceding layers.
If, say, a layer drops some relevant data, that information becomes inaccessible to all
subsequent layers. This information is permanently lost and cannot be retrieved. The layer that
drops this information now acts as a bottleneck, stifling the increase of the model’s accuracy
and performance, thus acting as an information bottleneck.
The Reuters dataset is a set of short newswires sorted into 46 mutually exclusive topics.
Reuters published it in 1986. This dataset is used widely for text classification. There are 46
DEMON
topics, where some topics are represented more than others. However, each topic contains at
least ten examples in the training set.
The Reuters dataset comes preloaded with Keras and contains 8,982 training examples and
2,246 testing examples.
Load data from the pre-packaged module in Keras. We will limit the data to 10,000 of the
most frequently occurring words. To do this, we pass num_words=10000 argument to
the load_data function.
We’ll perform some good-old-fashioned EDA on our dataset. Doing so will give us a general
idea of the breadth and scope of our data.
Decode a story
Let’s go ahead and decode a story. Decoding helps us get the gist of the organization and
encoding of the data.
We cannot feed integer sequences to the neural network; therefore, we will vectorize each
sequence and convert it into tensors. We do this by One-Hot Encoding each sequence.
We have 10,000 unique elements in our training and testing data. Vectorizing our input data
will result in two 2D tensors; a training Input tensor of shape (8982, 10000) and a test input
tensor of shape (2246, 10000).
The Labels for this problem include 46 different classes. The labels are represented as integers
in the range 1 to 46. To vectorize the labels, we could either,
We will go ahead with One-Hot Encoding of the label data. This will give us tensors, whose
second axis has 46 dimensions. This can be done easily using the to_categorical function in
Keras.
DEMON
The other way to perform One-Hot categorical encoding_of the labels is to use the built-in
function, as shown in the Gist above. For clarity, here it is again:
from keras.utils.np_utils import to_categorical
Y_train = to_categorical(train_labels)
Y_test = to_categorical(test_labels)
Therefore, each layer will have to deal with more data, which presents the genuine possibility
of an information bottleneck.
Information Bottleneck
Given a multiclass classification problem, the amount of processing that we need to perform
on the data dramatically increases as compared to a Binary Classification problem.
In a stack of Dense layers, like the ones we use, each layer can only access information
present in the output of the previous layer. If a layer drops relevant information, that
information is permanently inaccessible for all subsequent layers. Information, once lost, can
never be recovered by later layers. In cases like multiclass classification_, where data is
limited and crucial, every layer could become an information bottleneck.
Such layers could easily bottleneck the performance of our network. To ensure that crucial
data is not discarded, we will use layers with a greater number of hidden units, i.e., Larger
layers. For comparison, we use layers with 16 hidden units Dense(16) in the two-
class classification example of Sentiment analysis of IMDB Reviews. In this case, where the
output dimension is 46, we will use layers with 64 hidden units, Dense(64).
Model Definition
We will define our network with two fully-connected ReLU activated layers with 64 hidden
units each. The third and last layer will be Dense layer of size 46. This layer will use
a softmax activation and will output a 46-dimensional vector. Every dimension will be the
probability of the input belonging to that class.
Calling the fit method returns a History object. This object contains a member history which
stores all data about the training process, including the values of observable or monitored
DEMON
quantities as the epochs proceed. We will save this object since the information it holds will
help us to determine the fine-tuning better to apply to the training step.
At the end of the training, we have attained a training accuracy of 95% and validation
accuracy of 80.9%
Now that we have trained our network, we will observe its performance metrics stored in
the History object.
Calling the fit method returns a History object. This object has an attribute history which is a
dictionary containing four entries: one per monitored metric.
• Training loss
• Training Accuracy
• Validation Loss
• Validation Accuracy
Let’s use Matplotlib to plot Training and validation losses and Training and Validation
Accuracy side by side.
DEMON
We observe that minimum validation loss and maximum validation Accuracy is achieved at
around 9–10 epochs. After that, we observe two trends:
This implies that the model is getting better at classifying the sentiment of the training data,
but making consistently worse predictions when it encounters new, previously unseen data,
which is the hallmark of overfitting. After the 10th epoch, the model begins to fit too closely
to the training data.
To address overfitting, we will reduce the number of epochs to 9. These results may vary
depending on your machine and due to the very nature of the random assignment of weights
that may differ from one model to model.
Now that we know that excessive epochs are causing our model to overfit, we will limit the
number of epochs and retrain our model from scratch.
DEMON
If this were a balanced dataset, a random attribution of labels would, using simple probability,
result in 50% accuracy. But since this dataset is not balanced, the accuracy of a random
classifier might be on the lesser side.
A random classifier doles out labels to samples randomly. To put it in perspective, a random
classifier is how the chimp at your local zoo would classify these newsreels.
Considering that the random baseline is about 19%, our model with an accuracy of ~80%
performs quite well.
This time we introduce an information bottleneck in our model. One of our layers will drop
data, and we will see its effect on the model performance as a drop in its accuracy.
Conclusion
At this point, you will have successfully classified newsreels from the Reuters Dataset under
their respective topics. You will also have seen how layers with an inadequate number of
hidden units wreak havoc on your model’s performance by killing precious data.