Unit - 3 DL

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

lOMoARcPSD|39948448

UNIT-III - lec notes

information theory and coding (Jawaharlal Nehru Technological University, Kakinada)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by chaitanya ketha ([email protected])
lOMoARcPSD|39948448

DEMON

UNIT-III

UNIT III: Neural Networks: Anatomy of Neural Network, Introduction to Keras: Keras, TensorFlow,
Theano and CNTK, Setting up Deep Learning Workstation, Classifying Movie Reviews: Binary
Classification, Classifying newswires: Multiclass Classification

Neural Networks:
training a neural network revolves around the following objects:

 Layers, which are combined into a network (or model)

 The input data and corresponding targets

 The loss function, which defines the feedback signal used for learning

 The optimizer, which determines how learning proceeds

You can visualize their interaction as illustrated in figure 3.1: the network, composed of layers that
are chained together, maps the input data to predictions. The loss function then compares these
predictions to the targets, producing a loss value: a measure of how well the network’s predictions
match what was expected. The optimizer uses this loss value to update the network’s weights.

Layers: the building blocks of deep learning

A layer is a data-processing module that takes as input one or more tensors and that outputs one or
more tensors. Some layers are stateless, but more frequently layers have a state: the layer’s weights,
one or several tensors learned with stochastic gradient descent, which together contain the
network’s knowledge. Different layers are appropriate for different tensor formats and different
types of data processing. For instance, simple vector data, stored in 2D tensors of shape (samples,
features), is often processed by densely connected layers, also called fully connected or dense layers
(the Dense class in Keras). Sequence data, stored in 3D tensors of shape (samples, timesteps,
features), is typically processed by recurrent layers such as an LSTM layer. Image data, stored in 4D
tensors, is usually processed by 2D convolution layers (Conv2D).

Downloaded by chaitanya ketha ([email protected])


lOMoARcPSD|39948448

DEMON

You can think of layers as the LEGO bricks of deep learning, a metaphor that is made explicit by
frameworks like Keras. Building deep-learning models in Keras is done by clipping together
compatible layers to form useful data-transformation pipelines. The notion of layer compatibility
here refers specifically to the fact that every layer will only accept input tensors of a certain shape
and will return output tensors of a certain shape

Models: networks of layers

A deep-learning model is a directed, acyclic graph of layers. The most common instance is a linear
stack of layers, mapping a single input to a single output. But as you move forward, you’ll be
exposed to a much broader variety of network topologies. Some common ones include the
following:

 Two-branch networks

 Multihead networks

 Inception blocks

The topology of a network defines a hypothesis space.we defined machine learning as “searching for
useful representations of some input data, within a predefined space of possibilities, using guidance
from a feedback signal.” By choosing a network topology, you constrain your space of possibilities
(hypothesis space) to a specific series of tensor operations, mapping input data to output data. What
you’ll then be searching for is a good set of values for the weight tensors involved in these tensor
operations.

Loss functions and optimizers: keys to configuring the learning process

Once the network architecture is defined, you still have to choose two more things:

 Loss function (objective function)—The quantity that will be minimized during training. It
represents a measure of success for the task at hand.

 Optimizer—Determines how the network will be updated based on the loss function. It
implements a specific variant of stochastic gradient descent (SGD).

A neural network that has multiple outputs may have multiple loss functions (one per output). But
the gradient-descent process must be based on a single scalar loss value; so, for multiloss networks,
all losses are combined (via averaging) into a single scalar quantity.

Choosing the right objective function for the right problem is extremely important: your network will
take any shortcut it can, to minimize the loss; so if the objective doesn’t fully correlate with success
for the task at hand, your network will end up doing things you may not have wanted. Imagine a
stupid, omnipotent AI trained via SGD, with this poorly chosen objective function: “maximizing the
average well-being of all humans alive.” To make its job easier, this AI might choose to kill all
humans except a few and focus on the well-being of the remaining ones—because average well-
being isn’t affected by how many humans are left. That might not be what you intended! Just
remember that all neural networks you build will be just as ruthless in lowering their loss function—
so choose the objective wisely, or you’ll have to face unintended side effects

Introduction to Keras
Throughout this book, the code examples use Keras (https://keras.io). Keras is a deep-learning
framework for Python that provides a convenient way to define and train almost any kind of deep-

Downloaded by chaitanya ketha ([email protected])


lOMoARcPSD|39948448

DEMON

learning model. Keras was initially developed for researchers, with the aim of enabling fast
experimentation. Keras has the following key features:

 It allows the same code to run seamlessly on CPU or GPU.

 It has a user-friendly API that makes it easy to quickly prototype deep-learning models.

 It has built-in support for convolutional networks (for computer vision), recurrent networks (for
sequence processing), and any combination of both.

 It supports arbitrary network architectures: multi-input or multi-output models, layer sharing,


model sharing, and so on.

This means Keras is appropriate for building essentially any deep-learning model, from a generative
adversarial network to a neural Turing machine. Keras is distributed under the permissive MIT
license, which means it can be freely used in commercial projects. It’s compatible with any version of

Python from 2.7 to 3.6 (as of mid-2017).

Keras, TensorFlow, Theano, and CNTK


Keras is a model-level library, providing high-level building blocks for developing deep-learning
models. It doesn’t handle low-level operations such as tensor manipulation and differentiation.
Instead, it relies on a specialized, well-optimized tensor library to do so, serving as the backend
engine of Keras. Rather than choosing a single tensor library and tying the implementation of Keras
to that library, Keras handles the problem in a modular way (see figure 3.3); thus several different
backend engines can be plugged seamlessly into Keras. Currently, the three existing backend
implementations are the TensorFlow backend, the Theano backend, and the Microsoft Cognitive
Toolkit (CNTK) backend. In the future, it’s likely that Keras will be extended to work with even more
deep-learning execution engines.

Downloaded by chaitanya ketha ([email protected])


lOMoARcPSD|39948448

DEMON

TensorFlow, CNTK, and Theano are some of the primary platforms for deep learning today. Theano
(http://deeplearning.net/software/theano) is developed by the MILA lab at Université de Montréal,
TensorFlow (www.tensorflow.org) is developed by Google, and CNTK
(https://github.com/Microsoft/CNTK) is developed by Microsoft. Any piece of code that you write
with Keras can be run with any of these backends without having to change anything in the code:
you can seamlessly switch between the two during development, which often proves useful—for
instance, if one of these backends proves to be faster for a specific task. We recommend using the
TensorFlow backend as the default for most of your deep-learning needs, because it’s the most
widely adopted, scalable, and production ready. Via TensorFlow (or Theano, or CNTK), Keras is able
to run seamlessly on both CPUs and GPUs. When running on CPU, TensorFlow is itself wrapping a
low-level library for tensor operations called Eigen (http://eigen.tuxfamily.org). On GPU, TensorFlow
wraps a library of well-optimized deep-learning operations called the NVIDIA CUDA Deep Neural
Network library (cuDNN).

Setting up a deep-learning workstation


Before you can get started developing deep-learning applications, you need to set up your
workstation. It’s highly recommended, although not strictly necessary, that you run deep-learning
code on a modern NVIDIA GPU. Some applications—in particular, image processing with
convolutional networks and sequence processing with recurrent neural networks—will be
excruciatingly slow on CPU, even a fast multicore CPU. And even for applications that can realistically
be run on CPU, you’ll generally see speed increase by a factor or 5 or 10 by using a modern GPU. If
you don’t want to install a GPU on your machine, you can alternatively consider running your
experiments on an AWS EC2 GPU instance or on Google Cloud Platform. But note that cloud GPU
instances can become expensive over time

Whether you’re running locally or in the cloud, it’s better to be using a Unix workstation. Although
it’s technically possible to use Keras on Windows (all three Keras backends support Windows), We
don’t recommend it. In the installation instructions in appendix A, we’ll consider an Ubuntu
machine. If you’re a Windows user, the simplest solution to get everything running is to set up an
Ubuntu dual boot on your machine. It may seem like a hassle, but using Ubuntu will save you a lot of
time and trouble in the long run.

Jupyter notebooks: the preferred way to run deep-learning experiments

Jupyter notebooks are a great way to run deep-learning experiments—in particular, the many code
examples in this book. They’re widely used in the data-science and machine-learning communities. A
notebook is a file generated by the Jupyter Notebook app (https://jupyter.org), which you can edit in
your browser. It mixes the ability to execute Python code with rich text-editing capabilities for
annotating what you’re doing. A notebook also allows you to break up long experiments into smaller
pieces that can be executed independently, which makes development interactive and means you
don’t have to rerun all of your previous code if something goes wrong late in an experiment.

Getting Keras running: two options

To get started in practice, we recommend one of the following two options:

Downloaded by chaitanya ketha ([email protected])


lOMoARcPSD|39948448

DEMON

 Use the official EC2 Deep Learning AMI (https://aws.amazon.com/amazonai/amis), and run Keras
experiments as Jupyter notebooks on EC2. Do this if you don’t already have a GPU on your local
machine. Appendix B provides a step-by-step guide.

 Install everything from scratch on a local Unix workstation. You can then run either local Jupyter
notebooks or a regular Python codebase. Do this if you already have a high-end NVIDIA GPU.
Appendix A provides an Ubuntu-specific, step-by-step guide.

Running deep-learning jobs in the cloud: pros and cons

If you don’t already have a GPU that you can use for deep learning (a recent, high-end NVIDIA GPU),
then running deep-learning experiments in the cloud is a simple, lowcost way for you to get started
without having to buy any additional hardware. If you’re using Jupyter notebooks, the experience of
running in the cloud is no different from running locally. As of mid-2017, the cloud offering that
makes it easiest to get started with deep learning is definitely AWS EC2. Appendix B provides a step-
by-step guide to running Jupyter notebooks on a EC2 GPU instance. But if you’re a heavy user of
deep learning, this setup isn’t sustainable in the long term—or even for more than a few weeks. EC2
instances are expensive: the instance type recommended in appendix B (the p2.xlarge instance,
which won’t provide you with much power) costs $0.90 per hour as of mid-2017. Meanwhile, a solid
consumerclass GPU will cost you somewhere between $1,000 and $1,500—a price that has been
fairly stable over time, even as the specs of these GPUs keep improving. If you’re serious about deep
learning, you should set up a local workstation with one or more GPUs. In short, EC2 is a great way
to get started. You could follow the code examples in this book entirely on an EC2 GPU instance. But
if you’re going to be a power user of deep learning, get your own GPUs.

Classifying movie reviews: a binary classification example4


The IMDB Dataset

The IMDB dataset is a set of 50,000 highly polarized reviews from the Internet Movie
Database. They are split into 25000 reviews each for training and testing. Each set contains an
equal number (50%) of positive and negative reviews.

The IMDB dataset comes packaged with Keras. It consists of reviews and their corresponding
labels (0 for negative and 1 for positive review). The reviews are a sequence of words. They
come preprocessed as a sequence of integers, where each integer stands for a specific word in
the dictionary.

The IMDB dataset can be loaded directly from Keras and will usually download about 80 MB
on your machine.

Loading the Data

Let’s load the prepackaged data from Keras. We will only include 10,000 of the most
frequently occurring words.
Loading and Analyzing input data

Downloaded by chaitanya ketha ([email protected])


lOMoARcPSD|39948448

DEMON

For kicks, let’s decode the first review.


Decoding a review

Preparing the Data

We cannot feed a list of integers into our deep neural network. We will need to convert them
into tensors.

To prepare our data, we will One-hot Encode our lists and turn them into vectors of 0’s and
1’s. This would blow up all of our sequences into 10,000-dimensional vectors containing 1 at
all indices corresponding to integers present in that sequence. This vector will have element 0
at all index, which is not present in the integer sequence.

Simply put, the 10,000-dimensional vector corresponding to each review will have

• Every index corresponding to a word

• Every index with value 1, is a word that is present in the review and is denoted by its
integer counterpart.

• Every index containing 0 is a word not present in the review.

We will vectorize our data manually for maximum clarity. This will result in a tensor of shape
(25000, 10000).
Pre Processing Input data

Building the Neural Network

Our input data is vectors that need to be mapped to scaler labels (0s and 1s). This is one of the
easiest setups, and a simple stack of fully-connected, Dense layers with relu activation perform
quite well.

Hidden layers

In this network, we will leverage hidden layers. We will define our layers as such.
Dense(16, activation='relu')

The argument being passed to each Dense layer, (16) is the number of hidden units of a layer.

The output from a Dense layer with relu activation is generated after a chain
of tensor operations. This chain of operations is implemented as
output = relu(dot(W, input) + b)

Downloaded by chaitanya ketha ([email protected])


lOMoARcPSD|39948448

DEMON

Where, W is the Weight matrix and b is the bias (tensor).

Having 16 hidden units means that the matrix W will be of the shape (input_Dimension, 16 ).
In this case, where the dimension of the input vector is 10,000, the shape of the Weight matrix
will be (10000, 16). If you were to represent this network as a graph, you would see 16
neurons in this hidden layer.

To put it in layman’s terms, there will be 16 balls in this layer.

Each of these balls or hidden units is a dimension in the representation space of the layer.
Representation space is the set of all viable representations for the data. Every hidden
layer composed of its hidden units aims to learn one specific transformation of the data or one
feature/pattern from the data.

DeepAI.org has a very informative write-up on hidden layers.

Hidden layers, simply put, are layers of mathematical functions each designed to produce an
output specific to an intended result.

Hidden layers allow for the function of a neural network to be broken down into specific
transformations of the data. Each hidden layer function is specialized to produce a defined
output.For example, a hidden layer functions that are used to identify human eyes and ears
may be used in conjunction by subsequent layers to identify faces in images. While the
functions to identify eyes alone are not enough to independently recognize objects, they can
function jointly within a neural network.

ReLU activation function. This is one of the most commonly used activation functions.

Downloaded by chaitanya ketha ([email protected])


lOMoARcPSD|39948448

DEMON

Model Architecture

For our model, we will use

1. Two intermediate layers with 16 hidden units each

2. Third layer that will output the scalar sentiment prediction

3. Intermediate layers will use the relu activation function. relu or Rectified linear unit
function will zero out the negative values.

4. Sigmoid activation for the final layer or output layer. A sigmoid function

“squashes” arbitrary values into the [0,1] range.

The Sigmoid Activation Function. (Source: Wikipedia, By Qef )

There are formal principles that guide our approach in selecting the architectural attributes of a
model. These are not covered in this case study.
Defining the model architecture

Compiling the model

In this step, we will choose an optimizer, a loss function, and metrics to observe. We will go
forward with

• binary_crossentropy loss function, commonly used for Binary Classification

• rmsprop optimizer and

• accuracy as a measure of performance

Downloaded by chaitanya ketha ([email protected])


lOMoARcPSD|39948448

DEMON

We can pass our choices for optimizer, loss function and metrics as strings to
the compile function because rmsprop, binary_crossentropy and accuracy come packaged with
Keras.
model.complie(optimizer='rmsprop',
loss = 'binary_crossentropy',
metrics = ['accuracy'])

One could use a customized loss function or optimizer by passing a custom class instance as
an argument to the loss, optimizer or mertics fields.

In this example, we will implement our default choices, but we will do so by passing class
instances. This is precisely how we would do it if we had customized parameters.
Compiling the model

Setting up Validation

We will set aside a part of our training data for validation of the accuracy of the model as it
trains. A validation set enables us to monitor the progress of our model on previously unseen
data as it goes through epochs during training.

Validation steps help us fine-tune the training parameters of the model.fit function to avoid
Overfitting and underfitting of data.
Setting up the Validation set for training the model

Training our model

Initially, we will train our models for 20 epochs in mini-batches of 512 samples. We will also
pass our validation set to the fit method.

Calling the fit method returns a History object. This object contains a member history which
stores all data about the training process, including the values of observable or monitored
quantities as the epochs proceed. We will save this object to determine the fine-tuning better to
apply to the training step.
Training the model. Times corresponding to Google Colab GPU. Usually takes about 20
seconds on CPU, i7

At the end of the training, we have attained a training accuracy of 99.85% and validation
accuracy of 86.57%

Now that we have trained our network, we will observe its performance metrics stored in
the History object.

Calling the fit method returns a History object. This object has an attribute history which is a
dictionary containing four entries: one per monitored metric.

Downloaded by chaitanya ketha ([email protected])


lOMoARcPSD|39948448

DEMON

History of the training process.

history_dict contains values of

• Training loss

• Training Accuracy

• Validation Loss

• Validation Accuracy

at the end of each epoch.

Let’s use Matplotlib to plot Training and validation losses and Training and Validation
Accuracy side by side.
Analysis of Loss and Accuracy data derived from Training history. This data tells us about
the performance of our training strategy.

We observe that minimum validation loss and maximum validation Accuracy is achieved at
around 3–5 epochs. After that, we observe two trends:

• increase in validation loss and a decrease in training loss

• decrease in validation accuracy and an increase in training accuracy

This implies that the model is getting better at classifying the sentiment of the training data,
but making consistently worse predictions when it encounters new, previously unseen data.
This is the hallmark of Overfitting. After the 5th epoch, the model begins to fit too closely to
the training data.

To address Overfitting, we will reduce the number of epochs to somewhere between 3 and 5.
These results may vary depending on your machine and due to the very nature of the random
assignment of weights that may vary from model to model.

In our case, we will stop training after 3 epochs.

Retraining our Neural Network

We retrain our neural network based on our findings from studying the history of loss and
accuracy variation. This time we run it for 3 epochs so as to avoid Overfitting on training data.
Retraining from scratch

Downloaded by chaitanya ketha ([email protected])


lOMoARcPSD|39948448

DEMON

In the end, we achieve a training accuracy of 99% and a validation accuracy of 86%. This is
pretty good, considering we are using a very naive approach. A higher degree of accuracy can
be achieved by using a better training algorithm.

Evaluating the model performance

We will use our trained model to make predictions for the test data. The output is an array of
floating integers that denote the probability of a review being positive. As you can see, in
some cases, the network is absolutely sure the review is positive. In other cases — not so
much!
Making Predictions

You could try to find some error metric for the number of sentiments that were wrongly
classified by using a metric like mean squared error as I did here. But it would be stupid to do
so! The analysis of the result is not something we will cover here. However, I will shed some
light on why usingmse is futile in this case.

The result from our model is the measure of how much the model perceives a review to be
positive. Rather than telling us the absolute class of the sample, the model tells us by how
much it perceives the sentiment to be skewed on one side or the other. MSE is too simple a
metric and fails to capture the complexity of the solution.

Multiclass Classification is the classification of samples in more than two classes. Classifying
samples into precisely two categories is colloquially referred to as Binary Classification.

Information Bottleneck

Neural networks are comprised of many layers. Each layer performs some transformation on
the data mapping input to the output of the network. However, it is crucial to note that these
layers do not generate any additional data and work solely on the data that they receive from
the preceding layers.

If, say, a layer drops some relevant data, that information becomes inaccessible to all
subsequent layers. This information is permanently lost and cannot be retrieved. The layer that
drops this information now acts as a bottleneck, stifling the increase of the model’s accuracy
and performance, thus acting as an information bottleneck.

We shall see this in action later on.

The Reuters Dataset

The Reuters dataset is a set of short newswires sorted into 46 mutually exclusive topics.
Reuters published it in 1986. This dataset is used widely for text classification. There are 46

Downloaded by chaitanya ketha ([email protected])


lOMoARcPSD|39948448

DEMON

topics, where some topics are represented more than others. However, each topic contains at
least ten examples in the training set.

The Reuters dataset comes preloaded with Keras and contains 8,982 training examples and
2,246 testing examples.

Loading the data

Load data from the pre-packaged module in Keras. We will limit the data to 10,000 of the
most frequently occurring words. To do this, we pass num_words=10000 argument to
the load_data function.

Some Exploratory Data Analysis

We’ll perform some good-old-fashioned EDA on our dataset. Doing so will give us a general
idea of the breadth and scope of our data.

Decode a story

Let’s go ahead and decode a story. Decoding helps us get the gist of the organization and
encoding of the data.

Preparing the data

We cannot feed integer sequences to the neural network; therefore, we will vectorize each
sequence and convert it into tensors. We do this by One-Hot Encoding each sequence.

We have 10,000 unique elements in our training and testing data. Vectorizing our input data
will result in two 2D tensors; a training Input tensor of shape (8982, 10000) and a test input
tensor of shape (2246, 10000).

The Labels for this problem include 46 different classes. The labels are represented as integers
in the range 1 to 46. To vectorize the labels, we could either,

• Cast the labels as integer tensors

• One-Hot encode the label data

We will go ahead with One-Hot Encoding of the label data. This will give us tensors, whose
second axis has 46 dimensions. This can be done easily using the to_categorical function in
Keras.

For greater clarity, we will do it manually.

Downloaded by chaitanya ketha ([email protected])


lOMoARcPSD|39948448

DEMON

The other way to perform One-Hot categorical encoding_of the labels is to use the built-in
function, as shown in the Gist above. For clarity, here it is again:
from keras.utils.np_utils import to_categorical
Y_train = to_categorical(train_labels)
Y_test = to_categorical(test_labels)

Building the Neural Network

The problem of topic classification (single-label multiclass classification_) is similar to


_Binary Classification of text fields. Both problems follow a similar approach towards
handling and pre-processing of data. The pedagogy of implementing neural network remains
unchanged. However, there is a new constraint: the number of classes just went from 2 to 46.
The dimensionality of the output space is much higher.

Therefore, each layer will have to deal with more data, which presents the genuine possibility
of an information bottleneck.

Information Bottleneck

Given a multiclass classification problem, the amount of processing that we need to perform
on the data dramatically increases as compared to a Binary Classification problem.

In a stack of Dense layers, like the ones we use, each layer can only access information
present in the output of the previous layer. If a layer drops relevant information, that
information is permanently inaccessible for all subsequent layers. Information, once lost, can
never be recovered by later layers. In cases like multiclass classification_, where data is
limited and crucial, every layer could become an information bottleneck.

If a layer drops relevant information, it is potentially an information bottleneck.

Such layers could easily bottleneck the performance of our network. To ensure that crucial
data is not discarded, we will use layers with a greater number of hidden units, i.e., Larger
layers. For comparison, we use layers with 16 hidden units Dense(16) in the two-
class classification example of Sentiment analysis of IMDB Reviews. In this case, where the
output dimension is 46, we will use layers with 64 hidden units, Dense(64).

Model Definition

We will define our network with two fully-connected ReLU activated layers with 64 hidden
units each. The third and last layer will be Dense layer of size 46. This layer will use
a softmax activation and will output a 46-dimensional vector. Every dimension will be the
probability of the input belonging to that class.

Calling the fit method returns a History object. This object contains a member history which
stores all data about the training process, including the values of observable or monitored

Downloaded by chaitanya ketha ([email protected])


lOMoARcPSD|39948448

DEMON

quantities as the epochs proceed. We will save this object since the information it holds will
help us to determine the fine-tuning better to apply to the training step.

At the end of the training, we have attained a training accuracy of 95% and validation
accuracy of 80.9%

Now that we have trained our network, we will observe its performance metrics stored in
the History object.

Calling the fit method returns a History object. This object has an attribute history which is a
dictionary containing four entries: one per monitored metric.

history_dict contains values of

• Training loss

• Training Accuracy

• Validation Loss

• Validation Accuracy

at the end of each epoch.

Let’s use Matplotlib to plot Training and validation losses and Training and Validation
Accuracy side by side.

Training and Validation Loss

Downloaded by chaitanya ketha ([email protected])


lOMoARcPSD|39948448

DEMON

Loss versus Epochs


Code by rakshitraj hosted on GitHub

Training and Validation Accuracy

Overfitting: Trends in Loss and Accuracy Data

We observe that minimum validation loss and maximum validation Accuracy is achieved at
around 9–10 epochs. After that, we observe two trends:

• increase in validation loss and a decrease in training loss

• decrease in validation accuracy and an increase in training accuracy

This implies that the model is getting better at classifying the sentiment of the training data,
but making consistently worse predictions when it encounters new, previously unseen data,
which is the hallmark of overfitting. After the 10th epoch, the model begins to fit too closely
to the training data.

To address overfitting, we will reduce the number of epochs to 9. These results may vary
depending on your machine and due to the very nature of the random assignment of weights
that may differ from one model to model.

In our case, we will stop training after nine epochs.

Retraining our model from scratch

Now that we know that excessive epochs are causing our model to overfit, we will limit the
number of epochs and retrain our model from scratch.

Predicting results and evaluation

Downloaded by chaitanya ketha ([email protected])


lOMoARcPSD|39948448

DEMON

Our approach results in an efficiency of ~80%

If this were a balanced dataset, a random attribution of labels would, using simple probability,
result in 50% accuracy. But since this dataset is not balanced, the accuracy of a random
classifier might be on the lesser side.

A random classifier doles out labels to samples randomly. To put it in perspective, a random
classifier is how the chimp at your local zoo would classify these newsreels.

Let’s determine this random baseline:

Considering that the random baseline is about 19%, our model with an accuracy of ~80%
performs quite well.

A model with Information Bottleneck

This time we introduce an information bottleneck in our model. One of our layers will drop
data, and we will see its effect on the model performance as a drop in its accuracy.

In the presence of a bottleneck, the testing accuracy drops 10%

Conclusion

At this point, you will have successfully classified newsreels from the Reuters Dataset under
their respective topics. You will also have seen how layers with an inadequate number of
hidden units wreak havoc on your model’s performance by killing precious data.

The effect of information bottleneck is clearly visible as a sizeable reduction in prediction


accuracy.

Downloaded by chaitanya ketha ([email protected])

You might also like