0% found this document useful (0 votes)
64 views61 pages

Convolutional Neural Network

Convolutional neural networks are a type of neural network that can process grid-like image data without extensive preprocessing. They use convolutional layers that apply filters to extract features from images through calculations like convolutions. As the image data passes through successive convolutional and pooling layers, the network identifies increasingly complex features, from simple edges to full objects. Training a CNN involves adjusting weights based on accuracy to classify images while avoiding overfitting to the training data.

Uploaded by

Ankit Mahapatra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
64 views61 pages

Convolutional Neural Network

Convolutional neural networks are a type of neural network that can process grid-like image data without extensive preprocessing. They use convolutional layers that apply filters to extract features from images through calculations like convolutions. As the image data passes through successive convolutional and pooling layers, the network identifies increasingly complex features, from simple edges to full objects. Training a CNN involves adjusting weights based on accuracy to classify images while avoiding overfitting to the training data.

Uploaded by

Ankit Mahapatra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

CONVOLUTIONAL NEURAL NETWORK

Introduction
A convolutional neural network is a specific kind of neural
network with multiple layers. It processes data that has a grid-
like arrangement then extracts important features. One huge
advantage of using CNNs is that you don't need to do a lot of
pre-processing on images.
A big difference between a CNN and a regular neural network
is that CNNs use convolutions to handle the math behind the
scenes. A convolution is used instead of matrix multiplication
in at least one layer of the CNN. Convolutions take to two
functions and return a function.

How Convolutional Neural Networks Work

Convolutional neural networks are based on neuroscience


findings. They are made of layers of artificial neurons called
nodes. These nodes are functions that calculate the weighted
sum of the inputs and return an activation map. This is the
convolution part of the neural network.

Each node in a layer is defined by its weight values. When you


give a layer some data, like an image, it takes the pixel values
and picks out some of the visual features.

When you're working with data in a CNN, each layer returns


activation maps. These maps point out important features in
the data set. If you gave the CNN an image, it'll point out
features based on pixel values, like colors, and give you an
activation function.

Usually with images, a CNN will initially find the edges of the
picture. Then this slight definition of the image will get passed
to the next layer. Then that layer will start detecting things
like corners and color groups. Then that image definition will
get passed to the next layer and the cycle continues until a
prediction is made.

As the layers get more defined, this is called max pooling. It


only returns the most relevant features from the layer in the
activation map. This is what gets passed to each successive
layer until you get the final layer.
The last layer of a CNN is the classification layer which
determines the predicted value based on the activation map.
If you pass a handwriting sample to a CNN, the classification
layer will tell you what letter is in the image. This is what
autonomous vehicles use to determine whether an object is
another car, a person, or some other obstacle.

Training a CNN is similar to training many other machine


learning algorithms. You'll start with some training data that is
separate from your test data and you'll tune your weights
based on the accuracy of the predicted values. Just be careful
that you don't overfit your model.

Convolutional neural networks are distinguished from


other neural networks by their superior performance with
image, speech, or audio signal inputs. They have three
main types of layers, which are:
• Convolutional layer
• Pooling layer
• Fully-connected (FC) layer
The convolutional layer is the first layer of a convolutional
network. While convolutional layers can be followed by
additional convolutional layers or pooling layers, the
fully-connected layer is the final layer. With each layer,
the CNN increases in its complexity, identifying greater
portions of the image. Earlier layers focus on simple
features, such as colors and edges. As the image data
progresses through the layers of the CNN, it starts to
recognize larger elements or shapes of the object until it
finally identifies the intended object.
Convolutional layer
The convolutional layer is the core building block of a CNN, and it is where the majority
of computation occurs. It requires a few components, which are input data, a filter, and a
feature map. Let’s assume that the input will be a color image, which is made up of a
matrix of pixels in 3D. This means that the input will have three dimensions—a height,
width, and depth—which correspond to RGB in an image. We also have a feature
detector, also known as a kernel or a filter, which will move across the receptive fields
of the image, checking if the feature is present. This process is known as a convolution.

The feature detector is a two-dimensional (2-D) array of weights, which represents part
of the image. While they can vary in size, the filter size is typically a 3x3 matrix; this also
determines the size of the receptive field. The filter is then applied to an area of the
image, and a dot product is calculated between the input pixels and the filter. This dot
product is then fed into an output array. Afterwards, the filter shifts by a stride,
repeating the process until the kernel has swept across the entire image. The final
output from the series of dot products from the input and the filter is known as a
feature map, activation map, or a convolved feature.

Note that the weights in the feature detector remain fixed as it moves across the image,
which is also known as parameter sharing. Some parameters, like the weight values,
adjust during training through the process of backpropagation and gradient descent.
However, there are three hyperparameters which affect the volume size of the output
that need to be set before the training of the neural network begins. These include:

1. The number of filters affects the depth of the output. For example, three distinct
filters would yield three different feature maps, creating a depth of three.

2. Stride is the distance, or number of pixels, that the kernel moves over the input
matrix. While stride values of two or greater is rare, a larger stride yields a smaller
output.

3. Zero-padding is usually used when the filters do not fit the input image. This sets all
elements that fall outside of the input matrix to zero, producing a larger or equally sized
output. There are three types of padding:

• Valid padding: This is also known as no padding. In this case, the last
convolution is dropped if dimensions do not align.
• Same padding: This padding ensures that the output layer has the same size as
the input layer
• Full padding: This type of padding increases the size of the output by adding
zeros to the border of the input.
After each convolution operation, a CNN applies a Rectified Linear Unit (ReLU)
transformation to the feature map, introducing nonlinearity to the model.
Additional convolutional layer

As we mentioned earlier, another convolution layer can follow the initial


convolution layer. When this happens, the structure of the CNN can become
hierarchical as the later layers can see the pixels within the receptive fields
of prior layers. As an example, let’s assume that we’re trying to determine if
an image contains a bicycle. You can think of the bicycle as a sum of parts. It
is comprised of a frame, handlebars, wheels, pedals, et cetera. Each
individual part of the bicycle makes up a lower-level pattern in the neural
net, and the combination of its parts represents a higher-level pattern,
creating a feature hierarchy within the CNN. Ultimately, the convolutional
layer converts the image into numerical values, allowing the neural network
to interpret and extract relevant patterns.
Pooling layer
Pooling layers, also known as downsampling, conducts dimensionality reduction,
reducing the number of parameters in the input. Similar to the convolutional layer, the
pooling operation sweeps a filter across the entire input, but the difference is that this
filter does not have any weights. Instead, the kernel applies an aggregation function to
the values within the receptive field, populating the output array. There are two main
types of pooling:

• Max pooling: As the filter moves across the input, it selects the pixel with the
maximum value to send to the output array. As an aside, this approach tends to
be used more often compared to average pooling.
• Average pooling: As the filter moves across the input, it calculates the average
value within the receptive field to send to the output array.
While a lot of information is lost in the pooling layer, it also has a number of benefits to
the CNN. They help to reduce complexity, improve efficiency, and limit risk of
overfitting.

Fully-connected layer
The name of the full-connected layer aptly describes itself. As mentioned earlier, the
pixel values of the input image are not directly connected to the output layer in partially
connected layers. However, in the fully-connected layer, each node in the output layer
connects directly to a node in the previous layer.

This layer performs the task of classification based on the features extracted through
the previous layers and their different filters. While convolutional and pooling layers
tend to use ReLu functions, FC layers usually leverage a SoftMax activation function to
classify inputs appropriately, producing a probability from 0 to 1.

Advantages of Convolutional Neural Networks (CNNs):

Good at detecting patterns and features in images, videos, and audio signals.
Robust to translation, rotation, and scaling invariance.
End-to-end training, no need for manual feature extraction.
Can handle large amounts of data and achieve high accuracy.

Disadvantages of Convolutional Neural Networks (CNNs):

Computationally expensive to train and require a lot of memory.


Can be prone to overfitting if not enough data or proper regularization is used.
Requires large amounts of labelled data.
Interpretability is limited, it’s hard to understand what the network has learned.

Pooling in Convolutional Neural Networks (CNNs)


Introduction:

Pooling, an essential operation within Convolutional Neural Networks (CNNs), plays a


pivotal role in processing and transforming feature maps. It is a crucial step in the
network architecture, serving multiple functions such as spatial dimension reduction,
achieving translation invariance, and feature selection. This detailed note provides an
in-depth understanding of pooling in CNNs, exploring its types, parameters, placement
within networks, and its overall significance in deep learning.

Purpose of Pooling

Pooling is a critical operation in Convolutional Neural Networks (CNNs) that


serves several important purposes, enhancing the network's performance and
capabilities in various ways:

Spatial Dimension Reduction:

One of the primary purposes of pooling is to reduce the spatial dimensions of


feature maps. As feature maps are processed through convolutional layers,
they tend to become larger, which can increase the computational demands of
subsequent layers. Pooling mitigates this issue by down-sampling the feature
maps, making them smaller and more manageable.
Translation Invariance:

Pooling contributes to achieving translation invariance, a crucial property in


image recognition tasks. Translation invariance means that the network should
recognize the same pattern or object, regardless of its exact position in the
input data. Pooling achieves this by considering local regions of the feature
map, effectively capturing the most important features in those regions and
making the network less sensitive to small shifts or translations in the input
data.
Feature Selection:
Pooling also serves as a mechanism for feature selection. It retains the most
essential information from a local region of the feature map while discarding
less informative or redundant details. This feature selection aids in focusing the
network on key patterns and reducing the influence of noise or less relevant
features.
Regularization and Overfitting Control:

Pooling can act as a form of regularization in the network. By summarizing


information and reducing the spatial dimensions, pooling helps prevent
overfitting, where the model might become too specialized to the training data
and perform poorly on unseen data. It encourages the network to generalize
better to new data by simplifying its internal representations.
Computational Efficiency:

Smaller feature maps generated by pooling layers lead to reduced


computational requirements in subsequent layers. This not only speeds up
training and inference but also makes it feasible to build deeper networks
without overwhelming computational resources.

Types of Pooling:

There are two common types of pooling used in CNNs:

Max Pooling:

Max pooling is a critical operation in Convolutional Neural Networks (CNNs)


used for down-sampling and feature selection. It plays a crucial role in
preserving the most important information within local regions of the feature
map. Here, we'll delve into the details of max pooling:
Operation Description:

Max pooling involves a sliding window or filter, typically with a size of 2x2 or
3x3, moving over the input feature map. The window progresses through the
input in a systematic manner.
At each position of the window, the operation selects the maximum value
present within that region.
Feature Selection:

Max pooling's primary purpose is to select the most prominent features in a


local region. It identifies the most significant value within the window and
discards the rest.
By retaining only the maximum value, max pooling effectively highlights the
most salient information, making it suitable for tasks like object detection or
pattern recognition. It helps the network focus on the most critical features
while disregarding less important details.

Translation Invariance:

Max pooling contributes to achieving translation invariance in CNNs. It ensures


that the network recognizes patterns or features regardless of their exact
position in the input data.
By considering local regions and selecting the maximum value within each
region, max pooling makes the network less sensitive to small shifts or
translations in the input data. This property is essential for tasks like image
recognition, where objects or features may appear at different locations within
an image.
Size and Stride:

The size of the max pooling window (e.g., 2x2 or 3x3) and the stride (the step
size by which the window moves) are adjustable parameters.
A larger window size retains more information but may lead to higher
computational costs and less translation invariance. Smaller window sizes
result in more aggressive down-sampling and potentially less information
retention.
The choice of window size and stride depends on the specific problem, the
network architecture, and the trade-off between feature preservation and
computational efficiency.
Use Cases:

Max pooling is commonly employed in CNN architectures for image


classification, object detection, and recognition tasks. It is particularly useful
when the precise location of features within the input data is not critical, and
the network needs to focus on identifying the presence of specific patterns or
objects.
Regularization Effect:

In addition to its primary role in feature selection and dimensionality reduction,


max pooling can act as a form of regularization. By summarizing the most
essential features within each region, it helps prevent overfitting and
encourages the network to generalize better to new data.
In summary, max pooling is a critical operation in CNNs that selects the
maximum value within local regions of the feature map. It promotes translation
invariance, feature selection, and computational efficiency, making it an
integral part of many successful CNN architectures for various computer vision
tasks.

Average Pooling:
Average pooling is a key operation in Convolutional Neural Networks (CNNs)
that, like max pooling, serves as a method for down-sampling and feature
selection. However, it differs in its approach to summarizing information. Here's
a detailed explanation of average pooling:
1. Operation Description:

Average pooling involves a sliding window or filter, typically with a size of 2x2
or 3x3, moving over the input feature map. This window traverses the input
data in a systematic manner.
At each position of the window, the operation computes the average (mean)
value of the data within that region.
2. Smoothing and Information Averaging:

The primary purpose of average pooling is to smooth the representation of the


feature map within local regions. Unlike max pooling, which selects the
maximum value, average pooling computes the average, which results in a
more generalized view of the information in the window.
By averaging the values within a local region, average pooling reduces the
influence of outliers and noise, leading to a smoother representation of the
data.
Feature Selection and Reduction:

While average pooling retains some information about the local region, it
inherently reduces the dimensionality of the feature map. This down-sampling
helps in reducing computational demands in subsequent layers of the network.
4. Translation Invariance:

Similar to max pooling, average pooling also contributes to achieving


translation invariance in CNNs. It allows the network to recognize patterns or
features regardless of their exact position in the input data.
By considering local regions and computing the average value within each
region, average pooling makes the network less sensitive to small shifts or
translations in the input data.
5. Size and Stride:
As with max pooling, the size of the average pooling window (e.g., 2x2 or 3x3)
and the stride (the step size by which the window moves) are adjustable
parameters.
Larger window sizes capture more information, but this may come at the cost
of reduced translation invariance. Smaller window sizes lead to more
aggressive down-sampling and less information retention.
. Use Cases:

Average pooling is commonly used in scenarios where a smoother


representation of the data is preferred, or when the precise location of features
within the input data is not critical.
It can be useful in tasks like image segmentation, where the goal is to assign a
label to each region of an image, and a more generalized view of the data can
be beneficial.
7. Regularization Effect:

Similar to max pooling, average pooling can act as a form of regularization. By


averaging information within local regions, it helps prevent overfitting and
encourages the network to generalize better to new data.
In summary, average pooling is an important operation in CNNs that computes
the average value within local regions of the feature map. It promotes
translation invariance, feature selection, dimensionality reduction, and noise
reduction, making it suitable for various computer vision tasks where a
smoother and more generalized representation is desired.

Normalization in Convolutional Neural Networks (CNNs): A Detailed


Overview

Introduction:
Normalization is a critical component in Convolutional Neural Networks (CNNs)
that plays a crucial role in improving training stability, accelerating
convergence, and enhancing the model's generalization performance. This
detailed overview provides specific information about normalization
techniques in CNNs, their significance, and the types of normalization
commonly used.

Purpose of Normalization:

Normalization techniques in CNNs primarily serve the following purposes:

Stabilizing Activation Distributions:

Normalization techniques ensure that the activations at different layers of the


network are centered around zero and have a consistent scale. This helps in
stabilizing the gradient flow during training, preventing vanishing or exploding
gradients, and allowing for faster convergence.
Accelerating Training:

By promoting a more standardized distribution of activations, normalization


techniques enable networks to train faster. This is because neurons can
collectively learn more efficiently when their inputs are within a consistent
range.
Enhancing Generalization:

Normalization mitigates the risk of overfitting by preventing the network from


adapting too much to the training data. This results in models that generalize
better to unseen data.
Types of Normalization:
Several normalization techniques are commonly used in CNNs. Here are some
of the most prevalent ones:
Batch Normalization (BatchNorm):

Batch normalization operates on a mini-batch of data during training. It


normalizes the activations by calculating the mean and variance of the mini-
batch, and then scales and shifts the activations using learned parameters.
BatchNorm has become a standard in CNN architectures due to its
effectiveness in stabilizing training and accelerating convergence.
Layer Normalization (LayerNorm):

Unlike BatchNorm, LayerNorm normalizes the activations within a single layer


rather than a mini-batch. It calculates mean and variance for each feature
independently.
LayerNorm is particularly useful in recurrent neural networks (RNNs) and
Transformers, where batch sizes may vary significantly.
Instance Normalization (InstanceNorm):

InstanceNorm, similar to BatchNorm, operates on mini-batches. However, it


normalizes each instance (example) independently rather than each feature.
InstanceNorm is often used in style transfer and image-to-image translation
tasks.
Group Normalization (GroupNorm):

GroupNorm is a compromise between BatchNorm and LayerNorm. It divides


the channels into groups and normalizes each group separately.
GroupNorm is useful in situations where batch sizes are small, and BatchNorm
may not be as effective.

Instance-Conditional Normalization (ICN):


ICN is a conditional normalization technique that adapts the normalization
parameters based on additional information, such as class labels or other
context-specific data.
It is used in tasks like conditional image generation and style transfer.
Significance of Normalization:

Normalization in CNNs is significant for several reasons:

Improved Training Dynamics: Normalization techniques ensure that gradients


during backpropagation remain in a reasonable range, reducing the risk of
vanishing or exploding gradients. This is critical for training deep networks
effectively.

Faster Convergence: By maintaining stable activation distributions, networks


trained with normalization techniques typically converge faster, reducing
training time and resource requirements.

Regularization: Normalization acts as a form of regularization, mitigating


overfitting and improving a model's generalization to unseen data.

Robustness to Hyperparameters: Normalization makes networks less sensitive


to hyperparameter choices, making it easier to train models that perform well
across various datasets and architectures.

Compatibility with Various Architectures: Different normalization techniques


are adaptable to various neural network architectures, from CNNs to RNNs and
Transformers, enhancing their overall utility in deep learning.
In summary, normalization techniques are fundamental in CNNs for stabilizing
training, accelerating convergence, and improving the generalization
performance of models. The choice of normalization method depends on the
specific task, network architecture, and dataset, and understanding these
techniques is crucial for designing effective deep learning models.

Why do we need Normalization in Convolution Layers


The data points in an image are its pixels. Pixel values typically range from 0 to
255; which is why, before feeding images into a convolutional neural network,
it is a good idea to normalize them in some way so as to put all pixels in a
manageable range.

Even when this is done, when training a convnet, weights (elements in its
filters) might become too large, and thereby produce feature maps with pixels
spread across a wide range. This essentially renders the normalization done
during the preprocessing step somewhat futile. Furthermore, this could
hamper the optimization process making it slow or in extreme cases it could
lead to a problem called unstable gradients, which could essentially prevent
the convnet from further optimizing it's weights entirely.

To deal with the above difficulties lets discuss a solution with batch
normalization with some practical implementation.
The Process of Batch Normalization
Batch normalization essentially sets the pixels in all feature maps in a
convolution layer to a new mean and a new standard deviation. Typically, it
starts off by z-score normalizing all pixels, and then goes on to multiply the
normalized values by an arbitrary parameter alpha (scale) before adding
another arbitrary parameter beta (offset).
These two parameters alpha and beta are learnable parameters which the
convnet will then use to ensure that pixel values in the feature maps are within
a manageable range - thereby ameliorating the problem of unstable gradients.
In order to really assess the effects of batch normalization in convolution layers,
we need to benchmark two convnets, one without batch normalization and the
other with batch normalization. For this we will be using the LeNet-5
architecture and the MNIST dataset.
Dataset & Convolutional Neural Network Class
In this article, the MNIST dataset will be used for benchmarking purposes as
mentioned previously. This dataset consists of 28 x 28 pixel images of
handwritten digits ranging from digit 0 to 9 labelled accordingly.

Dataset & Convolutional Neural Network Class


In this article, the MNIST dataset will be used for benchmarking purposes as
mentioned previously. This dataset consists of 28 x 28 pixel images of
handwritten digits ranging from digit 0 to 9 labelled accordingly.
# loading training data
training_set = Datasets.MNIST(root='./', download=True,
transform=transforms.Compose([transforms.ToTensor(),
transforms.Resize((32, 32))]))

# loading validation data


validation_set = Datasets.MNIST(root='./', download=True, train=False,
transform=transforms.Compose([transforms.ToTensor(),
transforms.Resize((32, 32))]))
For training and utilization of our convnets, we shall be using the class below
aptly named 'ConvolutionalNeuralNet()'. This class contains methods which will
help to train and classify instances using the trained convnet. The train()
method also contains inner helper functions such as init_weights() and
accuracy.
class ConvolutionalNeuralNet():
def __init__(self, network):
self.network = network.to(device)
self.optimizer = torch.optim.Adam(self.network.parameters(), lr=1e-3)

def train(self, loss_function, epochs, batch_size,


training_set, validation_set):

# creating log
log_dict = {
'training_loss_per_batch': [],
'validation_loss_per_batch': [],
'training_accuracy_per_epoch': [],
'validation_accuracy_per_epoch': []
}

# defining weight initialization function


def init_weights(module):
if isinstance(module, nn.Conv2d):
torch.nn.init.xavier_uniform_(module.weight)
module.bias.data.fill_(0.01)
elif isinstance(module, nn.Linear):
torch.nn.init.xavier_uniform_(module.weight)
module.bias.data.fill_(0.01)

# defining accuracy function


def accuracy(network, dataloader):
network.eval()
total_correct = 0
total_instances = 0
for images, labels in tqdm(dataloader):
images, labels = images.to(device), labels.to(device)
predictions = torch.argmax(network(images), dim=1)
correct_predictions = sum(predictions==labels).item()
total_correct+=correct_predictions
total_instances+=len(images)
return round(total_correct/total_instances, 3)

# initializing network weights


self.network.apply(init_weights)

# creating dataloaders
train_loader = DataLoader(training_set, batch_size)
val_loader = DataLoader(validation_set, batch_size)

# setting convnet to training mode


self.network.train()
for epoch in range(epochs):
print(f'Epoch {epoch+1}/{epochs}')
train_losses = []

# training
print('training...')
for images, labels in tqdm(train_loader):
# sending data to device
images, labels = images.to(device), labels.to(device)
# resetting gradients
self.optimizer.zero_grad()
# making predictions
predictions = self.network(images)
# computing loss
loss = loss_function(predictions, labels)
log_dict['training_loss_per_batch'].append(loss.item())
train_losses.append(loss.item())
# computing gradients
loss.backward()
# updating weights
self.optimizer.step()
with torch.no_grad():
print('deriving training accuracy...')
# computing training accuracy
train_accuracy = accuracy(self.network, train_loader)
log_dict['training_accuracy_per_epoch'].append(train_accuracy)

# validation
print('validating...')
val_losses = []

# setting convnet to evaluation mode


self.network.eval()

with torch.no_grad():
for images, labels in tqdm(val_loader):
# sending data to device
images, labels = images.to(device), labels.to(device)
# making predictions
predictions = self.network(images)
# computing loss
val_loss = loss_function(predictions, labels)
log_dict['validation_loss_per_batch'].append(val_loss.item())
val_losses.append(val_loss.item())
# computing accuracy
print('deriving validation accuracy...')
val_accuracy = accuracy(self.network, val_loader)
log_dict['validation_accuracy_per_epoch'].append(val_accuracy)

train_losses = np.array(train_losses).mean()
val_losses = np.array(val_losses).mean()
print(f'training_loss: {round(train_losses, 4)} training_accuracy: '+
f'{train_accuracy} validation_loss: {round(val_losses, 4)} '+
f'validation_accuracy: {val_accuracy}\n')

return log_dict

def predict(self, x):


return self.network(x)
Lenet-5

LeNet-5 is one of the earliest convolutional neural networks specifically


designed to recognize/classify images of hand written digits. Its architecture is
depicted in the image above and its implementation in PyTorch is provided in
the following code block.
class LeNet5(nn.Module):
def __init__(self):
super().__init__()
self.conv1 = nn.Conv2d(1, 6, 5)
self.pool1 = nn.AvgPool2d(2)
self.conv2 = nn.Conv2d(6, 16, 5)
self.pool2 = nn.AvgPool2d(2)
self.linear1 = nn.Linear(5*5*16, 120)
self.linear2 = nn.Linear(120, 84)
self.linear3 = nn. Linear(84, 10)

def forward(self, x):


x = x.view(-1, 1, 32, 32)

#----------
# LAYER 1
#----------
output_1 = self.conv1(x)
output_1 = torch.tanh(output_1)
output_1 = self.pool1(output_1)

#----------
# LAYER 2
#----------
output_2 = self.conv2(output_1)
output_2 = torch.tanh(output_2)
output_2 = self.pool2(output_2)

#----------
# FLATTEN
#----------
output_2 = output_2.view(-1, 5*5*16)

#----------
# LAYER 3
#----------
output_3 = self.linear1(output_2)
output_3 = torch.tanh(output_3)

#----------
# LAYER 4
#----------
output_4 = self.linear2(output_3)
output_4 = torch.tanh(output_4)

#-------------
# OUTPUT LAYER
#-------------
output_5 = self.linear3(output_4)
return(F.softmax(output_5, dim=1))
Using the above defined LeNet-5 architecture, we will instantiate model_1, a
member of the ConvolutionalNeuralNet class, with parameters as seen in the
code block. This model will serve as our baseline for benchmarking purposes.
# training model 1
model_1 = ConvolutionalNeuralNet(LeNet5())

log_dict_1 = model_1.train(nn.CrossEntropyLoss(), epochs=10, batch_size=64,


training_set=training_set, validation_set=validation_set)
After training for 10 epochs and visualizing accuracies from the metric log we
receive in return, we can see that both training and validation accuracy
increased over the course of training. In our experiment, validation accuracy
started off at approximately 93% after the first epoch before proceeding to
increase steadily over the next 9 iterations, eventually terminating at just over
98% by epoch 10.
sns.lineplot(y=log_dict_1['training_accuracy_per_epoch'],
x=range(len(log_dict_1['training_accuracy_per_epoch'])), label='training')

sns.lineplot(y=log_dict_1['validation_accuracy_per_epoch'],
x=range(len(log_dict_1['validation_accuracy_per_epoch'])), label='validation')

plt.xlabel('epoch')
plt.ylabel('accuracy')
Batch Normalized LeNet-5

Since the theme of this article is centered around batch normalization in


convolution layers, batch norm is only applied on the two convolution layers
present in this architecture as illustrated in the image above.
class LeNet5_BatchNorm(nn.Module):
def __init__(self):
super().__init__()
self.conv1 = nn.Conv2d(1, 6, 5)
self.batchnorm1 = nn.BatchNorm2d(6)
self.pool1 = nn.AvgPool2d(2)
self.conv2 = nn.Conv2d(6, 16, 5)
self.batchnorm2 = nn.BatchNorm2d(16)
self.pool2 = nn.AvgPool2d(2)
self.linear1 = nn.Linear(5*5*16, 120)
self.linear2 = nn.Linear(120, 84)
self.linear3 = nn. Linear(84, 10)
def forward(self, x):
x = x.view(-1, 1, 32, 32)

#----------
# LAYER 1
#----------
output_1 = self.conv1(x)
output_1 = torch.tanh(output_1)
output_1 = self.batchnorm1(output_1)
output_1 = self.pool1(output_1)

#----------
# LAYER 2
#----------
output_2 = self.conv2(output_1)
output_2 = torch.tanh(output_2)
output_2 = self.batchnorm2(output_2)
output_2 = self.pool2(output_2)

#----------
# FLATTEN
#----------
output_2 = output_2.view(-1, 5*5*16)

#----------
# LAYER 3
#----------
output_3 = self.linear1(output_2)
output_3 = torch.tanh(output_3)

#----------
# LAYER 4
#----------
output_4 = self.linear2(output_3)
output_4 = torch.tanh(output_4)

#-------------
# OUTPUT LAYER
#-------------
output_5 = self.linear3(output_4)
return(F.softmax(output_5, dim=1))
Using the code segment below, we can nstantiate model_2 with batch
normalization included, and begin training with the same parameters as
model_1. Then, we yield accuracy scores..
# training model 2
model_2 = ConvolutionalNeuralNet(LeNet5_BatchNorm())

log_dict_2 = model_2.train(nn.CrossEntropyLoss(), epochs=10, batch_size=64,


training_set=training_set, validation_set=validation_set)
Looking at the plot, it is clear that both training and validation accuracies
increased over the course of training similar to the model without batch
normalization. Validation accuracy after the first epoch stood at just above
95%, 3 percentage points higher than model_1 at the same point, before
increasing gradually and culminating at approximately 98.5%, 0.5% higher than
model_1.
sns.lineplot(y=log_dict_2['training_accuracy_per_epoch'],
x=range(len(log_dict_2['training_accuracy_per_epoch'])), label='training')

sns.lineplot(y=log_dict_2['validation_accuracy_per_epoch'],
x=range(len(log_dict_2['validation_accuracy_per_epoch'])), label='validation')

plt.xlabel('epoch')
plt.ylabel('accuracy')

Comparing Models
Comparing both models, it is clear that the LeNet-5 model with batch
normalized convolution layers outperformed the regular model without batch
normalized convolution layers. It is therefore safe to say that batch
normalization has lent a hand to increasing performance in this instance.
Comparing training and validation losses between the regular and batch
normalized LeNet-5 models also shows that the batch normalized model attains
lower loss values faster than the regular model. This is a pointer to batch
normalization increasing the rate at which the model optimizes it's weights in
the correct direction or in other words, batch normalization increases the rate
at which the convnet learns.

Application of CNN in Computer Vision


Convolutional Neural Networks (CNNs) have found numerous applications in
computer vision, and one of the most prominent benchmarks and datasets for
evaluating their performance is ImageNet. ImageNet is a large-scale dataset
containing millions of images across thousands of categories. Here are some of
the applications of CNNs in computer vision, particularly with reference to
ImageNet:

Image Classification: CNNs have excelled in the task of image classification,


where the goal is to assign a label or category to an image. Using ImageNet,
CNNs have achieved state-of-the-art results in classifying objects, animals, and
scenes, often with unprecedented accuracy.

Object Detection: CNNs are widely used for object detection in images. By
sliding a CNN over an image, it can identify and localize objects of interest.
ImageNet-based pre-trained models are often used for feature extraction in
object detection pipelines.
Scene Understanding: CNNs trained on ImageNet data have been used for
scene understanding, allowing systems to recognize and interpret the content
of images, such as identifying landmarks, types of environments, and scenes
(e.g., urban, rural, indoor, outdoor).

Fine-Grained Classification: CNNs have demonstrated excellent performance in


fine-grained classification tasks, where the goal is to classify objects within a
specific category into subcategories. This is particularly useful in distinguishing
between closely related objects, such as different bird species or dog breeds.

Image Retrieval: CNN-based models can be used for content-based image


retrieval. Given a query image, these models can retrieve similar images from a
database based on visual similarity, making them valuable in search engines,
art collections, and image databases.

Art and Style Transfer: CNNs have been used for artistic style transfer, where
the style of one image or painting is applied to another image, resulting in
visually striking and artistic compositions. This application is popular in the
creative and artistic domain.

Semantic Segmentation: Semantic segmentation involves labeling each pixel in


an image with the corresponding object or category. CNNs trained on ImageNet
are often used as the backbone for more complex models that perform
semantic segmentation tasks.

Object Tracking: CNNs are applied to object tracking tasks where they can learn
to track and follow objects within a video stream, useful in surveillance,
autonomous vehicles, and robotics.

Image Captioning: Combining CNNs with recurrent neural networks (RNNs),


image captioning models generate descriptive captions for images. They are
used in applications where textual descriptions of images are needed, such as
in accessibility tools for the visually impaired.

Medical Image Analysis: CNNs have made significant contributions to medical


image analysis tasks, including disease diagnosis, tumor detection, and organ
segmentation in radiological images.

Augmented Reality (AR): CNNs are used in AR applications to recognize and


track objects or features in real-time, enabling the overlay of virtual
information onto the real world.

Robotics: CNNs are used for visual perception and object manipulation in
robotic systems. They enable robots to identify and interact with objects in
their environment.

These are just a few of the many applications of CNNs in computer vision, and
ImageNet has played a pivotal role in advancing the field by providing a
challenging dataset for training and evaluating deep learning models.

ImageNet: A Detailed Overview

ImageNet is a large-scale image dataset and benchmark widely recognized for


its contribution to the advancement of computer vision and deep learning.
Here, we provide a comprehensive overview of ImageNet, its significance, and
its role in the development of artificial intelligence.

1. Dataset Size and Content:

ImageNet was created by Fei-Fei Li and her colleagues at Stanford University


and Princeton University.
The dataset originally consisted of over 14 million images across thousands of
categories, making it one of the largest publicly available image datasets.
The dataset encompasses a wide variety of objects, animals, scenes, and more,
organized into a hierarchical structure of categories.
2. The ImageNet Challenge:

ImageNet gained significant prominence through the ImageNet Large Scale


Visual Recognition Challenge (ILSVRC), which was an annual competition held
from 2010 to 2017.
The primary task of ILSVRC was image classification, where participants were
required to develop models capable of accurately categorizing objects in
images from ImageNet into one of 1,000 predefined categories.
The challenge provided a standardized platform for researchers to benchmark
their image classification algorithms and spurred remarkable advancements in
deep learning, particularly convolutional neural networks (CNNs).
3. Role in Deep Learning Revolution:

ImageNet and ILSVRC played a pivotal role in the resurgence of neural


networks, especially CNNs, in the field of artificial intelligence.
In the 2012 edition of ILSVRC, a CNN architecture called AlexNet, developed by
Krizhevsky et al., demonstrated a substantial improvement in image
classification accuracy, significantly outperforming other techniques. This
achievement is considered a landmark moment in the deep learning
revolution.
4. Pre-trained Models:

The availability of the ImageNet dataset has enabled the training of deep
learning models for various computer vision tasks.
Researchers and practitioners use pre-trained CNN models on ImageNet data
as a starting point for a wide range of image-related tasks, such as object
detection, segmentation, style transfer, and more.
Pre-trained models serve as powerful feature extractors, providing a
foundation for transfer learning in computer vision applications.
5. Impact on Research and Applications:

ImageNet has had a profound impact on the development of computer vision


and AI applications. It has led to significant advancements in image
understanding, object recognition, and scene analysis.
ImageNet's influence extends to various domains, including healthcare
(medical image analysis), autonomous vehicles (object detection and tracking),
augmented reality (scene recognition), and many other areas.
6. Ethical and Societal Considerations:

The scale and diversity of ImageNet data have raised ethical and societal
considerations regarding privacy, data bias, and consent. These concerns have
prompted discussions about responsible data collection and usage in AI
research.
7. Future Directions:

ImageNet has evolved over time, and efforts have been made to address the
dataset's limitations, including issues related to data quality, diversity, and bias.
Future directions for ImageNet and similar datasets involve expanding the
scope of labeled data to include more fine-grained categories, focusing on
diverse and underrepresented subjects, and ensuring ethical data practices.
In summary, ImageNet is a monumental dataset in the field of computer vision,
and the ImageNet Challenge has been instrumental in advancing the
capabilities of deep learning models. Its role in the resurgence of neural
networks, the development of pre-trained models, and its impact on various
applications underscore its significance in the development of artificial
intelligence. However, ongoing discussions about data ethics and dataset
diversity are important considerations as the field of AI continues to progress.
Sequence modeling is a branch of machine learning and deep learning that
deals with the analysis and modeling of sequences of data. A sequence is an
ordered set of data points, typically indexed by time or position, where each
data point can be of any data type, such as text, numbers, or categorical
information. Sequence modeling is a fundamental concept in various domains,
including natural language processing, speech recognition, time series analysis,
and more. Here is an overview of sequence modeling:

Key Concepts in Sequence Modeling:

Sequential Data: Sequence modeling deals with data that occurs in a specific
order or sequence. Examples of sequential data include sentences in natural
language, time series data, DNA sequences, and more.

Recurrent Neural Networks (RNNs): RNNs are a class of neural networks


specifically designed for sequence modeling. They have recurrent connections
that allow them to maintain a hidden state that captures information from
previous time steps. This makes them suitable for tasks like language modeling,
text generation, and speech recognition.

Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU): These are
specialized variants of RNNs designed to address the vanishing gradient
problem and improve the modeling of long-range dependencies in sequences.
LSTMs and GRUs have become popular choices for various sequence modeling
tasks.

Temporal Convolutional Networks (TCNs): TCNs are an alternative to RNNs for


sequence modeling. They use convolutional layers to capture patterns in
sequences and have been successful in tasks like language modeling, time
series forecasting, and speech recognition.
Sequence-to-Sequence (Seq2Seq) Models: Seq2Seq models, consisting of an
encoder and a decoder, are used for tasks like machine translation, text
summarization, and chatbot development. They take a sequence of data as
input and produce another sequence as output.

Applications of Sequence Modeling:

Natural Language Processing (NLP): Sequence modeling plays a crucial role in


NLP tasks, including machine translation, sentiment analysis, text generation,
and named entity recognition.

Speech Recognition: Modeling sequences of audio data is essential for speech


recognition systems, which convert spoken language into text. RNNs and LSTM-
based models are commonly used in this domain.

Time Series Analysis: Sequence modeling is central to time series forecasting,


anomaly detection, and financial prediction. It's applied in fields like finance,
meteorology, and industrial maintenance.

Genomics and Bioinformatics: In genomics, DNA and RNA sequences are


analyzed for tasks like gene prediction, sequence alignment, and disease
prediction.

Recommendation Systems: Modeling user interactions over time to make


personalized recommendations in e-commerce, content delivery, and social
media platforms.

Gesture Recognition and Video Analysis: Analyzing sequences of images or


video frames to recognize gestures, actions, or objects.
Music Generation: Modeling sequences of musical notes to generate original
compositions or harmonize existing ones.

Autonomous Systems: In self-driving cars, robotics, and drones, sequence


modeling helps in understanding and predicting the environment based on
sensor data.

Challenges in Sequence Modeling:

Vanishing Gradient: For long sequences, RNNs can suffer from the vanishing
gradient problem, making it challenging to capture long-range dependencies.

Data Length Variability: Sequences can vary in length, and handling variable-
length data efficiently is a challenge.

Model Overfitting: Complex models may overfit to the training data, leading to
poor generalization.

Memory and Computational Complexity: For large datasets and long


sequences, memory and computational requirements can become prohibitive.

Recent Advancements:

Recent advances in sequence modeling include the use of attention


mechanisms, transformers, and self-attention mechanisms, which have
significantly improved the performance of models in tasks like machine
translation, language understanding, and large-scale document analysis.

Sequence modeling is a diverse and rapidly evolving field with a wide range of
applications. Researchers and practitioners continue to develop innovative
techniques and architectures to address the challenges associated with
sequential data, making it a fundamental area of study in the realm of machine
learning and artificial intelligence.
1.VGGNet (Visual Geometry Group Network):

VGGNet, developed by the Visual Geometry Group at the University of Oxford,


is a deep convolutional neural network architecture known for its simplicity
and effectiveness. It achieved state-of-the-art results in the ImageNet Large
Scale Visual Recognition Challenge (ILSVRC) in 2014. Here's a detailed
description of VGGNet:

Architecture: VGGNet consists of a series of convolutional layers followed by


max-pooling layers. The key innovation of VGGNet is its use of very small 3x3
convolutional filters throughout the network. It uses multiple stacked
convolutional layers (up to 16 layers in some versions), which allows it to learn
complex hierarchical features.

Layer Configuration: VGGNet comes in several variants with different depths.


The most common configurations are VGG16 and VGG19, which have 16 and
19 weight layers, respectively. The convolutional layers have a fixed filter size of
3x3, and max-pooling is performed with 2x2 windows and a stride of 2.

Fully Connected Layers: After the convolutional and pooling layers, VGGNet
employs three fully connected layers with 4096 neurons each, followed by a
final output layer. These fully connected layers provide the high-level reasoning
in the network.

Activation Function: VGGNet primarily uses the rectified linear unit (ReLU)
activation function, which introduces non-linearity in the network.

Advantages: VGGNet is known for its simplicity, and its modular architecture
makes it easy to understand and implement. It achieved competitive results in
image classification tasks and served as a foundation for deeper networks like
ResNet.

2. LeNet:

LeNet, short for LeNet-5, is one of the earliest convolutional neural networks
developed by Yann LeCun, Léon Bottou, and Yoshua Bengio in the late 1990s. It
was designed for handwritten digit recognition and is considered a pioneering
architecture in the field of deep learning. Here's a detailed description of
LeNet:

Architecture: LeNet is a relatively shallow network compared to modern CNNs.


It consists of seven layers: two convolutional layers, two subsampling (pooling)
layers, and three fully connected layers.

Layer Configuration: The first convolutional layer uses a 5x5 kernel, followed by
a subsampling layer with 2x2 average pooling. The second convolutional layer
also uses a 5x5 kernel, followed by another 2x2 average pooling layer. The
convolutional and pooling layers are used to extract low-level features.

Fully Connected Layers: After the convolutional and pooling layers, LeNet uses
three fully connected layers. The first fully connected layer has 120 neurons,
the second has 84 neurons, and the final output layer has 10 neurons for digit
classification.

Activation Function: LeNet uses the sigmoid activation function in the


convolutional and fully connected layers.

Advantages: LeNet was groundbreaking at the time and demonstrated the


potential of CNNs for image recognition tasks. It laid the foundation for
subsequent developments in deep learning and computer vision.
While VGGNet and LeNet have played significant roles in the history of deep
learning, they are considered relatively simple by today's standards. Modern
architectures like ResNet, Inception, and EfficientNet have surpassed them in
terms of performance and efficiency. However, understanding these classic
architectures is crucial for grasping the evolution of convolutional neural
networks.

RNN
Recurrent Neural Networks (RNNs) are a type of artificial neural network
designed for processing sequences of data. They are particularly well-suited for
tasks where the order and context of the data matter, such as time series
prediction, natural language processing, speech recognition, and more. Here's
a comprehensive overview of RNNs, including their architecture, training, and
applications:

Architecture:

Recurrent Neurons: The key feature of RNNs is the presence of recurrent


neurons, which maintain a hidden state or memory. This memory allows RNNs
to process sequences of data, one element at a time, while retaining
information about previous elements.

Time Unrolling: Conceptually, you can think of an RNN as a network that is


"unrolled" through time. For each time step, the network takes an input and
produces an output while updating its internal state.

Hidden State: The hidden state at each time step is a function of the input at
that step and the hidden state from the previous step. Mathematically, the
hidden state can be represented as: h(t) = f(W * h(t-1) + U * x(t)), where h(t) is
the hidden state at time t, x(t) is the input at time t, W and U are weight
matrices, and f is an activation function (commonly the hyperbolic tangent or
sigmoid function).
Training:

Backpropagation Through Time (BPTT): RNNs are trained using a variant of the
backpropagation algorithm called Backpropagation Through Time. This
algorithm calculates gradients with respect to the network's parameters and
updates them to minimize a defined loss function.

Vanishing and Exploding Gradients: RNNs are known to suffer from vanishing
and exploding gradient problems. Vanishing gradients occur when gradients
become very small, making it challenging to train long sequences. Exploding
gradients occur when gradients become very large, leading to numerical
instability. Techniques like gradient clipping and using specialized RNN variants
(e.g., Long Short-Term Memory or LSTM, and Gated Recurrent Unit or GRU)
help mitigate these issues.

Variants of RNNs:

LSTM (Long Short-Term Memory): LSTMs are a type of RNN that addresses the
vanishing gradient problem by introducing specialized gating mechanisms. They
can capture long-term dependencies in data and are widely used in various
sequence-related tasks.

GRU (Gated Recurrent Unit): GRUs are another variant of RNNs with gating
mechanisms similar to LSTMs but with a simpler architecture. They are
computationally less intensive and perform well in many sequence-based tasks.

Applications:
Natural Language Processing (NLP): RNNs are extensively used in NLP tasks,
such as language modeling, text generation, machine translation, and
sentiment analysis.

Speech Recognition: RNNs are employed in speech recognition systems to


transcribe spoken language into text.

Time Series Prediction: RNNs are valuable for predicting and modeling time
series data in finance, weather forecasting, and stock market analysis.

Image Captioning: RNNs are used in image captioning systems to generate


textual descriptions for images.

Music Generation: RNNs can be used to generate music and create new
compositions based on existing musical data.

Video Analysis: RNNs are applied in video analysis tasks, such as action
recognition, object tracking, and gesture recognition.

RNNs have laid the foundation for many advanced sequence modeling
architectures and have significantly advanced the fields of machine learning
and artificial intelligence. Despite their effectiveness, RNNs still have
limitations, including difficulty in capturing very long-term dependencies and
slow training for deep networks. Researchers have since developed more
advanced architectures, such as Transformers, which have gained popularity in
various applications.

Long Short term Memory :-

LSTM (Long Short-Term Memory) is a recurrent neural network (RNN)


architecture widely used in Deep Learning. It excels at capturing long-term
dependencies, making it ideal for sequence prediction tasks.
Unlike traditional neural networks, LSTM incorporates feedback connections,
allowing it to process entire sequences of data, not just individual data points.
This makes it highly effective in understanding and predicting patterns in
sequential data like time series, text, and speech.

LSTM Architecture

In the introduction to long short-term memory, we learned that it resolves


the vanishing gradient problem faced by RNN, so now, in this section, we will
see how it resolves this problem by learning the architecture of the LSTM. At
a high level, LSTM works very much like an RNN cell. Here is the internal
functioning of the LSTM network. The LSTM network architecture consists
of three parts, as shown in the image below, and each part performs an
individual function.

The Logic Behind LSTM


The first part chooses whether the information coming from the previous
timestamp is to be remembered or is irrelevant and can be forgotten. In the
second part, the cell tries to learn new information from the input to this cell.
At last, in the third part, the cell passes the updated information from the
current timestamp to the next timestamp. This one cycle of LSTM is
considered a single-time step.

These three parts of an LSTM unit are known as gates. They control the flow
of information in and out of the memory cell or lstm cell. The first gate is
called Forget gate, the second gate is known as the Input gate, and the last
one is the Output gate. An LSTM unit that consists of these three gates and
a memory cell or lstm cell can be considered as a layer of neurons in
traditional feedforward neural network, with each neuron having a hidden
layer and a current state.

Just like a simple RNN, an LSTM also has a hidden state where H(t-1)
represents the hidden state of the previous timestamp and Ht is the hidden
state of the current timestamp. In addition to that, LSTM also has a cell state
represented by C(t-1) and C(t) for the previous and current timestamps,
respectively.
Here the hidden state is known as Short term memory, and the cell state is
known as Long term memory. Refer to the following image.

It is interesting to note that the cell state carries the information along with
all the timestamps.

Example of LTSM Working


Let’s take an example to understand how LSTM works. Here we have two
sentences separated by a full stop. The first sentence is “Bob is a nice person,”
and the second sentence is “Dan, on the Other hand, is evil”. It is very clear,
in the first sentence, we are talking about Bob, and as soon as we encounter
the full stop(.), we started talking about Dan.

As we move from the first sentence to the second sentence, our network
should realize that we are no more talking about Bob. Now our subject is Dan.
Here, the Forget gate of the network allows it to forget about it. Let’s
understand the roles played by these gates in LSTM architecture.

Forget Gate

In a cell of the LSTM neural network, the first step is to decide whether we
should keep the information from the previous time step or forget it. Here is
the equation for forget gate.

Let’s try to understand the equation, here

• Xt: input to the current timestamp.


• Uf: weight associated with the input
• Ht-1: The hidden state of the previous timestamp
• Wf: It is the weight matrix associated with the hidden state

Later, a sigmoid function is applied to it. That will make ft a number between
0 and 1. This ft is later multiplied with the cell state of the previous
timestamp, as shown below.
Introduction

Long Short-Term Memory Networks is a deep learning, sequential neural network


that allows information to persist. It is a special type of Recurrent Neural Network
which is capable of handling the vanishing gradient problem faced by RNN. LSTM
was designed by Hochreiter and Schmidhuber that resolves the problem caused by
traditional rnns and machine learning algorithms. LSTM can be implemented in
Python using the Keras library.

Let’s say while watching a video, you remember the previous scene, or while reading
a book, you know what happened in the earlier chapter. RNNs work similarly; they
remember the previous information and use it for processing the current input. The
shortcoming of RNN is they cannot remember long-term dependencies due to
vanishing gradient. LSTMs are explicitly designed to avoid long-term dependency
problems.

This article will cover all the basics about LSTM, including its meaning,
architecture, applications, and gates.

Learning Objectives

• Understand what LSTM is.


• Understand the architecture and working of an LSTM network.
• Learn about the different parts/gates in an LSTM unit.

Note: If you are more interested in learning concepts in an Audio-Visual format, We


have the tutorial of this entire article explained in the video below. If not, you may
continue reading.
Table of contents

• What is LSTM?
• LSTM Architecture
• Forget Gate
• Input Gate
• New Information
• Output Gate
• LTSM vs RNN
• What are Bidirectional LSTMs?
• Conclusion
• Frequently Asked Questions

What is LSTM?

LSTM (Long Short-Term Memory) is a recurrent neural network (RNN) architecture


widely used in Deep Learning. It excels at capturing long-term dependencies,
making it ideal for sequence prediction tasks.

Unlike traditional neural networks, LSTM incorporates feedback connections,


allowing it to process entire sequences of data, not just individual data points. This
makes it highly effective in understanding and predicting patterns in sequential data
like time series, text, and speech.
DataHour: Securing LLM-Based Applications
🗓️ Date: 2 Nov 2023 🕖 Time: 6:00 PM – 7:00 PM IST

RSVP!

LSTM has become a powerful tool in artificial intelligence and deep learning,
enabling breakthroughs in various fields by uncovering valuable insights from
sequential data.

LSTM Architecture

In the introduction to long short-term memory, we learned that it resolves the


vanishing gradient problem faced by RNN, so now, in this section, we will see how
it resolves this problem by learning the architecture of the LSTM. At a high level,
LSTM works very much like an RNN cell. Here is the internal functioning of the
LSTM network. The LSTM network architecture consists of three parts, as shown
in the image below, and each part performs an individual function.

The Logic Behind LSTM


The first part chooses whether the information coming from the previous timestamp
is to be remembered or is irrelevant and can be forgotten. In the second part, the cell
tries to learn new information from the input to this cell. At last, in the third part, the
cell passes the updated information from the current timestamp to the next
timestamp. This one cycle of LSTM is considered a single-time step.

These three parts of an LSTM unit are known as gates. They control the flow of
information in and out of the memory cell or lstm cell. The first gate is called Forget
gate, the second gate is known as the Input gate, and the last one is the Output
gate. An LSTM unit that consists of these three gates and a memory cell or lstm cell
can be considered as a layer of neurons in traditional feedforward neural network,
with each neuron having a hidden layer and a current state.

Just like a simple RNN, an LSTM also has a hidden state where H(t-1) represents
the hidden state of the previous timestamp and Ht is the hidden state of the current
timestamp. In addition to that, LSTM also has a cell state represented by C(t-1) and
C(t) for the previous and current timestamps, respectively.

Here the hidden state is known as Short term memory, and the cell state is known as
Long term memory. Refer to the following image.
It is interesting to note that the cell state carries the information along with all the
timestamps.

Example of LTSM Working


Let’s take an example to understand how LSTM works. Here we have two sentences
separated by a full stop. The first sentence is “Bob is a nice person,” and the second
sentence is “Dan, on the Other hand, is evil”. It is very clear, in the first sentence,
we are talking about Bob, and as soon as we encounter the full stop(.), we started
talking about Dan.

As we move from the first sentence to the second sentence, our network should
realize that we are no more talking about Bob. Now our subject is Dan. Here, the
Forget gate of the network allows it to forget about it. Let’s understand the roles
played by these gates in LSTM architecture.

Forget Gate

In a cell of the LSTM neural network, the first step is to decide whether we should
keep the information from the previous time step or forget it. Here is the equation
for forget gate.

Let’s try to understand the equation, here

• Xt: input to the current timestamp.


• Uf: weight associated with the input
• Ht-1: The hidden state of the previous timestamp
• Wf: It is the weight matrix associated with the hidden state

Later, a sigmoid function is applied to it. That will make ft a number between 0 and
1. This ft is later multiplied with the cell state of the previous timestamp, as shown
below.
Input Gate

Let’s take another example.

“Bob knows swimming. He told me over the phone that he had served the navy for
four long years.”

So, in both these sentences, we are talking about Bob. However, both give different
kinds of information about Bob. In the first sentence, we get the information that he
knows swimming. Whereas the second sentence tells, he uses the phone and served
in the navy for four years.

Now just think about it, based on the context given in the first sentence, which
information in the second sentence is critical? First, he used the phone to tell, or he
served in the navy. In this context, it doesn’t matter whether he used the phone or
any other medium of communication to pass on the information. The fact that he
was in the navy is important information, and this is something we want our model
to remember for future computation. This is the task of the Input gate.

The input gate is used to quantify the importance of the new information carried by
the input. Here is the equation of the input gate

Here,

• Xt: Input at the current timestamp t


• Ui: weight matrix of input
• Ht-1: A hidden state at the previous timestamp
• Wi: Weight matrix of input associated with hidden state

Again we have applied the sigmoid function over it. As a result, the value of I
at timestamp t will be between 0 and 1.

Now the new information that needed to be passed to the cell state is a
function of a hidden state at the previous timestamp t-1 and input x at
timestamp t. The activation function here is tanh. Due to the tanh function,
the value of new information will be between -1 and 1. If the value of Nt is
negative, the information is subtracted from the cell state, and if the value is
positive, the information is added to the cell state at the current timestamp.

However, the Nt won’t be added directly to the cell state. Here comes the
updated equation:

Here, Ct-1 is the cell state at the current timestamp, and the others are the
values we have calculated previously.

Output Gate

Now consider this sentence.

“Bob single-handedly fought the enemy and died for his country. For his
contributions, brave______.”
During this task, we have to complete the second sentence. Now, the minute
we see the word brave, we know that we are talking about a person. In the
sentence, only Bob is brave, we can not say the enemy is brave, or the country
is brave. So based on the current expectation, we have to give a relevant
word to fill in the blank. That word is our output, and this is the function of
our Output gate.

Here is the equation of the Output gate, which is pretty similar to the two
previous gates

Its value will also lie between 0 and 1 because of this sigmoid function. Now
to calculate the current hidden state, we will use Ot and tanh of the updated
cell state. As shown below.

It turns out that the hidden state is a function of Long term memory (Ct) and
the current output. If you need to take the output of the current timestamp,
just apply the SoftMax activation on hidden state Ht.

Here the token with the maximum score in the output is the prediction.

This is the More intuitive diagram of the LSTM network.


LTSM vs RNN
RNN (Recurrent Neural
Aspect LSTM (Long Short-Term Memory) Network)
Architecture A type of RNN with additional memory A basic type of RNN
cells
Memory Retention Handles long-term dependencies and Struggles with long-term
prevents vanishing gradient problem dependencies and vanishing
gradient problem
Cell Structure Complex cell structure with input, Simple cell structure with only
output, and forget gates hidden state
Handling Suitable for processing sequential data Also designed for sequential
Sequences data, but limited memory
Training Slower training process due to increased Faster training process due to
Efficiency complexity simpler architecture
Performance on Performs better on long sequences Struggles to retain information
Long Sequences on long sequences
Usage Best suited for tasks requiring long-term Appropriate for simple
memory, such as language translation sequential tasks, such as time
and sentiment analysis series forecasting
Vanishing Addresses the vanishing gradient Prone to the vanishing gradient
Gradient Problem problem problem

What are Bidirectional LSTMs?

Bidirectional LSTMs (Long Short-Term Memory) are a type of recurrent


neural network (RNN) architecture that processes input data in both forward
and backward directions. In a traditional LSTM, the information flows only
from past to future, making predictions based on the preceding context.
However, in bidirectional LSTMs, the network also considers future context,
enabling it to capture dependencies in both directions.

The bidirectional LSTM comprises two LSTM layers, one processing the input
sequence in the forward direction and the other in the backward direction.
This allows the network to access information from past and future time
steps simultaneously. As a result, bidirectional LSTMs are particularly useful
for tasks that require a comprehensive understanding of the input sequence,
such as natural language processing tasks like sentiment analysis, machine
translation, and named entity recognition.

By incorporating information from both directions, bidirectional LSTMs


enhance the model’s ability to capture long-term dependencies and make
more accurate predictions in complex sequential data.

Bidirectional RNN
Introduction
A bi-directional recurrent neural network (Bi-RNN) is a type of recurrent neural
network (RNN) that processes input data in both forward and backward
directions. The goal of a Bi-RNN is to capture the contextual dependencies in
the input data by processing it in both directions, which can be useful in
various natural language processing (NLP) tasks.

In a Bi-RNN, the input data is passed through two separate RNNs: one
processes the data in the forward direction, while the other processes it in the
reverse direction. The outputs of these two RNNs are then combined in some
way to produce the final output.

One common way to combine the outputs of the forward and reverse RNNs is
to concatenate them. Still, other methods, such as element-wise addition or
multiplication, can also be used. The choice of combination method can
depend on the specific task and the desired properties of the final output.
Need for Bi-directional RNNs
A uni-directional recurrent neural network (RNN) processes input sequences in
a single direction, either from left to right or right to left.
This means the network can only use information from earlier time steps when
making predictions at later time steps.
This can be limiting, as the network may not capture important contextual
information relevant to the output prediction.
For example, in natural language processing tasks, a uni-directional RNN may
not accurately predict the next word in a sentence if the previous words
provide important context for the current word.
Consider an example where we could use the recurrent network to predict the
masked word in a sentence.

Apple is my favorite _____.


Apple is my favourite _____, and I work there.
Apple is my favorite _____, and I am going to buy one.
In the first sentence, the answer could be fruit, company, or phone. But it can
not be a fruit in the second and third sentences.

A Recurrent Neural Network that can only process the inputs from left to right
may not accurately predict the right answer for sentences discussed above.

To perform well on natural language tasks, the model must be able to process
the sequence in both directions.

Bi-directional RNNs
A bidirectional recurrent neural network (RNN) is a type of recurrent neural
network (RNN) that processes input sequences in both forward and backward
directions.
This allows the RNN to capture information from the input sequence that may
be relevant to the output prediction. Still, the same could be lost in a
traditional RNN that only processes the input sequence in one direction.
This allows the network to consider information from the past and future when
making predictions rather than just relying on the input data at the current
time step.
This can be useful for tasks such as language processing, where understanding
the context of a word or phrase can be important for making accurate
predictions.
In general, bidirectional RNNs can help improve a model's performance on
various sequence-based tasks.
This means that the network has two separate RNNs:

One that processes the input sequence from left to right


Another one that processes the input sequence from right to left.
These two RNNs are typically called forward and backward RNNs, respectively.

During the forward pass of the RNN, the forward RNN processes the input
sequence in the usual way by taking the input at each time step and using it to
update the hidden state. The updated hidden state is then used to predict the
output.

Backpropagation through time (BPTT) is a widely used algorithm for training


recurrent neural networks (RNNs). It is a variant of the backpropagation
algorithm specifically designed to handle the temporal nature of RNNs, where
the output at each time step depends on the inputs and outputs at previous
time steps.

In the case of a bidirectional RNN, BPTT involves two separate Backpropagation


passes: one for the forward RNN and one for the backward RNN. During the
forward pass, the forward RNN processes the input sequence in the usual way
and makes predictions for the output sequence. These predictions are then
compared to the target output sequence, and the error is backpropagated
through the network to update the weights of the forward RNN.

The backward RNN processes the input sequence in reverse order during the
backward pass and predicts the output sequence. These predictions are then
compared to the target output sequence in reverse order, and the error is
backpropagated through the network to update the weights of the backward
RNN.

Once both passes are complete, the weights of the forward and backward
RNNs are updated based on the errors computed during the forward and
backward passes, respectively. This process is repeated for multiple iterations
until the model converges and the predictions of the bidirectional RNN are
accurate.

This allows the bidirectional RNN to consider information from past and future
time steps when making predictions, which can significantly improve the
model's accuracy.
What’s the Difference Between BRNN and Recurrent
Neural Network?
Unlike standard recurrent neural networks, BRNN’s are trained to simultaneously predict
both the positive and negative directions of time.

haracteristic BRNN RNN

Definition Bidirectional Recurrent Neural Networks. Recurrent Neural Networks.

Process input sequences in both forward and backward


Purpose Process input sequences in a single direction.
directions.

Output at each time step depends on the past and future Output at each time step depends only on the past
Output
inputs. inputs.

Training Trained on both forward and backward sequences. Trained on a single sequence.

Examples Natural language processing tasks, speech recognition. Time series prediction, language translation.

You might also like