0% found this document useful (0 votes)
92 views24 pages

Unit 5

The document discusses deep learning architectures used in computer vision, including AlexNet, GoogleNet, and ResNet. It provides details on each architecture, such as their layers, functions, advantages, and disadvantages. Deep learning architectures enable computer vision to achieve results for complex problems like image classification, object detection, and object segmentation.

Uploaded by

ac5445
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
92 views24 pages

Unit 5

The document discusses deep learning architectures used in computer vision, including AlexNet, GoogleNet, and ResNet. It provides details on each architecture, such as their layers, functions, advantages, and disadvantages. Deep learning architectures enable computer vision to achieve results for complex problems like image classification, object detection, and object segmentation.

Uploaded by

ac5445
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

UNIT-5

Deep learning architecture for Computer Vision:

Applications of Deep Learning Architectures in Computer Vision


As the requirements for Computer Vision increase, it is replacing statistical
techniques with Deep Learning architectures and neural network models. Although
there are various challenges in Computer Vision that need addressing. Therefore,
deep learning architectures enable Computer Vision to achieve results for complex
problems.
 Image Classification: Image Classification refers to labeling the complete
image or the photo. It is also referred to as object classification and more
commonly as image recognition. Therefore, Deep Learning Architectures
enable Computer Vision to execute tasks to classify and categorize large sets
of images. For instance, labeling x-rays as detecting diseases or not is binary
classification. Similarly, classifying handwritten content is a multiclass
classification.
 Object Detection: Object Detection is the classification of images using
localization. Further, an image may contain multiple objects that require
localization and classification. Although it is a more challenging task in
comparison to Image classification. Hence, Deep Learning Architectures
simplifies the process of image analysis and analyzes multiple objects in the
image.
 Object Segmentation: Object segmentation aka Semantic Segmentation
detects objects using a line around various objects in the image. Although
image segmentation is a more generic challenge segmenting images. Further,
Deep Learning Architectures train models to identify objects and assign
segments. Moreover, it segments objects using particular pixels in the image.

Top Deep Learning Architectures for Computer Vision:


AlexNet (2012):AlexNet is a Deep Learning architecture for Computer Vision
based on LeNet architecture. It incorporates five convolutional layers and three
fully connected layers as well as dual pipeline structures. It also accommodates the
functions of two GPUs while training models. Further, it uses rectified linear units
(ReLU) instead of sigmoid or Tanh activation functions. ReLU also enables
AlexNet to train models with simpler and faster computations.
GoogleNet (2014):GoogleNet or Inception V1 is also based on the LeNet
architecture. It includes twenty-two layers of smaller groups of convolutions
known as inception modules. GoogleNet uses the inception modules to resolve
large network problems and RMSprop to decrease the computational cost.
Moreover, RMSprop integrates algorithms that enable adaptive learning to rate
techniques.
VGGNet (2014):VGGNet or VGG 16 is 16 layer architecture that may also include
19 layers in some models. It generally includes convolutional layers and a few
pooling layers. Further, VGG originates from the notion of deeper networks that
include smaller filters.

AlexNet:
The AlexNet contains 8 layers with weights;
5 convolutional layers
3 fully connected layers.
At the end of each layer, ReLu activation is performed except for the last one,
which outputs with a softmax with a distribution over the 1000 class labels.
Dropout is applied in the first two fully connected layers. As the figure above
shows also applies Max-pooling after the first, second, and fifth convolutional
layers. The kernels of the second, fourth, and fifth convolutional layers are
connected only to those kernel maps in the previous layer, which reside on the
same GPU. The kernels of the third convolutional layer are connected to all kernel
maps in the second layer. The neurons in the fully connected layers are connected
to all neurons in the previous layer.
An important feature of the AlexNet is the use of ReLU(Rectified Linear Unit)
Nonlinearity.
Tanh or sigmoid activation functions used to be the usual way to train a neural
network model.
AlexNet showed that using ReLU nonlinearity, deep CNNs could be trained much
faster than using the saturating activation functions like tanh or sigmoid.
Although ReLU helps with the vanishing gradient problem, due to its unbounded
nature, the learned variables can become unnecessarily high. To prevent this,
AlexNet introduced Local Response Normalization (LRN). The idea behind LRN
is to carry out a normalization in a neighborhood of pixels amplifying the excited
neuron while dampening the surrounding neurons at the same time.
The architecture consists of eight layers: five convolutional layers and three fully-
connected layers. But this isn’t what makes AlexNet special; these are some of the
features used that are new approaches to convolutional neural networks:
 ReLU Nonlinearity: AlexNet uses Rectified Linear Units (ReLU) instead of
the tanh function, which was standard at the time. ReLU’s advantage is in
training time; a CNN using ReLU was able to reach a 25% error on the
CIFAR-10 dataset six times faster than a CNN using tanh.
 Multiple GPUs: Back in the day, GPUs were still rolling around with 3
gigabytes of memory (nowadays those kinds of memory would be rookie
numbers). This was especially bad because the training set had 1.2 million
images. AlexNet allows for multi-GPU training by putting half of the
model’s neurons on one GPU and the other half on another GPU. Not only
does this mean that a bigger model can be trained, but it also cuts down on
the training time.
 Overlapping Pooling: CNNs traditionally “pool” outputs of neighboring
groups of neurons with no overlapping. However, when the authors
introduced overlap, they saw a reduction in error by about 0.5% and found
that models with overlapping pooling generally find it harder to overfit.

Pros of AlexNet:
1. AlexNet is considered as the milestone of CNN for image classification.
2. Many methods, such as the conv+pooling design, dropout, GPU, parallel
computing, ReLU, are still the industrial standard for computer vision.
3. The unique advantage ofAlexNet is the direct image input to the
classification model.
4. The convolution layers can automatically extract the edges of the images and
fully connected layers learning these features
5. Theoretically the complexity of visual patterns can be effective extracted by
adding more convlayer.

Cons of AlexNet:
1. AlexNet is NOT deep enough compared to the later model, such as
VGGNet, GoogLENet, and ResNet.
2. The use of large convolution filters (5*5) is not encouraged shortly after
that.
3. Use normal distribution to initiate the weights in the neural networks, can
not effectively solve the problem of gradient vanishing, replaced by the
Xavier method later.
4. The performance is surpassed by more complex models such as GoogLENet
(6.7%), and ResNet (3.6%)

ResNet: The Art of Skipping:


The Problem: As convolutional neural networks become deeper, the derivatives,
when back-propagating to the initial layers, become almost negligible.
The Solution: ResNet offers a clever remedy by introducing two types of ‘shortcut
connections’: the Identity shortcut and the Projection shortcut.
The Architecture: ResNet boasts of several architectures with varying layer counts,
such as ResNet50 and ResNet101. Thanks to the shortcut connections addressing
the vanishing gradient problem, ResNet models can have significantly greater
depth.
The Concept:
At the heart of ResNet lies the Residual block. Instead of learning the direct
mapping from x →F(x), the network learns the residual mapping from x →
F(x)+G(x). Identity connections exist where the input and output dimensions are
identical, while projection connections exist where these dimensions differ.

Projection and Identity: When the dimensions of F(x) and x differ, the Projection
shortcut is implemented. It adjusts the dimensions of input x to match that of
output F(x). On the other hand, when the dimensions of F(x) and x are the same,
the Identity shortcut is used, which makes it easier for the network to learn.
Imagine a network, A which produces x amount of training error. Construct a
network B by adding few layers on top of A and put parameter values in those
layers in such a way that they do nothing to the outputs from A. Let’s call the
additional layer as C. This would mean the same x amount of training error for the
new network. So while training network B, the training error should not be above
the training error of A. And since it DOES happen, the only reason is that learning
the identity mapping(doing nothing to inputs and just copying as it is) with the
added layers-C is not a trivial problem, which the solver does not achieve. To solve
this, the module shown above creates a direct path between the input and output to
the module implying an identity mapping and the added layer-C just need to learn
the features on top of already available input. Since C is learning only the residual,
the whole module is called residual module.
Also, similar to GoogLeNet, it uses a global average pooling followed by the
classification layer. Through the changes mentioned, ResNets were learned with
network depth of as large as 152. It achieves better accuracy than VGGNet and
GoogLeNet while being computationally more efficient than VGGNet. ResNet-152
achieves 95.51 top-5 accuracies.
The architecture is similar to the VGGNet consisting mostly of 3X3 filters. From
the VGGNet, shortcut connection as described above is inserted to form a residual
network. This can be seen in the figure which shows a small snippet of earlier layer
synthesis from VGG-19.
The power of the residual networks can be judged from one of the experiments in
paper 4. The plain 34 layer network had higher validation error than the 18 layers
plain network. This is where we realize the degradation problem. And the same 34
layer network when converted into the residual network has much lesser training
error than the 18 layer residual network.

Transfer learning:
Transfer learning, used in machine learning, is the reuse of a pre-trained model on
a new problem. In transfer learning, a machine exploits the knowledge gained from
a previous task to improve generalization about another. For example, in training a
classifier to predict whether an image contains food, you could use the knowledge
it gained during training to recognize drinks.
With transfer learning, we basically try to exploit what has been learned in one task
to improve generalization in another. We transfer the weights that a network has
learned at “task A” to a new “task B.”
The general idea is to use the knowledge a model has learned from a task with a lot
of available labeled training data in a new task that doesn't have much data. Instead
of starting the learning process from scratch, we start with patterns learned from
solving a related task.
Transfer learning is mostly used in computer vision and natural language
processing tasks like sentiment analysis due to the huge amount of computational
power required.
Transfer learning isn’t really a machine learning technique, but can be seen as a
“design methodology” within the field, for example, active learning. It is also not
an exclusive part or study-area of machine learning. Nevertheless, it has become
quite popular in combination with neural networks that require huge amounts of
data and computational power.

How Transfer Learning Works


In computer vision, for example, neural networks usually try to detect edges in the
earlier layers, shapes in the middle layer and some task-specific features in the
later layers. In transfer learning, the early and middle layers are used and we only
retrain the latter layers. It helps leverage the labeled data of the task it was initially
trained on.
Let’s go back to the example of a model trained for recognizing a backpack on an
image, which will be used to identify sunglasses. In the earlier layers, the model
has learned to recognize objects, because of that we will only retrain the latter
layers so it will learn what separates sunglasses from other objects.
In transfer learning, we try to transfer as much knowledge as possible from the
previous task the model was trained on to the new task at hand. This knowledge
can be in various forms depending on the problem and the data. For example, it
could be how models are composed, which allows us to more easily identify novel
objects.

Why Use Transfer Learning


Transfer learning has several benefits, but the main advantages are saving training
time, better performance of neural networks (in most cases), and not needing a lot
of data.
Usually, a lot of data is needed to train a neural network from scratch but access to
that data isn't always available — this is where transfer learning comes in handy.
With transfer learning a solid machine learning model can be built with
comparatively little training data because the model is already pre-trained. This is
especially valuable in natural language processing because mostly expert
knowledge is required to create large labeled data sets. Additionally, training time
is reduced because it can sometimes take days or even weeks to train a deep neural
network from scratch on a complex task.

When to Use Transfer Learning


As is always the case in machine learning, it is hard to form rules that are generally
applicable, but here are some guidelines on when transfer learning might be used:
 There isn’t enough labeled training data to train your network from scratch.
 There already exists a network that is pre-trained on a similar task, which is
usually trained on massive amounts of data.
 When task 1 and task 2 have the same input.
If the original model was trained using an open-source library like TensorFlow,
you can simply restore it and retrain some layers for your task. Keep in mind,
however, that transfer learning only works if the features learned from the first task
are general, meaning they can be useful for another related task as well. Also, the
input of the model needs to have the same size as it was initially trained with. If
you don’t have that, add a pre-processing step to resize your input to the needed
size.
Approaches to Transfer Learning
1. TRAINING A MODEL TO REUSE IT
Imagine you want to solve task A but don’t have enough data to train a deep neural
network. One way around this is to find a related task B with an abundance of data.
Train the deep neural network on task B and use the model as a starting point for
solving task A. Whether you'll need to use the whole model or only a few layers
depends heavily on the problem you're trying to solve.
If you have the same input in both tasks, possibly reusing the model and making
predictions for your new input is an option. Alternatively, changing and retraining
different task-specific layers and the output layer is a method to explore.
2. USING A PRE-TRAINED MODEL
The second approach is to use an already pre-trained model. There are a lot of
these models out there, so make sure to do a little research. How many layers to
reuse and how many to retrain depends on the problem.
Keras, for example, provides numerous pre-trained models that can be used for
transfer learning, prediction, feature extraction and fine-tuning. You can find these
models, and also some brief tutorials on how to use them, here. There are also
many research institutions that release trained models.
This type of transfer learning is most commonly used throughout deep learning.
3. FEATURE EXTRACTION
Another approach is to use deep learning to discover the best representation of
your problem, which means finding the most important features. This approach is
also known as representation learning, and can often result in a much better
performance than can be obtained with hand-designed representation.

In machine learning, features are usually manually hand-crafted by researchers and


domain experts. Fortunately, deep learning can extract features automatically. Of
course, this doesn't mean feature engineering and domain knowledge isn’t
important anymore — you still have to decide which features you put into your
network. That said, neural networks have the ability to learn which features are
really important and which ones aren’t. A representation learning algorithm can
discover a good combination of features within a very short timeframe, even for
complex tasks which would otherwise require a lot of human effort.
The learned representation can then be used for other problems as well. Simply use
the first layers to spot the right representation of features, but don’t use the output
of the network because it is too task-specific. Instead, feed data into your network
and use one of the intermediate layers as the output layer. This layer can then be
interpreted as a representation of the raw data.
This approach is mostly used in computer vision because it can reduce the size of
your dataset, which decreases computation time and makes it more suitable for
traditional algorithms, as well.

Siamese Network:
A siamese neural network (SNN) is a class of neural network architectures that
contain two or more identical sub-networks. “Identical” here means they have the
same configuration with the same parameters and weights. Parameter updating is
mirrored across both sub-networks and it’s used to find similarities between inputs
by comparing feature vectors.
A siamese neural network (SNN) is a class of neural network architectures that
contain two or more identical sub-networks. “Identical” here means they have the
same configuration with the same parameters and weights. Parameter updating is
mirrored across both sub-networks and it’s used to find similarities between inputs
by comparing its feature vectors. These networks are used in many applications.
Traditionally, a neural network learns to predict multiple classes. This poses a
problem when we need to add or remove new classes to the data. In this case, we
have to update the neural network and retrain it on the whole data set. Also, deep
neural networks need a large volume of data on which to train. SNNs, on the other
hand, learn a similarity function

SIAMESE NETWORK PROS


 More Robust to Class Imbalance: Giving a few images per class is sufficient
for siamese networks to recognize those images in the future with the aid of
one-shot learning.
 Nice to Pair With the Best Classifier: Given that an SNN’s learning
mechanism is somewhat different from classification models, simply
averaging it with a classifier can do much better than averaging two
correlated supervised models (e.g. GBM & RF classifiers).
 Learning from Semantic Similarity:SNN focuses on learning embeddings (in
the deeper layer) that place the same classes/concepts close together. Hence,
we can learn semantic similarity.

SIAMESE NETWORK CONS


 Needs More Training Time Than Normal Networks : Since SNNs involves
learning from quadratic pairs (to see all information available) they’re
slower than the normal classification type of learning (pointwise learning).
 Don’t Output Probabilities: Since training involves pairwise learning, SNNs
won’t output the probabilities of the prediction, only distance from each
class.

Loss Functions Used in Siamese Networks:


Since training SNNs involve pairwise learning, we cannot use cross entropy loss
cannot be used. There are two loss functions we typically use to train siamese
networks.
TRIPLET LOSS : Triplet loss is a loss function where in we compare a baseline
(anchor) input to a positive (truthy) input and a negative (falsy) input. The distance
from the baseline (anchor) input to the positive (truthy) input is minimized, and the
distance from the baseline (anchor) input to the negative (falsy) input is
maximized.

In the above equation, alpha is a margin term used to stretch the distance between
similar and dissimilar pairs in the triplet. Fa, Fp, Fn are the feature embeddings for
the anchor, positive and negative images.
During the training process, we feed an image triplet (anchor image, negative
image, positive image)(anchor image, negative image, positive image) into the
model as a single sample. The distance between the anchor and positive images
should be smaller than that between the anchor and negative images.
CONTRASTIVE LOSS: Contrastive loss is an increasingly popular loss function.
It’s a distance-based loss as opposed to more conventional error-prediction loss.
This loss function is used to learn embeddings in which two similar points have a
low Euclidean distance and two dissimilar points have a large Euclidean distance.

We define Dw (the Euclidean distance) as:

Gw is the output of our network for one image.

Metric learning:
Many approaches in machine learning require a measure of distance between data
points. Traditionally, practitioners would choose a standard distance metric
(Euclidean, City-Block, Cosine, etc.) using a priori knowledge of the domain.
However, it is often difficult to design metrics that are well-suited to the particular
data and task of interest.
Distance metric learning (or simply, metric learning) aims at automatically
constructing task-specific distance metrics from (weakly) supervised data, in a
machine learning manner. The learned distance metric can then be used to perform
various tasks (e.g., k-NN classification, clustering, information retrieval).
The goal of Metric Learning is to learn a representation function that maps objects
into an embedded space. The distance in the embedded space should preserve the
objects’ similarity — similar objects get close and dissimilar objects get far away.
Various loss functions have been developed for Metric Learning. For example, the
contrastive loss guides the objects from the same class to be mapped to the same
point and those from different classes to be mapped to different points whose
distances are larger than a margin. Triplet loss is also popular, which requires the
distance between the anchor sample and the positive sample to be smaller than the
distance between the anchor sample and the negative sample.
Metric learning problems fall into two main categories depending on the type of
supervision available about the training data:
 Supervised learning: the algorithm has access to a set of data points, each of
them belonging to a class (label) as in a standard classification problem.
Broadly speaking, the goal in this setting is to learn a distance metric that
puts points with the same label close together while pushing away points
with different labels.
 Weakly supervised learning: the algorithm has access to a set of data points
with supervision only at the tuple level (typically pairs, triplets, or
quadruplets of data points). A classic example of such weaker supervision is
a set of positive and negative pairs: in this case, the goal is to learn a
distance metric that puts positive pairs close together and negative pairs far
away.
Based on the above (weakly) supervised data, the metric learning problem is
generally formulated as an optimization problem where one seeks to find the
parameters of a distance function that optimize some objective function measuring
the agreement with the training data.

Use-cases:
There are many use-cases for metric learning. We list here a few popular examples
(for code illustrating some of these use-cases, see the examples section of the
documentation):
1. Nearest neighbors models: the learned metric can be used to improve nearest
neighbors learning models for classification, regression, anomaly
detection…
2. Clustering: metric learning provides a way to bias the clusters found by
algorithms like K-Means towards the intended semantics.
3. Information retrieval: the learned metric can be used to retrieve the elements
of a database that are semantically closest to a query element.
4. Dimensionality reduction: metric learning may be seen as a way to reduce
the data dimension in a (weakly) supervised setting.
More generally, the learned transformation can be used to project the data into a
new embedding space before feeding it into another machine learning algorithm.
Approaches to metric learning:
In practice, there are two main approaches to metric learning and two
corresponding types of NN architectures. The first is the interaction-based
approach, which first builds local interactions (i.e., local matching signals)
between two objects. Deep neural networks learn hierarchical interaction patterns
for matching. Examples of neural network architectures include MV-LSTM, ARC-
II, and MatchPyramid.
The second is the representation-based approach. In this case distance function is
composed of 2 components: the Encoder transforms an object into embedded
representation — usually a large float point vector, and the Comparator takes
embeddings of a pair of objects from the Encoder and calculates their similarity.
The most well-known example of this embedding representation is Word2Vec.

Ranking/triplet loss:
Triplet Loss: Triplet Loss is a distance based Loss function that Operates on Three
Inputs.
1. anchor (a) is any Datapoint chosen by user.
2. Postive(p) is any Datapoint belonging to same class as anchor.
3. Negative(n) is any Datapoint not belonging to same class as anchor.
The basic idea behind triplet loss it to learn a embedding space representation in
which positive embedding is nearer to anchor embedding and at the same time
negative embedding is at distance greater than margin m from postive embedding.
We can formulate the above understaning of triplet loss into mathmatical equation
as follows.

d(ra,rp) -distance between anchor image and positive image


d(ra,rn) -distance between anchor image and negative image

Negative Selection
From above loss equation we can have three possibilities or three catergory of
triplets.
1. easy-triplets: triplets which have loss of 0 i.e d(ra,rn) > d(ra,rp) + m.
2. semi-hard triplets: triplets where the negative is not closer to the anchor than the
positive, but which still have positive loss i.e d(ra,rp) < d(ra,rn) < d(ra,rp) + m.
3. hard-triplets: triplets where the negative is closer to the anchor than the positive,
i.e d(ra,rn) < d(ra,rp).
Each of these definitions depend on where the negative is, relatively to the anchor
and positive. We can therefore extend these three categories to the negatives: hard
negatives, semi-hard negatives or easy negatives.
Choosing what kind of triplet we will be training on will greatly impact our
error.For ex: If we train on easy triplets then overall loss is already zero so there
will be no parameter update and our model will not learn anything. So it is
advisable to train our model on semi hard or hard tripelts for better accuracy and
low error rates.

Ranking Loss Functions: Metric Learning:

Unlike other loss functions, such as Cross-Entropy Loss or Mean Square Error
Loss, whose objective is to learn to predict directly a label, a value, or a set or
values given an input, the objective of Ranking Losses is to predict relative
distances between inputs. This task if often called metric learning.

Ranking Losses functions are very flexible in terms of training data: We just need a
similarity score between data points to use them. That score can be binary (similar /
dissimilar). As an example, imagine a face verification dataset, where we know
which face images belong to the same person (similar), and which not (dissimilar).
Using a Ranking Loss function, we can train a CNN to infer if two face images
belong to the same person or not.

To use a Ranking Loss function we first extract features from two (or three) input
data points and get an embedded representation for each of them. Then, we define a
metric function to measure the similarity between those representations, for
instance euclidian distance. Finally, we train the feature extractors to produce
similar representations for both inputs, in case the inputs are similar, or distant
representations for the two inputs, in case they are dissimilar.

We don’t even care about the values of the representations, only about the distances
between them. However, this training methodology has demonstrated to produce
powerful representations for different tasks.

Other names used for Ranking Losses: Ranking Losses are essentialy the ones
explained above, and are used in many different aplications with the same
formulation or minor variations. However, different names are used for them, which
can be confusing. Here I explain why those names are used.

 Ranking loss: This name comes from the information retrieval field, where
we want to train models to rank items in an specific order.
 Margin Loss: This name comes from the fact that these losses use a margin to
compare samples representations distances.
 Contrastive Loss: Contrastive refers to the fact that these losses are computed
contrasting two or more data points representations. This name is often used
for Pairwise Ranking Loss, but I’ve never seen using it in a setup with
triplets.
 Triplet Loss: Often used as loss name when triplet training pairs are
employed.
 Hinge loss: Also known as max-margin objective. It’s used for training
SVMs for classification. It has a similar formulation in the sense that it
optimizes until a margin. That’s why this name is sometimes used for
Ranking Losses.

RCNN:

Region-based Convolutional Neural Network (R-CNN) is a type of deep learning


architecture used for object detection in computer vision tasks. RCNN was one of
the pioneering models that helped advance the object detection field by combining
the power of convolutional neural networks and region-based approaches.
R-CNN, or Regions with CNN Features, is an object detection model that uses
high-capacity CNNs to bottom-up region proposals in order to localize and
segment objects. It uses selective search to identify a number of bounding-box
object region candidates (“regions of interest”), and then extracts features from
each region independently for classification.
R-CNN is a two-stage object detection pipeline that first generates a set of ROIs
using a method such as selective search or edge boxes and then classifies the
objects within these ROIs using a CNN.

The R-CNN pipeline can be divided into three main steps:


1. Region proposal: A method, such as selective search or edge boxes,
generates a set of ROIs within the image. The bounding boxes around the
objects of interest typically define these ROIs.
2. Feature extraction: A CNN is used to extract features from each ROI. These
features are then used to represent the ROI in a compact and informative
manner.
3. Classification: The extracted features are fed into a classifier, such as a
support vector machine (SVM), to predict the object’s class within the ROI.
One of the main advantages of R-CNN is that it can handle many object classes, as
the classifier is trained separately for each class. However, a significant drawback
of R-CNN is that it is computationally expensive, requiring the CNN to be run on
each ROI individually.

Working of RCNN:
Region Proposal: R-CNN starts by dividing the input image into multiple regions
or subregions. These regions are referred to as "region proposals" or "region
candidates." The region proposal step is responsible for generating a set of
potential regions in the image that are likely to contain objects. R-CNN does not
generate these proposals itself; instead, it relies on external methods like Selective
Search or EdgeBoxes to generate region proposals.
Selective Search, for example, operates by merging or splitting segments of the
image based on various image cues like color, texture, and shape to create a
diverse set of region proposals.
Feature Extraction: Once the region proposals are generated, approximately 2,000
regions are extracted and anisotropically warped to a consistent input size that the
CNN expects (e.g., 224x224 pixels) and then it is passed through the CNN to
extract features.
Before warping, the region size is expanded to a new size that will result in 16
pixels of context in the warped frame. The CNN used is AlexNet and it is typically
fine-tuned on a large dataset like ImageNet for generic feature representation.
The output of the CNN is a high-dimensional feature vector representing the
content of the region proposal.
Object Classification: The extracted feature vectors from the region proposals are
fed into a separate machine learning classifier for each object class of interest. R-
CNN typically uses Support Vector Machines (SVMs) for classification. For each
class, a unique SVM is trained to determine whether or not the region proposal
contains an instance of that class.
During training, positive samples are regions that contain an instance of the class.
Negative samples are regions that do not.
Bounding Box Regression: In addition to classifying objects, R-CNN also
performs bounding box regression. For each class, a separate regression model is
trained to refine the location and size of the bounding box around the detected
object. The bounding box regression helps improve the accuracy of object
localization by adjusting the initially proposed bounding box to better fit the
object's actual boundaries.
Non-Maximum Suppression (NMS): After classifying and regressing bounding
boxes for each region proposal, R-CNN applies non-maximum suppression to
eliminate duplicate or highly overlapping bounding boxes. NMS ensures that only
the most confident and non-overlapping bounding boxes are retained as final object
detections.

Strengths of R-CNN : Below are a few of the key strengths of the R-CNN
architecture.

1. Accurate Object Detection: R-CNN provides accurate object detection by


leveraging region-based convolutional features. It excels in scenarios where
precise object localization and recognition are crucial.
2. Robustness to Object Variations: R-CNN models can handle objects with
different sizes, orientations, and scales, making them suitable for real-world
scenarios with diverse objects and complex backgrounds.
3. Flexibility: R-CNN is a versatile framework that can be adapted to various
object detection tasks, including instance segmentation and object tracking.
By modifying the final layers of the network, you can tailor R-CNN to suit
your specific needs.

Disadvantages of R-CNN: Below are a few disadvantages of the R-CNN


architecture.

1. Computational Complexity: R-CNN is computationally intensive. It involves


extracting region proposals, applying a CNN to each proposal, and then
running the extracted features through a classifier. This multi-stage process
can be slow and resource-demanding.
2. Slow Inference: Due to its sequential processing of region proposals, R-
CNN is relatively slow during inference. Real-time applications may find
this latency unacceptable.
3. Overlapping Region Proposals: R-CNN may generate multiple region
proposals that overlap significantly, leading to redundant computation and
potentially affecting detection performance.
4. R-CNN is Not End-to-End: Unlike more modern object detection
architectures like Faster R-CNN, R-CNN is not an end-to-end model. It
involves separate modules for region proposal and classification, which can
lead to suboptimal performance compared to models that optimize both tasks
jointly.

RCNN WITH KERAS:


Implementing an R-CNN (Region-based Convolutional Neural Network) in Keras
involves several steps, as R-CNN is a multi-stage object detection framework.
Here's a simplified overview of how you can implement it:
1. Preprocessing: Preprocess the input images and annotations. This may
involve resizing images, extracting region proposals, and generating ground-
truth bounding box annotations.
2. Region Proposal Generation: Use a selective search algorithm or another
region proposal method to generate region proposals within the image.
These proposals represent candidate object regions.
3. Feature Extraction: Extract features from each region proposal using a pre-
trained CNN. Common choices include networks like VGG, ResNet, or
MobileNet. This step involves passing each region proposal through the
CNN and extracting features from a specific layer.
4. Region-based CNN: Implement a region-based CNN (e.g., Fast R-CNN or
Faster R-CNN) to perform classification and bounding box regression. This
network takes the extracted features from the previous step and predicts the
class label and bounding box coordinates for each region proposal.
5. Training: Train the entire network end-to-end using a combination of
classification and bounding box regression losses. This involves optimizing
the parameters of both the feature extraction CNN and the region-based
CNN.
Here's a basic example of how you can implement a simple version of R-CNN
using Keras:
from keras.models import Model
from keras.layers import Input, Dense, Flatten
from keras.applications import VGG16
from keras.optimizers import Adam

# Step 1: Preprocess the input images and annotations

# Step 2: Generate region proposals

# Step 3: Extract features using a pre-trained CNN (VGG16 in this case)


base_model = VGG16(weights='imagenet', include_top=False, input_shape=(224,
224, 3))
region_features = base_model.output
region_features = Flatten()(region_features)

# Step 4: Region-based CNN for classification and bounding box regression


fc1 = Dense(1024, activation='relu')(region_features)
fc2 = Dense(1024, activation='relu')(fc1)
output_cls = Dense(num_classes, activation='softmax')(fc2)
output_reg = Dense(4, activation='linear')(fc2) # 4 for bounding box coordinates
(x, y, w, h)

# Combine into a single model


model = Model(inputs=base_model.input, outputs=[output_cls, output_reg])

# Step 5: Compile and train the model


model.compile(optimizer=Adam(lr=0.001),
loss=['categorical_crossentropy', 'mse'], # Classification loss and
regression loss
metrics=['accuracy'])

# Train the model


model.fit(X_train, [y_cls_train, y_reg_train], epochs=10, batch_size=32)
This is a simplified example, and a full implementation of R-CNN involves
additional complexities such as non-maximum suppression, anchor generation, and
handling data imbalance.

CNN-RNN:
CNN-RNN refers to a neural network architecture that combines Convolutional
Neural Networks (CNNs) and Recurrent Neural Networks (RNNs).
CNNs are particularly effective in processing spatial data, such as images, due to
their ability to learn hierarchical representations. They use convolutional layers to
extract features from the input data.
RNNs, on the other hand, are well-suited for sequential data, like time series or
text, because they have memory to capture temporal dependencies. They work by
recursively applying the same operation to each element of a sequence, updating
their internal state with each step.
In a CNN-RNN architecture, the CNN is typically used for feature extraction from
the input data (such as images), and the RNN is then employed to process these
features sequentially. This combination allows the model to leverage both spatial
and temporal information, making it useful for tasks such as video analysis, image
captioning, and sequence generation.
For instance, in image captioning, the CNN can extract visual features from the
input image, and the RNN can generate a corresponding caption word by word,
taking into account the contextual information provided by the image features.
Overall, CNN-RNN architectures have proven to be effective in various machine
learning tasks where both spatial and temporal information are crucial for
understanding the data.

CNN-RNN ARCHITECTURE: The CNN-RNN architecture typically involves


several key components, each serving a specific purpose in processing the input
data and generating output. Here's a high-level overview of the architecture:
1. Convolutional Neural Network (CNN) Layers:
 The CNN component is responsible for extracting features from the input
data, which is often spatial data like images.
 Convolutional layers with filters (kernels) are used to detect patterns and
features within the input data.
 Pooling layers (e.g., max pooling) are employed to downsample the feature
maps and reduce the spatial dimensions, while retaining the most important
information.
2. Recurrent Neural Network (RNN) Layers:
 The RNN component processes the features extracted by the CNN in a
sequential manner, capturing temporal dependencies.
 Recurrent layers, such as Long Short-Term Memory (LSTM) or Gated
Recurrent Unit (GRU), are commonly used due to their ability to maintain
memory across time steps and handle long-range dependencies.
3. Connection Between CNN and RNN:
 The output of the last convolutional layer in the CNN is typically flattened
or reshaped into a suitable format for input to the RNN.
 This transformed feature representation from the CNN serves as the initial
input to the RNN, which then operates sequentially on this representation.
4. Sequence Modeling:
 Within the RNN component, the sequence of feature representations from
the CNN is processed step by step.
 At each time step, the RNN receives the current feature representation as
input, updates its hidden state based on both the current input and the
previous hidden state, and generates an output.
 This process is repeated for each time step until the entire sequence has been
processed.
5. Output Layer:
 The final output layer of the model may vary depending on the specific task.
 For tasks like classification, the output layer may consist of one or more
fully connected layers followed by a softmax activation function for
predicting class probabilities.
 For sequence generation tasks, the output layer may involve generating a
sequence of tokens (e.g., words) one step at a time using techniques like
softmax sampling or beam search.
Overall, the CNN-RNN architecture leverages the strengths of both CNNs and
RNNs to process spatial and temporal information, respectively, making it well-
suited for tasks that require understanding both aspects of the input data
APPLICATION IN CAPTIONING AND VIDEO TASKS:
CNN-RNN architectures find extensive application in captioning and video tasks
due to their ability to process both spatial and temporal information. Here are some
specific applications in these domains:
1. Video Captioning:
 CNN-RNN architectures can generate textual descriptions (captions) for
videos by processing the spatial features extracted from individual frames
using CNNs and modeling the temporal dependencies using RNNs.
 The CNN extracts visual features from each frame, which are then fed into
the RNN sequentially to generate captions.
 By considering both visual information from the frames and temporal
context from the sequence, CNN-RNN models can produce coherent and
descriptive captions for videos.
2. Activity Recognition:
 In video activity recognition, CNN-RNN architectures are used to recognize
and classify actions or activities performed in videos.
 CNNs are employed to extract spatial features from each frame, capturing
visual information about the actions.
 The extracted features are then passed through RNNs to model temporal
dependencies across frames, enabling the network to recognize complex
activities that unfold over time.
3. Video Question Answering:
 CNN-RNN architectures can be used in video question answering tasks
where the model is required to answer questions about the content of a
video.
 The CNN extracts visual features from video frames, while the RNN
processes the question embeddings.
 The combined features from both modalities are used to predict the answer
to the question, with the RNN incorporating temporal context to understand
the progression of events in the video.
4. Temporal Localization of Actions:
 CNN-RNN architectures are employed for localizing actions within a video,
i.e., determining the start and end times of actions.
 The CNN extracts spatial features from each frame, and the RNN processes
these features temporally to predict action boundaries.
 By jointly modeling spatial and temporal features, CNN-RNN models can
accurately localize actions within videos.
5. Video Summarization:
 CNN-RNN architectures can generate video summaries by selecting key
frames or segments that represent the essence of the video content.
 The CNN extracts features from video frames, and the RNN processes these
features to identify important segments or frames.
 By considering both spatial features and temporal context, CNN-RNN
models can generate concise and informative video summaries.

3D-CNN:
A 3D Convolutional Neural Network (3D CNN) is an extension of the traditional
Convolutional Neural Network (CNN) architecture that operates on three-
dimensional data. While traditional CNNs are primarily designed for processing
two-dimensional data, such as images (which have height and width), 3D CNNs
are capable of handling volumetric data, such as video frames or medical scans,
which also have depth.
Here's an overview of how 3D CNNs differ from their 2D counterparts:
 Convolutional Filters: In a 3D CNN, the convolutional filters slide across
three dimensions: height, width, and depth. This allows the network to
capture spatial as well as temporal information, making it suitable for tasks
involving video analysis or 3D image data.
 Input Data: While 2D CNNs typically take 2D images as input (with
dimensions like height x width x channels), 3D CNNs take volumetric data
as input (with dimensions like depth x height x width x channels). This
means that each input to the network represents a sequence of 3D volumes,
such as a sequence of video frames or slices of a 3D medical scan.
 Feature Extraction: Similar to 2D CNNs, 3D CNNs use convolutional layers
to extract hierarchical features from the input data. These features are
learned through the application of convolutional filters across the spatial and
temporal dimensions of the input volumes.
 Pooling and Downsampling: Pooling layers in 3D CNNs perform
downsampling not only across the spatial dimensions (height and width) but
also across the temporal dimension (depth). This helps reduce the spatial and
temporal resolution of the feature maps, making the network more robust to
variations in the input data.
 3D Convolutions and Memory: One challenge with 3D CNNs is their
computational and memory requirements, especially when dealing with
high-resolution volumetric data. Training and inference on 3D CNNs often
require significant computational resources compared to their 2D
counterparts.
Applications of 3D CNNs include action recognition in videos, video
classification, 3D object recognition, medical image analysis (such as MRI or CT
scans), and various other tasks involving volumetric data. These networks have
demonstrated promising results in tasks where capturing spatiotemporal
dependencies is essential for accurate predictions.

You might also like