Unit 5
Unit 5
AlexNet:
The AlexNet contains 8 layers with weights;
5 convolutional layers
3 fully connected layers.
At the end of each layer, ReLu activation is performed except for the last one,
which outputs with a softmax with a distribution over the 1000 class labels.
Dropout is applied in the first two fully connected layers. As the figure above
shows also applies Max-pooling after the first, second, and fifth convolutional
layers. The kernels of the second, fourth, and fifth convolutional layers are
connected only to those kernel maps in the previous layer, which reside on the
same GPU. The kernels of the third convolutional layer are connected to all kernel
maps in the second layer. The neurons in the fully connected layers are connected
to all neurons in the previous layer.
An important feature of the AlexNet is the use of ReLU(Rectified Linear Unit)
Nonlinearity.
Tanh or sigmoid activation functions used to be the usual way to train a neural
network model.
AlexNet showed that using ReLU nonlinearity, deep CNNs could be trained much
faster than using the saturating activation functions like tanh or sigmoid.
Although ReLU helps with the vanishing gradient problem, due to its unbounded
nature, the learned variables can become unnecessarily high. To prevent this,
AlexNet introduced Local Response Normalization (LRN). The idea behind LRN
is to carry out a normalization in a neighborhood of pixels amplifying the excited
neuron while dampening the surrounding neurons at the same time.
The architecture consists of eight layers: five convolutional layers and three fully-
connected layers. But this isn’t what makes AlexNet special; these are some of the
features used that are new approaches to convolutional neural networks:
ReLU Nonlinearity: AlexNet uses Rectified Linear Units (ReLU) instead of
the tanh function, which was standard at the time. ReLU’s advantage is in
training time; a CNN using ReLU was able to reach a 25% error on the
CIFAR-10 dataset six times faster than a CNN using tanh.
Multiple GPUs: Back in the day, GPUs were still rolling around with 3
gigabytes of memory (nowadays those kinds of memory would be rookie
numbers). This was especially bad because the training set had 1.2 million
images. AlexNet allows for multi-GPU training by putting half of the
model’s neurons on one GPU and the other half on another GPU. Not only
does this mean that a bigger model can be trained, but it also cuts down on
the training time.
Overlapping Pooling: CNNs traditionally “pool” outputs of neighboring
groups of neurons with no overlapping. However, when the authors
introduced overlap, they saw a reduction in error by about 0.5% and found
that models with overlapping pooling generally find it harder to overfit.
Pros of AlexNet:
1. AlexNet is considered as the milestone of CNN for image classification.
2. Many methods, such as the conv+pooling design, dropout, GPU, parallel
computing, ReLU, are still the industrial standard for computer vision.
3. The unique advantage ofAlexNet is the direct image input to the
classification model.
4. The convolution layers can automatically extract the edges of the images and
fully connected layers learning these features
5. Theoretically the complexity of visual patterns can be effective extracted by
adding more convlayer.
Cons of AlexNet:
1. AlexNet is NOT deep enough compared to the later model, such as
VGGNet, GoogLENet, and ResNet.
2. The use of large convolution filters (5*5) is not encouraged shortly after
that.
3. Use normal distribution to initiate the weights in the neural networks, can
not effectively solve the problem of gradient vanishing, replaced by the
Xavier method later.
4. The performance is surpassed by more complex models such as GoogLENet
(6.7%), and ResNet (3.6%)
Projection and Identity: When the dimensions of F(x) and x differ, the Projection
shortcut is implemented. It adjusts the dimensions of input x to match that of
output F(x). On the other hand, when the dimensions of F(x) and x are the same,
the Identity shortcut is used, which makes it easier for the network to learn.
Imagine a network, A which produces x amount of training error. Construct a
network B by adding few layers on top of A and put parameter values in those
layers in such a way that they do nothing to the outputs from A. Let’s call the
additional layer as C. This would mean the same x amount of training error for the
new network. So while training network B, the training error should not be above
the training error of A. And since it DOES happen, the only reason is that learning
the identity mapping(doing nothing to inputs and just copying as it is) with the
added layers-C is not a trivial problem, which the solver does not achieve. To solve
this, the module shown above creates a direct path between the input and output to
the module implying an identity mapping and the added layer-C just need to learn
the features on top of already available input. Since C is learning only the residual,
the whole module is called residual module.
Also, similar to GoogLeNet, it uses a global average pooling followed by the
classification layer. Through the changes mentioned, ResNets were learned with
network depth of as large as 152. It achieves better accuracy than VGGNet and
GoogLeNet while being computationally more efficient than VGGNet. ResNet-152
achieves 95.51 top-5 accuracies.
The architecture is similar to the VGGNet consisting mostly of 3X3 filters. From
the VGGNet, shortcut connection as described above is inserted to form a residual
network. This can be seen in the figure which shows a small snippet of earlier layer
synthesis from VGG-19.
The power of the residual networks can be judged from one of the experiments in
paper 4. The plain 34 layer network had higher validation error than the 18 layers
plain network. This is where we realize the degradation problem. And the same 34
layer network when converted into the residual network has much lesser training
error than the 18 layer residual network.
Transfer learning:
Transfer learning, used in machine learning, is the reuse of a pre-trained model on
a new problem. In transfer learning, a machine exploits the knowledge gained from
a previous task to improve generalization about another. For example, in training a
classifier to predict whether an image contains food, you could use the knowledge
it gained during training to recognize drinks.
With transfer learning, we basically try to exploit what has been learned in one task
to improve generalization in another. We transfer the weights that a network has
learned at “task A” to a new “task B.”
The general idea is to use the knowledge a model has learned from a task with a lot
of available labeled training data in a new task that doesn't have much data. Instead
of starting the learning process from scratch, we start with patterns learned from
solving a related task.
Transfer learning is mostly used in computer vision and natural language
processing tasks like sentiment analysis due to the huge amount of computational
power required.
Transfer learning isn’t really a machine learning technique, but can be seen as a
“design methodology” within the field, for example, active learning. It is also not
an exclusive part or study-area of machine learning. Nevertheless, it has become
quite popular in combination with neural networks that require huge amounts of
data and computational power.
Siamese Network:
A siamese neural network (SNN) is a class of neural network architectures that
contain two or more identical sub-networks. “Identical” here means they have the
same configuration with the same parameters and weights. Parameter updating is
mirrored across both sub-networks and it’s used to find similarities between inputs
by comparing feature vectors.
A siamese neural network (SNN) is a class of neural network architectures that
contain two or more identical sub-networks. “Identical” here means they have the
same configuration with the same parameters and weights. Parameter updating is
mirrored across both sub-networks and it’s used to find similarities between inputs
by comparing its feature vectors. These networks are used in many applications.
Traditionally, a neural network learns to predict multiple classes. This poses a
problem when we need to add or remove new classes to the data. In this case, we
have to update the neural network and retrain it on the whole data set. Also, deep
neural networks need a large volume of data on which to train. SNNs, on the other
hand, learn a similarity function
In the above equation, alpha is a margin term used to stretch the distance between
similar and dissimilar pairs in the triplet. Fa, Fp, Fn are the feature embeddings for
the anchor, positive and negative images.
During the training process, we feed an image triplet (anchor image, negative
image, positive image)(anchor image, negative image, positive image) into the
model as a single sample. The distance between the anchor and positive images
should be smaller than that between the anchor and negative images.
CONTRASTIVE LOSS: Contrastive loss is an increasingly popular loss function.
It’s a distance-based loss as opposed to more conventional error-prediction loss.
This loss function is used to learn embeddings in which two similar points have a
low Euclidean distance and two dissimilar points have a large Euclidean distance.
Metric learning:
Many approaches in machine learning require a measure of distance between data
points. Traditionally, practitioners would choose a standard distance metric
(Euclidean, City-Block, Cosine, etc.) using a priori knowledge of the domain.
However, it is often difficult to design metrics that are well-suited to the particular
data and task of interest.
Distance metric learning (or simply, metric learning) aims at automatically
constructing task-specific distance metrics from (weakly) supervised data, in a
machine learning manner. The learned distance metric can then be used to perform
various tasks (e.g., k-NN classification, clustering, information retrieval).
The goal of Metric Learning is to learn a representation function that maps objects
into an embedded space. The distance in the embedded space should preserve the
objects’ similarity — similar objects get close and dissimilar objects get far away.
Various loss functions have been developed for Metric Learning. For example, the
contrastive loss guides the objects from the same class to be mapped to the same
point and those from different classes to be mapped to different points whose
distances are larger than a margin. Triplet loss is also popular, which requires the
distance between the anchor sample and the positive sample to be smaller than the
distance between the anchor sample and the negative sample.
Metric learning problems fall into two main categories depending on the type of
supervision available about the training data:
Supervised learning: the algorithm has access to a set of data points, each of
them belonging to a class (label) as in a standard classification problem.
Broadly speaking, the goal in this setting is to learn a distance metric that
puts points with the same label close together while pushing away points
with different labels.
Weakly supervised learning: the algorithm has access to a set of data points
with supervision only at the tuple level (typically pairs, triplets, or
quadruplets of data points). A classic example of such weaker supervision is
a set of positive and negative pairs: in this case, the goal is to learn a
distance metric that puts positive pairs close together and negative pairs far
away.
Based on the above (weakly) supervised data, the metric learning problem is
generally formulated as an optimization problem where one seeks to find the
parameters of a distance function that optimize some objective function measuring
the agreement with the training data.
Use-cases:
There are many use-cases for metric learning. We list here a few popular examples
(for code illustrating some of these use-cases, see the examples section of the
documentation):
1. Nearest neighbors models: the learned metric can be used to improve nearest
neighbors learning models for classification, regression, anomaly
detection…
2. Clustering: metric learning provides a way to bias the clusters found by
algorithms like K-Means towards the intended semantics.
3. Information retrieval: the learned metric can be used to retrieve the elements
of a database that are semantically closest to a query element.
4. Dimensionality reduction: metric learning may be seen as a way to reduce
the data dimension in a (weakly) supervised setting.
More generally, the learned transformation can be used to project the data into a
new embedding space before feeding it into another machine learning algorithm.
Approaches to metric learning:
In practice, there are two main approaches to metric learning and two
corresponding types of NN architectures. The first is the interaction-based
approach, which first builds local interactions (i.e., local matching signals)
between two objects. Deep neural networks learn hierarchical interaction patterns
for matching. Examples of neural network architectures include MV-LSTM, ARC-
II, and MatchPyramid.
The second is the representation-based approach. In this case distance function is
composed of 2 components: the Encoder transforms an object into embedded
representation — usually a large float point vector, and the Comparator takes
embeddings of a pair of objects from the Encoder and calculates their similarity.
The most well-known example of this embedding representation is Word2Vec.
Ranking/triplet loss:
Triplet Loss: Triplet Loss is a distance based Loss function that Operates on Three
Inputs.
1. anchor (a) is any Datapoint chosen by user.
2. Postive(p) is any Datapoint belonging to same class as anchor.
3. Negative(n) is any Datapoint not belonging to same class as anchor.
The basic idea behind triplet loss it to learn a embedding space representation in
which positive embedding is nearer to anchor embedding and at the same time
negative embedding is at distance greater than margin m from postive embedding.
We can formulate the above understaning of triplet loss into mathmatical equation
as follows.
Negative Selection
From above loss equation we can have three possibilities or three catergory of
triplets.
1. easy-triplets: triplets which have loss of 0 i.e d(ra,rn) > d(ra,rp) + m.
2. semi-hard triplets: triplets where the negative is not closer to the anchor than the
positive, but which still have positive loss i.e d(ra,rp) < d(ra,rn) < d(ra,rp) + m.
3. hard-triplets: triplets where the negative is closer to the anchor than the positive,
i.e d(ra,rn) < d(ra,rp).
Each of these definitions depend on where the negative is, relatively to the anchor
and positive. We can therefore extend these three categories to the negatives: hard
negatives, semi-hard negatives or easy negatives.
Choosing what kind of triplet we will be training on will greatly impact our
error.For ex: If we train on easy triplets then overall loss is already zero so there
will be no parameter update and our model will not learn anything. So it is
advisable to train our model on semi hard or hard tripelts for better accuracy and
low error rates.
Unlike other loss functions, such as Cross-Entropy Loss or Mean Square Error
Loss, whose objective is to learn to predict directly a label, a value, or a set or
values given an input, the objective of Ranking Losses is to predict relative
distances between inputs. This task if often called metric learning.
Ranking Losses functions are very flexible in terms of training data: We just need a
similarity score between data points to use them. That score can be binary (similar /
dissimilar). As an example, imagine a face verification dataset, where we know
which face images belong to the same person (similar), and which not (dissimilar).
Using a Ranking Loss function, we can train a CNN to infer if two face images
belong to the same person or not.
To use a Ranking Loss function we first extract features from two (or three) input
data points and get an embedded representation for each of them. Then, we define a
metric function to measure the similarity between those representations, for
instance euclidian distance. Finally, we train the feature extractors to produce
similar representations for both inputs, in case the inputs are similar, or distant
representations for the two inputs, in case they are dissimilar.
We don’t even care about the values of the representations, only about the distances
between them. However, this training methodology has demonstrated to produce
powerful representations for different tasks.
Other names used for Ranking Losses: Ranking Losses are essentialy the ones
explained above, and are used in many different aplications with the same
formulation or minor variations. However, different names are used for them, which
can be confusing. Here I explain why those names are used.
Ranking loss: This name comes from the information retrieval field, where
we want to train models to rank items in an specific order.
Margin Loss: This name comes from the fact that these losses use a margin to
compare samples representations distances.
Contrastive Loss: Contrastive refers to the fact that these losses are computed
contrasting two or more data points representations. This name is often used
for Pairwise Ranking Loss, but I’ve never seen using it in a setup with
triplets.
Triplet Loss: Often used as loss name when triplet training pairs are
employed.
Hinge loss: Also known as max-margin objective. It’s used for training
SVMs for classification. It has a similar formulation in the sense that it
optimizes until a margin. That’s why this name is sometimes used for
Ranking Losses.
RCNN:
Working of RCNN:
Region Proposal: R-CNN starts by dividing the input image into multiple regions
or subregions. These regions are referred to as "region proposals" or "region
candidates." The region proposal step is responsible for generating a set of
potential regions in the image that are likely to contain objects. R-CNN does not
generate these proposals itself; instead, it relies on external methods like Selective
Search or EdgeBoxes to generate region proposals.
Selective Search, for example, operates by merging or splitting segments of the
image based on various image cues like color, texture, and shape to create a
diverse set of region proposals.
Feature Extraction: Once the region proposals are generated, approximately 2,000
regions are extracted and anisotropically warped to a consistent input size that the
CNN expects (e.g., 224x224 pixels) and then it is passed through the CNN to
extract features.
Before warping, the region size is expanded to a new size that will result in 16
pixels of context in the warped frame. The CNN used is AlexNet and it is typically
fine-tuned on a large dataset like ImageNet for generic feature representation.
The output of the CNN is a high-dimensional feature vector representing the
content of the region proposal.
Object Classification: The extracted feature vectors from the region proposals are
fed into a separate machine learning classifier for each object class of interest. R-
CNN typically uses Support Vector Machines (SVMs) for classification. For each
class, a unique SVM is trained to determine whether or not the region proposal
contains an instance of that class.
During training, positive samples are regions that contain an instance of the class.
Negative samples are regions that do not.
Bounding Box Regression: In addition to classifying objects, R-CNN also
performs bounding box regression. For each class, a separate regression model is
trained to refine the location and size of the bounding box around the detected
object. The bounding box regression helps improve the accuracy of object
localization by adjusting the initially proposed bounding box to better fit the
object's actual boundaries.
Non-Maximum Suppression (NMS): After classifying and regressing bounding
boxes for each region proposal, R-CNN applies non-maximum suppression to
eliminate duplicate or highly overlapping bounding boxes. NMS ensures that only
the most confident and non-overlapping bounding boxes are retained as final object
detections.
Strengths of R-CNN : Below are a few of the key strengths of the R-CNN
architecture.
CNN-RNN:
CNN-RNN refers to a neural network architecture that combines Convolutional
Neural Networks (CNNs) and Recurrent Neural Networks (RNNs).
CNNs are particularly effective in processing spatial data, such as images, due to
their ability to learn hierarchical representations. They use convolutional layers to
extract features from the input data.
RNNs, on the other hand, are well-suited for sequential data, like time series or
text, because they have memory to capture temporal dependencies. They work by
recursively applying the same operation to each element of a sequence, updating
their internal state with each step.
In a CNN-RNN architecture, the CNN is typically used for feature extraction from
the input data (such as images), and the RNN is then employed to process these
features sequentially. This combination allows the model to leverage both spatial
and temporal information, making it useful for tasks such as video analysis, image
captioning, and sequence generation.
For instance, in image captioning, the CNN can extract visual features from the
input image, and the RNN can generate a corresponding caption word by word,
taking into account the contextual information provided by the image features.
Overall, CNN-RNN architectures have proven to be effective in various machine
learning tasks where both spatial and temporal information are crucial for
understanding the data.
3D-CNN:
A 3D Convolutional Neural Network (3D CNN) is an extension of the traditional
Convolutional Neural Network (CNN) architecture that operates on three-
dimensional data. While traditional CNNs are primarily designed for processing
two-dimensional data, such as images (which have height and width), 3D CNNs
are capable of handling volumetric data, such as video frames or medical scans,
which also have depth.
Here's an overview of how 3D CNNs differ from their 2D counterparts:
Convolutional Filters: In a 3D CNN, the convolutional filters slide across
three dimensions: height, width, and depth. This allows the network to
capture spatial as well as temporal information, making it suitable for tasks
involving video analysis or 3D image data.
Input Data: While 2D CNNs typically take 2D images as input (with
dimensions like height x width x channels), 3D CNNs take volumetric data
as input (with dimensions like depth x height x width x channels). This
means that each input to the network represents a sequence of 3D volumes,
such as a sequence of video frames or slices of a 3D medical scan.
Feature Extraction: Similar to 2D CNNs, 3D CNNs use convolutional layers
to extract hierarchical features from the input data. These features are
learned through the application of convolutional filters across the spatial and
temporal dimensions of the input volumes.
Pooling and Downsampling: Pooling layers in 3D CNNs perform
downsampling not only across the spatial dimensions (height and width) but
also across the temporal dimension (depth). This helps reduce the spatial and
temporal resolution of the feature maps, making the network more robust to
variations in the input data.
3D Convolutions and Memory: One challenge with 3D CNNs is their
computational and memory requirements, especially when dealing with
high-resolution volumetric data. Training and inference on 3D CNNs often
require significant computational resources compared to their 2D
counterparts.
Applications of 3D CNNs include action recognition in videos, video
classification, 3D object recognition, medical image analysis (such as MRI or CT
scans), and various other tasks involving volumetric data. These networks have
demonstrated promising results in tasks where capturing spatiotemporal
dependencies is essential for accurate predictions.