CNN 3
CNN 3
• Decision Making: Use the accumulated features in the fully connected layers to
classify the object as a car or another category.
This process enables a CNN to mimic human vision by starting with basic patterns
and building up to recognize complex objects effectively.
The image in Figure 2 shows how a Convolutional Neural Network (CNN) processes
an image step-by-step to classify it as a specific object, like a car. Here’s what happens:
Figure 2: The activations of an example ConvNet architecture. The initial volume stores
the raw image pixels (left) and the last volume stores the class scores (right). Each
volume of activations along the processing path is shown as a column. Since it’s difficult
to visualize 3D volumes, we lay out each volume’s slices in rows. The last layer volume
holds the scores for each class, but here we only visualize the sorted top 5 scores, and
print the labels of each one.
1. At the start, we have a picture of a car (on the left). This image is processed by a
series of operations called convolution, activation (ReLU), and pooling. These steps
help the network focus on specific details in the image.
2. In the first convolutional (Conv) layer, the network uses 10 filters to look for simple
patterns like edges or textures. For example, it may detect the round shape of the
wheels or the straight lines of the car’s body.
3. As we go deeper into the layers, these features are combined to form more complex
patterns. For example, in the later layers, the network might combine the wheel
and the car’s body to identify it as a complete car.
4. The image becomes cluttered with more features as we move deeper, meaning the
network gathers a combination of patterns that represent the object.
5. Finally, the network accumulates all the features from the earlier layers and makes
a decision (in this case, classifying the object as a car).
2. Feature Extraction: As you move through the layers, the network keeps extract-
ing features and combining them to understand the object better.
Figure 3: For instance, if the image was of the number 7, the initial layers might focus
on the curve or its straight line. As we go deeper, these individual features get combined
to form the entire number 7
3. Combination of Features: Later layers combine simpler features (like wheels and
edges) into more complex representations (like the full shape of a car).
4. Decision Making: The accumulated features are used to classify the object (e.g.,
as a car or a truck) in the final layer.
This layered approach in CNNs mimics how humans recognize objects, starting with
simple patterns and combining them into a complete picture. It is highly effective for tasks
like object detection and classification, making it a vital tool in fields like autonomous
vehicles, medical imaging, and more.
2 Activation Functions
2.1 Why activation functions are necessary?
After extracting features from raw image data using convolutional layers, the network
combines these features into a linear representation. However, many real-world problems,
including object recognition, are inherently non-linear. To enable the network to capture
these complex patterns and make meaningful decisions, activation functions introduce
non-linearity into the network. Without non-linearity, the network would be limited to
learning only linear mappings, regardless of its depth.
4. Leaky ReLU
The Leaky ReLU addresses the dying ReLU problem by allowing a small gradient when
the input is negative: (
x if x > 0
f (x) =
αx if x ≤ 0
where α is a small positive constant (e.g., 0.01).
• Range: (-∞, ∞)
• Pros: Prevents neurons from becoming inactive.
5. Softmax
The softmax function is primarily used in the output layer for multi-class classification.
It converts raw scores (logits) into probabilities:
exi
Softmax(xi ) = P xj
je
• Range: (0, 1)
6. Swish
The swish function, proposed by Google, is defined as:
1
f (x) = x · σ(x) = x ·
1 + e−x
• Range: (-∞, ∞)
• ReLU and its variants (like Leaky ReLU) are the most commonly used due to their
simplicity and efficiency in avoiding the vanishing gradient problem.
• Sigmoid and Tanh are useful in specific scenarios but are prone to the vanishing
gradient problem in deeper networks.
• Softmax is primarily used in the output layer for multi-class classification tasks,
providing a probabilistic interpretation.
• Swish and other newer functions may offer better performance for certain tasks due
to their smooth, non-monotonic nature.
• The choice of activation function depends on the problem type, network architecture,
and the nature of the data.
Figure 6: Summary
3 Pooling
3.1 Why Pooling?
Pooling is an essential operation in Convolutional Neural Networks (CNNs) that reduces
the spatial dimensions of feature maps. This serves two main purposes:
• Focus on Key Features: Pooling helps retain the dominant information, making
it easier for the network to recognize patterns crucial for classification.
Figure 7: Pooling layer downsamples the volume spatially, independently in each depth
slice of the input volume. Left: In this example, the input volume of size [224x224x64]
is pooled with filter size 2, stride 2 into output volume of size [112x112x64]. Notice that
the volume depth is preserved. Right: The most common downsampling operation is
max, giving rise to max pooling, here shown with a stride of 2. That is, each max is taken
over 4 numbers (little 2x2 square).
• Max Pooling: Selects the maximum value from the pooling window. It captures
the most prominent features, ensuring the dominant information is retained.
• Average Pooling: Computes the average value of the pooling window. It is used
when smoother or more generalized feature extraction is required.
• Global Pooling: Reduces the feature map to a single value by applying pooling
over the entire map. It is commonly used in tasks like object detection.
• Stride: The step size by which the kernel moves across the feature map. A stride
greater than 1 helps in reducing the dimensions of the output.
• It ensures that only the most important information in each region is retained.
• This is sufficient for identifying an image with a label, as the main features are
enough to differentiate between classes.
• The final layers of the network, often a Multi-Layer Perceptron (MLP), act as
the classification head, mapping the extracted features to specific labels.
3.6 Drawbacks
• It may lose fine details or subtler patterns in the input feature map.
• In the example shown, max pooling retains only the brightest pixel values but loses
the gradient or intensity variation of the diagonal line in the feature map.
• It tends to blur the features by averaging out high and low values.
• In the example, average pooling smooths out the intensity values, causing a loss of
sharpness in the diagonal line.
3. Task-Specific Choices: The choice between max pooling and average pooling
depends on the task; max pooling is generally preferred for tasks requiring feature
emphasis, while average pooling is better for smoother, more general representations.
4 CNN Architecture
The architecture of a Convolutional Neural Network (CNN) consists of several key com-
ponents:
• Pooling Layers: Following the convolutional layers, pooling layers help reduce the
dimensionality of the feature maps. The decision to include pooling depends on the
design of the model, as indicated in the diagram.
• Multilayer Perceptron (MLP): After convolution and pooling layers, MLP layers
are added for further feature processing, ultimately performing classification tasks.
Figure 9: Workflow
It is important to note that including too many design choices or parameters in the model
may lead to overfitting and unnecessary complexity.
Story Time!
Imagine you’re at a grocery store, overwhelmed by endless cookie options: chocolate chip,
oatmeal raisin, gluten-free, and more. The sheer variety makes it frustrating to decide,
leaving you second-guessing your choice. Now, imagine the store offered a few curated
options, like ”classic” or ”healthy.” With fewer, thoughtfully selected choices, you could
decide quickly and confidently.
This mirrors a common challenge in deep learning. With countless design choices for
layers, filters, kernel sizes, and activations, deciding on the best configuration can feel
overwhelming. Too many options can lead to unnecessary complexity, overfitting, or poor
performance. Structured benchmarks act like the curated options in the store, helping
simplify these decisions and guide the development of efficient, high-performing models.
ImageNet
One of the most significant breakthroughs in deep learning came with the introduction of
the ImageNet dataset. ImageNet is a vast visual database containing millions of labeled
images across various categories, specifically designed for visual object recognition. It has
played a critical role in the development of deep learning models by enabling the training
of complex CNNs with large amounts of diverse data.
The success of CNN architectures, especially after the introduction of ImageNet, revo-
lutionized the field of computer vision. ImageNet provided a standardized benchmark for
evaluating and comparing deep learning models, enabling rapid advancements in visual
recognition tasks, from image classification to object detection.
By leveraging the ImageNet dataset, researchers and developers were able to train
models capable of achieving high accuracy on real-world image recognition tasks. The
dataset’s scale and complexity have been instrumental in pushing the boundaries of deep
learning in the field of computer vision.
4.1 ImageNet 1K
ImageNet-1K is a widely used dataset in computer vision and deep learning, playing
a pivotal role in advancing image classification and object recognition tasks. It is a
subset of the larger ImageNet dataset and contains 1,000 categories (or classes) with
approximately 1.28 million training images, 50,000 validation images, and 100,000
test images. Each image is labeled with one of the 1,000 categories, which include diverse
objects, animals, and scenes.
• High-Quality Labels: The labels are derived from the WordNet hierarchy, ensur-
ing semantic relationships among categories.
• Large-Scale: The dataset’s size and diversity make it ideal for training and eval-
uating large-scale deep learning models.
• Catalyst for Research: The ImageNet Large Scale Visual Recognition Challenge
(ILSVRC), based on ImageNet-1K, spurred innovation in CNN architectures, opti-
mization techniques, and hardware development.
4.1.3 Limitations
• Bias and Imbalance: Despite its diversity, ImageNet-1K reflects cultural and
geographical biases present in its source data.
• Data Accessibility: While widely used in research, its licensing restricts direct
commercial use.
4.1.4 Legacy
ImageNet-1K has been a cornerstone in computer vision research, shaping how neural net-
works are designed and evaluated. Although newer datasets and challenges have emerged,
it remains a foundational tool for developing and benchmarking image classification mod-
els.
5 MNIST Dataset
The MNIST dataset (Modified National Institute of Standards and Technology) is one
of the most popular datasets in machine learning and computer vision. It is primarily
used for training and testing image classification models, especially for handwritten digit
recognition.
• Ease of Use: Its simplicity allows beginners to experiment with neural networks
without complex preprocessing.
• Feature Extraction: Manually crafting features for images was tedious and prone
to errors.
• High Dimensionality: Images have a large number of pixels, making them difficult
to process without efficient algorithms.
3. Subsampling Layer 1 (S2): Averages the values in 2×2 regions, reducing feature
maps to size 14 × 14.
6. Fully Connected Layer (F5): Connects the flattened feature maps to a 120-
neuron layer.
• Translation Invariance: Pooling layers ensure that the model is robust to small
shifts in the input image.
• Efficiency: LeNet was computationally efficient for its time, enabling practical use
in digit recognition systems.
• Commercial Applications: It was used in bank check processing and other real-
world systems.
7 AlexNet
In 2012, a groundbreaking moment in the field of computer vision occurred, largely at-
tributed to a deep convolutional neural network (CNN) called AlexNet. Developed
by Alex Krizhevsky, along with his advisor Geoffrey Hinton and colleague Ilya
Sutskever, AlexNet played a pivotal role in revolutionizing the way machines interpret
images and kick-started the deep learning revolution in artificial intelligence.
Before AlexNet, computer vision tasks such as image classification were traditionally
handled by shallow machine learning algorithms or manually designed feature extraction
techniques. These methods had some success, but their performance was limited, partic-
ularly when dealing with complex tasks like recognizing objects in large, high-resolution
images.
7.3 Legacy
AlexNet’s success also paved the way for even deeper and more sophisticated architectures,
including VGGNet, GoogLeNet, ResNet, and other models, each building upon the
principles established by AlexNet. Today, convolutional neural networks (CNNs) are the
standard for image classification, object detection, and various other computer vision
tasks.
AlexNet’s legacy extends beyond computer vision — it was a key factor in the
widespread adoption of deep learning for a variety of AI applications, including natural
language processing, speech recognition, and reinforcement learning.
5. Impact on Deep Learning: AlexNet’s success marked the beginning of the deep
learning revolution, influencing a wide range of applications beyond computer vision,
including natural language processing and reinforcement learning.
8 LeNet vs AlexNet
LeNet was originally developed for handwritten digit recognition, specifically designed
to classify images in the MNIST dataset. Its simpler architecture, using small input
images (28x28x1), was effective for this task. On the other hand, AlexNet, with its much
deeper and more complex architecture, was designed to tackle the much larger and more
varied ImageNet dataset, consisting of high-resolution images (224x224x3) from 1,000
different categories. AlexNet’s success in classifying complex images marked a significant
breakthrough in deep learning, particularly in computer vision.
9 VGGNet
VGGNet, developed by the Visual Geometry Group at the University of Oxford, is a
deep convolutional neural network architecture. It became famous for its simplicity and
effectiveness in handling large-scale image classification tasks. VGGNet is known for its
deep architecture with very small 3x3 convolutional filters, which allowed it to capture
detailed hierarchical features in images. The model made significant contributions to the
field of computer vision, especially for image classification challenges like ImageNet.
• ReLU Activation: ReLU activation functions are used after each convolutional
layer, helping in faster training and preventing vanishing gradients.
• Fully Connected Layers: At the end of the convolutional layers, VGGNet uses
three fully connected layers, which help in making the final classification decision.
4. Transfer Learning: VGGNet’s pre-trained weights have been widely used in trans-
fer learning applications, where the model is adapted for various tasks in computer
vision.