0% found this document useful (0 votes)
5 views21 pages

CNN 3

The document discusses the components and variants of Convolutional Neural Networks (CNNs), focusing on their application in autonomous vehicle systems for object recognition. It covers key processes such as feature extraction, activation functions, pooling, and CNN architecture, emphasizing the importance of each layer in identifying and classifying objects. Additionally, it highlights the significance of the ImageNet dataset in advancing deep learning and computer vision tasks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views21 pages

CNN 3

The document discusses the components and variants of Convolutional Neural Networks (CNNs), focusing on their application in autonomous vehicle systems for object recognition. It covers key processes such as feature extraction, activation functions, pooling, and CNN architecture, emphasizing the importance of each layer in identifying and classifying objects. Additionally, it highlights the significance of the ImageNet dataset in advancing deep learning and computer vision tasks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Minor in AI

Components & Variants of CNN

January 07, 2025


Minor in AI

1 Driving Vision: How CNNs Recognize Vehicles


Imagine you are designing an autonomous vehicle system. The car must accurately iden-
tify objects like pedestrians, traffic signs, and other vehicles in real-time from a camera
feed. Each object has distinctive features, such as shapes, edges, and textures, which
can help the system recognize and classify it. The challenge is to extract these features
and use them to make decisions like braking, accelerating, or turning, ensuring the car
navigates safely.

Figure 1: Classification of Vehicles

1.1 Problem Statement


In autonomous driving, accurately identifying objects such as cars, pedestrians, and traffic
signs is critical for ensuring safety. This requires extracting and analyzing features from
images captured by cameras to classify objects in real-time.

1.2 What needs to be done?


• Feature Extraction: Use convolutional layers to extract basic features such as
edges, lines, and curves from the image of a car.

• Feature Combination: Gradually combine simpler features (e.g., wheels, win-


dows) into more complex patterns to identify the object as a car.

• Layer-wise Processing: Understand how each layer in a Convolutional Neural


Network (CNN) contributes to detecting finer details and accumulates these features
to form a comprehensive representation of the car.

• Decision Making: Use the accumulated features in the fully connected layers to
classify the object as a car or another category.

This process enables a CNN to mimic human vision by starting with basic patterns
and building up to recognize complex objects effectively.

The image in Figure 2 shows how a Convolutional Neural Network (CNN) processes
an image step-by-step to classify it as a specific object, like a car. Here’s what happens:

Components & Variants of CNN 1


Minor in AI

Figure 2: The activations of an example ConvNet architecture. The initial volume stores
the raw image pixels (left) and the last volume stores the class scores (right). Each
volume of activations along the processing path is shown as a column. Since it’s difficult
to visualize 3D volumes, we lay out each volume’s slices in rows. The last layer volume
holds the scores for each class, but here we only visualize the sorted top 5 scores, and
print the labels of each one.

1. At the start, we have a picture of a car (on the left). This image is processed by a
series of operations called convolution, activation (ReLU), and pooling. These steps
help the network focus on specific details in the image.

2. In the first convolutional (Conv) layer, the network uses 10 filters to look for simple
patterns like edges or textures. For example, it may detect the round shape of the
wheels or the straight lines of the car’s body.

3. As we go deeper into the layers, these features are combined to form more complex
patterns. For example, in the later layers, the network might combine the wheel
and the car’s body to identify it as a complete car.

4. The image becomes cluttered with more features as we move deeper, meaning the
network gathers a combination of patterns that represent the object.

5. Finally, the network accumulates all the features from the earlier layers and makes
a decision (in this case, classifying the object as a car).

1.3 Key Takeaways


1. Feature Inside a Feature: After applying the first Conv layer, the network cap-
tures simple features, like the edges of a car, which are parts of larger features.

2. Feature Extraction: As you move through the layers, the network keeps extract-
ing features and combining them to understand the object better.

Components & Variants of CNN 2


Minor in AI

Figure 3: For instance, if the image was of the number 7, the initial layers might focus
on the curve or its straight line. As we go deeper, these individual features get combined
to form the entire number 7

3. Combination of Features: Later layers combine simpler features (like wheels and
edges) into more complex representations (like the full shape of a car).
4. Decision Making: The accumulated features are used to classify the object (e.g.,
as a car or a truck) in the final layer.

This layered approach in CNNs mimics how humans recognize objects, starting with
simple patterns and combining them into a complete picture. It is highly effective for tasks
like object detection and classification, making it a vital tool in fields like autonomous
vehicles, medical imaging, and more.

2 Activation Functions
2.1 Why activation functions are necessary?
After extracting features from raw image data using convolutional layers, the network
combines these features into a linear representation. However, many real-world problems,
including object recognition, are inherently non-linear. To enable the network to capture
these complex patterns and make meaningful decisions, activation functions introduce
non-linearity into the network. Without non-linearity, the network would be limited to
learning only linear mappings, regardless of its depth.

2.2 Common Activation Functions


1. Sigmoid Function
The sigmoid activation function is defined as:
1
σ(x) =
1 + e−x
• Range: (0, 1)
• Pros: Useful for binary classification as it maps outputs to probabilities.
• Cons: Prone to the vanishing gradient problem, making it less suitable for deep
networks.

Components & Variants of CNN 3


Minor in AI

Figure 4: Flow of NN with activation function

2. Hyperbolic Tangent (Tanh) Function


The tanh function is given by:
ex − e−x
tanh(x) =
ex + e−x
• Range: (-1, 1)
• Pros: Zero-centered output, which helps in faster convergence compared to sigmoid.
• Cons: Suffers from the vanishing gradient problem in deep networks.

3. Rectified Linear Unit (ReLU)


The ReLU function is defined as:
f (x) = max(0, x)
• Range: [0, ∞)
• Pros: Efficient, avoids the vanishing gradient problem, and accelerates convergence.
• Cons: Can suffer from the dying ReLU problem where neurons get stuck during
training.

4. Leaky ReLU
The Leaky ReLU addresses the dying ReLU problem by allowing a small gradient when
the input is negative: (
x if x > 0
f (x) =
αx if x ≤ 0
where α is a small positive constant (e.g., 0.01).
• Range: (-∞, ∞)
• Pros: Prevents neurons from becoming inactive.

Components & Variants of CNN 4


Minor in AI

Figure 5: Softmax Calculation for three classes

5. Softmax
The softmax function is primarily used in the output layer for multi-class classification.
It converts raw scores (logits) into probabilities:
exi
Softmax(xi ) = P xj
je

• Range: (0, 1)

• Pros: Provides a probabilistic interpretation for multi-class problems.

6. Swish
The swish function, proposed by Google, is defined as:
1
f (x) = x · σ(x) = x ·
1 + e−x
• Range: (-∞, ∞)

• Pros: Smooth and non-monotonic, often outperforms ReLU in certain tasks.

2.3 Key Takeaways


• Activation functions introduce non-linearity, allowing neural networks to learn com-
plex, non-linear relationships in data.

• ReLU and its variants (like Leaky ReLU) are the most commonly used due to their
simplicity and efficiency in avoiding the vanishing gradient problem.

• Sigmoid and Tanh are useful in specific scenarios but are prone to the vanishing
gradient problem in deeper networks.

• Softmax is primarily used in the output layer for multi-class classification tasks,
providing a probabilistic interpretation.

Components & Variants of CNN 5


Minor in AI

• Swish and other newer functions may offer better performance for certain tasks due
to their smooth, non-monotonic nature.

• The choice of activation function depends on the problem type, network architecture,
and the nature of the data.

Figure 6: Summary

3 Pooling
3.1 Why Pooling?
Pooling is an essential operation in Convolutional Neural Networks (CNNs) that reduces
the spatial dimensions of feature maps. This serves two main purposes:

Components & Variants of CNN 6


Minor in AI

• Dimensionality Reduction: By down-sampling the feature maps, pooling reduces


the computational complexity of the network.

• Focus on Key Features: Pooling helps retain the dominant information, making
it easier for the network to recognize patterns crucial for classification.

Figure 7: Pooling layer downsamples the volume spatially, independently in each depth
slice of the input volume. Left: In this example, the input volume of size [224x224x64]
is pooled with filter size 2, stride 2 into output volume of size [112x112x64]. Notice that
the volume depth is preserved. Right: The most common downsampling operation is
max, giving rise to max pooling, here shown with a stride of 2. That is, each max is taken
over 4 numbers (little 2x2 square).

3.2 Types of Pooling


There are several types of pooling, each with a specific use case:

• Max Pooling: Selects the maximum value from the pooling window. It captures
the most prominent features, ensuring the dominant information is retained.

• Average Pooling: Computes the average value of the pooling window. It is used
when smoother or more generalized feature extraction is required.

• Global Pooling: Reduces the feature map to a single value by applying pooling
over the entire map. It is commonly used in tasks like object detection.

3.3 Kernel and Stride


• Kernel: The size of the window (e.g., 2 × 2, 3 × 3) used to perform the pooling
operation.

• Stride: The step size by which the kernel moves across the feature map. A stride
greater than 1 helps in reducing the dimensions of the output.

3.4 Why Max Pooling?


The primary goal of CNNs is to extract and identify the dominant features that are most
significant for classification. Max pooling is effective because:

Components & Variants of CNN 7


Minor in AI

• It ensures that only the most important information in each region is retained.

• This is sufficient for identifying an image with a label, as the main features are
enough to differentiate between classes.

3.5 Feature Extraction vs Classification


• CNNs are often referred to as feature extractors, as they are responsible for
identifying patterns and extracting essential information from images.

• The final layers of the network, often a Multi-Layer Perceptron (MLP), act as
the classification head, mapping the extracted features to specific labels.

3.6 Drawbacks

Figure 8: Drawbacks of Pooling

Pooling is a crucial operation in Convolutional Neural Networks (CNNs) to reduce


the dimensions of feature maps. However, it comes with certain drawbacks, as illustrated
below:

3.6.1 Max Pooling


Max pooling selects the maximum value in each pooling window, emphasizing the most
dominant feature. While this helps retain sharp and prominent features:

• It may lose fine details or subtler patterns in the input feature map.

Components & Variants of CNN 8


Minor in AI

• In the example shown, max pooling retains only the brightest pixel values but loses
the gradient or intensity variation of the diagonal line in the feature map.

3.6.2 Average Pooling


Average pooling computes the average of all pixel values in each pooling window, resulting
in a more generalized representation. However:

• It tends to blur the features by averaging out high and low values.

• In the example, average pooling smooths out the intensity values, causing a loss of
sharpness in the diagonal line.

3.7 Key Takeaways


1. Max Pooling vs. Average Pooling: Max pooling is often preferred in clas-
sification tasks as it highlights the most dominant features, which are important
for pattern recognition, while average pooling is better suited for tasks needing
smoother feature representations.

2. Dimensionality Reduction: Pooling operations in CNNs help in reducing the di-


mensionality of feature maps, making models more computationally efficient without
losing significant information.

3. Task-Specific Choices: The choice between max pooling and average pooling
depends on the task; max pooling is generally preferred for tasks requiring feature
emphasis, while average pooling is better for smoother, more general representations.

4. Hybrid or Learnable Pooling: Combining max and average pooling or using


learnable pooling methods could provide more flexibility and better performance
for specific tasks in CNNs.

4 CNN Architecture
The architecture of a Convolutional Neural Network (CNN) consists of several key com-
ponents:

• Convolutional Layers (Conv Layers): The architecture begins by defining the


number of convolutional layers, which are responsible for feature extraction. These
layers utilize parameters such as kernel size (k), number of filters (F), padding (P),
and stride (S).

• Pooling Layers: Following the convolutional layers, pooling layers help reduce the
dimensionality of the feature maps. The decision to include pooling depends on the
design of the model, as indicated in the diagram.

• Multilayer Perceptron (MLP): After convolution and pooling layers, MLP layers
are added for further feature processing, ultimately performing classification tasks.

• Activation Functions: Activation functions introduce non-linearity into the model,


enabling CNNs to learn complex patterns from the input data.

Components & Variants of CNN 9


Minor in AI

Figure 9: Workflow

Too Many Options!!!

It is important to note that including too many design choices or parameters in the model
may lead to overfitting and unnecessary complexity.

Story Time!
Imagine you’re at a grocery store, overwhelmed by endless cookie options: chocolate chip,
oatmeal raisin, gluten-free, and more. The sheer variety makes it frustrating to decide,
leaving you second-guessing your choice. Now, imagine the store offered a few curated
options, like ”classic” or ”healthy.” With fewer, thoughtfully selected choices, you could
decide quickly and confidently.

Figure 10: Want a Cookie?

This mirrors a common challenge in deep learning. With countless design choices for
layers, filters, kernel sizes, and activations, deciding on the best configuration can feel
overwhelming. Too many options can lead to unnecessary complexity, overfitting, or poor
performance. Structured benchmarks act like the curated options in the store, helping
simplify these decisions and guide the development of efficient, high-performing models.

Components & Variants of CNN 10


Minor in AI

ImageNet
One of the most significant breakthroughs in deep learning came with the introduction of
the ImageNet dataset. ImageNet is a vast visual database containing millions of labeled
images across various categories, specifically designed for visual object recognition. It has
played a critical role in the development of deep learning models by enabling the training
of complex CNNs with large amounts of diverse data.

Figure 11: Click Here ·

The success of CNN architectures, especially after the introduction of ImageNet, revo-
lutionized the field of computer vision. ImageNet provided a standardized benchmark for
evaluating and comparing deep learning models, enabling rapid advancements in visual
recognition tasks, from image classification to object detection.
By leveraging the ImageNet dataset, researchers and developers were able to train
models capable of achieving high accuracy on real-world image recognition tasks. The
dataset’s scale and complexity have been instrumental in pushing the boundaries of deep
learning in the field of computer vision.

4.1 ImageNet 1K
ImageNet-1K is a widely used dataset in computer vision and deep learning, playing
a pivotal role in advancing image classification and object recognition tasks. It is a
subset of the larger ImageNet dataset and contains 1,000 categories (or classes) with
approximately 1.28 million training images, 50,000 validation images, and 100,000
test images. Each image is labeled with one of the 1,000 categories, which include diverse
objects, animals, and scenes.

4.1.1 Key Features


• Rich Diversity: ImageNet-1K spans a wide variety of categories, from common
objects like “apple” and “chair” to specific breeds of dogs and species of birds.

• High-Quality Labels: The labels are derived from the WordNet hierarchy, ensur-
ing semantic relationships among categories.

• Large-Scale: The dataset’s size and diversity make it ideal for training and eval-
uating large-scale deep learning models.

4.1.2 Importance in Deep Learning


• Benchmark for Architectures: ImageNet-1K became a standard benchmark
for evaluating the performance of convolutional neural networks (CNNs) and other
models. Breakthroughs such as AlexNet (2012), VGG, ResNet, and EfficientNet
were first tested on this dataset.

Components & Variants of CNN 11


Minor in AI

• Transfer Learning: Models pre-trained on ImageNet-1K are often fine-tuned for


downstream tasks like object detection and semantic segmentation, significantly
improving performance on smaller, domain-specific datasets.

• Catalyst for Research: The ImageNet Large Scale Visual Recognition Challenge
(ILSVRC), based on ImageNet-1K, spurred innovation in CNN architectures, opti-
mization techniques, and hardware development.

Figure 12: ImagNet 1K

4.1.3 Limitations
• Bias and Imbalance: Despite its diversity, ImageNet-1K reflects cultural and
geographical biases present in its source data.

• Focus on Object Recognition: It emphasizes object categories, which may not


be ideal for tasks requiring fine-grained or contextual understanding.

• Data Accessibility: While widely used in research, its licensing restricts direct
commercial use.

4.1.4 Legacy
ImageNet-1K has been a cornerstone in computer vision research, shaping how neural net-
works are designed and evaluated. Although newer datasets and challenges have emerged,
it remains a foundational tool for developing and benchmarking image classification mod-
els.

5 MNIST Dataset
The MNIST dataset (Modified National Institute of Standards and Technology) is one
of the most popular datasets in machine learning and computer vision. It is primarily
used for training and testing image classification models, especially for handwritten digit
recognition.

Components & Variants of CNN 12


Minor in AI

5.1 The Postcard Story


In the 1980s, researchers in the field of computer vision were trying to solve the challenge
of recognizing handwritten text. One major use case that highlighted the need for this
technology was reading and interpreting postal addresses on envelopes and postcards. The
problem was that traditional OCR (Optical Character Recognition) systems struggled
with the wide variety of handwriting styles, making it difficult to process incoming mail
efficiently.
The U.S. Postal Service (USPS) and other postal organizations faced an increasing vol-
ume of mail, and they needed a system that could automatically read and sort envelopes
based on the handwritten addresses. The variation in handwriting was a significant chal-
lenge — different people write in very different ways, making it difficult for computers to
recognize the characters.

Figure 13: The Postcard Story of MNIST

5.2 What was done?


To address this, researchers at the National Institute of Standards and Technology (NIST)
began creating datasets of handwritten digits, which would help train algorithms to rec-
ognize these digits in real-world scenarios like postal address recognition. However, this
required a large and diverse set of examples, capturing the variation in handwriting that
would appear on real mail.
Yann LeCun, Corinna Cortes, and Christopher J.C. Burges introduced MNIST in
1998. They modified and standardized NIST data by normalizing the images, resizing
them to 28 × 28 pixels, and ensuring a consistent format. Thus, the MNIST dataset
was born, made up of a collection of handwritten digits that were collected from postal
workers. These digits were carefully processed and digitized, providing a standardized set
that researchers could use to train and test their machine learning algorithms. MNIST was
intended to serve as a benchmark for evaluating the performance of algorithms designed
to recognize handwritten digits, specifically to tackle challenges like reading and sorting
postal mail.

Components & Variants of CNN 13


Minor in AI

Figure 14: MNIST Data

5.3 Features of MNIST


• Size: Contains 60,000 training images and 10,000 test images of handwritten digits
(0 to 9).

• Image Format: Grayscale images of size 28 × 28 pixels.

• Labels: Each image corresponds to a digit (0 to 9), making it a 10-class classification


problem.

• Preprocessing: Images are centered and normalized for consistency.

5.4 Why MNIST is Important


• Benchmark Dataset: It became a standard benchmark for evaluating image clas-
sification algorithms.

• Ease of Use: Its simplicity allows beginners to experiment with neural networks
without complex preprocessing.

• Historical Significance: It played a crucial role in the development of convolu-


tional neural networks (CNNs).

6 LeNet: The Pioneer CNN Architecture


LeNet-5, introduced by Yann LeCun in 1998, is one of the first convolutional neural
networks (CNNs) designed for image recognition tasks, specifically for handwritten digit
recognition using the MNIST dataset.

6.1 Why LeNet was needed?


Traditional machine learning algorithms struggled with image data due to:

Components & Variants of CNN 14


Minor in AI

• Feature Extraction: Manually crafting features for images was tedious and prone
to errors.

• High Dimensionality: Images have a large number of pixels, making them difficult
to process without efficient algorithms.

LeNet addressed these challenges by learning features automatically through convolutional


and pooling layers, drastically improving performance.

6.2 Architecture of LeNet-5


LeNet-5 consists of the following layers:

1. Input Layer: Accepts 32 × 32 grayscale images.

2. Convolutional Layer 1 (C1): Applies 6 filters of size 5 × 5, resulting in 6 feature


maps of size 28 × 28.

3. Subsampling Layer 1 (S2): Averages the values in 2×2 regions, reducing feature
maps to size 14 × 14.

4. Convolutional Layer 2 (C3): Applies 16 filters of size 5 × 5, resulting in 16


feature maps of size 10 × 10.

5. Subsampling Layer 2 (S4): Further reduces size to 5 × 5.

6. Fully Connected Layer (F5): Connects the flattened feature maps to a 120-
neuron layer.

7. Output Layer: Produces 10 outputs, corresponding to the 10 digit classes.

Figure 15: Architecture

Components & Variants of CNN 15


Minor in AI

6.3 Key Features


• Automatic Feature Extraction: Learns relevant features directly from data us-
ing convolutional layers.

• Translation Invariance: Pooling layers ensure that the model is robust to small
shifts in the input image.

• Efficiency: LeNet was computationally efficient for its time, enabling practical use
in digit recognition systems.

6.4 Impact of LeNet


• Foundation for CNNs: LeNet laid the groundwork for modern CNN architectures
like AlexNet, VGG, and ResNet.

• Commercial Applications: It was used in bank check processing and other real-
world systems.

• Demonstration of Deep Learning: Showed that neural networks could outper-


form traditional methods for visual tasks.

7 AlexNet
In 2012, a groundbreaking moment in the field of computer vision occurred, largely at-
tributed to a deep convolutional neural network (CNN) called AlexNet. Developed
by Alex Krizhevsky, along with his advisor Geoffrey Hinton and colleague Ilya
Sutskever, AlexNet played a pivotal role in revolutionizing the way machines interpret
images and kick-started the deep learning revolution in artificial intelligence.
Before AlexNet, computer vision tasks such as image classification were traditionally
handled by shallow machine learning algorithms or manually designed feature extraction
techniques. These methods had some success, but their performance was limited, partic-
ularly when dealing with complex tasks like recognizing objects in large, high-resolution
images.

The ImageNet Challenge


The breakthrough came when Krizhevsky and his team decided to apply deep learn-
ing, specifically convolutional neural networks, to the ImageNet Large Scale Visual
Recognition Challenge (ILSVRC) in 2012. This annual competition had been ongo-
ing for several years, with the task being to classify and detect objects in large-scale
image datasets.
Prior to AlexNet’s involvement, the best-performing algorithms used handcrafted
features like SIFT (Scale-Invariant Feature Transform) or HOG (Histogram of Ori-
ented Gradients), which required manual tuning and were computationally expen-
sive. These methods performed well in limited contexts but struggled to handle the
scale and variability of the ImageNet challenge, which featured over 1,000 classes
of objects, each with thousands of images.

Components & Variants of CNN 16


Minor in AI

Figure 16: AlexNet Architecture

7.1 Key Features


AlexNet made several key innovations that allowed it to outperform existing methods:

7.1.1 Deep Architecture


AlexNet had a deep architecture, consisting of eight layers — five convolutional layers
followed by three fully connected layers. This deep structure enabled it to learn complex
features and patterns in images, which shallow models couldn’t capture.

7.1.2 ReLU Activation


Instead of using the traditionally slower sigmoid or tanh activation functions, AlexNet
utilized the ReLU (Rectified Linear Unit) activation function. ReLU is computation-
ally efficient, helps avoid the vanishing gradient problem, and speeds up training. This
was a significant factor in AlexNet’s success.

7.1.3 GPUs for Training


One of the most crucial innovations was the use of Graphics Processing Units (GPUs)
for training the network. AlexNet was trained on two GPUs, which significantly acceler-
ated the training process compared to traditional CPU-based methods. This allowed the
model to learn from a massive amount of data in a reasonable amount of time, something
previously unfeasible.

7.1.4 Data Augmentation and Dropout


To prevent overfitting, AlexNet utilized data augmentation techniques like random
cropping, flipping, and color variation. This artificially expanded the training dataset.
Additionally, dropout was employed in the fully connected layers to reduce overfitting
by randomly setting some of the neurons to zero during training.

Components & Variants of CNN 17


Minor in AI

7.1.5 Local Response Normalization (LRN)


AlexNet introduced Local Response Normalization, which helped in improving gen-
eralization by normalizing the activations of neurons within local regions of the network,
aiding the model in learning more robust features.

7.2 Results and Impact


The impact of AlexNet was immediate and dramatic. At the ILSVRC 2012, AlexNet
achieved an impressive top-5 error rate of 16.4%, a substantial improvement over the
second-place submission with a top-5 error rate of 25.7%. This victory marked the dawn
of deep learning as the dominant approach to solving many computer vision tasks.
The success of AlexNet helped reignite interest in neural networks and deep learning,
which had seen a decline in popularity due to limitations in computational power and lack
of large datasets. By demonstrating the power of deep convolutional networks, AlexNet
contributed to a surge in research and development in deep learning, leading to the rapid
advancement of the field.

7.3 Legacy
AlexNet’s success also paved the way for even deeper and more sophisticated architectures,
including VGGNet, GoogLeNet, ResNet, and other models, each building upon the
principles established by AlexNet. Today, convolutional neural networks (CNNs) are the
standard for image classification, object detection, and various other computer vision
tasks.
AlexNet’s legacy extends beyond computer vision — it was a key factor in the
widespread adoption of deep learning for a variety of AI applications, including natural
language processing, speech recognition, and reinforcement learning.

7.4 Key Takeaways


1. Deep Network Architecture: AlexNet showed that deep architectures with mul-
tiple convolutional layers could significantly improve performance in computer vision
tasks.

2. ReLU Activation: The use of ReLU as an activation function helped speed up


training and improve performance compared to traditional activation functions like
sigmoid and tanh.

3. GPU Utilization: Training on GPUs enabled AlexNet to process large datasets


and learn complex patterns more efficiently than ever before.

4. Data Augmentation and Dropout: Techniques like data augmentation and


dropout were crucial in reducing overfitting and improving generalization.

5. Impact on Deep Learning: AlexNet’s success marked the beginning of the deep
learning revolution, influencing a wide range of applications beyond computer vision,
including natural language processing and reinforcement learning.

Components & Variants of CNN 18


Minor in AI

8 LeNet vs AlexNet

Figure 17: LeNet vs AlexNet

LeNet was originally developed for handwritten digit recognition, specifically designed
to classify images in the MNIST dataset. Its simpler architecture, using small input
images (28x28x1), was effective for this task. On the other hand, AlexNet, with its much
deeper and more complex architecture, was designed to tackle the much larger and more
varied ImageNet dataset, consisting of high-resolution images (224x224x3) from 1,000
different categories. AlexNet’s success in classifying complex images marked a significant
breakthrough in deep learning, particularly in computer vision.

9 VGGNet
VGGNet, developed by the Visual Geometry Group at the University of Oxford, is a
deep convolutional neural network architecture. It became famous for its simplicity and
effectiveness in handling large-scale image classification tasks. VGGNet is known for its
deep architecture with very small 3x3 convolutional filters, which allowed it to capture
detailed hierarchical features in images. The model made significant contributions to the
field of computer vision, especially for image classification challenges like ImageNet.

9.1 Why it was developed?


Before VGGNet, architectures like LeNet and AlexNet paved the way for deep learn-
ing in image recognition. However, there was a need for a more uniform and deeper
architecture that could handle high-resolution images and learn more complex features.
VGGNet emerged to address these limitations by leveraging deep layers of small convo-
lutional filters, offering a balanced approach that achieved state-of-the-art performance
while maintaining simplicity in design.

Components & Variants of CNN 19


Minor in AI

Figure 18: Architecture of VGGNet-CNN

9.2 Key Features


• Deep Architecture: VGGNet consists of 16 to 19 layers, making it significantly
deeper than its predecessors like AlexNet.

• Small Convolutional Filters: It uses small 3x3 convolutional filters, stacked in


multiple layers, which helps in capturing finer details of the image.

• Uniform Architecture: The architecture is uniform, with only 3x3 convolutions


and 2x2 max-pooling layers used throughout, simplifying the network design.

• ReLU Activation: ReLU activation functions are used after each convolutional
layer, helping in faster training and preventing vanishing gradients.

• Fully Connected Layers: At the end of the convolutional layers, VGGNet uses
three fully connected layers, which help in making the final classification decision.

9.3 Key Takeaways


1. Simplicity in Design: VGGNet’s uniform use of small convolutional filters made
it a simple yet powerful model for image classification.

2. Deep Learning Success: VGGNet demonstrated that deeper networks could


achieve impressive performance on challenging image classification tasks.

3. Influential in Computer Vision: The architecture set a benchmark for subse-


quent deep learning models, influencing the design of later networks like ResNet.

4. Transfer Learning: VGGNet’s pre-trained weights have been widely used in trans-
fer learning applications, where the model is adapted for various tasks in computer
vision.

Components & Variants of CNN 20

You might also like