Module 05
Module 05
2. Architecture of CNNs
Typical Layers in a CNN:
1. Input Layer – Raw data (e.g., image of size 28x28x3)
2. Convolutional Layer – Applies filters (kernels) to extract features.
3. Activation Function (ReLU) – Adds non-linearity.
4. Pooling Layer – Reduces dimensionality (e.g., Max Pooling).
5. Fully Connected Layer – Final decision-making layer.
6. Output Layer – Gives final prediction (e.g., softmax for classification).
3. Convolutional Layer
How It Works:
• Applies a filter/kernel over the input image to compute a feature map.
• Each filter detects specific features (edges, textures, etc.).
Numerical Example:
• Input: 5x5 image
• Filter: 3x3
• Stride: 1
• Output: (5-3)/1 + 1 = 3x3 feature map
4. Training a Convolutional Neural Network
Forward Propagation
• Input image passes through convolution, activation (ReLU), pooling, fully connected
layers.
• Output is a prediction (e.g., class probabilities).
Loss Computation
• Compare the prediction with the ground truth using a loss function.
• Common loss: Cross-Entropy Loss for classification.
Backpropagation
• Compute the gradient of the loss w.r.t. all trainable parameters using the chain rule.
Weight Update (Optimization)
• Update weights using Gradient Descent or variants like Adam, RMSProp, etc.
Where η\etaη is the learning rate.
Repeat for all epochs (multiple passes through the dataset).
Key Points:
• CNNs are used for image, speech, and video recognition.
• They reduce the number of parameters significantly compared to fully connected
networks.
• CNNs learn spatial hierarchies of features from input data.
Example:
Classifying handwritten digits (0-9) using the MNIST dataset.
Convolutional Neural Networks (CNNs) have evolved over the years with many architectures
designed to solve increasingly complex image recognition and classification tasks. Below are
the most important and widely used CNN architectures:
The LeNet CNN is a simple yet powerful model that has been used for various tasks such as
handwritten digit recognition, traffic sign recognition, and face detection. Although LeNet was
developed more than 20 years ago, its architecture is still relevant today and continues to be
used.
2. AlexNet – Deep Learning Architecture that popularized CNN
AlexNet was developed by Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton. AlexNet
network had a very similar architecture to LeNet, but was deeper, bigger, and featured
Convolutional Layers stacked on top of each other. AlexNet was the first large-scale CNN and
was used to win the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2012.
The AlexNet architecture was designed to be used with large-scale image datasets and it
achieved state-of-the-art results at the time of its publication. AlexNet is composed of 5
convolutional layers with a combination of max-pooling layers, 3 fully connected layers, and
2 dropout layers. The activation function used in all layers is Relu. The activation function used
in the output layer is Softmax. The total number of parameters in this architecture is around 60
million.
3. ZF Net:
ZFnet is the CNN architecture that uses a combination of fully-connected layers and CNNs.
ZF Net was developed by Matthew Zeiler and Rob Fergus. It was the ILSVRC 2013 winner.
The network has relatively fewer parameters than AlexNet, but still outperforms it on ILSVRC
2012 classification task by achieving top accuracy with only 1000 images per class. It was an
improvement on AlexNet by tweaking the architecture hyperparameters, in particular by
expanding the size of the middle convolutional layers and making the stride and filter size on
the first layer smaller. It is based on the Zeiler and Fergus model, which was trained on the
ImageNet dataset. ZF Net CNN architecture consists of a total of seven layers: Convolutional
layer, max-pooling layer (downscaling), concatenation layer, convolutional layer with linear
activation function, and stride one, dropout for regularization purposes applied before the fully
connected output. This CNN model is computationally more efficient than AlexNet by
introducing an approximate inference stage through deconvolutional layers in the middle of
CNNs.
4. GoogLeNet – CNN Architecture used by Google
GoogLeNet is the CNN architecture used by Google to win ILSVRC 2014 classification task.
It was developed by Jeff Dean, Christian Szegedy, Alexandro Szegedy et al.. It has been shown
to have a notably reduced error rate in comparison with previous winners AlexNet (Ilsvrc 2012
winner) and ZF-Net (Ilsvrc 2013 winner). In terms of error rate, the error is significantly lesser
than VGG (2014 runner up). It achieves deeper architecture by employing a number of distinct
techniques, including 1×1 convolution and global average pooling. GoogleNet CNN
architecture is computationally expensive. To reduce the parameters that must be learned, it
uses heavy unpooling layers on top of CNNs to remove spatial redundancy during training and
also features shortcut connections between the first two convolutional layers before adding new
filters in later CNN layers. Real-world applications/examples of GoogLeNet CNN architecture
include Street View House Number (SVHN) digit recognition task, which is often used as a
proxy for roadside object detection. Below is the simplified block diagram representing
GoogLeNet CNN architecture:
6.ResNet – CNN architecture that also got used for NLP tasks apart from Image
Classification
ResNet is the CNN architecture that was developed by Kaiming He et al. to win the ILSVRC
2015 classification task with a top-five error of only 15.43%. The network has 152 layers and
over one million parameters, which is considered deep even for CNNs because it would have
taken more than 40 days on 32 GPUs to train the network on the ILSVRC 2015 dataset. CNNs
are mostly used for image classification tasks with 1000 classes, but ResNet proves that CNNs
can also be used successfully to solve natural language processing problems like sentence
completion or machine comprehension, where it was used by the Microsoft Research Asia team
in 2016 and 2017 respectively. Real-life applications/examples of ResNet CNN architecture
include Microsoft’s machine comprehension system, which has used CNNs to generate the
answers for more than 100k questions in over 20 categories. The CNN architecture ResNet is
computationally efficient and can be scaled up or down to match the computational power of
GPUs.
7.MobileNets – CNN Architecture for Mobile Devices
MobileNets are CNNs that can be fit on a mobile device to classify images or detect objects
with low latency. MobileNets have been developed by Andrew G Trillion et al.. They are
usually very small CNN architectures, which makes them easy to run in real-time using
embedded devices like smartphones and drones. The architecture is also flexible so it has been
tested on CNNs with 100-300 layers and it still works better than other architectures like
VGGNet. Real-life examples of MobileNets CNN architecture include CNNs that is built into
Android phones to run Google’s Mobile Vision API, which can automatically identify labels
of popular objects in images.
Advantages:
• Efficient and scalable to large corpora.
• Captures both semantic and syntactic relationships.
Purpose of LRN
• To encourage competition between neurons (like contrast enhancement).
• To normalize the outputs of neurons in the same region.
• To help the model learn more diverse and informative features
Advantages of LRN:
Advantage Description
Regularization Acts like dropout by reducing overfitting
Improved Feature Diversity Boosts discriminative features by encouraging competition
Smooth Training Helps stabilize gradients early in training
Disadvantages:
Disadvantage Description
Computationally Costly Adds overhead due to normalization calculations
Rarely Used Now Replaced by Batch Normalization in modern CNNs
Limited Improvement Doesn’t significantly boost performance in deeper networks
Applications:
• Used in AlexNet for image classification on ImageNet.
• Helpful in shallow CNNs for visual pattern extraction.
• Less common today due to the superiority of BatchNorm.