0% found this document useful (0 votes)
2 views

Module 05

The document provides an overview of Convolutional Neural Networks (CNNs), detailing their architecture, types, and applications in image recognition and classification tasks. It discusses various CNN architectures such as LeNet, AlexNet, ZFNet, GoogLeNet, VGGNet, ResNet, and MobileNets, highlighting their key features and use cases. Additionally, it covers the importance of learning vector representations of words in Natural Language Processing, introducing methods like Word2Vec, GloVe, FastText, ELMo, and BERT.

Uploaded by

yoxisam356
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Module 05

The document provides an overview of Convolutional Neural Networks (CNNs), detailing their architecture, types, and applications in image recognition and classification tasks. It discusses various CNN architectures such as LeNet, AlexNet, ZFNet, GoogLeNet, VGGNet, ResNet, and MobileNets, highlighting their key features and use cases. Additionally, it covers the importance of learning vector representations of words in Natural Language Processing, introducing methods like Word2Vec, GloVe, FastText, ELMo, and BERT.

Uploaded by

yoxisam356
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Module 05:

Convolutional Neural Networks: supervised Learning

Contents: Convolutional Neural Networks, Types of CNN, Learning Vectorial


Representations of Words, Local Response Normalization, Training a Convolutional
Network

1. Introduction to CNNs (Convolutional Neural Networks)


CNNs are a class of deep, feed-forward artificial neural networks, most commonly applied to
analyzing visual imagery. They are inspired by the organization of the animal visual cortex.
Yann LeCun, director of Facebook’s AI Research Group, pioneered convolutional neural
networks. In 1988, he built the first one, LeNet, which was used for character recognition tasks
like reading zip codes and digits.
Have you ever wondered how facial recognition works on social media, or how object detection
helps in building self-driving cars, or how disease detection is done using visual imagery in
healthcare? It’s all possible thanks to convolutional neural networks (CNN). Here’s an example
of convolutional neural networks that illustrates how they work:
Imagine there’s an image of a bird, and you want to identify whether it’s really a bird or some
other object. The first thing you do is feed the pixels of the image in the form of arrays to the
input layer of the neural network (multi-layer networks used to classify things). The hidden
layers carry out feature extraction by performing different calculations and manipulations.
There are multiple hidden layers like the convolution layer, the ReLU layer, and pooling layer,
that perform feature extraction from the image. Finally, there’s a fully connected layer that
identifies the object in the image.

2. Architecture of CNNs
Typical Layers in a CNN:
1. Input Layer – Raw data (e.g., image of size 28x28x3)
2. Convolutional Layer – Applies filters (kernels) to extract features.
3. Activation Function (ReLU) – Adds non-linearity.
4. Pooling Layer – Reduces dimensionality (e.g., Max Pooling).
5. Fully Connected Layer – Final decision-making layer.
6. Output Layer – Gives final prediction (e.g., softmax for classification).

3. Convolutional Layer
How It Works:
• Applies a filter/kernel over the input image to compute a feature map.
• Each filter detects specific features (edges, textures, etc.).
Numerical Example:
• Input: 5x5 image
• Filter: 3x3
• Stride: 1
• Output: (5-3)/1 + 1 = 3x3 feature map
4. Training a Convolutional Neural Network
Forward Propagation
• Input image passes through convolution, activation (ReLU), pooling, fully connected
layers.
• Output is a prediction (e.g., class probabilities).
Loss Computation
• Compare the prediction with the ground truth using a loss function.
• Common loss: Cross-Entropy Loss for classification.
Backpropagation
• Compute the gradient of the loss w.r.t. all trainable parameters using the chain rule.
Weight Update (Optimization)
• Update weights using Gradient Descent or variants like Adam, RMSProp, etc.
Where η\etaη is the learning rate.
Repeat for all epochs (multiple passes through the dataset).

Fig: Convolutional Neural Network to identify the image of a bird

Key Points:
• CNNs are used for image, speech, and video recognition.
• They reduce the number of parameters significantly compared to fully connected
networks.
• CNNs learn spatial hierarchies of features from input data.
Example:
Classifying handwritten digits (0-9) using the MNIST dataset.

A convolutional neural network is a feed-forward neural network that is generally used to


analyze visual images by processing data with grid-like topology. It’s also known as
a ConvNet. A convolutional neural network is used to detect and classify objects in an image.
Below is a neural network that identifies two types of flowers: Orchid and Rose.
Types of CNN

Convolutional Neural Networks (CNNs) have evolved over the years with many architectures
designed to solve increasingly complex image recognition and classification tasks. Below are
the most important and widely used CNN architectures:

1. LeNet – First CNN Architecture


LeNet was developed in 1998 by Yann LeCun, Corinna Cortes, and Christopher Burges for
handwritten digit recognition problems. LeNet was one of the first successful CNNs and is
often considered the “Hello World” of deep learning. It is one of the earliest and most widely-
used CNN architectures and has been successfully applied to tasks such as handwritten digit
recognition. The LeNet architecture consists of multiple convolutional and pooling layers,
followed by a fully-connected layer. The model has five convolution layers followed by two
fully connected layers. LeNet was the beginning of CNNs in deep learning for computer vision
problems. However, LeNet could not train well due to the vanishing gradients problem. To
solve this issue, a shortcut connection layer known as max-pooling is used between
convolutional layers to reduce the spatial size of images which helps prevent overfitting and
allows CNNs to train more effectively. The diagram below represents LeNet-5 architecture.

The LeNet CNN is a simple yet powerful model that has been used for various tasks such as
handwritten digit recognition, traffic sign recognition, and face detection. Although LeNet was
developed more than 20 years ago, its architecture is still relevant today and continues to be
used.
2. AlexNet – Deep Learning Architecture that popularized CNN

AlexNet was developed by Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton. AlexNet
network had a very similar architecture to LeNet, but was deeper, bigger, and featured
Convolutional Layers stacked on top of each other. AlexNet was the first large-scale CNN and
was used to win the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2012.
The AlexNet architecture was designed to be used with large-scale image datasets and it
achieved state-of-the-art results at the time of its publication. AlexNet is composed of 5
convolutional layers with a combination of max-pooling layers, 3 fully connected layers, and
2 dropout layers. The activation function used in all layers is Relu. The activation function used
in the output layer is Softmax. The total number of parameters in this architecture is around 60
million.

3. ZF Net:
ZFnet is the CNN architecture that uses a combination of fully-connected layers and CNNs.
ZF Net was developed by Matthew Zeiler and Rob Fergus. It was the ILSVRC 2013 winner.
The network has relatively fewer parameters than AlexNet, but still outperforms it on ILSVRC
2012 classification task by achieving top accuracy with only 1000 images per class. It was an
improvement on AlexNet by tweaking the architecture hyperparameters, in particular by
expanding the size of the middle convolutional layers and making the stride and filter size on
the first layer smaller. It is based on the Zeiler and Fergus model, which was trained on the
ImageNet dataset. ZF Net CNN architecture consists of a total of seven layers: Convolutional
layer, max-pooling layer (downscaling), concatenation layer, convolutional layer with linear
activation function, and stride one, dropout for regularization purposes applied before the fully
connected output. This CNN model is computationally more efficient than AlexNet by
introducing an approximate inference stage through deconvolutional layers in the middle of
CNNs.
4. GoogLeNet – CNN Architecture used by Google
GoogLeNet is the CNN architecture used by Google to win ILSVRC 2014 classification task.
It was developed by Jeff Dean, Christian Szegedy, Alexandro Szegedy et al.. It has been shown
to have a notably reduced error rate in comparison with previous winners AlexNet (Ilsvrc 2012
winner) and ZF-Net (Ilsvrc 2013 winner). In terms of error rate, the error is significantly lesser
than VGG (2014 runner up). It achieves deeper architecture by employing a number of distinct
techniques, including 1×1 convolution and global average pooling. GoogleNet CNN
architecture is computationally expensive. To reduce the parameters that must be learned, it
uses heavy unpooling layers on top of CNNs to remove spatial redundancy during training and
also features shortcut connections between the first two convolutional layers before adding new
filters in later CNN layers. Real-world applications/examples of GoogLeNet CNN architecture
include Street View House Number (SVHN) digit recognition task, which is often used as a
proxy for roadside object detection. Below is the simplified block diagram representing
GoogLeNet CNN architecture:

5. VGGNet – CNN Architecture with Large Filters


VGGNet is the CNN architecture that was developed by Karen Simonyan, Andrew Zisserman
et al. at Oxford University. The full form of “VGG16” stands for “Visual Geometry Group
16“. This name comes from the Visual Geometry Group at the University of Oxford, where
this neural network architecture was developed. The “16” in the name indicates that the model
contains 16 layers that have weights; this includes convolutional layers as well as fully
connected layers.
VGGNet is a 16-layer CNN with up to 95 million parameters and trained on over one billion
images (1000 classes). It can take large input images of 224 x 224-pixel size for which it has
4096 convolutional features. CNNs with such large filters are expensive to train and require a
lot of data, which is the main reason why CNN architectures like GoogLeNet (AlexNet
architecture) work better than VGGNet for most image classification tasks where input images
have a size between 100 x 100-pixel and 350 x 350 pixels. Real-world applications/examples
of VGGNet CNN architecture include the ILSVRC 2014 classification task, which was also
won by GoogleNet CNN architecture. The VGG CNN model is computationally efficient and
serves as a strong baseline for many applications in computer vision due to its applicability for
numerous tasks including object detection. Its deep feature representations are used across
multiple neural network architectures like YOLO, SSD, etc. The diagram below represents the
standard VGG16 network architecture diagram:

6.ResNet – CNN architecture that also got used for NLP tasks apart from Image
Classification
ResNet is the CNN architecture that was developed by Kaiming He et al. to win the ILSVRC
2015 classification task with a top-five error of only 15.43%. The network has 152 layers and
over one million parameters, which is considered deep even for CNNs because it would have
taken more than 40 days on 32 GPUs to train the network on the ILSVRC 2015 dataset. CNNs
are mostly used for image classification tasks with 1000 classes, but ResNet proves that CNNs
can also be used successfully to solve natural language processing problems like sentence
completion or machine comprehension, where it was used by the Microsoft Research Asia team
in 2016 and 2017 respectively. Real-life applications/examples of ResNet CNN architecture
include Microsoft’s machine comprehension system, which has used CNNs to generate the
answers for more than 100k questions in over 20 categories. The CNN architecture ResNet is
computationally efficient and can be scaled up or down to match the computational power of
GPUs.
7.MobileNets – CNN Architecture for Mobile Devices
MobileNets are CNNs that can be fit on a mobile device to classify images or detect objects
with low latency. MobileNets have been developed by Andrew G Trillion et al.. They are
usually very small CNN architectures, which makes them easy to run in real-time using
embedded devices like smartphones and drones. The architecture is also flexible so it has been
tested on CNNs with 100-300 layers and it still works better than other architectures like
VGGNet. Real-life examples of MobileNets CNN architecture include CNNs that is built into
Android phones to run Google’s Mobile Vision API, which can automatically identify labels
of popular objects in images.

8.GoogLeNet_DeepDream – Generate images based on CNN features


GoogLeNet_DeepDream is a deep dream CNN architecture that was developed by Alexander
Mordvintsev, Christopher Olah, et al.. It uses the Inception network to generate images based
on CNN features. The architecture is often used with the ImageNet dataset to generate
psychedelic images or create abstract artworks using human imagination at the ICLR 2017
workshop by David Ha, et al.
To summarize the different types of CNN architectures described above in an easy to remember
form, you can use the following:

Architecture Year Key Features Use Case

First successful applications of CNNs, 5


layers (alternating between convolutional Recognizing handwritten
and pooling), Used tanh/sigmoid activation and machine-printed
LeNet 1998 functions characters

Deeper and wider than LeNet, Used ReLU


activation function, Implemented dropout Large-scale image
AlexNet 2012 layers, Used GPUs for training recognition tasks

Similar architecture to AlexNet, but with


different filter sizes and numbers of filters,
Visualization techniques for understanding
ZFNet 2013 the network ImageNet classification

Deeper networks with smaller filters (3×3),


All convolutional layers have the same
depth, Multiple configurations (VGG16, Large-scale image
VGGNet 2014 VGG19) recognition

Introduced “skip connections” or “shortcuts” Large-scale image


to enable training of deeper networks, recognition, won 1st
Multiple configurations (ResNet-50, ResNet- place in the ILSVRC
ResNet 2015 101, ResNet-152) 2015

Introduced Inception module, which allows Large-scale image


for more efficient computation and deeper recognition, won 1st
networks, multiple versions (Inception v1, place in the ILSVRC
GoogleLeNet 2014 v2, v3, v4) 2014
Architecture Year Key Features Use Case

Designed for mobile and embedded vision


applications, Uses depthwise separable Mobile and embedded
convolutions to reduce the model size and vision applications, real-
MobileNets 2017 complexity time object detection

Learning Vectorial Representations of Words


In Machine Learning and Deep Learning, especially in Natural Language Processing (NLP),
it is crucial to convert words into numerical representations that models like CNNs can process.
These numerical representations are known as word vectors or word embeddings.
The goal is to capture the semantic meaning and contextual similarity of words in vector
form so that similar words have closer vector representations.

Why Not Just Use One-Hot Encoding?


One-Hot Encoding Example (Vocabulary = [cat, dog, mouse]):
• "cat" → [1, 0, 0]
• "dog" → [0, 1, 0]
• "mouse" → [0, 0, 1]
Problems:
• High dimensionality for large vocabularies.
• No information about word meaning or context.
• All words are equally distant in vector space (Euclidean distance is same for any pair).
Solution → Use vector embeddings instead!

Popular Methods for Learning Word Representations


1. Word2Vec (by Google, 2013)
A neural network model to learn distributed representations of words.
➤ Two Architectures:
• CBOW (Continuous Bag of Words): Predicts a target word from surrounding context
words.
• Skip-Gram: Predicts surrounding context words given the target word.
Example (Sentence: “The cat sat on the mat”):
• Skip-Gram: Input = "cat", Outputs = ["The", "sat", "on"]
Key Points:
• Trains a shallow neural network.
• Produces vectors where semantically similar words are close.

Advantages:
• Efficient and scalable to large corpora.
• Captures both semantic and syntactic relationships.

2. GloVe (Global Vectors for Word Representation)


Developed by: Stanford NLP Group
Idea: Combine global matrix factorization (like LSA) and local context window (like
Word2Vec).
3. FastText (by Facebook AI)

Improves Word2Vec by representing words as bags of character n-grams.


Example:
“playing” = [“pla”, “lay”, “ayi”, “yin”, “ing”]
Benefits:
• Handles out-of-vocabulary (OOV) words.
• Better for morphologically rich languages (e.g., German, Hindi).

4. ELMo (Embeddings from Language Models)


• Generates contextual word embeddings.
• Word meaning changes based on the sentence context.
Example:
• “He sat on the bank of the river.”
• “He went to the bank to deposit money.”
ELMo produces different vectors for "bank" in each sentence.

5. BERT (Bidirectional Encoder Representations from Transformers)


• State-of-the-art contextual embedding.
• Considers both left and right context simultaneously.
• More accurate for downstream NLP tasks like QA, translation, etc.

Local Response Normalization (LRN)


Local Response Normalization (LRN) is a regularization technique used in Convolutional
Neural Networks (CNNs) to improve generalization and stabilize learning.
It was introduced in AlexNet (2012) and is inspired by biological lateral inhibition in neurons
— a phenomenon where excited neurons inhibit the activity of their neighbors.

Purpose of LRN
• To encourage competition between neurons (like contrast enhancement).
• To normalize the outputs of neurons in the same region.
• To help the model learn more diverse and informative features

Advantages of LRN:
Advantage Description
Regularization Acts like dropout by reducing overfitting
Improved Feature Diversity Boosts discriminative features by encouraging competition
Smooth Training Helps stabilize gradients early in training

Disadvantages:
Disadvantage Description
Computationally Costly Adds overhead due to normalization calculations
Rarely Used Now Replaced by Batch Normalization in modern CNNs
Limited Improvement Doesn’t significantly boost performance in deeper networks
Applications:
• Used in AlexNet for image classification on ImageNet.
• Helpful in shallow CNNs for visual pattern extraction.
• Less common today due to the superiority of BatchNorm.

You might also like