Deep Learning
Deep Learning
Imagine you have a book to read (your dataset), and you want to understand it well (train a model). Each time
you read through the entire book from start to finish, that is one epoch. But just reading the book once might
not be enough to fully understand everything, so you read it multiple times. Each full read-through is another
epoch.
In machine learning, during each epoch, the model learns by seeing all the data and adjusting itself based on
the errors it made. More epochs generally help the model learn better, but too many can lead to overfitting,
where the model learns too much from the training data and struggles with new, unseen data.
Concept:
Gradient Descent is the most basic optimization algorithm used to minimize the loss function of a machine
learning model. The idea is to iteratively adjust the model’s parameters (weights and biases) in the direction of
the negative gradient (down the slope) of the loss function with respect to those parameters, to minimise the
error.
Features:
● Global batch-based update: It uses the entire training dataset to compute gradients.
● Deterministic: For a given dataset, it always converges to the same solution if the learning rate is
well-chosen.
Benefits:
● Smooth Convergence: Since the gradient is calculated over the entire dataset, the updates are more
stable.
Drawbacks:
● Slow for Large Datasets: Since it computes gradients using the entire dataset in each iteration, it can
be computationally expensive for large datasets.
● Memory Intensive: Storing and processing the whole dataset at once can be memory-intensive.
Concept:
Stochastic Gradient Descent improves upon regular Gradient Descent by updating the parameters more
frequently. Instead of using the whole dataset to compute the gradient, SGD updates the parameters using
one data point (or sample) at a time.
Features:
● Frequent Updates: Updates occur after every single training example, resulting in faster learning.
● Noisy Updates: Since updates happen with one sample, the updates can be noisy, helping the model
jump out of local minima.
Benefits:
● Faster Convergence for Large Datasets: It can quickly start learning even with very large datasets.
● Works Well with Online Learning: Ideal for scenarios where data arrives in a stream.
Drawbacks:
● Noisy Convergence: The frequent updates can lead to the loss function fluctuating rather than
smoothly decreasing.
● Requires Learning Rate Tuning: The learning rate is crucial for avoiding overshooting or slow
convergence.
Concept:
Mini-batch SGD combines the benefits of both regular GD and SGD. Instead of using the entire dataset (as in
GD) or a single data point (as in SGD), it updates the parameters using a small subset (mini-batch) of the data.
Features:
● Balanced Update: Reduces the variance of parameter updates by using a small but significant portion
of the data.
● Faster Training: Compared to regular SGD, it improves computational efficiency by parallelizing
gradient computation.
Benefits:
● Faster than GD: Less computationally intensive than computing gradients on the entire dataset.
● Less Noisy than SGD: Mini-batch updates are smoother than updating with a single data point.
Drawbacks:
Concept:
SGD with Momentum improves standard SGD by adding a momentum term that helps accelerate the
optimization process. It helps the algorithm overcome small local minima and smooths out the noisy updates
typical in SGD.
Features:
● Faster Convergence: By including momentum, the optimizer "builds up speed" in directions that
consistently reduce the loss.
● Reduces Oscillations: Momentum helps reduce the zig-zagging behaviour of SGD.
Benefits:
Drawbacks:
5. AdaGrad Optimizer
Concept:
AdaGrad (Adaptive Gradient Algorithm) adapts the learning rate for each parameter individually based on how
frequently that parameter is updated. Parameters with larger gradients receive smaller updates, and
parameters with smaller gradients receive larger updates.
Features:
● Adaptive Learning Rate: Automatically scales the learning rate for each parameter.
Benefits:
● Efficient for Sparse Data: Works well with sparse data (e.g., NLP tasks), where some features occur
very infrequently.
Drawbacks:
● Learning Rate Decay: The learning rate tends to decay too much over time, slowing down
convergence.
Concept:
Both Adadelta and RMSProp are designed to overcome AdaGrad’s aggressive learning rate decay by using a
moving average of squared gradients instead of a cumulative sum.
Features:
● Adapts Learning Rate: Like AdaGrad, these methods adjust the learning rate based on recent
gradients but prevent the learning rate from decaying too fast.
● Uses Window of Gradients: They use a moving window of recent gradient updates instead of
considering the entire history.
Benefits:
Drawbacks:
Concept:
Adam combines the ideas of both Momentum and RMSProp. It computes adaptive learning rates for each
parameter and includes momentum by maintaining moving averages of both the gradients and the squared
gradients.
Features:
● Combines Momentum and Adaptive Learning Rate: Utilizes both the concept of momentum and
per-parameter learning rates.
● Bias Correction: Corrects the bias in early iterations when the moving averages are initialized to 0.
Benefits:
Drawbacks:
1. LeNet-5 (1998)
Concept:
LeNet-5, designed by Yann LeCun, is one of the first convolutional neural networks (CNNs). It was primarily
developed for digit classification, particularly for recognizing handwritten digits in the MNIST dataset.
Architecture:
Mathematical Equation:
Where:
Features:
● Simple Structure: LeNet-5 has a straightforward architecture with fewer layers compared to modern
CNNs.
● Uses Subsampling (Pooling): Introduces pooling layers for down-sampling, reducing the spatial
dimensions of the input.
Benefits:
● Efficient for Small Datasets: Works well with smaller datasets, such as MNIST.
● First Use of CNN: Pioneered CNNs and introduced the concept of convolution and subsampling
layers.
Drawbacks:
● Limited Depth: Not suitable for complex and large-scale image datasets.
● Outdated Activation Functions: Modern architectures use ReLU over tanh or sigmoid for faster
convergence.
Real-Life Application:
Concept:
AlexNet, developed by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, was a breakthrough CNN model
that won the 2012 ImageNet Large Scale Visual Recognition Challenge (ILSVRC). It demonstrated the power
of deep convolutional networks on large-scale datasets.
Architecture:
● ReLU Activation: Replaced tanh and sigmoid with ReLU, accelerating training.
● Dropout: Introduced dropout to prevent overfitting by randomly dropping neurons during training.
● Large Network: Utilised GPUs for faster training.
Benefits:
● Great Performance on ImageNet: Achieved remarkable results on ImageNet, reducing top-5 error
from 26% to 16%.
● GPU Utilisation: Showcased the importance of GPUs for deep learning.
Drawbacks:
Real-Life Application:
● Used in image search engines and automated tagging systems for photo libraries.
They use ReLu instead of the Tanh activation function to add non-linearity, and it helps to increase speed by
six times.
3. VGG-16 (2014)
Concept:
VGG-16 was developed by the Visual Geometry Group (VGG) at Oxford University and is known for its simple
and uniform architecture, using smaller convolutional filters but stacking many layers.
Architecture:
Mathematical Equation:
Features:
Benefits:
Drawbacks:
● Large Model: Over 138 million parameters, which makes it slow to train.
● Memory and Computational Intensive: Requires significant resources for training.
Real-Life Application:
● Medical Imaging: Used in tasks like diagnosing diseases from medical images, such as X-rays.
● Image classification and object detection.
● Often used for transfer learning due to its strong feature extraction capabilities.
● Applied in facial recognition systems, medical image analysis, and video classification.
● Detecting and classifying types of diseases from medical scans (CT, MRI).
● Analyzing food items in restaurant ordering systems.
The “deep” refers to the number of layers with VGG-16 or VGG-19 consisting of 16 and 19 convolutional
layers.
Most unique thing about VGG16 is that instead of having a large number of hyper-parameter they focused on
having
convolution layers of 3x3 filter with a stride 1 and always used the same padding and maxpool layer of 2x2
filter of stride 2.
It follows this arrangement of convolution and max pool layers consistently throughout the whole architecture.
In the end it has 2 FC(fully connected layers) followed by a softmax for output.
The 16 in VGG16 refers to it having 16 layers that have weights.
This network is a pretty large network and it has about 138 million (approx) parameters.
In VGG-16, the softmax function is used in the final layer of the network to perform multi-class classification.
Specifically, after the convolutional layers and fully connected layers, the output is passed through a softmax
activation function to convert the raw scores (logits) for each class into probabilities. The softmax function
ensures that the output values for the different classes sum up to 1, which allows the network to interpret the
output as probabilities of belonging to each class.
In VGG-16, softmax is used to classify the input into one of 1,000 categories when trained on the ImageNet
dataset. The class with the highest probability is chosen as the final prediction.
ReLU Activation:
● Both VGG-16 and VGG-19 use the ReLU (Rectified Linear Unit) activation function after each
convolutional and fully connected layer, which helps the network avoid the vanishing gradient problem
and speeds up training.
Softmax Output:
● The final output layer in both models uses a softmax function to convert the raw logits into class
probabilities for multi-class classification.
Parameter Heavy:
● Both models are parameter-heavy due to the use of fully connected layers and the depth of the
network. VGG-16 has ~138 million parameters, while VGG-19 has ~143 million parameters, making
them computationally expensive to train.
16 Layers:
● VGG-16 consists of 13 convolutional layers and 3 fully connected layers. These layers are
organized into 5 blocks, each followed by a max-pooling layer.
● VGG-16 has fewer convolutional layers per block compared to VGG-19, which makes it slightly
more efficient in terms of memory and computation.
Model Size:
● VGG-16 is slightly smaller in terms of model size compared to VGG-19 due to fewer layers, making it
marginally faster to train and deploy.
The problem arises when the activation functions used in the network, such as sigmoid or tanh, squish the
input values into very small ranges, between (0, 1) for sigmoid and (-1, 1) for tanh. As the gradients are
computed and backpropagated through these functions, they tend to shrink exponentially for earlier layers.
This leads to the gradients becoming so small that the weights of the earlier layers don’t get updated
effectively.
● RNNs are particularly vulnerable to the vanishing gradient problem because they repeatedly apply the
same weights across time steps. When RNNs are trained on long sequences (many time steps),
gradients can either vanish (become very small) or explode (become too large), making it difficult to
capture long-term dependencies.
● Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRUs) were developed to mitigate
this issue by introducing mechanisms like gates that control how much information passes through,
addressing the vanishing gradient problem in RNNs.
4. VGG-19 (2014)
Concept:
VGG-19 is an extended version of VGG-16, adding three additional convolutional layers for more depth.
Architecture:
● Similar to VGG-16, but with 19 layers (16 convolutional and 3 fully connected).
Features:
● More Depth: The same architecture as VGG-16 but with additional layers for potentially better feature
extraction.
Benefits:
● Improved Feature Extraction: The additional layers can help in capturing more complex patterns.
Drawbacks:
● Even Larger Model: Has more parameters than VGG-16, making it even more computationally
expensive.
Real-Life Application:
19 Layers:
● VGG-19 consists of 16 convolutional layers and 3 fully connected layers. These are organized into
5 blocks, similar to VGG-16, but with more convolutional layers in some blocks.
Deeper Architecture:
● VGG-19 has more convolutional layers per block compared to VGG-16, leading to a deeper
architecture. This makes it slightly more powerful in terms of feature extraction but at the cost of higher
computational complexity.
● Due to the additional convolutional layers, VGG-19 has a higher number of parameters (~143
million) than VGG-16, making it slightly heavier to train and run, but potentially providing better
accuracy on some tasks.
5. ResNet (2015)
ResNet, or Residual Network, is a deep neural network architecture designed to address the vanishing
gradient problem, which commonly occurs when training very deep networks. It introduces the concept of
residual learning through skip (shortcut) connections, where the output of a layer is added directly to the input
of a subsequent layer, effectively bypassing one or more layers. This allows gradients to flow more easily
through the network during backpropagation, enabling the training of much deeper networks (e.g., ResNet-50,
ResNet-101, ResNet-152) without performance degradation. The residual block is defined as F(x)+xF(x) +
xF(x)+x, where F(x)F(x)F(x) represents the learned residual function, and xxx is the input that is added directly
to the block's output. This architecture allows networks to focus on learning residuals (the difference from the
identity) rather than the complete mapping, making deep models both easier to train and more accurate in
practice, as seen in their state-of-the-art performance on tasks like image recognition.
Concept:
ResNet (Residual Network) introduced residual learning, solving the problem of vanishing gradients when
training very deep networks. It allows layers to learn residuals (or modifications) to the input, making training
more efficient for deep networks.
Architecture:
Mathematical Equation:
Where F(x)F(x)F(x) is the residual function learned by the block, and xxx is the input that is added back
through the skip connection.
Features:
● Skip Connections: Allow gradients to bypass layers, preventing the vanishing gradient problem.
● Deep Architecture: Enables networks to be very deep (e.g., ResNet-152).
Benefits:
● Allows Very Deep Networks: ResNet-152 can be trained effectively without degradation in
performance.
● State-of-the-Art Performance: Dominated image recognition tasks for years.
Drawbacks:
Real-Life Application:
A residual block is a key component of ResNet (Residual Network), a deep learning architecture that helps
to train very deep neural networks more effectively. The residual block introduces a shortcut connection that
allows the input of the block to bypass (or "skip") the block's layers and be added directly to the block's output.
This concept helps to solve the vanishing gradient problem, which makes training deep networks
challenging.
ResNet (Residual Network) is a deep learning architecture introduced in 2015 by Kaiming He, Xiangyu
Zhang, Shaoqing Ren, and Jian Sun in their paper Deep Residual Learning for Image Recognition. ResNet
revolutionised the way deep neural networks are designed and trained by addressing the problem of
vanishing/exploding gradients and enabling the construction of extremely deep networks, sometimes with
hundreds or even thousands of layers.
1.Residual Learning: The central idea of ResNet is the use of skip connections (or shortcut connections) to
introduce an identity mapping across layers. This identity mapping allows the network to bypass one or more
layers and send the input directly to a deeper layer. This "residual connection" helps prevent the degradation
problem (where accuracy saturates and then degrades as the network depth increases).
2.Deep Network: ResNet architectures can be very deep, with models like ResNet-50 (50 layers), ResNet-101
(101 layers), and even ResNet-152 (152 layers). Despite their depth, they can still be efficiently trained
because of residual learning.
3. Simplification of Network Training: The skip connections in ResNet make it easier to optimise deep
networks by ensuring that the gradients flow back easily during backpropagation, mitigating the issue of
vanishing gradients that traditionally made it difficult to train very deep networks.
As neural networks get deeper, they can perform better in theory, but in practice, deeper networks often suffer
from issues such as vanishing or exploding gradients. These issues prevent proper training, causing
performance to plateau or degrade as more layers are added. Residual blocks address this by allowing layers
to learn residuals (small updates) rather than forcing them to learn entire mappings from scratch.
Concept
This addition is the key to the residual block because it allows the block to focus on learning the difference
(residual) between the input and output.
Diagrammatically:
Types of Residual Blocks
1. Identity Block: The input size and the output size are the same. This is the most basic form of a
residual block.
2. Convolutional Block: If the dimensions of the input and output differ, a convolutional layer in the
shortcut connection is used to match the dimensions.
1. Eases the Training of Deep Networks: By learning residuals, layers only need to make small
adjustments to the output rather than learning completely new representations.
2. Prevents Vanishing Gradients: The skip connection ensures gradients can flow through the network
more easily during backpropagation, even in very deep networks.
3. Improves Accuracy: Residual blocks allow networks to be deeper without the degradation of
accuracy.
Drawbacks
● Increased Complexity: The use of shortcut connections and additional operations may increase the
computational complexity.
● Residual Learning Limitation: In some cases, residual connections might not be beneficial if the
network is shallow, as the residual learning technique is most effective in very deep architectures.
Applications
Residual blocks are a core component of deep architectures like ResNet (ResNet-50, ResNet-101, etc.),
which are widely used for:
● Image classification
● Object detection
● Segmentation tasks
● Speech recognition
● Known for handling complex object detection and classification tasks with fewer parameters.
● Used for image recognition, video analysis, and medical imaging.
● Effective in detecting multi-scale objects within images due to its inception modules.
The major issues faced by deeper CNN models such as VGGNet were: ● Although previous networks such as
VGG achieved remarkable accuracy on the ImageNet dataset, deploying these kinds of models is highly
computationally expensive because of the deep architecture. ● Very deep networks are susceptible to
overfitting. It is also hard to pass gradient updates through the entire network.
1 X 1 convolution: A 1×1 convolution simply maps an input pixel with all its respective channels to an output
pixel. 1×1 convolution is used as a dimensionality reduction module to reduce computation to an extent. For
instance, we need to perform 5×5 convolution without using 1×1 convolution as below:
6. Inception (2014)
Concept:
The Inception network (GoogleNet) introduced the idea of parallel convolutions with different filter sizes in the
same layer, allowing the network to capture features at different scales.
Architecture:
● Inception Module: Combines 1x1, 3x3, and 5x5 convolutions in parallel, along with max pooling,
followed by concatenation.
● Layers: Stacked inception modules.
Features:
● Multi-scale Feature Extraction: Each inception module captures features at multiple scales.
● Efficient Architecture: By reducing the number of parameters with 1x1 convolutions, it saves
computational power.
Benefits:
● Good Trade-off Between Depth and Efficiency: Performs well without being overly computationally
expensive.
Drawbacks:
● Complex Architecture: More difficult to implement and tune compared to simpler architectures like
VGG.
Real-Life Application:
● Surveillance Systems: Used in video analysis for detecting objects at different scales.
7. GoogLeNet (2014)
Concept:
GoogLeNet is the original Inception model and was the winner of the ILSVRC 2014. It uses inception modules
with multiple convolutions and pooling operations in parallel, followed by concatenation.
Architecture:
Features:
● Efficient: Uses fewer parameters than AlexNet and VGG despite being deeper.
● Inception Modules: Capture multi-scale features with parallel convolutions and pooling.
Benefits:
● Good Performance with Fewer Parameters: Performs well on large-scale image datasets with fewer
parameters than VGG.
Drawbacks:
● Complex Design: Hard to implement and tune due to the inception modules and their parallel
operations.
Real-Life Application:
● Healthcare: Used in medical imaging for disease detection and diagnosis, such as identifying tumors in
MRI scans.
● Functionality: Residual blocks allow gradients to flow through the network without vanishing, enabling
the training of very deep networks. This is achieved by adding skip connections, which bypass one or
more layers.
● Application in E-commerce:
○ Business Problem: High dimensionality of product images (e.g., clothing, electronics) can lead
to poor classification performance when traditional models are used.
○ Solution: Implement a ResNet architecture to classify product images, allowing the model to
learn intricate features of various products. The residual connections can improve training
efficiency and accuracy by alleviating vanishing gradients, leading to better feature extraction
from complex images.
○ Example: Classifying apparel images into categories (e.g., shirts, pants, dresses) with high
accuracy, reducing misclassification and enhancing customer experience.
2. 1x1 Convolutions (Inception)
● Functionality: 1x1 convolutions help to reduce dimensionality and increase the depth of the network
without losing spatial information. They allow for effective feature combinations and improve
computational efficiency.
● Application in E-commerce:
○ Business Problem: Managing and processing large product image datasets can be
computationally intensive and slow down real-time recommendation systems.
○ Solution: Use 1x1 convolutions within an Inception-like architecture to create more efficient
models. This enables the model to capture essential features with fewer parameters, speeding
up inference time and reducing the computational burden.
○ Example: Using the model to quickly identify and categorize a vast array of electronics (e.g.,
smartphones, laptops) in real-time, thereby improving the efficiency of search and
recommendation systems.
By integrating the advantages of both Residual Blocks and 1x1 Convolutions, an e-commerce platform can
build a robust image classification model that:
Evidence of Impact:
● Improved Accuracy: A study showed that ResNet architectures can achieve top performance on
image classification benchmarks like ImageNet, demonstrating their effectiveness in extracting features
from complex images.
● Efficiency Gains: Implementing Inception modules has been shown to reduce the number of
computations needed while maintaining or improving accuracy, which is crucial for handling large
product catalogs in e-commerce.
Generative Adversarial Networks (GANs) are a class of deep learning models used for generative tasks,
meaning they can generate new data samples that resemble a given dataset. GANs were introduced by Ian
Goodfellow in 2014. The core idea behind GANs is to pit two neural networks—a generator and a
discriminator—against each other in a game-theoretic setup. The generator tries to produce fake data that
mimics the real data, while the discriminator tries to distinguish between real and generated data. Through
this adversarial process, the generator learns to produce increasingly realistic data.
Key Concept:
● GANs are trained in an adversarial setting, where the goal of the generator is to "fool" the discriminator,
while the discriminator's goal is to accurately distinguish between real and fake data.
● Applications of GANs include generating realistic images, videos, text, or even music. They have
been used in image synthesis, style transfer, super-resolution tasks, and many creative AI
applications.
1. Training Dynamics:
○ Both the generator and discriminator are trained in turns. First, the discriminator is trained on
both real and fake data, and then the generator is trained using the feedback from the
discriminator.
○ The generator learns from the discriminator's decisions, and the discriminator learns to
improve its classification accuracy by distinguishing between real and generated samples.
2. Minimax Game: The training of GANs is set up as a minimax game, where:
○ The generator minimizes the loss by producing data that "fools" the discriminator.
○ The discriminator maximizes its accuracy in distinguishing real from fake samples.
To improve product recommendations using GANs, we can design a Recommendation GAN (RecGAN) that
uses customer behavior data (e.g., clicks, views, purchases) as input to generate product recommendations.
Here’s a high-level design:
1. Generator Network:
● Objective: Generate a list of product recommendations for a user based on their interaction history
(e.g., viewed products, past purchases).
● Input: Customer behavior embeddings (user interaction history, product features, user demographics)
+ noise (to introduce variability).
● Output: A set of recommended products (product IDs or feature vectors).
● Justification: By introducing noise, the generator can create diverse, personalized product
recommendations based on implicit customer preferences, generating recommendations that are not
just based on past behavior but also explore similar or related products.
2. Discriminator Network:
● Objective: Distinguish between real product purchases (actual user interactions) and fake
recommendations generated by the generator.
● Input: A combination of real purchase data and generated recommendations.
● Output: Probability that the recommended products are "real" (likely to be purchased).
● Justification: This ensures that the generated recommendations are realistic and closely match the
user's true preferences. The discriminator is trained to validate that the generated recommendations
are products that a user would likely engage with.
3. Loss Function:
● Generator Loss: Minimize the loss that reflects how well the generated recommendations are able to
fool the discriminator (cross-entropy loss).
● Discriminator Loss: Minimize the binary classification loss between real user interactions and
generated recommendations.
4. Data Points:
● Customer Interaction Data: Past user interactions like clicks, views, and purchases.
● Product Features: Attributes like product categories, price, brand, and ratings.
● User Features: Demographics, location, and browsing history.
Justification: These data points allow the model to generate personalised and contextually relevant
recommendations by understanding the user's behaviour and product similarities.
5. Analysis:
● Evaluate the model by tracking click-through rates (CTR), conversion rates, and user engagement
with the generated recommendations.
● Use A/B testing to compare GAN-based recommendations with traditional recommendation models
(collaborative filtering or matrix factorization).
Conclusion:
By using a GAN architecture for product recommendations, we can generate diverse, personalised
recommendations that adapt to user preferences and introduce product exploration, improving engagement
and conversions in e-commerce.
In the context of sequential data, such as time series forecasting, customer behavior prediction, or natural
language processing in business applications, the capabilities of Recurrent Neural Networks (RNNs), Long
Short-Term Memory (LSTM) networks, and Transformer models differ significantly. Here’s a breakdown of
each model's capabilities and their relevance to business problems:
3. Transformer Models
Summary of Differences:
Conclusion:
In sequential data applications, LSTMs outperform traditional RNNs due to their ability to capture long-term
dependencies, while Transformers offer significant advantages in scalability and efficiency for handling
complex sequences. Depending on the business problem, choosing the right architecture is crucial for
achieving optimal results. For simpler tasks, RNNs or LSTMs may suffice, while Transformers are the best
choice for complex tasks requiring deep contextual understanding.