Chapter-2(Deep Learning)
Chapter-2(Deep Learning)
The history of deep learning is tied closely to the broader field of artificial intelligence (AI)
and machine learning (ML), which seeks to create machines that can learn from data. Deep
learning, specifically, refers to a subset of machine learning that uses neural networks with
many layers to model complex patterns in data. Here’s a breakdown of the key milestones in
the development of deep learning:
1. Support Vector Machines and Boosting: In the 1990s, simpler machine learning
methods like support vector machines (SVMs) and decision trees (boosting) became
more popular due to their success and computational efficiency compared to neural
networks, which were hard to train at the time.
2. Vanishing Gradient Problem: It was recognized that training deep networks was
difficult because of the vanishing gradient problem—where gradients used in
backpropagation became too small to update weights in earlier layers effectively.
Module-II Deep Learning
3. Recurrent Neural Networks (RNNs): In the 1990s, RNNs, particularly Long Short-
Term Memory (LSTM) networks introduced by Sepp Hochreiter and Jürgen
Schmidhuber (1997), addressed the difficulty of learning long-term dependencies in
sequential data.
1. Deep Belief Networks (2006): Geoffrey Hinton and his collaborators introduced deep
belief networks (DBNs), which sparked renewed interest in deep learning. DBNs used
a layer-by-layer pretraining method to address some of the issues with training deep
neural networks, particularly the vanishing gradient problem.
2. Better Hardware – GPUs: Around this time, graphics processing units (GPUs),
initially designed for fast image rendering, were repurposed for deep learning because
they could efficiently perform the large-scale matrix operations required by neural
networks.
3. Rectified Linear Unit (ReLU): The ReLU activation function, introduced in 2009,
helped mitigate the vanishing gradient problem and allowed deep networks to train
faster by preventing the gradients from shrinking as they propagated back through the
network.
1. ImageNet (2012): Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton used a deep
convolutional neural network (AlexNet) to win the ImageNet competition,
significantly outperforming previous methods in image classification. This moment
marked a major public recognition of the power of deep learning.
2. Speech Recognition: Around the same time, deep learning methods began
outperforming traditional approaches in speech recognition, leading companies like
Google, Microsoft, and Baidu to adopt deep learning for their products.
3. Generative Models – GANs (2014): Ian Goodfellow introduced Generative
Adversarial Networks (GANs), a new class of generative models that could create
realistic data, such as images, by having two neural networks (a generator and a
discriminator) compete against each other.
Ethics and Bias: As deep learning becomes more pervasive, there are growing
concerns about the ethical implications, such as the biases embedded in the data these
models are trained on.
Explainability: Deep learning models are often seen as "black boxes," and there's a
push towards creating more interpretable AI systems.
Efficiency and Sustainability: Large deep learning models require massive amounts
of computational power and energy, raising concerns about their environmental
impact.
Deep learning continues to evolve, with innovations in areas such as self-supervised learning,
reinforcement learning, and neuromorphic computing pushing the boundaries of what these
models can achieve.
Backpropagation
Backpropagation (short for "backward propagation of errors") is the algorithm used to train
deep neural networks. It adjusts the weights of the connections between neurons in the
network to minimize the error in the network's predictions, improving its performance over
time. Here's a step-by-step breakdown of how backpropagation works:
BPN Algorithm
The algorithm for BPN is as classified int four major steps as follows:
1. Initialization of Bias, Weights
2. Feedforward process
Module-II Deep Learning
3. Back Propagation of Errors
4. Updating of weights & biases
Algorithm:
I. Initialization of weights:
Step 1: Initialize the weights to small random values near zero
Step 2: While stop condition is false , Do steps 3 to 10
Step 3: For each training pair do steps 4 to 9
II. Feed forward of inputs
Step 4: Each input xi is received and forwarded to higher layers (next hidden)
Step 5: Hidden unit sums its weighted inputs as follows Zinj = Woj + Σxiwij
Applying Activation function Zj = f(Zinj)
This value is passed to the output layer
Step 6: Output unit sums it’s weighted inputs
yink= Voj + Σ ZjVjk
Applying Activation function
Yk = f(yink)
III. Backpropagation of Errors
Step 7: δk = (tk – Yk)f(yink )
δinj = Σ δjVjk
IV. Updating of Weights & Biases
Step 8: Weight correction is
Δwij = αδkZj
bias Correction is
Δwoj = αδk
V. Updating of Weights & Biases
Step 9: continued:
New Weight is
Wij(new) = Wij(old) + Δwij Vjk(new) = Vjk(old) + ΔVjk
New bias is
Woj(new) = Woj(old) + Δwoj Vok(new) = Vok(old) + ΔVok
Module-II Deep Learning
Gradient Descent: The algorithm for minimizing the loss function by iteratively
updating the weights based on the gradient of the loss.
Learning Rate: A hyperparameter that controls the size of the steps taken during
weight updates. If it’s too large, the model may fail to converge; if too small, training
can be too slow.
Activation Functions: Functions like sigmoid, ReLU, or tanh, which add non-
linearity to the model and determine the output of each neuron.
Regularization
1. L1 and L2 Regularization:
These are the most common forms of regularization, which work by adding a penalty term to
the loss function that discourages large weight values.
L2 Regularization (Ridge): Adds the sum of the squared weights to the loss
function:
L1 Regularization (Lasso): Adds the sum of the absolute values of the weights to the
loss
function:
Module-II Deep Learning
This can lead to sparsity, meaning some weights become exactly zero, effectively
reducing the number of features used by the model.
λ is the regularization strength. A larger value increases the penalty on the weights.
2. Dropout:
Dropout is a technique that randomly “drops” or ignores a subset of neurons during each
forward and backward pass during training. This prevents the network from becoming too
reliant on any one particular neuron, promoting robustness and reducing overfitting.
For each neuron in a given layer, a probability ppp (e.g., 0.5) is used to determine
whether it will be ignored (dropped) during training.
During inference (when the model is being used for predictions), all neurons are used,
but their outputs are scaled by 1−p1 - p1−p to balance their contribution.
3. Early Stopping:
4. Batch Normalization:
Batch normalization normalizes the inputs of each layer (specifically, the mini-batches)
during training. This can improve training speed and stability while also having a slight
regularization effect by introducing a small amount of noise in the mini-batch estimates.
5. Data Augmentation:
Instead of modifying the model, data augmentation involves modifying the data to introduce
variability, which reduces overfitting. For example, in image classification tasks,
transformations like rotation, flipping, or cropping can artificially increase the size of the
training set, making the model more robust.
6. Weight Constraints:
Constraints can be applied on the values of weights to keep them within a specific range
during training, limiting the capacity of the model and making it less prone to overfitting.
Backpropagation calculates gradients that guide how to update weights to minimize error,
while regularization modifies the loss function to penalize overly complex models (with
large weights), improving generalization.
Module-II Deep Learning
For example, in L2 regularization, backpropagation will not only minimize the prediction
error but also attempt to minimize the sum of squared weights. This keeps the model simpler
and less prone to overfitting.
Together, backpropagation and regularization form a powerful framework for training neural
networks that generalize well to unseen data.
Batch normalization is a technique used in training deep neural networks to improve both
speed and performance. It works by normalizing the inputs of each layer so that they have a
stable distribution throughout training, which addresses the problem of internal covariate
shift (where the distribution of layer inputs changes as the model learns).
How It Works:
1. Normalization: For each mini-batch during training, the mean and variance of the
layer’s inputs are computed. The inputs are then normalized to have a mean of 0 and a
variance of 1.
Scale and Shift: After normalization, the outputs are scaled and shifted by two learnable
parameters γ (scale) and β (shift). These parameters allow the model to adapt to different
input distributions and recover any necessary information that may have been lost during
normalization.
Faster Training: It allows for higher learning rates, accelerating the training process.
Improved Performance: Models often generalize better and overfit less, partly due to a
regularization effect from batch normalization.
Reduced Internal Covariate Shift: It stabilizes the learning process by keeping the input
distributions consistent throughout the network layers.
In practice, batch normalization is typically inserted after the linear transformation of each
layer but before the activation function.
The VC dimension is a concept from statistical learning theory that measures the capacity or
complexity of a model. Specifically, it quantifies the ability of a model (like a classifier) to
shatter a dataset, meaning the ability of the model to correctly classify any arbitrary labeling
of data points.
The VC dimension is important because it helps to assess the capacity of a learning model
to fit data, which impacts the model's ability to generalize.
Definition of VC Dimension:
For a model (or hypothesis class H) and a set of data points, the VC dimension is defined as
the largest number d such that the model can shatter any set of d points.
"Shattering" means that for every possible binary labeling of the d points, the model can
perfectly classify them.
A higher VC dimension means the model is more complex and has a greater capacity to fit
diverse patterns in the data.
Module-II Deep Learning
Example:
Consider a simple model like a linear classifier in 2D space (i.e., a line that separates data
points). The VC dimension of this model is 3, because a line can classify any arbitrary
labeling of 3 points in a plane. However, with 4 points, it cannot shatter all possible labelings.
In the context of neural networks, the VC dimension becomes more complex due to the
flexibility and high expressiveness of deep neural architectures. Neural networks are
powerful function approximators and can theoretically shatter very large sets of data points,
leading to a high VC dimension.
While calculating the exact VC dimension for a neural network is challenging, it's known that
the VC dimension of a neural network grows with its number of parameters. For a neural
network with W parameters, the VC dimension is typically proportional to W, meaning the
network can shatter about as many data points as it has weights.
For a neural network with W weights, the VC dimension d is typically on the order of
O(W), meaning it is directly related to the number of parameters.
This suggests that large neural networks (e.g., deep networks) have a high VC
dimension and, thus, a high capacity to fit complex datasets.
A high VC dimension means that a neural network has high capacity and can fit very
complex datasets. However, this also increases the risk of overfitting, where the
network fits noise in the training data rather than capturing true underlying patterns.
Regularization techniques (like L2 regularization, dropout, early stopping) are often
used to reduce the effective capacity of the network and improve generalization.
Module-II Deep Learning
Trade-off: Bias-Variance Trade-off
Shallow Network:
Hidden Layer(s): One or two layers where computations occur. Each hidden layer
consists of neurons that apply activation functions to the weighted sums of their
inputs.
Output Layer: Produces the final output, which can be a classification, regression
value, or another type of result.
Overfitting Risk: Less prone to overfitting with small datasets, but can underfit if the
problem is complex.
Applications
Feature extraction: When the features are well-defined and the relationships are not
too complex.
Limitations
Module-II Deep Learning
Performance: Struggles with complex tasks where deep learning excels, like image
and speech recognition.
Capacity: May not capture intricate patterns in data due to limited layers.
Deep Networks
Deep networks, often referred to as deep neural networks (DNNs), consist of multiple layers
of neurons, allowing them to learn complex patterns in data. Here’s an overview of their key
features, structure, and applications:
Input Layer: Accepts the input data, which could be images, text, audio, etc.
Hidden Layers: Composed of many layers (often dozens or hundreds). Each layer
transforms the input through learned weights and activation functions.
Output Layer: Produces the final output, such as class probabilities in classification
tasks or continuous values in regression.
Depth: The primary feature is the number of hidden layers, enabling the network to
learn hierarchical representations.
Complexity: Can model intricate relationships and patterns in data due to the high
number of parameters.
Advantages
Transfer Learning: Pre-trained deep networks can be fine-tuned for specific tasks,
saving time and resources.
Applications
Challenges
Interpretability: More complex and harder to interpret than shallow networks, which
can be a drawback in critical applications.
1 One Hidden layer(or very less no. of Deep Net’s has many layers of Hidden
Hidden Layers) layers with more no. of neurons in
each layers
2 Takes input only as VECTORS DL can have raw data like image, text
as inputs
3 Shallow net’s needs more parameters DL can fit functions better with less
to have better fit parameters than a shallow network
Convolution Networks
Module-II Deep Learning
Convolutional Neural Networks (CNNs) are a specialized type of deep neural
network primarily designed for processing structured grid data, such as images. They
excel in tasks involving image recognition, object detection, and other visual tasks
due to their unique architecture. Here’s an overview of CNNs, including their
structure, characteristics, and applications:
Pooling Layers: Downsampling: Reduce the spatial dimensions of the feature maps,
retaining important information while reducing computational load. Common types
include max pooling and average pooling.
Fully Connected Layers: After several convolutional and pooling layers, the high-
level reasoning in the neural network is performed by fully connected layers, where
every neuron is connected to every neuron in the previous layer.
Output Layer: Produces the final output, such as class probabilities for classification
tasks.
Key Characteristics
Parameter Sharing: Filters are applied across the entire input, reducing the number
of parameters and allowing the model to learn features that are spatially invariant.
Advantages
Applications
Image Classification: Identifying the main object in an image (e.g., cats vs. dogs).
Object Detection: Locating and identifying multiple objects within an image (e.g.,
YOLO, Faster R-CNN).
Structure of GANs
1. Generator (G):
o The generator's role is to create fake data samples from random noise. It learns
to produce data that mimics the real data distribution.
o The generator starts with random input (often sampled from a Gaussian
distribution) and transforms it through several layers to produce an output
(e.g., an image).
2. Discriminator (D):
o The discriminator's role is to distinguish between real data samples (from the
training set) and fake samples generated by the generator.
o It outputs a probability score indicating whether a given sample is real or fake.
Training Process
2. Loss Functions:
o The generator's loss is based on how well it can fool the discriminator.
o The discriminator's loss is based on how accurately it can distinguish between
real and fake data.
o These losses are typically computed using binary cross-entropy.
3. Iterative Updates:
Module-II Deep Learning
o During training, the generator and discriminator are updated alternately. The
generator improves its ability to create realistic samples, while the
discriminator enhances its ability to identify fakes.
Key Characteristics
Applications
Challenges
Training Complexity: Achieving a balance between the generator and discriminator can
be difficult, often requiring careful tuning of hyperparameters.
Evaluation: Assessing the quality of generated samples can be subjective; metrics like
Inception Score and Fréchet Inception Distance (FID) are used but have limitations.
Ethical Concerns: The ability to generate realistic fake images can lead to misuse, such
as deepfakes and misinformation.
Semi-supervised learning
Semi-supervised learning is a machine learning approach that combines both labeled and
unlabeled data to improve the learning process. This method is particularly useful when
obtaining labeled data is expensive or time-consuming, while unlabeled data is more
abundant. Here’s a detailed overview:
Key Concepts
Labeled Data: Data that comes with corresponding output labels (e.g., images of cats labeled
as "cat").
Unlabeled Data: Data without associated labels (e.g., images of animals without any
annotations).
Module-II Deep Learning
Learning Objective: The goal is to leverage the unlabeled data to enhance the model’s
performance on the task, often achieving better results than using labeled data alone.
How It Works
Training Process:
Initial Training: The model is initially trained on the available labeled data.
Utilization of Unlabeled Data: The model is then used to make predictions on the unlabeled
data, generating pseudo-labels based on its confidence.
Iterative Refinement: The model is retrained using a combination of labeled and pseudo-
labeled data, refining its ability to generalize.
Techniques:
Consistency Regularization: Encourages the model to produce consistent outputs for the same
input under different perturbations (e.g., noise or transformations).
Generative Models: Some approaches involve using generative models (like GANs) to create
additional labeled data or to better understand the data distribution.
Graph-Based Methods: Use graph structures to model relationships between labeled and
unlabeled data points, leveraging their connectivity to propagate labels.
Advantages
Reduced Labeling Cost: Fewer labeled samples are needed, saving time and resources.
Better Generalization: Leveraging large amounts of unlabeled data can lead to improved
model generalization and performance.
Flexibility: Can be applied to various types of tasks, including classification, regression, and
clustering.
Applications
Image Classification: Using large sets of unlabeled images alongside a smaller set of labeled
images to improve classification accuracy.
Natural Language Processing: Tasks like sentiment analysis, where unlabeled text data can
enhance understanding.
Speech Recognition: Leveraging large amounts of unlabeled audio data to improve models
trained on smaller labeled datasets.
Module-II Deep Learning
Medical Imaging: Where labeling can be costly and requires expert knowledge; semi-
supervised methods can help utilize vast amounts of unlabeled scans.