0% found this document useful (0 votes)
16 views39 pages

Gen Aiml Notes by Piyush

Optimizers in deep learning are algorithms that adjust a neural network's weights to minimize the loss function during training, with common types including SGD, Adam, and RMSprop. CNNs are specialized neural networks designed for visual data analysis, featuring layers that extract and classify features, while techniques like dropout and regularization help mitigate overfitting. GANs utilize a competitive framework of generator and discriminator networks to produce realistic data, facing challenges such as mode collapse and training instability.

Uploaded by

sainipiyush090
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views39 pages

Gen Aiml Notes by Piyush

Optimizers in deep learning are algorithms that adjust a neural network's weights to minimize the loss function during training, with common types including SGD, Adam, and RMSprop. CNNs are specialized neural networks designed for visual data analysis, featuring layers that extract and classify features, while techniques like dropout and regularization help mitigate overfitting. GANs utilize a competitive framework of generator and discriminator networks to produce realistic data, facing challenges such as mode collapse and training instability.

Uploaded by

sainipiyush090
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

optimizers in deep learning

In deep learning, optimizers are algorithms that adjust a neural network's


weights to minimize the loss function, a crucial step in model training. They
work by iteratively refining the weights and biases based on feedback from the
data, guiding the model towards a minimum loss. Common optimizers include
Stochastic Gradient Descent (SGD), Adam, and RMSprop

How Optimizers Work:


• Iterative Refinement:
Optimizers repeatedly update the network's parameters (weights and biases) in
small, incremental steps.
• Loss Function Minimization:
The goal is to find the parameter values that minimize the loss function, which
measures the difference between the model's predictions and the actual
values.
• Learning Rate:
Optimizers also manage the learning rate, which controls how much the
parameters are adjusted during each update.
• Momentum and Adaptive Learning Rates:
Some optimizers incorporate momentum (like SGD with momentum) to help
the algorithm move more efficiently towards the minimum, while others adapt
the learning rate for each parameter (like AdaGrad and Adam).

ITS TYPES
Key Optimizers:
• Stochastic Gradient Descent (SGD):
A fundamental optimizer that updates parameters based on the gradient
calculated from a randomly selected subset of the training data (a batch).
• Adam (Adaptive Moment Estimation):
A popular optimizer that combines the ideas of momentum and RMSprop,
using adaptive learning rates for each parameter.
• RMSprop (Root Mean Squared Propagation):
Another adaptive learning rate optimizer that uses a moving average of
squared gradients to adapt the learning rate for each parameter.
• Other Optimizers:
AdaGrad, Adadelta, and Nesterov Accelerated Gradient (NAG) are also used,
each with its own strengths and weaknesses.

Optimizers in Adam vs. RMSprop vs. SGD

Adam and RMSprop are adaptive learning rate optimizers, meaning they adjust
the learning rate for each parameter individually based on past gradients,
potentially leading to faster convergence and better performance compared to
SGD (Stochastic Gradient Descent). SGD, on the other hand, uses a global
learning rate for all parameters.
Key Differences and Comparisons:
• Adaptive Learning Rates:
Adam and RMSprop adapt learning rates, while SGD uses a fixed learning rate.
• First and Second Moments:
Adam uses both the first (mean) and second (uncentered variance) moments of
the gradients, while RMSprop focuses on the second moment.
• Convergence Speed:
Adam and RMSprop generally converge faster than SGD, especially in complex
landscapes with noisy gradients.
• Hyperparameter Tuning:
Adam typically requires less hyperparameter tuning compared to SGD and
RMSprop.
• Memory Usage:
All three optimizers have relatively low memory requirements.
When to use each optimizer:
• SGD: Suitable for problems with simple loss landscapes and where
computational resources are limited.
• RMSprop: Effective when dealing with noisy gradients or non-stationary
objectives.
• Adam: A good general-purpose choice, often preferred for its adaptive
learning rates and relatively good performance across various tasks.
In summary: Adam is a popular choice due to its adaptive learning rates and
ability to handle noisy gradients, while RMSprop focuses on the second
moment of the gradients, and SGD is a simpler, less computationally intensive
option.
CNN Architecture

A Convolutional Neural Network (CNN) is a type of deep neural network(DNN)


specifically designed to analyze visual data like images. Its architecture is
characterized by a hierarchy of convolutional and pooling layers, followed by
fully connected layers for classification. This structure enables CNNs to
automatically extract features from data, making them highly effective in tasks
like image recognition and object detection.
CNNs consist of multiple layers like the input layer, Convolutional layer, pooling
layer, and fully connected layers. Let’s learn more about CNNs in detail

Layers Used to Build ConvNets


A complete Convolution Neural Networks architecture is also known as covnets.
A covnets is a sequence of layers, and every layer transforms one volume to
another through a differentiable function.
• Input Layer:
This layer receives the input data, such as an image, and passes it to the hidden
layers.
• Hidden Layers:
These layers are the core of the CNN and perform the following tasks:
• Convolutional Layers: These layers use filters to extract features
from the input data.
• Pooling Layers: These layers reduce the spatial dimensions of the
feature maps, making the model more robust to variations in the
input.
• Activation Functions: These functions introduce non-linearity into
the model, enabling it to learn complex patterns.
• Fully Connected Layers:
These layers take the output of the convolutional and pooling layers and use it
to make predictions.
• Output Layer:
This layer provides the final prediction, which could be a class label or
probability scores.

• Flattening: The resulting feature maps are flattened into a one-


dimensional vector after the convolution and pooling layers so they can
be passed into a completely linked layer for categorization or regression.

Advantages of CNNs
1. Good at detecting patterns and features in images, videos, and audio
signals.
2. Robust to translation, rotation, and scaling invariance.
3. End-to-end training, no need for manual feature extraction.
4. Can handle large amounts of data and achieve high accuracy.
Disadvantages of CNNs
1. Computationally expensive to train and require a lot of memory.
2. Can be prone to overfitting if not enough data or proper regularization is
used.
3. Requires large amounts of labeled data.
4. Interpretability is limited, it’s hard to understand what the network has
learned.

Mathematical Overview of Convolution


Now let’s talk about a bit of mathematics that is involved in the whole
convolution process.
• Convolution layers consist of a set of learnable filters (or kernels) having
small widths and heights and the same depth as that of input volume (3
if the input layer is image input).
• For example, if we have to run convolution on an image with dimensions
34x34x3. The possible size of filters can be axax3, where ‘a’ can be
anything like 3, 5, or 7 but smaller as compared to the image dimension.
• During the forward pass, we slide each filter across the whole input
volume step by step where each step is called stride (which can have a
value of 2, 3, or even 4 for high-dimensional images) and compute the
dot product between the kernel weights and patch from input volume.
• As we slide our filters we’ll get a 2-D output for each filter and we’ll stack
them together as a result, we’ll get output volume having a depth equal
to the number of filters. The network will learn all the filters
difference between underfitting and overfitting expalijn tecniuqes
to avoid it.
Underfitting and overfitting are two common challenges in machine learning
where a model's performance deviates from its intended purpose. Underfitting
occurs when a model is too simple and fails to capture the underlying patterns
in the data, leading to poor performance on both training and testing
data. Overfitting, on the other hand, happens when a model is too complex
and learns the training data too well, including noise, resulting in excellent
performance on the training data but poor performance on unseen data.

Underfitting:
• Cause: The model is too simplistic and cannot learn the complex
relationships in the data.
• Symptoms: Poor performance on both training and testing data.
• Mitigation: TO AVOID
o Increase model complexity: Use more complex models,
algorithms, or features.
o Train longer: Allow the model to train for a longer period.
o Add more data: Provide the model with more relevant features.
o Reduce noise: Clean the training data to remove irrelevant
information.
o Reduce regularization: Decrease the penalty on model
parameters.

Overfitting:
• Cause:
The model learns the training data too well, including noise and irrelevant
details.
• Symptoms:
Excellent performance on training data, but poor performance on unseen data.

• Techniques to Avoid Overfitting

• Use more training data: Provide the model with a larger and more
diverse training set.
• Data augmentation: Generate synthetic training data to increase
the dataset size.
• Add noise to the input data: Introduce random variations in the
training data to make the model more robust.
• Feature selection: Select only the most relevant features and
remove irrelevant ones.
• Regularization: Add penalties to the model's parameters to
prevent them from becoming too large.
• Early stopping: Monitor the model's performance on a validation
set and stop training when performance starts to degrade.
• Simplify the model: Use a less complex model or algorithm.
• Use cross-validation: Evaluate the model's performance on
different subsets of the data to get a more reliable estimate of its
performance.
• Use ensembling: Combine the predictions of multiple models to
reduce overfitting.

1. Dropout

Definition:
Dropout is a regularization technique where, during training, a random subset
of neurons is “dropped out” (i.e., temporarily deactivated) in each forward
pass.
• This prevents the network from becoming too reliant on specific
neurons.
• At each iteration, different neurons are dropped, forcing the network to
learn redundant, generalized patterns.
Benefits:
• Reduces overfitting.
• Encourages independence and redundancy in feature learning.
• Improves generalization to unseen data.

Typical Dropout Rate:


• Usually between 0.2 and 0.5 (i.e., 20% to 50% of neurons are dropped
during training).

2. Regularization

Definition:
Regularization refers to adding a penalty to the loss function to prevent the
model from learning overly complex or extreme weights.
There are two main types:

L1 Regularization (Lasso)
• Adds the absolute value of weights to the loss function.
• Encourages sparsity — some weights become exactly zero.
• Good for feature selection.
Loss function becomes:
L=Loriginal+λ∑∣wi∣L = L_{\text{original}} + \lambda \sum |w_i|L=Loriginal
+λ∑∣wi∣

L2 Regularization (Ridge)
• Adds the squared value of weights to the loss function.
• Encourages small weights, helping the model generalize better.
Loss function becomes:
L=Loriginal+λ∑wi2L = L_{\text{original}} + \lambda \sum w_i^2L=Loriginal
+λ∑wi2
• λ (lambda) is a regularization hyperparameter that controls the strength
of the penalty.
Module III: GANs & Autoencoders

Generative Adversarial Networks (GANs):


What is a GAN?
A Generative Adversarial Network (GAN) is a generative model that learns to
generate new, realistic data (e.g., images, audio, or text) by using two neural
networks in competition with each other:
• Generator (G): Learns to generate fake data.
• Discriminator (D): Learns to distinguish real data from fake.
This creates an adversarial game, where the Generator tries to fool the
Discriminator, and the Discriminator tries to catch the Generator.
Generative Adversarial Networks (GANs) are a type of deep learning model
that generate new data instances, like images or text, by training two
competing neural networks: a generator and a discriminator. The generator
creates fake data, while the discriminator tries to distinguish between the fake
data and real data. This competitive process allows the generator to learn to
produce data that is increasingly difficult to distinguish from real data.

Core Components: How it work


• Generator:
This network takes random noise as input and attempts to generate new data
samples (e.g., images) that resemble real data from the training dataset.
• Discriminator:
This network acts as an "adversary" and tries to determine whether a given
data sample is real (from the training dataset) or fake (generated by the
generator).
The two networks engage in a continuous game of cat and mouse: the
Generator improves its ability to create realistic data, while the Discriminator
becomes better at detecting fakes. Over time, this adversarial process leads to
the generation of highly realistic and high-quality data.

Training Progression
• As training continues, the generator becomes highly proficient at
producing realistic data.
• Eventually, the discriminator struggles to distinguish real from fake,
indicating that the GAN has reached a well-trained state.
• At this point, the generator can be used to generate high-quality
synthetic data for various applications.

Application Of Generative Adversarial Networks (GANs)


1. Image Synthesis & Generation: GANs generate realistic images, avatars,
and high-resolution visuals by learning patterns from training data. They
are widely used in art, gaming, and AI-driven design.
2. Image-to-Image Translation: GANs can transform images between
domains while preserving key features. Examples include converting day
images to night, sketches to realistic images, or changing artistic styles.
3. Text-to-Image Synthesis: GANs create visuals from textual descriptions,
enabling applications in AI-generated art, automated design, and content
creation.
4. Data Augmentation: GANs generate synthetic data to improve machine
learning models, making them more robust and generalizable, especially
in fields with limited labeled data.
5. High-Resolution Image Enhancement: GANs upscale low-resolution
images, improving clarity for applications like medical imaging, satellite
imagery, and video enhancement.
Advantages of GAN
The advantages of the GANs are as follows:
1. Synthetic data generation: GANs can generate new, synthetic data that
resembles some known data distribution, which can be useful for data
augmentation, anomaly detection, or creative applications.
2. High-quality results: GANs can produce high-quality, photorealistic
results in image synthesis, video synthesis, music synthesis, and other
tasks.
3. Unsupervised learning: GANs can be trained without labeled data,
making them suitable for unsupervised learning tasks, where labeled
data is scarce or difficult to obtain.
4. Versatility: GANs can be applied to a wide range of tasks, including
image synthesis, text-to-image synthesis, image-to-image translation,
anomaly detection, data augmentation, and others.

Q.Discuss some challenges that you encounter while training GAN model and
their probabilities.
ANS:
Training GANs presents several challenges, primarily due to the adversarial
nature of the training process and the inherent instability that can arise. These
challenges include mode collapse, non-convergence, vanishing gradients, and
difficulty in balancing the generator and discriminator.
1.Mode Collapse:
• Problem:
The generator produces a limited variety of outputs, focusing on only a few
"modes" of the data distribution.
• Probability:
High, as the generator can easily fall into the trap of optimizing for the
discriminator's current state, ignoring other possible outputs.
• Example:
A GAN trained on images of faces might only generate one type of face or a
limited set of variations.

2.Non-Convergence:
• Problem:
The training process stalls, with the generator and discriminator failing to reach
a stable equilibrium.
• Probability:
Moderate, as the adversarial nature can lead to oscillations and instability in
the training process.
• Example:
The model parameters might keep oscillating and not converge to a stable
solution.

3.Vanishing Gradients:
• Problem:
The discriminator becomes too good at distinguishing real and fake data,
leading to small or zero gradients for the generator, slowing down learning.
• Probability:
Moderate, as the discriminator's success can hinder the generator's progress,
especially in early stages of training.
• Example:
The generator's parameters might not update effectively, leading to slow or no
learning.

4.Difficulty in Balancing Generator and Discriminator:


• Problem:
One network (usually the discriminator) becomes dominant, making the other
network unable to learn effectively.
• Probability:
High, as the adversarial nature of GANs can lead to an imbalance between the
two networks.
• Example:
The discriminator might be so good at distinguishing real and fake images that
the generator's gradients are negligible.

GANs face challenges like mode collapse, where the generator produces limited
diverse outputs, and training instability, where the generator and discriminator
oscillate or fail to converge. Mode collapse happens when the generator
focuses on a few modes of the data, while training instability can result from
the discriminator becoming too powerful, hindering generator updates.
Elaboration:
• Mode Collapse:
This occurs when the generator learns to produce a limited variety of samples,
often focusing on a small subset of the data distribution. It happens when the
generator finds a way to fool the discriminator by producing a few specific
types of samples, neglecting the broader diversity of the data.
• Training Instability:
GANs can experience instability due to the adversarial nature of their training,
where the generator and discriminator are competing against each other. The
discriminator might become too successful, leading to vanishing gradients for
the generator, or the models might oscillate and fail to converge to a stable
equilibrium.
Causes and Solutions:
• Mode Collapse:
Several factors can contribute to mode collapse, including an imbalance in the
training dynamics between the generator and discriminator, or the generator's
ability to find a "winning" strategy to fool the discriminator without exploring
the full data distribution. Techniques like Wasserstein Loss, unrolling, and
progressive growing have been proposed to address mode collapse.
• Training Instability:
Instability can arise from issues like the discriminator becoming too strong,
resulting in vanishing gradients for the generator. Techniques like batch
normalization, spectral normalization, and using WGAN-GP loss can help
stabilize training.

6. Use Cases of GANs

Image Synthesis
• Generate new faces, fashion items, anime characters (StyleGAN,
DCGAN).
• Example: thispersondoesnotexist.com

Data Augmentation
• Synthetic data for imbalanced classes or privacy preservation.
• Used in medical imaging and autonomous driving.

Image-to-Image Translation
• Convert sketch ↔ photo (Pix2Pix)
• Horse ↔ Zebra (CycleGAN)

Super-Resolution
• Enhance low-quality images (SRGAN).
Generative AI models

Generative AI models are a type of machine learning that can create new data,
like text, images, audio, or video, that resembles the data they were trained
on. These models learn patterns and structures from existing data to generate
novel content. Some key types include Generative Adversarial Networks
(GANs), Variational Autoencoders (VAEs), and diffusion models, each with its
own strengths and applications.

TYPES
Generative Adversarial Networks (GANs)
Autoencoders(AE)
Variational Autoencoders (VAEs):
Autoregressive Models:
Conditional Generative Models

. Autoencoders and Variational Autoencoders (VAEs) are both neural network


architectures used for unsupervised learning, primarily for dimensionality
reduction, data compression, and generative modeling. However, VAEs differ
from standard autoencoders by encoding the latent representation as a
probability distribution, rather than a fixed vector, enabling them to generate
new data samples.
Differences:
• Latent Space:
Autoencoders map inputs to a single, fixed point in the latent space, while VAEs
map inputs to a probability distribution over the latent space, typically a
Gaussian distribution.
• Generative Capabilities:
Autoencoders primarily focus on reconstruction, compressing and
reconstructing input data. VAEs can generate new data samples by sampling
from the latent space, allowing them to create new data resembling the
training data.
• Loss Function:
Autoencoders minimize the reconstruction loss (difference between input and
output). VAEs minimize a combination of reconstruction loss and a loss term
(Kullback-Leibler divergence) that encourages the latent space to follow a
standard distribution.

Applications:
• Autoencoders:
• Feature extraction and representation learning: Identifying
important features within data.
• Data compression: Reducing the dimensionality of data while
retaining important information.
• Image denoising and reconstruction: Improving the quality of
images by reducing noise.
• VAEs:
• Generative modeling: Creating new data samples similar to the
training data, useful for tasks like image and text generation.
• Data augmentation: Generating synthetic data to expand the
training dataset.
• Anomaly detection: Identifying deviations from the normal
patterns in data.
• Industrial quality control: Detecting defects in products by
analyzing their generated images.
• Natural Language Processing (NLP): Generating text, translating
languages, and more.
In essence:
• Autoencoders are good for tasks requiring reconstruction, feature
learning, and data compression.
• VAEs excel in generative tasks, allowing for the creation of new, diverse
data samples.

4. Conditional Generative Models


• Conditional GAN (cGAN): Adds labels or conditions to both Generator
and Discriminator.
G(z∣y),D(x∣y)G(z | y),\quad D(x | y)G(z∣y),D(x∣y)
• Example: Generate images of a specific digit/class.

Other examples:
• Conditional VAEs (CVAE) – Add condition to encoder/decoder.
• Text-to-image GANs – Generate images based on text input (e.g.,
DALL·E).
Module IV: Use of Recurrent Neural Networks in Generative
AI (RNN)

Recurrent Neural Networks (RNNs) are a type of artificial neural network


designed to process sequential data, such as text or time series data, by
maintaining a "memory" of past inputs. This memory allows RNNs to learn
patterns and relationships within sequences, making them well-suited for tasks
like language translation, speech recognition, and time series prediction.
Recurrent Neural Networks (RNNs) work a bit different from regular neural
networks. In neural network the information flows in one direction from input
to output. However in RNN information is fed back into the system after each
step. Think of it like reading a sentence, when you’re trying to predict the next
word you don’t just look at the current word but also need to remember the
words that came before to make accurate guess.
RNNs allow the network to “remember” past information by feeding the
output from one step into next step. This helps the network understand the
context of what has already happened and make better predictions based on
that. For example when predicting the next word in a sentence the RNN uses
the previous words to help decide what word is most likely to come next.
How RNN Differs from Feedforward Neural Networks?
Feedforward Neural Networks (FNNs) process data in one direction from input
to output without retaining information from previous inputs. This makes them
suitable for tasks with independent inputs like image classification. However
FNNs struggle with sequential data since they lack memory.
Recurrent Neural Networks (RNNs) solve this by incorporating loops that allow
information from previous steps to be fed back into the network. This
feedback enables RNNs to remember prior inputs making them ideal for tasks
where context is important.
Types Of Recurrent Neural Networks
There are four types of RNNs based on the number of inputs and outputs in the
network:

1. One-to-One RNN
This is the simplest type of neural network architecture where there is a single
input and a single output. It is used for straightforward classification tasks such
as binary classification where no sequential data is involved.
2. One-to-Many RNN
In a One-to-Many RNN the network processes a single input to produce
multiple outputs over time. This is useful in tasks where one input triggers a
sequence of predictions (outputs). For example in image captioning a single
image can be used as input to generate a sequence of words as a caption

3. Many-to-One RNN
The Many-to-One RNN receives a sequence of inputs and generates a single
output. This type is useful when the overall context of the input sequence is
needed to make one prediction. In sentiment analysis the model receives a
sequence of words (like a sentence) and produces a single output like positive,
negative or neutral
4. Many-to-Many RNN
The Many-to-Many RNN type processes a sequence of inputs and generates a
sequence of outputs. In language translation task a sequence of words in one
language is given as input, and a corresponding sequence in another language
is generated as output.
Variants of Recurrent Neural Networks (RNNS) A
There are several variations of RNNs, each designed to address specific
challenges or optimize for certain tasks:
1. Vanilla RNN
This simplest form of RNN consists of a single hidden layer where weights are
shared across time steps. Vanilla RNNs are suitable for learning short-term
dependencies but are limited by the vanishing gradient problem, which
hampers long-sequence learning.
2. Bidirectional RNNs
Bidirectional RNNs process inputs in both forward and backward directions,
capturing both past and future context for each time step. This architecture is
ideal for tasks where the entire sequence is available, such as named entity
recognition and question answering.
3. Long Short-Term Memory Networks (LSTMs)
Long Short-Term Memory Networks (LSTMs) introduce a memory mechanism
to overcome the vanishing gradient problem. Each LSTM cell has three gates:
• Input Gate: Controls how much new information should be added to the
cell state.
• Forget Gate: Decides what past information should be discarded.
• Output Gate: Regulates what information should be output at the
current step. This selective memory enables LSTMs to handle long-term
dependencies, making them ideal for tasks where earlier context is
critical.
• A type of RNN designed to overcome the vanishing gradient problem,
LSTMs can learn long-range dependencies in sequential data. They use
"memory cells" to store information over long time steps.
4. Gated Recurrent Units (GRUs)
Gated Recurrent Units (GRUs) simplify LSTMs by combining the input and
forget gates into a single update gate and streamlining the output mechanism.
This design is computationally efficient, often performing similarly to LSTMs,
and is useful in tasks where simplicity and faster training are beneficial
Advantages of Recurrent Neural Networks
• Sequential Memory: RNNs retain information from previous inputs,
making them ideal for time-series predictions where past data is crucial.
This capability is often called Long Short-Term Memory (LSTM).
• Enhanced Pixel Neighborhoods: RNNs can be combined with
convolutional layers to capture extended pixel neighborhoods improving
performance in image and video data processing.
Limitations of Recurrent Neural Networks (RNNs)
While RNNs excel at handling sequential data, they face two main training
challenges i.e., vanishing gradient and exploding gradient problem:
1. Vanishing Gradient: During backpropagation, gradients diminish as they
pass through each time step, leading to minimal weight updates. This
limits the RNN’s ability to learn long-term dependencies, which is crucial
for tasks like language translation.
2. Exploding Gradient: Sometimes, gradients grow uncontrollably, causing
excessively large weight updates that destabilize training. Gradient
clipping is a common technique to manage this issue.
These challenges can hinder the performance of standard RNNs on complex,
long-sequence tasks.

Applications of RNNs in Generative AI


RNNs are widely used in generative tasks because of their ability to handle
sequential and time-dependent data. By learning patterns in sequences, they
can generate new data that mimics the structure of the original input.

1. Text Generation with RNNs


RNNs can be trained on a large corpus of text to learn grammar, word usage,
and sentence structure. Once trained, they can generate new text by
predicting one character or word at a time, using the previous output as the
next input.
Example:
Training an RNN on Shakespeare’s plays allows it to generate text in a similar
style, word by word.
How it works:
• Input: A sequence of words or characters
• Output: The predicted next word or character
• This prediction is fed back into the RNN to generate the next one, and
so on.
Used in:
• Chatbots
• Story or poetry generation
• Code auto-completion

2. Music Generation Using RNNs


Music is also sequential — notes follow a certain rhythm and pattern. RNNs
can learn musical sequences and generate original compositions that follow
the style and tempo of the training data.
How it works:
• Input: Sequences of notes or MIDI data
• Output: Predicted next note or chord
• The RNN continues generating until the sequence ends
Used in:
• Automatic music composition
• Accompaniment generation
• Style transfer in music (e.g., converting pop music into classical)
RNNs, LSTMs, and GRUs are all used here, with LSTM being more common
due to better memory handling.
3. Speech Synthesis and Recognition

Speech Synthesis (Text-to-Speech)


RNNs are used to convert written text into human-like speech. They model
the relationship between text and audio waveforms to generate natural
sounding speech.
Example:
Google’s WaveNet uses deep RNN-like structures to produce highly realistic
speech by modeling audio signals.

Speech Recognition
RNNs also help in converting spoken language into text. Since speech is time-
sequential, RNNs are good at capturing phonetic and linguistic patterns.
How it works:
• Input: Audio waveform
• Output: Predicted sequence of words
• Often combined with CNNs or attention mechanisms for improved
accuracy
Used in:
• Virtual assistants (e.g., Siri, Alexa)
• Voice commands and transcription
• Automated customer support

Summary
RNNs are essential in generative AI because they:
• Understand sequence and context
• Generate outputs step-by-step, making them ideal for language, music,
and speech
• Are enhanced by LSTM/GRU for longer memory
LSTM vs. RNN (End-Sem Q5, Q7b)
RNNs and LSTMs both process sequential data, but they differ in how they
handle memory. RNNs have a single hidden state that is updated at every time
step. Because they rely only on this state and use simple activation functions
like tanh, they tend to forget long-term dependencies due to the vanishing
gradient problem.
LSTMs improve on this by adding a memory cell and three gates:
• The forget gate decides what information to discard from the memory.
• The input gate decides what new information to store.
• The output gate decides what to pass to the next hidden state.
These gates help LSTM models maintain information over longer sequences. As
a result, LSTMs are more powerful and widely used in tasks such as machine
translation, speech recognition, and text generation, where remembering long-
term dependencies is crucial. However, LSTMs are computationally heavier
than basic RNNs.

Vanishing Gradients (End-Sem Q9b)


The vanishing gradient problem occurs during training of deep neural networks
and RNNs when gradients — which are used to update weights — become
extremely small. In RNNs, this problem is particularly severe because
backpropagation must be applied through many time steps (called
backpropagation through time or BPTT).
When the gradients become too small, the weight updates are negligible, and
the network fails to learn long-range dependencies. This means earlier layers
(or earlier time steps) receive almost no learning signal. For example, if you're
trying to predict the next word in a long sentence, the beginning of the
sentence might have important context, but the network fails to retain it.
This issue arises because functions like tanh and sigmoid squash their inputs
into small ranges, and when you multiply many small values together (as
happens during backpropagation), the product becomes near zero.
To solve this, techniques such as:
• LSTMs and GRUs (which are designed to maintain more stable
gradients),
• Gradient clipping (which prevents gradients from becoming too small or
too large),
• and using ReLU activation (which avoids squashing inputs), are
commonly used.

Recurrent Neural Networks (RNNs), especially Long Short-Term Memory (LSTM)


and Gated Recurrent Unit (GRU) variations, are designed to handle sequential
data and overcome the long-term dependency issues faced by traditional
feedforward neural networks. These models leverage the concept of a "hidden
state" to remember information from previous time steps, allowing them to
capture relationships between distant elements in a sequence.
Key Concepts:
• Sequential Data:
RNNs excel at processing data that has a natural order or sequence, such as
text, speech, time series, and video.
• Hidden State:
RNNs maintain a hidden state at each time step, which captures information
from previous inputs and allows them to remember context.
• Long-Term Dependencies:
RNNs, especially LSTMs and GRUs, are particularly good at learning long-term
dependencies, where the relationship between elements in a sequence is
significant even if they are separated by many time steps.
• Vanishing Gradient Problem:
Traditional RNNs can struggle with the vanishing gradient problem, where the
gradients (used for updating weights during training) become very small as
they propagate back through time, making it difficult to learn long-term
dependencies.
• Gated Mechanisms (LSTMs and GRUs):
LSTMs and GRUs address the vanishing gradient problem by incorporating
gating mechanisms (input gate, output gate, and forget gate in LSTMs) that
control the flow of information within the network.
• Backpropagation Through Time (BPTT):
RNNs use BPTT, a learning process where errors are propagated across time
steps to adjust the network's weights, enabling them to learn dependencies
within sequential data.
How RNNs Solve Long-Term Dependency Issues:
1. 1. Hidden State:
The hidden state acts as a memory, allowing the network to remember
information from previous time steps.
2. 2. Gated Mechanisms (LSTMs and GRUs):
LSTMs and GRUs use gates to selectively control the flow of information,
allowing them to retain important information from the past while forgetting
less relevant information.
3. 3. BPTT:
BPTT enables the network to learn long-term dependencies by propagating
errors across time steps and adjusting weights accordingly.
Applications:
RNNs are used in various applications where sequential data is crucial,
including:
• Natural Language Processing (NLP): Language translation, sentiment
analysis, text generation.
• Speech Recognition: Transcribing spoken language into text.
• Time Series Forecasting: Predicting future values based on past data.
• Image Captioning: Generating descriptions for images
\Module V: Attention Mechanism and Transformers

Introduction to Attention Mechanism


The attention mechanism allows a model to focus on the most relevant parts
of the input when performing a task, instead of treating all input information
equally. Originally introduced in NLP (for machine translation), attention has
now become a key component in both language and vision models.
Why use attention?
• Traditional models (like RNNs or CNNs) struggle with long-range
dependencies or capturing global context.
• Attention dynamically weighs the importance of each input element
based on its relevance to the task at hand.

Use of Attention in Convolutional Neural Networks (CNNs)


In CNNs, attention can enhance feature maps by helping the network focus on
important spatial regions or channels, improving performance on tasks like
classification or segmentation.
Attention in CNNs is usually added as modules on top of standard
convolutional layers to highlight informative features and suppress irrelevant
ones.

Types of Attention in Vision: Self-Attention, Spatial, and Channel


Attention
1. Self-Attention
• Each element (e.g., pixel or token) attends to all others in the input.
• Helps capture global relationships.
• Used in Transformers and Vision Transformers (ViT).
• Learns context-dependent weights.
2. Spatial Attention
• Focuses on "where" in the image the model should look.
• Assigns weights to different locations (spatial positions) in a feature map.
• Helps in locating objects or features in the image.
3. Channel Attention
• Focuses on "what" feature maps are important.
• Assigns weights to different channels in the convolutional output.
• Enhances useful filters and suppresses less relevant ones.
Transformer

In machine learning, a transformer is a neural network architecture that uses


self-attention mechanisms to process and generate sequences of data
efficiently. Unlike traditional sequential models like RNNs, transformers can
handle long-range dependencies and contextual relationships, making them
highly effective for tasks like natural language processing (NLP), machine
translation, and text generation.

For example, in the sentence: “XYZ went to France in 2019 when there were no
cases of COVID and there he met the president of that country” the word
“that country” refers to “France”.
However RNN would struggle to link “that country” to “France” since it
processes each word in sequence leading to losing context over long sentences.
This limitation prevents RNNs from understanding the full meaning of the
sentence.
While adding more memory cells in LSTMs (Long Short-Term Memory
networks) helped address the vanishing gradient issue they still process words
one by one. This sequential processing means LSTMs can’t analyze an entire
sentence at once.
For instance the word “point” has different meanings in these two sentences:
• “The needle has a sharp point.” (Point = Tip)
• “It is not polite to point at people.” (Point = Gesture)
Traditional models struggle with this context dependence,
whereas, Transformer model through its self-attention mechanism, processes
the entire sentence in parallel addressing these issues and making it
significantly more effective at understanding context.
Transformer architecture

Transformer Architecture
Transformers consist of encoder and decoder blocks, primarily based on self-
attention and feedforward layers.
Encoder:
• Takes input tokens and uses multi-head self-attention to encode
contextual information.
Decoder:
• Generates output step-by-step using masked self-attention and attention
to encoder outputs.
Each layer includes:
• Multi-head Self-Attention (captures relationships)
• Feedforward Neural Network (applies non-linearity)
• Layer Norm and Residual Connections (for stability)
No recurrence or convolution — fully based on attention.

4. Encoder-Decoder Architecture
The encoder-decoder structure is key to transformer models. The encoder
processes the input sequence into a vector, while the decoder converts this
vector back into a sequence. Each encoder and decoder layer includes self-
attention and feed-forward layers. In the decoder, an encoder-decoder
attention layer is added to focus on relevant parts of the input.
For example, a French sentence “Je suis étudiant” is translated into “I am a
student” in English.
The encoder consists of multiple layers (typically 6 layers). Each layer has two
main components:
• Self-Attention Mechanism – Helps the model understand word
relationships.
• Feed-Forward Neural Network – Further transforms the representation.
The decoder also consists of 6 layers, but with an additional encoder-decoder
attention mechanism. This allows the decoder to focus on relevant parts of the
input sentence while generating output.

For instance in the sentence “The cat didn’t chase the mouse, because it was
not hungry”, the word ‘it’ refers to ‘cat’. The self-attention mechanism helps
the model correctly associate ‘it’ with ‘cat’ ensuring an accurate understanding
of sentence structure.
Attention Mechanisms in Computer Vision
In deep learning (especially CNNs), attention mechanisms help models focus on
the most relevant features in an image. Two commonly used types of attention
are:

Channel Attention

What it focuses on:


• Determines which feature maps (channels) are most important for a
given task.

How it works:
• Each channel represents a different learned pattern (e.g., edges, colors,
textures).
• The model learns to assign higher weights to important channels and
lower weights to less useful ones.

Mechanism:
1. A global descriptor is created using global average pooling or max
pooling across spatial dimensions (height and width).
2. This descriptor goes through a small network (often with a sigmoid
activation).
3. It outputs a weight for each channel.
4. Each channel is scaled by its corresponding weight.

Purpose:
• Helps the model prioritize useful feature maps, improving accuracy.

Example:
• SE Block (Squeeze-and-Excitation) is a popular channel attention
module.

Spatial Attention

What it focuses on:


• Determines which spatial regions (locations) in the feature map are
important.

How it works:
• Instead of selecting channels, spatial attention finds "where" the model
should look in an image (e.g., object locations).

Mechanism:
1. The feature map is pooled along the channel axis (e.g., using average
and max pooling).
2. This pooled output is passed through a convolutional layer with a
sigmoid.
3. It outputs a spatial attention map — a 2D mask highlighting important
regions.
4. This map is multiplied with the original feature map spatially.

Purpose:
• Enhances regions with high semantic content (like object edges,
centers).

Example:
Vision Transformer (ViT)

Vision Transformers (ViTs) are deep learning models that apply the Transformer
architecture, originally developed for natural language processing (NLP), to
visual tasks like image classification and object detection. Unlike traditional
Convolutional Neural Networks (CNNs), ViTs leverage self-attention
mechanisms to capture global relationships within images, allowing them to
model long-range dependencies and achieve state-of-the-art performance

Key Concepts:
• Image Patching:
ViTs start by dividing an image into smaller, fixed-size patches.
• Self-Attention:
Each patch interacts with every other patch, allowing the model to learn how
much attention to pay to different parts of the image.
• Transformer Encoder:
The processed patches are then fed into a standard Transformer encoder,
which uses layers of self-attention and feedforward networks.
• Global Context:
Unlike CNNs, which primarily focus on local features, ViTs capture global
context by considering all patches during each attention step.

How they Work:


1. Image Input: An image is fed into the ViT model.
2. Patching: The image is divided into smaller, fixed-size patches.
3. Patch Embedding: Each patch is linearly mapped to a vector embedding.
4. Positional Encoding: Positional information is added to the patch
embeddings to maintain spatial relationships.
5. Transformer Encoder: The embedded patches are processed by a stack
of Transformer encoders, each using self-attention to capture
relationships between patches.
6. Output: The processed output can be used for tasks like image
classification, object detection, or image segmentation.
Advantages of ViTs:
• State-of-the-art performance: ViTs have achieved excellent results on
various computer vision tasks, often surpassing CNNs.
• Global context modeling: They can effectively capture global
relationships and long-range dependencies within images.
• Scalability: ViTs can be scaled up to handle large and complex images.
Challenges of ViTs:
• Computational cost:
ViTs can be computationally expensive to train and use, especially for large
models.
• Data requirement:
ViTs typically require large amounts of training data to learn effective
representations.
• Inductive bias:
ViTs have a weaker inductive bias compared to CNNs, meaning they rely more
on data and regularization techniques.

Imagine you have a big puzzle (your image). Instead of looking at the whole
picture at once (like humans do), the Vision Transformer (ViT) breaks the
puzzle into small pieces (like 16x16 pixel patches).
1. Step 1: Cut the Image into Patches
o The image is divided into small squares (like puzzle pieces).
o Each piece is turned into numbers (called "embeddings").
2. Step 2: Add Position Info
o Since transformers don’t understand "where" each patch is, we
add position info (like numbering puzzle pieces).
3. Step 3: Let the Transformer Work
o The transformer looks at all pieces at once and finds relationships
(like noticing that a "dog’s ear" patch connects to a "dog’s head"
patch).
o It uses self-attention (focusing on important patches).
4. Step 4: Predict the Output
o A special [CLS] token (like a summary token) is used to decide the
final answer (e.g., "this is a cat").

Transfer Learning with ViT (Simple Explanation)


• Pre-training: First, ViT learns from millions of images (like reading lots of
books).
• Fine-tuning: Then, we tweak it for a specific task (like teaching it to
recognize dog breeds).

Why it’s useful?


• If you don’t have much data, ViT can still work well because it already
"knows" a lot.
• It’s flexible—can be used for classification, detection, etc.

Applications of Vision Transformers


1. Image Classification (e.g., Is this a cat or a dog?)
o Works as well as (or better than) CNNs if trained on big datasets.
2. Object Detection (e.g., Find all cars in this image.)
o Models like DETR (Detection Transformer) use ViT to detect
objects without needing complex pipelines.
3. Other Uses
o Medical Imaging (finding tumors in X-rays).
o Self-Driving Cars (detecting pedestrians, traffic signs).

Final Thought
• ViT is like a smart student who reads all puzzle pieces at once and
understands the big picture.
• CNNs are like students who scan the puzzle slowly but are good with
less data

You might also like