Gen Aiml Notes by Piyush
Gen Aiml Notes by Piyush
ITS TYPES
Key Optimizers:
• Stochastic Gradient Descent (SGD):
A fundamental optimizer that updates parameters based on the gradient
calculated from a randomly selected subset of the training data (a batch).
• Adam (Adaptive Moment Estimation):
A popular optimizer that combines the ideas of momentum and RMSprop,
using adaptive learning rates for each parameter.
• RMSprop (Root Mean Squared Propagation):
Another adaptive learning rate optimizer that uses a moving average of
squared gradients to adapt the learning rate for each parameter.
• Other Optimizers:
AdaGrad, Adadelta, and Nesterov Accelerated Gradient (NAG) are also used,
each with its own strengths and weaknesses.
Adam and RMSprop are adaptive learning rate optimizers, meaning they adjust
the learning rate for each parameter individually based on past gradients,
potentially leading to faster convergence and better performance compared to
SGD (Stochastic Gradient Descent). SGD, on the other hand, uses a global
learning rate for all parameters.
Key Differences and Comparisons:
• Adaptive Learning Rates:
Adam and RMSprop adapt learning rates, while SGD uses a fixed learning rate.
• First and Second Moments:
Adam uses both the first (mean) and second (uncentered variance) moments of
the gradients, while RMSprop focuses on the second moment.
• Convergence Speed:
Adam and RMSprop generally converge faster than SGD, especially in complex
landscapes with noisy gradients.
• Hyperparameter Tuning:
Adam typically requires less hyperparameter tuning compared to SGD and
RMSprop.
• Memory Usage:
All three optimizers have relatively low memory requirements.
When to use each optimizer:
• SGD: Suitable for problems with simple loss landscapes and where
computational resources are limited.
• RMSprop: Effective when dealing with noisy gradients or non-stationary
objectives.
• Adam: A good general-purpose choice, often preferred for its adaptive
learning rates and relatively good performance across various tasks.
In summary: Adam is a popular choice due to its adaptive learning rates and
ability to handle noisy gradients, while RMSprop focuses on the second
moment of the gradients, and SGD is a simpler, less computationally intensive
option.
CNN Architecture
Advantages of CNNs
1. Good at detecting patterns and features in images, videos, and audio
signals.
2. Robust to translation, rotation, and scaling invariance.
3. End-to-end training, no need for manual feature extraction.
4. Can handle large amounts of data and achieve high accuracy.
Disadvantages of CNNs
1. Computationally expensive to train and require a lot of memory.
2. Can be prone to overfitting if not enough data or proper regularization is
used.
3. Requires large amounts of labeled data.
4. Interpretability is limited, it’s hard to understand what the network has
learned.
Underfitting:
• Cause: The model is too simplistic and cannot learn the complex
relationships in the data.
• Symptoms: Poor performance on both training and testing data.
• Mitigation: TO AVOID
o Increase model complexity: Use more complex models,
algorithms, or features.
o Train longer: Allow the model to train for a longer period.
o Add more data: Provide the model with more relevant features.
o Reduce noise: Clean the training data to remove irrelevant
information.
o Reduce regularization: Decrease the penalty on model
parameters.
Overfitting:
• Cause:
The model learns the training data too well, including noise and irrelevant
details.
• Symptoms:
Excellent performance on training data, but poor performance on unseen data.
•
• Techniques to Avoid Overfitting
•
• Use more training data: Provide the model with a larger and more
diverse training set.
• Data augmentation: Generate synthetic training data to increase
the dataset size.
• Add noise to the input data: Introduce random variations in the
training data to make the model more robust.
• Feature selection: Select only the most relevant features and
remove irrelevant ones.
• Regularization: Add penalties to the model's parameters to
prevent them from becoming too large.
• Early stopping: Monitor the model's performance on a validation
set and stop training when performance starts to degrade.
• Simplify the model: Use a less complex model or algorithm.
• Use cross-validation: Evaluate the model's performance on
different subsets of the data to get a more reliable estimate of its
performance.
• Use ensembling: Combine the predictions of multiple models to
reduce overfitting.
1. Dropout
Definition:
Dropout is a regularization technique where, during training, a random subset
of neurons is “dropped out” (i.e., temporarily deactivated) in each forward
pass.
• This prevents the network from becoming too reliant on specific
neurons.
• At each iteration, different neurons are dropped, forcing the network to
learn redundant, generalized patterns.
Benefits:
• Reduces overfitting.
• Encourages independence and redundancy in feature learning.
• Improves generalization to unseen data.
2. Regularization
Definition:
Regularization refers to adding a penalty to the loss function to prevent the
model from learning overly complex or extreme weights.
There are two main types:
L1 Regularization (Lasso)
• Adds the absolute value of weights to the loss function.
• Encourages sparsity — some weights become exactly zero.
• Good for feature selection.
Loss function becomes:
L=Loriginal+λ∑∣wi∣L = L_{\text{original}} + \lambda \sum |w_i|L=Loriginal
+λ∑∣wi∣
L2 Regularization (Ridge)
• Adds the squared value of weights to the loss function.
• Encourages small weights, helping the model generalize better.
Loss function becomes:
L=Loriginal+λ∑wi2L = L_{\text{original}} + \lambda \sum w_i^2L=Loriginal
+λ∑wi2
• λ (lambda) is a regularization hyperparameter that controls the strength
of the penalty.
Module III: GANs & Autoencoders
Training Progression
• As training continues, the generator becomes highly proficient at
producing realistic data.
• Eventually, the discriminator struggles to distinguish real from fake,
indicating that the GAN has reached a well-trained state.
• At this point, the generator can be used to generate high-quality
synthetic data for various applications.
Q.Discuss some challenges that you encounter while training GAN model and
their probabilities.
ANS:
Training GANs presents several challenges, primarily due to the adversarial
nature of the training process and the inherent instability that can arise. These
challenges include mode collapse, non-convergence, vanishing gradients, and
difficulty in balancing the generator and discriminator.
1.Mode Collapse:
• Problem:
The generator produces a limited variety of outputs, focusing on only a few
"modes" of the data distribution.
• Probability:
High, as the generator can easily fall into the trap of optimizing for the
discriminator's current state, ignoring other possible outputs.
• Example:
A GAN trained on images of faces might only generate one type of face or a
limited set of variations.
2.Non-Convergence:
• Problem:
The training process stalls, with the generator and discriminator failing to reach
a stable equilibrium.
• Probability:
Moderate, as the adversarial nature can lead to oscillations and instability in
the training process.
• Example:
The model parameters might keep oscillating and not converge to a stable
solution.
3.Vanishing Gradients:
• Problem:
The discriminator becomes too good at distinguishing real and fake data,
leading to small or zero gradients for the generator, slowing down learning.
• Probability:
Moderate, as the discriminator's success can hinder the generator's progress,
especially in early stages of training.
• Example:
The generator's parameters might not update effectively, leading to slow or no
learning.
GANs face challenges like mode collapse, where the generator produces limited
diverse outputs, and training instability, where the generator and discriminator
oscillate or fail to converge. Mode collapse happens when the generator
focuses on a few modes of the data, while training instability can result from
the discriminator becoming too powerful, hindering generator updates.
Elaboration:
• Mode Collapse:
This occurs when the generator learns to produce a limited variety of samples,
often focusing on a small subset of the data distribution. It happens when the
generator finds a way to fool the discriminator by producing a few specific
types of samples, neglecting the broader diversity of the data.
• Training Instability:
GANs can experience instability due to the adversarial nature of their training,
where the generator and discriminator are competing against each other. The
discriminator might become too successful, leading to vanishing gradients for
the generator, or the models might oscillate and fail to converge to a stable
equilibrium.
Causes and Solutions:
• Mode Collapse:
Several factors can contribute to mode collapse, including an imbalance in the
training dynamics between the generator and discriminator, or the generator's
ability to find a "winning" strategy to fool the discriminator without exploring
the full data distribution. Techniques like Wasserstein Loss, unrolling, and
progressive growing have been proposed to address mode collapse.
• Training Instability:
Instability can arise from issues like the discriminator becoming too strong,
resulting in vanishing gradients for the generator. Techniques like batch
normalization, spectral normalization, and using WGAN-GP loss can help
stabilize training.
Image Synthesis
• Generate new faces, fashion items, anime characters (StyleGAN,
DCGAN).
• Example: thispersondoesnotexist.com
Data Augmentation
• Synthetic data for imbalanced classes or privacy preservation.
• Used in medical imaging and autonomous driving.
Image-to-Image Translation
• Convert sketch ↔ photo (Pix2Pix)
• Horse ↔ Zebra (CycleGAN)
Super-Resolution
• Enhance low-quality images (SRGAN).
Generative AI models
Generative AI models are a type of machine learning that can create new data,
like text, images, audio, or video, that resembles the data they were trained
on. These models learn patterns and structures from existing data to generate
novel content. Some key types include Generative Adversarial Networks
(GANs), Variational Autoencoders (VAEs), and diffusion models, each with its
own strengths and applications.
TYPES
Generative Adversarial Networks (GANs)
Autoencoders(AE)
Variational Autoencoders (VAEs):
Autoregressive Models:
Conditional Generative Models
Applications:
• Autoencoders:
• Feature extraction and representation learning: Identifying
important features within data.
• Data compression: Reducing the dimensionality of data while
retaining important information.
• Image denoising and reconstruction: Improving the quality of
images by reducing noise.
• VAEs:
• Generative modeling: Creating new data samples similar to the
training data, useful for tasks like image and text generation.
• Data augmentation: Generating synthetic data to expand the
training dataset.
• Anomaly detection: Identifying deviations from the normal
patterns in data.
• Industrial quality control: Detecting defects in products by
analyzing their generated images.
• Natural Language Processing (NLP): Generating text, translating
languages, and more.
In essence:
• Autoencoders are good for tasks requiring reconstruction, feature
learning, and data compression.
• VAEs excel in generative tasks, allowing for the creation of new, diverse
data samples.
Other examples:
• Conditional VAEs (CVAE) – Add condition to encoder/decoder.
• Text-to-image GANs – Generate images based on text input (e.g.,
DALL·E).
Module IV: Use of Recurrent Neural Networks in Generative
AI (RNN)
1. One-to-One RNN
This is the simplest type of neural network architecture where there is a single
input and a single output. It is used for straightforward classification tasks such
as binary classification where no sequential data is involved.
2. One-to-Many RNN
In a One-to-Many RNN the network processes a single input to produce
multiple outputs over time. This is useful in tasks where one input triggers a
sequence of predictions (outputs). For example in image captioning a single
image can be used as input to generate a sequence of words as a caption
3. Many-to-One RNN
The Many-to-One RNN receives a sequence of inputs and generates a single
output. This type is useful when the overall context of the input sequence is
needed to make one prediction. In sentiment analysis the model receives a
sequence of words (like a sentence) and produces a single output like positive,
negative or neutral
4. Many-to-Many RNN
The Many-to-Many RNN type processes a sequence of inputs and generates a
sequence of outputs. In language translation task a sequence of words in one
language is given as input, and a corresponding sequence in another language
is generated as output.
Variants of Recurrent Neural Networks (RNNS) A
There are several variations of RNNs, each designed to address specific
challenges or optimize for certain tasks:
1. Vanilla RNN
This simplest form of RNN consists of a single hidden layer where weights are
shared across time steps. Vanilla RNNs are suitable for learning short-term
dependencies but are limited by the vanishing gradient problem, which
hampers long-sequence learning.
2. Bidirectional RNNs
Bidirectional RNNs process inputs in both forward and backward directions,
capturing both past and future context for each time step. This architecture is
ideal for tasks where the entire sequence is available, such as named entity
recognition and question answering.
3. Long Short-Term Memory Networks (LSTMs)
Long Short-Term Memory Networks (LSTMs) introduce a memory mechanism
to overcome the vanishing gradient problem. Each LSTM cell has three gates:
• Input Gate: Controls how much new information should be added to the
cell state.
• Forget Gate: Decides what past information should be discarded.
• Output Gate: Regulates what information should be output at the
current step. This selective memory enables LSTMs to handle long-term
dependencies, making them ideal for tasks where earlier context is
critical.
• A type of RNN designed to overcome the vanishing gradient problem,
LSTMs can learn long-range dependencies in sequential data. They use
"memory cells" to store information over long time steps.
4. Gated Recurrent Units (GRUs)
Gated Recurrent Units (GRUs) simplify LSTMs by combining the input and
forget gates into a single update gate and streamlining the output mechanism.
This design is computationally efficient, often performing similarly to LSTMs,
and is useful in tasks where simplicity and faster training are beneficial
Advantages of Recurrent Neural Networks
• Sequential Memory: RNNs retain information from previous inputs,
making them ideal for time-series predictions where past data is crucial.
This capability is often called Long Short-Term Memory (LSTM).
• Enhanced Pixel Neighborhoods: RNNs can be combined with
convolutional layers to capture extended pixel neighborhoods improving
performance in image and video data processing.
Limitations of Recurrent Neural Networks (RNNs)
While RNNs excel at handling sequential data, they face two main training
challenges i.e., vanishing gradient and exploding gradient problem:
1. Vanishing Gradient: During backpropagation, gradients diminish as they
pass through each time step, leading to minimal weight updates. This
limits the RNN’s ability to learn long-term dependencies, which is crucial
for tasks like language translation.
2. Exploding Gradient: Sometimes, gradients grow uncontrollably, causing
excessively large weight updates that destabilize training. Gradient
clipping is a common technique to manage this issue.
These challenges can hinder the performance of standard RNNs on complex,
long-sequence tasks.
Speech Recognition
RNNs also help in converting spoken language into text. Since speech is time-
sequential, RNNs are good at capturing phonetic and linguistic patterns.
How it works:
• Input: Audio waveform
• Output: Predicted sequence of words
• Often combined with CNNs or attention mechanisms for improved
accuracy
Used in:
• Virtual assistants (e.g., Siri, Alexa)
• Voice commands and transcription
• Automated customer support
Summary
RNNs are essential in generative AI because they:
• Understand sequence and context
• Generate outputs step-by-step, making them ideal for language, music,
and speech
• Are enhanced by LSTM/GRU for longer memory
LSTM vs. RNN (End-Sem Q5, Q7b)
RNNs and LSTMs both process sequential data, but they differ in how they
handle memory. RNNs have a single hidden state that is updated at every time
step. Because they rely only on this state and use simple activation functions
like tanh, they tend to forget long-term dependencies due to the vanishing
gradient problem.
LSTMs improve on this by adding a memory cell and three gates:
• The forget gate decides what information to discard from the memory.
• The input gate decides what new information to store.
• The output gate decides what to pass to the next hidden state.
These gates help LSTM models maintain information over longer sequences. As
a result, LSTMs are more powerful and widely used in tasks such as machine
translation, speech recognition, and text generation, where remembering long-
term dependencies is crucial. However, LSTMs are computationally heavier
than basic RNNs.
For example, in the sentence: “XYZ went to France in 2019 when there were no
cases of COVID and there he met the president of that country” the word
“that country” refers to “France”.
However RNN would struggle to link “that country” to “France” since it
processes each word in sequence leading to losing context over long sentences.
This limitation prevents RNNs from understanding the full meaning of the
sentence.
While adding more memory cells in LSTMs (Long Short-Term Memory
networks) helped address the vanishing gradient issue they still process words
one by one. This sequential processing means LSTMs can’t analyze an entire
sentence at once.
For instance the word “point” has different meanings in these two sentences:
• “The needle has a sharp point.” (Point = Tip)
• “It is not polite to point at people.” (Point = Gesture)
Traditional models struggle with this context dependence,
whereas, Transformer model through its self-attention mechanism, processes
the entire sentence in parallel addressing these issues and making it
significantly more effective at understanding context.
Transformer architecture
Transformer Architecture
Transformers consist of encoder and decoder blocks, primarily based on self-
attention and feedforward layers.
Encoder:
• Takes input tokens and uses multi-head self-attention to encode
contextual information.
Decoder:
• Generates output step-by-step using masked self-attention and attention
to encoder outputs.
Each layer includes:
• Multi-head Self-Attention (captures relationships)
• Feedforward Neural Network (applies non-linearity)
• Layer Norm and Residual Connections (for stability)
No recurrence or convolution — fully based on attention.
4. Encoder-Decoder Architecture
The encoder-decoder structure is key to transformer models. The encoder
processes the input sequence into a vector, while the decoder converts this
vector back into a sequence. Each encoder and decoder layer includes self-
attention and feed-forward layers. In the decoder, an encoder-decoder
attention layer is added to focus on relevant parts of the input.
For example, a French sentence “Je suis étudiant” is translated into “I am a
student” in English.
The encoder consists of multiple layers (typically 6 layers). Each layer has two
main components:
• Self-Attention Mechanism – Helps the model understand word
relationships.
• Feed-Forward Neural Network – Further transforms the representation.
The decoder also consists of 6 layers, but with an additional encoder-decoder
attention mechanism. This allows the decoder to focus on relevant parts of the
input sentence while generating output.
For instance in the sentence “The cat didn’t chase the mouse, because it was
not hungry”, the word ‘it’ refers to ‘cat’. The self-attention mechanism helps
the model correctly associate ‘it’ with ‘cat’ ensuring an accurate understanding
of sentence structure.
Attention Mechanisms in Computer Vision
In deep learning (especially CNNs), attention mechanisms help models focus on
the most relevant features in an image. Two commonly used types of attention
are:
Channel Attention
How it works:
• Each channel represents a different learned pattern (e.g., edges, colors,
textures).
• The model learns to assign higher weights to important channels and
lower weights to less useful ones.
Mechanism:
1. A global descriptor is created using global average pooling or max
pooling across spatial dimensions (height and width).
2. This descriptor goes through a small network (often with a sigmoid
activation).
3. It outputs a weight for each channel.
4. Each channel is scaled by its corresponding weight.
Purpose:
• Helps the model prioritize useful feature maps, improving accuracy.
Example:
• SE Block (Squeeze-and-Excitation) is a popular channel attention
module.
Spatial Attention
How it works:
• Instead of selecting channels, spatial attention finds "where" the model
should look in an image (e.g., object locations).
Mechanism:
1. The feature map is pooled along the channel axis (e.g., using average
and max pooling).
2. This pooled output is passed through a convolutional layer with a
sigmoid.
3. It outputs a spatial attention map — a 2D mask highlighting important
regions.
4. This map is multiplied with the original feature map spatially.
Purpose:
• Enhances regions with high semantic content (like object edges,
centers).
Example:
Vision Transformer (ViT)
Vision Transformers (ViTs) are deep learning models that apply the Transformer
architecture, originally developed for natural language processing (NLP), to
visual tasks like image classification and object detection. Unlike traditional
Convolutional Neural Networks (CNNs), ViTs leverage self-attention
mechanisms to capture global relationships within images, allowing them to
model long-range dependencies and achieve state-of-the-art performance
Key Concepts:
• Image Patching:
ViTs start by dividing an image into smaller, fixed-size patches.
• Self-Attention:
Each patch interacts with every other patch, allowing the model to learn how
much attention to pay to different parts of the image.
• Transformer Encoder:
The processed patches are then fed into a standard Transformer encoder,
which uses layers of self-attention and feedforward networks.
• Global Context:
Unlike CNNs, which primarily focus on local features, ViTs capture global
context by considering all patches during each attention step.
Imagine you have a big puzzle (your image). Instead of looking at the whole
picture at once (like humans do), the Vision Transformer (ViT) breaks the
puzzle into small pieces (like 16x16 pixel patches).
1. Step 1: Cut the Image into Patches
o The image is divided into small squares (like puzzle pieces).
o Each piece is turned into numbers (called "embeddings").
2. Step 2: Add Position Info
o Since transformers don’t understand "where" each patch is, we
add position info (like numbering puzzle pieces).
3. Step 3: Let the Transformer Work
o The transformer looks at all pieces at once and finds relationships
(like noticing that a "dog’s ear" patch connects to a "dog’s head"
patch).
o It uses self-attention (focusing on important patches).
4. Step 4: Predict the Output
o A special [CLS] token (like a summary token) is used to decide the
final answer (e.g., "this is a cat").
Final Thought
• ViT is like a smart student who reads all puzzle pieces at once and
understands the big picture.
• CNNs are like students who scan the puzzle slowly but are good with
less data