BTech Advanced AI Unit03
BTech Advanced AI Unit03
Dr.Vaishnaw G.Kale
Associate Professor
Department of Computer Science
Engineering & Applications
D.Y.Patil International University, Pune
About the Course
Module-III:Advanced AI Models
Introduction to Generative Models: GAN ,HMM, Auto regressive model,Applications, Evaluation and
challenges of generative models
Introduction to Deep Generative Models :Deep Learning, Advanced DNN, Recurrent Neural Networks
GAN, Deep Boltzmann machines, Deep Belief networks
Books
Sr.No. Text Books Name of the Author
Ex: There is a word -’ Point’, and we use it in two different contexts given below
● In contrast to it, feedback networks, which means that the information can pass to both directions
and it consists of a feedback path i.e we can again make use of the memory for new predictions.
● Now, coming to the architecture of the Transformer. Encoder and Decoder are building blocks of a
Transformer.
● The encoder block turns the sequence of input words into a vector and a Decoder converts a vector
into a sequence
Transformer based Language Models
Transformer based Language Models
● The encoder architecture has two layers: Self Attention and Feed
Forward.
● The encoder’s inputs first pass by a self-attention layer and then the
outputs of the self-attention layer are fed to a feed-forward neural
network.
● Sequential data has temporal characteristics.
● It signifies that each word holds some position concerning the other.
● For example, let’s take a sentence- ‘The cat didn’t chase the mouse, because it was not hungry’. Here,
we can easily tell that ‘it’ is referring to the cat, but it is not as simple for an algorithm.
● When the model is processing the word ‘it’, self-attention allows it to associate ‘it’ with ‘cat’.
● Self-attention is the method to reformulate the representation based on all other words of the sentence.
Transformer based Language Models
http://jalammar.github.io/illustrated-transformer/
Transformer based Language Models
Advantages of Transformer
1. They hold the potential to understand the relationship between sequential elements that are far from each
other.
● Google's Bidirectional Encoder Representations from Transformers was one of the first LLMs
based on transformers.
● OpenAI's GPT followed suit and underwent several iterations, including GPT-2, GPT-3, GPT-
● Meta's Llama achieves comparable performance with models 10 times its size.
● Google's Pathways Language Model generalizes and performs tasks across multiple domains,
3) Self-Attention Mechanism:
4) Positional Encoding:
● Since Transformers don't have an inherent sense of word order, positional encodings are added to
the input embeddings.
● These encodings provide information about the position of each word in the sequence.
How GPT works
5) Layer Stacking:
● The decoder consists of multiple layers, each containing a self-attention sub-layer and
feedforward neural network sub-layers.
● The model uses multiple stacked layers to capture increasingly complex patterns in the data.
6) Training:
7) Fine-Tuning:
8) Text Generation:
● When generating text, you provide a prompt to the trained GPT model.
● The model takes the prompt and generates a continuation of the text one word at a time.
● It samples from its learned probability distribution of words to predict the next word based on the
context of the prompt and the generated text so far.
How GPT works
9) Output Sampling:
● GPT often uses techniques like temperature control and nucleus sampling to adjust the randomness of
generated text.
● Temperature control influences the randomness of word selection, and nucleus sampling focuses on
selecting from a subset of the most likely words based on their cumulative probabilities.
● While GPT is excellent at generating coherent and contextually relevant text, it can sometimes produce
repetitive or nonsensical outputs.
● Post-processing techniques and careful prompting can be used to improve the quality of generated text.
How GPT works
● In the realm of Natural Language Processing (NLP), transformer-based language models have
revolutionized the way machines understand and generate human language.
● The traditional transformer model consists of two main components: an encoder and a decoder.
● However, OpenAI’s GPT (Generative Pretrained Transformer) models have deviated from this norm by
adopting a decoder-only architecture.
● In a traditional transformer model, the encoder and decoder work together to process language.
● The encoder’s role is to take in the input data (such as a sentence in English) and transform it into a
higher, more abstract representation.
● This process is known as encoding.
● The encoded data is a complex representation that captures the semantic and syntactic properties of the
input data
How GPT works
● The decoder, on the other hand, takes this encoded data and generates the output.
● The decoder uses a mechanism called attention, which allows it to focus on different parts of the input
when generating each part of the output.
● GPT models, however, do not use an encoder. Instead, they are with a decoder-only architecture.
● This means that the input data is fed directly into the decoder without being transformed into a higher,
more abstract representation by an encoder.
● The decoder in a GPT model uses a specific type of attention mechanism known as masked self-
attention.
● In a traditional transformer, the attention mechanism allows the model to focus on all parts of the input
when generating each part of the output.
How GPT works
● However, in a decoder-only transformer like GPT, the attention mechanism is “masked” to prevent it
from looking at future parts of the input when generating each part of the output.
● This is necessary because GPT models are trained to predict the next word in a sentence, so they should
not have access to future words.
● The decoder-only architecture simplifies the model and makes it more efficient for certain tasks, like
language modeling.
● By removing the encoder, GPT models can process input data more directly and generate output more
quickly.
● This architecture also allows GPT models to be trained on a large amount of unlabeled data, which is a
significant advantage in the field of NLP where labeled data is often scarce.
How GPT works
● GPT’s decoder-only architecture is a powerful and efficient alternative to the traditional encoder-
decoder model.
● It simplifies the model, makes it more efficient, and allows it to be trained on a large amount of
unlabeled data.
● Despite not having an encoder, GPT models are still capable of performing tasks typically associated
with encoder-decoder models, due to the power of the transformer’s decoder and the training method
used.
Language Models
● Language modeling (LM) is the use of various statistical and probabilistic techniques to determine the
probability of a given sequence of words occurring in a sentence.
● Language models analyze bodies of text data to provide a basis for their word predictions.
● So simply put, a Language Model predicts the next word(s) in a sequence.
● Language models have many applications like:
○ Part of Speech (PoS) Tagging
○ Machine Translation
○ Text Classification
○ Speech Recognition
○ Information Retrieval
○ News Article Generation
○ Question Answering, etc.
How Language Models works
● For training a language model, a number of probabilistic approaches are used.
● These approaches vary on the basis of the purpose for which a language model is created.
● The amount of text data to be analyzed and the math applied for analysis make a difference in the
approach followed for creating and training a language model.
● Consider an arbitrary language L.
● In this case, English will be utilized to simplify the arbitrary language.
● A language model assigns probabilities to sequences of arbitrary symbols such that the more likely a
sequence (w1,w2,...,wn) is to exist in that language, the higher the probability.
● A symbol can be a character, a word, or a sub-word (e.g. the word ‘going’ can be divided into two sub-
words: ‘go’ and ‘ing’).
How Language Models works
GPT3
Limitations
● GPT-3 lacks long-term memory — the model does not learn anything from long-term
interactions like humans.
● Lack of interpretability — this is a problem that affects extremely large and complex in general.
GPT-3 is so large that it is difficult to interpret or explain the output that it produces.
● Limited input size — transformers have a fixed maximum input size and this means that prompts
that GPT-3 can deal with cannot be longer than a few sentences.
● Slow inference time — because GPT-3 is so large, it takes more time for the model to produce
predictions.
● GPT-3 suffers from bias
Generative Models
● Generative models are considered a class of statistical models that can generate new data
instances.
● A generative model could generate new photos of animals that look like real animals
● GANs(Generative Adversarial Network") is just one kind of generative model
● Generative models capture the joint probability p(X, Y), or just p(X) if there are no labels.
● A generative model includes the distribution of the data itself, and tells you how likely a given
example is.
● For example, models that predict the next word in a sequence are typically generative models
(usually much simpler than GANs) because they can assign a probability to a sequence of words.
Generative Models
● These models use the concept of joint probability and create instances
where a given feature (x) or input and the desired output or label (y) exist
simultaneously.
● These models use probability estimates and likelihood to model data
Generative Models
● The generator learns to generate plausible data. The generated instances become negative training
examples for the discriminator.
● The discriminator learns to distinguish the generator's fake data from real data. The discriminator
penalizes the generator for producing implausible results.
When training begins, the generator produces obviously fake data, and the discriminator quickly learns to
tell that it's fake:
Generative Adversarial Models(GAN’s)
Generative Adversarial Models(GAN’s)
1. The discriminator classifies both real data and fake data from the generator.
2. The discriminator loss penalizes the discriminator for misclassifying a real instance as fake or a fake
instance as real.
3. The discriminator updates its weights through back propogation from the discriminator loss through
the discriminator network.
Generative Adversarial Models(GAN’s)
Generator
● The generator part of a GAN learns to create fake data by incorporating feedback from the discriminator.
● It learns to make the discriminator classify its output as real.
● Generator training requires tighter integration between the generator and the discriminator than
discriminator training requires.
● The portion of the GAN that trains the generator includes:
❖ random input
❖ generator network, which transforms the random input into a data instance
❖ discriminator network, which classifies the generated data
❖ discriminator output
❖ generator loss, which penalizes the generator for failing to fool the discriminator
Generative Adversarial Models(GAN’s)
● In its most basic form, a GAN takes random noise as its input.
● The generator then transforms this noise into a meaningful output.
● By introducing noise, we can get the GAN to produce a wide variety of data, sampling from different
places in the target distribution.
● For convenience the space from which the noise is sampled is usually of smaller dimension than the
dimensionality of the output space.
Generative Adversarial Models(GAN’s)
So we train the generator with the following procedure:
Because a GAN contains two separately trained networks, its training algorithm must address two
complications:
● GANs must juggle two different kinds of training (generator and discriminator).
● GAN convergence is hard to identify.
Generative Adversarial Models(GAN’s)
● The generator and the discriminator have different training processes. So how do we train the GAN as a
whole?
● GAN training proceeds in alternating periods:
1. The discriminator trains for one or more epochs.
2. The generator trains for one or more epochs.
3. Repeat steps 1 and 2 to continue to train the generator and discriminator networks.
Two losses
1)Minimax
2)Wasserstein loss
Hidden Markov Model(HMM)
A Markov chain contains all the possible states of a system and the probability of transiting from one
state to another.
Hidden Markov Model(HMM)
● A first-order Markov chain assumes that the next state depends on the current state only.
An HMM is a statistical model that deals with sequences. It consists of two main components:
● Hidden States (S): These are the underlying, unobservable states that the model transitions between. In
many applications, these states represent some underlying phenomenon or process.
● Observations (O): These are the observable outcomes that are associated with each hidden state.
Observations are generated based on the current hidden state.
Hidden Markov Model(HMM)
● This model will be much easier to handle. However, in many ML systems, not all states are
observable and we call these states hidden states or internal states.
● For example, it may not be easy to know whether I am happy or sad. My internal state will be {H or
S}. But we can get some hints from what we observe.
● For example, when I am happy I have a 0.2 chance that I watch a movie, but when I am sad, that
chance goes up to 0.4.
● The probability of observing an observable given an internal state is called the emission probability.
● The probability of transiting from one internal state to another is called the transition probability.
Hidden Markov Model(HMM)
Given the above components, you can use the following steps to work with an HMM:
1) Initialize the HMM with the state transition matrix A, observation matrix B, and initial state
probabilities π.
2) Given a sequence of observations, you can use algorithms like the Forward-Backward
algorithm or the Viterbi algorithm to compute the probabilities of the observations given the
model, and to find the most likely sequence of hidden states.
3) You can also train an HMM using methods like the Baum-Welch algorithm (a type of
Expectation-Maximization algorithm) to adjust the model parameters based on observed
sequences.
Auto Regressive Model
● An autoregressive (AR) model forecasts future behavior based on past behavior data.
● This type of analysis is used when there is a correlation between the time series values and their
preceding and succeeding values.
● Autoregressive modeling uses only past data to predict future behavior.
● Linear regression is carried out on the data from the current series based on one or more past values
of the same series.
● AR models are linear regression models where the outcome variable (Y) at some point of time is
directly related to the predictor variable (X).
● In AR models, Y depends on X and previous values for Y, which is different from simple linear
regression.
Auto Regressive Model
● A simple autoregressive model, also known as an AR(1), would look like this, for example, if X is a
time-series variable:
● Xt = C + ϕ1Xt-1 + ϵt
● In the first place, Xt-1 represents the previous period's value of X.
● “t” represents today, while “t-1” represents last week. As a result, Xt-1 reflects last week's value.
● The coefficient ϕ1 represents the numeric constant multiplied by the lagged variable (Xt-1). In other
words, it represents the future portion of the previous value.
● Try to maintain these coefficients between -1 and 1. The reason for this is that when the absolute
value of the coefficient exceeds 1, it will explode exponentially over time.
● ϵt-value is known as residual, representing the difference between our period t prediction and the
correct value (ϵt = yt - ŷt).
Auto Regressive Model
● It is a common belief that past values influence current values in autoregressive models. Hence, this
is why the statistical technique is widely getting used to analyze natural phenomena, economic
processes, and other processes that change over time.
● Many regression models use linear combinations of predictors to forecast a variable.
● In contrast, autoregressive models use the variable's past values to determine the future value.
● AR(1) autoregressive processes depend on the value immediately preceding the current value.
Alternatively, AR(2) uses the previous two values to calculate the current value while AR(0)
processes white noise, which does not depend on terms.
● The least squares method gets used to calculate coefficients with these variations.
Auto Regressive Model
Advantages
● Advantage of this model is that you can tell if there is a lack of randomness by using the
autocorrelation function.
● Additionally, it is capable of forecasting recurring patterns in data.
● It is also possible to predict outcomes with less information using self-variable series.
Limitations
● The autocorrelation coefficient must be at least 0.5 in this case for it to be appropriate.
● It is usually used while predicting things associated with economics based on historical data.
Something that is significantly affected by social factors.
● It is highly recommended to use the vector autoregressive model instead. The reason being a single
model can be used to predict multiple time series variables at the same time.
Evaluating of Generative Models
1) Inception Score (IS)
● The Inception Score (IS) is a popular evaluation metric for generative models, particularly for
image generation tasks.
● It is based on the idea that a good generative model should produce diverse and realistic samples.
● The IS is calculated by using a pre-trained classifier (typically the Inception network) to classify
the generated samples and compute the entropy of the predicted class probabilities.
● A high IS indicates that the generated samples are both diverse (high entropy) and realistic (low
conditional entropy).
● A large number of generated images are classified using the model.
● Specifically, the probability of the image belonging to each class is predicted.
● These predictions are then summarized into the inception score.
Evaluating of Generative Models
● The entropy is calculated as the negative sum of each observed probability multiplied by the log of the
probability.
● The intuition here is that large probabilities have less information than small probabilities.
● Entropy = -sum(p_i * log(p_i))
● The conditional probability captures our interest in image quality.
● To capture our interest in a variety of images, we use the marginal probability.
● This is the probability distribution of all generated images. We, therefore, would prefer the integral of the
marginal probability distribution to have a high entropy.
● These elements are combined by calculating the Kullback-Leibler divergence, or KL divergence (relative
entropy), between the conditional and marginal probability distributions KL (C || M)
● The KL divergence is then calculated for each image as the conditional probability multiplied by the log of
the conditional probability minus the log of the marginal probability.
● KL divergence = p(y|x) * (log(p(y|x)) – log(p(y)))
Evaluating of Generative Models
2) Frechet Inception Distance (FID)
● The Frechet Inception Distance (FID) is another evaluation metric for generative models that
addresses some of the limitations of the IS.
● The FID measures the similarity between the distributions of the generated samples and the real
data in the feature space of a pre-trained classifier (again, typically the Inception network).
● The FID is calculated by computing the Frechet distance between the two distributions, which takes
into account both the mean and covariance of the feature vectors.
● A lower FID indicates that the generated samples are more similar to the real data.
5) Log Likelihood
● Log-likelihood is a fundamental evaluation metric for generative models that measures the
probability of the real data given the model.
● A higher log-likelihood indicates that the model assigns a higher probability to the real data,
suggesting a better fit.
● However, log-likelihood can be difficult to compute for some generative models, such as
Generative Adversarial Networks (GANs), due to the lack of an explicit likelihood function.
Likelihood
Some observed data
a set of probability distributions that could have generated the data; each distribution is identified by
a parameter.
Evaluating of Generative Models
5) Log Likelihood
● Generative models are often complex and computationally intensive to train and serve.
● They may require specialized hardware and software infrastructure, such as GPUs, TPUs, cloud
computing, and distributed systems.
● Moreover, generative models are difficult to evaluate and compare, as there is no clear and
objective metric to measure their quality and diversity.
● Therefore, generative models need to optimize their resource utilization, scalability, and efficiency
and develop appropriate evaluation methods and benchmarks that reflect the desired outcomes and
user feedback.
Challenges of Generative Models
● Generative models are often opaque and unpredictable in their behavior and outputs.
● They may generate inaccurate, inappropriate, or harmful content to users or society.
● They may also infringe on the original content creators’ or consumers’ intellectual property rights or
moral values.
● Therefore, generative models need to ensure that they are transparent and accountable for their
actions and decisions and respect the rights and preferences of the stakeholders involved.
● They also need to provide explanations and justifications for their outputs and mechanisms for
correction and control.
Challenges of Generative Models
● One of the main challenges of generative models is balancing between the quality of the generated
data and its diversity.
● Some models may prioritize the quality of the data at the cost of diversity and vice versa.
● Finding a balance between these two factors is crucial for the success of the model.
Applications of Generative Models
1) Data Augmentation
● In cases when it is difficult or expensive to annotate a large amount of training data, we can use
GANs to generate synthetic data and increase the size of the dataset.
● For example, StyleGAN is a generative model proposed by Nvidia that is able to generate very
realistic images of human faces that don’t exist.
2) Super Resolution
● we take as input a low-resolution image (like ) and we want to increase its resolution (to and even
more) and keep its quality as high as possible.
● SRGAN is a generative model that can successfully recover photo-realistic high-resolution images
from low-resolution images.
● The model comprises a deep network in combination with an adversary network like in most GAN
architectures.
Applications of Generative Models
3)Impainting
● In image inpainting, our task is to reconstruct the missing regions in an image.
● In particular, we want to fill the missing pixels of the input image in a way that the new image is
still realistic and the context is consistent.
● The applications of this task are numerous like image rendering, editing, or unwanted object
removal.
● Deepfill is open-source framework for the image inpainting task that uses a generative model-based
approach.
● Its novelty lies in the Contextual Attention layer which allows the generator to make use of the
information given by distant spatial locations for the reconstruction of local missing pixels.
Applications of Generative Models
4)Denoising
● Nowadays, thanks to modern digital cameras we are able to take high-quality photos.
● However, there are still cases where an image contains a lot of noise and its quality is low.
● Removing the noise from an image without losing image features is a very crucial task and researchers
have been working on denoising methods for many years.
● A very popular generative model for image denoising is Autoencoder that is trained to reconstruct its
input image after removing the noise.
● During training, the network is given the original image and its noisy version.
● Then, the network tries to reconstruct its output to be as close as possible to the original image.
5) Image Colorization
● CycleGAN for automatic image colorization which is very useful in areas like restoration of aged or
degraded images. CycleGAN converts a grayscale image of a flower to its colorful RGB form
Applications of Generative Models
6)Translation
● Generative models are also used in image translation where our goal is to learn the mapping between
two image domains.
● Then, the model is able to generate a synthetic version of the input image with a specific modification
like translating a winter landscape to summer.
● CycleGAN is a very famous GAN-based model for image translation.
● The model is trained in an unsupervised way using a dataset of images from the source and the target
domain.
7) Object Transfiguration
● Another exciting application of StyleGAN is object transfiguration where the model translates one
object class to another like translating a horse to a zebra, a winter landscape to a summer one, and
apples to oranges.
Deep Generative Models
● A subset of generative modeling, deep generative modeling uses deep neural networks to learn the
underlying distribution of data.
● These models can develop novel samples that have never been seen before by producing new samples
that are similar to the input data but not exactly the same.
● Deep generative models are multi-layer nonlinear neural networks that are used to simulate data
dependency.
● Deep generative models have sparked a considerable interest because of their ability to provide a very
efficient approach to evaluate and understand unlabeled data.
● They do not suffer from the capacity limitation and can learn to generate high-level representations
solely from data.
● More crucially, when back-propagation is allowed, deep generative model training becomes extremely
efficient, resulting in much superior performance than shallow models.
Deep Learning/DNN
● Deep learning is a branch of machine learning which is based on artificial neural networks.
● It is capable of learning complex patterns and relationships within data.
● In deep learning, we don’t need to explicitly program everything.
● It is based on artificial neural networks (ANNs) also known as deep neural networks (DNNs)
● These neural networks are inspired by the structure and function of the human brain’s biological
neurons, and they are designed to learn from large amounts of data.
● In a fully connected Deep neural network, there is an input layer and one or more hidden layers
connected one after the other.
● Each neuron receives input from the previous layer neurons or the input layer.
● The layers of the neural network transform the input data through a series of nonlinear transformations,
allowing the network to learn complex representations of the input data.
Advanced DNN
● Advanced Deep Neural Networks (DNNs) refer to sophisticated and complex architectures that have
been developed to tackle more challenging tasks and learn intricate patterns from data.
● These architectures often build upon the foundation of basic neural networks (feedforward networks)
by incorporating advanced techniques and structures to enhance their performance.
● Imagine you have an image. It can be represented as a cuboid having its length, width
(dimension of the image), and height (i.e the channel as images generally have red, green, and
blue channels).
● Now imagine taking a small patch of this image and running a small neural network, called a
filter or kernel on it, with say, K outputs and representing them vertically.
● Now slide that neural network across the whole image, as a result, we will get another image
with different widths, heights, and depths. Instead of just R, G, and B channels now we have
more channels but lesser width and height.
● This operation is called Convolution.
● If the patch size is the same as that of the image it will be a regular neural network.
CNN
Now let’s talk about a bit of mathematics that is involved in the whole convolution process.
● Convolution layers consist of a set of learnable filters (or kernels) having small widths and heights
and the same depth as that of input volume (3 if the input layer is image input).
● For example, if we have to run convolution on an image with dimensions 34x34x3. The possible
size of filters can be axax3, where ‘a’ can be anything like 3, 5, or 7 but smaller as compared to the
image dimension.
● During the forward pass, we slide each filter across the whole input volume step by step where each
step is called stride (which can have a value of 2, 3, or even 4 for high-dimensional images) and
compute the dot product between the kernel weights and patch from input volume.
● As we slide our filters we’ll get a 2-D output for each filter and we’ll stack them together as a result,
we’ll get output volume having a depth equal to the number of filters. The network will learn all the
filters.
CNN
Layers used to build ConvNets(CNN Architecture)
A covnets is a sequence of layers, every layer transforms one volume to another through a differentiable
function.
Let’s take an example by running a covnets on of image of dimension 32 x 32 x 3.
1) Input Layers: In CNN, Generally, the input will be an image or a sequence of images. This layer holds
the raw input of the image with width 32, height 32, and depth 3.
2) Convolutional Layers: This is the layer, which is used to extract the feature from the input dataset.
● It applies a set of learnable filters known as the kernels to the input images.
● The filters/kernels are smaller matrices usually 2×2, 3×3, or 5×5 shape.
● It slides over the input image data and computes the dot product between kernel weight and the
corresponding input image patch.
● The output of this layer is referred ad feature maps. Suppose we use a total of 12 filters for this layer we’ll
get an output volume of dimension 32 x 32 x 12.
CNN
3) Activation Layer:
● By adding an activation function to the output of the preceding layer, activation layers add nonlinearity to
the network.
● It will apply an element-wise activation function to the output of the convolution layer. Some common
activation functions are RELU: max(0, x), Tanh, Leaky RELU, etc.
● The volume remains unchanged hence output volume will have dimensions 32 x 32 x 12.
4) Pooling layer:
● This layer is periodically inserted in the covnets and its main function is to reduce the size of volume
which makes the computation fast reduces memory and also prevents overfitting.
● Two common types of pooling layers are max pooling and average pooling.
● If we use a max pool with 2 x 2 filters and stride 2, the resultant volume will be of dimension 16x16x12.
CNN
5) Flattening:
● The resulting feature maps are flattened into a one-dimensional vector after the convolution
and pooling layers so they can be passed into a completely linked layer for categorization or
regression.
6) Fully Connected Layers:
● It takes the input from the previous layer and computes the final classification or regression
task.
7)Output Layer:
● The output from the fully connected layers is then fed into a logistic function for classification
tasks like sigmoid or softmax which converts the output of each class into the probability
score of each class.
CNN
● The nodes in different layers of the neural network are compressed to form a single layer of
recurrent neural networks. A, B, and C are the parameters of the network.
RNN
● The output at any given time is fetched back to the network to improve on the output.
RNN
RNN
● The input layer ‘x’ takes in the input to the neural network and processes it and passes it onto the
middle layer.
● The middle layer ‘h’ can consist of multiple hidden layers, each with its own activation functions and
weights and biases.
● If you have a neural network where the various parameters of different hidden layers are not affected
by the previous layer, ie: the neural network does not have memory, then you can use a recurrent
neural network.
● The Recurrent Neural Network will standardize the different activation functions and weights and
biases so that each hidden layer has the same parameters.
● Then, instead of creating multiple hidden layers, it will create one and loop over it as many times as
required.
RNN
● The common types of activation functions used in RNN modules
RNN
Vanishing/exploding gradient
● The vanishing and exploding gradient phenomena are often encountered in the context of RNNs.
● The reason why they happen is that it is difficult to capture long term dependencies because of
multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of
layers.
Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient
problem encountered by traditional RNNs, with LSTM being a generalization of GRU.
Deep Boltzmann Machine
● Similar to a standard Boltzmann Machine, a DBM defines an energy function that measures the
compatibility of a given configuration of neurons. The energy of a configuration is determined by the
weights of connections and biases.
● The energy function of a DBM can be expressed as a sum of terms, one for each neuron, that captures
interactions between neurons and their biases.
● The energy of a configuration corresponds to an unnormalized probability distribution over the possible
states of the neurons.
● The Boltzmann distribution is used to associate probabilities with different states of the neurons. Lower
energy states are assigned higher probabilities.
Deep Boltzmann Machine
Applications:
● Data Generation: Samples can be generated from the learned distribution, enabling the model
to generate new data similar to the training data.
● Feature Learning: DBMs can learn meaningful features from the data, capturing hierarchical
representations in the hidden layers.
● Pretraining: DBMs have been used as pretrained layers for other models to provide useful
initializations for supervised learning tasks.
Limitations:
● Training DBMs can be slow and computationally intensive due to the sampling-based learning
techniques.
● As generative models, DBMs can sometimes struggle to capture complex dependencies in the data,
leading to suboptimal performance in certain scenarios.
Deep Belief Network(DBN)
● DBN is a hybrid generative graphical model. The top two layers have no direction.
● The tanh activation is used to help regulate the values flowing through the network.
● To review, the Forget gate decides what is relevant to keep from prior steps.
● The input gate decides what information is relevant to add from the current step.
● The output gate determines what the next hidden state should be.
LSTM & GRU
2) GRU
● The GRU is the newer generation of Recurrent Neural networks and is pretty similar to an LSTM.
● GRU’s got rid of the cell state and used the hidden state to transfer information.
● It also only has two gates, a reset gate and update gate.
a) Update Gate
● The update gate acts similar to the forget and input gate of an LSTM.
● It decides what information to throw away and what new information to add.
a) Reset Gate
● The reset gate is another gate is used to decide how much past information to forget.
● GRU’s has fewer tensor operations; therefore, they are a little speedier to train then LSTM’s.
Thank you!