0% found this document useful (0 votes)
21 views17 pages

DL Unit-V

The document discusses various interactive applications of deep learning, focusing on machine vision and natural language processing (NLP). It highlights how deep learning, particularly through techniques like Convolutional Neural Networks (CNNs), has transformed machine vision by improving accuracy, processing speed, and robustness. Additionally, it covers NLP applications, including language modeling using n-grams, and addresses challenges such as data requirements and explainability.

Uploaded by

agatamudi2005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views17 pages

DL Unit-V

The document discusses various interactive applications of deep learning, focusing on machine vision and natural language processing (NLP). It highlights how deep learning, particularly through techniques like Convolutional Neural Networks (CNNs), has transformed machine vision by improving accuracy, processing speed, and robustness. Additionally, it covers NLP applications, including language modeling using n-grams, and addresses challenges such as data requirements and explainability.

Uploaded by

agatamudi2005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

DEEP LEARNING

UNIT – V
Interactive Applications of Deep Learning: Machine Vision, Natural
Language processing, Generative Adversial Networks, Deep Reinforcement
Learning.

Machine Vision:
Deep learning has revolutionized machine vision, empowering computers to "see" and
interpret images with remarkable accuracy. Here's a more detailed look at how it works and
its diverse applications:

How Deep Learning Powers Machine Vision

 Traditional Machine Vision: Relied on handcrafted features and algorithms to


analyze images. This approach was often limited by its inability to adapt to variations
in lighting, angles, and object appearances.
 Deep Learning Revolution: Deep learning, particularly Convolutional Neural
Networks (CNNs), excels at automatically learning complex features from raw image
data. This eliminates the need for manual feature extraction, making machine vision
systems more robust and accurate.

Advantages of Deep Learning in Machine Vision

 Higher Accuracy: Deep learning models can achieve state-of-the-art accuracy in


various vision tasks.
 Faster Processing: Deep learning algorithms can process images and videos in real-
time, enabling applications like autonomous driving.
 Improved Robustness: Deep learning models can handle variations in lighting,
angles, and object appearances, making them more reliable in real-world scenarios.

Challenges and Future Trends

 Data Requirements: Deep learning models require large amounts of labeled data for
training.
 Computational Resources: Training deep learning models can be computationally
intensive.
 Explainability: Understanding how deep learning models make decisions can be
challenging.

Future trends in deep learning for machine vision include:

 Automated Machine Learning (AutoML): Making deep learning more accessible


by automating model building and hyperparameter tuning.
 Edge Computing: Deploying deep learning models on edge devices (like
smartphones or cameras) for faster processing and reduced latency.
 Generative Adversarial Networks (GANs): Using GANs for image synthesis,
editing, and enhancement.

Deep learning has transformed machine vision, enabling computers to understand and
interpret visual data with unprecedented accuracy. As deep learning technology continues to
advance, we can expect even more innovative applications in the years to come.

Let's delve into even greater detail about the applications of deep learning in machine vision:

1. Object Recognition and Detection:

 What it does: This involves not just identifying what objects are present in an image
(recognition), but also where they are located (detection). Think of it like drawing
bounding boxes around each identified object.
 Deep Learning Techniques: Convolutional Neural Networks (CNNs) are the
workhorses here. Architectures like Faster R-CNN, YOLO (You Only Look Once),
and SSD (Single Shot MultiBox Detector) are designed specifically for object
detection. They learn to identify features that are characteristic of different objects,
and then use these features to both classify and localize the objects.
 Detailed Examples:
o Autonomous Vehicles: Crucial for detecting pedestrians (even in varying
lighting or clothing), traffic lights (and their current state), other vehicles (cars,
trucks, bikes), and road signs. The system needs to understand the context of
these objects to make safe driving decisions.
o Surveillance: Identifying suspicious activities could involve detecting people
trespassing in restricted areas, recognizing abandoned objects, or even
analyzing crowd behavior to predict potential issues. Facial recognition often
plays a role here too.
o Retail: Imagine a smart shelf that automatically tracks which products are
running low, or a checkout system that can identify items without needing to
scan barcodes. Object detection is key to these applications.
o Robotics: Robots can use object detection to navigate complex environments,
grasp and manipulate objects, and perform tasks that require visual
understanding.

2. Image Classification:

 What it does: This task involves assigning a single label to an entire image,
categorizing it into a predefined class. For example, classifying an image as "cat,"
"dog," or "bird."
 Deep Learning Techniques: CNNs are also central to image classification. Models
like ResNet, Inception, and EfficientNet have achieved high accuracy on large image
datasets. The network learns hierarchical features, from simple edges and textures in
the early layers to complex object parts and whole objects in the later layers.
 Detailed Examples:
o Medical Diagnosis: Classifying medical images (X-rays, MRIs, CT scans) to
detect diseases like cancer, pneumonia, or Alzheimer's. This can assist doctors
in making faster and more accurate diagnoses.
o Agriculture: Classifying images of crops to identify diseases, nutrient
deficiencies, or pest infestations. This allows for targeted interventions and
improved yields.
o Environmental Monitoring: Classifying satellite or aerial images to monitor
deforestation, track pollution, or assess the impact of natural disasters.
o Product Categorization: E-commerce platforms use image classification to
automatically categorize products based on their visual appearance, improving
search and recommendation systems.

3. Image Segmentation:

 What it does: This is a more granular task than object detection. Instead of just
drawing boxes, segmentation aims to classify each pixel in the image, assigning it to a
specific object or region. This creates a pixel-level mask that separates different
objects or parts of an object.
 Deep Learning Techniques: Fully Convolutional Networks (FCNs), U-Net, and
Mask R-CNN are popular architectures for image segmentation. They often use
encoder-decoder structures to learn both high-level and low-level features.
 Detailed Examples:
o Medical Imaging: Segmenting organs or tissues in MRI or CT scans allows
for precise measurements, 3D reconstructions, and more accurate diagnosis
and treatment planning. For example, segmenting a tumor to determine its size
and shape.
o Satellite Imagery: Analyzing land use by segmenting different types of
terrain (urban areas, forests, water bodies). This is crucial for urban planning,
environmental monitoring, and disaster response.
o Autonomous Driving: Segmenting the road, pedestrians, and other vehicles
provides a much richer understanding of the environment than just object
detection.
o Image Editing: Segmentation can be used to easily remove backgrounds,
replace objects, or apply special effects to specific parts of an image.

4. Facial Recognition:

 What it does: Identifying or verifying individuals based on their facial features. This
can involve comparing a face to a database of known faces or simply confirming that
two faces belong to the same person.
 Deep Learning Techniques: CNNs are used to extract features from faces, such as
the distance between eyes, the shape of the nose, and the texture of the skin. These
features are then used to create a "facial fingerprint" that can be compared to other
faces.
 Detailed Examples:
o Security: Access control systems that use facial recognition to grant entry to
authorized personnel. Surveillance systems that can identify individuals of
interest.
o Personalization: Smartphones that use facial recognition to unlock the device
or personalize user experiences. Social media platforms that use facial
recognition to tag people in photos.
o Law Enforcement: Using facial recognition to identify suspects in criminal
investigations.
o Marketing: Analyzing facial expressions to understand customer emotions
and preferences.

5. Pose Estimation:

 What it does: Determining the position and orientation of objects or people in an


image. This involves identifying key points, such as joints in the human body or
corners of an object, and then using these points to estimate the overall pose.
 Deep Learning Techniques: OpenPose, PoseNet, and other specialized architectures
are used for pose estimation. They often combine CNNs with recurrent neural
networks (RNNs) to capture temporal information in videos.
 Detailed Examples:
o Gaming: Creating realistic character animations by tracking the movements
of actors using pose estimation.
o Sports Analysis: Tracking the movements of athletes to analyze their
performance and identify areas for improvement.
o Human-Computer Interaction: Developing more natural and intuitive ways
for humans to interact with computers using gestures and body language.
o Healthcare: Analyzing patient movements to assess rehabilitation progress or
detect movement disorders.
CHAPTER 12. APPLICATIONS

information (Chorowski et al., 2014; Lu et al., 2015).

12.4 Natural Language Processing


Natural language processing (NLP) is the use of human languages, such as
English or French, by a computer. Computer programs typically read and emit
specialized languages designed to allow efficient and unambiguous parsing by simple
programs. More naturally occurring languages are often ambiguous and defy formal
description. Natural language processing includes applications such as machine
translation, in which the learner must read a sentence in one human language and
emit an equivalent sentence in another human language. Many NLP applications
are based on language models that define a probability distribution over sequences
of words, characters or bytes in a natural language.
As with the other applications discussed in this chapter, very generic neural
network techniques can be successfully applied to natural language processing.
However, to achieve excellent performance and to scale well to large applications,
some domain-specific strategies become important. To build an efficient model of
natural language, we must usually use techniques that are specialized for processing
sequential data. In many cases, we choose to regard natural language as a sequence
of words, rather than a sequence of individual characters or bytes. Because the total
number of possible words is so large, word-based language models must operate on
an extremely high-dimensional and sparse discrete space. Several strategies have
been developed to make models of such a space efficient, both in a computational
and in a statistical sense.

12.4.1 n-grams

A language model defines a probability distribution over sequences of tokens


in a natural language. Depending on how the model is designed, a token may
be a word, a character, or even a byte. Tokens are always discrete entities. The
earliest successful language models were based on models of fixed-length sequences
of tokens called n-grams. An n-gram is a sequence of n tokens.
Models based on n-grams define the conditional probability of the n-th token
given the preceding n − 1 tokens. The model uses products of these conditional
distributions to define the probability distribution over longer sequences:
τ

P (x1 , . . . , xτ ) = P (x1 , . . . , xn−1) P (xt | xt−n+1 , . . . , xt−1 ). (12.5)
t=n
461
CHAPTER 12. APPLICATIONS

This decomposition is justified by the chain rule of probability. The probability


distribution over the initial sequence P (x1 , . . . , x n−1) may be modeled by a different
model with a smaller value of n.
Training n-gram models is straightforward because the maximum likelihood
estimate can be computed simply by counting how many times each possible n
gram occurs in the training set. Models based on n-grams have been the core
building block of statistical language modeling for many decades (Jelinek and
Mercer, 1980; Katz, 1987; Chen and Goodman, 1999).
For small values of n, models have particular names: unigram for n=1, bigram
for n=2, and trigram for n=3. These names derive from the Latin prefixes for
the corresponding numbers and the Greek suffix “-gram” denoting something that
is written.
Usually we train both an n-gram model and an n−1 gram model simultaneously.
This makes it easy to compute

P n (xt−n+1 , . . . , xt )
P (x t | xt−n+1 , . . . , xt−1) = (12.6)
Pn−1 (xt−n+1 , . . . , xt−1 )

simply by looking up two stored probabilities. For this to exactly reproduce


inference in P n, we must omit the final character from each sequence when we
train P n−1.
As an example, we demonstrate how a trigram model computes the probability
of the sentence “ THE DOG RAN AWAY.” The first words of the sentence cannot be
handled by the default formula based on conditional probability because there is no
context at the beginning of the sentence. Instead, we must use the marginal prob-
ability over words at the start of the sentence. We thus evaluate P3(THE DOG RAN).
Finally, the last word may be predicted using the typical case, of using the condi-
tional distribution P(AWAY | DOG RAN ). Putting this together with equation 12.6,
we obtain:

P (THE DOG RAN AWAY) = P3 (THE DOG RAN)P3(DOG RAN AWAY)/P 2(DOG RAN).
(12.7)
A fundamental limitation of maximum likelihood for n-gram models is that Pn
as estimated from training set counts is very likely to be zero in many cases, even
though the tuple (x t−n+1, . . . , x t ) may appear in the test set. This can cause two
different kinds of catastrophic outcomes. When Pn−1 is zero, the ratio is undefined,
so the model does not even produce a sensible output. When P n−1 is non-zero but
Pn is zero, the test log-likelihood is −∞. To avoid such catastrophic outcomes,
most n-gram models employ some form of smoothing. Smoothing techniques
462
CHAPTER 12. APPLICATIONS

shift probability mass from the observed tuples to unobserved ones that are similar.
See Chen and Goodman (1999) for a review and empirical comparisons. One basic
technique consists of adding non-zero probability mass to all of the possible next
symbol values. This method can be justified as Bayesian inference with a uniform
or Dirichlet prior over the count parameters. Another very popular idea is to form
a mixture model containing higher-order and lower-order n-gram models, with the
higher-order models providing more capacity and the lower-order models being
more likely to avoid counts of zero. Back-off methods look-up the lower-order
n-grams if the frequency of the context xt−1, . . . , x t−n+1 is too small to use the
higher-order model. More formally, they estimate the distribution over xt by using
contexts x t−n+k , . . . , xt−1, for increasing k, until a sufficiently reliable estimate is
found.
Classical n-gram models are particularly vulnerable to the curse of dimension-
ality. There are |V|n possible n-grams and |V| is often very large. Even with a
massive training set and modest n, most n-grams will not occur in the training set.
One way to view a classical n-gram model is that it is performing nearest-neighbor
lookup. In other words, it can be viewed as a local non-parametric predictor,
similar to k-nearest neighbors. The statistical problems facing these extremely
local predictors are described in section 5.11.2. The problem for a language model
is even more severe than usual, because any two different words have the same dis-
tance from each other in one-hot vector space. It is thus difficult to leverage much
information from any “neighbors”—only training examples that repeat literally the
same context are useful for local generalization. To overcome these problems, a
language model must be able to share knowledge between one word and other
semantically similar words.
To improve the statistical efficiency of n-gram models, class-based language
models (Brown et al., 1992; Ney and Kneser, 1993; Niesler et al., 1998) introduce
the notion of word categories and then share statistical strength between words that
are in the same category. The idea is to use a clustering algorithm to partition the
set of words into clusters or classes, based on their co-occurrence frequencies with
other words. The model can then use word class IDs rather than individual word
IDs to represent the context on the right side of the conditioning bar. Composite
models combining word-based and class-based models via mixing or back-off are
also possible. Although word classes provide a way to generalize between sequences
in which some word is replaced by another of the same class, much information is
lost in this representation.

463
CHAPTER 20. DEEP GENERATIVE MODELS

of models to those with tractable mean field fixed point equations. The variational
autoencoder also has the advantage that it increases a bound on the log-likelihood
of the model, while the criteria for the MP-DBM and related models are more
heuristic and have little probabilistic interpretation beyond making the results of
approximate inference accurate. One disadvantage of the variational autoencoder
is that it learns an inference network for only one problem, inferring z given x.
The older methods are able to perform approximate inference over any subset of
variables given any other subset of variables, because the mean field fixed point
equations specify how to share parameters between the computational graphs for
all of these different problems.
One very nice property of the variational autoencoder is that simultaneously
training a parametric encoder in combination with the generator network forces the
model to learn a predictable coordinate system that the encoder can capture. This
makes it an excellent manifold learning algorithm. See figure 20.6 for examples of
low-dimensional manifolds learned by the variational autoencoder. In one of the
cases demonstrated in the figure, the algorithm discovered two independent factors
of variation present in images of faces: angle of rotation and emotional expression.

20.10.4 Generative Adversarial Networks

Generative adversarial networks or GANs (Goodfellow et al., 2014c) are another


generative modeling approach based on differentiable generator networks.
Generative adversarial networks are based on a game theoretic scenario in
which the generator network must compete against an adversary. The generator
network directly produces samples x = g(z; θ (g) ). Its adversary, the discriminator
network, attempts to distinguish between samples drawn from the training data
and samples drawn from the generator. The discriminator emits a probability value
given by d(x; θ(d) ), indicating the probability that x is a real training example
rather than a fake sample drawn from the model.
The simplest way to formulate learning in generative adversarial networks is
as a zero-sum game, in which a function v(θ(g) , θ(d) ) determines the payoff of the
discriminator. The generator receives −v(θ (g), θ (d)) as its own payoff. During
learning, each player attempts to maximize its own payoff, so that at convergence

g∗ = arg min max v(g, d). (20.80)


g d

The default choice for v is

v(θ (g), θ(d) ) = Ex∼p data log d(x) + E x∼pmodel log (1 − d(x)) . (20.81)
699
CHAPTER 20. DEEP GENERATIVE MODELS

Figure 20.6: Examples of two-dimensional coordinate systems for high-dimensional mani-


folds, learned by a variational autoencoder (Kingma and Welling, 2014a). Two dimensions
may be plotted directly on the page for visualization, so we can gain an understanding of
how the model works by training a model with a 2-D latent code, even if we believe the
intrinsic dimensionality of the data manifold is much higher. The images shown are not
examples from the training set but images x actually generated by the model p(x | z ),
simply by changing the 2-D “code” z (each image corresponds to a different choice of “code”
z on a 2-D uniform grid). (Left)The two-dimensional map of the Frey faces manifold.
One dimension that has been discovered (horizontal) mostly corresponds to a rotation of
the face, while the other (vertical) corresponds to the emotional expression. (Right)The
two-dimensional map of the MNIST manifold.

This drives the discriminator to attempt to learn to correctly classify samples as real
or fake. Simultaneously, the generator attempts to fool the classifier into believing
its samples are real. At convergence, the generator’s samples are indistinguishable
from real data, and the discriminator outputs 12 everywhere. The discriminator
may then be discarded.
The main motivation for the design of GANs is that the learning process
requires neither approximate inference nor approximation of a partition function
gradient. In the case where maxd v(g, d) is convex in θ(g) (such as the case where
optimization is performed directly in the space of probability density functions)
the procedure is guaranteed to converge and is asymptotically consistent.
Unfortunately, learning in GANs can be difficult in practice when g and d
are represented by neural networks and max d v(g, d) is not convex. Goodfellow

700
CHAPTER 20. DEEP GENERATIVE MODELS

(2014) identified non-convergence as an issue that may cause GANs to underfit.


In general, simultaneous gradient descent on two players’ costs is not guaranteed
to reach an equilibrium. Consider for example the value function v(a, b) = ab,
where one player controls a and incurs cost ab, while the other player controls b
and receives a cost −ab. If we model each player as making infinitesimally small
gradient steps, each player reducing their own cost at the expense of the other
player, then a and b go into a stable, circular orbit, rather than arriving at the
equilibrium point at the origin. Note that the equilibria for a minimax game are
not local minima of v. Instead, they are points that are simultaneously minima
for both players’ costs. This means that they are saddle points of v that are local
minima with respect to the first player’s parameters and local maxima with respect
to the second player’s parameters. It is possible for the two players to take turns
increasing then decreasing v forever, rather than landing exactly on the saddle
point where neither player is capable of reducing its cost. It is not known to what
extent this non-convergence problem affects GANs.
Goodfellow (2014) identified an alternative formulation of the payoffs, in which
the game is no longer zero-sum, that has the same expected gradient as maximum
likelihood learning whenever the discriminator is optimal. Because maximum
likelihood training converges, this reformulation of the GAN game should also
converge, given enough samples. Unfortunately, this alternative formulation does
not seem to improve convergence in practice, possibly due to suboptimality of the
discriminator, or possibly due to high variance around the expected gradient.
In realistic experiments, the best-performing formulation of the GAN game
is a different formulation that is neither zero-sum nor equivalent to maximum
likelihood, introduced by Goodfellow et al. (2014c) with a heuristic motivation. In
this best-performing formulation, the generator aims to increase the log probability
that the discriminator makes a mistake, rather than aiming to decrease the log
probability that the discriminator makes the correct prediction. This reformulation
is motivated solely by the observation that it causes the derivative of the generator’s
cost function with respect to the discriminator’s logits to remain large even in the
situation where the discriminator confidently rejects all generator samples.
Stabilization of GAN learning remains an open problem. Fortunately, GAN
learning performs well when the model architecture and hyperparameters are care-
fully selected. Radford et al. (2015) crafted a deep convolutional GAN (DCGAN)
that performs very well for image synthesis tasks, and showed that its latent repre-
sentation space captures important factors of variation, as shown in figure 15.9.
See figure 20.7 for examples of images generated by a DCGAN generator.
The GAN learning problem can also be simplified by breaking the generation

701
CHAPTER 20. DEEP GENERATIVE MODELS

Figure 20.7: Images generated by GANs trained on the LSUN dataset. (Left)Images
of bedrooms generated by a DCGAN model, reproduced with permission from Radford
et al. (2015). (Right)Images of churches generated by a LAPGAN model, reproduced with
permission from Denton et al. (2015).

process into many levels of detail. It is possible to train conditional GANs (Mirza
and Osindero, 2014) that learn to sample from a distribution p(x | y ) rather
than simply sampling from a marginal distribution p(x). Denton et al. (2015)
showed that a series of conditional GANs can be trained to first generate a very
low-resolution version of an image, then incrementally add details to the image.
This technique is called the LAPGAN model, due to the use of a Laplacian pyramid
to generate the images containing varying levels of detail. LAPGAN generators
are able to fool not only discriminator networks but also human observers, with
experimental subjects identifying up to 40% of the outputs of the network as
being real data. See figure 20.7 for examples of images generated by a LAPGAN
generator.
One unusual capability of the GAN training procedure is that it can fit proba-
bility distributions that assign zero probability to the training points. Rather than
maximizing the log probability of specific points, the generator net learns to trace
out a manifold whose points resemble training points in some way. Somewhat para-
doxically, this means that the model may assign a log-likelihood of negative infinity
to the test set, while still representing a manifold that a human observer judges
to capture the essence of the generation task. This is not clearly an advantage or
a disadvantage, and one may also guarantee that the generator network assigns
non-zero probability to all points simply by making the last layer of the generator
network add Gaussian noise to all of the generated values. Generator networks
that add Gaussian noise in this manner sample from the same distribution that one
obtains by using the generator network to parametrize the mean of a conditional

702
CHAPTER 20. DEEP GENERATIVE MODELS

Gaussian distribution.
Dropout seems to be important in the discriminator network. In particular,
units should be stochastically dropped while computing the gradient for the
generator network to follow. Following the gradient of the deterministic version of
the discriminator with its weights divided by two does not seem to be as effective.
Likewise, never using dropout seems to yield poor results.
While the GAN framework is designed for differentiable generator networks,
similar principles can be used to train other kinds of models. For example, self-
supervised boosting can be used to train an RBM generator to fool a logistic
regression discriminator (Welling et al., 2002).

20.10.5 Generative Moment Matching Networks

Generative moment matching networks (Li et al., 2015; Dziugaite et al.,


2015) are another form of generative model based on differentiable generator
networks. Unlike VAEs and GANs, they do not need to pair the generator network
with any other network—neither an inference network as used with VAEs nor a
discriminator network as used with GANs.
These networks are trained with a technique called moment matching. The
basic idea behind moment matching is to train the generator in such a way that
many of the statistics of samples generated by the model are as similar as possible
to those of the statistics of the examples in the training set. In this context, a
moment is an expectation of different powers of a random variable. For example,
the first moment is the mean, the second moment is the mean of the squared
values, and so on. In multiple dimensions, each element of the random vector may
be raised to different powers, so that a moment may be any quantity of the form

Ex Πi xni i (20.82)

where n = [n 1, n 2, . . . , nd] is a vector of non-negative integers.


Upon first examination, this approach seems to be computationally infeasible.
For example, if we want to match all the moments of the form xi x j , then we need
to minimize the difference between a number of values that is quadratic in the
dimension of x. Moreover, even matching all of the first and second moments
would only be sufficient to fit a multivariate Gaussian distribution, which captures
only linear relationships between values. Our ambitions for neural networks are to
capture complex nonlinear relationships, which would require far more moments.
GANs avoid this problem of exhaustively enumerating all moments by using a

703
Deep Reinforcement Learning:
Deep reinforcement learning (DRL) is a powerful subfield of machine learning that combines
the strengths of reinforcement learning (RL) and deep learning. It enables agents to learn
optimal decision-making strategies in complex environments by interacting with them and
receiving feedback in the form of rewards or penalties.

Core Concepts

 Reinforcement Learning (RL): RL is a type of machine learning where an agent


learns to interact with an environment to maximize a cumulative reward. The agent
takes actions in the environment, observes the consequences, and adjusts its behavior
to achieve its goals.
 Deep Learning: Deep learning uses artificial neural networks with multiple layers to
learn complex patterns from data. These networks can automatically extract features
from raw input, making them well-suited for handling high-dimensional data like
images or sensor readings.
 Agent: The agent is the learner and decision-maker. It interacts with the environment
by taking actions.
 Environment: The environment is the world in which the agent operates. It provides
states and rewards to the agent.
 State: A state is a representation of the current situation in the environment.
 Action: An action is a choice that the agent can make in the environment.
 Reward: A reward is a feedback signal from the environment, indicating the
desirability of an action.
 Policy: A policy is a strategy that the agent uses to choose actions based on the
current state.
 Value Function: A value function estimates the expected cumulative reward that the
agent can achieve by following a particular policy from a given state.

How DRL Works

1. Interaction: The agent interacts with the environment by taking actions.


2. Observation: The agent observes the new state and receives a reward from the
environment.
3. Learning: The agent uses the experience (state, action, reward, next state) to update
its policy or value function.
4. Iteration: The agent repeats these steps, gradually improving its decision-making
strategy over time.

Deep Learning's Role

Deep learning models, such as convolutional neural networks (CNNs) or recurrent neural
networks (RNNs), are used to approximate the policy or value function. This allows DRL
agents to handle complex environments with high-dimensional state spaces, such as those
encountered in robotics, games, or autonomous driving.
Types of DRL Algorithms
(I) Value-Based Methods: These methods learn a value function that estimates the expected
cumulative reward for each state-action pair. Q-learning and SARSA are popular examples.

What are Value-Based Methods?

In reinforcement learning, the goal of an agent is to learn an optimal policy, which is a


strategy for choosing actions in different situations to maximize its cumulative reward.
Value-based methods achieve this by learning a value function.

 Value Function: This function estimates how good it is for an agent to be in a


particular state (or take a specific action in a state). It essentially predicts the expected
cumulative reward the agent can achieve by following a certain policy from that state.

Key Idea:

Value-based methods learn a value function and then indirectly derive a policy by selecting
actions that maximize this value. The agent doesn't explicitly learn a policy; instead, it acts
greedily according to the learned value function.

Types of Value Functions:

 State-Value Function (V(s)): This function estimates the expected cumulative


reward if the agent starts in state 's' and follows a particular policy thereafter.
 Action-Value Function (Q(s, a)): This function estimates the expected cumulative
reward if the agent starts in state 's', takes action 'a', and follows a particular policy
thereafter.

How Value-Based Methods Work:

1. Estimate Value: The agent interacts with the environment and uses its experiences
(states, actions, rewards) to estimate the value function.
2. Improve Estimate: The agent updates its estimate of the value function based on the
rewards it receives and the transitions it observes.
3. Derive Policy: Once the value function is learned, the agent can derive a policy by
selecting actions that maximize the value in each state.

Popular Value-Based Algorithms:

 Q-learning: An off-policy algorithm that learns the optimal action-value function


Q(s, a) directly. It updates the Q-values based on the maximum possible reward in the
next state, regardless of the action actually taken by the agent.
 SARSA (State-Action-Reward-State-Action): An on-policy algorithm that learns
the action-value function Q(s, a) by taking into account the action actually taken by
the agent in the next state.
 Deep Q-Networks (DQN): A deep learning-based version of Q-learning that uses a
neural network to approximate the Q-function. This enables Q-learning to be applied
to problems with high-dimensional state spaces, such as those encountered in games
or robotics.
Advantages of Value-Based Methods:

 Simplicity: They are often easier to understand and implement compared to policy-
based methods.
 Efficiency: They can be more sample-efficient than policy-based methods in certain
scenarios.

Disadvantages of Value-Based Methods:

 Limited to Discrete Action Spaces: Traditional value-based methods are typically


limited to problems with discrete action spaces.
 Potential for Overestimation: Q-learning can sometimes overestimate the values of
actions, leading to suboptimal policies.

Deep Learning's Role:

Deep learning has significantly enhanced value-based methods by enabling them to handle
complex environments with high-dimensional state spaces. Deep neural networks are used to
approximate the value function, allowing the agent to learn from raw sensory input, such as
images or sensor readings.

Applications:

Value-based methods, particularly DQN, have been successfully applied to various domains,
including:

 Playing Atari games: DQN has achieved superhuman performance in many Atari
games.
 Robotics: Value-based methods can be used to train robots to perform tasks like
grasping objects or navigating environments.

(II) Policy-Based Methods: These methods directly learn a policy that maps states to
actions. Policy gradient methods, such as REINFORCE, are commonly used.

What are Policy-Based Methods?

Instead of learning a value function and then deriving a policy, policy-based methods directly
learn the policy, which is a mapping from states to probabilities of taking actions. The policy,
often denoted as π(a|s), represents the probability of taking action 'a' in state 's'.

Key Idea:

Policy-based methods aim to find the policy that maximizes the expected cumulative reward.
They do this by directly adjusting the policy parameters based on the observed rewards.

How Policy-Based Methods Work:


1. Parameterize Policy: The policy is represented by a function with parameters (e.g., a
neural network).
2. Evaluate Policy: The agent interacts with the environment and collects experiences
(states, actions, rewards). These experiences are used to evaluate the current policy's
performance.
3. Improve Policy: The policy parameters are updated to increase the probabilities of
actions that led to high rewards and decrease the probabilities of actions that led to
low rewards.

Types of Policy-Based Methods:

 Policy Gradient Methods: These methods use gradient ascent to directly optimize
the policy parameters. They estimate the gradient of the expected cumulative reward
with respect to the policy parameters and then update the parameters in the direction
of the gradient. REINFORCE is a classic example.
 Actor-Critic Methods: These methods combine policy-based and value-based
approaches. They use an "actor" network to learn the policy and a "critic" network to
learn the value function. The critic helps the actor learn more efficiently by providing
a baseline for evaluating the policy's performance. A2C, A3C, and PPO are popular
examples.

Popular Policy Gradient Algorithms:

 REINFORCE (REward INcrement = FORce of Change): A Monte Carlo policy


gradient algorithm that updates the policy parameters based on the total reward
received during an episode.
 REINFORCE with Baseline: An improvement over REINFORCE that uses a
baseline (e.g., the average reward) to reduce variance in the gradient estimates.

Advantages of Policy-Based Methods:

 Can Handle Continuous Action Spaces: Policy-based methods can be easily


extended to problems with continuous action spaces, which are common in robotics
and control tasks.
 Can Learn Stochastic Policies: They can learn stochastic policies, which are often
more robust than deterministic policies in complex environments.
 Convergence Properties: Under certain conditions, policy gradient methods are
guaranteed to converge to a local optimum.

Disadvantages of Policy-Based Methods:

 High Variance: Policy gradient methods can have high variance in the gradient
estimates, which can slow down learning.
 Local Optima: They can get stuck in local optima, which may not be the global
optimum.

Deep Learning's Role:


Deep learning is essential for policy-based methods in complex environments. Deep neural
networks are used to represent the policy function, enabling the agent to learn from high-
dimensional state spaces, such as images or sensor readings.

Applications:

Policy-based methods have been successfully applied to a wide range of problems, including:

 Robotics: Training robots to perform complex manipulation tasks.


 Game Playing: Achieving superhuman performance in games with continuous action
spaces.
 Control Tasks: Controlling complex systems, such as aircraft or power grids.

Key Differences Between Value-Based and Policy-Based Methods:

Feature Value-Based Methods Policy-Based Methods


What is learned Value function (V(s) or Q(s, a)) Policy (π(a
Action Selection Greedy based on value function Directly from the learned policy
Action Space Typically discrete Can be discrete or continuous
Convergence Not always guaranteed Guaranteed to converge to a local optimum
Variance Lower Higher

 .

You might also like