DL Unit-V
DL Unit-V
UNIT – V
Interactive Applications of Deep Learning: Machine Vision, Natural
Language processing, Generative Adversial Networks, Deep Reinforcement
Learning.
Machine Vision:
Deep learning has revolutionized machine vision, empowering computers to "see" and
interpret images with remarkable accuracy. Here's a more detailed look at how it works and
its diverse applications:
Data Requirements: Deep learning models require large amounts of labeled data for
training.
Computational Resources: Training deep learning models can be computationally
intensive.
Explainability: Understanding how deep learning models make decisions can be
challenging.
Deep learning has transformed machine vision, enabling computers to understand and
interpret visual data with unprecedented accuracy. As deep learning technology continues to
advance, we can expect even more innovative applications in the years to come.
Let's delve into even greater detail about the applications of deep learning in machine vision:
What it does: This involves not just identifying what objects are present in an image
(recognition), but also where they are located (detection). Think of it like drawing
bounding boxes around each identified object.
Deep Learning Techniques: Convolutional Neural Networks (CNNs) are the
workhorses here. Architectures like Faster R-CNN, YOLO (You Only Look Once),
and SSD (Single Shot MultiBox Detector) are designed specifically for object
detection. They learn to identify features that are characteristic of different objects,
and then use these features to both classify and localize the objects.
Detailed Examples:
o Autonomous Vehicles: Crucial for detecting pedestrians (even in varying
lighting or clothing), traffic lights (and their current state), other vehicles (cars,
trucks, bikes), and road signs. The system needs to understand the context of
these objects to make safe driving decisions.
o Surveillance: Identifying suspicious activities could involve detecting people
trespassing in restricted areas, recognizing abandoned objects, or even
analyzing crowd behavior to predict potential issues. Facial recognition often
plays a role here too.
o Retail: Imagine a smart shelf that automatically tracks which products are
running low, or a checkout system that can identify items without needing to
scan barcodes. Object detection is key to these applications.
o Robotics: Robots can use object detection to navigate complex environments,
grasp and manipulate objects, and perform tasks that require visual
understanding.
2. Image Classification:
What it does: This task involves assigning a single label to an entire image,
categorizing it into a predefined class. For example, classifying an image as "cat,"
"dog," or "bird."
Deep Learning Techniques: CNNs are also central to image classification. Models
like ResNet, Inception, and EfficientNet have achieved high accuracy on large image
datasets. The network learns hierarchical features, from simple edges and textures in
the early layers to complex object parts and whole objects in the later layers.
Detailed Examples:
o Medical Diagnosis: Classifying medical images (X-rays, MRIs, CT scans) to
detect diseases like cancer, pneumonia, or Alzheimer's. This can assist doctors
in making faster and more accurate diagnoses.
o Agriculture: Classifying images of crops to identify diseases, nutrient
deficiencies, or pest infestations. This allows for targeted interventions and
improved yields.
o Environmental Monitoring: Classifying satellite or aerial images to monitor
deforestation, track pollution, or assess the impact of natural disasters.
o Product Categorization: E-commerce platforms use image classification to
automatically categorize products based on their visual appearance, improving
search and recommendation systems.
3. Image Segmentation:
What it does: This is a more granular task than object detection. Instead of just
drawing boxes, segmentation aims to classify each pixel in the image, assigning it to a
specific object or region. This creates a pixel-level mask that separates different
objects or parts of an object.
Deep Learning Techniques: Fully Convolutional Networks (FCNs), U-Net, and
Mask R-CNN are popular architectures for image segmentation. They often use
encoder-decoder structures to learn both high-level and low-level features.
Detailed Examples:
o Medical Imaging: Segmenting organs or tissues in MRI or CT scans allows
for precise measurements, 3D reconstructions, and more accurate diagnosis
and treatment planning. For example, segmenting a tumor to determine its size
and shape.
o Satellite Imagery: Analyzing land use by segmenting different types of
terrain (urban areas, forests, water bodies). This is crucial for urban planning,
environmental monitoring, and disaster response.
o Autonomous Driving: Segmenting the road, pedestrians, and other vehicles
provides a much richer understanding of the environment than just object
detection.
o Image Editing: Segmentation can be used to easily remove backgrounds,
replace objects, or apply special effects to specific parts of an image.
4. Facial Recognition:
What it does: Identifying or verifying individuals based on their facial features. This
can involve comparing a face to a database of known faces or simply confirming that
two faces belong to the same person.
Deep Learning Techniques: CNNs are used to extract features from faces, such as
the distance between eyes, the shape of the nose, and the texture of the skin. These
features are then used to create a "facial fingerprint" that can be compared to other
faces.
Detailed Examples:
o Security: Access control systems that use facial recognition to grant entry to
authorized personnel. Surveillance systems that can identify individuals of
interest.
o Personalization: Smartphones that use facial recognition to unlock the device
or personalize user experiences. Social media platforms that use facial
recognition to tag people in photos.
o Law Enforcement: Using facial recognition to identify suspects in criminal
investigations.
o Marketing: Analyzing facial expressions to understand customer emotions
and preferences.
5. Pose Estimation:
12.4.1 n-grams
P n (xt−n+1 , . . . , xt )
P (x t | xt−n+1 , . . . , xt−1) = (12.6)
Pn−1 (xt−n+1 , . . . , xt−1 )
P (THE DOG RAN AWAY) = P3 (THE DOG RAN)P3(DOG RAN AWAY)/P 2(DOG RAN).
(12.7)
A fundamental limitation of maximum likelihood for n-gram models is that Pn
as estimated from training set counts is very likely to be zero in many cases, even
though the tuple (x t−n+1, . . . , x t ) may appear in the test set. This can cause two
different kinds of catastrophic outcomes. When Pn−1 is zero, the ratio is undefined,
so the model does not even produce a sensible output. When P n−1 is non-zero but
Pn is zero, the test log-likelihood is −∞. To avoid such catastrophic outcomes,
most n-gram models employ some form of smoothing. Smoothing techniques
462
CHAPTER 12. APPLICATIONS
shift probability mass from the observed tuples to unobserved ones that are similar.
See Chen and Goodman (1999) for a review and empirical comparisons. One basic
technique consists of adding non-zero probability mass to all of the possible next
symbol values. This method can be justified as Bayesian inference with a uniform
or Dirichlet prior over the count parameters. Another very popular idea is to form
a mixture model containing higher-order and lower-order n-gram models, with the
higher-order models providing more capacity and the lower-order models being
more likely to avoid counts of zero. Back-off methods look-up the lower-order
n-grams if the frequency of the context xt−1, . . . , x t−n+1 is too small to use the
higher-order model. More formally, they estimate the distribution over xt by using
contexts x t−n+k , . . . , xt−1, for increasing k, until a sufficiently reliable estimate is
found.
Classical n-gram models are particularly vulnerable to the curse of dimension-
ality. There are |V|n possible n-grams and |V| is often very large. Even with a
massive training set and modest n, most n-grams will not occur in the training set.
One way to view a classical n-gram model is that it is performing nearest-neighbor
lookup. In other words, it can be viewed as a local non-parametric predictor,
similar to k-nearest neighbors. The statistical problems facing these extremely
local predictors are described in section 5.11.2. The problem for a language model
is even more severe than usual, because any two different words have the same dis-
tance from each other in one-hot vector space. It is thus difficult to leverage much
information from any “neighbors”—only training examples that repeat literally the
same context are useful for local generalization. To overcome these problems, a
language model must be able to share knowledge between one word and other
semantically similar words.
To improve the statistical efficiency of n-gram models, class-based language
models (Brown et al., 1992; Ney and Kneser, 1993; Niesler et al., 1998) introduce
the notion of word categories and then share statistical strength between words that
are in the same category. The idea is to use a clustering algorithm to partition the
set of words into clusters or classes, based on their co-occurrence frequencies with
other words. The model can then use word class IDs rather than individual word
IDs to represent the context on the right side of the conditioning bar. Composite
models combining word-based and class-based models via mixing or back-off are
also possible. Although word classes provide a way to generalize between sequences
in which some word is replaced by another of the same class, much information is
lost in this representation.
463
CHAPTER 20. DEEP GENERATIVE MODELS
of models to those with tractable mean field fixed point equations. The variational
autoencoder also has the advantage that it increases a bound on the log-likelihood
of the model, while the criteria for the MP-DBM and related models are more
heuristic and have little probabilistic interpretation beyond making the results of
approximate inference accurate. One disadvantage of the variational autoencoder
is that it learns an inference network for only one problem, inferring z given x.
The older methods are able to perform approximate inference over any subset of
variables given any other subset of variables, because the mean field fixed point
equations specify how to share parameters between the computational graphs for
all of these different problems.
One very nice property of the variational autoencoder is that simultaneously
training a parametric encoder in combination with the generator network forces the
model to learn a predictable coordinate system that the encoder can capture. This
makes it an excellent manifold learning algorithm. See figure 20.6 for examples of
low-dimensional manifolds learned by the variational autoencoder. In one of the
cases demonstrated in the figure, the algorithm discovered two independent factors
of variation present in images of faces: angle of rotation and emotional expression.
v(θ (g), θ(d) ) = Ex∼p data log d(x) + E x∼pmodel log (1 − d(x)) . (20.81)
699
CHAPTER 20. DEEP GENERATIVE MODELS
This drives the discriminator to attempt to learn to correctly classify samples as real
or fake. Simultaneously, the generator attempts to fool the classifier into believing
its samples are real. At convergence, the generator’s samples are indistinguishable
from real data, and the discriminator outputs 12 everywhere. The discriminator
may then be discarded.
The main motivation for the design of GANs is that the learning process
requires neither approximate inference nor approximation of a partition function
gradient. In the case where maxd v(g, d) is convex in θ(g) (such as the case where
optimization is performed directly in the space of probability density functions)
the procedure is guaranteed to converge and is asymptotically consistent.
Unfortunately, learning in GANs can be difficult in practice when g and d
are represented by neural networks and max d v(g, d) is not convex. Goodfellow
700
CHAPTER 20. DEEP GENERATIVE MODELS
701
CHAPTER 20. DEEP GENERATIVE MODELS
Figure 20.7: Images generated by GANs trained on the LSUN dataset. (Left)Images
of bedrooms generated by a DCGAN model, reproduced with permission from Radford
et al. (2015). (Right)Images of churches generated by a LAPGAN model, reproduced with
permission from Denton et al. (2015).
process into many levels of detail. It is possible to train conditional GANs (Mirza
and Osindero, 2014) that learn to sample from a distribution p(x | y ) rather
than simply sampling from a marginal distribution p(x). Denton et al. (2015)
showed that a series of conditional GANs can be trained to first generate a very
low-resolution version of an image, then incrementally add details to the image.
This technique is called the LAPGAN model, due to the use of a Laplacian pyramid
to generate the images containing varying levels of detail. LAPGAN generators
are able to fool not only discriminator networks but also human observers, with
experimental subjects identifying up to 40% of the outputs of the network as
being real data. See figure 20.7 for examples of images generated by a LAPGAN
generator.
One unusual capability of the GAN training procedure is that it can fit proba-
bility distributions that assign zero probability to the training points. Rather than
maximizing the log probability of specific points, the generator net learns to trace
out a manifold whose points resemble training points in some way. Somewhat para-
doxically, this means that the model may assign a log-likelihood of negative infinity
to the test set, while still representing a manifold that a human observer judges
to capture the essence of the generation task. This is not clearly an advantage or
a disadvantage, and one may also guarantee that the generator network assigns
non-zero probability to all points simply by making the last layer of the generator
network add Gaussian noise to all of the generated values. Generator networks
that add Gaussian noise in this manner sample from the same distribution that one
obtains by using the generator network to parametrize the mean of a conditional
702
CHAPTER 20. DEEP GENERATIVE MODELS
Gaussian distribution.
Dropout seems to be important in the discriminator network. In particular,
units should be stochastically dropped while computing the gradient for the
generator network to follow. Following the gradient of the deterministic version of
the discriminator with its weights divided by two does not seem to be as effective.
Likewise, never using dropout seems to yield poor results.
While the GAN framework is designed for differentiable generator networks,
similar principles can be used to train other kinds of models. For example, self-
supervised boosting can be used to train an RBM generator to fool a logistic
regression discriminator (Welling et al., 2002).
Ex Πi xni i (20.82)
703
Deep Reinforcement Learning:
Deep reinforcement learning (DRL) is a powerful subfield of machine learning that combines
the strengths of reinforcement learning (RL) and deep learning. It enables agents to learn
optimal decision-making strategies in complex environments by interacting with them and
receiving feedback in the form of rewards or penalties.
Core Concepts
Deep learning models, such as convolutional neural networks (CNNs) or recurrent neural
networks (RNNs), are used to approximate the policy or value function. This allows DRL
agents to handle complex environments with high-dimensional state spaces, such as those
encountered in robotics, games, or autonomous driving.
Types of DRL Algorithms
(I) Value-Based Methods: These methods learn a value function that estimates the expected
cumulative reward for each state-action pair. Q-learning and SARSA are popular examples.
Key Idea:
Value-based methods learn a value function and then indirectly derive a policy by selecting
actions that maximize this value. The agent doesn't explicitly learn a policy; instead, it acts
greedily according to the learned value function.
1. Estimate Value: The agent interacts with the environment and uses its experiences
(states, actions, rewards) to estimate the value function.
2. Improve Estimate: The agent updates its estimate of the value function based on the
rewards it receives and the transitions it observes.
3. Derive Policy: Once the value function is learned, the agent can derive a policy by
selecting actions that maximize the value in each state.
Simplicity: They are often easier to understand and implement compared to policy-
based methods.
Efficiency: They can be more sample-efficient than policy-based methods in certain
scenarios.
Deep learning has significantly enhanced value-based methods by enabling them to handle
complex environments with high-dimensional state spaces. Deep neural networks are used to
approximate the value function, allowing the agent to learn from raw sensory input, such as
images or sensor readings.
Applications:
Value-based methods, particularly DQN, have been successfully applied to various domains,
including:
Playing Atari games: DQN has achieved superhuman performance in many Atari
games.
Robotics: Value-based methods can be used to train robots to perform tasks like
grasping objects or navigating environments.
(II) Policy-Based Methods: These methods directly learn a policy that maps states to
actions. Policy gradient methods, such as REINFORCE, are commonly used.
Instead of learning a value function and then deriving a policy, policy-based methods directly
learn the policy, which is a mapping from states to probabilities of taking actions. The policy,
often denoted as π(a|s), represents the probability of taking action 'a' in state 's'.
Key Idea:
Policy-based methods aim to find the policy that maximizes the expected cumulative reward.
They do this by directly adjusting the policy parameters based on the observed rewards.
Policy Gradient Methods: These methods use gradient ascent to directly optimize
the policy parameters. They estimate the gradient of the expected cumulative reward
with respect to the policy parameters and then update the parameters in the direction
of the gradient. REINFORCE is a classic example.
Actor-Critic Methods: These methods combine policy-based and value-based
approaches. They use an "actor" network to learn the policy and a "critic" network to
learn the value function. The critic helps the actor learn more efficiently by providing
a baseline for evaluating the policy's performance. A2C, A3C, and PPO are popular
examples.
High Variance: Policy gradient methods can have high variance in the gradient
estimates, which can slow down learning.
Local Optima: They can get stuck in local optima, which may not be the global
optimum.
Applications:
Policy-based methods have been successfully applied to a wide range of problems, including:
.