Deep Learning UNIT-5
Deep Learning UNIT-5
UNIT V
Interactive Applications of Deep Learning: Machine Vision, Natural Language processing,
Generative Adversial Networks, Deep Reinforcement Learning. [Text Book 1]
Deep Learning Research: Autoencoders, Deep Generative Models: Boltzmann Machines
Restricted Boltzmann Machines, Deep Belief Networks. [Text Book 1]
Machine Vision
Machine Vision" refers to the utilization of deep learning techniques, particularly within the
domain of computer vision, to develop interactive systems that can perceive and understand
visual information from the environment and respond to user input or environmental changes in
real-time.
Convolutional Neural Networks (CNNs):
The Two-Dimensional Structure of Visual Imagery:
• Convolutional neural networks (CNNs) are commonly used in image recognition tasks.
CNNs are specifically designed to work with two-dimensional data and are capable of
preserving spatial information through layers such as convolutional and pooling layers.
• In the context of handwritten digit recognition using MNIST, CNNs can learn
hierarchical representations of features directly from the two-dimensional pixel grid,
without the need for flattening the images into one-dimensional arrays.
Computational Complexity:
• The computational complexity of processing images in a dense neural network increases
rapidly with the size of the input image.
• For example, for a 28x28 pixel MNIST image with one color channel, passing the image
into a dense layer results in 785 parameters per neuron.
Convolutional Layers:
• Convolutional layers consist of sets of kernels, which are also known as filters. Each of
these kernels is a small window (called a patch) that scans across the image (in more
technical terms, the filter convolves), from top left to bottom right.
• Kernels are made up of weights, which—as in dense layers—are learned through
backpropagation. Kernels can range in size, but a typical size is 3X3, For the
monochromatic MNIST digits, this 3X3-pixel window would consist of 3 X 3 X 1
weights—nine weights, for a total of 10 parameters (like an artificial neuron in a dense
layer, every convolutional filter has a bias term b).
When reading a page of a book written in English, we begin in the top-left corner and read to the
right. Every time we reach the end of a row of text, we progress to the next row. In this way, we
eventually reach the bottom-right corner, thereby reading all of the words on the page.
Multiple Filters:
• Multiple filters (also known as kernels) are used to extract different features from the
input images. Each filter performs convolution operations across the input image,
resulting in feature maps that highlight specific patterns or structures within the image.
• The filters in the layers react to increasingly complex combinations of these simple
features, learning to represent increasingly abstract spatial patterns and eventually
building a hierarchy from simple lines and colors up to complex textures and shapes.
• The number of filters in the layer, like the number of neurons in a dense layer, is a
hyperparameter that we configure ourselves.
Activation map:
• An activation map, also known as a feature map, is a two-dimensional array that
represents the output of applying a set of filters (kernels) to an input image in a
convolutional neural network (CNN).
Pooling Layers:
▪ Pooling layers help in reducing computational complexity, controlling overfitting, and
increasing translation invariance.
▪ These layers help reduce the spatial dimensions (width and height) of the input volume,
which in turn reduces the computational complexity of the network and helps in
extracting dominant features
These are examples of various machine vision applications. We have encountered classification
previously in this chapter, but now we cover object detection, semantic segmentation, and
instance segmentation.
Object detection:
Object detection has broad applications, such as detecting pedestrians in the field of view for
autonomous driving, or for identifying anomalies in medical images.
Generally speaking, object detection is divided into two tasks: detection (identifying where the
objects in the image are) and then, subsequently, classification (identifying what the objects are
that have been detected).
Seminal models—ones that have defined progress in this area—include R-CNN, Fast R-CNN,
Faster R-CNN, and YOLO.
R-CNN:
R-CNN (Region-based Convolutional Neural Network) is a seminal object detection framework
that popularized the use of deep learning for object detection tasks.
To emulate thisattention, Girshick and his coworkers developed R-CNN to:
1. Perform a selective search for regions of interest (ROIs) within the image.
2. Extract features from these ROIs by using a CNN.
3. Combine two “traditional” (as in Figure 1.12) machine learning approaches—called linear
regression and support vector machines—to, respectively, refine the locations of bounding
boxes34 and classify objects within each of those boxes.
R-CNNs redefined the state of the art in object detection, achieving a massive gain in
performance over the previous best model in the Pattern Analysis, Statistical Modeling and
Computational Learning (PASCAL) Visual Object Classes (VOC) competition.
YOLO:
• YOLO, which stands for "You Only Look Once," is a state-of-the-art real-time object
detection system introduced by Joseph Redmon et al. in 2016.
• Unlike traditional object detection methods that use region proposal algorithms and
multi-stage pipelines, YOLO approaches the task in a different way, aiming for both high
accuracy and fast inference speed.
NLTK (Natural Language Toolkit) is a popular library in Python for natural language processing
tasks. The "corpus" submodule within NLTK contains a collection of text corpora for various
languages and domains.
The first book in the Project Gutenberg corpus is Emma, because this first,element contains the
book’s title page, chapter markers, and first sentence, all (erroneously) blended together with
newline characters (\n):
Skip-Gram:
• The Skip-Gram model learns distributed representations of words in a continuous vector
space. The main objective of Skip-Gram is to predict context words (words surrounding a
target word) given a target word.
Discriminator training loop. Forward propagation through the generator produces fake images.
These are mixed into batches with real images from the dataset and, together with their labels,
are used to train the discriminator. Learning paths are shown in green, while non-learning paths
are shown in black and the blue arrow calls attention to the image labels, y.
Forward propagation through the generator produces fake images, and inference with the
discriminator scores these images. The generator is improved through backpropagation. Learning
paths are shown in green, and non-learning paths are shown in black. The blue arrow calls
attention to the relationship between the image and its label y which, in the case of generator
training, is always equal to 1.
Loading the data. Assuming you set up your directory structure the same as ours and downloaded
the apple.npy file, you can load these data in using the command
We divide by 255 to scale our pixels to be in the range of 0 to 1, just as we did for the MNIST
digits.
Example—a bitmap of the 4,243rd sketch from the apple category—by running this code:
The policy function π enables an agent to map any state s (from the set of all possible states S) to
an action a from the set of all possible actions A.
1. Undercomplete Autoencoders:
An undercomplete autoencoder is a type of autoencoder neural network architecture where the
dimensionality of the latent space (also known as the bottleneck layer or encoding layer) is lower
than the dimensionality of the input data. Learning an undercomplete representation forces the
autoencoder to capture the most salient features of the training data.
The learning process is described simply as minimizing a loss function:
where L is a loss function penalizing g(f (x)) for being dissimilar from x, such as the mean
squared error.
4. Denoising autoencoders:
The denoising autoencoder (DAE) is an autoencoder that receives a corrupted data point as input
and is trained to predict the original, uncorrupted data point as its output.
Denoising autoencoders are a type of autoencoder neural network architecture designed to learn
robust representations of input data by removing noise from the input during training. They
achieve this by training the autoencoder to reconstruct clean versions of noisy input data,
effectively learning to denoise the input.
1. Noise Injection: During training, noise is intentionally added to the input data to create
corrupted versions of the original data. Common types of noise include Gaussian noise,
dropout noise, or masking noise, where random elements of the input are set to zero.
2. Reconstruction Objective: The denoising autoencoder is trained to minimize the
reconstruction error between the clean input data and the reconstructed output. The objective
is to teach the autoencoder to recover the original, clean data from the noisy input.
3. Regularization: In addition to the reconstruction objective, denoising autoencoders often
incorporate regularization techniques to encourage the learned representations to be robust to
where U is the “weight” matrix of model parameters and b is the vector of bias parameters.
(b) A deep belief network is a hybrid graphical model involving both directed and undirected
connections. Like an RBM, it has no intralayer connections. However, a DBN has multiple
(c)A deep Boltzmann machine is an undirected graphical model with several layers of latent
variables. Like RBMs and DBNs, DBMs lack intralayer connections. DBMs are less closely tied
to RBMs than DBNs are. When initializing a DBM from a stack of RBMs, it is necessary to
modify the RBM parameters slightly. Some kinds of DBMs may be trained without first training
a set of RBMs.
Training Restricted Boltzmann Machines:
Training Restricted Boltzmann Machines (RBMs) involves adjusting the weights and biases of
the model to better capture the underlying distribution of the training data. RBMs are typically
trained using contrastive divergence (CD), a variant of stochastic gradient descent (SGD) that
approximates the gradient of the log-likelihood of the data.
The contrastive divergence algorithm is a computationally efficient approximation of the
maximum likelihood learning objective for RBMs.
where x = ho, P(hk | hk + 1)is a conditional distribution for the visible units conditioned on the
hidden units of the RBM at level k, and P(hl − 1, hl)is the visible-hidden joint distribution in the
top-level RBM.
The architecture of a Deep-Belief Network with two RBMs is as shown in Fig.
RBMs and DBNs, DBMs typically contain only binary units—as we assume for simplicity of our
presentation of the model—but it is straightforward to include real-valued visible units.
A DBM is an energy-based model, meaning that the the joint probability distribution over the
model variables is parametrized by an energy function E. In the case of a deep Boltzmann
machine with one visible layer, v, and three hidden layers, h(1), h(2) and h(3) , the joint
probability is given by: