DL1 5
DL1 5
Image formation is an analog to digital conversion of an image with the help of 2D Sampling and
Quantization techniques that is done by the capturing devices like cameras. In general, we see a
2D view of the 3D world.
Generally, a frame grabber or a digitizer is used for sampling and quantizing the analog signals.
Imaging
The mapping of a 3D world object into a 2D digital image plane is called imaging.
We all know that light reflects from every object that we see thus enabling us to capture all those
light-reflecting points in our image plane.
Optical Systems
The lenses and mirrors are crucial in focusing the light coming from the 3D scene to produce the
image on the image plane. These systems define how light is collected and where it is directed and
consequently affects the sharpness and quality of the image produced.
Image Sensors
The goals of image sensors like the CCD or the CMOS sensors are to simply transform the optical
image into an electronic signal. These sensors differ by sensitivity, the resolution that they deliver
affecting the image as a whole.
Resolution and Sampling
Resolution is defined as the sharpness of an image and it occurs technically as the number of pixels
an image can hold. Sampling is the act of taking samples or discretizing a digital signal and
representing a continuous analog signal as a grouping of discrete values. It can be seen that higher
resolution and appropriative sampling rates are required in order to provide detailed and accurate
images.
Image Processing
Image processing can be described as act of modifying and enhancing digital images by using
algorithms. Pre-processing includes activities like filtering, noise reduction and color correction
that enhance image quality and information extraction.
Advantages
● 1) Improved Accuracy: Digital imaging is less susceptible to human factors and gives accurate
output of the object with high detailed capture.
● 2) Enhanced Flexibility: Digital images are easy to manipulate, edit or analyse as per the
requirements through different software hence they provide flexibility of post processing.
● 3) High Storage Capacity: Data in any digital format such as in one or more digital images can
still be stored in large amount with very high resolution and quality and will not suffer physical
wear and tear.
● 4) Easy Sharing and Distribution: The use of digital images allows them to be quickly
duplicated and transmitted across various channels and to various gadgets, helping to speed up
the work.
● 5) Advanced Analysis Capabilities: Digital imaging enables the application of analytical tools,
including image recognition and machine learning, which can provide better insights and
increase productivity.
Disadvantages
● 1) Data Size: Large-structured digital image could occupy large storage space and
computational power hence may be expensive.
● 2) Image Noise: Digital images may be compromised by noise and artifacts, which degrades
the image quality mainly when photographed at night or using low image sensors.
● 3) Dependency on Technology: Digital imaging entails the use of sophisticated technology and
equipment that may be costly and there may be constant need to service or replace the
equipment.
● 4) Privacy Concerns: The ability to take and circulate photographs digitally also poses concern
because personal information can be photographed without the subject’s permission.
● 5) Data Loss Risks: Digital image repositories, however, are prone to data loss caused by
hardware failures, corrupting software, or unintentional erasure.
Applications
● 1) Medical Imaging: Digital imaging is employed in the medical fields in the diagnostic
process such as X-ray pictures, MRI scans, and CT scans, for internal body reflections.
● 2) Surveillance and Security: Digital cameras and imaging systems are greatly needed for
various security or surveillance purposes as they offer live feed and are also useful in acquiring
data for investigations.
● 3) Remote Sensing: Digital imaging plays an important role in remote sensing applications in
terms of monitoring and mapping of environment and disasters and involve data captured from
satellite and aerial systems.
● 4) Entertainment and Media: The entertainment industry involves the use of digital imaging in
films, video games, and virtual reality to deliver improved visual impact.
● 5) Scientific Research: Digital imaging helps in scientific studies through providing best
picture at research fields like astronomy, biology, and material science.
Linear Filtering
Linear filtering is a computer vision technique that uses a filter or kernel to modify an image. It's
a powerful image enhancement method that can reduce noise, and it's often used in applications
that require fast processing
This means the filter’s response to a weighted sum of the inputs is equal to the weighted sum of
the responses of the filter to all inputs. Mathematically, if
For purposes of analysis and computation, if x(t) is the input signal and h(t) is the filter’s impulse
response the output of the convolution yields y(t)=x(t)∗h(t). This property makes the linear filters
superposition and homogeneous hence making them easily predictable when evaluated
mathematically.
Features of Linear Filters:
● Superposition Principle: Given the literature, the response to the sum of inputs is the sum of
the responses to each of the inputs separately.
● Homogeneity: The response given to a scaled input is also a scaled response to the input given.
● Convolution-Based: It must be noted that the output is obtained from the convolution of the
input signal in the filter’s impulse response.
● Frequency Domain Analysis: Due to the non-random nature of signals which can be involved
in the operation of systems they can be easily analyzed and designed using frequency domain
techniques like Fourier transform.
● Predictable Behavior: This means that they have a sequential order and their application
makes them quite easy to anticipate and use in different fields.
What are non-linear filters?
Non-linear filters can be defined as signal or image processing which does not consist of
superposition and homogeneity. This means that what they produce as output is not just a
proportionate relation to the input values. These filters apply operations that are functions of the
inputs’ values and arrangement or other more complex mathematical operations and algorithms.
Frequency Domain Can be analyzed using Fourier Not easily analyzed using Fourier
Analysis Transform Transform
Adaptive Behavior Static, does not adapt to input Can adapt to local input
characteristics characteristics
A Gaussian filter is a linear smoothing filter used in image processing to reduce noise and blur
images. It's based on the Gaussian distribution, also known as the normal distribution, which is a
bell-shaped curve that describes the probability distribution of a continuous random variable.
● Weighted averaging: The filter assigns higher weights to pixels closer to the center and lower
weights to those farther away.
● Non-causal: The filter window is symmetric about the origin in the time domain.
● Separable equation: The equation for the 2-D isotropic Gaussian can be separated into x and y
components, which allows for fairly quick convolution.
The Gaussian filter is used to: Remove Gaussian noise, Blur images, Remove detail and noise, and
Reduce salt and pepper noise
A convolution is also a mathematical tool that is used to combine two things in order to produce
the result. In image processing, convolution is a process by which we transform an input image by
applying a kernel over it in a pixel-wise fashion.
When the convolution mask operates on a particular pixel, then it performs the action by
considering that pixel and its neighboring pixels and the result is returned to that one particular
pixel. Thus, we conclude that convolution in image processing is the mask operator.
● Point operator: While operating on a particular pixel, it takes only one pixel as input that is
itself. For example Brightness increasing operation. We increase each pixel’s intensity by the
same value to increase the brightness of the image.
● Mask operator: While performing an action on a particular pixel it takes the particular pixel
and its neighbouring pixels as the input. Convolution operation.
Illustration:
Image, I = [100, 120, 100, 150, 160]
We are using same mask not the flipped one, hence we shall use the indexes properly.
J(2) = I(0) . H(1) + I(1) . H(0) + I(2) . H(-1)Indexes are represented in the parentheses.
J=I*H
● pattern recognition
● image morphology
● feature extraction
Edge detection allows users to observe the features of an image for a significant change in the gray
level. This texture indicating the end of one region in the image and the beginning of another. It
reduces the amount of data in an image and preserves the structural properties of an image.
Advantages:
Limitations:
Limitations:
Limitations:
Advantages:
Limitations:
● Machine vision: Corner detection helps locate objects and measure their dimensions
● Motion detection: Corner detection is often one of the first steps in motion detection applications
Moravec detector
The principle of this detector is to observe if a sub-image, moved around one pixel in all directions,
changes significantly. If this is the case, then the considered pixel is a corner.
Principle of Moravec detector. From left to right : on a flat area, small shifts in the sub-image (in
red) do not cause any change; on a contour, we observe changes in only one direction; around a
corner there are significant changes in all directions.
Mathematically, the change is characterized in each pixel (m,n) of the image by Em,n(x,y) which
represents the difference between the sub-images for an offset (x,y):
In bag of words (BOW), we count the number of each word appears in a document, use the
frequency of each word to know the keywords of the document, and make a frequency histogram
from it.
The following models a text document using bag-of-words. Here are two simple text documents:
Based on these two text documents, a list is constructed as follows for each document:
"John","likes","to","watch","movies","Mary","likes","movies","too"
"Mary","also","likes","to","watch","football","games"
BoW1 = {"John":1,"likes":2,"to":1,"watch":1,"movies":2,"Mary":1,"too":1};
BoW2 = {"Mary":1,"also":1,"likes":1,"to":1,"watch":1,"football":1,"games":1};
Each key is the word, and each value is the number of occurrences of that word in the given text
document.
BoW3 =
{"John":1,"likes":3,"to":2,"watch":2,"movies":2,"Mary":2,"too":1,"also":1,"football":1,"games":
1};
So, as we see in the bag algebra, the "union" of two documents in the bags-of-words representation
is, formally, the disjoint union, summing the multiplicities of each element.
Word order
[edit]
The BoW representation of a text removes all word ordering. For example, the BoW representation
of "man bites dog" and "dog bites man" are the same, so any algorithm that operates with a BoW
representation of text must treat them in the same way. Despite this lack of syntax or grammar,
BoW representation is fast and may be sufficient for simple tasks that do not require word order.
For instance, for document classification, if the words "stocks" "trade" "investors" appears
multiple times, then the text is likely a financial report, even though it would be insufficient to
distinguish between
and
and so the BoW representation would be insufficient to determine the detailed meaning of the
document.
Implementations
[edit]
VLAD
VLAD (Vector of Locally Aggregated Descriptors) is a feature encoding and pooling algorithm
used in computer vision to represent images. It's often used for image classification and instance
retrieval. Here are some things to know about VLAD:
● How it works
VLAD is based on feature descriptors extracted from an image using a dictionary built from a
clustering method. It matches each descriptor to its closest cluster, and then stores the sum of the
differences between the descriptors and the cluster centroid.
● Advantages
VLAD strikes a good balance between computational efficiency and representation ability.
● Extensions
VLAD can be combined with Deep Convolutional Neural Network (DCNN) features to improve
face verification.
● History
VLAD was introduced by Jegou et al. in a 2013 paper published at the IEEE Conference on
Computer Vision and Pattern Recognition. The paper also proposed a normalization method and
vocabulary adaptation to improve retrieval performance.
Mathematical Formulation:
Let us assume that we have a set of data points, D = {d1, d2, …, dn}, and we want to estimate a
model, M, that best fits this data. The model can be represented by a set of parameters, θ = {θ1,
θ2, …, θm}. For example, in the case of a linear regression model, θ1 and θ2 would be the slope
and intercept, respectively.
● n: the minimum number of data points required to estimate the model parameters
● t: the threshold that determines which data points are considered inliers
1. Randomly select n data points from D and use them to estimate the model parameters θ.
2. Classify the remaining data points as inliers or outliers based on whether their distance to the
model is less than the threshold t.
3. If the number of inliers is greater than or equal to d, re-estimate the model parameters using
all the inliers and terminate the algorithm.
4. Repeat steps 1–3 k times and select the model with the largest number of inliers.
Pros and Cons
Pros:
● RANSAC is a robust algorithm that can handle a large amount of noise and outliers in the
data.
● It can be used with any model that can be estimated from a subset of the data.
● RANSAC can provide a good approximation of the true model even when there are a large
number of outliers in the data.
Cons:
● RANSAC is a heuristic algorithm, which means that it does not guarantee the optimal
solution.
● The choice of parameters (n, k, t, d) can have a significant impact on the performance of the
algorithm. Finding the optimal values for these parameters can be challenging.
● The algorithm can be sensitive to the initial random sample, which can lead to different results
for different runs of the algorithm.
1. Line fitting: RANSAC can be used to fit a line to a set of 2D or 3D points in the presence of
outliers. This is useful in computer vision tasks such as lane detection in autonomous vehicles.
2. Fundamental matrix estimation: RANSAC can be used to estimate the fundamental matrix
that relates corresponding points in two images. This is useful in stereo vision applications
such as 3D reconstruction and object tracking.
3. Object recognition: RANSAC can be used to match features between images and estimate
the pose of objects in the scene. This is useful in robotics applications such as pick-and-place
tasks.
4. Plane fitting: RANSAC can be used to fit a plane to a set of 3D points in the presence of
outliers. This is useful in computer graphics applications such as rendering and 3D modeling.
Overall, RANSAC is a powerful algorithm for robust model estimation in the presence of
outliers. While it has its limitations, it can be a valuable tool in a wide range of applications in
machine learning and computer vision.
Hough transform in computer vision.
The Hough Transform is a popular technique in computer vision and image processing, used for
detecting geometric shapes like lines, circles, and other parametric curves. Named after Paul
Hough, who introduced the concept in 1962, the transform has evolved and found numerous
applications in various domains such as medical imaging, robotics, and autonomous driving. In
this article, we will discuss how Hough transformation is utilized in computer vision.
What is Hough Transform?
A feature extraction method called the Hough Transform is used to find basic shapes in a picture,
like circles, lines, and ellipses. Fundamentally, it transfers these shapes’ representation from the
spatial domain to the parameter space, allowing for effective detection even in the face of
distortions like noise or occlusion.
How Does the Hough Transform Work?
The accumulator array, sometimes referred to as the parameter space or Hough space, is the first
thing that the Hough Transform creates. The available parameter values for the shapes that are
being detected are represented by this space. The slope (m) and y-intercept (b) of a line, for
instance, could be the parameters in the line detection scenario.
The Hough Transform calculates the matching curves in the parameter space for each edge point
in the image. This is accomplished by finding the curve that intersects the parameter values at the
spot by iterating over all possible values of the parameters. The “votes” or intersections for every
combination of parameters are recorded by the accumulator array.
In the end, the programme finds peaks in the accumulator array that match the parameters of the
shapes it has identified. These peaks show whether the image contains lines, circles, or other
shapes.
Variants and Techniques of Hough transform
The performance and adaptability of the Hough Transform have been improved throughout time
by a number of variations and techniques:
● Paul Hough’s initial formulation for line identification is known as the Standard Hough
Transform (SHT). It entails voting for every possible combination of parameters and
discretizing the parameter space.
● Probabilistic Hough Transform (PHT): The PHT randomly chooses a subset of edge points
and only applies line detection to those locations in order to increase efficiency. For real-time
applications, this minimizes processing complexity while maintaining accuracy in the output.
● Generalized Hough Transform (GHT): By recording the spatial relationships of every shape
using a template, the GHT can detect any shape, in contrast to the SHT’s limited ability to
detect just specified shapes. After that, a voting system akin to the SHT is used to match this
template with the image.
● Accumulator Space Dimensionality: The classic Hough Transform can identify lines in two
dimensions, but it can also detect more complicated forms, such ellipses or circles, in higher
dimensions. Every extra dimension translates into an extra parameter of the identified shape.
The Python code implementation for line detection utilizing the Hough Transform on
this image and OpenCV is described in detail below.
1) Import necessary libraries
This code imports OpenCV for image processing and the NumPy library for numerical
computations.
Python
import numpy as np
import cv2
import numpy as np
import cv2
# Read image
img = cv2.imread('lane_hough.jpg',cv2.IMREAD_COLOR) # road.png is the filename
Activation Functions
Activation functions introduce non-linearity into the network, enabling it to learn and model
complex data patterns. Common activation functions include:
● Sigmoid: σ(x)=σ(x)=11+e−xσ(x)=1+e−x1
● Tanh: tanh(x)=ex–e−xex+e−xtanh(x)=ex+e−xex–e−x
● ReLU (Rectified Linear Unit): ReLU(x)=max(0,x)ReLU(x)=max(0,x)
● Leaky ReLU: Leaky ReLU(x)=max(0.01x,x)Leaky ReLU(x)=max(0.01x,x)
Training a Feedforward Neural Network
Training a Feedforward Neural Network involves adjusting the weights of the neurons to minimize
the error between the predicted output and the actual output. This process is typically performed
using backpropagation and gradient descent.
1. Forward Propagation: During forward propagation, the input data passes through the
network, and the output is calculated.
2. Loss Calculation: The loss (or error) is calculated using a loss function such as Mean Squared
Error (MSE) for regression tasks or Cross-Entropy Loss for classification tasks.
3. Backpropagation: In backpropagation, the error is propagated back through the network to
update the weights. The gradient of the loss function with respect to each weight is calculated,
and the weights are adjusted using gradient descent.
Forward Propagation
Gradient Descent
Gradient Descent is an optimization algorithm used to minimize the loss function by iteratively
updating the weights in the direction of the negative gradient. Common variants of gradient descent
include:
● Batch Gradient Descent: Updates weights after computing the gradient over the entire
dataset.
● Stochastic Gradient Descent (SGD): Updates weights for each training example individually.
● Mini-batch Gradient Descent: Updates weights after computing the gradient over a small
batch of training examples.
Evaluation of Feedforward neural network
Evaluating the performance of the trained model involves several metrics:
● Accuracy: The proportion of correctly classified instances out of the total instances.
● Precision: The ratio of true positive predictions to the total predicted positives.
● Recall: The ratio of true positive predictions to the actual positives.
● F1 Score: The harmonic mean of precision and recall, providing a balance between the two.
● Confusion Matrix: A table used to describe the performance of a classification model,
showing the true positives, true negatives, false positives, and false negatives.
Code Implementation of Feedforward neural network
This code demonstrates the process of building, training, and evaluating a neural network model
using TensorFlow and Keras to classify handwritten digits from the MNIST dataset. Initially, the
MNIST dataset is loaded and normalized by scaling the pixel values to the range [0, 1]. The model
architecture is defined using the Sequential API, consisting of a Flatten layer to convert the 2D
image input into a 1D array, followed by a Dense layer with 128 neurons and ReLU activation,
and a final Dense layer with 10 neurons and softmax activation to output probabilities for each
digit class. The model is compiled with the Adam optimizer, SparseCategoricalCrossentropy loss
function, and SparseCategoricalAccuracy metric. The model is then trained for 5 epochs on the
training data. Finally, the model’s performance is evaluated on the test set, and the test accuracy
is printed.
Python
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.losses import SparseCategoricalCrossentropy
from tensorflow.keras.metrics import SparseCategoricalAccuracy
What is backpropagation?
● In machine learning, backpropagation is an effective algorithm used to train artificial neural
networks, especially in feed-forward neural networks.
● Backpropagation is an iterative algorithm, that helps to minimize the cost function by
determining which weights and biases should be adjusted. During every epoch, the model
learns by adapting the weights and biases to minimize the loss by moving down toward the
gradient of the error. Thus, it involves the two most popular optimization algorithms, such
as gradient descent or stochastic gradient descent.
● Computing the gradient in the backpropagation algorithm helps to minimize the cost
function and it can be implemented by using the mathematical rule called chain rule from
calculus to navigate through complex layers of the neural network.
Fig(a) A simple illustration of how the backpropagation works by adjustments of weights
Note that, our actual output is 0.5 but we obtained 0.67. To calculate the error, we can use the
below formula:
Errorj=ytarget–y5Errorj=ytarget–y5
Error = 0.5 – 0.67
= -0.17
Using this error value, we will be backpropagating.
Implementing Backward Propagation
Each weight in the network is changed by,
∇wij = η ?j Oj
?j = Oj (1-Oj)(tj - Oj) (if j is an output unit)
?j = Oj (1-O)∑k ?k wkj (if j is a hidden unit)
where ,
η is the constant which is considered as learning rate,
tj is the correct output for unit j
?j is the error measure for unit j
Step 3: To calculate the backpropagation, we need to start from the output unit:
To compute the ?5, we need to use the output of forward pass,
?5 = y5(1-y5) (ytarget -y5)
= 0.67(1-0.67) (-0.17)
= -0.0376
For hidden unit,
To compute the hidden unit, we will take the value of ?5
?3 = y3(1-y3) (w1,3 * ?5)
=0.56(1-0.56) (0.3*-0.0376)
=-0.0027
?4 = y4 (1-y5) (w2,3 * ?5)
=0.59(1-0.59) (0.9*-0.0376)
=-0.0819
Step 4: We need to update the weights, from output unit to hidden unit,
∇ wj,i = η ?j Oj
Once, the above process is done, we again perform the forward pass to find if we obtain the actual
output as 0.5.
While performing the forward pass again, we obtain the following values:
y3 = 0.57
y4 = 0.56
y5 = 0.61
We can clearly see that our y5 value is 0.61 which is not an expected actual output, So again we
need to find the error and backpropagate through the network by updating the weights until the
actual output is obtained.
Error=ytarget–y5Error=ytarget–y5
= 0.5 – 0.61
= -0.11
This is how the backpropagate works, it will be performing the forward pass first to see if we
obtain the actual output, if not we will be finding the error rate and then backpropagating
backwards through the layers in the network by adjusting the weights according to the error rate.
This process is said to be continued until the actual output is gained by the neural network.
ReLU
● ReLU stands for rectified linear activation unit and is considered one of the few milestones
in the deep learning revolution. It is simple yet really better than its predecessor activation
functions such as sigmoid or tanh.
● Now how does ReLU transform its input? It uses this simple formula:
● f(x)=max(0,x)
● ReLU function is its derivative both are monotonic. The function returns 0 if it receives
any negative input, but for any positive value x, it returns that value back. Thus it gives an
output that has a range from 0 to infinity.
● Now let us give some inputs to the ReLU activation function and see how it transforms
them and then we will plot them also.
● First, let us define a ReLU function
● def ReLU(x):
● if x>0:
● return x
● else:
● return 0
● The Rectified Linear Unit is the most commonly used activation function in deep learning
models. The function returns 0 if it receives any negative input, but for any positive
value xx it returns that value back. So it can be written as f(x)=max(0,x)f(x)=max(0,x).
● Graphically it looks like this
● It's surprising that such a simple function (and one composed of two linear pieces) can
allow your model to account for non-linearities and interactions so well. But the ReLU
function works great in most applications, and it is very widely used as a result.
● Why It Works
● Introducing Interactions and Non-linearities
● Activation functions serve two primary purposes: 1) Help a model account for interaction
effects.
What is an interactive effect? It is when one variable A affects a prediction differently
depending on the value of B. For example, if my model wanted to know whether a certain
body weight indicated an increased risk of diabetes, it would have to know an individual's
height. Some bodyweights indicate elevated risks for short people, while indicating good
health for tall people. So, the effect of body weight on diabetes risk depends on height,
and we would say that weight and height have an interaction effect.
● 2) Help a model account for non-linear effects. This just means that if I graph a variable
on the horizontal axis, and my predictions on the vertical axis, it isn't a straight line. Or
said another way, the effect of increasing the predictor by one is different at different values
of that predictor.
● How ReLU captures Interactions and Non-Linearities
● Interactions: Imagine a single node in a neural network model. For simplicity, assume it
has two inputs, called A and B. The weights from A and B into our node are 2 and 3
respectively. So the node output is f(2A+3B)f(2A+3B). We'll use the ReLU function for
our f. So, if 2A+3B2A+3B is positive, the output value of our node is also 2A+3B2A+3B.
If 2A+3B2A+3B is negative, the output value of our node is 0.
● For concreteness, consider a case where A=1 and B=1. The output is 2A+3B2A+3B, and
if A increases, then the output increases too. On the other hand, if B=-100 then the output
is 0, and if A increases moderately, the output remains 0. So A might increase our output,
or it might not. It just depends what the value of B is.
● This is a simple case where the node captured an interaction. As you add more nodes and
more layers, the potential complexity of interactions only increases. But you should now
see how the activation function helped capture an interaction.
● Non-linearities: A function is non-linear if the slope isn't constant. So, the ReLU function
is non-linear around 0, but the slope is always either 0 (for negative values) or 1 (for
positive values). That's a very limited type of non-linearity.
● But two facts about deep learning models allow us to create many different types of non-
linearities from how we combine ReLU nodes.
● First, most models include a bias term for each node. The bias term is just a constant
number that is determined during model training. For simplicity, consider a node with a
single input called A, and a bias. If the bias term takes a value of 7, then the node output is
f(7+A). In this case, if A is less than -7, the output is 0 and the slope is 0. If A is greater
than -7, then the node's output is 7+A, and the slope is 1.
● So the bias term allows us to move where the slope changes. So far, it still appears we can
have only two different slopes.
● However, real models have many nodes. Each node (even within a single layer) can have
a different value for it's bias, so each node can change slope at different values for our
input.
● When we add the resulting functions back up, we get a combined function that changes
slopes in many places.
● These models have the flexibility to produce non-linear functions and account for
interactions well (if that will giv better predictions). As we add more nodes in each layer
(or more convolutions if we are using a convolutional model) the model gets even greater
ability to represent these interactions and non-linearities.
1. L2 regularization
2. L1 regularization
3. Dropout regularization
L2 regularization
According to regression analysis, L2 regularization is also called ridge regression. In this type of
regularization, the squared magnitude of the coefficients or weights multiplied with a regularizer
term is added to the loss or cost function. L2 regression can be represented with the following
mathematical equation.
Loss:
● Lambda is the hyperparameter that is tuned to prevent overfitting i.e. penalize the
insignificant weights by forcing them to be small but not zero.
● L2 regularization works best when all the weights are roughly of the same size, i.e., input
features are of the same range.
● This technique also helps the model to learn more complex patterns from data without
overfitting easily.
L1 regularization
L1 regularization is also referred to as lasso regression. In this type of regularization, the absolute
value of the magnitude of coefficients or weights multiplied with a regularizer term is added to the
loss or cost function. It can be represented with the following equation.
Loss:
A fraction of the sum of absolute values of weights to the loss function is added in the L1
regularization. In this way, you will be able to eliminate some coefficients with lesser values by
pushing those values towards 0. You can observe the following by using L1 regularization:
● Since the L1 regularization adds an absolute value as a penalty to the cost function, the
feature selection will be done by retaining only some important features and eliminating
the lower or unimportant features.
● This technique is also robust to outliers, i.e., the model will be able to easily learn about
outliers in the dataset.
● This technique will not be able to learn complex patterns from the input data.
Dropout regularization
Dropout regularization is the technique in which some of the neurons are randomly disabled during
the training such that the model can extract more useful robust features from the model. This
prevents overfitting. You can see the dropout regularization in the following diagram:
● In figure (a), the neural network is fully connected. If all the neurons are trained with the
entire training dataset, some neurons might memorize the patterns occurring in training
data. This leads to overfitting since the model is not generalizing well.
● In figure (b), the neural network is sparsely connected, i.e., only some neurons are active
during the model training. This forces the neurons to extract robust features/patterns from
training data to prevent overfitting.
● Dropout randomly disables some percent of neurons in each layer. So for every epoch,
different neurons will be dropped leading to effective learning.
● Dropout is applied by specifying the ‘p’ values, which is the fraction of neurons to be
dropped.
● Dropout reduces the dependencies of neurons on other neurons, resulting in more robust
model behavior.
● Dropout is applied only during the model training phase and is not applied during the
inference phase.
● When the model receives complete data during the inference time, you need to scale the
layer outputs ‘x’ by ‘p’ such that only some parts of data will be sent to the next layer. This
is because the layers have seen less amount of data as specified by dropout.
These are some of the most popular regularization techniques that are used to reduce overfitting
during model training. They can be applied according to the use case or dataset being considered
for more accurate model performance on the testing data.
Adversarial Training
Adversarial Training is a technique that has been developed to protect Machine Learning
models from Adversarial Examples. Let’s briefly recall what Adversarial Examples are. These are
inputs that are very slightly and cleverly perturbed (such as an image, text, or sound) in a way that
is imperceptible to humans but will be misclassified by a machine learning model.
What is astonishing about these attacks is the model’s confidence in its incorrect prediction. The
example above illustrates this well: while the model only has a confidence rate of 57.7% for the
correct prediction, it will exhibit a very high confidence rate of 99.3% for the incorrect prediction.
These attacks are very problematic. For example, an article published in Science in 2019 by
researchers from Harvard and MIT demonstrates how medical AI systems could be vulnerable to
adversarial attacks. That’s why it’s necessary to defend against them. This is where Adversarial
Training comes in. It, along with ‘Defensive Distillation,’ is the primary technique to protect
against these attacks.
How does this technique work? It involves retraining the Machine Learning model with numerous
Adversarial Examples. Indeed, during the training phase of a predictive model, if the input is
misclassified by the Machine Learning model, the algorithm learns from its mistakes and adjusts
its parameters to avoid making them again.
Thus, after initially training the model, the model’s creators generate numerous Adversarial
Examples. They expose their own model to these contradictory examples to prevent it from
making these mistakes again.
While this method defends Machine Learning models against some Adversarial Examples, does it
generalize the model’s robustness to all Adversarial Examples? The answer is no. This approach
is generally insufficient to stop all attacks because the range of possible attacks is too wide and
cannot be generated in advance. Thus, it often becomes a race between hackers generating new
adversarial examples and designers protecting against them as quickly as possible.
In a more general sense, it is very difficult to protect models against adversarial examples because
it is nearly impossible to construct a theoretical model of how these examples are created. It would
involve solving particularly complex optimization problems, and we do not have the necessary
theoretical tools.
All strategies tested so far fail because they are not adaptive: they may block one type of attack
but leave another vulnerability open to an attacker who knows the defense used. Designing a
defense capable of protecting against a powerful and adaptive attacker is an important research
area.
Optimizers and loss functions are two components that help improve the performance of the model.
By calculating the difference between the expected and actual outputs of a model, a loss function
evaluates the effectiveness of a model.
A convolutional neural network (CNN) is a type of artificial neural network that uses
convolutional layers to process and analyze data, such as images, text, and audio:
A Convolutional Neural Network (CNN) is a type of Deep Learning neural network architecture
commonly used in Computer Vision. Computer vision is a field of Artificial Intelligence that
enables a computer to understand and interpret the image or visual data.
When it comes to Machine Learning, Artificial Neural Networks perform really well. Neural
Networks are used in various datasets like images, audio, and text. Different types of Neural
Networks are used for different purposes, for example for predicting the sequence of words we
use Recurrent Neural Networks more precisely an LSTM, similarly for image classification we
use Convolution Neural networks. In this blog, we are going to build a basic building block for
CNN.
The Convolutional layer applies filters to the input image to extract features, the Pooling layer
downsamples the image to reduce computation, and the fully connected layer makes the final
prediction. The network learns the optimal filters through backpropagation and gradient descent.
How Convolutional Layers Works?
Convolution Neural Networks or covnets are neural networks that share their parameters. Imagine
you have an image. It can be represented as a cuboid having its length, width (dimension of the
image), and height (i.e the channel as images generally have red, green, and blue channels).
Now imagine taking a small patch of this image and running a small neural network, called a filter
or kernel on it, with say, K outputs and representing them vertically. Now slide that neural network
across the whole image, as a result, we will get another image with different widths, heights, and
depths. Instead of just R, G, and B channels now we have more channels but lesser width and
height. This operation is called Convolution. If the patch size is the same as that of the image it
will be a regular neural network. Because of this small patch, we have fewer weights.
source:
cs231n.stanford.edu
● Output Layer: The output from the fully connected layers is then fed into a logistic function
for classification tasks like sigmoid or softmax which converts the output of each class into the
probability score of each class.
Advantages and Disadvantages of Convolutional Neural Networks (CNNs)
Advantages of CNNs:
1. Good at detecting patterns and features in images, videos, and audio signals.
2. Robust to translation, rotation, and scaling invariance.
3. End-to-end training, no need for manual feature extraction.
4. Can handle large amounts of data and achieve high accuracy.
Disadvantages of CNNs:
1. Computationally expensive to train and require a lot of memory.
2. Can be prone to overfitting if not enough data or proper regularization is used.
3. Requires large amounts of labeled data.
4. Interpretability is limited, it’s hard to understand what the network has learned.
AlexNet:
AlexNet, developed by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, is a landmark
model that won the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) in 2012. It
introduced several innovative ideas that shaped the future of CNNs.
AlexNet Architecture:
AlexNet consists of 8 layers, including 5 convolutional layers and 3 fully connected layers. It uses
traditional stacked convolutional layers with max-pooling in between. Its deep network structure
allows for the extraction of complex features from images.
● The architecture employs overlapping pooling layers to reduce spatial dimensions while
retaining the spatial relationships among neighbouring features.
● Activation function: AlexNet uses the ReLU activation function and dropout regularization,
which enhance the model’s ability to capture non-linear relationships within the data.
The key features of AlexNet are as follows:-
● AlexNet was created to be more computationally efficient than earlier CNN topologies. It
introduced parallel computing by utilising two GPUs during training.
● AlexNet is a relatively shallow network compared to GoogleNet. It has eight layers, which
makes it simpler to train and less prone to overfitting on smaller datasets.
● In 2012, AlexNet produced ground-breaking results in the ImageNet Large Scale Visual
Recognition Challenge (ILSVRC). It outperformed prior CNN architectures greatly and set the
path for the rebirth of deep learning in computer vision.
● Several architectural improvements were introduced by AlexNet, including the use of rectified
linear units (ReLU) as activation functions, overlapping pooling, and dropout regularisation.
These strategies aided in the improvement of performance and generalisation
Let’s consider an image classification task of various dog breeds. AlexNet’s convolutional layers
learn features such as edges, textures, and shapes to distinguish between different dog breeds. The
fully connected layers then analyze these learned features and make predictions.
ZFNet
Rob Fergus and Matthew D.Zeiler introduced ZFNet. ZFNet is named after their surname Zeiler
and Fergus. ZFNet was a slight improvement over AlexNet .The 2013 ILSVRC was won by
ZFNet.It actually visualized how each layer of AlexNet performs and what parameters can be
tuned to achieve greater accuracy.
image from original paper-https://arxiv.org/pdf/1311.2901.pdf
· Convolutional layers:
In these layers convolutional filters are applied to extract important features,ZFNet consists of
multiple convolutional layers to extract important features.
· MaxPooling Layers:
MaxPooling Layers are used to downsample the spatial dimensions of feature map in.It consist of
aggregation function known as maxima.
Relu is used after each convolution layer to introduce non linearity into the model which is crucial
for learning complex patterns. It rectifies the feature map ensuring the feature maps are always
positive.
· SoftMax Activation:
SoftMax activation is used in the last layer to obtain the probabilities of the image belonging to
the 1000 classes.
Architecture:
First Layer
· In the first layer 96 filters of size 7x7 and stride of 2 are used to convolve followed by relu
activation.
The output feature map is then passed through Max Pooling Layer with pool kernel of 3x3 and
stride of 2 .Then the features are contrast normalized.
Second layer
· In the second layer 256 filters are applied of size 3x3 with stride of 2. Again the obtained feature
map is passed through MaxPooling layer with pooling kernel of 3x3 with stride of 2.After that
features are contrast normalized.
· The third and fourth layers are identical with 384 kernels of size 3x3 and padding is kept as same
and stride is set to 1.
Fifth Layer
· In the fifth layer 256 filters of size 3x3 are applied with stride 1. After then the MaxPooling
kernel of size 3x3 is applied with stride of 2 .Then the features are contrast normalized.
· The sixth and seventh layers are fully connected dense layers with 4096 neurons each.
Eighth Layer
VGG-16 architecture
This model achieves 92.7% top-5 test accuracy on the ImageNet dataset which contains 14 million
images belonging to 1000 classes.
VGG-16 Model Objective:
The ImageNet dataset contains images of fixed size of 224*224 and have RGB channels. So, we
have a tensor of (224, 224, 3) as our input. This model process the input image and outputs the a
vector of 1000 values:
y^=[y0^y1^y2^...y^999] y^=y0^y1^y2^...y^999
This vector represents the classification probability for the corresponding class. Suppose we have
a model that predicts that the image belongs to class 0 with probability 1, class 1 with
probability 0.05, class 2 with probability 0.05, class 3 with probability 0.03, class 780 with
probability 0.72, class 999 with probability 0.05 and all other class with 0.
so, the classification vector for this will be:
y^=[y0^=0.10.050.050.03...y780^=0.72..y999^=0.05] y^=y0^=0.10.050.050.03...y780^
=0.72..y999^=0.05
To make sure these probabilities add to 1, we use softmax function.
This softmax function is defined as follows:
y^i=ezi∑j=1nezjy^i=∑j=1nezjezi
After this we take the 5 most probable candidates into the vector.
C=[780012999]C=780012999
and our ground truth vector is defined as follows:
G=[G0G1G2]=[7802999] G=G0G1G2=7802999
Then we define our Error function as follows:
E=1n∑kminid(ci,Gk) E=n1∑kminid(ci,Gk)
It calculates the minimum distance between each ground truth class and the predicted candidates,
where the distance function d is defined as:
● d=0 if ci=Gkci=Gk
● d=1 otherwise
So, the loss function for this example is :
E=13(minid(ci,G1)+minid(ci,G2)+minid(ci,G3))=13(0+0+0)=0E=31(minid(ci,G1)+minid(ci,G2
)+minid(ci,G3))=31(0+0+0)=0
Since, all the categories in ground truth are in the Predicted top-5 matrix, so the loss becomes 0.
VGG Architecture:
The VGG-16 architecture is a deep
convolutional neural network (CNN)
designed for image classification tasks. It was
introduced by the Visual Geometry Group at
the University of Oxford. VGG-16 is
characterized by its simplicity and uniform
architecture, making it easy to understand and
implement.
The VGG-16 configuration typically consists of 16 layers, including 13 convolutional layers and
3 fully connected layers. These layers are organized into blocks, with each block containing
multiple convolutional layers followed by a max-pooling layer for downsampling.
VGG-16 architecture Map
Usually, images acquired by a vision system suffer from degradation that can be modelled as a
convolution. For example, some images present a camera shake effect (Fig. 100) or a blur due to
poor focus (Fig. 101). The goal of deconvolution is to cancel the effect of a convolution.
y=h∗x+b
The deconvolution computes a deconvolved image x^ from the observation y. We will consider
only linear methods, thus deconvolution comes to filtering by g:
x^=g∗y
Deconvolution model.
Deconvolution needs a degradation model, thus having knowledge about both h and b.
● The PSF h can be estimated by observation, i.e. by finding in the image some factors to
estimate h. For example, a single point object in the image is h. The PSF can also be
estimated by experimentation by reproducing the observation conditions in a laboratory.
So, the image of a pulse gives an estimate of h. Finally, it is also possible to estimate the
PSF from a mathematical model of the physics of the observation. Note also that some
deconvolution methods estimate the PSF h at the same time as x: these are called blind
deconvolution methods (French: déconvolution myope).
● Models for the noise have already been presented in chapter denoising.
Inverse filter
The inverse filter is the simplest deconvolution method. Since the degradation is
modelled y=h∗x+b, then this equation becomes in the Fourier domain:
Y=HX+B
so we can write:
X=Y−BH.
x=F−1[Y−BH].
As the noise (and therefore its spectrum B) is unknown, we can approximate the expression of x by
cancelling B in the previous expression, and thus get the deconvolved image:
x^=F−1[YH]
The result of the inverse filter applied on an image is given Fig. 103. The result is not usable, and
yet the observed image is very little blurred with very little noise!
Thus, the deconvolved image x^ corresponds to x with an additional term which is the inverse
Fourier transform of B/H. The PSF H is generally a low-pass filter, so the values of H(m,n) tend
towards 0 for high frequencies (m,n). Because H is in the denominator, this tends to drastically
amplify the high frequencies of the noise, and then the term B/H quickly dominates X. This
explains the result of Fig. 103.
One solution consists in considering only the low frequencies of Y/H. This is equivalent to
truncating the result given by the inverse filter by cancelling the high frequencies before
calculating the inverse Fourier transform. The result of the deconvolution is much more acceptable,
as shown by Fig. 104, although the result is still far from perfect (there are many variations in
intensity around objects, such as tree trunks)…
Fig. 104 Result of the truncated inverse filter with very small noise.
Wiener Filter
Wiener filter, denoted by g (with Fourier transform G), applies to the observation y such that:
x^=g∗y⇔X^=GY.
This filter is established in the statistical framework: the image x and the noise b are considered to
be random variables. They are also assumed to be statistically independent. As a result, the
observation y and the estimate x^ are also random variables.
The calculations are done in the Fourier domain for simplicity (since convolutions become
multiplications). The goal of Wiener filter is to find the image X^=F[x^] closest to X=F[x], in the
sense of the mean squared error MSE=E[(X^−X)2]. Thereby :
MSE=E[(X^−X)2]=E[(GY−X)2]=E[(G(HX+B)−X)2]=E[((GH−I)X+GB)2]
MSE=E[(GH−I)∗(GH−I)X∗X+(GH−I)∗GX∗B+G∗(GH−I)B∗X+G∗GB∗B]
where ⋅∗ denotes the conjugate of the variables. Since the expectation E is linear and
only X and B are random variables, we can decompose the previous expression into four terms:
MSE=(GH−I)∗(GH−I)E[X∗X]+(GH−I)∗GE[X∗B]+G∗(GH−I)E[B∗X]+G∗GE[B∗B].
Since X and B are independent, then the covariances E[X∗B] and E[B∗X] are zeros.
Moreover, E[X∗X] and E[B∗B] are the power spectral densities denoted as Sx and Sb (the power
spectral density is the expectation of the square of the modulus of the Fourier transform). So the
mean squared error simplifies into:
MSE=(GH−1)∗(GH−1)Sx+G∗GSb
We look for the filter G that minimizes the MSE, or equivalently, that cancels the derivative of
MSE:
∂MSE∂G=(GH−1)∗HSx+G∗Sb=0⇔G∗H∗HSx−HSx+G∗Sb=0⇔G∗(H∗HSx+Sb)=HSx⇔G∗=H
SxH∗HSx+Sb⇔G=H∗SxH∗HSx+Sb⇔G=H∗Sx|H|2Sx+Sb
Here we are, we get the expression of the Wiener filter G! 🥳 Finally, the deconvolved image is
the inverse Fourier transform of GY:
x^=F−1[H∗Sx|H|2Sx+SbY]
We can consider that the power spectral densities Sx and Sb are constant (for Sb, it is necessary to
assume white noise). Thus, the expression of the Wiener filter can be written
x^=F−1[H∗|H|2+Sb/SxY]
and the term Sb/Sx is replaced by a constant K, which becomes the parameter of the method, to be
set by the user.
Two remarks:
● where H vanishes (typically in high frequencies), the problem of noise increase is no longer
observed as with the inverse filter, since the inverse filter tends towards 0,
● moreover, if the noise in the image is zero, then Sb=0 and Wiener filter comes back to the
inverse filter:
x^=F−1[H∗|H|2Y]=F−1[YH]
The result of Wiener filter is presented Fig. 105: it is clearly much better than the inverse filter,
even its truncated version!
DeepDream
DeepDream is a deep learning technique that uses neural networks to create images that activate
specific layers in a network. This technique is also known as Inceptionism.
DeepDream algorithm initiates the process by forwarding a particular picture or image through the
network and then it starts measuring the gradient of the image with respect to a specific activation
layer. In the next step, the picture is adjusted in order to improve these activations and amplify the
patterns which result in a dream-like picture. This entire process is also known as Inceptionism.
The entire process of enhancing the pattern of images is very much dependent on how the
algorithm has been trained. Therefore, if an algorithm has been instructed to recognize the faces
in any image then that particular algorithm will also try to deduce the faces from any given image
using the algorithmic pareidolia.
Now that we have properly understood what DeepDream is, it is time for us to understand the
functions of this algorithm in more detail. Before that let us have a look at how the convolutional
neural networks work:
● First, we provide an image to the convolutional neural networks and the first layer of the
network distinguishes the low-level features such as edges.
● In the next step, the second layer of the network will try to expose the higher-level features
of the picture such as trees, cars, faces, etc.
● Lastly, the remaining layers will try to collect all of these features and complete the
interpretations so that the pictures can be categorized accordingly.
In convolutional neural networks, there are different layers available to perform different tasks.
On the other hand, in the DeepDream algorithm, we can take any particular feature (be it high level
or low level) and increase its activation so that it can have a huge impact on the image.
● Whenever you try to give a picture (as an input) to a trained artificial neural network, the
neurons kickstart and initiate activation.
● The DeepDream algorithm tries to modify the input image and in the process, it boosts
some of the neurons more than others. We can specify the type of layer and neuron we
want to strengthen precisely.
● The process will continue until all the elements of the input image have been disclosed
appropriately.
For example, if we have used a specific layer to discover the cat faces while we have provided the
image of a cloud (as input) then, the DeepDream algorithm will meticulously convert the image
and will begin to produce cat faces on the blue sky.
Here is a step-by-step process through which you can apply the DeepDream algorithm to any
image:
Hallucinations
Hallucinations in deep learning, also known as AI hallucinations, occur when an AI model
generates incorrect or misleading results. This can happen when the model is trained with
insufficient data, or when it makes incorrect assumptions or learns incorrect patterns.
1. Medical Misdiagnosis
● Missed or Wrong Diagnosis: AI-powered medical tools used for analysis (e.g., X-rays, blood
tests) could misinterpret results due to limitations in training data or unexpected variations.
This could lead to missed diagnoses of critical illnesses or unnecessary procedures based on
false positives.
● Ineffective Treatment Plans: AI-driven treatment recommendations might be based on faulty
data or fail to consider a patient’s unique medical history, potentially leading to ineffective or
even harmful treatment plans.
2. Faulty Financial Predictions
● Market Crashes: AI algorithms used for stock market analysis and trading could be swayed
by hallucinations, leading to inaccurate predictions and potentially triggering market crashes.
● Loan Denials and High-Interest Rates: AI-powered credit scoring systems could rely on
biased data, leading to unfair denials of loans or higher interest rates for qualified individuals.
3. Algorithmic Bias and Discrimination
● Unequal Opportunities: AI-driven hiring tools that rely on biased historical data could
overlook qualified candidates from underrepresented groups, perpetuating discrimination in
the workplace.
● Unfair Law Enforcement: Facial recognition software with AI hallucinations might
misidentify individuals, leading to wrongful arrests or profiling based on race or ethnicity.
How to Prevent Artificial Intelligence (AI) Hallucinations?
1. When feeding the input to the model restrict the possible outcomes by specifying the type of
response you desire. For example, instead of asking a trained LLM to get the ‘facts about the
existence of Mahabharta’, user can ask ‘ wether Mahabharta was real, Yes or No?’.
2. Specify what kind of information you are looking for.
3. Rather than specifying what information you require, also list what information you don’t
want.
4. Last but not the least, verify the output given by an AI model.
So there is an immediate need to develop algorithms or methods to detect and remove
Hallucination from AI models or at least decrease its impact.
CAM, Grad-CAM
Class Activation Mapping (CAMs)
For a particular class (or category), Class activation mapping basically indicates the discriminative
region of the image, which influenced the deep learning model to make the decision. The
architecture is very similar to a convolutional neural network. It comprises several convolution
layers, with the layer just before the final output performing Global Average Pooling. The features
that are obtained are fed into the fully connected neural network layer governed by the softmax
activation function and thus, output us the required probabilities. The importance of the weights
with respect to a category can be found out by projecting back the weights onto the last convolution
layer’s feature map.
The Global Average Pooling (GAP) is preferred over Global Max Pooling (GMP) because GAP
helps us in identifying the full extent of the object as compared to the GMP layer, which identifies
one discriminative part. In Global Average Pooling, an average is taken across all the activation
maps which help us to find all the possible discriminative regions present in them. Contrary to this,
the Global Max Pooling method just considers the most discriminative region. Hence, Global
Average Pooling showed better results than Global Max Pooling.
Mathematical equations governing CAMs
Let be the activation map of unit in the last convolutional layer at spatial location .
Thus, the final equation for an activation map of class c would be:-
The localization ability of the CAM method was put to the test when they were trained on the
ILSVRC 2014 benchmark dataset. The CAM technique was used on popular CNN models like
AlexNet, VGGNet and GoogLeNet by tweaking their models and fitting a GAP layer (similar to
the CAM architecture) towards the end. This modified model was giving astounding results with
the GAP layer as compared to their traditional architecture in terms of discriminative localization.
After applying a CAM architecture to fine-grained recognition and pattern discovery (like
discovering informative objects in the scenes, concept localization in weakly labelled images,
weakly supervised text detector and interpreting visual question answering), we can infer that
feature capturing and localization was far more accurate in the CAM based GAP layer architecture,
as the complete extent of the features were captured. Visualizing Class-specific Units:- When we
use the GAP layer and the ranked softmax weight, we can directly visualize the units, which are
the most discriminative for a particular class. Thus, CNN actually learns a bag of words, where
each word is a discriminative class-specific unit. A combination of these class-specific units helps
to guide CNNs in classifying each image.
Grad-CAM interprets CNNs, revealing insights into predictions, aiding debugging, and enhancing
performance. Class-discriminative and localizing, it lacks pixel-space detail highlighting.
Learning Objectives
Gain insights into the implementation steps of Grad-CAM, enabling the generation of class
activation maps to highlight important regions in images for model predictions.
Explore real-world applications and use cases where Grad-CAM enhances understanding and trust
in CNN predictions.
Why Grad-CAM is Required in Deep Learning?
Grad-CAM is required because it addresses the critical need for interpretability in deep learning
models, providing a way to visualize and comprehend how these models arrive at their predictions
without sacrificing the accuracy they offer in various computer vision tasks.
Copy Code
● Complexity of CNNs: While CNNs achieve high accuracy in various tasks, their inner
workings are often complex and hard to interpret.
● Grad-CAM’s Role: Grad-CAM serves as a solution by offering visual explanations, aiding
in understanding how CNNs arrive at their predictions.
Grad-CAM generates heatmaps known as Class Activation Maps. These maps highlight crucial
regions in an image responsible for specific predictions made by CNN.
Gradient Analysis
It does so by analyzing gradients flowing into the final convolutional layer of the CNN, focusing
on how these gradients impact class predictions.
Grad-CAM stands out among visualization techniques due to its class-discriminative nature.
Unlike other methods, it provides visualizations specific to particular predicted classes, enhancing
interpretability.
Grad-CAM computes gradients of predicted class scores concerning the activations in the last
convolutional layer. These gradients signify the importance of each activation map for predicting
specific classes.
It precisely identifies and highlights regions in input images that significantly contribute to
predictions for specific classes, enabling a deeper understanding of model decisions.
Versatility
Grad-CAM allows for understanding the decision-making processes of complex models without
sacrificing their accuracy, striking a balance between model interpretability and high performance.
● The CNN processes the input image through its layers, culminating in the last
convolutional layer.
● Grad CAM visualization utilizes the activations from this last convolutional layer to
generate the Class Activation Map (CAM).
● Techniques like Guided Backpropagation are applied to refine the visualization, resulting
in class-discriminative localization and high-resolution detailed visualizations, aiding in
interpreting CNN decisions.
UNIT IV CNN and RNN FOR IMAGE AND VIDEO PROCESSING
Siamese Networks
A siamese neural network (SNN) is a class of neural network architectures that contain two or
more identical sub-networks. “Identical” here means they have the same configuration with the
same parameters and weights. Parameter updating is mirrored across both sub-networks and it’s
used to find similarities between inputs by comparing its feature vectors. These networks are used
in many applications.
Traditionally, a neural network learns to predict multiple classes. This poses a problem when we
need to add or remove new classes to the data. In this case, we have to update the neural network
and retrain it on the whole data set. Also, deep neural networks need a large volume of data on
which to train. SNNs, on the other hand, learn a similarity function. Thus, we can train the SNN to
see if two images are the same (which I’ll demonstrate below). This process enables us to classify
new classes of data without retraining the network.
Given that an SNN’s learning mechanism is somewhat different from classification models, simply
averaging it with a classifier can do much better than averaging two correlated supervised models
(e.g. GBM & RF classifiers).
SNN focuses on learning embeddings (in the deeper layer) that place the same classes/concepts
close together. Hence, we can learn semantic similarity.
Since training involves pairwise learning, SNNs won’t output the probabilities of the prediction,
only distance from each class.
Since training SNNs involve pairwise learning, we cannot use cross entropy loss cannot be used.
There are two loss functions we typically use to train siamese networks.
Triplet Loss
Triplet loss is a loss function where in we compare a baseline (anchor) input to a positive (truthy)
input and a negative (falsy) input. The distance from the baseline (anchor) input to the positive
(truthy) input is minimized, and the distance from the baseline (anchor) input to the negative (falsy)
input is maximized.
In the above equation, alpha is a margin term used to stretch the distance between similar and
dissimilar pairs in the triplet. Fa, Fp, Fn are the feature embeddings for the anchor, positive and
negative images.
During the training process, we feed an image triplet (anchor image, negative image, positive
image)(anchor image, negative image, positive image) into the model as a single sample. The
distance between the anchor and positive images should be smaller than that between the anchor
and negative images.
Contrastive Loss
Contrastive loss is an increasingly popular loss function. It’s a distance-based loss as opposed to
more conventional error-prediction loss. This loss function is used to learn embeddings in which
two similar points have a low Euclidean distance and two dissimilar points have a large Euclidean
distance.
Object detection in deep learning is a machine learning technique that uses deep learning models
to accurately and quickly locate objects in images. Deep learning models can learn from large
amounts of labeled data to extract complex patterns, which allows for more precise object
localization and classification
dimensions. Consequently, whether the region proposals are small or large, they need to be
adjusted accordingly to fit the specified input size.
From the above architecture, we remove the final softmax layer to obtain a (1, 4096) feature vector.
This feature vector is then fed into both the Support Vector Machine (SVM) for classification and
the bounding box regressor for improved localization.
Fast R-CNN | ML
RoI pooling is a novel thing that was introduced in the Fast R-CNN paper. Its purpose is to produce
uniform, fixed-size feature maps from non-uniform inputs (RoIs). It takes two values as inputs:
● A feature map was obtained from the previous CNN layer (14 x 14 x 512 in VGG-16).
● An N x 4 matrix represents regions of interest, where N is a number of RoIs, the first two
represent the coordinates of the upper left corner of RoI and the other two represent the height
and width of RoI denoted as (r, c, h, w).
Let’s consider we have 8*8 feature maps, we need to extract an output of size 2*2. We will follow
the steps below.
Suppose we were given RoI’s left corner coordinates as (0, 3) and height, and width as (5, 7).
Now if we need to convert this region proposal into a 2 x 2 output block and we know that the
dimensions of the pooling section do not perfectly divisible by output dimension. We take pooling
such that it is fixed into 2 x 2 dimensions.
Now we apply the max pooling operator to select the maximum value from each of the regions
that we divided into.
Multi-task Loss
where Lcls is classification loss, and Lloc is localization loss. lambda is a balancing parameter and
u is a function (the value of u=0 for background, otherwise u=1) to make sure that loss is only
calculated when we need to define the bounding box. Here, Lcls is the log loss and Lloc is defined
as
FCN
● Image Classification: Classify the object (Recognize the object class) within an image.
● Object Detection: Classify and detect the object(s) within an image with bounding box(es)
bounded the object(s). That means we also need to know the class, position and size of each
object.
● Semantic Segmentation: Classify the object class for each pixel within an image. That
means there is a label for each pixel.
In classification, conventionally, an input image is downsized and goes through the convolution
layers and fully connected (FC) layers, and output one predicted label for the input image, as
follows:
Classification
And if the image is not downsized, the output will not be a single label. Instead, the output has a
size smaller than the input image (due to the max pooling):
All layers are convolutional layers
If we upsample the output above, then we can calculate the pixelwise output (label map) as below:
Convolution is a process getting the output size smaller. Thus, the name, deconvolution, is coming
from when we want to have upsampling to get the output size larger. (But the name, deconvolution,
is misinterpreted as reverse process of convolution, but it is not.) And it is also called, up
convolution, and transposed convolution. And it is also called fractional stride
After going through conv7 as below, the output size is small, then 32× upsampling is done to make
the output have the same size of input image. But it also makes the output label map rough. And
it is called FCN-32s:
FCN-32s
This is because, deep features can be obtained when going deeper, spatial location
information is also lost when going deeper. That means output from shallower layers have more
location information. If we combine both, we can enhance the result.
FCN-16s: The output from pool5 is 2× upsampled and fuse with pool4 and perform 16×
upsampling. Similar operations for FCN-8s as in the figure above.
FCN-32s result is very rough due to loss of location information while FCN-8s has the best
result.
This fusing operation actually is just like the boosting / ensemble technique used in AlexNet,
VGGNet, and GoogLeNet, where they add the results by multiple model to make the prediction
more accurate. But in this case, it is done for each pixel, and they are added from the results of
different layers within a model.
SegNet is a deep learning architecture designed for semantic pixel-wise image segmentation. The
architecture includes an encoder network and a corresponding decoder network, followed by a
final pixel-wise classification layer. This detailed explanation covers each component of SegNet,
comparisons with other architectures, and various decoder variants.
Encoder Network
The encoder network in SegNet is composed of 13 convolutional layers, mirroring the first 13
convolutional layers of the VGG16 network, which was originally designed for object
classification. Key points about the encoder network are:
1. Pre-trained Weights: The use of VGG16's pre-trained weights allows for efficient
initialization and faster convergence during training.
2. Convolutional Layers: These layers perform convolution operations to extract features from
the input image.
3. Batch Normalization: Each convolutional layer is followed by batch normalization to
stabilize and accelerate the training process.
4. ReLU Activation: Rectified Linear Unit (ReLU) activation function is applied element-wise
to introduce non-linearity.
5. Max-Pooling: Max-pooling with a 2×2 window and a stride of 2 is used to downsample the
feature maps, reducing their spatial resolution by half. This step helps in achieving translation
invariance over small spatial shifts.
The max-pooling operation results in a lossy representation of the image, especially in terms of
boundary details, which are crucial for segmentation tasks. To mitigate this loss, the locations of
the maximum feature values in each pooling window (max-pooling indices) are stored. This
information is later used in the decoder network for accurate upsampling.
Decoder Network
The decoder network consists of 13 layers, each corresponding to an encoder layer. The decoding
process is designed to upsample the feature maps back to the original image resolution. Key steps
in the decoder network are:
1. Upsampling Using Max-Pooling Indices: The stored max-pooling indices are used to
upsample the feature maps, creating sparse feature maps. This technique ensures that the spatial
locations of features are preserved.
2. Convolution with Trainable Filters: The sparse feature maps are convolved with trainable
decoder filters to produce dense feature maps. This step helps in refining the feature maps and
improving segmentation accuracy.
3. Batch Normalization: Similar to the encoder, batch normalization is applied to each layer in
the decoder network.
4. Soft-Max Classifier: The final output of the decoder network is passed through a multi-class
soft-max classifier, which assigns class probabilities to each pixel. The predicted segmentation
is obtained by taking the class with the highest probability for each pixel.
Spatio-temporal Models
Spatiotemporal models arise when data are collected across time as well as space and has at least
one spatial and one temporal property. An event in a spatiotemporal dataset describes a spatial and
temporal phenomenon that exists at a certain time t and location x.
Spatio-temporal modeling describes studies which record and analyse both the locations and
associated times of the observations. In spatio-temporal analysis, the focus is on variation in the
average number of incident or prevalent cases in combinations of place and time units over the
geographical region and time-period of interest – that is the spatio-temporal intensity of incident
or prevalent cases.
Real-time spatio-temporal surveillance can inform a rapid response team about where and when
to target prevention and control activities as well as to make longer term plans.
For example, the New York City Department of Health developed a system that uses daily
reports of the location and timing of 35 notifiable diseases to automatically detect epidemics. In
2015, the system identified a cluster of community-acquired legionellosis in a specific location
three days before health professionals noticed an increase in cases; the cluster of observations
expanded and became the largest outbreak in the US.
Longitudinal design
In a longitudinal design, data are collected repeatedly over time from the same set of sampled
locations. This is appropriate when temporal variation in the health outcome dominates spatial
variation. A longitudinal design can be cost-effective when setting up a sampling location is
expensive but subsequent data-collection is cheap. Longitudinal designs can act
as sentinel locations, when the locations may be chosen subjectively, either to be representative of
the population at large or, in the case of pollution monitoring for example, to capture extreme cases
to monitor compliance with environmental legislation.
In a repeated cross-sectional design, the researcher chooses different sets of locations on each
sampling occasion. This sacrifices direct information on changes in the underlying process over
time in favour of more complete spatial coverage. For example, to predict stunting in children in
Ghana, researchers drew data from four quinquennial national Demographic and Health Surveys
each of which used a similar two-stage cluster sampling strategy.
Repeated cross-sectional designs can also be adaptive, meaning that on any sampling occasion, the
choice of sampling locations is informed by an analysis of the data collected on earlier
occasions. Adaptive repeated cross-sectional designs are particularly suitable for applications in
which temporal variation either is dominated by spatial variation or is strongly related to risk
factors of interest.
Types of data
Geo-statistical data-set
The unit of observation is a location in the region but the researcher obtains the data only from a
sample of the susceptible population. Typically, each location identifies a village community but
resource limitations dictate that use only of a sample of villages, rather than a complete census.
The data-set consists of the number of cases in each sampled village.
Small-area data-set
The researcher partitions the region into a set of sub-regions. The dataset consists of all cases of
cholera in each sub-region. Typically the researcher uses this approach when the health system
maintains a register of all cases in the region.
All these formats can be extended in time. For example when an investigator records both the
location and time of occurrence of a case during real-time surveillance, they obtain a spatio-
temporal point pattern data-set of all cases. When the investigator records cases longitudinally at
sampled locations, they obtain a spatio-temporal geostatistical data-set, and similarly with small
area data-sets.
Geostatistical data-sets are most commonly obtained for disease mapping and surveillance in low-
resource settings where collecting point pattern data is expensive and health registries may not
exist to provide small area data.
Sampling and geostatistical data-sets
Without a properly designed sampling scheme, there is a risk that the investigator will sample
more accessible communities that do not represent the health experiences of the study-population,
that is the study will be biased.
To obtain valid predictions, the sample must be as unbiased spatially and temporally. The
sampling schemes below are commonly used to eliminate as much bias as possible.
Probability sampling
To avoid spatial bias the investigator can either selecting gridded locations from a gridded map of
the geographic area of interest or use a probability sampling scheme.
Counter-intuitively, simple random sampling is not recommended. The reason is that this leads to
an irregular pattern of sampled locations; for constructing an accurate map, is is preferable to
evenly space sampling locations throughout the region of interest. Chipeta et al. explain how this
can be achieved without losing the guarantee of unbiasedness by choosing sampling locations at
random subject to the constraint that no two sampled location can be separated by less than a
specified minimum distance.
Stratified random sampling is a set of simple random samples, one in each of a pre-defined set of
sub-regions that form a partition of the region of interest. Chipeta et al.’s method can secure an
even coverage of each sub-region without introducing bias. Stratification generally leads to gains
in efficiency when contextual knowledge can be used to define the strata . So between-strata
variation in the outcome of interest dominates within-stratum variation.
The investigator divides the region of interest into administrative divisions and randomly selects a
number of clusters of households or villages in each division. Cluster sampling designs are
typically less efficient statistically than simple or stratified designs with the same total sample
size. But this is counterbalanced by their practical convenience.
Opportunistic sampling
To reduce the length and cost of the study, researchers often use opportunistic sampling, in which
they collect data at whatever locations are available, for example from presentations at health
clinics. The limitations are obvious. the onus is on the investigators to convince themselves and
their audience that such a design does not bias their results.
Action/Activity Recognition
Action or activity recognition is a computer vision task that involves identifying and classifying
human actions in videos or images. It's a complex task that involves analyzing the spatiotemporal
dynamics of actions and mapping them to a predefined set of action classes.
Densely packed actions: Videos can have multiple actions happening at once or in quick
succession.
Long-range processing: Actions can extend over long periods of time, requiring long-range
processing to capture the nuances and transitions.
Irrelevant frames: Not every frame contributes to the action recognition process.
Training: Video models are more compute intensive than image models and can be expensive and
time consuming to train.
Generalization: It can be difficult to generalize due to the amount of variations possible in the
video space
Human-computer interfaces
Health care
Security
Military applications
Generative Adversarial Networks (GANs) are a powerful class of neural networks that are used
for an unsupervised learning. GANs are made up of two neural networks, a discriminator and a
generator. They use adversarial training to produce artificial data that is identical to actual data.
The Generator attempts to fool the Discriminator, which is tasked with accurately distinguishing
between produced and genuine data, by producing random noise samples.
Realistic, high-quality samples are produced as a result of this competitive interaction, which
drives both networks toward advancement.
GANs are proving to be highly versatile artificial intelligence tools, as evidenced by their extensive
use in image synthesis, style transfer, and text-to-image synthesis.
Through adversarial training, these models engage in a competitive interplay until the generator
becomes adept at creating realistic samples, fooling the discriminator approximately half the time.
Generative Adversarial Networks (GANs) can be broken down into three parts:
Generative: To learn a generative model, which describes how data is generated in terms of a
probabilistic model.
Adversarial: The word adversarial refers to setting one thing up against another. This means that,
in the context of GANs, the generative result is compared with the actual images in the data set. A
mechanism known as a discriminator is used to apply a model that attempts to distinguish between
real and fake images.
Networks: Use deep neural networks as artificial intelligence (AI) algorithms for training purposes.
Types of GANs
Vanilla GAN: This is the simplest type of GAN. Here, the Generator and the Discriminator are
simple a basic multi-layer perceptrons. In vanilla GAN, the algorithm is really simple, it tries to
optimize the mathematical equation using stochastic gradient descent.
Conditional GAN (CGAN): CGAN can be described as a deep learning method in which some
conditional parameters are put into place.
In CGAN, an additional parameter ‘y’ is added to the Generator for generating the corresponding
data.
Labels are also put into the input to the Discriminator in order for the Discriminator to help
distinguish the real data from the fake generated data.
Deep Convolutional GAN (DCGAN): DCGAN is one of the most popular and also the most successful
implementations of GAN. It is composed of ConvNets in place of multi-layer perceptrons.
The ConvNets are implemented without max pooling, which is in fact replaced by convolutional
stride.
Laplacian Pyramid GAN (LAPGAN): The Laplacian pyramid is a linear invertible image representation
consisting of a set of band-pass images, spaced an octave apart, plus a low-frequency residual.
This approach uses multiple numbers of Generator and Discriminator networks and different levels
of the Laplacian Pyramid.
This approach is mainly used because it produces very high-quality images. The image is down-
sampled at first at each layer of the pyramid and then it is again up-scaled at each layer in a
backward pass where the image acquires some noise from the Conditional GAN at these layers
until it reaches its original size.
Super Resolution GAN (SRGAN): SRGAN as the name suggests is a way of designing a GAN in which
a deep neural network is used along with an adversarial network in order to produce higher-
resolution images. This type of GAN is particularly useful in optimally up-scaling native low-
resolution images to enhance their details minimizing errors while doing so.
Architecture of GANs
A Generative Adversarial Network (GAN) is composed of two primary parts, which are the
Generator and the Discriminator.
Generator Model
A key element responsible for creating fresh, accurate data in a Generative Adversarial Network
(GAN) is the generator model. The generator takes random noise as input and converts it into
complex data samples, such text or images. It is commonly depicted as a deep neural network.
The training data’s underlying distribution is captured by layers of learnable parameters in its
design through training. The generator adjusts its output to produce samples that closely mimic
real data as it is being trained by using backpropagation to fine-tune its parameters.
The generator’s ability to generate high-quality, varied samples that can fool the discriminator is
what makes it successful.
Generator Loss
The objective of the generator in a GAN is to produce synthetic samples that are realistic enough
to fool the discriminator. The generator achieves this by minimizing its loss function JGJG. The
loss is minimized when the log probability is maximized, i.e., when the discriminator is highly
likely to classify the generated samples as real. The following equation is given below:
JG=−1mΣi=1mlogD(G(zi))JG=−m1Σi=1mlogD(G(zi))
Where,
log D(G(zi))D(G(zi))represents log probability of the discriminator being correct for generated
samples.
The generator aims to minimize this loss, encouraging the production of samples that the
discriminator classifies as real (logD(G(zi))(logD(G(zi)), close to 1.
Discriminator Model
Over time, the discriminator learns to differentiate between genuine data from the dataset and
artificial samples created by the generator. This allows it to progressively hone its parameters and
increase its level of proficiency.
Convolutional layers or pertinent structures for other modalities are usually used in its architecture
when dealing with picture data. Maximizing the discriminator’s capacity to accurately identify
generated samples as fraudulent and real samples as authentic is the aim of the adversarial training
procedure. The discriminator grows increasingly discriminating as a result of the generator and
discriminator’s interaction, which helps the GAN produce extremely realistic-looking synthetic
data overall.
Discriminator Loss
The discriminator reduces the negative log likelihood of correctly classifying both produced and
real samples. This loss incentivizes the discriminator to accurately categorize generated samples
as fake and real samples with the following equation:
JD=−1mΣi=1mlogD(xi)–1mΣi=1mlog(1–D(G(zi))JD=−m1Σi=1mlogD(xi)–m1Σi=1mlog(1–
D(G(zi))
JDJD assesses the discriminator’s ability to discern between produced and actual samples.
The log likelihood that the discriminator will accurately categorize real data is represented
by logD(xi)logD(xi).
The log chance that the discriminator would correctly categorize generated samples as fake is
represented by log(1−D(G(zi)))log(1−D(G(zi))).
The discriminator aims to reduce this loss by accurately identifying artificial and real samples.
MinMax Loss
In a Generative Adversarial Network (GAN), the minimax loss formula is provided by:
minGmaxD(G,D)=[Ex∼pdata[logD(x)]+Ez∼pz(z)[log(1–D(g(z)))]minGmaxD(G,D)=[Ex∼pdata
[logD(x)]+Ez∼pz(z)[log(1–D(g(z)))]
Where,
Actual data samples obtained from the true data distribution pdata(x)pdata(x) are represented by x.
D(x) represents the discriminator’s likelihood of correctly identifying actual data as real.
D(G(z)) is the likelihood that the discriminator will identify generated data coming from the
generator as authentic.
How does a GAN work?
Initialization: Two neural networks are created: a Generator (G) and a Discriminator (D).
G is tasked with creating new data, like images or text, that closely resembles real data.
D acts as a critic, trying to distinguish between real data (from a training dataset) and the data
generated by G.
Generator’s First Move: G takes a random noise vector as input. This noise vector contains random
values and acts as the starting point for G’s creation process. Using its internal layers and learned
patterns, G transforms the noise vector into a new data sample, like a generated image.
The data samples generated by G in the previous step. D’s job is to analyze each input and
determine whether it’s real data or something G cooked up. It outputs a probability score between
0 and 1. A score of 1 indicates the data is likely real, and 0 suggests it’s fake.
If D correctly identifies real data as real (score close to 1) and generated data as fake (score close
to 0), both G and D are rewarded to a small degree. This is because they’re both doing their jobs
well.
However, the key is to continuously improve. If D consistently identifies everything correctly, it
won’t learn much. So, the goal is for G to eventually trick D.
Generator’s Improvement:
When D mistakenly labels G’s creation as real (score close to 1), it’s a sign that G is on the right
track. In this case, G receives a significant positive update, while D receives a penalty for being
fooled.
This feedback helps G improve its generation process to create more realistic data.
Discriminator’s Adaptation:
Conversely, if D correctly identifies G’s fake data (score close to 0), but G receives no reward, D
is further strengthened in its discrimination abilities.
This ongoing duel between G and D refines both networks over time.
As training progresses, G gets better at generating realistic data, making it harder for D to tell the
difference. Ideally, G becomes so adept that D can’t reliably distinguish real from fake data. At
this point, G is considered well-trained and can be used to generate new, realistic data samples.
GANs, or Generative Adversarial Networks, have many uses in many different fields. Here are
some of the widely recognized uses of GANs:
Image Synthesis and Generation : GANs are often used for picture synthesis and generation
tasks, They may create fresh, lifelike pictures that mimic training data by learning the distribution
that explains the dataset. The development of lifelike avatars, high-resolution photographs, and
fresh artwork have all been facilitated by these types of generative networks.
Image-to-Image Translation : GANs may be used for problems involving image-to-image translation,
where the objective is to convert an input picture from one domain to another while maintaining
its key features. GANs may be used, for instance, to change pictures from day to night, transform
drawings into realistic images, or change the creative style of an image.
Text-to-Image Synthesis : GANs have been used to create visuals from descriptions in text. GANs
may produce pictures that translate to a description given a text input, such as a phrase or a caption.
This application might have an impact on how realistic visual material is produced using text-
based instructions.
Data Augmentation : GANs can augment present data and increase the robustness and
generalizability of machine-learning models by creating synthetic data samples.
Data Generation for Training : GANs can enhance the resolution and quality of low-resolution
images. By training on pairs of low-resolution and high-resolution images, GANs can generate
high-resolution images from low-resolution inputs, enabling improved image quality in various
applications such as medical imaging, satellite imaging, and video enhancement.
Advantages of GAN
Synthetic data generation: GANs can generate new, synthetic data that resembles some known data
distribution, which can be useful for data augmentation, anomaly detection, or creative
applications.
High-quality results: GANs can produce high-quality, photorealistic results in image synthesis,
video synthesis, music synthesis, and other tasks.
Unsupervised learning: GANs can be trained without labeled data, making them suitable for
unsupervised learning tasks, where labeled data is scarce or difficult to obtain.
Versatility: GANs can be applied to a wide range of tasks, including image synthesis, text-to-image
synthesis, image-to-image translation, anomaly detection, data augmentation, and others.
Disadvantages of GAN
Training Instability: GANs can be difficult to train, with the risk of instability, mode collapse, or
failure to converge.
Computational Cost: GANs can require a lot of computational resources and can be slow to train,
especially for high-resolution images or large datasets.
Overfitting: GANs can overfit the training data, producing synthetic data that is too similar to the
training data and lacking diversity.
Bias and Fairness: GANs can reflect the biases and unfairness present in the training data, leading
to discriminatory or biased synthetic data.
Interpretability and Accountability : GANs can be opaque and difficult to interpret or explain,
making it challenging to ensure accountability, transparency, or fairness in their applications.
Variational AutoEncoders
Variational autoencoder was proposed in 2013 by Diederik P. Kingma and Max Welling at Google
and Qualcomm. A variational autoencoder (VAE) provides a probabilistic manner for describing
an observation in latent space. Thus, rather than building an encoder that outputs a single value to
describe each latent state attribute, we’ll formulate our encoder to describe a probability
distribution for each latent attribute. It has many applications, such as data compression, synthetic
data creation, etc.
The latent code generated by the encoder is a probabilistic encoding, allowing the VAE to express
not just a single point in the latent space but a distribution of potential representations.
The decoder network, in turn, takes a sampled point from the latent distribution and reconstructs
it back into data space. During training, the model refines both the encoder and decoder parameters
to minimize the reconstruction loss – the disparity between the input data and the decoded output.
The goal is not just to achieve accurate reconstruction but also to regularize the latent space,
ensuring that it conforms to a specified distribution.
The process involves a delicate balance between two essential components: the reconstruction loss
and the regularization term, often represented by the Kullback-Leibler divergence. The
reconstruction loss compels the model to accurately reconstruct the input, while the regularization
term encourages the latent space to adhere to the chosen distribution, preventing overfitting and
promoting generalization.
By iteratively adjusting these parameters during training, the VAE learns to encode input data into
a meaningful latent space representation. This optimized latent code encapsulates the underlying
features and structures of the data, facilitating precise reconstruction. The probabilistic nature of
the latent space also enables the generation of novel samples by drawing random points from the
learned distribution.
Mathematics behind Variational Autoencoder
Variational autoencoder uses KL-divergence as its loss function, the goal of this is to minimize the
difference between a supposed distribution and original distribution of dataset.
Suppose we have a distribution z and we want to generate the observation x from it. In other
words, we want to calculate
We can do it by following way:
But, the calculation of p(x) can be quite difficult
This usually makes it an intractable distribution. Hence, we need to approximate p(z|x) to q(z|x)
to make it a tractable distribution. To better approximate p(z|x) to q(z|x), we will minimize the
KL-divergence loss which calculates how similar two distributions are:
The first term represents the reconstruction likelihood and the other term ensures that our
learned distribution q is similar to the true prior distribution p.
Thus our total loss consists of two terms, one is reconstruction error and other is KL-divergence
loss:
Applications
Deep Learning for Photo Editing (Image Editing)
Inpainting
Inpaint focuses on photo editing via simplified semi-automatic tools and mechanisms. The
program includes a tool similar to the Healing Brush tool in Adobe Photoshop CS5 with the
Content-Aware mode on. Similar to Healing Brush, the tool tries to replace bad or damaged texture
with good texture from another area to create a seamless repair of an image.
Facial retouching
Object cloning
Inpainting replaces or edits specific areas of an image. This makes it a useful tool for image
restoration like removing defects and artifacts, or even replacing an image area with something
entirely new. Inpainting relies on a mask to determine which regions of an image to fill in; the area
to inpaint is represented by white pixels and the area to keep is represented by black pixels. The
white pixels are filled in by the prompt.
Super resolution
With the advancement in deep learning techniques in recent years, deep learning-based SR models
have been actively explored and often achieve state-of-the-art performance on various benchmarks
of SR. A variety of deep learning methods have been applied to solve SR tasks, ranging from the
early Convolutional Neural Networks (CNN) based method to recent promising Generative
Adversarial Nets based SR approaches.
Problem
Image super-resolution (SR) problem, particularly single image super-resolution (SISR), has
gained a lot of attention in the research community. SISR aims to reconstruct a high-resolution
image ISR from a single low-resolution image ILR. Generally, the relationship between I LR and the
original high-resolution image IHR can vary depending on the situation. Many studies assume that
ILR is a bicubic downsampled version of IHR, but other degrading factors such as blur, decimation,
or noise can also be considered for practical applications.
In this article, we would be focusing on supervised learning methods for super-resolution tasks.
By using HR images as target and LR images as input, we can treat this problem as a supervised
learning problem.
Exhaustive table of
topics in Supervised Image
Super-Resolution
Upsampling Methods
Before understanding the rest of the theory behind the super-resolution, we need to
understand upsampling (Increasing the spatial resolution of images or simply increasing the
number of pixel rows/columns or both in the image) and its various methods.
Sub-pixel layer – The blue boxes denote the input and the boxes with other colors indicate different
convolution operations and different output feature maps.
● Sub-pixel Layer: The sub-pixel layer, another end-to-end learnable upsampling layer,
performs upsampling by generating a plurality of channels by convolution and then
reshaping them shows. Within this layer, a convolution is firstly applied for producing
outputs with
2
s times channels, where s is the scaling factor. Assuming the input size is h × w × c, the
output size will be h×w×s2c. After that, the reshaping operation is performed to produce
outputs with size sh × sw × c
Super-resolution Frameworks
Since image super-resolution is an ill-posed problem, how to perform upsampling (i.e., generating
HR output from LR input) is the key problem. There are mainly four model frameworks based on
the employed upsampling operations and their locations in the model (refer to the table above).
1. Pre-upsampling Super-resolution –
2. Post-upsampling Super-resolution –
\
Learning Strategies
● Pixelwise L1 loss – Absolute difference between pixels of ground truth HR image and the
generated one.
● Pixelwise L2 loss – Mean squared difference between pixels of ground truth HR image
and the generated one.
● Content loss – the content loss is indicated as the Euclidean distance between high-level
representations of the output image and the target image. High-level features are obtained
by passing through pre-trained CNNs like VGG and ResNet.
● Adversarial loss – Based on GAN where we treat the SR model as a generator, and define
an extra discriminator to judge whether the input image is generated or not.
● PSNR – Peak Signal-to-Noise Ratio (PSNR) is a commonly used objective metric to
measure the reconstruction quality of a lossy transformation. PSNR is inversely
proportional to the logarithm of the Mean Squared Error (MSE) between the ground truth
image and the generated image.
In MSE, I is a noise-free m×n monochrome image (ground truth) and K is the generated image
(noisy approximation). In PSNR, MAXI represents the maximum possible pixel value of the image.
Network Design
Various
network
designs in
super-resolution architecture
Enough of the basics! Let’s discuss some of the state-of-art super-resolution methods –
Super-Resolution methods
Super-Resolution Generative Adversarial Network (SRGAN) – Uses the idea of GAN for super-
resolution task i.e. generator will try to produce an image from noise which will be judged by the
discriminator. Both will keep training so that generator can generate images that can match the
true training data.
Self-supervised learning (SSL) and reinforcement learning (RL) are both machine learning
techniques, but they differ in how they learn:
● Self-supervised learning: In SSL, models learn from unlabeled data by generating their own
labels. This is a more practical approach than supervised learning, which requires labeled data that
is often expensive and time-consuming to obtain. SSL can be used in computer vision tasks like
image classification, object detection, and semantic segmentation.
● Reinforcement learning: In RL, models learn from feedback from actions taken in an
environment.
Reinforcement learning
RL operates on the principle of learning optimal behavior through trial and error. The agent takes
actions within the environment, receives rewards or penalties, and adjusts its behavior to maximize
the cumulative reward. This learning process is characterized by the following elements:
● Policy: A strategy used by the agent to determine the next action based on the current state.
● Reward Function: A function that provides a scalar feedback signal based on the state and
action.
● Value Function: A function that estimates the expected cumulative reward from a given state.
● Model of the Environment: A representation of the environment that helps in planning by
predicting future states and rewards.
Example: Navigating a Maze
The problem is as follows: We have an agent and a reward, with many hurdles in between. The
agent is supposed to find the best possible path to reach the reward. The following problem
explains the problem more easily.
The above image shows the robot, diamond, and fire. The goal of the robot is to get the reward
that is the diamond and avoid the hurdles that are fired. The robot learns by trying all the possible
paths and then choosing the path which gives him the reward with the least hurdles. Each right
step will give the robot a reward and each wrong step will subtract the reward of the robot. The
total reward will be calculated when it reaches the final reward that is the diamond.
● Maximizes Performance
● Sustain Change for a long period of time
● Too much Reinforcement can lead to an overload of states which can diminish the results
2. Negative: Negative Reinforcement is defined as strengthening of behavior because a
negative condition is stopped or avoided.
● Increases Behavior
● Provide defiance to a minimum standard of performance
● It Only provides enough to meet up the minimum behavior