0% found this document useful (0 votes)
31 views122 pages

DL1 5

The document covers the fundamentals of image formation, including the processes of sampling and quantization that convert analog images to digital formats. It discusses the roles of optical systems, image sensors, and image processing techniques, as well as the advantages and disadvantages of digital imaging. Additionally, it details linear and non-linear filtering methods, edge detection techniques, and various operators used in image processing.

Uploaded by

cseaiml251258
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views122 pages

DL1 5

The document covers the fundamentals of image formation, including the processes of sampling and quantization that convert analog images to digital formats. It discusses the roles of optical systems, image sensors, and image processing techniques, as well as the advantages and disadvantages of digital imaging. Additionally, it details linear and non-linear filtering methods, edge detection techniques, and various operators used in image processing.

Uploaded by

cseaiml251258
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 122

UNIT - I

Fundamentals of Image Formation

Image formation is an analog to digital conversion of an image with the help of 2D Sampling and
Quantization techniques that is done by the capturing devices like cameras. In general, we see a
2D view of the 3D world.

Generally, a frame grabber or a digitizer is used for sampling and quantizing the analog signals.

Imaging
The mapping of a 3D world object into a 2D digital image plane is called imaging.

We all know that light reflects from every object that we see thus enabling us to capture all those
light-reflecting points in our image plane.
Optical Systems
The lenses and mirrors are crucial in focusing the light coming from the 3D scene to produce the
image on the image plane. These systems define how light is collected and where it is directed and
consequently affects the sharpness and quality of the image produced.
Image Sensors
The goals of image sensors like the CCD or the CMOS sensors are to simply transform the optical
image into an electronic signal. These sensors differ by sensitivity, the resolution that they deliver
affecting the image as a whole.
Resolution and Sampling
Resolution is defined as the sharpness of an image and it occurs technically as the number of pixels
an image can hold. Sampling is the act of taking samples or discretizing a digital signal and
representing a continuous analog signal as a grouping of discrete values. It can be seen that higher
resolution and appropriative sampling rates are required in order to provide detailed and accurate
images.
Image Processing
Image processing can be described as act of modifying and enhancing digital images by using
algorithms. Pre-processing includes activities like filtering, noise reduction and color correction
that enhance image quality and information extraction.

Color and Pixelation


In digital Imaging, a frame grabber is placed at the image plane which is like a sensor. It aims to
focus the light on it and the continuous image is pixelated via the reflected light by the 3D object.
The light that is focused on the sensor generates an electronic signal.
Each pixel that is formed may be colored or grey depending on the intensity of the sampling and
quantization of the light that is reflected and the electronic signal that is generated via them.
All these pixels form a digital image. The density of these pixels determines the image quality.
The more the density the more the clear and high-resolution image we will get.
Forming a Digital Image
In order to form or create an image that is digital in nature, we need to have a continuous
conversion of data into a digital form. Thus, we require two main steps to do so:
● Sampling (2D): Sampling is a spatial resolution of the digital image. And the rate of sampling
determines the quality of the digitized image. The magnitude of the sampled image is
determined as a value in image processing. It is related to the coordinates values of the image.
● Quantization: Quantization is the number of grey levels in the digital image. The transition
of the continuous values from the image function to its digital equivalent is called quantization.
It is related to the intensity values of the image.
● The normal human being acquires a high level of quantization levels to get the fine shading
details of the image. The more quantization levels will result in the more clear image.

Advantages
● 1) Improved Accuracy: Digital imaging is less susceptible to human factors and gives accurate
output of the object with high detailed capture.
● 2) Enhanced Flexibility: Digital images are easy to manipulate, edit or analyse as per the
requirements through different software hence they provide flexibility of post processing.
● 3) High Storage Capacity: Data in any digital format such as in one or more digital images can
still be stored in large amount with very high resolution and quality and will not suffer physical
wear and tear.
● 4) Easy Sharing and Distribution: The use of digital images allows them to be quickly
duplicated and transmitted across various channels and to various gadgets, helping to speed up
the work.
● 5) Advanced Analysis Capabilities: Digital imaging enables the application of analytical tools,
including image recognition and machine learning, which can provide better insights and
increase productivity.
Disadvantages
● 1) Data Size: Large-structured digital image could occupy large storage space and
computational power hence may be expensive.
● 2) Image Noise: Digital images may be compromised by noise and artifacts, which degrades
the image quality mainly when photographed at night or using low image sensors.
● 3) Dependency on Technology: Digital imaging entails the use of sophisticated technology and
equipment that may be costly and there may be constant need to service or replace the
equipment.
● 4) Privacy Concerns: The ability to take and circulate photographs digitally also poses concern
because personal information can be photographed without the subject’s permission.
● 5) Data Loss Risks: Digital image repositories, however, are prone to data loss caused by
hardware failures, corrupting software, or unintentional erasure.
Applications
● 1) Medical Imaging: Digital imaging is employed in the medical fields in the diagnostic
process such as X-ray pictures, MRI scans, and CT scans, for internal body reflections.
● 2) Surveillance and Security: Digital cameras and imaging systems are greatly needed for
various security or surveillance purposes as they offer live feed and are also useful in acquiring
data for investigations.
● 3) Remote Sensing: Digital imaging plays an important role in remote sensing applications in
terms of monitoring and mapping of environment and disasters and involve data captured from
satellite and aerial systems.
● 4) Entertainment and Media: The entertainment industry involves the use of digital imaging in
films, video games, and virtual reality to deliver improved visual impact.
● 5) Scientific Research: Digital imaging helps in scientific studies through providing best
picture at research fields like astronomy, biology, and material science.
Linear Filtering

Linear filtering is a computer vision technique that uses a filter or kernel to modify an image. It's
a powerful image enhancement method that can reduce noise, and it's often used in applications
that require fast processing
This means the filter’s response to a weighted sum of the inputs is equal to the weighted sum of
the responses of the filter to all inputs. Mathematically, if
For purposes of analysis and computation, if x(t) is the input signal and h(t) is the filter’s impulse
response the output of the convolution yields y(t)=x(t)∗h(t). This property makes the linear filters
superposition and homogeneous hence making them easily predictable when evaluated
mathematically.
Features of Linear Filters:
● Superposition Principle: Given the literature, the response to the sum of inputs is the sum of
the responses to each of the inputs separately.
● Homogeneity: The response given to a scaled input is also a scaled response to the input given.
● Convolution-Based: It must be noted that the output is obtained from the convolution of the
input signal in the filter’s impulse response.
● Frequency Domain Analysis: Due to the non-random nature of signals which can be involved
in the operation of systems they can be easily analyzed and designed using frequency domain
techniques like Fourier transform.
● Predictable Behavior: This means that they have a sequential order and their application
makes them quite easy to anticipate and use in different fields.
What are non-linear filters?
Non-linear filters can be defined as signal or image processing which does not consist of
superposition and homogeneity. This means that what they produce as output is not just a
proportionate relation to the input values. These filters apply operations that are functions of the
inputs’ values and arrangement or other more complex mathematical operations and algorithms.

Features of Non-linear Filters:


● Non-Superposition: The definition of a linear function gives a clear indication that the
response of a system to a sum of inputs is not simply the sum of the response to each input
separately.
● Complex Operations: Some of those include operations like median filtering, morphological
transformations and even adaptive filtering.
● Effective Noise Reduction: Superior in the removal of specific types of noise, for instance, of
the salt and pepper type without distorting the edges.
● Edge Preservation: Able to maintain or even sharpen edges and small features in the picture.
● Adaptive Behavior: Can adjust their processing according to the input characteristics of the
local environment and therefore ideal for complex and diverse data.
Difference between Linear and non-linear filters

Parameter Linear Filters Non-linear filters

Superposition Principle Does not obey superposition


Obeys superposition principle
principle

Homogeneity Response is proportional to the Response is not necessarily


input proportional

Mathematical Basis Based on linear algebra and Based on complex mathematical


convolution functions

Frequency Domain Can be analyzed using Fourier Not easily analyzed using Fourier
Analysis Transform Transform

Output Predictability Predictable and straightforward Less predictable, complex


to analyze analysis required

Noise Reduction Moderate noise reduction, can Effective at noise reduction,


blur edges preserves edges
Parameter Linear Filters Non-linear filters

Edge Preservation Excels at preserving or enhancing


Can blur edges
edges

Computational Higher computational


Generally lower complexity
Complexity complexity

Adaptive Behavior Static, does not adapt to input Can adapt to local input
characteristics characteristics

Impulse Response Defined impulse response (h(t)) No defined impulse response

Implementation Simpler to implement More complex to implement

Examples Median filter, morphological


Mean filter, Gaussian filter
filters

Applications of Linear filters


● Smoothing and Blurring: Applied within image processing to diminish the level of noise and
detail of the image. There are ordinary filters like Gaussian and averaging filters.
● Signal Filtering in Communication Systems: Hired to filter out unwanted components in
communication channels or rather in the information that is being sent from one terminal to
the other.
● Edge Detection (Basic): Linear edge detectors like the Sobel filter enhance the edge by
finding the change in the intensity gradients.
● Data Smoothing: In this case, used in time series analysis to remove variability and to amplify
trends that may be underlying.
● Audio Signal Processing: Applied to balance the sounds; filtering out the noise or increasing
the frequency of certain tones.
Gaussian filter

A Gaussian filter is a linear smoothing filter used in image processing to reduce noise and blur
images. It's based on the Gaussian distribution, also known as the normal distribution, which is a
bell-shaped curve that describes the probability distribution of a continuous random variable.

properties of the Gaussian filter:

● Weighted averaging: The filter assigns higher weights to pixels closer to the center and lower
weights to those farther away.

● Non-causal: The filter window is symmetric about the origin in the time domain.

● Separable equation: The equation for the 2-D isotropic Gaussian can be separated into x and y
components, which allows for fairly quick convolution.
The Gaussian filter is used to: Remove Gaussian noise, Blur images, Remove detail and noise, and
Reduce salt and pepper noise

Convolution and Correlation

A convolution is also a mathematical tool that is used to combine two things in order to produce
the result. In image processing, convolution is a process by which we transform an input image by
applying a kernel over it in a pixel-wise fashion.

When the convolution mask operates on a particular pixel, then it performs the action by
considering that pixel and its neighboring pixels and the result is returned to that one particular
pixel. Thus, we conclude that convolution in image processing is the mask operator.

How to perform convolution

1. Flip the mask and do correlation.


2. The 1D mask is flipped horizontally, as there is a single row.
3. The 2D mask is flipped vertically and horizontally.
4. Mask is slid over the image matrix from the left to the right direction.
5. When the mask hovers on the image, corresponding elements of mask and image are multiplied
and the products are added.
6. This process repeats for all the pixels of the image.

There are two types of operators in image processing.

● Point operator: While operating on a particular pixel, it takes only one pixel as input that is
itself. For example Brightness increasing operation. We increase each pixel’s intensity by the
same value to increase the brightness of the image.
● Mask operator: While performing an action on a particular pixel it takes the particular pixel
and its neighbouring pixels as the input. Convolution operation.
Illustration:
Image, I = [100, 120, 100, 150, 160]

Indexes of the image are 0, 1, 2, 3 and 4.

Mask used for correlation, H = [1/3, 1/3, 1/3]

Indexes of the mask are -1, 0 and 1.

We are using same mask not the flipped one, hence we shall use the indexes properly.

Apply convolution between image and mask at index=1 in the image.

J(2) = I(0) . H(1) + I(1) . H(0) + I(2) . H(-1)Indexes are represented in the parentheses.

J=I*H

Convolution is denoted by (*).

Size of resultant image follows same as in case of correlation.

Image Edge Detection


Edges are significant local changes of intensity in a digital image. An edge can be defined as a set
of connected pixels that forms a boundary between two disjoint regions. There are three types of
edges:
● Horizontal edges
● Vertical edges
● Diagonal edges

Edge Detection is a method of segmenting an image into regions of discontinuity. It is a widely


used technique in digital image processing like

● pattern recognition
● image morphology
● feature extraction

Edge detection allows users to observe the features of an image for a significant change in the gray
level. This texture indicating the end of one region in the image and the beginning of another. It
reduces the amount of data in an image and preserves the structural properties of an image.

Edge Detection Operators are of two types:


● Gradient – based operator which computes first-order derivations in a digital image like,
Sobel operator, Prewitt operator, Robert operator
● Gaussian – based operator which computes second-order derivations in a digital image like,
Canny edge detector, Laplacian of Gaussian
Sobel Operator: It is a discrete differentiation operator. It computes the gradient approximation
of image intensity function for image edge detection. At the pixels of an image, the Sobel operator
produces either the normal to a vector or the corresponding gradient vector. It uses two 3 x 3
kernels or masks which are convolved with the input image to calculate the vertical and horizontal
derivative approximations respectively –

Advantages:

1. Simple and time efficient computation


2. Very easy at searching for smooth edges

Limitations:

1. Diagonal direction points are not preserved always


2. Highly sensitive to noise
3. Not very accurate in edge detection
4. Detect with thick and rough edges does not give appropriate results
Prewitt Operator: This operator is almost similar to the sobel operator. It also detects vertical
and horizontal edges of an image. It is one of the best ways to detect the orientation and magnitude
of an image. It uses the kernels or masks – Advantages:

1. Good performance on detecting vertical and horizontal edges


2. Best operator to detect the orientation of an image

Limitations:

1. The magnitude of coefficient is fixed and cannot be changed


2. Diagonal direction points are not preserved always
Robert Operator: This gradient-based operator computes the sum of squares of the differences
between diagonally adjacent pixels in an image through discrete differentiation. Then the gradient
approximation is made. It uses the following 2 x 2 kernels or masks – Advantages:

1. Detection of edges and orientation are very easy


2. Diagonal direction points are preserved

Limitations:

1. Very sensitive to noise


2. Not very accurate in edge detection

Marr-Hildreth Operator or Laplacian of Gaussian (LoG): It is a gaussian-based operator


which uses the Laplacian to take the second derivative of an image. This really works well when
the transition of the grey level seems to be abrupt. It works on the zero-crossing method i.e when
the second-order derivative crosses zero, then that particular location corresponds to a maximum
level. It is called an edge location. Here the Gaussian operator reduces the noise and the Laplacian
operator detects the sharp edges.
The Gaussian function is defined by the formula:
Canny Operator: It is a gaussian-based operator in detecting edges. This operator is not
susceptible to noise. It extracts image features without affecting or altering the feature. Canny edge
detector have advanced algorithm derived from the previous work of Laplacian of Gaussian
operator. It is widely used an optimal edge detection technique. It detects edges based on three
criteria:
1. Low error rate
2. Edge points must be accurately localized
3. There should be just one single edge response

Advantages:

1. It has good localization


2. It extract image features without altering the features
3. Less Sensitive to noise

Limitations:

1. There is false zero crossing


2. Complex computation and time consuming
Corner detection is a computer vision technique that finds corners in an image by looking for points
where lines bend sharply. It's used in many applications, including
● Text spotting: Text often has many distinct corners

● Machine vision: Corner detection helps locate objects and measure their dimensions

● Motion detection: Corner detection is often one of the first steps in motion detection applications

● Image registration: Corner detection is used in image registration

● Video tracking: Corner detection is used in video tracking

● Image mosaicing: Corner detection is used in image mosaicing

● Panorama stitching: Corner detection is used in panorama stitching

● 3D reconstruction: Corner detection is used in 3D reconstruction

● Object recognition: Corner detection is used in object recognition


To detect the corners of objects in an image, one can start by detecting edges then determining
where two edges meet. There are however other methods, among which:

● the Moravec detector [Moravec 1980],


● the Harris detector [Harris & Stephens 1988].

Moravec detector
The principle of this detector is to observe if a sub-image, moved around one pixel in all directions,
changes significantly. If this is the case, then the considered pixel is a corner.

Principle of Moravec detector. From left to right : on a flat area, small shifts in the sub-image (in
red) do not cause any change; on a contour, we observe changes in only one direction; around a
corner there are significant changes in all directions.
Mathematically, the change is characterized in each pixel (m,n) of the image by Em,n(x,y) which
represents the difference between the sub-images for an offset (x,y):

Bag of Visual Words

In bag of words (BOW), we count the number of each word appears in a document, use the
frequency of each word to know the keywords of the document, and make a frequency histogram
from it.

The following models a text document using bag-of-words. Here are two simple text documents:

(1) John likes to watch movies. Mary likes movies too.


(2) Mary also likes to watch football games.

Based on these two text documents, a list is constructed as follows for each document:

"John","likes","to","watch","movies","Mary","likes","movies","too"

"Mary","also","likes","to","watch","football","games"

Representing each bag-of-words as a JSON object, and attributing to the


respective JavaScript variable:

BoW1 = {"John":1,"likes":2,"to":1,"watch":1,"movies":2,"Mary":1,"too":1};
BoW2 = {"Mary":1,"also":1,"likes":1,"to":1,"watch":1,"football":1,"games":1};

Each key is the word, and each value is the number of occurrences of that word in the given text
document.

The order of elements is free, so, for


example {"too":1,"Mary":1,"movies":2,"John":1,"watch":1,"likes":2,"to":1} is also equivalent
to BoW1. It is also what we expect from a strict JSON object representation.

Note: if another document is like a union of these two,


(3) John likes to watch movies. Mary likes movies too. Mary also likes to watch football games.

its JavaScript representation will be:

BoW3 =
{"John":1,"likes":3,"to":2,"watch":2,"movies":2,"Mary":2,"too":1,"also":1,"football":1,"games":
1};
So, as we see in the bag algebra, the "union" of two documents in the bags-of-words representation
is, formally, the disjoint union, summing the multiplicities of each element.
Word order
[edit]

The BoW representation of a text removes all word ordering. For example, the BoW representation
of "man bites dog" and "dog bites man" are the same, so any algorithm that operates with a BoW
representation of text must treat them in the same way. Despite this lack of syntax or grammar,
BoW representation is fast and may be sufficient for simple tasks that do not require word order.
For instance, for document classification, if the words "stocks" "trade" "investors" appears
multiple times, then the text is likely a financial report, even though it would be insufficient to
distinguish between

Yesterday, investors were rallying, but today, they are retreating.

and

Yesterday, investors were retreating, but today, they are rallying.

and so the BoW representation would be insufficient to determine the detailed meaning of the
document.

Implementations
[edit]

Implementations of the bag-of-words model might involve using frequencies of words in a


document to represent its contents. The frequencies can be "normalized" by the inverse of
document frequency, or tf–idf. Additionally, for the specific purpose of
classification, supervised alternatives have been developed to account for the class label of a
document.[4] Lastly, binary (presence/absence or 1/0) weighting is used in place of frequencies for
some problems (e.g., this option is implemented in the WEKA machine learning software system).

VLAD

VLAD (Vector of Locally Aggregated Descriptors) is a feature encoding and pooling algorithm
used in computer vision to represent images. It's often used for image classification and instance
retrieval. Here are some things to know about VLAD:

● How it works

VLAD is based on feature descriptors extracted from an image using a dictionary built from a
clustering method. It matches each descriptor to its closest cluster, and then stores the sum of the
differences between the descriptors and the cluster centroid.

● Advantages

VLAD strikes a good balance between computational efficiency and representation ability.

● Extensions

VLAD can be combined with Deep Convolutional Neural Network (DCNN) features to improve
face verification.

● History
VLAD was introduced by Jegou et al. in a 2013 paper published at the IEEE Conference on
Computer Vision and Pattern Recognition. The paper also proposed a normalization method and
vocabulary adaptation to improve retrieval performance.

Random Sample Consensus (RANSAC):


RANSAC (Random Sample Consensus) is a robust algorithm used in machine learning and
computer vision to estimate model parameters in the presence of outliers. It is particularly useful
when there is a large amount of noisy data, and the goal is to find a model that fits the inliers well.
RANSAC is an iterative algorithm that randomly samples a subset of the data and fits a model to
that subset. The model is then used to classify the remaining data as either inliers or outliers. The
algorithm continues to iterate, selecting new random subsets of the data, until a satisfactory model
is found.

Mathematical Formulation:

Let us assume that we have a set of data points, D = {d1, d2, …, dn}, and we want to estimate a
model, M, that best fits this data. The model can be represented by a set of parameters, θ = {θ1,
θ2, …, θm}. For example, in the case of a linear regression model, θ1 and θ2 would be the slope
and intercept, respectively.

To apply RANSAC, we need to define the following parameters:

● n: the minimum number of data points required to estimate the model parameters

● k: the number of iterations the algorithm should run

● t: the threshold that determines which data points are considered inliers

● d: the minimum number of inliers required to accept a model as valid

The algorithm works as follows:

1. Randomly select n data points from D and use them to estimate the model parameters θ.

2. Classify the remaining data points as inliers or outliers based on whether their distance to the
model is less than the threshold t.
3. If the number of inliers is greater than or equal to d, re-estimate the model parameters using
all the inliers and terminate the algorithm.

4. Repeat steps 1–3 k times and select the model with the largest number of inliers.
Pros and Cons

Pros:

● RANSAC is a robust algorithm that can handle a large amount of noise and outliers in the
data.

● It can be used with any model that can be estimated from a subset of the data.

● It is relatively simple to implement and computationally efficient.

● RANSAC can provide a good approximation of the true model even when there are a large
number of outliers in the data.

Cons:
● RANSAC is a heuristic algorithm, which means that it does not guarantee the optimal
solution.

● The choice of parameters (n, k, t, d) can have a significant impact on the performance of the
algorithm. Finding the optimal values for these parameters can be challenging.

● The algorithm can be sensitive to the initial random sample, which can lead to different results
for different runs of the algorithm.

Some specific use cases where RANSAC can be applied include:

1. Line fitting: RANSAC can be used to fit a line to a set of 2D or 3D points in the presence of
outliers. This is useful in computer vision tasks such as lane detection in autonomous vehicles.

2. Fundamental matrix estimation: RANSAC can be used to estimate the fundamental matrix
that relates corresponding points in two images. This is useful in stereo vision applications
such as 3D reconstruction and object tracking.

3. Object recognition: RANSAC can be used to match features between images and estimate
the pose of objects in the scene. This is useful in robotics applications such as pick-and-place
tasks.

4. Plane fitting: RANSAC can be used to fit a plane to a set of 3D points in the presence of
outliers. This is useful in computer graphics applications such as rendering and 3D modeling.

Overall, RANSAC is a powerful algorithm for robust model estimation in the presence of
outliers. While it has its limitations, it can be a valuable tool in a wide range of applications in
machine learning and computer vision.
Hough transform in computer vision.

The Hough Transform is a popular technique in computer vision and image processing, used for
detecting geometric shapes like lines, circles, and other parametric curves. Named after Paul
Hough, who introduced the concept in 1962, the transform has evolved and found numerous
applications in various domains such as medical imaging, robotics, and autonomous driving. In
this article, we will discuss how Hough transformation is utilized in computer vision.
What is Hough Transform?
A feature extraction method called the Hough Transform is used to find basic shapes in a picture,
like circles, lines, and ellipses. Fundamentally, it transfers these shapes’ representation from the
spatial domain to the parameter space, allowing for effective detection even in the face of
distortions like noise or occlusion.
How Does the Hough Transform Work?
The accumulator array, sometimes referred to as the parameter space or Hough space, is the first
thing that the Hough Transform creates. The available parameter values for the shapes that are
being detected are represented by this space. The slope (m) and y-intercept (b) of a line, for
instance, could be the parameters in the line detection scenario.
The Hough Transform calculates the matching curves in the parameter space for each edge point
in the image. This is accomplished by finding the curve that intersects the parameter values at the
spot by iterating over all possible values of the parameters. The “votes” or intersections for every
combination of parameters are recorded by the accumulator array.
In the end, the programme finds peaks in the accumulator array that match the parameters of the
shapes it has identified. These peaks show whether the image contains lines, circles, or other
shapes.
Variants and Techniques of Hough transform
The performance and adaptability of the Hough Transform have been improved throughout time
by a number of variations and techniques:
● Paul Hough’s initial formulation for line identification is known as the Standard Hough
Transform (SHT). It entails voting for every possible combination of parameters and
discretizing the parameter space.
● Probabilistic Hough Transform (PHT): The PHT randomly chooses a subset of edge points
and only applies line detection to those locations in order to increase efficiency. For real-time
applications, this minimizes processing complexity while maintaining accuracy in the output.
● Generalized Hough Transform (GHT): By recording the spatial relationships of every shape
using a template, the GHT can detect any shape, in contrast to the SHT’s limited ability to
detect just specified shapes. After that, a voting system akin to the SHT is used to match this
template with the image.
● Accumulator Space Dimensionality: The classic Hough Transform can identify lines in two
dimensions, but it can also detect more complicated forms, such ellipses or circles, in higher
dimensions. Every extra dimension translates into an extra parameter of the identified shape.

Implementation of Hough transform in computer vision

The Python code implementation for line detection utilizing the Hough Transform on
this image and OpenCV is described in detail below.
1) Import necessary libraries
This code imports OpenCV for image processing and the NumPy library for numerical
computations.
Python

import numpy as np
import cv2

2) Read the image


Python

img = cv2.imread('lane_hough.jpg', cv2.IMREAD_COLOR) # lane_hough.jpg is the filename


3) Convert the Image to Grayscale
Convert the loaded image to grayscale for edge detection.
Python

gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)


4) Apply Canny Edge Detector:
Detect edges in the grayscale image using the Canny edge detection method.
Python

edges = cv2.Canny(gray, 50, 200)


5) Detect Lines using Probabilistic Hough Transform:
In order to find lines in the edge-detected image, use the ‘cv2.HoughLinesP’ function. An array of
lines is returned by this method, with each line’s end points (x1, y1, x2, y2) serving as its
representation.
Python

lines = cv2.HoughLinesP(edges, 1, np.pi/180, 68, minLineLength=15, maxLineGap=250)


6) Draw Detected Lines on the Original Image:
Draw each identified line using ‘cv2.line’ on the original image after iterating over them. The
lines’ thickness is set to 3 pixels, and their color is set to blue (255, 0, 0).
Python

for line in lines:


x1, y1, x2, y2 = line[0]
cv2.line(img, (x1, y1), (x2, y2), (255, 0, 0), 3)
7) Display the Result:
A statement explaining that line detection is being done should be printed. Next, use ‘cv2.imshow’
to display the image with the lines that have been detected. Finally, use ‘cv2.waitKey(0)’ and
‘cv2.destroyAllWindows()’ to wait for a key press to close the window.
Python

print("Line Detection using Hough Transform")


cv2.imshow('lanes', img)
cv2.waitKey(0)
cv2.destroyAllWindows()

Taking complete code at once, we get


Python

import numpy as np
import cv2
# Read image
img = cv2.imread('lane_hough.jpg',cv2.IMREAD_COLOR) # road.png is the filename

# Convert the image to grayscale


gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
# Find the edges in the image using canny detector
edges = cv2.Canny(gray, 50, 200)

# Detect points that form a line


lines = cv2.HoughLinesP(edges, 1, np.pi/180, 68, minLineLength=15, maxLineGap=250)
#lines = cv2.HoughLinesP(edges, 1, np.pi/180, minLineLength=10, maxLineGap=250)
# Draw lines on the image
for line in lines:
x1, y1, x2, y2 = line[0]
cv2.line(img, (x1, y1), (x2, y2), (255, 0, 0), 3)
# Show result
print("Line Detection using Hough Transform")
cv2.imshow('lanes',img)
cv2.waitKey(0)
cv2.destroyAllWindows()

Applications in Computer Vision


In computer vision, the Hough Transform has several uses, some of which are as follows:
● Edge Detection: The Hough Transform is an essential part of edge detection algorithms that
facilitate the extraction of significant information from images by identifying lines or curves
in the image.
● Object Recognition: To aid in the identification and categorization of items, the Hough
Transform can be utilized in object recognition tasks to pinpoint particular forms within an
image.
● Lane detection: To help autonomous cars stay in their assigned lanes, lane markers on the
road are commonly detected using the Hough Transform.
● Medical Imaging: The Hough Transform can be used to identify and evaluate different
anatomical features in medical imaging applications, such as MRI or CT scans, which can help
with diagnosis and therapy planning.
● In the manufacturing sector, the Hough Transform can be applied to quality control tasks like
measuring component dimensions or looking for flaw
UNIT – II INTRODUCTION TO DEEP LEARNING

What is a Feedforward Neural Network?


A Feedforward Neural Network (FNN) is a type of artificial neural network where connections
between the nodes do not form cycles. This characteristic differentiates it from recurrent neural
networks (RNNs).
The network consists of an input layer, one or more hidden layers, and an output layer.
Information flows in one direction—from input to output—hence the name “feedforward.”

What is a Feedforward Neural Network?


A Feedforward Neural Network (FNN) is a type of artificial neural network where connections
between the nodes do not form cycles. This characteristic differentiates it from recurrent neural
networks (RNNs). The network consists of an input layer, one or more hidden layers, and an output
layer. Information flows in one direction—from input to output—hence the name “feedforward.”
Structure of a Feedforward Neural Network
1. Input Layer: The input layer consists of neurons that receive the input data. Each neuron in
the input layer represents a feature of the input data.
2. Hidden Layers: One or more hidden layers are placed between the input and output layers.
These layers are responsible for learning the complex patterns in the data. Each neuron in a
hidden layer applies a weighted sum of inputs followed by a non-linear activation function.
3. Output Layer: The output layer provides the final output of the network. The number of
neurons in this layer corresponds to the number of classes in a classification problem or the
number of outputs in a regression problem.
Each connection between neurons in these layers has an associated weight that is adjusted during
the training process to minimize the error in predictions.
Feed Forward Neural Network

Activation Functions
Activation functions introduce non-linearity into the network, enabling it to learn and model
complex data patterns. Common activation functions include:
● Sigmoid: σ(x)=σ(x)=11+e−xσ(x)=1+e−x1
● Tanh: tanh(x)=ex–e−xex+e−xtanh(x)=ex+e−xex–e−x
● ReLU (Rectified Linear Unit): ReLU(x)=max⁡(0,x)ReLU(x)=max(0,x)
● Leaky ReLU: Leaky ReLU(x)=max⁡(0.01x,x)Leaky ReLU(x)=max(0.01x,x)
Training a Feedforward Neural Network
Training a Feedforward Neural Network involves adjusting the weights of the neurons to minimize
the error between the predicted output and the actual output. This process is typically performed
using backpropagation and gradient descent.
1. Forward Propagation: During forward propagation, the input data passes through the
network, and the output is calculated.
2. Loss Calculation: The loss (or error) is calculated using a loss function such as Mean Squared
Error (MSE) for regression tasks or Cross-Entropy Loss for classification tasks.
3. Backpropagation: In backpropagation, the error is propagated back through the network to
update the weights. The gradient of the loss function with respect to each weight is calculated,
and the weights are adjusted using gradient descent.
Forward Propagation

Gradient Descent
Gradient Descent is an optimization algorithm used to minimize the loss function by iteratively
updating the weights in the direction of the negative gradient. Common variants of gradient descent
include:
● Batch Gradient Descent: Updates weights after computing the gradient over the entire
dataset.
● Stochastic Gradient Descent (SGD): Updates weights for each training example individually.
● Mini-batch Gradient Descent: Updates weights after computing the gradient over a small
batch of training examples.
Evaluation of Feedforward neural network
Evaluating the performance of the trained model involves several metrics:
● Accuracy: The proportion of correctly classified instances out of the total instances.
● Precision: The ratio of true positive predictions to the total predicted positives.
● Recall: The ratio of true positive predictions to the actual positives.
● F1 Score: The harmonic mean of precision and recall, providing a balance between the two.
● Confusion Matrix: A table used to describe the performance of a classification model,
showing the true positives, true negatives, false positives, and false negatives.
Code Implementation of Feedforward neural network
This code demonstrates the process of building, training, and evaluating a neural network model
using TensorFlow and Keras to classify handwritten digits from the MNIST dataset. Initially, the
MNIST dataset is loaded and normalized by scaling the pixel values to the range [0, 1]. The model
architecture is defined using the Sequential API, consisting of a Flatten layer to convert the 2D
image input into a 1D array, followed by a Dense layer with 128 neurons and ReLU activation,
and a final Dense layer with 10 neurons and softmax activation to output probabilities for each
digit class. The model is compiled with the Adam optimizer, SparseCategoricalCrossentropy loss
function, and SparseCategoricalAccuracy metric. The model is then trained for 5 epochs on the
training data. Finally, the model’s performance is evaluated on the test set, and the test accuracy
is printed.
Python

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.losses import SparseCategoricalCrossentropy
from tensorflow.keras.metrics import SparseCategoricalAccuracy

# Load and prepare the MNIST dataset


mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

# Build the model


model = Sequential([
Flatten(input_shape=(28, 28)),
Dense(128, activation='relu'),
Dense(10, activation='softmax')
])
# Compile the model
model.compile(optimizer=Adam(),
loss=SparseCategoricalCrossentropy(),
metrics=[SparseCategoricalAccuracy()])

# Train the model


model.fit(x_train, y_train, epochs=5)

# Evaluate the model


test_loss, test_acc = model.evaluate(x_test, y_test)
print(f'\nTest accuracy: {test_acc}')
Output:
Test accuracy: 0.9767000079154968

Back propagation in Neural Network

What is backpropagation?
● In machine learning, backpropagation is an effective algorithm used to train artificial neural
networks, especially in feed-forward neural networks.
● Backpropagation is an iterative algorithm, that helps to minimize the cost function by
determining which weights and biases should be adjusted. During every epoch, the model
learns by adapting the weights and biases to minimize the loss by moving down toward the
gradient of the error. Thus, it involves the two most popular optimization algorithms, such
as gradient descent or stochastic gradient descent.
● Computing the gradient in the backpropagation algorithm helps to minimize the cost
function and it can be implemented by using the mathematical rule called chain rule from
calculus to navigate through complex layers of the neural network.
Fig(a) A simple illustration of how the backpropagation works by adjustments of weights

Advantages of Using the Backpropagation Algorithm in Neural Networks


Backpropagation, a fundamental algorithm in training neural networks, offers several advantages
that make it a preferred choice for many machine learning tasks. Here, we discuss some key
advantages of using the backpropagation algorithm:
1. Ease of Implementation: Backpropagation does not require prior knowledge of neural
networks, making it accessible to beginners. Its straightforward nature simplifies the
programming process, as it primarily involves adjusting weights based on error derivatives.
2. Simplicity and Flexibility: The algorithm’s simplicity allows it to be applied to a wide range
of problems and network architectures. Its flexibility makes it suitable for various scenarios,
from simple feedforward networks to complex recurrent or convolutional neural networks.
3. Efficiency: Backpropagation accelerates the learning process by directly updating weights
based on the calculated error derivatives. This efficiency is particularly advantageous in
training deep neural networks, where learning features of a function can be time-consuming.
4. Generalization: Backpropagation enables neural networks to generalize well to unseen data
by iteratively adjusting weights during training. This generalization ability is crucial for
developing models that can make accurate predictions on new, unseen examples.
5. Scalability: Backpropagation scales well with the size of the dataset and the complexity of the
network. This scalability makes it suitable for large-scale machine learning tasks, where
training data and network size are significant factors.
Working of Backpropagation Algorithm
The Backpropagation algorithm works by two different passes, they are:
● Forward pass
● Backward pass
How does Forward pass work?
● In forward pass, initially the input is fed into the input layer. Since the inputs are raw data, they
can be used for training our neural network.
● The inputs and their corresponding weights are passed to the hidden layer. The hidden layer
performs the computation on the data it receives. If there are two hidden layers in the neural
network, for instance, consider the illustration fig(a), h1 and h2 are the two hidden layers, and
the output of h1 can be used as an input of h2. Before applying it to the activation function, the
bias is added.
● To the weighted sum of inputs, the activation function is applied in the hidden layer to each of
its neurons. One such activation function that is commonly used is ReLU can also be used,
which is responsible for returning the input if it is positive otherwise it returns zero. By doing
this so, it introduces the non-linearity to our model, which enables the network to learn the
complex relationships in the data. And finally, the weighted outputs from the last hidden layer
are fed into the output to compute the final prediction, this layer can also use the activation
function called the softmax function which is responsible for converting the weighted outputs
into probabilities for each class.
The forward pass using weights and biases

How does backward pass work?


● In the backward pass process shows, the error is transmitted back to the network which helps
the network, to improve its performance by learning and adjusting the internal weights.
● To find the error generated through the process of forward pass, we can use one of the most
commonly used methods called mean squared error which calculates the difference between
the predicted output and desired output. The formula for mean squared error
is: Meansquarederror=(predictedoutput–actualoutput)2Meansquarederror=(predictedoutput–
actualoutput)2
● Once we have done the calculation at the output layer, we then propagate the error backward
through the network, layer by layer.
● The key calculation during the backward pass is determining the gradients for each weight and
bias in the network. This gradient is responsible for telling us how much each weight/bias
should be adjusted to minimize the error in the next forward pass. The chain rule is used
iteratively to calculate this gradient efficiently.
● In addition to gradient calculation, the activation function also plays a crucial role in
backpropagation, it works by calculating the gradients with the help of the derivative of the
activation function.
Example of Backpropagation in Machine Learning
Let us now take an example to explain backpropagation in Machine Learning,
Assume that the neurons have the sigmoid activation function to perform forward and
backward pass on the network. And also assume that the actual output of y is 0.5 and the
learning rate is 1. Now perform the backpropagation using backpropagation algorithm.

Example (1) of backpropagation sum

Implementing forward propagation:


Step1: Before proceeding to calculating forward propagation, we need to know the two formulae:
aj=∑(wi,j∗xi)aj=∑(wi,j∗xi)
Where,
● aj is the weighted sum of all the inputs and weights at each node,
● wi,j – represents the weights associated with the jth input to the ith neuron,
● xi – represents the value of the jth input,
yj=F(aj)=11+e−ajyj=F(aj)=1+e−aj1, yi – is the output value, F denotes the activation function
[sigmoid activation function is used here), which transforms the weighted sum into the output
value.
Step 2: To compute the forward pass, we need to compute the output for y3 , y4 , and y5.

To find the outputs of y3, y4 and y5

We start by calculating the weights and inputs by using the formula:


aj=∑(wi,j∗xi)aj=∑(wi,j∗xi) To find y3 , we need to consider its incoming edges along with its
weight and input. Here the incoming edges are from X1 and X2.
At h1 node,
a1=(w1,1x1)+(w2,1x2)=(0.2∗0.35)+(0.2∗0.7)=0.21a1=(w1,1x1)+(w2,1x2
)=(0.2∗0.35)+(0.2∗0.7)=0.21
Once, we calculated the a1 value, we can now proceed to find the y3 value:
yj=F(aj)=11+e−ajyj=F(aj)=1+e−aj1
y3=F(0.21)=11+e−0.21y3=F(0.21)=1+e−0.211
y3=0.56y3=0.56
Similarly find the values of y4 at h2 and y5 at O3 ,
a2=(w1,2∗x1)+(w2,2∗x2)=(0.3∗0.35)+(0.3∗0.7)=0.315a2=(w1,2∗x1)+(w2,2∗x2
)=(0.3∗0.35)+(0.3∗0.7)=0.315
y4=F(0.315)=11+e−0.315y4=F(0.315)=1+e−0.3151
a3=(w1,3∗y3)+(w2,3∗y4)=(0.3∗0.57)+(0.9∗0.59)=0.702a3=(w1,3∗y3)+(w2,3∗y4
)=(0.3∗0.57)+(0.9∗0.59)=0.702
y5=F(0.702)=11+e−0.702=0.67y5=F(0.702)=1+e−0.7021=0.67

Values of y3, y4 and y5

Note that, our actual output is 0.5 but we obtained 0.67. To calculate the error, we can use the
below formula:
Errorj=ytarget–y5Errorj=ytarget–y5
Error = 0.5 – 0.67
= -0.17
Using this error value, we will be backpropagating.
Implementing Backward Propagation
Each weight in the network is changed by,
∇wij = η ?j Oj
?j = Oj (1-Oj)(tj - Oj) (if j is an output unit)
?j = Oj (1-O)∑k ?k wkj (if j is a hidden unit)
where ,
η is the constant which is considered as learning rate,
tj is the correct output for unit j
?j is the error measure for unit j
Step 3: To calculate the backpropagation, we need to start from the output unit:
To compute the ?5, we need to use the output of forward pass,
?5 = y5(1-y5) (ytarget -y5)
= 0.67(1-0.67) (-0.17)
= -0.0376
For hidden unit,
To compute the hidden unit, we will take the value of ?5
?3 = y3(1-y3) (w1,3 * ?5)
=0.56(1-0.56) (0.3*-0.0376)
=-0.0027
?4 = y4 (1-y5) (w2,3 * ?5)
=0.59(1-0.59) (0.9*-0.0376)
=-0.0819
Step 4: We need to update the weights, from output unit to hidden unit,
∇ wj,i = η ?j Oj

Note- Here our learning rate is 1


∇ w2,3 = η ?5 O4
= 1 * (-0.376) * 0.59
= -0.22184
We will be updating the weights based on the old weight of the network,
w2,3(new) = ∇ w4,5 + w4,5 (old)
= -0.22184 + 0.9
= 0.67816
From hidden unit to input unit,
For an hidden to input node, we need to do calculations by the following;
∇ w1,1 = η ?3 O4
= 1 * (-0.0027) * 0.35
= 0.000945
Similarly, we need to calculate the new weight value using the old one:
w1,1(new) = ∇ w1,1+ w1,1 (old)
= 0.000945 + 0.2
= 0.200945
Similarly, we update the weights of the other neurons: The new weights are mentioned below
w1,2 (new) = 0.271335
w1,3 (new) = 0.08567
w2,1 (new) = 0.29811
w2,2 (new) = 0.24267
The updated weights are illustrated below,

Through backward pass the weights are updated

Once, the above process is done, we again perform the forward pass to find if we obtain the actual
output as 0.5.
While performing the forward pass again, we obtain the following values:
y3 = 0.57
y4 = 0.56
y5 = 0.61
We can clearly see that our y5 value is 0.61 which is not an expected actual output, So again we
need to find the error and backpropagate through the network by updating the weights until the
actual output is obtained.
Error=ytarget–y5Error=ytarget–y5
= 0.5 – 0.61
= -0.11
This is how the backpropagate works, it will be performing the forward pass first to see if we
obtain the actual output, if not we will be finding the error rate and then backpropagating
backwards through the layers in the network by adjusting the weights according to the error rate.
This process is said to be continued until the actual output is gained by the neural network.

Vanishing Gradient Problem


The vanishing gradient problem is a challenge that emerges during backpropagation when the
derivatives or slopes of the activation functions become progressively smaller as we move
backward through the layers of a neural network.
The weight updates becomes extremely tiny, or even exponentially small, it can significantly
prolong the training time, and in the worst-case scenario, it can halt the training process
altogether.
Why the Problem Occurs?
During backpropagation, the gradients propagate back through the layers of the network, they
decrease significantly. This means that as they leave the output layer and return to the input layer,
the gradients become progressively smaller. As a result, the weights associated with the initial
levels, which accommodate these small gradients, are updated little or not at each iteration of the
optimization process.
The vanishing gradient problem is particularly associated with the sigmoid and hyperbolic
tangent (tanh) activation functions because their derivatives fall within the range of 0 to 0.25 and
0 to 1, respectively. Consequently, extreme weights becomes very small, causing the updated
weights to closely resemble the original ones. This persistence of small updates contributes to the
vanishing gradient issue.
The sigmoid and tanh functions limit the input values to the ranges [0,1] and [-1,1], so that they
saturate at 0 or 1 for sigmoid and -1 or 1 for Tanh. The derivatives at points becomes zero as they
are moving. In these regions, especially when inputs are very small or large, the gradients are very
close to zero. While this may not be a major concern in shallow networks with a few layers, it is a
more pronounced issue in deep networks. When the inputs fall in saturated regions, the gradients
approach zero, resulting in little update to the weights of the previous layer. In simple networks
this does not pose much of a problem, but as more layers are added, these small gradients, which
multiply between layers, decay significantly and consequently the first layer tears very slowly ,
and hinders overall model performance and can lead to convergence failure.
How can we solve the issue?
● Batch Normalization : Batch normalization normalizes the inputs of each layer, reducing
internal covariate shift. This can help stabilize and accelerate the training process, allowing for
more consistent gradient flow.
● Activation function: Activation function like Rectified Linear Unit (ReLU) can be used.
With ReLU, the gradient is 0 for negative and zero input, and it is 1 for positive input, which
helps alleviate the vanishing gradient issue. Therefore, ReLU operates by replacing poor enter
values with 0, and 1 for fine enter values, it preserves the input unchanged.
● Skip Connections and Residual Networks (ResNets): Skip connections, as seen in ResNets,
allow the gradient to bypass certain layers during backpropagation. This facilitates the flow of
information through the network, preventing gradients from vanishing.
● Long Short-Term Memory Networks (LSTMs) and Gated Recurrent Units (GRUs): In
the context of recurrent neural networks (RNNs), architectures like LSTMs and GRUs are
designed to address the vanishing gradient problem in sequences by incorporating gating
mechanisms .
● Gradient Clipping: Gradient clipping involves imposing a threshold on the gradients during
backpropagation. Limit the magnitude of gradients during backpropagation, this can prevent
them from becoming too small or exploding, which can also hinder learning.

ReLU

● ReLU stands for rectified linear activation unit and is considered one of the few milestones
in the deep learning revolution. It is simple yet really better than its predecessor activation
functions such as sigmoid or tanh.

● ReLU activation function formula

● Now how does ReLU transform its input? It uses this simple formula:

● f(x)=max(0,x)

● ReLU function is its derivative both are monotonic. The function returns 0 if it receives
any negative input, but for any positive value x, it returns that value back. Thus it gives an
output that has a range from 0 to infinity.

● Now let us give some inputs to the ReLU activation function and see how it transforms
them and then we will plot them also.
● First, let us define a ReLU function

● def ReLU(x):

● if x>0:

● return x

● else:

● return 0

● The Rectified Linear Unit is the most commonly used activation function in deep learning
models. The function returns 0 if it receives any negative input, but for any positive
value xx it returns that value back. So it can be written as f(x)=max(0,x)f(x)=max(0,x).
● Graphically it looks like this

● It's surprising that such a simple function (and one composed of two linear pieces) can
allow your model to account for non-linearities and interactions so well. But the ReLU
function works great in most applications, and it is very widely used as a result.

● Why It Works
● Introducing Interactions and Non-linearities
● Activation functions serve two primary purposes: 1) Help a model account for interaction
effects.
What is an interactive effect? It is when one variable A affects a prediction differently
depending on the value of B. For example, if my model wanted to know whether a certain
body weight indicated an increased risk of diabetes, it would have to know an individual's
height. Some bodyweights indicate elevated risks for short people, while indicating good
health for tall people. So, the effect of body weight on diabetes risk depends on height,
and we would say that weight and height have an interaction effect.
● 2) Help a model account for non-linear effects. This just means that if I graph a variable
on the horizontal axis, and my predictions on the vertical axis, it isn't a straight line. Or
said another way, the effect of increasing the predictor by one is different at different values
of that predictor.
● How ReLU captures Interactions and Non-Linearities
● Interactions: Imagine a single node in a neural network model. For simplicity, assume it
has two inputs, called A and B. The weights from A and B into our node are 2 and 3
respectively. So the node output is f(2A+3B)f(2A+3B). We'll use the ReLU function for
our f. So, if 2A+3B2A+3B is positive, the output value of our node is also 2A+3B2A+3B.
If 2A+3B2A+3B is negative, the output value of our node is 0.
● For concreteness, consider a case where A=1 and B=1. The output is 2A+3B2A+3B, and
if A increases, then the output increases too. On the other hand, if B=-100 then the output
is 0, and if A increases moderately, the output remains 0. So A might increase our output,
or it might not. It just depends what the value of B is.
● This is a simple case where the node captured an interaction. As you add more nodes and
more layers, the potential complexity of interactions only increases. But you should now
see how the activation function helped capture an interaction.

● Non-linearities: A function is non-linear if the slope isn't constant. So, the ReLU function
is non-linear around 0, but the slope is always either 0 (for negative values) or 1 (for
positive values). That's a very limited type of non-linearity.
● But two facts about deep learning models allow us to create many different types of non-
linearities from how we combine ReLU nodes.

● First, most models include a bias term for each node. The bias term is just a constant
number that is determined during model training. For simplicity, consider a node with a
single input called A, and a bias. If the bias term takes a value of 7, then the node output is
f(7+A). In this case, if A is less than -7, the output is 0 and the slope is 0. If A is greater
than -7, then the node's output is 7+A, and the slope is 1.
● So the bias term allows us to move where the slope changes. So far, it still appears we can
have only two different slopes.

● However, real models have many nodes. Each node (even within a single layer) can have
a different value for it's bias, so each node can change slope at different values for our
input.

● When we add the resulting functions back up, we get a combined function that changes
slopes in many places.

● These models have the flexibility to produce non-linear functions and account for
interactions well (if that will giv better predictions). As we add more nodes in each layer
(or more convolutions if we are using a convolutional model) the model gets even greater
ability to represent these interactions and non-linearities.

Heuristics for Avoiding Bad Local Minima


A local minimum is a suboptimal equilibrium point at which system error is non-zero and the
hidden output matrix is singular The complex problem which has a large number of patterns
needs as many hidden nodes as patterns in order not to cause a singular hidden output matrix.

Regularization of Deep Learning

Regularization is a technique used to address overfitting by directly changing the architecture of


the model by modifying the model’s training process. The following are the commonly used
regularization techniques:

1. L2 regularization
2. L1 regularization
3. Dropout regularization

Here’s a look at each in detail.

L2 regularization

According to regression analysis, L2 regularization is also called ridge regression. In this type of
regularization, the squared magnitude of the coefficients or weights multiplied with a regularizer
term is added to the loss or cost function. L2 regression can be represented with the following
mathematical equation.

Loss:

In the above equation,


You can see that a fraction of the sum of squared values of weights is added to the loss function.
Thus, when gradient descent is applied on loss, the weight update seems to be consistent by giving
almost equal emphasis on all features. You can observe the following:

● Lambda is the hyperparameter that is tuned to prevent overfitting i.e. penalize the
insignificant weights by forcing them to be small but not zero.
● L2 regularization works best when all the weights are roughly of the same size, i.e., input
features are of the same range.
● This technique also helps the model to learn more complex patterns from data without
overfitting easily.
L1 regularization

L1 regularization is also referred to as lasso regression. In this type of regularization, the absolute
value of the magnitude of coefficients or weights multiplied with a regularizer term is added to the
loss or cost function. It can be represented with the following equation.

Loss:

In the above equation,

A fraction of the sum of absolute values of weights to the loss function is added in the L1
regularization. In this way, you will be able to eliminate some coefficients with lesser values by
pushing those values towards 0. You can observe the following by using L1 regularization:

● Since the L1 regularization adds an absolute value as a penalty to the cost function, the
feature selection will be done by retaining only some important features and eliminating
the lower or unimportant features.
● This technique is also robust to outliers, i.e., the model will be able to easily learn about
outliers in the dataset.
● This technique will not be able to learn complex patterns from the input data.

Dropout regularization

Dropout regularization is the technique in which some of the neurons are randomly disabled during
the training such that the model can extract more useful robust features from the model. This
prevents overfitting. You can see the dropout regularization in the following diagram:

● In figure (a), the neural network is fully connected. If all the neurons are trained with the
entire training dataset, some neurons might memorize the patterns occurring in training
data. This leads to overfitting since the model is not generalizing well.
● In figure (b), the neural network is sparsely connected, i.e., only some neurons are active
during the model training. This forces the neurons to extract robust features/patterns from
training data to prevent overfitting.

The following are the characteristics of dropout regularization:

● Dropout randomly disables some percent of neurons in each layer. So for every epoch,
different neurons will be dropped leading to effective learning.
● Dropout is applied by specifying the ‘p’ values, which is the fraction of neurons to be
dropped.
● Dropout reduces the dependencies of neurons on other neurons, resulting in more robust
model behavior.
● Dropout is applied only during the model training phase and is not applied during the
inference phase.
● When the model receives complete data during the inference time, you need to scale the
layer outputs ‘x’ by ‘p’ such that only some parts of data will be sent to the next layer. This
is because the layers have seen less amount of data as specified by dropout.
These are some of the most popular regularization techniques that are used to reduce overfitting
during model training. They can be applied according to the use case or dataset being considered
for more accurate model performance on the testing data.
Adversarial Training

What is an Adversarial Example?

Adversarial Training is a technique that has been developed to protect Machine Learning
models from Adversarial Examples. Let’s briefly recall what Adversarial Examples are. These are
inputs that are very slightly and cleverly perturbed (such as an image, text, or sound) in a way that
is imperceptible to humans but will be misclassified by a machine learning model.

What is astonishing about these attacks is the model’s confidence in its incorrect prediction. The
example above illustrates this well: while the model only has a confidence rate of 57.7% for the
correct prediction, it will exhibit a very high confidence rate of 99.3% for the incorrect prediction.

These attacks are very problematic. For example, an article published in Science in 2019 by
researchers from Harvard and MIT demonstrates how medical AI systems could be vulnerable to
adversarial attacks. That’s why it’s necessary to defend against them. This is where Adversarial
Training comes in. It, along with ‘Defensive Distillation,’ is the primary technique to protect
against these attacks.

How does Adversarial Training work?

How does this technique work? It involves retraining the Machine Learning model with numerous
Adversarial Examples. Indeed, during the training phase of a predictive model, if the input is
misclassified by the Machine Learning model, the algorithm learns from its mistakes and adjusts
its parameters to avoid making them again.
Thus, after initially training the model, the model’s creators generate numerous Adversarial
Examples. They expose their own model to these contradictory examples to prevent it from
making these mistakes again.

While this method defends Machine Learning models against some Adversarial Examples, does it
generalize the model’s robustness to all Adversarial Examples? The answer is no. This approach
is generally insufficient to stop all attacks because the range of possible attacks is too wide and
cannot be generated in advance. Thus, it often becomes a race between hackers generating new
adversarial examples and designers protecting against them as quickly as possible.

In a more general sense, it is very difficult to protect models against adversarial examples because
it is nearly impossible to construct a theoretical model of how these examples are created. It would
involve solving particularly complex optimization problems, and we do not have the necessary
theoretical tools.

All strategies tested so far fail because they are not adaptive: they may block one type of attack
but leave another vulnerability open to an attacker who knows the defense used. Designing a
defense capable of protecting against a powerful and adaptive attacker is an important research
area.

In conclusion, Adversarial Training generally fails to protect Machine Learning models


against Adversarial Attacks. If we were to highlight one reason, it’s because this technique
provides defense against a specific set of attacks without achieving a generalized method.
Optimization for deep learning

Optimizers and loss functions are two components that help improve the performance of the model.
By calculating the difference between the expected and actual outputs of a model, a loss function
evaluates the effectiveness of a model.

Optimization Rule in Deep Neural Networks


There are various optimization techniques to change model weights and learning rates, like
Gradient Descent, Stochastic Gradient Descent, Stochastic Gradient descent with momentum,
Mini-Batch Gradient Descent, AdaGrad, RMSProp, AdaDelta, and Adam. These optimization
techniques play a critical role in the training of neural networks, as they help improve the model
by adjusting its parameters to minimize the loss of function value. Choosing the best optimizer
depends on the application.
Before we proceed,

it’s essential to acquaint yourself with a few terms


1. The epoch is the number of times the algorithm iterates over the entire training dataset.
2. Batch weights refer to the number of samples used for updating the model parameters.
3. A sample is a single record of data in a dataset.
4. Learning Rate is a parameter determining the scale of model weight updates
5. Weights and Bias are learnable parameters in a model that regulate the signal between two
neurons.
UNIT – III

Convolutional Neural Network

A convolutional neural network (CNN) is a type of artificial neural network that uses
convolutional layers to process and analyze data, such as images, text, and audio:
A Convolutional Neural Network (CNN) is a type of Deep Learning neural network architecture
commonly used in Computer Vision. Computer vision is a field of Artificial Intelligence that
enables a computer to understand and interpret the image or visual data.
When it comes to Machine Learning, Artificial Neural Networks perform really well. Neural
Networks are used in various datasets like images, audio, and text. Different types of Neural
Networks are used for different purposes, for example for predicting the sequence of words we
use Recurrent Neural Networks more precisely an LSTM, similarly for image classification we
use Convolution Neural networks. In this blog, we are going to build a basic building block for
CNN.

Neural Networks: Layers and Functionality


In a regular Neural Network there are three types of layers:
1. Input Layers: It’s the layer in which we give input to our model. The number of neurons in
this layer is equal to the total number of features in our data (number of pixels in the case of
an image).
2. Hidden Layer: The input from the Input layer is then fed into the hidden layer. There can be
many hidden layers depending on our model and data size. Each hidden layer can have different
numbers of neurons which are generally greater than the number of features. The output from
each layer is computed by matrix multiplication of the output of the previous layer with
learnable weights of that layer and then by the addition of learnable biases followed by
activation function which makes the network nonlinear.
3. Output Layer: The output from the hidden layer is then fed into a logistic function like
sigmoid or softmax which converts the output of each class into the probability score of each
class.
The data is fed into the model and output from each layer is obtained from the above step is
called feedforward, we then calculate the error using an error function, some common error
functions are cross-entropy, square loss error, etc. The error function measures how well the
network is performing. After that, we backpropagate into the model by calculating the derivatives.
This step is called Backpropagation which basically is used to minimize the loss.
Convolution Neural Network
Convolutional Neural Network (CNN) is the extended version of artificial neural networks (ANN)
which is predominantly used to extract the feature from the grid-like matrix dataset. For example
visual datasets like images or videos where data patterns play an extensive role.
CNN Architecture
Convolutional Neural Network consists of multiple layers like the input layer, Convolutional layer,
Pooling layer, and fully connected layers.
Simple CNN architecture

The Convolutional layer applies filters to the input image to extract features, the Pooling layer
downsamples the image to reduce computation, and the fully connected layer makes the final
prediction. The network learns the optimal filters through backpropagation and gradient descent.
How Convolutional Layers Works?
Convolution Neural Networks or covnets are neural networks that share their parameters. Imagine
you have an image. It can be represented as a cuboid having its length, width (dimension of the
image), and height (i.e the channel as images generally have red, green, and blue channels).

Now imagine taking a small patch of this image and running a small neural network, called a filter
or kernel on it, with say, K outputs and representing them vertically. Now slide that neural network
across the whole image, as a result, we will get another image with different widths, heights, and
depths. Instead of just R, G, and B channels now we have more channels but lesser width and
height. This operation is called Convolution. If the patch size is the same as that of the image it
will be a regular neural network. Because of this small patch, we have fewer weights.

Image source: Deep Learning Udacity


Mathematical Overview of Convolution
Now let’s talk about a bit of mathematics that is involved in the whole convolution process.
● Convolution layers consist of a set of learnable filters (or kernels) having small widths and
heights and the same depth as that of input volume (3 if the input layer is image input).
● For example, if we have to run convolution on an image with dimensions 34x34x3. The
possible size of filters can be axax3, where ‘a’ can be anything like 3, 5, or 7 but smaller as
compared to the image dimension.
● During the forward pass, we slide each filter across the whole input volume step by step where
each step is called stride (which can have a value of 2, 3, or even 4 for high-dimensional
images) and compute the dot product between the kernel weights and patch from input volume.
● As we slide our filters we’ll get a 2-D output for each filter and we’ll stack them together as a
result, we’ll get output volume having a depth equal to the number of filters. The network will
learn all the filters.
Layers Used to Build ConvNets
A complete Convolution Neural Networks architecture is also known as covnets. A covnets is a
sequence of layers, and every layer transforms one volume to another through a differentiable
function.

Types of layers: datasets


Let’s take an example by running a covnets on of image of dimension 32 x 32 x 3.
● Input Layers: It’s the layer in which we give input to our model. In CNN, Generally, the input
will be an image or a sequence of images. This layer holds the raw input of the image with
width 32, height 32, and depth 3.
● Convolutional Layers: This is the layer, which is used to extract the feature from the input
dataset. It applies a set of learnable filters known as the kernels to the input images. The
filters/kernels are smaller matrices usually 2×2, 3×3, or 5×5 shape. it slides over the input
image data and computes the dot product between kernel weight and the corresponding input
image patch. The output of this layer is referred as feature maps. Suppose we use a total of 12
filters for this layer we’ll get an output volume of dimension 32 x 32 x 12.
● Activation Layer: By adding an activation function to the output of the preceding layer,
activation layers add nonlinearity to the network. it will apply an element-wise activation
function to the output of the convolution layer. Some common activation functions are RELU:
max(0, x), Tanh, Leaky RELU, etc. The volume remains unchanged hence output volume
will have dimensions 32 x 32 x 12.
● Pooling layer: This layer is periodically inserted in the covnets and its main function is to
reduce the size of volume which makes the computation fast reduces memory and also prevents
overfitting. Two common types of pooling layers are max pooling and average pooling. If we
use a max pool with 2 x 2 filters and stride 2, the resultant volume will be of dimension
16x16x12.
Image

source:

cs231n.stanford.edu

● Flattening: The resulting


feature maps are flattened into a
one-dimensional vector after
the convolution and pooling
layers so they can be passed
into a completely linked layer
for categorization or regression.
● Fully Connected Layers: It
takes the input from the
previous layer and computes
the final classification or
regression task.

Image source: cs231n.stanford.edu

● Output Layer: The output from the fully connected layers is then fed into a logistic function
for classification tasks like sigmoid or softmax which converts the output of each class into the
probability score of each class.
Advantages and Disadvantages of Convolutional Neural Networks (CNNs)
Advantages of CNNs:
1. Good at detecting patterns and features in images, videos, and audio signals.
2. Robust to translation, rotation, and scaling invariance.
3. End-to-end training, no need for manual feature extraction.
4. Can handle large amounts of data and achieve high accuracy.
Disadvantages of CNNs:
1. Computationally expensive to train and require a lot of memory.
2. Can be prone to overfitting if not enough data or proper regularization is used.
3. Requires large amounts of labeled data.
4. Interpretability is limited, it’s hard to understand what the network has learned.

AlexNet:
AlexNet, developed by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, is a landmark
model that won the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) in 2012. It
introduced several innovative ideas that shaped the future of CNNs.
AlexNet Architecture:
AlexNet consists of 8 layers, including 5 convolutional layers and 3 fully connected layers. It uses
traditional stacked convolutional layers with max-pooling in between. Its deep network structure
allows for the extraction of complex features from images.
● The architecture employs overlapping pooling layers to reduce spatial dimensions while
retaining the spatial relationships among neighbouring features.
● Activation function: AlexNet uses the ReLU activation function and dropout regularization,
which enhance the model’s ability to capture non-linear relationships within the data.
The key features of AlexNet are as follows:-
● AlexNet was created to be more computationally efficient than earlier CNN topologies. It
introduced parallel computing by utilising two GPUs during training.
● AlexNet is a relatively shallow network compared to GoogleNet. It has eight layers, which
makes it simpler to train and less prone to overfitting on smaller datasets.
● In 2012, AlexNet produced ground-breaking results in the ImageNet Large Scale Visual
Recognition Challenge (ILSVRC). It outperformed prior CNN architectures greatly and set the
path for the rebirth of deep learning in computer vision.
● Several architectural improvements were introduced by AlexNet, including the use of rectified
linear units (ReLU) as activation functions, overlapping pooling, and dropout regularisation.
These strategies aided in the improvement of performance and generalisation

Let’s consider an image classification task of various dog breeds. AlexNet’s convolutional layers
learn features such as edges, textures, and shapes to distinguish between different dog breeds. The
fully connected layers then analyze these learned features and make predictions.

ZFNet

Rob Fergus and Matthew D.Zeiler introduced ZFNet. ZFNet is named after their surname Zeiler
and Fergus. ZFNet was a slight improvement over AlexNet .The 2013 ILSVRC was won by
ZFNet.It actually visualized how each layer of AlexNet performs and what parameters can be
tuned to achieve greater accuracy.
image from original paper-https://arxiv.org/pdf/1311.2901.pdf

Some Key Features of ZFNet architecture

· Convolutional layers:
In these layers convolutional filters are applied to extract important features,ZFNet consists of
multiple convolutional layers to extract important features.

· MaxPooling Layers:

MaxPooling Layers are used to downsample the spatial dimensions of feature map in.It consist of
aggregation function known as maxima.

· Rectified Linear Unit:

Relu is used after each convolution layer to introduce non linearity into the model which is crucial
for learning complex patterns. It rectifies the feature map ensuring the feature maps are always
positive.

· Fully Connected Layers:


In the latter part of ZFNet architecture fully connected dense layers are used to extract patterns
from features .The activation function used in the neurons is relu.

· SoftMax Activation:

SoftMax activation is used in the last layer to obtain the probabilities of the image belonging to
the 1000 classes.

· Deconvolution Layers:ZFNet introduced a visualization technique involving deconvolutional


layers(Transposed Layers) .These layers provide insights into what network has learned by
projecting feature activations back into input pixel space.

Architecture:

ZFNet Architecture (image by me)


Input

· The input image is of size 224x224x3.

First Layer

· In the first layer 96 filters of size 7x7 and stride of 2 are used to convolve followed by relu
activation.
The output feature map is then passed through Max Pooling Layer with pool kernel of 3x3 and
stride of 2 .Then the features are contrast normalized.

Second layer

· In the second layer 256 filters are applied of size 3x3 with stride of 2. Again the obtained feature
map is passed through MaxPooling layer with pooling kernel of 3x3 with stride of 2.After that
features are contrast normalized.

Third layer and Fourth Layer

· The third and fourth layers are identical with 384 kernels of size 3x3 and padding is kept as same
and stride is set to 1.

Fifth Layer

· In the fifth layer 256 filters of size 3x3 are applied with stride 1. After then the MaxPooling
kernel of size 3x3 is applied with stride of 2 .Then the features are contrast normalized.

Sixth Layer and Seventh Layer

· The sixth and seventh layers are fully connected dense layers with 4096 neurons each.

Eighth Layer

· The last layer is dense layer with 1000 neurons(number of classes).


VGG-16
The VGG-16 model is a convolutional neural network (CNN) architecture that was proposed by
the Visual Geometry Group (VGG) at the University of Oxford. It is characterized by its depth,
consisting of 16 layers, including 13 convolutional layers and 3 fully connected layers. VGG-16
is renowned for its simplicity and effectiveness, as well as its ability to achieve strong performance
on various computer vision tasks, including image classification and object recognition. The
model’s architecture features a stack of convolutional layers followed by max-pooling layers, with
progressively increasing depth. This design enables the model to learn intricate hierarchical
representations of visual features, leading to robust and accurate predictions. Despite its simplicity
compared to more recent architectures, VGG-16 remains a popular choice for many deep learning
applications due to its versatility and excellent performance.
The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) is an annual competition in
computer vision where teams tackle tasks including object localization and image classification.
VGG16, proposed by Karen Simonyan and Andrew Zisserman in 2014, achieved top ranks in both
tasks, detecting objects from 200 classes and classifying images into 1000 categories.

VGG-16 architecture

This model achieves 92.7% top-5 test accuracy on the ImageNet dataset which contains 14 million
images belonging to 1000 classes.
VGG-16 Model Objective:

The ImageNet dataset contains images of fixed size of 224*224 and have RGB channels. So, we
have a tensor of (224, 224, 3) as our input. This model process the input image and outputs the a
vector of 1000 values:
y^=[y0^y1^y2^...y^999] y^=y0^y1^y2^...y^999
This vector represents the classification probability for the corresponding class. Suppose we have
a model that predicts that the image belongs to class 0 with probability 1, class 1 with
probability 0.05, class 2 with probability 0.05, class 3 with probability 0.03, class 780 with
probability 0.72, class 999 with probability 0.05 and all other class with 0.
so, the classification vector for this will be:
y^=[y0^=0.10.050.050.03...y780^=0.72..y999^=0.05] y^=y0^=0.10.050.050.03...y780^
=0.72..y999^=0.05
To make sure these probabilities add to 1, we use softmax function.
This softmax function is defined as follows:
y^i=ezi∑j=1nezjy^i=∑j=1nezjezi
After this we take the 5 most probable candidates into the vector.
C=[780012999]C=780012999
and our ground truth vector is defined as follows:
G=[G0G1G2]=[7802999] G=G0G1G2=7802999
Then we define our Error function as follows:
E=1n∑kminid(ci,Gk) E=n1∑kminid(ci,Gk)
It calculates the minimum distance between each ground truth class and the predicted candidates,
where the distance function d is defined as:
● d=0 if ci=Gkci=Gk
● d=1 otherwise
So, the loss function for this example is :
E=13(minid(ci,G1)+minid(ci,G2)+minid(ci,G3))=13(0+0+0)=0E=31(minid(ci,G1)+minid(ci,G2
)+minid(ci,G3))=31(0+0+0)=0
Since, all the categories in ground truth are in the Predicted top-5 matrix, so the loss becomes 0.

VGG Architecture:
The VGG-16 architecture is a deep
convolutional neural network (CNN)
designed for image classification tasks. It was
introduced by the Visual Geometry Group at
the University of Oxford. VGG-16 is
characterized by its simplicity and uniform
architecture, making it easy to understand and
implement.
The VGG-16 configuration typically consists of 16 layers, including 13 convolutional layers and
3 fully connected layers. These layers are organized into blocks, with each block containing
multiple convolutional layers followed by a max-pooling layer for downsampling.
VGG-16 architecture Map

Here’s a breakdown of the VGG-16 architecture based on the provided details:


1. Input Layer:
1. Input dimensions: (224, 224, 3)
2. Convolutional Layers (64 filters, 3×3 filters, same padding):
1. Two consecutive convolutional layers with 64 filters each and a filter size of 3×3.
2. Same padding is applied to maintain spatial dimensions.
3. Max Pooling Layer (2×2, stride 2):
1. Max-pooling layer with a pool size of 2×2 and a stride of 2.
4. Convolutional Layers (128 filters, 3×3 filters, same padding):
1. Two consecutive convolutional layers with 128 filters each and a filter size of 3×3.
5. Max Pooling Layer (2×2, stride 2):
1. Max-pooling layer with a pool size of 2×2 and a stride of 2.
6. Convolutional Layers (256 filters, 3×3 filters, same padding):
1. Two consecutive convolutional layers with 256 filters each and a filter size of 3×3.
7. Convolutional Layers (512 filters, 3×3 filters, same padding):
1. Two sets of three consecutive convolutional layers with 512 filters each and a filter size of
3×3.
8. Max Pooling Layer (2×2, stride 2):
1. Max-pooling layer with a pool size of 2×2 and a stride of 2.
9. Stack of Convolutional Layers and Max Pooling:
1. Two additional convolutional layers after the previous stack.
2. Filter size: 3×3.
10. Flattening:
1. Flatten the output feature map (7x7x512) into a vector of size 25088.
11. Fully Connected Layers:
1. Three fully connected layers with ReLU activation.
2. First layer with input size 25088 and output size 4096.
3. Second layer with input size 4096 and output size 4096.
4. Third layer with input size 4096 and output size 1000, corresponding to the 1000 classes
in the ILSVRC challenge.
5. Softmax activation is applied to the output of the third fully connected layer for
classification.
This architecture follows the specifications provided, including the use of ReLU activation
function and the final fully connected layer outputting probabilities for 1000 classes using softmax
activation.
VGG-16 Configuration:
The main difference between VGG-16 configurations C and D lies in the use of filter sizes in some
of the convolutional layers. While both versions predominantly use 3×3 filters, in version D, there
are instances where 1×1 filters are used instead. This slight variation results in a difference in the
number of parameters, with version D having a slightly higher number of parameters compared to
version C. However, both versions maintain the overall architecture and principles of the VGG-16
model.
Different VGG Configuration

Object Localization In Image:


To perform localization, we need to replace the class score by bounding box location coordinates.
A bounding box location is represented by the 4-D vector (center coordinates(x,y), height, width).
There are two versions of localization architecture, one is bounding box is shared among different
candidates (the output is 4 parameter vector) and the other is a bounding box is class-specific (the
output is 4000 parameter vector). The paper experimented with both approaches on VGG -16 (D)
architecture. Here we also need to change loss from classification loss to regression loss functions
(such as MSE) that penalize the deviation of predicted loss from the ground truth.
Results: VGG-16 was one of the best performing architectures in the ILSVRC challenge 2014.It
was the runner up in the classification task with a top-5 classification error of 7.32% (only behind
GoogLeNet with a classification error of 6.66%). It was also the winner of localization task
with 25.32% localization error.
Limitations Of VGG 16:
● It is very slow to train (the original VGG model was trained on Nvidia Titan GPU for 2-3
weeks).
● The size of VGG-16 trained imageNet weights is 528 MB. So, it takes quite a lot of disk space
and bandwidth which makes it inefficient.
● 138 million parameters lead to exploding gradients problem.
Further advancements: Resnets are introduced to prevent exploding gradients problem that
occurred in VGG-16.
De convolution

Usually, images acquired by a vision system suffer from degradation that can be modelled as a
convolution. For example, some images present a camera shake effect (Fig. 100) or a blur due to
poor focus (Fig. 101). The goal of deconvolution is to cancel the effect of a convolution.

An example of motion blur (the parliament of Budapest shot by a camera).

Fig. 101 Hubble’s view of Ganymede in 1996.


The degradation phenomenon is modelled as in Fig. 102: The observed image y is degraded by the
convolution with a PSF h and, possibly, by a noise b (considered to be additive).

y=h∗x+b
The deconvolution computes a deconvolved image x^ from the observation y. We will consider
only linear methods, thus deconvolution comes to filtering by g:

x^=g∗y
Deconvolution model.
Deconvolution needs a degradation model, thus having knowledge about both h and b.

● The PSF h can be estimated by observation, i.e. by finding in the image some factors to
estimate h. For example, a single point object in the image is h. The PSF can also be
estimated by experimentation by reproducing the observation conditions in a laboratory.
So, the image of a pulse gives an estimate of h. Finally, it is also possible to estimate the
PSF from a mathematical model of the physics of the observation. Note also that some
deconvolution methods estimate the PSF h at the same time as x: these are called blind
deconvolution methods (French: déconvolution myope).
● Models for the noise have already been presented in chapter denoising.

Inverse filter
The inverse filter is the simplest deconvolution method. Since the degradation is
modelled y=h∗x+b, then this equation becomes in the Fourier domain:

Y=HX+B

so we can write:

X=Y−BH.

We obtain x by calculating the inverse Fourier transform of the previous expression:

x=F−1[Y−BH].

As the noise (and therefore its spectrum B) is unknown, we can approximate the expression of x by
cancelling B in the previous expression, and thus get the deconvolved image:

x^=F−1[YH]

The result of the inverse filter applied on an image is given Fig. 103. The result is not usable, and
yet the observed image is very little blurred with very little noise!

Fig. 103 Result of the inverse filter.


The catastrophic result of the inverse filter is due to the fact of having considered the noise to be
zero. Indeed, according to the definition of x^ and considering Y=HX+B, then:
x^=F−1[YH]=F−1[X+BH]=x+F−1[BH]

Thus, the deconvolved image x^ corresponds to x with an additional term which is the inverse
Fourier transform of B/H. The PSF H is generally a low-pass filter, so the values of H(m,n) tend
towards 0 for high frequencies (m,n). Because H is in the denominator, this tends to drastically
amplify the high frequencies of the noise, and then the term B/H quickly dominates X. This
explains the result of Fig. 103.

One solution consists in considering only the low frequencies of Y/H. This is equivalent to
truncating the result given by the inverse filter by cancelling the high frequencies before
calculating the inverse Fourier transform. The result of the deconvolution is much more acceptable,
as shown by Fig. 104, although the result is still far from perfect (there are many variations in
intensity around objects, such as tree trunks)…

Fig. 104 Result of the truncated inverse filter with very small noise.
Wiener Filter
Wiener filter, denoted by g (with Fourier transform G), applies to the observation y such that:

x^=g∗y⇔X^=GY.

This filter is established in the statistical framework: the image x and the noise b are considered to
be random variables. They are also assumed to be statistically independent. As a result, the
observation y and the estimate x^ are also random variables.

The calculations are done in the Fourier domain for simplicity (since convolutions become
multiplications). The goal of Wiener filter is to find the image X^=F[x^] closest to X=F[x], in the
sense of the mean squared error MSE=E[(X^−X)2]. Thereby :

MSE=E[(X^−X)2]=E[(GY−X)2]=E[(G(HX+B)−X)2]=E[((GH−I)X+GB)2]

where I is an image filled with 1. So:

MSE=E[(GH−I)∗(GH−I)X∗X+(GH−I)∗GX∗B+G∗(GH−I)B∗X+G∗GB∗B]

where ⋅∗ denotes the conjugate of the variables. Since the expectation E is linear and
only X and B are random variables, we can decompose the previous expression into four terms:

MSE=(GH−I)∗(GH−I)E[X∗X]+(GH−I)∗GE[X∗B]+G∗(GH−I)E[B∗X]+G∗GE[B∗B].

Since X and B are independent, then the covariances E[X∗B] and E[B∗X] are zeros.
Moreover, E[X∗X] and E[B∗B] are the power spectral densities denoted as Sx and Sb (the power
spectral density is the expectation of the square of the modulus of the Fourier transform). So the
mean squared error simplifies into:
MSE=(GH−1)∗(GH−1)Sx+G∗GSb

We look for the filter G that minimizes the MSE, or equivalently, that cancels the derivative of
MSE:

∂MSE∂G=(GH−1)∗HSx+G∗Sb=0⇔G∗H∗HSx−HSx+G∗Sb=0⇔G∗(H∗HSx+Sb)=HSx⇔G∗=H
SxH∗HSx+Sb⇔G=H∗SxH∗HSx+Sb⇔G=H∗Sx|H|2Sx+Sb

Here we are, we get the expression of the Wiener filter G! 🥳 Finally, the deconvolved image is
the inverse Fourier transform of GY:

x^=F−1[H∗Sx|H|2Sx+SbY]

We can consider that the power spectral densities Sx and Sb are constant (for Sb, it is necessary to
assume white noise). Thus, the expression of the Wiener filter can be written

x^=F−1[H∗|H|2+Sb/SxY]

and the term Sb/Sx is replaced by a constant K, which becomes the parameter of the method, to be
set by the user.

Two remarks:

● where H vanishes (typically in high frequencies), the problem of noise increase is no longer
observed as with the inverse filter, since the inverse filter tends towards 0,
● moreover, if the noise in the image is zero, then Sb=0 and Wiener filter comes back to the
inverse filter:

x^=F−1[H∗|H|2Y]=F−1[YH]

The result of Wiener filter is presented Fig. 105: it is clearly much better than the inverse filter,
even its truncated version!
DeepDream

DeepDream is a deep learning technique that uses neural networks to create images that activate
specific layers in a network. This technique is also known as Inceptionism.

Here's how DeepDream works:


1. Load an image
2. Define a number of processing scales, or "octaves"
3. Resize the image to the smallest scale
4. Run gradient ascent for each scale, starting with the smallest
5. Upscale the image to the next scale
6. Reinject the detail that was lost during upscaling
7. Repeat until the image is back to its original size
DeepDream produces images that have a dreamlike appearance, similar to a psychedelic
experience. The images can be used to understand and diagnose network behavior, and to highlight
the image features that a network has learned.

DeepDream algorithm initiates the process by forwarding a particular picture or image through the
network and then it starts measuring the gradient of the image with respect to a specific activation
layer. In the next step, the picture is adjusted in order to improve these activations and amplify the
patterns which result in a dream-like picture. This entire process is also known as Inceptionism.

The entire process of enhancing the pattern of images is very much dependent on how the
algorithm has been trained. Therefore, if an algorithm has been instructed to recognize the faces
in any image then that particular algorithm will also try to deduce the faces from any given image
using the algorithmic pareidolia.

How the DeepDream algorithm functions

Now that we have properly understood what DeepDream is, it is time for us to understand the
functions of this algorithm in more detail. Before that let us have a look at how the convolutional
neural networks work:

● First, we provide an image to the convolutional neural networks and the first layer of the
network distinguishes the low-level features such as edges.
● In the next step, the second layer of the network will try to expose the higher-level features
of the picture such as trees, cars, faces, etc.
● Lastly, the remaining layers will try to collect all of these features and complete the
interpretations so that the pictures can be categorized accordingly.

In convolutional neural networks, there are different layers available to perform different tasks.
On the other hand, in the DeepDream algorithm, we can take any particular feature (be it high level
or low level) and increase its activation so that it can have a huge impact on the image.

Let us have a look at the function of the DeepDream algorithm in detail:

● Whenever you try to give a picture (as an input) to a trained artificial neural network, the
neurons kickstart and initiate activation.
● The DeepDream algorithm tries to modify the input image and in the process, it boosts
some of the neurons more than others. We can specify the type of layer and neuron we
want to strengthen precisely.
● The process will continue until all the elements of the input image have been disclosed
appropriately.

For example, if we have used a specific layer to discover the cat faces while we have provided the
image of a cloud (as input) then, the DeepDream algorithm will meticulously convert the image
and will begin to produce cat faces on the blue sky.

Processing an image with Deep Dream

Here is a step-by-step process through which you can apply the DeepDream algorithm to any
image:

● Use an already trained ResNet, ANN, CNN, etc. to forward an image.


● Now choose a specific layer while remembering that the first layer analyses the edges
whereas the deeper layers analyze different shapes and figures.
● It is time to measure the output from the layer of interest.
● Measure the gradient of the image in regard to the already chosen activation layer.
● Adjust the image in order to amplify the activations and the image will turn out like a
dream-like hallucinated image.
● Continue to repeat this operation on multiple images.

Hallucinations
Hallucinations in deep learning, also known as AI hallucinations, occur when an AI model
generates incorrect or misleading results. This can happen when the model is trained with
insufficient data, or when it makes incorrect assumptions or learns incorrect patterns.

Here are some examples of AI hallucinations:


● Incorrect predictions: An AI model might predict rain when there's no forecast for it.
● Image recognition: An image recognition system might see objects that aren't there.
● Nonsensical text: A language model might generate text that seems coherent but is actually
nonsensical.
● Chatbot responses: A chatbot powered by a large language model (LLM) might generate
plausible-sounding falsehoods.
AI hallucinations can be difficult to predict and can have serious consequences, especially in
critical applications like healthcare or transportation. For example, an AI-powered self-driving car
that hallucinates could cause an accident.
The term “hallucination” takes on a new and exciting meaning in artificial intelligence (AI). Unlike
its meaning in human psychology, where it relates to misleading sensory sensations, AI
hallucination refers to AI systems generating imaginative novel, or unexpected. These outputs
frequently exceed the scope of training data.
In this post, we will look into the concept of AI hallucination problems, causes, detections, and
prevention in the field of AI.

Causes of Artificial Intelligence (AI) Hallucinations


Some of the reasons (or causes) why Artificial Intelligence (AI) models do so are:
1. Quality dataset: AI models rely on the training data. Incorrect labelled training data
(adversarial examples), noise, bias, or errors will result in model-generating hallucinations.
2. Outdated Data: The world is constantly changing. AI models trained on outdated data might
miss crucial information or trends, leading to hallucinations when encountering new situations.
3. Missing context in training (or test) data: wrong or contradictory input may result in
hallucinations. This is in users control to provide the right context in the input.
More often, we rely on the results generated by an AI model, considering they might be accurate
ones. But AI models can generate convincing information which can be false.

1. Medical Misdiagnosis
● Missed or Wrong Diagnosis: AI-powered medical tools used for analysis (e.g., X-rays, blood
tests) could misinterpret results due to limitations in training data or unexpected variations.
This could lead to missed diagnoses of critical illnesses or unnecessary procedures based on
false positives.
● Ineffective Treatment Plans: AI-driven treatment recommendations might be based on faulty
data or fail to consider a patient’s unique medical history, potentially leading to ineffective or
even harmful treatment plans.
2. Faulty Financial Predictions
● Market Crashes: AI algorithms used for stock market analysis and trading could be swayed
by hallucinations, leading to inaccurate predictions and potentially triggering market crashes.
● Loan Denials and High-Interest Rates: AI-powered credit scoring systems could rely on
biased data, leading to unfair denials of loans or higher interest rates for qualified individuals.
3. Algorithmic Bias and Discrimination
● Unequal Opportunities: AI-driven hiring tools that rely on biased historical data could
overlook qualified candidates from underrepresented groups, perpetuating discrimination in
the workplace.
● Unfair Law Enforcement: Facial recognition software with AI hallucinations might
misidentify individuals, leading to wrongful arrests or profiling based on race or ethnicity.
How to Prevent Artificial Intelligence (AI) Hallucinations?
1. When feeding the input to the model restrict the possible outcomes by specifying the type of
response you desire. For example, instead of asking a trained LLM to get the ‘facts about the
existence of Mahabharta’, user can ask ‘ wether Mahabharta was real, Yes or No?’.
2. Specify what kind of information you are looking for.
3. Rather than specifying what information you require, also list what information you don’t
want.
4. Last but not the least, verify the output given by an AI model.
So there is an immediate need to develop algorithms or methods to detect and remove
Hallucination from AI models or at least decrease its impact.

CAM, Grad-CAM
Class Activation Mapping (CAMs)

For a particular class (or category), Class activation mapping basically indicates the discriminative
region of the image, which influenced the deep learning model to make the decision. The
architecture is very similar to a convolutional neural network. It comprises several convolution
layers, with the layer just before the final output performing Global Average Pooling. The features
that are obtained are fed into the fully connected neural network layer governed by the softmax
activation function and thus, output us the required probabilities. The importance of the weights
with respect to a category can be found out by projecting back the weights onto the last convolution
layer’s feature map.

Global Average Pooling (GAP) vs Global Max Pooling (GMP)

The Global Average Pooling (GAP) is preferred over Global Max Pooling (GMP) because GAP
helps us in identifying the full extent of the object as compared to the GMP layer, which identifies
one discriminative part. In Global Average Pooling, an average is taken across all the activation
maps which help us to find all the possible discriminative regions present in them. Contrary to this,
the Global Max Pooling method just considers the most discriminative region. Hence, Global
Average Pooling showed better results than Global Max Pooling.
Mathematical equations governing CAMs

Let be the activation map of unit in the last convolutional layer at spatial location .

The result of GAP is represented as:-

For a class c, an input to the softmax will be:-

Output of Softmax layer:-

Thus, the final equation for an activation map of class c would be:-

Weakly-supervised Object Localization

The localization ability of the CAM method was put to the test when they were trained on the
ILSVRC 2014 benchmark dataset. The CAM technique was used on popular CNN models like
AlexNet, VGGNet and GoogLeNet by tweaking their models and fitting a GAP layer (similar to
the CAM architecture) towards the end. This modified model was giving astounding results with
the GAP layer as compared to their traditional architecture in terms of discriminative localization.

Deep Features for Generic Localization

After applying a CAM architecture to fine-grained recognition and pattern discovery (like
discovering informative objects in the scenes, concept localization in weakly labelled images,
weakly supervised text detector and interpreting visual question answering), we can infer that
feature capturing and localization was far more accurate in the CAM based GAP layer architecture,
as the complete extent of the features were captured. Visualizing Class-specific Units:- When we
use the GAP layer and the ranked softmax weight, we can directly visualize the units, which are
the most discriminative for a particular class. Thus, CNN actually learns a bag of words, where
each word is a discriminative class-specific unit. A combination of these class-specific units helps
to guide CNNs in classifying each image.

Grad-CAM in Deep Learning


Gradient-weighted Class Activation Mapping is a technique used in deep learning to visualize and
understand the decisions made by a CNN. This technique unveils the hidden decisions made by
CNNs, transforming them from opaque models into transparent storytellers. Picture this as a magic
lens that paints a vivid heatmap, spotlighting the essence of an image that captivates the neural
network’s attention. How does it work? Grad-CAM decodes the importance of each feature map
for a specific class by analyzing gradients in the last convolutional layer.

Grad-CAM interprets CNNs, revealing insights into predictions, aiding debugging, and enhancing
performance. Class-discriminative and localizing, it lacks pixel-space detail highlighting.

Learning Objectives

Understand the significance of interpretability in convolutional neural networks (CNNs) based


models, making them more transparent and explainable.

Learn the fundamentals of gradcam visualization (Gradient-weighted Class Activation Mapping)


as a technique for visualizing and interpreting CNN decisions.

Gain insights into the implementation steps of Grad-CAM, enabling the generation of class
activation maps to highlight important regions in images for model predictions.

Explore real-world applications and use cases where Grad-CAM enhances understanding and trust
in CNN predictions.
Why Grad-CAM is Required in Deep Learning?

Grad-CAM is required because it addresses the critical need for interpretability in deep learning
models, providing a way to visualize and comprehend how these models arrive at their predictions
without sacrificing the accuracy they offer in various computer vision tasks.

Copy Code

● Interpretability in Deep Learning: Deep neural networks, especially Convolutional


Neural Networks (CNNs), are powerful but often treated as “black boxes.” Gradcam
visualization helps open this black box by providing insights into why the network makes
certain predictions. Understanding model decisions is crucial for debugging, improving
performance, and building trust in AI systems.
● Balancing Interpretability and Performance: Grad-CAM helps bridge the gap between
accuracy and interpretability. It allows for understanding complex, high-performing CNN
models without compromising their accuracy or altering their architecture, thus addressing
the trade-off between model complexity and interpretability.
● Enhancing Model Transparency: By producing visual explanations, Grad-CAM enables
researchers, practitioners, and end-users to interpret and comprehend the reasoning behind
a model’s decisions. This transparency is crucial, especially in applications where AI
systems impact critical decisions, such as medical diagnoses or autonomous vehicles.
● Localization of Model Decisions: Grad-CAM generates class activation maps that
highlight which regions of an input image contribute the most to the model’s prediction of
a particular class. This localization helps visualize and understand the specific features or
areas in an image that the model focuses on when making predictions.

Grad-CAM’s Role in CNN Interpretability

Grad-CAM (Gradient-weighted Class Activation Mapping) is a technique used in the field of


computer vision, specifically in deep learning models based on Convolutional Neural Networks
(CNNs). It addresses the challenge of interpretability in these complex models by highlighting the
important regions in an input image that contribute to the network’s predictions.
Interpretability in Deep Learning

● Complexity of CNNs: While CNNs achieve high accuracy in various tasks, their inner
workings are often complex and hard to interpret.
● Grad-CAM’s Role: Grad-CAM serves as a solution by offering visual explanations, aiding
in understanding how CNNs arrive at their predictions.

Class Activation Maps (Heatmaps Generation)

Grad-CAM generates heatmaps known as Class Activation Maps. These maps highlight crucial
regions in an image responsible for specific predictions made by CNN.

Gradient Analysis

It does so by analyzing gradients flowing into the final convolutional layer of the CNN, focusing
on how these gradients impact class predictions.

Visualization Techniques (Comparison of Methods)

Grad-CAM stands out among visualization techniques due to its class-discriminative nature.
Unlike other methods, it provides visualizations specific to particular predicted classes, enhancing
interpretability.

Trust Assessment and Importance Alignment

● User Trust Validation: Studies involving human evaluations showcase Grad-CAM’s


importance in fostering user trust in automated systems by providing transparent insights
into model decisions.
● Alignment with Domain Knowledge: Grad-CAM aligns gradient-based neuron
importance with human domain knowledge, facilitating the learning of classifiers for novel
classes and grounding vision and language models.

Weakly-supervised Localization and Comparison

● Overcoming Architecture Limitations: Grad-CAM addresses limitations in certain CNN


architectures for localization tasks, offering a more versatile approach that doesn’t require
architectural modifications.
● Enhanced Efficiency: Compared to some localization techniques, gradcam visualization
proves more efficient, providing accurate localizations in a single forward and partial
backward pass per image.
Working Principle

Grad-CAM computes gradients of predicted class scores concerning the activations in the last
convolutional layer. These gradients signify the importance of each activation map for predicting
specific classes.

Class-Discriminative Localization (Precise Identification)

It precisely identifies and highlights regions in input images that significantly contribute to
predictions for specific classes, enabling a deeper understanding of model decisions.

Versatility

Grad-CAM’s adaptability spans various CNN architectures without requiring architectural


changes or retraining. It applies to models handling diverse inputs and outputs, ensuring broad
usability across different tasks.

Balancing Accuracy and Interpretability

Grad-CAM allows for understanding the decision-making processes of complex models without
sacrificing their accuracy, striking a balance between model interpretability and high performance.

● The CNN processes the input image through its layers, culminating in the last
convolutional layer.
● Grad CAM visualization utilizes the activations from this last convolutional layer to
generate the Class Activation Map (CAM).
● Techniques like Guided Backpropagation are applied to refine the visualization, resulting
in class-discriminative localization and high-resolution detailed visualizations, aiding in
interpreting CNN decisions.
UNIT IV CNN and RNN FOR IMAGE AND VIDEO PROCESSING

CNNs for Recognition and Verification

Siamese Networks

A siamese neural network (SNN) is a class of neural network architectures that contain two or
more identical sub-networks. “Identical” here means they have the same configuration with the
same parameters and weights. Parameter updating is mirrored across both sub-networks and it’s
used to find similarities between inputs by comparing its feature vectors. These networks are used
in many applications.

Traditionally, a neural network learns to predict multiple classes. This poses a problem when we
need to add or remove new classes to the data. In this case, we have to update the neural network
and retrain it on the whole data set. Also, deep neural networks need a large volume of data on
which to train. SNNs, on the other hand, learn a similarity function. Thus, we can train the SNN to
see if two images are the same (which I’ll demonstrate below). This process enables us to classify
new classes of data without retraining the network.

How to Train a Siamese Network

● Initialize the network, loss function and optimizer.


● Pass the first image of the pair through the network.
● Pass the second image of the pair through the network.
● Calculate the loss using the outputs from the first and second images.
● Backpropagate the loss to calculate the gradients of our model.
● Update the weights using an optimizer.
● Save the model.
Pros and Cons of Siamese Networks
Siamese Network Pros

More Robust to Class Imbalance


Giving a few images per class is sufficient for siamese networks to recognize those images in the
future with the aid of one-shot learning.

Nice to Pair With the Best Classifier

Given that an SNN’s learning mechanism is somewhat different from classification models, simply
averaging it with a classifier can do much better than averaging two correlated supervised models
(e.g. GBM & RF classifiers).

Learning from Semantic Similarity

SNN focuses on learning embeddings (in the deeper layer) that place the same classes/concepts
close together. Hence, we can learn semantic similarity.

Siamese Network Cons

Needs More Training Time Than Normal Networks


Since SNNs involves learning from quadratic pairs (to see all information available) they’re slower
than the normal classification type of learning (pointwise learning).

Don’t Output Probabilities

Since training involves pairwise learning, SNNs won’t output the probabilities of the prediction,
only distance from each class.

Loss Functions Used in Siamese Networks

Since training SNNs involve pairwise learning, we cannot use cross entropy loss cannot be used.
There are two loss functions we typically use to train siamese networks.
Triplet Loss
Triplet loss is a loss function where in we compare a baseline (anchor) input to a positive (truthy)
input and a negative (falsy) input. The distance from the baseline (anchor) input to the positive
(truthy) input is minimized, and the distance from the baseline (anchor) input to the negative (falsy)
input is maximized.

In the above equation, alpha is a margin term used to stretch the distance between similar and
dissimilar pairs in the triplet. Fa, Fp, Fn are the feature embeddings for the anchor, positive and
negative images.

During the training process, we feed an image triplet (anchor image, negative image, positive
image)(anchor image, negative image, positive image) into the model as a single sample. The
distance between the anchor and positive images should be smaller than that between the anchor
and negative images.

Contrastive Loss
Contrastive loss is an increasingly popular loss function. It’s a distance-based loss as opposed to
more conventional error-prediction loss. This loss function is used to learn embeddings in which
two similar points have a low Euclidean distance and two dissimilar points have a large Euclidean
distance.

We define Dw (the Euclidean distance) as:

Gw is the output of our network for one image.

Background of Object Detection

Object detection in deep learning is a machine learning technique that uses deep learning models
to accurately and quickly locate objects in images. Deep learning models can learn from large
amounts of labeled data to extract complex patterns, which allows for more precise object
localization and classification

R-CNN – Region-Based Convolutional Neural Networks


R-CNN (Region-based Convolutional Neural Network) was introduced by Ross Girshick et al. in
2014. R-CNN revolutionized object detection by combining the strengths of region proposal
algorithms and deep learning, leading to remarkable improvements in detection accuracy and
efficiency.

Introduction of Region-Based Convolutional Neural Networks (R-CNN)


To tackle the challenges of object detection, Ross Girshick introduced R-CNN. This approach
utilizes a selective search algorithm to generate approximately 2,000 region proposals, which are
then processed through a Convolutional Neural Network (CNN) to extract features. These
features are classified using a Support Vector Machine (SVM), while a bounding box regressor is
employed to improve localization accuracy.
R-CNN identifies and localizes objects in images by proposing Regions of Interest (RoIs) and
classifying them through the CNN. The object detection framework starts with an input image
containing potential objects and employs a Region Proposal Network (RPN), like Selective Search,
to generate bounding boxes likely to contain objects.
Each proposed region is resized and fed into a pre-trained CNN, such as AlexNet or VGG16, to
extract feature representations. These features are then classified by the SVM into predefined
categories or designated as background. To refine localization further, a bounding box regression
model adjusts the coordinates of each box, aligning them more closely with the actual object
boundaries.
This systematic process effectively combines proposal generation, feature extraction,
classification, and bounding box refinement, enabling accurate object detection.

Key Features of R-CNNs


1. Region Proposals
R-CNNs begin by generating region proposals, which are smaller sections of the image that may
contain the objects we are searching for.
The algorithm employs a method called selective search, a greedy approach that generates
approximately 2,000 region proposals per image. Selective search effectively balances the number
of proposals while maintaining high object recall, ensuring efficient object detection.
By limiting the number of regions for detailed analysis, this method enhances the overall
performance of the R-CNN in detecting objects within images.
2. Selective Search
Selective Search is a greedy algorithm that generates region proposals by combining smaller
segmented regions. It takes an image as input and produces region proposals that are crucial for
object detection. This method offers significant advantages over random proposal generation by
limiting the number of proposals to approximately 2,000 while ensuring high object recall.
Algorithm Steps:
1. Generate Initial Segmentation: The algorithm starts by performing an initial sub-
segmentation of the input image.
2. Combine Similar Regions: It then recursively combines similar bounding boxes into larger
ones. Similarities are evaluated based on factors such as color, texture, and region size.
3. Generate Region Proposals: Finally, these larger bounding boxes are used to create region
proposals for object detection.
The selective search algorithm provides an efficient way to identify potential object regions,
enhancing the overall effectiveness of the detection process.
For a more detailed exploration of the selective search algorithm, please refer to the full discussion
in this article.
3. Input Preparation in R-CNN
After generating the region proposals, these regions are warped into a uniform square shape to
match the input dimensions required by the CNN model.
In this case, we use the pre-trained AlexNet model, which was considered the state-of-the-art CNN
for image classification at the time.
The input size for AlexNet is (227, 227, 3), meaning each input image must be resized to these

dimensions. Consequently, whether the region proposals are small or large, they need to be
adjusted accordingly to fit the specified input size.
From the above architecture, we remove the final softmax layer to obtain a (1, 4096) feature vector.
This feature vector is then fed into both the Support Vector Machine (SVM) for classification and
the bounding box regressor for improved localization.

4. SVM (Support Vector Machine)


The feature vector generated by the CNN is then utilized by a binary Support Vector Machine
(SVM), which is trained independently for each class. This SVM model takes the feature vector
produced by the previous CNN architecture and outputs a confidence score indicating the
likelihood of an object being present in that region.
However, a challenge arises during the training process with the SVM: it requires the AlexNet
feature vectors for each class. As a result, we cannot train AlexNet and the SVM independently
and in parallel.
5. Bounding Box Regressor
To accurately locate the bounding box within the image, we utilize a scale-invariant linear
regression model known as the bounding box regressor.
For training this model, we use pairs of predicted and ground truth values for four dimensions of
localization: . Here, and represent the pixel coordinates of the center of the bounding box,
while and indicate the width and height of the bounding boxes, respectively.
This method enhances the Mean Average Precision (mAP) of the results by 3-4%.

To further optimize detection, R-CNNs apply Non-Maximum Suppression (NMS):


1. Remove proposals with confidence scores below a threshold (e.g., 0.5).
2. Select the highest-probability region among candidates for each object.
3. Discard overlapping regions with an IoU (Intersection over Union) above 0.5 to eliminate
duplicate detections, where IoU is defined as:

By combining region proposals, selective search, CNN-based feature extraction, SVM


classification, and bounding box refinement, R-CNN achieves high accuracy in object detection,
making it suitable for various applications.
After that, we can obtain output by plotting these bounding boxes on the input image and labeling
objects that are present in bounding boxes.
Results of R-CNN Model
The R-CNN gives a Mean Average Precision (mAPs) of 53.7% on VOC 2010 dataset. On 200-
class ILSVRC 2013 object detection dataset it gives an mAP of 31.4% which is a large
improvement from the previous best of 24.3%. However, this architecture is very slow to train and
takes ~ 49 sec to generate test results on a single image of the VOC 2007 dataset.
Challenges of R-CNN
R-CNN faces several challenges in its implementation:
1. Rigid Selective Search Algorithm: The selective search algorithm is inflexible and does not
involve any learning. This rigidity can result in poor region proposal generation for object
detection.
2. Time-Consuming Training: With approximately 2,000 candidate proposals, training the
network becomes time-intensive. Additionally, multiple components need to be trained
separately, including the CNN architecture, SVM model, and bounding box regressor. This
multi-step training process slows down implementation.
3. Inefficiency for Real-Time Applications: R-CNN is not suitable for real-time applications,
as it takes around 50 seconds to process a single image with the bounding box regressor.
4. Increased Memory Requirements: Storing feature maps for all region proposals significantly
increases the disk memory needed during the training phase.

Fast R-CNN | ML

CNN Network of Fast R-CNN


Fast R-CNN is experimented with three pre-trained ImageNet networks each with 5 max-pooling
layers and 5-13 convolution layers (such as VGG-16). There are some changes proposed in this
pre-trained network, These changes are:
● The network is modified in such a way that it two inputs the image and list of region proposals
generated on that image.
● Second, the last pooling layer (here (7*7*512)) before fully connected layers needs to be
replaced by the region of interest (RoI) pooling layer.
● Third, the last fully connected layer and softmax layer is replaced by twin layers of softmax
classifier and K+1 category-specific bounding box regressor with a fully connected layer.
This CNN
architecture takes
the image (size
= 224 x 224 x 3 for
VGG-16) and its
region proposal and
outputs the
convolution feature
map (size = 14 x 14
x 512 for VGG-16).

Region of Interest (RoI) pooling:

(Source: Fast R-CNN slides)

RoI pooling is a novel thing that was introduced in the Fast R-CNN paper. Its purpose is to produce
uniform, fixed-size feature maps from non-uniform inputs (RoIs). It takes two values as inputs:
● A feature map was obtained from the previous CNN layer (14 x 14 x 512 in VGG-16).
● An N x 4 matrix represents regions of interest, where N is a number of RoIs, the first two
represent the coordinates of the upper left corner of RoI and the other two represent the height
and width of RoI denoted as (r, c, h, w).
Let’s consider we have 8*8 feature maps, we need to extract an output of size 2*2. We will follow
the steps below.

Suppose we were given RoI’s left corner coordinates as (0, 3) and height, and width as (5, 7).
Now if we need to convert this region proposal into a 2 x 2 output block and we know that the
dimensions of the pooling section do not perfectly divisible by output dimension. We take pooling
such that it is fixed into 2 x 2 dimensions.

Now we apply the max pooling operator to select the maximum value from each of the regions
that we divided into.

Max pooling output

Training and Loss Function


First, we take each training region of interest labeled with ground truth class u and ground truth
bounding box v. Then we take the output generated by the softmax classifier and bounding box
regressor and apply the loss function to them. We defined our loss function such that it takes into
account both the classification and bounding box localization. This loss function is called multi-
task loss. This is defined as follows:

Multi-task Loss
where Lcls is classification loss, and Lloc is localization loss. lambda is a balancing parameter and
u is a function (the value of u=0 for background, otherwise u=1) to make sure that loss is only
calculated when we need to define the bounding box. Here, Lcls is the log loss and Lloc is defined
as

Loss function of Fast R-CNN model

Advantages of Fast R-CNN over R-CNN


● The most important reason that Fast R-CNN is faster than R-CNN is that we don’t need to pass
2000 region proposals for every image in the CNN model. Instead, the convNet operation is
done only once per image and a feature map is generated from it.
● Since the whole model is combined and trained in one go. So, there is no need for feature
caching. That also decreases disk memory requirement while training.
● Fast R-CNN also improves mAP as compared to R-CNN on most of the classes of VOC
2007, 10, and 12 datasets.

FCN

Fully Convolutional Network (FCN) for Semantic Segmentation is briefly reviewed.


Compared with classification and detection tasks, segmentation is a much more difficult task.

● Image Classification: Classify the object (Recognize the object class) within an image.

● Object Detection: Classify and detect the object(s) within an image with bounding box(es)
bounded the object(s). That means we also need to know the class, position and size of each
object.
● Semantic Segmentation: Classify the object class for each pixel within an image. That
means there is a label for each pixel.

1. From Image Classification to Semantic Segmentation

In classification, conventionally, an input image is downsized and goes through the convolution
layers and fully connected (FC) layers, and output one predicted label for the input image, as
follows:

Classification

Imagine we turn the FC layers into 1×1 convolutional layers:

All layers are convolutional layers

And if the image is not downsized, the output will not be a single label. Instead, the output has a
size smaller than the input image (due to the max pooling):
All layers are convolutional layers

If we upsample the output above, then we can calculate the pixelwise output (label map) as below:

Upsampling at the last step


Feature Map / Filter Number Along Layers

2. Upsampling Via Deconvolution

Convolution is a process getting the output size smaller. Thus, the name, deconvolution, is coming
from when we want to have upsampling to get the output size larger. (But the name, deconvolution,
is misinterpreted as reverse process of convolution, but it is not.) And it is also called, up
convolution, and transposed convolution. And it is also called fractional stride

convolution when fractional stride is used.


Upsampling Via Deconvolution (Blue: Input, Green: Output)

3. Fusing the Output

After going through conv7 as below, the output size is small, then 32× upsampling is done to make
the output have the same size of input image. But it also makes the output label map rough. And
it is called FCN-32s:

FCN-32s
This is because, deep features can be obtained when going deeper, spatial location
information is also lost when going deeper. That means output from shallower layers have more
location information. If we combine both, we can enhance the result.

To combine, we fuse the output (by element-wise addition):

Fusing for FCN-16s and FCN-8s

FCN-16s: The output from pool5 is 2× upsampled and fuse with pool4 and perform 16×
upsampling. Similar operations for FCN-8s as in the figure above.

Comparison with different FCNs

FCN-32s result is very rough due to loss of location information while FCN-8s has the best
result.

This fusing operation actually is just like the boosting / ensemble technique used in AlexNet,
VGGNet, and GoogLeNet, where they add the results by multiple model to make the prediction
more accurate. But in this case, it is done for each pixel, and they are added from the results of
different layers within a model.

SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation.


What is SegNet?
SegNet is a deep learning architecture designed for semantic segmentation, where the goal is to
classify each pixel in an image into a predefined category. It is an encoder-decoder neural network
specifically tailored for pixel-wise image segmentation, making it highly effective for tasks that
require detailed and precise segmentation.
SegNet's primary purpose is to perform semantic segmentation by learning to label each pixel in
an image according to its category. This makes SegNet particularly useful in applications such as
autonomous driving, medical image analysis, and urban scene understanding, where accurate
segmentation is crucial.
SegNet Architecture Detailed Overview

SegNet is a deep learning architecture designed for semantic pixel-wise image segmentation. The
architecture includes an encoder network and a corresponding decoder network, followed by a
final pixel-wise classification layer. This detailed explanation covers each component of SegNet,
comparisons with other architectures, and various decoder variants.
Encoder Network
The encoder network in SegNet is composed of 13 convolutional layers, mirroring the first 13
convolutional layers of the VGG16 network, which was originally designed for object
classification. Key points about the encoder network are:
1. Pre-trained Weights: The use of VGG16's pre-trained weights allows for efficient
initialization and faster convergence during training.
2. Convolutional Layers: These layers perform convolution operations to extract features from
the input image.
3. Batch Normalization: Each convolutional layer is followed by batch normalization to
stabilize and accelerate the training process.
4. ReLU Activation: Rectified Linear Unit (ReLU) activation function is applied element-wise
to introduce non-linearity.
5. Max-Pooling: Max-pooling with a 2×2 window and a stride of 2 is used to downsample the
feature maps, reducing their spatial resolution by half. This step helps in achieving translation
invariance over small spatial shifts.
The max-pooling operation results in a lossy representation of the image, especially in terms of
boundary details, which are crucial for segmentation tasks. To mitigate this loss, the locations of
the maximum feature values in each pooling window (max-pooling indices) are stored. This
information is later used in the decoder network for accurate upsampling.
Decoder Network
The decoder network consists of 13 layers, each corresponding to an encoder layer. The decoding
process is designed to upsample the feature maps back to the original image resolution. Key steps
in the decoder network are:
1. Upsampling Using Max-Pooling Indices: The stored max-pooling indices are used to
upsample the feature maps, creating sparse feature maps. This technique ensures that the spatial
locations of features are preserved.
2. Convolution with Trainable Filters: The sparse feature maps are convolved with trainable
decoder filters to produce dense feature maps. This step helps in refining the feature maps and
improving segmentation accuracy.
3. Batch Normalization: Similar to the encoder, batch normalization is applied to each layer in
the decoder network.
4. Soft-Max Classifier: The final output of the decoder network is passed through a multi-class
soft-max classifier, which assigns class probabilities to each pixel. The predicted segmentation
is obtained by taking the class with the highest probability for each pixel.

Spatio-temporal Models

Spatiotemporal models arise when data are collected across time as well as space and has at least
one spatial and one temporal property. An event in a spatiotemporal dataset describes a spatial and
temporal phenomenon that exists at a certain time t and location x.

Spatio-temporal modeling describes studies which record and analyse both the locations and
associated times of the observations. In spatio-temporal analysis, the focus is on variation in the
average number of incident or prevalent cases in combinations of place and time units over the
geographical region and time-period of interest – that is the spatio-temporal intensity of incident
or prevalent cases.

Real time surveillance

Real-time spatio-temporal surveillance can inform a rapid response team about where and when
to target prevention and control activities as well as to make longer term plans.

For example, the New York City Department of Health developed a system that uses daily
reports of the location and timing of 35 notifiable diseases to automatically detect epidemics. In
2015, the system identified a cluster of community-acquired legionellosis in a specific location
three days before health professionals noticed an increase in cases; the cluster of observations
expanded and became the largest outbreak in the US.

Types of spatio-temporal study design


Study design depends on the objectives of the study and practical constraints.

Longitudinal design

In a longitudinal design, data are collected repeatedly over time from the same set of sampled
locations. This is appropriate when temporal variation in the health outcome dominates spatial
variation. A longitudinal design can be cost-effective when setting up a sampling location is
expensive but subsequent data-collection is cheap. Longitudinal designs can act
as sentinel locations, when the locations may be chosen subjectively, either to be representative of
the population at large or, in the case of pollution monitoring for example, to capture extreme cases
to monitor compliance with environmental legislation.

Repeated cross-sectional design

In a repeated cross-sectional design, the researcher chooses different sets of locations on each
sampling occasion. This sacrifices direct information on changes in the underlying process over
time in favour of more complete spatial coverage. For example, to predict stunting in children in
Ghana, researchers drew data from four quinquennial national Demographic and Health Surveys
each of which used a similar two-stage cluster sampling strategy.

Repeated cross-sectional designs can also be adaptive, meaning that on any sampling occasion, the
choice of sampling locations is informed by an analysis of the data collected on earlier
occasions. Adaptive repeated cross-sectional designs are particularly suitable for applications in
which temporal variation either is dominated by spatial variation or is strongly related to risk
factors of interest.

Types of data

Spatial point pattern data-set


The unit of observation is the
individual case which the researcher geo-references to a single point in the region being described.
For example all persons diagnosed with cholera in the region, each identified by their village
address.

Geo-statistical data-set

The unit of observation is a location in the region but the researcher obtains the data only from a
sample of the susceptible population. Typically, each location identifies a village community but
resource limitations dictate that use only of a sample of villages, rather than a complete census.
The data-set consists of the number of cases in each sampled village.

Small-area data-set

The researcher partitions the region into a set of sub-regions. The dataset consists of all cases of
cholera in each sub-region. Typically the researcher uses this approach when the health system
maintains a register of all cases in the region.

All these formats can be extended in time. For example when an investigator records both the
location and time of occurrence of a case during real-time surveillance, they obtain a spatio-
temporal point pattern data-set of all cases. When the investigator records cases longitudinally at
sampled locations, they obtain a spatio-temporal geostatistical data-set, and similarly with small
area data-sets.

Geostatistical data-sets are most commonly obtained for disease mapping and surveillance in low-
resource settings where collecting point pattern data is expensive and health registries may not
exist to provide small area data.
Sampling and geostatistical data-sets

Without a properly designed sampling scheme, there is a risk that the investigator will sample
more accessible communities that do not represent the health experiences of the study-population,
that is the study will be biased.

To obtain valid predictions, the sample must be as unbiased spatially and temporally. The
sampling schemes below are commonly used to eliminate as much bias as possible.

Probability sampling

To avoid spatial bias the investigator can either selecting gridded locations from a gridded map of
the geographic area of interest or use a probability sampling scheme.

Counter-intuitively, simple random sampling is not recommended. The reason is that this leads to
an irregular pattern of sampled locations; for constructing an accurate map, is is preferable to
evenly space sampling locations throughout the region of interest. Chipeta et al. explain how this
can be achieved without losing the guarantee of unbiasedness by choosing sampling locations at
random subject to the constraint that no two sampled location can be separated by less than a
specified minimum distance.

Stratified random sampling

Stratified random sampling is a set of simple random samples, one in each of a pre-defined set of
sub-regions that form a partition of the region of interest. Chipeta et al.’s method can secure an
even coverage of each sub-region without introducing bias. Stratification generally leads to gains
in efficiency when contextual knowledge can be used to define the strata . So between-strata
variation in the outcome of interest dominates within-stratum variation.

Multi-stage cluster sampling

The investigator divides the region of interest into administrative divisions and randomly selects a
number of clusters of households or villages in each division. Cluster sampling designs are
typically less efficient statistically than simple or stratified designs with the same total sample
size. But this is counterbalanced by their practical convenience.

Opportunistic sampling

To reduce the length and cost of the study, researchers often use opportunistic sampling, in which
they collect data at whatever locations are available, for example from presentations at health
clinics. The limitations are obvious. the onus is on the investigators to convince themselves and
their audience that such a design does not bias their results.
Action/Activity Recognition

Action or activity recognition is a computer vision task that involves identifying and classifying
human actions in videos or images. It's a complex task that involves analyzing the spatiotemporal
dynamics of actions and mapping them to a predefined set of action classes.

Here are some challenges of action recognition:

Densely packed actions: Videos can have multiple actions happening at once or in quick
succession.

Long-range processing: Actions can extend over long periods of time, requiring long-range
processing to capture the nuances and transitions.

Irrelevant frames: Not every frame contributes to the action recognition process.

Training: Video models are more compute intensive than image models and can be expensive and
time consuming to train.

Generalization: It can be difficult to generalize due to the amount of variations possible in the
video space

Action recognition has many applications, including:

Intelligent surveillance systems

Human-computer interfaces

Health care

Security

Military applications

Analyzing worker interactions with machinery and materials


UNIT – V

Deep Generative Models

GAN(Generative Adversarial Network) represents a cutting-edge approach to generative modeling


within deep learning, often leveraging architectures like convolutional neural networks. The goal of
generative modeling is to autonomously identify patterns in input data, enabling the model to
produce new examples that feasibly resemble the original dataset.

What is a Generative Adversarial Network?

Generative Adversarial Networks (GANs) are a powerful class of neural networks that are used
for an unsupervised learning. GANs are made up of two neural networks, a discriminator and a
generator. They use adversarial training to produce artificial data that is identical to actual data.

The Generator attempts to fool the Discriminator, which is tasked with accurately distinguishing
between produced and genuine data, by producing random noise samples.

Realistic, high-quality samples are produced as a result of this competitive interaction, which
drives both networks toward advancement.

GANs are proving to be highly versatile artificial intelligence tools, as evidenced by their extensive
use in image synthesis, style transfer, and text-to-image synthesis.

They have also revolutionized generative modeling.

Through adversarial training, these models engage in a competitive interplay until the generator
becomes adept at creating realistic samples, fooling the discriminator approximately half the time.

Generative Adversarial Networks (GANs) can be broken down into three parts:

Generative: To learn a generative model, which describes how data is generated in terms of a
probabilistic model.

Adversarial: The word adversarial refers to setting one thing up against another. This means that,
in the context of GANs, the generative result is compared with the actual images in the data set. A
mechanism known as a discriminator is used to apply a model that attempts to distinguish between
real and fake images.

Networks: Use deep neural networks as artificial intelligence (AI) algorithms for training purposes.

Types of GANs
Vanilla GAN: This is the simplest type of GAN. Here, the Generator and the Discriminator are
simple a basic multi-layer perceptrons. In vanilla GAN, the algorithm is really simple, it tries to
optimize the mathematical equation using stochastic gradient descent.

Conditional GAN (CGAN): CGAN can be described as a deep learning method in which some
conditional parameters are put into place.

In CGAN, an additional parameter ‘y’ is added to the Generator for generating the corresponding
data.

Labels are also put into the input to the Discriminator in order for the Discriminator to help
distinguish the real data from the fake generated data.

Deep Convolutional GAN (DCGAN): DCGAN is one of the most popular and also the most successful
implementations of GAN. It is composed of ConvNets in place of multi-layer perceptrons.

The ConvNets are implemented without max pooling, which is in fact replaced by convolutional
stride.

Also, the layers are not fully connected.

Laplacian Pyramid GAN (LAPGAN): The Laplacian pyramid is a linear invertible image representation
consisting of a set of band-pass images, spaced an octave apart, plus a low-frequency residual.

This approach uses multiple numbers of Generator and Discriminator networks and different levels
of the Laplacian Pyramid.

This approach is mainly used because it produces very high-quality images. The image is down-
sampled at first at each layer of the pyramid and then it is again up-scaled at each layer in a
backward pass where the image acquires some noise from the Conditional GAN at these layers
until it reaches its original size.

Super Resolution GAN (SRGAN): SRGAN as the name suggests is a way of designing a GAN in which
a deep neural network is used along with an adversarial network in order to produce higher-
resolution images. This type of GAN is particularly useful in optimally up-scaling native low-
resolution images to enhance their details minimizing errors while doing so.
Architecture of GANs
A Generative Adversarial Network (GAN) is composed of two primary parts, which are the
Generator and the Discriminator.

Generator Model

A key element responsible for creating fresh, accurate data in a Generative Adversarial Network
(GAN) is the generator model. The generator takes random noise as input and converts it into
complex data samples, such text or images. It is commonly depicted as a deep neural network.

The training data’s underlying distribution is captured by layers of learnable parameters in its
design through training. The generator adjusts its output to produce samples that closely mimic
real data as it is being trained by using backpropagation to fine-tune its parameters.

The generator’s ability to generate high-quality, varied samples that can fool the discriminator is
what makes it successful.

Generator Loss

The objective of the generator in a GAN is to produce synthetic samples that are realistic enough
to fool the discriminator. The generator achieves this by minimizing its loss function JGJG. The
loss is minimized when the log probability is maximized, i.e., when the discriminator is highly
likely to classify the generated samples as real. The following equation is given below:

JG=−1mΣi=1mlogD(G(zi))JG=−m1Σi=1mlogD(G(zi))
Where,

JGJG measure how well the generator is fooling the discriminator.

log D(G(zi))D(G(zi))represents log probability of the discriminator being correct for generated
samples.

The generator aims to minimize this loss, encouraging the production of samples that the
discriminator classifies as real (logD(G(zi))(logD(G(zi)), close to 1.

Discriminator Model

An artificial neural network called a discriminator model is used in Generative Adversarial


Networks (GANs) to differentiate between generated and actual input. By evaluating input samples
and allocating probability of authenticity, the discriminator functions as a binary classifier.

Over time, the discriminator learns to differentiate between genuine data from the dataset and
artificial samples created by the generator. This allows it to progressively hone its parameters and
increase its level of proficiency.
Convolutional layers or pertinent structures for other modalities are usually used in its architecture
when dealing with picture data. Maximizing the discriminator’s capacity to accurately identify
generated samples as fraudulent and real samples as authentic is the aim of the adversarial training
procedure. The discriminator grows increasingly discriminating as a result of the generator and
discriminator’s interaction, which helps the GAN produce extremely realistic-looking synthetic
data overall.

Discriminator Loss

The discriminator reduces the negative log likelihood of correctly classifying both produced and
real samples. This loss incentivizes the discriminator to accurately categorize generated samples
as fake and real samples with the following equation:
JD=−1mΣi=1mlogD(xi)–1mΣi=1mlog(1–D(G(zi))JD=−m1Σi=1mlogD(xi)–m1Σi=1mlog(1–
D(G(zi))

JDJD assesses the discriminator’s ability to discern between produced and actual samples.

The log likelihood that the discriminator will accurately categorize real data is represented
by logD(xi)logD(xi).

The log chance that the discriminator would correctly categorize generated samples as fake is
represented by log⁡(1−D(G(zi)))log⁡(1−D(G(zi))).

The discriminator aims to reduce this loss by accurately identifying artificial and real samples.

MinMax Loss

In a Generative Adversarial Network (GAN), the minimax loss formula is provided by:

minGmaxD(G,D)=[Ex∼pdata[logD(x)]+Ez∼pz(z)[log(1–D(g(z)))]minGmaxD(G,D)=[Ex∼pdata
[logD(x)]+Ez∼pz(z)[log(1–D(g(z)))]
Where,

G is generator network and is D is the discriminator network

Actual data samples obtained from the true data distribution pdata(x)pdata(x) are represented by x.

Random noise sampled from a previous distribution pz(z)pz(z)(usually a normal or uniform


distribution) is represented by z.

D(x) represents the discriminator’s likelihood of correctly identifying actual data as real.

D(G(z)) is the likelihood that the discriminator will identify generated data coming from the
generator as authentic.
How does a GAN work?

The steps involved in how a GAN works:

Initialization: Two neural networks are created: a Generator (G) and a Discriminator (D).

G is tasked with creating new data, like images or text, that closely resembles real data.

D acts as a critic, trying to distinguish between real data (from a training dataset) and the data
generated by G.

Generator’s First Move: G takes a random noise vector as input. This noise vector contains random
values and acts as the starting point for G’s creation process. Using its internal layers and learned
patterns, G transforms the noise vector into a new data sample, like a generated image.

Discriminator’s Turn: D receives two kinds of inputs:

Real data samples from the training dataset.

The data samples generated by G in the previous step. D’s job is to analyze each input and
determine whether it’s real data or something G cooked up. It outputs a probability score between
0 and 1. A score of 1 indicates the data is likely real, and 0 suggests it’s fake.

The Learning Process: Now, the adversarial part comes in:

If D correctly identifies real data as real (score close to 1) and generated data as fake (score close
to 0), both G and D are rewarded to a small degree. This is because they’re both doing their jobs
well.
However, the key is to continuously improve. If D consistently identifies everything correctly, it
won’t learn much. So, the goal is for G to eventually trick D.

Generator’s Improvement:

When D mistakenly labels G’s creation as real (score close to 1), it’s a sign that G is on the right
track. In this case, G receives a significant positive update, while D receives a penalty for being
fooled.

This feedback helps G improve its generation process to create more realistic data.

Discriminator’s Adaptation:

Conversely, if D correctly identifies G’s fake data (score close to 0), but G receives no reward, D
is further strengthened in its discrimination abilities.

This ongoing duel between G and D refines both networks over time.

As training progresses, G gets better at generating realistic data, making it harder for D to tell the
difference. Ideally, G becomes so adept that D can’t reliably distinguish real from fake data. At
this point, G is considered well-trained and can be used to generate new, realistic data samples.

Application Of Generative Adversarial Networks (GANs)

GANs, or Generative Adversarial Networks, have many uses in many different fields. Here are
some of the widely recognized uses of GANs:

Image Synthesis and Generation : GANs are often used for picture synthesis and generation
tasks, They may create fresh, lifelike pictures that mimic training data by learning the distribution
that explains the dataset. The development of lifelike avatars, high-resolution photographs, and
fresh artwork have all been facilitated by these types of generative networks.

Image-to-Image Translation : GANs may be used for problems involving image-to-image translation,
where the objective is to convert an input picture from one domain to another while maintaining
its key features. GANs may be used, for instance, to change pictures from day to night, transform
drawings into realistic images, or change the creative style of an image.

Text-to-Image Synthesis : GANs have been used to create visuals from descriptions in text. GANs
may produce pictures that translate to a description given a text input, such as a phrase or a caption.
This application might have an impact on how realistic visual material is produced using text-
based instructions.

Data Augmentation : GANs can augment present data and increase the robustness and
generalizability of machine-learning models by creating synthetic data samples.
Data Generation for Training : GANs can enhance the resolution and quality of low-resolution
images. By training on pairs of low-resolution and high-resolution images, GANs can generate
high-resolution images from low-resolution inputs, enabling improved image quality in various
applications such as medical imaging, satellite imaging, and video enhancement.

Advantages of GAN

The advantages of the GANs are as follows:

Synthetic data generation: GANs can generate new, synthetic data that resembles some known data
distribution, which can be useful for data augmentation, anomaly detection, or creative
applications.

High-quality results: GANs can produce high-quality, photorealistic results in image synthesis,
video synthesis, music synthesis, and other tasks.

Unsupervised learning: GANs can be trained without labeled data, making them suitable for
unsupervised learning tasks, where labeled data is scarce or difficult to obtain.

Versatility: GANs can be applied to a wide range of tasks, including image synthesis, text-to-image
synthesis, image-to-image translation, anomaly detection, data augmentation, and others.

Disadvantages of GAN

The disadvantages of the GANs are as follows:

Training Instability: GANs can be difficult to train, with the risk of instability, mode collapse, or
failure to converge.

Computational Cost: GANs can require a lot of computational resources and can be slow to train,
especially for high-resolution images or large datasets.

Overfitting: GANs can overfit the training data, producing synthetic data that is too similar to the
training data and lacking diversity.

Bias and Fairness: GANs can reflect the biases and unfairness present in the training data, leading
to discriminatory or biased synthetic data.

Interpretability and Accountability : GANs can be opaque and difficult to interpret or explain,
making it challenging to ensure accountability, transparency, or fairness in their applications.

Variational AutoEncoders
Variational autoencoder was proposed in 2013 by Diederik P. Kingma and Max Welling at Google
and Qualcomm. A variational autoencoder (VAE) provides a probabilistic manner for describing
an observation in latent space. Thus, rather than building an encoder that outputs a single value to
describe each latent state attribute, we’ll formulate our encoder to describe a probability
distribution for each latent attribute. It has many applications, such as data compression, synthetic
data creation, etc.

Variational autoencoder is different from an autoencoder in a way that it provides a statistical


manner for describing the samples of the dataset in latent space. Therefore, in the variational
autoencoder, the encoder outputs a probability distribution in the bottleneck layer instead of a
single output value.

Architecture of Variational Autoencoder

The encoder-decoder architecture lies at the heart of Variational Autoencoders (VAEs),


distinguishing them from traditional autoencoders. The encoder network takes raw input data and
transforms it into a probability distribution within the latent space.

The latent code generated by the encoder is a probabilistic encoding, allowing the VAE to express
not just a single point in the latent space but a distribution of potential representations.

The decoder network, in turn, takes a sampled point from the latent distribution and reconstructs
it back into data space. During training, the model refines both the encoder and decoder parameters
to minimize the reconstruction loss – the disparity between the input data and the decoded output.
The goal is not just to achieve accurate reconstruction but also to regularize the latent space,
ensuring that it conforms to a specified distribution.

The process involves a delicate balance between two essential components: the reconstruction loss
and the regularization term, often represented by the Kullback-Leibler divergence. The
reconstruction loss compels the model to accurately reconstruct the input, while the regularization
term encourages the latent space to adhere to the chosen distribution, preventing overfitting and
promoting generalization.

By iteratively adjusting these parameters during training, the VAE learns to encode input data into
a meaningful latent space representation. This optimized latent code encapsulates the underlying
features and structures of the data, facilitating precise reconstruction. The probabilistic nature of
the latent space also enables the generation of novel samples by drawing random points from the
learned distribution.
Mathematics behind Variational Autoencoder

Variational autoencoder uses KL-divergence as its loss function, the goal of this is to minimize the
difference between a supposed distribution and original distribution of dataset.

Suppose we have a distribution z and we want to generate the observation x from it. In other
words, we want to calculate
We can do it by following way:
But, the calculation of p(x) can be quite difficult

This usually makes it an intractable distribution. Hence, we need to approximate p(z|x) to q(z|x)
to make it a tractable distribution. To better approximate p(z|x) to q(z|x), we will minimize the
KL-divergence loss which calculates how similar two distributions are:

By simplifying, the above minimization problem is equivalent to the following maximization


problem :

The first term represents the reconstruction likelihood and the other term ensures that our
learned distribution q is similar to the true prior distribution p.

Thus our total loss consists of two terms, one is reconstruction error and other is KL-divergence
loss:

Applications
Deep Learning for Photo Editing (Image Editing)

Inpainting

Inpaint focuses on photo editing via simplified semi-automatic tools and mechanisms. The
program includes a tool similar to the Healing Brush tool in Adobe Photoshop CS5 with the
Content-Aware mode on. Similar to Healing Brush, the tool tries to replace bad or damaged texture
with good texture from another area to create a seamless repair of an image.

Specifically, Inpaint is capable of performing the following functions:

Removal of unwanted objects from an image[2]

Facial retouching

Old photos repair

Arbitrary merging of multiple images into one[3]

Object cloning

Restoring empty areas on panorama photos

Inpainting replaces or edits specific areas of an image. This makes it a useful tool for image
restoration like removing defects and artifacts, or even replacing an image area with something
entirely new. Inpainting relies on a mask to determine which regions of an image to fill in; the area
to inpaint is represented by white pixels and the area to keep is represented by black pixels. The
white pixels are filled in by the prompt.

Super resolution

Image super-resolution (SR) is the process of recovering high-resolution (HR) images


from low-resolution (LR) images. It is an important class of image processing techniques in
computer vision and image processing and enjoys a wide range of real-world applications, such as
medical imaging, satellite imaging, surveillance and security, astronomical imaging, amongst
others.

With the advancement in deep learning techniques in recent years, deep learning-based SR models
have been actively explored and often achieve state-of-the-art performance on various benchmarks
of SR. A variety of deep learning methods have been applied to solve SR tasks, ranging from the
early Convolutional Neural Networks (CNN) based method to recent promising Generative
Adversarial Nets based SR approaches.

Problem
Image super-resolution (SR) problem, particularly single image super-resolution (SISR), has
gained a lot of attention in the research community. SISR aims to reconstruct a high-resolution
image ISR from a single low-resolution image ILR. Generally, the relationship between I LR and the
original high-resolution image IHR can vary depending on the situation. Many studies assume that
ILR is a bicubic downsampled version of IHR, but other degrading factors such as blur, decimation,
or noise can also be considered for practical applications.

In this article, we would be focusing on supervised learning methods for super-resolution tasks.
By using HR images as target and LR images as input, we can treat this problem as a supervised
learning problem.

Exhaustive table of
topics in Supervised Image
Super-Resolution
Upsampling Methods

Before understanding the rest of the theory behind the super-resolution, we need to
understand upsampling (Increasing the spatial resolution of images or simply increasing the
number of pixel rows/columns or both in the image) and its various methods.

1. Interpolation-based methods – Image interpolation (image scaling), refers to resizing digital


images and is widely used by image-related applications. The traditional methods include nearest-
neighbor interpolation, linear, bilinear, bicubic interpolation, etc.
Nearest-neighbor interpolation with the scale of 2

Nearest-neighbor Interpolation – The nearest-neighbor interpolation is a simple and intuitive


algorithm. It selects the value of the nearest pixel for each position to be interpolated regardless of
any other pixels.

● Bilinear Interpolation – The bilinear interpolation (BLI) first performs linear


interpolation on one axis of the image and then performs on the other axis. Since it results
in a quadratic interpolation with a receptive field-sized 2 × 2, it shows much better
performance than nearest-neighbor interpolation while keeping a relatively fast speed.
● Bicubic Interpolation – Similarly, the bicubic interpolation (BCI) performs cubic
interpolation on each of the two axes Compared to BLI, the BCI takes 4 × 4 pixels into
account, and results in smoother results with fewer artifacts but much lower speed. Refer to
this for a detailed discussion.

Shortcomings – Interpolation-based methods often introduce some side effects such as


computational complexity, noise amplification, blurring results, etc.
2. Learning-based upsampling – To overcome the shortcomings of interpolation-based methods
and learn upsampling in an end-to-end manner, transposed convolution layer and sub-pixel layer
are introduced into the SR field.

Transposed convolution layer – The blue boxes denote the input,


and the green boxes indicate the kernel and the convolution output.

Transposed convolution: layer, a.k.a. deconvolution layer, tries to perform transformation


opposite a normal convolution, i.e., predicting the possible input based on feature maps sized like
convolution output. Specifically, it increases the image resolution by expanding the image by
inserting zeros and performing convolution.

Sub-pixel layer – The blue boxes denote the input and the boxes with other colors indicate different
convolution operations and different output feature maps.

● Sub-pixel Layer: The sub-pixel layer, another end-to-end learnable upsampling layer,
performs upsampling by generating a plurality of channels by convolution and then
reshaping them shows. Within this layer, a convolution is firstly applied for producing
outputs with
2
s times channels, where s is the scaling factor. Assuming the input size is h × w × c, the
output size will be h×w×s2c. After that, the reshaping operation is performed to produce
outputs with size sh × sw × c

Super-resolution Frameworks

Since image super-resolution is an ill-posed problem, how to perform upsampling (i.e., generating
HR output from LR input) is the key problem. There are mainly four model frameworks based on
the employed upsampling operations and their locations in the model (refer to the table above).

1. Pre-upsampling Super-resolution –

We don’t do a direct mapping of LR images to HR images since it is considered to be a difficult


task. We utilize traditional upsampling algorithms to obtain higher resolution images and then
refining them using deep neural networks is a straightforward solution. For example – LR images
are upsampled to coarse HR images with the desired size using bicubic interpolation. Then deep
CNNs are applied to these images for reconstructing high-quality images.

2. Post-upsampling Super-resolution –
\

To improve the computational


efficiency and make full use of
deep learning technology to increase resolution automatically, researchers propose to perform
most computation in low-dimensional space by replacing the predefined upsampling with end-to-
end learnable layers integrated at the end of the models. In the pioneer works of this framework,
namely post-upsampling SR, the LR input images are fed into deep CNNs without increasing
resolution, and end-to-end learnable upsampling layers are applied at the end of the network.

Learning Strategies

In the super-resolution field, loss functions are used to


measure reconstruction error and guide the model optimization. In early times, researchers usually
employ the pixelwise L2 loss(mean squared error), but later discover that it cannot
measurethereconstruction quality very accurately. Therefore, a variety
of loss functions (e.g., content loss, adversarial loss) are adopted for better measuring the
reconstruction
error and producing more realistic and higher-quality results.

● Pixelwise L1 loss – Absolute difference between pixels of ground truth HR image and the
generated one.
● Pixelwise L2 loss – Mean squared difference between pixels of ground truth HR image
and the generated one.
● Content loss – the content loss is indicated as the Euclidean distance between high-level
representations of the output image and the target image. High-level features are obtained
by passing through pre-trained CNNs like VGG and ResNet.
● Adversarial loss – Based on GAN where we treat the SR model as a generator, and define
an extra discriminator to judge whether the input image is generated or not.
● PSNR – Peak Signal-to-Noise Ratio (PSNR) is a commonly used objective metric to
measure the reconstruction quality of a lossy transformation. PSNR is inversely
proportional to the logarithm of the Mean Squared Error (MSE) between the ground truth
image and the generated image.
In MSE, I is a noise-free m×n monochrome image (ground truth) and K is the generated image
(noisy approximation). In PSNR, MAXI represents the maximum possible pixel value of the image.
Network Design

Various
network
designs in
super-resolution architecture

Enough of the basics! Let’s discuss some of the state-of-art super-resolution methods –

Super-Resolution methods

Super-Resolution Generative Adversarial Network (SRGAN) – Uses the idea of GAN for super-
resolution task i.e. generator will try to produce an image from noise which will be judged by the
discriminator. Both will keep training so that generator can generate images that can match the
true training data.

Self-supervised learning (SSL) and reinforcement learning (RL) are both machine learning
techniques, but they differ in how they learn:
● Self-supervised learning: In SSL, models learn from unlabeled data by generating their own
labels. This is a more practical approach than supervised learning, which requires labeled data that
is often expensive and time-consuming to obtain. SSL can be used in computer vision tasks like
image classification, object detection, and semantic segmentation.
● Reinforcement learning: In RL, models learn from feedback from actions taken in an
environment.

What is Self-Supervised Learning?

Self-supervised learning is a deep learning methodology where a model is pre-trained using


unlabelled data and the data labels are generated automatically, which are further used in
subsequent iterations as ground truths. The fundamental idea for self-supervised learning is to
create supervisory signals by making sense of the unlabeled data provided to it in an unsupervised
fashion on the first iteration. Then, the model uses the high-confidence data labels among those
generated to train the model in subsequent iterations like the supervised learning model
via backpropagation. The only difference is, the data labels used as ground truths in every iteration
are changed.

How to train a Self-


Supervised Learning Model
in ML
1. Select a property of the data to predict: To predict the next word in a sentence, the
orientation of an object in an image, or the speaker of an audio clip.
2. Define a loss function: The loss function measures the model’s performance on the task of
predicting the property of the data. It should be designed to encourage the model to learn useful
features and representations of the data that are relevant to the task.
3. Train the model: The model is trained on a large dataset by minimizing the loss function. This
is typically done using an optimization algorithm, such as stochastic gradient descent (SGD)
or Adam.
4. Fine-tune the model: Once the model has been trained, it can be fine-tuned on a specific task
by adding a few labeled examples and fine-tuning the model’s weights using supervised
learning techniques. This allows the model to learn task-specific features and further improve
its performance on the target task.

Application of SSL in Computer Vision


Image and video recognition: Self-supervised learning has been used to improve the performance
of image and video recognition tasks, such as object recognition, image classification, and video
classification. For example, a self-supervised learning model might be trained to predict the
location of an object in an image given the surrounding pixels to classify a video as depicting a
particular action.

Application of SSL in Natural Language Processing


● Language understanding: Self-supervised learning has been used to improve the
performance of natural language processing (NLP) tasks, such as machine translation,
language modeling, and text classification. For example, a self-supervised learning model
might be trained to predict the next word in a sentence given the previous words, or to classify
a sentence as positive or negative.
● Speech recognition: Self-supervised learning has been used to improve the performance of
speech recognition tasks, such as transcribing audio recordings into text. For example, a self-
supervised learning model might be trained to predict the speaker of an audio clip based on the
characteristics of their voice.
Self-Supervised Learning Techniques
● Pretext tasks: Pretext tasks are auxiliary tasks designed to solve using the inherent structure
of the data, but are also related to the main task. For example, the model might be trained on a
pretext task of predicting the rotation of an image, with the goal of improving performance on
the main task of image classification.
● Contrastive learning: Contrastive Learning is a self-supervised learning technique that
involves training a model to distinguish between a noisy version of the data to a clean version.
The model is trained to distinguish between the two, with the goal of learning a robust
representation of noise.

Advantages of Self-Supervised Learning


● Reduced Reliance on Labeled Data: One of the main benefits of self-supervised learning is
that it allows a model to learn useful features and representations of the data without the need
for large amounts of labeled data. This can be particularly useful in situations where it is
expensive or time-consuming to obtain labeled data, or where there is a limited amount of
labeled data available.
● Improved Generalization: Self-supervised learning can improve the generalization
performance of a model, meaning that it is able to make more accurate predictions on unseen
data. This is because self-supervised learning allows a model to learn from the inherent
structure of the data, rather than just memorizing specific examples.
● Transfer Learning: Self-supervised learning can be useful for transfer learning, which
involves using a model trained on one task to improve performance on a related task. By
learning useful features and representations of the data through self-supervised learning, a
model can be more easily adapted to new tasks and environments.
● Scalability: Self-supervised learning can be more scalable than supervised learning, as it
allows a model to learn from a larger dataset without the need for human annotation. This can
be particularly useful in situations where the amount of data is too large to be labeled by
humans.
Limitations of Self-Supervised Learning
● Quality of supervision signal: One of the main limitations of self-supervised learning is that
the quality of the supervision signal can be lower than in supervised learning. This is because
the supervision signal is derived from the data itself, rather than being explicitly provided by
a human annotator. As a result, the supervision signal may be noisy or incomplete, which can
lead to lower performance on the task.
● Limited to certain types of tasks: Self-supervised learning may not be as effective for tasks
where the data is more complex or unstructured.
● The complexity of training: Some self-supervised learning techniques can be more complex
to implement and train than supervised learning techniques. For example, contrastive learning
and unsupervised representation learning can be more challenging to implement and tune than
supervised learning methods.

Reinforcement learning

Reinforcement Learning (RL) is a branch of machine learning focused on making decisions to


maximize cumulative rewards in a given situation. Unlike supervised learning, which relies on a
training dataset with predefined answers, RL involves learning through experience. In RL, an agent
learns to achieve a goal in an uncertain, potentially complex environment by performing actions
and receiving feedback through rewards or penalties.

Key Concepts of Reinforcement Learning

● Agent: The learner or decision-maker.


● Environment: Everything the agent interacts with.
● State: A specific situation in which the agent finds itself.
● Action: All possible moves the agent can make.
● Reward: Feedback from the environment based on the action taken.

How Reinforcement Learning Works

RL operates on the principle of learning optimal behavior through trial and error. The agent takes
actions within the environment, receives rewards or penalties, and adjusts its behavior to maximize
the cumulative reward. This learning process is characterized by the following elements:

● Policy: A strategy used by the agent to determine the next action based on the current state.
● Reward Function: A function that provides a scalar feedback signal based on the state and
action.
● Value Function: A function that estimates the expected cumulative reward from a given state.
● Model of the Environment: A representation of the environment that helps in planning by
predicting future states and rewards.
Example: Navigating a Maze

The problem is as follows: We have an agent and a reward, with many hurdles in between. The
agent is supposed to find the best possible path to reach the reward. The following problem
explains the problem more easily.

The above image shows the robot, diamond, and fire. The goal of the robot is to get the reward
that is the diamond and avoid the hurdles that are fired. The robot learns by trying all the possible
paths and then choosing the path which gives him the reward with the least hurdles. Each right
step will give the robot a reward and each wrong step will subtract the reward of the robot. The
total reward will be calculated when it reaches the final reward that is the diamond.

Main points in Reinforcement learning –


● Input: The input should be an initial state from which the model will start
● Output: There are many possible outputs as there are a variety of solutions to a particular
problem
● Training: The training is based upon the input, The model will return a state and the user will
decide to reward or punish the model based on its output.
● The model keeps continues to learn.
● The best solution is decided based on the maximum reward.
Types of Reinforcement:
1. Positive: Positive Reinforcement is defined as when an event, occurs due to a particular
behavior, increases the strength and the frequency of the behavior. In other words, it has a
positive effect on behavior.
Advantages of reinforcement learning are:

● Maximizes Performance
● Sustain Change for a long period of time
● Too much Reinforcement can lead to an overload of states which can diminish the results
2. Negative: Negative Reinforcement is defined as strengthening of behavior because a
negative condition is stopped or avoided.

Advantages of reinforcement learning:

● Increases Behavior
● Provide defiance to a minimum standard of performance
● It Only provides enough to meet up the minimum behavior

Elements of Reinforcement Learning

i) Policy: Defines the agent’s behavior at a given time.


ii) Reward Function: Defines the goal of the RL problem by providing feedback.
iii) Value Function: Estimates long-term rewards from a state.
iv) Model of the Environment: Helps in predicting future states and rewards for planning.

Application of Reinforcement Learnings


i) Robotics: Automating tasks in structured environments like manufacturing.
ii) Game Playing: Developing strategies in complex games like chess.
iii) Industrial Control: Real-time adjustments in operations like refinery controls.
iv) Personalized Training Systems: Customizing instruction based on individual needs.

Advantages and Disadvantages of Reinforcement Learning


Advantages:
1. Reinforcement learning can be used to solve very complex problems that cannot be solved by
conventional techniques.
2. The model can correct the errors that occurred during the training process.
3. In RL, training data is obtained via the direct interaction of the agent with the environment
4. Reinforcement learning can handle environments that are non-deterministic, meaning that the
outcomes of actions are not always predictable. This is useful in real-world applications where the
environment may change over time or is uncertain.
5. Reinforcement learning can be used to solve a wide range of problems, including those that
involve decision making, control, and optimization.
6. Reinforcement learning is a flexible approach that can be combined with other machine learning
techniques, such as deep learning, to improve performance.
Disadvantages:
1. Reinforcement learning is not preferable to use for solving simple problems.
2. Reinforcement learning needs a lot of data and a lot of computation
3. Reinforcement learning is highly dependent on the quality of the reward function. If the reward
function is poorly designed, the agent may not learn the desired behavior.
4. Reinforcement learning can be difficult to debug and interpret. It is not always clear why the
agent is behaving in a certain way, which can make it difficult to diagnose and fix problems.

You might also like