0% found this document useful (0 votes)
29 views33 pages

Computer Vision

Computer Vision (CV) is a branch of AI focused on enabling computers to interpret visual data, with key tasks including image classification, object detection, and segmentation. Applications span various fields such as robotics, healthcare, and autonomous vehicles, utilizing techniques like convolutional neural networks (CNNs) for image processing. Image representation involves storing images as pixel values, with formats like RGB and methods for enhancing image quality through smoothing, sharpening, and histogram equalization.

Uploaded by

221210088
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views33 pages

Computer Vision

Computer Vision (CV) is a branch of AI focused on enabling computers to interpret visual data, with key tasks including image classification, object detection, and segmentation. Applications span various fields such as robotics, healthcare, and autonomous vehicles, utilizing techniques like convolutional neural networks (CNNs) for image processing. Image representation involves storing images as pixel values, with formats like RGB and methods for enhancing image quality through smoothing, sharpening, and histogram equalization.

Uploaded by

221210088
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Overview of Computer Vision

✅ What is Computer Vision?


●​ Computer Vision (CV) is a field of AI that enables computers to understand and interpret visual data
like images and videos.
●​ Goal: To simulate human vision — detect, identify, and understand objects/scenes.

🔧 Basic Tasks in Computer Vision


●​ Image Classification: What is in the image?
●​ Object Detection: Where is the object in the image?
●​ Segmentation: Which pixels belong to which object?
●​ Face Recognition, Tracking, Motion Estimation, etc.

🧠 How Computer Vision Works (Basic Idea)


1.​ Input: Image or video.
2.​ Processing: Extract features (edges, colors, shapes).
3.​ Understanding: Use models (ML/DL) to classify or detect.

📌 Applications of Computer Vision


🤖 1. Robotics
●​ Object Detection: Robots can locate and identify tools or parts.
●​ Navigation: CV helps robots avoid obstacles and move in real-time (e.g., SLAM).
●​ Grasping: Identify the shape/orientation of objects to pick them.
●​ Inspection: Quality control in manufacturing (e.g., finding defects).

🏥 2. Healthcare
●​ Medical Imaging: Detect tumors, fractures in X-rays, MRIs, CT scans.
●​ Retinal Analysis: For diabetic retinopathy, glaucoma, etc.
●​ Surgical Assistance: Robots guided using CV.
●​ Pathology: Automated detection of cells and abnormalities.

🚗 3. Autonomous Vehicles
●​ Lane Detection: Identify road lanes.
●​ Traffic Sign Recognition: Read and respond to signs.
●​ Pedestrian Detection: For safe driving.
●​ Obstacle Avoidance: Detect other vehicles, people, animals, etc.
●​ Surround View: 360° environment understanding.

🧠 Bonus Tip for Exam:


If asked for an open-ended answer, conclude like this:

"Computer vision continues to grow rapidly and is a key enabler of intelligent systems across various domains
by allowing machines to see, understand, and make decisions."

_____________________________________________
📘 Image Formation – Basic Concepts
1. What is Image Formation?

●​ It is the process by which a 3D real-world scene is captured as a 2D image on a camera sensor or


retina (in humans).
●​ Happens using light rays that reflect off objects and are collected through a lens or hole.

2. Pinhole Camera Model (Ideal Model)

●​ Imagine a dark box with a small hole on one side.


●​ Light enters through the hole and hits the back wall forming an inverted image.

✅ Key Points:
●​ No lens used.
●​ Simple and distortion-free, but very dim image.
●​ Smaller hole → sharper but darker image.
●​ Larger hole → brighter but blurry image.

3. Real Camera (with Lens)

●​ Modern cameras use a lens instead of a hole.


●​ The lens focuses light onto the sensor (image plane) to form a clear image.

✅ Key Terms:
●​ Lens: Focuses light.
●​ Image Plane: Where the image is formed.
●​ Sensor: Converts light into electrical signals (digital image).

4. Inversion

●​ The image formed is inverted (upside-down and left-right reversed).


●​ Software or brain (in humans) reinterprets it.

5. Light & Image Brightness

●​ More light = brighter image.


●​ Focused light = sharper image.
●​ Oblique light (not perpendicular) = can cause blurring or distortion.

_____________________________________________

📘 Image Representation
✅ What is Image Representation?
It refers to how an image is stored, structured, and processed in a computer — using pixel values.

🟦 1. Digital Image Basics


●​ An image is a 2D grid of pixels (picture elements).
●​ Each pixel has a numerical value that represents intensity (grayscale) or color (RGB)
🖤 2. Grayscale Image
●​ Stored as a 2D matrix.
●​ Each pixel = single intensity value from 0 (black) to 255 (white).​
[

[0, 125, 255],


[100, 200, 50]
]

🌈 3. RGB Image (Color)


●​ Stored as a 3D matrix: Width × Height × 3 (for Red, Green, Blue channels).
●​ Each channel is a 2D matrix of values (0–255).

Example (2×2 image):

Red: [[255, 0], Green: [[0, 255], Blue: [[0, 0],


[100, 50]] [100, 50]] [255, 200]]

🔳 4. Binary Image
●​ Pixel value is either 0 or 1 (black or white).
●​ Used in basic segmentation and thresholding.

🔢 5. Image Resolution
●​ Resolution = Width × Height
●​ Higher resolution = more pixels = more detail, larger file.

🧮 6. Pixel Depth / Bit Depth


●​ Number of bits used per pixel.
●​ Example:
○​ 8-bit grayscale → 256 shades (2⁸)
○​ 24-bit RGB → 8 bits per channel = ~16 million colors

🧠 7. Image Formats
●​ JPEG, PNG, BMP, TIFF are ways to store images.
○​ JPEG: Compressed, lossy
○​ PNG: Lossless, supports transparency

🔍 8. Coordinate System
●​ Top-left pixel is (0, 0).
●​ X-axis → right, Y-axis → down

_____________________________________________
🧴 1. Smoothing (Blurring)
Goal: Reduce noise or small variations in the image.

✅ Common Methods:
●​ Mean Filter (Average Filter):
○​ Replaces each pixel with the average of neighboring pixels.
○​ Removes noise but can blur edges.
●​ Gaussian Filter:
○​ Uses a Gaussian function to give more weight to central pixels.
○​ Smooths noise better and preserves edges more than the mean filter.

📌 Effect: Softens the image, reduces detail.

✏️ 2. Sharpening
Goal: Enhance edges and fine details in the image.

✅ Common Methods:
●​ Laplacian Filter:
○​ Second-order derivative operator.
○​ Highlights regions of rapid intensity change (edges).
●​ Unsharp Masking:
○​ Subtracts a blurred (low-pass) version from the original image.
○​ Formula:​

📌
Sharpened=Original+α⋅(Original−Blurred)
○​ Effect: Image appears crisper and more detailed.

📊 3. Histogram Equalization
Goal: Improve contrast by spreading out intensity values.

✅ Process:
●​ Create the histogram of the image.
●​ Compute the cumulative distribution function (CDF).
●​ Map old pixel values to new ones using the CDF.

📌 Result: Dark areas become lighter and bright areas dimmer — overall better contrast.
⚠️ Used in:
●​ Medical imaging, satellite images, low-light enhancement.

__________________________________________________________________________________________

✅ Why RGB Color is Preferred


1. Directly Matches Human Vision

●​ Human eyes have three types of cone cells that detect Red, Green, and Blue light.
●​ RGB aligns with our natural perception.

2. Device-Friendly

●​ Monitors, cameras, and screens all use RGB to capture, display, and store color.
●​ It is the native format for most image sensors.

3. Simple and Efficient

●​ Easy to implement and understand — just three channels.


●​ Well-supported by most image processing libraries (OpenCV, PIL, etc.).

4. Lossless Color Representation


●​ RGB allows for a wide range of colors by combining different intensities (0–255) of R,
G, and B.
●​ 24-bit RGB = over 16 million colors.

5. Foundation for Other Models

●​ Other color spaces like HSV, YCbCr, Lab are usually converted from RGB for special
processing (e.g., skin detection, lighting adjustments).

_________________________________________________________________________________________
Key Components:

1.​ Input Layer:


○​ Takes the image as input.
○​ For a color image of size 224×224, input size = 224×224×3 (R, G, B channels).
2.​ Convolutional Layers (Conv Layers):
○​ Extract features using learnable filters (kernels).
○​ Each filter slides over the image and produces a feature map.
3.​ Activation Functions:
○​ Usually ReLU (Rectified Linear Unit).
○​ Adds non-linearity to the model.
4.​ Pooling Layers:
○​ Reduce spatial dimensions (downsampling).
○​ Common: Max Pooling (takes max value in a window).
5.​ Fully Connected (Dense) Layers:
○​ Final layers that interpret features and make predictions.
○​ Each neuron is connected to all activations from the previous layer.
6.​ Output Layer:
○​ Produces the final result.
○​ For classification: Softmax function gives class probabilities.
7.​ Loss Function:
○​ Measures prediction error.
○​ Common in CV: Cross-entropy loss for classification.
8.​ Optimizer:
○​ Updates weights using gradients (e.g., SGD, Adam).

In Computer Vision, DNNs are used for:

●​ Image Classification (e.g., dog vs. cat)


●​ Object Detection (e.g., YOLO, Faster R-CNN)
●​ Segmentation (e.g., U-Net, Mask R-CNN)
●​ Image Generation (e.g., GANs)
●​ Depth Estimation (e.g., MiDaS)

What is a CNN?

A Convolutional Neural Network (CNN) is a type of deep neural network specially designed to
process images by preserving spatial relationships using convolution operations. It
automatically learns features like edges, textures, shapes, etc., without manual feature
engineering.

Key Layers in CNN:

🔹 Convolutional Layer:
○​ Applies filters (kernels) to input image.
○​ Extracts features like edges, corners, patterns.
○​ Output is called a feature map.
○​ Equation:​
Y(i,j)=∑m∑nX(i+m,j+n)⋅K(m,n)

🔹 ReLU (Activation Function):


○​ Applies non-linearity:​
ReLU(x)=max⁡(0,x)
○​ Makes model capable of learning complex patterns.

🔹 Pooling Layer:
○​ Reduces spatial size (downsampling).
○​ Max Pooling is common:
■​ Selects the max value from a region (e.g., 2×2).
○​ Benefits: Reduces computation, helps generalization.

🔹 Fully Connected (Dense) Layer:


○​ Final decision-making layers.
○​ Takes the flattened feature maps as input.
○​ Outputs class scores or predictions.

🔹 Softmax Layer (Output):


●​ Converts final scores into probabilities.
●​ Useful for multi-class classification.

_____________________________________________
How R-CNN Works

R-CNN performs object detection in three main steps:

1.​ Region Proposal:


○​ Uses Selective Search to generate around 2000 candidate object regions (region
proposals) from the input image.
2.​ Feature Extraction:
○​ Each region proposal is resized to a fixed size (e.g., 224x224) and passed
through a CNN (like AlexNet or VGG) to extract a feature vector.
3.​ Classification + Bounding Box Regression:
○​ A separate SVM is trained for each object class to classify the feature vectors.
○​ A linear regressor is trained to refine the coordinates of the bounding boxes.

Image segmentation using an image-to-image neural network refers to the task of assigning
a class label to each pixel in an image, using an architecture that takes an image as input and
outputs a mask of the same spatial dimensions.

The most common approach for this is using fully convolutional networks (FCNs) or
advanced variants like U-Net, SegNet, or DeepLab.

✅ Key Idea:
●​ Input: An image (e.g., 256×256×3)
●​ Output: A segmentation mask (e.g., 256×256×C), where C is the number of classes.
●​ The model learns pixel-wise classification

🧠 Architecture: (e.g., U-Net or FCN)


●​ Encoder (Downsampling): Extracts features using convolution + pooling.
●​ Decoder (Upsampling): Reconstructs the spatial resolution using transposed
convolutions or interpolation.
●​ Skip Connections: Merge encoder features with decoder features for precise
localization (U-Net specific).

🔹 1. Semantic Segmentation
📌 Definition:
Assigns a class label to each pixel in the image.

✅ Key Point:
●​ All objects of the same class share the same label.
●​ No distinction between individual object instances.

🔍 Example:
In a street scene:

●​ All pixels belonging to "car" get the same label.


●​ All "roads", "trees", and "sky" are labeled, but not individually identified.

📷 Output:
●​ A pixel-wise map with categories like [car, tree, road, person, sky].
🔹 2. Instance Segmentation
📌 Definition:
Assigns a class label + instance ID to each pixel.

✅ Key Point:
●​ Each individual object is segmented separately, even if they’re the same class.
●​ Combines object detection and semantic segmentation.

🔍 Example:
In the same street scene:

●​ Car1, Car2, Car3 are segmented as different instances, all of class "car".
●​ So are Person1, Person2, etc.

📷 Output:
●​ Pixel-wise masks with object instance separation, e.g., [car#1, car#2, person#1].

🔹 3. Panoptic Segmentation
📌 Definition:
Combines both semantic and instance segmentation in a single output.

✅ Key Point:
●​ Segments all pixels (like semantic).
●​ Differentiates each object instance (like instance).

🔍 Example:
In the street scene:

●​ Every pixel is labeled.


●​ "Sky", "road" (amorphous 'stuff') are given semantic labels.
●​ "Car1", "Car2", "Person1" (countable 'things') are individually segmented.

📷 Output:
●​ A unified segmentation map with both class and instance info.

____________________________________________
Temporal processing refers to handling sequential or time-dependent data in machine
learning or deep learning models. This is essential for tasks where the order and timing of
inputs matter — such as video analysis, time-series forecasting, speech recognition, or human
activity recognition.

🕒 Common Applications:
●​ Video classification or object tracking
●​ Time series forecasting (e.g., stock prediction)
●​ Speech-to-text
●​ Sensor data analysis (e.g., wearable activity monitoring)

🔄 Approaches to Temporal Processing:


1.​ Recurrent Neural Networks (RNNs):
○​ Handle sequences by maintaining a hidden state.
○​ Problem: vanishing gradients.​

2.​ LSTM / GRU:


○​ Improved versions of RNNs that handle long-range dependencies.​

3.​ 1D/3D Convolution:


○​ 1D Conv: good for time series (sliding over time axis).
○​ 3D Conv: for spatiotemporal data (e.g., video: height × width × time).​

4.​ Transformers:
○​ Use attention over sequences, parallelizable.
○​ Dominant in NLP and video understanding.​

5.​ Temporal Convolutional Networks (TCN):


○​ Use dilated convolutions for long-range dependencies.
○​ No recurrence, more efficient.

_____________________________________________
🔁 Recurrent Neural Network (RNN) — Theory
Recurrent Neural Networks (RNNs) are a class of neural networks designed for sequential
data, where the order and context of elements matter. Unlike traditional feedforward neural
networks, RNNs have loops that allow information to persist over time steps, making them
ideal for tasks like time series prediction, natural language processing, and speech recognition.

🧠 Key Concept
An RNN processes sequences by maintaining a hidden state that is updated at each time step
based on:

●​ The current input


●​ The previous hidden state

This allows the network to have a sort of memory, capturing dependencies across time steps.

🧮 Mathematical Formulation
At time step t, given:

●​ Input vector: xt
●​ Hidden state from previous step: ht−1
●​ Output: yt

The update equations are:

🔄 Unfolding in Time
RNNs can be "unfolded" across time steps. For example, a sequence of 3 time steps:

x1 → h1 → y1

x2 → h2 → y2

x3 → h3 → y3

Each time step shares the same weights, making it efficient for sequence learning.

📉 Training: Backpropagation Through Time (BPTT)


●​ The RNN is trained using BPTT, a variant of backpropagation that unfolds the network
over time.
●​ Gradients are calculated through all time steps.
●​ Can suffer from vanishing or exploding gradients, which make learning long-term
dependencies difficult.

🔍 Limitations of Basic RNNs


●​ Struggles with long-range dependencies.
●​ Prone to vanishing/exploding gradients.
●​ Sequential processing makes it slower for long sequences.

🧬 Variants
To overcome limitations, advanced architectures were developed:

●​ LSTM (Long Short-Term Memory): Introduces gates to control information flow.


●​ GRU (Gated Recurrent Unit): A simplified version of LSTM.

🎯 Applications
●​ Language modeling & text generation
●​ Machine translation
●​ Sentiment analysis
●​ Speech recognition
●​ Time series forecasting
●​ Human activity recognition
Anomaly Detection in Images using Autoencoders

Anomaly detection involves identifying patterns in data that do not conform to expected
behavior. In the context of images, anomalies are regions or entire images that are different
from the typical image patterns in a dataset. This can be useful in various applications such as
medical image analysis, industrial defect detection, and security monitoring.

One of the most effective methods for image anomaly detection is using Autoencoders — a
type of neural network that learns to compress and then reconstruct its input.

🧠 How Autoencoders Work for Anomaly Detection


An autoencoder consists of two parts:

1.​ Encoder: Compresses the input into a lower-dimensional representation (latent space).
2.​ Decoder: Reconstructs the original input from the compressed representation.

In anomaly detection, the autoencoder is trained on normal images. Once trained, it should
be able to reconstruct normal images well and fail to reconstruct anomalous images
accurately (i.e., large reconstruction error).

Steps for Anomaly Detection:

1.​ Train an autoencoder on normal (non-anomalous) images.


2.​ Reconstruct the test image (normal or anomalous).
3.​ Measure the reconstruction error (e.g., Mean Squared Error between input and
output).
4.​ If the error exceeds a threshold, classify the image as anomalous.

🧑‍💻 Autoencoder Architecture for Anomaly Detection


1.​ Encoder: The encoder reduces the dimensionality of the input image to a latent
representation (typically smaller than the input).
2.​ Decoder: The decoder reconstructs the input image from the latent representation. This
process tries to preserve the key features of the input image.
3.​ Loss Function: Mean Squared Error (MSE) or other loss functions measure how well
the autoencoder reconstructs the image.​
ALEXNET
VGG
RESNET
Hue: A color attribute that describes a pure color (pure yellow, orange, or red).

Saturation: Gives a measure of how much a pure color is diluted with white light.

Intensity: Brightness is nearly impossible to measure because it is so subjective.


Instead, we use intensity. It is some achromatic notion that we have seen in gray
level images.
(CYAN - C, MAGENTA - M, YELLOW - Y, BLACK(Key)- B)

ADDITION: (RGB-HSI)= G+B=C || B+R = M || R+G = Y || R+G+B=WHITE

SUBTRACTION: (HSI->RGB) = C+M = B || M+Y = R || C+Y = G || C+Y+M= BLACK

_____________________________________________
TRANSFORMATION
Spatial domain to frequency domain.
Image transformations like Fourier Transform and Discrete Cosine Transform (DCT) convert images from
the spatial domain (pixel-based) to the frequency domain (based on patterns of intensity change).

1. Fourier Transform (DFT / FFT)


📌 Purpose:
Decomposes an image into sine and cosine waves (frequencies). Used in image
compression, denoising, filtering, and pattern recognition.

🧠 Concept:
●​ Low frequencies: represent smooth regions
●​ High frequencies: represent edges and fine details.

An optimized version of DFT with O(nlog⁡n) complexity.

2. Discrete Cosine Transform (DCT)


📌 Purpose:
Like Fourier but uses only cosine functions — more efficient and better energy compaction for
image compression (e.g., JPEG).
🧠 Concept:
●​ Most information (energy) is packed in few DCT coefficients
●​ Remaining (higher-frequency) coefficients are often near zero → can be discarded

🔎 What Is a "Blob"?
A blob is a group of connected pixels that share similar intensity or texture and represent
a region of interest:

●​ Can be light blobs on dark background or vice versa.


●​ Used in object detection, feature extraction, keypoint matching, etc.

_____________________________________________

Title: Scale-Invariant Feature Transform (SIFT)


🔹 Goal of SIFT:
●​ Detect salient, stable feature points in images.
●​ These are interest points (keypoints) that don’t change even when the image:
○​ Rotates
○​ Scales (zooms in or out)
○​ Changes in brightness

So we describe a small region around each keypoint in a way that is:

●​ Rotation-invariant
●​ Scale-invariant

✅ Steps of SIFT (on this page)


1. Scale-Space Extrema Detection

●​ Purpose: Find candidate keypoints at multiple scales.


●​ Technique: Use Difference of Gaussians (DoG) to detect blobs/spots that are:
○​ Brighter or darker than surroundings.
●​ DoG = Difference between two blurred images (Gaussian-blurred at different
scales)
2. Accurate Keypoint Localization

●​ Purpose: Eliminate unstable keypoints (e.g. edge points or low contrast).


●​ Use Taylor Series Expansion to fit a curve to the DoG function for subpixel
accuracy.
✏️ Step 3: Orientation Assignment
🌟 Purpose:
To make the keypoints rotation-invariant, we assign an orientation to each keypoint.

🔄 What’s happening:
●​ You analyze the gradient directions around the keypoint.
●​ You build a histogram of gradient directions in the local region.
●​ The highest peak in the histogram becomes the main orientation of the keypoint.

👉 If multiple peaks are strong (above 80% of the maximum), then multiple keypoints are
created at the same location but with different orientations.

📝 Example in your notes:​


"If multiple peaks are present → create multiple descriptors for each orientation."

🔍 This helps the algorithm recognize the same object even when it's rotated in different
images.

✏️ Step 4: Descriptor Generation


📦 Purpose:
To create a unique fingerprint for each keypoint — this is used for matching later.

🧠 What happens here:


●​ You take a small region around the keypoint (typically 16x16 pixels).
●​ Divide this region into 4×4 smaller cells.
●​ For each cell, build a histogram of gradient directions (8 bins per histogram).

📊 This gives:
●​ 4×4 = 16 cells
●​ Each with 8-bin histogram​
→ Total of 128 values (16 × 8)​
This is the SIFT descriptor vector.
●​ "Each entry is weighted by gradient magnitude and Gaussian weighting."
●​ So closer pixels and stronger edges get more importance.
●​ The result is a 128-dimensional vector that describes the local patch.
🔍 What is SURF?
SURF = Speeded Up Robust Features​
It is used to:

1.​ Detect important points (keypoints) in an image


2.​ Describe the region around those points
3.​ Match these features across images

💡 SURF Pipeline (as per your notes):


1.​ Interest Point Detection
○​ SURF uses the Hessian matrix to detect blobs (areas of sudden change in
intensity).
○​ It's similar to how SIFT detects keypoints, but faster.
2.​ Local Neighbourhood Description
○​ Describes the area around the keypoint using a Haar wavelet (a kind of fast
edge detector).
3.​ Matching
○​ Compare descriptors between two images to find matching points.

🔬 Key Concepts from Your Notes:


🧠 “SURF uses blob detection”:
It looks for "blobs" — spots in the image that are visually distinctive, e.g., corners, or
points with sharp intensity changes.

🧮 Hessian Matrix (used in SURF for keypoint detection):


The Hessian matrix is a mathematical tool that finds where the intensity in an image
changes sharply (i.e., where blobs or corners are):

H = | Lxx Lxy |

| Lxy Lyy |

Where:

●​ Lxx = second-order derivative in x direction


●​ Lyy = second-order derivative in y direction
●​ Lxy = mixed second-order derivative
The determinant of the Hessian matrix helps in identifying keypoints:

Det(H) = Lxx * Lyy - (0.94 * Lxy)^2

🔹 The constant 0.94 is used to balance the scale of derivatives.


🔄 Orientation Assignment:
This is done using Haar wavelet responses in the x and y directions:

V = [Σdx, Σdy]

●​ dx and dy are responses to Haar wavelets.


●​ The orientation vector V gives the dominant direction around the keypoint.

🔗 Descriptor Vector:
After finding the keypoint and its orientation, SURF builds a descriptor (just like SIFT
does) using wavelet responses, which makes it faster.

The descriptor vector is represented as:

[Σdx, Σdy, |Σdx|, |Σdy|]

This compact descriptor is what is matched across different images.

Image Matching

●​ Find correspondence between different image pairs.


●​ Establish matching & correspondence between feature points across frames or views.
●​ Used in stereo, structure from motion, SLAM (Simultaneous Localization and Mapping).

Applications

●​ Triangulation: Locate the actual 3D point in the scene from its two (or more)
corresponding views.

Types of Matching

1. Area-Based

●​ Directly compares pixel values


●​ Window-based correlation (intensity matching)
●​ Fast but not reliable
2. Feature-Based

●​ Uses descriptors (points, corners, blobs)


●​ More robust
●​ Match feature vectors

Matching Techniques

●​ Feature Vector Matching​


→ Compute distance (e.g., Euclidean)​

●​ Reliability Measures​
→ RANSAC (Random Sample Consensus)​
→ Distance ratio​
→ Non-consistent matches removal​
→ Cross-check (symmetric match)​

●​ Robust Matching​
→ Voting​
→ Histogram​
→ Hough transform​
→ Epipolar geometry​
→ Graph-based​
→ Bag of words​
→ Best match

RANSAC Algorithm Steps :(Random Sample Consensus)


1.​ Select any 4 features at random​
➤ You need 4 point correspondences (keypoint pairs) to compute a homography matrix
between two images.​

2.​ Compute Homography​


➤ Calculate the transformation matrix H that maps points from one image to the other
using the selected 4 points.​

3.​ Compute inliers where SSD (p’, Hp) < ε​


➤ For each matching point, check if the SSD (Sum of Squared Differences) between
the transformed point Hp and actual point p’ is below a small threshold ε.​
➤ If it is, it’s considered an inlier (i.e., a correct match).​

4.​ Keep largest set of inliers​


➤ Repeat the above steps multiple times, and keep the largest set of points that agree
with the best homography.​

5.​ Recompute least squares fit for all inliers​


➤ Finally, use all the inliers to refine the homography matrix using a more accurate
least squares method.

🧠 1. Object Detection vs. Segmentation


●​ Object Detection: Identifying where objects are in an image (bounding boxes + class
labels).
●​ Segmentation: Identifying which pixels belong to each object.
○​ Semantic Segmentation: Labels each pixel with a class (e.g., “car”, “road”).
○​ Instance Segmentation: Separates different instances of the same class.

✂️ 2. Edge Detection
Detects boundaries of objects using gradients.

●​ Key Techniques:
○​ Sobel Operator
○​ Prewitt Operator
○​ Canny Edge Detector: Most used, multi-stage (smoothing, gradient calc,
non-max suppression, thresholding).
○​ Laplacian of Gaussian (LoG): Finds edges using second-order derivatives.
○​ Difference of Gaussian (DoG): Approximates LoG, used in SIFT.

🌾 3. Texture Analysis
Describes patterns or variations in image intensity.

●​ Statistical Methods:
○​ Gray-Level Co-occurrence Matrix (GLCM): Measures texture features like
contrast, correlation, homogeneity.
○​ Local Binary Patterns (LBP): Encodes local texture by thresholding
neighborhood.
●​ Transform-Based:
○​ Fourier Transform
○​ Gabor Filters
○​ Wavelets

🧩 4. Region-Based Segmentation
Groups pixels into regions based on similarity.

●​ Region Growing: Start from seed points and grow based on similarity.
●​ Region Splitting & Merging: Divide image, then merge similar regions.
●​ Watershed Algorithm: Treats image as a topographic surface.
●​ Graph-Based Methods (e.g., Normalized Cuts)


🔍 1. Matching
Matching is the process of finding correspondences between features (points, patches, or
regions) in different images.

🧱 A. Types of Matching
●​ Feature-Based Matching: Match keypoints using descriptors.
●​ Template Matching: Slide a template over the image and compare.
●​ Area-Based Matching: Use windows/patches (e.g., SSD, NCC).
●​ Descriptor Matching:
○​ Distance Metrics: Euclidean, SSD (Sum of Squared Differences), Cosine,
Hamming.
○​ Matching Algorithms: Brute Force, k-NN, FLANN.

🔧 B. Steps in Feature Matching


1.​ Feature Detection: e.g., SIFT, SURF, ORB.
2.​ Feature Description: Compute descriptors.
3.​ Feature Matching: Use distance metric to find best matches.
4.​ Filtering Matches: Ratio test (Lowe’s), RANSAC for homography.

🧠 2. Recognition
Recognition means identifying what object or category is present in the image.

🎯 A. Types
●​ Object Recognition: e.g., recognize a "cat" or "car".
●​ Face Recognition: Identify people by faces.
●​ Scene Recognition: e.g., “indoor” vs. “outdoor”.

🔍 B. Techniques
●​ Template Matching: Match with stored image patterns (rigid).
●​ Feature-Based: Match extracted features to database features.
●​ Bag of Visual Words (BoVW): Treat local features as words and do classification.
●​ Machine Learning Classifiers:​

○​ SVMs
○​ KNN
○​ Random Forests​
●​ Deep Learning:​

○​ CNNs for image classification


○​ Pretrained models (ResNet, VGG, Inception)
○​ Fine-tuning for specific tasks

🔗 1. Fusion
Image Fusion means combining multiple images into a single enhanced image, retaining
complementary information.

📸 A. Types
●​ Multi-focus Fusion: Combine images with different focus areas.
●​ Multi-sensor Fusion: e.g., Thermal + RGB for surveillance.
●​ Multi-exposure Fusion: Combine HDR images.

⚙️ B. Techniques
●​ Pixel-level Fusion:
○​ Average or max pixel intensities.
●​ Feature-level Fusion:
○​ Extract features (edges, textures), then combine.
●​ Decision-level Fusion:
○​ Fuse decisions from multiple models/sources.
●​ Wavelet/Transform Fusion:
○​ Decompose images (e.g., DWT), fuse components, reconstruct.

✅ Use Cases
●​ Medical imaging (e.g., CT + MRI)
●​ Surveillance
●​ Robotics (e.g., vision + LIDAR)

Image Fusion (based on Steerable Transform + SVD algorithm)

Input: Source images X and Y which must be registered

Output: Fused Image (F)

Steps:

1. Decompose source images X and Y using Steerable Transform (ST)


2. Get low-pass sub-band of the source images as estimated by SVD on low-pass
subband.

3. Fuse high-pass ST coefficients of the source images as selected

4. Fuse high-pass ST coefficients of the source images thus selected and tuned with the
best low-pass sub-band as estimated by SVD in Step 2.

5. Reconstruct the image by applying inverse transform.

6. Display fused image (F)

📐 2. Image Alignment
Image Alignment refers to registering two or more images so their contents line up accurately.

🔍 A. Steps
1.​ Detect Features: SIFT, SURF, ORB.
2.​ Describe Features: Compute feature descriptors.
3.​ Match Features: Use SSD, ratio test, etc.
4.​ Estimate Transformation:
○​ Affine: preserves lines/parallelism
○​ Homography: projective transformation (used for stitching)
5.​ Warp Image using transformation matrix.

🧪 Homography Example:
If point (x, y) in Image A maps to (x', y') in Image B:

[x', y', 1]ᵀ = H * [x, y, 1]ᵀ

Solve for H using at least 4 corresponding points.

1. Translation

Definition: Moves (shifts) an image in the x and/or y direction without rotating or scaling it.

Transformation Matrix:

1 0 tx

0 1 ty

001
Use Case: When the image is simply displaced but not deformed.

2. Affine Transformation

Definition: Preserves lines and parallelism (but not necessarily distances and angles). Includes
translation, rotation, scaling, and shearing.

Transformation Matrix:

a_{11} a_{12} t_x

a_{21} a_{22} t_y

0 0 1

Properties:

Straight lines remain straight

Parallel lines remain parallel

Use Case: Mapping between two images when the camera undergoes rotation, scaling, or
shear.

3. Homography (Projective Transformation)

Definition: A more general transformation that can map a plane to another plane under
perspective. It includes all affine transformations and more.

Transformation Matrix:

h_{11} & h_{12} & h_{13}

h_{21} & h_{22} & h_{23}

h_{31} & h_{32} & h_{33}

Parameters: 8 (one parameter is redundant due to scale)

Properties:

Straight lines remain straight

Can handle perspective distortion

Use Case: Used in panorama stitching, object detection, camera calibration, AR, etc.
🧵 3. Image Stitching
Stitching involves aligning and blending multiple overlapping images into a seamless panorama.

⚙️ A. Steps
1.​ Detect & Match Features (like image alignment)
2.​ Estimate Homography between image pairs
3.​ Warp Images to align with a reference frame
4.​ Blend Images:
○​ Linear blending
○​ Multi-band blending (Laplacian pyramids)
○​ Seam finding to remove visible boundaries

🔄 Automatic Tools
●​ OpenCV’s cv2.Stitcher_create()
●​ Python libraries: OpenCV, ImageAI, AutoStitch

✅ Use Cases
●​ Panorama creation
●​ Aerial/mosaic imaging
●​ Document scanning apps

🔄 Image Stitching Algorithm: Step-by-Step


1. Image Acquisition

●​ Capture or load multiple images with overlapping fields of view.

2. Feature Detection

●​ Detect distinctive keypoints in each image


●​ Algorithms used:
○​ SIFT (Scale-Invariant Feature Transform)
○​ SURF (Speeded Up Robust Features)
○​ ORB (Oriented FAST and Rotated BRIEF)

3. Feature Description

●​ Extract descriptors for each keypoint to characterize its neighborhood


●​ These descriptors help in matching features across images.
4. Feature Matching

●​ Match features between pairs of overlapping images.


●​ Algorithms used:
○​ Brute Force Matcher
○​ FLANN (Fast Library for Approximate Nearest Neighbors)

5. Homography Estimation

●​ Compute the transformation (homography matrix) between matched


images.
●​ Use RANSAC (Random Sample Consensus) to eliminate outliers and find
the best transformation.

6. Image Warping

●​ Apply the homography to warp images into a common coordinate system.

7. Image Blending

●​ Blend the warped images to create a seamless result.


●​ Techniques:
○​ Feathering (simple averaging)
○​ Multi-band blending (Laplacian pyramids)
○​ Seam optimization (Graph cuts)

8. Output Generation

●​ Crop black borders or non-overlapping regions.


●​ Save or display the final stitched panorama.

You might also like