0% found this document useful (0 votes)

29 views33 pages

Computer Vision

Computer Vision (CV) is a branch of AI focused on enabling computers to interpret visual data, with key tasks including image classification, object detection, and segmentation. Applications span various fields such as robotics, healthcare, and autonomous vehicles, utilizing techniques like convolutional neural networks (CNNs) for image processing. Image representation involves storing images as pixel values, with formats like RGB and methods for enhancing image quality through smoothing, sharpening, and histogram equalization.

Uploaded by

221210088

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views33 pages

Computer Vision

Uploaded by

221210088

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

Overview of Computer Vision

✅ What is Computer Vision?

● Computer Vision (CV) is a field of AI that enables computers to understand and interpret visual data
like images and videos.
● Goal: To simulate human vision — detect, identify, and understand objects/scenes.

🔧 Basic Tasks in Computer Vision

● Image Classification: What is in the image?
● Object Detection: Where is the object in the image?
● Segmentation: Which pixels belong to which object?
● Face Recognition, Tracking, Motion Estimation, etc.

🧠 How Computer Vision Works (Basic Idea)

1. Input: Image or video.
2. Processing: Extract features (edges, colors, shapes).
3. Understanding: Use models (ML/DL) to classify or detect.

📌 Applications of Computer Vision

🤖 1. Robotics
● Object Detection: Robots can locate and identify tools or parts.
● Navigation: CV helps robots avoid obstacles and move in real-time (e.g., SLAM).
● Grasping: Identify the shape/orientation of objects to pick them.
● Inspection: Quality control in manufacturing (e.g., finding defects).

🏥 2. Healthcare
● Medical Imaging: Detect tumors, fractures in X-rays, MRIs, CT scans.
● Retinal Analysis: For diabetic retinopathy, glaucoma, etc.
● Surgical Assistance: Robots guided using CV.
● Pathology: Automated detection of cells and abnormalities.

🚗 3. Autonomous Vehicles
● Lane Detection: Identify road lanes.
● Traffic Sign Recognition: Read and respond to signs.
● Pedestrian Detection: For safe driving.
● Obstacle Avoidance: Detect other vehicles, people, animals, etc.
● Surround View: 360° environment understanding.

🧠 Bonus Tip for Exam:

If asked for an open-ended answer, conclude like this:

"Computer vision continues to grow rapidly and is a key enabler of intelligent systems across various domains
by allowing machines to see, understand, and make decisions."

_____________________________________________
📘 Image Formation – Basic Concepts
1. What is Image Formation?

● It is the process by which a 3D real-world scene is captured as a 2D image on a camera sensor or

retina (in humans).
● Happens using light rays that reflect off objects and are collected through a lens or hole.

2. Pinhole Camera Model (Ideal Model)

● Imagine a dark box with a small hole on one side.

● Light enters through the hole and hits the back wall forming an inverted image.

✅ Key Points:
● No lens used.
● Simple and distortion-free, but very dim image.
● Smaller hole → sharper but darker image.
● Larger hole → brighter but blurry image.

3. Real Camera (with Lens)

● Modern cameras use a lens instead of a hole.

● The lens focuses light onto the sensor (image plane) to form a clear image.

✅ Key Terms:
● Lens: Focuses light.
● Image Plane: Where the image is formed.
● Sensor: Converts light into electrical signals (digital image).

4. Inversion

● The image formed is inverted (upside-down and left-right reversed).

● Software or brain (in humans) reinterprets it.

5. Light & Image Brightness

● More light = brighter image.

● Focused light = sharper image.
● Oblique light (not perpendicular) = can cause blurring or distortion.

_____________________________________________

📘 Image Representation
✅ What is Image Representation?
It refers to how an image is stored, structured, and processed in a computer — using pixel values.

🟦 1. Digital Image Basics

● An image is a 2D grid of pixels (picture elements).
● Each pixel has a numerical value that represents intensity (grayscale) or color (RGB)
🖤 2. Grayscale Image
● Stored as a 2D matrix.
● Each pixel = single intensity value from 0 (black) to 255 (white).
[

[0, 125, 255],

[100, 200, 50]
]

🌈 3. RGB Image (Color)

● Stored as a 3D matrix: Width × Height × 3 (for Red, Green, Blue channels).
● Each channel is a 2D matrix of values (0–255).

Example (2×2 image):

Red: [[255, 0], Green: [[0, 255], Blue: [[0, 0],

[100, 50]] [100, 50]] [255, 200]]

🔳 4. Binary Image
● Pixel value is either 0 or 1 (black or white).
● Used in basic segmentation and thresholding.

🔢 5. Image Resolution
● Resolution = Width × Height
● Higher resolution = more pixels = more detail, larger file.

🧮 6. Pixel Depth / Bit Depth

● Number of bits used per pixel.
● Example:
○ 8-bit grayscale → 256 shades (2⁸)
○ 24-bit RGB → 8 bits per channel = ~16 million colors

🧠 7. Image Formats
● JPEG, PNG, BMP, TIFF are ways to store images.
○ JPEG: Compressed, lossy
○ PNG: Lossless, supports transparency

🔍 8. Coordinate System
● Top-left pixel is (0, 0).
● X-axis → right, Y-axis → down

_____________________________________________
🧴 1. Smoothing (Blurring)
Goal: Reduce noise or small variations in the image.

✅ Common Methods:
● Mean Filter (Average Filter):
○ Replaces each pixel with the average of neighboring pixels.
○ Removes noise but can blur edges.
● Gaussian Filter:
○ Uses a Gaussian function to give more weight to central pixels.
○ Smooths noise better and preserves edges more than the mean filter.

📌 Effect: Softens the image, reduces detail.

✏️ 2. Sharpening
Goal: Enhance edges and fine details in the image.

✅ Common Methods:
● Laplacian Filter:
○ Second-order derivative operator.
○ Highlights regions of rapid intensity change (edges).
● Unsharp Masking:
○ Subtracts a blurred (low-pass) version from the original image.
○ Formula:

📌
Sharpened=Original+α⋅(Original−Blurred)
○ Effect: Image appears crisper and more detailed.

📊 3. Histogram Equalization
Goal: Improve contrast by spreading out intensity values.

✅ Process:
● Create the histogram of the image.
● Compute the cumulative distribution function (CDF).
● Map old pixel values to new ones using the CDF.

📌 Result: Dark areas become lighter and bright areas dimmer — overall better contrast.
⚠️ Used in:
● Medical imaging, satellite images, low-light enhancement.

__________________________________________________________________________________________

✅ Why RGB Color is Preferred

1. Directly Matches Human Vision

● Human eyes have three types of cone cells that detect Red, Green, and Blue light.
● RGB aligns with our natural perception.

2. Device-Friendly

● Monitors, cameras, and screens all use RGB to capture, display, and store color.
● It is the native format for most image sensors.

3. Simple and Efficient

● Easy to implement and understand — just three channels.

● Well-supported by most image processing libraries (OpenCV, PIL, etc.).

4. Lossless Color Representation

● RGB allows for a wide range of colors by combining different intensities (0–255) of R,
G, and B.
● 24-bit RGB = over 16 million colors.

5. Foundation for Other Models

● Other color spaces like HSV, YCbCr, Lab are usually converted from RGB for special
processing (e.g., skin detection, lighting adjustments).

_________________________________________________________________________________________
Key Components:

1. Input Layer:

○ Takes the image as input.
○ For a color image of size 224×224, input size = 224×224×3 (R, G, B channels).
2. Convolutional Layers (Conv Layers):
○ Extract features using learnable filters (kernels).
○ Each filter slides over the image and produces a feature map.
3. Activation Functions:
○ Usually ReLU (Rectified Linear Unit).
○ Adds non-linearity to the model.
4. Pooling Layers:
○ Reduce spatial dimensions (downsampling).
○ Common: Max Pooling (takes max value in a window).
5. Fully Connected (Dense) Layers:
○ Final layers that interpret features and make predictions.
○ Each neuron is connected to all activations from the previous layer.
6. Output Layer:
○ Produces the final result.
○ For classification: Softmax function gives class probabilities.
7. Loss Function:
○ Measures prediction error.
○ Common in CV: Cross-entropy loss for classification.
8. Optimizer:
○ Updates weights using gradients (e.g., SGD, Adam).

In Computer Vision, DNNs are used for:

● Image Classification (e.g., dog vs. cat)

● Object Detection (e.g., YOLO, Faster R-CNN)
● Segmentation (e.g., U-Net, Mask R-CNN)
● Image Generation (e.g., GANs)
● Depth Estimation (e.g., MiDaS)

What is a CNN?

A Convolutional Neural Network (CNN) is a type of deep neural network specially designed to
process images by preserving spatial relationships using convolution operations. It
automatically learns features like edges, textures, shapes, etc., without manual feature
engineering.

Key Layers in CNN:

🔹 Convolutional Layer:
○ Applies filters (kernels) to input image.
○ Extracts features like edges, corners, patterns.
○ Output is called a feature map.
○ Equation:
Y(i,j)=∑m∑nX(i+m,j+n)⋅K(m,n)

🔹 ReLU (Activation Function):

○ Applies non-linearity:
ReLU(x)=max⁡(0,x)
○ Makes model capable of learning complex patterns.

🔹 Pooling Layer:
○ Reduces spatial size (downsampling).
○ Max Pooling is common:
■ Selects the max value from a region (e.g., 2×2).
○ Benefits: Reduces computation, helps generalization.

🔹 Fully Connected (Dense) Layer:

○ Final decision-making layers.
○ Takes the flattened feature maps as input.
○ Outputs class scores or predictions.

🔹 Softmax Layer (Output):

● Converts final scores into probabilities.
● Useful for multi-class classification.

_____________________________________________
How R-CNN Works

R-CNN performs object detection in three main steps:

1. Region Proposal:

○ Uses Selective Search to generate around 2000 candidate object regions (region
proposals) from the input image.
2. Feature Extraction:
○ Each region proposal is resized to a fixed size (e.g., 224x224) and passed
through a CNN (like AlexNet or VGG) to extract a feature vector.
3. Classification + Bounding Box Regression:
○ A separate SVM is trained for each object class to classify the feature vectors.
○ A linear regressor is trained to refine the coordinates of the bounding boxes.

Image segmentation using an image-to-image neural network refers to the task of assigning
a class label to each pixel in an image, using an architecture that takes an image as input and
outputs a mask of the same spatial dimensions.

The most common approach for this is using fully convolutional networks (FCNs) or
advanced variants like U-Net, SegNet, or DeepLab.

✅ Key Idea:
● Input: An image (e.g., 256×256×3)
● Output: A segmentation mask (e.g., 256×256×C), where C is the number of classes.
● The model learns pixel-wise classification

🧠 Architecture: (e.g., U-Net or FCN)

● Encoder (Downsampling): Extracts features using convolution + pooling.
● Decoder (Upsampling): Reconstructs the spatial resolution using transposed
convolutions or interpolation.
● Skip Connections: Merge encoder features with decoder features for precise
localization (U-Net specific).

🔹 1. Semantic Segmentation
📌 Definition:
Assigns a class label to each pixel in the image.

✅ Key Point:
● All objects of the same class share the same label.
● No distinction between individual object instances.

🔍 Example:
In a street scene:

● All pixels belonging to "car" get the same label.

● All "roads", "trees", and "sky" are labeled, but not individually identified.

📷 Output:
● A pixel-wise map with categories like [car, tree, road, person, sky].
🔹 2. Instance Segmentation
📌 Definition:
Assigns a class label + instance ID to each pixel.

✅ Key Point:
● Each individual object is segmented separately, even if they’re the same class.
● Combines object detection and semantic segmentation.

🔍 Example:
In the same street scene:

● Car1, Car2, Car3 are segmented as different instances, all of class "car".
● So are Person1, Person2, etc.

📷 Output:
● Pixel-wise masks with object instance separation, e.g., [car#1, car#2, person#1].

🔹 3. Panoptic Segmentation
📌 Definition:
Combines both semantic and instance segmentation in a single output.

✅ Key Point:
● Segments all pixels (like semantic).
● Differentiates each object instance (like instance).

🔍 Example:
In the street scene:

● Every pixel is labeled.

● "Sky", "road" (amorphous 'stuff') are given semantic labels.
● "Car1", "Car2", "Person1" (countable 'things') are individually segmented.

📷 Output:
● A unified segmentation map with both class and instance info.

____________________________________________
Temporal processing refers to handling sequential or time-dependent data in machine
learning or deep learning models. This is essential for tasks where the order and timing of
inputs matter — such as video analysis, time-series forecasting, speech recognition, or human
activity recognition.

🕒 Common Applications:
● Video classification or object tracking
● Time series forecasting (e.g., stock prediction)
● Speech-to-text
● Sensor data analysis (e.g., wearable activity monitoring)

🔄 Approaches to Temporal Processing:

1. Recurrent Neural Networks (RNNs):
○ Handle sequences by maintaining a hidden state.
○ Problem: vanishing gradients.

2. LSTM / GRU:

○ Improved versions of RNNs that handle long-range dependencies.

3. 1D/3D Convolution:

○ 1D Conv: good for time series (sliding over time axis).
○ 3D Conv: for spatiotemporal data (e.g., video: height × width × time).

4. Transformers:
○ Use attention over sequences, parallelizable.
○ Dominant in NLP and video understanding.

5. Temporal Convolutional Networks (TCN):

○ Use dilated convolutions for long-range dependencies.
○ No recurrence, more efficient.

_____________________________________________
🔁 Recurrent Neural Network (RNN) — Theory
Recurrent Neural Networks (RNNs) are a class of neural networks designed for sequential
data, where the order and context of elements matter. Unlike traditional feedforward neural
networks, RNNs have loops that allow information to persist over time steps, making them
ideal for tasks like time series prediction, natural language processing, and speech recognition.

🧠 Key Concept
An RNN processes sequences by maintaining a hidden state that is updated at each time step
based on:

● The current input

● The previous hidden state

This allows the network to have a sort of memory, capturing dependencies across time steps.

🧮 Mathematical Formulation
At time step t, given:

● Input vector: xt
● Hidden state from previous step: ht−1
● Output: yt

The update equations are:

🔄 Unfolding in Time
RNNs can be "unfolded" across time steps. For example, a sequence of 3 time steps:

x1 → h1 → y1
↓

x2 → h2 → y2

x3 → h3 → y3

Each time step shares the same weights, making it efficient for sequence learning.

📉 Training: Backpropagation Through Time (BPTT)

● The RNN is trained using BPTT, a variant of backpropagation that unfolds the network
over time.
● Gradients are calculated through all time steps.
● Can suffer from vanishing or exploding gradients, which make learning long-term
dependencies difficult.

🔍 Limitations of Basic RNNs

● Struggles with long-range dependencies.
● Prone to vanishing/exploding gradients.
● Sequential processing makes it slower for long sequences.

🧬 Variants
To overcome limitations, advanced architectures were developed:

● LSTM (Long Short-Term Memory): Introduces gates to control information flow.

● GRU (Gated Recurrent Unit): A simplified version of LSTM.

🎯 Applications
● Language modeling & text generation
● Machine translation
● Sentiment analysis
● Speech recognition
● Time series forecasting
● Human activity recognition
Anomaly Detection in Images using Autoencoders

Anomaly detection involves identifying patterns in data that do not conform to expected
behavior. In the context of images, anomalies are regions or entire images that are different
from the typical image patterns in a dataset. This can be useful in various applications such as
medical image analysis, industrial defect detection, and security monitoring.

One of the most effective methods for image anomaly detection is using Autoencoders — a
type of neural network that learns to compress and then reconstruct its input.

🧠 How Autoencoders Work for Anomaly Detection

An autoencoder consists of two parts:

1. Encoder: Compresses the input into a lower-dimensional representation (latent space).
2. Decoder: Reconstructs the original input from the compressed representation.

In anomaly detection, the autoencoder is trained on normal images. Once trained, it should
be able to reconstruct normal images well and fail to reconstruct anomalous images
accurately (i.e., large reconstruction error).

Steps for Anomaly Detection:

1. Train an autoencoder on normal (non-anomalous) images.

2. Reconstruct the test image (normal or anomalous).
3. Measure the reconstruction error (e.g., Mean Squared Error between input and
output).
4. If the error exceeds a threshold, classify the image as anomalous.

🧑‍💻 Autoencoder Architecture for Anomaly Detection

1. Encoder: The encoder reduces the dimensionality of the input image to a latent
representation (typically smaller than the input).
2. Decoder: The decoder reconstructs the input image from the latent representation. This
process tries to preserve the key features of the input image.
3. Loss Function: Mean Squared Error (MSE) or other loss functions measure how well
the autoencoder reconstructs the image.
ALEXNET
VGG
RESNET
Hue: A color attribute that describes a pure color (pure yellow, orange, or red).

Saturation: Gives a measure of how much a pure color is diluted with white light.

Intensity: Brightness is nearly impossible to measure because it is so subjective.

Instead, we use intensity. It is some achromatic notion that we have seen in gray
level images.
(CYAN - C, MAGENTA - M, YELLOW - Y, BLACK(Key)- B)

ADDITION: (RGB-HSI)= G+B=C || B+R = M || R+G = Y || R+G+B=WHITE

SUBTRACTION: (HSI->RGB) = C+M = B || M+Y = R || C+Y = G || C+Y+M= BLACK

_____________________________________________
TRANSFORMATION
Spatial domain to frequency domain.
Image transformations like Fourier Transform and Discrete Cosine Transform (DCT) convert images from
the spatial domain (pixel-based) to the frequency domain (based on patterns of intensity change).

1. Fourier Transform (DFT / FFT)

📌 Purpose:
Decomposes an image into sine and cosine waves (frequencies). Used in image
compression, denoising, filtering, and pattern recognition.

🧠 Concept:
● Low frequencies: represent smooth regions
● High frequencies: represent edges and fine details.

An optimized version of DFT with O(nlog⁡n) complexity.

2. Discrete Cosine Transform (DCT)

📌 Purpose:
Like Fourier but uses only cosine functions — more efficient and better energy compaction for
image compression (e.g., JPEG).
🧠 Concept:
● Most information (energy) is packed in few DCT coefficients
● Remaining (higher-frequency) coefficients are often near zero → can be discarded

🔎 What Is a "Blob"?
A blob is a group of connected pixels that share similar intensity or texture and represent
a region of interest:

● Can be light blobs on dark background or vice versa.

● Used in object detection, feature extraction, keypoint matching, etc.

_____________________________________________

Title: Scale-Invariant Feature Transform (SIFT)

🔹 Goal of SIFT:
● Detect salient, stable feature points in images.
● These are interest points (keypoints) that don’t change even when the image:
○ Rotates
○ Scales (zooms in or out)
○ Changes in brightness

So we describe a small region around each keypoint in a way that is:

● Rotation-invariant
● Scale-invariant

✅ Steps of SIFT (on this page)

1. Scale-Space Extrema Detection

● Purpose: Find candidate keypoints at multiple scales.

● Technique: Use Difference of Gaussians (DoG) to detect blobs/spots that are:
○ Brighter or darker than surroundings.
● DoG = Difference between two blurred images (Gaussian-blurred at different
scales)
2. Accurate Keypoint Localization

● Purpose: Eliminate unstable keypoints (e.g. edge points or low contrast).

● Use Taylor Series Expansion to fit a curve to the DoG function for subpixel
accuracy.
✏️ Step 3: Orientation Assignment
🌟 Purpose:
To make the keypoints rotation-invariant, we assign an orientation to each keypoint.

🔄 What’s happening:
● You analyze the gradient directions around the keypoint.
● You build a histogram of gradient directions in the local region.
● The highest peak in the histogram becomes the main orientation of the keypoint.

👉 If multiple peaks are strong (above 80% of the maximum), then multiple keypoints are
created at the same location but with different orientations.

📝 Example in your notes:

"If multiple peaks are present → create multiple descriptors for each orientation."

🔍 This helps the algorithm recognize the same object even when it's rotated in different
images.

✏️ Step 4: Descriptor Generation

📦 Purpose:
To create a unique fingerprint for each keypoint — this is used for matching later.

🧠 What happens here:

● You take a small region around the keypoint (typically 16x16 pixels).
● Divide this region into 4×4 smaller cells.
● For each cell, build a histogram of gradient directions (8 bins per histogram).

📊 This gives:
● 4×4 = 16 cells
● Each with 8-bin histogram
→ Total of 128 values (16 × 8)
This is the SIFT descriptor vector.
● "Each entry is weighted by gradient magnitude and Gaussian weighting."
● So closer pixels and stronger edges get more importance.
● The result is a 128-dimensional vector that describes the local patch.
🔍 What is SURF?
SURF = Speeded Up Robust Features
It is used to:

1. Detect important points (keypoints) in an image

2. Describe the region around those points
3. Match these features across images

💡 SURF Pipeline (as per your notes):

1. Interest Point Detection
○ SURF uses the Hessian matrix to detect blobs (areas of sudden change in
intensity).
○ It's similar to how SIFT detects keypoints, but faster.
2. Local Neighbourhood Description
○ Describes the area around the keypoint using a Haar wavelet (a kind of fast
edge detector).
3. Matching
○ Compare descriptors between two images to find matching points.

🔬 Key Concepts from Your Notes:

🧠 “SURF uses blob detection”:
It looks for "blobs" — spots in the image that are visually distinctive, e.g., corners, or
points with sharp intensity changes.

🧮 Hessian Matrix (used in SURF for keypoint detection):

The Hessian matrix is a mathematical tool that finds where the intensity in an image
changes sharply (i.e., where blobs or corners are):

H = | Lxx Lxy |

| Lxy Lyy |

Where:

● Lxx = second-order derivative in x direction

● Lyy = second-order derivative in y direction
● Lxy = mixed second-order derivative
The determinant of the Hessian matrix helps in identifying keypoints:

Det(H) = Lxx * Lyy - (0.94 * Lxy)^2

🔹 The constant 0.94 is used to balance the scale of derivatives.

🔄 Orientation Assignment:
This is done using Haar wavelet responses in the x and y directions:

V = [Σdx, Σdy]

● dx and dy are responses to Haar wavelets.

● The orientation vector V gives the dominant direction around the keypoint.

🔗 Descriptor Vector:
After finding the keypoint and its orientation, SURF builds a descriptor (just like SIFT
does) using wavelet responses, which makes it faster.

The descriptor vector is represented as:

[Σdx, Σdy, |Σdx|, |Σdy|]

This compact descriptor is what is matched across different images.

Image Matching

● Find correspondence between different image pairs.

● Establish matching & correspondence between feature points across frames or views.
● Used in stereo, structure from motion, SLAM (Simultaneous Localization and Mapping).

Applications

● Triangulation: Locate the actual 3D point in the scene from its two (or more)
corresponding views.

Types of Matching

1. Area-Based

● Directly compares pixel values

● Window-based correlation (intensity matching)
● Fast but not reliable
2. Feature-Based

● Uses descriptors (points, corners, blobs)

● More robust
● Match feature vectors

Matching Techniques

● Feature Vector Matching

→ Compute distance (e.g., Euclidean)

● Reliability Measures
→ RANSAC (Random Sample Consensus)
→ Distance ratio
→ Non-consistent matches removal
→ Cross-check (symmetric match)

● Robust Matching
→ Voting
→ Histogram
→ Hough transform
→ Epipolar geometry
→ Graph-based
→ Bag of words
→ Best match

RANSAC Algorithm Steps :(Random Sample Consensus)

1. Select any 4 features at random
➤ You need 4 point correspondences (keypoint pairs) to compute a homography matrix
between two images.

2. Compute Homography

➤ Calculate the transformation matrix H that maps points from one image to the other
using the selected 4 points.

3. Compute inliers where SSD (p’, Hp) < ε

➤ For each matching point, check if the SSD (Sum of Squared Differences) between
the transformed point Hp and actual point p’ is below a small threshold ε.
➤ If it is, it’s considered an inlier (i.e., a correct match).

4. Keep largest set of inliers

➤ Repeat the above steps multiple times, and keep the largest set of points that agree
with the best homography.

5. Recompute least squares fit for all inliers

➤ Finally, use all the inliers to refine the homography matrix using a more accurate
least squares method.

🧠 1. Object Detection vs. Segmentation

● Object Detection: Identifying where objects are in an image (bounding boxes + class
labels).
● Segmentation: Identifying which pixels belong to each object.
○ Semantic Segmentation: Labels each pixel with a class (e.g., “car”, “road”).
○ Instance Segmentation: Separates different instances of the same class.

✂️ 2. Edge Detection
Detects boundaries of objects using gradients.

● Key Techniques:
○ Sobel Operator
○ Prewitt Operator
○ Canny Edge Detector: Most used, multi-stage (smoothing, gradient calc,
non-max suppression, thresholding).
○ Laplacian of Gaussian (LoG): Finds edges using second-order derivatives.
○ Difference of Gaussian (DoG): Approximates LoG, used in SIFT.

🌾 3. Texture Analysis
Describes patterns or variations in image intensity.

● Statistical Methods:
○ Gray-Level Co-occurrence Matrix (GLCM): Measures texture features like
contrast, correlation, homogeneity.
○ Local Binary Patterns (LBP): Encodes local texture by thresholding
neighborhood.
● Transform-Based:
○ Fourier Transform
○ Gabor Filters
○ Wavelets

🧩 4. Region-Based Segmentation
Groups pixels into regions based on similarity.

● Region Growing: Start from seed points and grow based on similarity.
● Region Splitting & Merging: Divide image, then merge similar regions.
● Watershed Algorithm: Treats image as a topographic surface.
● Graph-Based Methods (e.g., Normalized Cuts)

🔍 1. Matching
Matching is the process of finding correspondences between features (points, patches, or
regions) in different images.

🧱 A. Types of Matching
● Feature-Based Matching: Match keypoints using descriptors.
● Template Matching: Slide a template over the image and compare.
● Area-Based Matching: Use windows/patches (e.g., SSD, NCC).
● Descriptor Matching:
○ Distance Metrics: Euclidean, SSD (Sum of Squared Differences), Cosine,
Hamming.
○ Matching Algorithms: Brute Force, k-NN, FLANN.

🔧 B. Steps in Feature Matching

1. Feature Detection: e.g., SIFT, SURF, ORB.
2. Feature Description: Compute descriptors.
3. Feature Matching: Use distance metric to find best matches.
4. Filtering Matches: Ratio test (Lowe’s), RANSAC for homography.

🧠 2. Recognition
Recognition means identifying what object or category is present in the image.

🎯 A. Types
● Object Recognition: e.g., recognize a "cat" or "car".
● Face Recognition: Identify people by faces.
● Scene Recognition: e.g., “indoor” vs. “outdoor”.

🔍 B. Techniques
● Template Matching: Match with stored image patterns (rigid).
● Feature-Based: Match extracted features to database features.
● Bag of Visual Words (BoVW): Treat local features as words and do classification.
● Machine Learning Classifiers:

○ SVMs
○ KNN
○ Random Forests
● Deep Learning:

○ CNNs for image classification

○ Pretrained models (ResNet, VGG, Inception)
○ Fine-tuning for specific tasks

🔗 1. Fusion
Image Fusion means combining multiple images into a single enhanced image, retaining
complementary information.

📸 A. Types
● Multi-focus Fusion: Combine images with different focus areas.
● Multi-sensor Fusion: e.g., Thermal + RGB for surveillance.
● Multi-exposure Fusion: Combine HDR images.

⚙️ B. Techniques
● Pixel-level Fusion:
○ Average or max pixel intensities.
● Feature-level Fusion:
○ Extract features (edges, textures), then combine.
● Decision-level Fusion:
○ Fuse decisions from multiple models/sources.
● Wavelet/Transform Fusion:
○ Decompose images (e.g., DWT), fuse components, reconstruct.

✅ Use Cases
● Medical imaging (e.g., CT + MRI)
● Surveillance
● Robotics (e.g., vision + LIDAR)

Image Fusion (based on Steerable Transform + SVD algorithm)

Input: Source images X and Y which must be registered

Output: Fused Image (F)

Steps:

1. Decompose source images X and Y using Steerable Transform (ST)

2. Get low-pass sub-band of the source images as estimated by SVD on low-pass
subband.

3. Fuse high-pass ST coefficients of the source images as selected

4. Fuse high-pass ST coefficients of the source images thus selected and tuned with the
best low-pass sub-band as estimated by SVD in Step 2.

5. Reconstruct the image by applying inverse transform.

6. Display fused image (F)

📐 2. Image Alignment
Image Alignment refers to registering two or more images so their contents line up accurately.

🔍 A. Steps
1. Detect Features: SIFT, SURF, ORB.
2. Describe Features: Compute feature descriptors.
3. Match Features: Use SSD, ratio test, etc.
4. Estimate Transformation:
○ Affine: preserves lines/parallelism
○ Homography: projective transformation (used for stitching)
5. Warp Image using transformation matrix.

🧪 Homography Example:
If point (x, y) in Image A maps to (x', y') in Image B:

[x', y', 1]ᵀ = H * [x, y, 1]ᵀ

Solve for H using at least 4 corresponding points.

1. Translation

Definition: Moves (shifts) an image in the x and/or y direction without rotating or scaling it.

Transformation Matrix:

1 0 tx

0 1 ty

001
Use Case: When the image is simply displaced but not deformed.

2. Affine Transformation

Definition: Preserves lines and parallelism (but not necessarily distances and angles). Includes
translation, rotation, scaling, and shearing.

Transformation Matrix:

a_{11} a_{12} t_x

a_{21} a_{22} t_y

0 0 1

Properties:

Straight lines remain straight

Parallel lines remain parallel

Use Case: Mapping between two images when the camera undergoes rotation, scaling, or
shear.

3. Homography (Projective Transformation)

Definition: A more general transformation that can map a plane to another plane under
perspective. It includes all affine transformations and more.

Transformation Matrix:

h_{11} & h_{12} & h_{13}

h_{21} & h_{22} & h_{23}

h_{31} & h_{32} & h_{33}

Parameters: 8 (one parameter is redundant due to scale)

Properties:

Straight lines remain straight

Can handle perspective distortion

Use Case: Used in panorama stitching, object detection, camera calibration, AR, etc.
🧵 3. Image Stitching
Stitching involves aligning and blending multiple overlapping images into a seamless panorama.

⚙️ A. Steps
1. Detect & Match Features (like image alignment)
2. Estimate Homography between image pairs
3. Warp Images to align with a reference frame
4. Blend Images:
○ Linear blending
○ Multi-band blending (Laplacian pyramids)
○ Seam finding to remove visible boundaries

🔄 Automatic Tools
● OpenCV’s cv2.Stitcher_create()
● Python libraries: OpenCV, ImageAI, AutoStitch

✅ Use Cases
● Panorama creation
● Aerial/mosaic imaging
● Document scanning apps

🔄 Image Stitching Algorithm: Step-by-Step

1. Image Acquisition

● Capture or load multiple images with overlapping fields of view.

2. Feature Detection

● Detect distinctive keypoints in each image

● Algorithms used:
○ SIFT (Scale-Invariant Feature Transform)
○ SURF (Speeded Up Robust Features)
○ ORB (Oriented FAST and Rotated BRIEF)

3. Feature Description

● Extract descriptors for each keypoint to characterize its neighborhood

● These descriptors help in matching features across images.
4. Feature Matching

● Match features between pairs of overlapping images.

● Algorithms used:
○ Brute Force Matcher
○ FLANN (Fast Library for Approximate Nearest Neighbors)

5. Homography Estimation

● Compute the transformation (homography matrix) between matched

images.
● Use RANSAC (Random Sample Consensus) to eliminate outliers and find
the best transformation.

6. Image Warping

● Apply the homography to warp images into a common coordinate system.

7. Image Blending

● Blend the warped images to create a seamless result.

● Techniques:
○ Feathering (simple averaging)
○ Multi-band blending (Laplacian pyramids)
○ Seam optimization (Graph cuts)

8. Output Generation

● Crop black borders or non-overlapping regions.

● Save or display the final stitched panorama.

Computer Vision U1&2 Notes (1)
No ratings yet
Computer Vision U1&2 Notes (1)
62 pages
Computer Vision - Unit 1 Notes
No ratings yet
Computer Vision - Unit 1 Notes
13 pages
Deep Learning for Vision Book 2
No ratings yet
Deep Learning for Vision Book 2
292 pages
Unit 2 Computer Vision 2025
No ratings yet
Unit 2 Computer Vision 2025
194 pages
CV Questions
No ratings yet
CV Questions
15 pages
Week5_Computer_Vision
No ratings yet
Week5_Computer_Vision
58 pages
Computer Vision
No ratings yet
Computer Vision
22 pages
Instant Download OpenCV Computer Vision Application Programming Cookbook 2nd Edition Robert Laganiere PDF All Chapters
No ratings yet
Instant Download OpenCV Computer Vision Application Programming Cookbook 2nd Edition Robert Laganiere PDF All Chapters
71 pages
Computer vision
No ratings yet
Computer vision
47 pages
Computer Vision 2
No ratings yet
Computer Vision 2
62 pages
Unit 4 Computer Vision Lecture Notes 1 4 Compress
No ratings yet
Unit 4 Computer Vision Lecture Notes 1 4 Compress
138 pages
Image and Video Analytics Unit 1
No ratings yet
Image and Video Analytics Unit 1
110 pages
Computer Vision Notes
No ratings yet
Computer Vision Notes
72 pages
Computer Vision
No ratings yet
Computer Vision
21 pages
CV_SVD_L02_P1_IntroImageProcColor
No ratings yet
CV_SVD_L02_P1_IntroImageProcColor
89 pages
Computer Vision 1731163352
No ratings yet
Computer Vision 1731163352
153 pages
3.1 - Image Fundamentals
No ratings yet
3.1 - Image Fundamentals
32 pages
CV_SVD_L04_P1_ImageTrasformations_1
No ratings yet
CV_SVD_L04_P1_ImageTrasformations_1
45 pages
Computer Vision
No ratings yet
Computer Vision
21 pages
Lecture Notes
No ratings yet
Lecture Notes
33 pages
0 Computer Vision Panikzettel
No ratings yet
0 Computer Vision Panikzettel
28 pages
lecture 1 AI Summary
No ratings yet
lecture 1 AI Summary
31 pages
Cv Digital Notes
No ratings yet
Cv Digital Notes
77 pages
PDF Joiner
No ratings yet
PDF Joiner
38 pages
CV #1 Course Introduction-1
No ratings yet
CV #1 Course Introduction-1
61 pages
CS7.505: Computer Vision: Spring 2022
No ratings yet
CS7.505: Computer Vision: Spring 2022
46 pages
Digital Image Processing - Unit 5
No ratings yet
Digital Image Processing - Unit 5
11 pages
CV Unit 1
No ratings yet
CV Unit 1
30 pages
Introcduction To Image Processing With Python Nour Eddine ALAA and Ismail Zine El Abidne March 5, 2021
No ratings yet
Introcduction To Image Processing With Python Nour Eddine ALAA and Ismail Zine El Abidne March 5, 2021
77 pages
Week 9 Lecture Notes
No ratings yet
Week 9 Lecture Notes
27 pages
CV GTU ANSWERS
No ratings yet
CV GTU ANSWERS
56 pages
Computer Vision
No ratings yet
Computer Vision
29 pages
2023 - 12 - 06 7 - 57 PM Office Lens
No ratings yet
2023 - 12 - 06 7 - 57 PM Office Lens
11 pages
UNIT 1
No ratings yet
UNIT 1
21 pages
Probabilistic Deep Learning With Python 1st Edition Oliver Duerr Download PDF
100% (3)
Probabilistic Deep Learning With Python 1st Edition Oliver Duerr Download PDF
52 pages
Lec - 05 - CNN Deep Learning
No ratings yet
Lec - 05 - CNN Deep Learning
176 pages
Generative AI for Business Leaders Linkedin Q&A
No ratings yet
Generative AI for Business Leaders Linkedin Q&A
19 pages
revisionback
No ratings yet
revisionback
13 pages
UNIT_1
No ratings yet
UNIT_1
15 pages
AI-Computer Vision
No ratings yet
AI-Computer Vision
16 pages
CV 2 MARKS
No ratings yet
CV 2 MARKS
11 pages
CVIP-Module-01-Reviewer
No ratings yet
CVIP-Module-01-Reviewer
20 pages
Mini Project
No ratings yet
Mini Project
32 pages
Chapter-9 Ai Class 8
No ratings yet
Chapter-9 Ai Class 8
13 pages
chp-5_rcs
No ratings yet
chp-5_rcs
10 pages
Gesture Recognition System With Machine Learning
No ratings yet
Gesture Recognition System With Machine Learning
26 pages
Project Report b
No ratings yet
Project Report b
31 pages
Computer Vision Part 2
No ratings yet
Computer Vision Part 2
5 pages
Compiler
No ratings yet
Compiler
9 pages
Unit III
No ratings yet
Unit III
58 pages
Computer vision
No ratings yet
Computer vision
13 pages
Download full Graph Database and Graph Computing for Power System Analysis Renchang Dai ebook all chapters
No ratings yet
Download full Graph Database and Graph Computing for Power System Analysis Renchang Dai ebook all chapters
47 pages
DL Assignment 1 (1)
No ratings yet
DL Assignment 1 (1)
6 pages
computer-vision-revision-notes_250322_101703
No ratings yet
computer-vision-revision-notes_250322_101703
4 pages
DIP IMP Q&A 10
No ratings yet
DIP IMP Q&A 10
8 pages
Computer Vision Class 10 Notes
No ratings yet
Computer Vision Class 10 Notes
5 pages
Unit 1 Ai PDF
No ratings yet
Unit 1 Ai PDF
89 pages
AAI qb
No ratings yet
AAI qb
15 pages
MgNet-- 通过多重网格方法构造卷积神经网络 2019
No ratings yet
MgNet-- 通过多重网格方法构造卷积神经网络 2019
24 pages
Computer Vision
No ratings yet
Computer Vision
4 pages
DATA COMMUNICATION
No ratings yet
DATA COMMUNICATION
9 pages
Automatic Pain Estimation From Facial Expressions A Comparative Analysis Using Off The Self CNN Architecture
No ratings yet
Automatic Pain Estimation From Facial Expressions A Comparative Analysis Using Off The Self CNN Architecture
13 pages
skin cancer classification 1
No ratings yet
skin cancer classification 1
4 pages
Unit-5 Computer Vision
No ratings yet
Unit-5 Computer Vision
3 pages
HealthBotX- Voice-Based Multilingual Health Assistant for Rural India
No ratings yet
HealthBotX- Voice-Based Multilingual Health Assistant for Rural India
13 pages
Machine and Deep Learning Algorithms and Applications
No ratings yet
Machine and Deep Learning Algorithms and Applications
123 pages
Enhancing Ocular Healthcare Deep Learning-Based Multi-Class Diabetic Eye Disease Segmentation and Classification
No ratings yet
Enhancing Ocular Healthcare Deep Learning-Based Multi-Class Diabetic Eye Disease Segmentation and Classification
18 pages
IEEE Conference Template (1)
No ratings yet
IEEE Conference Template (1)
5 pages
New_CV_Syllabus (1)
No ratings yet
New_CV_Syllabus (1)
3 pages
EEG Report Final
No ratings yet
EEG Report Final
45 pages
Leveraging CNN-Bilstm for Multi-Class Cyber Bullying Detection in Hindi Text
No ratings yet
Leveraging CNN-Bilstm for Multi-Class Cyber Bullying Detection in Hindi Text
12 pages
Mask RCNN Skin Lesions
No ratings yet
Mask RCNN Skin Lesions
9 pages
Physics-Informed Neural Networks For Diffraction Tomography: Research Article
No ratings yet
Physics-Informed Neural Networks For Diffraction Tomography: Research Article
12 pages
Ritik DL
No ratings yet
Ritik DL
17 pages
Augmented Intelligence Surveys of Literature and Expert Opinion To Understand Relations Between Human Intelligence and Artificial Intelligence
No ratings yet
Augmented Intelligence Surveys of Literature and Expert Opinion To Understand Relations Between Human Intelligence and Artificial Intelligence
18 pages
NN Lab Course Plan
No ratings yet
NN Lab Course Plan
9 pages
Lip Reading Using External Viseme Decoding: 1 Javad Peymanfard 2 Mohammad Reza Mohammadi 3 Hossein Zeinali
No ratings yet
Lip Reading Using External Viseme Decoding: 1 Javad Peymanfard 2 Mohammad Reza Mohammadi 3 Hossein Zeinali
5 pages
A Review On Deep Learning-Based Structural Health Monitoring of Civil Infrastructures
No ratings yet
A Review On Deep Learning-Based Structural Health Monitoring of Civil Infrastructures
20 pages
Deep Open Space Segmentation Using Automotive Radar
No ratings yet
Deep Open Space Segmentation Using Automotive Radar
4 pages
Deep Learning To Improve Breast Cancer Detection On Screening Mammography
No ratings yet
Deep Learning To Improve Breast Cancer Detection On Screening Mammography
12 pages
Spot-The-Camel: Computer Vision For Safer Roads
No ratings yet
Spot-The-Camel: Computer Vision For Safer Roads
10 pages
Driver Drowsiness Detection Using Machine Learning
No ratings yet
Driver Drowsiness Detection Using Machine Learning
6 pages
Imaging the World: Unlocking the Secrets of Digital Images
From Everand
Imaging the World: Unlocking the Secrets of Digital Images
Pasquale De Marco
No ratings yet
Blender: A Beginner's Guide to 3D Modeling
From Everand
Blender: A Beginner's Guide to 3D Modeling
Steven Mcananey
No ratings yet
Histogram Equalization: Enhancing Image Contrast for Enhanced Visual Perception
From Everand
Histogram Equalization: Enhancing Image Contrast for Enhanced Visual Perception
Fouad Sabry
No ratings yet
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Tone Mapping: Tone Mapping: Illuminating Perspectives in Computer Vision
From Everand
Tone Mapping: Tone Mapping: Illuminating Perspectives in Computer Vision
Fouad Sabry
No ratings yet
Computer Stereo Vision: Exploring Depth Perception in Computer Vision
From Everand
Computer Stereo Vision: Exploring Depth Perception in Computer Vision
Fouad Sabry
No ratings yet
Anti Aliasing: Enhancing Visual Clarity in Computer Vision
From Everand
Anti Aliasing: Enhancing Visual Clarity in Computer Vision
Fouad Sabry
No ratings yet
Image Compression: Efficient Techniques for Visual Data Optimization
From Everand
Image Compression: Efficient Techniques for Visual Data Optimization
Fouad Sabry
No ratings yet
Rendering Computer Graphics: Exploring Visual Realism: Insights into Computer Graphics
From Everand
Rendering Computer Graphics: Exploring Visual Realism: Insights into Computer Graphics
Fouad Sabry
No ratings yet
Volume Rendering: Exploring Visual Realism in Computer Vision
From Everand
Volume Rendering: Exploring Visual Realism in Computer Vision
Fouad Sabry
No ratings yet
Image Segmentation: Unlocking Insights through Pixel Precision
From Everand
Image Segmentation: Unlocking Insights through Pixel Precision
Fouad Sabry
No ratings yet
Digital Image Processing: Fundamentals and Applications
From Everand
Digital Image Processing: Fundamentals and Applications
Fouad Sabry
No ratings yet

Computer Vision

Uploaded by

Computer Vision

Uploaded by

Overview of Computer Vision

✅ What is Computer Vision?

🔧 Basic Tasks in Computer Vision

🧠 How Computer Vision Works (Basic Idea)

📌 Applications of Computer Vision

🧠 Bonus Tip for Exam:

●​ It is the process by which a 3D real-world scene is captured as a 2D image on a camera sensor or

2. Pinhole Camera Model (Ideal Model)

●​ Imagine a dark box with a small hole on one side.

3. Real Camera (with Lens)

●​ Modern cameras use a lens instead of a hole.

●​ The image formed is inverted (upside-down and left-right reversed).

5. Light & Image Brightness

●​ More light = brighter image.

🟦 1. Digital Image Basics

[0, 125, 255],

🌈 3. RGB Image (Color)

Example (2×2 image):

Red: [[255, 0], Green: [[0, 255], Blue: [[0, 0],

🧮 6. Pixel Depth / Bit Depth

📌 Effect: Softens the image, reduces detail.

✅ Why RGB Color is Preferred

3. Simple and Efficient

●​ Easy to implement and understand — just three channels.

4. Lossless Color Representation

5. Foundation for Other Models

1.​ Input Layer:

In Computer Vision, DNNs are used for:

●​ Image Classification (e.g., dog vs. cat)

Key Layers in CNN:

🔹 ReLU (Activation Function):

🔹 Fully Connected (Dense) Layer:

🔹 Softmax Layer (Output):

R-CNN performs object detection in three main steps:

1.​ Region Proposal:

🧠 Architecture: (e.g., U-Net or FCN)

●​ All pixels belonging to "car" get the same label.

●​ Every pixel is labeled.

🔄 Approaches to Temporal Processing:

2.​ LSTM / GRU:

3.​ 1D/3D Convolution:

5.​ Temporal Convolutional Networks (TCN):

●​ The current input

The update equations are:

📉 Training: Backpropagation Through Time (BPTT)

🔍 Limitations of Basic RNNs

●​ LSTM (Long Short-Term Memory): Introduces gates to control information flow.

🧠 How Autoencoders Work for Anomaly Detection

Steps for Anomaly Detection:

1.​ Train an autoencoder on normal (non-anomalous) images.

🧑‍💻 Autoencoder Architecture for Anomaly Detection

Intensity: Brightness is nearly impossible to measure because it is so subjective.

ADDITION: (RGB-HSI)= G+B=C || B+R = M || R+G = Y || R+G+B=WHITE

SUBTRACTION: (HSI->RGB) = C+M = B || M+Y = R || C+Y = G || C+Y+M= BLACK

1. Fourier Transform (DFT / FFT)

An optimized version of DFT with O(nlog⁡n) complexity.

2. Discrete Cosine Transform (DCT)

●​ Can be light blobs on dark background or vice versa.

Title: Scale-Invariant Feature Transform (SIFT)

So we describe a small region around each keypoint in a way that is:

✅ Steps of SIFT (on this page)

●​ Purpose: Find candidate keypoints at multiple scales.

●​ Purpose: Eliminate unstable keypoints (e.g. edge points or low contrast).

📝 Example in your notes:​

✏️ Step 4: Descriptor Generation

🧠 What happens here:

1.​ Detect important points (keypoints) in an image

💡 SURF Pipeline (as per your notes):

🔬 Key Concepts from Your Notes:

🧮 Hessian Matrix (used in SURF for keypoint detection):

●​ Lxx = second-order derivative in x direction

Det(H) = Lxx * Lyy - (0.94 * Lxy)^2

🔹 The constant 0.94 is used to balance the scale of derivatives.

●​ dx and dy are responses to Haar wavelets.

The descriptor vector is represented as:

[Σdx, Σdy, |Σdx|, |Σdy|]

● It is the process by which a 3D real-world scene is captured as a 2D image on a camera sensor or

● Imagine a dark box with a small hole on one side.

● Modern cameras use a lens instead of a hole.

● The image formed is inverted (upside-down and left-right reversed).

● More light = brighter image.

● Easy to implement and understand — just three channels.

1. Input Layer:

● Image Classification (e.g., dog vs. cat)

1. Region Proposal:

● All pixels belonging to "car" get the same label.

● Every pixel is labeled.

2. LSTM / GRU:

3. 1D/3D Convolution:

5. Temporal Convolutional Networks (TCN):

● The current input

● LSTM (Long Short-Term Memory): Introduces gates to control information flow.

1. Train an autoencoder on normal (non-anomalous) images.

● Can be light blobs on dark background or vice versa.

● Purpose: Find candidate keypoints at multiple scales.

● Purpose: Eliminate unstable keypoints (e.g. edge points or low contrast).

📝 Example in your notes:

1. Detect important points (keypoints) in an image

● Lxx = second-order derivative in x direction

● dx and dy are responses to Haar wavelets.

● Find correspondence between different image pairs.

● Directly compares pixel values

● Uses descriptors (points, corners, blobs)

● Feature Vector Matching

2. Compute Homography

3. Compute inliers where SSD (p’, Hp) < ε

4. Keep largest set of inliers

5. Recompute least squares fit for all inliers

○ CNNs for image classification

● Capture or load multiple images with overlapping fields of view.

● Detect distinctive keypoints in each image

● Extract descriptors for each keypoint to characterize its neighborhood

● Match features between pairs of overlapping images.

● Compute the transformation (homography matrix) between matched

● Apply the homography to warp images into a common coordinate system.

● Blend the warped images to create a seamless result.

● Crop black borders or non-overlapping regions.