Computer Vision
Computer Vision
🏥 2. Healthcare
● Medical Imaging: Detect tumors, fractures in X-rays, MRIs, CT scans.
● Retinal Analysis: For diabetic retinopathy, glaucoma, etc.
● Surgical Assistance: Robots guided using CV.
● Pathology: Automated detection of cells and abnormalities.
🚗 3. Autonomous Vehicles
● Lane Detection: Identify road lanes.
● Traffic Sign Recognition: Read and respond to signs.
● Pedestrian Detection: For safe driving.
● Obstacle Avoidance: Detect other vehicles, people, animals, etc.
● Surround View: 360° environment understanding.
"Computer vision continues to grow rapidly and is a key enabler of intelligent systems across various domains
by allowing machines to see, understand, and make decisions."
_____________________________________________
📘 Image Formation – Basic Concepts
1. What is Image Formation?
✅ Key Points:
● No lens used.
● Simple and distortion-free, but very dim image.
● Smaller hole → sharper but darker image.
● Larger hole → brighter but blurry image.
✅ Key Terms:
● Lens: Focuses light.
● Image Plane: Where the image is formed.
● Sensor: Converts light into electrical signals (digital image).
4. Inversion
_____________________________________________
📘 Image Representation
✅ What is Image Representation?
It refers to how an image is stored, structured, and processed in a computer — using pixel values.
🔳 4. Binary Image
● Pixel value is either 0 or 1 (black or white).
● Used in basic segmentation and thresholding.
🔢 5. Image Resolution
● Resolution = Width × Height
● Higher resolution = more pixels = more detail, larger file.
🧠 7. Image Formats
● JPEG, PNG, BMP, TIFF are ways to store images.
○ JPEG: Compressed, lossy
○ PNG: Lossless, supports transparency
🔍 8. Coordinate System
● Top-left pixel is (0, 0).
● X-axis → right, Y-axis → down
_____________________________________________
🧴 1. Smoothing (Blurring)
Goal: Reduce noise or small variations in the image.
✅ Common Methods:
● Mean Filter (Average Filter):
○ Replaces each pixel with the average of neighboring pixels.
○ Removes noise but can blur edges.
● Gaussian Filter:
○ Uses a Gaussian function to give more weight to central pixels.
○ Smooths noise better and preserves edges more than the mean filter.
✏️ 2. Sharpening
Goal: Enhance edges and fine details in the image.
✅ Common Methods:
● Laplacian Filter:
○ Second-order derivative operator.
○ Highlights regions of rapid intensity change (edges).
● Unsharp Masking:
○ Subtracts a blurred (low-pass) version from the original image.
○ Formula:
📌
Sharpened=Original+α⋅(Original−Blurred)
○ Effect: Image appears crisper and more detailed.
📊 3. Histogram Equalization
Goal: Improve contrast by spreading out intensity values.
✅ Process:
● Create the histogram of the image.
● Compute the cumulative distribution function (CDF).
● Map old pixel values to new ones using the CDF.
📌 Result: Dark areas become lighter and bright areas dimmer — overall better contrast.
⚠️ Used in:
● Medical imaging, satellite images, low-light enhancement.
__________________________________________________________________________________________
● Human eyes have three types of cone cells that detect Red, Green, and Blue light.
● RGB aligns with our natural perception.
2. Device-Friendly
● Monitors, cameras, and screens all use RGB to capture, display, and store color.
● It is the native format for most image sensors.
● Other color spaces like HSV, YCbCr, Lab are usually converted from RGB for special
processing (e.g., skin detection, lighting adjustments).
_________________________________________________________________________________________
Key Components:
What is a CNN?
A Convolutional Neural Network (CNN) is a type of deep neural network specially designed to
process images by preserving spatial relationships using convolution operations. It
automatically learns features like edges, textures, shapes, etc., without manual feature
engineering.
🔹 Convolutional Layer:
○ Applies filters (kernels) to input image.
○ Extracts features like edges, corners, patterns.
○ Output is called a feature map.
○ Equation:
Y(i,j)=∑m∑nX(i+m,j+n)⋅K(m,n)
🔹 Pooling Layer:
○ Reduces spatial size (downsampling).
○ Max Pooling is common:
■ Selects the max value from a region (e.g., 2×2).
○ Benefits: Reduces computation, helps generalization.
_____________________________________________
How R-CNN Works
Image segmentation using an image-to-image neural network refers to the task of assigning
a class label to each pixel in an image, using an architecture that takes an image as input and
outputs a mask of the same spatial dimensions.
The most common approach for this is using fully convolutional networks (FCNs) or
advanced variants like U-Net, SegNet, or DeepLab.
✅ Key Idea:
● Input: An image (e.g., 256×256×3)
● Output: A segmentation mask (e.g., 256×256×C), where C is the number of classes.
● The model learns pixel-wise classification
🔹 1. Semantic Segmentation
📌 Definition:
Assigns a class label to each pixel in the image.
✅ Key Point:
● All objects of the same class share the same label.
● No distinction between individual object instances.
🔍 Example:
In a street scene:
📷 Output:
● A pixel-wise map with categories like [car, tree, road, person, sky].
🔹 2. Instance Segmentation
📌 Definition:
Assigns a class label + instance ID to each pixel.
✅ Key Point:
● Each individual object is segmented separately, even if they’re the same class.
● Combines object detection and semantic segmentation.
🔍 Example:
In the same street scene:
● Car1, Car2, Car3 are segmented as different instances, all of class "car".
● So are Person1, Person2, etc.
📷 Output:
● Pixel-wise masks with object instance separation, e.g., [car#1, car#2, person#1].
🔹 3. Panoptic Segmentation
📌 Definition:
Combines both semantic and instance segmentation in a single output.
✅ Key Point:
● Segments all pixels (like semantic).
● Differentiates each object instance (like instance).
🔍 Example:
In the street scene:
📷 Output:
● A unified segmentation map with both class and instance info.
____________________________________________
Temporal processing refers to handling sequential or time-dependent data in machine
learning or deep learning models. This is essential for tasks where the order and timing of
inputs matter — such as video analysis, time-series forecasting, speech recognition, or human
activity recognition.
🕒 Common Applications:
● Video classification or object tracking
● Time series forecasting (e.g., stock prediction)
● Speech-to-text
● Sensor data analysis (e.g., wearable activity monitoring)
4. Transformers:
○ Use attention over sequences, parallelizable.
○ Dominant in NLP and video understanding.
_____________________________________________
🔁 Recurrent Neural Network (RNN) — Theory
Recurrent Neural Networks (RNNs) are a class of neural networks designed for sequential
data, where the order and context of elements matter. Unlike traditional feedforward neural
networks, RNNs have loops that allow information to persist over time steps, making them
ideal for tasks like time series prediction, natural language processing, and speech recognition.
🧠 Key Concept
An RNN processes sequences by maintaining a hidden state that is updated at each time step
based on:
This allows the network to have a sort of memory, capturing dependencies across time steps.
🧮 Mathematical Formulation
At time step t, given:
● Input vector: xt
● Hidden state from previous step: ht−1
● Output: yt
🔄 Unfolding in Time
RNNs can be "unfolded" across time steps. For example, a sequence of 3 time steps:
x1 → h1 → y1
↓
x2 → h2 → y2
x3 → h3 → y3
Each time step shares the same weights, making it efficient for sequence learning.
🧬 Variants
To overcome limitations, advanced architectures were developed:
🎯 Applications
● Language modeling & text generation
● Machine translation
● Sentiment analysis
● Speech recognition
● Time series forecasting
● Human activity recognition
Anomaly Detection in Images using Autoencoders
Anomaly detection involves identifying patterns in data that do not conform to expected
behavior. In the context of images, anomalies are regions or entire images that are different
from the typical image patterns in a dataset. This can be useful in various applications such as
medical image analysis, industrial defect detection, and security monitoring.
One of the most effective methods for image anomaly detection is using Autoencoders — a
type of neural network that learns to compress and then reconstruct its input.
1. Encoder: Compresses the input into a lower-dimensional representation (latent space).
2. Decoder: Reconstructs the original input from the compressed representation.
In anomaly detection, the autoencoder is trained on normal images. Once trained, it should
be able to reconstruct normal images well and fail to reconstruct anomalous images
accurately (i.e., large reconstruction error).
Saturation: Gives a measure of how much a pure color is diluted with white light.
_____________________________________________
TRANSFORMATION
Spatial domain to frequency domain.
Image transformations like Fourier Transform and Discrete Cosine Transform (DCT) convert images from
the spatial domain (pixel-based) to the frequency domain (based on patterns of intensity change).
🧠 Concept:
● Low frequencies: represent smooth regions
● High frequencies: represent edges and fine details.
🔎 What Is a "Blob"?
A blob is a group of connected pixels that share similar intensity or texture and represent
a region of interest:
_____________________________________________
● Rotation-invariant
● Scale-invariant
🔄 What’s happening:
● You analyze the gradient directions around the keypoint.
● You build a histogram of gradient directions in the local region.
● The highest peak in the histogram becomes the main orientation of the keypoint.
👉 If multiple peaks are strong (above 80% of the maximum), then multiple keypoints are
created at the same location but with different orientations.
🔍 This helps the algorithm recognize the same object even when it's rotated in different
images.
📊 This gives:
● 4×4 = 16 cells
● Each with 8-bin histogram
→ Total of 128 values (16 × 8)
This is the SIFT descriptor vector.
● "Each entry is weighted by gradient magnitude and Gaussian weighting."
● So closer pixels and stronger edges get more importance.
● The result is a 128-dimensional vector that describes the local patch.
🔍 What is SURF?
SURF = Speeded Up Robust Features
It is used to:
H = | Lxx Lxy |
| Lxy Lyy |
Where:
V = [Σdx, Σdy]
🔗 Descriptor Vector:
After finding the keypoint and its orientation, SURF builds a descriptor (just like SIFT
does) using wavelet responses, which makes it faster.
Image Matching
Applications
● Triangulation: Locate the actual 3D point in the scene from its two (or more)
corresponding views.
Types of Matching
1. Area-Based
Matching Techniques
● Reliability Measures
→ RANSAC (Random Sample Consensus)
→ Distance ratio
→ Non-consistent matches removal
→ Cross-check (symmetric match)
● Robust Matching
→ Voting
→ Histogram
→ Hough transform
→ Epipolar geometry
→ Graph-based
→ Bag of words
→ Best match
✂️ 2. Edge Detection
Detects boundaries of objects using gradients.
● Key Techniques:
○ Sobel Operator
○ Prewitt Operator
○ Canny Edge Detector: Most used, multi-stage (smoothing, gradient calc,
non-max suppression, thresholding).
○ Laplacian of Gaussian (LoG): Finds edges using second-order derivatives.
○ Difference of Gaussian (DoG): Approximates LoG, used in SIFT.
🌾 3. Texture Analysis
Describes patterns or variations in image intensity.
● Statistical Methods:
○ Gray-Level Co-occurrence Matrix (GLCM): Measures texture features like
contrast, correlation, homogeneity.
○ Local Binary Patterns (LBP): Encodes local texture by thresholding
neighborhood.
● Transform-Based:
○ Fourier Transform
○ Gabor Filters
○ Wavelets
🧩 4. Region-Based Segmentation
Groups pixels into regions based on similarity.
● Region Growing: Start from seed points and grow based on similarity.
● Region Splitting & Merging: Divide image, then merge similar regions.
● Watershed Algorithm: Treats image as a topographic surface.
● Graph-Based Methods (e.g., Normalized Cuts)
🔍 1. Matching
Matching is the process of finding correspondences between features (points, patches, or
regions) in different images.
🧱 A. Types of Matching
● Feature-Based Matching: Match keypoints using descriptors.
● Template Matching: Slide a template over the image and compare.
● Area-Based Matching: Use windows/patches (e.g., SSD, NCC).
● Descriptor Matching:
○ Distance Metrics: Euclidean, SSD (Sum of Squared Differences), Cosine,
Hamming.
○ Matching Algorithms: Brute Force, k-NN, FLANN.
🧠 2. Recognition
Recognition means identifying what object or category is present in the image.
🎯 A. Types
● Object Recognition: e.g., recognize a "cat" or "car".
● Face Recognition: Identify people by faces.
● Scene Recognition: e.g., “indoor” vs. “outdoor”.
🔍 B. Techniques
● Template Matching: Match with stored image patterns (rigid).
● Feature-Based: Match extracted features to database features.
● Bag of Visual Words (BoVW): Treat local features as words and do classification.
● Machine Learning Classifiers:
○ SVMs
○ KNN
○ Random Forests
● Deep Learning:
🔗 1. Fusion
Image Fusion means combining multiple images into a single enhanced image, retaining
complementary information.
📸 A. Types
● Multi-focus Fusion: Combine images with different focus areas.
● Multi-sensor Fusion: e.g., Thermal + RGB for surveillance.
● Multi-exposure Fusion: Combine HDR images.
⚙️ B. Techniques
● Pixel-level Fusion:
○ Average or max pixel intensities.
● Feature-level Fusion:
○ Extract features (edges, textures), then combine.
● Decision-level Fusion:
○ Fuse decisions from multiple models/sources.
● Wavelet/Transform Fusion:
○ Decompose images (e.g., DWT), fuse components, reconstruct.
✅ Use Cases
● Medical imaging (e.g., CT + MRI)
● Surveillance
● Robotics (e.g., vision + LIDAR)
Steps:
4. Fuse high-pass ST coefficients of the source images thus selected and tuned with the
best low-pass sub-band as estimated by SVD in Step 2.
📐 2. Image Alignment
Image Alignment refers to registering two or more images so their contents line up accurately.
🔍 A. Steps
1. Detect Features: SIFT, SURF, ORB.
2. Describe Features: Compute feature descriptors.
3. Match Features: Use SSD, ratio test, etc.
4. Estimate Transformation:
○ Affine: preserves lines/parallelism
○ Homography: projective transformation (used for stitching)
5. Warp Image using transformation matrix.
🧪 Homography Example:
If point (x, y) in Image A maps to (x', y') in Image B:
1. Translation
Definition: Moves (shifts) an image in the x and/or y direction without rotating or scaling it.
Transformation Matrix:
1 0 tx
0 1 ty
001
Use Case: When the image is simply displaced but not deformed.
2. Affine Transformation
Definition: Preserves lines and parallelism (but not necessarily distances and angles). Includes
translation, rotation, scaling, and shearing.
Transformation Matrix:
0 0 1
Properties:
Use Case: Mapping between two images when the camera undergoes rotation, scaling, or
shear.
Definition: A more general transformation that can map a plane to another plane under
perspective. It includes all affine transformations and more.
Transformation Matrix:
Properties:
Use Case: Used in panorama stitching, object detection, camera calibration, AR, etc.
🧵 3. Image Stitching
Stitching involves aligning and blending multiple overlapping images into a seamless panorama.
⚙️ A. Steps
1. Detect & Match Features (like image alignment)
2. Estimate Homography between image pairs
3. Warp Images to align with a reference frame
4. Blend Images:
○ Linear blending
○ Multi-band blending (Laplacian pyramids)
○ Seam finding to remove visible boundaries
🔄 Automatic Tools
● OpenCV’s cv2.Stitcher_create()
● Python libraries: OpenCV, ImageAI, AutoStitch
✅ Use Cases
● Panorama creation
● Aerial/mosaic imaging
● Document scanning apps
2. Feature Detection
3. Feature Description
5. Homography Estimation
6. Image Warping
7. Image Blending
8. Output Generation