Unit 1
Unit 1
Image formation is a fundamental concept in computer vision, which focuses on how digital
images are created and represented in a way that can be analyzed and processed by computers.
Understanding the image formation process is crucial for building algorithms that can interpret visual
data accurately.
o Key properties of light include intensity, wavelength (color), and direction, all of
which affect the resulting image.
o Cameras simulate the human eye to capture images of the 3D world. The pinhole
camera model is a simple mathematical model widely used in computer vision:
▪ The world is projected onto a 2D plane (image plane) through a single point
(pinhole).
3. Projection Geometry:
o Perspective Projection: Objects farther from the camera appear smaller, creating
depth perception.
4. Image Representation:
o An image is a 2D matrix of pixels, where each pixel stores intensity or color values:
▪ Color Images: Represent RGB values (three channels: Red, Green, Blue).
o Camera Lens: Collects light and focuses it onto the image sensor.
o Image Sensor: Converts light into electrical signals (e.g., CCD or CMOS sensors).
8. Mathematical Tools:
Image formation is the process of converting a 3D real-world scene into a 2D digital representation
that can be analyzed by computers. The process involves several key steps:
1. Scene Illumination:
2. Light Projection:
o Light rays pass through the camera lens and project onto the image plane.
o The pinhole camera model or more complex lens models govern this projection.
o Light is converted into electrical signals using an image sensor (e.g., CCD or CMOS).
o Each pixel in the sensor captures light intensity (grayscale) or light of different
wavelengths (color).
4. Digital Image Processing:
o Enables core computer vision tasks like object detection, facial recognition, and 3D
scene reconstruction.
o Captures visual data that closely represents the physical world, making it intuitive for
humans and effective for AI models.
3. Automation Potential:
o Automates tasks like quality inspection, surveillance, and navigation that would
otherwise require human intervention.
4. Scalability:
o Once set up, systems leveraging image formation can process vast amounts of data
quickly and consistently.
5. Multi-Sensor Integration:
o Works with other sensors like LiDAR and depth cameras to create more robust
systems for 3D perception.
1. Environmental Dependence:
o Varying lighting conditions, weather, and occlusions can degrade image quality and
affect accuracy.
3. Sensor Limitations:
o Dynamic range, resolution, and noise levels in sensors restrict the quality of captured
images.
4. Lens Distortions:
5. Motion Blur:
o Fast-moving objects or camera motion can cause blurred images, affecting analysis.
7. Data Ambiguity:
8. Ethical Concerns:
1. 3D Reconstruction:
3. Augmented Reality:
4. Robotics:
o Helping robots perceive and navigate their environment.
5. Medical Imaging:
1) image capture
2)image representation
Both stages are crucial for enabling computers to analyze and interpret visual data
effectively.
1. Image Capture:
Image capture involves the transformation of light from a scene into a digital format that can be
processed by a computer. This step includes:
• Light originates from sources such as the sun, artificial lights, or ambient illumination.
• When light interacts with objects in the scene, the following phenomena occur:
o Reflection:
o Transmission and Refraction: Light passes through transparent objects and bends.
• Camera Lens:
o The lens introduces projection effects such as perspective, which impacts the
appearance of objects in the captured image.
• Projection Models:
▪ Simplest model, where light rays pass through a small aperture (pinhole) to
form an inverted image on the image plane.
▪ 3D points in the scene (X,Y,Z) are mapped to 2D points (x,y) on the image
plane based on:
where f is the focal length.
o Lens-Based Model:
C. Image Sensors
• Each pixel on the sensor measures light intensity (grayscale) or light intensity for different
wavelengths (color).
D. Analog-to-Digital Conversion
• The electrical signals from the sensor are digitized into discrete pixel values:
o Grayscale Images: Represent light intensity using a single value per pixel (e.g., 0–255
for 8-bit images).
o Color Images: Represent light intensity for different wavelengths, commonly stored
as RGB (Red, Green, Blue) triplets.
2. Image Representation:
Once the image is captured, it is represented in a format that can be processed by computer vision
algorithms. Representation involves the organization and encoding of pixel data to extract
meaningful information.
A. Pixel Grid
• Each pixel in the image has a position (x,y) in the image coordinate system.
C. Resolution
• Higher resolution provides finer detail but requires more storage and processing power.
• Grayscale:
o Represents brightness levels on a scale (e.g., 0 for black, 255 for white in 8-bit
images).
• Color:
o RGB format is most common, with each channel typically stored as an 8-bit value (0–
255).
o Other color spaces, like HSV (Hue, Saturation, Value) or YUV, are used for specific
applications.
• Some systems capture depth along with intensity or color using techniques like stereo vision,
LiDAR, or structured light.
• Standard cameras have limited dynamic range, which may cause loss of detail in very bright
or dark areas.
Various methods are used to acquire visual data, depending on the application and the type of
information required:
1. Standard 2D Cameras:
2. Stereo Cameras:
3. Depth Cameras:
o Uses techniques like structured light, time-of-flight (ToF), or LiDAR to capture depth.
o Captures images in multiple spectral bands beyond the visible range (e.g., infrared,
ultraviolet).
5. High-Speed Cameras:
6. Thermal Cameras:
1. Pixel-Based Representation:
o Images are represented as a grid of pixels, each storing intensity or color values.
2. Feature-Based Representation:
o Represents key features (e.g., edges, corners, textures) instead of raw pixel data.
3. Sparse Representations:
o Focuses only on important areas or features, reducing data size.
4. Graph-Based Representations:
o Models an image as a graph where pixels or regions are nodes, and edges represent
relationships.
5. 3D Representations:
1. Environmental Factors:
2. Sensor Noise:
3. Projection Loss:
• Examples:
2. 3D Reconstruction
• Examples:
o Archaeological site reconstruction.
3. Autonomous Vehicles
• Use: Navigating and understanding the environment using cameras and sensors.
• Examples:
o Lane detection.
• Use: Integrating virtual elements with the real world or creating immersive environments.
• Examples:
o AR gaming.
5. Medical Imaging
• Examples:
• Examples:
o Intruder detection.
o Crowd analysis.
7. Industrial Automation
• Examples:
8. Agriculture
• Examples:
9. Entertainment
• Examples:
• Examples:
Linear filtering involves modifying the value of a pixel (or a data point in general) by
applying a mathematical function that depends linearly on the values of its neighboring pixels. The
output at each point is a weighted sum of the input values, where the weights are defined by a filter
kernel (or mask).
Steps in Linear Filtering
o A small matrix of weights (e.g., 3×33, 5×55) that defines the transformation.
o Examples:
o Correlation: Directly slides the kernel across the image without flipping.
o For each pixel, compute the sum of the product of the kernel values and the
corresponding pixel values in the neighborhood.
o Options include padding with zeros, mirroring, or extending edge values to deal with
regions where the kernel extends beyond the image boundary.
o The result is an image with the same dimensions (or slightly reduced if no padding is
applied), where each pixel value reflects the weighted sum of its neighborhood.
1. Smoothing Filters:
Gaussian Filter:
• Purpose: Smooth the image while preserving edges better than the box filter.
Sharpening Filters:
• Kernel:
1. Image Smoothing:
2. Edge Detection:
o Identifies boundaries of objects.
3. Feature Extraction:
4. Image Enhancement:
5. Data Preprocessing:
1. Loss of Detail:
3. Edge Artifacts:
o Only considers local neighborhoods, which may not capture larger patterns
Example:
Correlation:
Correlation is a statistical measure that describes the extent to which two variables are linearly
related. In simpler terms, it quantifies the strength and direction of the relationship between two
data sets. Correlation is widely used in various fields, including statistics, machine learning, and
computer vision, to understand dependencies and interactions between variables.
1. Direction:
2. Magnitude:
▪ 0: No correlation.
Types of Correlation:
o Formula:
• Measures the strength and direction of a monotonic relationship between ranked variables.
3. Kendall’s Tau:
• Measures the association between two variables based on the ranking of data.
4. Cross-Correlation:
1. Feature Selection:
2. Predictive Modeling:
o Indicates dependencies that may improve model performance.
3. Interpretability:
1. Template Matching:
o Measures the similarity between an image template and regions in a larger image.
2. Feature Matching:
3. Optical Flow:
Advantages of Correlation:
1. Simplicity:
2. Quantifies Relationships:
3. Feature Analysis:
4. Signal Processing:
5. Predictive Modeling:
Disadvantages of Correlation:
o Correlation measures only linear dependencies and may not detect non-linear
relationships.
2. No Causation:
o Correlation does not imply causation; two correlated variables may be influenced by
a third factor.
3. Sensitivity to Outliers:
o Extreme values can distort the correlation coefficient, leading to misleading
interpretations.
Applications of Correlation:
2. Machine Learning
3. Signal Processing
• Pattern Recognition: Identifying patterns in data streams, such as audio or seismic signals.
4. Computer Vision
• Analyzing relationships between financial indicators (e.g., stock prices and interest rates).
6. Bioinformatics
7. Social Sciences
Convolution:
Convolution is a mathematical operation that combines two functions to produce a third
function. In the context of images, convolution involves a small matrix called a kernel or filter sliding
over an image to perform operations like edge detection, blurring, or sharpening.
Mathematical Representation:
1. Kernel/Filter:
o A small matrix (e.g., 3×33 \times 33×3, 5×55 \times 55×5) with predefined or learned
weights.
2. Sliding Window:
3. Aggregation:
4. Output:
o The output is a feature map (or activation map) highlighting specific patterns or
features.
Key Components in Convolutional Operations:
1. Stride:
o Larger strides reduce the output size, capturing more abstract features.
2. Padding:
o Adds extra pixels around the image to control the output size.
o Types:
3. Channels:
o Each kernel applies convolution to individual channels, and the results are
aggregated.
1. Feature Extraction:
2. Translation Invariance:
o The same filter is applied across the image, ensuring features are detected regardless
of location.
3. Parameter Efficiency:
1. Efficient Representation:
2. Scalable:
3. Universal Applicability:
Challenges of Convolution:
1. Computational Intensity:
o Each convolution captures local features, requiring deeper layers for global
understanding.
3. Overfitting:
o May occur without proper regularization techniques (e.g., dropout, weight decay).
1. Image Processing:
2. Object Detection:
3. Image Segmentation:
4. Feature Matching:
5. Facial Recognition:
6. Generative Models:
o GANs use convolutions for creating new images or altering existing ones.
7. Image Classification: