0% found this document useful (0 votes)
8 views

Computer Vision

computer vision
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Computer Vision

computer vision
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

21AI601 - COMPUTER VISION

Unit I & LP3- DEPTH ESTIMATION AND MULTI CAMERA


VIEWS: PERSPECTIVE, BINOCULAR STEREOPSIS: CAMERA
AND EPIPOLAR GEOMETRY

1. DEPTH ESTIMATION AND MULTI CAMERA VIEWS


 Depth is a critical part of computer vision, which gives the computer information about the
distance of objects to the camera.
 Image depth estimation is about figuring out how far away objects in an image are.
 Depth estimation involves determining the distance of each pixel in relation to the camera.
 Depth is extracted from either monocular (single) or stereo (multiple views of a scene)
images.
 The task requires an input RGB image and outputs a depth image.
 The depth image includes information about the distance of the objects in the image from
the viewpoint, which is usually the camera taking the image.
 Some of the applications of depth estimation include smoothing blurred parts of an image,
better rendering of 3D scenes, self-driving cars, grasping in robotics, robot-assisted
surgery, automatic 2D-to-3D conversion in film, and shadow mapping in 3D computer
graphics, etc.
 Another significant application is self-driving cars, which need to know how far away the
vehicle ahead of it is to avoid collisions.
 Image depth estimation is about figuring out how far away objects in an image are.
 It’s an important problem in computer vision because it helps with things like creating 3D
models, augmented reality, and self-driving cars.
 One way of obtaining depth information is through stereo vision, which uses two cameras,
usually side by side. Both cameras take a picture of the same scene.
 Objects that appear in both images will be at different positions, where the disparity is the
difference between the positions.
 Objects close to the cameras will have a larger horizontal disparity in the images, whereas
faraway objects will have a smaller disparity.
 In the past, people used techniques like stereo vision or special sensors to estimate depth.
But now, there’s a new method called Depth Prediction Transformers (DPTs) that uses
deep learning.
 Depth estimation and multi-camera views in computer vision are closely related concepts,
as depth information is often derived from multiple camera views to create a more accurate
representation of the 3D structure of a scene.

1.1 How do we estimate depth?


 Our eyes estimate depth by comparing the image obtained by our left and right eye.
 The minor displacement between both viewpoints is enough to calculate an approximate
depth map. We call the pair of images obtained by our eyes a stereo pair.
 This, combined with our lens with variable focal length, and general experience of “seeing
things”, allows us to have seamless 3D vision.

1.2 Methods for depth estimation


1.2.1 Passive Methods:
Passive methods use information from a single image or a stereo pair of images without actively
projecting any additional light or signals.
Stereo Vision:
Principle: Stereo vision involves using two or more cameras to capture a scene from slightly
different viewpoints. The disparity between corresponding points in the left and right images is
used to calculate depth through triangulation. The baseline (distance between the cameras) affects
the accuracy of depth estimation.
Multi-camera Setup: Multiple cameras are positioned at different locations to capture a scene
from various perspectives, enhancing the accuracy of depth estimation.
Applications: Stereo vision is widely used in robotics, autonomous vehicles, and 3D
reconstruction
Depth from Defocus:
This method uses the blur information in images to estimate depth. Objects at different distances
will have different amounts of blur in the image. Cameras with controllable aperture sizes can
exploit this information to estimate depth.

Structure from Motion (SfM):


SfM involves analyzing the motion of features across multiple frames of a video sequence. By
tracking the movement of these features, the depth information can be inferred.

Depth from Semantic Segmentation:


By using deep learning techniques, such as convolutional neural networks (CNNs), depth can be
estimated based on semantic segmentation information. The network learns to associate certain
object classes with specific depths.

1.2.2 Active Methods:


Active methods involve actively projecting light or signals into the scene and measuring their
interactions with objects to determine depth.

Time-of-Flight (ToF):
ToF cameras emit light pulses and measure the time it takes for the light to travel to the object and
back. This information is used to calculate the distance between the camera and the object.
Structured Light:
Principle: Structured light systems project known patterns onto a scene, and depth is calculated
based on the deformation of the pattern. Depth is then calculated by analyzing the deformation of
the pattern. Using multiple cameras can enhance the accuracy and coverage of the depth
information.
Applications: Multi-camera structured light setups are used in industrial applications, such as
quality control and 3D scanning.

Lidar (Light Detection and Ranging):


Lidar systems use laser beams to measure the distance to objects. By analyzing the time it takes
for the laser beams to travel to the object and back, a 3D point cloud of the scene can be generated.
Lidar systems, which use laser beams to measure distance, can be combined with multi-camera
setups to create more comprehensive 3D representations.
Sensor Fusion: Integrating lidar data with information from multiple cameras helps overcome
limitations, such as occlusions or difficulties in textureless areas.
Applications: Autonomous vehicles often use a combination of lidar and camera data for robust
perception.
Active Stereo Vision:
Similar to stereo vision, but with the addition of active illumination. This can improve performance
in low-light conditions.
Deep Learning Approaches:
In recent years, deep learning methods, particularly convolutional neural networks (CNNs) and
recurrent neural networks (RNNs), have shown remarkable success in depth estimation. End-to-
end models can be trained to directly predict depth from monocular or stereo images.
Principle: Deep learning models, particularly convolutional neural networks (CNNs), can be
trained to estimate depth directly from multiple camera views.
End-to-End Models: These models take advantage of the rich features captured by multiple views
and learn complex mappings from images to depth maps.
Applications: Multi-view deep learning models are applied in areas like augmented reality and
3D scene understanding.

2. PERSPECTIVE, BINOCULAR STEREOPSIS


2.1 Binocular Stereopsis
Binocular stereopsis in computer vision is a technique that mimics the human visual system's ability
to perceive depth by using information from both eyes. This method involves capturing and
analyzing images from two slightly offset cameras, simulating the way human eyes provide
different viewpoints of the same scene. Binocular stereopsis is a fundamental concept in stereo
vision, and it plays a crucial role in tasks such as depth perception and 3D reconstruction.

2.1.1. Principle of Binocular Stereopsis:


Binocular Disparity: The key idea behind binocular stereopsis is the disparity between the images
captured by the left and right cameras. Disparity refers to the horizontal shift or difference in the
apparent position of an object in the left and right images.
Triangulation: By analyzing the disparity, the depth of objects in the scene can be determined
using triangulation principles.
2.1.2. Stereo Camera Setup:
Camera Configuration: Two cameras are positioned at a slight horizontal offset, simulating the
separation between human eyes. This offset is referred to as the baseline.
Calibration: Precise calibration of the cameras is essential to ensure accurate correspondence
between points in the left and right images.

2.1.3. Depth Estimation using Binocular Stereopsis:


Correspondence Matching: The process begins with identifying corresponding points in the left
and right images. This involves finding features or patterns that can be matched between the two
images.
Disparity Calculation: Once correspondences are established, the disparity between these points
is calculated. The greater the disparity, the closer the object is to the cameras.
Depth Map Generation: By mapping the disparity values across the entire image, a depth map can
be created, providing the depth information for each pixel.

2.1.4. Challenges and Solutions:


Occlusions: Occluded regions pose challenges to stereo vision. Advanced algorithms are employed
to handle occlusions and interpolate depth information in such areas.
Textureless Surfaces: Regions with little or no texture may result in ambiguous correspondences.
Proper handling of such situations is critical for accurate depth estimation.

2.1.5. Applications:
Robotics: Binocular stereopsis is widely used in robotics for tasks like navigation and object
manipulation.
Autonomous Vehicles: Depth perception is crucial for autonomous vehicles to understand the
environment and make informed decisions.
3D Reconstruction: Binocular stereopsis is a fundamental technique for creating detailed 3D
models of scenes and objects.

2.1.6. Improvements with Machine Learning:


Deep Learning: Convolutional Neural Networks (CNNs) are employed for feature extraction and
correspondence matching, improving the robustness of binocular stereopsis.
End-to-End Models: Some approaches use end-to-end learning to directly predict depth maps from
stereo image pairs.

2.1.7. Limitations:
Calibration Sensitivity: Precise calibration is critical, and small errors in camera alignment can
lead to inaccuracies.
Limited Baseline: A smaller baseline may result in less accurate depth estimation, especially for
distant objects.

2.2 Perspective Stereopsis


Perspective stereopsis in computer vision refers to the use of perspective cues to estimate depth
information in a scene, often relying on monocular (single-camera) imagery. Unlike binocular
stereopsis, which uses information from two offset cameras, perspective stereopsis exploits the
inherent depth cues present in a single image captured by a camera.
2.2.1 Key aspects of perspective stereopsis in computer vision:
Perspective Cues:
Size and Position of Objects: In a perspective image, objects that are closer to the camera appear
larger than those farther away. The position of an object in the image also provides cues about its
depth.
Perspective Projection:
Projection Geometry: The projection of 3D points onto a 2D image plane follows the principles of
perspective projection. This results in depth information being encoded in the image through the
size and position of objects.
Depth from Motion:
Motion Parallax: Perspective stereopsis can leverage motion parallax, where objects at different
depths move at different rates when the camera or the objects are in motion. Analyzing this motion
provides depth information.
Texture Gradients:
Perspective-induced Gradients: The rate of change of texture in an image can be indicative of
depth. As objects move away from the camera, the density of texture (pixels) in the image
decreases.
Single-Camera Setup:
Monocular Vision: Perspective stereopsis relies on a single camera to capture images. It doesn't
require the use of multiple cameras or stereo pairs.
Depth Estimation Techniques:
Depth from Focus: By analyzing the sharpness or blurriness of different regions in the image,
depth information can be estimated. Objects in focus are likely to be closer, while blurry regions
may indicate distance.
Depth from Defocus: Similar to depth from focus, but it involves deliberately defocusing certain
regions and analyzing the resulting blur for depth estimation.
Machine Learning Approaches:
Deep Learning: Convolutional Neural Networks (CNNs) and other deep learning architectures can
be trained to directly predict depth from monocular images, leveraging a large dataset with ground
truth depth information.
Applications:
Autonomous Systems: Perspective stereopsis is used in robotics and autonomous systems for tasks
such as obstacle avoidance and navigation.
Augmented Reality: Depth information from a single camera is valuable for overlaying virtual
objects onto the real world with proper depth cues.
Challenges:
Ambiguity: Monocular depth estimation can be inherently ambiguous, especially when objects
have similar appearances but different depths.
Scene Complexity: Highly textured or occluded scenes may pose challenges for accurate depth
estimation.
Integration with Other Techniques:
Sensor Fusion: Combining perspective stereopsis with data from other sensors (such as IMUs or
additional cameras) can enhance the robustness of depth estimation.
Perspective stereopsis is a valuable approach in scenarios where only a single camera is available,
and it complements binocular and multi-camera methods in computer vision applications.
Advances in machine learning have led to improvements in the accuracy and robustness of
monocular depth estimation techniques, making them increasingly relevant in various real-world
applications.

3. CAMERA AND EPIPOLAR GEOMETRY


3.1 Epipolar Geometry
 Epipolar geometry is a fundamental concept in computer vision and stereo vision that
describes the geometric relationship between two cameras capturing the same scene from
different viewpoints.
 This concept is particularly useful in tasks such as stereo reconstruction, 3D scene
reconstruction, and structure from motion.
 The epipolar geometry between two views is essentially the geometry of the intersection
of the image planes with the pencil of planes having the baseline as axis (the baseline is
the line joining the camera centres).

3.2 The geometric entities involved in epipolar geometry


Epipolar Line:
Given two cameras, each capturing an image of the same scene from a different perspective, the
epipolar line in one image corresponds to the line along which the corresponding point must lie in
the other image.
Epipole:
The epipole is the point of intersection of the line connecting the camera centers (baseline) with
the image plane. The epipole in one image corresponds to the position of the other camera in the
first image.
Epipolar Constraint:
The epipolar constraint states that the epipolar lines in one image correspond to the intersection of
the corresponding planes with the image planes of the other camera. This constraint reduces the
search space for finding corresponding points.
Epipolar Geometry Matrix (Fundamental Matrix):
The relationship between corresponding points in two images can be mathematically expressed
using the Fundamental Matrix. The Fundamental Matrix relates the pixel coordinates in one image
to the epipolar lines in the other image.
Essential Matrix:
The Essential Matrix is a 3x3 matrix that encapsulates the intrinsic parameters of the cameras and
the rotation and translation between them. It is closely related to the Fundamental Matrix and is
used in the process of recovering the relative pose between the cameras.

Triangulation:
Once the epipolar geometry is established, triangulation can be used to compute the 3D position
of a point in the scene by finding the intersection of the corresponding rays in the two cameras.

You might also like