Unit - 3 - Object Recognition
Unit - 3 - Object Recognition
AlexNet was not the first fast GPU-implementation of a CNN to win an image
recognition contest. A CNN on GPU by K. Chellapilla et al. (2006) was 4 times
faster than an equivalent implementation on CPU.[4] A deep CNN of Dan
Cireșan et al. (2011) at IDSIA was already 60 times faster[5] and outperformed
predecessors in August 2011.[6] Between May 15, 2011, and September 10,
2012, their CNN won no fewer than four image competitions.[7][8] They also
significantly improved on the best performance in the literature for multiple
image databases.[9]
According to the AlexNet paper,[2] Cireșan's earlier net is "somewhat similar."
Both were originally written with CUDA to run with GPU support. In fact, both
are actually just variants of the CNN designs introduced by Yann LeCun et al.
(1989)[10][11] who applied the backpropagation algorithm to a variant
of Kunihiko Fukushima's original CNN architecture called "neocognitron."[12]
[13] The architecture was later modified by J. Weng's method called max-
pooling.[14][8]
In 2015, AlexNet was outperformed by a Microsoft Research Asia project
with over 100 layers, which won the ImageNet 2015 contest.[15]
Network design[edit]
AlexNet contains eight layers: the first five are convolutional layers, some of
them followed by max-pooling layers, and the last three are fully connected
layers. The network, except the last layer, is split into two copies, each run on
one GPU.[2] The entire structure can be written as:
where
Strengths: Similar to Sobel, but with separate masks for horizontal and vertical
edges, provides good edge localization.
Weaknesses: Prone to noise, similar to Sobel.
Roberts Cross Operator:
Strengths: Smooths the image with a Gaussian filter before applying the
Laplacian operator, which helps reduce noise sensitivity, provides accurate edge
localization.
Weaknesses: Computationally expensive due to Gaussian smoothing, can
produce thick edges, requires careful parameter tuning.
Gradient-Based Methods (Prewitt, Sobel, etc.):
Strengths: Learn features directly from data, can capture complex patterns and
variations, highly flexible and adaptable.
Weaknesses: Require large amounts of labeled data for training,
computationally intensive, may be prone to overfitting if not properly
regularized.
Stereo Vision:
One common approach to object recognition using two views is stereo vision.
Stereo vision involves capturing images of a scene from two or more cameras
placed at different positions or angles. By analyzing the disparities or
differences between corresponding points in the images captured by the
cameras, depth information about the scene can be computed using techniques
such as stereo matching or triangulation. This depth information can then be
used to improve object recognition by providing additional spatial cues and
constraints.
Feature Matching:
Another approach is to extract features from images captured by multiple
cameras and then match these features across views. Features such as keypoints,
edges, or descriptors can be extracted from each image, and then corresponding
features between the views can be identified using techniques like feature
matching or correspondence estimation. By matching features across views, the
system can establish correspondences between different parts of the object and
infer its three-dimensional structure or pose.
Multi-View Fusion:
Object recognition systems can also fuse information from multiple views to
improve recognition performance. This fusion can involve combining features
extracted from each view using techniques like feature concatenation, pooling,
or aggregation. By leveraging information from different viewpoints, the system
can capture complementary information and achieve better discrimination
between objects, especially in challenging scenarios such as occlusion or
viewpoint variations.
Applications:
Object recognition using two views finds applications in various domains,
including robotics, autonomous navigation, surveillance, and augmented reality.
By leveraging information from multiple viewpoints, these systems can achieve
more accurate and robust recognition of objects in complex environments.
Shape and Structure Analysis: Depth values provide information about the
shape and structure of objects in the scene. Depth maps can be used to extract
features such as object boundaries, surface normals, and 3D shapes, which offer
valuable cues for distinguishing between different object categories. Techniques
like voxelization or point cloud processing can further refine the representation
of objects in 3D space.
Time-of-Flight (ToF) Cameras: ToF cameras emit infrared light pulses and
measure the time it takes for the light to bounce back from objects in the scene.
This information is used to calculate the distance to each point in the scene,
providing depth information. ToF cameras are often integrated into devices like
smartphones, tablets, and gaming consoles.
Structured Light: Structured light systems project a known pattern onto the
scene and analyze the deformation of the pattern to infer depth information. By
analyzing how the pattern deforms on objects in the scene, structured light
systems can calculate their distance from the camera. Microsoft's Kinect sensor,
for example, used structured light technology for depth sensing.
Stereo Vision: Stereo vision systems use two or more cameras to capture images
of the scene from different viewpoints. By comparing the images captured by
each camera and analyzing the disparities between corresponding points, stereo
vision systems can triangulate the distance to objects in the scene. This
approach mimics human depth perception using binocular vision.
Lidar (Light Detection and Ranging): Lidar systems emit laser pulses and
measure the time it takes for the pulses to reflect off objects in the scene. By
scanning the laser across the scene, lidar systems can generate detailed 3D point
clouds of the environment. Lidar is commonly used in autonomous vehicles,
robotics, and aerial mapping applications.
Depth from Defocus (DfD): DfD is a technique that infers depth information
from the degree of defocus in images captured by a camera with an adjustable
aperture. By analyzing the blur in the images, DfD systems can estimate the
distance to objects in the scene. This approach is less common than others but
has potential applications in consumer cameras and robotics.
Computational Techniques:
Feature Extraction: Extracting discriminative features from images captured
from multiple views. This could involve techniques such as SIFT (Scale-
Invariant Feature Transform), SURF (Speeded Up Robust Features), or deep
learning-based feature extraction methods.
Stereo Matching: In the case of stereo vision, stereo matching algorithms are
used to find correspondences between points in the left and right views to
compute depth information.
Machine Learning and Deep Learning: Utilizing machine learning and deep
learning algorithms to learn discriminative representations of objects from
multiple views, enabling accurate classification or detection.
Pose Estimation: Estimating the pose or orientation of objects in the scene based
on information from multiple views. This could involve estimating the
transformation between views or directly predicting the pose of objects.