0% found this document useful (0 votes)
15 views

Object Detection With Deep Learning_ A Review Summary

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Object Detection With Deep Learning_ A Review Summary

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Object Detection With Deep Learning: A

Review Summary

Deep learning-based object detection is an effective and fastest way to


predict, recognize, and identify the exact object location in an image. It
consists of different subtasks like face detection, pedestrian detection,
and skeleton detection. Object detection provides the data for semantic
understanding of images and videos which are associated with different
applications such as image classification, face recognition, human
behavior estimation, and autonomous driving.

The system is inherited and related to the neural network and other
corresponding learning strategies, the improvement in this pasture will
develop the algorithm of neural networks and will heavily impact the
object detection system that can be evaluated as a learning method.

However, object detection with an extra object localization task is difficult


due to the lighting conditions, poses, occlusions, and considerable
variations in viewpoints. Though much concentration has been given to
this field in recent years, it has problems like object localization and
classification. Therefore, the traditional object detection prototype can be
outlined in three phases: informative region selection, feature extraction,
and classification.

A. Informative Region Selection

It means scanning the most informative region of the objects from


different objects with different sizes and ratios using a multi-scale sliding
window.

Though this strategy can detect all possible object locations it has some
shortcomings like expensiveness due to many candidate windows and
redundant windows. Regardless, limiting the sliding window templates
may produce unsatisfactory regions.

B. Feature Extraction

Feature extraction includes extracting visual features of the object to get


a semantic and robust representation. The representative features are a
scale-invariant feature transform, histograms of oriented gradients(HOG),
and Haar-like. These features can generate representations like human
brain cells. However, designing a manual robust features descriptor for all
kinds of objects is difficult for the diversity of appearances, backgrounds,
and illumination conditions.

C. Classification

A classifier is important to differentiate the target object from another


object to make the representation more informative, hierarchical, and
semantic. The supported vector machines( SVM), AdaBoost, and
deformable part-based model (DPM) are some of the best object
classifiers. Among those classifiers, DPM is the most flexible one to apply.

Among different local features, descriptors and shallow learnable


architecture state-of-the-art developed a real-time embedded system on
PASCAL visual object classes (VOC) that has less burden on hardware.
Some success has been obtained in the field during 2010-2012 but that
has some remarkable limitations like bounding boxes (BBs) and failure to
bridge the semantic gap. The deep neural networks (DNNs) obtained
significant gains with convolutional neural networks (CNNs) features(R-
CNN). The DNNs and CNNs stunt differently than the traditional
approaches.

Our assignment will illustrate the brief history of deep learning, basic CNN
structure, generic object detection architecture, and CNN application
reviews including object detection, face detection, and pedestrian
detection. At last, there will remain some future guidelines and concluding
remarks.

Brief Overview Of Deep Learning

Before proceeding to the deeper discussion on deep learning-based object


detection approaches, the history of deep learning, and the introduction
and pros of CNN are represented.

A. History: Birth, Decline, and Prosperity

The journey of neural networks started in the 1940s and became


popular in the 1980s and 1990s with backpropagation algorithms by
Rumelhart. The intention of discovering neural networks was to
simulate the human brain systems to solve general learning
difficulties. Neural networks lost their popularity in the early 2000s
due to the lack of training and other limitations. Deep learning
regained its popularity in 2006 with the innovation of speech
recognition. Deep learning is assigned to the below factors:

1) The emergence of large-scale annotated training data like


ImageNet exhibits its huge learning capacity.
2) The quick progress of high-performance parallel computing
systems like GPU clusters.
3) Advancement in the design of network structures and training
strategies to remove overfitting problems with data
augmentation and dropout. The batch normalization ( BN)
made the DNNs training efficient and AlexNet, Over feat,
GoogLe Net, Visual Geometry Group(VGG), and Residual Net
(ResNet) studied vastly to improve the performance.

B. Architecture and Advantages of CNN

The CNN architecture is referred to as VGG16 and each layer of it is


known as a feature map. The feature map of the input layer is a 3-D
matrix with pixel intensities for different colors and channels(e.g., RGB).
Transformations like filtering and pooling can be conducted on feature
maps to convolute filter matrices.

Pooling like max pooling, average pooling, L2 pooling, and local contrast
normalization summarises the receptive field response to create a more
robust feature description. The VGG16 has 13 convolutional layers, 3 FC
layers, 3 max-pooling layers, and a softmax classification layer.

The advantages of CNN over traditional methods are:

1) Hierarchical feature representation.


2) Deeper architecture with exponentially increased expression
capacity.
3) Opportunity to do several tasks together like R-CNN classification
and bounding box regression.
4) Larger learning capacity and high dimensional data transform
ability.

For those advantages, CNN is widely used in different research fields such
as image super-resolution reconstruction, image classification, image
retrieval, face recognition, pedestrian detection, and video analysis.

III. Generic Object Detection:


It is used to locate and classify objects in an image and label them with
rectangular BBs to show confidence in existence. The generic object
detection framework is classified into two groups: one of which follows the
conventional object detection pipeline( e.g. R-CNN an SPP, Fast R-CNN,
etc), and the other focus on object detection as a regression and
classification problem( e.g. Multi-box, AttentionNet, G-CNN, YOLO, single
shot, YOLOv2, etc). The interrelationships between these two pipelines are
bridged using anchors. These two methods have a vast explanation for
this the key points are given below.

A. Region proposal-Based Framework

It is a two-step process to match the attentional mechanism of the human


brain which results in a course scan of the whole scenario and focuses on
the region of interest (ROIs).

1) R-CNN: It improves the quality of the candidate BBs and receives a


deep architecture to extract high-level features. R-CNN was
proposed by Girshick and gained mean Average Precision(mAP) of
53.3% with 30% more than the previous record. PASCAL VOC shows
the R-CNN flowchart of three stages such as

a) Region Proposal Generation


b) CNN-Based Deep Feature Extraction
c) Classification and Localization

Despite the significant improvement over traditional methods, CNN has


some disadvantages too.

1) CNN requires a definite size( e.g., 227×227) of the input image.


2) Multistage R-CNN training pipeline.
3) Expensive training due to space, time, and memory storage.
4) Redundant and time-consuming procedure.

To solve the above-mentioned problems many proposals have been


introduced like geodesic object proposals, multi-scale combinatorial
grouping, etc. In addition, Bayesian optimization-based search algorithms
were introduced to solve the inaccurate localization problem by Zhang et
al.
Ouyang et al proposed deformable deep CNN (DeepID-Net) to introduce a
novel deformation-constrained pooling (def-pooling) layer to impose the
geometric penalty. The overall goal is achieved by biassing sampling to
match the statistics for the ground truth BBs with K-means clustering.

2) SPP-Net:

FC layers must take a fixed-size input. That is why R-CNN chooses to warp
or crop each region's proposal into the same size.

To solve the partly existing cropped region and unwanted geometric


distortion He et al. took the theory of spatial pyramid matching (SPM), into
consideration and proposed a novel CNN architecture named SPP-net.SPM
takes several finer to coarser scales to partition the image into many
divisions and aggregates of quantized local features into mid-level
representations. Different from R-CNN, SPP-net reuses feature maps of the
fifth Conv layer (conv5) to project region proposals of arbitrary sizes to
fixed-length feature vectors. For conv5 is 256, a three-level pyramid, the
SPP layer has a dimension of 256 × (12 + 22 + 42) = 5376.

SPP-net gains better results with a correct estimation of different region


proposals in their corresponding scales and improves detection efficiency.

3) Fast R-CNN:

To reduce the accuracy drop of very deep networks Girshick introduced a


multitask loss on classification and bounding box regression and proposed
a novel CNN architecture named Fast R-CNN. The architecture of Fast R-
CNN is Similar to SPP-net, the whole image is processed with Conv layers
to produce feature maps. The RoI pooling layer is a special case of the SPP
layer, which has only one pyramid level.

4) Faster R-CNN: In the Faster R-CNN, anchors of three scales and three
aspect ratios are adopted. With the proposal of Faster R-CNN, region
proposal-based CNN architectures for object detection can be trained in
an end-to-end way. The alternate training algorithm is very time-
consuming and RPN produces objectlike regions (including backgrounds)
instead of object instances and is not skilled in dealing with objects with
extreme scales or shapes.
5) R-FCN: Recent state-of-the-art image classification networks, such as
ResNets and GoogLeNetsare fully convolutional. With R-FCN, more
powerful classification networks can be adopted to accomplish object
detection in a fully convolutional architecture by sharing nearly all the
layers, and the state-of-the-art results are obtained on both PASCAL VOC
and Microsoft COCO data sets at a test speed of 170 ms per image.

6) FPN: Feature pyramids built upon image pyramids (featured image


pyramids) have been widely applied in many object detection systems to
improve scale invariance. FPN holds an architecture with a BU pathway, a
top-down (TD) pathway, and several lateral connections to combine low-
resolution and semantically strong features with high-resolution and
semantically weak features

7) Mask R-CNN: To solve the instant segmentation problem, parallel to


the existing branches in Faster R-CNN for classification and bounding box
regression, the Mask R-CNN adds a branch to predict segmentation masks
in a pixel-to-pixel manner. Mask R-CNN is simple to implement with good
instance segmentation and object detection results.

8) Multitask Learning, Multiscale Representation, and Con-textual


Modeling:

To tackle problems in multitasking with several proposals, it is necessary


to perform object detection with multitask learning, multiscale
representation, and context modeling to combine complementary
information from multiple sources. Multitask learning learns a useful
representation for multiple correlated tasks from the same input.

9) Thinking in Deep Learning-Based Object Detection.

Although there are different methods of deep learning, there are many
factors for continuous improvement. Still, there is a huge imbalance
between the annotated object numbers and background examples.

B. Regression/Classification-Based Framework:

One-step frameworks based on global regression/classification, mapping


straightly from image pixels to bounding box coordinates and class
probabilities, can reduce the time expense. However, there are several
types of regression such as 1) Pioneer Works, 2) YOLO, and 3) SSD.

C. Experimental Evaluation:

Experimental evaluation includes the proposal, learning method, loss


function, programming language, and platform of the prominent
architectures.

IV. SALIENT OBJECT DETECTION

Visual saliency detection is one of the most critical and challenging tasks
in computer vision, aiming to highlight the most dominant object regions
in an image. Numerous applications are incorporated to improve visual
saliency performance such as image cropping and segmentation image
retrieval, and object detection.

There are two branches of approaches in salient object detection, namely,


BU and TD. TD saliency can be viewed as a focus-of-attention mechanism,
which prunes BU's salient points that are unlikely to be parts of the object.
Deep learning is associated with:

A. Deep Learning in Salient Object Detection


B. Experimental Evaluation

V. FACE DETECTION:

Face detection and pedestrian detection are closely related to generic


object detection and are accomplished with multi-scale adoption and
multi-feature boosting forest respectively. Pedestrian and face recognition
images have a stable structure but the general images and scene images
have complex geometric structures and layouts.

Face detection is an important preprocessing procedure to face


recognition, face synthesis, and facial expression analysis. It recognizes a
large face region covering scales (30-3000 pts versus 10-1000 pts). The
most famous face detector proposed by Viola and Jonas trains cascaded
classifiers with Haar-like features and AdaBoost, achieving good
performance with real-time
Efficiency.

However, this detector may degrade significantly in real-world


applications due to larger visual variations of human faces. Different from
this cascade structure, Felzen-szwalb et al. proposed a deformable part
model (DPM) for face detection. But these traditional face-detection
methods have high computational expenses and large quantities of
annotations. In addition, their performance is greatly bounded by
manually designed features and shallow architecture.

A. Deep Learning in Face Detection

Recently, some CNN-based face detection approaches have been


proposed. Different researchers proposed various deep learning-based
face detection processes, for example, Yang et al. proposed a novel deep
learning-based face detection framework, which collects the responses
from local facial parts (e.g., eyes, nose, and mouths) to address face
detection under severe occlusions and unconstrained pose variations.

Some authors trained CNNs with other complementary tasks, such as 3-D
modeling and face landmarks, in a multitask learning manner.

B. Experimental Evaluation

The FDDB data set has 2845 pictures in which 5171 faces are annotated
with an elliptical shape. Here, two types of evaluations are used: the
discrete score and the continuous score.

VI. PEDESTRIAN DETECTION

In recent years pedestrian detection has been studied vastly which


includes pedestrian tracking, person reidentification, and robot navigation.
Before the recent progress in deep CNN (DCNN)-based techniques some
researchers combined boosted decision forests with hand-crafted features
to obtain pedestrian detectors.

A. Deep Learning in Pedestrian Detection

Although the DCNNs have outstanding performance on generic object


detection, none of these strategies have achieved better results than the
best hand-crafted feature-based method for a long time, even when part-
based information and occlusion handling are incorporated.
Zhang et al. attempted to adapt generic Faster R-CNN to pedestrian
detection. Other researchers also endeavored to combine complementary
information from multiple data sources.

R-CNN, Liu, et al. proposed multispectral DNNs-based learning DNNs for


pedestrian detection to combine complementary information from color
and thermal images.

B. Experimental Evaluation

The evaluation is executed on the most prevalent Caltech Pedestrian


dataset which was compiled from the videos of a vehicle driving through
an urban environment and consists of 250000 frames with about 2300
unique pedestrians and 350000 annotated BBs. Here the performance is
measured with the log-average miss rate(L-AMR).

VII. PROMISING FUTURE DIRECTIONS AND TASKS

Though some rapid development and progress are achieved in object


detection, there are many open ways to progress in this field, especially in
the small object detection in COCO data set and face detection. To
improve small objects localization accuracy the following aspects need to
be researched and developed.

1) Multitask Joint Optimization and Multimodal Information Fusion


2) Scale Adaption
3) Spatial Correlations and Contextual Modelling

The second scope is to reduce manual labor in real-time object detection.


The below measures can be taken in this regard:

1) Cascade Network
2) Unsupervised and Weakly Supervised Learning
3) Network Optimization

The third scope of research is to detect 2-D, and 3-D objects, and video
object detection.

Video Object Detection:

The video accuracy suffers from degenerated object impressions (e.g.,


motion blur and video defocus) in videos, and the network is usually not
experienced end to end. So video object detection is necessary.
VIII. CONCLUSION

Deep learning-based object detection has become one of the best


research hotspots in recent years for its powerful learning abilities and
advantages in dealing with occlusion, scale transformation, and
background Switches. This article is the summary of the review on deep
learning-based object detection frameworks including different
subproblems, such as occlusion, clutter, and low resolution, with different
degrees of modifications on R-CNN.

Besides, it represented a brief description, developments, analysis, and


scopes of research on neural networks and associated learning system.

You might also like