A Comprehensive Survey of The R-CNN Family For Object Detection
A Comprehensive Survey of The R-CNN Family For Object Detection
N
W
00
object detection
M
m
m
m
N
N
O. HMIDANI E. M. ISMAILI ALAOUI
o
N Computer Networks and Systems Laboratory. Computer Networks And Systems Laboratory.
....:
w
o Faculty of Science, Moulay Ismail University Faculty Of Science, Moulay Ismail University
w
'"f- Meknes, Morocco Meknes, Morocco
u.J
Z [email protected] [email protected]
~
~
o
u
---g]
.-<
.-<
d Abstract-Object detection using deep learning, one of the paper is organized as follows: In the next section, we present
.-<
most challenging problems in computer vision, seeks to locate a brief overview of different R-CNN algorithms for object
is
o instances of objects from a large number of predefined categories detection. We first briefly describe the R-CNN approach. Next,
in natural images. Given this period of rapid eVOlution, the main
u.J
u.J contribution of this paper is to provide a comprehensive survey of we describe the Fast R-CNN approach. This is followed by
u.J
N the region-based convolutional neural network (R-CNN) family the details of the Faster R-CNN approach. Section 3 presents a
N
o
N
(R-CNN, Fast R-CNN, and Faster R-CNN). In comparison to the set of experiments to evaluate and compare the performances
g R-CNN and Fast R-CNN, simulation results show that the faster and challenges of the R-CNN family. Finally, in Section 4,
8 R-CNN improves not only accuracy but also detection speed. For we conclude the paper with an overalldiscussion of object
.--i robust object detection, it has been found that the Faster R-CNN
M
-<Jl.
is particularly suited for this purpose. We conclude with several detection and future research directions.
---
N
N open issues and challenges that are keys to the design of future
----+
W
work. II. OVERVIEW OF DIFFERENT R-CNN ALGORITHMS FOR
'"
o Index Terms-Object detection, Deep learning techniques, R- OBJECT DETECTION
'"-+ CNN, Fast R-CNN, Faster R-CNN A. R-CNN approach
'"'fw
.-< The objective of R-CNN is to take an image and recognize
ob I. INTRODUCTION
where the most objects within the picture are. This technique
"m Recently, deep learning techniques have emerged as power- was first introduced in the paper ''Rich feature hierarchies for
ful methods for learning featurerepresentations automatically accurate object detection and semantic segmentation" [1].
~
OJ
z
E from data. These techniques have provided major improve-
E • Input: image
§ ments in object detection. Deep learning can be considered a • Output: bounding boxes + labels for each object in the
bJJ
C
part of the broader family of machine learning, which allows image.
:;;:
(; computer systems to transform simpler concepts into more
Rather than working on a massive number of regions, the
~OJ abstract and complex concepts [1]. Deep learning models,
R-CNN algorithm proposes a bunch of boxes within the image
Z
known as deep neural networks, deploy multiple hidden layers
""0 and checks on the off chance that any of these boxes contain
C
ro to exploit the unknown structure in the input distribution to
'" any object [1]. R-CNN uses selective search to extract the
OJ
.iii> discover composite representations. The main goal of this
o boxes from an image (these boxes are called regions) [1].
oc paper is to offer a comprehensive survey of deep learning R-
The general process of R-CNN is shown in figure 1.
..c
u CNN techniques and to present a comparison between those
OJ
f-
c
techniques in terms of test time per image, speed, and mean
.ro.,
.o average precision (mAP). Our categorization is helpful for
warped regioll
~ ~ , pc:::.==.:::::...J
u
readers to have an accessible understanding of differences '1>.,: 1_ r,....,..,...,."....",.......,
: person? yes.
·c ~
:::J
E among the algorithms belonging to the R-CNN family, such rA"
E as R-CNN, Fast R-CNN, and Faster R-CNN. In this paper,
o 1. Input 2. Extract region 3. ompute 4. Classify
U
""0
OJ
we'll cover the intuition behind some of the main techniques image propo al (-2k) NN featmes region
U
C
ro
used in object detection and see how they've evolved from
>
""0 one implementation to the next. This article attempts to track Fig. 1. Regions with CNN features. [1]
«
c
o recent advances and summarize their achievements in order to
OJ
u gain a clearer picture of the algorithms belonging to the R- Let's begin by understanding what selective search is and
c
~ CNN family in object detection. This survey gives researchers how it recognizes the different regions. There are basically
J!! four regions that form an object: varying scales, colors,
c
o a starting point to understand current research and identify
u
open challenges for future research. The remainder of this textures, and enclosure [2]. The selective search recognizes
ro
these patterns within the image and based on that proposes
.cro.,
.o
978-1-6654-5054-6/221$31.()() ©2022 different locales [2].
E
2
E
-5
'"
N
N
o
N
Authorized licensed use limited to: FhI fur Integrierte Schaltungen Angewandte Elek. Downloaded on March 27,2023 at 17:12:55 UTC from IEEE Xplore. Restrictions apply.
~~=-'
W~
_
.JIIIIII ~~
Regions of Interest (Rol)
from a proposal method
.• yT",J..:~~
L~ --',,- Input Image
nk)
Authorized licensed use limited to: FhI fur Integrierte Schaltungen Angewandte Elek. Downloaded on March 27,2023 at 17:12:55 UTC from IEEE Xplore. Restrictions apply.
• Extricating features utilizing CNN for each image district. features from the regions, divides them into various classes,
Assume we have N images; the quantity of CNN features and returns the limit boxes for the separated classes at the
will be N*2ooo [1]. same time [3].
• The whole cycle of object detection utilizing R-CNN has We will visualize each step:
three models: • We take an image as input:
- CNN for the feature extraction.
- Linear SVM classifier for recognizing objects.
- Regression model for fixing the bounding boxes.
This load of cycles joins to make R-CNN exceptionally lethar-
gic. It takes around 40-50 seconds to make predictions for each
new image, which basically makes the model awkward and,
for all intents and purposes, difficult to construct when faced
with a gigantic dataset.
As a result, Fast-R CNN was created in order to overcome R
CNN's bottleneck.
Fig. 9. Fast R-CNN explanation. [3]
B. Fast R-CNN approach
In 2015 The main author of the past paper [1], Ross • This image is passed to a ConvNet which returns the
Girshick, proposed a better model for object recognition a year regions of interest:
later. In his paper, Fast R-CNN [3], rather than running a CNN
multiple times for every image, we can run it just once per
image and get all of the regions of interest.
Ross Girshick concocted running the CNN just once per image
and then figuring out how to share that calculation across the
2000 regions [3]. In Fast R-CNN, we feed the input image
to the CNN, which thusly produces the convolutional feature
maps. The regions of proposals are extracted using these maps.
We then utilize a ROI pooling layer to reshape all the proposed
regions into a fixed size so they can be fed into a fully Fig. 10. Fast R-CNN explanation. [3]
connected network [3].
• The ROI pooling layer is then applied to the separated
Outputs: bbox regions of interest to ensure that they are all of a similar
softmax regressor size:
FC FC
FCs
Rol feature
vector Fareoch Ral
The following are the steps for simplifying the concept [3]:
• We take an image as an input.
• This image is passed to a ConvNet which thus produces Fig. 11. Fast R-CNN explanation. [3]
the regions of interest.
•A ROI pooling layer is applied to these regions to reshape • At longlast, these regions are given to a fully connected
them according to the contribution of the ConvNet. network that groups them, just as it returns the bounding
Then, at that point, every region is given a fully connected boxes utilizing SoftMax and linear regression layers:
network. According to the author, the Fast R-CNN model is nine
•A SoftMax layer is utilized on top of the fully connected times faster than the R-CNN model.This is the way Fast
network to yield classes. In addition to the Soft-Max R-CNN settle two significant issues with R-CNN. The first
layer, a linear regression layer is used to generate bound- is passing one rather than 2000 regions for each image to
ing box estimates for predicted variables. the ConvNet, and the second is utilizing one rather than
Thus, rather than utilizing three unique models (like in R- three different models for separating features, classifying, and
CNN), Fast R-CNN utilizes a solitary model that extracts creating bounding boxes [3].
Authorized licensed use limited to: FhI fur Integrierte Schaltungen Angewandte Elek. Downloaded on March 27,2023 at 17:12:55 UTC from IEEE Xplore. Restrictions apply.
Fig. 12. Fast R-CNN explanation. [3]
1) Problems with Fast R-CNN: Yet, even Fast R-CNN Fig. 13. Faster R-CNN flow diagram. [4]
has certain pain points. It also employs selective search to
locate the areas of interest, which is a slow and tedious
I I 4k coordinates I
I
process. It takes around two seconds for each image to identify 2k scores .. Ianchor boxes
objects, which is vastly improved compared with R-CNN. In cis layer \ reg layer
D
o
any case, when we consider huge genuine datasets, even a
t
...----,
Fast R-CNN does not look so quick any longer compared 256·d
Authorized licensed use limited to: FhI fur Integrierte Schaltungen Angewandte Elek. Downloaded on March 27,2023 at 17:12:55 UTC from IEEE Xplore. Restrictions apply.
III. SIMULATION RESULTS AND DISCUSSION
THE TEST TIME PER IMAGE OF R-CNN, FAST R-CNN AND FASTER
R-CNN
Table 1 shows the test time per image; the results further
confirm that the Faster R-CNN improves the detection speed.
It can be seen from this table that the test time per image
of Faster R-CNN is 11 times that of Fast R-CNN and 245
times that of R-CNN. This is mainly due to the Faster R-CNN
replaces the selective search method with RPN and makes the
algorithm much faster.
Faster R-CNN improves not only the detection speed, but also
the accuracy. This comparison is illustrated in Table 2.
IV. CONCLUSION R-CNN because it uses RPN to replace the time consuming
Object detection is an important and challenging problem in selective search. Object detection algorithms need to not only
computer vision and has received considerable attention. This accurately classify and localize important objects, but they also
article served as a comprehensive survey on deep learning need to be incredibly fast at prediction time to meet the real-
for object detection, highlighting recent achievements and time demands of video processing. The major complications
providing a comparison of a variety of R-CNN algorithms used of object detection are speed and accuracy, and to improve
for any object detection problem. The experimental results detection accuracy and inference time using deep learning,
demonstrate that the accuracy and detection speed of Faster we have to change the architecture of the base network and
R-CNN are greatly improved compared to R-CNN and Fast the classifier.
Authorized licensed use limited to: FhI fur Integrierte Schaltungen Angewandte Elek. Downloaded on March 27,2023 at 17:12:55 UTC from IEEE Xplore. Restrictions apply.
REFERENCES
[1] R. Girshick, 1. Donahue, T. Darrell, and 1. Malik, Rich feature hi-
erarchies for accurate object detection and semantic segmentation, in
Proceedings of the IEEE Computer Society Conference on Computer
Vision and Pattern Recognition, 580587 (2014)
[2] Uijlings, Jasper RR and Van De Sande, Koen EA and Gevers, Theo
and Smeulders, Arnold WM, Selective search for object recognition,
International joumal of computer vision, 154 171 (2013)
[3] R. Girshick, Fast R-CNN, in Proceedings of the IEEE International
Conference on Computer Vision, 1440 1448 (2015)
[4] S. Ren, K. He, R. Girshick, and J. Sun, Faster R-eNN: Towards
Real-Time Object Detection with Region Proposal Networks, IEEE
Transactions on Pattern Analysis and Machine Intelligence, (2017)
[5] Kang, Kai, Intelligent Video Analysis with Deep Learning, The Chinese
University of Hong Kong (Hong Kong), (2017)
[6] Abadi, Martin and Barham, Paul and Chen, Jianrnin and Chen, Zhifeng
and Davis, Andy and Dean, Jeffrey and Devin, Matthieu and Ghe-
mawat, Sanjay and Irving, Geoffrey and Isard, Michael and others, 12th
{USENIX} symposium on operating systems design and implementa-
tion ({OSDI} 16), 265 283 (2016)
[7] He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian,
Deep residual learning for image recognition, Proceedings of the IEEE
conference on computer vision and pattern recognition, 770778 (2016)
[8] Guo, Yanrning and Liu, Yu and Oerlemans, Ard and Lao, Songyang and
Wu, Song and Lew, Michael S, Deep learning for visual understanding:
A review, Neurocomputing, 27 48 (2016)
[9] LeCun, Yann and Bottou, LOOn and Bengio, Yoshua and Haffner, Patrick,
Gradient-based learning applied to document recognition, Proceedings
of the IEEE, 22782324 (1998)
[10] Long, Jonathan and Shelhamer, Evan and Darrell, Trevor, Fully convo-
lutional networks for semantic segmentation, Proceedings of the IEEE
conference on computer vision and pattern recognition, 3431 3440
(2015)
[11] O'Shea, Keiron and Nash, Ryan, An introduction to convolutional neural
networks, arXiv preprint arXiv:1511.08458, (2015)
[12] Kunihiko, Fukushima, Neocognitron: A hierarchical neural network
capable of visual pattern recognition, Neural networks 1, no. 2, 119-130
(1988).
[13] Li, Deng, The rnnist database of handwritten digit images for machine
learning research [best of the web], IEEE Signal Processing Magazine
29, no. 6, 141-142 (2012).
[14] Zeiler, Matthew D and Fergus, Rob, Visualizing and understanding
convolutional networks, European conference on computer vision, 2014,
Springer.
[15] Fan, Quanfu and Brown, Lisa and Smith, John, A closer look at Faster
R-eNN for vehicle detection, 2016 IEEE intelligent vehicles symposium
(IV), 124-129.
Authorized licensed use limited to: FhI fur Integrierte Schaltungen Angewandte Elek. Downloaded on March 27,2023 at 17:12:55 UTC from IEEE Xplore. Restrictions apply.