0% found this document useful (0 votes)
198 views6 pages

A Comprehensive Survey of The R-CNN Family For Object Detection

This document provides an overview of the R-CNN family of object detection algorithms, including R-CNN, Fast R-CNN, and Faster R-CNN. It discusses how each approach takes an image as input and outputs bounding boxes and labels for detected objects. R-CNN was the first approach, using selective search to extract region proposals which are then passed through a CNN to extract features before classifying each region. Fast R-CNN improves on this by performing classification and bounding box regression in one shot. Faster R-CNN introduces a region proposal network to generate proposals, improving speed over previous methods. Simulation results show Faster R-CNN improves both accuracy and detection speed compared to earlier approaches.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
198 views6 pages

A Comprehensive Survey of The R-CNN Family For Object Detection

This document provides an overview of the R-CNN family of object detection algorithms, including R-CNN, Fast R-CNN, and Faster R-CNN. It discusses how each approach takes an image as input and outputs bounding boxes and labels for detected objects. R-CNN was the first approach, using selective search to extract region proposals which are then passed through a CNN to extract features before classifying each region. Fast R-CNN improves on this by performing classification and bounding box regression in one shot. Faster R-CNN introduces a region proposal network to generate proposals, improving speed over previous methods. Simulation results show Faster R-CNN improves both accuracy and detection speed compared to earlier approaches.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

A comprehensive survey of the R-CNN family for

N
W
00
object detection
M
m
m
m
N
N
O. HMIDANI E. M. ISMAILI ALAOUI
o
N Computer Networks and Systems Laboratory. Computer Networks And Systems Laboratory.
....:
w
o Faculty of Science, Moulay Ismail University Faculty Of Science, Moulay Ismail University
w
'"f- Meknes, Morocco Meknes, Morocco
u.J
Z [email protected] [email protected]
~
~
o
u
---g]
.-<
.-<
d Abstract-Object detection using deep learning, one of the paper is organized as follows: In the next section, we present
.-<
most challenging problems in computer vision, seeks to locate a brief overview of different R-CNN algorithms for object
is
o instances of objects from a large number of predefined categories detection. We first briefly describe the R-CNN approach. Next,
in natural images. Given this period of rapid eVOlution, the main
u.J
u.J contribution of this paper is to provide a comprehensive survey of we describe the Fast R-CNN approach. This is followed by
u.J

N the region-based convolutional neural network (R-CNN) family the details of the Faster R-CNN approach. Section 3 presents a
N
o
N
(R-CNN, Fast R-CNN, and Faster R-CNN). In comparison to the set of experiments to evaluate and compare the performances
g R-CNN and Fast R-CNN, simulation results show that the faster and challenges of the R-CNN family. Finally, in Section 4,
8 R-CNN improves not only accuracy but also detection speed. For we conclude the paper with an overalldiscussion of object
.--i robust object detection, it has been found that the Faster R-CNN
M
-<Jl.
is particularly suited for this purpose. We conclude with several detection and future research directions.
---
N
N open issues and challenges that are keys to the design of future
----+
W
work. II. OVERVIEW OF DIFFERENT R-CNN ALGORITHMS FOR
'"
o Index Terms-Object detection, Deep learning techniques, R- OBJECT DETECTION
'"-+ CNN, Fast R-CNN, Faster R-CNN A. R-CNN approach
'"'fw
.-< The objective of R-CNN is to take an image and recognize
ob I. INTRODUCTION
where the most objects within the picture are. This technique
"m Recently, deep learning techniques have emerged as power- was first introduced in the paper ''Rich feature hierarchies for
ful methods for learning featurerepresentations automatically accurate object detection and semantic segmentation" [1].
~
OJ
z
E from data. These techniques have provided major improve-
E • Input: image
§ ments in object detection. Deep learning can be considered a • Output: bounding boxes + labels for each object in the
bJJ
C
part of the broader family of machine learning, which allows image.
:;;:
(; computer systems to transform simpler concepts into more
Rather than working on a massive number of regions, the
~OJ abstract and complex concepts [1]. Deep learning models,
R-CNN algorithm proposes a bunch of boxes within the image
Z
known as deep neural networks, deploy multiple hidden layers
""0 and checks on the off chance that any of these boxes contain
C
ro to exploit the unknown structure in the input distribution to
'" any object [1]. R-CNN uses selective search to extract the
OJ
.iii> discover composite representations. The main goal of this
o boxes from an image (these boxes are called regions) [1].
oc paper is to offer a comprehensive survey of deep learning R-
The general process of R-CNN is shown in figure 1.
..c
u CNN techniques and to present a comparison between those
OJ
f-
c
techniques in terms of test time per image, speed, and mean
.ro.,
.o average precision (mAP). Our categorization is helpful for
warped regioll
~ ~ , pc:::.==.:::::...J
u
readers to have an accessible understanding of differences '1>.,: 1_ r,....,..,...,."....",.......,
: person? yes.
·c ~
:::J
E among the algorithms belonging to the R-CNN family, such rA"
E as R-CNN, Fast R-CNN, and Faster R-CNN. In this paper,
o 1. Input 2. Extract region 3. ompute 4. Classify
U
""0
OJ
we'll cover the intuition behind some of the main techniques image propo al (-2k) NN featmes region
U
C
ro
used in object detection and see how they've evolved from
>
""0 one implementation to the next. This article attempts to track Fig. 1. Regions with CNN features. [1]
«
c
o recent advances and summarize their achievements in order to
OJ
u gain a clearer picture of the algorithms belonging to the R- Let's begin by understanding what selective search is and
c
~ CNN family in object detection. This survey gives researchers how it recognizes the different regions. There are basically
J!! four regions that form an object: varying scales, colors,
c
o a starting point to understand current research and identify
u
open challenges for future research. The remainder of this textures, and enclosure [2]. The selective search recognizes
ro
these patterns within the image and based on that proposes
.cro.,
.o
978-1-6654-5054-6/221$31.()() ©2022 different locales [2].
E
2
E
-5
'"
N
N
o
N

Authorized licensed use limited to: FhI fur Integrierte Schaltungen Angewandte Elek. Downloaded on March 27,2023 at 17:12:55 UTC from IEEE Xplore. Restrictions apply.
~~=-'
W~
_
.JIIIIII ~~
Regions of Interest (Rol)
from a proposal method

.• yT",J..:~~
L~ --',,- Input Image
nk)

Fig. 4. R-CNN explanation. [3]

Fig. 2. An example of selective search. [2]

• Here is a brief overview of how selective search works:


We discover the objects at distinctive scales just like the
young lady who appeared on TV. In this process, image Fig. 5. R-CNN explanation. [3]
segmentation is first performed on the image which separates
the picture into distinctive sections by gathering together
adjoining pixels on the basis of color and surface. After • CNN extracts features for every region and SVMs are
image segmentation, bounding boxes of diverse sizes are utilized to partition these regions into various classes:
made around the sectioned objects [2]. These bounding boxes
contain the objects.
The steps followed by R-CNN to detect objects are [1]:
• We first take a pre-trained convolutional neural network.
• At that point, this model is trained. The number of classes
recognized is used to train the network's final layer.
• The third step is to get the region of interest for each
image. We then reshape all these regions so that they can
match the CNN input size.
• Mter getting the regions, we train SVM to classify
objects and backgrounds. For each class, we train one Fig. 6. R-CNN explanation. [3]
binary SVM.
• Finally, we train a linear regression model to generate • Finally, a bounding box regression is used to estimate the
tighter bounding boxes for each identified object in the bounding boxes for each distinguished area:
image.
We will visualize each step:
• First, an image is taken as an input:

Fig. 7. R-CNN explanation. [3]

Fig. 3. R-eNN explanation. [3]


1) Problems with R-CNN: Up until this point, we have
• We get the regions of interest by utilizing some strategies perceived how R-CNN can be useful for object detection.
(for example, selective search as seen previously): However, this procedure has its own set of limitations. Training
• These regions are then reshaped according to the con- an R-CNN model is costly and slow.
tribution of the CNN, and every region is passed to the • Based on a selective search, 2000 regions are separated
ConvNet: for each image [1].

Authorized licensed use limited to: FhI fur Integrierte Schaltungen Angewandte Elek. Downloaded on March 27,2023 at 17:12:55 UTC from IEEE Xplore. Restrictions apply.
• Extricating features utilizing CNN for each image district. features from the regions, divides them into various classes,
Assume we have N images; the quantity of CNN features and returns the limit boxes for the separated classes at the
will be N*2ooo [1]. same time [3].
• The whole cycle of object detection utilizing R-CNN has We will visualize each step:
three models: • We take an image as input:
- CNN for the feature extraction.
- Linear SVM classifier for recognizing objects.
- Regression model for fixing the bounding boxes.
This load of cycles joins to make R-CNN exceptionally lethar-
gic. It takes around 40-50 seconds to make predictions for each
new image, which basically makes the model awkward and,
for all intents and purposes, difficult to construct when faced
with a gigantic dataset.
As a result, Fast-R CNN was created in order to overcome R
CNN's bottleneck.
Fig. 9. Fast R-CNN explanation. [3]
B. Fast R-CNN approach
In 2015 The main author of the past paper [1], Ross • This image is passed to a ConvNet which returns the
Girshick, proposed a better model for object recognition a year regions of interest:
later. In his paper, Fast R-CNN [3], rather than running a CNN
multiple times for every image, we can run it just once per
image and get all of the regions of interest.
Ross Girshick concocted running the CNN just once per image
and then figuring out how to share that calculation across the
2000 regions [3]. In Fast R-CNN, we feed the input image
to the CNN, which thusly produces the convolutional feature
maps. The regions of proposals are extracted using these maps.
We then utilize a ROI pooling layer to reshape all the proposed
regions into a fixed size so they can be fed into a fully Fig. 10. Fast R-CNN explanation. [3]
connected network [3].
• The ROI pooling layer is then applied to the separated
Outputs: bbox regions of interest to ensure that they are all of a similar
softmax regressor size:
FC FC

FCs

Rol feature
vector Fareoch Ral

Fig. 8. Fast R-eNN architecture. [3]

The following are the steps for simplifying the concept [3]:
• We take an image as an input.
• This image is passed to a ConvNet which thus produces Fig. 11. Fast R-CNN explanation. [3]
the regions of interest.
•A ROI pooling layer is applied to these regions to reshape • At longlast, these regions are given to a fully connected
them according to the contribution of the ConvNet. network that groups them, just as it returns the bounding
Then, at that point, every region is given a fully connected boxes utilizing SoftMax and linear regression layers:
network. According to the author, the Fast R-CNN model is nine
•A SoftMax layer is utilized on top of the fully connected times faster than the R-CNN model.This is the way Fast
network to yield classes. In addition to the Soft-Max R-CNN settle two significant issues with R-CNN. The first
layer, a linear regression layer is used to generate bound- is passing one rather than 2000 regions for each image to
ing box estimates for predicted variables. the ConvNet, and the second is utilizing one rather than
Thus, rather than utilizing three unique models (like in R- three different models for separating features, classifying, and
CNN), Fast R-CNN utilizes a solitary model that extracts creating bounding boxes [3].

Authorized licensed use limited to: FhI fur Integrierte Schaltungen Angewandte Elek. Downloaded on March 27,2023 at 17:12:55 UTC from IEEE Xplore. Restrictions apply.
Fig. 12. Fast R-CNN explanation. [3]

1) Problems with Fast R-CNN: Yet, even Fast R-CNN Fig. 13. Faster R-CNN flow diagram. [4]
has certain pain points. It also employs selective search to
locate the areas of interest, which is a slow and tedious
I I 4k coordinates I
I
process. It takes around two seconds for each image to identify 2k scores .. Ianchor boxes

objects, which is vastly improved compared with R-CNN. In cis layer \ reg layer
D
o
any case, when we consider huge genuine datasets, even a

t
...----,
Fast R-CNN does not look so quick any longer compared 256·d

to previous R-CNN and Fast-RCNN, Faster R-CNN model intennediatelayer

is more efficient as it implements the region proposal network D


(RPN) which utilizes a neural network to take care of the
generating bounding box process.
C. Faster R-CNN approach
Q sliding window D
conv feature map
In 2015, Shaoqing Ren, Kaiming He, Ross Girshick, and
Jian Sun, part of the Microsoft research team, introduced a Fig. 14. Left: Region proposal network. Right: Examples of detection using
region proposal network to eliminate the bottleneck problem RPN proposals. [4]
in the Fast R-CNN [4]. Faster R-CNN is a modified version
of Fast R-CNN.
The main advantage of this model over the Fast R-CNN model 2) The bounding box regressor for changing the anchors to
is that it uses more than one CNN for region suggestion and better fit the object.
classification. RPN takes image feature maps as input and
produces a bunch of object proposals, each with an objectness For each anchor box, we output one bounding box and score
score as output [4]. per position in the image. Considering these anchor boxes, the
The steps of Faster R-CNN [4]: inputs, and outputs to this region proposal networks are:
• We take an image and pass it as input to the ConvNet • input: CNN feature map.
which will give us the feature map for that image. • output: a bounding boxper anchor A score addressing
• We apply RPN to these feature maps. This generates the how conceivable the image in that bounding box will be
object proposals with their objectness score. as an object.
•A ROI pooling layer is applied to this proposition to cut
down all of the proposals to a similar size. At that point, we use Fast R-CNN to generate a classification
and tightened bounding boxes for each bounding box that is
• Finally, the proposition is passed to a fully connected
layer, which includes a softmax layer and a linear regres- likely to be an object [4].
sion layer, to classify and output the bounding boxes for 1) Problems with Faster R-CNN: The entirety of the object
objects. detection algorithms we have examined so far use regions to
We will clarify how this RPN works. recognize the objects. The network doesn't take a gander at
Faster R-CNN takes the feature maps from CNN and passes the total image in one go, however; it centers around parts of
them on to the region proposal network. RPN utilizes a sliding the image consecutively.
window over these feature maps, and at every window, it This causes two complications:
produces K anchor boxes of various shapes and size: • The algorithm does numerous passes through a single
Anchor boxes are fixed-sized boundary boxes that are put image to find all of the objects.
all through the image and have various shapes and sizes. RPN • The various systems working one after the other cause
predicts two things, for each anchor: the performance of the systems further ahead to depend
1) The probability that an anchor is an object. on how the previous systems performed.

Authorized licensed use limited to: FhI fur Integrierte Schaltungen Angewandte Elek. Downloaded on March 27,2023 at 17:12:55 UTC from IEEE Xplore. Restrictions apply.
III. SIMULATION RESULTS AND DISCUSSION

Several simulations have been conducted to compare the


performance of different R-CNN algorithms for object detec-
tion. The data set used in this experiment is one of the most
common public data sets in the field of object detection, which
is coco. In order to assess the performances of the different
algorithms, we used the following criteria for our comparisons:
(1) Test time per image; (2 ) detection speed; and (3) mAP.

R-CNN Fast R-CNN Faster R-CNNII


Test time per image 49 seconds 2,2 seconds 0.2 seconds
Speed up Ix 22.27x 245x

THE TEST TIME PER IMAGE OF R-CNN, FAST R-CNN AND FASTER
R-CNN

Table 1 shows the test time per image; the results further
confirm that the Faster R-CNN improves the detection speed.
It can be seen from this table that the test time per image
of Faster R-CNN is 11 times that of Fast R-CNN and 245
times that of R-CNN. This is mainly due to the Faster R-CNN
replaces the selective search method with RPN and makes the
algorithm much faster.
Faster R-CNN improves not only the detection speed, but also
the accuracy. This comparison is illustrated in Table 2.

II R-CNN Fast R-eNN Faster R-CNN II


~mAP 66 66.9 66.9
TABLE II
THE MAP(%) OF R-CNN, FAST R-CNN AND FASTER R-CNN

In this table we observe that the Fast R-CNN and Faster


R-CNN have the same mAP which is much better compared
to R-CNN.
The detection result is expressed as bounding boxes and the
associated detection confidence scores for each image. The
figure above shows the detection results of the three different
network architectures mentioned above, where the first image
shows the detection results ofR-CNN, the second image shows
the detection results of Fast R-CNN and the third image shows
the detection results of Faster R-CNN, respectively. It can be
seen from the third image of the figure above that the Faster
R-CNN gave the correct detection result, demonstrating its
important roles in improving the detection accuracy of the Fig. 17. Faster R-CNN
object detection network.

IV. CONCLUSION R-CNN because it uses RPN to replace the time consuming
Object detection is an important and challenging problem in selective search. Object detection algorithms need to not only
computer vision and has received considerable attention. This accurately classify and localize important objects, but they also
article served as a comprehensive survey on deep learning need to be incredibly fast at prediction time to meet the real-
for object detection, highlighting recent achievements and time demands of video processing. The major complications
providing a comparison of a variety of R-CNN algorithms used of object detection are speed and accuracy, and to improve
for any object detection problem. The experimental results detection accuracy and inference time using deep learning,
demonstrate that the accuracy and detection speed of Faster we have to change the architecture of the base network and
R-CNN are greatly improved compared to R-CNN and Fast the classifier.

Authorized licensed use limited to: FhI fur Integrierte Schaltungen Angewandte Elek. Downloaded on March 27,2023 at 17:12:55 UTC from IEEE Xplore. Restrictions apply.
REFERENCES
[1] R. Girshick, 1. Donahue, T. Darrell, and 1. Malik, Rich feature hi-
erarchies for accurate object detection and semantic segmentation, in
Proceedings of the IEEE Computer Society Conference on Computer
Vision and Pattern Recognition, 580587 (2014)
[2] Uijlings, Jasper RR and Van De Sande, Koen EA and Gevers, Theo
and Smeulders, Arnold WM, Selective search for object recognition,
International joumal of computer vision, 154 171 (2013)
[3] R. Girshick, Fast R-CNN, in Proceedings of the IEEE International
Conference on Computer Vision, 1440 1448 (2015)
[4] S. Ren, K. He, R. Girshick, and J. Sun, Faster R-eNN: Towards
Real-Time Object Detection with Region Proposal Networks, IEEE
Transactions on Pattern Analysis and Machine Intelligence, (2017)
[5] Kang, Kai, Intelligent Video Analysis with Deep Learning, The Chinese
University of Hong Kong (Hong Kong), (2017)
[6] Abadi, Martin and Barham, Paul and Chen, Jianrnin and Chen, Zhifeng
and Davis, Andy and Dean, Jeffrey and Devin, Matthieu and Ghe-
mawat, Sanjay and Irving, Geoffrey and Isard, Michael and others, 12th
{USENIX} symposium on operating systems design and implementa-
tion ({OSDI} 16), 265 283 (2016)
[7] He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian,
Deep residual learning for image recognition, Proceedings of the IEEE
conference on computer vision and pattern recognition, 770778 (2016)
[8] Guo, Yanrning and Liu, Yu and Oerlemans, Ard and Lao, Songyang and
Wu, Song and Lew, Michael S, Deep learning for visual understanding:
A review, Neurocomputing, 27 48 (2016)
[9] LeCun, Yann and Bottou, LOOn and Bengio, Yoshua and Haffner, Patrick,
Gradient-based learning applied to document recognition, Proceedings
of the IEEE, 22782324 (1998)
[10] Long, Jonathan and Shelhamer, Evan and Darrell, Trevor, Fully convo-
lutional networks for semantic segmentation, Proceedings of the IEEE
conference on computer vision and pattern recognition, 3431 3440
(2015)
[11] O'Shea, Keiron and Nash, Ryan, An introduction to convolutional neural
networks, arXiv preprint arXiv:1511.08458, (2015)
[12] Kunihiko, Fukushima, Neocognitron: A hierarchical neural network
capable of visual pattern recognition, Neural networks 1, no. 2, 119-130
(1988).
[13] Li, Deng, The rnnist database of handwritten digit images for machine
learning research [best of the web], IEEE Signal Processing Magazine
29, no. 6, 141-142 (2012).
[14] Zeiler, Matthew D and Fergus, Rob, Visualizing and understanding
convolutional networks, European conference on computer vision, 2014,
Springer.
[15] Fan, Quanfu and Brown, Lisa and Smith, John, A closer look at Faster
R-eNN for vehicle detection, 2016 IEEE intelligent vehicles symposium
(IV), 124-129.

Authorized licensed use limited to: FhI fur Integrierte Schaltungen Angewandte Elek. Downloaded on March 27,2023 at 17:12:55 UTC from IEEE Xplore. Restrictions apply.

You might also like