0% found this document useful (0 votes)
29 views6 pages

Region-Based Object Detection and Classification Using Faster R-CNN

Region-based Object Detection and Classification Using Faster R-CNN

Uploaded by

Soma Hazra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views6 pages

Region-Based Object Detection and Classification Using Faster R-CNN

Region-based Object Detection and Classification Using Faster R-CNN

Uploaded by

Soma Hazra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

International Conference on "Computational Intelligence and Communication Technology" (CICT 2018)

Region-based Object Detection and Classification


using Faster R-CNN

Syed Mazhar Abbas Dr. Shailendra Narayan Singh


ASET, Amity University ASET, Amity University
Uttar Pradesh, India Uttar Pradesh, India
mazhar92.abbas@gmail.com snsingh36@amity.edu

Abstract— With the advent of Deep Learning,the machine approach similar to the R-CNN. The approach used a Region
learning systems are able to recognize and classify objects of Proposal algorithm typically Selective search to extract objects
interest in an image.Various advancement has been done in the of interest in the scene.The approach varies from the R-CNN
field of object recognition and classification.Our research work in the process that it applies on the entire image instead of
focusses on improving the R-CNN, Fast R-CNN,YOLO selecting one region at a time.For classication and regression it
architecture.The work focussed on using Region Proposals employs Region of interest Pooling layer on the feature
Network(RPN) to extract region of interest in an image.RPN map[3].This approach proved to be faster than the R-
outputs an image based on the objectness score.The output CNN.However its biggest drawback was the technique still
objects are subjected to Roll Polling for classification.Our relied on selective search approach for extracting object from
research work focusses on training Faster R-CNN using custom the image.
based data set of images. Our trained network efficiently detects
objects from an image consisting of multiple objects.Our network A few time after R-CNN architecture came into existence,
requires minimum GPU capability of 3.0 or higher. You Only Look Once Unified(YOLO) in 2016, object
detection and classification technique was proposed.The
Keywords—Deep Learning, Faster R-CNN, Region Proposal reserch work was published in the form of paper by Joseph
Network, Convolution Neural Network.) Redmon[4].The architecture which the researchers have
proposed is based on a convolution neural network. The
I. INTRODUCTION technique achieved efficient results and took relatively less
Object detection and classification came into existence time.This was the first time that the real-time object
with the advent of Convolution Neural Network.The very first recognition came into picture.[5]
advancement in the field of Deep Learning application in
Following theYOLO architecture an advancement widely
object recognition took place at NYU in Singapore in
called Faster R-CNN proposed by Shaoqing Ren coauthored
2013.The model developed by the researcher known as
by Girshick who is currently working as aResearcher at
OverFeat employs the use of sliding window approach in
Facebook. Faster R-CNN in an attempt to construct a model
Convolution Neural Network.However the approach was not
that can be trained efficiently using training data added to its
efficient as it required the window to be placed on different
feature a Region Proposal Network[6].The basic of RPN is it
regions of network.Soon after which Region based
outputs the objects the object based on their relative objectness
Convolution Network architecture was proposed by Ross
score.The objects extracted from the RPN networks are
Girshick[1].The results provided improvement to the existing
subsequently used by the Rol polling and fully connected
architecture.The R-CNN architecture proposed by them
layer.
worked in three phases.Firstly it extracts the typical objects in
the entire image using a region proposal method,the most The uses the train Faster RCNN Object Detector function
common among them was Selective approach.Secondly it of the Computer Vision System Toolbox. Custom data sets of
extracts features from each of the possible objects recognized approximate 295 images containing vehicles as objects ias
in the scene[2].Lastly it classifies the image into regions by collected.[7] A Convolution Neural Network(CNN) is created
using the Support Vector Machine.While the results were far layer by layer by implementing the functionality provided by
much accurate,training the network using different data sets MATLAB tool that is Neural Network Toolbox™. At the
avaliable was a challenge.The approach employs extracting time of training image patches are extracted.
features from objects of interest in the scene one by one and
apply Support Vector Machine for object The paper is divided into following sections.Section I
classification.However the approach was later improved by gives the general introduction to the deep learning techniques.
Ross Girshick,a Microsoft Researcher by publishing an Section II of the paper indicates and its applications including
the various research performed by researchers in the field of

Authorized licensed use limited to: INSTITUTE OF ENGINEERING & MANAGEMENT TRUST. Downloaded on May 21,2024 at 04:45:02 UTC from IEEE Xplore. Restrictions apply.
International Conference on "Computational Intelligence and Communication Technology" (CICT 2018)

Object Detection and classification.Section III explains the of different scales.R-FCN follows the architecture of Faster R-
architecture of Faster R-CNN and Region Proposal CNN.
Network.Further Section IV shows the results of the training
performed on network using custom data set. Lastly Section , III. METHODOLGY
V concludes the paper. A. Faster R-CNN
II. RELATED WORK Region of interest polling is the approach that is gaining
much attention in the field of object recognition and
A. Region Based Convolution Neural Network classification,a deep learning approach.An instance could be
The model developed by the researcher known as detection of objects from a scene of image containing multiple
OverFeat employs the use of sliding window approach in objects.[12]The objective is to use max pooling on the entire
Convolution Neural Network.However the approach was not image to extract feature maps of fixed-size.The typical
efficient as it required the window to be placed on different architecture of Faster R-CNN is illustrated in fig[1,1].
regions of network.Soon after which Region based
Convolution Network architecture was proposed by Ross
Girshick.[8]The results provided improvement to the existing
architecture.The R-CNN architecture proposed by them
worked in three phases.
Firstly it extracts the typical objects in the entire image
using a region proposal method,the most common among
them was Selective approach.Secondly it extracts features
from each of the possible objects recognized in the
scene.Lastlly it classifies the image into regions by using the
Support Vector Machine.While the results were far much
accurate,training the network using different data sets
avaliable was a challenge.[9]The approach employs extracting
features from objects of interest in the scene one by one and
apply Support Vector Machine for object classification
B. Fast-Region based Convolution Neural Network
However the approach was later improved by Ross
Girshick,a Microsoft Researcher by publishing an approach
similar to the R-CNN known as Fast R-CNN. The approach
used a Region Proposal algorithm typically Selective search to
extract objects of interest in the scene.The approach varies
from the R-CNN in the process that it applies on the entire Fig 1.Typical architecture of Faster R-CNN
image instead of selecting one region at a time.For
classiication and regression it employs Region of interest The object detection technique of Faster R-CNN is sub-
Pooling layer on the feature map.[10]This approach proved to divided into follwing stages:
be faster than the R-CNN.However its biggest drawback was
the technique still relied on selective search approach for a) Region Propsal Network:The very fast task is to search in
extracting object from the image. the given input image the spaces where there is a probability
C. YOLO Architecture of location of object.The position of the object in an image can
A few time after R-CNN architecture, You Only Look be located.[13]These regions where there is possibility of
Once Unified(YOLO) in 2016, object detection and object is bounded by a region known as region of
classification technique was proposed.The reserch work was interest(ROI).
published in the form of paper by Joseph Redmon.
The architecture which the researchers have proposed is b) Classification:The stage is to classify the regions of interest
based on a convolution neural network.[11]The technique identified in the above steps into corresponding classes.The
achieved efficient results and took relatively less time.This technique deployed here is Convolution Neural
was the first time that the real-time object recognition came Networks(CNN).
into picture. In the proposed approach there is rigrous process of
identifying all spaces of object location in image.However if
D. Single Shot Detector(SSD) Region-based Fully no regions are identified in the first stage of algorithm then
Convolution Networks(R-FCN) there is no need to further go to the second step of
The proposed architecture followed YOLO. Predicts approach.[14]
categories and box offsets Uses small convolutional filters
applied to feature maps Makes predictions using feature maps

Authorized licensed use limited to: INSTITUTE OF ENGINEERING & MANAGEMENT TRUST. Downloaded on May 21,2024 at 04:45:02 UTC from IEEE Xplore. Restrictions apply.
International Conference on "Computational Intelligence and Communication Technology" (CICT 201
2018)

Anchors with the boundary of region surrounding objects


objects[26].
The output of regressor is a region boundary with loc
location
B. Region Proposal Network coordinates (x,y,h,w).[19]The
The classification indicates a
The Region of interest poolingooling was proposed by Ross probability of 0 or 1which indicates whether the region
Girschik in 2015 as object detection on approach using deep contains anobject or not
learning[21].ROI OI has an advantage of achieving speedup and
flexibility to scale for both training and testing data.The inputs ∗  1 if IoU>0.7
to the ROI layers is:  ∗  1 if IoU<0.3
<0.3
A feature map that is the output of Convolution Neural
Network with multiple convolution layers and max pooling ∗  0 otherwise
layers.
A N*N matrix is made by subdividing the feature map   ⁄ ,    ⁄ , log ⁄ , log  ⁄ !
space into Region of Interest.Here, N correponds to the ∗  ∗
 ⁄ ,  ∗   ⁄ , log  ∗ ⁄ , log ∗ ⁄ !
Region of Interest(ROI).[15]The
The first column represents the
index of the image and the remaining column represents the Where  ,  , , are the width,height and centre of
coordinates of Region of Interest(ROI)
st(ROI) starting from the upper anchor and  ∗ , ∗ , ∗ ,  ∗ are the ground truth bounding box
left-most coordinate to the bottom-most
most coordinate width,height,center.
The Region of Interest space determined is known as The loss function is defined over the output from classification
Region Proposal.The approach works by dividing the entire and regression network.
space of Region Proposal into equal sized partitions.
partitions.[16]The
   ,     1/  ,  ∗ " λ 1/N%&' L%&' t *, t * ∗
number
mber of sections in which the entire Region Proposal is
divided must be equal to the dimension of output. Lastlly the final features of size 3*3 is extracted and are input
to the networks for the purpose of classification and regression
The maximum value is determined from each divided sub
sub-
regions.The max values are copied to the output buffer
buffer.

Fig 3. Three Anchors with different Aspect


Aspect-Ratio and Scaling.

IV. RESULTS AND DISCUSSIONS


We are using Faster R-CNN
CNN to classify objects belonging
Fig 2. Region Proposal Network to a particular class.Faster R-CNN
CNN technique fo for object
classification is an enhancement to the techniques used
In Region Proposal Network, Initially, the image is fed earlier.[20]The faster R-CNNCNN technique works on the
into the convolution neural network.[17]The
The input image is complete image for classification and recognititon.Thry
passed to a set of convolution layers to the last layer which deploy Region Proposal Network for as a prepre-processing step
outputs feature maps. before the Convolution Neural Network(CNN).The R R-CNN
and Fast R-CNN
CNN architectures uses Selective search approac
The sliding window is placed on every section of the for extractin regions of an image.Faster R-CNN
R addresses the
feature maps.The sliding window mask is generally taken of problem by using Region Proposal Network in ombination
mask size as n*n.Corresponding to each sliding window with CNN[28].
anchors are generated.[18]These
These anchors will have the same
center ,let it be (xc,yc).However the anchors generated will The following section uses the train Faster RCNN Object
have different Aspect ratios and Scaling factor. Detector function of the Computer Vision System Toolbox.
In addition to this,A value p is calculated for each of these The example that we have discussed has been divided into
anchors which measures the probabilty of overlap of the following sections.

Authorized licensed use limited to: INSTITUTE OF ENGINEERING & MANAGEMENT TRUST. Downloaded on May 21,2024 at 04:45:02 UTC from IEEE Xplore. Restrictions apply.
International Conference on "Computational Intelligence and Communication Technology" (CICT 2018)

A)Loading of data. created bu combining the networks obtained from initial two
B)Convolution Neural Network designing. steps.The Convergence rates can be diferent for each training
C)Configuring training options steps,therfore we have have specified options for training in
D)Training of the Network. each step using trainingOptions function from the Neural
E)Testing of the trained network. Network Toolbox.[35]

A. Loading of Data D. Training of the Network


Custom data sets of approximate 295 images containing Now that the CNN and training options are defined, you
vehicles is collected.Out of the images collected each one of can train the detector
them contains some instances of different objects of using trainFasterRCNNObjectDetector.
vehicles.In order to explain the working of Faster R-CNN we
are using a small data set[21].However to make the system During training, image patches are extracted from the
robust,more training images must be used in practice. training data.
The data is captured and stored in a tabular format.The To save time of execution of this example, a pretrained
table is divided into two columns where the first column network is loaded from disk. To train the network yourself, set
represents the path of the stored image files and the second the doTrainingAndEval variable shown here to true.The
column represents the Region of Interest for vehicles. network is trained using the trainFasterRCNNObjectDetector
function from the Neural Network Toolbox.. The following
figure4 shows vehicle being detected .
B. Convolutional Neural Network (CNN) designing
In this data set all the objects are larger than [16 16], so
select an input size of [32 32]. This input size is a balance
between processing time and the amount of spatial detail the
CNN needs to resolve.
A Convolution Neural Network(CNN) is created layer by
layer by implementing the functionality provided by
MATLAB tool that is Neural Network Toolbox™.
In the Neural Network toolbox we started with the
imageInputLayer function.This function provides us the
functionality to specify the type and size of the input
layer.[22]To detect various section of image a Convolution
Fig 4. Results showing object detection
Neural Network needs analysis of various segments of
image.Therefore the size of input layer must be similar to the E. Evaluating Detector using Testing dataset
smallest object in the dataset. In our research paper we tested the detector on a single
In addition we have defined the middle layers in the image of vehicle datasets..Fig 2 illustrates the vehicle detected
network.The middle layer in CNN is composed of several by training the network with vehicle data set. Our network
layers of convolutional,Rectified linear units and several distinguished the region of interest objects in
pooling layers.[23]These layers are considered to be base for image.[26]Testing the small image having relative no of
creating any Convolution Neural Network.To create a more vehicles showed expected results.However it is desirable to
deeper network we can use several combination of these test the detector on large data sets of images in order to make
layers.However in order to avoid extracting useful and the system robust.The Computer Vision System Toolbox™
relevant information in an image it is required to keep the provides us with the feature for measuring the precision and
layers as low as possible.Downsampling at an early stage of recall of the result estimated.The evaluateDetectionPrecesion
training results results in the network discarding useful and evaluateDetection MissRate are the two functions
information for learning from the scene.[24] provided by Computer Vision System Toolbox™.[27]The
average precesion measures the ability of the network to
Further we have combined the input,hidden and output classify the objects correctly.The recall is the ability of the
layers. network to detect all relevant objects from the images.
We have evaluated the network using the testing data for
C. Configuring Training Options evalutaion purpose.However the results are loaded from the
disk to avoid the evaluation time.The curve shows the
The inbuilt function of Neural Network toolbox precision of the trained detector.The example we have shown
trainFasterRCNNObjectDetector adopts four steps to train the achieves the average precesion of 0.6. Including additional
network.The initial two steps in the training is used to train the
region proposal network and detection.[25]The last steps is

Authorized licensed use limited to: INSTITUTE OF ENGINEERING & MANAGEMENT TRUST. Downloaded on May 21,2024 at 04:45:02 UTC from IEEE Xplore. Restrictions apply.
International Conference on "Computational Intelligence and Communication Technology" (CICT 2018)

layers can be used to improve the average precision. [7] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D.
Ramanan, “Object detection with discriminatively trained
partbased models,” IEEE Transactions on Pattern Analysis
and Machine Intelligence (TPAMI), 2010.

[8] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN:


Towards real-time object detection with region proposal
networks,” in Neural Information Processing Systems (NIPS),
2015.

[9] J. Zhu, X. Chen, and A. L. Yuille, “DeePM: A deep part-


based model for object detection and semantic part
localization,” arXiv:1511.07131, 2015.

V. CONCLUSION [10] J. Johnson, A. Karpathy, and L. Fei-Fei, “Densecap: Fully


convolutional localization networks for dense captioning,”
The paper presents the advancement in the field of deep
arXiv:1511.07571, 2015.
learning for extracting object of interest from an
image.[28]The paper provides an overview to the various
[11] J. Hosang, R. Benenson, and B. Schiele, “How good are
techniques such as R-CNN,Fast R-CNN[41].The research
detection proposals, really?” in British Machine Vision
work focusses on training Faster R-CNN to recognize objects
Conference (BMVC), 2014.
of interst using custom based data-sets.Faster R-CNN is based
on Region-Proposal Network.Region Proposal Network is
[12] N. Chavali, H. Agrawal, A. Mahendru, and D. Batra,
based on objectness score.[29]The outputs are subjected to
“Object-Proposal Evaluation Protocol is ’Gameable’,” arXiv:
Roll pooling for classification. In our research paper we tested
1505.05836, 2015.
the detector on a single image of vehicle datasets. The
Computer Vision System Toolbox™ provides us with the
[13] S. Ren, K. He, R. Girshick, X. Zhang, and J. Sun, “Object
feature for measuring the precision and recall of the result
detection networks on convolutional feature maps,”
estimated. A CUDA-capable NVIDIA™ GPU with compute
arXiv:1504.06066, 2015.
capability 3.0 or higher is highly recommended for training.
The network is trained using the
[14] C. Szegedy, A. Toshev, and D. Erhan, “Deep neural
trainFasterRCNNObjectDetector function from the Neural
networks for object detection,” in Neural Information
Network Toolbox.[30]
Processing Systems (NIPS), 2013.
REFERENCE
[15] C. Szegedy, S. Reed, D. Erhan, and D. Anguelov,
[1] Ren, Shaoqing, “Faster R-CNN: Towards Real-Time
“Scalable, high-quality object detection,” arXiv:1412.1441
Object detection with Region Proposal Networks.” Advances
(v1), 2015.
in Neural Information Processing Systems. 2015.
[2] Girshick, Ross. “Fast r-cnn.”Proceedings of the IEEE [16] J. Dai, K. He, and J. Sun, “Convolutional feature
International Conference on Computer Vision. 2015. masking for joint object and stuff segmentation,” in IEEE
Conference on Computer Vision and Pattern Recognition
[3] Uijlings, Jasper RR, “Selective search for object (CVPR), 2015.
recognition.” International Journal of Computer
Vision (2013): 154-171. [17] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and
Y. Bengio,“Attention-based models for speech recognition,”in
[4] R. Girshick, “Fast R-CNN,” in IEEE International
Neural Information Processing Systems (NIPS), 2015.
Conference on Computer Vision (ICCV), 2015.
[18] V. Nair and G. E. Hinton, “Rectified linear units improve
[5] J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W.
restricted boltzmann machines,” in International Conference
Smeulders,“Selective search for object recognition,”
on Machine Learning (ICML), 2010.
International Journal of Computer Vision (IJCV), 2013.
[19] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E.
[6] C. L. Zitnick and P. Dollar, “Edge boxes: Locating object
Howard, W. Hubbard, and L. D. Jackel, “Backpropagation
proposals from edges,” in European Conference on Computer
applied to handwritten zip code recognition,” Neural
Vision (ECCV), 2014.
computation, 1989.

Authorized licensed use limited to: INSTITUTE OF ENGINEERING & MANAGEMENT TRUST. Downloaded on May 21,2024 at 04:45:02 UTC from IEEE Xplore. Restrictions apply.
International Conference on "Computational Intelligence and Communication Technology" (CICT 2018)

[20] A. Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet


classification with deep convolutional neural networks,”
inNeural Information Processing Systems (NIPS), 2012.

[21] Girshick, Ross, "Rich feature hierarchies for accurate


object detection and semantic segmentation." Proceedings of
the IEEE Conference on Computer Vision and Pattern
Recognition. 2014.
[22] Zitnick, C. Lawrence, and Piotr Dollar. "Edge boxes:
Locating object proposals from edges." European Conference
on Computer Vision 2014. Springer International Publishing,
2014. 391-405.
[23] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid
pooling in deep convolutional networks for visual
recognition,” in European Conference on Computer Vision
(ECCV), 2014

[24] K. Simonyan and A. Zisserman, “Very deep


convolutional networks for large-scale image recognition,” in
International Conference on Learning Representations (ICLR),
2015.

[25] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich


feature hierarchies for accurate object detection and semantic
segmentation,” in IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), 2014.

[26] J. Long, E. Shelhamer, and T. Darrell, “Fully


convolutional networks for semantic segmentation,” in IEEE
Conference on Computer Vision and Pattern Recognition
(CVPR), 2015.

[27] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus,


and Y. LeCun, “Overfeat: Integrated recognition, localization
and detection using convolutional networks,” in International
Conference on Learning Representations (ICLR), 2014.

[28] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D.


Anguelov, D. Erhan, and A. Rabinovich, “Going deeper with
convolutions,” in IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), 2015.

[29]Li Z, Zhang L, Fang Y, Wang J, Xu H, Yin B, Lu H,


“Deep People Counting with Faster R-CNN and Correlation
Tracking,” Proceedings of the International Conference on
Internet Multimedia Computing and Service - ICIMCS'16
(2016) pp. 57-60

[30]Object Detection using fasterR-CNN,2017[Online]


Available: https://in.mathworks.com/help/vision/examp
les/object-detection-using-faster-r-cnn-deep-
learning.html?requestedDomain=www.mathworks.com
[Accessed 10-Oct-2017]

Authorized licensed use limited to: INSTITUTE OF ENGINEERING & MANAGEMENT TRUST. Downloaded on May 21,2024 at 04:45:02 UTC from IEEE Xplore. Restrictions apply.

You might also like