An Investigation of Deep Neural Network Based Techniques For Object Detection An
An Investigation of Deep Neural Network Based Techniques For Object Detection An
Abstract—Detection of objects and its recognition in visual techniques namely optical flow techniques, Hough Transform,
sequences are the two critical tasks in the computer vision field. Harris Corner Detector and SIFT (scale invariant feature
Various real-time applications such as autonomous vehicles, face transform) have been utilized for object detection task from
recognition, health-care systems and space exploration requires past several decade [2-3]. However, above mentioned methods
highly reliable and precise object detection models. Traditional are time-consuming, error-prone and less accurate in nature.
object detection and recognition algorithms are based on hand Several challenges of traditional methods have been addressed
crafted and are considered to be erroneous, time consuming and by the recent developments in object detection and recognition
expensive leading to the significant reduction of accuracy rate for
especially with the introduction of deep neural networks which
object detection in large datasets. Recently, large number of
are capable of dealing with large datasets trained upon
promising deep neural networks models have been emerged for
facilitating automated and accurate detection of varying scale
powerful GPUs shows promising performance results and also
objects and its precise recognition across various computer vision builds confidence for emerging as a future technology for
applications. Several GPU based neural models thereby various computer vision applications such as self-driving
incorporating context-aware capabilities have shown effective vehicle, 3D space investigation and other AI-robotic systems.
performance, which overcomes the drawbacks of traditional Correlation based filtering techniques were considered to
techniques. This research study provides an investigation of be prominent in the past decades, however for establishing the
several popular deep learning models that exists for accurate
balance between multiple objects tracking (MOT) and its
object detection in various forms of visual sequences. Varying
representation along with real-time performance achievement
scale objects are detected frequently available in popular MS
COCO and PASCAL datasets and their performance are evaluated in real-time video sequences, deep learning based multiple
utilizing one stage Yolo family and two stage Faster RCNN deep feature representation methodologies have been emerged as a
learning object detectors. At the end of the study, several future new advancement in the recent era [4]. Thus, nowadays deep
research directions for object detection task are discussed. learning has emerged as the most robust and popular
methodology compared to machine learning, due to the
Keywords—Object Detection and Recognition; One-stage availability of large-scale public datasets namely ImageNet
detector; Two-stage detector, CNN algorithms; Deep Learning; [5], MS COCO [6] and PASCAL VOC [7] which can be
Computer Vision. trained and learned with the most powerful GPU based
computers. Most of the modern deep learning models supports
I. INTRODUCTION the capabilities of an automated support for identifying and
An image particularly consists of several interesting recognizing complex patterns in visual scene due to the
features which necessitates careful investigation for target existence of similarity between actual human brain processing
objects localization and classification activities in computer activities and neuron like resemblance in the neural networks
vision. An object in an image can be any real-world based models [8].
recognizable entity belonging to several classes or instances
such as human, animal, vehicles etc. An object detection refers
to computer vision task which includes accurately locating an
object and predicting the class of the particular object which is
considered as the crucial fundamental phenomena of several
promising computer vision applications [1].
Traditional approaches to object detection were based on
handcrafted and geometric feature extraction oriented
Fangming Bi A Siamese SiamFC Spatial information Table I. investigates the deep learning models evolution for
et al.(2019) Neural Network achieves good are not utilized. object detection task highlighting the merits and demerits of
known as results upon Optimizing online various modern CNNs experimented upon several popular
SiamFC CNN VOT 2017 trackers updation public datasets. Future scope of the object detection still exists
for object dataset. RNNs criteria and
as open research question whereby incorporating hybridization
detection in namely LSTM focusing upon
of several CNNs can effectively boost the performance of
Video sequences and GRU yields exploitation of label
is utilized. better results for of first frame needs
object detection task.
FCNT and sequential tasks. to be well A. Object Detection
MDNet online addressed[4].
trackers are other
Object detection is crucial activity in several popular
object detectors
computer vision based real world applications even in recent
for varying
era. Objects are identified under various environmental or
environmental nature of imaging conditions such as lighting conditions,
scenarios. obstacles, occlusions, blurring effect, size variations,
viewpoints and pose variations etc. These aspects still need to
be addressed by the robust and powerful deep learning models.
Shrey YOLOv3, YOLO-v4 is Still requires
Srivastava et YOLO-v4 and considered optimization with
Conventional detectors such as VJ (Viola-Jones) detector [13],
al. (2021) SSD single stage fastest for object CNN backbone for
HOG (Histograms of Oriented Gradients) detector [14] and
object detectors detection in live boosting the real- SIFT (scale-invariant feature transform) [15] for object
are evaluated video feed. time performance detection based on features description yielded significant
upon MS FRCNN is [8]. results but huge redundancy and false positive rate were
COCO. FRCNN highly accurate witnessed. Consequently, deep learning based one-stage and
two stage for small two-stage object detectors [16] emerged as a solution for
detector are also dataset. SSD addressing the limitations of traditional object detectors.
experimented. satisfies the However, because of training on huge datasets few object
trade-off detectors causes slight increase with the required training time
between speed but still promises the state-of-the-art classification
and accuracy performance results. Modern deep learning CNN architectures
than other provides highly accurate object detection and classification
CNNs. results [17].
B. Deep Neural Network
Eric Spatially Superior than Further
Crawford et Invariant based conventional performance In this section, we review some of the prominent deep
al. (2019) Attend, Infer and unsupervised improvement is neural network CNN models employed for object detection
Repeat methods. necessary. Real- and classification task in computer vision.
framework for Effectively Time video object
a) Convolutional Neural Network (CNN):
unsupervised scalable to large detection yet to be
object detection dataset. Overall explored [27]. A typical artificial neural network in the deep learning
task is proposed. 0.66 mAP is domain, popularly known as the Convolutional Neural
observed for Network or CNN, finds applications in object detection and
digits classification. Thus, modern deep convolutional neural
recognition task. networks are capable of recognizing objects present in an
image or visual sequences by using an efficient CNN
Sankar K. A survey of Datasets namely Efficient integration architecture as a backbone network.
Pal et al. generic multiple MOT-2015 and of one stage and two
(2021) object detection MOT-2016 are stage detectors are
b) History of CNNs:
(MOT) is experimented essential for First CNN architecture known as LeNet-5 was developed
investigated. upon RNN+ boosting the in 1998 for recognizing handwritten numerals [18]. With the
Distance Metric LSTM network performance for introduction of GPUs and NVIDIA, CNN supported parallel
based MOT and with 71% and diversified object processing capabilities and faster training, learning and
Generative 75.9% MOTP detection tasks [28]. performance evaluation upon challenging dataset. AlexNet
Networks based respectively. CNN, the most popular architecture with 8 layers witnessed
MOT are
higher accuracy than other architectures in 2012 [19]. VGG
highlighted.
(Visual Geometry Group) object detection model was
introduced in 2014. Depending on the number of deep layers
VGG-16 and VGG-19 was introduced with 16 and 19 multiple
convolutional layers respectively [17]. One stage object
detectors namely You Only Look Once (YOLO) and Single
Shot Detection (SSD) also two stage object detectors such as
region of interests (ROI) based R-CNN, Fast-RCNN, Faster-
RCNN, RFCN and Mask RCNN were the most popular CNN
based object detection models from past several years with slower CNN model. SSD utilizes VGG16 as the backbone
varying features extraction and object detection capabilities model and constructs feature maps for both small and large
[8]. objects detection [12, 21].
c) Faster RCNN (FRCNN) f) Object Detection Public Datasets
It is an expanded version of Fast RCNN which overcomes Microsoft COCO: Microsoft Common Objects in Context
the selective approach algorithm bottleneck by incorporating dataset is a widely used dataset which consists of more than
region proposal network (RPN). Benchmark datasets such as 330K images supporting annotations for accurately classifying
ImageNet and PASCAL VOC with Faster RCNN yields mAP objects such as humans, animals, vehicles classes etc. It has
of 66% with 250 times faster than conventional RCNN. Faster been emerged as benchmark dataset and heavily utilized in
RCNN used for character recognition or text recognition in training the model, model evaluation and for object
Google and Facebook real world applications [20]. classification tasks.
d) You Only Look Once (Yolo) PASCAL VOC: PASCAL Visual Object Classes Challenge
[22] is a benchmark dataset in computer vision utilized for
A one stage detection model which possess faster detection object identification, classification activities and for semantic
capability for high resolution based images or visual scenes based segmentation. PASACAL VOC 2012 dataset consists of
than two satge Faster RCNN. However, Faster RCNN tends to
more than 10000 images for training and testing tasks. It also
be more accurate and outperforms Yolo model especially provides annotation support for the objects occurring in the
while detecting small objects in image sequences. images and categorizing them into person, animal and other
Computational costs for Yolo are reasonably lesser than the indoor tangible classes.
two stage detectors. However, depending on the type of object
detection one stage or two stage detectors can be utilized III. MODEL PERFORMANCE EVALUATION METRICS
yielding significant trade-off between speed and accuracy.
Several families of Yolo exist for real-time object detection The standard metric mAP which is known as mean average
task namely Yolo v2, v3 and v4 [10, 21]. precision is extensively used for assessing the training and
classification performance of CNN models upon the
TABLE II. Summary of Yolo CNNs performance. benchmark datasets such as MS COCO and PASCAL VOC.
It is computed as shown in following steps:
Accuracy
Speed Dataset
Yolo Model (mAP) (1) Plotting of precision curve is performed based upon the
subsequent computations of recall values for every object
Yolo v1 45 fps 63.4 VOC category. For every recall computation maximum precision
computed is considered.
Yolo v2 67 fps 76.8 VOC (2) Average precision is the region below the previous step
1 plotted curve. It is an average of all the precision of detected
Yolo v3 51 fps 33.0 COCO categories finally resulting in mean Average Precision (mAP)
[23].
Yolo v4 65 fps 44.0 COCO
IoU, commonly referred as Intersection Over Union is
mainly utilized for mAP value computation. It predicts the
Yolo v5 55 fpf 55.0 COCO
overlap significance between the estimated bounding box and
the actual or ground truth box of an image. Confidence score
Yolo v6 40 fps 52.0 COCO of the predicted object (by the bounding box) is the product of
the precision of the model and the IoU. Higher the confidence
Yolo v7 60 fps 56.0 COCO score, higher will be the precision and the performance of the
object detection task by the model.
IV. COMPARATIVE PERFORMANCE ANALYSIS
Table II. depicts the overall performance of several Yolo
family of models depending on the dataset, backbone CNN Faster RCNN on PASCAL VOC 2007 and VOC 2012 yields
and GPU environment utilized for experimentation. Most of 84.9% mAP and 84.2% mAP respectively. Faster RCNN also
the models are experimented with MS COCO and PASCAL yields 62.9% mAP on MS COCO dataset. All the models
VOC challenging public datasets and are evaluated upon performance evaluation are performed on standard powerful
standard Titan X or Tesla GPUs. NVIDIA Tesla GPU. Based on the results obtained it can be
concluded that CNN depends on several parameters such as
e) Single Shot Detector (SSD) IoU threshold, backbone CNN utilized, GPU capabilities and
Single Shot Multiple box detector is faster and lightweight number of training iterations performed [24]. Table III. depicts
than Yolo utilizing the MobileNet architecture. However, SSD the approximated performance of CNNs depending on the
is reasonably less accurate compared to Faster-RCNN when considered benchmark dataset, backbone CNN, target classes
used with InceptionV2 backbone CNN. Besides, Faster- of objects for classification and GPU for experimentation.
RCNN success for high accuracy for object detection task
especially for small target objects it is considered to be a
Table III. CNNs performance evaluation statistics capabilities of weak or unsupervised based object detection
support.
Dataset CNN Accuracy (mAP)
speed however they consume enormous training time and also [16] Lohia, Aditya; Kadam, Kalyani Dhananjay; Joshi, Rahul Raghvendra;
and Bongale, Dr. Anupkumar M., "Bibliometric Analysis of One-stage
incurs high annotation costs whereby making it inappropriate and Two-stage Object Detection" (2021). Library Philosophy and
for real-time robust multiple object detection task especially Practice (e-journal). 4910.
with novel unknown object detection task that occurs in large- https://digitalcommons.unl.edu/libphilprac/4910
scale dataset. Conclusively, as a future research aspect hybrid [17] Simonyan K, Zisserman A. Very deep convolutional networks for
CNN, combination of several CNNs needs to be fabricated. It large-scale image recognition. arXiv preprint arXiv:1409.1556; 2014.
requires careful integration of existing powerful CNNs [18] LeCun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning
applied to document recognition. Proc IEEE. 1998;86(11):2278–324.
whereby the resulting fusion based CNN generic object
[19] Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with
detection framework promises the state-of-art performance deep convolutional neural networks. Adv Neural Inf Process Syst.
than any individual dedicated CNN performance along with 2012;25:1097–105.
the capabilities of either weakly supervised or unsupervised [20] S. Ren, K. He, R. Girshick and J. Sun, "Faster R-CNN: Towards Real-
based object detection support. Unrecognized novel real- Time Object Detection with Region Proposal Networks," in IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no.
world objects comprised dataset still needs to be fabricated 6, pp. 1137-1149, 1 June 2017, doi: 10.1109/TPAMI.2016.2577031.
and have to be made public. [21] Liu, L., Ouyang, W., Wang, X. et al. Deep Learning for Generic Object
Detection: A Survey. Int J Comput Vis 128, 261–318 (2020).
REFERENCES https://doi.org/10.1007/s11263-019-01247-4
[22] Everingham, M., Van Gool, L., Williams, C.K.I. et al. The PASCAL
Visual Object Classes (VOC) Challenge. Int J Comput Vis 88, 303–338
[1] Zou, "A Review of Object Detection Techniques," 2019 International
(2010). https://doi.org/10.1007/s11263-009-0275-4
Conference on Smart Grid and Electrical Automation (ICSGEA), 2019,
pp. 251-254, doi: 10.1109/ICSGEA.2019.00065. [23] Zhu H, Wei H, Li B, Yuan X, Kehtarnavaz N. A Review of Video
Object Detection: Datasets, Metrics and Methods. Applied Sciences.
[2] C. Harris and M. Stephens, "A combined comer and edge detector[C]",
2020; 10(21):7834. https://doi.org/10.3390/app10217834
Proceedings of the 4th Alvey Vision Conference, pp. 147-151, 1988.
[24] Liu, L., Ouyang, W., Wang, X. et al. Deep Learning for Generic Object
[3] D.G. Lowe, "Distinctive image features from scale-invariant
Detection: A Survey. Int J Comput Vis 128, 261–318 (2020).
keypoints[J]", International Journal of Computer Vision, vol. 60, pp.
https://doi.org/10.1007/s11263-019-01247-4
91, 2004.
[25] Borji, A., Cheng, MM., Hou, Q. et al. Salient object detection: A
[4] Fangming Bi, Xin Ma et all., Review on Video Object Tracking Based
survey. Comp. Visual Media 5, 117–150 (2019).
on Deep Learning, Journal of New Media, JNM, vol.1, no.2, pp.63-74,
https://doi.org/10.1007/s41095-019-0149-9
2019. doi:10.32604/jnm.2019.06253.
[26] Kaur J, Singh W. Tools, techniques, datasets and application areas for
[5] Russakovsky, O., Deng, J., Su, H. et al. ImageNet Large Scale Visual
object detection in an image: a review. Multimed Tools Appl.
Recognition Challenge. Int J Comput Vis 115, 211–252 (2015).
2022;81(27):38297-38351. doi: 10.1007/s11042-022-13153-y. Epub
https://doi.org/10.1007/s11263-015-0816-y
2022 Apr 23. PMID: 35493415; PMCID: PMC9033309.
[6] Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D.,
[27] Eric Crawford and Joelle Pineau. 2019. Spatially invariant
... & Zitnick, C. L. (2014). Microsoft coco: Common objects in context.
unsupervised object detection with convolutional neural networks. In
In Computer Vision–ECCV 2014: 13th European Conference, Zurich,
Proceedings of the Thirty-Third AAAI Conference on Artificial
Switzerland, September 6-12, 2014, Proceedings, Part V 13 (pp. 740-
Intelligence and Thirty-First Innovative Applications of Artificial
755). Springer International Publishing.
Intelligence Conference and Ninth AAAI Symposium on Educational
[7] Everingham, M., Eslami, S.M.A., Van Gool, L. et al. The PASCAL Advances in Artificial Intelligence (AAAI'19/IAAI'19/EAAI'19).
Visual Object Classes Challenge: A Retrospective. Int J Comput Vis AAAI Press, Article 419, 3412–3420.
111, 98–136 (2015). https://doi.org/10.1007/s11263-014-0733-5 https://doi.org/10.1609/aaai.v33i01.33013412
[8] Srivastava, S., Divekar, A.V., Anilkumar, C. et al. Comparative [28] Pal, S.K., Pramanik, A., Maiti, J. et al. Deep learning in multi-object
analysis of deep learning image detection algorithms. J Big Data 8, 66 detection and tracking: state of the art. Appl Intell 51, 6400–6429
(2021). https://doi.org/10.1186/s40537-021-00434-w (2021). https://doi.org/10.1007/s10489-021-02293-7
[9] R. L. Galvez, A. A. Bandala, E. P. Dadios, R. R. P. Vicerra and J. M.
Z. Maningo, "Object Detection Using Convolutional Neural
Networks," TENCON 2018 - 2018 IEEE Region 10 Conference, Jeju,
Korea (South), 2018, pp. 2023-2027, doi:
10.1109/TENCON.2018.8650517.
[10] Diwan T, Anirudh G, Tembhurne JV. Object detection using YOLO:
challenges, architectural successors, datasets and applications.
Multimed Tools Appl. 2022 Aug 8:1-33. doi: 10.1007/s11042-022-
13644-y. Epub ahead of print. PMID: 35968414; PMCID:
PMC9358372.
[11] Bohush, R., Ablameyko, S.V., Ihnatsyeva, S., & Adamovskiy, Y.
(2021). Object Detection Algorithm for High Resolution Images Based
on Convolutional Neural Network and Multiscale Processing.
International Workshop on Computer Modeling and Intelligent
Systems.
[12] Zhi-Hua Zhou, A brief introduction to weakly supervised learning,
National Science Review, Volume 5, Issue 1, January 2018, Pages 44–
53, https://doi.org/10.1093/nsr/nwx106
[13] P. Viola and M. J. Jones. Robust real-time face detection.
InternationalJournal of Computer Vision, 57(2):137–154, May 2004
[14] N. Dalal and B. Triggs. Histograms of oriented gradients for
humandetection. In In CVPR, pages 886–893, 2005
[15] D. G. Lowe. Distinctive image features from scale-invariant
keypoints.Int. J. Comput. Vision, 60(2):91–110, Nov. 2004.