From classical techniques to convolution-based models: A review of object detection algorithms
From classical techniques to convolution-based models: A review of object detection algorithms
TABLE III: Quantitative Performance Comparison of Object Detection Models on different Dataset
Model Pascal VOC (mAP) COCO (mAP) ImageNet (mAP) Open Images (mAP) Inference Speed (FPS) Model Size (MB)
RCNN 66% 54% 60% 55% ∼5 FPS 200
Fast RCNN 70% 59% 63% 58% ∼7 FPS 150
Faster RCNN 75% 65% 68% 63% ∼10 FPS 180
Mask RCNN 76% 66% 69% 64% ∼8 FPS 230
YOLO 72.5% 58.5% 61.5% 57.5% ∼45–60 FPS 145
SSD 75% 63.5% 66.5% 61.5% ∼19–46 FPS 145
B. Mean Average Precision (mAP) • Speed-Accuracy Trade-off: Enhancing both accuracy and
mAP evaluates model performance by averaging the pre- speed for real-time, low-power applications.
• Tiny Object Detection: Improving the detection of small
cision across all classes. The Average Precision (AP) is
computed as: objects in areas such as wildlife monitoring and medical
imaging.
Pn
(P (k) × Precision at Recall(k)) • 3D Object Detection: Leveraging 3D sensors for applica-
AP = k=1 tions in augmented reality and robotics.
n
• Multi-modal Detection: Integrating visual and textual
where P (k) is the change in recall from the previous highest sources for better accuracy in complex scenarios.
recall, and precision at recall k is the maximum precision • Few-shot Learning: Developing models that can effec-
observed at any recall level j where j ≥ k. tively detect objects from limited examples, particularly
C. Precision and Recall in low-resource settings.
This review aims to foster interest in advancing object
Precision is the ratio of true positives to all positive predic-
detection models and to inspire innovation to address current
tions, while Recall is the ratio of true positives to all ground
limitations, including minimizing environmental impacts.
truth positives.
ACKNOWLEDGMENT
D. Confidence Score (CS)
This study was partly supported by the West Chester Uni-
The Confidence Score reflects the model’s certainty that a
versity faculty development fund.
predicted bounding box contains the correct object. Higher
scores indicate greater accuracy and help set thresholds for R EFERENCES
accepting or rejecting detections.
[1] L. Chen, S. Li, Q. Bai, J. Yang, S. Jiang, and Y. Miao, “Review of
image classification algorithms based on convolutional neural networks,”
E. Non-Maximum Suppression (NMS) Remote Sensing, vol. 13, no. 22, p. 4712, 2021.
[2] C. B. Murthy, M. F. Hashmi, N. D. Bokde, and Z. W. Geem, “Investi-
Non-Maximum Suppression refines bounding box predic- gations of object detection in images/videos using various deep learning
tions by sorting them by confidence scores and selecting the techniques and embedded platforms—a comprehensive review,” Applied
highest one while suppressing overlapping boxes. This process sciences, vol. 10, no. 9, p. 3280, 2020.
[3] J. Ma, X. Jiang, A. Fan, J. Jiang, and J. Yan, “Image matching
ensures each object is detected once, improving accuracy and from handcrafted to deep features: A survey,” International Journal of
efficiency. Computer Vision, vol. 129, no. 1, pp. 23–79, 2021.
[4] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,”
VIII. D ISCUSSION AND F UTURE D IRECTIONS International journal of computer vision, vol. 60, pp. 91–110, 2004.
[5] J. Canny, “A computational approach to edge detection,” IEEE Transac-
This review examined prominent object detection models, tions on pattern analysis and machine intelligence, no. 6, pp. 679–698,
classifying them into classical computer vision techniques 1986.
[6] N. Dalal and B. Triggs, “Histograms of oriented gradients for human
and CNN-based methods. While recent CNN architectures detection,” in 2005 IEEE computer society conference on computer
have significantly improved accuracy to below 5%, they also vision and pattern recognition (CVPR’05), vol. 1. Ieee, 2005, pp.
increase complexity and resource demands. Traditional models 886–893.
[7] P. Viola and M. Jones, “Rapid object detection using a boosted cascade
like Deformable Part Models (DPMs) are shallower and more of simple features,” in Proceedings of the 2001 IEEE computer society
lightweight, making them better suited for edge deployment conference on computer vision and pattern recognition. CVPR 2001,
compared to modern deep learning architectures like AlexNet vol. 1. Ieee, 2001, pp. I–I.
[8] P. Felzenszwalb, D. McAllester, and D. Ramanan, “A discriminatively
and VGGNet. trained, multiscale, deformable part model,” in 2008 IEEE conference
Key future directions for object detection include: on computer vision and pattern recognition. Ieee, 2008, pp. 1–8.
TABLE IV: Evaluation Metrics: Limitations and Potential Biases of Object Detection Models
Model Metrics Used Limitations Potential Biases
RCNN IoU, mAP, Precision, Recall, F1 Score - Separate region proposal step slows infer- - Favors larger objects due to reliance on
ence. - High memory usage due to multiple selective search. - Struggles with scale vari-
stages. ations and densely packed objects.
Fast RCNN IoU, mAP, Precision, Recall, F1 Score - Dependent on external region proposals. - - Similar biases as RCNN: prefers larger and
Not optimized for real-time applications. well-separated objects. - Performance drops
in high-density scenes.
Faster RCNN IoU, mAP, Precision, Recall, F1 Score - More complex architecture with integrated - Favors objects with distinct features de-
Region Proposal Network (RPN). - Requires tectable by RPN. - Limited accuracy on
careful hyperparameter tuning. small or thin objects compared to single-
shot models.
Mask RCNN IoU, mAP, Precision, Recall, F1 Score - Increased computational overhead from - Bias towards classes with abundant and
mask prediction. - Longer training times. detailed segmentation data. - Misses small
or occluded objects in segmentation masks.
YOLO IoU, mAP, Precision, Recall, Confidence Score - Lower detection accuracy on small objects. - Prioritizes objects at the center of the
- Struggles with overlapping objects and image. - Predefined grid may miss objects
crowded scenes. at image edges.
SSD IoU, mAP, Precision, Recall, Confidence Score - Performance degrades on very small ob- - Bias towards predefined anchor boxes,
jects. - Limited by predefined anchor box affecting generalization for unseen scales. -
scales and aspect ratios. Struggles with variable object shapes and
sizes not covered by anchor boxes.
[9] S. Schulter, C. Leistner, P. Wohlhart, P. M. Roth, and H. Bischof, pattern analysis and machine intelligence, vol. 37, no. 9, pp. 1904–
“Accurate object detection with joint classification-regression random 1916, 2015.
forests,” in Proceedings of the IEEE conference on computer vision and [24] J. Redmon and A. Farhadi, “Yolo9000: better, faster, stronger,” in
pattern recognition, 2014, pp. 923–930. Proceedings of the IEEE conference on computer vision and pattern
[10] P. Sermanet, “Overfeat: Integrated recognition, localization and detection recognition, 2017, pp. 7263–7271.
using convolutional networks,” arXiv preprint arXiv:1312.6229, 2013. [25] A. Farhadi and J. Redmon, “Yolov3: An incremental improvement,”
[11] G. Li and Y. Yu, “Visual saliency based on multiscale deep features,” in Computer vision and pattern recognition, vol. 1804. Springer
in Proceedings of the IEEE conference on computer vision and pattern Berlin/Heidelberg, Germany, 2018, pp. 1–6.
recognition, 2015, pp. 5455–5463. [26] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, “Yolov4: Op-
[12] K. Fu, C. Gong, J. Yang, and Y. Zhou, “Salient object detection via color timal speed and accuracy of object detection,” arXiv preprint
contrast and color distribution,” in Computer Vision–ACCV 2012: 11th arXiv:2004.10934, 2020.
Asian Conference on Computer Vision, Daejeon, Korea, November 5-9, [27] G. Jocher, A. Stoken, A. Chaurasia, J. Borovec, Y. Kwon, K. Michael,
2012, Revised Selected Papers, Part I 11. Springer, 2013, pp. 111–122. L. Changyu, J. Fang, P. Skalski, A. Hogan et al., “ultralytics/yolov5: v6.
[13] C. L. Zitnick and P. Dollár, “Edge boxes: Locating object proposals from 0-yolov5n’nano’models, roboflow integration, tensorflow export, opencv
edges,” in Computer Vision–ECCV 2014: 13th European Conference, dnn support,” Zenodo, 2021.
Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. [28] C. Li, L. Li, H. Jiang, K. Weng, Y. Geng, L. Li, Z. Ke, Q. Li, M. Cheng,
Springer, 2014, pp. 391–405. W. Nie et al., “Yolov6: A single-stage object detection framework for
[14] K. Fu, C. Gong, J. Yang, Y. Zhou, and I. Y.-H. Gu, “Superpixel based industrial applications,” arXiv preprint arXiv:2209.02976, 2022.
color contrast and color distribution driven salient object detection,” [29] C.-Y. Wang, A. Bochkovskiy, and H.-Y. M. Liao, “Yolov7: Trainable
Signal Processing: Image Communication, vol. 28, no. 10, pp. 1448– bag-of-freebies sets new state-of-the-art for real-time object detectors,”
1463, 2013. in Proceedings of the IEEE/CVF conference on computer vision and
[15] K. Simonyan and A. Zisserman, “Very deep convolutional networks for pattern recognition, 2023, pp. 7464–7475.
large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014. [30] G. Jocher, A. Chaurasia, and J. Qiu, “Ultralytics yolov8,” https://github.
com/ultralytics/ultralytics, 2023, aGPL-3.0 License.
[16] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
[31] C.-Y. Wang, I.-H. Yeh, and H.-Y. M. Liao, “Yolov9: Learning what you
with deep convolutional neural networks,” Advances in neural informa-
want to learn using programmable gradient information,” arXiv preprint
tion processing systems, vol. 25, 2012.
arXiv:2402.13616, 2024.
[17] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature
[32] A. Wang, H. Chen, L. Liu, K. Chen, Z. Lin, J. Han, and G. Ding,
hierarchies for accurate object detection and semantic segmentation,”
“Yolov10: Real-time end-to-end object detection,” arXiv preprint
in Proceedings of the IEEE conference on computer vision and pattern
arXiv:2405.14458, 2024.
recognition, 2014, pp. 580–587.
[33] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisser-
[18] R. Girshick, “Fast r-cnn,” arXiv preprint arXiv:1504.08083, 2015. man, “The pascal visual object classes (voc) challenge,” International
[19] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time journal of computer vision, vol. 88, pp. 303–338, 2010.
object detection with region proposal networks,” IEEE transactions on [34] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
pattern analysis and machine intelligence, vol. 39, no. 6, pp. 1137–1149, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in
2016. context,” in Computer Vision–ECCV 2014: 13th European Conference,
[20] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13.
Proceedings of the IEEE international conference on computer vision, Springer, 2014, pp. 740–755.
2017, pp. 2961–2969. [35] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet:
[21] J. Redmon, “You only look once: Unified, real-time object detection,” A large-scale hierarchical image database,” in 2009 IEEE conference on
in Proceedings of the IEEE conference on computer vision and pattern computer vision and pattern recognition. Ieee, 2009, pp. 248–255.
recognition, 2016. [36] A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset,
[22] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and S. Kamali, S. Popov, M. Malloci, A. Kolesnikov et al., “The open images
A. C. Berg, “Ssd: Single shot multibox detector,” in Computer Vision– dataset v4: Unified image classification, object detection, and visual
ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, relationship detection at scale,” International journal of computer vision,
October 11–14, 2016, Proceedings, Part I 14. Springer, 2016, pp. vol. 128, no. 7, pp. 1956–1981, 2020.
21–37.
[23] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep
convolutional networks for visual recognition,” IEEE transactions on