0% found this document useful (0 votes)
18 views

From classical techniques to convolution-based models: A review of object detection algorithms

This document reviews object detection algorithms, contrasting classical methods with convolutional neural networks (CNNs). It highlights the evolution of object detection techniques, emphasizing the limitations of traditional approaches and the advancements brought by deep learning. The paper categorizes detection methods, discusses their strengths and weaknesses, and identifies areas for future research to enhance performance.

Uploaded by

Neha
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

From classical techniques to convolution-based models: A review of object detection algorithms

This document reviews object detection algorithms, contrasting classical methods with convolutional neural networks (CNNs). It highlights the evolution of object detection techniques, emphasizing the limitations of traditional approaches and the advancements brought by deep learning. The paper categorizes detection methods, discusses their strengths and weaknesses, and identifies areas for future research to enhance performance.

Uploaded by

Neha
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

From classical techniques to convolution-based

models: A review of object detection algorithms


1st FNU Neha 1st Deepshikha Bhati 2nd Deepak Kumar Shukla 3rd Md Amiruzzaman
Dept. of Computer Science Dept. of Computer Science Rutgers Business School Dept. of Computer Science
Kent State University Kent State University Rutgers University West Chester University
Kent, OH, USA Kent, OH, USA Newark, New Jersey, USA West Chester, PA, USA
[email protected] [email protected] [email protected] [email protected]
arXiv:2412.05252v1 [cs.CV] 6 Dec 2024

Abstract—Object detection is a fundamental task in computer


vision and image understanding, with the goal of identifying and
localizing objects of interest within an image while assigning them
corresponding class labels. Traditional methods, which relied on
handcrafted features and shallow models, struggled with complex
visual data and showed limited performance. These methods
combined low-level features with contextual information and
lacked the ability to capture high-level semantics. Deep learning,
especially Convolutional Neural Networks (CNNs), addressed
these limitations by automatically learning rich, hierarchical Fig. 1: (A) Single-object sunflower: A single bounding box
features directly from data. These features include both semantic localizes and classifies the central sunflower bloom. (B)
and high-level representations essential for accurate object de- Multiple-object sunflower: Multiple bounding boxes highlight
tection. This paper reviews object detection frameworks, starting
and classify overlapping sunflowers and leaves, illustrating
with classical computer vision methods. We categorize object
detection approaches into two groups: (1) classical computer multi-scale object detection and localization within a complex
vision techniques and (2) CNN-based detectors. We compare scene.
major CNN models, discussing their strengths and limitations. In
conclusion, this review highlights the significant advancements in
object detection through deep learning and identifies key areas especially CNNs, have significantly improved detection per-
for further research to improve performance. formance. Modern methods use hierarchical representations,
Index Terms—Object Detection, CNN, Deep Learning, Image
enabling object detection in complex environments with oc-
Processing, Computer Vision
clusions and varying scales.
I. INTRODUCTION Although many studies have reviewed specific deep learning
models or object detection applications, few provide a compre-
Deep learning (DL) has advanced image analysis, especially
hensive overview of both classical computer vision techniques
in object classification, localization, and detection tasks. In
and CNN-based approaches. This paper addresses this gap by
classification, the aim is to assign an image or object within it
offering an analysis of both. Key contributions include:
to one of several categories [1]. However, classification does
1) A review of classical computer vision techniques for
not provide the object’s location. Localization improves on this
object detection.
by identifying both the object’s category and position, typically
2) An analysis of general region proposal generation tech-
with a bounding box [2], though the precision of these boxes
niques.
can vary. Object detection further extends classification and
3) A detailed review of convolution-based models for ob-
localization by detecting and classifying multiple objects in an
ject detection, including two-stage and one-stage detec-
image, providing bounding boxes for each [2]. The bounding
tors.
box’s top-left corner is represented by (Xmin , Ymin ), and the
bottom-right by (Xmax , Ymax ), along with a label indicating The paper is organized as follows: Section 2 covers classical
the object’s class as shown in Fig 1. computer vision techniques for object detection, Section 3
Object detection has applications across fields such as discusses region proposal generation, Section 4 explores CNN-
medical imaging, logo detection, facial recognition, pedestrian based detection architectures, Section 5 reviews applications,
detection, and industrial automation. However, challenges arise Section 6 lists popular datasets, Section 7 covers evaluation
from image transformations like changes in scale, orienta- metrics, and Section 8 concludes with future directions.
tion, and lighting. While classical computer vision techniques II. CLASSICAL COMPUTER VISION TECHNIQUES
provided a foundation, advancements in deep learning (DL), FOR OBJECT DETECTION
The first author contributed the most to this paper. Corresponding author: Earlier computer vision techniques for image processing,
[email protected] particularly image similarity, relied on feature-based methods
[3]–[8]. These methods focused on extracting distinctive image Each method has specific limitations: multiscale saliency
features to reduce computational costs while enabling robust struggles with low-contrast objects, color contrast is inef-
image matching despite transformations like scaling or rotation fective with minimal contrast, edge detection may produce
[3]. The Scale-Invariant Feature Transform (SIFT) algorithm false positives or negatives, and super-pixel clustering requires
overcame the challenge of scaling by extracting features refinement. Consequently, hybrid models are often developed
invariant to scale, rotation, brightness, and contrast [4]. Other to improve region proposal accuracy.
feature extractors, like the Canny Edge Detector, contributed
to tasks like image comparison and panoramic stitching by IV. CONVOLUTION BASED OBJECT DETECTION
providing resilience to transformations and occlusions [5]. The MODELS
Histogram of Oriented Gradients (HOG) technique enabled Object detection initially relied on manual feature design,
efficient image analysis by measuring gradient magnitudes and focusing on patterns and edges. With CNN advancements,
directions, creating descriptive feature vectors [6]. networks such as Visual Geometry Group Network (VGGNet)
Traditional object detection involves three stages: [15] and AlexNet [16] now autonomously extract features
1) Proposal Generation: Scanning the image at various through convolution and pooling layers, with fully connected
positions and scales to generate candidate bounding (FC) layers followed by a SoftMax layer for classification.
boxes, often using methods like sliding windows or For localization, the final FC layer outputs bounding box
selective search algorithms. coordinates, unlike classical methods which use filters and
2) Feature Extraction: Extracting features from the iden- machine learning-based models (e.g., SVMs).
tified regions to capture relevant visual patterns. Training CNN-based models involve adjusting weights via
3) Classification: Classifying the extracted features using backpropagation to align predictions with ground truth bound-
machine learning algorithms, such as support vector ing boxes. Detection models fall into two categories: (1)
machine (SVM). Two-stage detectors, which generate region proposals before
In 2001, Viola et al. introduced a real-time (webcam based) classification, including R-CNN [17], Fast R-CNN [18], Faster
facial detection classifier [7]. In 2005, Dalal et al. introduced R-CNN [19], and Mask R-CNN [20]; and (2) One-stage de-
an object detector using HOG features and an SVM classifier, tectors, treating detection as direct regression or classification
effective across scales but limited by pose variations [6]. In tasks, like YOLO [21] and SSD [22]. Table I summarizes their
2009, Felzenszwalb et al. improved this with the Deformable strengths and limitations.
Part Model (DPM), allowing flexible parts to handle poses,
though it struggled with overlapping parts in multi-person TABLE I: Comparison of CNN-Based Object Detection Ar-
images [8]. chitectures
Studies from 2008 to 2012 on popular object detection Model Strengths Limitations
R-CNN (2013) Simple, foundational; ap- High computation
datasets (see Section 5) showed key limitations in traditional plies CNNs for classifica- for 2000 region
methods. For instance, sliding windows require substantial tion. classifications; slow
(47 sec/image); no end-
computational resources and can generate redundant detec- to-end training.
tions. Additionally, the performance of the classifier greatly SPPNet (2015) Faster than R-CNN; sup- Does not update conv. lay-
ports multi-scale input via ers before SPP layer dur-
impacts the results, necessitating more robust approaches. spatial pyramid pooling. ing fine-tuning.
Fast R-CNN (2015) Faster than SPPNet; intro- Relies on selective search
III. GENERIC REGION PROPOSAL GENERATION duces ROI pooling to han- for region proposals, not
TECHNIQUES dle varied input sizes. learned during training.
Faster R-CNN (2015) Uses RPN for fast region Limited in detecting small
Object detection models integrate a bounding box regressor proposals; improves effi- objects due to single fea-
ciency. ture map.
within the classification network to accurately locate objects Mask R-CNN (2017) Adds instance segmenta- High computational de-
[9]. Traditionally, this involves feeding cropped images to tion, detecting objects and mand; struggles with mo-
the localization network, resulting in excessive inputs. An masks simultaneously. tion blur at low resolution.
YOLO (2015) Real-time detection at 45 Poor detection of small
OverFeat model enhances efficiency by using a sliding window fps; single forward pass. objects; produces coarse
detector within convolution layers, scanning images with a features.
SSD (2016) Handles various resolu- Default boxes may not
large filter and stride [10]. However, indiscriminate scanning tions; uses multi-scale fea- match all shapes; possible
of background regions necessitates predicting potential object ture maps for detection. overlapping detections.
locations. Methods such as interest point detection, multiscale
saliency, color contrast, edge detection, and super-pixel clus-
A. Region-based Convolutional Neural Network (R-CNN)
tering are employed for this purpose [11]–[14].
For instance, multiscale saliency leverages the Fast Fourier In 2014, Girshick et al. introduced R-CNN, a two-stage net-
Transform to analyze features at multiple scales [11]; color work that combines classical techniques like selective search
contrast relies on color intensity differences [12]; edge de- with CNNs for object detection [17] (see Fig. 2). R-CNN’s
tection identifies edges, followed by density analysis [13]; training involves three steps:
and super-pixel clustering groups similar pixels for detailed • Fine-tune a pre-trained network (e.g., AlexNet) on region
analysis [14]. proposals generated by selective search.
Fig. 2: R-CNN Architecture Fig. 3: SPP-Net Architecture

• Train an SVM classifier for object classification.


• Use a bounding box regressor to improve localization
accuracy.
Selective search generates around 2000 region proposals,
each resized to 227x227 pixels for CNN input, reducing the
computational cost of exhaustive sliding windows.
Initially, R-CNN achieved 44% accuracy, improving to 54%
after fine-tuning on warped images. Adding a bounding box Fig. 4: Faster R-CNN Architecture
regressor boosted accuracy to 58%, and using VGGNet further
increased it to 66%. While nine times slower than OverFeat,
R-CNN’s focus on region proposals reduces false positives, and uses a Region of Interest (ROI) Pooling layer to extract
improving accuracy by 10%. fixed-length features from each region, dividing proposals into
However, R-CNN has some limitations: a fixed N × N grid. Unlike SPP, ROI Pooling backpropagates
• Feature extraction is performed independently for each error signals, enabling end-to-end optimization.
proposal, resulting in high computational costs. After feature extraction, features pass through FC layers,
• The separate stages of proposal generation, feature extrac- outputting (1) SoftMax probabilities for C+1 classes (including
tion, and classification prevent end-to-end optimization. background) and (2) four bounding box regression parameters.
• Selective search relies on low-level visual features, strug- Fast R-CNN achieved better accuracy than R-CNN and SPP-
gles with complex scenes, and does not benefit from GPU Net but still relied on traditional proposal methods.
acceleration.
• Despite higher accuracy compared to methods like Over-
D. Faster Region-based Convolutional Neural Network
Feat, R-CNN is slower due to these inefficiencies. (Faster R-CNN)
B. Spatial Pyramid Pooling-Net (SPP-Net) In 2015, Girshick et al. introduced Faster R-CNN, which
In 2015, He et al. introduced SPP-Net to improve detection utilizes the Region Proposal Network (RPN) to generate object
speed and feature learning over R-CNN [23]. Unlike R-CNN, proposals at each feature map position using a sliding window
which processes each cropped proposal individually, SPP-Net approach (Fig. 4) [18]. This method shares feature extraction
computes the feature map for the entire image and then applies across regions, enhancing efficiency and achieving state-of-
a Spatial Pyramid Pooling (SPP) layer to extract fixed-length the-art results. However, the separate computation for region
feature vectors (See Fig. 3). The SPP layer divides the feature classification can be inefficient with many proposals, and
map into grids of varying sizes (N × N), enabling pooling reliance on a single deep feature map makes detecting objects
at multiple scales and concatenation of the resulting feature of varying scales difficult, as deep features are semantically
vectors. strong but spatially weak, while shallow features are spatially
SPP-Net allows multi-scale and varied aspect ratio handling strong but semantically weak.
without resizing, preserving image details and improving both
accuracy and inference speed over R-CNN. However, its multi- E. Mask R-CNN
stage training hinders end-to-end optimization and requires
extra memory for feature storage. Additionally, the SPP layer In 2017, He et al. introduced Mask R-CNN, an extension of
does not back-propagate to earlier layers, keeping parameters Faster R-CNN that performs pixel-level instance segmentation
fixed before the SPP layer and limiting deeper learning. [20]. It adds a new branch for binary mask prediction to the
two-stage pipeline, alongside class and box predictions. This
C. Fast Region-based Convolutional Neural Network (Fast R- branch uses a fully convolutional network (FCN) atop the
CNN) CNN feature map. Mask R-CNN also replaces RoIPool with
In 2015, Girshick et al. introduced Fast R-CNN, a two- RoIAlign to better preserve spatial accuracy, enhancing mask
stage detector designed to improve on SPP-Net’s limitations precision. However, it struggles to detect objects with motion
[18]. Fast R-CNN computes a feature map for the entire image blur in low-resolution images.
F. You Only Look Once (YOLO) tasks such as document digitization, automated data entry,
To increase speed, one-stage models like YOLO (You Only and cognitive computing.
Look Once) were developed, bypassing region proposals. • Self-Driving Cars: Object detection is essential for au-
Introduced in 2015 by Redmon et al., YOLO treats detection tonomous vehicles to detect and classify objects such as
as a regression task [21]. Dividing the image into an S × S cars, pedestrians, traffic lights, and road signs.
grid, YOLO predicts class probabilities, bounding boxes, and • Object Tracking: Used in tracking objects in videos,
confidence scores per cell. This captures context well, reducing object detection has applications in surveillance, traffic
false positives, but the grid structure can cause localization monitoring, and sports analytics.
errors and struggles with small objects. • Face Detection and Recognition: Widely employed in
YOLO has undergone several iterations, enhancing its per- computer vision, object detection is used for social media
formance: image tagging and biometric security systems.
• YOLOv2/YOLO9000 (2017): Introduced batch normal-
• Object Extraction from Images or Videos: Facilitates
ization and anchor boxes for improved speed and accu- segmentation and meaningful representation of images,
racy [24]. potentially enabling applications like video object extrac-
• YOLOv3 (2018): Added multi-scale predictions and
tion.
residual connections for better detection across various • Digital Watermarking: Embed markers into digital sig-
sizes [25]. nals for copyright protection and authentication purposes.
• YOLOv4 (2020): Enhanced with the CSPDarknet back-
• Medical Imaging: Assists clinicians in diagnosis and
bone and advanced training techniques, achieving higher therapy planning, particularly in tracking anatomical ob-
precision [26]. jects.
• YOLOv5 (2021): Focused on usability, scalability, and
Object detection technology continues to evolve, promis-
deployment flexibility with various model sizes [27]. ing further advancements and expanding its applications
• YOLOv6 (2022): Optimized for edge devices with im-
across various industries.
proved backbone and attention mechanisms [28]. VI. POPULAR DATASET
• YOLOv7 (2023): Employed AutoML techniques for
Key datasets in object detection include Pascal VOC [33],
dynamic model optimization, enhancing adaptability [29].
COCO [34], ImageNet [35], and Open Images [36]. Pascal
• YOLOv8 (2023): Incorporated a transformer-based back-
VOC (Visual Object Classes) offers a manageable size, balanc-
bone for better detection in dense scenes [30].
ing complexity and computational efficiency, making it ideal
• YOLOv9 (2024): Utilized adversarial training to improve
for testing. COCO (Common Objects in Context) provides
robustness against variations [31].
extensive annotations with multiple objects per image, includ-
• YOLOv10 (2024): Implemented real-time feedback
ing segmentation and key points. ImageNet, primarily used
loops for dynamic adjustments, boosting accuracy [32].
for classification, also includes object detection annotations.
These enhancements have established YOLO as a versatile
Open Images, with over 600 labeled categories, stands out for
and powerful option for real-time object detection.
its large scale, offering both bounding box annotations and
G. Single Shot MultiBox Detector (SSD) segmentation masks. Table II summarizes the key attributes of
The Single Shot MultiBox Detector (SSD), introduced by each dataset, emphasizing their unique features and primary
Liu et al. in 2016, is a one-stage model that improves on usage. Table III provides a comparison of the performance
YOLO by using anchors with multiple scales and aspect ratios of RCNN, Fast RCNN, Faster RCNN, Mask RCNN, YOLO,
within each grid cell [22]. Each anchor is refined by regres- and SSD on these datasets in terms of mAP, inference speed
sors and assigned probabilities across categories, with object (measured in Frames Per Second, or FPS), and model size.
detection predicted on multiple feature maps for different VII. E VALUATION M ETRICS
scales. SSD trains end-to-end with a weighted localization
and classification loss, integrating results across maps. Using Object detection models are assessed using several key
hard negative mining and extensive data augmentation, SSD metrics: Intersection over Union (IoU), Mean Average Pre-
matches Faster R-CNN’s accuracy while allowing real-time cision (mAP), Precision, Recall, Confidence Score (CS), F1
inference. Score, and Non-Maximum Suppression (NMS). Table IV
summarizes these metrics, highlighting their limitations and
V. APPLICATIONS potential biases.
Object detection, powered by CNN, has diverse applica-
A. Intersection over Union (IoU)
tions, spanning from targeted advertising to self-driving cars
and beyond. It is utilized for handwritten digit recognition, IoU measures the overlap between the predicted and ground
Optical Character Recognition (OCR), face detection, medical truth bounding boxes, calculated as the ratio of the intersection
image analysis, sports analytics, and more. area to the union area:
• Optical Character Recognition (OCR): OCR converts
Area of Intersection
images of text into machine-encoded text, facilitating IoU =
Area of Union
TABLE II: Popular Object Detection Datasets
Dataset Number of Images Number of Classes Usage
Pascal VOC 0.01 million 20 Initial model testing
COCO 0.33 million 80 Object detection
ImageNet 1.5 million 1,000 Object localization and detection
Open Images 9.2 million 600 Object localization

TABLE III: Quantitative Performance Comparison of Object Detection Models on different Dataset
Model Pascal VOC (mAP) COCO (mAP) ImageNet (mAP) Open Images (mAP) Inference Speed (FPS) Model Size (MB)
RCNN 66% 54% 60% 55% ∼5 FPS 200
Fast RCNN 70% 59% 63% 58% ∼7 FPS 150
Faster RCNN 75% 65% 68% 63% ∼10 FPS 180
Mask RCNN 76% 66% 69% 64% ∼8 FPS 230
YOLO 72.5% 58.5% 61.5% 57.5% ∼45–60 FPS 145
SSD 75% 63.5% 66.5% 61.5% ∼19–46 FPS 145

B. Mean Average Precision (mAP) • Speed-Accuracy Trade-off: Enhancing both accuracy and
mAP evaluates model performance by averaging the pre- speed for real-time, low-power applications.
• Tiny Object Detection: Improving the detection of small
cision across all classes. The Average Precision (AP) is
computed as: objects in areas such as wildlife monitoring and medical
imaging.
Pn
(P (k) × Precision at Recall(k)) • 3D Object Detection: Leveraging 3D sensors for applica-
AP = k=1 tions in augmented reality and robotics.
n
• Multi-modal Detection: Integrating visual and textual
where P (k) is the change in recall from the previous highest sources for better accuracy in complex scenarios.
recall, and precision at recall k is the maximum precision • Few-shot Learning: Developing models that can effec-
observed at any recall level j where j ≥ k. tively detect objects from limited examples, particularly
C. Precision and Recall in low-resource settings.
This review aims to foster interest in advancing object
Precision is the ratio of true positives to all positive predic-
detection models and to inspire innovation to address current
tions, while Recall is the ratio of true positives to all ground
limitations, including minimizing environmental impacts.
truth positives.
ACKNOWLEDGMENT
D. Confidence Score (CS)
This study was partly supported by the West Chester Uni-
The Confidence Score reflects the model’s certainty that a
versity faculty development fund.
predicted bounding box contains the correct object. Higher
scores indicate greater accuracy and help set thresholds for R EFERENCES
accepting or rejecting detections.
[1] L. Chen, S. Li, Q. Bai, J. Yang, S. Jiang, and Y. Miao, “Review of
image classification algorithms based on convolutional neural networks,”
E. Non-Maximum Suppression (NMS) Remote Sensing, vol. 13, no. 22, p. 4712, 2021.
[2] C. B. Murthy, M. F. Hashmi, N. D. Bokde, and Z. W. Geem, “Investi-
Non-Maximum Suppression refines bounding box predic- gations of object detection in images/videos using various deep learning
tions by sorting them by confidence scores and selecting the techniques and embedded platforms—a comprehensive review,” Applied
highest one while suppressing overlapping boxes. This process sciences, vol. 10, no. 9, p. 3280, 2020.
[3] J. Ma, X. Jiang, A. Fan, J. Jiang, and J. Yan, “Image matching
ensures each object is detected once, improving accuracy and from handcrafted to deep features: A survey,” International Journal of
efficiency. Computer Vision, vol. 129, no. 1, pp. 23–79, 2021.
[4] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,”
VIII. D ISCUSSION AND F UTURE D IRECTIONS International journal of computer vision, vol. 60, pp. 91–110, 2004.
[5] J. Canny, “A computational approach to edge detection,” IEEE Transac-
This review examined prominent object detection models, tions on pattern analysis and machine intelligence, no. 6, pp. 679–698,
classifying them into classical computer vision techniques 1986.
[6] N. Dalal and B. Triggs, “Histograms of oriented gradients for human
and CNN-based methods. While recent CNN architectures detection,” in 2005 IEEE computer society conference on computer
have significantly improved accuracy to below 5%, they also vision and pattern recognition (CVPR’05), vol. 1. Ieee, 2005, pp.
increase complexity and resource demands. Traditional models 886–893.
[7] P. Viola and M. Jones, “Rapid object detection using a boosted cascade
like Deformable Part Models (DPMs) are shallower and more of simple features,” in Proceedings of the 2001 IEEE computer society
lightweight, making them better suited for edge deployment conference on computer vision and pattern recognition. CVPR 2001,
compared to modern deep learning architectures like AlexNet vol. 1. Ieee, 2001, pp. I–I.
[8] P. Felzenszwalb, D. McAllester, and D. Ramanan, “A discriminatively
and VGGNet. trained, multiscale, deformable part model,” in 2008 IEEE conference
Key future directions for object detection include: on computer vision and pattern recognition. Ieee, 2008, pp. 1–8.
TABLE IV: Evaluation Metrics: Limitations and Potential Biases of Object Detection Models
Model Metrics Used Limitations Potential Biases
RCNN IoU, mAP, Precision, Recall, F1 Score - Separate region proposal step slows infer- - Favors larger objects due to reliance on
ence. - High memory usage due to multiple selective search. - Struggles with scale vari-
stages. ations and densely packed objects.
Fast RCNN IoU, mAP, Precision, Recall, F1 Score - Dependent on external region proposals. - - Similar biases as RCNN: prefers larger and
Not optimized for real-time applications. well-separated objects. - Performance drops
in high-density scenes.
Faster RCNN IoU, mAP, Precision, Recall, F1 Score - More complex architecture with integrated - Favors objects with distinct features de-
Region Proposal Network (RPN). - Requires tectable by RPN. - Limited accuracy on
careful hyperparameter tuning. small or thin objects compared to single-
shot models.
Mask RCNN IoU, mAP, Precision, Recall, F1 Score - Increased computational overhead from - Bias towards classes with abundant and
mask prediction. - Longer training times. detailed segmentation data. - Misses small
or occluded objects in segmentation masks.
YOLO IoU, mAP, Precision, Recall, Confidence Score - Lower detection accuracy on small objects. - Prioritizes objects at the center of the
- Struggles with overlapping objects and image. - Predefined grid may miss objects
crowded scenes. at image edges.
SSD IoU, mAP, Precision, Recall, Confidence Score - Performance degrades on very small ob- - Bias towards predefined anchor boxes,
jects. - Limited by predefined anchor box affecting generalization for unseen scales. -
scales and aspect ratios. Struggles with variable object shapes and
sizes not covered by anchor boxes.

[9] S. Schulter, C. Leistner, P. Wohlhart, P. M. Roth, and H. Bischof, pattern analysis and machine intelligence, vol. 37, no. 9, pp. 1904–
“Accurate object detection with joint classification-regression random 1916, 2015.
forests,” in Proceedings of the IEEE conference on computer vision and [24] J. Redmon and A. Farhadi, “Yolo9000: better, faster, stronger,” in
pattern recognition, 2014, pp. 923–930. Proceedings of the IEEE conference on computer vision and pattern
[10] P. Sermanet, “Overfeat: Integrated recognition, localization and detection recognition, 2017, pp. 7263–7271.
using convolutional networks,” arXiv preprint arXiv:1312.6229, 2013. [25] A. Farhadi and J. Redmon, “Yolov3: An incremental improvement,”
[11] G. Li and Y. Yu, “Visual saliency based on multiscale deep features,” in Computer vision and pattern recognition, vol. 1804. Springer
in Proceedings of the IEEE conference on computer vision and pattern Berlin/Heidelberg, Germany, 2018, pp. 1–6.
recognition, 2015, pp. 5455–5463. [26] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, “Yolov4: Op-
[12] K. Fu, C. Gong, J. Yang, and Y. Zhou, “Salient object detection via color timal speed and accuracy of object detection,” arXiv preprint
contrast and color distribution,” in Computer Vision–ACCV 2012: 11th arXiv:2004.10934, 2020.
Asian Conference on Computer Vision, Daejeon, Korea, November 5-9, [27] G. Jocher, A. Stoken, A. Chaurasia, J. Borovec, Y. Kwon, K. Michael,
2012, Revised Selected Papers, Part I 11. Springer, 2013, pp. 111–122. L. Changyu, J. Fang, P. Skalski, A. Hogan et al., “ultralytics/yolov5: v6.
[13] C. L. Zitnick and P. Dollár, “Edge boxes: Locating object proposals from 0-yolov5n’nano’models, roboflow integration, tensorflow export, opencv
edges,” in Computer Vision–ECCV 2014: 13th European Conference, dnn support,” Zenodo, 2021.
Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. [28] C. Li, L. Li, H. Jiang, K. Weng, Y. Geng, L. Li, Z. Ke, Q. Li, M. Cheng,
Springer, 2014, pp. 391–405. W. Nie et al., “Yolov6: A single-stage object detection framework for
[14] K. Fu, C. Gong, J. Yang, Y. Zhou, and I. Y.-H. Gu, “Superpixel based industrial applications,” arXiv preprint arXiv:2209.02976, 2022.
color contrast and color distribution driven salient object detection,” [29] C.-Y. Wang, A. Bochkovskiy, and H.-Y. M. Liao, “Yolov7: Trainable
Signal Processing: Image Communication, vol. 28, no. 10, pp. 1448– bag-of-freebies sets new state-of-the-art for real-time object detectors,”
1463, 2013. in Proceedings of the IEEE/CVF conference on computer vision and
[15] K. Simonyan and A. Zisserman, “Very deep convolutional networks for pattern recognition, 2023, pp. 7464–7475.
large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014. [30] G. Jocher, A. Chaurasia, and J. Qiu, “Ultralytics yolov8,” https://github.
com/ultralytics/ultralytics, 2023, aGPL-3.0 License.
[16] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
[31] C.-Y. Wang, I.-H. Yeh, and H.-Y. M. Liao, “Yolov9: Learning what you
with deep convolutional neural networks,” Advances in neural informa-
want to learn using programmable gradient information,” arXiv preprint
tion processing systems, vol. 25, 2012.
arXiv:2402.13616, 2024.
[17] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature
[32] A. Wang, H. Chen, L. Liu, K. Chen, Z. Lin, J. Han, and G. Ding,
hierarchies for accurate object detection and semantic segmentation,”
“Yolov10: Real-time end-to-end object detection,” arXiv preprint
in Proceedings of the IEEE conference on computer vision and pattern
arXiv:2405.14458, 2024.
recognition, 2014, pp. 580–587.
[33] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisser-
[18] R. Girshick, “Fast r-cnn,” arXiv preprint arXiv:1504.08083, 2015. man, “The pascal visual object classes (voc) challenge,” International
[19] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time journal of computer vision, vol. 88, pp. 303–338, 2010.
object detection with region proposal networks,” IEEE transactions on [34] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
pattern analysis and machine intelligence, vol. 39, no. 6, pp. 1137–1149, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in
2016. context,” in Computer Vision–ECCV 2014: 13th European Conference,
[20] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13.
Proceedings of the IEEE international conference on computer vision, Springer, 2014, pp. 740–755.
2017, pp. 2961–2969. [35] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet:
[21] J. Redmon, “You only look once: Unified, real-time object detection,” A large-scale hierarchical image database,” in 2009 IEEE conference on
in Proceedings of the IEEE conference on computer vision and pattern computer vision and pattern recognition. Ieee, 2009, pp. 248–255.
recognition, 2016. [36] A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset,
[22] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and S. Kamali, S. Popov, M. Malloci, A. Kolesnikov et al., “The open images
A. C. Berg, “Ssd: Single shot multibox detector,” in Computer Vision– dataset v4: Unified image classification, object detection, and visual
ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, relationship detection at scale,” International journal of computer vision,
October 11–14, 2016, Proceedings, Part I 14. Springer, 2016, pp. vol. 128, no. 7, pp. 1956–1981, 2020.
21–37.
[23] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep
convolutional networks for visual recognition,” IEEE transactions on

You might also like