0% found this document useful (0 votes)
263 views

Paper Survey On Performance Metrics For Object Detection Algorithms

This document discusses performance metrics for evaluating object detection algorithms. It reviews common metrics like average precision (AP) and introduces variants of AP. AP estimates the area under the precision-recall curve and is widely used but can produce ambiguous results depending on implementation. The document also describes object detection challenges and their use of metrics like AP and mean AP to evaluate participants. It proposes a standard implementation of AP that can benchmark algorithms across datasets.

Uploaded by

test pt
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
263 views

Paper Survey On Performance Metrics For Object Detection Algorithms

This document discusses performance metrics for evaluating object detection algorithms. It reviews common metrics like average precision (AP) and introduces variants of AP. AP estimates the area under the precision-recall curve and is widely used but can produce ambiguous results depending on implementation. The document also describes object detection challenges and their use of metrics like AP and mean AP to evaluate participants. It proposes a standard implementation of AP that can benchmark algorithms across datasets.

Uploaded by

test pt
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

A Survey on Performance Metrics for

Object-Detection Algorithms
Rafael Padilla1 , Sergio L. Netto2 , Eduardo A. B. da Silva3
1,2,3 PEE, COPPE, Federal University of Rio de Janeiro, P.O. Box 68504, RJ, 21945-970, Brazil
{rafael.padilla, sergioln,eduardo}@smt.ufrj.br

Abstract—This work explores and compares the plethora of


metrics for the performance evaluation of object-detection algo-
rithms. Average precision (AP), for instance, is a popular metric
for evaluating the accuracy of object detectors by estimating the
area under the curve (AUC) of the precision × recall relationship.
Depending on the point interpolation used in the plot, two
different AP variants can be defined and, therefore, different
results are generated. AP has six additional variants increasing the
possibilities of benchmarking. The lack of consensus in different
works and AP implementations is a problem faced by the academic
and scientific communities. Metric implementations written in
different computational languages and platforms are usually
distributed with corresponding datasets sharing a given bounding-
box description. Such projects indeed help the community with
evaluation tools, but demand extra work to be adapted for other (a) (b)
datasets and bounding-box formats. This work reviews the most
used metrics for object detection detaching their differences,
applications, and main concepts. It also proposes a standard
implementation that can be used as a benchmark among different
datasets with minimum adaptation on the annotation files.
Keywords—object-detection metrics, average precision, object-
detection challenges, bounding boxes.

I. I NTRODUCTION
Object detection is an extensively studied topic in the field
of computer vision. Different approaches have been employed
(c)
to solve the growing need for accurate object detection mod-
els [1]. The Viola-Jones framework [2], for instance, became Fig. 1: Examples of detections performed by YOLO [20] in
popular due to its successful application in the face-detection different datasets. (a) PASCAL VOC; (b) personal dataset; (c)
problem [3], and was later applied to different subtasks such as COCO. Besides the bounding box coordinates of a detected
pedestrian [4] and car [5] detections. More recently, with the object, the output also includes the confidence level and its
popularization of the convolutional neural networks (CNN) [6]– class.
[9] and GPU-accelerated deep-learning frameworks, object-
detection algorithms started being developed from a new per-
spective [10], [11]. Works as Overfeat [12], R-CNN [13], Fast are mostly represented by their top-left and bottom-right co-
R-CNN [14], Faster R-CNN [15], R-FCN [16], SSD [17] and ordinates (xini , yini , xend , yend ), with a notable exception being
YOLO [18]–[20] highly increased the performance stantards the YOLO [18]–[20] algorithm, that differs from the others by
on the field. World famous competitions such as VOC PAS- outlining the bounding boxes by their center coordinates, width,
xcenter ycenter box width box height 
CAL Challenge [21], COCO [22], ImageNet Object Detection and height image width , image height , image width , image height .
Challenge [23], and Google Open Images Challenge [24] have Different challenges, competitions, and hackathons [21],
as their top object-detection algorithms methods inspired on [23]–[27] attempt to assess the performance of object de-
the aforementioned works. Differently from algorithms such as tections in specific scenarios by using real-world annotated
the Viola-Jones, CNN-based detectors are flexible enough to be images [28]–[30]. In these events, participants are given a
trained with several (hundreds or even a few thousands) classes. testing nonannotated image set in which objects have to be
A detector outcome is commonly composed of a list of detected by their proposed works. Some competitions provide
bounding boxes, confidence levels and classes, as seen in their own (or 3rd-party) source code, allowing the participants
Figure 1. However, the standard output-file format varies a to evaluate their algorithms in an annotated validation image
lot for different detection algorithms. Bounding-box detections set before submitting their testing-set detections. In the end,
each team sends a list of bounding-boxes coordinates with their predicted bounding box Bp and the ground-truth bounding box
respective classes and (sometimes) their confidence levels to be Bgt divided by the area of union between them, that is
evaluated.
area(Bp ∩ Bgt )
In most competitions, the average precision (AP) and its J(Bp , Bgt ) = IOU = , (1)
area(Bp ∪ Bgt )
derivations are the metrics adopted to assess the detections
and thus rank the teams. The PASCAL VOC dataset [31] and as illustrated in Figure 2.
challenge [21] provide their own source code to measure the
AP and the mean AP (mAP) over all object classes. The City
Intelligence Hackathon [27] uses the source code distributed
in [32] to rank the participants also on AP and mAP. The Ima-
geNet Object Localization challenge [23] does not recommend
any code to compute their evaluation metric, but provides a
pseudo-code explaining it. The Open Images 2019 [24] and
Google AI Open Images [26] challenges use mAP, referencing
Fig. 2: Intersection Over Union (IOU).
a tool to evaluate the results [33], [34]. The Lyft 3D Object
Detection for Autonomous Vehicles challenge [25] does not
By comparing the IOU with a given threshold t, we can
reference any external tool, but uses the AP averaged over 10
classify a detection as being correct or incorrect. If IOU ≥ t
different thresholds, the so-called AP@50:5:95 metric.
then the detection is considered as correct. If IOU < t the
This work reviews the most popular metrics used to evalu- detection is considered as incorrect.
ate object-detection algorithms, including their main concepts, Since, as stated above, the true negatives (TN) are not used in
pointing out their differences, and establishing a comparison be- object detection frameworks, one refrains to use any metric that
tween different implementations. In order to introduce its main is based on the TN, such as the TPR, FPR and ROC curves [36].
contributions, this work is divided into the following topics: Instead, the assessment of object detection methods is mostly
Section II explains the main performance metrics employed based on the precision P and recall R concepts, respectively
in the field of object detection and how the AP metric can defined as
produce ambiguous results; Section III describes some of the
TP TP
most known object detection challenges and their employed P = = , (2)
performance metrics, whereas Section IV presents a project TP + FP all detections
implementing the AP metric to be used with any annotation TP TP
format. R = = . (3)
TP + FN all ground truths
Precision is the ability of a model to identify only relevant
II. M AIN P ERFORMANCE M ETRICS objects. It is the percentage of correct positive predictions.
Among different annotated datasets used by object detection Recall is the ability of a model to find all relevant cases (all
challenges and the scientific community, the most common ground-truth bounding boxes). It is the percentage of correct
metric used to measure the accuracy of the detections is the AP. positive predictions among all given ground truths.
Before examining the variations of the AP, we should review The precision × recall curve can be seen as a trade-off
some concepts that are shared among them. The most basic are between precision and recall for different confidence values
the ones defined below: associated to the bounding boxes generated by a detector. If the
confidence of a detector is such that its FP is low, the precision
• True positive (TP): A correct detection of a ground-truth will be high. However, in this case, many positives may be
bounding box; missed, yielding a high FN, and thus a low recall. Conversely,
• False positive (FP): An incorrect detection of a nonexistent if one accepts more positives, the recall will increase, but the FP
object or a misplaced detection of an existing object; may also increase, decreasing the precision. However, a good
• False negative (FN): An undetected ground-truth bounding object detector should find all ground-truth objects (F N = 0 ≡
box; high recall) while identifying only relevant objects (F P = 0 ≡
It is important to note that, in the object detection context, high precision). Therefore, a particular object detector can
a true negative (TN) result does not apply, as there are infinite be considered good if its precision stays high as its recall
number of bounding boxes that should not be detected within increases, which means that if the confidence threshold varies,
any given image. the precision and recall will still be high. Hence, a high area
The above definitions require the establishment of what a under the curve (AUC) tends to indicate both high precision
“correct detection” and an “incorrect detection” are. A common and high recall. Unfortunately, in practical cases, the precision
way to do so is using the intersection over union (IOU). It is × recall plot is often a zigzag-like curve, posing challenges to
a measurement based on the Jaccard Index, a coefficient of an accurate measurement of its AUC. This is circumvented by
similarity for two sets of data [35]. In the object detection processing the precision × recall curve in order to remove the
scope, the IOU measures the overlapping area between the zigzag behavior prior to AUC estimation. There are basically
two approaches to do so: the 11-point interpolation and all-
point interpolation.
In the 11-point interpolation, the shape of the precision
× recall curve is summarized by averaging the maximum
precision values at a set of 11 equally spaced recall levels [0,
0.1, 0.2, ... , 1], as given by
1 X
AP11 = Pinterp (R), (4)
11
R∈{0,0.1,...,0.9,1}

where
Pinterp (R) = max P (R̃). (5)
R̃:R̃≥R

In this definition of AP, instead of using the precision Fig. 3: Example of 24 detections (red boxes) performed by an
P (R) observed at each recall level R, the AP is obtained object detector aiming to detect 15 ground-truth objects (green
by considering the maximum precision Pinterp (R) whose recall boxes) belonging to the same class.
value is greater than R.
In the all-point interpolation, instead of interpolating only 11
equally spaced points, one may interpolate through all points 1; otherwise it is set to 0 and it is considered as FP. Some
in such way that: detectors can output multiple detections overlapping a single
ground truth (e.g. detections D and E in Image 2; G, H and
X
APall = (Rn+1 − Rn )Pinterp (Rn+1 ), (6) I in Image 3). For those cases the detection with the highest
n confidence is considered a TP and the others are considered
where as FP, as applied by the PASCAL VOC 2012 challenge. The
Pinterp (Rn+1 ) = max P (R̃). (7) columns Acc TP and Acc FP accumulate the total amount of
R̃:R̃≥Rn+1 TP and FP along all the detections above the corresponding
confidence level. Figure 4 depicts the calculated precision and
In this case, instead of using the precision observed at
recall values for this case.
only few points, the AP is now obtained by interpolating the
precision at each level, taking the maximum precision whose TABLE I: Computation of Precision and Recall Values for IOU
recall value is greater or equal than Rn+1 . threshold = 30%
The mean AP (mAP) is a metric used to measure the
accuracy of object detectors over all classes in a specific detection confidence TP FP acc TP acc FP precision recall
database. The mAP is simply the average AP over all classes R 95% 1 0 1 0 1 0.0666
Y 95% 0 1 1 1 0.5 0.0666
[15], [17], that is J 91% 1 0 2 1 0.6666 0.1333
N A 88% 0 1 2 2 0.5 0.1333
1X U 84% 0 1 2 3 0.4 0.1333
mAP = APi , (8) C 80% 0 1 2 4 0.3333 0.1333
N i=1 M 78% 0 1 2 5 0.2857 0.1333
F 74% 0 1 2 6 0.25 0.1333
with APi being the AP in the ith class and N is the total D 71% 0 1 2 7 0.2222 0.1333
number of classes being evaluated. B 70% 1 0 3 7 0.3 0.2
H 67% 0 1 3 8 0.2727 0.2
P 62% 1 0 4 8 0.3333 0.2666
A. A Practical Example E 54% 1 0 5 8 0.3846 0.3333
X 48% 1 0 6 8 0.4285 0.4
As stated previously, the AP is calculated individually for N 45% 0 1 6 9 0.4 0.4
each class. In the example shown in Figure 3, the boxes T 45% 0 1 6 10 0.375 0.4
represent detections (red boxes identified by a letter - A, B, K 44% 0 1 6 11 0.3529 0.4
Q 44% 0 1 6 12 0.3333 0.4
..., Y) and the ground truth (green boxes) of a given class. V 43% 0 1 6 13 0.3157 0.4
The percentage value drawn next to each red box represents I 38% 0 1 6 14 0.3 0.4
L 35% 0 1 6 15 0.2857 0.4
the detection confidence for this object class. In order to S 23% 0 1 6 16 0.2727 0.4
evaluate the precision and recall of the 24 detections among G 18% 1 0 7 16 0.3043 0.4666
O 14% 0 1 7 17 0.2916 0.4666
the 15 ground-truth boxes distributed in seven images, an IOU
threshold t needs to be established. In this example, let us
consider as a TP detection box one having IOU ≥ 30%. Note As mentioned above, each interpolation method yields a
that each value of IOU threshold provides a different AP metric, different AP result, as given by (Figure 5):
and thus the threshold used must always be indicated.
Table I presents each detection ordered by their confidence 1
level. For each detection, if its area overlaps 30% or more of AP11 = (1 + 0.6666 + 0.4285 + 0.4285 + 0.4285)
11
a ground truth (IOU ≥ 30%), the TP column is identified as AP11 = 26.84%,
Fig. 6: Precision × Recall curves of points from Table I
Fig. 4: Precision x Recall curve with values calculated for each applying interpolation with all points.
detection in Table I.

each detection was reported as the tiebreaker (usually one or


and (Figure 6): more evaluation files contain the detections to be evaluated),
but in general there is no common consensus by the evaluation
APall = 1 ∗ (0.0666 − 0) + 0.6666 ∗ (0.1333 − 0.0666) tools.
+ 0.4285 ∗ (0.4 − 0.1333) + 0.3043 ∗ (0.4666 − 0.4) III. O BJECT-D ETECTION C HALLENGES AND T HEIR AP
APall = 24.56%. VARIANTS
Constantly, new techniques are being developed and new
different state-of-the-art object-detection algorithms are arising.
Comparing their results with different works is not an easy
task. Sometimes the applied metrics vary or the implementation
used by the different authors may not be the same, generating
dissimilar results. This section covers the main challenges and
their most popular AP variants found in the literature.
The PASCAL VOC [31] is an object-detection challenge
released in 2005. From 2005 to 2012, a new version of the
Pascal VOC was released with increased numbers of images
and classes, starting at four classes, reaching 20 classes in
its last update. The PASCAL VOC competition still accepts
submissions, revealing state-of-the-art algorithms for object
detections ever since. In this trail, the challenge applies the 11-
Fig. 5: Precision × Recall curves of points from Table I using point interpolated precision (see Section II) and uses the mean
the 11-point interpolation approach. AP over all of its classes to rank the submission performances,
as implemented by the provided development kit.
From what we have seen so far, benchmarks are not truly The Open Images 2019 challenge [24] in its object-detection
comparable if the method used to calculate the AP is not track uses the Open Images Dataset [29] containing 12.2 M
reported. Works found in the literature [1], [9], [12]–[20], [37] annotated bounding boxes across 500 object categories on
usually neither mention the method used nor reference the 1.7 M images. Due to its hierarchical annotations, the same
adopted tool to evaluate their results. This problem does not object can belong to a main class and multiple sub-classes
occur much often in challenges, as it is a common practice (e.g. ‘helmet’ and ‘football helmet’). Because of that, the users
to have a reference software tool included in order for the should report the class and subclasses of a given detection. If
participants to evaluate their results. Also, it is not rare to somehow only the main class is correctly reported for a detected
occur cases where a detector sets the same confidence level bounding box, the unreported subclasses affect negatively the
for different detections. Table I, for example, illustrates that score, as it is counted as a false negative. The metric employed
detections R and Y obtained the same confidence level (95%). by the aforementioned challenge is the mean AP over all classes
Depending on the criterion used by a certain implementation, using the Tensorflow Object Detection API [33].
one or other detection can be sorted as the first detection in the The COCO detection challenge (bounding box) [22] is a
table, directly affecting the final result of an object-detection competition which provides bounding-box coordinates of more
algorithm. Some implementations may consider the order that than 200,000 images comprising 80 object categories. The
submitted works are ranked according to metrics gathered into be evaluated by the metrics employed in other datasets (COCO,
four main groups. for instance).
• AP: The AP is evaluated with different IOUs. It can be
calculated for 10 IOUs varying in a range of 50% to 95% IV. A N O PEN -S OURCE P ERFORMANCE M ETRIC
with steps of 5%, usually reported as AP@50:5:95. It also R EPOSITORY
can be evaluated with single values of IOU, where the
most common values are 50% and 75%, reported as AP50 In order to help other researchers and the academic com-
and AP75 respectively; munity to obtain trustworthy results that can be comparable re-
• AP Across Scales: The AP is determined for objects in gardless the detector, the database, or the format of the ground-
three different sizes: small (with area < 322 pixels), truth annotations, a library was developed in Python with the
medium (with 322 < area < 962 pixels), and large (with AP metric that can be extended to its variations. Easy-to-use
area > 962 pixels); functions implement the same metrics used as benchmark by
• Average Recall (AR): The AR is estimated by the maxi- the most popular competitions and object-detection researches.
mum recall values given a fixed number of detections per The proposed implementation does not require modifications
image (1, 10 or 100) averaged over IOUs and classes; of the detection model to match complicated input formats,
• AR Across Scales: The AR is determined for objects in avoiding conversions to XML or JSON files. To assure the
the same three different sizes as in the AP Across Scales, accuracy of the results, the implementation followed to the
usually reported as AR-S, AR-M, and AR-L, respectively; letter the definitions and our results were carefully compared
Tables II and III present results obtained by different object against the official implementations and the results are precisely
detectors for the COCO and PASCAL VOC challenges, as given the same. The variations of the AP metric such as mAP, AP50,
in [20], [38]. Due to different bounding-box annotation formats, AP75 and AP@50:5:95 using the 11-point or the all-point
researchers tend to report only the metrics supported by the interpolations can be obtained with the proposed library.
source code distributed with each dataset. Besides that, works The input data (ground-truth bounding boxes and detected
that use datasets with other annotation formats [39] are forced bounding boxes) format was simplified requiring a single
to convert their annotations to PASCAL VOC’s and COCO’s format to compute all AP variation metrics. The format required
formats before using their evaluation codes. is straightforward and can support the most popular detectors.
For the ground-truth bounding boxes, a single text file for each
TABLE II: Results using AP variants obtained by different image should be created with each line in one of the following
methods on COCO dataset [40]. formats:
methods AP@50:5:95 AP50 AP75 AP-S AP-M AP-L
Faster R-CNN with ResNet-101 [9], [15] 34.9 55.7 37.4 15.6 38.7 50.9
<class> <left> <top> <right> <bottom>
Faster R-CNN with FPN [15], [41] 36.2 59.1 39.0 18.2 39.0 48.2 <class> <left> <top> <width> <height>
Faster R-CNN by G-RMI [15], [42] 34.7 55.5 36.7 13.5 38.1 52.0
Faster R-CNN with TDM [15], [43] 36.8 57.7 39.2 16.2 39.8 52.1
YOLO v2 [19] 21.6 44.0 19.2 5.0 22.4 35.5 For the detections, a text file for each image should include
YOLO v3 [20] 33.0 57.9 34.4 18.3 35.4 41.9
SSD513 with ResNet-101 [9], [17] 31.2 50.4 33.3 10.2 34.5 49.8 a line for each bounding box in one of the following formats:
DSSD513 with ResNet-101 [9], [44] 33.2 53.3 35.2 13.0 35.4 51.1
RetinaNet [40] 39.1 59.1 42.3 21.8 42.7 50.2
<class> <confidence> <left> <top> <right> <bottom>
<class> <confidence> <left> <top> <width> <height>
TABLE III: Results using AP variant (mAP) obtained by The second options support YOLO’s output bounding-box
different methods on PASCAL VOC 2012 dataset [38]. formats. Besides specifying the input formats of the bounding
methods mAP boxes, one can also set the IOU threshold used to consider
Faster R-CNN * [15] 70.4
a TP (useful to calculate the metrics AP@50:5:95, AP50 and
YOLO v1 [18] 57.9 AP75) and the interpolation method (11-point interpolation or
YOLO v2 ** [19] 78.2 interpolation with all points). The tool will output the plots as
SSD300 ** [17] 79.3
SSD512 ** [17] 82.2
in Figures 5 and 6, the final mAP and the AP for each class,
giving a better view of the results for each class. The tool
(*) trained with PASCAL VOC dataset images only, while (**) trained with
COCO dataset images.
also provides an option to generate the output images with the
bounding boxes drawn on it as shown in Figure 1.
The metric AP50 in Table II is calculated in the same way as The project distributed with this paper can be accessed at:
the metric mAP in Table III, but as the methods were trained https:// github.com/ rafaelpadilla/ Object-Detection-Metrics. So
and tested in different datasets, one obtains different results in far, our framework has helped researchers to obtain AP metrics
both evaluations. Due to the need of conversions between the and its variations in a simple way, supporting the most popular
bounding-box annotations among different datasets, researchers formats used by datasets, avoiding conversions to XML or
in general do not evaluate all methods with all possible metrics. JSON files. The proposed tool has been used as the official
In practice, it would be more meaningful if methods trained and tool in the competition [27], adopted in 3rd-party libraries such
tested with one dataset (PASCAL VOC, for instance) could also as [45] and used by many other works as in [46]–[48].
R EFERENCES
[26] G. Research. Google ai open images - object de-
[1] W. Hu, T. Tan, L. Wang, and S. Maybank, “A survey on visual surveil- tection track. [Online]. Available: https://www.kaggle.com/c/
lance of object motion and behaviors,” IEEE Transactions on Systems, google-ai-open-images-object-detection-track/
Man, and Cybernetics, Part C: Applications and Reviews, vol. 34, no. 3, [27] City intelligence hackathon. [Online]. Available: https://belvisionhack.ru
pp. 334–352, Aug 2004. [28] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,
[2] P. Viola and M. Jones, “Rapid object detection using a boosted cascade A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual
of simple features,” in IEEE Computer Society Conference on Computer recognition challenge,” International Journal of Computer Vision, vol.
Vision and Pattern Recognition, vol. 1, Dec 2001, p. 511518. 115, no. 3, pp. 211–252, 2015.
[3] R. Padilla, C. Costa Filho, and M. Costa, “Evaluation of haar cascade [29] I. Krasin, T. Duerig, N. Alldrin, V. Ferrari, S. Abu-El-Haija,
classifiers designed for face detection,” World Academy of Science, A. Kuznetsova, H. Rom, J. Uijlings, S. Popov, S. Kamali, M. Malloci,
Engineering and Technology, vol. 64, pp. 362–365, 2012. J. Pont-Tuset, A. Veit, S. Belongie, V. Gomes, A. Gupta, C. Sun,
[4] E. Ohn-Bar and M. M. Trivedi, “To boost or not to boost? on the limits G. Chechik, D. Cai, Z. Feng, D. Narayanan, and K. Murphy, “Openim-
of boosted trees for object detection,” in IEEE International Conference ages: A public dataset for large-scale multi-label and multi-class image
on Pattern Recognition, Dec 2016, pp. 3350–3355. classification,” 2017.
[5] Z. Sun, G. Bebis, and R. Miller, “On-road vehicle detection: A review,” [30] T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B. Girshick, J. Hays,
IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft COCO:
no. 5, pp. 694–711, May 2006. common objects in context,” CoRR, 2014.
[6] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification [31] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zis-
with deep convolutional neural networks,” in International Conference on serman, “The pascal visual object classes (voc) challenge,” International
Neural Information Processing Systems, 2012, pp. 1097–1105. Journal of Computer Vision, vol. 88, no. 2, pp. 303–338, Jun. 2010.
[7] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, [32] R. Padilla. Metrics for object detection. [Online]. Available: https:
V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” //github.com/rafaelpadilla/Object-Detection-Metrics
in IEEE Conference on Computer Vision and Pattern Recognition, June [33] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S.
2015, pp. 1–9. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow,
[8] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur,
applied to document recognition,” in Proceedings of the IEEE, 1998, pp. J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah,
2278–2324. M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker,
[9] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Watten-
image recognition,” in IEEE Conference on Computer Vision and Pattern berg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Large-scale machine
Recognition, Jun 2016, pp. 770–778. learning on heterogeneous systems,” 2015.
[10] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for [34] TensorFlow. Detection evaluation protocols. [Online]. Available: https:
deep belief nets,” Neural Computation, vol. 18, no. 7, pp. 1527–1554, //github.com/tensorflow
2006. [35] P. Jaccard, “Étude comparative de la distribution florale dans une portion
[11] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of des alpes et des jura,” Bulletin de la Societe Vaudoise des Sciences
data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507, Naturelles, vol. 37, pp. 547–579, 1901.
Jul. 2006. [36] J. A. Hanley and B. J. McNeil, “The meaning and use of the area under a
[12] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. Le- receiver operating characteristic (roc) curve.” Radiology, vol. 143, no. 1,
Cun, “Overfeat: Integrated recognition, localization and detection using pp. 29–36, 1982.
convolutional networks,” CoRR, 2013. [37] D. Yoo, S. Park, J.-Y. Lee, A. S. Paek, and I. So Kweon, “Attentionnet:
[13] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierar- Aggregating weak directions for accurate object detection,” in IEEE
chies for accurate object detection and semantic segmentation,” in IEEE International Conference on Computer Vision, 2015, pp. 2659–2667.
Conference on Computer Vision and Pattern Recognition, Jun 2014. [38] Z.-Q. Zhao, P. Zheng, S.-t. Xu, and X. Wu, “Object detection with deep
[14] R. Girshick, “Fast r-cnn,” in IEEE International Conference on Computer learning: A review,” IEEE Transactions on Neural Networks and Learning
Vision, Dec 2015. Systems, vol. 30, no. 11, pp. 3212–3232, 2019.
[15] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time [39] “An annotated video database for abandoned-object detection in a clut-
object detection with region proposal networks,” in Advances in Neural tered environment,” in International Telecommunications Symposium,
Information Processing Systems 28, 2015, pp. 91–99. 2014, pp. 1–5.
[16] J. Dai, Y. Li, K. He, and J. Sun, “R-FCN: object detection via region-
[40] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for
based fully convolutional networks,” CoRR, 2016.
dense object detection,” in IEEE International Conference on Computer
[17] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, C. Fu, and A. C.
Vision, 2017, pp. 2980–2988.
Berg, “SSD: single shot multibox detector,” CoRR, 2015.
[41] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie,
[18] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once:
“Feature pyramid networks for object detection,” in IEEE Conference on
Unified, real-time object detection,” in IEEE Conference on Computer
Computer Vision and Pattern Recognition, 2017, pp. 2117–2125.
Vision and Pattern Recognition, 2016, pp. 779–788.
[42] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer,
[19] J. Redmon and A. Farhadi, “Yolo9000: Better, faster, stronger,” in IEEE
Z. Wojna, Y. Song, S. Guadarrama et al., “Speed/accuracy trade-offs for
Conference on Computer Vision and Pattern Recognition, 2017, pp. 7263–
modern convolutional object detectors,” in IEEE Conference on Computer
7271.
Vision and Pattern Recognition, 2017, pp. 7310–7311.
[20] ——, “Yolov3: An incremental improvement,” Technical Report, 2018.
[21] M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, [43] A. Shrivastava, R. Sukthankar, J. Malik, and A. Gupta, “Beyond skip
J. Winn, and A. Zisserman, “The pascal visual object classes challenge: A connections: Top-down modulation for object detection,” arXiv, 2016.
retrospective,” International Journal of Computer Vision, vol. 111, no. 1, [44] C.-Y. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg, “Dssd: Deconvo-
pp. 98–136, Jan. 2015. lutional single shot detector,” arXiv, 2017.
[22] Coco detection challenge (bounding box). [Online]. Available: https: [45] C. R. I. of Montreal (CRIM). thelper package. [Online]. Available:
//competitions.codalab.org/competitions/20794 https://thelper.readthedocs.io/en/latest/thelper.optim.html
[23] ImageNet. Imagenet object localization challenge. [Online]. Available: [46] C. Adleson and D. C. Conner, “Comparison of classical and cnn-based
https://www.kaggle.com/c/imagenet-object-localization-challenge/ detection techniques for state estimation in 2d,” Journal of Computing
[24] G. Research. Open images 2019 - object detec- Sciences in Colleges, vol. 35, no. 3, pp. 122–133, 2019.
tion challenge. [Online]. Available: https://www.kaggle.com/c/ [47] A. Borji and S. M. Iranmanesh, “Empirical upper-bound in object
open-images-2019-object-detection/ detection and more,” arXiv, 2019.
[25] Lyft. Lyft 3d object detection for autonomous [48] D. Caschili, M. Poncino, and T. Italia, “Optimization of cnn-based
vehicles. [Online]. Available: https://www.kaggle.com/c/ object detection algorithms for embedded systems,” Masters dissertation,
3d-object-detection-for-autonomous-vehicles/ Politecnico di Torino, 2019.

You might also like