Paper Survey On Performance Metrics For Object Detection Algorithms
Paper Survey On Performance Metrics For Object Detection Algorithms
Object-Detection Algorithms
Rafael Padilla1 , Sergio L. Netto2 , Eduardo A. B. da Silva3
1,2,3 PEE, COPPE, Federal University of Rio de Janeiro, P.O. Box 68504, RJ, 21945-970, Brazil
{rafael.padilla, sergioln,eduardo}@smt.ufrj.br
I. I NTRODUCTION
Object detection is an extensively studied topic in the field
of computer vision. Different approaches have been employed
(c)
to solve the growing need for accurate object detection mod-
els [1]. The Viola-Jones framework [2], for instance, became Fig. 1: Examples of detections performed by YOLO [20] in
popular due to its successful application in the face-detection different datasets. (a) PASCAL VOC; (b) personal dataset; (c)
problem [3], and was later applied to different subtasks such as COCO. Besides the bounding box coordinates of a detected
pedestrian [4] and car [5] detections. More recently, with the object, the output also includes the confidence level and its
popularization of the convolutional neural networks (CNN) [6]– class.
[9] and GPU-accelerated deep-learning frameworks, object-
detection algorithms started being developed from a new per-
spective [10], [11]. Works as Overfeat [12], R-CNN [13], Fast are mostly represented by their top-left and bottom-right co-
R-CNN [14], Faster R-CNN [15], R-FCN [16], SSD [17] and ordinates (xini , yini , xend , yend ), with a notable exception being
YOLO [18]–[20] highly increased the performance stantards the YOLO [18]–[20] algorithm, that differs from the others by
on the field. World famous competitions such as VOC PAS- outlining the bounding boxes by their center coordinates, width,
xcenter ycenter box width box height
CAL Challenge [21], COCO [22], ImageNet Object Detection and height image width , image height , image width , image height .
Challenge [23], and Google Open Images Challenge [24] have Different challenges, competitions, and hackathons [21],
as their top object-detection algorithms methods inspired on [23]–[27] attempt to assess the performance of object de-
the aforementioned works. Differently from algorithms such as tections in specific scenarios by using real-world annotated
the Viola-Jones, CNN-based detectors are flexible enough to be images [28]–[30]. In these events, participants are given a
trained with several (hundreds or even a few thousands) classes. testing nonannotated image set in which objects have to be
A detector outcome is commonly composed of a list of detected by their proposed works. Some competitions provide
bounding boxes, confidence levels and classes, as seen in their own (or 3rd-party) source code, allowing the participants
Figure 1. However, the standard output-file format varies a to evaluate their algorithms in an annotated validation image
lot for different detection algorithms. Bounding-box detections set before submitting their testing-set detections. In the end,
each team sends a list of bounding-boxes coordinates with their predicted bounding box Bp and the ground-truth bounding box
respective classes and (sometimes) their confidence levels to be Bgt divided by the area of union between them, that is
evaluated.
area(Bp ∩ Bgt )
In most competitions, the average precision (AP) and its J(Bp , Bgt ) = IOU = , (1)
area(Bp ∪ Bgt )
derivations are the metrics adopted to assess the detections
and thus rank the teams. The PASCAL VOC dataset [31] and as illustrated in Figure 2.
challenge [21] provide their own source code to measure the
AP and the mean AP (mAP) over all object classes. The City
Intelligence Hackathon [27] uses the source code distributed
in [32] to rank the participants also on AP and mAP. The Ima-
geNet Object Localization challenge [23] does not recommend
any code to compute their evaluation metric, but provides a
pseudo-code explaining it. The Open Images 2019 [24] and
Google AI Open Images [26] challenges use mAP, referencing
Fig. 2: Intersection Over Union (IOU).
a tool to evaluate the results [33], [34]. The Lyft 3D Object
Detection for Autonomous Vehicles challenge [25] does not
By comparing the IOU with a given threshold t, we can
reference any external tool, but uses the AP averaged over 10
classify a detection as being correct or incorrect. If IOU ≥ t
different thresholds, the so-called AP@50:5:95 metric.
then the detection is considered as correct. If IOU < t the
This work reviews the most popular metrics used to evalu- detection is considered as incorrect.
ate object-detection algorithms, including their main concepts, Since, as stated above, the true negatives (TN) are not used in
pointing out their differences, and establishing a comparison be- object detection frameworks, one refrains to use any metric that
tween different implementations. In order to introduce its main is based on the TN, such as the TPR, FPR and ROC curves [36].
contributions, this work is divided into the following topics: Instead, the assessment of object detection methods is mostly
Section II explains the main performance metrics employed based on the precision P and recall R concepts, respectively
in the field of object detection and how the AP metric can defined as
produce ambiguous results; Section III describes some of the
TP TP
most known object detection challenges and their employed P = = , (2)
performance metrics, whereas Section IV presents a project TP + FP all detections
implementing the AP metric to be used with any annotation TP TP
format. R = = . (3)
TP + FN all ground truths
Precision is the ability of a model to identify only relevant
II. M AIN P ERFORMANCE M ETRICS objects. It is the percentage of correct positive predictions.
Among different annotated datasets used by object detection Recall is the ability of a model to find all relevant cases (all
challenges and the scientific community, the most common ground-truth bounding boxes). It is the percentage of correct
metric used to measure the accuracy of the detections is the AP. positive predictions among all given ground truths.
Before examining the variations of the AP, we should review The precision × recall curve can be seen as a trade-off
some concepts that are shared among them. The most basic are between precision and recall for different confidence values
the ones defined below: associated to the bounding boxes generated by a detector. If the
confidence of a detector is such that its FP is low, the precision
• True positive (TP): A correct detection of a ground-truth will be high. However, in this case, many positives may be
bounding box; missed, yielding a high FN, and thus a low recall. Conversely,
• False positive (FP): An incorrect detection of a nonexistent if one accepts more positives, the recall will increase, but the FP
object or a misplaced detection of an existing object; may also increase, decreasing the precision. However, a good
• False negative (FN): An undetected ground-truth bounding object detector should find all ground-truth objects (F N = 0 ≡
box; high recall) while identifying only relevant objects (F P = 0 ≡
It is important to note that, in the object detection context, high precision). Therefore, a particular object detector can
a true negative (TN) result does not apply, as there are infinite be considered good if its precision stays high as its recall
number of bounding boxes that should not be detected within increases, which means that if the confidence threshold varies,
any given image. the precision and recall will still be high. Hence, a high area
The above definitions require the establishment of what a under the curve (AUC) tends to indicate both high precision
“correct detection” and an “incorrect detection” are. A common and high recall. Unfortunately, in practical cases, the precision
way to do so is using the intersection over union (IOU). It is × recall plot is often a zigzag-like curve, posing challenges to
a measurement based on the Jaccard Index, a coefficient of an accurate measurement of its AUC. This is circumvented by
similarity for two sets of data [35]. In the object detection processing the precision × recall curve in order to remove the
scope, the IOU measures the overlapping area between the zigzag behavior prior to AUC estimation. There are basically
two approaches to do so: the 11-point interpolation and all-
point interpolation.
In the 11-point interpolation, the shape of the precision
× recall curve is summarized by averaging the maximum
precision values at a set of 11 equally spaced recall levels [0,
0.1, 0.2, ... , 1], as given by
1 X
AP11 = Pinterp (R), (4)
11
R∈{0,0.1,...,0.9,1}
where
Pinterp (R) = max P (R̃). (5)
R̃:R̃≥R
In this definition of AP, instead of using the precision Fig. 3: Example of 24 detections (red boxes) performed by an
P (R) observed at each recall level R, the AP is obtained object detector aiming to detect 15 ground-truth objects (green
by considering the maximum precision Pinterp (R) whose recall boxes) belonging to the same class.
value is greater than R.
In the all-point interpolation, instead of interpolating only 11
equally spaced points, one may interpolate through all points 1; otherwise it is set to 0 and it is considered as FP. Some
in such way that: detectors can output multiple detections overlapping a single
ground truth (e.g. detections D and E in Image 2; G, H and
X
APall = (Rn+1 − Rn )Pinterp (Rn+1 ), (6) I in Image 3). For those cases the detection with the highest
n confidence is considered a TP and the others are considered
where as FP, as applied by the PASCAL VOC 2012 challenge. The
Pinterp (Rn+1 ) = max P (R̃). (7) columns Acc TP and Acc FP accumulate the total amount of
R̃:R̃≥Rn+1 TP and FP along all the detections above the corresponding
confidence level. Figure 4 depicts the calculated precision and
In this case, instead of using the precision observed at
recall values for this case.
only few points, the AP is now obtained by interpolating the
precision at each level, taking the maximum precision whose TABLE I: Computation of Precision and Recall Values for IOU
recall value is greater or equal than Rn+1 . threshold = 30%
The mean AP (mAP) is a metric used to measure the
accuracy of object detectors over all classes in a specific detection confidence TP FP acc TP acc FP precision recall
database. The mAP is simply the average AP over all classes R 95% 1 0 1 0 1 0.0666
Y 95% 0 1 1 1 0.5 0.0666
[15], [17], that is J 91% 1 0 2 1 0.6666 0.1333
N A 88% 0 1 2 2 0.5 0.1333
1X U 84% 0 1 2 3 0.4 0.1333
mAP = APi , (8) C 80% 0 1 2 4 0.3333 0.1333
N i=1 M 78% 0 1 2 5 0.2857 0.1333
F 74% 0 1 2 6 0.25 0.1333
with APi being the AP in the ith class and N is the total D 71% 0 1 2 7 0.2222 0.1333
number of classes being evaluated. B 70% 1 0 3 7 0.3 0.2
H 67% 0 1 3 8 0.2727 0.2
P 62% 1 0 4 8 0.3333 0.2666
A. A Practical Example E 54% 1 0 5 8 0.3846 0.3333
X 48% 1 0 6 8 0.4285 0.4
As stated previously, the AP is calculated individually for N 45% 0 1 6 9 0.4 0.4
each class. In the example shown in Figure 3, the boxes T 45% 0 1 6 10 0.375 0.4
represent detections (red boxes identified by a letter - A, B, K 44% 0 1 6 11 0.3529 0.4
Q 44% 0 1 6 12 0.3333 0.4
..., Y) and the ground truth (green boxes) of a given class. V 43% 0 1 6 13 0.3157 0.4
The percentage value drawn next to each red box represents I 38% 0 1 6 14 0.3 0.4
L 35% 0 1 6 15 0.2857 0.4
the detection confidence for this object class. In order to S 23% 0 1 6 16 0.2727 0.4
evaluate the precision and recall of the 24 detections among G 18% 1 0 7 16 0.3043 0.4666
O 14% 0 1 7 17 0.2916 0.4666
the 15 ground-truth boxes distributed in seven images, an IOU
threshold t needs to be established. In this example, let us
consider as a TP detection box one having IOU ≥ 30%. Note As mentioned above, each interpolation method yields a
that each value of IOU threshold provides a different AP metric, different AP result, as given by (Figure 5):
and thus the threshold used must always be indicated.
Table I presents each detection ordered by their confidence 1
level. For each detection, if its area overlaps 30% or more of AP11 = (1 + 0.6666 + 0.4285 + 0.4285 + 0.4285)
11
a ground truth (IOU ≥ 30%), the TP column is identified as AP11 = 26.84%,
Fig. 6: Precision × Recall curves of points from Table I
Fig. 4: Precision x Recall curve with values calculated for each applying interpolation with all points.
detection in Table I.