Towards Building Self-Aware Object Detectors Via Reliable Uncertainty
Towards Building Self-Aware Object Detectors Via Reliable Uncertainty
Abstract
y
y
0.6
DOOD , moreover the uncertainties get larger as the sever-
p matches
ity of corruption increases. We also display AP (black Accuracy 1.0
line), where it can be clearly seen that as the uncertainty in- 0.0 0.3 0.6 1.0
p x
creases AP decreases, implying that the uncertainty reflects (a) Classifier [17] (b) Regressor [33] (c) Detector
the performance of the detector. Thereby suggesting that Figure 3. (a) Calibrated classifier; (b) Calibrated Bayesian re-
the image-level uncertainties are reliable and effective. As gressor, where empirical and predicted CDFs match; (c) Loci of
already pointed out, this conclusion is not necessarily very constant IoU boundary, e.g. any predicted box with top-left and
surprising, since the classifiers of object detectors are gen- bottom-right corners obtained from within the green loci has an
erally trained not only by proposals matching the objects IoU > 0.2 with the blue box. The detector is calibrated if its con-
but also by a very large number of proposals not matching fidence matches the classification and the localisation quality.
with any object, which can be ∼ 1000 times more [57]. This
we define calibration as the alignment of performance and
composition of training data prevents the classifier from be-
confidence of a model; which has already been extensively
coming drastically over-confident for unseen data, enabling
studied for the classification task [8,17,34,47,52,69]. How-
the detector to yield reliable uncertainties.
ever, existing work which studies the calibration properties
Thresholding Image-level Uncertainties For our of an object detector [35, 36, 48, 51] is limited. For object
SAOD baseline, we can obtain an appropriate value for ū detection, the goal is to align a detector’s confidence with
through cross-validation. Ideally, this will require a val- the quality of the joint task of classification and localisa-
No need idation set including both ID and OOD images, but un- tion (regression). Arguably, it is not obvious how to ex-
for OOD fortunately DVal consists of only ID images. However, tend merely classification-based calibration measures such
detectors given that in this case our image-level uncertainty is ob- as Expected Calibration Error (ECE) [17] for object de-
tained by aggregating detection-level uncertainties, the im- tection. A straightforward extension would be to replace
ages which have detections with high uncertainty will pro- the accuracy in such measures by the precision of the de-
duce high image-level uncertainty and vice-versa. Using tector, which is computed by validating TPs from a spe-
this fact, if we remove the ground-truth objects from the im- cific IoU threshold. However, this perspective, as employed
ages in DVal , the resulting image-level uncertainties should by [35], does not account for the fact that two object detec-
be high. We leverage this approach to construct a pseudo tors, while having the same precision, might differ signifi-
OOD dataset out of DVal , by replacing the pixels inside the cantly in terms of localisation quality.
ground-truth bounding boxes with zeros, thereby removing Hence, as one of the main contributions of this work, we
them from the image and enabling us to cross-validation. consider the calibration of object detectors from a funda-
As for the metric to cross-validate ū against, we observe mental perspective and define Localisation-aware Calibra-
that existing metrics such as: AUC metrics are unsuitable tion Error (LaECE) which accounts for the joint nature of
to evaluate binary predictions, F-Score is sensitive to the the task (classification and localisation). We further analyse
choice of the positive class [60] and [email protected] [13, 24] how calibration measures should be coupled with accuracy
requires a fixed threshold. As an attractive candidate, Un- in object detection and adapt common post hoc calibration
certainty Error [46] computes the arithmetic mean of FP and methods such as histogram binning [74], linear regression,
FN rates. However, the arithmetic mean does not heavily and isotonic regression [75] to improve LaECE.
penalise choosing ū on extreme values, potentially leading
to the situation where â = 1 or â = 0 for all images. To ad- 5.1. Localisation-aware ECE
dress this, we instead leverage the harmonic mean, which is
To build an intuitive understanding and to appreciate the
sensitive to these extreme values. Particularly, we define the
underlying complexity in developing a metric to quantify
Balanced Accuracy (BA) as the harmonic mean of TP rate
the calibration of an object detector, we first revisit its sub-
(TPR) and FP rate (FPR), addressing the aforementioned
tasks and briefly discuss what a calibrated classifier and a
issue and enabling us to use it to obtain a suitable ū.
calibrated regressor correspond to. For the former, a classi-
fier is calibrated if its confidence matches its accuracy as
5. Calibration of Object Detectors illustrated in Fig. 3(a). For calibrating Bayesian regres-
Accepting or rejecting an image is only one component sors, there are different definitions [33, 37, 38, 64]. One
of the SAOD task, in situations where the image is accepted notable definition [33] requires aligning the predicted and
SAOD then requires the detections to be calibrated. Here the empirical cumulative distribution functions (cdf), im-
plying p% credible interval from the mean of the predictive
2 Which is the FPR for a fixed threshold set when TPR=0.95. distribution should include p% of the ground truths for all
p ∈ [0, 1] (Fig. 3(b)). Extending this definition to object de- over all the classes. We highlight that for the sake of bet-
tection is nontrivial due to the increasing complexity of the ter accuracy the recent detectors [2, 23, 28–30, 39, 40, 44, 54,
problem. For example, a detection is represented by a tuple 55, 67, 76] tend to obtain p̂i by combining the classification
{ĉi , b̂i , p̂i } with b̂i ∈ R4 , which is not univariate as in [33]. confidence with the localisation confidence (e.g., obtained
Also, this definition to align the empirical and predicted from an auxiliary IoU prediction head), which is very well
cdfs does not consider the regression accuracy explicitly, aligned with our LaECE formulation, enforcing p̂i to match
and therefore not fit for our purpose. Instead, we take in- with the joint performance in Eq. (4).
spiration from an alternative definition that aims to directly Reliability Diagrams We also produce reliability dia-
align the confidence with the regression accuracy [37, 38]. grams to provide insights on the calibration properties of a
To this end, without loss of generality, we use IoU as detector (Fig. 4(a)). To obtain a reliability diagram, we first
the measure of localisation quality for the detection boxes. obtain the performance, measured by the product of preci-
Therefore, broadly speaking, if the detection confidence sion and IoU (Eq. (4)), for each class over bins and then
score p̂i = 0.8, then the localisation task is calibrated (ig- average the performance over the classes by ignoring the
noring the classification task for now) if the average locali- empty bins. Note that if a detector is perfectly calibrated
sation performance (IoU in our case) is 80% over the entire with LaECE = 0, then all the histograms will lie along
dataset. To demonstrate, following [56] we plot the loci for the diagonal in the reliability diagram since LaECEc = 0.
fixed values of IoU in Fig. 3(c). In this example, consider- Similar to classification, if the performance tends to be
ing the blue-box to be the ground-truth, p̂i = 0.2 implies lower than the diagonal, then the detector is said to be
that a detector is calibrated if the detection box lie on the over-confident as in Fig. 4(a), and vice versa for an under-
‘green’ loci corresponding to IoU = 0.2. confident detector. Please see Fig. A.14 for more examples.
Focusing back onto the joint nature of object detection,
we say that an object detector f : X 7→ {ĉi , b̂i , p̂i }N is cali- 5.2. Impact of Top-k Survival on Calibration
brated if the classification and the localisation performances
jointly match its confidence p̂i . More formally, Top-k survival, a critical part of the post-processing step,
selects k detections with the highest confidence in an im-
P(ĉi = ci |p̂i ) Eb̂i ∈Bi (p̂i ) [IoU(b̂i , bψ(i) )] = p̂i , ∀p̂i ∈ [0, 1] (3) age. The value of k is typically significantly larger than the
| {z }|
Classification perf.
{z
Localisation perf.
}
number of objects, for example, k = 100 for COCO where
an average of only 7.3 ground-truth objects exist per image
where Bi (p̂i ) is the set of TP boxes with the confidence on the val set. Therefore, the final detections may contain
score of p̂i , and bψ(i) is the ground-truth box that b̂i matches numerous low-scoring noisy detections. In fact, ATSS on
with. Note that in the absence of localisation quality, the COCO val set, for example, produces 86.4 detections on
above calibration formulation boils down to the standard average per image after postprocessing, far more than the
classification calibration definition. average number of objects per image.
For a given Bi (p̂i ), the first term in Eq. (3), P(ĉi = Since these extra noisy detections do not impact on the
ci |p̂i ), is the ratio of the number of correctly-classified to widely used AP, most works do not pay much attention to
the total number of detections, which is simply the preci- them, however, as we show below, they do have a negative
sion. Whereas, the second term represents the average lo- impact on the calibration metric. Thus, this may mislead a
calisation quality of the boxes in Bi (p̂i ). practitioner in choosing the wrong model when it comes to
Following the approximations used to define the well- calibration quality.
known ECE, we use Eq. (3) to define LaECE. Precisely, we We design a synthetic experiment to show the impact
discretize the confidence score space into J = 25 equally- of low-scoring noisy detections on AP and calibration
spaced bins [17, 34], and to prevent more frequent classes
(LaECE). Specifically, if the number of final detections is
to dominate the calibration error, we compute the average
calibration error for each class separately [34, 47]. Thus, less than k in an image, we insert “dummy” detections
the calibration error for the c-th class is obtained as into the remaining space. These dummy detections are ran-
domly assigned a class ĉi , p̂i = 0, and only one pixel to en-
J
X |D̂jc | ¯ c (j) , sure that they do not match with any object. Hence, by de-
LaECEc = p̄cj − precisionc (j) × IoU (4)
j=1 |D̂c | sign, they are “perfectly calibrated”. As shown in Fig. 5(a),
though these dummy detections have no impact on the AP
where D̂c denotes the set of all detections, D̂jc ⊆ D̂c is the (mathematical proof in App. D), they do give an impression
set of detections in bin j and p̄cj is the average of the de- that the model becomes more calibrated (lower LaECE) as
tection confidence scores in bin j for class c. Furthermore, k increases. Therefore, considering that extra noisy detec-
precisionc (j) denotes the precision of the j-th bin for c- tions are undesirable in practice, we do not advocate top-k
¯ c (j) the average IoU of TP boxes in bin
th class and IoU survival, instead, we motivate the need to select a detec-
j. Then, LaECE is computed as the average of LaECEc tion confidence threshold v̄, where detections are rejected if
1.0 1.0 Perf. or Error (in %) det/img Perf. or Error (in %) det/img
Precision ×IoU Precision ×IoU 100 500 100 100
% of Samples % of Samples 80
0.8 0.8 80 LaECE 400 80
LaECE= 43.3% LaECE= 17.7% AP
60 300 60 60
0.6 0.6 LRP
40 200 40 40
0.4 0.4 20 100 20 20
0 0 0 0
0.2 0.2 0 up to 100 up to 300 up to 500 none 0.30 0.50 0.70
Number of Dummy Detections (up to k) Score threshold to remove noisy detections
0.0 0.0 (a) Adding dummy detections (b) Thresholding detections
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Confidence Confidence
(a) Base model (b) Calibrated by LR Figure 5. Red: ATSS, green: F-RCNN, histograms present de-
t/img using right axes, results are on COCO val set with 7.3 ob-
Figure 4. Reliability diagrams of F-RCNN on DID for SAOD- jects/img. (a) Dummy detections decrease LaECE (solid line) ar-
Gen before and after calibration. tificially with no effect on AP (dashed line). LRP (dotted line), on
the other hand, penalizes dummy detections. (b) AP is maximized
their confidence is lower than v̄. with more detections (threshold ‘none’) while LRP Error benefits
An appropriate choice of v̄ should produce a set of from properly-thresholded detections. (refer App. D)
thresholded-detections with a good balance of precision, re-
call and localisation errors3 . In Fig. 5(b), we present the ef- 6. Baseline SAODets and Their Evaluation
fect of v̄ on LRP, where the lowest error is obtained around
0.30 for ATSS and 0.70 for F-RCNN, leading to an average Using the necessary features developed in Sec. 4 and
of 6 detections/image for both detectors, far closer to the av- Sec. 5, namely, obtaining: image-level uncertainties, cali-
erage number of objects compared to using k = 100. Con- bration methods as well as the thresholds ū and v̄, we now
sequently, to obtain v̄ for our baseline, we use LRP-optimal show how to convert standard detectors into ones that are
thresholding [53, 58], which is the threshold achieving the self-aware. Then, we benchmark them using the SAOD
minimum LRP for each class on the val set. framework proposed in Sec. 3 whilst leveraging our test
datasets and LaECE.
5.3. Post hoc Calibration of Object Detectors
Baseline SAODets To address the requirements of a
For our baseline, given that LaECE provides the calibra- SAODet, we make the following design choices when con-
tion error of the model, we can calibrate an object detector verting an object detector into one which is self aware: The
using common calibration approaches from the classifica- hard requirement of predicting whether or not to accept an
tion and regression literature. Precisely, for each class, we image is achieved through obtaining image-level uncertain-
train a calibrator ζ c : [0, 1] → [0, 1] using the input-target ties by aggregating uncertainty scores. Specifically, we use
pairs ({p̂i , tcal
i }) from DVal , where ti
cal
is the target confi-
mean(top-3) and obtain an uncertainty threshold ū through
dence. As shown in App D, LaECE for bin j reduces to
cross-validation using pseudo OOD set approach (Sec. 4).
X X cal We only keep the detections with higher confidence than v̄,
tcal
i − IoU(b̂i , bψ(i) ) + ti . (5)
which is set using LRP-optimal thresholding (Sec. 5.2). To
b̂i ∈D̂jc b̂i ∈D̂jc
ψ(i)>0 ψ(i)≤0
calibrate the detection scores, we use linear regression as
discussed in Sec. 5.3. Thus, we convert all four detectors
Consequently, we seek tcal i which minimises this value as- that we use (Sec. 3) into ones that are self-aware, prefixed
suming that p̂i resides in the jth bin. In situations where the by a SA in the tables. For further details, please see App. E.
prediction is a TP (ψ(i) > 0), Eq. (5) is minimized when The SAOD Evaluation Protocol The SAOD task is a
p̂i = tcal
i = IoU(b̂i , bψ(i) ) and conversely, if ψ(i) ≤ 0, it is robust protocol unifying the evaluation of the: (i) reliabil-
minimised when p̂i = tcal i = 0. We then train linear regres- ity of uncertainties; (ii) the calibration and accuracy; (iii)
sion (LR); histogram binning (HB) [74]; and isotonic re- and performance under domain shift. To obtain quantita-
gression (IR) [75] models with such pairs. Tab. 5 shows that tive values for the above, we leverage the Balanced Accu-
these calibration methods improve LaECE in five out of six racy (Sec. 4) for (i). For (ii) we evaluate the calibration
cases, and in the case where they do not improve (ATSS on and accuracy using LaECE (Sec. 5) and the LRP [53] re-
SAOD-Gen), the calibration performance of the base model spectively, but combine them through the harmonic mean
is already good. Overall, we find IR and LR perform bet- of 1 − LRP and 1 − LaECE on X ∈ DID , which we de-
ter than HB and consequently we employ LR for SAODets fine as the In-Distribution Quality (IDQ). Similarly, for (iii)
since LR performs the best on three detectors. Fig. 4(b) we compute the IDQ for X ∈ T (DID ), denoted by IDQT ,
shows an example reliability histogram after applying LR, but with the principal difference that the detector is flexible
indicating the improvement to calibration. to accept or reject severe corruptions (C5) as discussed in
3 Using properly-thresholded detections is in fact similar to the Panoptic Sec. 3. Considering that all of these features are crucial in
Segmentation, which is a closely-related task to object detection [31, 32] a safety-critical application, a lack of performance in one
Why the accuracy values all the same across
different calibrators? The confidence scores are
already calibrated!
Table 5. Effect of post hoc calibration on LaECE and LRP (in %). ✗: Uncalibrated, HB: Histogram binning, IR: Isotonic Regresssion, LR:
Linear Regression. ATSS, combining localisation and classification confidences using multiplication as in our LaECE (Eq. (4)), performs
the best on both datasets before/after calibration. Aligned with [47], uncalibrated F-RCNN using cross-entropy loss performs the worst.
Table 6. Evaluating SAODets. With higher BA and IDQs, SA- D-DETR still obtains a low score of 43.5% on the SAOD-
D-DETR achieves the best DAQ on SAOD-Gen. For SAOD-AV Gen dataset. As this performance does not seem to be con-
datasets, SA-ATSS outperforms SA-F-RCNN thanks to its higher vincing, extra care should be taken before these models are
IDQs. Bold: SAODet achieves the best, values are in %.
deployed in safety-critical applications. Consequently, our
Self-aware
DAQ↑
DOOD DID T (DID ) DVal study shows that a significant amount of attention needs to
Detector BA↑ IDQ↑ LaECE↓ LRP↓ IDQ↑ LaECE↓ LRP↓ LRP↓ AP↑
be provided in building self-aware object detectors and ef-
SA-F-RCNN 39.7 87.7 38.5 17.3 74.9 26.2 18.1 84.4 59.5 39.9
SA-RS-RCNN 41.2 88.9 39.7 17.1 73.9 27.5 17.8 83.5 58.1 42.0
fort to reduce the performance gap needs to be undertaken.
Gen
SA-ATSS 41.4 87.8 39.7 16.6 74.0 27.8 18.2 83.2 58.5 42.8 Ablation Analyses To test which components of the
SA-D-DETR 43.5 88.9 41.7 16.4 72.3 29.6 17.9 81.9 55.9 44.3
SA-F-RCNN 43.0 91.0 41.5 9.5 73.1 28.8 7.2 83.0 54.3 55.0
SAODet contribute the most to their improvement, we
AV
SA-ATSS 44.7 85.8 43.5 8.8 71.5 30.8 6.8 81.5 53.2 56.9 perform a simple experiment using SA-F-RCNN (SAOD-
Gen). In this experiment, we systematically remove the
Table 7. Ablation study by removing: LRP-Optimal threshold- LRP-optimal thresholds; LR calibration; and pseudo-set ap-
ing (Sec. 5.2) for v̄ = 0.5; LR calibration (Sec. 5.3) for uncali- proach and replace these features, with a detection-score
brated model; and image-level threshold ū (Sec. 4) for the thresh- threshold of 0.5; no calibration; and a threshold correspond-
old corresponding to TPR = 0.95. ing to a TPR of 0.95 respectively. We can see in Tab. 7 that
as hypothesized, LRP-optimal thresholding improves accu-
v̄ LR ū DAQ↑ BA↑ LaECE↓ LRP↓ LaECET ↓ LRPT ↓
racy, LR yields notable gain in LaECE and using pseudo-
36.0 83.2 42.7 76.2 44.1 84.7
sets results in a gain for OOD detection. In App. E, we
✓ 36.5 83.2 41.7 74.8 43.9 84.7
✓ ✓ 39.1 83.2 17.2 74.8 18.1 84.7 further conduct additional experiments to (i) investigate the
✓ ✓ ✓ 39.7 87.7 17.3 74.9 18.1 84.4 effect of ū and v̄ on reported metrics and (ii) how common
improvement strategies for object detectors affect DAQ.
them needs to be heavily penalized. To do so, we introduce Evaluating Individual Robustness Aspects We finally
the Detection Awareness Quality (DAQ), a unified perfor- note that our framework provides the necessary tools to
mance measure to evaluate SAODets, constructed as the the evaluate a detector in terms of reliability of uncertainties,
harmonic mean of BA, IDQ and IDQT . The resulting DAQ calibration and domain shift. Thereby enabling the re-
is a higher-better measure with a range of [0, 1]. searchers to benchmark either a SAODet using our DAQ
Main Results Here we discuss how our SAODets per- measure or one of its individual components. Specifically,
form in terms of the aforementioned metrics. In terms of our (i) uncertainties can be evaluated on DID ∪ DOOD us-
hypotheses, the first evaluation we wish observe is the effec- ing AUROC or BA (Tab. 2); (ii) calibration can be eval-
tiveness of our metrics. Specifically, we observe in Tab. 6 uated on DID ∪ T (DID ) using LaECE (Tab. 5); and (iii)
that a lower LaECE and LRP lead to a higher IDQ; and that DID ∪ T (DID ) can be used to test detectors developed for
a higher BA, IDQ and IDQT lead to a higher DAQ, indi- single domain generalization [68, 72].
cating that the constructions of these metrics is appropriate.
To justify that they are reasonable, we observe that typi- 7. Conclusive Remarks
cally more complex and better performing detectors (DETR In this paper, we developed the SAOD task, which re-
and ATSS) outperform the simpler F-RCNN, indicating that quires detectors to obtain reliable uncertainties; yield cali-
these metrics reflect the quality of the object detectors. brated confidences; and be robust to domain shift. We cu-
In terms of observing the performance of these self- rated large-scale datasets and introduced novel metrics to
aware variants, we can see that while recent state-of-the- evaluate detectors on the SAOD task. Also, we proposed
art detectors perform very well in terms of LRP and AP on a metric (LaECE) to quantify the calibration of object de-
DVal , their performance drops significantly as we expose tectors which respects both classification and localisation
them to our DID and T (DID ) which involves domain shift, quality, addressing a critical shortcoming in the literature.
corruptions and OOD. We would also like to note that the We hope that this work inspires researchers to build more
best DAQ corresponding to the best performing model SA- reliable object detectors for safety-critical applications.
References [13] Xuefeng Du, Zhaoning Wang, Mu Cai, and Sharon Li. To-
wards unknown-aware learning with virtual outlier synthe-
[1] Daniel Bolya, Sean Foley, James Hays, and Judy Hoffman. sis. In International Conference on Learning Representa-
Tide: A general toolbox for identifying object detection er- tions, 2022. 1, 3, 5, 17, 19, 20, 23
rors. In The IEEE European Conference on Computer Vision [14] Ayers Edward, Sadeghi Jonathan, Redford John, Mueller Ro-
(ECCV), 2020. 27 main, and Dokania Puneet K. Query-based hard-image re-
[2] Daniel Bolya, Chong Zhou, Fanyi Xiao, and Yong Jae Lee. trieval for object detection at test time. arXiv, 2209.11559,
Yolact++: Better real-time instance segmentation. IEEE 2022. 4
Transactions on Pattern Analysis and Machine Intelligence, [15] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn,
2020. 6 and A. Zisserman. The pascal visual object classes (voc)
[3] François Bourgeois and Jean-Claude Lassalle. An exten- challenge. International Journal of Computer Vision (IJCV),
sion of the munkres algorithm for the assignment problem to 88(2):303–338, 2010. 2, 24, 25
rectangular matrices. Communications of ACM, 14(12):802– [16] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we
804, 1971. 16 ready for autonomous driving? the kitti vision benchmark
[4] Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh Vora, suite. In Conference on Computer Vision and Pattern Recog-
Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- nition (CVPR), 2012. 15
ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- [17] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger.
modal dataset for autonomous driving. In IEEE/CVF Confer- On calibration of modern neural networks. In Doina Precup
ence on Computer Vision and Pattern Recognition (CVPR), and Yee Whye Teh, editors, Proceedings of the 34th Interna-
2020. 3, 15 tional Conference on Machine Learning, volume 70 of Pro-
[5] Qi Cai, Yingwei Pan, Yu Wang, Jingen Liu, Ting Yao, and ceedings of Machine Learning Research, pages 1321–1330.
Tao Mei. Learning a unified sample weighting network for PMLR, 2017. 5, 6
object detection. In IEEE/CVF Conference on Computer Vi- [18] Agrim Gupta, Piotr Dollar, and Ross Girshick. Lvis: A
sion and Pattern Recognition (CVPR), 2020. 2 dataset for large vocabulary instance segmentation. In The
[6] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas IEEE Conference on Computer Vision and Pattern Recogni-
Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- tion (CVPR), 2019. 1, 2, 25
end object detection with transformers. In European Confer- [19] David Hall, Feras Dayoub, John Skinner, Haoyang Zhang,
ence on Computer Vision (ECCV), 2020. 2, 31 Dimity Miller, Peter Corke, Gustavo Carneiro, Anelia An-
[7] Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, Yu gelova, and Niko Suenderhauf. Probabilistic object de-
Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei tection: Definition and evaluation. In Proceedings of the
Liu, Jiarui Xu, Zheng Zhang, Dazhi Cheng, Chenchen Zhu, IEEE/CVF Winter Conference on Applications of Computer
Tianheng Cheng, Qijie Zhao, Buyu Li, Xin Lu, Rui Zhu, Vision (WACV), 2020. 2
Yue Wu, Jifeng Dai, Jingdong Wang, Jianping Shi, Wanli [20] Ali Harakeh, Michael H. W. Smart, and Steven L. Waslan-
Ouyang, Chen Change Loy, and Dahua Lin. MMDetec- der. Bayesod: A bayesian approach for uncertainty estima-
tion: Open mmlab detection toolbox and benchmark. arXiv, tion in deep object detectors. IEEE International Conference
1906.07155, 2019. 19 on Robotics and Automation (ICRA), 2020. 2
[8] Jiacheng Cheng and Nuno Vasconcelos. Calibrating deep [21] Ali Harakeh and Steven L. Waslander. Estimating and evalu-
neural networks by pairwise constraints. In Proceedings of ating regression predictive uncertainty in deep object detec-
the IEEE/CVF Conference on Computer Vision and Pattern tors. In International Conference on Learning Representa-
Recognition (CVPR), 2022. 5 tions (ICLR), 2021. 1, 2, 3, 17, 19, 20, 23
[9] Jiwoong Choi, Ismail Elezi, Hyuk-Jae Lee, Clement Farabet, [22] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
and Jose M. Alvarez. Active learning for deep object detec- Deep residual learning for image recognition. In IEEE/CVF
tion via probabilistic modeling. In IEEE/CVF International Conference on Computer Vision and Pattern Recognition
Conference on Computer Vision (ICCV), 2021. 2 (CVPR), 2016. 31
[10] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo [23] Yihui He, Chenchen Zhu, Jianren Wang, Marios Savvides,
Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe and Xiangyu Zhang. Bounding box regression with uncer-
Franke, Stefan Roth, and Bernt Schiele. The cityscapes tainty for accurate object detection. In IEEE/CVF Confer-
dataset for semantic urban scene understanding. In IEEE ence on Computer Vision and Pattern Recognition (CVPR),
Conference on Computer Vision and Pattern Recognition 2019. 2, 3, 6, 23
(CVPR), 2016. 1, 15 [24] Dan Hendrycks, Steven Basart, Mantas Mazeika, Andy Zou,
[11] Achal Dave, Piotr Dollár, Deva Ramanan, Alexander Kir- Joseph Kwon, Mohammadreza Mostajabi, Jacob Steinhardt,
illov, and Ross B. Girshick. Evaluating large-vocabulary and Dawn Song. Scaling out-of-distribution detection for
object detectors: The devil is in the details. arXiv e- real-world settings. In International Conference on Machine
prints:2102.01066, 2021. 1 Learning (ICML), 2022. 5
[12] Akshay Raj Dhamija, Manuel Günther, Jonathan Ventura, [25] Dan Hendrycks and Thomas Dietterich. Benchmarking neu-
and Terrance E. Boult. The overlooked elephant of object ral network robustness to common corruptions and perturba-
detection: Open set. In IEEE Winter Conference on Applica- tions. In International Conference on Learning Representa-
tions of Computer Vision (WACV), 2020. 1, 3, 20 tions (ICLR), 2019. 3, 17
[26] Derek Hoiem, Yodsawalai Chodpathumwan, and Qieyun [40] Xiang Li, Wenhai Wang, Lijun Wu, Shuo Chen, Xiaolin Hu,
Dai. Diagnosing error in object detectors. In The IEEE Eu- Jun Li, Jinhui Tang, and Jian Yang. Generalized focal loss:
ropean Conference on Computer Vision (ECCV), 2012. 27 Learning qualified and distributed bounding boxes for dense
[27] Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, object detection. In Advances in Neural Information Pro-
Chen Sun, Alexander Shepard, Hartwig Adam, Pietro Per- cessing Systems (NeurIPS), 2020. 6
ona, and Serge J. Belongie. The inaturalist species classi- [41] Tsung-Yi Lin, Piotr Dollár, Ross B. Girshick, Kaiming He,
fication and detection dataset. In CVPR, pages 8769–8778, Bharath Hariharan, and Serge J. Belongie. Feature pyramid
2018. 3 networks for object detection. In IEEE/CVF Conference on
[28] Zhaojin Huang, Lichao Huang, Yongchao Gong, Chang Computer Vision and Pattern Recognition (CVPR), 2017. 19
Huang, and Xinggang Wang. Mask scoring r-cnn. In [42] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and
IEEE/CVF Conference on Computer Vision and Pattern Piotr Dollár. Focal loss for dense object detection. IEEE
Recognition (CVPR), 2019. 6 Transactions on Pattern Analysis and Machine Intelligence
[29] Borui Jiang, Ruixuan Luo, Jiayuan Mao, Tete Xiao, and Yun- (TPAMI), 42(2):318–327, 2020. 2
ing Jiang. Acquisition of localization confidence for accurate [43] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
object detection. In The European Conference on Computer Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
Vision (ECCV), 2018. 6 Zitnick. Microsoft COCO: Common Objects in Context.
[30] Kang Kim and Hee Seok Lee. Probabilistic anchor assign- In The European Conference on Computer Vision (ECCV),
ment with iou prediction for object detection. In The Euro- 2014. 1, 2, 3, 14, 24, 25
pean Conference on Computer Vision (ECCV), 2020. 6 [44] Ji Liu, Dong Li, Rongzhang Zheng, Lu Tian, and Yi Shan.
[31] Alexander Kirillov, Ross B. Girshick, Kaiming He, and Piotr Rankdetnet: Delving into ranking constraints for object de-
Dollár. Panoptic feature pyramid networks. In IEEE/CVF tection. In IEEE/CVF Conference on Computer Vision and
Conference on Computer Vision and Pattern Recognition Pattern Recognition (CVPR), pages 264–273, June 2021. 6
(CVPR), 2019. 7 [45] C. Michaelis, B. Mitzkus, R. Geirhos, E. Rusak, O. Bring-
mann, A. S. Ecker, M. Bethge, and W. Brendel. Bench-
[32] Alexander Kirillov, Kaiming He, Ross Girshick, Carsten
marking robustness in object detection: Autonomous driving
Rother, and Piotr Dollar. Panoptic segmentation. In The
when winter is coming. In NeurIPS Workshop on Machine
IEEE Conference on Computer Vision and Pattern Recogni-
Learning for Autonomous Driving, 2019. 1
tion (CVPR), June 2019. 7
[46] Dimity Miller, Feras Dayoub, Michael Milford, and Niko
[33] Volodymyr Kuleshov, Nathan Fenner, and Stefano Ermon.
Sünderhauf. Evaluating merging strategies for sampling-
Accurate uncertainties for deep learning using calibrated re-
based uncertainty techniques in object detection. In Inter-
gression. In International Conference on Machine Learning
national Conference on Robotics and Automation (ICRA),
(ICML), 2018. 5, 6
2019. 5
[34] Ananya Kumar, Percy S Liang, and Tengyu Ma. Verified [47] Jishnu Mukhoti, Viveka Kulharia, Amartya Sanyal, Stuart
uncertainty calibration. In Advances in Neural Information Golodetz, Philip Torr, and Puneet Dokania. Calibrating deep
Processing Systems (NeurIPS), volume 32, 2019. 5, 6 neural networks using focal loss. In H. Larochelle, M. Ran-
[35] Fabian Kuppers, Jan Kronenberger, Amirhossein Shantia, zato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances
and Anselm Haselhoff. Multivariate confidence calibration in Neural Information Processing Systems, volume 33, pages
for object detection. In The IEEE/CVF Conference on Com- 15288–15299. Curran Associates, Inc., 2020. 5, 6, 8, 31
puter Vision and Pattern Recognition (CVPR) Workshops, [48] Muhammad Akhtar Munir, Muhammad Haris Khan,
2020. 1, 2, 5 M. Saquib Sarfraz, and Mohsen Ali. Towards improving cal-
[36] Fabian Kuppers, Jonas Schneider, and Anselm Haselhoff. ibration in object detection under domain shift. In Advances
Parametric and multivariate uncertainty calibration for re- in Neural Information Processing Systems (NeurIPS), 2022.
gression and object detection. In Safe Artificial Intelligence 5
for Automated Driving Workshop in The European Confer- [49] Kevin P. Murphy. Probabilistic Machine Learning: An in-
ence on Computer Vision, 2022. 1, 2, 3, 5 troduction. MIT Press, 2022. 4, 21
[37] Max-Heinrich Laves, Sontje Ihler, Jacob F. Fast, Lüder A. [50] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y.
Kahrs, and Tobias Ortmaier. Well-calibrated regression un- Ng. Reading digits in natural images with unsupervised fea-
certainty in medical imaging with deep learning. In Proceed- ture learning. In NIPS Workshop on Deep Learning and Un-
ings of the Third Conference on Medical Imaging with Deep supervised Feature Learning, 2011. 3
Learning, pages 393–412, 2020. 5, 6 [51] Lukás Neumann, Andrew Zisserman, and Andrea Vedaldi.
[38] Dan Levi, Liran Gispan, Niv Giladi, and Ethan Fetaya. Eval- Relaxed softmax: Efficient confidence auto-calibration for
uating and calibrating uncertainty prediction in regression safe pedestrian detection. In NIPS MLITS Workshop on Ma-
tasks. Sensors (Basel), 22 (15):5540–5550, 2022. 5, 6 chine Learning for Intelligent Transportation System, 2018.
[39] Xiang Li, Wenhai Wang, Xiaolin Hu, Jun Li, Jinhui Tang, 5
and Jian Yang. Generalized focal loss v2: Learning reli- [52] Jeremy Nixon, Michael W. Dusenberry, Linchuan Zhang,
able localization quality estimation for dense object detec- Ghassen Jerfel, and Dustin Tran. Measuring calibration in
tion. In IEEE/CVF Conference on Computer Vision and Pat- deep learning. In IEEE/CVF Conference on Computer Vision
tern Recognition (CVPR), 2019. 6 and Pattern Recognition (CVPR) Workshops, June 2019. 5
[53] Kemal Oksuz, Baris Can Cam, Emre Akbas, and Sinan International Conference on Machine Learning (ICML),
Kalkan. Localization recall precision (LRP): A new perfor- 2019. 5
mance metric for object detection. In The European Confer- [65] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien
ence on Computer Vision (ECCV), 2018. 2, 7, 24 Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou,
[54] Kemal Oksuz, Baris Can Cam, Emre Akbas, and Sinan Yuning Chai, Benjamin Caine, Vijay Vasudevan, Wei Han,
Kalkan. A ranking-based, balanced loss function unifying Jiquan Ngiam, Hang Zhao, Aleksei Timofeev, Scott Et-
classification and localisation in object detection. In Ad- tinger, Maxim Krivokon, Amy Gao, Aditya Joshi, Yu Zhang,
vances in Neural Information Processing Systems (NeurIPS), Jonathon Shlens, Zhifeng Chen, and Dragomir Anguelov.
2020. 6 Scalability in perception for autonomous driving: Waymo
[55] Kemal Oksuz, Baris Can Cam, Emre Akbas, and Sinan open dataset. In IEEE/CVF Conference on Computer Vision
Kalkan. Rank & sort loss for object detection and instance and Pattern Recognition (CVPR), 2020. 1, 15
segmentation. In The International Conference on Computer [66] Peize Sun, Rufeng Zhang, Yi Jiang, Tao Kong, Chenfeng
Vision (ICCV), 2021. 3, 6, 23, 31 Xu, Wei Zhan, Masayoshi Tomizuka, Lei Li, Zehuan Yuan,
[56] Kemal Oksuz, Baris Can Cam, Sinan Kalkan, and Emre Ak- Changhu Wang, and Ping Luo. SparseR-CNN: End-to-end
bas. Generating positive bounding boxes for balanced train- object detection with learnable proposals. In IEEE/CVF
ing of object detectors. In IEEE Winter Applications on Com- Conference on Computer Vision and Pattern Recognition
puter Vision (WACV), 2020. 6 (CVPR), 2018. 2
[57] Kemal Oksuz, Baris Can Cam, Sinan Kalkan, and Emre Ak- [67] Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. Fcos:
bas. Imbalance problems in object detection: A review. IEEE Fully convolutional one-stage object detection. In IEEE/CVF
Transactions on Pattern Analysis and Machine Intelligence International Conference on Computer Vision (ICCV), 2019.
(TPAMI), pages 1–1, 2020. 5 6
[68] Vidit Vidit, Martin Engilberge, and Mathieu Salzmann. Clip
[58] Kemal Oksuz, Baris Can Cam, Sinan Kalkan, and Emre Ak-
the gap: A single domain generalization approach for object
bas. One metric to measure them all: Localisation recall
detection, 2023. 1, 8
precision (lrp) for evaluating visual detection tasks. IEEE
[69] Deng-Bao Wang, Lei Feng, and Min-Ling Zhang. Rethink-
Transactions on Pattern Analysis and Machine Intelligence,
ing calibration of deep neural networks: Do not be afraid of
pages 1–1, 2021. 2, 7, 24, 27, 33
overconfidence. In Advances in Neural Information Process-
[59] Tai-Yu Pan, Cheng Zhang, Yandong Li, Hexiang Hu, Dong ing Systems (NeurIPS), 2021. 5
Xuan, Soravit Changpinyo, Boqing Gong, and Wei-Lun
[70] Shaoru Wang, Jin Gao, Bing Li, and Weiming Hu. Nar-
Chao. On model calibration for long-tailed object detection
rowing the gap: Improved detector training with noisy loca-
and instance segmentation. In M. Ranzato, A. Beygelzimer,
tion annotations. IEEE Transactions on Image Processing,
Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors,
31:6369–6380, 2022. 2
Advances in Neural Information Processing Systems, vol-
[71] Xin Wang, Thomas E Huang, Benlin Liu, Fisher Yu, Xiao-
ume 34, pages 2529–2542. Curran Associates, Inc., 2021.
long Wang, Joseph E Gonzalez, and Trevor Darrell. Robust
1
object detection via instance-level temporal cycle confusion.
[60] Francesco Pinto, Harry Yang, Ser-Nam Lim, Philip H. S. International Conference on Computer Vision (ICCV), 2021.
Torr, and Puneet K. Dokania. Regmixup: Mixup as a regular- 1
izer can surprisingly improve accuracy and out distribution [72] Aming Wu and Cheng Deng. Single-domain generalized
robustness. In Advances in Neural Information Processing object detection in urban scene via cyclic-disentangled self-
Systems (NeurIPS), 2022. 5 distillation. In IEEE/CVF Conference on Computer Vision
[61] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. and Pattern Recognition, 2022. 1, 8
Faster R-CNN: Towards real-time object detection with re- [73] Fisher Yu, Haofeng Chen, Xin Wang, Wenqi Xian, Yingying
gion proposal networks. IEEE Transactions on Pattern Anal- Chen, Fangchen Liu, Vashisht Madhavan, and Trevor Dar-
ysis and Machine Intelligence (TPAMI), 39(6):1137–1149, rell. Bdd100k: A diverse driving dataset for heterogeneous
2017. 2, 3, 19, 23 multitask learning. In Proceedings of the IEEE/CVF Confer-
[62] Murat Sensoy, Lance Kaplan, and Melih Kandemir. Eviden- ence on Computer Vision and Pattern Recognition (CVPR),
tial deep learning to quantify classification uncertainty. In S. June 2020. 1, 3, 15
Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa- [74] Bianca Zadrozny and Charles Elkan. Obtaining calibrated
Bianchi, and R. Garnett, editors, Advances in Neural Infor- probability estimates from decision trees and naive bayesian
mation Processing Systems, volume 31. Curran Associates, classifiers. In Internation Conference on Machine Learning
Inc., 2018. 4, 20 (ICML), volume 1, pages 609–616, 2001. 5, 7
[63] Shuai Shao, Zeming Li, Tianyuan Zhang, Chao Peng, Gang [75] Bianca Zadrozny and Charles Elkan. Transforming classifier
Yu, Xiangyu Zhang, Jing Li, and Jian Sun. Objects365: scores into accurate multiclass probability estimates. In Pro-
A large-scale, high-quality dataset for object detection. In ceedings of the eighth ACM SIGKDD international confer-
IEEE/CVF International Conference on Computer Vision ence on Knowledge discovery and data mining, pages 694–
(ICCV), 2019. 3, 14 699, 2002. 5, 7
[64] Hao Song, Tom Diethe, Meelis Kull, and Peter Flach. Distri- [76] Haoyang Zhang, Ying Wang, Feras Dayoub, and Niko
bution calibration for regression. In Proceedings of the 36th Sünderhauf. Varifocalnet: An iou-aware dense object de-
tector. In IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR), 2021. 6
[77] Shifeng Zhang, Cheng Chi, Yongqiang Yao, Zhen Lei, and
Stan Z. Li. Bridging the gap between anchor-based and
anchor-free detection via adaptive training sample selection.
In IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), 2020. 3, 19, 23, 31
[78] Xizhou Zhu, Han Hu, Stephen Lin, and Jifeng Dai. De-
formable convnets v2: More deformable, better results.
In IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), 2019. 31
[79] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang,
and Jifeng Dai. Deformable {detr}: Deformable transform-
ers for end-to-end object detection. In International Confer-
ence on Learning Representations (ICLR), 2021. 3, 19, 23
APPENDICES D.2. Sensitivity of LaECE to TP validation threshold 26
D.3. Derivation of Eq. (5) . . . . . . . . . . . . . 27
Contents D.4. More Examples of Reliability Diagrams . . . 28
D.5. Numerical Values of Fig. 5 . . . . . . . . . . 28
1. Introduction 1
E. Further Details on SAOD and SAODets 28
2. Notations and Preliminaries 2 E.1. Algorithms to Make an Object Detector Self-
Aware . . . . . . . . . . . . . . . . . . . . 28
3. An Overview to the SAOD Task 2
E.2. Sensitivity of the SAOD Performance
4. Obtaining Image-level Uncertainty 3 Measures to the Image-level Uncertainty
Threshold and Detection Confidence
5. Calibration of Object Detectors 5 Threshold . . . . . . . . . . . . . . . . . . 31
5.1. Localisation-aware ECE . . . . . . . . . . . 5 E.3. Effect of common improvement strategies on
5.2. Impact of Top-k Survival on Calibration . . . 6 DAQ . . . . . . . . . . . . . . . . . . . . . 31
5.3. Post hoc Calibration of Object Detectors . . 7 E.4. The Impact of Domain-shift on Detection-
level Confidence Score Thresholding . . . . 31
6. Baseline SAODets and Their Evaluation 7 E.5. Qualitative Results of SAODets in compari-
son to Conventional Object Detectors . . . 33
7. Conclusive Remarks 8 E.6. Suggestions for Future Work . . . . . . . . . 33
105
103
102
104
101
snowboard
skateboard
surfboard
bicycle
person
car
light
plant
bed
dining toilet
table
laptop tv
airplane
motorcycle
bus
train
boat
truck
sign
hydrant
parking bench
bird
meter
dog
cat
sheep
horse
elephant
cow
bear
zebra
giraffe
backpack
handbag tie
suitcase
frisbee
winebottle
mouse
keyboard
cell phone
oven
toaster
toothbrush
umbrella
skis
sports kite
ball
glove
bat
racket
cup
glass
knife
fork
spoon
bowl
banana
apple
orange
sandwich
dog
hotcarrot
broccoli
donut
pizza
chair
cake
pottedcouch
remote
sink
book
clock
vase
microwave
refrigerator
bear
scissors
hair drier
baseball
pedestrian vehicle bicycle
stop
traffic
teddy
baseball
tennis
fire
(a) COCO val vs. Obj45K (b) nuImages val vs. BDD45K
Figure A.6. Distribution of the objects over classes from our test sets and existing val sets. For both SAOD-Gen and SAOD-AV use-cases,
our DID have more objects nearly for all classes to provide a thorough evaluation. Note that y-axes are in log-scale.
Figure A.7. Aligning the annotations of certain classes in BDD100K and nuImages datasets while curating our BDD45K test set. The
riders and ridables (bicycles or motorcycles) need to be combined properly in (a). In this example, both of the rider objects are properly
assigned to the corresponding bicycle objects by our simple method relying on Hungarian algorithm. In (b), which we use as a test image
in our BDD45K, the bounding boxes are combined by finding the smallest enclosing bounding box and the objects are labelled as bicycles.
should be no rider class but bicycle and motorcycle the Hungarian algorithm. Otherwise, if any of the riders is
objects include their riders in the resulting annotations. To assigned to a rideable object with an IoU less than 0.10 in an
do so, we use a simple matching algorithm on BDD100K image, we simply do not include this image in our BDD45K
images to combine bicycle and motorcycle objects with test set. Finally, exploiting the assignment result, we obtain
their riders. In particular, given an image, we identify first the bounding box annotation using the smallest enclosing
objects from bicycle, motorcycle and rider cate- bounding box including both the bounding box of the rider
gories. Then, we group bicycle and motorcycle ob- and that of the rideable object. As for the category annota-
jects as “rideables” and compute IoU between each ride- tion of the object, we simply use the category of the ride-
able and rider object. Given this matrix of representing the able, which is either bicycle or motorcycle. Fig. A.7
proximity between each rideable and rider object in terms presents an example in which we convert BDD100K anno-
of their IoUs, we assign riders to rideables by maximizing tations of these specific classes into the nuImages format.
the total IoU using the Hungarian assignment algorithm [3]. To validate our approach, we manually examine more than
Furthermore, we include a sanity check to avoid possible 2500 images in BDD45K test set and observe that it is ef-
matching errors, e.g., in which a rideable object might be fective to align the annotations of nuImages and BDD100K.
combined with a rider in a further location in the image due
to possible annotation errors. Specifically, our simple sanity Overall, using this strategy, we collect 45K images from
is to require a minimum IoU overlap of 0.10 between a rider training and validation sets of BDD100K and construct
and its assigned rideable in the resulting assignment from our BDD45K split. We would like to highlight that our
BDD45K dataset is diverse and extensive, where (i) it is
# images 104 weather time of day scene
102
100
clear rainy snowy overcast cloudy foggy day dawn/dusk night city highway resident. parking tunnel
Figure A.8. The diversity of BDD45K split in terms of weather, time of day and scene counts.
larger compared to 16K images of nuImages val set; and (ii) detector as long as it can infer that it is uncertain and rejects
it includes 543K objects in total, significantly larger than the such images with high corruption severity.
number of objects from these 3 classes in nuImages val set
with 96K objects. Please refer to Fig. A.6(b) for quantita- A.3. SiNObj110K-OOD Split
tive comparison. In terms of diversity, our BDD45K (DID ) This split is designed to evaluate the reliability of the
comes from a different distribution than nuImages (DTrain ); uncertainties. Following similar work [13, 21], we ensure
thereby introducing natural covariate shift. Fig. A.8 il- that the images in our OOD test set do not include any object
lustrates that our BDD45K is very diverse and it is col- from ID classes. Specifically, in order to use SiNObj110K-
lected from different cities using different camera types than OOD within both SAOD-Gen and SAOD-AV datasets, we
nuImages (DTrain ). As a result, as we will see in Sec. B, select an image to SiNObj110K-OOD if the image does not
the accuracy of the models drops significantly from DVal include an object from either of the ID classes of Obj45K or
to DID even before the corruptions are employed. We note BDD45K (DID ). Then, we collect 110K images from three
that ImageNet-C corruptions are then applied to this dataset, different detection datasets as detailed below:
further increasing the domain shift.
• SVHN subset of SiNObj110K-OOD. We include all
A.2. Obj45K-C and BDD45K-C Splits 46470 full numbers (not cropped digits) using both
While constructing Obj45K-C and BDD45K-C as training and test sets of SVHN dataset in our OOD test
T (DID ), we use the following 15 different corruptions from set.
4 main groups [25]: • iNaturalist OOD subset of SiNObj110K-OOD. We use
• Noise. gaussian noise, shot noise, impulse noise, the validation set of iNaturalist 2017 object detection
speckle noise dataset to obtain our iNaturalist dataset. Specifically,
we include 28768 images in our OOD test set with the
• Blur. defocus blur, motion blur, gaussian blur following classes:
(e) Clean Image - Obj45K (f) JPEG compression (g) Elastic transform (h) Frost
Figure A.9. Clean and corrupted images using different transformations at severity 5 from AV-OD (upper row) and Gen-OD (lower row)
use-cases. We do not penalize a detector if it can infer that it is uncertain and rejects such images with high corruption severity.
’ Power o u t l e t ’ , ’ A i r C o n d i t i o n e r ’ , ’ B r u s h ’ , ’ P e n g u i n ’ , ’ Megaphone ’ ,
’ Hockey S t i c k ’ , ’ P a d d l e ’ , ’ B a l l o n ’ , ’ Corn ’ , ’ L e t t u c e ’ , ’ G a r l i c ’ ,
’ T r i p o d ’ , ’ Hanger ’ , ’ Swan ’ , ’ H e l i c o p t e r ’ , ’ Green Onion ’ ,
’ B l a c k b o a r d / W h i t e b o a r d ’ , ’ Napkin ’ , ’ N u t s ’ , ’ I n d u c t i o n Cooker ’ ,
’ O t h e r F i s h ’ , ’ T o i l e t r y ’ , ’ Tomato ’ , ’ Broom ’ , ’ Trombone ’ , ’ Plum ’ ,
’ L a n t e r n ’ , ’ Fan ’ , ’ Pumpkin ’ , ’ G o l d f i s h ’ , ’ Kiwi f r u i t ’ ,
’ Tea p o t ’ , ’ Head Phone ’ , ’ S c o o t e r ’ , ’ R o u t e r / modem ’ , ’ P o k e r Card ’ ,
’ S t r o l l e r ’ , ’ C r a n e ’ , ’ Lemon ’ , ’ Shrimp ’ , ’ S u s h i ’ , ’ C h e e s e ’ ,
’ S u r v e i l l a n c e Camera ’ , ’ J u g ’ , ’ P i a n o ’ , ’ Notepaper ’ , ’ Cherry ’ , ’ P l i e r s ’ ,
’ Gun ’ , ’ S k a t i n g and S k i i n g s h o e s ’ , ’CD ’ , ’ P a s t a ’ , ’ Hammer ’ ,
’ Gas s t o v e ’ , ’ S t r a w b e r r y ’ , ’ Cue ’ , ’ Avocado ’ , ’ Hamimelon ’ ,
’ Other B a l l s ’ , ’ Shovel ’ , ’ Pepper ’ , ’ Mushroon ’ , ’ S c r e w d r i v e r ’ , ’ Soap ’ ,
’ Computer Box ’ , ’ T o i l e t P a p e r ’ , ’ Recorder ’ , ’ Eggplant ’ ,
’ Cleaning Products ’ , ’ Chopsticks ’ , ’ Board E r a s e r ’ , ’ C o c o n u t ’ ,
’ P i g e o n ’ , ’ C u t t i n g / c h o p p i n g Board ’ , ’ Tape Measur / R u l e r ’ , ’ P i g ’ ,
’ Marker ’ , ’ L a d d e r ’ , ’ R a d i a t o r ’ , ’ Showerhead ’ , ’ Globe ’ , ’ C h i p s ’ ,
’ Grape ’ , ’ P o t a t o ’ , ’ S a u s a g e ’ , ’ S t e a k ’ , ’ S t a p l e r ’ , ’ Campel ’ ,
’ V i o l i n ’ , ’ Egg ’ , ’ F i r e E x t i n g u i s h e r ’ , ’ Pomegranate ’ , ’ Dishwasher ’ ,
’ Candy ’ , ’ C o n v e r t e r ’ , ’ B a t h t u b ’ , ’ Crab ’ , ’ Meat b a l l ’ , ’ R i c e Cooker ’ ,
’ G o l f Club ’ , ’ Cucumber ’ , ’ Tuba ’ , ’ C a l c u l a t o r ’ ,
’ Cigar / C i g a r e t t e ’ , ’ P a i n t Brush ’ , ’ Papaya ’ , ’ Antelope ’ , ’ S e a l ’ ,
’ P e a r ’ , ’ Hamburger ’ , ’ B u t t e f l y ’ , ’ Dumbbell ’ ,
’ E x t e n t i o n Cord ’ , ’ Tong ’ , ’ F o l d e r ’ , ’ Donkey ’ , ’ L i o n ’ , ’ D o l p h i n ’ ,
’ e a r p h o n e ’ , ’ Mask ’ , ’ K e t t l e ’ , ’ Electric Drill ’ , ’ Jellyfish ’ ,
’ Swing ’ , ’ C o f f e e Machine ’ , ’ S l i d e ’ , ’ Treadmill ’ , ’ Lighter ’ ,
’ Onion ’ , ’ Green b e a n s ’ , ’ P r o j e c t o r ’ , ’ G r a p e f r u i t ’ , ’Game b o a r d ’ ,
’ Washing Machine / D r y i n g Machine ’ , ’Mop ’ , ’ R a d i s h ’ ,
’ P r i n t e r ’ , ’ Watermelon ’ , ’ Saxophone ’ , ’ Baozi ’ , ’ Target ’ , ’ French ’ ,
’ T i s s u e ’ , ’ I c e cream ’ , ’ H o t a i r b a l l o n ’ , ’ S p r i n g R o l l s ’ , ’ Monkey ’ , ’ R a b b i t ’ ,
’ Cello ’ , ’ French F r i e s ’ , ’ Scale ’ , ’ P e n c i l Case ’ , ’ Yak ’ ,
’ Trophy ’ , ’ Cabbage ’ , ’ B l e n d e r ’ , ’ Red Cabbage ’ , ’ B i n o c u l a r s ’ ,
’ P e a c h ’ , ’ R i c e ’ , ’ Deer ’ , ’ Tape ’ , ’ Asparagus ’ , ’ B a r b e l l ’ ,
’ C o s m e t i c s ’ , ’ Trumpet ’ , ’ P i n e a p p l e ’ , ’ S c a l l o p ’ , ’ Noddles ’ ,
’ Mango ’ , ’ Key ’ , ’ H u r d l e ’ , ’Comb ’ , ’ Dumpling ’ ,
’ F i s h i n g Rod ’ , ’ Medal ’ , ’ F l u t e ’ , ’ O y s t e r ’ , ’ Green V e g e t a b l e s ’ ,
Table A.8. COCO-style AP of the used object detectors on val- of the image within the range of [480, 800] by limiting its
idation set (DVal ) and test set (DID ), along with their corrupted longer size to 1333 and keeping the original aspect ratio; or
versions (T (DVal ) and T (DID )). (ii) a sequence of
T (DVal ) T (DID )
Dataset Detector DVal DID • randomly resizing the shorter side of the image within
C1 C3 C5 C1 C3 C5
F-RCNN 39.9 31.3 20.3 10.8 27.0 20.3 12.8 6.9 the range of [400, 600] by limiting its longer size to
RS-RCNN 42.0 33.7 21.8 11.6 28.6 21.7 13.7 7.3 4200 and keeping the original aspect ratio,
SAOD ATSS 42.8 33.9 22.3 11.9 28.8 22.0 14.0 7.3
Gen D-DETR 44.3 36.2 24.0 12.2 30.5 23.4 15.4 8.0
• random cropping with a size of [384, 600],
NLL-RCNN 40.1 31.0 20.0 11.6 26.9 20.3 12.9 6.8
ES-RCNN 40.3 31.6 20.3 11.7 27.2 20.6 13.0 6.9
SAOD F-RCNN 55.0 44.9 31.1 16.7 23.2 19.8 12.8 7.2 • randomly resizing the shorter side of the cropped im-
AV ATSS 56.9 47.1 34.1 18.9 25.1 21.7 14.8 8.6 age within the range of [480, 800] by limiting its
longer size to 1333 and keeping the original aspect ra-
tio.
’ Cosmetics Brush / E y e l i n e r P e n c i l ’ ,
’ Chainsaw ’ , ’ E r a s e r ’ , ’ L o b s t e r ’ ,
Unless otherwise noted, we train all of the detectors (as
’ D u r i a n ’ , ’ Okra ’ , ’ L i p s t i c k ’ ,
aforementioned, with the exception of D-DETR, which is
’ Trolley ’ , ’ Cosmetics Mirror ’ ,
trained for 50 epochs following its recommended settings
’ Curling ’ , ’ Hoverboard ’ ,
[79]) for 36 epochs using 16 images in a batch on 8 GPUs.
’ P l a t e ’ , ’ Pot ’ ,
Following the previous works, we use the initial learning
’ E x t r a c t o r ’ , ’ Table T e n i i s paddle ’
rates of 0.020 for F-RCNN, NLL-RCNN and ES-RCNN;
Using both training and validation sets of Objects365, 0.010 for ATSS; and 0.012 for RS-RCNN. We decay the
we collect 35190 images that only contains objects learning rate by a factor of 10 after epochs 27 and 33. As
from above classes. a backbone, we use a ResNet-50 with FPN [41] for all the
models, as is common in practice. At test time, we simply
Consequently, our resulting SiNObj110K-OOD is both rescale the images to 800 × 1333 and do not use any test-
diverse and extensive compared to the datasets introduced in time augmentation. For the rest of the design choices, we
previous work [13, 21] which includes around 1-2K images follow the recommended settings of the detectors.
and is collected from a single dataset. As for SAOD-AV, we train F-RCNN [61] and ATSS [77]
on nuImages training set by following the same design
B. Details of the Used Object Detectors choices. We note that these models are trained using the
annotations of the three classes (pedestrian, vehicle
Here we demonstrate the details of the selected object
and bicycle) in nuImages dataset.
detectors and ensure that their performance is inline with
their expected results. We build our SAOD framework We display baseline results in Tab. A.8 on DVal ,
upon the mmdetection framework [7] since it enables us T (DVal ), DID and T (DID ) data splits, which shows the
using different datasets and models also with different de- performance on the COCO val set (DVal of SAOD-Gen in
sign choices. As presented in Sec. 3, we use four conven- the table) is inline or higher with those published in the cor-
tional and two probabilistic object detectors. We exploit responding papers. We would like to note that the perfor-
all of these detectors for our SAOD-Gen setting by training mance on DVal is lower than that on DID due to (i) more
them on the COCO training set as DTrain . We train all the challenging nature of Object365/BDD100K compared to
detectors with the exception of D-DETR. As for D-DETR, COCO/nuImages and (ii) the domain shift between them.
we directly employ the trained D-DETR model released in As an example, AP drops ∼ 30 points from DVal (nuIm-
mmdetection framework. This D-DETR model is trained ages) to DID (BDD45K) even before the corruptions are ap-
for 50 epochs with a batch size of 32 images on 16 GPUs (2 plied. As expected, we also see a decrease in performance
images/GPU) and corresponds to the vanilla D-DETR (i.e., with increasing severity of corruptions.
not its two-stage version and without iterative bounding box
refinement). C. Further Details on Image-level Uncertainty
While training the detectors, we incorporate the multi-
scale training data augmentation used by D-DETR into This section presents further details on image-level un-
them in order to obtain stronger baselines. Specifically, the certainty including the motivation behind; the definitions of
multi-scale training data augmentation is sampled randomly the used uncertainty estimation techniques; and more anal-
from two alternatives: (i) randomly resizing the shorter side yses.
C.2. Definitions
Here, we provide the definitions of the detection-level
uncertainty estimation methods for classification and local-
isation as well as the aggregation techniques we used to ob-
tain image-level uncertainty estimates.
% of Data AP % of Data AP
80 30 80 30
Clean ID (% of Data) Clean ID (% of Data)
70 C1 (% of Data) 25 70 C1 (% of Data) 25
60 C3 (% of Data) 60 C3 (% of Data)
C5 (% of Data) 20 C5 (% of Data) 20
50 OOD (% of Data) 50 OOD (% of Data)
40 All except OOD (AP) 15 40 All except OOD (AP) 15
30 10 30 10
20 20
10 5 10 5
0 0.0-0.1 0.1-0.2 0.2-0.3 0.3-0.4 0.4-0.5 0.5-0.6 0.6-0.7 0.7-0.8 0.8-0.9 0.9-1.0 0 0 0.0-0.1 0.1-0.2 0.2-0.3 0.3-0.4 0.4-0.5 0.5-0.6 0.6-0.7 0.7-0.8 0.8-0.9 0.9-1.0 0
Image-level Uncertainty Image-level Uncertainty
Figure A.11. The distribution of the image-level uncertainties obtained from different detectors on clean ID, corrupted ID with severities
1, 3, 5 and OOD data on SAOD-AV dataset.
Table A.12. Effectiveness of our pseudo-OOD set approach com- a baseline to compare our method against and demonstrate
pared to using [email protected]. its effectiveness. However, to the best of our knowledge,
there is no such a method that obtains a threshold relying
Task Detector Method BA TPR TNR
only on the ID data for OOD detection task. As a result,
[email protected] 83.2 98.5 72.0
F-RCNN inspired from the performance measure [email protected] [13],
pseudo-OOD 87.7 94.7 81.6
[email protected] 84.0 98.3 73.4 we simply set the threshold ū to the value that corresponds
RS-RCNN to [email protected], and use it as a baseline. Note that this ap-
pseudo-OOD 88.9 92.8 85.3
Gen-OD proach only relies on the ID val set= and hence there is
[email protected] 84.7 96.9 75.2
ATSS
pseudo-OOD 87.8 93.1 83.0 no need for OOD val set, which is similar to our pseudo-
D-DETR
[email protected] 85.8 97.2 76.8 OOD approach. Tab. A.12 compares our pseudo-OOD ap-
pseudo-OOD 88.9 90.0 87.8 proach with [email protected] baseline; suggesting, on average,
[email protected] 80.9 97.7 69.1 more than 4.5 BA gain over the baseline method; thereby
F-RCNN
pseudo-OOD 91.0 94.1 88.2 confirming the effectiveness of our approach.
SAOD-AV
[email protected] 83.5 96.7 73.5
ATSS
pseudo-OOD 85.8 95.9 77.6 D. Further Details on Calibration of Object
Detectors
C.3.5 The Effectiveness of Using Pseudo OOD val set This section provides further details and analyses on cal-
for Image-level Uncertainty Thresholding ibration of object detectors.
D.1. How does AP benefit from low-scoring detec-
In order to compute the image-level uncertainty threshold ū
tions?
and decide whether or not to accept an image, we presented
a way to construct pseudo-OOD val set in Sec. 4 as DVal Here we show that enlarging the detection set with low-
only includes ID images. Here, we discuss the effectiveness scoring detections provably does not decrease AP. Thereby
of this pseudo-set approach. To do so, we also prefer to have confirming the practical limitations previously discussed by
1.0 1.0 1.0 1.0
interpolated PR curve of PR curve of PR curve of
0.8 not interpolated 0.8 PR curve of I 0.8 PR curve of I 0.8 PR curve of I
Precision
Precision
0.6
Precision
Precision
0.4 0.4 0.4 0.4
0.2 0.2 0.2 0.2
0.0 0.0 0.0 0.0
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
Recall Recall Recall Recall
(a) Interpolating PR Curve (b) Case 1 of the proof (c) Case 2 of the proof (d) Case 3 of the proof
Figure A.12. Illustrations of (a) non-interpolated and interpolated PR curves. Typically, the area under the interpolated PR curve is used
as the AP value in object detection; (b), (c), (d) corresponding to each of the three different cases we consider in the proof of Theorem 1.
Following Theorem 1, in all three cases, the area under the red curve is smaller or equal to that of the blue curve.
Oksuz et al. [58]. As a result, instead of top-k predictions 2. Then, going over this sorted list of detections, the jth
and AP, we require a thresholded detection set in SAOD detection is identified as a TP if there exists a ground
task and employ the LRP Error as a measure of accuracy to truth that satisfies the following two conditions:
enforce this type of output.
Before proceeding, below we provide a formal definition • The ground truth is not previously assigned to
of AP as a basis to our proof. any other detections with a larger confidence
score than that of j,
• The IoU between the ground truth and jth detec-
Definition of AP. AP is defined as the area under the
tion is more than τ , the TP validation threshold.
Precision-Recall curve [15, 53, 58]. Here we formalize how
to obtain this curve and the resulting AP in object detec-
Note that the second condition also implies that the
tion given detections and the ground truths. Considering
jth detection and the ground truth that it matches with
the common practice, we will first focus on the AP of a
should reside in the same image. If there is a single
single class and then discuss further after our proof. More
ground truth satisfying these two conditions, then j is
precisely, computing AP for class c from an IoU threshold
matched with that ground truth; else if there are more
of τ , two sets are required:
than one ground truths ensuring these conditions, then
• A set of detections obtained on the test set: This set
the jth detection is matched with the ground truth that
is represented by tuples Ŷ = {b̂i , p̂i , Xi }Nc , where b̂i
j has the largest IoU with.
and p̂i are the bounding box and confidence score of
the ith detection respectively. Xi is the image id that
the ith detection resides and Nc is the number of all 3. Upon completing this sorted list, the detections that are
detections across the dataset from class c. We assume not matched with any ground truths are identified as
that the detections obtained from a single image is less FPs.
than k where k represents the upper-bound within the
context of the top-k survival (Sec. 2), that is, there can
be up to k detections from each image. This matching procedure enables us to determine which
• A set of ground truths of the test set: This set is rep- detections are TP or FP. Now, let L = [L1 , ..., LNc ] be a
resented by tuples Y = {bi , Xi }Mc , where bi is the binary vector that represents whether jth detection is a TP
bounding box of the ground truth and Xi is similarly or FP and assume that L is also sorted with respect to the
the image id. Mc is the number of total ground truth confidence scores of the detections. Specifically, Li = 1
objects from class c across the dataset. if the ith detection is a TP, else the ith detection is FP and
Then, the detections are matched with the ground truths to Li = 0. Consequently, we need precision and recall pairs
identify TP and FP detections using a matching algorithm. in order to obtain the Precision-Recall curve, area under
For the sake of completeness, we provide a matching al- which corresponds to the AP. Noting that the precision is
gorithm that is used by the commonly-used COCO bench- the ratio between the number of TPs and number of all de-
mark [43]: tections; and recall is the ratio between the number of TPs
and number of ground truths, we can obtain these pairs by
1. All detections in Ŷ are first sorted with respect to the leveraging L. Denoting the precision and recall vectors by
confidence score in descending order P r = [P r1 , ..., P rNc ] and Re = [Re1 , ..., ReNc ] respec-
′ ′ ′
tively, the ith element of these vectors can be obtained by: • The first Nc elements of P r , Re and P¯r account for
Pi Pi the precision, recall and interpolated precision values
k=1 Lk Lk
computed on the detection from Ŷ; and
P ri = , and Rei = k=1 . (A.19)
i Mc • their elements between Nc +1 to the last element (Nc +
′
Since these obtained precision values P ri may not be Nc ) correspond to the precision, recall and interpolated
′
monotonically decreasing function of recall, there can be precision values computed on the detections from Ŷ .
wiggles in the Precision-Recall curve. Therefore, it is com- Note that by definition, computing precision and re-
mon in object detection [15, 18, 43] to interpolate the preci- call on the ith detection only considers the detections with
sions P r to make it monotonically decreasing with respect higher scores than that of i (and ignores the ones with lower
to the recall Re. Denoting the interpolated precision vector scores than that of i), since the list of labels denoted by L in
by P¯r = [P¯r1 , ..., P¯rNc ], its ith element P¯ri is obtained as Eq. (A.19) is sorted with respect to the confidence scores.
follows: As a result, the following holds for precision and recall val-
ues (but not the interpolated precision):
P¯ri = max (P ri ). (A.20)
i:Rei ≥Rek ′ ′
P ri = P ri , and Rei = Rei for i ≤ Nc . (A.24)
Finally, Eq. (A.20) also allows us to interpolate the PR
′
curve to the precision and recall axes. Namely, we include Then, the difference between APτ (Ŷ) and APτ (Ŷ ∪ Ŷ )
the pairs that (i) P¯r1 with recall 0; and (ii) precision 0 with depends on two aspects:
¯ N . This allows us to obtain the final Precision-
recall Re ′ ′ ′
c
Recall curve using these two additional points as well as 1. P ri and Rei for Nc < i ≤ Nc + Nc
the vectors P¯ri and Rei . Then, the area under this curve ′ ′
corresponds to the Average Precision of the detection set 2. the interpolated precision vector P¯r of Ŷ ∪ Ŷ , to be
′ ′
Ŷ for the IoU validation threshold of τ , which we denote obtained using P r and Re based on Eq. (A.20)
as APτ (Ŷ). As an example, Fig. A.12(a) illustrates a PR For the rest of the proof, we enumerate all possible three
′
curve before and after interpolation. Based on this defini- cases for Ŷ and identify these aspects.
tion, we now prove that low-scoring detections do not harm
AP. ′
Case (1): Ŷ does not include any TP. This case suggests
′
Theorem 1. Given two sets of detections Ŷ = that the detections in Ŷ are all FPs, and neither the number
′
{b̂i , p̂i , Xi }N c
Ŷ
′
=
Nc
{b̂j , p̂j , Xj }j=1 of TPs nor the number of FNs change for Nc < i ≤ Nc +
i=1 , and denoting ′
′ Nc , implying:
pmin = min p̂i , pmax = max ′ p̂j , if
{b̂i ,p̂i ,Xi }∈Ŷ {b̂j ,p̂j ,Xj }∈Ŷ ′ ′ ′
′ ′ Rei = ReNc , for Nc < i ≤ Nc + Nc . (A.25)
pmax < pmin , then APτ (Ŷ) ≤ APτ (Ŷ ∪ Ŷ ).
Proof. We denote the recall and precision values to com- As for the precision, it is monotonically decreasing as i in-
′
pute APτ (Ŷ) by P r = [P r1 , ..., P rNc ] and Re = creases between Nc < i ≤ Nc + Nc since the number of
[Re1 , ..., ReNc ], and similarly the interpolated precision FPs increases, that is,
is P¯r = [P¯r1 , ..., P¯rNc ]. We aim to obtain these vec- ′ ′ ′
′ P ri−1 > P ri , for Nc < i ≤ Nc + Nc . (A.26)
tors for APτ (Ŷ ∪ Ŷ ) to be able to compare the resulting
′ ′
APτ (Ŷ ∪ Ŷ ) with APτ (Ŷ). To do so, we introduce P r , ′ ′
Having identified P ri and Rei for Nc < i ≤ Nc + Nc ,
′
′ ′
Re and P¯r as the precision, recall and the interpolated ′
′ now let’s obtain the interpolated precision P¯r . To do so, we
precision vectors of APτ (Ŷ ∪ Ŷ ) respectively. ′
′ ′
By definition, the numbers of elements in P r , Re and focus on P¯r in two parts: Up to and including its Nc th ele-
′ ′
′ ′ ment and its remaining part. Since P rNc > P ri , for Nc <
P¯r are equal to the number of detections in Ŷ∪Ŷ , which is ′ ′
′
simply Nc + Nc . More precisely, we need to determine the i ≤ Nc + Nc , the low-scoring detections in Ŷ do not affect
′
′
following three vectors to be able to obtain APτ (Ŷ ∪ Ŷ ): P¯r for i ≤ Nc considering Eq. (A.20), implying:
i
′ ′ ′ ′ ′ ′
P r = {P r1 , ..., P rNc , P rNc +1 , ..., P rNc +N ′ } (A.21)
c
P¯ri = P¯ri , for i ≤ Nc . (A.27)
′ ′ ′ ′ ′
Re = {Re1 , ..., ReNc , ReNc +1 , ..., ReNc +N ′ } (A.22) ′ ′ ′
c As for Nc < i ≤ Nc + Nc , since Rek = Rei , P¯ri = P¯rNc
′ ′ ′ ′ ′
P¯r = {P¯r1 , ..., P¯rNc , P¯rNc +1 , ..., P¯rNc +Nc′ } (A.23) holds.
′
As a result, the detections from Ŷ will have all equal re-
′
As an additional insight to those three vectors, pmax < call and interpolated precision, which is also equal to P¯rNc
pmin implies the following: and ReNc ; implying that they do not introduce new points to
′
the Precision-Recall curve used to obtain APτ (Ŷ). There- axes: (i) owing to the interpolation thanks to a TP in Ŷ with
′ ′
fore, APτ (Ŷ) = APτ (Ŷ ∪ Ŷ ) in this case. higher precision in Ŷ ∪ Ŷ , it is extended in precision axis;
′
Fig. A.12(b) illustrates this case to provide more insight. and (ii) thanks to a new TP in Ŷ , it is extended in recall
In particular, when there is no TP in the low-scoring de- axis. Note that in our proof for this case, we only discussed
′
tections (Ŷ ), then no new points are introduced compared the extension in precision since each of the extensions is
′
to the PR curve of Ŷ and the resulting AP after including sufficient to show APτ (Ŷ) < APτ (Ŷ ∪ Ŷ ).
low-scoring detections does not change.
′ ′
Discussion Theorem 1 can also be extended to COCO-
Case (2): Ŷ includes TPs and max ′
(P ri ) ≤ style AP. To be more specific and revisit the definition of
Nc <i≤Nc +Nc
′ COCO-style AP, first the class-wise COCO-style APs are
min (P¯ri ). Note that max (P ri ) ≤ min (P¯ri ) obtained by averaging over the APs computed over τ ∈
i≤Nc Nc <i≤Nc +Nc′ i≤Nc
implies that the interpolated precisions computed on the de- {0.50, 0.55, ..., 0.95} for a single class. Then, the detector
′
tection set Ŷ (P¯ri for i ≤ Nc ) will not be affected by the COCO-style AP is again the average of the class-wise APs.
′
detections in Ŷ . As a result, Eq. (A.24) can simply be Considering that the arithmetic mean is a monotonically in-
extended to the interpolated precisions: creasing function, Theorem 1 also applies to the class-wise
COCO-style AP and the detector COCO-style AP. More
′
P¯ri = P¯ri for i ≤ Nc . (A.28) precisely, in the case that either Case (1) applies for some
′
(or all) of the classes and the detections for the remaining
Considering the area under the curve of the pairs P¯ri and classes stay the same, then following Case (1), COCO-style
′
Rei = Rei for i ≤ Nc , it is already guaranteed that AP does not change. That is also the reason why we do
′
APτ (Ŷ) ≤ APτ (Ŷ ∪ Ŷ ) completing the proof for this case. not observe a change in COCO-Style AP in Fig. 5(a) once
To provide more insight, we also briefly explore the we add dummy detections that are basically FPs with lower
effect of remaining detections, that are the detections in scores. If at least for a single class Case (2) or (3) apply,
′
Nc < i ≤ Nc +Nc and include TPs. Assume that jth detec- then COCO-style AP increases considering the monoton-
tion is the TP with the highest confidence score within the ically increasing nature of the arithmetic average. Follow-
′
detections for Nc < i ≤ Nc + Nc . Then, for the jth detec- ing from this, we observe some decrease in COCO-style AP
′ ′ ′ when we remove the detections in Fig. 5(b) when we thresh-
tion 0 < P¯r < P¯r as
j Nc max (P r ) ≤ min (P¯ri )
i
Nc <i≤Nc +Nc′ i≤Nc old and remove some TPs. As a result, we conclude that AP,
by definition. Moreover, since the number of TPs in- including COCO-Style AP, encourages the detections with
creases and the number of ground truths is a fixed number, lower scores.
′ ′
Rej > ReNc . This implies that, the PR curve now has
′ ′ D.2. Sensitivity of LaECE to TP validation thresh-
P¯rj > 0 precision for some Re . Note that the precision
j old
′
was implicitly 0 for Rej for the detection set Ŷ since this
new ground truth could not be retrieved regardless of the Here we analyse the sensitivity of LaECE to the TP val-
number of predictions. Accordingly, the additional area un- idation threshold τ . Please note that we normally obtain
′
der the PR curve of Ŷ ∪ Ŷ compared to that of Ŷ increases class-wise LRP-optimal thresholds v̄ considering a specific
′
and it is guaranteed that APτ (Ŷ) < APτ (Ŷ ∪ Ŷ ) in this τ on DVal , then use the resulting detections while measur-
case. As depicted in Fig. A.12(c), the area-under-the PR ing the LRP Error and LaECE on the test set using the same
curve of Ŷ is extended towards higher recall (compare blue IoU validation threshold τ . Namely, we use τ for two pur-
′
curve with the red one) resulting in a larger APτ (Ŷ ∪ Ŷ ) poses: (i) to obtain the thresholds v̄; and (ii) to evaluate the
compared to APτ (Ŷ). resulting detections in terms of LaECE and LRP Error. As
we aim to understand how LaECE, as a performance mea-
′ ′ sure, behaves under different specifications of the TP vali-
Case (3): Ŷ includes TPs and max (P ri ) >
Nc <i≤Nc +Nc′ dation threshold τ here, we decouple these two purposes of
min (P¯ri ). Unlike case (ii), this case implies that upon τ by fixing the detection confidence threshold v̄ to the value
i≤Nc obtained from a TP validation threshold of 0.10. This en-
′ ′
merging Ŷ and Ŷ , some of the P¯ri of Ŷ with P rj > P¯ri ables us to fix the detection set as the input of LaECE and
will be replaced by a larger value due to Eq. (A.20), i.e., only focus on how the performance measures behave when
′
P¯ri > P¯ri for some i while the rest will be equal sim- only τ changes.
ilar to Case (ii). This simply implies that APτ (Ŷ) < Specifically, we use F-RCNN detector, validate v̄ on
′
APτ (Ŷ ∪ Ŷ ). COCO validation set and obtain the detections on Obj45K
Fig. A.12(d) includes an illustration for this case demon- test set using v̄. Then, given this set of detections, for dif-
strating that the PR curve of Ŷ is extended in both of the ferent values of τ ∈ [0, 1], we compute:
100
75 LaECE (uncalibrated)
Figure A.13. Sensitivity analysis of LaECE and LRP Error. We use the detections of F-RCNN on our Obj45K split. (a) For both calibrated
and uncalibrated case, we observe that the LaECE is not sensitive for τ ∈ [0.0, 0.5]. When τ gets larger, the misalignment between
detection scores and performance increases for the uncalibrated case, while the calibration becomes an easier problem since most of the
detections are now FP. In the extreme case that τ = 1 (a perfect localisation is required for a TP), there is no TP and it is sufficient to
assign a confidence score of 0.00 to all of the detections to obtain 0 LaECE. (b) Sensitivity analysis of LRP Error. As also previously
analysed [58], when τ increases, number of FPs increases and LRP increases. In the extreme case when τ ≈ 1, LRP approximates 1.
• LaECE for uncalibrated detection confidence scores; D.3. Derivation of Eq. (5)
In Sec. 5.3, we claim that the LaECE for a bin reduces
• LaECE for calibrated detection confidence scores us- to:
ing linear regression (LR); and X X cal
tcal
i − IoU(b̂i , bψ(i) ) + ti , (A.29)
b̂i ∈D̂jc b̂i ∈D̂jc
• LRP Error. ψ(i)>0 ψ(i)≤0
for a TP. In this case, there is no TP and it is sufficient for a which can be expressed as,
calibrator to assign a confidence score of 0.00 to all detec-
tions and achieve perfect LaECE that is 0. Finally, as also
P P
J
|D̂jc | b̂k ∈D̂ c ,ψ(k)>0 1 b̂k ∈D̂ c ,ψ(k)>0 IoU(b̂k , bψ(k) )
j j
X
p̄j − ,
analysed before [58], when τ increases, the detection task j=1 |D̂ c | |D̂jc |
P
b̂k ∈D̂ c ,ψ(k)>0 1
j
becomes more challenging, and therefore LRP Error, as the (A.32)
lower-better measure of accuracy, also increases. This is be-
cause the number of TPs decreases and the number of FPs as
increases as τ increases.
P
b̂k ∈D̂jc ,ψ(k)>0 1
While choosing the TP validation threshold τ for our precisionc (j) = , (A.33)
|D̂jc |
SAOD framework, we first consider that a proper τ should
decompose the false positive and localisation errors prop- and,
erly. Having looked at the literature, the general consensus
of object detection analysis tools [1, 26] to split the false
P
c b̂k ∈D̂jc ,ψ(k)>0 IoU(b̂k , bψ(k) )
positive and localisation errors is achieved by employing ¯ (j) =
IoU P . (A.34)
b̂k ∈D̂jc ,ψ(k)>0 1
an IoU of 0.10. As a result, following these works, we
set τ = 0.10 throughout the paper unless otherwise noted. P
Still, the TP validation threshold τ should be chosen by the The expression 1 (in the nominator of
b̂k ∈D̂jc ,ψ(k)>0
requirements of the specific application. c ¯ c (j)) corresponds
precision (j) and the denominator of IoU
to the number of TPs. Canceling out these terms yield Table A.13. Dummy detections decrease LaECE superficially with
P no effect on AP due to top-k survival. LRP Error penalizes dummy
J
b̂k ∈D̂jc ,ψ(k)>0 IoU(b̂k , bψ(k) )
X |D̂jc | detections and requires the detections to be thresholded properly.
p̄j − . (A.35) COCO val set is used.
j=1 |D̂ |
c |D̂jc |
Detector Dummy det. det/img. LaECE ↓ AP ↑ LRP ↓
p̄j , the average of the confidence score in bin j, can sim- None 33.9 15.1 39.9 86.5
ilarly be obtained as: up to 100 100 3.9 39.9 96.8
F-RCNN
P up to 300 300 1.4 39.9 98.8
b̂k ∈D̂jc p̂k up to 500 500 0.9 39.9 99.2
p̄j = , (A.36)
|D̂jc | None 86.4 7.7 42.8 95.1
up to 100 100 6.0 42.8 96.2
and replacing p̄j in Eq. (A.35) yields ATSS
up to 300 300 1.8 42.8 98.9
P P up to 500 500 1.1 42.8 99.3
J
X |D̂jc | b̂k ∈D̂jc p̂k b̂k ∈D̂jc ,ψ(k)>0 IoU(b̂k , bψ(k) )
− .
c
j=1 |D̂ | |D̂jc | |D̂jc |
D.4. More Examples of Reliability Diagrams
(A.37)
We provide more examples of reliability diagrams in
|D̂jc | Fig. A.14 on F-RCNN and ATSS on SAOD-AV. To pro-
Since a|x| = |ax| if a ≥ 0, we take |D̂ c |
inside the absolute
vide insight on how the error on the set that the calibrator is
value where |D̂jc |
terms cancel out: trained with, Fig. A.14 (a-c) show the reliability diagrams
J
P P on the val set as the split used to train the calibrator. On the
X k∈D̂jc p̂k b̂k ∈D̂jc ,ψ(k)>0 IoU(b̂k , bψ(k) ) val set, we observe that the isotonic regression method for
− .
j=1 |D̂c | |D̂c | calibration results in an LaECE of 0.0; thereby overfitting
to the training data (Fig. A.14(b)).
(A.38)
On the other hand, the linear regression method ends
P
Splitting b̂k ∈D̂j p̂k for true positives and false positives up with a training LaECE of 5.0 (Fig. A.14(c)). Conse-
quently, we observe that linear regression performs slightly
P P
as b̂k ∈D̂j ,ψ(k)>0 p̂k and b̂k ∈D̂j ,ψ(k)≤0 p̂k respectively,
we have better than isotonic regression on the BDD45K test split
P P (Fig. A.14 (e,f)). Besides, when we compare Fig. A.14(e,f)
J
X b̂k ∈D̂jc ,ψ(k)>0 p̂k + b̂k ∈D̂jc ,ψ(k)≤0 p̂k with Fig. A.14(d), we observe that both isotonic regression
|D̂c | and linear regression decrease the overconfidence of the
j=1
P (A.39) baseline F-RCNN Fig. A.14. As a different type of calibra-
b̂k ∈D̂jc ,ψ(k)>0, IoU(b̂k , bψ(k) ) tion error, ATSS shown in Fig. A.14(g) is under-confident.
− . Again, linear regression and isotonic regression improves
|D̂c |
the calibration performance of ATSS. This further validates
Considering that Eq. (A.39) is minimized when the error on SAOD-AV that such post-hoc calibration methods are
for each bin j is minimized as 0, we now focus on a single effective.
bin j. Note also that for each bin j, |D̂c | is a constant. As a
result, minimizing the following expression minimizes the D.5. Numerical Values of Fig. 5
error for each bin, and also LaECE,
Tables A.13 and A.14 present the numerical values used
X X X in the Fig. 5(a) and Fig. 5(b) respectively. Please refer to
p̂k + p̂k − IoU(b̂k , bψ(k) ) .
Sec. 5.2 for the details of the tables and discussion.
b̂k ∈D̂ c ,ψ(k)>0 b̂k ∈D̂ c ,ψ(k)≤0 b̂k ∈D̂ c ,ψ(k)>0,
j j j
(A.40)
E. Further Details on SAOD and SAODets
By rearranging the terms, we have
This section provides further details and analyses on the
X X SAOD task and the SAODets.
p̂k − IoU(b̂k , bψ(k) ) + p̂k ,
b̂k ∈D̂jc ,ψ(k)>0 b̂k ∈D̂jc ,ψ(k)≤0 E.1. Algorithms to Make an Object Detector Self-
(A.41) Aware
which reduces to Eq. (5) once by setting p̂k by tcal
k . This In Sec. 6, we summarized how we convert an object de-
concludes the derivation and validates how we construct the tector into a self-aware one. Specifically, to do so, we use
targets tcal
k while obtaining the pairs to train the calibrator. mean(top-3) and obtain an uncertainty threshold ū through
1.0 1.0 1.0
Precision ×IoU Precision ×IoU Precision ×IoU
% of Samples % of Samples % of Samples
0.8 0.8 0.8
LaECE= 23.5% LaECE= 5.0% LaECE= 0.0%
0.6 0.6 0.6
Figure A.14. (First row) Reliability diagrams of F R-CNN on SAOD-AV DVal , which is used to obtain the set that we used for training
the calibrators. (Second row) Reliability diagrams of F R-CNN on SAOD-AV DID (BDD45K). (Third row) Reliability diagrams of ATSS
on SAOD-AV DID (BDD45K). Linear regression and isotonic regression improve the calibration performance of both over-confidence
F-RCNN (compare (e) and (f) with (d)) and under-confidence ATSS (compare (h) and (i) with (g)).
Performance or Error
80
Performance or Error
80 DAQ ter LRP values (green and purple curves in Fig. A.15(a))
BA
60 60 LRP as otherwise the ID images are rejected with the detection
40 LaECE
40 LRPT sets being empty. Similarly, to achieve a high LRP, setting
20 20 LaECET the detection confidence threshold properly is important as
0 0 well. This is because, a small confidence score threshold
0 20 40 60 80 100 0 20 40 60 80 100
Image-level threshold (TPR of val set in %) Detection-level threshold (pi in %) implies more FPs, conversely a large threshold can induce
(a) Image-level uncertainty thr. (b) Detection confidence thr. more FNs. Finally, while we don’t observe a significant ef-
fect of the uncertainty threshold on LaECE, a large number
Figure A.15. The effect of image- and detection-level thresholds. of detections due to a smaller detection confidence thresh-
DAQ (blue curve) decreases significantly for extreme cases such old has generally a lower LaECE.. This is also related to our
as when all images are rejected or all detections are accepted; im-
previous analysis in Sec. 5.2, which we show that more de-
plying its robustness to such cases. Here, for the sake of analysis
simplicity, we use a single confidence score threshold (v̄) obtained
tections imply a lower LaECE as depicted in Fig. A.15(b)
on the final detection scores p̂i in (b) instead of class-wise ap- when the threshold approaches 0. However, in that case,
proach that we used while building SAODets. LRP Error goes to 1, and as a result, DAQ significantly de-
creases; thereby preventing setting the threshold to a lower
value to superficially boost the overall performance.
Table A.15. Effect of common improvements (epochs (Ep.),
Multi-scale (MS) training, stronger backbones) on F-RCNN E.3. Effect of common improvement strategies on
(SAOD-Gen).
DAQ
Ep. MS Backbone DAQ BA mECE LRP mECET LRPT AP Here, we analyse how common improvement strategies
12 ✗ R50 38.5 88.0 16.4 76.6 16.8 85.0 24.8 of object detectors affect the DAQ in comparison with
36 ✗ R50 38.4 87.4 18.7 75.9 20.5 85.0 25.5 AP. To do so, we first use a simple but a common base-
36 ✓ R50 39.7 87.7 17.3 74.9 18.1 84.4 27.0
line model: We use F-RCNN (SAOD-Gen) trained for 12
36 ✓ R101 42.0 88.1 17.5 73.4 19.0 82.8 28.7
epochs without multi-scale training. Then, gradually, we
36 ✓ R101-DCN 45.9 87.4 17.3 70.8 19.4 79.7 31.8
include the following four common improvement strategies
commonly used for object detection [6, 55, 77]:
E.2. Sensitivity of the SAOD Performance Measures 1. increasing number of training epochs,
to the Image-level Uncertainty Threshold and
Detection Confidence Threshold 2. using multiscale training as described in Sec. B,
Here, we explore the sensitivity of the performance mea- 3. using ResNet-101 as a stronger backbone [22], and
sures used in our SAOD framework to the image-level un-
certainty threshold û and the detection confidence threshold 4. using deformable convolutions [78].
v̄. To do so, we measure DAQ, BA, LRP and LaECE of
F-RCNN on DTest of SAOD-Gen by systematically vary- Table A.15 shows the effect of these improvement strate-
ing (i) image-level uncertainty threshold ū ∈ [0, 1] and (ii) gies, where we see that stronger backbones increase DAQ,
detection-level confidence score threshold v̄ ∈ [0, 1]. Note but mainly due to an improvement in LRP Error. It is also
that in this analysis, we do not use LRP-optimal threshold worth highlighting that more training epochs improves AP
for detection-level thresholding, which obtains v̄ for each (e.g. going from 12 to 36 improves AP from 24.8 to 25.5),
class but instead employ a single threshold for all classes; but not DAQ due to a degradation in LaECE. This is some-
enabling us to change this threshold easily. Fig. A.15 what expected, as longer training improves accuracy, but
shows how there performance measures change for differ- drastically make the models over-confident [47].
ent image-level and detection-level thresholds. First, we
E.4. The Impact of Domain-shift on Detection-level
observe that it is crucial to set both thresholds properly to
Confidence Score Thresholding
achieve a high DAQ. More specifically, rejecting all images
or accepting all detections v̄ = 0 in Fig. A.15 results in a For detection-level confidence score thresholding, we
very low DAQ, highlighting the robustness of DAQ in these employ LRP-optimal thresholds by cross-validating a
extreme cases. Second, setting a proper uncertainty thresh- threshold v̄ for each class using DVal against the LRP Er-
old is also important for a high BA (Fig. A.15(a)), while it ror. While LRP-optimal thresholds are shown to be useful
is not affected by detection-level threshold (Fig. A.15(b)) if the test set follows the same distibution of DVal , we note
since BA indicates the OOD detection performance but not that our DID is collected from a different dataset, introduc-
related to the accuracy or calibration performance of the ing domain shift as discussed in App. A. As a result, here
1.0 1.0
nuImages val
BDD45K
0.8
LRP-optimal Thresholds
0.8
LRP-optimal Thresholds
0.6 0.6
0.4 0.4
snowboard
skateboard
surfboard
person
bicyclecar
bus
train
bird
bench
oven
airplane
motorcycle
boat
truck
light
hydrant
stopmetersign
dog
cat
sheep
horse
elephantcow
zebra
bear
giraffe
backpack
handbag tie
suitcase
frisbee
toaster
sink
book
clock
vase
bear
umbrella
skis
sports kiteball
baseballglove bat
racket
winebottlecup
glass
knife
fork
spoon
bowl
banana
apple
orange
sandwich
dog
hotcarrot
plant
bed
dining toilet
table
laptop tv
broccoli
donut
pizza
chair
cake
pottedcouch
mouse
keyboard
phone
remote
microwave
refrigerator
scissors
toothbrush
hair drier
pedestrian vehicle bicycle
Classes
traffic
teddy
baseball
parking
tennis
cell
fire
Classes
(a) F-RCNN (SAOD-Gen) (b) F-RCNN (SAOD-AV)
1.0 1.0
COCO val nuImages val
Obj45K BDD45K
0.8
LRP-optimal Thresholds
0.8
LRP-optimal Thresholds
0.6 0.6
0.4 0.4
0.2 0.2
0.0
0.0
snowboard
skateboard
surfboard
person
bicyclecar
cup
airplane
motorcycle
bus
train
boat
truck
light
hydrant
stopmetersign
bird
bench
dog
cat
sheep
horse
elephantcow
bear
zebra
giraffe
backpack
handbag tie
suitcase
frisbee
sports kite
knife
fork
spoon
bowl
banana
apple
orange
sandwich
dog
hotcarrot
laptop tv
umbrella
skis
ball
baseballglove bat
racket
winebottle
glass
broccoli
donut
pizza
chair
cake
pottedcouchplant
bed
dining toilet
table
mouse
keyboard
phone
oven
toaster
remote
sink
book
clock
vase
microwave
refrigerator
bear
scissors
toothbrush
hair drier
pedestrian vehicle bicycle
Classes
traffic
teddy
baseball
parking
tennis
cell
fire
Classes
(c) ATSS (SAOD-Gen) (d) ATSS (SAOD-AV)
Figure A.16. Comparison of (i) LRP-optimal thresholds obtained on DVal as presented and used in the paper (blue lines); and (ii) LRP-
optimal thresholds obtained on DID as oracle thresholds (red lines). Owing to the domain shift between DVal and DID , the optimal
thresholds do not match exactly. The thresholds between DVal and DID are relatively more similar for SAOD-Gen compared to SAOD-AV.
Table A.16. Evaluating self-aware object detectors. In addition to Tab. 6, this table includes the components of the LRP Error for more
insight. Particularly, LRPLoc , LRPFP , LRPFN correspond to the average 1-IoU of TPs, 1-precision and 1-recall respectively.
SA-F-RCNN 39.7 87.7 94.7 81.6 38.5 17.3 74.9 20.4 48.5 52.3 26.2 18.1 84.4 21.9 52.2 72.4
SA-RS-RCNN 41.2 88.9 92.8 85.3 39.7 17.1 73.9 19.3 47.8 51.9 27.5 17.8 83.5 20.4 50.8 72.1
Gen
SA-ATSS 41.4 87.8 93.1 83.0 39.7 16.6 74.0 18.5 47.8 52.8 27.8 18.2 83.2 20.2 53.2 71.1
SA-D-DETR 43.5 88.9 90.0 87.8 41.7 16.4 72.3 18.8 45.1 50.7 29.6 17.9 81.9 20.4 49.6 69.4
SA-F-RCNN 43.0 91.0 94.1 88.2 41.5 9.5 73.1 26.3 13.2 58.1 28.8 7.2 83.0 26.7 12.2 74.7
AV
SA-ATSS 44.7 85.8 95.9 77.6 43.5 8.8 71.5 25.9 14.2 55.7 30.8 6.8 81.5 26.0 14.3 72.5
we investigate whether the detection-level confidence score • For both of the settings optimal thresholds computed
threshold is affected from domain shift. on val and test sets rarely match.
REJECT
REJECT
REJECT
Figure A.17. Qualitative Results of F-RCNN vs. SA-F-RCNN on DOOD . The images in the first, second and third rows correspond SVHN,
iNaturalist and Objects365 subset of DOOD . While F-RCNN performs inference with non-empty detections sets, SA-F-RCNN rejects all
of these images properly.
Image/Ground Truth Output of Obj. Det. Output of SAODet
Figure A.18. Qualitative Results of Object detectors and SAODets on DID . (First row) F-RCNN vs. SA-F-RCNN. (Second row) ATSS vs.
SA-ATSS. See text for discussion. The class labels and confidence scores of the detection boxes are visible once zoomed in.
REJECT
Figure A.19. Qualitative Results of F-RCNN vs. SA-F-RCNN on T (DID ) using SAOD-AV dataset. First to third row includes images
from T (DID ) in severities 1, 3 and 5 as we used in our experiments. The class labels and confidence scores of the detection boxes are
visible once zoomed in. For each detector, we sample a transformation using the ‘frost’ corruption.
Image/Ground Truth Output of Obj. Det. Output of SAODet
Figure A.20. Failure cases of SAODets in comparison to Object detector outputs. First row includes an image from iNaturalist subset of
DOOD with the detections from ATSS and SA-ATSS trained on nuImages following our SAOD-AV dataset. While SA-ATSS removes
most of the low-scoring detections, it still classifies the image as ID and perform inference. Similarly, the second row includes an image
from Objects365 subset of DOOD with the detections from F-RCNN and SA-F-RCNN trained on nuImages again following our SAOD-AV
dataset. SA-F-RCNN misclassifies the image as ID and performs inference.