0% found this document useful (0 votes)
16 views36 pages

Towards Building Self-Aware Object Detectors Via Reliable Uncertainty

Uploaded by

badem15188
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views36 pages

Towards Building Self-Aware Object Detectors Via Reliable Uncertainty

Uploaded by

badem15188
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Towards Building Self-Aware Object Detectors via Reliable Uncertainty

Quantification and Calibration

Kemal Oksuz, Tom Joy, Puneet K. Dokania


Five AI Ltd., United Kingdom
{kemal.oksuz, tom.joy, puneet.dokania}@five.ai
arXiv:2307.00934v1 [cs.CV] 3 Jul 2023

Abstract

The current approach for testing the robustness of object


detectors suffers from serious deficiencies such as improper
methods of performing out-of-distribution detection and us- Detection
ing calibration metrics which do not consider both locali- Awareness
sation and classification quality. In this work, we address Quality
these issues, and introduce the Self Aware Object Detection
(SAOD) task, a unified testing framework which respects Figure 1. (Top) The vanilla object detection task vs. (Bottom)
and adheres to the challenges that object detectors face in the self-aware object detection (SAOD) task. Different from the
safety-critical environments such as autonomous driving. vanilla approach; the SAOD task requires the detector to: predict
Specifically, the SAOD task requires an object detector to â ∈ {0, 1} representing whether the image X is accepted or not
be: robust to domain shift; obtain reliable uncertainty es- for further processing; yield accurate and calibrated detections;
timates for the entire scene; and provide calibrated con- and be robust to domain shift. Accordingly, for SAOD we evaluate
on ID, domain-shift and OOD data using our novel DAQ measure.
fidence scores for the detections. We extensively use our
Here, {ĉi , b̂i , p̂i }N are the predicted set of detections.
framework, which introduces novel metrics and large scale
test datasets, to test numerous object detectors in two dif-
ferent use-cases, allowing us to highlight critical insights Whilst object detectors are able to obtain uncertainty at
into their robustness performance. Finally, we introduce the detection-level, they do not naturally produce uncer-
a simple baseline for the SAOD task, enabling researchers tainty at the image-level. This has lead researchers to of-
to benchmark future proposed methods and move towards ten evaluate uncertainty by performing out-of-distribution
robust object detectors which are fit for purpose. Code is (OOD) detection at the detection-level [13, 21], which can-
available at: https://github.com/fiveai/saod. not be clearly defined. Thereby creating a misunderstand-
ing between OOD and in-distribution (ID) data. This leads
to an improper evaluation, as defining OOD at the detection
level is non-trivial due to the presence of known-unknowns
1. Introduction or background objects [12]. Furthermore, the test sets for
OOD in such evaluations are small, typically containing
The safe and reliable usage of object detectors in safety
around 1-2K images [13, 21].
critical systems such as autonomous driving [10,65,73], de-
pends not only on its accuracy, but also critically on other Moreover, as there is no direct access to the labels of the
robustness aspects which are often only considered in addi- test sets and the evaluation servers only report accuracy [18,
tion or not all. These aspects represent its ability to be ro- 43], researchers have no choice but to use small validation
bust to domain shift, obtain well-calibrated predictions and sets as testing sets to report robustness performance, such
yield reliable uncertainty estimates at the image-level, en- as calibration and performance under domain shift. As a
abling it to flag the scene for human intervention instead of result, either the training set [11, 59]; the validation set [13,
making unreliable predictions. Consequently, the develop- 21]; or a subset of the validation set [36] is employed for
ment of object detectors for safety critical systems requires cross-validation, leading to an unideal usage of the dataset
a thorough evaluation framework which also accounts for splits and a poor choice of the hyper-parameters.
these robustness aspects, a feature lacking in current evalu- Finally, prior work typically focuses on only one of: cal-
ation methodologies. ibration [35, 36]; OOD detection [13]; domain-shift [45,
68, 71, 72]; or leveraging uncertainty to improve accuracy work predicting raw detections with bounding boxes b̂raw i
[5, 9, 20, 23, 70], with no prior work taking a holistic ap- and predicted class distribution p̂raw
i . Given these raw-
proach by evaluating all of them. Specifically for calibra- detections h(·) applies post-processing to obtain the fi-
tion, previous studies either consider classification calibra- nal detections1 . In general, h(·) comprises removing
tion [35], or localisation calibration [36], completely disre- the detections predicted as background; Non-Maximum-
garding the fact that object detection is a joint problem. Suppression (NMS) to discard duplicates; and keeping use-
In this paper, we address the critical need for a ro- ful detections, normally achieved through top-k survival,
bust testing framework which evaluates object detectors where in practice k = 100 for COCO dataset [43].
thoroughly, thus alleviating the aforementioned deficien- Evaluating the Performance of Object Detectors Av-
cies. To do this, we introduce the Self-aware Object De- erage Precision (AP) [15, 18, 43], or the area under the
tection (SAOD) task, which considers not only accuracy, precision-recall (PR) curve, has been the common perfor-
but also calibration using our novel Localisation-aware Ex- mance measure of object detection. Though widely ac-
pected Calibration Error (LaECE) as well as the reliability cepted, AP suffers from the following three main draw-
of image-level uncertainties. Furthermore, the introduction backs [58]. First, it only validates true-positives (TPs) us-
of LaECE addresses a critical gap in the literature as it re- ing a localisation quality threshold, completely disregarding
spects both classification and localisation quality, a feature the continuous nature of localisation. Second, as an area-
ignored in previous methods [35, 36]. Moreover, the SAOD under-curve (AUC) measure, AP is difficult to interpret, as
task requires an object detector to either perform reliably or PR curves with different characteristics can yield the same
reject images outside of its training domain. value. Also, AP rewards a detector that produces a large
We illustrate the SAOD task in Fig. 1, which not only number of low scoring detections than actual objects in the
RetinaNet
evaluates the accuracy, but also the calibration and perfor- image, which becomes a significant issue when relying on
mance under OOD or domain-shifted data. We can also see top-k survival as shown in Fig. 1. App. D includes details.
the functionality to reject an image, and to only produce Alternatively, the recently proposed Localisation-Recall-
detections which have a high confidence; unlike for a stan- Precision Error (LRP) [53, 58] combines the number of TP,
dard detector which has to accept every image and produce false-positive (FP), false-negative (FN), denoted by NTP ,
detections. To summarise, our main contributions are: NFP , NFN , respectively, as well as the Intersection-over-
• We introduce the SAOD task, which evaluates: accu- Union (IoU) of TPs with the objects that they match with:
racy; robustness to domain shift; ability to accept of re-
 
1 X
ject an image; and calibration in a unified manner. We NFP + NFN + (1 − lq(i)) (1)
NFP + NFN + NTP
further construct large datasets totaling 155K images ψ(i)>0

and provide a simple baseline for future researchers to


IoU(b̂ ,b
i ψ(i) )−τ
benchmark against. where lq(i) = 1−τ is the localisation quality
• We explore how to obtain image-level uncertainties with τ being the TP assignment threshold, ψ(i) is the index
from any object detector, enabling it to reject the entire of the object that a TP i matches to; else i is a FP and ψ(i) =
scene for the SAOD task. Through our investigations, −1. LRP can be decomposed into components providing
we discover that object detectors are inherently strong insights on: the localisation quality; the precision; and the
OOD detectors and provide reliable uncertainties. recall error. Besides, low-scoring detections are demoted
• Finally, we define the LaECE as a novel calibration by the term NFP in Eq. (1). Thus, LRP arguably alleviates
measure for object detectors in SAOD, which requires the aforementioned drawbacks of AP.
the confidence of a detector to represent both its clas-
sification as well as its localisation quality. 3. An Overview to the SAOD Task
2. Notations and Preliminaries For object detectors to be deployed in safety critical sys-
tems it is imperative that they perform in a robust manner.
Object Detection Given that the set of M objects in an Specifically, we would expect the detector to be aware of sit-
image X is represented by {bi , ci }M where bi ∈ R4 is a uations when the scene differs substantially from the train-
bounding box and ci ∈ {1, . . . , K} its class; the goal of ing domain and to include the functionality to flag the scene
an object detector is to predict the bounding boxes and the for human intervention. Moreover, we also expect that the
class labels for the objects in X, f (X) = {ĉi , b̂i , p̂i }N , confidence of the detections matches the performance, re-
where ĉi , b̂i , p̂i represent the class, bounding box and con- ferred to as calibration. With these expectations in mind,
fidence score of the ith detection respectively and N is the we characterise the crucial elements needed to evaluate and
number of predictions. Conventionally, the detections are 1 for probabilistic detectors [5, 19–21, 23], b̂raw follows a probabil-
i
obtained in two steps, f (X) = (h ◦ g)(X) [6, 42, 61, 66]: ity distribution mostly of the form g(X) = {N (µi , Σi ), p̂raw }N
raw
,
raw i
where g(X) = {b̂raw i , p̂raw
i }N is a deep neural net- where Σi is either a diagonal [5, 23] or full covariance matrix [20]
perform the SAOD task. Specifically, the SAOD task re- Table 1. Our dataset splits for SAOD. We design test sets for
quires an object detector to: COCO [43] and nuImages [4] as ID data (train & val). We ex-
• Have the functionality to reject a scene based on its ploit Objects365 [63] and BDD100K [73] for DID and T (DID ),
and use Objects365, iNaturalist [27] and SVHN [50] for DOOD .
image-level uncertainties through a binary indicator
variable â ∈ {0, 1}. Dataset DTest
DTrain DVal
• Produce detection-level confidences that are calibrated DID T (DID ) DOOD
(train) (val)
SAOD-Gen COCO COCO Obj45K Obj45K-C SiNObj110K-OOD
in terms of classification and localisation. SAOD-AV nuImages(train) nuImages(val) BDD45K BDD45K-C SiNObj110K-OOD
• Be robust to domain-shift.
For brevity, and to enable future researchers to adopt the
An ‘accept’ should be penalized in this case.
SAOD framework, the explicit practical specification for
Self-aware Object Detectors (SAODets) is Models In terms of evaluating SAOD on common object
detectors, it would prove useful at this point to introduce
fA (X) = {â, {ĉi , b̂i , p̂i }N }, (2) the models used in our investigation. We mainly exploit a
diverse set of four object detectors:
where â ∈ {0, 1} implies if the image should be accepted or 1. Faster R-CNN (F-RCNN) [61] is a two-stage detector
rejected and that the predicted confidences p̂i are calibrated. with a softmax classifier
Evaluation Datasets As the SAOD emulates challeng- 2. Rank & Sort R-CNN (RS-RCNN) [55] is another two-
ing real-life situations, the evaluation needs to be performed stage detector but with a ranking-based loss function
using large-scale test datasets. Unlike previous approaches and sigmoid classifiers
on OOD detection using around 1-2K OOD images [13, 21] 3. Adaptive Training Sample Selection (ATSS) [77] is a
for testing or calibration methods [36] relying on 2.5K ID common one-stage baseline with sigmoid classifiers
test images, our test set totals to 155K individual images for 4. Deformable DETR (D-DETR) [79] is a transformer-
each of our two use-cases when combining ID and OOD based model, again using sigmoid classifiers
data. Specifically, we construct two test datasets, where We also evaluate two probabilistic detectors with a diag-
each DTest in our case is the union of the following datasets: onal covariance matrix minimizing the negative log likeli-
• DID (45K Images): ID dataset with images containing hood [23] (NLL-RCNN) or energy score [21] (ES-RCNN),
the same foreground objects as were present in DTrain . allowing us to obtain uncertainty estimates for localisation.
• T (DID ) (3 × 45K Images): domain-shift dataset ob- Please see App. B for the training details of the methods as
tained by applying transformations to the images from well as their accuracy on DVal , T (DVal ), DID and T (DID ).
DID , which preserve the semantics of the image. As we have now outlined clear requirements for a
• DOOD (110K Images): OOD dataset with images that SAODet, it is natural to ask how well the aforementioned
do not contain any foreground object from DID . These object detectors perform under these requirements. We will
images tend to include objects not present in DTrain . extensively investigate this by first introducing a simple
We present exact splits in Tab. 1 for object detection in Gen- method to extract image-level uncertainty enabling the ac-
eral and Autonomous Vehicles (AV) use-cases (refer App. ceptance or rejection of an image in Sec. 4; evaluate the
A for further details). Collected from a different dataset, calibration and provide methods to calibrate such detectors
our DID differs from DTrain , but is still semantically simi- in Sec. 5; before finally providing a complete analysis of
lar; which is reflective of a challenging real-word scenario, them using the SAOD framework in Sec. 6.
as domains change over time and scenes differ in terms of
appearance. For T (DID ), we apply ImageNet-C style cor- 4. Obtaining Image-level Uncertainty
ruptions [25] to DID , where for each image we randomly
choose one of 15 corruption types (fog, blur, noise, etc.) at As there is no clear distinction between background and
severity levels 1, 3 and 5 as is common in practice [21]. an OOD object unless each pixel in DTrain is labelled
Then, we expect that for a given input X ∈ DTest , a [12], evaluating uncertainties of detectors is nontrivial at
SAODet makes the following decisions: detection-level. Thus, different from prior work [13, 21]
• if X ∈ DID ∪T (DID ) for corruption severities 1 and 3, conducting OOD detection at detection-level, we evalu-
‘accept’ the input and provide accurate and calibrated ate the uncertainties on image-level OOD detection task.
detections. Penalize a rejection. Thereby aligning the evaluation and the definition of an
• if X ∈ T (DID ) at corruption severity 5, provide the OOD image. Please see App. C.1 for further discussion.
choice to ‘accept’ and evaluate but do not not penalize Practically, one method to accept or reject an image is to
a ‘rejection’ as the transformed images might not con- obtain an estimate of uncertainty at the image-level through
tain enough cues to perform object detection reliably. a function G : X → R and a threshold ū ∈ R, where the
• if X ∈ DOOD , ‘reject’ the image and provide no de- image is accepted if G(X) < ū and â = 1; and rejected
tections as, by design, the predictions would be wrong. vice-versa. We take this approach when constructing our
Table 2. AUROC scores (in %) for image-level uncertainties when % of Data AP
60 30
aggregating through different methods, where we use the uncer- (% of Data)
ID
50 C1 of ( ID) (% of Data) 25
tainty score of 1 − p̂i for the detections. Here, top-m refers to the C3 of ( ID) (% of Data)
average of the lowest m uncertainties for the detections. As we 40 C5 of ( ID) (% of Data) 20
OOD (% of Data)
can see, using the most certain detections performs better. Bold 30 ID ( ID) (AP) 15
and underline are best and second best respectively. 20 10
10 5
Dataset Detector sum mean top-5 top-3 top-2 min
0 0.0-0.1 0.1-0.2 0.2-0.3 0.3-0.4 0.4-0.5 0.5-0.6 0.6-0.7 0.7-0.8 0.8-0.9 0.9-1.0 0
F-RCNN 20.9 84.1 93.4 94.1 94.4 93.8 Image-level Uncertainty
RS-RCNN 85.8 85.8 94.3 94.8 94.8 93.5
SAOD-Gen
ATSS 66.2 86.3 93.8 94.2 94.0 92.6 Figure 2. The distribution of image-level uncertainties obtained
D-DETR 85.2 85.2 94.4 94.7 94.6 93.3 from F-RCNN (SAOD-Gen) on DID , different severities 1, 3, 5
F-RCNN 27.1 84.1 96.4 97.3 97.4 96.0 (C1, C3, C5) of T (DID ) and DOOD vs. the accuracy in COCO-
SAOD-AV
ATSS 18.8 92.2 97.7 97.6 97.3 95.7 Style AP in % (AP in short). App. C includes more examples.

Table 4. AUROC scores (in %) on subsets of DOOD . In all cases,


Table 3. AUROC scores (in %) of different detection-level uncer-
near-OOD (Obj365) has a lower AUROC than far-OOD (SVHN).
tainty estimates. Classification-based uncertainties perform better
compared to localization and 1 − p̂i performs generally the best. near to far OOD
Dataset Detector all OOD
Obj365 iNat SVHN
Classification Localisation F-RCNN 83.7 97.6 99.8 94.1
Dataset Detector
H(p̂raw
i ) DS 1 − p̂i |Σ| tr(Σ) H(Σ) SAOD RS-RCNN 85.6 97.8 99.8 94.8
Gen ATSS 84.5 97.4 99.5 94.2
F-RCNN 92.6 89.7 94.1 N/A N/A N/A
D-DETR 85.7 98.1 99.4 94.4
RS-RCNN 93.7 30.0 94.8 N/A N/A N/A
SAOD ATSS 94.3 36.9 94.2 N/A N/A N/A SAOD F-RCNN 95.2 97.4 98.8 97.3
Gen D-DETR 93.9 73.8 94.4 N/A N/A N/A AV ATSS 95.0 97.3 99.7 97.7

NLL-RCNN 92.4 89.0 94.1 87.6 87.5 87.7


ES-RCNN 92.8 89.9 94.1 85.0 85.2 86.4
that high AUROC scores are obtained when G is formed
SAOD F-RCNN 97.3 96.0 97.3 N/A N/A N/A by considering up to the mean(top-5) detections, with the
AV ATSS 97.2 97.1 97.6 N/A N/A N/A mean(top-3) aggregation strategy of 1 − p̂i performs the
best. This highlights that the detections with lowest uncer-
baseline and now specifically outline the method to do so. tainty in each image provide useful information to reliably
Obtaining Image-level Uncertainties This can be estimate image-level uncertainty. We believe the poor per-
achieved through aggregating the detection-level uncertain- formance for mean and sum stem from the fact that there
ties. We hypothesise that there is implicitly enough uncer- are typically too many noisy detections (up to k = 100) for
tainty information in the detections to produce image-level only a few objects in the image. We further provide assur-
uncertainty, they just need to be extracted and aggregated ance that 1 − p̂i is the most appropriate method to extract
in an appropriate way. In terms of the extraction, we can detection-level uncertainty in Tab. 3, where we can see that
obtain detection level uncertain through: the uncertainty 1 − p̂i obtains higher AUROC scores compared to H(p̂raw i )
score (1 − p̂i ); the entropy of the predictive classification and DS. We also note that classification uncertainties (ex-
distribution of the raw detections (H(p̂raw
i )); and Dempster- cept DS) perform consistently better than localisation ones
Shafer [14,62] (DS). In addition, for probabilistic detectors, for probabilistic detectors. We believe one of the reasons
we can extract uncertainty from Σ by taking the: determi- for that is the classifier is trained using both the proposals
nant, trace, or entropy of the multivariate normal distribu- matching and not matching with any object, preventing the
tion [49]. In terms of the aggregation strategy, given the detector from becoming over-confident everywhere.
uncertainties for the detections after top-k survival, we let How Reliable are these Image-level Uncertainties?
G either take their: sum, mean, minimum, or their mean Though the aforementioned results show that the image-
of the m smallest uncertainty values, i.e. the most certain level uncertainties are effective, we now see how reliable
top-m detections. For further details, please see App. C. these uncertainties are in practice. For this, we first eval-
Whilst these strategies are simple, as we will now show, uate the detectors on different subsets of our SiNObj110K
they provide a suitable method to obtain image-level uncer- OOD set. Tab. 4 shows that for all detectors, the AUROC
tainty, enabling effective performance on OOD detection, a score is lower for near-OOD subset (Obj365) than for far-
common task for evaluating uncertainty quantification. OOD (iNat and SVHN) and is consistently very high for
To do this, we evaluate the Area-under ROC Curve (AU- far-OOD subsets (up to 99.8 on SVHN).
ROC) score between the uncertainties of the data from DID We then consider the uncertainties of DID , T (DID ) and
and DOOD and display the results in Tab. 2; which shows DOOD by plotting histograms of the image-level uncertain-
0.0
ties in 10 equally-spaced bins in the range of [0, 1]. In 0.2
0.4
0.6
Fig. 2 we see that the uncertainties from DID have a signif- 0.3 0.8
0.9
icant amount of mass in the smaller bins and vice versa for

y
y
0.6
DOOD , moreover the uncertainties get larger as the sever-
p matches
ity of corruption increases. We also display AP (black Accuracy 1.0
line), where it can be clearly seen that as the uncertainty in- 0.0 0.3 0.6 1.0
p x
creases AP decreases, implying that the uncertainty reflects (a) Classifier [17] (b) Regressor [33] (c) Detector
the performance of the detector. Thereby suggesting that Figure 3. (a) Calibrated classifier; (b) Calibrated Bayesian re-
the image-level uncertainties are reliable and effective. As gressor, where empirical and predicted CDFs match; (c) Loci of
already pointed out, this conclusion is not necessarily very constant IoU boundary, e.g. any predicted box with top-left and
surprising, since the classifiers of object detectors are gen- bottom-right corners obtained from within the green loci has an
erally trained not only by proposals matching the objects IoU > 0.2 with the blue box. The detector is calibrated if its con-
but also by a very large number of proposals not matching fidence matches the classification and the localisation quality.
with any object, which can be ∼ 1000 times more [57]. This
we define calibration as the alignment of performance and
composition of training data prevents the classifier from be-
confidence of a model; which has already been extensively
coming drastically over-confident for unseen data, enabling
studied for the classification task [8,17,34,47,52,69]. How-
the detector to yield reliable uncertainties.
ever, existing work which studies the calibration properties
Thresholding Image-level Uncertainties For our of an object detector [35, 36, 48, 51] is limited. For object
SAOD baseline, we can obtain an appropriate value for ū detection, the goal is to align a detector’s confidence with
through cross-validation. Ideally, this will require a val- the quality of the joint task of classification and localisa-
No need idation set including both ID and OOD images, but un- tion (regression). Arguably, it is not obvious how to ex-
for OOD fortunately DVal consists of only ID images. However, tend merely classification-based calibration measures such
detectors given that in this case our image-level uncertainty is ob- as Expected Calibration Error (ECE) [17] for object de-
tained by aggregating detection-level uncertainties, the im- tection. A straightforward extension would be to replace
ages which have detections with high uncertainty will pro- the accuracy in such measures by the precision of the de-
duce high image-level uncertainty and vice-versa. Using tector, which is computed by validating TPs from a spe-
this fact, if we remove the ground-truth objects from the im- cific IoU threshold. However, this perspective, as employed
ages in DVal , the resulting image-level uncertainties should by [35], does not account for the fact that two object detec-
be high. We leverage this approach to construct a pseudo tors, while having the same precision, might differ signifi-
OOD dataset out of DVal , by replacing the pixels inside the cantly in terms of localisation quality.
ground-truth bounding boxes with zeros, thereby removing Hence, as one of the main contributions of this work, we
them from the image and enabling us to cross-validation. consider the calibration of object detectors from a funda-
As for the metric to cross-validate ū against, we observe mental perspective and define Localisation-aware Calibra-
that existing metrics such as: AUC metrics are unsuitable tion Error (LaECE) which accounts for the joint nature of
to evaluate binary predictions, F-Score is sensitive to the the task (classification and localisation). We further analyse
choice of the positive class [60] and [email protected] [13, 24] how calibration measures should be coupled with accuracy
requires a fixed threshold. As an attractive candidate, Un- in object detection and adapt common post hoc calibration
certainty Error [46] computes the arithmetic mean of FP and methods such as histogram binning [74], linear regression,
FN rates. However, the arithmetic mean does not heavily and isotonic regression [75] to improve LaECE.
penalise choosing ū on extreme values, potentially leading
to the situation where â = 1 or â = 0 for all images. To ad- 5.1. Localisation-aware ECE
dress this, we instead leverage the harmonic mean, which is
To build an intuitive understanding and to appreciate the
sensitive to these extreme values. Particularly, we define the
underlying complexity in developing a metric to quantify
Balanced Accuracy (BA) as the harmonic mean of TP rate
the calibration of an object detector, we first revisit its sub-
(TPR) and FP rate (FPR), addressing the aforementioned
tasks and briefly discuss what a calibrated classifier and a
issue and enabling us to use it to obtain a suitable ū.
calibrated regressor correspond to. For the former, a classi-
fier is calibrated if its confidence matches its accuracy as
5. Calibration of Object Detectors illustrated in Fig. 3(a). For calibrating Bayesian regres-
Accepting or rejecting an image is only one component sors, there are different definitions [33, 37, 38, 64]. One
of the SAOD task, in situations where the image is accepted notable definition [33] requires aligning the predicted and
SAOD then requires the detections to be calibrated. Here the empirical cumulative distribution functions (cdf), im-
plying p% credible interval from the mean of the predictive
2 Which is the FPR for a fixed threshold set when TPR=0.95. distribution should include p% of the ground truths for all
p ∈ [0, 1] (Fig. 3(b)). Extending this definition to object de- over all the classes. We highlight that for the sake of bet-
tection is nontrivial due to the increasing complexity of the ter accuracy the recent detectors [2, 23, 28–30, 39, 40, 44, 54,
problem. For example, a detection is represented by a tuple 55, 67, 76] tend to obtain p̂i by combining the classification
{ĉi , b̂i , p̂i } with b̂i ∈ R4 , which is not univariate as in [33]. confidence with the localisation confidence (e.g., obtained
Also, this definition to align the empirical and predicted from an auxiliary IoU prediction head), which is very well
cdfs does not consider the regression accuracy explicitly, aligned with our LaECE formulation, enforcing p̂i to match
and therefore not fit for our purpose. Instead, we take in- with the joint performance in Eq. (4).
spiration from an alternative definition that aims to directly Reliability Diagrams We also produce reliability dia-
align the confidence with the regression accuracy [37, 38]. grams to provide insights on the calibration properties of a
To this end, without loss of generality, we use IoU as detector (Fig. 4(a)). To obtain a reliability diagram, we first
the measure of localisation quality for the detection boxes. obtain the performance, measured by the product of preci-
Therefore, broadly speaking, if the detection confidence sion and IoU (Eq. (4)), for each class over bins and then
score p̂i = 0.8, then the localisation task is calibrated (ig- average the performance over the classes by ignoring the
noring the classification task for now) if the average locali- empty bins. Note that if a detector is perfectly calibrated
sation performance (IoU in our case) is 80% over the entire with LaECE = 0, then all the histograms will lie along
dataset. To demonstrate, following [56] we plot the loci for the diagonal in the reliability diagram since LaECEc = 0.
fixed values of IoU in Fig. 3(c). In this example, consider- Similar to classification, if the performance tends to be
ing the blue-box to be the ground-truth, p̂i = 0.2 implies lower than the diagonal, then the detector is said to be
that a detector is calibrated if the detection box lie on the over-confident as in Fig. 4(a), and vice versa for an under-
‘green’ loci corresponding to IoU = 0.2. confident detector. Please see Fig. A.14 for more examples.
Focusing back onto the joint nature of object detection,
we say that an object detector f : X 7→ {ĉi , b̂i , p̂i }N is cali- 5.2. Impact of Top-k Survival on Calibration
brated if the classification and the localisation performances
jointly match its confidence p̂i . More formally, Top-k survival, a critical part of the post-processing step,
selects k detections with the highest confidence in an im-
P(ĉi = ci |p̂i ) Eb̂i ∈Bi (p̂i ) [IoU(b̂i , bψ(i) )] = p̂i , ∀p̂i ∈ [0, 1] (3) age. The value of k is typically significantly larger than the
| {z }|
Classification perf.
{z
Localisation perf.
}
number of objects, for example, k = 100 for COCO where
an average of only 7.3 ground-truth objects exist per image
where Bi (p̂i ) is the set of TP boxes with the confidence on the val set. Therefore, the final detections may contain
score of p̂i , and bψ(i) is the ground-truth box that b̂i matches numerous low-scoring noisy detections. In fact, ATSS on
with. Note that in the absence of localisation quality, the COCO val set, for example, produces 86.4 detections on
above calibration formulation boils down to the standard average per image after postprocessing, far more than the
classification calibration definition. average number of objects per image.
For a given Bi (p̂i ), the first term in Eq. (3), P(ĉi = Since these extra noisy detections do not impact on the
ci |p̂i ), is the ratio of the number of correctly-classified to widely used AP, most works do not pay much attention to
the total number of detections, which is simply the preci- them, however, as we show below, they do have a negative
sion. Whereas, the second term represents the average lo- impact on the calibration metric. Thus, this may mislead a
calisation quality of the boxes in Bi (p̂i ). practitioner in choosing the wrong model when it comes to
Following the approximations used to define the well- calibration quality.
known ECE, we use Eq. (3) to define LaECE. Precisely, we We design a synthetic experiment to show the impact
discretize the confidence score space into J = 25 equally- of low-scoring noisy detections on AP and calibration
spaced bins [17, 34], and to prevent more frequent classes
(LaECE). Specifically, if the number of final detections is
to dominate the calibration error, we compute the average
calibration error for each class separately [34, 47]. Thus, less than k in an image, we insert “dummy” detections
the calibration error for the c-th class is obtained as into the remaining space. These dummy detections are ran-
domly assigned a class ĉi , p̂i = 0, and only one pixel to en-
J
X |D̂jc | ¯ c (j) , sure that they do not match with any object. Hence, by de-
LaECEc = p̄cj − precisionc (j) × IoU (4)
j=1 |D̂c | sign, they are “perfectly calibrated”. As shown in Fig. 5(a),
though these dummy detections have no impact on the AP
where D̂c denotes the set of all detections, D̂jc ⊆ D̂c is the (mathematical proof in App. D), they do give an impression
set of detections in bin j and p̄cj is the average of the de- that the model becomes more calibrated (lower LaECE) as
tection confidence scores in bin j for class c. Furthermore, k increases. Therefore, considering that extra noisy detec-
precisionc (j) denotes the precision of the j-th bin for c- tions are undesirable in practice, we do not advocate top-k
¯ c (j) the average IoU of TP boxes in bin
th class and IoU survival, instead, we motivate the need to select a detec-
j. Then, LaECE is computed as the average of LaECEc tion confidence threshold v̄, where detections are rejected if
1.0 1.0 Perf. or Error (in %) det/img Perf. or Error (in %) det/img
Precision ×IoU Precision ×IoU 100 500 100 100
% of Samples % of Samples 80
0.8 0.8 80 LaECE 400 80
LaECE= 43.3% LaECE= 17.7% AP
60 300 60 60
0.6 0.6 LRP
40 200 40 40
0.4 0.4 20 100 20 20
0 0 0 0
0.2 0.2 0 up to 100 up to 300 up to 500 none 0.30 0.50 0.70
Number of Dummy Detections (up to k) Score threshold to remove noisy detections
0.0 0.0 (a) Adding dummy detections (b) Thresholding detections
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Confidence Confidence
(a) Base model (b) Calibrated by LR Figure 5. Red: ATSS, green: F-RCNN, histograms present de-
t/img using right axes, results are on COCO val set with 7.3 ob-
Figure 4. Reliability diagrams of F-RCNN on DID for SAOD- jects/img. (a) Dummy detections decrease LaECE (solid line) ar-
Gen before and after calibration. tificially with no effect on AP (dashed line). LRP (dotted line), on
the other hand, penalizes dummy detections. (b) AP is maximized
their confidence is lower than v̄. with more detections (threshold ‘none’) while LRP Error benefits
An appropriate choice of v̄ should produce a set of from properly-thresholded detections. (refer App. D)
thresholded-detections with a good balance of precision, re-
call and localisation errors3 . In Fig. 5(b), we present the ef- 6. Baseline SAODets and Their Evaluation
fect of v̄ on LRP, where the lowest error is obtained around
0.30 for ATSS and 0.70 for F-RCNN, leading to an average Using the necessary features developed in Sec. 4 and
of 6 detections/image for both detectors, far closer to the av- Sec. 5, namely, obtaining: image-level uncertainties, cali-
erage number of objects compared to using k = 100. Con- bration methods as well as the thresholds ū and v̄, we now
sequently, to obtain v̄ for our baseline, we use LRP-optimal show how to convert standard detectors into ones that are
thresholding [53, 58], which is the threshold achieving the self-aware. Then, we benchmark them using the SAOD
minimum LRP for each class on the val set. framework proposed in Sec. 3 whilst leveraging our test
datasets and LaECE.
5.3. Post hoc Calibration of Object Detectors
Baseline SAODets To address the requirements of a
For our baseline, given that LaECE provides the calibra- SAODet, we make the following design choices when con-
tion error of the model, we can calibrate an object detector verting an object detector into one which is self aware: The
using common calibration approaches from the classifica- hard requirement of predicting whether or not to accept an
tion and regression literature. Precisely, for each class, we image is achieved through obtaining image-level uncertain-
train a calibrator ζ c : [0, 1] → [0, 1] using the input-target ties by aggregating uncertainty scores. Specifically, we use
pairs ({p̂i , tcal
i }) from DVal , where ti
cal
is the target confi-
mean(top-3) and obtain an uncertainty threshold ū through
dence. As shown in App D, LaECE for bin j reduces to
cross-validation using pseudo OOD set approach (Sec. 4).
X X cal We only keep the detections with higher confidence than v̄,
tcal

i − IoU(b̂i , bψ(i) ) + ti . (5)
which is set using LRP-optimal thresholding (Sec. 5.2). To
b̂i ∈D̂jc b̂i ∈D̂jc
ψ(i)>0 ψ(i)≤0
calibrate the detection scores, we use linear regression as
discussed in Sec. 5.3. Thus, we convert all four detectors
Consequently, we seek tcal i which minimises this value as- that we use (Sec. 3) into ones that are self-aware, prefixed
suming that p̂i resides in the jth bin. In situations where the by a SA in the tables. For further details, please see App. E.
prediction is a TP (ψ(i) > 0), Eq. (5) is minimized when The SAOD Evaluation Protocol The SAOD task is a
p̂i = tcal
i = IoU(b̂i , bψ(i) ) and conversely, if ψ(i) ≤ 0, it is robust protocol unifying the evaluation of the: (i) reliabil-
minimised when p̂i = tcal i = 0. We then train linear regres- ity of uncertainties; (ii) the calibration and accuracy; (iii)
sion (LR); histogram binning (HB) [74]; and isotonic re- and performance under domain shift. To obtain quantita-
gression (IR) [75] models with such pairs. Tab. 5 shows that tive values for the above, we leverage the Balanced Accu-
these calibration methods improve LaECE in five out of six racy (Sec. 4) for (i). For (ii) we evaluate the calibration
cases, and in the case where they do not improve (ATSS on and accuracy using LaECE (Sec. 5) and the LRP [53] re-
SAOD-Gen), the calibration performance of the base model spectively, but combine them through the harmonic mean
is already good. Overall, we find IR and LR perform bet- of 1 − LRP and 1 − LaECE on X ∈ DID , which we de-
ter than HB and consequently we employ LR for SAODets fine as the In-Distribution Quality (IDQ). Similarly, for (iii)
since LR performs the best on three detectors. Fig. 4(b) we compute the IDQ for X ∈ T (DID ), denoted by IDQT ,
shows an example reliability histogram after applying LR, but with the principal difference that the detector is flexible
indicating the improvement to calibration. to accept or reject severe corruptions (C5) as discussed in
3 Using properly-thresholded detections is in fact similar to the Panoptic Sec. 3. Considering that all of these features are crucial in
Segmentation, which is a closely-related task to object detection [31, 32] a safety-critical application, a lack of performance in one
Why the accuracy values all the same across
different calibrators? The confidence scores are
already calibrated!
Table 5. Effect of post hoc calibration on LaECE and LRP (in %). ✗: Uncalibrated, HB: Histogram binning, IR: Isotonic Regresssion, LR:
Linear Regression. ATSS, combining localisation and classification confidences using multiplication as in our LaECE (Eq. (4)), performs
the best on both datasets before/after calibration. Aligned with [47], uncalibrated F-RCNN using cross-entropy loss performs the worst.

Dataset SAOD-Gen SAOD-AV


Detector F-RCNN RS-RCNN ATSS D-DETR F-RCNN ATSS
Calibrator ✗ LR HB IR ✗ LR HB IR ✗ LR HB IR ✗ LR HB IR ✗ LR HB IR ✗ LR HB IR
LaECE 43.3 17.7 18.6 16.9 32.0 17.4 19.6 17.2 15.7 16.8 18.7 16.7 15.9 15.7 17.7 15.9 26.5 9.8 10.2 10.2 16.8 9.0 9.7 9.7
LRP 74.7 74.7 74.7 74.7 73.6 73.6 73.6 73.6 74.0 74.0 74.1 74.0 71.9 71.9 71.9 71.9 73.5 73.5 73.5 73.5 70.6 70.6 70.6 70.6

Table 6. Evaluating SAODets. With higher BA and IDQs, SA- D-DETR still obtains a low score of 43.5% on the SAOD-
D-DETR achieves the best DAQ on SAOD-Gen. For SAOD-AV Gen dataset. As this performance does not seem to be con-
datasets, SA-ATSS outperforms SA-F-RCNN thanks to its higher vincing, extra care should be taken before these models are
IDQs. Bold: SAODet achieves the best, values are in %.
deployed in safety-critical applications. Consequently, our
Self-aware
DAQ↑
DOOD DID T (DID ) DVal study shows that a significant amount of attention needs to
Detector BA↑ IDQ↑ LaECE↓ LRP↓ IDQ↑ LaECE↓ LRP↓ LRP↓ AP↑
be provided in building self-aware object detectors and ef-
SA-F-RCNN 39.7 87.7 38.5 17.3 74.9 26.2 18.1 84.4 59.5 39.9
SA-RS-RCNN 41.2 88.9 39.7 17.1 73.9 27.5 17.8 83.5 58.1 42.0
fort to reduce the performance gap needs to be undertaken.
Gen

SA-ATSS 41.4 87.8 39.7 16.6 74.0 27.8 18.2 83.2 58.5 42.8 Ablation Analyses To test which components of the
SA-D-DETR 43.5 88.9 41.7 16.4 72.3 29.6 17.9 81.9 55.9 44.3
SA-F-RCNN 43.0 91.0 41.5 9.5 73.1 28.8 7.2 83.0 54.3 55.0
SAODet contribute the most to their improvement, we
AV

SA-ATSS 44.7 85.8 43.5 8.8 71.5 30.8 6.8 81.5 53.2 56.9 perform a simple experiment using SA-F-RCNN (SAOD-
Gen). In this experiment, we systematically remove the
Table 7. Ablation study by removing: LRP-Optimal threshold- LRP-optimal thresholds; LR calibration; and pseudo-set ap-
ing (Sec. 5.2) for v̄ = 0.5; LR calibration (Sec. 5.3) for uncali- proach and replace these features, with a detection-score
brated model; and image-level threshold ū (Sec. 4) for the thresh- threshold of 0.5; no calibration; and a threshold correspond-
old corresponding to TPR = 0.95. ing to a TPR of 0.95 respectively. We can see in Tab. 7 that
as hypothesized, LRP-optimal thresholding improves accu-
v̄ LR ū DAQ↑ BA↑ LaECE↓ LRP↓ LaECET ↓ LRPT ↓
racy, LR yields notable gain in LaECE and using pseudo-
36.0 83.2 42.7 76.2 44.1 84.7
sets results in a gain for OOD detection. In App. E, we
✓ 36.5 83.2 41.7 74.8 43.9 84.7
✓ ✓ 39.1 83.2 17.2 74.8 18.1 84.7 further conduct additional experiments to (i) investigate the
✓ ✓ ✓ 39.7 87.7 17.3 74.9 18.1 84.4 effect of ū and v̄ on reported metrics and (ii) how common
improvement strategies for object detectors affect DAQ.
them needs to be heavily penalized. To do so, we introduce Evaluating Individual Robustness Aspects We finally
the Detection Awareness Quality (DAQ), a unified perfor- note that our framework provides the necessary tools to
mance measure to evaluate SAODets, constructed as the the evaluate a detector in terms of reliability of uncertainties,
harmonic mean of BA, IDQ and IDQT . The resulting DAQ calibration and domain shift. Thereby enabling the re-
is a higher-better measure with a range of [0, 1]. searchers to benchmark either a SAODet using our DAQ
Main Results Here we discuss how our SAODets per- measure or one of its individual components. Specifically,
form in terms of the aforementioned metrics. In terms of our (i) uncertainties can be evaluated on DID ∪ DOOD us-
hypotheses, the first evaluation we wish observe is the effec- ing AUROC or BA (Tab. 2); (ii) calibration can be eval-
tiveness of our metrics. Specifically, we observe in Tab. 6 uated on DID ∪ T (DID ) using LaECE (Tab. 5); and (iii)
that a lower LaECE and LRP lead to a higher IDQ; and that DID ∪ T (DID ) can be used to test detectors developed for
a higher BA, IDQ and IDQT lead to a higher DAQ, indi- single domain generalization [68, 72].
cating that the constructions of these metrics is appropriate.
To justify that they are reasonable, we observe that typi- 7. Conclusive Remarks
cally more complex and better performing detectors (DETR In this paper, we developed the SAOD task, which re-
and ATSS) outperform the simpler F-RCNN, indicating that quires detectors to obtain reliable uncertainties; yield cali-
these metrics reflect the quality of the object detectors. brated confidences; and be robust to domain shift. We cu-
In terms of observing the performance of these self- rated large-scale datasets and introduced novel metrics to
aware variants, we can see that while recent state-of-the- evaluate detectors on the SAOD task. Also, we proposed
art detectors perform very well in terms of LRP and AP on a metric (LaECE) to quantify the calibration of object de-
DVal , their performance drops significantly as we expose tectors which respects both classification and localisation
them to our DID and T (DID ) which involves domain shift, quality, addressing a critical shortcoming in the literature.
corruptions and OOD. We would also like to note that the We hope that this work inspires researchers to build more
best DAQ corresponding to the best performing model SA- reliable object detectors for safety-critical applications.
References [13] Xuefeng Du, Zhaoning Wang, Mu Cai, and Sharon Li. To-
wards unknown-aware learning with virtual outlier synthe-
[1] Daniel Bolya, Sean Foley, James Hays, and Judy Hoffman. sis. In International Conference on Learning Representa-
Tide: A general toolbox for identifying object detection er- tions, 2022. 1, 3, 5, 17, 19, 20, 23
rors. In The IEEE European Conference on Computer Vision [14] Ayers Edward, Sadeghi Jonathan, Redford John, Mueller Ro-
(ECCV), 2020. 27 main, and Dokania Puneet K. Query-based hard-image re-
[2] Daniel Bolya, Chong Zhou, Fanyi Xiao, and Yong Jae Lee. trieval for object detection at test time. arXiv, 2209.11559,
Yolact++: Better real-time instance segmentation. IEEE 2022. 4
Transactions on Pattern Analysis and Machine Intelligence, [15] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn,
2020. 6 and A. Zisserman. The pascal visual object classes (voc)
[3] François Bourgeois and Jean-Claude Lassalle. An exten- challenge. International Journal of Computer Vision (IJCV),
sion of the munkres algorithm for the assignment problem to 88(2):303–338, 2010. 2, 24, 25
rectangular matrices. Communications of ACM, 14(12):802– [16] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we
804, 1971. 16 ready for autonomous driving? the kitti vision benchmark
[4] Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh Vora, suite. In Conference on Computer Vision and Pattern Recog-
Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- nition (CVPR), 2012. 15
ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- [17] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger.
modal dataset for autonomous driving. In IEEE/CVF Confer- On calibration of modern neural networks. In Doina Precup
ence on Computer Vision and Pattern Recognition (CVPR), and Yee Whye Teh, editors, Proceedings of the 34th Interna-
2020. 3, 15 tional Conference on Machine Learning, volume 70 of Pro-
[5] Qi Cai, Yingwei Pan, Yu Wang, Jingen Liu, Ting Yao, and ceedings of Machine Learning Research, pages 1321–1330.
Tao Mei. Learning a unified sample weighting network for PMLR, 2017. 5, 6
object detection. In IEEE/CVF Conference on Computer Vi- [18] Agrim Gupta, Piotr Dollar, and Ross Girshick. Lvis: A
sion and Pattern Recognition (CVPR), 2020. 2 dataset for large vocabulary instance segmentation. In The
[6] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas IEEE Conference on Computer Vision and Pattern Recogni-
Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- tion (CVPR), 2019. 1, 2, 25
end object detection with transformers. In European Confer- [19] David Hall, Feras Dayoub, John Skinner, Haoyang Zhang,
ence on Computer Vision (ECCV), 2020. 2, 31 Dimity Miller, Peter Corke, Gustavo Carneiro, Anelia An-
[7] Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, Yu gelova, and Niko Suenderhauf. Probabilistic object de-
Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei tection: Definition and evaluation. In Proceedings of the
Liu, Jiarui Xu, Zheng Zhang, Dazhi Cheng, Chenchen Zhu, IEEE/CVF Winter Conference on Applications of Computer
Tianheng Cheng, Qijie Zhao, Buyu Li, Xin Lu, Rui Zhu, Vision (WACV), 2020. 2
Yue Wu, Jifeng Dai, Jingdong Wang, Jianping Shi, Wanli [20] Ali Harakeh, Michael H. W. Smart, and Steven L. Waslan-
Ouyang, Chen Change Loy, and Dahua Lin. MMDetec- der. Bayesod: A bayesian approach for uncertainty estima-
tion: Open mmlab detection toolbox and benchmark. arXiv, tion in deep object detectors. IEEE International Conference
1906.07155, 2019. 19 on Robotics and Automation (ICRA), 2020. 2
[8] Jiacheng Cheng and Nuno Vasconcelos. Calibrating deep [21] Ali Harakeh and Steven L. Waslander. Estimating and evalu-
neural networks by pairwise constraints. In Proceedings of ating regression predictive uncertainty in deep object detec-
the IEEE/CVF Conference on Computer Vision and Pattern tors. In International Conference on Learning Representa-
Recognition (CVPR), 2022. 5 tions (ICLR), 2021. 1, 2, 3, 17, 19, 20, 23
[9] Jiwoong Choi, Ismail Elezi, Hyuk-Jae Lee, Clement Farabet, [22] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
and Jose M. Alvarez. Active learning for deep object detec- Deep residual learning for image recognition. In IEEE/CVF
tion via probabilistic modeling. In IEEE/CVF International Conference on Computer Vision and Pattern Recognition
Conference on Computer Vision (ICCV), 2021. 2 (CVPR), 2016. 31
[10] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo [23] Yihui He, Chenchen Zhu, Jianren Wang, Marios Savvides,
Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe and Xiangyu Zhang. Bounding box regression with uncer-
Franke, Stefan Roth, and Bernt Schiele. The cityscapes tainty for accurate object detection. In IEEE/CVF Confer-
dataset for semantic urban scene understanding. In IEEE ence on Computer Vision and Pattern Recognition (CVPR),
Conference on Computer Vision and Pattern Recognition 2019. 2, 3, 6, 23
(CVPR), 2016. 1, 15 [24] Dan Hendrycks, Steven Basart, Mantas Mazeika, Andy Zou,
[11] Achal Dave, Piotr Dollár, Deva Ramanan, Alexander Kir- Joseph Kwon, Mohammadreza Mostajabi, Jacob Steinhardt,
illov, and Ross B. Girshick. Evaluating large-vocabulary and Dawn Song. Scaling out-of-distribution detection for
object detectors: The devil is in the details. arXiv e- real-world settings. In International Conference on Machine
prints:2102.01066, 2021. 1 Learning (ICML), 2022. 5
[12] Akshay Raj Dhamija, Manuel Günther, Jonathan Ventura, [25] Dan Hendrycks and Thomas Dietterich. Benchmarking neu-
and Terrance E. Boult. The overlooked elephant of object ral network robustness to common corruptions and perturba-
detection: Open set. In IEEE Winter Conference on Applica- tions. In International Conference on Learning Representa-
tions of Computer Vision (WACV), 2020. 1, 3, 20 tions (ICLR), 2019. 3, 17
[26] Derek Hoiem, Yodsawalai Chodpathumwan, and Qieyun [40] Xiang Li, Wenhai Wang, Lijun Wu, Shuo Chen, Xiaolin Hu,
Dai. Diagnosing error in object detectors. In The IEEE Eu- Jun Li, Jinhui Tang, and Jian Yang. Generalized focal loss:
ropean Conference on Computer Vision (ECCV), 2012. 27 Learning qualified and distributed bounding boxes for dense
[27] Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, object detection. In Advances in Neural Information Pro-
Chen Sun, Alexander Shepard, Hartwig Adam, Pietro Per- cessing Systems (NeurIPS), 2020. 6
ona, and Serge J. Belongie. The inaturalist species classi- [41] Tsung-Yi Lin, Piotr Dollár, Ross B. Girshick, Kaiming He,
fication and detection dataset. In CVPR, pages 8769–8778, Bharath Hariharan, and Serge J. Belongie. Feature pyramid
2018. 3 networks for object detection. In IEEE/CVF Conference on
[28] Zhaojin Huang, Lichao Huang, Yongchao Gong, Chang Computer Vision and Pattern Recognition (CVPR), 2017. 19
Huang, and Xinggang Wang. Mask scoring r-cnn. In [42] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and
IEEE/CVF Conference on Computer Vision and Pattern Piotr Dollár. Focal loss for dense object detection. IEEE
Recognition (CVPR), 2019. 6 Transactions on Pattern Analysis and Machine Intelligence
[29] Borui Jiang, Ruixuan Luo, Jiayuan Mao, Tete Xiao, and Yun- (TPAMI), 42(2):318–327, 2020. 2
ing Jiang. Acquisition of localization confidence for accurate [43] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
object detection. In The European Conference on Computer Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
Vision (ECCV), 2018. 6 Zitnick. Microsoft COCO: Common Objects in Context.
[30] Kang Kim and Hee Seok Lee. Probabilistic anchor assign- In The European Conference on Computer Vision (ECCV),
ment with iou prediction for object detection. In The Euro- 2014. 1, 2, 3, 14, 24, 25
pean Conference on Computer Vision (ECCV), 2020. 6 [44] Ji Liu, Dong Li, Rongzhang Zheng, Lu Tian, and Yi Shan.
[31] Alexander Kirillov, Ross B. Girshick, Kaiming He, and Piotr Rankdetnet: Delving into ranking constraints for object de-
Dollár. Panoptic feature pyramid networks. In IEEE/CVF tection. In IEEE/CVF Conference on Computer Vision and
Conference on Computer Vision and Pattern Recognition Pattern Recognition (CVPR), pages 264–273, June 2021. 6
(CVPR), 2019. 7 [45] C. Michaelis, B. Mitzkus, R. Geirhos, E. Rusak, O. Bring-
mann, A. S. Ecker, M. Bethge, and W. Brendel. Bench-
[32] Alexander Kirillov, Kaiming He, Ross Girshick, Carsten
marking robustness in object detection: Autonomous driving
Rother, and Piotr Dollar. Panoptic segmentation. In The
when winter is coming. In NeurIPS Workshop on Machine
IEEE Conference on Computer Vision and Pattern Recogni-
Learning for Autonomous Driving, 2019. 1
tion (CVPR), June 2019. 7
[46] Dimity Miller, Feras Dayoub, Michael Milford, and Niko
[33] Volodymyr Kuleshov, Nathan Fenner, and Stefano Ermon.
Sünderhauf. Evaluating merging strategies for sampling-
Accurate uncertainties for deep learning using calibrated re-
based uncertainty techniques in object detection. In Inter-
gression. In International Conference on Machine Learning
national Conference on Robotics and Automation (ICRA),
(ICML), 2018. 5, 6
2019. 5
[34] Ananya Kumar, Percy S Liang, and Tengyu Ma. Verified [47] Jishnu Mukhoti, Viveka Kulharia, Amartya Sanyal, Stuart
uncertainty calibration. In Advances in Neural Information Golodetz, Philip Torr, and Puneet Dokania. Calibrating deep
Processing Systems (NeurIPS), volume 32, 2019. 5, 6 neural networks using focal loss. In H. Larochelle, M. Ran-
[35] Fabian Kuppers, Jan Kronenberger, Amirhossein Shantia, zato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances
and Anselm Haselhoff. Multivariate confidence calibration in Neural Information Processing Systems, volume 33, pages
for object detection. In The IEEE/CVF Conference on Com- 15288–15299. Curran Associates, Inc., 2020. 5, 6, 8, 31
puter Vision and Pattern Recognition (CVPR) Workshops, [48] Muhammad Akhtar Munir, Muhammad Haris Khan,
2020. 1, 2, 5 M. Saquib Sarfraz, and Mohsen Ali. Towards improving cal-
[36] Fabian Kuppers, Jonas Schneider, and Anselm Haselhoff. ibration in object detection under domain shift. In Advances
Parametric and multivariate uncertainty calibration for re- in Neural Information Processing Systems (NeurIPS), 2022.
gression and object detection. In Safe Artificial Intelligence 5
for Automated Driving Workshop in The European Confer- [49] Kevin P. Murphy. Probabilistic Machine Learning: An in-
ence on Computer Vision, 2022. 1, 2, 3, 5 troduction. MIT Press, 2022. 4, 21
[37] Max-Heinrich Laves, Sontje Ihler, Jacob F. Fast, Lüder A. [50] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y.
Kahrs, and Tobias Ortmaier. Well-calibrated regression un- Ng. Reading digits in natural images with unsupervised fea-
certainty in medical imaging with deep learning. In Proceed- ture learning. In NIPS Workshop on Deep Learning and Un-
ings of the Third Conference on Medical Imaging with Deep supervised Feature Learning, 2011. 3
Learning, pages 393–412, 2020. 5, 6 [51] Lukás Neumann, Andrew Zisserman, and Andrea Vedaldi.
[38] Dan Levi, Liran Gispan, Niv Giladi, and Ethan Fetaya. Eval- Relaxed softmax: Efficient confidence auto-calibration for
uating and calibrating uncertainty prediction in regression safe pedestrian detection. In NIPS MLITS Workshop on Ma-
tasks. Sensors (Basel), 22 (15):5540–5550, 2022. 5, 6 chine Learning for Intelligent Transportation System, 2018.
[39] Xiang Li, Wenhai Wang, Xiaolin Hu, Jun Li, Jinhui Tang, 5
and Jian Yang. Generalized focal loss v2: Learning reli- [52] Jeremy Nixon, Michael W. Dusenberry, Linchuan Zhang,
able localization quality estimation for dense object detec- Ghassen Jerfel, and Dustin Tran. Measuring calibration in
tion. In IEEE/CVF Conference on Computer Vision and Pat- deep learning. In IEEE/CVF Conference on Computer Vision
tern Recognition (CVPR), 2019. 6 and Pattern Recognition (CVPR) Workshops, June 2019. 5
[53] Kemal Oksuz, Baris Can Cam, Emre Akbas, and Sinan International Conference on Machine Learning (ICML),
Kalkan. Localization recall precision (LRP): A new perfor- 2019. 5
mance metric for object detection. In The European Confer- [65] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien
ence on Computer Vision (ECCV), 2018. 2, 7, 24 Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou,
[54] Kemal Oksuz, Baris Can Cam, Emre Akbas, and Sinan Yuning Chai, Benjamin Caine, Vijay Vasudevan, Wei Han,
Kalkan. A ranking-based, balanced loss function unifying Jiquan Ngiam, Hang Zhao, Aleksei Timofeev, Scott Et-
classification and localisation in object detection. In Ad- tinger, Maxim Krivokon, Amy Gao, Aditya Joshi, Yu Zhang,
vances in Neural Information Processing Systems (NeurIPS), Jonathon Shlens, Zhifeng Chen, and Dragomir Anguelov.
2020. 6 Scalability in perception for autonomous driving: Waymo
[55] Kemal Oksuz, Baris Can Cam, Emre Akbas, and Sinan open dataset. In IEEE/CVF Conference on Computer Vision
Kalkan. Rank & sort loss for object detection and instance and Pattern Recognition (CVPR), 2020. 1, 15
segmentation. In The International Conference on Computer [66] Peize Sun, Rufeng Zhang, Yi Jiang, Tao Kong, Chenfeng
Vision (ICCV), 2021. 3, 6, 23, 31 Xu, Wei Zhan, Masayoshi Tomizuka, Lei Li, Zehuan Yuan,
[56] Kemal Oksuz, Baris Can Cam, Sinan Kalkan, and Emre Ak- Changhu Wang, and Ping Luo. SparseR-CNN: End-to-end
bas. Generating positive bounding boxes for balanced train- object detection with learnable proposals. In IEEE/CVF
ing of object detectors. In IEEE Winter Applications on Com- Conference on Computer Vision and Pattern Recognition
puter Vision (WACV), 2020. 6 (CVPR), 2018. 2
[57] Kemal Oksuz, Baris Can Cam, Sinan Kalkan, and Emre Ak- [67] Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. Fcos:
bas. Imbalance problems in object detection: A review. IEEE Fully convolutional one-stage object detection. In IEEE/CVF
Transactions on Pattern Analysis and Machine Intelligence International Conference on Computer Vision (ICCV), 2019.
(TPAMI), pages 1–1, 2020. 5 6
[68] Vidit Vidit, Martin Engilberge, and Mathieu Salzmann. Clip
[58] Kemal Oksuz, Baris Can Cam, Sinan Kalkan, and Emre Ak-
the gap: A single domain generalization approach for object
bas. One metric to measure them all: Localisation recall
detection, 2023. 1, 8
precision (lrp) for evaluating visual detection tasks. IEEE
[69] Deng-Bao Wang, Lei Feng, and Min-Ling Zhang. Rethink-
Transactions on Pattern Analysis and Machine Intelligence,
ing calibration of deep neural networks: Do not be afraid of
pages 1–1, 2021. 2, 7, 24, 27, 33
overconfidence. In Advances in Neural Information Process-
[59] Tai-Yu Pan, Cheng Zhang, Yandong Li, Hexiang Hu, Dong ing Systems (NeurIPS), 2021. 5
Xuan, Soravit Changpinyo, Boqing Gong, and Wei-Lun
[70] Shaoru Wang, Jin Gao, Bing Li, and Weiming Hu. Nar-
Chao. On model calibration for long-tailed object detection
rowing the gap: Improved detector training with noisy loca-
and instance segmentation. In M. Ranzato, A. Beygelzimer,
tion annotations. IEEE Transactions on Image Processing,
Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors,
31:6369–6380, 2022. 2
Advances in Neural Information Processing Systems, vol-
[71] Xin Wang, Thomas E Huang, Benlin Liu, Fisher Yu, Xiao-
ume 34, pages 2529–2542. Curran Associates, Inc., 2021.
long Wang, Joseph E Gonzalez, and Trevor Darrell. Robust
1
object detection via instance-level temporal cycle confusion.
[60] Francesco Pinto, Harry Yang, Ser-Nam Lim, Philip H. S. International Conference on Computer Vision (ICCV), 2021.
Torr, and Puneet K. Dokania. Regmixup: Mixup as a regular- 1
izer can surprisingly improve accuracy and out distribution [72] Aming Wu and Cheng Deng. Single-domain generalized
robustness. In Advances in Neural Information Processing object detection in urban scene via cyclic-disentangled self-
Systems (NeurIPS), 2022. 5 distillation. In IEEE/CVF Conference on Computer Vision
[61] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. and Pattern Recognition, 2022. 1, 8
Faster R-CNN: Towards real-time object detection with re- [73] Fisher Yu, Haofeng Chen, Xin Wang, Wenqi Xian, Yingying
gion proposal networks. IEEE Transactions on Pattern Anal- Chen, Fangchen Liu, Vashisht Madhavan, and Trevor Dar-
ysis and Machine Intelligence (TPAMI), 39(6):1137–1149, rell. Bdd100k: A diverse driving dataset for heterogeneous
2017. 2, 3, 19, 23 multitask learning. In Proceedings of the IEEE/CVF Confer-
[62] Murat Sensoy, Lance Kaplan, and Melih Kandemir. Eviden- ence on Computer Vision and Pattern Recognition (CVPR),
tial deep learning to quantify classification uncertainty. In S. June 2020. 1, 3, 15
Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa- [74] Bianca Zadrozny and Charles Elkan. Obtaining calibrated
Bianchi, and R. Garnett, editors, Advances in Neural Infor- probability estimates from decision trees and naive bayesian
mation Processing Systems, volume 31. Curran Associates, classifiers. In Internation Conference on Machine Learning
Inc., 2018. 4, 20 (ICML), volume 1, pages 609–616, 2001. 5, 7
[63] Shuai Shao, Zeming Li, Tianyuan Zhang, Chao Peng, Gang [75] Bianca Zadrozny and Charles Elkan. Transforming classifier
Yu, Xiangyu Zhang, Jing Li, and Jian Sun. Objects365: scores into accurate multiclass probability estimates. In Pro-
A large-scale, high-quality dataset for object detection. In ceedings of the eighth ACM SIGKDD international confer-
IEEE/CVF International Conference on Computer Vision ence on Knowledge discovery and data mining, pages 694–
(ICCV), 2019. 3, 14 699, 2002. 5, 7
[64] Hao Song, Tom Diethe, Meelis Kull, and Peter Flach. Distri- [76] Haoyang Zhang, Ying Wang, Feras Dayoub, and Niko
bution calibration for regression. In Proceedings of the 36th Sünderhauf. Varifocalnet: An iou-aware dense object de-
tector. In IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR), 2021. 6
[77] Shifeng Zhang, Cheng Chi, Yongqiang Yao, Zhen Lei, and
Stan Z. Li. Bridging the gap between anchor-based and
anchor-free detection via adaptive training sample selection.
In IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), 2020. 3, 19, 23, 31
[78] Xizhou Zhu, Han Hu, Stephen Lin, and Jifeng Dai. De-
formable convnets v2: More deformable, better results.
In IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), 2019. 31
[79] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang,
and Jifeng Dai. Deformable {detr}: Deformable transform-
ers for end-to-end object detection. In International Confer-
ence on Learning Representations (ICLR), 2021. 3, 19, 23
APPENDICES D.2. Sensitivity of LaECE to TP validation threshold 26
D.3. Derivation of Eq. (5) . . . . . . . . . . . . . 27
Contents D.4. More Examples of Reliability Diagrams . . . 28
D.5. Numerical Values of Fig. 5 . . . . . . . . . . 28
1. Introduction 1
E. Further Details on SAOD and SAODets 28
2. Notations and Preliminaries 2 E.1. Algorithms to Make an Object Detector Self-
Aware . . . . . . . . . . . . . . . . . . . . 28
3. An Overview to the SAOD Task 2
E.2. Sensitivity of the SAOD Performance
4. Obtaining Image-level Uncertainty 3 Measures to the Image-level Uncertainty
Threshold and Detection Confidence
5. Calibration of Object Detectors 5 Threshold . . . . . . . . . . . . . . . . . . 31
5.1. Localisation-aware ECE . . . . . . . . . . . 5 E.3. Effect of common improvement strategies on
5.2. Impact of Top-k Survival on Calibration . . . 6 DAQ . . . . . . . . . . . . . . . . . . . . . 31
5.3. Post hoc Calibration of Object Detectors . . 7 E.4. The Impact of Domain-shift on Detection-
level Confidence Score Thresholding . . . . 31
6. Baseline SAODets and Their Evaluation 7 E.5. Qualitative Results of SAODets in compari-
son to Conventional Object Detectors . . . 33
7. Conclusive Remarks 8 E.6. Suggestions for Future Work . . . . . . . . . 33

A. Details of the Test Sets 14


A.1. Obj45K and BDD45K Splits . . . . . . . . . 14
A.1.1 Obj45K Split . . . . . . . . . . . . 14
A.1.2 BDD45K Split . . . . . . . . . . . 15
A.2. Obj45K-C and BDD45K-C Splits . . . . . . 17
A.3. SiNObj110K-OOD Split . . . . . . . . . . . 17

B. Details of the Used Object Detectors 19

C. Further Details on Image-level Uncertainty 19


C.1. Why is Detection-Level OOD Detection for
Object Detection Nontrivial? . . . . . . . . 20
C.2. Definitions . . . . . . . . . . . . . . . . . . 20
C.2.1 Detection-Level Uncertainties . . . 20
C.2.2 Aggregation Strategies to Obtain
Image-Level Uncertainties . . . . . 21
C.3. More Analyses on Image-Level Uncertainty . 21
C.3.1 Computing Detection-level Uncer-
tainty for Sigmoid-based Classifiers 21
C.3.2 Combining Classification and Lo-
calisation Uncertainties . . . . . . . 22
C.3.3 The Effect of Aggregation Tech-
niques on a Localisation Uncer-
tainty Estimate . . . . . . . . . . . 22
C.3.4 On the Reliability of Image-level
Uncertainties . . . . . . . . . . . . 22
C.3.5 The Effectiveness of Using Pseudo
OOD val set for Image-level Uncer-
tainty Thresholding . . . . . . . . . 23

D. Further Details on Calibration of Object Detec-


tors 23
D.1. How does AP benefit from low-scoring de-
tections? . . . . . . . . . . . . . . . . . . . 23
A. Details of the Test Sets those datasets, it is not practical to have a manual inspection
over all images and classes. In the following, we present our
This section provides the details of our test sets summa- resulting matching between COCO and Objects365 classes:
rized in Tab. 1. To give a general overview, while construct-
ing our datasets, we impose restrictions for a more princi- ’ person ’ : ’ Person ’ ,
pled evaluation: ’ bicycle ’ : ’ Bicycle ’ ,
• We ensure that there is at least one ID object in the ’ c a r ’ : [ ’ Car ’ , ’SUV ’ , ’ S p o r t s Car ’ ,
images of DID (and also in the ones in T (DID )) to ’ Formula 1 ’ ] ,
avoid a situation that an ID image does not include any ’ motorcycle ’ : ’ Motorcycle ’ ,
ID object. ’ airplane ’ : ’ Airplane ’ ,
• DOOD images does not include any foreground object. ’ b u s ’ : ’ Bus ’ ,
Besides, we use detection datasets with different ID ’ t r a i n ’ : ’ Train ’ ,
classes (iNat, Obj365 and SVHN) than our DTrain to ’ t r u c k ’ : [ ’ Truck ’ , ’ Pickup Truck ’ ,
promote OOD objects in OOD images. ’ F i r e T r u c k ’ , ’ Ambulance ’ ,
In the following, we present how we curate each of ’ Heavy T r u c k ’ ] ,
these test splits, that are (i) Obj45K and BDD45K as ’ b o a t ’ : [ ’ Boat ’ , ’ S a i l b o a t ’ , ’ Ship ’ ] ,
DID ; (ii) Obj45K-C and BDD45K-C as T (DID ); and (iii) ’ t r a f f i c l i g h t ’ : ’ Traffic Light ’ ,
SiNObj110K-OOD as DOOD . ’ f i r e hydrant ’ : ’ F i r e Hydrant ’ ,
’ st op s ig n ’ : ’ Stop Sign ’ ,
A.1. Obj45K and BDD45K Splits ’ parking meter ’ : ’ Parking meter ’ ,
We construct DID from different but semantically simi- ’ b e n c h ’ : ’ Bench ’ ,
lar datasets; thereby introducing domain-shift to be reflec- ’ b i r d ’ : [ ’ Wild B i r d ’ , ’ Duck ’ ,
tive of the challenges faced by detectors in practice such ’ Goose ’ , ’ P a r r o t ’ , ’ C h i c k e n ’ ] ,
as distribution shifts over time or lack of data in a partic- ’ c a t ’ : ’ Cat ’ ,
ular environment. To do so, we employ Objects365 [63] ’ dog ’ : ’ Dog ’ ,
for our SAOD-Gen use-case using COCO as ID data and ’ h o r s e ’ : ’ Horse ’ ,
BDD100K for our SAOD-AV use-case with nuImages com- ’ s h e e p ’ : ’ Sheep ’ ,
prising the ID data. In the following, we discuss the specific ’ cow ’ : ’Cow ’ ,
details how we constructed our Obj45K and BDD45K splits ’ elephant ’ : ’ Elephant ’ ,
from these datasets. ’ b e a r ’ : ’ Bear ’ ,
’ zebra ’ : ’ Zebra ’ ,
’ giraffe ’ : ’ Giraffe ’ ,
A.1.1 Obj45K Split
’ b a c k p a c k ’ : ’ Backpack ’ ,
We rely on Objects365 [63] to construct our Gen-OD ID ’ umbrella ’ : ’ Umbrella ’ ,
test set. Similar to COCO [43], which we use for training ’ h a n d b a g ’ : ’ Handbag / S a t c h e l ’ ,
and validation in our Gen-OD setting, Objects365 is a gen- ’ t i e ’ : [ ’ T i e ’ , ’Bow T i e ’ ] ,
eral object detection dataset. On the other hand, Object365 ’ s u i t c a s e ’ : ’ Luggage ’ ,
includes 365 different classes, which is significantly larger ’ frisbee ’ : ’ Frisbee ’ ,
than the 80 different classes in COCO dataset. Therefore, ’ skis ’ : ’ Skiboard ’ ,
using images from Objects365 to evaluate a model trained ’ s n o w b o a r d ’ : ’ Snowboard ’ ,
on COCO requires a proper matching between the classes ’ s p o r t s b a l l ’ : [ ’ Baseball ’ , ’ Soccer ’ ,
of COCO with those of Objects365. Fortunately, by de- ’ Basketball ’ , ’ Billards ’ ,
sign, Objects365 already includes most of the classes of ’ American F o o t b a l l ’ , ’ V o l l e y b a l l ’ ,
COCO in order to facilitate using these datasets together. ’ Golf B al l ’ , ’ Table Tennis ’ , ’ Tennis ’ ] ,
However, we inspect the classes in those datasets more thor- ’ k i t e ’ : ’ Kite ’ ,
oughly to prove a more proper one-to-many matching from ’ b a s e b a l l b a t ’ : ’ B a s e b a l l Bat ’ ,
COCO classes to Objects365 classes. As an example, ex- ’ b a s e b a l l g l o v e ’ : ’ B a s e b a l l Glove ’ ,
amining the objects labelled as chair in COCO dataset, ’ skateboard ’ : ’ Skateboard ’ ,
we observe that wheelchairs also pertain to the chair class ’ surfboard ’ : ’ Surfboard ’ ,
of COCO. However, in Objects365 dataset, Wheelchair ’ t e n n i s r a c k e t ’ : ’ Tennis Racket ’ ,
and Chair are different classes. Therefore, in this case, we ’ bottle ’ : ’ Bottle ’ ,
match chair class of COCO not only with Chair but also ’ wine g l a s s ’ : ’ Wine G l a s s ’ ,
with Wheelchair of Objects365. Having said that, we ’ cup ’ : ’ Cup ’ ,
also note that due to high numbers of images and classes in ’ f o r k ’ : ’ Fork ’ ,
’ k n i f e ’ : ’ Knife ’ , ’ N i g h t s t a n d ’ , ’ Desk ’ , ’ C o f f e e T a b l e ’ ,
’ s p o o n ’ : ’ Spoon ’ , ’ S i d e T a b l e ’ , ’ Watch ’ , ’ S t o o l ’ ,
’ bowl ’ : ’ Bowl / B a s i n ’ , ’ Machinery Vehicle ’ , ’ T r i c y c l e ’ ,
’ b a n a n a ’ : ’ Banana ’ , ’ C a r r i a g e ’ , ’ Rickshaw ’ , ’ Van ’ ,
’ a p p l e ’ : ’ Apple ’ , ’ T r a f f i c S i g n ’ , ’ Speed L i m i t S i g n ’ ,
’ s a n d w i c h ’ : ’ Sandwich ’ , ’ Crosswalk Sign ’ , ’ Flower ’ , ’ Telephone ’ ,
’ o r a n g e ’ : ’ Orange / T a n g e r i n e ’ , ’ Tablet ’ , ’ Flask ’ , ’ Briefcase ’ ,
’ broccoli ’ : ’ Broccoli ’ , ’ Egg t a r t ’ , ’ P i e ’ , ’ D e s s e r t ’ , ’ C o o k i e s ’ ,
’ carrot ’ : ’ Carrot ’ , ’ Wallet / Purse ’
’ h o t dog ’ : ’ Hot dog ’ ,
Finally, we collect 45K images for Obj45K split from
’ pizza ’ : ’ Pizza ’ ,
validation set of Objects365 that contains (i) at least one ID
’ d o n u t ’ : ’ Donut ’ ,
object based on the one-to-many matching between classes
’ c a k e ’ : ’ Cake ’ ,
of COCO and Objects365; and (ii) no object from an am-
’ c h a i r ’ : [ ’ Chair ’ , ’ Wheelchair ’ ] ,
biguous class. Compared to COCO val set with 5K images
’ c o u c h ’ : ’ Couch ’ ,
with 36K annotated objects, our Gen-OD ID test has 45K
’ potted plant ’ : ’ Potted Plant ’ ,
images with 237K objects, significantly outnumbering the
’ bed ’ : ’ Bed ’ ,
val set which is commonly used to analyse and test the mod-
’ din ing t a b l e ’ : ’ Dinning Table ’ ,
els mainly in terms of robustness aspects. Fig. A.6(a) com-
’ t o i l e t ’ : [ ’ Toilet ’ , ’ Urinal ’ ] ,
pares the number of objects for Obj45K split and COCO val
’ t v ’ : ’ M o n i t e r / TV ’ ,
set, showing that the number of objects for each class of our
’ l a p t o p ’ : ’ Laptop ’ ,
Obj45K split is for almost classes (except 2 of 80 classes)
’ mouse ’ : ’ Mouse ’ ,
larger than the COCO val set. This large number of objects
’ r e m o t e ’ : ’ Remote ’ ,
enables us to evaluate the models thoroughly.
’ k e y b o a r d ’ : ’ Keyboard ’ ,
’ c e l l phone ’ : ’ C e l l Phone ’ ,
’ microwave ’ : ’ Microwave ’ , A.1.2 BDD45K Split
’ oven ’ : ’ Oven ’ , Considering that the widely-used AV datasets [4, 10, 16,
’ toaster ’ : ’ Toaster ’ , 65, 73] have pedestrian, vehicle and bicycle in
’ si nk ’ : ’ Sink ’ , common, we consider these three classes as ID classes of
’ refrigerator ’ : ’ Refrigerator ’ , our SAOD-AV use-case4 . Then, similar to how we ob-
’ book ’ : ’ Book ’ , tain Obj45K, we match these classes of nuImages with the
’ c l o c k ’ : ’ Clock ’ , classes of BDD100K, resulting in the following one-to-
’ v a s e ’ : ’ Vase ’ , many matching:
’ scissors ’ : ’ Scissors ’ ,
’ t e d d y b e a r ’ : ’ S t u f f e d Toy ’ , ’ pedestrian ’ : [ ’ pedestrian ’ ,
’ h a i r d r i e r ’ : ’ Hair Dryer ’ , ” other person ” ] ,
’ toothbrush ’ : ’ Toothbrush ’ ’ v e h i c l e ’ : [ ’ c a r ’ , ’ t r u c k ’ , ’ bus ’ ,
’ motorcycle ’ , ’ t r a i n ’ , ” t r a i l e r ” ,
Having matched the ID classes, we label the remaining ” other vehicle ” ] ,
classes of Objects365 either as “OOD” or “ambiguous”. ’ bicycle ’ : ’ bicycle ’
Specifically, a class is labelled as OOD if COCO classes
On the other hand, we observe a key difference in anno-
(or nuImages classes that we are interested in) do not con-
tating bicycle and motorcycle classes between nuIm-
tain that class and they will be discussed in Section A.3.
ages and BDD100K datasets. Specifically, while BDD100K
Subsequently, we label a class as an ambiguous class in the
has an additional class rider that is annotated separately
cases that we cannot confidently categorize the class neither
from bicycle and motorcycle objects, the riders of bi-
as ID nor as OOD. As an example, having examined quite
cycle and motorcycle are instead included in the annotated
a few COCO images with bottle class, we haven’t ob-
bounding box of bicycle and motorcycle objects in
served a flask, which is an individual class of Objects365
nuImages dataset. In order to align the annotations of these
(Flask). Still, as there might be instances of flask labelled
classes between BDD100K and nuImages and provide a
as bottle class in COCO, we categorize Flask class of
consistent evaluation, we aim to rectify the bounding box
Objects365 as ambiguous and do not use any of the images
annotations of these classes in BDD100K dataset such that
in Objects that has a Flask object in it. Following this, we
they follow the annotations of nuImages. Particularly, there
identify the following 25 out of 365 classes in Objects365
as ambiguous: 4 Accordingly, we the models for SAOD-AV for these three classes.
COCO val nuImages val
Obj45K BDD45K
104

105
103

102
104
101

snowboard

skateboard
surfboard
bicycle
person
car

light

plant
bed
dining toilet
table
laptop tv
airplane
motorcycle
bus
train
boat
truck
sign
hydrant
parking bench
bird
meter
dog
cat
sheep
horse
elephant
cow
bear
zebra
giraffe
backpack
handbag tie
suitcase
frisbee

winebottle

mouse
keyboard
cell phone
oven
toaster

toothbrush
umbrella

skis
sports kite
ball
glove
bat
racket
cup
glass
knife
fork
spoon
bowl
banana
apple
orange
sandwich
dog
hotcarrot
broccoli
donut
pizza
chair
cake
pottedcouch

remote

sink
book
clock
vase
microwave
refrigerator

bear
scissors
hair drier
baseball
pedestrian vehicle bicycle
stop
traffic

teddy
baseball
tennis
fire

(a) COCO val vs. Obj45K (b) nuImages val vs. BDD45K

Figure A.6. Distribution of the objects over classes from our test sets and existing val sets. For both SAOD-Gen and SAOD-AV use-cases,
our DID have more objects nearly for all classes to provide a thorough evaluation. Note that y-axes are in log-scale.

(a) BDD100K-style annotations (b) NuImages-style annotations (obtained by Hungarian matching)

Figure A.7. Aligning the annotations of certain classes in BDD100K and nuImages datasets while curating our BDD45K test set. The
riders and ridables (bicycles or motorcycles) need to be combined properly in (a). In this example, both of the rider objects are properly
assigned to the corresponding bicycle objects by our simple method relying on Hungarian algorithm. In (b), which we use as a test image
in our BDD45K, the bounding boxes are combined by finding the smallest enclosing bounding box and the objects are labelled as bicycles.

should be no rider class but bicycle and motorcycle the Hungarian algorithm. Otherwise, if any of the riders is
objects include their riders in the resulting annotations. To assigned to a rideable object with an IoU less than 0.10 in an
do so, we use a simple matching algorithm on BDD100K image, we simply do not include this image in our BDD45K
images to combine bicycle and motorcycle objects with test set. Finally, exploiting the assignment result, we obtain
their riders. In particular, given an image, we identify first the bounding box annotation using the smallest enclosing
objects from bicycle, motorcycle and rider cate- bounding box including both the bounding box of the rider
gories. Then, we group bicycle and motorcycle ob- and that of the rideable object. As for the category annota-
jects as “rideables” and compute IoU between each ride- tion of the object, we simply use the category of the ride-
able and rider object. Given this matrix of representing the able, which is either bicycle or motorcycle. Fig. A.7
proximity between each rideable and rider object in terms presents an example in which we convert BDD100K anno-
of their IoUs, we assign riders to rideables by maximizing tations of these specific classes into the nuImages format.
the total IoU using the Hungarian assignment algorithm [3]. To validate our approach, we manually examine more than
Furthermore, we include a sanity check to avoid possible 2500 images in BDD45K test set and observe that it is ef-
matching errors, e.g., in which a rideable object might be fective to align the annotations of nuImages and BDD100K.
combined with a rider in a further location in the image due
to possible annotation errors. Specifically, our simple sanity Overall, using this strategy, we collect 45K images from
is to require a minimum IoU overlap of 0.10 between a rider training and validation sets of BDD100K and construct
and its assigned rideable in the resulting assignment from our BDD45K split. We would like to highlight that our
BDD45K dataset is diverse and extensive, where (i) it is
# images 104 weather time of day scene
102
100
clear rainy snowy overcast cloudy foggy day dawn/dusk night city highway resident. parking tunnel

Figure A.8. The diversity of BDD45K split in terms of weather, time of day and scene counts.

larger compared to 16K images of nuImages val set; and (ii) detector as long as it can infer that it is uncertain and rejects
it includes 543K objects in total, significantly larger than the such images with high corruption severity.
number of objects from these 3 classes in nuImages val set
with 96K objects. Please refer to Fig. A.6(b) for quantita- A.3. SiNObj110K-OOD Split
tive comparison. In terms of diversity, our BDD45K (DID ) This split is designed to evaluate the reliability of the
comes from a different distribution than nuImages (DTrain ); uncertainties. Following similar work [13, 21], we ensure
thereby introducing natural covariate shift. Fig. A.8 il- that the images in our OOD test set do not include any object
lustrates that our BDD45K is very diverse and it is col- from ID classes. Specifically, in order to use SiNObj110K-
lected from different cities using different camera types than OOD within both SAOD-Gen and SAOD-AV datasets, we
nuImages (DTrain ). As a result, as we will see in Sec. B, select an image to SiNObj110K-OOD if the image does not
the accuracy of the models drops significantly from DVal include an object from either of the ID classes of Obj45K or
to DID even before the corruptions are employed. We note BDD45K (DID ). Then, we collect 110K images from three
that ImageNet-C corruptions are then applied to this dataset, different detection datasets as detailed below:
further increasing the domain shift.
• SVHN subset of SiNObj110K-OOD. We include all
A.2. Obj45K-C and BDD45K-C Splits 46470 full numbers (not cropped digits) using both
While constructing Obj45K-C and BDD45K-C as training and test sets of SVHN dataset in our OOD test
T (DID ), we use the following 15 different corruptions from set.
4 main groups [25]: • iNaturalist OOD subset of SiNObj110K-OOD. We use
• Noise. gaussian noise, shot noise, impulse noise, the validation set of iNaturalist 2017 object detection
speckle noise dataset to obtain our iNaturalist dataset. Specifically,
we include 28768 images in our OOD test set with the
• Blur. defocus blur, motion blur, gaussian blur following classes:

• Weather. snow, frost, fog, brightness ’ A c t i n o p t e r y g i i ’ , ’ Amphibia ’ ,


’ Animalia ’ , ’ Arachnida ’ ,
• Digital. contrast, elastic transform, pixelate, jpeg com- ’ I n s e c t a ’ , ’ Mollusca ’ , ’ R e p t i l i a ’
pression
• Objects365 OOD subset of SiNObj110K-OOD. To se-
Then, given an image for a particular severity level that can
lect images for our OOD test set from Objects365
be 1, 3 or 5, we randomly sample a transformation and ap-
dataset, we use the following classes as OOD:
ply to the image. In such a way, we obtain 3 different copies
of Obj45K and BDD45K during evaluation. ’ S n e a k e r s ’ , ’ O t h e r S h o e s ’ , ’ Hat ’ ,
We outline in the definition of the SAOD task (Sec. 3) ’ Lamp ’ , ’ G l a s s e s ’ , ’ S t r e e t L i g h t s ’ ,
that an image with a corruption severity 5 might not con- ’ Cabinet / shelf ’ , ’ Bracelet ’ ,
tain enough cues to perform object detection reliably and ’ P i c t u r e / Frame ’ , ’ Helmet ’ , ’ G l o v e s ’ ,
that a SAODet is flexible to accept or reject such images as ’ S t o r a g e box ’ , ’ L e a t h e r S h o e s ’ , ’ F l a g ’ ,
long as it yields accurate and calibrated detections on the ’ P i l l o w ’ , ’ Boots ’ , ’ Microphone ’ ,
accepted ones. To provide insight of providing this flexi- ’ N e c k l a c e ’ , ’ Ring ’ , ’ B e l t ’ ,
bility, Fig. A.9 presents example corruptions with severity ’ S p e a k e r ’ , ’ T r a s h b i n Can ’ , ’ S l i p p e r s ’ ,
5. Note that several cars in the corrupted images above and ’ Barrel / bucket ’ , ’ Sandals ’ , ’ Bakset ’ ,
birds in the ones below are not visible any more due to the ’ Drum ’ , ’ Pen / P e n c i l ’ , ’ High H e e l s ’ ,
severity of the corruption. As notable examples, some of the ’ G u i t a r ’ , ’ C a r p e t ’ , ’ B r e a d ’ , ’ Camera ’ ,
cars in Fig. A.9(b) and the birds in Fig. A.9(h) Fig. A.9(h) ’ Canned ’ , ’ T r a f f i c c o n e ’ , ’ Cymbal ’ ,
are not visible. As a result, instead of enforcing the detector ’ L i f e s a v e r ’ , ’ Towel ’ , ’ C a n d l e ’ ,
to predict all of the objects accurately, we do not penalize a ’ Awning ’ , ’ F a u c e t ’ , ’ T e n t ’ , ’ M i r r o r ’ ,
(a) Clean Image - BDD45K (b) Constrast (c) Motion Blur (d) Snow

(e) Clean Image - Obj45K (f) JPEG compression (g) Elastic transform (h) Frost

Figure A.9. Clean and corrupted images using different transformations at severity 5 from AV-OD (upper row) and Gen-OD (lower row)
use-cases. We do not penalize a detector if it can infer that it is uncertain and rejects such images with high corruption severity.

’ Power o u t l e t ’ , ’ A i r C o n d i t i o n e r ’ , ’ B r u s h ’ , ’ P e n g u i n ’ , ’ Megaphone ’ ,
’ Hockey S t i c k ’ , ’ P a d d l e ’ , ’ B a l l o n ’ , ’ Corn ’ , ’ L e t t u c e ’ , ’ G a r l i c ’ ,
’ T r i p o d ’ , ’ Hanger ’ , ’ Swan ’ , ’ H e l i c o p t e r ’ , ’ Green Onion ’ ,
’ B l a c k b o a r d / W h i t e b o a r d ’ , ’ Napkin ’ , ’ N u t s ’ , ’ I n d u c t i o n Cooker ’ ,
’ O t h e r F i s h ’ , ’ T o i l e t r y ’ , ’ Tomato ’ , ’ Broom ’ , ’ Trombone ’ , ’ Plum ’ ,
’ L a n t e r n ’ , ’ Fan ’ , ’ Pumpkin ’ , ’ G o l d f i s h ’ , ’ Kiwi f r u i t ’ ,
’ Tea p o t ’ , ’ Head Phone ’ , ’ S c o o t e r ’ , ’ R o u t e r / modem ’ , ’ P o k e r Card ’ ,
’ S t r o l l e r ’ , ’ C r a n e ’ , ’ Lemon ’ , ’ Shrimp ’ , ’ S u s h i ’ , ’ C h e e s e ’ ,
’ S u r v e i l l a n c e Camera ’ , ’ J u g ’ , ’ P i a n o ’ , ’ Notepaper ’ , ’ Cherry ’ , ’ P l i e r s ’ ,
’ Gun ’ , ’ S k a t i n g and S k i i n g s h o e s ’ , ’CD ’ , ’ P a s t a ’ , ’ Hammer ’ ,
’ Gas s t o v e ’ , ’ S t r a w b e r r y ’ , ’ Cue ’ , ’ Avocado ’ , ’ Hamimelon ’ ,
’ Other B a l l s ’ , ’ Shovel ’ , ’ Pepper ’ , ’ Mushroon ’ , ’ S c r e w d r i v e r ’ , ’ Soap ’ ,
’ Computer Box ’ , ’ T o i l e t P a p e r ’ , ’ Recorder ’ , ’ Eggplant ’ ,
’ Cleaning Products ’ , ’ Chopsticks ’ , ’ Board E r a s e r ’ , ’ C o c o n u t ’ ,
’ P i g e o n ’ , ’ C u t t i n g / c h o p p i n g Board ’ , ’ Tape Measur / R u l e r ’ , ’ P i g ’ ,
’ Marker ’ , ’ L a d d e r ’ , ’ R a d i a t o r ’ , ’ Showerhead ’ , ’ Globe ’ , ’ C h i p s ’ ,
’ Grape ’ , ’ P o t a t o ’ , ’ S a u s a g e ’ , ’ S t e a k ’ , ’ S t a p l e r ’ , ’ Campel ’ ,
’ V i o l i n ’ , ’ Egg ’ , ’ F i r e E x t i n g u i s h e r ’ , ’ Pomegranate ’ , ’ Dishwasher ’ ,
’ Candy ’ , ’ C o n v e r t e r ’ , ’ B a t h t u b ’ , ’ Crab ’ , ’ Meat b a l l ’ , ’ R i c e Cooker ’ ,
’ G o l f Club ’ , ’ Cucumber ’ , ’ Tuba ’ , ’ C a l c u l a t o r ’ ,
’ Cigar / C i g a r e t t e ’ , ’ P a i n t Brush ’ , ’ Papaya ’ , ’ Antelope ’ , ’ S e a l ’ ,
’ P e a r ’ , ’ Hamburger ’ , ’ B u t t e f l y ’ , ’ Dumbbell ’ ,
’ E x t e n t i o n Cord ’ , ’ Tong ’ , ’ F o l d e r ’ , ’ Donkey ’ , ’ L i o n ’ , ’ D o l p h i n ’ ,
’ e a r p h o n e ’ , ’ Mask ’ , ’ K e t t l e ’ , ’ Electric Drill ’ , ’ Jellyfish ’ ,
’ Swing ’ , ’ C o f f e e Machine ’ , ’ S l i d e ’ , ’ Treadmill ’ , ’ Lighter ’ ,
’ Onion ’ , ’ Green b e a n s ’ , ’ P r o j e c t o r ’ , ’ G r a p e f r u i t ’ , ’Game b o a r d ’ ,
’ Washing Machine / D r y i n g Machine ’ , ’Mop ’ , ’ R a d i s h ’ ,
’ P r i n t e r ’ , ’ Watermelon ’ , ’ Saxophone ’ , ’ Baozi ’ , ’ Target ’ , ’ French ’ ,
’ T i s s u e ’ , ’ I c e cream ’ , ’ H o t a i r b a l l o n ’ , ’ S p r i n g R o l l s ’ , ’ Monkey ’ , ’ R a b b i t ’ ,
’ Cello ’ , ’ French F r i e s ’ , ’ Scale ’ , ’ P e n c i l Case ’ , ’ Yak ’ ,
’ Trophy ’ , ’ Cabbage ’ , ’ B l e n d e r ’ , ’ Red Cabbage ’ , ’ B i n o c u l a r s ’ ,
’ P e a c h ’ , ’ R i c e ’ , ’ Deer ’ , ’ Tape ’ , ’ Asparagus ’ , ’ B a r b e l l ’ ,
’ C o s m e t i c s ’ , ’ Trumpet ’ , ’ P i n e a p p l e ’ , ’ S c a l l o p ’ , ’ Noddles ’ ,
’ Mango ’ , ’ Key ’ , ’ H u r d l e ’ , ’Comb ’ , ’ Dumpling ’ ,
’ F i s h i n g Rod ’ , ’ Medal ’ , ’ F l u t e ’ , ’ O y s t e r ’ , ’ Green V e g e t a b l e s ’ ,
Table A.8. COCO-style AP of the used object detectors on val- of the image within the range of [480, 800] by limiting its
idation set (DVal ) and test set (DID ), along with their corrupted longer size to 1333 and keeping the original aspect ratio; or
versions (T (DVal ) and T (DID )). (ii) a sequence of
T (DVal ) T (DID )
Dataset Detector DVal DID • randomly resizing the shorter side of the image within
C1 C3 C5 C1 C3 C5
F-RCNN 39.9 31.3 20.3 10.8 27.0 20.3 12.8 6.9 the range of [400, 600] by limiting its longer size to
RS-RCNN 42.0 33.7 21.8 11.6 28.6 21.7 13.7 7.3 4200 and keeping the original aspect ratio,
SAOD ATSS 42.8 33.9 22.3 11.9 28.8 22.0 14.0 7.3
Gen D-DETR 44.3 36.2 24.0 12.2 30.5 23.4 15.4 8.0
• random cropping with a size of [384, 600],
NLL-RCNN 40.1 31.0 20.0 11.6 26.9 20.3 12.9 6.8
ES-RCNN 40.3 31.6 20.3 11.7 27.2 20.6 13.0 6.9
SAOD F-RCNN 55.0 44.9 31.1 16.7 23.2 19.8 12.8 7.2 • randomly resizing the shorter side of the cropped im-
AV ATSS 56.9 47.1 34.1 18.9 25.1 21.7 14.8 8.6 age within the range of [480, 800] by limiting its
longer size to 1333 and keeping the original aspect ra-
tio.
’ Cosmetics Brush / E y e l i n e r P e n c i l ’ ,
’ Chainsaw ’ , ’ E r a s e r ’ , ’ L o b s t e r ’ ,
Unless otherwise noted, we train all of the detectors (as
’ D u r i a n ’ , ’ Okra ’ , ’ L i p s t i c k ’ ,
aforementioned, with the exception of D-DETR, which is
’ Trolley ’ , ’ Cosmetics Mirror ’ ,
trained for 50 epochs following its recommended settings
’ Curling ’ , ’ Hoverboard ’ ,
[79]) for 36 epochs using 16 images in a batch on 8 GPUs.
’ P l a t e ’ , ’ Pot ’ ,
Following the previous works, we use the initial learning
’ E x t r a c t o r ’ , ’ Table T e n i i s paddle ’
rates of 0.020 for F-RCNN, NLL-RCNN and ES-RCNN;
Using both training and validation sets of Objects365, 0.010 for ATSS; and 0.012 for RS-RCNN. We decay the
we collect 35190 images that only contains objects learning rate by a factor of 10 after epochs 27 and 33. As
from above classes. a backbone, we use a ResNet-50 with FPN [41] for all the
models, as is common in practice. At test time, we simply
Consequently, our resulting SiNObj110K-OOD is both rescale the images to 800 × 1333 and do not use any test-
diverse and extensive compared to the datasets introduced in time augmentation. For the rest of the design choices, we
previous work [13, 21] which includes around 1-2K images follow the recommended settings of the detectors.
and is collected from a single dataset. As for SAOD-AV, we train F-RCNN [61] and ATSS [77]
on nuImages training set by following the same design
B. Details of the Used Object Detectors choices. We note that these models are trained using the
annotations of the three classes (pedestrian, vehicle
Here we demonstrate the details of the selected object
and bicycle) in nuImages dataset.
detectors and ensure that their performance is inline with
their expected results. We build our SAOD framework We display baseline results in Tab. A.8 on DVal ,
upon the mmdetection framework [7] since it enables us T (DVal ), DID and T (DID ) data splits, which shows the
using different datasets and models also with different de- performance on the COCO val set (DVal of SAOD-Gen in
sign choices. As presented in Sec. 3, we use four conven- the table) is inline or higher with those published in the cor-
tional and two probabilistic object detectors. We exploit responding papers. We would like to note that the perfor-
all of these detectors for our SAOD-Gen setting by training mance on DVal is lower than that on DID due to (i) more
them on the COCO training set as DTrain . We train all the challenging nature of Object365/BDD100K compared to
detectors with the exception of D-DETR. As for D-DETR, COCO/nuImages and (ii) the domain shift between them.
we directly employ the trained D-DETR model released in As an example, AP drops ∼ 30 points from DVal (nuIm-
mmdetection framework. This D-DETR model is trained ages) to DID (BDD45K) even before the corruptions are ap-
for 50 epochs with a batch size of 32 images on 16 GPUs (2 plied. As expected, we also see a decrease in performance
images/GPU) and corresponds to the vanilla D-DETR (i.e., with increasing severity of corruptions.
not its two-stage version and without iterative bounding box
refinement). C. Further Details on Image-level Uncertainty
While training the detectors, we incorporate the multi-
scale training data augmentation used by D-DETR into This section presents further details on image-level un-
them in order to obtain stronger baselines. Specifically, the certainty including the motivation behind; the definitions of
multi-scale training data augmentation is sampled randomly the used uncertainty estimation techniques; and more anal-
from two alternatives: (i) randomly resizing the shorter side yses.
C.2. Definitions
Here, we provide the definitions of the detection-level
uncertainty estimation methods for classification and local-
isation as well as the aggregation techniques we used to ob-
tain image-level uncertainty estimates.

C.2.1 Detection-Level Uncertainties


In the following, we present how we obtain detection-level
uncertainties from classification and localisation heads. We
note that all of these uncertainties, except the uncertainty
score, are computed on the raw detections represented by
raw
{b̂raw
i , p̂raw
i }N in Sec. 2 and then propagated through
Figure A.10. (Left) An example image from the ID test set and the post-processing steps. The uncertainty score is, instead,
(Right) an example image from the OOD test set used by [13]. directly computed using the confidence of the final detec-
The flag is not an ID class but exists in both ID and OOD test tions (p̂i ). In such a way, we obtain the uncertainty val-
sets as indicated by red bounding boxes. Labelling the detections ues of top-k final detections, which are then aggregated for
corresponding to the flag as ID or OOD is non-trivial without image-level uncertainty estimates.
labelling every pixel in the images of the training set. Conversely,
current works label all detections from an ID image as ID and
Classification Uncertainties We use the following
those from an OOD image as OOD; and then compute AUROC
in detection-level using the measured uncertainty. This way of
detection-level classification uncertainties:
detection-level OOD detection evaluation might not be ideal for • The entropy of the predictive distribution. The stan-
object detection. dard configuration of F-RCNN, NLL-RCNN and ES-
RCNN employ a softmax classifier over K ID classes
and background; resulting in a K+1-dimensional cate-
gorical distribution. Denoting this distribution by p̂raw
i
(Sec. 2), the entropy of p̂raw
i is:
C.1. Why is Detection-Level OOD Detection for Ob-
K+1
ject Detection Nontrivial? X
H(p̂raw
i )=− p̂raw raw
ij log p̂ij , (A.6)
j=1
As we motivated in Sec. 1 and Sec. 4, evaluating the
reliability of uncertainties using OOD detection task in such that p̂raw
ij is the probability mass in jth class in
detection-level is conceptually non-trivial for object detec- p̂raw
i . As for the object detectors which exploit class-
tion. This is because there is no clear definition as to which wise sigmoid classifiers, the situation is more com-
detections can be considered ID and which cannot. To elab- plicated since the prediction p̂rawi comprises of K
orate further, unknown objects may appear in two forms at Bernoulli random variables, instead of a single dis-
test time: (i) “known-unknowns”, which can manifest as tribution unlike the softmax classifier. Therefore, we
background and unlabelled objects in the training set or (ii) will discuss and analyse different ways of computing
“unknown-unknowns”, completely unseen objects, which entropy for the detectors using sigmoid classifiers in
are not present in the training data. It is not possible to Sec. C.3.1.
split these unknown objects into the two categories without • Dempster-Shafer. We use the logits as the evidence to
having labels for every pixel in the training set [12]. Cur- compute DS [62]. Accordingly, denoting the jth logit
rent evaluation [13,21] however, does not adhere to this and (i.e., for class j) of the ith detection obtained from a
instead defines “an image” with no ID object as OOD but softmax-based detector by sij , we compute the uncer-
assumes “any detection” in an OOD image is an OOD de- tainty by
tection, and vice versa for an ID image; thereby decreasing K +1
the reliability of the evaluation. Fig. A.10 presents an exam- DS = PK+1 , (A.7)
ple from an existing ID and OOD test splits [13] to illustrate K +1+ j=1 exp (sij )
why the reliability of the evaluation decreases. Conversely, and similarly, for a sigmoid-based classifier yielding
as we have followed, evaluating the reliability of the un- K logits, we simply use
certainties for object detectors based on OOD detection at
the image-level aligns with the definition of OOD images, K
DS = PK . (A.8)
which is again at image-level. K+ j=1 exp (sij )
• Uncertainty score. While H(p̂rawi ) and DS are com- Table A.9. AUROC values for different variations of computing
puted on the raw detections, we compute uncertainty entropy as the uncertainty for sigmoid-based detectors. Applying
score based on final detections using the detection con- softmax to the logits to obtain K-dimensional categorical distribu-
tion performs the best for all detectors.
fidence score as 1 − p̂i .
Detector average max class categorical
Localisation Uncertainties We utilise the covariance RS-RCNN 73.3 91.2 93.7
matrix Σ predicted by the probabilistic detectors (NLL- ATSS 79.9 27.5 94.3
RCNN and ES-RCNN) to compute the uncertainty of a de- D-DETR 63.4 27.9 93.9
tection in the localisation head. As described in Sec. 3, our
models predicts a diagonal covariance matrix,
• mean:
 2 
σ1 0 0 0 N
 0 σ22 0 0 1 X
Σ= , (A.9) G(X) = ui (A.14)
0 2
0 σ3 0  N i=1
0 0 0 σ42
• mean(top-m): Denoting ϕ(i) as the index of the ith
for each detection such that σi2 with 0 < i ≤ 4 is the pre- smallest uncertainties,
dicted variance of the Gaussian for ith bounding box param-  m
1
P
eter. Considering that an increase in Σ should imply more

m
 uϕ(i) , if N ≥ m
i=1
uncertainty of the localisation head, we define the following G(X) = N (A.15)
 N1
P
uϕ(i) , if 0 < N < m

uncertainty measures for localisation exploiting Σ,. 
i=1
• The determinant of the predicted covariance matrix
• min: Similarly, denoting uϕ(1) is the smallest uncer-
4
Y tainty or the most certain one,
|Σ| = σi2 (A.10)
i=1 G(X) = uϕ(1) (A.16)
• The trace of the predicted covariance matrix Finally, we consider the extreme case in which all of the
detections are eliminated in the background removal stage,
4
X the first step of the post-processing. It is also worth men-
tr(Σ) = σi2 (A.11)
tioning that this case can be avoided by reducing the score
i=1
threshold of the detectors, which is typically 0.05. How-
• The entropy of the predicted multivariate Gaussian ever, using off-the-shelf detectors by keeping their hyper-
distribution [49] parameters as they are, we observe rare cases that a detector
may not yield any detection for an image. To give an in-
1 tuition how rare these cases are, we haven’t observed any
H(Σ) = 2 + 2 ln(2π) + ln(|Σ|) (A.12)
2 image with no detection for D-DETR and RS-RCNN and
there very few images for F-RCNN. However, for the sake
C.2.2 Aggregation Strategies to Obtain Image-Level of completeness, we assign a large uncertainty value (typi-
Uncertainties cally 1012 ) that ensures that the image is classified as OOD
in such cases.
In this section, given detection-level uncertainties {ui }N
where ui is the detection-level uncertainty for the ith de- C.3. More Analyses on Image-Level Uncertainty
tection, we present our aggregation strategies to obtain the
This section includes more analyses on obtaining image-
image-level uncertainty G(X). Note that ui corresponds to
level uncertainties. We use our Gen-OD setting and report
a detection-level uncertainty after post-processing with top-
AUROC following Sec. 4 unless explicitly otherwise noted.
k survival (Sec. 2), hence implying N ≤ k. In particular,
we use the following aggregation techniques that enables us
to obtain reliable image-level uncertainties from different C.3.1 Computing Detection-level Uncertainty for
detectors: Sigmoid-based Classifiers
• sum: Unlike the detectors using K + 1-class softmax classi-
N
fiers, the detectors employing sigmoid-based classifiers,
such as RS R-CNN, ATSS and D-DETR, yield K differ-
X
G(X) = ui (A.13)
i=1
ent Bernoulli random variables each of which corresponds
Table A.10. Combining classification and localisation uncertain- detection, here we investigate whether there is a benefit in
ties. Please refer to the text for more details. combining such uncertainties for a detection. Basically, we
combine the entropy of the predictive classification distribu-
H(p̂raw
Detector ) H(Σ) Balanced Norm. AUROC
i tion (H(p̂rawi )) and the entropy of the predictive Gaussian
✓ 92.4 distribution for localisation (H(Σ)). Assuming these two
✓ 87.7 distributions independent, the entropy of the joint distribu-
NLL-RCNN ✓ ✓ 89.9 tion can be obtained by the summation over the entropies,
✓ ✓ ✓ 92.2 i.e., H(p̂raw ) + H(Σ). However, this way of combining
i
✓ ✓ ✓ ✓ 93.2 tends to overestimate the contribution of the localisation as
✓ 92.8 the localisation output involves four random variables but
✓ 86.4 the classification is univariate. Consequently, we first in-
ES-RCNN ✓ ✓ 89.3 crease the contribution of the classification by multiplying
✓ ✓ ✓ 91.8 its entropy by 4, which results in a positive effect (“bal-
✓ ✓ ✓ ✓ 93.2 anced” in Table A.10). We also find it useful to normalize
the uncertainties between 0 and 1 using their minimum and
maximum scores obtained on the validation set. This nor-
to one of the K different classes in DTrain . In this case, one
malisation shows for the both cases that the resulting per-
can think of different ways to compute the detection-level
formance is better compared to using only classification and
uncertainty. Here, we analyse the effect of three different
localisation (“Norm.” in Table A.10).
methods to obtain the uncertainty for a detection as (i) the
average over the entropies of K Bernoulli random variables; We also would like to note that the resulting AUROC
(ii) the entropy of the maximum-scoring class; and (iii) ob- does not outperform using only the uncertainty score (1 −
taining a categorical distribution over the classes through p̂i ), which yields 94.1 AUROC score with mean(top-3) for
softmax first, and then, computing the entropy of this cate- both of the detectors ES-RCNN and NLL-RCNN as demon-
gorical distribution. More precisely, for the jth class and the strated in Tab. 2. Hence, instead of combining classification
ith detection, denoting the predicted logit and correspond- and localisation uncertainties, we still suggest using 1 − p̂i
ing probability (obtained through sigmoid) by ŝij and p̂raw to obtain the detection-level uncertainties.
ij
respectively, we define the following uncertainties:
• The average of the entropies of the K Bernoulli ran-
dom variables: C.3.3 The Effect of Aggregation Techniques on a Lo-
K
calisation Uncertainty Estimate
1 X raw
p̂ij log p̂raw + (1 − p̂raw raw

ij ij ) log(1 − p̂ij ) ;
K j=1 In Tab. 2, we investigated the aggregation methods using
the uncertainty score, which is a classification-based uncer-
(A.17)
tainty estimate. Here in Tab. A.11, we present an extended
• The entropy of the maximum-scoring class: version of Tab. 2 including the effect of the same aggre-
gation methods on |Σ| as one example of our localisation
p̂raw raw raw raw
ik log p̂ik + (1 − p̂ik ) log(1 − p̂ik ) (A.18) uncertainty estimation methods. Tab. A.11 also validates
our conclusion on |Σ| to use the most certain detections in
with k being the maximum scoring class for detection
obtaining image-level uncertainties.
i; and
• The entropy of the categorical distribution: It is
simply obtained by first applying softmax over K-
dimensional logits, yielding a categorical distribution, C.3.4 On the Reliability of Image-level Uncertainties
and then, computing the entropy following Eq. (A.6)
with K classes. Here, similar to our analysis on SAOD-Gen, we present
Tab. A.9 indicates that the entropy of the categorical the distribution of the image-level uncertainties for differ-
distribution performs consistently the best for all three ent subsets of DTest on SAOD-AV. Fig. A.11 confirms that
sigmoid-based detectors. the following observations obtained on SAOD-Gen are also
valid for SAOD-AV: (i) the distribution gets more right-
skewed as the subset moves away from the DID and (ii) AP
C.3.2 Combining Classification and Localisation Un-
(black line) perfectly decreases as the uncertainties increase
certainties
(refer to Sec. 4 for details). These figures confirm on a dif-
Considering that the probabilistic object detectors yield an ferent dataset that the image-level uncertainties are reliable
uncertainty both for classification and localisation for each and effective.
Table A.11. AUROC scores of different aggregations to obtain image-level uncertainty. |Σ| can be computed for probabilistic detectors,
hence N/A for others. mean(top-m) refers to average of the uncertainties of the most m certain detections (the detections with the least
uncertainty based on 1 − p̂i or |Σ|). Using few most certain detections perform better for both detection-level uncertainty estimation
methods. Underlined & Bold: best of a detector, bold: second best.

Dataset sum mean mean(top-5) mean(top-3) mean(top-2) min


Detector
(DID vs. DOOD ) 1 − p̂i |Σ| 1 − p̂i |Σ| 1 − p̂i |Σ| 1 − p̂i |Σ| 1 − p̂i |Σ| 1 − p̂i |Σ|
F R-CNN [61] 20.9 N/A 84.1 N/A 93.4 N/A 94.1 N/A 94.4 N/A 93.8 N/A
RS R-CNN [55] 85.8 N/A 85.8 N/A 94.3 N/A 94.8 N/A 94.8 N/A 93.5 N/A
ATSS [77] 66.2 N/A 86.3 N/A 93.8 N/A 94.2 N/A 94.0 N/A 92.6 N/A
SAOD-Gen
D-DETR [79] 85.2 N/A 85.2 N/A 94.4 N/A 94.7 N/A 94.6 N/A 93.3 N/A
NLL R-CNN [23] 22.6 41.6 83.8 74.9 93.4 87.4 94.1 87.6 94.4 87.5 93.7 87.0
ES R-CNN [21] 22.1 24.5 84.6 32.9 93.4 83.8 94.1 85.0 94.4 85.7 93.8 86.3
F R-CNN [61] 27.1 N/A 84.1 N/A 96.4 N/A 97.3 N/A 97.4 N/A 96.0 N/A
SAOD-AV
ATSS [77] 18.8 N/A 92.2 N/A 97.7 N/A 97.6 N/A 97.3 N/A 95.7 N/A

% of Data AP % of Data AP
80 30 80 30
Clean ID (% of Data) Clean ID (% of Data)
70 C1 (% of Data) 25 70 C1 (% of Data) 25
60 C3 (% of Data) 60 C3 (% of Data)
C5 (% of Data) 20 C5 (% of Data) 20
50 OOD (% of Data) 50 OOD (% of Data)
40 All except OOD (AP) 15 40 All except OOD (AP) 15
30 10 30 10
20 20
10 5 10 5
0 0.0-0.1 0.1-0.2 0.2-0.3 0.3-0.4 0.4-0.5 0.5-0.6 0.6-0.7 0.7-0.8 0.8-0.9 0.9-1.0 0 0 0.0-0.1 0.1-0.2 0.2-0.3 0.3-0.4 0.4-0.5 0.5-0.6 0.6-0.7 0.7-0.8 0.8-0.9 0.9-1.0 0
Image-level Uncertainty Image-level Uncertainty

(a) F-RCNN (b) ATSS

Figure A.11. The distribution of the image-level uncertainties obtained from different detectors on clean ID, corrupted ID with severities
1, 3, 5 and OOD data on SAOD-AV dataset.

Table A.12. Effectiveness of our pseudo-OOD set approach com- a baseline to compare our method against and demonstrate
pared to using [email protected]. its effectiveness. However, to the best of our knowledge,
there is no such a method that obtains a threshold relying
Task Detector Method BA TPR TNR
only on the ID data for OOD detection task. As a result,
[email protected] 83.2 98.5 72.0
F-RCNN inspired from the performance measure [email protected] [13],
pseudo-OOD 87.7 94.7 81.6
[email protected] 84.0 98.3 73.4 we simply set the threshold ū to the value that corresponds
RS-RCNN to [email protected], and use it as a baseline. Note that this ap-
pseudo-OOD 88.9 92.8 85.3
Gen-OD proach only relies on the ID val set= and hence there is
[email protected] 84.7 96.9 75.2
ATSS
pseudo-OOD 87.8 93.1 83.0 no need for OOD val set, which is similar to our pseudo-
D-DETR
[email protected] 85.8 97.2 76.8 OOD approach. Tab. A.12 compares our pseudo-OOD ap-
pseudo-OOD 88.9 90.0 87.8 proach with [email protected] baseline; suggesting, on average,
[email protected] 80.9 97.7 69.1 more than 4.5 BA gain over the baseline method; thereby
F-RCNN
pseudo-OOD 91.0 94.1 88.2 confirming the effectiveness of our approach.
SAOD-AV
[email protected] 83.5 96.7 73.5
ATSS
pseudo-OOD 85.8 95.9 77.6 D. Further Details on Calibration of Object
Detectors
C.3.5 The Effectiveness of Using Pseudo OOD val set This section provides further details and analyses on cal-
for Image-level Uncertainty Thresholding ibration of object detectors.
D.1. How does AP benefit from low-scoring detec-
In order to compute the image-level uncertainty threshold ū
tions?
and decide whether or not to accept an image, we presented
a way to construct pseudo-OOD val set in Sec. 4 as DVal Here we show that enlarging the detection set with low-
only includes ID images. Here, we discuss the effectiveness scoring detections provably does not decrease AP. Thereby
of this pseudo-set approach. To do so, we also prefer to have confirming the practical limitations previously discussed by
1.0 1.0 1.0 1.0
interpolated PR curve of PR curve of PR curve of
0.8 not interpolated 0.8 PR curve of I 0.8 PR curve of I 0.8 PR curve of I

0.6 0.6 0.6

Precision
Precision

0.6

Precision

Precision
0.4 0.4 0.4 0.4
0.2 0.2 0.2 0.2
0.0 0.0 0.0 0.0
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
Recall Recall Recall Recall
(a) Interpolating PR Curve (b) Case 1 of the proof (c) Case 2 of the proof (d) Case 3 of the proof

Figure A.12. Illustrations of (a) non-interpolated and interpolated PR curves. Typically, the area under the interpolated PR curve is used
as the AP value in object detection; (b), (c), (d) corresponding to each of the three different cases we consider in the proof of Theorem 1.
Following Theorem 1, in all three cases, the area under the red curve is smaller or equal to that of the blue curve.

Oksuz et al. [58]. As a result, instead of top-k predictions 2. Then, going over this sorted list of detections, the jth
and AP, we require a thresholded detection set in SAOD detection is identified as a TP if there exists a ground
task and employ the LRP Error as a measure of accuracy to truth that satisfies the following two conditions:
enforce this type of output.
Before proceeding, below we provide a formal definition • The ground truth is not previously assigned to
of AP as a basis to our proof. any other detections with a larger confidence
score than that of j,
• The IoU between the ground truth and jth detec-
Definition of AP. AP is defined as the area under the
tion is more than τ , the TP validation threshold.
Precision-Recall curve [15, 53, 58]. Here we formalize how
to obtain this curve and the resulting AP in object detec-
Note that the second condition also implies that the
tion given detections and the ground truths. Considering
jth detection and the ground truth that it matches with
the common practice, we will first focus on the AP of a
should reside in the same image. If there is a single
single class and then discuss further after our proof. More
ground truth satisfying these two conditions, then j is
precisely, computing AP for class c from an IoU threshold
matched with that ground truth; else if there are more
of τ , two sets are required:
than one ground truths ensuring these conditions, then
• A set of detections obtained on the test set: This set
the jth detection is matched with the ground truth that
is represented by tuples Ŷ = {b̂i , p̂i , Xi }Nc , where b̂i
j has the largest IoU with.
and p̂i are the bounding box and confidence score of
the ith detection respectively. Xi is the image id that
the ith detection resides and Nc is the number of all 3. Upon completing this sorted list, the detections that are
detections across the dataset from class c. We assume not matched with any ground truths are identified as
that the detections obtained from a single image is less FPs.
than k where k represents the upper-bound within the
context of the top-k survival (Sec. 2), that is, there can
be up to k detections from each image. This matching procedure enables us to determine which
• A set of ground truths of the test set: This set is rep- detections are TP or FP. Now, let L = [L1 , ..., LNc ] be a
resented by tuples Y = {bi , Xi }Mc , where bi is the binary vector that represents whether jth detection is a TP
bounding box of the ground truth and Xi is similarly or FP and assume that L is also sorted with respect to the
the image id. Mc is the number of total ground truth confidence scores of the detections. Specifically, Li = 1
objects from class c across the dataset. if the ith detection is a TP, else the ith detection is FP and
Then, the detections are matched with the ground truths to Li = 0. Consequently, we need precision and recall pairs
identify TP and FP detections using a matching algorithm. in order to obtain the Precision-Recall curve, area under
For the sake of completeness, we provide a matching al- which corresponds to the AP. Noting that the precision is
gorithm that is used by the commonly-used COCO bench- the ratio between the number of TPs and number of all de-
mark [43]: tections; and recall is the ratio between the number of TPs
and number of ground truths, we can obtain these pairs by
1. All detections in Ŷ are first sorted with respect to the leveraging L. Denoting the precision and recall vectors by
confidence score in descending order P r = [P r1 , ..., P rNc ] and Re = [Re1 , ..., ReNc ] respec-
′ ′ ′
tively, the ith element of these vectors can be obtained by: • The first Nc elements of P r , Re and P¯r account for
Pi Pi the precision, recall and interpolated precision values
k=1 Lk Lk
computed on the detection from Ŷ; and
P ri = , and Rei = k=1 . (A.19)
i Mc • their elements between Nc +1 to the last element (Nc +

Since these obtained precision values P ri may not be Nc ) correspond to the precision, recall and interpolated

monotonically decreasing function of recall, there can be precision values computed on the detections from Ŷ .
wiggles in the Precision-Recall curve. Therefore, it is com- Note that by definition, computing precision and re-
mon in object detection [15, 18, 43] to interpolate the preci- call on the ith detection only considers the detections with
sions P r to make it monotonically decreasing with respect higher scores than that of i (and ignores the ones with lower
to the recall Re. Denoting the interpolated precision vector scores than that of i), since the list of labels denoted by L in
by P¯r = [P¯r1 , ..., P¯rNc ], its ith element P¯ri is obtained as Eq. (A.19) is sorted with respect to the confidence scores.
follows: As a result, the following holds for precision and recall val-
ues (but not the interpolated precision):
P¯ri = max (P ri ). (A.20)
i:Rei ≥Rek ′ ′
P ri = P ri , and Rei = Rei for i ≤ Nc . (A.24)
Finally, Eq. (A.20) also allows us to interpolate the PR

curve to the precision and recall axes. Namely, we include Then, the difference between APτ (Ŷ) and APτ (Ŷ ∪ Ŷ )
the pairs that (i) P¯r1 with recall 0; and (ii) precision 0 with depends on two aspects:
¯ N . This allows us to obtain the final Precision-
recall Re ′ ′ ′
c
Recall curve using these two additional points as well as 1. P ri and Rei for Nc < i ≤ Nc + Nc
the vectors P¯ri and Rei . Then, the area under this curve ′ ′

corresponds to the Average Precision of the detection set 2. the interpolated precision vector P¯r of Ŷ ∪ Ŷ , to be
′ ′

Ŷ for the IoU validation threshold of τ , which we denote obtained using P r and Re based on Eq. (A.20)
as APτ (Ŷ). As an example, Fig. A.12(a) illustrates a PR For the rest of the proof, we enumerate all possible three

curve before and after interpolation. Based on this defini- cases for Ŷ and identify these aspects.
tion, we now prove that low-scoring detections do not harm
AP. ′
Case (1): Ŷ does not include any TP. This case suggests

Theorem 1. Given two sets of detections Ŷ = that the detections in Ŷ are all FPs, and neither the number

{b̂i , p̂i , Xi }N c


=
Nc
{b̂j , p̂j , Xj }j=1 of TPs nor the number of FNs change for Nc < i ≤ Nc +
i=1 , and denoting ′
′ Nc , implying:
pmin = min p̂i , pmax = max ′ p̂j , if
{b̂i ,p̂i ,Xi }∈Ŷ {b̂j ,p̂j ,Xj }∈Ŷ ′ ′ ′
′ ′ Rei = ReNc , for Nc < i ≤ Nc + Nc . (A.25)
pmax < pmin , then APτ (Ŷ) ≤ APτ (Ŷ ∪ Ŷ ).
Proof. We denote the recall and precision values to com- As for the precision, it is monotonically decreasing as i in-

pute APτ (Ŷ) by P r = [P r1 , ..., P rNc ] and Re = creases between Nc < i ≤ Nc + Nc since the number of
[Re1 , ..., ReNc ], and similarly the interpolated precision FPs increases, that is,
is P¯r = [P¯r1 , ..., P¯rNc ]. We aim to obtain these vec- ′ ′ ′
′ P ri−1 > P ri , for Nc < i ≤ Nc + Nc . (A.26)
tors for APτ (Ŷ ∪ Ŷ ) to be able to compare the resulting
′ ′
APτ (Ŷ ∪ Ŷ ) with APτ (Ŷ). To do so, we introduce P r , ′ ′
Having identified P ri and Rei for Nc < i ≤ Nc + Nc ,

′ ′
Re and P¯r as the precision, recall and the interpolated ′
′ now let’s obtain the interpolated precision P¯r . To do so, we
precision vectors of APτ (Ŷ ∪ Ŷ ) respectively. ′
′ ′
By definition, the numbers of elements in P r , Re and focus on P¯r in two parts: Up to and including its Nc th ele-
′ ′
′ ′ ment and its remaining part. Since P rNc > P ri , for Nc <
P¯r are equal to the number of detections in Ŷ∪Ŷ , which is ′ ′

simply Nc + Nc . More precisely, we need to determine the i ≤ Nc + Nc , the low-scoring detections in Ŷ do not affect


following three vectors to be able to obtain APτ (Ŷ ∪ Ŷ ): P¯r for i ≤ Nc considering Eq. (A.20), implying:
i
′ ′ ′ ′ ′ ′
P r = {P r1 , ..., P rNc , P rNc +1 , ..., P rNc +N ′ } (A.21)
c
P¯ri = P¯ri , for i ≤ Nc . (A.27)
′ ′ ′ ′ ′
Re = {Re1 , ..., ReNc , ReNc +1 , ..., ReNc +N ′ } (A.22) ′ ′ ′
c As for Nc < i ≤ Nc + Nc , since Rek = Rei , P¯ri = P¯rNc
′ ′ ′ ′ ′
P¯r = {P¯r1 , ..., P¯rNc , P¯rNc +1 , ..., P¯rNc +Nc′ } (A.23) holds.

As a result, the detections from Ŷ will have all equal re-

As an additional insight to those three vectors, pmax < call and interpolated precision, which is also equal to P¯rNc
pmin implies the following: and ReNc ; implying that they do not introduce new points to

the Precision-Recall curve used to obtain APτ (Ŷ). There- axes: (i) owing to the interpolation thanks to a TP in Ŷ with
′ ′
fore, APτ (Ŷ) = APτ (Ŷ ∪ Ŷ ) in this case. higher precision in Ŷ ∪ Ŷ , it is extended in precision axis;

Fig. A.12(b) illustrates this case to provide more insight. and (ii) thanks to a new TP in Ŷ , it is extended in recall
In particular, when there is no TP in the low-scoring de- axis. Note that in our proof for this case, we only discussed

tections (Ŷ ), then no new points are introduced compared the extension in precision since each of the extensions is

to the PR curve of Ŷ and the resulting AP after including sufficient to show APτ (Ŷ) < APτ (Ŷ ∪ Ŷ ).
low-scoring detections does not change.
′ ′
Discussion Theorem 1 can also be extended to COCO-
Case (2): Ŷ includes TPs and max ′
(P ri ) ≤ style AP. To be more specific and revisit the definition of
Nc <i≤Nc +Nc
′ COCO-style AP, first the class-wise COCO-style APs are
min (P¯ri ). Note that max (P ri ) ≤ min (P¯ri ) obtained by averaging over the APs computed over τ ∈
i≤Nc Nc <i≤Nc +Nc′ i≤Nc
implies that the interpolated precisions computed on the de- {0.50, 0.55, ..., 0.95} for a single class. Then, the detector

tection set Ŷ (P¯ri for i ≤ Nc ) will not be affected by the COCO-style AP is again the average of the class-wise APs.

detections in Ŷ . As a result, Eq. (A.24) can simply be Considering that the arithmetic mean is a monotonically in-
extended to the interpolated precisions: creasing function, Theorem 1 also applies to the class-wise
COCO-style AP and the detector COCO-style AP. More

P¯ri = P¯ri for i ≤ Nc . (A.28) precisely, in the case that either Case (1) applies for some

(or all) of the classes and the detections for the remaining
Considering the area under the curve of the pairs P¯ri and classes stay the same, then following Case (1), COCO-style

Rei = Rei for i ≤ Nc , it is already guaranteed that AP does not change. That is also the reason why we do

APτ (Ŷ) ≤ APτ (Ŷ ∪ Ŷ ) completing the proof for this case. not observe a change in COCO-Style AP in Fig. 5(a) once
To provide more insight, we also briefly explore the we add dummy detections that are basically FPs with lower
effect of remaining detections, that are the detections in scores. If at least for a single class Case (2) or (3) apply,

Nc < i ≤ Nc +Nc and include TPs. Assume that jth detec- then COCO-style AP increases considering the monoton-
tion is the TP with the highest confidence score within the ically increasing nature of the arithmetic average. Follow-

detections for Nc < i ≤ Nc + Nc . Then, for the jth detec- ing from this, we observe some decrease in COCO-style AP
′ ′ ′ when we remove the detections in Fig. 5(b) when we thresh-
tion 0 < P¯r < P¯r as
j Nc max (P r ) ≤ min (P¯ri )
i
Nc <i≤Nc +Nc′ i≤Nc old and remove some TPs. As a result, we conclude that AP,
by definition. Moreover, since the number of TPs in- including COCO-Style AP, encourages the detections with
creases and the number of ground truths is a fixed number, lower scores.
′ ′
Rej > ReNc . This implies that, the PR curve now has
′ ′ D.2. Sensitivity of LaECE to TP validation thresh-
P¯rj > 0 precision for some Re . Note that the precision
j old

was implicitly 0 for Rej for the detection set Ŷ since this
new ground truth could not be retrieved regardless of the Here we analyse the sensitivity of LaECE to the TP val-
number of predictions. Accordingly, the additional area un- idation threshold τ . Please note that we normally obtain

der the PR curve of Ŷ ∪ Ŷ compared to that of Ŷ increases class-wise LRP-optimal thresholds v̄ considering a specific

and it is guaranteed that APτ (Ŷ) < APτ (Ŷ ∪ Ŷ ) in this τ on DVal , then use the resulting detections while measur-
case. As depicted in Fig. A.12(c), the area-under-the PR ing the LRP Error and LaECE on the test set using the same
curve of Ŷ is extended towards higher recall (compare blue IoU validation threshold τ . Namely, we use τ for two pur-

curve with the red one) resulting in a larger APτ (Ŷ ∪ Ŷ ) poses: (i) to obtain the thresholds v̄; and (ii) to evaluate the
compared to APτ (Ŷ). resulting detections in terms of LaECE and LRP Error. As
we aim to understand how LaECE, as a performance mea-
′ ′ sure, behaves under different specifications of the TP vali-
Case (3): Ŷ includes TPs and max (P ri ) >
Nc <i≤Nc +Nc′ dation threshold τ here, we decouple these two purposes of
min (P¯ri ). Unlike case (ii), this case implies that upon τ by fixing the detection confidence threshold v̄ to the value
i≤Nc obtained from a TP validation threshold of 0.10. This en-
′ ′
merging Ŷ and Ŷ , some of the P¯ri of Ŷ with P rj > P¯ri ables us to fix the detection set as the input of LaECE and
will be replaced by a larger value due to Eq. (A.20), i.e., only focus on how the performance measures behave when

P¯ri > P¯ri for some i while the rest will be equal sim- only τ changes.
ilar to Case (ii). This simply implies that APτ (Ŷ) < Specifically, we use F-RCNN detector, validate v̄ on

APτ (Ŷ ∪ Ŷ ). COCO validation set and obtain the detections on Obj45K
Fig. A.12(d) includes an illustration for this case demon- test set using v̄. Then, given this set of detections, for dif-
strating that the PR curve of Ŷ is extended in both of the ferent values of τ ∈ [0, 1], we compute:
100
75 LaECE (uncalibrated)

LRP Error (in %)


LaECE (in %)

LaECE (calibrated by LR) 90


50
25 80
0 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 0.99 70 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 0.99
TP validation threshold ( ) TP validation threshold ( )
(a) Sensitivity of LaECE to τ (b) Sensitivity of LRP Error to τ

Figure A.13. Sensitivity analysis of LaECE and LRP Error. We use the detections of F-RCNN on our Obj45K split. (a) For both calibrated
and uncalibrated case, we observe that the LaECE is not sensitive for τ ∈ [0.0, 0.5]. When τ gets larger, the misalignment between
detection scores and performance increases for the uncalibrated case, while the calibration becomes an easier problem since most of the
detections are now FP. In the extreme case that τ = 1 (a perfect localisation is required for a TP), there is no TP and it is sufficient to
assign a confidence score of 0.00 to all of the detections to obtain 0 LaECE. (b) Sensitivity analysis of LRP Error. As also previously
analysed [58], when τ increases, number of FPs increases and LRP increases. In the extreme case when τ ≈ 1, LRP approximates 1.

• LaECE for uncalibrated detection confidence scores; D.3. Derivation of Eq. (5)
In Sec. 5.3, we claim that the LaECE for a bin reduces
• LaECE for calibrated detection confidence scores us- to:
ing linear regression (LR); and X X cal
tcal

i − IoU(b̂i , bψ(i) ) + ti , (A.29)
b̂i ∈D̂jc b̂i ∈D̂jc
• LRP Error. ψ(i)>0 ψ(i)≤0

which allows us to set the target tcal


i as:
Fig. A.13 demonstrates how these three quantities change (
for τ ∈ [0, 1]. Note that both for the uncalibrated and cali- tcal =
IoU(b̂i , bψ(i) ), if bψ(i) > 0 (i is true positive),
i
brated cases, LaECE is not sensitive for τ ∈ [0.0, 0.5]. As 0, otherwise (i is false positive).
for τ ∈ [0.5, 1.0], LaECE increases for the uncalibrated (A.30)
case due to the fact that the detection task becomes more
challenging once a larger TP validation threshold τ is re- In this section, we derive (A.29) given the definition of
quired and that the uncalibrated detections implies more LaECE in Eq. (4) to justify our claim.
over-confidence as τ increases. Conversely, in this case, the To start with, Eq. (4) defines LaECE for class c as
calibration task becomes easier as the most of the detections J
|D̂jc | ¯ c (j) , (A.31)
X
are now FP. As an insight, please consider the the extreme LaECEc = p̄cj − precisionc (j) × IoU
|D̂c |
case that τ = 1 in which a perfect localisation is required j=1

for a TP. In this case, there is no TP and it is sufficient for a which can be expressed as,
calibrator to assign a confidence score of 0.00 to all detec-
tions and achieve perfect LaECE that is 0. Finally, as also
P P
J
|D̂jc | b̂k ∈D̂ c ,ψ(k)>0 1 b̂k ∈D̂ c ,ψ(k)>0 IoU(b̂k , bψ(k) )
j j
X
p̄j − ,
analysed before [58], when τ increases, the detection task j=1 |D̂ c | |D̂jc |
P
b̂k ∈D̂ c ,ψ(k)>0 1
j
becomes more challenging, and therefore LRP Error, as the (A.32)
lower-better measure of accuracy, also increases. This is be-
cause the number of TPs decreases and the number of FPs as
increases as τ increases.
P
b̂k ∈D̂jc ,ψ(k)>0 1
While choosing the TP validation threshold τ for our precisionc (j) = , (A.33)
|D̂jc |
SAOD framework, we first consider that a proper τ should
decompose the false positive and localisation errors prop- and,
erly. Having looked at the literature, the general consensus
of object detection analysis tools [1, 26] to split the false
P
c b̂k ∈D̂jc ,ψ(k)>0 IoU(b̂k , bψ(k) )
positive and localisation errors is achieved by employing ¯ (j) =
IoU P . (A.34)
b̂k ∈D̂jc ,ψ(k)>0 1
an IoU of 0.10. As a result, following these works, we
set τ = 0.10 throughout the paper unless otherwise noted. P
Still, the TP validation threshold τ should be chosen by the The expression 1 (in the nominator of
b̂k ∈D̂jc ,ψ(k)>0
requirements of the specific application. c ¯ c (j)) corresponds
precision (j) and the denominator of IoU
to the number of TPs. Canceling out these terms yield Table A.13. Dummy detections decrease LaECE superficially with
P no effect on AP due to top-k survival. LRP Error penalizes dummy
J
b̂k ∈D̂jc ,ψ(k)>0 IoU(b̂k , bψ(k) )
X |D̂jc | detections and requires the detections to be thresholded properly.
p̄j − . (A.35) COCO val set is used.
j=1 |D̂ |
c |D̂jc |
Detector Dummy det. det/img. LaECE ↓ AP ↑ LRP ↓
p̄j , the average of the confidence score in bin j, can sim- None 33.9 15.1 39.9 86.5
ilarly be obtained as: up to 100 100 3.9 39.9 96.8
F-RCNN
P up to 300 300 1.4 39.9 98.8
b̂k ∈D̂jc p̂k up to 500 500 0.9 39.9 99.2
p̄j = , (A.36)
|D̂jc | None 86.4 7.7 42.8 95.1
up to 100 100 6.0 42.8 96.2
and replacing p̄j in Eq. (A.35) yields ATSS
up to 300 300 1.8 42.8 98.9
P P up to 500 500 1.1 42.8 99.3
J
X |D̂jc | b̂k ∈D̂jc p̂k b̂k ∈D̂jc ,ψ(k)>0 IoU(b̂k , bψ(k) )
− .
c
j=1 |D̂ | |D̂jc | |D̂jc |
D.4. More Examples of Reliability Diagrams
(A.37)
We provide more examples of reliability diagrams in
|D̂jc | Fig. A.14 on F-RCNN and ATSS on SAOD-AV. To pro-
Since a|x| = |ax| if a ≥ 0, we take |D̂ c |
inside the absolute
vide insight on how the error on the set that the calibrator is
value where |D̂jc |
terms cancel out: trained with, Fig. A.14 (a-c) show the reliability diagrams
J
P P on the val set as the split used to train the calibrator. On the
X k∈D̂jc p̂k b̂k ∈D̂jc ,ψ(k)>0 IoU(b̂k , bψ(k) ) val set, we observe that the isotonic regression method for
− .
j=1 |D̂c | |D̂c | calibration results in an LaECE of 0.0; thereby overfitting
to the training data (Fig. A.14(b)).
(A.38)
On the other hand, the linear regression method ends
P
Splitting b̂k ∈D̂j p̂k for true positives and false positives up with a training LaECE of 5.0 (Fig. A.14(c)). Conse-
quently, we observe that linear regression performs slightly
P P
as b̂k ∈D̂j ,ψ(k)>0 p̂k and b̂k ∈D̂j ,ψ(k)≤0 p̂k respectively,
we have better than isotonic regression on the BDD45K test split
P P (Fig. A.14 (e,f)). Besides, when we compare Fig. A.14(e,f)
J
X b̂k ∈D̂jc ,ψ(k)>0 p̂k + b̂k ∈D̂jc ,ψ(k)≤0 p̂k with Fig. A.14(d), we observe that both isotonic regression
|D̂c | and linear regression decrease the overconfidence of the
j=1
P (A.39) baseline F-RCNN Fig. A.14. As a different type of calibra-
b̂k ∈D̂jc ,ψ(k)>0, IoU(b̂k , bψ(k) ) tion error, ATSS shown in Fig. A.14(g) is under-confident.
− . Again, linear regression and isotonic regression improves
|D̂c |
the calibration performance of ATSS. This further validates
Considering that Eq. (A.39) is minimized when the error on SAOD-AV that such post-hoc calibration methods are
for each bin j is minimized as 0, we now focus on a single effective.
bin j. Note also that for each bin j, |D̂c | is a constant. As a
result, minimizing the following expression minimizes the D.5. Numerical Values of Fig. 5
error for each bin, and also LaECE,
Tables A.13 and A.14 present the numerical values used
X X X in the Fig. 5(a) and Fig. 5(b) respectively. Please refer to
p̂k + p̂k − IoU(b̂k , bψ(k) ) .
Sec. 5.2 for the details of the tables and discussion.
b̂k ∈D̂ c ,ψ(k)>0 b̂k ∈D̂ c ,ψ(k)≤0 b̂k ∈D̂ c ,ψ(k)>0,
j j j
(A.40)
E. Further Details on SAOD and SAODets
By rearranging the terms, we have
This section provides further details and analyses on the
X   X SAOD task and the SAODets.
p̂k − IoU(b̂k , bψ(k) ) + p̂k ,
b̂k ∈D̂jc ,ψ(k)>0 b̂k ∈D̂jc ,ψ(k)≤0 E.1. Algorithms to Make an Object Detector Self-
(A.41) Aware
which reduces to Eq. (5) once by setting p̂k by tcal
k . This In Sec. 6, we summarized how we convert an object de-
concludes the derivation and validates how we construct the tector into a self-aware one. Specifically, to do so, we use
targets tcal
k while obtaining the pairs to train the calibrator. mean(top-3) and obtain an uncertainty threshold ū through
1.0 1.0 1.0
Precision ×IoU Precision ×IoU Precision ×IoU
% of Samples % of Samples % of Samples
0.8 0.8 0.8
LaECE= 23.5% LaECE= 5.0% LaECE= 0.0%
0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

0.0 0.0 0.0


0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Confidence Confidence Confidence
(a) Uncalibrated (b) Calibrated by linear regression (c) Calibrated by isotonic regression
1.0 1.0 1.0
Precision ×IoU Precision ×IoU Precision ×IoU
% of Samples % of Samples % of Samples
0.8 0.8 0.8
LaECE= 26.5% LaECE= 9.8% LaECE= 10.2%
0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

0.0 0.0 0.0


0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Confidence Confidence Confidence
(d) Uncalibrated (e) Calibrated by linear regression (f) Calibrated by isotonic regression
1.0 1.0 1.0
Precision ×IoU Precision ×IoU Precision ×IoU
% of Samples % of Samples % of Samples
0.8 0.8 0.8
LaECE= 16.8% LaECE= 9.0% LaECE= 9.7%
0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

0.0 0.0 0.0


0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Confidence Confidence Confidence
(g) Uncalibrated (h) Calibrated by linear regression (i) Calibrated by isotonic regression

Figure A.14. (First row) Reliability diagrams of F R-CNN on SAOD-AV DVal , which is used to obtain the set that we used for training
the calibrators. (Second row) Reliability diagrams of F R-CNN on SAOD-AV DID (BDD45K). (Third row) Reliability diagrams of ATSS
on SAOD-AV DID (BDD45K). Linear regression and isotonic regression improve the calibration performance of both over-confidence
F-RCNN (compare (e) and (f) with (d)) and under-confidence ATSS (compare (h) and (i) with (g)).

cross-validation using pseudo OOD set approach (Sec. 4), tector:


obtain the detections through LRP-optimal thresholding
(Sec. 5.2) and calibrate the detection scores using linear re- 1. The algorithm to make an object detector self-aware in
gression as discussed in Sec. 5.3. Here, we present the fol- Alg. A.1. The aim of Alg. A.1 is to obtain
lowing two algorithms to include the further details on how • the image-level uncertainty threshold ū;
we incorporate these features into a conventional object de-
• the detection confidence score thresholds for
Table A.14. To avoid superficial LaECE gain (Table A.13); we uncertainty scores. While we enforce this by keep-
adopt LRP Error that requires the detections to be thresholded ing a maximum of 100 detections following AP-based
properly. We use LRP-optimal thresholding to obtain class-wise evaluation, we only use the top-3 scoring detections
thresholds. Results are from COCO val set. with 7.3 objects/im-
to compute the image-level uncertainty. Then, after
age on average.
cross-validating ū, we cross-validate {v̄ c }C
c=1 for each
Detector Threshold det/img. LaECE ↓ AP ↑ LRP ↓ class (line 8) and using only the thresholded detections
None 33.9 15.1 39.9 86.5 we train the calibrators (lines 9-10). This procedure
0.30 11.2 27.5 38.0 67.6 allows us to incorporate necessary features into a con-
F-RCNN 0.50 7.4 27.6 36.1 62.1 ventional object detector, making it self-aware.
0.70 5.2 24.5 33.2 61.5
LRP-opt. 6.1 26.1 34.6 61.1 2. The inference algorithm of a SAODet in Alg. A.2.
None 86.4 7.7 42.8 95.1 Given a SAODet, the inference on a given image X
0.30 5.2 20.2 35.3 60.5 is as follows. We first compute the uncertainty of the
ATSS 0.50 2.0 26.6 19.7 78.4 detector on image X, denoted by G(X), following the
0.70 0.3 12.3 3.9 96.3 same method in Alg. A.1, that is mean(top-3) (lines 2-
LRP-opt. 6.0 18.3 36.7 60.2 3). Then, the rejection or acceptance decision is made
by comparing G(X) by the cross-validated threshold ū
Algorithm A.1 Making an object detector self-aware (line 5). If the image is rejected then, no detection is re-
1: procedure M AKING S ELFAWARE(DTrain , DVal ) turned (line 7). Otherwise, if the detection is accepted,
2: Train a standard detector f (·) on DTrain then its confidence is compared against the cross-

3: Obtain pseudo OOD set DVal by replacing objects in DVal validated detection confidence threshold (line 11); en-
with zeros (Sec. 4) abling us to differentiate between a useful and a low-
4: Remove the images with no objects from DVal , denote the scoring noisy detection. If the confidence of the detec-
+
resulting set by DVal tion is larger than the detection-level threshold, then it

5: Make inference on DVal by including top-100 detections
− is added into the final set of detections also by a cali-
from each image, i.e., Dval = {f (Xi )}Xi ∈D̂−
+
100 brated confidence score (line 12). Therefore, Alg. A.2
6: Make inference on DVal by including top-100 detections checks whether the detector is able to make a detection
+
from each image, i.e., Dval = {f (Xi )}Xi ∈D̂+ on the given image X, if so, it preserves the accurate
100
7: Cross-validate ū, the image-level uncertainty threshold, on detections obtained by the object detector by removing
+ −
Dval and Dval using mean(top-3) of the uncertainty scores
the noisy detections as well as calibrates the detection
against the Balanced Accuracy as the performance measure
scores; applying all features of a SAODet during infer-
(Sec. 4)
8: Cross-validate v̄ c , the detection-level threshold of class c, ence.
+
on D̂100 using LRP-optimal thresholding (Sec. 5.2)
+
9: Remove all detections of class c in D̂100 with score less Algorithm A.2 The inference algorithm of a SAODet given
c
than v̄ to obtain thresholded detections D̂thr
an image X (please also refer to Alg. A.1 for the notation)
10: Using D̂thr , train a linear regression calibrator ζ c (·) for
each class c (Sec. 5.3) 1: procedure I NFERENCE(f (·), ū, {v̄ c }C c C
c=1 , {ζ (·)}c=1 , X)
N
11: return f (·), ū, {v̄ c }C c C
c=1 , {ζ (·)}c=1 2: D̂100 = {ĉi , b̂i , p̂i } = f (X) such that N ≤ 100
12: end procedure 3: Estimate G(X), the image-level uncertainty of f (X) on
X, using mean(top-3) of the uncertainty scores (Sec. 4)
4: Initialize the thresholded detection set D̂thr = Ø
each class {v̄ c }C 5: if G(X) > ū then
c=1 ;
6: â = 0
• the calibrators for each class {ζ c (·)}C
c=1 ; and 7: return {â, D̂thr } // REJECT X with D̂thr being Ø
• a conventional object detector f (·) on which 8: else
these features will be incorporated into. 9: â = 1
10: for each detection {ĉi , b̂i , p̂i } ∈ D̂100 do The predicted
To do so, after training the conventional object detector 11: if p̂i ≥ v̄ ĉi then class
f (·) using DTrain (line 2), we first obtain the image- 12: D̂thr = D̂thr ∪ {ĉi , b̂i , ζ ĉi (p̂i )}
level uncertainty threshold ū by using our pseudo 13: end if
OOD set approach as described in Sec. 4 (lines 3- 14: end for
7). While doing that, we do not apply detection- 15: return {â, D̂thr } // ACCEPT X
16: end if
level thresholding yet; ensuring us to have at least
17: end procedure
3 detections from each image on which we compute
the image-level uncertainty using mean(top-3) of the
100 100
detections. Third, accepting more images results in bet-

Performance or Error
80
Performance or Error

80 DAQ ter LRP values (green and purple curves in Fig. A.15(a))
BA
60 60 LRP as otherwise the ID images are rejected with the detection
40 LaECE
40 LRPT sets being empty. Similarly, to achieve a high LRP, setting
20 20 LaECET the detection confidence threshold properly is important as
0 0 well. This is because, a small confidence score threshold
0 20 40 60 80 100 0 20 40 60 80 100
Image-level threshold (TPR of val set in %) Detection-level threshold (pi in %) implies more FPs, conversely a large threshold can induce
(a) Image-level uncertainty thr. (b) Detection confidence thr. more FNs. Finally, while we don’t observe a significant ef-
fect of the uncertainty threshold on LaECE, a large number
Figure A.15. The effect of image- and detection-level thresholds. of detections due to a smaller detection confidence thresh-
DAQ (blue curve) decreases significantly for extreme cases such old has generally a lower LaECE.. This is also related to our
as when all images are rejected or all detections are accepted; im-
previous analysis in Sec. 5.2, which we show that more de-
plying its robustness to such cases. Here, for the sake of analysis
simplicity, we use a single confidence score threshold (v̄) obtained
tections imply a lower LaECE as depicted in Fig. A.15(b)
on the final detection scores p̂i in (b) instead of class-wise ap- when the threshold approaches 0. However, in that case,
proach that we used while building SAODets. LRP Error goes to 1, and as a result, DAQ significantly de-
creases; thereby preventing setting the threshold to a lower
value to superficially boost the overall performance.
Table A.15. Effect of common improvements (epochs (Ep.),
Multi-scale (MS) training, stronger backbones) on F-RCNN E.3. Effect of common improvement strategies on
(SAOD-Gen).
DAQ
Ep. MS Backbone DAQ BA mECE LRP mECET LRPT AP Here, we analyse how common improvement strategies
12 ✗ R50 38.5 88.0 16.4 76.6 16.8 85.0 24.8 of object detectors affect the DAQ in comparison with
36 ✗ R50 38.4 87.4 18.7 75.9 20.5 85.0 25.5 AP. To do so, we first use a simple but a common base-
36 ✓ R50 39.7 87.7 17.3 74.9 18.1 84.4 27.0
line model: We use F-RCNN (SAOD-Gen) trained for 12
36 ✓ R101 42.0 88.1 17.5 73.4 19.0 82.8 28.7
epochs without multi-scale training. Then, gradually, we
36 ✓ R101-DCN 45.9 87.4 17.3 70.8 19.4 79.7 31.8
include the following four common improvement strategies
commonly used for object detection [6, 55, 77]:
E.2. Sensitivity of the SAOD Performance Measures 1. increasing number of training epochs,
to the Image-level Uncertainty Threshold and
Detection Confidence Threshold 2. using multiscale training as described in Sec. B,

Here, we explore the sensitivity of the performance mea- 3. using ResNet-101 as a stronger backbone [22], and
sures used in our SAOD framework to the image-level un-
certainty threshold û and the detection confidence threshold 4. using deformable convolutions [78].
v̄. To do so, we measure DAQ, BA, LRP and LaECE of
F-RCNN on DTest of SAOD-Gen by systematically vary- Table A.15 shows the effect of these improvement strate-
ing (i) image-level uncertainty threshold ū ∈ [0, 1] and (ii) gies, where we see that stronger backbones increase DAQ,
detection-level confidence score threshold v̄ ∈ [0, 1]. Note but mainly due to an improvement in LRP Error. It is also
that in this analysis, we do not use LRP-optimal threshold worth highlighting that more training epochs improves AP
for detection-level thresholding, which obtains v̄ for each (e.g. going from 12 to 36 improves AP from 24.8 to 25.5),
class but instead employ a single threshold for all classes; but not DAQ due to a degradation in LaECE. This is some-
enabling us to change this threshold easily. Fig. A.15 what expected, as longer training improves accuracy, but
shows how there performance measures change for differ- drastically make the models over-confident [47].
ent image-level and detection-level thresholds. First, we
E.4. The Impact of Domain-shift on Detection-level
observe that it is crucial to set both thresholds properly to
Confidence Score Thresholding
achieve a high DAQ. More specifically, rejecting all images
or accepting all detections v̄ = 0 in Fig. A.15 results in a For detection-level confidence score thresholding, we
very low DAQ, highlighting the robustness of DAQ in these employ LRP-optimal thresholds by cross-validating a
extreme cases. Second, setting a proper uncertainty thresh- threshold v̄ for each class using DVal against the LRP Er-
old is also important for a high BA (Fig. A.15(a)), while it ror. While LRP-optimal thresholds are shown to be useful
is not affected by detection-level threshold (Fig. A.15(b)) if the test set follows the same distibution of DVal , we note
since BA indicates the OOD detection performance but not that our DID is collected from a different dataset, introduc-
related to the accuracy or calibration performance of the ing domain shift as discussed in App. A. As a result, here
1.0 1.0
nuImages val
BDD45K
0.8
LRP-optimal Thresholds

0.8

LRP-optimal Thresholds
0.6 0.6

0.4 0.4

0.2 COCO val 0.2


Obj45K
0.0
0.0

snowboard

skateboard
surfboard
person
bicyclecar
bus
train

bird
bench

oven
airplane
motorcycle

boat
truck
light
hydrant
stopmetersign

dog
cat
sheep
horse
elephantcow
zebra
bear
giraffe
backpack
handbag tie
suitcase
frisbee

toaster
sink
book
clock
vase
bear
umbrella

skis
sports kiteball
baseballglove bat
racket
winebottlecup
glass
knife
fork
spoon
bowl
banana
apple
orange
sandwich
dog
hotcarrot

plant
bed
dining toilet
table
laptop tv
broccoli
donut
pizza
chair
cake
pottedcouch

mouse
keyboard
phone
remote
microwave
refrigerator
scissors
toothbrush
hair drier
pedestrian vehicle bicycle
Classes
traffic

teddy
baseball
parking

tennis

cell
fire

Classes
(a) F-RCNN (SAOD-Gen) (b) F-RCNN (SAOD-AV)
1.0 1.0
COCO val nuImages val
Obj45K BDD45K
0.8
LRP-optimal Thresholds

0.8

LRP-optimal Thresholds
0.6 0.6

0.4 0.4

0.2 0.2

0.0
0.0
snowboard

skateboard
surfboard
person
bicyclecar

cup
airplane
motorcycle
bus
train
boat
truck
light
hydrant
stopmetersign
bird
bench
dog
cat
sheep
horse
elephantcow
bear
zebra
giraffe
backpack
handbag tie
suitcase
frisbee
sports kite

knife
fork
spoon
bowl
banana
apple
orange
sandwich
dog
hotcarrot

laptop tv
umbrella

skis
ball
baseballglove bat
racket
winebottle
glass

broccoli
donut
pizza
chair
cake
pottedcouchplant
bed
dining toilet
table
mouse
keyboard
phone
oven
toaster
remote

sink
book
clock
vase
microwave
refrigerator

bear
scissors
toothbrush
hair drier
pedestrian vehicle bicycle
Classes
traffic

teddy
baseball
parking

tennis

cell
fire

Classes
(c) ATSS (SAOD-Gen) (d) ATSS (SAOD-AV)

Figure A.16. Comparison of (i) LRP-optimal thresholds obtained on DVal as presented and used in the paper (blue lines); and (ii) LRP-
optimal thresholds obtained on DID as oracle thresholds (red lines). Owing to the domain shift between DVal and DID , the optimal
thresholds do not match exactly. The thresholds between DVal and DID are relatively more similar for SAOD-Gen compared to SAOD-AV.

Table A.16. Evaluating self-aware object detectors. In addition to Tab. 6, this table includes the components of the LRP Error for more
insight. Particularly, LRPLoc , LRPFP , LRPFN correspond to the average 1-IoU of TPs, 1-precision and 1-recall respectively.

Self-aware DOOD vs. DID DID T (DID )


DAQ↑
Detector BA↑ TPR↑ TNR↑ IDQ↑ LaECE↓ LRP↓ LRPLoc ↓ LRPFP ↓ LRPFN ↓ IDQ↑ LaECE↓ LRP↓ LRPLoc ↓ LRPFP ↓ LRPFN ↓

SA-F-RCNN 39.7 87.7 94.7 81.6 38.5 17.3 74.9 20.4 48.5 52.3 26.2 18.1 84.4 21.9 52.2 72.4
SA-RS-RCNN 41.2 88.9 92.8 85.3 39.7 17.1 73.9 19.3 47.8 51.9 27.5 17.8 83.5 20.4 50.8 72.1
Gen

SA-ATSS 41.4 87.8 93.1 83.0 39.7 16.6 74.0 18.5 47.8 52.8 27.8 18.2 83.2 20.2 53.2 71.1
SA-D-DETR 43.5 88.9 90.0 87.8 41.7 16.4 72.3 18.8 45.1 50.7 29.6 17.9 81.9 20.4 49.6 69.4
SA-F-RCNN 43.0 91.0 94.1 88.2 41.5 9.5 73.1 26.3 13.2 58.1 28.8 7.2 83.0 26.7 12.2 74.7
AV

SA-ATSS 44.7 85.8 95.9 77.6 43.5 8.8 71.5 25.9 14.2 55.7 30.8 6.8 81.5 26.0 14.3 72.5

we investigate whether the detection-level confidence score • For both of the settings optimal thresholds computed
threshold is affected from domain shift. on val and test sets rarely match.

To do so, we compute the LRP-optimal thresholds on


both DVal and DTest of F-RCNN and ATSS, and then, we • While the thresholds obtained on DVal (COCO val
compare them in Fig. A.16 for our both datasets. As DTest set) and DID (Obj45K) for SAOD-Gen dataset is rel-
is not available during training, the thresholds obtained on atively similar, they are more different for our SAOD-
DTest correspond to the oracle detection-level confidence AV dataset, which can be especially observed for the
score thresholds. We observe in Fig. A.16 that: bicycle class.
Therefore, due to the domain shift in our datasets, the opti- now has larger confidence scores compared to its conven-
mal threshold diverges from DVal to DID especially for AV- tional version and vice versa for SA-F-RCNN. This can
OD dataset. This is not very surprising due to the challeng- enable the subsequent systems exploiting these confidence
ing nature of BDD100K dataset including images at night scores for decision making to abstract away the difference
and under various weather conditions, which also ensues a of the confidence score distributions of the detectors.
significant accuracy drop for this setting (Tab. A.8).
Furthermore, to see how these changes reflect into Removing Low-scoring Noisy Detections We previ-
the performance measures, we include LRPLoc , LRPFP , ously discussed in Sec. 5.2 that the detections obtained with
LRPFN components of LRP Error that are defined as the top-k survival allows low-scoring noisy detections and that
average over the localisation errors (1-IoU) of TPs, 1- the performance measure AP promotes them (mathematical
precision and 1-recall respectively in Tab. A.16. We would proof in App. D). This is also presented in the images of
normally expect that the precision and the recall errors to conventional object detectors in Fig. A.18. For example,
be balanced once the detections are filtered out using the the output of ATSS includes several low-scoring detections,
LRP-optimal threshold [58]. This is, in fact, what we ob- which the practical applications might hardly benefit from.
serve in LRPFP and LRPFN for the DID of SAOD-Gen On the other hand, the outputs of the SAODets in the same
setting,. For example for F-RCNN, LRPFP = 48.5 and figure are more similar to the objects presented in ground
LRPFN = 52.3; indicating a relatively balanced precision truth images (first column) and barely contain any detection
and recall errors. As for SAOD-AV setting, the significant that may not be useful.
domain shift of BDD45K is also reflected in the difference
between LRPFP and LRPFN for both F-RCNN and ATSS.
Domain Shift Fig. A.19 includes images with the corrup-
For example for F-RCNN, LRPFP = 13.2 and LRPFN =
tions with severities 1, 3 and 5. Following the design of
58.1; indicating a significant gap. Besides, as the domain
the SAOD task, SA-F-RCNN accepts the images with cor-
shift increases with T (DID ) on SAOD-AV, the gap between
ruptions 1 and 3 and provide detections also by calibrating
LRPFP and LRPFN increases more. These suggest that
the detection scores which is similar to DID . However, for
more accurate detection-level thresholding methods are re-
the image with severity 5, different from the conventional
quired under domain-shifted data.
detector, SA-F-CNN rejects the image; implying that the
E.5. Qualitative Results of SAODets in comparison detector is uncertain on the scene.
to Conventional Object Detectors
Failure Cases Finally in Fig. A.20, we provide images
In order provide more insight on the behaviour of
that SA-F-RCNN and SA-ATSS fail to identify the image
SAODets, here we present the inference of SAODets in
from DOOD as OOD, but instead perform inference.
comparison to conventional object detectors. Here, we use
SA-ATSS and SA-F-RCNN on SAOD-AV dataset. To be E.6. Suggestions for Future Work
consistent with the evaluation, we plot all the detection
boxes as they are evaluated: That is, we use top-k survival Our framework provides insights into the various ele-
for F-RCNN and ATSS and thresholded detections for SA- ments needed to build self-aware object detectors. Future
F-RCNN and SA-ATSS. In the following, we discuss the research should pay more attention to each elements inde-
main features of the SAODets using Fig. A.17, Fig. A.18, pendently while keeping in mind that these elements are
Fig. A.19 and Fig. A.20. tightly intertwined and greatly impact the ultimate goal.
One could also try to build a self-aware detector by directly
optimizing DAQ which accounts for all the elements to-
OOD Detection Fig. A.17 shows on three different input
gether, although in its current state it is not differentiable
images from different subsets of DOOD that a conventional
so a proxy loss or a method to differentiate through such
F-RCNN performs detection on OOD images and output
non-differentiable functions would need to be employed.
detections with high confidence. For example F-RCNN de-
tects a vehicle object with 0.84 confidence on the OOD
image from Objects365 (last row). On the other hand, SA-
F-R-CNN can successfully leverage uncertainty estimates
to reject these OOD images.

Calibration In Fig. A.14, we presented that ATSS is


under-confident and F-RCNN is over-confident. Now,
Fig. A.18 shows that the calibration performance of these
models are improved accordingly. Specifically, SA-ATSS
Image/Ground Truth Output of F-RCNN Output of SA-F-RCNN

REJECT

REJECT

REJECT

Figure A.17. Qualitative Results of F-RCNN vs. SA-F-RCNN on DOOD . The images in the first, second and third rows correspond SVHN,
iNaturalist and Objects365 subset of DOOD . While F-RCNN performs inference with non-empty detections sets, SA-F-RCNN rejects all
of these images properly.
Image/Ground Truth Output of Obj. Det. Output of SAODet

Figure A.18. Qualitative Results of Object detectors and SAODets on DID . (First row) F-RCNN vs. SA-F-RCNN. (Second row) ATSS vs.
SA-ATSS. See text for discussion. The class labels and confidence scores of the detection boxes are visible once zoomed in.

Image/Ground Truth Output of F-RCNN Output of SA-F-RCNN

REJECT

Figure A.19. Qualitative Results of F-RCNN vs. SA-F-RCNN on T (DID ) using SAOD-AV dataset. First to third row includes images
from T (DID ) in severities 1, 3 and 5 as we used in our experiments. The class labels and confidence scores of the detection boxes are
visible once zoomed in. For each detector, we sample a transformation using the ‘frost’ corruption.
Image/Ground Truth Output of Obj. Det. Output of SAODet

Figure A.20. Failure cases of SAODets in comparison to Object detector outputs. First row includes an image from iNaturalist subset of
DOOD with the detections from ATSS and SA-ATSS trained on nuImages following our SAOD-AV dataset. While SA-ATSS removes
most of the low-scoring detections, it still classifies the image as ID and perform inference. Similarly, the second row includes an image
from Objects365 subset of DOOD with the detections from F-RCNN and SA-F-RCNN trained on nuImages again following our SAOD-AV
dataset. SA-F-RCNN misclassifies the image as ID and performs inference.

You might also like