0% found this document useful (0 votes)

15 views34 pages

The Challenge of Door Detection

The document discusses the development and adaptation of robotic vision, specifically focusing on door detection for mobile service robots operating in human-centric environments. It highlights the challenges of applying deep learning techniques in real-world scenarios and proposes a method that combines photorealistic simulations with domain adaptation to improve detection performance. The findings are validated through extensive experiments and real-world deployments, demonstrating the effectiveness of the proposed approach in enhancing robotic navigation and operational capabilities.

Uploaded by

chriswarden0

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views34 pages

The Challenge of Door Detection

Uploaded by

chriswarden0

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

Development and Adaptation of Robotic Vision in the

Real–World: the Challenge of Door Detection

Michele Antonazzi Matteo Luperto

Department of Computer Science Department of Computer Science
University of Milan University of Milan
[email protected] [email protected]
arXiv:2401.17996v1 [cs.RO] 31 Jan 2024

N. Alberto Borghese
Department of Computer Science
University of Milan
[email protected]

Nicola Basilico
Department of Computer Science
University of Milan
[email protected]

Abstract

Mobile service robots are increasingly prevalent in human–centric, real–world domains, oper-
ating autonomously in unconstrained indoor environments. In such a context, robotic vision
plays a central role in enabling service robots to perceive high–level environmental features
from visual observations. Despite the data–driven approaches based on deep learning push
the boundaries of vision systems, applying these techniques to real–world robotic scenarios
presents unique methodological challenges. Traditional models fail to represent the chal-
lenging perception constraints typical of service robots and must be adapted for the specific
environment where robots finally operate. We propose a method leveraging photorealistic
simulations that balances data quality and acquisition costs for synthesizing visual datasets
from the robot perspective used to train deep architectures. Then, we show the benefits in
qualifying a general detector for the target domain in which the robot is deployed, showing
also the trade–off between the effort for obtaining new examples from such a setting and the
performance gain. In our extensive experimental campaign, we focus on the door detection
task (namely recognizing the presence and the traversability of doorways) that, in dynamic
settings, is useful to infer the topology of the map. Our findings are validated in a real–
world robot deployment, comparing prominent deep–learning models and demonstrating the
effectiveness of our approach in practical settings.

1 Introduction

Mobile service robots represent a flourishing technology increasingly employed across a range of human–
centric, real–world domains such as office and domestic environments. Typically deployed for the long
term, these robots find in robotic vision – the ability to understand semantic knowledge from visual robot
perceptions in real time– a cornerstone to achieving high–level autonomy and operational awareness. Indeed,
one of the key capabilities is the one to detect objects, which can significantly enhance a variety of sub–tasks
across increasing levels of abstraction, including localization, navigation, scene understanding, and planning.

Recent developments in object detection, largely driven by research in deep learning, have unlocked remark-
able possibilities for addressing this real–world AI challenge, facilitating the creation of highly effective vision
modules for mobile robots. However, despite the abundance and diversity of available models, their practical
application in field robotics continues to pose methodological and practical challenges. In this work, we
introduce a method designed to address them in the context of a particularly significant task for service
robots, whose functionalities are strongly based on autonomous navigation: door detection.

Doors are key environmental features for a mobile robot. They represent the topological connections between
adjacent sub–regions of the free space whose traversability, importantly, might not always be possible. More
specifically, our definition of a door is centered on the notion of variable–traversability passage, without
necessarily assuming the presence of typical door components like a leaf, a hinge, or any other related
furniture (this is what in the literature is commonly referred to as “explicit” doors [67]). Door detection is
the capability for a robot to recognize in real–time the location and traversability status (open or closed)
of such passages. This problem is most effectively addressed as an object detection task using RGB images
acquired by the robot, primarily due to the limitations of alternative technologies. For example, laser range
scanners, while excellent for distance measurements, are typically constrained by a 2D field of view. In
the same line, RGB–D cameras often exhibit a limited depth sensing range, making them less reliable over
longer distances or in larger indoor spaces. Furthermore, both types of sensors can struggle with transparent
or highly reflective surfaces, that are often present in doors. Figure 1 illustrates the execution of door
detection: a mobile robot navigates in an environment to perform its tasks; at the same time it acquires
images through its onboard camera. For each image, it infers in real–time the bounding boxes of doors,
distinguishing between open doors (depicted in green) and closed ones (depicted in red), also highlighted on
the map.

Figure 1: Giraff–X [49], the service robot adopted in this work, performing door detection.

In practical deployments, service robots are subject to specific perception constraints and frequently en-
counter challenging recognition instances, such as partially occluded or poorly positioned viewpoints. These
challenges are intrinsic to the domain of service robots, as they are designed to operate in environments
marked by dynamism and clutter. This condition induces critical domain shifts over the state–of–the–art
object detectors, which are typically trained over datasets that largely neglect the noisy, constrained, and
challenging operational conditions that a robot faces on the field. Properly and efficiently adapting these
models for service robots is an open methodological problem, yet one of remarkable practical importance.
Two important facts should be considered about a robot operating in its target environment: on one side,
human–centric indoor environments are indeed complex and difficult to process using computer vision due to
the aforementioned challenges. On the other side, while the position of the objects in the target environment
is highly dynamic, the robot will see the same object instances multiple times. Moreover, different instances
of the same type of object will likely have a similar aspect. For example, while it is highly likely that two
chairs from two different environments look very different, the chairs around the same table are usually of the
same model. The same holds for doors. As a result of this, the robot during its daily activities will observe
multiple times the same set of objects, and sets of objects similar among them. This fact can be used to
improve the perceptual capabilities of the robot in its target environment, compensating for the challenges
of the domain.

In this work, we demonstrate how the integration of simulation frameworks and domain adaptation techniques
can create an effective development pipeline for constructing door detectors specifically tailored for mobile
service robots. Specifically, the contributions we provide are the following.

• We analyze the trade–offs involved in using simulations to train a door detector capable of generaliz-
ing effectively across various environments. We delineate the desiderata of this process and propose
a framework based on simulation performed in 3D real–world models, Gibson [68], that achieves a
balance between data photorealism and acquisition costs.
• We explore fine–tuning to qualify detectors to a robot’s target environment, demonstrating how
leveraging typical operational settings of service robots can enhance performance in challenging
instances.
• We argue that performance metrics used to evaluate computer–vision models are not well suited to
be used in a robotic context, and we propose new evaluation metrics specifically designed to better
reflect our setting.
• We conduct an extensive experimental campaign with three prominent deep–learning models for
object detection, providing insights into how domain shifts in our scenario are influenced by the
nature of the training datasets.
• We assess them on both general and environment–specific door detectors, identifying the most
effective training setups. We evaluate the trade–off between training efforts and performance im-
provements, validating our findings in an actual deployment on a service robot.
• We evaluate the impact of different performance of a door detection method on a downstream task,
that is to evaluate the current traversability status of the whole environment.

In the next section, we briefly survey the related works relevant to this study. Section 3 motivates and outlines
our methods while the remainder of the paper is devoted to experimental analysis. Section 4 extensively
analyses the performance of our general and qualified detectors while Section 5 focuses on the evaluation in
a real robot deployment. Concluding remarks are provided in Section 6.

The contributions of this paper build upon and extend our earlier work presented in [2], where we have
illustrated a preliminary version of the findings presented here. In this work, we significantly extend the
experimental evaluation also by assessing multiple models for object detection.

2 State of the Art

Mobile service robots are a cutting–edge technology that is increasingly being adopted in a variety of real–
world scenarios. These robots are typically employed to assist humans in various tasks, often unfolding in
indoor industrial, or domestic environments [41]. Among recent and representative application domains are
healthcare, where robots assist patients and caregivers [30]; logistics, where they carry out repetitive tasks
like item deliveries or environmental monitoring [24]; at–home caregiving, where they aid in day–by–day
activities like cleaning or provide additional services such as personal assistance, wellbeing monitoring, and
social entertainment [19, 49].

Although the adoption of mobile service robots is increasing, their deployment in indoor human–centric
environments such as houses, offices, hospitals, and schools, continues to present substantial research chal-
lenges. Unlike industrial settings, where there is a higher degree of control and predictability, the lived–in
setup is typically unstructured and dynamic. This complexity arises both from the physical layout of such
environments, that is how rooms, walls, and furniture are disposed, and by their visual aspect, which is
complex and semantic–rich; consequently, two different environments of the same type can have very dif-
ferent features. As examples of recent research works show, in human–centric environments, tasks such as
user–understanding [31], activity recognition [18], and even path planning [38] face additional challenges.

To properly work inside these complex settings, a robot should be able to understand relevant proper-
ties through its vision. In this domain, one of the fundamental problems is the one of real–time object
detection. Indeed, such functionality is key for service robots, enhancing their capabilities and enabling
autonomous behavior in various situations [1]. Object detection is typically performed from images acquired
with RGB cameras, and the recent advancements in deep learning applied to computer vision [8] have led
to the development of high–performance pre–trained models, which can be exploited to obtain robotic vision
modules running locally on mobile robots. These models, among which prominent examples are YOLO [33],
DETR [6], and Faster R–CNN [56], offer significant advantages in object detection tasks for service robots,
but their field deployment poses a set of engineering and methodological challenges, especially in unstruc-
tured environments. Our work focuses on tackling common problems of object detection in human–centric
environments by assessing a specific and widely relevant case study, door detection.

The ability of mobile robots to detect doors is key for indoor operations. The traversability status of a door
(open or closed) enables the availability of passageways between sub–areas of an environment. In turn,
traversability directly determines the environment’s topology, ultimately affecting the robot’s navigational
routes and accessibility of the areas therein. The location of doors is important for tasks such as room
segmentation [4], which entails partitioning the map of an environment into semantically meaningful areas
or rooms. This knowledge is also beneficial in predicting the layout of rooms not yet observed during
exploration [48] or in identifying temporary unreachable locations during mapping. Furthermore, it plays a
key role in place categorization, a process where rooms on the occupancy map are assigned semantic labels
(like “corridor” or “office”) based on their visual appearance [21, 61]. Recent studies have shown that a
robot’s ability to recognize doors can significantly enhance its navigation capabilities in long–term scenarios.
For instance, the work presented in [37] models the periodic changes in dynamic environments over extended
periods. Similarly, the study in [52] introduces a navigation system designed for robots functioning in indoor
environments for long periods, particularly where the traversability of the area varies over time.

The use of object detection methods is the mainstream approach to tackle door detection with mobile
robots. Initial seminal methods in this domain relied on the extraction of handcrafted features [39, 51, 71],
such as edges [5] and corners [29], to describe the characteristic rectangular shape of doors. However, the
requirement to explicitly define and combine these features is a significant limitation of these approaches.
This constraint hampers their robustness and adaptability, especially when dealing with the highly variable
images encountered in real dynamic environments.

Deep learning end–to–end methods have brought significant improvements in the field. Their ability to
automatically learn features that characterize an object class, and robustly handling variations in scale,
position, rotation, and lighting, is a major advantage that has led to their widespread use in mobile robotics.
A pioneering method for door recognition in mobile robot navigation was introduced in [13]. This method
utilizes color and shape as key features to detect doors in office environments, employing two neural classifiers
to identify these elements in images. These features are then integrated using a heuristic algorithm to
determine if they form a typical door structure. The study in [11] presents a method for door detection
aimed at enhancing the autonomous navigation of mobile robots. A convolutional neural network is trained to
identify doors in indoor settings, demonstrating its utility in aiding a robot’s efficiency in traversing passages.
Additionally, recent research has explored the integration of RGB vision with other sensors commonly used in
robot navigation [35] and the identification of doors and their handles to enable interactions like grasping [45,
15, 32]. For example, the research in [45] utilizes a YOLO–based deep learning framework [55] for the
detection of door Regions Of Interest (ROI). This approach specifically targets the handles by focusing on
the area encapsulated within the door’s ROI, effectively locating the handles for interaction purposes.

The studies previously mentioned offer approaches to the door detection task in scenarios similar to the one
we address, yet they do so only to a limited extent. A notable limitation in these studies is the absence
of training from a mobile robot’s perspective. Additionally, these methods often do not exploit the typical
operational conditions of a robot. Our work introduces a strategy specifically tailored to address these
shortcomings. Viewing this from a broader angle, our method addresses the domain adaptation challenge (a
well–recognized issue in deep learning at large [72]) within the realm of door detection using mobile robots
in unstructured indoor environments. Similar findings are provided by the work of [74], where is presented
a dataset of door handles for training robotic manipulation methods to open/close doors. Our proposed
technique exploits the technique of fine–tuning of pre–trained deep neural models, a practice extensively
employed in autonomous mobile robotics both relying on manually labeled data [76, 75] and self–supervised
methods [78, 40].

3 General and Qualified Door Detection for Service Robots

We focus on a service robot designed to autonomously operate in human–centric indoor environments. We

assume a widely–used hardware setup, where the robot is equipped with one or more RGB cameras for vision.
The primary objective is developing real–time door detection, namely the task of processing an RGB image
acquired by the robot to determine the bounding box and binary traversability status (open or closed) for
each visible door.

The method we propose, graphically summarized in Figure 2, is structured around the two principal phases
that define the lifecycle of a mobile service robot: the robot’s development phase and the subsequent deploy-
ment phase. With this method, we aim to identify, experimentally evaluate, and solve some of the challenges
that are intertwined with the usage of vision–based object detection methods on mobile robots.

The development phase for a service robot involves preparing and configuring the platform, including the
installation and setup of hardware and software components. The objective here is to setup a robot that
is ready to meet the challenges of real–world environments. This work’s focus lies in the domain of visual
perception capabilities, aiming to equip the robot with vision skills that perform satisfactorily across various
environments, thereby ensuring a high level of generalizability. The development phase is the starting ground
for addressing our door–detection task. Our approach involves creating a General Detector (GD), designed
to recognize doors while adhering to the perception constraints of service robots and maintaining consistent
performance in various environments. A significant part of our method involves utilizing simulation to
develop a photorealistic visual dataset, representing typical visual perceptions of a robot. This dataset is
then used to train a GD, ensuring it achieves baseline performance in the real world.

During the deployment phase, the service robot is introduced for autonomous operation in a target envi-
ronment, usually for an extended period. This phase often involves a domain shift, presenting challenges to
the performance of the previously developed GD. This is because the pre–built computer vision methods,
focusing on the model’s development, present difficulties introduced by our environmental setup that prevent
their straightforward use on autonomous mobile robots. Given the long–term nature of this phase, there is
an opportunity to incrementally fine–tune the GD with data collected in the target environment, aligning
it more closely with the specific visual features at hand, thus obtaining a Qualified Detector (QD). This
detector can exploit the fact that usually multiple instances of the same object within the same environment
Development Deployment
Environment 2D maps
Data
Acquisition

Data
annotation
Perception
poses

Photorealistic
simulation
Fine-tune Fine-tune Fine-tune

Qualified detector Qualified detector Qualified detector

Detection Detection Detection

Training

General detector

Detection

Real world

Figure 2: A general overview of the proposed method.

present similar features, that are stable in time. Doors and windows are good examples of this fact: within
a target environment, most of them are usually produced by the same manufacturer and are of the same
type. The process of adapting the detection model to the target environment may require the collection and
annotation of data for which we propose a method demonstrating a trade–off between the effort required
and the resulting performance improvements.

3.1 Training a General Detector

The recent trends in object detection suggest that a straightforward way to address Robotic Vision is to
plug and play a deep detector in a robotic platform [77]. Despite the availability of a large number of
effective models, when faced with reality this simple approach presents several engineering challenges and
an established method to provide a GD for service robots still needs a comprehensive investigation.

State–of–the–art object detectors such as the one we shall consider, namely DETR [6], YOLO [55], and
Faster R–CNN [56], are typically trained on prominent datasets like Pascal VOC [22], ImageNet Large
Scale Visual Recognition Challenge [57], or, most commonly, MS COCO [43]. These datasets comprise
thousands of images from diverse contexts (including both indoor and outdoor settings) and feature various
object categories captured from application–agnostic viewpoints. However, when applying these models to
robotics, two primary challenges arise. First, if the dataset does not extensively represent the object of
interest, the detector’s ability to recognize it may be compromised. Our survey of existing datasets indicates
that doors frequently suffer from this lack of representation. Secondly, even when the object of interest
is well–represented, distribution shifts between the training data and real–world scenarios can significantly
affect performance. This widely recognized yet largely unresolved issue is particularly problematic in our
context, as the shift can affect multiple aspects: the input data, the feature space, and the data collection
process itself (here we rely on the discussion about domain shift types proposed in [42]).
Perhaps the dataset for door detection that is most relevant to our work is DeepDoors2 (DD2) [54], which
contains around 3000 images of doors, each annotated with a bounding box and traversability status1 .
However, in our scenario, DD2 is susceptible to performance degradation due to distribution shifts, a fact
that becomes expected already upon examining some of its examples. The images in DD2 are captured from
human–like perspectives, often showing the door fully visible and centrally located, as depicted in the indoor
and outdoor examples of Figure 3a and Figure 3b, respectively. This dataset overlooks instances such as
partially visible or nested doors, which are common in robots’ perceptions. Labels are provided only for doors
that are completely within the frame and distinct enough for clear identification, as shown in the dashed
bounding boxes of Figure 3c (partially visible door) and Figure 3d (nested doors). These shortcomings are,
to varying degrees, present in nearly all conventional computer vision datasets [57, 22, 43, 54], reflecting
their inherent limitations in capturing a robot’s visual perception model [62]. As our experimental campaign
will demonstrate concretely, these limitations significantly affect performance.

(a) (b) (c) (d)

Figure 3: Examples from the DD2 dataset [54] of open and closed doors (in green and red, respectively).
The dashed bounding boxes represent missing annotations.

To address them, one common method is fine–tuning a large–scale pre–trained model (such as one trained
on MS COCO [43]) with new examples that better represent the target object distribution. This approach is
prevalent, especially in robotics [75, 12, 45], and the strategy we evaluate in this work is based on it. Ideally,
creating an effective door detector through fine–tuning requires a dataset that:

• demonstrates a high level of photorealism (to withstand distribution shifts at the input level);

• encompasses a variety of indoor environments with diverse features (to withstand distribution shifts
at the feature level);

• accurately reflects the robot’s perspective and perception model (to withstand distribution shifts in
the data acquisition process).

Currently, no dataset fulfilling these criteria exists in the literature, as efficiently collecting it is still an open
problem. An alternative to this issue is to use labeled sequences of images obtained by a robot or by a mobile
platform, such as in ScanNet [16] or SUN3D [69]. However, these sequences are usually collected within single
rooms and, as they are based on fixed trajectories, do not allow the sampling of new viewpoints from different
perspectives that may be encountered by the robot while navigating. The most straightforward approach
would involve an extensive data collection campaign using robots in real–world environments, gathering
image samples and manually labeling them. However, the logistics and costs associated with this method
are prohibitively high and well–recognized among robotics professionals. In the following, we tackle this
problem by exploiting simulation [14], an approach frequently employed in robotics to mitigate the large
1 Notehow a dataset of this size is customary in several object detection tasks: as an example, in MS COCO, the average
number of examples per class, is 3200; the only exception is the category person, which has more examples, over 10K.
costs of on–the–field experimentation. The empirical results we present later will demonstrate how, with
appropriate design measures, simulations can provide a dataset from which an effective door detector can be
trained.

3.1.1 The Proposed Simulation Framework

Common 3D physics simulators like Gazebo [36, 65] or CoppeliaSim [64, 10] are widely adopted for proto-
typing control software in robotics before real–world deployment [14]. However, their lack of a sophisticated
rendering pipeline for realistic visual perceptions makes them unsuitable for our setting. Early efforts to
address this limitation have involved the use of 3D game engines, such as Unity3D [50] or Unreal [7], to
recreate complex robotics scenarios, including unique environments or specialized physical laws, as seen in
autonomous vehicles [20, 59], UAVs [34], or surgical robotics [63]. Despite their adaptability, customizing
these game engines for a particular robotic application can be challenging [58]. Specifically, for the task
considered in this paper, this would involve manually creating synthetic scenes that accurately reflect the
structural features of real indoor environments, a task that is more aligned with environment design than
robotics engineering.

Recently, the introduction of interactive realistic simulations for embodied AIs, as iGibson [60], has helped
mitigate the limitations of traditional simulators for indoor robotic tasks. iGibson comes with 15 artificially
constructed home–sized scenes, which are developed by populating layouts of actual environments with 3D
controllable objects whose configuration, shape, material, and texture can be automatically changed. Unlike
the simulators mentioned earlier, iGibson seamlessly integrates with the Robot Operating System (ROS),
facilitating the collection of extensive, high–quality annotated image datasets from a robot’s perspective.
However, despite these significant advantages, simulators based on synthetic scenes still fall short in achieving
the crucial aspect of photorealism. This limitation is something we empirically assess in our experimental
campaign.

In addressing these challenges, we adopted an approach that balances the photorealism of real–world data
acquisition with the automation benefits of synthetic simulations. Our solution relies on Gibson [68], a sim-
ulator designed for embodied agents with an emphasis on enhancing the photorealism of visual perceptions.
This enhancement aids in the effective transfer of models trained within the simulator to real–world envi-
ronments. Gibson achieves this by employing scene datasets scanned directly from real environments, such
as Matterport3D [9] and Stanford–2D–3Ds [3], which accurately replicate the complexity of the real world.
Additionally, it incorporates a neural rendering pipeline to further bridge the sim–to–real gap. Leveraging
these features, we developed a simulation framework based on Gibson, in conjunction with Matterport3D,
a comprehensive RGB–D dataset comprising 90 digitized real scenes also including semantic tagging for
both instance and category–level segmentation. This combination allows us to achieve a balance between
photorealism and the controlled conditions necessary for effective simulation.

Gibson provides a middleware for controlling a ROS–based virtual robot. While camera perceptions can
be easily simulated (setting resolution and FOV), navigation encounters several technical limitations. First,
conducting a real–time acquisition campaign, even in a simulated environment, can be time–intensive. More-
over, this approach does not offer complete control over the data acquisition process, as much of it depends
on the navigation stack of the simulated robot. Additionally, the 3D polygonal meshes of the Matterport3D
environments, which are digitized from real–world settings, are often cluttered and noisy. This results in
various issues: furniture models appear malformed and incomplete (as shown in Figure 4a, where the legs of
the table are not modeled), walls frequently have holes near windows or mirrors (see Figure 4b, where the
bed should be behind a mirror and a wall, which are missing), and the surfaces of floor plans are irregular
(see Figure 4c). Such artifacts, which concern the 3D meshes and not the images, can lead to inaccuracies
and failures when the robot navigates in the simulated environment.

To address these shortcomings, we have developed an enhanced version of Gibson, which introduces a highly
(a) (b) (c) (d)

Figure 4: Matterport3D mesh malformations. (a) The table and chairs have no legs. (b) A wall in the
bedroom, in front of the bed, is missing. (c) Floor plan irregularities. (d) An example of perception acquired
from the Gibson–based simulation framework.

controllable simulation mechanism. This upgraded simulation framework2 allows to script robot teleporting
actions to any location relaxing some constraints from the physics engine such as gravity or collisions. Such
a capability can enable large–scale batch data acquisition without the risk of operational failures. With
this system, the robot can effectively operate over uneven floor surfaces and across different floors without
encountering issues related to architectural barriers, like stairs or elevators. This approach significantly
streamlines the data–gathering process, ensuring efficient and uninterrupted data collection in simulated
environments. Figure 4d shows an example of an acquisition obtained with this simulation framework.

3.1.2 Pose Extraction

To effectively exploit the simulation framework proposed above it is crucial to ensure that the data acquired
aligns with the perception model of a service robot. To model how a robot perceives human–centric envi-
ronments, we rely on the experience obtained in a long–term deployment of service robots, described in [49].
To achieve this, we propose a method for principled selection of perception poses. First, data acquisition
should occur from locations within the free space that also maintain a minimum clearance from the nearest
obstacles. Additionally, these locations ought to be strategically positioned along the shortest paths con-
necting key areas of the environment. This positioning is key as these paths are the most likely to be covered
by a robot during its service time. Third, it is important to distribute the locations uniformly throughout
the environment to minimize redundancy and to ensure comprehensive coverage of the environment’s visual
features. Our method is composed of three distinct phases.

The initial phase focuses on generating a 2D map of the environment from the 3D mesh provided by the
simulation framework. This process involves aggregating obstacles identified through multiple cross–sections
of the 3D mesh, which are created using parallel planes starting from a few centimeters over the floor level.
The resulting map undergoes erosion and dilation to eliminate small gaps between obstacles so as to exclude
areas that are unreachable or too close to obstacles. Figure 5 presents some key outcomes of these steps.
Specifically to this first phase, Figure 5a illustrates the 3D mesh of a simulated environment, while Figure 5b
displays the corresponding 2D map.

In the second phase, the extracted map is used to compute a navigation graph, a data structure that represents
the principal routes likely to be traversed by a robot. This process is detailed in Algorithm 1. The map,
denoted as M = (F, O), comprises free and obstacle points sets denoted as F, O ⊆ R2 , respectively. Initially,
the algorithm identifies the contours of the obstacle shapes in O, resulting in a set of vertices Ov ⊆ R2 (line 1).
This set contains the minimum number of vertices to represent the obstacle shapes without information loss.
These vertices are used as basis points for calculating the Voronoi boundary within the free space F (line 2).
2 The pre-compiled python package is available at https://pypi.org/project/gibson/, the source code can be found at
https://github.com/micheleantonazzi/GibsonEnv.
(a) A 3D mesh of an environment (b) Map of free (F , white) and ob- (c) Grid cells (G0 , in red) lying on
obtained from the simulator stacle (O, black) space the Voronoi boundary

(d) Navigation graph (G, in red) (e) 2D locations (in red) of the perception poses (P )

Figure 5: Different phases of our pose extraction method.

This boundary, separating Voronoi cells that cover F , is structured as an undirected graph with vertices
V0 ⊆ R2 and edges E0 ⊆ V0 × V0 . The algorithm then overlays the Voronoi boundary onto a grid map
that discretizes the free space F at a resolution ϵ. Each grid cell ci has an area of ϵ2 and is centered at
coordinates (cxi , cyi ). A partial grid G0 is formed by selecting free space grid cells that contain at least one
point from Ov (line 3, also illustrated in Figure 5c). Subsequently, G0 undergoes a heuristic filtration to
eliminate spurious cells, specifically those with a degree (number of adjacent cells, assuming 8–connectivity)
of 1 or less (lines 4–12), targeting isolated or excessively narrow grid branches. The final step involves a
skeletonization process [73] to further simplify the grid structure (line 13). This involves converting G0 into
a bitmap, applying the skeletonization algorithm, and then reconstructing a final grid G, which effectively
represents the navigation graph. An example of the obtained result is shown in Figure 5d.

Algorithm 1: Compute navigation graph

Input: M = (F, O), the 2D map of the environment
Output: G, the navigation graph
1 Ov ← f indContours(O);
2 (V0 , E0 ) ← V oronoiBoundary(F, Ov );
3 G0 ← Gridϵ (F, V0 , E0 );
4 do
5 f ilter ← f alse;
6 for c ∈ G0 do
7 if degree(c) ≤ 1 then
8 G0 ← G0 \ c ; /* Filter spurious cell */
9 f ilter ← true;
10 end
11 end
12 while filter ;
13 G ← Skeletonize(G0 ) ; /* Apply skeletonization */

In the third phase, the navigation graph G is utilized to determine the poses for data acquisition, a pro-
cess detailed in Algorithm 2. A perception pose is defined by the tuple (x, y, h, θ), where x, y are the 2D
coordinates on the map, corresponding to the center of a cell in G. From this location, the robot acquires
an image at height h and orientation θ. Essentially, the algorithm performs a depth–first search on G, gen-
erating a cluster of poses each time a distance D is covered on the grid. This is achieved using a stack S
and a set of explored cells, denoted as EXP . The functions d(·, ·) and N (·), applied over G, compute the
distance between cell pairs and identify the set of adjacent cells for a given cell, respectively (assuming again
8–connectivity). The exploration initiates from a randomly selected cell (line 2), and whenever a distance
of at least D is covered (line 9), 16 poses centered on the current cell c are added to the set P . These poses
are generated by iterating over two height values (hhigh and hlow ) and 8 different orientations ranging from
0 to 2π in π4 increments (lines 10–14). An example of the set of 2D locations obtained over the navigation
graph is depicted in Figure 5e.

Algorithm 2: Pose extraction

Input: G, the navigation graph; D, a distance threshold
Output: P , the set of poses
1 S, EXP , P ← ∅, d ← 0 ; /* Initialization */
2 cur ← randomCell(G);
3 S.push(cur);
4 while S not empty do
5 c ← S.pop();
6 EXP ← EXP ∪ {c};
7 d ← d + dist(cur, c);
8 cur ← c;
9 if d ≥ D then
10 for h ∈ {hhigh , hlow } do
11 for i ∈ {0, 1, . . . , 7} do
12 P ← P ∪ (cx , cy , h, πi 4 ) ; /* Add perception pose */
13 end
14 end
15 d ← 0;
16 end
17 for c′ ∈ N (c) \ EXP do
18 S.push(c′ );
19 end
20 end

3.2 Training a Qualified Detector for a Target Environment

During the deployment phase, the robot is set up for long–term operation in a specific target environment,
denoted as e. A critical requirement for autonomous navigation is obtaining an on–site map. This process
often involves a technician who either directly operates the robot or assists it in exploring the environment
to acquire a map for later use. (We experienced this setup during an extensive experimental campaign
conducted in the scope of an assistive robotics study where service robots have been installed in several
private apartments [47, 49]. Beyond this, we deem that the setup is common and highly representative to
a very large number of on–the–field installations.) In this exploration phase, the robot has the opportunity
to collect additional data, particularly images of the environment captured with its onboard RGB camera.
A selected portion of these images can be labeled with doors and utilized to fine–tune the general detector
developed in Section 3.1, tailoring it specifically to environment e. We refer to the adapted version of this
detector as the qualified detector for environment e and we denote it as QDe . A general overview of the
proposed methodology is illustrated in Figure 6.

An intriguing approach would be to automatically label the additional acquired data in what essentially
would be an instance of an unsupervised domain adaptation problem. One method is to use pseudo–labels,
Figure 6: A general overview of the qualification procedure.

generated by applying our general detector to the new samples, a technique common in semi–supervised
learning [70]. However, our preliminary experiments showed a significant performance drop of about 20%
with this method compared to results with the general detector. This decline can be attributed to the
inherent inaccuracy of pseudo–labels, as also observed in recent studies [25]. While pseudo–labels may
improve performance in tasks where precise labels are less critical (such as semantic segmentation [78]), their
lack of accuracy makes them unsuitable for object detection tasks, such as the one we consider. Indeed,
re–training with missed or hallucinated bounding boxes produces a drift in the model in which errors keep
getting reinforced. Exploring more advanced techniques for unsupervised domain adaptation (as discussed
in [53]) is beyond the scope of this paper, where our aim is to empirically assess the trade–offs in enhancing
a general detector. Consequently, we opt for manual labeling, which can be conveniently done during the
robot’s installation phase. This approach is widely accepted in robotics, e.g., see the work presented in [76],
where manual annotations have been used to fine–tune an object detector for long–term localization tasks.

To facilitate this process, we have developed and released a ROS–integrated data annotation tool3 . This tool
allows transferring robot perceptions from a ROS bag into a database. It then samples these perceptions at a
given frequency and presents them to a technician, providing an interface for easy bounding box annotation.
To enhance efficiency, bounding boxes from one image are retained in subsequent images, leveraging the
robot’s slow movement to reduce labeling workload and reuse prior annotations.

With our experimental campaign, we prove the benefits that the qualification procedure brings to the robot’s
performance, studying also the trade–off between the effort between labeling costs and the model performance
gain. We empirically show that a relatively limited effort is sufficient to obtain remarkably better results in
object detection. In addition, we show that this procedure is more effective when applied to a GD trained
with data from the robot’s point of view.

4 Evaluation

In this section4 , we evaluate the performance of our door detectors by presenting an extensive experimental
campaign where the full workflow of our method (Figure 2) is applied to a significant selection of three
prominent deep–learning object detectors. In Section 4.1 we describe our experimental setting by detailing
our model selection (Section 4.1.1), the datasets used for the trials and the details of their preparation
(Section 4.1.2), and the evaluation metrics we propose to adopt (Section 4.1.3). We then present and discuss
the obtained results both with our general detectors (Section. 4.2) and with the qualified ones (Section 4.3).
We show how the increase in performance due to our pipeline is general regardless of the object detection
3 Available
at https://github.com/aislabunimi/ros-data-annotator
4 Thedatasets, models, and scripts of our experiments are available at https://github.com/aislabunimi/
door-detection-long-term/tree/robotic-vision
method used. To do so, we compare the results obtained with three popular object detection architectures
and we select the configuration that better suits our target problem (Section 4.4).

4.1 Experimental Setting

4.1.1 Model Selection

Research in deep learning for object detection primarily explored three types of deep learning architectures.
Initially, the focus was on two–stage detectors, which were then followed by the development of one–stage
models. More recently, considerable interest has been devoted to Transformers.

Two–stage detectors (such as [26, 56]) employ an architecture featuring two parts. The initial part generates
proposals, namely regions likely containing objects of interest. The second part classifies and refines these
proposals in a coarse–to–fine fashion. Following a more end–to–end approach, one–stage detectors (such
as [23, 44]), perform object recognition in a single step. They simultaneously predict both the locations and
the labels of objects using predefined bounding boxes, known as anchors, which are distributed uniformly
across the image. Recently, Transformer–based models (such as [6, 17]) have gained importance as a novel
paradigm in object detection. These models first create a spatial feature map from the input image, which
is then processed by a Transformer [66]. This process allows for the parallel prediction of multiple objects’
labels and locations, with the added advantage of considering inter–object relationships through the use of
attention. (See [77] for a more comprehensive survey of these techniques.)

In our experimental campaign, we chose a representative model for each of the three main types of object
detection architectures, considering both their availability and the computational requirements for training
and deployment on a robotic platform. For the two–stage architecture, we selected Faster R–CNN [56]
as implemented in the PyTorch Hub framework. This model includes a Feature Pyramid Network (FPN)
backbone based on ResNet–50 [28], coupled with a Region Proposal Network (RPN) [56] and a classifier for
bounding box regression [26], totaling around 41 million parameters. For one–stage detectors, we opted for
the medium–sized variant of YOLOv5 [33], which has approximately 20 million parameters. Both these two
architectures apply a non–maximum suppression procedure to discard bounding boxes with a high overlap
(for any pair of bounding boxes with an overlap of 50% or more, the one with the lower confidence is
removed).

As for the Transformer–based model, we selected DETR [6], which integrates a ResNet–50 backbone [28]
with a Transformer module [66] and a four–layer MLP, summing up to 41 million weights in total. DETR
requires setting a critical hyperparameter, N , which defines the fixed number of bounding boxes predicted
per image. We set N to 10, a value slightly higher than the maximum number of doors observed in any
single image in our datasets, to ensure comprehensive detection without excessive computational burden.

4.1.2 Datasets

In our experiments, we considered a total of four datasets composed of images and their relative door–status
annotations.

The first dataset, which we refer to as DDD2 , is derived from the DD2 dataset [54] discussed in Section 3.1.
This dataset includes 3000 real–world images taken from a human perspective, as provided in DD2. In these
images, doors are marked as open, semi–closed, or closed. For the purposes of our experiments, we re–labeled
the dataset to include ground truth data for complex examples that were not initially annotated (similar to
those shown in Figure 3). Additionally, considering the operational constraints of a robot, which may not be
able to navigate through partially opened doors, we categorized the doors marked as semi–closed as closed.

The second dataset, which we refer to as DiG , was generated using the iGibson simulator [60]. iGibson
provides 15 artificial environments, designed to mirror the structural features of real indoor scenes. To
capture data from the perspectives of robots, we implemented a pose extraction mechanism akin to the
one outlined in Section 3.1.2. (The details of this method are not elaborated here, as our later results
will show its limited performance.) By integrating this pose extraction process with the ability to control
door configurations within the simulation, we successfully generated a large batch of around 35000 instances
that were automatically annotated using the semantic data provided by the simulator. (Some examples are
reported in Figure 7.)

Our third dataset, referred to as DG , was created using the Gibson–based simulation framework proposed
in Section 3.1. This dataset comprises images generated from perception poses derived using Algorithms 1
and 2. For this dataset, we set the distance parameter D to 1 m, and used two different robot embodiments
with heights of hlow = 0.1 m and hhigh = 0.7 m across 10 diverse Matterport3D environments, including small
apartments and large villas with multiple floors and varied furniture styles. In processing these images, we
utilized the semantic frames provided by Matterport3D, where each pixel is classified into an object category.
We filtered out images without doors (i.e., where pixels labeled as “door” constituted less than 2.5% of the
total image). Subsequently, we automatically generated bounding box proposals around door instances.
This pre–processing step significantly simplified the final phase of manually verifying and completing the
annotations, which was carried out by human operators. The resulting DG dataset comprises 5457 images all
captured from the perspective of a mobile robot. See some examples in Figure 7, where also the enhanced
photorealism with respect to DiG can be appreciated.
DiG
DG

Figure 7: Example of annotated images obtained from simulations.

The final dataset in our study, named Dreal , is collected from a real deployment scenario of a service robot.
This dataset consists of images acquired by a Giraff–X platform [47, 49], as depicted in Figure 1, during
the exploration in 4 distinct indoor settings. These environments, as depicted in Figure 8, include a vari-
ety of settings. There is a university facility characterized by open spaces and classrooms (referred to as
Classrooms), the floors of a department consisting of narrow corridors and regularly arranged offices (de-
noted as Offices), a research facility with laboratories (labeled as Laboratories), and a private apartment
(identified as House). (In Figure 22–23 the floor plans of Classrooms and Offices are shown.) Data col-
lection was performed using an Orbbec Astra RGB–D camera (the lower camera attached to the robot in
Figure 1), capturing 320x240 RGB images at a rate of 1 fps. The images were then manually annotated.

These datasets collectively offer a comprehensive overview of the trade–offs involved in training a door
detector. DDD2 showcases what is typically available in literature but comes with significant drawbacks:
(e1 ) Classrooms (e2 ) Offices (e3 ) Laboratories (e4 ) House

Figure 8: Real environments considered in this work.

Acquisition Effort Labeling Effort Photorealism Robot POV Num. of Examples

DDD2 Medium – Acquisitions High – Manual labeling High – Real–world im- No ≈ 3000 images, from
taken by an operator is required ages several environments

DiG Low – Automatized Low – Labels provided Low – The simulator Yes ≈ 35000 images, from 15
batch acquisition by the simulator uses synthetic graphics different environments

DG Low – Automatized Medium – Manual la- Medium – Real–world Yes ≈ 5500 images, from 10
batch acquisition beling aided by simula- scans with sim–2–real different environments
tor rendering

Dreal High – Real–robot de- High – Manual labeling High – Real–world im- Yes ≈ 3000 images, from 4
ployment required is required ages different environments

Table 1: Overview of the main features of the datasets we built in this work.

the extensive effort needed for labeling and the lack in representing a robot’s perception model. DiG and
DG , on the other hand, are products of efforts to address this limitation through the use of simulation
frameworks. DiG maximizes the advantages of simulated data collection: images are acquired and annotated
in large batches, automatically, and from a robot–centric perspective. However, this comes with a critical
downside given by the lack of photorealism. Our results will demonstrate that DG achieves a more favorable
compromise, allowing for batch data collection from the robot’s viewpoint with reasonable effort, while
ensuring a decent degree of photorealism and easing the manual annotation process. Dreal , representing the
ideal data set, offers the most authentic data but its high acquisition costs make it impractical for large–
scale training. Table 1 summarizes these points, giving a broad comparison of the key characteristics of each
dataset together with the number of samples exploited in this work.

4.1.3 Performance Metrics

Our first performance metric is the mean Average Precision score (mAP), which averages the AP across all
object categories (in our case, open and closed doors). The AP, as defined in the Pascal VOC challenge [22],
is calculated as the area under the precision/recall curve that is interpolated at 11 evenly spaced recall levels.
In our evaluation, we refine this approach by introducing additional interpolation points at each recall level
where the precision reaches a local maximum. This enhancement provides a more detailed approximation of
the precision/recall curve, resulting in a more accurate assessment.

While the mAP is a widely accepted metric for object detection tasks, it has notable limitations in our
robotic context. On one hand, certain errors disproportionately affect the AP relative to their actual impact
on the robot’s functionality. For instance, as illustrated in Figure 9a, minor inaccuracies in bounding box
localization may have minimal effect on a service robot that is often primarily concerned with recognizing
a door’s traversability status rather than its precise localization. Furthermore, the AP treats multiple
bounding boxes for the same door, as seen in Figure 9b, as false positives. However, a robot can resolve such
ambiguities using additional information like its estimated pose and the map of the environment. On the
other hand, the AP may not adequately reflect the severity of errors in identifying a door’s traversability
status if the bounding box is otherwise accurate. Once again, these errors are treated as false positives but,
in our scenario, incorrectly classifying a closed door as open (or vice versa) can significantly impact the
robot’s efficiency, especially when these classifications inform the robot’s decisions. An example of this type
of error is depicted in Figure 9b.

(a) (b)

Figure 9: Errors made by a detector on Giraff–X (Figure 1). In (a) the foreground green bounding box is
only slightly misaligned compared to its ground truth (in dashed blue). The error affects the AP but not
the robot’s typical task. Similarly, in (b) the two large green bounding boxes at the corridor’s end correctly
refer to the same open door; on the right, the closed door is a false positive. While the two errors affect the
mAP similarly, the former is of little interest in the robotic domain, but the latter is critical for a navigating
robot.

Given these shortcomings, we suggest incorporating additional metrics better suited to the specific needs of
the robotic application domain where door detection is crucial. These proposed metrics are based on the
premise that a service robot will invariably employ a method to sift through and select the most reliable
predictions from a door detector. This process typically involves prioritizing high–confidence predictions and
aggregating multiple bounding boxes that are localized in the same image region. The following definitions
aim to encapsulate this approach, as well as enable the assessment of the asymmetrical nature of detection
errors as previously discussed.

Consider the i–th image xi ∈ X and call Y i and Ŷ i the set of doors present in that image and the set of
predictions computed by the detector, respectively. Given a predicted bounding box ŷ, we denote as c(ŷ)
the confidence associated to it by the detector and we select those predictions whose confidence is above a
threshold ρc , that is Ŷci = {ŷ ∈ Ŷ i | c(ŷ) ≥ ρc }. Given two bounding boxes y1 and y2 , we denote as aI (y1 , y2 )
and aU (y1 , y2 ) the area of their intersection and union, respectively. We compute the set of Background
False Detections (BF D) as the confident predictions that cannot be assigned to any real door based on a
threshold ρa on their maximum Intersection Over Union area (IOU). Formally,
( )
i aI (ŷ, y)
BF D = ŷ ∈ Ŷci max < ρa .
y∈Y i aU (ŷ, y)

BFDs occur when a robot mistakenly identifies a door in locations where none exists, such as on a wall or
a closet. As previously discussed, this type of error relates to the mislocalization of doors. In principle, a
robot might correct such errors using information from its navigation stack. For example, the robot could
infer from its map that a door cannot exist in a place designated as a wall. Therefore, provided these errors
are not excessively frequent, they are generally deemed acceptable within typical robotic scenarios.

Confident predictions that, instead, are well localized and have an above–threshold IOU for at least one door
i
in the image are contained in a set called Ŷc,a = Ŷci \ BF Di . This allows us to define, for each ground truth
door y, the set of predictions that are confident and whose area is maximally matched with it, formally
( )
i aI (ŷ, y)
B(y) = ŷ ∈ Ŷc,a arg maxi =y .
y∈Y aU (ŷ, y)

(Notice that, provided that ties are broken, the same prediction can never be matched to more than one
door.)

Finally, we define ŷ ∗ = arg maxl̂∈B(y) c(ŷ) as the most confident prediction for door y, and it is this prediction
we focus on, discarding any other predictions for the same door. If ŷ ∗ correctly predicts the traversability of
door y, it is included in the set of true positives (T P i ). Conversely, if ŷ ∗ incorrectly predicts the traversability
status, it is assigned to the set of false positives (F P i ). A false positive substantially differs from a BFD, as
an FP is potentially more consequential. An FP can lead the robot to incorrectly assess a critical aspect of
the environment’s topology, such as mistaking a closed door for an open passage, which could significantly
impact its decisions (notice how, in this example, the environment’s map cannot be exploited to fix the
error). In our evaluation, we apply the aforementioned method across all images, defining
i i
|BF Di |
P P P
i |T P | i |F P |
T P% = , F P% = , and BF D% = i ,
Y Y Y
where Y = i |Y i |. We call these Operational Performance Indicators (OPI), they represent the rates of
P
true positives, false positives, and BFDs, respectively. In our experiments, the confidence threshold ρc is set
to 75%, and the IOU threshold ρa is set to 50%.

4.2 Evaluation of General Detectors

In this section, we evaluate our pipeline for synthesizing a GD. For each of the three chosen models, we train
four general detectors on each of the following datasets: DiG , DDD2 , DG , and DDD2+G . The last dataset, DDD2+G ,
is obtained by joining the examples of DDD2 and DG .

Following a preliminary experimental campaign that explored various batch sizes ({1, 2, 4, 16, 32}) and
epoch numbers ({20, 40, 60}), we established our training parameters. For Faster R–CNN and YOLOv5, we
trained for 60 epochs with a batch size of 4 and using the SGD optimizer, with a learning rate of 10−3 . On
the other hand, DETR was trained for 60 epochs with a batch size of 1, utilizing the AdamW algorithm [46].
(The learning rate was set to 10−6 for the CNN backbone and 10−5 for the remaining layers as in [6].)

We present the performance metrics by testing on real–world samples from Dreal . For each one of the 4
T
real environments e1 , e2 , e3 , e4 of Figure 8, a test set Dreal,e is compiled by randomly choosing 25% of the
images acquired by the robot in it. The results are detailed in Table 2 and visually summarized in Figures 10
and 12.

First, notice how it is evident that the general detectors trained on DiG exhibit very poor performance,
as indicated by the blue bars in the figures. To elaborate, the YOLOv5–based GD correctly identified
only one door instance (in e4 – House). Meanwhile, its counterparts, DETR and Faster R–CNN, incur a
high number of errors in terms of F P% and BF D% (as illustrated in Figure 12), which outweighs their
very limited number of correct detections. These unsurprising outcomes confirm the intuition that training
with simulations, even those designed to replicate real environmental features, is ineffective if they lack
photorealism. This conclusion is further supported by observing the significant performance improvements
achieved when transitioning from training with DiG to DDD2 (among our training datasets, the one that
maximizes photorealism).

An interesting and perhaps counter–intuitive observation emerges when comparing the training results of
DDD2 (real–world images) with DG (our simulation framework outlined in Section 3.1). Common intuition
suggests that a detector trained on real–world data should outperform one trained on a simulation, even
DETR [6] YOLOv5 [33] Faster R–CNN [56]
Env. Dataset mAP↑ TP% ↑ FP% ↓ BFD% ↓ mAP↑ TP% ↑ FP% ↓ BFD% ↓ mAP↑ TP% ↑ FP% ↓ BFD% ↓

DiG 0 1 0 26 0 0 0 0 2 2 0 2
DDD2 13 18 9 13 2 3 2 1 18 25 14 9
e1
DG 26 30 7 22 30 31 1 8 20 25 6 11
DDD2+G 32 37 6 17 32 34 2 3 34 43 10 14
DiG 0 1 1 22 0 0 0 0 0 1 0 3
DDD2 14 19 8 17 3 5 1 3 22 27 4 18
e2
DG 28 36 6 21 14 21 9 9 14 17 4 10
DDD2+G 24 31 10 19 16 24 10 9 27 34 5 20
DiG 0 2 0 35 0 0 0 1 0 1 1 11
DDD2 9 15 3 30 3 3 0 1 10 20 8 40
e3
DG 13 19 6 33 4 6 3 10 2 4 2 10
DDD2+G 16 24 4 44 6 10 2 12 14 24 8 34
DiG 1 5 3 25 0 0 1 4 1 3 7 7
DDD2 22 20 14 9 14 12 3 1 31 35 9 14
e4
DG 31 40 9 11 16 22 2 4 12 18 4 6
DDD2+G 32 35 10 13 30 34 7 7 48 49 7 16

Table 2: Real–World performance of general detectors. The best and second–best results among the training
datasets are highlighted in bold and underlined, respectively.

DiG DDD2 DG DDD2+G Closed door Open door

e1 – Classrooms e2 – Offices
50

30
mAP

0
e3 – Laboratories e4 – House
50

30
mAP

0
DETR [6] YOLOv5 [33] Faster R–CNN [56] DETR [6] YOLOv5 [33] Faster R–CNN [56]
Model Model

Figure 10: mAP of the general detectors trained with the 4 datasets in real environments.

if photorealistic. However, as shown by the mAP scores in Figure 10 and the T P% in Figure 12, we see
that two out of the three detectors, specifically those based on DETR and YOLOv5, actually demonstrate
improved performance when trained on DG rather than DDD2 . This result indicates that while photorealism,
a characteristic highly present in DDD2 , is important, it is not the unique key feature for creating effective
general detectors for robots. It appears that the slightly compromised visual quality in DG might be effectively
DETR [6]
YOLOv5 [33]
Faster R–CNN [56]

e1 – Classrooms e2 – Offices e3 – Laboratories e4 – House

Figure 11: Real–world door instances correctly recognized by GDs trained on DDD2+G .

balanced by a closer alignment with a robot’s perception model, thereby reducing, to some extent, the sim–
to–real gap. This also suggests that in real robot deployments, the shift in data distribution might be more
significantly influenced by the data acquisition process rather than by the characteristics of the input space.

This trend does not hold for the detector based on Faster R–CNN, which shows better results with DDD2 .
Upon closer examination, this can be attributed to the Region Proposal Network, which, by localizing and
classifying bounding boxes based on features extracted from the Pyramid Backbone, is more sensitive to
the photorealistic quality of images. To support this observation, we consider the performance of Faster
R–CNN trained on DDD2+G , a dataset that combines DDD2 ’s high photorealism with DG ’s representation of the
robot’s viewpoint. As indicated by the red bars in Figure 10, Faster R–CNN’s performance improves, while
DETR and YOLOv5 are only slightly impacted by the absence of real–world data. The T P% in Figure 12
shows that the correct door status detections with DDD2+G slightly surpass those with DG . In some cases, our
simulated data even yield better results, as demonstrated by DETR’s performance in environments e2 and
e3 . However, it’s noteworthy that mixing training data often leads to an increase in erroneous detections, as
evidenced by the F P% and BF D% indicators in Figure 12.

These results prove the effectiveness of our simulation framework, which strikes a balance between photo-
realism and alignment with the robot’s perception model. This approach is hence both viable and efficient
for building general door detectors, reducing training costs while still achieving an acceptable performance
level. The general detectors we developed are capable of accurately recognizing doors across diverse real–
world environments, demonstrating a fair level of generalization. However, this strength is mainly evident in
straightforward door instances, and less so in more complex ones involving occluded views, multiple nested
doors, or combinations of these. Figure 11 showcases some representative examples where our GDs excel.
To bridge the gap in identifying such difficult cases, it is essential to qualify the general detectors for the
target environment where they are set to operate.
DiG DDD2 DG DDD2+G T P% F P% BF D%
e1 – Classrooms e2 – Offices
60

20
%

e3 – Laboratories e4 – House
60

20
%

DETR [6] YOLOv5 [33] Faster R–CNN [56] DETR [6] YOLOv5 [33] Faster R–CNN [56]
Model Model

Figure 12: Operational performance indicators of the GDs trained with the 4 datasets.

4.3 Evaluation of Qualified Detectors

In this section, we assess the benefits to general detectors introduced by the qualification procedure. In
particular, we qualify for the environments of Dreal the GDs trained with DDD2 , DG , and DDD2+G , discarding
the one based on DiG due to its unsatisfactory performance in the real world (see previous section). To ease
presentation, we say that a QD is based on a dataset Dx when it is obtained from a GD trained on such a
dataset. Considering each real environment e, we perform a series of qualification rounds of the GDs with
increasing amounts of data from the target environment, obtaining a set of qualified detectors we denote
as QDe15 , QDe25 , QDe50 , and QDe75 . Their superscripts denote the percentage of examples from Dreal,e (the
real data acquired in environment e) used for the fine–tune and can be interpreted as an indicator of the
cost to acquire and label the additional samples. We use the same training parameters as in Section 4.2
reducing the epochs to 40. Each qualified detector QDex is tested in the corresponding environment e using
T
the previously defined test set Dreal,e (random 25% of images from Dreal,e not used in any qualification
round). We assess the average performance over the three selected models in each real environment. The
detailed results can be found in Table 3. For a visual representation, refer to Figure 13 which illustrates the
mAP, and Figure 14 for the operational performance indicators.

A first evident observation is that the qualification procedure boosts the performance of the general detectors
for the target environment and, unsurprisingly, the performance (together with the data preparation costs)
increases as more samples are included, from QDe15 to QDe75 . This can be appreciated in the mAP and T P%
improvements visually depicted in Figures 13 and 14 and by the decreasing trend (after the first qualification
round) of F P% and BF D% in Figure 14.
DDD2 DG DDD2+G
Env. Exp. mAP↑ TP% ↑ FP% ↓ BFD% ↓ mAP↑ TP% ↑ FP% ↓ BFD% ↓ mAP↑ TP% ↑ FP% ↓ BFD% ↓

GD 11 ± 8 15 ± 11 8±6 8±6 25 ± 5 29 ± 3 5±3 14 ± 7 33 ± 1 38 ± 5 6±4 11 ± 7

QDe15 49 ± 21 53 ± 19 3±2 10 ± 9 63 ± 12 67 ± 9 3±1 15 ± 15 63 ± 7 67 ± 7 3±2 14 ± 10
e1
QDe25 59 ± 20 63 ± 18 2±1 16 ± 15 72 ± 14 75 ± 12 3±2 14 ± 13 73 ± 12 76 ± 9 2±2 15 ± 12
QDe50 72 ± 19 76 ± 16 1±1 13 ± 13 81 ± 9 83 ± 7 1±1 11 ± 9 80 ± 10 83 ± 9 1±1 11 ± 8
QDe75 78 ± 15 81 ± 13 1±1 11 ± 9 85 ± 9 87 ± 7 1±1 9±7 85 ± 8 87 ± 7 1±1 9±6
GD 13 ± 9 17 ± 11 4±4 13 ± 8 18 ± 8 25 ± 10 6±3 13 ± 7 22 ± 5 30 ± 5 8±3 16 ± 6
QDe15 48 ± 21 53 ± 19 3±1 16 ± 16 65 ± 12 69 ± 9 2±1 24 ± 25 62 ± 11 66 ± 10 3±1 24 ± 19
e2
QDe25 60 ± 18 66 ± 17 3±2 23 ± 24 71 ± 8 74 ± 7 3±1 17 ± 16 72 ± 10 76 ± 8 3±0 20 ± 19
QDe50 70 ± 17 74 ± 14 2±0 18 ± 19 80 ± 8 83 ± 6 2±1 14 ± 11 80 ± 7 84 ± 7 2±1 16 ± 17
QDe75 75 ± 13 80 ± 9 2±0 18 ± 18 84 ± 5 86 ± 4 2±1 12 ± 11 82 ± 7 85 ± 6 2±1 18 ± 15
GD 7±4 13 ± 9 4±4 24 ± 20 6±6 10 ± 8 4±2 18 ± 13 12 ± 5 19 ± 8 5±3 30 ± 16
QDe15 41 ± 23 48 ± 20 2±1 26 ± 20 55 ± 16 63 ± 11 5±3 27 ± 20 56 ± 14 63 ± 10 4±2 33 ± 24
e3
QDe25 54 ± 25 59 ± 21 3±2 21 ± 18 64 ± 19 70 ± 13 3±3 26 ± 28 68 ± 13 75 ± 9 3±2 21 ± 15
QDe50 68 ± 21 74 ± 16 2±1 19 ± 18 76 ± 13 81 ± 10 2±2 17 ± 15 76 ± 15 81 ± 12 3±2 19 ± 17
QDe75 76 ± 19 81 ± 15 1±1 14 ± 14 82 ± 12 86 ± 9 1±1 12 ± 10 82 ± 11 86 ± 8 1±1 15 ± 16
GD 22 ± 8 22 ± 12 9±6 8±7 20 ± 10 27 ± 12 5±4 7±4 37 ± 9 39 ± 8 8±2 12 ± 5
QDe15 60 ± 18 64 ± 19 2±1 18 ± 12 76 ± 4 77 ± 4 1±1 14 ± 7 76 ± 6 78 ± 8 3±2 18 ± 16
e4
QDe25 70 ± 18 73 ± 17 2±1 14 ± 11 79 ± 7 81 ± 5 3±1 14 ± 9 82 ± 8 83 ± 8 4±2 12 ± 9
QDe50 81 ± 10 84 ± 10 2±1 13 ± 11 91 ± 4 92 ± 3 1±1 8±6 90 ± 5 92 ± 3 1±1 8±7
QDe75 90 ± 8 91 ± 6 1±1 5±4 96 ± 2 96 ± 2 0±1 4±3 96 ± 2 96 ± 2 0±0 3±2

Table 3: Results of the qualification procedure (averaged per detector) when the GD is trained with the
DDD2 , DG , and DDD2+G .

However, the increments follow a diminishing–returns trend, with large gains in the first qualification rounds
and marginal ones as more data are used. Focusing on the average mAP and T P% it can be seen how the
qualified detector that scores the highest performance improvements is QDe15 , despite requiring a relatively
affordable effort for data preparation. From a practical perspective, this observation suggests how just
a coarse visual inspection of the target environment might be enough to obtain an environment–specific
detector whose performance is significantly better than the corresponding general one. In such a case the
robot’s deployment time is only marginally affected. To give a concrete idea, annotating the 15% of the data
collected by the robot’s first exploration of a new environment (approximately 80 images) required a human
operator using our tool (Section 3.2) about half an hour. Another key finding is how the improvements
through qualification are distributed across various types of instances encountered by the detector. Upon
direct inspection, we observed how these were particularly notable in challenging instances. Figure 15
showcases significant examples of this, illustrating how the QDe15 model, based on our dataset DG , successfully
detects doors in highly challenging instances. These include scenarios with nested or partially occluded doors
and even situations where the door is hidden in the background.

It is important to notice that the dataset chosen to train the general detector does affect the benefits of the
qualification. The trends observed in Figure 13 indicate that QDs based on DDD2 generally demonstrate lower
performance compared to those based on DG and DDD2+G . This observation is further supported by the data
presented in Figure 14. Although the error rates (F P% and BF D% ), are substantially similar, there is a
noticeable difference in the T P% . Specifically, detectors based on DG or DDD2+G show better T P% performance
than those based on DDD2 . Confirming the findings from the previous section, this again suggests that training
on images not representing the robot’s point of view, although taken from the real world, hits a performance
limit. A simulated dataset from the robot perspective with an adequate level of photorealism (as DG ), when
included in the training phase, enables the detectors to reach better performance when qualified for a target
environment.

To further support the effectiveness of the method proposed in Section 3.1, we can notice that DDD2+G , which
DDD2 DG DDD2+G Closed door Open door
e1 – Classrooms e2 – Offices
100

60
mAP

0
e3 – Laboratories e4 – House
100

60
mAP

0
GD QDe15 QDe25 QDe50 QDe75 GD QDe15 QDe25 QDe50 QDe75
Qualification rounds Qualification rounds

Figure 13: Real–world evaluation of the qualified detectors where GDs are based on different datasets. The
mAP is averaged over the three models.

integrates the realism in DDD2 and the robot perception model of DG , does not introduce significant variations
in the performance of the qualified detectors when compared with those solely based on DG . This can be
easily seen by comparing the (substantially similar) orange and red bars of Figure 13 that refer to DG and
DDD2+G , respectively. In addition, while T P% reaches comparable performance, Table 3 shows that DG enables
the qualified detectors to reduce the rate of BF D with respect to DDD2+G .

As argued in Section 3.2, the true benefit of qualification rests on the typically long–term nature of a robot’s
deployment, during which the target environment is assumed to stay the same. However, this invariance
does not imply the complete absence of domain shifts. While the overall environment may not change,
lower–order shifts could still challenge the detector’s ability to perform its task. We deem that one of the
most significant of such shifts might occur at the feature level, particularly concerning the illumination
conditions of the environment. Changes in lighting can significantly alter the appearance of doors in images
and these variations in illumination can be widespread throughout the entire environment, rather than being
localized. To test the robustness of our approach in such situations we acquire and annotate (following the
same procedure for Dreal reported in Section 4.1.2) additional data from environments e1 and e2 during
nighttime, when only artificial light is present and some rooms are dark. Then, we use these images to test
our detectors trained with daylight data. The metrics’ average performance are detailed in Table 4 and
visually shown in Figure 16 (mAP) and Figure 17 (T P% , F P% , and BF D% ).

From Figure 16 we can see how the mAP performance of the GDs reflects those shown in Figure 13, demon-
strating that the GDs are robust to illumination changes. As reported in Section 4.2, the general detectors
based on DDD2 exhibit the worst performance while those trained with our simulated dataset DG perform re-
markably better, especially in e1 (see also the T P% in Figure 17). The DDD2+G –based GDs, despite improving
the average mAP as shown in Figure 10, increase also the performance gap between the models (see the
T P% BF D% F P%
e1 – Classrooms e2 – Offices

DDD2 DG DDD2+G DDD2 DG DDD2+G

100

60
%

0
e3 – Laboratories e4 – House

DDD2 DG DDD2+G DDD2 DG DDD2+G

100

60
%

0
0 15 25 50 75 0 15 25 50 75 0 15 25 50 75 0 15 25 50 75 0 15 25 50 75 0 15 25 50 75
% of qualification data % of qualification data

Figure 14: Operational performance indicators (averages over the three models) with GDs trained on different
datasets.

standard deviation of the orange and red bars in Figure 16). More interestingly, it can be seen in Table 4 how
the improvement provided by the qualified detectors is maintained also with different light conditions. Once
again, our dataset DG increases the benefits of the qualification when compared to the human–POV images
of DDD2 (as it can be seen by the higher and stable T P% of Figure 18). The slight performance decrease ob-
servable comparing Table 4 and 3 is a direct consequence of the fine–tune, which produces QDs that slightly
overfit the illumination conditions seen during the robot’s deployment. Despite this, our method ensures a
performance improvement to the GD when used in long–term scenarios with illumination changes, enabling
the QDs to still solve challenging examples, as shown in Figure 18. Once again, QDe15 , albeit using a few
examples for fine–tuning, ensures the best performance improvement also under variable light conditions.

4.4 Model Comparison

In this section, we analyze the three selected models (DETR [6], YOLOv5 [33], and Faster R–CNN [56])
highlighting their strengths and weaknesses to provide insights for helping technicians working in Robotic
Vision scenarios to choose the best one according to their requirements.

From our experience in setting up the three models for the specific task of door detection, DETR turned out to
be the easiest to adapt. Instead of learning how to activate a set of predefined anchor boxes according to the
image features, DETR directly regresses the coordinates of the bounding boxes by construction. Moreover,
it does not require a non–maximum suppression step to discard multiple detections of the same object. This
DETR [6]
YOLOv5 [33]
Faster R–CNN [56]

e1 – Classrooms e2 – Offices e3 – Laboratories e4 – House

Figure 15: Challenging doors correctly detected by QDe15 (GDs divided by model and trained on DG ).

is achieved by its loss function that matches the (limited) inferred bounding boxes to a single target. On
the contrary, the detection performance of our detectors based on YOLOv5 and Faster R–CNN are strongly
influenced by the hyperparameters setting: the anchor dimension and scale should be compliant with the
object shape while the non–maximum suppression procedure can delete correct bounding boxes (such as
those of nested doors). In other words, while the competitors need to encode task–specific prior knowledge
in the model, DETR offers the possibility to share the same configuration among different applications (such
as [76]).

After these considerations, we compare the performance of the detectors (based on our dataset DG ) to study
how they work, on average, in the four real environments of Dreal . Table 5 reports in detail the metrics
results, depicted also in Figure 19 (mAP) and Figure 20 (T P% , F P% , and BF D% ).

DDD2 DG DDD2+G
Env. Exp. mAP↑ TP% ↑ FP% ↓ BFD% ↓ mAP↑ TP% ↑ FP% ↓ BFD% ↓ mAP↑ TP% ↑ FP% ↓ BFD% ↓

GD 11 ± 6 16 ± 9 5±2 9±5 24 ± 1 29 ± 2 4±1 9±4 31 ± 4 37 ± 5 6±2 12 ± 6

QDe15 27 ± 9 32 ± 10 4±2 13 ± 10 41 ± 5 47 ± 2 4±2 16 ± 12 42 ± 3 47 ± 1 4±2 17 ± 11
e1
QDe25 39 ± 9 45 ± 8 4±2 15 ± 12 48 ± 6 52 ± 3 4±3 17 ± 13 48 ± 3 54 ± 1 4±2 18 ± 12
QDe50 43 ± 12 49 ± 11 4±2 15 ± 13 52 ± 5 57 ± 4 5±2 15 ± 12 53 ± 8 58 ± 7 4±3 17 ± 13
QDe75 50 ± 10 54 ± 9 4±2 12 ± 10 56 ± 4 60 ± 3 5±2 13 ± 10 52 ± 6 57 ± 6 4±2 12 ± 10
GD 12 ± 9 18 ± 13 3±2 9±7 16 ± 4 26 ± 6 5±1 15 ± 9 23 ± 6 33 ± 9 5±1 15 ± 8
QDe15 30 ± 10 39 ± 17 3±2 19 ± 16 39 ± 7 48 ± 11 3±1 27 ± 21 43 ± 9 51 ± 13 3±1 25 ± 20
e2
QDe25 39 ± 15 48 ± 16 4±1 24 ± 19 47 ± 6 56 ± 5 3±2 22 ± 16 50 ± 5 60 ± 7 4±1 21 ± 15
QDe50 49 ± 14 56 ± 13 3±1 19 ± 15 58 ± 5 64 ± 4 3±1 15 ± 11 58 ± 5 64 ± 6 3±1 16 ± 9
QDe75 54 ± 12 60 ± 9 3±1 14 ± 11 62 ± 7 67 ± 6 4±1 15 ± 11 62 ± 8 68 ± 7 4±2 17 ± 12

Table 4: General and qualified detector performance during nighttime conditions.

DDD2 DG DDD2+G Closed door Open door
e1 - Classrooms e2 - Offices
80

60
mAP

0
GD QDe15 QDe25 QDe50 QDe75 GD QDe15 QDe25 QDe50 QDe75
Model Model

Figure 16: Real–world evaluation of the detectors during nighttime. GDs are based on different datasets
and the mAP is averaged over the three models.

T P% BF D% F P%
e1 – Classrooms e2 – Offices

DDD2 DG DDD2+G DDD2 DG DDD2+G

60
%

0
0 15 25 50 75 0 15 25 50 75 0 15 25 50 75 0 15 25 50 75 0 15 25 50 75 0 15 25 50 75
% of qualification data % of qualification data

Figure 17: Operational performance indicators during nighttime (averages over the three models) with GDs
trained on different datasets.

By observing the mAP performance shown in Figure 19 we can see that the best GD is based on DETR
that, not requiring task–oriented knowledge, better addresses the sim–to–real gap (between our dataset DG
and the real acquisitions of Dreal ). While YOLOv5 lies in the middle, Faster R–CNN reaches the worst
performance when trained in simulation (with our dataset DG ) and tested in the real world. As discussed
in Section 4.2, Faster R–CNN, being a two–stage detector, tends to overfit the distribution of the training
data acquired in simulation. This outcome is reverted by the qualification procedure. When fine–tuned for a
target environment, Faster R–CNN reaches the best mAP results while DETR the worst (see the green and
red bars in Figure 19). This is caused by the Transformer that, although popular, requires huge amounts of
data (hundreds of millions) to effectively learn the architecturally inherent biases of the CNN–based models
(such as the translation equivariance and the locality principle [27]). Moreover, by carefully examining the
extended metric’s results, we can see that the BF D% of YOLOv5 is considerably lower than the other
detectors (both the GD and its qualified versions). In a robotic domain where detections are translated into
actions, this fact is extremely important because drastically reduces robot failures. Figure 21 shows how
the additional indicators of QDe15 vary according to the confidence threshold. The results demonstrate that
our choice of ρc = 75% is a good compromise between the correct (T P% ) and the wrong (F P% , BF D% )
predictions.
e1 – Classrooms
e2 – Offices

DETR [6] YOLOv5 [33] Faster R–CNN [56]

Figure 18: Challenging doors detected by QDe15 during nighttime.

DETR [6] YOLOv5 [33] Faster R–CNN [56]

Exp. mAP↑ TP% ↑ FP% ↓ BFD% ↓ mAP↑ TP% ↑ FP% ↓ BFD% ↓ mAP↑ TP% ↑ FP% ↓ BFD% ↓

GD 24 ± 8 31 ± 9 7±1 22 ± 9 16 ± 11 20 ± 10 4±4 8±3 12 ± 8 16 ± 9 4±2 9±2

QDe15 53 ± 15 63 ± 11 4±3 35 ± 15 66 ± 5 67 ± 5 2±1 3±3 74 ± 7 78 ± 4 3±2 22 ± 9
QDe25 59 ± 12 66 ± 9 4±2 35 ± 16 75 ± 5 76 ± 5 2±1 2±1 81 ± 4 84 ± 3 3±0 16 ± 2
QDe50 72 ± 11 78 ± 8 2±1 21 ± 9 85 ± 5 86 ± 5 0±1 2±0 89 ± 4 91 ± 3 2±1 15 ± 2
QDe75 80 ± 11 84 ± 8 2±1 17 ± 7 88 ± 6 88 ± 6 0±0 2±1 93 ± 4 94 ± 3 1±0 10 ± 5

Table 5: Average real–world performance with our three selected models (GDs are are based on DG ).

mAP of the three detectors in real worlds

DETR [6] YOLOv5 [33] Faster R–CNN [56] Closed door Open door
100

80
mAP

0
GD QDe15 QDe25 QDe50 QDe75
Environment

Figure 19: Real–world mAP with our three selected models (GD is based on DG ).

Despite it is well–known from the literature that the two–stage detectors (like Faster R–CNN) are generally
better than single–stage ones [77], YOLOv5 is more suitable for edge devices typically mounted in service
robots. First, it is compatible with the NVIDIA Jetson TX2 mounted on our Giraff–X robotic platform [47,
49] (depicted in Figure 1) where it can run at 20 fps with the TensorRT framework. Since the other models
are not compatible with the NVIDIA SDK for our specific hardware, we deploy all the architectures relying
on ONNXRuntime, a less efficient inference framework able to run YOLOv5, DETR, and Faster R–CNN
at 14, 6, and 0.7 fps, respectively. In our experimental setting, YOLOv5 represents the best compromise
between performance and inference time, thus appearing as the most convenient model for door detection
Extended metric results in the real worlds

T P% BF D% F P%

DETR [6] YOLOv5 [33] Faster R–CNN [56]

100

80
%
60

0
0 15 25 50 75 0 15 25 50 75 0 15 25 50 75
% of qualification data

Figure 20: Real–world performance of the operational performance indicators with our three selected models
(GD is based on DG ).

Performance of QDe15 with different confidence threshold

T P% BF D% F P%

DETR [6] YOLOv5 [33] Faster R–CNN [56]

100

80
%

0
50 55 60 65 70 75 80 85 90 95 50 55 60 65 70 75 80 85 90 95 50 55 60 65 70 75 80 85 90 95
ρc

Figure 21: Operational performance indicators w.r.t. confidence for QDe15 (averages over the real environ-
ments, GD based on DG ).

with service robots.

5 On The Field Evaluation

The goal of equipping an autonomous mobile robot with an object detection method is to allow the robot to
have an updated representation of its working environment that can be used to plan and execute the tasks
assigned to the robot. In the scenario we consider, the ability to detect doors can be used by the robot for
the downstream task of reconstructing the current topology of the environment, that is, to infer which are
the sub–areas that are currently accessible by the robot and those that are not. In the experiments presented
in this section, we consider the environment’s topology to be a graph, wherein the nodes represent rooms
and the edges correspond to open paths connecting them. This knowledge can be used by the robot to plan
its activities [52] considering the constraint that only a subset of the sub–areas are accessible at the current
time.

We evaluate how the proposed door detector can be used to obtain such a knowledge. As discussed in Section
3.2, we assume that the robot can rely on a 2D map acquired during its setup, but we consider a situation
where the current topology of the environment has changed with respect to the one encoded in such a map,
since some doors might be closed at the moment (for obvious reasons, when the 2D map is acquired all the
doors are left opened). The task that the robot must face is to infer the current topology of the free space
it can cover, assuming that door statuses do not change during the execution of this task. To carry it out,
we consider the setting exemplified in Figure 1: the robot follows a trajectory spanning multiple rooms (the
trajectory can be either functional to this topology–inference task or to another higher–level task the robot
is performing). While doing so, it can observe, on purpose or incidentally, the status (open or closed) of
multiple doors. While some doors are perceived from a frontal view, others will likely be observed only from
a side angle as the robot moves in a different direction. We assume that the robot has full knowledge of all
the locations of doors on the map D = {d1 , d2 , . . . dn }. Furthermore, we assume to have a method that, given
the image x ∈ X where a door ŷ ∈ Ŷ has been identified by a door detector, along with the pose from which
the image was acquired, can determine the specific door instance d ∈ D being observed at the moment. Note
that multiple doors can be observed within the same image. The result of this step is that each ŷ ∈ Ŷ that
is not a BF D is associated to a door instance d. The robot thus counts, along the whole trajectory, how
many times each door d ∈ D has been identified either as open or closed, and infers its status as the one
of majority label. This information is used to infer the current topology, which, for evaluation, we compare
with the one obtained by repeating the process using true detections instead of predictions.

We had the robot following two trajectories in a real–world experiment with the same setup described in
Section 4.1.2. The first trajectory is performed in e1 – Classrooms during nighttime. The second trajectory
is performed in e2 – Offices with daylight. We compare the performance in inferring the topology with
GD against QDe15 (both based on YOLOv5 and DG ). In both cases, the QDe15 is trained with data collected
at mapping time, with daylight. The floor plans and the topologies of e1 – Classrooms and e2 – Offices
inferred with this framework are shown in Figure 22 and 23, respectively. We indicate with ■ (■) a door
correctly recognized as open (closed) and with ✖ (✖) a door that has been wrongfully recognized as closed
(open) when its current status is open (closed). We indicate with ✖ the event where the number of
detections ŷ where a door is labeled as open is equal to those where it is perceived as closed, and thus the
robot is undecided. If a door has been observed in multiple images, but the door detector was always unable
to detect any door due to false negatives, we label such door with a ✖. We indicate with ● the location of a
room that the robot can access, and we highlight the location of the main entrance/exits of the environment.
We indicated with a solid blue line a path across two different rooms that is open for the robot, and with a
dashed line a path between two rooms that has been wrongfully estimated, following the same color schema
as above: a red (green) path when a passage is estimated to be closed (open) when actually it is open
(closed).

To understand the impact of having a qualified detector in estimating the topology of an environment, we
report the topology obtained with GD and QDe15 in Figure 22–23, as well as the number of doors whose
status is correctly/wrongly detected, and the total recognition accuracy RA, that is the percentage of doors d
whose status has been successfully detected during the robot run. These metrics are specific to the detection
domain and are downstream OPI, following the definition of Section 4.1.3.

From Figure 22–23, we can see how the qualified detector can correctly identify the topological status of the
environment, albeit making minor errors. In both environments, the QD identifies correctly the status of
most doors, with an RA of around 90% (89, 29% in e1 , 95, 23% in e2 ).

We noticed how, while both GD and QDe15 could detect successfully the status of a door when it is observed
from a frontal position by the robot, the GD often fails when the robot is at one side of a door, when the
door is partially occluded, or when there are challenging light conditions. In all those cases, QDe15 does
not suffer from the same limitations. An example of this can be seen in the doors connected to the main
corridor of e2 , shown in Figure 23. While QDe15 can identify how most doors are closed (✖), GD often fails
to understand the status of those doors, thus identifying them as open (✖ or ✖) or failing to identify them
(✖).

To better highlight this event, we provide detailed results about how many times two doors, that are high-
Exp. ■■↑ ✖✖↓ ✖↓ ✖↓ RA ↑
GD 20 4 1 2 71%
QDe15 25 1 1 1 89%

(a) GD (b) QDe15

Figure 22: Topology of the environment for e1 – Classrooms during night–time as identified using GD and
QDe15 to detect the status of each door.

Exp. ■■↑ ✖✖↓ ✖↓ ✖↓ RA ↑

GD 26 9 4 3 62%
QDe15 40 1 1 0 95%

(a) GD

(b) QDe15

Figure 23: Topology of the environment for e2 – Offices during daytime as identified using GD and QDe15
to detect the status of each door.

lighted with ➀ and ➁ in Figure 23, have been viewed by the robot in the two runs. Door ➀ was observed in
60 images by the robot. While QDe15 was able to correctly detect the status of the door 52 times and was
unable to detect the door on 8, the GD was able to detect the door only on 2 occasions and was unable to
detect it 58 times. Figure 24(a–b) shows two of the 60 images, with the bounding box identified by QDe15 .
At the beginning of the run, the door was closed, and was briefly perceived in that condition when the robot
was outside the room; nevertheless, the robot was able to correctly label it, as in Figure 24a. Later, the
robot enters the room and the door status is open, as in Figure 24b. In both cases, GD fails to identify
any bounding box from those images. The door ➁ was observed in 40 images as closed; QDe15 was able to
correctly identify the status of the door 32 times, and was unable to detect the door 8 times. The GD is
far less accurate in detecting doors in the same set of images: it correctly identified the door as closed 5
times, wrongly identified the door as open 4 times, and did not identify any door in 31 perceptions. Two
examples of these images are shown in Figure 24(c–d); in Figure 24c the door ➁ is the second one on the
right, while in Figure 24d it is the one on the left side of the corridor. In both images, QDe15 was able to
identify successfully the door status and location, while GD failed to identify the presence of a door. Similar
examples can be made for all of the rooms that are connected to the central corridor of Figure 23, that are
seen by the robot from a similar perspective.

(a) (b) (c) (d)

Figure 24: Two examples where the QD identifies the door ➀ in tho challenging images (a–b), and the door
➁ in other two challenging images (c–d). In all four cases, the GD does not identify the two doors in these
images.

These results show how the general detector manages to partially reconstruct the topology of the environment
due to a high number of false positives and wrong detections. On the other hand, the qualified detector
obtains more stable and robust performance compared to its general version when used in its deployment
environment. This demonstrates how the qualification step described in Section 3.2 substantially improves
(with a little cost) the performance of the detection method in a downstream task, providing more accurate
domain–specific knowledge to the robot.

6 Conclusions

In this paper, we propose a methodological approach to synthesize deep learning–based object detectors for
the typical deployment scenario of service robots. At first, we show how to leverage simulation to reduce
the effort to acquire a photorealistic dataset from the robot’s perspective, proving that our method enables
multiple off–the–shelf models to work in the real world when used by service robots. Then, we show that
qualifying the general detector for the target environment will drastically improve the robot’s performance
in its deployment environment even if the number of new examples (and the effort for their annotation) is
limited. We evaluate our approach considering the use case of door detection providing extensive experiments
validated in several indoor environments, providing also an on–the–field evaluation in the downstream task
of reconstructing the environment topology from the detected door statuses.

Future works will investigate how to enhance the photorealism of the synthetic simulators (as iGibson [60])
that, despite offering a high level of automatization, cannot be used for training models working in the real
world (as shown in Section 4.2). Furthermore, we will investigate new solutions for enabling the general
object detector to automatically qualify itself reducing the noise of its pseudo–label.
References
[1] Mary B Alatise and Gerhard P Hancke. “A review on challenges of autonomous mobile robot and
sensor fusion methods”. In: IEEE Access 8 (2020), pp. 39830–39846.
[2] Michele Antonazzi et al. “Enhancing Door-Status Detection for Autonomous Mobile Robots During
Environment-Specific Operational Use”. In: 2023 European Conference on Mobile Robots (ECMR).
2023.
[3] Iro Armeni et al. “3D Semantic Parsing of Large-Scale Indoor Spaces”. In: Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition (CVPR). 2016, pp. 1534–1543.
[4] Richard Bormann et al. “Room segmentation: Survey, implementation, and analysis”. In: 2016 IEEE
International Conference on Robotics and Automation (ICRA). 2016, pp. 1019–1026.
[5] John Canny. “A Computational Approach to Edge Detection”. In: Transactions on Pattern Analysis
and Machine Intelligence PAMI-8.6 (1986), pp. 679–698.
[6] Nicolas Carion et al. “End-to-End Object Detection with Transformers”. In: Computer Vision – ECCV
2020. 2020, pp. 213–229.
[7] Stefano Carpin et al. “USARSim: a robot simulator for research and education”. In: Proceedings 2007
IEEE International Conference on Robotics and Automation. 2007, pp. 1400–1405.
[8] Junyi Chai et al. “Deep learning in computer vision: A critical review of emerging techniques and
application scenarios”. In: Machine Learning with Applications 6 (2021), pp. 100–134.
[9] Angel X. Chang et al. Matterport3D: Learning from RGB-D Data in Indoor Environments. 2017.
[10] Jian Chen, Bingxi Jia, and Kaixiang Zhang. “Trifocal Tensor-Based Adaptive Visual Trajectory Track-
ing Control of Mobile Robots”. In: IEEE Transactions on Cybernetics 47.11 (2017), pp. 3784–3798.
[11] Wei Chen et al. “Door recognition and deep learning algorithm for visual based robot navigation”. In:
2014 IEEE International Conference on Robotics and Biomimetics (ROBIO 2014). 2014, pp. 1793–
1798.
[12] Agnese Chiatti et al. “Surgical Fine-Tuning for Grape Bunch Segmentation Under Visual Domain
Shifts”. In: 2023 European Conference on Mobile Robots (ECMR). 2023.
[13] G. Cicirelli, T. D’orazio, and A. Distante. “Target recognition by components for mobile robot navi-
gation”. In: Journal of Experimental & Theoretical Artificial Intelligence 15.3 (2003), pp. 281–297.
[14] Jack Collins et al. “A review of physics simulators for robotic applications”. In: IEEE Access 9 (2021),
pp. 51416–51431.
[15] Robert Cupec et al. “Teaching a robot where doors and drawers are and how to handle them”. In:
2023, pp. 2288–2294.
[16] Angela Dai et al. “Scannet: Richly-annotated 3d reconstructions of indoor scenes”. In: Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2017, pp. 5828–5839.
[17] Xiyang Dai et al. “Dynamic DETR: End-to-End Object Detection With Dynamic Attention”. In:
Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 2021, pp. 2988–
2997.
[18] Ha Manh Do, Karla Conn Welch, and Weihua Sheng. “SoHAM: A Sound-Based Human Activity
Monitoring Framework for Home Service Robots”. In: IEEE Transactions on Automation Science and
Engineering 19.3 (2022), pp. 2369–2383.
[19] Ha Manh Do et al. “RiSH: A robot-integrated smart home for elderly care”. In: Robotics and Au-
tonomous Systems 101 (2018), pp. 74–92.
[20] Alexey Dosovitskiy et al. “CARLA: An Open Urban Driving Simulator”. In: Proceedings of the 1st
Annual Conference on Robot Learning. 2017.
[21] P. Espinace et al. “Indoor scene recognition through object detection”. In: 2010 IEEE International
Conference on Robotics and Automation. 2010, pp. 1406–1413.
[22] Mark Everingham et al. “The Pascal Visual Object Classes (VOC) Challenge”. In: International Jour-
nal of Computer Vision 88 (2009), pp. 303–338.
[23] Ali Farhadi and Joseph Redmon. YoloV3: An incremental improvement. 2018.
[24] Giuseppe Fragapane et al. “Planning and control of autonomous mobile robots for intralogistics: Litera-
ture review and research agenda”. In: European Journal of Operational Research 294.2 (2021), pp. 405–
426.
[25] Jonas Frey et al. “Continual Adaptation of Semantic Segmentation Using Complementary 2D-3D Data
Representations”. In: IEEE Robotics and Automation Letters 7.4 (2022), pp. 11665–11672.
[26] Ross Girshick. “Fast R-CNN”. In: Proceedings of the IEEE International Conference on Computer
Vision - (ICCV). 2015, pp. 1440–1448.
[27] Kai Han et al. “A Survey on Vision Transformer”. In: IEEE Transactions on Pattern Analysis and
Machine Intelligence 45.1 (2023), pp. 87–110.
[28] Kaiming He et al. “Deep Residual Learning for Image Recognition”. In: Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition (CVPR). 2016, pp. 770–778.
[29] Xiaochen He and Nelson Hon Ching Yung. “Corner detector based on global and local curvature
properties”. In: Optical Engineering 47.5 (2008).
[30] Jane Holland et al. “Service robots in the healthcare sector”. In: Robotics 10.1 (2021), p. 47.
[31] Shintaro Ishikawa and Komei Sugiura. “Target-Dependent UNITER: A Transformer-Based Multimodal
Language Comprehension Model for Domestic Service Robots”. In: IEEE Robotics and Automation
Letters 6.4 (2021), pp. 8401–8408.
[32] Keunwoo Jang, Sanghyun Kim, and Jaeheung Park. “Motion Planning of Mobile Manipulator for
Navigation Including Door Traversal”. In: IEEE Robotics and Automation Letters 8.7 (2023), pp. 4147–
4154.
[33] Glenn Jocher. YOLOv5 by Ultralytics. https://github.com/ultralytics/yolov5. 2020.
[34] Pushkal Katara et al. “Open Source Simulator for Unmanned Underwater Vehicles using ROS and
Unity3D”. In: 2019 IEEE Underwater Technology (UT). 2019.
[35] Taehyeon Kim et al. “Improvement of Door Recognition Algorithm Using Lidar and RGB-D Camera
for Mobile Manipulator”. In: 2022 IEEE Sensors Applications Symposium (SAS). 2022.
[36] Nathan Koenig and Andrew Howard. “Design and use paradigms for gazebo, an open-source multi-
robot simulator”. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).
2004, pp. 2149–2154.
[37] Tomáš Krajnı́k et al. “FreMEn: Frequency Map Enhancement for Long-Term Mobile Robot Autonomy
in Changing Environments”. In: IEEE Transactions on Robotics 33.4 (2017), pp. 964–977.
[38] AA Nippun Kumaar and Sreeja Kochuvila. “Mobile Service Robot Path Planning using Deep Rein-
forcement Learning”. In: IEEE Access 11 (2023), pp. 100083–100096.
[39] Nosan Kwak, Hitoshi Arisumi, and Kazuhito Yokoi. “Visual recognition of a door and its knob for a hu-
manoid robot”. In: 2011 IEEE International Conference on Robotics and Automation. 2011, pp. 2079–
2084.
[40] Pierre-Yves Lajoie and Giovanni Beltrame. “Self-Supervised Domain Calibration and Uncertainty Es-
timation for Place Recognition”. In: IEEE Robotics and Automation Letters 8.2 (2023), pp. 792–799.
[41] In Lee. “Service robots: A systematic literature review”. In: Electronics 10.21 (2021), p. 2658.
[42] Yoonho Lee et al. “Surgical fine-tuning improves adaptation to distribution shifts”. In: International
Conference on Learning Representations (2023).
[43] Tsung-Yi Lin et al. “Microsoft COCO: Common Objects in Context”. In: Computer Vision – ECCV
2014. 2014, pp. 740–755.
[44] Wei Liu et al. “SSD: Single Shot MultiBox Detector”. In: Computer Vision – ECCV 2016. 2016,
pp. 21–37.
[45] Adrian Llopart, Ole Ravn, and Nils. A. Andersen. “Door and cabinet recognition using Convolutional
Neural Nets and real-time method fusion for handle detection and grasping”. In: 2017 3rd International
Conference on Control, Automation and Robotics (ICCAR). 2017, pp. 144–149.
[46] Ilya Loshchilov and Frank Hutter. Fixing Weight Decay Regularization in Adam. 2017.
[47] Francesca Lunardini et al. “The MOVECARE Project: Home-based Monitoring of Frailty”. In: 2019
IEEE EMBS International Conference on Biomedical and Health Informatics (BHI). 2019.
[48] Matteo Luperto et al. “Mapping beyond what you can see: Predicting the layout of rooms behind
closed doors”. In: Robotics and Autonomous Systems 159 (2023), p. 104282.
[49] Matteo Luperto et al. “User feedback and remote supervision for assisted living with mobile robots: A
field study in long-term autonomy”. In: Robotics and Autonomous Systems 155 (2022), p. 104170.
[50] Yoshiaki Mizuchi and Tetsunari Inamura. “Cloud-based multimodal human-robot interaction simula-
tor utilizing ROS and unity frameworks”. In: 2017 IEEE/SICE International Symposium on System
Integration (SII). 2017, pp. 948–955.
[51] Iñaki Monasterio et al. “Learning to traverse doors using visual information”. In: Mathematics and
Computers in Simulation 60.3 (2002), pp. 347–356.
[52] Lorenzo Nardi and Cyrill Stachniss. “Long-Term Robot Navigation in Indoor Environments Estimat-
ing Patterns in Traversability Changes”. In: 2020 IEEE International Conference on Robotics and
Automation (ICRA). 2020, pp. 300–306.
[53] Poojan Oza et al. “Unsupervised Domain Adaptation of Object Detectors: A Survey”. In: IEEE Trans-
actions on Pattern Analysis and Machine Intelligence (2023).
[54] João Ramôa et al. “Real-time 2D–3D door detection and state classification on a low-power device”.
In: SN Applied Sciences 3 (2021).
[55] Joseph Redmon et al. “You Only Look Once: Unified, Real-Time Object Detection”. In: Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016, pp. 779–788.
[56] Shaoqing Ren et al. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal
Networks”. In: Advances in Neural Information Processing Systems 28.6 (2015).
[57] Olga Russakovsky et al. “Imagenet large scale visual recognition challenge”. In: International journal
of computer vision 115 (2015), pp. 211–252.
[58] Mirella Santos Pessoa de Melo et al. “Analysis and Comparison of Robotics 3D Simulators”. In: 2019
21st Symposium on Virtual and Augmented Reality (SVR). 2019, pp. 242–251.
[59] Shital Shah et al. “Airsim: High-fidelity visual and physical simulation for autonomous vehicles”. In:
Field and Service Robotics: Results of the 11th International Conference. 2018, pp. 621–635.
[60] Bokui Shen et al. “iGibson 1.0: A Simulation Environment for Interactive Tasks in Large Realistic
Scenes”. In: 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).
2021, pp. 7520–7527.
[61] Niko Sünderhauf et al. “Place categorization and semantic mapping on a mobile robot”. In: 2016 IEEE
International Conference on Robotics and Automation (ICRA). 2016, pp. 5729–5736.
[62] Niko Sünderhauf et al. “The limits and potentials of deep learning for robotics”. In: The International
Journal of Robotics Research 37.4–5 (2018), pp. 405–420.
[63] Eleonora Tagliabue et al. “Soft Tissue Simulation Environment to Learn Manipulation Tasks in Au-
tonomous Robotic Surgery”. In: 2020 IEEE/RSJ International Conference on Intelligent Robots and
Systems (IROS). 2020, pp. 3261–3266.
[64] Lei Tai, Giuseppe Paolo, and Ming Liu. “Virtual-to-real deep reinforcement learning: Continuous con-
trol of mobile robots for mapless navigation”. In: 2017 IEEE/RSJ International Conference on Intel-
ligent Robots and Systems (IROS). 2017, pp. 31–36.
[65] Kenta Takaya et al. “Simulation environment for mobile robots testing using ROS and Gazebo”.
In: 2016 20th International Conference on System Theory, Control and Computing (ICSTCC). 2016,
pp. 96–101.
[66] Ashish Vaswani et al. “Attention is all you need”. In: Advances in Neural Information Processing
Systems. 2017, pp. 5998–6008.
[67] Emily Whiting, Jonathan Battat, and Seth Teller. “Topology of urban environments”. In: Computer-
Aided Architectural Design Futures (CAADFutures) 2007: Proceedings of the 12th International CAAD-
Futures Conference. Springer. 2007, pp. 114–128.
[68] Fei Xia et al. “Gibson env: Real-world perception for embodied agents”. In: Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition (CVPR). 2018, pp. 9068–9079.
[69] Jianxiong Xiao, Andrew Owens, and Antonio Torralba. “Sun3d: A database of big spaces reconstructed
using sfm and object labels”. In: Proceedings of the IEEE international conference on computer vision.
2013, pp. 1625–1632.
[70] Xiangli Yang et al. “A Survey on Deep Semi-Supervised Learning”. In: IEEE Transactions on Knowl-
edge and Data Engineering 35.9 (2023), pp. 8934–8954.
[71] Xiaodong Yang and Yingli Tian. “Robust door detection in unfamiliar environments by combining edge
and corner features”. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern
Recognition – Workshops. 2010, pp. 57–64.
[72] Jason Yosinski et al. “How transferable are features in deep neural networks?” In: Advances in neural
information processing systems 27 (2014).
[73] Tongjie Y Zhang and Ching Y. Suen. “A fast parallel algorithm for thinning digital patterns”. In:
Communications of the ACM 27.3 (1984), pp. 236–239.
[74] Yizhou Zhao et al. Opend: A benchmark for language-driven door and drawer opening. 2022.
[75] Nicky Zimmerman et al. “Constructing Metric-Semantic Maps Using Floor Plan Priors for Long-Term
Indoor Localization”. In: 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems
(IROS). 2023, pp. 1366–1372.
[76] Nicky Zimmerman et al. “Long-Term Localization Using Semantic Cues in Floor Plan Maps”. In: IEEE
Robotics and Automation Letters 8.1 (2023), pp. 176–183.
[77] Zhengxia Zou et al. “Object Detection in 20 Years: A Survey”. In: Proceedings of the IEEE 111.3
(2023), pp. 257–276.
[78] René Zurbrügg et al. “Embodied Active Domain Adaptation for Semantic Segmentation via Informative
Path Planning”. In: IEEE Robotics and Automation Letters 7.4 (2022), pp. 8691–8698.