Search | arXiv e-print repository

doi 10.1002/rob.22414

Addressing the challenges of loop detection in agricultural environments

Authors: Nicolás Soncini, Javier Civera, Taihú Pire

Abstract: While visual SLAM systems are well studied and achieve impressive results in indoor and urban settings, natural, outdoor and open-field environments are much less explored and still present relevant research challenges. Visual navigation and local mapping have shown a relatively good performance in open-field environments. However, globally consistent mapping and long-term localization still depen… ▽ More While visual SLAM systems are well studied and achieve impressive results in indoor and urban settings, natural, outdoor and open-field environments are much less explored and still present relevant research challenges. Visual navigation and local mapping have shown a relatively good performance in open-field environments. However, globally consistent mapping and long-term localization still depend on the robustness of loop detection and closure, for which the literature is scarce. In this work we propose a novel method to pave the way towards robust loop detection in open fields, particularly in agricultural settings, based on local feature search and stereo geometric refinement, with a final stage of relative pose estimation. Our method consistently achieves good loop detections, with a median error of 15cm. We aim to characterize open fields as a novel environment for loop detection, understanding the limitations and problems that arise when dealing with them. △ Less

Submitted 28 August, 2024; originally announced August 2024.

Journal ref: Journal of Field Robotics (2024), 1-10

arXiv:2408.01737 [pdf, other]

Real-time Localization and Mapping in Architectural Plans with Deviations

Authors: Muhammad Shaheer, Jose Andres Millan-Romera, Hriday Bavle, Marco Giberna, Jose Luis Sanchez-Lopez, Javier Civera, Holger Voos

Abstract: Having prior knowledge of an environment boosts the localization and mapping accuracy of robots. Several approaches in the literature have utilized architectural plans in this regard. However, almost all of them overlook the deviations between actual as-built environments and as-planned architectural designs, introducing bias in the estimations. To address this issue, we present a novel localizati… ▽ More Having prior knowledge of an environment boosts the localization and mapping accuracy of robots. Several approaches in the literature have utilized architectural plans in this regard. However, almost all of them overlook the deviations between actual as-built environments and as-planned architectural designs, introducing bias in the estimations. To address this issue, we present a novel localization and mapping method denoted as deviations-informed Situational Graphs or diS-Graphs that integrates prior knowledge from architectural plans even in the presence of deviations. It is based on Situational Graphs (S-Graphs) that merge geometric models of the environment with 3D scene graphs into a multi-layered jointly optimizable factor graph. Our diS-Graph extracts information from architectural plans by first modeling them as a hierarchical factor graph, which we will call an Architectural Graph (A-Graph). While the robot explores the real environment, it estimates an S-Graph from its onboard sensors. We then use a novel matching algorithm to register the A-Graph and S-Graph in the same reference, and merge both of them with an explicit model of deviations. Finally, an alternating graph optimization strategy allows simultaneous global localization and mapping, as well as deviation estimation between both the A-Graph and the S-Graph. We perform several experiments in simulated and real datasets in the presence of deviations. On average, our diS-Graphs outperforms the baselines by a margin of approximately 43% in simulated environments and by 7% in real environments, while being able to estimate deviations up to 35 cm and 15 degrees. △ Less

Submitted 3 August, 2024; originally announced August 2024.

arXiv:2407.20391 [pdf, other]

Alignment Scores: Robust Metrics for Multiview Pose Accuracy Evaluation

Authors: Seong Hun Lee, Javier Civera

Abstract: We propose three novel metrics for evaluating the accuracy of a set of estimated camera poses given the ground truth: Translation Alignment Score (TAS), Rotation Alignment Score (RAS), and Pose Alignment Score (PAS). The TAS evaluates the translation accuracy independently of the rotations, and the RAS evaluates the rotation accuracy independently of the translations. The PAS is the average of the… ▽ More We propose three novel metrics for evaluating the accuracy of a set of estimated camera poses given the ground truth: Translation Alignment Score (TAS), Rotation Alignment Score (RAS), and Pose Alignment Score (PAS). The TAS evaluates the translation accuracy independently of the rotations, and the RAS evaluates the rotation accuracy independently of the translations. The PAS is the average of the two scores, evaluating the combined accuracy of both translations and rotations. The TAS is computed in four steps: (1) Find the upper quartile of the closest-pair-distances, $d$. (2) Align the estimated trajectory to the ground truth using a robust registration method. (3) Collect all distance errors and obtain the cumulative frequencies for multiple thresholds ranging from $0.01d$ to $d$ with a resolution $0.01d$. (4) Add up these cumulative frequencies and normalize them such that the theoretical maximum is 1. The TAS has practical advantages over the existing metrics in that (1) it is robust to outliers and collinear motion, and (2) there is no need to adjust parameters on different datasets. The RAS is computed in a similar manner to the TAS and is also shown to be more robust against outliers than the existing rotation metrics. We verify our claims through extensive simulations and provide in-depth discussion of the strengths and weaknesses of the proposed metrics. △ Less

Submitted 2 August, 2024; v1 submitted 29 July, 2024; originally announced July 2024.

arXiv:2407.02422 [pdf, other]

Close, But Not There: Boosting Geographic Distance Sensitivity in Visual Place Recognition

Authors: Sergio Izquierdo, Javier Civera

Abstract: Visual Place Recognition (VPR) plays a critical role in many localization and mapping pipelines. It consists of retrieving the closest sample to a query image, in a certain embedding space, from a database of geotagged references. The image embedding is learned to effectively describe a place despite variations in visual appearance, viewpoint, and geometric changes. In this work, we formulate how… ▽ More Visual Place Recognition (VPR) plays a critical role in many localization and mapping pipelines. It consists of retrieving the closest sample to a query image, in a certain embedding space, from a database of geotagged references. The image embedding is learned to effectively describe a place despite variations in visual appearance, viewpoint, and geometric changes. In this work, we formulate how limitations in the Geographic Distance Sensitivity of current VPR embeddings result in a high probability of incorrectly sorting the top-k retrievals, negatively impacting the recall. In order to address this issue in single-stage VPR, we propose a novel mining strategy, CliqueMining, that selects positive and negative examples by sampling cliques from a graph of visually similar images. Our approach boosts the sensitivity of VPR embeddings at small distance ranges, significantly improving the state of the art on relevant benchmarks. In particular, we raise recall@1 from 75% to 82% in MSLS Challenge, and from 76% to 90% in Nordland. Models and code are available at https://github.com/serizba/cliquemining. △ Less

Submitted 2 July, 2024; originally announced July 2024.

arXiv:2405.15518 [pdf, other]

Feature Splatting for Better Novel View Synthesis with Low Overlap

Authors: T. Berriel Martins, Javier Civera

Abstract: 3D Gaussian Splatting has emerged as a very promising scene representation, achieving state-of-the-art quality in novel view synthesis significantly faster than competing alternatives. However, its use of spherical harmonics to represent scene colors limits the expressivity of 3D Gaussians and, as a consequence, the capability of the representation to generalize as we move away from the training v… ▽ More 3D Gaussian Splatting has emerged as a very promising scene representation, achieving state-of-the-art quality in novel view synthesis significantly faster than competing alternatives. However, its use of spherical harmonics to represent scene colors limits the expressivity of 3D Gaussians and, as a consequence, the capability of the representation to generalize as we move away from the training views. In this paper, we propose to encode the color information of 3D Gaussians into per-Gaussian feature vectors, which we denote as Feature Splatting (FeatSplat). To synthesize a novel view, Gaussians are first "splatted" into the image plane, then the corresponding feature vectors are alpha-blended, and finally the blended vector is decoded by a small MLP to render the RGB pixel values. To further inform the model, we concatenate a camera embedding to the blended feature vector, to condition the decoding also on the viewpoint information. Our experiments show that these novel model for encoding the radiance considerably improves novel view synthesis for low overlap views that are distant from the training views. Finally, we also show the capacity and convenience of our feature vector representation, demonstrating its capability not only to generate RGB values for novel views, but also their per-pixel semantic labels. We will release the code upon acceptance. Keywords: Gaussian Splatting, Novel View Synthesis, Feature Splatting △ Less

Submitted 24 May, 2024; originally announced May 2024.

arXiv:2404.17251 [pdf, other]

Camera Motion Estimation from RGB-D-Inertial Scene Flow

Authors: Samuel Cerezo, Javier Civera

Abstract: In this paper, we introduce a novel formulation for camera motion estimation that integrates RGB-D images and inertial data through scene flow. Our goal is to accurately estimate the camera motion in a rigid 3D environment, along with the state of the inertial measurement unit (IMU). Our proposed method offers the flexibility to operate as a multi-frame optimization or to marginalize older data, t… ▽ More In this paper, we introduce a novel formulation for camera motion estimation that integrates RGB-D images and inertial data through scene flow. Our goal is to accurately estimate the camera motion in a rigid 3D environment, along with the state of the inertial measurement unit (IMU). Our proposed method offers the flexibility to operate as a multi-frame optimization or to marginalize older data, thus effectively utilizing past measurements. To assess the performance of our method, we conducted evaluations using both synthetic data from the ICL-NUIM dataset and real data sequences from the OpenLORIS-Scene dataset. Our results show that the fusion of these two sensors enhances the accuracy of camera motion estimation when compared to using only visual data. △ Less

Submitted 26 April, 2024; originally announced April 2024.

Comments: Accepted to CVPR2024 Workshop on Visual Odometry and Computer Vision Applications

arXiv:2403.13395 [pdf, other]

Unifying Local and Global Multimodal Features for Place Recognition in Aliased and Low-Texture Environments

Authors: Alberto García-Hernández, Riccardo Giubilato, Klaus H. Strobl, Javier Civera, Rudolph Triebel

Abstract: Perceptual aliasing and weak textures pose significant challenges to the task of place recognition, hindering the performance of Simultaneous Localization and Mapping (SLAM) systems. This paper presents a novel model, called UMF (standing for Unifying Local and Global Multimodal Features) that 1) leverages multi-modality by cross-attention blocks between vision and LiDAR features, and 2) includes… ▽ More Perceptual aliasing and weak textures pose significant challenges to the task of place recognition, hindering the performance of Simultaneous Localization and Mapping (SLAM) systems. This paper presents a novel model, called UMF (standing for Unifying Local and Global Multimodal Features) that 1) leverages multi-modality by cross-attention blocks between vision and LiDAR features, and 2) includes a re-ranking stage that re-orders based on local feature matching the top-k candidates retrieved using a global representation. Our experiments, particularly on sequences captured on a planetary-analogous environment, show that UMF outperforms significantly previous baselines in those challenging aliased environments. Since our work aims to enhance the reliability of SLAM in all situations, we also explore its performance on the widely used RobotCar dataset, for broader applicability. Code and models are available at https://github.com/DLR-RM/UMF △ Less

Submitted 20 March, 2024; originally announced March 2024.

Comments: Accepted submission to International Conference on Robotics and Automation (ICRA), 2024

arXiv:2402.16598 [pdf, other]

PCR-99: A Practical Method for Point Cloud Registration with 99 Percent Outliers

Authors: Seong Hun Lee, Javier Civera, Patrick Vandewalle

Abstract: We propose a robust method for point cloud registration that can handle both unknown scales and extreme outlier ratios. Our method, dubbed PCR-99, uses a deterministic 3-point sampling approach with two novel mechanisms that significantly boost the speed: (1) an improved ordering of the samples based on pairwise scale consistency, prioritizing the point correspondences that are more likely to be i… ▽ More We propose a robust method for point cloud registration that can handle both unknown scales and extreme outlier ratios. Our method, dubbed PCR-99, uses a deterministic 3-point sampling approach with two novel mechanisms that significantly boost the speed: (1) an improved ordering of the samples based on pairwise scale consistency, prioritizing the point correspondences that are more likely to be inliers, and (2) an efficient outlier rejection scheme based on triplet scale consistency, prescreening bad samples and reducing the number of hypotheses to be tested. Our evaluation shows that, up to 98% outlier ratio, the proposed method achieves comparable performance to the state of the art. At 99% outlier ratio, however, it outperforms the state of the art for both known-scale and unknown-scale problems. Especially for the latter, we observe a clear superiority in terms of robustness and speed. △ Less

Submitted 2 August, 2024; v1 submitted 26 February, 2024; originally announced February 2024.

arXiv:2312.05995 [pdf, other]

From Correspondences to Pose: Non-minimal Certifiably Optimal Relative Pose without Disambiguation

Authors: Javier Tirado-Garín, Javier Civera

Abstract: Estimating the relative camera pose from $n \geq 5$ correspondences between two calibrated views is a fundamental task in computer vision. This process typically involves two stages: 1) estimating the essential matrix between the views, and 2) disambiguating among the four candidate relative poses that satisfy the epipolar geometry. In this paper, we demonstrate a novel approach that, for the firs… ▽ More Estimating the relative camera pose from $n \geq 5$ correspondences between two calibrated views is a fundamental task in computer vision. This process typically involves two stages: 1) estimating the essential matrix between the views, and 2) disambiguating among the four candidate relative poses that satisfy the epipolar geometry. In this paper, we demonstrate a novel approach that, for the first time, bypasses the second stage. Specifically, we show that it is possible to directly estimate the correct relative camera pose from correspondences without needing a post-processing step to enforce the cheirality constraint on the correspondences. Building on recent advances in certifiable non-minimal optimization, we frame the relative pose estimation as a Quadratically Constrained Quadratic Program (QCQP). By applying the appropriate constraints, we ensure the estimation of a camera pose that corresponds to a valid 3D geometry and that is globally optimal when certified. We validate our method through exhaustive synthetic and real-world experiments, confirming the efficacy, efficiency and accuracy of the proposed approach. Code is available at https://github.com/javrtg/C2P. △ Less

Submitted 27 March, 2024; v1 submitted 10 December, 2023; originally announced December 2023.

Comments: Accepted to CVPR 2024

arXiv:2311.15937 [pdf, other]

Optimal Transport Aggregation for Visual Place Recognition

Authors: Sergio Izquierdo, Javier Civera

Abstract: The task of Visual Place Recognition (VPR) aims to match a query image against references from an extensive database of images from different places, relying solely on visual cues. State-of-the-art pipelines focus on the aggregation of features extracted from a deep backbone, in order to form a global descriptor for each image. In this context, we introduce SALAD (Sinkhorn Algorithm for Locally Ag… ▽ More The task of Visual Place Recognition (VPR) aims to match a query image against references from an extensive database of images from different places, relying solely on visual cues. State-of-the-art pipelines focus on the aggregation of features extracted from a deep backbone, in order to form a global descriptor for each image. In this context, we introduce SALAD (Sinkhorn Algorithm for Locally Aggregated Descriptors), which reformulates NetVLAD's soft-assignment of local features to clusters as an optimal transport problem. In SALAD, we consider both feature-to-cluster and cluster-to-feature relations and we also introduce a 'dustbin' cluster, designed to selectively discard features deemed non-informative, enhancing the overall descriptor quality. Additionally, we leverage and fine-tune DINOv2 as a backbone, which provides enhanced description power for the local features, and dramatically reduces the required training time. As a result, our single-stage method not only surpasses single-stage baselines in public VPR datasets, but also surpasses two-stage methods that add a re-ranking with significantly higher cost. Code and models are available at https://github.com/serizba/salad. △ Less

Submitted 27 June, 2024; v1 submitted 27 November, 2023; originally announced November 2023.

arXiv:2309.14823 [pdf, other]

Segmentation-Free Streaming Machine Translation

Authors: Javier Iranzo-Sánchez, Jorge Iranzo-Sánchez, Adrià Giménez, Jorge Civera, Alfons Juan

Abstract: Streaming Machine Translation (MT) is the task of translating an unbounded input text stream in real-time. The traditional cascade approach, which combines an Automatic Speech Recognition (ASR) and an MT system, relies on an intermediate segmentation step which splits the transcription stream into sentence-like units. However, the incorporation of a hard segmentation constrains the MT system and i… ▽ More Streaming Machine Translation (MT) is the task of translating an unbounded input text stream in real-time. The traditional cascade approach, which combines an Automatic Speech Recognition (ASR) and an MT system, relies on an intermediate segmentation step which splits the transcription stream into sentence-like units. However, the incorporation of a hard segmentation constrains the MT system and is a source of errors. This paper proposes a Segmentation-Free framework that enables the model to translate an unsegmented source stream by delaying the segmentation decision until the translation has been generated. Extensive experiments show how the proposed Segmentation-Free framework has better quality-latency trade-off than competing approaches that use an independent segmentation model. Software, data and models will be released upon paper acceptance. △ Less

Submitted 25 May, 2024; v1 submitted 26 September, 2023; originally announced September 2023.

Comments: Pre-MIT Press publication version. 18 pages, 13 figures

arXiv:2309.06792 [pdf, other]

Motion-Bias-Free Feature-Based SLAM

Authors: Alejandro Fontan, Javier Civera, Michael Milford

Abstract: For SLAM to be safely deployed in unstructured real world environments, it must possess several key properties that are not encompassed by conventional benchmarks. In this paper we show that SLAM commutativity, that is, consistency in trajectory estimates on forward and reverse traverses of the same route, is a significant issue for the state of the art. Current pipelines show a significant bias b… ▽ More For SLAM to be safely deployed in unstructured real world environments, it must possess several key properties that are not encompassed by conventional benchmarks. In this paper we show that SLAM commutativity, that is, consistency in trajectory estimates on forward and reverse traverses of the same route, is a significant issue for the state of the art. Current pipelines show a significant bias between forward and reverse directions of travel, that is in addition inconsistent regarding which direction of travel exhibits better performance. In this paper we propose several contributions to feature-based SLAM pipelines that remedies the motion bias problem. In a comprehensive evaluation across four datasets, we show that our contributions implemented in ORB-SLAM2 substantially reduce the bias between forward and backward motion and additionally improve the aggregated trajectory error. Removing the SLAM motion bias has significant relevance for the wide range of robotics and computer vision applications where performance consistency is important. △ Less

Submitted 13 September, 2023; originally announced September 2023.

Comments: BMVC 2023

arXiv:2309.05388 [pdf, other]

Robust Single Rotation Averaging Revisited

Authors: Seong Hun Lee, Javier Civera

Abstract: In this work, we propose a novel method for robust single rotation averaging that can efficiently handle an extremely large fraction of outliers. Our approach is to minimize the total truncated least unsquared deviations (TLUD) cost of geodesic distances. The proposed algorithm consists of three steps: First, we consider each input rotation as a potential initial solution and choose the one that y… ▽ More In this work, we propose a novel method for robust single rotation averaging that can efficiently handle an extremely large fraction of outliers. Our approach is to minimize the total truncated least unsquared deviations (TLUD) cost of geodesic distances. The proposed algorithm consists of three steps: First, we consider each input rotation as a potential initial solution and choose the one that yields the least sum of truncated chordal deviations. Next, we obtain the inlier set using the initial solution and compute its chordal $L_2$-mean. Finally, starting from this estimate, we iteratively compute the geodesic $L_1$-mean of the inliers using the Weiszfeld algorithm on $SO(3)$. An extensive evaluation shows that our method is robust against up to 99% outliers given a sufficient number of accurate inliers, outperforming the current state of the art. △ Less

Submitted 28 February, 2024; v1 submitted 11 September, 2023; originally announced September 2023.

arXiv:2308.11242 [pdf, other]

Faster Optimization in S-Graphs Exploiting Hierarchy

Authors: Hriday Bavle, Jose Luis Sanchez-Lopez, Javier Civera, Holger Voos

Abstract: 3D scene graphs hierarchically represent the environment appropriately organizing different environmental entities in various layers. Our previous work on situational graphs extends the concept of 3D scene graph to SLAM by tightly coupling the robot poses with the scene graph entities, achieving state-of-the-art results. Though, one of the limitations of S-Graphs is scalability in really large env… ▽ More 3D scene graphs hierarchically represent the environment appropriately organizing different environmental entities in various layers. Our previous work on situational graphs extends the concept of 3D scene graph to SLAM by tightly coupling the robot poses with the scene graph entities, achieving state-of-the-art results. Though, one of the limitations of S-Graphs is scalability in really large environments due to the increased graph size over time, increasing the computational complexity. To overcome this limitation in this work we present an initial research of an improved version of S-Graphs exploiting the hierarchy to reduce the graph size by marginalizing redundant robot poses and their connections to the observations of the same structural entities. Firstly, we propose the generation and optimization of room-local graphs encompassing all graph entities within a room-like structure. These room-local graphs are used to compress the S-Graphs marginalizing the redundant robot keyframes within the given room. We then perform windowed local optimization of the compressed graph at regular time-distance intervals. A global optimization of the compressed graph is performed every time a loop closure is detected. We show similar accuracy compared to the baseline while showing a 39.81% reduction in the computation time with respect to the baseline. △ Less

Submitted 22 August, 2023; originally announced August 2023.

Comments: 4 pages, 3 figures, IROS 2023 Workshop Paper

arXiv:2308.10525 [pdf, other]

LightDepth: Single-View Depth Self-Supervision from Illumination Decline

Authors: Javier Rodríguez-Puigvert, Víctor M. Batlle, J. M. M. Montiel, Ruben Martinez-Cantin, Pascal Fua, Juan D. Tardós, Javier Civera

Abstract: Single-view depth estimation can be remarkably effective if there is enough ground-truth depth data for supervised training. However, there are scenarios, especially in medicine in the case of endoscopies, where such data cannot be obtained. In such cases, multi-view self-supervision and synthetic-to-real transfer serve as alternative approaches, however, with a considerable performance reduction… ▽ More Single-view depth estimation can be remarkably effective if there is enough ground-truth depth data for supervised training. However, there are scenarios, especially in medicine in the case of endoscopies, where such data cannot be obtained. In such cases, multi-view self-supervision and synthetic-to-real transfer serve as alternative approaches, however, with a considerable performance reduction in comparison to supervised case. Instead, we propose a single-view self-supervised method that achieves a performance similar to the supervised case. In some medical devices, such as endoscopes, the camera and light sources are co-located at a small distance from the target surfaces. Thus, we can exploit that, for any given albedo and surface orientation, pixel brightness is inversely proportional to the square of the distance to the surface, providing a strong single-view self-supervisory signal. In our experiments, our self-supervised models deliver accuracies comparable to those of fully supervised ones, while being applicable without depth ground-truth data. △ Less

Submitted 19 September, 2023; v1 submitted 21 August, 2023; originally announced August 2023.

arXiv:2307.12836 [pdf, other]

doi 10.1002/rob.22232

GNSS-stereo-inertial SLAM for arable farming

Authors: Javier Cremona, Javier Civera, Ernesto Kofman, Taihú Pire

Abstract: The accelerating pace in the automation of agricultural tasks demands highly accurate and robust localization systems for field robots. Simultaneous Localization and Mapping (SLAM) methods inevitably accumulate drift on exploratory trajectories and primarily rely on place revisiting and loop closing to keep a bounded global localization error. Loop closure techniques are significantly challenging… ▽ More The accelerating pace in the automation of agricultural tasks demands highly accurate and robust localization systems for field robots. Simultaneous Localization and Mapping (SLAM) methods inevitably accumulate drift on exploratory trajectories and primarily rely on place revisiting and loop closing to keep a bounded global localization error. Loop closure techniques are significantly challenging in agricultural fields, as the local visual appearance of different views is very similar and might change easily due to weather effects. A suitable alternative in practice is to employ global sensor positioning systems jointly with the rest of the robot sensors. In this paper we propose and implement the fusion of global navigation satellite system (GNSS), stereo views, and inertial measurements for localization purposes. Specifically, we incorporate, in a tightly coupled manner, GNSS measurements into the stereo-inertial ORB-SLAM3 pipeline. We thoroughly evaluate our implementation in the sequences of the Rosario data set, recorded by an autonomous robot in soybean fields, and our own in-house data. Our data includes measurements from a conventional GNSS, rarely included in evaluations of state-of-the-art approaches. We characterize the performance of GNSS-stereo-inertial SLAM in this application case, reporting pose error reductions between 10% and 30% compared to visual-inertial and loosely coupled GNSS-stereo-inertial baselines. In addition to such analysis, we also release the code of our implementation as open source. △ Less

Submitted 24 July, 2023; originally announced July 2023.

Comments: This paper has been accepted for publication in Journal of Field Robotics, 2023

arXiv:2306.16917 [pdf, other]

The Drunkard's Odometry: Estimating Camera Motion in Deforming Scenes

Authors: David Recasens, Martin R. Oswald, Marc Pollefeys, Javier Civera

Abstract: Estimating camera motion in deformable scenes poses a complex and open research challenge. Most existing non-rigid structure from motion techniques assume to observe also static scene parts besides deforming scene parts in order to establish an anchoring reference. However, this assumption does not hold true in certain relevant application cases such as endoscopies. Deformable odometry and SLAM pi… ▽ More Estimating camera motion in deformable scenes poses a complex and open research challenge. Most existing non-rigid structure from motion techniques assume to observe also static scene parts besides deforming scene parts in order to establish an anchoring reference. However, this assumption does not hold true in certain relevant application cases such as endoscopies. Deformable odometry and SLAM pipelines, which tackle the most challenging scenario of exploratory trajectories, suffer from a lack of robustness and proper quantitative evaluation methodologies. To tackle this issue with a common benchmark, we introduce the Drunkard's Dataset, a challenging collection of synthetic data targeting visual navigation and reconstruction in deformable environments. This dataset is the first large set of exploratory camera trajectories with ground truth inside 3D scenes where every surface exhibits non-rigid deformations over time. Simulations in realistic 3D buildings lets us obtain a vast amount of data and ground truth labels, including camera poses, RGB images and depth, optical flow and normal maps at high resolution and quality. We further present a novel deformable odometry method, dubbed the Drunkard's Odometry, which decomposes optical flow estimates into rigid-body camera motion and non-rigid scene deformations. In order to validate our data, our work contains an evaluation of several baselines as well as a novel tracking error metric which does not require ground truth data. Dataset and code: https://davidrecasens.github.io/TheDrunkard'sOdometry/ △ Less

Submitted 29 June, 2023; originally announced June 2023.

arXiv:2305.12250 [pdf, other]

DAC: Detector-Agnostic Spatial Covariances for Deep Local Features

Authors: Javier Tirado-Garín, Frederik Warburg, Javier Civera

Abstract: Current deep visual local feature detectors do not model the spatial uncertainty of detected features, producing suboptimal results in downstream applications. In this work, we propose two post-hoc covariance estimates that can be plugged into any pretrained deep feature detector: a simple, isotropic covariance estimate that uses the predicted score at a given pixel location, and a full covariance… ▽ More Current deep visual local feature detectors do not model the spatial uncertainty of detected features, producing suboptimal results in downstream applications. In this work, we propose two post-hoc covariance estimates that can be plugged into any pretrained deep feature detector: a simple, isotropic covariance estimate that uses the predicted score at a given pixel location, and a full covariance estimate via the local structure tensor of the learned score maps. Both methods are easy to implement and can be applied to any deep feature detector. We show that these covariances are directly related to errors in feature matching, leading to improvements in downstream tasks, including solving the perspective-n-point problem and motion-only bundle adjustment. Code is available at https://github.com/javrtg/DAC △ Less

Submitted 15 August, 2023; v1 submitted 20 May, 2023; originally announced May 2023.

arXiv:2305.09566 [pdf, other]

Ray-Patch: An Efficient Querying for Light Field Transformers

Authors: T. Berriel Martins, Javier Civera

Abstract: In this paper we propose the Ray-Patch querying, a novel model to efficiently query transformers to decode implicit representations into target views. Our Ray-Patch decoding reduces the computational footprint and increases inference speed up to one order of magnitude compared to previous models, without losing global attention, and hence maintaining specific task metrics. The key idea of our nove… ▽ More In this paper we propose the Ray-Patch querying, a novel model to efficiently query transformers to decode implicit representations into target views. Our Ray-Patch decoding reduces the computational footprint and increases inference speed up to one order of magnitude compared to previous models, without losing global attention, and hence maintaining specific task metrics. The key idea of our novel querying is to split the target image into a set of patches, then querying the transformer for each patch to extract a set of feature vectors, which are finally decoded into the target image using convolutional layers. Our experimental results, implementing Ray-Patch in 3 different architectures and evaluating it in 2 different tasks and datasets, demonstrate and quantify the effectiveness of our method, specifically a notable boost in rendering speed for the same task metrics. △ Less

Submitted 17 August, 2023; v1 submitted 16 May, 2023; originally announced May 2023.

arXiv:2305.09295 [pdf, other]

Graph-based Global Robot Simultaneous Localization and Mapping using Architectural Plans

Authors: Muhammad Shaheer, Jose Andres Millan-Romera, Hriday Bavle, Jose Luis Sanchez-Lopez, Javier Civera, Holger Voos

Abstract: In this paper, we propose a solution for graph-based global robot simultaneous localization and mapping (SLAM) using architectural plans. Before the start of the robot operation, the previously available architectural plan of the building is converted into our proposed architectural graph (A-Graph). When the robot starts its operation, it uses its onboard LIDAR and odometry to carry out an online… ▽ More In this paper, we propose a solution for graph-based global robot simultaneous localization and mapping (SLAM) using architectural plans. Before the start of the robot operation, the previously available architectural plan of the building is converted into our proposed architectural graph (A-Graph). When the robot starts its operation, it uses its onboard LIDAR and odometry to carry out an online SLAM relying on our situational graph (S-Graph), which includes both, a representation of the environment with multiple levels of abstractions, such as walls or rooms, and their relationships, as well as the robot poses with their associated keyframes. Our novel graph-to-graph matching method is used to relate the aforementioned S-Graph and A-Graph, which are aligned and merged, resulting in our novel informed Situational Graph (iS-Graph). Our iS-Graph not only provides graph-based global robot localization, but it extends the graph-based SLAM capabilities of the S-Graph by incorporating into it the prior knowledge of the environment existing in the architectural plan △ Less

Submitted 16 May, 2023; originally announced May 2023.

Comments: 4 Page Workshop paper for ICRA 2023. arXiv admin note: substantial text overlap with arXiv:2303.02076

arXiv:2303.02076 [pdf, other]

Graph-based Global Robot Localization Informing Situational Graphs with Architectural Graphs

Authors: Muhammad Shaheer, Jose Andres Millan-Romera, Hriday Bavle, Jose Luis Sanchez-Lopez, Javier Civera, Holger Voos

Abstract: In this paper, we propose a solution for legged robot localization using architectural plans. Our specific contributions towards this goal are several. Firstly, we develop a method for converting the plan of a building into what we denote as an architectural graph (A-Graph). When the robot starts moving in an environment, we assume it has no knowledge about it, and it estimates an online situation… ▽ More In this paper, we propose a solution for legged robot localization using architectural plans. Our specific contributions towards this goal are several. Firstly, we develop a method for converting the plan of a building into what we denote as an architectural graph (A-Graph). When the robot starts moving in an environment, we assume it has no knowledge about it, and it estimates an online situational graph representation (S-Graph) of its surroundings. We develop a novel graph-to-graph matching method, in order to relate the S-Graph estimated online from the robot sensors and the A-Graph extracted from the building plans. Note the challenge in this, as the S-Graph may show a partial view of the full A-Graph, their nodes are heterogeneous and their reference frames are different. After the matching, both graphs are aligned and merged, resulting in what we denote as an informed Situational Graph (iS-Graph), with which we achieve global robot localization and exploitation of prior knowledge from the building plans. Our experiments show that our pipeline shows a higher robustness and a significantly lower pose error than several LiDAR localization baselines. △ Less

Submitted 3 March, 2023; originally announced March 2023.

Comments: 8 pages, 5 Figures, IROS 2023 conference

arXiv:2212.11770 [pdf, other]

S-Graphs+: Real-time Localization and Mapping leveraging Hierarchical Representations

Authors: Hriday Bavle, Jose Luis Sanchez-Lopez, Muhammad Shaheer, Javier Civera, Holger Voos

Abstract: In this paper, we present an evolved version of Situational Graphs, which jointly models in a single optimizable factor graph (1) a pose graph, as a set of robot keyframes comprising associated measurements and robot poses, and (2) a 3D scene graph, as a high-level representation of the environment that encodes its different geometric elements with semantic attributes and the relational informatio… ▽ More In this paper, we present an evolved version of Situational Graphs, which jointly models in a single optimizable factor graph (1) a pose graph, as a set of robot keyframes comprising associated measurements and robot poses, and (2) a 3D scene graph, as a high-level representation of the environment that encodes its different geometric elements with semantic attributes and the relational information between them. Specifically, our S-Graphs+ is a novel four-layered factor graph that includes: (1) a keyframes layer with robot pose estimates, (2) a walls layer representing wall surfaces, (3) a rooms layer encompassing sets of wall planes, and (4) a floors layer gathering the rooms within a given floor level. The above graph is optimized in real-time to obtain a robust and accurate estimate of the robots pose and its map, simultaneously constructing and leveraging high-level information of the environment. To extract this high-level information, we present novel room and floor segmentation algorithms utilizing the mapped wall planes and free-space clusters. We tested S-Graphs+ on multiple datasets, including simulated and real data of indoor environments from varying construction sites, and on a real public dataset of several indoor office areas. On average over our datasets, S-Graphs+ outperforms the accuracy of the second-best method by a margin of 10.67%, while extending the robot situational awareness by a richer scene model. Moreover, we make the software available as a docker file. △ Less

Submitted 26 May, 2023; v1 submitted 22 December, 2022; originally announced December 2022.

Comments: 8 Pages, 7 Figures, 10 Tables

arXiv:2212.05376 [pdf, other]

What's Wrong with the Absolute Trajectory Error?

Authors: Seong Hun Lee, Javier Civera

Abstract: One of the limitations of the commonly used Absolute Trajectory Error (ATE) is that it is highly sensitive to outliers. As a result, in the presence of just a few outliers, it often fails to reflect the varying accuracy as the inlier trajectory error or the number of outliers varies. In this work, we propose an alternative error metric for evaluating the accuracy of the reconstructed camera trajec… ▽ More One of the limitations of the commonly used Absolute Trajectory Error (ATE) is that it is highly sensitive to outliers. As a result, in the presence of just a few outliers, it often fails to reflect the varying accuracy as the inlier trajectory error or the number of outliers varies. In this work, we propose an alternative error metric for evaluating the accuracy of the reconstructed camera trajectory. Our metric, named Discernible Trajectory Error (DTE), is computed in five steps: (1) Shift the ground-truth and estimated trajectories such that both of their geometric medians are located at the origin. (2) Rotate the estimated trajectory such that it minimizes the sum of geodesic distances between the corresponding camera orientations. (3) Scale the estimated trajectory such that the median distance of the cameras to their geometric median is the same as that of the ground truth. (4) Compute, winsorize and normalize the distances between the corresponding cameras. (5) Obtain the DTE by taking the average of the mean and the root-mean-square (RMS) of the resulting distances. This metric is an attractive alternative to the ATE, in that it is capable of discerning the varying trajectory accuracy as the inlier trajectory error or the number of outliers varies. Using the similar idea, we also propose a novel rotation error metric, named Discernible Rotation Error (DRE), which has similar advantages to the DTE. Furthermore, we propose a simple yet effective method for calibrating the camera-to-marker rotation, which is needed for the computation of our metrics. Our methods are verified through extensive simulations. △ Less

Submitted 9 July, 2024; v1 submitted 10 December, 2022; originally announced December 2022.

arXiv:2211.13551 [pdf, other]

SfM-TTR: Using Structure from Motion for Test-Time Refinement of Single-View Depth Networks

Authors: Sergio Izquierdo, Javier Civera

Abstract: Estimating a dense depth map from a single view is geometrically ill-posed, and state-of-the-art methods rely on learning depth's relation with visual appearance using deep neural networks. On the other hand, Structure from Motion (SfM) leverages multi-view constraints to produce very accurate but sparse maps, as matching across images is typically limited by locally discriminative texture. In thi… ▽ More Estimating a dense depth map from a single view is geometrically ill-posed, and state-of-the-art methods rely on learning depth's relation with visual appearance using deep neural networks. On the other hand, Structure from Motion (SfM) leverages multi-view constraints to produce very accurate but sparse maps, as matching across images is typically limited by locally discriminative texture. In this work, we combine the strengths of both approaches by proposing a novel test-time refinement (TTR) method, denoted as SfM-TTR, that boosts the performance of single-view depth networks at test time using SfM multi-view cues. Specifically, and differently from the state of the art, we use sparse SfM point clouds as test-time self-supervisory signal, fine-tuning the network encoder to learn a better representation of the test scene. Our results show how the addition of SfM-TTR to several state-of-the-art self-supervised and supervised networks improves significantly their performance, outperforming previous TTR baselines mainly based on photometric multi-view consistency. The code is available at https://github.com/serizba/SfM-TTR. △ Less

Submitted 31 March, 2023; v1 submitted 24 November, 2022; originally announced November 2022.

arXiv:2211.08754 [pdf, other]

Advanced Situational Graphs for Robot Navigation in Structured Indoor Environments

Authors: Hriday Bavle, Jose Luis Sanchez-Lopez, Muhammad Shaheer, Javier Civera, Holger Voos

Abstract: Mobile robots extract information from its environment to understand their current situation to enable intelligent decision making and autonomous task execution. In our previous work, we introduced the concept of Situation Graphs (S-Graphs) which combines in a single optimizable graph, the robot keyframes and the representation of the environment with geometric, semantic and topological abstractio… ▽ More Mobile robots extract information from its environment to understand their current situation to enable intelligent decision making and autonomous task execution. In our previous work, we introduced the concept of Situation Graphs (S-Graphs) which combines in a single optimizable graph, the robot keyframes and the representation of the environment with geometric, semantic and topological abstractions. Although S-Graphs were built and optimized in real-time and demonstrated state-of-the-art results, they are limited to specific structured environments with specific hand-tuned dimensions of rooms and corridors. In this work, we present an advanced version of the Situational Graphs (S-Graphs+), consisting of the five layered optimizable graph that includes (1) metric layer along with the graph of free-space clusters (2) keyframe layer where the robot poses are registered (3) metric-semantic layer consisting of the extracted planar walls (4) novel rooms layer constraining the extracted planar walls (5) novel floors layer encompassing the rooms within a given floor level. S-Graphs+ demonstrates improved performance over S-Graphs efficiently extracting the room information while simultaneously improving the pose estimate of the robot, thus extending the robots situational awareness in the form of a five layered environmental model. △ Less

Submitted 16 November, 2022; originally announced November 2022.

Comments: 4 pages, IROS 2022 Workshop Paper. arXiv admin note: text overlap with arXiv:2202.12197

arXiv:2206.02570 [pdf, other]

RODIAN: Robustified Median

Authors: Seong Hun Lee, Javier Civera

Abstract: We propose a robust method for averaging numbers contaminated by a large proportion of outliers. Our method, dubbed RODIAN, is inspired by the key idea of MINPRAN [1]: We assume that the outliers are uniformly distributed within the range of the data and we search for the region that is least likely to contain outliers only. The median of the data within this region is then taken as RODIAN. Our ap… ▽ More We propose a robust method for averaging numbers contaminated by a large proportion of outliers. Our method, dubbed RODIAN, is inspired by the key idea of MINPRAN [1]: We assume that the outliers are uniformly distributed within the range of the data and we search for the region that is least likely to contain outliers only. The median of the data within this region is then taken as RODIAN. Our approach can accurately estimate the true mean of data with more than 50% outliers and runs in time $O(n\log n)$. Unlike other robust techniques, it is completely deterministic and does not rely on a known inlier error bound. Our extensive evaluation shows that RODIAN is much more robust than the median and the least-median-of-squares. This result also holds in the case of non-uniform outlier distributions. △ Less

Submitted 18 November, 2022; v1 submitted 3 June, 2022; originally announced June 2022.

arXiv:2204.14240 [pdf, other]

doi 10.1038/s41597-023-02564-7

EndoMapper dataset of complete calibrated endoscopy procedures

Authors: Pablo Azagra, Carlos Sostres, Ángel Ferrandez, Luis Riazuelo, Clara Tomasini, Oscar León Barbed, Javier Morlana, David Recasens, Victor M. Batlle, Juan J. Gómez-Rodríguez, Richard Elvira, Julia López, Cristina Oriol, Javier Civera, Juan D. Tardós, Ana Cristina Murillo, Angel Lanas, José M. M. Montiel

Abstract: Computer-assisted systems are becoming broadly used in medicine. In endoscopy, most research focuses on the automatic detection of polyps or other pathologies, but localization and navigation of the endoscope are completely performed manually by physicians. To broaden this research and bring spatial Artificial Intelligence to endoscopies, data from complete procedures is needed. This paper introdu… ▽ More Computer-assisted systems are becoming broadly used in medicine. In endoscopy, most research focuses on the automatic detection of polyps or other pathologies, but localization and navigation of the endoscope are completely performed manually by physicians. To broaden this research and bring spatial Artificial Intelligence to endoscopies, data from complete procedures is needed. This paper introduces the Endomapper dataset, the first collection of complete endoscopy sequences acquired during regular medical practice, making secondary use of medical data. Its main purpose is to facilitate the development and evaluation of Visual Simultaneous Localization and Mapping (VSLAM) methods in real endoscopy data. The dataset contains more than 24 hours of video. It is the first endoscopic dataset that includes endoscope calibration as well as the original calibration videos. Meta-data and annotations associated with the dataset vary from the anatomical landmarks, procedure labeling, segmentations, reconstructions, simulated sequences with ground truth and same patient procedures. The software used in this paper is publicly available. △ Less

Submitted 10 October, 2023; v1 submitted 29 April, 2022; originally announced April 2022.

Comments: 17 pages, 14 figures, 8 tables

Journal ref: Sci Data 10, 671 (2023)

arXiv:2203.02459 [pdf, other]

From Simultaneous to Streaming Machine Translation by Leveraging Streaming History

Authors: Javier Iranzo-Sánchez, Jorge Civera, Alfons Juan

Abstract: Simultaneous Machine Translation is the task of incrementally translating an input sentence before it is fully available. Currently, simultaneous translation is carried out by translating each sentence independently of the previously translated text. More generally, Streaming MT can be understood as an extension of Simultaneous MT to the incremental translation of a continuous input text stream. I… ▽ More Simultaneous Machine Translation is the task of incrementally translating an input sentence before it is fully available. Currently, simultaneous translation is carried out by translating each sentence independently of the previously translated text. More generally, Streaming MT can be understood as an extension of Simultaneous MT to the incremental translation of a continuous input text stream. In this work, a state-of-the-art simultaneous sentence-level MT system is extended to the streaming setup by leveraging the streaming history. Extensive empirical results are reported on IWSLT Translation Tasks, showing that leveraging the streaming history leads to significant quality gains. In particular, the proposed system proves to compare favorably to the best performing systems. △ Less

Submitted 31 March, 2022; v1 submitted 4 March, 2022; originally announced March 2022.

Comments: ACL 2022 - Camera ready; v3: expanded data pre-processing

arXiv:2202.12197 [pdf, other]

Situational Graphs for Robot Navigation in Structured Indoor Environments

Authors: Hriday Bavle, Jose Luis Sanchez-Lopez, Muhammad Shaheer, Javier Civera, Holger Voos

Abstract: Mobile robots should be aware of their situation, comprising the deep understanding of their surrounding environment along with the estimation of its own state, to successfully make intelligent decisions and execute tasks autonomously in real environments. 3D scene graphs are an emerging field of research that propose to represent the environment in a joint model comprising geometric, semantic and… ▽ More Mobile robots should be aware of their situation, comprising the deep understanding of their surrounding environment along with the estimation of its own state, to successfully make intelligent decisions and execute tasks autonomously in real environments. 3D scene graphs are an emerging field of research that propose to represent the environment in a joint model comprising geometric, semantic and relational/topological dimensions. Although 3D scene graphs have already been combined with SLAM techniques to provide robots with situational understanding, further research is still required to effectively deploy them on-board mobile robots. To this end, we present in this paper a novel, real-time, online built Situational Graph (S-Graph), which combines in a single optimizable graph, the representation of the environment with the aforementioned three dimensions, together with the robot pose. Our method utilizes odometry readings and planar surfaces extracted from 3D LiDAR scans, to construct and optimize in real-time a three layered S-Graph that includes (1) a robot tracking layer where the robot poses are registered, (2) a metric-semantic layer with features such as planar walls and (3) our novel topological layer constraining the planar walls using higher-level features such as corridors and rooms. Our proposal does not only demonstrate state-of-the-art results for pose estimation of the robot, but also contributes with a metric-semantic-topological model of the environment △ Less

Submitted 1 July, 2022; v1 submitted 24 February, 2022; originally announced February 2022.

Comments: 8 pages, 6 figures, RAL/IROS 2022

arXiv:2202.01821 [pdf, other]

Danish Airs and Grounds: A Dataset for Aerial-to-Street-Level Place Recognition and Localization

Authors: Andrea Vallone, Frederik Warburg, Hans Hansen, Søren Hauberg, Javier Civera

Abstract: Place recognition and visual localization are particularly challenging in wide baseline configurations. In this paper, we contribute with the \emph{Danish Airs and Grounds} (DAG) dataset, a large collection of street-level and aerial images targeting such cases. Its main challenge lies in the extreme viewing-angle difference between query and reference images with consequent changes in illuminatio… ▽ More Place recognition and visual localization are particularly challenging in wide baseline configurations. In this paper, we contribute with the \emph{Danish Airs and Grounds} (DAG) dataset, a large collection of street-level and aerial images targeting such cases. Its main challenge lies in the extreme viewing-angle difference between query and reference images with consequent changes in illumination and perspective. The dataset is larger and more diverse than current publicly available data, including more than 50 km of road in urban, suburban and rural areas. All images are associated with accurate 6-DoF metadata that allows the benchmarking of visual localization methods. We also propose a map-to-image re-localization pipeline, that first estimates a dense 3D reconstruction from the aerial images and then matches query street-level images to street-level renderings of the 3D model. The dataset can be downloaded at: https://frederikwarburg.github.io/DAG △ Less

Submitted 3 February, 2022; originally announced February 2022.

Comments: Submitted to RA-L (IROS)

arXiv:2202.00765 [pdf, other]

doi 10.1109/LRA.2022.3142905

A Model for Multi-View Residual Covariances based on Perspective Deformation

Authors: Alejandro Fontan, Laura Oliva, Javier Civera, Rudolph Triebel

Abstract: In this work, we derive a model for the covariance of the visual residuals in multi-view SfM, odometry and SLAM setups. The core of our approach is the formulation of the residual covariances as a combination of geometric and photometric noise sources. And our key novel contribution is the derivation of a term modelling how local 2D patches suffer from perspective deformation when imaging 3D surfa… ▽ More In this work, we derive a model for the covariance of the visual residuals in multi-view SfM, odometry and SLAM setups. The core of our approach is the formulation of the residual covariances as a combination of geometric and photometric noise sources. And our key novel contribution is the derivation of a term modelling how local 2D patches suffer from perspective deformation when imaging 3D surfaces around a point. Together, these add up to an efficient and general formulation which not only improves the accuracy of both feature-based and direct methods, but can also be used to estimate more accurate measures of the state entropy and hence better founded point visibility thresholds. We validate our model with synthetic and real data and integrate it into photometric and feature-based Bundle Adjustment, improving their accuracy with a negligible overhead. △ Less

Submitted 1 February, 2022; originally announced February 2022.

arXiv:2201.10602 [pdf, other]

Jacobian Computation for Cumulative B-Splines on SE(3) and Application to Continuous-Time Object Tracking

Authors: Javier Tirado, Javier Civera

Abstract: In this paper we propose a method that estimates the $SE(3)$ continuous trajectories (orientation and translation) of the dynamic rigid objects present in a scene, from multiple RGB-D views. Specifically, we fit the object trajectories to cumulative B-Splines curves, which allow us to interpolate, at any intermediate time stamp, not only their poses but also their linear and angular velocities and… ▽ More In this paper we propose a method that estimates the $SE(3)$ continuous trajectories (orientation and translation) of the dynamic rigid objects present in a scene, from multiple RGB-D views. Specifically, we fit the object trajectories to cumulative B-Splines curves, which allow us to interpolate, at any intermediate time stamp, not only their poses but also their linear and angular velocities and accelerations. Additionally, we derive in this work the analytical $SE(3)$ Jacobians needed by the optimization, being applicable to any other approach that uses this type of curves. To the best of our knowledge this is the first work that proposes 6-DoF continuous-time object tracking, which we endorse with significant computational cost reduction thanks to our analytical derivations. We evaluate our proposal in synthetic data and in a public benchmark, showing competitive results in localization and significant improvements in velocity estimation in comparison to discrete-time approaches. △ Less

Submitted 24 May, 2022; v1 submitted 25 January, 2022; originally announced January 2022.

Comments: Accepted at IEEE Robotics and Automation Letters

arXiv:2112.08906 [pdf, other]

On the Uncertain Single-View Depths in Colonoscopies

Authors: Javier Rodríguez-Puigvert, David Recasens, Javier Civera, Rubén Martínez-Cantín

Abstract: Estimating depth information from endoscopic images is a prerequisite for a wide set of AI-assisted technologies, such as accurate localization and measurement of tumors, or identification of non-inspected areas. As the domain specificity of colonoscopies -- deformable low-texture environments with fluids, poor lighting conditions and abrupt sensor motions -- pose challenges to multi-view 3D recon… ▽ More Estimating depth information from endoscopic images is a prerequisite for a wide set of AI-assisted technologies, such as accurate localization and measurement of tumors, or identification of non-inspected areas. As the domain specificity of colonoscopies -- deformable low-texture environments with fluids, poor lighting conditions and abrupt sensor motions -- pose challenges to multi-view 3D reconstructions, single-view depth learning stands out as a promising line of research. Depth learning can be extended in a Bayesian setting, which enables continual learning, improves decision making and can be used to compute confidence intervals or quantify uncertainty for in-body measurements. In this paper, we explore for the first time Bayesian deep networks for single-view depth estimation in colonoscopies. Our specific contribution is two-fold: 1) an exhaustive analysis of scalable Bayesian networks for depth learning in different datasets, highlighting challenges and conclusions regarding synthetic-to-real domain changes and supervised vs. self-supervised methods; and 2) a novel teacher-student approach to deep depth learning that takes into account the teacher uncertainty. △ Less

Submitted 20 July, 2022; v1 submitted 16 December, 2021; originally announced December 2021.

Comments: 11 pages

arXiv:2111.08831 [pdf, other]

HARA: A Hierarchical Approach for Robust Rotation Averaging

Authors: Seong Hun Lee, Javier Civera

Abstract: We propose a novel hierarchical approach for multiple rotation averaging, dubbed HARA. Our method incrementally initializes the rotation graph based on a hierarchy of triplet support. The key idea is to build a spanning tree by prioritizing the edges with many strong triplet supports and gradually adding those with weaker and fewer supports. This reduces the risk of adding outliers in the spanning… ▽ More We propose a novel hierarchical approach for multiple rotation averaging, dubbed HARA. Our method incrementally initializes the rotation graph based on a hierarchy of triplet support. The key idea is to build a spanning tree by prioritizing the edges with many strong triplet supports and gradually adding those with weaker and fewer supports. This reduces the risk of adding outliers in the spanning tree. As a result, we obtain a robust initial solution that enables us to filter outliers prior to nonlinear optimization. With minimal modification, our approach can also integrate the knowledge of the number of valid 2D-2D correspondences. We perform extensive evaluations on both synthetic and real datasets, demonstrating state-of-the-art results. △ Less

Submitted 29 March, 2022; v1 submitted 16 November, 2021; originally announced November 2021.

Comments: Accepted to CVPR2022

arXiv:2104.14202 [pdf, other]

Bayesian Deep Neural Networks for Supervised Learning of Single-View Depth

Authors: Javier Rodríguez-Puigvert, Rubén Martínez-Cantín, Javier Civera

Abstract: Uncertainty quantification is essential for robotic perception, as overconfident or point estimators can lead to collisions and damages to the environment and the robot. In this paper, we evaluate scalable approaches to uncertainty quantification in single-view supervised depth learning, specifically MC dropout and deep ensembles. For MC dropout, in particular, we explore the effect of the dropout… ▽ More Uncertainty quantification is essential for robotic perception, as overconfident or point estimators can lead to collisions and damages to the environment and the robot. In this paper, we evaluate scalable approaches to uncertainty quantification in single-view supervised depth learning, specifically MC dropout and deep ensembles. For MC dropout, in particular, we explore the effect of the dropout at different levels in the architecture. We show that adding dropout in all layers of the encoder brings better results than other variations found in the literature. This configuration performs similarly to deep ensembles with a much lower memory footprint, which is relevant forapplications. Finally, we explore the use of depth uncertainty for pseudo-RGBD ICP and demonstrate its potential to estimate accurate two-view relative motion with the real scale. △ Less

Submitted 15 December, 2021; v1 submitted 29 April, 2021; originally announced April 2021.

arXiv:2104.08817 [pdf, other]

Stream-level Latency Evaluation for Simultaneous Machine Translation

Authors: Javier Iranzo-Sánchez, Jorge Civera, Alfons Juan

Abstract: Simultaneous machine translation has recently gained traction thanks to significant quality improvements and the advent of streaming applications. Simultaneous translation systems need to find a trade-off between translation quality and response time, and with this purpose multiple latency measures have been proposed. However, latency evaluations for simultaneous translation are estimated at the s… ▽ More Simultaneous machine translation has recently gained traction thanks to significant quality improvements and the advent of streaming applications. Simultaneous translation systems need to find a trade-off between translation quality and response time, and with this purpose multiple latency measures have been proposed. However, latency evaluations for simultaneous translation are estimated at the sentence level, not taking into account the sequential nature of a streaming scenario. Indeed, these sentence-level latency measures are not well suited for continuous stream translation resulting in figures that are not coherent with the simultaneous translation policy of the system being assessed. This work proposes a stream-level adaptation of the current latency measures based on a re-segmentation approach applied to the output translation, that is successfully evaluated on streaming conditions for a reference IWSLT task. △ Less

Submitted 8 September, 2021; v1 submitted 18 April, 2021; originally announced April 2021.

Comments: EMNLP 2021 Camera Ready

arXiv:2103.16525 [pdf, other]

Endo-Depth-and-Motion: Reconstruction and Tracking in Endoscopic Videos using Depth Networks and Photometric Constraints

Authors: David Recasens, José Lamarca, José M. Fácil, J. M. M. Montiel, Javier Civera

Abstract: Estimating a scene reconstruction and the camera motion from in-body videos is challenging due to several factors, e.g. the deformation of in-body cavities or the lack of texture. In this paper we present Endo-Depth-and-Motion, a pipeline that estimates the 6-degrees-of-freedom camera pose and dense 3D scene models from monocular endoscopic videos. Our approach leverages recent advances in self-su… ▽ More Estimating a scene reconstruction and the camera motion from in-body videos is challenging due to several factors, e.g. the deformation of in-body cavities or the lack of texture. In this paper we present Endo-Depth-and-Motion, a pipeline that estimates the 6-degrees-of-freedom camera pose and dense 3D scene models from monocular endoscopic videos. Our approach leverages recent advances in self-supervised depth networks to generate pseudo-RGBD frames, then tracks the camera pose using photometric residuals and fuses the registered depth maps in a volumetric representation. We present an extensive experimental evaluation in the public dataset Hamlyn, showing high-quality results and comparisons against relevant baselines. We also release all models and code for future comparisons. △ Less

Submitted 3 July, 2021; v1 submitted 30 March, 2021; originally announced March 2021.

arXiv:2011.12663 [pdf, other]

Bayesian Triplet Loss: Uncertainty Quantification in Image Retrieval

Authors: Frederik Warburg, Martin Jørgensen, Javier Civera, Søren Hauberg

Abstract: Uncertainty quantification in image retrieval is crucial for downstream decisions, yet it remains a challenging and largely unexplored problem. Current methods for estimating uncertainties are poorly calibrated, computationally expensive, or based on heuristics. We present a new method that views image embeddings as stochastic features rather than deterministic features. Our two main contributions… ▽ More Uncertainty quantification in image retrieval is crucial for downstream decisions, yet it remains a challenging and largely unexplored problem. Current methods for estimating uncertainties are poorly calibrated, computationally expensive, or based on heuristics. We present a new method that views image embeddings as stochastic features rather than deterministic features. Our two main contributions are (1) a likelihood that matches the triplet constraint and that evaluates the probability of an anchor being closer to a positive than a negative; and (2) a prior over the feature space that justifies the conventional l2 normalization. To ensure computational efficiency, we derive a variational approximation of the posterior, called the Bayesian triplet loss, that produces state-of-the-art uncertainty estimates and matches the predictive performance of current state-of-the-art methods. △ Less

Submitted 17 September, 2021; v1 submitted 25 November, 2020; originally announced November 2020.

Journal ref: 2021 ICCV

arXiv:2011.11724 [pdf, other]

Rotation-Only Bundle Adjustment

Authors: Seong Hun Lee, Javier Civera

Abstract: We propose a novel method for estimating the global rotations of the cameras independently of their positions and the scene structure. When two calibrated cameras observe five or more of the same points, their relative rotation can be recovered independently of the translation. We extend this idea to multiple views, thereby decoupling the rotation estimation from the translation and structure esti… ▽ More We propose a novel method for estimating the global rotations of the cameras independently of their positions and the scene structure. When two calibrated cameras observe five or more of the same points, their relative rotation can be recovered independently of the translation. We extend this idea to multiple views, thereby decoupling the rotation estimation from the translation and structure estimation. Our approach provides several benefits such as complete immunity to inaccurate translations and structure, and the accuracy improvement when used with rotation averaging. We perform extensive evaluations on both synthetic and real datasets, demonstrating consistent and significant gains in accuracy when used with the state-of-the-art rotation averaging method. △ Less

Submitted 27 March, 2021; v1 submitted 23 November, 2020; originally announced November 2020.

Comments: Accepted to CVPR 2021

arXiv:2010.00052 [pdf, other]

DOT: Dynamic Object Tracking for Visual SLAM

Authors: Irene Ballester, Alejandro Fontan, Javier Civera, Klaus H. Strobl, Rudolph Triebel

Abstract: In this paper we present DOT (Dynamic Object Tracking), a front-end that added to existing SLAM systems can significantly improve their robustness and accuracy in highly dynamic environments. DOT combines instance segmentation and multi-view geometry to generate masks for dynamic objects in order to allow SLAM systems based on rigid scene models to avoid such image areas in their optimizations.… ▽ More In this paper we present DOT (Dynamic Object Tracking), a front-end that added to existing SLAM systems can significantly improve their robustness and accuracy in highly dynamic environments. DOT combines instance segmentation and multi-view geometry to generate masks for dynamic objects in order to allow SLAM systems based on rigid scene models to avoid such image areas in their optimizations. To determine which objects are actually moving, DOT segments first instances of potentially dynamic objects and then, with the estimated camera motion, tracks such objects by minimizing the photometric reprojection error. This short-term tracking improves the accuracy of the segmentation with respect to other approaches. In the end, only actually dynamic masks are generated. We have evaluated DOT with ORB-SLAM 2 in three public datasets. Our results show that our approach improves significantly the accuracy and robustness of ORB-SLAM 2, especially in highly dynamic scenes. △ Less

Submitted 30 September, 2020; originally announced October 2020.

arXiv:2008.01258 [pdf, other]

Robust Uncertainty-Aware Multiview Triangulation

Authors: Seong Hun Lee, Javier Civera

Abstract: We propose a robust and efficient method for multiview triangulation and uncertainty estimation. Our contribution is threefold: First, we propose an outlier rejection scheme using two-view RANSAC with the midpoint method. By prescreening the two-view samples prior to triangulation, we achieve the state-of-the-art efficiency. Second, we compare different local optimization methods for refining the… ▽ More We propose a robust and efficient method for multiview triangulation and uncertainty estimation. Our contribution is threefold: First, we propose an outlier rejection scheme using two-view RANSAC with the midpoint method. By prescreening the two-view samples prior to triangulation, we achieve the state-of-the-art efficiency. Second, we compare different local optimization methods for refining the initial solution and the inlier set. With an iterative update of the inlier set, we show that the optimization provides significant improvement in accuracy and robustness. Third, we model the uncertainty of a triangulated point as a function of three factors: the number of cameras, the mean reprojection error and the maximum parallax angle. Learning this model allows us to quickly interpolate the uncertainty at test time. We validate our method through an extensive evaluation. △ Less

Submitted 5 August, 2020; v1 submitted 3 August, 2020; originally announced August 2020.

arXiv:2008.01254 [pdf, other]

Geometric Interpretations of the Normalized Epipolar Error

Authors: Seong Hun Lee, Javier Civera

Abstract: In this work, we provide geometric interpretations of the normalized epipolar error. Most notably, we show that it is directly related to the following quantities: (1) the shortest distance between the two backprojected rays, (2) the dihedral angle between the two bounding epipolar planes, and (3) the $L_1$-optimal angular reprojection error. In this work, we provide geometric interpretations of the normalized epipolar error. Most notably, we show that it is directly related to the following quantities: (1) the shortest distance between the two backprojected rays, (2) the dihedral angle between the two bounding epipolar planes, and (3) the $L_1$-optimal angular reprojection error. △ Less

Submitted 30 December, 2020; v1 submitted 3 August, 2020; originally announced August 2020.

arXiv:2004.00732 [pdf, other]

Robust Single Rotation Averaging

Authors: Seong Hun Lee, Javier Civera

Abstract: We propose a novel method for single rotation averaging using the Weiszfeld algorithm. Our contribution is threefold: First, we propose a robust initialization based on the elementwise median of the input rotation matrices. Our initial solution is more accurate and robust than the commonly used chordal $L_2$-mean. Second, we propose an outlier rejection scheme that can be incorporated in the Weisz… ▽ More We propose a novel method for single rotation averaging using the Weiszfeld algorithm. Our contribution is threefold: First, we propose a robust initialization based on the elementwise median of the input rotation matrices. Our initial solution is more accurate and robust than the commonly used chordal $L_2$-mean. Second, we propose an outlier rejection scheme that can be incorporated in the Weiszfeld algorithm to improve the robustness of $L_1$ rotation averaging. Third, we propose a method for approximating the chordal $L_1$-mean using the Weiszfeld algorithm. An extensive evaluation shows that both our method and the state of the art perform equally well with the proposed outlier rejection scheme, but ours is $2-4$ times faster. △ Less

Submitted 4 November, 2020; v1 submitted 1 April, 2020; originally announced April 2020.

arXiv:2003.09025 [pdf, other]

LoCoQuad: A Low-Cost Arachnoid Quadruped Robot for Research and Education

Authors: Manuel Bernal, Javier Civera

Abstract: Developing real robotic systems requires a tight integration of mechanics, electronics and software. Most of the times, existing robotic platforms are either closed or expensive or both, and in-house solutions are costly to develop and maintain. Open-source and low-cost designs are essential to facilitate the access to real robotic platforms and enable further progress in the field. LoCoQuad is… ▽ More Developing real robotic systems requires a tight integration of mechanics, electronics and software. Most of the times, existing robotic platforms are either closed or expensive or both, and in-house solutions are costly to develop and maintain. Open-source and low-cost designs are essential to facilitate the access to real robotic platforms and enable further progress in the field. LoCoQuad is an arachnoid quadruped platform that we designed targeting research and education in robotics. To meet these two ends, our platform allows for a high degree of flexibility and configurability. Our legged design has the lowest hardware cost of the state of the art, in the range of 150-165USD. We validated the robot platform by running several experiments showing over all functionalities. All the mechanical and electronic designs and all the software have been made open source and can be found at: https://github.com/TomBlackroad/LoCoQuad △ Less

Submitted 19 March, 2020; originally announced March 2020.

Comments: 8 pages, 12 figures

arXiv:2001.06875 [pdf, other]

doi 10.1007/978-3-030-28603-3_6

RGB-D Odometry and SLAM

Authors: Javier Civera, Seong Hun Lee

Abstract: The emergence of modern RGB-D sensors had a significant impact in many application fields, including robotics, augmented reality (AR) and 3D scanning. They are low-cost, low-power and low-size alternatives to traditional range sensors such as LiDAR. Moreover, unlike RGB cameras, RGB-D sensors provide the additional depth information that removes the need of frame-by-frame triangulation for 3D scen… ▽ More The emergence of modern RGB-D sensors had a significant impact in many application fields, including robotics, augmented reality (AR) and 3D scanning. They are low-cost, low-power and low-size alternatives to traditional range sensors such as LiDAR. Moreover, unlike RGB cameras, RGB-D sensors provide the additional depth information that removes the need of frame-by-frame triangulation for 3D scene reconstruction. These merits have made them very popular in mobile robotics and AR, where it is of great interest to estimate ego-motion and 3D scene structure. Such spatial understanding can enable robots to navigate autonomously without collisions and allow users to insert virtual entities consistent with the image stream. In this chapter, we review common formulations of odometry and Simultaneous Localization and Mapping (known by its acronym SLAM) using RGB-D stream input. The two topics are closely related, as the former aims to track the incremental camera motion with respect to a local map of the scene, and the latter to jointly estimate the camera trajectory and the global map with consistency. In both cases, the standard approaches minimize a cost function using nonlinear optimization techniques. This chapter consists of three main parts: In the first part, we introduce the basic concept of odometry and SLAM and motivate the use of RGB-D sensors. We also give mathematical preliminaries relevant to most odometry and SLAM algorithms. In the second part, we detail the three main components of SLAM systems: camera pose tracking, scene mapping and loop closing. For each component, we describe different approaches proposed in the literature. In the final part, we provide a brief discussion on advanced research topics with the references to the state-of-the-art. △ Less

Submitted 19 January, 2020; originally announced January 2020.

Comments: This is the pre-submission version of the manuscript that was later edited and published as a chapter in RGB-D Image Analysis and Processing

arXiv:1911.03167 [pdf, ps, other]

Europarl-ST: A Multilingual Corpus For Speech Translation Of Parliamentary Debates

Authors: Javier Iranzo-Sánchez, Joan Albert Silvestre-Cerdà, Javier Jorge, Nahuel Roselló, Adrià Giménez, Albert Sanchis, Jorge Civera, Alfons Juan

Abstract: Current research into spoken language translation (SLT),or speech-to-text translation, is often hampered by the lack of specific data resources for this task, as currently available SLT datasets are restricted to a limited set of language pairs. In this paper we present Europarl-ST, a novel multilingual SLT corpus containing paired audio-text samples for SLT from and into 6 European languages, for… ▽ More Current research into spoken language translation (SLT),or speech-to-text translation, is often hampered by the lack of specific data resources for this task, as currently available SLT datasets are restricted to a limited set of language pairs. In this paper we present Europarl-ST, a novel multilingual SLT corpus containing paired audio-text samples for SLT from and into 6 European languages, for a total of 30 different translation directions. This corpus has been compiled using the debates held in the European Parliament in the period between 2008 and 2012. This paper describes the corpus creation process and presents a series of automatic speech recognition, machine translation and spoken language translation experiments that highlight the potential of this new resource. The corpus is released under a Creative Commons license and is freely accessible and downloadable. △ Less

Submitted 12 February, 2020; v1 submitted 8 November, 2019; originally announced November 2019.

Comments: Accepted by ICASSP2020. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

arXiv:1907.11917 [pdf, other]

Triangulation: Why Optimize?

Authors: Seong Hun Lee, Javier Civera

Abstract: For decades, it has been widely accepted that the gold standard for two-view triangulation is to minimize the cost based on reprojection errors. In this work, we challenge this idea. We propose a novel alternative to the classic midpoint method that leads to significantly lower 2D errors and parallax errors. It provides a numerically stable closed-form solution based solely on a pair of backprojec… ▽ More For decades, it has been widely accepted that the gold standard for two-view triangulation is to minimize the cost based on reprojection errors. In this work, we challenge this idea. We propose a novel alternative to the classic midpoint method that leads to significantly lower 2D errors and parallax errors. It provides a numerically stable closed-form solution based solely on a pair of backprojected rays. Since our solution is rotationally invariant, it can also be applied for fisheye and omnidirectional cameras. We show that for small parallax angles, our method outperforms the state-of-the-art in terms of combined 2D, 3D and parallax accuracy, while achieving comparable speed. △ Less

Submitted 23 August, 2019; v1 submitted 27 July, 2019; originally announced July 2019.

Comments: Accepted to BMVC2019 (oral presentation)

arXiv:1904.02028 [pdf, other]

CAM-Convs: Camera-Aware Multi-Scale Convolutions for Single-View Depth

Authors: Jose M. Facil, Benjamin Ummenhofer, Huizhong Zhou, Luis Montesano, Thomas Brox, Javier Civera

Abstract: Single-view depth estimation suffers from the problem that a network trained on images from one camera does not generalize to images taken with a different camera model. Thus, changing the camera model requires collecting an entirely new training dataset. In this work, we propose a new type of convolution that can take the camera parameters into account, thus allowing neural networks to learn cali… ▽ More Single-view depth estimation suffers from the problem that a network trained on images from one camera does not generalize to images taken with a different camera model. Thus, changing the camera model requires collecting an entirely new training dataset. In this work, we propose a new type of convolution that can take the camera parameters into account, thus allowing neural networks to learn calibration-aware patterns. Experiments confirm that this improves the generalization capabilities of depth prediction networks considerably, and clearly outperforms the state of the art when the train and test images are acquired with different cameras. △ Less

Submitted 3 April, 2019; originally announced April 2019.

Comments: Camera ready version for CVPR 2019. Project page: http://webdiis.unizar.es/~jmfacil/camconvs/

arXiv:1903.09115 [pdf, other]

Closed-Form Optimal Two-View Triangulation Based on Angular Errors

Authors: Seong Hun Lee, Javier Civera

Abstract: In this paper, we study closed-form optimal solutions to two-view triangulation with known internal calibration and pose. By formulating the triangulation problem as $L_1$ and $L_\infty$ minimization of angular reprojection errors, we derive the exact closed-form solutions that guarantee global optimality under respective cost functions. To the best of our knowledge, we are the first to present su… ▽ More In this paper, we study closed-form optimal solutions to two-view triangulation with known internal calibration and pose. By formulating the triangulation problem as $L_1$ and $L_\infty$ minimization of angular reprojection errors, we derive the exact closed-form solutions that guarantee global optimality under respective cost functions. To the best of our knowledge, we are the first to present such solutions. Since the angular error is rotationally invariant, our solutions can be applied for any type of central cameras, be it perspective, fisheye or omnidirectional. Our methods also require significantly less computation than the existing optimal methods. Experimental results on synthetic and real datasets validate our theoretical derivations. △ Less

Submitted 29 July, 2019; v1 submitted 21 March, 2019; originally announced March 2019.

Comments: Accepted to ICCV2019

arXiv:1903.08094 [pdf, other]

Corners for Layout: End-to-End Layout Recovery from 360 Images

Authors: Clara Fernandez-Labrador, Jose M. Facil, Alejandro Perez-Yus, Cédric Demonceaux, Javier Civera, Jose J. Guerrero

Abstract: The problem of 3D layout recovery in indoor scenes has been a core research topic for over a decade. However, there are still several major challenges that remain unsolved. Among the most relevant ones, a major part of the state-of-the-art methods make implicit or explicit assumptions on the scenes -- e.g. box-shaped or Manhattan layouts. Also, current methods are computationally expensive and not… ▽ More The problem of 3D layout recovery in indoor scenes has been a core research topic for over a decade. However, there are still several major challenges that remain unsolved. Among the most relevant ones, a major part of the state-of-the-art methods make implicit or explicit assumptions on the scenes -- e.g. box-shaped or Manhattan layouts. Also, current methods are computationally expensive and not suitable for real-time applications like robot navigation and AR/VR. In this work we present CFL (Corners for Layout), the first end-to-end model for 3D layout recovery on 360 images. Our experimental results show that we outperform the state of the art relaxing assumptions about the scene and at a lower cost. We also show that our model generalizes better to camera position variations than conventional approaches by using EquiConvs, a type of convolution applied directly on the sphere projection and hence invariant to the equirectangular distortions. CFL Webpage: https://cfernandezlab.github.io/CFL/ △ Less

Submitted 25 March, 2019; v1 submitted 19 March, 2019; originally announced March 2019.

Showing 1–50 of 57 results for author: Civera, J