Sensors 21 03380 v2
Sensors 21 03380 v2
Article
Semantic Evidential Grid Mapping Using Monocular and
Stereo Cameras †
Sven Richter 1, * , Yiqun Wang 1 , Johannes Beck 2 , Sascha Wirges 1 and Christoph Stiller 1
1 Institute of Measurement and Control Systems, Karlsruhe Institute of Technology (KIT), Engler-Bunte-Ring 21,
76131 Karlsruhe, Germany; [email protected] (Y.W.); [email protected] (S.W.);
[email protected] (C.S.)
2 Atlatec GmbH, Haid-und-Neu-Straße 7, 76131 Karlsruhe, Germany; [email protected]
* Correspondence: [email protected]
† This paper is an extended version of our paper published in Richter, S.; Beck, J.; Wirges, S.; Stiller, C. Semantic
Evidential Grid Mapping based on Stereo Vision. In Proceedings of the 2020 IEEE International Conference
on Multisensor Fusion and Integration for Intelligent Systems (MFI), Karlsruhe, Germany, 14–16 September
2020; pp. 179–184, doi:10.1109/MFI49285.2020.9235217.
Abstract: Accurately estimating the current state of local traffic scenes is one of the key problems
in the development of software components for automated vehicles. In addition to details on free
space and drivability, static and dynamic traffic participants and information on the semantics may
also be included in the desired representation. Multi-layer grid maps allow the inclusion of all of
this information in a common representation. However, most existing grid mapping approaches
only process range sensor measurements such as Lidar and Radar and solely model occupancy
without semantic states. In order to add sensor redundancy and diversity, it is desired to add vision-
based sensor setups in a common grid map representation. In this work, we present a semantic
evidential grid mapping pipeline, including estimates for eight semantic classes, that is designed
Citation: Richter, S.; Wang, Y.;
for straightforward fusion with range sensor data. Unlike other publications, our representation
Beck, J.; Wirges, S.; Stiller, C. Semantic
explicitly models uncertainties in the evidential model. We present results of our grid mapping
Evidential Grid Mapping Using
pipeline based on a monocular vision setup and a stereo vision setup. Our mapping results are
Monocular and Stereo Cameras.
Sensors 2021, 21, 3380. https://
accurate and dense mapping due to the incorporation of a disparity- or depth-based ground surface
doi.org/10.3390/s21103380 estimation in the inverse perspective mapping. We conclude this paper by providing a detailed
quantitative evaluation based on real traffic scenarios in the KITTI odometry benchmark dataset and
Academic Editors: Sukhan Lee, Uwe demonstrating the advantages compared to other semantic grid mapping approaches.
D. Hanebeck and Florian Pfaff
Keywords: autonomous driving; environment perception; grid mapping; stereo vision; monocular
Received: 4 April 2021 vision
Accepted: 7 May 2021
Published: 12 May 2021
compared to cameras, Lidar sensors are still more expensive. Furthermore, cameras are
superior when it comes to understanding semantic details in the environment. In [1], we
presented a semantic evidential fusion approach for multi-layer grid maps by introducing
a refined set of hypotheses that allows the joint modeling of occupancy and semantic states
in a common representation. In this work, we use the same evidence theoretical framework
and present two improved sensor models for stereo vision and monocular vision that can
be incorporated in the sensor data fusion presented in [1].
In the remainder of this section, we briefly introduce the terms of the Dempster–Shafer
theory (Section 1.1) relevant to this work. We then review past publications on stereo
vision-based and monocular vision-based grid mapping, monocular depth estimation and
semantic grid mapping in Section 1.2, followed by highlighting our focus for the proposed
methods in Section 1.3. In Section 2, we give an overview of our semantic evidential
models and the multi-layer grid map representations. We further describe our proposed
semantic evidential grid mapping pipelines, depicted in Figure 1, in detail. We evaluate
our processing steps based on challenging real traffic scenarios and compare the results of
both methods in Section 3. Finally, we conclude this paper and give an outlook to future
work in Section 4.
1a Label Histograms 2
Semantic BBA 4
gM : G xy × S → [0, 1]
hM : G M × S → R≥0
Semantic segmentation
Stereo disparity
1b 3
Figure 1. Overview of the described grid mapping framework. On the front end, both monocular images are processed
to obtain depth maps (1a) or stereo images are used to estimate a disparity map (1b). Both of them are accompanied by a
pixelwise semantic segmentation image. The images are used as input for a label histogram calculation in a setup-dependant
grid in the second step (2). This label histogram is transformed into a cartesian grid (3) and finally transformed into a
semantic evidential grid map (4).
m : 2Ω → [0, 1] , m(∅) = 0, ∑ m( A ) = 1
A ∈2Ω
set Ω models the amount of total ignorance explicitly. Based on a BBA, lower and upper
bounds for the probability mass Pr(·) of a set A ∈ 2Ω can be deduced as
where bel(·) and pl(·) are called belief and plausibility, respectively.
ground truth dataset, an unsupervised learning framework is presented in [20] for the task
of monocular depth and camera motion estimation from unstructured video sequences.
In [21], the authors generate disparity images from monocular images by training the
network with an image reconstruction loss and stereo images training dataset, exploiting
epipolar geometry constraints. Finally, Qiao et al. tackle the inverse projection problem
in [22] by jointly performing monocular depth estimation and video panoptic segmentation.
With their method, they are able to generate 3D point clouds with instance-level semantic
estimates for each point.
consists of the hypotheses car (c), cyclist (cy), pedestrian (p), other movable object (om),
non movable object (nm), street (s), sidewalk (sw) and terrain (t). This hypotheses set can
be seen as a refinement of the classical occupancy frame consisting of the two hypotheses
occupied and free by considering the hypotheses sets
and
F := {{s}, {sw}, {t}} ⊂ 2Ω .
This makes it particularly suitable for the fusion of semantic estimates with range
measurements in top-view as outlined in [1]. For the BBA, we consider the hypotheses set
consisting of singletons
as all hypotheses combinations are either conflicting by definition or not estimated by the
semantic labeling. We define the two-dimensional grid G = P1 × P2 on the rectangular
region of interest R = I1 × I2 ⊂ R2 , where
forms a partition of the interval Ii with equidistant length δi ∈ R, origin oi ∈ R and size
si ∈ N. The BBA m on 2Ω is then represented by the multi-layer grid map
gM : G × S → [0, 1],
(C, ω ) 7→ mC (ω ),
which is the pixel-wise semantic labeling image fsem , the disparity image fdisp and disparity
confidence image fconfdisp .
In the case of measurements stemming from a monocular camera, the disparity image
fdisp is replaced by the depth image fdepth : Guv → R:
n o
Mmono = { Pi ∈ Guv , i ∈ {1, . . . , n}}, {fsem , fdepth , fconfdepth }
Note that the confidence images may be set to one for all pixels in case the disparity or
depth estimation does not output one. In this case, every pixel is attached the same weight
in the subsequent grid mapping pipeline. Figure 2 shows an example for the stereo vision
measurements that were used in [23].
Sensors 2021, 21, 3380 6 of 16
(a) Pixel-wise semantic labeling image (b) Stereo disparity image (c) Stereo disparity confidence image
Figure 2. The three input images to our stereo vision-based grid mapping pipeline used in [23]. © 2021 IEEE. Reprinted,
with permission, from Richter, S.; Beck, J.; Wirges, S.; Stiller, C. Semantic Evidential Grid Mapping based on Stereo Vision.
In Proceedings of the 2020 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems
(MFI), Karlsruhe, Germany, 14–16 September 2020, pp. 179–184, doi: 10.1109/MFI49285.2020.9235217.
hM : G M × S → R≥0
Guv GM
u
Rc
p∈P
C ∩ Rc
d/z
Figure 3. Mapping of measurements with assigned object labels from image to u-disparity/-
depth grid.
Treating pixels with assigned object labels as points is a simplification based on a lack
of knowledge about object surfaces. For pixels labeled as ground, however, the surface
can be assumed to be locally planar. We use this prior knowledge and propose an ap-
proximating pixel-to-area correspondence to obtain dense mapping results for the ground
hypotheses. The label histogram wω for the ground hypotheses ω ∈ F is given by
1
Z
wω (C, P) = fX ∗ 1AP ( x ) dx, (1)
µ(C )
C
where f X is the probability density function of the random variable X modeling the
measurement position and
ud
A P = Tuv ( P) ⊂ RM
is the area in the grid space corresponding to the measurement pixel P. For pixels classified
with ground labels, the shape of this area depends on the ground surface. We approximate
the resulting label histogram for ground hypotheses by approximating A P with rectangles
based on an encapsulated ground surface estimation in the three steps: ground estimation,
pixel area approximation and area integral calculation.
where D1 and D2 are the distance transforms based on the masks defining the observed
and unobserved image regions, respectively, and T1 , T2 ∈ R are thresholds defining the
interpolation neighborhood. We justify the application of this data augmentation by the
assumption that gradient jumps in the height profile would lead to gradient jumps in
the pixel intensity and thus imply well-estimated disparity and depth. Consequently,
the inpainted regions are restricted to areas without jumps in the height profile.
where K is the 3 × 4 pinhole camera matrix. Based on the projected corner points, the pro-
jected rectangle is then given by
Guv GM
u
v
AP
C
RP
d/z
Figure 4. Mapping of measurements with assigned ground labels from image to u-disparity/-
depth grid.
k l
f̄ω ( I1,k , I2,l ) = ∑ ∑ 1{ω} (fsem ( I1,i , I2,j ))
i =0 j =0
as
Note that in the upper equation, ui and vi are sub-pixel coordinates and the corre-
sponding integral image f̄ω is evaluated using bilinear interpolation.
d x
u y
xy
Tuz
Guz G xy
z x
u y
Figure 5. In the cartesian grid on the right-hand side, the grid cells are influenced by the distorted
overlayed areas based on the corresponding u-disparity or u-depth grid cell, respectively.
Sensors 2021, 21, 3380 10 of 16
for the relevant hypotheses ω ∈ S . Note that pω can easily be determined based on the
confusion matrix of the evaluation data set of the semantic labelling network.
3. Results
We execute our proposed method based on two setups using the Kitti odometry
benchmark [26]. In the first case, we calculate stereo disparities based on the two color
cameras in the Kitti sensor setup using the guided aggregation net for stereo matching
presented by Zhang et al. in [27]. The authors connect a local guided aggregation layer
that follows a traditional cost filtering refining thin structures to a semi-global aggregation
layer. In the second setup, we only use the left color camera and compute a depth map
using the unsupervised method presented by Godard et al. in [21]. Both neural networks
are openly available on GitHub and have been trained or at least refined using the Kitti
2015 stereo vision benchmark. For calculating the pixelwise semantic labeling, the neural
network proposed by Zhu et al. in [28] was used. Their network architecture is openly
available as well and achieves a mean intersection over union (IoU) of 72.8% in the Kitti
semantic segmentation benchmark. Note that all of the above choices were made inde-
pendently of runtime considerations. In both cases, the pixelwise confidences for depth
and disparity, respectively, are set to one as the corresponding networks do not output
adequate information. In Figure 6, an example of the three used input images is depicted.
Semantic segmentation fsem
Figure 6. Results of the three neural networks used to generate the input images for our proposed
grid mapping pipeline.
The region of interest of our cartesian grid map is 100 m in x-direction and 50 m
in y-direction where the sensor origin is located at (0 m, 25 m). The cell size is 10 cm in
both dimensions.
In the remainder of this section, we first present the ground truth that we used to
evaluate our method in Section 3.1. We then present some visual results in Section 3.2.
Finally, we present a detailed quantitative evaluation in Section 3.3.
3D semantically annotated point cloud is mapped into the same grid G xy that is used for
our semantic evidential grid map representation. The generation of our ground truth is
illustrated in Figure 7. When using multiple frames to build a denser ground truth, some
cells covered by dynamic objects are covered by road pixels. In order to remove those
conflicts, we use morphological operations to remove the ground labels in those regions, as
can be seen in Figure 7 at the locations of the two vehicles that are present in the depicted
scene. Subsequently, the grid map containing the ground truth labels is denoted as
gGT : G xy → S ∪ {unknown},
assigning both a semantic label or the label “unknown” to each grid cell C ∈ G xy .
topview
Figure 7. The generation of the ground truth used for the quantitative evaluation. Three-dimensional
semantic point clouds from ten frames are merged and mapped into a top-view grid.
the mono and the stereo pipeline. The passing vehicle is detected better using the stereo
pipeline, whereas its shape is slightly distorted using the monocular vision pipeline due to
higher inaccuracy in the depth estimation. As a general observation, it can be stated that
the errors in both camera-based reconstructions are dominated by flying pixels at object
boundaries that result from inconsistencies between the pixelwise semantic estimate and
the depth or disparity estimation.
Figure 8. The resulting BBA for stereo and mono images. Each column corresponds to one frame in the Kitti odometry
benchmark depicted in the image in the first row. The second row shows the ground truth, the third row shows the results
for stereo vision, and the last row shows the results for monovision.
where TPω presents the number of true positive cells, FPω the number of false-positive
cells, and FNω the number of false-negative cells of the label ω ∈ S . In this context, a
grid cell is considered as a true positive if the class in the ground truth coincides with the
class ω ∈ S that has been assigned the highest BBA. Note, hence, that this metric does not
Sensors 2021, 21, 3380 13 of 16
consider the measure of uncertainty that is encoded in the BBA. Therefore, we calculate the
modified intersection-over-union metrics
TP0ω 1
|S| ω∑
IoU0ω = , mIoU0 = IoU0ω ,
TP0ω + FP0ω + FN0ω ∈S
Tables 1 and 2 show the above-defined IoU metrics for the sequences 00 to 10 in the
Kitti odometry benchmark. The tables contain the numbers for all considered semantic
classes except for other movable objects (om) as it barely occurs in the test sequences. The
stereo vision pipeline outperforms the monocular vision pipeline for almost all classes.
This is expected as the used stereo disparity estimation is more accurate than the monocular
depth estimation. In general, the numbers for both setups are in similar regions as the ones
presented in the Lidar-based semantic grid map estimation from Bieder et al. in [30]. They
reach a 39.8% mean IoU with their best configuration. Our proposed method reaches 37.4%
and 41.0% mean IoU in the monocular and stereo pipeline, respectively. The accuracy for
small objects as pedestrians and cyclists is very low as small errors in the range estimations
have a high effect compared to the objects size. Comparing the numbers for mIoUω with
mIoU0ω incorporating the BBA, it stands out that the modified IoU is significantly higher.
For the modified IoU, means of 44.7% and 48.7% are reached in the two setups. The
reason for this is that higher uncertainties in wrongly classified cells lower the modified
false-positive and false-negative rates FP0ω and FN0ω and thus also the denominator in
the calculation of mIoU0ω . The results show that wrong classifications are attached with a
higher uncertainty and that the BBA can be used as a meaningful measure for uncertainty.
Table 1. Class IoUs IoUω (IoU0ω ) for the stereo vision pipeline in %. The dash indicates that there are no corresponding
objects in the sequence. The column on the right contains the mean IoUs mIoU (mIoU0 ).
Table 2. Class IoUs IoUω (IoU0ω ) for the monocular vision pipeline in %. The dash indicates that there are no corresponding
objects in the sequence. The column on the right contains the mean IoUs mIoU (mIoU0 ).
∑C∈Gxy TC
CR = ,
∑C∈Gxy (TC + FC )
where TC ∈ {0, 1} equals one if the correct label ω ∈ S was assigned the highest BBA
greater than zero and FC ∈ {0, 1} is one if the highest BBA greater than zero corresponds
to the wrong label. The counterpart incorporating the BBA reads
∑C∈Gxy TC mC
CR0 = , mC = arg max gM (C, ω ).
∑C∈Gxy (TC + FC )mC ω ∈S
We have calculated CR0 as well as CR for sequences 00 to 10. The results are presented
in Table 3 for the stereo vision pipeline and in Table 4 for the monocular vision pipeline.
The numbers confirm the tendencies collected in the IoU-based evaluation. The modified
ratios CR0 based on the BBA are higher than the ones that are based solely on one predicted
class per cell and the ratios of the stereo vision pipeline are slightly above the ones of the
monocular vision pipeline. Besides the consistency between range estimation and semantic
segmentation, the quality of the semantic segmentation itself naturally influences the final
results strongly. We found that the majority of the errors in the segmentation occur in
the distinction between the road and the sidewalk. Experiments showed that CR0 can be
improved by up to 10% depending on the sequence when merging the two classes. Besides
the Lidar-based semantic top-view maps presented in [30], we can compare our results to
the hybrid approach using Lidar and RGB images from Erkent et al. presented in [14]. They
achieve a ratio of correctly labeled cells of 81% in their best performing setup, indicating
that our approach performs slightly better. However, note that they predict a different set
of classes without uncertainty considerations.
Table 3. Ratio of correctly labeled grid cells for the stereo vision pipeline.
Seq. 00 01 02 03 04 05 06 07 08 09 10 All
CR 81.5 81.1 78.6 86.7 83.7 73.6 82.5 83.9 82.9 79.3 73.4 80.8
CR0 87.0 87.9 84.2 89.3 87.9 81.3 88.2 88.4 87.8 85.9 78.8 86.2
Sensors 2021, 21, 3380 15 of 16
Table 4. Ratio of correctly labeled grid cells for the monocular vision pipeline.
Seq. 00 01 02 03 04 05 06 07 08 09 10 All
CR 80.3 81.0 77.7 81.5 83.1 71.9 81.9 81.6 81.5 78.9 70.1 79.3
CR0 87.8 88.9 83.8 86.3 87.7 83.4 88.1 88.3 87.5 86.2 78.5 86.3
4. Conclusions
We presented an accurate and efficient framework for semantic evidential grid map-
ping based on two camera setups: monocular vision and stereo vision. Our resulting
top-view representation contains evidential measures for eight semantic hypotheses, which
can be seen as a refinement of the classical occupancy hypotheses free and occupied. We
explicitly model uncertainties of the sensor setup-dependant range estimation in an inter-
mediate grid representation. The mapping results are dense and smooth, yet not complete
as no estimates are given in unobserved areas. In our quantitative evaluation, we showed
the benefits of our evidential model by obtaining significantly better error metrics when
considering the uncertainties. This is one of the main advantages of our method compared
to other publications and enables our pipeline to perform comparably well to competitive
ones using more expensive sensors such as Lidar [14,30]. The second advantage is the
underlying semantic evidential representation that makes fusion with other sensor types
as range sensors straight forward, see [1]. The main bottlenecks in our pipeline are the
semantic segmentation and the range estimation in the image domain as well as the consis-
tency between both. Especially the influence of the latter might easily be underestimated
as inconsistencies of a few pixels already imply large distortions at higher distances.
In future work, we will focus on developing a refinement method to improve the
consistency between range and semantic estimation in the image domain. In this regard,
it might also be promising to combine both in a mutual aid network to achieve a higher
consistency in the first place. We will then fuse the presented vision-based semantic
evidential grid maps with evidential grid maps from range sensors based on the method
described in [1]. Furthermore, we will incorporate the fused grid maps into a dynamic
grid mapping framework that is able to both accumulate a semantic evidential map as
well as track dynamic traffic participants. Finally, we aim at providing a real-time capable
implementation of our framework by utilizing massive parallelization on state-of-the-
art GPUs.
Author Contributions: Conceptualization, S.R., J.B. and S.W.; methodology, S.R., J.B.; software,
S.R., Y.W.; validation, S.R., J.B., S.W. and Y.W.; formal analysis, S.R. and Y.W.; investigation, S.R.
and Y.W.; resources, S.R.; data curation, S.R. and Y.W.; writing—original draft preparation, S.R.
and Y.W.; writing—review and editing, S.R.; visualization, S.R. and Y.W.; supervision, C.S.; project
administration, C.S.; funding acquisition, S.R. All authors have read and agreed to the published
version of the manuscript.
Funding: We acknowledge the support from the KIT-Publication Fund of the Karlsruhe Institute
of Technology.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Richter, S.; Wirges, S.; Königshof, H.; Stiller, C. Fusion of range measurements and semantic estimates in an evidential
framework/Fusion von Distanzmessungen und semantischen Größen im Rahmen der Evidenztheorie. Tm-Tech. Mess. 2019,
86, 102–106. [CrossRef]
2. Shafer, G. A Mathematical Theory of Evidence; Princeton University Press: Princeton, NJ, USA, 1976; Volume 42.
3. Elfes, A. Using Occupancy Grids for Mobile Robot Perception and Navigation. Computer 1989, 22, 46–57. [CrossRef]
4. Nuss, D.; Reuter, S.; Thom, M.; Yuan, T.; Krehl, G.; Maile, M.; Gern, A.; Dietmayer, K. A Random Finite Set Approach for Dynamic
Occupancy Grid Maps with Real-time Application. Int. J. Robot. Res. 2018, 37, 841–866. [CrossRef]
5. Steyer, S.; Tanzmeister, G.; Wollherr, D. Grid-Based Environment Estimation Using Evidential Mapping and Particle Tracking.
IEEE Trans. Intell. Veh. 2018, 3, 384–396. [CrossRef]
Sensors 2021, 21, 3380 16 of 16
6. Badino, H.; Franke, U. Free Space Computation Using Stochastic Occupancy Grids and Dynamic Programming; Technical Report;
Citeseer: University Park, PA, USA, 2007.
7. Perrollaz, M.; Spalanzani, A.; Aubert, D. Probabilistic Representation of the Uncertainty of Stereo-Vision and Application to
Obstacle Detection. In Proceedings of the IEEE Intelligent Vehicles Symposium, La Jolla, CA, USA, 21–24 June 2010; pp. 313–318.
[CrossRef]
8. Pocol, C.; Nedevschi, S.; Meinecke, M.M. Obstacle Detection Based on Dense Stereovision for Urban ACC Systems. In Proceedings
of the 5th International Workshop on Intelligent Transportation, Hamburg, Germany, 18–19 March 2008.
9. Danescu, R.; Pantilie, C.; Oniga, F.; Nedevschi, S. Particle Grid Tracking System Stereovision Based Obstacle Perception in Driving
Environments. IEEE Intell. Transp. Syst. Mag. 2012, 4, 6–20. [CrossRef]
10. Yu, C.; Cherfaoui, V.; Bonnifait, P. Evidential Occupancy Grid Mapping with Stereo-Vision. In Proceedings of the IEEE Intelligent
Vehicles Symposium, Seoul, Korea, 28 June–1 July 2015; pp. 712–717. [CrossRef]
11. Giovani, B.V.; Victorino, A.C.; Ferreira, J.V. Stereo Vision for Dynamic Urban Environment Perception Using Semantic Context
in Evidential Grid. In Proceedings of the IEEE Conference on Intelligent Transportation Systems, Gran Canaria, Spain, 15–18
September 2015; pp. 2471–2476. [CrossRef]
12. Valente, M.; Joly, C.; de la Fortelle, A. Fusing Laser Scanner and Stereo Camera in Evidential Grid Maps. In Proceedings of the
2018 15th International Conference on Control, Automation, Robotics and Vision (ICARCV), Singapore, 18–21 November 2018.
13. Thomas, J.; Tatsch, J.; Van Ekeren, W.; Rojas, R.; Knoll, A. Semantic grid-based road model estimation for autonomous driving. In
Proceedings of the 2019 IEEE Intelligent Vehicles Symposium (IV), Paris, France, 9–12 June 2019. [CrossRef]
14. Erkent, O.; Wolf, C.; Laugier, C.; Gonzalez, D.S.; Cano, V.R. Semantic Grid Estimation with a Hybrid Bayesian and Deep Neural
Network Approach. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS),
Madrid, Spain, 1–5 October 2018; pp. 888–895.
15. Lu, C.; van de Molengraft, M.J.G.; Dubbelman, G. Monocular Semantic Occupancy Grid Mapping With Convolutional Variational
Encoder–Decoder Networks. IEEE Robot. Autom. Lett. 2019, 4, 445–452. [CrossRef]
16. Eigen, D.; Puhrsch, C.; Fergus, R. Depth map prediction from a single image using a multi-scale deep network. Advances in
neural information processing systems. arXiv 2014, arXiv:1406.2283.
17. Liu, F.; Shen, C.; Lin, G.; Reid, I. Learning depth from single monocular images using deep convolutional neural fields. IEEE
Trans. Pattern Anal. Mach. Intell. 2015, 38, 2024–2039. [CrossRef] [PubMed]
18. Fu, H.; Gong, M.; Wang, C.; Batmanghelich, K.; Tao, D. Deep ordinal regression network for monocular depth estimation. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018;
pp. 2002–2011.
19. Alhashim, I.; Wonka, P. High quality monocular depth estimation via transfer learning. arXiv 2018, arXiv:1812.11941.
20. Zhou, T.; Brown, M.; Snavely, N.; Lowe, D.G. Unsupervised learning of depth and ego-motion from video. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1851–1858.
21. Godard, C.; Mac Aodha, O.; Brostow, G.J. Unsupervised monocular depth estimation with left-right consistency. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 270–279.
22. Qiao, S.; Zhu, Y.; Adam, H.; Yuille, A.; Chen, L.C. ViP-DeepLab: Learning Visual Perception with Depth-aware Video Panoptic
Segmentation. arXiv 2020, arXiv:2012.05258.
23. Richter, S.; Beck, J.; Wirges, S.; Stiller, C. Semantic Evidential Grid Mapping based on Stereo Vision. In Proceedings of the 2020
IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems (MFI), Karlsruhe, Germany, 14–16
September 2020; pp. 179–184. [CrossRef]
24. Bertalmio, M.; Bertozzi, A.; Sapiro, G. Navier-Stokes, Fluid Dynamics, and Image and Video Inpainting. In Proceedings of
the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2001, Kauai, HI, USA, 8–14
December 2001; Volume 1, pp. 1–355. [CrossRef]
25. Yguel, M.; Aycard, O.; Laugier, C. Efficient GPU-based construction of occupancy grids using several laser range-finders. Int. J.
Veh. Auton. Syst. 2008, 6, 48–83. [CrossRef]
26. Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The kitti dataset. Int. J. Robot. Res. 2013, 32, 1231–1237.
[CrossRef]
27. Zhang, F.; Prisacariu, V.; Yang, R.; Torr, P.H.S. GA-Net: Guided Aggregation Net for End-To-End Stereo Matching. In Proceedings
of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019;
pp. 185–194. [CrossRef]
28. Zhu, Y.; Sapra, K.; Reda, F.A.; Shih, K.J.; Newsam, S.; Tao, A.; Catanzaro, B. Improving semantic segmentation via video
propagation and label relaxation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long
Beach, CA, USA, 16–20 June 2019; pp. 8856–8865.
29. Behley, J.; Garbade, M.; Milioto, A.; Quenzel, J.; Behnke, S.; Stachniss, C.; Gall, J. SemanticKITTI: A Dataset for Semantic Scene
Understanding of LiDAR Sequences. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV),
Seoul, Korea, 27 October–2 November 2019.
30. Bieder, F.; Wirges, S.; Janosovits, J.; Richter, S.; Wang, Z.; Stiller, C. Exploiting Multi-Layer Grid Maps for Surround-View Semantic
Segmentation of Sparse LiDAR Data. arXiv 2020, arXiv:2005.06667.