Paper 11
Paper 11
Domain-Independent Perception
1 Introduction
Autonomous vehicles have been a popular research domain for many years, and
there has recently been large investments from both technology and car com-
?
These authors contributed equally to this work.
Copyright c 2020 for this paper by its authors. Use permitted under Creative Com-
mons License Attribution 4.0 International (CC BY 4.0). Colour and Visual Com-
puting Symposium 2020, Gjøvik, Norway, September 16-17, 2020.
2 A.W. Arbo et al.
panies to be the first to solve the problem. The most prominent approach in
recent years has been the modular approach, where the driving is divided into
several sub-tasks such as perception, localization, and planning. The modular
approach often results in a very complex solution, where each module has to be
fine tuned individually. The scalability of this approach can therefore become an
issue when expanding the approach to more complex situations.
Another rising approach is the end-to-end approach, where the entire driving
policy is generated within a single system. The system takes sensor-input and
converts it directly to driving commands, similar to how humans drive vehicles.
End-to-end systems for autonomous vehicles require large amounts of data, and
the ability to train on many different scenarios. Therefore, simulated environ-
ments have been explored for training in different scenarios and creating large
datasets. These environments, however, differ significantly from the real world,
and the learned driving policy does not transfer adequately between environ-
ments.
This paper attempts to improve the ability for driving policies to be trans-
ferred between domains by abstracting away both the perception task, and the
raw throttle and brake control of the vehicle, focusing mainly on the perception
task. The Mapillary Vistas dataset [21] is used for learning perception in a real-
world driving environment, and the autonomous vehicle simulator CARLA [9] is
used to learn both driving and perception. The ultimate goal of this paper is to
reduce the amount of real-world data required to train an autonomous vehicle,
by utilizing simulated environments for training.
The paper is structured as follows: Section 2 investigates related work, while
additionally providing a brief history of the field itself. Section 3 presents our
method, including the data, neural network architectures, and evaluation met-
rics. Section 4 describes our experiments, their results, and discussion related to
these. Section 5 discusses the overall implications of the experimental results, and
compare our results with conclusions from related work. Section 6 draws a final
conclusion of the work conducted, and addresses the paper’s merits, weaknesses,
and potential future work.
2 Related Work
[27] arrange autonomous vehicle control algorithms into two categories: mod-
ular approaches and end-to-end approaches. Modular approaches divides the
responsibility of driving into several sub-tasks, such as perception, localization,
planning, and control. Conversely, end-to-end approaches can be defined as a
function f (x) = a where x is any input needed to make decisions — typically
sensor data and environmental information — and a are the output controls that
are sent to the vehicle’s actuators.
The end-to-end approach was first demonstrated in the ALVINN project,
described by [22]. ALVINN was able to follow simple tracks, but had no means
to handle more complex environments. Since then, large advancements have
been done within neural networks, resulting in new research within end-to-end
Domain-Independent Perception in Autonomous Driving 3
vehicle control. [3] approaches the problem using modern techniques, and show-
case a driving policy capable of driving on both highways and residential roads;
in varied weather conditions. More recent approaches [6,20,15,14,26] are based
on Conditional Imitation Learning (CIL), introduced by [6] in 2017, where the
driving policy is given instructions — high-level commands (HLCs) — on which
actions to take (e.g. turn left in next intersection). [6] shows that an architecture
can be re-used for both simulated and physical environments, but they make no
attempt to use the same model weights across the two domains. Codevilla et.
al outputs a steering angle, and either throttle or brake, which are sent to the
vehicle’s control systems. [15] outputs the target speed of the vehicle, leaving
the raw throttle and brake adjustments to a lower-level system. [20] proposes
to abstract the commands even further; into several waypoints in space. Their
model outputs two waypoints, 5 and 20 meters away from the vehicle, which a
PID controller uses to control the vehicle’s steering and velocity.
Transfer from simulation to real world. A lot of studies have been done
on transferring learning from simulation to the real world. [18] used images from
the driving game Grand Theft Auto, to train their object detection model, and
achieved state of the art performance on the KITTI [10] and Cityscapes [8]
datasets. [4] successfully used simulation to train a model for robotic grasping of
physical unseen objects. Among the techniques used was applying randomization
in the form of random textures, lighting, and camera position, to enable their
model to generalize from the simulated source domain to their physical target
domain.
Transferring driving policies between domains also require an abstraction
of the perception data. [20] uses a perception model to generate segmentation
maps which are forwarded to the driving model, in order to generate similar per-
ception environments for both simulation and real-world. [26] combines ground-
truth segmentation and depth data from CARLA to increase driving perfor-
mance. [15] uses an encoder-decoder network with three decoder-heads — seg-
mentation, depth estimation and original RGB reproduction — to maximize the
model’s scene understanding. Hawke et al. also removes the decoding-process
when training their driving policy, making their driving policy model take only
the compressed encoding of scene understanding as input. [19] finds that the
performance of such multi-task prediction models depend highly on the relative
weighting between each task’s loss. Tuning these weights manually is an error
prone and time-consuming process, and they therefore suggest a solution for tun-
ing weights based on the homoscedastic uncertainty of each task. They show that
the multi-task approach outperformes separate models trained individually. The
uncertainty based weighing was later used by Hawke et al. and produced good
results for generating optimal encoding of a driving scene. Depth images has also
been proven as a useful approach in other simulation-to-real world knowledge
transfers, such as robotic grasping [25,12].
4 A.W. Arbo et al.
The perception model takes raw RGB images as input, and tries to predict one
or more outputs related to scene understanding; always semantic segmentation,
and in some experiments an additional depth map. The model has an encoder-
decoder structure, compressing the input into a layer with few neurons (encoder)
before expanding towards one or more prediction outputs (decoders). To train
the model, data from driving situations in different environments and geograph-
ical areas are used. Some experiment also use data generated from CARLA as a
means to improve the model’s performance in simulated environments.
Data. The Mapillary Vistas dataset [21] (henceforth Mapillary) was used for
RGB and ground-truth semantic segmentation data. The dataset consists of
25 000 high-resolution images from different driving situations, with a large
variety of weather and geographical locations. To simplify the environment for
the perception network, the number of classes for segmentation was reduced from
the original 66 object classes, to five classes: unlabeled, road, lane markings,
humans and vehicles. To train the model’s depth decoder, ground truth depth
maps were generated using the Monodepth2 network from [11], as Mapillary
lacks this information. Figure 1 shows a sample from this dataset.
Domain-Independent Perception in Autonomous Driving 5
Fig. 1: Sample of the data used when training the perception model. The left image
is the original RGB. The center image is segmentation ground truth from Mapillary.
The right depth map was generated from RGB images with the Monodepth2 network.
Architecture. Several encoders and decoders were explored when deciding the
model’s architecture. Encoders tested were: MobileNet [17], ResNet-50 [16] and
a vanilla CNN, while decoders tested were: U-Net[23] and SegNet[2]. To gener-
ate a network that could predict both depth and segmentation estimations, we
modified the existing MobileNet-U-Net architecture to include a second U-Net
decoder. The decoder was modified to predict only one value per pixel, use the
sigmoid activation function, and train with a regression loss function for depth
estimation, adapted from [1]. Figure 2 illustrates the new MobileNet-U-Net with
two decoders.
Fig. 2: A simplified illustration of the perception model. The different architectures all
used a variant of the encoder-decoder architecture. The figure represents the MobileNet-
U-Net model with a second depth estimation decoder, where each layer in the encoder
is connected to the corresponding layer in the decoder.
6 A.W. Arbo et al.
Evaluation and Metrics. The segmentation prediction was evaluated using In-
tersection over Union (IoU), calculated with the following equation: gt∩p
gt∪p , where
gt is the ground truth segmentation and p is the predicted segmentation. Mean
IoU was used as the main indicator for performance, calculated by taking the
mean of the class-wise IoU. Frequency weighted IoU was also calculated, mea-
sured as the mean IoU weighted by the number of pixels for each class.
The accuracy within threshold, as described in [5], was chosen as the metric
for depth estimation. Given the predicted depth value dp and the ground truth
d dp
depth value dgt , the accuracy δ within threshold th is defined as max( dgt
p
, dgt )=
δ > th. Each pixel gets labeled as true or false based on whether the pixel is
within the specified threshold or not. The accuracy of an image is then calculated
by taking the average of all the pixels in the image. th is a threshold that we
varied between the values 1.25, 1.252 , and 1.253 , as in [5].
The driving model runs raw RGB images through the perception model, and
uses its output segmentation and depth predictions as input. These images are
coupled with driving data recorded from an expert driver. The driving model
processes these inputs through its own layers, before outputting a steering angle
and target speed.
Data. The driving data was generated in CARLA version 0.9.9. This was done
by making an autopilot control a car in various environments, and recording
video from three forward-facing cameras, its steering angle, speed, target speed,
and HLC (left, right, straight, or follow lane). The autopilot has access to the full
state of the world, which includes a HD map, its own location and velocity, and
the states of other vehicles and pedestrians. It uses this information to generate
waypoints, which are finally fed into a PID-based controller to apply throttle,
brake, and steering angle. The collected training data was unevenly distributed
in regards to HLCs and steering angles, and we therefore down-sampled over-
represented values for an improved data distribution.
Various datasets were gathered for training the driving policy, all of which
were collected in Town01. These have different amount of complexity; steering
noise magnitude the autopilot has to account for, different weather conditions
and different light conditions. 30 641 samples were collected in total, where the
weather varied according to CARLA’s 15 default weather presets. The training
data was effectively multiplied by three, as we made two copies of each data
point, where we used the recorded image from each side camera instead of the
main camera. To adjust for a slightly modified camera perspective, we added
an offset of 0.05 and -0.05 in steering angle respectively for the left and right
camera variants. This technique was first introduced by [3], and has later proved
successfully in other papers [6,14].
Domain-Independent Perception in Autonomous Driving 7
Angle
+
Speed
+
Traffic light state +
Speed limit
Current speed HLC
Fig. 3: A simplified illustration of the driving network. The segmentation and depth
maps inputs are concatenated directly from the outputs of the perception model (shown
in Figure 2).
The segmentation and depth output of the perception model are concate-
nated channel-wise, and resemble a RGBD (RGB + depth) image. This rep-
resentation is then run through 5 convolutional blocks, each consisting of zero
padding of 1, 2D-convolution with kernel 3, batch normalization, ReLu activa-
tion, and finally max-pooling with pool size 2. The filter sizes are 64, 128, 256,
256, 256, respectively. The current HLC, whether the traffic light was red or not,
speed, and speed limit are concatenated with feature vectors generated from the
perception data. The last layers are a combination of fully-connected layers,
where we concatenate the HLC vector at each step, similar to [15]. The first out-
put of the model is the steering prediction; one neuron outputting the optimal
steering (between 0 and 1, 0 being max leftward, 1 being max rightward), later
mapped to CARLA’s [-1, 1] range. The second output is the optimal vehicle
speed, outputted as a percentage of 100 km/h (between 0 and 1).
Evaluation and Metrics. The main metric used for measuring driving model
performance was Mean Completion Rate (MCR) during real-time evaluation.
This is calculated by dividing the completed distance dc by the total route
distance dt of each run-through of a route, averaged over all run-throughs R:
8 A.W. Arbo et al.
P dc
rR dt
|R| . Traffic violations were not included as metrics, as the scope of this pa-
per is mainly within completing routes without major incidents, and the models
were therefore not trained to avoid such violations. The model’s validation loss
was also used as a rough metric for performance. By empirical observations we
only picked models with validation loss / 0.03 for further evaluation. The val-
idation loss metric was used as an initial performance estimation because the
MCR evaluation was considerably more time consuming.
There are two main experiments conducted in this paper. The first experiment
and its sub-experiments focuses on generating the best perception model to
be used when training the driving network. Model architecture, dataset vari-
ants, augmentation, and multi-task learning are parameters experimented with
to increase performance. The second experiment is conducted in CARLA. This
experiment assess the driving policy performance given the different models de-
rived in the first set of experiments. The generalizability of each model is tested
using different unseen environments. Each perception model is then compared
to a baseline model trained only on the CARLA dataset using Mean Completion
Rate as the metric.
expected. MobileNet was used for futher experiments as it was significantly faster
than ResNet50.
Experiment 1-2: Training data. To improve the model further some CARLA
data was introduced to the Mapillary dataset. Augmentation was also introduced
for further improvements and better generalization. The Mapillary+CARLA
dataset consisted of 20 000 datapoints from the Mapillary dataset and 3 250
samples from Town01 and Town02 in CARLA. The dataset with only aug-
mented CARLA data (CARLA+Aug) used a different dataset of 15 000 samples
from Town 1-4, and 4 000 samples from Town 5 as validation. The results were
evaluated on Town 3-4 as Town 1-2 was used when training Mapillary+CARLA.
The augmentation included consists of among others gaussian noise, translation,
rotation, hue and saturation augmentations, and was adapted from [13].
Segmentation Depth
Training dataset Mean IoU Weighted IoU δ < 1.25 δ < 1.252 δ < 1.253
Mapillary 0.458 (+0.03) 0.817 0.320 0.572 0.684
Mapillary+CARLA 0.520 (+0.05) 0.854 0.295 0.542 0.679
CARLA 0.717 (+0.15) 0.960 0.775 0.806 0.816
This experiment aims to assess the overall performance of the two-part (percep-
tion and driving policy) architecture. We run real-time evaluations on variants
Domain-Independent Perception in Autonomous Driving 11
The scenario runner. The scenario runner makes each model drive through
a predefined set of routes, each of which is defined by a set of waypoints. The
model navigates each route using HLCs provided automatically when passing
each waypoint. Each attempt at a route ends either when the vehicle completes
the route, or when the vehicle enters any of the following erroneous states: stuck
on an obstacle, leaving its correct lane and not returning within five seconds, or
ignoring a HLC. The models are then compared on their mean route completion
rate.
Table 4: Mean completion rate in (a) Town02 and (b) Town07, in six weather condi-
tions. Day, Sunset and Night is shortened to D, S, N respectively. The individual cells
are colored on a scale where green is the best, and red is the worst. Note that no cars
or pedestrians were included in the traffic during these experiments.
Domain-Independent Perception in Autonomous Driving 13
5 Discussions
[26] uses ground-truth semantic segmentation data generated from CARLA, not
predicted as we do, and combine segmentation with both ground-truth depth
maps and depth estimated by a separate network. Their results aligns with
our results; using semantic segmentation data beats just using raw images, and
combining both segmentation and depth performs the best. With a combination
of ground truth segmentation and estimated depth, their policy is still able to
beat the raw image-based policy. Our models estimate both segmentation and
depth, and is still able to perform good in comparison to our baseline RGB-
model.
[20] use predicted binary segmentation (road/not road) as driving input,
and our work extends this with predicted depth, giving additional performance
benefits. [14] achieved higher completion rates even with traffic, but focused
more on the impact of larger datasets and encoding temporal information in the
model, while this paper focused mainly on generalizability.
The driving model by [15] did not include the perception model’s decoding
layers in its architecture, which seems to be an overall more efficient approach.
Because the U-Net architecture used in our paper had connections between each
encoder-decoder layer, information could have been lost by not including the de-
coding layers. In future work, a model without connections between the encoder-
decoder layers could be explored to take advantage of [15]’s approach.
We find that models with a learned understanding of both the semantics and/or
geometry of the scene are able to navigate never-before-seen environments and
weather. Our real-time experiment shows that these driving models often per-
form better than learning from raw image inputs directly, with models utilizing
both semantics and geometry performing best overall.
the results were representative. Still, conclusions based on the results in Exper-
iment 2 must be drawn carefully. A more robust approach could be to train
multiple models with the same parameters, and averaging their results.
6 Conclusion
Splitting end-to-end models for autonomous vehicles into separate models for
perception and driving policy is shown to give good results in simulated en-
vironments. Perception models trained from public datasets such as Mapillary
Vistas can be used to reduce the amount of driving data needed when training
an end-to-end driving policy network. This approach opens up for training the
driving policy in a simulated environment, while still getting good performance
in real-world environments.
Future work should explore how these results transfers into the real world.
Evaluating the performance of a model trained solely in simulation directly in a
real-world environment will be an important next step as a means of testing the
validity of these results.
References
1. Alhashim, I., Wonka, P.: High quality monocular depth estimation via transfer
learning. arXiv:1812.11941 [cs] (Mar 2019), arXiv: 1812.11941
2. Badrinarayanan, V., Kendall, A., Cipolla, R.: Segnet: A deep convolutional
encoder-decoder architecture for image segmentation. arXiv:1511.00561 [cs] (Oct
2016), arXiv: 1511.00561
3. Bojarski, M., Del Testa, D., Dworakowski, D., Firner, B., Flepp, B., Goyal, P.,
Jackel, L.D., Monfort, M., Muller, U., Zhang, J., Zhang, X., Zhao, J., Zieba, K.:
End to End Learning for Self-Driving Cars (2016)
4. Bousmalis, K., Irpan, A., Wohlhart, P., Bai, Y., Kelcey, M., Kalakrishnan, M.,
Downs, L., Ibarz, J., Pastor, P., Konolige, K., et al.: Using simulation and domain
adaptation to improve efficiency of deep robotic grasping. In: 2018 IEEE Interna-
tional Conference on Robotics and Automation (ICRA). p. 4243–4250 (May 2018).
https://doi.org/10.1109/ICRA.2018.8460875
5. Cao, Y., Zhao, T., Xian, K., Shen, C., Cao, Z., Xu, S.: Monocular depth estima-
tion with augmented ordinal depth relationships. arXiv:1806.00585 [cs] (Jul 2019),
arXiv: 1806.00585
6. Codevilla, F., Müller, M., López, A., Koltun, V., Dosovitskiy, A.: End-to-end Driv-
ing via Conditional Imitation Learning. arXiv:1710.02410 [cs] (Oct 2017), arXiv:
1710.02410
7. Codevilla, F., Santana, E., López, A.M., Gaidon, A.: Exploring the Limitations
of Behavior Cloning for Autonomous Driving. arXiv:1904.08980 [cs] (Apr 2019),
arXiv: 1904.08980
8. Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R.,
Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene
understanding. arXiv:1604.01685 [cs] (Apr 2016), arXiv: 1604.01685
9. Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., Koltun, V.: CARLA: An open
urban driving simulator. In: Proceedings of the 1st Annual Conference on Robot
Learning. pp. 1–16 (2017)
Domain-Independent Perception in Autonomous Driving 15
10. Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: The kitti
dataset. International Journal of Robotics Research (IJRR) (2013)
11. Godard, C., Mac Aodha, O., Firman, M., Brostow, G.: Digging into self-supervised
monocular depth estimation. arXiv:1806.01260 [cs, stat] (Aug 2019), arXiv:
1806.01260
12. Gualtieri, M., Pas, A.t., Saenko, K., Platt, R.: High precision grasp pose detection
in dense clutter. arXiv:1603.01564 [cs] (Jun 2017), arXiv: 1603.01564
13. Gupta, D.: Image segmentation keras : Implementation of segnet, fcn, unet,
pspnet and other models in keras. (2020), https://github.com/divamgupta/
image-segmentation-keras
14. Haavaldsen, H., Aasboe, M., Lindseth, F.: Autonomous Vehicle Control: End-to-
end Learning in Simulated Urban Environments. arXiv:1905.06712 [cs] (May 2019),
arXiv: 1905.06712
15. Hawke, J., Shen, R., Gurau, C., Sharma, S., Reda, D., Nikolov, N., Mazur, P., Mick-
lethwaite, S., Griffiths, N., Shah, A., Kendall, A.: Urban Driving with Conditional
Imitation Learning. arXiv:1912.00177 [cs] (Dec 2019), arXiv: 1912.00177
16. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.
arXiv:1512.03385 [cs] (Dec 2015), arXiv: 1512.03385
17. Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., An-
dreetto, M., Adam, H.: Mobilenets: Efficient convolutional neural networks for
mobile vision applications. arXiv:1704.04861 [cs] (Apr 2017), arXiv: 1704.04861
18. Johnson-Roberson, M., Barto, C., Mehta, R., Sridhar, S.N., Rosaen, K., Vasudevan,
R.: Driving in the matrix: Can virtual worlds replace human-generated annotations
for real world tasks? arXiv:1610.01983 [cs] (Feb 2017), arXiv: 1610.01983
19. Kendall, A., Gal, Y., Cipolla, R.: Multi-task learning using uncertainty to weigh
losses for scene geometry and semantics. arXiv:1705.07115 [cs] (Apr 2018), arXiv:
1705.07115
20. Müller, M., Dosovitskiy, A., Ghanem, B., Koltun, V.: Driving Policy Transfer via
Modularity and Abstraction. arXiv:1804.09364 [cs] (Dec 2018), arXiv: 1804.09364
21. Neuhold, G., Ollmann, T., Rota Bulò, S., Kontschieder, P.: The mapillary vistas
dataset for semantic understanding of street scenes. In: International Conference on
Computer Vision (ICCV) (2017), https://www.mapillary.com/dataset/vistas
22. Pomerleau, D.A.: Advances in Neural Information Processing Systems 1,
p. 305–313. Morgan Kaufmann Publishers Inc. (1989), http://dl.acm.org/
citation.cfm?id=89851.89891
23. Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomed-
ical image segmentation. arXiv:1505.04597 [cs] (May 2015), arXiv: 1505.04597
24. Standley, T., Zamir, A.R., Chen, D., Guibas, L., Malik, J., Savarese, S.: Which
tasks should be learned together in multi-task learning? arXiv:1905.07553 [cs] (May
2019)
25. Viereck, U., Pas, A.t., Saenko, K., Platt, R.: Learning a visuomotor controller for
real world robotic grasping using simulated depth images. arXiv:1706.04652 [cs]
(Nov 2017), arXiv: 1706.04652
26. Xiao, Y., Codevilla, F., Gurram, A., Urfalioglu, O., López, A.M.: Multimodal End-
to-End Autonomous Driving. arXiv:1906.03199 [cs] (Jun 2019), arXiv: 1906.03199
27. Yurtsever, E., Lambert, J., Carballo, A., Takeda, K.: A Survey of Autonomous
Driving: Common Practices and Emerging Technologies. arXiv:1906.05113 [cs, eess]
(Jun 2019), arXiv: 1906.05113