A Survey of Deep Learning Techniques For Autonomous Driving

A Survey of Deep Learning Techniques for Autonomous Driving
Sorin Grigorescu∗ Bogdan Trasnea

Artificial Intelligence, Artificial Intelligence,
Elektrobit Automotive. Elektrobit Automotive.
Robotics, Vision and Control Lab, Robotics, Vision and Control Lab,
Transilvania University of Brasov. Transilvania University of Brasov.
arXiv:1910.07738v2 [cs.LG] 24 Mar 2020
Brasov, Romania Brasov, Romania

[email protected] [email protected]
Tiberiu Cocias Gigel Macesanu

Artificial Intelligence, Artificial Intelligence,
Elektrobit Automotive. Elektrobit Automotive.
Robotics, Vision and Control Lab, Robotics, Vision and Control Lab,
Transilvania University of Brasov. Transilvania University of Brasov.
Brasov, Romania Brasov, Romania
[email protected] [email protected]
Abstract
The last decade witnessed increasingly rapid progress in self-driving vehicle technology, mainly backed
up by advances in the area of deep learning and artificial intelligence. The objective of this paper is to
survey the current state-of-the-art on deep learning technologies used in autonomous driving. We start by
presenting AI-based self-driving architectures, convolutional and recurrent neural networks, as well as the
deep reinforcement learning paradigm. These methodologies form a base for the surveyed driving scene
perception, path planning, behavior arbitration and motion control algorithms. We investigate both the
modular perception-planning-action pipeline, where each module is built using deep learning methods, as
well as End2End systems, which directly map sensory information to steering commands. Additionally, we
tackle current challenges encountered in designing AI architectures for autonomous driving, such as their
safety, training data sources and computational hardware. The comparison presented in this survey helps to
gain insight into the strengths and limitations of deep learning and AI approaches for autonomous driving
and assist with design choices.1
∗ The authors are with Elektrobit Automotive and the Robotics, Vision and Control Laboratory (ROVIS Lab) at the Department of Automation and
Information Technology, Transilvania University of Brasov, 500036 Romania. E-mail: (see http://rovislab.com/sorin_grigorescu.html).
1 The articles referenced in this survey can be accessed at the web-page accompanying this paper, available at http://rovislab.com/survey_
DL_AD.html
Contents
1 Introduction 3
2 Deep Learning based Decision-Making Architectures for Self-Driving Cars 3
3 Overview of Deep Learning Technologies 4
3.1 Deep Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.2 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.3 Deep Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4 Deep Learning for Driving Scene Perception and Localization 8
4.1 Sensing Hardware: Camera vs. LiDAR Debate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.2 Driving Scene Understanding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.2.1 Bounding-Box-Like Object Detectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.2.2 Semantic and Instance Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.2.3 Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.3 Perception using Occupancy Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
5 Deep Learning for Path Planning and Behavior Arbitration 11
6 Motion Controllers for AI-based Self-Driving Cars 11
6.1 Learning Controllers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
6.2 End2End Learning Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
7 Safety of Deep Learning in Autonomous Driving 14
8 Data Sources for Training Autonomous Driving Systems 16
9 Computational Hardware and Deployment 19
10 Discussion and Conclusions 20
10.1 Final Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

1 Introduction 2 Deep Learning based
Decision-Making Architectures for
Over the course of the last decade, Deep Learning and Arti-
ficial Intelligence (AI) became the main technologies behind Self-Driving Cars
many breakthroughs in computer vision [1], robotics [2] and Self-driving cars are autonomous decision-making systems
Natural Language Processing (NLP) [3]. They also have a that process streams of observations coming from different
major impact in the autonomous driving revolution seen to- on-board sources, such as cameras, radars, LiDARs, ultra-
day both in academia and industry. Autonomous Vehicles sonic sensors, GPS units and/or inertial sensors. These ob-
(AVs) and self-driving cars began to migrate from labora- servations are used by the car’s computer to make driving
tory development and testing conditions to driving on pub- decisions. The basic block diagrams of an AI powered au-
lic roads. Their deployment in our environmental landscape tonomous car are shown in Fig. 1. The driving decisions are
offers a decrease in road accidents and traffic congestions, computed either in a modular perception-planning-action
as well as an improvement of our mobility in overcrowded pipeline (Fig. 1(a)), or in an End2End learning fashion
cities. The title of ”self-driving” may seem self-evident, (Fig. 1(b)), where sensory information is directly mapped to
but there are actually five SAE Levels used to define au- control outputs. The components of the modular pipeline
tonomous driving. The SAE J3016 standard [4] introduces can be designed either based on AI and deep learning
a scale from 0 to 5 for grading vehicle automation. Lower methodologies, or using classical non-learning approaches.
SAE Levels feature basic driver assistance, whilst higher Various permutations of learning and non-learning based
SAE Levels move towards vehicles requiring no human in- components are possible (e.g. a deep learning based object
teraction whatsoever. Cars in the level 5 category require detector provides input to a classical A-star path planning al-
no human input and typically will not even feature steering gorithm). A safety monitor is designed to assure the safety
wheels or foot pedals. of each module.
Although most driving scenarios can be relatively simply The modular pipeline in Fig. 1(a) is hierarchically decom-
solved with classical perception, path planning and motion posed into four components which can be designed using ei-
control methods, the remaining unsolved scenarios are cor- ther deep learning and AI approaches, or classical methods.
ner cases in which traditional methods fail. These components are:
One of the first autonomous cars was developed by Ernst
• Perception and Localization,
Dickmanns [5] in the 1980s. This paved the way for new
research projects, such as PROMETHEUS, which aimed • High-Level Path Planning,
to develop a fully functional autonomous car. In 1994,
• Behavior Arbitration, or low-level path planning,
the VaMP driverless car managed to drive 1, 600km, out
of which 95% were driven autonomously. Similarly, in • Motion Controllers.
1995, CMU NAVLAB demonstrated autonomous driving
Based on these four high-level components, we have
on 6, 000km, with 98% driven autonomously. Another im-
grouped together relevant deep learning papers describing
portant milestone in autonomous driving were the DARPA
methods developed for autonomous driving systems. Addi-
Grand Challenges in 2004 and 2005, as well as the DARPA
tional to the reviewed algorithms, we have also grouped rel-
Urban Challenge in 2007. The goal was for a driverless
evant articles covering the safety, data sources and hardware
car to navigate an off-road course as fast as possible, with-
aspects encountered when designing deep learning modules
out human intervention. In 2004, none of the 15 vehicles
for self-driving cars.
completed the race. Stanley, the winner of the 2005 race,
Given a route planned through the road network, the first
leveraged Machine Learning techniques for navigating the
task of an autonomous car is to understand and localize it-
unstructured environment. This was a turning point in self-
self in the surrounding environment. Based on this repre-
driving cars development, acknowledging Machine Learn-
sentation, a continuous path is planned and the future ac-
ing and AI as central components of autonomous driving.
tions of the car are determined by the behavior arbitration
The turning point is also notable in this survey paper, since
system. Finally, a motion control system reactively corrects
the majority of the surveyed work is dated after 2005.
errors generated in the execution of the planned motion. A
In this survey, we review the different artificial intelli-
review of classical non-AI design methodologies for these
gence and deep learning technologies used in autonomous
four components can be found in [6].
driving, and provide a survey on state-of-the-art deep learn-
Following, we will give an introduction of deep learning
ing and AI methods applied to self-driving cars. We also
and AI technologies used in autonomous driving, as well as
dedicate complete sections on tackling safety aspects, the
surveying different methodologies used to design the hierar-
challenge of training data sources and the required compu-
chical decision making process described above. Addition-
tational hardware.
ally, we provide an overview of End2End learning systems
used to encode the hierarchical process into a single deep
learning architecture which directly maps sensory observa-
tions to control outputs.
Figure 1: Deep Learning based self-driving car. The architecture can be implemented either as a sequential perception-
planing-action pipeline (a), or as an End2End system (b). In the sequential pipeline case, the components can be designed
either using AI and deep learning methodologies, or based on classical non-learning approaches. End2End learning systems
are mainly based on deep learning methods. A safety monitor is usually designed to ensure the safety of each module.
3 Overview of Deep Learning automatically learn a representation of the feature space en-
coded in the training set.
Technologies CNNs can be loosely understood as very approximate
In this section, we describe the basis of deep learning tech- analogies to different parts of the mammalian visual cor-
nologies used in autonomous vehicles and comment on tex [12]. An image formed on the retina is sent to the visual
the capabilities of each paradigm. We focus on Convolu- cortex through the thalamus. Each brain hemisphere has its
tional Neural Networks (CNN), Recurrent Neural Networks own visual cortex. The visual information is received by the
(RNN) and Deep Reinforcement Learning (DRL), which are visual cortex in a crossed manner: the left visual cortex re-
the most common deep learning methodologies applied to ceives information from the right eye, while the right visual
autonomous driving. cortex is fed with visual data from the left eye. The infor-
Throughout the survey, we use the following notations to mation is processed according to the dual flux theory [13],
describe time dependent sequences. The value of a variable which states that the visual flow follows two main fluxes: a
is defined either for a single discrete time step t, written as ventral flux, responsible for visual identification and object
superscript < t >, or as a discrete sequence defined in the recognition, and a dorsal flux used for establishing spatial
< t,t + k > time interval, where k denotes the length of the relations between objects. A CNN mimics the functioning
sequence. For example, the value of a state variable z is de- of the ventral flux, in which different areas of the brain are
fined either at discrete time t, as z<t> , or within a sequence sensible to specific features in the visual field. The earlier
interval z<t,t+k> . Vectors and matrices are indicated by bold brain cells in the visual cortex are activated by sharp transi-
symbols. tions in the visual field of view, in the same way in which an
edge detector highlights sharp transitions between the neigh-
3.1 Deep Convolutional Neural Networks boring pixels in an image. These edges are further used in
Convolutional Neural Networks (CNN) are mainly used for the brain to approximate object parts and finally to estimate
processing spatial information, such as images, and can be abstract representations of objects.
viewed as image features extractors and universal non-linear An CNN is parametrized by its weights vector θ = [W, b],
function approximators [7], [8]. Before the rise of deep where W is the set of weights governing the inter-neural
learning, computer vision systems used to be implemented connections and b is the set of neuron bias values. The
based on handcrafted features, such as HAAR [9], Local Bi- set of weights W is organized as image filters, with coef-
nary Patterns (LBP) [10], or Histograms of Oriented Gradi- ficients learned during training. Convolutional layers within
ents (HoG) [11]. In comparison to these traditional hand- a CNN exploit local spatial correlations of image pixels to
crafted features, convolutional neural networks are able to learn translation-invariant convolution filters, which capture
discriminant image features.
Consider a multichannel signal representation Mk in layer
k, which is a channel-wise integration of signal representa-
tions Mk,c , where c ∈ N. A signal representation can be
generated in layer k + 1 as:
Mk+1,l = ϕ(Mk ∗ wk,l + bk,l ), (1)

where wk,l ∈ W is a convolutional filter with the same num- Figure 2: A folded (a) and unfolded (b) over time, many-
ber of channels as Mk , bk,l ∈ b represents the bias, l is a to-many Recurrent Neural Network. Over time t, both
channel index and ∗ denotes the convolution operation. ϕ(·) the input s<t−τi ,t> and output z<t+1,t+τo > sequences share
is an activation function applied to each pixel in the input the same weights h<·> . The architecture is also referred to
signal. Typically, the Rectified Linear Unit (ReLU) is the as a sequence-to-sequence model.
most commonly used activation function in computer vision
applications [1]. The final layer of a CNN is usually a fully- the learned weights in each unfolded copy of the network
connected layer which acts as an object discriminator on a are averaged, thus enabling the network to shared the same
high-level abstract representation of objects. weights over time.
In a supervised manner, the response R(·; θ ) of a The main challenge in using basic RNNs is the vanish-
CNN can be trained using a training database D = ing gradient encountered during training. The gradient sig-
[(x1 , y1 ), ..., (xm , ym )], where xi is a data sample, yi is the nal can end up being multiplied a large number of times,
corresponding label and m is the number of training exam- as many as the number of time steps. Hence, a traditional
ples. The optimal network parameters can be calculated us- RNN is not suitable for capturing long-term dependencies
ing Maximum Likelihood Estimation (MLE). For the clarity in sequence data. If a network is very deep, or processes
of explanation, we take as example the simple least-squares long sequences, the gradient of the network’s output would
error function, which can be used to drive the MLE process have a hard time in propagating back to affect the weights of
when training regression estimators: the earlier layers. Under gradient vanishing, the weights of
the network will not be effectively updated, ending up with
m very small weight values.
θ̂ = arg max L (θ ; D) = arg min ∑ (R(xi ; θ ) − yi )2 . (2) Long Short-Term Memory (LSTM) [17] networks are
θ θ i=1
non-linear function approximators for estimating temporal
dependencies in sequence data. As opposed to traditional
For classification purposes, the least-squares error is usually
recurrent neural networks, LSTMs solve the vanishing gra-
replaced by the cross-entropy, or the negative log-likelihood
dient problem by incorporating three gates, which control
loss functions. The optimization problem in Eq. 2 is typ-
the input, output and memory state.
ically solved with Stochastic Gradient Descent (SGD) and
Recurrent layers exploit temporal correlations of se-
the backpropagation algorithm for gradient estimation [14].
quence data to learn time dependent neural structures. Con-
In practice, different variants of SGD are used, such as
sider the memory state c<t−1> and the output state h<t−1>
Adam [15] or AdaGrad [16].
in an LSTM network, sampled at time step t − 1, as well as
the input data s<t> at time t. The opening or closing of a
3.2 Recurrent Neural Networks gate is controlled by a sigmoid function σ (·) of the current
Among deep learning techniques, Recurrent Neural Net- input signal s<t> and the output signal of the last time point
works (RNN) are especially good in processing temporal se- h<t−1> , as follows:
quence data, such as text, or video streams. Different from
conventional neural networks, a RNN contains a time de- Γ<t>
u = σ (Wu s<t> + Uu h<t−1> + bu ), (3)
pendent feedback loop in its memory cell. Given a time
dependent input sequence [s<t−τi > , ..., s<t> ] and an output Γ<t>
f = σ (W f s<t> + U f h<t−1> + b f ), (4)
sequence [z<t+1> , ..., z<t+τo > ], a RNN can be ”unfolded”
τi + τo times to generate a loop-less network architecture
matching the input length, as illustrated in Fig. 2. t repre- Γo<t> = σ (Wo s<t> + Uo h<t−1> + bo ), (5)
sents a temporal index, while τi and τo are the lengths of where Γu<t> , Γ<t>
f and Γ<t>
o are gate functions of the input
the input and output sequences, respectively. Such neural gate, forget gate and output gate, respectively. Given current
networks are also encountered under the name of sequence- observation, the memory state c<t> will be updated as:
to-sequence models. An unfolded network has τi + τo + 1
identical layers, that is, each layer shares the same learned
weights. Once unfolded, a RNN can be trained using the c<t> = Γu<t> ∗tanh(Wc s<t> +Uc h<t−1> +bc )+Γ f ∗c<t−1> ,
backpropagation through time algorithm. When compared (6)
to a conventional neural network, the only difference is that The new network output h<t> is computed as:
• I is the set of observations, with I<t> ∈ I defined as
<t>
h = Γ<t>
o ∗ tanh(c<t> ). (7) an observation of the environment at time t.
An LSTM network Q is parametrized by θ = [Wi , Ui , bi ], • S represents a finite set of states, s<t> ∈ S being
where Wi represents the weights of the network’s gates the state of the agent at time t, commonly defined
and memory cell multiplied with the input state, Ui are the as the vehicle’s position, heading and velocity.
weights governing the activations and bi denotes the set of
neuron bias values. ∗ symbolizes element-wise multiplica- • A represents a finite set of actions allowing the
tion. agent to navigate through the environment defined
In a supervised learning setup, given by I<t> , where a<t> ∈ A is the action performed by
a set of training sequences D = the agent at time t.
[(s<t−τ
1
i ,t> <t+1,t+τo >
, z1 ), ..., (s<t−τ
q
i ,t> <t+1,t+τo >
, zq )], that is,
q independent pairs of observed sequences with assign- • T : S × A × S → [0, 1] is a stochastic transition func-
<t+1>
ments z<t,t+τo > , one can train the response of an LSTM tion, where Tss<t> ,a<t> describes the probability of
network Q(·; θ ) using Maximum Likelihood Estimation: arriving in state s<t+1> , after performing action
a<t> in state s<t> .
θ̂ = arg max L (θ ; D) • R : S × A × S → R is a scalar reward function which
θ <t+1>
m controls the estimation of a, where Rss<t> ,a<t> ∈ R.
= arg min ∑ li (Q(si<t−τi ,t> ; θ ), z<t+1,t+τ
i
o>
), For a state transition s<t> → s<t+1> at time t, we
θ i=1 (8) <t+1>
define a scalar reward function Rss<t> ,a<t> which
m τo
= arg min ∑ ∑ li<t> (Q<t> (si<t−τi ,t> ; θ ), z<t> ), quantifies how well did the agent perform in reach-
i
θ i=1 t=1 ing the next state.
where an input sequence of observations s<t−τi ,t> = • γ is the discount factor controlling the importance
[s<t−τi > , ..., s<t−1> , s<t> ] is composed of τi consecutive data of future versus immediate rewards.
samples, l(·, ·) is the logistic regression loss function and t
represents a temporal index. Considering the proposed reward function and an arbi-
In recurrent neural networks terminology, the optimiza- trary state trajectory [s<0> , s<1> , ..., s<k> ] in observation
tion procedure in Eq. 8 is typically used for training ”many- space, at any time tˆ ∈ [0, 1, ..., k], the associated cumulative
to-many” RNN architectures, such as the one in Fig. 2, future discounted reward is defined as:
where the input and output states are represented by tem- k
poral sequences of τi and τo data instances, respectively. R<tˆ> = ∑ γ <t−tˆ> r<t> , (9)
This optimization problem is commonly solved using gradi- t=tˆ
ent based methods, like Stochastic Gradient Descent (SGD),
where the immediate reward at time t is given by r<t> . In
together with the backpropagation through time algorithm
RL theory, the statement in Eq. 9 is known as a finite horizon
for calculating the network’s gradients.
learning episode of sequence length k [18].
The objective in RL is to find the desired trajectory policy
3.3 Deep Reinforcement Learning
that maximizes the associated cumulative future reward. We
In the following, we review the Deep Reinforcement Learn- define the optimal action-value function Q∗ (·, ·) which esti-
ing (DRL) concept as an autonomous driving task, using the mates the maximal future discounted reward when starting
Partially Observable Markov Decision Process (POMDP) in state s<t> and performing actions [a<t> , ..., a<t+k> ]:
formalism.
In a POMDP, an agent, which in our case is the self-
driving car, senses the environment with observation I<t> , Q∗ (s, a) = maxE [R<tˆ> |s<tˆ> = s, a<tˆ> = a, π], (10)
π
performs an action a<t> in state s<t> , interacts with its
environment through a received reward R<t+1> , and tran- where π is an action policy, viewed as a probability density
sits to the next state s<t+1> following a transition function function over a set of possible actions that can take place
<t+1>
Tss<t> ,a<t> . in a given state. The optimal action-value function Q∗ (·, ·)
In RL based autonomous driving, the task is to learn an maps a given state to the optimal action policy of the agent
<t> to a
optimal driving policy for navigating from state sstart in any state:
<t+k> <t>
destination state sdest , given an observation I at time
t and the system’s state s<t> . I<t> represents the observed ∀s ∈ S : π ∗ (s) = arg maxQ∗ (s, a). (11)
environment, while k is the number of time steps required a∈A
for reaching the destination state s<t+k>

dest . The optimal action-value function Q∗ satisfies the Bell-
In reinforcement learning terminology, the above problem man optimality equation [19], which is a recursive formula-
can be modeled as a POMDP M := (I, S, A, T, R, γ), where: tion of Eq. 10:
DeepMind in [22], where the combined algorithm, entitled
Rainbow, was able to outperform the independently compet-
∗ s0 s0 ∗ 0 0 ing methods. DeepMind [22] proposes six extensions to the
Q (s, a) = ∑ Ts,a Rs,a + γ · maxQ (s , a )
s a0 base DQN, each addressing a distinct concern:
(12)
s0 ∗ 0 0
= Ea0 Rs,a + γ · maxQ (s , a ) , • Double Q Learning addresses the overestimation
a0
bias and decouples the selection of an action and
where s0 represents a possible state visited after s = s<t> its evaluation;
and a0 is the corresponding action policy. The model-based
• Prioritized replay samples more frequently from
policy iteration algorithm was introduced in [18], based on
the data in which there is information to learn;
the proof that the Bellman equation is a contraction map-
ping [20] when written as an operator ν: • Dueling Networks aim at enhancing value based
RL;
∀Q, lim ν (n) (Q) = Q∗ . (13)
n→∞
• Multi-step learning is used for training speed im-
However, the standard reinforcement learning method provement;
described above is not feasible in high dimensional state
spaces. In autonomous driving applications, the observa- • Distributional RL improves the target distribution
tion space is mainly composed of sensory information made in the Bellman equation;
up of images, radar, LiDAR, etc. Instead of the traditional
• Noisy Nets improve the ability of the network to ig-
approach, a non-linear parametrization of Q∗ can be en-
nore noisy inputs and allows state-conditional ex-
coded in the layers of a deep neural network. In litera-
ploration.
ture, such a non-linear approximator is called a Deep Q-
Network (DQN) [21] and is used for estimating the approx- All of the above complementary improvements have been
imate action-value function: tested on the Atari 2600 challenge. A good implementa-
tion of DQN regarding autonomous vehicles should start by
Q(s<t> , a<t> ; Θ) ≈ Q∗ (s<t> , a<t> ), (14) combining the stated DQN extensions with respect to a de-
where Θ represents the parameters of the Deep Q-Network. sired performance. Given the advancements in deep rein-
By taking into account the Bellman optimality equa- forcement learning, the direct application of the algorithm
tion 12, it is possible to train a deep Q-network in a rein- still needs a training pipeline in which one should simulate
forcement learning manner through the minimization of the and model the desired self-driving car’s behavior.
mean squared error. The optimal expected Q value can be The simulated environment state is not directly accessible
estimated within a training iteration i based on a set of ref- to the agent. Instead, sensor readings provide clues about the
erence parameters Θ̄i calculated in a previous iteration i0 : true state of the environment. In order to decode the true en-
vironment state, it is not sufficient to map a single snapshot
0
y = Rss,a + γ · maxQ(s0 , a0 ; Θ̄i ), (15) of sensors readings. The temporal information should also
a0 be included in the network’s input, since the environment’s
where Θ̄i := Θi0 . The new estimated network parameters state is modified over time. An example of DQN applied to
at training step i are evaluated using the following squared autonomous vehicles in a simulator can be found in [23].
error function: DQN has been developed to operate in discrete action
h i spaces. In the case of an autonomous car, the discrete ac-
∇JΘ̂i = min Es,y,r,s0 (y − Q(s, a; Θi ))2 , (16) tions would translate to discrete commands, such as turn
Θi left, turn right, accelerate, or break. The DQN approach de-
s0
where r = Rs,a . Based on 16, the maximum likelihood es- scribed above has been extended to continuous action spaces
timation function from Eq. 8 can be applied for calculating based on policy gradient estimation [24]. The method
the weights of the deep Q-network. The gradient is approx- in [24] describes a model-free actor-critic algorithm able to
imated with random samples and the backpropagation algo- learn different continuous control tasks directly from raw
rithm, which uses stochastic gradient descent for training: pixel inputs. A model-based solution for continuous Q-
learning is proposed in [25].
Although continuous control with DRL is possible, the
∇Θi = Es,a,r,s0 [(y − Q(s, a; Θi )) ∇Θi (Q(s, a; Θi ))] . (17) most common strategy for DRL in autonomous driving is
based on discrete control [26]. The main challenge here
The deep reinforcement learning community has made is the training, since the agent has to explore its environ-
several independent improvements to the original DQN al- ment, usually through learning from collisions. Such sys-
gorithm [21]. A study on how to combine these improve- tems, trained solely on simulated data, tend to learn a biased
ments on deep reinforcement learning has been provided by version of the driving environment. A solution here is to use
Imitation Learning methods, such as Inverse Reinforcement
Learning (IRL) [27], to learn from human driving demon- The sensing approaches have advantages and disadvantages.
strations without needing to explore unsafe actions. LiDARs have high resolution and precise perception even in
the dark, but are vulnerable to bad weather conditions (e.g.
4 Deep Learning for Driving Scene heavy rain) [31] and involve moving parts. In contrast, cam-
eras are cost efficient, but lack depth perception and cannot
Perception and Localization work in the dark. Cameras are also sensitive to bad weather,
The self-driving technology enables a vehicle to operate au- if the weather conditions are obstructing the field of view.
tonomously by perceiving the environment and instrument- Researchers at Cornell University tried to replicate
ing a responsive answer. Following, we give an overview of LiDAR-like point clouds from visual depth estimation [32].
the top methods used in driving scene understanding, con- An estimated depth map is reprojected into 3D space, with
sidering camera based vs. LiDAR environment perception. respect to the left sensor’s coordinate of a stereo cam-
We survey object detection and recognition, semantic seg- era. The resulting point cloud is referred to as pseudo-
mentation and localization in autonomous driving, as well as LiDAR. The pseudo-LiDAR data can be further fed to 3D
scene understanding using occupancy maps. Surveys dedi- deep learning processing methods, such as PointNet [33]
cated to autonomous vision and environment perception can or AVOD [34]. The success of image based 3D estimation
be found in [28] and [29]. is of high importance to the large scale deployment of au-
tonomous cars, since the LiDAR is arguably one of the most
4.1 Sensing Hardware: Camera vs. LiDAR Debate expensive hardware component in a self-driving vehicle.
Deep learning methods are particularly well suited for de- Apart from these sensing technologies, radar and ultra-
tecting and recognizing objects in 2D images and 3D point sonic sensors are used to enhance perception capabilities.
clouds acquired from video cameras and LiDAR (Light De- For example, alongside three Lidar sensors, Waymo also
tection and Ranging) devices, respectively. makes use of five radars and eight cameras, while Tesla®
In the autonomous driving community, 3D perception is cars are equipped with eights cameras, 12 ultrasonic sensors
mainly based on LiDAR sensors, which provide a direct 3D and one forward-facing radar.
representation of the surrounding environment in the form
4.2 Driving Scene Understanding
of 3D point clouds. The performance of a LiDAR is mea-
sured in terms of field of view, range, resolution and ro- An autonomous car should be able to detect traffic partici-
tation/frame rate. 3D sensors, such as Velodyne® , usually pants and drivable areas, particularly in urban areas where a
have a 360◦ horizontal field of view. In order to operate at wide variety of object appearances and occlusions may ap-
high speeds, an autonomous vehicle requires a minimum of pear. Deep learning based perception, in particular Convolu-
200m range, allowing the vehicle to react to changes in road tional Neural Networks (CNNs), became the de-facto stan-
conditions in time. The 3D object detection precision is dic- dard in object detection and recognition, obtaining remark-
tated by the resolution of the sensor, with most advanced able results in competitions such as the ImageNet Large
LiDARs being able to provide a 3cm accuracy. Scale Visual Recognition Challenge [35].
Recent debate sparked around camera vs. LiDAR (Light Different neural networks architectures are used to detect
Detection and Ranging) sensing technologies. Tesla® and objects as 2D regions of interest [36] [37] [38] [39] [40] [41]
Waymo® , two of the companies leading the development or pixel-wise segmented areas in images [42] [43] [44] [45],
of self-driving technology [30], have different philosophies 3D bounding boxes in LiDAR point clouds [33] [46] [47], as
with respect to their main perception sensor, as well as re- well as 3D representations of objects in combined camera-
garding the targeted SAE level [4]. Waymo® is building LiDAR data [48] [49] [34]. Examples of scene perception
their vehicles directly as Level 5 systems, with currently results are illustrated in Fig. 3. Being richer in information,
more than 10 million miles driven autonomously2 . On the image data is more suited for the object recognition task.
other hand, Tesla® deploys its AutoPilot as an ADAS (Ad- However, the real-world 3D positions of the detected objects
vanced Driver Assistance System) component, which cus- have to be estimated, since depth information is lost in the
tomers can turn on or off at their convenience. The advan- projection of the imaged scene onto the imaging sensor.
tage of Tesla® resides in its large training database, con- 4.2.1 Bounding-Box-Like Object Detectors
sisting of more than 1 billion driven miles3 . The database
The most popular architectures for 2D object detection in
has been acquired by collecting data from customers-owned
images are single stage and double stage detectors. Pop-
cars.
ular single stage detectors are ”You Only Look Once”
The main sensing technologies differ in both companies.
(Yolo) [36] [50] [51], the Single Shot multibox Detector
Tesla® tries to leverage on its camera systems, whereas
(SSD) [52], CornerNet [37] and RefineNet [38]. Double
Waymo’s driving technology relies more on Lidar sensors4 .
stage detectors, such as RCNN [53], Faster-RCNN [54],
2 https://arstechnica.com/cars/2018/10/waymo- or R-FCN [41], split the object detection process into two
has-driven-10-million-miles-on-public-roads/ parts: region of interest candidates proposals and bounding
3 https://electrek.co/2018/11/28/tesla-
autopilot-1-billion-miles/ 4/19/17204044/tesla-waymo-self-driving-car-data-
4 https://www.theverge.com/transportation/2018/ simulation
Figure 3: Examples of scene perception results. (a) 2D object detection in images. (b) 3D bounding box detector applied on
LiDAR data. (c) Semantic segmentation results on images.
boxes classification. In general, single stage detectors do ing drivable area, pedestrians, traffic participants, buildings,
not provide the same performances as double stage detec- etc. It is one of the high-level tasks that paves the way to-
tors, but are significantly faster. wards complete scene understanding, being used in appli-
If in-vehicle computation resources are scarce, one can cations such as autonomous driving, indoor navigation, or
use detectors such as SqueezeNet [40] or [55], which are virtual and augmented reality.
optimized to run on embedded hardware. These detectors Semantic segmentation networks like SegNet [42], IC-
usually have a smaller neural network architecture, making Net [43], ENet [57], AdapNet [58], or Mask R-CNN [45]
it possible to detect objects using a reduced number of oper- are mainly encoder-decoder architectures with a pixel-wise
ations, at the cost of detection accuracy. classification layer. These are based on building blocks from
A comparison between the object detectors described some common network topologies, such as AlexNet [1],
above is given in Figure 4, based on the Pascal VOC 2012 VGG-16 [59], GoogLeNet [60], or ResNet [61].
dataset and their measured mean Average Precision (mAP) As in the case of bounding-box detectors, efforts have
with an Intersection over Union (IoU) value equal to 50 and been made to improve the computation time of these sys-
75, respectively. tems on embedded targets. In [44] and [57], the authors
A number of publications showcased object detection proposed approaches to speed up data processing and infer-
on raw 3D sensory data, as well as for combined video ence on embedded devices for autonomous driving. Both
and LiDAR information. PointNet [33] and VoxelNet [46] architectures are light networks providing similar results as
are designed to detect objects solely from 3D data, pro- SegNet, with a reduced computation cost.
viding also the 3D positions of the objects. However, The robustness objective for semantic segmentation was
point clouds alone do not contain the rich visual informa- tackled for optimization in AdapNet [58]. The model is ca-
tion available in images. In order to overcome this, com- pable of robust segmentation in various environments by
bined camera-LiDAR architectures are used, such as Frus- adaptively learning features of expert networks based on
tum PointNet [48], Multi-View 3D networks (MV3D) [49], scene conditions.
or RoarNet [56]. A combined bounding-box object detector and semantic
The main disadvantage in using a LiDAR in the sensory segmentation result can be obtained using architectures such
suite of a self-driving car is primarily its cost5 . A solu- as Mask R-CNN [45]. The method extends the effective-
tion here would be to use neural network architectures such ness of Faster-RCNN to instance segmentation by adding a
as AVOD (Aggregate View Object Detection) [34], which branch for predicting an object mask in parallel with the ex-
leverage on LiDAR data only for training, while images are isting branch for bounding box recognition.
used during training and deployment. At deployment stage, Figure 5 shows tests results performed on four key seman-
AVOD is able to predict 3D bounding boxes of objects solely tic segmentation networks, based on the CityScapes dataset.
from image data. In such a system, a LiDAR sensor is nec- The per-class mean Intersection over Union (mIoU) refers to
essary only for training data acquisition, much like the cars multi-class segmentation, where each pixel is labeled as be-
used today to gather road data for navigation maps. longing to a specific object class, while per-category mIoU
refers to foreground (object) - background (non-object) seg-
4.2.2 Semantic and Instance Segmentation
mentation. The input samples have a size of 480px × 320px.
Driving scene understanding can also be achieved using se-
mantic segmentation, representing the categorical labeling 4.2.3 Localization
of each pixel in an image. In the autonomous driving con- Localization algorithms aim at calculating the pose (position
text, pixels can be marked with categorical labels represent- and orientation) of the autonomous vehicle as it navigates.
5 https://techcrunch.com/2019/03/06/waymo-to- Although this can be achieved with systems such as GPS, in
start-selling-standalone-lidar-sensors/ the followings we will focus on deep learning techniques for
Figure 4: Object detection and recognition performance
comparison. The evaluation has been performed on the Pas-
cal VOC 2012 benchmarking database. The first four meth-
ods on the right represent single stage detectors, while the
remaining six are double stage detectors. Due to their in- Figure 5: Semantic segmentation performance compari-
creased complexity, the runtime performance in Frames-person on the CityScapes dataset [74]. The input samples are
Second (FPS) is lower for the case of double stage detectors. 480px × 320px images of driving scenes.
visual based localization. with deep learning architectures able to automatically learn
Visual Localization, also known as Visual Odometry the scene flow. In [73], an encoding deep network is trained
(VO), is typically determined by matching keypoint land- on occupancy grids with the purpose of finding matching or
marks in consecutive video frames. Given the current frame, non-matching locations between successive timesteps.
these keypoints are used as input to a perspective-n-point Although much progress has been reported in the area
mapping algorithm for computing the pose of the vehicle of deep learning based localization, VO techniques are
with respect to the previous frame. Deep learning can be still dominated by classical keypoints matching algorithms,
used to improve the accuracy of VO by directly influenc- combined with acceleration data provided by inertial sen-
ing the precision of the keypoints detector. In [62], a deep sors. This is mainly due to the fact that keypoints detectors
neural network has been trained for learning keypoints dis- are computational efficient and can be easily deployed on
tractors in monocular VO. The so-called learned ephemeral- embedded devices.
ity mask, acts a a rejection scheme for keypoints outliers
which might decrease the vehicle localization’s accuracy.
The structure of the environment can be mapped incremen- 4.3 Perception using Occupancy Maps
tally with the computation of the camera pose. These meth-
ods belong to the area of Simultaneous Localization and An occupancy map, also known as Occupancy Grid (OG), is
Mapping (SLAM). For a survey on classical SLAM tech- a representation of the environment which divides the driv-
niques, we refer the reader to [63]. ing space into a set of cells and calculates the occupancy
Neural networks such as PoseNet [64], VLocNet++ [65], probability for each cell. Popular in robotics [72], [75],
or the approaches introduced in [66], [67], [68], [69], or [70] the OG representation became a suitable solution for self-
are using image data to estimate the 3D pose of a camera in driving vehicles. A couple of OG data samples are shown in
an End2End fashion. Scene semantics can be derived to- Fig. 6.
gether with the estimated pose [65]. Deep learning is used in the context of occupancy maps
LiDAR intensity maps are also suited for learning a either for dynamic objects detection and tracking [76], prob-
real-time, calibration-agnostic localization for autonomous abilistic estimation of the occupancy map surrounding the
cars [71]. The method uses a deep neural network to build vehicle [77],[78], or for deriving the driving scene con-
a learned representation of the driving scene from LiDAR text [79], [80]. In the latter case, the OG is constructed by
sweeps and intensity maps. The localization of the vehicle accumulating data over time, while a deep neural net is used
is obtained through convolutional matching. In [72], laser to label the environment into driving context classes, such
scans and a deep neural network are used to learn descrip- as highway driving, parking area, or inner-city driving.
tors for localization in urban and natural environments. Occupancy maps represent an in-vehicle virtual environ-
In order to safely navigate the driving scene, an au- ment, integrating perceptual information in a form better
tonomous car should be able to estimate the motion of the suited for path planning and motion control. Deep learning
surrounding environment, also known as scene flow. Previ- plays an important role in the estimation of OG, since the
ous LiDAR based scene flow estimation techniques mainly information used to populate the grid cells is inferred from
relied on manually designed features. In recent articles, we processing image and LiDAR data using scene perception
have noticed a tendency to replace these classical methods methods, as the ones described in this chapter of the survey.
Figure 6: Examples of Occupancy Grids (OG). The images show a snapshot of the driving environment together with its
respective occupancy grid [80].
5 Deep Learning for Path Planning NeuroTrajectory [85] is a perception-planning deep neural
network that learns the desired state trajectory of the ego-
and Behavior Arbitration vehicle over a finite prediction horizon. Imitation learning
can also be framed as an Inverse Reinforcement Learning
The ability of an autonomous car to find a route between (IRL) problem, where the goal is to learn the reward func-
two points, that is, a start position and a desired location, tion from a human driver [89], [27]. Such methods use real
represents path planning. According to the path planning drivers behaviors to learn reward-functions and to generate
process, a self-driving car should consider all possible ob- human-like driving trajectories.
stacles that are present in the surrounding environment and DRL for path planning deals mainly with learning driv-
calculate a trajectory along a collision-free route. As stated ing trajectories in a simulator [81], [90], [86] [87]. The real
in [81], autonomous driving is a multi-agent setting where environmental model is abstracted and transformed into a
the host vehicle must apply sophisticated negotiation skills virtual environment, based on a transfer model. In [81], it
with other road users when overtaking, giving way, merging, is stated that the objective function cannot ensure functional
taking left and right turns, all while navigating unstructured safety without causing a serious variance problem. The pro-
urban roadways. The literature findings point to a non triv- posed solution for this issue is to construct a policy function
ial policy that should handle safety in driving. Considering a composed of learnable and non-learnable parts. The learn-
reward function R(s̄) = −r for an accident event that should able policy tries to maximize a reward function (which in-
be avoided and R(s̄) ∈ [−1, 1] for the rest of the trajectories, cludes comfort, safety, overtake opportunity, etc.). At the
the goal is to learn to perform difficult maneuvers smoothly same time, the non-learnable policy follows the hard con-
and safe. straints of functional safety, while maintaining an acceptable
This emerging topic of optimal path planning for au- level of comfort.
tonomous cars should operate at high computation speeds, Both IL and DRL for path planning have advantages and
in order to obtain short reaction times, while satisfying spe- disadvantages. IL has the advantage that it can be trained
cific optimization criteria. The survey in [82] provides a with data collected from the real-world. Nevertheless, this
general overview of path planning in the automotive context. data is scarce on corner cases (e.g. driving off-lanes, vehicle
It addresses the taxonomy aspects of path planning, namely crashes, etc.), making the trained network’s response uncer-
the mission planner, behavior planner and motion planner. tain when confronted with unseen data. On the other hand,
However, [82] does not include a review on deep learning although DRL systems are able to explore different driving
technologies, although the state of the art literature has re- situations within a simulated world, these models tend to
vealed an increased interest in using deep learning technolo- have a biased behavior when ported to the real-world.
gies for path planning and behavior arbitration. Follow-
ing, we discuss two of the most representative deep learn-
ing paradigms for path planning, namely Imitation Learn- 6 Motion Controllers for AI-based
ing (IL) [83], [84], [85] and Deep Reinforcement Learning Self-Driving Cars
(DRL) based planning [86] [87].
The goal in Imitation Learning [83], [84], [85] is to learn The motion controller is responsible for computing the lon-
the behavior of a human driver from recorded driving expe- gitudinal and lateral steering commands of the vehicle.
riences [88]. The strategy implies a vehicle teaching pro- Learning algorithms are used either as part of Learning Con-
cess from human demonstration. Thus, the authors em- trollers, within the motion control module from Fig. 1(a), or
ploy CNNs to learn planning from imitation. For example, as complete End2End Control Systems which directly map
sensory data to steering commands, as shown in Fig. 1(b). driving models [100], [101], driving dynamics for race cars
operating at their handling limits [102], [103], [104], as
6.1 Learning Controllers well as to improve path tracking accuracy [109], [91], [94].
Traditional controllers make use of an a priori model com- These methods use learning mechanisms to identify nonlin-
posed of fixed parameters. When robots or other au- ear dynamics that are used in the MPC’s trajectory cost func-
tonomous systems are used in complex environments, such tion optimization. This enables one to better predict distur-
as driving, traditional controllers cannot foresee every pos- bances and the behavior of the vehicle, leading to optimal
sible situation that the system has to cope with. Unlike con- comfort and safety constraints applied to the control inputs.
trollers with fixed parameters, learning controllers make use Training data is usually in the form of past vehicle states and
of training information to learn their models over time. With observations. For example, CNNs can be used to compute
every gathered batch of training data, the approximation of a dense occupancy grid map in a local robot-centric coor-
the true system model becomes more accurate, thus enabling dinate system. The grid map is further passed to the MPC’s
model flexibility, consistent uncertainty estimates and antic- cost function for optimizing the trajectory of the vehicle over
ipation of repeatable effects and disturbances that cannot be a finite prediction horizon.
modeled prior to deployment [91]. Consider the following A major advantage of learning controllers is that they op-
nonlinear, state-space system: timally combine traditional model-based control theory with
learning algorithms. This makes it possible to still use es-
z<t+1> = ftrue (z<t> , u<t> ), (18) tablished methodologies for controller design and stability
analysis, together with a robust learning component applied
with observable state z<t> ∈ Rnand control input u<t>
∈ at system identification and prediction levels.
Rm , at discrete time t. The true system ftrue is not known
exactly and is approximated by the sum of an a-priori model 6.2 End2End Learning Control
and a learned dynamics model: In the context of autonomous driving, End2End Learning
Control is defined as a direct mapping from sensory data
z<t+1> = f(z<t> , u<t> ) + h(z<t> ) . (19) to control commands. The inputs are usually from a high-
a-priori model learned model
dimensional features space (e.g. images or point clouds).
In previous works, learning controllers have been intro- As illustrated in Fig 1(b), this is opposed to traditional pro-
duced based on simple function approximators, such as cessing pipelines, where at first objects are detected in the
Gaussian Process (GP) modeling [92], [93], [91], [94], or input image, after which a path is planned and finally the
Support Vector Regression [95]. computed control values are executed. A summary of some
Learning techniques are commonly used to learn of the most popular End2End learning systems is given in
a dynamics model which in turn improves an a Table 1.
priori system model in Iterative Learning Control End2End learning can also be formulated as a back-
(ILC) [96], [97], [98], [99] and Model Predictive Control propagation algorithm scaled up to complex models. The
(MPC) [100] [101], [91], [94], [102], [103], [104], [105], [106]. paradigm was first introduced in the 1990s, when the Au-
Iterative Learning Control (ILC) is a method for control- tonomous Land Vehicle in a Neural Network (ALVINN)
ling systems which work in a repetitive mode, such as path system was built [110]. ALVINN was designed to follow a
tracking in self-driving cars. It has been successfully ap- pre-defined road, steering according to the observed road’s
plied to navigation in off-road terrain [96], autonomous car curvature. The next milestone in End2End driving is con-
parking [97] and modeling of steering dynamics in an au- sidered to be in the mid 2000s, when DAVE (Darpa Au-
tonomous race car [98]. Multiple benefits are highlighted, tonomous VEhicle) managed to drive through an obstacle-
such as the usage of a simple and computationally light feed- filled road, after it has been trained on hours of human driv-
back controller, as well as a decreased controller design ef- ing acquired in similar, but not identical, driving scenar-
fort (achieved by predicting path disturbances and platform ios [111]. Over the last couple of years, the technological
dynamics). advances in computing hardware have facilitated the usage
Model Predictive Control (MPC) [107] is a control strat- of End2End learning models. The back-propagation algo-
egy that computes control actions by solving an optimiza- rithm for gradient estimation in deep networks is now ef-
tion problem. It received lots of attention in the last two ficiently implemented on parallel Graphic Processing Units
decades due to its ability to handle complex nonlinear sys- (GPUs). This kind of processing allows the training of large
tems with state and input constraints. The central idea and complex network architectures, which in turn require
behind MPC is to calculate control actions at each sam- huge amounts of training samples (see Section 8).
pling time by minimizing a cost function over a short End2End control papers mainly employ either deep neu-
time horizon, while considering observations, input-output ral networks trained offline on real-world and/or synthetic
constraints and the system’s dynamics given by a pro- data [119], [113], [114], [115], [120], [116], [117], [121], [118],
cess model. A general review of MPC techniques for au- or Deep Reinforcement Learning (DRL) systems trained
tonomous robots is given in [108]. and evaluated in simulation [23] [122], [26]. Methods for
Learning has been used in conjunction with MPC to learn porting simulation trained DRL models to real-world driv-
Neural network Sensor
Name Problem Space Description
architecture input
ALVINN stands for Autonomous Land Vehicle In a Neural
3-layer Network). Training has been conducted using simulated
ALVINN Camera, laser
Road following back-prop. road images. Successful tests on the Carnegie Mellon
[110] range finder
network autonomous navigation test vehicle indicate that the
network can effectively follow real roads.
A vision-based obstacle avoidance system for off-road
DAVE 6-layer Raw camera mobile robots. The robot is a 50cm off-road truck, with two
DARPA challenge
[111] CNN images front color cameras. A remote computer processes the video
and controls the robot via radio.
Autonomous The system automatically learns internal representations of
NVIDIA PilotNet Raw camera
driving in real CNN the necessary processing steps such as detecting useful road
[112] images
traffic situations features with human steering angle as the training signal.
A generic vehicle motion model from large scale crowd-
Novel FCN-LSTM Ego-motion Large scale sourced video data is obtained, while developing an end-to
FCN-LSTM
[113] prediction video data -end trainable architecture (FCN-LSTM) for predicting a
distribution of future vehicle ego-motion data.
C-LSTM is end-to-end trainable, learning both visual and
Camera frames, dynamic temporal dependencies of driving. Additionally, the
Novel C-LSTM
Steering angle control C-LSTM steering wheel steering angle regression problem is considered classification
[114]
angle while imposing a spatial relationship between the output
layer neurons.
The sensor setup provides data for a 360-degree view of
CNN + Fully Surround-view the area surrounding the vehicle. A new driving dataset
Drive360 Steering angle and
Connected + cameras, CAN is collected, covering diverse scenarios. A novel driving
[115] velocity control
LSTM bus reader model is developed by integrating the surround-view
cameras with the route planner.
The trained neural net directly maps pixel data from a
DNN policy front-facing camera to steering commands and does not
Steering angle control CNN + FC Camera images
[116] require any other sensors. We compare the controller
performance with the steering behavior of a human driver.
DeepPicar is a small scale replica of a real self-driving car
DeepPicar called DAVE-2 by NVIDIA. It uses the same network
Steering angle control CNN Camera images
[117] architecture and can drive itself in real-time using a web
camera and a Raspberry Pi 3.
It incorporates Recurrent Neural Networks for information
TORCS
TORCS DRL Lane keeping and DQN + RNN integration, enabling the car to handle partially observable
simulator
[23] obstacle avoidance + CNN scenarios. It also reduces the computational complexity for
images
deployment on embedded hardware.
The image features are split into three categories (sky-related,
Steering angle control TORCS
TORCS E2E roadside-related, and roadrelated features). Two experimental
in a simulated CNN simulator
[118] frameworks are used to investigate the importance of each
env. (TORCS) images
single feature for training a CNN controller.
A CNN, refereed to as the learner, is trained with optimal
Steering angle and
Agile Autonomous Driving Raw camera trajectory examples provided at training time by an MPC controller.
velocity control CNN
[106] images The MPC acts as an expert, encoding the scene dynamics
for aggressive driving
into the layers of the neural network.
An Asynchronous ActorCritic (A3C) framework is used to
WRC6
WRC6 AD Driving in a CNN + LSTM learn the car control in a physically and graphically realistic
Racing
[26] racing game Encoder rally game, with the agents evolving simultaneously on
Game
different tracks.
Table 1: Summary of End2End learning methods.
ing have also been reported [123], as well as DRL systems on different road types. Prior to training, the data is enriched
trained directly on real-world image data [105], [106]. using augmentation, adding artificial shifts and rotations to
End2End methods have been popularized in the last cou- the original data.
ple of years by NVIDIA® , as part of the PilotNet architec- PilotNet has 250.000 parameters and approx. 27mil. con-
ture. The approach is to train a CNN which maps raw pixels nections. The evaluation is performed in two stages: first
from a single front-facing camera directly to steering com- in simulation and secondly in a test car. An autonomy per-
mands [119]. The training data is composed of images and formance metric represents the percentage of time when the
steering commands collected in driving scenarios performed neural network drives the car:
in a diverse set of lighting and weather conditions, as well as
ing a faster convergence and permissiveness for more gen-
eralization. Both articles rely on the following procedure:
(no. o f interventions) ∗ 6 sec
autonomy = (1 − ) ∗ 100. receiving the current state of the game, deciding on the next
elapsed time [sec] control commands and then getting a reward on the next iter-
(20)
ation. The experimental setup benefited from a realistic car
An intervention is considered to take place when the sim-
game, namely World Rally Championship 6, and also from
ulated vehicle departs from the center line by more than
other simulated environments, like TORCS.
one meter, assuming that 6 seconds is the time needed by
The next trend in DRL based control seems to be the in-
a human to retake control of the vehicle and bring it back
clusion of classical model-based control techniques, as the
to the desired state. An autonomy of 98% was reached on
ones detailed in Section 6.1. The classical controller pro-
a 20km drive from Holmdel to Atlantic Highlands in NJ,
vides a stable and deterministic model on top of which the
USA. Through training, PilotNet learns how the steering
policy of the neural network is estimated. In this way, the
commands are computed by a human driver [112]. The fo-
hard constraints of the modeled system are transfered into
cus is on determining which elements in the input traffic im-
the neural network policy [124]. A DRL policy trained on
age have the most influence on the network’s steering deci-
real-world image data has been proposed in [105] and [106]
sion. A method for finding the salient object regions in the
for the task of aggressive driving. In this case, a CNN, ref-
input image is described, while reaching the conclusion that
ereed to as the learner, is trained with optimal trajectory ex-
the low-level features learned by PilotNet are similar to the
amples provided at training time by a model predictive con-
ones that are relevant to a human driver.
troller.
End2End architectures similar to PilotNet, which map
visual data to steering commands, have been reported
in [116], [117], [121]. In [113], autonomous driving is for- 7 Safety of Deep Learning in
mulated as a future ego-motion prediction problem. The Autonomous Driving
introduced FCN-LSTM (Fully Convolutional Network -
Long-Short Term Memory) method is designed to jointly Safety implies the absence of the conditions that cause a sys-
train pixel-level supervised tasks using a fully convolu- tem to be dangerous [125]. Demonstrating the safety of a
tional encoder, together with motion prediction through a system which is running deep learning techniques depends
temporal encoder. The combination between visual tem- heavily on the type of technique and the application context.
poral dependencies of the input data has also been con- Thus, reasoning about the safety of deep learning techniques
sidered in [114], where the C-LSTM (Convolutional Long requires:
Short Term Memory) network has been proposed for steer-
• understanding the impact of the possible failures;
ing control. In [115], surround-view cameras were used for
End2End learning. The claim is that human drivers also use • understanding the context within the wider system;
rear and side-view mirrors for driving, thus all the informa-
tion from around the vehicle needs to be gathered and inte- • defining the assumption regarding the system con-
grated into the network model in order to output a suitable text and the environment in which it will likely be
control command. used;
To carry out an evaluation of the Tesla® Autopilot sys-
tem, [120] proposed an End2End Convolutional Neural Net- • defining what a safe behavior means, including
work framework. It is designed to determine differences non-functional constraints.
between Autopilot and its own output, taking into consid- In [126], an example is mapped on the above require-
eration edge cases. The network was trained using real data, ments with respect to a deep learning component. The prob-
collected from over 420 hours of real road driving. The com- lem space for the component is pedestrian detection with
parison between Tesla® ’s Autopilot and the proposed frame- convolutional neural networks. The top level task of the sys-
work was done in real-time on a Tesla® car. The evaluation tem is to locate an object of class person from a distance of
revealed an accuracy of 90.4% in detecting differences be- 100 meters, with a lateral accuracy of +/- 20 cm, a false neg-
tween both systems and the control transfer of the car to a ative rate of 1% and false positive rate of 5%. The assump-
human driver. tions is that the braking distance and speed are sufficient to
Another approach to design End2End driving systems is react when detecting persons which are 100 meters ahead
DRL. This is mainly performed in simulation, where an au- of the planned trajectory of the vehicle. Alternative sensing
tonomous agent can safely explore different driving strate- methods can be used in order to reduce the overall false neg-
gies. In [23], a DRL End2End system is used to compute ative and false positive rates of the system to an acceptable
steering command in the TORCS game simulation engine. level. The context information is that the distance and the
Considering a more complex virtual environment, [122] accuracy shall be mapped to the dimensions of the image
proposed an asynchronous advantage Actor-Critic (A3C) frames presented to the CNN.
method for training a CNN on images and vehicle velocity There is no commonly agreed definition for the term
information. The same idea has been enhanced in [26], hav- safety in the context of machine learning or deep learning.
In [127], Varshney defines safety in terms of risk, epistemic goal set to address the hazard and is then inherited by the
uncertainty and the harm incurred by unwanted outcomes. safety requirements derived from that goal [130].
He then analyses the choice of cost function and the appro- According to ISO26226, a hazard is defined as ”poten-
priateness of minimizing the empirical average training cost. tial source of harm caused by a malfunctioning behavior,
[128] takes into consideration the problem of accidents where harm is a physical injury or damage to the health
in machine learning systems. Such accidents are defined as of a person” [131]. Nevertheless, a deep learning compo-
unintended and harmful behaviors that may emerge from a nent can create new types of hazards. An example of such a
poor AI system design. The authors present a list of five hazard is usually happening because humans think that the
practical research problems related to accident risk, catego- automated driver assistance (often developed using learning
rized according to whether the problem originates from hav- techniques) is more reliable than it actually is [132].
ing the wrong objective function (avoiding side effects and Due to its complexity, a deep learning component can
avoiding reward hacking), an objective function that is too fail in unique ways. For example, in Deep Reinforcement
expensive to evaluate frequently (scalable supervision), or Learning systems, faults in the reward function can nega-
undesirable behavior during the learning process (safe ex- tively affect the trained model [128]. In such a case, the
ploration and distributional shift). automated vehicle figures out that it can avoid getting pe-
Enlarging the scope of safety, [129] propose a decision- nalized for driving too close to other vehicles by exploiting
theoretic definition of safety that applies to a broad set of certain sensor vulnerabilities so that it can’t see how close it
domains and systems. They define safety to be the reduction is getting. Although hazards such as these may be unique to
or minimization of risk and epistemic uncertainty associated deep reinforcement learning components, they can be traced
with unwanted outcomes that are severe enough to be seen to faults, thus fitting within the existing guidelines of ISO
as harmful. The key points in this definition are: i) the cost 26262.
of unwanted outcomes has to be sufficiently high in some A key requirement for analyzing the safety of deep learn-
human sense for events to be harmful, and ii) safety involves ing components is to examine whether immediate human
reducing both the probability of expected harms, as well as costs of outcomes exceed some harm severity thresholds.
the possibility of unexpected harms. Undesired outcomes are truly harmful in a human sense
Regardless of the above empirical definitions and possi- and their effect is felt in near real-time. These outcomes
ble interpretations of safety, the use of deep learning com- can be classified as safety issues. The cost of deep learn-
ponents in safety critical systems is still an open question. ing decisions is related to optimization formulations which
The ISO 26262 standard for functional safety of road vehi- explicitly include a loss function L. The loss function
cles provides a comprehensive set of requirements for assur- L : X × Y × Y → R is defined as the measure of the error
ing safety, but does not address the unique characteristics of incurred by predicting the label of an observation x as f (x),
deep learning-based software. instead of y. Statistical learning calls the risk of f as the
[130] addresses this gap by analyzing the places where expected value of the loss of f under P:
machine learning can impact the standard and provides rec- Z
ommendations on how to accommodate this impact. These R( f ) = L(x, f (x), y)dP(x, y), (21)
recommendations are focused towards the direction of iden-
tifying the hazards, implementing tools and mechanism for where, X × Y is a random example space of observations
fault and failure situations, but also ensuring complete train- x and labels y, distributed according to a probability distri-
ing datasets and designing a multi-level architecture. The bution P(X,Y ). The statistical learning problem consists of
usage of specific techniques for various stages within the finding the function f that optimizes (i.e. minimizes) the
software development life-cycle is desired. risk R [133]. For an algorithm’s hypothesis h and loss func-
The standard ISO 26262 recommends the use of a Hazard tion L, the expected loss on the training set is called the em-
Analysis and Risk Assessment (HARA) method to identify pirical risk of h:
hazardous events in the system and to specify safety goals
that mitigate the hazards. The standard has 10 parts. Our fo- 1 m
cus is on Part 6: product development at the software level,
Remp (h) = ∑ L(x(i) , h(x)(i) , y(i) ).
m i=1
(22)
the standard following the well-known V model for engi- A machine learning algorithm then optimizes the empirical
neering. Automotive Safety Integrity Level (ASIL) refers risk on the expectation that the risk decreases significantly.
to a risk classification scheme defined in ISO 26262 for an However, this standard formulation does not consider the
item (e.g. subsystem) in an automotive system. issues related to the uncertainty that is relevant for safety.
ASIL represents the degree of rigor required (e.g., testing The distribution of the training samples (x1 , y1 ), ..., (xm , ym )
techniques, types of documentation required, etc.) to reduce is drawn from the true underlying probability distribution
risk, where ASIL D represents the highest and ASIL A the of (X,Y ), which may not always be the case. Usually the
lowest risk. If an element is assigned to QM (Quality Man- probability distribution is unknown, precluding the use of
agement), it does not require safety management. The ASIL domain adaptation techniques [134] [135]. This is one of
assessed for a given hazard is at first assigned to the safety the epistemic uncertainty that is relevant for safety because
training on a dataset of different distribution can cause much
harm through bias.
In reality, a machine learning system only encounters a
finite number of test samples and an actual operational risk
is an empirical quantity on the test set. The operational risk
may be much larger than the actual risk for small cardinality
test sets, even if h is risk-optimal. This uncertainty caused
by the instantiation of the test set can have large safety im-
plications on individual test samples [136].
Faults and failures of a programmed component (e.g. one
using a formal algorithm to solve a problem) are totally dif-
ferent from the ones of a deep learning component. Spe-
cific faults of a deep learning component can be caused by Figure 7: Sensor suite of the nuTonomy® self-driving car
unreliable or noisy sensor signals (video signal due to bad [146].
weather, radar signal due to absorbing construction materi-
als, GPS data, etc.), neural network topology, learning algo-
rithm, training set or unexpected changes in the environment data gathered can be verified with respect to this specifica-
(e.g. unknown driving scenes or accidents on the road). We tion. Furthermore, some specifications, for example the fact
must mention the first autonomous driving accident, pro- that a vehicle cannot be wider than 3 meters, can be used
duced by a Tesla® car, where, due to object misclassifica- to reject false positive detections. Such properties are used
tion errors, the AutoPilot function collided the vehicle into a even directly during the training process to improve the ac-
truck [137]. Despite the 130 million miles of testing and curacy of the model [145].
evaluation, the accident was caused under extremely rare Machine learning and deep learning techniques are start-
circumstances, also known as Black Swans, given the height ing to become effective and reliable even for safety criti-
of the truck, its white color under bright sky, combined with cal systems, even if the complete safety assurance for this
the positioning of the vehicle across the road. type of systems is still an open question. Current standards
Self-driving vehicles must have fail-safe mechanisms, and regulation from the automotive industry cannot be fully
usually encountered under the name of Safety Monitors. mapped to such systems, requiring the development of new
These must stop the autonomous control software once a safety standards targeted for deep learning.
failure is detected [138]. Specific fault types and failures
have been cataloged for neural networks in [139], [140]
and [141]. This led to the development of specific and fo- 8 Data Sources for Training
cused tools and techniques to help finding faults. [142] Autonomous Driving Systems
describes a technique for debugging misclassifications due
to bad training data, while an approach for troubleshooting Undeniably, the usage of real world data is a key require-
faults due to complex interactions between linked machine ment for training and testing an autonomous driving com-
learning components is proposed in [143]. In [144], a white ponent. The high amount of data needed in the development
box technique is used to inject faults onto a neural network stage of such components made data collection on public
by breaking the links or randomly changing the weights. roads a valuable activity. In order to obtain a comprehensive
The training set plays a key role in the safety of the deep description of the driving scene, the vehicle used for data
learning component. ISO 26262 standard states that the collection is equipped with a variety of sensors such as radar,
component behavior shall be fully specified and each refine- LIDAR, GPS, cameras, Inertial Measurement Units (IMU)
ment shall be verified with respect to its specification. This and ultrasonic sensors. The sensors setup differs from vehi-
assumption is violated in the case of a deep learning system, cle to vehicle, depending on how the data is planned to be
where a training set is used instead of a specification. It is used. A common sensor setup for an autonomous vehicle is
not clear how to ensure that the corresponding hazards are presented in Fig. 7.
always mitigated. The training process is not a verification In the last years, mainly due to the large and increas-
process since the trained model will be correct by construc- ing research interest in autonomous vehicles, many driving
tion with respect to the training set, up to the limits of the datasets were made public and documented. They vary in
model and the learning algorithm [130]. Effects of this con- size, sensor setup and data format. The researchers need
siderations are visible in the commercial autonomous vehi- only to identify the proper dataset which best fits their prob-
cle market, where Black Swan events caused by data not lem space. [29] published a survey on a broad spectrum of
present in the training set may lead to fatalities [141]. datasets. These datasets address the computer vision field
Detailed requirements shall be formulated and traced to in general, but there are few of them which fit to the au-
hazards. Such a requirement can specify how the training, tonomous driving topic.
validation and testing sets are obtained. Subsequently, the A most comprehensive survey on publicly available
datasets for self-driving vehicles algorithms can be found
in [147]. The paper presents 27 available datasets contain- Automotive multi-sensor dataset (AMUSE) [149]. Pro-
ing data recorded on public roads. The datasets are com- vided by Linköping University of Sweden, it consists of
pared from different perspectives, such that the reader can sequences recorded in various environments from a car
select the one best suited for his task. equipped with an omnidirectional multi-camera, height sen-
Despite our extensive search, we are yet to find a mas- sors, an IMU, a velocity sensor and a GPS. The API for
ter dataset that combines at least parts of the ones available. reading these data sets is provided to the public, together
The reason may be that there are no standard requirements with a collection of long multi-sensor and multi-camera data
for the data format and sensor setup. Each dataset heav- streams stored in the given format. The dataset is provided
ily depends on the objective of the algorithm for which the under the Creative Commons Attribution-NonCommercial-
data was collected. Recently, the companies Scale® and NoDerivs 3.0 Unsupported License.
nuTonomy® started to create one of the largest and most Ford campus vision and lidar dataset (Ford) [150]. Pro-
detailed self-driving dataset on the market to date6 . This vided by University of Michigan, this dataset was collected
includes Berkeley DeepDrive [148], a dataset developed by using a Ford F250 pickup truck equipped with professional
researchers at Berkeley University. More relevant datasets (Applanix POS-LV) and a consumer (Xsens MTi-G) inertial
from the literature are pending for merging7 . measurement units (IMU), a Velodyne Lidar scanner, two
In [120], the authors present a study that seeks to collect push-broom forward looking Riegl Lidars and a Point Grey
and analyze large scale naturalistic data of semi-autonomous Ladybug3 omnidirectional camera system. The approx. 100
driving in order to better characterize the state of the art of GB of data was recorded around the Ford Research campus
the current technology. The study involved 99 participants, and downtown Dearborn, Michigan in 2009. The dataset is
29 vehicles, 405, 807 miles and approximatively 5.5 billion well suited to test various autonomous driving and simulta-
video frames. Unfortunately, the data collected in this study neous localization and mapping (SLAM) algorithms.
is not available for the public. Udacity dataset [152]. The vehicle sensor setup contains
In the remaining of this section we will provide and monocular color cameras, GPS and IMU sensors, as well as
highlight the distinctive characteristics of the most relevant a Velodyne 3D Lidar. The size of the dataset is 223GB. The
datasets that are publicly available. data is labeled and the user is provided with the correspond-
KITTI Vision Benchmark dataset (KITTI) [151]. Pro- ing steering angle that was recorded during the test runs by
vided by the Karlsruhe Institute of Technology (KIT) from the human driver.
Germany, this dataset fits the challenges of benchmarking Cityscapes dataset[74]. Provided by Daimler AG R&D,
stereo-vision, optical flow, 3D tracking, 3D object detection Germany; Max Planck Institute for Informatics (MPI-IS),
or SLAM algorithms. It is known as the most prestigious Germany, TU Darmstadt Visual Inference Group, Germany,
dataset in the self-driving vehicles domain. To this date it the Cityscapes Dataset focuses on semantic understanding
counts more than 2000 citations in the literature. The data of urban street scenes, this being the reason for which it
collection vehicle is equipped with multiple high-resolution contains only stereo vision color images. The diversity
color and gray-scale stereo cameras, a Velodyne 3D LiDAR of the images is very large: 50 cities, different seasons
and high-precision GPS/IMU sensors. In total, it provides (spring, summer, fall), various weather conditions and dif-
6 hours of driving data collected in both rural and highway ferent scene dynamics. There are 5000 images with fine an-
traffic scenarios around Karlsruhe. The dataset is provided notations and 20000 images with coarse annotations. Two
under the Creative Commons Attribution-NonCommercial- important challenges have used this dataset for benchmark-
ShareAlike 3.0 License. ing the development of algorithms for semantic segmenta-
NuScenes dataset [146]. Constructed by nuTonomy, this tion [157] and instance segmentation [158].
dataset contains 1000 driving scenes collected from Boston The Oxford dataset [153]. Provided by Oxford Univer-
and Singapore, two known for their dense traffic and highly sity, UK, the dataset collection spanned over 1 year, result-
challenging driving situations. In order to facilitate coming in over 1000 km of recorded driving with almost 20 mil-
mon computer vision tasks, such as object detection and lion images collected from 6 cameras mounted to the vehi-
tracking, the providers annotated 25 object classes with ac- cle, along with LIDAR, GPS and INS ground truth. Data
curate 3D bounding boxes at 2Hz over the entire dataset. was collected in all weather conditions, including heavy
Collection of vehicle data is still in progress. The fi- rain, night, direct sunlight and snow. One of the particu-
nal dataset will include approximately 1,4 million cam- larities of this dataset is that the vehicle frequently drove the
era images, 400.000 Lidar sweeps, 1,3 million RADAR same route over the period of a year to enable researchers
sweeps and 1,1 million object bounding boxes in 40.000 to investigate long-term localization and mapping for au-
keyframes. The dataset is provided under the Creative Com- tonomous vehicles in real-world, dynamic urban environ-
mons Attribution-NonCommercial-ShareAlike 3.0 License ments.
license. The Cambridge-driving Labeled Video Dataset (CamVid)
6 https://venturebeat.com/2018/09/14/scale-and-
[154]. Provided by the University of Cambridge, UK, it
nutonomy-release-nuscenes-a-self-driving-dataset- is one of the most cited dataset from the literature and the
with-over-1-4-million-images/ first released publicly, containing a collection of videos with
7 https://scale.com/open-datasets
Traffic
Dataset Problem Space Sensor setup Size Location License
condition
3D tracking, Radar, Lidar,
NuScenes 345 GB Boston,
3D object EgoData, GPS, Urban CC BY-NC-SA 3.0
[146] (1000 scenes, clips of 20s) Singapore
detection IMU, Camera
Omnidirectional
AMUSE 1 TB
SLAM camera, IMU, Los Angeles Urban CC BY-NC-ND 3.0
[149] (7 clips)
EgoData, GPS
Omnidirectional
Ford 3D tracking,
camera, IMU, 100 GB Michigan Urban Not specified
[150] 3D object detection
Lidar, GPS
3D tracking, Monocular
KITTI Urban
3D object detection, cameras, IMU 180 GB Karlsruhe CC BY-NC-SA 3.0
[151] Rural
SLAM Lidar, GPS
Monocular
Udacity 3D tracking, cameras, IMU,
220 GB Mountain View Rural MIT
[152] 3D object detection Lidar, GPS,
EgoData
Darmstadt,
Cityscapes Semantic Color stereo 63 GB
Zurich, Urban CC BY-NC-SA 3.0
[74] understanding cameras (5 clips)
Strasbourg
Stereo and
3D tracking,
Oxford monocular 23 TB Urban,
3D object detection, Oxford CC BY-NC-SA 3.0
[153] cameras, GPS (133 clips) Highway
SLAM
Lidar, IMU
Monocular
CamVid Object detection, 8 GB
color Cambridge Urban N/A
[154] Segmentation (4 clips)
camera
Pedestrian detection,
Daimler Stereo and
Classification, 91 GB Amsterdam,
pedestrian monocular Urban N/A
Segmentation, (8 clips) Beijing
[155] cameras
Path prediction
Tracking,
Caltech Monocular Los Angeles
Segmentation, 11 GB Urban N/A
[156] camera (USA)
Object detection
Table 2: Summary of datasets for training autonomous driving systems
object class semantic labels, along with metadata annota- a total of 350 thousand bounding boxes and 2.300 unique
tions. The database provides ground truth labels that asso- pedestrians annotations. The annotations include both tem-
ciate each pixel with one of 32 semantic classes. The sensor poral correspondences between bounding boxes and de-
setup is based on only one monocular camera mounted on tailed occlusion labels.
the dashboard of the vehicle. The complexity of the scenes Given the variety and complexity of the available
is quite low, the vehicle being driven only in urban areas databases, choosing one or more to develop and test an au-
with relatively low traffic and good weather conditions. tonomous driving component may be difficult. As it can
The Daimler pedestrian benchmark dataset [155]. Pro- be observed, the sensor setup varies among all the available
vided by Daimler AG R&D and University of Amsterdam, databases. For localization and vehicle motion, the Lidar
this dataset fits the topics of pedestrian detection, classifi- and GPS/IMU sensors are necessary, with the most popular
cation, segmentation and path prediction. Pedestrian data Lidar sensors used being Velodyne [160] and Sick [161].
is observed from a traffic vehicle by using only on-board Data recorded from a radar sensor is present only in the
mono and stereo cameras. It is the first dataset with contains NuScenes dataset. The radar manufacturers adopt propri-
pedestrians. Recently, the dataset was extended with cyclist etary data formats which are not public. Almost all avail-
video samples captured with the same setup [159]. able datasets include images captured from a video camera,
Caltech pedestrian detection dataset (Caltech) [156]. while there is a balance use of monocular and stereo cameras
Provided by California Institute of Technology, US, the mainly configured to capture gray-scale images. AMUSE
dataset contains richly annotated videos, recorded from a and Ford databases are the only ones that use omnidirec-
moving vehicle, with challenging images of low resolution tional cameras.
and frequently occluded people. There are approx. 10 hours Besides raw recorded data, the datasets usually contain
of driving scenarios cumulating about 250.000 frames with miscellaneous files such as annotations, calibration files, la-
bels, etc. In order to cope with this files, the dataset provider ization. It combines deep learning, sensor fusion and sur-
must offer tools and software that enable the user to read round vision to improve the driving experience.
and post-process the data. Splitting of the datasets is also an Introduced in September 2018, NVIDIA® DRIVE AGX
important factor to consider, because some of the datasets developer kit platform was presented as the world’s most ad-
(e.g. Caltech, Daimler, Cityscapes) already provide pre- vanced self-driving car platform [166], being based on the
processed data that is classified in different sets: training, Volta technology [167]. It is available in two different con-
testing and validation. This enables benchmarking of de- figurations, namely DRIVE AGX Xavier and DRIVE AGX
sired algorithms against similar approaches to be consistent. Pegasus.
Another aspect to consider is the license type. The most DRIVE AGX Xavier is a scalable open platform that can
commonly used license is Creative Commons Attribution- serve as an AI brain for self driving vehicles, and is an
NonCommercial-ShareAlike 3.0. It allows the user to energy-efficient computing platform, with 30 trillion oper-
copy and redistribute in any medium or format and also ations per second, while meeting automotive standards like
to remix, transform, and build upon the material. KITTI the ISO 26262 functional safety specification. NVIDIA®
and NuScenes databases are examples of such distribution DRIVE AGX Pegasus improves the performance with an
license. The Oxford database uses a Creative Commons architecture which is built on two NVIDIA® Xavier proces-
Attribution-Noncommercial 4.0. which, compared with the sors and two state of the art TensorCore GPUs.
first license type, does not force the user to distribute his A hardware platform used by the car makers for Ad-
contributions under the same license as the database. Oppo- vanced Driver Assistance Systems (ADAS) is the R-Car
site to that, the AMUSE database is licensed under Creative V3H system-on-chip (SoC) platform from Renesas Auton-
Commons Attribution-Noncommercial-noDerivs 3.0 which omy [168]. This SoC provides the possibility to imple-
makes the database illegal to distribute if modification of the ment high performance computer vision with low power
material are made. consumption. R-Car V3H is optimized for applications
With very few exceptions, the datasets are collected from that involve the usage of stereo cameras, containing ded-
a single city, which is usually around university campuses or icated hardware for convolutional neural networks, dense
company locations in Europe, the US, or Asia. Germany is optical flow, stereo-vision, and object classification. The
the most active country for driving recording vehicles. Un- hardware features four 1.0 GHz Arm Cortex-A53 MPCore
fortunately, all available datasets together cover a very small cores, which makes R-Car V3H a suitable hardware plat-
portion of the world map. One reason for this is the memory form which can be used to deploy trained inference engines
size of the data which is in direct relation with the sensor for solving specific deep learning tasks inside the automo-
setup and the quality. For example, the Ford dataset takes tive domain.
around 30 GB for each driven kilometer, which means that Renesas also provides a similar SoC, called R-Car
covering an entire city will take hundreds of TeraBytes of H3 [169] which delivers improved computing capabilities
driving data. The majority of the available datasets consider and compliance with functional safety standards. Equipped
sunny, daylight and urban conditions, these being ideal op- with new CPU cores (Arm Cortex-A57), it can be used as
erating conditions for autonomous driving systems. an embedded platform for deploying various deep learning
algorithms, compared with R-Car V3H, which is only opti-
mized for CNNs.
9 Computational Hardware and
Deployment A Field-Programmable Gate Array (FPGA) is another vi-
able solution, showing great improvements in both perfor-
Deploying deep learning algorithms on target edge devices mance and power consumption in deep learning applica-
is not a trivial task. The main limitations when it comes to tions. The suitability of the FPGAs for running deep learn-
vehicles are the price, performance issues and power con- ing algorithms can be analyzed from four major perspec-
sumption. Therefore, embedded platforms are becoming es- tives: efficiency and power, raw computing power, flexibil-
sential for integration of AI algorithms inside vehicles due ity and functional safety. Our study is based on the research
to their portability, versatility, and energy efficiency. published by Intel [170], Microsoft [171] and UCLA [172].
The market leader in providing hardware solutions for de- By reducing the latency in deep learning applications, FP-
ploying deep learning algorithms inside autonomous cars is GAs provide additional raw computing power. The mem-
NVIDIA® . DRIVE PX [162] is an AI car computer which ory bottlenecks, associated with external memory accesses,
was designed to enable the auto-makers to focus directly on are reduced or even eliminated by the high amount of chip
the software for autonomous vehicles. cache memory. In addition, FPGAs have the advantages of
The newest version of DrivePX architecture is based on supporting a full range of data types, together with custom
two Tegra X2 [163] systems on a chip (SoCs). Each SoC user-defined types.
contains two Denve [164] cores, 4 ARM A57 cores and FPGAs are optimized when it comes to efficiency and
a graphical computeing unit (GPU) from the Pascal [165] power consumption. The studies presented by manufactur-
generation. NVIDIA® DRIVE PX is capable to perform ers like Microsoft and Xilinx show that GPUs can consume
real-time environment perception, path planning and local-
upon ten times more power than FPGAs when processing are standard methods for selecting a route through the road
algorithms with the same computation complexity, demon- network, from the car’s current position to destination [82].
strating that FPGAs can be a much more suitable solution Availability of training data: ”Data is the new oil” be-
for deep learning applications in the automotive field. came lately one of the most popular quote in the automo-
In terms of flexibility, FPGAs are built with multiple ar- tive industry. The effectiveness of deep learning systems is
chitectures, which are a mix of hardware programmable re- directly tied to the availability of training data. As a rule
sources, digital signal processors and Processor Block RAM of thumb, current deep learning methods are also evaluated
(BRAM) components. This architecture flexibility is suit- based on the quality of training data [29]. The better the
able for deep and sparse neural networks, which are the state quality of the data is, the higher the accuracy of the algo-
of the art for the current machine learning applications. An- rithm. The daily data recorded by an autonomous vehicle
other advantage is the possibility of connecting to various is on the order of petabytes. This poses challenges on the
input and output peripheral devices like sensors, network el- parallelization of the training procedure, as well as on the
ements and storage devices. storage infrastructure. Simulation environments have been
In the automotive field, functional safety is one of the used in the last couple of years for bridging the gap between
most important challenges. FPGAs have been designed to scarce data and the deep learning’s hunger for training ex-
meet the safety requirements for a wide range of applica- amples. There is still a gap to be filled between the accuracy
tions, including ADAS. When compared to GPUs, which of a simulated world and real-world driving.
were originally built for graphics and high-performance Learning corner cases: Most driving scenarios are con-
computing systems, where functional safety is not neces- sidered solvable with classical methodologies. However, the
sary, FPGAs provide a significant advantage in developing remaining unsolved scenarios are corner cases which, un-
driver assistance systems. til now, required the reasoning and intelligence of a human
driver. In order to overcome corner cases, the generaliza-
10 Discussion and Conclusions tion power of deep learning algorithms should be increased.
Generalization in deep learning is of special importance in
We have identified seven major areas that form open chal- learning hazardous situations that can lead to accidents, es-
lenges in the field of autonomous driving. We believe that pecially due to the fact that training data for such corner
Deep Learning and Artificial Intelligence will play a key cases is scarce. This implies also the design of one-shot
role in overcoming these challenges: and low-shot learning methods, that can be trained a reduced
Perception: In order for an autonomous car to safely nav- number of training examples.
igate the driving scene, it must be able to understand its Learning-based control methods: Classical controllers
surroundings. Deep learning is the main technology be- make use of an a-priori model composed of fixed parame-
hind a large number of perception systems. Although great ters. In a complex case, such as autonomous driving, these
progress has been reported with respect to accuracy in object controllers cannot anticipate all driving situations. The ef-
detection and recognition [173], current systems are mainly fectiveness of deep learning components to adapt based on
designed to calculate 2D or 3D bounding boxes for a couple past experiences can also be used to learn the parameters of
of trained object classes, or to provide a segmented image the car’s control system, thus better approximating the un-
of the driving environment. Future methods for perception derlaying true system model [174, 94].
should focus on increasing the levels of recognized details, Functional safety: The usage of deep learning in safety-
making it possible to perceive and track more objects in real- critical systems is still an open debate, efforts being made to
time. Furthermore, additional work is required for bridg- bring the computational intelligence and functional safety
ing the gap between image- and LiDAR-based 3D percep- communities closer to each other. Current safety stan-
tion [32], enabling the computer vision community to close dards, such as the ISO 26262, do not accommodate machine
the current debate on camera vs. LiDAR as main perception learning software [130]. Although new data-driven design
sensors. methodologies have been proposed, there are still opened is-
Short- to middle-term reasoning: Additional to a ro- sues on the explainability, stability, or classification robust-
bust and accurate perception system, an autonomous vehi- ness of deep neural networks.
cle should be able to reason its driving behavior over a short Real-time computing and communication: Finally, real-
(milliseconds) to middle (seconds to minutes) time hori- time requirements have to be fulfilled for processing the
zon [82]. AI and deep learning are promising tools that can large amounts of data gathered from the car’s sensors suite,
be used for the high- and low-level path path planning re- as well as for updating the parameters of deep learning sys-
quired for navigating the miriad of driving scenarios. Cur- tems over high-speed communication lines [170]. These
rently, the largest portion of papers in deep learning for self- real-time constraints can be backed up by advances in semi-
driving cars are focused mainly on perception and End2End conductor chips dedicated for self-driving cars, as well as by
learning [81, 124]. Over the next period, we expect deep the rise of 5G communication networks.
learning to play a significant role in the area of local trajec-
tory estimation and planning. We consider long-term rea-
soning as solved, as provided by navigation systems. These
10.1 Final Notes [6] B. Paden, M. Cáp, S. Z. Yong, D. S. Yershov, and
Autonomous vehicle technology has seen a rapid progress E. Frazzoli, “A Survey of Motion Planning and Con-
in the past decade, especially due to advances in the area of trol Techniques for Self-Driving Urban Vehicles,”
artificial intelligence and deep learning. Current AI method- IEEE Trans. Intelligent Vehicles, vol. 1, no. 1, pp. 33–
ologies are nowadays either used or taken into consideration 55, 2016.
when designing different components for a self-driving car. [7] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner,
Deep learning approaches have influenced not only the de- “Gradient-based Learning Applied to Document
sign of traditional perception-planning-action pipelines, but Recognition,” Proceedings of the IEEE, vol. 86,
have also enabled End2End learning systems, able do di- no. 11, pp. 2278–2324, Nov 1998.
rectly map sensory information to steering commands.
Driverless cars are complex systems which have to safely [8] Y. Bengio, A. Courville, and P. Vincent, “Represen-
drive passengers or cargo from a starting location to destina- tation Learning: A Review and New Perspectives,”
tion. Several challenges are encountered with the advent of IEEE Transactions on Pattern Analysis and Machine
AI based autonomous vehicles deployment on public roads. Intelligence, vol. 35, no. 8, pp. 1798–1828, Aug
A major challenge is the difficulty in proving the functional 2013.
safety of these vehicle, given the current formalism and ex-
plainability of neural networks. On top of this, deep learning [9] P. A. Viola and M. J. Jones, “Rapid Object Detection
systems rely on large training databases and require exten- using a Boosted Cascade of Simple Features,” in 2001
sive computational hardware. IEEE Computer Society Conference on Computer Vi-
This paper has provided a survey on deep learning tech- sion and Pattern Recognition (CVPR 2001), with CD-
nologies used in autonomous driving. The survey of perfor- ROM, 8-14 December 2001, Kauai, HI, USA, 2001,
mance and computational requirements serves as a reference pp. 511–518.
for system level design of AI based self-driving vehicles.
[10] T. Ojala, M. Pietikäinen, and D. Harwood, “A Com-
Acknowledgment parative Study of Texture Measures with Classifica-
tion Based on Featured Distributions,” Pattern Recog-
The authors would like to thank Elektrobit Automotive for nition, vol. 29, no. 1, pp. 51–59, Jan. 1996.
the infrastructure and research support.
[11] N. Dalal and B. Triggs, “Histograms of Oriented Gra-
References dients for Human Detection,” in In CVPR, 2005, pp.
886–893.
[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Ima-
geNet Classification with Deep Convolutional Neu- [12] D. H. Hubel and T. N.Wiesel, “Shape and Arrange-
ral Networks,” in Advances in Neural Information ment of Columns in Cats Striate Cortex,” The Journal
Processing Systems 25, F. Pereira, C. J. C. Burges, of Physiology, vol. 165, no. 3, p. 559568, 1963.
L. Bottou, and K. Q. Weinberger, Eds. Curran As-
[13] M. A. Goodale and A. Milner, “Separate Visual Path-
sociates, Inc., 2012, pp. 1097–1105.
ways for Perception and Action,” Trends in Neuro-
[2] M. Andrychowicz, B. Baker, M. Chociej, R. Jozefow- sciences, vol. 15, no. 1, pp. 20 – 25, 1992.
icz, B. McGrew, J. Pachocki, A. Petron, M. Plappert,
[14] D. E. Rumelhart, J. L. McClelland, and C. PDP Re-
G. Powell, A. Ray, J. Schneider, S. Sidor, J. Tobin,
search Group, Eds., Parallel Distributed Processing:
P. Welinder, L. Weng, and W. Zaremba, “Learn-
Explorations in the Microstructure of Cognition, Vol.
ing Dexterous In-Hand Manipulation,” CoRR, vol.
1: Foundations. Cambridge, MA, USA: MIT Press,
abs/1808.00177, August 2018. [Online]. Available:
1986.
https://arxiv.org/abs/1808.00177
[15] D. P. Kingma and J. Ba, “Adam: A Method for
[3] Y. Goldberg, Neural Network Methods for Natural Stochastic Optimization,” in 3rd Int. Conf. on Learn-
Language Processing, ser. Synthesis Lectures on Hu- ing Representations, ICLR 2015, San Diego, CA,
man Language Technologies. Morgan & Claypool, USA, May 2015.
2017, vol. 37.
[16] E. H. J. Duchi and Y. Singer, “Adaptive Subgradi-
[4] SAE Committee, “Taxonomy and Definitions for ent Methods for Online Learning and Stochastic Op-
Terms Related to On-road Motor Vehicle Automated timization,” Journal of Machine Learning Research,
Driving Systems,” 2014. vol. 12, pp. 2121–2159, 2011.
[5] E. Dickmanns and V. Graefe, “Dynamic Monocular [17] S. Hochreiter and J. Schmidhuber, “Long Short-term
Machine Vision,” Machine vision and applications, Memory,” Neural computation, vol. 9, no. 8, pp.
vol. 1, pp. 223–240, 1988. 1735–1780, 1997.
[18] R. Sutton and A. Barto, Introduction to Reinforce- [30] S. O’Kane, “How Tesla and Waymo are Tackling a
ment Learning. MIT Press, 1998. Major Problem for Self-Driving Cars: Data,” Trans-
portation, 2018.
[19] R. Bellman, Dynamic Programming. Princeton Uni-
versity Press, 1957. [31] S. Hasirlioglu, A. Kamann, I. Doric, and T. Brand-
meier, “Test Methodology for Rain Influence on Au-
[20] C. Watkins and P. Dayan, “Q-Learning,” Machine tomotive Surround Sensors,” in 2016 IEEE 19th Int.
Learning, vol. 8, no. 3, p. 279292, 1992. Conf. on Intelligent Transportation Systems (ITSC),
Nov 2016, pp. 2242–2247.
[21] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu,
J. Veness, M. G. Bellemare, A. Graves, M. Ried- [32] Y. Wang, W.-L. Chao, D. Garg, B. Hariharan,
miller, A. K. Fidjeland, G. Ostrovski, S. Petersen, M. Campbell, and K. Weinberger, “Pseudo-LiDAR
C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Ku- from Visual Depth Estimation: Bridging the Gap in
maran, D. Wierstra, S. Legg, and D. Hassabis, 3D Object Detection for Autonomous Driving,” in
“Human-level Control Through Deep Reinforcement IEEE Conf. on Computer Vision and Pattern Recog-
Learning,” Nature, vol. 518, no. 7540, pp. 529–533, nition (CVPR) 2019, June 2019.
Feb. 2015. [33] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Point-
Net: Deep Learning on Point Sets for 3D Classifica-
[22] M. Hessel, J. Modayil, H. van Hasselt, T. Schaul,
tion and Segmentation,” in IEEE Conf. on Computer
G. Ostrovski, W. Dabney, D. Horgan, B. Piot,
Vision and Pattern Recognition (CVPR) 2017, July
M. Azar, and D. Silver, “Rainbow: Combining Im-
2017.
provements in Deep Reinforcement Learning,” 2017.
[34] J. Ku, M. Mozifian, J. Lee, A. Harakeh, and S. L.
[23] A. E. Sallab, M. Abdou, E. Perot, and S. Yoga- Waslander, “Joint 3D Proposal Generation and Ob-
mani, “Deep Reinforcement Learning framework for ject Detection from View Aggregation,” in IEEE/RSJ
Autonomous Driving,” CoRR, vol. abs/1704.02532, Int. Conf. on Intelligent Robots and Systems (IROS)
2017. 2018. IEEE, 2018.
[24] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, [35] O. Russakovsky, J. Deng, H. Su, J. Krause,
T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Con- S. Satheesh, S. Ma, Z. Huang, A. Karpathy,
tinuous Control with Deep Reinforcement Learning,” A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-
2-4 May 2016. Fei, “ImageNet Large Scale Visual Recognition Chal-
lenge,” Int. Journal of Computer Vision (IJCV), vol.
[25] S. Gu, T. Lillicrap, I. Sutskever, and S. Levine, “Con- 115, no. 3, pp. 211–252, 2015.
tinuous Deep Q-Learning with Model-based Accel-
eration,” in Int. Conf. on Machine Learning ICML [36] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi,
2016, vol. 48, Jun. 2016, pp. 2829–2838. “You Only Look Once: Unified, Real-time Object
Detection,” in Proceedings of the IEEE Conf. on com-
[26] M. Jaritz, R. de Charette, M. Toromanoff, E. Perot, puter vision and pattern recognition, 2016, pp. 779–
and F. Nashashibi, “End-to-End Race Driving with 788.
Deep Reinforcement Learning,” 2018 IEEE Int. Conf.
on Robotics and Automation (ICRA), pp. 2070–2075, [37] H. Law and J. Deng, “Cornernet: Detecting Objects
2018. as Paired Keypoints,” in Proceedings of the European
Conference on Computer Vision (ECCV), 2018, pp.
[27] M. Wulfmeier, D. Z. Wang, and I. Posner, “Watch 734–750.
This: Scalable Cost-Function Learning for Path
[38] S. Zhang, L. Wen, X. Bian, Z. Lei, and S. Z. Li,
Planning in Urban Environments,” 2016 IEEE/RSJ
“Single-shot Refinement Neural Network for Object
Int. Conf. on Intelligent Robots and Systems (IROS),
Detection,” IEEE Conference on Computer Vision
vol. abs/1607.02329, 2016. [Online]. Available:
and Pattern Recognition (CVPR), 2017.
http://arxiv.org/abs/1607.02329
[39] R. Girshick, “Fast R-CNN,” in Proceedings of the
[28] H. Zhu, K.-V. Yuen, L. S. Mihaylova, and H. Leung, IEEE Int. Conf. on computer vision, 2015, pp. 1440–
“Overview of Environment Perception for Intelligent 1448.
Vehicles,” IEEE Transactions on Intelligent Trans-
portation Systems, vol. 18, pp. 2584–2601, 2017. [40] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf,
W. J. Dally, and K. Keutzer, “Squeezenet: Alexnet-
[29] J. Janai, F. Guney, A. Behl, and A. Geiger, “Computer level Accuracy with 50x Fewer Parameters and¡ 0.5
Vision for Autonomous Vehicles: Problems, Datasets Mb Model Size,” arXiv preprint arXiv:1602.07360,
and State-of-the-Art,” 04 2017. 2016.
[41] J. Dai, Y. Li, K. He, and J. Sun, “R-fcn: Object [53] R. Girshick, J. Donahue, T. Darrell, and J. Malik,
Detection via Region-based Fully Convolutional Net- “Rich Feature Hierarchies for Accurate Object De-
works,” in Advances in neural information processing tection and Semantic Segmentation,” in Proceedings
systems, 2016, pp. 379–387. of the 2014 IEEE Conf. on Computer Vision and Pat-
tern Recognition, ser. CVPR ’14. Washington, DC,
[42] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Seg- USA: IEEE Computer Society, 2014, pp. 580–587.
Net: A Deep Convolutional Encoder-Decoder Ar-
chitecture for Image Segmentation,” IEEE Transac- [54] S. Ren, K. He, R. Girshick, and J. Sun, “Faster
tions on Pattern Analysis and Machine Intelligence, R-CNN: Towards Real-time Object Detection with
vol. 39, 2017. Region Proposal Networks,” IEEE Transactions on
Pattern Analysis & Machine Intelligence, no. 6, pp.
[43] H. Zhao, X. Qi, X. Shen, J. Shi, and J. Jia, “Ic- 1137–1149, 2017.
net for Real-time Semantic Segmentation on High-
resolution Images,” European Conference on Com- [55] J. Li, K. Peng, and C.-C. Chang, “An Efficient Ob-
puter Vision, pp. 418–434, 2018. ject Detection Algorithm Based on Compressed Net-
works,” Symmetry, vol. 10, no. 7, p. 235, 2018.
[44] M. Treml, J. A. Arjona-Medina, T. Unterthiner,
R. Durgesh, F. Friedmann, P. Schuberth, A. Mayr, [56] K. Shin, Y. P. Kwon, and M. Tomizuka, “Roar-
M. Heusel, M. Hofmarcher, M. Widrich, B. Nessler, Net: A Robust 3D Object Detection based on
and S. Hochreiter, “Speeding up Semantic Segmenta- RegiOn Approximation Refinement,” CoRR, vol.
tion for Autonomous Driving,” 2016. abs/1811.03818, 2018.
[57] A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello,

[45] K. He, G. Gkioxari, P. Dollar, and R. B. Girshick,
“Enet: A Deep Neural Network Architecture for
“Mask R-CNN,” 2017 IEEE Int. Conf. on Computer
Real-time Semantic Segmentation,” arXiv preprint
Vision (ICCV), pp. 2980–2988, 2017.
arXiv:1606.02147, 2016.
[46] Y. Zhou and O. Tuzel, “VoxelNet: End-to-End Learn-
[58] A. Valada, J. Vertens, A. Dhall, and W. Burgard,
ing for Point Cloud Based 3D Object Detection,”
“AdapNet: Adaptive Semantic Segmentation in Ad-
IEEE Conf. on Computer Vision and Pattern Recog-
verse Environmental Conditions,” 2017 IEEE Int.
nition 2018, pp. 4490–4499, 2018.
Conf. on Robotics and Automation (ICRA), pp. 4644–
4651, 2017.
[47] W. Luo, B. Yang, and R. Urtasun, “Fast and Furi-
ous: Real Time End-to-End 3D Detection, Tracking [59] K. Simonyan and A. Zisserman, “Very Deep Con-
and Motion Forecasting With a Single Convolutional volutional Networks for Large-scale Image Recogni-
Net,” in IEEE Conf. on Computer Vision and Pattern tion,” arXiv preprint arXiv:1409.1556, 2014.
Recognition (CVPR) 2018, June 2018.
[60] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
[48] C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Ra-
“Frustum PointNets for 3D Object Detection from binovich, “Going Deeper with Convolutions,” IEEE
RGB-D Data,” in IEEE Conf. on Computer Vision Conference on Computer Vision and Pattern Recog-
and Pattern Recognition (CVPR) 2018, June 2018. nition (CVPR), 2015.
[49] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia, “Multi- [61] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual
View 3D Object Detection Network for Autonomous Learning for Image Recognition,” in Proceedings of
Driving,” in IEEE Conf. on Computer Vision and Pat- the IEEE Conf. on computer vision and pattern recog-
tern Recognition (CVPR) 2017, July 2017. nition, 2016, pp. 770–778.
[50] J. Redmon and A. Farhadi, “YOLO9000: Better, [62] D. Barnes, W. Maddern, G. Pascoe, and I. Posner,
Faster, Stronger,” IEEE Conf. on Computer Vision “Driven to Distraction: Self-Supervised Distractor
and Pattern Recognition (CVPR), 2017. Learning for Robust Monocular Visual Odometry in
Urban Environments,” in 2018 IEEE Int. Conf. on
[51] ——, “Yolov3: An Incremental Improvement,” arXiv Robotics and Automation (ICRA). IEEE, 2018.
preprint arXiv:1804.02767, 2018.
[63] G. Bresson, Z. Alsayed, L. Yu, and S. Glaser, “Simul-
[52] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, taneous Localization and Mapping: A Survey of Cur-
C.-Y. Fu, and A. C. Berg, “Ssd: Single Shot Multi- rent Trends in Autonomous Driving,” IEEE Transac-
box Detector,” in European conference on computer tions on Intelligent Vehicles, vol. 2, no. 3, pp. 194–
vision. Springer, 2016, pp. 21–37. 220, Sep 2017.
[64] A. Kendall, M. Grimes, and R. Cipolla, “PoseNet: A [75] S. Thrun, W. Burgard, and D. Fox, “Probabilis-
Convolutional Network for Real-Time 6-DOF Cam- tic Robotics (Intelligent Robotics and Autonomous
era Relocalization,” in Proceedings of the 2015 IEEE Agents),” in Cambridge: The MIT Press, 2005.
Int. Conf. on Computer Vision (ICCV). Washington,
DC, USA: IEEE Computer Society, 2015, pp. 2938– [76] P. Ondruska, J. Dequaire, D. Z. Wang, and I. Pos-
2946. ner, “End-to-End Tracking and Semantic Segmenta-
tion Using Recurrent Neural Networks,” CoRR, vol.
[65] N. Radwan, A. Valada, and W. Burgard, “VLoc- abs/1604.05091, 2016.
Net++: Deep Multitask Learning for Semantic Visual
Localization and Odometry,” IEEE Robotics and Au- [77] S. Hoermann, M. Bach, and K. Dietmayer, “Dynamic
tomation Letters, Sep 2018. Occupancy Grid Prediction for Urban Autonomous
Driving: Deep Learning Approach with Fully Auto-
[66] F. Walch, C. Hazirbas, L. Leal-Taixé, T. Sattler, matic Labeling,” IEEE Int. Conf. on Robotics and Au-
S. Hilsenbeck, and D. Cremers, “Image-Based Lo- tomation (ICRA), 2017.
calization Using LSTMs for Structured Feature Cor-
relation,” 2017 IEEE Int. Conf. on Computer Vision [78] S. Ramos, S. K. Gehrig, P. Pinggera, U. Franke, and
(ICCV), pp. 627–637, 2017. C. Rother, “Detecting Unexpected Obstacles for Self-
Driving Cars: Fusing Deep Learning and Geomet-
[67] I. Melekhov, J. Ylioinas, J. Kannala, and E. Rahtu, ric Modeling,” IEEE Intelligent Vehicles Symposium,
“Image-Based Localization Using Hourglass Net- vol. 4, 2016.
works,” 2017 IEEE Int. Conf. on Computer Vision
[79] C. Seeger, A. Mller, and L. Schwarz, “Towards Road
Workshops (ICCVW), pp. 870–877, 2017.
Type Classification with Occupancy Grids,” in Intel-
[68] Z. Laskar, I. Melekhov, S. Kalia, and J. Kannala, ligent Vehicles Symposium - Workshop: DeepDriving
“Camera Relocalization by Computing Pairwise Rel- - Learning Representations for Intelligent Vehicles,
ative Poses Using Convolutional Neural Network,” in IEEE, Gothenburg, Sweden, July 2016.
The IEEE Int. Conf. on Computer Vision (ICCV), Oct [80] L. Marina, B. Trasnea, T. Cocias, A. Vasilcoi,
2017. F. Moldoveanu, and S. Grigorescu, “Deep Grid Net
(DGN): A Deep Learning System for Real-Time
[69] E. Brachmann and C. Rother, “Learning Less is
Driving Context Understanding,” in Int. Conf. on
More 6D Camera Localization via 3D Surface Re-
Robotic Computing IRC 2019, Naples, Italy, 25-27
gression,” in IEEE Conf. on Computer Vision and
February 2019.
Pattern Recognition (CVPR) 2018, June 2018.
[81] S. Shalev-Shwartz, S. Shammah, and A. Shashua,
[70] P. Sarlin, F. Debraine, M. Dymczyk, R. Siegwart, and “Safe, Multi-Agent, Reinforcement Learning for Au-
C. Cadena, “Leveraging Deep Visual Descriptors for tonomous Driving,” 2016.
Hierarchical Efficient Localization,” in Proc. of the
2nd Conf. on Robot Learning (CoRL), Oct 2018. [82] S. D. Pendleton, H. Andersen, X. Du, X. Shen,
M. Meghjani, Y. H. Eng, D. Rus, and M. H. Ang,
[71] I. A. Barsan, S. Wang, A. Pokrovsky, and R. Urta- “Perception, Planning, Control, and Coordination for
sun, “Learning to Localize Using a LiDAR Intensity Autonomous Vehicles,” Machines, vol. 5, no. 1, p. 6,
Map,” in Proc. of the 2nd Conf. on Robot Learning 2017.
(CoRL), Oct 2018.
[83] E. Rehder, J. Quehl, and C. Stiller, “Driving Like
[72] O. Garcia-Favrot and M. Parent, “Laser Scanner a Human: Imitation Learning for Path Planning us-
Based SLAM in Real Road and Traffic Environ- ing Convolutional Neural Networks,” in Int. Conf. on
ment,” in IEEE Int. Conf. Robotics and Automation Robotics and Automation Workshops, 2017.
(ICRA09). Workshop on Safe navigation in open and
dynamic environments Application to autonomous ve- [84] L. Sun, C. Peng, W. Zhan, and M. Tomizuka, “A
hicles, 2009. Fast Integrated Planning and Control Framework for
Autonomous Driving via Imitation Learning,” ASME
[73] A. K. Ushani and R. M. Eustice, “Feature Learning 2018 Dynamic Systems and Control Conference,
for Scene Flow Estimation from LIDAR,” in Proc. of vol. 3, 2018. [Online]. Available: https://arxiv.org/
the 2nd Conf. on Robot Learning (CoRL), vol. 87, Oct pdf/1707.02515.pdf
2018, pp. 283–292.
[85] S. Grigorescu, B. Trasnea, L. Marina, A. Vasilcoi,
[74] Cityscapes, “Cityscapes Data Collection,” https:// and T. Cocias, “NeuroTrajectory: A Neuroevolution-
www.cityscapes-dataset.com/, 2018. ary Approach to Local State Trajectory Learning for
Autonomous Vehicles,” IEEE Robotics and Automa- [95] O. Sigaud, C. Salaün, and V. Padois, “On-line Re-
tion Letters, vol. 4, no. 4, pp. 3441–3448, October gression Algorithms for Learning Mechanical Mod-
2019. els of Robots: A Survey,” Robotics and Autonomous
Systems, vol. 59, no. 12, pp. 1115–1129, Dec. 2011.
[86] L. Yu, X. Shao, Y. Wei, and K. Zhou, “Intelli-
gent Land-Vehicle Model Transfer Trajectory Plan- [96] C. Ostafew, A. Schoellig, and T. D. Barfoot, “Visual
ning Method Based on Deep Reinforcement Learn- Teach and Repeat, Repeat, Repeat: Iterative Learn-
ing,” Sensors (Basel, Switzerland), vol. 18, 09 2018. ing Control to Improve Mobile Robot Path Tracking
in Challenging Outdoor Environments,” 11 2013, pp.
[87] C. Paxton, V. Raman, G. D. Hager, and 176–181.
M. Kobilarov, “Combining Neural Networks and
Tree Search for Task and Motion Planning in [97] B. Panomruttanarug, “Application of Iterative Learn-
Challenging Environments,” 2017 IEEE/RSJ Int. ing Control in Tracking a Dubin’s Path in Paral-
Conf. on Intelligent Robots and Systems (IROS), lel Parking,” Int. Journal of Automotive Technology,
vol. abs/1703.07887, 2017. [Online]. Available: vol. 18, no. 6, pp. 1099–1107, Dec 2017.
http://arxiv.org/abs/1703.07887
[98] N. R. Kapania and J. C. Gerdes, “Path Tracking
[88] W. Schwarting, J. Alonso-Mora, and D. Rus, “Plan- of Highly Dynamic Autonomous Vehicle Trajecto-
ning and Decision-Making for Autonomous Vehi- ries via Iterative Learning Control,” in 2015 Ameri-
cles,” Annual Review of Control, Robotics, and Au- can Control Conference (ACC), July 2015, pp. 2753–
tonomous Systems, vol. 1, 05 2018. 2758.
[89] T. Gu, J. M. Dolan, and J. Lee, “Human-like Planning [99] Z. Yang, F. Zhou, Y. Li, and Y. Wang, “A Novel It-
of Swerve Maneuvers for Autonomous Vehicles,” in erative Learning Path-tracking Control for Nonholo-
2016 IEEE Intelligent Vehicles Symposium (IV), June nomic Mobile Robots Against Initial Shifts,” Int.
2016, pp. 716–721. Journal of Advanced Robotic Systems, vol. 14, p.
172988141771063, 05 2017.
[90] A. I. Panov, K. S. Yakovlev, and R. Suvorov,
“Grid Path Planning with Deep Reinforcement [100] S. Lefvre, A. Carvalho, and F. Borrelli, “A Learning-
Learning: Preliminary Results,” Procedia Computer Based Framework for Velocity Control in Au-
Science, vol. 123, pp. 347 – 353, 2018, 8th tonomous Driving,” IEEE Transactions on Automa-
Annual Int. Conf. on Biologically Inspired Cog- tion Science and Engineering, vol. 13, no. 1, pp. 32–
nitive Architectures, BICA 2017. [Online]. Avail- 42, Jan 2016.
able: http://www.sciencedirect.com/science/article/
pii/S1877050918300553 [101] S. Lefevre, A. Carvalho, and F. Borrelli, “Au-
tonomous Car Following: A Learning-based Ap-
[91] C. J. Ostafew, J. Collier, A. P. Schoellig, and T. D. proach,” in 2015 IEEE Intelligent Vehicles Sympo-
Barfoot, “Learning-based Nonlinear Model Predic- sium (IV), June 2015, pp. 920–926.
tive Control to Improve Vision-based Mobile Robot
Path Tracking,” Journal of Field Robotics, vol. 33, [102] P. Drews, G. Williams, B. Goldfain, E. A Theodorou,
no. 1, pp. 133–152, 2015. and J. M Rehg, “Aggressive Deep Driving: Combin-
ing Convolutional Neural Networks and Model Pre-
[92] P. J. Nguyen-Tuong D and S. M, “Local Gaussian dictive Control,” 01 2017, pp. 133–142.
Process Regression for Real Time Online Model
Learning,” in Proceedings of the neural informa- [103] P. Drews, G. Williams, B. Goldfain, E. A. Theodorou,
tion processing systems Conference, 2008, pp. 1193– and J. M. Rehg, “Aggressive Deep Driving: Model
1200. Predictive Control with a CNN Cost Model,” CoRR,
vol. abs/1707.05303, 2017.
[93] H. P. Meier F and S. S, “Efficient Bayesian Lo-
cal Model Learning for Control,” in IEEE/RSJ Int. [104] U. Rosolia, A. Carvalho, and F. Borrelli, “Au-
Conf. on Intelligent Robots and Systems (IROS) 2016. tonomous Racing using Learning Model Predictive
IEEE, 2014, pp. 2244–2249. Control,” in 2017 American Control Conference
(ACC), May 2017, pp. 5115–5120.
[94] C. J. Ostafew, A. P. Schoellig, and T. D. Barfoot, “Ro-
bust Constrained Learning-Based NMPC Enabling [105] Y. Pan, C.-A. Cheng, K. Saigol, K. Lee, X. Yan,
Reliable Mobile Robot Path Tracking,” Int. Journal E. A. Theodorou, and B. Boots, “Learning Deep Neu-
of Robotics Research, vol. 35, no. 13, pp. 1547–1563, ral Network Control Policies for Agile Off-Road Au-
2016. tonomous Driving,” 2017.
[106] Y. Pan, C. Cheng, K. Saigol, K. Lee, X. Yan, [118] S. Yang, W. Wang, C. Liu, K. Deng, and
E. Theodorou, and B. Boots, “Agile Off-Road Au- J. K. Hedrick, “Feature Analysis and Selection
tonomous Driving Using End-to-End Deep Imita- for Training an End-to-End Autonomous Vehicle
tion Learning,” Robotics: Science and Systems 2018, Controller Using the Deep Learning Approach,” 2017
2018. IEEE Intelligent Vehicles Symposium, vol. 1, 2017.
[Online]. Available: http://arxiv.org/abs/1703.09744
[107] J. Rawlings and D. Mayne, Model Predictive Control:
Theory and Design. Nob Hill Pub., 2009. [119] M. Bojarski, D. D. Testa, D. Dworakowski, B. Firner,
[108] M. Kamel, A. Hafez, and X. Yu, “A Review on B. Flepp, P. Goyal, L. D. Jackel, M. Monfort,
Motion Control of Unmanned Ground and Aerial U. Muller, J. Zhang, X. Zhang, J. Zhao, and
Vehicles Based on Model Predictive Control Tech- K. Zieba, “End to End Learning for Self-Driving
niques,” Engineering Science and Military Technolo- Cars,” CoRR, vol. abs/1604.07316, 2016. [Online].
gies, vol. 2, pp. 10–23, 03 2018. Available: http://arxiv.org/abs/1604.07316
[109] M. Brunner, U. Rosolia, J. Gonzales, and F. Borrelli, [120] L. Fridman, D. E. Brown, M. Glazer, W. Angell,
“Repetitive Learning Model Predictive Control: An S. Dodd, B. Jenik, J. Terwilliger, J. Kindelsberger,
Autonomous Racing Example,” in 2017 IEEE 56th L. Ding, S. Seaman, H. Abraham, A. Mehler,
Annual Conference on Decision and Control (CDC), A. Sipperley, A. Pettinato, L. Angell, B. Mehler, and
Dec 2017, pp. 2545–2550. B. Reimer, “MIT Autonomous Vehicle Technology
[110] D. A. Pomerleau, “Alvinn: An autonomous Land Ve- Study: Large-Scale Deep Learning Based Analysis of
hicle in a Neural Network,” in Advances in neural Driver Behavior and Interaction with Automation,”
information processing systems, 1989, pp. 305–313. IEEE Access 2017, 2017. [Online]. Available:
https://arxiv.org/abs/1711.06976
[111] U. Muller, J. Ben, E. Cosatto, B. Flepp, and Y. L.
Cun, “Off-road Obstacle Avoidance through End-to- [121] C. Chen, A. Seff, A. L. Kornhauser, and J. Xiao,
End Learning,” in Advances in neural information “DeepDriving: Learning Affordance for Direct Per-
processing systems, 2006, pp. 739–746. ception in Autonomous Driving,” 2015 IEEE Int.
Conf. on Computer Vision (ICCV), pp. 2722–2730,
[112] M. Bojarski, P. Yeres, A. Choromanska, K. Choro- 2015.
manski, B. Firner, L. Jackel, and U. Muller, “Ex-
plaining How a Deep Neural Network Trained with [122] E. Perot, M. Jaritz, M. Toromanoff, and R. D.
End-to-End Learning Steers a Car,” arXiv preprint Charette, “End-to-End Driving in a Realistic Racing
arXiv:1704.07911, 2017. Game with Deep Reinforcement Learning,” in 2017
[113] H. Xu, Y. Gao, F. Yu, and T. Darrell, “End-to-End IEEE Conf. on Computer Vision and Pattern Recog-
Learning of Driving Models from Large-scale Video nition Workshops (CVPRW), July 2017, pp. 474–475.
Datasets,” IEEE Conf. on Computer Vision and Pat-
tern Recognition (CVPR), 2017. [123] Wayve. (2018) Learning to Drive in a Day. [Online].
Available: https://wayve.ai/blog/learning-to-drive-
[114] H. M. Eraqi, M. N. Moustafa, and J. Honer, “End- in-a-day-with-reinforcement-learning
to-end Deep Learning for Steering Autonomous Ve-
hicles Considering Temporal Dependencies,” arXiv [124] T. Zhang, G. Kahn, S. Levine, and P. Abbeel,
preprint arXiv:1710.03804, 2017. “Learning Deep Control Policies for Autonomous
Aerial Vehicles with MPC-guided Policy Search,”
[115] S. Hecker, D. Dai, and L. Van Gool, “End-to-End
2016 IEEE Int. Conf. on Robotics and Automation
Learning of Driving Models with Surround-view
(ICRA), May 2016. [Online]. Available: http:
Cameras and Route Planners,” in European Confer-
//dx.doi.org/10.1109/ICRA.2016.7487175
ence on Computer Vision (ECCV), 2018.
[116] V. Rausch, A. Hansen, E. Solowjow, C. Liu, [125] T. Ferrel, “Engineering Safety-critical Systems in the
E. Kreuzer, and J. K. Hedrick, “Learning a Deep 21st Century,” 2010.
Neural Net Policy for End-to-End Control of Au-
tonomous Vehicles,” in 2017 American Control Con- [126] H. C. Burton S., Gauerhof L., “Making the Case for
ference (ACC), May 2017, pp. 4914–4919. Safety of Machine Learning in Highly Automated
Driving,” Lecture Notes in Computer Science, vol.
[117] M. G. Bechtel, E. McEllhiney, and H. Yun, “Deep- 10489, 2017.
Picar: A Low-cost Deep Neural Network-based Au-
tonomous Car,” in The 24th IEEE Inter. Conf. on Em- [127] K. R. Varshney, “Engineering Safety in Machine
bedded and Real-Time Computing Systems and Ap- Learning,” in 2016 Information Theory and Applica-
plications (RTCSA), August 2018, pp. 1–12. tions Workshop (ITA), Jan 2016, pp. 1–5.
[128] D. Amodei, C. Olah, J. Steinhardt, P. F. Christiano, [142] A. Chakarov, A. Nori, S. Rajamani, S. Sen, and D. Vi-
J. Schulman, and D. Mané, “Concrete Problems in jaykeerthy, “Debugging Machine Learning Tasks,”
AI Safety,” CoRR, vol. abs/1606.06565, 2016. arXiv preprint arXiv:1603.07292, 2018.
[129] N. Möller, The Concepts of Risk and Safety. Springer [143] B. Nushi, E. Kamar, E. Horvitz, and D. Kossmann,
Netherlands, 2012. “On Human Intellect and Machine Failures: Trou-
bleshooting Integrative Machine Learning Systems,”
[130] R. Salay, R. Queiroz, and K. Czarnecki, “An Analysis
in AAAI, 2017.
of ISO 26262: Using Machine Learning Safely in
Automotive Software,” CoRR, vol. abs/1709.02435, [144] I. Takanami, M. Sato, and Y. P. Yang, “A Fault-value
2017. [Online]. Available: http://arxiv.org/abs/1709. Injection Approach for Multiple-weight-fault Toler-
02435 ance of MNNs,” in Proceedings of the IEEE-INNS-
ENNS, 2000, pp. 515–520 vol.3.
[131] S. Bernd, R. Detlev, E. Susanne, W. Ulf, B. Wolf-
gang, Patz, and Carsten, “Challenges in Applying the [145] G. Katz, C. W. Barrett, D. L. Dill, K. Julian, and M. J.
ISO 26262 for Driver Assistance Systems,” in Schw- Kochenderfer, “Reluplex: An Efficient SMT Solver
erpunkt Vernetzung, 5. Tagung Fahrerassistenz, 2012. for Verifying Deep Neural Networks,” in CAV, 2017.
[132] R. Parasuraman and V. Riley, “Humans and Automa-
[146] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E.
tion: Use, Misuse, Disuse, Abuse,” Human Factors,
Liong, Q. Xu, A. Krishnan, Y. Pan, G. Bal-
vol. 39, no. 2, pp. 230–253, 1997.
dan, and O. Beijbom, “nuScenes: A multimodal
[133] F. Jose, Safety-Critical Systems, 2018. Dataset for Autonomous Driving,” arXiv preprint
arXiv:1903.11027, 2019.
[134] H. Daumé, III and D. Marcu, “Domain Adaptation
for Statistical Classifiers,” J. Artif. Int. Res., vol. 26, [147] H. Yin and C. Berger, “When to Use what Data Set
no. 1, pp. 101–126, May 2006. for Your Self-driving Car Algorithm: An Overview of
Publicly Available Driving Datasets,” in 2017 IEEE
[135] R. Caruana, Y. Lou, J. Gehrke, P. Koch, M. Sturm, 20th Int. Conf. on Intelligent Transportation Systems
and N. Elhadad, “Intelligible Models for Health- (ITSC), Oct 2017, pp. 1–8.
Care: Predicting Pneumonia Risk and Hospital 30-
day Readmission,” in Proceedings of the 21th ACM [148] F. Yu, W. Xian, Y. Chen, F. Liu, M. Liao, V. Madha-
SIGKDD Int. Conf. on Knowledge Discovery and van, and T. Darrell, “BDD100K: A Diverse Driving
Data Mining, 2015, pp. 1721–1730. Video Database with Scalable Annotation Tooling,”
CoRR, vol. abs/1805.04687, 2018.
[136] K. R. Varshney and H. Alemzadeh, “On the Safety
of Machine Learning: Cyber-Physical Systems, De- [149] P. Koschorrek, T. Piccini, P. berg, M. Felsberg,
cision Sciences, and Data Products,” Big data, vol. 5, L. Nielsen, and R. Mester, “A Multi-sensor Traf-
10 2016. fic Scene Dataset with Omnidirectional Video,” in
Ground Truth - What is a good dataset? CVPR Work-
[137] S. Levin, “Tesla Fatal Crash: ’Autopilot’ Mode Sped
shop 2013, 2013.
up Car Before Driver Killed, Report Finds,” The
Guardian, 2018. [150] G. Pandey, J. R. McBride, and R. M. Eustice, “Ford
[138] P. Koopman, “Challenges in Autonomous Vehicle Campus Vision and Lidar Data Set ,” Int. Journal of
Validation: Keynote Presentation Abstract,” in Pro- Robotics Research, vol. 30, no. 13, pp. 1543–1552,
ceedings of the 1st Int. Workshop on Safe Control of 2011.
Connected and Autonomous Vehicles, 2017.
[151] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision
[139] Z. Kurd, T. Kelly, and J. Austin, “Developing Arti- Meets Robotics: The KITTI Dataset,” The Int. Jour-
ficial Neural Networks for Safety Critical Systems,” nal of Robotics Research, vol. 32, no. 11, pp. 1231–
Neural Computing and Applications, vol. 16, no. 1, 1237, 2013.
pp. 11–19, Jan 2007.
[152] Udacity, “Udacity Data Collection,” http:
[140] M. Harris, “Google Reports Self-driving Car Mis- //academictorrents.com/collection/self-driving-cars,
takes: 272 Failures and 13 Near Misses,” The 2018.
Guardian, 2016.
[153] W. Maddern, G. Pascoe, C. Linegar, and P. Newman,
[141] J. McPherson, “How Uber’s Self-Driving Technol- “1 Year, 1000km: The Oxford RobotCar Dataset,”
ogy Could Have Failed In The Fatal Tempe Crash,” The Int. Journal of Robotics Research (IJRR), vol. 36,
Forbes, 2018. no. 1, pp. 3–15, 2017.
[154] G. J. Brostow, J. Fauqueur, and R. Cipolla, “Semantic [169] ——, “R-Car H3,”
Object Classes in Video: A High-definition Ground https://www.renesas.com/sg/en/solutions/automotive/soc/r-
Truth Database,” Pattern Recognition Letters, vol. 30, car-h3.html/.
pp. 88–97, 2009.
[170] E. Nurvitadhi, G. Venkatesh, J. Sim, D. Marr,
[155] F. Flohr and D. M. Gavrila, “Daimler Pedestrian R. Huang, J. Ong Gee Hock, Y. T. Liew,
Segmentation Benchmark Dataset,” in Proc. of the K. Srivatsan, D. Moss, S. Subhaschandra, and
British Machine Vision Conference, 2013. G. Boudoukh, “Can FPGAs Beat GPUs in
Accelerating Next-Generation Deep Neural Net-
[156] P. Dollar, C. Wojek, B. Schiele, and P. Perona, works?” in Proceedings of the 2017 ACM/SIGDA
“Pedestrian Detection: A Benchmark,” in 2009 IEEE Int. Symposium on Field-Programmable Gate Ar-
Conf. on Computer Vision and Pattern Recognition, rays, ser. FPGA ’17. New York, NY, USA:
2009, pp. 304–311. ACM, 2017, pp. 5–14. [Online]. Available:
http://doi.acm.org/10.1145/3020078.3021740
[157] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyra-
mid Scene Parsing Network,” in 2017 IEEE Conf. on [171] K. Ovtcharov, O. Ruwase, J.-Y. Kim, J. Fowers,
Computer Vision and Pattern Recognition (CVPR), K. Strauss, and E. Chung, “Accelerating Deep Con-
2017, pp. 6230–6239. volutional Neural Networks Using Specialized Hard-
ware,” February 2015.
[158] S. Liu, J. Jia, S. Fidler, and R. Urtasun, “SGN: Se-
quential Grouping Networks for Instance Segmenta- [172] J. Cong, Z. Fang, M. Lo, H. Wang, J. Xu,
tion,” pp. 3516–3524, 10 2017. and S. Zhang, “Understanding Performance Dif-
ferences of FPGAs and GPUs: (Abtract Only),”
[159] X. Li, F. Flohr, Y. Yang, H. Xiong, M. Braun, S. Pan, in Proceedings of the 2018 ACM/SIGDA Int.
K. Li, and D. M. Gavrila, “A New Benchmark for Symposium on Field-Programmable Gate Ar-
Vision-based Cyclist Detection,” in 2016 IEEE In- rays, ser. FPGA ’18. New York, NY, USA:
telligent Vehicles Symposium (IV), 2016, pp. 1028– ACM, 2018, pp. 288–288. [Online]. Available:
1033. http://doi.acm.org/10.1145/3174243.3174970
[160] Velodyne, “Velodyne LiDAR for Data Collection,” [173] Z.-Q. Zhao, P. Zheng, S.-t. Xu, and X. Wu, “Ob-
https://velodynelidar.com/, 2018. ject Detection with Deep Learning: A Review,” IEEE
transactions on neural networks and learning sys-
[161] Sick, “Sick LiDAR for Data Collection,” https:// tems, 2018.
www.sick.com/, 2018.
[174] C. J. Ostafew, “Learning-based Control for Au-
[162] NVIDIA, “NVIDIA AI Car Computer Drive tonomous Mobile Robots,” Ph.D. dissertation, Uni-
PX,” https://www.nvidia.com/en-au/self-driving- versity of Toronto, 2016.
cars/drive-px/.
[163] ——, “Tegra X2,”

https://devblogs.nvidia.com/jetson-tx2-delivers-
twice-intelligence-edge/.
[164] ——, “Denver Core,”

https://en.wikichip.org/wiki/nvidia/microarchitectures/denver.
[165] ——, “Pascal Microarchitecture,”

https://www.nvidia.com/en-us/data-center/pascal-
gpu-architecture/.
[166] ——, “NVIDIA Drive AGX,”

https://www.nvidia.com/en-us/self-driving-
cars/drive-platform/hardware/.
[167] ——, “NVIDIA Volta,” https://www.nvidia.com/en-

us/data-center/volta-gpu-architecture/.
[168] Renesas, “R-Car V3H,”

https://www.renesas.com/eu/en/solutions/automotive/soc/r-
car-v3h.html/.

A Survey of Deep Learning Techniques For Autonomous Driving

Uploaded by

Copyright:

Available Formats

A Survey of Deep Learning Techniques For Autonomous Driving

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Survey of Deep Learning Techniques For Autonomous Driving

Uploaded by

Copyright:

Available Formats

A Survey of Deep Learning Techniques for Autonomous Driving

Sorin Grigorescu∗ Bogdan Trasnea

Brasov, Romania Brasov, Romania

Tiberiu Cocias Gigel Macesanu

2 Deep Learning based Decision-Making Architectures for Self-Driving Cars 3

3 Overview of Deep Learning Technologies 4

3.1 Deep Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3.2 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3.3 Deep Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

4 Deep Learning for Driving Scene Perception and Localization 8

4.1 Sensing Hardware: Camera vs. LiDAR Debate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

4.2 Driving Scene Understanding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

4.2.1 Bounding-Box-Like Object Detectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

4.2.2 Semantic and Instance Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4.3 Perception using Occupancy Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

5 Deep Learning for Path Planning and Behavior Arbitration 11

6 Motion Controllers for AI-based Self-Driving Cars 11

6.1 Learning Controllers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

6.2 End2End Learning Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

7 Safety of Deep Learning in Autonomous Driving 14

8 Data Sources for Training Autonomous Driving Systems 16

9 Computational Hardware and Deployment 19

10 Discussion and Conclusions 20

10.1 Final Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

Mk+1,l = ϕ(Mk ∗ wk,l + bk,l ), (1)

for reaching the destination state s<t+k>

Table 1: Summary of End2End learning methods.

Table 2: Summary of datasets for training autonomous driving systems

[57] A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello,

[163] ——, “Tegra X2,”

[164] ——, “Denver Core,”

[165] ——, “Pascal Microarchitecture,”

[166] ——, “NVIDIA Drive AGX,”

[167] ——, “NVIDIA Volta,” https://www.nvidia.com/en-

[168] Renesas, “R-Car V3H,”

You might also like