A Survey of Deep Learning Applications To Autonomous Vehicle Control
A Survey of Deep Learning Applications To Autonomous Vehicle Control
A Survey of Deep Learning Applications To Autonomous Vehicle Control
to test the system in the wide variety of scenarios which it may deswehr Munich for highway driving [14]. Since then, projects
encounter after deployment. However, deep learning methods such as DARPA Grand Challenges [15], [16] have continued
have shown great promise in not only providing excellent perfor- to drive forward research in autonomous vehicles. Outside of
mance for complex and non-linear control problems, but also in academia, car manufacturers and tech companies have also
generalising previously learned rules to new scenarios. For these carried out research to develop their own autonomous vehicles.
reasons, the use of deep learning for vehicle control is becoming
increasingly popular. Although important advancements have This has led to multiple Advanced Driver Assistance Sys-
been achieved in this field, these works have not been fully tems such as Adaptive Cruise Control (ACC), Lane Keeping
summarised. This paper surveys a wide range of research works Assistance, and Lane Departure Warning technologies, which
reported in the literature which aim to control a vehicle through provide modern vehicles with partial autonomy. These tech-
deep learning methods. Although there exists overlap between nologies not only increase the safety of modern vehicles and
control and perception, the focus of this paper is on vehicle
control, rather than the wider perception problem which includes make driving easier but also pave the way for fully autonomous
tasks such as semantic segmentation and object detection. The vehicles which do not require any human intervention.
paper identifies the strengths and limitations of available deep Early autonomous vehicle systems were heavily reliant on
learning methods through comparative analysis and discusses the accurate sensory data, utilising multi-sensor setups and expen-
research challenges in terms of computation, architecture selec- sive sensors such as LIDAR to provide accurate environment
tion, goal specification, generalisation, verification and validation,
as well as safety. Overall, this survey brings timely and topical perception. Control of these autonomous vehicles was handled
information to a rapidly evolving field relevant to intelligent via rule-based controllers, where the parameters are set by the
transportation systems. developers and hand-tuned after simulation and field testing
Index Terms—Machine learning, Neural networks, Intelli- [17]–[19]. The downside of this approach is the time intensive
gent control, Computer vision, Advanced driver assistance, Au- hand-tuning of parameters [20] and the difficulty of such rule-
tonomous vehicles based controllers to generalise to new scenarios [21]. Also,
the highly non-linear nature of driving means that control
I. I NTRODUCTION methods based on linearisation of the vehicle model or other
algebraic analytical solutions are often infeasible or do not
I N 2016, traffic accidents resulted in 37,000 fatalities in
the United States [1] and 25,500 fatalities in the European
Union [2]. With the steady increase in the number of vehicles
scale well [22], [23]. Recently, deep learning has gained
attention due to the numerous state-of-the-art results it has
on the road, issues such as traffic congestion, pollution, and achieved in fields such as image classification and speech
road safety are becoming critical issues [3]. Autonomous recognition [24]–[26]. This has led to increasing use of deep
vehicles have gained significant interest as solutions to these learning in autonomous vehicle applications, including plan-
challenges [4]–[7]. For instance, 90% of all car accidents ning and decision making [27]–[31], perception [32]–[36], as
are estimated to be caused by human errors, while only well as mapping and localisation [37]–[39]. The performance
2% are caused by vehicle failures [8]. Further benefits from of Convolutional Neural Networks (CNNs) with raw camera
autonomous vehicles in terms of better fuel economy [9], [10], inputs has the potential to reduce the number of sensors used
reduced pollution, car sharing [11], increased productivity, and by autonomous vehicles. This has led to some organisations
improved traffic flow [12] have also been reported. investigating autonomous vehicles without expensive sensors
such as LIDAR, instead employing extensive use of deep
This work was supported by the UK-EPSRC grant EP/R512217/1 and learning for scene understanding, object recognition, semantic
Jaguar Land Rover.
Sampo Kuutti and Saber Fallah are with the Centre for Automotive segmentation, and motion estimation. The strong results of
Engineering, University of Surrey, Guildford, GU2 7XH, U.K. (e-mail: deep learning in these perception problems have also sparked
[email protected], [email protected]). interest in using Deep Neural Networks (DNNs) to produce
Richard Bowden is with the Centre for Vision Speech and Signal
Processing, University of Surrey, Guildford, GU2 7XH, U.K. (e-mail: control actions in autonomous vehicles. Indeed, autonomous
[email protected]). vehicle control often has a strong link to perception, as many
Yaochu Jin is with the Department of Computer Science, University of techniques use CNNs to predict control actions based on
Surrey, Guildford, GU2 7XH, U.K. (e-mail: [email protected]).
Phil Barber was with Jaguar Land Rover Limited (e-mail: pbar- images of the scene, without any separate perception module,
[email protected]). thereby removing the separation between the perception and
©2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including
reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or
reuse of any copyrighted component of this work in other works.
2
value of a given action in a given state, is used instead. family of machine learning algorithms. This diversity can be
The optimal policy is then found by greedily maximising the useful in systems where different families of deep learning al-
state-action value function Q(s, a). The disadvantage of this gorithms have been used. On the other hand, Altera’s Cyclone
approach is that there is no guarantee on the optimality of V [81] SoC provides a driving solution optimised for sensor
the learned policy [61], [62]. Policy gradient algorithms (e.g fusion. For a more in-depth review of autonomous driving
REINFORCE [63]) do not estimate a value function, but in- hardware platforms, see the discussion by Liu et al. [82].
stead parametrise the policy and then update the parameters to
maximise the expected rewards. This is done by constructing III. D EEP L EARNING A PPLICATIONS TO V EHICLE
a loss function and estimating a gradient of the loss function C ONTROL
with respect to the network parameters. During training, the The motion control of a vehicle can be broadly divided into
network parameters are then updated in the direction of the two tasks; lateral motion of the vehicle is controlled by the
policy gradient. The main disadvantage of this approach is the steering of the vehicle, whilst longitudinal motion is controlled
high variance in the estimated policy gradients [64]–[66]. The through manipulating the gas and brake pedals of the vehicle.
third class, actor-critic algorithms (e.g. A3C [67]), are hybrid Lateral control systems aim to control the vehicle’s position
methods which combine the use of a value function with a on the lane, as well as carry out other lateral actions such as
parametrised policy function. This creates a trade-off between lane changes or collision avoidance manoeuvres. In the deep
the disadvantages of the high variance of policy gradients learning domain, this is typically achieved by capturing the
and the bias of value-based methods [51], [68], [69]. Another environment using the images from on-board cameras as the
separating factor between different reinforcement learning input to the neural network. Longitudinal control manages the
algorithms is the type of reward function used. The reward acceleration of the vehicle such that it maintains the desirable
function used can be either sparse or dense. In a sparse reward velocity on the road, keeps a safe distance from the preceding
function, the agent only receives a reward following specific vehicle, and avoids rear-end collisions. While lateral control
events, such as success or failure in its task. The benefit of this is typically achieved through vision, the longitudinal control
approach is that the success (e.g. reaching a goal location) or relies on measurements of relative velocity and distance to the
failure (e.g. colliding with another object) is easy to define for preceding/following vehicles. This means that ranging sensors
most tasks. However, this can further exacerbate the sample such as RADAR or LIDAR are more commonly used in longi-
complexity issue in reinforcement learning, since the agent tudinal control systems. The majority of the current research
would only receive a reward relatively rarely, resulting in slow projects have chosen to focus on only one of these actions,
convergence. On the other hand, in a dense reward function thereby simplifying the control problem. Moreover, both types
the agent is given a reward at every time-step based on the of control systems have different challenges and differ in terms
state it is in. This means that the agent receives a continuous of implementation (e.g. sensor setups, test/use cases). For these
learning signal, estimating how useful the chosen actions were reasons this section is split into three subsections, with the
in their respective states. first two subsections discussing lateral and longitudinal control
systems, independently, and the third subsection focusing on
C. Datasets and Tools for Deep Learning techniques which have attempted to combine both longitudinal
and lateral control.
The rapid progress in the implementation of deep learning
systems on autonomous vehicles has led to the availability
of diverse deep learning data sets for autonomous driving A. Lateral Control Systems
and perception. Perhaps the most well known data set for One of the earliest applications of artificial neural networks
autonomous driving is the KITTI benchmark suite [70], [71], to the vehicle control problem was the Autonomous Land
which includes multiple data sets for evaluation of stereo Vehicle in a Neural Network (ALVINN) system by Pomerleau
vision, optical flow, scene flow, simultaneous localisation and in 1989 which was first described in [83] and further extended
mapping, object detection and tracking, road detection and in [84]. ALVINN utilised a feedforward neural network, with a
semantic segmentation. Other useful data sets include the 30x32-neuron input layer, one hidden layer with four neurons,
Waymo Open [72], Oxford Robotcar [73], ApolloScape [74], and a 30-neuron output layer in which each neuron represents
Udacity [75], ETH Pedestrian [76], and Caltech Pedestrian a possible discrete steering action. The system used the input
[77] data sets. For a more complete overview of available from a camera together with the steering commands of the
autonomous driving data sets, see the survey by Yin & Berger human driver as training data. To increase the amount of data
[78]. Besides public data sets, there are also a number of and variety of scenarios available, the author employed data
other tools available for the development of deep learning augmentation methods to increase the available training data
in autonomous vehicles. The current leading Artificial Intelli- without recording any additional footage; each image was
gence (AI) platform for autonomous driving is the NVIDIA shifted and rotated, so as to make the vehicle appear to be
Drive PX2 [79], which provides two Tegra system-on-chips situated at a different part of the road laterally. Additionally,
(SoC) and two Pascal graphics processors with dedicated to avoid bias towards recent inputs (e.g. if a training session
memory and specialised support for DNN calculations. For ends in a long right hand turn, the system could be biased
more diverse tasks, the MobilEye EyeQ5 [80] provides four to turn right more often) a buffering solution was used where
fully programmable accelerators, each optimised for a different previously encountered training patterns were retained in the
4
buffer. The buffer contained 4 patterns of previous data at any The CNN consisted of 9 layers, including a normalisation
time, which were periodically replaced such that the patterns layer, 5 convolutional layers and 3 fully connected layers,
in the buffer had no right or left bias on average. Both the with a total of 27 million connections and 250,000 parameters.
image shifting as well as buffering solutions were shown to This method achieved a 98% autonomy in initial testing and
significantly improve the system performance. The system was 100% autonomy during a 10-mile highway test, measured
trained on a 150m stretch of road, after which it was tested on based on the number of interventions required over a given
a separate stretch of road at speeds ranging from 5 to 55mph test time. However, it should be noted that this measure does
allowing steering without intervention for distances of up to not include lane changes or turns, and therefore only evaluates
22 miles. The system was shown to be able to remain, on the system’s ability to stay in its current lane.
average, 1.6cm distance from the centre of the road compared
to that of 4.0cm under human control. This demonstrated that
neural networks can learn to steer a vehicle from recorded
data.
The first to suggest reinforcement learning for vehicle
steering was the work carried out by Yu [85]. Yu proposed
a road following system based on Pomerleau’s work utilising
reinforcement learning to design a controller. The advantage
of which was the ability to learn from previous experiences to
drive in new environments and continuously learn and improve
its road following ability through online learning. Combining
supervised learning and reinforcement learning, Moriarty et
al. [86] developed a lane-selection strategy for a highway
environment. The results showed that the vehicles with learned
controllers managed to maintain speeds close to the desired
speed and resulted in less lane-changes. Moreover, the learned
control strategy resulted in better traffic flow than manually
constructed controllers.
The neural networks utilised in the aforementioned early
works are significantly smaller when compared to what is
feasible with today’s technology [87]. Indeed, while neural
networks are hardly new, the research interest and adoption
to various applications has exploded in recent years due
to increased computing power, especially through parallel
graphics processing units (GPUs) which can significantly
reduce training time and improve performance. Moreover, the
availability of large public data sets and hardware solutions
Fig. 1. Convolutional Neural Network utilised in the NVIDIA end-to-end
optimised for deep learning have made training and valida- steering system. (Figure recreated based on [87]).
tion of neural network systems easier. Overall, these recent
advancements have enabled better performance through more A further example of supervised learning for steering of
complex systems with vastly increased amounts of training an autonomous vehicle is the work by Rausch et al. [42],
data and episodes. where supervised learning was employed to create an end-
Utilising deeper models with CNNs, Muller et al. [88] to-end lateral vehicle controller. Rausch et al. utilised a CNN
trained a sub-scale radio controlled car to navigate off-road in with four hidden layers, three convolutional layers and one
the DARPA Autonomous VEhicle (DAVE) project. The model fully connected layer. The training data was the steering
was trained with training data collected from two forward- angle and front-facing camera footage which was provided
facing cameras while a human was controlling the vehicle. by a human steering a vehicle in a CarSim [89] simulation,
Using a 6-layer CNN, the model learned to navigate around with imaging captured at 12 frames per second (FPS) at a
obstacles when driving at speeds of 2m/s. Building on the resolution of 1912x1036. The data collection was collected
approach of DAVE, NVIDIA utilised a CNN to create an end- from a 15-minute simulation run resulting in a total of 10,800
to-end control system for steering of a vehicle through super- frames. Inappropriate frames caused by bad driving behaviour
vised learning [87]. The system is capable of self-optimising or graphic errors (e.g. due to a fault in the simulator) were
the system performance and detecting useful environmental removed from the training data manually. Then, the neural net-
features (e.g. detection of roads and lanes). The CNN used work was trained with three different optimisation algorithms
(see Fig. 1) can learn the steering policy without explicit to update the network weights, namely Stochastic Gradient
manual decomposition of the environmental features, path Descent (SGD) [90], Adam [91], and Nesterov’s Accelerated
planning, or control actions using a small amount of training Gradient (NAG) [92]. During training Adam resulted in the
data. The training data set consisted of recorded camera best loss convergence, while during the evaluation, the NAG
footage and steering signals from a human driven vehicle. trained network performed the best in terms of keeping the
5
vehicle in the centre of the lane. Therefore, convergence of the task and would require further tests to validate their real
the loss function is not necessarily representative of a well- world performance. Nevertheless, there have been important
trained neural network. The neural networks were shown to developments in this field and these results show great promise
learn good estimations of the human driver’s steering policy, for the use of deep learning for autonomous vehicle control.
however by comparing the steering angles, it could be seen
that the steering signal of the neural networks included noisy
behaviour. A potential reason is that the system estimates B. Longitudinal Control Systems
the required steering angle at each frame, with no context Machine learning methods have also shown promise in
regarding previous states or actions. This results in the steering applications to vehicle longitudinal control, such as ACC
signals between subsequent time steps varying significantly design. The ACC can be described as an optimal tracking
from each other, causing noisy output. This could be resolved control problem for a complex nonlinear system [97], [98]
by utilising a RNN to provide memory of previous inputs and and therefore is poorly suited to control systems based on
outputs for the system, giving it temporal context. linear vehicle models or other algebraic analytical solutions
Introducing temporal context to a deep learning steering [99]. Such traditional control systems provide poor adapt-
model, Eraqi et al. [93] utilised a Convolutional Long Short- ability in complex environments and do not conform to the
Term Memory Recurrent Neural Network (C-LSTM) to learn driver’s habits [100]. The strong nonlinear nature of the
to steer a vehicle based on visual and dynamic temporal system makes it difficult to build a vehicle model without
dependencies. The network was trained to predict steering significant uncertainty, limiting the effectiveness of model-
angles based on image inputs, and then compared it to a based solutions. However, neural networks have shown great
simple CNN architecture used in [94]. Experimental results potential for optimising nonlinear, high-dimensional control
showed improved accuracy and smoother steering variations systems [40], [41], [101]–[106]. For instance, reinforcement
when using the C-LSTM network. However, the model was learning can learn an optimal control policy through interac-
only evaluated offline by comparing the predicted control tion with the environment, without knowledge of the system
action against ground truth, which does not necessarily give model [50]. Furthermore, the strong adaptive capacity and
an accurate evaluation of driving quality [95]. Live testing, model-free capability of reinforcement learning makes it an
where the model can control the vehicle to test the learned attractive solution for ACC design. In early works, Dai et
driving behaviour, should be used instead. al. [107] proposed a fuzzy reinforcement learning method for
There has also been lateral control techniques for lane longitudinal control of an autonomous vehicle. The method
change manoeuvrers presented. Wang et al. [96] used rein- combines a Q estimator network (QEN) with a Takagi-Sugeno-
forcement learning to train an agent to execute lane change type Fuzzy Inference System (FIS). The QEN is used to
manoeuvrers using a Deep Q-Network (DQN). The network estimate the optimal action value function whilst the FIS gets
uses host vehicle speed, longitudinal acceleration, position, the control output based on the estimated action value function.
yaw angle, target lane, lane width and road curvature to The described approach was evaluated in a simulation of a car-
provide a continuous value for the desired yaw acceleration. following scenario where the lead vehicle varies its velocity
To ensure Q-learning could be used to output continuous over time with a maximum episode duration of 80s. The
action values, a modified Q-learning approach was used to controller was shown to be able to successfully drive the
support continuous action values, where the Q-function was a vehicle without failing after 68 trials. However, the reward
quadratic function approximated by three single hidden layer function of the proposed approach by Dai et al. is only based
feedforward neural networks. The proposed approach was on the spacing between the lead and the following vehicle.
tested in a simulated highway environment, with preliminary The reward function is the key to a successful reinforcement
results showing effective lane change manoeuvrers learned by learning approach as it is the means by which the developer
the agent. indicates the desirability of being in any given state. Therefore,
A summary of the research works covered in this section the reward function needs to accurately capture the task to be
can be seen in Table I. Due to the advancements mentioned performed and the manner in which it should be completed.
previously, the recent trend has been to move to deeper models For longitudinal control, the reward function should motivate
with increased amounts of training data. Recent works have the agent to adopt a safe and efficient driving strategy. For
also investigated introducing temporal cues into the learning these reasons, a reward function with only one parameter such
model, but this suffers from instability in training. Moreover, as inter-vehicle spacing may not be sufficient in real-time
many of the models developed so far have been trained and applications.
evaluated in relatively simple environments. For instance, most There are several works in which the use of multi-objective
researchers have decided to focus on lateral control for a reward functions have been explored. For example, Desjardins
single task. For example in models trained for lane keeping & Chaib-Draa [23] used a multi-objective reward function
no decision-making for e.g. lane changes or turns to different based on time headway (distance in time from the lead vehicle)
roads have been incorporated in these systems. This opens and time headway derivative. The agent was encouraged
possible avenues for future research where multiple actions through the reward function to keep a 2s time headway to
could be carried out by the same DNN. It should also be the lead vehicle, and the time headway derivative provided
noted that the majority of these works were trained and information regarding whether the vehicle is moving closer
evaluated in simulated environments, which further simplifies to or farther from the lead vehicle, and allowed it to adjust
6
TABLE I
A C OMPARISON OF L ATERAL C ONTROL T ECHNIQUES .
its driving strategy accordingly. Taking the time headway by Dai et al. [107], this reward function does not consider
derivative into consideration in the reward function encourages passenger comfort which could lead to harsh accelerations or
the agent to choose actions which help it progress toward decelerations.
the desired state (ideal time headway). The authors used this Huang et al. [108] presented a Parameterised Batch Actor-
reward function in a policy-gradient method for a Cooperative Critic (PBAC) reinforcement learning algorithm for longitu-
Adaptive Cruise Control (CACC) system. The neural network dinal control of autonomous vehicles based on actor-critic
architecture chosen had two inputs, a single hidden layer of 20 algorithms. A multi-objective reward function was designed
neurons, and an output layer with 3 discrete actions (brake, to reward the algorithm for tracking precision and drive
accelerate, do nothing). In the learning process, an average smoothness. The method was validated by field experiments on
of over 2.2 million iterations were obtained over ten learning various driving environments (e.g. flat, slippery, sloping, etc.)
simulations. The chosen method was shown to be efficient in and the results suggested the method can track time-varying
CACC, providing average time headway errors of 0.039s in speeds more precisely than traditional Proportion-Integration
an emergency braking scenario. While the magnitude of the (PI) or Kernel-based Least Square Policy Iteration (KLSPI)
time headway errors remain small, it should be noted that controllers trained with reinforcement learning [109], [110].
the velocity profile of the subject vehicle showed oscillatory This was due to lower sensitivity to noise of speeds and
behaviour. This would make the system uncomfortable for the accelerations. Moreover, smooth driving was achieved using
passengers as well as pose a potential safety risk. Potential the proposed method. The addition of driving smoothness in
solutions for this could include utilising continuous action the reward function makes these systems more comfortable
values, the use of RNNs, or negative rewards for changes for passengers. However, the method was evaluated in an
in acceleration to help smooth the velocity profile of the environment without adjacent vehicles or other obstacles. This
vehicle. Similarly, Sun [99] proposed a CACC system based allowed the authors to not consider safety parameters in the
on rewards from time headway and time headway derivative in reward function, which leaves the algorithm susceptible to
a Q-learning algorithm. This approach was shown to reduce crashes in environments with other vehicles present. Therefore,
the learning time of the neural network. Over one hundred additional terms for safety would be required in the reward
learning simulations, the best performing policy (the policy function to ensure safe behaviour of the autonomous vehicle.
which obtained the highest reward) was chosen for evaluation. One such reward function was proposed by Chae et al.
The algorithm was evaluated in a simulation of a stop-and- [111], who proposed an autonomous braking system for colli-
go environment in which the lead vehicle accelerated and sion avoidance based on a DQN approach. The reward function
decelerated periodically. The agent was shown to provide balances two conflicting objectives: avoiding collision and
adequate performance in a platoon scenario. However, whilst getting out of high risk situations. To speed up convergence,
such multi-objective reward functions are an improvement over a replay memory was used to store a number of episodes of
single objective reward functions such as the one proposed which some are chosen randomly to help train the network.
7
performance in critical scenarios such as emergency braking. Moreover, during 50 tests on a competition track, the proposed
A summary of the longitudinal control methods can be seen approach completed the track 49 times, compared to only 33
in Table II. In contrast to lateral control systems, vision-based with NFQ. Additionally, DQFE performed better in terms of
inputs are not generally used for longitudinal control. Instead mean distance from centre of the track. Therefore, the addition
sensor inputs from ranging sensors (e.g. RADAR, LIDAR) of filtered experience replay improved the speed of conver-
and host vehicle states are more commonly used. These lower gence as well as performance of the algorithm. Comparing two
dimensional inputs (e.g. time headway or relative distance) neural networks for lane keeping systems, Sallab et al. [122]
can then easily be used to define a reward function for investigated the effects of discretised and continuous actions.
reinforcement learning. The second major difference between Two approaches, DQN and a Deep Deterministic Actor Critic
lateral and longitudinal control algorithms is the choice of (DDAC) algorithm, were evaluated in a TORCS simulator
learning strategies. While lateral control techniques favour [123]. In the two networks developed by the authors, the DQN
supervised learning techniques trained on labelled datasets, could only output discretised values (steer, gear, brake, and
longitudinal control techniques favour reinforcement learning acceleration), while the DDAC supports continuous action val-
methods which learn through interaction with the environment. ues. The DDAC consisted of two networks; an Actor Network
However, as seen in this section, the reward function in which is a neural network responsible for taking actions based
reinforcement learning needs to be carefully designed. Safety, on perceived states and the Critic Network which criticises the
performance, and comfort all need to be considered. Poorly value of the action taken. The experimental results showed
designed reward functions result in poor performance or the that the DQN algorithm suffered in performance due to the
model not converging. Another challenge with reinforcement fact that it cannot support continuous actions or state spaces.
learning algorithms is the trade-off between exploration and The DQN algorithm is suitable for continuous (input) states,
exploitation. During training, the agent must take random however it still requires discrete actions since it finds the
actions to explore the environment. However, to perform well action that maximises the action-value function. This would
in its task the agent should exploit its knowledge to find require an iterative process at every time step for continuous
the optimal action. Example solutions for this are the - action spaces [124]. As shown in Fig. 3, the ability to support
greedy exploration policies and the Upper Confidence Bound continuous action values allowed the DDAC algorithm to
(UCB) algorithm. -greedy strategies choose a random action follow curved tracks more smoothly and stay closer to the
with a probability , which decreases overtime as the agent centre of the lane when compared to the DQN algorithm,
learns its environment. On the other hand, UCB encourages thereby producing better performance for lane keeping.
exploration in states with high uncertainty, whilst exploitation
is encouraged in regions with high confidence. Therefore,
intrinsic motivation is implemented in the system, encouraging
the agent to learn about its environment, whilst exploitation
can be taken advantage of in states which have already been
explored adequately [51], [117]–[119]. Other approaches have
sought to use supervised learning as a pre-training step to get
the advantages of both reinforcement and supervised learning.
TABLE II
A C OMPARISON OF L ONGITUDINAL C ONTROL T ECHNIQUES .
that states which were not reached in the initial training followed by a real-world 30m long dirt track with a 1/5-scale
set can be covered in the new extended training set. The vehicle. The sub-scale vehicle successfully learned to drive at
primary policy is then iteratively fine-tuned using the new speeds up to 7.5m/s around the track. Instead of using direct
training set. Zhang et al. proposed an extension to this method, vision for control, Wang et al. [128] demonstrated that DAgger
called SafeDAgger, where the system estimates (in any given can be used to train an object-centric policy, which uses salient
state) whether the primary policy is likely to deviate from the objects in the image (e.g. vehicles, pedestrians) to output a
reference policy. If the primary policy is likely to deviate by control action. The trained control policy was tested in Grand
more than a specified threshold, the reference policy is used Theft Auto V simulation, with a discrete control action (left,
to drive the vehicle instead. The safety policy is estimated straight, right, fast, slow stop) which was then translated to
by a fully connected network where the input is the last a continuous control with a PID controller. The test results
convolutional layer’s activation. The authors used this method demonstrated improved performance with the object-centric
to train a CNN to predict a continuous steering wheel angle policy compared to models without attention or those based
and a binary decision for braking (brake or do not brake). on heuristic object selection. Vision based techniques have also
The authors then evaluated supervised learning, DAgger, and been used to mitigate collisions by Porav & Newman [129],
SafeDAgger by driving them on three test tracks, with up who built on the previous work by Chae et al. [111] by using a
to three laps on each track. Out of the three algorithms deep reinforcement learning algorithm for collision mitigation
evaluated, SafeDAgger was found to perform best in terms which can provide continuous control actions for both velocity
of the number of completed laps, number of collisions, and and steering. The system uses a Variational AutoEncoder
mean squared error of steering angles. In another work, Pan (VAE) coupled with an RNN to predict the movement of
et al. [126] used DAgger-like imitation learning to learn to obstacles and learns a control policy with Deep Deterministic
drive at high speeds autonomously, with continuous actions Policy Gradient (DDPG) to mitigate collisions in low TTC
for both steering and acceleration. The reference policy for scenarios. The network used a semantically segmented image
the dataset was obtained from a model predictive controller to predict continuous steering and deceleration actions. The
operated using expensive high resolution sensors, which the proposed technique shows improvement over braking-only
CNN then learned to imitate using only low cost camera policies for TTC values between 0.5 and 1.5s, and up to 60%
sensors for observations. The technique was first tested in reduction in collision rates.
Robot Operation System (ROS) Gazebo [127] simulations, Inverse Reinforcement Learning (IRL) approaches have also
10
been investigated in the context of control systems as a way to algorithm was shown to be more robust to sensor noise. This
overcome the difficulty of defining an optimal reward function. shows that the use of DNNs in an IRL algorithm for trajectory
IRL is a subset of reinforcement learning, in which the reward planning was beneficial overall. Therefore IRL techniques
function is not specified, but the agent attempts to learn it from could be considered as a potential way to overcome the
an expert’s demonstrations. In IRL, the agent assumes that the difficulties of designing an optimal reward function for driving.
expert is completing the task by following an unknown reward However, there are some challenges for IRL approaches
function. It then estimates a reward function in which the in practical applications. Firstly, there is no guarantee of
demonstrators’ trajectory is the most likely one. This has the optimality of the demonstrations. For example, in a driving
advantage that instead of requiring the developer to explicitly demonstration, no human driver can carry out the driving tasks
specify a reward function, they simply have to demonstrate optimally every time. Therefore, the training data will include
the intended behaviour. This can be advantageous since in suboptimal demonstrations which will affect the final reward
large and complex tasks, defining an adequate reward function function constructed. There are some solutions to minimising
to provide optimal agent behaviour can be both difficult and the effect of suboptimal demonstrations; using multiple trajec-
time consuming [130]. IRL approaches have been shown to tories and averaging over multiple sets to find a reward func-
not only reduce the amount of time required for design and tion or removing the assumption of global optimality [135].
optimisation, but also improve the system performance by Secondly, reward ambiguity can lead to further problems in
creating more robust reward functions. Abbeel & Ng [131] IRL approaches. Given expert demonstrations of driving strate-
showed that when IRL was applied to a problem where the gies, there can be multiple reward functions that explain the
agent learned by observing an expert, the agent performed as expert’s behaviour. Therefore, an effective IRL algorithm must
well as the expert when evaluated with respect to the reward find a reward function that considers the expert’s trajectory
function used by the expert, even if the reward function derived optimal and rejects other possible trajectories. Thirdly, the
from observations was not the expert’s true reward function. reward function derived through IRL methods may not be safe,
Moreover, it was shown that in a simplistic highway driving as noted by Abbeel et al. [136], who used IRL to operate an
scenario with 5 different actions for lane selection available autonomous helicopter and had to manually tune the reward
to the agent and multiple driving styles demonstrated, the function for safety. Therefore, hand tuning of the derived
IRL algorithm successfully learned to mimic the demonstrated reward function may be required to ensure safe behaviour.
driving behaviours. Further, Silver et al. [21] used an IRL Lastly, the computational burden of IRL methods can be
algorithm based on Maximum Margin Planning [132] which heavy since they often require iteratively solving reinforcement
was shown to be effective in a demonstration of an autonomous learning problems with each new reward function derived
vehicle in unstructured terrain. The vehicle was shown to [130]. Nevertheless, in tasks where an adequately accurate
perform better than an agent based on traditional reinforcement reward function cannot be easily defined, IRL approaches can
learning with a hand-tuned reward function. Additionally, the provide an effective solution.
IRL approach was shown to require significantly less time to While the previously mentioned works in this section
design and optimise compared to the reinforcement learning demonstrate that a DNN can be trained to drive a vehicle,
agent. Kuderer et al. [20] proposed a vehicle controller that training a vehicle to simply follow a road or keep in its lane
can learn individual driving styles from demonstration using without any outside context is not sufficient for deploying
IRL. The algorithm assumes that the demonstrator is driving fully autonomous vehicles. Humans drive vehicles with the
in a way to maximise an unknown reward function. From goal of arriving at our target destination, and learning to drive
this, the learning model estimates the weights in a linear from camera images to imitate human driving behaviour is
reward function based on 9 features for driving. Initially, not enough to understand the full context behind the human
the weights were equally set and were then updated based driver’s action. For instance, it has been reported [83], that
on demonstrations of 8 minutes per driver. After finding the upon reaching a fork in the road end-to-end driving techniques
driving policy, the chosen trajectories were compared to those tend to oscillate between the two possible driving directions.
observed from human drivers. The system was shown to learn Not only is this impractical if our goal is to continue in
drivers’ personal driving styles from minimal training data and the left direction, but can result in unsafe behaviour where
performed adequately in simulated testing. the DNN oscillates between left and right but never picking
Building on the IRL approaches, Wulfmeier et al. [133] either direction. Aiming to provide autonomous vehicles with
proposed an IRL approach for deep learning. The proposed contextual awareness, Hecker et al. [137] collected a data set
algorithm is based on the Maximum Entropy [134] model for a with a 360-degree view from 8 cameras and a driver following
trajectory planner, and uses CNNs to infer the reward functions a route plan. This data set was then used to train a DNN
from expert demonstration. The approach was trained on a to predict steering wheel angle and velocities from example
dataset collected over the course of one year with a total images and route plans in the data set. Qualitative testing
of 120km of driving a modified golfcart on walkways and was done to evaluate learning on instances from the data set,
cycle lanes. The input to the network was the LIDAR point suggesting the model was learning to imitate the human driver,
cloud map, which was represented on a discretised grid map. but no live testing was completed to validate performance.
The output of the network was a discrete set of actions. The With a similar aim, Codevilla et al. [138] trained a supervised
proposed approach was demonstrated to work better than a learning algorithm, which uses both images and a high-level
manually constructed cost function. Moreover, the learned navigational command for its driving policy. The network was
11
trained through end-to-end supervised learning, conditioned control tend to have poorer performance on steering than
by a high-level command which could be follow road, go techniques which only consider steering. This is explained by
straight, turn left, or turn right. The authors tested two network the significant increase in the complexity of the task which the
architectures which could take the navigational command into neural network is trained to perform. For this reason, several
account; one where the command was an additional input to of the works summarised in this section have been trained
the network, and one where the network branched at the end and evaluated in simplified simulated environments. While full
into multiple sub-modules (feedforward layers), one for each vehicle control should be the end goal of autonomous vehicle
possible command. The authors noted that the latter archi- control techniques, current approaches have yet to achieve
tecture performed better. The resulting network was initially adequate performance in complex and dynamic environments.
tested in CARLA [139] simulation, followed by real-world Therefore future research is required to further improve the
testing on a 1/5-scale car. The resulting policy successfully control performance of neural network-driven autonomous
learned to turn the correct way at intersections as commanded. vehicles.
The authors noted that data augmentation and noise injection
during training was key to learning a robust control policy. IV. C HALLENGES
This method was further extended in [140], by using an extra The previous section discussed various examples of deep
module for velocity prediction, which helps the network in learning applied to vehicle controller design. While this shows
some situations, such as when the vehicle is stopped at a traffic that there is a significant amount of interest in the research of
light, to predict the expected vehicle velocity from visual cues such systems, they are still far from ready for commercial
and prevent it from getting stuck when the vehicle comes application. There remains a number of challenges that must
to a full stop. Further improvements to the model were a be overcome before learned autonomous vehicle technology is
deeper network architecture and a larger training set, which ready for widespread commercial use. This section is dedicated
reduced the variance in training. A slightly different approach to discussing the technological challenges for deep learning
was explored using reinforcement learning by Paxton et al. based control of autonomous vehicles. It is worth remembering
[141] where the high-level command is provided by another that besides these technological challenges, issues such as
DNN responsible for decision making. The system consisted user acceptance, cost efficiency, machine ethics for artificial
a DDPG network for low-level control and a DQN for a intelligence technologies, and lack of legislation/regulation for
stochastic high level policy subject to linear temporal logic autonomous vehicles must also be addressed. However, the
constraints. The aim of the vehicle was to navigate a busy aim of this manuscript is to focus on deep learning based
intersection, where some lanes had stopped vehicles so that autonomous vehicle control methods and their technical chal-
the host vehicle had to successfully change lanes as well. lenges, therefore general and non-technological challenges for
The system was tested in 100 simulated intersections with and autonomous vehicles are out of the scope of this manuscript,
without stopped cars ahead, for a total of 200 tests. Without for further reading on these topics, see [145]–[150].
stopped cars the agent succeeded every time, whereas with
stopped cars ahead, 3 collisions occurred.
Moving away from end-to-end approaches, researchers at A. Computation
Waymo recently presented ChauffeurNet [142]. ChauffeurNet The major drawback for deep learning methods is the
uses mid-to-mid learning to learn a driving policy, where the large amount of data and time required for adequate training,
input is a pre-processed top-down view of the surrounding especially for reinforcement learning methods. This can lead to
environment which represents useful features such as roadmap, long training periods which can cause delays and additional
traffic lights, a route plan to follow, dynamic objects, and past cost in the design of an autonomous vehicle. The common
agent poses. The agent then processes these inputs through solution to reduce training data requirements or the time
an RNN to provide a heading, speed, and waypoint, which required for training is to combine reinforcement learning
are then achieved through a low-level controller. This had the with supervised learning, which helps reduce the training
advantage that pre-processed inputs could be obtained either time whilst still providing good adaptability. Nevertheless,
from simulation or real-world data, which makes transferring for a fully autonomous vehicle, the amount of training data
driving policies from simulation to the real world easier required to build a reliable and robust system can be vast.
[143], [144]. Furthermore, synthesising perturbations to model It is challenging to train a vehicle to drive in all possible
recoveries from incorrect lane positions or even scenarios scenarios that it could encounter in the real world due to
such as collisions or driving off-road provides the model with the huge quantity of data that needs to be collected. There
robustness to errors and allows the model to learn to avoid are several companies researching autonomous driving using
such scenarios. machine learning and collaborating and sharing data would
An overview of full vehicle control approaches can be seen be the fastest route to move from experimental systems to
in Table III. Unlike previous sections, a variety of learning commercial ones. However, this is unlikely as companies
strategies have been utilised here, however supervised learning researching autonomous vehicles are not willing to share their
is still the preferred approach. An important note on the works resources due to fear of diluting their competitive advantage
where full vehicle control via neural networks is researched, [151]. However, while increasing the amount of available data
is that robust and high performing models still seem out of is useful to learn more complex behaviours, using larger data
reach. For instance, techniques which implement full vehicle sets brings its own challenges, such as ensuring diversity of
12
TABLE III
A C OMPARISON OF F ULL V EHICLE C ONTROL T ECHNIQUES .
the data. If the amount of data used for training the model are necessary for a deployable vehicle control system to have
is increased, without ensuring variety in the data set, the risk adequate performance. However, as the number of dimensions
of overfitting to the data set increases. For instance, Codevilla grows, the computational complexity grows exponentially
et al. [140] compared 4 driving models trained with 2, 10, [152], this is known as the Curse of Dimensionality [153].
50, and 100 hours of data, and it was shown that the model In the high-dimensional problems of vehicle control, this has
trained with 10 hours of driving data performed best in most a significant effect on the computational complexity of any
scenarios. This is due to many of the instances in the training solution. Although discretisation of the system can reduce the
set being very similar, captured in typical driving conditions. complexity, as seen in previous examples, this can lead to
As the data set size increases, rare driving scenarios (where degradation in system performance. Other solutions include
the model is more likely to fail) are encountered increasingly using multiple learners to reduce learning time [154], [155],
rarely during training. Therefore, when generating large data evolution strategies which are highly parallelisable [156], or
sets, diversity in the data set must be ensured. removing unnecessary data from the training and system input
data [157].
Further computational complexity is caused by the contin-
uous states and actions in which the agent has to operate. Overall, the high computational burden of DNNs is a chal-
As stated in the previous section, continuous action values lenge to not only the development and training of the networks
13
but also the deployment of such systems in vehicles. The high values [168], [169]. Recent research has also explored neural
computational overhead of the deep learning algorithms will architecture search methods which take hardware efficiency
require high computing capabilities on-board, driving up the into account by incorporating the hardware feedback into
system cost and power requirements, which must be kept in the learning signal [170]–[173]. This has resulted in neural
mind during the system design. network architectures which are specialised for specific hard-
ware platforms, and demonstrate a hardware efficiency benefit
over non-specialised architectures. Such methods could also be
B. Architectures extended to find efficient network architectures for vehicle on-
Another challenge with deep learning is selecting the archi- board hardware platforms. It should be noted that automated
tecture of the neural networks. There are no clear guidelines neural architecture search is an active area of research, for
for ’good’ neural network architecture for a given task. For further discussion on this topic we refer the reader to the
instance, in terms of size and number of layers, it has been survey by Elsken et al. [174].
shown that too few neurons will lead to a system with poor While architecture selection is a general problem for many
performance. However, too many neurons may overfit to the deep learning applications, a complex task such as autonomous
training data and therefore not generalise well. Also, given driving also brings its own challenges. Currently, most end-
that additional neurons will lead to increased computational to-end driving systems have been limited to smaller networks.
complexity, finding an optimal number of neurons would This is due to the relatively small datasets used, which would
be of great benefit to deep learning methods [158], [159]. cause deeper networks to overfit to the training data. However,
Other parameters can also have an effect on the performance, as noted in [140], when large amounts of data are available,
training, and convergence of the system. The fundamental deeper architectures can reduce both bias and variance in
architecture, training method, learning rate, loss function, training, resulting in more robust control policies. Further
batch size etc., all need to be decided upon and defined, thought should be given to architectures specifically designed
which affect the performance of the agent. However, there for autonomous driving, such as the conditional imitation
are few methods for choosing these parameters, and often learning model [138], where the network included a different
trial-and-error and heuristics are the only viable options for final network layer for each high-level command used for driv-
optimising each parameter due to the complexity of DNNs ing. These challenges translate to mid-to-mid approaches as
[49]. This is generally achieved by choosing a range of values well, as the selection of high-level features represented in the
for the hyperparameters in the neural network, and finding the input to the network must be chosen carefully. Future works
best performing values. However, using such trial-and-error investigating specialised network architectures for autonomous
methods for exploring the hyperparameter space can be slow, driving can therefore be expected.
given the amount of computation required for each training
run.
Solutions to this challenge currently being researched in- C. Goal Specification
clude computerised ways of finding optimal values for these Adequate goal specification is a challenge specific to re-
parameters, either by trialling across a range or using model- inforcement learning methods. One of the advantages of
based methods to converge on the best values. There are reinforcement learning is that the behaviour of the agent does
several methods for changing the parameters over the chosen not need to be specified implicitly as it would be in rule-based
range, such as Coordinate Descent [160], Grid Search [160], systems. Only the reward function, which can often be easier
[161], and Random Search [162]. Coordinate Descent keeps to define than the value function, and the control action (e.g.
all hyperparameters except one fixed, and finds the best value steering, acceleration, braking) need to be defined. However,
for one parameter at a time. Grid Search optimises every the goal of reinforcement learning is to maximise the long
parameter simultaneously, including the cross-product of all term accumulated reward as defined by the reward function.
intervals. However, this vastly increases the computational Therefore, the desired behaviour of the agent must be accu-
expense by requiring a large number of neural network models rately captured by the reward function, otherwise unexpected
to be trained and therefore is only suitable when the models and undesired behaviour might occur. For instance, instead of
can be trained quickly. Random Search often finds a good using binary rewards for successful or unsuccessful completion
set of parameters faster than a Grid Search by sampling the of tasks, intermediate rewards can be used to guide the agent
chosen interval randomly [162], [163]. However, this has the towards desired behaviour, this process in known as reward
disadvantage that the parameter space is often not covered shaping [175], [176]. For example, Desjardins & Chaib-Draa
completely, and some sample points can be very close to [23] used the time headway derivative to reward the agent for
each other. These disadvantages can be solved by using quasi- actions that helped it move towards the ideal time headway
random sequences [164]. Alternatively, one can use model- state. Furthermore, for a complex task such as driving, a multi-
based hyperparameter optimisation methods, such as Bayesian objective reward function needs to consider different objectives
optimisation or tree-structured Parzen estimators, which tend which may conflict with each other. For example, for driving,
to yield better results but are more time intensive [164]– these objectives may include maintaining a safe distance from
[167]. Other proposed approaches focus on automated hyper- other vehicles, staying close to the centre of the lane, avoiding
parameter tuning by eliminating undesirable regions of the pedestrians, not changing lanes too often, maintaining desired
hyperparameter search space in order to converge to optimal velocity, and avoiding harsh accelerations/braking. Hence, the
14
reward function should not only consider all factors that affect D. Adaptability & Generalisation
the agent’s behaviour, but also the weight of these factors.
Another challenge for learned control systems is dealing
A further challenge for agents which control both lateral with different environments with a scalable approach. For
& longitudinal actions is the difficulty of defining a reward example, a driving strategy that is successful in an urban
function when the agent must be able to perform multiple environment may not be optimal on a highway, since they
actions (steering, braking, and acceleration). In reinforcement are very different environments with different traffic flow
learning, the agent uses the feedback from the reward function patterns and safety issues. Similar issues arise with changing
to improve its own performance. However, when the agent is weather conditions, seasons, climates etc. A neural network’s
carrying out multiple actions, it may not be clear which of ability to use what it has learned from previous experiences
the actions resulted in the given reward. For example, if the to operate in a completely new environment is referred to as
vehicle steers away from the road, the acceleration may not be generalisation. However, the problem with generalisation is
at fault but a negative reward signal is sent to the agent. One that even if the system demonstrates good generalisation in
solution to this is a Hybrid Reward Architecture [177], where one new environment, there is no guarantee it will generalise
the system uses a decomposed reward function and learns a to other possible environments. Moreover, considering the
separate value function for each component reward function. complex operating environment of a vehicle, it is not possible
Alternatively, Shalev-Shwartz et al. [178] proposed a solution to test the system in all scenarios. Therefore, building a deep
in which the reward function is decomposed into a high level learning system capable of generalising to such a vast variety
decision making system, through which the agent learns to of situations, as well as validating its generalisation capability
drive safely and make strategic decisions (e.g. which cars to poses major challenges. This is a challenge that must be
overtake or give way to), and a low level reward function overcome for deep learning driven autonomous vehicles to be
which helps the agent learn an optimal policy for different deployable in the real world, as the vehicles must be able to
actions (e.g. overtaking, merging, decelerating etc.). cope with the various different environments it will be used
in.
The developer should also take care that the agent does Generally, to avoid poor generalisation in DNNs the training
not exploit the reward function in unexpected ways, resulting must be stopped before the DNN starts to overfit to the
in unintended behaviour. This effect is also known as reward training data. Overfitting refers to creating a model that fits
hacking. Reward hacking occurs when the agent finds an the training data too well, losing its ability to generalise to
unanticipated way of exploiting the reward function to gain new data. Overfitting occurs when the network is trained
large rewards in a way which goes against the developers’ with either insufficient amounts of training data or too many
defined objective(s) for the agent. For example, a robot used training episodes on the same training data. This results in the
in ball paddling with a reward function based on the distance neural network memorising the training data, thereby losing
between the ball and the desired highest point, may attempt generalisation. Unfortunately, there are no known methods
to move the racket up and keep the ball resting on it [152]. of choosing the optimal stopping point in order to avoid
Potential solutions to avoid reward hacking were proposed overfitting [184]. However, it is possible to get some indication
by Amodei, et al. [179] in the form of adversarial reward of the network’s generalisation capability by having three
functions, model look-ahead, reward capping, multiple reward different data sets: training, validation, and test sets. The
functions, and trip wires. Adversarial reward functions utilise training and validation sets are used during training, but
a reward function which is its own agent, similar to generative only the training set results are used to update the network
adversarial networks. The reward function agent can then weights [185]. The purpose of the validation set is to minimise
explore the environment, making it more robust to reward overfitting, by monitoring the error in the validation data set.
hacking. It could, for example, try to find instances where In this way, it will be ensured that changes which reduce
the system claims a high reward from its actions, while the error on the training set also reduce the error on the
a human would label it as a low reward. On the other validation set, thereby avoiding overfitting. If the accuracy of
hand, model look-ahead gives a reward based on anticipated the validation set starts to decrease over the training iterations,
future states, instead of the present one. Reward capping then the network is starting to overfit and training should be
is a simple solution to reward hacking, where a maximum stopped. In addition to stopping overfitting, a validation set
value is imposed on the reward function, thereby preventing can also be used to compare different network architectures
unexpected high reward scenarios. Multiple reward functions (e.g. comparing two different networks with different numbers
can also increase robustness to reward hacking, since multiple of hidden layers) to provide a measure of generalisation.
rewards can be more difficult to hack than a single one. Nevertheless, utilising the validation set simultaneously in the
Finally, trip wires are deliberately placed vulnerabilities in the selection of the network and to terminate training can result
system, where reward hacking is most likely to occur. These in overfitting to the validation set. Therefore an additional
vulnerabilities are then monitored to alert the system if the independent set, known as testing set, is required for the
agent is attempting to exploit its reward function. Another evaluation of the network performance [186]. The testing set is
approach to solving these challenges in goal specification only used to test the final network to confirm its performance
is using inverse reinforcement learning to extract a reward and generalisation capabilities. The testing set must provide
function from expert demonstrations of the task [180]–[183]. an unbiased evaluation of the network’s generalisation [185].
15
Therefore, it is crucial that the test set is not used to choose training, as well as testing, in both simulation and field tests
between different networks or network architectures. would be required [198]. The large number of trials required
There are also techniques available for DNNs which aim to for reinforcement learning algorithms to converge, makes them
reduce the test error, although often at the cost of increased susceptible to this issue where simulation is used for training.
training error, known as regularisation techniques [48]. The However, recent studies in robot manipulation have shown
basis of regularisation techniques is to introduce some con- effective transfer of learned policies from simulation to the
straints on the deep learning model, which either introduce real world [199]–[202].
prior knowledge into the model or promote simpler models in Validation of the model and simulation environment alone
order to achieve better generalisation capability. There are a is not enough for autonomous vehicles, as the influence of
variety of regularisation techniques available to choose from. the training data can be equal to that of the algorithm itself
For instance, L1 and L2 regularisation techniques introduce [203]. Therefore, there should be emphasis on validating the
a constraint on the model by including an additional term quality of the training set as well. Ensuring that the data set
in the cost function of the learning model, which makes represents the desired operational environment adequately, and
the network prefer smaller weights. The smaller weights in covers the potential states is important. For instance, data
the network reduce the effect of individual inputs on its sets that are biased towards a certain action (e.g. turn left)
behaviour, which means that the effect of local noise is reduced or scenario (e.g. driving in daytime) can introduce harmful
and the network is more likely to learn trends across the biases into the learning model. Therefore, data sets should
whole data set [49], [187]. Similarly, imposing constraints on be validated to understand if they contain potentially harmful
the network weights through weight clipping has also been biases or patterns that could lead to undesirable behaviour of
shown to improve robustness [188], [189]. Another popular the learned control policy [56].
regularisation technique is dropout, which drops out some
randomly selected neurons from training and only updates the
F. Safety
remaining weights for the given training example. At each
weight update, a different set of neurons is omitted, thereby In a safety-critical system, such as vehicle operation, a
preventing complex co-adaptions between neurons. This helps serious malfunction or failure could result in death or serious
each neuron learn features which are important for the given harm to people or property. Therefore, the safety of road users
task and therefore helps reduce overfitting [190], [191]. must be ensured before such systems are deployed commer-
cially. However, ensuring functional safety in deep learning
systems can be challenging. As the neural networks become
E. Verification & Validation more complex, the solutions they provide and how they come
The testing of the system needs to be rigorous to validate to those solutions becomes increasingly difficult to interpret
the performance and safety of the system. However, the [204]. This is known as the black box problem. The opacity
problem is that real-world testing can be expensive in terms of these solutions is an obstacle to their implementation in
of time, labour, and finances. Indeed, full-scale vehicle studies safety-critical applications; while it is possible to show that
with multiple vehicles have typically been achieved through these systems provide good performance in our validation
collaboration of government research projects with automotive environment, it is impossible to test these systems in all the
manufacturers, such as Demo ’97 [192]–[194] or Demo 2000 possible environments they would encounter in the real world.
[195]. Alternatively, simulation studies can reduce the amount Therefore, if we do not understand the way in which the
field testing required, and can be used as a first step for system makes its decisions, ensuring it does not make unsafe
performance and safety evaluation. Simulation studies are decisions in new environments becomes increasingly difficult.
significantly cheaper, faster, more flexible, and can be used to It becomes even more challenging in online learning methods,
set up situations not easily achieved in real life (e.g. crashes). since they change their policies during operation and therefore
Indeed, with the increasing accuracy and speed of simulation could potentially shift from safe policies to unsafe policies
tools, simulation has become an increasingly dominant method over time [205]–[209].
of study in this field [196]. Any autonomous vehicle system not only needs to drive
While simulation has multiple advantages, the model errors safely, but it also needs to be capable of reacting in a safe
must be kept in consideration throughout the verification and manner to other vehicles or pedestrians acting unpredictably.
validation process. This is especially critical for training, as It can be difficult to guarantee the safety for any vehicle
training an agent in an imprecise model will result in a system controller if, for example, another driver is acting recklessly or
that will not transfer to the real-world without significant mod- a previously unseen pedestrian runs onto the road. Therefore,
ifications [152], [197]. Complex mechanical interactions, such it would be useful to include unsafe and aggressive driving
as contacts and friction, are often difficult to model accurately. behaviours of other vehicles into the training data of the
These small variations between the simulation model and vehicle controller to enable it to learn how to deal with such
the real-world can have drastic consequences on the system situations. One option to improve reliability and safety in
behaviour in the real world. In other words, the problem is such situations is utilising a trauma memory [111] where rare
the agent overfitting its policies to the simulation environment, negative events (e.g. collisions) are stored. These are then used
and not transferring well to a real-world environment. For in training to persistently remind the agent of these events and
a system that can be evaluated and used in the real-world, ensure it maintains safe behaviour.
16
TABLE IV
A S UMMARY OF R ESEARCH C HALLENGES .
• Lack of clear rules for network architectures • Automated neural architecture search methods
Architectures • Reliance on heuristics and trial-and-error • Specialised architectures for autonomous driving
• Wide variety of the operational environment • Representative data sets and/or training environments
Adaptability &
• Overfitting to training data/environment • Effective use of regularisation techniques
Generalisation
Also, safety must be maintained during any training or functional safety. Alternatively, Xiong et al. [214] suggested a
testing in the real world. For instance, during early training control structure which combines reinforcement learning based
of a reinforcement learning agent, the agent is more likely to control with safety based control and path tracking. The aim is
use exploration than exploitation of past experiences, which to combine a traditional control method with a reinforcement
means the agent will effectively be learning through trial and method to take advantage of the superior performance of deep
error. Therefore, care must be taken to ensure the exploration learning systems whilst ensuring safety through traditional
happens in a safe manner. This is especially true in any control theory. The path tracking element is included to ensure
environment including other road users or pedestrians, since the vehicle stays on (or as close as is safe to) the centre of
inappropriate actions chosen due to exploration could have the lane. The reinforcement learning approach is based on
disastrous results. Exploration poses safety challenges as the the DDPG algorithm. Also, the safety based controller uses
agent is encouraged to take random actions, which can lead to an Artificial Potential Field method [215] which models any
catastrophic events if not considered beforehand [210]–[213]. obstacles with a repulsive force to steer the vehicle away from
Potential solutions include the use of demonstrations such as them. The final steering policy is then found by the weighted
in IRL to provide examples of safe behaviour which could summation of the three models. The system was shown to keep
be used as a baseline policy, simulated exploration where a safe distance in a simulated environment where the vehicle
exploration happens in a simulated environment, bounded had to drive along a curve with other vehicles nearby.
exploration which limits exploration in state spaces which
are considered unsafe, and human oversight although this
is limited in scalability and not feasible in some real-time Furthermore, malicious inputs to deep learning systems have
systems. The same holds true for any testing and evaluation to be considered. It has been shown that visual classification
of the system; until the system has been deemed to perform DNN systems are vulnerable to adversarial examples, which
adequately and in a safe manner, all necessary precautions are perturbed images that cause the DNNs to misclassify them
must be taken to ensure safety [179]. with high confidence [216]–[219], including misclassification
An approach for ensuring functional safety for deep learning of traffic signs [220]. DNNs have been shown to be vulnerable
based autonomous vehicles is suggested by Shalev-Shwartz to printed adversarial examples in the real world [221] and
et al. [178]. In the proposed system architecture, the policy even to 3D-printed physical adversarial examples [222], which
function is decomposed into a learnable part and a non- suggests they are a threat to DNN applications in the real
learnable part. The learnable part is responsible for the comfort world. Moreover, the image modification of the adversarial ex-
of driving and for making strategic decisions (e.g. which amples have been shown to be subtle enough that a human eye
cars to overtake or give way to). This policy is learned does not notice the modification, making prevention of such
from experience by maximising an expected reward from the malicious attacks difficult [221]. These types of weaknesses
reward function. On the other hand, the non-learnable policy in DNNs could be exploited and pose a security concern for
is responsible for the safety by minimising a cost function any technology using DNNs. Although defences against these
with hard constraints (e.g. the vehicle is not allowed within attacks have been proposed [223], state-of-the-art attacks can
a specified distance of other vehicles’ trajectories) to ensure by-pass defences and detection mechanisms.
17
[12] Department for Transport, “Research on the Impacts of Connected [34] J. Janai, F. Güney, A. Behl, and A. Geiger, “Computer vision for
and Autonomous Vehicles (CAVs) on Traffic Flow: Summary Report,” autonomous vehicles: Problems, datasets and state-of-the-art,” arXiv
2017. [Online]. Available: https://www.gov.uk/government/uploads/ preprint arXiv:1704.05519, 2017.
system/uploads/attachment data/file/530091/impacts-of-connected- [35] R. Benenson, M. Omran, J. Hosang, and B. Schiele, “Ten years of
and-autonomous-vehicles-on-traffic-flow-summary-report.pdf pedestrian detection, what have we learned?” in European Conference
[13] C. Thorpe, M. Herbert, T. Kanade, and S. Shafter, “Toward autonomous on Computer Vision. Springer, 2014, pp. 613–627.
driving: the cmu navlab. ii. architecture and systems,” IEEE expert, [36] S. Zhang, R. Benenson, M. Omran, J. Hosang, and B. Schiele, “How
vol. 6, no. 4, pp. 44–52, 1991. far are we from solving pedestrian detection?” in Proceedings of the
[14] E. D. Dickmanns and A. Zapp, “Autonomous high speed road vehicle IEEE Conference on Computer Vision and Pattern Recognition, 2016,
guidance by computer vision1,” IFAC Proceedings Volumes, vol. 20, pp. 1259–1267.
no. 5, pp. 221–226, 1987. [37] S. Lowry, N. Sünderhauf, P. Newman, J. J. Leonard, D. Cox, P. Corke,
[15] S. Thrun, M. Montemerlo, H. Dahlkamp, D. Stavens, A. Aron, and M. J. Milford, “Visual place recognition: A survey,” IEEE Trans-
J. Diebel, P. Fong, J. Gale, M. Halpenny, G. Hoffmann et al., “Stanley: actions on Robotics, vol. 32, no. 1, pp. 1–19, 2016.
The robot that won the darpa grand challenge,” Journal of field [38] K. R. Konda and R. Memisevic, “Learning visual odometry with a
Robotics, vol. 23, no. 9, pp. 661–692, 2006. convolutional network.” in VISAPP (1), 2015, pp. 486–490.
[16] M. Buehler, K. Iagnemma, and S. Singh, The DARPA urban challenge: [39] S. Kuutti, S. Fallah, K. Katsaros, M. Dianati, F. Mccullough, and
autonomous vehicles in city traffic. springer, 2009, vol. 56. A. Mouzakitis, “A survey of the state-of-the-art localization techniques
[17] T. Le-Anh and M. De Koster, “A review of design and control of and their potentials for autonomous vehicle applications,” IEEE Inter-
automated guided vehicle systems,” European Journal of Operational net of Things Journal, vol. 5, no. 2, pp. 829–846, 2018.
Research, vol. 171, no. 1, pp. 1–23, 2006. [40] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training of
[18] B. Paden, M. Čáp, S. Z. Yong, D. Yershov, and E. Frazzoli, “A survey of deep visuomotor policies,” The Journal of Machine Learning Research,
motion planning and control techniques for self-driving urban vehicles,” vol. 17, no. 1, pp. 1334–1373, 2016.
IEEE Transactions on intelligent vehicles, vol. 1, no. 1, pp. 33–55, [41] S. Levine, P. Pastor, A. Krizhevsky, and D. Quillen, “Learning hand-eye
2016. coordination for robotic grasping with large-scale data collection,” in
[19] M. Pasquier, C. Quek, and M. Toh, “Fuzzylot: a novel self-organising International Symposium on Experimental Robotics. Springer, 2016,
fuzzy-neural rule-based pilot system for automated vehicles,” Neural pp. 173–184.
networks, vol. 14, no. 8, pp. 1099–1112, 2001. [42] V. Rausch, A. Hansen, E. Solowjow, C. Liu, E. Kreuzer, and J. K.
[20] M. Kuderer, S. Gulati, and W. Burgard, “Learning driving styles Hedrick, “Learning a deep neural net policy for end-to-end control of
for autonomous vehicles from demonstration,” Proceedings - IEEE autonomous vehicles,” in 2017 American Control Conference (ACC).
International Conference on Robotics and Automation, vol. 2015-June, IEEE, 2017, pp. 4914–4919.
no. June, pp. 2641–2646, 2015. [43] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.
[21] D. Silver, J. A. Bagnell, and A. Stentz, “Learning Autonomous Driving Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski
Styles and Maneuvers from Expert Demonstration,” in Experimental et al., “Human-level control through deep reinforcement learning,”
Robotics. Springer, Heidelberg, 2013, pp. 371–386. Nature, vol. 518, no. 7540, p. 529, 2015.
[44] I. Arel, D. C. Rose, and T. P. Karnowski, “Deep machine learning-a
[22] D. Zhao, B. Wang, and D. Liu, “A supervised Actor-Critic approach
new frontier in artificial intelligence research [research frontier],” IEEE
for adaptive cruise control,” Soft Computing, vol. 17, no. 11, pp. 2089–
computational intelligence magazine, vol. 5, no. 4, pp. 13–18, 2010.
2099, 2013.
[45] J. Tani, M. Ito, and Y. Sugita, “Self-organization of distributedly
[23] C. Desjardins and B. Chaib-draa, “Cooperative Adaptive Cruise Con-
represented multiple behavior schemata in a mirror system: reviews
trol: A Reinforcement Learning Approach,” IEEE Transactions on
of robot experiments using rnnpb,” Neural Networks, vol. 17, no. 8-9,
Intelligent Transportation Systems, vol. 12, no. 4, pp. 1248–1260, 2011.
pp. 1273–1289, 2004.
[24] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification [46] Y. Lecun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521,
with deep convolutional neural networks,” in Advances in neural no. 7553, pp. 436–444, 2015.
information processing systems, 2012, pp. 1097–1105. [47] J. Schmidhuber, “Deep learning in neural networks: An overview,”
[25] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, Neural networks, vol. 61, pp. 85–117, 2015.
A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath et al., “Deep neural [48] I. Goodfellow, Y. Bengio, and A. Courville, “Deep Learning,” MIT
networks for acoustic modeling in speech recognition: The shared Press, 2016. [Online]. Available: http://www.deeplearningbook.org/
views of four research groups,” IEEE Signal Processing Magazine, [49] M. Nielsen, “Neural Networks and Deep Learning,”
vol. 29, no. 6, pp. 82–97, 2012. Determination Press, 2015. [Online]. Available: http://
[26] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning neuralnetworksanddeeplearning.com/index.html
with neural networks,” in Advances in neural information processing [50] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction.
systems, 2014, pp. 3104–3112. Cambridge, MA: MIT Press, 1998, vol. 9.
[27] W. Schwarting, J. Alonso-Mora, and D. Rus, “Planning and decision- [51] K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath,
making for autonomous vehicles,” Annual Review of Control, Robotics, “Deep reinforcement learning: A brief survey,” IEEE Signal Processing
and Autonomous Systems, no. 0, 2018. Magazine, vol. 34, no. 6, pp. 26–38, 2017.
[28] T. T. Mac, C. Copot, D. T. Tran, and R. De Keyser, “Heuristic ap- [52] Y. Li, “Deep reinforcement learning: An overview,” arXiv preprint
proaches in robot path planning: A survey,” Robotics and Autonomous arXiv:1701.07274, 2017.
Systems, vol. 86, pp. 13–28, 2016. [53] L. Bottou and O. Bousquet, “The tradeoffs of large scale learning,” in
[29] S. M. Veres, L. Molnar, N. K. Lincoln, and C. P. Morice, “Autonomous Advances in neural information processing systems, 2008, pp. 161–168.
vehicle control systemsa review of decision making,” Proceedings of [54] S. Ross, G. Gordon, and D. Bagnell, “A reduction of imitation learning
the Institution of Mechanical Engineers, Part I: Journal of Systems and and structured prediction to no-regret online learning,” in Proceedings
Control Engineering, vol. 225, no. 2, pp. 155–195, 2011. of the fourteenth international conference on artificial intelligence and
[30] L. Caltagirone, M. Bellone, L. Svensson, and M. Wahde, “Lidar-based statistics, 2011, pp. 627–635.
driving path generation using fully convolutional neural networks,” in [55] P. de Haan, D. Jayaraman, and S. Levine, “Causal confusion in
Intelligent Transportation Systems (ITSC), 2017 IEEE 20th Interna- imitation learning,” arXiv preprint arXiv:1905.11979, 2019.
tional Conference on. IEEE, 2017, pp. 1–6. [56] A. Torralba, A. A. Efros et al., “Unbiased look at dataset bias.” in
[31] S. Dixit, S. Fallah, U. Montanaro, M. Dianati, A. Stevens, F. Mc- CVPR, vol. 1, no. 2. Citeseer, 2011, p. 7.
cullough, and A. Mouzakitis, “Trajectory planning and tracking for [57] A. Gupta, A. Murali, D. P. Gandhi, and L. Pinto, “Robot learning
autonomous overtaking: State-of-the-art and future prospects,” Annual in homes: Improving generalization and reducing dataset bias,” in
Reviews in Control, 2018. Advances in Neural Information Processing Systems, 2018, pp. 9094–
[32] H. Zhu, K.-V. Yuen, L. Mihaylova, and H. Leung, “Overview of 9104.
environment perception for intelligent vehicles,” IEEE Transactions on [58] Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, K. Kavukcuoglu,
Intelligent Transportation Systems, vol. 18, no. 10, pp. 2584–2601, and N. de Freitas, “Sample efficient actor-critic with experience replay,”
2017. arXiv preprint arXiv:1611.01224, 2016.
[33] J. Van Brummelen, M. OBrien, D. Gruyer, and H. Najjaran, “Au- [59] V. R. Konda and J. N. Tsitsiklis, “On actor-critic algorithms,” SIAM
tonomous vehicle perception: The technology of today and tomorrow,” journal on Control and Optimization, vol. 42, no. 4, pp. 1143–1166,
Transportation research part C: emerging technologies, 2018. 2003.
19
[60] C. J. Watkins and P. Dayan, “Q-learning,” Machine learning, vol. 8, [85] G. Yu and I. K. Sethi, “Road-following with continuous learning,” in
no. 3-4, pp. 279–292, 1992. Intelligent Vehicles ’95 Symposium., Proceedings of the, Detroit, MI,
[61] G. J. Gordon, “Stable function approximation in dynamic program- 1995.
ming,” in Machine Learning Proceedings 1995. Elsevier, 1995, pp. [86] D. E. Moriarty, S. Handley, and P. Langley, “Learning distributed
261–268. strategies for traffic control,” Proc. of the fifth International Conference
[62] J. N. Tsitsiklis and B. Van Roy, “Feature-based methods for large scale of the Society for Adaptive Behavior, no. May 1998, pp. 437–446, 1998.
dynamic programming,” Machine Learning, vol. 22, no. 1-3, pp. 59–94, [87] M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp,
1996. P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, X. Zhang,
[63] R. J. Williams, Reinforcement-learning connectionist systems. College J. Zhao, and K. Zieba, “End to End Learning for Self-Driving Cars,”
of Computer Science, Northeastern University, 1987. no. May, 2016. [Online]. Available: http://arxiv.org/abs/1604.07316
[64] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour, “Policy [88] U. Muller, J. Ben, E. Cosatto, B. Flepp, and Y. L. Cun, “Off-road
gradient methods for reinforcement learning with function approxima- obstacle avoidance through end-to-end learning,” in Advances in neural
tion,” in Advances in neural information processing systems, 2000, pp. information processing systems, 2006, pp. 739–746.
1057–1063. [89] Mechanical Simulation Corporation, “CarSim.” [Online]. Available:
[65] M. Riedmiller, J. Peters, and S. Schaal, “Evaluation of policy gradient https://www.carsim.com
methods and variants on the cart-pole benchmark,” in Approximate [90] L. Bottou, “Large-scale machine learning with stochastic gradient
Dynamic Programming and Reinforcement Learning, 2007. ADPRL descent,” in Proceedings of COMPSTAT’2010. Springer, 2010, pp.
2007. IEEE International Symposium on. IEEE, 2007, pp. 254–261. 177–186.
[91] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
[66] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Ried-
arXiv preprint arXiv:1412.6980, 2014.
miller, “Deterministic policy gradient algorithms,” 2014.
[92] W. Su, S. Boyd, and E. Candes, “A differential equation for model-
[67] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, ing nesterovs accelerated gradient method: Theory and insights,” in
D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep rein- Advances in Neural Information Processing Systems, 2014, pp. 2510–
forcement learning,” in International conference on machine learning, 2518.
2016, pp. 1928–1937. [93] H. M. Eraqi, M. N. Moustafa, and J. Honer, “End-to-end deep learning
[68] I. Grondman, L. Busoniu, G. A. Lopes, and R. Babuska, “A survey for steering autonomous vehicles considering temporal dependencies,”
of actor-critic reinforcement learning: Standard and natural policy arXiv preprint arXiv:1710.03804, 2017.
gradients,” IEEE Transactions on Systems, Man, and Cybernetics, Part [94] R. Rothe, R. Timofte, and L. Van Gool, “Dex: Deep expectation
C (Applications and Reviews), vol. 42, no. 6, pp. 1291–1307, 2012. of apparent age from a single image,” in Proceedings of the IEEE
[69] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “High- International Conference on Computer Vision Workshops, 2015, pp.
dimensional continuous control using generalized advantage estima- 10–15.
tion,” arXiv preprint arXiv:1506.02438, 2015. [95] F. Codevilla, A. M. López, V. Koltun, and A. Dosovitskiy, “On offline
[70] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: evaluation of vision-based driving models,” in Proceedings of the
The kitti dataset,” International Journal of Robotics Research (IJRR), European Conference on Computer Vision (ECCV), 2018, pp. 236–
2013. 251.
[71] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous [96] P. Wang, C.-Y. Chan, and A. de La Fortelle, “A reinforcement learning
driving? the kitti vision benchmark suite,” in Computer Vision and based approach for automated lane change maneuvers,” in 2018 IEEE
Pattern Recognition (CVPR), 2012 IEEE Conference on, 2012. Intelligent Vehicles Symposium (IV). IEEE, 2018, pp. 1379–1384.
[72] “Waymo open dataset: An autonomous driving dataset,” 2019. [97] A. Vahidi and A. Eskandarian, “Research advances in intelligent
[Online]. Available: https://www.waymo.com/open collision avoidance and adaptive cruise control,” IEEE Transactions on
[73] W. Maddern, G. Pascoe, C. Linegar, and P. Newman, “1 Year, Intelligent Transportation Systems, vol. 4, no. 3, pp. 143–153, 2003.
1000km: The Oxford RobotCar Dataset,” The International Journal [98] S. Moon, I. Moon, and K. Yi, “Design, tuning, and evaluation of a full-
of Robotics Research (IJRR), vol. 36, no. 1, pp. 3–15, 2017. [Online]. range adaptive cruise control system with collision avoidance,” Control
Available: http://dx.doi.org/10.1177/0278364916679498 Engineering Practice, vol. 17, no. 4, pp. 442–455, 2009.
[74] X. Huang, X. Cheng, Q. Geng, B. Cao, D. Zhou, P. Wang, Y. Lin, [99] Q. Sun, “Cooperative Adaptive Cruise Control Performance Analysis,”
and R. Yang, “The apolloscape dataset for autonomous driving,” arXiv Ph.D. dissertation, Ecole Centrale de Lille, 2016.
preprint arXiv:1803.06184, 2018. [100] X. Chen, Y. Zhai, C. Lu, J. Gong, and G. Wang, “A Learning Model
[75] Udacity Inc., “Udacity Self-driving Car Dataset,” 2018. [Online]. for Personalized Adaptive Cruise Control,” in Intelligent Vehicles
Available: https://github.com/udacity/self-driving-car Symposium (IV), 2017 IEEE, 2017, pp. 379–384.
[76] A. Ess, B. Leibe, K. Schindler, , and L. van Gool, “A mobile vision [101] D. Wang and J. Huang, “Neural network-based adaptive dynamic
system for robust multi-person tracking,” in Computer Vision and surface control for a class of uncertain nonlinear systems in strict-
Pattern Recognition (CVPR), 2008 IEEE Conference on. IEEE Press, feedback form,” IEEE Transactions on Neural Networks, vol. 16, no. 1,
June 2008. pp. 195–202, 2005.
[77] P. Dollár, C. Wojek, B. Schiele, and P. Perona, “Pedestrian detection: [102] M. M. Polycarpou, “Stable adaptive neural control scheme for nonlin-
A benchmark,” in Computer Vision and Pattern Recognition (CVPR), ear systems,” IEEE Transactions on Automatic Control, vol. 41, no. 3,
2009 IEEE Conference on. IEEE, 2009, pp. 304–311. pp. 447–451, 1996.
[103] R. Sanner and M. Mears, “Stable adaptive tracking of uncertainty
[78] H. Yin and C. Berger, “When to use what data set for your self-
systems using nonlinearly parameterized on-line approximators,” IEEE
driving car algorithm: An overview of publicly available driving
Transactions on Neural Networks, vol. 3, no. 6, pp. 837–863, 1992.
datasets,” in Intelligent Transportation Systems (ITSC), 2017 IEEE 20th
[104] D. Wang and J. Huang, “Adaptive neural network control for a class
International Conference on. IEEE, 2017, pp. 1–8.
of uncertain nonlinear systems in pure-feedback form,” Automatica,
[79] NVIDIA Corporation, “Autonomous car development platform vol. 38, no. 8, pp. 1365–1372, 2002.
from NVIDIA DRIVE PX2,” 2018. [Online]. Available: https: [105] B. Ren, S. S. Ge, C.-Y. Su, and T. H. Lee, “Adaptive neural control
//www.nvidia.com/en-us/self-driving-cars/drive-platform/ for a class of uncertain nonlinear systems in pure-feedback form
[80] MobilEye, “The Evolution of EyeQ,” 2018. [Online]. Available: with hysteresis input,” IEEE Transactions on Systems, Man, and
https://www.mobileye.com/our-technology/evolution-eyeq-chip/ Cybernetics, Part B (Cybernetics), vol. 39, no. 2, pp. 431–443, 2009.
[81] Intel Corporation, “Cyclone V - Overview,” 2018. [Online]. Avail- [106] T. Zhang, S. S. Ge, and C. C. Hang, “Adaptive neural network con-
able: https://www.altera.com/products/fpga/cyclone-series/cyclone-v/ trol for strict-feedback nonlinear systems using backstepping design,”
overview.html Automatica, vol. 36, no. 12, pp. 1835–1846, 2000.
[82] S. Liu, J. Tang, Z. Zhang, and J.-L. Gaudiot, “Caad: Computer [107] X. Dai, C.-K. Li, and A. B. Rad, “An approach to tune fuzzy controllers
architecture for autonomous driving,” arXiv preprint arXiv:1702.01894, based on reinforcement learning for autonomous vehicle control,” IEEE
2017. Transactions on Intelligent Transportation Systems, vol. 6, no. 3, pp.
[83] D. A. Pomerleau, “Alvinn: An autonomous land vehicle in a neural 285–293, 2005.
network,” Advances in Neural Information Processing Systems 1, pp. [108] Z. Huang, X. Xu, H. He, J. Tan, and Z. Sun, “Parameterized Batch
305–313, 1989. Reinforcement Learning for Longitudinal Control of Autonomous Land
[84] D. Pomerleau, “Neural network vision for robot driving,” Intelligent Vehicles,” IEEE Transactions on Systems, Man, and Cybernetics:
Unmanned Ground Vehicles, pp. 1–22, 1997. Systems, pp. 1–12, 2017.
20
[109] X. Xu, D. Hu, and X. Lu, “Kernel-based least squares policy iteration [132] N. D. Ratliff, J. A. Bagnell, and M. A. Zinkevich, “Maximum margin
for reinforcement learning,” IEEE Transactions on Neural Networks, planning,” in Proceedings of the 23rd international conference on
vol. 18, no. 4, pp. 973–992, 2007. Machine learning - ICML ’06, 2006, pp. 729–736.
[110] J. Wang, X. Xu, D. Liu, Z. Sun, and Q. Chen, “Self-learning cruise [133] M. Wulfmeier, D. Rao, D. Z. Wang, P. Ondruska, and I. Posner,
control using kernel-based least squares policy iteration,” IEEE Trans- “Large-scale cost function learning for path planning using deep inverse
actions on Control Systems Technology, vol. 22, no. 3, pp. 1078–1087, reinforcement learning,” International Journal of Robotics Research,
2014. vol. 36, no. 10, pp. 1073–1087, 2017.
[111] H. Chae, C. M. Kang, B. Kim, J. Kim, C. C. Chung, and J. W. Choi, [134] B. D. Ziebart, A. Maas, J. A. Bagnell, and A. K. Dey, “Maximum En-
“Autonomous braking system via deep reinforcement learning,” in tropy Inverse Reinforcement Learning.” AAAI Conference on Artificial
2017 IEEE 20th International Conference on Intelligent Transportation Intelligence, pp. 1433–1438, 2008.
Systems (ITSC). IEEE, 2017, pp. 1–6. [135] S. Levine and V. Koltun, “Continuous Inverse Optimal Control with
[112] Euro NCAP, “European New Car Assessment Programme: Test Pro- Locally Optimal Examples,” International Conference on Machine
tocol - AEB VRU systems,” 2015. Learning (ICML), pp. 41–48, 2012.
[113] D. Zhao, Z. Xia, and Q. Zhang, “Model-free optimal control based [136] P. Abbeel, A. Coates, M. Quigley, and A. Y. Ng, “An application
intelligent cruise control with hardware-in-the-loop demonstration [re- of reinforcement learning to aerobatic helicopter flight,” Education,
search frontier],” IEEE Computational Intelligence Magazine, vol. 12, vol. 19, p. 1, 2007.
no. 2, pp. 56–69, 2017. [137] S. Hecker, D. Dai, and L. Van Gool, “End-to-end learning of driving
[114] L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Reinforcement models with surround-view cameras and route planners,” in Proceed-
learning: A survey,” Journal of artificial intelligence research, vol. 4, ings of the European Conference on Computer Vision (ECCV), 2018,
pp. 237–285, 1996. pp. 435–453.
[115] D. Zhao, Z. Hu, Z. Xia, C. Alippi, Y. Zhu, and D. Wang, “Full- [138] F. Codevilla, M. Müller, A. López, V. Koltun, and A. Dosovitskiy,
range adaptive cruise control based on supervised adaptive dynamic “End-to-end driving via conditional imitation learning,” in 2018 IEEE
programming,” Neurocomputing, vol. 125, no. February, pp. 57–67, International Conference on Robotics and Automation (ICRA). IEEE,
2014. 2018, pp. 1–9.
[116] B. Wang, D. Zhao, C. Li, and Y. Dai, “Design and implementation [139] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun,
of an adaptive cruise control system based on supervised actor-critic “CARLA: An open urban driving simulator,” in Proceedings of the
learning,” 2015 5th International Conference on Information Science 1st Annual Conference on Robot Learning, 2017, pp. 1–16.
and Technology (ICIST), pp. 243–248, 2015. [140] F. Codevilla, E. Santana, A. M. López, and A. Gaidon, “Exploring the
[117] T. L. Lai and H. Robbins, “Asymptotically efficient adaptive allocation limitations of behavior cloning for autonomous driving,” arXiv preprint
rules,” Advances in applied mathematics, vol. 6, no. 1, pp. 4–22, 1985. arXiv:1904.08980, 2019.
[118] M. Bellemare, S. Srinivasan, G. Ostrovski, T. Schaul, D. Saxton, and [141] C. Paxton, V. Raman, G. D. Hager, and M. Kobilarov, “Combining
R. Munos, “Unifying count-based exploration and intrinsic motivation,” neural networks and tree search for task and motion planning in
in Advances in Neural Information Processing Systems, 2016, pp. challenging environments,” in 2017 IEEE/RSJ International Conference
1471–1479. on Intelligent Robots and Systems (IROS). IEEE, 2017, pp. 6059–6066.
[142] M. Bansal, A. Krizhevsky, and A. Ogale, “Chauffeurnet: Learning to
[119] J. Schmidhuber, “A possibility for implementing curiosity and boredom
drive by imitating the best and synthesizing the worst,” arXiv preprint
in model-building neural controllers,” in Proc. of the international
arXiv:1812.03079, 2018.
conference on simulation of adaptive behavior: From animals to
[143] X. Pan, Y. You, Z. Wang, and C. Lu, “Virtual to real reinforcement
animats, 1991, pp. 222–227.
learning for autonomous driving,” arXiv preprint arXiv:1704.03952,
[120] W. Xia, H. Li, and B. Li, “A control strategy of autonomous vehicles
2017.
based on deep reinforcement learning,” in Computational Intelligence
[144] M. Müller, A. Dosovitskiy, B. Ghanem, and V. Koltun, “Driv-
and Design (ISCID), 2016 9th International Symposium on, vol. 2.
ing policy transfer via modularity and abstraction,” arXiv preprint
IEEE, 2016, pp. 198–201.
arXiv:1804.09364, 2018.
[121] M. Riedmiller, “Neural fitted q iteration–first experiences with a data
[145] M. Maurer, J. C. Gerdes, B. Lenz, and H. Winner, Autonomous Driving.
efficient neural reinforcement learning method,” in European Confer-
Berlin: Springer, Heidelberg, 2016.
ence on Machine Learning. Springer, 2005, pp. 317–328.
[146] European Commission, “Cooperative Intelligent Transportation
[122] A. E. Sallab, M. Abdou, E. Perot, and S. Yogamani, “End-to-end Systems - Research Theme Analysis Report,” 2016. [Online].
deep reinforcement learning for lane keeping assist,” arXiv preprint Available: http://www.transport-research.info/sites/default/files/TRIP
arXiv:1612.04340, 2016. C-ITS Report.pdf
[123] “The open racing car simulator.” [Online]. Available: http:// [147] S. A. Bagloee, M. Tavana, M. Asadi, and T. Oliver, “Autonomous
torcs.sourceforge.net/ vehicles: challenges, opportunities, and future implications for trans-
[124] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, portation policies,” Journal of Modern Transportation, vol. 24, no. 4,
D. Silver, and D. Wierstra, “Continuous control with deep reinforce- pp. 284–303, 2016.
ment learning,” arXiv preprint arXiv:1509.02971, 2015. [148] HERE Technologies, “Consumer Acceptance of Autonomous
[125] J. Zhang and K. Cho, “Query-efficient imitation learning for end-to-end Vehicles,” 2017. [Online]. Available: https://here.com/file/13726/
autonomous driving,” arXiv preprint arXiv:1605.06450, 2016. download?token=njs4ZwfW
[126] Y. Pan, C.-A. Cheng, K. Saigol, K. Lee, X. Yan, E. Theodorou, and [149] L. Bosankic, “How consumers’ perception of autonomous
B. Boots, “Agile autonomous driving using end-to-end deep imitation cars will influence their adoption,” 2017. [Online]. Avail-
learning,” Proceedings of Robotics: Science and Systems. Pittsburgh, able: https://medium.com/@leo pold b/how-consumers-perception-
Pennsylvania, 2018. of-autonomous-cars-will-influence-their-adoption-ba99e3f64e9a
[127] N. Koenig and A. Howard, “Design and use paradigms for gazebo, an [150] H. Abraham, B. Reimer, B. Seppelt, C. Fitzgerald, B. Mehler, and
open-source multi-robot simulator,” in 2004 IEEE/RSJ International J. F. Coughlin, “Consumer Interest in Automation: Preliminary Obser-
Conference on Intelligent Robots and Systems (IROS)(IEEE Cat. No. vations Exploring a Year’s Change,” 2017. [Online]. Available: http:
04CH37566), vol. 3. IEEE, pp. 2149–2154. //agelab.mit.edu/sites/default/files/MIT-NEMPAWhitePaperFINAL.pdf
[128] D. Wang, C. Devin, Q.-Z. Cai, F. Yu, and T. Darrell, “Deep object cen- [151] W. Knight, “An Ambitious Plan to Build a Self-Driving Borg,” 2016.
tric policies for autonomous driving,” arXiv preprint arXiv:1811.05432, [Online]. Available: https://www.technologyreview.com/s/602531/an-
2018. ambitious-plan-to-build-a-self-driving-borg/
[129] H. Porav and P. Newman, “Imminent collision mitigation with rein- [152] Kober, Jens J., Bagnell, Andrew, Peters, Jan, “Reinforcement Learning
forcement learning and vision,” in 2018 21st International Conference in Robotics: A Survey,” International Journal of Robotics Research,
on Intelligent Transportation Systems (ITSC). IEEE, 2018, pp. 958– vol. 32, no. 11, pp. 1238–1274, 2013.
964. [153] R. Bellman, “Dynamic Programming,” Science, vol. 153, no. 3731, pp.
[130] S. Zhifei and E. M. Joo, “A review of inverse reinforcement learn- 34–37, 1966.
ing theory and recent advances,” World Congress on Computational [154] S. Gu, T. Lillicrap, I. Sutskever, and S. Levine, “Continuous deep q-
Intelligence, pp. 1–8, 2012. learning with model-based acceleration,” in International Conference
[131] P. Abbeel and A. Y. Ng, “Apprenticeship learning via inverse rein- on Machine Learning, 2016, pp. 2829–2838.
forcement learning,” Twenty-first international conference on Machine [155] J. T. Barron, D. S. Golland, and N. J. Hay, “Parallelizing reinforcement
learning - ICML ’04, p. 1, 2004. learning,” UC Berkeley, 2009.
21
[156] T. Salimans, J. Ho, X. Chen, S. Sidor, and I. Sutskever, “Evolution [181] J. Z. Kolter, P. Abbeel, and A. Y. Ng, “Hierarchical Apprenticeship
strategies as a scalable alternative to reinforcement learning,” arXiv Learning, with Application to Quadruped Locomotion,” Science, vol. 1,
preprint arXiv:1703.03864, 2017. pp. 1–8, 2008.
[157] S. Yang, W. Wang, C. Liu, W. Deng, and J. K. Hedrick, “Feature [182] D. Silver, J. A. Bagnell, and A. Stentz, “Learning from demonstra-
analysis and selection for training an end-to-end autonomous vehicle tion for autonomous navigation in complex unstructured terrain,” in
controller using deep learning approach,” in 2017 IEEE Intelligent International Journal of Robotics Research, vol. 29, no. 12, 2010, pp.
Vehicles Symposium (IV). IEEE, 2017, pp. 1033–1038. 1565–1592.
[158] Y. LeCun, “Generalization and network design strategies,” Connection- [183] N. Ratliff, J. A. Bagnell, and S. S. Srinivasa, “Imitation learning for
ism in perspective, pp. 143–155, 1989. locomotion and manipulation,” in Proceedings of the 2007 7th IEEE-
[159] N. Morgan and H. Bourlard, “Generalization and Parameter Estimation RAS International Conference on Humanoid Robots, HUMANOIDS
in Feedforward Nets: Some Experiments,” Advances in neural infor- 2007, 2008, pp. 392–397.
mation processing systems, pp. 630–637, 1989. [184] R. J. Schalkoff, Artificial Neural Networks. New York: McGraw-Hill,
[160] Y. Bengio, “Practical recommendations for gradient-based training of 1997.
deep architectures,” in Neural networks: Tricks of the trade. Springer, [185] B. D. Ripley, Pattern Recognition in Neural Networks. Cambridge:
2012, pp. 437–478. Cambridge University Press, 1996.
[161] Y. LeCun, L. Bottou, G. B. Orr, and K.-R. Müller, “Efficient backprop,” [186] C. M. Bishop, “Neural networks for pattern recognition,” Journal of
in Neural networks: Tricks of the trade. Springer, 1998, pp. 9–50. the American Statistical Association, vol. 92, p. 482, 1995.
[162] J. Bergstra and Y. Bengio, “Random Search for Hyper-Parameter [187] A. Y. Ng, “Feature selection, l 1 vs. l 2 regularization, and rotational
Optimization,” Journal of Machine Learning Research, vol. 13, pp. invariance,” in Proceedings of the twenty-first international conference
281–305, 2012. on Machine learning. ACM, 2004, p. 78.
[163] C. Raffel, “Neural Network Hyperparameters,” 2015. [Online]. [188] P. Merolla, R. Appuswamy, J. Arthur, S. K. Esser, and D. Modha,
Available: http://colinraffel.com/wiki/neural network hyperparameters “Deep neural networks are robust to weight binarization and other non-
[164] Y. Sevchuk, “Hyperparameter optimization for Neural Net- linear distortions,” arXiv preprint arXiv:1606.01981, 2016.
works,” 2016. [Online]. Available: http://neupy.com/2016/12/17/ [189] M. Courbariaux, Y. Bengio, and J.-P. David, “Binaryconnect: Training
hyperparameter optimization for neural networks.html deep neural networks with binary weights during propagations,” in
[165] J. Snoek, H. Larochelle, and R. P. Adams, “Practical Bayesian Op- Advances in neural information processing systems, 2015, pp. 3123–
timization of Machine Learning Algorithms,” Advances in Neural 3131.
Information Processing Systems, vol. 25, pp. 2960–2968, 2012. [190] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and
[166] C. E. Rasmussen, “Gaussian processes in machine learning,” in Ad- R. R. Salakhutdinov, “Improving neural networks by preventing co-
vanced lectures on machine learning. Springer, 2004, pp. 63–71. adaptation of feature detectors,” arXiv preprint arXiv:1207.0580, 2012.
[191] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhut-
[167] J. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl, “Algorithms for
dinov, “Dropout: A Simple Way to Prevent Neural Networks from
Hyper-Parameter Optimization,” in Advances in Neural Information
Overfitting,” Journal of Machine Learning Research, vol. 15, pp. 1929–
Processing Systems (NIPS), 2011, pp. 2546–2554.
1958, 2014.
[168] M. Kumar, G. E. Dahl, V. Vasudevan, and M. Norouzi, “Parallel
[192] H. Raza and P. Ioannou, “Vehicle following control design for auto-
architecture and hyperparameter search via successive halving and
mated highway systems,” IEEE Control Systems Magazine, vol. 16,
classification,” arXiv preprint arXiv:1805.10255, 2018.
no. 6, pp. 43–60, 1996.
[169] T. B. Hashimoto, S. Yadlowsky, and J. C. Duchi, “Derivative free opti-
[193] R. Rajamani, H. S. Tan, B. K. Law, and W. B. Zhang, “Demonstration
mization via repeated classification,” arXiv preprint arXiv:1804.03761,
of integrated longitudinal and lateral control for the operation of
2018.
automated vehicles in platoons,” IEEE Transactions on Control Systems
[170] H. Cai, C. Gan, and S. Han, “Once for all: Train one network and spe- Technology, vol. 8, no. 4, pp. 695–708, 2000.
cialize it for efficient deployment,” arXiv preprint arXiv:1908.09791, [194] C. Thorpe, T. Jochem, and D. Pomerleau, “The 1997 automated high-
2019. way free agent demonstration,” in Intelligent Transportation System,
[171] M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard, 1997. ITSC’97., IEEE Conference on. IEEE, 1997, pp. 496–501.
and Q. V. Le, “Mnasnet: Platform-aware neural architecture search for [195] S. Kato, S. Tsugawa, K. Tokuda, T. Matsui, and H. Fujii, “Vehicle
mobile,” in Proceedings of the IEEE Conference on Computer Vision Control Algorithms for Cooperative Driving with Automated Vehicles
and Pattern Recognition, 2019, pp. 2820–2828. and Intervehicle Communications,” IEEE Transactions on Intelligent
[172] B. Wu, X. Dai, P. Zhang, Y. Wang, F. Sun, Y. Wu, Y. Tian, P. Vajda, Transportation Systems, vol. 3, no. 3, pp. 155–160, 2002.
Y. Jia, and K. Keutzer, “Fbnet: Hardware-aware efficient convnet design [196] L. Ng, C. M. Clark, and J. P. Huissoon, “Reinforcement learning
via differentiable neural architecture search,” in Proceedings of the of adaptive longitudinal vehicle control for dynamic collaborative
IEEE Conference on Computer Vision and Pattern Recognition, 2019, driving,” in IEEE Intelligent Vehicles Symposium, Proceedings, 2008,
pp. 10 734–10 742. pp. 907–912.
[173] F. Scheidegger, L. Benini, C. Bekas, and C. Malossi, “Constrained deep [197] C. G. Atkeson, “Using Local Trajectory Optimizers To Speed Up
neural network architecture search for iot devices accounting hardware Global Optimization In Dynamic Programming,” Advances in Neural
calibration,” arXiv preprint arXiv:1909.10818, 2019. Information Processing Systems (NIPS),, pp. 663–670, 1994.
[174] T. Elsken, J. H. Metzen, and F. Hutter, “Neural architecture search: A [198] T. Hester, M. Vecerik, O. Pietquin, M. Lanctot, T. Schaul, B. Piot,
survey,” Journal of Machine Learning Research, vol. 20, no. 55, pp. A. Sendonaris, G. Dulac-Arnold, I. Osband, J. Agapiou, J. Z. Leibo,
1–21, 2019. and A. Gruslys, “Learning from demonstrations for real world rein-
[175] A. Y. Ng, D. Harada, and S. Russell, “Policy invariance under reward forcement learning,” arXiv preprint arXiv:1704.03732, 2017.
transformations : Theory and application to reward shaping,” Sixteenth [199] P. Christiano, Z. Shah, I. Mordatch, J. Schneider, T. Blackwell, J. To-
International Conference on Machine Learning, vol. 3, pp. 278–287, bin, P. Abbeel, and W. Zaremba, “Transfer from simulation to real
1999. world through learning deep inverse dynamics model,” arXiv preprint
[176] A. D. Laud, “Theory and Application of Reward Shaping in Reinforce- arXiv:1610.03518, 2016.
ment Learning,” Ph.D. dissertation, University of Illinois, 2004. [200] A. A. Rusu, M. Vecerik, T. Rothörl, N. Heess, R. Pascanu, and
[177] H. Van Seijen, M. Fatemi, J. Romoff, R. Laroche, T. Barnes, and R. Hadsell, “Sim-to-real robot learning from pixels with progressive
J. Tsang, “Hybrid reward architecture for reinforcement learning,” in nets,” arXiv preprint arXiv:1610.04286, 2016.
Advances in Neural Information Processing Systems, 2017, pp. 5392– [201] E. Tzeng, C. Devin, J. Hoffman, C. Finn, X. Peng, S. Levine,
5402. K. Saenko, and T. Darrell, “Towards adapting deep visuomo-
[178] S. Shalev-Shwartz, S. Shammah, and A. Shashua, “Safe, multi- tor representations from simulated to real environments,” CoRR,
agent, reinforcement learning for autonomous driving,” arXiv preprint abs/1511.07111, 2015.
arXiv:1610.03295, 2016. [202] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel,
[179] D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, “Domain randomization for transferring deep neural networks from
and D. Mané, “Concrete problems in ai safety,” arXiv preprint simulation to the real world,” in Intelligent Robots and Systems (IROS),
arXiv:1606.06565, 2016. 2017 IEEE/RSJ International Conference on. IEEE, 2017, pp. 23–30.
[180] S. Russell, “Learning agents for uncertain environments (extended ab- [203] K. R. Varshney and H. Alemzadeh, “On the safety of machine learning:
stract),” Proceedings of the 11th Annual Conference on Computational Cyber-physical systems, decision sciences, and data products,” Big
Learning Theory (COLT), pp. 101–103, 1998. data, vol. 5, no. 3, pp. 246–255, 2017.
22
[204] D. Castelvecchi, “Can we open the black box of ai?” Nature News, overtaking in structured environments using robust mpc,” IEEE Trans-
vol. 538, no. 7623, p. 20, 2016. actions on Intelligent Transportation Systems, 2019.
[205] X. Zhang, M. Clark, K. Rattan, and J. Muse, “Controller verification in [229] K. Amezquita-Semprun, Y. C. Pradeep, P. C. Chen, W. Chen, and
adaptive learning systems towards trusted autonomy,” in Proceedings Z. Zhao, “Experimental evaluation of the stimuli-induced equilibrium
of the ACM/IEEE Sixth International Conference on Cyber-Physical point concept for automatic ramp merging systems,” IEEE Transactions
Systems. ACM, 2015, pp. 31–40. on Intelligent Transportation Systems, 2019.
[206] M. Clark, X. Koutsoukos, J. Porter, R. Kumar, G. Pappas, O. Sokolsky, [230] V. Milanés, J. Godoy, J. Villagrá, and J. Pérez, “Automated on-ramp
I. Lee, and L. Pike, “A study on run time assurance for complex merging system for congested traffic situations,” IEEE Transactions on
cyber physical systems,” AIR FORCE RESEARCH LAB WRIGHT- Intelligent Transportation Systems, vol. 12, no. 2, pp. 500–508, 2010.
PATTERSON AFB OH AEROSPACE SYSTEMS DIR, Tech. Rep.,
2013.
[207] S. Jacklin, J. Schumann, P. Gupta, M. Richard, K. Guenther, and
F. Soares, “Development of advanced verification and validation pro-
cedures and tools for the certification of learning systems in aerospace
applications,” in Infotech@ Aerospace, 2005, p. 6912.
[208] C. Wilkinson, J. Lynch, and R. Bharadwaj, Final Report, Regulatory Sampo Kuutti received the MEng degree in me-
Considerations for Adaptive Systems. National Aeronautics and Space chanical engineering in 2017 from University of
Administration, Langley Research Center, 2013. Surrey, Guildford, U.K., where he is currently pursu-
ing the PhD degree in automotive engineering with
[209] P. Van Wesel and A. E. Goodloe, “Challenges in the verification of
the Connected Autonomous Vehicles Lab within the
reinforcement learning algorithms,” Technical report, NASA, Tech.
Centre for Automotive Engineering. His research in-
Rep., 2017.
terests include deep learning applied to autonomous
[210] J. G. Schneider, “Exploiting model uncertainty estimates for safe dy-
vehicles, functional safety validation, and safety and
namic control learning,” in Advances in neural information processing
interpretability in machine learning systems.
systems, 1997, pp. 1047–1053.
[211] J. A. Bagnell, “Learning decisions: Robustness, uncertainty, and ap-
poximation,” Robotics Institute, p. 78, 2004.
[212] M. Deisenroth and C. E. Rasmussen, “Pilco: A model-based and
data-efficient approach to policy search,” in Proceedings of the 28th
International Conference on machine learning (ICML-11), 2011, pp.
465–472.
[213] T. M. Moldovan and P. Abbeel, “Safe exploration in markov decision Richard Bowden is Professor of computer vision
processes,” arXiv preprint arXiv:1205.4810, 2012. and machine learning at the University of Surrey
[214] X. Xiong, J. Wang, F. Zhang, and K. Li, “Combining deep reinforce- where he leads the Cognitive Vision Group within
ment learning and safety based control for autonomous driving,” arXiv the Centre for Vision, Speech and Signal Process-
preprint arXiv:1612.00147, 2016. ing. His research centres on the use of computer
[215] S. Glaser, B. Vanholme, S. Mammar, D. Gruyer, and L. Nouveliere, vision to locate, track, and understand humans. He
“Maneuver-based trajectory planning for highly autonomous vehicles is an associate editor for the journals Image and
on real road with traffic and driver interaction,” IEEE Transactions on Vision computing and IEEE TPAMI. In 2013 he was
Intelligent Transportation Systems, vol. 11, no. 3, pp. 589–606, 2010. awarded a Royal Society Leverhulme Trust Senior
[216] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfel- Research Fellowship and is a fellow of the Higher
low, and R. Fergus, “Intriguing properties of neural networks,” arXiv Education Academy, a senior member of the IEEE
preprint arXiv:1312.6199, 2013. and Fellow of the International Association of Pattern Recognition (IAPR).
[217] A. Nguyen, J. Yosinski, and J. Clune, “Deep Neural Networks are
Easily Fooled,” Computer Vision and Pattern Recognition, 2015 IEEE
Conference on, pp. 427–436, 2015.
[218] S. M. Moosavi Dezfooli, A. Fawzi, and P. Frossard, “Deepfool: a sim-
ple and accurate method to fool deep neural networks,” in Proceedings
of 2016 IEEE Conference on Computer Vision and Pattern Recognition Yaochu Jin is a Professor in Computational Intelli-
(CVPR), no. EPFL-CONF-218057, 2016. gence, Department of Computer Science, University
[219] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing of Surrey, Guildford, U.K. His main research inter-
adversarial examples,” arXiv preprint arXiv:1412.6572, 2014. ests include data-driven surrogate-assisted evolution-
[220] X. Huang, M. Kwiatkowska, S. Wang, and M. Wu, “Safety verification ary optimization, evolutionary learning, interpretable
of deep neural networks,” in Lecture Notes in Computer Science and secure machine learning, and evolutionary de-
(including subseries Lecture Notes in Artificial Intelligence and Lecture velopmental systems.
Notes in Bioinformatics), vol. 10426 LNCS, 2017, pp. 3–29. Dr Jin is the Editor-in-Chief of the IEEE TRANS-
[221] A. Kurakin, I. Goodfellow, and S. Bengio, “Adversarial examples in ACTIONS ON COGNITIVE AND DEVELOP-
the physical world,” arXiv preprint arXiv:1607.02533, 2016. MENTAL SYSTEMS and Co-Editor-in-Chief of
[222] A. Athalye, L. Engstrom, A. Ilyas, and K. Kwok, “Synthesizing Robust Complex & Intelligent Systems. He is an IEEE
Adversarial Examples,” arXiv preprint arXiv:1707.07397, 2017. Distinguished Lecturer and IEEE Fellow.
[223] X. Yuan, P. He, Q. Zhu, and X. Li, “Adversarial examples: Attacks and
defenses for deep learning,” IEEE transactions on neural networks and
learning systems, 2019.
[224] R. Salay, R. Queiroz, and K. Czarnecki, “An analysis of iso 26262:
Using machine learning safely in automotive software,” arXiv preprint
arXiv:1709.02435, 2017.
[225] International Organization for Standardization, “Iso 26262: Road Phil Barber was formerly Principal Technical Spe-
vehicles-functional safety,” International Standard ISO/FDIS, 2011. cialist in Capability Research at Jaguar Land Rover.
[226] F. Falcini, G. Lami, and A. M. Costanza, “Deep learning in automotive For over 30 years in the automotive industry he has
software,” IEEE Software, vol. 34, no. 3, pp. 56–63, 2017. witnessed the introduction of computer controlled
by-wire technology and been part of the debate over
[227] S. Dixit, U. Montanaro, S. Fallah, M. Dianati, D. Oxtoby, T. Mizutani,
the safety issues involved in the implementation of
and A. Mouzakitis, “Trajectory planning for autonomous high-speed
real-time vehicle control.
overtaking using mpc with terminal set constraints,” in 2018 21st
International Conference on Intelligent Transportation Systems (ITSC).
IEEE, 2018, pp. 1061–1068.
[228] S. Dixit, U. Montanaro, M. Dianati, D. Oxtoby, T. Mizutani, A. Mouza-
kitis, and S. Fallah, “Trajectory planning for autonomous high-speed
23