Evaluation of Deep Reinforcement Learning Algorithms for Autonomous Driving
Evaluation of Deep Reinforcement Learning Algorithms for Autonomous Driving
Abstract— Once considered futuristic, machine learning is to be able to drive in all areas, an extensive amount of data
already integrated into our everyday life and will shape many for the map creation must be collected and analyzed.
areas of our daily life in the future: This success is mainly Autonomous driving is the key to more efficient mobility
due to the progress in machine learning and the increase in
computing power. While machine learning is used to solve across the globe and, at the same time, promises a tremen-
partial problems in autonomous driving, the support of high- dous reduction of severe road incidents. Despite advances
resolution maps severely limits the use of autonomous vehicles in machine learning, especially deep learning and deep rein-
in unknown areas. At the same time, the structuring of the forcement learning, the application for autonomous driving
overall problem into modular subsystems for perception, self- is still in its infancy. End-to-end learning of autonomous
localization, planning, and control limits the performance of
the systems. A particularly promising alternative is end-to-end driving seems to be an exciting and promising alternative to
learning, which optimizes the system as a whole. In this work, the conventional approach and thus requires further investiga-
we investigate the application of an end-to-end learning method tion. Two fundamental categories of end-to-end approaches
for autonomous driving, employing reinforcement learning. are direct supervised deep learning, and deep reinforcement
For this purpose, a system is developed which allows the learning [1]. However, since supervised learning requires
examination of different reinforcement learning approaches in
a simulated environment. The system receives simulated images massive amounts of labeled data for training and validation
of the front camera as input and provides the control values [4], we focus on deep reinforcement learning.
for steering angle, accelerator, and brake pedal position as Autonomous driving itself is ever-present, but the integra-
direct output. The desired behavior is learned automatically tion of deep reinforcement learning techniques into the areas
through interaction with the environment. The reward function mentioned above has not yet been achieved. Deep reinforce-
is currently optimized for following a lane at the highest
possible speed. Using specially modeled environments with ment learning indeed is a highly active field of research.
different levels of detail, multiple deep reinforcement learning Similar to the application of deep learning to computer vision
approaches are compared. Among other aspects, the extent tasks, the system must be able to generalize from seen data to
to which a transferability of trained models to unknown new, unseen data. Nonetheless, deep reinforcement learning
environments is possible is examined. Our investigations show algorithms have shown promising results in other robotic
that Soft Actor-Critic is the best choice of the tested algorithms
concerning learning speed and the ability to generalize to domains [5]. They enable learning desired behavior directly
unseen environments. end-to-end in a fully automated way. In contrast, end-to-end
learning enables learning desired behavior directly from a
I. I NTRODUCTION given observation.
The driving task can be divided into two main tasks,
Most recent approaches to autonomous driving follow a namely perceiving the environment and selecting an appro-
rather conventional approach [1]. Typically, the problem is priate action. This sensorimotoric process of sensing and
split up in one form or another into sub-problems which acting is learned through experience. Reinforcement learning
are then tackled independently. They can be defined as self- is the theory of an agent that learns optimal behavior through
localization, object detection and classification, prediction interaction with its environment and hence matches the
of dynamic objects and eventually the planning of the problem description quite well. With deep reinforcement
movement of the own vehicle. In this approach, an exact learning, it is possible to utilize the power of deep learning
localization in the environment is necessary for the system techniques in conjunction with reinforcement learning to
to work. One popular approach to handle this is via very de- learn optimal behavior from high dimensional inputs as raw
tailed maps that are used in conjunction with the sensor suite pixels to action outputs. Thus, this paper is motivated by the
of the vehicle to localize in the map. The creation of these advances in both the field of deep reinforcement learning and
maps is complex and rather cumbersome [2]. In addition, the its particular applicability to autonomous driving. Issues have
modular approach suffers from error propagation [3]. While been the trial-and-error property of reinforcement learning
this may work for areas that are well defined through the algorithms, which has made it an unpopular choice for real-
respective maps, it becomes quickly apparent that in order world applications both due to its inherent safety risks during
Authorized licensed use limited to: University of Exeter. Downloaded on June 02,2021 at 02:30:47 UTC from IEEE Xplore. Restrictions apply.
the training process and its cost ineffectiveness. When the value function is known, it is possible to define
We seek to design a framework to support experiments a certain order over different policies. If the optimal value
with deep reinforcement learning algorithms for autonomous function is known, the optimal policy can be found following
driving since reinforcement learning is inherently a trial-and- a greedy strategy [8].
error learning process.
The main contributions of this paper are: A. Deep Q-networks (DQN)
• Development of a deep reinforcement learning frame-
Deep Q-networks [9] try to learn the optimal state-action
work that is suitable for autonomous driving, value-function q ∗ . They do so by using a neural network that
• Comparison of different deep reinforcement learning al-
attempts to learn the relationship between state-action pairs
gorithms with concerning convergence time and overall and value.
performance, High-dimensional input requires the extraction of features
• Investigation of generalization in the experiments con- in the input data. A convolutional neural network is used to
sidering the ability of deep reinforcement learning find meaningful representations in the high-dimensional data
agents to generalize between different environments. automatically. An update rule similar to Q-learning [10] is
used to calculate the labels to train the neural network.
Section II briefly describes work that relates to this paper,
while a short introduction to basic reinforcement learning is
given in III. The concept for the framework is presented in 0 0
Q (s, a) = Q (s, a) + α r + γ max Q (s , a ) − Q (s, a)
a0
IV and experiments are conducted in V. (5)
II. BACKGROUND Given the optimal Q-function, the best action can be found
by calculating the argmax function of the values.
The reinforcement learning setup consists of an agent that
is interacting with an environment in discrete time steps. At a∗ = argmaxa Q(s, a) (6)
each time step t the agent receives an observation ot of the
current state st of the environment. Based on the observation B. Soft Actor-Critic
the agent takes an action at and receives a scalar reward rt While DQN works with discrete action-spaces, and it is
[6]. The agent’s behavior is defined by a policy π that maps not merely possible to extend the functionality to contin-
states S to a probability distribution over actions A: uous actions-spaces, SAC features a stochastic policy that
supports the use of continuous action-spaces, which makes
π : S → P(A) (1) it particularly useful for robotic use cases [11]. Instead of
using the optimal Q-function to determine the optimal action
The goal of the agent is to maximize the total reward. The
a∗ the stochastic policy is directly learned using a neural
total reward is defined as what is called return Gt , which can
network. The goal of conventional reinforcement learning is
be undiscounted or discounted [6]. Here we use the definition
the maximization of the expected total reward. Agents trained
of the discounted return that can be found to:
with this objective in mind tend to focus on optimizing a
T
X single path, however. While this might be desirable in a
Gt = γ k rt+k+1 (2) deterministic world, it is prone to external disturbances.
k=0 Soft Actor-Critic resides in the maximum entropy rein-
In order to guarantee to learn, potential-based reward forcement learning framework. The framework differs from
shaping is used [7]. To avoid positive feedback loops in each the conventional reinforcement learning setup in that it
time step, the previously awarded reward is subtracted from expands the original objective about an additional term for
the current one. By doing so situations allowing the agent entropy-regularization. In this way policies are favored that
to collect infinite amounts of reward without pursuing the still perform the given task but at the same time following the
actual goal can be avoided. More on that topic in section IV most random policy possible. Since randomness is included
where the reward design is discussed. at training time the trained policies tend to be more robust
against external disturbances [11].
To assess the value of a certain state to the agent, the
value function may be used. It is defined as the expected III. R ELATED W ORK
total reward starting in state st and following the policy π
A. Autonomous Driving
[6]:
In general, a system capable of autonomous driving re-
Vπ (s) = Eπ [Gt |St = s] (3) quires an understanding of both the static environment and
the dynamic objects [12] [13]. Examples for the static envi-
Analogously the state-action value Q(s,a) can be defined ronment can be found in the course of the road, traffic signs,
as the expected total reward when starting in state st , taking and the position of light signals. Dynamic objects, on the
action at and afterward following the policy π [6]: other hand, describe other road users who must be detected in
real-time employing a type of object recognition and object
Qπ (s, a) = Eπ [Gt |St = s, At = a] (4) classification. Hence, recent approaches use a highly detailed
1577
Authorized licensed use limited to: University of Exeter. Downloaded on June 02,2021 at 02:30:47 UTC from IEEE Xplore. Restrictions apply.
map to aid the localization process of the vehicle in the static entropy-regularization which encourages the agent to learn a
environment and computer vision to perceive the dynamic policy that is robust against external perturbations, while still
objects. Most approaches are perception-control based, and being successful. Haarnoja et al. [11] use SAC to train a four-
the development of end-to-end learning-based approaches is legged robot to learn how to walk solely based on sensory
still lacking. One of the main reasons is computational and input. Through the combination of the recent advances in
algorithmic limitations. With the advances in deep learning, deep learning techniques that allow for hierarchical learning
research and development are starting to catch up, however. of representations [20] Haarnoja et al. [11] use Soft Actor-
Besides the conventional mapping approach, imitation Critic to teach a four-legged robot to learn how to walk.
learning offers an alternative approach. Instead of solving
IV. C ONCEPT & D ESIGN
several subproblems, imitation learning tries to mimic real
driving behavior by mapping sensor data directly to the The main requirements for the end-to-end learning system
control output, i.e., steering wheel angle and acceleration. are the following:
Bojarski et al. [14] train a convolutional neural network using • Ability to use the algorithm in different environments
image data from a front camera with respective actuator (in the sense of track scenarios).
values recorded from real driving scenarios to learn the • Efficient handling of the high-dimensional input data
mapping from image to steering wheel angle. coming from the vehicle’s front camera.
Vitelli et al. [15] discretize the action space to apply • The reward function should reflect the desired behavior
deep Q-learning to autonomous driving. The condition of the of the driver.
vehicle is based on an image of the environment combined • The system should allow easy switching between differ-
with direct signals such as speed, distance to the center of the ent environments and algorithms to compare resulting
road, and an estimate of the angle between the longitudinal behavior
axis of the vehicle and the longitudinal axis of the road. The system is supplied with images from the vehicle’s
Wayve follows an approach of imitation learning and front camera to simulate real-world operation as accurately
reinforcement learning. A policy is learned from experts as possible. No additional information from the simulation
based on demonstrations and improved through reinforce- engine is used, which would not be available in a real appli-
ment learning [16]. According to Kendall et al. [16], this cation. Hence, the system has to provide a way to process the
is the first successful attempt to use deep reinforcement high-dimensional input data. Based on the image or image
learning for autonomous driving in reality. The authors use sequence, a combination of steering wheel angle, gas and
a system that learns the tracking task based on image data brake pedal position should be given. It is very similar to
from a mono camera, a convolutional neural network, and a human behavior and considers driving as a holistic task. In
deep reinforcement learning algorithm. The agent’s reward is the field of artificial intelligence, reinforcement learning is a
proportional to the time traveled without human intervention. framework that learns from interaction with the environment
and offers an exciting alternative to the modular approach
B. Reinforcement Learning with the latest research results in the field. The desired
With the advances in deep learning, the field of computer behavior is specified using the reward function. Therefore,
vision has undergone significant breakthroughs. By com- the design of the latter is of particular importance. The task
bining deep learning techniques with reinforcement learn- of autonomous driving is already a non-trivial task in a single
ing, much progress has been made in recent years [17]. environment. For successful applications in different regions,
Mnih et al. [9] introduce deep reinforcement learning as a a variation in the environment must also be controllable.
combination of deep learning and reinforcement learning. Different forms of vegetation, but also differences in the
They combine a convolutional neural network with the routing and road conditions are just a few examples of how
reinforcement learning algorithm Q-learning in Deep Q- the environments differ and thus require a system that can
learning (DQN). The neural net is used to extract relevant handle a wide range of different situations.
features directly from high-dimensional input data, which is In this paper, a limited problem that handles automated
then used to find the optimal action. The image is directly following of the course is regarded. To integrate this ex-
mapped to one out of several discrete actions to be taken. trinsic need, a suitable reward function can be designed.
Every state-action pair is mapped to a so-called Q-value that The question arises whether, despite the optimized features,
displays the value of the future pair. While DQN works a generalization to different environments is possible. On
well with discrete action-spaces, it is non-trivial to extend the basis of the designed system, studies must, therefore,
it to continuous action-spaces. Lillycrap et al. [18] propose be carried out on the influence of the environment on the
Deep Deterministic Policy Gradient (DDPG) an algorithm generalizability of the system. In machine learning, gener-
that is particularly well suited for continuous action-spaces. alization describes the ability of a model to adapt to new,
In addition to video games challenging control, tasks were unseen data. While in supervised learning dropout, batch
included for evaluation. They use TORCS, a race simulator, normalization and other regularization techniques are used
to train their agent to drive based on sensor data as well to increase generalization in the model, these procedures in
as raw pixel data. In contrast to DDPG Soft Actor-Critic reinforcement learning do not have found wide application.
(SAC) [19] uses a stochastic policy. SAC includes a term for Thus, reinforcement learning agents are typically prone to
1578
Authorized licensed use limited to: University of Exeter. Downloaded on June 02,2021 at 02:30:47 UTC from IEEE Xplore. Restrictions apply.
Action
Agent AirSimWrapper
Update CarControls
batch of
experience AirSim
Replay
Observation
(st , at, rt, st+1, dt+1)
Reward Preprocessing
Calculation
Store
Observation,
Reward,
Done, Info
Fig. 1: The framework used for the investigation is primarily based on the agent-environment interaction model. The
simulation engine AirSim is used to simulate the vehicle’s behavior.
overfit to the training environment [21]. However, the ability same course of the road. Although they share the course
to generalize is desirable for several reasons. First, in the they differ in the scenery on and off the road, which is a
case of autonomous driving it is simply impossible to cover convenient feature for generalization verification. The idea
all possible situations in training. Second, far less data is to train agents separately in each environment only. In
or experience is needed in the training process if already the next step, the agents are put on the other environments
acquired knowledge can be transferred and applied to new which they have not seen during training time and perform
situations. Naturally, learned features should be more general evaluation episodes without any further fine-tuning. Apart
when there is a better generalization. from track no. 1, the remaining tracks feature the same course
meaning that if only relevant features, i.e., lane markings or
Cobbe et al. [21] use a custom game CoinGame to test
segmentation are used, the agent should be able to find the
their agents concerning their generalization ability. For the
same features on different tracks.
specific application of autonomous driving, the generaliza-
tion of different reinforcement learning algorithms will be The system design, which is shown in figure 1, is as
compared employing different simulation environments. modular as possible to simplify reusability. For instance, the
interfaces between AirSimWrapper and the agent’s compo-
For this purpose, four different track environments are
nent are conformed to OpenAI Gym interfaces. Thereby al-
created in AirSim and shown in figure 2. The environments
gorithms published can be evaluated. The overall system is to
are modeled with specific connections between the individual
be divided analogously into two subsystems, the simulation
environments in mind. On the one hand, the environments
environment and the system for deep reinforcement learn-
fulfill the purpose of measuring the performance of different
ing. To ensure that the two subsystems are as independent
reinforcement learning agents. On the other hand, they can
of each other as possible, an interface definition between
be used to see how well agents generalize between different
the simulation environment and the end-to-end system is
environments. To build the environments, the simulation
required. Interfaces are required for receiving images from
engine AirSim [22], in conjunction with the Unreal Engine,
the front camera and the current state of the vehicle, but also
is used to provide photorealistic environments. The tracks
for sending control signals to the environment.
are modeled in a way that they get progressively harder to
finish successfully. This is achieved by varying the curvature The core of the system are different deep reinforcement
of the road, which means that the lighter tracks have fewer learning algorithms. Here we consider only off-policy algo-
sharp bends, while the more difficult tracks contain sharp rithms as they tend to have better data complexity. Expe-
curves. While the first track has an entirely different course rience replay is used to reuse collected experiences. Past
than the following ones, track number two to four share the experiences are stored in a replay buffer, which contains
1579
Authorized licensed use limited to: University of Exeter. Downloaded on June 02,2021 at 02:30:47 UTC from IEEE Xplore. Restrictions apply.
the individual agents, the route should be chosen in such a
way that there is no large-scale exploration at the beginning
of the training. The following further extensions will be
presented, which mainly aim at investigations regarding the
generalization capability. Characteristics of the environments
that could be used for decision making are, for example,
absolute curve radii, road markings, objects beyond the road,
but also objects in the background. However, it should be
irrelevant whether or not there is a mountain located in the
background. The same applies to objects at the roadside:
For successful movement on the road and transferability to
similar environments, these objects should not be used for
decision making.
Another choice is whether a single frame or multiple
frames should be used. Using a single frame allows for
Fig. 2: Overview of the different tracks. greater memory efficiency regarding the fact that each ob-
servation is stored inside the replay buffer. Nevertheless, the
drawback gets quickly apparent when watching the agent
tuples of observation, action, reward, and next observation. drive. The agent fails to infer its speed from a single frame
In this work, Deep Q-networks and Soft Actor-Critic are since no other sensor data is used. Therefore, a sequence of
chosen as deep reinforcement learning algorithms. four consecutive frames is taken and used as an observation.
To process the high-dimensional input data a convolutional In that way, the agent can infer the velocity and acceleration
neural network (CNN) is used. Convolutional neural net- through differences in the frames.
works are particularly suitable for processing image data.
Therefore, for the extraction of features like in [9] a CNN V. E VALUATION
is chosen to automatically extract relevant features. In each
time step, a new frame is received and preprocessed before The evaluation of the agents is split into three different
fed to the convolutional neural network. First, the received investigations. Initial training is performed, which serves two
frame is cropped to an area of interest to constrain the agent purposes:
to relevant areas only. It follows a reduction in size and • Evaluation of the algorithms during the training process
eventually conversion to grayscale. The structure of the CNN in different environments concerning convergence speed
itself is briefly explained in the following: In the first layer and overall performance.
16 filter cores of size 3 × 3, in the second layer 32 filter cores • Training of agents for each of the environments required
of size 3 × 3 and in the third layer 64 filter cores of size for the following generalization studies.
3 × 3 are used. An excessive selection of the convolutional The network architecture is the same for both DQN and
neural network architecture is not part of the present work. SAC. Both use a convolutional neural network with three
The reward reflects the desired behavior and hence has to convolution layers followed by two fully connected layers.
be carefully designed. In this case, the desired behavior is The replay buffer is chosen to hold 100k transitions, the
defined as following a road with and without lane markings. learning rate is 2.5 · 10−4 , batch size 32, and discount factor
This requirement can be rewritten as the following sub-goals: 0.99. For all training runs, the same hyperparameter has
• Maximization of the driven distance: The goal shall be been used. A resolution of 84 × 84 pixels is used as the
to maximize the distance until the road is left. image size of the observations. The chosen size represents
• Minimum distance to the centre of the road: Minimising a compromise between the reduction of dimensionality and
the distance between the centre of the vehicle and the the information still contained in the image. Instead of using
centre of the road leads to greater distances to critical Deep Q-networks, we use Double Deep Q-networks (DDQN)
points at which it is most probably to exit the road. [23] which is a more stable version of DQN that tends to
Most agents have a high exploration rate by the time have less of an overestimation problem of the learned Q-
the training starts. Either this is the result of an explicit values. To encourage exploration a -greedy strategy is used
exploration strategy, such as DQN, or is implicitly integrated that anneals from 1.0 to 0.1 in the first 240k environment
through the task definition. A high frequency of action interactions and down to 0.01 until the end of the training
selection and a high exploration rate at the beginning of at 1M interactions. Parameters from the first network are
training mean that the selection of actions is based on an copied to the target network periodically every 1000 steps.
equal distribution. The agent can choose between six discrete actions. The agent
Due to the symmetrical action spaces of the steering trained with SAC is trained for 500k time steps. Both agents
angle and the accelerator and brake pedal position, this are evaluated during training every 10k steps to measure the
results in a tendency to drive straight ahead. In order to performance without artificial exploration. For DQN is set
enable meaningful evaluation of the exploratory ability of to 0.0 while for SAC the mean of the probability distribution
1580
Authorized licensed use limited to: University of Exeter. Downloaded on June 02,2021 at 02:30:47 UTC from IEEE Xplore. Restrictions apply.
Environment = 1 Environment = 2 Environment = 1 Environment = 2
Distance in [m]
300 300 750
Distance in [m]
Distance in [m]
200 200 500
0 0 0
0 5000 10000 15000 0 5000 10000 15000 0 1000 2000 3000 0 1000 2000 3000
Episode Episode Episode Episode
300
Distance in [m]
Distance in [m]
400 400
200
0 0 0
0 5000 10000 15000 0 2500 5000 7500 10000 0 1000 2000 3000 4000 0 2000 4000 6000
Episode Episode Episode Episode
Fig. 3: The training process of DQN (left) and SAC (right). SAC learns significantly faster with a higher maximum driven
distance before experiencing a drop in performance, which is most like due to limitations in the size of the replay buffer.
750 750
Agent 1 342.5 71.7 4.8 8.4 Agent 1 348.3 284.3 282.1 164.5
600 600
Agent 2 339.9 288.2 168.0 73.5 Agent 2 343.6 806.2 450.2 168.8
450 450
Agent 3 188.9 151.4 300.4 88.6 300 Agent 3 349.2 286.6 660.1 179.5 300
150 150
Agent 4 273.4 115.4 113.3 143.2 Agent 4 347.1 256.9 296.3 280.3
1
4
t
t
en
en
en
en
en
en
en
en
nm
nm
nm
nm
nm
nm
nm
nm
ro
ro
ro
ro
ro
ro
ro
ro
vi
vi
vi
vi
vi
vi
vi
vi
En
En
En
En
En
En
En
En
Fig. 4: Comparison of the transfer capability of DQN (left) and Soft SAC (right). Agents trained with SAC perform better
across all possible agent-environment combinations. In particular, agents two and three generalize best to the remaining
environments.
is used. For each evaluation, five trajectories are driven, and learning agents in autonomous driving the agents trained
the mean of the returns taken. in the single environments are used as initial models. Each
Significant differences in performance between the envi- of the trained agents will be evaluated for 100 episodes on
ronments as well as between the algorithms get apparent. the remaining unseen tracks without further improving the
While both agents successfully drive the first track, the policy. The results are displayed as a heat map in figure
performance differs tremendously in the following environ- 4 where rows show the different agents and columns the
ments. This gets also apparent in figure 3. One possible different environments. The first track is approximately 340
explanation for the seen behavior in DQN can be found m long, while the other tracks are round tracks with one lap
in its exploration ability. In the first environment, no sharp being around 340 m. The values shown in the heat map
turns are present, which makes exploration easier. With sharp are the driven distance in meters. All agents are able to
turns, more trajectories are needed to finish the course finally, complete the first environment, even though agents two to
but since the exploration rate has already dropped down to a four were never trained in this environment. One possible
number close to zero, the exploration is not sufficient. SAC, explanation for this is the curvature of the track used in the
on the other hand, utilizes its entropy-regularized policy to two environments. The higher variance in track curvature
explore the course efficiently. In general, one can say that appears to have led to agents trained in environments 2-4
the agents need more time steps to learn a sufficient policy learning to follow over a greater number of curve radii. In
that can drive the tracks. contrast, agent one is unable to apply its learned knowledge
to environments two to four with full success. However, the
To investigate the generalization of deep reinforcement
1581
Authorized licensed use limited to: University of Exeter. Downloaded on June 02,2021 at 02:30:47 UTC from IEEE Xplore. Restrictions apply.
results achieved are significantly higher than those of the [5] C. Finn, X. Y. Tan, Y. Duan, T. Darrell, S. Levine, and P. Abbeel,
random agent. As expected, agents trained in an environment “Learning visual feature spaces for robotic manipulation with deep
spatial autoencoders,” CoRR, vol. abs/1509.06113, 2015.
achieve the highest value in the environment in the compar- [6] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduc-
ison of agents, which is supported by the diagonal shape of tion. The MIT Press, second ed., 2018.
the curve. Agents two and three have the best generalization [7] A. Y. Ng, D. Harada, and S. Russell, “Policy invariance under reward
transformations: Theory and application to reward shaping,” in ICML,
ability among the trained agents, with both having their vol. 99, pp. 278–287, 1999.
weakness in environment four. In contrast to the other [8] C. Szepesvári, “Algorithms for reinforcement learning,” Synthesis
environments, no road markings are used in environment lectures on artificial intelligence and machine learning, vol. 4, no. 1,
pp. 1–103, 2010.
four. This allows the thesis to be formed that agents in the [9] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.
previous environments make decisions based on the road Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski,
markings. The investigation of different deep reinforcement et al., “Human-level control through deep reinforcement learning,”
Nature, vol. 518, no. 7540, p. 529, 2015.
learning algorithms shows significant differences between [10] C. J. Watkins and P. Dayan, “Q-learning,” Machine learning, vol. 8,
the algorithms. Soft Actor-Critic stands out from Deep Q- no. 3-4, pp. 279–292, 1992.
networks in all investigated scenarios. By applying it in [11] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-
policy maximum entropy deep reinforcement learning with a stochastic
maximum entropy reinforcement learning, robust policies are actor,” arXiv preprint arXiv:1801.01290, 2018.
learned, which as a consequence show a better generalization [12] J. Levinson, J. Askeland, J. Becker, J. Dolson, D. Held, S. Kammel,
capability. Apart from the differences between the different J. Z. Kolter, D. Langer, O. Pink, V. Pratt, et al., “Towards fully au-
tonomous driving: Systems and algorithms,” in 2011 IEEE Intelligent
algorithms, both algorithms show good results in the tested Vehicles Symposium (IV), pp. 163–168, IEEE, 2011.
environments. [13] C. Linegar, W. Churchill, and P. Newman, “Made to measure: Bespoke
landmarks for 24-hour, all-weather localisation with a camera,” in 2016
VI. C ONCLUSION & F UTURE W ORK IEEE International Conference on Robotics and Automation (ICRA),
pp. 787–794, IEEE, 2016.
The goal of this work was to design and implement an [14] M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp,
end-to-end system that is capable of driving autonomously. P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, et al., “End to
We developed a framework for deep reinforcement learning- end learning for self-driving cars,” arXiv preprint arXiv:1604.07316,
2016.
based driving and AirSim, which provides a stochastic envi- [15] M. Vitelli and A. Nayebi, “Carma: A deep reinforcement learning
ronment that is rich in detail. The core of the system is a deep approach to autonomous driving,” tech. rep., Tech. rep. Stanford
reinforcement learning algorithm that only processes data University, 2016.
[16] A. Kendall, J. Hawke, D. Janz, P. Mazur, D. Reda, J.-M. Allen, V.-
from a low-resolution front camera. By using the OpenAI D. Lam, A. Bewley, and A. Shah, “Learning to drive in a day,” in
Gym interfaces for our framework, new reinforcement learn- 2019 International Conference on Robotics and Automation (ICRA),
ing algorithms can be seamlessly integrated in the future. pp. 8248–8254, IEEE, 2019.
[17] P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and
The system can successfully learn to drive solely based on D. Meger, “Deep reinforcement learning that matters,” in Thirty-
a reward function. Additionally, the developed framework Second AAAI Conference on Artificial Intelligence, 2018.
was used to conduct experiments regarding general appli- [18] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,
D. Silver, and D. Wierstra, “Continuous control with deep reinforce-
cability to autonomous driving and generalization between ment learning,” arXiv preprint arXiv:1509.02971, 2015.
different environments. We found that especially Soft Actor- [19] T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan,
Critic was able to transfer its learned knowledge to un- V. Kumar, H. Zhu, A. Gupta, P. Abbeel, et al., “Soft actor-critic
algorithms and applications,” arXiv preprint arXiv:1812.05905, 2018.
seen environments. All results have been conducted using [20] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521,
hyperparameter baselines from original papers. As tuning no. 7553, p. 436, 2015.
them using conventional optimization methods is very time- [21] K. Cobbe, O. Klimov, C. Hesse, T. Kim, and J. Schulman, “Quan-
tifying generalization in reinforcement learning,” arXiv preprint
consuming in reinforcement learning, alternative methods arXiv:1812.02341, 2018.
like hyperparameter optimization with evolutionary strategies [22] S. Shah, D. Dey, C. Lovett, and A. Kapoor, “Airsim: High-fidelity
[24] are left to future work. Besides, the use of further visual and physical simulation for autonomous vehicles,” in Field and
service robotics, pp. 621–635, Springer, 2018.
neural network architectures, e.g. LSTMs, is an interesting [23] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning
perspective. with double q-learning,” in Thirtieth AAAI conference on artificial
intelligence, 2016.
R EFERENCES [24] M. Stang, C. Meier, V. Rau, and E. Sax, “An evolutionary approach
to hyper-parameter optimization of neural networks,” in Interna-
[1] E. Yurtseve, J. Lambert, A. Carballo, and K. Takeda, “A survey of tional Conference on Human Interaction and Emerging Technologies,
autonomous driving: Common practices and emerging technologies,” pp. 713–718, Springer, 2019.
2019.
[2] C. Ellis, “Mapping the world: solving one of the biggest challenges
for autonomous cars,” April 2019. Accessed: 2020-01-29.
[3] R. McAllister, Y. Gal, A. Kendall, M. van der Wilk, A. Shah,
R. Cipolla, and A. Weller, “Concrete problems for autonomous vehicle
safety: Advantages of bayesian deep learning,” in Proceedings of the
Twenty-Sixth International Joint Conference on Artificial Intelligence,
IJCAI-17, pp. 4745–4753, 2017.
[4] H. J. Vishnukumar, B. Butting, C. Müller, and E. Sax, “Machine
learning and deep neural network — artificial intelligence core for lab
and real-world test and validation for adas and autonomous vehicles:
Ai for efficient and quality test and validation,” in 2017 Intelligent
Systems Conference (IntelliSys), pp. 714–721, Sep. 2017.
1582
Authorized licensed use limited to: University of Exeter. Downloaded on June 02,2021 at 02:30:47 UTC from IEEE Xplore. Restrictions apply.