0% found this document useful (0 votes)

22 views7 pages

Evaluation of Deep Reinforcement Learning Algorithms for Autonomous Driving

The document presents research on the evaluation of deep reinforcement learning algorithms for autonomous driving, focusing on the development of a framework to support experiments in this area. It highlights the limitations of traditional modular approaches and explores end-to-end learning methods, particularly the Soft Actor-Critic algorithm, which shows promise in learning optimal driving behaviors in simulated environments. The study aims to assess the generalization capabilities of these algorithms across different driving scenarios and environments.

Uploaded by

mxrv

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views7 pages

Evaluation of Deep Reinforcement Learning Algorithms for Autonomous Driving

Uploaded by

mxrv

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

2020 IEEE Intelligent Vehicles Symposium (IV)

October 20-23, 2020. Las Vegas, USA

Evaluation of Deep Reinforcement Learning Algorithms for Autonomous

Driving
Marco Stang, Daniel Grimm, Moritz Gaiser and Eric Sax
Karlsruhe Institute of Technology, Institute for Information Processing Technologies
Engesserstr. 5, 76131 Karlsruhe
Karlsruhe, Germany
{marco.stang, daniel.grimm, moritz.gaiser, eric.sax}@kit.edu

Abstract— Once considered futuristic, machine learning is to be able to drive in all areas, an extensive amount of data
already integrated into our everyday life and will shape many for the map creation must be collected and analyzed.
areas of our daily life in the future: This success is mainly Autonomous driving is the key to more efficient mobility
due to the progress in machine learning and the increase in
computing power. While machine learning is used to solve across the globe and, at the same time, promises a tremen-
partial problems in autonomous driving, the support of high- dous reduction of severe road incidents. Despite advances
resolution maps severely limits the use of autonomous vehicles in machine learning, especially deep learning and deep rein-
in unknown areas. At the same time, the structuring of the forcement learning, the application for autonomous driving
overall problem into modular subsystems for perception, self- is still in its infancy. End-to-end learning of autonomous
localization, planning, and control limits the performance of
the systems. A particularly promising alternative is end-to-end driving seems to be an exciting and promising alternative to
learning, which optimizes the system as a whole. In this work, the conventional approach and thus requires further investiga-
we investigate the application of an end-to-end learning method tion. Two fundamental categories of end-to-end approaches
for autonomous driving, employing reinforcement learning. are direct supervised deep learning, and deep reinforcement
For this purpose, a system is developed which allows the learning [1]. However, since supervised learning requires
examination of different reinforcement learning approaches in
a simulated environment. The system receives simulated images massive amounts of labeled data for training and validation
of the front camera as input and provides the control values [4], we focus on deep reinforcement learning.
for steering angle, accelerator, and brake pedal position as Autonomous driving itself is ever-present, but the integra-
direct output. The desired behavior is learned automatically tion of deep reinforcement learning techniques into the areas
through interaction with the environment. The reward function mentioned above has not yet been achieved. Deep reinforce-
is currently optimized for following a lane at the highest
possible speed. Using specially modeled environments with ment learning indeed is a highly active field of research.
different levels of detail, multiple deep reinforcement learning Similar to the application of deep learning to computer vision
approaches are compared. Among other aspects, the extent tasks, the system must be able to generalize from seen data to
to which a transferability of trained models to unknown new, unseen data. Nonetheless, deep reinforcement learning
environments is possible is examined. Our investigations show algorithms have shown promising results in other robotic
that Soft Actor-Critic is the best choice of the tested algorithms
concerning learning speed and the ability to generalize to domains [5]. They enable learning desired behavior directly
unseen environments. end-to-end in a fully automated way. In contrast, end-to-end
learning enables learning desired behavior directly from a
I. I NTRODUCTION given observation.
The driving task can be divided into two main tasks,
Most recent approaches to autonomous driving follow a namely perceiving the environment and selecting an appro-
rather conventional approach [1]. Typically, the problem is priate action. This sensorimotoric process of sensing and
split up in one form or another into sub-problems which acting is learned through experience. Reinforcement learning
are then tackled independently. They can be defined as self- is the theory of an agent that learns optimal behavior through
localization, object detection and classification, prediction interaction with its environment and hence matches the
of dynamic objects and eventually the planning of the problem description quite well. With deep reinforcement
movement of the own vehicle. In this approach, an exact learning, it is possible to utilize the power of deep learning
localization in the environment is necessary for the system techniques in conjunction with reinforcement learning to
to work. One popular approach to handle this is via very de- learn optimal behavior from high dimensional inputs as raw
tailed maps that are used in conjunction with the sensor suite pixels to action outputs. Thus, this paper is motivated by the
of the vehicle to localize in the map. The creation of these advances in both the field of deep reinforcement learning and
maps is complex and rather cumbersome [2]. In addition, the its particular applicability to autonomous driving. Issues have
modular approach suffers from error propagation [3]. While been the trial-and-error property of reinforcement learning
this may work for areas that are well defined through the algorithms, which has made it an unpopular choice for real-
respective maps, it becomes quickly apparent that in order world applications both due to its inherent safety risks during

978-1-7281-6673-5/20/$31.00 ©2020 IEEE 1576

Authorized licensed use limited to: University of Exeter. Downloaded on June 02,2021 at 02:30:47 UTC from IEEE Xplore. Restrictions apply.
the training process and its cost ineffectiveness. When the value function is known, it is possible to define
We seek to design a framework to support experiments a certain order over different policies. If the optimal value
with deep reinforcement learning algorithms for autonomous function is known, the optimal policy can be found following
driving since reinforcement learning is inherently a trial-and- a greedy strategy [8].
error learning process.
The main contributions of this paper are: A. Deep Q-networks (DQN)
• Development of a deep reinforcement learning frame-
Deep Q-networks [9] try to learn the optimal state-action
work that is suitable for autonomous driving, value-function q ∗ . They do so by using a neural network that
• Comparison of different deep reinforcement learning al-
attempts to learn the relationship between state-action pairs
gorithms with concerning convergence time and overall and value.
performance, High-dimensional input requires the extraction of features
• Investigation of generalization in the experiments con- in the input data. A convolutional neural network is used to
sidering the ability of deep reinforcement learning find meaningful representations in the high-dimensional data
agents to generalize between different environments. automatically. An update rule similar to Q-learning [10] is
used to calculate the labels to train the neural network.
Section II briefly describes work that relates to this paper,
while a short introduction to basic reinforcement learning is
given in III. The concept for the framework is presented in 0 0
Q (s, a) = Q (s, a) + α r + γ max Q (s , a ) − Q (s, a)
a0
IV and experiments are conducted in V. (5)
II. BACKGROUND Given the optimal Q-function, the best action can be found
by calculating the argmax function of the values.
The reinforcement learning setup consists of an agent that
is interacting with an environment in discrete time steps. At a∗ = argmaxa Q(s, a) (6)
each time step t the agent receives an observation ot of the
current state st of the environment. Based on the observation B. Soft Actor-Critic
the agent takes an action at and receives a scalar reward rt While DQN works with discrete action-spaces, and it is
[6]. The agent’s behavior is defined by a policy π that maps not merely possible to extend the functionality to contin-
states S to a probability distribution over actions A: uous actions-spaces, SAC features a stochastic policy that
supports the use of continuous action-spaces, which makes
π : S → P(A) (1) it particularly useful for robotic use cases [11]. Instead of
using the optimal Q-function to determine the optimal action
The goal of the agent is to maximize the total reward. The
a∗ the stochastic policy is directly learned using a neural
total reward is defined as what is called return Gt , which can
network. The goal of conventional reinforcement learning is
be undiscounted or discounted [6]. Here we use the definition
the maximization of the expected total reward. Agents trained
of the discounted return that can be found to:
with this objective in mind tend to focus on optimizing a
T
X single path, however. While this might be desirable in a
Gt = γ k rt+k+1 (2) deterministic world, it is prone to external disturbances.
k=0 Soft Actor-Critic resides in the maximum entropy rein-
In order to guarantee to learn, potential-based reward forcement learning framework. The framework differs from
shaping is used [7]. To avoid positive feedback loops in each the conventional reinforcement learning setup in that it
time step, the previously awarded reward is subtracted from expands the original objective about an additional term for
the current one. By doing so situations allowing the agent entropy-regularization. In this way policies are favored that
to collect infinite amounts of reward without pursuing the still perform the given task but at the same time following the
actual goal can be avoided. More on that topic in section IV most random policy possible. Since randomness is included
where the reward design is discussed. at training time the trained policies tend to be more robust
against external disturbances [11].
To assess the value of a certain state to the agent, the
value function may be used. It is defined as the expected III. R ELATED W ORK
total reward starting in state st and following the policy π
A. Autonomous Driving
[6]:
In general, a system capable of autonomous driving re-
Vπ (s) = Eπ [Gt |St = s] (3) quires an understanding of both the static environment and
the dynamic objects [12] [13]. Examples for the static envi-
Analogously the state-action value Q(s,a) can be defined ronment can be found in the course of the road, traffic signs,
as the expected total reward when starting in state st , taking and the position of light signals. Dynamic objects, on the
action at and afterward following the policy π [6]: other hand, describe other road users who must be detected in
real-time employing a type of object recognition and object
Qπ (s, a) = Eπ [Gt |St = s, At = a] (4) classification. Hence, recent approaches use a highly detailed

1577

Authorized licensed use limited to: University of Exeter. Downloaded on June 02,2021 at 02:30:47 UTC from IEEE Xplore. Restrictions apply.
map to aid the localization process of the vehicle in the static entropy-regularization which encourages the agent to learn a
environment and computer vision to perceive the dynamic policy that is robust against external perturbations, while still
objects. Most approaches are perception-control based, and being successful. Haarnoja et al. [11] use SAC to train a four-
the development of end-to-end learning-based approaches is legged robot to learn how to walk solely based on sensory
still lacking. One of the main reasons is computational and input. Through the combination of the recent advances in
algorithmic limitations. With the advances in deep learning, deep learning techniques that allow for hierarchical learning
research and development are starting to catch up, however. of representations [20] Haarnoja et al. [11] use Soft Actor-
Besides the conventional mapping approach, imitation Critic to teach a four-legged robot to learn how to walk.
learning offers an alternative approach. Instead of solving
IV. C ONCEPT & D ESIGN
several subproblems, imitation learning tries to mimic real
driving behavior by mapping sensor data directly to the The main requirements for the end-to-end learning system
control output, i.e., steering wheel angle and acceleration. are the following:
Bojarski et al. [14] train a convolutional neural network using • Ability to use the algorithm in different environments
image data from a front camera with respective actuator (in the sense of track scenarios).
values recorded from real driving scenarios to learn the • Efficient handling of the high-dimensional input data
mapping from image to steering wheel angle. coming from the vehicle’s front camera.
Vitelli et al. [15] discretize the action space to apply • The reward function should reflect the desired behavior
deep Q-learning to autonomous driving. The condition of the of the driver.
vehicle is based on an image of the environment combined • The system should allow easy switching between differ-
with direct signals such as speed, distance to the center of the ent environments and algorithms to compare resulting
road, and an estimate of the angle between the longitudinal behavior
axis of the vehicle and the longitudinal axis of the road. The system is supplied with images from the vehicle’s
Wayve follows an approach of imitation learning and front camera to simulate real-world operation as accurately
reinforcement learning. A policy is learned from experts as possible. No additional information from the simulation
based on demonstrations and improved through reinforce- engine is used, which would not be available in a real appli-
ment learning [16]. According to Kendall et al. [16], this cation. Hence, the system has to provide a way to process the
is the first successful attempt to use deep reinforcement high-dimensional input data. Based on the image or image
learning for autonomous driving in reality. The authors use sequence, a combination of steering wheel angle, gas and
a system that learns the tracking task based on image data brake pedal position should be given. It is very similar to
from a mono camera, a convolutional neural network, and a human behavior and considers driving as a holistic task. In
deep reinforcement learning algorithm. The agent’s reward is the field of artificial intelligence, reinforcement learning is a
proportional to the time traveled without human intervention. framework that learns from interaction with the environment
and offers an exciting alternative to the modular approach
B. Reinforcement Learning with the latest research results in the field. The desired
With the advances in deep learning, the field of computer behavior is specified using the reward function. Therefore,
vision has undergone significant breakthroughs. By com- the design of the latter is of particular importance. The task
bining deep learning techniques with reinforcement learn- of autonomous driving is already a non-trivial task in a single
ing, much progress has been made in recent years [17]. environment. For successful applications in different regions,
Mnih et al. [9] introduce deep reinforcement learning as a a variation in the environment must also be controllable.
combination of deep learning and reinforcement learning. Different forms of vegetation, but also differences in the
They combine a convolutional neural network with the routing and road conditions are just a few examples of how
reinforcement learning algorithm Q-learning in Deep Q- the environments differ and thus require a system that can
learning (DQN). The neural net is used to extract relevant handle a wide range of different situations.
features directly from high-dimensional input data, which is In this paper, a limited problem that handles automated
then used to find the optimal action. The image is directly following of the course is regarded. To integrate this ex-
mapped to one out of several discrete actions to be taken. trinsic need, a suitable reward function can be designed.
Every state-action pair is mapped to a so-called Q-value that The question arises whether, despite the optimized features,
displays the value of the future pair. While DQN works a generalization to different environments is possible. On
well with discrete action-spaces, it is non-trivial to extend the basis of the designed system, studies must, therefore,
it to continuous action-spaces. Lillycrap et al. [18] propose be carried out on the influence of the environment on the
Deep Deterministic Policy Gradient (DDPG) an algorithm generalizability of the system. In machine learning, gener-
that is particularly well suited for continuous action-spaces. alization describes the ability of a model to adapt to new,
In addition to video games challenging control, tasks were unseen data. While in supervised learning dropout, batch
included for evaluation. They use TORCS, a race simulator, normalization and other regularization techniques are used
to train their agent to drive based on sensor data as well to increase generalization in the model, these procedures in
as raw pixel data. In contrast to DDPG Soft Actor-Critic reinforcement learning do not have found wide application.
(SAC) [19] uses a stochastic policy. SAC includes a term for Thus, reinforcement learning agents are typically prone to

1578

Authorized licensed use limited to: University of Exeter. Downloaded on June 02,2021 at 02:30:47 UTC from IEEE Xplore. Restrictions apply.
Action

Agent AirSimWrapper

Policy Interpret Action

Update CarControls

batch of
experience AirSim

Replay
Observation
(st , at, rt, st+1, dt+1)
Reward Preprocessing
Calculation
Store

Observation,
Reward,
Done, Info

Fig. 1: The framework used for the investigation is primarily based on the agent-environment interaction model. The
simulation engine AirSim is used to simulate the vehicle’s behavior.

overfit to the training environment [21]. However, the ability same course of the road. Although they share the course
to generalize is desirable for several reasons. First, in the they differ in the scenery on and off the road, which is a
case of autonomous driving it is simply impossible to cover convenient feature for generalization verification. The idea
all possible situations in training. Second, far less data is to train agents separately in each environment only. In
or experience is needed in the training process if already the next step, the agents are put on the other environments
acquired knowledge can be transferred and applied to new which they have not seen during training time and perform
situations. Naturally, learned features should be more general evaluation episodes without any further fine-tuning. Apart
when there is a better generalization. from track no. 1, the remaining tracks feature the same course
meaning that if only relevant features, i.e., lane markings or
Cobbe et al. [21] use a custom game CoinGame to test
segmentation are used, the agent should be able to find the
their agents concerning their generalization ability. For the
same features on different tracks.
specific application of autonomous driving, the generaliza-
tion of different reinforcement learning algorithms will be The system design, which is shown in figure 1, is as
compared employing different simulation environments. modular as possible to simplify reusability. For instance, the
interfaces between AirSimWrapper and the agent’s compo-
For this purpose, four different track environments are
nent are conformed to OpenAI Gym interfaces. Thereby al-
created in AirSim and shown in figure 2. The environments
gorithms published can be evaluated. The overall system is to
are modeled with specific connections between the individual
be divided analogously into two subsystems, the simulation
environments in mind. On the one hand, the environments
environment and the system for deep reinforcement learn-
fulfill the purpose of measuring the performance of different
ing. To ensure that the two subsystems are as independent
reinforcement learning agents. On the other hand, they can
of each other as possible, an interface definition between
be used to see how well agents generalize between different
the simulation environment and the end-to-end system is
environments. To build the environments, the simulation
required. Interfaces are required for receiving images from
engine AirSim [22], in conjunction with the Unreal Engine,
the front camera and the current state of the vehicle, but also
is used to provide photorealistic environments. The tracks
for sending control signals to the environment.
are modeled in a way that they get progressively harder to
finish successfully. This is achieved by varying the curvature The core of the system are different deep reinforcement
of the road, which means that the lighter tracks have fewer learning algorithms. Here we consider only off-policy algo-
sharp bends, while the more difficult tracks contain sharp rithms as they tend to have better data complexity. Expe-
curves. While the first track has an entirely different course rience replay is used to reuse collected experiences. Past
than the following ones, track number two to four share the experiences are stored in a replay buffer, which contains

1579

Authorized licensed use limited to: University of Exeter. Downloaded on June 02,2021 at 02:30:47 UTC from IEEE Xplore. Restrictions apply.
the individual agents, the route should be chosen in such a
way that there is no large-scale exploration at the beginning
of the training. The following further extensions will be
presented, which mainly aim at investigations regarding the
generalization capability. Characteristics of the environments
that could be used for decision making are, for example,
absolute curve radii, road markings, objects beyond the road,
but also objects in the background. However, it should be
irrelevant whether or not there is a mountain located in the
background. The same applies to objects at the roadside:
For successful movement on the road and transferability to
similar environments, these objects should not be used for
decision making.
Another choice is whether a single frame or multiple
frames should be used. Using a single frame allows for
Fig. 2: Overview of the different tracks. greater memory efficiency regarding the fact that each ob-
servation is stored inside the replay buffer. Nevertheless, the
drawback gets quickly apparent when watching the agent
tuples of observation, action, reward, and next observation. drive. The agent fails to infer its speed from a single frame
In this work, Deep Q-networks and Soft Actor-Critic are since no other sensor data is used. Therefore, a sequence of
chosen as deep reinforcement learning algorithms. four consecutive frames is taken and used as an observation.
To process the high-dimensional input data a convolutional In that way, the agent can infer the velocity and acceleration
neural network (CNN) is used. Convolutional neural net- through differences in the frames.
works are particularly suitable for processing image data.
Therefore, for the extraction of features like in [9] a CNN V. E VALUATION
is chosen to automatically extract relevant features. In each
time step, a new frame is received and preprocessed before The evaluation of the agents is split into three different
fed to the convolutional neural network. First, the received investigations. Initial training is performed, which serves two
frame is cropped to an area of interest to constrain the agent purposes:
to relevant areas only. It follows a reduction in size and • Evaluation of the algorithms during the training process
eventually conversion to grayscale. The structure of the CNN in different environments concerning convergence speed
itself is briefly explained in the following: In the first layer and overall performance.
16 filter cores of size 3 × 3, in the second layer 32 filter cores • Training of agents for each of the environments required
of size 3 × 3 and in the third layer 64 filter cores of size for the following generalization studies.
3 × 3 are used. An excessive selection of the convolutional The network architecture is the same for both DQN and
neural network architecture is not part of the present work. SAC. Both use a convolutional neural network with three
The reward reflects the desired behavior and hence has to convolution layers followed by two fully connected layers.
be carefully designed. In this case, the desired behavior is The replay buffer is chosen to hold 100k transitions, the
defined as following a road with and without lane markings. learning rate is 2.5 · 10−4 , batch size 32, and discount factor
This requirement can be rewritten as the following sub-goals: 0.99. For all training runs, the same hyperparameter has
• Maximization of the driven distance: The goal shall be been used. A resolution of 84 × 84 pixels is used as the
to maximize the distance until the road is left. image size of the observations. The chosen size represents
• Minimum distance to the centre of the road: Minimising a compromise between the reduction of dimensionality and
the distance between the centre of the vehicle and the the information still contained in the image. Instead of using
centre of the road leads to greater distances to critical Deep Q-networks, we use Double Deep Q-networks (DDQN)
points at which it is most probably to exit the road. [23] which is a more stable version of DQN that tends to
Most agents have a high exploration rate by the time have less of an overestimation problem of the learned Q-
the training starts. Either this is the result of an explicit values. To encourage exploration a -greedy strategy is used
exploration strategy, such as DQN, or is implicitly integrated that anneals from 1.0 to 0.1 in the first 240k environment
through the task definition. A high frequency of action interactions and down to 0.01 until the end of the training
selection and a high exploration rate at the beginning of at 1M interactions. Parameters from the first network are
training mean that the selection of actions is based on an copied to the target network periodically every 1000 steps.
equal distribution. The agent can choose between six discrete actions. The agent
Due to the symmetrical action spaces of the steering trained with SAC is trained for 500k time steps. Both agents
angle and the accelerator and brake pedal position, this are evaluated during training every 10k steps to measure the
results in a tendency to drive straight ahead. In order to performance without artificial exploration. For DQN is set
enable meaningful evaluation of the exploratory ability of to 0.0 while for SAC the mean of the probability distribution

1580

Authorized licensed use limited to: University of Exeter. Downloaded on June 02,2021 at 02:30:47 UTC from IEEE Xplore. Restrictions apply.
Environment = 1 Environment = 2 Environment = 1 Environment = 2
Distance in [m]
300 300 750

Distance in [m]

Distance in [m]
200 200 500

100 100 250

0 0 0
0 5000 10000 15000 0 5000 10000 15000 0 1000 2000 3000 0 1000 2000 3000
Episode Episode Episode Episode

Environment = 3 Environment = 4 Environment = 3 Environment = 4

600
600
Distance in [m]

300

Distance in [m]

Distance in [m]
400 400
200

100 200 200

0 0 0
0 5000 10000 15000 0 2500 5000 7500 10000 0 1000 2000 3000 4000 0 2000 4000 6000
Episode Episode Episode Episode

Fig. 3: The training process of DQN (left) and SAC (right). SAC learns significantly faster with a higher maximum driven
distance before experiencing a drop in performance, which is most like due to limitations in the size of the replay buffer.

750 750
Agent 1 342.5 71.7 4.8 8.4 Agent 1 348.3 284.3 282.1 164.5

600 600

Agent 2 339.9 288.2 168.0 73.5 Agent 2 343.6 806.2 450.2 168.8
450 450

Agent 3 188.9 151.4 300.4 88.6 300 Agent 3 349.2 286.6 660.1 179.5 300

150 150
Agent 4 273.4 115.4 113.3 143.2 Agent 4 347.1 256.9 296.3 280.3
1

4
t

t
en

en
nm

nm
ro

ro
vi

vi
En

En
Fig. 4: Comparison of the transfer capability of DQN (left) and Soft SAC (right). Agents trained with SAC perform better
across all possible agent-environment combinations. In particular, agents two and three generalize best to the remaining
environments.

is used. For each evaluation, five trajectories are driven, and learning agents in autonomous driving the agents trained
the mean of the returns taken. in the single environments are used as initial models. Each
Significant differences in performance between the envi- of the trained agents will be evaluated for 100 episodes on
ronments as well as between the algorithms get apparent. the remaining unseen tracks without further improving the
While both agents successfully drive the first track, the policy. The results are displayed as a heat map in figure
performance differs tremendously in the following environ- 4 where rows show the different agents and columns the
ments. This gets also apparent in figure 3. One possible different environments. The first track is approximately 340
explanation for the seen behavior in DQN can be found m long, while the other tracks are round tracks with one lap
in its exploration ability. In the first environment, no sharp being around 340 m. The values shown in the heat map
turns are present, which makes exploration easier. With sharp are the driven distance in meters. All agents are able to
turns, more trajectories are needed to finish the course finally, complete the first environment, even though agents two to
but since the exploration rate has already dropped down to a four were never trained in this environment. One possible
number close to zero, the exploration is not sufficient. SAC, explanation for this is the curvature of the track used in the
on the other hand, utilizes its entropy-regularized policy to two environments. The higher variance in track curvature
explore the course efficiently. In general, one can say that appears to have led to agents trained in environments 2-4
the agents need more time steps to learn a sufficient policy learning to follow over a greater number of curve radii. In
that can drive the tracks. contrast, agent one is unable to apply its learned knowledge
to environments two to four with full success. However, the
To investigate the generalization of deep reinforcement

1581

Authorized licensed use limited to: University of Exeter. Downloaded on June 02,2021 at 02:30:47 UTC from IEEE Xplore. Restrictions apply.
results achieved are significantly higher than those of the [5] C. Finn, X. Y. Tan, Y. Duan, T. Darrell, S. Levine, and P. Abbeel,
random agent. As expected, agents trained in an environment “Learning visual feature spaces for robotic manipulation with deep
spatial autoencoders,” CoRR, vol. abs/1509.06113, 2015.
achieve the highest value in the environment in the compar- [6] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduc-
ison of agents, which is supported by the diagonal shape of tion. The MIT Press, second ed., 2018.
the curve. Agents two and three have the best generalization [7] A. Y. Ng, D. Harada, and S. Russell, “Policy invariance under reward
transformations: Theory and application to reward shaping,” in ICML,
ability among the trained agents, with both having their vol. 99, pp. 278–287, 1999.
weakness in environment four. In contrast to the other [8] C. Szepesvári, “Algorithms for reinforcement learning,” Synthesis
environments, no road markings are used in environment lectures on artificial intelligence and machine learning, vol. 4, no. 1,
pp. 1–103, 2010.
four. This allows the thesis to be formed that agents in the [9] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.
previous environments make decisions based on the road Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski,
markings. The investigation of different deep reinforcement et al., “Human-level control through deep reinforcement learning,”
Nature, vol. 518, no. 7540, p. 529, 2015.
learning algorithms shows significant differences between [10] C. J. Watkins and P. Dayan, “Q-learning,” Machine learning, vol. 8,
the algorithms. Soft Actor-Critic stands out from Deep Q- no. 3-4, pp. 279–292, 1992.
networks in all investigated scenarios. By applying it in [11] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-
policy maximum entropy deep reinforcement learning with a stochastic
maximum entropy reinforcement learning, robust policies are actor,” arXiv preprint arXiv:1801.01290, 2018.
learned, which as a consequence show a better generalization [12] J. Levinson, J. Askeland, J. Becker, J. Dolson, D. Held, S. Kammel,
capability. Apart from the differences between the different J. Z. Kolter, D. Langer, O. Pink, V. Pratt, et al., “Towards fully au-
tonomous driving: Systems and algorithms,” in 2011 IEEE Intelligent
algorithms, both algorithms show good results in the tested Vehicles Symposium (IV), pp. 163–168, IEEE, 2011.
environments. [13] C. Linegar, W. Churchill, and P. Newman, “Made to measure: Bespoke
landmarks for 24-hour, all-weather localisation with a camera,” in 2016
VI. C ONCLUSION & F UTURE W ORK IEEE International Conference on Robotics and Automation (ICRA),
pp. 787–794, IEEE, 2016.
The goal of this work was to design and implement an [14] M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp,
end-to-end system that is capable of driving autonomously. P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, et al., “End to
We developed a framework for deep reinforcement learning- end learning for self-driving cars,” arXiv preprint arXiv:1604.07316,
2016.
based driving and AirSim, which provides a stochastic envi- [15] M. Vitelli and A. Nayebi, “Carma: A deep reinforcement learning
ronment that is rich in detail. The core of the system is a deep approach to autonomous driving,” tech. rep., Tech. rep. Stanford
reinforcement learning algorithm that only processes data University, 2016.
[16] A. Kendall, J. Hawke, D. Janz, P. Mazur, D. Reda, J.-M. Allen, V.-
from a low-resolution front camera. By using the OpenAI D. Lam, A. Bewley, and A. Shah, “Learning to drive in a day,” in
Gym interfaces for our framework, new reinforcement learn- 2019 International Conference on Robotics and Automation (ICRA),
ing algorithms can be seamlessly integrated in the future. pp. 8248–8254, IEEE, 2019.
[17] P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and
The system can successfully learn to drive solely based on D. Meger, “Deep reinforcement learning that matters,” in Thirty-
a reward function. Additionally, the developed framework Second AAAI Conference on Artificial Intelligence, 2018.
was used to conduct experiments regarding general appli- [18] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,
D. Silver, and D. Wierstra, “Continuous control with deep reinforce-
cability to autonomous driving and generalization between ment learning,” arXiv preprint arXiv:1509.02971, 2015.
different environments. We found that especially Soft Actor- [19] T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan,
Critic was able to transfer its learned knowledge to un- V. Kumar, H. Zhu, A. Gupta, P. Abbeel, et al., “Soft actor-critic
algorithms and applications,” arXiv preprint arXiv:1812.05905, 2018.
seen environments. All results have been conducted using [20] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521,
hyperparameter baselines from original papers. As tuning no. 7553, p. 436, 2015.
them using conventional optimization methods is very time- [21] K. Cobbe, O. Klimov, C. Hesse, T. Kim, and J. Schulman, “Quan-
tifying generalization in reinforcement learning,” arXiv preprint
consuming in reinforcement learning, alternative methods arXiv:1812.02341, 2018.
like hyperparameter optimization with evolutionary strategies [22] S. Shah, D. Dey, C. Lovett, and A. Kapoor, “Airsim: High-fidelity
[24] are left to future work. Besides, the use of further visual and physical simulation for autonomous vehicles,” in Field and
service robotics, pp. 621–635, Springer, 2018.
neural network architectures, e.g. LSTMs, is an interesting [23] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning
perspective. with double q-learning,” in Thirtieth AAAI conference on artificial
intelligence, 2016.
R EFERENCES [24] M. Stang, C. Meier, V. Rau, and E. Sax, “An evolutionary approach
to hyper-parameter optimization of neural networks,” in Interna-
[1] E. Yurtseve, J. Lambert, A. Carballo, and K. Takeda, “A survey of tional Conference on Human Interaction and Emerging Technologies,
autonomous driving: Common practices and emerging technologies,” pp. 713–718, Springer, 2019.
2019.
[2] C. Ellis, “Mapping the world: solving one of the biggest challenges
for autonomous cars,” April 2019. Accessed: 2020-01-29.
[3] R. McAllister, Y. Gal, A. Kendall, M. van der Wilk, A. Shah,
R. Cipolla, and A. Weller, “Concrete problems for autonomous vehicle
safety: Advantages of bayesian deep learning,” in Proceedings of the
Twenty-Sixth International Joint Conference on Artificial Intelligence,
IJCAI-17, pp. 4745–4753, 2017.
[4] H. J. Vishnukumar, B. Butting, C. Müller, and E. Sax, “Machine
learning and deep neural network — artificial intelligence core for lab
and real-world test and validation for adas and autonomous vehicles:
Ai for efficient and quality test and validation,” in 2017 Intelligent
Systems Conference (IntelliSys), pp. 714–721, Sep. 2017.

1582

Authorized licensed use limited to: University of Exeter. Downloaded on June 02,2021 at 02:30:47 UTC from IEEE Xplore. Restrictions apply.

Deep Learning for Unmanned Systems Anis Koubaa instant download
100% (2)
Deep Learning for Unmanned Systems Anis Koubaa instant download
45 pages
Technical Delivery Conditions For Seamless Carbon Steel Fittings
100% (1)
Technical Delivery Conditions For Seamless Carbon Steel Fittings
2 pages
Fashion Accessories - Since 1500 (History Arts)
86% (7)
Fashion Accessories - Since 1500 (History Arts)
168 pages
Autonomous Driving With Deep Reinforcement Learning in CARLA Simulation
No ratings yet
Autonomous Driving With Deep Reinforcement Learning in CARLA Simulation
7 pages
Deep Reinforcement Learning for Autonomous Driving a Survey
No ratings yet
Deep Reinforcement Learning for Autonomous Driving a Survey
18 pages
Deep Reinforcement Learning For Autonomous Driving A Survey
No ratings yet
Deep Reinforcement Learning For Autonomous Driving A Survey
18 pages
Learning For Autonomous Vehicles: A Focus On Expert Demonstration
No ratings yet
Learning For Autonomous Vehicles: A Focus On Expert Demonstration
26 pages
Interpretable_End-to-End_Urban_Autonomous_Driving_With_Latent_Deep_Reinforcement_Learning
No ratings yet
Interpretable_End-to-End_Urban_Autonomous_Driving_With_Latent_Deep_Reinforcement_Learning
11 pages
Donkey Car Depp Reinforcement Learning
No ratings yet
Donkey Car Depp Reinforcement Learning
7 pages
A Review of Reward Functions for Reinforcement Learning in the context of autonomous driving
No ratings yet
A Review of Reward Functions for Reinforcement Learning in the context of autonomous driving
8 pages
Controlling An Autonomous Vehicle With Deep Reinforcement Learning
No ratings yet
Controlling An Autonomous Vehicle With Deep Reinforcement Learning
7 pages
Learning For Autonomous Vehicles: A Focus On Expert Demonstration
No ratings yet
Learning For Autonomous Vehicles: A Focus On Expert Demonstration
24 pages
Deep Learning Implementation of Self Driving Car: 15BCE0205 (Radhika Garodia) 15BCE0311 (Kanav Sethi)
No ratings yet
Deep Learning Implementation of Self Driving Car: 15BCE0205 (Radhika Garodia) 15BCE0311 (Kanav Sethi)
11 pages
PGP Report Sachin t22060
No ratings yet
PGP Report Sachin t22060
20 pages
DRLAD ExploringApplicationsTalk2019 CognitiveVehicles
No ratings yet
DRLAD ExploringApplicationsTalk2019 CognitiveVehicles
28 pages
0504 Learning Robust Driving Policies Without Online Exploration
No ratings yet
0504 Learning Robust Driving Policies Without Online Exploration
8 pages
10、《Let Hybrid a Path Planner Obey Traffic Rules a Deep Reinforcement Learning-Based Planning Framework》
No ratings yet
10、《Let Hybrid a Path Planner Obey Traffic Rules a Deep Reinforcement Learning-Based Planning Framework》
8 pages
Basic Study For Transfer Learning For Autonomous Driving in Car Race of Model Car
No ratings yet
Basic Study For Transfer Learning For Autonomous Driving in Car Race of Model Car
4 pages
Reinforcement Learning Workflows For Ai
No ratings yet
Reinforcement Learning Workflows For Ai
39 pages
Paper 2
No ratings yet
Paper 2
21 pages
Experiment 9 (2)
No ratings yet
Experiment 9 (2)
4 pages
83%Visual_Navigation_in_Real-World_Indoor_Environments_Using_End-to-End_Deep_Reinforcement_Learning
No ratings yet
83%Visual_Navigation_in_Real-World_Indoor_Environments_Using_End-to-End_Deep_Reinforcement_Learning
8 pages
Safe_Navigation_Based_on_Deep_Q-Network_Algorithm_Using_an_Improved_Control_Architecture
No ratings yet
Safe_Navigation_Based_on_Deep_Q-Network_Algorithm_Using_an_Improved_Control_Architecture
6 pages
Interpretable End-To-End Urban Autonomous Driving With Latent Deep Reinforcement Learning
No ratings yet
Interpretable End-To-End Urban Autonomous Driving With Latent Deep Reinforcement Learning
11 pages
Reinforcement Learning in Autonomous Driving
No ratings yet
Reinforcement Learning in Autonomous Driving
7 pages
Integrating Deep Reinforcement Learning With Model-Based Path Planner
No ratings yet
Integrating Deep Reinforcement Learning With Model-Based Path Planner
6 pages
The Actor-Dueling-Critic Method
No ratings yet
The Actor-Dueling-Critic Method
20 pages
2410.22766v1
No ratings yet
2410.22766v1
12 pages
Comparing DRL Architectures
No ratings yet
Comparing DRL Architectures
14 pages
A Brief Survey of Deep Reinforcement Learning
No ratings yet
A Brief Survey of Deep Reinforcement Learning
16 pages
Ibarz Et Al 2021 How To Train Your Robot With Deep Reinforcement Learning Lessons We Have Learned
No ratings yet
Ibarz Et Al 2021 How To Train Your Robot With Deep Reinforcement Learning Lessons We Have Learned
24 pages
End-To-End Autonomous Driving: Challenges and Frontiers
No ratings yet
End-To-End Autonomous Driving: Challenges and Frontiers
21 pages
A Survey of Deep Learning Techniques For Autonomous Driving
No ratings yet
A Survey of Deep Learning Techniques For Autonomous Driving
28 pages
Paper 11
No ratings yet
Paper 11
15 pages
2305.14644v1
No ratings yet
2305.14644v1
12 pages
A Survey On Deep Learning and Deep Reinforcement Learning in Robotics With A Tutorial On Deep Reinforcement Learning
No ratings yet
A Survey On Deep Learning and Deep Reinforcement Learning in Robotics With A Tutorial On Deep Reinforcement Learning
33 pages
Paper_103-Navigation_of_Autonomous_Vehicles_using_Reinforcement_Learning
No ratings yet
Paper_103-Navigation_of_Autonomous_Vehicles_using_Reinforcement_Learning
6 pages
MAE 598 Intro To Autonomous Project Dhiram Omkar Harshal
No ratings yet
MAE 598 Intro To Autonomous Project Dhiram Omkar Harshal
14 pages
Group 10
No ratings yet
Group 10
20 pages
Advancements in Reinforcement Learning Algorithms For Autonomous Systems
No ratings yet
Advancements in Reinforcement Learning Algorithms For Autonomous Systems
6 pages
Stable_training_via_elastic_ad
No ratings yet
Stable_training_via_elastic_ad
9 pages
Imitation Learning: From - Future - Import Ultimate - Ai - Solution
No ratings yet
Imitation Learning: From - Future - Import Ultimate - Ai - Solution
2 pages
SocProS 2018 Paper ID 35
No ratings yet
SocProS 2018 Paper ID 35
13 pages
A Survey of Deep Learning Applications To Autonomous Vehicle Control
No ratings yet
A Survey of Deep Learning Applications To Autonomous Vehicle Control
23 pages
Decision-Making Strategy On Highway For Autonomous Vehicles Using Deep Reinforcement Learning
No ratings yet
Decision-Making Strategy On Highway For Autonomous Vehicles Using Deep Reinforcement Learning
11 pages
Electronics: Decision-Making System For Lane Change Using Deep Reinforcement Learning in Connected and Automated Driving
No ratings yet
Electronics: Decision-Making System For Lane Change Using Deep Reinforcement Learning in Connected and Automated Driving
13 pages
Autonomous Car Racing in Simulation Environment Using Deep Reinforcement Learning
No ratings yet
Autonomous Car Racing in Simulation Environment Using Deep Reinforcement Learning
6 pages
2015.08.26.Lecture01Intro 2
No ratings yet
2015.08.26.Lecture01Intro 2
37 pages
Paper 3 Mlis
No ratings yet
Paper 3 Mlis
9 pages
Lec 1 Intro Course Overview
No ratings yet
Lec 1 Intro Course Overview
50 pages
1 Introduction To RL
No ratings yet
1 Introduction To RL
46 pages
jurnal
No ratings yet
jurnal
20 pages
Q Transformer
No ratings yet
Q Transformer
20 pages
Deep Reinforcement Learning
No ratings yet
Deep Reinforcement Learning
47 pages
Applications of Reinforcement Learning
No ratings yet
Applications of Reinforcement Learning
10 pages
Collision Avoidance Using RL
No ratings yet
Collision Avoidance Using RL
19 pages
case
No ratings yet
case
6 pages
2506.03568v2
No ratings yet
2506.03568v2
13 pages
2020, Parham M. Kebria et al., Deep_Imitation_Learning_for_Autonomous_V
No ratings yet
2020, Parham M. Kebria et al., Deep_Imitation_Learning_for_Autonomous_V
14 pages
Deep Reinforcement Learning For Robotic Manipulation
No ratings yet
Deep Reinforcement Learning For Robotic Manipulation
9 pages
AI for Everyone: An Intermediate Guide to Artificial Intelligence
From Everand
AI for Everyone: An Intermediate Guide to Artificial Intelligence
Nova Clarke
No ratings yet
Pedestrian Detection: Please, suggest a subtitle for a book with title 'Pedestrian Detection' within the realm of 'Computer Vision'. The suggested subtitle should not have ':'.
From Everand
Pedestrian Detection: Please, suggest a subtitle for a book with title 'Pedestrian Detection' within the realm of 'Computer Vision'. The suggested subtitle should not have ':'.
Fouad Sabry
No ratings yet
Torque Wrenches: STW101 STW1011 STW102 STW103 STW104 STW200 STW201.V2 STW202 STW1012
No ratings yet
Torque Wrenches: STW101 STW1011 STW102 STW103 STW104 STW200 STW201.V2 STW202 STW1012
3 pages
Csc 206 Exam Study Questions
No ratings yet
Csc 206 Exam Study Questions
13 pages
power
No ratings yet
power
10 pages
Locomotive - Cooling Water System
No ratings yet
Locomotive - Cooling Water System
5 pages
ABC Fire Extinguisher - Ansul Incorporated PDF
No ratings yet
ABC Fire Extinguisher - Ansul Incorporated PDF
10 pages
Y4 Module 5 Worksheets
No ratings yet
Y4 Module 5 Worksheets
25 pages
Assignment02 Physics20
No ratings yet
Assignment02 Physics20
12 pages
Surface Mining: Traffic Management
No ratings yet
Surface Mining: Traffic Management
40 pages
DA20 Katana Flight Manual
No ratings yet
DA20 Katana Flight Manual
136 pages
CM-150_CA_gen2_Surface_4Wire_24V
No ratings yet
CM-150_CA_gen2_Surface_4Wire_24V
3 pages
Portable Mixers Bulletin 704
No ratings yet
Portable Mixers Bulletin 704
2 pages
Eurocode 6 - Design of Masonry Structures
100% (1)
Eurocode 6 - Design of Masonry Structures
40 pages
Expt: 1.: Practice Correctly Safety Procedures in The Workshop
No ratings yet
Expt: 1.: Practice Correctly Safety Procedures in The Workshop
10 pages
CBSE Class 12 Maths Chapter 11 Three Dimensional Geometry Important Questions 2022-23
No ratings yet
CBSE Class 12 Maths Chapter 11 Three Dimensional Geometry Important Questions 2022-23
25 pages
The Solar Anus - A Traversal
No ratings yet
The Solar Anus - A Traversal
11 pages
Handling Unit
No ratings yet
Handling Unit
48 pages
Bioethics in A Liberal Society
No ratings yet
Bioethics in A Liberal Society
169 pages
Clinical Comparisson of Conventional and Digital Impression
No ratings yet
Clinical Comparisson of Conventional and Digital Impression
8 pages
Text Report
No ratings yet
Text Report
15 pages
Kiss On The Cheek.: by Jacob Saul
No ratings yet
Kiss On The Cheek.: by Jacob Saul
3 pages
Retina - 1 Course Notes
No ratings yet
Retina - 1 Course Notes
9 pages
CN LAB 07 - Dynamic Routing - Tutorial
No ratings yet
CN LAB 07 - Dynamic Routing - Tutorial
4 pages
Overcoming Shoulder Impingement Syndrome, 2003
100% (1)
Overcoming Shoulder Impingement Syndrome, 2003
16 pages
The Relation of The Expansion of Universe From Verses of Quran
No ratings yet
The Relation of The Expansion of Universe From Verses of Quran
2 pages
The Burden of Thirst - Key
100% (1)
The Burden of Thirst - Key
2 pages
2 - Python Basics & Functions (Solutions)
No ratings yet
2 - Python Basics & Functions (Solutions)
5 pages
Yoga Therapy For Teachers D2 PDF
No ratings yet
Yoga Therapy For Teachers D2 PDF
9 pages
Aditya Hridayam-Meaning in English
No ratings yet
Aditya Hridayam-Meaning in English
9 pages

Evaluation of Deep Reinforcement Learning Algorithms for Autonomous Driving

Uploaded by

Evaluation of Deep Reinforcement Learning Algorithms for Autonomous Driving

Uploaded by

2020 IEEE Intelligent Vehicles Symposium (IV)

October 20-23, 2020. Las Vegas, USA

Evaluation of Deep Reinforcement Learning Algorithms for Autonomous

978-1-7281-6673-5/20/$31.00 ©2020 IEEE 1576

Policy Interpret Action

100 100 250

Environment = 3 Environment = 4 Environment = 3 Environment = 4

100 200 200

You might also like