0% found this document useful (0 votes)
28 views9 pages

Open DeepRacer Autonomous Racing Platform For Experimentation With Sim2Real Reinforcement Learning PDF

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views9 pages

Open DeepRacer Autonomous Racing Platform For Experimentation With Sim2Real Reinforcement Learning PDF

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

DeepRacer: Autonomous Racing Platform for Experimentation with

Sim2Real Reinforcement Learning


Bharathan Balaji1⇤ , Sunil Mallya1⇤ , Sahika Genc1⇤ , Saurabh Gupta1 , Leo Dirac1 , Vineet Khare1 , Gourav Roy1 ,
Tao Sun1 , Yunzhe Tao1 , Brian Townsend1 , Eddie Calleja1 , Sunil Muralidhara1 , Dhanasekar Karuppasamy1

Abstract— DeepRacer is a platform for end-to-end experi- like Gazebo [30]. The DeepRacer 1/18th scale car is one
mentation with RL and can be used to systematically investigate realization of a physical robot in our platform that uses RL
the key challenges in developing intelligent control systems. for navigating a race track with a fisheye lens camera. The car
Using the platform, we demonstrate how a 1/18th scale car
can learn to drive autonomously using RL with a monocular hardware includes a GPU for executing the neural network
camera. It is trained in simulation with no additional tuning policy locally, live streams the camera view over WiFi, the
in the physical world and demonstrates: 1) formulation and compute battery supports ⇠6 hours of development time and
solution of a robust reinforcement learning algorithm, 2) nar- retails at $400. We have a corresponding robot model in
rowing the reality gap through joint perception and dynamics, simulation, along with rendering for multiple race tracks. We
3) distributed on-demand compute architecture for training
optimal policies, and 4) a robust evaluation method to identify can train the RL policy with different simulation parameters
when to stop training. It is the first successful large-scale and multiple tracks in parallel using distributed rollouts.
deployment of deep reinforcement learning on a robotic control We learn an end-to-end policy for navigating a race track.
agent that uses only raw camera images as observations and a We use a single grayscale camera image as observation and
model-free learning method to perform robust path planning. discretized throttle/steering as actions. We train in simula-
We open source our code and video demo on GitHub2 .
tion using the Proximal Policy Optimization (PPO) algo-
I. I NTRODUCTION rithm [31], which can converge in <5 minutes and ⇠5000
simulation steps. With no pre-processing, real world data
Reinforcement Learning (RL) has been used to accomplish or expert labeling, the learned policy successfully transfers
diverse robotic tasks: manipulation [1], [2], [3], [4], locomo- from simulation to real tracks (sim2real [32]). The entire
tion [5], [6], navigation [7], [8], [9], [10], flight [11], [12], process from training a policy to testing in the real car takes
interaction [13], [14], motion planning [15], [16] and more. <30 minutes. Multiple models can be trained in parallel
Due to high sample complexity and safety requirements, it is with on-demand compute and stored in the car. Thousands
common to train the RL agent in simulation [1], [5], [17]. To of users have designed their own reward functions, trained
reduce training time and encourage exploration, the agent is their models on our platform, and demonstrated real track
usually trained with distributed rollouts [18], [19], [20], [21]. navigation. To the best of our knowledge, this is the first
For a successful transfer to the real world, researchers use demonstration of model-free RL based sim2real at scale.
calibration [2], [22], domain randomization [23], [24], [25], DeepRacer serves as a testbed for many areas of
[12], fine tuning with real world data [9], and learn features RL research such as reducing sample complexity [33],
from a combination of simulation and real data [26], [27]. sim2real [34] and generalizability [35]. The car can log
To experiment with robotic reinforcement learning, one camera images, inertial sensor measurements, policy deci-
needs to have expertise in many areas, access to a physical sions. Simulations can be randomized with different tracks,
robot, an accurate robot model for simulations, a distributed lighting, sensor and actuator noise. The learned policy can
training mechanism and customizability of the training pro- underfit/overfit to the simulation settings. We use a robust
cedure such as modifying the neural network and the loss evaluation method to identify when the learned policy will
function or introducing noise. For the uninitiated, dealing generalize to the real world. We evaluate multiple check-
with this complexity is daunting and dissuades adoption. As points of the saved policy with domain randomization such
a result, much of prior work is limited to a single robot [1], as action noise and different starting points. Models that
[23], [28] or a few robots [16]. We reduce the learning curve give good results in robust evaluation generalize well to the
and alleviate development effort with DeepRacer. real world. Our policies trained with domain randomization
DeepRacer supports state-of-the-art deep RL algo- generalize to multiple cars, tracks and to variations in speed,
rithms [29], simulations with the OpenAI Gym [17] interface, background, lighting, track shape, color and texture.
distributed rollouts and integration with cloud services. We
introduce a training mechanism that decouples RL policy II. R ELATED W ORK
updates with the rollouts, which enables independent scaling RL has been used in robotics for several decades [36],
of the simulation cluster and supports popular simulators [37], [38], [39]. Initial works used low dimensional state
1 Authors are employees of Amazon Web Services. ⇤ contributed equally. spaces due to scalability challenges. RL concepts were
Send all correspondence to: [email protected] generalized to high dimensional problems with deep net-
2 DeepRacer training source code: https://git.io/fjxoJ works [40], [41], [42]. High variance, sample complexity
and replicability challenges [43] in deep RL algorithms led TABLE I: Comparison of DeepRacer with contemporary RL
to development of simulators [44], benchmarks [17], [45] and and self-driving platforms. Hindicates partial support.
libraries [46], [47]. We build upon these works to create a

Sim2Real Demo
Physical Robot

GPU on Robot
platform for experimentation with simulation and real robots.

Rigid Body

Robot Cost
Distributed Rollouts: Algorithms that use distributed

Distributed
Simulation

Dynamics

Deep RL

Rollouts
rollouts, where multiple simulations are executed in paral-

(USD)
lel to collect experience data, were introduced to reduce Platform
training time [2], [20], [48]. OpenAI Baselines [47] uses AutoRally [63] 3 3 3 7 3 7 3 10K
OpenMPI [49] to support distributed gradient algorithms, BARC [64] 7 7 7 7 3 7 7 500
Blue [65] 7 7 7 7 3 7 7 5K
where each worker computes gradients on data collected. CARLA [66] 3 H 3 3 7 3 3 7
OpenAI Rapid [2] generalizes it to a distributed system for DonkeyCar [67] 3 H 3 7 3 3 7 200
the PPO algorithm and demonstrate sim2real transfer on Duckietown [68] 3 H 3 7 3 3 7 150
dextrous manipulation. Flex [19] extends the same distri- F1/10 [69] 7 7 7 7 3 7 3 3600
Fetch [1], [23] 3 3 3 3 3 3 7 100K
bution mechanism to use GPUs for simulation and hence Flex [19] 3 3 3 3 7 3 7 7
can run 750 humanoid MuJoCo simulations with a single RACECAR [70] 3 3 7 7 7 7 3 2600
GPU. Chebotar et al. [50] use Flex to demonstrate sim2real MuSHR [71] 3 7 7 7 3 3 3 900
Poppy [72] 3 3 3 3 3 3 7 350
transfer for manipulation. Surreal [18] uses a decoupled RLlib [21] 3 3 3 3 7 7 7 7
rollout mechanism to support the experience replay algo- Surreal [18] 3 3 3 3 7 7 7 7
rithms, where each worker stores the experience data in a DeepRacer 3 3 3 3 3 3 3 400
buffer and a separate training worker computes gradients.
Ray RLlib [21], [51] introduces a stateful actor framework to
support distributed rollouts. DeepRacer integrates with Intel Sim2Real Navigation: Many works rely on simulators
Coach library [29] that supports >20 deep RL algorithms only for testing and use methods such as state estimation,
in an easy-to-use, modular interface. DeepRacer uses the motion planning and model predictive control (MPC) [68],
same rollout mechanism as Surreal, and extends support for [73], [74] for navigation. Other works have used imitation
Gazebo. Similar to Rapid, DeepRacer can use different sim- learning, where expert demonstrations are given either by
ulation settings for each worker and have separate evaluation a person [67], [75] or with an MPC algorithm [76], [77].
workers that validate the performance of the current policy. Kahn et al. [10] directly learn the RL policy in the real
Sim2Real: Training RL policies in the real world is car, with a fixed maneuver when collision occurs. Domain
challenging due to high sample complexity and safety issues. randomization and image segmentation in simulations have
Simulations alleviate these concerns and serve as a testbed been used to close the visual reality gap with a model based
to experiment with algorithms and debug software. However, controller [12], [66], [78]. Image pre-processing [79], learned
sim2real transfer is challenging because of differences in embeddings [9] and depth camera [80] have been used to
dynamics, imagery and as simulated models are just ap- achieve sim2real transfer. Bharadhwaj et al. [27] demonstrate
proximations of the real world [23], [24], [34]. Domain sim2real transfer by mixing expert demonstrations with sim-
randomization, where simulation parameters are perturbed ulations. We observe that prior sim2real works rely on a
during training, has been used for successful sim2real trans- model based controller for high speed navigation [78], [81]
fer for various robotic tasks [2], [12], [50]. Methods include or achieve slow speeds because of poor transfer of dynam-
adding noise in dynamics [23], [2] and imagery [12], [52], ics [79], [80]. With DeepRacer, we demonstrate speeds of
learning model ensembles [53], [54], adding adversarial 1.6m/s with a single grayscale monocular image as input
noise [25], [55] and assessing simulation bias [24]. Domain and discretized steering/throttle as output. We use simple,
adaptation [56] has also been used for sim2real, particularly non-recurrent networks for our policy and still demonstrate
to address the visual reality gap [26], [57], [58], [59]. robustness in the real world to multiple cars, tracks, and
DeepRacer serves as a platform to reproduce and experiment variations in the environment. We also achieve slow speed
with sim2real methods. We demonstrate various forms of (0.5m/s) sim2real transfer with <5 minutes of training.
domain randomization in our experiments. Navigation with Table I compares DeepRacer with other platforms for RL,
the DeepRacer car can be structured from simple, low speed, sim2real and autonomous driving. The other simulation plat-
lane following to complex tasks such as high speed racing forms can also be used with DeepRacer. We provide an easy-
or commuting in traffic. to-use, economical and flexible platform with support for
Our distributed rollout mechanism facilitates iterative ex- distributed RL, domain randomization and robust evaluation.
perimentation as policies converge faster and helps identify DeepRacer tools have enabled us to replicate sim2real RL
underfitting/overfitting. Prior sim2real works use a fixed policy transfer with consistency and at scale.
number of simulation steps [2], [23], [60], [61]. We show
that policies can both underfit and overfit to the simulation III. AUTONOMOUS R ACING WITH RL
while training, as identified by prior works [24], [35], [62]. In our formulation, the agent steers the car and the
We use a separate robust evaluation to identify the policy environment is the race track. The track is marked by white
checkpoints that are likely to transfer well to the real world. lanes, there is a single car on track with no obstacles and the
Fig. 2: Training the agent with DeepRacer distributed rollouts

network gets deployed in the real car. By default, we use


three convolutional layers and two fully connected layers
for both networks. We train a new policy every 20 episodes.
Fig. 1: Observation, action and reward for DeepRacer agent The full list of hyperparameters is given in our source code.

car only moves forwards. The image from the car’s camera IV. D EEP R ACER D ESIGN AND I MPLEMENTATION
is the observation, and actions are the throttle/steering of the We decouple the simulation data collection from the policy
car. As the agent does not receive the full state such as the updates. We use RoboMaker [83] for our simulations with
the track layout, this is a partially observed Markov Decision Gazebo and SageMaker [84] to train our policy with the RL
Process. An episode starts with the car somewhere on track Coach [29] library. Simulations help us train without manual
and finishes when the car goes off-track or finishes a lap. effort. The decoupled training allows us to use separate
The images from the camera are streamed at 15 fps, down- machines which are specialized for simulations (e.g. license,
sized to 160 x 120 pixels and converted to grayscale. We Mac/Windows OS) and neural network training (e.g. GPU,
discretize the actions to 10 values, with 2 levels for throttle large RAM) respectively. We also get the flexibility to launch
and 5 for steering. Users can customize this discretization, multiple simulations each with their own settings for domain
which get mapped to low level controls. We use discrete randomization as well as evaluate policies in parallel.
actions as the mapping to low level control is non-linear and
challenging to calibrate in the continuous space. We fix the A. Training Workflow
maximum throttle in simulation and set it manually in the Figure 2 shows the DeepRacer training workflow. The
real car. We incentivize the agent to stay close to the center training starts by initializing the policy/value network models
line of the track. If the car is at the edge of the track, a small and hyper-parameters in SageMaker. The neural network
deviation can off-road the car and the track is not visible in models are saved in S3 [85], an object store service. Robo-
the image. Staying close to the center of the track leads to Maker initializes the simulation, the agent and loads the
a stable policy. Users can customize this reward function. models from S3. The agent interacts with the simulation
Figure 1 illustrates our problem formulation. over the OpenAI Gym interface. The agent takes actions a
(steering/throttle) based on the observation o (camera image).
A. Reinforcement Learning Algorithm The simulator updates the position of the car based on the
We use PPO, a state-of-the-art policy gradient algo- action and returns with the updated camera image and reward
rithm [31]. The algorithm uses two neural networks during r. The experiences collected in the form of hot , at , rt , ot+1 i
training – a policy network and a value network. The policy are stored in Redis [86], an in-memory database. SageMaker
network decides which action to take given an image as input trains the neural networks with data collected in Redis and
and the value network estimates the expected cumulative saves the models in S3. RoboMaker copies the model from
discounted reward given the image. The agent initializes a S3 and creates more experience data. The cycle continues
policy that takes random actions. The policy network is used until training stops. The models in S3 are continually eval-
to interact with the simulation environment to collect data. uated in a separate simulation to assess convergence and
The resulting dataset is used to update the policy and value generalizability. Models in S3 can be deployed on the real
networks as per the algorithm’s loss function. The updated car. While we show our results with the PPO algorithm, our
policy is used to interact with the environment to collect architecture can be used for various experience replay based
more data and the training cycle continues until a time limit. algorithms such as DQN [40], DDPG [87] and SAC [88].
The policy loss function maximizes the actions that give Robomaker can be replaced with other simulators that can
higher rewards on average as given by the generalized integrate with the Gym interface.
advantage estimation algorithm [82] and applies a clipped
importance sampling weight as the policy that collects the B. Training with Amazon SageMaker
dataset is an older version of the policy being updated. The SageMaker is a platform to train and deploy machine
value loss function uses the mean squared error between learning models at scale using the Jupyter Notebook [89] as
the predicted value and the observed value. Only the policy interface. SageMaker integrates RL algorithms using Coach
(a) Simulation tracks

Fig. 4: DeepRacer Hardware Specifications

car’s position, velocity, and send throttle, steering commands


to control the car. Users can customize the simulation in
(b) Camera view of simulation tracks
Gazebo with their own robot models and environments.

D. Sim2Real Calibration
We have matched the URDF robot model to the measured
dimensions of the car. We compared images from the real
camera and calibrated the height, angle and the field of
view of the simulation camera to match the real images.
(c) Camera view of real world tracks
As DeepRacer camera can capture 15 fps, we match the
Fig. 3: We train in multiple tracks and evaluate with a replica simulation environment to use the same frame rate and use
track as well as a track made with duct tape. a producer-consumer mechanism to ensure one action per
image. We map the agent’s action space to the motor control
and RLlib [21] libraries that build on top of existing deep commands by measuring the steering angles and speed of the
learning frameworks. SageMaker uses RL Coach to support car under different settings. We have created a real world
the decoupled simulation based training used in DeepRacer, track that is identical in color, shape and dimensions with
and RLlib for integrated simulation and training. The li- one of the simulation tracks. We use barricades around this
braries are packaged in a Docker container [90] and training track to reduce visual distractions. In addition, we have eight
can be launched in a cluster of machines with different other tracks with varying shapes, backgrounds and textures.
configurations (CPU/GPU/RAM). The training clusters are
E. Calculating Rewards
created on-demand and billed per second, freeing users
from infrastructure maintenance. Metrics such as rewards per We compute an ordered set of points along the middle of
episode, the policy entropy, cpu/memory use are visualized, the track, called waypoints, to estimate the relative position
source code is saved and logs are recorded. Users can of the car on track. The track and the background are
launch experiments in parallel and search across experiment modeled as a polygon mesh. We separate the track mesh
metadata. In addition to autonomous racing, SageMaker from the background and identify the border edges as those
contains RL examples for HVAC control, robot locomotion, which belong to a single triangle. We get two boundaries cor-
portfolio management and more. responding to inner and outer part of the track by grouping
the border vertices. We construct a bipartite graph from the
C. Simulation with AWS RoboMaker two sets of vertices and compute the linear sum assignment
RoboMaker is a cloud service to develop, test and deploy using the Euclidean distance as edge length. This gives us
robot software. It uses Gazebo for simulation. A robot model border vertices parallel to each other on both sizes of the
describes each component of the DeepRacer car - the chassis, track. The waypoints are the mean of the vertices connected
wheels, camera, Ackermann steering - their dimensions, how by each edge. The spline is the line joining the waypoints.
they link together, their properties such as mass and camera The car starts an episode at a waypoint. We flag the car as
angle. We create our tracks and background environment in off-track when it deviates from the spline by more than half
Blender, a 3D modeling software and import it into Gazebo. the track width. We measure the car’s progress by the relative
We use the ODE physics engine that simulates the laws of distance it covers compared to the length of the spline.
physics using the robot model and takes into account factors
like collision, friction, acceleration, etc. A rendering engine, F. DeepRacer Hardware
OGRE, visualizes the graphics. We use Gazebo plugins to Figure 4 gives an overview of DeepRacer hardware. We
add the camera and light sources. We use ROS [91] for have designed the car for experimentation while keeping the
communication between the agent and the simulation. The cost nominal. The Intel Atom processor with a built-in GPU
agent uses ROS to place the car in the track at the beginning can perform >15 inferences per second with our default
of an episode, get images from the camera module, get the five layer neural network. The motors are equipped with
(a) Training with Track A and maximum (b) Training with Track A and maximum (c) Training with Track B and maximum
throttle of 1 m/s throttle of 1.67 m/s throttle of 1.67 m/s
Fig. 5: Training with multiple rollout workers. Progress on track is reported across two runs in a ml.p3.2xlarge instance in
SageMaker, which has one NVIDIA V100 GPU. Each rollout is a separate simulation job in RoboMaker.

electronic speed controllers. We can use the car as a regular 16 workers give a slightly faster convergence compared to
computer with a monitor, mouse and keyboard connected 8. Somewhat surprisingly, the higher throttle of 1.67 m/s
via HDMI and USB. The camera connects over USB and helped speedup convergence in Track A. We hypothesize that
there are three USB ports for extensions. The 13600 mAh the agent collects more uniform experience with the faster
compute battery lasts ⇠6 hours. The 1100 mAh drive battery speed and this helps with convergence. Track B takes longer
lasts for ⇠45 minutes in typical experiments. The WiFi chip to converge but follows similar trends as Track A.
enables remote monitoring and programming. We built the
car software on top of ROS. We can load multiple trained B. Robust Evaluation
models over WiFi. We use Intel OpenVino to convert our We test whether robust evaluation in simulation is in-
Tensorflow models to an optimized binary for fast inference. dicative of real world performance. If true, we can identify
The camera images are fed to the OpenVino inference engine when to stop training in simulation and avoid underfit-
and a real-time video feed on a browser. There is a web ting/overfitting. We can tune our hyper-parameters entirely in
UI for calibrating steering and throttle. The model inference simulation and avoid extensive testing in the real world. We
results are converted to motor control commands based on train policies with increasing levels of domain randomization
the calibration and action space mapping. In addition, the and evaluate the policy in both simulation and real.
browser has an interface for manual joystick like control. Our baseline case is trained on Track A with no domain
randomization and throttle of 1 m/s. For domain random-
V. E VALUATION ization, we train policies on Track A with (i) up to 10%
We evaluate our track navigation policies extensively uniform random noise to steering and throttle (action noise),
across multiple tracks, with domain randomization in both (ii) reverse direction of travel each episode (reverse), (iii)
simulation and real world. We have created a replica of Track include both action noise and reverse, and (iv) train on Track
A with the track printed on carpet with the same dimensions B with both action noise and reverse. For robust evaluation,
as in simulation. We place barriers around the track to reduce we add uniform random noise to actions, evaluate in multiple
distractions and evaluate performance both with and without starting positions and both directions of travel on Track A.
barriers as well as different speeds and lighting conditions. For naive evaluation, we evaluate on Track A with a fixed
We also made a custom “tape track” with 2 inch white duct starting point without randomization. Both evaluations test
tape in our office corridor to test model robustness. The track each checkpoint 10 times in simulator. We pick six policies
is roughly 24 inches wide, 12m in length, traverses both during training from checkpoints 5 through 30, and test their
carpet and concrete, has multiple turns and the car camera sim2real performance in the Track A replica with 3 trials
is exposed to clutter and bright lights in the background. for each direction of travel. The model performance varies
with speed, but it is difficult to maintain a constant speed
A. Training with Multiple Rollouts due to changing battery levels and as the model switches
We train policies with three different conditions: on Track between throttle levels. For sim2real experiments we ensure
A with a maximum throttle of 1 m/s, on Track A with throttle the model completes a lap in 18 to 22 seconds (0.8-1 m/s).
1.67 m/s and on Track B with throttle 1.67 m/s. The task gets In simulation, the models complete the lap in ⇠35 seconds,
harder at higher speeds. Track B is more difficult to navigate so we test the policy at about double speed in the real track.
because of background with buildings and higher number of Figure 6 shows the experimental results. The model that
turns. Each episode starts with a different waypoint so that all perform consistently well with robust evaluation also perform
parts of the track are experienced by the policy. We use p3.2x well on the real track. The models are particularly robust
instance for training in SageMaker and run each experiment when a sequence of checkpoints perform well in simula-
twice for 2 hours. Figure 5 shows the progress on track tor. Reversing the direction of travel significantly improves
during training with different number of rollout workers. model performance. Action noise does not help by itself, but
As we expect, more rollout workers lead to faster conver- improves performance when combined with reverse. Policies
gence. There is diminishing returns as we increase workers, trained on Track B do not perform well for checkpoints in
Fig. 6: Robust evaluation with domain randomization as a criteria to select policy checkpoints for sim2real transfer.
TABLE II: Sim2Real for policies trained with regularization and domain randomization. Results are out of 6 trials.
Training Type of Checkpoint # Training A Replica Tape Track
Total
Track Training (Progress %) Sunlight No Barriers 0.7-0.9 m/s
0.5 m/s 1 m/s
0.8-1 m/s 0.8-1 m/s
B 54 (100) 5 3 0 1 3 12
C 53 (99.7) 5 3 2 3 3 16
Default
D 50 (100) 5 3 3 3 0 13
L2=2e-5 53 (100) 5 4 2 4 2 17
Dropout=0.3 49 (100) 6 3 5 5 4 23
BatchNorm 41 (100) 4 2 1 4 2 13
B Throttle=0.33 m/s 21 (100) 2 0 0 0 2 4
Throttle=1.67 m/s 72 (91.1) 6 4 5 6 2 23
Throttle=2.33 m/s 79 (57.9) 6 5 5 6 2 24
Default 41 (100) 3 3 3 3 1 13
Color Aug. 49 (100) 6 5 6 6 3 26
Translation 37 (100) 6 5 5 3 3 22
B, D Shadow 46 (100) 5 3 5 3 2 18
Sharpen 48 (89.5) 4 4 5 4 0 17
Pepper 53 (98.9) 6 3 4 2 1 16
All image aug 48 (100) 5 6 3 4 0 18
Best combo,
C 67 (91.7) 6 6 6 5 4 27
Throttle=2.33 m/s

Figure 6, but with more training start performing well in translation, shadow, and salt and pepper noise, each with
both robust evaluation and real track, policy checkpoint 35 0.2 probability. For random color, we combine the effects
traversed the real track successfully 5 out of 6 trials. of random hue, saturation, brightness and contrast to create
The performance of the model changes dramatically at variations in observation. Random color was the most effec-
slower speeds (35s lap, 0.5 m/s), even checkpoint 5 of the tive method for sim2real transfer.
policy trained on Track A with no randomization traverses We combine the best of our parameters and train a model
the real track. This model is trained in <5 minutes. All the on Track C with L2 regularlization, lower entropy bonus,
above policies were trained in <1 hour with 4 rollouts. dropout, color randomization and a maximum throttle of
2.33 m/s. This model performed the best overall in our
C. Robust Sim2Real
experiments. The model consistently completed 11 second
We test the robustness of sim2real by training on multiple laps (1.6 m/s) in our Track A replica.
tracks, with multiple speeds, regularization and domain ran-
domization in actions and observations. By default, we train VI. C ONCLUSION
on Track B with throttle of 1 m/s, with action noise and DeepRacer is an experimentation platform for sim2real
reverse direction each episode. We pick model checkpoints reinforcement learning. The platform integrates state-of-the-
based on performance in robust evaluation and test the policy art Deep RL algorithms, multiple simulation engines with
on Track A replica in two speeds (0.5 m/s, 1 m/s), with bright OpenAI Gym interface, provides on-demand compute, dis-
sunlight, with no barriers and on tape track. tributed rollouts that facilitates domain randomization and
Table II summarizes our results. Training on a different robust evaluation in parallel. We demonstrate DeepRacer
track gives good sim2real results, but vary track to track. platform features with a 1/18th scale car that navigates a
For regularization, we used L2 norm, dropout, batch normal- race track using reinforcement learning. We have created a
ization and an entropy bonus to the policy loss. We tested calibrated robot model for the car in Gazebo along with mul-
the models that give best performance in robust evaluation. tiple race tracks. We demonstrate robust sim2real navigation
Reducing the entropy bonus to 0.001 (it is 0.1 by default) performance trained in DeepRacer with PPO algorithm in
and dropout with probability 0.3 were particularly effective. both our real world replica track as well as a custom tape
Larger throttle speeds in training increased the robustness of track. We achieve sim2real in real track with <5 minutes
the model dramatically but also increased convergence time of training at slow speeds and achieve speeds of 1.6 m/s
in the presence of action noise. Mixing multiple tracks during using models trained with tuned parameters. Thousands of
training did not lead to improvement in performance. We users have replicated our model training and demonstrated
perturb the observation images with random color, horizontal sim2real RL navigation.
R EFERENCES A. Billard, A. Dragan, J. Peters, and J. Morimoto, Eds., vol. 87.
PMLR, 29–31 Oct 2018, pp. 270–282. [Online]. Available:
[1] M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welin- http://proceedings.mlr.press/v87/liang18a.html
der, B. McGrew, J. Tobin, O. P. Abbeel, and W. Zaremba, “Hindsight [20] L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih,
experience replay,” in Advances in Neural Information Processing T. Ward, Y. Doron, V. Firoiu, T. Harley, I. Dunning, S. Legg,
Systems, 2017, pp. 5048–5058. and K. Kavukcuoglu, “IMPALA: Scalable distributed deep-RL with
[2] M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. Mc- importance weighted actor-learner architectures,” in Proceedings
Grew, J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray of the 35th International Conference on Machine Learning,
et al., “Learning dexterous in-hand manipulation,” arXiv preprint ser. Proceedings of Machine Learning Research, J. Dy and
arXiv:1808.00177, 2018. A. Krause, Eds., vol. 80. Stockholmsmässan, Stockholm Sweden:
[3] S. Gu, E. Holly, T. Lillicrap, and S. Levine, “Deep reinforcement learn- PMLR, 10–15 Jul 2018, pp. 1407–1416. [Online]. Available:
ing for robotic manipulation with asynchronous off-policy updates,” http://proceedings.mlr.press/v80/espeholt18a.html
in 2017 IEEE international conference on robotics and automation [21] E. Liang, R. Liaw, R. Nishihara, P. Moritz, R. Fox, K. Goldberg,
(ICRA). IEEE, 2017, pp. 3389–3396. J. Gonzalez, M. Jordan, and I. Stoica, “RLlib: Abstractions for
[4] A. A. Rusu, M. Večerı́k, T. Rothörl, N. Heess, R. Pascanu, distributed reinforcement learning,” in Proceedings of the 35th
and R. Hadsell, “Sim-to-real robot learning from pixels with International Conference on Machine Learning, ser. Proceedings of
progressive nets,” in Proceedings of the 1st Annual Conference on Machine Learning Research, J. Dy and A. Krause, Eds., vol. 80.
Robot Learning, ser. Proceedings of Machine Learning Research, Stockholmsmässan, Stockholm Sweden: PMLR, 10–15 Jul 2018,
S. Levine, V. Vanhoucke, and K. Goldberg, Eds., vol. 78. pp. 3053–3062. [Online]. Available: http://proceedings.mlr.press/v80/
PMLR, 13–15 Nov 2017, pp. 262–270. [Online]. Available: liang18b.html
http://proceedings.mlr.press/v78/rusu17a.html
[22] J. Tan, T. Zhang, E. Coumans, A. Iscen, Y. Bai, D. Hafner, S. Bo-
[5] J. Hwangbo, J. Lee, A. Dosovitskiy, D. Bellicoso, V. Tsounis,
hez, and V. Vanhoucke, “Sim-to-real: Learning agile locomotion for
V. Koltun, and M. Hutter, “Learning agile and dynamic motor skills
quadruped robots,” in Proceedings of Robotics: Science and Systems,
for legged robots,” Science Robotics, vol. 4, no. 26, 2019. [Online].
Pittsburgh, Pennsylvania, June 2018.
Available: https://robotics.sciencemag.org/content/4/26/eaau5872
[23] X. B. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel, “Sim-to-
[6] Z. Xie, G. Berseth, P. Clary, J. Hurst, and M. van de Panne, “Feedback
real transfer of robotic control with dynamics randomization,” in 2018
control for cassie with deep reinforcement learning,” in 2018 IEEE/RSJ
IEEE International Conference on Robotics and Automation (ICRA).
International Conference on Intelligent Robots and Systems (IROS).
IEEE, 2018, pp. 1–8.
IEEE, 2018, pp. 1241–1246.
[24] F. Muratore, F. Treede, M. Gienger, and J. Peters, “Domain random-
[7] S.-H. Hsu, S.-H. Chan, P.-T. Wu, K. Xiao, and L.-C. Fu, “Distributed
ization for simulation-based policy optimization with transferability
deep reinforcement learning based indoor visual navigation,” in 2018
assessment,” in Conference on Robot Learning, 2018, pp. 700–713.
IEEE/RSJ International Conference on Intelligent Robots and Systems
(IROS). IEEE, 2018, pp. 2532–2537. [25] A. Mandlekar, Y. Zhu, A. Garg, L. Fei-Fei, and S. Savarese, “Ad-
[8] J. Choi, K. Park, M. Kim, and S. Seok, “Deep reinforcement learning versarially robust policy learning: Active construction of physically-
of navigation in a complex and crowded environment with a limited plausible perturbations,” in 2017 IEEE/RSJ International Conference
field of view,” in 2019 International Conference on Robotics and on Intelligent Robots and Systems (IROS). IEEE, 2017, pp. 3932–
Automation (ICRA). IEEE, 2019, pp. 5993–6000. 3939.
[9] Y. Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, L. Fei-Fei, and [26] I. Higgins, A. Pal, A. Rusu, L. Matthey, C. Burgess, A. Pritzel,
A. Farhadi, “Target-driven visual navigation in indoor scenes using M. Botvinick, C. Blundell, and A. Lerchner, “Darla: Improving zero-
deep reinforcement learning,” in 2017 IEEE international conference shot transfer in reinforcement learning,” in Proceedings of the 34th
on robotics and automation (ICRA). IEEE, 2017, pp. 3357–3364. International Conference on Machine Learning-Volume 70. JMLR.
[10] G. Kahn, A. Villaflor, B. Ding, P. Abbeel, and S. Levine, “Self- org, 2017, pp. 1480–1490.
supervised deep reinforcement learning with generalized computation [27] H. Bharadhwaj, Z. Wang, Y. Bengio, and L. Paull, “A data-efficient
graphs for robot navigation,” in 2018 IEEE International Conference framework for training and sim-to-real transfer of navigation policies,”
on Robotics and Automation (ICRA). IEEE, 2018, pp. 1–8. in 2019 International Conference on Robotics and Automation (ICRA).
[11] H. J. Kim, M. I. Jordan, S. Sastry, and A. Y. Ng, “Autonomous IEEE, 2019, pp. 782–788.
helicopter flight via reinforcement learning,” in Advances in neural [28] K. Hightower, B. Burns, and J. Beda, Kubernetes: Up and Running
information processing systems, 2004, pp. 799–806. Dive into the Future of Infrastructure, 1st ed. O’Reilly Media, Inc.,
[12] F. Sadeghi and S. Levine, “CAD2RL: Real single-image flight without 2017.
a single real image,” arXiv preprint arXiv:1611.04201, 2016. [29] I. Caspi, G. Leibovich, G. Novik, and S. Endrawis, “Reinforcement
[13] C. Chen, Y. Liu, S. Kreiss, and A. Alahi, “Crowd-robot interaction: Learning Coach,” Dec. 2017. [Online]. Available: https://doi.org/10.
Crowd-aware robot navigation with attention-based deep reinforce- 5281/zenodo.1134899
ment learning,” in 2019 International Conference on Robotics and [30] N. Koenig and A. Howard, “Design and use paradigms for gazebo,
Automation (ICRA). IEEE, 2019, pp. 6015–6022. an open-source multi-robot simulator,” in IEEE/RSJ International
[14] S. Christen, S. Stevsic, and O. Hilliges, “Guided deep reinforcement Conference on Intelligent Robots and Systems, Sendai, Japan, Sep
learning of control policies for dexterous human-robot interaction,” 2004, pp. 2149–2154.
arXiv preprint arXiv:1906.11695, 2019. [31] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov,
[15] M. Everett, Y. F. Chen, and J. P. How, “Motion planning among “Proximal policy optimization algorithms,” arXiv preprint
dynamic, decision-making agents with deep reinforcement learning,” arXiv:1707.06347, 2017.
in 2018 IEEE/RSJ International Conference on Intelligent Robots and [32] F. Sadeghi, A. Toshev, E. Jang, and S. Levine, “Sim2real viewpoint
Systems (IROS). IEEE, 2018, pp. 3052–3059. invariant visual servoing by recurrent control,” in Proceedings of the
[16] G. Sartoretti, J. Kerr, Y. Shi, G. Wagner, T. S. Kumar, S. Koenig, IEEE Conference on Computer Vision and Pattern Recognition, 2018,
and H. Choset, “Primal: Pathfinding via reinforcement and imitation pp. 4691–4699.
multi-agent learning,” IEEE Robotics and Automation Letters, vol. 4, [33] S. M. Kakade et al., “On the sample complexity of reinforcement
no. 3, pp. 2378–2385, 2019. learning,” Ph.D. dissertation, University of London London, England,
[17] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schul- 2003.
man, J. Tang, and W. Zaremba, “OpenAI Gym,” arXiv preprint [34] N. Jakobi, P. Husbands, and I. Harvey, “Noise and the reality gap: The
arXiv:1606.01540, 2016. use of simulation in evolutionary robotics,” in European Conference
[18] L. Fan, Y. Zhu, J. Zhu, Z. Liu, O. Zeng, A. Gupta, J. Creus-Costa, on Artificial Life. Springer, 1995, pp. 704–720.
S. Savarese, and L. Fei-Fei, “Surreal: Open-source reinforcement [35] K. Cobbe, O. Klimov, C. Hesse, T. Kim, and J. Schulman,
learning framework and robot manipulation benchmark,” in Confer- “Quantifying generalization in reinforcement learning,” in Proceedings
ence on Robot Learning, 2018, pp. 767–782. of the 36th International Conference on Machine Learning, ser.
[19] J. Liang, V. Makoviychuk, A. Handa, N. Chentanez, M. Macklin, Proceedings of Machine Learning Research, K. Chaudhuri and
and D. Fox, “Gpu-accelerated robotic simulation for distributed R. Salakhutdinov, Eds., vol. 97. Long Beach, California, USA:
reinforcement learning,” in Proceedings of The 2nd Conference on PMLR, 09–15 Jun 2019, pp. 1282–1289. [Online]. Available:
Robot Learning, ser. Proceedings of Machine Learning Research, http://proceedings.mlr.press/v97/cobbe19a.html
[36] M. J. Matarić, “Reinforcement learning in the multi-robot domain,” in Conference on Machine Learning-Volume 70. JMLR. org, 2017, pp.
Robot colonies. Springer, 1997, pp. 73–83. 2817–2826.
[37] M. Asada, S. Noda, S. Tawaratsumida, and K. Hosoda, “Purposive [56] V. M. Patel, R. Gopalan, R. Li, and R. Chellappa, “Visual domain
behavior acquisition for a real robot by vision-based reinforcement adaptation: A survey of recent advances,” IEEE signal processing
learning,” Machine learning, vol. 23, no. 2-3, pp. 279–303, 1996. magazine, vol. 32, no. 3, pp. 53–69, 2015.
[38] V. Gullapalli, J. A. Franklin, and H. Benbrahim, “Acquiring robot [57] K. Bousmalis, A. Irpan, P. Wohlhart, Y. Bai, M. Kelcey, M. Kalakrish-
skills via reinforcement learning,” IEEE Control Systems Magazine, nan, L. Downs, J. Ibarz, P. Pastor, K. Konolige et al., “Using simulation
vol. 14, no. 1, pp. 13–24, 1994. and domain adaptation to improve efficiency of deep robotic grasping,”
[39] S. Mahadevan and J. Connell, “Automatic programming of behavior- in 2018 IEEE International Conference on Robotics and Automation
based robots using reinforcement learning,” Artificial intelligence, (ICRA). IEEE, 2018, pp. 4243–4250.
vol. 55, no. 2-3, pp. 311–365, 1992. [58] S. James, P. Wohlhart, M. Kalakrishnan, D. Kalashnikov, A. Irpan,
[40] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. J. Ibarz, S. Levine, R. Hadsell, and K. Bousmalis, “Sim-to-real via sim-
Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski to-sim: Data-efficient robotic grasping via randomized-to-canonical
et al., “Human-level control through deep reinforcement learning,” adaptation networks,” in Proceedings of the IEEE Conference on
Nature, vol. 518, no. 7540, p. 529, 2015. Computer Vision and Pattern Recognition, 2019, pp. 12 627–12 637.
[41] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den [59] G. J. Stein and N. Roy, “Genesis-rt: Generating synthetic images
Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, for training secondary real-world tasks,” in 2018 IEEE International
M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, Conference on Robotics and Automation (ICRA). IEEE, 2018, pp.
I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, 7151–7158.
and D. Hassabis, “Mastering the game of go with deep neural [60] J. Tan, T. Zhang, E. Coumans, A. Iscen, Y. Bai, D. Hafner, S. Bo-
networks and tree search,” Nature, vol. 529, pp. 484–503, hez, and V. Vanhoucke, “Sim-to-real: Learning agile locomotion for
2016. [Online]. Available: http://www.nature.com/nature/journal/v529/ quadruped robots,” arXiv preprint arXiv:1804.10332, 2018.
n7587/full/nature16961.html [61] J. Matas, S. James, and A. J. Davison, “Sim-to-real reinforcement
[42] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust learning for deformable object manipulation,” in Proceedings of The
region policy optimization,” in International conference on machine 2nd Conference on Robot Learning, ser. Proceedings of Machine
learning, 2015, pp. 1889–1897. Learning Research, A. Billard, A. Dragan, J. Peters, and J. Morimoto,
[43] P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and Eds., vol. 87. PMLR, 29–31 Oct 2018, pp. 734–743. [Online].
D. Meger, “Deep reinforcement learning that matters,” in Thirty- Available: http://proceedings.mlr.press/v87/matas18a.html
Second AAAI Conference on Artificial Intelligence, 2018. [62] S. Whiteson, B. Tanner, M. E. Taylor, and P. Stone, “Protecting
[44] E. Todorov, T. Erez, and Y. Tassa, “Mujoco: A physics engine for against evaluation overfitting in empirical reinforcement learning,”
model-based control,” in 2012 IEEE/RSJ International Conference on in 2011 IEEE Symposium on Adaptive Dynamic Programming and
Intelligent Robots and Systems. IEEE, 2012, pp. 5026–5033. Reinforcement Learning (ADPRL). IEEE, 2011, pp. 120–127.
[45] Y. Tassa, Y. Doron, A. Muldal, T. Erez, Y. Li, D. d. L. Casas, [63] B. Goldfain, P. Drews, C. You, M. Barulic, O. Velev, P. Tsiotras, and
D. Budden, A. Abdolmaleki, J. Merel, A. Lefrancq et al., “Deepmind J. M. Rehg, “Autorally: An open platform for aggressive autonomous
control suite,” arXiv preprint arXiv:1801.00690, 2018. driving,” IEEE Control Systems Magazine, vol. 39, no. 1, pp. 26–55,
[46] O. S. Oguz, “Setting up a benchmark environment for deep reinforce- 2019.
ment learning.” [64] J. Gonzales, F. Zhang, K. Li, and F. Borrelli, “Autonomous drifting
[47] P. Dhariwal, C. Hesse, O. Klimov, A. Nichol, M. Plappert, A. Radford, with onboard sensors,” in Advanced Vehicle Control: Proceedings
J. Schulman, S. Sidor, Y. Wu, and P. Zhokhov, “Openai baselines,” of the 13th International Symposium on Advanced Vehicle Control
GitHub, GitHub repository, 2017. (AVEC’16), September 13-16, 2016, Munich, Germany, 2016, p. 133.
[48] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, [65] D. V. Gealy, S. McKinley, B. Yi, P. Wu, P. R. Downey, G. Balke,
D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep rein- A. Zhao, M. Guo, R. Thomasson, A. Sinclair et al., “Quasi-direct
forcement learning,” in International conference on machine learning, drive for low-cost compliant robotic manipulation,” arXiv preprint
2016, pp. 1928–1937. arXiv:1904.03815, 2019.
[49] E. Gabriel, G. E. Fagg, G. Bosilca, T. Angskun, J. J. Dongarra, J. M. [66] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun,
Squyres, V. Sahay, P. Kambadur, B. Barrett, A. Lumsdaine, R. H. “CARLA: An open urban driving simulator,” in Proceedings of the 1st
Castain, D. J. Daniel, R. L. Graham, and T. S. Woodall, “Open MPI: Annual Conference on Robot Learning, ser. Proceedings of Machine
Goals, concept, and design of a next generation MPI implementation,” Learning Research, S. Levine, V. Vanhoucke, and K. Goldberg, Eds.,
in Proceedings, 11th European PVM/MPI Users’ Group Meeting, vol. 78. PMLR, 13–15 Nov 2017, pp. 1–16. [Online]. Available:
Budapest, Hungary, September 2004, pp. 97–104. http://proceedings.mlr.press/v78/dosovitskiy17a.html
[50] Y. Chebotar, A. Handa, V. Makoviychuk, M. Macklin, J. Issac, [67] W. Roscoe, “Donkey car: An opensource DIY self driving platform
N. Ratliff, and D. Fox, “Closing the sim-to-real loop: Adapting simula- for small scale cars,” http://donkeycar.com, 2019.
tion randomization with real world experience,” in 2019 International [68] L. Paull, J. Tani, H. Ahn, J. Alonso-Mora, L. Carlone, M. Cap,
Conference on Robotics and Automation (ICRA). IEEE, 2019, pp. Y. F. Chen, C. Choi, J. Dusek, Y. Fang et al., “Duckietown: an
8973–8979. open, inexpensive and flexible platform for autonomy education and
[51] P. Moritz, R. Nishihara, S. Wang, A. Tumanov, R. Liaw, E. Liang, research,” in 2017 IEEE International Conference on Robotics and
M. Elibol, Z. Yang, W. Paul, M. I. Jordan et al., “Ray: A distributed Automation (ICRA). IEEE, 2017, pp. 1497–1504.
framework for emerging {AI} applications,” in 13th {USENIX} Sym- [69] M. O’Kelly, V. Sukhil, H. Abbas, J. Harkins, C. Kao, Y. V. Pant,
posium on Operating Systems Design and Implementation ({OSDI} R. Mangharam, D. Agarwal, M. Behl, P. Burgio et al., “F1/10:
18), 2018, pp. 561–577. An open-source autonomous cyber-physical platform,” arXiv preprint
[52] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel, arXiv:1901.08567, 2019.
“Domain randomization for transferring deep neural networks from [70] S. Karaman, A. Anders, M. Boulet, J. Connor, K. Gregson, W. Guerra,
simulation to the real world,” in 2017 IEEE/RSJ International Con- O. Guldner, M. Mohamoud, B. Plancher, R. Shin et al., “Project-
ference on Intelligent Robots and Systems (IROS). IEEE, 2017, pp. based, collaborative, algorithmic robotics for high school students:
23–30. Programming self-driving race cars at mit,” in 2017 IEEE Integrated
[53] I. Mordatch, K. Lowrey, and E. Todorov, “Ensemble-cio: Full-body STEM Education Conference (ISEC). IEEE, 2017, pp. 195–203.
dynamic motion planning that transfers to physical humanoids,” in [71] S. S. Srinivasa, P. Lancaster, J. Michalove, M. Schmittle, C. S. M.
2015 IEEE/RSJ International Conference on Intelligent Robots and Rockett, J. R. Smith, S. Choudhury, C. Mavrogiannis, and F. Sadeghi,
Systems (IROS). IEEE, 2015, pp. 5307–5314. “Mushr: A low-cost, open-source robotic racecar for education and
[54] A. Rajeswaran, S. Ghotra, B. Ravindran, and S. Levine, “Epopt: research,” arXiv preprint arXiv:1908.08031, 2019.
Learning robust neural network policies using model ensembles,” [72] M. Lapeyre, P. Rouanet, J. Grizou, S. Nguyen, F. Depraetre, A. Le Fal-
arXiv preprint arXiv:1610.01283, 2016. her, and P.-Y. Oudeyer, “Poppy project: open-source fabrication of 3d
[55] L. Pinto, J. Davidson, R. Sukthankar, and A. Gupta, “Robust adversar- printed humanoid robot for science, education and art,” 2014.
ial reinforcement learning,” in Proceedings of the 34th International [73] N. Wagener, C. an Cheng, J. Sacks, and B. Boots, “An online learning
approach to model predictive control,” in Proceedings of Robotics:
Science and Systems, FreiburgimBreisgau, Germany, June 2019.
[74] G. Williams, N. Wagener, B. Goldfain, P. Drews, J. M. Rehg, B. Boots,
and E. A. Theodorou, “Information theoretic mpc for model-based
reinforcement learning,” in 2017 IEEE International Conference on
Robotics and Automation (ICRA). IEEE, 2017, pp. 1714–1721.
[75] M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp,
P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang et al., “End to
end learning for self-driving cars,” arXiv preprint arXiv:1604.07316,
2016.
[76] Y. Pan, C.-A. Cheng, K. Saigol, K. Lee, X. Yan, E. Theodorou, and
B. Boots, “Agile autonomous driving using end-to-end deep imitation
learning,” in Robotics: science and systems, 2018.
[77] M. Mueller, A. Dosovitskiy, B. Ghanem, and V. Koltun, “Driving
policy transfer via modularity and abstraction,” in Conference on Robot
Learning, 2018, pp. 1–15.
[78] A. Loquercio, E. Kaufmann, R. Ranftl, A. Dosovitskiy, V. Koltun, and
D. Scaramuzza, “Deep drone racing: From simulation to reality with
domain randomization,” arXiv preprint arXiv:1905.09727, 2019.
[79] Q. Zhang and T. Du, “Self-driving scale car trained by deep reinforce-
ment learning,” arXiv preprint arXiv:1909.03467, 2019.
[80] K. Wu, M. Abolfazli Esfahani, S. Yuan, and H. Wang, “Learn to steer
through deep reinforcement learning,” Sensors, vol. 18, no. 11, p. 3650,
2018.
[81] P. Drews, G. Williams, B. Goldfain, E. A. Theodorou, and J. M. Rehg,
“Vision-based high-speed driving with a deep dynamic observer,”
IEEE Robotics and Automation Letters, vol. 4, no. 2, pp. 1564–1571,
2019.
[82] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “High-
dimensional continuous control using generalized advantage estima-
tion,” in Proceedings of the International Conference on Learning
Representations (ICLR), 2016.
[83] “AWS RoboMaker,” https://aws.amazon.com/robomaker/, 2019.
[84] “Amazon SageMaker,” https://aws.amazon.com/sagemaker/, 2019.
[85] “Amazon S3,” https://aws.amazon.com/s3/, 2019.
[86] “Redis,” https://redis.io, 2019.
[87] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez,
Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with
deep reinforcement learning,” in 4th International Conference on
Learning Representations, ICLR 2016, San Juan, Puerto Rico, May
2-4, 2016, Conference Track Proceedings, 2016. [Online]. Available:
http://arxiv.org/abs/1509.02971
[88] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic:
Off-policy maximum entropy deep reinforcement learning with a
stochastic actor,” in Proceedings of the 35th International Conference
on Machine Learning, ser. Proceedings of Machine Learning
Research, J. Dy and A. Krause, Eds., vol. 80. Stockholmsmässan,
Stockholm Sweden: PMLR, 10–15 Jul 2018, pp. 1861–1870. [Online].
Available: http://proceedings.mlr.press/v80/haarnoja18b.html
[89] T. Kluyver, B. Ragan-Kelley, F. Pérez, B. E. Granger, M. Bussonnier,
J. Frederic, K. Kelley, J. B. Hamrick, J. Grout, S. Corlay et al.,
“Jupyter notebooks-a publishing format for reproducible computa-
tional workflows.” in ELPUB, 2016, pp. 87–90.
[90] D. Merkel, “Docker: lightweight linux containers for consistent de-
velopment and deployment,” Linux Journal, vol. 2014, no. 239, p. 2,
2014.
[91] M. Quigley, K. Conley, B. Gerkey, J. Faust, T. Foote, J. Leibs,
R. Wheeler, and A. Y. Ng, “Ros: an open-source robot operating
system,” in ICRA workshop on open source software, vol. 3, no. 3.2.
Kobe, Japan, 2009, p. 5.

You might also like