96%-1-Deep-Reinforcement-Learning-Based Semantic Navigation of Mobile Robots in Dynamic Environments
96%-1-Deep-Reinforcement-Learning-Based Semantic Navigation of Mobile Robots in Dynamic Environments
Abstract— Mobile robots have gained increased importance hand defined safety restrictions and measures which are non
within industrial tasks such as commissioning, delivery or oper- flexible and result in slow navigation. Higher level semantics
ation in hazardous environments. The ability to autonomously bear the potential to enhance the safety of path planing by
navigate safely especially within dynamic environments, is
paramount in industrial mobile robotics. Current navigation creating links the agent can reason about and consider for its
methods depend on preexisting static maps and are error-prone navigation [6]. On this account, we propose a deep DRL local
in dynamic environments. Furthermore, for safety reasons, they navigation system for autonomous navigation in unknown
often rely on hand-crafted safety guidelines, which makes the dynamic environments that works both in simulation and
system less flexible and slow. Visual based navigation and high reality. To alleviate the problem of overfitting, we include
level semantics bear the potential to enhance the safety of path
planing by creating links the agent can reason about for a more highly random and dynamic components into our developed
flexible navigation. On this account, we propose a reinforcement simulation engine called ARENA2D. For enhanced safety,
learning based local navigation system which learns navigation we incorporate high level semantic information to learn
behavior based solely on visual observations to cope with highly safe navigation behavior for specific classes. The result is
dynamic environments. Therefore, we develop a simple yet an end to end DRL local navigation system which learns
efficient simulator - ARENA2D - which is able to generate
highly randomized training environments and provide semantic to navigate and avoid dynamic obstacles based directly on
information to train our agent. We demonstrate enhanced visual observations. The robot is able to reason about safety
results in terms of safety and robustness over a traditional distances and measures by itself based solely on its visual
baseline approach based on the dynamic window approach. input. The main contributions of this work are following:
• Proposal of a reinforcement learning local navigation
I. I NTRODUCTION
system for autonomous robot navigation based solely
The demand for mobile robots has raised significantly due on visual input.
to their flexibility and the variety of use cases they can oper- • Proposal of an efficient 2D simulation environment -
ate in. Tasks such as provision of components, transportation, ARENA2D - to enable safe and generalizable navigation
commissioning or the work in hazardous environments are behavior.
increasingly being executed by such robots [1], [2]. A safe • Evaluation of the performance in terms of safety and
and reliable navigation is essential in operation of mobile robustness in highly dynamic environments.
robotics. Typically, the navigation stack consists of self-
The paper is structured as follows. Sec. II gives an overview
localization, mapping, global and local planning modules.
of related work. Sec. III presents the conceptional design
Simultaneous Localization and Mapping (SLAM) is most
of our approach while sec. IV describes the implementation
commonly conducted as part of the navigation stack. It is
and training process. Sec. V demonstrates the results. Finally,
used to create a map of the environment using its sensor
Sec. VI gives a conclusion.
observations upon which the further navigation relies on.
However, this form of navigation depends on the preexisting II. RELATED WORK
map and its performance degrades at highly dynamic envi-
ronments [3], [4]. Furthermore, it requires an exploration step A. Deep Reinforcement Learning for Navigation
to generate a map which can be time consuming especially With the advent of powerful neural networks, deep rein-
for large environments. forcement learning (DRL) mitigated the bottleneck of tedious
Deep Reinforcement Learning (DRL) emerged as a solid policy acquisitions by accelerating the policy exploration
alternative to tackle this challenge of navigation within phase using neural networks. Mhni et. al [10] first used neural
dynamic environments [5]. A trial and error approach lets networks to find an optimal policy for navigation behavior.
the agent learn its behavior based purely on its observations. They conducted high level sensory input and proposed an
The training process is accelerated with neural networks and end-to-end policy learning system termed Deep-Q-Learning
recent research showed remarkable results in mobile naviga- (DQN). Bojarski et. al [11] applied the same techniques
tion. Yet, a main concern still lies in the safety aspect of navi- for mobile robot navigation by proposing an end-to-end
gation within human robot collaboration. Most ubiquitous are method that maps raw camera pixels directly to steering
commands within a simulation environment and show the
1 Linh Kästner, Cornelius Marx and Jens Lambrecht are with the Chair
feasibility of a RL approach. Most recently, Pokle et al. [12]
Industry Grade Networks and Clouds Department, Faculty of Electrical
Engineering, and Computer Science, Technical University of Berlin, Berlin, presented a system for autonomous robot navigation using
Germany [email protected] deep neural networks to map observed Lidar sensor data to
B. Semantic Navigation
Semantic navigation have been a research direction for
many years. Borowski et al. [6] introduces a path-planning
algorithm after semantically segmenting the nearby objects in Fig. 1: Design of the navigation system
the robot’s environment. The researchers were able to show
the feasibility and extract information about object classes
to consider for mobile robot navigation. Wang et al. [14]
incorporated into the reward functions: the mobile robot
propose a three layer perception framework for achieving
have to keep a distance of 0.8 meters from detected humans
semantic navigation. The proposed network uses visual in-
and 0.3 meters from collaborating robots. The training is
formation from a RGB camera to recognize the semantic
accelerated with neural networks and build upon a deep Q
region the agent is currently located at and generate a map.
network (DQN) which maps states to a so called Q values
The work demonstrates that the robot can correct its position
that each state action pair possess. Therefore, we employ the
and orientation by recognizing current states from visual
proposed DRL workflow described in [17].
input. However, the dynamic elements such as presence of
dynamic pedestrians are not taken into consideration. Zhi
A. Design of the Proposed Navigation System
et al. [15] proposes a learning-based approach for semantic
interpretation of visual scenes. The authors present an end- The general workflow of our system is illustrated in
to-end vision-based exploration and mapping which builds Fig. 1. The agent is trained by the RL algorithm within
semantics maps on which the navigation is based. One simulation by analyzing states, actions and rewards. For our
limitation of their method is the assumption of a perfect use case, the states are represented by laser scan observations
odometry, which is hard to achieve in real robots. Zhu et and object detections while actions are the possible robot
al. [16] presented a semantic navigation framework where movements. Within the training stage, these information are
the agent was trained with 3D scenes containing real world simulated in our simulation environment. Within deployment
objects. The researchers were able to transfer the algorithm stage, a RGB camera with integrated object detection and a
towards the real robot which could navigate to specific real 360 degree 2D Lidar scanner of the Turtlebot3 deliver the
world objects like doors or tables. Furthermore, they showed required data. Core component of our system is the neural
the potential of using semantics for navigation without any network which optimizes the Q function. We input the data
hand crafted feature mapping but working solely on visual into the neural network which maps the states-actions tuples
input. The training within the 3D environment, however, is to Q values with the maximal reward. Thus, the agent learns
resource-intensive and require a large amount of computa- optimal behavior given a set of states and actions. In addition,
tional capacity. Our work follows a more simplified way we integrate several optimization modules to accelerate the
of incorporating semantic information. We include different training process even further. Finally, in the deployment
object classes into our 2D simulation environment and train stage, the RL model is deployed towards the real robot using
the agent with specific policies which should shape the a proposed deployment infrastructure consisting of a top
behavior of the robot when encountering with the specific down camera to detect the goal and surrounding objects. As
classes. Furthermore, we transfer the algorithms towards the a middle-ware for communication between all entities, ROS
real robot. Kinetic on Ubuntu 16.04 is used. For the deployment of the
trained algorithms to the real robot, several challenges have
III. CONCEPTUAL DESIGN to be considered. Unlike simulation, the robot does not know
We propose a DRL-based local navigation system which when the goal is reached. Therefore, we propose a solution
maps observed sensory input directly to robot actions for using object detection and markers with the aforementioned
obstacle avoidance and safe navigation in dynamic envi- global camera. The problem of inaccurate and noisy laser
ronments. Furthermore, we aspire to explore the safety scan data is considered within our simulation environment,
enhancements using semantic information. More specifically, where we include a module to add several levels of noise
rules based on detected nearby objects are defined and to the laser scan data. Furthermore, we included dynamic
1111
Authorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on March 19,2025 at 10:58:07 UTC from IEEE Xplore. Restrictions apply.
obstacles and randomness into the simulation to alleviate the Algorithm 2 PostStep
differences between real and simulation environment. B ← R.random_batch(Bsize )
L ← new Array(Bsize )
IV. IMPLEMENTATION
for i in B do
In the following chapter, each module of our proposed (si , s0i , ai , ri , di ) ← B[i]
navigation system is presented in detail. if di = TRUE then
y ← ri
A. Training Algorithm else
The basic training loop is based on the suggestions from y ← ri + γ max Q̂(s0i , a)
a
[17] and employs deep Q-learning. We rely on three major end if
techniques: replay buffer and epsilon greedy to cope with the L[i] ← (Q(si , ai ) − y)2
exploration and exploitation problem and target net which we end for
use to stabilize the training process in terms of robustness. Q.optimize(L)
To abstract the algorithm for our simulation environment, if t mod Nsync = 0 then
we split it into two parts: a simulation step, which interacts QT ← Q
with the simulation environment, and a learning step, which end if
is responsible for the refinement of the neural network model. ε ← max(εmin , 1 − t/tmax )
The implementation of those steps is described by algorithms
1 and 2. In the PreStep algorithm, a random action is
Algorithm 1 PreStep
if random() < ε then
a ← random_action()
else
a ← arg max Q(s, a)
a
end if
(s0 , r) ← simulation_step(a)
if episode is over then
R.insert((s, s0 , a, r, TRUE)) Fig. 2: Architecture of fully connected neural network
s ← simulation_reset()
else
R.insert((s, s0 , a, r, FALSE)) B. Neural Network Design
s ← s0 We use fully connected neural networks for our DQL
end if sytem. Our model consists of 4 hidden and fully connected
t ← t +1 layers and is described in Fig. 2
The input of the first fully connected layer represents the
chosen with a probability of ε. Otherwise, the action with laser scan data and information about nearby humans or
the maximum Q Value, according to the current networks robots. 360 neurons as input values representing one value
estimation, is retrieved. Using that action, a simulation step for each degree of the laser scanner. The nearby dynamic
is performed, revealing the new state and the reward gained. objects are each represented with two neuron using their
The new state along with the previous state, reward, action distance and angle to the robot. For simplicity reasons, we
and a flag indicating, whether the episode has ended or not, restrict this input to one object for each class human and
is stored in the replay buffer. robot. After 2 dense layers and a dropout layer, the resulting
The actual training takes place in the PostStep algorithm. output are 7 neurons denoting the robots actions. Adam is
Here a random batch is sampled from the replay buffer and chosen as optimizer with an adaptive learning rate of 0.0025.
the mean square error (MSE) loss is calculated using the As loss-function, Mean Squared Error (MSE) is chosen.
Bellman equation. Every Nsync frames, weights from the
network Q are copied over to the target network Q̂. Using C. Training Stage with ARENA2D
stochastic gradient descent (SGD) optimization, the weights We apply the presented methods in the training stage of
of the network Q are optimized according to the MSE loss L the robot through simulations to generate a model which,
calculated for every batch sample. Finally, the epsilon value subsequently, can be used for the real robot. Therefore, we
is updated according to the current step t. Thereby, epsilon developed a simple yet efficient 2D training environment -
denotes the randomness of executed actions in each step ARENA2D - with a large amount of built-in capabilities
which ensures an efficient trade off between exploration and for performance enhancements and exploration of training
exploitation. PreStep and PostStep are called in a loop until settings. The simulation environment is depicted in Fig.
the network converges. 3. Within the simulation environment, we include static
1112
Authorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on March 19,2025 at 10:58:07 UTC from IEEE Xplore. Restrictions apply.
as well as dynamic components and create several stages the angle between the robot and the goal. The rewards and
with ascending difficulty. In total, we executed the training penalties are listed in Table II.
on 3 different stages which we denote as static, dynamic
and semantic. The static stage contains only static obstacles TABLE II: Rewards and penalties for training
whereas the dynamic stage include moving obstacles. These Event Description Reward Ep. Over
Goal reached yes +100 yes
two stages use a neural network which does not include Moving towards goal |α| ≤ 30◦ +0.1 no
the neurons indicating the distance of human and collab- Wall hit Robot hit wall −100 yes
orating robot as additional input. The semantic stage uses Moving away from goal |α| > 30◦ −0.2 no
Violate distance to human d < 0.7m −10 no
the network presented in Fig. 2. If the robot hits a wall or Violate distance to robot d < 0.2m −10 no
times out, it is reseted to the center of the stage. For human
obstacles, we included an additional stop rate which lets the
Hyperparameters: To determine the optimal hyperpa-
obstacle stop at random positions for a time of 2 seconds
rameters, we conducted several training runs and adjusted
thus simulating the human behavior.
the hyperparameters manually according to our literature
research as well as experience. The optimal hyperparameters
used for all further training runs are listed in Table III.
1113
Authorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on March 19,2025 at 10:58:07 UTC from IEEE Xplore. Restrictions apply.
Fig. 5: Test scenarios for the conducted experiments
For the object detection module, we utilized a pose estima- V. R ESULTS AND E VALUATION
tion module working with RGB input based on SSPE [21]. In the following, the results form our conducted experi-
Thus, the position and distance of humans or collaborating ments are presented. The deployment of the models to the
robots can be detected globally. Currently, the model is real robot was without any difficulties and we compare our
able to localize humans, the Kuka Youbot and the Turtlebot approaches with the traditional local planer of the Turtlebot3
models Burger and Waffle. We fine-tune the model with a navigation package in terms of relevant metrics such as
training on a human and robot RGB-dataset utilizing the distance speed, error rate and safety of the navigation.
pipeline proposed in our previous work [22]. The results are The results are listed in Table IV. The distance metric is
transmitted to the Observation Packer node to be considered indicating the efficiency of the path planer and is conducted
for the agent. Subsequently, the DRL algorithm will refine through the odometry topic. We measured the time each
the trajectory of the robot. approach required to reach the goal. The safety rate is
The representing neuron for the distance of the agent to calculated with the total number of collision the robot had
humans and other robots is initially set to a distance of with static or dynamic obstacles while still reaching the
10 meters to make sure the neural network is assuming no goal. Robustness describes how many times the robot failed
nearby human or robot. Once a robot or human is detected to reach the goal due to a failed path planning resulting
and localized, the estimated position will be given as input in a navigation stack shut down or if the robot pursuits
to the neural network via the Pose Estimation node. a completely wrong direction and were out of the arena.
In total we placed 10 different goal positions and for each
Conducted Experiments goal, 3 measurements were carried out for all approaches. If
We conduct several experiments with the model at dif- one run resulted in a failure, this was added to the error
ferent real environments to compare our method against the rate count and another measurement was conducted such
traditional local planer of the Turtlebot3 navigation package that finally, there are 30 measurements for each approach
that uses a preexisting map of the environment with static to calculate the mean distances and speed. The error rate
obstacles and an algorithm without semantic rules. The is calculated as the percentage of failed to successful runs.
setup of the experiments are illustrated in Fig. 5 We tested Table IV lists the the results for each approach. It can
different setups with static as well as dynamic components be observed that our approaches outperform the traditional
like moving humans and other robots and placed 10 different local planer of the robot in terms of speed and distance.
goal positions ranging from 0.2m to 2.5m distance. The start Furthermore, our methods eliminate the need to generate
position of the robot was the same for every run. For each a map which is necessary for the SLAM packages on
approach, we conducted 30 measurements consisting of 3 which the global and local planner of the robot rely on.
1114
Authorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on March 19,2025 at 10:58:07 UTC from IEEE Xplore. Restrictions apply.
TABLE IV: Comparison of navigation approaches R EFERENCES
Metric Trad. DRL Stat. DRL Dyn. DRL Sem.
Distance [m] 4.72 3.71 4.1 5.28 [1] V. Paelke, “Augmented reality in the smart factory: Supporting workers
Speed [s] 15.7 11.7 12.49 17.9 in an industry 4.0. environment,” in Proceedings of the 2014 IEEE
Error Rate [%] 16.66 10 3.33 0 emerging technology and factory automation (ETFA). IEEE, 2014,
Obstacles hit 6.4 4.8 3.2 0 pp. 1–4.
[2] R. Siegwart, I. R. Nourbakhsh, D. Scaramuzza, and R. C. Arkin,
Introduction to autonomous mobile robots. MIT press, 2011.
[3] Y. Sun, M. Liu, and M. Q.-H. Meng, “Improving rgb-d slam in
dynamic environments: A motion removal approach,” Robotics and
The model trained with semantic information achieves the Autonomous Systems, vol. 89, pp. 110–122, 2017.
best performance in terms of obstacle avoidance and had [4] M. S. Bahraini, A. B. Rad, and M. Bozorg, “Slam in dynamic
no collisions in all our test runs. Although this comes at environments: A deep learning approach for moving object tracking
using ml-ransac algorithm,” Sensors, vol. 19, no. 17, p. 3699, 2019.
the cost of longer distances and speed because the robot [5] J. Zeng, R. Ju, L. Qin, Y. Hu, Q. Yin, and C. Hu, “Navigation
will keep a larger distance when encountering a human in unknown dynamic environments based on deep reinforcement
sometimes driving backwards. The greater distance alleviates learning,” Sensors, vol. 19, no. 18, p. 3837, 2019.
[6] A. Borkowski, B. Siemiatkowska, and J. Szklarski, “Towards semantic
the high amount of collisions that were observed in our navigation in mobile robotics,” in Graph transformations and model-
previous work where we mitigated the issue by training driven engineering. Springer, 2010, pp. 719–748.
in highly dynamic environments. The additional semantic [7] N. Kohl and P. Stone, “Policy gradient reinforcement learning for
fast quadrupedal locomotion,” in IEEE International Conference on
information enhances this effect even further as indicated Robotics and Automation, 2004. Proceedings. ICRA’04. 2004, vol. 3.
in table IV. Notably, the transfer of the simulated agent to IEEE, 2004, pp. 2619–2624.
the real environment did not cause major differences, even [8] D. Vasquez, B. Okal, and K. O. Arras, “Inverse reinforcement learning
algorithms and features for robot navigation in crowds: an experi-
though in the simulation environment, only round obstacles mental comparison,” in 2014 IEEE/RSJ International Conference on
were deployed as dynamic obstacles. However, our agent Intelligent Robots and Systems. IEEE, 2014, pp. 1341–1346.
could generalize its behavior to all obstacles both static and [9] B. D. Ziebart, A. L. Maas, J. A. Bagnell, and A. K. Dey, “Maximum
entropy inverse reinforcement learning.” in Aaai, vol. 8. Chicago, IL,
dynamic, thus still managed to avoid the objects and keep a USA, 2008, pp. 1433–1438.
save distance to the human. For a more visual demonstration [10] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.
of our experiments we refer to our demonstration video Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski
et al., “Human-level control through deep reinforcement learning,”
which is available at https://youtu.be/KqHkqMqyStM. Nature, vol. 518, no. 7540, pp. 529–533, 2015.
[11] M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp,
P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang et al., “End to
VI. CONCLUSION end learning for self-driving cars,” arXiv preprint arXiv:1604.07316,
2016.
[12] A. Pokle, R. Martín-Martín, P. Goebel, V. Chow, H. M. Ewald,
We proposed an overall deep reinforcement learning based J. Yang, Z. Wang, A. Sadeghian, D. Sadigh, S. Savarese et al., “Deep
local trajectory replanning and control for robot navigation,” in 2019
local navigation system for autonomous robot navigation International Conference on Robotics and Automation (ICRA). IEEE,
within highly dynamic environments. Therefore, we devel- 2019, pp. 5815–5822.
oped a simple, yet efficient simulation engine from scratch [13] A. Nüchter, H. Surmann, K. Lingemann, and J. Hertzberg, “Semantic
scene analysis of scanned 3d indoor environments.” in VMV, 2003,
which showed fast and efficient training speed and feasibility pp. 215–221.
in transferring the models towards the real robot. Our naviga- [14] L. Wang, L. Zhao, G. Huo, R. Li, Z. Hou, P. Luo, Z. Sun, K. Wang,
tion algorithm works solely on visual input and eliminates the and C. Yang, “Visual semantic navigation based on deep learning for
indoor mobile robots,” Complexity, vol. 2018, 2018.
need for any additional map. Furthermore, we explored the [15] X. Zhi, X. He, and S. Schwertfeger, “Learning autonomous exploration
potential of semantic information by incorporating semantic and mapping with semantic vision,” in Proceedings of the 2019
classes such as human and robot and concluded safety International Conference on Image, Video and Signal Processing,
2019, pp. 8–15.
enhancements for the navigation. This will be extended in our [16] Y. Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, L. Fei-Fei, and
further work to include more classes such as long corridor, A. Farhadi, “Target-driven visual navigation in indoor scenes using
doors or restricted areas. For the deployment into the real deep reinforcement learning,” in 2017 IEEE international conference
on robotics and automation (ICRA). IEEE, 2017, pp. 3357–3364.
environment a framework was proposed to integrate the DRL [17] M. Lapan, “Deep reinforcement learning hands-on: Apply modern rl
algorithms towards the real robot using marker detection methods, with deep q-networks, value iteration, policy gradients, trpo,
and odometry data. Thereby, we ease the transferability of alphago zero and more.” Packt Publishing, 2018, pp. 125–214.
[18] R. S. Sutton, “Learning to predict by the methods of temporal
simulated models and enable a map-independent solution. differences,” Machine learning, vol. 3, no. 1, pp. 9–44, 1988.
The results were remarkable both in static as well as dynamic [19] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning
environments and surpasses the traditional baseline RRT- with double q-learning,” in Thirtieth AAAI conference on artificial
intelligence, 2016.
PWA approach in terms of safety and robustness. For future [20] F. J. Romero-Ramirez, R. Muñoz-Salinas, and R. Medina-Carnicer,
work, we plan to incorporate more semantic classes such “Speeded up detection of squared fiducial markers,” Image and vision
as long corridor, doors or restricted areas into the training Computing, vol. 76, pp. 38–47, 2018.
[21] B. Tekin, S. N. Sinha, and P. Fua, “Real-time seamless single shot
environment to enhance safety and the overall performance 6d object pose prediction,” in Proceedings of the IEEE Conference on
even further. Additionally, an extension of our framework for Computer Vision and Pattern Recognition, 2018, pp. 292–301.
more capabilities and features e.g. including more reinforce- [22] L. Kästner, D. Dimitrov, and J. Lambrecht, “A markerless deep
learning-based 6 degrees of freedom poseestimation for with mobile
ment learning algorithms, recurrent modules and continuous robots using rgb data,” arXiv preprint arXiv:2001.05703, 2020.
actions is planned.
1115
Authorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on March 19,2025 at 10:58:07 UTC from IEEE Xplore. Restrictions apply.