96%-1-Deep-Reinforcement-Learning-Based Semantic Navigation of Mobile Robots in Dynamic Environments

This document presents a deep reinforcement learning (DRL) based navigation system for mobile robots in dynamic environments, emphasizing safety and flexibility through visual observations and semantic information. The proposed system, implemented in a simulator called ARENA2D, enhances navigation by allowing robots to learn and adapt to their surroundings without relying on preexisting maps. The paper details the design, implementation, and training of the system, demonstrating improved safety and robustness compared to traditional navigation methods.

Uploaded by

Nalini Radhakrishnan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views6 pages

96%-1-Deep-Reinforcement-Learning-Based Semantic Navigation of Mobile Robots in Dynamic Environments

Uploaded by

Nalini Radhakrishnan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

2020 16th IEEE International Conference on Automation Science and Engineering (CASE)

August 20-21, 2020, Online Zoom Meeting

Deep-Reinforcement-Learning-Based Semantic Navigation of Mobile

Robots in Dynamic Environments
Linh Kästner1 , Cornelius Marx1 and Jens Lambrecht1

Abstract— Mobile robots have gained increased importance hand defined safety restrictions and measures which are non
within industrial tasks such as commissioning, delivery or oper- flexible and result in slow navigation. Higher level semantics
ation in hazardous environments. The ability to autonomously bear the potential to enhance the safety of path planing by
navigate safely especially within dynamic environments, is
paramount in industrial mobile robotics. Current navigation creating links the agent can reason about and consider for its
methods depend on preexisting static maps and are error-prone navigation [6]. On this account, we propose a deep DRL local
in dynamic environments. Furthermore, for safety reasons, they navigation system for autonomous navigation in unknown
often rely on hand-crafted safety guidelines, which makes the dynamic environments that works both in simulation and
system less flexible and slow. Visual based navigation and high reality. To alleviate the problem of overfitting, we include
level semantics bear the potential to enhance the safety of path
planing by creating links the agent can reason about for a more highly random and dynamic components into our developed
flexible navigation. On this account, we propose a reinforcement simulation engine called ARENA2D. For enhanced safety,
learning based local navigation system which learns navigation we incorporate high level semantic information to learn
behavior based solely on visual observations to cope with highly safe navigation behavior for specific classes. The result is
dynamic environments. Therefore, we develop a simple yet an end to end DRL local navigation system which learns
efficient simulator - ARENA2D - which is able to generate
highly randomized training environments and provide semantic to navigate and avoid dynamic obstacles based directly on
information to train our agent. We demonstrate enhanced visual observations. The robot is able to reason about safety
results in terms of safety and robustness over a traditional distances and measures by itself based solely on its visual
baseline approach based on the dynamic window approach. input. The main contributions of this work are following:
• Proposal of a reinforcement learning local navigation
I. I NTRODUCTION
system for autonomous robot navigation based solely
The demand for mobile robots has raised significantly due on visual input.
to their flexibility and the variety of use cases they can oper- • Proposal of an efficient 2D simulation environment -
ate in. Tasks such as provision of components, transportation, ARENA2D - to enable safe and generalizable navigation
commissioning or the work in hazardous environments are behavior.
increasingly being executed by such robots [1], [2]. A safe • Evaluation of the performance in terms of safety and
and reliable navigation is essential in operation of mobile robustness in highly dynamic environments.
robotics. Typically, the navigation stack consists of self-
The paper is structured as follows. Sec. II gives an overview
localization, mapping, global and local planning modules.
of related work. Sec. III presents the conceptional design
Simultaneous Localization and Mapping (SLAM) is most
of our approach while sec. IV describes the implementation
commonly conducted as part of the navigation stack. It is
and training process. Sec. V demonstrates the results. Finally,
used to create a map of the environment using its sensor
Sec. VI gives a conclusion.
observations upon which the further navigation relies on.
However, this form of navigation depends on the preexisting II. RELATED WORK
map and its performance degrades at highly dynamic envi-
ronments [3], [4]. Furthermore, it requires an exploration step A. Deep Reinforcement Learning for Navigation
to generate a map which can be time consuming especially With the advent of powerful neural networks, deep rein-
for large environments. forcement learning (DRL) mitigated the bottleneck of tedious
Deep Reinforcement Learning (DRL) emerged as a solid policy acquisitions by accelerating the policy exploration
alternative to tackle this challenge of navigation within phase using neural networks. Mhni et. al [10] first used neural
dynamic environments [5]. A trial and error approach lets networks to find an optimal policy for navigation behavior.
the agent learn its behavior based purely on its observations. They conducted high level sensory input and proposed an
The training process is accelerated with neural networks and end-to-end policy learning system termed Deep-Q-Learning
recent research showed remarkable results in mobile naviga- (DQN). Bojarski et. al [11] applied the same techniques
tion. Yet, a main concern still lies in the safety aspect of navi- for mobile robot navigation by proposing an end-to-end
gation within human robot collaboration. Most ubiquitous are method that maps raw camera pixels directly to steering
commands within a simulation environment and show the
1 Linh Kästner, Cornelius Marx and Jens Lambrecht are with the Chair
feasibility of a RL approach. Most recently, Pokle et al. [12]
Industry Grade Networks and Clouds Department, Faculty of Electrical
Engineering, and Computer Science, Technical University of Berlin, Berlin, presented a system for autonomous robot navigation using
Germany [email protected] deep neural networks to map observed Lidar sensor data to

978-1-7281-6904-0/20/$31.00 ©2020 IEEE 1110

Authorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on March 19,2025 at 10:58:07 UTC from IEEE Xplore. Restrictions apply.
actions for local planning. Zeng et. al [5] proposed a DRL
system to work in with unknown dynamic environments. The
researchers include a gated recurrent unit to interact with
the temporal space of observations and train the agent with
moving obstacles in simulation. They show the feasibility
and efficiency of the method in terms of navigation robust-
ness and safety in dynamic environments. Nevertheless, the
method is still limited to simulation. In contrast to that,
our work proposes a system trained in unknown dynamic
environments and additionally, transfers the algorithms into
the real robot.

B. Semantic Navigation
Semantic navigation have been a research direction for
many years. Borowski et al. [6] introduces a path-planning
algorithm after semantically segmenting the nearby objects in Fig. 1: Design of the navigation system
the robot’s environment. The researchers were able to show
the feasibility and extract information about object classes
to consider for mobile robot navigation. Wang et al. [14]
incorporated into the reward functions: the mobile robot
propose a three layer perception framework for achieving
have to keep a distance of 0.8 meters from detected humans
semantic navigation. The proposed network uses visual in-
and 0.3 meters from collaborating robots. The training is
formation from a RGB camera to recognize the semantic
accelerated with neural networks and build upon a deep Q
region the agent is currently located at and generate a map.
network (DQN) which maps states to a so called Q values
The work demonstrates that the robot can correct its position
that each state action pair possess. Therefore, we employ the
and orientation by recognizing current states from visual
proposed DRL workflow described in [17].
input. However, the dynamic elements such as presence of
dynamic pedestrians are not taken into consideration. Zhi
A. Design of the Proposed Navigation System
et al. [15] proposes a learning-based approach for semantic
interpretation of visual scenes. The authors present an end- The general workflow of our system is illustrated in
to-end vision-based exploration and mapping which builds Fig. 1. The agent is trained by the RL algorithm within
semantics maps on which the navigation is based. One simulation by analyzing states, actions and rewards. For our
limitation of their method is the assumption of a perfect use case, the states are represented by laser scan observations
odometry, which is hard to achieve in real robots. Zhu et and object detections while actions are the possible robot
al. [16] presented a semantic navigation framework where movements. Within the training stage, these information are
the agent was trained with 3D scenes containing real world simulated in our simulation environment. Within deployment
objects. The researchers were able to transfer the algorithm stage, a RGB camera with integrated object detection and a
towards the real robot which could navigate to specific real 360 degree 2D Lidar scanner of the Turtlebot3 deliver the
world objects like doors or tables. Furthermore, they showed required data. Core component of our system is the neural
the potential of using semantics for navigation without any network which optimizes the Q function. We input the data
hand crafted feature mapping but working solely on visual into the neural network which maps the states-actions tuples
input. The training within the 3D environment, however, is to Q values with the maximal reward. Thus, the agent learns
resource-intensive and require a large amount of computa- optimal behavior given a set of states and actions. In addition,
tional capacity. Our work follows a more simplified way we integrate several optimization modules to accelerate the
of incorporating semantic information. We include different training process even further. Finally, in the deployment
object classes into our 2D simulation environment and train stage, the RL model is deployed towards the real robot using
the agent with specific policies which should shape the a proposed deployment infrastructure consisting of a top
behavior of the robot when encountering with the specific down camera to detect the goal and surrounding objects. As
classes. Furthermore, we transfer the algorithms towards the a middle-ware for communication between all entities, ROS
real robot. Kinetic on Ubuntu 16.04 is used. For the deployment of the
trained algorithms to the real robot, several challenges have
III. CONCEPTUAL DESIGN to be considered. Unlike simulation, the robot does not know
We propose a DRL-based local navigation system which when the goal is reached. Therefore, we propose a solution
maps observed sensory input directly to robot actions for using object detection and markers with the aforementioned
obstacle avoidance and safe navigation in dynamic envi- global camera. The problem of inaccurate and noisy laser
ronments. Furthermore, we aspire to explore the safety scan data is considered within our simulation environment,
enhancements using semantic information. More specifically, where we include a module to add several levels of noise
rules based on detected nearby objects are defined and to the laser scan data. Furthermore, we included dynamic

1111
Authorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on March 19,2025 at 10:58:07 UTC from IEEE Xplore. Restrictions apply.
obstacles and randomness into the simulation to alleviate the Algorithm 2 PostStep
differences between real and simulation environment. B ← R.random_batch(Bsize )
L ← new Array(Bsize )
IV. IMPLEMENTATION
for i in B do
In the following chapter, each module of our proposed (si , s0i , ai , ri , di ) ← B[i]
navigation system is presented in detail. if di = TRUE then
y ← ri
A. Training Algorithm else
The basic training loop is based on the suggestions from y ← ri + γ max Q̂(s0i , a)
a
[17] and employs deep Q-learning. We rely on three major end if
techniques: replay buffer and epsilon greedy to cope with the L[i] ← (Q(si , ai ) − y)2
exploration and exploitation problem and target net which we end for
use to stabilize the training process in terms of robustness. Q.optimize(L)
To abstract the algorithm for our simulation environment, if t mod Nsync = 0 then
we split it into two parts: a simulation step, which interacts QT ← Q
with the simulation environment, and a learning step, which end if
is responsible for the refinement of the neural network model. ε ← max(εmin , 1 − t/tmax )
The implementation of those steps is described by algorithms
1 and 2. In the PreStep algorithm, a random action is

Algorithm 1 PreStep
if random() < ε then
a ← random_action()
else
a ← arg max Q(s, a)
a
end if
(s0 , r) ← simulation_step(a)
if episode is over then
R.insert((s, s0 , a, r, TRUE)) Fig. 2: Architecture of fully connected neural network
s ← simulation_reset()
else
R.insert((s, s0 , a, r, FALSE)) B. Neural Network Design
s ← s0 We use fully connected neural networks for our DQL
end if sytem. Our model consists of 4 hidden and fully connected
t ← t +1 layers and is described in Fig. 2
The input of the first fully connected layer represents the
chosen with a probability of ε. Otherwise, the action with laser scan data and information about nearby humans or
the maximum Q Value, according to the current networks robots. 360 neurons as input values representing one value
estimation, is retrieved. Using that action, a simulation step for each degree of the laser scanner. The nearby dynamic
is performed, revealing the new state and the reward gained. objects are each represented with two neuron using their
The new state along with the previous state, reward, action distance and angle to the robot. For simplicity reasons, we
and a flag indicating, whether the episode has ended or not, restrict this input to one object for each class human and
is stored in the replay buffer. robot. After 2 dense layers and a dropout layer, the resulting
The actual training takes place in the PostStep algorithm. output are 7 neurons denoting the robots actions. Adam is
Here a random batch is sampled from the replay buffer and chosen as optimizer with an adaptive learning rate of 0.0025.
the mean square error (MSE) loss is calculated using the As loss-function, Mean Squared Error (MSE) is chosen.
Bellman equation. Every Nsync frames, weights from the
network Q are copied over to the target network Q̂. Using C. Training Stage with ARENA2D
stochastic gradient descent (SGD) optimization, the weights We apply the presented methods in the training stage of
of the network Q are optimized according to the MSE loss L the robot through simulations to generate a model which,
calculated for every batch sample. Finally, the epsilon value subsequently, can be used for the real robot. Therefore, we
is updated according to the current step t. Thereby, epsilon developed a simple yet efficient 2D training environment -
denotes the randomness of executed actions in each step ARENA2D - with a large amount of built-in capabilities
which ensures an efficient trade off between exploration and for performance enhancements and exploration of training
exploitation. PreStep and PostStep are called in a loop until settings. The simulation environment is depicted in Fig.
the network converges. 3. Within the simulation environment, we include static

1112
Authorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on March 19,2025 at 10:58:07 UTC from IEEE Xplore. Restrictions apply.
as well as dynamic components and create several stages the angle between the robot and the goal. The rewards and
with ascending difficulty. In total, we executed the training penalties are listed in Table II.
on 3 different stages which we denote as static, dynamic
and semantic. The static stage contains only static obstacles TABLE II: Rewards and penalties for training
whereas the dynamic stage include moving obstacles. These Event Description Reward Ep. Over
Goal reached yes +100 yes
two stages use a neural network which does not include Moving towards goal |α| ≤ 30◦ +0.1 no
the neurons indicating the distance of human and collab- Wall hit Robot hit wall −100 yes
orating robot as additional input. The semantic stage uses Moving away from goal |α| > 30◦ −0.2 no
Violate distance to human d < 0.7m −10 no
the network presented in Fig. 2. If the robot hits a wall or Violate distance to robot d < 0.2m −10 no
times out, it is reseted to the center of the stage. For human
obstacles, we included an additional stop rate which lets the
Hyperparameters: To determine the optimal hyperpa-
obstacle stop at random positions for a time of 2 seconds
rameters, we conducted several training runs and adjusted
thus simulating the human behavior.
the hyperparameters manually according to our literature
research as well as experience. The optimal hyperparameters
used for all further training runs are listed in Table III.

TABLE III: Hyperparameters for training

Hyperparameter Value Explanation
Mean Success Bound 1 Training considered done if mean
success rate reaches this value
Num Actions 7 Total number of discrete actions the
robot can perform
Discount Factor 0.99 Discount factor for reward estima-
tion (often denoted by gamma)
Sync Target Steps 2000 Target network is synchronized with
current network every X steps
Learning Rate 0.00025 Learning rate for optimizer
Epsilon Start 1 Start value of epsilon
Epsilon Max Steps 105 Steps until epsilon reaches mini-
mum
Epsilon End 0.05 Minimum epsilon value
Batch Size 64 Batch size for training after every
step
Training Start 64 Start training only after the first X
Fig. 3: Graphical user interface of the simulation environ- steps
ment ARENA2D Memory Size 106 Last X states will be stored in a
buffer (memory), from which the
batches are sampled
Agent Definition: To simulate the real robot, we define
an agent for simulation. The agent is defined with the same
parameters and output compared to the real robot in order D. Deployment on real robot
to have as little differences as possible between simulation Once the simulation was successful and the agent performs
and reality. Therefore, we choose the Turtlebot3 due to its a safe navigation within all simulation environments, we
simplicity and compactness. It is equipped with a 360 degree deploy the algorithms towards the real robot to evaluate their
laser scan sensor and offers input for further equipment e.g. feasibility within the real environment. Fig. 4 illustrates the
an on board camera. The possible actions are listed in Table deployment setup with all entities used for conducting the
I. experiments. Fiducial Aruco markers [20] are included on the
robot and the goal to verify the arrival at the target destination
TABLE I: Agent definition
Action Linear Velocity [m/s] Angulat Velocity [rad/s]
by comparing the position of both markers. Therefore, all en-
Forward 0.15 0 tities are tracked with a global Intel Realsense D435 camera
Backwards -0.15 0 placed at the top of the real test environment. When both
Stop 0 0 markers reach the same position, the system is informed that
Left 0.15 + 0.75
Right 0.15 - 0.75 the destination is reached. The robot is equipped with an Intel
Strong Left 0.15 + +1.5 Realsense camera as well, which delivers input for the human
Strong Right 0.15 - 1.5 pose estimation module. The communication and signal
workflow between all entities is explained in the following
Rewards and Penalties: The rewards were exactly the chapter. A variety of different obstacles were included, which
same for all trainings. After each step, the agent will receive are similar to the simulation as well as completely different.
a reward based on the new state the robot is in. α denotes For static obstacles, chairs, round objects and boxes of

1113
Authorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on March 19,2025 at 10:58:07 UTC from IEEE Xplore. Restrictions apply.
Fig. 5: Test scenarios for the conducted experiments

measurements for each of the 10 goals. If the robot could

not reach its target within a time of 1 minute or due to a
shut down of the Turtlebot3 navigation planners because no
path could be calculated, we increased the failure count but
conducted another measurement to ensure that each approach
Fig. 4: Deployment setup
has the same number of measurements. The mentioned shut
down happens, if the navigation package can not localize
the robot and fail to generate a path due to a too distant or
different sizes and forms were used. Dynamic obstacles complex goal or sudden obstacles interfering which at times
include other robots moving randomly and moving humans result in a shut down.
walking randomly across the environment and intersecting To explore the efficiency of the additional semantic rules,
the path of the robot. we compared the collision count of each of the approaches
E. Pose Estimation to reason about the safety of each approach.

For the object detection module, we utilized a pose estima- V. R ESULTS AND E VALUATION
tion module working with RGB input based on SSPE [21]. In the following, the results form our conducted experi-
Thus, the position and distance of humans or collaborating ments are presented. The deployment of the models to the
robots can be detected globally. Currently, the model is real robot was without any difficulties and we compare our
able to localize humans, the Kuka Youbot and the Turtlebot approaches with the traditional local planer of the Turtlebot3
models Burger and Waffle. We fine-tune the model with a navigation package in terms of relevant metrics such as
training on a human and robot RGB-dataset utilizing the distance speed, error rate and safety of the navigation.
pipeline proposed in our previous work [22]. The results are The results are listed in Table IV. The distance metric is
transmitted to the Observation Packer node to be considered indicating the efficiency of the path planer and is conducted
for the agent. Subsequently, the DRL algorithm will refine through the odometry topic. We measured the time each
the trajectory of the robot. approach required to reach the goal. The safety rate is
The representing neuron for the distance of the agent to calculated with the total number of collision the robot had
humans and other robots is initially set to a distance of with static or dynamic obstacles while still reaching the
10 meters to make sure the neural network is assuming no goal. Robustness describes how many times the robot failed
nearby human or robot. Once a robot or human is detected to reach the goal due to a failed path planning resulting
and localized, the estimated position will be given as input in a navigation stack shut down or if the robot pursuits
to the neural network via the Pose Estimation node. a completely wrong direction and were out of the arena.
In total we placed 10 different goal positions and for each
Conducted Experiments goal, 3 measurements were carried out for all approaches. If
We conduct several experiments with the model at dif- one run resulted in a failure, this was added to the error
ferent real environments to compare our method against the rate count and another measurement was conducted such
traditional local planer of the Turtlebot3 navigation package that finally, there are 30 measurements for each approach
that uses a preexisting map of the environment with static to calculate the mean distances and speed. The error rate
obstacles and an algorithm without semantic rules. The is calculated as the percentage of failed to successful runs.
setup of the experiments are illustrated in Fig. 5 We tested Table IV lists the the results for each approach. It can
different setups with static as well as dynamic components be observed that our approaches outperform the traditional
like moving humans and other robots and placed 10 different local planer of the robot in terms of speed and distance.
goal positions ranging from 0.2m to 2.5m distance. The start Furthermore, our methods eliminate the need to generate
position of the robot was the same for every run. For each a map which is necessary for the SLAM packages on
approach, we conducted 30 measurements consisting of 3 which the global and local planner of the robot rely on.

1114
Authorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on March 19,2025 at 10:58:07 UTC from IEEE Xplore. Restrictions apply.
TABLE IV: Comparison of navigation approaches R EFERENCES
Metric Trad. DRL Stat. DRL Dyn. DRL Sem.
Distance [m] 4.72 3.71 4.1 5.28 [1] V. Paelke, “Augmented reality in the smart factory: Supporting workers
Speed [s] 15.7 11.7 12.49 17.9 in an industry 4.0. environment,” in Proceedings of the 2014 IEEE
Error Rate [%] 16.66 10 3.33 0 emerging technology and factory automation (ETFA). IEEE, 2014,
Obstacles hit 6.4 4.8 3.2 0 pp. 1–4.
[2] R. Siegwart, I. R. Nourbakhsh, D. Scaramuzza, and R. C. Arkin,
Introduction to autonomous mobile robots. MIT press, 2011.
[3] Y. Sun, M. Liu, and M. Q.-H. Meng, “Improving rgb-d slam in
dynamic environments: A motion removal approach,” Robotics and
The model trained with semantic information achieves the Autonomous Systems, vol. 89, pp. 110–122, 2017.
best performance in terms of obstacle avoidance and had [4] M. S. Bahraini, A. B. Rad, and M. Bozorg, “Slam in dynamic
no collisions in all our test runs. Although this comes at environments: A deep learning approach for moving object tracking
using ml-ransac algorithm,” Sensors, vol. 19, no. 17, p. 3699, 2019.
the cost of longer distances and speed because the robot [5] J. Zeng, R. Ju, L. Qin, Y. Hu, Q. Yin, and C. Hu, “Navigation
will keep a larger distance when encountering a human in unknown dynamic environments based on deep reinforcement
sometimes driving backwards. The greater distance alleviates learning,” Sensors, vol. 19, no. 18, p. 3837, 2019.
[6] A. Borkowski, B. Siemiatkowska, and J. Szklarski, “Towards semantic
the high amount of collisions that were observed in our navigation in mobile robotics,” in Graph transformations and model-
previous work where we mitigated the issue by training driven engineering. Springer, 2010, pp. 719–748.
in highly dynamic environments. The additional semantic [7] N. Kohl and P. Stone, “Policy gradient reinforcement learning for
fast quadrupedal locomotion,” in IEEE International Conference on
information enhances this effect even further as indicated Robotics and Automation, 2004. Proceedings. ICRA’04. 2004, vol. 3.
in table IV. Notably, the transfer of the simulated agent to IEEE, 2004, pp. 2619–2624.
the real environment did not cause major differences, even [8] D. Vasquez, B. Okal, and K. O. Arras, “Inverse reinforcement learning
algorithms and features for robot navigation in crowds: an experi-
though in the simulation environment, only round obstacles mental comparison,” in 2014 IEEE/RSJ International Conference on
were deployed as dynamic obstacles. However, our agent Intelligent Robots and Systems. IEEE, 2014, pp. 1341–1346.
could generalize its behavior to all obstacles both static and [9] B. D. Ziebart, A. L. Maas, J. A. Bagnell, and A. K. Dey, “Maximum
entropy inverse reinforcement learning.” in Aaai, vol. 8. Chicago, IL,
dynamic, thus still managed to avoid the objects and keep a USA, 2008, pp. 1433–1438.
save distance to the human. For a more visual demonstration [10] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.
of our experiments we refer to our demonstration video Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski
et al., “Human-level control through deep reinforcement learning,”
which is available at https://youtu.be/KqHkqMqyStM. Nature, vol. 518, no. 7540, pp. 529–533, 2015.
[11] M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp,
P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang et al., “End to
VI. CONCLUSION end learning for self-driving cars,” arXiv preprint arXiv:1604.07316,
2016.
[12] A. Pokle, R. Martín-Martín, P. Goebel, V. Chow, H. M. Ewald,
We proposed an overall deep reinforcement learning based J. Yang, Z. Wang, A. Sadeghian, D. Sadigh, S. Savarese et al., “Deep
local trajectory replanning and control for robot navigation,” in 2019
local navigation system for autonomous robot navigation International Conference on Robotics and Automation (ICRA). IEEE,
within highly dynamic environments. Therefore, we devel- 2019, pp. 5815–5822.
oped a simple, yet efficient simulation engine from scratch [13] A. Nüchter, H. Surmann, K. Lingemann, and J. Hertzberg, “Semantic
scene analysis of scanned 3d indoor environments.” in VMV, 2003,
which showed fast and efficient training speed and feasibility pp. 215–221.
in transferring the models towards the real robot. Our naviga- [14] L. Wang, L. Zhao, G. Huo, R. Li, Z. Hou, P. Luo, Z. Sun, K. Wang,
tion algorithm works solely on visual input and eliminates the and C. Yang, “Visual semantic navigation based on deep learning for
indoor mobile robots,” Complexity, vol. 2018, 2018.
need for any additional map. Furthermore, we explored the [15] X. Zhi, X. He, and S. Schwertfeger, “Learning autonomous exploration
potential of semantic information by incorporating semantic and mapping with semantic vision,” in Proceedings of the 2019
classes such as human and robot and concluded safety International Conference on Image, Video and Signal Processing,
2019, pp. 8–15.
enhancements for the navigation. This will be extended in our [16] Y. Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, L. Fei-Fei, and
further work to include more classes such as long corridor, A. Farhadi, “Target-driven visual navigation in indoor scenes using
doors or restricted areas. For the deployment into the real deep reinforcement learning,” in 2017 IEEE international conference
on robotics and automation (ICRA). IEEE, 2017, pp. 3357–3364.
environment a framework was proposed to integrate the DRL [17] M. Lapan, “Deep reinforcement learning hands-on: Apply modern rl
algorithms towards the real robot using marker detection methods, with deep q-networks, value iteration, policy gradients, trpo,
and odometry data. Thereby, we ease the transferability of alphago zero and more.” Packt Publishing, 2018, pp. 125–214.
[18] R. S. Sutton, “Learning to predict by the methods of temporal
simulated models and enable a map-independent solution. differences,” Machine learning, vol. 3, no. 1, pp. 9–44, 1988.
The results were remarkable both in static as well as dynamic [19] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning
environments and surpasses the traditional baseline RRT- with double q-learning,” in Thirtieth AAAI conference on artificial
intelligence, 2016.
PWA approach in terms of safety and robustness. For future [20] F. J. Romero-Ramirez, R. Muñoz-Salinas, and R. Medina-Carnicer,
work, we plan to incorporate more semantic classes such “Speeded up detection of squared fiducial markers,” Image and vision
as long corridor, doors or restricted areas into the training Computing, vol. 76, pp. 38–47, 2018.
[21] B. Tekin, S. N. Sinha, and P. Fua, “Real-time seamless single shot
environment to enhance safety and the overall performance 6d object pose prediction,” in Proceedings of the IEEE Conference on
even further. Additionally, an extension of our framework for Computer Vision and Pattern Recognition, 2018, pp. 292–301.
more capabilities and features e.g. including more reinforce- [22] L. Kästner, D. Dimitrov, and J. Lambrecht, “A markerless deep
learning-based 6 degrees of freedom poseestimation for with mobile
ment learning algorithms, recurrent modules and continuous robots using rgb data,” arXiv preprint arXiv:2001.05703, 2020.
actions is planned.

1115
Authorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on March 19,2025 at 10:58:07 UTC from IEEE Xplore. Restrictions apply.