0% found this document useful (0 votes)
91 views8 pages

Goal-Driven Autonomous Exploration Through Deep Reinforcement Learning

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
91 views8 pages

Goal-Driven Autonomous Exploration Through Deep Reinforcement Learning

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

730 IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 7, NO.

2, APRIL 2022

Goal-Driven Autonomous Exploration Through Deep


Reinforcement Learning
Reinis Cimurs , Il Hong Suh , Fellow, IEEE, and Jin Han Lee

Abstract—In this letter, we present an autonomous naviga- environmental dangers, and others. Subsequently, autonomous
tion system for goal-driven exploration of unknown environments exploration is an area that is garnering a lot of attention with on-
through deep reinforcement learning (DRL). Points of interest going research in the ability to delegate navigation and mapping
(POI) for possible navigation directions are obtained from the envi- tasks to autonomous vehicles. But unlike regular environment
ronment and an optimal waypoint is selected, based on the available exploration, where the task is only to map the surroundings,
data. Following the waypoints, the robot is guided towards the
global goal and the local optimum problem of reactive navigation
fully autonomous goal-driven exploration is a twofold problem.
is mitigated. Then, a motion policy for local navigation is learned First, the exploration robot needs to make a decision on where
through a DRL framework in a simulation. We develop a navigation to go to have the highest possibility of arriving at the global
system where this learned policy is integrated into a motion plan- goal. Without prior information or the global goal in sight, the
ning stack as the local navigation layer to move the robot between system needs to point to possible navigation directions directly
waypoints towards a global goal. The fully autonomous navigation from the sensor data. From such points of interest (POI) the
is performed without any prior knowledge while a map is recorded best possible one needs to be selected as a waypoint to guide the
as the robot moves through the environment. Experiments show robot to the global goal in the most optimal way. Second, a robot
that the proposed method has an advantage over similar explo- motion policy, that does not depend on map data, for uncertain
ration methods, without reliance on a map or prior information in environments needs to be obtained. With the recent advances in
complex static as well as dynamic environments.
deep reinforcement learning (DRL) for robot navigation, high-
Index Terms—AI-enabled robotics, reinforcement learning, precision decision-making has become feasible for autonomous
sensor-based Control. agents. Using DRL, an agent control policy can be learned to
achieve the target task in an unknown environment [2]. However,
due to its reactive nature and lack of global information, it easily
I. INTRODUCTION
encounters the local optimum problem, especially for large-scale
VER the last couple of decades, the field of simultane- navigation tasks [3].
O ous localization and mapping (SLAM) has been studied
extensively. Typically, in SLAM systems a person operates a
Therefore, we present a fully autonomous exploration system
for navigation to a global goal, without the necessity of human
measuring device and a map is generated from the location and control or prior information about the environment. Points of
the environmental landmarks [1]. The operator makes decisions, interest are extracted from the immediate vicinity of the robot,
which parts of the previously unmapped environment to visit and evaluated, and one of them is selected as a waypoint. Waypoints
their visitation order. More so if the requirement is to map a path guide the DRL-based motion policy to the global goal, mitigat-
between two locations. Humans can use their best knowledge of ing the local optimum problem. Then, motion is performed based
their surroundings and instincts to locate possible pathways to on the policy, without requiring fully mapped representation of
the goal, even if working in unknown environments. Afterward, the surroundings. The main contributions of this work can be
they can manually guide the mapping robot along the selected itemized as follows:
path to the global goal. However, it is not always possible to r Designed a global navigation and waypoint selection strat-
manually operate the mapping device due to various reasons - egy for goal-driven exploration.
the high cost of labor, physical constraints, limited resources, r Developed a TD3 architecture based neural network for
mobile robot navigation in continuous action space.
r Combined the DRL motion policy with global navigation
Manuscript received September 8, 2021; accepted November 25, 2021. Date
of publication December 10, 2021; date of current version December 14, 2021. strategy to mitigate the local optimum problem for explo-
This letter was recommended for publication by Associate Editor S. Cascianelli ration of unknown environments and performed extensive
and Editor M. Vincze upon evaluation of the reviewers’ comments. This work experiments to validate the proposed method.
was supported in part by the Ministry of Trade, Industry and Energy (MOTIE),
through Technology Innovation Program (Industrial Strategic Technology De-
The remainder of this paper is organized as follows. In
velopment) under Grant 10080638 and in part by the National Research Foun- Section II related works are reviewed. The proposed method
dation, Ministry of Science and ICT, South Korea, through Korea Advanced is described in Section III. Experimental results are given in
Research Program under Grant 2020M3H8A1114945. (Corresponding author: Section IV and conclusions given in Section V.
Reinis Cimurs.)
Reinis Cimurs is with the Robert Bosch GmbH, Hanyang University, Seoul
04763, South Korea (e-mail: [email protected]).
Il Hong Suh is with the COGAPLEX Seoul 04763, South Korea (e-mail:
[email protected]). II. RELATED WORKS
Jin Han Lee is with the CLE Inc., Hanyang University, Seoul 04763, South
Korea (e-mail: [email protected]). Environmental exploration and mapping has been a popular
Digital Object Identifier 10.1109/LRA.2021.3133591 study in the field of robotics for decades [4]–[6]. With the

2377-3766 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on March 25,2024 at 03:58:54 UTC from IEEE Xplore. Restrictions apply.
CIMURS et al.: GOAL-DRIVEN AUTONOMOUS EXPLORATION THROUGH DEEP REINFORCEMENT LEARNING 731

availability of various low-cost sensors and computational de-


vices, it has become possible to perform SLAM on the robotic
agent in real-time with a number of different approaches. Sensor
devices such as cameras [7]–[9], two dimensional and three
dimensional LiDARs [10]–[12], and their combinations [13]–
[15] are used not only to detect and record the environment but
also to autonomously position the agents within it. However,
to obtain reliable map information of the surroundings through
navigation, a large portion of developed SLAM systems rely Fig. 1. Navigation system implementation. Robot setup is depicted on the left.
on human operators or a pre-described plan [16]–[18]. Also, Global and local navigation with their individual parts and data flow is visualized
many autonomous exploration approaches are developed based in the middle and right, respectively.
on previously available map [19]–[21]. POI for exploration
are extracted from the map edges that show free space in the
environment. Subsequently, a path is planned and navigation
their respective domains, they are difficult to deploy in practice
performed towards the selected POI [22]–[24].
due to concerns relating to safety, generalization ability, local
Due to the rise in popularity and the capabilities of deep learn-
optimum problem, and knowledge transfer from simulation to
ing methods, robot navigation methods through neural networks
reality. As such, the neural network-based methods can tackle
have been developed. Robots are able to perform directed explo-
modular tasks, but might not be suitable for implementation in
ration by predicting the environment. In [25] the environment is
global, end-to-end solutions. Visual navigation in general can be
mapped by using learned predictions of the frontier areas that
difficult to down-scale and transfer from a simulation to a real
guide the exploration of the robot. But the navigation is still
implementation. Therefore, we propose combining a lightweight
carried out based on a planner. Similarly, in [26] and [27] a
learned motion policy from laser data with a broader global
prediction is made about the map based on RGB image and
navigation strategy to solve a goal-driven exploration problem.
an action is selected towards an either specified or predicted
The navigation system’s aim is not only to to explore and map
goal. A planner is still used in the uncertain map to create a
out an unknown environment towards a specified global goal but
route towards the goal and immediate actions are obtained from
also re-actively navigate around obstacles without a predefined
the neural network to follow the path. In [28] free space in
plan. The proposed fully autonomous system is presented in
the environment is predicted in a static simulated grid world,
Fig. 1.
and a robot is taught to navigate in a novel setting. Exploratory
navigation through deep Q-learning is presented in [29], where a
robot learns to avoid obstacles in unknown environments. This III. GOAL-DRIVEN AUTONOMOUS EXPLORATION
is further extended in [30], where a robot learns an obstacle- To achieve autonomous navigation and exploration in an
avoiding policy in a simulation and the network is applied to unknown environment, we propose a navigation structure that
real-world environments. Here, discrete navigation actions are consists of two parts: global navigation with optimal waypoint
selected to avoid obstacles without a specific goal. Robot actions selection from POI and mapping; a deep reinforcement learning-
in continuous action space are obtained by performing learning based local navigation. Points of interest are extracted from the
in deep deterministic policy gradient-based networks. In [31], environment and an optimal waypoint is selected following the
[32] the environment state information is combined with the goal evaluation criteria. At every step, a waypoint is given to the
position to form the input for the network. However, since the neural network in the form of polar coordinates, relative to the
information about the environment is given locally in time and robot’s location and heading. An action is calculated based on
space, these networks may encounter a local optimum problem. the sensor data and executed towards the waypoint. Mapping is
In [33] the learned, behavior-based navigation is successfully performed while moving between waypoints towards the global
combined with planning. However, local planners are still used goal.
for motion planning and the neural network is used only to
avoid dynamic obstacles in a previously mapped environment.
Similarly, in [34], [35] creation of a local plan is learned and A. Global Navigation
combined with a global path planner to avoid obstacles along For the robot to navigate towards the global goal, intermediate
the way. Here, a local path is calculated through a neural waypoints for local navigation need to be selected from available
network in an already mapped environment and exploration is POI. As there is no initial information about the environment, it
not considered. Generalization of target-driven navigation from is not possible to calculate an optimal path. Therefore, the robot
visual inputs in the immediate vicinity is discussed in [36]. A needs not only to be guided towards the destination but also to
neural network learns navigation commands to arrive at a given explore the environment along the way to recognize possible
target within the surrounding scene. In [37] a similar approach is alternative routes if it were to encounter a dead-end. Since no
introduced for target search in unknown environments by learn- prior information is given, the possible POI need to be obtained
ing decoupled target detection and navigation tasks through deep from the immediate surroundings of the robot and stored in the
reinforcement and supervised learning. RGB images are used to memory.
obtain depth information and recognize the target, if it is present We implement two methods of obtaining new POI:
in the scene, and discrete navigation action is selected to guide r A POI is added if a value difference between two sequential
the robot towards it. The robot develops a right wall following laser readings is larger than a threshold, allowing the robot
strategy that works well for maze environments, but might not to navigate through the presumed gap.
be optimal for real life implementations in open spaces. Even r Due to laser sensors having maximum range, the readings
though these methods provide high accuracy and reliability in outside this range are returned as a non-numerical type and

Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on March 25,2024 at 03:58:54 UTC from IEEE Xplore. Restrictions apply.
732 IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 7, NO. 2, APRIL 2022

B. Local Navigation
In a planning-based navigation stack, local motion is per-
formed following the local planner. In our approach, we replace
this layer with a neural network motion policy. We employ DRL
to train the local navigation policy separately in a simulated
environment.
A Twin Delayed Deep Deterministic Policy Gradient (TD3)
based neural network architecture is used to train the motion
policy [39]. TD3 is an actor-critic network that allows perform-
Fig. 2. POI extraction from the environment. Blue circles represent an ex- ing actions in continuous action space. The local environment is
tracted POI with the respective method. Green circles represent the current
waypoint. (a) Blue POI is obtained from a gap between laser readings. (b) Blue
described by bagged laser readings in 180◦ range in front of the
POI extracted from non-numerical laser readings. robot. This information is combined with polar coordinates of
the waypoint with respect to the robot’s position. The combined
data is used as an input state s in the actor-network of the TD3.
The actor-network consists of two fully connected (FC) layers.
represent free space in the environment. A POI is placed in Rectified linear unit (ReLU) activation follows after each of
the environment if sequential laser readings return a non- these layers. The last layer is then connected to the output layer
numerical value. with two action parameters a that represent the linear velocity a1
Examples of POI extraction from the environment are de- and angular velocity a2 of the robot. A tanh activation function
picted in Fig. 2 . is applied to the output layer to limit it in the range (−1, 1).
If in subsequent steps any of the POI are found to be located Before applying the action in the environment, it is scaled by
near an obstacle, they are deleted from the memory. A POI the maximum linear velocity vmax and the maximum angular
will not be obtained from the laser readings in a place that the velocity ωmax as follows:
robot has already visited. Additionally, if a POI is selected as  
a waypoint but cannot be reached over a number of steps, it is a1 + 1
a = vmax , ωmax a2 . (5)
deleted and a new waypoint selected. 2
From available POI, the optimal waypoint at the time-step
t is selected by using the Information-based Distance Limited Since the laser readings only record data in front of the robot,
Exploration (IDLE) evaluation method [38]. The IDLE method motion backward is not considered and the linear velocity is
evaluates the fitness of each candidate POI as: adjusted to only be positive.
⎛  d(p ,c ) 2 ⎞ The Q value of the state-action pair Q(s, a) is evaluated by two
t i
e l2 −l1 critic-networks. Both critic-networks have the same structure but
h(ci ) = tanh ⎝  l 2 ⎠ l2 + d(ci , g) + eIi,t , (1) their parameter updates are delayed allowing for divergence in
2
e l2 −l1 parameter values. The critic-networks use a pair of the state s
and action a as an input. The state s is fed into a fully connected
where score h of each candidate POI c with index i is a sum of layer followed by ReLU activation with output Ls . The output
three components. The Euclidean distance component d(pt , ci ) of this layer, as well as the action, are fed into two separate
between robots position p at t and candidate POI is expressed transformation fully connected layers (TFC) of the same size τ1
as a hyperbolic tangent tanh function: and τ2 , respectively. These layers are then combined as follows:
⎛  d(p ,c ) 2 ⎞
t i
l2 −l1
Lc = Ls Wτ1 + aWτ2 + bτ2 , (6)
e
tanh ⎝  l 2 ⎠ l2 , (2)
2 where Lc is the combined fully connected layer (CFC), Wτ1
l2 −l1
e and Wτ2 are the weights of the τ1 and τ2 , respectively. bτ2
where e is the Euler’s number, l1 and l2 are two-step distance is bias of layer τ2 . Then ReLU activation is applied to the
limits in which to discount the score. The two-step distance limits combined layer. Afterward, it is connected to the output with
are set based on the area size of the DRL training environment. 1 parameter representing the Q value. The minimum Q value
The second component d(ci , g) represents Euclidean distance of both critic-networks is selected as the final critic output to
between the candidate and the global goal g. Finally, map limit the overestimation of the state-action pair value. The full
information score at t is expressed as: network architecture is visualized in Fig. 3.
The policy is rewarded according to the following function:
eIi,t , (3) ⎧
⎨rg if Dt < ηD
where Ii,t is calculated as: r(st , at ) = rc if collision (7)

 k2  k2 v − |ω| otherwise,
w=− k h=− k
c(x+w)(y+h)
Ii,t = 2 2
. (4) where the reward r of the state-action pair (st , at ) at timestep t
k2 is dependant on three conditions. If the distance to the goal at the
k is size of the kernel to calculate the information around the current timestep Dt is less than the threshold ηD , a positive goal
candidate points coordinates x and y, w and h represent the reward rg is applied. If a collision is detected, a negative collision
kernel width and height, respectively. reward rc is applied. If both of these conditions are not present,
A POI with the smallest IDLE score from (1) is selected as an immediate reward is applied based on the current linear
the optimal waypoint for local navigation. velocity v and angular velocity ω. To guide the navigation policy

Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on March 25,2024 at 03:58:54 UTC from IEEE Xplore. Restrictions apply.
CIMURS et al.: GOAL-DRIVEN AUTONOMOUS EXPLORATION THROUGH DEEP REINFORCEMENT LEARNING 733

Fig. 3. TD3 network structure including the actor and critic parts. Layer type and the number of their respective parameters are described within the layers. TFC
layers refer to transformation fully connected layers τ and CFC layer to combined fully connected layer Lc .

towards the given goal, a delayed attributed reward method is


Algorithm 1: Goal-Driven Autonomous Exploration.
employed following calculation:
1: globalGoal Set global goal
rg 2: δ Set threshold of navigating to global goal
rt−i = r(st−i , at−i ) + , ∀i = {1, 2, 3, . . ., n} , (8)
i 3: while reachedGlobalGoal = T rue do
4: Read sensor data
where n is the number of previous steps where rewards are 5: Update map from sensor data
updated. This means that the positive goal reward is attributed 6: Obtain new POI
decreasingly over the last n steps before reaching a goal. The 7: if Dt < ηD then
network learned a generalized local navigation policy that is 8: if waypoint = globalGoal then
capable of arriving at a local goal, while simultaneously avoiding 9: reachedGlobalGoal = T rue
obstacles directly from the laser inputs. 10: else
11: if d(pt , g) < δ then
C. Exploration and Mapping 12: waypoint ← globalGoal
13: else
Following the waypoints, the robot is guided towards the
14: for i in P OI do
global goal. Once the robot is near the global goal, it navigates
15: calculate h(i) from (1)
to it. The environment is explored and mapped along the way.
16: end for
Mapping uses laser and robot odometry sensors as sources
17: waypoint ← P OImin(h)
and obtains an occupancy grid map of the environment. The
18: end if
pseudo-code of the fully autonomous exploration algorithm with
19: end if
mapping is described in Algorithm 1.
20: end if
21: Obtain an action from TD3
IV. EXPERIMENTS 22: Perform action
Experiments in real-life settings of varying complexity were 23: end while
executed to validate the proposed goal-driven exploration
system.

A. System Setup For the network to work not only in simulation but also in
The learning of local navigation through DRL was performed real life, it needs to learn generic obstacle avoidance behaviour
on a computer equipped with an NVIDIA GTX 1080 graphics from laser data. In order to facilitate generalization and policy
card, 32 GB of RAM, and Intel Core i7-6800 K CPU. The TD3 exploration, Gaussian noise was added to the sensor and action
network was trained in the Gazebo simulator for 800 episodes values. To help with sim2real transfer, the environment was
which took approximately 8 hours. Each training episode con- varied on each episode by randomly changing the locations of
cluded when a goal was reached, a collision was detected or 500 the box-shaped obstacles. Example of their changing locations
steps were taken. vmax and ωmax were set as 0.5 meters per is depicted in Fig. 4(a), (b), and (c). The IDLE method of POI
second and 1 rad per second, respectively. The delayed rewards evaluation selects waypoints within the learned distance of the
were updated over the last n = 10 steps and parameter update simulation. The same robot model was used in simulation as in
delay was set as 2 episodes. The training was carried out in a the experiments and its starting position and the goal locations
simulated 10x10 m-sized environment depicted in Fig 4. were randomized on every episode.

Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on March 25,2024 at 03:58:54 UTC from IEEE Xplore. Restrictions apply.
734 IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 7, NO. 2, APRIL 2022

indoor settings. We refer to the proposed method as Goal-Driven


Autonomous Exploration (GDAE) which combines reinforce-
ment learning with global navigation strategy to arrive at the
global goal. To the best of the author’s knowledge, currently,
there are no comparable goal-driven exploration methods that
employ neural network-based motion policies from low-cost
laser data. Therefore, a SLAM-based state-of-the-art Nearest
Frontier (NF) exploration strategy from [41] was employed
for comparison with a planning-based method. Here, the dis-
Fig. 4. Examples of training environments. Blue field represents the input
laser readings and range. The four box obstacles change their location on every tance factor for the nearest frontier was updated to include
episode as presented in (a), (b), and (c) in order to randomize the training data. the distance to the goal. Additionally, GDAE was compared
to a reinforcement learning-based method without the global
navigation strategy, referred to as Goal-Driven Reinforcement
Learning (GD-RL). Experiments with the proposed navigation
system were carried out, where the neural network was substi-
tuted with a ROS local planner package (TrajectoryPlanner). We
refer to this system as Local Planner Autonomous Exploration
(LP-AE). Finally, as a baseline comparison, a path obtained with
the Dijkstra algorithm in an already known map was executed.
We refer to this method as Path Planner (PP). Experiments were
performed five times with each method in two environments.
The recorded data includes travel distance (D.) in meters, travel
time (T.) in seconds, recorded map size in square meters, and
Fig. 5. Environment and resulting navigation examples of the respective how many times has the method successfully reached the goal.
methods in the first quantitative experiment. (a) Depiction of the experiment The recorded map size was calculated only from known pixels.
environment. (b) Example of a resulting map of the environment with overlaid Average (Av.), maximal (max.), minimal (min.), and standard
path of each method. deviation (σ) were calculated from the obtained results. Ad-
ditionally, Success weighted by Path Length (SPL) measure,
introduced in [42], is used for comparison, where the shortest
In the real experiments, the robot was equipped with two path distance is the length of shortest PP output.
RpLidar laser sensors at different heights with a maximal mea- The first environment is depicted in Fig. 5 and consisted
suring distance of 10 meters. Robot setup is displayed in Fig. 1. mostly of smooth walls with multiple local optima. The goal
The location and angle of both lasers were calibrated and laser was located at coordinate (−12,15). While all the methods were
readings were recorded in 180◦ in front of the robot. The able to arrive at the global goal, they did so with differing
data from each device was bagged into 21 groups, where the travel distance and time. GDAE was capable of arriving at the
minimum value of each group created the final laser input state global goal at comparable travel distance to similar methods
of 21 values. The laser data was then combined with the polar but the navigation took less time. The time increase for NF and
coordinates to the waypoint. The mapping of the environment LP-AE methods was due to needing to wait while a new path
was performed based on the full laser readings of the top RpLidar is calculated towards a new waypoint. However, the GD-RL
sensor in combination with the internal mobile robot odometry. method fell in the local optimum trap, from which it escaped
ROS package SLAM Toolbox [40] was used to obtain and update by following a wall. This significantly increased the distance
the global map of the environment as well as localize the robot and time to the global goal. The results of the experiment are
within it. Maximal linear and angular velocities were set to the described in Table I.
same values as in the simulation. Kernel size k was set as 1.5 The second experiment is depicted in Fig. 6 and introduced
meters and l1 , l2 values in (2) were selected as 5 and 10 meters, obstacles of various complexity into the environment, such
respectively. The waypoints and global goal were considered as - furniture, chairs, tables, glass walls, and others. The start
reached at a 1-meter distance. point was located in a local optimum area with a see-through
For quantitative experiments and comparison to similar meth- glass wall at height of the top laser. The proposed method
ods with constrained resources, the network was transferred to successfully and reliably navigated to the global goal in the
an Intel NUC mini-PC (2.70 GHz i7-8559 U CPU, 16 GB of shortest time and comparable distance. The NF and LP-AE
RAM) that facilitated the full exploration system. For qualitative methods are highly dependant on the accuracy of the map. As
experiments, the system was embedded on a laptop with an such, they calculated paths through not yet registered obstacles,
NVIDIA RTX 2070 M graphics card, 16 GB of RAM, and such as chairs and glass walls, on multiple occasions and failed
Intel Core i9-10980HK CPU running Ubuntu 18.04 operating to reach the goal. Initially, the GD-RL method was unable to
system. ROS Melodic version managed the packages and control escape the local optimum and failed to reach the global goal.
commands. Pioneer P3-DX mobile platform was used as the In subsequent runs, a human operator intervened as the robot
robot base. approached the escape from the starting area and guided the
robot towards the free space. The obtained results are described
in Table II.
B. Quantitative Experiments The SPL results of the experiments show that the proposed
In order to quantify-ably evaluate the proposed method, it method works comparably well with SLAM and planning-based
was compared to different environment exploration methods in methods in simpler environments but has an advantage in more

Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on March 25,2024 at 03:58:54 UTC from IEEE Xplore. Restrictions apply.
CIMURS et al.: GOAL-DRIVEN AUTONOMOUS EXPLORATION THROUGH DEEP REINFORCEMENT LEARNING 735

Fig. 6. Environment with complex obstacles and resulting navigation examples of the respective methods in the second quantitative experiment. (a) Depiction of
the experiment environment. (b) Example of a resulting map of the environment with overlaid path of each method.

TABLE I
RESULTS OF THE FIRST QUANTITATIVE EXPERIMENT

TABLE II
RESULTS OF THE SECOND QUANTITATIVE EXPERIMENT

*Successful only after human interference.

complex environments. In environments, where obstacles are


difficult to detect, basing a motion plan on an uncertain map
without properly registered obstacles leads to failure in execu-
tion. In such occasions the learned generalized motion policy
from the neural network makes instantaneous decisions directly
from the laser data on first detection and executes motion ac-
cordingly.

C. Qualitative Experiments
Additional experiments with GDAE method were performed
in various indoor settings. In Fig. 7 three different environments
are depicted, where a local optimum needed to be avoided or
navigated out of to arrive at the global goal. The blue dots
represent the obtained POI, green path represents the robot’s
path.
In Fig. 7(a) the goal position was located at coordinate (22,
12). A straight path towards the goal led towards a room corner
and a local optimum. Once the robot had obtained the informa-
tion about a corner, it was able to backtrack, find the exit of the
room, and navigate towards the global goal through the hallways.
In Fig. 7(b) the goal position was located at coordinate (23, 2),
and the robot starting location was placed in the local optimum.
By using the global navigation strategy, the local environment
was explored, which allowed the robot to navigate out of the
confined hallway and to the global goal. In Fig. 7(c) the goal
Fig. 7. Qualitative experiments. (a) Navigation towards a goal in hallway. coordinates were (10, 10) and the robot starting location was
(b) Navigation towards a goal in a hallway, when starting position is in a local the local optimum. The environment was littered with obstacles
optimum. (c) Navigation towards a goal in a hall through a clutter. of various shapes and sizes, such as chairs, signs, potted plants,

Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on March 25,2024 at 03:58:54 UTC from IEEE Xplore. Restrictions apply.
736 IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 7, NO. 2, APRIL 2022

Fig. 8. Experimental results of goal driven exploration and mapping with goal located at a significant distance from the robot. (a) Navigation towards a goal in
hallway. (b) Navigation towards a goal in a hall with backtracking. (c) Navigation towards a goal in an underground parking lot.

and others. The robot successfully explored its surroundings, but its shortcomings are alleviated by introducing the global
navigated through the clutter using the local navigation, and navigation strategy. The obtained experimental results show
arrived at the global goal. that the proposed system works reasonably close to the optimal
Afterward, navigation and mapping experiments were per- solution obtained by the path planner from an already known
formed on a larger scale. In Fig. 8 three scenarios are introduced, environment. Additionally, as the SPL measurement results
where a robot navigated towards a goal in hallways and an un- show, GDAE is more reliable in complex scenarios than map
derground parking lot. In Fig. 8(a) goal was located at coordinate dependant SLAM methods.
(12, −90). In Fig. 8(b) goal coordinate was (60, −7). In Fig. 8 In the current implementation, the motion policy training was
the goal was located 100 meters diagonally across a parking lot performed with the model of the same robot as in the real-life
at coordinate (100, 0). In these experiments, the robot navigated experiments. This allowed for easy transfer of the network
out of cluttered environment and arrive at the goal at a significant parameters to the embedded implementation as the network
distance. parameters were optimized for its specifications. To generalize
From experiments, we can observe that the navigation system to various types of robots, system dynamics could be introduced
is capable of exploring and navigating in an unknown environ- as a separate input state to the neural network, and training
ment of various scale and complexity and reliably find its way performed accordingly. By only providing robot dynamics, it
to the global goal. The experimentation code with images and would be possible to perform local navigation up to its best
videos1 of the presented and additional experiments in static capabilities.
and dynamic settings are available from our repository2 and
supplementary material.
REFERENCES
V. CONCLUSIONS [1] K. Yousif, A. Bab-Hadiashar, and R. Hoseinnezhad, “An overview to visual
odometry and visual SLAM: Applications to mobile robotics,” Intell. Ind.
In this paper, a DRL goal-driven fully autonomous exploration Syst., vol. 1, no. 4, pp. 289–311, 2015.
system is introduced. As the experiments show, the system suc- [2] M. Sugiyama, Statistical Reinforcement Learning: Modern Machine
cessfully combines reactive local and global navigation strate- Learning Approaches. Boca Raton, FL, USA: CRC Press, 2015.
[3] D. Aberdeen et al., “Policy-gradient algorithms for partially observable
gies. Moreover, the task of introducing a neural network-based markov decision processes,” Ph.D. dissertation, Australian Nat. Univ.,
module for an end-to-end system proves to be beneficial as it Canberra ACT, Australia, 2003.
allows the robot to move without generating an explicit plan, [4] A. Zelinsky, “A mobile robot navigation exploration algorithm,” IEEE
Trans. Robot. Automat., vol. 8, no. 6, pp. 707–717, Dec. 1992.
[5] H. Surmann, A. Nüchter, and J. Hertzberg, “An autonomous mobile robot
1 Video: https://youtu.be/MhuhsSdzZFk with a 3D laser range finder for 3D exploration and digitalization of indoor
2 Repository: https://github.com/reiniscimurs/GDAE environments,” Robot. Auton. Syst., vol. 45, no. 3-4, pp. 181–198, 2003.

Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on March 25,2024 at 03:58:54 UTC from IEEE Xplore. Restrictions apply.
CIMURS et al.: GOAL-DRIVEN AUTONOMOUS EXPLORATION THROUGH DEEP REINFORCEMENT LEARNING 737

[6] T. Chen, S. Gupta, and A. Gupta, “Learning exploration policies for [25] R. Shrestha, F.-P. Tian, W. Feng, P. Tan, and R. Vaughan, “Learned map
navigation,” CoRR, 2019, arXiv:1903.01959. prediction for enhanced mobile robot exploration,” in Proc. Int. Conf.
[7] R. Mur-Artal and Juan D. Tardós, “ORB-SLAM2: An open-source SLAM Robot. Automat., 2019, pp. 1197–1204.
system for monocular, stereo, and RGB-D cameras,” IEEE Trans. Robot., [26] D. S. Chaplot, D. Gandhi, S. Gupta, A. Gupta, and R. Salakhutdi-
vol. 33, no. 5, pp. 1255–1262, Oct. 2017. nov, “Learning to explore using active neural SLAM,” CoRR, 2020,
[8] C. Yu et al., “DS-SLAM: A semantic visual SLAM towards dynamic arXiv:2004.05155.
environments,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., 2018, [27] S. K. Ramakrishnan Z. Al-Halah, and K. Grauman, “Occupancy anticipa-
pp. 1168–1174. tion for efficient exploration and navigation,” in Proc. Eur. Conf. Comput.
[9] L. Cui and C. Ma, “Sof-SLAM: A semantic visual SLAM for dynamic Vis., 2020, pp. 400–418.
environments,” IEEE Access, vol. 7, pp. 166528–166539, 2019. [28] S. Gupta, J. Davidson, S. Levine, R. Sukthankar, and J. Malik, “Cognitive
[10] B. Li, Y. Wang, Y. Zhang, W. Zhao, J. Ruan, and P. Li, “GP-SLAM: mapping and planning for visual navigation,” in Proc. IEEE Conf. Comput.
Laser-based SLAM approach based on regionalized Gaussian process map Vis. Pattern Recognit., 2017, pp. 2616–2625.
reconstruction,” Auton. Robots, vol. 44, pp. 947–967, 2020. [29] L. Tai and M. Liu, “A robot exploration strategy based on Q-learning
[11] G. Jiang, L. Yin, G. Liu, W. Xi, and Y. Ou, “FFT-based scan-matching for network,” in Proc. IEEE Int. Conf. Real-Time Comput. Robot., 2016,
SLAM applications with low-cost laser range finders,” Appl. Sci., vol. 9, pp. 57–62.
no. 1, pp. 1–18, 2019, Art. no. 41. [30] L. Xie, S. Wang, A. Markham, and N. Trigoni, “Towards monocular vision
[12] Z. Ren, L. Wang, and L. Bi, “Robust gicp-based 3D LiDAR SLAM based obstacle avoidance through deep reinforcement learning,” CoRR,
for underground mining environment,” Sensors, vol. 19, no. 13, 2019, 2017, arXiv:1706.09829.
Art. no. 2915. [31] L. Tai, G. Paolo, and M. Liu, “Virtual-to-real deep reinforcement learning:
[13] X. Liang, H. Chen, Y. Li, and Y. Liu, “Visual laser-SLAM in large-scale continuous control of mobile robots for mapless navigation,” in Proc.
indoor environments,” in Proc. IEEE Int. Conf. Robot. Biomimetics, 2016, IEEE/RSJ Int. Conf. Intell. Robots Syst., 2017, pp. 31–36.
pp. 19–24. [32] R. Cimurs, J. H. Lee, and I. H. Suh, “Goal-oriented obstacle avoidance
[14] S.-H. Chan, P.-T.Wu, and L.-C. Fu, “Robust 2D indoor localization through with deep reinforcement learning in continuous action space,” Electronics,
laser SLAM and visual SLAM fusion,” in Proc. IEEE Int. Conf. Syst., Man, vol. 9, no. 3, pp. 1–16, 2020, Art. no. 411.
Cybern., 2018, pp. 1263–1268. [33] Y. F. Chen, M. Everett, M. Liu, and J. P. How, “Socially aware motion
[15] Z. Zhang, S. Liu, G. Tsai, H. Hu, C.-C. Chu, and F. Zheng, “PIRVS: An planning with deep reinforcement learning,” in Proc. IEEE/RSJ Int. Conf.
advanced visual-inertial SLAM system with flexible sensor fusion and Intell. Robots Syst., 2017, pp. 1343–1350.
hardware Co-design,” in Proc. IEEE Int. Conf. Robot. Automat., 2018, [34] A. Pokle et al., “Deep local trajectory replanning and control
pp. 1–7. for robot navigation,” in Proc. Int. Conf. Robot. Automat., 2019,
[16] T. Taketomi, H. Uchiyama, and S. Ikeda, “Visual SLAM algorithms: A pp. 5815–5822.
survey from 2010 to 2016,” IPSJ Trans. Comput. Vis. Appl., vol. 9, no. 1, [35] A. Faust et al., “prm-Rl: Long-range robotic navigation tasks by combining
p. 16, 2017. reinforcement learning and sampling-based planning,” in Proc. IEEE Int.
[17] M. Filipenko and I. Afanasyev, “Comparison of various SLAM systems Conf. Robot. Automat., 2018, pp. 5113–5120.
for mobile robot in an indoor environment,” in Proc. Int. Conf. Intell. Syst., [36] Y. Zhu et al., “Target-driven visual navigation in indoor scenes using deep
2018, pp. 400–407. reinforcement learning,” in Proc. IEEE Int. Conf. Robot. Automat., 2017,
[18] D. W. Ko, Y. N. Kim, J. H. Lee, and I. H. Suh, “A scene-based dependable pp. 3357–3364.
indoor navigation system,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots [37] A. Devo, G. Mezzetti, G. Costante, M. L. Fravolini, and P. Valigi, “To-
Syst., 2016, pp. 1530–1537. wards generalization in target-driven visual navigation by using deep re-
[19] L. von Stumberg, V. Usenko, J. Engel, J. Stückler, and D. Cremers, “From inforcement learning,” IEEE Trans. Robot., vol. 36, no. 5, pp. 1546–1561,
monocular SLAM to autonomous drone exploration,” in Proc. Eur. Conf. Oct. 2020.
Mobile Robots, 2017, pp. 1–8. [38] R. Cimurs, I. H. Suh, and J. H. Lee, “Information-based heuristics for
[20] H. Gao, X. Zhang, J. Wen, J. Yuan, and Y. Fang, “Autonomous indoor learned goal-driven exploration and mapping,” in Proc. 18th Int. Conf.
exploration via polygon map construction and graph-based SLAM using Ubiquitous Robots, 2021, pp. 571–578.
directional endpoint features,” IEEE Trans. Automat. Sci. Eng., vol. 16, [39] S. Fujimoto, H. Hoof, and D. Meger, “Addressing function approximation
no. 4, pp. 1531–1542, Oct. 2019. error in actor-critic methods,” in Proc. Int. Conf. Mach. Learn., 2018,
[21] N. Palomeras, M. Carreras, and J. Andrade-Cetto, “Active SLAM for pp. 1587–1596.
autonomous underwater exploration,” Remote Sens., vol. 11, no. 23, 2019, [40] S. S. Macenski, and I. Jambrecic, “SLAM Toolbox: SLAM for the dynamic
Art. no. 2827. world,” J. Open Source Software, vol. 6, no. 61, pp. 2783–2790, 2021.
[22] M. Keidar and A. GalKaminka, “Efficient frontier detection for robot [41] D. L. da Silva Lubanco, M. Pichler-Scheder, and T. Schlechter, “A novel
exploration,” Int. J. Robot. Res., vol. 33, no. 2, pp. 215–236, 2014. frontier-based exploration algorithm for mobile robots,” in Proc. 6th Int.
[23] W. Gao, M. Booker, A. Adiwahono, M. Yuan, J. Wang, and Y. W. Yun, Conf. Mechatronics Robot. Eng., 2020, pp. 1–5.
“An improved frontier-based approach for autonomous exploration,” in [42] P. Anderson et al., “On evaluation of embodied navigation agents,” CoRR,
Proc. 15th Int. Conf. Control, Automat., Robot. Vis., 2018, pp. 292–297. 2018, arXiv:1807.06757.
[24] Y. Tang, J. Cai, M. Chen, X. Yan, and Y. Xie, “An autonomous exploration
algorithm using environment-robot interacted traversability analysis,” in
Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., 2019, pp. 4885–4890.

Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on March 25,2024 at 03:58:54 UTC from IEEE Xplore. Restrictions apply.

You might also like