CQLite: Communication-Efficient Multi-Robot Exploration Using Coverage-biased Distributed Q-Learning

Ehsan Latif Ramviyas Parasuraman The authors are with the Heterogeneous Robotics Research Lab, School of Computing, University of Georgia, Athens, GA 30602, USA.Author emails: {ehsan.latif;ramviyas}@uga.edu

Abstract

Frontier exploration and reinforcement learning have historically been used to solve the problem of enabling many mobile robots to autonomously and cooperatively explore complex surroundings. These methods need to keep an internal global map for navigation, but they do not take into consideration the high costs of communication and information sharing between robots. This study offers CQLite, a novel distributed Q-learning technique designed to minimize data communication overhead between robots while achieving rapid convergence and thorough coverage in multi-robot exploration. The proposed CQLite method uses ad hoc map merging, and selectively shares updated Q-values at recently identified frontiers to significantly reduce communication costs. The theoretical analysis of CQLite’s convergence and efficiency, together with extensive numerical verification on simulated indoor maps utilizing several robots, demonstrates the method’s novelty. With over 2x reductions in computation and communication alongside improved mapping performance, CQLite outperformed cutting-edge multi-robot exploration techniques like Rapidly Exploring Random Trees and Deep Reinforcement Learning. Related codes are open-sourced at https://github.com/herolab-uga/cqlite.

Index Terms:

Multi-Robot Systems, Cooperative Robots, Exploration, Robot Communication

I Introduction

Map-based coverage and exploration is a significant problem of interest in the robotics and multi-robot systems (MRS) community [1]. In this problem, robots continuously explore to obtain the full environmental map in an unknown bounded environment. It can be helpful in various applications, including search and rescue, domestic service, survey and operations, field robotics, etc. Autonomous exploration and surveillance solutions can also demonstrate the adaptability of the MRS since robots can carry out these missions in different and uncharted areas. In such applications, the robots need efficient wireless network connectivity for robust cooperation in uncertain environments [2].

Recent works have been influential in realizing an efficient exploration objective. For example, information-based methods (e.g., [3]) typically use the Shannon entropy to describe the uncertainty of the environmental map and construct the optimization problems such that the robot’s control variable (e.g., velocity) is continuously optimized during the exploration process. On the other hand, frontier-based methods (e.g., [4]) involve deciding the robot’s next move (or path) by searching the frontier points on the border of free and unknown points. Often, these methods only produce approximate solutions due to optimization.

Integrating learning with planning solutions is promising, especially for robot exploration [5, 6]. In the reinforcement learning (RL) paradigm, robots can continuously improve competence and adapt to the dynamics of natural surroundings by observing the results of navigational choices made in the actual world [7]. On the other hand, cooperation among robots in an MRS can help achieve a complex mission through simple distributed approaches [8, 9].

Refer to caption — Figure 1: Overview of the distributed CQLite method for efficient multi-robot exploration, shown with an illustrative simulation.

This paper explores the intersection between learning and cooperation, designs a combined solution to achieve efficient map exploration, and provides theoretical support for fast convergence and time complexity. We leverage the benefits of learning-based paradigms for joint exploration. We aim to create a distributed algorithm that gains knowledge through robot-robot information sharing while minimizing communication and computing overheads. Specifically, we utilize a distributed Q-learning methodology with a coverage-biased reward function with a light communication and information fusion strategy. In our approach, we reduce communication complexity by sharing only the current state information, i.e., Q-value, instead of the complete Q-table as done in [10] and explored frontier. Fig. 1 provides an overview of the proposed method implemented in the Robot Operating System (ROS) framework. The main contributions of this paper are summarized below.

•

We propose a novel distributed coverage-biased Q-learning approach (CQLite) for efficient multi-robot map exploration, leading to a significant reduction of the communication and computation overheads at each robot. In our approach, we reduce communication complexity by sharing only the current state information, i.e., Q-value, instead of the complete Q-table as done in [10] and apply map merging in an ad-hoc manner.
•

We substantiate the potential of our method with theoretical proofs of fast convergence in learning.
•

With extensive simulation experiments, we evaluate the performance of our approach against two state-of-the-art (SOTA) multi-robot exploration methods: Rapidly-exploring Random Trees (RRT) for Optimized Exploration [11] and Deep Reinforcement Learning (DRL) for Voronoi-based Exploration [12].
•

We open source¹¹1https://github.com/herolab-uga/cqlite the CQLite as a ROS package for use and further development by the robotics community.
•

We confirm the findings through real-world robot experiments. The video of sample simulation experiments and real-world demonstrations are available at https://youtu.be/n3unL1nuieQ.

The key idea behind the CQLite is that it uses a coverage-biased reward function to perform efficient exploration by sharing limited information among robots in a distributed fashion. Our method achieves fast convergence with the best coverage performance, reduced communication, and update costs compared to the baselines.

II Related Work

Map exploration problems focus on frontier-based and learning-based coverage planning approaches. A robot can be greedily pushed in an occupancy grip-map to the closest boundaries [13] or to the most uncertain (or informative) regions [14]. In frontier-based strategies, the robots will look to expand coverage into the unexplored regions by choosing the next waypoints based on the frontier of the explored map boundaries [15]. For instance, in [11], the multi-robot map exploration objective is integrated into an optimization framework incorporating Rapidly-exploring Random Trees (RRTs) to increase the effectiveness and efficiency of multi-robot map exploration. However, the constraints of such frontier-based approaches are the computing expense of the optimization methods and the possibility of non-optimal outcomes resulting from RRTs’ stochastic characteristics.

Researchers also presented communication-efficient solutions for exploration in multi-robot systems. For instance, Zhang et al. [16] introduced the MR-TopoMap based on a topological map, which can independently explore the robot’s surroundings while sporadically exchanging topological maps when communication is possible. However, path planning through topological mapping can lead to a sub-optimal path and, specifically, in the case where robots start exploring from the same position, exploring the same map, making it difficult to divide the map into topologies. Masaba and Li [17] proposed an exploration algorithm using the topology of the generalized Voronoi graphs made efficient through a recurrent and lean communication strategy. Corah et al. [18] use information-based distributed planning considering communication restrictions. However, the planner’s finite-horizon nature could lead to suboptimal exploration paths because it doesn’t consider long-term planning beyond the given horizon, making it more difficult for the system to make decisions based on knowledge in the future. This might prevent robots from efficiently exploring or discovering key regions of interest. More recently, Gao et al. [19] reduced inter-robot communication costs by utilizing a mission-based protocol and centralized planning, where the former can actively disconnect robots to proceed with distributed (independent exploration) and the latter will help them achieve rendezvous to reconnect and share information. However, computing the super-frontier information is computationally expensive, and the active disconnection strategy may limit the robots from sharing other critical data during the mission.

A body of research concentrates on Reinforcement Learning (RL) and Q-learning for multi-robot tasks, modifying the learning mechanism in low communication scenarios for better navigation and exploration [20, 5] and utilizing deep reinforcement learning to achieve optimality in robotic exploration [21]. A Deep RL (DRL) approach for cooperative multi-robot exploration using Voronoi cells was proposed in [12]. Despite its intriguing concept, it was constrained by training difficulties and sub-optimal solution tendencies. In another study [22], researchers suggested using an improved Q-learning algorithm in RL map navigation to stop robots from lingering in the past. However, these methods call for frequent map merging, which raises the cost of updates. Further, DRLs have shown promise in some problem spaces, but they frequently offer less-than-ideal solutions outside those contexts. They cannot guarantee convergence in infinite horizons.

To address these gaps in the literature, CQLite considers an efficient information transfer mechanism combined with distributed Q-learning with a coverage-biased reward function for efficient multi-robot cooperation to solve map exploration tasks. CQLite departs from RRT (frontier-based) and DRL (learning-based) regarding exploration strategy by reducing recurrent frontier exploration to avoid mapping overlap and Q-learning update strategy for communication efficiency by only sharing and utilizing recently calculated Q-value to the robots, respectively. Additionally, in both RRT and DRL, robots share locally explored maps on every iteration and apply map merging, which consequently gives rise to computational complexity. We reduce this overhead by only sharing and applying map merging in an ad-hoc manner. By incorporating these novelties, our approach provides significant improvement in computation and communication efficiency, even in cases of limited connectivity scenarios.

It is worth noting that the objectives of cooperative simultaneous localization and mapping (SLAM) techniques and exploration approaches are fundamentally different. The SLAM problem focuses on accurately building and merging the map, while the exploration problem focuses on using the available map to determine waypoints to maximize coverage area. Cooperative SLAM techniques emphasize communication effectiveness. Although computational efficiency is still a problem, Liu et al. [23] presented a multi-agent SLAM technique that lowers bandwidth use. Others use spectral graph analysis for cooperative mapping but overlook the computational costs of graph formation and optimization [24]. In contrast, others concentrate on lifelong localization and mapping but fail to optimize the communication and computational cost [25]. Cooperative RL techniques, as those in [26], have difficulty keeping up with the rising computational complexity of growing state spaces. In our work, we use an existing map merging method²²2https://wiki.ros.org/multirobot_map_merge from the literature to perform multi-robot SLAM. At the same time, our proposed CQLite is designed to maximize exploration and lower communication and computation. Specifically, our CQLite method addresses the above limitations by delivering computational and communication efficiency through selective data sharing and utilizing effective Q-learning to determine the best exploration strategy.

III Proposed Approach

We consider $n$ connected robots $r\in V,|V|=n$ deployed at random starting locations in an unknown environment. The connectivity of the robots is expressed by a graph $G=(V,E)$ , and a robot $i$ can share data with their immediate neighbors $\mathcal{N}_{i}=\{j\in V|(i,j)\in E\}$ . The robots must find and navigate toward the frontier position on their locally built map as a standard map exploration strategy. To accomplish this efficiently, a robot must decide which frontier to navigate after leaving its current explored region and is expected to reduce the number of steps to take and the size of data exchanges between robots.

Robots only share updated Q-value and the newly explored frontier with other robots in its range (neighbor set). Each robot keeps track of its local and shared frontiers to avoid re-exploration. Robots continue to generate local maps and share the newly developed map only when asked by other robots in case of the already explored frontier. Robots who cannot find new frontiers merge their local map with a map received from peers to build a global map using the feature similarity-based map merging technique [27]. The robot’s decisions regarding an action plan are based on the shared information and Q-learning computation. The whole procedure concerning robot $i$ can be visualized in Fig. 2.

III-A Q-learning

Markov’s decision processes frequently model the robot’s interaction with the environment. A robot’s state is $(x,y,\theta,active/inactive)$ in a global frame. Robots are localized and initialized in a global frame, and positions are known concerning virtually defined bounded regions that can be expanded based on exploration requirements. We consider the frontier’s position as states for exploration by applying efficient frontier detection [28]. We consider the current frontier’s position as the current state $s_{t}$ and the newly explored frontier as the next state $s_{t+1}$ . A robot can transition from state $s_{t}\in S$ to state $s_{t+1}$ due to acting $a_{t}\in A$ based on its state at time $t$ . Robot action $a_{t}$ to reach $s_{t+1}$ from $s_{t}$ can be determined using discrete-time Hopfield function [29]. The transition probability is defined as $T:S\times A\times S\rightarrow[0,1]$ . The robot will receive a reward for each action using a reward function $R:S\times A\times S\rightarrow R$ specific to the task. The robot will have learned the course of action to take in each state and will be able to maximize the reward of the entire interaction process.

In Q-learning, all possible states and actions are created using the Q table, which then updates each value through iterative learning. The robot then chooses the best course of action for each state based on the values in the table. This approach is frequently utilized in path planning, chess, card games, and other activities.

Assumptions: For simplicity, we assume a flat ground terrain environment for exploration.

•

The robots have omnidirectional sensors to detect obstacle boundaries within a maximum sensing range $r_{s}$ .
•

Each robot has a communication range, $r_{c}>>r_{s}$ forming a neighbor set $\mathcal{N}_{i}$ with constant connectivity to send and receive information about its relative position.
•

The connectivity graph $G$ is connected throughout exploration, which is practical to achieve in a multi-robot application. Since the proposed solution is distributed, it ensures maximum coverage even in partial disconnectivity, but at the cost of increased re-exploration.

Here, we introduce CQLite as a distributed method for robot $i$ , which is now at state $s_{t}$ at time $t$ and selects the following state as $s_{t+1}$ to explore independently. CQLite begins with an empty Q-table and updates its table values as exploration proceeds; CQLite aims to achieve fast convergence based on the optimized Q-learning and reward function. Finding the action $a$ that maximizes the Q-value for a specific state $s$ is the goal of the maximum optimization function for Q-learning, i.e., $a^{*}=\operatorname*{arg\,max}_{a}Q_{i}(s,a)$ , where $a^{*}$ is the optimal action for a given state $s$ .

The Q-learning algorithm updates the Q value as

	$\displaystyle{Q_{i,t+1}}\left({{s_{t}},{a_{t}}}\right)$
	$\displaystyle=(1-\alpha){Q_{i,t}}\left({{s_{t}},{a_{t}}}\right)+\alpha\left[{{% r_{i,t}}+\gamma{Q_{i,t}}\left({{s_{t+1}},{a^{*}}}\right)}\right],$		(1)

where $r_{i,t}$ is the reward received for taking action $a$ in state $s_{t}$ . The $\alpha\in(0,1]$ controls the balance between the coverage and delay, and $\gamma$ is the discount factor to prioritize present vs. future rewards. This optimization function is used in the action selection step of the Q-learning, where the agent selects the action that maximizes its expected future reward.

The objective of the CQLite is to perform maximum coverage in less time and avoid overlapping exploration, which can be numerically defined as

\displaystyle max_{\pi}\{P_{a}^{\pi}(t)-\lambda_{i}E_{t}(a|\pi)\},

(2)

where $P_{a}^{\pi}(t)$ is the probability to cover the unexplored region using for action $a$ using policy $\pi$ at time $t$ , $E_{t}(a|\pi)$ is estimated time to reach the state $s_{t}$ by taking action $a$ at time $t$ in policy $\pi$ and $\lambda_{i}$ is the cost associated with each step taken by robot $i$ . We have a vector $path$ extracted by containing position waypoints connecting $s_{t}$ to $s_{t+1}$ associated with $a$ [30]. For each dimension of $path$ at each control instant $t=t_{j}$ , we first compute the velocity command as:

\displaystyle v_{t_{j}}=K_{p}\cdot e_{t_{j}}+K_{I}\sum_{t=t_{0}}^{t_{j}}(e_{t_% {j}}),

(3)

where $e_{t_{j}}=s_{t,j}-s_{t,j-1}$ represents the instantaneous error between the intermediate states associated with action $a$ (i.e., the feedback) at time $t=t_{j}$ . Further, $K_{p}$ and $K_{I}$ are the so-called proportional and integral gains of the motor controller that regulate the contributions of the corrections induced by the actual error and the error accumulated over time, respectively. These constant values determination is based on the motion constraints of our differential drive robots as discussed by Li et al. [30]. They can be different based on the robot’s physical and motion characteristics. In our case, we predetermined values of $K_{P}$ and $K_{I}$ as 2 and 0.5, respectively. Now we apply simple kinematic to find the estimate $E_{t}(a|\pi)$ as:

E_{t}(a|\pi)=\sum_{j=1}^{m}\frac{(e_{t_{j}})^{2}}{v_{t_{j}}},

(4)

where $m=|\mathcal{N}_{i}|$ is the number of intermediate neighbors of the robot $i$ . To avoid the exploration of an already explored region for state $s_{t}$ , we determine $P(s_{t}\cap ES_{t})$ as:

P(s_{t}\cap ES_{t})=\sum_{j=1}^{m}\frac{P(s_{t}\cap es_{j})}{m},

(5)

where $ES_{t}$ is the set of explored states at time $t$ , $es_{j}\in ES_{t}$ , and $m$ is the number of explored states and overlap probability of each explored state in $ES_{t}$ can be determined as:

\displaystyle P(s_{t}\cap es_{j})=\begin{cases}1&dist(s_{t},es_{j})\leq r_{i,s% }\\ 0&dist(s_{t},es_{j})>r_{i,s}\\ \end{cases}

(6)

At each discrete time step $t$ , the robot $i$ acquires an observation $s_{t}$ from the environment, selects a corresponding action $a_{t}$ , then receives feedback from the environment in the form of a reward $r_{it}=R(s_{t},a_{t})$ as shown below:

\displaystyle r_{it}=\begin{cases}-\lambda_{i}&s_{t}\in ES_{t}\\ \lambda_{i}-Q_{i,t}+\rho(1-P(s_{t}\cap ES_{t}))+\sigma r_{i,c}&s_{t}\notin ES_% {t}\end{cases}

(7)

Where $P(s_{t}\cap ES_{t})$ is the probability of overlap between the current state $s_{t}$ , and the already explored states $ES_{t}$ by $robot_{i}$ and other robots, $\rho$ is a scaling factor that controls the importance of minimizing the overlap, $r_{c}$ is the communication range, and $\sigma$ is the scaling factor that determines the importance of maximizing the communication range. $\sigma$ depends upon the robot’s sensing capabilities and makes the reward function modular for heterogeneous robots with different sensing capabilities [31]. Then the state information is updated $s_{t+1}$ . The goal of the RL is to select policy $\pi$ that maximizes the discounted sum of future rewards, i.e., $Q_{\pi}(s_{1})=\sum_{\tau}^{t=1}\gamma^{t}R(s_{t},a_{t})$ , which according to the Bellman optimality principle satisfies.

The reward function in Eq. (7) produces a negative reward whenever the agent has looped back, and the calculated reward is based on the step-cost, Q-value, probability of overlap, and scaling factor otherwise.

Multi-Robot Lite Cooperation: We reduce the communication overhead amongst individual exploration-capable robots through a distributed approach, allowing each robot to make independent decisions based on local information and with little interaction from other robots. In our lite version of Q-learning, only the current state and Q-value are communicated amongst nearby robots to encourage cooperation. When another robot receives the information, it will update the received Q value in its Q table and update the local map. We develop a discovery approach based on the distance between simulated robots to replicate the network range in which we only share the current position of a robot $i$ , its Q-Value for each direction, and mark the current situation as explored to avoid repetitive exploration.

III-B Exploration Strategy

Robots create a global Q table for each cell and action after searching the map and experiencing several experiences. The Q table is then turned into a weighted graph $G=(\mathcal{S},\mathcal{E},\mathcal{C})$ , where $\mathcal{S}=\{s_{1},s_{2},...,s_{n}\}$ denotes the set of states, and $\mathcal{E}\in|\mathcal{S}|\times|\mathcal{S}|$ signifies the set of edges whose elements indicate whether or not a path exists between the center points of each pair of states. It is assumed that robots do not exchange nodes during exploration, and Voronoi boundaries are fixed. Furthermore, $\mathcal{C}$ is the weight matrix indicating the edge metric cost. The primary goal of discovering this study’s reduced graph and significant states is to optimally disperse robots over the coverage region by minimizing the relevant cost function. Because robots move at varying speeds, we formulate the cost as a function of the defined traveling time as $t_{(s_{p_{i}},s_{q})}=\frac{d_{(s_{p_{i}},s_{q})}}{v_{i}},$ where $v_{i}$ is the $i^{th}$ robot’s speed, and $d_{(s_{p_{i}},s_{q})}\in\mathcal{C}$ is the Euclidean distance between the $i^{th}$ robot’s current state $p_{i}$ and state $q$ . Furthermore, knowing the optimal path from state $p_{i}$ to state $s$ , each robot’s overall optimal traveling time is the sum of the trip times (costs) from state $p_{i}$ to state $s$ . This study’s shortest path between each pair of states is computed using the A* algorithm. Then the total time $\tau$ is calculated by knowing path $\mathcal{P}=\{p_{1},p_{2},...,p_{n}\}$ as

\displaystyle\tau_{(S_{p},S_{q})}={t_{(s_{p},s_{p_{1})}}+t_{(s_{p_{1}},s_{p_{2% }})}+...+t_{(s_{p_{n}},s_{q})}}.

(8)

After determining the shortest time between each pair of states, the field is partitioned into M Voronoi subgraphs $g_{r_{i}}$ for $i\in\{1,2,...,M\}$ to distribute work proportionally among $M$ robots. To that aim, the ideal Voronoi diagram $g_{r_{i}}$ for $i^{th}$ robot, according to Lloyd’s algorithm, is a split of the area determined as:

\displaystyle g_{r_{i}}=\{s_{q}\in S|\tau_{(s_{p_{i}},s_{q})}\leq{\tau_{(s_{p_% {j}},s_{q})}},\forall i\neq j\}.

(9)

Where $j$ is the other connected robot. The $i^{th}$ robot is responsible for covering the state $s$ (associated robot) in its sub-graph $g_{r_{i}}$ using the Voronoi partitioning result. The entire cost is then calculated as

\displaystyle\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\lambda_{i,(p,g_{r})}={% \sum_{q\in g_{r_{i}}}\tau_{(s_{p_{j}},s_{q})}\phi_{q}},\color[rgb]{0,0,0}% \definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}% \pgfsys@color@gray@fill{0}

(10)

where $\phi_{q}$ is the priority value associated with state $s_{q}$ . As the map turns into a graph, higher priority values are assigned to target states, while lower priority values are assigned to states far and already explored from the current state. The entire travel time (cost) will therefore be minimized, and an optimal solution will be obtained only when the current distance between the robot $i$ and the target state $s_{q}$ , $d_{(s_{p_{i}},s_{q})}$ converges to zero. Algorithm 1 provides the pseudocode description of CQLite for efficient map exploration implemented in a distributed manner on each robot $i$ .

1 Data: Reward matrix

R

, learning rate

\alpha

, discount factor

\gamma

, step cost

\lambda_{i}

;

2 Number of iterations:

t=0

;

3 Initialize empty Q-table as

Q_{i}

for robot

i

; Initialize empty explored frontier list

ES_{t}

;

4 Generate a local map using range sensor;

5 Initialize Explored Frontier Detected

s_{d}

as 0;

6 while $(t\leq t_{max})$ and $s_{d}$ <2 do

S_{t}\rightarrow S_{start}

, step = 1;

8 Find new frontiers at new

ES_{t}

and update

S_{t}

using locally explored map;

Q_{i,t}

= list(0);

10 for each frontier $s$ in $S_{t}$ do

11 if $s$ not in $ES_{t}$ then

12 Calculate Q-value as

q_{s}

for actions

a_{s}

to reach frontier

s

using Eq. (1);

13 Append

Q_{i,t}(s,a_{s})

;

14 else

15 Request for explored Maps;

16 Merge maps into local maps;

s_{d}

s_{d}

+ 1;

19 if $f_{d}<2$ then

20 Set updated Q-value

q_{update}

max_{a}(Q_{i,t}(s,a))

;

21 Update

Q_{i}

with

q_{update}

;

22 Take action

a_{t}

associated with

Q_{i}

;

23 Share

q_{update}

with connected robots;

24 Receive Q-value for

ES_{t}

from connected robots

j\in\mathcal{N}_{i}

;

25 Receive explored frontiers

es_{1:n-1}

and update

ES_{t}

with

s_{t}

and

es_{1:n-1}

;

26 Set new state associated with

a_{t}

S_{t}

;

27 Reset

s_{d}

as 0;

Algorithm 1 Distributed CQLite Exploration

IV Theoretical Analysis

We theoretically analyze the convergence, efficiency, and time complexity of our proposed CQLite for learning-based multi-robot map exploration.

IV-A Convergence

We analyze the convergence of the target Q value update function Eq. (1). We denote the error ratio $\delta_{t}=\frac{MSE(Q_{t})}{E_{t}(a|\pi)}$ , where $MSE(Q_{t})$ is the calculated mean square error for Q-table at time $t$ and $E_{t}(a|\pi)$ is the average number of steps to cover the region by taking action $a$ at time $t$ for policy $\pi$ .

Theorem 1 (Convergence of Q-values): Using Eq. (1) for Q-value updates, then if $0\leq\delta_{t}\leq 1$ , with probability $1-e$ , we have the estimated time to reach a given state as:

\displaystyle E_{t}(a|\pi)\leq\omega E_{1}(a|\pi)+\sqrt{\frac{\ln(1/e)\sum_{i=% 0}^{t-1}\psi_{i}^{2}(\delta_{t-i:t})}{2}}

(11)

Here, $\psi_{i}(\delta_{t-1:t})=\frac{\prod_{j=t-i}^{t-1}(j+\gamma\delta_{j})}{\prod_% {j=t-i}^{t}j}$ , $\alpha_{t}=\frac{\prod_{j=1}^{t-1}(j+\gamma\delta_{j})}{\prod_{j=2}^{t}j}$ and $\gamma=0.95$ .

Proof.

Our analysis is derived based on the subsequent (synchronous) Q-learning. In contrast to the conventional Q-learning, we swap out the current $Qt$ for the independent Q-function $Q^{\prime}(s,a)$ for the target $Q_{t}(s_{t},a_{t})$ and note that if $Q^{\prime}_{t}(s,a)=Q^{*}_{source}$ , we know that $0\leq\delta_{t}\leq 1$ .
First, we break down the update role into:

	$\displaystyle{Q_{t}}\left({{s_{t}},{a_{t}}}\right)$
	$\displaystyle=\left(\frac{t-1}{t}\right)Q_{t-1}(s,a)+\frac{1}{t}\big{(}r_{t}+% \gamma max_{a^{\prime}}Q^{\prime}_{t-1}(s^{\prime},a^{\prime})$
	$\displaystyle+\gamma max_{a^{\prime}}Q^{}_{t-1}(s^{\prime},a^{\prime})-\gamma max% _{a^{\prime}}Q^{}_{t-1}(s^{\prime},a^{\prime})\big{)}$

Let $\varepsilon_{t}(s,a)=Q_{t}(s,a)-Q^{*}(s,a)$ and
$\xi(s^{\prime})=\gamma\times max_{a^{\prime}}\left(Q^{*}_{t-1}(s^{\prime},a^{% \prime})\right)$ then recall the definition of $\delta_{t}$ , we will have $\varepsilon_{t}(s,a)$

\displaystyle\leq\frac{t-1}{t}\varepsilon_{t-1}(s,a)+\frac{1}{t}\big{(}\xi(s^{% \prime})-E_{s^{\prime}}\xi(s^{\prime})\big{)}+\frac{1}{t}\gamma\delta_{t}E_{t-1}

As we know $\varepsilon_{t}(s,a)\leq E_{t}$ , by applying maximization and recursion of $E$ , we will have:

	$\displaystyle E_{t}\leq\frac{t-1+\gamma\delta_{t}}{t}E_{t-1}+\frac{1}{t}\big{(% }\xi(s^{\prime})-E_{s^{\prime}}\xi(s^{\prime})\big{)}$
	$\displaystyle\leq\frac{\prod_{j=1}^{t-1}(j+\gamma\delta_{j})}{\prod_{j=2}^{t}j% }E_{1}+\sum_{i=1}^{t-1}\frac{\prod_{j=t-i}^{t-1}(j+\gamma\delta_{j})}{\prod_{j% =t-i}^{t}j}$
	$\displaystyle\times\big{(}\xi(s^{\prime})-E_{s^{\prime}}\xi(s^{\prime})\big{)}$
	$\displaystyle=\alpha_{t}E_{1}+\sum_{i=1}^{t-1}\psi_{i}(\delta)\big{(}\xi(s^{% \prime})-E_{s^{\prime}}\xi(s^{\prime})\big{)}$

According to weighted Hoeffding inequality [32], with prob- ability $1-e$ , we can prove Eq (11) for Theorem 1. ∎

This convergence result demonstrates the influence of the error ratio on the convergence rate. In other words, learning will go more quickly for our chosen Q value update function. Even though CQLite shares only updated Q-value, it still achieves the required convergence and provides an optimal strategy for robots to explore the map efficiently.

IV-B CQLite Efficiency

Proposition 1 (Q-table Update Efficiency): The CQLite Exploration method reduces the communication and computation cost for exploration by sharing and appending only updated Q-values and newly discovered frontiers to the local Q-table, which reduces communication and computation cost by $\frac{1}{n}$ than the cost of the SOTA approaches. Where $n$ is the total number of possible states (size of Q-table).

Proof.

The CQLite approach reduces the size of the Q-table and the amount of data that needs to be transmitted between robots by sharing and appending only the updated Q-values and recently found frontiers to the Q-table. We prove this by comparing the data needs with that of the SOTA, where the full Q-table is shared between robots. The shared Q-value for a given state-action combination $(i,j)$ in the Q-table will be $Q_{i,j}$ .

CQLite only updates Q-value once during the whole exploration, in contrast to SOTA as it updates each value in every iteration. Compared to sharing and updating the whole Q-table, the communication and computing costs are decreased by $\frac{1}{n}$ . The update cost of CQLite for Q-table with size $n$ is $n$ , but the update cost of SOTA is $n^{2}$ ; hence the cost reduction relation case be written as:

\displaystyle C_{i,CQLite}=\frac{1}{n}C_{i,SOTA},

where $C_{i,SOTA}$ is the communication and calculation cost of updating and sharing the whole Q-table in SOTA exploration techniques, and $C_{i,CQLite}$ is the communication and computation cost of the CQLite Exploration method.

To further determine the effectiveness of the Q-table update in CQLite, the cost of sending the matrix $Q$ over a network can be used to indicate the cost of sharing and updating the whole Q-table and can be stated as follows:

\displaystyle C_{i,SOTA}=\kappa\cdot\sum\limits_{j=1}^{n}|Q_{i,j}|,

where $|Q_{i,n}|$ is the absolute value of the Q-value of state $j$ for robot $i$ , and $\alpha$ is a constant that denotes the cost of sending one unit of data across the network. SOTA requires all Q-values for policy determination; hence all Q-values are shared to update Q-table in every iteration.

The CQLite Exploration approach reduces the size of the matrix and the quantity of data that needs to be transferred by sharing and appending only the updated Q-values and newly found frontiers. Let $Q^{\prime}$ be the updated matrix that only includes the new frontiers and updated Q-values. This modified matrix’s transmission cost can be expressed as

\displaystyle C_{i,CQLite}=\kappa\cdot\sum\limits_{j=1}^{n}|Q^{\prime}_{i,j}|.

Since $Q^{\prime}$ is a subset of $Q$ , it can be concluded that $\sum\limits_{j=1}^{l}|Q^{\prime}_{i,j}|\leq\sum\limits_{j=1}^{l}|Q_{i,j}|$ , and therefore:

\displaystyle C_{i,CQLite}\leq\frac{1}{n}C_{i,SOTA}

This proves that the CQLite exploration approach is more efficient regarding Q-table updating than the SOTA exploration methods like RRT and DRL. ∎

Proposition 2 (Mapping Efficiency): CQLite performs map sharing and merging with the probability $P(s_{t}\cap ES_{t})$ , which requires $<<{iterations}$ compared to relevant SOTA exploration approaches (e.g., RRT and DRL) for maximum exploration. Here, $P(s_{t}\cap ES_{t})$ is the probability of overlap between the current state $s_{t}$ , and the already explored states $ES_{t}$ by $robot_{i}$ and other robots $iterations$ is the total number of iterations carried out by the algorithm.

Proof.

The probability of overlap $P(s_{t}\cap ES_{t})$ between the current state $s_{t}$ of robot $i$ and the previously explored states $ES_{t}$ by other robots is used to determine if map sharing and merging will take place in the CQLite Exploration technique. This map merging and sharing aims to reduce the number of iterations and steps the algorithm must perform.

As part of the CQLite Exploration approach, the algorithm updates the map by combining shared maps regularly as follows:

f_{CQLite}={P(s_{t}\cap ES_{t})}\cdot{iterations}

Where $f_{CQLite}$ is the frequency of map merging carried out by the algorithm in the CQLite Exploration method, and ${iterations}$ is the mapping frequency of SOTA exploration methods like RRT and DRL.

The probability $P(s_{t}\cap ES_{t})$ can be derived using Bayes’ theorem as follows:

P(s_{t}\cap ES_{t})=P(ES_{t}\mid s_{t})\cdot P(s_{t})

Given the previously explored states $ES_{t}$ , $P(ES_{t}\mid s_{t})$ is the conditional probability of the current state $s_{t}$ , and $P(s_{t})$ is the probability of the current state $s_{t}$ .

$P(s_{t})$ can be represented as a uniform distribution over the state space, assuming that the exploration process is a random walk, with:

P(s_{t})=\frac{1}{n},

where the state space’s overall state count is $n$ .

The frequency of occurrence of the current state $s_{t}$ in the previously investigated states $ES_{t}$ can be used to estimate the conditional probability $P(ES_{t}|s_{t})$ . If the frequency with which the present state $s_{t}$ occurs in the previously studied states $ES_{t}$ is $f_{s_{t}}$ , then:

P(ES_{t}\mid s_{t})=\frac{f_{s_{t}}}{n_{e}},

where $n_{e}$ is the total number of states in the already explored states $ES_{t}$ . Substituting the above expressions into the equation for $P(s_{t}\cap ES_{t})$ gives:

\displaystyle P(s_{t}\cap ES_{t})

\displaystyle=\frac{f_{s_{t}}}{n_{e}}\cdot\frac{1}{n}

\displaystyle=\frac{f_{s_{t}}}{nn_{e}}<<{iterations}

For the total number of $iterations$ CQLite only updates the map for $\frac{f_{s_{t}}}{nn_{e}}$ times and $f_{s_{t}}<nn_{e}$ and $nn_{e}$ is equal to the $iterations$ in case of visiting each state at each $iteration$ . Hence, the above derivation proved that the CQLite method is more efficient as CQLite’s update frequency (frequency of map merging) is $<<{iterations}$ in map sharing and merging than SOTA exploration methods like RRT and DRL. ∎

Both propositions signify that the CQLite exploration method is more efficient in computation, communication, and mapping operations compared to the state-of-the-art RL-based multi-robot exploration approaches.

IV-C Time Complexity

Assume that the grid factor is $k_{g}$ (resolution of the grid map on which the grid is divided) and that the target sub-map is $k\times l$ in size. The grid map’s size is $k_{g}k\times k_{g}l$ , and the total number of points is $k_{g}^{2}kl$ . The operations to find Q-values must be carried out cyclically $k_{g}^{2}kl$ times. $kl$ can calculate and represent the CQLite’s state space size; however CQLite doesn’t perform a merging and searching strategy at every iteration; hence the length of the Q-value table is significantly less than $(kl)$ through selecting the training process, and the computing complexity of the algorithm is considerably less than $O(kl)$ .

V Experimental Validation

Turtlebot3 robots are used to carry out the exploration plan, implemented using the ROS framework. The open-source openslam-gmapping³³3https://openslam-org.github.io/gmapping.html technique of the ROS gmapping package is used to create 2D maps. It uses odometer data and a particle filter method as its foundation. The local maps created by each robot are combined to create the global map. Feature-based map merging⁴⁴4http://wiki.ros.org/multirobot_map_merge is employed to merge maps when required. Frame conversion between the local map frames is necessary for map merger. The coordinate transformation correlation between the robots must be calibrated before combining local maps. In the current work, the global frame is one robot’s frame, and the relative positions and orientations of the robots are initialized to a known state. The ROS movebase⁵⁵5https://github.com/ros-planning/navigation package allows the robot to move toward the goal point while securely avoiding barriers between robots. The Dijkstra algorithm for global path planning and the Dynamic Window Approach (DWA) for local dynamic obstacle avoidance are both implemented in this package. In this study, the units of time and distance are in seconds and meters, respectively.

TABLE I: Performance Results of RRT, DRL, and the proposed CQLite.

{}^{*}

indicates the best performer.

Evaluation parameters	Three robots in house world			Three robots in bookstore world			Six robots in bookstore world
	RRT [11]	DRL [12]	CQLite (ours)	RRT [11]	DRL [12]	CQLite (ours)	RRT [11]	DRL [12]	CQLite (ours)
Mapping Time (s)	$1208\pm 52$	924 ${}^{*}\pm 67$	$1029\pm 59$	$347\pm 32$	$324\pm 21$	317 ${}^{*}\pm 19$	$212\pm 18$	$267\pm 29$	197 ${}^{*}\pm 13$
Path Length (m)	$592\pm 11$	$604\pm 19$	543 ${}^{*}\pm 9$	$278\pm 26$	$235\pm 29$	147 ${}^{*}\pm 21$	$223\pm 12$	$196\pm 17$	121 ${}^{*}\pm 11$
Exploration Percentage (%)	$87\pm 4$	$91\pm 3$	$\textbf{95}^{*}\pm 3$	$90\pm 5$	$93\pm 2$	$\textbf{97}^{*}\pm 2$	$93\pm 4$	$94\pm 5$	$\textbf{98}^{*}\pm 2$
Overlap Percentage (%)	$51\pm 5$	$46\pm 6$	$\textbf{28}^{*}\pm 2$	$57\pm 8$	$51\pm 9$	$\textbf{31}^{*}\pm 6$	$47\pm 7$	$39\pm 8$	$\textbf{21}^{*}\pm 6$
MAP SSIM	$0.73\pm 0.12$	$0.89\pm 0.08$	$\textbf{0.91}^{*}\pm 0.06$	$0.68\pm 0.21$	$0.71\pm 0.13$	$\textbf{0.89}^{*}\pm 0.08$	$0.71\pm 0.17$	$0.73\pm 0.15$	$\textbf{0.93}^{*}\pm 0.10$
CPU Utilization (%)	$112\pm 22$	$79\pm 18$	$\textbf{42}^{*}\pm 8$	$97\pm 18$	$65\pm 15$	$\textbf{34}^{*}\pm 9$	$68\pm 21$	$47\pm 16$	$\textbf{26}^{*}\pm 9$
RAM (MB)	$824\pm 19$	$1264\pm 41$	$\textbf{665}^{*}\pm 24$	$624\pm 16$	$819\pm 33$	$\textbf{432}^{*}\pm 21$	$452\pm 19$	$724\pm 38$	$\textbf{319}^{*}\pm 18$
COM Payload (MB)	$2.2\pm 0.08$	$2.4\pm 0.06$	$\textbf{0.6}^{*}\pm 0.02$	$1.3\pm 0.06$	$1.8\pm 0.04$	$\textbf{0.4}^{*}\pm 0.01$	$1.1\pm 0.04$	$1.3\pm 0.05$	$\textbf{0.2}^{*}\pm 0.01$

V-A Simulation Setup:

A closed simulation environment based on the ROS Gazebo simulator with two indoor template environments is used: the Gazebo’s house world ( $\approx$ 250 $m^{2}$ area) and the Amazon AWS bookstore world ( $\approx$ 100 $m^{2}$ area). The robots may quickly finish the map exploration in a closed environment. Each robot has a laser scanner to gather data about its surroundings. The robots form a fully connected graph through a WiFi communication channel with a standard range of 40-60m. The robot’s trajectory is determined based on the fusion of wheel odometry and laser scan information.

The following parameters are used in the experiments in the simulated environment. The laser scanner’s range and $r_{i,s}$ are set to $15m$ and $1m$ , respectively. Additionally, the robot’s maximum linear and angular speeds in the simulation are set to $0.5ms^{-1}$ and $\frac{\pi}{4}rads^{-1}$ , respectively. The global detector’s growth factor $\eta$ and the local detector’s growth factor $\eta_{1}$ in the RRT detector are set to $5m$ and $3m$ , respectively. The weight parameters, $\alpha=0.6$ , $\gamma=0.95$ and $\lambda_{i}=2$ for $1m$ distant step.

Each experiment was run for ten trials, with average observations reported. We evaluate the performance in the following three scenarios to validate the robustness and scalability of the proposed solution: 1) 3 robots in the house world, 2) 3 robots in the bookstore world, and 3) 6 robots in the bookstore world.

V-B Evaluation Metrics

The proposed CQLite and the methods put forward by RRT [11] and DRL [12] are compared in our experiments. We use the below metrics for a comprehensive evaluation:

1.

Mapping Time: The amount of time spent mapping is a gauge of the efficiency of the exploration process;
2.

Path Length: This term refers to the path length of all robot’s trajectories combined until exploration converges. The entire trajectory length gives an idea of the robot’s energy usage while subtly describing its investigation’s effectiveness;
3.

Exploration Percentage: The percentage of the generated map with time elapsed;
4.

Overlap Percentage: The percentage of the overlap of the explored map with time elapsed;
5.

Map SSIM: Structural similarity index measure of generated maps compared with ground truth map to measure map correctness;
6.

CPU Utilization: The maximum % consumption of the processor of a robot throughout the trajectory;
7.

Memory Consumption (RAM): The maximum occupied memory by the robot throughout the trajectory;
8.

COM payload: The size of the data communicated by a robot averaged over iterations.

VI Results and Discussion

We have reported each approach’s average performance after ten trials in each condition to reduce the measurement noise and analyze the statistical details. A sample of the mapping outcomes of the compared approaches with the trajectories followed by three robots in the simulated environment is shown in Fig. 3 and generated maps also delineate the map correctness. The outcome should be stated considering the average mapping time, distance traveled, and mapping efficiency. Mapping efficiency is determined by comparing with the original map, and reported percentages are normalized with gazebo world dimensions.

Table I provides a comparative analysis of different methods on all the performance metrics and the statistical data from the results. It also lists the theoretical (algorithmic) computational complexity. Figs. 4 shows the comparison of the approaches in the three key performance metrics: computation, communication, and exploration. Fig. 5 shows a zoomed-in view of the mapping process at three different closely timed instances.

The proposed CQLite reliably outperforms other strategies on the key performance metrics. CQLite covers a larger area in less time, improving mapping efficiency by 10% while traveling 22 fewer meters than RRT in the experiment. In three-robot scenarios, CQLite was more effective than DRL and RRT, with 9% and 8% shorter mapping times, respectively. Its path length was also less than DRL’s by about 38%. The advantages became even more apparent when the trial involved six robots. While the mapping time was around 26% faster than DRL and 7% faster than RRT for the bookhouse world, the path length was about 38% shorter than with DRL. However, the mapping time of DRL is 10% better than CQLite for house world because DRL is trained for the world and has an optimized path for coverage.

CQLite had an exploration percentage that was 4% greater than DRL in the three-robot scenario. This advantage persisted in the six-robot case, where CQLite’s exploration percentage was almost 4% higher than DRL’s while maintaining the lowest overlap percentage. The stability and effectiveness of CQLite in multi-robot exploration tasks are highlighted by these results from various experiments.

In the reward convergence (Fig. 6), the 3-robots in the house world show a steady progression, converging at around 850 iterations. The 3-robots in the bookstore demonstrate convergence near 280 iterations, while the 6-robots in the bookstore achieve stability quickest, by approximately 175 iterations. After these points, rewards in each experiment remain consistent, indicating optimal behavior within their respective environments. These convergence results imply the generalizability of the CQLite approach. Further, it can be noted that the local convergence of Q-values (Theorem 1) within the robot’s neighborhood set will lead to the group’s global convergence when the graph is connected [1, 33].

Communication-wise, CQLite’s strategy is more effective. Contrary to RRT and DRL, which exchange locally explored maps continually. CQLite showed a significant reduction of more than 80% in the communication payload (average data size) shared between the robots. Notably, CQLite continues to explore at a constant rate even after reaching 60% coverage, in contrast to RRT, which slows down. This dominance carries over into a real-life three-robot bookshop scenario, where it outperformed DRL and RRT regarding reduced mapping time and shorter journey distances. Results have validated the practicality of CQLite by surpassing DRL and RRT in terms of most of the performance matrices in all scenarios. Further, demonstrating its efficacy and applicability on resource-constrained robots, CQLite maintained decreased RAM, CPU, and communication payload usage. CQLite demonstrates its power in managing a range of multi-robot exploration scenarios by offering improved map quality, as higher MAP SSIM ratings indicate. It is particularly appropriate in situations when there are significant communication and resource constraints.

One of the limitations of the proposed CQLite is that it relies on wireless communication, which can be intermittent or harsh in specific real-world situations. In such scenarios, a communication-aware strategy can be integrated with our approach to tolerate changes in communication channels.

VI-A Real Robot Demonstration

Two Turtlebot3 robots are used for real-world map exploration using the ROS Noetic framework. Robots can share information about the odometry and map output of their respective SLAM by subscribing to specific topic messages in ROS. Experiments are performed in a small-scale lab setting of $10m^{2}$ . We tested the CQLite under two scenarios: 1) a simplistic setup where one robot can cover the entire map without the other robot needing to move; 2) a complex setup where obstacles obstruct both robots in their initial positions, necessitating both robots’ contributions (see Fig. 7). Both scenarios were successfully tested by exploring the entire map, as can be seen in this video: https://youtu.be/n3unL1nuieQ. We believe this provides further evidence for the practicality of the proposed exploration approach.

VII Conclusion

This paper proposed CQLite, a distributed Q-learning strategy for multi-robot exploration created to get around the excessive communication and computation complexity and expense of learning-based systems. CQLite, which reduces communication and processing overhead by simply sharing Q-value and mapping data over the network, performs well in practical applications. Experimental results revealed that it ensures comprehensive coverage, quick convergence, and cheaper computing costs compared to well-liked RRT and DRL techniques. The same mapping efficiency was attained with only half the CPU load and 80% less communication overhead. In the future, we will examine the reward generality of CQLite to cope with various multi-robot applications.

References

[1] W. Burgard, M. Moors, C. Stachniss, and F. E. Schneider, “Coordinated multi-robot exploration,” IEEE Transactions on robotics, vol. 21, no. 3, pp. 376–386, 2005.
[2] R. Parasuraman, S. Caccamo, F. Båberg, P. Ögren, and M. Neerincx, “A new ugv teleoperation interface for improved awareness of network connectivity and physical surroundings,” Journal of Human-Robot Interaction, vol. 6, no. 3, pp. 48–70, 2017.
[3] B. Fang, J. Ding, and Z. Wang, “Autonomous robotic exploration based on frontier point optimization and multistep path planning,” IEEE Access, vol. 7, pp. 46 104–46 113, 2019.
[4] A. Dai, S. Papatheodorou, N. Funk, D. Tzoumanikas, and S. Leutenegger, “Fast frontier-based information-driven autonomous exploration with an mav,” in 2020 IEEE international conference on robotics and automation (ICRA). IEEE, 2020, pp. 9570–9576.
[5] E. Latif, W. Song, and R. Parasuraman, “Communication-efficient reinforcement learning in swarm robotic networks for maze exploration,” in IEEE INFOCOM 2023 - IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), 2023, pp. 1–6.
[6] R. Shrestha, F.-P. Tian, W. Feng, P. Tan, and R. Vaughan, “Learned map prediction for enhanced mobile robot exploration,” in 2019 International Conference on Robotics and Automation (ICRA). IEEE, 2019, pp. 1197–1204.
[7] Z. Zhang, X. Wang, Q. Zhang, and T. Hu, “Multi-robot cooperative pursuit via potential field-enhanced reinforcement learning,” in 2022 International Conference on Robotics and Automation (ICRA). IEEE, 2022, pp. 8808–8814.
[8] E. Tolstaya, J. Paulos, V. Kumar, and A. Ribeiro, “Multi-robot coverage and exploration using spatial graph neural networks,” in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2021, pp. 8944–8950.
[9] Q. Yang, Z. Luo, W. Song, and R. Parasuraman, “Self-reactive planning of multi-robots with dynamic task assignments,” in 2019 International Symposium on Multi-Robot and Multi-Agent Systems (MRS). IEEE, 2019, pp. 89–91.
[10] A. K. Sadhu and A. Konar, “An efficient computing of correlated equilibrium for cooperative $q$ -learning-based multi-robot planning,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 50, no. 8, pp. 2779–2794, 2018.
[11] L. Zhang, Z. Lin, J. Wang, and B. He, “Rapidly-exploring random trees multi-robot map exploration under optimization framework,” Robotics and Autonomous Systems, vol. 131, p. 103565, 2020.
[12] J. Hu, H. Niu, J. Carrasco, B. Lennox, and F. Arvin, “Voronoi-based multi-robot autonomous exploration in unknown environments via deep reinforcement learning,” IEEE Transactions on Vehicular Technology, vol. 69, no. 12, pp. 14 413–14 423, 2020.
[13] W. Gao, M. Booker, A. Adiwahono, M. Yuan, J. Wang, and Y. W. Yun, “An improved frontier-based approach for autonomous exploration,” in 2018 15th International Conference on Control, Automation, Robotics and Vision (ICARCV). IEEE, 2018, pp. 292–297.
[14] A. Bouman, M. F. Ginting, N. Alatur, M. Palieri, D. D. Fan, T. Touma, T. Pailevanian, S.-K. Kim, K. Otsu, J. Burdick et al., “Autonomous spot: Long-range autonomous exploration of extreme environments with legged locomotion,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2020, pp. 2518–2525.
[15] E. Latif and R. Parasuraman, “Seal: Simultaneous exploration and localization for multi-robot systems,” in 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2023, pp. 5358–5365.
[16] Z. Zhang, J. Yu, J. Tang, Y. Xu, and Y. Wang, “Mr-topomap: Multi-robot exploration based on topological map in communication restricted environment,” IEEE Robotics and Automation Letters, vol. 7, no. 4, pp. 10 794–10 801, 2022.
[17] K. Masaba and A. Q. Li, “Gvgexp: Communication-constrained multi-robot exploration system based on generalized voronoi graphs,” in 2021 International Symposium on Multi-Robot and Multi-Agent Systems (MRS). IEEE, 2021, pp. 146–154.
[18] M. Corah, C. O’Meadhra, K. Goel, and N. Michael, “Communication-efficient planning and mapping for multi-robot exploration in large environments,” IEEE Robotics and Automation Letters, vol. 4, no. 2, pp. 1715–1721, 2019.
[19] Y. Gao, Y. Wang, X. Zhong, T. Yang, M. Wang, Z. Xu, Y. Wang, Y. Lin, C. Xu, and F. Gao, “Meeting-merging-mission: A multi-robot coordinate framework for large-scale communication-limited exploration,” in 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2022, pp. 13 700–13 707.
[20] A. Serra-Gómez, B. Brito, H. Zhu, J. J. Chung, and J. Alonso-Mora, “With whom to communicate: learning efficient communication for multi-robot collision avoidance,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2020, pp. 11 770–11 776.
[21] R. Han, S. Chen, and Q. Hao, “Cooperative multi-robot navigation in dynamic environment with deep reinforcement learning,” in 2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2020, pp. 448–454.
[22] X. Yu, Y. Wu, and X.-M. Sun, “A navigation scheme for a random maze using reinforcement learning with quadrotor vision,” in 2019 18th European Control Conference (ECC). IEEE, 2019, pp. 518–523.
[23] J. Liu, K. Chen, R. Liu, Y. Yang, Z. Wang, and J. Zhang, “Robust and accurate multi-agent slam with efficient communication for smart mobiles,” in 2022 International Conference on Robotics and Automation (ICRA). IEEE, 2022, pp. 2782–2788.
[24] L. Bernreiter, S. Khattak, L. Ott, R. Siegwart, M. Hutter, and C. Cadena, “Collaborative robot mapping using spectral graph analysis,” in 2022 International Conference on Robotics and Automation (ICRA). IEEE, 2022, pp. 3662–3668.
[25] M. Zhao, X. Guo, L. Song, B. Qin, X. Shi, G. H. Lee, and G. Sun, “A general framework for lifelong localization and mapping in changing environment,” in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2021, pp. 3305–3312.
[26] Y. Jia, H. Luo, F. Zhao, G. Jiang, Y. Li, J. Yan, Z. Jiang, and Z. Wang, “Lvio-fusion: A self-adaptive multi-sensor fusion slam framework using actor-critic method,” in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2021, pp. 286–293.
[27] J. G. Mangelson, D. Dominic, R. M. Eustice, and R. Vasudevan, “Pairwise consistent measurement set maximization for robust multi-robot map merging,” in 2018 IEEE international conference on robotics and automation (ICRA). IEEE, 2018, pp. 2916–2923.
[28] M. Keidar and G. A. Kaminka, “Efficient frontier detection for robot exploration,” The International Journal of Robotics Research, vol. 33, no. 2, pp. 215–236, 2014.
[29] Z. Uykan, “On the working principle of the hopfield neural networks and its equivalence to the gadia in optimization,” IEEE Transactions on Neural Networks and Learning Systems, vol. 31, no. 9, pp. 3294–3304, 2019.
[30] J. Li, “Faster parallel algorithm for approximate shortest path,” in Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, 2020, pp. 308–321.
[31] S. S. OV, R. Parasuraman, and R. Pidaparti, “Impact of heterogeneity in multi-robot systems on collective behaviors studied using a search and rescue problem,” in 2020 IEEE International Symposium on Safety, Security, and Rescue Robotics (SSRR). IEEE, 2020, pp. 290–297.
[32] P. Duda, M. Jaworski, L. Pietruczuk, and L. Rutkowski, “A novel application of hoeffding’s inequality to decision trees construction for data streams,” in 2014 International Joint Conference on Neural Networks (IJCNN). IEEE, 2014, pp. 3324–3330.
[33] R. Konda, H. M. La, and J. Zhang, “Decentralized function approximated q-learning in multi-robot systems for predator avoidance,” IEEE Robotics and Automation Letters, vol. 5, no. 4, pp. 6342–6349, 2020.