HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: hyphenat
  • failed: graphbox

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: arXiv.org perpetual non-exclusive license
arXiv:2307.00500v2 [cs.RO] 18 Jan 2024

CQLite: Communication-Efficient Multi-Robot Exploration Using Coverage-biased Distributed Q-Learning

Ehsan Latif Ramviyas Parasuraman The authors are with the Heterogeneous Robotics Research Lab, School of Computing, University of Georgia, Athens, GA 30602, USA.Author emails: {ehsan.latif;ramviyas}@uga.edu
Abstract

Frontier exploration and reinforcement learning have historically been used to solve the problem of enabling many mobile robots to autonomously and cooperatively explore complex surroundings. These methods need to keep an internal global map for navigation, but they do not take into consideration the high costs of communication and information sharing between robots. This study offers CQLite, a novel distributed Q-learning technique designed to minimize data communication overhead between robots while achieving rapid convergence and thorough coverage in multi-robot exploration. The proposed CQLite method uses ad hoc map merging, and selectively shares updated Q-values at recently identified frontiers to significantly reduce communication costs. The theoretical analysis of CQLite’s convergence and efficiency, together with extensive numerical verification on simulated indoor maps utilizing several robots, demonstrates the method’s novelty. With over 2x reductions in computation and communication alongside improved mapping performance, CQLite outperformed cutting-edge multi-robot exploration techniques like Rapidly Exploring Random Trees and Deep Reinforcement Learning. Related codes are open-sourced at https://github.com/herolab-uga/cqlite.

Index Terms:
Multi-Robot Systems, Cooperative Robots, Exploration, Robot Communication

I Introduction

Map-based coverage and exploration is a significant problem of interest in the robotics and multi-robot systems (MRS) community [1]. In this problem, robots continuously explore to obtain the full environmental map in an unknown bounded environment. It can be helpful in various applications, including search and rescue, domestic service, survey and operations, field robotics, etc. Autonomous exploration and surveillance solutions can also demonstrate the adaptability of the MRS since robots can carry out these missions in different and uncharted areas. In such applications, the robots need efficient wireless network connectivity for robust cooperation in uncertain environments [2].

Recent works have been influential in realizing an efficient exploration objective. For example, information-based methods (e.g., [3]) typically use the Shannon entropy to describe the uncertainty of the environmental map and construct the optimization problems such that the robot’s control variable (e.g., velocity) is continuously optimized during the exploration process. On the other hand, frontier-based methods (e.g., [4]) involve deciding the robot’s next move (or path) by searching the frontier points on the border of free and unknown points. Often, these methods only produce approximate solutions due to optimization.

Integrating learning with planning solutions is promising, especially for robot exploration [5, 6]. In the reinforcement learning (RL) paradigm, robots can continuously improve competence and adapt to the dynamics of natural surroundings by observing the results of navigational choices made in the actual world [7]. On the other hand, cooperation among robots in an MRS can help achieve a complex mission through simple distributed approaches [8, 9].

Refer to caption
Figure 1: Overview of the distributed CQLite method for efficient multi-robot exploration, shown with an illustrative simulation.

This paper explores the intersection between learning and cooperation, designs a combined solution to achieve efficient map exploration, and provides theoretical support for fast convergence and time complexity. We leverage the benefits of learning-based paradigms for joint exploration. We aim to create a distributed algorithm that gains knowledge through robot-robot information sharing while minimizing communication and computing overheads. Specifically, we utilize a distributed Q-learning methodology with a coverage-biased reward function with a light communication and information fusion strategy. In our approach, we reduce communication complexity by sharing only the current state information, i.e., Q-value, instead of the complete Q-table as done in [10] and explored frontier. Fig. 1 provides an overview of the proposed method implemented in the Robot Operating System (ROS) framework. The main contributions of this paper are summarized below.

  • We propose a novel distributed coverage-biased Q-learning approach (CQLite) for efficient multi-robot map exploration, leading to a significant reduction of the communication and computation overheads at each robot. In our approach, we reduce communication complexity by sharing only the current state information, i.e., Q-value, instead of the complete Q-table as done in [10] and apply map merging in an ad-hoc manner.

  • We substantiate the potential of our method with theoretical proofs of fast convergence in learning.

  • With extensive simulation experiments, we evaluate the performance of our approach against two state-of-the-art (SOTA) multi-robot exploration methods: Rapidly-exploring Random Trees (RRT) for Optimized Exploration [11] and Deep Reinforcement Learning (DRL) for Voronoi-based Exploration [12].

  • We open source111https://github.com/herolab-uga/cqlite the CQLite as a ROS package for use and further development by the robotics community.

  • We confirm the findings through real-world robot experiments. The video of sample simulation experiments and real-world demonstrations are available at https://youtu.be/n3unL1nuieQ.

The key idea behind the CQLite is that it uses a coverage-biased reward function to perform efficient exploration by sharing limited information among robots in a distributed fashion. Our method achieves fast convergence with the best coverage performance, reduced communication, and update costs compared to the baselines.

II Related Work

Map exploration problems focus on frontier-based and learning-based coverage planning approaches. A robot can be greedily pushed in an occupancy grip-map to the closest boundaries [13] or to the most uncertain (or informative) regions [14]. In frontier-based strategies, the robots will look to expand coverage into the unexplored regions by choosing the next waypoints based on the frontier of the explored map boundaries [15]. For instance, in [11], the multi-robot map exploration objective is integrated into an optimization framework incorporating Rapidly-exploring Random Trees (RRTs) to increase the effectiveness and efficiency of multi-robot map exploration. However, the constraints of such frontier-based approaches are the computing expense of the optimization methods and the possibility of non-optimal outcomes resulting from RRTs’ stochastic characteristics.

Researchers also presented communication-efficient solutions for exploration in multi-robot systems. For instance, Zhang et al. [16] introduced the MR-TopoMap based on a topological map, which can independently explore the robot’s surroundings while sporadically exchanging topological maps when communication is possible. However, path planning through topological mapping can lead to a sub-optimal path and, specifically, in the case where robots start exploring from the same position, exploring the same map, making it difficult to divide the map into topologies. Masaba and Li [17] proposed an exploration algorithm using the topology of the generalized Voronoi graphs made efficient through a recurrent and lean communication strategy. Corah et al. [18] use information-based distributed planning considering communication restrictions. However, the planner’s finite-horizon nature could lead to suboptimal exploration paths because it doesn’t consider long-term planning beyond the given horizon, making it more difficult for the system to make decisions based on knowledge in the future. This might prevent robots from efficiently exploring or discovering key regions of interest. More recently, Gao et al. [19] reduced inter-robot communication costs by utilizing a mission-based protocol and centralized planning, where the former can actively disconnect robots to proceed with distributed (independent exploration) and the latter will help them achieve rendezvous to reconnect and share information. However, computing the super-frontier information is computationally expensive, and the active disconnection strategy may limit the robots from sharing other critical data during the mission.

A body of research concentrates on Reinforcement Learning (RL) and Q-learning for multi-robot tasks, modifying the learning mechanism in low communication scenarios for better navigation and exploration [20, 5] and utilizing deep reinforcement learning to achieve optimality in robotic exploration [21]. A Deep RL (DRL) approach for cooperative multi-robot exploration using Voronoi cells was proposed in [12]. Despite its intriguing concept, it was constrained by training difficulties and sub-optimal solution tendencies. In another study [22], researchers suggested using an improved Q-learning algorithm in RL map navigation to stop robots from lingering in the past. However, these methods call for frequent map merging, which raises the cost of updates. Further, DRLs have shown promise in some problem spaces, but they frequently offer less-than-ideal solutions outside those contexts. They cannot guarantee convergence in infinite horizons.

To address these gaps in the literature, CQLite considers an efficient information transfer mechanism combined with distributed Q-learning with a coverage-biased reward function for efficient multi-robot cooperation to solve map exploration tasks. CQLite departs from RRT (frontier-based) and DRL (learning-based) regarding exploration strategy by reducing recurrent frontier exploration to avoid mapping overlap and Q-learning update strategy for communication efficiency by only sharing and utilizing recently calculated Q-value to the robots, respectively. Additionally, in both RRT and DRL, robots share locally explored maps on every iteration and apply map merging, which consequently gives rise to computational complexity. We reduce this overhead by only sharing and applying map merging in an ad-hoc manner. By incorporating these novelties, our approach provides significant improvement in computation and communication efficiency, even in cases of limited connectivity scenarios.

It is worth noting that the objectives of cooperative simultaneous localization and mapping (SLAM) techniques and exploration approaches are fundamentally different. The SLAM problem focuses on accurately building and merging the map, while the exploration problem focuses on using the available map to determine waypoints to maximize coverage area. Cooperative SLAM techniques emphasize communication effectiveness. Although computational efficiency is still a problem, Liu et al. [23] presented a multi-agent SLAM technique that lowers bandwidth use. Others use spectral graph analysis for cooperative mapping but overlook the computational costs of graph formation and optimization [24]. In contrast, others concentrate on lifelong localization and mapping but fail to optimize the communication and computational cost [25]. Cooperative RL techniques, as those in [26], have difficulty keeping up with the rising computational complexity of growing state spaces. In our work, we use an existing map merging method222https://wiki.ros.org/multirobot_map_merge from the literature to perform multi-robot SLAM. At the same time, our proposed CQLite is designed to maximize exploration and lower communication and computation. Specifically, our CQLite method addresses the above limitations by delivering computational and communication efficiency through selective data sharing and utilizing effective Q-learning to determine the best exploration strategy.

Refer to caption
Figure 2: System architecture of CQLite distributed across several robots. It shows the Robot i𝑖iitalic_i’s process showing the mapping, frontier detection, and Q-learning operations along with the communication of local map and updated Q-value information to n𝑛nitalic_n connected robots.

III Proposed Approach

We consider n𝑛nitalic_n connected robots rV,|V|=nformulae-sequence𝑟𝑉𝑉𝑛r\in V,|V|=nitalic_r ∈ italic_V , | italic_V | = italic_n deployed at random starting locations in an unknown environment. The connectivity of the robots is expressed by a graph G=(V,E)𝐺𝑉𝐸G=(V,E)italic_G = ( italic_V , italic_E ), and a robot i𝑖iitalic_i can share data with their immediate neighbors 𝒩i={jV|(i,j)E}subscript𝒩𝑖conditional-set𝑗𝑉𝑖𝑗𝐸\mathcal{N}_{i}=\{j\in V|(i,j)\in E\}caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_j ∈ italic_V | ( italic_i , italic_j ) ∈ italic_E }. The robots must find and navigate toward the frontier position on their locally built map as a standard map exploration strategy. To accomplish this efficiently, a robot must decide which frontier to navigate after leaving its current explored region and is expected to reduce the number of steps to take and the size of data exchanges between robots.

Robots only share updated Q-value and the newly explored frontier with other robots in its range (neighbor set). Each robot keeps track of its local and shared frontiers to avoid re-exploration. Robots continue to generate local maps and share the newly developed map only when asked by other robots in case of the already explored frontier. Robots who cannot find new frontiers merge their local map with a map received from peers to build a global map using the feature similarity-based map merging technique [27]. The robot’s decisions regarding an action plan are based on the shared information and Q-learning computation. The whole procedure concerning robot i𝑖iitalic_i can be visualized in Fig. 2.

III-A Q-learning

Markov’s decision processes frequently model the robot’s interaction with the environment. A robot’s state is (x,y,θ,active/inactive)𝑥𝑦𝜃𝑎𝑐𝑡𝑖𝑣𝑒𝑖𝑛𝑎𝑐𝑡𝑖𝑣𝑒(x,y,\theta,active/inactive)( italic_x , italic_y , italic_θ , italic_a italic_c italic_t italic_i italic_v italic_e / italic_i italic_n italic_a italic_c italic_t italic_i italic_v italic_e ) in a global frame. Robots are localized and initialized in a global frame, and positions are known concerning virtually defined bounded regions that can be expanded based on exploration requirements. We consider the frontier’s position as states for exploration by applying efficient frontier detection [28]. We consider the current frontier’s position as the current state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and the newly explored frontier as the next state st+1subscript𝑠𝑡1s_{t+1}italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT. A robot can transition from state stSsubscript𝑠𝑡𝑆s_{t}\in Sitalic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ italic_S to state st+1subscript𝑠𝑡1s_{t+1}italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT due to acting atAsubscript𝑎𝑡𝐴a_{t}\in Aitalic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ italic_A based on its state at time t𝑡titalic_t. Robot action atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to reach st+1subscript𝑠𝑡1s_{t+1}italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT from stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT can be determined using discrete-time Hopfield function [29]. The transition probability is defined as T:S×A×S[0,1]:𝑇𝑆𝐴𝑆01T:S\times A\times S\rightarrow[0,1]italic_T : italic_S × italic_A × italic_S → [ 0 , 1 ]. The robot will receive a reward for each action using a reward function R:S×A×SR:𝑅𝑆𝐴𝑆𝑅R:S\times A\times S\rightarrow Ritalic_R : italic_S × italic_A × italic_S → italic_R specific to the task. The robot will have learned the course of action to take in each state and will be able to maximize the reward of the entire interaction process.

In Q-learning, all possible states and actions are created using the Q table, which then updates each value through iterative learning. The robot then chooses the best course of action for each state based on the values in the table. This approach is frequently utilized in path planning, chess, card games, and other activities.

Assumptions: For simplicity, we assume a flat ground terrain environment for exploration.

  • The robots have omnidirectional sensors to detect obstacle boundaries within a maximum sensing range rssubscript𝑟𝑠r_{s}italic_r start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT.

  • Each robot has a communication range, rc>>rsmuch-greater-thansubscript𝑟𝑐subscript𝑟𝑠r_{c}>>r_{s}italic_r start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT > > italic_r start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT forming a neighbor set 𝒩isubscript𝒩𝑖\mathcal{N}_{i}caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with constant connectivity to send and receive information about its relative position.

  • The connectivity graph G𝐺Gitalic_G is connected throughout exploration, which is practical to achieve in a multi-robot application. Since the proposed solution is distributed, it ensures maximum coverage even in partial disconnectivity, but at the cost of increased re-exploration.

Here, we introduce CQLite as a distributed method for robot i𝑖iitalic_i, which is now at state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at time t𝑡titalic_t and selects the following state as st+1subscript𝑠𝑡1s_{t+1}italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT to explore independently. CQLite begins with an empty Q-table and updates its table values as exploration proceeds; CQLite aims to achieve fast convergence based on the optimized Q-learning and reward function. Finding the action a𝑎aitalic_a that maximizes the Q-value for a specific state s𝑠sitalic_s is the goal of the maximum optimization function for Q-learning, i.e., a*=argmaxaQi(s,a)superscript𝑎subscriptargmax𝑎subscript𝑄𝑖𝑠𝑎a^{*}=\operatorname*{arg\,max}_{a}Q_{i}(s,a)italic_a start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s , italic_a ), where a*superscript𝑎a^{*}italic_a start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT is the optimal action for a given state s𝑠sitalic_s.

The Q-learning algorithm updates the Q value as

Qi,t+1(st,at)subscript𝑄𝑖𝑡1subscript𝑠𝑡subscript𝑎𝑡\displaystyle{Q_{i,t+1}}\left({{s_{t}},{a_{t}}}\right)italic_Q start_POSTSUBSCRIPT italic_i , italic_t + 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
=(1α)Qi,t(st,at)+α[ri,t+γQi,t(st+1,a*)],absent1𝛼subscript𝑄𝑖𝑡subscript𝑠𝑡subscript𝑎𝑡𝛼delimited-[]subscript𝑟𝑖𝑡𝛾subscript𝑄𝑖𝑡subscript𝑠𝑡1superscript𝑎\displaystyle=(1-\alpha){Q_{i,t}}\left({{s_{t}},{a_{t}}}\right)+\alpha\left[{{% r_{i,t}}+\gamma{Q_{i,t}}\left({{s_{t+1}},{a^{*}}}\right)}\right],= ( 1 - italic_α ) italic_Q start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_α [ italic_r start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT + italic_γ italic_Q start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) ] , (1)

where ri,tsubscript𝑟𝑖𝑡r_{i,t}italic_r start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT is the reward received for taking action a𝑎aitalic_a in state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The α(0,1]𝛼01\alpha\in(0,1]italic_α ∈ ( 0 , 1 ] controls the balance between the coverage and delay, and γ𝛾\gammaitalic_γ is the discount factor to prioritize present vs. future rewards. This optimization function is used in the action selection step of the Q-learning, where the agent selects the action that maximizes its expected future reward.

The objective of the CQLite is to perform maximum coverage in less time and avoid overlapping exploration, which can be numerically defined as

maxπ{Paπ(t)λiEt(a|π)},𝑚𝑎subscript𝑥𝜋superscriptsubscript𝑃𝑎𝜋𝑡subscript𝜆𝑖subscript𝐸𝑡conditional𝑎𝜋\displaystyle max_{\pi}\{P_{a}^{\pi}(t)-\lambda_{i}E_{t}(a|\pi)\},italic_m italic_a italic_x start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT { italic_P start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_t ) - italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a | italic_π ) } , (2)

where Paπ(t)superscriptsubscript𝑃𝑎𝜋𝑡P_{a}^{\pi}(t)italic_P start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_t ) is the probability to cover the unexplored region using for action a𝑎aitalic_a using policy π𝜋\piitalic_π at time t𝑡titalic_t, Et(a|π)subscript𝐸𝑡conditional𝑎𝜋E_{t}(a|\pi)italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a | italic_π ) is estimated time to reach the state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT by taking action a𝑎aitalic_a at time t𝑡titalic_t in policy π𝜋\piitalic_π and λisubscript𝜆𝑖\lambda_{i}italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the cost associated with each step taken by robot i𝑖iitalic_i. We have a vector path𝑝𝑎𝑡pathitalic_p italic_a italic_t italic_h extracted by containing position waypoints connecting stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to st+1subscript𝑠𝑡1s_{t+1}italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT associated with a𝑎aitalic_a [30]. For each dimension of path𝑝𝑎𝑡pathitalic_p italic_a italic_t italic_h at each control instant t=tj𝑡subscript𝑡𝑗t=t_{j}italic_t = italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, we first compute the velocity command as:

vtj=Kpetj+KIt=t0tj(etj),subscript𝑣subscript𝑡𝑗subscript𝐾𝑝subscript𝑒subscript𝑡𝑗subscript𝐾𝐼superscriptsubscript𝑡subscript𝑡0subscript𝑡𝑗subscript𝑒subscript𝑡𝑗\displaystyle v_{t_{j}}=K_{p}\cdot e_{t_{j}}+K_{I}\sum_{t=t_{0}}^{t_{j}}(e_{t_% {j}}),italic_v start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ⋅ italic_e start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_K start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_e start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , (3)

where etj=st,jst,j1subscript𝑒subscript𝑡𝑗subscript𝑠𝑡𝑗subscript𝑠𝑡𝑗1e_{t_{j}}=s_{t,j}-s_{t,j-1}italic_e start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_s start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT - italic_s start_POSTSUBSCRIPT italic_t , italic_j - 1 end_POSTSUBSCRIPT represents the instantaneous error between the intermediate states associated with action a𝑎aitalic_a (i.e., the feedback) at time t=tj𝑡subscript𝑡𝑗t=t_{j}italic_t = italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. Further, Kpsubscript𝐾𝑝K_{p}italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT and KIsubscript𝐾𝐼K_{I}italic_K start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT are the so-called proportional and integral gains of the motor controller that regulate the contributions of the corrections induced by the actual error and the error accumulated over time, respectively. These constant values determination is based on the motion constraints of our differential drive robots as discussed by Li et al. [30]. They can be different based on the robot’s physical and motion characteristics. In our case, we predetermined values of KPsubscript𝐾𝑃K_{P}italic_K start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT and KIsubscript𝐾𝐼K_{I}italic_K start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT as 2 and 0.5, respectively. Now we apply simple kinematic to find the estimate Et(a|π)subscript𝐸𝑡conditional𝑎𝜋E_{t}(a|\pi)italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a | italic_π ) as:

Et(a|π)=j=1m(etj)2vtj,subscript𝐸𝑡conditional𝑎𝜋superscriptsubscript𝑗1𝑚superscriptsubscript𝑒subscript𝑡𝑗2subscript𝑣subscript𝑡𝑗E_{t}(a|\pi)=\sum_{j=1}^{m}\frac{(e_{t_{j}})^{2}}{v_{t_{j}}},italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a | italic_π ) = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT divide start_ARG ( italic_e start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_v start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG , (4)

where m=|𝒩i|𝑚subscript𝒩𝑖m=|\mathcal{N}_{i}|italic_m = | caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | is the number of intermediate neighbors of the robot i𝑖iitalic_i. To avoid the exploration of an already explored region for state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we determine P(stESt)𝑃subscript𝑠𝑡𝐸subscript𝑆𝑡P(s_{t}\cap ES_{t})italic_P ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∩ italic_E italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) as:

P(stESt)=j=1mP(stesj)m,𝑃subscript𝑠𝑡𝐸subscript𝑆𝑡superscriptsubscript𝑗1𝑚𝑃subscript𝑠𝑡𝑒subscript𝑠𝑗𝑚P(s_{t}\cap ES_{t})=\sum_{j=1}^{m}\frac{P(s_{t}\cap es_{j})}{m},italic_P ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∩ italic_E italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT divide start_ARG italic_P ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∩ italic_e italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG start_ARG italic_m end_ARG , (5)

where ESt𝐸subscript𝑆𝑡ES_{t}italic_E italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the set of explored states at time t𝑡titalic_t, esjESt𝑒subscript𝑠𝑗𝐸subscript𝑆𝑡es_{j}\in ES_{t}italic_e italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_E italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and m𝑚mitalic_m is the number of explored states and overlap probability of each explored state in ESt𝐸subscript𝑆𝑡ES_{t}italic_E italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT can be determined as:

P(stesj)={1dist(st,esj)ri,s0dist(st,esj)>ri,s𝑃subscript𝑠𝑡𝑒subscript𝑠𝑗cases1𝑑𝑖𝑠𝑡subscript𝑠𝑡𝑒subscript𝑠𝑗subscript𝑟𝑖𝑠0𝑑𝑖𝑠𝑡subscript𝑠𝑡𝑒subscript𝑠𝑗subscript𝑟𝑖𝑠\displaystyle P(s_{t}\cap es_{j})=\begin{cases}1&dist(s_{t},es_{j})\leq r_{i,s% }\\ 0&dist(s_{t},es_{j})>r_{i,s}\\ \end{cases}italic_P ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∩ italic_e italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = { start_ROW start_CELL 1 end_CELL start_CELL italic_d italic_i italic_s italic_t ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_e italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ≤ italic_r start_POSTSUBSCRIPT italic_i , italic_s end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL italic_d italic_i italic_s italic_t ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_e italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) > italic_r start_POSTSUBSCRIPT italic_i , italic_s end_POSTSUBSCRIPT end_CELL end_ROW (6)

At each discrete time step t𝑡titalic_t, the robot i𝑖iitalic_i acquires an observation stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT from the environment, selects a corresponding action atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, then receives feedback from the environment in the form of a reward rit=R(st,at)subscript𝑟𝑖𝑡𝑅subscript𝑠𝑡subscript𝑎𝑡r_{it}=R(s_{t},a_{t})italic_r start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT = italic_R ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) as shown below:

rit={λistEStλiQi,t+ρ(1P(stESt))+σri,cstEStsubscript𝑟𝑖𝑡casessubscript𝜆𝑖subscript𝑠𝑡𝐸subscript𝑆𝑡subscript𝜆𝑖subscript𝑄𝑖𝑡𝜌1𝑃subscript𝑠𝑡𝐸subscript𝑆𝑡𝜎subscript𝑟𝑖𝑐subscript𝑠𝑡𝐸subscript𝑆𝑡\displaystyle r_{it}=\begin{cases}-\lambda_{i}&s_{t}\in ES_{t}\\ \lambda_{i}-Q_{i,t}+\rho(1-P(s_{t}\cap ES_{t}))+\sigma r_{i,c}&s_{t}\notin ES_% {t}\end{cases}italic_r start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT = { start_ROW start_CELL - italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL start_CELL italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ italic_E italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_Q start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT + italic_ρ ( 1 - italic_P ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∩ italic_E italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) + italic_σ italic_r start_POSTSUBSCRIPT italic_i , italic_c end_POSTSUBSCRIPT end_CELL start_CELL italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∉ italic_E italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_CELL end_ROW (7)

Where P(stESt)𝑃subscript𝑠𝑡𝐸subscript𝑆𝑡P(s_{t}\cap ES_{t})italic_P ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∩ italic_E italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is the probability of overlap between the current state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and the already explored states ESt𝐸subscript𝑆𝑡ES_{t}italic_E italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT by roboti𝑟𝑜𝑏𝑜subscript𝑡𝑖robot_{i}italic_r italic_o italic_b italic_o italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and other robots, ρ𝜌\rhoitalic_ρ is a scaling factor that controls the importance of minimizing the overlap, rcsubscript𝑟𝑐r_{c}italic_r start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT is the communication range, and σ𝜎\sigmaitalic_σ is the scaling factor that determines the importance of maximizing the communication range. σ𝜎\sigmaitalic_σ depends upon the robot’s sensing capabilities and makes the reward function modular for heterogeneous robots with different sensing capabilities [31]. Then the state information is updated st+1subscript𝑠𝑡1s_{t+1}italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT. The goal of the RL is to select policy π𝜋\piitalic_π that maximizes the discounted sum of future rewards, i.e., Qπ(s1)=τt=1γtR(st,at)subscript𝑄𝜋subscript𝑠1superscriptsubscript𝜏𝑡1superscript𝛾𝑡𝑅subscript𝑠𝑡subscript𝑎𝑡Q_{\pi}(s_{1})=\sum_{\tau}^{t=1}\gamma^{t}R(s_{t},a_{t})italic_Q start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t = 1 end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_R ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), which according to the Bellman optimality principle satisfies.

The reward function in Eq. (7) produces a negative reward whenever the agent has looped back, and the calculated reward is based on the step-cost, Q-value, probability of overlap, and scaling factor otherwise.

Multi-Robot Lite Cooperation: We reduce the communication overhead amongst individual exploration-capable robots through a distributed approach, allowing each robot to make independent decisions based on local information and with little interaction from other robots. In our lite version of Q-learning, only the current state and Q-value are communicated amongst nearby robots to encourage cooperation. When another robot receives the information, it will update the received Q value in its Q table and update the local map. We develop a discovery approach based on the distance between simulated robots to replicate the network range in which we only share the current position of a robot i𝑖iitalic_i, its Q-Value for each direction, and mark the current situation as explored to avoid repetitive exploration.

III-B Exploration Strategy

Robots create a global Q table for each cell and action after searching the map and experiencing several experiences. The Q table is then turned into a weighted graph G=(𝒮,,𝒞)𝐺𝒮𝒞G=(\mathcal{S},\mathcal{E},\mathcal{C})italic_G = ( caligraphic_S , caligraphic_E , caligraphic_C ), where 𝒮={s1,s2,,sn}𝒮subscript𝑠1subscript𝑠2subscript𝑠𝑛\mathcal{S}=\{s_{1},s_{2},...,s_{n}\}caligraphic_S = { italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } denotes the set of states, and |𝒮|×|𝒮|𝒮𝒮\mathcal{E}\in|\mathcal{S}|\times|\mathcal{S}|caligraphic_E ∈ | caligraphic_S | × | caligraphic_S | signifies the set of edges whose elements indicate whether or not a path exists between the center points of each pair of states. It is assumed that robots do not exchange nodes during exploration, and Voronoi boundaries are fixed. Furthermore, 𝒞𝒞\mathcal{C}caligraphic_C is the weight matrix indicating the edge metric cost. The primary goal of discovering this study’s reduced graph and significant states is to optimally disperse robots over the coverage region by minimizing the relevant cost function. Because robots move at varying speeds, we formulate the cost as a function of the defined traveling time as t(spi,sq)=d(spi,sq)vi,subscript𝑡subscript𝑠subscript𝑝𝑖subscript𝑠𝑞subscript𝑑subscript𝑠subscript𝑝𝑖subscript𝑠𝑞subscript𝑣𝑖t_{(s_{p_{i}},s_{q})}=\frac{d_{(s_{p_{i}},s_{q})}}{v_{i}},italic_t start_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT = divide start_ARG italic_d start_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT end_ARG start_ARG italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG , where visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT robot’s speed, and d(spi,sq)𝒞subscript𝑑subscript𝑠subscript𝑝𝑖subscript𝑠𝑞𝒞d_{(s_{p_{i}},s_{q})}\in\mathcal{C}italic_d start_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ∈ caligraphic_C is the Euclidean distance between the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT robot’s current state pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and state q𝑞qitalic_q. Furthermore, knowing the optimal path from state pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to state s𝑠sitalic_s, each robot’s overall optimal traveling time is the sum of the trip times (costs) from state pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to state s𝑠sitalic_s. This study’s shortest path between each pair of states is computed using the A* algorithm. Then the total time τ𝜏\tauitalic_τ is calculated by knowing path 𝒫={p1,p2,,pn}𝒫subscript𝑝1subscript𝑝2subscript𝑝𝑛\mathcal{P}=\{p_{1},p_{2},...,p_{n}\}caligraphic_P = { italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } as

τ(Sp,Sq)=t(sp,sp1)+t(sp1,sp2)++t(spn,sq).\displaystyle\tau_{(S_{p},S_{q})}={t_{(s_{p},s_{p_{1})}}+t_{(s_{p_{1}},s_{p_{2% }})}+...+t_{(s_{p_{n}},s_{q})}}.italic_τ start_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT = italic_t start_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_t start_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT + … + italic_t start_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT . (8)

After determining the shortest time between each pair of states, the field is partitioned into M Voronoi subgraphs grisubscript𝑔subscript𝑟𝑖g_{r_{i}}italic_g start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT for i{1,2,,M}𝑖12𝑀i\in\{1,2,...,M\}italic_i ∈ { 1 , 2 , … , italic_M } to distribute work proportionally among M𝑀Mitalic_M robots. To that aim, the ideal Voronoi diagram grisubscript𝑔subscript𝑟𝑖g_{r_{i}}italic_g start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT for ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT robot, according to Lloyd’s algorithm, is a split of the area determined as:

gri={sqS|τ(spi,sq)τ(spj,sq),ij}.subscript𝑔subscript𝑟𝑖conditional-setsubscript𝑠𝑞𝑆formulae-sequencesubscript𝜏subscript𝑠subscript𝑝𝑖subscript𝑠𝑞subscript𝜏subscript𝑠subscript𝑝𝑗subscript𝑠𝑞for-all𝑖𝑗\displaystyle g_{r_{i}}=\{s_{q}\in S|\tau_{(s_{p_{i}},s_{q})}\leq{\tau_{(s_{p_% {j}},s_{q})}},\forall i\neq j\}.italic_g start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = { italic_s start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ∈ italic_S | italic_τ start_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ≤ italic_τ start_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT , ∀ italic_i ≠ italic_j } . (9)

Where j𝑗jitalic_j is the other connected robot. The ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT robot is responsible for covering the state s𝑠sitalic_s (associated robot) in its sub-graph grisubscript𝑔subscript𝑟𝑖g_{r_{i}}italic_g start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT using the Voronoi partitioning result. The entire cost is then calculated as

λi,(p,gr)=qgriτ(spj,sq)ϕq,subscript𝜆𝑖𝑝subscript𝑔𝑟subscript𝑞subscript𝑔subscript𝑟𝑖subscript𝜏subscript𝑠subscript𝑝𝑗subscript𝑠𝑞subscriptitalic-ϕ𝑞\displaystyle\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\lambda_{i,(p,g_{r})}={% \sum_{q\in g_{r_{i}}}\tau_{(s_{p_{j}},s_{q})}\phi_{q}},\color[rgb]{0,0,0}% \definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}% \pgfsys@color@gray@fill{0}italic_λ start_POSTSUBSCRIPT italic_i , ( italic_p , italic_g start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_q ∈ italic_g start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , (10)

where ϕqsubscriptitalic-ϕ𝑞\phi_{q}italic_ϕ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT is the priority value associated with state sqsubscript𝑠𝑞s_{q}italic_s start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT. As the map turns into a graph, higher priority values are assigned to target states, while lower priority values are assigned to states far and already explored from the current state. The entire travel time (cost) will therefore be minimized, and an optimal solution will be obtained only when the current distance between the robot i𝑖iitalic_i and the target state sqsubscript𝑠𝑞s_{q}italic_s start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT, d(spi,sq)subscript𝑑subscript𝑠subscript𝑝𝑖subscript𝑠𝑞d_{(s_{p_{i}},s_{q})}italic_d start_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT converges to zero. Algorithm 1 provides the pseudocode description of CQLite for efficient map exploration implemented in a distributed manner on each robot i𝑖iitalic_i.

1 Data: Reward matrix R𝑅Ritalic_R, learning rate α𝛼\alphaitalic_α, discount factor γ𝛾\gammaitalic_γ, step cost λisubscript𝜆𝑖\lambda_{i}italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT;
2 Number of iterations: t=0𝑡0t=0italic_t = 0;
3 Initialize empty Q-table as Qisubscript𝑄𝑖Q_{i}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for robot i𝑖iitalic_i; Initialize empty explored frontier list ESt𝐸subscript𝑆𝑡ES_{t}italic_E italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT;
4 Generate a local map using range sensor;
5 Initialize Explored Frontier Detected sdsubscript𝑠𝑑s_{d}italic_s start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT as 0;
6 while  (ttmax)𝑡subscript𝑡𝑚𝑎𝑥(t\leq t_{max})( italic_t ≤ italic_t start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ) and sdsubscript𝑠𝑑s_{d}italic_s start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT<2 do
7       StSstartsubscript𝑆𝑡subscript𝑆𝑠𝑡𝑎𝑟𝑡S_{t}\rightarrow S_{start}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT → italic_S start_POSTSUBSCRIPT italic_s italic_t italic_a italic_r italic_t end_POSTSUBSCRIPT, step = 1;
8       Find new frontiers at new ESt𝐸subscript𝑆𝑡ES_{t}italic_E italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and update Stsubscript𝑆𝑡S_{t}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT using locally explored map;
9       Qi,tsubscript𝑄𝑖𝑡Q_{i,t}italic_Q start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT = list(0);
10       for each frontier s𝑠sitalic_s in Stsubscript𝑆𝑡S_{t}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT do
11             if s𝑠sitalic_s not in ESt𝐸subscript𝑆𝑡ES_{t}italic_E italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT then
12                   Calculate Q-value as qssubscript𝑞𝑠q_{s}italic_q start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT for actions assubscript𝑎𝑠a_{s}italic_a start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT to reach frontier s𝑠sitalic_s using Eq. (1);
13                   Append Qi,t(s,as)subscript𝑄𝑖𝑡𝑠subscript𝑎𝑠Q_{i,t}(s,a_{s})italic_Q start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ( italic_s , italic_a start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT );
14            else
15                   Request for explored Maps;
16                  Merge maps into local maps;
17                   sdsubscript𝑠𝑑s_{d}italic_s start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = sdsubscript𝑠𝑑s_{d}italic_s start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT + 1;
18                  
19      if fd<2subscript𝑓𝑑2f_{d}<2italic_f start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT < 2 then
20             Set updated Q-value qupdatesubscript𝑞𝑢𝑝𝑑𝑎𝑡𝑒q_{update}italic_q start_POSTSUBSCRIPT italic_u italic_p italic_d italic_a italic_t italic_e end_POSTSUBSCRIPT as maxa(Qi,t(s,a))𝑚𝑎subscript𝑥𝑎subscript𝑄𝑖𝑡𝑠𝑎max_{a}(Q_{i,t}(s,a))italic_m italic_a italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_Q start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) );
21             Update Qisubscript𝑄𝑖Q_{i}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with qupdatesubscript𝑞𝑢𝑝𝑑𝑎𝑡𝑒q_{update}italic_q start_POSTSUBSCRIPT italic_u italic_p italic_d italic_a italic_t italic_e end_POSTSUBSCRIPT;
22             Take action atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT associated with Qisubscript𝑄𝑖Q_{i}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT;
23             Share qupdatesubscript𝑞𝑢𝑝𝑑𝑎𝑡𝑒q_{update}italic_q start_POSTSUBSCRIPT italic_u italic_p italic_d italic_a italic_t italic_e end_POSTSUBSCRIPT with connected robots;
24             Receive Q-value for ESt𝐸subscript𝑆𝑡ES_{t}italic_E italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT from connected robots j𝒩i𝑗subscript𝒩𝑖j\in\mathcal{N}_{i}italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT;
25             Receive explored frontiers es1:n1𝑒subscript𝑠:1𝑛1es_{1:n-1}italic_e italic_s start_POSTSUBSCRIPT 1 : italic_n - 1 end_POSTSUBSCRIPT and update ESt𝐸subscript𝑆𝑡ES_{t}italic_E italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT with stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and es1:n1𝑒subscript𝑠:1𝑛1es_{1:n-1}italic_e italic_s start_POSTSUBSCRIPT 1 : italic_n - 1 end_POSTSUBSCRIPT;
26             Set new state associated with atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as Stsubscript𝑆𝑡S_{t}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT;
27             Reset sdsubscript𝑠𝑑s_{d}italic_s start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT as 0;
28            
Algorithm 1 Distributed CQLite Exploration

IV Theoretical Analysis

We theoretically analyze the convergence, efficiency, and time complexity of our proposed CQLite for learning-based multi-robot map exploration.

IV-A Convergence

We analyze the convergence of the target Q value update function Eq. (1). We denote the error ratio δt=MSE(Qt)Et(a|π)subscript𝛿𝑡𝑀𝑆𝐸subscript𝑄𝑡subscript𝐸𝑡conditional𝑎𝜋\delta_{t}=\frac{MSE(Q_{t})}{E_{t}(a|\pi)}italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = divide start_ARG italic_M italic_S italic_E ( italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a | italic_π ) end_ARG, where MSE(Qt)𝑀𝑆𝐸subscript𝑄𝑡MSE(Q_{t})italic_M italic_S italic_E ( italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is the calculated mean square error for Q-table at time t𝑡titalic_t and Et(a|π)subscript𝐸𝑡conditional𝑎𝜋E_{t}(a|\pi)italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a | italic_π ) is the average number of steps to cover the region by taking action a𝑎aitalic_a at time t𝑡titalic_t for policy π𝜋\piitalic_π.

Theorem 1 (Convergence of Q-values): Using Eq. (1) for Q-value updates, then if 0δt10subscript𝛿𝑡10\leq\delta_{t}\leq 10 ≤ italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ 1, with probability 1e1𝑒1-e1 - italic_e, we have the estimated time to reach a given state as:

Et(a|π)ωE1(a|π)+ln(1/e)i=0t1ψi2(δti:t)2subscript𝐸𝑡conditional𝑎𝜋𝜔subscript𝐸1conditional𝑎𝜋1𝑒superscriptsubscript𝑖0𝑡1superscriptsubscript𝜓𝑖2subscript𝛿:𝑡𝑖𝑡2\displaystyle E_{t}(a|\pi)\leq\omega E_{1}(a|\pi)+\sqrt{\frac{\ln(1/e)\sum_{i=% 0}^{t-1}\psi_{i}^{2}(\delta_{t-i:t})}{2}}italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a | italic_π ) ≤ italic_ω italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_a | italic_π ) + square-root start_ARG divide start_ARG roman_ln ( 1 / italic_e ) ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_δ start_POSTSUBSCRIPT italic_t - italic_i : italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG 2 end_ARG end_ARG (11)

Here, ψi(δt1:t)=j=tit1(j+γδj)j=titjsubscript𝜓𝑖subscript𝛿:𝑡1𝑡superscriptsubscriptproduct𝑗𝑡𝑖𝑡1𝑗𝛾subscript𝛿𝑗superscriptsubscriptproduct𝑗𝑡𝑖𝑡𝑗\psi_{i}(\delta_{t-1:t})=\frac{\prod_{j=t-i}^{t-1}(j+\gamma\delta_{j})}{\prod_% {j=t-i}^{t}j}italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_δ start_POSTSUBSCRIPT italic_t - 1 : italic_t end_POSTSUBSCRIPT ) = divide start_ARG ∏ start_POSTSUBSCRIPT italic_j = italic_t - italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ( italic_j + italic_γ italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG start_ARG ∏ start_POSTSUBSCRIPT italic_j = italic_t - italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_j end_ARG, αt=j=1t1(j+γδj)j=2tjsubscript𝛼𝑡superscriptsubscriptproduct𝑗1𝑡1𝑗𝛾subscript𝛿𝑗superscriptsubscriptproduct𝑗2𝑡𝑗\alpha_{t}=\frac{\prod_{j=1}^{t-1}(j+\gamma\delta_{j})}{\prod_{j=2}^{t}j}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = divide start_ARG ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ( italic_j + italic_γ italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG start_ARG ∏ start_POSTSUBSCRIPT italic_j = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_j end_ARG and γ=0.95𝛾0.95\gamma=0.95italic_γ = 0.95.

Proof.

Our analysis is derived based on the subsequent (synchronous) Q-learning. In contrast to the conventional Q-learning, we swap out the current Qt𝑄𝑡Qtitalic_Q italic_t for the independent Q-function Q(s,a)superscript𝑄𝑠𝑎Q^{\prime}(s,a)italic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s , italic_a ) for the target Qt(st,at)subscript𝑄𝑡subscript𝑠𝑡subscript𝑎𝑡Q_{t}(s_{t},a_{t})italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and note that if Qt(s,a)=Qsource*subscriptsuperscript𝑄𝑡𝑠𝑎subscriptsuperscript𝑄𝑠𝑜𝑢𝑟𝑐𝑒Q^{\prime}_{t}(s,a)=Q^{*}_{source}italic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) = italic_Q start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s italic_o italic_u italic_r italic_c italic_e end_POSTSUBSCRIPT, we know that 0δt10subscript𝛿𝑡10\leq\delta_{t}\leq 10 ≤ italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ 1.
First, we break down the update role into:

Qt(st,at)subscript𝑄𝑡subscript𝑠𝑡subscript𝑎𝑡\displaystyle{Q_{t}}\left({{s_{t}},{a_{t}}}\right)italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
=(t1t)Qt1(s,a)+1t(rt+γmaxaQt1(s,a)\displaystyle=\left(\frac{t-1}{t}\right)Q_{t-1}(s,a)+\frac{1}{t}\big{(}r_{t}+% \gamma max_{a^{\prime}}Q^{\prime}_{t-1}(s^{\prime},a^{\prime})= ( divide start_ARG italic_t - 1 end_ARG start_ARG italic_t end_ARG ) italic_Q start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( italic_s , italic_a ) + divide start_ARG 1 end_ARG start_ARG italic_t end_ARG ( italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_γ italic_m italic_a italic_x start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT )
+γmaxaQt1*(s,a)γmaxaQt1*(s,a))\displaystyle+\gamma max_{a^{\prime}}Q^{*}_{t-1}(s^{\prime},a^{\prime})-\gamma max% _{a^{\prime}}Q^{*}_{t-1}(s^{\prime},a^{\prime})\big{)}+ italic_γ italic_m italic_a italic_x start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_γ italic_m italic_a italic_x start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) )

Let εt(s,a)=Qt(s,a)Q*(s,a)subscript𝜀𝑡𝑠𝑎subscript𝑄𝑡𝑠𝑎superscript𝑄𝑠𝑎\varepsilon_{t}(s,a)=Q_{t}(s,a)-Q^{*}(s,a)italic_ε start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) = italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) - italic_Q start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_s , italic_a ) and
ξ(s)=γ×maxa(Qt1*(s,a))𝜉superscript𝑠𝛾𝑚𝑎subscript𝑥superscript𝑎subscriptsuperscript𝑄𝑡1superscript𝑠superscript𝑎\xi(s^{\prime})=\gamma\times max_{a^{\prime}}\left(Q^{*}_{t-1}(s^{\prime},a^{% \prime})\right)italic_ξ ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_γ × italic_m italic_a italic_x start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_Q start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) then recall the definition of δtsubscript𝛿𝑡\delta_{t}italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we will have εt(s,a)subscript𝜀𝑡𝑠𝑎\varepsilon_{t}(s,a)italic_ε start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a )

t1tεt1(s,a)+1t(ξ(s)Esξ(s))+1tγδtEt1absent𝑡1𝑡subscript𝜀𝑡1𝑠𝑎1𝑡𝜉superscript𝑠subscript𝐸superscript𝑠𝜉superscript𝑠1𝑡𝛾subscript𝛿𝑡subscript𝐸𝑡1\displaystyle\leq\frac{t-1}{t}\varepsilon_{t-1}(s,a)+\frac{1}{t}\big{(}\xi(s^{% \prime})-E_{s^{\prime}}\xi(s^{\prime})\big{)}+\frac{1}{t}\gamma\delta_{t}E_{t-1}≤ divide start_ARG italic_t - 1 end_ARG start_ARG italic_t end_ARG italic_ε start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( italic_s , italic_a ) + divide start_ARG 1 end_ARG start_ARG italic_t end_ARG ( italic_ξ ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_E start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_ξ ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) + divide start_ARG 1 end_ARG start_ARG italic_t end_ARG italic_γ italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT

As we know εt(s,a)Etsubscript𝜀𝑡𝑠𝑎subscript𝐸𝑡\varepsilon_{t}(s,a)\leq E_{t}italic_ε start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) ≤ italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, by applying maximization and recursion of E𝐸Eitalic_E, we will have:

Ett1+γδttEt1+1t(ξ(s)Esξ(s))subscript𝐸𝑡𝑡1𝛾subscript𝛿𝑡𝑡subscript𝐸𝑡11𝑡𝜉superscript𝑠subscript𝐸superscript𝑠𝜉superscript𝑠\displaystyle E_{t}\leq\frac{t-1+\gamma\delta_{t}}{t}E_{t-1}+\frac{1}{t}\big{(% }\xi(s^{\prime})-E_{s^{\prime}}\xi(s^{\prime})\big{)}italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ divide start_ARG italic_t - 1 + italic_γ italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_t end_ARG italic_E start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_t end_ARG ( italic_ξ ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_E start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_ξ ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) )
j=1t1(j+γδj)j=2tjE1+i=1t1j=tit1(j+γδj)j=titjabsentsuperscriptsubscriptproduct𝑗1𝑡1𝑗𝛾subscript𝛿𝑗superscriptsubscriptproduct𝑗2𝑡𝑗subscript𝐸1superscriptsubscript𝑖1𝑡1superscriptsubscriptproduct𝑗𝑡𝑖𝑡1𝑗𝛾subscript𝛿𝑗superscriptsubscriptproduct𝑗𝑡𝑖𝑡𝑗\displaystyle\leq\frac{\prod_{j=1}^{t-1}(j+\gamma\delta_{j})}{\prod_{j=2}^{t}j% }E_{1}+\sum_{i=1}^{t-1}\frac{\prod_{j=t-i}^{t-1}(j+\gamma\delta_{j})}{\prod_{j% =t-i}^{t}j}≤ divide start_ARG ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ( italic_j + italic_γ italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG start_ARG ∏ start_POSTSUBSCRIPT italic_j = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_j end_ARG italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT divide start_ARG ∏ start_POSTSUBSCRIPT italic_j = italic_t - italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ( italic_j + italic_γ italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG start_ARG ∏ start_POSTSUBSCRIPT italic_j = italic_t - italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_j end_ARG
×(ξ(s)Esξ(s))absent𝜉superscript𝑠subscript𝐸superscript𝑠𝜉superscript𝑠\displaystyle\times\big{(}\xi(s^{\prime})-E_{s^{\prime}}\xi(s^{\prime})\big{)}× ( italic_ξ ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_E start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_ξ ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) )
=αtE1+i=1t1ψi(δ)(ξ(s)Esξ(s))absentsubscript𝛼𝑡subscript𝐸1superscriptsubscript𝑖1𝑡1subscript𝜓𝑖𝛿𝜉superscript𝑠subscript𝐸superscript𝑠𝜉superscript𝑠\displaystyle=\alpha_{t}E_{1}+\sum_{i=1}^{t-1}\psi_{i}(\delta)\big{(}\xi(s^{% \prime})-E_{s^{\prime}}\xi(s^{\prime})\big{)}= italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_δ ) ( italic_ξ ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_E start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_ξ ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) )

According to weighted Hoeffding inequality [32], with prob- ability 1e1𝑒1-e1 - italic_e, we can prove Eq (11) for Theorem 1. ∎

This convergence result demonstrates the influence of the error ratio on the convergence rate. In other words, learning will go more quickly for our chosen Q value update function. Even though CQLite shares only updated Q-value, it still achieves the required convergence and provides an optimal strategy for robots to explore the map efficiently.

IV-B CQLite Efficiency

Proposition 1 (Q-table Update Efficiency): The CQLite Exploration method reduces the communication and computation cost for exploration by sharing and appending only updated Q-values and newly discovered frontiers to the local Q-table, which reduces communication and computation cost by 1n1𝑛\frac{1}{n}divide start_ARG 1 end_ARG start_ARG italic_n end_ARG than the cost of the SOTA approaches. Where n𝑛nitalic_n is the total number of possible states (size of Q-table).

Proof.

The CQLite approach reduces the size of the Q-table and the amount of data that needs to be transmitted between robots by sharing and appending only the updated Q-values and recently found frontiers to the Q-table. We prove this by comparing the data needs with that of the SOTA, where the full Q-table is shared between robots. The shared Q-value for a given state-action combination (i,j)𝑖𝑗(i,j)( italic_i , italic_j ) in the Q-table will be Qi,jsubscript𝑄𝑖𝑗Q_{i,j}italic_Q start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT.

CQLite only updates Q-value once during the whole exploration, in contrast to SOTA as it updates each value in every iteration. Compared to sharing and updating the whole Q-table, the communication and computing costs are decreased by 1n1𝑛\frac{1}{n}divide start_ARG 1 end_ARG start_ARG italic_n end_ARG. The update cost of CQLite for Q-table with size n𝑛nitalic_n is n𝑛nitalic_n, but the update cost of SOTA is n2superscript𝑛2n^{2}italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ; hence the cost reduction relation case be written as:

Ci,CQLite=1nCi,SOTA,subscript𝐶𝑖𝐶𝑄𝐿𝑖𝑡𝑒1𝑛subscript𝐶𝑖𝑆𝑂𝑇𝐴\displaystyle C_{i,CQLite}=\frac{1}{n}C_{i,SOTA},italic_C start_POSTSUBSCRIPT italic_i , italic_C italic_Q italic_L italic_i italic_t italic_e end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG italic_C start_POSTSUBSCRIPT italic_i , italic_S italic_O italic_T italic_A end_POSTSUBSCRIPT ,

where Ci,SOTAsubscript𝐶𝑖𝑆𝑂𝑇𝐴C_{i,SOTA}italic_C start_POSTSUBSCRIPT italic_i , italic_S italic_O italic_T italic_A end_POSTSUBSCRIPT is the communication and calculation cost of updating and sharing the whole Q-table in SOTA exploration techniques, and Ci,CQLitesubscript𝐶𝑖𝐶𝑄𝐿𝑖𝑡𝑒C_{i,CQLite}italic_C start_POSTSUBSCRIPT italic_i , italic_C italic_Q italic_L italic_i italic_t italic_e end_POSTSUBSCRIPT is the communication and computation cost of the CQLite Exploration method.

To further determine the effectiveness of the Q-table update in CQLite, the cost of sending the matrix Q𝑄Qitalic_Q over a network can be used to indicate the cost of sharing and updating the whole Q-table and can be stated as follows:

Ci,SOTA=κj=1n|Qi,j|,subscript𝐶𝑖𝑆𝑂𝑇𝐴𝜅superscriptsubscript𝑗1𝑛subscript𝑄𝑖𝑗\displaystyle C_{i,SOTA}=\kappa\cdot\sum\limits_{j=1}^{n}|Q_{i,j}|,italic_C start_POSTSUBSCRIPT italic_i , italic_S italic_O italic_T italic_A end_POSTSUBSCRIPT = italic_κ ⋅ ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT | italic_Q start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT | ,

where |Qi,n|subscript𝑄𝑖𝑛|Q_{i,n}|| italic_Q start_POSTSUBSCRIPT italic_i , italic_n end_POSTSUBSCRIPT | is the absolute value of the Q-value of state j𝑗jitalic_j for robot i𝑖iitalic_i, and α𝛼\alphaitalic_α is a constant that denotes the cost of sending one unit of data across the network. SOTA requires all Q-values for policy determination; hence all Q-values are shared to update Q-table in every iteration.

The CQLite Exploration approach reduces the size of the matrix and the quantity of data that needs to be transferred by sharing and appending only the updated Q-values and newly found frontiers. Let Qsuperscript𝑄Q^{\prime}italic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT be the updated matrix that only includes the new frontiers and updated Q-values. This modified matrix’s transmission cost can be expressed as

Ci,CQLite=κj=1n|Qi,j|.subscript𝐶𝑖𝐶𝑄𝐿𝑖𝑡𝑒𝜅superscriptsubscript𝑗1𝑛subscriptsuperscript𝑄𝑖𝑗\displaystyle C_{i,CQLite}=\kappa\cdot\sum\limits_{j=1}^{n}|Q^{\prime}_{i,j}|.italic_C start_POSTSUBSCRIPT italic_i , italic_C italic_Q italic_L italic_i italic_t italic_e end_POSTSUBSCRIPT = italic_κ ⋅ ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT | italic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT | .

Since Qsuperscript𝑄Q^{\prime}italic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is a subset of Q𝑄Qitalic_Q, it can be concluded that j=1l|Qi,j|j=1l|Qi,j|superscriptsubscript𝑗1𝑙subscriptsuperscript𝑄𝑖𝑗superscriptsubscript𝑗1𝑙subscript𝑄𝑖𝑗\sum\limits_{j=1}^{l}|Q^{\prime}_{i,j}|\leq\sum\limits_{j=1}^{l}|Q_{i,j}|∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT | italic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT | ≤ ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT | italic_Q start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT |, and therefore:

Ci,CQLite1nCi,SOTAsubscript𝐶𝑖𝐶𝑄𝐿𝑖𝑡𝑒1𝑛subscript𝐶𝑖𝑆𝑂𝑇𝐴\displaystyle C_{i,CQLite}\leq\frac{1}{n}C_{i,SOTA}italic_C start_POSTSUBSCRIPT italic_i , italic_C italic_Q italic_L italic_i italic_t italic_e end_POSTSUBSCRIPT ≤ divide start_ARG 1 end_ARG start_ARG italic_n end_ARG italic_C start_POSTSUBSCRIPT italic_i , italic_S italic_O italic_T italic_A end_POSTSUBSCRIPT

This proves that the CQLite exploration approach is more efficient regarding Q-table updating than the SOTA exploration methods like RRT and DRL. ∎

Proposition 2 (Mapping Efficiency): CQLite performs map sharing and merging with the probability P(stESt)𝑃subscript𝑠𝑡𝐸subscript𝑆𝑡P(s_{t}\cap ES_{t})italic_P ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∩ italic_E italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), which requires <<iterationsmuch-less-thanabsent𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠<<{iterations}< < italic_i italic_t italic_e italic_r italic_a italic_t italic_i italic_o italic_n italic_s compared to relevant SOTA exploration approaches (e.g., RRT and DRL) for maximum exploration. Here, P(stESt)𝑃subscript𝑠𝑡𝐸subscript𝑆𝑡P(s_{t}\cap ES_{t})italic_P ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∩ italic_E italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is the probability of overlap between the current state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and the already explored states ESt𝐸subscript𝑆𝑡ES_{t}italic_E italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT by roboti𝑟𝑜𝑏𝑜subscript𝑡𝑖robot_{i}italic_r italic_o italic_b italic_o italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and other robots iterations𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠iterationsitalic_i italic_t italic_e italic_r italic_a italic_t italic_i italic_o italic_n italic_s is the total number of iterations carried out by the algorithm.

Proof.

The probability of overlap P(stESt)𝑃subscript𝑠𝑡𝐸subscript𝑆𝑡P(s_{t}\cap ES_{t})italic_P ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∩ italic_E italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) between the current state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT of robot i𝑖iitalic_i and the previously explored states ESt𝐸subscript𝑆𝑡ES_{t}italic_E italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT by other robots is used to determine if map sharing and merging will take place in the CQLite Exploration technique. This map merging and sharing aims to reduce the number of iterations and steps the algorithm must perform.

As part of the CQLite Exploration approach, the algorithm updates the map by combining shared maps regularly as follows:

fCQLite=P(stESt)iterationssubscript𝑓𝐶𝑄𝐿𝑖𝑡𝑒𝑃subscript𝑠𝑡𝐸subscript𝑆𝑡𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠f_{CQLite}={P(s_{t}\cap ES_{t})}\cdot{iterations}italic_f start_POSTSUBSCRIPT italic_C italic_Q italic_L italic_i italic_t italic_e end_POSTSUBSCRIPT = italic_P ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∩ italic_E italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ⋅ italic_i italic_t italic_e italic_r italic_a italic_t italic_i italic_o italic_n italic_s

Where fCQLitesubscript𝑓𝐶𝑄𝐿𝑖𝑡𝑒f_{CQLite}italic_f start_POSTSUBSCRIPT italic_C italic_Q italic_L italic_i italic_t italic_e end_POSTSUBSCRIPT is the frequency of map merging carried out by the algorithm in the CQLite Exploration method, and iterations𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠{iterations}italic_i italic_t italic_e italic_r italic_a italic_t italic_i italic_o italic_n italic_s is the mapping frequency of SOTA exploration methods like RRT and DRL.

The probability P(stESt)𝑃subscript𝑠𝑡𝐸subscript𝑆𝑡P(s_{t}\cap ES_{t})italic_P ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∩ italic_E italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) can be derived using Bayes’ theorem as follows:

P(stESt)=P(EStst)P(st)𝑃subscript𝑠𝑡𝐸subscript𝑆𝑡𝑃conditional𝐸subscript𝑆𝑡subscript𝑠𝑡𝑃subscript𝑠𝑡P(s_{t}\cap ES_{t})=P(ES_{t}\mid s_{t})\cdot P(s_{t})italic_P ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∩ italic_E italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_P ( italic_E italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ⋅ italic_P ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )

Given the previously explored states ESt𝐸subscript𝑆𝑡ES_{t}italic_E italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, P(EStst)𝑃conditional𝐸subscript𝑆𝑡subscript𝑠𝑡P(ES_{t}\mid s_{t})italic_P ( italic_E italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is the conditional probability of the current state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and P(st)𝑃subscript𝑠𝑡P(s_{t})italic_P ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is the probability of the current state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

P(st)𝑃subscript𝑠𝑡P(s_{t})italic_P ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) can be represented as a uniform distribution over the state space, assuming that the exploration process is a random walk, with:

P(st)=1n,𝑃subscript𝑠𝑡1𝑛P(s_{t})=\frac{1}{n},italic_P ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ,

where the state space’s overall state count is n𝑛nitalic_n.

The frequency of occurrence of the current state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in the previously investigated states ESt𝐸subscript𝑆𝑡ES_{t}italic_E italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT can be used to estimate the conditional probability P(ESt|st)𝑃conditional𝐸subscript𝑆𝑡subscript𝑠𝑡P(ES_{t}|s_{t})italic_P ( italic_E italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). If the frequency with which the present state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT occurs in the previously studied states ESt𝐸subscript𝑆𝑡ES_{t}italic_E italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is fstsubscript𝑓subscript𝑠𝑡f_{s_{t}}italic_f start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT, then:

P(EStst)=fstne,𝑃conditional𝐸subscript𝑆𝑡subscript𝑠𝑡subscript𝑓subscript𝑠𝑡subscript𝑛𝑒P(ES_{t}\mid s_{t})=\frac{f_{s_{t}}}{n_{e}},italic_P ( italic_E italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = divide start_ARG italic_f start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_ARG ,

where nesubscript𝑛𝑒n_{e}italic_n start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT is the total number of states in the already explored states ESt𝐸subscript𝑆𝑡ES_{t}italic_E italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Substituting the above expressions into the equation for P(stESt)𝑃subscript𝑠𝑡𝐸subscript𝑆𝑡P(s_{t}\cap ES_{t})italic_P ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∩ italic_E italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) gives:

P(stESt)𝑃subscript𝑠𝑡𝐸subscript𝑆𝑡\displaystyle P(s_{t}\cap ES_{t})italic_P ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∩ italic_E italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) =fstne1nabsentsubscript𝑓subscript𝑠𝑡subscript𝑛𝑒1𝑛\displaystyle=\frac{f_{s_{t}}}{n_{e}}\cdot\frac{1}{n}= divide start_ARG italic_f start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_ARG ⋅ divide start_ARG 1 end_ARG start_ARG italic_n end_ARG =fstnne<<iterationsabsentsubscript𝑓subscript𝑠𝑡𝑛subscript𝑛𝑒much-less-than𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠\displaystyle=\frac{f_{s_{t}}}{nn_{e}}<<{iterations}= divide start_ARG italic_f start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG start_ARG italic_n italic_n start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_ARG < < italic_i italic_t italic_e italic_r italic_a italic_t italic_i italic_o italic_n italic_s

For the total number of iterations𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠iterationsitalic_i italic_t italic_e italic_r italic_a italic_t italic_i italic_o italic_n italic_s CQLite only updates the map for fstnnesubscript𝑓subscript𝑠𝑡𝑛subscript𝑛𝑒\frac{f_{s_{t}}}{nn_{e}}divide start_ARG italic_f start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG start_ARG italic_n italic_n start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_ARG times and fst<nnesubscript𝑓subscript𝑠𝑡𝑛subscript𝑛𝑒f_{s_{t}}<nn_{e}italic_f start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT < italic_n italic_n start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT and nne𝑛subscript𝑛𝑒nn_{e}italic_n italic_n start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT is equal to the iterations𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠iterationsitalic_i italic_t italic_e italic_r italic_a italic_t italic_i italic_o italic_n italic_s in case of visiting each state at each iteration𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛iterationitalic_i italic_t italic_e italic_r italic_a italic_t italic_i italic_o italic_n. Hence, the above derivation proved that the CQLite method is more efficient as CQLite’s update frequency (frequency of map merging) is <<iterationsmuch-less-thanabsent𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠<<{iterations}< < italic_i italic_t italic_e italic_r italic_a italic_t italic_i italic_o italic_n italic_s in map sharing and merging than SOTA exploration methods like RRT and DRL. ∎

Both propositions signify that the CQLite exploration method is more efficient in computation, communication, and mapping operations compared to the state-of-the-art RL-based multi-robot exploration approaches.

IV-C Time Complexity

Assume that the grid factor is kgsubscript𝑘𝑔k_{g}italic_k start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT (resolution of the grid map on which the grid is divided) and that the target sub-map is k×l𝑘𝑙k\times litalic_k × italic_l in size. The grid map’s size is kgk×kglsubscript𝑘𝑔𝑘subscript𝑘𝑔𝑙k_{g}k\times k_{g}litalic_k start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_k × italic_k start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_l, and the total number of points is kg2klsuperscriptsubscript𝑘𝑔2𝑘𝑙k_{g}^{2}klitalic_k start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_k italic_l. The operations to find Q-values must be carried out cyclically kg2klsuperscriptsubscript𝑘𝑔2𝑘𝑙k_{g}^{2}klitalic_k start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_k italic_l times. kl𝑘𝑙klitalic_k italic_l can calculate and represent the CQLite’s state space size; however CQLite doesn’t perform a merging and searching strategy at every iteration; hence the length of the Q-value table is significantly less than (kl)𝑘𝑙(kl)( italic_k italic_l ) through selecting the training process, and the computing complexity of the algorithm is considerably less than O(kl)𝑂𝑘𝑙O(kl)italic_O ( italic_k italic_l ).

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 3: A depiction of the outcome in a sample trial. It shows the map generated by three robots in the house world (left column) and three and six robots in the bookstore world (canter and right column, respectively) created by the three compared approaches; RRT (top), DRL (center), and CQLite (bottom), with robots moving in a simulated House and Bookstore worlds along with the following trajectories, start and end locations.

V Experimental Validation

Turtlebot3 robots are used to carry out the exploration plan, implemented using the ROS framework. The open-source openslam-gmapping333https://openslam-org.github.io/gmapping.html technique of the ROS gmapping package is used to create 2D maps. It uses odometer data and a particle filter method as its foundation. The local maps created by each robot are combined to create the global map. Feature-based map merging444http://wiki.ros.org/multirobot_map_merge is employed to merge maps when required. Frame conversion between the local map frames is necessary for map merger. The coordinate transformation correlation between the robots must be calibrated before combining local maps. In the current work, the global frame is one robot’s frame, and the relative positions and orientations of the robots are initialized to a known state. The ROS movebase555https://github.com/ros-planning/navigation package allows the robot to move toward the goal point while securely avoiding barriers between robots. The Dijkstra algorithm for global path planning and the Dynamic Window Approach (DWA) for local dynamic obstacle avoidance are both implemented in this package. In this study, the units of time and distance are in seconds and meters, respectively.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 4: Computation (Left), Communication (Center) cost, and Exploration over time (Right) comparison plot of CQLite with RRT and DRL approach in three Gazebo simulated world. Row-wise: Top 3 robots in house world, Middle 3 robots in bookstore world, and Bottom 6 robots in bookstore world.
Refer to caption
Figure 5: Exploration map of the House world before, during, and after map sharing and merging corresponds to points (A, B, and C from Fig. 4) in Computation and Communication plots. Peaks demonstrate the request for map merging in CQLite for Computation and communication plots; RRT runs longer with persistent high communication and computational overhead but explores fewer regions than DRL and CQLite.
TABLE I: Performance Results of RRT, DRL, and the proposed CQLite. *{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT indicates the best performer.
Evaluation parameters Three robots in house world Three robots in bookstore world Six robots in bookstore world
RRT [11] DRL [12] CQLite (ours) RRT [11] DRL [12] CQLite (ours) RRT [11] DRL [12] CQLite (ours)
Mapping Time (s) 1208±52plus-or-minus1208521208\pm 521208 ± 52 924±*67{}^{*}\pm 67start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT ± 67 1029±59plus-or-minus1029591029\pm 591029 ± 59 347±32plus-or-minus34732347\pm 32347 ± 32 324±21plus-or-minus32421324\pm 21324 ± 21 317±*19{}^{*}\pm 19start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT ± 19 212±18plus-or-minus21218212\pm 18212 ± 18 267±29plus-or-minus26729267\pm 29267 ± 29 197±*13{}^{*}\pm 13start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT ± 13
Path Length (m) 592±11plus-or-minus59211592\pm 11592 ± 11 604±19plus-or-minus60419604\pm 19604 ± 19 543±*9{}^{*}\pm 9start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT ± 9 278±26plus-or-minus27826278\pm 26278 ± 26 235±29plus-or-minus23529235\pm 29235 ± 29 147±*21{}^{*}\pm 21start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT ± 21 223±12plus-or-minus22312223\pm 12223 ± 12 196±17plus-or-minus19617196\pm 17196 ± 17 121±*11{}^{*}\pm 11start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT ± 11
Exploration Percentage (%) 87±4plus-or-minus87487\pm 487 ± 4 91±3plus-or-minus91391\pm 391 ± 3 𝟗𝟓*±3plus-or-minussuperscript𝟗𝟓3\textbf{95}^{*}\pm 395 start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ± 3 90±5plus-or-minus90590\pm 590 ± 5 93±2plus-or-minus93293\pm 293 ± 2 𝟗𝟕*±2plus-or-minussuperscript𝟗𝟕2\textbf{97}^{*}\pm 297 start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ± 2 93±4plus-or-minus93493\pm 493 ± 4 94±5plus-or-minus94594\pm 594 ± 5 𝟗𝟖*±2plus-or-minussuperscript𝟗𝟖2\textbf{98}^{*}\pm 298 start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ± 2
Overlap Percentage (%) 51±5plus-or-minus51551\pm 551 ± 5 46±6plus-or-minus46646\pm 646 ± 6 𝟐𝟖*±2plus-or-minussuperscript𝟐𝟖2\textbf{28}^{*}\pm 228 start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ± 2 57±8plus-or-minus57857\pm 857 ± 8 51±9plus-or-minus51951\pm 951 ± 9 𝟑𝟏*±6plus-or-minussuperscript𝟑𝟏6\textbf{31}^{*}\pm 631 start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ± 6 47±7plus-or-minus47747\pm 747 ± 7 39±8plus-or-minus39839\pm 839 ± 8 𝟐𝟏*±6plus-or-minussuperscript𝟐𝟏6\textbf{21}^{*}\pm 621 start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ± 6
MAP SSIM 0.73±0.12plus-or-minus0.730.120.73\pm 0.120.73 ± 0.12 0.89±0.08plus-or-minus0.890.080.89\pm 0.080.89 ± 0.08 0.91*±0.06plus-or-minussuperscript0.910.06\textbf{0.91}^{*}\pm 0.060.91 start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ± 0.06 0.68±0.21plus-or-minus0.680.210.68\pm 0.210.68 ± 0.21 0.71±0.13plus-or-minus0.710.130.71\pm 0.130.71 ± 0.13 0.89*±0.08plus-or-minussuperscript0.890.08\textbf{0.89}^{*}\pm 0.080.89 start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ± 0.08 0.71±0.17plus-or-minus0.710.170.71\pm 0.170.71 ± 0.17 0.73±0.15plus-or-minus0.730.150.73\pm 0.150.73 ± 0.15 0.93*±0.10plus-or-minussuperscript0.930.10\textbf{0.93}^{*}\pm 0.100.93 start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ± 0.10
CPU Utilization (%) 112±22plus-or-minus11222112\pm 22112 ± 22 79±18plus-or-minus791879\pm 1879 ± 18 𝟒𝟐*±8plus-or-minussuperscript𝟒𝟐8\textbf{42}^{*}\pm 842 start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ± 8 97±18plus-or-minus971897\pm 1897 ± 18 65±15plus-or-minus651565\pm 1565 ± 15 𝟑𝟒*±9plus-or-minussuperscript𝟑𝟒9\textbf{34}^{*}\pm 934 start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ± 9 68±21plus-or-minus682168\pm 2168 ± 21 47±16plus-or-minus471647\pm 1647 ± 16 𝟐𝟔*±9plus-or-minussuperscript𝟐𝟔9\textbf{26}^{*}\pm 926 start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ± 9
RAM (MB) 824±19plus-or-minus82419824\pm 19824 ± 19 1264±41plus-or-minus1264411264\pm 411264 ± 41 𝟔𝟔𝟓*±24plus-or-minussuperscript𝟔𝟔𝟓24\textbf{665}^{*}\pm 24665 start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ± 24 624±16plus-or-minus62416624\pm 16624 ± 16 819±33plus-or-minus81933819\pm 33819 ± 33 𝟒𝟑𝟐*±21plus-or-minussuperscript𝟒𝟑𝟐21\textbf{432}^{*}\pm 21432 start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ± 21 452±19plus-or-minus45219452\pm 19452 ± 19 724±38plus-or-minus72438724\pm 38724 ± 38 𝟑𝟏𝟗*±18plus-or-minussuperscript𝟑𝟏𝟗18\textbf{319}^{*}\pm 18319 start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ± 18
COM Payload (MB) 2.2±0.08plus-or-minus2.20.082.2\pm 0.082.2 ± 0.08 2.4±0.06plus-or-minus2.40.062.4\pm 0.062.4 ± 0.06 0.6*±0.02plus-or-minussuperscript0.60.02\textbf{0.6}^{*}\pm 0.020.6 start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ± 0.02 1.3±0.06plus-or-minus1.30.061.3\pm 0.061.3 ± 0.06 1.8±0.04plus-or-minus1.80.041.8\pm 0.041.8 ± 0.04 0.4*±0.01plus-or-minussuperscript0.40.01\textbf{0.4}^{*}\pm 0.010.4 start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ± 0.01 1.1±0.04plus-or-minus1.10.041.1\pm 0.041.1 ± 0.04 1.3±0.05plus-or-minus1.30.051.3\pm 0.051.3 ± 0.05 0.2*±0.01plus-or-minussuperscript0.20.01\textbf{0.2}^{*}\pm 0.010.2 start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ± 0.01

V-A Simulation Setup:

A closed simulation environment based on the ROS Gazebo simulator with two indoor template environments is used: the Gazebo’s house world (\approx 250m2superscript𝑚2m^{2}italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT area) and the Amazon AWS bookstore world (\approx 100m2superscript𝑚2m^{2}italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT area). The robots may quickly finish the map exploration in a closed environment. Each robot has a laser scanner to gather data about its surroundings. The robots form a fully connected graph through a WiFi communication channel with a standard range of 40-60m. The robot’s trajectory is determined based on the fusion of wheel odometry and laser scan information.

The following parameters are used in the experiments in the simulated environment. The laser scanner’s range and ri,ssubscript𝑟𝑖𝑠r_{i,s}italic_r start_POSTSUBSCRIPT italic_i , italic_s end_POSTSUBSCRIPT are set to 15m15𝑚15m15 italic_m and 1m1𝑚1m1 italic_m, respectively. Additionally, the robot’s maximum linear and angular speeds in the simulation are set to 0.5ms10.5𝑚superscript𝑠10.5ms^{-1}0.5 italic_m italic_s start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT and π4rads1𝜋4𝑟𝑎𝑑superscript𝑠1\frac{\pi}{4}rads^{-1}divide start_ARG italic_π end_ARG start_ARG 4 end_ARG italic_r italic_a italic_d italic_s start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT, respectively. The global detector’s growth factor η𝜂\etaitalic_η and the local detector’s growth factor η1subscript𝜂1\eta_{1}italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT in the RRT detector are set to 5m5𝑚5m5 italic_m and 3m3𝑚3m3 italic_m, respectively. The weight parameters, α=0.6𝛼0.6\alpha=0.6italic_α = 0.6, γ=0.95𝛾0.95\gamma=0.95italic_γ = 0.95 and λi=2subscript𝜆𝑖2\lambda_{i}=2italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 2 for 1m1𝑚1m1 italic_m distant step.

Each experiment was run for ten trials, with average observations reported. We evaluate the performance in the following three scenarios to validate the robustness and scalability of the proposed solution: 1) 3 robots in the house world, 2) 3 robots in the bookstore world, and 3) 6 robots in the bookstore world.

V-B Evaluation Metrics

The proposed CQLite and the methods put forward by RRT [11] and DRL [12] are compared in our experiments. We use the below metrics for a comprehensive evaluation:

  1. 1.

    Mapping Time: The amount of time spent mapping is a gauge of the efficiency of the exploration process;

  2. 2.

    Path Length: This term refers to the path length of all robot’s trajectories combined until exploration converges. The entire trajectory length gives an idea of the robot’s energy usage while subtly describing its investigation’s effectiveness;

  3. 3.

    Exploration Percentage: The percentage of the generated map with time elapsed;

  4. 4.

    Overlap Percentage: The percentage of the overlap of the explored map with time elapsed;

  5. 5.

    Map SSIM: Structural similarity index measure of generated maps compared with ground truth map to measure map correctness;

  6. 6.

    CPU Utilization: The maximum % consumption of the processor of a robot throughout the trajectory;

  7. 7.

    Memory Consumption (RAM): The maximum occupied memory by the robot throughout the trajectory;

  8. 8.

    COM payload: The size of the data communicated by a robot averaged over iterations.

VI Results and Discussion

We have reported each approach’s average performance after ten trials in each condition to reduce the measurement noise and analyze the statistical details. A sample of the mapping outcomes of the compared approaches with the trajectories followed by three robots in the simulated environment is shown in Fig. 3 and generated maps also delineate the map correctness. The outcome should be stated considering the average mapping time, distance traveled, and mapping efficiency. Mapping efficiency is determined by comparing with the original map, and reported percentages are normalized with gazebo world dimensions.

Table I provides a comparative analysis of different methods on all the performance metrics and the statistical data from the results. It also lists the theoretical (algorithmic) computational complexity. Figs. 4 shows the comparison of the approaches in the three key performance metrics: computation, communication, and exploration. Fig. 5 shows a zoomed-in view of the mapping process at three different closely timed instances.

The proposed CQLite reliably outperforms other strategies on the key performance metrics. CQLite covers a larger area in less time, improving mapping efficiency by 10% while traveling 22 fewer meters than RRT in the experiment. In three-robot scenarios, CQLite was more effective than DRL and RRT, with 9% and 8% shorter mapping times, respectively. Its path length was also less than DRL’s by about 38%. The advantages became even more apparent when the trial involved six robots. While the mapping time was around 26% faster than DRL and 7% faster than RRT for the bookhouse world, the path length was about 38% shorter than with DRL. However, the mapping time of DRL is 10% better than CQLite for house world because DRL is trained for the world and has an optimized path for coverage.

CQLite had an exploration percentage that was 4% greater than DRL in the three-robot scenario. This advantage persisted in the six-robot case, where CQLite’s exploration percentage was almost 4% higher than DRL’s while maintaining the lowest overlap percentage. The stability and effectiveness of CQLite in multi-robot exploration tasks are highlighted by these results from various experiments.

Refer to caption
Figure 6: Reward progression with iterations for three experiments

In the reward convergence (Fig. 6), the 3-robots in the house world show a steady progression, converging at around 850 iterations. The 3-robots in the bookstore demonstrate convergence near 280 iterations, while the 6-robots in the bookstore achieve stability quickest, by approximately 175 iterations. After these points, rewards in each experiment remain consistent, indicating optimal behavior within their respective environments. These convergence results imply the generalizability of the CQLite approach. Further, it can be noted that the local convergence of Q-values (Theorem 1) within the robot’s neighborhood set will lead to the group’s global convergence when the graph is connected [1, 33].

Communication-wise, CQLite’s strategy is more effective. Contrary to RRT and DRL, which exchange locally explored maps continually. CQLite showed a significant reduction of more than 80% in the communication payload (average data size) shared between the robots. Notably, CQLite continues to explore at a constant rate even after reaching 60% coverage, in contrast to RRT, which slows down. This dominance carries over into a real-life three-robot bookshop scenario, where it outperformed DRL and RRT regarding reduced mapping time and shorter journey distances. Results have validated the practicality of CQLite by surpassing DRL and RRT in terms of most of the performance matrices in all scenarios. Further, demonstrating its efficacy and applicability on resource-constrained robots, CQLite maintained decreased RAM, CPU, and communication payload usage. CQLite demonstrates its power in managing a range of multi-robot exploration scenarios by offering improved map quality, as higher MAP SSIM ratings indicate. It is particularly appropriate in situations when there are significant communication and resource constraints.

Refer to caption
Figure 7: Sample outcome of the CQLite exploration with two Turtlebot3 robots in a real-world scenario with complex obstacle configuration.

One of the limitations of the proposed CQLite is that it relies on wireless communication, which can be intermittent or harsh in specific real-world situations. In such scenarios, a communication-aware strategy can be integrated with our approach to tolerate changes in communication channels.

VI-A Real Robot Demonstration

Two Turtlebot3 robots are used for real-world map exploration using the ROS Noetic framework. Robots can share information about the odometry and map output of their respective SLAM by subscribing to specific topic messages in ROS. Experiments are performed in a small-scale lab setting of 10m210superscript𝑚210m^{2}10 italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. We tested the CQLite under two scenarios: 1) a simplistic setup where one robot can cover the entire map without the other robot needing to move; 2) a complex setup where obstacles obstruct both robots in their initial positions, necessitating both robots’ contributions (see Fig. 7). Both scenarios were successfully tested by exploring the entire map, as can be seen in this video: https://youtu.be/n3unL1nuieQ. We believe this provides further evidence for the practicality of the proposed exploration approach.

VII Conclusion

This paper proposed CQLite, a distributed Q-learning strategy for multi-robot exploration created to get around the excessive communication and computation complexity and expense of learning-based systems. CQLite, which reduces communication and processing overhead by simply sharing Q-value and mapping data over the network, performs well in practical applications. Experimental results revealed that it ensures comprehensive coverage, quick convergence, and cheaper computing costs compared to well-liked RRT and DRL techniques. The same mapping efficiency was attained with only half the CPU load and 80% less communication overhead. In the future, we will examine the reward generality of CQLite to cope with various multi-robot applications.

References

  • [1] W. Burgard, M. Moors, C. Stachniss, and F. E. Schneider, “Coordinated multi-robot exploration,” IEEE Transactions on robotics, vol. 21, no. 3, pp. 376–386, 2005.
  • [2] R. Parasuraman, S. Caccamo, F. Båberg, P. Ögren, and M. Neerincx, “A new ugv teleoperation interface for improved awareness of network connectivity and physical surroundings,” Journal of Human-Robot Interaction, vol. 6, no. 3, pp. 48–70, 2017.
  • [3] B. Fang, J. Ding, and Z. Wang, “Autonomous robotic exploration based on frontier point optimization and multistep path planning,” IEEE Access, vol. 7, pp. 46 104–46 113, 2019.
  • [4] A. Dai, S. Papatheodorou, N. Funk, D. Tzoumanikas, and S. Leutenegger, “Fast frontier-based information-driven autonomous exploration with an mav,” in 2020 IEEE international conference on robotics and automation (ICRA).   IEEE, 2020, pp. 9570–9576.
  • [5] E. Latif, W. Song, and R. Parasuraman, “Communication-efficient reinforcement learning in swarm robotic networks for maze exploration,” in IEEE INFOCOM 2023 - IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), 2023, pp. 1–6.
  • [6] R. Shrestha, F.-P. Tian, W. Feng, P. Tan, and R. Vaughan, “Learned map prediction for enhanced mobile robot exploration,” in 2019 International Conference on Robotics and Automation (ICRA).   IEEE, 2019, pp. 1197–1204.
  • [7] Z. Zhang, X. Wang, Q. Zhang, and T. Hu, “Multi-robot cooperative pursuit via potential field-enhanced reinforcement learning,” in 2022 International Conference on Robotics and Automation (ICRA).   IEEE, 2022, pp. 8808–8814.
  • [8] E. Tolstaya, J. Paulos, V. Kumar, and A. Ribeiro, “Multi-robot coverage and exploration using spatial graph neural networks,” in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2021, pp. 8944–8950.
  • [9] Q. Yang, Z. Luo, W. Song, and R. Parasuraman, “Self-reactive planning of multi-robots with dynamic task assignments,” in 2019 International Symposium on Multi-Robot and Multi-Agent Systems (MRS).   IEEE, 2019, pp. 89–91.
  • [10] A. K. Sadhu and A. Konar, “An efficient computing of correlated equilibrium for cooperative q𝑞qitalic_q-learning-based multi-robot planning,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 50, no. 8, pp. 2779–2794, 2018.
  • [11] L. Zhang, Z. Lin, J. Wang, and B. He, “Rapidly-exploring random trees multi-robot map exploration under optimization framework,” Robotics and Autonomous Systems, vol. 131, p. 103565, 2020.
  • [12] J. Hu, H. Niu, J. Carrasco, B. Lennox, and F. Arvin, “Voronoi-based multi-robot autonomous exploration in unknown environments via deep reinforcement learning,” IEEE Transactions on Vehicular Technology, vol. 69, no. 12, pp. 14 413–14 423, 2020.
  • [13] W. Gao, M. Booker, A. Adiwahono, M. Yuan, J. Wang, and Y. W. Yun, “An improved frontier-based approach for autonomous exploration,” in 2018 15th International Conference on Control, Automation, Robotics and Vision (ICARCV).   IEEE, 2018, pp. 292–297.
  • [14] A. Bouman, M. F. Ginting, N. Alatur, M. Palieri, D. D. Fan, T. Touma, T. Pailevanian, S.-K. Kim, K. Otsu, J. Burdick et al., “Autonomous spot: Long-range autonomous exploration of extreme environments with legged locomotion,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2020, pp. 2518–2525.
  • [15] E. Latif and R. Parasuraman, “Seal: Simultaneous exploration and localization for multi-robot systems,” in 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2023, pp. 5358–5365.
  • [16] Z. Zhang, J. Yu, J. Tang, Y. Xu, and Y. Wang, “Mr-topomap: Multi-robot exploration based on topological map in communication restricted environment,” IEEE Robotics and Automation Letters, vol. 7, no. 4, pp. 10 794–10 801, 2022.
  • [17] K. Masaba and A. Q. Li, “Gvgexp: Communication-constrained multi-robot exploration system based on generalized voronoi graphs,” in 2021 International Symposium on Multi-Robot and Multi-Agent Systems (MRS).   IEEE, 2021, pp. 146–154.
  • [18] M. Corah, C. O’Meadhra, K. Goel, and N. Michael, “Communication-efficient planning and mapping for multi-robot exploration in large environments,” IEEE Robotics and Automation Letters, vol. 4, no. 2, pp. 1715–1721, 2019.
  • [19] Y. Gao, Y. Wang, X. Zhong, T. Yang, M. Wang, Z. Xu, Y. Wang, Y. Lin, C. Xu, and F. Gao, “Meeting-merging-mission: A multi-robot coordinate framework for large-scale communication-limited exploration,” in 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2022, pp. 13 700–13 707.
  • [20] A. Serra-Gómez, B. Brito, H. Zhu, J. J. Chung, and J. Alonso-Mora, “With whom to communicate: learning efficient communication for multi-robot collision avoidance,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2020, pp. 11 770–11 776.
  • [21] R. Han, S. Chen, and Q. Hao, “Cooperative multi-robot navigation in dynamic environment with deep reinforcement learning,” in 2020 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2020, pp. 448–454.
  • [22] X. Yu, Y. Wu, and X.-M. Sun, “A navigation scheme for a random maze using reinforcement learning with quadrotor vision,” in 2019 18th European Control Conference (ECC).   IEEE, 2019, pp. 518–523.
  • [23] J. Liu, K. Chen, R. Liu, Y. Yang, Z. Wang, and J. Zhang, “Robust and accurate multi-agent slam with efficient communication for smart mobiles,” in 2022 International Conference on Robotics and Automation (ICRA).   IEEE, 2022, pp. 2782–2788.
  • [24] L. Bernreiter, S. Khattak, L. Ott, R. Siegwart, M. Hutter, and C. Cadena, “Collaborative robot mapping using spectral graph analysis,” in 2022 International Conference on Robotics and Automation (ICRA).   IEEE, 2022, pp. 3662–3668.
  • [25] M. Zhao, X. Guo, L. Song, B. Qin, X. Shi, G. H. Lee, and G. Sun, “A general framework for lifelong localization and mapping in changing environment,” in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2021, pp. 3305–3312.
  • [26] Y. Jia, H. Luo, F. Zhao, G. Jiang, Y. Li, J. Yan, Z. Jiang, and Z. Wang, “Lvio-fusion: A self-adaptive multi-sensor fusion slam framework using actor-critic method,” in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2021, pp. 286–293.
  • [27] J. G. Mangelson, D. Dominic, R. M. Eustice, and R. Vasudevan, “Pairwise consistent measurement set maximization for robust multi-robot map merging,” in 2018 IEEE international conference on robotics and automation (ICRA).   IEEE, 2018, pp. 2916–2923.
  • [28] M. Keidar and G. A. Kaminka, “Efficient frontier detection for robot exploration,” The International Journal of Robotics Research, vol. 33, no. 2, pp. 215–236, 2014.
  • [29] Z. Uykan, “On the working principle of the hopfield neural networks and its equivalence to the gadia in optimization,” IEEE Transactions on Neural Networks and Learning Systems, vol. 31, no. 9, pp. 3294–3304, 2019.
  • [30] J. Li, “Faster parallel algorithm for approximate shortest path,” in Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, 2020, pp. 308–321.
  • [31] S. S. OV, R. Parasuraman, and R. Pidaparti, “Impact of heterogeneity in multi-robot systems on collective behaviors studied using a search and rescue problem,” in 2020 IEEE International Symposium on Safety, Security, and Rescue Robotics (SSRR).   IEEE, 2020, pp. 290–297.
  • [32] P. Duda, M. Jaworski, L. Pietruczuk, and L. Rutkowski, “A novel application of hoeffding’s inequality to decision trees construction for data streams,” in 2014 International Joint Conference on Neural Networks (IJCNN).   IEEE, 2014, pp. 3324–3330.
  • [33] R. Konda, H. M. La, and J. Zhang, “Decentralized function approximated q-learning in multi-robot systems for predator avoidance,” IEEE Robotics and Automation Letters, vol. 5, no. 4, pp. 6342–6349, 2020.