Multi Agent Reinforcement Learning A Rev
Multi Agent Reinforcement Learning A Rev
sciences
Review
Multi-Agent Reinforcement Learning: A Review of Challenges
and Applications
Lorenzo Canese † , Gian Carlo Cardarilli, Luca Di Nunzio † , Rocco Fazzolari † , Daniele Giardino † ,
Marco Re † and Sergio Spanò *,†
Department of Electronic Engineering, University of Rome “Tor Vergata”, Via del Politecnico 1, 00133 Rome, Italy;
[email protected] (L.C.); [email protected] (G.C.C.);
[email protected] (L.D.N.); [email protected] (R.F.); [email protected] (D.G.);
[email protected] (M.R.)
* Correspondence: [email protected]; Tel.: +39-06-7259-7273
† These authors contributed equally to this work.
Abstract: In this review, we present an analysis of the most used multi-agent reinforcement learn-
ing algorithms. Starting with the single-agent reinforcement learning algorithms, we focus on the
most critical issues that must be taken into account in their extension to multi-agent scenarios. The
analyzed algorithms were grouped according to their features. We present a detailed taxonomy of
the main multi-agent approaches proposed in the literature, focusing on their related mathematical
models. For each algorithm, we describe the possible application fields, while pointing out its pros
and cons. The described multi-agent algorithms are compared in terms of the most important char-
acteristics for multi-agent reinforcement learning applications—namely, nonstationarity, scalability,
and observability. We also describe the most common benchmark environments used to evaluate the
Citation: Canese, L.; Cardarilli, G.C.;
performances of the considered methods.
Di Nunzio, L.; Fazzolari, R.; Giardino,
D.; Re, M.; Spanò, S. Multi-Agent Keywords: machine learning; reinforcement learning; multi-agent; swarm
Reinforcement Learning: A Review of
Challenges and Applications. Appl.
Sci. 2021, 11, 4948. https://doi.org/
10.3390/app11114948 1. Introduction
In the field of machine learning (ML), reinforcement learning (RL) has attracted the
Academic Editor: Juan Francisco De
attention of the scientific community owing to its ability to solve a wide range of tasks by
Paz Santana
using a simple architecture and without the need for prior knowledge of the dynamics
of the problem to solve. RL has found uses in many applications, from finance [1] and
Received: 26 April 2021
robotics [2–4], to natural language processing [5] and telecommunications [6]. The core
Accepted: 24 May 2021
of a RL system is the agent that operates in an environment that models the task that it
Published: 27 May 2021
has to fulfill. In all of the above applications, the RL agents interact with the environment
via a trial and error approach, within which they receive rewards (reinforcement) for their
Publisher’s Note: MDPI stays neutral
actions. This mechanism, similar to human learning, guides the agent to the improvement
with regard to jurisdictional claims in
published maps and institutional affil-
of its future decisions in order to maximize the upcoming rewards. Despite the success of
iations.
this approach, a large number of real-world problems cannot be fully solved by a single
active agent that interacts with the environment; the solution to that problem is the multi-
agent system (MAS), in which several agents learn concurrently how to solve a task by
interacting with the same environment [7]. In Figure 1, we show the representation of the
RL structure for a single agent and for an MAS.
Copyright: © 2021 by the authors.
MASs can be used in several fields, for example, traffic control, network packet
Licensee MDPI, Basel, Switzerland.
routing, energy distribution, systems of robots, economic modeling, and the analysis of
This article is an open access article
distributed under the terms and
social dilemmas. For these reasons, in the last few years, researchers have attempted
conditions of the Creative Commons
to extend the existing single-agent RL algorithms to multi-agent approaches. Empirical
Attribution (CC BY) license (https://
evaluations, however, have shown that a direct implementation of single-agent RL to
creativecommons.org/licenses/by/ several agents cannot converge to optimal solutions, because the environment is no longer
4.0/). stationary from each agent’s perspective. In fact, an action performed by a certain agent can
yield different rewards depending on the actions taken by the other agents. This challenge
is called the non-stationarity of the environment and is the main problem to address in order
to develop an efficient multi-agent RL (MARL) algorithm.
Figure 1. (a) In the single-agent RL paradigm, an agent interacts with an environment by performing
an action for which it receives a reward. (b) In the MARL paradigm, from the agent’s point of view,
the other agents may be considered to be part of the environment, which of course changes due to
the actions of all the agents.
Even when convergence is obtained, typically, this kind of algorithm can maintain
acceptable performance in terms of the quality of the policies derived and speed of conver-
gence only if a restricted number of agents is involved. The scalability to a high number of
agents is an essential feature that must be taken into account when developing algorithms
that can be applied to real-world problems [8].
In this survey, we present an introduction to multi-agent reinforcement learning. We
focus on the models used to the describe the framework environment and how to adapt
the most relevant single-agent reinforcement learning techniques for multi-agent settings.
Below, we present an assortment of MARL algorithms that address the above-mentioned
challenges of non-stationarity and scalability. We then address partially observable envi-
ronments. For MAS, partial observability is far more common than in the single-agent
setting; thus, it is crucial for the development of algorithms that can be applied to real-
world problems. Finally, we introduce an overview of the most common benchmarking
environments used to evaluate the performances of RL algorithms. This work is intended
to be an introduction to multi-agent reinforcement learning, introducing the main chal-
lenges of the field and the main solutions adopted in the literature. Finally, we describe
typical applications of MARL. While the research into MARL is still at an early stage and is
often not supported by theoretical proof, it has shown promising progress in terms of its
application. It could be considered a novel approach to achieve systems that are capable of
helping humans perform complex tasks, such as working in hazardous environments, and
exhibit general artificial intelligence.
Appl. Sci. 2021, 11, 4948 3 of 25
2. Background
In the field of machine learning, reinforcement learning stands apart from the classic
paradigm based on learning from examples. It adopts a trial and error procedure using
a reward provided by an interpreter that observes the interaction of the agent with the
environment. It takes inspiration from research into animal learning [9] combined with
theoretical concepts of optimal control theory.
The interaction between the agent and the environment can be described by three
fundamental elements: state, actions, and reward. The state represents a particular configu-
ration of the environment, the actions are the options that the agents have to interact with
to modify the environment, and the reward is a signal used to define the task of the agent
and is what motivates the agent to pick one action with respect to the others.
The learning process is iterative. The agent senses the environment while collecting
its current state. The agent performs an action, thereby reaching the next state, and it
receives a reward based on the combination of the state and selected action. This process is
then repeated. The agents adopt a policy to determine the best action to select, which is a
mapping from all the possible states of the environment to the action that can be selected.
The reward, however, is not sufficient to determine the optimal policy because an
instantaneous reward does not give information about the future rewards that a specific
action can lead to—that is, the long-term profit. For this reason, it is useful to introduce a
new kind of reward: the return value Rt . If we write as rt the reward received by the agent
at the time-step t, the return value over a finite length time horizon T is defined as
T − t −1
Rt = rt+1 + γrt+2 + γ2 rt+3 + . . . + γ T −t−1 r T = ∑ γ i r t + i +1 (1)
i =0
Sometimes, the return value is also considered for non-finite time horizons.
∞
Ri = ∑ γ i r i +1 (2)
i =0
where γ is a discounted factor such that 0 ≤ γ < 1. To evaluate the quality of a particular
state or state–action pair, it is possible to define two value functions. In particular, under a
policy π, the value function of the state is calculated as
and
!
Qπ (s, a) = ∑ p ( s t +1 | s t , a ) Ws t → s t +1 | a + γ ∑ π ( s t +1 , a ′ ) Q π ( s t +1 , a ′ ) (6)
s t +1 a′
q xy ≡ Pr {Yt = y | Xt = x } (7)
A POMDP is well suited to modeling a large variety of RL problems even though it is
an intrinsically more complex method than using an MDP.
2.1.4. Dec-POMDP
A decentralized partial-observable Markov decision process (Dec-POMDP) is defined
by the tuple h I, S, A, T, R, Ω, O, γi, where
• I is the set of n agents;
• S is the state space;
• A = ×i A(i) is the joint action space;
Appl. Sci. 2021, 11, 4948 5 of 25
• The critic, which has the task of estimating the value function, typically using TD
methods;
• The actor, which represents the parameterized policy and updates its action distribu-
tion in the direction "suggested" by the critic using a policy gradient.
The critic outputs a signal that changes the action selection preferences in order to
chose actions related to higher value functions more frequently. This kind of algorithm
presents a reduced variance in the gradient estimates due to the use of the value function
information to guide the policy evolution. In addition, minimal computation is required
to select an action as there is no need to compare different state–action values, and it is
possible to learn an explicitly stochastic policy [16].
All of these methods, as summarized in Figure 3, have been used with success in the
solution of RL problems, and their effectiveness is dependent on the type of problem.
2.2.1. Q-Learning
Q-learning [15] is a form of model-free, value-based reinforcement learning. It takes
inspiration from dynamic programming and methods of temporal differences such as
TD(0) [17]. The Q-learning agent faces the task of determining an optimal policy that
maximizes the total discounted reward; this policy π ∗ is such that
( )
∗ ∗
V ∗ (s) ≡ V π (s) = max Rs ( a) + γ ∑ Pss′ [ a]V π (s′ ) (9)
a
s′
Appl. Sci. 2021, 11, 4948 7 of 25
where Rs ( a) is the average reward received by an agent in the state s if it selects the action
a, and Pss′ [ a] is the transition probability from the state s to s′ , given the action a. The object
is to estimate the state–action pair value function (Q-values); this is done by creating a
matrix Qn (s, a) of dimensions S × A containing the estimates of all the value functions at
the time-step n. At each iteration n of the learning process, the Q-values are updated using
the equation
(
(1 − αn ) Qn−1 (s, a) + αn [rn ∗ γVn−1 (sn+1 )] if s = sn and a = an
Qn (s, a) = (10)
Qn− a (s, a) otherwise
It was shown in [18] that, under the assumption of bounded rewards |rn | ≤ R and
learning rates 0 ≤ αn < 1 such that
∞ ∞
∑ αni (s,a) = ∞, ∑ [αni (s,a) ]2 < ∞ ∀s, a, (11)
i =1 i =1
the estimates Qn (s, a) will converge to the optimal Q-value Q∗ (s, a) with probability 1.
2.2.2. REINFORCE
REINFORCE is a class of episodic reinforcement learning algorithms that have the
most simple implementations of the policy gradient methods; for this reason, the name
vanilla policy gradient exists [19]. The policy is directly optimized without the need to
estimate the value functions. The policy π is parameterized with a set of weights θ such that
π (s; θ ) ≡ πθ (s), and it is the probability distribution of actions over the state. Regardless
of the architecture of the parameterized policy, all REINFORCE algorithms use the same
weight update procedure:
θ ← θ + α( Rt − b)∇θ ln πθ (12)
where α is a non-negative learning parameter, Rt is the discounted return value, and b is
the baseline, which is used to reduce the variance of the gradient estimation. b is a function
of the state (a constant without loss of generality). The steps to implement a REINFORCE
algorithm are the following:
1. Initialize the policy parameters θ at random;
2. Use πθ to generate a trajectory, which is a sequence of states, actions and rewards,
π θ : s1 , a1 , r2 , s2 , . . . , r T ;
3. For each time-step t = 1, 2, . . . , T,
1. Estimate the return Rt ;
2. Update policy parameters using Equation (12);
4. Iterate the process.
2.2.3. A3C
Recalling the policy update rule of REINFORCE θ ← θ + α( Rt − b)∇θ ln πθ , a fre-
quently used baseline b is a learned estimate of the state value function b(s) = V π (s; θv ). If
this kind of baseline is selected, we can express the term Rt − b as an advantage function
A( at , st ) = Q(st , at ) − V (st ) because Rt is an estimate of the state-action value function.
This approach can be viewed as an actor–critic method in which the actor is the policy
network πθ and the critic is represented by the baseline. In [20], A3C (asynchronous advan-
tage actor critic) and its synchronous variant A2C are presented; the algorithm maintains
a policy function π ( at |st ; θ ) parameterized by θ and an estimate of the value function
V (st ; θv ) parameterized by θv . In the formulation, the weights are considered separately; in
Appl. Sci. 2021, 11, 4948 8 of 25
the implementation, the same deep convolutional neural network is used to approximate
the policy and the state value, with the only difference being in the separated output
layers (a soft-max layer for the policy and linear layer for the state value). The algorithm
continues for a number of steps tmax or until it reaches a terminal state; the return value is
then calculated, and the updating of the parameters is performed following the gradient.
In Algorithm 1 the psuedocode of the A3C algorithm can be found.
The algorithm was trained and tested in a variety of domains, such as the Atari 2600
platform, improving upon the results of state-of-the-art RL algorithms such as the Deep
Q-network [21].
team games in which the agents have to cooperate with their own team-mates while com-
peting with the opposing teams. The transition from single-agent to multi-agent settings
introduces new challenges that require a different design approach for the algorithms.
3.1. Nonstationarity
The environment in a multi-agent setting can be modified by the actions of all agents;
thus, from the single-agent perspective, the environment becomes non-stationary. The ef-
fectiveness of most reinforcement learning algorithms is tied to the Markov property, which
does not hold in non-stationary environments [16]. Policies created in a non-stationary
environment are deemed to have become outdated. Despite the loss of theoretical support,
algorithms designed for the single-agent setting have been applied in multi-agent settings,
such as independent learners (IL), occasionally achieving desirable results [22] . A naive
approach to tackle the non-stationarity is the use of joint action learners (JAL). This uses
a single-agent RL but with the joint action instead of the local action used to compute
its value functions. This approach eliminates the problem of non-stationarity entirely;
however, it is computationally ineffective, and the action space dimension becomes | A| N ,
where N is the number of agents, making it difficult to scale this type of approach to more
than a few agents. In addition, to ensure that every agent knows the actions of others,
some type of centralized controller or communication network is required [23]. In [24],
a variation of Q-learning for swarm systems is presented, called Q-RTS. The key idea is
to use a centralized aggregation center to combine all the Q-value tables of the agents to
form a global swarm matrix Qsw (s, a) containing the highest and lowest Q-values repre-
senting the most interesting iterations by the agents. The swarm matrix is then linearly
combined into the local Q-value matrix of each agent Qi (s, a) using an independence factor
β ∈ [0, 1), controlling the tradeoff of local and global knowledge. The Q-learning update
then proceeds independently on Qi′ for each agent.
δ ← r − Qi ( ai , si )
(
Qi ( ai , si ) + αδ if δ ≥ 0
Qi ( ai , si ) ← (15)
Qi ( ai , si ) + βδ else
This heuristic has the positive effect of implementing "optimistic" agents that are able
to ignore the bad rewards caused by the actions of others, improving the performance
compared to traditional IL approaches in cooperative MGs. In [27], the authors argue
that in order to obtain robustness against incorrect future reward estimation, in the early
iterations of the reinforcement learning algorithms, the agents need to show some sort of
leniency towards others. In particular, the future reward for a performed action can be
assessed as the maximum reward received over a number of different actions chosen by the
Appl. Sci. 2021, 11, 4948 10 of 25
other agents. The proposed algorithm is called lenient multi-agent reinforcement learning
(LMRL) and implements varying leniency, lowering the amount that each agent exhibits in
the later stages of learning. Agents decide the amount of leniency to apply by associating a
temperature with each state–action pair. The level of leniency is inversely proportional to
the temperature and decreases as those state–action pairs are selected.
3.2. Scalability
As the number of agents increases, there is a growth in the joint action space. For this
reason, centralized approaches, in which an observer selects the actions after receiving the
action–state information of every agent, require large amounts of computational resources
and memory to work with more than a couple of agents. A possible solution to the curse of
dimensionality in MARL is to use independent learners, but as we have seen, this approach
is unable to obtain consistent results in a non-stationary environment. A third model of
agent connection is the decentralized setting with networked agents. In this setting, every
agent is able to interact with the environment and to exchange information with few other
agents (typically, those in its vicinity), creating a time-varying communication network
between all the agents. Algorithms developed for this setting are scalable to a massive
number of agents and more real-world-oriented applications, as the absence of a central
controller and uncertainty in communication links are typical requirements in a large
number of applications. In [28], a distributed version of Q-learning, namely, QD-learning,
is proposed under the assumption that each agent is only aware of its local action and
reward and the inter-agent communication network is weakly connected. The optimal
policy is achieved by agents sending their Q-values to their neighbors. The update of the
Q-value is then computed locally for each agent using the following equation:
n n n
Qi,u (t + 1) = Qi,u (t) − β i,u (t) ∑ Qi,u (t) − Qnl,u (t) +
l ∈Ωn (t)
(16)
+ αi,u (t) rn ( xt , ut ) + γ min Qnxt+1,v (t) − n
Qi,u (t)
v ∈U
In can be seen from Equation (16) that the update is defined by two processes, consen-
sus and innovation, where the former is the sum of the differences of the Q-value of the
agent and the Q-values of its neighbors, controlled by the weight sequence β i,u (t), and the
latter is the traditional Q-learning update rule weighted by the sequence αi,u (t). In [29], the
same setting of decentralized reinforcement learning with networked agents is addressed
using an actor–critic algorithm. The policy is parametrized by a set θ ∈ Θ, and the gradient
of the globally averaged return J (θ ) with respect to θ is given by
where Ṽθi (s, a−i ) = ∑ ai ∈ Ai πθi i (s, ai ) · Qθ (s, ai , a−i ) and a−i is the action chosen by every
agent except for agent i. The advantage function is not available considering only local
information, so Aθ is estimated with the state–value TD-error, which is an unbiased
estimator of the advantage function.
approximate the state–action value function, called deep reinforcement learning (DRL),
was first presented in [21] in the single-agent setting and achieved promising results in
the training of an agent capable of playing a large set of Atari 2600 games using the raw
pixel data from the screen as the input for training. The success of the proposed approach
is based on two features. The first is the introduction of an experience replay mechanism
in which every experience tuple et , composed of the state transition, action selected, and
reward received, is stored in a dataset and then randomly batched to train the action–value
approximation network. This method eliminates the correlation between consecutive
iterations, which is inevitable for sequentially generated data points. The experience replay
also has the additional effect of increasing the sample efficiency by reusing experience
tuples. The second feature is the use of two networks to approximate Q, the Q-network,
and the target network; the parameters of the latter are updated every C steps with the
Q-network parameters and used as the target of the training loss function, defined as
" # 2
Li (θi ) = E(s,a,r,s′ )∼U ( D) r + γ max Q(s′ , a′ ; θi− ) − Q(s, a; θi ) (19)
a′
in which γ is the discount factor; θi− and θi are the parameters of the target network and
of the Q-network at iteration i, respectively. Policy gradient methods have been extended
to make use of deep neural networks, keeping the advantage of allowing for policies
in the continuous action space. In [30], a deep Q-network was combined with an actor–
critic approach. The parameterized actor µ(s|θµ ) that represented the policy and the critic
Q(s, a|θ ) that estimated the value using the state–action pair were represented by DQNs.
The networks were trained using a deterministic policy gradient algorithm with a batch
normalization technique [31]. The performances obtained have motivated the research
community to adopt deep networks in the multi-agent environment. In [32], the lenient
reinforcement learning algorithm was adapted to DRL, and the authors thus proposed the
lenient deep Q-network (LDQN) algorithm. In [33], the authors proposed two techniques
to stabilize the effect of the experience replay in the multi-agent setting: low-dimensional
fingerprints, made by a Boolean vector, were added to the experience tuple to disambiguate
training samples; and importance sampling, which consists of recording the other agent
policies µ− a (u− a |s) = ∏i∈− a πi (ui |s) in the experience replay, forming an augmented
transition tuple hs, u a , r, π (u− a |s), s′ itc . The DQN parameters θ are trained by minimizing
the importance-weighted loss function analogous to Equation (19).
b tr
π− a (u− a | s )
L(θ ) = ∑ π ti [(yiDQN − Q(s, u; θ ))2 ] (20)
i =1 − a (u− a | s )
where b is the size of the batch used in the learning and yiDQN is the output of the
target network.
of the environment and the capabilities of representing the hidden state space of deep
recurrent Q-networks, proposing an algorithm called the decentralized hysteretic deep
recurrent Q-network (Dec-HDRQN). In the work, a variation of the experience replay
mechanism called concurrent experience replay trajectories (CERTs) was used. In CERT,
(i ) (i ) (i )
each experience tuple (ot , at , rt , ot+1 ) containing the current observation, action, and
reward, and the subsequent observation, is indexed by the agent number, the time-step
of acquisition, and the episode. The samples given to the Q-network for training are then
taken from this structure in a synchronized way.
In particular, the actions u− a are given as an input to a network that determines the
state–action value for each action of the agent a in a single forward pass.
Appl. Sci. 2021, 11, 4948 13 of 25
This algorithm (full pseudocode can be found in Algorithm 2) was tested in Starcraft,
a combat-based video-game environment. Several homogeneous and heterogeneous unit
combinations, with each one represented by an agent, were considered. It was shown
that, in that setting, COMA reached competitive performance in regard to fully centralized
methods with better training speed. While the paradigm of the centralized learning of
decentralized policies is easily implementable for actor–critic and policy gradient methods,
it is not as straightforward when considering value-based methods. A possible approach
was presented in [40] that consists of decomposing the team value function into agent-wise
value functions. The assumption on which this approach is based is that the joint action–
value function can be factorized according to each agent’s Q-function based only on each
agent’s local observation:
N
Q((o1 , o2 , . . . , o N ), ( a1 , a2 , . . . , a N )) ≈ ∑ Qi ( o i , ai ) (22)
i =1
Each agent taking a greedy action to maximize their returns is equivalent to a central
controller maximizing the joint action–value functions. The value decomposition network
(VDN) used DQN or DRQN, with the possibility of communication between agents at a
low level (sharing observations) or high level (sharing network weights). It was tested in a
two-dimensional grid environment, obtaining better performances than the centralized
Appl. Sci. 2021, 11, 4948 14 of 25
and independent learners methods. The limit of this approach is that not every multi-
agent problem can be approximated and solved as a summation of Q-functions. In [41],
the VDN method was extended with the QMIX algorithm. The authors argued that a
full factorization of the value function is not required to extract effective policies. It is
sufficient that the result of an argmax on the joint action–value functions produces the
same result to apply an argmax to all the individual action–value functions. This is possible
if a monotonicity constraint is enforced between the total Q-value Qtot and local agent
Q-value Q a .
∂Qtot
> 0, ∀ a (23)
∂Q a
Each agent value function Q a (ota , uta−1 ) is represented by a DRQN that takes, as its
input, the local observation and the last action at each time step. Then, the singular value
functions are combined using a mixing network, which is a feed-forward neural network.
The weights of the mixing network are bounded to be non-negative to enforce the condition
presented in Equation (23). These weights are determined by a separate hyper-network
which takes the augmented state as its input. The networks are trained to minimize the
loss function in a way analogous to DQN:
b h i
L(θ ) = ∑ (yitot − Qtot (τ, u, s; θ ))2 (24)
i =1
with transitions sampled by the replay buffer. In the evaluation carried out in the StarCraft
II Learning Environment, QMIX obtained better results than VDQ at the cost of the added
architectural complexity. VDQ combines the local Q-functions using a simple summation,
whereas QMIX uses a neural network.
1
cij+1 = ∑ hij+′ 1 (26)
J−1 j′ 6= j
The first hidden state is obtained through an encoder function h0j = r (s j ), which takes
the state observation of agent and outputs its hidden state. h0j is made of a single layer
neural network. After the last round of communication, a decoder function, made of a
single layer neural network followed by a softmax, is used to convert the hidden state hKj to
a probability distribution over the space of action q(hKj ). The action is chosen by sampling
Appl. Sci. 2021, 11, 4948 15 of 25
5. Applications
MARL algorithms have been used in a variety of applications in different fields thanks
to their ability to model competitive and cooperative scenarios. In [47], a team of UAVs
equipped with a downward-facing camera was considered; the UAVs were assigned the
task of optimal sensing coverage. The drone fleet had to cover an area of interest under
its FOV while minimizing the overlap between camera captures. This task is required in
several applications, such as wildlife monitoring or search and rescue, where the use of
UAVs is growing due to their mobility and ease of deployment. UAVs are considered as
independent agents, and the convergence to a singular joint action policy is obtained by the
means of social conventions. The agents act in a prior-defined order so that the last agent
can observe the preceding actions of its teammates and make a decision accordingly. This
action selection order allows easy collision avoidance, as the agents are not able to choose an
action that will cause a collision with its predecessors. To emphasize collaboration between
agents, a single global reward is used during learning, and to reduce the dimensionality
of the state space, several techniques of function approximation are implemented. In [48],
the authors proposed a novel actor–critic multi-agent reinforcement learning algorithm
to address the problem of multi-UAV target assignment and path planning (MUTAPP). A
team of UAVs operates in an environment containing target locations to reach and locations
to avoid and has to decide the most efficient way to maneuver while avoiding collisions.
The critic and the actor are designed according to the multi-agent deep deterministic
policy gradient (MADDPG) algorithm [49]. The critic of each agent shares its actions and
observations during training while the actor network works exclusively based on local
information. The reward signal is factorized into three components, with each one being
responsible for a desired behavior: minimizing travel distance, avoiding collisions with
other agents, or avoiding collisions with target areas. The evaluation was performed using
the openAI platform and showed that the proposed algorithm can solve the problem of
MUTAPP. In terms of convergence speed over the number of agents, the method showed
some evident limitations in scaling.
Another UAV application used a UAV team to implement a network for line-of-sight
(LoS) communications using MARL to optimize resource management, including the
transmit power and subchannel selection [50]. The multi-agent setting is addressed via
independent learner agents using Q-learning to maximize the future rewards, defined as
the difference between the data throughput and the cost of power consumption. Agents
select an action tuple θm (t) = ( am (t), cm (t)), pm (t) where am (t) is the selected user, cm (t)
is the selected subchannel and pm (t) is the power level. A simulation showed that, in this
context, if a high exploration level is selected (ǫ = 0.5), the algorithm can reach a tradeoff
between system performance and information exchange that other methods in this setting
require, such as the Gale–Shapley algorithm, which is based on the matching theory for
user selection [51]. In [52], the authors study the problem of the limited spectrum in UAV
networks considering a relay-based cooperative spectrum leasing scenario. The UAV fleet
needs to transmit its data to a fusion center and does that by forwarding data packets for a
ground user in exchange for spectrum access. The objective of the algorithm is to partition
the UAVs into two groups: the relaying group, which handles the data transfer for the
ground user, and the transmitting group, which forwards packets to the UAV fusion center.
The learning algorithm is developed in a distributed way using independent learner agents
using Q-learning without the need for communications, but only has access at the task
partition, which serves as the state (we can consider this to be a fully observable setting).
The algorithm was tested in scenarios with two UAVs and six UAVs and in both cases
managed to achieve the optimal configuration.
In [53], a multi-agent approach based on a CommNet architecture [42] was proposed
to coordinate the operation of the wireless recharging towers of a group of UAVs. The
aim of the algorithm was to schedule the UAV to serve, determine how much energy
should be delivered, and share the energy between charging towers. Each tower had
access to a local photo-voltaic power generator, and the energy was shared to minimize
Appl. Sci. 2021, 11, 4948 19 of 25
the purchase of electricity from the energy market, resulting in lower operating costs.
The proposed algorithm was tested in an urban simulation consisting of 30 UAVs and
four charging towers. It performance improvements over the baseline policies of random
sharing and scheduling.
Applications of MARL do not consider only the control of UAV fleets. In [54], the
problem of a joint top-down active search of multiple objects was addressed. Each detector
was an agent, and deep reinforcement learning was used to learn the optimal policy for
object localization. Coordination between agents was obtained by the use of a differential
message for each agent, which was a function of the action selected and the state of
the environment with parameters θm , m( a, s; θm ). The agent-wise Q-function was then
defined as
(i ) (i )
Q := Q(i) ( a(i) , m(i) , s(i) , m(−i) ; θ a , θm ) (28)
The agent could control the effect of messages from other agents in its decision making
process by the use of a learning gating mechanism. The algorithm was tested with a
two-agent implementation on a series of datasets for joint image detection of man–bike,
ball–racket, or person–handbag; the joint model achieved good performance using fewer
iterations than the single-agent counterpart. Agents managed to "help" each other by
sending clues about the correlations between object locations in the messages. In [55], a
deep multi-agent reinforcement algorithm was developed to regulate the energy exchange
between a community of buildings with heterogeneous energy production and storage
capabilities. The objective was to reach nearly zero energy community (nZEC) status, which
is defined as "A micro-grid that has distributed generation, storage, delivery, consumption
and zero net annual energy balance." The buildings were modeled as DRL agents, and the
authors proposed the presence of a community monitoring service (CMS) to aggregate data
from all the agents and enable cooperation. Each agent used its local energy generation,
consumption, and storing states as its states, and the energy balance of the entire community.
This information was used to select an optimal action, maximizing the global reward signal
given by the CMS, which was the negative of the community energy status
!
n
r=− ∑ c ( hi ) − g ( hi ) (29)
i =1
where c(hi ) is the energy consumed by the i-th house and g(hi ) is the energy generated by
the i-th house. This approach was tested in a simulation of a summer and winter setting
with up to 10 agents and confronted with some behavioral baselines, including never-share,
always share and a random selection of actions. Baselines were outperformed especially in
the summer setting. A limitation of this approach is that the learning was conducted using
an episodic base approach and thus there is no guarantee that using an online learning
approach would lead to the same convergence to an optimum.
In [56], a novel multi-agent reinforcement learning algorithm called Equilibrium
selection MARL (ES-MARL) to control the energy scheduling of residential microgrid was
presented. The microgrid consisted of renewable energy generators (RG) (wind turbines
and photo-voltaic), households that demanded energy from the grid and a number of
electric vehicles (EV) that could offer or request energy when connected to a recharging
station with a Vehicle to Grid (V2G) connectivity. The algorithm uses several types of agents:
an EV Aggregator agent that macromanages the exchanges of energy of all the EVs parked
by demanding a certain charging power from the grid or by offering power to the grid
while selecting the price; a User Aggregator agent, which receives the energy demand of the
residential household and decides how much load to curtail (i.e., reducing air conditioning)
and how much load to shift to another time-step (i.e., postponing the use of a washing
machine). Two RG agents—one for each type of energy generation—were used to decide the
price for selling their energy production. Cooperation between the heterogeneous agents
is achieved throughout the action of an Equilibrium Selecting Agent with the objective
Appl. Sci. 2021, 11, 4948 20 of 25
of separately negotiating with all the agents to select the optimal equilibrium based on
the average reward. Several scenarios of the microgrid were simulated, and the proposed
approach showed an higher average reward compared to single-agent reinforcement
learning approaches. When confronted with another MARL algorithm, such as Nash-
Q [57], ES-MARL showed a faster convergence rate.
A similar application but for an industry production control setting was presented
in [58]. An independent MARL algorithm based on Proximal Policy Optimization (PPO)
was proposed to control the energy exchanges in a factory, composed of local power
generators (renewable and fuel based), a battery system and a certain number of resources
that could consume power to produce a variety of products. Each of these elements
were represented in the algorithm as separated agents and a market agent that regulated
the energy purchase from the energy market. The coordination between agents was
encouraged by the use of a global reward combined with agent-specific local rewards. The
reward function could be decomposed into energy and production costs that needed to
be minimized. The proposed algorithm was compared to a reactive control strategy (RCS)
and a predictive–reactive control strategy (PCS). MARL outperformed the RCS but did
not match the performance of the PCS, indicating that it was able to reach only a local
optimum. Comparing the time required to make a decision ( 1 s for MARL and 2.5 h for
PCS), the MARL approach showed the capability to operate online and change its policy in
real-time according to stochastic changes in the environment, such as changes of electricity
cost or the failure of a production job.
The autonomous driving setting is a natural framework to develop MARL algorithms;
however, controlling vehicles with passengers on-board requires strict safety guarantees
that are not compatible with the learned nature of MARL. In [59], a safe reinforcement
learning algorithm was presented. The policy function was decomposed into a learned
policy for desires and trajectory planning. The desires policy was related to granting
a comfortable driving experience (for example, the absence of sharp curves or sudden
acceleration) and producing a cost function over driving trajectories and was learned
through using policy gradient methods. The trajectory planning, on the other hand, was
not learned and was tied to hard constraints. It took as its input the cost function and aimed
to find a trajectory that minimized the cost while enforcing driving safety. For the trajectory
planning, an approach based on the option framework was used [60]. The resulting
algorithm was tested in the noticeably difficult setting of a double-merge intersection while
avoiding collision.
A multi-agent reinforcement learning framework was used to address the fleet man-
agement problem for large-scale online ride-sharing platforms such as Uber and Lift [61].
The authors proposed two algorithms, namely, contextual DQN (cDQN) and contextual
actor–critic (cA2C), which allocate each car (represented by an agent) to a particular zone
of a city, divided into hexagonal grids. The computational efficiency was increased by the
use of contexts: a geographical context was employed that reduced the action space of an
agent in a grid by filtering the actions that would lead to an infeasible grid, as well as a
collaborative context that avoided situations in which the agents moved in conflicting di-
rections (for example, swapping grids at a given time). The efficiencies of those algorithms
were tested in a simulator environment calibrated using the historical data provided by the
Didi Chuxing ride-sharing app. It was compared to independent-DQN and the historical
data, achieving good performances in terms of gross merchandise volume (GVM) and
order response rate.
In [62], an independent deep Q-network (IDQN) architecture was used to address a
heterogeneous multi-junction urban traffic controlled scenario. The state was considered as
an image-like representation of the simulator environment, and the actions for each agent
(which represents an intersection) were the possible configurations of a traffic light. The
reward to be maximized was connected to the cumulative waiting time of all the vehicles in
the road network. The algorithm was tested in an open-source traffic simulator SUMO [63],
showing promising results, particularly in a low-traffic setting.
Appl. Sci. 2021, 11, 4948 21 of 25
Sector Applications
Drone field coverage [47]
Target assignment and path planning [48]
UAVs LoS Networks [50]
Packet routing relaying [52]
Recharging towers [53]
Image processing Joint active objective search [54]
Zero-energy community [55]
Residential microgrid with V2G exchange [56]
Energy sharing and scheduling
Industry production control [58]
Lithium battery lifetime optimization [64]
Safe driving [59]
Automotive Fleet control for ride sharing platform [61]
Intersection traffic light control [62]
Common Pool Resource Approximation [66]
Social Science
Sequential Social Dilemmas [67]
Networking Base-station parameter approximation [68]
6. Conclusions
Multi-agent reinforcement learning is a new, promising branch of machine learning
theory. The technological trend is moving towards distributed systems composed of
multitudes of computational units, which is also due to the development of the IoT and
the edge computing sector. MARL could be the answer to the realization of intelligent
systems that are capable of learning how to cooperate to maximize their efficiency. The
development of such algorithms can eliminate the need to interface with a centralized
Appl. Sci. 2021, 11, 4948 22 of 25
controller for multi-agent systems, such as cloud servers, minimizing the time required to
select an action, and—from a reliability perspective—not having a singular point of failure.
In this work, the main challenges in the development of MARL algorithms were presented,
addressing the nonstationarity of the environment, the scaling, and the need to move to
partially observable settings as key components of a fast-converging, efficient algorithm.
The research community has proposed an assortment of solutions to these challenges in
recent years. It was shown that MARL algorithms have been used to address a large variety
of applications, such as traffic light control, autonomous driving, and smart energy grids;
however, the vast majority of approaches have adopted an independent learning paradigm.
It would be interesting to observe the performance of more MARL algorithm typologies in
real-world applications.
Author Contributions: Supervision, G.C.C.; investigation, L.C., L.D.N., R.F., D.G., M.R., S.S.; re-
sources, L.C., L.D.N., R.F., D.G., M.R., S.S.; visualization, L.C., L.D.N., R.F., D.G., M.R., S.S.; writing—
original draft, L.C., L.D.N., R.F., D.G., M.R., S.S.; writing—review & editing, L.C., L.D.N., R.F., D.G.,
M.R., S.S. All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.
Informed Consent Statement: Not applicable.
Data Availability Statement: Data sharing not applicable.
Conflicts of Interest: The authors declare no conflict of interest.
Abbreviations
The following abbreviations are used in this manuscript:
ML Machine Learning
RL Reinforcement Learning
MAS Multi Agent System
MARL Multi Agent Reinforcement Learning
MDP Markov Decision Process
MG Markov Game
POMDP Partially-Observable Markov Decision Process
Dec-POMDP Decentralized Partially Observable Markov Decision Process
TD Temporal Difference
DQN Deep Q-Network
IL Independent Learner
JAL Joint Action Learner
Q-RTS Q Learning Real Time Swarm
LMRL Lenient Multi-agent Reinforcement Learning
DRL Deep Reinforcement Learning
DQN Deep Q-Network
cDQN Contextual DQN
LDQN Lenient Deep Q-Network
DRQN Deep Recurrent Q-Network
LSTM Long Short Term Memory
Dec-HDRQN Decentralized Hysteretic Deep Recurrent Q-Network
CERT Concurrent Experience Replay Trajectories
TRPO Trust Region Policy Optimization
PS-TRPO Parameter Sharing Trust Region Policy Optimization
COMA Counterfactual Multi-Agent
A3C Asynchronous Advantage Actor-Critic
cA2C Contextual Asynchronous Advantage Actor-Critic
VDN Value Decomposition Network
DDRQN Distributed Deep Recurrent Q-Network
MA Multi-Agent
MADQN Multi-Agent Deep Q-Network
Appl. Sci. 2021, 11, 4948 23 of 25
References
1. Yang, H.; Liu, X.Y.; Zhong, S.; Walid, A. Deep Reinforcement Learning for Automated Stock Trading: An Ensemble Strategy.
SSNR 2020. [CrossRef]
2. Abbeel, P.; Darrell, T.; Finn, C.; Levine, S. End-to-End Training of Deep Visuomotor Policies. J. Mach. Learn. Res.
2016, 17, 1334–1373.
3. Konar, A.; Chakraborty, I.G.; Singh, S.J.; Jain, L.C.; Nagar, A.K. A deterministic improved q-learning for path planning of a mobile
robot. IEEE Trans. Syst. Man Cybern. Part A Syst. Hum. 2013, 43. [CrossRef]
4. Lin, J.L.; Hwang, K.S.; Jiang, W.C.; Chen, Y.J. Gait Balance and Acceleration of a Biped Robot Based on Q-Learning. IEEE Access
2016, 4. [CrossRef]
5. Panagiaris, N.; Hart, E.; Gkatzia, D. Generating unambiguous and diverse referring expressions. Comput. Speech Lang. 2021, 68.
[CrossRef]
6. Matta, M.; Cardarilli, G.; Di Nunzio, L.; Fazzolari, R.; Giardino, D.; Nannarelli, A.; Re, M.; Spanò, S. A reinforcement learning-
based QAM/PSK symbol synchronizer. IEEE Access 2019, 7. [CrossRef]
7. Stone, P.; Veloso, M. Multiagent systems: A survey from a machine learning perspective. Auton. Robots 2000, 8, 345–383.
[CrossRef]
8. Zhuang, Y.; Hu, Y.; Wang, H. Scalability of Multiagent Reinforcement Learning. In Interactions in Multiagent Systems; Chapter 1;
World Scientific: Singapore, 2000; pp. 1–17. [CrossRef]
9. Thorndike, E.L. Animal Intelligence: An experimental study of the associate processes in animals. Am. Psychol. 1998, 58. [CrossRef]
10. Monahan, G.E. A Survey of Partially Observable Markov Decision Processes: Theory, Models, and Algorithms. Manag. Sci.
1982, 28, 1–16. [CrossRef]
11. Bernstein, D.S.; Givan, R.; Immerman, N.; Zilberstein, S. The Complexity of Decentralized Control of Markov Decision Processes.
Math. Oper. Res. 2002, 27, 819–840. [CrossRef]
12. Sutton, R.S. Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming. In
Machine Learning Proceedings; Morgan Kaufmann: Burlington, MA, USA, 1990; pp. 216–224. [CrossRef]
13. Schrittwieser, J.; Antonoglou, I.; Hubert, T.; Simonyan, K.; Sifre, L.; Schmitt, S.; Guez, A.; Lockhart, E.; Hassabis, D.; Graepel, T.;
et al. Mastering Atari, Go, chess and shogi by planning with a learned model. Nature 2020, 588, 604–609. [CrossRef]
14. Silver, D.; Hubert, T.; Schrittwieser, J.; Antonoglou, I.; Lai, M.; Guez, A.; Lanctot, M.; Sifre, L.; Kumaran, D.; Graepel, T.; et al.
A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science 2018, 362, 1140–1144.
[CrossRef]
15. Watkins, C. Learning from Delayed Rewards. Ph.D. Thesis, University of Cambridge, Cambridge, UK, 1989.
16. Sutton, R.S.; Barto, A.G. Reinforcement Learning I: Introduction; MIT Press: Cambridge, MA, USA, 1998.
17. Sutton, R. Learning to Predict by the Method of Temporal Differences. Mach. Learn. 1988, 3, 9–44. [CrossRef]
18. Watkins, C.J.C.H. Q-learning. Mach. Learn. 1992, 8, 279–292. [CrossRef]
19. Williams, R.J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn.
2004, 8, 229–256. [CrossRef]
20. Mnih, V.; Badia, A.P.; Mirza, M.; Graves, A.; Harley, T.; Lillicrap, T.P.; Silver, D.; Kavukcuoglu, K. Asynchronous Methods for
Deep Reinforcement Learning. In Proceedings of the 33rd International Conference on International Conference on Machine
Learning, New York, NY, USA, 9–24 June 2016; Volume 48, pp. 1928–1937. [CrossRef]
Appl. Sci. 2021, 11, 4948 24 of 25
21. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.A.; Fidjeland, A.K.;
Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [CrossRef]
22. Tan, M. Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents. In Proceedings of the Tenth International
Conference on Machine Learning, Amherst, MA, USA, 27–29 June 1993; pp. 330–337.
23. Claus, C.; Boutilier, C. The Dynamics of Reinforcement Learning in Cooperative Multiagent Systems. In Proceedings of the
Fifteenth National Conference on Artificial Intelligence, Madison, WI, USA, 26–30 July 1998; pp. 746–752.
24. Matta, M.; Cardarilli, G.; Di Nunzio, L.; Fazzolari, R.; Giardino, D.; Re, M.; Silvestri, F.; Spanò, S. Q-RTS: A real-time swarm
intelligence based on multi-agent Q-learning. Electron. Lett. 2019, 55, 589–591. [CrossRef]
25. Cardarilli, G.C.; Di Nunzio, L.; Fazzolari, R.; Giardino, D.; Matta, M.; Nannarelli, A.; Re, M.; Spanò, S. FPGA Implementation of
Q-RTS for Real-Time Swarm Intelligence systems. In Proceedings of the 2020 Asilomar Conference on Signals, Systems, and
Computers, Pacific Grove, CA, USA, 1–4 November 2020.
26. Matignon, L.; Laurent, G.J.; Le Fort-Piat, N. Hysteretic Q-learning: An algorithm for Decentralized Reinforcement Learning
in Cooperative Multi-Agent Teams. In Proceedings of the 2007 IEEE/RSJ International Conference on Intelligent Robots and
Systems, San Diego, CA, USA, 29 October–2 November 2007; pp. 64–69. [CrossRef]
27. Bloembergen, D.; Kaisers, M.; Tuyls, K. Lenient Frequency Adjusted Q-Learning; University of Luxemburg: Luxembourg, 2010;
pp. 19–26.
28. Kar, S.; Moura, J.M.F.; Poor, H.V. QD -Learning: A Collaborative Distributed Strategy for Multi-Agent Reinforcement Learning
Through Consensus + Innovations. IEEE Trans. Signal Process. 2013, 61, 1848–1862. [CrossRef]
29. Zhang, K.; Yang, Z.; Liu, H.; Zhang, T.; Başar, T. Fully Decentralized Multi-Agent Reinforcement Learning with Networked Agents;
ML Research Press: Maastricht, NL, USA, 2018.
30. Lillicrap, T.; Hunt, J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement
learning. arXiv 2016, arXiv:1509.02971.
31. Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Pro-
ceedings of the 32nd International Conference on International Conference on Machine Learning, ICML’15, Lille, France,
6–11 July 2015; Volume 37, pp. 448–456. [CrossRef]
32. Palmer, G.; Tuyls, K.; Bloembergen, D.; Savani, R. Lenient Multi-Agent Deep Reinforcement Learning. In Proceedings of the 17th
International Conference on Autonomous Agents and MultiAgent Systems, Stockholm, Sweden, 10–15 July 2018; pp. 443–451.
33. Foerster, J.; Nardelli, N.; Farquhar, G.; Torr, P.; Kohli, P.; Whiteson, S. Stabilising Experience Replay for Deep Multi-Agent Reinforce-
ment Learning. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017.
34. Hausknecht, M.; Stone, P. Deep Recurrent Q-Learning for Partially Observable MDPs. In Proceedings of the AAAI Fall
Symposium on Sequential Decision Making for Intelligent Agents (AAAI-SDMIA15), Arlington, VA, USA, 12–14 November 2015.
35. Omidshafiei, S.; Pazis, J.; Amato, C.; How, J.P.; Vian, J. Deep Decentralized Multi-task Multi-Agent Reinforcement Learning
under Partial Observability. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia,
6–11 August 2017; Precup, D., Teh, Y.W., Eds.; PMLR, International Convention Centre: Sydney, Australia, 2017; Volume 70,
pp. 2681–2690.
36. Gupta, J.K.; Egorov, M.; Kochenderfer, M. Cooperative Multi-agent Control Using Deep Reinforcement Learning. In Proceedings
of the International Conference on Autonomous Agents and Multiagent Systems, São Paulo, Brazil, 8–12 May 2017; Volume 10642.
[CrossRef]
37. Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M.; Moritz, P. Trust Region Policy Optimization. In Proceedings of the 32nd
International Conference on Machine Learning, Lille, France, 6–11 July 2015; Volume 37, pp. 1889–1897.
38. Bengio, Y.; Louradour, J.; Collobert, R.; Weston, J. Curriculum Learning. In Proceedings of the 26th Annual International
Conference on Machine Learning, ICML’09, Montreal, QC, Canada, 14–18 June 2009; Association for Computing Machinery: New
York, NY, USA, 2009, pp. 41–48. [CrossRef]
39. Foerster, J.; Farquhar, G.; Afouras, T.; Nardelli, N.; Whiteson, S. Counterfactual Multi-Agent Policy Gradients. In Proceedings of
the AAAI 2018, Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018.
40. Sunehag, P.; Lever, G.; Gruslys, A.; Czarnecki, W.M.; Zambaldi, V.; Jaderberg, M.; Lanctot, M.; Sonnerat, N.; Leibo, J.Z.; Tuyls, K.;
et al. Value-Decomposition Networks For Cooperative Multi-Agent Learning Based on Team Reward. In Proceedings of the 17th
International Conference on Autonomous Agents and MultiAgent Systems, International Foundation for Autonomous Agents
and Multiagent Systems, AAMAS’18, Stockholm, Sweden, 10–15 July 2018; pp. 2085–2087. [CrossRef]
41. Rashid, T.; Samvelyan, M.; Schroeder, C.; Farquhar, G.; Foerster, J.; Whiteson, S. QMIX: Monotonic Value Function Factorisation
for Deep Multi-Agent Reinforcement Learning. In Proceedings of the 35th International Conference on Machine Learning, PMLR,
Stockholm, Sweden, 10–15 July 2018; Volume 80, pp. 4295–4304.
42. Sukhbaatar, S.; Szlam, A.; Fergus, R. Learning multiagent communication with backpropagation. In Advances in Neural Information
Processing Systems, Proceedings of the 30th Annual Conference on Neural Information Processing Systems, NIPS 2016, Barcelona, Spain,
5–10 December 2016; Curran Associates Inc.: Red Hook, NY, USA, 2016; pp. 2252–2260.
43. Foerster, J.N.; Assael, Y.M.; de Freitas, N.; Whiteson, S. Learning to Communicate to Solve Riddles with Deep Distributed
Recurrent Q-Networks. arXiv 2016, arXiv:1602.02672.
44. Sukhbaatar, S.; Szlam, A.; Synnaeve, G.; Chintala, S.; Fergus, R. MazeBase: A Sandbox for Learning from Games; Cornell University
Library: Ithaca, NY, USA, 2016.
Appl. Sci. 2021, 11, 4948 25 of 25
45. Vinyals, O.; Ewalds, T.; Bartunov, S.; Georgiev, P.; Sasha Vezhnevets, A.; Yeo, M.; Makhzani, A.; Küttler, H.; Agapiou, J.;
Schrittwieser, J.; et al. StarCraft II: A New Challenge for Reinforcement Learning. arXiv 2017, arXiv:1708.04782.
46. MuJoCo: Advanced Physics Simulation. Available online: http://www.mujoco.org (accessed on 10 April 2021).
47. Pham, H.X.; La, H.M.; Feil-Seifer, D.; Nefian, A. Cooperative and Distributed Reinforcement Learning of Drones for Field
Coverage. arXiv 2018, arXiv:1803.07250.
48. Qie, H.; Shi, D.; Shen, T.; Xu, X.; Li, Y.; Wang, L. Joint Optimization of Multi-UAV Target Assignment and Path Planning Based on
Multi-Agent Reinforcement Learning. IEEE Access 2019, 7, 146264–146272. [CrossRef]
49. Lowe, R.; Wu, Y.; Tamar, A.; Harb, J.; Abbeel, P.; Mordatch, I. Multi-Agent Actor-Critic for Mixed Cooperative-Competitive
Environments. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17,
Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: Red Hook, NY, USA, 2017; pp. 6382–6393.
50. Cui, J.; Liu, Y.; Nallanathan, A. Multi-Agent Reinforcement Learning-Based Resource Allocation for UAV Networks. IEEE Trans.
Wirel. Commun. 2020, 19, 729–743. [CrossRef]
51. Gale, D.; Shapley, L.S. College Admissions and the Stability of Marriage. Am. Math. Mon. 1962, 69, 9–15. [CrossRef]
52. Shamsoshoara, A.; Khaledi, M.; Afghah, F.; Razi, A.; Ashdown, J. Distributed Cooperative Spectrum Sharing in UAV Networks
Using Multi-Agent Reinforcement Learning. In Proceedings of the 2019 16th IEEE Annual Consumer Communications
Networking Conference (CCNC), Las Vegas, NV, USA, 11–14 January 2019; pp. 1–6. [CrossRef]
53. Jung, S.; Yun, W.J.; Kim, J.; Kim, J.H. Coordinated Multi-Agent Deep Reinforcement Learning for Energy-Aware UAV-Based
Big-Data Platforms. Electronics 2021, 10, 543. [CrossRef]
54. Kong, X.; Xin, B.; Wang, Y.; Hua, G. Collaborative Deep Reinforcement Learning for Joint Object Search. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017.
55. Prasad, A.; Dusparic, I. Multi-agent Deep Reinforcement Learning for Zero Energy Communities. In Proceedings of the 2019
IEEE PES Innovative Smart Grid Technologies Europe (ISGT-Europe), Bucharest, Romania, 29 September–2 October 2019; pp. 1–5.
[CrossRef]
56. Fang, X.; Wang, J.; Song, G.; Han, Y.; Zhao, Q.; Cao, Z. Multi-Agent Reinforcement Learning Approach for Residential Microgrid
Energy Scheduling. Energies 2020, 13, 123. [CrossRef]
57. Hu, J.; Wellman, M.P. Nash Q-Learning for General-Sum Stochastic Games. J. Mach. Learn. Res. 2003, 4, 1039–1069.
58. Roesch, M.; Linder, C.; Zimmermann, R.; Rudolf, A.; Hohmann, A.; Reinhart, G. Smart Grid for Industry Using Multi-Agent
Reinforcement Learning. Appl. Sci. 2020, 10, 6900. [CrossRef]
59. Shalev-Shwartz, S.; Shammah, S.; Shashua, A. Safe, Multi-Agent, Reinforcement Learning for Autonomous Driving. arXiv 2016,
arXiv:1610.03295.
60. Sutton, R.S.; Precup, D.; Singh, S. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement
learning. Artif. Intell. 1999, 112, 181–211. [CrossRef]
61. Lin, K.; Zhao, R.; Xu, Z.; Zhou, J. Efficient Large-Scale Fleet Management via Multi-Agent Deep Reinforcement Learning.
In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, UK,
19–23 August 2018; Association for Computing Machinery: New York, NY, USA, 2018; pp. 1774–1783.
62. Calvo, A.; Dusparic, I. Heterogeneous Multi-Agent Deep Reinforcement Learning for Traffic Lights Control. In Proceedings of
the 26th Irish Conference on Artificial Intelligence and Cognitive Science, Dublin, Ireland, 6–7 December 2018.
63. Krajzewicz, D.; Erdmann, J.; Behrisch, M.; Bieker, L. Recent Development and Applications of SUMO – Simulation of Urban
Mobility. Int. J. Adv. Syst. Meas. 2012, 5 , 128–138.
64. Sui, Y.; Song, S. A Multi-Agent Reinforcement Learning Framework for Lithium-ion Battery Scheduling Problems. Energies 2020,
13, 1982. [CrossRef]
65. Kim, H.; Shin, K.G. Scheduling of Battery Charge, Discharge, and Rest. In Proceedings of the 2009 30th IEEE Real-Time Systems
Symposium, Washington, DC, USA, 1–4 December 2009; pp. 13–22. [CrossRef]
66. Perolat, J.; Leibo, J.Z.; Zambaldi, V.; Beattie, C.; Tuyls, K.; Graepel, T. A multi-agent reinforcement learning model of common-pool
resource appropriation. arXiv 2017, arXiv:1707.06600.
67. Leibo, J.Z.; Zambaldi, V.; Lanctot, M.; Marecki, J.; Graepel, T. Multi-Agent Reinforcement Learning in Sequential Social Dilemmas.
In Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems, AAMAS’17, Paulo, Brazil, 8–12 May 2017;
International Foundation for Autonomous Agents and Multiagent Systems: Richland, SC, USA, 2017; pp. 464–473.
68. Pandey, B. Adaptive Learning for Mobile Network Management. G2 pro Gradu, Diplomityö. Master’s Thesis, University of
Aalto, São Espoo, Finland, 12 December 2016.