Optimal Control in Large Stochastic Multi-Agent Systems: (B.vandenbroek, W.wiegerinck, B.kappen) @science - Ru.nl
Optimal Control in Large Stochastic Multi-Agent Systems: (B.vandenbroek, W.wiegerinck, B.kappen) @science - Ru.nl
Multi-agent Systems
1 Introduction
A collaborative multi-agent system is a group of agents in which each member
behaves autonomously to reach the common goal of the group. Some examples
are teams of robots or unmanned vehicles, and networks of automated resource
allocation. An issue typically appearing in multi-agent systems is decentralized
coordination; the communication between agents may be restricted, there may
be no time to receive all the demands for a certain resource, or an unmanned
vehicle may be unsure about how to anticipate another vehicles movement and
avoid a collision.
In this paper we focus on the issue of optimal control in large multi-agent sys-
tems where the agents dynamics are continuous in space and time. In particular
we look at cases where the agents have to distribute themselves in admissible
ways over a number of targets. Due to the noise in the dynamics, a configura-
tion that initially seems attainable with little effort may become harder to reach
later on.
Common approaches to derive a coordination rule are based on discretizations
of space and time. These often suffer from the curse of dimensionality, as the
complexity increases exponentially in the number of agents. Some successfull
ideas, however, have recently been put forward, which are based on structures
that are assumed to be present [1,2].
K. Tuyls et al. (Eds.): Adaptive Agents and MAS III, LNAI 4865, pp. 15–26, 2008.
c Springer-Verlag Berlin Heidelberg 2008
16 B. van den Broek, W. Wiegerinck, and B. Kappen
Here we rather model the system in continuous space and time, following the
approach of Wiegerinck et al. [3]. The agents satisfy dynamics with additive
control and noise, and the joint behaviour of the agents is valued by a joint cost
function that is quadratic in the control. The stochastic optimization problem
may then be transformed into a linear partial differential equation, which can
be solved using generic path integral methods [4,5]. The dynamics of the agents
are assumed to factorize over the agents, such that the agents are coupled by
their joint task only.
The optimal control problem is equivalent to a graphical model inference prob-
lem [3]. In large and sparsely coupled multi-agent systems the optimal control
can be computed using the junction tree algorithm. Exact inference, however,
will break down when the system is both large and densely coupled. Here we
explore the use of graphical model approximate inference methods in optimal
control of large stochastic multi-agent systems. We apply the mean field approx-
imation to show that optimal control is possible with accuracy in systems where
exact inference breaks down.
given the agents initial state x, and the joint control over time u(t → T ). R
is a symmetric k × k matrix with positive eigenvalues, such that ua (θ) Rua (θ)
is always a non-negative number, V (x(θ), θ) is the cost for the agents to be
in a joint state x(θ) at time θ. The issue is to find the optimal control which
minimizes the expected cost-to-go.
The optimal controls are given by the gradient
where J(x, t) the optimal expected cost-to-go, i.e. the cost (2) minimized over
all possible controls; a brief derivation is contained in the appendix. An impor-
tant implication of equation (3) is that at any moment in time, each agent can
compute its own optimal control if it knows its own state and that of the other
agents: there is no need to discuss possible strategies! This is because the agents
always perform the control that is optimal, and the optimal control is unique.
To compute the optimal controls, however, we first need to find the optimal
expected cost-to-go J. The latter may be expressed in terms of a forward diffusion
process:
J(x, t) = −λ log dy ρ(y, T |x, t)e−φ(y)/λ , (4)
ρ(y, T |x, t) being the transition probability for the system to go from a state
x at time t to a state y at the end time T . The constant λ is determined by
the relation σσ = λBR−1 B , equation (14) in the appendix. The density
ρ(y, θ|x, t), t < θ ≤ T , satisfies the forward Fokker-Planck equation,
V
n n
1 2
∂θ ρ = − − ∂ya ba ρ + Tr σσ ∂ya ρ . (5)
λ a=1 a=1
2
The solution to this equation may generally be estimated using path integral
methods [4,5], in a few special cases a solution exists in closed form:
Example 1. Consider a multi-agent system in one dimension in which there is
noise and control in the velocities of the agents, according to the set of equations
dxa (t) = ẋa (t)dt
dẋa (t) = ua (t)dt + σdw(t).
Note that this set of equations can be merged into a single equation of the
form (1) by a concatenation of xa and ẋa into a single vector. We choose the
potential V = 0. Under the task where each agent a has to reach a target with
location μa at the end time T , and arrive with speed μ̇a , the end cost function
φ can be given in terms of a product of delta functions, that is
n
e−φ(x,ẋ)/λ = δ(xa − μa )δ(ẋa − μ̇a ),
a=1
and the system decouples into n independent single-agent systems. The dynamics
of each agent a is given by a transition probability
where
1 2(T − t)3 3(T − t)2
c= σ2 .
6 3(T − t)2 6(T − t)
18 B. van den Broek, W. Wiegerinck, and B. Kappen
The optimal control follows from equations (3) and (4) and reads
6(μa − xa − (T − t)ẋa ) − 2(T − t)(μ̇a − ẋa )
ua (xa , ẋa , t) = . (7)
(T − t)2
The first term in the control will steer the agent towards the target μa in a
straight line, but since this may happen with a speed that differs from μ̇a with
which the agent should arrive, there is a second term that initially ‘exaggerates’
the speed for going in a straight line, so that in the end there is time to adjust
the speed to the end speed μ̇a .
that are peaked around the location (μs1 , . . . , μsn ) of a joint target (s1 , . . . , sn ),
that is
n
−φ(y)/λ
e = w(s1 , . . . , sn ) Φa (ya , sa ),
s1 ,...,sn a=1
where the w(s1 , . . . , sn ) are positive weights. We will refer to these weights as
coupling factors, since they introduce dependencies between the agents. The
optimal control of a single agent is obtained using equations (3) and (4), and is
a weighted combination of single-target controls,
m
ua = pa (s)ua (s) (8)
s=1
(the explicit (x, t) dependence has been dropped in the notation). Here ua (s) is
the control for agent a to go to target s,
ua (s) = −R−1 B ∂xa Za (s), (9)
with Za (s) defined by
Za (sa ) = dya ρa (ya , T |xa , t)Φa (ya , sa ).
n
e−φ(x,ẋ)/λ = w(s1 , . . . , sn ) δ(ya − μa )δ(ẏa )
s1 ,...,sn a=1
For any agent a, the optimal control under this task is a weighted average of
single target controls (7),
n
μa = pa (s)μs .
s=1
The average is taken with respect to the marginal pa of the joint distribution
n
p(s1 , . . . , sn ) ∝ w(s1 , . . . , sn ) ρa (μsa , 0, T |xa , ẋa , t),
a=1
on the joint task of the agents. In the most complex case, to fulfil the task each
agent will have to take the joint state of the entire system into account. In less
complicated cases, an agent will only consider the states of a few agents in the
system, in other words, the coupling factors will have a nontrivial factorized
form:
w(s1 , . . . , sn ) = wA (sA ),
A
where the A are subsets of agents. In such cases we may represent the couplings,
and thus the joint distribution, by a factor graph; see Figure 1 for an example.
1 4 2 3
Fig. 1. Example of a factor graph for a multi-agent system of four agents. The cou-
plings are represented by the factors A, with A = {1, 4}, {1, 2}, {2, 4}, {3, 4}, {2, 3}.
where q(s) = q1 (s1 ) · · · qn (sn ). Here the H(qa ) are the entropies of the distribu-
tions qa ,
H(qa ) = − qa (s) log qa (s).
s
The minimum
JMF = min FMF ({qa })
{qa }
is an upper bound for the optimal cost-to-go J, it equals J in case the agents
are uncoupled. FMF has zero gradient in its local minima, that is,
The mean field equations are solved by means of iteration, and the solutions are
the local minima of the mean field free energy. Thus the mean field free energy
minimized over all solutions to the mean field equations equals the minimum
JMF .
The mean field approximation of the optimal control is found by taking the
gradient of the minimum JMF of the mean field free energy, similar to the exact
case where the optimal control is the gradient of the optimal expected cost-to-go,
equation (3):
ua (x, t) = −Ra−1 Ba ∂xa JMF (x, t) = qa (sa )ua (xa , t; sa ).
sa
repeated for the remaining agents and targets, until there are no more remain-
ing agents and targets. We will refer to this method as the sort distances (SD)
method.
For several sizes of the system we computed the control cost and the required
CPU time to calculate the controls. This we did under both control methods.
Figures 2(a) and (b) show the control cost and the required CPU time as a
function of the system size n; each value is an average obtained from 100 sim-
ulations. To emphasize the necessity of the approximate inference methods, in
figure 2(b) we included the required CPU time under exact inference; this quan-
tity increases exponentially with n, as we may have expected, making exact
inference intractable in large MASs. In contrast, both under the SD method and
the MF method the required CPU time appears to increase polynomially with n,
the SD method requiring less computation time than the MF method. Though
the SD method is faster than the MF method, it also is more costly: the control
cost under the SD method is significantly higher than under the MF method.
The MF method thus better approximates the optimal control.
3
5 10
2
10
4
CPU Time
1
10
Cost
3 0
10
2 −1
10
−2
1 10
0 5 10 15 0 5 10 15
n n
(a) Cost (b) CPU Time
Fig. 2. The control cost (a) and the required CPU Time in seconds (b) under the
exact method (· − ·), the MF method (−−), and the SD method (—)
Figure 3 shows the positions and the velocities of the agents over time, both
under the control obtained using the MF approximation and under the control
obtained with the SD method. We observe that under MF control, the agents
determine their targets early, between t = 0 and t = 0.5, and the agents ve-
locities gradually increase from zero to a maximum value at t = 0.5 to again
gradually decrease to zero, as required. This is not very surprising, since the
MF approximation is known to show an early symmetry breaking. In contrast,
under the SD method the decision making process of the agents choosing their
targets takes place over almost the entire time interval, and the velocities of
the agents are subject to frequent changes; in particular, as time increases the
agents who have not yet chosen a target seem to exchange targets in a frequent
manner. This may be understood by realising that under the SD method agents
always perform a control to their nearest target only, instead of a weighted com-
bination of controls to different targets which is the situation under MF control.
Optimal Control in Large Stochastic Multi-agent Systems 23
1 2
0.5 1
Position
Velocity
0 0
−0.5 −1
−1 −2
0 0.5 1 0 0.5 1
Time Time
(a) Positions (b) Impulse
1 4
0.5 2
Position
Velocity
0 0
−0.5 −2
−1 −4
0 0.5 1 0 0.5 1
Time Time
(c) Positions (d) Impulse
Fig. 3. A multi-agent system of 15 agents. The positions (a) and the velocities (b) over
time under MF control, and the positions (c) and the velocities (d) over time under
SD control.
Further more, compared with the velocities under the MF method the velocities
under the SD method take on higher maximum values. This may account for
the relatively high control costs under SD control.
4 Discussion
of targets, such that each target is reached by precisely one agent. In the sort
distances method each agent performs a control to a single nearby target, in
such a way that no two agents head to the same target at the same time. This
method has an advantage of being fast, but it results in relatively high control
costs. Because each agent performs a control to a single target, agents switch
targets frequently during the control process. In the mean field approximation
each agent performs a control which is a weighted sum of controls to single tar-
gets. This requires more computation time than the sort distances method, but
involves significantly lower control costs and therefore is a better approximation
to the optimal control.
An obvious choice for a graphical model inference method not considered in
the present paper would be belief propagation. Results of numeric simulations
with this method in the context of multi-agent control, and comparisons with the
mean field approximation and the exact junction tree algorithm will be published
elsewhere.
There are many possible model extensions worthwhile exploring in future re-
search. Examples are non-zero potentials V in case of a non-empty environment,
penalties for collisions in the context of robotics, non-fixed end times, or bounded
state spaces in the context of a production process. Typically, such model ex-
tensions will not allow for a solution in closed form, and approximate numerical
methods will be required. Some suggestions are given by Kappen [4,5]. In the
setting that we considered the model which describes the behaviour of the agents
was given. It would be worthwhile, however, to consider cases of stochastic op-
timal control of multi-agent systems in continuous space and time where the
model first needs to be learned.
Acknowledgments
We thank Joris Mooij for making available useful software and the reviewers for
their useful remarks. This research is part of the Interactive Collaborative Infor-
mation Systems (ICIS) project, supported by the Dutch Ministry of Economic
Affairs, grant BSIK03024.
References
1. Guestrin, C., Koller, D., Parr, R.: Multiagent planning with factored MDPs. In:
Proceedings of NIPS, vol. 14, pp. 1523–1530 (2002)
2. Guestrin, C., Venkataraman, S., Koller, D.: Context-specific multiagent coordination
and planning with factored MDPs. In: Proceedings of AAAI, vol. 18, pp. 253–259
(2002)
3. Wiegerinck, W., van den Broek, B., Kappen, B.: Stochastic optimal control in con-
tinuous space-time multi-agent systems. In: UAI 2006 (2006)
4. Kappen, H.J.: Path integrals and symmetry breaking for optimal control theory.
Journal of statistical mechanics: theory and experiment, 11011 (2005)
5. Kappen, H.J.: Linear theory for control of nonlinear stochastic systems. Physical
Review Letters 95(20), 200–201 (2005)
Optimal Control in Large Stochastic Multi-agent Systems 25
then
1 1
u Rua + (Bua ) ∂xa J = − λ2 Z −2 (∂xa Z) BR−1 B ∂xa Z,
2 a 2
1 2 1 −2 1
Tr σσ ∂xa J = λZ (∂xa Z) σσ ∂xa Z − λZ −1 Tr σσ ∂x2a Z .
2 2 2
The terms quadratic in Z vanish when σ σ and R are related via
σσ = λBR−1 B . (14)
In the one dimensional case a constant λ can always be found such that equa-
tion (14) is satisfied, in the higher dimensional case the equation puts restrictions
on the matrices σ and R, because in general σσ and BR−1 B will not be pro-
portional.
When equation (14) is satisfied, the HJB equation becomes
V n n
1 2
∂t Z = − b ∂x − Tr σσ ∂xa Z
λ a=1 a a a=1 2
= −HZ, (15)
26 B. van den Broek, W. Wiegerinck, and B. Kappen
the density ρ(y, ϑ|x, t) (t < ϑ ≤ T ) satisfying the forward Fokker-Planck equa-
tion (5). Combining the equations (13) and (16) yields the expression (4) for the
optimal expected cost-to-go.