Distributed Subgradient Methods For Multi-Agent Optimization
Distributed Subgradient Methods For Multi-Agent Optimization
Angelia Nedic
We would like to thank Rayadurgam Srikant, Devavrat Shah, the associate editor, three anonymous
referees, and various seminar participants for useful comments and discussions.
m
i=1
f
i
(x), where the function f
i
: R
n
R
represents the cost function of agent i, known by this agent only, and x R
n
is a
decision vector. The decision vector can be viewed as either a resource vector where
sub-components correspond to resources allocated to each agent, or a global decision
vector which the agents are trying to compute using local information.
Our algorithm is as follows: each agent generates and maintains estimates of the opti-
mal decision vector based on information concerning his own cost function (in particular,
the subgradient information of f
i
) and exchanges these estimates directly or indirectly
with the other agents in the network. We allow such communication to be asynchronous,
local, and with time varying connectivity. Under some weak assumptions, we prove that
this type of local communication and computation achieves consensus in the estimates
and converges to an (approximate) global optimal solution. Moreover, we provide rate of
convergence analysis for the estimates maintained by each agent. In particular, we show
that there is a tradeo between the quality of an approximate optimal solution and the
computation load required to generate such a solution. Our convergence rate estimate
captures this dependence explicitly in terms of the system and algorithm parameters.
Our approach builds on the seminal work of Tsitsiklis [23] (see also Tsitsiklis et al.
[24], Bertsekas and Tsitsiklis [2]), who developed a framework for the analysis of dis-
tributed computation models. This framework focuses on the minimization of a (smooth)
function f(x) by distributing the processing of the components of vector x R
n
among
n agents. In contrast, our focus is on problems in which each agent has a locally known,
dierent, convex and potentially nonsmooth cost function. To the best of our knowl-
edge, this paper presents the rst analysis of this type of distributed resource allocation
problem.
In addition, our work is also related to the literature on reaching consensus on a
particular scalar value or computing exact averages of the initial values of the agents,
which has attracted much recent attention as natural models of cooperative behavior in
networked-systems (see Vicsek et al. [25], Jadbabaie et al. [8], Boyd et al. [4], Olfati-
Saber and Murray [17], Cao et al. [5], and Olshevsky and Tsitsiklis [18, 19]). Our work
is also related to the utility maximization framework for resource allocation in networks
(see Kelly et al. [10], Low and Lapsley [11], Srikant [22], and Chiang et al. [7]). In
contrast to this literature, we allow the local performance measures to depend on the
entire resource allocation vector and provide explicit rate analysis.
The remainder of this paper is organized as follows: In Section 2, we introduce
a model for the information exchange among the agents and a distributed optimiza-
tion model. In Section 3, we establish the basic convergence properties of the transition
matrices governing the evolution of the system in time. In Section 4, we establish conver-
gence and convergence rate results for the proposed distributed optimization algorithm.
Finally, we give our concluding remarks in Section 5.
Basic Notation and Notions:
A vector is viewed as a column vector, unless clearly stated otherwise. We denote
2
by x
i
or [x]
i
the i-th component of a vector x. When x
i
0 for all components i of a
vector x, we write x 0. For a matrix A, we write A
j
i
or [A]
j
i
to denote the matrix entry
in the i-th row and j-th column. We write [A]
i
to denote the i-th row of the matrix A,
and [A]
j
to denote the j-th column of A.
We denote the nonnegative orthant by R
n
+
, i.e., R
n
+
= {x R
n
| x 0}. We write
x
x. We
write x
= max
1im
|x
i
|.
A vector a R
m
is said to be a stochastic vector when its components a
i
, i = 1, . . . , m,
are nonnegative and their sum is equal to 1, i.e.,
m
i=1
a
i
= 1. A square mm matrix
A is said to be a stochastic matrix when each row of A is a stochastic vector. A square
m m matrix A is said to be a doubly stochastic matrix when both A and A
are
stochastic matrices.
For a function F : R
n
(, ], we denote the domain of F by dom(F), where
dom(F) = {x R
n
| F(x) < }.
We use the notion of a subgradient of a convex function F(x) at a given vector x
dom(F). We say that s
F
( x) R
n
is a subgradient of the function F at x dom(F)
when the following relation holds:
F( x) + s
F
( x)
m
j=1
a
i
j
(k) = 1 for all i and k.
Assumption 1(a) states that each agent gives signicant weights to her own state
x
i
(k) and the information state x
j
(k) available from her neighboring agents j at the
update time t
k
. Naturally, an agent i assigns zero weights to the states x
j
for those
agents j whose information state is not available at the update time.
1
Note that, under
Assumption 1, for a matrix A(k) whose columns are a
1
(k), . . . , a
m
(k), the transpose
A
consisting of edges (j, i) such that j is a neighbor of i who communicates with i innitely
often. The connectivity requirement is formally stated in the following assumption.
Assumption 2 (Connectivity) The graph (V, E
) is connected, where E
is the set of
edges (j, i) representing agent pairs communicating directly innitely many times, i.e.,
E
= {(j, i) | (j, i) E
k
for innitely many indices k}.
In other words, this assumption states that for any k and any agent pair (j, i), there
is a directed path from agent j to agent i with edges in the set
lk
E
l
. Thus, Assump-
tion 2 is equivalent to the assumption that the composite directed graph (V,
lk
E
l
) is
connected for all k.
When analyzing the system state behavior, we use an additional assumption that the
intercommunication intervals are bounded for those agents that communicate directly.
In particular, we use the following.
Assumption 3 (Bounded Intercommunication Interval) There exists an integer B 1
such that for every (j, i) E
and k 0.
5
2.2 Optimization Model
We consider a scenario where agents cooperatively minimize a common additive cost.
Each agent has information only about one cost component, and minimizes that com-
ponent while exchanging information with other agents. In particular, the agents want
to cooperatively solve the following unconstrained optimization problem:
minimize
m
i=1
f
i
(x)
subject to x R
n
,
(2)
where each f
i
: R
n
R is a convex function. We denote the optimal value of this
problem by f
, i.e., X
= {x R
n
|
m
i=1
f
i
(x) = f
}.
In this setting, the information state of an agent i is an estimate of an optimal
solution of the problem (2). We denote by x
i
(k) R
n
the estimate maintained by agent
i at the time t
k
. The agents update their estimates as follows: When generating a new
estimate, agent i combines his/her current estimate x
i
with the estimates x
j
received
from some of the other agents j. Here we assume that there is no communication delay
in delivering a message from agent j to agent i
2
.
In particular, agent i updates his/her estimates according to the following relation:
x
i
(k + 1) =
m
j=1
a
i
j
(k)x
j
(k)
i
(k)d
i
(k), (3)
where the vector a
i
(k) = (a
i
1
(k), . . . , a
i
m
(k))
j=1
[A(s)A(s + 1) A(k 1)a
i
(k)]
j
x
j
(s)
j=1
[A(s + 1) A(k 1)a
i
(k)]
j
j
(s)d
j
(s)
2
A more general model that accounts for the possibility of such delays is the subject of our current
work, see [16].
6
j=1
[A(s + 2) A(k 1)a
i
(k)]
j
j
(s + 1)d
j
(s + 1)
m
j=1
[A(k 1)a
i
(k)]
j
j
(k 2)d
j
(k 2)
j=1
[a
i
(k)]
j
j
(k 1)d
j
(k 1)
i
(k)d
i
(k). (4)
Let us introduce the matrices
(k, s) = A(s)A(s + 1) A(k 1)A(k) for all s and k with k s,
where (k, k) = A(k) for all k. Note that the i-th column of (k, s) is given by
[(k, s)]
i
= A(s)A(s + 1) A(k 1)a
i
(k) for all i, s and k with k s,
while the entry in i-th column and j-th row of (k, s) is given by
[(k, s)]
i
j
= [A(s)A(s + 1) A(k 1)a
i
(k)]
j
for all i, j, s and k with k s.
We can now rewrite relation (4) compactly in terms of the matrices (k, s), as follows:
for any i {1, . . . , m}, and s and k with k s 0,
x
i
(k + 1) =
m
j=1
[(k, s)]
i
j
x
j
(s)
k
r=s+1
_
m
j=1
[(k, r)]
i
j
j
(r 1)d
j
(r 1)
_
i
(k)d
i
(k).
(5)
We start our analysis by considering the transition matrices (k, s).
3 Convergence of the Transition Matrices (k, s)
In this section, we study the convergence behavior of the matrices (k, s) as k goes
to innity. We establish convergence rate estimates for these matrices. Clearly, the
convergence rate of these matrices dictates the convergence rate of the agents estimates
to an optimal solution of the overall optimization problem (2). Recall that these matrices
are given by
(k, s) = A(s)A(s + 1) A(k 1)A(k) for all s and k with k s, (6)
where
(k, k) = A(k) for all k. (7)
3.1 Basic Properties
Here, we establish some properties of the transition matrices (k, s) under the assump-
tions discussed in Section 2.
7
Lemma 1 Let Weights Rule (a) hold [cf. Assumption 1(a)]. We then have:
(a) [(k, s)]
j
j
ks+1
for all j, k, and s with k s.
(b) [(k, s)]
i
j
ks+1
for all k and s with k s, and for all (j, i) E
s
E
k
,
where E
t
is the set of edges dened by
E
t
= {(j, i) | a
i
j
(t) > 0} for all t.
(c) Let (j, v) E
s
E
r
for some r s and (v, i) E
r+1
E
k
for k > r.
Then,
[(k, s)]
i
j
ks+1
.
(d) Let Weights Rule (b) also hold [cf. Assumption 1(b)]. Then, the matrices
(k, s)
are stochastic for all k and s with k s.
Proof. We let s be arbitrary, and prove the relations by induction on k.
(a) Note that, in view of Assumption 1(a), the matrices (k, s) have nonnegative
entries for all k and s with k s. Furthermore, by Assumption 1(a)(i), we have
[(s, s)]
j
j
. Thus, the relation [(k, s)]
j
j
ks+1
holds for k = s.
Now, assume that for some k with k > s we have [(k, s)]
j
j
ks+1
, and consider
[(k + 1, s)]
j
j
. By the denition of the matrix (k, s) [cf. Eq. (6)], we have
[(k + 1, s)]
j
j
=
m
h=1
[(k, s)]
h
j
a
j
h
(k + 1) [(k, s)]
j
j
a
j
j
(k + 1),
where the inequality in the preceding relation follows from the nonnegativity of the
entries of (k, s). By using the inductive hypothesis and the relation a
j
j
(k + 1) [cf.
Assumption 1(a)(i)], we obtain
[(k + 1, s)]
j
j
ks+2
.
Hence, the relation [(k, s)]
j
j
ks+1
holds for all k s.
(b) Let (j, i) E
s
. Then, by the denition of E
s
and Assumption 1(a), we have that
a
i
j
(s) . Since (s, s) = A(s) [cf. Eq. (7)], it follows that the relation [(k, s)]
i
j
ks+1
holds for k = s and any (j, i) E
s
. Assume now that for some k > s and all
(j, i) E
s
E
k
, we have [(k, s)]
i
j
ks+1
. Consider k + 1, and let (j, i)
E
s
E
k
E
k+1
. There are two possibilities (j, i) E
s
E
k
or (j, i) E
k+1
.
Suppose that (j, i) E
s
E
k
. Then, by the induction hypothesis, we have
[(k, s)]
i
j
ks+1
.
Therefore
[(k + 1, s)]
i
j
=
m
h=1
[(k, s)]
h
j
a
i
h
(k + 1) [(k, s)]
i
j
a
i
i
(k + 1),
8
where the inequality in the preceding relation follows from the nonnegativity of the
entries of (k, s). By combining the preceding two relations, and using the fact a(r)
i
i
for all i and r [cf. Assumption 1(a)(i)], we obtain
[(k + 1, s)]
i
j
ks+2
.
Suppose now that (j, i) E
k+1
. Then, by the denition of E
k+1
, we have a
i
j
(k +1)
. Furthermore, by part (a), we have
[(k, s)]
j
j
ks+1
.
Therefore
[(k + 1, s)]
i
j
=
m
h=1
[(k, s)]
h
j
a
i
h
(k + 1) [(k, s)]
j
j
a
i
j
(k + 1)
ks+2
.
Hence, [(k, s)]
i
j
ks+1
holds for all k s and all (j, i) E
s
E
k
.
(c) Let (j, v) E
s
E
r
for some r s and (v, i) E
r+1
E
k
for k > r.
Then, by the nonnegativity of the entries of (r, s) and (k, r + 1), we have
[(k, s)]
i
j
=
m
h=1
[(r, s)]
h
j
[(k, r + 1)]
i
h
[(r, s)]
v
j
[(k, r + 1)]
i
v
.
By part (b), we further have
[(r, s)]
v
j
rs+1
[(k, r + 1)]
i
v
kr
,
implying that
[(k, s)]
i
j
rs+1
kr
=
ks+1
.
(d) Recall that, for each k, the columns of the matrix A(k) are the weight vectors
a
1
(k), . . . , a
m
(k). Hence, by Assumption 1, the matrix A
(k, s) = A
(k) A
(s +1)A
(s), thus
implying that
, = 0, . . . , r
and with edges (i
1
, i
) in the set
E
= {(h,
h) | (h,
h) E
k
for innitely many indices k}.
Because each edge (i
1
, i
)], we obtain
(i
1
, i
) E
s+(1)B
E
s+B1
for = 1, ..., r.
By using Lemma 1 (b), we have
[(s + B 1, s + ( 1)B)]
i
i
1
B
for = 1, ..., r.
By Lemma 1(c), it follows that
[(s + rB 1, s)]
i
j
rB
.
Since there are m agents, and the nodes in the path j = i
0
i
1
. . . i
r1
i
r
= i
are distinct, it follows that r m1. Hence, we have
[(s + (m1)B 1, s)]
i
j
=
m
h=1
[(s + rB 1, s)]
h
j
[(s + (m1)B 1, s + rB)]
i
h
[(s + rB 1, s)]
i
j
[(s + (m1)B 1, s + rB)]
i
i
rB
(m1)BrB
=
(m1)B
,
where the last inequality follows from [(k, s]
i
i
ks+1
for all i, k, and s with k s
[cf. Lemma 1(a)].
Our ultimate goal is to study the limit behavior of (k, s) as k for a xed
s 0. For this analysis, we introduce the matrices D
k
(s) as follows: for a xed s 0,
D
k
(s) =
(s + kB
0
1, s + (k 1)B
0
) for k = 1, 2, . . . , (8)
where B
0
= (m1)B. The next lemma shows that, for each s 0, the product of these
matrices converges as k increases to innity.
Lemma 3 Let Weights Rule, Connectivity, and Bounded Intercommunication Interval
assumptions hold [cf. Assumptions 1, 2, and 3]. Let the matrices D
k
(s) for k 1 and a
xed s 0 be given by Eq. (8). We then have:
(a) The limit
D(s) = lim
k
D
k
(s) D
1
(s) exists.
(b) The limit
D(s) is a stochastic matrix with identical rows (a function of s) i.e.,
D(s) = e
(s)
where e R
m
is a vector of ones and (s) R
m
is a stochastic vector.
10
(c) The convergence of D
k
(s) D
1
(s) to
D(s) is geometric: for every x R
m
,
_
_
(D
k
(s) D
1
(s)) x
D(s)x
_
_
2
_
1 +
B
0
_ _
1
B
0
_
k
x
for all k 1.
In particular, for every j, the entries [D
k
(s) D
1
(s)]
j
i
, i = 1, . . . , m, converge to
the same limit
j
(s) as k with a geometric rate: for every j,
[D
k
(s) D
1
(s)]
j
i
j
(s)
2
_
1 +
B
0
_ _
1
B
0
_
k
for all k 1 and i,
where is the lower bound of Assumption 1(a), B
0
= (m1)B, m is the number
of agents, and B is the intercommunication interval bound of Assumption 3.
Proof. In this proof, we suppress the explicit dependence of the matrices D
i
on s to
simplify our notation.
(a) We prove that the limit lim
k
(D
k
D
1
) exists by showing that the sequence
{(D
k
D
1
)x} converges for every x R
m
. To show this, let x
0
R
m
be arbitrary, and
consider the vector sequence {x
k
} dened by
x
k
= D
k
D
1
x
0
for k 1.
We write each vector x
k
in the following form:
x
k
= z
k
+ c
k
e with z
k
0 for all k 0, (9)
where e R
m
is the vector with all entries equal to 1. The recursion is initialized with
z
0
= x min
1im
[x
0
]
i
and c
0
= min
1im
[x
0
]
i
. (10)
Having the decomposition for x
k
, we consider the vector x
k+1
= D
k+1
x
k
. In view of
relation (9) and the stochasticity of D
k+1
, we have
x
k+1
= D
k+1
z
k
+ c
k
e.
We dene
z
k+1
= D
k+1
z
k
([D
k+1
]
j
z
k
) e, (11)
c
k+1
= [D
k+1
]
j
z
k
+ c
k
. (12)
where j
z
k
(1
B
0
) z
k
.
Because z
k+1
0, it follows
z
k+1
_
1
B
0
_
z
k
for all k 0,
implying that
z
k
_
1
B
0
_
k
z
0
i=1
[D
k+1
]
i
j
z
k
= c
k
+z
k
.
From the preceding two relations and Eq. (14), it follows that
0 c
k+1
c
k
z
k
_
1
B
0
_
k
z
0
for all k.
Therefore, we have for any k 1 and r 1,
0 c
k+r
c
k
c
k+r
c
k+r1
+ +c
k+1
c
k
(q
k+r1
+ +q
k
)z
0
=
1 q
r
1 q
q
k
z
0
,
where q = 1
B
0
. Hence, {c
k
} is a Cauchy sequence and therefore it converges to some
c R. By letting r in the preceding relation, we obtain
0 c c
k
q
k
1 q
z
0
. (15)
From the decomposition of x
k
[cf. Eq. (9)], and the relations z
k
0 and c
k
c, it follows
that (D
k
D
1
)x
0
ce for any x
0
R
m
, with c being a function of x
0
. Therefore, the
limit of D
k
D
1
as k exists. We denote this limit by
D, for which we have
Dx
0
= c(x
0
)e for all x
0
R
m
.
(b) Since each D
k
is stochastic, the limit matrix
D is also stochastic. Furthermore,
because (D
k
D
1
)x c(x)e for any x R
m
, the limit matrix
D has rank one. Thus,
the rows of
D are collinear. Because the sum of all entries of
D in each of its rows is
equal to 1, it follows that the rows of
D are identical. Therefore, for some stochastic
vector (s) R
m
[a function of the xed s], we have
D = e(s)
.
12
(c) Let x
k
= (D
k
D
1
)x
0
for an arbitrary x
0
R
m
. By omitting the explicit depen-
dence on x
0
in c(x
0
), and by using the decomposition of x
k
[cf. Eq. (9)], we have
(D
k
D
1
)x
0
Dx
0
= z
k
+ (c
k
c)e for all k.
Using the estimates in Eqs. (14) and (15), we obtain for all k 1,
_
_
(D
k
D
1
)x
0
Dx
0
_
_
z
k
+|c
k
c|
_
1 +
1
1 q
_
q
k
z
0
.
Since z
0
= x
0
min
1im
[x
0
]
i
[cf. Eq. (10)], implying that z
0
2x
0
. Therefore,
_
_
(D
k
D
1
)x
0
Dx
0
_
_
2
_
1 +
1
1 q
_
q
k
x
0
for all k,
with q = 1
B
0
, or equivalently
_
_
(D
k
D
1
)x
0
Dx
0
_
_
2
_
1 +
B
0
_ _
1
B
0
_
k
x
0
(s) and e
j
= 1, we obtain
_
_
[D
k
D
1
]
j
j
(s)e
_
_
2
_
1 +
B
0
_ _
1
B
0
_
k
for all k.
Thus, it follows that
[D
k
D
1
]
j
i
j
(s)
2
_
1 +
B
0
_ _
1
B
0
_
k
for all k 1 and i.
The explicit form of the bound in part (c) of this Lemma 3 is new; the other parts
have been established by Tsitsiklis [23].
In the following lemma, we present convergence results for the matrices (k, s) as k
goes to innity. Lemma 3 plays a crucial role in establishing these results. In particular,
we show that the matrices (k, s) have the same limit as the matrices [D
1
(s) D
k
(s)]
,
when k increases to innity.
Lemma 4 Let Weights Rule, Connectivity, and Bounded Intercommunication Interval
assumptions hold [cf. Assumptions 1, 2, and 3]. We then have:
(a) The limit
(s) = lim
k
(k, s) exists for each s.
(b) The limit matrix
(s) has identical columns and the columns are stochastic i.e.,
(s) = (s)e
,
where (s) R
m
is a stochastic vector for each s.
13
(c) For every i, the entries [(k, s)]
j
i
, j = 1, ..., m, converge to the same limit
i
(s) as
k with a geometric rate, i.e., for every i {1, . . . , m} and all s 0,
[(k, s)]
j
i
i
(s)
2
1 +
B
0
1
B
0
_
1
B
0
_ks
B
0
for all k s and j {1, . . . , m},
where is the lower bound of Assumption 1(a), B
0
= (m1)B, m is the number
of agents, and B is the intercommunication interval bound of Assumption 3.
Proof. For a given s and k s + B
0
, there exists 1 such that s + B
0
k <
s + ( + 1)B
0
. Then, by the denition of (k, s) [cf. Eqs. (6)-(7)], we have
(k, s) = (s + B
0
1, s) (k, s + B
0
)
= (s + B
0
1, s) (s + B
0
1, s + ( 1)B
0
) (k, s + B
0
)
= (
(s + B
0
1, s + ( 1)B
0
)
(s + B
0
1, s))
(k, s + B
0
) .
By using the matrices
D
k
=
(s + kB
0
1, s + (k 1)B
0
) for k 1 (17)
[the dependence of D
k
on s is suppressed], we can write
(k, s) = (D
D
1
)
(k, s + B
0
) .
Therefore, for any i, j and k s + B
0
, we have
[(k, s)]
j
i
=
m
h=1
[D
D
1
]
i
h
[(k, s + B
0
)]
j
h
max
1hm
[D
D
1
]
i
h
m
h=1
[(k, s + B
0
)]
j
h
.
Since the columns of the matrix (k, s + B
0
) are stochastic vectors, it follows that for
any i, j and k s + B
0
,
[(k, s)]
j
i
max
1hm
[D
D
1
]
i
h
. (18)
Similarly, it can be seen that for any i, j and k s + B
0
,
[(k, s)]
j
i
min
1hm
[D
D
1
]
i
h
. (19)
In view of Lemma 3, for a given s, there exists a stochastic vector (s) such that
lim
k
D
k
D
1
= e
(s).
Furthermore, by Lemma 3(c) we have for every
h {1, . . . , m},
[D
D
1
]
h
h
[(s)]
2
_
1 +
B
0
_ _
1
B
0
_
,
for 1 and h {1, . . . , m}. From the preceding relation, and inequalities (18) and
(19), it follows that for k s + B
0
and any i, j {1, . . . , m},
14
[(k, s)]
j
i
[(s)]
i
max
_
max
1hm
[D
D
1
]
i
h
[(s)]
i
min
1hm
[D
D
1
]
i
h
[(s)]
i
_
2
_
1 +
B
0
_ _
1
B
0
_
.
Since 1 and s + B
0
k < s + ( + 1)B
0
, we have
_
1
B
0
_
=
_
1
B
0
_
+1 1
1
B
0
=
_
1
B
0
_
s+(+1)B
0
s
B
0
1
1
B
0
_
1
B
0
_ks
B
0
1
1
B
0
,
where the last inequality follows from the relations 0 < 1
B
0
< 1 and k < s+(+1)B
0
.
By combining the preceding two relations, we obtain for k s + B
0
and any i, j
{1, . . . , m},
[(k, s)]
j
i
[(s)]
i
2
1 +
B
0
1
B
0
_
1
B
0
_ks
B
0
. (20)
Therefore, we have
lim
k
(k, s) = (s)e
=
(s) for every s,
thus showing part (a) of the lemma. Furthermore, we have that all the columns of
(s)
coincide with the vector (s), which is a stochastic vector by Lemma 3(b). This shows
part (b) of the lemma.
Note that relation (20) holds for k s+B
0
and any i, j {1, . . . , m}. To prove part
(c) of the lemma, we need to show that the estimate of Eq. (20) holds for arbitrary s
and for k with s + B
0
> k s, and any i, j {1, . . . , m}. Thus, let s be arbitrary and
let s + B
0
> k s. Because
[(k, s)]
j
i
[(s)]
i
2
< 2
_
1 +
B
0
_
= 2
1 +
B
0
1
B
0
_
1
B
0
_
s+B
0
s
B
0
< 2
1 +
B
0
1
B
0
_
1
B
0
_ks
B
0
,
where the last inequality follows from the relations 0 < 1
B
0
< 1 and k < s + B
0
.
From the preceding relation and Eq. (20) it follows that for every s and i {1, . . . , m},
[(k, s)]
j
i
[(s)]
i
2
1 +
B
0
1
B
0
_
1
B
0
_ks
B
0
for all k s and j {1, . . . , m},
15
thus showing part (c) of the lemma.
The preceding results are shown by following the line of analysis of Tsitsiklis [23] (see
Lemma 5.2.1 in [23]; see also Bertsekas and Tsitsiklis [2]). The rate estimate given in
part (c) is new and provides the explicit dependence of the convergence of the transition
matrices on system parameters and problem data. This estimate will be essential in
providing convergence rate results for the subgradient method of Section 3 [cf. Eq. (5)].
3.2 Limit Vectors (s)
The agents objective is to cooperatively minimize the additive cost function
m
i=1
f
i
(x),
while each agent individually performs his own state updates according to the subgra-
dient method of Eq. (5). In order to reach a consensus on the optimal solution of
the problem, it is essential that the agents process their individual objective functions
with the same frequency in the long run. This will be guaranteed if the limit vectors
(s) converge to a uniform distribution, i.e., lim
s
(s) =
1
m
e for all s. One way of
ensuring this is to have (s) =
1
m
e for all s, which holds when the weight matrices A(k)
are doubly stochastic. We formally impose this condition in the following.
Assumption 4 (Doubly Stochastic Weights) Let the weight vectors a
1
(k), . . . , a
m
(k),
k = 0, 1, . . . , satisfy Weights Rule [cf. Assumption 1]. Assume further that the matrices
A(k) = [a
1
(k), . . . , a
m
(k)] are doubly stochastic for all k.
Under this and some additional assumptions, we show that all the limit vectors (s)
are the same and correspond to the uniform distribution
1
m
e. This is an immediate
consequence of Lemma 4, as seen in the following.
Proposition 1 (Uniform Limit Distribution) Let Connectivity, Bounded Intercommu-
nication Interval, and Doubly Stochastic Weights assumptions hold [cf. Assumptions 2,
3, and 4]. We then have:
(a) The limit matrices (s) = lim
k
(k, s) are doubly stochastic and correspond to
a uniform steady state distribution for all s, i.e.,
(s) =
1
m
ee
for all s.
(b) The entries [(k, s)]
j
i
converge to
1
m
as k with a geometric rate uniformly
with respect to i and j, i.e., for all i, j {1, . . . , m},
[(k, s)]
j
i
1
m
2
1 +
B
0
1
B
0
_
1
B
0
_ks
B
0
for all s and k with k s,
where is the lower bound of Assumption 1(a), B
0
= (m1)B, m is the number
of agents, and B is the intercommunication interval bound of Assumption 3.
16
Proof. (a) Since the matrix A(k) is doubly stochastic for all k, the matrix (k, s) [cf.
Eqs. (6)-(7)] is also doubly stochastic for all s and k with k s. In view of Lemma
4, the limit matrix
(s) = (s)e
m
j=1
p
i
j
(k) = 1 for all i and k. Furthermore, let the actual weights a
i
j
(k),
i, j = 1, . . . , m that the agents use in their updates be given by:
(i) a
i
j
(k) = min{p
i
j
(k), p
j
i
(k)} when agents i and j communicate during the time in-
terval (t
k
, t
k+1
), and a
i
j
(k) = 0 otherwise.
(ii) a
i
i
(k) = 1
j=i
a
i
j
(k).
The preceding discussion, combined with Lemma 1, yields the following result.
17
Proposition 2 Let Connectivity, Bounded Intercommunication Interval, Simultaneous
Information Exchange, and Symmetric Weights assumptions hold [cf. Assumptions 2, 3,
5, and 6]. We then have:
(a) The limit matrices (s) = lim
k
(k, s) are doubly stochastic and correspond to
a uniform steady state distribution for all s, i.e.,
(s) =
1
m
ee
for all s.
(b) The entries [(k, s)]
i
j
converge to
1
m
as k with a geometric rate uniformly
with respect to i and j, i.e., for all i, j {1, . . . , m},
[(k, s)]
j
i
1
m
2
1 +
B
0
1
B
0
_
1
B
0
_ks
B
0
for all s and k with k s,
where is the lower bound of Symmetric Weights Assumption 6, B
0
= (m1)B,
m is the number of agents, and B is the intercommunication interval bound of
Assumption 3.
Proof. In view of Uniform Limit Distribution [cf. Proposition 1], it suces to show that
Simultaneous Information Exchange and Symmetric Weights assumptions [cf. Assump-
tions 5 and 6], imply that Assumption 4 holds. In particular, we need to show that the
actual weights a
i
j
(k), i, j = 1, . . . , m satisfy Weights Rule [cf. Assumption 1], and that
the vectors a
i
(k), i = 1, . . . , m form a doubly stochastic matrix.
First, note that by Symmetric Weights assumption, the weights a
i
j
(k), i, j = 1, . . . , m
satisfy Weights Rule. Thus, the agent weight vectors a
i
(k), i = 1, . . . , m are stochastic,
and hence, the weight matrix A
(k)
is stochastic for any k, we have that A(k) is doubly stochastic for all k.
4 Convergence of the Subgradient Method
Here, we study the convergence behavior of the subgradient method introduced in Sec-
tion 2. In particular, we have shown that the iterations of the method satisfy the
following relation: for any i {1, . . . , m}, and s and k with k s,
x
i
(k + 1) =
m
j=1
[(k, s)]
i
j
x
j
(s)
k
r=s+1
_
m
j=1
[(k, r)]
i
j
j
(r 1)d
j
(r 1)
_
i
(k)d
i
(k),
[cf. Eq. (5)]. We analyze this model under the symmetric weights assumption (cf. As-
sumption 6). Also, we consider the case of a constant stepsize that is common to all
18
agents, i.e.,
j
(r) = for all r and all agents j, so that the model reduces to the
following: for any i {1, . . . , m}, and s and k with k s,
x
i
(k + 1) =
m
j=1
[(k, s)]
i
j
x
j
(s)
k
r=s+1
_
m
j=1
[(k, r)]
i
j
d
j
(r 1)
_
d
i
(k). (21)
To analyze this model, we consider a related stopped model whereby the agents stop
computing the subgradients d
j
(k) at some time, but they keep exchanging their infor-
mation and updating their estimates using only the weights for the rest of the time. To
describe the stopped model, we use relation (21) with s = 0, from which we obtain
x
i
(k + 1) =
m
j=1
[(k, 0)]
i
j
x
j
(0)
k
s=1
_
m
j=1
[(k, s)]
i
j
d
j
(s 1)
_
d
i
(k). (22)
Suppose that agents cease computing d
j
(k) after some time t
k
, so that
d
j
(k) = 0 for all j and all k with k
k.
Let { x
i
(k)}, i = 1, . . . , m be the sequences of the estimates generated by the agents in
this case. Then, from relation (22) we have for all i,
x
i
(k) = x
i
(k) for all k
k,
and for k >
k,
x
i
(k) =
m
j=1
[(k 1, 0)]
i
j
x
j
(0)
s=1
_
m
j=1
[(k 1, s)]
i
j
d
j
(s 1)
_
d
i
(
k)
=
m
j=1
[(k 1, 0)]
i
j
x
j
(0)
s=1
_
m
j=1
[(k 1, s)]
i
j
d
j
(s 1)
_
.
By letting k and by using Proposition 2(b), we see that the limit vector
lim
k
x
i
(k) exists. Furthermore, the limit vector does not depend on i, but does
depend on
k. We denote this limit by y(
k), i.e.,
lim
k
x
i
(k) = y(
k),
for which, by Proposition 2(a), we have
y(
k) =
1
m
m
j=1
x
j
(0)
s=1
_
m
j=1
1
m
d
j
(s 1)
_
.
Note that this relation holds for any
k, so may re-index these relations by using k, and
thus obtain
y(k + 1) = y(k)
m
m
j=1
d
j
(k) for all k. (23)
19
Recall that the vector d
j
(k) is a subgradient of the agent j objective function f
j
(x) at
x = x
j
(k). Thus, the preceding iteration can be viewed as an iteration of an approximate
subgradient method. Specically, for each j, the method uses a subgradient of f
j
at
the estimate x
j
(k) approximating the vector y(k) [instead of a subgradient of f
j
(x) at
x = y(k)].
We start with a lemma providing some basic relations used in the analysis of subgra-
dient methods. Similar relations have been used in various ways to analyze subgradient
approaches (for example, see Shor [21], Polyak [20], Nedic and Bertsekas [12], [13], and
Nedic, Bertsekas, and Borkar [14]). In the following lemma and thereafter, we use the
notation f(x) =
m
i=1
f
i
(x).
Lemma 5 (Basic Iterate Relation) Let the sequence {y(k)} be generated by the iter-
ation (23), and the sequences {x
j
(k)} for j {1, . . . , m} be generated by the iteration
(22). Let {g
j
(k)} be a sequence of subgradients such that g
j
(k) f
j
(y(k)) for all
j {1, . . . , m} and k 0. We then have:
(a) For any x R
n
and all k 0,
y(k + 1) x
2
y(k) x
2
+
2
m
m
j=1
(d
j
(k) +g
j
(k)) y(k) x
j
(k)
2
m
[f(y(k)) f(x)] +
2
m
2
m
j=1
d
j
(k)
2
.
(b) When the optimal solution set X
)
dist
2
(y(k), X
) +
2
m
m
j=1
(d
j
(k) +g
j
(k)) y(k) x
j
(k)
2
m
[f(y(k)) f
] +
2
m
2
m
j=1
d
j
(k)
2
.
Proof. From relation (23) we obtain for any x R
n
and all k 0,
y(k + 1) x
2
=
_
_
_
_
_
y(k)
m
m
j=1
d
j
(k) x
_
_
_
_
_
2
,
implying that
y(k + 1) x
2
y(k) x
2
2
m
m
j=1
d
j
(k)
(y(k) x) +
2
m
2
m
j=1
d
j
(k)
2
. (24)
20
We now consider the term d
j
(k)
(y(k) x) = d
j
(k)
(y(k) x
j
(k)) + d
j
(k)
(x
j
(k) x)
d
j
(k) y(k) x
j
(k) + d
j
(k)
(x
j
(k) x).
Since d
j
(k) is a subgradient of f
j
at x
j
(k) [cf. Eq. (1)], we further have for any j and
any x R
n
,
d
j
(k)
(x
j
(k) x) f
j
(x
j
(k)) f
j
(x).
Furthermore, by using a subgradient g
j
(k) of f
j
at y(k) [cf. Eq. (1)], we also have for
any j and x R
n
,
f
j
(x
j
(k)) f
j
(x) = f
j
(x
j
(k)) f
j
(y(k)) + f
j
(y(k)) f
j
(x)
g
j
(k)
(x
j
(k) y(k)) + f
j
(y(k)) f
j
(x)
g
j
(k) x
j
(k) y(k) + f
j
(y(k)) f
j
(x).
By combining the preceding three relations it follows that for any j and x R
n
,
d
j
(k)
(y(k) x) (d
j
(k) +g
j
(k)) y(k) x
j
(k) + f
j
(y(k)) f
j
(x).
Summing this relation over all j, we obtain
m
j=1
d
j
(k)
(y(k) x)
m
j=1
(d
j
(k) +g
j
(k)) y(k) x
j
(k) + f(y(k)) f(x).
By combining the preceding inequality with Eq. (24) the relation in part (a) follows, i.e.,
for all x R
n
and all k 0,
y(k + 1) x
2
y(k) x
2
+
2
m
m
j=1
(d
j
(k) +g
j
(k)) y(k) x
j
(k)
2
m
[f(y(k)) f(x)] +
2
m
2
m
j=1
d
j
(k)
2
.
The relation in part (b) follows by letting x X
is
nonempty.
Our main convergence results are given in the following proposition. In particular,
we provide a uniform bound on the norm of the dierence between y(k) and x
i
(k) that
holds for all i {1, . . . , m} and all k 0. We also consider the averaged-vectors y(k)
and x
i
(k) dened for all k 1 as follows:
y(k) =
1
k
k1
h=0
y(h), x
i
(k) =
1
k
k1
h=0
x
i
(h) for all i {1, . . . , m}. (25)
We provide upper bounds on the objective function value of the averaged-vectors. Note
that averaging allows us to provide our estimates per iteration
3
.
Proposition 3 Let Connectivity, Bounded Intercommunication Interval, Simultaneous
Information Exchange, and Symmetric Weights assumptions hold [cf. Assumptions 2, 3,
5, and 6]. Let the Bounded Subgradients and Nonempty Optimal Set assumptions hold
[cf. Assumptions 7 and 8]. Let x
j
(0) denote the initial vector of agent j and assume that
max
1jm
x
j
(0) L.
Let the sequence {y(k)} be generated by the iteration (23), and let the sequences {x
i
(k)}
be generated by the iteration (22). We then have:
(a) For every i {1, . . . , m}, a uniform upper bound on y(k) x
i
(k) is given by:
y(k) x
i
(k) 2LC
1
for all k 0,
C
1
= 1 +
m
1 (1
B
0
)
1
B
0
1 +
B
0
1
B
0
.
(b) Let y(k) and x
i
(k) be the averaged-vectors of Eq. (25). An upper bound on the
objective cost f( y(k)) is given by:
f( y(k)) f
+
m dist
2
(y(0), X
)
2k
+
L
2
C
2
for all k 1.
When there are subgradients g
ij
(k) of f
j
at x
i
(k) that are bounded uniformly by
some constant
L
1
, an upper bound on the objective value f( x
i
(k)) for each i is
given by:
f( x
i
(k)) f
+
m dist
2
(y(0), X
)
2k
+ L
_
LC
2
+ 2m
L
1
C
1
_
for all k 1,
3
See also our recent work [15] which uses averaging to generate approximate primal solutions with
convergence rate estimates for dual subgradient methods.
22
where L is the subgradient norm bound of Assumption 7, y(0) =
1
m
m
j=1
x
j
(0), and C =
1+8mC
1
. The constant B
0
is given by B
0
= (m1)B and B is the intercommunication
interval bound of Assumption 3.
Proof. (a) From Eq. (23) it follows that y(k) = y(0)
m
k1
s=0
m
j=1
d
j
(s) for all
k 1. Using this relation, the relation y(0) =
1
m
m
j=1
x
j
(0) and Eq. (22), we obtain for
all k 0 and i {1, . . . , m},
y(k) x
i
(k)
_
_
_
m
j=1
x
j
(0)
_
1
m
[(k 1, 0)]
i
j
_
k1
s=1
m
j=1
_
1
m
[(k 1, s)]
i
j
_
d
j
(s 1)
_
1
m
m
j=1
d
j
(k 1) d
i
(k 1)
__
_
_.
Therefore, for all k 0 and i {1, . . . , m},
y(k) x
i
(k) max
1jm
x
j
(0)
m
j=1
1
m
[(k 1, 0)]
i
j
+
k1
s=1
m
j=1
d
j
(s 1)
1
m
[(k 1, s)]
i
j
+
m
m
j=1
d
j
(k 1) d
i
(k 1).
Using the estimates for
[(k 1, 0)]
i
j
1
m
s=0
(1
B
0
)
k1s
B
0
+ 2L
2L
_
1 +
m
1 (1
B
0
)
1
B
0
1 +
B
0
1
B
0
_
.
(b) By using Lemma 5(b) and the subgradient boundedness [cf. Assumption 7], we have
for all k 0,
dist
2
(y(k+1), X
) dist
2
(y(k), X
)+
4L
m
m
j=1
y(k)x
j
(k)
2
m
[f(y(k)) f
]+
2
L
2
m
.
Using the estimate of part (a), we obtain for all k 0,
dist
2
(y(k + 1), X
) dist
2
(y(k), X
) +
4L
m
2mLC
1
2
m
[f(y(k)) f
] +
2
L
2
m
= dist
2
(y(k), X
) +
2
L
2
m
C
2
m
[f(y(k)) f
],
23
where C = 1 + 8mC
1
. Therefore, we have
f(y(k)) f
dist
2
(y(k), X
) dist
2
(y(k + 1), X
)
2/m
+
L
2
C
2
for all k 0.
By summing preceding relations over 0, . . . , k 1 and dividing the sum by k, we have
for any k 1,
1
k
k1
k=0
f(y(h)) f
dist
2
(y(0), X
) dist
2
(y(h), X
)
2/m
+
L
2
C
2
dist
2
(y(0), X
)
2k/m
+
L
2
C
2
(26)
By the convexity of the function f, we have
1
k
k1
k=0
f(y(h)) f( y(k)) where y(k) =
1
k
k1
k=0
y(h).
Therefore, by using the relation in (26), we obtain
f( y(k)) f
+
m dist
2
(y(0), X
)
2k
+
L
2
C
2
for all k 1. (27)
We now show the estimate for f( x
i
(k)). By the subgradient denition, we have
f( x
i
(k)) f( y(k)) +
m
j=1
g
ij
(k)
( x
i
(k) y(k)) for all i {1, . . . , m} and k 1,
where g
ij
(k) is a subgradient of f
j
at x
i
(k). Since g
ij
(k)
L
1
for all i, j {1, . . . , m},
and k 1, it follows that
f( x
i
(k)) f( y(k)) + m
L
1
x
i
(k) y(k).
Using the estimate in part (a), the relation x
i
(k) y(k)
k1
l=0
x
i
(l) y(l)/k, and
Eq. (27), we obtain for all i {1, . . . , m} and k 1,
f( x
i
(k)) f
+
m dist
2
(y(0), X
)
2k
+
L
2
C
2
+ 2m
L
1
LC
1
.
Part (a) of the preceding proposition shows that the error between y(k) and x
i
(k)
for all i is bounded from above by a constant that is proportional to the stepsize , i.e.,
by picking a smaller stepsize in the subgradient method, one can guarantee a smaller
error between the vectors y(k) and x
i
(k) for all i {1, . . . , m} and all k 0. In part
(b) of the proposition, we provide upper bounds on the objective function values of the
averaged-vectors y(k) and x
i
(k). The upper bounds on f( x
i
(k)) provide estimates for
the error from the optimal value f