Clustering in Distributed Incremental Estimation in Wireless Sensor Networks
Clustering in Distributed Incremental Estimation in Wireless Sensor Networks
Abstract
Energy efficiency, low latency, high estimation accuracy, and fast convergence are important goals in
distributed incremental estimation algorithms for sensor networks. One approach that adds flexibility in
achieving these goals is clustering. In this paper, the framework of distributed incremental estimation is
extended by allowing clustering amongst the nodes. Among the observations made is that a scaling law
exists where the estimation accuracy increases proportionally with the number of clusters. The distributed
parameter estimation problem is posed as a convex optimization problem involving a social cost function
and data from the sensor nodes. An in-cluster algorithm is then derived using the incremental subgradient
method. Sensors in each cluster successively update a cluster parameter estimate based on local data,
which is then passed on to a fusion center for further processing. We prove convergence results for the
distributed in-cluster algorithm, and provide simulations that demonstrate the benefits clustering for least
Index Terms
Distributed estimation, optimization, incremental subgradient method, clustering, wireless sensor net-
works.
This research was supported in part by the ONR under grant N00014-03-1-0290, the Army Research Office under grant DAAD19-
00-1-0466, Draper Laboratory under IR&D 6002 grant DL-H-546263, and the National Science Foundation under grant CCR-
0312413. Portions of this work were presented at the 2005 IEEE International Conference on Wireless Networks, Communications,
I. I NTRODUCTION
A wireless sensor network (WSN) is comprised of a fusion center and a set of geographically
distributed sensor nodes. The fusion center provides a central point in the network to consolidate
the sensor data while the sensor nodes collect information about a state of nature. In our scenario,
the fundamental objective of a WSN is to reconstruct that state of nature, e.g., estimation of a
parameter, given the sensor observations. Depending on the application and the resources of the
WSN, many possible algorithms exist that solve this parameter estimation problem.
One failsafe approach that accomplishes this objective is the centralized approach in which all
sensor nodes send their observations to the fusion center and allow the fusion center to make the
parameter estimation. The centralized scheme allows the most information to be present when
making the inference. However, the main drawback is the drainage of energy resources from each
sensor of the WSN [1]. In an energy constrained WSN, the energy expenditure of transmitting all
the observations to the fusion center might be too costly, thus making the method highly energy
inefficient. In our application, the purpose of a WSN is to make an inference, not collect all the
sensor observations.
Another approach avoids the fusion center altogether and allows the sensors to collaboratively
make the inference. This approach is referred to as the distributed in-network scheme, recently
proposed by Rabbat and Nowak [2]. First, consider a path that passes through all the sensor nodes
and visits each node only once. The path hops from one neighbor node to another until all the
sensor nodes are covered. Instead of passing the data along the sensor node path, a parameter esti-
mate is passed from node to node. As the parameter estimate passes through each node, each node
updates the parameter estimate with its own local observations. The distributed in-network ap-
proach significantly reduces the transmission energy for communication required by the network.
However, this approach has drawbacks in terms of latency, accuracy, and convergence. While the
centralized approach takes one iteration to have access to all data, the distributed approach takes
n iterations to have seen all the data captured by the network, where n is the number of sensors.
Also, the parameter estimate of the distributed in-cluster algorithm is less accurate when com-
3
pared to the parameter estimate of the centralized algorithm. In terms of the number of iterations,
the distributed in-network scheme converges slower than the centralized scheme. The distributed
in-network scheme remedies the issue of energy inefficiency, but suffers in terms of these other
performance parameters.
In this paper, we consider a hybrid form of the two aforementioned approaches. While the
former approach relies heavily on the fusion center and the latter approach eliminates the fusion
center altogether, we allow the fusion center to minimally interact with the sensor nodes. We
formulate a distributed in-cluster approach where the nodes are clustered and there exists a path
within each cluster that passes through each node only once as shown in Fig. 1. While the precise
mathematical formulation of the algorithm is stated in Section IV, roughly speaking, each cluster
operates similarly to the distributed in-network scheme. Within each cluster, every sensor node
updates its own parameter estimate based on its own local observations. Hence, each cluster has
its own parameter estimate. The sensor node that initiates the algorithm is designated to be the
cluster head. After completion of all the iterations within each cluster, the parameter estimate of
each cluster is then passed to the fusion center and averaged. Then, the fusion center announces
the average parameter value back to the cluster heads to start another set of cluster iterations if
necessary.
The purpose of clustering is to address the inherent inflexibility of both the centralized and the
in-network algorithms. For example, if the WSN application calls for the most accurate estimate
regardless of the communication costs, then the centralized algorithm would suffice. If the WSN
demands the most energy efficient algorithm irrespective of the other performance parameters like
latency, accuracy, and convergence speed, then the distributed in-network algorithm would be most
suitable. However, given a WSN application with specific accuracy demands or energy constraints,
are we able to develop an algorithm that is tailored to those desired performance levels? With the
distributed in-cluster algorithm, it is more feasible since the number of clusters, or equivalently
the size of the clusters, adds another dimension to the algorithm development process. We are also
Throughout the rest of the paper, we consider the following criteria in comparing the distributed
4
• Energy efficiency
• Latency
• Estimation accuracy
We show that the proposed distributed in-cluster algorithm adds a flexible tradeoff among all
the aforementioned criteria. Specifically, due to clustering, we are able to control the estimation
accuracy since the residual error scales as a function of the number of clusters. The inclusion
of clusters improves the scaling behavior of the estimation accuracy and latency. We use the
centralized and the distributed in-network algorithms as extreme cases of maximal and minimal
√
energy usage, respectively. For the special case where the WSN has n clusters with each cluster
√
having n sensors, we show that the transport cost of the distributed in-cluster algorithm has
the same order of magnitude as the distributed in-network algorithm. However, the latency and
√
accuracy improve by factor of n under this specific clustering situation.
The organization of the paper is as follows. Previous work in the areas of distributed incremental
estimation and clustering is discussed in Section II. We formulate the problem and provide two
concrete applications in Section III. The distributed in-cluster algorithm is precisely formulated
in Section IV and convergence analysis is discussed in Section V. Section VI surveys the benefits
of clustering on the performance parameters. Analytical results are verified in Section VII by
simulations that involve two applications, least-squares and robust estimation, followed by the
The ideas of incremental subgradient methods and clustering applied to distributed estimation
are quite prevalent in the literature. Incremental subgradient methods were first studied by Kibardin
[3] and then, more recently, by Nedić and Bertsekas [4]. Then, Rabbat and Nowak [2] applied
the framework used in [4] to handle the issue of energy consumption in distributed estimation
in WSNs. To further save on energy expenditure, Rabbat and Nowak [5] also implemented a
5
quantization scheme for distributed estimation, in which they showed that quantization does not
Clustering schemes have been implemented throughout the area of WSNs to provide a hierarchal
structure to minimize the energy spent in the system (see [6] and references therein). The purpose
of clustering is either to minimize the number of hops the data needs to arrive at a destination or
to provide a fusion point within the network to consolidate the amount of data sent. In our work,
scaling law for estimation accuracy and latency in relation to the number of clusters.
Consider a WSN with n sensors with each sensor taking m measurements. The parameter
estimation objective of the WSN can be viewed as a convex optimization problem if the distortion
measure between the parameter estimate and the data is convex. The problem is
minimize f (x, θ)
(1)
subject to θ ∈ Θ
where f : Rnm+1 → R is a convex cost function, x ∈ Rnm is a vector of all the observations
collected by the WSN, θ is a scalar, and Θ is a nonempty, closed and convex subset of R. Note that
One method to decompose Problem (1) into a distributed optimization problem is to assume
that the cost function has an additive structure. The additive property states that the social cost
function given all the WSN data, f (x, θ), can be expressed as the normalized sum of individual
cost functions given only individual sensor data, fi (xi , θ). Hence, the problem becomes
1
Pn
minimize f (x, θ) = n i=1 fi (xi , θ)
(2)
subject to θ ∈ Θ
where fi (xi , θ) : Rm+1 → R is a convex local cost function for the ith sensor only using its own
measurement data xi ∈ Rm .
6
Although the additive property does not hold for general cost functions, two important appli-
cations in estimation satisfy this property: least squares estimation and robust estimation with the
The simplest estimation procedure is least-squares estimation. For the classical least squares
estimation problem, the distortion measure is f (x, θ) = kx − θ1k2 , where k · k is the Euclidean
norm. Clearly, the least-squares distortion measure is convex and additive. Hence, the optimization
1
Pn n 1 Pm p 2
o
minimize n i=1 m p=1 kxi − θk
(3)
subject to θ ∈ Θ
Pm
where fi (xi , θ) = 1
m p=1 kxpi − θk2 and xpi denotes the pth entry in the vector of observations
from sensor node i. The beauty of least-squares estimation lies in its simplicity, but the technique is
prone to suffer greatly in terms of accuracy if some of the measurements are not as accurate as the
others. If some measurements have higher variances than other measurements, the least-squares
inference procedure does not take this effect into account. Thus, the least squares procedure is
highly sensitive to large deviations. To make the inference procedure more robust to these types of
B. Robust Estimation
Another practical application is the robust estimation problem, which has the following form,
1
Pn n 1 Pm p
o
minimize n i=1 m p=1 fH (xi , θ)
(4)
subject to θ ∈ Θ
where
kxpi − θk2 /2, if kxpi − θk ≤ γ
fH (xpi , θ) = (5)
γkxp − θk − γ 2 /2, if kxp − θk > γ
i i
7
Pm
and γ ≥ 0 is the Huber loss function constant [7]. Note, fi (xi , θ) = 1
m p=1 fH (xpi , θ). The
purpose behind robust estimation is to introduce a new distortion measure that puts more weight
on good measurements, and less weight on, or even discards, bad measurements. The parameter
γ sets the threshold for the measurement values around the parameter estimate θ, i.e., the values
within a γ–range of θ are considered good measurements and the values outside a γ–range of θ are
To solve a convex optimization problem like (1), the most common method used is a gradient
descent method. Given any starting point θ̂ ∈ dom f , update θ̂ by descending along the gradient
where α is the step size and ∆θ̂ = −∇f . The convexity of the function f guarantees that a
local minimum will be a global minimum. However, if the function f is not differentiable, then a
subgradient can be used. A subgradient of any convex function f (x) at a point y is any vector g
such that f (x) ≥ f (y) + (x − y)T g, ∀x. For a differentiable function, the subgradient is just the
gradient.
Along with convexity, if the cost function has an additive structure, a variant of the subgradient
method can be used. This method is called the incremental subgradient method [8]. The key idea
of the incremental subgradient algorithm is to sequentially take steps along the subgradients of the
marginal function fi (xi , θ) instead of taking one large step along the subgradient of f (x, θ). In
doing so, the parameter estimate can be adjusted by the subgradient of each individual cost function
given only its individual observations. Following this procedure leads to the distributed in-network
algorithm [2]. The convergence results for the algorithm follow directly from the incremental
subgradient method.
We now describe the in-cluster algorithm. Consider a WSN with nC clusters and nS sensors per
cluster, where nS ∗ nC = n. Assume that nS and nC are factors of n. Note that the distributed
8
in-network algorithm can be viewed as a special case of the distributed in-cluster algorithm where
We use i = 1 to index sensor nodes, j to index clusters, and k to index the iteration number.
Let i = 0 represent the cluster head. Let fi,j (xi,j , θ) and φi,j,k denote the local cost function and
the parameter estimate at node i in cluster j during iteration k, respectively. For conciseness, we
suppress the dependency of f on the parameters in the notation and let fi,j (xi,j , θ) = fi,j (θ) and
f (x, θ) = f (θ). Also, let θk be the estimate maintained by the fusion center during iteration k. We
1) Fusion center passes the current estimate θk to the cluster heads in all clusters.
2) Incremental update is conducted in parallel in all the clusters. Within each cluster,
the updates are conducted through update paths that traverse all the nodes in each
cluster:
gi,j,k
φi,j,k = φi−1,j,k − αk (7)
n
where αk is the step size and gi,j,k is a subgradient of fi,j using the last esti-
mate φi−1,j,k and the local measurement data xi,j . This is denoted as gi,j,k ∈
3) All clusters pass the last in-cluster estimate φnS ,j,k to the fusion center, which takes
P C
the average to produce the next estimate θk+1 = n1C nj=1 φnS ,j,k .
4) Repeat.
In step 3 of the distributed in-cluster algorithm, the fusion center may process the in-cluster
to noise ratio of the observations. By involving the fusion center, this allows more flexibility in the
9
algorithm development. In the convergence proofs, we will consider the case where an average of
V. C ONVERGENCE
Following the approach for the distributed incremental subgradient algorithm in [8], we show
convergence for the distributed in-cluster approach. The main difference in our proofs is the emer-
gence of the clustering values, nC and nS . In these proofs, we make reasonable assumptions that
the optimal solution exists and the subgradient is bounded as shown in the following statements.
Let the true underlying state of the environment be the (finite) minimizer, θ∗ , of the cost func-
tion. Also, assume there exists scalars Ci,j ≥ 0 such that Ci,j ≥ ||gi,j,k || for all i = 1, ..., nS ,
j = 1, ..., nC , and k.
We start with the following lemma that is true for each cluster parameter estimate {φi,j,k }.
Lemma 1: Let {φi,j,k } be the sequence of subiterations generated by Eq. (7). Then for all y ∈ Θ
and for k ≥ 0,
2 2 2αk ¡ ¢ αk2 2
||φi,j,k − y|| ≤ ||φi−1,j,k − y|| − fi,j (φi−1,j,k ) − fi,j (y) + 2 Ci,j ∀ i, j . (8)
n n
By summing all the inequalities in Eq. (8) over all i = 1, ..., nS and j = 1, ..., nC , we have the
Lemma 2: Let {θk } be a sequence generated by distributed in-cluster method. Then, for all
y ∈ Θ and for k ≥ 0
2αk ¡ ¢ α2 b 2
||θk+1 − y||2 ≤ ||θk − y||2 − f (θk ) − f (y) + 2 k C , (9)
nC n · nC
Theorem 1: Let {θk } be a sequence generated by the distributed in-cluster method. Then, for a
If all the subgradients gi,j,k are bounded by one scalar C, we have the following corollary.
Corollary 1: Let C0 = maxi,j Ci,j . It is evident that C0 ≥ kgi,j,k k for all i = 1, ..., nS and
αC02
lim inf f (θk ) ≤ f (θ∗ ) + . (11)
k→∞ 2nC
Since the incremental subgradient method is a primal feasible method, a lower bound of f (θ∗ )
is always satisfied. Therefore, the sequence of estimates f (θk ) will eventually be trapped between
αC02
f (θ∗ ) and f (θ∗ ) + 2nC
. The fluctuation around the equilibrium,
αC02
R= , (12)
2nC
is the residue estimation error due to the fact that a constant step size is used for subgradient
methods.
Comparing Corollary 1 with both the standard results in [8] and the result used by Rabbat and
Nowak in [2], we observe that in our case, we have a smaller threshold tolerance. Using the same
αC02
lim inf f (θk ) ≤ f (θ∗ ) + . (13)
k→∞ 2
αC02
Thus, as k → ∞, the distributed in-network algorithm converges to a ( 2
- suboptimal) solution,
αC 2
whereas, the distributed in-cluster algorithm converges to a ( 2nC0 - suboptimal) solution. We ob-
serve a key advantage of the in-cluster approach: estimation accuracy is tighter by a factor of nC .
Even for medium scale sensor networks, a factor of nC can be an order of magnitude improvement.
The next theorem provides the necessary number of iterations, K, to achieve a certain desired
estimation accuracy.
11
Theorem 2: Let {θk } be a sequence generated by the distributed in-cluster method. Then, for a
where K is given by
¹ º
nC ||θ0 − θ∗ ||2
K= . (15)
α²
Along the same lines as Theorem 1 and Corollary 1, we have the following corollary.
Corollary 2: Let C0 = maxi,j Ci,j . Then, for a fixed step-size α and for any positive scalar ²
µ ¶
∗ 1 αC02
min f (θk ) ≤ f (θ ) + +² (16)
0≤k≤K 2 nC
By observing Eq. (16), we see that the index k refers to the parameter estimate obtained at
the end of each cluster cycle. Since each cluster has nS iterations, the total number of itera-
³ 2 ´ j k
1 αC0 nC ||θ0 −θ ∗ ||2
tions required for an accuracy of 2 nC + ² is nS α²
. In comparison, for the dis-
tributed in-network case, the total number of iterations required for an accuracy of 12 (αC02 + ²)
j k
−θ∗ ||2
is n ||θ0α² . Therefore, for both algorithms, the total number of iterations necessary is the
same order of magnitude, while the benefit of distributed in-cluster algorithm over the distributed
Another natural extension is varying the step-size. For a fixed step-size, complete convergence
can not be achieved. The parameter estimates, {θk }, enter a limit cycle after an arbitrary number
of iterations. To force convergence to the optimal value, f (θ∗ ), the step-size can be set to diminish
at a rate inversely proportional to the number of iterations, αk = αk . More generally, we have the
following theorem.
Theorem 3: Let {θk } be a sequence generated by the distributed in-cluster method. Also, assume
then,
A. Energy Efficiency
The main expenditure of energy for a WSN is in the cost of communication. This entails trans-
porting bits either from sensor to sensor or from sensor to fusion center. So, the transport cost
For example, consider our original WSN consisting of n sensors where each sensor has m mea-
surements. The sensors are distributed randomly (uniformly) over one square meter. We use
bit-meters as the metric to measure the transport cost in the transmission of data.
In the centralized setting, all n sensors send their m observations to the fusion center, requiring
O(mn) bits to be transmitted over an average distance of O(1) meters. In total, the transport cost
is O(mn) bit-meters.
In the distributed in-network setting, all n sensors use their m observations to update the pa-
rameter estimate and pass the parameter estimate along the path that contains all the sensor nodes.
The distributed in-network method needs O(n) bits to be transmitted over an average distance of
√
O( √1n ). In total, the transport cost is O( n).
In the distributed in-cluster setting, the sensor network forms nC clusters with nS nodes per
cluster. This method requires O(n) bits to be transmitted over an average distance of O( √1n )
meters which accounts for the sensor to sensor transport cost and O(nC ) bits to be transmitted over
an average distance of O(1) meters which accounts for the sensor to fusion center transport cost.
√
Thus, the transport cost is O( n + nC ) bit-meters.
An interesting result arises when the cluster size and the number of sensors per cluster are equal,
√
nC = nS = n. For this case, the total transport cost of the distributed in-cluster algorithm
√
becomes O( n). The distributed in-cluster algorithm and the distributed in-network algorithm
√
have the same magnitude in terms of transport cost when nC = nS = n.
13
B. Latency
Latency is defined as the number of iterations needed to see all the data captured by the network.
For the centralized case, only one iteration is needed while for the in-network case, n iterations
are needed. However, the in-cluster algorithm, the latency of the WSN can be adjusted by the size
n
of the cluster. The latency for the in-cluster case reduces to nC
or more simply, nS iterations as
C. Estimation Accuracy
By forming nC clusters in a WSN, estimation accuracy can be improved. For the fixed step-size
1
case, the estimation accuracy is reduced by a factor of nC
when compared to the distributed in-
network case as shown in Table I. The accuracy improvement by a factor of nC holds for both cases
where k tends toward infinity and k is finite as shown in Corollary 1 and Corollary 2, respectively.
Consider a WSN with 100 sensors uniformly distributed over a region, each taking 10 mea-
surements each. The observations are independent and identically distributed from measurement
to measurement, observations are independent from sensor to sensor. If the sensors are working
properly, the measurements are distributed by a Gaussian distribution with mean 10 and variance 1
and if the sensors are defective, the measurements are distributed by a Gaussian distribution with
mean 10 and variance 100. This application can be viewed as a deterministic mean-location param-
eter estimation problem. The simulations assume that 10% of the sensor nodes are damaged and
a fixed step-size of αk = 0.4 is used. As summarized in this section, a variety of simulations are
conducted to verify the theorems and characterize other properties and tradeoffs in the proposed
A. Basic Simulations
Least squares estimation and Huber robust estimation are simulated, and the resulting conver-
gence behavior of the residual value is shown in Figs. 2 and 3, respectively. In both figures, the
14
distributed in-network method and the distributed in-cluster method are shown by a solid line while
the centralized method is shown by a dashed line. Since a fixed step size of αk = 0.4 is used, there
In both estimation procedures, an increase in the number of clusters causes a decrease in fluctu-
ations. The precise data points confirm the theoretical prediction: the distributed in-cluster method
1
fluctuation is nC
smaller than the distributed in-network fluctuation. By observing the least squares
1
estimation plots in Fig. 2, when nC = 4 and nC = 10, the fluctuations are smaller by a factor of 4
1
and 10
, respectively, compared to the plot when nC = 1. In the robust estimation example of Fig.
3, we use a Huber parameter of γ = 1. The distributed in-cluster method fluctuation again shows
narrower fluctuations and is almost indistinguishable from the centralized estimation curve.
To determine the tolerance bounds for the robust estimation procedure, the gradient of the Huber
P
loss function is calculated. Since fi (θ) = m1 m p
p=1 fH (xi , θ), it is clear that
||∇fi (θ)|| ≤ γ
by differentiation, while
||∇fi (θ)|| ≤ C0
by definition. Hence, C0 can be set to equal γ to provide an upper bound for the gradient.
This gives the following results that analytically characterize the tradeoff among three competing
criteria: accuracy, robustness, and speed of convergence for incremental estimation. Combining
Eq. (12) with the relation that C0 = γ, we have the following formula. For a given network size
n, the tradeoff among estimation error bound R, Huber robustness parameter γ, and constant step
2nC R = αγ 2 . (17)
For example, to maintain a desired level of robustness γ, tighter convergence bounds (smaller
R) implies slower convergence speed (smaller α). As another example, to get tighter convergence
bounds, we would like R to be small. This can be achieved by either reducing α, which means
15
smaller step size and slower convergence speed, or reducing γ, which means accepting less reliable
data for Huber estimation and reducing the robustness as well as the speed of convergence, since
less reliable data are used for estimation. An illustrative example is shown in Fig. 4, where we
reduce γ by a factor of 2 (cf. Figs. 4a and 4b). This reduction of γ reduces the estimation error by
about a factor of 4 but also increases the convergence time by roughly a factor of 2. The nC term in
the tradeoff characterization in Eq. (17) again highlights an advantage of the in-cluster approach:
VIII. C ONCLUSION
We have presented a distributed in-cluster scheme for a WSN that uses the incremental subgra-
dient method. By incorporating clustering within the sensor network, we have created a degree
of freedom which allows us to tune the algorithm for energy efficiency, estimation accuracy, con-
vergence speed and latency. Specifically, in terms of estimation accuracy, we have shown that a
different scaling law applies to the clustered algorithm: the residual error is inversely proportional
√
to the number of clusters. Also, for the special case where a WSN with n sensors forms n clus-
ters, we are able to maintain the same transport cost as the distributed-in-network scheme, while
increasing both accuracy of the estimate and convergence speed, and reducing latency. Simulations
have been provided for both least squares and robust estimation.
We plan to extend our work by relaxing the independence assumption of sensor to sensor obser-
vations. In particular, in future work, we will consider a WSN scenario where the data within each
cluster are spatially correlated, while the data from cluster to cluster are independent.
R EFERENCES
[1] G. J. Pottie and W. J. Kaiser, “Wireless integrated network sensors,” Communications of the ACM, vol. 43, no. 5, pp. 51–58,
May 2000.
[2] M. Rabbat and R. Nowak, “Distributed optimization in sensor networks,” in Proceedings of the Third International Symposium
[3] V. M. Kibardin, “Decomposition into functions in the minimization problem,” Automation and Remote Control, vol. 40, pp.
1311–1323, 1980.
16
[4] A. Nedić and D. P. Bertsekas, “Incremental subgradient methods for nondifferentiable optimization,” Tech. Rep., Massachusetts
[5] M. Rabbat and R. Nowak, “Quantized incremental algorithms for distributed optimization,” IEEE Journal on Selected Areas
[6] S. Bandyopadhyay and E. Coyle, “An energy efficient hierarchical clustering algorithm for wireless sensor networks,” in
Proceedings of the 22nd Annual Joint Conference of the IEEE Computer and Communications Societies (Infocom 2003), San
[7] P. J. Huber, Robust Statistics, John Wiley & Sons, New York, 1981.
[8] D. P. Bertsekas, Convex Analysis and Optimization, Athena Scientific, Belmont, MA, April 2003.
A PPENDIX
A. Proof of Lemma 1
Proof:
gi,j,k
||φi,j,k − y||2 = ||φi−1,j,k − αk − y||2
n
2αk α2
≤ ||φi−1,j,k − y||2 − (φi−1,j,k − y)gi,j,k + 2k ||gi,j,k ||2
n n
2 2αk αk2 2
≤ ||φi−1,j,k − y|| − (fi,j (φi−1,j,k ) − fi,j (y)) + 2 Ci,j .
n n
The last line of the proof uses the fact that gi,j,k is a subgradient of the convex function fi,j at
φi−1,j,k . Thus,
B. Proof of Lemma 2
Proof:
° °2
° 1 XnC
¡ ¢ °
° °
||θk+1 − y||2 = ° φnS ,j,k − y °
° nC °
j=1
° °2
° 1 XnC
¡ ¢°
° °
= ° φnS ,j,k − y °
° nC °
j=1
nC
1 X
≤ ||φnS ,j,k − y||2
nC j=1
nC ½ ¾
1 X 2 2αk ¡ ¢ αk2 2
≤ ||φnS −1,j,k − y|| − fnS ,j (φnS −1,j,k ) − fnS ,j (y) + 2 Ci,j
nC j=1 n n
17
where in the third and fourth lines of the proof, we used the Quadratic Mean-Arithmetic Mean
inequality and Lemma 1, respectively. After recursively decomposing ||φi,j,k − y||2 nS times, we
get
nC
( nS nS
)
1 X 2αk
X ¡ ¢ α 2 X
||θk+1 − y||2 ≤ ||φ0,j,k − y||2 − fi,j (φi−1,j,k ) − fi,j (y) + 2k C2
nC j=1 n i=1 n i=1 i,j
nC X nS nC X nS
2αk X ¡ ¢ α2 X
= ||θk − y||2 − fi,j (φi−1,j,k ) − fi,j (y) + 2 k 2
Ci,j
n · nC j=1 i=1 n · nC j=1 i=1
nC X nS
22αk X ¡ ¢
= ||θk − y|| − fi,j (φi−1,j,k ) − fi,j (y) + fi,j (θk ) − fi,j (θk )
n · nC j=1 i=1
nC X nS
αk2 X 2
+ 2 Ci,j .
n · nC j=1 i=1
Then,
( nC X nS
)
2αk 1X
||θk+1 − y||2 ≤ ||θk − y||2 − f (θk ) − f (y) + Ci,j ||φi−1,j,k − θk ||
nC n j=1 i=1
nC XnS
αk2 X 2
+ 2 Ci,j
n · nC j=1 i=1
( nC X nS i−1
)
2α k 1 X α k
X
≤ ||θk − y||2 − f (θk ) − f (y) + Ci,j · Cm,j
nC n j=1 i=2 n m=1
nC X nS
αk2 X
+ C2
n2 · nC j=1 i=1 i,j
2αk ¡ ¢
= ||θk − y||2 − f (θk ) − f (y)
n
n
(C n ( i−1 ) n )
αk2 X C X S X XS
2
+ 2 Ci,j Cm,j + Ci,j
n2 · nC j=1 i=2 m=1 i=1
nC
(n )2
2α k ¡ ¢ α 2 X X S
where in the second and third inequalities, we used the fact that
and
i
αk X
||φi,j,k − θk || ≤ Cm,j , i = 1, ..., nS ,
n m=1
respectively.
C. Proof of Theorem 1
Proof: Proof is by contradiction. If Thm. 1 is not true, then there exists an ² > 0 such that
b2
αC
lim inf f (θk ) > f (θ∗ ) + + 2² .
k→∞ 2n2
b2
αC
lim inf f (θk ) ≥ f (z) + + 2² ,
k→∞ 2n2
Then, by combining the above two relations with Lemma 2 and setting y = z, we have, for all
k ≥ k0
2α²
||θk+1 − z||2 ≤ ||θk − z||2 − .
nC
Therefore,
2α²
||θk+1 − z||2 ≤ ||θk − z||2 −
nC
4α²
≤ ||θk−1 − z||2 −
nC
..
.
2(k + 1 − k0 )α²
≤ ||θk0 − z||2 −
nC
D. Proof of Theorem 2
By setting αk = α and y = θ∗ in Lemma 2 and by combining that with the above relation, we have
2α b2
α2 C
||θk+1 − θ∗ ||2 ≤ ||θk − θ∗ ||2 − (f (θk ) − f (θ∗ )) + 2
nC n · nC
à à !!
2α 1 αC b2 b2
α2 C
≤ ||θk − θ∗ ||2 − + ² −
nC 2 n2 n2 · nC
α²
= ||θk − θ∗ ||2 − .
nC
(K + 1) α²
||θK+1 − θ∗ ||2 ≤ ||θ0 − θ∗ ||2 − .
nC
E. Proof of Theorem 3
Proof: Proof is by contradiction. If Thm. 3 does not hold, then there exists an ² > 0 such
that
Then, using the convexity of f and Θ, there exists a point z ∈ Θ such that
f (θk ) − f (z) ≥ ².
By setting y = z in Lemma 2 and by combining that with the above relation, we obtain for all
k ≥ k0 ,
2αk b2
α2 C
||θk+1 − z||2 ≤ ||θk − z||2 − ² + 2k
nC n · nC
à !
αk α k
b
C 2
= ||θk − z||2 − 2² − .
nC n2
b2
αk C
2² − ≥ ², ∀k ≥ k0 .
n2
αk ²
||θk+1 − z||2 ≤ ||θk − z||2 −
nC
(αk + αk−1 )²
≤ ||θk−1 − z||2 −
nC
..
.
k
2 ² X
≤ ||θk0 − z|| − αj ,
nC j=k
0
qk
Fusion Center
Fig. 1. Illustration of a sensor network implementing the distributed in-cluster algorithm. The dash-dotted lines represent the
borders of the clusters. The shaded nodes represent the cluster heads that communicate with the fusion center. All clusters run the
algorithm in parallel, although in the schematic only the lower-right cluster is shown running the incremental subgradient algorithm.
22
centralized O(mn) 1 0
√ αC02
distributed in-network O( n) n 2
√ n αC02
distributed in-cluster O( n + nC ) nC 2nC
√ √ αC02
special case: (nC = n) O( n) √1 √
n 2 n
TABLE I
S UMMARY OF PERFORMANCE TRADEOFFS AMONG DIFFERENT ALGORITHMS IS SHOWN . N OTE , IN THE SPECIAL CASE
√
WHERE nC = n, THE DISTRIBUTED IN - CLUSTER CASE AND IN - NETWORK ALGORITHMS HAVE THE SAVE TRANSPORT
√
COST, BUT THE LATENCY AND ESTIMATION ACCURACY IS IMPROVED BY A FACTOR OF 1/ n.
23
−1
−2
−3
−4
−5
0 50 100 150 200
Total Number of Iterations
3
Least Squares Residual Value
−1
−2
−3
−4
−5
0 50 100 150 200
Total Number of Iterations
3
Least Squares Residual Value
−1
−2
−3
−4
−5
0 50 100 150 200
Total Number of Iterations
Fig. 2. Plots of least squares residual value vs. total number of iterations for three different clustering scenarios are shown.
(a) Distributed in-network algorithm, (b) Distributed in-cluster algorithm with nC = 4 and nS = 25, (c) Distributed in-cluster
−1
−2
−3
−4
−5
0 50 100 150 200
Total Number of Iterations
2
Robust Residual Value
−1
−2
−3
−4
−5
0 50 100 150 200
Total Number of Iterations
2
Robust Residual Value
−1
−2
−3
−4
−5
0 50 100 150 200
Total Number of Iterations
Fig. 3. Plots of robust residual value vs. total number of iterations for three different clustering scenarios are shown. (a) Distributed
in-network algorithm, (b) Distributed in-cluster algorithm with nC = 4 and nS = 25, (c) Distributed in-cluster algorithm with
nC = 10 and nS = 10.
25
0.6
0.4
Robust Residual Value
0.2
−0.2
−0.4
−0.6
−0.8
−1
0 50 100 150 200
Total Number of Iterations
0.6
0.4
Robust Residual Value
0.2
−0.2
−0.4
−0.6
−0.8
−1
0 50 100 150 200
Total Number of Iterations
Fig. 4. Plots of robust residual value vs. total number of iterations for the case where nC = 10 and nS = 10 are shown. The plots
are rescaled for clarity. (a) Robust estimation with γ = 1, (b) Robust estimation with γ = 0.5.