0% found this document useful (0 votes)
44 views11 pages

Online Collection and Forecasting of Resource Utilization in Large-Scale Distributed Systems

Uploaded by

vchicav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views11 pages

Online Collection and Forecasting of Resource Utilization in Large-Scale Distributed Systems

Uploaded by

vchicav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Online Collection and Forecasting of Resource

Utilization in Large-Scale Distributed Systems


Tiffany Tuor∗ , Shiqiang Wang† , Kin K. Leung∗ , Bong Jun Ko†
∗ Imperial
College London, UK. Email: {tiffany.tuor14, kin.leung}@imperial.ac.uk
† IBM T. J. Watson Research Center, Yorktown Heights, NY, USA. Email: {wangshiq, bko}@us.ibm.com

Abstract—Large-scale distributed computing systems often utilization) of individual machines [3], based on which the
contain thousands of distributed nodes (machines). Monitoring current and future available resources at each machine can
the conditions of these nodes is important for system man- be inferred so that the system can be properly managed
agement purposes, which, however, can be extremely resource
arXiv:1905.09219v1 [cs.DC] 22 May 2019

demanding as this requires collecting local measurements of each and resource allocation can be performed in a near-optimal
individual node and constantly sending those measurements to way. In particular, measurements of the resource utilization
a central controller. Meanwhile, it is often useful to forecast the at each physical machine (local node) has to be transmitted
future system conditions for various purposes such as resource to a central controller (central node). The controller needs
planning/allocation and anomaly detection, but it is usually too to forecast the future resource availability of each machine,
resource-consuming to have one forecasting model running for
each node, which may also neglect correlations in observed so that it can assign new incoming tasks to machines that
metrics across different nodes. In this paper, we propose a are predicted to have the most suitable amount of available
mechanism for collecting and forecasting the resource utilization resources. Furthermore, the forecasting has to be done in
of machines in a distributed computing system in a scalable an online manner, where algorithms make decisions based
manner. We present an algorithm that allows each local node on information received up to the current time, and does
to decide when to transmit its most recent measurement to the
central node, so that the transmission frequency is kept below not assume knowledge of future information due to obvious
a given constraint value. Based on the measurements received practical reasons.
from local nodes, the central node summarizes the received data There exist several challenges towards a distributed system
into a small number of clusters. Since the cluster partitioning that can perform the above functionalities of collecting and
can change over time, we also present a method to capture the forecasting resource utilization. First, it is often bandwidth-
evolution of clusters and their centroids. As an effective way
to reduce the amount of computation, time-series forecasting consuming and unnecessary to transmit all the measurements
models are trained on the time-varying centroids of each cluster, collected at local nodes to the central node. Second, predictive
to forecast the future resource utilizations of a group of local models for data forecasting typically have high complexity,
nodes. The effectiveness of our proposed approach is confirmed thus running a forecasting model for the time-series measure-
by extensive experiments using multiple real-world datasets. ment data collected at each local node would consume too
I. I NTRODUCTION much computational resource. Third, measurements at each
Modern cloud computing systems often include thousands local node are collected in an online manner, which form a
of machines that process tasks originating from different time series; decisions related to data collection and forecasting
geographical regions. The effective management of such large- need to be made in an online manner as well.
scale distributed systems is very challenging. For example, In this paper, we address the above challenges and propose
even within a single data center, where machines are inter- a mechanism that efficiently collects and forecasts the resource
connected with high-speed networking and often owned by a utilization at machines in a large-scale distributed system. The
single service provider, it is very difficult to allocate resources results provided by our mechanism can be used for system
optimally, due to the high variation of resource demands of management such as resource allocation. We focus on the
different tasks. As a result, resource over-provisioning (allo- collection and forecasting of resource utilization in this paper,
cating too much resource) and under-provisioning (allocating and leave its application to system management for future
too little resource) often occur in practical cloud systems [1], work. Our main contributions are summarized as follows.
[2]. The former causes waste in resources and high operational 1) We propose an algorithm for each local node to adap-
cost, and the latter causes degradation in user experience. tively decide when to transmit its latest measurement
To overcome these issues, we need to precisely monitor and to the central node, subject to a maximum frequency
predict the resource utilization (such as CPU and memory of transmissions that is given as a system-constraint
parameter. The algorithm adapts to the degree of changes
This research was sponsored by the U.S. Army Research Laboratory and in observations since the last transmission, so that the
the U.K. Ministry of Defence under Agreement Number W911NF-16-3-0001.
The views and conclusions contained in this document are those of the allowed transmission bandwidth is efficiently used.
authors and should not be interpreted as representing the official policies, 2) We propose a dynamic clustering algorithm for the
either expressed or implied, of the U.S. Army Research Laboratory, the U.S. central node to partition the measurements received
Government, the U.K. Ministry of Defence or the U.K. Government. The U.S.
and U.K. Governments are authorized to reproduce and distribute reprints for from local nodes into a given number of clusters. The
Government purposes notwithstanding any copyright notation hereon. algorithm allows the clustering to evolve over time, and
the cluster centroids are a compressed representation of decides when to transmit subject to a maximum transmission
the dynamic observations of the large distributed system. frequency. This allows the system to explicitly specify the
3) We propose a forecasting mechanism where the cen- communication budget.
troids of each cluster evolving over time constitute a For the clustering of local node measurements, Gaussian
time series that is used to train a forecasting model. The models are widely used, such as in [3], [11], [12]. However,
trained model is then used to forecast the future resource these methods require a separate training phase to estimate the
utilizations of a group of local nodes. covariance matrix, during which it needs to collect all the data
4) Extensive experiments of our proposed mechanism have from all local nodes, which can be bandwidth consuming. In
been conducted using three real-world computing cluster addition, a sufficiently large number of samples are required
datasets, to show the effectiveness of our proposed for a good estimation of the covariance matrix. When the cor-
approach. relation among local nodes vary frequently, which is the case
The clustering, model training, and forecasting are all with resource utilization at machines in distributed systems
performed in an online manner, based on “intermittent” mea- (see Section III for further discussion), the system may not
surement data received at the central node. be able to collect enough samples to estimate the covariance
The rest of this paper is organized as follows. In the matrix with a reasonable accuracy. In this paper, we propose
next section, we review the related work. In Section IV, we a clustering mechanism that works well with highly varying
present the system overview together with some definitions. resource utilization data.
The proposed algorithms are described in Section V. The The evolution of clusters over time is related to the area
experimentation settings and results are given in Section VI, of evolutionary clustering [18]–[21], for which typical appli-
and Section VII draws our conclusion. cations include community matching in social science [20],
disease diagnosis in bio-informatics [22], user preference
II. R ELATED W ORK
modelling in dynamic recommender systems [23], etc. To
The existing body of work that uses prediction/forecasting our knowledge, evolutionary clustering techniques have not
models to assist resource scheduling mostly focuses on aggre- been applied to the dynamic clustering and forecasting of
gated workloads or resource demands that can be described as resource utilization at multiple machines, where the objectives
a single time series [1], [2], [4], [5]. While these approaches are different from the above applications.
are useful for predicting the future demand, they do not capture In summary, while there exist methods in the literature
the dynamics of resource utilization at individual physical ma- that are related to specific parts of our problem, they focus
chines, and hence cannot predict how much resource is utilized on different scenarios or applications and do not directly
or available in the physical system. In this paper, we focus apply to our problem, as explained above. Furthermore, to our
on resource utilization at machines in the distributed system, knowledge, a system/mechanism that efficiently collects and
which is more complex because each machine generates a forecasts resource utilization of all machines in a distributed
time-series measurement data on its own. system does not exist in the literature. This paper overcomes
Some existing approaches for efficient data collection in the challenge of developing such a mechanism with different
a distributed system involve only a selected subset of local components working smoothly together, while providing good
nodes that transmit data to the central node [3], [6]–[12]. More performance in practical settings.
specifically, techniques in [3], [11], [12] select the best set of
monitors (local nodes) subject to a constraint on the number III. M OTIVATIONAL E XPERIMENT
of monitors, and infer data from the unobserved local nodes
based on Gaussian models. Methods in [6]–[10] are based on To illustrate the challenge in the problem we study in this
compressed sensing, where a subset of local nodes is randomly paper, we start with a motivational experiment comparing
selected to collect data, then matrix completion is applied the long-term spatial correlations1 in resource utilizations at
to reconstruct data from unobserved nodes. The approaches different machines in a distributed computing environment and
based on compressed sensing generally perform worse than sensor measurements at different nodes in a sensor network.
Gaussian-based approaches [3]. All these approaches where We consider the sensor network dataset collected by Intel
only some of the local nodes send data in the monitoring Research Laboratory at Berkeley [24], which includes sensor
phase lead to unbalanced resource consumption (such as measurements over 12 days, and the Google cluster usage trace
communication bandwidth, energy, etc.). (version 2) [25], which includes resource utilizations at ma-
To avoid unbalanced resource consumption, some existing chines over one month. The empirical cumulative distribution
approaches consider settings where every node sends data to function (CDF) of the spatial correlation values computed on
the central node but with a sampling rate adapted directly the temperature and humidity data from the sensor dataset and
at each node [13]–[17]. However, the sampling rate in these the CPU and memory utilization data (aggregated for all tasks
works is only implicitly related to the transmission frequency. running on each machine) from the Google cluster dataset
None of them allows one to specify a target transmission 1 The (spatial) correlation of two nodes is defined as the sample covariance
frequency which is proportional to the required communica- of measurements obtained at the two nodes divided by the standard deviations
tion bandwidth. In this paper, we propose an algorithm that of both sets of measurements (each obtained at one of the two nodes) [5].
Empirical CDF Temperature x Measurements stored at central node Clusters * Centroids
1
Humidity Time series constituted by centroids Forecasted centroid values
CPU
F(x)

0.5 Memory Local Nodes:

0
-1 -0.5 0 0.5 1 Central Node
x
xx 𝑥 ,
Fig. 1: Empirical cumulative distribution function (CDF) of correlation values x x
of different datasets. *xx 𝑥 ,
*xx 𝑐̂
are plotted in Fig. 1, where each type of data is considered x ,
x ⋮
separately. *x 𝑐̂ ,
We see that for CPU and memory utilization, most of the
x 𝑥,
*x
spatial correlation values are between −0.5 and 0.5, whereas x 𝑐̂ ,
x ⋮
most correlation values are above 0.5 for temperature and
humidity data. This shows that in the long term (over the entire *x x
𝑥 ,
x
duration of the dataset), the spatial correlation in resource *xx
utilization among machines in a distributed computing system t
is much weaker than the spatial correlation in sensor mea-
surements at different nodes in a sensor network. Therefore, Clustering and mapping Training Multi‐step Map forecasted
we do not have strong long-term spatial correlation in our cluster over time forecasting forecasting of centroid values
model centroids to nodes
scenario, which is required by Gaussian-based methods for
covariance matrix estimation (see also the related discussions Fig. 2: System overview.
in Section II). Hence, Gaussian models which are widely used the j-th cluster at time step t, which is defined as a set of
in the clustering of sensor network data [3], [11], [12] are indices of local nodes whose measurements are included in
not suitable for our case with resource utilization data. This this cluster, i.e., Cj,t ⊆ {1, 2, ..., N }. Each cluster j has a
justifies the need of developing our own clustering mechanism centroid, defined as
that focuses more on short-term spatial correlations.
1 X
More detailed comparison between our approach and the cj,t := zi,t (1)
Gaussian-based approach in [3] will also be presented later in Cj,t i∈C
j,t

Section VI-E.
where | · | denotes the cardinality (size) of the set.
IV. D EFINITIONS AND S YSTEM OVERVIEW At time step t, a time-series forecasting model is trained
We consider a distributed system with N local nodes using the time series formed by the set of historical centroids
(machines) generating resource utilization measurements, and (i.e., {cj,τ : τ ≤ t}) for each cluster j. The model can forecast
a central node (controller) that receives a summary of future values of the cluster centroid, i.e., for any forecasting
all the local measurements and forecasts the future. We step h ≥ 1, the model provides a forecasted value ĉj,t+h at
assume that time is slotted. For each time step t, let the future time step t + h. The future resource utilization at
xt := [x1,t , x2,t , ..., xN,t ] denote the N -tuple that contains each individual local node i is predicted as the value of its
the true measurements of N local nodes and let zt := centroid plus an offset for this node, thus we define2
[z1,t , z2,t , ..., zN,t ] be the measurements stored at the central
x̂i,t+h = ĉj,t+h + ŝi,t+h (2)
node. Here, xi,t and zi,t (1 ≤ i ≤ N ) are d-dimensional
vectors, where d is equal to the number of resource types for i ∈ Ĉj,t+h , where Ĉj,t+h is the forecasted set of nodes in
(e.g., CPU, memory). The values in zt depend on the trans- cluster j at time step t + h, and ŝi,t+h is the forecasted offset
mission frequency (i.e., how often each local node sends its of node i with respect to the centroid of cluster j (to which
measurement to the central node). For each node i, let βi,t node i is forecasted to belong to) at time step t + h. In this
be an indication variable such that βi,t = 1 if node i has way, the estimation of x̂i,t+h involves both spatial estimation3
sent its most recent measurement at time step t to the central (using cluster centroid and per-node offset as estimation of
node, otherwise βi,t = 0. Then, zi,t = xi,t−p , where p ≥ 0 values for individual nodes) and temporal forecasting. Fig. 2
is defined as the smallest p such that βi,t−p = 1. If βi,t = 1, illustrates the system with the functionalities described above.
then p = 0 and zi,t = xi,t .
2 For convenience (and with slight abuse of notation), we use the subscript
We define K as a given input parameter to the system that
t + h to denote that the current time step is t and we forecast h steps ahead.
specifies the number of different forecasting models the system With this notation, even if t1 + h1 = t2 + h2 , we may have x̂i,t1 +h1 6=
uses, which is related to the computational overhead. At each x̂i,t2 +h2 if t1 6= t2 .
3 The use of the term spatial estimation or spatial correlation is for notional
time step t, the central node partitions the N measurements
convenience. We acknowledge that the clustering behavior of the measurement
z1,t , z2,t , ..., zN,t into K clusters, so that one forecasting model data from different local nodes result from their spatial relationship as well
can be used for each cluster. Let Cj,t (1 ≤ j ≤ K) denote as non-spatial reasons such as application-driven workloads.
We define the root mean square error (RMSE) of x̂t+h := in practice as we show in Section VI later.
[x̂1,t+h , x̂2,t+h , ..., x̂N,t+h ] for h ≥ 0 as
v V. P ROPOSED A LGORITHMS
u
u1 X N A. Measurement Collection with Adaptive Transmission
2
RMSE(t, h) := t x̂i,t+h − xi,t+h (3) In every time step t, each node i determines its action βi,t ,
N i=1
i.e., whether it transmits its current measurement xi,t to the
where we define x̂i,t := zi,t for h = 0 for convenience. With central node or not. To capture the error of the measurements
this definition, when h = 0, the RMSE only includes the error stored at the central node, we define a penalty function
caused by infrequent transmission of local node measurements (
1 2
to the central node. We also note that the true value xt+h
 d zi,t − xi,t , if βi,t = 0
Fi,t βi,t := . (6)
cannot be observed by the central node. 0, if βi,t = 1
We also define the time-averaged RMSE over T time steps To take into account the maximum transmission frequency Bi ,
for a given forecasting step h as we also define Yi βi,t := βi,t − Bi . We also define V0 > 0
and γ ∈ (0, 1) as a control parameters. The algorithm that
v
u
u1 X T
RMSE(T, h) := t (RMSE(t, h))2 (4) runs at each node i to determine βi,t is given as follows.
T t=1 1) In the first time slot t = 1, initialize a variable Qi (t) ←
0. The variable Qi (t) represents the length of a “virtual
where the time average is over the square error and the square
queue” at node i.
root is taken afterwards.
2) For every t ∈ {1, 2, 3, ...}, choose βi,t according to
Let Bi (0 ≤ Bi ≤ 1) denote the maximum transmission
frequency (for node i). Using the above definitions, and βi,t ← arg min Vt Fi,t (β) + Qi (t)Yi (β) (7)
β∈{0,1}
considering a maximum forecasting range H, the algorithms to
be introduced in the next section aim at solving the following where
γ
problem: Vt := V0 · (t + 1) . (8)
v
u
u 1 X H Then, update the virtual queue length according to
min lim t (RMSE(T, h))2 (5)
T →∞ H +1 Qi (t + 1) ← Qi (t) + Yi (βi,t ). (9)
h=0
T The intuition behind the above algorithm is as follows.
1X
s.t. lim βi,t ≤ Bi , ∀i The virtual queue length Qi (t) captures how much the Bi
T →∞ T
t=1 constraint in (5) has been violated up to the current time step t.
where the minimization is over all {βi,t }, {Cj,t }, {Ĉj,t+h }, The determination of βi,t in (7) considers a trade-off between
{ĉj,t+h }, and {ŝi,t+h }. Intuitively, we would like to find the the penalty (error) Fi,t (β) and constraint violation (related
transmission schedule (indicator) βi,t for each local node i and to Qi (t)), where the trade-off is controlled by the parameter
time step t, the membership of clusters Cj,t , ∀j for each time Vt . When Qi (t) is large, the term Qi (t)Yi (β) in (7) becomes
step t, and the forecasted cluster memberships, centroids, and dominant, and the algorithm tends to choose β = 0 because
offsets for every forecasting step h ∈ [0, H] computed at each this gives a negative value of Qi (t)Yi (β) which is in favor of
time step t, to minimize the average RMSE over all forecasting the minimization. Since β = 0 corresponds to not transmitting,
steps and all time steps. this relieves the constraint violation. When Qi (t) is small and
2
As we do not make any assumption on the characteristics zi,t − xi,t is relatively large, the term Vt Fi,t (β) in (7) is
of the time series constituting the cluster centroids {cj,t }, dominant. In this case, the algorithm tends to choose β = 1
we cannot hope to find the theoretically optimal forecasting because this will make Fi,t (β) = 0 and reduces the error of
scheme, because for any forecasted time series, there can measurements stored at the central node.
always exist a true time series that is very different from The above algorithm is a form of the drift-plus-penalty
the forecasted values and thus gives a high forecasting error. framework in Lyapunov optimization [27]. According to Lya-
In addition, it is often reasonable in the clustering step to punov optimization theory, as long as Fi,t (β) has a finite upper
minimize the error between the data and their closest cluster bound4 , the above algorithm can always guarantee that the Bi
centroids (we refer to this error as the “intermediate RMSE” constraint in (5) is satisfied with equality (for T → ∞ as
later in the paper), which is the K-means clustering problem given in the constraint, not necessarily for finite T ), because
and is NP-hard [26]. We also note that an online algorithm is limt→∞ Qi (t)/t = 0 (see [27, Chapter 4]). Note that satisfying
required because measurements from local nodes are obtained the Bi constraint with equality is always not worse than sat-
over time and decisions have to be made only based on the isfying it with inequality, because more transmissions cannot
current and past information (with future information unknown hurt the RMSE performance. For finite T , the satisfaction of
to the algorithm). All the above impose challenges in solving the Bi constraint is related to the parameter Vt , which can
(5). We propose online heuristics to solve the problem (5) 4 F (β) usually has a finite upper bound because measurement data is
i,t
approximately in the next section. These heuristics work well usually finite. Also note that the lower bound of Fi,t (β) is zero thus finite.
be tuned by parameters V0 and γ. From (8), we see that Vt Intuitively, with the mapping ϕ found from (11), the clusters
increases with t, which means that we give more emphasis {Cj,t : ∀j} are indexed in such a way that most nodes remain
on minimizing the penalty function when t is large. This is in the same cluster in the current time step t and M previous
because for a larger t, we can allow a larger Qi (t) while still time steps. In this way, the evolution of the centroids of
maintaining Qi (t)/t close to zero. each cluster j represents a majority of local nodes within that
Note, however, that the penalty function Fi,t (β) depends on cluster, and it is reasonable to perform time-series forecasting
transmission decisions in previous time steps that impact the with the centroids of clusters that are dynamically constructed
value of zi,t . Therefore, the optimality analysis of Lyapunov in this way.
optimization theory does not hold for our algorithm, and we Solution to (11): The problem in (11) is equivalent to a
do not have a theoretical bound on how optimal the result maximum weighted bipartite graph matching problem, where
is. Nevertheless, we have observed that this algorithm with one side of the bipartite graph has nodes representing the
the current penalty definition works well in practice (see values of k, the other side of the bipartite graph has nodes
experimentation results in Section VI). representing the values of j, and each k-j pair is connected
B. Dynamic Cluster Construction Over Time with an edge with weight wk,j . This can then be solved
We now discuss how the central node computes the clusters in polynomial time using existing algorithms for maximum
Cj,t , for 1 ≤ j ≤ K, from zt over time. The computation weighted bipartite graph matching, such as the Hungarian
includes two steps. First, K-means clustering is computed algorithm [29].
using the stored measurements zt in time step t only. Second, The parameter M in the similarity measure (10) controls
the clusters computed in the first step are re-indexed so that whether to consider long or short term history when computing
they align the best with the clusters computed in previous time the similarity. The proper choice of M is related to the
steps. The re-indexing step is only performed for t > 1. temporal variation in the data correlation among different local
The first step of K-means clustering is straightforward and nodes, because each cluster contains a group of nodes that are
efficient heuristic algorithms for K-means exist [28]. Let Ck,t 0 (positively) correlated with each other. Our experimentation
(1 ≤ k ≤ K) denote the K-means clustering result on zt in results in Section VI show that a fixed value of M usually
time step t. If t = 1, we let j = k, such that Cj,t = Cj,t 0
, ∀j, works well for a given scenario.
where we recall that {Cj,t : ∀j} is the final set of clusters Our clustering approach can be extended in several ways.
0 For example, one can define a time window of a given length,
in time step t. If t > 1, the cluster indices of {Ck,t : ∀k}
need to be reassigned in order to obtain {Cj,t : ∀j}, because which contains multiple time steps, and perform clustering on
the cluster indices resulting from the K-means algorithm is extended feature vectors that include measurements at multiple
random, and for each cluster Ck,t 0
, we need to find out which time steps within each time window [30]. In this case, t
cluster among {Cj,t−1 : ∀j} in the previous time step t − 1 it represents the time window index, and everything else in our
evolves from. approach presented above works in the same way. We mainly
0 : focus on dynamic settings where the time series and node cor-
To associate the clusters {Ck,t ∀k} in time step t with the
clusters in previous time steps, we define a similarity measure relation can fluctuate frequently in this paper. In such settings,
between the k-th cluster from the K-means result in time step as we will see in the experimentation results in Section VI, it
0 is best to use a time window of length one (equivalent to no
t, i.e., Ck,t , and the j-th clusters in a subset of previous time
steps. Formally, the similarity measure is defined as windowing), so that the clustering can adapt to the most recent
  measurements. We can also perform clustering on each type
min{M,t−1}
0
\ of resource (e.g., CPU, memory) independently from other
wk,j = Ck,t ∩ Cj,t−m  (10) resource types, in which case the K-means step is performed
m=1
on one-dimensional vectors (equivalent to scalars). We will
where M ≥ 1 specifies the number of time steps to look see in Section VI that this way of independent clustering
back into the history when computing the intersection in the performs better than joint clustering on the datasets we use
similarity measure. Intuitively, the similarity measure wk,j for evaluation.
specifies how many local nodes exist concurrently in the k-th Our dynamic clustering approach shares some similarities
cluster obtained from the K-means algorithm in time step t and with the approach in [20]. However, we define a different
in the j-th clusters in all M most recent time steps (excluding similarity measure that can look back multiple time steps
time step t). If wk,j is large, it means that most of the nodes and is not normalized. This is more suitable for the RMSE
in the corresponding clusters are the same. objective in (5) which considers the errors at all nodes.
0
Now, to find Cj,t from Ck,t , we find a one-to-one mapping Moreover, we focus on the clustering and forecasting of time-
between the indices j and k. Let ϕ denote the one-to-one series data which is different from existing work.
mapping from k to j. We would like to find the mapping ϕ
C. Temporal Forecasting
such that the sum similarity is maximized, i.e.,
K
As discussed in Section IV, temporal forecasting is per-
max
X
wk,ϕ(k) . (11) formed using models trained on historical centroids of mea-
ϕ
k=1
surements stored at the central controller. The models can
Actual trans. freq.

Actual trans. freq.


Actual trans. freq.
include Autoregressive Integrated Moving Average (ARIMA) Alibaba Bitbrains Google
10 0 10 0 10 0
[31], Long Short-Term Memory (LSTM) [32], etc. Different
-1 -1
models have different computational complexities. When the 10 -1 10 10
system starts for the first time, there is an initial data collection 10 -2 10 -2 10
-2

phase where there is no forecasting model available to use. 10 -2 10 -1 10


-2
10
-1
10
-2
10
-1

Afterwards, forecasting models are trained on the time-series Req. trans. freq. Req. trans. freq. Req. trans. freq.
constituted by the historical centroids of clusters. After the Fig. 3: Behavior of the adaptive transmission algorithm
models are trained, the system can forecast future centroids is the Rnd trace of the GWA-T-12 Bitbrains dataset [34]. It
using the models, based on the most updated measurements contains 500 machines, the data is collected over a period
at the central node. The transient state of each model gets of 3 months (we only use data in the first month because
updated whenever a new measurement is available. The models there is a 24-hour gap between different months), and raw
are retrained periodically at a given time interval using all (or measurements are sampled at 5 minute intervals. The size
a subset of) the historical cluster centroids up to the current of the dataset is 156 MB. The third dataset is the Google
time. cluster usage trace (version 2) [25], which contains job/task
As explained in Section IV, at time step t, the forecasted usage information of approximately 12, 478 machines5 over
resource utilization at node i in the future time step t + h 29 days, sampled at 5 minute intervals. The total size of the
is computed using the forecasted centroid plus an offset, i.e., compressed dataset is approximately 41 GB. For each dataset,
x̂i,t+h = ĉj,t+h + ŝi,t+h where j is chosen such that i ∈ we pre-processed the raw data to obtain the normalized CPU
Ĉj,t+h . We explain how to find the forecasted cluster Ĉj,t+h and memory utilizations for each individual machine.
and the offset ŝi,t+h in the following. We define M 0 as the
2) Choice of Parameters: Unless otherwise specified, we
number of time steps to look back into the history (excluding
set the transmission frequency constraint Bi = B := 0.3
the current time step t). For each node i, consider the time
for all i, the control parameters for adaptive transmission
steps within the interval [t−M 0 , t], and compute the frequency
V0 = 10−12 and γ = 0.65, the number of forecasting models
that node i belongs to the j-th cluster Cj,t within this time
(which is equal to the number of clusters) K = 3, the
interval, for all j. Let j ∗ denote the cluster that node i belongs
look-back durations for the similarity measure M = 1 and
to for the most time within [t − M 0 , t]. The algorithm then
temporal forecasting M 0 = 5. The clustering is performed on
predicts that node i belongs to the j ∗ -th cluster in time step
the scalar values of the measurements of each resource type,
t + h. By finding j ∗ for all i, the forecasted cluster Ĉj,t+h is
unless noted otherwise. These parameter choices are justified
obtained for all j.
in our experiments, which will be further discussed later in
For node i ∈ Ĉj,t+h , the offset ŝi,t+h is computed as
this section.
0
M
1 X 3) Forecasting Models: We use ARIMA and LSTM models
ŝi,t+h = 0 αt−m (zi,t−m − cj,t−m ) (12) for temporal forecasting. For the ARIMA model, after making
M + 1 m=0
some initial observations of the stationarity, auto correlation,
where αt−m ∈ (0, 1] is a scaling coefficient that ensures the and partial auto correlation functions, we conduct a grid search
cluster centroid plus the offset cj,t−m +αt−m (zi,t−m −cj,t−m ) over the following ranges of parameters: the order of the auto-
still belongs to cluster j in time step t − m, i.e., its value is regressive terms p ∈ [0, 5], the degree of differencing d ∈
still closest to the centroid cj,t−m of cluster j compared to [0, 2], the order of the moving average terms q ∈ [0, 5], and for
the centroids of all other clusters. If zi,t−m belongs cluster the corresponding seasonal components: P ∈ [0, 2], D ∈ [0, 1],
j, we choose αt−m = 1. Otherwise, we choose αt−m as the Q ∈ [0, 2]. The best model is selected from the grid search
largest value so that cj,t−m + αt−m (zi,t−m − cj,t−m ) belongs using the Akaike information criterion with correction term
to cluster j. This is useful because we do not want the offset (AICc) [35]. For the LSTM model, we stacked two LSTM
to be so large that the resulting estimated value belongs to a layers, and on top of that we stacked a dense layer with a
different cluster (other than cluster j), as the forecasted x̂i,t+h rectified linear unit (ReLU) as activation function. Due to the
is still based on the forecasted centroid ĉj,t+h of cluster j. randomness of LSTM, we plot the average forecasting results
over 10 different simulation runs.
VI. E XPERIMENTATION R ESULTS
For both ARIMA and LSTM, the initial data collection
A. Setup phase includes the first 1000 time steps. Then, the models
We evaluate the performance of our proposed approach on are retrained every 288 time steps, equivalent to a day when
three real-world computing cluster datasets. the raw measurements are sampled at 5 minute intervals. For
1) Datasets: The first dataset is the Alibaba cluster trace each cluster j, a separated model is trained for forecasting the
(version 2018) [33] that includes CPU and memory utiliza- centroids of this cluster. At every time step t, forecasting is
tions of 4, 000 machines over a period of 8 days. The raw made for a given number of time steps h ahead.
measurements are sampled at 1 minute intervals (i.e., each We present results on different aspects of our proposed
local node obtains a new measurement every minute) and the 5 We had to remove 2 machines that have error in the measurement data
entire compressed dataset is about 48 GB. The second dataset (unreasonably high CPU/memory utilization).
Memory Bitbrains 0.1 Proposed method
0.2 0 Uniform sampling method
0.1 Proposed method
0 Uniform sampling method
0 0.5 1
CPU Alibaba Memory Alibaba CPU Bitbrains Memory Bitbrains CPU Google Memory Google
0.15 0.15 0.15 0.15 0.1 0.06

0.1 0.1 0.1 0.1 0.04

RMSE
RMSE

RMSE

RMSE

RMSE
RMSE
0.05
0.05 0.05 0.05 0.05 0.02

0 0 0 0 0 0
0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1
Req. trans. freq. Req. trans. freq. Req. trans. freq. Req. trans. freq. Req. trans. freq. Req. trans. freq.

Fig. 4: RMSE comparison of our proposed adaptive transmission method with the uniform sampling method
0.1 Alibaba Bitbrains
Alibaba
10 20 30
mechanism in the following. 0.1 Bitbrains
10 20 30 Google
Remark: As mentioned in Section II, to the best of our BitbrainsCPU Google Temporal
Memory clustering

Intermediate RMSE
Intermediate RMSE
0.14
knowledge, there does not exist work in the literature that
0.12 0.1
solves the entire problem in our setting. Therefore, we cannot
0.1 0.08
compare our overall method with another existing approach.
0.08 0.06
We will compare individual parts of our method with existing
0.06 0.04
work where possible. 10 20 30 10 20 30
Temporal clustering dim. Temporal clustering dim.
B. Adaptive Transmission Algorithm
Fig. 5: Intermediate RMSE of clustering different temporal dimensions.
We first study some behavior of the algorithm presented
in Section V-A. Fig. 3 shows that the required transmission TABLE I: Intermediate RMSE of clustering independent scalars & full vectors
frequency B always matches closely with the actual transmis- Resource type & dataset Scalar Full
sion frequency (with parameters V0 and γ chosen as described CPU Alibaba 0.069 0.075
in Section VI-A2). This confirms that the algorithm is able Memory Alibaba 0.066 0.072
CPU Bitbrains 0.086 0.089
to adapt the transmission frequency to remain within the Bi - Memory Bitbrains 0.096 0.098
constraint in (5). CPU Google 0.063 0.082
In Fig. 4, we compare our proposed adaptive transmission Memory Google 0.055 0.067
approach with a uniform sampling approach, and show the
the temporal clustering dimension, where we cluster CPU and
time-averaged RMSE as defined in (4) with h = 0 and T equal
memory measurements separately and independently. We see
to the total number of time steps in the dataset (recall that we
that using a temporal clustering dimension of 1 (i.e., clustering
defined x̂i,t := zi,t for h = 0, so the RMSE only includes error
the measurements obtained at a single time step) always gives
caused by infrequent transmission in this case). The uniform
the best performance.
sampling baseline transmits each local node’s measurement at
Section V-B also mentions that we can either cluster dif-
a fixed interval, so that the average transmission frequency at
ferent resource types independently using their scalar values,
each node i is equal to Bi . We see that our proposed approach
or we can jointly cluster vectors of multiple resource types.
outperforms the uniform sampling approach for any required
Table I compares the intermediate RMSEs of these two ap-
transmission frequency. When the required transmission fre-
proaches, where the intermediate RMSEs are always computed
quency is 1.0, we always have zi,t = xi,t and the RMSE is
for individual resource types, but the clustering is computed
zero for both approaches.
either on independent scalars or full vectors. We see that
C. Spatial Estimation without Per-node Offset clustering using scalar values of each resource type performs
In this subsection, we evaluate the impact of using cluster better than clustering using the full vector. This suggests that
centroids to represent the group of nodes in the cluster, where the correlation among different types of resources in each
we ignore the offset ŝi,t+h and choose h = 0. We evaluate dataset is relatively weak.
the intermediate RMSE which is the time-averaged RMSE The above results show that it is beneficial to use scalar
between the data and their closest cluster centroids. This measurement values of each resource type at a single time step
evaluation is useful because the forecasting models are trained for clustering. We will use this setting in all our experiments
on cluster centroids, so we would like the cluster centroids to presented next.
be not too far from the actual measurements at each node even 2) Different Clustering Methods: We compare our proposed
if there is no per-node offset added to the estimated value. It dynamic clustering approach with two baselines. The first
also provides useful insights on the clustering mechanism. baseline static clustering is an offine baseline, where nodes
1) Impact of Clustering Dimensions: We first discuss the are grouped into static clusters based on the entire time
impact of different dimensions we cluster over time and over series at each node that is assumed to be known in advance.
resource types. As mentioned in Section V-B, we can cluster The clusters are found using K-means on multi-dimensional
either on the measurement obtained at a single time step or vectors, where each vector represents the entire time series at
multiple time steps, i.e., over different temporal dimensions. a node. With this setting, the clusters remain fixed over all
Fig. 5 shows the results of intermediate RMSE when we vary time steps. The second baseline minimum distance is obtained
RMSE
Proposed Approach 0.05Minimum-distance
Proposed Approach 0.25 Minimum-distance Static
1 (offline)
25 50
0.25 Minimum-distance 0.2 Static Standard deviation
CPU Alibaba Memory Alibaba CPU Bitbrains Memory Bitbrains CPU Google Memory Google

Intermediate RMSE

Intermediate RMSE

Intermediate RMSE
Intermediate RMSE

Intermediate RMSE

Intermediate RMSE
0.14 0.2 0.2 0.2 0.14
0.12 0.12 0.1
0.15 0.15 0.15
0.1 0.1 0.08
0.1 0.1 0.1
0.08 0.08 0.06

0.06 0.05 0.05 0.05 0.06 0.04


0 0.5 0 0.5 0 0.5 0 0.5 0 0.5 0 0.5
Transmission frequency Transmission frequency Transmission frequency Transmission frequency Transmission frequency Transmission frequency

RMSE
Fig. 6: Intermediate RMSE when varying the transmission frequency B and fixing K = 3.
Proposed Approach 0.05Minimum-distance
Proposed Approach 0.25 Minimum-distance Static
1 (offline)
25 50
0.25 Minimum-distance 0.2 Static Standard deviation
CPU Alibaba Memory Alibaba CPU Bitbrains Memory Bitbrains CPU Google Memory Google

Intermediate RMSE
Intermediate RMSE
Intermediate RMSE

Intermediate RMSE

Intermediate RMSE
Intermediate RMSE
0.2 0.3 0.4 0.2 0.2
0.2
0.15 0.2
0.15 0.2 0.1 0.1
0.1 0.1
0.1
0.05 0.05 0 0 0 0
10 0 10 2 10 0 10 2 10 0 10 2 10
0
10
2
10 0 10 2 10 4 10 0 10 2 10 4
K (number of clusters) K (number of clusters) K (number of clusters) K (number of clusters) K (number of clusters) K (number of clusters)

Fig. 7: Intermediate RMSE when varying the number of clusters K and fixing B = 0.3.

1 0.6 0.8

CPU utilization
CPU utilization

CPU utilization
0.6 ARIMA
0.6
Sample and hold
0.4 0.5 0.4
LSTM
0.4
0.2 TRUE
1000120001400 1600 1800 2000
0 0.2 0.2
1000 12000 1400 1600 1800 2000 1000 12000 1400 1600 1800 2000 1000 12000 1400 1600 1800 2000
Time step t Time step t Time step t

(a) Centroid j = 1 (b) Centroid j = 2 (c) Centroid j = 3


Fig. 8: Instantaneous true and forecasted (h = 5) results of K = 3 centroids on CPU data of Alibaba dataset.

by randomly selecting K nodes at each time step, treating a small transmission frequency and a very small number of
the selected nodes as “centroids” and mapping the remaining clusters, which significantly reduces the communication and
nodes to the “centroids” based on minimum Euclidean distance computation overhead for system monitoring.
between measurements. The minimum distance baseline rep- D. Joint Spatial Estimation and Temporal Forecasting (with
resents approaches which select monitoring nodes randomly, Per-node Offset)
such as [6]–[10].
We now consider the entire pipeline with joint spatial esti-
Fig. 6 shows the intermediate RMSE with varying B while mation (through dynamic clustering) and temporal forecasting.
fixing K = 3. We can see that our proposed approach performs We include the per-node offset ŝi,t+h in this subsection and
better than baseline approaches in (almost) all cases. Note focus on the time-averaged RMSE as defined in (4).
that the static approach is an offline baseline with stronger
1) Different Forecasting Models: We compare our predic-
assumptions than our proposed online approach. We also see
tions based on ARIMA and LSTM with a sample-and-hold
that in most cases, the intermediate RMSE starts to converge
prediction method, which simply uses the cluster centroid
at approximately B = 0.3. This shows that a transmission
values at time step t as the predicted future values. We
frequency higher than 0.3 will not provide much benefit.
also compare with the standard deviation computed over all
Fig. 7 shows the results with varying K while fixing resource utilizations over time (except for the instantaneous
B = 0.3. We see that the intermediate RMSE of the proposed plot in Fig. 8). The standard deviation serves as an error upper
approach is close to the lowest value even with only a few bound of an offline mechanism where forecasting is made only
clusters (i.e., small value of K). This is a strong result because based on long-term statistics (such as mean value) without
it shows that a small number of cluster centroids is sufficient considering temporal correlation.
for representing a large number of nodes. We also note that We first show the instantaneous true and forecasted CPU uti-
because B = 0.3, the measurements stored at the central node lization values of three different centroids for t ∈ [1000, 2000]
are not always up-to-date, which explains why the intermediate with the Alibaba dataset in Fig. 8, where the forecasting is
RMSE is larger than zero even when K = N . for h = 5 steps ahead. We see that with our methods, the
The above observations explain the rationale behind choos- trajectories of the forecasted centroid values by all models
ing B = 0.3 and K = 3 as default parameters as mentioned in follow very closely to that of the true centroid values.
Section VI-A2. In general, we can conclude that our proposed The time-averaged RMSE with different forecasting models
approach can provide close to optimal clustering error by using is shown in Fig. 9, where we include results for both K = 3
RMSE
ARIMA 0.1 Standard deviation Sample and hold LSTM
0.05

RMSE
ARIMA 0.15 Sample and hold, K=3 Sample and hold, K=N LSTM Standard deviation
0.1 LSTM
0.05 0.05 1 25 50
Sample and hold Standard deviation
CPU Alibaba Memory Alibaba CPU Bitbrains Memory Bitbrains CPU Google Memory Google
0.14 0.2 0.2 0.12 0.08

0.12 0.1
0.15 0.06
RMSE

RMSE

RMSE

RMSE
RMSE

RMSE
0.1 0.1 0.1 0.08
0.1 0.04
0.08 0.06

0.06 0 0.05 0.08 0.04 0.02


1 25 50 1 25 50 1 25 50 1 25 50 1 25 50 1 25 50
Number of steps ahead Number of steps ahead Number of steps ahead Number of steps ahead Number of steps ahead Number of steps ahead

RMSE
Fig. 9: Time-averaged RMSE with different number of forecasting steps (h), with our proposed dynamic clustering approach.
Proposed Approach 0.05Minimum-distance 0.05 LSTM
Proposed Approach 0.25 Minimum-distance Static
1 (offline)
25 50 1 Standard deviation
0.25 Minimum-distance 0.2 Static Standard deviation 25 50
CPU Alibaba Memory Alibaba CPU Bitbrains Memory Bitbrains CPU Google Memory Google
0.14 0.2 0.2 0.12
0.12 0.14 0.06

RMSE

RMSE
0.1

RMSE
RMSE

0.15
RMSE

RMSE
0.15
0.1 0.12
0.1 0.1 0.08
0.08 0.1 0.04
0.06 0.05 0.05 0.08 0.06
1 25 50 1 25 50 1 25 50 1 25 50 1 25 50 1 25 50
Number of steps ahead Number of steps ahead Number of steps ahead Number of steps ahead Number of steps ahead Number of steps ahead

Fig. 10: Time-averaged RMSE with different number of forecasting steps (h) using the sample-and-hold method.

TABLE II: Aggregated training time (in seconds) of forecasting model on one
and K = N for the sample-and-hold method, and use the centroid over the entire duration of the dataset
default K = 3 for all the other methods. Also note that
Dataset ARIMA LSTM
the standard deviation does not depend on K. We see that Alibaba data set
61.25 855.34
although sample-and-hold is simple enough to run on every (11519 total time steps)
local node (i.e., K = N ), the case with K = N generally Bitbrains data set
33.4 554.97
(8259 total time steps)
performs worse than cases with K = 3. This is due to the Google data set
fluctuation of resource utilization at individual nodes, which 37.86 554.97
(8350 total time steps)
makes the forecasting model perform badly when running
TABLE III: RMSE with Different Values of M and M 0 for the Google dataset
on every node. The cluster centroids are averages of data at with CPU resource
multiple nodes, which remove noisy fluctuations and provide h=1
better performances. LSTM performs the best among all the M0 = 1 M0 = 5 M 0 = 12 M 0 = 100
M =1 0.055 0.068 0.071 0.106
models, which is expected since LSTM is the most complex M =5 0.058 0.068 0.068 0.098
and advanced model compared to the others. We also see M = 12 0.059 0.048 0.046 0.050
that the RMSE is lower than the standard deviation for M = 100 0.065 0.089 0.047 0.055
most forecasting models when the forecasting step h ≤ 50. h=5
This shows that our forecasting mechanism, which takes into M0 = 1 M0 = 5 M 0 = 12 M 0 = 100
account both spatial and temporal correlations, is beneficial M =1 0.088 0.073 0.076 0.108
over mechanisms that are only based on long-term statistics. M =5 0.105 0.081 0.074 0.099
M = 12 0.117 0.079 0.076 0.097
Table II shows the total (aggregated) computation time used M = 100 0.091 0.0899 0.078 0.101
for training the ARIMA and LSTM models for the entire
h = 10
duration of one centroid, on a regular personal computer M0 = 1 M0 = 5 M 0 = 12 M 0 = 100
(without GPU) with Intel Core i7-6700 3.4 GHz CPU, 16 GB M =1 0.098 0.082 0.081 0.107
memory. The model is trained or re-trained at each of the M =5 0.121 0.095 0.080 0.099
M = 12 0.129 0.102 0.081 0.098
initial training and retraining periods defined in Section VI-A3,
M = 100 0.104 0.112 0.084 0.101
and the result shown in Table II is the sum computation time
for training at all periods. We can see that for data traces that steps (h) are shown in Fig. 10. We see that our proposed
span over at least multiple days, the total computation time approach performs the best in almost all cases. For long-term
used for model training is only a few minutes. Since we only forecasting with large h, the static clustering method often
need to train K = 3 models, the computation overhead (time) performs similar as our proposed approach, because when
for training forecasting models is very small compared to the there are fluctuations, dynamic clustering may not perform as
entire monitoring duration. good as static clustering for long time periods. Note, however,
In the remaining of this subsection, we use the sample-and- that the static clustering baseline is an offline method which
hold method (with K = 3) for forecasting and consider the requires knowledge of the entire time series beforehand, thus
impact of other aspects on the RMSE. it is not really applicable in practice.
2) Different Clustering Methods: We consider the different 3) Different Values of M and M 0 : Table III shows the
clustering methods as in Section VI-C2 combined with tempo- RMSE with different values of M and M 0 on the Google
ral forecasting. The RMSE results with different forecasting dataset with CPU resource, where we recall that M and M 0
RMSE
Jaccard

RMSE
0.1 Jaccard
Proposed similarity measure
0.1
Proposed similarity measure
CPU Alibaba Memory Alibaba CPU Bitbrains Memory Bitbrains Memory Google CPU Google
0.14 0.14 0.14 0.12 0.07
0.12 0.12 0.12 0.06

RMSE
RMSE

RMSE
RMSE

RMSE
RMSE
0.1
0.1 0.1 0.1 0.1 0.05
0.08
0.08 0.08 0.08 0.04
0.06 0.06 0.06 0.08 0.03 0.06
0 50 0 50 0 50 0 50 0 50 0 50
Number of steps ahead Number of steps ahead Number of steps ahead Number of steps ahead Number of steps ahead Number of steps ahead

Fig. 11: Time-averaged RMSE with Jaccard Index and our proposed similarity measure.
Top-W Top-W-Up Proposed Approach
Memory Bitbrains 10
10 5 Top-W Top-W-Up Batch Proposed Approach 0.25 Minimum-distance
Top-W-Up Batch 0 0.25
Proposed-Approach
10 Bitbrains Minimum-distance 0.2 Static
CPU Alibaba Memory Alibaba CPU Memory Bitbrains CPU Google Memory Google
10 0 10 -1 10 0 10 5 10 5 10 5
RMSE

RMSE
RMSE
RMSE

RMSE
RMSE
10
-2
10
-2
10 0 10
0
10
0

10 -2

10 -4 10 -3 10 -5 10 -5 10 -5
50 100 50 100 50 100 50 100 50 100 50 100
K (number of clusters) K (number of clusters) K (number of clusters) K (number of clusters) K (number of clusters) K (number of clusters)

Fig. 12: RMSE for comparison with [3] with different number of clusters (K).

TABLE IV: Computation time (in seconds) for each approach and dataset
are the number of time steps to look back into history when (100 nodes)
computing the similarity measure and forecasted cluster (and
CPU Alibaba CPU Bitbrains CPU Google
per-node offset), respectively (see Sections V-B and V-C). We Proposed 0.1401 0.16457 0.1370
observe that the optimal choices of M and M 0 depend on Min.-distance 0.0231 0.0287 0.0238
the forecasting step h. Generally, M = 1 is a reasonably Top-W 0.5987 0.6134 0.6074
Top-W-Update 29.3502 30.2132 27.4450
good value for all cases. The optimal value of M 0 tends to Batch Selection 2.8197 2.7812 2.2934
increase with h. This means that the farther-ahead we would
like to forecast, the more we should look back into the history
when determining the cluster membership and offset values into clusters based on their 500 latest measurements (i.e., we
of local nodes, which is intuitive because we need to rely perform K-means on 500-dimensional vectors). This gives us
more on long-term (stable) characteristics when forecasting K clusters of nodes. We select one node in each cluster that
farther ahead into the future. We choose M 0 = 5 as default in has the smallest Euclidean distance from the centroid of this
Section VI-A2 which is a relatively good value for different h. cluster. We consider this node as a monitor. During testing, we
4) Proposed Similarity Measure vs. Jaccard Index: As dis- only receive measurements from the monitors. The resource
cussed in Section V-B, the Jaccard index used in [20] is utilizations at all nodes that belong to the same cluster as
another possible similarity measure that one could use. In the monitor are estimated as equal to the measurement of the
Fig. 11, we compare the RMSE when using our proposed monitor. The minimum distance baseline in this setting is one
similarity measure and Jaccard index. Our proposed similarity that selects the K monitors randomly, and the other nodes are
measure gives a better or similar performance in all cases. assigned to clusters based on their Euclidean distances from
the monitors, where each cluster contains one monitor. Three
E. Comparison to Gaussian-based Method in [3] algorithms that are proposed in [3] are also considered as
Finally, we modify our setup and compare our proposed baselines: Top-W, Top-W-Update, and Batch Selection, which
approach with the Gaussian-based method in [3]. are based on Gaussian models.
The method in [3] includes separate training and testing
phases, both set to 500 time steps (which is the value chosen We only use 100 randomly selected machines in this exper-
in [3]). During the training phase, the central node receives iment, because the approaches in [3] are too time-consuming
measurements from every node (i.e., B = 1) and uses this to run on the entire dataset. The results of RMSE defined
information to select a subset of nodes (K  N ) that will on the estimation method described in this subsection above6
continue to send measurements during the testing phase. This are shown in Fig. 12 and the computational time of different
subset of K nodes is called monitors. During the testing phase, approaches (on computer with Intel Core i7-6700 3.4 GHz
the central node receives measurements only from the selected CPU, 16 GB memory) is shown in Table IV. We see that
nodes (which is equivalent to having a transmission frequency our proposed approach provides the smallest RMSE, and it
of B = K runs much faster than the three approaches (Top-W, Top-W-
N ), and the measurements of the non-monitor nodes
are inferred based on the measurements from the monitors. Update, and Batch Selection) from [3]. This observation is
There is no temporal forecasting in this mechanism. consistent with our discussions in Sections II and III that
We adapt our proposed approach to the above setting with Gaussian models do not work well in our setting.
separate training and testing phases as follows. During train- 6 Note that this RMSE definition is different from that in earlier parts of
ing, we perform K-means clustering, where we group nodes this paper.
VII. C ONCLUSION [13] C. Liu, K. Wu, and M. Tsao, “Energy efficient information collection
with the arima model in wireless sensor networks,” in Global Telecom-
In this paper, we have proposed a novel mechanism for munications Conference, 2005. GLOBECOM’05. IEEE, vol. 5. IEEE,
2005, pp. 5–pp.
the efficient collection and forecasting of resource utilization [14] Y. W. Law, S. Chatterjea, J. Jin, T. Hanselmann, and M. Palaniswami,
at different machines in large-scale distributed systems. The “Energy-efficient data acquisition by adaptive sampling for wireless
mechanism is a tight integration of algorithms for adaptive sensor networks,” in Proceedings of the 2009 International Conference
on Wireless Communications and Mobile Computing: Connecting the
transmission, dynamic clustering, and temporal forecasting, World Wirelessly. ACM, 2009, pp. 1146–1151.
with the goal of minimizing the RMSE of both spatial estima- [15] H. Harb, A. Makhoul, A. Jaber, R. Tawil, and O. Bazzi, “Adaptive data
tion and temporal forecasting. Experiments on three real-world collection approach based on sets similarity function for saving energy
in periodic sensor networks,” International Journal of Information
datasets show the effectiveness of our approach compared Technology and Management, vol. 15, no. 4, pp. 346–363, 2016.
to baseline methods. Future work can study the integration [16] S. Chatterjea and P. Havinga, “An adaptive and autonomous sensor
of our approach with resource allocation and other system sampling frequency control scheme for energy-efficient data acquisition
in wireless sensor networks,” in International Conference on Distributed
management mechanisms. Computing in Sensor Systems. Springer, 2008, pp. 60–78.
[17] A. K. Idrees and A. K. M. Al-Qurabat, “Distributed adaptive data
R EFERENCES collection protocol for improving lifetime in periodic sensor networks.”
IAENG International Journal of Computer Science, vol. 44, no. 3, 2017.
[1] B. Cai, R. Zhang, L. Zhao, and K. Li, “Less provisioning: A [18] D. Chakrabarti, R. Kumar, and A. Tomkins, “Evolutionary clustering,”
fine-grained resource scaling engine for long-running services with in Proceedings of the 12th ACM SIGKDD international conference on
tail latency guarantees,” in Proceedings of the 47th International Knowledge discovery and data mining. ACM, 2006, pp. 554–560.
Conference on Parallel Processing, ser. ICPP 2018. New York, [19] K. S. Xu, M. Kliger, and A. O. Hero III, “Adaptive evolutionary
NY, USA: ACM, 2018, pp. 30:1–30:11. [Online]. Available: http: clustering,” Data Mining and Knowledge Discovery, vol. 28, no. 2, pp.
//doi.acm.org/10.1145/3225058.3225113 304–336, 2014.
[20] D. Greene, D. Doyle, and P. Cunningham, “Tracking the evolution
[2] M. Grechanik, Q. Luo, D. Poshyvanyk, and A. Porter, “Enhancing
of communities in dynamic social networks,” in Advances in social
rules for cloud resource provisioning via learned software performance
networks analysis and mining (ASONAM), 2010 international conference
models,” in Proceedings of the 7th ACM/SPEC on International
on. IEEE, 2010, pp. 176–183.
Conference on Performance Engineering, ser. ICPE ’16. New
[21] T. Yang, Y. Chi, S. Zhu, Y. Gong, and R. Jin, “Detecting communities
York, NY, USA: ACM, 2016, pp. 209–214. [Online]. Available:
and their evolutions in dynamic social networksa bayesian approach,”
http://doi.acm.org/10.1145/2851553.2851568
Machine learning, vol. 82, no. 2, pp. 157–189, 2011.
[3] S. Silvestri, R. Urgaonkar, M. Zafer, and B. J. Ko, “An online method
[22] P. C. Ma, K. C. Chan, X. Yao, and D. K. Chiu, “An evolutionary
for minimizing network monitoring overhead,” in Distributed Computing
clustering algorithm for gene expression microarray data analysis,” IEEE
Systems (ICDCS), 2015 IEEE 35th International Conference on. IEEE,
Transactions on Evolutionary Computation, vol. 10, no. 3, pp. 296–314,
2015, pp. 268–277.
2006.
[4] A. Y. Nikravesh, S. A. Ajila, and C.-H. Lung, “Towards an autonomic [23] C. Rana and S. K. Jain, “An evolutionary clustering algorithm based
auto-scaling prediction system for cloud resource provisioning,” on temporal features for dynamic recommender systems,” Swarm and
in Proceedings of the 10th International Symposium on Software Evolutionary Computation, vol. 14, pp. 21–30, 2014.
Engineering for Adaptive and Self-Managing Systems, ser. SEAMS [24] P. Bodik, W. Hong, C. Guestrin, S. Madden, M. Paskin, and R. Thibaux,
’15. Piscataway, NJ, USA: IEEE Press, 2015, pp. 35–45. [Online]. “Intel lab data,” Online dataset, 2004.
Available: http://dl.acm.org/citation.cfm?id=2821357.2821365 [25] C. Reiss, J. Wilkes, and J. L. Hellerstein, “Google cluster-usage traces:
[5] H. Shen and L. Chen, “Resource demand misalignment: An important format+ schema,” Google Inc., White Paper, pp. 1–14, 2011.
factor to consider for reducing resource over-provisioning in cloud [26] D. Aloise, A. Deshpande, P. Hansen, and P. Popat, “Np-hardness of
datacenters,” IEEE/ACM Transactions on Networking, vol. 26, no. 3, euclidean sum-of-squares clustering,” Machine Learning, vol. 75, no. 2,
pp. 1207–1221, June 2018. pp. 245–248, May 2009.
[6] G. Coluccia, E. Magli, A. Roumy, and V. Toto-Zarasoa, “Lossy com- [27] M. J. Neely, “Stochastic network optimization with application to
pression of distributed sparse sources: a practical scheme,” in Signal communication and queueing systems,” Synthesis Lectures on Commu-
Processing Conference, 2011 19th European. IEEE, 2011, pp. 422– nication Networks, vol. 3, no. 1, pp. 1–211, 2010.
426. [28] J. A. Hartigan and M. A. Wong, “Algorithm as 136: A k-means
[7] J. E. Barceló-Lladó, A. M. Pérez, and G. Seco-Granados, “Enhanced clustering algorithm,” Journal of the Royal Statistical Society. Series
correlation estimators for distributed source coding in large wireless C (Applied Statistics), vol. 28, no. 1, pp. 100–108, 1979.
sensor networks,” IEEE Sensors Journal, vol. 12, no. 9, pp. 2799–2806, [29] H. W. Kuhn, “The hungarian method for the assignment problem,” Naval
2012. research logistics quarterly, vol. 2, no. 1-2, pp. 83–97, 1955.
[8] C. Anagnostopoulos and S. Hadjiefthymiades, “Advanced principal [30] T. W. Liao, “Clustering of time series dataa survey,” Pattern
component-based compression schemes for wireless sensor networks,” Recognition, vol. 38, no. 11, pp. 1857 – 1874, 2005. [Online]. Available:
ACM Transactions on Sensor Networks (TOSN), vol. 11, no. 1, p. 7, http://www.sciencedirect.com/science/article/pii/S0031320305001305
2014. [31] G. E. Box, G. M. Jenkins, and J. F. MacGregor, “Some recent advances
[9] Y. Li and Y. Liang, “Compressed sensing in multi-hop large-scale in forecasting and control,” Applied Statistics, pp. 158–179, 1974.
wireless sensor networks based on routing topology tomography,” IEEE [32] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
Access, vol. 6, pp. 27 637–27 650, 2018. computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[10] M. Leinonen, M. Codreanu, and M. Juntti, “Compressed acquisition [33] “Alibaba trace,” https://github.com/alibaba/clusterdata/tree/v2018, 2018.
and progressive reconstruction of multi-dimensional correlated data in [34] S. Shen, V. Van Beek, and A. Iosup, “Workload characterization of cloud
wireless sensor networks,” in Acoustics, Speech and Signal Processing datacenter of bitbrains,” TU Delft, Tech. Rep. PDS-2014-001, 2014.
(ICASSP), 2014 IEEE International Conference on. IEEE, 2014, pp. [35] K. P. Burnham and D. R. Anderson, Model selection and multimodel
6449–6453. inference: a practical information-theoretic approach. Springer Science
[11] Y. Zhang, T. N. Hoang, K. H. Low, and M. S. Kankanhalli, “Near- & Business Media, 2003.
optimal active learning of multi-output gaussian processes.” in AAAI,
2016, pp. 2351–2357.
[12] A. Krause, A. Singh, and C. Guestrin, “Near-optimal sensor placements
in gaussian processes: Theory, efficient algorithms and empirical stud-
ies,” Journal of Machine Learning Research, vol. 9, no. Feb, pp. 235–
284, 2008.

You might also like