ATOM Efficient Tracking Monitoring and Orchestration of Cloud Resources
ATOM Efficient Tracking Monitoring and Orchestration of Cloud Resources
8, AUGUST 2017
Abstract—The emergence of Infrastructure as a Service framework brings new opportunities, which also accompanies with new
challenges in auto scaling, resource allocation, and security. A fundamental challenge underpinning these problems is the continuous
tracking and monitoring of resource usage in the system. In this paper, we present ATOM, an efficient and effective framework to
automatically track, monitor, and orchestrate resource usage in an Infrastructure as a Service (IaaS) system that is widely used in cloud
infrastructure. We use novel tracking method to continuously track important system usage metrics with low overhead, and develop a
Principal Component Analysis (PCA) based approach to continuously monitor and automatically find anomalies based on the
approximated tracking results. We show how to dynamically set the tracking threshold based on the detection results, and further, how
to adjust tracking algorithm to ensure its optimality under dynamic workloads. Lastly, when potential anomalies are identified, we use
introspection tools to perform memory forensics on VMs guided by analyzed results from tracking and monitoring to identify malicious
behavior inside a VM. We demonstrate the extensibility of ATOM through virtual machine (VM) clustering. The performance of our
framework is evaluated in an open source IaaS system.
Index Terms—Infrastructure as a service, cloud, tracking, monitoring, anomaly detection, virtual machine introspection
1 INTRODUCTION
TABLE 1
Frequently Used Notations
Symbol Definition
D tracking threshold
g finest resolution for floating point values
t number of time instances in a sliding window
n number of monitored VMs
d0 number of metrics for each VM
d d0 n
M data matrix (t d) of the most recent monitored data
avgj mean of the jth column in M
stdj standard deviation of the jth column in M
Y standardized M, each value yi;j ¼ ðmi;j avgj Þ=stdj
tnow current time-stamp
A consecutively abnormal data from tnow t to tnow
B standardized A
z the metric vector monitored at tnow (with d dimensions)
x standardized z
vi the ith eigen vector output by PCA
Fig. 2. The ATOM framework. i the ith eigen value output by PCA
k number of principal components output by PCA
and orchestration (for VM introspection) into one frame- a input false alarm rate in PCA anomaly detection
work, whereas UBL focuses on anomaly detection in per- Qa PCA anomaly detection threshold
formance data without the integration of tracking and m false alarm rate deviation, to control tracking threshold
orchestration. Hence, UBL is “equivalent” to the moni-
toring component in ATOM.
More specifically, UBL can be plugged/integrated into dramatically reduces the overhead used to monitor
ATOM’s monitoring component as an alternative anomaly cloud resources and enables continuous measure-
detection method to be more effective in capturing different ments to CC and CLC;
types of anomaly. Note that PCA-based approach has the (2) Monitoring component (anomaly detection): ATOM adds
advantage of enabling us to analyze the theoretical bounds, this component in CLC to analyze tracking results by
when there are bounded tracking errors present in the con- the tracking component, which provides continuous
tinuously tracked measurements returned by the tracking resource usage data in real time. It uses a modified
component. UBL is an empirical method which may per- PCA method to continuously track the divided sub-
form really well on some instances, but it remains as an space, as defined by the multi-dimensional values
open problem to theoretically study its performance espe- from the tracking results, and automatically detect
cially with approximate measurements when being used anomaly by identifying notable shift in the interesting
together with ATOM’s tracking module. PCA-based subspace. It also generates anomaly information for
approach also allows us to adjust the tracking threshold further analysis by the orchestration component
automatically in an online fashion by only adjusting the when this happens. The monitoring component also
false alarm rate, as later shown in Section 5 where we have adjusts the tracking threshold from the tracking com-
established the theoretical connection between the false ponent dynamically online based on the data trends
alarm rate and the tracking threshold. and a desired false alarm rate.
Paper Organization. The rest of this paper is organized as (3) Orchestration component (introspection and debugging):
follows. Section 2 gives an overview on the design of ATOM, when a potential anomaly is identified by the moni-
and the threat model it considers. Sections 3 and 4 describe toring component, an INTROSPECT request along
the online tracking and the online monitoring modules in with anomaly information is sent to the orchestration
ATOM. We further demonstrates the interaction between component on NC, in which VMI tools (such as
tracking component and monitoring component in Section 5. LibVMI [9]) and VM debugging tools (such as
Section 6 introduces the orchestration module. Section 7 StackDB [10]) are used to identify the anomalous
shows an extension on VM clustering using the ATOM behavior inside a VM and raise an alarm to cloud
framework. Section 8 evaluates ATOM using Eucalyptus users for further analysis.
cloud and shows its effectiveness. Lastly, Section 9 reviews In the following sections we investigate each component
the related work, and section 10 concludes the paper. in further detail. Table 1 lists some frequently-used notations.
Also, despite various security rules and policies that are in view at every minute, user now could query values at any
place, it’s still possible that a smart attacker could bypass time instance in the entire history that is available.
them and perform malicious tasks. The malicious behavior But unfortunately, this seemingly natural idea may per-
could very likely cause some change in resource usage. form very badly in practice. In fact, in the worst case, its
Note that, however, this is not necessarily always accompanied asymptotic cost is infinite in terms of competitive ratio over
with more resource consumption! Some attacks could actually the optimal offline algorithm that knows the entire data
lead to less resource usage, or simply different ways of series in advance. For example, suppose the first value NC
using the same amount of resources on average. All these observes is 0 and then it oscillates between 0 and D þ 1.
attacks are targeted by the ATOM framework. The possibil- Then NC continues to send 0 and D þ 1 to the CLC. While
ity of incorporating other types of attacks into ATOM is dis- the optimal offline algorithm who knows the entire fðtÞ at
cussed in Sections 6 and 8.7. the beginning could send only one message to the CLC-the
value D2 . Formally, this is known as the online tracking prob-
3 TRACKING COMPONENT lem, which is formalized and studied in [11]. In online track-
ing, an observer observes a function fðtÞ in an online
This section introduces the tracking component in ATOM. fashion, which means she sees fðtÞ for any time t before the
Consider Eucalyptus CloudWatch as an example, which is current time (including the current time). A tracker would
an AWS CloudWatch compatible monitoring service that like to keep track of the current function value within some
enables cloud users to monitor their cloud resources and predefined error. The observer needs to decide when and
make operational decisions based on the statistics. Cloud- what value she needs to send to the tracker so that the com-
Watch is capable of collecting, aggregating and dispensing munication cost is minimized.
data from resources such as VMs and storage. Cloud users Suppose function f : Zþ ! Z is the function observer
can specify what they would like to monitor, and then query observes overtime. gðtÞ stands for the value she chooses to
the history data for up to two weeks through the interface in send to the tracker at time t. The predefined error is D,
the CLC. They can also set an alarm (essentially, a thresh- which means at any time tnow , if the observer does not send
old) for a specific measure, and be notified or let it trigger a new value gðtnow Þ to the tracker, then it must satisfy
some predefined action if the alarm conditions are met. jjfðtnow Þ gðtlast Þjj D, where gðtlast Þ is the last value the
Clearly, collecting such statistics continuously is expensive. tracker receives from the observer. This is an online tracking
Thus, the default in Eucalyptus and AWS is to ask a NC to over a one dimension positive integer function.
only send measurements to the CLC at some predefined Instead of the naive algorithm that’s shown above, Yi and
interval, e.g., once every minute in Eucalyptus. Zhang provide an online algorithm that is proved to be opti-
A user VM in Eucalyptus is called an instance. In the fol- mal with a competitive ratio of only Oðlog DÞ; that means in
lowing we will use the term “instance” and “VM” inter- the worst case, its communication cost is only Oðlog DÞ times
changeably. There are various variables that can be worse than the cost of the offline optimal algorithm that
monitored overtime on each instance, each of which is called knows the function fðtÞ for entire time domain [11]. But
a metric. The measurement for each metric, for example, Per- unfortunately, the algorithm works only for integer values.
cent for CPUUtilization, Count for DiskReadOps and We observe that in reality, especially in our setting, real
DiskWriteOps, Bytes for DiskReadBytes, DiskWriteBytes, values (e.g., “double” for CPUUtilization) need to be
NetworkIn and NetworkOut, is called Unit and is numerical. tracked instead. To that end, we adapt the algorithm from
A continuous understanding of these values is much [11], and design Algorithm 1 to track real values continu-
more useful than a periodic, discrete sampled view that are ously in an online fashion. The algorithm performs in
only available, say, every minute. But doing so is expensive; rounds. A round ends when S becomes an empty set, and a
a NC needs to constantly sending data to the CLC. A key new round starts.
observation is that, for most purposes, cloud users may not
be interested in the exact value at every time instance. Thus, Algorithm 1. One Round of Online Tracking for Real
a continuous understanding of these values within some Values
predefined error range is an appealing alternative. For let S ¼ ½fðtnow Þ D; fðtnow Þ þ D;
example, it’s acceptable to learn that CPUUtilization is while Supper bound Slower bound > g do
guaranteed to be within 3 percent of its exact value at any gðtnow Þ ¼ ðSupper bound Slower bound Þ=2;
time instance. send gðtnow Þ to tracker;
This way NC only sends a value whenever the newest wait until jjfðtnow Þ gðtlast Þjj > D;
one is more than D away from last sent value on a measure- Supper bound ¼ minðSupper bound ; fðtnow Þ þ DÞ;
ment, where D is a user-specified, maximum allowed error Slower bound ¼ maxðSlower bound ; fðtnow Þ DÞ;
on this measurement. CLC could use the last received value end while /* this algorithm is run by observer */
as an acceptable approximation for all values in-between. In
practice, often time certain metrics on a VM do not change The central idea of our algorithm is to always send the
much over a long period. Thus far fewer values need to be median value from the range of possible valid values,
sent to the CLC. Not only can we save the communication denoted by S, whenever fðtnow Þ has changed more than D
overhead from NC to the CLC, but also the database space (could be non-integer) from gðtlast Þ. The next key observa-
on CLC used to store every value reported by NC (so that tion is that any real domain in a system must have a finite
the history data could be kept for much longer than two precision. Suppose g is the finest resolution for the floating
weeks). Furthermore, instead of having only a sampled point values being tracked in the algorithm. Then at the
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on March 31,2024 at 11:48:31 UTC from IEEE Xplore. Restrictions apply.
2176 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 28, NO. 8, AUGUST 2017
to the direction having the largest data variance while the using the latest PCA model; and if the newest time instance
remaining axis forms the residual subspace. The abnormal data z is normal, move it to M and update the PCA model;
point is detected by comparing its projection length onto the otherwise move it to A in case it doesn’t agree with the
residual subspace (second axis) against a threshold (detailed residual subspace; (5) if z is abnormal, do metrics identifica-
in Section 4.3.3). Using PCA for anomaly detection has been tion to find which metrics of which VM instances might
widely studied in the context of network traffic analysis and have caused the anomaly. Step 1 is trivial by the definition
monitoring, e.g., [12], [13]. of Y. The details of steps (2) to (5) are as follows.
To the best of our knowledge, there is no prior work in
adapting PCA for online monitoring and anomaly detection 4.3.1 Building the PCA Model
over VMs in an IaaS system. That said, there are three new To build the PCA model, we perform eigenvalue decompo-
challenges that we need to address: 1) unlike most existing sition on the covariance matrix of Y, and get a set of eigen
work that use PCA for anomaly detection in an offline batch vectors V ¼ ðv1 ; v2 ; . . . ; vd ) sorted by their eigen values.
setting [13], ATOM needs to do online monitoring; 2) once These eigen vectors form the new axes in the transformed
anomaly is identified, ATOM needs to figure out which coordinate system, with the first principal axis v1 pointing
metrics from which VM instance(s) might have caused the to the direction that has the largest variance in Y and the fol-
anomaly; 3) the input data to ATOM’s online monitoring lowing principal axes each points to the largest variance
module are approximate results from the tracking module, direction orthogonal to previous ones. The corresponding
which have an error that is bounded by D. We need to take eigen values are 1 2 d 0.
into account such tracking errors into the analysis. Next we
will explain our method in detail.
4.3.2 Find the Residual Subspace
We define the principal subspace and the residual subspace
4.2 The Data Matrix as follows. The principal subspace S stands for the space
Given d0 metrics reported by the tracking module for each spanned by the first several principal axes in V, while resid-
VM and t is the length of a time-based sliding window, PCA ual subspace Se stands for the space spanned by the rest.
could be performed on these data which form a t d0 matrix. The number of significant principal components in the prin-
A more general and more interesting case is to perform cipal subspace is denoted as k. Hence, the first k eigen vec-
online monitoring over a data matrix composed of multiple tors form the principal subspace, and the rest ðd kÞ eigen
VMs’ data, e.g., d ¼ d0 n dimensions. For VMs hosted on vectors form the residual subspace that could be used to
the same physical node, or even the same cloud, it’s quite detect anomalies. Of numerous methods to determine k, we
possible that one VM may attack another [14], or some VMs choose cumulative percent variance (CPV) method [15] for
are attacked by the same process simultaneously. Detecting its ease of computation and good performance in practice as
anomaly on a d-dimensional space makes it easier to dis- shown by previous work. For the first ‘ principal compo-
cover such correlations. It also provides better detection P‘
i
accuracy. Performing PCA on multiple VMs’ statistics nents, CPV ð‘Þ ¼ Pi¼1d 100%; and we choose k to be:
i¼1 i
yields a higher residual dimension space, leading to more
k ¼ arg min‘ ðCPV ð‘Þ > 90%Þ.
accurate anomaly detection.
Recall that ATOM’s tracking module ensures that at any
time point t, for each metric E, CLC can obtain a value v0t 4.3.3 Anomaly Detection
that is within vt D, where vt is the exact value of this met- Unlike previous methods, e.g., [13], that perform offline,
ric at time t from a VM instance of interest. Next we will batched backbone network anomaly detection, we are not
show how to design an online PCA method to detect anom- required to detect anomalies for every row in M. Instead, we
aly using a t d matrix M. Each data value in this matrix is only need to check the newest vector z at tnow . That’s because
guaranteed to be within D of the true exact value for the we have classified data into the (normal) data matrix M and
same metric at that same time instance. the abnormal matrix A, and the real-time detection of ongo-
ing anomalies is based on the PCA model built from M.
4.3 Our approach To do this, we first standardize z using the mean and
standard deviation of each column in M. We use x to denote
The following matrices are used in our construction: M, Y,
the standardized vector.
A, B, whose definitions could be found in Table 1.
Given the normal subspace S : P1 ¼ ½v1 ; . . . ; vk , and the
At first, a standard, offline batch PCA analysis [13] is
residual subspace Se : P2 ¼ ½vkþ1 ; . . . ; vd , x is divided into
applied to the data using the newest t time instances to find
two parts by being projected on these two subspaces
potential anomalies. If anomalies are found, we eliminate
data corresponding to those time instances, and use the rest x¼^
xþ~
x ¼ P1 P1 T x þ P2 P2 T x:
as the initial data matrix M to find the residual subspace Se
through a regular PCA analysis. Afterwards, for each z at If z is normal, it should fit the distribution (e.g., mean
tnow , we use the latest residual subspace Se to perform anom- and variance) of the normal data. Moreover, the values of ~ x,
aly detection. which are the projection onto P2 by x, are supposed to be
In summary, our monitoring method has five steps: (1) small. Specifically, we define the squared prediction error
process data from M to form Y; (2) build the PCA model (SPE) to quantify this:
based on Y; (3) find the residual subspace of the PCA model; 2 2
(4) do anomaly detection for data at each new time instance xjj2 ¼ jjP2 P2 T xjj ¼ jjðI P1 P1 T Þxjj :
SPEðxÞ ¼ jj~
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on March 31,2024 at 11:48:31 UTC from IEEE Xplore. Restrictions apply.
2178 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 28, NO. 8, AUGUST 2017
Let Q ¼ jj~xjj2 , a classic result for the PCA model is that Step 1 reveals which dimension has a larger projection on
the following variable c approximately follows a standard residual subspace than the normal data, however it is hard
normal distribution with zero mean and unit variance [16]: to map such change back to the original data. Furthermore,
as shown in Section 8.2, this measure is not highly reliable
u1 ½ðQ=u1 Þh0 1 u2 h0 ðh0 1Þ=u21
c¼ qffiffiffiffiffiffiffiffiffiffiffiffi ; (1) and could be omitted to save some computation cost. Step 2
2u2 h20 is a useful measure to show which dimension has a signifi-
Pd cant different pattern compared to the normal data. How-
where ui ¼ j¼kþ1 ij ; i ¼ 1; 2; 3; h0 ¼ 1 2u3u1 2u3 . ever, it does not tell us whether some metric usage goes up
2
And we consider x to be abnormal if SPEðxÞ > Qa , where or down. Thus we use step 3 at last to find this pattern. Step
the threshold Qa is derived from the distribution c 3 itself is not good enough to indicate a pattern, because the
oscillation of metric usage statistics might make the mean of
2 qffiffiffiffiffiffiffiffiffiffiffiffi 3h10 some dimension in A appear benign. Thus, the output of
ca 2u2 h20 u h ðh 1Þ steps 2 and 3 are sent together along with an introspection
Qa ¼ u1 4 5 ;
2 0 0
þ1þ
u1 u21 request, to the orchestration module on the corresponding
NC(s), that administrates the identified VM instance(s). Sec-
and ca is the ð1 aÞ percentile in a standard normal distri- tion 8 shows how information identified from these three
bution, with a being the false alarm rate. steps could facilitate the orchestration module to find a “real
Finally, if z is normal, we add it to M and delete the oldest cause” of what might have gone wrong and how wrong it is.
data in M, and update the PCA model accordingly. Otherwise
4.3.5 Other Remarks
it is added to A, and the corresponding standardized x is
moved to matrix B. Matrices A and B need to contain time- Raising Alarms to Cloud Users. Once a data vector is detected
consecutive data only (so that we detect anomaly correspond- as abnormal, it is moved to the abnormal data matrix, on
ing to a continuous event), thus, they are cleared if its last vec- which metrics identification is performed. Suppose there are
tor is not consecutive in time with the new incoming vector. totally m vectors in the abnormal data matrix A, an alarm
will be raised with an alarm level m. The alarm level indicates
how serious the detected anomaly is; intuitively, the larger
4.3.4 Metrics Identification
number of data vectors contained in A, the longer duration
When an anomaly is detected, we need to do further analy- of the currently detected anomaly is. The alarm can be raised
sis to identify which metrics on which VM instance(s) from the either right after the metrics identification step, or wait until
d ¼ d0 n dimensions might have caused the anomaly, to the virtual machine inspection from the orchestration mod-
assist the orchestration module. Our identification method ule has finished (so that more information are gathered). The
consists of three steps. It compares the abnormal data alarm notifies the user about the potential abnormal behav-
matrix A (and the corresponding standardized matrix B), ior in the IaaS system and lets user identify whether the
and normal matrix M (and Y). Suppose there are m vectors ongoing behavior on his/her VM(s) is normal. If this is
in A (B) and t vectors in M (Y). because that the tasks on a VM have changed, the corre-
xjj2 , it is natural
Step 1. Since the anomaly is detected by jj~ sponding data vectors in the abnormal matrix should be
to compare the residual data between B and Y. Suppose yi is moved to the normal data matrix and used to build the PCA
the transpose of the ith row vector in Y, and y~i ¼ P2 PT2 yi is model to accommodate and reflect the new behavior. Abnor-
its residual traffic, then mal data matrix is cleared once the anomaly on VM is
ðy~1 ; y~2 ; . . . ; y~t ÞT ¼ ðP2 PT2 ðy1 ; y2 ; . . . ; yt ÞÞT ¼ YP2 PT2 ; removed, or is identified as normal by the cloud user.
Scalability. The computation complexity of monitoring
forms a residual matrix of Y , denoted as Yr . Similarly, module is evaluated in Section 8.4 (Fig. 11). Although its
Ar ¼ AP2 PT2 . For each dimension j 2 ½1; d, let computation cost increases with the increasing number of
VMs, it remains as a very small overhead. The average com-
1 1 putation cost per sliding window for the monitoring module
aj ¼ jjðAr Þj jj2 and yj ¼ jjðYr Þj jj2 ;
m t is less than 3 milliseconds in most cases for up to 6 VMs.
where ðAr Þj is the jth column in Ar and ðYr Þj is the jth col- What’s more, due to the significant message savings from
umn in Yr . Then rdj ¼ ðaj yj Þ=yj . ATOM’s tracking module, both the PCA-based computation
Step 2. If for some dimension j, rdj b1 for some con- overhead and the Eucalyptus storage overhead are reduced
stant b1 , we measure the change in A and M. In particular, significantly. Larger number of VMs could significantly
for each such dimension j, we calculate how much the improve the detection accuracy, meaning smaller false alarm
abnormal data in A are away from the standard normal rates, which is due to the fact that the monitoring component
deviation of the normal data along that uses a larger data matrix that helps find normal subspace
P dimension in M. more reliably, as also evaluated in Section 8.4 in Fig. 11.
Specifically, we calculate stddevj ¼ m1 m i¼1 jaij avgj j=stdj .
A dimension j is considered abnormal if stddevj b2 for
5 INTERACTION BETWEEN TRACKING AND
some constant b2 . In practice, we find that setting b1 and b2
to small positive integers works well, say b1 ¼ 2 and b2 ¼ 3.
MONITORING COMPONENTS
Step 3. For a dimension j that’s been considered abnor- 5.1 Deriving the Tracking Error Threshold
mal in Step 2, we measure the difference between the mean As mentioned earlier, the input data to the monitoring mod-
of abnormal and normal
P data. Specifically, we want to mea- ule is produced by the tracking module and each value may
sure meandiff j ¼ ðm1 m i¼1 aij avgj Þ=avgj . contain an approximate error of at most D (away from the
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on March 31,2024 at 11:48:31 UTC from IEEE Xplore. Restrictions apply.
DU AND LI: ATOM: EFFICIENT TRACKING, MONITORING, AND ORCHESTRATION OF CLOUD RESOURCES 2179
true value at that time instance for that metric). The approxi- by the observer (a NC) are independently and uni-
mation error introduced by the tracking module may formly distributed within the threshold, according to
degrade the performance quality of ATOM’s monitoring which
pffiffithe
ffi tracking threshold for the ith dimension is
module. Thus, a formal analysis is needed to bound the effect di ¼ 3s i .
of tracking errors and show how to set a proper value as the 2) we use homogeneous slack allocation, which is to
error threshold D for each metric in the tracking module. assume a uniform distribution of tracking error d on
As shown in Section 4.3.3, the random variable c follows a each dimension.
normal distribution, and the SPE threshold Qa is computed Applying these two assumptions, we get a tracking
after an a value is specified. However, we do not have c from threshold
the exact data matrix, instead, the approximate data matrix pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffiffi
þ 3 m2 þ mn 3n
3n
leads to the value c^. The SPE threshold is computed using a d¼ pffiffiffiffiffiffiffiffiffiffiffiffiffi : (2)
user-specified a value. However, the threshold calculated by mþn
the approximated matrix does not represent confident limit
1 a anymore, instead it leads to a corresponding approxi- Note we cannot send this threshold directly to observers
mation 1 a ^ . We want to understand the relationship since the data matrix used to build the PCA model has been
between a ^ and a. Formally, the cloud user specifies a and a standardized. Recall stdi is the standard deviation along the
maximally allowed deviation rate m such that our tracking and ith dimension of matrix M, then the original variance is
monitoring methods guarantee that j^ a aj m (even Si ¼ ðstdi s i Þ2 . Thus, the tracking threshold for the ffiith
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
pffiffiffi
though c is unknown). Thus, we need to establish the rela- dimension is calculated as Di ¼ 3Si ¼ 3ðstdi s i Þ2 ¼
tionship between m and the tracking error threshold D for stdi di : The CLC calculates the results for each metric dimen-
each metric dimension used by the tracking module [17]. sion whenever there is a PCA update, and then send the new
We achieve this objective via two steps: 1) given m, find tracking threshold to corresponding NCs (observers), which
an approximate error bound on the average eigen values
use the updated thresholds to adjust its tracking algorithm. A
produced by PCA; 2) once having the error bound on
possible improvement is to allocate the tracking slack for
eigen values, calculate the tracking threshold D based on .
each metric dimension according to the frequency of message
Step 1. We could approximate m according to from
passing sent to the CLC. By giving the dimensions being sent
Equation 1, yet the reverse could not be done with a closed-
form formula. The observation is made that m monotoni- more frequently larger tracking error thresholds, and other
cally increases with . Hence the idea is to use a binary dimensions smaller tracking error thresholds, the tracking
search to approximate : we first guess a value 0 , then calcu- overhead could be potentially further reduced.
late a m0 and compare it with the user-input m, and finally
adjust the value of 0 and compute m0 again. We repeat this 5.2 Accommodating Dynamic Tracking Thresholds
process until the difference between m0 and m is within a In the monitoring component (CLC), each time a new set of
desired precision. Then we could treat 0 as , the input for tracking thresholds are calculated, they are sent back to the
the next step. The way to calculate m using could be tracking component (NC). This means that the tracking
derived as follows. Given that c approximately follows a threshold on each metric dimension may change from time
normal distribution, then m ¼ Pr½ca hc < U < ca þ hc , to time. On the tracking component, we use a buffer B to
where hc ¼ j^ c cj, and U is a random variable following the store the newest tracking threshold for each metric, and
normal distribution Nð0; 1Þ. hc could be approximated from adjust the tracking method in Algorithm 1 accordingly, as
using the Monte Carlo sampling technique according to shown in Algorithm 2. Here Dnew is the current tracking
equation 1. For each loop, we generate a random value ^ in threshold in buffer B for the metric being tracked.
the range of ½ ; þ and then compute c^ based on equa-
tion 1, and compute the difference with c which is calcu- Algorithm 2. One Round of Online Tracking for Real
lated by . This loop is repeated a constant number of times Values
and the largest difference is assigned to hc , which could be let S ¼ ½fðtnow Þ D; fðtnow Þ þ D;
then used to calculate m. while Supper bound Slower bound > g do
Step 2. Once having the eigen-error , using stochastic gðtnow Þ ¼ ðSupper bound Slower bound Þ=2;
matrix perturbation method we could get the relation send gðtnow Þ to tracker;
between eigen-error and the variance s 2i along each while jjfðtnow Þ gðtlast Þjj D do
dimension wait until fðtnow Þ is updated;
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi D ¼ Dnew ;
u u
u X d u 1 1 X d
end while
2 t si þ t þ
2 s 4 ¼ ;
t i¼1 t d i¼1 i Supper bound ¼ minðSupper bound ; fðtnow Þ þ DÞ;
Slower bound ¼ maxðSlower bound ; fðtnow Þ DÞ;
end while /* this algorithm is run by observer */
where is the average of eigen values, t is the number of
points used to build the PCA model, and d is the number of
dimensions. Then the estimation of tracking error D is based We can show that doing this style of “lazy update of the
on the following assumptions: tracking threshold value” could ensure that the competitive
ratio is the max of log D for all possible D (or log ðD=gÞ where
1) the errors between the approximated values sent to g is the finest precision for “double” values) in a tracking
the tracker (the CLC) and the true values observed period; and it is optimal. It also guarantees that on the
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on March 31,2024 at 11:48:31 UTC from IEEE Xplore. Restrictions apply.
2180 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 28, NO. 8, AUGUST 2017
6 ORCHESTRATION COMPONENT
Fig. 4. Intersection with dynamically changing values of D.
The monitoring component in Section 4 detects the abnor-
monitoring component, the PCA detection result calculated mal state and identifies which measurement on which VM
by the approximated tracking values has a false alarm rate might be responsible. In this section, we describe how
^ that is within user-specified deviation value m of the true
a orchestration component is able to automatically mitigate
^ 2 ½a m; a þ m).
false alarm rate a (i.e., a the malicious behavior after an anomaly is detected.
Claim 2. When the tracking threshold D changes at NC, Modern IaaS cloud vendors offer services mostly in the
by simply changing the D value in Algorithm 1 during a form of VMs, which makes it critical to ensure VM security
round, the correctness and optimality of the tracking algo- in order to attract more customers. VMI technique has been
rithm still hold. The competitive ratio with dynamically widely studied to introspect VM for security purpose. There
changing values of D becomes the log of the maximum D are also several popular open source general-purpose VMI
value for integers, and log of the maximum D=g value for tools such as LibVMI[9], Volatility[18], and StackDB[10], for
floating point values, where g is the finest precision. researchers to explore and develop more sophisticated
applications. LibVMI has many basic APIs that support
Proof. Here we prove for the case to track integer values. memory read and write on live memory. Volatility itself
The extension to real values is straightforward following supports memory forensics on a VM memory snapshot file,
the proof for Claim 1. We use the same notation as in and it has many Linux plugins that are ready to use.
Section 3. Recall that a range S is initialized as StackDB is designed to be a multi-level debugger, while
½fðt0 Þ D; fðt0 Þ þ D, where fðt0 Þ is the value observed at also serves well as a memory-forensics tool. Other more
first, and updated as the intersection of ½fðtÞ D; fðtÞþ D sophisticated techniques developed for special-purpose
up to tnow . A round is from the initialization of S until S VMI anomaly detection are generally based on these tools.
becomes empty. Blacksheep [19], for instance, utilizes Volatility and specifi-
Correctness. When the tracking error bound changes cally developed plug-ins to implement a distributed system
from D1 to D2 , Alice sends Bob a new value whenever the for detecting anomalies inside VMs among groups of simi-
newest value observed is beyond D2 range of last sent one. lar machines. However, as most other VMI strategies to
Competitive Ratio. Note that in Algorithm 1, ASOL uses secure VMs, it needs to dump the whole memory space of
binary search, to guess what value AOPT might have sent the target VM, and then analyze each piece, typically by
in each round. The range S contains all the possible val- comparing with what’s defined a “normal” state. Thus to
ues that AOPT might have sent, and it decreases at least protect VMs in real time, the whole memory space needs to
half upon the sending of each message (median of S). So be analyzed constantly, introducing much overhead into
that in each round, AOPT sends out only one value while the production system.
ASOL sends out at most log D. Even if D changes in the ATOM implements its orchestration component based
middle, as shown in Fig. 4, it won’t affect the fact that S on Volatility (with LibVMI plug-in for live introspection)
decreases at least half upon each message sent. When the and StackDB. A crucial difference with other systems is
tracking error bound changes from D1 to D2 , use S1 to that, ATOM only introspects the VM when an anomaly hap-
denote the region of S at that time, and S2 to denote pens, and only on the relevant memory space of the suspi-
½y D2 ; y þ D2 . x is the median of S1 , the last sent value, cious VMs. The monitoring component in ATOM serves as
and y is the first value observed that exceeds D2 of x after a trigger to inform VMI tools when and where to do intro-
D changes. According to our “lazy update” method, the spection. The anomalies are found by analyzing previously
new S is the intersection of S1 and S2 . Because monitored resource usage data, in monitoring component,
y D2 > x, so jnew Sj ¼ S1ðupper boundÞ ðy D2 Þ < jS1 j=2. which is much more lightweight than analyzing the whole
Hence no matter D2 is bigger or smaller than D1 , S1 memory space. Then the metrics identification process in
decreases at least half when this change happens. If S1 and monitoring component could locate which dimensions are
S2 do not intersect, then a new round starts and D2 becomes suspicious, indicating the relevant metrics on some particu-
the initial threshold of the new round. Therefore, the com- lar VMs. This information is sent to orchestration compo-
petitive ratio for each round ONLY matters with the initial nent along with a VMI request, which then only introspects
size of S. If the initial threshold of a round is D, then the the relevant memory space, reducing the overhead dramati-
competitive ratio for that round is thus log D. Throughout cally. For example, if it is detected and identified that the
the whole period, the competitive ratio becomes the log of network usages on VM-2 and VM-3 are unusual, as shown
maximum threshold values that ever appear. in Fig. 5, then ATOM could only introspects the network
Optimality. Suppose the last value sent by an online connections using Volatility network plug-ins on VM-2 and
algorithm ASOL before the change from D1 to D2 is x. An VM-3, in contrast to other VMI-based detection strategies
adversary Carole operates the value of f here. If x is which typically need to walk over the whole process list,
greater than the median of S, Carole decreases f until it opened network sockets, opened files, etc..
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on March 31,2024 at 11:48:31 UTC from IEEE Xplore. Restrictions apply.
DU AND LI: ATOM: EFFICIENT TRACKING, MONITORING, AND ORCHESTRATION OF CLOUD RESOURCES 2181
based clustering algorithm does not require the prior- threshold. For example, metric CPUUtilization is always
knowledge on the number of clusters, neither does it need between 0 and 0.2 percent when VM is idle, so D value for
to iteratively compute an explicit “centroid” and re-cluster this metric is only (roughly) 0.01 percent. This figure shows
at every iteration. that when allowing a very small error, tracking component
By default, ATOM sets minPts=10, and computes the already leads to significant savings. Fig. 6c shows the results
threshold value using a sampling based approach. More when VM is running TPC-C benchmark on a MySQL data-
specifically, we randomly select n pairs of VMs and com- base, which involves large disk reads and writes. D is set as
pute their VMdist. We sort the n VMdist values, and set the average of the exact values in 2 hours when VM is idle.
¼ VMdisti if VMdistiþ1 > 5 VMdisti . The intuition is This is reasonable even for users who do not allow any
that for any point, the distance to a point in a different clus- error, because D is merely the average of the amount con-
ter is much longer than the distance to a point in the same sumed by an idle VM. Note that in this figure, NetworkIn
cluster, and we want to find a large enough “inner cluster” and NetworkOut only have two values sent to CLC in 2
distance and use it as the threshold value to determine hours with the tracking component. This figure tells us that
whether two points belong to the same cluster. even if VM is intensively used and almost no error is
allowed, the tracking component is still highly effective.
8 EVALUATION Fig. 6d demonstrates the result when VM is running the
same workload, while D value for each metric is now set as
We implemented ATOM using Eucalyptus as the underly- 10 percent of the average value when the VM has been run-
ing IaaS system. The virtual machine hypervisor running on ning the same workload for 2 hours, i.e., larger errors
each NC is the default KVM hypervisor. Each VM has an are allowed. Clearly, the tracking component becomes more
m1.medium type on Eucalyptus. ATOM tracks seven met- effective. Error is expected to improve ATOM performance
rics from each VM instance: CPUUtilization, NetworkIn, because new values within the error threshold of last sent
NetworkOut, DiskReadOps, DiskWriteOps, DiskReadBytes, one could be saved.
DiskWriteBytes. All experiments are executed on a linux Fig. 7 explains how the online tracking component
machine with an 8-core Intel(R) Core(TM) i7-3770 CPU @ works. It shows both values sent by standard CloudWatch
3.40 GHz computer. (without tracking) and values sent by modified CloudWatch
with ATOM tracking, with a time interval of 1000 seconds
8.1 Online Tracking for the NetworkOut metric from Fig. 6b. This clearly illus-
In the evaluation the data collection time interval is set to 10 trates that at each time instance, with online tracking, the cur-
seconds, i.e., raw values for different metrics are collected rent (exact) value is not sent if it is within D threshold of the
every 10 seconds on a NC (observer), which produces 360 last sent value; and at each time point, the last value sent to
values for each metric per hour. Instead of sending every CLC is always within D of the newest value observed on NC.
value to CLC (the tracker), the modified CloudWatch The values sent by the tracking method closely approximate
with ATOM’s online tracking component selectively those exact values, with much smaller overhead.
sends certain values based on Algorithm 1, from NC to
CLC. Fig. 6 shows the number of values sent for each
metric over 2 hours, with different workloads (e.g., TPC-
C benchmark over MySQL) and different D values.
Among the seven metrics for each VM, only the first 5
ones are shown in each sub-figure, as DiskReadBytes/
DiskWriteBytes follow the same patterns with DiskRea-
dOps/DiskWriteOps in all experiments.
Fig. 6a shows the result when VM is idle, using D ¼ 0.
This is the base case with no error allowed for any metric.
The result shows that our tracking component has still
achieved significant savings when no error is allowed. In
Fig. 6b, VM is also idle, while D is set to 10 percent of the
average value (calculated from exact values collected) in 2
hours for each metric. Note that this is a very small error Fig. 7. A comparison on NetworkOut values sent by NC.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on March 31,2024 at 11:48:31 UTC from IEEE Xplore. Restrictions apply.
DU AND LI: ATOM: EFFICIENT TRACKING, MONITORING, AND ORCHESTRATION OF CLOUD RESOURCES 2183
TABLE 2 that this attack is hard to detect using the simple threshold
Online Monitoring Experiment Setup approach in existing IaaS systems. The normal workload on
VM 2 is a network workload, which already has a large
Experiment Workload Attack
amount of NetworkIn/NetworkOut usage, sending out
1 VM 1, 3 idle; VM 2 DDoS attack inside malicious traffic only changes roughly 10% 30 percent to
network workload VM 2
the mean of normal statistics. Hence it is difficult to set an
2 VM 1 idle; VM 2, 3 DDoS attack inside
network workload VM 2, 3 effective threshold value even for an experienced user due
3 VM 1 idle; VM 2 Resource-freeing attack to the fact that the underlying normal traffic might oscillate
network workload; from VM 3 to VM 2 within a range. Yet ATOM’s monitoring module success-
VM 3 disk workload fully finds the underlying pattern, and detects time instan-
ces that are abnormal (when attacks are ongoing). Fig. 8a
shows the online monitoring and detection process. The
8.2 Automated Online Monitoring and Orchestration dashed line corresponds to threshold Qa for a ¼ 0:2 percent,
We design three experiments to illustrate the effectiveness and the solid line shows Qa for a ¼ 0:5 percent. SPE of the
of ATOM’s monitoring module. For each experiment, we approximate data matrix projected onto the residual sub-
use a false alarm rate a ¼ 0:2 percent and its deviation m ¼ 1 space is plotted, where the black dots indicates the time instan-
percent (to set the tracking error bound). Meanwhile the Qa ces when DDoS attack happens. Clearly, ATOM has
threshold with a ¼ 0:5 percent is also calculated to compare successfully identified all abnormal time instances correctly.
against. The online tracking error D is calculated dynami- Once a time instance is considered abnormal, ATOM
cally according to the equations in Section 5.1 at the CLC, immediately runs metrics identification procedure to find
and set using the algorithm in Section 5.2 on each NC. Three the affected VMs and metrics. As described in Section 4.3.4,
VMs with a type of m1.medium co-located in one Eucalyptus ATOM first finds out potential abnormal dimension(s) by
physical node are monitored for each experiment, which analyzing the average change portion rdj between abnormal
form a t 21 data matrix. Dimensions 1-7 belong to VM 1, 8- data points and normal data points projected onto residual
14 are for VM 2, whereas VM 3 owns the rest. subspace. Then for dimensions that have significant
We use two types of normal workloads and two kinds of changes, ATOM computes stddevj as suggested in Section
attacks in all three experiments. The two types of normal 4.3.4, and also calculates the average change meandiff j if
workloads include network and disk workloads. For the stddevj is above a threshold. Recall m is the number of
network workload, an Apache web server is installed and consecutive abnormal time instances until tnow . The results
constantly responding WebBench network requests. The when m ¼ 5 are shown in the first table of Table 3. Note that
disk workload is TPC-C benchmark against MySQL data- only for the dimensions having large enough residual por-
base [21]. The two types of attacks are DDoS attack and tion (rdj ) does ATOM computes the standard deviation
resource-freeing attack [14]. In our experiment, DDoS attack error (stddevj ). Among the 3 VM instances being tracked
treats the affected VM as a compromised zombie and sends and monitored, ATOM correctly identifies an anomaly hap-
malicious traffic to the target IP address. Resource-freeing pening on VM 2, and more specifically, it discovers that the
attack is launched by VM 3 targeting the web server on VM anomaly is from its first three dimensions (CPUUtilization,
2 to gain more cache usage. Note that there is a 4th VM run- NetworkIn, NetworkOut), indicated by the bold values.
ning WebBench and a 5th VM running Apache web server Note that NetworkIn and NetworkOut actually go down
as the target of DDoS bots. The first two hours are used to because of DDoS attack. Our guess is that WebBench tends
build PCA model for each experiment, while the anomaly to saturate the bandwidth available for the VM, while the
happens at the third hour. The settings for each experiment DDoS attack we use launches many network connections
is shown in Table 2. but not sending as much traffic. The CPUUtilization, how-
In the first experiment, VM 2 runs an Apache web server ever, goes up due to the attack. Nevertheless, ATOM is able
while the other 2 VMs are idle. A DDoS attack turns VM 2 to identify all three abnormal metric dimensions.
to be a zombie at the third hour, using it to generate traffic After abnormal metrics are identified, a VMI request is
towards the target IP (the 5th VM in our experiment). Note sent to the corresponding NC for introspection. ATOM’s
Fig. 8. Time series plots of SPE against thresholds Qa with a ¼ 0:2 and 0.5 percent.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on March 31,2024 at 11:48:31 UTC from IEEE Xplore. Restrictions apply.
2184 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 28, NO. 8, AUGUST 2017
TABLE 3
Metrics Identification Results
Dim (j) vm1-d1 vm1-d2 vm1-d3 vm1-d4 vm1-d5 vm1-d6 vm1-d7 vm2-d1 vm2-d2 vm2-d3 vm2-d4
rdj 1.87 36.62 27.17 13.39 -0.56 0.08 8.55 32.63 7.31 35.82 0.00
Experiment 1 stddevj 0.50 0.32 0.72 0.00 0.76 0.00 0.90 48.68 3.82 6.74 0.08
Metrics meandiff j 0.11 -0.12 -0.21
Identification Dim (j) vm2-d5 vm2-d6 vm2-d7 vm3-d1 vm3-d2 vm3-d3 vm3-d4 vm3-d5 vm3-d6 vm3-d7
Results rdj 0.00 0.00 0.00 2.94 -0.50 -0.41 18.45 18.00 1.22 1.88
stddevj 0.90 0.08 0.41 0.72 0.31 1.06 0.00 0.18 0.00 0.66
meandiff j
Dim (j) vm1-d1 vm1-d2 vm1-d3 vm1-d4 vm1-d5 vm1-d6 vm1-d7 vm2-d1 vm2-d2 vm2-d3 vm2-d4
rdj 23.70 -0.98 -0.98 -0.55 -0.57 4.27 3.76 9.14 64.18 65.05 3.50
Experiment 2 stddevj 0.78 0.42 0.58 0.00 0.67 0.00 0.71 3.17 8.01 8.30 0.00
Metrics meandiff j 0.16 -0.26 -0.28
Identification Dim (j) vm2-d5 vm2-d6 vm2-d7 vm3-d1 vm3-d2 vm3-d3 vm3-d4 vm3-d5 vm3-d6 vm3-d7
Results rdj -0.51 -0.82 4.23 9.04 60.56 61.16 1.45 -0.56 1.89 -0.51
stddevj 0.31 0.00 0.35 7.23 6.06 6.98 0.17 3.39 0.12 3.65
meandiff j 0.39 -0.23 -0.31
Dim (j) vm1-d1 vm1-d2 vm1-d3 vm1-d4 vm1-d5 vm1-d6 vm1-d7 vm2-d1 vm2-d2 vm2-d3 vm2-d4
rdj 2.58 -0.65 -0.93 -0.65 28.23 -0.98 -0.15 6.90 7.94 7.27 -0.76
Experiment 2 stddevj 0.24 0.42 0.63 0.95 0.43 0.98 0.86 7.36 4.52 4.74 0.21
Metrics meandiff j -0.91 -0.85 -0.89
Identification Dim (j) vm2-d5 vm2-d6 vm2-d7 vm3-d1 vm3-d2 vm3-d3 vm3-d4 vm3-d5 vm3-d6 vm3-d7
Results rdj 0.30 -0.99 -0.44 10.70 1282.80 1401.34 1363.47 -0.70 1544.73 -0.53
stddevj 1.41 0.17 1.43 1.86 13.05 12.79 13.42 1.72 13.60 1.78
meandiff j 101.81 110.97 187.16 196.30
orchestration module first identifies this as a possible net- be very hard to catch except for using strong isolation on
work problem, and then calls volatility to analyze the net- physical node. In this experiment, VM 2 runs an Apache
work connections (linux_netstat plugin) on that VM, web server constantly handling network requests. VM 3
which then finds out the numerous network connections runs TPC-C benchmark on MySQL database. According to
targeting at one IP address, a typical pattern of DDoS [14], if VM 3 wants more cache usage, it could make net-
attacks. Volatility is then used to find out related processes work resource to be a bottleneck for VM 2, and shift its
and their parent process (pslist plugin) of these network usage on cache (VM 2 and VM 3 are running on the same
connections. At this time ATOM raises an alarm with alarm physical node). In this experiment VM 3 launches Golden-
level m notifying user about the findings, and asks user to Eye attack, which achieves a denial-of-service attack on the
check whether those processes are normal or malicious. The HTTP server running on VM 2 by consuming all available
alarm level is useful; for example, m ¼ 1 could be treated as sockets, and is paired with cache control. We show that
a mild warning. If user identifies them to be malicious, he/ ATOM successfully finds the two VMs, and by its metrics
she could either investigate the VM in further details, or use identification procedure, it suggests the possibility of an
ATOM’s monitoring module to do auto-debugging and kill resource-freeing attack and provides useful data to its
malicious processes automatically through StackDB [22]. orchestration module in assisting the VMI procedure on
Fig. 8a shows that SPE goes back to normal after the attack VMs 2 and 3.
is mitigated on the affected VM through ATOM’s orchestra- Fig. 8c plots the monitoring and the detection process.
tion module. The black dots indicate the time instances when abnormal
In the second experiment, both VM 2 and VM 3 are run- behavior happens. This figure, as before, only shows that an
ning the network workload, and the same DDoS attack anomaly has happened. While the third table in Table 3 ana-
turns both VMs to be zombie VMs simultaneously. Not lyzes where the anomaly has originated. The stddevj values
only ATOM is able to detect an anomaly happened as show what the abnormal dimensions are, on VM 2: CPUUti-
shown in Fig. 8b, but also it finds similar patterns on the cor- lization (vm2-1), NetworkIn (vm2-2), NetworkOut (vm2-3);
rect metrics from both VM 2 and VM 3 as illustrated in the on VM 3: NetworkIn (vm3-2), NetworkOut (vm3-3), Dis-
second table of Table 3, which shows the metrics identifica- kReadOps (vm3-4), DiskReadBytes (vm3-6).
tion results when m ¼ 5. By sending this information to the Further analysis on meandiff j finds out NetworkIn and
orchestration component, the introspection overhead could NetworkOut statistics on VM 2 decrease nearly by an order,
be saved by first introspecting one VM, and then checking if while VM 3 sees significant increase in NetworkIn, Net-
another one has the same malicious behavior going on. workOut and especially its disk read statistics (DiskRea-
The third experiment illustrates ATOM’s ability to detect dOps and DiskReadBytes). This is a typical resource
a different type of attack, the resource-freeing attack [14], a freeing attack as described in [14], where network
subtle attack where the goal is to improve a VM’s perfor- resource has become the bottleneck of a target VM, and
mance by forcing a competing VM to saturate some bottle- the beneficiary VM gains much of the shared cache usage
neck and shift its usage on the target resource (often times by showing a significant increase in disk read statistics.
with legitimate behavior). This kind of attacks is known to The sudden increase in NetworkIn/NetworkOut in VM 3
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on March 31,2024 at 11:48:31 UTC from IEEE Xplore. Restrictions apply.
DU AND LI: ATOM: EFFICIENT TRACKING, MONITORING, AND ORCHESTRATION OF CLOUD RESOURCES 2185
different overlay nodes with reduced overhead achieved by cloud users to verify the integrity of their VMs. However,
ad-hoc conditions filters. InfoTrack [38] is a monitoring sys- this is not an “active detection and reaction” system. In con-
tem that is similar to ATOM’s tracking module, in that it trast, ATOM enables triggering VMI only when a potential
tries to minimize continuous monitoring cost with most attack is identified, and it also helps locate the relevant memory
information precision preserved, by leveraging temporal region to analyze and introspect much more effectively and
and spatial correlation of monitored attributes, while efficiently using its orchestration component.
ATOM utilizes an optimal online tracking algorithm that is
proved to achieve the best saving in network cost without 10 CONCLUSION
any prior knowledge on the data. MELA [39] is a monitor- We present the ATOM framework that can be easily inte-
ing framework for cloud service which collects different grated into a standard IaaS system to provide automated,
dimensions of data tailored for analyzing cloud elasticity continuous tracking, monitoring, and orchestration of sys-
purpose (e.g., scale up and scale down). ATOM may use tem resource usage in nearly real-time. ATOM is extremely
MELA to collect, track, and monitor different types of met- useful for anomaly detection, auto scaling, and dynamic
rics than those already available through CloudWatch. resource allocation and load balancing in IaaS systems.
Cloud Security. IaaS system also brings us a new set of Interesting future work include extending ATOM for more
security problems. Leading cloud providers have developed sophisticated resource orchestration and incorporating the
advanced mechanism to ensure the security of their IaaS defense against even more complex attacks in ATOM.
systems. AWS [40] has many built-in security features such
as firewalls, encrypted storage and security logs. OpenStack
uses a security component called Keystone [41] to do
ACKNOWLEDGMENTS
authentication and authorization. It also has security rules Min Du and Feifei Li were supported in part by grants US
for network communication in its network component Neu- National Science Foundation CNS-1314945, CNS-1514520
tron [42]. Other IaaS platforms have similar security solu- and US National Science Foundation IIS-1251019. We wish
tions, which are mainly firewalls and security groups. to thank Eric Eide, Jacobus (Kobus) Van der Merwe, Robert
Nevertheless, it is still possible that hackers could bypass Ricci, and other members of the TCloud project and the
known security policies, or cloud users may accidentally Flux group for helpful discussion and valuable feedback.
run some malicious software. It is thus critical to be able to The preliminary version of this paper appeared in IEEE Big-
detect such anomaly in near real-time to avoid leaving hack- Data 2015[57].
ers plenty of time to cause significant damage. Hence we
need a monitoring solution that could actively detect anom- REFERENCES
aly, and identify potentially malicious behavior over a large [1] Amazon. [Online]. Available: http://www.aws.amazon.com/,
number of VM instances. AWS recently adopts its Cloud- Accessed on: Nov. 5, 2016.
Watch service for DDoS attacks[3], but it requires user to [2] ITWORLD. [Online]. Available: http://www.itworld.com/
security/428920/attackers-install-ddos-bots-amazon-cloud-
check historical data and set a “’magic value” as the thresh- exploiting-elasticsearch-weakness, Accessed on: Nov. 5, 2016.
old manually, which is unrealistic if user’s underlying [3] Amazon, “AWS Best Practices for DDoS Resiliency,” [Online].
workloads change frequently. Available: https://d0.awsstatic.com/whitepapers/DDoS_White_
In contrast, ATOM could automatically learn the normal Paper_June2015.pdf, Accessed on: Nov. 5, 2016.
[4] Eucalyptus. [Online]. Available: http://www8.hp.com/us/en/
behavior from previous monitored data, and detect more cloud/helion-eucalyptus.html, Accessed on: Nov. 5, 2016.
complex attacks besides DDoS attacks using PCA. PCA has [5] D. Nurmi, et al., “The eucalyptus open-source cloud-computing
been widely used to detect anomaly in network traffic vol- system,” in Proc. 9th IEEE/ACM Int. Symp. Cluster Comput. Grid,
ume in backbone networks [12], [13], [17], [43], [44], [45]. As 2009, pp. 124–131.
[6] M. Du and F. Li, “SPELL: Streaming parsing of system event
we have argued in Section 4.1, adapting a PCA-based logs,” in Proc. IEEE Int. Conf. Data Mining, 2016.
approach to our setting has not been studied before and pre- [7] W. Dawoud, I. Takouna, and C. Meinel, “Infrastructure as a ser-
sented significant new challenges. vice security: Challenges and solutions,” in Proc. 7th Int. Conf. Inf.
Syst., 2010, pp. 1–8.
The security challenges in IaaS system were analyzed in [8] D. J. Dean, H. Nguyen, and X. Gu, “UBL: Unsupervised behavior
[7], [46], [47], [48]. Virtual machine attacks is considered a learning for predicting performance anomalies in virtualized cloud
major security threat. ATOM’s introspection component systems,” in Proc. 9th Int. Conf. Auton. Comput., 2012, pp. 191–200.
leverages existing open source VMI tools such as Stackdb [9] LibVMI. [Online]. Available: http://libvmi.com/, Accessed on:
Nov. 5, 2016.
[10] and Volatility [18] to pinpoint the anomaly to the exact [10] D. Johnson, M. Hibler, and E. Eide, “Composable multi-level
process. debugging with Stackdb,” in Proc. 10th ACM SIGPLAN/SIGOPS
VMI is a well-known method for ensuring VM secu- Int. Conf. Virtual Execution Environ., 2014, pp. 213–226.
[11] K. Yi and Q. Zhang, “Multi-dimensional online tracking,” in Proc.
rity [49], [50], [51], [52]. It has also been studied for IaaS sys- 20th Annu. ACM-SIAM Symp. Discrete Algorithms, 2009, pp. 1098–
tems [53], [54], [55]. However, to constantly secure VM using 1107.
VMI technique, the entire VM memory needs to be traversed [12] H. Ringberg, A. Soule, J. Rexford, and C. Diot, “Sensitivity of PCA
and analyzed periodically. It may also require the VM to be for traffic anomaly detection,” ACM SIGMETRICS Performance
Eval. Rev., vol. 35, pp. 109–120, 2007.
suspended in order to gain access to VM memory. Black- [13] A. Lakhina, M. Crovella, and C. Diot, “Diagnosing network-wide
sheep [19] is such a system that detects rootkit by dumping traffic anomalies,” in Proc. Conf. Appl. Technol. Archit. Protocols
and comparing groups of similar machines. Though the per- Comput. Commun., 2004, pp. 219–230.
formance overhead is claimed to be acceptably low to sup- [14] V. Varadarajan, T. Kooburat, B. Farley, T. Ristenpart, and M. M.
Swift, “Resource-freeing attacks: Improve your cloud perfor-
port real-time monitoring, clearly user programs will be mance (at your neighbor’s expense),” in Proc. ACM Conf. Comput.
negatively affected. Another solution was suggested [56] for Commun. Secur., 2012, pp. 281–292.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on March 31,2024 at 11:48:31 UTC from IEEE Xplore. Restrictions apply.
DU AND LI: ATOM: EFFICIENT TRACKING, MONITORING, AND ORCHESTRATION OF CLOUD RESOURCES 2189
[15] W. Li, H. H. Yue, S. Valle-Cervantes , and S. J. Qin, “Recursive [41] OpenStack, “OpenStack Keystone,” [Online]. Available: http://docs.
PCA for adaptive process monitoring,” J. Process Control, vol. 10, openstack.org/developer/keystone/, Accessed on: Nov. 5, 2016.
pp. 471–486, 2000. [42] OpenStack, “OpenStack Neutron,” [Online]. Available: https://
[16] J. E. Jackson and G. S. Mudholkar, “Control procedures for resid- wiki.openstack.org/wiki/Neutron, Accessed on: Nov. 5, 2016.
uals associated with principal component analysis,” Technometrics, [43] X. Li, et al., “Detection and identification of network anomalies
vol. 21, pp. 341–349, 1979. using sketch subspaces,” in Proc. 6th ACM SIGCOMM Conf. Inter-
[17] L. Huang, M. I. Jordan, A. Joseph, M. Garofalakis, and N. Taft, net Meas., 2006, pp. 147–152.
“In-network PCA and anomaly detection,” in Proc. Neural Inf. Pro- [44] Y. Liu, L. Zhang, and Y. Guan, “Sketch-based streaming PCA
cess. Syst., 2006, pp. 617–624. algorithm for network-wide traffic anomaly detection,” in Proc.
[18] Volatility. [Online]. Available: http://www.volatilityfoundation. IEEE 30th Int. Conf. Distrib. Comput. Syst., 2010, pp. 807–816.
org/, Accessed on: Nov. 5, 2016. [45] L. Huang, et al., “Communication-efficient online detection of net-
[19] A. Bianchi, Y. Shoshitaishvili, C. Kruegel, and G. Vigna, work-wide anomalies,” in Proc. 26th IEEE Int. Conf. Comput. Com-
“Blacksheep: detecting compromised hosts in homogeneous mun., 2007, pp. 134–142.
crowds,” in Proc. ACM Conf. Comput. Commun. Secur., 2012, [46] A. S. Ibrahim, J. H. Hamlyn-Harris , and J. Grundy, “Emerging
pp. 341–352. security challenges of cloud virtual infrastructure,” in Proc.
[20] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A density-based APSEC Cloud Workshop, 2010.
algorithm for discovering clusters in large spatial databases with [47] L. M. Vaquero, L. Rodero-Merino , and D. Moran, “Locking the
noise,” in Proc. 2nd Int. Conf. Knowl. Discovery Data Mining, 1996, sky: A survey on IaaS cloud security,” Computing, vol. 91, pp. 93–
pp. 226–231. 118, 2011.
[21] D. E. Difallah, A. Pavlo, C. Curino, and P. Cudre-Mauroux , [48] C. R. Li, et al., “Potassium: Penetration testing as a service,” in
“OLTP-Bench: An extensible testbed for benchmarking relational Proc. 6th ACM Symp. Cloud Comput., 2015, pp. 30–42.
databases,” Proc. VLDB Endowment, vol. 7, pp. 277–288, 2013. [49] T. Garfinkel, et al., “A virtual machine introspection based archi-
[22] StackDB. [Online]. Available: http://www.flux.utah.edu/ tecture for intrusion detection,” in Proc. Netw. Distrib. Syst. Secur.
software/stackdb/doc/all.html#using-eucalyptus-to-run- Symp., 2003, pp. 191–206.
qemukvm, Accessed on: Nov. 5, 2016. [50] J. Pfoh, C. Schneider, and C. Eckert, “A formal model for virtual
[23] I. Goiri, R. Bianchini, S. Nagarakatte, and T. D. Nguyen, machine introspection,” in Proc. ACM Workshop Virtual Mach.
“ApproxHadoop: Bringing approximations to MapReduce frame- Secur., 2009, pp. 1–10.
works,” in Proc. 20th Int. Conf. Archit. Support Program. Languages [51] B. Dolan-Gavitt , T. Leek, M. Zhivich, J. Giffin, and W. Lee,
Operating Syst., 2015, pp. 383–397. “Virtuoso: Narrowing the semantic gap in virtual machine intro-
[24] M. T. Al Amin, et al., “Social trove: A self-summarizing storage spection,” in Proc. IEEE Symp. Secur. Privacy, 2011, pp. 297–312.
service for social sensing,” in Proc. IEEE Int. Conf. Auton. Comput., [52] Y. Fu and Z. Lin, “Space traveling across VM: Automatically
2015, pp. 41–50. bridging the semantic gap in virtual machine introspection via
[25] J. Kelley, C. Stewart, N. Morris, D. Tiwari, Y. He, and S. Elnikety, online kernel data redirection,” in Proc. IEEE Symp. Secur. Privacy,
“Measuring and managing answer quality for online data-inten- 2012, pp. 586–600.
sive services,” in Proc. IEEE Int. Conf. Auton. Comput., 2015, [53] A. S. Ibrahim, J. Hamlyn-Harris, J. Grundy, and M. Almorsy,
pp. 167–176. “CloudSec: A security monitoring appliance for virtual machines
[26] X. Wang, U. Kruger, and G. W. Irwin, “Process monitoring in the IaaS cloud model,” in Proc. 5th Int. Conf. Netw. Syst. Secur.,
approach using fast moving window PCA,” Ind. Eng. Chemistry 2011, pp. 113–120.
Res., vol. 44, pp. 5691–5702, 2005. [54] F. Zhang, J. Chen, H. Chen, and B. Zang, “CloudVisor: Retrofitting
[27] Amazon, “Amazon cloudwatch,” [Online]. Available: http://aws. protection of virtual machines in multi-tenant cloud with nested
amazon.com/cloudwatch/, Accessed on: Nov. 5, 2016. virtualization,” in Proc. 23rd ACM Symp. Operating Syst. Principles,
[28] OpenStack. [Online]. Available: http://www.openstack.org/ 2011, pp. 203–216.
Accessed on: Nov. 5, 2016. [55] H. W. Baek, A. Srivastava, and J. Van der Merwe, “CloudVMI:
[29] OpenStack, “Openstack ceilometer,” [Online]. Available: https:// Virtual machine introspection as a cloud service,” in Proc. IEEE
wiki.openstack.org/wiki/Ceilometer. Accessed on: Nov. 5, 2016. Int. Conf. Cloud Eng., 2014, pp. 153–158.
[30] DATADOG. [Online]. Available: https://www.datadoghq.com/ [56] B. Bertholon, S. Varrette, and P. Bouvry, “Certicloud: A novel
Accessed on: Nov. 5, 2016. TPM-based approach to ensure cloud IaaS security,” in Proc. IEEE
[31] librato. [Online]. Available: https://www.librato.com/. Accessed 4th Int. Conf. Cloud Comput., 2011, pp. 121–130.
on: Nov. 5, 2016. [57] M. Du and F. Li, “ATOM: Automated tracking, orchestration and
[32] D. J. Dean, H. Nguyen, P. Wang, and X. Gu, “PerfCompass: monitoring of resource usage in infrastructure as a service sys-
Toward runtime performance anomaly fault localization for infra- tems,” in Proc. IEEE Int. Conf. Big Data, 2015, pp. 271–278.
structure-as-a-service clouds,” in Proc. 6th USENIX Workshop Hot
Topics Cloud Comput., 2014, pp. 16–16. Min Du received the bachelor’s and master’s
[33] R. Van Renesse , K. P. Birman, and W. Vogels, “Astrolabe: A degrees from Beihang University, in 2009 and
robust and scalable technology for distributed system monitoring, 2012, respectively. She is currently working
management, and data mining,” ACM Trans. Comput. Syst., toward the PhD degree in the School of Comput-
vol. 21, pp. 164–206, 2003. ing, University of Utah. Her research interests
[34] P. Yalagandula and M. Dahlin, “A scalable distributed informa- include big data analytics and cloud security. She
tion management system,” in Proc. Conf. Appl. Technol. Archit. Pro- is a student member of the IEEE.
tocols Comput. Commun., 2004, pp. 379–390.
[35] M. L. Massie, B. N. Chun, and D. E. Culler, “The ganglia distrib-
uted monitoring system: Design, implementation, and experi-
ence,” Parallel Comput., vol. 30, pp. 817–840, 2004.
[36] N. Jain, D. Kit, P. Mahajan, P. Yalagandula, M. Dahlin, and
Y. Zhang, “STAR: Self-tuning aggregation for scalable monitoring,” Feifei Li received the BS degree in computer
in Proc. 33rd Int. Conf. Very Large Data Bases, 2007, pp. 962–973. engineering from Nanyang Technological Univer-
[37] J. Liang, X. Gu, and K. Nahrstedt, “Self-configuring information sity, in 2002 and the PhD degree in computer sci-
management for large-scale service overlays,” in Proc. 26th IEEE ence from Boston University, in 2007. He is
Int. Conf. Comput. Commun., 2007, pp. 472–480. currently an associate professor in the School of
[38] Y. Zhao, Y. Tan, Z. Gong, X. Gu, and M. Wamboldt, “Self-correlating Computing, University of Utah. His research
predictive information tracking for large-scale production systems,” interests include database and data management
in Proc. 6th Int. Conf. Auton. Comput., 2009, pp. 33–42. systems and big data analytics. He is a member
[39] D. Moldovan, G. Copil, H.-L. Truong, and S. Dustdar, “MELA: of the IEEE.
Monitoring and analyzing elasticity of cloud services,” in Proc.
IEEE 5th Int. Conf. Cloud Comput. Technol. Sci., 2013, pp. 80–87.
[40] Amazon, “Aws security center,” [Online]. Available: http://aws. " For more information on this or any other computing topic,
amazon.com/security/, Accessed on: Nov. 5, 2016. please visit our Digital Library at www.computer.org/publications/dlib.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on March 31,2024 at 11:48:31 UTC from IEEE Xplore. Restrictions apply.