0% found this document useful (0 votes)
38 views18 pages

ATOM Efficient Tracking Monitoring and Orchestration of Cloud Resources

The document discusses ATOM, an efficient framework to automatically track, monitor, and orchestrate resource usage in Infrastructure as a Service (IaaS) cloud systems. It uses novel tracking methods with low overhead to continuously track important metrics. It also uses Principal Component Analysis to continuously monitor for anomalies and can adjust tracking algorithms under dynamic workloads. When anomalies are found, it can perform memory forensics on VMs to identify malicious behavior.

Uploaded by

ishugupta0298
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views18 pages

ATOM Efficient Tracking Monitoring and Orchestration of Cloud Resources

The document discusses ATOM, an efficient framework to automatically track, monitor, and orchestrate resource usage in Infrastructure as a Service (IaaS) cloud systems. It uses novel tracking methods with low overhead to continuously track important metrics. It also uses Principal Component Analysis to continuously monitor for anomalies and can adjust tracking algorithms under dynamic workloads. When anomalies are found, it can perform memory forensics on VMs to identify malicious behavior.

Uploaded by

ishugupta0298
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

2172 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 28, NO.

8, AUGUST 2017

ATOM: Efficient Tracking, Monitoring,


and Orchestration of Cloud Resources
Min Du, Student Member, IEEE and Feifei Li, Member, IEEE

Abstract—The emergence of Infrastructure as a Service framework brings new opportunities, which also accompanies with new
challenges in auto scaling, resource allocation, and security. A fundamental challenge underpinning these problems is the continuous
tracking and monitoring of resource usage in the system. In this paper, we present ATOM, an efficient and effective framework to
automatically track, monitor, and orchestrate resource usage in an Infrastructure as a Service (IaaS) system that is widely used in cloud
infrastructure. We use novel tracking method to continuously track important system usage metrics with low overhead, and develop a
Principal Component Analysis (PCA) based approach to continuously monitor and automatically find anomalies based on the
approximated tracking results. We show how to dynamically set the tracking threshold based on the detection results, and further, how
to adjust tracking algorithm to ensure its optimality under dynamic workloads. Lastly, when potential anomalies are identified, we use
introspection tools to perform memory forensics on VMs guided by analyzed results from tracking and monitoring to identify malicious
behavior inside a VM. We demonstrate the extensibility of ATOM through virtual machine (VM) clustering. The performance of our
framework is evaluated in an open source IaaS system.

Index Terms—Infrastructure as a service, cloud, tracking, monitoring, anomaly detection, virtual machine introspection

1 INTRODUCTION

T HE Infrastructure as a Service (IaaS) framework is a pop-


ular model in realizing cloud computing services. In
this model, a cloud provider manages and outsources her
Security is another paramount issue while using an IaaS
system. For example, it was reported in late July 2014,
adversaries attacked Amazon cloud by installing distrib-
computing resources through an IaaS system. For example, uted denial-of-service (DDoS) bots on user VMs by exploit-
Amazon offers cloud service with its Elastic Compute ing a vulnerability in Elasticsearch [2]. Resource usage data
Cloud (EC2) platform [1], which is an IaaS system. While could provide critical insights to address security concerns.
IaaS is an attractive model, since it enables cloud providers Thus, a cloud provider needs to constantly monitor resource
to outsource their computing resources and cloud users to usage and utilize these statistics not only for resource alloca-
cut their cost on a pay-per-use basis, it has raised new chal- tion, but also for anomaly detection in the system. Until
lenges in auto scaling, resource allocation, and security. now, the best practices for mitigating DDoS and other
For example, auto scaling in the IaaS framework is the attacks in AWS include using CloudWatch to create simple
process to automatically add and remove computing threshold alarms on monitored metrics and alert users for
resources based upon the actual resource usage. Cloud potential attacks [3]. In our work we show how to detect the
users want to pay for more resources only when they need anomalies automatically while saving users the trouble on
them, and to make the best use of their (paid) resources by setting magic threshold values.
evenly distributing their workloads. Auto scaling and load These observations illustrate that a fundamental chal-
balancing, two critical services provided by Amazon Web lenge underpinning several important problems in an IaaS
Service (AWS) [1] and other IaaS platforms, are designed to system is the continuous tracking and monitoring of resource
address these issues. A critical module in achieving auto- usage in the system. Furthermore, several applications (e.g.,
scaling and load balancing is the ability to monitor resource security) also need intelligent and automated orchestration
usage from many virtual machines (VMs) running on top of of system resources, by going beyond passive tracking and
EC2. In Amazon cloud, resource usage information needs to monitoring, and introducing auto-detection of abnormal
be collected and reported back to a cloud controller, not behavior in the system, and active introspection and correc-
only for the cloud controller to make various administrative tion once anomaly has been identified and confirmed. This
decisions, but also for cloud users to query. motivates us to design and implement ATOM, an efficient
and effective framework to automatically track, orchestrate,
and monitor resource usage in an IaaS system.
 The authors are with the School of Computing, University of Utah, Salt A Motivating Example. Eucalyptus [4], [5] is an open
Lake City, UT 84112. E-mail: {mind, lifeifei}@cs.utah.edu. source cloud software that provides AWS-compatible envi-
Manuscript received 19 May 2016; revised 9 Nov. 2016; accepted 30 Dec. ronment and interface. A simplified architecture of Eucalyp-
2016. Date of publication 16 Jan. 2017; date of current version 14 July 2017. tus, similar to other IaaS systems, is shown in Fig. 1. Cloud
Recommended for acceptance by X. Gu. users interact with the cloud controller (CLC) to issue
For information on obtaining reprints of this article, please send e-mail to:
[email protected], and reference the Digital Object Identifier below. requests such as to allocate resources and query resource
Digital Object Identifier no. 10.1109/TPDS.2017.2652467 usage. CLC handles incoming user requests, collects
1045-9219 ß 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on March 31,2024 at 11:48:31 UTC from IEEE Xplore. Restrictions apply.
DU AND LI: ATOM: EFFICIENT TRACKING, MONITORING, AND ORCHESTRATION OF CLOUD RESOURCES 2173

framework that could be easily plugged into an IaaS system,


to provide automated tracking, orchestration, and monitor-
ing of resource usage for a potentially large number of VMs
running on an IaaS cloud, in an online fashion.
ATOM introduces an online tracking module that runs
at NC and continuously tracks various performance met-
rics and resource usage values of all VMs. The CLC is
denoted as the tracker, and the NCs are denoted as the
observers. The goal is to replace the sampled view at the
CLC with a continuous understanding of system status,
with minimum overhead.
ATOM then uses an automated monitoring module that
continuously monitors the resource usage data reported by
the online tracking module. The goal is to detect anomaly
by mining the resource usage data. This is especially helpful
for detecting attacks that could cause changes in resource
Fig. 1. A simplified architecture of Eucalyptus.
usage, for example, one VM consumes all available resources
information of the entire cloud, makes high-level decisions and starves all other VMs running on the same physical com-
and controls other components such as cluster controller puter [7]. The baseline for online monitoring is to simply
(CC) and node controller (NC). A CC forwards requests from define a threshold value for any metric of interest. Clearly,
the CLC to a NC, gathers status data on each NC, and this approach is not very effective against dynamic and com-
reports back to the CLC. A NC controls the VMs running on plex attacks and anomalies. ATOM uses a dynamic online
it. One CLC controls several CCs and each CC could in turn monitoring method that is developed based on PCA. We
control several NCs, on which multiple user VMs could be design a PCA-based method that continuously analyzes the
running. Note that only one CLC exists on each cloud. dominant subspace defined by the measurements from the
Eucalyptus provides an AWS-like service called Cloud- tracking module, and automatically raises an alarm when-
Watch. CloudWatch is able to monitor resource usage of ever a shift in the dominant subspace has been detected.
each VM. To reduce overhead, such data are only collected Even though PCA-based methods have been used for anom-
from each VM at every minute, and then reported to the aly detection in various contexts, a new challenge in our set-
CLC through a CC. Clearly, gathering resource usage in ting is to cope with approximate measurements produced by
real time introduces overhead in the system (e.g., communi- online tracking, and design methods that are able to auto-
cation overhead from a NC to the CLC). When there are matically adapting to and adjusting the tracking errors.
plenty of VMs to monitor, the problem becomes even worse Lastly, virtual machine introspection (VMI) is used to
and will bring significant overhead to the system. Cloud- detect and identify malicious behavior inside a VM. VMI
Watch addresses this problem by collecting measurements techniques such as analyzing VM memory space tends to be
only once every minute, but this provides only a discrete, sam- of great cost. If we don’t know where and when an attack
pled view of the system status and is not sufficient to provid- might have happened, we will need to go through the entire
ing continuous understanding and protection of the system. memory constantly, which is clearly expensive, especially if
Another limitation in existing approaches like Cloud- VMs to be analyzed are so many. ATOM provides two
Watch is that they only do passive monitoring. No active options here. The first option is to set a threshold for each
online resource orchestration is in place towards detecting resource usage measure (the baseline as discussed above),
system anomalies, potential threats and attacks. We observe and we consider there may be an anomaly if the reported
that, e.g., in the aforementioned DDoS attack to Amazon value is beyond (or below) the threshold for that measure
cloud, alarming signals can be learned automatically from and trigger a VMI. This is the method that existing systems
resource usage data, which are readily to analyze without any like AWS and Eucalyptus have adopted for auto scaling
pre-processing like system logs [6]. Active online resource tasks. The second option is to use the online monitoring
monitoring and orchestration is very useful in achieving a method in the monitoring module to automatically detect
more secure and reliable system. Active online resource moni- anomaly and trigger a VMI, as well as guiding the intro-
toring gives us the opportunities to trigger VM introspection spection to specific regions in the VM memory space based
to debug the system and figure out what has possibly gone on the data from online monitoring and tracking. We denote
wrong. The introspection into VMs then allows to orchestrate the second method as orchestration.
resource usage and allocation in the IaaS system to achieve a Comparison with UBL.UBL [8] stands for Unsupervised
more secure system and/or better performance. Note that Behavior Learning which is designed for monitoring vir-
VM introspection is expensive. Without continuous tracking tualized cloud systems. It collects resource usage data from
and online monitoring and orchestration, it is almost impossi- each VM, and trains Self-Organizing Maps (SOM) using
ble to figure out when to do VM introspection and what spe- normal data to predict future performance anomalies. UBL
cific target to introspect in a host VM. Our goal is to automate shows that SOM is an effective learning method for VM sta-
this process and trigger VM introspection only when needed. tistics and has better prediction accuracy compared with
We refer to this process as resource orchestration. PCA/KNN in some experiments [8].
Our Contribution. Motivated by these discussions, we That said, note that ATOM is an end-to-end frame-
present the ATOM framework. ATOM is an end-to-end work that integrates online tracking, online monitoring,
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on March 31,2024 at 11:48:31 UTC from IEEE Xplore. Restrictions apply.
2174 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 28, NO. 8, AUGUST 2017

TABLE 1
Frequently Used Notations

Symbol Definition
D tracking threshold
g finest resolution for floating point values
t number of time instances in a sliding window
n number of monitored VMs
d0 number of metrics for each VM
d d0  n
M data matrix (t  d) of the most recent monitored data
avgj mean of the jth column in M
stdj standard deviation of the jth column in M
Y standardized M, each value yi;j ¼ ðmi;j  avgj Þ=stdj
tnow current time-stamp
A consecutively abnormal data from tnow  t to tnow
B standardized A
z the metric vector monitored at tnow (with d dimensions)
x standardized z
vi the ith eigen vector output by PCA
Fig. 2. The ATOM framework. i the ith eigen value output by PCA
k number of principal components output by PCA
and orchestration (for VM introspection) into one frame- a input false alarm rate in PCA anomaly detection
work, whereas UBL focuses on anomaly detection in per- Qa PCA anomaly detection threshold
formance data without the integration of tracking and m false alarm rate deviation, to control tracking threshold
orchestration. Hence, UBL is “equivalent” to the moni-
toring component in ATOM.
More specifically, UBL can be plugged/integrated into dramatically reduces the overhead used to monitor
ATOM’s monitoring component as an alternative anomaly cloud resources and enables continuous measure-
detection method to be more effective in capturing different ments to CC and CLC;
types of anomaly. Note that PCA-based approach has the (2) Monitoring component (anomaly detection): ATOM adds
advantage of enabling us to analyze the theoretical bounds, this component in CLC to analyze tracking results by
when there are bounded tracking errors present in the con- the tracking component, which provides continuous
tinuously tracked measurements returned by the tracking resource usage data in real time. It uses a modified
component. UBL is an empirical method which may per- PCA method to continuously track the divided sub-
form really well on some instances, but it remains as an space, as defined by the multi-dimensional values
open problem to theoretically study its performance espe- from the tracking results, and automatically detect
cially with approximate measurements when being used anomaly by identifying notable shift in the interesting
together with ATOM’s tracking module. PCA-based subspace. It also generates anomaly information for
approach also allows us to adjust the tracking threshold further analysis by the orchestration component
automatically in an online fashion by only adjusting the when this happens. The monitoring component also
false alarm rate, as later shown in Section 5 where we have adjusts the tracking threshold from the tracking com-
established the theoretical connection between the false ponent dynamically online based on the data trends
alarm rate and the tracking threshold. and a desired false alarm rate.
Paper Organization. The rest of this paper is organized as (3) Orchestration component (introspection and debugging):
follows. Section 2 gives an overview on the design of ATOM, when a potential anomaly is identified by the moni-
and the threat model it considers. Sections 3 and 4 describe toring component, an INTROSPECT request along
the online tracking and the online monitoring modules in with anomaly information is sent to the orchestration
ATOM. We further demonstrates the interaction between component on NC, in which VMI tools (such as
tracking component and monitoring component in Section 5. LibVMI [9]) and VM debugging tools (such as
Section 6 introduces the orchestration module. Section 7 StackDB [10]) are used to identify the anomalous
shows an extension on VM clustering using the ATOM behavior inside a VM and raise an alarm to cloud
framework. Section 8 evaluates ATOM using Eucalyptus users for further analysis.
cloud and shows its effectiveness. Lastly, Section 9 reviews In the following sections we investigate each component
the related work, and section 10 concludes the paper. in further detail. Table 1 lists some frequently-used notations.

2 THE ATOM FRAMEWORK 2.1 Threat Model


Fig. 2 shows the ATOM framework. For simplicity, only one ATOM provides realtime tracking and monitoring on the
CC and one NC are shown in this example. ATOM adds three usage of cloud resource in an IaaS system. It further goes
components to an IaaS system like AWS and Eucalyptus: out to detect and prevent attacks that could cause a notable
change in resource usage from its typical subspace.
(1) Tracking component: ATOM adapts the optimal online To that end, we need to formalize a threat model. We
tracking algorithm for one-dimension online track- assume cloud users to be trustworthy, but they might acci-
ing inside the monitoring service on NCs. This dentally run some malicious software out of ignorance.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on March 31,2024 at 11:48:31 UTC from IEEE Xplore. Restrictions apply.
DU AND LI: ATOM: EFFICIENT TRACKING, MONITORING, AND ORCHESTRATION OF CLOUD RESOURCES 2175

Also, despite various security rules and policies that are in view at every minute, user now could query values at any
place, it’s still possible that a smart attacker could bypass time instance in the entire history that is available.
them and perform malicious tasks. The malicious behavior But unfortunately, this seemingly natural idea may per-
could very likely cause some change in resource usage. form very badly in practice. In fact, in the worst case, its
Note that, however, this is not necessarily always accompanied asymptotic cost is infinite in terms of competitive ratio over
with more resource consumption! Some attacks could actually the optimal offline algorithm that knows the entire data
lead to less resource usage, or simply different ways of series in advance. For example, suppose the first value NC
using the same amount of resources on average. All these observes is 0 and then it oscillates between 0 and D þ 1.
attacks are targeted by the ATOM framework. The possibil- Then NC continues to send 0 and D þ 1 to the CLC. While
ity of incorporating other types of attacks into ATOM is dis- the optimal offline algorithm who knows the entire fðtÞ at
cussed in Sections 6 and 8.7. the beginning could send only one message to the CLC-the
value D2 . Formally, this is known as the online tracking prob-
3 TRACKING COMPONENT lem, which is formalized and studied in [11]. In online track-
ing, an observer observes a function fðtÞ in an online
This section introduces the tracking component in ATOM. fashion, which means she sees fðtÞ for any time t before the
Consider Eucalyptus CloudWatch as an example, which is current time (including the current time). A tracker would
an AWS CloudWatch compatible monitoring service that like to keep track of the current function value within some
enables cloud users to monitor their cloud resources and predefined error. The observer needs to decide when and
make operational decisions based on the statistics. Cloud- what value she needs to send to the tracker so that the com-
Watch is capable of collecting, aggregating and dispensing munication cost is minimized.
data from resources such as VMs and storage. Cloud users Suppose function f : Zþ ! Z is the function observer
can specify what they would like to monitor, and then query observes overtime. gðtÞ stands for the value she chooses to
the history data for up to two weeks through the interface in send to the tracker at time t. The predefined error is D,
the CLC. They can also set an alarm (essentially, a thresh- which means at any time tnow , if the observer does not send
old) for a specific measure, and be notified or let it trigger a new value gðtnow Þ to the tracker, then it must satisfy
some predefined action if the alarm conditions are met. jjfðtnow Þ  gðtlast Þjj  D, where gðtlast Þ is the last value the
Clearly, collecting such statistics continuously is expensive. tracker receives from the observer. This is an online tracking
Thus, the default in Eucalyptus and AWS is to ask a NC to over a one dimension positive integer function.
only send measurements to the CLC at some predefined Instead of the naive algorithm that’s shown above, Yi and
interval, e.g., once every minute in Eucalyptus. Zhang provide an online algorithm that is proved to be opti-
A user VM in Eucalyptus is called an instance. In the fol- mal with a competitive ratio of only Oðlog DÞ; that means in
lowing we will use the term “instance” and “VM” inter- the worst case, its communication cost is only Oðlog DÞ times
changeably. There are various variables that can be worse than the cost of the offline optimal algorithm that
monitored overtime on each instance, each of which is called knows the function fðtÞ for entire time domain [11]. But
a metric. The measurement for each metric, for example, Per- unfortunately, the algorithm works only for integer values.
cent for CPUUtilization, Count for DiskReadOps and We observe that in reality, especially in our setting, real
DiskWriteOps, Bytes for DiskReadBytes, DiskWriteBytes, values (e.g., “double” for CPUUtilization) need to be
NetworkIn and NetworkOut, is called Unit and is numerical. tracked instead. To that end, we adapt the algorithm from
A continuous understanding of these values is much [11], and design Algorithm 1 to track real values continu-
more useful than a periodic, discrete sampled view that are ously in an online fashion. The algorithm performs in
only available, say, every minute. But doing so is expensive; rounds. A round ends when S becomes an empty set, and a
a NC needs to constantly sending data to the CLC. A key new round starts.
observation is that, for most purposes, cloud users may not
be interested in the exact value at every time instance. Thus, Algorithm 1. One Round of Online Tracking for Real
a continuous understanding of these values within some Values
predefined error range is an appealing alternative. For let S ¼ ½fðtnow Þ  D; fðtnow Þ þ D;
example, it’s acceptable to learn that CPUUtilization is while Supper bound  Slower bound > g do
guaranteed to be within 3 percent of its exact value at any gðtnow Þ ¼ ðSupper bound  Slower bound Þ=2;
time instance. send gðtnow Þ to tracker;
This way NC only sends a value whenever the newest wait until jjfðtnow Þ  gðtlast Þjj > D;
one is more than D away from last sent value on a measure- Supper bound ¼ minðSupper bound ; fðtnow Þ þ DÞ;
ment, where D is a user-specified, maximum allowed error Slower bound ¼ maxðSlower bound ; fðtnow Þ  DÞ;
on this measurement. CLC could use the last received value end while /* this algorithm is run by observer */
as an acceptable approximation for all values in-between. In
practice, often time certain metrics on a VM do not change The central idea of our algorithm is to always send the
much over a long period. Thus far fewer values need to be median value from the range of possible valid values,
sent to the CLC. Not only can we save the communication denoted by S, whenever fðtnow Þ has changed more than D
overhead from NC to the CLC, but also the database space (could be non-integer) from gðtlast Þ. The next key observa-
on CLC used to store every value reported by NC (so that tion is that any real domain in a system must have a finite
the history data could be kept for much longer than two precision. Suppose g is the finest resolution for the floating
weeks). Furthermore, instead of having only a sampled point values being tracked in the algorithm. Then at the
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on March 31,2024 at 11:48:31 UTC from IEEE Xplore. Restrictions apply.
2176 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 28, NO. 8, AUGUST 2017

beginning of each round, the number of possible values


within S is 2D=g, and since S is a finite set, it always
becomes an empty set at some step following the above
algorithm. As long as S contains a finite number of elements
in Algorithm 1, we can show its correctness and optimality
with a competitive ratio of only Oðlog ðD=gÞÞ for online
tracking of real values.
Theorem 1. Algorithm 1 is correct and optimal for tracking dou-
ble values, and has a competitive ratio of log ðD=gÞ, where g is
the finest precision for floating point values.
Proof. Since g is the finest resolution for the floating point
values being tracked in the algorithm, then by multiply-
ing every possible value in region S with integer 1=g, all Fig. 3. An example of PCA anomaly detection in 2-dimensional space.
the values become integers. Therefore S becomes a region
of integers, and all values we could choose to send to the more VMs being tracked/monitored, the more savings
tracker are integers. Now we could adapt the proof for ATOM will lead to, as evaluated in Section 8.4.
tracking integers to prove the correctness and optimality
of Algorithm 1, and compute its competitive ratio. We 4 MONITORING COMPONENT
denote the online algorithm as ASOL and the offline algo-
rithm as AOPT . With the continuously tracked values of various metrics,
Correctness. The correctness is obvious since Alice compared with having only discrete, sampled views on
sends a value to Bob whenever the observation exceeds these metrics, ATOM is able to do a much better job in mon-
threshold D, and the value sent is within D of the itoring system health and detecting anomalies.
observed value. To find anomalies in real-time, a naive method is to use
Competitive Ratio. The competitive ratio follows by two the threshold approach. For example, Eucalyptus and AWS
facts: In each round, i) ASOL sends at most log ðD=gÞ mes- CloudWatch allow users to set an alarm along with an alarm
action that can be triggered if the alarm condition is met. The
sages. This is because the cardinality of S decreases by
alarm action is optional, which could be some predefined
half each time, and the initial range of S is 2D=g. ii) AOPT
auto scaling policy such as changing disk capacity. The alarm
sends at least one message. S is maintained as \t ½fðtÞ
condition consists of a threshold value T on a metric E of
D; fðtÞ þ D for up to tnow in current round. If no value has
interest. The condition is met when the value vt from the
been sent in this round, then the value (call it y) sent at the
metric E has exceeded T (or gone below T ) at any time
end of last round is within D range of all observations in
instance t. However, in practice, it is very hard for cloud
current round, which makes y still lie in range S, a contra-
users to set a magic value as the threshold for a metric that
diction to the fact that S becomes empty in the end.
will be effective in a dynamic environment like that in an
Optimality. The optimality holds because any online
IaaS system. Besides, it’s inconvenient to change the thresh-
algorithm needs to send at least log ðD=gÞ messages in an
old for each metric every time a user does some different
extreme case. Suppose an adversary Carole operates
tasks (which may invalidate the old threshold value). Thus
function f. Whenever Alice sends some value to Bob, if
an automated monitoring method would be very useful.
the value is above the median of S, Carole decreases f
until Alice sends a new value; otherwise Carole increases
4.1 An Overview of PCA Method
f until Alice sends a new one. This way the cardinality of
S decreases at most half, so any online algorithm needs Given a data matrix in Rd , some dimensions in which are
to send at least log ðD=gÞ messages. Whereas AOPT only possibly correlated, the PCA method could transform this
needs to send out one value at the beginning of each matrix into a new coordinate system, where each dimension
round that’s within the final S until the current round is orthogonal. By mapping the original matrix onto the new
coordinate system, we get a set of principal components.
ends. In this case the lower bound of competitive ratio is
The first principal component points to the direction with
log ðD=gÞ. Hence the optimality of Algorithm 1 is proved.
the largest variance, and the following principal components
The competitive ratio for Algorithm 1 thus becomes
each points to the largest variance direction that is orthogo-
Oðlog ðD=gÞÞ, which is optimal among all online tracking
nal to all the previous ones. The intuition to use PCA as an
functions for floating point values. u
t
anomaly detection method, is that the abnormal data points
In an IaaS system, a NC obtains the values for a metric of most likely do not fit into the correlation between each dimension
interest and acts as an observer for these values, and then in the original space. Thus by transforming the data matrix
chooses what to send to CLC by following Algorithm 1. The onto a new space using PCA, the original anomaly point
CLC, as the tracker, simply stores the values into its local would have a large projection length on the axes supposed to
database, whenever a value is reported from a NC. This is have very small variance (or so-called “residual subspace” in
how ATOM’s tracking module is able to save the network our following analysis). This way anomaly can be detected
communication overhead from NC to CLC, and the storage by analyzing the projection length onto these axes. A simple
overhead in CLC. Note that the tracking algorithm is example when d ¼ 2 is shown in Fig. 3. PCA rotates the origi-
applied independently per dimension, meaning that the nal coordinates into a new space, where the first axis points
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on March 31,2024 at 11:48:31 UTC from IEEE Xplore. Restrictions apply.
DU AND LI: ATOM: EFFICIENT TRACKING, MONITORING, AND ORCHESTRATION OF CLOUD RESOURCES 2177

to the direction having the largest data variance while the using the latest PCA model; and if the newest time instance
remaining axis forms the residual subspace. The abnormal data z is normal, move it to M and update the PCA model;
point is detected by comparing its projection length onto the otherwise move it to A in case it doesn’t agree with the
residual subspace (second axis) against a threshold (detailed residual subspace; (5) if z is abnormal, do metrics identifica-
in Section 4.3.3). Using PCA for anomaly detection has been tion to find which metrics of which VM instances might
widely studied in the context of network traffic analysis and have caused the anomaly. Step 1 is trivial by the definition
monitoring, e.g., [12], [13]. of Y. The details of steps (2) to (5) are as follows.
To the best of our knowledge, there is no prior work in
adapting PCA for online monitoring and anomaly detection 4.3.1 Building the PCA Model
over VMs in an IaaS system. That said, there are three new To build the PCA model, we perform eigenvalue decompo-
challenges that we need to address: 1) unlike most existing sition on the covariance matrix of Y, and get a set of eigen
work that use PCA for anomaly detection in an offline batch vectors V ¼ ðv1 ; v2 ; . . . ; vd ) sorted by their eigen values.
setting [13], ATOM needs to do online monitoring; 2) once These eigen vectors form the new axes in the transformed
anomaly is identified, ATOM needs to figure out which coordinate system, with the first principal axis v1 pointing
metrics from which VM instance(s) might have caused the to the direction that has the largest variance in Y and the fol-
anomaly; 3) the input data to ATOM’s online monitoring lowing principal axes each points to the largest variance
module are approximate results from the tracking module, direction orthogonal to previous ones. The corresponding
which have an error that is bounded by D. We need to take eigen values are 1 2    d 0.
into account such tracking errors into the analysis. Next we
will explain our method in detail.
4.3.2 Find the Residual Subspace
We define the principal subspace and the residual subspace
4.2 The Data Matrix as follows. The principal subspace S stands for the space
Given d0 metrics reported by the tracking module for each spanned by the first several principal axes in V, while resid-
VM and t is the length of a time-based sliding window, PCA ual subspace Se stands for the space spanned by the rest.
could be performed on these data which form a t  d0 matrix. The number of significant principal components in the prin-
A more general and more interesting case is to perform cipal subspace is denoted as k. Hence, the first k eigen vec-
online monitoring over a data matrix composed of multiple tors form the principal subspace, and the rest ðd  kÞ eigen
VMs’ data, e.g., d ¼ d0  n dimensions. For VMs hosted on vectors form the residual subspace that could be used to
the same physical node, or even the same cloud, it’s quite detect anomalies. Of numerous methods to determine k, we
possible that one VM may attack another [14], or some VMs choose cumulative percent variance (CPV) method [15] for
are attacked by the same process simultaneously. Detecting its ease of computation and good performance in practice as
anomaly on a d-dimensional space makes it easier to dis- shown by previous work. For the first ‘ principal compo-
cover such correlations. It also provides better detection P‘
i
accuracy. Performing PCA on multiple VMs’ statistics nents, CPV ð‘Þ ¼ Pi¼1d  100%; and we choose k to be:

i¼1 i
yields a higher residual dimension space, leading to more
k ¼ arg min‘ ðCPV ð‘Þ > 90%Þ.
accurate anomaly detection.
Recall that ATOM’s tracking module ensures that at any
time point t, for each metric E, CLC can obtain a value v0t 4.3.3 Anomaly Detection
that is within vt  D, where vt is the exact value of this met- Unlike previous methods, e.g., [13], that perform offline,
ric at time t from a VM instance of interest. Next we will batched backbone network anomaly detection, we are not
show how to design an online PCA method to detect anom- required to detect anomalies for every row in M. Instead, we
aly using a t  d matrix M. Each data value in this matrix is only need to check the newest vector z at tnow . That’s because
guaranteed to be within D of the true exact value for the we have classified data into the (normal) data matrix M and
same metric at that same time instance. the abnormal matrix A, and the real-time detection of ongo-
ing anomalies is based on the PCA model built from M.
4.3 Our approach To do this, we first standardize z using the mean and
standard deviation of each column in M. We use x to denote
The following matrices are used in our construction: M, Y,
the standardized vector.
A, B, whose definitions could be found in Table 1.
Given the normal subspace S : P1 ¼ ½v1 ; . . . ; vk , and the
At first, a standard, offline batch PCA analysis [13] is
residual subspace Se : P2 ¼ ½vkþ1 ; . . . ; vd , x is divided into
applied to the data using the newest t time instances to find
two parts by being projected on these two subspaces
potential anomalies. If anomalies are found, we eliminate
data corresponding to those time instances, and use the rest x¼^
xþ~
x ¼ P1 P1 T x þ P2 P2 T x:
as the initial data matrix M to find the residual subspace Se
through a regular PCA analysis. Afterwards, for each z at If z is normal, it should fit the distribution (e.g., mean
tnow , we use the latest residual subspace Se to perform anom- and variance) of the normal data. Moreover, the values of ~ x,
aly detection. which are the projection onto P2 by x, are supposed to be
In summary, our monitoring method has five steps: (1) small. Specifically, we define the squared prediction error
process data from M to form Y; (2) build the PCA model (SPE) to quantify this:
based on Y; (3) find the residual subspace of the PCA model; 2 2
(4) do anomaly detection for data at each new time instance xjj2 ¼ jjP2 P2 T xjj ¼ jjðI  P1 P1 T Þxjj :
SPEðxÞ ¼ jj~
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on March 31,2024 at 11:48:31 UTC from IEEE Xplore. Restrictions apply.
2178 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 28, NO. 8, AUGUST 2017

Let Q ¼ jj~xjj2 , a classic result for the PCA model is that Step 1 reveals which dimension has a larger projection on
the following variable c approximately follows a standard residual subspace than the normal data, however it is hard
normal distribution with zero mean and unit variance [16]: to map such change back to the original data. Furthermore,
as shown in Section 8.2, this measure is not highly reliable
u1 ½ðQ=u1 Þh0  1  u2 h0 ðh0  1Þ=u21 
c¼ qffiffiffiffiffiffiffiffiffiffiffiffi ; (1) and could be omitted to save some computation cost. Step 2
2u2 h20 is a useful measure to show which dimension has a signifi-
Pd cant different pattern compared to the normal data. How-
where ui ¼ j¼kþ1 ij ; i ¼ 1; 2; 3; h0 ¼ 1  2u3u1 2u3 . ever, it does not tell us whether some metric usage goes up
2
And we consider x to be abnormal if SPEðxÞ > Qa , where or down. Thus we use step 3 at last to find this pattern. Step
the threshold Qa is derived from the distribution c 3 itself is not good enough to indicate a pattern, because the
oscillation of metric usage statistics might make the mean of
2 qffiffiffiffiffiffiffiffiffiffiffiffi 3h10 some dimension in A appear benign. Thus, the output of
ca 2u2 h20 u h ðh  1Þ steps 2 and 3 are sent together along with an introspection
Qa ¼ u1 4 5 ;
2 0 0
þ1þ
u1 u21 request, to the orchestration module on the corresponding
NC(s), that administrates the identified VM instance(s). Sec-
and ca is the ð1  aÞ percentile in a standard normal distri- tion 8 shows how information identified from these three
bution, with a being the false alarm rate. steps could facilitate the orchestration module to find a “real
Finally, if z is normal, we add it to M and delete the oldest cause” of what might have gone wrong and how wrong it is.
data in M, and update the PCA model accordingly. Otherwise
4.3.5 Other Remarks
it is added to A, and the corresponding standardized x is
moved to matrix B. Matrices A and B need to contain time- Raising Alarms to Cloud Users. Once a data vector is detected
consecutive data only (so that we detect anomaly correspond- as abnormal, it is moved to the abnormal data matrix, on
ing to a continuous event), thus, they are cleared if its last vec- which metrics identification is performed. Suppose there are
tor is not consecutive in time with the new incoming vector. totally m vectors in the abnormal data matrix A, an alarm
will be raised with an alarm level m. The alarm level indicates
how serious the detected anomaly is; intuitively, the larger
4.3.4 Metrics Identification
number of data vectors contained in A, the longer duration
When an anomaly is detected, we need to do further analy- of the currently detected anomaly is. The alarm can be raised
sis to identify which metrics on which VM instance(s) from the either right after the metrics identification step, or wait until
d ¼ d0  n dimensions might have caused the anomaly, to the virtual machine inspection from the orchestration mod-
assist the orchestration module. Our identification method ule has finished (so that more information are gathered). The
consists of three steps. It compares the abnormal data alarm notifies the user about the potential abnormal behav-
matrix A (and the corresponding standardized matrix B), ior in the IaaS system and lets user identify whether the
and normal matrix M (and Y). Suppose there are m vectors ongoing behavior on his/her VM(s) is normal. If this is
in A (B) and t vectors in M (Y). because that the tasks on a VM have changed, the corre-
xjj2 , it is natural
Step 1. Since the anomaly is detected by jj~ sponding data vectors in the abnormal matrix should be
to compare the residual data between B and Y. Suppose yi is moved to the normal data matrix and used to build the PCA
the transpose of the ith row vector in Y, and y~i ¼ P2 PT2 yi is model to accommodate and reflect the new behavior. Abnor-
its residual traffic, then mal data matrix is cleared once the anomaly on VM is
ðy~1 ; y~2 ; . . . ; y~t ÞT ¼ ðP2 PT2 ðy1 ; y2 ; . . . ; yt ÞÞT ¼ YP2 PT2 ; removed, or is identified as normal by the cloud user.
Scalability. The computation complexity of monitoring
forms a residual matrix of Y , denoted as Yr . Similarly, module is evaluated in Section 8.4 (Fig. 11). Although its
Ar ¼ AP2 PT2 . For each dimension j 2 ½1; d, let computation cost increases with the increasing number of
VMs, it remains as a very small overhead. The average com-
1 1 putation cost per sliding window for the monitoring module
aj ¼ jjðAr Þj jj2 and yj ¼ jjðYr Þj jj2 ;
m t is less than 3 milliseconds in most cases for up to 6 VMs.
where ðAr Þj is the jth column in Ar and ðYr Þj is the jth col- What’s more, due to the significant message savings from
umn in Yr . Then rdj ¼ ðaj  yj Þ=yj . ATOM’s tracking module, both the PCA-based computation
Step 2. If for some dimension j, rdj b1 for some con- overhead and the Eucalyptus storage overhead are reduced
stant b1 , we measure the change in A and M. In particular, significantly. Larger number of VMs could significantly
for each such dimension j, we calculate how much the improve the detection accuracy, meaning smaller false alarm
abnormal data in A are away from the standard normal rates, which is due to the fact that the monitoring component
deviation of the normal data along that uses a larger data matrix that helps find normal subspace
P dimension in M. more reliably, as also evaluated in Section 8.4 in Fig. 11.
Specifically, we calculate stddevj ¼ m1 m i¼1 jaij  avgj j=stdj .
A dimension j is considered abnormal if stddevj b2 for
5 INTERACTION BETWEEN TRACKING AND
some constant b2 . In practice, we find that setting b1 and b2
to small positive integers works well, say b1 ¼ 2 and b2 ¼ 3.
MONITORING COMPONENTS
Step 3. For a dimension j that’s been considered abnor- 5.1 Deriving the Tracking Error Threshold
mal in Step 2, we measure the difference between the mean As mentioned earlier, the input data to the monitoring mod-
of abnormal and normal
P data. Specifically, we want to mea- ule is produced by the tracking module and each value may
sure meandiff j ¼ ðm1 m i¼1 aij  avgj Þ=avgj . contain an approximate error of at most D (away from the
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on March 31,2024 at 11:48:31 UTC from IEEE Xplore. Restrictions apply.
DU AND LI: ATOM: EFFICIENT TRACKING, MONITORING, AND ORCHESTRATION OF CLOUD RESOURCES 2179

true value at that time instance for that metric). The approxi- by the observer (a NC) are independently and uni-
mation error introduced by the tracking module may formly distributed within the threshold, according to
degrade the performance quality of ATOM’s monitoring which
pffiffithe
ffi tracking threshold for the ith dimension is
module. Thus, a formal analysis is needed to bound the effect di ¼ 3s i .
of tracking errors and show how to set a proper value as the 2) we use homogeneous slack allocation, which is to
error threshold D for each metric in the tracking module. assume a uniform distribution of tracking error d on
As shown in Section 4.3.3, the random variable c follows a each dimension.
normal distribution, and the SPE threshold Qa is computed Applying these two assumptions, we get a tracking
after an a value is specified. However, we do not have c from threshold
the exact data matrix, instead, the approximate data matrix pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffiffi
 þ 3 m2 þ mn  3n
3n 
leads to the value c^. The SPE threshold is computed using a d¼ pffiffiffiffiffiffiffiffiffiffiffiffiffi : (2)
user-specified a value. However, the threshold calculated by mþn
the approximated matrix does not represent confident limit
1  a anymore, instead it leads to a corresponding approxi- Note we cannot send this threshold directly to observers
mation 1  a ^ . We want to understand the relationship since the data matrix used to build the PCA model has been
between a ^ and a. Formally, the cloud user specifies a and a standardized. Recall stdi is the standard deviation along the
maximally allowed deviation rate m such that our tracking and ith dimension of matrix M, then the original variance is
monitoring methods guarantee that j^ a  aj  m (even Si ¼ ðstdi  s i Þ2 . Thus, the tracking threshold for the ffiith
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
pffiffiffi
though c is unknown). Thus, we need to establish the rela- dimension is calculated as Di ¼ 3Si ¼ 3ðstdi  s i Þ2 ¼
tionship between m and the tracking error threshold D for stdi  di : The CLC calculates the results for each metric dimen-
each metric dimension used by the tracking module [17]. sion whenever there is a PCA update, and then send the new
We achieve this objective via two steps: 1) given m, find tracking threshold to corresponding NCs (observers), which
an approximate error bound  on the average eigen values
use the updated thresholds to adjust its tracking algorithm. A
produced by PCA; 2) once having the error bound  on
possible improvement is to allocate the tracking slack for
eigen values, calculate the tracking threshold D based on .
each metric dimension according to the frequency of message
Step 1. We could approximate m according to  from
passing sent to the CLC. By giving the dimensions being sent
Equation 1, yet the reverse could not be done with a closed-
form formula. The observation is made that m monotoni- more frequently larger tracking error thresholds, and other
cally increases with . Hence the idea is to use a binary dimensions smaller tracking error thresholds, the tracking
search to approximate : we first guess a value 0 , then calcu- overhead could be potentially further reduced.
late a m0 and compare it with the user-input m, and finally
adjust the value of 0 and compute m0 again. We repeat this 5.2 Accommodating Dynamic Tracking Thresholds
process until the difference between m0 and m is within a In the monitoring component (CLC), each time a new set of
desired precision. Then we could treat 0 as , the input for tracking thresholds are calculated, they are sent back to the
the next step. The way to calculate m using  could be tracking component (NC). This means that the tracking
derived as follows. Given that c approximately follows a threshold on each metric dimension may change from time
normal distribution, then m ¼ Pr½ca  hc < U < ca þ hc , to time. On the tracking component, we use a buffer B to
where hc ¼ j^ c  cj, and U is a random variable following the store the newest tracking threshold for each metric, and
normal distribution Nð0; 1Þ. hc could be approximated from adjust the tracking method in Algorithm 1 accordingly, as
 using the Monte Carlo sampling technique according to shown in Algorithm 2. Here Dnew is the current tracking
equation 1. For each loop, we generate a random value ^ in threshold in buffer B for the metric being tracked.
the range of ½  ;  þ  and then compute c^ based on equa-
tion 1, and compute the difference with c which is calcu- Algorithm 2. One Round of Online Tracking for Real
lated by . This loop is repeated a constant number of times Values
and the largest difference is assigned to hc , which could be let S ¼ ½fðtnow Þ  D; fðtnow Þ þ D;
then used to calculate m. while Supper bound  Slower bound > g do
Step 2. Once having the eigen-error , using stochastic gðtnow Þ ¼ ðSupper bound  Slower bound Þ=2;
matrix perturbation method we could get the relation send gðtnow Þ to tracker;
between eigen-error  and the variance s 2i along each while jjfðtnow Þ  gðtlast Þjj  D do
dimension wait until fðtnow Þ is updated;
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi D ¼ Dnew ;
u u
u X d u 1 1 X d
end while
2 t  si þ t þ
2 s 4 ¼ ;
t i¼1 t d i¼1 i Supper bound ¼ minðSupper bound ; fðtnow Þ þ DÞ;
Slower bound ¼ maxðSlower bound ; fðtnow Þ  DÞ;
end while /* this algorithm is run by observer */
where  is the average of eigen values, t is the number of
points used to build the PCA model, and d is the number of
dimensions. Then the estimation of tracking error D is based We can show that doing this style of “lazy update of the
on the following assumptions: tracking threshold value” could ensure that the competitive
ratio is the max of log D for all possible D (or log ðD=gÞ where
1) the errors between the approximated values sent to g is the finest precision for “double” values) in a tracking
the tracker (the CLC) and the true values observed period; and it is optimal. It also guarantees that on the
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on March 31,2024 at 11:48:31 UTC from IEEE Xplore. Restrictions apply.
2180 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 28, NO. 8, AUGUST 2017

exceeds D2 threshold of x, otherwise increases f until


ASOL has to send out a new value. This way the cardi-
nally of S decreases at most half during the change of D1
to D2 . So the optimality of Algorithm 1 still holds even
with the tracking error bound changing. u
t

6 ORCHESTRATION COMPONENT
Fig. 4. Intersection with dynamically changing values of D.
The monitoring component in Section 4 detects the abnor-
monitoring component, the PCA detection result calculated mal state and identifies which measurement on which VM
by the approximated tracking values has a false alarm rate might be responsible. In this section, we describe how
^ that is within user-specified deviation value m of the true
a orchestration component is able to automatically mitigate
^ 2 ½a  m; a þ m).
false alarm rate a (i.e., a the malicious behavior after an anomaly is detected.
Claim 2. When the tracking threshold D changes at NC, Modern IaaS cloud vendors offer services mostly in the
by simply changing the D value in Algorithm 1 during a form of VMs, which makes it critical to ensure VM security
round, the correctness and optimality of the tracking algo- in order to attract more customers. VMI technique has been
rithm still hold. The competitive ratio with dynamically widely studied to introspect VM for security purpose. There
changing values of D becomes the log of the maximum D are also several popular open source general-purpose VMI
value for integers, and log of the maximum D=g value for tools such as LibVMI[9], Volatility[18], and StackDB[10], for
floating point values, where g is the finest precision. researchers to explore and develop more sophisticated
applications. LibVMI has many basic APIs that support
Proof. Here we prove for the case to track integer values. memory read and write on live memory. Volatility itself
The extension to real values is straightforward following supports memory forensics on a VM memory snapshot file,
the proof for Claim 1. We use the same notation as in and it has many Linux plugins that are ready to use.
Section 3. Recall that a range S is initialized as StackDB is designed to be a multi-level debugger, while
½fðt0 Þ  D; fðt0 Þ þ D, where fðt0 Þ is the value observed at also serves well as a memory-forensics tool. Other more
first, and updated as the intersection of ½fðtÞ  D; fðtÞþ D sophisticated techniques developed for special-purpose
up to tnow . A round is from the initialization of S until S VMI anomaly detection are generally based on these tools.
becomes empty. Blacksheep [19], for instance, utilizes Volatility and specifi-
Correctness. When the tracking error bound changes cally developed plug-ins to implement a distributed system
from D1 to D2 , Alice sends Bob a new value whenever the for detecting anomalies inside VMs among groups of simi-
newest value observed is beyond D2 range of last sent one. lar machines. However, as most other VMI strategies to
Competitive Ratio. Note that in Algorithm 1, ASOL uses secure VMs, it needs to dump the whole memory space of
binary search, to guess what value AOPT might have sent the target VM, and then analyze each piece, typically by
in each round. The range S contains all the possible val- comparing with what’s defined a “normal” state. Thus to
ues that AOPT might have sent, and it decreases at least protect VMs in real time, the whole memory space needs to
half upon the sending of each message (median of S). So be analyzed constantly, introducing much overhead into
that in each round, AOPT sends out only one value while the production system.
ASOL sends out at most log D. Even if D changes in the ATOM implements its orchestration component based
middle, as shown in Fig. 4, it won’t affect the fact that S on Volatility (with LibVMI plug-in for live introspection)
decreases at least half upon each message sent. When the and StackDB. A crucial difference with other systems is
tracking error bound changes from D1 to D2 , use S1 to that, ATOM only introspects the VM when an anomaly hap-
denote the region of S at that time, and S2 to denote pens, and only on the relevant memory space of the suspi-
½y  D2 ; y þ D2 . x is the median of S1 , the last sent value, cious VMs. The monitoring component in ATOM serves as
and y is the first value observed that exceeds D2 of x after a trigger to inform VMI tools when and where to do intro-
D changes. According to our “lazy update” method, the spection. The anomalies are found by analyzing previously
new S is the intersection of S1 and S2 . Because monitored resource usage data, in monitoring component,
y  D2 > x, so jnew Sj ¼ S1ðupper boundÞ  ðy  D2 Þ < jS1 j=2. which is much more lightweight than analyzing the whole
Hence no matter D2 is bigger or smaller than D1 , S1 memory space. Then the metrics identification process in
decreases at least half when this change happens. If S1 and monitoring component could locate which dimensions are
S2 do not intersect, then a new round starts and D2 becomes suspicious, indicating the relevant metrics on some particu-
the initial threshold of the new round. Therefore, the com- lar VMs. This information is sent to orchestration compo-
petitive ratio for each round ONLY matters with the initial nent along with a VMI request, which then only introspects
size of S. If the initial threshold of a round is D, then the the relevant memory space, reducing the overhead dramati-
competitive ratio for that round is thus log D. Throughout cally. For example, if it is detected and identified that the
the whole period, the competitive ratio becomes the log of network usages on VM-2 and VM-3 are unusual, as shown
maximum threshold values that ever appear. in Fig. 5, then ATOM could only introspects the network
Optimality. Suppose the last value sent by an online connections using Volatility network plug-ins on VM-2 and
algorithm ASOL before the change from D1 to D2 is x. An VM-3, in contrast to other VMI-based detection strategies
adversary Carole operates the value of f here. If x is which typically need to walk over the whole process list,
greater than the median of S, Carole decreases f until it opened network sockets, opened files, etc..
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on March 31,2024 at 11:48:31 UTC from IEEE Xplore. Restrictions apply.
DU AND LI: ATOM: EFFICIENT TRACKING, MONITORING, AND ORCHESTRATION OF CLOUD RESOURCES 2181

developing customized, fine-tuned monitoring modules for


each cluster. For instance, a cloud provider may want to
evenly distribute the VMs having similar resource usage
patterns to different physical nodes, in order to make sure
the physical resources are fully utilized and fewer VMs may
suffer from resource starvation. In another example, we
may want to use different anomaly detection techniques for
VMs running a database server workload than those run-
ning a web server.
The basic idea of our proposed approach is as follows.
The monitoring component in ATOM, using its PCA-based
approach, transforms the original coordinates to a new
Fig. 5. Memory space introspected by ATOM.
coordinate system where the principal components (PCs)
are ordered by the amount of variations on each direction
After the orchestration component identifies potential (as explained in Fig. 3). Thus, if two VMs share similar
abnormal processes, an alarm is raised with associated workloads, the directions of the corresponding PCs between
information identified by VMI tools. The alarm and such the two should also be similar. That said,
information are provided to the VM user. If user confirms Step 1. On CLC, a data matrix for each VM is maintained,
this as an abnormal behavior, ATOM is able to terminate where the columns are metric types and rows are time
the malicious processes inside a VM instance by using tools instances (i.e., a t  d0 matrix for each VM with a sliding
like StackDB [10]. StackDB could be used to debug, inspect, window of t), and is updated over time.
modify, and analyze the behavior of running programs Step 2. ATOM performs a PCA on each VM data matrix
inside a VM instance. To kill a process, it first finds the without standardization; since for clustering purposes, not
task_struct object of the running process using process only the variations on each direction is important, but also
name or id, and then passes in SIGKILL signal. Next time the average usage on each dimension. For example, a VM
the process is being scheduled, it is killed immediately. having a disk usage that oscillates between 10,000 and
Although the anomalies that could be detected by ATOM 20,000 bytes is obviously not the same as one having oscilla-
is limited compared with other systems which analyze the tion between 100 and 200 bytes on the same dimension;
whole memory space, we argue the framework of ATOM whereas a standardization procedure which first performs
could be easily extended to detect more complex attacks. mean-center and then normalization will make the two
First, more metrics could be easily added to monitor for oscillations look similar.
each VM. Also, many other auto-debugging tools could be This step yields a set of PCs for each VM. The direction of
developed, which are useful to find various kinds of attacks each PC is denoted by the corresponding eigen vector while
and perform different desirable actions. the variation is shown by the associated eigen value.
Note that killing the identified, potentially malicious pro- Step 3. Suppose VM1 has eigen vectors ðv11 ; v12 ; . . .Þ and
cess is just one possible choice provided by ATOM, which is corresponding eigen values ð11 ; 12 ; . . .Þ, while VM2 has
performed only if user agrees to (ATOM is certainly able to ðv21 ; v22 ; . . .Þ and ð21 ; 22 ; . . .Þ. We measure the distance
automate this as well if desired). Alternatives could be to between two directions using cosine distance; defined as
terminate the network connections or to close file handles. ð1  cosine similarityÞ. Intuitively, the bigger the angle
A more sophisticated way is to study a rich dataset of between two directions (the less similar they are), the
known attacks (e.g., Exploits Database) and design rule- smaller their cosine similarity is, hence the larger the cosine
based approaches to mitigate attacks based on different pat- distance becomes. Finally, the distance between the two
terns. We refer these active actions, together with introspec- VMs is defined as: VMdist(VM1, VM2) ¼ j11  21 jð1
tion, as ATOM’s orchestration module. Orchestration in v11 v21 v22
jv11 jv21 Þ þ j12  22 jð1  jvv12
12 jv22
Þ þ    . Note that it is simply
ATOM can be greatly customized to suite the needs for dif-
ferent tasks, such as identification of different attacks, and the sum of the cosine distance of each corresponding pair of
dynamic resource allocation in an IaaS system. eigen vectors from VM1 and VM2, weighted by the differ-
ence of the corresponding eigen values to ensure that the
variations do not differ a lot.
7 VM CLUSTERING Step 4. Using VMdist as the distance measure between
ATOM enables a continuous understanding of the VMs in an any two VMs, we use DBSCAN [20] to cluster similar VMs
IaaS system. In addition to anomaly detection, this frame- together. DBSCAN is a threshold-based (aka density based)
work is also useful for many other decision making and ana- clustering algorithm which requires two parameters: 
lytics applications. Hence, in addition to using a PCA-based which is the density threshold, and minPts which is the
approach in the monitoring component, we will demonstrate number of minimum points to form a cluster. DBSCAN
that it is possible to design and implement a VM clustering expands a cluster from an un-visited data point towards all
module to be used in the monitoring component. its neighboring points provided the distance is within , and
The objective of VM clustering is to cluster a set of VMs then recursively expands from each of the neighboring
into different clusters so that VMs with similar workload point. Points are marked as an outlier if the number of
characteristics end up in the same group. This operation points in their cluster is fewer than minPts. Compared with
assists making load balancing decisions, as well as other popular clustering methods like k-means, density-
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on March 31,2024 at 11:48:31 UTC from IEEE Xplore. Restrictions apply.
2182 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 28, NO. 8, AUGUST 2017

Fig. 6. A comparison on number of values sent by NC for each metric.

based clustering algorithm does not require the prior- threshold. For example, metric CPUUtilization is always
knowledge on the number of clusters, neither does it need between 0 and 0.2 percent when VM is idle, so D value for
to iteratively compute an explicit “centroid” and re-cluster this metric is only (roughly) 0.01 percent. This figure shows
at every iteration. that when allowing a very small error, tracking component
By default, ATOM sets minPts=10, and computes the already leads to significant savings. Fig. 6c shows the results
threshold value  using a sampling based approach. More when VM is running TPC-C benchmark on a MySQL data-
specifically, we randomly select n pairs of VMs and com- base, which involves large disk reads and writes. D is set as
pute their VMdist. We sort the n VMdist values, and set the average of the exact values in 2 hours when VM is idle.
 ¼ VMdisti if VMdistiþ1 > 5  VMdisti . The intuition is This is reasonable even for users who do not allow any
that for any point, the distance to a point in a different clus- error, because D is merely the average of the amount con-
ter is much longer than the distance to a point in the same sumed by an idle VM. Note that in this figure, NetworkIn
cluster, and we want to find a large enough “inner cluster” and NetworkOut only have two values sent to CLC in 2
distance and use it as the threshold value  to determine hours with the tracking component. This figure tells us that
whether two points belong to the same cluster. even if VM is intensively used and almost no error is
allowed, the tracking component is still highly effective.
8 EVALUATION Fig. 6d demonstrates the result when VM is running the
same workload, while D value for each metric is now set as
We implemented ATOM using Eucalyptus as the underly- 10 percent of the average value when the VM has been run-
ing IaaS system. The virtual machine hypervisor running on ning the same workload for 2 hours, i.e., larger errors
each NC is the default KVM hypervisor. Each VM has an are allowed. Clearly, the tracking component becomes more
m1.medium type on Eucalyptus. ATOM tracks seven met- effective. Error is expected to improve ATOM performance
rics from each VM instance: CPUUtilization, NetworkIn, because new values within the error threshold of last sent
NetworkOut, DiskReadOps, DiskWriteOps, DiskReadBytes, one could be saved.
DiskWriteBytes. All experiments are executed on a linux Fig. 7 explains how the online tracking component
machine with an 8-core Intel(R) Core(TM) i7-3770 CPU @ works. It shows both values sent by standard CloudWatch
3.40 GHz computer. (without tracking) and values sent by modified CloudWatch
with ATOM tracking, with a time interval of 1000 seconds
8.1 Online Tracking for the NetworkOut metric from Fig. 6b. This clearly illus-
In the evaluation the data collection time interval is set to 10 trates that at each time instance, with online tracking, the cur-
seconds, i.e., raw values for different metrics are collected rent (exact) value is not sent if it is within D threshold of the
every 10 seconds on a NC (observer), which produces 360 last sent value; and at each time point, the last value sent to
values for each metric per hour. Instead of sending every CLC is always within D of the newest value observed on NC.
value to CLC (the tracker), the modified CloudWatch The values sent by the tracking method closely approximate
with ATOM’s online tracking component selectively those exact values, with much smaller overhead.
sends certain values based on Algorithm 1, from NC to
CLC. Fig. 6 shows the number of values sent for each
metric over 2 hours, with different workloads (e.g., TPC-
C benchmark over MySQL) and different D values.
Among the seven metrics for each VM, only the first 5
ones are shown in each sub-figure, as DiskReadBytes/
DiskWriteBytes follow the same patterns with DiskRea-
dOps/DiskWriteOps in all experiments.
Fig. 6a shows the result when VM is idle, using D ¼ 0.
This is the base case with no error allowed for any metric.
The result shows that our tracking component has still
achieved significant savings when no error is allowed. In
Fig. 6b, VM is also idle, while D is set to 10 percent of the
average value (calculated from exact values collected) in 2
hours for each metric. Note that this is a very small error Fig. 7. A comparison on NetworkOut values sent by NC.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on March 31,2024 at 11:48:31 UTC from IEEE Xplore. Restrictions apply.
DU AND LI: ATOM: EFFICIENT TRACKING, MONITORING, AND ORCHESTRATION OF CLOUD RESOURCES 2183

TABLE 2 that this attack is hard to detect using the simple threshold
Online Monitoring Experiment Setup approach in existing IaaS systems. The normal workload on
VM 2 is a network workload, which already has a large
Experiment Workload Attack
amount of NetworkIn/NetworkOut usage, sending out
1 VM 1, 3 idle; VM 2 DDoS attack inside malicious traffic only changes roughly 10%  30 percent to
network workload VM 2
the mean of normal statistics. Hence it is difficult to set an
2 VM 1 idle; VM 2, 3 DDoS attack inside
network workload VM 2, 3 effective threshold value even for an experienced user due
3 VM 1 idle; VM 2 Resource-freeing attack to the fact that the underlying normal traffic might oscillate
network workload; from VM 3 to VM 2 within a range. Yet ATOM’s monitoring module success-
VM 3 disk workload fully finds the underlying pattern, and detects time instan-
ces that are abnormal (when attacks are ongoing). Fig. 8a
shows the online monitoring and detection process. The
8.2 Automated Online Monitoring and Orchestration dashed line corresponds to threshold Qa for a ¼ 0:2 percent,
We design three experiments to illustrate the effectiveness and the solid line shows Qa for a ¼ 0:5 percent. SPE of the
of ATOM’s monitoring module. For each experiment, we approximate data matrix projected onto the residual sub-
use a false alarm rate a ¼ 0:2 percent and its deviation m ¼ 1 space is plotted, where the black dots indicates the time instan-
percent (to set the tracking error bound). Meanwhile the Qa ces when DDoS attack happens. Clearly, ATOM has
threshold with a ¼ 0:5 percent is also calculated to compare successfully identified all abnormal time instances correctly.
against. The online tracking error D is calculated dynami- Once a time instance is considered abnormal, ATOM
cally according to the equations in Section 5.1 at the CLC, immediately runs metrics identification procedure to find
and set using the algorithm in Section 5.2 on each NC. Three the affected VMs and metrics. As described in Section 4.3.4,
VMs with a type of m1.medium co-located in one Eucalyptus ATOM first finds out potential abnormal dimension(s) by
physical node are monitored for each experiment, which analyzing the average change portion rdj between abnormal
form a t  21 data matrix. Dimensions 1-7 belong to VM 1, 8- data points and normal data points projected onto residual
14 are for VM 2, whereas VM 3 owns the rest. subspace. Then for dimensions that have significant
We use two types of normal workloads and two kinds of changes, ATOM computes stddevj as suggested in Section
attacks in all three experiments. The two types of normal 4.3.4, and also calculates the average change meandiff j if
workloads include network and disk workloads. For the stddevj is above a threshold. Recall m is the number of
network workload, an Apache web server is installed and consecutive abnormal time instances until tnow . The results
constantly responding WebBench network requests. The when m ¼ 5 are shown in the first table of Table 3. Note that
disk workload is TPC-C benchmark against MySQL data- only for the dimensions having large enough residual por-
base [21]. The two types of attacks are DDoS attack and tion (rdj ) does ATOM computes the standard deviation
resource-freeing attack [14]. In our experiment, DDoS attack error (stddevj ). Among the 3 VM instances being tracked
treats the affected VM as a compromised zombie and sends and monitored, ATOM correctly identifies an anomaly hap-
malicious traffic to the target IP address. Resource-freeing pening on VM 2, and more specifically, it discovers that the
attack is launched by VM 3 targeting the web server on VM anomaly is from its first three dimensions (CPUUtilization,
2 to gain more cache usage. Note that there is a 4th VM run- NetworkIn, NetworkOut), indicated by the bold values.
ning WebBench and a 5th VM running Apache web server Note that NetworkIn and NetworkOut actually go down
as the target of DDoS bots. The first two hours are used to because of DDoS attack. Our guess is that WebBench tends
build PCA model for each experiment, while the anomaly to saturate the bandwidth available for the VM, while the
happens at the third hour. The settings for each experiment DDoS attack we use launches many network connections
is shown in Table 2. but not sending as much traffic. The CPUUtilization, how-
In the first experiment, VM 2 runs an Apache web server ever, goes up due to the attack. Nevertheless, ATOM is able
while the other 2 VMs are idle. A DDoS attack turns VM 2 to identify all three abnormal metric dimensions.
to be a zombie at the third hour, using it to generate traffic After abnormal metrics are identified, a VMI request is
towards the target IP (the 5th VM in our experiment). Note sent to the corresponding NC for introspection. ATOM’s

Fig. 8. Time series plots of SPE against thresholds Qa with a ¼ 0:2 and 0.5 percent.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on March 31,2024 at 11:48:31 UTC from IEEE Xplore. Restrictions apply.
2184 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 28, NO. 8, AUGUST 2017

TABLE 3
Metrics Identification Results

Dim (j) vm1-d1 vm1-d2 vm1-d3 vm1-d4 vm1-d5 vm1-d6 vm1-d7 vm2-d1 vm2-d2 vm2-d3 vm2-d4
rdj 1.87 36.62 27.17 13.39 -0.56 0.08 8.55 32.63 7.31 35.82 0.00
Experiment 1 stddevj 0.50 0.32 0.72 0.00 0.76 0.00 0.90 48.68 3.82 6.74 0.08
Metrics meandiff j 0.11 -0.12 -0.21
Identification Dim (j) vm2-d5 vm2-d6 vm2-d7 vm3-d1 vm3-d2 vm3-d3 vm3-d4 vm3-d5 vm3-d6 vm3-d7
Results rdj 0.00 0.00 0.00 2.94 -0.50 -0.41 18.45 18.00 1.22 1.88
stddevj 0.90 0.08 0.41 0.72 0.31 1.06 0.00 0.18 0.00 0.66
meandiff j
Dim (j) vm1-d1 vm1-d2 vm1-d3 vm1-d4 vm1-d5 vm1-d6 vm1-d7 vm2-d1 vm2-d2 vm2-d3 vm2-d4
rdj 23.70 -0.98 -0.98 -0.55 -0.57 4.27 3.76 9.14 64.18 65.05 3.50
Experiment 2 stddevj 0.78 0.42 0.58 0.00 0.67 0.00 0.71 3.17 8.01 8.30 0.00
Metrics meandiff j 0.16 -0.26 -0.28
Identification Dim (j) vm2-d5 vm2-d6 vm2-d7 vm3-d1 vm3-d2 vm3-d3 vm3-d4 vm3-d5 vm3-d6 vm3-d7
Results rdj -0.51 -0.82 4.23 9.04 60.56 61.16 1.45 -0.56 1.89 -0.51
stddevj 0.31 0.00 0.35 7.23 6.06 6.98 0.17 3.39 0.12 3.65
meandiff j 0.39 -0.23 -0.31
Dim (j) vm1-d1 vm1-d2 vm1-d3 vm1-d4 vm1-d5 vm1-d6 vm1-d7 vm2-d1 vm2-d2 vm2-d3 vm2-d4
rdj 2.58 -0.65 -0.93 -0.65 28.23 -0.98 -0.15 6.90 7.94 7.27 -0.76
Experiment 2 stddevj 0.24 0.42 0.63 0.95 0.43 0.98 0.86 7.36 4.52 4.74 0.21
Metrics meandiff j -0.91 -0.85 -0.89
Identification Dim (j) vm2-d5 vm2-d6 vm2-d7 vm3-d1 vm3-d2 vm3-d3 vm3-d4 vm3-d5 vm3-d6 vm3-d7
Results rdj 0.30 -0.99 -0.44 10.70 1282.80 1401.34 1363.47 -0.70 1544.73 -0.53
stddevj 1.41 0.17 1.43 1.86 13.05 12.79 13.42 1.72 13.60 1.78
meandiff j 101.81 110.97 187.16 196.30

orchestration module first identifies this as a possible net- be very hard to catch except for using strong isolation on
work problem, and then calls volatility to analyze the net- physical node. In this experiment, VM 2 runs an Apache
work connections (linux_netstat plugin) on that VM, web server constantly handling network requests. VM 3
which then finds out the numerous network connections runs TPC-C benchmark on MySQL database. According to
targeting at one IP address, a typical pattern of DDoS [14], if VM 3 wants more cache usage, it could make net-
attacks. Volatility is then used to find out related processes work resource to be a bottleneck for VM 2, and shift its
and their parent process (pslist plugin) of these network usage on cache (VM 2 and VM 3 are running on the same
connections. At this time ATOM raises an alarm with alarm physical node). In this experiment VM 3 launches Golden-
level m notifying user about the findings, and asks user to Eye attack, which achieves a denial-of-service attack on the
check whether those processes are normal or malicious. The HTTP server running on VM 2 by consuming all available
alarm level is useful; for example, m ¼ 1 could be treated as sockets, and is paired with cache control. We show that
a mild warning. If user identifies them to be malicious, he/ ATOM successfully finds the two VMs, and by its metrics
she could either investigate the VM in further details, or use identification procedure, it suggests the possibility of an
ATOM’s monitoring module to do auto-debugging and kill resource-freeing attack and provides useful data to its
malicious processes automatically through StackDB [22]. orchestration module in assisting the VMI procedure on
Fig. 8a shows that SPE goes back to normal after the attack VMs 2 and 3.
is mitigated on the affected VM through ATOM’s orchestra- Fig. 8c plots the monitoring and the detection process.
tion module. The black dots indicate the time instances when abnormal
In the second experiment, both VM 2 and VM 3 are run- behavior happens. This figure, as before, only shows that an
ning the network workload, and the same DDoS attack anomaly has happened. While the third table in Table 3 ana-
turns both VMs to be zombie VMs simultaneously. Not lyzes where the anomaly has originated. The stddevj values
only ATOM is able to detect an anomaly happened as show what the abnormal dimensions are, on VM 2: CPUUti-
shown in Fig. 8b, but also it finds similar patterns on the cor- lization (vm2-1), NetworkIn (vm2-2), NetworkOut (vm2-3);
rect metrics from both VM 2 and VM 3 as illustrated in the on VM 3: NetworkIn (vm3-2), NetworkOut (vm3-3), Dis-
second table of Table 3, which shows the metrics identifica- kReadOps (vm3-4), DiskReadBytes (vm3-6).
tion results when m ¼ 5. By sending this information to the Further analysis on meandiff j finds out NetworkIn and
orchestration component, the introspection overhead could NetworkOut statistics on VM 2 decrease nearly by an order,
be saved by first introspecting one VM, and then checking if while VM 3 sees significant increase in NetworkIn, Net-
another one has the same malicious behavior going on. workOut and especially its disk read statistics (DiskRea-
The third experiment illustrates ATOM’s ability to detect dOps and DiskReadBytes). This is a typical resource
a different type of attack, the resource-freeing attack [14], a freeing attack as described in [14], where network
subtle attack where the goal is to improve a VM’s perfor- resource has become the bottleneck of a target VM, and
mance by forcing a competing VM to saturate some bottle- the beneficiary VM gains much of the shared cache usage
neck and shift its usage on the target resource (often times by showing a significant increase in disk read statistics.
with legitimate behavior). This kind of attacks is known to The sudden increase in NetworkIn/NetworkOut in VM 3
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on March 31,2024 at 11:48:31 UTC from IEEE Xplore. Restrictions apply.
DU AND LI: ATOM: EFFICIENT TRACKING, MONITORING, AND ORCHESTRATION OF CLOUD RESOURCES 2185

Fig. 9. Network bandwidth saving in each experiment in Fig. 8.


Fig. 10. Sensitivity analysis.
also suggests that VM 3 might be the attacker of VM 2 by
sending malicious traffic to it. the increase of values of a. When a increases from 0.1 per-
Further analysis by VMI in ATOM’s orchestration mod- cent (confidence level 99.9 percent) to 5 percent, the actual
ule shows that most of VM 2’s sockets are occupied by con- false alarm rate also increases. But the actual false alarm
necting to VM 3, thus the anomaly could be mitigated by rate is always much smaller, ranging from just 0.006 to 0.05
closing such connections and limiting future ones. Of percent. The common practice is to choose a < 0:5 percent,
course, VM 3 could use a helper to establish such malicious and far fewer actual false alarms (less than 0.5 per 100 data
connections with VM 2 as suggested in paper [14], yet points) will be produced.
ATOM is still able to raise an alarm to end user and suggest Fig. 10b shows the results of the second experiment. We
a possible ongoing resource-freeing attack. vary the false alarm deviation rate m from 0 to 2.6 percent,
Users should be aware that the larger an alarm level m is, with a step size of 0.2 percent. For each value of m, we run the
the more accurate the metrics identification process is, and experiments for 10 times, where a value ranges from 0.1 to
the more likely a bigger damage an attack has caused. 0.5 percent, with a step size of 0.1 percent. Finally the average
Lastly, Fig. 9 shows the communication overhead saving results are computed with respect to each m. We are inter-
achieved by ATOM in these three experiments. For the first ested in network bandwidth saving in percentage (y axis)
hour in each experiment, each VM should have sent and the actual false alarm rate (x axis). Larger m means a big-
7  360 ¼ 2; 520 values to CLC. However, by using online ger D threshold is used by ATOM, and thus leads to more
tracking and setting the tracking error D dynamically, the savings in network bandwidth. However the growth of m is
number of values sent are significantly reduced. Since each also accompanied with an increase in the number of actual
message is of the same size, we calculate the network band- false alarms, which suggests a trade-off between using more
number of messages sent with tracking module
width saving as (1 - number network bandwidth and having fewer false alarms.
of messages sent without tracking module); the
But generally speaking, a small value for m is sufficient to
results are shown in Fig. 9. Note here the deviation is a very
provide enough communication savings. Note that all
small value m ¼ 1 percent, which could be a much bigger
attacks were still detected in all experiments, achieving false
value in practice because as shown in Fig. 8, the anomalies
negative rate of 0 percent. As shown in Fig. 8, using both
tend to have a much bigger SPE value. A larger m leads to a
a ¼ 0:2 percent and a ¼ 0:5 percent, ATOM could easily
bigger threshold D, leading to even more savings than what
identify the attacks. A higher a value leads to a lower
is shown in Fig. 9. What’s more, the monitoring overhead is
threshold value for attack detection, meaning attacks are
also saved because PCA only needs to be computed when
more easily to be detected though it may lead to more false
new data arrives. With less data reported, PCA could be
alarms. ATOM allows users to control the false alarm rate
computed less frequently.
and the tracking threshold by adjusting a and m.
In our analysis, a wide range of thresholds suffice to
8.3 Sensitivity Analysis detect denial of service attacks or resource starving attacks
There are only two parameters to set up for ATOM: while achieving large communication saving. However in
the desired false alarm rate a which is used to calculate the production systems, it is important to provide users feed-
anomaly detection threshold Qa in Section 4.3.3, and the back about the effects of their error setting. ApproxHadoop
maximally allowed false alarm deviation rate m as defined [23] and Social Trove [24] use statistical models from extreme
in Section 5 which is used to bound and adjust the tracking value theory to estimate the effects of delta. In our models,
threshold D. statistical models can help over short periods, but over long
We design two experiments to analyze the sensitivity of a periods we would expect malicious attackers to adapt their
and m respectively. We use the same dataset for these attacks to reduce their chances of being caught. In this situa-
experiments to clearly demonstrate the impacts of having tion, it is important to use offline benchmarking to assess the
different values for a and m. The first experiment varies the effect of the error threshold as shown in [25], in which the
values of a and counts the actual number of false alarms. authors provide techniques to overlap online executions
The second experiment gradually increases the values of m, with different delta settings, allowing us to understand the
and measures the actual number of false alarms and the net- effects of delta empirically without degrading throughput.
work savings achieved by the tracking module (when its
tracking threshold D is dynamically adjusted by ATOM). 8.4 ATOM Scalability Evaluation
The result of the first experiment is shown in Fig. 10a, To evaluate the scalability of ATOM, we measure the key
which shows how the actual false alarm rate changes with performance metrics of ATOM with an increasing number
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on March 31,2024 at 11:48:31 UTC from IEEE Xplore. Restrictions apply.
2186 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 28, NO. 8, AUGUST 2017

Fig. 11. Impacts to ATOM’s performance with respect to the increasing


number of VMs.

of VMs (from 2 to 6). In each configuration, we perform


online monitoring using the adapted PCA-based anomaly Fig. 12. offlinePCA using the settings in Experiment 1 from Table 2 (SPE
detection using a sliding window of size 100 (time instan- is computed as in Section 4.3.3).
ces), combined with either online tracking or no tracking
(i.e., send everything). We report the average for the false as that in experiment 1 shown in Table 2 and apply the
alarm rate, the average PCA running time, and the total above method. Fig. 12 shows the result of anomaly detection
number of messages sent from NC to the CLC, per sliding using this method. Note that SPE is not calculated for all
window. The results are shown in Fig. 11. time points as explained above. The blue dots indicate SPE
Larger number of VMs leads to higher communication cost of certain time instances and red crosses show the threshold
in ATOM. However, the tracking component of ATOM Qa to compare against at the same time instances. Anomaly
becomes more effective with more VMs, as shown in Fig. 11a. happens at the end of first hour (3,600 on the time dimen-
This is because there are more opportunities for communica- sion in Fig. 12), and it continues ever since. So we would
tion savings when there is a higher probability of temporal expect at the second hour there should be blue dots and red crosses
locality on one of the many VMs’ performance metrics. at every time point, and all blue dots should be above the red
Fig. 11b shows the computation cost in ATOM increases crosses. However as shown in Fig. 12, it only takes less than
linearly to the number of VMs, which is as expected. Never- 4 minutes (only the first 20 time instances after 3,600 have
theless, the overall computation overhead of ATOM is still been detected being abnormal) for the attack to escape from
fairly small (in just a few milliseconds per sliding window). the monitoring and detection, and make itself be identified
The measured false alarm rate actually decreases initially as normal behavior. Note that towards the end the SPE val-
with more VMs. But when the number of VMs keeps ues in Fig. 12 here is different from that in Fig. 8, where SPE
increasing, the measured false alarm rate will eventually values are normal again in the end because the attacks were
start to increase, as indicated in Fig. 11c. Initially, when pre- mitigated in the experiment by ATOM’s orchestration mod-
sented with more data, the PCA-based approach becomes ule. The key difference causing this is that ATOM’s online
more effective in “learning” the normal subspace, hence method based on PCA ensures only normal data points are
results in a reduced false alarm rate. But as number of VMs used for anomaly detection, while the abnormal data points
continues to increase, the dimensionality of the data matrix may skew the PCA model built by the naive offline method.
becomes larger, eventually making it less effective to detect
abnormal subspace after dimensionality reduction. Never- 8.6 VM Clustering Evaluation
theless, ATOM remains very effective in all cases; the false To evaluate the accuracy and robustness of our VM cluster-
alarm rates are smaller than 1 percent in nearly all test cases. ing method, we run an experiment using 102 VM data vec-
tors (each VM data vector is a seven-dimension vector with
8.5 ATOM versus Classic Offline PCA Anomaly seven performance metrics, collected from a VM running a
Detection particular workload at the time of data collection). Among
In this section we show what happens if we simply the 102 VM data vectors, 34 were idle, another 34 were run-
apply the classic offline PCA method (offlinePCA) that ning a TPCC benchmark on MySQL database and the rest
has been widely used for anomaly detection in previous 34 were running an Apache web server. We run the VM
literature [13]. clustering for 10 times and calculate the average results.
Specifically, to use offlinePCA, each time we delete the The experiment result shows that our method is able to pre-
oldest time instance and add the newest one in the sliding cisely identify the three clusters, with a few points marked
window, and then use the data matrix inside this sliding as outliers each time. The average clustering precision is
window to do PCA. Each time only the newest time instance 96.08 percent, the average clustering recall is 95.10 percent,
data need to be verified. After transforming the original and the average clustering F-measure is 95.59 percent.
data to the rotated PCA space, we measure each dimension Since we use a density-based clustering algorithm which
at the newest time instance, use the first dimension that groups nearby VM data vectors together, to test its robust-
exceeds 3 times of standard deviation along that axis as the ness, for each VM data vector we measure its distance to the
starting dimension of the residual subspace, and if no such closet neighbor inside the same cluster (denoted as “inner
dimension exists, SPE need not be calculated and checked cluster distance”) and the closest distance to a VM data vec-
against the anomaly threshold. tor in a different cluster (denoted as “inter cluster dis-
To compare the performance of this method with our tance”). A histogram of the two measures for all 102 VM
approach, we run another experiment with the same setting data vectors is shown in Fig. 13. We can see that there is a
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on March 31,2024 at 11:48:31 UTC from IEEE Xplore. Restrictions apply.
DU AND LI: ATOM: EFFICIENT TRACKING, MONITORING, AND ORCHESTRATION OF CLOUD RESOURCES 2187

for a large number of VMs and carry out orchestration


inside a VM.
Cloud Monitoring. Most existing IaaS systems follow the
general, hierarchical architecture as shown in Fig. 1. Inside
these systems, there are imperative needs for the controller
to continuously collect resource usage data and monitor
system health. AWS [1] and Eucalyptus [4], [5] use Cloud-
Watch [27] service to monitor VMs and other components
in some fixed intervals, e.g., every minute. This provides
cloud users a system-wide visibility into resource utiliza-
tion, and allows users to set some simple threshold based
alarms to monitor and ensure system health. OpenStack [28]
Fig. 13. Histogram of clustering distances. is developing a project called Ceilometer [29], to collect
resources utilization measurements. However, these
large gap between the two distances, which shows the approaches only provide a discrete, sampled view of the
robustness of our clustering method and it is fairly robust system. Several emerging startup companies such as DATA-
and insensitive to a wide range of threshold  values. DOG[30] and librato [31] could monitor in a more fine-
grained granularity, provided the required softwares are
8.7 Discussion installed. However, this inevitably introduces more network
The Choice of D. Larger D values lead to more savings, but overhead to the cloud, which becomes worse when the moni-
with less accurate data matrix. However cloud users don’t tored infrastructure scales up. On the contrary, ATOM sig-
have to worry about setting D values; ATOM only needs the nificantly reduces the network overhead by utilizing the
user to specify a tolerable deviation rate m on the detection optimal online tracking algorithm, while providing just
threshold. ATOM is then able to adjust D values dynami- about the same amount of information. Furthermore, all
cally online. these cloud monitoring services offer very limited capability
Possible False Alarms. Resource usage pattern may simply in monitoring and ensuring system health. UBL [8] uses col-
change due to normal changes in user activities, in which lected VM usage data to train Self-Organizing Maps for
case ATOM may raise false alarms. Nevertheless, ATOM is anomaly prediction, which serves a similar purpose to
able to raise alarms and let users decide the right course of ATOM’s monitoring component. Besides the detailed com-
actions to take by assisting users with its orchestration mod- parison in Section 1, SOM requires an explicit training stage
ule. ATOM also uses the new workload chracteristics to and needs to be trained by normal data, while PCA could
adjust its monitoring component to adapt to a new work- identify what is normal directly from the history data pro-
load dynamically and automatically in an online fashion. vided normal data is the majority. Unlike UBL and ATOM
Overhead. The tracking module, by simply apply Algo- which only require VM usage data, PerfCompass collects
rithm 1 before sending out each value, introduces only Oð1Þ system call traces and checks the execution units being
overhead. The monitoring module could leverage a recur- affected [32] to identify whether a VM performance anomaly
sive update procedure, so that it is possible to use the cur- is caused by internal fault like software bugs, or from an
rent PCA model to do incremental update instead of external source such as co-existing VMs.
computing from scratch, e.g., [15], [26]. Depending on the Astrolabe [33] is a monitoring service for distributed
PCA algorithm used, it is polynomial to the sliding window resources, to perform user-defined aggregation (e.g., number
size and number of dimensions. In contrast, the overhead of nodes that satisfy certain property) on-the-fly for the host
saved by ATOM is significant. Not only a major fraction of hierarchy. It is intended as a “summarizing mechanism”.
network traffic could be saved from CC to CLC, but also the Similar to Astrolabe, SDIMS[34] is another system that aggre-
effort to apply VMI. The orchestration module orchestrates gates information about large-scale networked systems with
and introspects only the affected VMs and metrics, and only better scalability, flexibility, and administrative isolation.
when needed, hence, leads to much smaller overhead than Ganglia [35] is a general-purpose scalable distributed moni-
full-scale VM introspection that are typically required. toring system for high performance computing systems
Other Attacks. Our experiments use the same set of met- which also has a hierarchical design to monitor and aggre-
rics that are monitored by CloudWatch and demonstrate gate all the nodes and has been used in many clusters. These
two different types of attacks. But ATOM can easily add efforts are similar to the CloudWatch module currently used
any additional metric without much overhead. This means in AWS/Eucalyptus, and they reduce monitoring overhead
that it can be easily extended when necessary with addi- by simple aggregations. While the purpose of ATOM’s track-
tional metrics for monitoring and detecting different kinds ing module is to reduce data transfer, but it does so using
of attacks. online tracking instead of simply aggregating which delivers
much more fine-grained information.
STAR [36] is a hierarchical algorithm for scalable aggre-
9 RELATED WORK gation that reduces communication overhead by carefully
To the best of our knowledge, none of existing IaaS plat- distributing the allowed error budgets. It suites systems like
forms is able to provide continuous tracking, monitoring, SDIMS[34] well. InfoEye [37] is a model-based information
and orchestration of system resource usage. Furthermore, management system for large-scale service overlay net-
none of them is able to do intelligent, automated monitoring works through a set of monitoring sensors deployed on
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on March 31,2024 at 11:48:31 UTC from IEEE Xplore. Restrictions apply.
2188 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 28, NO. 8, AUGUST 2017

different overlay nodes with reduced overhead achieved by cloud users to verify the integrity of their VMs. However,
ad-hoc conditions filters. InfoTrack [38] is a monitoring sys- this is not an “active detection and reaction” system. In con-
tem that is similar to ATOM’s tracking module, in that it trast, ATOM enables triggering VMI only when a potential
tries to minimize continuous monitoring cost with most attack is identified, and it also helps locate the relevant memory
information precision preserved, by leveraging temporal region to analyze and introspect much more effectively and
and spatial correlation of monitored attributes, while efficiently using its orchestration component.
ATOM utilizes an optimal online tracking algorithm that is
proved to achieve the best saving in network cost without 10 CONCLUSION
any prior knowledge on the data. MELA [39] is a monitor- We present the ATOM framework that can be easily inte-
ing framework for cloud service which collects different grated into a standard IaaS system to provide automated,
dimensions of data tailored for analyzing cloud elasticity continuous tracking, monitoring, and orchestration of sys-
purpose (e.g., scale up and scale down). ATOM may use tem resource usage in nearly real-time. ATOM is extremely
MELA to collect, track, and monitor different types of met- useful for anomaly detection, auto scaling, and dynamic
rics than those already available through CloudWatch. resource allocation and load balancing in IaaS systems.
Cloud Security. IaaS system also brings us a new set of Interesting future work include extending ATOM for more
security problems. Leading cloud providers have developed sophisticated resource orchestration and incorporating the
advanced mechanism to ensure the security of their IaaS defense against even more complex attacks in ATOM.
systems. AWS [40] has many built-in security features such
as firewalls, encrypted storage and security logs. OpenStack
uses a security component called Keystone [41] to do
ACKNOWLEDGMENTS
authentication and authorization. It also has security rules Min Du and Feifei Li were supported in part by grants US
for network communication in its network component Neu- National Science Foundation CNS-1314945, CNS-1514520
tron [42]. Other IaaS platforms have similar security solu- and US National Science Foundation IIS-1251019. We wish
tions, which are mainly firewalls and security groups. to thank Eric Eide, Jacobus (Kobus) Van der Merwe, Robert
Nevertheless, it is still possible that hackers could bypass Ricci, and other members of the TCloud project and the
known security policies, or cloud users may accidentally Flux group for helpful discussion and valuable feedback.
run some malicious software. It is thus critical to be able to The preliminary version of this paper appeared in IEEE Big-
detect such anomaly in near real-time to avoid leaving hack- Data 2015[57].
ers plenty of time to cause significant damage. Hence we
need a monitoring solution that could actively detect anom- REFERENCES
aly, and identify potentially malicious behavior over a large [1] Amazon. [Online]. Available: http://www.aws.amazon.com/,
number of VM instances. AWS recently adopts its Cloud- Accessed on: Nov. 5, 2016.
Watch service for DDoS attacks[3], but it requires user to [2] ITWORLD. [Online]. Available: http://www.itworld.com/
security/428920/attackers-install-ddos-bots-amazon-cloud-
check historical data and set a “’magic value” as the thresh- exploiting-elasticsearch-weakness, Accessed on: Nov. 5, 2016.
old manually, which is unrealistic if user’s underlying [3] Amazon, “AWS Best Practices for DDoS Resiliency,” [Online].
workloads change frequently. Available: https://d0.awsstatic.com/whitepapers/DDoS_White_
In contrast, ATOM could automatically learn the normal Paper_June2015.pdf, Accessed on: Nov. 5, 2016.
[4] Eucalyptus. [Online]. Available: http://www8.hp.com/us/en/
behavior from previous monitored data, and detect more cloud/helion-eucalyptus.html, Accessed on: Nov. 5, 2016.
complex attacks besides DDoS attacks using PCA. PCA has [5] D. Nurmi, et al., “The eucalyptus open-source cloud-computing
been widely used to detect anomaly in network traffic vol- system,” in Proc. 9th IEEE/ACM Int. Symp. Cluster Comput. Grid,
ume in backbone networks [12], [13], [17], [43], [44], [45]. As 2009, pp. 124–131.
[6] M. Du and F. Li, “SPELL: Streaming parsing of system event
we have argued in Section 4.1, adapting a PCA-based logs,” in Proc. IEEE Int. Conf. Data Mining, 2016.
approach to our setting has not been studied before and pre- [7] W. Dawoud, I. Takouna, and C. Meinel, “Infrastructure as a ser-
sented significant new challenges. vice security: Challenges and solutions,” in Proc. 7th Int. Conf. Inf.
Syst., 2010, pp. 1–8.
The security challenges in IaaS system were analyzed in [8] D. J. Dean, H. Nguyen, and X. Gu, “UBL: Unsupervised behavior
[7], [46], [47], [48]. Virtual machine attacks is considered a learning for predicting performance anomalies in virtualized cloud
major security threat. ATOM’s introspection component systems,” in Proc. 9th Int. Conf. Auton. Comput., 2012, pp. 191–200.
leverages existing open source VMI tools such as Stackdb [9] LibVMI. [Online]. Available: http://libvmi.com/, Accessed on:
Nov. 5, 2016.
[10] and Volatility [18] to pinpoint the anomaly to the exact [10] D. Johnson, M. Hibler, and E. Eide, “Composable multi-level
process. debugging with Stackdb,” in Proc. 10th ACM SIGPLAN/SIGOPS
VMI is a well-known method for ensuring VM secu- Int. Conf. Virtual Execution Environ., 2014, pp. 213–226.
[11] K. Yi and Q. Zhang, “Multi-dimensional online tracking,” in Proc.
rity [49], [50], [51], [52]. It has also been studied for IaaS sys- 20th Annu. ACM-SIAM Symp. Discrete Algorithms, 2009, pp. 1098–
tems [53], [54], [55]. However, to constantly secure VM using 1107.
VMI technique, the entire VM memory needs to be traversed [12] H. Ringberg, A. Soule, J. Rexford, and C. Diot, “Sensitivity of PCA
and analyzed periodically. It may also require the VM to be for traffic anomaly detection,” ACM SIGMETRICS Performance
Eval. Rev., vol. 35, pp. 109–120, 2007.
suspended in order to gain access to VM memory. Black- [13] A. Lakhina, M. Crovella, and C. Diot, “Diagnosing network-wide
sheep [19] is such a system that detects rootkit by dumping traffic anomalies,” in Proc. Conf. Appl. Technol. Archit. Protocols
and comparing groups of similar machines. Though the per- Comput. Commun., 2004, pp. 219–230.
formance overhead is claimed to be acceptably low to sup- [14] V. Varadarajan, T. Kooburat, B. Farley, T. Ristenpart, and M. M.
Swift, “Resource-freeing attacks: Improve your cloud perfor-
port real-time monitoring, clearly user programs will be mance (at your neighbor’s expense),” in Proc. ACM Conf. Comput.
negatively affected. Another solution was suggested [56] for Commun. Secur., 2012, pp. 281–292.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on March 31,2024 at 11:48:31 UTC from IEEE Xplore. Restrictions apply.
DU AND LI: ATOM: EFFICIENT TRACKING, MONITORING, AND ORCHESTRATION OF CLOUD RESOURCES 2189

[15] W. Li, H. H. Yue, S. Valle-Cervantes , and S. J. Qin, “Recursive [41] OpenStack, “OpenStack Keystone,” [Online]. Available: http://docs.
PCA for adaptive process monitoring,” J. Process Control, vol. 10, openstack.org/developer/keystone/, Accessed on: Nov. 5, 2016.
pp. 471–486, 2000. [42] OpenStack, “OpenStack Neutron,” [Online]. Available: https://
[16] J. E. Jackson and G. S. Mudholkar, “Control procedures for resid- wiki.openstack.org/wiki/Neutron, Accessed on: Nov. 5, 2016.
uals associated with principal component analysis,” Technometrics, [43] X. Li, et al., “Detection and identification of network anomalies
vol. 21, pp. 341–349, 1979. using sketch subspaces,” in Proc. 6th ACM SIGCOMM Conf. Inter-
[17] L. Huang, M. I. Jordan, A. Joseph, M. Garofalakis, and N. Taft, net Meas., 2006, pp. 147–152.
“In-network PCA and anomaly detection,” in Proc. Neural Inf. Pro- [44] Y. Liu, L. Zhang, and Y. Guan, “Sketch-based streaming PCA
cess. Syst., 2006, pp. 617–624. algorithm for network-wide traffic anomaly detection,” in Proc.
[18] Volatility. [Online]. Available: http://www.volatilityfoundation. IEEE 30th Int. Conf. Distrib. Comput. Syst., 2010, pp. 807–816.
org/, Accessed on: Nov. 5, 2016. [45] L. Huang, et al., “Communication-efficient online detection of net-
[19] A. Bianchi, Y. Shoshitaishvili, C. Kruegel, and G. Vigna, work-wide anomalies,” in Proc. 26th IEEE Int. Conf. Comput. Com-
“Blacksheep: detecting compromised hosts in homogeneous mun., 2007, pp. 134–142.
crowds,” in Proc. ACM Conf. Comput. Commun. Secur., 2012, [46] A. S. Ibrahim, J. H. Hamlyn-Harris , and J. Grundy, “Emerging
pp. 341–352. security challenges of cloud virtual infrastructure,” in Proc.
[20] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A density-based APSEC Cloud Workshop, 2010.
algorithm for discovering clusters in large spatial databases with [47] L. M. Vaquero, L. Rodero-Merino , and D. Moran, “Locking the
noise,” in Proc. 2nd Int. Conf. Knowl. Discovery Data Mining, 1996, sky: A survey on IaaS cloud security,” Computing, vol. 91, pp. 93–
pp. 226–231. 118, 2011.
[21] D. E. Difallah, A. Pavlo, C. Curino, and P. Cudre-Mauroux , [48] C. R. Li, et al., “Potassium: Penetration testing as a service,” in
“OLTP-Bench: An extensible testbed for benchmarking relational Proc. 6th ACM Symp. Cloud Comput., 2015, pp. 30–42.
databases,” Proc. VLDB Endowment, vol. 7, pp. 277–288, 2013. [49] T. Garfinkel, et al., “A virtual machine introspection based archi-
[22] StackDB. [Online]. Available: http://www.flux.utah.edu/ tecture for intrusion detection,” in Proc. Netw. Distrib. Syst. Secur.
software/stackdb/doc/all.html#using-eucalyptus-to-run- Symp., 2003, pp. 191–206.
qemukvm, Accessed on: Nov. 5, 2016. [50] J. Pfoh, C. Schneider, and C. Eckert, “A formal model for virtual
[23] I. Goiri, R. Bianchini, S. Nagarakatte, and T. D. Nguyen, machine introspection,” in Proc. ACM Workshop Virtual Mach.
“ApproxHadoop: Bringing approximations to MapReduce frame- Secur., 2009, pp. 1–10.
works,” in Proc. 20th Int. Conf. Archit. Support Program. Languages [51] B. Dolan-Gavitt , T. Leek, M. Zhivich, J. Giffin, and W. Lee,
Operating Syst., 2015, pp. 383–397. “Virtuoso: Narrowing the semantic gap in virtual machine intro-
[24] M. T. Al Amin, et al., “Social trove: A self-summarizing storage spection,” in Proc. IEEE Symp. Secur. Privacy, 2011, pp. 297–312.
service for social sensing,” in Proc. IEEE Int. Conf. Auton. Comput., [52] Y. Fu and Z. Lin, “Space traveling across VM: Automatically
2015, pp. 41–50. bridging the semantic gap in virtual machine introspection via
[25] J. Kelley, C. Stewart, N. Morris, D. Tiwari, Y. He, and S. Elnikety, online kernel data redirection,” in Proc. IEEE Symp. Secur. Privacy,
“Measuring and managing answer quality for online data-inten- 2012, pp. 586–600.
sive services,” in Proc. IEEE Int. Conf. Auton. Comput., 2015, [53] A. S. Ibrahim, J. Hamlyn-Harris, J. Grundy, and M. Almorsy,
pp. 167–176. “CloudSec: A security monitoring appliance for virtual machines
[26] X. Wang, U. Kruger, and G. W. Irwin, “Process monitoring in the IaaS cloud model,” in Proc. 5th Int. Conf. Netw. Syst. Secur.,
approach using fast moving window PCA,” Ind. Eng. Chemistry 2011, pp. 113–120.
Res., vol. 44, pp. 5691–5702, 2005. [54] F. Zhang, J. Chen, H. Chen, and B. Zang, “CloudVisor: Retrofitting
[27] Amazon, “Amazon cloudwatch,” [Online]. Available: http://aws. protection of virtual machines in multi-tenant cloud with nested
amazon.com/cloudwatch/, Accessed on: Nov. 5, 2016. virtualization,” in Proc. 23rd ACM Symp. Operating Syst. Principles,
[28] OpenStack. [Online]. Available: http://www.openstack.org/ 2011, pp. 203–216.
Accessed on: Nov. 5, 2016. [55] H. W. Baek, A. Srivastava, and J. Van der Merwe, “CloudVMI:
[29] OpenStack, “Openstack ceilometer,” [Online]. Available: https:// Virtual machine introspection as a cloud service,” in Proc. IEEE
wiki.openstack.org/wiki/Ceilometer. Accessed on: Nov. 5, 2016. Int. Conf. Cloud Eng., 2014, pp. 153–158.
[30] DATADOG. [Online]. Available: https://www.datadoghq.com/ [56] B. Bertholon, S. Varrette, and P. Bouvry, “Certicloud: A novel
Accessed on: Nov. 5, 2016. TPM-based approach to ensure cloud IaaS security,” in Proc. IEEE
[31] librato. [Online]. Available: https://www.librato.com/. Accessed 4th Int. Conf. Cloud Comput., 2011, pp. 121–130.
on: Nov. 5, 2016. [57] M. Du and F. Li, “ATOM: Automated tracking, orchestration and
[32] D. J. Dean, H. Nguyen, P. Wang, and X. Gu, “PerfCompass: monitoring of resource usage in infrastructure as a service sys-
Toward runtime performance anomaly fault localization for infra- tems,” in Proc. IEEE Int. Conf. Big Data, 2015, pp. 271–278.
structure-as-a-service clouds,” in Proc. 6th USENIX Workshop Hot
Topics Cloud Comput., 2014, pp. 16–16. Min Du received the bachelor’s and master’s
[33] R. Van Renesse , K. P. Birman, and W. Vogels, “Astrolabe: A degrees from Beihang University, in 2009 and
robust and scalable technology for distributed system monitoring, 2012, respectively. She is currently working
management, and data mining,” ACM Trans. Comput. Syst., toward the PhD degree in the School of Comput-
vol. 21, pp. 164–206, 2003. ing, University of Utah. Her research interests
[34] P. Yalagandula and M. Dahlin, “A scalable distributed informa- include big data analytics and cloud security. She
tion management system,” in Proc. Conf. Appl. Technol. Archit. Pro- is a student member of the IEEE.
tocols Comput. Commun., 2004, pp. 379–390.
[35] M. L. Massie, B. N. Chun, and D. E. Culler, “The ganglia distrib-
uted monitoring system: Design, implementation, and experi-
ence,” Parallel Comput., vol. 30, pp. 817–840, 2004.
[36] N. Jain, D. Kit, P. Mahajan, P. Yalagandula, M. Dahlin, and
Y. Zhang, “STAR: Self-tuning aggregation for scalable monitoring,” Feifei Li received the BS degree in computer
in Proc. 33rd Int. Conf. Very Large Data Bases, 2007, pp. 962–973. engineering from Nanyang Technological Univer-
[37] J. Liang, X. Gu, and K. Nahrstedt, “Self-configuring information sity, in 2002 and the PhD degree in computer sci-
management for large-scale service overlays,” in Proc. 26th IEEE ence from Boston University, in 2007. He is
Int. Conf. Comput. Commun., 2007, pp. 472–480. currently an associate professor in the School of
[38] Y. Zhao, Y. Tan, Z. Gong, X. Gu, and M. Wamboldt, “Self-correlating Computing, University of Utah. His research
predictive information tracking for large-scale production systems,” interests include database and data management
in Proc. 6th Int. Conf. Auton. Comput., 2009, pp. 33–42. systems and big data analytics. He is a member
[39] D. Moldovan, G. Copil, H.-L. Truong, and S. Dustdar, “MELA: of the IEEE.
Monitoring and analyzing elasticity of cloud services,” in Proc.
IEEE 5th Int. Conf. Cloud Comput. Technol. Sci., 2013, pp. 80–87.
[40] Amazon, “Aws security center,” [Online]. Available: http://aws. " For more information on this or any other computing topic,
amazon.com/security/, Accessed on: Nov. 5, 2016. please visit our Digital Library at www.computer.org/publications/dlib.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on March 31,2024 at 11:48:31 UTC from IEEE Xplore. Restrictions apply.

You might also like