TMP B899

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.

IEEE SYSTEMS JOURNAL 1

System of Systems for Quality-of-Service


Observation and Response in Cloud
Computing Environments
Paul C. Hershey, Senior Member, IEEE, Shrisha Rao, Senior Member, IEEE,
Charles B. Silio, Jr., Life Senior Member, IEEE, and Akshay Narayan, Student Member, IEEE

Abstract—As military, academic, and commercial computing to multiple end users efficiently [1], [2]. However, complex
systems evolve from autonomous entities that deliver computing systems that use cloud computing, such as shown in Fig. 1,
products into network centric enterprise systems that deliver com- are prone to failure and security compromise in five main
puting as a service, opportunities emerge to consolidate computing
resources, software, and information through cloud computing. areas: computing performance (e.g., latency and time delay
Along with these opportunities come challenges, particularly to experienced by a system when processing a request), cloud
service providers and operations centers that struggle to monitor reliability (e.g., network connectivity), economic goals (e.g.,
and manage quality of service (QoS) for these services in order to interoperability between cloud providers), compliance (e.g.,
meet customer service commitments. Traditional approaches fall digital forensics to discern what happened, learn how to prevent
short in addressing these challenges because they examine QoS
from a limited perspective rather than from a system-of-systems incidents, and collect information for future actions), and infor-
(SoS) perspective applicable to a net-centric enterprise system in mation security (e.g., to protect the confidentiality and integrity
which any user from any location can share computing resources of data and ensure data availability) [1], [2]. Present approaches
at any time. This paper presents a SoS approach to enable QoS to mitigate these issues are not sufficient to ensure quality of
monitoring, management, and response for enterprise systems service (QoS) for end users [3].
that deliver computing as a service through a cloud computing
environment. A concrete example is provided for application of Recent survey articles [4]–[7] summarize challenges for
this new SoS approach to a real-world scenario (viz., distributed cloud computing to provide mechanisms to mitigate the issues
denial of service). Simulated results confirm the efficacy of the listed above, the need for monitoring the various layers of the
approach. cloud computing environment, and the need to provide end-to-
Index Terms—Cloud computing, distributed denial of service end solutions for the problems. The architectural frameworks
(DDoS), enterprise systems, information assurance, net centric, considered in these excellent surveys are one dimensional and
quality of service (QoS), security, service-oriented architecture principally deal with the infrastructure, platform, and software
(SOA), systems of systems (SoS). as service layers of cloud computing. Federated clouds with
cross-cloud connectivity provide an opportunity mentioned in
I. I NTRODUCTION [6] to support service-oriented computing, and QoS is discussed
in [8] but is limited in application to the task of virtual machine
A S ECONOMIC pressure intensifies for network and en-
terprise operations centers, those responsible for these
centers seek a method to lower costs in the presence of data
provisioning in data centers rather than to end-user satisfiability.
None treats the clouds from a system-of-systems point of view
overload. This dramatic increase in the quantity of data is a as is done here. The architectural model used here to support
product of the evolution of complex net-centric enterprise sys- a service-oriented architecture (SOA) is multidimensional, ex-
tems over which multiple disparate users in dispersed locations tending to layers of business and governance as services, while
share gigabytes, terabytes, or even petabytes of data at high reducing complexity by combining the traditional platform and
speeds over production networks. Cloud computing provides applications layers into one software layer that provides overall
one possible method to meet this challenge by reducing cost Software as a Service (SaaS) to end users. Furthermore, specific
through shared computing resources while distributing data locations and the use of software agents for monitoring and
observation in this system of systems (SoS) are presented.
Motivation: QoS specification and monitoring for cloud ser-
Manuscript received November 16, 2012; revised September 19, 2013; vices is a complex and challenging issue, one where there
accepted December 10, 2013. are few universal benchmarks or standards. While the usual
P. C. Hershey is with Raytheon Intelligence, Information and Services,
Dulles, VA, 20166 USA (e-mail: [email protected]).
quality metrics such as uptime and reliability may still be
S. Rao is with the International Institute of Information Technology- considered applicable in the context of cloud systems, it is
Bangalore, Bangalore 560100, India (e-mail: [email protected]). less clear what the QoS parameters unique to such systems
C. B. Silio Jr., is with the University of Maryland, College Park, MD 20742
USA (e-mail: [email protected]). are and how they should be applied in specific contexts. A
A. Narayan is with School of Computing, National University of Singapore, lack of common metrics to be applied to cloud offerings from
Singapore 117417 (e-mail: [email protected]). various providers is also a barrier to standardization of cloud
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org. offerings from different providers. Thus, although “federated”
Digital Object Identifier 10.1109/JSYST.2013.2295961 clouds using services from different providers are thought to

1932-8184 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE SYSTEMS JOURNAL

Fig. 1. Complex cloud computing environment [2].

be the way of the future, it is not yet possible to construct TABLE I


S O S C HARACTERISTICS
complex services using a mixture of services from different
cloud providers [9]. QoS parameters are also often specified
in semitechnical legalese, and it is a challenge to derive their
appropriate mathematical formulations.
Another issue with QoS parameters and their evaluation is
that there typically are multiple metrics, some positive (where a
higher value means a better QoS, e.g., network bandwidth) and
some negative (where a lower value means a better QoS, e.g.,
network latency), and it is far from obvious how, in general, the
various parameters should be weighted and compared (particu-
larly in cases where the metrics are incompatible or unavailable
all at once). This, in fact, is an example of a challenging
problem in multicriteria decision making, for which a method
called simple additive weighting [10] has been applied in the past
in the context of web services [11]. More recently, in the context
of cloud computing, simple additive weighting and the analytic
hierarchy process [12] have been used [13] to scale and com-
pare QoS parameters, both positive and negative.
Proper resource allocations to different tiers of a cloud sys-
tem are also known to be critical to improving the QoS, but
existing approaches tend to be ad hoc and wasteful [14].
computing environment is subjected to a distributed denial-
The approach presented in this paper introduces a SoS to
of-service (DDoS) attack. Section V details the experimental
provide a clear and concise view of QoS events within cloud
results obtained in the prototype system created to verify the
computing environments that proactively informs enterprise
SoS approach. Section VI presents benefits and conclusions.
operators of the state of the enterprise and, thereby, enables
timely operator response to QoS problems. Section II provides
a step-by-step description of the SoS approach; Section III II. A PPROACH
provides the mathematical model for the QoS metrics consid- Step 1: Define a SoS for monitoring, management, and
ered in our work. Section IV describes the application of this response. A SoS [15]–[18] possesses the characteristics shown
approach to a cyber security scenario in which a complex cloud in Table I. Fig. 2 presents a SoS comprising a system of multiple
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

HERSHEY et al.: SoS FOR QoS OBSERVATION AND RESPONSE IN CLOUD COMPUTING ENVIRONMENTS 3

Fig. 2. Net-centric SOA-based SoS.

The X dimension (Response Time) defines time-based ser-


vices based on measurement time intervals (MTIs) ranging
from microseconds to days or beyond, depending on the time
criticality of the information to the mission. In each domain,
the term service refers to any program, algorithm, function,
analysis technique, or monitoring activity that uses or interprets
enterprise system information. A service can be a single indi-
vidual or group of individuals (e.g., individuals doing problem
determination or analysis), a program providing statistics, a
piece of code performing network functions (such as load bal-
ancing), or another entity using network data. Different services
in which network managers are interested require different time
periods (i.e., MTIs) for data collection. For example, a critical
mission in which lives may be at stake could require real-time
(i.e., microsecond) monitoring, analysis, and response for QoS
event problem determination. On the other end of the spectrum,
a department store studying customer trends may be content to
collect and analyze data on a daily basis as they plan for an
upcoming sales event. Traditional performance measurement
and analysis approaches do not have the flexibility to collect
Fig. 3. EMMRA CC. data and assess QoS performance in real time for the wide range
of MTIs required by enterprise-level services. A need exists
administrative domains operating within a SOA-based cloud for a flexible data collection device that can collect information
computing system [19]. For the system depicted in Fig. 2, within complex enterprise systems for varying MTIs. EMMRA
a single authority provides governance services to multiple CC provides this flexibility and thereby meets critical QoS
heterogeneous administrative domains in which SOA-based mission requirements for diverse missions with varying MTIs.
applications enable business and collaboration services that The Y dimension (Domains) detects and responds to en-
support end users who are producing and consuming data using terprise events using similar techniques and instrumentation.
software and infrastructure services. A key contribution from this step is the extension of the
Step 2: Derive framework for QoS monitoring, man- domain dimension from Infrastructure as a Service (IaaS) and
agement, and response in cloud computing environments. SaaS, where traditional techniques attempt to enforce QoS
The Enterprise Monitoring, Management, and Response Ar- in cloud computing environments, to include Business as a
chitecture (EMMRA) for Cloud Computing Environments Service (BaaS) and Governance as a service (GaaS) to meet
(EMMRA CC), as shown in Fig. 3, extends previous work [19] the challenges of SOA-based net-centric enterprise systems. For
to provide structure from which to identify points within the BaaS, EMMRA CC focuses on monitoring business processes
administrative domains of Fig. 2 where key QoS metrics may and managing these to ensure their uniform implementation
be monitored and managed. among end users. For GaaS, EMMRA CC identifies ownership
This framework is multidimensional to enable an end-to of governance services and enables the institution and enforce-
end view of the system where metric viewpoints may be ment of policies that influence enterprise-wide behavior.
located within traditional Open Systems Interconnection (OSI) The Z dimension (Planes) introduces structures that moni-
layers (infrastructure through applications/software) and SOA- tor and manage particular end-to-end events. Planes (namely,
based layers (business and governance), as well as across these usage, control, management, and cyber security) encompass
layers. all domains and thereby provide a cross-domain solution that
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE SYSTEMS JOURNAL

enables enterprise-wide monitoring, management, and re- TABLE II


M ETRICS C ATEGORIES
sponse. Planes also span multiple MTIs, enabling them to
address multiple and diverse services. Implementing a plane-
based approach within a SOA methodology is highly effective
compared with existing domain-only QoS parameter evaluation
techniques because the EMMRA CC approach fills the gaps
between the domain monitoring layers. For example, the usage
monitoring plane encompasses those activities that filter, col-
lect, analyze, and disseminate information about the user data.
The monitoring and management of this information could
originate with an activity in the GaaS domain (e.g., dissemina-
tion, implementation, and assessment of operational polices),
but this activity can then influence similar activities within the
BaaS, SaaS, and IaaS domains. The usage monitoring plane of Delay, Delay Variation, Throughput, Authentication, Autho-
provides this enterprise-wide view. rization, and Certification and Accreditation.
Step 3a: Identify Cloud computing metrics. To enable ef- Delay is the elapsed time observed for a completed or
fective QoS monitoring, management, and response across on-going task and is caused by processing, queuing, and trans-
and among multiple and diverse enterprise operations centers, mission of data. Examples of other terminology used inter-
we define relevant service-based QoS metric categories and changeably with delay include response time, round-trip time,
metrics within those categories. The term service-based metric and latency. Delay metrics identify when the cumulative effects
represents an object that has a name (i.e., metric name), a of processing and transport hinder the end users ability to
definition, a value (i.e., measure), an observation time period, accomplish the mission. It is obvious to the end user and
a computing or collection method (i.e., measurement), and one therefore influences customer satisfaction.
or more threshold values for setting alarms (e.g., minor, major, The comparison of delay for different observation time
and critical). In this paper, whenever the word metric is used, intervals is Delay Variation. Examples of delay variation
it refers to the metric name of the metric object. The term include the following: the variation in application response
measure is a quantitative value of the metric object derived time between peak and nonpeak hours, the edge-to-edge delay
for the observation time interval. The term measurement is the variation in the pattern of packet arrival events, and jitter. Delay
computing or collection process or method of determining the variation identifies system instability that either presently pre-
value of the metric object. vents the end users from successfully executing their missions
To clarify the meaning of these definitions, consider the or that is a forewarning of upcoming problems that will do so.
example of a toll bridge with two parameters of interest: the Throughput metrics describe the amount of work completed
length of the bridge and the number of cars that have passed over a time period [20]. Throughput metrics identify the level
over the bridge. Let the service provider be the local department of work accomplished by the team, application, computer,
of transportation whose mission is to determine whether the and network. Proper interpretation of these metrics enhances
capacity of the bridge is adequate or if a second bridge span productivity, efficiency, and resource allocation.
will be required to handle the traffic at some time in the future. The Authentication metrics category includes metrics that
A common characteristic of metrics is that they change over confirm the identity of users, systems, or data sources. Authen-
time. The length of the bridge is static over time and does tication employs one or more mechanisms such as passwords,
not change regardless of how often it is measured. Thus, this key exchange, digital certificates, and biometrics. Confirmation
parameter is not a metric. By contrast, the number of cars that of entity identity is a fundamental requirement in establishing
have passed over the bridge requires counting cars over an trust and confidence in the service and enabling other security
observation time interval. This observation time interval could functions.
be the last hour, the last day, or from the time that the bridge User Authorization metrics report the success and failure
first opened. The number of cars counted can be reset to zero of access to resources based on policy and permission levels.
at the beginning of the observation time interval as required by Authorization extends authentication—confirming an entity’s
the end user in order to best meet the purpose of collecting the identity—to define the entity’s privileges (i.e., those functions
traffic flow information, e.g., throughput analysis or capacity that the entity can be trusted to perform). Authorization metrics
planning. For this example, the terminology number of cars that enforce the principle of least privilege. By ensuring that entities
have passed over the bridge is a service-based metric. The value are assigned the fewest privileges consistent with their mission,
representing the counted number of cars per observation time the overall service integrity is maintained, and mission effec-
interval provides the measure for the metric. The measurement tiveness is enhanced.
method for the metric could be to use a sensor placed across the Certification and Accreditation (C&A) is the compre-
lanes of the bridge that would advance a counter each time a car hensive evaluation of the technical and nontechnical security
passed over it. features of an IT system and other safeguards, made in support
This paper focuses on the QoS metrics categories of perfor- of the accreditation process, to establish the extent to which
mance and security and their respective metrics, as shown in a particular design and implementation meets a set of spec-
Table II. In Section V, we present results for the categories ified security requirements [21]. DoD requires C&A security
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

HERSHEY et al.: SoS FOR QoS OBSERVATION AND RESPONSE IN CLOUD COMPUTING ENVIRONMENTS 5

Fig. 4. Locations for IA/security, delay, and throughput metric observation.

through the DoD Information Technology Security Certifica- (blue boxes) for a representative cloud computing environ-
tion and Accreditation Process. Metrics within the C&A met- ment realized through a SOA-based net-centric enterprise sys-
rics category identify security risks and deficiencies, provide tem. This system includes four user communities: 1) end
information to help ensure that steps are taken to correct these users shown with the client workstation, 2) help desk shown
deficiencies and vulnerabilities, and provide information to help with the trouble management system database, 3) opera-
ensure the safeguarding of applications, networks, systems, tions shown with the configuration management database, and
data and information. 4) engineering shown with the project control and development
Step 3b: Measuring performance metrics. When considering tracking database. With respect to the system components,
heterogeneous resource types, having a baseline for measuring the end-user client machines that run applications over the
all performance and security metrics is nontrivial. In our work, base/camp/post/station network appear at the left side of the
we measure low-level independent performance metrics such figure. The network edge appears as a Customer Edge (CE)
as delay at each level of EMMRA CC. In the prototype of router attached to a High Assurance IP Encryptor (HAIPE)
an online transaction processing application, we measure the device and a Provider Edge (PE) router, along with a cache. The
response time at each of the layers. Specifically, we measure core network includes information transported via Multiproto-
the following: 1) time taken for a database query to return the col Label Switching (MPLS), Dense Wave-Division Multiplex-
results (on the cloud instance hosting the database), 2) time ing, and Synchronous Optical Network services. The network
taken by the application logic to execute (on the cloud instance terminates at a Defense Enterprise Computing Center (DECC)
hosting the application server), and 3) time taken for the data to with the connection from a second PE router to a HAIPE
be transmitted over the network between the application server device and then to a DECC Edge (DE) router and a DECC
and the database server. Throughput is measured as the number LAN. To the right of the DE are the computing services to be
of transactions completed when viewed from the application shared through a cloud computing environment, including those
perspective and hence is measured on the instance hosting for dynamic host configuration protocol (DHCP), Distributed
the application server. A similar approach to measure perfor- Names System (DNS), Discovery, App Server, Portal, Message
mance metrics for SLA management has been used in earlier Queue, Web Server, Service Node, Security, and Database. As
work [22]. an example, we provide the rationale used to select the locations
Step 4: Identify suitable locations within the cloud com- for IA/security metrics.
puting environment for metric detection. This step identifies Authentication: Cyber security can be monitored at the
suitable locations to observe and collect the metrics in Table II. Apps, Portal, and Security/SSO servers. For example, EMMRA
Fig. 4 shows these locations for an expanded set of metrics for CC agents can monitor Security Assertion Markup Language
security (green boxes), delay (yellow boxes), and throughput authentication assertions at the Security/SSO server. These
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6 IEEE SYSTEMS JOURNAL

assertions facilitate the secure exchange of authentication infor-


mation between systems regardless of their underlying security
mechanisms.
Likewise, EMMRA CC agents can effectively monitor and
respond to Authorization events from the Apps, Portal, and
Security/SSO servers where they can monitor and respond to
information such as need-to-know determination required to
grant authorization to a resource. This is typically implemented
in the DD Form 2875, System Authorization Access Request.
For this example, EMMRA CC agents embedded within the
Apps server would report the number of roles supported and
the number of users assigned to each role, including the number
added and the number removed over an observation period. The
EMMRA CC agent would also monitor system resources access
attempts to determine authorization failures. Fig. 5. Locations for EMMRA CC agents and CA nodes.
Non-repudiation confirms that a transaction between two
parties took place. Thus, a good location for EMMRA CC The DoD requires Physical Security for every enclave of
agents would be at the receiving end or a processing intermedi- a net-centric enterprise system. Thus, each user community
ary, such as the Message Queue in Fig. 4. Here, the EMMRA could benefit from embedded EMMRA CC agents located at
CC agents can observe confirmation of messages both from the perimeter devices that indicate destruction, theft, or sabotage
sender to the agent and from the agent to the receiver. from unauthorized access to facilities, equipment, material,
A prime location to monitor and respond to Integrity events data, information, or documents.
is the DNS server. For example, web applications use DNS Step 5: Identify potential implementation schemes from
name resolution to resolve IP addresses. An attack called DNS which to collect and analyze the cloud computing QoS met-
poisoning corrupts the DNS domain-to-IP address database rics. One possible scheme presented in this paper embeds
such that requests are redirected from the legitimate server EMMRA CC agents within multiple diverse cloud computing
to the attackers’ server for the purpose of presenting fraud- components where they can continuously monitor the enterprise
ulent content or collecting sensitive information. To prevent system for QoS metrics associated with cloud computing envi-
DNS integrity problems, EMMRA CC agents should audit ronments (e.g., delay, throughput, and security metrics). These
all administrative updates to the master database and data agents communicate over an out-of-band (OOB) monitoring
transfer activity and provide a response so that operators con- network to EMMRA CC CA nodes that are located at local,
figure DNS such that the end user never directly contacts regional, enterprise, and global operations centers, as shown
the master database. EMMRA CC agents should search the in Fig. 5.
DNS logs for particular events such as spikes in DNS traffic
that could indicate a redirection. In addition, EMMRA CC III. S YSTEM M ODEL
agents should work with Intrusion Detection Systems, Intrusion
Prevention Systems, anti-virus and configuration tracking soft- Here, we describe the mathematical model for the QoS
ware at the DNS servers to deter unauthorized changes to the metrics considered in our work. Furthermore, we describe the
server. possible response actions to be considered on a QoS breach. We
For Information Availability monitoring and response, describe the individual metric classes considered in Table II.
EMMRA CC agents should be installed at the DECC LAN
router and DHCP server to observe malicious traffic that could A. Performance
reduce information availability below its required threshold,
The performance QoS metrics are additive in the numerical
thereby indicating spurious threat activity at individual servers.
sense. The SoS view from the top-level domain in Fig. 3 (i.e.,
The commander Joint Task Force for Computer Network De-
GaaS) perceives delay as a sum of the delays experienced in the
fense can use this metric to set and track the Information Op-
other lower domain levels of the cloud. This is also dependent
erations Condition (INFOCON) [23]. Based on the INFOCON
on the infrastructure components used to provide the service.
status level, a Collection and Analysis (CA) node in the opera-
Hence, we must include the component-induced performance
tions center can invoke countermeasures to uniformly heighten
degradation. The delay metric can be represented as
or reduce defensive posture, defend against computer network
attacks, and mitigate sustained damage to the DoD information DSoS = p1 · DG + p2 · DB + p3 · DS + p4 · DI (1)
infrastructure.
Certification and Accreditation monitoring and re- where each pi is a parameter that is dependent on the infrastruc-
sponse supports the DoD Information Assurance Certifica- ture component used. Dj is the delay experienced in each layer
tion and Accreditation Process Instruction [24]. EMMRA CC j in EMMRA, where the specific letter for j is the domain (i.e.,
agents distributed within the engineering project control and Governance, Business, Software, and Infrastructure).
development-tracking database can provide the relevant infor- We define throughput at the system level as the number of
mation to support ongoing certification and accreditation. transactions that are completed per unit time. Throughput can
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

HERSHEY et al.: SoS FOR QoS OBSERVATION AND RESPONSE IN CLOUD COMPUTING ENVIRONMENTS 7

be visualized at different levels. Throughput at the GaaS level


of EMMRA is on the order of a few days. This must be captured
in a different scale. However, the throughput at the lower levels
of EMMRA is multiplicative in nature. The throughput at every
level is a function of the throughput at a lower level in EMMRA.
Hence, we have

TI = n × Transaction Throughput
TS = m × TI
TB = q × TS . (2)

Here, m, n, and q are the numbers of transactions at the lower


domain needed to complete a transaction in the higher domain.
Additionally, at each level, throughput is additive in nature.
For example, at the software layer, if there are p operations
independent of each other (which may or may not require
services from the infrastructure layer), then the throughput of
the software layer is the sum total of the number of operations
completed per unit time.

Fig. 6. Security enclave.


B. Security
Security can be thought of as a functional requirement of IV. R EAL -W ORLD A PPLICATION S CENARIOS
the system. It comprises authentication and authorization using This section describes a real-world scenario (i.e., use case)
certificates and accreditation. in which the EMMRA CC SoS approach is applied to a com-
The authentication QoS metric is the logical conjunction plex cloud computing environment that is exposed to a cyber
at each level in EMMRA. The users’ access to the system security threat (i.e., DDoS) [26]. In this scenario, the Cyber
ceases at the level authentication fails. Hence, the SoS view Security Plane is used to observe cyber security threats across
of authentication is a logical AND of the authentications at all domains in order to detect and enable proactive response to
various levels in EMMRA. Security can be viewed as a top- a DDoS security breach within any of these domains that could
down metric, i.e., compromise the transactions and cause potentially devastating
consequences to the end user.
ASoS = AG ∧ AB ∧ AS ∧ AI . (3)
DDoS cyber attacks are on the rise with ever increasing
The lower level EMMRA components have to be kept secure sophistication and costs to victims. A DDoS attack attempts to
from the end user. A user at the top level can obtain service from exhaust the victims’ resources, including network bandwidth,
the bottom levels but is not authorized to access the components computing power, and operating system data structures. To
directly. Only specific personnel are allowed access to the launch a DDoS attack, malicious users first build a network of
lower level components (viz., administrators). Hence, in order computers that they will use to produce the volume of traffic
to obtain access to lower level components, the user needs to be needed to deny services to computer users. To create this attack
authenticated at the top level. This is modeled in (3). network, attackers discover vulnerable sites or hosts on the
Authorization, however, is a bottom-up metric and is appli- network that are then exploited by attackers who install new
cable at each level. User access to the service at any layer of programs (known as attack tools) on the compromised hosts
EMMRA is subject to authorization. The authorization is such of the attack network. The hosts that are running these attack
that the least privilege is granted sufficient to accomplish the tools are known as zombies, and they can carry out any attack
operation. Authorization is applicable at each level in EMMRA under the control of the attacker. Many zombies together form
CC. For example, in a banking application, an administrator is an army that comprises both master zombies and slave zombies.
not authorized to access account details of the customer of the The attacker coordinates and orders master zombies, and they,
bank. Authorization at the IaaS level can be represented as in turn, coordinate and trigger slave zombies that send a large
   volume of packets to the victim, flooding its system with useless
AuthI = min pi (4) load and exhausting its resources.
i∈Set of actions
In order to devise and defend against DDoS attacks in a
where pi is the permission to perform action i at the IaaS cloud computing environment, we segregate this system into
level. The min operator is used here as an indication of the security enclaves, i.e., a collection of computing environments
least privilege level that is granted to the user. Authorization connected by one or more internal networks under the control
is defined similarly for the rest of the levels in EMMRA CC. of a single authority and security policy. Fig. 6 depicts the
The SoS view of authorization can be obtained using methods main components of a security enclave [27], [28], including
such as linear logic [25]. a Demilitarized Zone (DMZ), which is defined as a security
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8 IEEE SYSTEMS JOURNAL

Fig. 7. Operational view of EMMRA CC agents and CA node deployment to counter DDoS attacks.

boundary created between two security policy-enforcing com- TABLE III


D ESCRIPTIVE S TATISTICS OF A PPLICATION I NSTANCE
ponents. Note that the existence of a security enclave, while it
reduces the chances and severity of a DDoS attack, does not
guarantee attack protection.
Fig. 7 depicts a scenario in which a cloud computing sys-
tem comprises six security enclaves that communicate over
a global network. In this scenario, an attacker residing in
enclave 1 launches a DDoS attack using master zombies (i.e.,
bots) in enclaves 2 and 3 and slave zombies in enclaves 4
and 5 to initiate and sustain the DDoS attack on a victim in
enclave 6. the application was hosted on a different cloud instance, the
The EMMRA CC nodes are distributed among local, re- cloud itself was deployed in a private network. This allows
gional, enterprise, and global operations centers, as described us to demonstrate the dependence of performance metrics
Section II, and communicate in an OOB network, as shown on the component-induced degradation or, in our example, a
in Fig. 7. lack of such degradation. For example, because the setup was
This distribution of CA nodes corresponds to the domain deployed in a private gigabit Ethernet supported network, we
layers of the EMMRA CC framework in Fig. 3 and enables can hardly feel the presence of a network-induced delay in the
DDoS detection and response for infrastructure, software, busi- transactions. Network-induced delay is a significant component
ness, and governance layers of SOA-enabled cloud comput- of induced degradation in enterprise systems that have a global
ing environments [26]. EMMRA CA nodes are implemented presence. Our setup emulates the setup followed by many of
through software or by programmable hardware devices. They the online transaction processing applications in the enterprise
are initially programmed to collect and analyze metrics from environment. The descriptive statistics of the recorded delay
CC agents that indicate anomalous data relevant to their original for application instance are shown in Table III. With a large
mission, e.g., excessive delay on voice-based links used for standard deviation, the delay variations are large. In addition,
military operations collaboration. Once anomalous behavior is we note that the data are skewed with a huge difference between
detected, then the CA node conducts correlation and analyses minimum and maximum delays. These data are similar to real-
to isolate the threat and provide response, viz., shut down the world applications [29].
affected links and redistribute traffic so that the users do not
lose QoS. A. Delay
As discussed in Section II, Step 3, delay is one of the primary
V. R ESULTS
parameters considered while evaluating the QoS of a system
Here, we describe the experimental setup and our insights and can be introduced into the system because of various rea-
into the QoS in a SoS setup. We built a prototype of an sons, including misconfiguration of the software stack, blocked
online transaction processing application. While each layer of ports in the network, and data processing delays. In our work,
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

HERSHEY et al.: SoS FOR QoS OBSERVATION AND RESPONSE IN CLOUD COMPUTING ENVIRONMENTS 9

TABLE IV
D ELAY R ECORDED IN T EN S AMPLE T RANSACTIONS

Fig. 9. Variation in observed delay and calculated delay.

Fig. 8. Variation in delay recorded over time. Fig. 10. Throughput: Number of transactions per minute.

we monitor delay at various levels of EMMRA CC. We notice delay in the setup. For example, a regular pattern in spikes
that the delay perceived at the business or governance level in clearly indicates a set of operations that cause repeated delay
our work is the sum total of delay experienced at every level of in the system. Such operations can be isolated, and remedial
EMMRA. This follows the relation established in (1). Table IV actions can be suggested to the users to avert a QoS breach.
indicates that the SoS perspective of delay metric is additive in Another relevant variation in delay is the variation in the ob-
nature, as discussed in Section III. served delay and the delay computed using the modeling in (1)
Monitoring and characterizing delay helps determine the (without the correction factors pi ). This variation accounts for
necessary configuration for a particular type of application the component-induced performance degradation due to which
stack to be deployed on the cloud. Delay modeling in Section III delay is induced in the system. In Fig. 9, we notice this variation
is an attempt to achieve the same. The experimental evidence in the observed delay and the computed delay value. To account
provided here supports the model defined earlier. for this component-induced performance degradation, we need
Although characterization of delay in a deployment is advan- to include some correction factors in delay modeling, and
tageous, it has been noted that that delay is not a constant in the hence, it is apt to have pi at each level in EMMRA CC. It
system. The following section describes the variation in delay may be necessary to perform a calibration run to determine the
and its implications. accurate values of pi in a particular setup.

B. Variability in Delay C. Throughput


Although delay is an important performance metric, char- Throughput is an indicator of the overall system perfor-
acterizing and monitoring delay alone is not sufficient in a mance. It is desired to have a high-throughput system in the
system. Due to various factors, the delay in the system varies cloud. In our work, we measured throughput at the application
with time and operation being performed. For instance, when and the database layers. The overall system throughput is
processing a data stream, the delay may vary based on the size represented in Fig. 10. Like delay, throughput is a variant in the
and type of data being processed. In addition, there are various system. Analyzing the throughput variation pattern allows us to
component-induced factors that affect delay in the system. As have an insight into the behavior of the deployment (physical
a simple example, consider Ethernet cards manufactured by and software components).
different manufacturers, although the cards have same rating, Our application comprises two phases: the first phase is
owing to the raw material used or difference in manufacturing independent of the database transactions, whereas the second
technologies, the delay induced by these cards may be different phase needs three database transactions. We tabulated the find-
on deployment. ings independently and presented them in Table V. Since all
Observing the delay variation over time (see Fig. 8) helps these operations are at the same layer in EMMRA CC, we
to identify a particular set of operations that cause maximum notice that throughput is additive in nature. The factor in the
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

10 IEEE SYSTEMS JOURNAL

TABLE V
T HROUGHPUT OF T EN S AMPLE T RANSACTIONS

table represents the number of database transactions needed A. Maraschini, P. Massonet, H. Muoz, and G. Tofetti, “Reservoir—When
to complete phase 2 of our application. The total through- one cloud is not enough,” Computer, vol. 44, no. 3, pp. 44–51, Mar. 2011.
[10] T. Gwo-Hshiung, G. H. Tzeng, and J.-J. Huang, Multiple Attribute Deci-
put shown in Table V follows the additive rule presented in sion Making: Methods and Applications. Boca Raton, FL, USA: CRC
Section III. Press, Jun. 2011.
[11] L. Zeng, B. Benatallah, A. H. H. Ngu, M. Dumas, J. Kalagnanam, and
H. Chang, “QoS-aware middleware for web services composition,” IEEE
VI. C ONCLUSION Trans. Softw. Eng., vol. 30, no. 5, pp. 311–327, May 2004.
[12] T. L. Saaty, Multicriteria Decision Making: The Analytic Hierarchy
The new approach presented in this paper enables cloud Process: Planning, Priority Setting Resource Allocation. Pittsburgh,
computing service providers and operations centers to meet PA, USA: RWS Publications, 1990.
committed customer QoS levels using a trusted QoS metric [13] A. S. Prasad and S. Rao, “A mechanism design approach to resource
procurement in cloud computing,” IEEE Trans. Comput., vol. 63, no. 1,
collection and analysis implementation scheme that extends pp. 17–30, Jan. 2014.
traditional monitoring, management, and response for IaaS and [14] M. F. Mithani and S. Rao, “Improving resource allocation in multi-tier
SaaS to a complete SOA stack that includes business logic cloud systems,” in Proc. 6th Annu. IEEE Int. SysCon, Vancouver, BC,
Canada, Mar. 2012, pp. 356–361.
(BaaS) and governance (GaaS). This paper includes real-world [15] G. A. Lewis, E. Morris, P. Place, S. Simanta, and D. B. Smith, “Require-
scenarios that describe the applications of this approach to voice ments engineering for systems of systems,” in Proc. 3rd Annu. IEEE Int.
and data systems for performance metrics and to DDoS for SysCon, Mar. 2009, pp. 247–252.
[16] S. M. White, “Modeling a system of systems to analyze requirements,” in
security metrics. Next steps include simulating these scenarios Proc. 3rd Annu. IEEE Int. SysCon, Mar. 2009, pp. 83–89.
to quantify the effectiveness of this approach with respect to [17] Defense Acquisition Guidebook (DAG), Jan. 2012. [Online]. Available:
operations center response time to restore QoS in the presence https://dag.dau.mil/Pages/Default.aspx
[18] Systems Engineering Guide for Systems of Systems, Version 1.0,
of anomalous enterprise events. ser. OUSD (A & T) SSE, Aug. 2008.
[19] P. Hershey and D. Runyon, “SOA monitoring for enterprise computing
systems,” in Proc. 11th Int. IEEE EDOC Conf., Oct. 2007, pp. 443–450.
R EFERENCES [20] P. J. Denning and J. P. Buzen, “The operational analysis of queueing
[1] P. Mell and T. Grance, “The NIST definition of cloud comput- network models,” ACM Comput. Surv., vol. 10, no. 3, pp. 225–261,
ing,” Natl. Inst. Standards Technol. (NIST), U.S. Dept. of Commerce, Sep. 1978.
Gaithersburg, MD, USA, NIST Special Publication 800-145, Sep. 2011. [21] DoD instruction 5200.40: DoD Information Technology Security Certifi-
[Online]. Available: http://csrc.nist.gov/publications/nistpubs/800-145/ cation and Accreditation Process, Dec. 1997. [Online]. Available: http://
SP800-145.pdf csrc.nist.gov/groups/SMA/fasp/documents/c&a/DLABSP/i520040p.pdf
[2] L. Badger, T. Grance, R. Patt-Corner, and J. Voas, “Cloud computing [22] V. Emeakaroha, I. Brandic, M. Maurer, and S. Dustdar, “Low level metrics
synopsis and recommendations,” Natl. Inst. Standards Technol. (NIST), to high level SLAs—LoM2HiS framework: Bridging the gap between
U.S. Dept. of Commerce, Gaithersburg, MD, USA, NIST Special Pub- monitored metrics and SLA parameters in cloud environments,” in Proc.
lication 800-146, May 2012. [Online]. Available: http://csrc.nist.gov/ 2010 Int. Conf. HPCS, 2010, pp. 48–54.
publications/nistpubs/800-146/sp800-146.pdf [23] DoD instruction 5200.80: Security of DoD Installations and Resources,
[3] M. F. Mithani, M. Salsburg, and S. Rao, “A decision support system 2005. [Online]. Available: http://www.dtic.mil/whs/directives/corres/pdf/
for moving workloads to public clouds,” in Proc. Annu. Int. Conf. CCV, 520008p.pdf
Singapore, May 2010, pp. 3–9. [24] IETF, “A one-way delay metric for IPPM,” Fremont, CA, USA, RFC2679.
[4] J. Spring, “Monitoring cloud computing by layer, Part 1,” IEEE Security [Online]. Available: http://www.ietf.org/rfc/rfc2679.txt
Privacy, vol. 9, no. 2, pp. 66–68, Mar./Apr. 2011. [25] J.-Y. Girard, “Linear logic,” Theoretical Comput. Sci., vol. 50, no. 1,
[5] J. Spring, “Monitoring cloud computing by layer, Part 2,” IEEE Security pp. 1–101, 1987.
Privacy, vol. 9, no. 3, pp. 52–55, May/Jun. 2011. [26] P. Hershey and C. Silio, “Procedure for detection of and response to
[6] Y. Wei and M. Blake, “Service-oriented computing and cloud computing: distributed denial of service cyber attacks on complex enterprise systems,”
Challenges and opportunities,” IEEE Internet Comput., vol. 14, no. 6, in Proc. 6th Annu. IEEE Int. SysCon, Mar. 2012, pp. 85–90.
pp. 72–75, Nov./Dec. 2010. [27] Network Infrastructure Technology Overview. Version 8, Release 5,
[7] Q. Zhang, L. Cheng, and R. Boutaba, “Cloud computing: State-of-the-art Apr. 27, 2012, (ArchiveEntry=U_Network_V8R5_Overview.pdf).
and research challenges,” J. Internet Serv. Appl., vol. 1, no. 1, pp. 7–18, [Online]. Available: http://goo.gl/Ja8Ajk
May 2010. [28] Enclave Security Technical Implementation Guide, Jan. 3, 2011,
[8] R. Calheiros, R. Ranjan, and R. Buyya, “Virtual machine provisioning (ArchiveEntry=U_Enclave_V4R3_STIG.pdf). [Online]. Available:
based on analytical performance and QoS in cloud computing environ- http://goo.gl/Xhy4L2
ments,” in Proc. ICPP, 2011, pp. 295–304. [29] A. Iyengar, M. Squillante, and L. Zhang, “Analysis and characterization
[9] B. Rochwerger, D. Breitgand, A. Epstein, D. Hadas, I. Loy, K. Nagin, of large-scale web server access patterns and performance,” World Wide
J. Tordsson, C. Ragusa, M. Villari, S. Clayman, E. Levy, Web, vol. 2, no. 1/2, pp. 85–100, Jun. 1999.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

HERSHEY et al.: SoS FOR QoS OBSERVATION AND RESPONSE IN CLOUD COMPUTING ENVIRONMENTS 11

Paul C. Hershey (S’80–M’84–SM’99) received the Charles B. Silio, Jr. (S’62–M’72–SM’89–LSM’09)


A.B. degree in mathematics from the College of received the B.S.E.E., M.S.E.E., and Ph.D. degrees
William and Mary, Williamsburg, VA, USA, and in electrical engineering from the University of Notre
the Ph.D. and M.S. degrees in electrical engineer- Dame, Notre Dame, IN, USA.
ing from the University of Maryland, College Park, He is currently an Associate Professor of electrical
MD, USA. and computer engineering with the University of
He is currently a Senior Engineering Fellow (with Maryland, College Park, MD, USA. His research in-
honors) and the Chief Engineer of Global Hawk terests include performance evaluation and reliability
Ground Segment programs at Raytheon Intelligence, of computer networks.
Information and Services, Dulles, VA. He is an Ad- Prof. Silio is a member of Eta Kappa Nu, Tau Beta
junct Professor with George Washington University, Pi, and Sigma Xi and a retired Lieutenant Colonel of
Washington, DC, USA, where he serves on the Curriculum Advisory Board. He the U.S. Army. He served as an IEEE Computer Society Treasurer, chaired its
has authored or coauthored more than 45 technical articles and has published technical committee on multiple-valued logic, and has been an NRC Research
28 patents (issued) with six additional patents filed and pending. His current Associate at the Naval Postgraduate School and the Army Research Laboratory.
research focuses on real-time information collection and analysis, data to
decision, autonomous systems, cloud computing, and cyber security.

Shrisha Rao (M’08–SM’13) received the M.S.


degree in logic and computation from Carnegie
Mellon University, Pittsburgh, PA, USA, and the
Ph.D. degree in computer science from the Univer-
sity of Iowa, Iowa City, IA, USA.
He is currently an Associate Professor with the Akshay Narayan (S’13) received the M.Tech.
International Institute of Information Technology- degree in information technology from the In-
Bangalore, Bangalore, India, a graduate school of ternational Institute of Information Technology-
information technology. His research interests are Bangalore, Bangalore, India. He is currently working
in distributed computing, specifically algorithms and toward the Ph.D. degree in the School of Computing,
approaches for concurrent and distributed systems, National University of Singapore, Singapore.
and include solar energy and microgrids, cloud computing, energy-aware He is currently working on sequential decision
computing (“green IT”), and demand-side resource management. making. His research interests include artificial in-
Dr. Rao is a member of the IEEE Computer Society, the Association for telligence and its application to cloud computing.
Computing Machinery, the American Mathematical Society, and the Computer Mr. Narayan is a Student Member of the Associa-
Society of India. tion for Computing Machinery.

You might also like