Differentiated Availability in Cloud Computing Slas

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Differentiated Availability in Cloud Computing SLAs

Astrid Undheim
Telenor ASA
Corporate Development
Trondheim, Norway
[email protected]

Ameen Chilwan and Poul Heegaard


Department of Telematics
Norwegian University of Science and Technology (NTNU)
Trondheim, Norway
[email protected], [email protected]

AbstractCloud computing is the new trend in service


delivery, and promises large cost savings and agility for the
customers. However, some challenges still remain to be solved
before widespread use can be seen. This is especially relevant
for enterprises, which currently lack the necessary assurance
for moving their critical data and applications to the cloud.
The cloud SLAs are simply not good enough.
This paper focuses on the availability attribute of a cloud
SLA, and develops a complete model for cloud data centers,
including the network. Different techniques for increasing the
availability in a virtualized system are investigated, quantifying
the resulting availability. The results show that depending on
the failure rates, different deployment scenarios and faulttolerance techniques can be used for achieving availability
differentiation. However, large differences can be seen from
using different priority levels for restarting of virtual machines.
Keywords-Availability, cloud, differentiation, SLA

I. I NTRODUCTION
Cloud computing presents a new computing paradigm
that has attracted a lot of attention lately. It enables ondemand access to a shared pool of highly scalable computing
resources that can be rapidly provisioned and released [1].
This is achieved by offering computing resources and services from large data centers, where the physical resources
(servers, network, storage) are virtualized and offered as
services over a network.
A large part of cloud applications has so far been targeted to consumers with low willingness to pay, and low
expectations to the service QoS (dependability, performance
and security). Recently, more and more enterprises are also
investigating how to leverage on the cloud computing advantages such as the pay per use model and rapid elasticity.
However, major challenges have to be faced in order for
enterprises to trust cloud providers with their core business
applications. These challenges are mainly related to QoS, in
our view covering dependability, performance and security,
and a comprehensive Service Level Agreement (SLA) is
needed to cover all these aspects. This is in contrast to the
insufcient SLAs offered today.
In this paper, we focus on dependability, and more specifically the availability attribute. Availability is dened in
[2] as the readiness for correct service, which can be

interpreted as the probability of providing service according


to dened requirements. In order to avoid costly down
times contributing to service unavailability, fault avoidance
and fault tolerance are used in the design of dependable
systems. Traditionally, fault tolerance has been implemented
in hardware resulting in expensive systems, or using cluster
software, often very specic for each application [3]. In
cloud computing, the approach to fault-tolerance has mainly
been to use cheap, off-the-shelf hardware, allowing failures
and then tolerating these in software. The reason for this is
partly the large size of cloud data centers, which means that
hardware will fail constantly. Adding additional hardware
resources to account for failures and let the failover be
handled by software is then a more cost-effective approach
than using special-built hardware. Another advantage from
virtualization is that virtual instances can be migrated to
arbitrary physical machines, sharing redundant capacity
among a large number of Virtual Machines (VMs). The
standby resources needed are thus much less than for a
traditional system. However, also with virtualization, fault
tolerance means adding additional resources in the system,
which adds cost. Fault-tolerance should therefore be targeted
to specic needs.
Differentiating with respect to fault-tolerance techniques
and physical deployment for different applications give better resource utilization, being cost effective for the provider
while still delivering service according to user expectations. In particular, stateful applications require synchronized/updated replicas for tolerating failures, while stateless
applications that tolerate short downtimes can be implemented using non-updated replicas. In addition, adding
replicas at different physical locations increase the faulttolerance, tolerating failures that may affect specic parts
of a cloud data center.
In related work, hardware reliability for cloud data centers
has been characterized in [4], and reliability/availability
models for cloud data centers are stated as important ongoing work. The availability of a service running in VMs
on two physical host is modeled in [5] and a non-virtualized
system is compared with a virtualized system. A very simple
cloud computing availability model is used in [6], combined
with performance models to give Quality of Experience

(QoE) measures for online services. The authors state the


need for availability models for complete cloud data centers.
The main contribution of this paper is the effort of
modeling a complete cloud system, including the network
between the cloud data centers and the customers. The
availability resulting from deployment in different physical
locations can thus be studied with respect to different failure
rates both in the network and the cloud infrastructure itself.
In addition, these deployment options will be inuenced by
management software failure rates, an important aspect to
include in analysis.
The rest of this paper is organized as follows. In Section
II, we focus on the SLAs offered by commercial cloud
providers today and the missing pieces. In Section III,
principles for achieving fault-tolerance are described, as well
as its application in cloud computing. An availability model
of a cloud service deployment is described in Section IV,
and the different types of failures are discussed. In Section
V, different scenarios for VM deployment is described.
Numerical results are presented in Section VI. Finally, we
conclude the paper in Section VII, together with some
thoughts on future work.
II. SLA S IN C LOUD C OMPUTING
Cloud computing gives the customers less control of
the service delivery, and they need to take precautions in
order not to suffer low performance, long downtimes or
loss of critical data. Service Level Agreements (SLAs) have
therefore become an important part of the cloud service
delivery model. An SLA is a binding agreement between the
service provider and the service customer, used to specify
the level of service to be delivered as well as how measuring,
reporting and violation handling should be done. Today,
most of the major cloud service providers include QoS
guarantees in their SLA proposals, specied in Service Level
Specication (SLS), as seen in Figure 1. The focus in most
cases is on dependability, measured as service availability
usually covering a time period of a month or a whole
year. Credits are issued if the SLA is violated, e.g. the
Amazon EC2 SLA includes an Annual uptime percentage
of 99.99% and issues 10% service credits 1 .
Service Level Agreement (SLA)
Measuring
Reporting
Violation handling
Service Level Specication (SLS)
SLS Parameters
SLS Thresholds

Figure 1: The structure of an SLA


Recently, we have seen many examples where cloud
services have been unavailable for the customer, but where
1 http://aws.amazon.com/ec2-sla/

the unavailability has not been covered by the SLA 2 . One


reason is that the cloud SLAs are not specic enough when
dening availability. From the customers point of view this is
a major drawback. Performance (e.g. response time) above
a certain threshold will be perceived by the customer as
service unavailability and should be credited accordingly.
This issue is covered in [7], where the throughput of a loadbalanced application is studied under the events of failures.
It is clear that the availability parameter alone is not enough
to ensure a satisfactory service delivery.
The on-demand characteristic of cloud computing is one
aspect that complicates the QoS provisioning and SLA
management. The cloud infrastructure needs to adjust to
changing user demands, resource conditions and environmental issues. Hence, the cloud management system needs
to automatically allocate resources to match the SLAs and
also detect possible violations and take appropriate action
in order to avoid paying credits. Several challenges for
autonomic SLA management still remain. First, resources
need to be allocated according to a given SLA. Next,
measurements and monitoring are needed to detect possible
violations and react accordingly, e.g., by allocating more
resources. For availability violations this may require adding
more standby resources to handle a given number of failures,
and for performance violations this may require moving a
VM to an other physical machine if the current machine
is overloaded. All these actions require a mapping between
low-level resource metrics and high-level SLA parameters.
One proposal on how to do this mapping is given in [8],
where the amount of allocated resources are adjusted on
the y to avoid an SLA violation. An other proposal for
dynamic resource allocation using a feedback control system
is proposed in [9]. Here, the allocation of physical resources
(e.g. physical CPU, memory and input/output) to VMs is
adjusted based on measured performance.
With deployment of widely different services in the cloud,
there is clearly a need for cloud providers to offer differentiated SLAs, with respect to dependability, performance and
security. Core business functions such as production systems
and billing needs a higher availability than applications
targeted to consumers such as email and document handling.
Also, different user groups may have different requirements.
One example is Gmail, where the SLA for email services
for consumers and business users are differentiated, offering
the business users an availability of 99.9% at a xed price,
while consumers have a free offering without any SLA 3 .
III. FAULT T OLERANCE IN C LOUD C OMPUTING
In the design of dependable systems, a combination of
fault avoidance (called fault prevention in [2]) and fault
tolerance is used to increase availability. Fault avoidance
2 http://cloudcomputingfuture.wordpress.com/2011/04/24/why-amazonscloud-computing-outage-didnt-violate-its-sla
3 http://www.google.com/apps/intl/en/business/features.html

aims at avoiding faults being introduced, through use of


better components (i.e. SSD instead of HDD), debugging
of software or protecting the system against environmental
faults. Fault tolerance is often used in addition to fault
avoidance, allowing a fault leading to error but preventing
errors leading to service failure. Fault tolerance thus use
redundancy in order to remove or compensate for errors.
This section gives a short overview of general fault tolerance
techniques used in design of dependable communication
systems, and then looks at how fault tolerance is achieved
in virtualized environments.
A. Fault Tolerance Principles
Cloud infrastructure is built using off-the-shelf hardware,
and standby redundancy is the preferred fault tolerance
technique. With standby redundancy, there are two or more
replicas of the system. Only the active replica will produce
result to be presented to the receiver, while the standby
replicas are ready to take over should the active replica
fail. Hot and cold standbys are possible. Hot standbys are
powered standbys, capable of taking over service execution
with no downtime (as long as the state is updated). Cold
standbys are non-powered and need some time to be started
in case of failure in the active replica.
Different levels of synchronization are possible for the hot
standbys (updated/not-updated), and the backup resources
can be dedicated or shared, for both the hot and cold
standbys. This gives the overall classication as shown in
Figure 2.
Standby Redundancy

VMware HA
Hot

Updated
VMware FT
Dedicated

Cold

Not updated

Dedicated

Shared

Remus

Shared

Figure 2: Standby redundancy classication


The choice between hot or cold standbys will decide the
service restoration time, but more importantly the choice
should depend upon the applications need of an updated
state space, as described in the next section.
B. Fault Tolerance in Cloud Computing
Cloud computing uses virtualization of computing resources made available as VMs, virtual storage and virtual
networks. We concentrate here on computation services and
the use of VMs. In this case, backup is made easy with
virtualization, since the virtual image contains everything
that is needed to run the application and can be transparently
migrated between physical machines. One of the downsides

of virtualization though is that one single hardware fault


in a physical server can affect several VMs and hence
many applications. Replicas of the same applications must
therefore always be deployed on different physical machines.
The standby resources must also be dimensioned to handle
the high number of failed VMs in case of a physical server
failure.
Virtualization facilitates live migration of VMs, where a
running VM instance can be transferred between physical
machines. Live migration has been implemented both for the
Xen hypervisor [10] and for VMware with its VMotion [11]
and ensures zero downtime in case of planned migrations
due to resource optimization or planned maintenance. In
case of failures in the physical host running the VM, live
migration is not possible. The conguration le or the
VM image should then be available on possible new host
machines in order to restart the application. In addition, the
conguration le should be stored at a centralized location
should all replicas fail. How this is performed is dependent
on the type of standby redundancy, as described next.
1) Hot Standby: For stateful applications, state must be
stored on the standby virtual machine in order to allow
failover. In traditional fault-tolerance terminology, this requires the use of updated hot standbys. Different levels of
updating/synchronization between the active and standby
replica are possible; either the input is evaluated at each
replica, or the state information is transferred at specied
checkpoints. The former method will then consume more
compute resources than the latter, and is denoted dedicated
in our classication (Figure 2). The latter method allows
many replicas to share backup resources and is hence
denoted shared.
Examples of hot standby techniques are VMwares Fault
Tolerance [3] and Remus for the Xen hypervisor [12] as
seen in Figure 2. VMware Fault Tolerance is designed for
mission-critical workloads, using a technique called virtual
Lockstep, and ensure no data or state loss, including all
active network connections etc. Both active and standby
replicas execute all instructions, but the output from the
standby replica is suppressed by the hypervisor. The hypervisor thus hides the complexity from both the application
and the underlying hardware. This scheme is classied as
updated and dedicated since the standby replica is fully
synchronized and consumes resources equal to the active
replica.
In Remus, fault tolerance is achieved by transmitting state
information to the standby VM at frequent checkpoints,
and buffering intermediate inputs between checkpoints. The
standby can hence be up and running with a complete state
space in case of failures, with only a short downtime needed
for catching up the input buffer. The standby is not executing
any inputs, which means that less resources are consumed
compared to VMware Fault Tolerance, and a short downtime
and loss of ongoing transaction is experienced in case of

failure of the active replica. This scheme is classied as


updated and shared since the standby only consume a small
amount of resources compared to the active.
Hot standbys can be used for both stateless and stateful
applications, but since all replicas consume resources it
is most often used for stateful applications. For stateless
services, the not-updated hot standby is a possibility if high
availability is important.
2) Cold Standby: The cold standby solution requires
less resources and should in general be used for stateless
applications that allows short downtimes. The same is true in
cloud computing. But in addition, functionality is added in a
virtualized environment that is valuable for fault tolerance.
Virtualization facilitates the running of different VMs on
top of the same hardware and the standby resources can
be shared by different VMs, reducing the total resource
needs. Dedicated standby resources are still possible for cold
standbys, and should be used for stateless applications with
high availability requirements. In practice, the dedicated
solution can be implemented by prioritizing the restart of a
standby VM in case of a failure. The low priority VMs may
then experience a longer down time, and possible migration
to a different part of the cloud.
VMware High Availability (HA) is one example of the
use of cold standbys and supports both dedicated and shared
resource usage, i.e., by allowing for different priority levels
when restarting failed VMs.
With this simple classication, we end up with four
different service levels as seen in Figure 3, where the choice
between updated and not-updated hot standby is strictly a
choice on the state preservation, while choosing between
cold shared, cold dedicated and hot not-updated will give
different availabilities. Next, the physical deployment of the
standby resources may inuence the resulting availability.
These principles can lay the foundation for offering differentiated availability levels in cloud SLAs.
State
Hot
Updated
Shared

Cold
Shared

Cold
Dedicated

Dedicated

Hot
Not updated

Availability

Figure 3: Classication of fault-tolerance techniques according to state and availability


IV. C LOUD AVAILABILITY M ODEL
A. High Level Model
A simplied model of a cloud system is developed. Each
cloud provider will typically have two or more data centers

at different physical locations, connected to the customers


via the Internet. Following [13],we model the data centers
with racks of servers that are organized into clusters. Each
cluster share some infrastructure elements such as power
distribution elements and network switches. The overall
network architecture is simplied (inspired by [14]), and
consist of two levels of switches (L1 and L2) in addition to
the gateway routers. The overall model is then as shown in
Figure 4.
Cluster
Server
VM

PDU
PDU

Server
VM

VM

VM

VMM

VMM

HW

HW

L1
L2

Cloud Provider
Data
center 1

Cluster

L2

GW1

Internet

COL
PWR

Cluster

L2

GW2

Customer

Data center 2

Figure 4: High level cloud model


B. Failure Classication
From the high level model, we focus on four different
types of failures, namely failures in the power distribution/cooling, network failures, management software failures
and server failures. These are described next.
1) Power Failures: An overview of power distribution in
data centers can be found in [13]. In general, the power
supply to the data center is from the utility power network.
The Uninterrupted Power Supply (UPS) unit will distribute
the power to the datacenter, and also handle switching
from utility to generator and providing backup batteries
should there be a utility power failure. We can not assume
perfect failover and a Markov model is needed to model the
complexity of the power supply. Since the power supply is
not the main focus in this paper, we chose to use availability
numbers highly documented in [15].
Within the data center, each cluster will be connected
to a (duplicated) Power Distribution Unit (PDU) which are
connected to the central power supply over a power bus. A
failure in the distribution system is assumed to only affect
one cluster. These failures are independent from failures in
the power supply and the two parts can be modeled in a
series structure as seen in Figure 5.
2) Network Failures: The cloud services are accessed
over the Internet, and the high level can be seen in Figure
4. In addition, we model the data center internal network in
two levels [13]. First there is one (duplicated) level 1 switch
connecting all servers in one cluster. Next, there is one

Data center
Cluster
PDU 1
Power/
cooling
PDU 2

Figure 5: Power model


(duplicated) level 2 switch connecting all level 1 switches
from all clusters. These are again connected to the WAN
gateways of which there is also two since we assume the
cloud provider to be multi-homed to two independent ISPs.
The resulting structure model, including the core Internet
and the user access network is then seen in Figure 6. Note
that we assume common core Internet failures for different
data centers of the same cloud provider, this will typically
be dependent on the physical location of the data centers.
Cluster

Data center
L1 - A

L2 - A

GW 1

W1

L1 - B

L2 - B

GW2

W2

Internet

UA

VMware FT scheme [3]. We model two identical VMs


running on two different physical servers, always within
the same cluster. The replicas receive the same input and
perform the same operations, but only the active VM delivers
services. In case of a failure in the active replica, the
hypervisor will immediately detect the failure and switch
to the standby replica which is ready to perform service
without any delay or loss of data. The cluster management
software will then deploy a new standby VM. With the
failure of the standby VM, the management software will
likewise deploy a new standby VM. This means that in a
dependability context, it does not matter which VM that
fails. This setup will always tolerate one failure, however, it
may happen that the resources are exhausted when trying to
deploy a new standby VM, in which case the service will
fail with the next failure. The resulting model is shown in
Figure 8. Here, is the failure rate of the server(including
hardware, software and operational failures), is the restart
rate of a new VM, and c is the coverage factor, i.e., the
probability that a restart is successful. We assume here that
the resources are dimensioned so that there will always be
enough resources for restarting a hot standby, since these
should host the highest priority applications.
(1-c)

(1-c)

Figure 6: Network model


3) Management Software Failures: Cloud computing requires extensive management systems, which are complex
software systems and these are exposed for failures. Depending on what level of management software these failures
affect, a cluster (VM Management), the whole data center
(Virtual Infrastructure Management) or the whole cloud
(Cloud Management) can be affected. The resulting model
is shown in Figure 7, where we assume that these software
failures are independent.
Cloud
Data center
Cluster
VM
Mngmt

VI
Mngmt

Cloud
Mngmt

Figure 7: Management software model


4) Server Failures: The server models include failures
from hardware, software and operation. However, application software failures that will take all replicas down are
excluded. We chose to study the schemes that are currently
deployed in commercial products, i.e., the VMware FT,
Remus, and VMware HA with two different priority levels.
Hot Standby, Updated, Dedicated The hot standby
option with dedicated, updated standbys provides the highest
availability and the most updated state, corresponding to

One
down
2

Both
OK
1
c

Both
down
3
c

Figure 8: Updated, hot stand-by with dedicated backup


resources
Hot Standby, Updated, Shared The hot standby option
with shared updated standbys is different from the dedicated
option in that the state information is transmitted at regular
intervals instead of running the replicas in a synchronized
fashion. This scheme corresponds to the Remus scheme for
Xen [12], and as seen above this scheme experiences a short
downtime and loss of data in case of failures. This also
means that it matters which replica fails, since failure of the
active replica will cause a short downtime. The model will
therefore be different than for the dedicated standby.
Since the standbys share the backup resources, there will
be a non-zero probability that the standby will not have
enough resources to start in case of failure of the primary.
We assume here that the overall load on the cluster is dimensioned such that there will always be enough resources.
The resulting model is shown in Figure 9, where the
parameters are the same as for the updated, dedicated model.
However, one additional parameter is introduced, , where
1/ is the time needed to switch to the standby replica in
case of failure in the active replica. It is then clear that
when this time is short enough, this model will be equal to
the previous model.

We look at different deployment options and the resulting


dependability when replicas are located in the same cluster,
in different clusters of a data center or even in different
data centers. The latter two deployment options provide
tolerance also towards power, network and management
software failures.

(1-c)

(1-c)

Both
OK
1

Active
down
2

Both
down
4

Standby
down
3
c

Figure 9: Hot stand-by with shared backup resources


Cold Standby For the cold standby setup, no state information is retained in the standby and the standby is simply
restarted in case of a failure, corresponding to VMware HA
[16] hence heartbeats are used to detect failures in the active
replica and restart the VM. This restart usually takes some
time, during which the service is not available. Also, the
backup resources may be shared between different VMs,
and are usually dimensioned to allow for a specic number
of physical server failures. If the resources are exhausted,
VMs can not be restarted in case of a failure in the active
replica.
Here we look at two different classes, both with shared
backup resources, but where the high priority class has
preemptive priority over the low priority class. Given that the
resources are properly dimensioned, the high priority class
will experience having dedicated standby resources.
The resulting models are then shown in Figure 10. The
additional parameter is the preemption rate from a higher
priority application, and we introduce the parameter p
which is the probability that there are enough resources for
restarting the replica. This is different from the previous
models since we can no longer assume that the resources are
dimensioned to handle these restarts for the lowest priority
applications. Also, we introduce the parameter which is the
rate at which the management system adds more resources
to the cluster if the resources are exhausted.

A. Same Cluster
The easiest deployment is to place all replicas in the
same cluster. This means low network latency in upgrading
replicas etc., but it also means that power, management
software and network failures may lead to unavailability
of all replicas and thus the service. The resulting model is
shown in Figure 11.
Mngmt

Power

Server

Network

Figure 11: Deployment in the same cluster


B. Same Data Center
Next, replicas are placed in two different clusters, but in
the same data center. The cluster block will then incorporate
the cluster part of the power, management software and
network blocks as well as the server block, all in series.
The server block will then include the Markov model from
the respective fault tolerance technique. The Mngmt, Power
and Network blocks will likewise exclude the cluster part
as shown in Figure 5-7. The resulting model is shown in
Figure 12.
Cluster
A
Mngmt

Network

Power
Cluster
B

Figure 12: Deployment in the same data center

(1-c)p

C. Same Cloud Provider

(1-c)

OK

VM
down

VM
down

pc

OK

(1-p)
Queue

The nal option is to deploy replicas in two different


data centers. The DC block will then include the whole
power block, as well as the cluster and data center part of
the network and management software blocks. The resulting
model is shown in Figure 13.

c
DC A

(a) High Priority

(b) Low Priority

Figure 10: Cold standby with high and low priority

Mngmt

Network
DC B

Figure 13: Deployment with the same cloud provider


V. L OCATION OF R EPLICAS
The power, network, management software and server
failures are assumed to be independent which means that
reliability block diagrams can be used to model the system
availability, and where individual blocks (here the power
block and the server block) is detailed using Markov models.

VI. N UMERICAL R ESULTS


The input parameters for the server models are listed in
Table I. These are mostly collected from [5]. The latter three
parameters are guessed, and will typically be dependent
on the load of the system (the preemption rate and the

Table I: Parameter values for the server model


Name
VM Failure Rate
VM Restart Rate
Standby Update Rate
VM Restart Coverage
Preemption Rate
Exhausted probability
Cluster Expansion Rate

Parameter

Value
0.00722 hr 1
2.0 hr 1
60 hr 1
0.95
0.05 hr 1
0.99
6.0 hr 1

Source
[5]
[5]
[12]
[5]
Guessed
Guessed
Guessed

Table III: Availability results for the different deployment


scenarios and fault-tolerance techniques
Scenario
I

II

Table II: Availability values for the high level model


Parameter
Apower
AP DU
Amngmt
Aswitch
Arouter
Aaccess
Acore
Auser

Value
0.9975
0.9992
0.999
0.97986
0.99966
0.989
0.999
0.99

Source
[15]
[17]
Guessed
[14]
[17]
[18]
[13]
[13]

III

Netw A
0.98901
0.98901
0.98901
0.98901
0.98901
0.98901
0.98901
0.98901
0.99
0.99
0.99
0.99

Tot A
0.98262
0.98262
0.97893
0.94336
0.98403
0.98403
0.98401
0.97230
0.98799
0.98799
0.98799
0.97697

0.989

0.988
Scenario I

Scenario II

0.987

Scenario III
Aservice

Name
Power
PDU
Management software
Switches
Router
Access Network
Core Network
User Access

Cloud A
0.99354
0.99342
0.98981
0.95385
0.994972
0.99497
0.99494
0.98311
0.997971
0.99797
0.99791
0.98683

VM Fault Tolerance
Updated Dedicated Hot
Updated Shared Hot
Shared Cold (HP)
Shared Cold (LP)
Updated Dedicated Hot
Updated Shared Hot
Shared Cold (HP)
Shared Cold (LP)
Updated Dedicated Hot
Updated Shared Hot
Shared Cold (HP)
Shared Cold (LP)

0.986

0.985

0.984

0.983

0.9990

0.9992

0.9994

0.9996

0.9998

Amngt

(a) Management software availability


0.988
Scenario I
0.987
Scenario II

Scenario III
0.986

Aservice

probability p), and the operational aspects of the data center


(the cluster expansion rate).
The availability values for different blocks in the high
level model are listed in Table II.
The resulting availability for the different deployment
scenarios (I-same cluster, II-same data center, III-same cloud
provider) and fault tolerance techniques are shown in Table
III. The availability is increased when replica VMs are
deployed in different clusters (scenario II) and data centers
(scenario III), but the effect is clearly not very big. The
difference between the hot and cold standby techniques is
more prominent, at least for the scenario with all replicas
in one cluster (scenario I). For the hot standbys, there are
small differences between the dedicated and shared standbys.
However, with shared backup resources, the availability will
decrease when the load increases. We also see that the cold
standby with high priority gives the same availability as
the hot standby solutions. However, only the latter provide
updated state and the cold standby option is only possible
for stateless applications.
The network part (Internet and user access) is separated
in order to see the effect of the network availability on the
total availability. For scenario III (same cloud provider), the
different data centers are accessed using disjoint networks,
resulting in a higher resulting availability. This inuence of
the network availability on the resulting end-to-end cloud
service availability is a topic for future study.
Next, we look at the updated, dedicated hot standby
scenario with different failure rates in the power and management software. The results are shown in Figure 14 and
shows that scenario III, i.e. using different data centers is
more superior compared to the less distributed scenarios
when the availability for the management software is low.
The same is true for the power part.

0.985

0.984

0.983

0.982
0.9975

0.9980

0.9985

0.9990

0.9995

Apower

(b) Power system availability

Figure 14: Availability of updated, dedicated hot standbys


for different deployment scenarios

Finally, the availability for the cold standby with high and
low priority is plotted versus the preemption rate in Figure
15. The preemption rate is dependent on the load in the
system. With a preemption rate equal to zero, the high and
low priority techniques are equal, but for higher preemption
rates the high priority is superior. Hence, using different
priority levels and allowing for preemption will have a clear
differentiation effect when the load increases.
VII. C ONCLUSIONS AND F UTURE W ORK
SLAs have received a lot of attention in cloud computing,
and especially availability is covered by public cloud SLAs.

[5] D. S. Kim, F. Machida, and K. S. Trivedi, Availability Modeling and Analysis of a Virtualized System, in Proceedings
of the 15th IEEE Pacic Rim International Symposium on
Dependable Computing, Nov. 2009, pp. 365371.

1.00

High Priority Service


0.98
Low Priority Service

Aserver

0.96

[6] H. Qian, D. Medhi, and K. Trivedi, A Hierarchical Model


to Evaluate Quality of Experience of Online Services hosted
by Cloud Computing, Time, no. May, pp. 18, 2011.

0.94

0.92

0.90

0.00

0.05

0.10

0.15

0.20

[7] D. Menasc, Performance and Availability of Internet Data


Centers, IEEE Internet Computing, vol. 8, no. 3, pp. 9496,
May 2004.

Figure 15: Availability for high and low priority cold standbys with increasing preemption rate
However, there are some important improvements to be
made. First, the SLAs must become more detailed with
respect to actual KPIs used to dene availability. Next, in
order to deploy also important enterprise services in clouds,
different levels of availability should be offered, depending
on the actual user requirements. Finally, the SLAs should
be available on demand, which also means that they should
be adjustable on demand.
This paper has proposed an overall availability model
for a cloud system, including the network. We have shown
how deploying replicas in different physical locations affect
the resulting availability, and also how different applications
need different fault tolerance schemes. These are two possible dimensions for differentiating cloud applications.
Future work include modeling more complex services,
e.g. a tiered web service. Also, the server models should
be made more detailed, taking into account characteristics
of the different failures and repairs. We have discussed the
need for well dened KPIs for availability, the next step
is to also include performance measures in the availability
models. Finally, the network availability strongly inuence
the total availability of a cloud service, and should optimally
be included in the cloud service SLA.

[8] I. Brandic, V. C. Emeakaroha, M. Maurer, S. Dustdar,


S. Acs, A. Kertesz, and G. Kecskemeti, LAYSI: A Layered
Approach for SLA-Violation Propagation in Self-manageble
Cloud Infrastructures, in Proceeding of the 2010 34th Annual IEEE Computer Software and Applications Conference
Workshops, Jul. 2010, pp. 365370.
[9] Q. Li, Q. Hao, L. Xiao, and Z. Li, Adaptive Management
of Virtualized Resources in Cloud Computing Using Feedback Control, in Proc. of 1st International Conference on
Information Science and Engineering (ICISE09), Dec. 2010.
[10] C. Clark, K. Fraser, S. Hand, J. G. Hansen, E. Jul,
C. Limpach, I. Pratt, and A. Wareld, Live Migration of
Virtual Machines, in Proceedings of the 2nd Symposium
on Networked Systems Design & Implementation (NSDI 05),
2005, pp. 273286.
[11] M. Nelson, B.-h. Lim, and G. Hutchins, Fast Transparent
Migration for Virtual Machines, in Proceedings of USENIX
05, Anaheim, California, 2005, pp. 59.
[12] B. Cully, G. Lefebvre, D. Meyer, M. Feeley, N. Hutchinson,
and A. Wareld, Remus: High Availability via Asynchronous
Virtual Machine Replication, in NSDI08 Proceedings of the
5th USENIX Symposium on Networked Systems Design and
Implementation, 2008.
[13] L. A. Barroso and U. Holzle, The Datacenter as a Computer:
An Introduction to the Design of Warehouse-Scale Machines,
Synthesis Lectures on Computer Architecture, vol. 4, no. 1,
pp. 1108, Jan. 2009.

R EFERENCES

[14] A. Greenberg, D. A. Maltz, and J. R. Hamilton, VL2 : A


Scalable and Flexible Data Center Network, in Proceedings
of SIGCOMM09. ACM, 2009.

[1] P. Mell and T. Grance, The NIST Denition


of Cloud Computing, v.15, 2009. [Online]. Available: http://csrc.nist.gov/groups/SNS/cloud-computing/clouddef-v15.doc

[15] W. P. Turner and J. Seader, Tier classications dene site infrastructure performance, The Uptime Institute White Paper,
2006.

[2] A. Avizienis, J.-C. Laprie, B. Randell, and C. Landwehr,


Basic Concepts and Taxonomy of Dependable and Secure
Computing, IEEE Transactions on Dependable and Secure
Computing, vol. 1, no. 1, pp. 1133, Jan. 2004.
[3] VMware White Paper, Protecting Mission-Critical Workloads with VMware Fault Tolerance, 2009.
[4] K. V. Vishwanath and N. Nagappan, Characterizing Cloud
Computing Hardware Reliability, in in Proceedings of the
ACM Symposium on Cloud Computing (SOCC), 2010.

[16] VMware White Paper, VMware High Availability. Concepts,


Implementation and Best Practices, 2007.
[17] J. Dean, Designs, Lessons and Advice from Building Large
Distributed Systems, Keynote Presentation at LADIS 2009,
The 3rd ACM SIGOPS International Workshop on Large
Scale Distributed Systems and Middleware, 2009.
[18] M. Dahlin, B. B. V. Chandra, L. Gao, and A. Nayate, Endto-End WAN Service Availability, IEEE/ACM Transactions
on Networking, vol. 11, no. 2, pp. 300313, 2003.

You might also like