2018 Book ReliabilityAspectOfCloudComput
2018 Book ReliabilityAspectOfCloudComput
Vidhyalakshmi
Reliability
Aspect of Cloud
Computing
Environment
Reliability Aspect of Cloud Computing
Environment
Vikas Kumar R. Vidhyalakshmi
•
123
Vikas Kumar R. Vidhyalakshmi
School of Business Studies Army Institute of Management &
Sharda University Technology
Greater Noida, Uttar Pradesh, India Greater Noida, Uttar Pradesh, India
This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
Preface
v
vi Preface
1 Cloud Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Deployment Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.3 Service Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.4 Virtualization Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.1.5 Business Benefits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2 Cloud Adoption and Migration . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2.1 Merits of Cloud Adoption . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2.2 Cost–Benefit Analysis of Cloud Adoption . . . . . . . . . . . . . 15
1.2.3 Strategy for Cloud Migration . . . . . . . . . . . . . . . . . . . . . . 17
1.2.4 Mitigation of Cloud Migration Risks . . . . . . . . . . . . . . . . 18
1.2.5 Case Study for Adoption and Migration to Cloud . . . . . . . 20
1.3 Challenges of Cloud Adoption . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.3.1 Technology Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.3.2 Service Provider Perspective . . . . . . . . . . . . . . . . . . . . . . . 23
1.3.3 Consumer Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.3.4 Governance Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.4 Limitations of Cloud Adoption . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2 Cloud Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.1.1 Mean Time Between Failure . . . . . . . . . . . . . . . . . . . . . . 32
2.1.2 Mean Time to Repair . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.1.3 Mean Time to Failure . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
vii
viii Contents
Dr. Vikas Kumar received his M.Sc. in electronics from Kurukshetra University,
Haryana, India, followed by M.Sc. in computer science and Ph.D. from the same
university. His Ph.D. work was in collaboration with CEERI, Pilani, and he has
worked in a number of ISRO-sponsored projects. He has designed and conducted a
number of training programs for the corporate sector and has served as a trainer for
various Government of India departments. Along with six books, he has published
more than 100 research papers in various national and international conferences and
journals. He was Editor of the international refereed journal Asia-Pacific Business
Review from June 2007 to June 2009. He is a regular reviewer for a number of
international journals and prestigious conferences. He is currently Professor at the
Sharda University, Greater Noida, and Visiting Professor at the Indian Institute of
Management, Indore, and University of Northern Iowa, USA.
xi
Chapter 1
Cloud Computing
Abbreviations
Moore’s Law was predicted by Gordon Moore, Intel co-founder in 1965 which stated
that the processing power (i.e., number of in a transistor of a silicon chips) will be
doubled in every 18–24 months. This became reality only in a few decades and finally
failed due to technology advancements resulting in abundant computing power. The
processing power doubled in a much less than the expected time and got leveraged
in almost all domains for incorporating speed, accuracy, and efficiency. Integrated
circuit chips have a limit of 12 mm2 , tweaking the transistors within this limit has
also got an upper limit. Correspondingly, the benefits of making the chips smaller
is diminishing and operating capacity of the high-end chips has been on the plateau
since middle of 2000. This led to a lookout for the development in computing field,
beyond the hardware. One such realization is the new computing paradigm called
Cloud Computing. Since its introduction about a decade ago, cloud computing has
evolved at a rapid pace and has found an inevitable place in every business operation.
This chapter provides an insight to various aspects of cloud computing, its business
benefits along with real time business implementation examples.
1.1 Introduction
1.1.1 Characteristics
Cloud Computing services are delivered over the Internet. It provides a very high
level of technology abstraction, due of which, customers with a very limited technical
knowledge, can also starts using cloud applications at the click of the mouse. NIST
describes characteristics of cloud computing as follows (NIST 2015):
1.1 Introduction 3
Cloud services can be deployed in any one of the four ways such as private cloud,
public cloud, community cloud and hybrid cloud. Physical presence of the resources,
security levels, and access methods varies with service deployment type. The selec-
tion of cloud deployment method is done based on the data sensitivity of the business
and their business requirements (Liu et al. 2011). Figure 1.1 depicts advantages of
various deployment methods.
i. Private Cloud
It is cloud setup that is maintained within the premises of the organization. It
is also called as “Internal Cloud”. Third party can also be involved in this to
host an on-site private cloud or outsourced private cloud maintained exclusively
for a single organization. This type of deployment is preferred by large orga-
nizations that include a strong IT team to setup, maintain, and control cloud
operations. This is intended for a single tenant cloud setup with strong data
security capabilities. Availability, resiliency, privacy, and security are the major
advantages of this type of deployment. Private cloud can be setup using major
service providers such as Amazon, Microsoft, VMware, Sun, IBM, etc. Some of
the open source implementations for the same are Eucalyptus and OpenSatck.
ii. Public Cloud
This type of cloud setup is open to general public. Multiple tenants exist in this
cloud setup which is owned, managed, and operated by service providers. Small-
and mid-sized companies opt for this type of cloud deployments with the prime
intention to replace CapEx with OpEx. “Pay as you go” model is used in this
setup, where the consumers pay only for the resources that are utilized by them.
Adoption of this facility eliminates prediction and forecasting overhead of IT
infrastructure requirements. Public cloud includes thousands of servers spanning
across various data centers situated across the globe. Facility to choose the data
center near to their business operations is provided to the consumers to reduce
latency in service provisioning. The public cloud setup requires huge investment
so it is set up large enterprises like Amazon, Microsoft, Google , Oracle, etc.
1.1 Introduction 5
Software, storage, network, and processing capacity are provided as services from
cloud. The wide range of services offered is built on top of one another and is also
termed as cloud computing stack. Figure 1.2 represents cloud computing stack. Three
major cloud computing services are Infrastructure as a Service (IaaS), Platform as
a Service (PaaS), and Software as a Service (SaaS). With the proliferation of cloud
in almost all computing related activities various other services are also provided
on demand and are collectively termed as Anything as a Service (XaaS). The XaaS
service list includes Communication as a Service, Network as a Service, Monitoring
as a Service, Storage as a Service, Database as a Service, etc.
• Platform as a Service
• Development tools, Web Servers, databases
• Google App engine, Microsoft Azure,
PaaS Amazon Elastic Cloud etc.
• Infrastructure as a Service
• Virtual Machines, Servers, Storage and
Networks
IaaS • Amazon Ec2, Rackspace, VMWare, IBM
Smart Cloud, Google cloud Storage etc.
1.1 Introduction 7
OS 1 OS 2 OS 3
Host Hardware
be made to work with other operating systems also using virtualization. It increases
utilization of hardware resources and also allows organizations to reduce the enor-
mous power consuming servers. This also helps organizations to achieve green IT
(Menascé 2005).
VMware and Oracle are the leading companies which are providing products
such as VMware Player and Oracle’s VirtualBox that supports virtualization imple-
mentation. Virtualization can be achieved as a hosted approach or using hypervisor
architecture. In hosted approach partitioning services are provided on top of the exist-
ing operating system to support wide range of guest operating systems. Hypervisor
also known as Virtualization Machine Manager (VMM) is the software that helps
in successful implementation of virtualization on the bare machine. It has direct
access to the machine hardware and is an interface and a controller between the host-
ing machine and the guest operating system or applications to regulate the resource
usage (vmware 2006).
Virtualization can also be used to combine resources from multiple physical
resources into a single virtual resource. Virtualization helps to eliminate server
sprawl, reduced complexity in maintaining business continuity, and rapid provi-
sioning for test and development. Figure 1.3 describes the virtualized environment.
Various types of virtualizations include
10 1 Cloud Computing
i. Storage virtualization
It is the combination of multiple network storage devices to project as a single
huge storage unit. The storage spaces of several interconnected devices are
combined into a simulated single storage space. It is implemented using software
on Storage Area Network (SAN), which is a high-speed sub-network of shared
storage devices primarily used for backup and archiving processes.
ii. Server virtualization
The concept of one physical dedicated server is replaced with virtual servers.
Physical server is divided into many virtual servers to enhance optimal uti-
lization. Main identity of the physical server is masked and the users interact
through the virtual servers only. Usage of virtual web servers helps to provide
low-cost web hosting facility. This also conserves infrastructure space as several
servers are replaced by a single server. The hardware maintenance overhead is
also reduced to a larger extent (Beal 2018).
iii. Operating system virtualization
This type of virtualization allows the same machine to run the multiple instances
of different operating system concurrently through the software. This helps a
single machine to run different application requiring different operating system.
Another type of virtualization involving OS is called as Operating System-
level virtualization where a single OS kernel will provide support for multiple
applications running in different partitions of a single machine.
iv. Network virtualization
This is achieved through logical segmentation of the physical network resources.
The available bandwidth is divided into different channels with each being sep-
arated and distinguished from each other. These channels will be assigned to
server or device for further operations. The true complexity of the network is
abstracted and are provided as simple hard drive for usage.
Cloud adoption gives a wide array of benefits to business like reduced CapEx, greater
flexibility, business agility, increased efficiency, enhanced web presence, faster time
to market, enhanced collaboration, etc. The business benefits of cloud adoption
include
i. Enhanced Business Agility
Cloud adoption enables organizations to handle business dynamism without com-
plexity. This enhances the agility of the organizations as it is equipped to accom-
modate the changing business and customer needs. The cloud adoption keeps the
organization in pace with the new technology updations with minimal or no human
interaction. This is achieved through faster and self-provisioning and de-provisioning
of IT resources at the time of need from anywhere and using any type of devices.
New application inclusion time has reduced from months to minutes.
1.1 Introduction 11
ii. Pay-As-You-Go
This factor is abbreviated as PYAG is a feature that allows the customers to pay for the
resources based on the time and amount of its utilization. Cloud services are meter-
based where usage-based payment is done or it is subscription-based. This convenient
payment facility enables customers to concentrate on core business activities rather
than worrying about the IT investments. The IT infrastructure investment planning is
replaced with planning for successful cloud migration and efficient cloud adoption.
This useful factor of cloud entitles the new entrants to leverage the entire benefit of
ICT implementation with minimal investment.
iii. Elimination of CapEx
This is an important cost factor that eradicates one of the most important barriers
to cost-based IT adoption for small businesses. The strenuous way of traditional
software usage in business includes activities like purchasing, installing, maintaining,
and upgrading. This is simplified to a simple browser usage. User need not worry
about the initial costs such as purchase costs, costs related to updation and renewal.
In fact the user needs to worry only about the Internet installation cost only in terms
of Capex. The software required for the organizations are used directly from the
provider’s site using authenticated login ids. This eliminates huge initial investment.
All cloud services are metered and this enables the customer to have greater control
on the use of expensive resources. The basic IT requirements of the business have
to be observed before cloud adoption and the allocations are to be done only for the
basic requirements. This controls the huge initial investment. Careful monitoring of
the cloud usage will enable the organizations to predict the financial implications of
their cloud usage expansion plans. Huge capital investment on resources that may not
be fully utilized is replaced with operation expenses by paying only for the resources
utilized thus managing the costs.
v. Increased Efficiency
This refers to the optimal utilization of IT-related resources which will in turn prevent
the devices from being over provisioned or under provisioned. Traditional IT resource
allocations for server, processing power, and storage are planned by targeting the
resource requirement spikes that occur during peak business seasons which last for
few parts of a year. These additional resources remain idle for most part of the year
thus reducing IT resource efficiency. For example, the estimated server utilization rate
is 5–15% of its total capacity. Cloud adoption eliminates the need of over investment
on resources. The required resources are provisioned at the time of need and are paid
as per usage capacity. This increases the resource efficiency .
12 1 Cloud Computing
Any disruption to the IT infrastructure will affect the business continuity and might
also result in financial losses. In traditional IT setup, periodic maintenance of the
hardware, software, storage, and network are essential to avoid the losses. The relia-
bility of traditional ICT for enterprise operations is associated with risk as the retrieval
of the affected IT systems is a time consuming process. Cloud adoption increases
the IT usage reliability for enterprise operations by improving the uptime and faster
recovery from unplanned outages. This is achieved through live migrations, fault
tolerance, storage migrations, distributed resource scheduling, and high availability.
Cloud adoption assists the organization to reduce their carbon footprint. Organi-
zations invest on huge servers and IT infrastructure to satisfy their future needs.
Utilization of these huge IT resources and heavy cooling systems contribute to the
carbon footprint. On cloud adoption the over provisioning of resources are eliminated
and only the required resources are utilized from the cloud thus reducing the carbon
footprint. The cloud data center working also results in increased carbon footprint
but is being shared by multiple users and the providers also employ natural cooling
mechanism to reduce the carbon footprint.
1.1 Introduction 13
x. Cost Reduction
Cloud adoption reduces cost in many ways. The initial investment in proprietary
software is eliminated. The overhead charges such as data storage cost, quality control
cost, software and hardware updation and maintenance cost are eliminated. The
expensive proprietary license costs such as license renewal cost and additional license
cost for multiple user access facility is completely removed in cloud adoption.
Most of the big organizations have already adopted cloud computing and many of the
medium and small organizations are also in the path of adopting cloud. Gartner’s has
mentioned in 2017 report that Cloud computing is projected to increase to $162B in
2020. As of 2017, nearly 74% of Chief Financial Officers believe Cloud computing
will have the most measurable impact on their business. Cloud spending is growing
at 4.5 times since 2009 and is expected to grow at a better rate of six times from
2015 through 2020 (www.forbes.com). As with two sides of a coin, cloud adoption
also has both merits and demerits. Complexity does exist in choosing between the
service models (IaaS, SaaS, PaaS) and deployment models (private, public, hybrid,
community). SaaS services can be used as utility services without any worry about the
underlying hardware or software, but other services need careful selection to enjoy
the complete benefits of cloud adoption. This section deals with various aspects to
understand before going for cloud adoption or migration.
i. Faster Deployments
Cloud applications are deployed faster than on-premise application. This is because
the cumbersome process of installation and configuration is replaced by a registra-
tion and subscription plan selection process. On-premise applications are designed,
created, and implemented for specific customer and had to go through the complete
software development life cycle that spans for months. The updation process also
had to go through the time consuming development cycle. In contrast to this, the
cloud application adoption takes less time as the software is readily available with
the provider. The time taken for the initial software usage is reduced from months to
minutes. Automatic software integration is another benefit of cloud adoption. This
14 1 Cloud Computing
will help people with less technical knowledge to use cloud applications without any
additional installation process. Even organizations with existing IT infrastructure
and in-house applications can migrate to cloud after performing the required data
migration process.
ii. Multi-tenancy
This factor is responsible for the reduced cost of the cloud services. Single instance of
an application is used by multiple customers called as tenants. The cost of the software
development, maintenance, and IT infrastructure incurred by the CSP is shared by
multiple users which results in delivery of the software at low cost. The tenants
are provided with the customization facility of the user interface or business rule
but not the application code. This factor streamlines the software patches or updates
release management. The updations done on the single instance are reflected to all the
customers thus eliminating the version compatibility issue with the software usage.
This multi-tenancy increases the optimal utilization of the resources thus reducing
the resource usage cost for the individuals.
iii. Scalability
iv. Flexibility
Recovery is an essential process for business continuity which can be achieved suc-
cessfully with the help of efficient backup process. Clod adoption provides backup
facility by default. Depending on the financial viability of the organization either
selected business operations or entire business operations can be backed up. For
small and medium organizations, backup storage locations must be planned in such
a way that core department or critical data are centrally located and are replicated
regionally. This helps to mitigate risk by moving the critical data close to the region
and their local customers. Primary and secondary backup sites must be geographi-
cally distributed to ensure business continuity. Different types of backups according
to NIST are full backup, incremental, and differential. Full back up process deals with
back up of all files and folders. Incremental backup captures files that were changed
or created since last backup. Differential backup deals with capturing changes or
new file creation after last full backup (Onlinetech 2013).
Cloud computing also has some associated challenges that are discussed in detail
in Sect. 1.3. Solution for handling these challenges are also discussed which needs
to be followed to leverage the benefits of cloud computing adoption.
Cost–Benefit Analysis (CBA) is a process of evaluating the costs and its correspond-
ing benefits of any investment, here in this context it is cloud adoption. This process
helps to make decisions for the operations that have calculable financial risks. CBA
should also take into the costs and revenue over a period of time including the changes
over monetary values depending on the length and time of the project. Calculating
Net Present Value (NPV) will help to measure the present profitability of the project
by comparing present ongoing cash flow with the present value of the future cash
flow. Three main steps to perform CBA are
i. Identifying costs
ii. Identifying benefits
iii. Comparing both
The main cost benefit of cloud adoption is reduced CapEx. Initial IT hardware
and infrastructure expenses are eliminated. This is due to the virtualization and
consolidation characteristics of cloud adoption. Various costs associated with cloud
adoption are server cost, storage cost, application subscription cost, cost of power,
network cost, etc. The pricing model of cloud (Pay-as-you-go) is one of the main
drivers for cloud adoption. The costs incurred in cloud adoption can be categorized
as upfront cost, ongoing costs and service termination costs (Cloud standards council
2013). Table 1.1 lists the various costs associated with cloud computing adoption.
Various financial metrics such as Total Cost Ownership (TCO), Return on Invest-
ment (ROI), Net Present value (NPV), Internal Rate of Return (IRR), and payback
16 1 Cloud Computing
period are used to measure the costs and monitor the financial benefits of SaaS
investment. ROI is used to estimate the financial benefits of SaaS investment and
TCO calculates the total associated direct and indirect costs for the entire life span
of SaaS. NPV compares the estimated benefits and costs of SaaS adoption over a
specified time period with the help of rate that assist in calculating the present value
of the future cash flow. IRR is used to identify the discount rate which would equate
the NPV of the investment to zero. ROI calculation being simple when compared to
the other metric is preferred for the financial evaluations (ISACA 2012).
Payback period refers to the time taken for the benefits return to equate with that of
the investment. Main payback areas of cloud computing where saving and additional
costs involved are listed in Table 1.2 (Mayo and Perng 2009).
1.2 Cloud Adoption and Migration 17
Cloud migration refers to the moving of data and applications related to the business
operations from on-premise IT infrastructure to cloud infrastructure. Moving the IT
operations from one cloud environment to another is also called as cloud migration.
Cisco mentions three types of migration options based on service models—IaaS,
PaaS, and SaaS. If an organization switches to SaaS it is not called as migration but
is a simple replacement of existing applications. Migrating business applications that
were based on standard on-premise application servers to cloud based development
environment is done in PaaS migration. This type of PaaS migrations also has various
steps such as refactor, revise and rebuild as the existing on-premise applications needs
to be modified to suit the cloud architecture and working. IaaS migration deals with
migrating applications and data storage on to the servers that are maintained by
cloud service provider. This is also called as re-hosting, where existing on-premise
applications and data are migrated to cloud (Zhao and Zhou 2014).
Plan, deploy, and optimize are the three main phases that are to be followed for
successful cloud migration. Plan phase includes the complete cloud assessment in
terms of functional, financial, and technical assessments, identifying whether to opt
for IaaS, PaaS, or SaaS and also deciding about the cloud deployment option (public,
private, or hybrid). The cost associated with server, storage, network and IT labor
has to be detailed and compared with on-premise cloud applications (Chugh 2018).
Security and compliance assessment needs to be done to understand the availabil-
ity and confidentiality of data, prevailing security threats, risk tolerance level, and
disaster recovery measures.
Deploy phase deals with application and data migration. The careful planning
for porting of the existing on-premise application and its data onto the cloud plat-
form is carried out in this phase so as to reduce or avoid disturbance to business
continuity. Either forklift migration where all applications are shifted on to cloud
or hybrid migration where partial shifting of application to cloud can be followed.
18 1 Cloud Computing
Self-contained, stateless, and tightly coupled applications are selected and moved
in forklift approach. Optimize phase deals with increasing efficiency of data access,
auto termination of unused instances, reengineering existing applications to suit cloud
environment (CRM Trilogix 2015).
Training the staff to utilize cloud environment is very essential to take control of
the fluctuating cloud expenses. The dynamic provisioning helps to cater to the sudden
increase in work load and the payment for the same will be done in subscription based
model. At the same time continuous monitoring has to be done to scale down the
resource requirement when the demand surges. This will help to reap the complete
cost benefit of cloud adoption. Unmanaged open source tools or provider based
managed tools are available for error free cloud migrations.
Some of the major migration options are live migration, host cloning, data migra-
tion, etc. In live migration, running applications are moved from on-premise physical
machines on to cloud without suspending the operations. In data migration synchro-
nization between the on-premise physical storage and cloud storage is carried out.
After successful migrations users can leverage cloud usage, monitor and optimize
cloud usage pattern using various cloud monitoring tools.
Business continuity might be affected due to the disturbances to the existing IT oper-
ations of the organization. The existing on-premise IT infrastructure, applications,
and data have to be completely or partially migrated to cloud. This might include
various risks like affect to business continuity, loss of data, application not working,
loss of control on data, etc. Some of the cloud migration risk mitigation measures
are
migration process. Before migrating to cloud, the data transfer time needs to be
calculated. Formula to calculate the number of days that will be taken for data
transfer depending on the amount of data to transfer and the network speed is
given below (Chugh 2018).
Total bytes
No. of days
(mbps ∗ 125 ∗ 1000 ∗ network utilization ∗ 60 s ∗ 60 mins ∗ 24 h
Industry: Entertainment
Company: Netflix
Source: https://increment.com/cloud/case-studies-in-cloud-migration/
In October 2008, Neil hunt, chief production officer at Netflix had called for a meeting
of his engineering staffers. The reason for the meeting was to discuss about the
problem that Netflix were facing. Its backend client architecture had some issues. It
was a having issues with connections and threads. Even an upgrade to machine worth
$5 million crashed immediately as it could not withstand the extra capacity of thread
pool. It was a disagreeable position for Netflix as it had introduced online streaming
of its video library a year before. It had also partnered with Microsoft to get its app
on the Xbox 360, had agreed with TV set-top boxes to service their customers and
had agreed to the terms of the manufacturers of Blu-ray players. But their back end
could not cope with the load. Public had huge expectation as Netflix concept was
viewed as a game changing technology for the industry of online video streaming.
There were two points of failure in the physical technology. A single Oracle data
base which on an array of Blade servers where Netflix’s database was stored and
executed using a single unit of machine. With this setup it is impossible to run the
show and hence they have to make it redundant. That is a second data center had to be
set to improve the situations. This will remove the single point failure. But they could
not go ahead due to the financial crisis of the company. The company tried to push a
piece of firmware to the disk array and that had corrupted the Netflix database. The
company had to spend three days to recover the data. The meeting called by Hunt
had decided to rethink and had decided to do everything from the beginning using
Cloud technology.
They had detailed all the issues that was plaguing the smooth functioning of online
streaming and were determined that these issues should not re-occur with cloud adop-
tion. No maintenance or physical upkeep of the data centers, flexible way to ensure
reliable IT resources, lowering costs, scaling of capacity, and increasing adaptability
are the main features required for startups with high unpredictable growth. Netflix
being a startup had made a wise decision to migrate to cloud.
Between December 2007 and December 2015, with cloud adoption, the com-
pany had achieved one thousand times increase in the number of hours of content
streaming. The user sign up has increased eight times. Cloud infrastructure was able
1.2 Cloud Adoption and Migration 21
to stretch to meet the ever expanding demand. Cloud adoption also proved to be
cost-effective.
Since the cloud was young technology in 2006 with Amazon being the leader
caution was required. Netflix had decided to move in small steps. It moved a single
page onto Amazon Web Services (AWS) to make sure that it works. AWS was chosen
over others alternatives due to its breadth of features, scaling capacity and broader
variants of APIs. Netflix cloud adoption was at a point when organizations were not
fully aware of cloud migration process. The cloud adoption involved lot of out of box
thinking. Lack of standards for cloud adoption was a point of concern for Netflix.
Rushlan Meshenberg of Netflix says “running physical data centers are simple as
we have to keep our servers up and running at all times and with all cost. That’s not
the case with the cloud. Software runs on ephemeral instances that aren’t guaranteed
to be up for any particular duration or at any particular time. You can either lament
that ephemerality and try to counteract it, or you can try to embrace it and say—I’m
going to build a reliable system on top of something that is not.”
Netflix had decided to build a system that can fail in parts but not as a whole. Netflix
had built a tool that named Chaos Monkey that would self-sabotage its systems. This
will simulate the condition of crash to make sure that their engineers are architect,
write and test software that’s resilient in times of failures. Meshenberg admits that
“In the initial days Chaos Monkey tantrums in the cloud were dispiriting. It was
painful, as we didn’t have the best practices and so many of our systems failed in
production. But this had helped our engineers to build software using best practices
that can withstand such destructive testing.”
Meshenberg says “The crux of our decision to go into the cloud was simple one.
Maintaining and building data centers wasn’t our core business. It is not something
from which our users get value from. Our users get value from enjoying their enter-
tainment. We decided to focus on that and push the underlying infrastructure to cloud
providers like AWS.”
Scalability was the main factor which inspired Netflix to move to cloud. Mesh-
enberg recalls “Every time you grow your business your traffic grows by an order of
magnitude. The things that worked on a small scale may no longer work at bigger
scale. We made a bet that cloud would be sufficient in terms of capacity and capabil-
ity to support our business and the rest was figuring out the technical details of how
to migrate and monitor.”
Internet-based working pattern of cloud inherits the security and continuity risks and
these factors also acts as inhibitors for cloud adoption. A careful security and risk
management are essential to overcome this barrier. The cloud adoption should not
be carried out depending on the market hype but by detailed parsing of the merits
and demerits of cloud adoption (Vidhyalakshmi and Kumar 2013).
22 1 Cloud Computing
This category includes challenges raised due to the base technology aspects such as
virtualization; Internet-based operations and remote access. High latency, security,
insufficient bandwidth, interaction with on-premise applications, bulk data transfer,
and mobile access are some of the challenges of this category.
i. High Latency
The time delay between the placing of request to the cloud provider and availability
of the service from them is called as latency. Network congestion, packet loss, data
encryption, distributed computing, virtualizations of cloud applications, data center
location and the load at the data center are the various factors that are responsible for
latency. The highly coupled modules which have intense data interactions between
them, when being used in distributed computing will result in data storm and hence
latency depending on the location of the interacting modules. Latency is a serious
business concern as half a second delay will cause 20% drop in Google’s traffic and
a tenth of a second delay will cause a drop of 1% in Amazon’s sales.
Segregating the applications as tolerant to latency and intolerant to latency and
maintaining the intolerant applications as on-premise application or opting for hybrid
cloud is a suggested solution (David and Jelly 2013). Choosing the data center loca-
tion near to the enterprise location can also bring down latency.
ii. Security
Distributed computing, shared resources, multi-tenancy, remote access, and third-
party hosting are the various reasons that infuse the security challenge in cloud
computing (Doelitzscher et al. 2012). Data may be modified or deleted either acci-
dentally or deliberately at the provider’s end who have access to the data. Any breach
of conduct or privilege misuse by the data center employee will go unnoticed as it
is beyond the scope of customer monitoring. Security breaches that were identified
in the highly secured fiber optics cable network, data tapping without accessing net-
work are adding more challenges (Jessica 2011). The security concern is with both
the data at rest and the data in motion.
Encryption at the customer’s end is one of the solutions to the security issues.
Distributed identity management has to be maintained using Lightweight Directory
Access Protocol (LDAP) to connect to identity systems.
1.3 Challenges of Cloud Adoption 23
Pervasive computing which enables the application to be accessed from any type
of device also introduce a host of issues such as authentication, authorization, and
provisioning. The failure of hypervisor to control the remote devices, mobile con-
nectivity disruptions due to signal failure, stickiness issue because of the frequent
application usage switches between PC and mobile devices are the challenges faced
due to mobile access.
Topology-agnostic identification of mobile device is essential to gain control and
monitor the mobile accesses of the cloud applications. 4G/LTE services with the
advantages such as plug and play features, high capacity data storage and low latency
will also provide a solution.
The providers are classified as Cloud Service Providers (CSP) providing IaaS, PaaS,
or SaaS services on contractual basis, Cloud Infrastructure Provider (CIP) provid-
ing infrastructure support to CSPs and Communication Service Providers provid-
ing transmission service to CSPs. Various challenges faced by them are regulatory
compliance, Service level agreement, Interoperability, performance monitoring and
reporting and environmental concerns.
i. Regulatory Compliance
Providers are expected to be compliant with PCI DSS, HIPAA, SAS 70, SSAE 16,
and other regulatory standards to provide a proof of security. This is a challenging
task due to the cross border, geographically distributed computing nature of Cloud
processes. The other challenge is the huge customer base spanning different industry
verticals having varied security requirement levels.
Some of the providers offer the compliance requirements at an increased cost.
The pricing of the product varies with the intensity of the compliance requirement.
24 1 Cloud Computing
The consumers adopt cloud with the main intention to pass over the IT concerns to
the 3rd party and to concentrate on core business operations and innovations. The
challenges from their perspective are availability, data ownership, organizational
barriers, scalability, data location, and migration.
i. Availability
This is one of the primary concerns for the consumers as any issue with this would
affect the business operations and may result in financial and customer reputation
losses. It is challenging to enjoy the availability claims by the provider due to the
internet based working of the cloud.
1.3 Challenges of Cloud Adoption 25
The providers take utmost care to make the services available as per their agree-
ment by using replication, mirroring, and live migrations. Critical business operations
that need to maintain continuity must opt for replication across the globe. Availability
is an important focus of the cloud performance and hence an integral part of all the
Service Level Agreements.
ii. Organizational Barriers
The complexity of the business is a very big challenge for cloud adoption. Organi-
zations that deal with sensitive data, highly critical time-based processing, complex
interdependency between working modules face a major challenge to migrate to
cloud. Organization’s non-willingness to mend its working to suit with the cloud
operations is a major challenge for cloud adoption.
Cloud Service Brokers (CSB) plays a major role is such situations to provide a
hybrid solution to maintain the organization working and also to leverage the cloud
benefits.
iii. Scalability
This is one of the primary benefits of cloud, which helps the startups to utilize the
ICT facilities depending on their business requirements and is also a great challenge
to monitor regularly. Auto-deployment option accommodates the user requirement
spikes with extra resources at an additional cost. Monitoring the spikes and de-
provisioning of the additional resources on spike period completion to reduce the
additional cost is a major challenge for the consumers.
IT personnel of the organization must be trained efficiently to handle the dashboard
and to constantly monitor the service provided.
The data location might keep changing due to the data center load balancing process
or due to data center failures. The consumer can change the provider either because
of the service termination of the provider or because of the service discontentment.
In any case data have to be migrated where data leaks is a big challenge.
Localization is one of the suggested solutions but this may pioneer issues such as
latency due to overload and increased cost due to under-utilization of resources. The
option of selecting the data center location can be provided to the customers so that
instead of localizing the data the customers can choose the desired locations where
the data can be shifted.
The working of Cloud Computing like remote access, virtualization, distributed com-
puting, geographically distributed data bases instill limitations in the design and usage
of cloud applications. Internet penetration intensity also imposes some limitation on
cloud usage as Internet is the base to deliver any type of cloud services. For example
the Internet penetration of India is 34.1% (www.internetsociety.org). This eventually
limits the cloud usage by Indian users. The other limitations are
i. Customization
Cloud applications are created based on the general requirement of the huge customer
base. Customer specific customizations are not possible and this forces the customers
to tolerate unwanted modules or to modify their working according to the application
requirements. This is one of the main barriers for SMBs to adopt cloud.
ii. Provider Dependency
Total control of the application lies with the provider. Updations are carried out at their
pace depending on the global requirements. Incompatible data formats maintained
by the providers may force the customer to stick with them. Any unplanned outages
will result in financial and customer loss as the business continuity is dependent on
the provider.
1.4 Limitations of Cloud Adoption 27
The complexity of the application also can limit the cloud usage. The applica-
tions with more module interactions that involve intensive data movements between
the modules are not suitable for cloud migration. 3D modeling applications when
migrated to cloud may experience slow I/O operations due to virtual device
drivers (Jamsa 2011). Applications that can be parallelized are more suitable for
cloud adoptions.
The ACID property based traditional databases do not support share-nothing archi-
tecture essential for scalability. Usage of RDBMS for the geographically distributed
cloud applications require complex distributed locking and commit mechanism. The
traditional RDBMS that has compromise on partition tolerance has to be replaced
with shared databases that preserve partition tolerance but compromises on either
consistency or availability.
Majority of cloud applications have data processing in peta byte scale and uses
distributed storage mechanism. Traditional RDBMS has to be replaced with No
SQL to keep pace with the volume of data growing beyond the capacity of the server,
variety of data gathered, and the velocity by which it is gathered. The categories of the
No SQL databases are Column-oriented databases (Hbase, Google’s Big Table and
Cassandra), Key-value store (Hadoop, Amazon’s Simple DB) and Document-based
store (Apache Couch DB, Mongo DB).
1.5 Summary
This chapter outlines the basic characteristics, deployment methods such as pub-
lic, private, community, hybrid, various service models such as IaaS, PaaS, SaaS.
The technical base for cloud adoption (i.e.) concepts of virtualization has also been
discussed. This would have given the readers a clear understanding of important
aspects with respect cloud computing. Cost is one of the main factors projected
as an advantage for cloud adoption. Understanding various costs heads included in
cloud adoption are detailed in this chapter. Chapter provides a good understanding
to the readers about the essentials to be monitored for cloud cost control. Business
benefits of cloud adoption, challenges, and limitations of cloud adoption have also
been highlighted. Careful selection of cloud service model and deployment model
is essential for leveraging cloud benefits. The metrics that are to be used for cloud
service selection and the model to be used to identify reliable cloud service provider
are detailed in the succeeding chapters.
28 1 Cloud Computing
References
Abbreviation
BC Business continuity
IA Information availability
MTTF Mean time to failure
MTTR Mean time to recovery
MTBF Mean time between failure
ERP Enterprise resource planning
SRE Software reliability engineering
Reliability is a tag that can be attached to any product or service delivery. Mere attach-
ment of this tag will exhibit the perceived characteristics such as trustworthiness and
consistent performance. This tag becomes more important for the cloud computing
environments, due to its strong dependence on internet for its service delivery. Cloud
adoption eliminates IT overhead, but it also brings in security, privacy, availability,
and reliability issues. Based on the survey by Juniper Research Agency, the number
of worldwide cloud service consumers is projected to be 3.6 billion in 2018. Cloud
computing market is flooded with numerous cloud service providers. It is a herculean
task for the consumers to choose a CSP to best suit their business needs. Possessing
reliability tag for services will help CSPs to outshine their competitors. This chapter
deals with the reliability aspect of cloud environments. Various reliability require-
ments with respect to business along with basic understanding of cloud reliability
concepts are detailed in this chapter.
2.1 Introduction
Reliability 1.0
Failure
Intensity Reliability
Failure
Intensity
Time (hr)
which when executed under certain conditions will result in failure. In other words,
the reason for failure is referred to as fault. Reliability values are always represented
as mean value. Various general reliability measuring techniques are (Aggarwal and
Singh 2007)
i. Rate of Occurrence of Failure
ii. Mean Time to Failure (MTTF)
iii. Mean Time to Repair (MTTR)
iv. Mean Time Between Failure (MTBF)
v. Probability of Failure on Demand
vi. Availability
The reliability quantities are measured with respect to unit of time. Probability
theory is included in the estimation of reliability due to the random nature of failure
occurrence. The value of these quantities cannot be predicted as the occurrence of
failures is not known with certainty. It differs with the usage pattern. It is represented
as cumulative number of failures by time t and failure intensity, which is measured
as number of failures per unit time t. Figure 2.1 represents reliability and failure
intensity graph with respect to time. It is evident from the graph that as the faults are
removed from the system after multiple tests and corrections, it gets stabilized. The
failure intensity will decrease and hence reliability of the system will increase.
32 2 Cloud Reliability
Mean time Between Failure (MTBF) is the term that is used to provide amount
of failure for a product with respect to time. It is one of the deciding factors as it
indicates the efficiency of the product. This factor is essential for the developers or
the manufacturer rather than for the consumers. These data are not readily available
for the consumers or the end users. This factor is given importance by the consumers
only on those products or services that are used for real time or critical operations
where failure leads to huge loss.
Mean Time to Repair abbreviated as MTTR refers to the time taken for repairing a
failed system or components. Repairing could be either replacing a failed component
or modifying the existing component to adapt with changes or to remove failures that
were raised due to faults. Taking long time to repair the product or software shoots
up operational cost. Organizations strive to reduce MTTR by having backup plans.
This factor is of concern for the consumers as they enquire about the turn-around
time for repairing a product.
Mean Time to Failure (MTTF) denotes the average time for which the device will
perform as per specification. The bigger the value the better is the product reliability.
It is similar to MTBF but the difference is that MTBF is used with products that can be
repaired and MTTF is used for the non-repairable products. MTTF data is collected
for a product by running many thousands of units. This metric is crucial for hardware
components and that too while they are used in mission critical applications.
Business Continuity (BC) is an enterprise wide process that encompasses all IT plan-
ning activities such as prepare, respond and recover from planned and unplanned
outages. Planning involves proactive measure such as analysis of business impact
and assessment of risk and reactive measures such as disaster recovery. Backup
and replication is used for proactive processes and recovery is used in the reactive
process. Information unavailability that results in business disruption could lead to
catastrophic effects depending on the criticality of the business. Information may
be inaccessible due to natural disaster, planned or unplanned outages. Planned out-
ages may occur due to hardware maintenance, new hardware installation, software
upgrades, or patches, backup operations, migration of the applications from testing to
the production environment, etc. Unplanned outages occur due to physical or virtual
device failures, database failures, unintentional or intentional human errors, etc.
34 2 Cloud Reliability
Analyze
Implement
Design &
Develop
ii. Analyze
The first process of this stage is to gather all information regarding business
processes, Infrastructure dependencies, data profiles, and frequency of using
business infrastructure. The business impact analysis in terms of revenue and
productivity loss due to service disruption is carried out. The critical business
processes are identified and its recovery priorities are assigned. Risk analysis is
performed for critical functions and its mitigation strategies are designed. The
available BC options are evaluated using cost–benefit analysis.
iii. Design and develop
Teams are defined for various activities like emergency response, infrastruc-
ture recovery, damage assessment, and application recovery with clearly defined
roles and responsibilities. Data protection strategies are designed and its required
infrastructure and recovery sites are developed. Contingency procedures, emer-
gency response procedures, recovery and restart procedures are developed.
iv. Implement
Risk mitigation procedures such as backup, replication, and resource manage-
ment are implemented. Identified recovery sites are prepared to be used during
disaster. Replication is implemented for every resource to avoid single point
failure.
v. Train, test, assess and maintain
The employees who are responsible for BC maintenance are trained in all the
proactive and reactive BC measures developed by the team. Vulnerability testing
must be done to the BC plans for performance evaluation and limitation iden-
tification. Periodic BC plans updations are to be done based on the technology
updation or business requirement modifications.
36 2 Cloud Reliability
The ability of the traditional or cloud based IT infrastructure to perform its func-
tionality as per the business expectation at the required time of operations is termed
as Information Availability (IA). Accessibility, reliability, and timeliness are the
attributes of IA. Accessibility refers to the access of information by the right person
and at right time, reliability refers to the consistency and correctness of the informa-
tion and timeliness refers to the time window during which the information will be
available (Wiley 2010).
Information unavailability which is also termed as downtime leads to loss of
productivity, loss of reputation, and loss of revenue. Reduced output per unit of labor,
capital and equipment constitutes loss of productivity. Direct loss, future revenue
loss, investment loss, compensatory payments, and billing loss are the various losses
included in loss of revenue. Loss of reputation is the confidence loss or creditability
loss with customers, suppliers, business partners, and bank (Somasundaram and
Shrivastava 2009).
The sum of all losses incurred due to the service disruption is calculated using the
metric, average cost of downtime per hour. It is used to measure the business impact
of downtime and also assist to identify the BC solution to be adopted. The formula
to calculate the average cost of downtime per hour is (Wiley 2010).
where
Avgdt is the “Average cost of downtime per hour”
Avgpl is the “Average productivity loss per hour”
Avgrl is the “Average revenue loss per hour”
The average productivity loss is calculated as
Total salary and financial benefits of all employees/week
Avgpl (2.2)
Average number of working hours per week
IA is calculated as the time period during which the system was functional to
perform its intended task. It is calculated in terms of system uptime and down time
or in terms of Mean Time Between Failure (MTBF) and Mean Time to Recovery
(MTTR).
System uptime
IA (2.4)
(System uptime + System down time)
2.2 Software Reliability Requirements in Business 37
Or
MTBF
IA (2.5)
(MTBF + MTTR)
Software is also a product that needs to be delivered with reliability tag attached.
Profitability of software is directly related to achieving the precise objective of reli-
ability. The software used in business is expected to adapt to the rapid changing
business needs at a fast pace. Faster delivery time includes a tag “greater agility”,
which refers to the rapid response of the product to the changes in user needs or
requirements (Musa 2004). Reliability of software gets influenced by either logical
or physical failure. The bug that caused physical failure is corrected and the system
is restored back to the state as it was before the appearance of bug. The bug that
caused logical failure is removed and the system is enhanced.
Reliability/availability, rapid delivery and low cost are the most important char-
acteristics of good software in the perspective of software users. Developing reliable
software depends upon the application of quality attributes at each phase of the
development cycle with main concentration on error prevention (Rosenberg et al.
1998). Various software quality attribute domain and its sub attributes are provided
in Table 2.2.
As discussed in the above sections, reliability is user-oriented and it deals with the
usage of the software rather than the design of the software. The evidence of reliability
is obtained after prolonged running of the software. It relates operational experience
with the influence of failures on that experience. Even though the reliability can be
38 2 Cloud Reliability
obtained after operating it for a period of time, the consumers need software with
some guaranteed reliability.
Well-developed reliability theories exists that can be applied directly for hard-
ware components. The failure of hardware components occur due to design error
failure, fabrication quality issues, momentary overload, aging, etc. The failure data
are collected during the development as well as operational phase to predict the
reliability of software. Major difference exists between the reliability measurements
and metrics of hardware and software. The hardware reliability cannot be directly
applied to software. Life of any hardware devices is classified into three phases such
as burn-in, useful and burn-out. During the burn-in phase the failures are more as
the product is in the nascent stage hence reliability is low. During useful phase the
failure is almost constant as the product would have stabilized after all corrections.
In the burn-out phase the product suffers from aging or wear-out issues and hence
the failure will be high. This is represented in Fig. 2.3 as the popular bath tub curve.
The same concept cannot be applied for software as there is not wear-out phase
in software. The software becomes obsolete. Failure rate is high during the testing
phase and the failure rate of software does not go down with age of the software.
Figure 2.4 depicts the software reliability in terms of failure rate with respect to time.
Software Reliability Engineering (SRE), a special field of engineering technique for
developing and maintaining software systems.
2.4 Reliability in Distributed Environments 39
traditional environments as they provide higher availability and high speed at low
cost. Easy resource sharing and data exchange might cause concurrency and secu-
rity issues. Various types of distributed systems are Distributed computing systems,
Distributed information systems and Distributed pervasive systems
Faults in any component of distributed system results in failure. The failures thus
encountered can lead to simple repairable errors or major system outage. Table 2.3
lists various failures of a distributed system.
Two general types of faults that occur in distributed systems are transient fault
and permanent faults. Table 2.4 lists the difference between these two types of faults.
Apart from the above-mentioned general type of faults, various types of faults
also occur in constituents of distributed systems such as components, processors and
network. These faults are discussed as follows:
i. Component faults: These are faults that occur due to the malfunctioning or
repair of components such as connectors, switches, or chips. These faults could
be transient, intermittent or permanent. Transient faults are those that occur once
and vanish with repetition of operations. Intermittent faults occur due to loose
connections and keep occurring sporadically until the part is problem is fixed.
Permanent faults results due to faulty or non-functional component. The system
will not function until the part is replaced.
ii. Processor faults: The main component of a distributed system responsible of
fast and efficient working is processor. Any faults in these processor functioning
leads to three types of failures such as fail-silent, Byzantine and slowdown. In
fail-silent failure the processor stops accepting input and giving output (i.e.)
it stops functioning completely. Byzantine failure does not stop the processor
working. The processor continues to work but it gives out wrong answers. In
2.4 Reliability in Distributed Environments 41
slowdown failure, the faulty processor will function slowly and will be labeled
as “Failed” by the system. These may return to normal and may issue orders
leading to problems within the distributed system.
iii. Network faults: Network is the backbone of distributed systems. Any fault in
network will lead to loss of communication. The failures that may arise are
one-way link and network partition. In one-way link failure message transfer
between two systems such as A and B will be in only one direction. For example,
Assume system A can send message to system B but not receive reply back from
it due to one-way failure. This will result in system A assuming that the other
system B has failed. Network partition failure occurs due to the fault in the
connection between two sections of systems. The two separated sections will
continue working among them. When the partition fault is fixed, consistency
error might occur if they had been working on the same resource independently
during the network partition failure.
Designing a fault tolerant system with reliability, availability and security is essen-
tial to leverage the benefits of distributed systems. To ensure reliable communication
between the processors, redundancy approach is incorporated in the design of a dis-
tributed system. Any one of the three types of redundancies such as information
redundancy, time redundancy or physical redundancy can be followed to ensure con-
tinuous system availability. Information redundancy is addition of extra bits to the
data to provide space for recovery from distorted bits. Time redundancy refers to
the repetition of the failed communication or the transaction. This is the solution for
transient and intermittent faults. Physical redundancy is the inclusion of new compo-
nent in the place of failed component. The physical redundancy can be implemented
as active replication, where each processor work will be replicated simultaneously.
The number of replications depends on the fault tolerance requirement of the system.
The other way of implementing physical redundancy is primary backup where along
with the primary server an unused backup server will be maintained. Any outage
in the primary server will initiate a switch for the backup server to be the primary
server.
Check pointing technique can also be used to maintain continuity of the system.
Process state, information of active registers and variables defines the state of a
system at a particular moment. All these information about the system are collected
and stored. These are called as checkpoints. The collection and storage process might
occur as either user triggered, coordinated by process communication or message-
based check pointing. When a system failure is encountered, the stored values are
used to restore the system back to the recently stored check point level. This does
have some loss of transaction details but eliminates the grueling process of repeating
the entire application from the beginning. The check pointing method is useful but
time consuming.
42 2 Cloud Reliability
The demand for cost-effective and flexible scaling of resources has paved the way
for adoption of cloud computing (Jhawar et al. 2013). Cloud computing industry is
expanding day by day and Forrester had predicted that the market will grow from
$146 billion in 2017 to $236 billion in 2020 (Bernheim 2018). It has also predicted
growth in industry specific services offered by diverse pool of cloud service providers.
The reliability models of traditional software cannot be directly applied to cloud
environment due to the technical shift from product oriented architecture to ser-
vice oriented architecture. With the development in cloud computing, reliability of
applications deployed on cloud attracts more attention of the cloud providers and
consumers. The layered structure of cloud applications and services increases the
complexity of its reliability process. Depending on the cloud services subscribed,
the CSPs and the cloud consumer share the responsibility of offering a reliable ser-
vice. Customer’s trust on the services provided by CSP is paramount, particularly
in case of SaaS service due to total dependency of business on the SaaS. Customers
expect the services to be available all the time due to the advancement in cloud
computing and online services (Microsoft 2014).
The main aim of applying the reliability concepts to cloud services is to
i. Maximize the service availability.
ii. Minimize the impact of service failure.
iii. Maximize the service performance and capacity.
iv. Enhance business continuity.
Reliability in terms of cloud environment is viewed to have failure tolerance,
which is quantifiable, along with some of the qualitative features like adherence to
the compliance standards, swift adaptability to the changing business needs, imple-
mentation of open standards, easy data migration policy and exit process etc. Various
types of failures like request timeout failure, resource missing failure, overflow fail-
ure, network failure, database failure, software and hardware failures are interleaved
in cloud computing environment (Dai et al. 2009).
The cloud customers and cloud providers share the responsibility for ensuring
a reliable service or application when they enter into a contract agreement (SLA),
either to utilize or to provide the cloud services. Depending on the cloud offering the
intensity of responsibility varies for both of them. If it is an IaaS offering, then the
customer is completely responsible for building a reliable software solution and the
provide is responsible for providing reliable infrastructure such as storage, compute
core or network. If it is a PaaS offering, then the provider is responsible for providing
reliable infrastructure and OS and the customer is responsible for designing and
installation of reliable software solution. If it is SaaS offering, then the provider is
completely responsible for delivering a reliable software service at all the times of
need and the customer has little or nothing to do for reliable SaaS (Microsoft 2014).
2.5 Defining Cloud Reliability 43
There are a number of models proposed by different researchers in the area of cloud
computing environments. The areas of research are interleaved failures in cloud
models, scheduling reliability, quality of cloud services, homomorphic encryption
methods, multi state system based reliability assessment.
A cloud service reliability model based on Graph theory, Markov model and queue
theory has been proposed on the basis that the failures in cloud computing models are
interleaved by Dai et al. (2009). The parameters that are considered for this model are
processing speed, amount of data transfer, bandwidth and failure rates. Graph theory
and Bayesian approaches are integrated to develop an algorithm for evaluation.
Banerjee et al. (2011) have designed a practical approach to assess the reliability of
the cloud computing suite using the log file details. Traditional reliability of the web
servers are used as a base to provide availability and reliability of SaaS applications.
The data are extracted from the log file using log filtering method based on transaction
categorization and workload characteristics based on session and request counts.
The transactions of the registered users are taken into consideration due to its direct
business impact. Suggestions have been done to include the findings of the log based
reliability techniques and measures as a component in the SLA.
Malik et al. (2012) have proposed the reliability assessment, fault tolerance, and
reliability based scheduling model for PaaS and IaaS. Different reliability assessment
algorithms for general, hard real time and soft real time applications are presented.
The proposed model has many modules, Out of them; Fault monitor module is used
to identify the faults during execution, Time checker module is used to identify the
real time processes and the core module Reliability Assessor to assess the reliability
of each compute instance. The algorithm proposed for general applications are more
focused towards failure and more adaptive.
Dastjerdi and Buyya (2012) have proposed automation of the negotiation process
between the cloud service requester and the provider for discovery of services, scal-
ing, and monitoring. Reliability assessment of the cloud provider is also proposed.
The objective of the automated negotiation process is to minimize the cost and max-
imize availability for the requester and maximize cost and minimize availability for
the providers. The challenges addressed are the tracking of reliability offers given
by the provider and balancing resource utilization. The research findings conclude
that the simultaneous negotiations with multiple requesters will improve profits for
the providers.
Quality of Reliability of cloud services is proposed by Wu et al. (2012). A layered
composable system accounting architecture is proposed rather than analyzing from
consumer end or provider end. S5 system accounting framework consisting of Service
existence, Service capability, Service availability, Service usability and Service self-
healing are identified as levels of QoR for cloud services. The primary aim of this
research is to analyze past events, update the occurrence probability and to make
predictions of failure.
44 2 Cloud Reliability
Resource sharing, distributed, and multi-tenancy nature and virtualization are the
main reason for the increased risks and vulnerabilities of cloud computing Ahamed
et al. (2013). Public key infrastructure that includes confidence, authentication and
privacy is identified as the base for providing essential security services that will even-
tually build trust and confidence between the provider and consumer. The challenges
and vulnerabilities of the cloud environments are discussed. Traditional encryption
is suggested as a solution to handle some of the challenges to an extent. Data-centric
and homomorphic encryption methods are suggested as the suitable solutions for the
cloud environment challenges.
Hendricks et al. (2013) have designed “CloudHealth”, a global system to be
accessed by the entire country for providing reliable cloud platform for healthcare
community in the USA. The attributes such as high availability, global access, secure
and compliant are mentioned as the prime attributes for a reliable SaaS health product.
OpenNebula is used as the default monitoring system for host and VM monitoring and
for balanced resource allocations and the add-on monitoring is done using Zenoss, a
Linus-based monitoring system. Tests on the add-on monitoring system were done
by creating a network failure on a VM, kernel panic in a VM and using simulated
failure of the VM’s host machine which were immediately identified and notified to
the administrators.
Wu et al. (2013) have modeled reliability and performance of cloud service com-
position based on multiple state system theory which is suitable for those systems
that are capable of accomplishing the task with partial performance or degraded per-
formance. Traditional reliability models are found unfit for cloud services because
of the assumption of component execution independence. The reliability of cloud
application is mentioned as the success probability of performance rate matching
the user requirement. A fast optimization algorithm based on UGF and GA working
with little time consumption has been presented in the paper which will eliminate
the risk of state space explosion.
A model to measure the software quality, QoS and security of SaaS applications
is proposed by Pang and Li (2013). The proposed model includes separate perspec-
tive for customer and platform provider. An evaluation model has been proposed
based on this which will evaluate and categorize the level of SaaS product as basic or
standard or optimized or integrated. The security metrics included in the model are
customer security, data security, network security, application security and manage-
ment security. Quality of experience, quality of platform and quality of application
are the metrics that are considered for QoS. The characteristics of quality in use
model and product quality model of ISO/IEC 25010:2011 is utilized for software
quality metrics. The metric to be met of four levels of SaaS are also listed.
Anjali et al. (2013) has identified the undetermined latency and no or less control
of the computing nodes are the reasons for the failure in cloud computing and a fault
tolerant model has been devised for the same. The model evaluates the reliability of
the node and decides on inclusion or exclusion of the node. An Acceptor module is
provided for each VM which tests them and identifies its efficiency. If the results of
the tests are produced before the specified time it is then sent to the timer module.
Reliability Assessor checks the reliability of each VM after every computing cycle.
2.5 Defining Cloud Reliability 45
The initial reliability is assumed as 100%. Maximum and minimum reliability limits
are predefined and the VM with reliability less than minimum is removed. The
decision maker node accepts the output list of reliable nodes from the reliability
assessor module and selects the node with high reliability.
In IaaS service model, the base for any application execution such as server, storage
and network is provided by CSP. Organizations opt for IaaS to have scalability
2.5 Defining Cloud Reliability 47
OS OS OS OS
where resources can be provisioned and de-provisioned at the time of need. High
availability, security, load balancing, storage options, location of data center, ability
to scale resources and faster data access are essential attributes of any IaaS services.
PaaS service model are preferred by developers as they need not worry about instal-
lation and maintenance of servers, patches, authentication and upgrades. PaaS pro-
vides workflow and design tools, rich APIs to help in faster and easier application
development. Hence, companies concentrate on enhancing user experience. Dynamic
provisioning, manageability, performance, fault tolerance, accessibility and monitor-
ing are the qualities that need to be taken care for maintaining reliability of PaaS
environment.
The reliability factors are considered from the requirement gathering phase till
the delivery of software as a service. This type of service delivery requires more
48 2 Cloud Reliability
2.6 Summary
In this chapter, we have discussed various terms and definitions pertaining to relia-
bility. Reliability is a tag that is attached to enhance the trustworthiness of a product
or services. Cloud computing environments are not an exception to this. Even though
cloud industry is expanding rapidly at a faster pace, consumers are still having inhi-
bitions to cloud adoption due to its dependency on Internet for working, remote data
storage, loss of control on application and data. Hence it is imperative for all cloud
service providers and cloud application developers to adopt all quality measures
to provide efficient and reliable cloud services. The following chapters deal with
outlining reliability factors and quantification methods for IaaS, PaaS, and SaaS.
References
Aggarwal, K. K., & Singh, Y. (2007). Software engineering (3rd ed). New Age International Pub-
lisher.
Ahamed, F., Shahrestani, S., & Ginige, A. (2013). Cloud computing: security and reliability issues.
Communications of the IBIMA, 2013, 1.
Anjali, D. M., Sambare, A. S., & Zade, S. D. (2013). Fault tolerance model for reliable cloud comput-
ing. International Journal on Recent and Innovation Trends in Computing and Communication,
1(7), 600–603.
Banerjee, P., Friedrich, R., Bash, C., Goldsack, P., Huberman, B., Manley, J., et al. (2011). Everything
as a service: Powering the new information economy. IEEE Computer, Magazine, 3, 36–43.
Bernheim, L. (2018). IaaS vs. PaaS vs. SaaS cloud models (differences & examples). Retrieved July,
2018 from https://www.hostingadvice.com/how-to/iaas-vs-paas-vs-saas/.
Briand L. (2010). Introduction to software reliability estimation. Simula research laboratory
material. Retrieved May, 5, 2015 from www.uio.no/studier/emner/matnat/ifi/INF4290/v10/
undervisningsmateriale/INF4290-SRE.pdf.
Dai, Y. S., Yang, B., Dongarra, J. & Zhang, G. (2009). Cloud service reliability: Modeling and
analysis. In 15th IEEE Pacific rim international symposium on dependable computing (pp. 1–17).
Dastjerdi, A. V., & Buyya, R. (2012). An autonomous reliability-aware negotiation strategy for
cloud computing environments. In 12th IEEE/ACM international symposium on cluster, cloud
and Grid computing (pp. 284–291).
Hendricks, E., Schooley, B., & Gao, C. (2013). Cloud health: developing a reliable cloud platform
for healthcare applications. In Conference proceedings of 3rd IEEE international workshop on
consumer e-health platforms, services and applications (pp. 887–890).
References 49
Jhawar, R., Piuri, V., & Santambrogio, M. (2013). Fault tolerance management in cloud computing:
a system-level perspective. IEEE Systems Journal 7(2), 288–297
Lyu, M. R. (2007). Software reliability engineering: A roadmap. In 2007 Future of Software Engi-
neering (pp. 153–170). IEEE Computer Society.
Malik, S., Huet, F. and Caromel, D. (2012). Reliability Aware Scheduling in Cloud Computing.
7th IEEE International Conference for Internet Technology and Secured Transactions (ICITST
2012), 194–200.
Microsoft Corporation White paper. (2014). An introduction to designing reliable cloud ser-
vices. Retrieved on September 10, 2014 from http://download.microsoft.com/download/…/An-
introduction-to-designing-reliable-cloud-services-January-2014.pdf.
Musa, J. D. (2004). Software reliability engineering (2nd ed, pp. 2–3.). Tata McGraw-hill Edition.
Musa, J. D., Iannino, A., & Okumoto, K. (1990). Software reliability. Advances in computers, 30,
4–6.
Pang, X. W., & Li, D. (2013). Quality model for evaluating SaaS software. In Proceedings of 4th
IEEE international conference on emerging intelligent data and web technologies (pp. 83–87).
Rosenberg, L., Hammer, T., & Shaw, J. (1998). Software metrics and reliability. In 9th international
symposium on software reliability engineering.
Somasundaram, G., & Shrivastava, A. (2009). Information storage management. EMC education
services. Retrieved August 10, 2014 from www.mikeownage.com/mike/ebooks/Information%
20Storage%20and%20Management.
Wiley, J. (2010). Information storage and management: storing, managing, and protecting digital
information. USA: Wiley Publishing.
Wu, Z., Chu, N., & Su, P. (2012). Improving cloud service reliability—A system accounting
approach. In 9th IEEE international conference on services computing (SCC) (pp. 90–97).
Wu, Z., Xiong, N., Huang, Y., Gu, Q., Hu, C., Wu, Z., Hang, B. (2013). A fast optimization method
for reliability and performance of cloud services composition application. Journal of applied
mathematics (407267). Retrieved April, 2014 from http://dx.doi.org/10.1155/2013/407267.
Chapter 3
Reliability Metrics
Abbreviation
3.1 Introduction
Reliability refers to the performance of the system as per the specification. Numerous
components are involved in the performance of a system. Efficient working of these
components as per expectation increases the overall efficiency of the system. It makes
the system worthy enough to be trusted which in other words can be termed as it
makes the system more reliable. The operations of these components thus become
metric of reliability.
Some metrics are easy to calculate and thus are called quantitative metrics. Some
metrics will have qualitative values like satisfactory level, adherence to compliance or
not, security level expected with values like high, intermediate, and normal, etc. Both
quantitative and qualitative metrics must be included in the final reliability evaluation
to have holistic performance of the system. Hence it is important to device measures to
quantify qualitative metrics. Role of metrics is essential to support informed decision
making and can also be used for
i. Selecting the suitable cloud services
ii. Defining and enforcing service level agreements
iii. Monitoring the services rendered
iv. Accounting and auditing of the measured services
This chapter starts with reliability aspects of Service-Oriented Architecture (SOA)
and virtualized environments. This is because these two are the backbone of cloud
applications and services delivery.
SOA and cloud computing complements each other to achieve efficiency in ser-
vice delivery. The working of cloud, SOA and its overlapping area between them is
given in Fig. 3.1 (Raines 2009). Both of them share the basic concept of service orien-
tation. In SOA, business functions are implemented as discoverable services and are
published in the services directory. Users willing to implement the functionality have
to request for the service and use them with the help of suitable standardized message
passing facility. On the other hand, cloud computing provides all IT requirements
as commodities that can be provisioned at the time of need from the cloud service
providers. Both SOA and cloud relies on the network for execution of services. Cloud
has broader coverage as it includes everything related to IT implementation whereas
SOA is restricted only to software implementation concepts.
Virtualization is the basic technology that powers cloud computing paradigm. It
is software that handles hardware manipulation efficiently, while cloud computing
offers services which are results of this manipulation. Virtualization helps to separates
compute environment from the physical infrastructure which allows simultaneous
execution of multiple operating systems and applications. A workstation having
Windows operating system installed can easily switch to perform task based on Mac
without switching off the system. Virtualization has helped organizations to reduce
IT costs and increase utilization, flexibility and efficiency of hardware.
Cloud computing had shifted the utilization of compute resources from asset based
resources to virtual resources which are service based. Dependency of cloud imple-
mentations on SOA and virtualization makes it necessary to include discussion on
3.1 Introduction 53
these topics before deciding on the metrics of reliability. This chapter includes two
separate sections that discuss about SOA and virtualization along with its reliability
requirement concepts. Apart from this there are various standard organizations that
work for setting the quality and performance standards for working of cloud com-
puting. Organizations such as ISO, NIST, CSMIC, ISACA, CSA, etc., work for the
betterment of cloud service development, deployment and delivery. These organiza-
tions had listed out various quality attribute that keeps updating depending on the
technological changes. These quality features of ISO 9126, NIST specifications on
cloud service delivery and Service Measurement Index (SMI) designed by CSMIC
are discussed in detail. The chapter concludes with the categorization of the reliabil-
ity metric along with its quantification mechanism. The metrics are classified based
on the expectation, usage pattern and standard specification. Depending on the nature
of the value stored in the metrics the quantification method varies.
Software systems built for business operations need to be updated to keep pace with
the ever changing global scope of business. Software architecture is chosen in a way
to provide flexibility in system modification without affecting the current working
and maintaining functional and non-functional quality attributes. This is essential for
the success of these software systems. Use of Service-Oriented Architecture (SOA)
helps to achieve flexibility and dynamism in software implementations. It provides
adaptive and dynamic solutions for building distributed systems. SOA is defined in
many ways by various companies. Some of them are
54 3 Reliability Metrics
Services
Registry
Invoke
Service Service
Providers Consumers
Service
Request / Response
“SOA is an application framework that takes business operations and breaks them
into individual business functions and processes called services. SOA lets you build,
deploy and integrate these services, independent of applications and the computing
platform on which they run”—IBM Corporation.
“SOA is a set of components which can be invoked, and whose interface descriptions
can be published”—Worldwide Web Consortium.
“SOA is an approach to organize information technology in which data, logic and
infrastructure resources are accessed by routing messages between network inter-
faces”—Microsoft.
Service in SOA refers to complete implementation of a well-defined module of
business functionality. These services are expected to have published interface that
are easily discoverable. Figure 3.2 represents overall view of SOA.
Well-established services are further be used as blocks to build new business
applications. The design principles of services are (Erl 2005; McGovern et al. 2003)
i. Services are self-contained and reusable.
ii. It logically represents a business activity with a specified outcome.
iii. It is a black box for users (i.e.) abstracts underlying logic.
iv. Services are loosely coupled.
v. Services are location transparent and have network-addressable interface.
vi. It may also consist of other services also.
Large systems are built using loose coupling of autonomous services which have
the potential to bind dynamically and discover each other through standard protocols.
3.2 Reliability of Service-Oriented Architecture 55
SOA
Applica-
Service Service
tion Front Service
Repository Bus
End
Implemen-
Contract Interface
tation
Business
Data
Logic
This also includes easy integration of existing systems and rapid inclusion of new
requirements (Arikan 2012).
Six core values of SOA are (innovativearchitects.com)
i. Business values are treated more than the technical strategy.
ii. Intrinsic interoperability is preferred over custom integration.
iii. Shared services are important over specific purpose implementation.
iv. Strategic goals are more preferred than project-specific benefits.
v. Flexibility has an edge over optimization.
vi. Evolutionary refinement is expected than initial perfection.
This style of architecture has reuse of services at macrolevel which help busi-
nesses to adapt quickly to the changing market condition in cost-effective way. This
helps organizations to view problems in a holistic manner. In practicality, a mass
of developers will be coding business operations in the language of their choice
but complying with the standard with respect to usage interface, data and message
communications. Figure 3.3 represents elements of SOA (Krafzig et al. 2005).
Various quality metrics that needs to be considered for ensuring effective and
efficient SOA implementations are
i. Interoperability
Distributed systems are designed and developed using various platforms and lan-
guages. They are also used across various devices like handheld portable devices to
mainframes. In early days of distributed systems introduction, there was no standard
56 3 Reliability Metrics
very difficult due to the presence of loosely coupled components. Two-phase commit
can be used which uses compatible transaction agents in the end points for interaction
using standard formats.
iv. Scalability and Extensibility
Scalability refers to the ability of the SOA functions to change in size or volume
based on the user needs but without any degradation in the existing performance.
The options for solving capacity issues are horizontal scalability and vertical scal-
ability. Horizontal scalability is the distribution of the extra load of work across
computers. This might involve addition of extra tier of systems. Vertical scalability
is the upgradation to more powerful hardware. Effective scaling increases the trust
on the services.
Extensibility refers to the modification to the services capability without any affect
to the existing parts of the services. This is an essential feature of SOA as this will
enable software to adapt to the ever changing business needs. Loose coupling of
the components enable SOA to perform the required changes without affecting other
services. The restriction in the message interface makes it easy to read and understand
but reduces extensibility. Tradeoff between interface message and extensibility is
required in SOA.
v. Auditability
This is a quality factor which represents the ability of the services to comply with
the regulatory compliance. Flexibility offered in the SOA design complicates the
auditing process. End-to-end audit involving logging and reporting of distributed
service requests are essential. This can be achieved by incorporating business-level
metadata with each SOA message header such that it can be captured by the audit
logs for future tracing. This implementation requires different service providers to
follow messaging standards.
Reliability of SOA is based on the software architecture used for building ser-
vices as the main focus is on components and data flow between them. Quality
attributes mentioned above have to be followed to meet the SLA requirements.
State-based, additive and path-based model are the three architecture based relia-
bility models (Goýeva-Popstojanova et al. 2001). State-based reliability model uses
control flow graph of the software architecture to estimate reliability. All possible
execution paths are computed and the final reliability is evaluated for each path. Reli-
ability of each component is evaluated and the total system reliability is computed
as non-homogenous Poisson Process in additive model. Normal software reliability
model cannot be directly used in SOA. This is because in SOA single software is
built using interacting groups of autonomous services built by various geographi-
cally distributed stakeholders. These services collection might have varying levels
of reliability assurance. The reliability of SOA must be evaluated for the basic com-
ponents such as basic service working, data flow, service composition and complete
work flow. As the service publication and discovery is done at run time, reliability
model designed for SOA must react to runtime to evaluate the dynamic changes of
the system.
58 3 Reliability Metrics
Virtualization is an old technique that had existed since 1960s but became popular
with the advent of cloud computing. It is the creation of virtual (not actual) existence
of server, desktop, storage, operating system, or network resources. It is the most
essential platform that helps IT infrastructure of the organization to meet the dynamic
business requirements. Implementation of virtualization assists IT organization to
achieve highest level of application performance efficiency in cost effective manner.
Organizations using vSphere with Operations Management of vmware have 30%
increase in hardware savings, 34% increase in hardware capacity utilization and
36% increase in consolidation ratios (Vmware 2015).
Virtualization in cloud computing terms assists to run multiple operating systems
and applications on the same server. It helps in creating the required level of cus-
tomization, isolation, security, and manageability which are the basics for delivery
of IT services on demand (Buyya et al. 2013). Adoption of virtualization will also
increase resource utilization and this in turn helps to reduce cost. Virtual machines
are created on the existing operating system and hardware. This provides an envi-
ronment of logical separation from the underlying hardware. Figure 3.4 explains the
concept of virtualization. Various types of virtualization are
i. Hardware virtualization
The software used for the implementation of virtualization is called Virtual Machine
Manager (vmm) or hypervisor. In hardware virtualization the hypervisor will be
installed directly on the hardware. It monitors and controls the working of mem-
ory, processor, and other hardware resources. It helps in the consolidation of various
hardware segments or servers. Once the hardware systems are virtualized different
3.3 Reliability of Virtualized Environments 59
Virtualized
Hardware
Virtualization
Hypervisor Layer
Physical
Hardware
operating systems can be installed on the same machine with different applications
running on them. The advantage of this type of virtualization is increased processing
power due to maximized hardware utilization. The sub types of hardware virtualiza-
tion are full virtualization, emulation virtualization, and paravirtualization.
ii. Storage virtualization
The process of grouping different physical storage into a single storage block with
the help of networks is called as storage virtualization. It provides the benefit of
using a single large contiguous memory without the actual presence of the same.
This is type of virtualization is used mostly during backup and recovery processes
where huge storage space is required. Advantages of this type of virtualization are
homogenization of storage across devices of different capacity and speed, reduced
downtime, enhanced load balancing and increased reliability. Block virtualization
and file type virtualization are the sub-types of storage virtualization.
iii. Software virtualization
Installation of virtual machine software or virtual machine manager on the host
operating system instead of installing on the machine directly is called as operating
system virtualization. This is used in situations where testing of applications needs
to be done on different operating systems platforms. It creates a full computer system
and allows the guest operating system to run it. For example a user can run Android
operating system on the machine installed with native OS as Windows OS. Appli-
cation virtualization, Operating system virtualization and Service virtualization are
the three flavors of software virtualization.
iv. Desktop virtualization
This is used as a common feature in almost all organizations. Desktop activities
of the end users are stored in remote servers and the users can access the desktop
60 3 Reliability Metrics
from any location using any device. This enables employees to work conveniently
at the comfort of their homes. The risk of data theft is minimized as the data transfer
happens over secured protocols.
Hardware and software virtualizations are preferred the most with respect to cloud
computing. Hardware virtualization is an enabling factor in Infrastructure as a Ser-
vice (IaaS) and Platform as a Service (PaaS) is leveraged by software virtualization
(Buyya et al. 2013). Managed isolation and execution are the two main reasons for
virtualization inclusion. These two characteristics help in building controllable and
secure computing environments. Portability is another advantage which helps in easy
transfer of computing environments from one machine to another. This also helps to
reduce migration costs. The motivating factors that urge to include virtualization are
confidentiality, availability and integrity. There are achieved through properties like
isolation, duplication and monitor.
Various quality attributes that needs to be maintained for assured reliability of
virtualized environment are (Pearce et al. 2013)
i. Improved confidentiality
The efficiency of this quality attribute is achieved through effective isolation tech-
niques. Placing OS inside virtual machines will help to achieve highest level of
isolation. This will not only isolate software and the hardware of the same machine
but also isolates various guest operating system and the hardware (Ormandy 2007).
Proactive measures of intrusion and malware analysis processes are also simplified
as they will be executed in virtualized environment and analyzed. This eliminates
the full system setup for sample analysis. A fully virtualized system is set to offer an
isolated environment but it has to be logically identical to physical environment.
ii. Duplication
Ability to capture all activities and restoring back at the time need is an important
quality feature of virtual machines. Rapid capturing of the working state of the guest
OS to a file is essential. The captured state is called as snapshot. The states of memory,
hard disk, and other attached devices are captured in snapshot. The snapshots are
captured periodically while running and also during any system outage. It is easy to
restore previously captured snapshots. Sometimes snapshots are also restored while
VMs are running with little degradation. The ability of VMs to store and restore
states also provides hardware abstraction. This in turn improves the availability of
virtual machines. Load balancing can be performed in VMs and is also referred a
live migration.
iii. Monitor
The virtual machine manager has full control on the working of VMs. This also
provides assurance that none of the VM activities will go unobserved. Full low-
level visibility of the operations and the ability to intervene in the operations of the
guest OS helps VM to capture, analyze and restore operations with ease (Pearce
et al. 2013). The low-level visibility aspect of VM operations also referred to as
3.3 Reliability of Virtualized Environments 61
Standards organizations such as ISO/IEC, NIST, CSMIC, etc., have laid out various
recommendations for providing cloud services. These are to be followed meticu-
lously by the providers to claim that reliable services are being offered to customers.
Compliance to these standards will also increase the trust factor and hence customer
base will increase. Some of the standards that are to be followed with respect to cloud
services are discussed below.
ii. Reliability
After delivery of the software as per specification, reliability refers to the capability
of the software to maintain its working under-defined condition for the mentioned
period of time. This characteristic is used to measure the resiliency of the system.
Fault tolerance measures should be in place to excel in resiliency.
iii. Usability
This feature refers to the ease of use of the system. It is connected with system
functionality. This also includes learnability (i.e.) the ability of the end user to learn
the system usage with ease. The overall user experience is judged in this and it is
collected as a Boolean value. Either the feature is present or not present.
iv. Efficiency
This feature refers to the use of system resources while providing the required func-
tionality. This deals with the processor usage, amount of storage utilized, network
usage without congestion issue, efficient response time, etc. This feature has got
close link with usability characteristics. The higher the efficiency the higher will be
the usability.
v. Maintainability
This includes the ability of the system to fix the issues that might occur in the
software usage. This characteristic is measured in terms of the ability to identify the
faults and fixing it. This is also referred to as supportability. The effectiveness of this
characteristic depends on the code readability and modularity used in the software
design.
vi. Portability
The characteristic refers to the ability of the software to adopt with the ever changing
implementation environment and business requirements. Including modularity by
implementing object oriented design principles, separation logical design from the
physical implementation will help to achieve adaptability without any affect to the
existing system working.
Table 3.1 represents complete characteristics and sub-characteristics of ISO-9126-
1 Quality Model (www.sqa.net/iso9126.html).
Understanding these quality attributes and incorporating them in the software
design will enhance the quality of the software and its delivery. This will in turn
enhance the overall trust on the software. Maintaining all quality attributes at its high
efficient level is a challenging task. For example, if code is highly modularized then it
is easy to maintain and also will have high adaptability. But this will have degradation
in resource usage such as CPU usage. Hence, tradeoff needs to be applied wherever
required depending on the business requirements of the customers. These quality
attributes can also be considered as the metrics for the evaluation of software.
3.4 Recommendations for Reliable Services 63
3.4.2 NIST
Measurement
Results
Property
Observa-
tion
Knowledge
Metrics
For example availability is mentioned in terms of percentage like 99.9, 99.99%, etc.
These are converted into amount of downtime and can be checked with the actual
downtime that was encountered.
Metrics used in cloud computing service provisioning can be categorized as ser-
vice selection, service agreement, and service verification.
i. Metrics for service selection
Service selection metrics are used for identifying and finalizing the cloud offering
that is best suitable for business requirement. Independent auditing or monitoring
agencies can be used to produce metrics values such as scalability, performance,
responsiveness, availability, etc. These values can be used by the customers to assess
the readiness and the service quality of the cloud provider. Some of the metrics
like security, accessibility, customer support, and adaptability of the system can be
determined from the customers who are currently using the cloud product or services.
ii. Metrics for service agreement
Service Agreement (SA) is the one that binds the customer and provider in a contract.
It is a combination of Service Level Agreement (SLA) and Service Level Objective
(SLO). It sets the boundaries and allowed margin of error to be followed by providers.
It includes terms definition, service description, and roles and responsibilities of both
provider and customer. Details of measuring cloud services like performance level
and the metric used for monitoring and balancing is also included in SLA.
iii. Metrics for service measurement
This is designed with the aim to measure the assurance of meeting service level
objectives. In case of failure to meet the guaranteed service levels, pre-determined
remedies has to be initiated. Metrics like notification of failure, availability, updation
frequency, etc., had to be check. These details are gathered from the dash board of
the product or services or accepted as feedback from the existing users.
Other aspects of cloud usage can also be measured using metrics for auditing,
accounting, and security. Accounting is linked with the amount of usage of ser-
vice. Auditing and security is related with assessing of compliance for certification
requirement related to the customer segment. Figure 3.6 represents various property
or functions that are required for management services offered. These are broadly
classified as business support, portability and provisioning (Liu et al. 2011).
3.4.3 CSMIC
categories which are further divided into sub categories. First level quality attributes
are (CSMIC 2014)
i. Accountability
Evaluation of this attribute will help customers to decide about the trust factor on the
provider. The attributes are related to the organization of the cloud service provider.
Various sub attributes of accountability is given in Table 3.2.
3.4 Recommendations for Reliable Services 67
ii. Agility
This attribute will help customers to identify the ability of the provider to meet the
changing business demands. The disruption due to product or service changes is
expected to be minimal. Table 3.3 lists the sub attributes of agility.
iii. Assurance
This attribute will provide the likelihood of the provider to meet the assured service
levels. Sub attributes of assurance are listed in Table 3.4.
iv. Financial
Evaluation of this attribute will help customers to prepare cost benefit analysis for
cloud product or service adoption. It will provide an idea about the cost involved and
billing process. The sub attributes are billing process and cost. Billing process will
provide details about the period of interval at which the bill will be generated. Cost
will provide details about the transition cost, recurring cost, service usage cost, and
termination cost.
68 3 Reliability Metrics
v. Performance
Evaluation of this attribute will prove the efficiency of the cloud product or services.
Sub attributes of performance are listed in Table 3.5.
vi. Security and privacy
This attribute will help customers to have a check on the safety and privacy measures
followed by the provider. It also indicates the level of control maintained by the
provider for service access, service data. and physical security. Table 3.6 lists various
sub-attributes of security and privacy attribute.
vii. Usability
This attribute is used to evaluate the ease with which the cloud products can be
installed and used. Sub-attributes of usability are listed in Table 3.7.
All the above-mentioned quality attributes of ISO 9126, NIST and CSMIC are
updated periodically based on the changing technology and growing business adop-
tion of cloud. These quality attributes have been taken into consideration and the
reliability metrics are designed. This is done with the basic idea that entire reliabil-
ity of the product can be enhanced if the intermediate operations are maintained of
high quality. The reliability metrics of three cloud services, IaaS, PaaS and SaaS are
explained in detail in the next chapter.
3.5 Categories of Cloud Reliability Metrics 69
Prospective
Customer
Requirement
based Goodness of fit
(Chi-square test)
Existing
Relaibility Customer Feed
Factors back based
Dichotomous values
(cummulative
binomial dist.
Standards based
The metrics used for cloud reliability calculations are the quality attributes of the
products. These metrics are defined in detail in the next chapter. Some of them are
quantitative while few others are qualitative. The qualitative metrics are also gathered
through questionnaire mechanism and are quantified. Metric evaluation of both the
types is based on various calculations methods. The three major classifications used
in this book are
i. Expectation-based
ii. Usage-based
iii. Standards-based.
Figure 3.7 illustrates the categorization of reliability metrics used in the following
chapters. These are also named as Type I, Type II, and Type III metrics.
70 3 Reliability Metrics
These are also referred to as Type I metrics. These metrics are computed based on the
input from prospective customers. These are the customers who are willing to adopt
cloud services for business operations. Input for Type I metrics has to be provided by
the end user of the organization. End users providing input to Type I metrics should
possess the following:
1. Clear understanding of the business operations
2. Knowledge about cloud product and services
3. List of modules that need to be moved to cloud platform
4. Amount of data that need to be migrated
5. Details of on-premise modules that has to interoperate with cloud products
6. Security and compliance requirements
7. Risk mitigation measures expected from cloud services
8. List of data center location choices (if any)
Prospective customers will approach the proposed reliability evaluation model
with a collection of shortlisted cloud products or services. The model will assist
them to choose the cloud product that suits their business needs with the help of
above inputs provided by them.
The customer requirements based on the above-listed points are accepted using a
questionnaire.1 The business requirements of the prospective customer are checked
against actual offerings of the SaaS product. The formula to be used for this checking
is
Number of features offered by the product
(3.1)
Total number of features required by the customer
Based on this calculation, the success probability of the product is measured. The
inclusion of this type of factor enhances the customer orientation of the proposed
model.
Example 3.1
If a customer expects 10 features to be present on a product or services and out of
it 8 is being offered by the provider, then the probability value of 8/10 0.8 is the
reliability value.
This metrics is also referred to as Type II metrics. These metrics provide an insight
into the reliability of the service provided by checking the conformance of the services
assured. The rendered services are compared with that of the guaranteed and based on
this comparison the reliability values are computed. The product usage experiences
of the existing customer are gathered using questionnaire. This will be the actual
working details of the cloud products. The assured working details are gathered from
the product catalog. These are compared using on the following ways to calculate
probability of the cloud products or services. Based on the nature of value gathered,
either
Chi-square test or binomial distribution method or simple division method is used
to identify the reliability of the metric.
i. Chi-square test
Chi-square test is used with two types of data. One is used with quantitative values
to check goodness of fit to determine whether the sample data matches with the
population. The other test is used with categorical values to test for independence.
In this type of test two variables are compared for relation using contingency table.
In both the tests, higher chi-square value indicates that sample does not match with
population or there is no relation. Lower chi-square value indicates that sample
matches population or there is relation between the values.
In this book, chi-square test method is used to when the assured value for rendering
services are numeric. Example assured updation frequency, availability, mirroring
latency, backup frequency, etc. The expected values for these metrics are retrieved
from SLA and the observed values of these are accepted from the existing customers.
Computations between observed and expected values are performed and goodness
of fit of the observed value with that of the assured value is used to compute the
final reliability of the metric. The null hypothesis and the alternative hypothesis for
Chi-square test is taken as
n
(Oi − E i )2
χ2 (3.2)
i1
Ei
Example 3.2
Let us take the example of mark prediction and actual marks earned in the exam.
A course had 120 students and they are divided into 12 syndicates comprising of
10 students in each syndicate. The odd number syndicates were assigned to class
A, and even number syndicate is assigned to class B. Class test was conducted in
statistics. Based on the interaction and caliber assessment, the tutor had predicted
that the average score of class B will score 90 marks and that of class A will score
75. Table 3.8 lists the actual average scores of the class test. Conduct chi-square test
to check the accuracy of tutor’s prediction.
Tutor prediction for Class B is 90 and that of Class A is 75. Hence syndicate 1,
3, 5, 7, 9, 11 will have expected value as 75 and syndicate 2, 4, 6, 8, 10 and 12 will
have expected value as 90. Table 3.9 shows calculations for χ 2 .
The sum of the last column in the above Table 3.9 is 1.80667. This is the χ 2 value.
Small χ 2 value indicates that the teacher prediction is correct. It matches with the
actual scores of the students of both the sections.
ii. Binomial distribution
The results of experiments that have dichotomous values “success” or “failure” where
the probability of success and failure is the same with every time of experiment is
called as Bernoulli trials or binomial trials. Examples of binomial distribution usage
in real life can be for drug testing. Assume a new drug is introduced in the market
for curing some diseases. It will either cure the disease which is termed as “success”
or it will not cure which is termed as “Failure”.
This is used to evaluate metrics values which indicate the success in rendering
the assured services or the failure in meeting the SLA specification. The reliability
of this type of value is calculated using binomial distribution function. This value
will indicate whether the services were rendered or not. Examples are audit log
success, log retention, recovery success, etc., the formula to calculate the probability
of success is
3.5 Categories of Cloud Reliability Metrics 73
n
f (x) p x q n−x , (3.3)
x
where
n is the number of trials
x indicates the count of success trials
n–x indicates the count of failed trials
p is the success probability
q is the failure probability
f (x) is the probability of obtaining exactly x good trials and n – x failed trials
n
The value mentioned as is n C x which is calculated as n! /((n − x)! ∗x!)
x
The average value of the binomial distribution is used to obtain the reliability of
the provider in meeting the SLA specification.
n
f (r )
F(x) r 0 (3.4)
n
Example 3.3
A coin is tossed 10 times. What is the probability of getting six heads? The probability
of getting heads and tails are the same, which means it is (0.5).
The number of trails n is 10.
The odds of success, which indicates the value of p is 0.5.
The value of q is (1 − p) which is 0.5.
The number of success (i.e.) x is 6
The probability of getting head six times when a dice is thrown 10 times is
0.205078125.
iii. Simple division method
Some of the features like response and resolution time for trouble shooting, notifica-
tion or incidence reporting will be guaranteed in the SLA. The providers will keep
claims like there will be immediate resolution to issues, all downtime and attacks
will be reported, etc. Numeric value for these types of claims will not be guaranteed
in the SLA. In such cases the simple division method is used for metric evaluation.
It is the simple probability calculations based on number of sub events and total
occurrence of the event. The formula used is
Number of times events has happened as per assurance
(3.5)
Total number of occurence of the event
Example 3.4
Assume a car company selling used cars has assured to provide free service for 1 and
½ year with the duration of three month between each service. In a span of 18 months
the company should provide six services. This is the assured value calculated from
the company statement. The company also claims to have a good customer support
and after sales service system. Free service reminders will be sent a week prior to the
scheduled service date through call and also through mail. The reality of the service
notification reminders provided is accepted from the owner of the car. If the owner
gets service reminders before each service it is counted as success and if they do not
it is counted as failure.
If the customer tells all six services were provided and notifications were given
a week prior to the scheduled service then success value is 6. The reliability of the
company for rendering prior service notification is 1 (6 assured and 6 provided so
6/6 1).
If the customer says that all six free services were given but got reminder call
only for five services, then the reliability of the company for rendering prior service
notification is 5/6 0.83.
3.5 Categories of Cloud Reliability Metrics 75
This type of metrics also referred to as Type III metrics is used to measure the
adherence of the product to the standards specified. Due to the cross border and
globally distributed working of cloud, various standards are required to be main-
tained depending on the country in which it is used. Various organizations such as
CSA, ISO/IEC, CSCC, AICPA, CSMIC, and ISACA are working towards setting the
standards for cloud service delivery, data privacy and security, data encryption poli-
cies, and other service organization controls (www.nist.gov). It is not mandatory for
the CSP to acquire all available cloud standards. The standards are to be maintained
depending on the type of customers for whom the cloud services are being delivered.
Conformance to these standards increases the customer base of the CSP as this even-
tually increases the trust of the CC on the CSP. The conformance to the standards
is revealed by acquiring certificates of standards. These certificates are issued after
successful auditing process and have validity included. Possession of valid certificate
is essential which will reflect the strict standards conformance. The quality standards
are amended periodically depending on the technology advancements. The CSP are
expected to keep updated with the new emerging quality standards and include them
appropriately in their service delivery.
The basic requirements of the standards designed by International Standards
Organizations, which are maintained in the repository layer of the model requires
updations depending on the quality standard amendments. A subscription to these
standards sites will provide alerts on the modification of the existing standards and
inclusion of the new standards. This is used as a reminder to perform standards
repository updation process.
These are checked against the features offered by the product. The checking is
done using the formula
Number of standards certificates possed by the organization
(5.2)
Number of certificates suggested by the standards
Based on their valid certificate possession the success probability of the compe-
tency to match the standards requirements is measured.
Example 3.5
Every business establishment need to have the required certifications to comply with
the government rules and regulations. Possession of these certifications will also help
providers to gain trust of the customers.
Assume a vendor supplies spices and grains to chain of restaurants. The vendor
should possess all certificates related to food safety and standards. If the vendor is
in India then company must possess FSSAI (Food Safety and Standards Authority
License), Health/Trade license, company registrations, etc. Set of required licenses
vary from country to country. Apart from this it is also dependent on the place and
industry, for which the services or products are provided.
76 3 Reliability Metrics
If the grain supplier is expected to have five certificates and all five are possessed
by the vendor then the metric of certificate possession will have the value 1 (5/5).
If the grain supplier has only three out of five required certificates, then the metric
value will be 3/5 0.60.
3.6 Summary
This chapter has detailed the need of metrics, use of SOA and virtualization in cloud
service delivery and various standards followed in cloud service delivery. SOA com-
plements cloud usage and cloud implementation enhances flexibility of SOA imple-
mentation. Based on this the reliability requirements of SOA are also discussed in
detail. Virtualization is the backbone of cloud deployments as it helps in efficient
resource utilization and cost reduction. The reliability aspects of virtualization are
discussed as it will be the base for IaaS reliability metrics. Various quality require-
ments suggested by organizations like ISO 9126, NIST, and CSMIC are discussed in
detail. The reliability metrics to be discussed in the next chapter are derived from these
quality features. The chapter concludes with the classification of the metric types.
All reliability metrics value will not be of same data type. They will be of numeric or
Boolean. The values will be either quantitative or qualitative. Some of the metrics can
be calculated directly from the product catalog, while others need to be calculated
based on the feedback accepted from the existing users. The categorizations are done
as expectation-based (Type I), usage-based (Type II), and standards-based (Type III).
The mathematical way of metric calculation is also explained with examples.
References
McGovern, James, Tyagi, Sameer, Stevens, Michael, & Matthew, Sunil. (2003). Java web services
architecture. San Francisco, CA: Morgan Kaufmann Publishers.
NIST Special Publication Article. (2015). Cloud computing service metrics description. An article
published by NIST Cloud Computing Reference Architecture and Taxonomy Working Group.
Retrieved September 12, 2016 from http://dx.doi.org/10.6028/NIST.SP.307.
O’Brien, L., Merson, P., & Bass, L. (2007, May). Quality attributes for service-oriented architectures.
In Proceedings of the international workshop on systems development in SOA environments (p. 3).
IEEE Computer Society.
Ormandy, T. (2007). An empirical study into the security exposure to hosts of hostile virtualized
environments. In Proceedings of the CanSecWest applied security conference (pp. 1–10).
Pearce, M., Zeadally, S., & Hunt, R. (2013). Virtualization: Issues, security threats, and solutions.
ACM Computing Surveys (CSUR), 45(2), 17.
Raines, G. (2009). Cloud computing and SOA. Service Oriented architecture (SOA) series, systems
engineering at MITRE. Retrieved March 23, 2015 from www.mitre.org.
Vmware. (2015). 5 Essential characters of a winning virtualization platform. Retrieved August,
2017 from https://www.vmware.com/content/…/pdf/solutions/vmw-5-essential-characteristics-
ebook.pdf.
Chapter 4
Reliability Metrics Formulation
Abbreviations
4.1 Introduction
Reliability is often considered as one of the quality factors. ISO 9126, an international
standard for software evaluation has included reliability as one of the prime quality
attribute along with functionality, efficiency, usability, portability, and maintainabil-
ity. The CSMIC has identified various quality metrics to be used for comparison of
cloud computing services (www.csmic.org). These are collectively named as SMI
holding metrics which includes accountability, assurance, cost, performance, usabil-
ity, privacy, and security. In the SMI metric collections, reliability is a mentioned sub
factor of assurance metric, which deals with failure rate of service availability (Garg
et al. 2013). IEEE 982.2-1988 states that a software reliability management program
should encompass a balanced set of user-quality attributes along with identification
of intermediate quality objective. The main requirement of high reliable software
is the presence of high-quality attributes at each phase of development life cycle
with the main intention of error prevention (Rosenberg et al. 1998). Mathematical
assurance for the presence of high-quality attributes can further be used to evaluate
the reliability of the entire product or services.
For example, consider a mobile phone. The reliability of mobile phone will be
evaluated based on the number of failures that occur during a time period. The failures
might occur due to mobile performance slow down, application crash, quick battery
drain, Wi-Fi connectivity issues, poor call quality, random crashes, etc. The best way
to assure failure resistant mobile phone is to include checks at each step of mobile
phone manufacturing. If the steps were performed as per specifications following
all standards and quality checks, then the failures will reduce to a greater extent.
Reduced failures will increase the overall reliability of the product.
Metrics quantification is already explained in previous chapter. A quick recap of
the metric categorization as given below
i. Expectation-based metrics (Type I) gathered from the business requirements of
the user.
ii. Usage-based metrics (Type II) gathered from the existing users of the cloud
products or service to check the performance assurance.
iii. Standard-based metrics (Type III).
Type II metric have further categorization based on the type of values that are
accepted from the existing customers. If the gathered values are numeric and the
match between assured and actual has to be calculated, then chi-square test method
is used. If the value captured from the feedback is dichotomous having “yes” or “no”
values, then binomial distribution method is used to calculate the reliability value of
those metrics. Some of the assurance of the metric is provided as a simple statement in
SLA without any numerical value. Performance values of these metrics are gathered
as a count value from the users and are evaluated using simple division method.
Further detailed information on quantification method is available in Sect. 3.5.
The reliability metric discussion in this chapter is divided as common metrics and
model specific metrics. Under each model, IaaS, PaaS, and SaaS, metrics specific to
4.1 Introduction 81
model and the hierarchical framework of all the metrics is also provided. As discussed
already these metrics are taken from various literature and standards documentation
available. As the cloud computing paradigm is evolving rapidly, these metrics have
to be updated time and again.
Some of the reliability metrics are so vital that it will be present in all types of cloud
service models. These are being considered as common cloud reliability metrics and
are discussed below.
Common metrics irrespective of the service models are availability, support hours,
scalability, usability, adherence to SLA, security certificates, built-in security, inci-
dence reporting, regulatory compliance and disaster management.
i. Availability
This metric indicates the ability of cloud product or service that is accessible and
usable by the authorized entity at the time of demand. This is one of the key Service
Level Objective and is specified using numeric values in the SLA. The availability
values such as 99.5, 99.9, or 99.99% that is mentioned in SLA will help to attract
customers. Hence it is imperative to include it as one of the reliability metric to
ensure continuity of service.
Availability is measured in terms of uptime of the service. The measurement
should also include the planned downtime for maintenance. The period of calculation
could be daily, monthly or yearly in terms of hours, minutes, or seconds. The uptime
calculation using standard SLA formula is (Brussels 2014)
where
T avl is the total agreed available time
T tdt is the total downtime
T mdt is the agreed maintenance downtime
All the fields should follow the same period and unit of measurement. If the
downtime T tdt is considered in hours per month then the rest of the values (T avl and
T mdt ) should be converted to hours per month. The assured available time is usually
given as percentage in the SLA. It needs to be converted as hours per month. For
82 4 Reliability Metrics Formulation
example if the assured availability percentage is 99%, then the assured availability
minutes per month is calculated as
Total hours per month (30 * 24 720 h)
Availability percentage 99%
Total uptime expected in hrs 720 * 99/100 712.8 h/month
Total downtime would be 720–712.80 7.2 h/month
Total down time of a year can be projected as 86.4/month
Allowed down time of a year is 3.6 days/year.
Likewise the downtime in terms of number of hours/month, hours/year or num-
ber of days/year can be calculated. 99% reliable systems have 3.65 days/year of
down time, three nines (i.e.) 99.9% reliable systems have 0.365 days/year means
8.7 h/year of down time. For four nines 99.99% reliable systems have 52.56 min/year
of down time and for five nines (i.e.) 99.999% reliable systems, the down time is
5.256 min/year of down time. Each addition of nine in cloud availability would
increase the cost as assurance of increased availability is achieved with the help of
enhanced backup. The other way out is to design a system architecture that handles
the failovers during cloud outages. These can be implemented in any cloud technol-
ogy with an extra design and configuration effort and should be tested rigorously.
Failover solutions are generally less expensive to implement in the cloud due to
on-demand or pay-as-you-go facility of cloud services.
ii. Support
This metric is used to measure the intensity of the support provided by the CSP to
handle the issues and queries raised by the CCs. Maintaining strict quality of this
metric is of utmost importance as positive feedback of this metric from existing cus-
tomer will help to attracting more customers. The efficiency of support is measured
in terms of the process and time by which the issues are resolved. Based on the
specifications in the standard SLA guidelines the success probability of support is
calculated with the help of three different factors (Brussels 2014).
Support hours → the value of this factor like 24 × 7 or 09–18 hrs indicates the
assured working hours during which a CC can communicate with CSP for support
or inquiry.
Support responsiveness → this factor indicates the maximum amount of time taken
by the CSP to respond to CCs request or inquiry. This refers to the time within which
the resolution process starts. It denotes only the start of the resolution process and is
does not include completion of resolution.
Resolution time → the value of this factor specifies the target time taken to completely
resolve CCs service requests. Example, assurance in the SLA that any reported
problem will be resolved within 24 hrs of service request generation.
iii. Scalability
This metric is used to identify the efficiency with which dynamic provisioning fea-
ture of cloud is implemented. Scalability is an important feature which provides
4.2 Common Cloud Reliability Metrics 83
v. Adherence to SLA
Service Level Agreement (SLA) is a binding agreement between the provider and
the customer of cloud product or services. It acts as a measuring scale to check the
effective rendering of the services. Effective rendering refers to the service rendering
as per assurance. It contains information that covers different jurisdiction due to the
geographically distributed and global working nature of cloud services. Currently the
SLA terminology varies from provider to provider and this increases the complexity
in understanding SLA. This has been addressed by C-SIG-SLA, a group formed by
the European commission in liaison with ISO Cloud Computing working group by
devising standardization guidelines for cloud computing SLA (Brussels 2014). This
is done with prime intention to bring clarity of the agreement terms and also to make
it comprehensive and comparable.
84 4 Reliability Metrics Formulation
The SLA contains Service Level Objectives (SLOs) specifying the assured effi-
ciency level of the services provided by the CSP. Due to the different types of service
provisioning and globally distributed list of customer the SLOs includes various
specifications. The specifications required by the users have to be chosen depending
on the business requirements. Various SLOs mentioned in the SLA are
a. Availability
b. Response time and throughput
c. Efficient simultaneous access to resources
d. Interoperability with on-premise applications
e. Customer support
f. Security incidence reporting
g. Logging and monitoring
h. Vulnerability management
i. Service charge specification
j. Data mirroring and backup facility
The above list needs to be updated based on the technology and business model
developments.
vi. Security Certificates
The presence of this metric is essential to give assurance of security to the customers.
Due to the integrated working model of cloud services, a set of certification acqui-
sitions are important. A certified CSP stands high chance of being selected by the
customers as it provides more confidence for the potential customer on the provider.
The organizations such as ISACA, AICPA, CoBIT, and NIST have provided frame-
works and certifications for evaluating IT security. Each certification has different
importance. The Service Organization Control reports such as SOC1 and SOC2 con-
tains details about the cloud organization working. SOC1 report contains the SSAE
16 audit details certifying the internal control design adequacy to meet the quality
working requirement. SOC2 report formerly identified as SAS70 report which is spe-
cific for SaaS providers contain comprehensive report to certify availability, security,
processing integrity, and confidentiality. The list of certificates that can be acquired
with its validity is given in Table 4.1 (www.iso.org, www.infocloud.gov.hk).
vii. Built-In Security
Robust, verifiable, and flexible authentication and authorization is an essential built-
in feature of a cloud product or service. This will help to enable secure data sharing
among applications or storage locations. All the standard security built-in features
expected to be present has been listed in Table 4.2. It is designed in accordance to the
CSA, CSCC recommendations and industry suggestions. This list needs to be updated
periodically based on CSA policy updations and cloud industry developments. The
table also has two columns “Presence in SLA” and “Customer Input”. These two
columns are used for the reliability calculations of built-in security metric and are
explained in the next section.
4.2 Common Cloud Reliability Metrics 85
(Tech Target 2015). Various compliance certificates and their detailed description
are given in Table 4.3 (Singh and Kumar 2013).
x. Disaster Management
The operational disturbances need to be mitigated to ensure operational resiliency.
Contingency plans which is also termed as Disaster Recovery (DR) plans are used to
ensure business continuity. The aim of DR is to provide the organization with a way
to recover data or implement various failover mechanisms in the event of man-made
or natural disasters. The motive of DR plans is to ensure business continuity. Most
DR plans include procedures for making availability of server, data, and storage
through remote desktop access facility. These are maintained and manipulated by
CSPs. There must be effective failover system to second site at times of hardware
or software failure. Failback must also be in place which will ensure returning back
to the original system if the failures are resolved. The organization on its part must
ensure the availability of required network resources and bandwidth for transfer of
data between primary data sites to cloud storage. The organization must also ensure
4.2 Common Cloud Reliability Metrics 87
proper encryption of the data which are leaving the organization. Testing of these DR
activities must be carried out on isolated networks without affecting the operational
data activity (Rouse 2016).
SLAs for cloud disaster recovery must include guaranteed uptime, Recovery Point
Objective (RPO) and Recovery Time Objective (RTO). The service cost of cloud
based DR will depend on the speed with which the failover is expected. Faster
failover needs high cost investments.
xi. Financial Metrics
The final selection of the product needs to include financial metric consideration
also. In order to identify the financial viability of the product, this metric is used as
a deciding tool. The main objective is to assist the CCs in selecting the best cloud
product or services based on their business requirement and also within planned
budget. TCO reduction, low cost startup and increased ROI are the metrics related
to financial benefits of cloud implementations.
TCO reduction is one of the prime entities in cloud adoption factors. The organi-
zations need not do heavy initial investment which reduces TCO drastically.
Low startup cost is the reason for TCO reduction. Minimum infrastructure invest-
ment requirement for clouds usage is PC purchase and Internet connection setup.
The profitability of any investment is judged by its ROI. The ROI value gets better
in a span of time. The procedure to calculate ROI of cloud applications is mentioned
in formula 4.24.
88 4 Reliability Metrics Formulation
The metrics identified in the previous section is quantified using the categories
explained in Chap. 3. Depending on the user from whom the inputs are accepted
and the type of value that is accepted from the users, metric quantification method
is decided. Quantification formula for common metrics is given below.
i. Availability (Type II Metric)
The probability of success of this metric for a cloud product or service is calculated
based on the exact monthly uptime hours gathered from existing customers. This is
considered as the observed uptime value for availability. This is compared with the
assured uptime mentioned on in the SLA using chi-square test. The corresponding
success probability is calculated based on the chi-square value having the degree
of freedom as number of customers-1. The period of data collection and the unit of
measurement used for uptime must be the same.
The formula for calculating chi-square value is
n
(avlo − avle )2
CHIAVL , (4.2)
i1
avle
where
avlo is the observed average uptime for 6 months
avle is the assured uptime for 6 months
n is the total customers surveyed
Based on CHIAVL value and the degree of freedom as (n − 1) the probability Q
value is calculated. Q calculations can be done through websites available. One of
the sites is (https://www.fourmilab.ch/rpkp/experiments/analysis/chiCalc.html). It is
used in the browser that supports Javascript.
ii. Support Hours (Type II Metric)
The reliability of this metric is calculated based on three different values such as
support hours, support response time, and resolution time. These values are already
assured in the SLA which is taken as expected values. The actual performance of
service or product is gathered from existing customers. Chi-square test is applied to
check for the best fit. Formula to calculate support hours is
No. of successful calls placed during working hours
Rshrs (4.3)
Total no. of calls attempted during working hours
Along with support hours, support responsiveness, and resolution time factors
need to be calculated as follows:
No. of services responded within maximum time
Rresp (4.4)
Total no. of services done
4.2 Common Cloud Reliability Metrics 89
After calculation of these sub-factors final support hours evaluation is done based
on the average of all three values Rshrs , Rresp , and Rrt and the formula for calculation
is
n n n
Rshrs (i) Rresp (i) Rrt (i)
i1
+ i1
+ i1
RSupport n n n
, (4.6)
3
where
n is the total number of customers involved in the feedback
Rshrs (i) is the support hours reliability of the ith customer
Rresp (i) is the support responsiveness reliability of the ith customer
Rrt (i) is the resolution time reliability of the ith customer
The scalability metrics is measured as the time taken to scale. It is mentioned in the
SLA as a range value (S max and S min ). S max is the maximum time limit to scale and
S min is the minimum time limit to scale. The scaling process done within limits is
considered as a success and the process that takes time beyond S max is considered as
failure. The metric is calculated based on the number of scaling done and umber of
successful scaling process based on the feedback value input from existing customer.
The formula used is
m ci n x n−x
i1 x0 x p q
RSCL , (4.7)
m
where
m denotes the number of customers surveyed
ci denotes the count of successful scaling done by ith customer
n denotes the total number of scaling done by ith customer
p denotes the probability of scaling to be success which is 0.5
The measurement of this metric is based on the input accepted from the prospective
and hence it is a type I metric. This factor is measured in terms of the presence of
these usability features in the product. The more the count of the feature present the
more will be the usability. It is calculated as
No. of usability features present
RUSBLTY (4.8)
Total number of usability features
90 4 Reliability Metrics Formulation
n 2
SLOreq − SLOact
CHIsla , (4.9)
i1
SLOreq
where
SLOreq is the number of objectives required to be maintained
SLOact is the number of objectives actually maintained
n is the total customers from whom the feedback is gathered
Based on CHISLA value and the degree of freedom as (n − 1) the probability Q
value is calculated. Q calculations can be done through websites available. One of
the sites is (https://www.fourmilab.ch/rpkp/experiments/analysis/chiCalc.html). It is
used in the browser that supports Javascript. This will be the Rsla value.
vi. Security Certificates (Type I Metric)
This is a type I metric which is calculated based on the input from the prospective
users of cloud services. The inputs are provided based on the business requirements.
Periodic auditing which is an evaluation process has to be carried out to identify the
degree to which the audit criteria are fulfilled. This auditing is performed by third
party or external organization. This enables systematic and documented process for
providing evidence of the standard conformance maintained by CSP. The confor-
mance to standards and security certificates also increases the trust of the CC on the
services provided. All certificates needs to be renewed and should be within expiry
date limits. The required security certificate details are accepted from the CC and the
possession of the valid required security certificate is used to measure this factor.
Count of possession of required valid security certificates
Rsec-cert (4.10)
Total count of the required valid security certificates
be filled. The “Presence in SLA” column is filled with the product security speci-
fication given in the SLA. The “Customer Input” column holds the feedback value
gathered from customer based on their experience of security issues or satisfaction.
The formulas used for the built-in security metric calculation are
where Rfg is reliability of the guaranteed security features and is calculated based
on the features mentioned in the brochure against the features mentioned in the
standards. Type III metric calculation is used here. The formula for calculation is
Number of security features guaranteed in SLA
Rfg (4.12)
Total number of desired built-in security features
Rfa is the reliability of the provider to adhere to the security features guaranteed and
is calculated using chi-square method (type II metric method) to find the best fit of
the assured value with the actual value
n
(Fo − Fe )2
Rfa , (4.13)
i1
Fe
where
n is the total number of customers being surveyed
F o is the observed count of features rendered
F e is the expected count of assured feature based on SLA
viii. Incidence Reporting (Type II Factor)
Security incidences are expected to be low but if it happens it has to be reported to
the customer. Based on the reporting efficiency the value of this metric is calculated.
The efficiency can be gathered from the existing customers. The formula to calculate
the reliability of incidence reporting metric is
n
REPef (i)
Rsec-inc-rep i1 , (4.14)
n
where
n is the number of customer involved in feedback
REPef is the reporting efficiency experienced by the customer and is calculated
using the formula
This metric is calculated based on the input that are accepted from the current users of
cloud product or services. The DR capabilities are inherent part of the cloud service
where the data replication takes place. This metric is very important for Small and
Medium Enterprise (SME) customers as they do not possess in-house IT skills to take
care of risk mitigation measures. The role of CCs is to ensure the suitability of the
DR plans with their business requirements. The organizations opting for cloud DR
should have contingency plans apart from the cloud DR investment. This is needed
to ensure business continuity in worst case scenario. The DR plan of the CSP should
specify comprehensive list of DR features guaranteed to be maintained by them. The
reliability of this factor is calculated as
Count of required DR features offered by CSP
RDR (4.17)
Total number of required DR features
TCO metric comparison of the existing on-premise application with that of the cloud
offering will give the TCO reduction efficiency. The percentage of TCO reduction is
calculated as
TCOon-premise − TCOcloud
TCOreduce (4.18)
TCOon-premise
The TCO calculation is done based on the summation of initial investment referred
to as upfront cost, operational cost, and annual disinvestment costs (Kumar Vid-
hyalakshmi 2013). Upfront cost is a one-time cost and other two costs are calculated
from the second year.
4.2 Common Cloud Reliability Metrics 93
n
TCO Cu + (Cad + Co ), (4.19)
i2
where
C u is the upfront cost and is calculated as
where
Ch denotes the cost of hardware
Cd denotes the cost of software development
Ct denotes the staff training cost for proper utilization of software
C ps is the professional consultancy cost
C cust is the customization cost to suit the business requirement
C ad is the annual disinvestment cost and is calculated as
where
C hmaint denotes the hardware maintenance cost
C smaint denotes the software maintenance cost
C pspt denotes the professional support cost
C cust is the customization cost to incorporate business changes
Co is the operational cost and is calculated as
where
C inet refers to the Internet cost
C pow refers to the cost of power utilized for ICT operations
C infra refers to the floor space infrastructure cost
C adm refers to the administration cost
The increase in ROI is calculated based on the comparison of on-premise ROI
with the ROI after cloud adoption.
ROIon-premise − ROISaaS
ROIincrease (4.23)
ROIon-premise
Basic IT resources such as server for processing, storage of data and communication
network are offered as services in Infrastructure as a Service (IaaS) model. Virtu-
alization is the base technique being followed in IaaS for efficient resource sharing
along with low cost and increased flexibility (CSCC 2015). Management of resources
lies with the service provider where as some of the backup facility may be left with
the customers. Usage of IaaS is like conversion of the data center activity of the
organization to cloud environment. IaaS is more under the control of operators who
deals with the decision of server allocation, storage capacity and network topologies
to be used.
Some of the metrics that are of specific importance to IaaS operations are location
awareness, notification reports, sustainability, adaptability, elasticity throughput.
i. Location Awareness
Data location refers to the geographic location of CCs data storage or data processing
location. The fundamental design principle of cloud application permit data to be
stored, processed, and transferred to any data center, server or devices operated by
the service provider that are geographically distributed. This is basically done to
provide service continuity in case of any data center service disruption or to share
the over workload of one data center with another less loaded data center to increase
the resource utilization. Choosing a correct data center is a crucial task as it will be
complex and time consuming to relocate hardware. Wrong choice will also lead to
loss due to bad investment. A list of tips to be followed before choosing data center
and also to avoid bad decision is given below (Zeifman 2015).
a. Explore the physical location of the data center if possible
b. Connectivity evaluation
c. Security standards understanding
d. Understanding bandwidth limit and bursts costs
e. Power management and energy backup plans at data centers
Data or processing capacity might also be transferred to the locations, where the
legislation does not guarantee the required level of data protection. The lawfulness of
the cross-border data transfers should be associated with any one of the appropriate
measures viz. safe harbor arrangements, binding corporate rules or EU model clauses
(Brussels 2014).
4.3 Infrastructure as a Service 95
All IaaS specific metrics are quantified using metrics categories explained in Sect. 3.5.
i. Location Awareness (Type II Metric)
The CSPs are expected to notify the data movement details to the CCs to keep them
aware of the data location and also to maintain openness and transparency of the
operations. The CSPs provide the prospective geographic location list where the data
could be moved and some of the providers also give option to choose the geographic
location to store data. The efficiency of the location awareness is calculated based on
the existing feedback from existing users. In SLA, assurance will be provided that
they will stick to the chosen location. Compliance to this assurance is examined by
enquiring users about reality. The count of correct data movements is captured from
the existing users. If the data movement is to a location chosen by the user, then it is
counted as correct data movement. The formula used for calculation is
n
LAeff (i)
RLA i1 , (4.25)
n
where
n is the number of customers considered for feedback
LAeff is the efficiency of data movement which is calculated as
Evaluation of this metric is done based on the feedback accepted from the existing
users. Transparency assurance in SLA will be achieved through notifications to the
customers. This includes distribution of details with respect to any policy changes,
payment charge changes, downtime issues, security breaches to the customers.
Efficiency of notification is measured as
Number of notification of changes
EFFnotify (4.27)
Total number of changes
The reliability of this factor is calculated as the average of the notification effi-
ciency feedback from existing customers.
4.3 Infrastructure as a Service 97
n
EFFnotify (i)
Rnotify i1
, (4.28)
n
where
n denotes the number of customers surveyed
EFFnotify (i) denotes the notification efficiency of the ith customer
iii. Sustainability
DCiE is the percentage of the total facility power that is utilized by the IT equipment
such as compute, storage, and network. PUE is calculated as the ratio of the total
energy utilized by the data center to the energy utilized by the IT equipment. The
ideal PUE value of the data center is 1.0. The formula for calculating PUE which is
the inverse of DCiE is (Garg et al. 2013)
Total Facility Energy
PUE (4.29)
IT Equipment Energy (DCiE)
The Data center Performance per Energy (DPPE) is another metric that is used to
correlate the performance of the datacenter with the emission from the data center.
The formula for calculating DPPE is (Garg et al. 2013)
1 1
DPPE ITEU × ITEE × × , (4.30)
PUE 1 − GEC
where ITEU is IT Equipment Utilization which is the average utilization factor of all
IT equipments of the data center. It denotes the degree of energy saving accomplished
using virtualization and optimal utilization of IT resources. It is calculated as
Total actual energy consumed by IT devices
ITEU (4.31)
Total specification of energy by manufacturer
ITEE is IT Equipment Energy Efficiency that represents the energy saving of the
IT devices due to efficient usage by extracting high processing capacity for single
unit of power consumption. It is calculated as
a Server capacity + b Network capacity + c Storage capacity
ITEE
Total energy specification provided by manufacturer
(4.32)
GEC represents the utilization of Renewable energy into the data center. It is
calculated as
98 4 Reliability Metrics Formulation
This metric is measured using the time taken by the provider to adapt to the change.
This is a type II metric as the efficiency of adaptability is accepted from the existing
users. Any adaptability process that happens within the specific time delay is con-
sidered as successful adaptation and the one that goes beyond the allowed delay time
is considered as failed adaptability. The formula to calculate this metric is
n
EFFadapt (i)
Radapt i1 , (4.34)
n
where
n is the number of customers
EFFadapt is calculated as
v. Throughput
It is the number of tasks that are completed by the cloud infrastructure in one unit of
time. Assume an application has n tasks and are submitted to m number of machines
at the cloud provider’s end. Let T m,n be the total execution time of all the tasks. Let
T o be the time taken for the overhead processes. The formula to calculate throughput
efficiency is (Garg et al. 2013)
n
RTput (4.36)
Tm,n + To
PaaS adoption also provides facilities that enable applications to take advantage of
the native characteristics of cloud system without addition of any special code. This
also facilitates building of “born on the cloud” applications without requirement of
any specialized programming skills (CSCC 2015).
Metrics that specific to PaaS service models are audit logs, Tools for development,
provision of runtime applications, portability, service provisioning, rapid deployment
mechanism etc.
i. Audit Logs
Logs can be termed as “flight data recorder” of IT operations. Logging is the process
of recording data related to the processes handled by servers, networking nodes,
applications, client devices, and cloud service usages. The logging and monitoring
activities are done by the CSP for cloud service. These log files are huge and the
analyzing process is very complex. Centralized data logging needs to be followed
by the CSP to reduce the complexity. Cloud based log management solutions are
available that can be used to gain insight from the log entries and are identified as
Logging as a Service. This is very essential in development environment as these log
files are used for tracing errors and server process related activities.
The CC can use log files to monitor day-to-day cloud service usage details. These
are also used by CCs for analyzing security breaches or failures if any. The parameters
that will be captured in the log file, the accessibility of the log file by the CCs and
the retention period of the log files will be mentioned in the SLA.
ii. Development Tools
PaaS service models aims to develop applications on the go. It also aims to streamline
the development processes. It helps to support DevOps by removing the separation
between development and operations. This separation is most common in in-house
application development process. PaaS systems provide tools for code editors, code
repositories, development, runtime code building, testing, security checking, and
service provisioning. Tools also exist for control and analytics activities like moni-
toring, analytics services, logging facility, log analysis, analytics of app usage and
dashboard visualization, etc.
iii. Portability
Many PaaS systems are designed for green field applications development where the
applications are primarily designed to be built and deployed using cloud environment.
These applications can be ported to any PaaS deployment with less or no modification.
Sometimes there may be non-cloud based development framework applications that
are hosted on to PaaS environment. This brings in some doubts like will the ported
applications function without any errors or can it get the complete benefits of PaaS
environment? (CSCC 2015).
100 4 Reliability Metrics Formulation
c
n
EFFlog_acc p x (1 − p)n−x , (4.38)
x
x0
where
n denotes the total number of customer surveyed.
c denotes the count of successful log file accessed
p denotes the probability of access to be success which is 0.5
EFFlog_ret is the efficiency of the log data retention which is calculated using the
cumulative distribution of the binomial function
c
n
EFFlog_ret p x (1 − p)n−x , (4.39)
x
x0
where
n denotes the total number of customer surveyed.
c denotes the count of successful log file retention
p denotes the probability of retention to be success which is 0.5
This metric is used to evaluate the efficiency of the PaaS service provider in providing
development environment. The presence of various development tools essential for
rapid application development will increase the chances of selection. This is a type
I metric as the evaluation is done based on the requirement of the PaaS developer.
The formula for development tools reliability evaluation is
4.4 Platform as a Service 101
Tooli
Rdev_tool , (4.40)
n
where
n is the total number of tools required by the developer
Tooli is the development tools available with the PaaS provider
The final evaluated value of this metric is expected to be 1 indicating the presence
of all the required tools.
iii. Portability (Type III Metric)
The efficiency of this metric is identified using the feedback from the existing cus-
tomers. This metric is measured based on the efficiency of the portability process.
Any portability process that ends in application execution without any errors or con-
version requirement is considered as successful porting. If porting process takes more
than the assured time or the ported process execution fails then it is considered as
a failed porting process. As the measuring values are dichotomous with “success”
and “failure”, binomial method of type II metric calculation is used for portability
metric evaluation.
m ci n x n−x
i1 x0 x p q
Rport , (4.41)
m
where
m denotes the number of developer surveyed
ci denotes the count of successful porting done by ith developer
N denotes the total number of porting done by ith developer
p denotes the probability of successful porting which is 0.5
Apart from the metrics defined under Sect. 4.2 which are metrics common to all
service models, there are few metrics that are specific to SaaS operations. SaaS
applications are adopted mostly by MSME customers due to less IT overhead, faster
time to market and controlling of entire application by CSPs. Entire business oper-
ation will depend on the failure free working SaaS and the various SaaS specific
metrics are discussed below.
i. Workflow Match
Cloud product market is flooded with numerous applications to perform same
business process. For example, SaaS products available for accounting processes
are—FreshBooks, Quickbooks, Intact, Kashoo, Zoho Books, Clear Books, Wave
Accounting, Financial Force, etc. Customers are faced with the herculean task of
choosing a suitable product from a wide array of SaaS products. This metric can be
used as a first filter to identify the product or products that are suitable for the business
process. Not all organizations have same business processes and not all applications
have same functionality. Some organizations have organized working while some
have unorganized working. Some may want to stick to their operational profile and
would want the software to be customized. Some may want the business streamlining
and would want to include standard procedures in business so as to achieve global
presence. This metric will provide an opportunity to analyze the business processes
and list out the requirements.
ii. Interoperability
This refers to the ability of exchanging and mutual usage of information between two
or more applications. These applications could be an existing on-premise application
or the applications used from other CSPs. This is an essential feature of a SaaS
product as they are built by integrating loosely coupled modules which needs proper
orchestration so as to operate accurately and securely irrespective of platforms used
or hosting locations. Some organizations may also have proprietary modules which
need to be maintained along with cloud applications. In such cases the selected cloud
applications has to interoperate with on-premise applications without any data loss
and with less coding requirements. This success of this feature also assures the ability
of the product to interact with other products offered by the same provider or by other
providers. Maintaining high interoperability index also eliminates the risk of being
locked in with a single provider.
Reliability of this metric is of major importance for those SaaS applications which
are used along with the on-premise application execution. It also important in those
cases in which, an organization uses SaaS products from various vendors to accom-
plish their tasks. The success of interoperability depends on the degree to which the
provider uses open or published file formats, architecture, and protocols.
4.5 Software as a Service 103
of backup sites exist such as hot site, warm site, and cold site. Hot sites are used
for backup of critical business data as they are up and running continuously and the
failover takes place within minimum lag time. These hot sites must be online and
should be located away from the original site. Warm sites have the cost attraction than
hot sites and are used for less critical operation backups. The failover process from
warm sites takes more time than hot sites. The cold sites are the cheapest for backup
operations but the switch over process is very time consuming when compared to hot
and warm sites (www.Omnisecu.com). In addition to automatic backup processes,
the CCs can utilize the data export facility to export data to another location or to
another CSP to maintain continuity in case of disaster or failure.
Periodic backups are conducted automatically by CSP and had to be tested by
CCs at specific interval of time. Local backups are expected to be completed within
24 h and offsite backups are taken daily or weekly based on the SaaS offering and
business requirement. The choice of backup method option needs to be taken from
the CCs.
vi. Recovery Process
Cloud recovery process such as automatic failover and redirection of users to repli-
cated servers needs to be handled by CSP efficiently which enable the CCs to perform
their operations even in times of critical server failure (BizTechReports 2010). The
guaranteed RPO and RTO will be specified in the SLA. RPO indicates the acceptable
time window between the recovery points. A smaller time window initiates the mir-
roring process of both persistent and transient data while a larger window will result
in periodic backup process. RTO is the maximum amount of business interruption
time (Brussels 2014). The CCs should prepare the recovery plans by determining the
acceptable RPO and RTO for the services they use and ensure its adherence with the
recovery plans of the CSP.
are satisfied by the product are marked as 1 and not that are not satisfied is marked
as 0. The count of the “1”s will give the workflow match value of the product. The
formula to calculate the workflow match of a product is
n
Reqi
RWFM i1 , (4.42)
n
where
n is the total number of work flow requirements of the customer
Reqi is the matching work flow requirements of the product
The desired WFM metric value for a product is 1 which indicates that all the
requirements are satisfied and the product is reliable in terms of work flow match.
ii. Interoperability (Type I Metric)
This metric is measured based on the facility to effortlessly transfer the data without
loss of data or data being re-entered. Value for this metric is gathered from the
prospective customers who are willing to adopt SaaS for their business operations;
hence it is Type I metric. Greater efficiency of this metric can be achieved, if the data
that are needed for interaction are in open standard file format (e.g. CSV file format,
NoSQL format) so that it can be used without any conversion.
This factor is measured based on the time taken for the conversion process. Let
T min , be the minimum allowed time for the conversion that is set by the customer.
All the interacting modules that take more than T min will eventually affect the inter-
operability of the product and is calculated as
m
Mi
RINTROP i1 , (4.43)
n
where
M i is the module whose data conversion time is >T min
n is the total number of modules that needs to interact
The desired value of INTROP is 1, an indication for the smooth interaction
between modules with little or no conversion.
iii. Ease of Migration (Type II Metric)
This metric evaluation is done based on the data gathered from the existing customers
who have used the SaaS product. Chi-square test is used to find the goodness of fit
of the observed value with the expected value and its corresponding probability is
calculated for both data conversion and training and their average is taken as the final
migration value of the product.
Rdc + Rtr
RMIGR , (4.44)
2
106 4 Reliability Metrics Formulation
where
Rdc is the reliability of data conversion calculated from MIGRdc
Rtr is the reliability of training for staff from MIGRtr
n
(DCo − DCe )2
Rdc , (4.45)
i1
DCe
where
DCo is the actual data conversion time
DCe is the assured or expected data conversion time
n is the total customers surveyed
n
(TRo − TRe )2
Rtr , (4.46)
i1
TRe
where
TRo is the actual time taken for training
TRe is the assured or expected time for training
n is the total customers surveyed
iv. Updation Frequency (Type II Metric)
This metric measurement gives guarantee to the customers that they are in pace
with the latest technology. The desired software updation frequency is 3–6 months
and the assured frequency is specified in the SLA. The measurement of this factor is
done based on the feedback accepted from existing customer feedback. The updation
frequency is considered for a period of 24 months. The factor is calculated as
n
Uai /n
RUPFRQ i1 , (4.47)
Up
where
U ai is the actual number of software updations experienced by customer i
n is the number of customers surveyed
U p refers to the proposed number updations calculated based on the frequency
updation specification of SLA with respect to the feedback period
v. Backup Frequency (Type II Metric)
The reliability of the backup frequency metric is calculated based on the feedback
from the existing customers. The average of the cumulative efficiency of mirroring
latency, data backup frequency and backup retention time will provide the reliability
of this metric. These values are gathered for the duration of 6 months.
4.5 Software as a Service 107
n n n
Rmirror (i) Rbck-frq (i) Rbrt (i)
i1
+ i1
+ i1
RBackup n n n
, (4.48)
3
where
n is the total number of customers involved in the feedback
Rmirror (i) is the mirroring reliability of the ith customer calculated using chi-square
test of the assured latency and experienced latency
n
(mlo − mle )2
Rmirror , (4.49)
i1
mle
where
mlo is the observed mirror latency, mle is the assured mirror latency and n is
the total customers surveyed
Rbck-frq (i) is the data backup frequency reliability of the ith customer calculated
using chi-square test of the assured frequency and experienced frequency
n
(bfo − bfe )2
Rbck_frq , (4.50)
i1
bfe
where
bfo is the observed backup frequency, bfe is the assured backup frequency and
n is the total customers surveyed
Rbrt (i) is the backup retention time reliability of the ith customer calculated using
chi-square test of the assured retention time and experienced retention time
n
(brto − brte )2
Rbrt , (4.51)
i1
brte
where
brto is the observed backup retention time
brte is the assured backup retention time and n is the total customers surveyed
The reliability of the recovery process is based on the successful recovery which is
completed within RTO and without any errors. The recovery processes not within
RTO or within RTO and with errors are identified as failure with respect to the
reliability calculations. This metric is based on the input from existing customers.
The recovery process metric for a product is calculated as the average of successful
recoveries of the existing customers.
108 4 Reliability Metrics Formulation
SaaS Reliability
Security Incidence
Reporting
Usability Notification
Reports
Updation Frequency
n
EFFrec (i)
Rrecovery i1
, (4.52)
n
where
n denotes the number of customers surveyed
EFFrec (i) is the efficiency of recovery of ith customer which is calculated as
n
EFFrec p x (1 − p)n−x ,
x
where
n stands for the total number of recovery processes done
x stands for the number of successful recovery processes
p is the probability of the successful recovery which is 0.5
SaaS specific reliability metrics and the common metrics are to be evaluated to
compute reliability of a SaaS product. Some of the metrics specific to IaaS and PaaS
are also considered in listing of all SaaS metrics as SaaS implementations are done
using IaaS and PaaS assistance. The metrics collection is further grouped based on
its functionality. Figure 4.1 shows the SaaS metric hierarchy. Likewise the metrics
of IaaS and PaaS can also be categorized.
4.6 Summary 109
4.6 Summary
This chapter has detailed the cloud reliability metrics. Some of the metrics, such as
availability, security, regulatory compliance, incidence reporting adherence to SLA,
etc., are common for all the three service models. The metric specific to each service
model such as IaaS, PaaS, and SaaS have been discussed in detail. Quantification
details for these metrics have been explained for the measurement purpose. Quan-
tification method varies based on the type of users from whom the metric values are
to be gathered. Three types of users mentioned are the, (a) customers or developers
who are planning to opt for the cloud services, (b) customers or developers who are
already using the cloud application or services and (c) the cloud brokers who keep
track of the cloud standards and cloud service operations. A 360° view covering all
the characteristics of cloud reliability evaluation has been presented in this chapter.
References
Singh, J., & Kumar, V. (2013). Compliance and regulatory standards for cloud computing. A volume
in IGI global book series advances in e-business research (AEBR). https://doi.org/10.4018/978-
1-4666-4209-6.ch006.
Tech Target Whitepaper. (2015). Regaining control of the cloud. Information Security, 17(8).
Retrieved October 3, 2015 from www.techtarget.com.
Vidhyalakshmi, R., & Kumar, V. (2016). Determinants of cloud computing adoption by SMEs.
International Journal of Business Information Systems, 22(3), 375–395.
Zeifman, I. (2015). 12 tips for choosing data center location. Retrieved April, 2017 from https://
www.incapsula.com/blog/choosing-data-center-location.html.
Chapter 5
Reliability Model
Abbreviations
5.1 Introduction
Reliability metrics for all the service models such as IaaS, PaaS, and SaaS were
discussed in detail in the previous chapter. Final reliability value has to be computed
from these metrics. We have already discussed in detail in Chap. 2, that reliability is
user-oriented. Hence, not all metrics will be of equal importance. The priorities for
the reliability metrics will vary with the business requirements of the customers. For
example, security is considered as an important factor for the financial organization
but the same may be of less importance for any academic setup. Multiple platform
provisioning is of prime importance for academic users but the same is not of any
importance for any Small and Medium Enterprises (SME) who are adopting cloud
services for their business operations. Due to this variance, the factor ranking cannot
be standardized and also ranking of the reliability factors based on the user priority
is essential.
The presence of multiple metrics with varying importance depending on business
requirement and the complexity involved in the metrics representation excludes the
use of traditional methods for reliability evaluation. Conventional methods such as
weighted sum or weighted product based methods are avoided and Multiple-Criteria
Decision-Making (MCDM) methods are used. MCDM is a subfield of operations
research that utilizes mathematical and computational tools to assist in making deci-
sions in the presence of multiple quantifiable and nonquantifiable factors. There
are numerous categories of MCDM methods and of all, Analytic Hierarchy Process
(AHP) is chosen due to the categorization of the reliability metrics in hierarchical
format.
A model named Customer-Oriented Reliability Evaluation (CORE) (Vidhyalak-
shmi and Kumar 2017) is explained in this chapter, which will help customers to
calculate the reliability of the cloud services that are chosen. The user of the model
should be a person with sound knowledge of cloud services and its technology. It
could be the customer or a broker. When customers are naïve to cloud usage, then
broker-based method can be used. The brokers are also termed as Reliability Eval-
uators (RE). The role of the REs is to provide guidance to customers and educate
them about the following:
i. Streamline business processes to meet global standards
ii. Features to look for in any cloud services
iii. Risk mitigations measures
iv. Ways to prioritize reliability metrics
v. Monitoring of cloud services
vi. Standards and certifications required for compliance
The CORE model has three layers: User Preference Layer, Reliability Evaluation
Layer, and Repository Layer. All the standards and usage-based metric defined in
Chap. 3 under Sect. 3.4 are calculated for popular cloud products and are stored in
the Repository Layer. The customer preferences are evaluated as metric priorities
and are stored in User Preference Layer. The priorities and metric values are used by
5.1 Introduction 113
the middle layer (i.e.) Reliability Evaluation Layer to arrive at final reliability value.
These calculated values are also time stamped and stored in the repository layer. This
will help to identify the growth of the cloud service or product over a period of time.
The CORE model will be helpful for the cloud users and cloud service providers.
The cloud users can identify reliable cloud product that suits their business. Naïve
users will get a chance to streamline business process to meet global standards and
enhance web presence. The customers will also get knowledge about what to expect
from an efficient cloud product. The existing users of cloud are benefitted by the
model implementation as it will enable them to monitor the performance which is
essential to keep the operational cost under control. The CORE model will also assist
Cloud Service Providers in many ways. Interaction with Reliability Evaluators will
assist the providers to gain an insight of the customer requirements. This will assist
providers to enhance the product features which will result in enhanced product
quality and customer base. Some of the other benefits for the service providers are
i. Enhancement of their product.
ii. Creation of healthy competition among the providers.
iii. Keeping track of their product performance.
iv. Know the position of their product in the market.
v. To gain information about the performance of the competitors’ products.
Formulation Selection
Process Process
Performance evaluation
Identification of decision
parameters
Implementation of selected
method
Result Evaluation
Decision
Various methods used in MCDM are weighted sum, weighted product, Analytic
Hierarchy Process (AHP), preference ranking organization method for enriching
evaluation, Elimination and Choice Translating Reality (ELECTRE), Multiattribute
Utility Theory (MAUT), Analytical Network Process, goal programming, fuzzy, Data
Envelopment Analysis (DEA), Gray Relation Analysis (GRA), etc (Whaiduzzaman
5.2 Multi Criteria Decision Making 115
et al. 2014). All these methods share common characteristics of divergent criteria,
difficulty in the selection of alternatives and unique units. These methods are solved
using evaluation matrix, decision matrix, and payoff matrix. Some of the MCDM
methods are discussed below.
i. Multi Attribute Utility Theory: The preferences of the decision maker are
accepted in the form of a utility function that is defined for a set of factors.
Preferences are given in the scale of 0–1 with 0 being the worst preference and 1
being the best. The utility functions are separated by addition or multiplication
utility functions with respect to single attribute (Keeny and Raiffa 1976).
ii. Goal Programming: This is a branch of multiobjective optimization. It is a gen-
eralization of linear programing techniques. It is used to achieve goals subjected
to the dynamically changing and conflicting objective constraints with the help
of modifying slack and few other variables that represents deviation from the
goal. Unwanted deviations from a collection of target values are minimized. It is
used to determine the resources required, degree of goal attainment, and provide
best optimal solution under dynamically varying resources and goal priorities
(Scniederjans 1995).
iii. Preference Ranking Organization Method for Enrichment Evaluation:- This
method is abbreviated as PROMETHEE. This method uses outranking prin-
ciple to provide priorities for the alternatives. Ranking is done based on the pair
wise comparison with respect to the number of criteria. It provides best suitable
solution rather than providing the right decision. Six general criterion functions
such as Gaussian criterion, level criterion, quasi criterion, usual criterion, crite-
rion with linear preference and criterion with linear preference and indifference
area are used (Brans et al. 1986).
iv. Elimination and Choice Translating Reality: ELECTRE is the abbreviation for
this method. Both qualitative and qualitative criteria are handled by this method.
Best alternative decision is chosen from most of the criteria. Concordance index,
discordance index, and threshold values are used and based on the indices graphs
for strong and weak relationships are developed. Iterative procedures are applied
for the graphs to get the ranking of the alternatives (Roy 1985).
AHP discovered by Thomas L. Satty is used in this research for the quantification
of factors as it the suitable method to be used for qualitative and quantitative factors
and for the pair wise comparisons of factors arranged in hierarchical order.
It is a method that utilizes the decision of experts to derive the priority of the
factors and calculates the measurement of alternatives using pair wise comparisons.
The complex problem of decision-making, involving various factors is decomposed
into clusters and pair wise factor comparisons are done within the cluster. This
116 5 Reliability Model
enables the decision-making problems to be solved easily with less cognitive load.
The following steps are followed in AHP (Satty 2008):
i. Examine the problem and identify the aim of decision.
ii. Build the decision hierarchy with the goal at the top of the hierarchy along with
the factors identified at the next level. The intermediate levels are filled with
the compound factors with the lowest level of the hierarchy having the atomic
measurable factors.
iii. Construct the pairwise comparison matrix
iv. Calculate the Eigen vector iteratively to find the ranking of the factors.
The pairwise comparison is done using the scaling of the factors to indicate the
importance of a factor over another. This process is done between factors for all the
level in the hierarchy. The fundamental scaling used for factor comparison is given
in Table 5.1. The cloud users wanting to evaluate the reliability of chosen cloud
services should have the clear understanding of their business operations. Based
on the requirements the importance of one reliability metric over another has to be
provided. The absolute number scale to be used for providing preference is specified
in Table 5.1.
This is a square matrix of order as same as the number of factors being compared.
Each cell value of the matrix is filled with values depending on the preferences chosen
by the customers. The diagonal elements of this matrix are marked as 1. The rest of
the elements are filled following the rules given below:
If factor i is five times more important than factor j, then the CMi, j value is 5 and
the transpose position CMj, i is filled with its reciprocal value.
Let us take an example of a computer system purchase. Assume Customer A is an
end user who has decided to purchase a new computer system for his small business.
The factors to be looked for are hardware, software and vendor support. Three systems
such as S1, S2, and S3 are shortlisted. Now based on the usage, the factors are to be
provided significance based on the numbers mentioned in Table 5.1. The first step
5.3 Analytical Hierarchy Process 117
Selection of a computer
system
After the creation of comparison matrix iterations of Eigen vector creation needs to
be done. The steps to be followed in the Eigen vector creation are
i. Square the comparison matrix by multiplying the matrix with itself.
ii. Add all the elements of the rows to create the row sum.
iii. Total the row sum values.
118 5 Reliability Model
iv. Normalize the row sum to create Eigen vector by dividing each row sum with
total of row sum. The Eigen vector is calculated with four digits precision.
The above steps are repeated until the difference between the Eigen vector of the
ith iteration and i − 1th iteration is negligible with respect to four digits precision.
The sum of the Eigen vector is 1. The values of the Eigen vector are used to identify
the rank or the order of the factors. These are assigned as the weights for the factors
of first level and used in final computation of reliability.
Alternate way is
i. Multiply all values of a row and calculate the nth root, where “n” refers to the
number of factors.
ii. Find the total of the nth root value.
iii. Normalize each row nth row value by dividing it with the total.
iv. The resulting values are the Eigen vector values.
For the example of computer selection the Eigen vector is calculated as follows:
i. Multiply the comparison matrix with itself. The resultant matrix will be
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
1 0.111 0.333 1 0.111 0.333 3 0.33333 1
⎣9 1 3 ⎦ × ⎣9 1 3 ⎦ ⎣ 27 3 9⎦
3 0.333 1 3 0.333 1 9 1 3
ii. Compute row sum for each row. Calculate the total of the row sum column.
⎡ ⎤
3 0.3333 1 4.3333
⎣ 27 3 9 ⎦ 39
9 1 3 13
iii. The total of the row sum column is 56.333. This value is used to normalize the
row sum column
⎡ ⎤
3 0.3333 1 4.3333 0.0769
⎣ 27 3 9 ⎦ 39 0.6923
9 1 3 13 0.2308
iv. The last column is the Eigen Vector [0.0769, 0.6923, 0.2308]. This is also called
as priority vector. Addition of this vector will be 1. The priority for hardware is
0.0769, software is 0.6923 and that of the vendor support is 0.2308.
5.3 Analytical Hierarchy Process 119
After the calculation of Eigen Vector, Consistency Ratio (CR) is calculated to vali-
date the consistency of the decision. The ability to calculate CR sets AHP ahead of
other MCDM methods like Goal Programming, Multiattribute Utility Theory, Choice
experiment, etc. The four steps of CR calculation are
i. Add the column sum of the comparison matrix. Multiply each column sum with
its respective priority value.
ii. Calculate λmax as sum of the multiplied values. λmax value need not be 1.
iii. Consistency Index (CI) is calculated as (λmax − n)/(n − 1), where n is the number
of criteria.
iv. CR is calculated as CI/RI, where RI is called the Random Index. RI is a direct
function based on the number of criteria being used. RI lookup table provided
by Thomas L. Satty for 1–10 criteria is given in Table 5.3.
Lower CR value indicates consistent decision-making whereas higher CR value
indicates that the decision is not consistent. CR value ≤0.01 indicates that the decision
is consistent. If the value of CR is >0.01, then decision maker should reconsider the
pairwise comparison values.
For the computer selection example the CR values are calculated as follows.
Column sum for the comparison matrix is calculated and placed as fourth row.
⎡ ⎤
1 0.111 0.333
⎢ ⎥
⎢ 9 1 3 ⎥
⎢ ⎥
⎣ 3 0.333 1 ⎦
13 1.444 4.3333
The last row is then multiplied with priority values [0.0769, 0.6923, 0.2308] and
then added to get λmax .
λmax [13 * 0.0769 + 1.444 * 0.6923 + 4.333 * 0.2308] 3.
Reliability metrics of all the three models and its hierarchy is defined in Chap. 4.
Further in this chapter let us consider metrics for SaaS model and its hierarchy. The
pairwise comparison for the metrics and its calculations are shown in this section.
As an example, let us assume customer A has a small business setup and is willing
to adopt SaaS product. The metrics are explained and the pair wise comparison is
accepted for the all the levels of reliability metrics.
a. First-level metrics
First level has four metrics such as Operational, Security, Support and Monitoring,
and Fault Tolerance. The pairwise comparison between them is as follows:
i. Operational metrics has strong dominance (8) over Security, moderately plus
strong dominance (5) over Support and Monitoring and Fault tolerance.
ii. Support and Monitoring and has moderate plus strong dominance (5) over Secu-
rity.
iii. Fault tolerance is preferred strongly (7) over Security and has moderate plus
strong dominance (5) over Support and Monitoring
The comparison matrix will be a 4 × 4 matrix as there are four metrics in this
level. The values of the pairwise comparison are
The addition of the row sum column is 215.206. Priority vector or the Eigen
vector for the first-level reliability metrics is achieved by normalizing the row sum
with the total value. On dividing the row sum with their total the resulting priority
vector is [0.598, 0.033, 0.094, 0.275]. Sum of the Eigen vector is 1. This is the prefer-
ence for the metrics [Operational, Security, support and monitoring, fault tolerance]
respectively.
Next step is to check the Consistency Ratio for the preferences given. The column
sum of the comparison matrix is first calculated.
⎡ ⎤
1.000 8.000 5.000 5.000
⎢ 0.125 1.000 0.200 0.142 ⎥
⎢ ⎥
⎢ ⎥
⎢ 0.200 5.000 1.000 0.200 ⎥
⎢ ⎥
⎣ 0.200 7.000 5.000 1.000 ⎦
1.525 21.000 11.200 6.342
The last highlighted value is the column sum. Calculate λmax by multiplying each
column sum with their corresponding priority vector value
λmax [1.525 * 0.598 + 21.000 * 0.033 + 11.200 * 0.094 + 6.342 * 0.275] 4.393.
CI is calculated as (λmax − n)/(n − 1). (i.e.) (4.393 − 4)/3 0.1
RI for four criteria is 0.9. CR CI/RI 0.1/0.9 0.1
As CR is ≤0.1 the pairwise comparisons provided are consistent.
b. Second level metrics
Each metric in the first level is further sub-divided. After completion of pairwise
comparison of the first-level metrics, the same procedure has to be applied for the
next level of metrics. The pairwise comparison of second level of Operational metrics
is given in Table 5.4. The sub-metrics of operational metrics are
i. Workflow Match
ii. Interoperability
iii. Ease of Migration
iv. Scalability
122 5 Reliability Model
v. Usability
vi. Updation Frequency
The pairwise comparison should be provided only for the metrics that are favored
higher than the other. For example, Workflow Match metric is of extreme impor-
tance than Interoperability, so “9” is mentioned in the table. Pairwise comparison of
the reverse is not required (i.e.) comparison between Interoperability and Workflow
Match metrics is not needed as the reciprocal value will be stored. Refer to Sect. 5.3.1
for comparison matrix creation.
The comparison matrix for the above table is
⎡ ⎤
1.000 9.000 8.000 7.000 5.000 4.000
⎢ 0.111 1.000 0.250 0.125 0.2000 0.111 ⎥
⎢ ⎥
⎢ ⎥
⎢ 0.125 4.000 1.000 0.142 0.142 0.142 ⎥
⎢ ⎥
⎢ 0.142 8.000 7.000 1.000 3.000 0.200 ⎥
⎢ ⎥
⎣ 0.200 5.000 7.000 0.333 1.000 0.200 ⎦
0.250 9.000 7.000 5.000 5.000 1.000
This indicates the priorities assigned for the second level of Operational metric.
λmax for the above comparison matrix is 6.98. Consistency Index is calculated for
6 metrics as 0.19. The final Consistency Ratio is calculated as 0.15 (Refer Random
Index for six metrics from Table 5.3. The value is greater than 0.01. Hence the
pairwise comparison is not consistent and needs to be reconsidered.
The Workflow Match is essential to maintain dynamism in business. Updation
frequency also refers to keep up with the changes. Hence both can be given equal
importance (i.e.) 1. On 4 with 1 and recalculating the entire process again the Con-
sistency Ratio will be equal to 0.1, which shows that the comparisons are valid and
consistent.
The above given pairwise comparison process has to be done for all the levels of
the metrics in the hierarchy and based on the lower level metrics priority the higher
level values are further calculated.
CORE System
User requirement,
(Customer Oriented Reliability
Factor Preferences, based Product
Cloud Product list Reliability Evaluation
System) Ranking
The working of the CORE model is divided into layers. The three layers of CORE
model are User Preference Layer, Reliability Evaluator Layer, and Repository Layer.
The container diagram of the CORE model is given in Fig. 5.4 and the detailed
component diagram is given in Fig. 5.5.
This layer consists of user interface, user preference template, product list, rules for
preference assignment. This is the base layer through which the customer orientation
for the CORE model is achieved. The customer’s reliability factor preference input
is accepted through a dashboard interface. The rules to be followed in preference
assignment are provided as help on demand. The input has to be given by a person
with clear understanding of business needs and technology aspects. This is the layer
where the user interacts with RE along with the product list to evaluate its reliability.
Sample preferences are provided from the factor preference template. On demand
5.4 CORE Reliability Evaluation 125
Repository Layer
(periodic storage of cloud reliability
metrics)
help is provided in case the user does not have any information on products that suit
their business needs.
The reliability metrics used and their importance are to be explained to the cus-
tomers and their preference for the factors is to be accepted. Questionnaire needs
to be answered by the prospective customers for gathering details about business
requirement.
The questionnaire for a SaaS product selection is given in Table 5.5. The business
operations that are to be deployed as cloud services are accepted from the prospec-
tive customers as the required functionality or modules and are checked against the
modules provided by the shortlisted products. The maximum number of module
match count required is accepted from the user. This value will be used in Workflow
Match sub factor. The list of usability, disaster recovery, security, and compliance
specifications are accepted from the customer.
126 5 Reliability Model
Metrics
Metrics Preference Preference
Template value list
User business
Requirement
Reliability Evaluator Layer
Existing Metrics
Customer Data Ranking Mod-
Analysis ule
Reliability
Evaluation
Module
Feedback &
updations reminder
Module
Feedback
Repository Layer
Data Gathering
Time stamped Feedback Module
Reliability Database Database
Standards
Standards Updation Mod-
Database ule
This layer is used by the REs who assist customers to find a reliable product for their
business operations. Various activities of the REs are customer preference accep-
tance, reliability calculation, data gathering, and data analysis.
5.4 CORE Reliability Evaluation 127
i. Reliability Evaluation
Based on the input from the customer, comparison matrices and Eigen vectors are
created for the factor priority identification using AHP technique. The reliability of
the atomic metrics is calculated using the calculation method mentioned in Chap. 4.
The Relative Reliability Matrix (RRM) is created using the reliability values of
the products for each sub factor of level II. The same comparison matrix creation
procedure is followed for RRM creation. The corresponding Eigen vector is the
Relative Reliability Vector (RRV) of the product for the sub-metrics. These RRVs
are then multiplied with the priority values to create the final RRVs for the product.
The repetition of this procedure for all the sub-metrics will finally result in a single
RRV specifying the reliability ranking of the product.
ii. Data Analysis
The values gathered from the existing customers that are stored in the repository
layer are processed as individual factor value for each product. As the data is gathered
periodically, the analysis is also carried out periodically. Depending on the type of
the reliability metric the evaluation is done using chi-square statistics or cumulative
binomial distribution or count based calculations. The calculated values are stored in
the repository layer as time stamped reliability database to be used by the reliability
evaluation module.
This layer stores various databases used by the CORE model and is the main source
for the reliability calculations. The existing customer data and the standards data
are gathered and stored in this layer. Due to the huge array of products and its vast
customer base, this layer will eventually qualify as big data as the CORE model
matures. The basic standard specification of the cloud services are retrieved from
the standard specifications laid down by the organizations such as ISO, CSA, CSCC,
CoBIT, etc., and these are stored in this layer. These specifications undergo periodic
updations which need to be reflected in the repository maintained by the REs. These
are utilized in security factor, compliance factor and SLA factor calculations. The
existing customer usage details are gathered and the evaluated reliability is stored in
the repository layer as time stamped reliability database.
The data from the existing customers need to be gathered at specific interval of
time with the help of the survey questionnaire. The questionnaire for SaaS product
feedback is given in Table 5.6. As the REs can be approached by customers of any
business sector, an extensive collection of data is required with intent to cover the
working of majority of the cloud services. These data needs to be stored with time
stamp as the increase in customer base will instigate the development of the product.
This will eventually result in enhanced performance which will also enhance the
reliability of the product. The time stamped data will help to analyze the reliability
improvement of the product.
128 5 Reliability Model
The CORE model is completely user-oriented and caters to the business require-
ments of all types of organizations. The model utilizes factors based on business
requirements integrated with the technical components used in SaaS product devel-
opment. This model generates flexible reliability values for a single SaaS product
depending on the business requirement based preferences and hence will be of great
use in the pre-SaaS adoption process. MSMEs/SMEs can be benefitted by the imple-
mentation of this model as it will enable them to enhance their business process,
improve their SaaS product monitoring process and also provides comparative reli-
ability ranking of the selected SaaS product.
5.5 Summary 129
5.5 Summary
This chapter has the detailed description about the cloud reliability evaluation. The
reliability is user-oriented and has many metrics to consider. Because of this MCDM
approach is used instead of traditional weights assigning method. AHP is the chosen
MCDM method using which reliability value is calculated. Customer-Oriented Reli-
ability Evaluation (CORE) model is used to evaluate the final reliability value. The
model has three layers such as User Preference Layer, Reliability Evaluator Layer,
and Repository Layer. End user of the model interacts through User Preference Layer.
This layer has templates for the metrics preferences. The entire interaction with the
model, like providing cloud product expectation and reliability metrics preference
based on business requirement is done in this layer. These inputs are further taken
by the reliability evaluator layer which does the final reliability calculations with the
help of the Repository Layer of the model. Single cloud product reliability evaluation
or comparative ranking of multiple cloud products based on the reliability value is
the output of the model.
References
Brans, J. P., Vincke, Ph, & Mareschal, B. (1986). How to select and how to rank projects: The
PROMETHEE method. European Journal of Operations Research, 24, 228–238.
Climaco, J. (Ed.). (1997). Multicriteria analysis. New York: Springer-Verlag.
Gal, T., & Hanne, T. (Eds.). (1999). Multicriteria decision making: Advances in MCDM models,
algorithms, theory, and applications. New York: Kluwer Academic Publishers.
Keeny, R. L., & Raiffa, H. (1976). Decisions with multiple objectives: Preferences and value trade-
offs. New York: Wiley.
Pohekar, S. D., & Ramachandran, M. (2004). Application of multi-criteria decision making to
sustainable energy planning—A review. Elsevier Journal of Renewable and Sustainable Energy
Review, 8(4), 365–381.
Roy, B. (1985). Métodologie multicrite‘re d’aide la décision. Collection Gestion. Paris: Economica.
Satty, T. L. (2008). Decision making with analytic hierarchy process. International Journal of
Services Sciences, 1(1), 83–98.
Scniederjans, M. J. (1995). Goal programming methodology and applications. Boston: Kluwer
Publishers.
Vidhyalakshmi, R., & Kumar, V. (2017). CORE framework for evaluating the reliability of SaaS
products. Future Generation Computer Systems, 72, 23–36.
Whaiduzzaman, M., Gani, A., Anuar, N. B., Shiraz, M., Haque, M. N., & Haque, I. T. (2014). Cloud
service selection using multicriteria decision analysis. The Scientific World Journal, 2014.
Chapter 6
Reliability Evaluation
Abbreviations
6.1 Introduction
Existing customers’ product usage feedback inputs are accepted periodically from
numerous customers using surveys and are stored in the repository layer. This is
feedback gathering process is carried out product wise and their evaluations are done
by the reliability calculation module of the RE layer. These evaluations are stored
back in the repository layer for future references.
Two customers chosen are C1 and C2. They have different business setup and dif-
ferent business requirements but have the same need to computerize accounting.
Profile of Customer C1
Type of business: Designer Textile retailer
Age of business: 4 years
Initial Investment: 20 Lakhs
Customer C1 had rented a shop in the market area and had started had started his
business with a single employee. C1’s family was helping in garment designing.
Manual billing system and book keeping methods were used for accounts operations
in the initial years. The business expansion is approximately 25% per year. Now
after four years the customer C1 owns a shop, has employed around four people for
sales and garment designing processes. MS Excel was being used for monthly and
annual accounting operations. With the current IT setup, the customer C1 wants to
venture into usage of accounting software. C1 is hesitant to shift to any software for
accounting because of the fear that the software usage may insist in the change of
business process. The lookout is for such software that suits the business operations.
Profile of customer C2
Type: Stationery retailer
Age of business: 3 years
Initial Investment: 30 Lakhs
Customer C2 had purchased a shop in a shopping complex situated among five
educational institutes. Along with stationery, project typing, printing, binding and
Xerox facility are also handled. Various computer storage devices and printer car-
tridges are also sold. Initially two people were employed. Customer C2 being a tech
savvy person used electronic billing system and MS-Excel for accounts operations.
The business expansion is approximately 35% per year. Currently after 3 years the
business has huge customer base and the employee count has increased to 7. The
customer with the knowledge of cloud wants to try out cloud application for finance
operation. The existing Excel data needs to be moved to the new application and the
customer is also willing to mend the business process to suit the software process.
Business continuity and security of data is of prime concern for the customer.
134 6 Reliability Evaluation
The preference between the metrics and the sub-metrics are accepted from the cus-
tomers C1 and C2 based on their business requirements. Refer Sect. 5.3.4 for under-
standing of how to assign comparative preferences for the metrics. The customers
will be briefed about the way of preference assignment by Reliability Engineers
(REs). If the customers do not have any preference for the metrics, sample prefer-
ence for the same will be provided. The metrics are compared with other metrics
of same level leaving comparison with itself. Hence, the first and the last column
of the table is the same. Table 6.1 present first-level metrics preferences of all three
customers C1and C2.
Priority calculations are done by creating comparison matrix and Eigen vector
as explained in the sub-sections of 5.3. Consistency ratio is also checked as a proof
to validate the consistency of the preferences assigned to metrics. If the consistency
ration is above 0.1, then the customers are advised to change the preferences. If the
customer is not willing to change then the same preferences with consistency ration
greater than 0.1 is accepted with a warning. The permissible CR limits are 0.1–0.2
(BPMSG 2017).
The final first-level metric preferences for customer C1 and C2 is given in
Table 6.2.
6.2 Reliability Metrics Preference Input 135
After the acceptance of the preferences and calculation of the priority for the
metrics, the calculation of the metrics performance has to be done. The SaaS relia-
bility metrics hierarchy with priorities assigned based on the preferences provided
by customer C1 is presented in Fig. 6.1.
The value of sub-metrics has to be calculated and combined with the priority values
to evaluate reliability of a product. The metrics value of multiple products can be
combined in a matrix format from which Eigen vector is calculated. The value of
the Eigen vector provides comparative reliability ranking of the products. This is
explained in detail in Sect. 6.4.
140 6 Reliability Evaluation
SaaS Reliability
Seciurity Disaster
Interoperability Adherence to Management
(0.02) Certificates
SLA (0.30) (0.05)
(0.35)
Backup
Ease of Audit logs frequency
Migration (0.03) Location (0.05) (0.15)
Awareness
(0.03)
Recovery
Scalability (0.14) Support (0.38) Process
(0.15)
Security
Incidence
Usability (0.10) Reporting Notification
(0.17) Reports (0.10)
Updation
Frequency (0.33)
The lowest level sub-metrics of the reliability metrics hierarchy has to be calcu-
lated first. Depending on the type of metrics, the calculation happens at the time of
reliability evaluation or calculated periodically and stored in the Repository Layer
of the CORE model.
The metrics based on the input from the expectation of the customer who is going
to choose the product or services is calculated at the time of reliability evaluation. The
metrics based on the feedback from the existing customers are calculated at regular
intervals and are stored in the repository layer. The metrics based on the standards
are updated periodically based on reminders. The computation of these metrics are
also done periodically and stored in the repository layer of the model. The following
sections explain all three types of metrics computation.
6.3 Metrics Computation 141
This metric is used to accept input from the users who are planning to adopt SaaS
product for their business operations. Table 5.5 of Chap. 5 lists out the questionnaire
to be used for accepting input from the customers. This has to be filled by customers
who have complete knowledge of business and also have knowledge related to cloud
usage. Filling this questionnaire will also provide a platform for unorganized business
sector customers to streamline their business operations. These metrics are used to
evaluate performance of the product which will further be used to calculate the
product reliability. These values are calculated like percentage calculations.
If 10 features are expected to be present in a product and all ten are present then
the reliability of the product is 1. If 8 features are present, then the reliability of the
product is 8/10 0.8.
This input is accepted from the existing users of SaaS product. This calculation gives
assurance that the product is delivered as per the catalog specifications.
SaaS product usage feedback of various products is accepted from existing cus-
tomers and stored in the repository layer. The feedback is obtained using survey
which contains questionnaire about the performance of the product. Refer Table 5.6
of Chap. 5 for SaaS product usage questionnaire. The target population is the SaaS
products users and the study population for the survey is MSME customers whom
have migrated to cloud with or without existing IT infrastructure. Stratified sampling
technique is used as the MSMEs are already classified as micro, small and medium
enterprises, which are then further classified into various sectors based on the type
of industry or services. The customers are divided into “strata” depending on the
size and type of business. Each stratum is considered as independent sub-population
and samples are randomly selected. The preferred person for the feedback must be
IT personnel of the organization who has complete information about the control
and monitoring processes of the cloud applications. Panel sampling technique is also
included after six months of the CORE model usage in which the same group of
existing customers will be surveyed several times over a period of six months. The
same strata will be surveyed periodically as the product maturity takes place with
time which enhances the reliability. The periodic execution of the existing customer
feedback module is used to survey the new customers using stratified sampling along
with the old customer’s repetitive survey using panel sampling.
Few challenges faced during the feedback accepting process are
i. List of the customers who are using the SaaS products were to be accepted from
the SaaS providers. They were reluctant to give the list as they claimed that their
product is reliable. They were encouraged to use the model as it will provide the
reliability comparison of their product with their competitor’s product. It was
142 6 Reliability Evaluation
explained to them that this model will also help them in their product promotion
and enhance their customer base. They were assured that the survey results will
be shared with them which will also help them to streamline the short comes if
any.
ii. Majority of the SaaS product users were not performing exact time-based mon-
itoring of the SaaS operations. The users were briefed about the importance of
monitoring the operations. They were explained about the significance of SaaS
product reliability for their business operations and the importance of their role in
the proposed reliability model. Various factors of the CORE model that are to be
monitored and noted down along with the monitoring procedures are explained
to the product users. These users were visited after three months to collect their
feedback.
The reliability calculation of these metrics based on user feedback is done using
Chi-Square method or using Cumulative Distribution Function (CDF) for the binomi-
als. Detailed explanation for the same is available in Sect. 3.5.2. The SaaS reliability
sub-metrics like availability, support hours, response time, backup retention, location
awareness, ease of migration, updation frequency, etc., are calculated using goodness
of fit test. The assured values of these metrics are retrieved from the SLA and the
actual provided values are accepted from the existing users of the product. Table 6.11
lists example data for observed availability hours of product P1, P2, and P3. The suc-
cessive reliability value calculations are listed below the table. The performance of
the product with respect to availability is calculated using goodness of fit method
due to the presence of expected and observed values.
Product P1 has assured availability of 95%, product P2 has assured availability
of 99% and product P3 has assured 99.99% availability. Based on this the expected
availability hours for the product availability has to be calculated as follows:
Number of hours in a month 24 * 60 720 h.
Availability hours for Product P1 with 95% assurance 720 * 95% 684 h
Availability hours for Product P2 with 99% assurance 720 * 99% 712.8 h
Availability hours for Product P3 with 99.99% assurance 720 * 99.99% 719.28 h
6.3 Metrics Computation 143
The values provided in Table 6.11 are the observed availability of the product
collected for a month from 10 different existing customers. Difference between the
observed and the expected value for products P1, P2, and P3 is calculated for each
customer. The difference is squared and divided by the expected value. Sum of
these squared differences is the chi-square value (χ 2 ). The higher chi-square value
indicates more deviation from the assurance provided. The probability of chance
occurrence for a given χ 2 value can be calculated from the chi-square calculators
available online. Sample calculations for a single product P1 is given in Table 6.12.
Sum of the last column is 3.15 which are slightly on the higher side. This indicates
the availability hours provided by the product has some deviation from assured hours.
The probability of the χ 2 value calculated from online calculator is 0.957 which also
indicates the performance of the product. The same procedure needs to be followed
for the chi-square evaluation of the sub-metrics.
If the feedback value for a metric is a dichotomous value like “Yes/No”, then
cumulative distribution for the binomial functions method is used. The probability
of success or failure to deliver the assured services for a product is considered as 0.5
each. The number of trials is the number of occurrences of the event and number of
success is the number of times the product has performed as per the specification.
For example, audit log metric is calculated based on the successful log access and
retention activity. The input accepted from the user is “Yes” or “No” based on the log
activity access at the time of need. The count of customers who have answered “yes”
will provide the value for the successful event. The number of customers surveyed
will be the number of trials. Appling this in the formula as explained in Sect. 3.5.2
we get the performance of the metric. Performance calculation of audit log metric is
discussed below.
If ten customers are being surveyed for products P1, P2 and P3, then number of
trials 10.
144 6 Reliability Evaluation
If 8 customers have answered “Yes” as a mark of successful log access for product
1, then number of successes is 8 and the probability of success and failure is 0.5.
Applying all these values in the CDF formula the result is 0.9892. This value indicates
the audit log efficiency of product P1.
If five customers have provided “yes” for successful log access, then the CDF cal-
culation with 10 as number of trials and success probability as 0.5 is 0.623.
Standards based inputs are accepted by the Reliability Engineers (REs) at regular
intervals and are stored in the repository layer of the model. These standard details
have to be updated periodically by REs. The standards based sub-metrics are calcu-
lated using Type III metric formulation as explained in Sect. 3.4.1. The reliability
value for each sub-metric is calculated as
Number of Standard certificates possessed by the organization
(6.1)
Number of certificates suggessted by the standard organization
Some of the certificate requirement details are explained above. These certification
requirements have to be updated periodically as the standards organizations keep
modifying the requirements to meet the changing business needs and growing threats.
If the requirement of the certificates is five and the SaaS company possesses 3,
then the reliability of the company with respect to certificates is 3/5 0.6. If all the
required certificates are possessed by the company then the reliability is 1.
If multiple products are chosen for reliability evaluation, then relative ranking of
the products is done based on their computed performance values. Metric value
computation for each product is done and stored in the Reliability Evaluation layer
or Repository Layer depending on its type. Relative weighing method is used to form
Relative Reliability Matrix (RRM) from which Relative Reliability Vectors (RRV)
are computed to rank the products.
RRM which is Relative Reliability Matrix is used to compare the reliability of chosen
products and to calculate the ranking of them based on their performance values. This
is a square matrix like comparison matrix with order N × N. Here, N is the number
of products being compared. The diagonal elements of this matrix are 1 and the cell
value will have the result of the division of comparing row product performance
value by the column product performance value.
Let us see the RRM for workflow match metrics of product P1, P2, and P3. To create
RRM, first the performance value of metric has to be calculated. Workflow match is
a type I metric. (Refer Sect. 3.5 for metric types and its calculation methods.)
Assume customer C1 has requirement of 10 functionalities in a SaaS product and
the requirement matching of product P1, P2, and P3 are given below.
P1 has 8 functionality that match with the requirement so 8/10 0.8
P2 has 9 matching required finalities 9/10 0.9
P3 matches all the required functionalities hence 10/10 1
Now RRM of the workflow match metric will be a 3 × 3 matrix as three products
are being compared. Diagonal element of RRM will always be 1 as the product
performance is compared with itself.
Eigen vector computation steps are followed and Relative Reliability Vector is cal-
culated. The RRM is squared and row sums are calculated. The row sum values are
normalized to create RRV. The resulting vector is the comparative reliability values
of the products.
Creation of RRV from the RRM explained in above Sect. 6.4.1 is given below.
The matrix is the resultant product matrix of RRM squaring.
⎡ ⎤
3.000 2.667 2.400
⎢ ⎥
⎣ 3.375 3.000 2.700 ⎦
3.750 3.333 3.000
Row sum of each row has to be calculated followed by the total of the row sum
value.
⎡ ⎤
3.000 2.667 2.400 8.066
⎢ ⎥
⎣ 3.375 3.000 2.700 ⎦ 9.075
3.750 3.333 3.000 10.083
Total of row sum column is 27.225. Normalize the values by dividing each row
sum with the total 27.225. The resulting vector is RRV which is also the comparative
ranking of the products P1, P2, and P3.
8.066/27.225 9.075/27.225 10.083/27.225
This when computed will result in RRV [0.30, 0.33, 0.37]. Product P3 is ranked
first among three products. These values are further used along with its priorities for
reliability calculation.
Sample data for all SaaS metrics is provided in annexure I. Few metrics value
computation followed by its RRM and RRV for all three types of SaaS metrics is
given below.
Type I SaaS metric
Disaster management metric is a type I metric. This works with number of DR
features required for a product. Depending on the business requirement and IT tech-
6.4 Comparative Reliability Evaluation 147
nical strength of the company the need of DR features varies. Assume customer C1
requires five DR features.
Product P1 satisfies four hence 4/5 0.8
Product P2 has three and half features 3.5/5 0.7
Product P3 has almost all but a few slight issues hence 4.5/5 0.9
Based on these three values RRM and RRV computations are done as follows:
The product matrix after squaring and row sum of each row is
⎡ ⎤
3.000 3.428 2.667 9.09
⎢ ⎥
⎣ 2.625 3.000 2.333 ⎦ 7.95
3.375 3.857 3.000 10.23
Total of row sum column is 27.28. RRV after normalizing row sum values is [0.33,
0.292, 0.375].
⎡ ⎤
1 0.945 0.942
⎢ ⎥
⎣ 1.057 1 0.996 ⎦
1.060 1.003 1
The total of row sum is 27.01. RRV after normalizing each row sum is [0.32, 0.34,
0.34]
Security certificates metric values 0.6, 0.8 and 1 of the products P1, P2 and P3
are used to calculate RRM and RRV as explained above.
RRM using metric values is
⎡ ⎤
1 0.666 0.600
⎣ 1.500 1 0.900 ⎦
1.667 1.111 1
The total of row sum column is 28.33. Upon normalizing row sum with total the
resulting RRV is [0.24, 0.36, 0.40].
The RRM and RRV calculation of all three types of SaaS reliability metrics are
the same.
After the completion of individual metrics value computation and priority calculation
based on customer preferences, the final reliability values are calculated. Depend-
ing on the user choice, either single product reliability is provided or comparative
reliability ranking of multiple products are provided.
Sample data required for metrics calculations and final reliability computation
provided in this section is given in Annexure I. Readers are advised to calculate the
value of each metrics and metrics priorities based on the sample preference and data
provided. These calculations can be carried out with the help of Microsoft Excel. We
request readers to proceed further in this section after completion of all the required
computation. Most of the values used in this section are the result of the computation
of previous sections of this chapter.
To calculate final reliability of a product the required values are
i. Priority of metrics of all levels.
ii. Individual metrics performance value.
iii. If multiple products are compared, then RRV value of each metrics.
If there are n levels of hierarchy in the reliability metrics, then the nth level will
be atomic metrics. These are computed first. Based on these values and the priority
of these metrics, n − 1th level metrics values are computed. This goes up in the
hierarchy till final reliability value is computed.
The SaaS product reliability metrics is provided as two-level hierarchy (refer
Fig. 6.1). Operational, security, support and monitoring and fault tolerance are the
150 6 Reliability Evaluation
first-level metrics which are further sub divided. We will take sub-metrics of each
metrics and perform the required computation. For single product reliability the
values of product P1 alone is chosen and for multiple product reliability all three
products P1, P2, and P3 are used.
In single product reliability calculations the priority computations and metrics per-
formance computations are done first. Product P1 and priorities of customer C1 is
chosen for sample calculation. The sub-metrics all four first-level metrics have to be
calculated.
i. The values of operational sub-metrics are as given below.
Workflow match (Type I) 0.8
Interoperability (Type I) 0
Migration Ease (Type II) 0.74
Scalability (Type II) 0.99
Usability (Type I) 0.7
Updation Frequency (Type II) 0.7333
The priorities of these metrics are [0.38, 0.02, 0.03, 0.14, 0.10, 0.33]. Refer
Sect. 6.2 to understand the computation of the priority values. Matrix multipli-
cation of these metrics value and the metrics priorities will result in calculation
of Operational metric value.
⎡ ⎤
0.38
⎢ 0.02 ⎥
⎢ ⎥
⎢ ⎢ 0.03 ⎥
⎥
0.8 0 0.74 0.99 0.7 0.73 × ⎢ ⎥
⎢ 0.14 ⎥
⎢ ⎥
⎣ 0.10 ⎦
0.33
0.03, 0.09, 0.28] (refer Table 6.2). The values of the first-level metrics are [0.77,
0.79, 0.86, 0.92] (refer to the above calculations). Matrix multiplication of priorities
of first-level metrics with its values will provide the final reliability of the product.
⎡ ⎤
0.60
⎢ 0.03 ⎥
0.77 079 0.86 0.92 × ⎢ ⎥
⎣ 0.09 ⎦
0.28
The result 0.824 is the final reliability of the product P1 based on the metrics
preferences provided by customer C1.
i. Operational metric
All the sub-metric of operational metric is evaluated and listed in Table 6.14.
For each sub-metric values of all three products RRM and RRV are calculated.
The RRV values of all operational sub-metrics are given below.
6.5 Final Reliability Computation 153
The above matrix of RRV values has to be multiplied with the priority assigned
by the customer C1 and C2 (refer Table 6.4).
The priority assigned by C1 is [0.38, 0.02, 0.03, 0.14, 0.10, 0.33] and that of
customer C2 is [0.06, 0.03, 0.39, 0.11, 0.12, 0,29].
⎡ ⎤
0.38
⎡ ⎤ ⎢ ⎥
⎢ 0.02 ⎥
0.30 0.33 0.28 0.34 0.29 0.31 ⎢ ⎥
⎣ 0.33 0.33 0.38 0.34 0.33 0.33 ⎦ × ⎢⎢
0.03 ⎥
⎥
⎢ 0.14 ⎥
0.37 0.33 0.34 0.33 0.38 0.35 ⎢ ⎥
⎣ 0.10 ⎦
0.33
The result product matrix with customer C1 priorities [0.30, 0.34, 0.36], which is
also the relative ranking of the products based on Operational metric. According to
customer C1 preferences the products are ranked as P3 > P2 > P1.
The same RRV when multiplied with the priorities assigned by customer C2 will
have the resulting vector as [0.30, 0.35, 0.35]. Based on this the relative ranking is
P3 P2 > P1.
[0.30, 0.34, 0.36] and [0.30, 0.35, 0.35] is operational metric value of customer
C1 and C2 respectively.
The above matrix of RRV values has to be multiplied with the priority assigned
by the customer C1 and C2 (refer to Table 6.6).
The priority assigned by C1 is [0.45, 0.35, 0.03, 0.17] and that of customer C2 is
[0.63, 0.08, 0.25, 0.04].
⎡ ⎤
⎡ ⎤ 0.45
0.375 0.240 0.320 0.330 ⎢ ⎥
⎢ ⎥ ⎢ 0.35 ⎥
⎣ 0.198 0.360 0.34 0.32 ⎦ × ⎢ ⎥
⎣ 0.03 ⎦
0.427 0.400 0.34 0.35
0.17
The result product matrix with customer C1 priorities [0.32, 0.28, 0.40], which is
the relative ranking of the products based on Security metric. According to customer
C1 preferences the products are ranked as P3 > P1 > P2.
The same RRV when multiplied with the priorities assigned by customer C2 will
have the resulting vector as [0.35, 0.25, 0.40]. Based on this the relative ranking is
P3 > P1 > P2.
[0.32, 0.28, 0.40] and [0.35, 0.25, 0.40] is security metric value of customer C1
and C2 respectively.
The sub-metrics values of support and monitor metric are listed in Table 6.16.
For each sub-metric values of all three products RRM and RRV are calculated.
The RRV values of all support and monitor sub-metrics are given below.
6.5 Final Reliability Computation 155
The above matrix of RRV values has to be multiplied with the priority assigned
by the customer C1 and C2 (refer to Table 6.8).
The priority assigned by C1 is [0.17, 0.30, 0.05, 0.38, 0.10] and that of customer
C2 is [0.14, 0.44, 0.04, 0.32, 0.06].
⎡ ⎤
⎡ ⎤ 0.17
⎢ 0.30 ⎥
0.33 0.33 0.34 0.31 0.34 ⎢ ⎥
⎢ ⎥ ⎢ ⎥
⎣ 0.25 0.33 0.29 0.33 0.31 ⎦ × ⎢ 0.05 ⎥
⎢ ⎥
0.42 0.34 0.37 0.36 0.35 ⎣ 0.38 ⎦
0.10
The result product matrix with customer C1 priorities [0.32, 0.32, 0.36], which is
the relative ranking of the products based on support and monitor metric. According
to customer C1 preferences the products are ranked as P3 > P1 P2.
The same RRV when multiplied with the priorities assigned by customer C2 will
have the resulting vector as [0.32, 0.32, 0.36]. Based on this the relative ranking is
P3 > P1 P2.
[0.32, 0.32, 0.36] and [0.32, 0.32, 0.36] is support and monitor metric value of
customer C1 and C2 respectively.
The sub-metrics values of fault tolerance metric are listed in Table 6.17.
For each sub-metric values of all three products RRM and RRV are calculated.
The RRV values of all fault tolerance sub-metrics are given below.
156 6 Reliability Evaluation
The above matrix of RRV values has to be multiplied with the priority assigned
by the customer C1 and C2 (refer to Table 6.10).
The priority assigned by C1 is [0.65, 0.05, 0.15, 0.15] and that of customer C2 is
[0.73, 0.09, 0.09, 0.09].
⎡ ⎤
⎡ ⎤ 0.65
0.32 0.33 0.30 0.32 ⎢ ⎥
⎣ 0.34 0.29 0.33 0.34 ⎦ × ⎢ ⎢
0.05 ⎥
⎥
⎣ 0.15 ⎦
0.34 0.38 0.37 0.34
0.15
The result product matrix with customer C1 priorities [0.32, 0.33, 0.35], which
is the relative ranking of the products based on Fault tolerance metric. According to
customer C1 preferences the products are ranked as P3 > P2 > P1.
The same RRV when multiplied with the priorities assigned by customer C2 will
have the resulting vector as [0.32, 0.33, 0.35]. Based on this the relative ranking is
P3 > P2 > P1.
[0.32, 0.33, 0.35] and [0.32, 0.33, 0.35] is fault tolerance metric value of customer
C1 and C2 respectively.
After completion of metric value calculation for all the first-level metrics, the
priorities of these are used to compute final comparative reliability ranking of the
products. The priority of the first-level metrics of customer C1 is [0.60, 0.03, 0.09,
0.28] and that of customer C2 is [0.19, 0.68, 0.09, 0.04] (refer Table 6.2).
Reliability ranking based on customer C1 priority assignment is given below.
There are four first-level metrics and three products are being compared. Hence 3 ×
4 matrix is used.
6.5 Final Reliability Computation 157
⎡ ⎤
⎡ ⎤ 0.60
0.30 0.32 0.32 0.32 ⎢ ⎥
⎣ 0.34 0.28 0.33 0.33 ⎦ × ⎢ 0.03 ⎥
⎣ 0.09 ⎦
0.36 0.40 0.36 0.35
0.28
The resulting product vector is [0.31, 0.33, 0.36]. This also provides ranking of
the products as P3 > P2 > P1. This ranking is as per customer C1 preferences.
Reliability ranking based on customer C1 priority assignment is given below.
⎡ ⎤ ⎡ 0.19 ⎤
0.30 0.35 0.32 0.32
⎢ ⎥ ⎢ 0.68 ⎥
⎣ 0.35 0.25 0.33 0.33 ⎦ × ⎢ ⎥
⎣ 0.09 ⎦
0.35 0.40 0.36 0.35
0.04
The resulting product vector is [0.34, 0.28, 0.38]. This also provides ranking of
the products as P3 > P1 > P2. This ranking is as per customer C2 preferences.
The variation in the same set of product ranking based on the user preferences is
a proof that the model is customer-oriented model.
6.6 Summary
This chapter deals with the complete numerical calculations used in the CORE model.
For easy understanding sample preferences from two different customers C1 and C2
are accepted. These two customers have varying business needs but are willing to
adopt cloud application for accounting purposes. Three different products chosen
are named as P1, P2, and P3 to maintain anonymity. Preference acceptance of all
the metrics and sub-metrics along with its priority calculations for both customers
is explained. The metrics performance calculations are discussed in detail with the
help of sample data. Separate examples are discussed for three type of metrics calcu-
lations such as type I, type II, and type III. Relative Reliability Matrix and Relative
Reliability Vector calculations discussed with example of a metric. The last section
of the chapter explains computation of reliability for a single product and also for a
group of products. Readers are advised to proceed for the last section after attempting
metric performance calculations based on the ample data provided in annexure 1.
Reference
BPMSG. (2017). AHP—High Consistency Ratio. Retrieved June 2018 from https://bpmsg.com/
ahp-high-consistency-ratio/.
Annexure
Sample Data for SaaS Reliability
Calculations
SaaS reliability metrics is categorized into levels and priority of reliability metrics,
needs to be assigned for final reliability evaluation. Sample data for priority and
individual metrics is provided in this annexure to demonstrate the calculations.
Note
Table A.1, A.2, A.3, A.4 and A.5 provides comparitive metric preferences for all
levels of reliability metrics. In Table A.1 some of the metric preference value is
given as “–”. For example metric preference between operational and security is
provided as 8 but metric preference between security and operational is given as
“–”. This is because if we provide preference say p between metric M1 and M2,
then preference between M2 and M1 is calculated as 1/p. Hence forth in the suc-
ceeding tables only value preferences are provided the reciprocals have to be
assumed during calculation.
Based on the above data the preference values can be calculated and the final
values are given in Fig. A.1.
Sample data for individual metrics for SaaS reliability calculation is given
below. These can be used for practicing reliability calculation. Three product details
are given using which either single product reliability or reliability based com-
parative ranking can be calculated. Three products are named as P1, P2, and P3.
Only sample data and various calculation headings are given. Readers are
encouraged to do the calculations
1. Workflow match (Type I metric)
Total functionality requirement of the customer is 10
P1 matches 8 functionalities
P2 matches 9 functionalities
P3 matches 10 functionalities
obs assu (obs − assu)2/assu obs assu (obs − assu)2/assu obs assu (obs − assu)2/assu
6 6 10 12 8 9
3 6 9 12 7 9
4 6 8 12 6 9
6 6 11 12 5 9
5 6 12 12 9 9
4 6 10 12 6 9
3 6 8 12 7 9
2 6 9 12 8 9
6 6 7 12 9 9
5 6 9 12 9 9
164 Annexure: Sample Data for SaaS Reliability Calculations
obs assu (obs − assu)2/assu obs assu (obs − assu)2/assu obs assu (obs − assu)2/assu
7 9 5 8 9 10
9 9 6 8 10 10
8 9 7 8 9 10
7 9 6 8 9 10
6 9 7 8 9 10
7 9 5 8 10 10
8 9 5 8 10 10
9 9 6 8 10 10
9 9 5 8 10 10
9 9 6 8 10 10
Total Total Total
In the above table DC is the data movement to the correct location and TM is the
total number of data movements.
10. Incidence reporting (Type II metric) simple division method
NI TI Eff (NI/TI) NI TI Eff (NI/TI) NI TI Eff (NI/TI)
3 3 2 2 3 3
2 3 2 2 3 3
3 3 1 2 3 3
3 3 2 2 3 3
3 3 2 2 3 3
3 3 2 2 3 3
3 3 1 2 3 3
2 3 2 2 3 3
3 3 2 2 3 3
3 3 2 2 3 3
Average Average Average
In the above table NI is the number of incidences that were reported to customers
and TI is the total number of security incidences.
11. Regulatory compliance certificates (Type III metric)
Total certificates required = 10
Product P1 has 8 certificates
P2 has 6 certificates and P3 has 10 certificates
12. Adherence to SLA (Type II metric) chi-square test
obs assu (obs − assu)2/assu obs assu (obs − assu)2/assu obs assu (obs − assu)2/assu
9 10 8 10 9 10
8 10 8 10 9 10
9 10 8 10 9 10
8 10 9 10 9 10
10 10 9 10 9 10
7 10 9 10 10 10
8 10 9 10 10 10
9 10 10 10 10 10
10 10 10 10 10 10
10 10 10 10 10 10
Total Total Total
166 Annexure: Sample Data for SaaS Reliability Calculations
In the above table CC is the number of call connected and TC is the total number
of calls made.
Step 2: Response time reliability
(continued)
RWT TR Eff (RWT/TR) RWT TR Eff (RWT/TR) RWT TR Eff (RWT/TR)
9 12 7 9 7 7
7 8 5 6 8 8
Average Average Average
In the above table RWT is the number of responses within time and TR is the
total number of responses.
Step 3: Resolution time reliability
In the above table SWT is the number of solutions provided within assured time
and TS is the total number of solutions provided.
Average value of step 1, 2, and 3 will be the final value of support metric
15. Notification report (Type II metric) simple division method
NC TC Eff (NC/TC) NC TC Eff (NC/TC) NC TC Eff (NC/TC)
9 10 7 7 5 5
10 10 6 7 4 5
7 10 5 7 5 5
6 10 6 7 5 5
7 10 5 7 4 5
8 10 4 7 3 5
9 10 7 7 5 5
10 10 6 7 5 5
10 10 5 7 3 5
9 10 4 7 5 5
Average Average Average
obs Assu (obs − assu)2/ obs Assu (obs − assu)2/ obs Assu (obs − assu)2
(95%) assu (99.9%) assu (99.99%) /assu
680 684 700 712.8 710 719.28
678 684 698 712.8 705 719.28
664 684 710 712.8 719 719.28
650 684 703 712.8 715 719.28
679 684 705 712.8 719 719.28
683 684 700 712.8 716 719.28
680 684 712 712.8 719 719.28
679 684 701 712.8 715 719.28
662 684 700 712.8 719 719.28
684 684 705 712.8 719 719.28
Total Total Total
obs Assu (obs − assu)2/ obs Assu (obs − assu)2 / obs Assu (obs − assu)2/
assu assu assu
6 5 7 5 5 5
7 5 6 5 6 5
5 5 8 5 5 5
7 5 6 5 5 5
8 5 7 5 5 5
5 5 8 5 5 5
6 5 5 5 5 5
7 5 5 5 6 5
8 5 5 5 6 5
7 5 6 5 5 5
Total Total Total
Annexure: Sample Data for SaaS Reliability Calculations 169
obs Assu (obs − assu)2/ obs Assu (obs − assu)2/ obs Assu (obs − assu)2/
assu assu assu
9 10 10 12 15 15
7 10 11 12 15 15
7 10 12 12 14 15
9 10 11 12 13 15
10 10 9 12 15 15
7 10 12 12 15 15
7 10 11 12 15 15
8 10 10 12 14 15
8 10 9 12 15 15
10 10 11 12 14 15
Total Total Total
Obs Assu (obs − assu)2/ obs Assu (obs − assu)2/ obs Assu (obs − assu)2/
assu assu assu
28 30 19 20 35 35
30 30 18 20 34 35
30 30 20 20 35 35
29 30 20 20 35 35
30 30 19 20 35 35
27 30 18 20 35 35
30 30 20 20 34 35
26 30 25 20 33 35
30 30 20 20 35 35
29 30 19 20 35 35
Total Total Total
(continued)
NR NSR Prob NR NSR Prob NR NSR Prob
5 3 4 4 3 3
5 4 4 4 3 3
5 3 4 4 3 2
5 4 4 3 3 3
Average Average Average