Model For Fault Tolerance and Checkpoints in Cloud Computing Environment
Model For Fault Tolerance and Checkpoints in Cloud Computing Environment
COMPUTING ENVIRONMENT
Name: Anteyi Benedict O.
Fault tolerance is the ability of a system to continue performing its intended functions in
presence of faults. In a broad sense, fault tolerance is associated with reliability, with successful
operation, and with the absence of breakdowns. A fault-tolerant system should be able to handle
unexpected problems and still meet its specification. Fault tolerance is necessary because it is
practically impossible to build a perfect system. The fundamental problem is that, as the
complexity of a system grows, its reliability drastically decreases, unless compensatory measures
are taken. For example, if the reliability of individual components is 99.99%, then the reliability
system consisting of 10,000 non-redundant components is just 36.79%. Such a low reliability is
Checkpoint/restart is a fault tolerance strategy that increases the wall clock time of the execution
of applications which increases the execution cost. Checkpointing is the process of saving
system states periodically during failure-free execution. By employing the Checkpointing fault
tolerance strategy, if a failure does occur while a system is running, the system can roll back to
the latest checkpoint and restart again from this checkpoint, thereby bounding the amount of lost
operations to be recomputed.
Cloud computing, the long-held dream of computing as a utility, has opened up a new era of
future computing, transformed a large part of IT industry, and reshaped the purchase and use of
highly available, and configurable and reconfigurable computing resources (e.g., networks,
servers, storage, applications, data, and so on) can be rapidly provisioned and released with
minimal management effort in the data centers. Services are delivered on demand to external
customers over high-speed Internet with the “X as a service (XaaS)” computing architecture,
which is broken down into three segments: “applications”, “platforms”, and “infrastructure”. Its
aims are to provide users with more flexible services in a transparent manner and with ever
cheaper and more powerful processors. Cloud computing offers new capacity and flexibility
solution to high performance computing (HPC) applications with provisioning of a large number
of virtual machines for computational intensive applications. Fault tolerance allows HPC systems
the present of fault. The most commonly used fault tolerance techniques for HPC is
checkpoint/restart and replication. However, in this research the focus is to present a model for
1.2 Motivation
Cloud computing has the ability and flexibility to greatly provision computer resources for HPC
(High Performance Computing) scientific purposes. However, fault tolerance is one of the
challenges that cloud for HPC applications are facing. With increasing numbers of processors on
today’s HPC systems which will also be provision on cloud; virtual instances, communication
links, and integrated circuit environment run on virtual machines (VMs), fault tolerance (FT) for
such applications running on the cloud need to ensure that computational intensive applications
run smoothly and simultaneously with reduced overhead as well as with visibility of the
environment. High fault tolerance issue is one of the major obstacles for opening up a new era of
high serviceability cloud computing environment as fault tolerance sockets as well as processors
are prone to failure. It has been predicted that a system with 100,000 processors will experience a
processor failure every few minutes. Therefore, Fault tolerant service is an essential part of
Service Level Objectives (SLOs), thus fault tolerance and checkpoint function should be
considered in clouds. When users transfer their critical systems to clouds, can the cloud
serviceability achieve 100 % uptime is a question users always ask. Unfortunately, cloud
serviceability and fault tolerance are still far from perfect. Failures are normal rather than
exceptional in cloud computing environments, due to large-scale time-critical data support, and
because cloud platforms are usually run in the form of voluntary, much cheaper, less powerful
and virtual computing nodes, cloud nodes are usually connected by unpredictable
communication links, thus communication failures, such as time out, will greatly influence the
serviceability of clouds, and some malicious behaviors occur in clouds as user contributed nodes.
Nowadays, demands for high fault tolerance and high serviceability are becoming
unprecedentedly strong. However, building a high fault tolerance and high serviceability cloud is
a critical, challenging, and urgently required task in cloud computing environment. Hence, in this
project, an optimized model for fault tolerance and checkpoint serviceability would be
developed.
1.3 Objective of Study
i. To design an optimized model for fault tolerance and checkpoint in cloud computing
1.4 Methodology
The goal of any optimization problem is to minimize objective functions such as error, faults,
failure, fault tolerance degree, fault tolerance overhead, response time, and other factors that
computing system environment. In this work, our main goal is to minimize the objective
function subject to the fault tolerance, FŦ subject to units of cloud computing environment
The steps listed below were taken in order to develop the optimized model for fault tolerance
The programming problem that will be formulated will be formulated will be of the
form,
Minimize cTx
Subject to Ax ≤ b
and x ≥ 0
where cTx is the objective function, Ax ≤ b and x ≥ 0 are both equality and inequality
constraints respectively.
At the end of this research, an optimized model would have been developed for fault tolerance
and checkpoint in cloud computing environment which will contribute to the existing study in