0% found this document useful (0 votes)
14 views5 pages

Model For Fault Tolerance and Checkpoints in Cloud Computing Environment

Uploaded by

Anteyi Benedict
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views5 pages

Model For Fault Tolerance and Checkpoints in Cloud Computing Environment

Uploaded by

Anteyi Benedict
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

MODEL FOR FAULT TOLERANCE AND CHECKPOINTS IN CLOUD

COMPUTING ENVIRONMENT
Name: Anteyi Benedict O.

Matric No: CSC/13/4987

1.1 Background of Study

Fault tolerance is the ability of a system to continue performing its intended functions in

presence of faults. In a broad sense, fault tolerance is associated with reliability, with successful

operation, and with the absence of breakdowns. A fault-tolerant system should be able to handle

faults in individual hardware or software components, power failures, or other kinds of

unexpected problems and still meet its specification. Fault tolerance is necessary because it is

practically impossible to build a perfect system. The fundamental problem is that, as the

complexity of a system grows, its reliability drastically decreases, unless compensatory measures

are taken. For example, if the reliability of individual components is 99.99%, then the reliability

of a system consisting of 100 non-redundant components is 99.01%, whereas the reliability of a

system consisting of 10,000 non-redundant components is just 36.79%. Such a low reliability is

unacceptable in most applications.

Checkpoint/restart is a fault tolerance strategy that increases the wall clock time of the execution

of applications which increases the execution cost. Checkpointing is the process of saving

system states periodically during failure-free execution. By employing the Checkpointing fault

tolerance strategy, if a failure does occur while a system is running, the system can roll back to

the latest checkpoint and restart again from this checkpoint, thereby bounding the amount of lost

operations to be recomputed.
Cloud computing, the long-held dream of computing as a utility, has opened up a new era of

future computing, transformed a large part of IT industry, and reshaped the purchase and use of

IT software and hardware. Cloud computing is a large-scale distributed computing paradigm

driven by economies of scale, in which a pool of abstracted, virtualized, dynamically-scalable,

highly available, and configurable and reconfigurable computing resources (e.g., networks,

servers, storage, applications, data, and so on) can be rapidly provisioned and released with

minimal management effort in the data centers. Services are delivered on demand to external

customers over high-speed Internet with the “X as a service (XaaS)” computing architecture,

which is broken down into three segments: “applications”, “platforms”, and “infrastructure”. Its

aims are to provide users with more flexible services in a transparent manner and with ever

cheaper and more powerful processors. Cloud computing offers new capacity and flexibility

solution to high performance computing (HPC) applications with provisioning of a large number

of virtual machines for computational intensive applications. Fault tolerance allows HPC systems

on cloud with multiple of nodes to complete execution of computational intensive applications in

the present of fault. The most commonly used fault tolerance techniques for HPC is

checkpoint/restart and replication. However, in this research the focus is to present a model for

fault tolerance and checkpoint in cloud computing environment.

1.2 Motivation

Cloud computing has the ability and flexibility to greatly provision computer resources for HPC

(High Performance Computing) scientific purposes. However, fault tolerance is one of the

challenges that cloud for HPC applications are facing. With increasing numbers of processors on

today’s HPC systems which will also be provision on cloud; virtual instances, communication
links, and integrated circuit environment run on virtual machines (VMs), fault tolerance (FT) for

such applications running on the cloud need to ensure that computational intensive applications

run smoothly and simultaneously with reduced overhead as well as with visibility of the

environment. High fault tolerance issue is one of the major obstacles for opening up a new era of

high serviceability cloud computing environment as fault tolerance sockets as well as processors

are prone to failure. It has been predicted that a system with 100,000 processors will experience a

processor failure every few minutes. Therefore, Fault tolerant service is an essential part of

Service Level Objectives (SLOs), thus fault tolerance and checkpoint function should be

considered in clouds. When users transfer their critical systems to clouds, can the cloud

serviceability achieve 100 % uptime is a question users always ask. Unfortunately, cloud

serviceability and fault tolerance are still far from perfect. Failures are normal rather than

exceptional in cloud computing environments, due to large-scale time-critical data support, and

because cloud platforms are usually run in the form of voluntary, much cheaper, less powerful

and virtual computing nodes, cloud nodes are usually connected by unpredictable

communication links, thus communication failures, such as time out, will greatly influence the

serviceability of clouds, and some malicious behaviors occur in clouds as user contributed nodes.

Nowadays, demands for high fault tolerance and high serviceability are becoming

unprecedentedly strong. However, building a high fault tolerance and high serviceability cloud is

a critical, challenging, and urgently required task in cloud computing environment. Hence, in this

project, an optimized model for fault tolerance and checkpoint serviceability would be

developed.
1.3 Objective of Study

i. To design an optimized model for fault tolerance and checkpoint in cloud computing

environment using linear programming.

ii. Implement the model in (i)

1.4 Methodology

The goal of any optimization problem is to minimize objective functions such as error, faults,

failure, fault tolerance degree, fault tolerance overhead, response time, and other factors that

affect the computation of processes in HPC while maximizing the reliability of

computational operations and processes subjects to operational constraints of the Cloud

computing system environment. In this work, our main goal is to minimize the objective

function subject to the fault tolerance, FŦ subject to units of cloud computing environment

The steps listed below were taken in order to develop the optimized model for fault tolerance

in cloud computing environment.

i. Literature Review: A detailed review of relevant literatures on fault tolerance in cloud

computing was carried out.

ii. The strengths and limitations of reviewed works were highlighted.

iii. A mathematical model using Linear Programming Optimization will be implemented.

This model will be implemented by formulating a Linear Programming problem from

generation cost coefficients. The coefficient will be linearized by finding the

incremental linear approximation since the initial coefficient is non-linear in nature.

The programming problem that will be formulated will be formulated will be of the

form,
Minimize cTx

Subject to Ax ≤ b

and x ≥ 0

where cTx is the objective function, Ax ≤ b and x ≥ 0 are both equality and inequality

constraints respectively.

1.5 Expected Contribution to Knowledge

At the end of this research, an optimized model would have been developed for fault tolerance

and checkpoint in cloud computing environment which will contribute to the existing study in

cloud computing systems.

You might also like