0% found this document useful (0 votes)
31 views5 pages

Ascs 04 0213

This document summarizes fault tolerance techniques in distributed systems. It defines key terms like fault, error, failure, and discusses different types of failures that can occur. It then describes common characteristics of distributed systems like resource sharing and scalability. The document emphasizes that fault tolerance is important for distributed systems to guarantee availability, reliability, and safety. It discusses techniques like redundancy, recovery from checkpoints, and majority voting to achieve fault tolerance.

Uploaded by

solma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views5 pages

Ascs 04 0213

This document summarizes fault tolerance techniques in distributed systems. It defines key terms like fault, error, failure, and discusses different types of failures that can occur. It then describes common characteristics of distributed systems like resource sharing and scalability. The document emphasizes that fault tolerance is important for distributed systems to guarantee availability, reliability, and safety. It discusses techniques like redundancy, recovery from checkpoints, and majority voting to achieve fault tolerance.

Uploaded by

solma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Acta Scientific COMPUTER SCIENCES

Volume 4 Issue 1 Janauary 2022


Literature Review

Fault Tolerance in Distributed System: A Review

Gajendra Sharma* and Santosh Sah


Received: October 19, 2021
Department of Computer Science and Engineering Kathmandu University, Dhulikhel,
Published: December 17, 2021
Kathmandu, Nepal
© All rights are reserved by Gajendra
*Corresponding Author: Gajendra Sharma, Department of Computer Science and
Sharma and Santosh Sah.
Engineering Kathmandu University, Dhulikhel, Kathmandu, Nepal.

Abstract
Distributed systems consist of several hardware and software components connected together which may fail eventually. There-
fore, the system should be designed with the proper fault tolerance technique, so that in case of fault the system would be able to
recover from the failure without any loss of service. Fault may occur due to various reasons like communication failure, resources
or hardware failure, failure due to fault in process, software errors etc. Any of these faults may result the system into faulty environ-
ment. The system in faulty environment may not perform the task as expected and will result in faulty output or no output. This paper
attempts to introduce fault, fault tolerance and fault tolerance techniques in detail with the help of previous research in the field of
fault tolerance in distributed system.

Keywords: Fault; Fault-Tolerance; Distributed System; Fault Tolerance Techniques

Introduction Common characteristics of a Distributed System are Resource


Sharing, Openness, Scalability, Transparency, and most importantly
Distributed Computing Systems consists of variety of hardware
is Fault Tolerance. In distributed system the individual worksta-
and software components. Failure of any of these components can
tions communicate each other by passing messages. There are al-
lead to unanticipated, potentially disruptive behavior and service
ways chances of fault to occurs which may be due communication
unavailability [1]. In the event of failure in the system or any com-
failure, hardware failure, shortage of memory, software bugs etc.
ponent of the system, the system must be capable of operating as
normal condition. This quality of system to operate normally in
case of failure is the fault tolerance of the system. High-availability
of the system is guaranteed by the fault tolerance of the system.
Incorrect resultor unpredictable service of the system cannot be
accepted in real time distributed system. Some examples are the
online transaction, process control and computer based commu-
nication system. A system’s fault tolerant capability guarantees
to minimize the loss that may be caused due to the unpredictable
system behavior. The demand for fault tolerance in a system is in-
creasing day by day.
Figure 1: Distributed system [2].

Citation: Gajendra Sharma and Santosh Sah. “Fault Tolerance in Distributed System: A Review". Acta Scientific Computer Sciences 4.1 (2022): 36-40.
Fault Tolerance in Distributed System: A Review

37

The above figure (Figure 1) shows the example of real time sys- • Arbitrary (Byzanthine) Failure: A server may produce
tem in distributed environment. Real time system is highly depend- arbitraryresponses at arbitrary times
able on deadline. The given task to the system must be completed
with the allocated amount of time. Resultobtained after the given Fault Tolerance – Ability of system to behave in a well-defined

period of time is of no use in case of real time system. Some ex- manner upon occurrence of faults.

amples of real time system are Nuclear System, Robotic Controls,


Recovery – Recovery is a passive approach in which the state of
Medical equipment, defense system etc. Distributed systems are
the system ismaintained and is used to roll back the execution to a
presented to the user as a single system image and has a feature of
predefined checkpoint.
easily expanding the resources based the load to the system. Addi-
tional component can be added to the system, this additional com- Redundancy – With respect to fault tolerance it is replication of
ponent can be used in case of fault in the systemhelping to reduce hardware,software components or computation.
the fault and for better output.
Security – Robustness of the system characterized by secrecy,
Below are few terminologies that are related to fault tolerance integrity,availability, reliability and safety during its operation.
in distributed system.
It has been found that the need of fault tolerance is for the sys-
Fault - At the lowest level of abstraction fault can be termed as tem Availability, Reliability, Safety, Maintainability and Security.
“defect”. It canlead to inaccurate system state. Fault in the system
• Availability: The system must be usable immediately at any
can be categorized based ontime as below:
time.
• Transient: This type of fault occurs once and disappear • Reliability: The system must work for a long period of
• Intermittent: This type of fault occurs many time in an ir- time withouterror.
regular way • Safety: There should be no catastrophic consequences of
• Permanent: This is the fault that is permanent and brings temporalfailure.
system to halt. • Maintainability: The system must be able to repair and
fix the faultquickly and easily.
Error – May be defined as state of the system, which is undesir-
able and maylead to failure of the system. • Safety: The system should be able to resist the attacks
against itsintegrity.
Failure – May be defined as faults due to unintentional intru-
sion. Differenttypes of failure are as below: Failure masking by redundancy
• Information redundancy: Extra bits are added (e.g. CRC)
• Crash Failure: A server halts, but is working correctly until
it halts • Time redundancy: Action may be redone (e.g. transaction
after abort)
• Omission Failure: A server fails to respond incoming re-
• Physical redundancy: Hardware and software component
quests
may bemultiplied (e.g. adding extra disk, replicating the da-
• Receive omission: A server fails to respond incoming tabase), TMR.
message
Triple modular redundancy (TMR)
• Send Omission: A server fails to send message
It uses the principle of building a majority of opinion. Each de-
• Timing Failure: A server’s response lies outside the vice is replicated 3 times, signal pass all 3 devices.If one device fails,
specified timeinterval a voter can reproduce the correct value based on 2 correct signals.
• Response Failure: The server’s response in incorrect In this case it is assumed that at every stage 1 device and 1 voter
may fail.
• Value failure: The value of the response is wrong

• State transition failure: The server deviates from the


correct flowof control

Citation: Gajendra Sharma and Santosh Sah. “Fault Tolerance in Distributed System: A Review". Acta Scientific Computer Sciences 4.1 (2022): 36-40.
Fault Tolerance in Distributed System: A Review

38

Literature Review They have mentioned some approaches for fault tolerance in
Real Time distributed system. They are as following:
There is a lot of research that has already been performed and
is ongoing in the field of fault tolerance in distributed system. Re- • Replication
search and experimentation efforts began in earnest in the 1970s • Job Replication
and continued through 1990s, with focused interest peaking in the • Component Replication
late 1980s.
• Data Replication
A number of distributed operating system were introduced • Check pointing
during these period; however, very few of these implementations
• Scheduling/Redundancy
achieved even modest commercialsuccess.
• Space Scheduling/Redundancy
Different authors have reviewed the concept of fault tolerance
• Time Scheduling/ Redundancy
computing system, like Ramamoorthy [1967], Short [1968], Avi-
zienes [1971], Khul and Reddy [1980, 1981], Bagchi and Hakimi • Hybrid Redundancy
[1991]. The SAPO Computer built in Prague, Czechoslovakia was
A common way to handle crashes involves two steps: (1) De-
probably the first Fault- Tolerant Computer built in 1950-1954 un-
tect the failure; and (2) Recover, by restarting or failing over the
der the supervision of Antonin Svoboda, using relays and a mag-
crashed component. Failure recovery has received a lot more atten-
netic drum and was operated in 1957-1960.
tion than Failure detection. Joshua B. Leners., et al. [9], have given
According to Leslie Lamport [3],Time should be used instead a fault detector mechanism named as Falcon. According to the au-
of timeout to increase the fault tolerance. A general method is de- thors Falcon achieves these features by coordinating a network of
scribed for implementing a distributed system with any desired de- spies, each monitoring a layer of the system.
gree of fault- tolerance. Instead of relying upon explicit timeouts,
Padmakumari (2015) [10] has provided the idea for diverse
processes execute a simple clock-driven algorithm. For Byzantine
fault tolerance and monitoring mechanism to enhance the reliabil-
problem solution author has assumed reliable clock synchroniza-
ity in cloud computing environment. In has given the data about
tion.
various techniques and methods which are used for fault toler-
According to Paval., et al. [4], the fault resilience techniques can ance and also focused on future research direction in cloud fault
be broadly classified into three categories as below: tolerance. Joshi (2014) [11] has given the concept of virtual data
• Hardware Resilience centres (VDC) which is based on the migration technique. In this
methodology if a virtual machine is overloaded then some of its
• Resilient System Software
resources are migrated to another virtual machine to handle the
• Application Based Resilience. server failure. TT-based designs have proved to be a viable solu-
tion in the scope of adaptive systems and recent works in this area
According to Diego Zuquim Guimarães Garcia., et al. [5], the web
show that there is an on-going interest in continuing improving the
service architecture still lacks the facilities to support fault toler-
RT-related features of the FTT protocol [12].
ance. The author has provided an architecture that provides me-
diation and monitoring for web service. Fault tolerance model
Fault model Fault model describes which faults and associated
According to Arvind Kumar., et al. [7], types of faults occurring in
rate of occurrence are assumed by the system being designed. Ac-
the system, fault detection and recovery techniques are discussed.
cording to different viewpoints the faults can be: system bound-
A system after failure can be in one the three below:
aries, internal or external; phenomenological cause natural or
• Fail Stop System
human-made; intent-deliberate or non-deliberate; capacity acci-
• Byzantine System dental or incompetence; persistence permanent or transient. Two
• Fail-Fast System examples of the faults considered by the presented fault model are

Citation: Gajendra Sharma and Santosh Sah. “Fault Tolerance in Distributed System: A Review". Acta Scientific Computer Sciences 4.1 (2022): 36-40.
Fault Tolerance in Distributed System: A Review

39

physical deterioration and physical interference as seen in figure prevent undesirable fault attrition. These replication techniques
2. These faults are caused by processes such as radiation, power may not be economical. There are many fault tolerance techniques
transients, noisy input lines, etc. proposed earlier but none of the single fault tolerance mechanism
can fulfill all aspects of fault tolerance. The model can be used
with necessary customization according to system that is being de-
signed.

Bibliography

1. Flavin Cristian. “Understaning Fault-Tolerant Distributed Sys-


tems”. Computer Science and Engineering, University of Cali-
fornia, San Diego (1993).

2. Lakshmi PS. “Distributed Fault Tolerance Sytem in Real Time


Environment”. Kundal Kr. Medhi, International Journal of Ad-
vance Research in Computer Science and Software Engineering
(2013).

3. Leslie Lamport. “Using Time instead of Timout for Fault Toler-


Figure 2: Fault tolerance model. ant Distributed System”. SRI International (2017).

4. Pavan B., et al. “Fault Tolerance Techniques for Scalable Com-


puting”. Mathematics and Computer Science Division, Argonne
Summary and Conclusion
National Laboratory (2014).
When a fault occurs in a system, then the system requires the
fault tolerance method to detect the fault and recover the system to 5. Diego Z., et al. “A Fault Tolerant Web Service Architecture”. In-
stitute of Computing University of Campinas, São Paulo, Brazil
its normal state. Fault tolerance techniques are required to predict
(2016).
these failures and take appropriate action before these faults actu-
ally occurs. Fault detection is equally importantas failure recovery 6. Avizienis A. “Fault Tolerance Computing-An overview”. IEEE
for having a better fault tolerance mechanism in a system. Computer 3 (2011).

All the activities executed by fault-tolerant system have to be 7. Arvind Kumar., et al. “Fault Tolerance in Real Time Distributed
synchronized, e.g. node replicas have to first execute certain ap- System”. International Journal on Computer Science and Engi-
neering 3 (2011): 933-939.
plication tasks in order to produce the results, then exchange the
produced results by transmission/reception of messages using the 8. Mitvin S. “Fault Tolerant Distributed System”. Department of
network protocol, and execute application tasks that vote on the lo- Computer Science and Engineering, University of Texas at Ar-
cally produced result and the results received through the channel. lington (2019).

After going through previous works in the field of fault toler- 9. Joshua B L., et al. “Detecting Failure in Distributed Systems
ance, it is found that several fault tolerance models are present. with FALCON spy network”. The University of Texas at Austin,
Microsoft Research Silicon Valley (2011).
Transient link faults may affect the capacity of a node for transmit-
ting/receiving, but they are transparently tolerated by using the 10. Padmakumari P. “Methodical Review on Various Fault Toler-
pro-active retransmission mechanism. Replication of Hardware ant and Monitoring Mechanisms to improve Reliability on
and Software technique are the most common technique used for Cloud Environment”. Indian Journal of Science and Technology
fault tolerance. Faults may lead a node replica to become desyn- 8 (2015).
chronized at the communication and/or the application level be-
11. Joshi SC and Sivalingam KM. “Fault tolerance mechanisms for
yond the error recovery capacity. Thus, it was realized that it was virtual data center architectures”. Photonic Network Communi-
necessary to propose more sophisticated recovery mechanisms to cations 28 (2014): 154-164.

Citation: Gajendra Sharma and Santosh Sah. “Fault Tolerance in Distributed System: A Review". Acta Scientific Computer Sciences 4.1 (2022): 36-40.
Fault Tolerance in Distributed System: A Review

40

12. Garibay-Martínez R. “Improved Holistic Analysis for Fork–Join


Distributed Real-Time Tasks Supported by the FTT-SE Pro-
tocol”. In: IEEE Transactions on Industrial Informatics 12.5
(2016): 1865-1876.

Assets from publication with us


• Prompt Acknowledgement after receiving the article
• Thorough Double blinded peer review
• Rapid Publication
• Issue of Publication Certificate
• High visibility of your Published work

Website: www.actascientific.com/
Submit Article: www.actascientific.com/submission.php
Email us: [email protected]
Contact us: +91 9182824667

Citation: Gajendra Sharma and Santosh Sah. “Fault Tolerance in Distributed System: A Review". Acta Scientific Computer Sciences 4.1 (2022): 36-40.

You might also like