Ascs 04 0213
Ascs 04 0213
Abstract
Distributed systems consist of several hardware and software components connected together which may fail eventually. There-
fore, the system should be designed with the proper fault tolerance technique, so that in case of fault the system would be able to
recover from the failure without any loss of service. Fault may occur due to various reasons like communication failure, resources
or hardware failure, failure due to fault in process, software errors etc. Any of these faults may result the system into faulty environ-
ment. The system in faulty environment may not perform the task as expected and will result in faulty output or no output. This paper
attempts to introduce fault, fault tolerance and fault tolerance techniques in detail with the help of previous research in the field of
fault tolerance in distributed system.
Citation: Gajendra Sharma and Santosh Sah. “Fault Tolerance in Distributed System: A Review". Acta Scientific Computer Sciences 4.1 (2022): 36-40.
Fault Tolerance in Distributed System: A Review
37
The above figure (Figure 1) shows the example of real time sys- • Arbitrary (Byzanthine) Failure: A server may produce
tem in distributed environment. Real time system is highly depend- arbitraryresponses at arbitrary times
able on deadline. The given task to the system must be completed
with the allocated amount of time. Resultobtained after the given Fault Tolerance – Ability of system to behave in a well-defined
period of time is of no use in case of real time system. Some ex- manner upon occurrence of faults.
Citation: Gajendra Sharma and Santosh Sah. “Fault Tolerance in Distributed System: A Review". Acta Scientific Computer Sciences 4.1 (2022): 36-40.
Fault Tolerance in Distributed System: A Review
38
Literature Review They have mentioned some approaches for fault tolerance in
Real Time distributed system. They are as following:
There is a lot of research that has already been performed and
is ongoing in the field of fault tolerance in distributed system. Re- • Replication
search and experimentation efforts began in earnest in the 1970s • Job Replication
and continued through 1990s, with focused interest peaking in the • Component Replication
late 1980s.
• Data Replication
A number of distributed operating system were introduced • Check pointing
during these period; however, very few of these implementations
• Scheduling/Redundancy
achieved even modest commercialsuccess.
• Space Scheduling/Redundancy
Different authors have reviewed the concept of fault tolerance
• Time Scheduling/ Redundancy
computing system, like Ramamoorthy [1967], Short [1968], Avi-
zienes [1971], Khul and Reddy [1980, 1981], Bagchi and Hakimi • Hybrid Redundancy
[1991]. The SAPO Computer built in Prague, Czechoslovakia was
A common way to handle crashes involves two steps: (1) De-
probably the first Fault- Tolerant Computer built in 1950-1954 un-
tect the failure; and (2) Recover, by restarting or failing over the
der the supervision of Antonin Svoboda, using relays and a mag-
crashed component. Failure recovery has received a lot more atten-
netic drum and was operated in 1957-1960.
tion than Failure detection. Joshua B. Leners., et al. [9], have given
According to Leslie Lamport [3],Time should be used instead a fault detector mechanism named as Falcon. According to the au-
of timeout to increase the fault tolerance. A general method is de- thors Falcon achieves these features by coordinating a network of
scribed for implementing a distributed system with any desired de- spies, each monitoring a layer of the system.
gree of fault- tolerance. Instead of relying upon explicit timeouts,
Padmakumari (2015) [10] has provided the idea for diverse
processes execute a simple clock-driven algorithm. For Byzantine
fault tolerance and monitoring mechanism to enhance the reliabil-
problem solution author has assumed reliable clock synchroniza-
ity in cloud computing environment. In has given the data about
tion.
various techniques and methods which are used for fault toler-
According to Paval., et al. [4], the fault resilience techniques can ance and also focused on future research direction in cloud fault
be broadly classified into three categories as below: tolerance. Joshi (2014) [11] has given the concept of virtual data
• Hardware Resilience centres (VDC) which is based on the migration technique. In this
methodology if a virtual machine is overloaded then some of its
• Resilient System Software
resources are migrated to another virtual machine to handle the
• Application Based Resilience. server failure. TT-based designs have proved to be a viable solu-
tion in the scope of adaptive systems and recent works in this area
According to Diego Zuquim Guimarães Garcia., et al. [5], the web
show that there is an on-going interest in continuing improving the
service architecture still lacks the facilities to support fault toler-
RT-related features of the FTT protocol [12].
ance. The author has provided an architecture that provides me-
diation and monitoring for web service. Fault tolerance model
Fault model Fault model describes which faults and associated
According to Arvind Kumar., et al. [7], types of faults occurring in
rate of occurrence are assumed by the system being designed. Ac-
the system, fault detection and recovery techniques are discussed.
cording to different viewpoints the faults can be: system bound-
A system after failure can be in one the three below:
aries, internal or external; phenomenological cause natural or
• Fail Stop System
human-made; intent-deliberate or non-deliberate; capacity acci-
• Byzantine System dental or incompetence; persistence permanent or transient. Two
• Fail-Fast System examples of the faults considered by the presented fault model are
Citation: Gajendra Sharma and Santosh Sah. “Fault Tolerance in Distributed System: A Review". Acta Scientific Computer Sciences 4.1 (2022): 36-40.
Fault Tolerance in Distributed System: A Review
39
physical deterioration and physical interference as seen in figure prevent undesirable fault attrition. These replication techniques
2. These faults are caused by processes such as radiation, power may not be economical. There are many fault tolerance techniques
transients, noisy input lines, etc. proposed earlier but none of the single fault tolerance mechanism
can fulfill all aspects of fault tolerance. The model can be used
with necessary customization according to system that is being de-
signed.
Bibliography
All the activities executed by fault-tolerant system have to be 7. Arvind Kumar., et al. “Fault Tolerance in Real Time Distributed
synchronized, e.g. node replicas have to first execute certain ap- System”. International Journal on Computer Science and Engi-
neering 3 (2011): 933-939.
plication tasks in order to produce the results, then exchange the
produced results by transmission/reception of messages using the 8. Mitvin S. “Fault Tolerant Distributed System”. Department of
network protocol, and execute application tasks that vote on the lo- Computer Science and Engineering, University of Texas at Ar-
cally produced result and the results received through the channel. lington (2019).
After going through previous works in the field of fault toler- 9. Joshua B L., et al. “Detecting Failure in Distributed Systems
ance, it is found that several fault tolerance models are present. with FALCON spy network”. The University of Texas at Austin,
Microsoft Research Silicon Valley (2011).
Transient link faults may affect the capacity of a node for transmit-
ting/receiving, but they are transparently tolerated by using the 10. Padmakumari P. “Methodical Review on Various Fault Toler-
pro-active retransmission mechanism. Replication of Hardware ant and Monitoring Mechanisms to improve Reliability on
and Software technique are the most common technique used for Cloud Environment”. Indian Journal of Science and Technology
fault tolerance. Faults may lead a node replica to become desyn- 8 (2015).
chronized at the communication and/or the application level be-
11. Joshi SC and Sivalingam KM. “Fault tolerance mechanisms for
yond the error recovery capacity. Thus, it was realized that it was virtual data center architectures”. Photonic Network Communi-
necessary to propose more sophisticated recovery mechanisms to cations 28 (2014): 154-164.
Citation: Gajendra Sharma and Santosh Sah. “Fault Tolerance in Distributed System: A Review". Acta Scientific Computer Sciences 4.1 (2022): 36-40.
Fault Tolerance in Distributed System: A Review
40
Website: www.actascientific.com/
Submit Article: www.actascientific.com/submission.php
Email us: [email protected]
Contact us: +91 9182824667
Citation: Gajendra Sharma and Santosh Sah. “Fault Tolerance in Distributed System: A Review". Acta Scientific Computer Sciences 4.1 (2022): 36-40.