0% found this document useful (0 votes)
13 views6 pages

DC Unit 4 Important

Dc Unit 4 Important

Uploaded by

Logesh Waran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views6 pages

DC Unit 4 Important

Dc Unit 4 Important

Uploaded by

Logesh Waran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

1.Analyze the Byzantine Agreement Problem in distributed systems.

Explain its significance, the challenges it poses & the main solutions or
algorithms developed to address its system performance and fault
tolerance.
OR
2.How does asynchronous checkpointing differ from synchronous
checkpointing, and why might it be preferred in large-scale distributed
systems? Illustrate with the algorithm proposed by Juang and Venkatesan
for asynchronous checkpointing and recovery, including its key steps and
benefits.
3.Summarize the agreement in a failure free system? Explain how
agreement is reached in message –passing synchronous systems with
failures.
OR
4.Compare the Consistent Set of Checkpointing and rollback recovery
techniques in detail.

Demonstrate how does agreement is reached in message passing


synchronous system with failure. How does agreement is get affected in
synchronous system.
OR
Sketch how does asynchronous checkpointing differ from synchronous
checkpointing and why might it be preferred in large-scale distributed
systems? Illustrate with the algorithm proposed by Juang and Venkatesan
for asynchronous checkpointing and recovery including its key steps and
benefits.
Describe how a rollback recovery of distributed systems complicated?
Explain in detail.
OR
Illustrate the checkpoint –based rollback-recovery techniques in detail.
“Recovery and Consensus are two important concepts in distributed
systems, particularly in the context of ensuring system reliability, fault
tolerance, and consistency.”

 Recovery in distributed systems is about bringing the system back to


a consistent state after failures. It involves techniques such as rollback
and forward recovery, log management, and checkpointing.
 Consensus is the process by which distributed nodes agree on a
single value or decision, ensuring consistency even in the presence of
faults and failures. Consensus algorithms like Paxos, Raft, and BFT are
fundamental for ensuring that distributed systems can continue to
operate reliably and consistently.

1. Recovery in Distributed Systems


Recovery in distributed systems refers to the mechanisms and strategies
used to restore the system to a consistent state after a failure. Distributed
systems are inherently vulnerable to various types of failures, including
hardware crashes, software bugs, communication errors, and network
partitions. Recovery ensures that the system can continue to operate
correctly despite these failures.
Key Aspects of Recovery:
 Failure Detection: Identifying when and where failures occur is
the first step in recovery. This could be a crash of a node, loss of
connectivity, or other faults that disrupt normal operation.
 Rollback Recovery: After a failure, the system might roll back to a
previous consistent state. This could involve undoing operations
that have been partially completed or that may have been
compromised due to the failure. Techniques such as logging,
checkpoints, and versioning are commonly used to help with
rollback.
 Forward Recovery: In some cases, instead of rolling back to a
previous state, the system may attempt to "move forward" by
repairing or reconstructing a failed state using available
information (e.g., by retrying operations or restoring from
backups).
 State Persistence: To facilitate recovery, systems often rely on
logs and checkpoints. Logs record the sequence of operations
performed, and checkpoints capture the state of the system at
particular points in time. When recovery is needed, these logs and
checkpoints can be used to bring the system back to a known good
state.
 Consistency After Recovery: Recovery mechanisms must ensure
that the system returns to a consistent and correct state, particularly
in cases where operations have been distributed across multiple
nodes. The challenge is to maintain data integrity and consistency
when some parts of the system may have failed.

Key Challenges in Consensus:


 Network Partitions: In the presence of network partitions (i.e.,
when parts of the network are temporarily isolated from others),
achieving consensus is challenging. Some nodes may be unable to
communicate with others, leading to situations where nodes are
unable to make decisions or where they might diverge in their
decisions.
 Performance: Achieving consensus, particularly in a large system
with many nodes, can be slow and resource-intensive. The process
of ensuring that all nodes reach agreement often requires multiple
rounds of communication and coordination.
 Fault Tolerance: Consensus protocols must ensure that the system
can tolerate a certain number of failures (e.g., crashes or network
delays). However, there is a trade-off between the number of
failures the system can tolerate and the number of nodes that need
to agree on a decision.
2. Consensus in Distributed Systems
Consensus refers to the process by which a group of distributed nodes
(or processes) agree on a common decision or value, even in the
presence of failures or network partitions. Achieving consensus is
crucial for maintaining consistency in distributed systems, where nodes
do not have access to a global memory and may be asynchronous, fail
independently, or become temporarily disconnected.
Key Aspects of Consensus:
 Agreement Among Nodes: Consensus ensures that all
participating nodes agree on a common decision (e.g., whether to
commit a transaction or which value to choose as the current state).
It is critical in scenarios such as distributed databases, coordination
services, and fault-tolerant systems.

 Fault Tolerance: A key requirement of consensus protocols is that


they must tolerate certain types of failures (e.g., by recovering from
crashed nodes or handling network splits). The system must ensure
that even if some nodes fail or are unreachable, the remaining
nodes can still make a decision and continue to function correctly.

 Safety and Liveness:


o Safety: Ensures that no two nodes can decide on different

values (i.e., the system avoids conflicting decisions).


o Liveness: Guarantees that eventually a decision will be made
(i.e., the system avoids deadlock or indefinite waiting).

Consensus Algorithms: Several consensus algorithms have been


proposed and widely used in distributed systems, each with its trade-offs
in terms of performance, fault tolerance, and complexity.
1. Paxos:
o Paxos is a well-known consensus algorithm that ensures that a

majority of nodes agree on a decision. The algorithm works by


having nodes propose a value, then vote on it, and then accept
the value if a majority agrees. Paxos is guaranteed to reach
consensus, even in the presence of network partitions or
crashes, as long as a majority of nodes are functioning.
o Challenges: Paxos can be difficult to implement efficiently

due to its complexity. It is often described as "the most elegant


but impractical consensus algorithm" because of the difficulty
of understanding and implementing its details.
2. Raft:
o Raft was designed as a more understandable alternative to

Paxos while providing similar fault tolerance guarantees. Raft


is widely used in modern distributed systems (e.g., etcd,
Consul, and HashiCorp Vault). It organizes the nodes in a
leader-follower configuration where the leader is responsible
for managing the log entries and making decisions. Consensus
is achieved by having the leader propose values and getting a
majority of followers to agree on them.
o Raft's advantages: Raft is easier to implement and reason

about compared to Paxos, which has made it more popular in


practical applications.
3. Byzantine Fault Tolerance (BFT):
o BFT algorithms (e.g., PBFT) are used in environments where

nodes may exhibit arbitrary (Byzantine) failures. A Byzantine


node might behave incorrectly, send conflicting messages, or
act maliciously, so BFT algorithms ensure that the system can
still reach consensus even in the presence of such failures.
o Applications: BFT is commonly used in blockchain systems
(e.g., Hyperledger, some blockchain consensus protocols) and
other environments where malicious behavior needs to be
tolerated.

You might also like