0% found this document useful (0 votes)

19 views20 pages

DS Unit - 4

Uploaded by

Shaik Reshma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views20 pages

DS Unit - 4

Uploaded by

Shaik Reshma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

UNIT – 4

Fault Tolerance: Introduction, Process Resilience, Reliable Client-Server Communication, Reliable Group
Communication, Distributed Commit, Recovery.

Security: Introduction, Secure Channels, Access Control, Security Management.

Fault Tolerance
Fault Tolerance in Distributed System

Fault tolerance in distributed systems is the capability to continue operating smoothly despite failures or errors in
one or more of its components. This resilience is crucial for maintaining system reliability, availability, and
consistency. By implementing strategies like redundancy, replication, and error detection, distributed systems can
handle various types of failures, ensuring uninterrupted service and data integrity.

In distributed systems, three types of problems occur. All these three types of problems are related.

 Fault: Fault is defined as a weakness or shortcoming in the system or any hardware and software
component. The presence of fault can lead to error and failure.

 Errors: Errors are incorrect results due to the presence of faults.

 Failure: Failure is the outcome where the assigned goal is not achieved.

What is Fault Tolerance?

Fault Tolerance is defined as the ability of the system to function properly even in the presence of any
failure. Distributed systems consist of multiple components due to which there is a high risk of faults occurring. Due
to the presence of faults, the overall performance may degrade.

Types of Faults

 Transient Faults: Transient Faults are the type of faults that occur once and then disappear. These types of
faults do not harm the system to a great extent but are very difficult to find or locate. Processor fault is an
example of transient fault.

 Intermittent Faults: Intermittent Faults are the type of faults that come again and again. Such as once the
fault occurs it vanishes upon itself and then reappears again. An example of intermittent fault is when the
working computer hangs up.

 Permanent Faults: Permanent Faults are the type of faults that remain in the system until the component is
replaced by another. These types of faults can cause very severe damage to the system but are easy to
identify. A burnt-out chip is an example of a permanent Fault.

Need for Fault Tolerance in Distributed Systems

Fault Tolerance is required in order to provide below four features.

1. Availability: Availability is defined as the property where the system is readily available for its use at any
time.

2. Reliability: Reliability is defined as the property where the system can work continuously without any
failure.

3. Safety: Safety is defined as the property where the system can remain safe from unauthorized access even if
any failure occurs.

4. Maintainability: Maintainability is defined as the property states that how easily and fastly the failed node
or system can be repaired.
Fault Tolerance in Distributed Systems

In order to implement the techniques for fault tolerance in distributed systems, the design, configuration and
relevant applications need to be considered. Below are the phases carried out for fault tolerance in any distributed
systems.

1. Fault Detection

Fault Detection is the first phase where the system is monitored continuously. The outcomes are being compared
with the expected output. During monitoring if any faults are identified they are being notified. These faults can
occur due to various reasons such as hardware failure, network failure, and software issues. The main aim of the first
phase is to detect these faults as soon as they occur so that the work being assigned will not be delayed.

2. Fault Diagnosis

Fault diagnosis is the process where the fault that is identified in the first phase will be diagnosed properly in order
to get the root cause and possible nature of the faults. Fault diagnosis can be done manually by the administrator or
by using automated Techniques in order to solve the fault and perform the given task.

3. Evidence Generation

Evidence generation is defined as the process where the report of the fault is prepared based on the diagnosis done
in an earlier phase. This report involves the details of the causes of the fault, the nature of faults, the solutions that
can be used for fixing, and other alternatives and preventions that need to be considered.

4. Assessment

Assessment is the process where the damages caused by the faults are analyzed. It can be determined with the help
of messages that are being passed from the component that has encountered the fault. Based on the assessment
further decisions are made.

5. Recovery

Recovery is the process where the aim is to make the system fault free. It is the step to make the system fault free
and restore it to state forward recovery and backup recovery. Some of the common recovery techniques such as
reconfiguration and resynchronization can be used.
Types of Fault Tolerance in Distributed Systems

1. Hardware Fault Tolerance: Hardware Fault Tolerance involves keeping a backup plan for hardware devices
such as memory, hard disk, CPU, and other hardware peripheral devices. Hardware Fault Tolerance is a type
of fault tolerance that does not examine faults and runtime errors but can only provide hardware backup.
The two different approaches that are used in Hardware Fault Tolerance are fault-masking and dynamic
recovery.

2. Software Fault Tolerance: Software Fault Tolerance is a type of fault tolerance where dedicated software is
used in order to detect invalid output, runtime, and programming errors. Software Fault Tolerance makes
use of static and dynamic methods for detecting and providing the solution. Software Fault Tolerance also
consists of additional data points such as recovery rollback and checkpoints.

3. System Fault Tolerance: System Fault Tolerance is a type of fault tolerance that consists of a whole system.
It has the advantage that it not only stores the checkpoints but also the memory block, and program
checkpoints and detects the errors in applications automatically. If the system encounters any type of fault
or error it does provide the required mechanism for the solution. Thus system fault tolerance is reliable and
efficient.
Process resilience
Process Resilience is a critical aspect of fault tolerance in distributed systems. It refers to the system's ability to
handle and recover from process failures while maintaining availability, reliability, and consistent operation. The goal
is to ensure that individual process failures do not compromise the overall functionality of the system.

Key Concepts of Process Resilience

1. Fault Tolerance Techniques:

o Failure Detection:

 Identifies failed processes using mechanisms like heartbeats, timeout checks, or monitoring
tools.

o Recovery:

 Restores failed processes through mechanisms like process restarts, checkpointing, or

replication.

o Redundancy:

 Ensures critical tasks are handled by multiple processes to avoid single points of failure.

2. Process Failures:

o Crash Failures:

 A process halts and does not resume (no further communication is received).

o Byzantine Failures:

 A process exhibits arbitrary or malicious behavior.

o Omission Failures:

 A process fails to send or receive messages.

o Timing Failures:

 A process responds too late or too early, violating timing constraints.

3. Design Principles for Resilient Processes:

o Idempotence:

 Ensure that re-executing a process produces the same result (e.g., retrying a failed
transaction).

o Isolation:

 Contain failures to prevent cascading effects on other processes.

o Graceful Degradation:

 Allow partial functionality if full functionality cannot be maintained.

Techniques for Achieving Process Resilience

1. Replication:

o Maintain multiple copies of a process on different nodes.

o Types:

 Active Replication: All replicas process requests simultaneously.

 Passive Replication: A primary replica processes requests and updates backups.

o Example: Leader-follower architectures.

2. Process Monitoring:

o Use watchdogs or external monitoring systems to detect process health.

o Example: Kubernetes health checks (liveness and readiness probes).

3. Checkpointing and Restart:

o Periodically save the state of a process to enable recovery from a failure.

o Types:

 Full Checkpointing: Save the entire process state.

 Incremental Checkpointing: Save only changes since the last checkpoint.

4. Process Migration:

o Move a process to another node if the current node fails or becomes overloaded.

o Example: Virtual machines or containers migrating to a healthy host.

5. Leader Election:

o When a critical process fails, elect a new leader to maintain coordination.

o Protocols:

 Bully Algorithm: The process with the highest priority becomes the leader.

 Raft/Paxos: Achieve consensus for leader selection.

6. Failover Mechanisms:

o Redirect tasks from a failed process to a standby process.

o Example: High Availability (HA) systems using active-passive configurations.

7. Consensus Protocols:

o Ensure that processes agree on actions despite failures.

o Example: Using Paxos or Raft for agreement in distributed systems.

Metrics for Process Resilience

1. Mean Time Between Failures (MTBF):

o Measures the expected time between process failures.

2. Mean Time to Recovery (MTTR):

o Measures how quickly a failed process can be restored.

3. Availability:

o Proportion of time the process is operational (Availability=MTBFMTBF+MTTRAvailability =

\frac{MTBF}{MTBF + MTTR}Availability=MTBF+MTTRMTBF).

Challenges in Process Resilience

1. Network Partitioning:
o Partitioning can isolate processes, making it difficult to detect or recover failures.

o Addressed by partition-tolerant designs (e.g., CAP theorem trade-offs).

2. Byzantine Failures:

o Handling malicious or arbitrary failures requires complex protocols like Byzantine Fault Tolerance
(BFT).

3. State Synchronization:

o Ensuring all replicas are consistent during and after recovery.

4. Resource Overheads:

o Redundancy, checkpointing, and monitoring increase resource usage.

Applications

 Financial Systems: Require high resilience to handle transaction processing without downtime.

 Healthcare Systems: Critical for ensuring availability of life-support systems.

 E-commerce: Essential for handling user requests and maintaining consistent inventory.

Process resilience is vital for building robust distributed systems, ensuring continuous operation despite failures, and
improving user trust in system reliability.
Reliable Client-Server Communication in a Distributed System
Reliable client-server communication in a distributed system refers to the dependable exchange of data between
clients and servers across a network. Ensuring this reliability is critical for maintaining system integrity, consistency,
and performance.

 Challenges like network latency, packet loss, and data corruption can hinder effective communication.
Addressing these issues involves using robust protocols and error-handling techniques.

 In this article, we will explore the importance of reliable communication, common challenges, and the best
practices for achieving it in distributed systems.

Importance of Reliable Communication

Reliable communication is vital for ensuring the smooth operation of distributed systems. It guarantees that data
transmitted between clients and servers remains accurate and consistent. Here are several key reasons why reliable
communication is essential:

 Data Integrity: Ensuring data integrity means that the information sent is received without errors. This is
crucial for applications like financial transactions where accuracy is paramount.

 Consistency: Consistent communication prevents data mismatches across different parts of the system. This
helps maintain a unified state across distributed nodes.

 System Performance: Maintaining reliable communication helps in optimizing system performance. It

reduces the need for repeated data transmissions and reprocessing.

 Security: Reliable protocols often include security features that protect data from interception and
tampering. This ensures that sensitive information remains confidential and intact.

 Scalability: As systems grow, maintaining reliable communication becomes more challenging. Reliable
communication strategies support scalable solutions that can handle increased load without compromising
performance.

Common Challenges in Client-Server Communication

Maintaining reliable client-server communication in distributed systems can be complex due to various inherent
challenges. These challenges can impact the system's performance, data integrity, and overall user experience. Here
are some common issues faced in client-server communication:

 Network Latency: Delays in data transmission can slow down system responses. High latency can degrade
user experience and hinder real-time processing.

 Packet Loss: Data packets may get lost during transmission due to network issues. Packet loss can lead to
incomplete or corrupted messages, affecting data integrity.

 Data Corruption: Errors during transmission can corrupt data, rendering it unusable. Ensuring data integrity
requires robust error detection and correction mechanisms.

 Concurrency Issues: Simultaneous data requests can cause conflicts and inconsistencies. Managing
concurrent requests effectively is crucial for maintaining data consistency.

 Scalability: As the system grows, ensuring reliable communication becomes more challenging. Increased
traffic can strain network resources and lead to performance bottlenecks.

 Security Threats: Data transmitted over the network can be intercepted or tampered with. Implementing
strong encryption and security measures is essential to protect sensitive information.
Protocols and Techniques for Reliable Communication

Ensuring reliable communication in a distributed system requires a combination of robust protocols and effective
techniques. Here are several key methods and protocols that help achieve dependable client-server communication:

 Transmission Control Protocol (TCP): TCP ensures reliable, ordered, and error-checked delivery of data
between applications. It manages packet loss by retransmitting lost packets and ensures data integrity
through checksums.

 HTTP/2 and HTTP/3: These protocols improve performance and reliability with features like multiplexing,
which allows multiple requests and responses simultaneously over a single connection. They also include
header compression to reduce overhead.

 Message Queues: Systems like RabbitMQ and Apache Kafka help manage message delivery. They queue
messages and retry sending them if they fail, ensuring no message is lost even if the server is temporarily
unavailable.

 Acknowledgment Mechanisms: Implementing acknowledgment protocols ensures that a message is

received and processed. If an acknowledgment is not received, the message can be resent.

 Automatic Repeat reQuest (ARQ): ARQ is a protocol for error control that automatically retransmits lost or
corrupted packets. This technique ensures that all data reaches its destination intact.

 Forward Error Correction (FEC): FEC adds redundant data to the original message. This allows the receiver to
detect and correct errors without needing a retransmission.

Error Detection and Correction Mechanisms

Error detection and correction mechanisms are essential for maintaining data integrity in client-server
communication. They ensure that any data corrupted during transmission is identified and corrected.

Below are several key mechanisms used in distributed systems:

 Checksums: Checksums generate a small value from a block of data. The sender includes this value with the
data, and the receiver recalculates it to verify integrity.

 Cyclic Redundancy Check (CRC): CRC is a more advanced form of checksum. It uses polynomial division to
detect errors in transmitted messages.

 Parity Bits: Parity bits add an extra bit to data to make the number of set bits either even or odd. This helps
detect single-bit errors.

 Hamming Code: Hamming code adds redundant bits to data. It detects and corrects single-bit errors and
detects two-bit errors.

 Automatic Repeat reQuest (ARQ): ARQ protocols, like Stop-and-Wait and Go-Back-N, request
retransmission of corrupted or lost packets. This ensures reliable delivery.

 Forward Error Correction (FEC): FEC adds redundant data to enable the receiver to detect and correct errors
without needing retransmission.
Examples of Reliable Client-Server Communication

Reliable client-server communication is crucial for various real-world applications where data integrity and
performance are paramount. Below are some examples demonstrating its importance:

 Financial Systems: In banking and stock trading platforms, reliable communication ensures transaction
accuracy and data consistency. A single error can lead to significant financial loss and undermine trust.

 E-commerce Platforms: Online shopping sites rely on dependable communication for inventory
management and payment processing. This ensures users have a smooth and secure shopping experience.

 Healthcare Systems: Electronic health records and telemedicine services require accurate and timely data
exchange. Reliable communication ensures patient information is correct and up-to-date.

 Cloud Services: Cloud platforms like AWS and Google Cloud maintain data consistency and availability across
distributed servers. This enables seamless access and high availability for users.

 Gaming Applications: Multiplayer online games need real-time data synchronization to ensure a fair and
enjoyable experience. Reliable communication minimizes lag and prevents data discrepancies.

 IoT Devices: Smart home systems and industrial IoT applications rely on consistent data transmission. This
ensures devices function correctly and respond promptly to commands.
Reliable Group Communication
Reliable group communication is used when you want to send messages to a group of processes or nodes, ensuring
that the messages are delivered correctly, even if some of the processes fail or the network experiences issues. It
involves mechanisms that guarantee reliable message delivery, message ordering, and fault tolerance for all
members of the group.
Key Challenges in Reliable Group Communication:
1. Message Loss: Messages sent to the group may be lost due to network issues or process crashes.
2. Message Duplication: A message might be delivered more than once, causing inconsistencies.
3. Out-of-Order Delivery: Messages may arrive out of order, violating the intended sequence.
4. Network Partitions: The system might become divided into separate groups, with some members unable to
communicate with others.
5. Fault Tolerance: Some processes in the group may fail, and the system should still ensure the message
delivery to the surviving members.
Types of Reliable Group Communication
1. Atomic Broadcast
 Definition: An atomic broadcast ensures that all members of the group either receive a message or none at
all. This means that the message is either delivered to all members of the group or none, even in the
presence of network failures or process crashes.
 Key Properties:
o Uniformity: All processes in the group either deliver the message or do not.
o Order: Messages are delivered in the same order to all processes (i.e., no reordering).
o Fault Tolerance: Even if some processes crash or the network partitions, the system ensures that a
message is either delivered to all surviving processes or no one.
Example: A leader election protocol in a distributed system might use atomic broadcast to ensure that when the
leader is elected, all nodes receive the same information about the new leader at the same time.
2. Causal Ordering
 Definition: Causal ordering ensures that messages are delivered in a way that respects their causal
relationships. If one message causes another (for example, process A sends a message to process B, and
process B sends a response back to process A), the order in which these messages are delivered must
respect the causal chain.
 Key Properties:
o Causal Consistency: The system respects the causal relationships between events (i.e., a cause
precedes its effect).
o Concurrency: If two messages are not causally related, they may be delivered in any order.
Example: If a client sends a request to a server, and the server sends a response back, the client must receive the
response after receiving the original request. If two requests are independent of each other, they may be delivered
in any order.
3. Total Ordering
 Definition: Total ordering ensures that all members of the group receive messages in the same order. This is
particularly important when the order of message processing matters (e.g., in transactions).
 Key Properties:
o Global Agreement: Every process receives messages in the same order.
o Consistency: It ensures that there is no divergence in the order in which messages are processed
across the group.
Example: If a distributed ledger system needs to process transactions in a specific order (e.g., in a blockchain), total
ordering ensures that all participants process the transactions in the same sequence.
4. Message Delivery Guarantees
 Reliable Delivery: Messages sent to a group are reliably delivered to all group members, even in the face of
network failures or crashes. The system ensures that if a message is sent, all group members either receive it
or none do.
 No Duplicates: A message should not be delivered more than once, even if retries or retransmissions occur
due to network failure.
 No Loss: Messages must not be lost. If a message is sent to the group, all processes in the group should
eventually receive it.
 Timeouts and Retries: In case of message loss or delay, the system should have mechanisms for retrying
delivery until all processes receive the message.
5. Handling Network Partitions (Partition Tolerance)
 Problem: In a distributed system, network failures might cause a partition, where some nodes become
isolated from the rest of the group. During a partition, some processes may not receive messages from
others, which can lead to inconsistencies.
 Solution: To ensure reliable communication in the presence of partitions, systems often use Quorum-based
protocols or Leader-based protocols to make decisions about message delivery and partition resolution.
 Quorum-based Protocol: A majority (quorum) of the group must agree to deliver a message. This ensures
consistency while handling partitions.
 Leader-based Protocol: A leader makes decisions for the group, and if the leader fails, a new leader is
elected to avoid split-brain scenarios.

Techniques and Protocols for Reliable Group Communication

1. Virtual Synchrony
 Definition: Virtual synchrony is a model used in reliable group communication to ensure that all members of
a group view the system as being in the same state, even in the presence of failures.
 Properties:
o All members of the group either see the same messages or none at all.
o It provides an abstraction that hides the complexities of failure recovery.
 Use Case: Virtual synchrony is widely used in distributed databases, where it ensures that all replicas of the
database stay consistent even when some nodes fail.
2. Total Order Broadcast (TOB) Protocols
 Definition: TOB ensures that messages sent to a group of processes are delivered in the same order to all
processes, even if some processes fail or the network is partitioned.
 Examples:
o Paxos: A consensus algorithm that provides reliable communication in distributed systems, ensuring
that a set of processes agree on a single value or sequence of values.
o Raft: Another consensus protocol that provides similar guarantees as Paxos but is easier to
understand and implement.
3. Multicast Communication
 Definition: Multicast communication is a technique used for sending messages to a group of recipients
simultaneously. Multicast protocols ensure that all recipients receive the same message, or none at all.
 Example: In systems like streaming services, multicast ensures that messages (e.g., video packets) are
delivered to multiple receivers at once, minimizing network load.

Challenges in Reliable Group Communication

1. Network Failures
2. Scalability
3. Dynamic Membership
4. Fault Handling
5. Ordering Guarantees
Applications of Reliable Group Communication
1. Replicated Databases
2. Distributed Consensus
3. Collaborative Applications
4. Fault-Tolerant Middleware
5. Event Notification Systems
Distributed Commit
Distributed Commit ensures that a group of processes (or nodes) in a distributed environment agree on whether to
commit or abort a transaction. This mechanism is crucial for maintaining consistency and reliability in systems where
multiple nodes must coordinate actions.

Key Concepts of Distributed Commit

1. Atomicity:

o A transaction must be executed completely or not at all across all participating nodes.

2. Consistency:

o All participants must agree on the outcome (commit or abort) to ensure data integrity.

3. Coordination:

o A coordinator process manages the decision-making, communicating with participants.

4. Fault Tolerance:

o The protocol must handle failures gracefully, ensuring the system remains consistent.

Common Distributed Commit Protocols

1. Two-Phase Commit Protocol (2PC)

Overview:
A widely used protocol for achieving consensus among distributed nodes.

Phases:

1. Prepare Phase:

o The coordinator sends a PREPARE request to all participants.

o Participants respond with YES (ready to commit) or NO (abort).

2. Commit Phase:

o If all participants respond YES, the coordinator sends a COMMIT message.

o If any participant responds NO, the coordinator sends an ABORT message.

Advantages:

 Simple and ensures atomicity.

Disadvantages:

 Blocking: If the coordinator fails after sending PREPARE, participants cannot proceed.

 High communication overhead.

2. Three-Phase Commit Protocol (3PC)

Overview:
An extension of 2PC designed to avoid blocking during coordinator failures.

Phases:

1. Prepare Phase:

o Same as in 2PC.

2. Pre-Commit Phase:

o If all participants respond YES, the coordinator sends a PRE-COMMIT message.

o Participants acknowledge readiness to commit.

3. Commit Phase:

o The coordinator sends a COMMIT message, and participants finalize the transaction.

Advantages:

 Non-blocking under certain failure conditions.

 Ensures better fault tolerance compared to 2PC.

Disadvantages:

 Increased complexity and additional communication overhead.

3. Paxos-Based Commit

Overview:
Uses the Paxos consensus algorithm to achieve a distributed commit decision.

Steps:

 Participants propose values (commit/abort), and Paxos ensures agreement on the outcome.

 Fault-tolerant and suitable for systems with high failure rates.

Advantages:

 Fault tolerance in scenarios with network partitions and node failures.

 Guarantees consistency.

Disadvantages:

 Complex implementation.

 Higher latency compared to 2PC.

4. Coordinator-Less Commit (Quorum-Based)

Overview:
Relies on quorum-based voting to decide commit or abort.

Steps:

 Participants agree to commit if a majority (quorum) votes in favor.

 If quorum is not achieved, the transaction is aborted.

Advantages:

 Decentralized; avoids single points of failure.

 Scalable for large systems.

Disadvantages:

 Requires careful configuration of quorum size to ensure consistency.

Challenges in Distributed Commit

1. Coordinator Failures

2. Network Partitions

3. Performance Overhead

4. Concurrency Control

Applications

1. Distributed Databases

2. Distributed File Systems

3. Financial Systems
Recovery
Recovery in distributed systems focuses on maintaining functionality and data integrity despite failures. It involves
strategies for detecting faults, restoring state, and ensuring continuity across interconnected nodes. This article
delves into techniques for handling various types of failures—such as network issues and node crashes—by
implementing robust recovery mechanisms. Understanding these principles helps in designing resilient systems that
can quickly recover from disruptions and maintain consistent operations.

Importance of Effective Recovery in Distributed Systems

Effective recovery in distributed systems is crucial for ensuring system reliability, availability, and fault tolerance.
When a component fails or an error occurs, the system must recover quickly and correctly to minimize downtime
and data loss. Effective recovery mechanisms, such as checkpointing, rollback, and forward recovery, help maintain
system consistency, prevent cascading failures, and ensure that the system can continue to function even in the
presence of faults.

Recovery Techniques in Distributed Systems

Recovery techniques in distributed systems are essential for ensuring that the system can return to a stable state
after encountering errors or failures. These techniques can be broadly categorized into the following:

 Checkpointing: Periodically saving the system’s state to a stable storage, so that in the event of a failure, the
system can be restored to the last known good state. Checkpointing is a key aspect of backward recovery.

 Rollback Recovery: Involves reverting the system to a previous checkpointed state upon detecting an error.
This technique is useful for undoing the effects of errors and is often combined with checkpointing.

 Forward Recovery: Instead of reverting to a previous state, forward recovery attempts to move the system
from an erroneous state to a new, correct state. This requires anticipating possible errors and having
strategies in place to correct them on the fly.

 Logging and Replay: Keeping logs of system operations and replaying them from a certain point to recover
the system’s state. This is useful in scenarios where a complete rollback might not be feasible.

 Replication: Maintaining multiple copies of data or system components across different nodes. If one
component fails, another can take over, ensuring continuity of service.

 Error Detection and Correction: Incorporating mechanisms that detect errors and automatically correct
them before they lead to system failure. This is a proactive approach that enhances system resilience.

Error Recovery Strategies in Fault-Tolerant Systems

Recovery from an error is essential to fault tolerance, and error is a component of a system that could result in
failure. The whole idea of error recovery is to replace an erroneous state with an error-free state. Error recovery can
be broadly divided into two categories.

 Backward Recovery: This involves rolling the system back to a previously known good state, using
checkpoints to periodically save the system’s state. When an error occurs, the system can revert to one of
these saved states to recover from the error.

 Forward Recovery: This approach focuses on moving the system from an erroneous state to a new, correct
state without reverting to a previous checkpoint. It requires anticipation of potential errors and the ability to
correct them, allowing the system to continue functioning.

These two categories are fundamental to understanding how distributed systems handle errors and maintain fault
tolerance. The distinction between backward and forward recovery highlights different strategies for ensuring
system resilience in the face of failures
Introduction to Security in Distributed Systems
Security in distributed systems refers to the protection of system resources and data from unauthorized access,
modification, or destruction. Distributed systems consist of multiple components, often running on different
machines or even different geographical locations, which increases the attack surface for potential security
breaches. Effective security in these systems typically involves:

 Confidentiality: Ensuring that sensitive data is accessible only to authorized users.

 Integrity: Ensuring that data is not tampered with, altered, or corrupted during transmission or storage.

 Availability: Ensuring that the system remains operational and accessible even under attack (e.g., denial-of-
service attacks).

 Authentication: Verifying the identity of users, systems, or entities.

 Authorization: Ensuring that an authenticated entity has the necessary permissions to access resources.

 Non-repudiation: Ensuring that actions cannot be denied after they have been performed (e.g., signing a
document).

Security challenges in distributed systems include:

Challenges in Distributed System Security

Securing distributed systems poses several significant challenges due to their complexity, scale, and dynamic nature.
Here are the key challenges in distributed system security:

 Network Complexity: Increases the attack surface and complexity of managing security configurations and
updates across diverse network environments.

 Data Protection and Encryption: Vulnerabilities in encryption implementations or weak key management
practices can lead to data breaches and unauthorized access.

 Authentication and Authorization: Misconfigurations or vulnerabilities in authentication mechanisms can

lead to unauthorized access and compromise the integrity of the entire system.

 Diverse Technologies and Platforms: Compatibility issues, differing security postures, and varying levels of
support for security standards can introduce vulnerabilities and complexities in maintaining a consistent
security posture.

 Scalability and Performance: Security measures such as encryption and authentication may introduce
latency and overhead, affecting system performance and responsiveness, especially under high load
conditions.

 Communication over insecure networks: Messages can be intercepted or altered during transmission.

 Data consistency and integrity: Protecting data from being modified or corrupted by malicious actors.
Secure Channels
A secure channel is a communication link between two or more entities in a distributed system that is designed to
protect the data being transmitted. The goal is to prevent eavesdropping, tampering, and unauthorized access.

Key Techniques for Secure Channels:

1. Encryption: Encrypting data before transmission ensures that even if the data is intercepted, it cannot be
read or altered without the appropriate decryption key.

o Symmetric Encryption: The same key is used for both encryption and decryption (e.g., AES). It is fast
but requires secure key distribution.

o Asymmetric Encryption: Uses a pair of keys: a public key to encrypt and a private key to decrypt
(e.g., RSA). It is more secure but computationally expensive.

2. TLS/SSL (Transport Layer Security / Secure Sockets Layer): These are protocols that provide secure
communication over a computer network by using encryption, integrity checks, and authentication. TLS/SSL
is widely used for securing HTTP connections (HTTPS).

o TLS ensures that data is encrypted between the client and server, preventing man-in-the-middle
(MITM) attacks.

o Handshake: During the TLS handshake, the client and server exchange cryptographic keys and
authenticate each other.

3. Digital Signatures: A digital signature is a cryptographic technique that verifies the authenticity and integrity
of a message. It ensures that the message has not been tampered with and that it was sent by the legitimate
sender.

o Example: A distributed blockchain system uses digital signatures to verify the authenticity of
transactions.

4. Perfect Forward Secrecy (PFS): This ensures that even if the private key of a server is compromised in the
future, past communications will remain secure because the session keys are not derived from the server’s
private key. Each session uses unique session keys.

5. VPNs (Virtual Private Networks): A VPN provides an encrypted communication channel between remote
clients and servers over a public network, ensuring privacy and security.

Example of Secure Channels:

In a distributed cloud-based storage system, when a user uploads or downloads sensitive data, the communication
between the user’s device and the cloud server should occur over a secure channel, such as HTTPS, to prevent
eavesdropping or man-in-the-middle attacks.
Access Control
Access control refers to the mechanisms that restrict access to system resources based on the identity of the user or
process and their permissions. It ensures that only authorized users or entities can perform specific operations on
system resources.

Key Models of Access Control:

1. Discretionary Access Control (DAC):

o Definition: In DAC, the resource owner has control over who can access their resources and what
actions they can perform.

o Example: A user may decide to share a file with specific users or groups, granting them read or write
access.

o Downside: DAC is considered less secure because the resource owner can grant permissions to
anyone.

2. Mandatory Access Control (MAC):

o Definition: In MAC, access decisions are made by a central authority, not the resource owner. The
system enforces strict policies about who can access resources.

o Example: A classified government document system where access is based on security labels (e.g.,
"Top Secret," "Confidential").

o Key Advantage: MAC provides a more secure and enforceable access control policy, reducing the
risk of unauthorized access.

3. Role-Based Access Control (RBAC):

o Definition: RBAC is based on roles assigned to users. Each role has specific permissions, and users
are assigned to roles based on their job responsibilities.

o Example: In a corporate setting, the "Admin" role might have full access to the system, while the
"Employee" role might have limited access.

o Key Advantage: RBAC is scalable and easy to manage as permissions are granted based on roles
rather than individual users.

4. Attribute-Based Access Control (ABAC):

o Definition: ABAC uses attributes (characteristics) of the user, resource, and environment to
determine access. This model allows fine-grained access control.

o Example: A system might allow access to resources based on the user's location, time of day, or
department.

Example of Access Control:

In an online banking application, access to sensitive data (e.g., account balance, transaction history) is controlled
through RBAC. A regular user can view their own account data, but an admin might have broader access to all users’
data.
Security Management
Security management involves the processes and tools used to ensure that security policies are effectively
implemented, monitored, and enforced in distributed systems. It covers everything from identifying threats to
responding to security incidents.

Key Aspects of Security Management:

1. Security Policies:

o Definition: A security policy is a document that outlines the rules and guidelines for maintaining the
security of a system. It includes policies for data protection, user access, and incident response.

o Example: A security policy might dictate that all sensitive data must be encrypted at rest and during
transit.

2. Authentication and Identity Management:

o Definition: Authentication is the process of verifying a user’s identity, typically through credentials
like usernames and passwords, biometrics, or security tokens. Identity management systems handle
the creation, storage, and management of user identities and their associated roles and permissions.

o Example: Single Sign-On (SSO) systems allow users to authenticate once and access multiple services
without having to log in separately to each one.

3. Auditing and Monitoring:

o Definition: Auditing involves tracking and logging all access to system resources, and monitoring
involves continuously checking the system for suspicious activity or potential threats.

o Example: In a distributed system, auditing logs could track all user access to sensitive data, while
monitoring could involve detecting unusual traffic patterns that might indicate an ongoing
Distributed Denial of Service (DDoS) attack.

4. Incident Response:

o Definition: Incident response refers to the steps taken to detect, respond to, and recover from
security incidents or breaches.

o Example: If a breach is detected in a system, the response might include isolating the affected
system, identifying the cause of the breach, notifying affected users, and restoring systems from
backups.

5. Security Patches and Updates:

o Definition: Regular patching and updating of software components is essential for security.
Vulnerabilities in software are often discovered, and patches must be applied to mitigate potential
exploits.

o Example: An organization might have a policy to apply critical security patches to all servers within
24 hours of release.

6. Key Management:

o Definition: Proper management of cryptographic keys is essential for securing communication

channels and sensitive data. Key management includes generating, distributing, storing, and
revoking keys securely.

o Example: In a system using asymmetric encryption, key management ensures that private keys are
protected and only authorized users have access to public keys.

Dis Sys
No ratings yet
Dis Sys
16 pages
Du3 1
No ratings yet
Du3 1
54 pages
Ascs 04 0213
No ratings yet
Ascs 04 0213
5 pages
Fault Tolerance in Distributed Computing
No ratings yet
Fault Tolerance in Distributed Computing
32 pages
Chapter 8-Fault Tolerance
100% (1)
Chapter 8-Fault Tolerance
71 pages
Ds Chapter 7
No ratings yet
Ds Chapter 7
21 pages
Fault Tolerance: Click To Add Text Dealing Successfully With Partial System. Key Technique: Redundancy
No ratings yet
Fault Tolerance: Click To Add Text Dealing Successfully With Partial System. Key Technique: Redundancy
48 pages
Distributed Systems - Fault Tolerance
No ratings yet
Distributed Systems - Fault Tolerance
21 pages
Chapter 8
No ratings yet
Chapter 8
107 pages
Lec 3
No ratings yet
Lec 3
30 pages
Ch-4-Fault Tularance - Naming-SM
No ratings yet
Ch-4-Fault Tularance - Naming-SM
42 pages
Lecture 7
No ratings yet
Lecture 7
57 pages
Lecture 7 - FAULT-TOLERANT COMPUTING
No ratings yet
Lecture 7 - FAULT-TOLERANT COMPUTING
13 pages
Chapter 8-Fault Tolerance
No ratings yet
Chapter 8-Fault Tolerance
30 pages
DS CH7 - Fault Tolerance
No ratings yet
DS CH7 - Fault Tolerance
17 pages
Intro To DS Chapter 6
No ratings yet
Intro To DS Chapter 6
51 pages
Unit5 Compressed Fault Tolerance - PACE
No ratings yet
Unit5 Compressed Fault Tolerance - PACE
11 pages
Slides 08 PDF
No ratings yet
Slides 08 PDF
95 pages
Lesson 1 - Introduction To Fault-Tolerant Computing
No ratings yet
Lesson 1 - Introduction To Fault-Tolerant Computing
6 pages
Chapter 3
No ratings yet
Chapter 3
40 pages
Research Paper2
No ratings yet
Research Paper2
5 pages
Fault Avoidance and Tolerance Technique
No ratings yet
Fault Avoidance and Tolerance Technique
15 pages
Fault Tolerant Message Passing Systems
No ratings yet
Fault Tolerant Message Passing Systems
26 pages
CSC423 - Lec12 - Distributed and Parallel ComputerSystems
No ratings yet
CSC423 - Lec12 - Distributed and Parallel ComputerSystems
28 pages
CH 4
No ratings yet
CH 4
25 pages
Fault Tolerance
No ratings yet
Fault Tolerance
10 pages
WRL0004 TMP
No ratings yet
WRL0004 TMP
9 pages
Ijcse V11i4p101
No ratings yet
Ijcse V11i4p101
10 pages
w9s1 FaultTolerance1
No ratings yet
w9s1 FaultTolerance1
34 pages
Unit 4
No ratings yet
Unit 4
11 pages
Fault Tolerance FDCC
No ratings yet
Fault Tolerance FDCC
76 pages
Reference Book Principles of Distributed Database System Chapters
No ratings yet
Reference Book Principles of Distributed Database System Chapters
25 pages
Future Trends in Fault Tolerant (Lect.10)
No ratings yet
Future Trends in Fault Tolerant (Lect.10)
3 pages
Fault-Tolerant Parallel Algorithms
No ratings yet
Fault-Tolerant Parallel Algorithms
16 pages
Chen 07
No ratings yet
Chen 07
39 pages
BCS 413 - Lecture7 - Fault Tolerance
No ratings yet
BCS 413 - Lecture7 - Fault Tolerance
47 pages
DS Unit-3 Notes
No ratings yet
DS Unit-3 Notes
35 pages
Chapter 8-Fault Tolerance
No ratings yet
Chapter 8-Fault Tolerance
37 pages
Distributed Computing: Farhad Muhammad Riaz
No ratings yet
Distributed Computing: Farhad Muhammad Riaz
18 pages
Chapter 8-Fault Tolerance
No ratings yet
Chapter 8-Fault Tolerance
51 pages
Ch8 Distributed
No ratings yet
Ch8 Distributed
12 pages
Attributes of Fault-Tolerant Distributed File Systems
No ratings yet
Attributes of Fault-Tolerant Distributed File Systems
69 pages
Week 04
No ratings yet
Week 04
49 pages
ch08 Ts TK Fault Tolerance I
No ratings yet
ch08 Ts TK Fault Tolerance I
29 pages
Inductionn + Chapter 1 Part 1
No ratings yet
Inductionn + Chapter 1 Part 1
22 pages
Fault Tolerance Slides
No ratings yet
Fault Tolerance Slides
18 pages
Fault Tolerance Slides
No ratings yet
Fault Tolerance Slides
18 pages
Fault Tolerance:-: Introduction, Process Resilience, Distributed Commit, Recovery
No ratings yet
Fault Tolerance:-: Introduction, Process Resilience, Distributed Commit, Recovery
52 pages
Research Paper
No ratings yet
Research Paper
63 pages
Chapte Four DS
No ratings yet
Chapte Four DS
37 pages
Fundamental Concepts of Dependability: Algirdas Aviz Ienis Jean-Claude Laprie Brian Randell
No ratings yet
Fundamental Concepts of Dependability: Algirdas Aviz Ienis Jean-Claude Laprie Brian Randell
6 pages
STDcurs1 Merged
No ratings yet
STDcurs1 Merged
139 pages
Modeling For Fault Tolerance in Cloud Computing Environment: Rampratap, T
No ratings yet
Modeling For Fault Tolerance in Cloud Computing Environment: Rampratap, T
11 pages
A Survey of Fault-Tolerance and Fault-Recovery Techniques in Parallel Systems
No ratings yet
A Survey of Fault-Tolerance and Fault-Recovery Techniques in Parallel Systems
13 pages
Fault Tolerance in Distributed Systems
No ratings yet
Fault Tolerance in Distributed Systems
6 pages
CBDT3103 Answer
No ratings yet
CBDT3103 Answer
9 pages
Chapter 8 Fault Tolerance
No ratings yet
Chapter 8 Fault Tolerance
20 pages
Fault Tolerance Techniques
No ratings yet
Fault Tolerance Techniques
4 pages
SDA Session 8
No ratings yet
SDA Session 8
17 pages
NN Unit - 1
No ratings yet
NN Unit - 1
27 pages
NNunit 2
No ratings yet
NNunit 2
25 pages
Cryptography
No ratings yet
Cryptography
26 pages
Archite Data Flow Architecture
No ratings yet
Archite Data Flow Architecture
6 pages
GATE DA Calculus
100% (2)
GATE DA Calculus
16 pages
NLP Notes
No ratings yet
NLP Notes
43 pages
MAD Lab Manual
No ratings yet
MAD Lab Manual
81 pages
Wireshark Certification
No ratings yet
Wireshark Certification
22 pages
Computer Network Lab Manual r22 CSD
No ratings yet
Computer Network Lab Manual r22 CSD
61 pages
SRT Deployment Guide
No ratings yet
SRT Deployment Guide
76 pages
IT6601 Mobile Computing 2 Mark Question and Answers: Erumapatti, Namakkal Distirct, Tamil Nadu, 637013
No ratings yet
IT6601 Mobile Computing 2 Mark Question and Answers: Erumapatti, Namakkal Distirct, Tamil Nadu, 637013
36 pages
CN - WT Cse Lab Manual
No ratings yet
CN - WT Cse Lab Manual
195 pages
IEEE 802.11 Packet Delay - A Finite Retry Limit Analysis: P. Chatzimisios, A. C. Boucouvalas and V.Vitsas
No ratings yet
IEEE 802.11 Packet Delay - A Finite Retry Limit Analysis: P. Chatzimisios, A. C. Boucouvalas and V.Vitsas
5 pages
The End-to-End Performance Effects of Parallel TCP Sockets On A Lossy Wide-Area Network
No ratings yet
The End-to-End Performance Effects of Parallel TCP Sockets On A Lossy Wide-Area Network
16 pages
C09-Transport Protocols
No ratings yet
C09-Transport Protocols
18 pages
cs432 New Quiz No 2
No ratings yet
cs432 New Quiz No 2
6 pages
Whitepaper Voip Service
No ratings yet
Whitepaper Voip Service
9 pages
eLTE2.2 DBS3900 LTE FDD Optional Feature Description: eLTE2.2 V200R002C00
No ratings yet
eLTE2.2 DBS3900 LTE FDD Optional Feature Description: eLTE2.2 V200R002C00
77 pages
EX - NO: 7 Study of Network Simulator (NS) and Simulation of Congestion Control Algorithms Using NS AIM
No ratings yet
EX - NO: 7 Study of Network Simulator (NS) and Simulation of Congestion Control Algorithms Using NS AIM
9 pages
Troubleshooting TCP Unidir Wireshark Perf
No ratings yet
Troubleshooting TCP Unidir Wireshark Perf
27 pages
Week 2 - Lec 2
No ratings yet
Week 2 - Lec 2
19 pages
Ippm Measurement Method and Standardization Huawei and Ericsson
No ratings yet
Ippm Measurement Method and Standardization Huawei and Ericsson
11 pages
Basic KPI
No ratings yet
Basic KPI
15 pages
1500 Projects
No ratings yet
1500 Projects
73 pages
Packet Loss PDF
No ratings yet
Packet Loss PDF
4 pages
Bca - Computer Networking
No ratings yet
Bca - Computer Networking
12 pages
Mob Net 2011 Exam Solution
No ratings yet
Mob Net 2011 Exam Solution
13 pages
11 MobileTCP
No ratings yet
11 MobileTCP
20 pages
Reliable Data Transfer
No ratings yet
Reliable Data Transfer
24 pages
SLA - Enterprise Internet - Apr
No ratings yet
SLA - Enterprise Internet - Apr
4 pages
Computer Network
No ratings yet
Computer Network
52 pages
Recovering From Packet Loss: 1.1 Forward Error Correction (FEC)
No ratings yet
Recovering From Packet Loss: 1.1 Forward Error Correction (FEC)
3 pages
Central Issues in Network Management: Sarah Lowman April 2010
No ratings yet
Central Issues in Network Management: Sarah Lowman April 2010
8 pages
Wireless Mash Network
No ratings yet
Wireless Mash Network
159 pages
MABS Multicast Authentication Based On Batch Signature
No ratings yet
MABS Multicast Authentication Based On Batch Signature
12 pages
Transport Layer and Security Protocols For Ad Hoc Wireless Networks
No ratings yet
Transport Layer and Security Protocols For Ad Hoc Wireless Networks
54 pages
Unit-3 Mobile Transport Layer
No ratings yet
Unit-3 Mobile Transport Layer
34 pages