Unit 4 Distributed Systems
Unit 4 Distributed Systems
A Distributed File System (DFS) is a file system that allows access to files and data over a network,
enabling multiple users and applications to read, write, and share data as if it were located on a local
disk. Unlike traditional file systems that are tied to a single machine, a DFS distributes files across
multiple servers, providing benefits such as increased availability, scalability, and fault tolerance.
1.Location Transparency:
Users access files without needing to know their physical location. The system abstracts the details
of where data is stored, making it easier for users and applications to interact with files seamlessly.
2.Scalability:
DFS can handle increasing amounts of data and user requests by adding more servers and storage
devices. This scalability allows organizations to grow their storage capabilities without significant
redesign.
3.Fault Tolerance:
Distributed file systems often include redundancy and replication mechanisms. If one server fails,
data can still be accessed from another server, ensuring continuous availability and reliability.
4.Concurrency Control:
A DFS manages concurrent access to files by multiple users or applications. It implements protocols
to ensure data consistency and integrity, preventing conflicts when multiple clients attempt to read
or write the same file simultaneously.
5.Data Replication:
To enhance reliability and performance, DFS often replicates data across multiple nodes. This not
only provides fault tolerance but also allows for load balancing by distributing read requests across
replicas.
6.Network Transparency:
Users interact with the file system without being aware of the underlying network. The system
manages network communication and data transfer, allowing users to focus on file management
rather than connectivity issues.
• Cloud Storage Solutions: Services like Google Drive, Dropbox, and Amazon S3 use
distributed file systems to provide scalable and accessible storage for users.
• Big Data Applications: Distributed file systems are essential in big data frameworks (e.g.,
Hadoop Distributed File System) that manage large datasets across many servers.
• Collaborative Environments: Environments that require multiple users to access and edit
files concurrently, such as shared project directories in organizations.
Examples of Distributed File Systems
Designed to support Google's data-intensive applications, GFS provides fault tolerance, high
throughput, and supports large files.
Part of the Apache Hadoop project, HDFS is designed for high-throughput access to large datasets,
supporting big data processing applications.
3.Ceph:
A unified distributed storage system that provides object, block, and file storage in a scalable and
fault-tolerant manner.
Originally developed by Sun Microsystems, NFS allows users to access files over a network as if they
were on local storage, though it is primarily designed for smaller networks compared to more
modern distributed file systems.
1. Client Module
Role: The client module runs on the client computer and acts as an intermediary between the client
applications and the server services (Flat File Service and Directory Service).
Functionality:
Provides a unified interface that combines the capabilities of the Flat File Service and Directory
Service, making it easier for applications to interact with files in a way similar to conventional file
systems (e.g., UNIX).
Manages file operations requested by the application programs, such as reading, writing, and
creating files.
Interprets file names and translates them into unique file identifiers (UFIDs) through iterative
communication with the Directory Service.
Caches recently accessed file data locally to enhance performance and reduce the number of
requests sent to the server.
Stores information about the network locations of the Flat File Service and Directory Service,
enabling direct communication with them.
2. Directory Service
Role: The Directory Service is responsible for managing file names and their mapping to unique file
identifiers (UFIDs).
Functionality:
Acts as a mapping system, translating human-readable file names into UFIDs. This allows the system
to use UFIDs to uniquely identify files across the network.
Supports directory-related operations, such as creating directories, adding new file names, and
retrieving UFIDs for files.
Uses a hierarchical naming system, similar to that in UNIX, where directories can reference other
directories, enabling complex file structures.
Stores directory data within the Flat File Service, meaning directory information is kept as files in the
flat file system, providing a uniform approach to data storage.
The Directory Service functions as a client of the Flat File Service, where it manages and stores
directory-related data as regular files.
3. Flat File Service
Role: The Flat File Service provides low-level access to the actual file contents and handles operations
on the data within each file.
Functionality:
Manages files directly by UFIDs, which ensure each file is uniquely identified and accessed.
Implements these functions through a Remote Procedure Call (RPC) interface, allowing the Client
Module to request operations without direct file manipulation.
Access Control:
Operations in the Flat File Service check for appropriate access rights before allowing access to or
modification of file data. Invalid file identifiers or permissions trigger exceptions (e.g., BadPosition).
• File Naming and Access: When an application program requests access to a file, it specifies
the file name. The Client Module contacts the Directory Service to resolve the file name to a
UFID.
• File Data Operations: Once the Client Module has the UFID, it uses the Flat File Service to
perform file operations like reading, writing, or deleting file contents.
• Caching: The Client Module may cache recently accessed files or data blocks locally,
improving performance by reducing repeated requests to the server.
• Separation of Concerns: The Flat File Service focuses only on file content management, while
the Directory Service manages the mapping of file names to UFIDs. This separation makes the
system modular and easier to scale and maintain.
1. Modularity: The separation between file data operations (Flat File Service) and file
naming/directory management (Directory Service) allows each component to be maintained
and scaled independently.
2. Flexibility: Different client modules can be implemented for various client systems, adapting
to specific operating system conventions and performance requirements.
3. Scalability: By separating directory and file data management, the system can handle more
clients and files more efficiently, making it well-suited for large distributed environments.
4. Caching for Performance: The client-side caching helps reduce latency by storing frequently
accessed data locally, which reduces the load on the server and improves user experience.
5. Unified Access: The Client Module provides a single interface for application programs,
abstracting the complexities of managing directories, UFIDs, and network communication.
1.Decentralization:
In P2P systems, there is no central authority or server that manages resources or connections. Each
peer is both a client and a server, capable of initiating requests and fulfilling others' requests.
This decentralized structure improves scalability and fault tolerance, as there’s no single point of
failure.
2.Resource Sharing:
Peers share resources, such as processing power, storage, or bandwidth, directly with one another.
For instance, in file-sharing P2P systems, peers store and share files with others.
Sharing resources at the peer level allows P2P networks to handle high volumes of traffic and
accommodate large numbers of users.
3.Scalability:
P2P systems are inherently scalable because as more peers join, they bring additional resources,
helping the network grow organically.
The network’s ability to handle traffic increases with the number of active peers.
4.Self-Organization:
Peers autonomously join and leave the network without requiring permission or coordination from
a central server. This characteristic makes P2P networks robust and adaptable.
5.Fault Tolerance:
P2P networks can remain operational even if several peers fail, as there is no central point of
dependency. The network dynamically adapts by rerouting tasks or data through active peers.
Redundant copies of data or resources are often distributed across multiple peers, ensuring
availability even when some nodes go offline.
In pure P2P systems, all peers have equal roles, and there is no central authority or hierarchy.
Hybrid P2P systems use a combination of P2P and client-server architectures. Some peers or servers
may have more control, often to handle specific tasks like indexing or authentication.
Examples include BitTorrent, where "tracker" servers help locate peers but do not participate in file
sharing.
Structured: Peers follow specific protocols to organize themselves and manage resources. For
example, Distributed Hash Tables (DHTs) assign data locations based on hash functions, allowing
efficient searches.
Unstructured: Peers connect randomly or semi-randomly, without any fixed organization. While this
makes them easier to set up, they can be inefficient in finding specific resources.
1.File Sharing:
P2P technology is widely used in file-sharing systems, such as BitTorrent, where users share files by
downloading and uploading file chunks to each other.
2.Distributed Computing:
In systems like SETI@home, peers contribute their processing power to perform computational
tasks, aggregating resources from millions of users.
3.Content Distribution:
Blockchain-based cryptocurrencies like Bitcoin and Ethereum use P2P networks to verify transactions
and maintain a distributed ledger. Each peer in the network validates transactions and helps maintain
the security and integrity of the ledger.
5.Communication and Social Media:
P2P technology can be used for secure messaging, voice calls, and video calls, where messages travel
directly between users instead of through central servers. Applications like Skype initially used a P2P-
based architecture.
1.Scalability:
Because resources increase as more peers join, P2P networks are naturally scalable. This contrasts
with client-server systems, where more users increase the load on centralized servers.
2.Cost Efficiency:
P2P networks reduce infrastructure costs as there’s no need for powerful central servers. Each peer
provides some storage, processing, or bandwidth.
P2P systems are resilient to failures. If some peers go offline, the network can still function, as
resources are spread across multiple nodes.
Many P2P networks allow users to connect and share resources directly, without revealing their
identity to a central authority. This can provide a level of anonymity.
1.Security:
P2P systems are vulnerable to attacks like data corruption, poisoning, and Sybil attacks, where
malicious users create multiple fake identities to manipulate the network.
2.Data Integrity:
Without central oversight, ensuring data integrity and authenticity can be challenging. Malicious
peers may introduce corrupted or fake data.
3.Legal Issues:
Due to the decentralized nature of file-sharing, P2P networks are often associated with piracy and
copyright infringement.
Since peers can join and leave at will, coordinating and managing resources in a P2P network is
complex, especially for unstructured systems.
1. File Location Request: A user searching for a song would send a request to Napster's
centralized server.
2. List of Peers: The server returned a list of peers who had the requested file, identified by their
IP addresses.
3. File Request: The client then directly contacted one of these peers to request the file.
4. File Delivered: The peer with the file sent it directly to the requesting client.
5. Index Update: Each user updated Napster’s index with the available files on their computer.
This combination of central indexing and decentralized file hosting allowed Napster to scale quickly,
enabling massive simultaneous file-sharing activity.
Napster demonstrated that a decentralized network could leverage the resources of ordinary
Internet users to create a large-scale information-sharing service.
This approach showed the power of distributing storage and computing across multiple nodes,
laying the foundation for future P2P systems like BitTorrent.
Napster’s reliance on copyrighted music files led to extensive legal battles with the music industry,
culminating in a court-ordered shutdown in 2001.
The case highlighted intellectual property issues in the digital age, influencing copyright laws and
leading to the development of legal digital music services like iTunes and Spotify.
3.Architectural Lessons:
Napster’s use of a centralized index showed both the strengths and limitations of this approach.
While it made file location efficient, it also created a single point of failure.
Later P2P systems adopted fully decentralized indexing (e.g., Distributed Hash Tables in BitTorrent)
to improve resilience and remove reliance on a single server.
Napster’s structure offered minimal anonymity since the centralized index tracked file locations. This
spurred interest in anonymity for P2P systems, leading to projects like Freenet, which emphasize
privacy and resilience against censorship.
5.Application-Specific Design:
Napster benefited from application-specific properties: music files were static (no updates), and
temporary unavailability was acceptable since users could download files later. This simplicity
facilitated scalability but was not suitable for applications requiring strict data consistency.
Napster’s popularity and scalability influenced future P2P architectures, such as:
• Freenet and FreeHaven: P2P systems focusing on anonymous and censorship-resistant file
storage.
• BitTorrent: Decentralized file-sharing with a focus on efficient resource distribution and load
management.
• Blockchain: Decentralized ledgers that rely on distributed networks, where each peer plays a
role in maintaining data integrity.
Napster’s blend of centralized indexing and decentralized storage allowed it to scale rapidly,
transforming digital media distribution and inspiring future innovations in P2P networking. Despite
its legal challenges and eventual shutdown, Napster’s impact on digital culture, technology, and the
music industry remains profound.
Peer-to-peer (P2P) middleware
Peer-to-peer (P2P) middleware is a specialized software layer that enables the efficient creation and
management of distributed applications across a network of hosts, with the primary purpose of
allowing clients to locate and access data resources quickly, regardless of where they are in the
network. Unlike traditional client-server systems, P2P systems use decentralized approaches,
distributing both the workload and data storage across multiple participating nodes or "peers."
1.Resource Location and Communication: P2P middleware must enable clients to locate and
interact with any resource distributed across the network, even though these resources may be
widely spread among nodes.
2.Dynamic Addition and Removal of Resources: Resources and hosts can join or leave the network
dynamically. Middleware must handle these changes without manual reconfiguration.
3.Simplified Programming Interface: The middleware should provide a simple API, allowing
developers to interact with distributed resources without needing to understand the complexities of
the underlying network.
Non-functional Requirements
To achieve high performance, P2P middleware must address several critical non-functional aspects:
1.Global Scalability:
P2P systems aim to leverage the hardware resources of a vast number of Internet-connected hosts,
often reaching thousands or millions of nodes.
The middleware must support this scale by efficiently managing resource discovery and distribution.
2.Load Balancing:
Efficient resource usage across nodes depends on an even workload distribution. Middleware should
ensure random placement of resources and replication for heavily used files to prevent overloading
individual nodes.
Minimizing the "network distance" (latency) between interacting nodes helps reduce delays in
resource access and lowers network traffic.
Middleware should ideally place frequently accessed resources closer to the requesting nodes.
P2P networks operate with nodes that can join or leave at any time, a factor driven by the fact that
hosts in P2P systems are not centrally managed and may face connectivity issues.
Middleware should automatically detect and adjust to the addition or removal of nodes,
redistributing load and resources accordingly.
5.Data Security in a Heterogeneous Trust Environment:
Given the varied ownership and potential lack of trust between nodes, security is a major concern.
Middleware should employ authentication, encryption, and other mechanisms to maintain data
integrity and privacy.
Middleware should provide anonymity for users and enable hosts to plausibly deny responsibility
for holding or supplying certain data. This is crucial in resisting censorship and protecting user
privacy.
Architectural Challenges
Given the need for scalability and high availability, maintaining a unified database of resource
locations across all nodes is impractical. Instead, P2P middleware relies on:
Partitioned and Distributed Indexing: Knowledge of data locations is divided across the network,
with each node managing a portion of the namespace (a segment of the global resource directory).
Topology Awareness: Nodes maintain limited knowledge of the overall network topology,
enhancing both efficiency and fault tolerance.
Replication: High replication levels (e.g., 16 copies of data) ensure system resilience against node
unavailability or network disruptions.
First-generation P2P systems like Napster used a centralized index, whereas second-generation
systems like Gnutella and Freenet moved to fully decentralized approaches. The evolution of P2P
middleware addresses the challenges of dynamic, large-scale networks by distributing control and
data management across multiple nodes, forming the foundation for modern P2P applications such
as file-sharing platforms, decentralized storage solutions, and even blockchain technologies.
Routing overlays
Routing overlays are a foundational concept in peer-to-peer (P2P) systems and distributed
networks, providing a mechanism for locating and retrieving data in a network where data is
decentralized and spread across multiple nodes (peers). Routing overlays manage how nodes
communicate with each other and how resources or data are located across the network without
relying on a central server. They provide efficient paths for data requests and responses, even as
nodes frequently join or leave the network.
Structured Overlays
Structured overlays use specific algorithms to maintain a strict topology and organize data. This
structure enables efficient, deterministic routing.
Chord: Organizes nodes in a circular ring. Each node is responsible for a specific range of data, and
data lookup is achieved in O(logN)O(\log N)O(logN) hops, where NNN is the number of nodes.
Pastry: Routes requests by progressively finding nodes with closer IDs, organizing nodes in a circular
ID space and assigning a unique numeric identifier to each node.
Tapestry: Uses a similar approach to Pastry, routing based on the "prefix" of the destination node’s
ID, with a focus on optimizing proximity between nodes.
Kademlia: Relies on XOR-based distance between node IDs, allowing nodes to find data based on
shortest logical distance. Known for its robustness and used by BitTorrent.
Unstructured Overlays
Unstructured overlays do not impose a strict topology on the network, allowing nodes to join and
leave freely. These systems are generally more resilient but may require more resources to locate
data.
Flooding and Random Walks: Nodes in unstructured overlays often use methods like flooding
(broadcasting requests to all neighbors) or random walks (sending requests to random neighbors)
to locate data.
Examples: Early Gnutella networks and Freenet, which lack structured routing and rely on
broadcasting or searching through known nodes.
When a node joins, it is assigned a unique identifier and is introduced to the overlay structure by
connecting to existing nodes.
In structured overlays, the new node is integrated into the routing structure and takes responsibility
for a specific range of data keys.
When a node requests data, it routes the request through the overlay structure. Using its knowledge
of neighbor nodes, it forwards the request closer to the node responsible for the data key.
Structured overlays typically have predictable routing paths, achieving lookups within O(log N) hops.
Routing overlays use redundancy and replication to handle node failures. For example, each data
item might be replicated on multiple nodes, or alternative paths can be used if the primary path
fails.
Nodes periodically update their neighbor information to account for changes in the network.
Fault Tolerance: By distributing data and routing paths, routing overlays can remain operational
despite node churn and network failures.
Load Distribution: Many routing overlays use DHTs to ensure that data is evenly distributed across
nodes, preventing hotspots and balancing the load.
Latency and Network Distance: Although DHTs provide efficient routing, network distance can still
introduce delays. Routing overlays often attempt to optimize for local interactions to minimize
latency.
Security and Anonymity: Ensuring data integrity, preventing malicious nodes from disrupting
routing, and protecting user privacy are ongoing challenges in decentralized networks.
File Sharing (e.g., BitTorrent): Distributes file chunks across peers, where Kademlia-based DHT
helps locate peers with required chunks.
Blockchain and Distributed Ledgers: Nodes use routing overlays to share transaction data and
synchronize distributed ledgers.
Content Distribution Networks (CDNs): Nodes cache and route content, improving latency and
load distribution.
Decentralized Web: Projects like IPFS (InterPlanetary File System) use DHTs to locate and retrieve
content in a decentralized manner.
Routing overlays are essential for decentralized, scalable, and resilient data access in distributed
networks. By providing structured or unstructured mechanisms for data location, they enable
efficient resource sharing in P2P networks, handling challenges like node churn, load balancing, and
network distance. Routing overlays continue to play a critical role in P2P systems and inspire new
decentralized technologies, from file-sharing applications to blockchain networks.
Coordination and Agreement
Coordination and Agreement are crucial aspects in distributed systems, where multiple nodes (or
processes) work together to achieve common objectives. Due to the lack of a single centralized
control in distributed systems, these nodes must coordinate to reach a consensus on shared data,
task assignments, or the order of operations. This coordination ensures consistency, reliability, and
correctness in the presence of network delays, faults, or varying speeds of different nodes.
1. Consistency: Distributed systems often require data consistency across nodes. For example,
in a distributed database, if multiple nodes are updating or reading shared data, they must
agree on a consistent view of that data.
2. Fault Tolerance: Distributed systems are prone to partial failures. Nodes may crash, networks
may partition, or messages may be delayed or lost. Agreement protocols help ensure that the
system can handle these failures without losing coherence.
3. Order of Operations: In distributed systems, the order in which operations are executed can
significantly impact the final outcome. For example, in a banking application, the order of
transactions matters for account balances. Agreement protocols help nodes maintain a
consistent order for operations.
4. Load Distribution: Coordination is needed for efficient distribution of tasks and resources
across nodes. Agreement on task allocation helps balance the load, prevent redundant work,
and ensure that all nodes are used effectively.
2. Mutual Exclusion: In some cases, only one node should access a resource at a time. Mutual
exclusion protocols allow nodes to coordinate and ensure that only one node can access a
resource or perform a specific operation at any given time.
3. Leader Election: In distributed systems, it is often necessary to have one node act as a
coordinator (leader) to make decisions on behalf of others. Leader election algorithms help
nodes decide which among them should take on this role.
4. Atomic Commitment: In transactions that span multiple nodes, atomic commitment ensures
that all nodes agree on whether to commit or abort a transaction. If any part of the transaction
fails, the entire transaction should be rolled back to maintain consistency.
5. Fault Models: Distributed systems must handle various types of faults, including:
6. Quorums: Quorum-based methods are often used to ensure consistency. By requiring that a
certain number of nodes (a quorum) agree on a value before it is accepted, systems can
achieve a level of fault tolerance and consistency.
Several protocols and algorithms help achieve coordination and agreement in distributed systems.
These include:
In the first phase (prepare phase), the coordinator node asks all participants if they can commit. If all
agree, the transaction proceeds to the second phase (commit phase) where all nodes commit.
2PC is blocking and can leave the system waiting indefinitely if the coordinator crashes.
3PC adds an additional phase to ensure that even if the coordinator crashes, nodes can reach a
consistent decision without being left in an uncertain state.
3.Paxos Algorithm:
A consensus algorithm designed for distributed systems to achieve agreement on a single value.
Paxos is fault-tolerant, handling node failures and network partitions, and is widely used in
applications requiring high availability.
4.Raft Algorithm:
A consensus algorithm that is easier to understand than Paxos and achieves similar goals.
Raft elects a leader among nodes, and all changes to the distributed state are handled through this
leader. If the leader fails, a new one is elected.
Byzantine Fault Tolerance algorithms are designed to handle Byzantine faults, where nodes may act
maliciously or arbitrarily.
Practical Byzantine Fault Tolerance (PBFT) is an example, used in systems where a high level of
reliability and security is required, such as blockchain networks.
Lamport timestamps provide a logical clock to order events without requiring synchronized physical
clocks, which helps achieve a consistent view of events across nodes.
2. Fault Tolerance vs. Consistency: Achieving agreement often requires trading off between
consistency, availability, and partition tolerance (CAP theorem). Consensus protocols typically
favor consistency over availability in case of network partitions.
3. Latency: Communication delays can cause inconsistencies and make agreement difficult to
reach quickly, especially in geographically distributed systems.
4. Scalability: As the number of nodes increases, achieving agreement becomes more complex
and time-consuming. Protocols must be designed to scale efficiently.
5. Security and Trust: In some distributed systems, nodes may act maliciously. Byzantine fault-
tolerant protocols are necessary in such environments, but they are complex and resource-
intensive.
Coordination and agreement are essential for ensuring that distributed systems operate reliably
and consistently. Through consensus algorithms, fault tolerance mechanisms, and protocols for
resource allocation, distributed systems can achieve coordination even in complex environments.
The field continues to evolve, addressing challenges of scalability, security, and efficiency, with
newer algorithms and models like Raft, Paxos, and BFT being applied to large-scale applications
across industries.
In centralized systems, mutual exclusion can be achieved with locking mechanisms. However, in
distributed systems, the lack of a single coordinating process and the unreliable nature of network
communication add complexity to the problem.
Key Requirements for Distributed Mutual Exclusion
1. Mutual Exclusion: Only one node should be allowed to enter the critical section (access the
shared resource) at any given time.
2. Freedom from Deadlock: The system should prevent situations where two or more nodes
wait indefinitely for each other to release the critical section.
3. Freedom from Starvation: Every request to enter the critical section should eventually be
granted to avoid indefinite waiting.
4. Fault Tolerance: The system should continue to function even if some nodes or network
connections fail.
5. Fairness: Requests should be granted in the order they are made, or based on some fairness
criterion, to prevent any node from having an unfair advantage.
Several algorithms have been developed to address distributed mutual exclusion, each with its own
advantages and trade-offs. Here are some widely used approaches:
1. Centralized Algorithm
• When a node wants to enter the critical section, it sends a request to the coordinator.
• The coordinator grants the request if no other node is in the critical section. Otherwise, it
queues the request.
• Cons: The coordinator is a single point of failure and may become a performance bottleneck
in large systems.
2. Ring-Based Algorithm
• Structure: Nodes are arranged in a logical ring topology, where each node is connected to
two neighboring nodes, forming a circular structure. There is no central coordinator, and
communication is typically unidirectional.
• Token Passing: A token (a special control message) circulates around the ring. A node must
possess the token to access the shared resource. When a node wants to access the resource,
it waits for the token, processes its request, and then passes the token to the next node in
the ring.
• Pros:
o Fairness: Each node gets an equal opportunity to access the resource, as the token is
passed in a circular manner, ensuring no node is starved of access.
• Cons:
o Message Overhead: The need to pass the token around the entire ring can lead to
increased message overhead, especially if the number of nodes is large or if the token
is far from the requesting node.
o Single Point of Failure: If the token is lost or if a node fails while holding the token,
the entire system can be disrupted, requiring mechanisms for token recovery or
regeneration.
o Latency: The time it takes for the token to circulate can lead to higher latency,
particularly if nodes are far apart in the ring.
Elections in Coordination and Agreement in Distributed Systems
Elections in the context of coordination and agreement in distributed systems are crucial for
achieving consensus among multiple nodes, especially when determining a leader or coordinator
for tasks such as resource management, decision-making, and fault tolerance. Below is an overview
of the election process and its significance in distributed systems:
Purpose: Elections are used to select a coordinator or leader node among a set of distributed nodes.
The selected leader is responsible for coordinating activities, managing resources, and ensuring that
consensus is reached among the nodes.
1.Bully Algorithm:
Mechanism: Each node has a unique identifier. When a node wants to initiate an election, it sends
an election message to nodes with higher identifiers. If no higher identifier responds, it declares
itself the leader.
Cons: High message overhead and can be inefficient in large systems due to multiple messages
being sent.
2.Ring Algorithm:
Mechanism: Nodes are arranged in a logical ring. When a node wants to start an election, it sends
a message containing its identifier around the ring. Nodes update the message with their identifiers
and forward it until it returns to the initiating node, which then selects the highest identifier as the
leader.
Pros: Reduces message complexity compared to the Bully Algorithm; efficient for systems with many
nodes.
Cons: Can have higher latency due to the message needing to circulate the entire ring.
3.Paxos Algorithm:
Mechanism: Involves multiple phases where nodes propose values, agree on a value, and commit
to that value. A leader is elected based on the highest proposal number.
Mechanism: Organizes nodes into a leader and followers. The leader handles all client requests, and
elections occur if the leader fails. Nodes vote for candidates based on log completeness.
Pros: Easier to understand and implement than Paxos; provides strong consistency guarantees.
Cons: Requires more stable leader election processes, which can be a challenge in dynamic
environments.
Challenges in Election:
• Fault Tolerance: The system must be able to handle node failures gracefully. Elections should
ensure that a new leader can be elected if the current leader fails.
• Network Partitions: In the case of network splits, different parts of the system may elect
different leaders. The system must reconcile these states when the network is restored.
• Scalability: As the number of nodes increases, the election process must remain efficient to
prevent performance degradation.
Applications:
Elections play a vital role in achieving coordination and agreement in distributed systems. By
selecting a leader, the system can effectively manage resources, ensure consistent state across
nodes, and provide fault tolerance. The choice of election algorithm can significantly impact the
system's performance, scalability, and resilience. Understanding these dynamics is essential for
designing robust distributed applications.
Multicast communication
The term “multicast” refers to a method of sending a single message to a large group of people and
a tool to make the most of available network bandwidth while conserving system resources. Also,
we can say that Multicast communication is a type of technique that transfers packets from one
source to many receivers simultaneously.
We also use many applications in our daily life, like Audio/Video Conferencing, Online Gaming, IPTV,
etc. The best part is that all these applications work on multicast communication.
Ethernet Multicast
Ethernet multicast signifies the process of multicasting at the data link layer in Ethernet networks.
Ethernet Frames are sent to a group of destination devices that shares a common multicast address.
By setting the least significant bit of the first byte of the destination address to 1 these frames are
identified, differentiating them from unicast and broadcast frames.
IP Multicast
A multicast group consists of all hosts that have been configured to receive packets on a specific
address.
Multicast Groups
When a host is configured to receive datagrams sent to a multicast, it is added to the multicast
group for that address.
One to an unlimited number of hosts comes under a group. The list of individual group members is
neither maintained by the host nor by routers.
A host can belong to several multicast groups and send multicast messages to various multicast
addresses.
A host can send datagrams to a multicast group address even though there are no members present
in that group, and a host doesn’t need to be a member of a group to send multicast datagrams to
that group.
Multicasting on Internet
router will check to see if any hosts on a locally connected network are set up to accept multicast
datagrams by using IGMP (Internet Group Management Protocol).
On the local subnet on a regular basis router will listen to IGMP messages and send queries. By using
the multicast group address 224.0.0.1 (reserved for all hosts).
Multicast routers do not keep a record of which hosts are members of a group but only need to
know if any hosts on that subnet are part of a group.
If a router gets a multicast datagram from another network and does not have any members for that
group address on any of its subnets, the packet is dropped.
Unit IV Part-II
Transactions and Replications in Distributed Systems: An Introduction
In distributed systems, transactions and replication are two fundamental concepts that enhance the
reliability, consistency, and availability of data across multiple nodes. They address challenges posed
by network latency, node failures, and concurrent access to shared resources.
Transactions
• Atomicity: Transactions are all-or-nothing. If any part of the transaction fails, the entire
transaction is aborted, and changes are rolled back.
• Consistency: A transaction brings the system from one valid state to another, maintaining
database invariants.
• Durability: Once a transaction is committed, its changes are permanent, even in the case of
a system failure.
Importance
• Data Integrity: Ensures that data remains accurate and consistent across distributed nodes.
Replication
Replication involves creating copies of data across multiple nodes in a distributed system. The
primary goal of replication is to enhance data availability, fault tolerance, and performance by
distributing the data across various locations.
Types of Replication
1.Synchronous Replication:
Updates are made to all replicas simultaneously. This ensures strong consistency but can introduce
latency, as the system must wait for all replicas to acknowledge the update.
2.Asynchronous Replication:
Updates are propagated to replicas after the primary operation completes. This improves
performance and reduces latency but may lead to temporary inconsistencies among replicas.
3.Quorum-Based Replication:
Requires a majority (or quorum) of replicas to agree on an update before it is considered committed.
This balances consistency and availability.
Importance
• Fault Tolerance: If one node fails, other replicas can continue to provide access to data,
ensuring system resilience.
• Load Balancing: Distributing read operations among multiple replicas can enhance
performance and reduce the load on any single node.
• High Availability: Replication helps ensure that data is always accessible, even in the event
of node failures or network partitions.
• Consistency Models: When transactions are used in conjunction with replication, consistency
models (like eventual consistency, strong consistency, and causal consistency) dictate how
replicas synchronize and maintain data integrity.
Transactions and replication are essential components of distributed systems, enabling reliable and
consistent data management across multiple nodes. By understanding and effectively implementing
these concepts, distributed systems can achieve higher levels of availability, fault tolerance, and
performance. As distributed applications become more prevalent, the need for robust transaction
management and effective replication strategies will continue to grow.
System Model:
The system model for managing replicated data in distributed systems consists of clients, front
ends, and replica managers. Clients make requests to access or modify data, which are handled by
front ends that act as intermediaries. The front ends send these requests to replica managers, which
store the actual copies of the data. Replica managers communicate with each other to keep the data
consistent across all replicas. This model is designed to ensure data availability, consistency, and
fault tolerance in distributed environments.
This system model is designed to ensure data consistency across multiple copies, or "replicas," stored
in different locations. It typically consists of three main components:
• Clients (C): The clients are the entities that make requests to the system to either retrieve or
update data. They don’t interact directly with the replica managers but instead go through
an intermediary.
• Front Ends (FE): The front ends act as intermediaries between the clients and the replica
managers. They receive requests from clients, forward these requests to the appropriate
replica managers, and relay the responses back to the clients. Front ends play a crucial role in
ensuring requests are handled efficiently and that clients receive consistent data.
• Replica Managers (RM): The replica managers are responsible for storing the actual copies
(replicas) of the data. They communicate with each other to keep their replicas synchronized.
In case of updates, each replica manager must ensure that the change is propagated to other
replica managers to maintain data consistency. The network of replica managers is called the
"Service."
• Failure Tolerance: Group communication helps to manage the failure of individual replica
managers. If one replica manager becomes unreachable, the others can still continue
operating, ensuring the system remains available to serve client requests. Additionally, group
communication can help detect failures and redistribute requests to healthy replicas.
• Ordering of Operations: In distributed systems, the order in which updates are applied
across replicas is crucial for consistency. Group communication protocols often enforce
ordering guarantees (like total order or causal order), ensuring that all replicas process
updates in the same sequence, which prevents inconsistencies from arising due to out-of-
order updates.
• Coordination and Synchronization: When multiple clients try to update the same data
simultaneously, the replica managers need to coordinate to determine the final state of the
data. Group communication protocols help in achieving this by enabling synchronized
communication between replica managers, so they can agree on the update to be applied.
These system model uses group communication to enable efficient and consistent data replication,
ensuring that multiple copies of data remain synchronized and resilient to failures. Group
communication protocols provide the necessary infrastructure for replication, data consistency, and
fault tolerance across distributed replica managers.
1. Locking
Locking is a widely used technique to control access to data in distributed transactions by restricting
simultaneous access to resources.
Mechanism: In a locking mechanism, a transaction must acquire a lock on the data it wants to read
or write. There are typically two types of locks:
• Shared Lock (Read Lock): Multiple transactions can hold a shared lock on a resource,
allowing them to read but not write to it.
• Exclusive Lock (Write Lock): Only one transaction can hold an exclusive lock on a resource,
preventing other transactions from reading or writing to it.
Two-Phase Locking (2PL): To prevent conflicts, many distributed systems use the two-phase
locking protocol, which has two phases:
Benefits and Limitations: Locking provides strong consistency but can lead to deadlocks (where
two or more transactions are waiting for each other’s locks) and reduced concurrency due to lock
contention.
2. Timestamp Ordering Concurrency Control
Mechanism: Each transaction is assigned a unique timestamp when it starts. Transactions are
ordered based on their timestamps, and operations are executed according to this order:
If a transaction tries to read or write data that another newer transaction has accessed, it may be
aborted and restarted to maintain timestamp order.
Types:
Basic Timestamp Ordering: Ensures that conflicting operations are executed according to the
timestamp order of transactions.
Multiversion Timestamp Ordering (MVTO): Keeps multiple versions of a data item, each
associated with the timestamp of the transaction that created it. This allows transactions to read
different versions of the same data, increasing concurrency.
Benefits and Limitations: Timestamp ordering reduces the risk of deadlocks and increases
concurrency. However, it may lead to frequent transaction restarts if timestamps conflict,
especially in high-contention environments.
Optimistic concurrency control assumes conflicts are rare and allows transactions to execute without
restrictions until they are ready to commit.
• Read Phase: The transaction reads data items without acquiring any locks.
• Validation Phase: Before committing, the transaction checks if other concurrent transactions
have modified the data it has read. If conflicts are detected, the transaction is aborted and
restarted.
• Write Phase: If the transaction passes validation, it writes its changes to the database.
Benefits and Limitations: OCC is beneficial in environments with low contention, as it allows high
concurrency without the need for locks. However, in systems with high contention, OCC can lead to
frequent rollbacks, as transactions may fail validation more often.
Distributed deadlock
Distributed deadlock occurs in a distributed system when two or more transactions or processes,
running on different servers, wait indefinitely for resources held by each other, forming a circular
dependency. This makes it impossible for any of the transactions involved to proceed, leading to a
system stall.
Resource Deadlock: A resource deadlock occurs when two or more processes wait permanently for
resources held by each other.
• A process that requires certain resources for its execution, and cannot proceed until it has
acquired all those resources.
• It will only proceed to its execution when it has acquired all required resources.
• It can also be represented using AND condition as the process will execute only if it has all
the required resources.
• Example: Process 1 has R1, R2, and requests resources R3. It will not execute if any one of
them is missing. It will proceed only when it acquires all requested resources i.e. R1, R2, and
R3.
Communication Deadlock: On the other hand, a communication deadlock occurs among a set
of processes when they are blocked waiting for messages from other processes in the set in order
to start execution but there are no messages in transit between them. When there are no
messages in transit between any pair of processes in the set, none of the processes will ever
receive a message. This implies that all processes in the set are deadlocked. Communication
deadlocks can be easily modeled by using WFGs to indicate which processes are waiting to
receive messages from which other processes. Hence, the detection of communication deadlocks
can be done in the same manner as that for systems having only one unit of each resource type.
Example of Distributed Deadlock:
• Transaction U running on server X locks object A and wants to access B, which is locked by
V. Transaction V running on server Y locks object B and wants to access C, which is locked
by W.
• Transaction W running on server Z locks object C and wants to access A, which is locked by
U.
Circular Dependency:
This cycle indicates a distributed deadlock where none of the transactions can proceed unless
one of them is aborted and releases its lock.
Deadlock Detection: Servers communicate to build a global wait-for graph and detect cycles,
aborting one of the transactions involved in the cycle.
Transaction Recovery in Distributed Systems
Transaction recovery refers to the process of restoring a system to a consistent state after a
failure during or after the execution of a transaction. It ensures that a transaction either commits
(completes successfully) or aborts (reverts all changes), and that no partial results are visible to
the system.
For a distributed transaction recovery to work, the transaction must adhere to the ACID (Atomicity,
Consistency, Isolation, Durability) properties:
Atomicity: The entire transaction is treated as a single unit. If any part of the transaction fails, the
entire transaction fails.
Consistency: The system transitions from one consistent state to another. Transactions should bring
the system to a valid state according to predefined rules.
Isolation: The effects of one transaction should be isolated from other concurrent transactions until
the transaction is complete.
Durability: Once a transaction is committed, its effects are permanent, even in the event of a crash.
The goal of transaction recovery is to ensure that, despite failures, transactions will always leave the
system in a consistent state, either by successfully committing or by aborting and undoing any
partial changes made.
Media Failure: Failure of disk or storage devices where transaction logs and data are stored.
Node Crash: A node in the distributed system crashes, leading to data inconsistency or the
transaction not being fully processed.
To achieve transaction recovery, several mechanisms and techniques are employed in distributed
systems. These include logging, checkpointing, and two-phase commit protocols. Below are the
primary techniques used for transaction recovery:
1. Write-Ahead Logging (WAL)
Write-Ahead Logging (WAL) ensures that before any changes are made to the actual data, the
changes are first written to a log. This log records the sequence of transaction operations and their
effects. If a failure occurs, the system can use this log to determine which operations were completed
and which were not.
Redoing: Reapplying the changes from the log that were committed but not yet reflected in the
system (i.e., recovering committed transactions).
Undoing: Reverting the changes of uncommitted transactions to ensure the system is consistent.
The Two-Phase Commit Protocol (2PC) is a consensus protocol used to ensure that a transaction
is either fully committed across all participating nodes or fully aborted. It operates in two phases:
Prepare Phase: The coordinator sends a "prepare" message to all participants (nodes), asking them
if they are ready to commit the transaction.
Commit/Abort Phase: If all participants respond with "yes," the coordinator sends a "commit"
message to all participants, and they make the transaction permanent. If any participant responds
with "no," the coordinator sends an "abort" message, and all participants roll back the transaction.
2PC guarantees atomicity but has limitations, such as blocking in case of failures (if the coordinator
or participants crash).
Three-Phase Commit Protocol (3PC) improves upon 2PC by addressing its blocking problem. In
3PC, there is an additional phase after the "prepare" phase, allowing for more robust recovery in the
event of failures.
Phase 2: Participants send back an acknowledgment that they are ready to commit.
Phase 3: Once the coordinator receives acknowledgments, it sends a "commit" message to all
participants. If any participant is unable to prepare, it sends an "abort" message.
This protocol reduces blocking by providing a "safe" state after the second phase, preventing
deadlocks due to participant crashes.
In some cases, instead of rolling back a transaction to a previous state (backward recovery),
compensation or forward recovery is used. In forward recovery, once an error is detected, the
system does not revert to an earlier state but instead makes corrective actions to move forward and
maintain consistency.
For instance, if a transaction partially commits and later encounters an error, the system might apply
compensating transactions to neutralize the effects of the failed transaction.
5. Distributed Snapshot and Global State
Distributed Snapshot is a technique used to create a consistent global state of all nodes in a
distributed system at a certain point in time. During recovery, these snapshots can be used to restore
the system’s state to a consistent point in time, undoing any incomplete or erroneous transactions.
This technique is particularly useful in distributed systems with multiple processes that need to be
synchronized to prevent inconsistent states.
Replication
Replication is the process of duplicating data, operations, or services across multiple systems or
locations to improve performance, reliability, and fault tolerance. In the context of distributed
systems or databases, replication ensures that copies of data are consistently maintained across
several nodes, so that even if one node fails, the system can continue functioning normally by relying
on the other copies.
There are two primary types of replication strategies: active replication and passive replication.
Both have distinct characteristics and are used in different scenarios based on system requirements,
such as fault tolerance, consistency, and performance.
Active Replication
Concurrency: All replicas are active and handle requests simultaneously, meaning that multiple
replicas are processing operations at the same time.
Fault Tolerance: If one or more replicas fail, the system can continue to operate as long as there are
other replicas that can process the requests. There’s no single point of failure.
Consistency: In active replication, it is crucial that all replicas maintain a consistent state. This is
typically achieved through consensus algorithms or synchronization techniques like Paxos or Raft,
which ensure that all replicas agree on the sequence of operations.
Distributed databases where read queries can be served by any replica, but write queries must be
synchronized across all replicas to maintain consistency.
Load-balancing systems where requests are distributed across multiple servers, each of which
processes requests concurrently.
2. Passive Replication
(secondary or slave) are passive and act as backups. The passive replicas only take over in case the
primary replica fails.
Single Primary Replica: Only one replica, the "primary," handles all requests and updates to the
system.
Failover Mechanism: If the primary replica fails, one of the passive replicas (usually the most up-
to-date one) is promoted to be the new primary. This failover process ensures that the system can
continue operating.
Consistency and Availability: Passive replication ensures consistency since the primary replica is
the only one making updates to the system. However, it might have lower availability compared to
active replication, as there is downtime during the failover process.
Performance: Passive replication can be more efficient in write-heavy workloads, as only one node
handles write requests, reducing synchronization overhead. However, read performance can be
improved if read requests are distributed to passive replicas.
Database systems where one master database handles all write operations, and read queries can
be distributed to read-only replicas.
Distributed file systems where there is a master node for updates, but backup nodes (passive
replicas) store copies of the data.
Replica Activity All replicas are active and process Only one replica (primary) processes
requests requests, others are passive
Fault Tolerance High; can tolerate multiple failures Lower; only one replica is active, so
failover is necessary
Consistency Achieved through coordination (e.g., Easier to maintain as only the primary
consensus algorithms) replica updates data
Performance Good for read-heavy workloads, but Good for write-heavy workloads, but
write synchronization can be slow failover can introduce delays
Availability High availability even during failures Availability may drop during failover if
the primary fails