Unit 4
Unit 4
Replication in distributed systems involves creating multiple copies of data or services across
different nodes to improve reliability, availability, and performance. System models in
replication typically describe the behavior, consistency, and architectural aspects of
replication. These models can be broadly categorized as follows:
1. Replication Architectures
These models define how replicas are organized and managed.
Master-Slave Model
o One master replica handles all write operations and propagates changes to
slave replicas.
o Slave replicas handle read requests.
o Advantage: Simplified conflict resolution.
o Disadvantage: Single point of failure at the master.
Multi-Master Model
o Multiple replicas can accept write operations.
o Updates are propagated and synchronized between masters.
o Advantage: Improved availability and fault tolerance.
o Disadvantage: Requires complex conflict resolution mechanisms.
Peer-to-Peer Model
o All replicas are equal, and any replica can serve read and write requests.
o Replicas coordinate to maintain consistency.
o Advantage: High fault tolerance and scalability.
o Disadvantage: Synchronization overhead.
2. Consistency Models
Consistency models specify the guarantees provided about the visibility of updates across
replicas.
Strong Consistency
o All replicas reflect the same data at any given time.
o Ensured through protocols like Two-Phase Commit (2PC) or Paxos.
o Advantage: Simplified programming model.
o Disadvantage: High latency and lower availability.
Eventual Consistency
o Replicas may diverge temporarily but will eventually converge to the same
state.
o Suitable for systems with high availability needs, like DNS.
o Advantage: High performance and scalability.
o Disadvantage: Temporary inconsistency can lead to anomalies.
Causal Consistency
o Guarantees that causally related operations are seen in the same order by all
replicas.
o Advantage: Reduces anomalies while being more efficient than strong
consistency.
o Disadvantage: Requires maintaining causal history, which can be complex.
Quorum-Based Consistency
o Operations must be acknowledged by a majority (quorum) of replicas to
succeed.
o Used in systems like Apache Cassandra and Amazon DynamoDB.
4. Failure Models
These models address how systems handle faults during replication.
Crash Failure Model
o Nodes may stop functioning but do not exhibit arbitrary behavior.
o Replication systems handle such failures using techniques like failover.
Byzantine Failure Model
o Nodes may behave maliciously or unpredictably.
o Requires robust protocols like Byzantine Fault Tolerance (BFT) to maintain
correctness.
Network Partition Model
o Network failures may isolate parts of the system, leading to partition
tolerance challenges.
o The CAP theorem highlights the trade-off between Consistency, Availability,
and Partition tolerance.
Group communication in distributed systems refers to the process where multiple nodes or entities
communicate with each other as a group.
This method is essential for coordinating actions, sharing data, and ensuring that all
participants in the system are informed and synchronized. It’s particularly useful in scenarios
like collaborative applications and real-time updates.
Group communication is critically important in distributed systems due to several key reasons:
Multiple nodes must collaborate and synchronize their actions. Group communication helps
them exchange information and stay updated.
Different nodes can create data that needs to be shared. Group communication helps quickly
send this information to everyone involved, reducing delays and keeping data consistent.
1. Unicast Communication
Unicast Communication
Unicast communication is the point-to-point transmission of data between two nodes in a network.
In the context of distributed systems:
Unicast is when a sender sends a message to a specific recipient, using their unique network
address.
Each message targets one recipient, creating a direct connection between the sender and
the receiver.
You commonly see unicast in client-server setups, where a client makes requests and
receives responses, as well as in direct connections between peers.
This method makes good use of network resources, is easy to implement, and keeps latency
low because messages go straight to the right person.
Unicast isn’t efficient for sending messages to many recipients at once, as it requires
separate messages for each one, leading to more work.
2. Multicast Communication
Multicast Communication
Multicast communication involves sending a single message from one sender to multiple receivers
simultaneously within a network. It is particularly useful in distributed systems where broadcasting
information to a group of nodes is necessary:
Multicast lets a sender share a message with a specific group of people who want it.
This way, the sender can reach many people at once, which is more efficient than sending
separate messages.
By sending data just once to a group, multicast saves bandwidth, simplifies communication,
and can easily handle a larger number of recipients.
Managing group membership is necessary to ensure reliable message delivery, and multicast
can run into issues if there are network problems that affect everyone in the group.
3. Broadcast Communication
Broadcast communication involves sending a message from one sender to all nodes in the network,
ensuring that every node receives the message:
Broadcast Communication
Broadcast is when a sender sends a message to every node in the network without targeting
specific recipients.
Messages are delivered to all nodes at once using a special address designed for this
purpose.
It’s often used for network management tasks, like sending status updates, or for emergency
alerts that need to reach everyone quickly.
Broadcast ensures that every node receives the message without needing to specify who the
recipients are, making it efficient for sharing information widely.
It can cause network congestion in larger networks and raises security concerns since anyone
on the network can access the broadcast message, which might lead to unauthorized access.
Messages sent from a sender to multiple recipients should be delivered reliably, consistently, and in
a specified order. Types of Reliable Multicast Protocols include:
FIFO Ordering:
o Ensures that messages are delivered to all group members in the order they were
sent by the sender.
Causal Ordering:
o Ensures that messages are delivered in an order that respects the causal
dependencies observed by the sender.
o Guarantees that all group members receive messages in the same global order.
o Ensures that operations based on the multicast messages (like updates to shared
data) appear atomic or indivisible to all recipients.
Scalability and performance are vital for effective group communication in distributed systems. It’s
essential for the system to handle more nodes, messages, and participants while still operating
efficiently. Here’s a closer look at these important aspects:
1. Scalability
Scalability refers to the system’s ability to grow without losing efficiency. This includes:
Handling more messages exchanged among group members, ensuring timely and responsive
communication.
Supporting connections across distant nodes or networks, which can introduce latency and
bandwidth issues.
2. Challenges in Scalability
The management of group membership and message routing becomes more complex,
adding overhead.
The network must have enough bandwidth to support the higher traffic from a larger group
to avoid congestion.
Keeping distributed nodes consistent and synchronized gets more complicated as the system
scales.
Partitioning and Sharding: Breaking the system into smaller parts can make communication
and management more manageable.
Load Balancing: Evenly distributing the workload across nodes helps prevent bottlenecks and
optimizes resource use.
Replication and Caching: Duplicating data or messages across nodes can lower access times
and enhance fault tolerance, aiding scalability.
Performance in group communication is crucial for optimizing message speed, resource utilization,
and addressing challenges like message ordering and concurrent access, ensuring efficient
collaboration in distributed systems.
1. Performance
It’s crucial to minimize the time it takes for messages to reach their intended recipients.
We want to enhance the rate at which messages are handled and sent out.
Efficient use of bandwidth, CPU, and memory helps keep communication fast and effective.
2. Challenges in Performance
Making sure messages arrive in the right order can be tough, especially with strict protocols.
Handling multiple users trying to access shared resources at the same time can lead to
slowdowns.
Communication needs to adjust based on changing conditions, like slow bandwidth or lost
packets.
Smart Routing: Using efficient routing methods can reduce delays by cutting down on the
number of hops messages take.
Pre-storing Data: Keeping frequently accessed messages or data ready can help lower delays
and speed up responses.
Group communication in distributed systems comes with several challenges due to the need to
coordinate multiple nodes that may be spread out or connected through unreliable networks. Key
challenges include:
Reliability: Messages must reach all intended recipients, even during network failures or
node crashes, which can be complicated when nodes frequently join or leave.
Concurrency and Consistency: Keeping shared data consistent while allowing simultaneous
updates is tricky, requiring strong synchronization to avoid conflicts.
Fault Tolerance: The system must handle node failures and communication issues without
losing reliability. This means having mechanisms to detect failures and manage changes in
group membership.
In distributed systems, three types of problems occur. All these three types of problems are related.
Fault: Fault is defined as a weakness or shortcoming in the system or any hardware and
software component. The presence of fault can lead to error and failure.
Failure: Failure is the outcome where the assigned goal is not achieved.
Types of Faults
Transient Faults: Transient Faults are the type of faults that occur once and then disappear.
These types of faults do not harm the system to a great extent but are very difficult to find or
locate. Processor fault is an example of transient fault.
Intermittent Faults: Intermittent Faults are the type of faults that come again and again.
Such as once the fault occurs it vanishes upon itself and then reappears again. An example of
intermittent fault is when the working computer hangs up.
Permanent Faults: Permanent Faults are the type of faults that remain in the system until
the component is replaced by another. These types of faults can cause very severe damage
to the system but are easy to identify. A burnt-out chip is an example of a permanent Fault.
1. Availability: Availability is defined as the property where the system is readily available for its
use at any time.
2. Reliability: Reliability is defined as the property where the system can work continuously
without any failure.
3. Safety: Safety is defined as the property where the system can remain safe from
unauthorized access even if any failure occurs.
4. Maintainability: Maintainability is defined as the property states that how easily and fastly
the failed node or system can be repaired.
In order to implement the techniques for fault tolerance in distributed systems, the design,
configuration and relevant applications need to be considered. Below are the phases carried out for
fault tolerance in any distributed systems.
Fault Detection is the first phase where the system is monitored continuously. The outcomes are
being compared with the expected output. During monitoring if any faults are identified they are
being notified. These faults can occur due to various reasons such as hardware failure, network
failure, and software issues. The main aim of the first phase is to detect these faults as soon as they
occur so that the work being assigned will not be delayed.
2. Fault Diagnosis
Fault diagnosis is the process where the fault that is identified in the first phase will be diagnosed
properly in order to get the root cause and possible nature of the faults. Fault diagnosis can be done
manually by the administrator or by using automated Techniques in order to solve the fault and
perform the given task.
3. Evidence Generation
Evidence generation is defined as the process where the report of the fault is prepared based on the
diagnosis done in an earlier phase. This report involves the details of the causes of the fault, the
nature of faults, the solutions that can be used for fixing, and other alternatives and preventions that
need to be considered.
4. Assessment
Assessment is the process where the damages caused by the faults are analyzed. It can be
determined with the help of messages that are being passed from the component that has
encountered the fault. Based on the assessment further decisions are made.
5. Recovery
Recovery is the process where the aim is to make the system fault free. It is the step to make the
system fault free and restore it to state forward recovery and backup recovery. Some of the common
recovery techniques such as reconfiguration and resynchronization can be used.
1. Hardware Fault Tolerance: Hardware Fault Tolerance involves keeping a backup plan for
hardware devices such as memory, hard disk, CPU, and other hardware peripheral devices.
Hardware Fault Tolerance is a type of fault tolerance that does not examine faults and
runtime errors but can only provide hardware backup. The two different approaches that are
used in Hardware Fault Tolerance are fault-masking and dynamic recovery.
2. Software Fault Tolerance: Software Fault Tolerance is a type of fault tolerance where
dedicated software is used in order to detect invalid output, runtime, and programming
errors. Software Fault Tolerance makes use of static and dynamic methods for detecting and
providing the solution. Software Fault Tolerance also consists of additional data points such
as recovery rollback and checkpoints.
3. System Fault Tolerance: System Fault Tolerance is a type of fault tolerance that consists of a
whole system. It has the advantage that it not only stores the checkpoints but also the
memory block, and program checkpoints and detects the errors in applications
automatically. If the system encounters any type of fault or error it does provide the required
mechanism for the solution. Thus system fault tolerance is reliable and efficient.
Fault Tolerance Strategies
Fault tolerance strategies are essential for ensuring that distributed systems continue to operate
smoothly even when components fail. Here are the key strategies commonly used:
Failover Mechanisms
o Checkpointing: Periodic saving of the system’s state so that if a failure occurs, the
system can be restored to the last saved state.
o Rollback Recovery: The system reverts to a previous state after detecting an error,
using saved checkpoints or logs.
o Forward Recovery: The system attempts to correct or compensate for the failure to
continue operating. This may involve reprocessing or reconstructing data.
Design patterns for fault tolerance help in creating systems that can handle failures gracefully and
maintain reliable operations. Here are some key fault tolerance design patterns:
This pattern prevents a system from making calls to a failing service by wrapping it in a “circuit
breaker.” When the service fails, the circuit breaker trips, causing further calls to fail fast instead of
trying to connect to a failing service repeatedly.
Useful in scenarios where services might experience temporary outages. For example, a microservices
architecture where a downstream service might be unreliable.
2. Bulkhead Pattern
This pattern isolates different components or services to prevent a failure in one part of the system
from affecting others. It’s similar to the bulkheads in a ship that prevent flooding in one
compartment from sinking the entire vessel.
Essential in systems where failures in one service should not impact others. For instance, an e-
commerce platform might use bulkhead isolation to separate payment processing from inventory
management.
3. Retry Pattern
This pattern involves automatically retrying an operation that has failed due to transient errors. The
retries are typically done with exponential backoff to avoid overwhelming the system.
Suitable for scenarios where operations might fail intermittently due to temporary issues like network
glitches or service overloads.
This pattern controls the number of requests a system or service can handle within a specific time
window to prevent overload and ensure fair usage.
Essential for APIs and services that might be susceptible to abuse or excessive traffic. It helps in
maintaining system stability and performance.
5. Failover Pattern
This pattern involves switching to a backup system or component when the primary one fails. It
ensures continuity of service by having redundant systems ready to take over.
implements the distributed systems shared memory model in a distributed system, that hasn’t any
physically shared memory. Shared model provides a virtual address area shared between any or all
nodes. To beat the high forged of communication in distributed system. DSM memo, model provides
a virtual address area shared between all nodes. systems move information to the placement of
access. Information moves between main memory and secondary memory (within a node) and
between main recollections of various nodes. Every Greek deity object is in hand by a node. The
initial owner is that the node that created the object. possession will amendment as the object
moves from node to node. Once a method accesses information within the shared address space,
the mapping manager maps shared memory address to physical memory (local or remote).
DSM permits programs running on separate reasons to share information while not the software
engineer having to agitate causation message instead underlying technology can send the messages
to stay the DSM consistent between compute. DSM permits programs that wont to treat constant
laptop to be simply tailored to control on separate reason. Programs access what seems to them to
be traditional memory. Hence, programs that Pine Tree State DSM square measure sometimes
shorter and easier to grasp than programs that use message passing. But, DSM isn’t appropriate for
all things. Client-server systems square measure typically less suited to DSM, however, a server is
also wont to assist in providing DSM practicality for information shared between purchasers.
The architecture of a Distributed Shared Memory (DSM) system typically consists of several key
components that work together to provide the illusion of a shared memory space across distributed
nodes. the components of Architecture of Distributed Shared Memory :
1.Nodes: Each node in the distributed system consists of one or more CPUs and a memory unit.
These nodes are connected via a high-speed communication network.
2.Memory Mapping Manager Unit: The memory mapping manager routine in each node is
responsible for mapping the local memory onto the shared memory space. This involves dividing the
shared memory space into blocks and managing the mapping of these blocks to the physical memory
of the node.
Caching is employed to reduce operation latency. Each node uses its local memory to cache portions
of the shared memory space. The memory mapping manager treats the local memory as a cache for
the shared memory space, with memory blocks as the basic unit of caching.
3.Communication Network Unit: This unit facilitates communication between nodes. When a process
accesses data in the shared address space, the memory mapping manager maps the shared memory
address to physical memory. The communication network unit handles the communication of data
between nodes, ensuring that data can be accessed remotely when necessary.
A layer of code, either implemented in the operating system kernel or as a runtime routine, is
responsible for managing the mapping between shared memory addresses and physical memory
locations.
Each node’s physical memory holds pages of the shared virtual address space. Some pages are local
to the node, while others are remote and stored in the memory of other nodes.
In summary, the architecture of a DSM system includes nodes with CPUs and memory, a memory
mapping manager responsible for mapping local memory to the shared memory space, caching
mechanisms to reduce latency, a communication network unit for inter-node communication, and a
mapped layer to manage the mapping between shared memory addresses and physical memory
locations.
Distributed systems are a collection of independent computers that appear to their users as a single
coherent system. These computers collaborate to achieve a common goal and communicate over a
network.
What is Consistency?
Consistency is a fundamental property of distributed systems that ensures that all replicas of a
shared data store have the same value. In a distributed system, data is typically replicated across
multiple nodes to improve availability and fault tolerance. However, maintaining consistency across
all replicas can be challenging, especially in the presence of concurrent updates and network delays.
When multiple processes access shared resources concurrently, where the order of execution is not
deterministic, the problem arises due to the need for maintaining consistency across these
processes. This challenge is further complicated by the fact that distributed systems are prone to
various types of failures, such as message losses, network delays, and crashes, which can lead to
inconsistencies in the execution order.
To ensure that shared resources are accessed in a consistent order by consistency, there are
several consistency models that can be used in distributed systems to ensure that shared
resources are accessed in a consistent order
These models define the level of consistency that is guaranteed for shared resources in the
presence of concurrent access by multiple processes.
Every concurrent execution of a program should ensure that the execution results are in some
sequential order. It provides an intuitive model of how the distributed systems should behave, which
makes it easier for users to understand and reason about the behavior of the system.
Sequential consistency is a consistency model that ensures that the order in which operations are
executed by different processes in the system appears to be consistent with a global order of
execution. It ensures that the result of any execution of a distributive system is the same as if the
operations of all the processes in the system were executed in some sequential order, one after the
other.
Below is the example of Sequential Consistency in a distributed system with three processes: P1, P2,
and P3, interacting with a shared variable X. Below is the example scenario of processes:
Sequential consistency requires that the result of operations appear as if they were executed in some
global sequence that respects the real-time order of operations performed by each process. Possible
sequential orders are:
Sequential Order 1:
o P2: R(X) — Read 5 (as 5 was the last write before this read).
o P2: R(X) — Read 10 (as 10 was the last write before this read).
Consistency Check:
o The second read by P2 returns 10, which is the value written by P3.
o This order maintains sequential consistency as it reflects a global sequence where
operations are ordered as if they occurred one after the other.
Sequential Order 2:
o P2: R(X) — Read 10 (as 10 was the latest write before this read).
o P2: R(X) — Read 10 (as 10 was the latest write before this read).
Consistency Check:
o The first read by P2 returns 10, reflecting that it reads the value written by P3, which
was the latest write.
o The second read by P2 also returns 10, consistent with the latest write.
o This order is valid because it maintains a consistent view of the latest write
operation, reflecting a valid global order.
The following techniques can be used to implement sequential consistency in distributed systems:
Two-Phase Locking:
o Once a process acquires a lock, it holds the lock until it has completed its operation.
o Two-phase locking ensures that no two processes can access the same data at the
same time, which can prevent inconsistencies in the data.
Timestamp Ordering:
o The system ensures that all operations are performed in the order of their
timestamps.
o Timestamp ordering ensures that the order of operations is preserved across all
nodes.
Quorum-based Replication:
o To ensure consistency, the system requires that a majority of nodes agree on the
value of the data.
o Quorum-based replication ensures that the data is replicated across different nodes,
and inconsistencies in the data are prevented.
Vector Clocks:
o Vector clocks are a technique used to ensure that operations are performed in the
same order across all nodes.
o In this technique, each node maintains a vector clock that contains the timestamp of
each operation performed by the node.
o The system ensures that all operations are performed in the order of their vector
clocks.
o Vector clocks ensure that the order of operations is preserved across all nodes.
Network Latency: Communication between different nodes in a distributed system can take
time, and this latency can result in inconsistencies in the data.
Node Failure: In a distributed system, nodes can fail, which can result in data inconsistencies.
Concurrent Access: Concurrent access to shared data can lead to inconsistencies in the data.
Replication: Replication of data across different nodes can lead to inconsistencies in the
data.
Design for Global Order: Use global timestamps or logical clocks to order operations across
processes.
Utilize Consensus Algorithms: Apply consensus protocols like Two-Phase Commit (2PC) or
Three-Phase Commit (3PC) to agree on the order of transactions.
Monitor and Test Consistency: Continuously monitor for consistency and perform regular
testing to validate that operations adhere to sequential consistency.
Use Logical Clocks: Implement Lamport timestamps or vector clocks to provide a partial
ordering of events.
Ivy is a multi-user read/write peer-to-peer file system. Ivy has no centralized or dedicated
components, and it provides useful integrity properties without requiring users to fully trust either
the underlying peer-to-peer storage system or the other users of the file system.
An Ivy file system consists solely of a set of logs, one log per participant. Ivy stores its logs in the
DHash distributed hash table. Each participant finds data by consulting all logs, but performs
modifications by appending only to its own log. This arrangement allows Ivy to maintain meta-data
consistency without locking. Ivy users can choose which other logs to trust, an appropriate
arrangement in a semi-open peer-to-peer system.
Ivy presents applications with a conventional file system interface. When the underlying network is
fully connected, Ivy provides NFS-like semantics, such as close-to-open consistency. Ivy detects
conflicting modifications made during a partition, and provides relevant version information to
application-specific conflict resolvers. Performance measurements on a wide-area network show
that Ivy is two to three times slower than NFS.
Difference between Write invalidate protocol &
Write Update Protocol (Bus-Snooping )
Cache coherence becomes an issue in today’s multi processor system where all the
processors are using a common memory. Two key maintained protocols to achieve this
coherency through bus-snooping are the write invalidate protocol and the write update
protocol. To efficiently manage memory and achieve better performance while using the
operating system the differences between these protocols should be made clear.
What is Cache Coherence Protocol?
To keep the data consistent, cache coherence protocols are used. These protocols update
cache copies in the multiprocessor systems. In bus – snooping mechanisms , processors
snoop (monitoring) the bus and take appropriate action on relevant events (data update) to
ensure the data consistency.
The 2 protocols that are usually used to update cache copies are –
1. Write-update protocol
2. Write-invalidate protocol
What is Write Invalidate Protocol?
Wherever there are multiple caches to a common memory, there must be a protocol that
ensures that these caches are in harmony when sharing data, and this is where the Write
Invalidate Protocol comes into play. Here, when a processor wants to write to a certain
memory location, then this protocol guarantees that other cache in the system with the
same memory line will nullify their copies. This helps avoid the situation where other
processors read via other caches that have old data and makes sure that only the one has
the latest data.
Advantages of Write Invalidate Protocol
Reduces Bandwidth Usage: To this, the only message that is transmitted in the bus is
invalidation messages and therefore the subject is smaller as compared to
transmitting the actual data.
Simpler Implementation: The protocol is also easy to implement because it simplifies
the problem of cache coherence.
Prevents Stale Data Access: By voiding other caches’ copies it ensures that no
processor acquire old information from that particular cache.
Disadvantages of Write Invalidate Protocol
Increased Cache Misses: Invalidating caches may result in more cache misses, in the
sense that other processors may have to read the data in from either the main
memory or from the cache of the writing processor.
Performance Overhead: It also allows for frequent invalidations which can add
latency where there is high write contention.
What is Write Update Protocol?
The Write Update Protocol or also called as the Write Broadcast Protocol is another cache
coherence strategy applied to multiprocessor systems. Unlike the Write Invalidate Protocol,
on write operation, the cache sends an update to all other processors that have that
memory line of the cache. This makes sure all the caches hold the new value and no cache
copy has to be invalidated.
Advantages of Write Update Protocol
Reduced Cache Misses: Due to the availability of the updated data in all the caches,
they do not have to use the fetch operation thus reducing on the number of cache
misses.
Improved Performance for Read-Heavy Workloads: Applications with more frequent
reads are benefitted most when there is the latest data in all the caches.
Consistent Data Across Caches: Makes sure all caches have up to date data, thus
improving data synchronization.
Disadvantages of Write Update Protocol
Higher Bandwidth Consumption: The approach of broadcasting updated data to all
caches in order to reduce multiple invalidation messages requires more
bus bandwidth than when sending invalidation messages only.
Complex Implementation: Co-ordinating and making certain that all of the caches
have the correct data can be even more challenging.
Potential for Increased Traffic: When multiple processors are involved, the amount of
update messages can grow big, causing congestion and thus the bus.
Difference between Write Invalidate Protocol & Write Update Protocol
One initial invalidation is required when Multiple write broadcasts are required
1
multiple writes to the same word is when multiple writes to the same word
.
done with no intervening reads. is done with no read is done in between.
Write invalidate protocol Write Update Protocol
With multiword cache blocks, only the A write broadcast is required when each
2 first write to any of the word in the word is written in a cache block with
. block need to generate an invalidate. multiword cache blocks. This protocol
This protocol works on cache blocks. works on individual words.(bytes)
Since the written data is not instantly Since the written data is instantly
updated in the reader’s cache, the delay updated in the reader’s cache, the delay
4 between writing a word in one between writing a word in one processor
. processor and reading the written value and reading the written value in another
in another processor is larger than the processor is normally shorter in a write
write Update. update scheme.
Conclusion
WI and WUP protocols are used to make cache coherent in multiprocessor systems but in
contrast manner. The Write Invalidate Protocol is nominally a little more efficient and
simpler than the Directory, so it is useful for systems where writes are comparatively less
common. While the Write Update Protocol helps keep the data synchronized across different
caches, and is helpful to those programs that do mostly reads, this comes at the expense of
more bandwidth use and a larger cost of implementation. The concepts of these protocols
are that, the decision between them has to be based on the circumstances of the certain
system workload.
Munin
Munin is a distributed shared memory (DSM) system that implements release consistency
with optimizations for parallel applications. It’s designed to provide a balance between
performance and consistency by leveraging RC and a few additional techniques:
1. Multiple Consistency Protocols: Munin uses different protocols tailored to different
types of data. For example, read-mostly data might be handled differently from data
that is frequently updated, optimizing communication for specific access patterns.
2. Delayed Update Propagation: Using release consistency, Munin delays the
propagation of updates until synchronization points. This can make it more efficient
in distributed environments where immediate consistency isn’t necessary.
3. Lazy Release Consistency (LRC): Munin incorporates Lazy Release Consistency,
which further reduces overhead by delaying the propagation of updates until they
are explicitly needed, rather than at each release point.
4. Minimizing False Sharing: Munin also employs strategies to minimize false sharing,
where two processes frequently access different parts of the same memory page,
leading to unnecessary synchronization.
In summary, Munin was one of the first DSM systems to explore the release consistency
model, demonstrating that RC could enhance the performance of distributed systems by
minimizing synchronization costs and adjusting consistency protocols based on application
needs.